Authors: Daniel Wesego
Abstract: Diffusion models have established themselves as state-of-the-art generative models across various data modalities, including images and videos, due to their ability to accurately approximate complex data distributions. Unlike traditional generative approaches such as VAEs and GANs, diffusion models employ a progressive denoising process that transforms noise into meaningful data over multiple iterative steps. This gradual approach enhances their expressiveness and generation quality. Not only that, diffusion models have also been shown to extract meaningful representations from data while learning to generate samples. Despite their success, the application of diffusion models to graph-structured data remains relatively unexplored, primarily due to the discrete nature of graphs, which necessitates discrete diffusion processes distinct from the continuous methods used in other domains. In this work, we leverage the representational capabilities of diffusion models to learn meaningful embeddings for graph data. By training a discrete diffusion model within an autoencoder framework, we enable both effective autoencoding and representation learning tailored to the unique characteristics of graph-structured data. We only need the encoder at the end to extract representations. Our approach demonstrates the potential of discrete diffusion models to be used for graph representation learning.
Authors: Marco Angioli, Marcello Barbirotta, Abdallah Cheikh, Antonio Mastrandrea, Francesco Menichelli, Mauro Olivieri
Abstract: As the Internet of Things expands, embedding Artificial Intelligence algorithms in resource-constrained devices has become increasingly important to enable real-time, autonomous decision-making without relying on centralized cloud servers. However, implementing and executing complex algorithms in embedded devices poses significant challenges due to limited computational power, memory, and energy resources. This paper presents algorithmic and hardware techniques to efficiently implement two LinearUCB Contextual Bandits algorithms on resource-constrained embedded devices. Algorithmic modifications based on the Sherman-Morrison-Woodbury formula streamline model complexity, while vector acceleration is harnessed to speed up matrix operations. We analyze the impact of each optimization individually and then combine them in a two-pronged strategy. The results show notable improvements in execution time and energy consumption, demonstrating the effectiveness of combining algorithmic and hardware optimizations to enhance learning models for edge computing environments with low-power and real-time requirements.
Authors: Qiongyan Wang, Yutong Xia, Siru ZHong, Weichuang Li, Yuankai Wu, Shifen Cheng, Junbo Zhang, Yu Zheng, Yuxuan Liang
Abstract: Monitoring real-time air quality is essential for safeguarding public health and fostering social progress. However, the widespread deployment of air quality monitoring stations is constrained by their significant costs. To address this limitation, we introduce \emph{AirRadar}, a deep neural network designed to accurately infer real-time air quality in locations lacking monitoring stations by utilizing data from existing ones. By leveraging learnable mask tokens, AirRadar reconstructs air quality features in unmonitored regions. Specifically, it operates in two stages: first capturing spatial correlations and then adjusting for distribution shifts. We validate AirRadar's efficacy using a year-long dataset from 1,085 monitoring stations across China, demonstrating its superiority over multiple baselines, even with varying degrees of unobserved data. The source code can be accessed at https://github.com/CityMind-Lab/AirRadar.
Authors: Yichen Wu, Hongming Piao, Long-Kai Huang, Renzhen Wang, Wanhua Li, Hanspeter Pfister, Deyu Meng, Kede Ma, Ying Wei
Abstract: Continual Learning (CL) with foundation models has recently emerged as a promising approach to harnessing the power of pre-trained models for sequential tasks. Existing prompt-based methods generally use a gating mechanism to select relevant prompts aligned with the test query for further processing. However, the success of these methods largely depends on the precision of the gating mechanism, which becomes less scalable with additional computational overhead as tasks increases. To overcome these issues, we propose a Scalable Low-Rank Adaptation (S-LoRA) method for CL (in particular class incremental learning), which incrementally decouples the learning of the direction and magnitude of LoRA parameters. S-LoRA supports efficient inference by employing the last-stage trained model for direct testing without a gating process. Our theoretical and empirical analysis demonstrates that S-LoRA tends to follow a low-loss trajectory that converges to an overlapped low-loss region, resulting in an excellent stability-plasticity trade-off in CL. Furthermore, based on our findings, we develop variants of S-LoRA with further improved scalability. Extensive experiments across multiple CL benchmarks and various foundation models consistently validate the effectiveness of S-LoRA.
Authors: Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev
Abstract: Multi-agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi-agent problems in various environments. One of the principal challenges in MARL is the need for explicit prediction of the agents' behavior to achieve cooperation. To resolve this issue, we propose the Shared Recurrent Memory Transformer (SRMT) which extends memory transformers to multi-agent settings by pooling and globally broadcasting individual working memories, enabling agents to exchange information implicitly and coordinate their actions. We evaluate SRMT on the Partially Observable Multi-Agent Pathfinding problem in a toy Bottleneck navigation task that requires agents to pass through a narrow corridor and on a POGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistently outperforms a variety of reinforcement learning baselines, especially under sparse rewards, and generalizes effectively to longer corridors than those seen during training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT is competitive with recent MARL, hybrid, and planning-based algorithms. These results suggest that incorporating shared recurrent memory into the transformer-based architectures can enhance coordination in decentralized multi-agent systems. The source code for training and evaluation is available on GitHub: https://github.com/Aloriosa/srmt.
Authors: Xiaoyang Wang, Christopher C. Yang
Abstract: Artificial intelligence (AI) systems in healthcare have demonstrated remarkable potential to improve patient outcomes. However, if not designed with fairness in mind, they also carry the risks of perpetuating or exacerbating existing health disparities. Although numerous fairness-enhancing techniques have been proposed, most focus on a single sensitive attribute and neglect the broader impact that optimizing fairness for one attribute may have on the fairness of other sensitive attributes. In this work, we introduce a novel approach to multi-attribute fairness optimization in healthcare AI, tackling fairness concerns across multiple demographic attributes concurrently. Our method follows a two-phase approach: initially optimizing for predictive performance, followed by fine-tuning to achieve fairness across multiple sensitive attributes. We develop our proposed method using two strategies, sequential and simultaneous. Our results show a significant reduction in Equalized Odds Disparity (EOD) for multiple attributes, while maintaining high predictive accuracy. Notably, we demonstrate that single-attribute fairness methods can inadvertently increase disparities in non-targeted attributes whereas simultaneous multi-attribute optimization achieves more balanced fairness improvements across all attributes. These findings highlight the importance of comprehensive fairness strategies in healthcare AI and offer promising directions for future research in this critical area.
Authors: Zahraa Al Sahili, Ioannis Patras, Matthew Purver
Abstract: As large-scale vision-language models (VLMs) become increasingly central to modern AI applications, understanding and mitigating social biases in these systems has never been more critical.We investigate how dataset composition, model size, and multilingual training affect gender and racial bias in a popular VLM, CLIP, and its open-source variants. In particular, we systematically evaluate models trained on varying dataset scales and architectures, as well as multilingual versions encompassing English along with Persian, Turkish, and Finnish, languages with minimal gender marking. To assess social perception bias, we measure the zero-shot performance on face images featuring socially charged terms rooted in the psychological constructs of communion and agency, and demographic labeling bias using both the FairFace and PATA datasets. Our findings reveal three key insights. First, while larger training datasets can mitigate some biases, they may also introduce or amplify others when the data composition is imbalanced. Second, although increasing model size generally improves performance, it does not consistently reduce bias and can, in certain cases, exacerbate it. Finally, while multilingual training broadens linguistic coverage, it does not inherently neutralize bias and can transfer or intensify inequities across languages. Taken together, these results highlight the necessity of inclusive, carefully curated training data to foster fairness rather than relying solely on model scaling or language expansion. We provide a systematic evaluation of vision language bias across diverse demographics, underscoring the urgent need for intentional bias mitigation strategies in next generation AI systems.
Authors: D\'avid Terj\'ek, Diego Gonz\'alez-S\'anchez
Abstract: We study the properties of the Neural Tangent Kernel (NTK) $\overset{\scriptscriptstyle\infty}{K} : \mathbb{R}^{m_0} \times \mathbb{R}^{m_0} \to \mathbb{R}^{m_l \times m_l}$ corresponding to infinitely wide $l$-layer Multilayer Perceptrons (MLPs) taking inputs from $\mathbb{R}^{m_0}$ to outputs in $\mathbb{R}^{m_l}$ equipped with activation functions $\phi(s) = a s + b \vert s \vert$ for some $a,b \in \mathbb{R}$ and initialized at the Edge Of Chaos (EOC). We find that the entries $\overset{\scriptscriptstyle\infty}{K}(x_1,x_2)$ can be approximated by the inverses of the cosine distances of the activations corresponding to $x_1$ and $x_2$ increasingly better as the depth $l$ increases. By quantifying these inverse cosine distances and the spectrum of the matrix containing them, we obtain tight spectral bounds for the NTK matrix $\overset{\scriptscriptstyle\infty}{K} = [\frac{1}{n} \overset{\scriptscriptstyle\infty}{K}(x_{i_1},x_{i_2}) : i_1, i_2 \in [1:n]]$ over a dataset $\{x_1,\cdots,x_n\} \subset \mathbb{R}^{m_0}$, transferred from the inverse cosine distance matrix via our approximation result. Our results show that $\Delta_\phi = \frac{b^2}{a^2+b^2}$ determines the rate at which the condition number of the NTK matrix converges to its limit as depth increases, implying in particular that the absolute value ($\Delta_\phi=1$) is better than the ReLU ($\Delta_\phi=\frac{1}{2}$) in this regard.
Authors: Yan Ru Pei
Abstract: We introduce Centaurus, a class of networks composed of generalized state-space model (SSM) blocks, where the SSM operations can be treated as tensor contractions during training. The optimal order of tensor contractions can then be systematically determined for every SSM block to maximize training efficiency. This allows more flexibility in designing SSM blocks beyond the depthwise-separable configuration commonly implemented. The new design choices will take inspiration from classical convolutional blocks including group convolutions, full convolutions, and bottleneck blocks. We architect the Centaurus network with a mixture of these blocks, to balance between network size and performance, as well as memory and computational efficiency during both training and inference. We show that this heterogeneous network design outperforms its homogeneous counterparts in raw audio processing tasks including keyword spotting, speech denoising, and automatic speech recognition (ASR). For ASR, Centaurus is the first network with competitive performance that can be made fully state-space based, without using any nonlinear recurrence (LSTMs), explicit convolutions (CNNs), or (surrogate) attention mechanism. Source code is available at github.com/Brainchip-Inc/Centaurus
Authors: Xintong Duan, Yutong He, Fahim Tajwar, Wen-Tse Chen, Ruslan Salakhutdinov, Jeff Schneider
Abstract: Many real-world decision-making problems are combinatorial in nature, where states (e.g., surrounding traffic of a self-driving car) can be seen as a combination of basic elements (e.g., pedestrians, trees, and other cars). Due to combinatorial complexity, observing all combinations of basic elements in the training set is infeasible, which leads to an essential yet understudied problem of zero-shot generalization to states that are unseen combinations of previously seen elements. In this work, we first formalize this problem and then demonstrate how existing value-based reinforcement learning (RL) algorithms struggle due to unreliable value predictions in unseen states. We argue that this problem cannot be addressed with exploration alone, but requires more expressive and generalizable models. We demonstrate that behavior cloning with a conditioned diffusion model trained on expert trajectory generalizes better to states formed by new combinations of seen elements than traditional RL methods. Through experiments in maze, driving, and multiagent environments, we show that conditioned diffusion models outperform traditional RL techniques and highlight the broad applicability of our problem formulation.
Authors: Reza Saadati Fard, Emmanuel Agu, Palawat Busaranuvong, Deepak Kumar, Shefalika Gautam, Bengisu Tulu, Diane Strong
Abstract: Chronic wounds affect 8.5 million Americans, particularly the elderly and patients with diabetes. These wounds can take up to nine months to heal, making regular care essential to ensure healing and prevent severe outcomes like limb amputations. Many patients receive care at home from visiting nurses with varying levels of wound expertise, leading to inconsistent care. Problematic, non-healing wounds should be referred to wound specialists, but referral decisions in non-clinical settings are often erroneous, delayed, or unnecessary. This paper introduces the Deep Multimodal Wound Assessment Tool (DM-WAT), a machine learning framework designed to assist visiting nurses in deciding whether to refer chronic wound patients. DM-WAT analyzes smartphone-captured wound images and clinical notes from Electronic Health Records (EHRs). It uses DeiT-Base-Distilled, a Vision Transformer (ViT), to extract visual features from images and DeBERTa-base to extract text features from clinical notes. DM-WAT combines visual and text features using an intermediate fusion approach. To address challenges posed by a small and imbalanced dataset, it integrates image and text augmentation with transfer learning to achieve high performance. In evaluations, DM-WAT achieved 77% with std 3% accuracy and a 70% with std 2% F1 score, outperforming prior approaches. Score-CAM and Captum interpretation algorithms provide insights into specific parts of image and text inputs that influence recommendations, enhancing interpretability and trust.
Authors: Ali Nazari, Michael Weiss
Abstract: This study presents a method for exploring advancements in a specific technological domain. It combines topic modeling, expert input, and reinforcement learning (RL). The proposed approach has three key steps: (1) generate aspect-based topic models using expert-weighted keywords to emphasize critical aspects, (2) analyze similarities and entropy changes by comparing topic distributions across iterative models, and (3) refine topic selection using reinforcement learning (RL) with a modified reward function that integrates changes in topic divergence and similarity across iterations. The method is tested on quantum communication documents with a focus on advances in cryptography and security protocols. The results show the method's effectiveness and can identify, rank, and track trends that match expert input. The framework provides a robust tool for exploring evolving technological landscapes.
Authors: Peiqi Li, Jie Chen
Abstract: The novel neural networks show great potential in solving partial differential equations. For single-phase flow problems in subsurface porous media with high-contrast coefficients, the key is to develop neural operators with accurate reconstruction capability and strict adherence to physical laws. In this study, we proposed a hybrid two-stage framework that uses multiscale basis functions and physics-guided deep learning to solve the Darcy flow problem in high-contrast fractured porous media. In the first stage, a data-driven model is used to reconstruct the multiscale basis function based on the permeability field to achieve effective dimensionality reduction while preserving the necessary multiscale features. In the second stage, the physics-informed neural network, together with Transformer-based global information extractor is used to reconstruct the pressure field by integrating the physical constraints derived from the Darcy equation, ensuring consistency with the physical laws of the real world. The model was evaluated on datasets with different combinations of permeability and basis functions and performed well in terms of reconstruction accuracy. Specifically, the framework achieves R2 values above 0.9 in terms of basis function fitting and pressure reconstruction, and the residual indicator is on the order of $1\times 10^{-4}$. These results validate the ability of the proposed framework to achieve accurate reconstruction while maintaining physical consistency.
Authors: Gaojie Jin, Sihao Wu, Jiaxu Liu, Tianjin Huang, Ronghui Mu
Abstract: Recent research has highlighted a critical issue known as ``robust fairness", where robust accuracy varies significantly across different classes, undermining the reliability of deep neural networks (DNNs). A common approach to address this has been to dynamically reweight classes during training, giving more weight to those with lower empirical robust performance. However, we find there is a divergence of class-wise robust performance between training set and testing set, which limits the effectiveness of these explicit reweighting methods, indicating the need for a principled alternative. In this work, we derive a robust generalization bound for the worst-class robust error within the PAC-Bayesian framework, accounting for unknown data distributions. Our analysis shows that the worst-class robust error is influenced by two main factors: the spectral norm of the empirical robust confusion matrix and the information embedded in the model and training set. While the latter has been extensively studied, we propose a novel regularization technique targeting the spectral norm of the robust confusion matrix to improve worst-class robust accuracy and enhance robust fairness. We validate our approach through comprehensive experiments on various datasets and models, demonstrating its effectiveness in enhancing robust fairness.
Authors: Hao Yuan Bai, Xue Liu
Abstract: Time series data is ubiquitous and appears in all fields of study. In multivariate time series, observations are interconnected both temporally and across components. For instance, in traffic flow analysis, traffic speeds at different intersections exhibit complex spatiotemporal correlations. Modelling this dual structure poses significant challenges. Most existing forecasting methods tackle these challenges by separately learning spatial and temporal dependencies. In this work, we introduce T-Graphormer, a Transformer-based approach designed to model spatiotemporal correlations directly. Extending the Graphormer architecture to incorporate temporal dynamics, our method updates each node representation by selectively attending to all other nodes within a graph sequence. This design enables the model to capture rich spatiotemporal patterns with minimal reliance on predefined spacetime inductive biases. We validate the effectiveness of T-Graphormer on real-world traffic prediction benchmark datasets, achieving up to 10% reductions in both root mean squared error (RMSE) and mean absolute percentage error (MAPE) compared to state-of-the-art methods.
Authors: Takuro Kutsuna
Abstract: Importance sampling is widely used to improve the efficiency of deep neural network (DNN) training by reducing the variance of gradient estimators. However, efficiently assessing the variance reduction relative to uniform sampling remains challenging due to computational overhead. This paper proposes a method for estimating variance reduction during DNN training using only minibatches sampled under importance sampling. By leveraging the proposed method, the paper also proposes an effective minibatch size to enable automatic learning rate adjustment. An absolute metric to quantify the efficiency of importance sampling is also introduced as well as an algorithm for real-time estimation of importance scores based on moving gradient statistics. Theoretical analysis and experiments on benchmark datasets demonstrated that the proposed algorithm consistently reduces variance, improves training efficiency, and enhances model accuracy compared with current importance-sampling approaches while maintaining minimal computational overhead.
Authors: Yiming Yang, Xiaoyuan Cheng, Daniel Giles, Sibo Cheng, Yi He, Xiao Xue, Boli Chen, Yukun Hu
Abstract: Variational data assimilation estimates the dynamical system states by minimizing a cost function that fits the numerical models with observational data. The widely used method, four-dimensional variational assimilation (4D-Var), has two primary challenges: (1) computationally demanding for complex nonlinear systems and (2) relying on state-observation mappings, which are often not perfectly known. Deep learning (DL) has been used as a more expressive class of efficient model approximators to address these challenges. However, integrating such models into 4D-Var remains challenging due to their inherent nonlinearities and the lack of theoretical guarantees for consistency in assimilation results. In this paper, we propose \textit{Tensor-Var} to address these challenges using kernel Conditional Mean Embedding (CME). Tensor-Var improves optimization efficiency by characterizing system dynamics and state-observation mappings as linear operators, leading to a convex cost function in the feature space. Furthermore, our method provides a new perspective to incorporate CME into 4D-Var, offering theoretical guarantees of consistent assimilation results between the original and feature spaces. To improve scalability, we propose a method to learn deep features (DFs) using neural networks within the Tensor-Var framework. Experiments on chaotic systems and global weather prediction with real-time observations show that Tensor-Var outperforms conventional and DL hybrid 4D-Var baselines in accuracy while achieving efficiency comparable to the static 3D-Var method.
Authors: Mars Liyao Gao, Jan P. Williams, J. Nathan Kutz
Abstract: Spatiotemporal modeling of real-world data poses a challenging problem due to inherent high dimensionality, measurement noise, and expensive data collection procedures. In this paper, we present Sparse Identification of Nonlinear Dynamics with SHallow REcurrent Decoder networks (SINDy-SHRED), a method to jointly solve the sensing and model identification problems with simple implementation, efficient computation, and robust performance. SINDy-SHRED uses Gated Recurrent Units (GRUs) to model the temporal sequence of sensor measurements along with a shallow decoder network to reconstruct the full spatiotemporal field from the latent state space using only a few available sensors. Our proposed algorithm introduces a SINDy-based regularization; beginning with an arbitrary latent state space, the dynamics of the latent space progressively converges to a SINDy-class functional, provided the projection remains within the set. In restricting SINDy to a linear model, the architecture produces a Koopman-SHRED model which enforces a linear latent space dynamics. We conduct a systematic experimental study including synthetic PDE data, real-world sensor measurements for sea surface temperature, and direct video data. With no explicit encoder, SINDy-SHRED and Koopman-SHRED enable efficient training with minimal hyperparameter tuning and laptop-level computing; further, it demonstrates robust generalization in a variety of applications with minimal to no hyperparameter adjustments. Finally, the interpretable SINDy and Koopman models of latent state dynamics enables accurate long-term video predictions, achieving state-of-the-art performance and outperforming all baseline methods considered, including Convolutional LSTM, PredRNN, ResNet, and SimVP.
Authors: Dongyoung Lee, Seungkyu Choi, Ik Joon Chang
Abstract: Large-scale language models (LLMs) have demonstrated outstanding performance in language processing tasks, yet their deployment is often hindered by high memory demands and computational complexity. Although low-bit quantization techniques, such as 4-bit quantization, present a potential solution, they frequently lead to significant accuracy degradation or require substantial effort for such aggressive quantization approaches. To overcome these challenges, we introduce QRazor, a reliable and effortless quantization scheme designed to enable 4-bit quantization for weights, activations, and KV cache in transformer-based LLMs. The scheme involves two main stages: quantization and compression. During the quantization stage, weights, activations, and KV cache values are quantized with wider 8 or 16-bit integers as a basis to achieve nearly identical accuracy to the original full-precision LLM models, using the absolute max scaling. Subsequently, all data are compressed to 4-bit using our proposed significant data razoring (SDR) technique, which retains only the four most salient bits while discarding the others. Furthermore, we present an integer-based arithmetic unit dedicated to QRazor, enabling direct low-precision arithmetic operations without decompressing the SDR data. Despite the reduced quantization effort, QRazor achieves LLM accuracies better or comparable to state-of-the-art 4-bit methods. By also validating the hardware efficiency, our decompression-free arithmetic unit achieves 61.2% and 57.8% reduction in area and power consumption, respectively.
Authors: Zhendong Guo, Yew-Soon Ong, Tiantian He, Haitao Liu
Abstract: Bayesian optimization (BO) is well known to be sample-efficient for solving black-box problems. However, the BO algorithms can sometimes get stuck in suboptimal solutions even with plenty of samples. Intrinsically, such suboptimal problem of BO can attribute to the poor surrogate accuracy of the trained Gaussian process (GP), particularly that in the regions where the optimal solutions locate. Hence, we propose to build multiple GP models instead of a single GP surrogate to complement each other and thus resolving the suboptimal problem of BO. Nevertheless, according to the bias-variance tradeoff equation, the individual prediction errors can increase when increasing the diversity of models, which may lead to even worse overall surrogate accuracy. On the other hand, based on the theory of Rademacher complexity, it has been proved that exploiting the agreement of models on unlabeled information can help to reduce the complexity of the hypothesis space, and therefore achieving the required surrogate accuracy with fewer samples. Such value of model agreement has been extensively demonstrated for co-training style algorithms to boost model accuracy with a small portion of samples. Inspired by the above, we propose a novel BO algorithm labeled as co-learning BO (CLBO), which exploits both model diversity and agreement on unlabeled information to improve the overall surrogate accuracy with limited samples, and therefore achieving more efficient global optimization. Through tests on five numerical toy problems and three engineering benchmarks, the effectiveness of proposed CLBO has been well demonstrated.
Authors: Joshua Park, Yongfeng Zhang
Abstract: Multi-agent systems must decide which agent is the most appropriate for a given task. We propose a novel architecture for recommending which LLM agent out of many should perform a task given a natural language prompt by extending the Sentence-BERT (SBERT) encoder model. On test data, we are able to achieve a top-1 accuracy of 92.2% with each classification taking less than 300 milliseconds. In contrast to traditional classification methods, our architecture is computationally cheap, adaptive to new classes, interpretable, and controllable with arbitrary metrics through reinforcement learning. By encoding natural language prompts into sentence embeddings, our model captures the semantic content relevant to recommending an agent. The distance between sentence embeddings that belong to the same agent is then minimized through fine-tuning and aligned to human values through reinforcement learning from human feedback. This allows the classification of natural language prompts based on their nearest neighbors by measuring the cosine similarity between embeddings. This work is made possible through the generation of a synthetic dataset for agent recommendation, which we have open-sourced to the public along with the code for AgentRec recommendation system at https://github.com/joshprk/agentrec.
Authors: Qingyue Long, Can Rong, Huandong Wang, Yong Li
Abstract: Trajectory data play a crucial role in many applications, ranging from network optimization to urban planning. Existing studies on trajectory data are task-specific, and their applicability is limited to the specific tasks on which they have been trained, such as generation, recovery, or prediction. However, the potential of a unified model has not yet been fully explored in trajectory modeling. Although various trajectory tasks differ in inputs, outputs, objectives, and conditions, they share common mobility patterns. Based on these common patterns, we can construct a general framework that enables a single model to address different tasks. However, building a trajectory task-general framework faces two critical challenges: 1) the diversity in the formats of different tasks and 2) the complexity of the conditions imposed on different tasks. In this work, we propose a general trajectory modeling framework via masked conditional diffusion (named GenMove). Specifically, we utilize mask conditions to unify diverse formats. To adapt to complex conditions associated with different tasks, we utilize historical trajectory data to obtain contextual trajectory embeddings, which include rich contexts such as spatiotemporal characteristics and user preferences. Integrating the contextual trajectory embedding into diffusion models through a classifier-free guidance approach allows the model to flexibly adjust its outputs based on different conditions. Extensive experiments on mainstream tasks demonstrate that our model significantly outperforms state-of-the-art baselines, with the highest performance improvement exceeding 13% in generation tasks.
Authors: Rishikesh Ranade, Mohammad Amin Nabian, Kaustubh Tangsali, Alexey Kamenev, Oliver Hennigh, Ram Cherukuri, Sanjay Choudhry
Abstract: Numerical simulations play a critical role in design and development of engineering products and processes. Traditional computational methods, such as CFD, can provide accurate predictions but are computationally expensive, particularly for complex geometries. Several machine learning (ML) models have been proposed in the literature to significantly reduce computation time while maintaining acceptable accuracy. However, ML models often face limitations in terms of accuracy and scalability and depend on significant mesh downsampling, which can negatively affect prediction accuracy and generalization. In this work, we propose a novel ML model architecture, DoMINO (Decomposable Multi-scale Iterative Neural Operator) developed in NVIDIA Modulus to address the various challenges of machine learning based surrogate modeling of engineering simulations. DoMINO is a point cloudbased ML model that uses local geometric information to predict flow fields on discrete points. The DoMINO model is validated for the automotive aerodynamics use case using the DrivAerML dataset. Through our experiments we demonstrate the scalability, performance, accuracy and generalization of our model to both in-distribution and out-of-distribution testing samples. Moreover, the results are analyzed using a range of engineering specific metrics important for validating numerical simulations.
Authors: Zihao Hu, Xiaoyu Fan, Yuan Yao, Jiheng Zhang, Zhengyuan Zhou
Abstract: First-price auctions have recently gained significant traction in digital advertising markets, exemplified by Google's transition from second-price to first-price auctions. Unlike in second-price auctions, where bidding one's private valuation is a dominant strategy, determining an optimal bidding strategy in first-price auctions is more complex. From a learning perspective, the learner (a specific bidder) can interact with the environment (other bidders) sequentially to infer their behaviors. Existing research often assumes specific environmental conditions and benchmarks performance against the best fixed policy (static benchmark). While this approach ensures strong learning guarantees, the static benchmark can deviate significantly from the optimal strategy in environments with even mild non-stationarity. To address such scenarios, a dynamic benchmark, which represents the sum of the best possible rewards at each time step, offers a more suitable objective. However, achieving no-regret learning with respect to the dynamic benchmark requires additional constraints. By inspecting reward functions in online first-price auctions, we introduce two metrics to quantify the regularity of the bidding sequence, which serve as measures of non-stationarity. We provide a minimax-optimal characterization of the dynamic regret when either of these metrics is sub-linear in the time horizon.
Authors: Bishwash Paneru, Biplov Paneru, Tanka Mukhiya, Khem Narayan Poudyal
Abstract: In Nepal, air pollution is a serious public health concern, especially in cities like Kathmandu where particulate matter (PM2.5 and PM10) has a major influence on respiratory health and air quality. The Air Quality Index (AQI) is predicted in this work using a Random Forest Regressor, and the model's predictions are interpreted using SHAP (SHapley Additive exPlanations) analysis. With the lowest Testing RMSE (0.23) and flawless R2 scores (1.00), CatBoost performs better than other models, demonstrating its greater accuracy and generalization which is cross validated using a nested cross validation approach. NowCast Concentration and Raw Concentration are the most important elements influencing AQI values, according to SHAP research, which shows that the machine learning results are highly accurate. Their significance as major contributors to air pollution is highlighted by the fact that high values of these characteristics significantly raise the AQI. This study investigates the Hydrogen-Alpha (HA) biodegradable filter as a novel way to reduce the related health hazards. With removal efficiency of more than 98% for PM2.5 and 99.24% for PM10, the HA filter offers exceptional defense against dangerous airborne particles. These devices, which are biodegradable face masks and cigarette filters, address the environmental issues associated with traditional filters' non-biodegradable trash while also lowering exposure to air contaminants.
Authors: Fengmiao Bian, Jian-Feng Cai, Xiaoqun Zhang, Yuanwei Zhang
Abstract: Low-rank tensor completion aims to recover a tensor from partially observed entries, and it is widely applicable in fields such as quantum computing and image processing. Due to the significant advantages of the tensor train (TT) format in handling structured high-order tensors, this paper investigates the low-rank tensor completion problem based on the TT-format. We proposed a preconditioned Riemannian gradient descent algorithm (PRGD) to solve low TT-rank tensor completion and establish its linear convergence. Experimental results on both simulated and real datasets demonstrate the effectiveness of the PRGD algorithm. On the simulated dataset, the PRGD algorithm reduced the computation time by two orders of magnitude compared to existing classical algorithms. In practical applications such as hyperspectral image completion and quantum state tomography, the PRGD algorithm significantly reduced the number of iterations, thereby substantially reducing the computational time.
Authors: Thang Duong, Zhi Wang, Chicheng Zhang
Abstract: We study lifelong learning in linear bandits, where a learner interacts with a sequence of linear bandit tasks whose parameters lie in an $m$-dimensional subspace of $\mathbb{R}^d$, thereby sharing a low-rank representation. Current literature typically assumes that the tasks are diverse, i.e., their parameters uniformly span the $m$-dimensional subspace. This assumption allows the low-rank representation to be learned before all tasks are revealed, which can be unrealistic in real-world applications. In this work, we present the first nontrivial result for sequential multi-task linear bandits without the task diversity assumption. We develop an algorithm that efficiently learns and transfers low-rank representations. When facing $N$ tasks, each played over $\tau$ rounds, our algorithm achieves a regret guarantee of $\tilde{O}\big (Nm \sqrt{\tau} + N^{\frac{2}{3}} \tau^{\frac{2}{3}} d m^{\frac13} + Nd^2 + \tau m d \big)$ under the ellipsoid action set assumption. This result can significantly improve upon the baseline of $\tilde{O} \left (Nd \sqrt{\tau}\right)$ that does not leverage the low-rank structure when the number of tasks $N$ is sufficiently large and $m \ll d$. We also demonstrate empirically on synthetic data that our algorithm outperforms baseline algorithms, which rely on the task diversity assumption.
Authors: Yasamin Ghahremani, Vangelis Metsis
Abstract: Time series analysis has become crucial in various fields, from engineering and finance to healthcare and social sciences. In this paper, we present a comprehensive review and evaluation of time series embedding methods for effective representations in machine learning and deep learning models. We introduce a taxonomy of embedding techniques, categorizing them based on their theoretical foundations and application contexts. Unlike previous surveys, our work provides a quantitative evaluation of representative methods from each category by assessing their performance on downstream classification tasks across diverse real-world datasets. Our experimental results demonstrate that the performance of embedding methods varies significantly depending on the dataset and classification algorithm used, highlighting the importance of careful model selection and extensive experimentation for specific applications, including engineering systems. To facilitate further research and practical applications, we provide an open-source code repository implementing these embedding methods. This study contributes to the field by offering a systematic comparison of time series embedding techniques, guiding practitioners in selecting appropriate methods for their specific applications, and providing a foundation for future advancements in time series analysis.
Authors: Yan Chen, Qinxun Bai, Yiteng Zhang, Shi Dong, Maria Dimakopoulou, Qi Sun, Zhengyuan Zhou
Abstract: Designing learning agents that explore efficiently in a complex environment has been widely recognized as a fundamental challenge in reinforcement learning. While a number of works have demonstrated the effectiveness of techniques based on randomized value functions on a single agent, it remains unclear, from a theoretical point of view, whether injecting randomization can help a society of agents {\it concurently} explore an environment. The theoretical results %that we established in this work tender an affirmative answer to this question. We adapt the concurrent learning framework to \textit{randomized least-squares value iteration} (RLSVI) with \textit{aggregated state representation}. We demonstrate polynomial worst-case regret bounds in both finite- and infinite-horizon environments. In both setups the per-agent regret decreases at an optimal rate of $\Theta\left(\frac{1}{\sqrt{N}}\right)$, highlighting the advantage of concurent learning. Our algorithm exhibits significantly lower space complexity compared to \cite{russo2019worst} and \cite{agrawal2021improved}. We reduce the space complexity by a factor of $K$ while incurring only a $\sqrt{K}$ increase in the worst-case regret bound, compared to \citep{agrawal2021improved,russo2019worst}. Additionally, we conduct numerical experiments to demonstrate our theoretical findings.
Authors: Kamal Sarkar
Abstract: As the energy landscape changes quickly, grid operators face several challenges, especially when integrating renewable energy sources with the grid. The most important challenge is to balance supply and demand because the solar and wind energy are highly unpredictable. When dealing with such uncertainty, trustworthy short-term load and renewable energy forecasting can help stabilize the grid, maximize energy storage, and guarantee the effective use of renewable resources. Physical models and statistical techniques were the previous approaches employed for this kind of forecasting tasks. In forecasting renewable energy, machine learning and deep learning techniques have recently demonstrated encouraging results. More specifically, the deep learning techniques like CNN and LSTM and the conventional machine learning techniques like regression that are mostly utilized for load and renewable energy forecasting tasks. In this article, we will focus mainly on CNN and LSTM-based forecasting methods.
Authors: Yiming Tang, Abrar Anwar, Jesse Thomason
Abstract: Understanding social signals in multi-party conversations is important for human-robot interaction and artificial social intelligence. Multi-party interactions include social signals like body pose, head pose, speech, and context-specific activities like acquiring and taking bites of food when dining. Incorporating all the multimodal signals in a multi-party interaction is difficult, and past work tends to build task-specific models for predicting social signals. In this work, we address the challenge of predicting multimodal social signals in multi-party settings in a single model. We introduce M3PT, a causal transformer architecture with modality and temporal blockwise attention masking which allows for the simultaneous processing of multiple social cues across multiple participants and their temporal interactions. This approach better captures social dynamics over time by considering longer horizons of social signals between individuals. We train and evaluate our unified model on the Human-Human Commensality Dataset (HHCD), and demonstrate that using multiple modalities improves bite timing and speaking status prediction. Source code: https://github.com/AbrarAnwar/masked-social-signals/
Authors: Rui Xu, Chao Chen, Yue Sun, Parvathinathan Venkitasubramaniam, Sihong Xie
Abstract: Conformal prediction yields a prediction set with guaranteed $1-\alpha$ coverage of the true target under the i.i.d. assumption, which may not hold and lead to a gap between $1-\alpha$ and the actual coverage. Prior studies bound the gap using total variation distance, which cannot identify the gap changes under distribution shift at a given $\alpha$. Besides, existing methods are mostly limited to covariate shift,while general joint distribution shifts are more common in practice but less researched.In response, we first propose a Wasserstein distance-based upper bound of the coverage gap and analyze the bound using probability measure pushforwards between the shifted joint data and conformal score distributions, enabling a separation of the effect of covariate and concept shifts over the coverage gap. We exploit the separation to design an algorithm based on importance weighting and regularized representation learning (WR-CP) to reduce the Wasserstein bound with a finite-sample error bound.WR-CP achieves a controllable balance between conformal prediction accuracy and efficiency. Experiments on six datasets prove that WR-CP can reduce coverage gaps to $3.1\%$ across different confidence levels and outputs prediction sets 38$\%$ smaller than the worst-case approach on average.
Authors: Yasmin Salehi, Dennis Giannacopoulos
Abstract: Graph clustering plays a crucial role in graph representation learning but often faces challenges in achieving feature-space diversity. While Deep Modularity Networks (DMoN) leverage modularity maximization and collapse regularization to ensure structural separation, they do not explicitly encourage diversity in the feature space among clusters. We address this limitation by proposing Deep Modularity Networks with Diversity-Preserving Regularization (DMoN-DPR), which introduces three novel regularization terms: distance-based for inter-cluster separation, variance-based for intra-cluster diversity, and entropy-based for balanced assignments. Our method enhances clustering performance on benchmark datasets, namely Cora, CiteSeer, PubMed, Coauthor CS, and Coauthor Physics, achieving significant improvements in Normalized Mutual Information (NMI), and F1 scores. These results demonstrate the effectiveness of incorporating diversity-preserving regularizations in creating meaningful and interpretable clusters, especially in feature-rich datasets.
Authors: Junhao Zheng, Xidi Cai, Shengjie Qiu, Qianli Ma
Abstract: Recent advancements in large language models (LLMs) reveal a perplexing phenomenon in continual learning: despite extensive training, models experience significant performance declines, raising questions about task alignment and underlying knowledge retention. This study first explores the concept of "spurious forgetting", proposing that such performance drops often reflect a decline in task alignment rather than true knowledge loss. Through controlled experiments with a synthesized dataset, we investigate the dynamics of model performance during the initial training phases of new tasks, discovering that early optimization steps can disrupt previously established task alignments. Our theoretical analysis connects these shifts to orthogonal updates in model weights, providing a robust framework for understanding this behavior. Ultimately, we introduce a Freezing strategy that fix the bottom layers of the model, leading to substantial improvements in four continual learning scenarios. Our findings underscore the critical distinction between task alignment and knowledge retention, paving the way for more effective strategies in continual learning.
Authors: Taoran Fang, Tianhong Gao, Chunping Wang, Yihao Shang, Wei Chow, Lei Chen, Yang Yang
Abstract: Graph neural networks (GNNs) with attention mechanisms, often referred to as attentive GNNs, have emerged as a prominent paradigm in advanced GNN models in recent years. However, our understanding of the critical process of scoring neighbor nodes remains limited, leading to the underperformance of many existing attentive GNNs. In this paper, we unify the scoring functions of current attentive GNNs and propose Kolmogorov-Arnold Attention (KAA), which integrates the Kolmogorov-Arnold Network (KAN) architecture into the scoring process. KAA enhances the performance of scoring functions across the board and can be applied to nearly all existing attentive GNNs. To compare the expressive power of KAA with other scoring functions, we introduce Maximum Ranking Distance (MRD) to quantitatively estimate their upper bounds in ranking errors for node importance. Our analysis reveals that, under limited parameters and constraints on width and depth, both linear transformation-based and MLP-based scoring functions exhibit finite expressive power. In contrast, our proposed KAA, even with a single-layer KAN parameterized by zero-order B-spline functions, demonstrates nearly infinite expressive power. Extensive experiments on both node-level and graph-level tasks using various backbone models show that KAA-enhanced scoring functions consistently outperform their original counterparts, achieving performance improvements of over 20% in some cases.
Authors: Rishabh Agrawal
Abstract: Few-shot learning (FSL) enables machine learning models to generalize effectively with minimal labeled data, making it crucial for data-scarce domains such as healthcare, robotics, and natural language processing. Despite its potential, FSL faces challenges including sensitivity to initialization, difficulty in adapting to diverse domains, and vulnerability to noisy datasets. To address these issues, this paper introduces Adaptive Few-Shot Learning (AFSL), a framework that integrates advancements in meta-learning, domain alignment, noise resilience, and multi-modal integration. AFSL consists of four key modules: a Dynamic Stability Module for performance consistency, a Contextual Domain Alignment Module for domain adaptation, a Noise-Adaptive Resilience Module for handling noisy data, and a Multi-Modal Fusion Module for integrating diverse modalities. This work also explores strategies such as task-aware data augmentation, semi-supervised learning, and explainable AI techniques to enhance the applicability and robustness of FSL. AFSL provides scalable, reliable, and impactful solutions for real-world, high-stakes domains.
Authors: Zukang Xu, Yuxuan Yue, Xing Hu, Zhihang Yuan, Zixu Jiang, Zhixuan Chen, Jiangyong Yu, Chen Xu, Sifan Zhou, Dawei Yang
Abstract: Mamba is an efficient sequence model that rivals Transformers and demonstrates significant potential as a foundational architecture for various tasks. Quantization is commonly used in neural networks to reduce model size and computational latency. However, applying quantization to Mamba remains underexplored, and existing quantization methods, which have been effective for CNN and Transformer models, appear inadequate for Mamba models (e.g., Quarot suffers a 21% accuracy drop on Vim-T$^\dagger$ even under W8A8). We have pioneered the exploration of this issue and identified several key challenges. First, significant outliers are present in gate projections, output projections, and matrix multiplications. Second, Mamba's unique parallel scan further amplifies these outliers, leading to uneven and heavy-tailed data distributions. Third, even with the application of the Hadamard transform, the variance across channels in weights and activations still remains inconsistent. To these ends, we propose MambaQuant, a post-training quantization (PTQ) framework consisting of: 1) Karhunen-Loeve Transformation (KLT) enhanced rotation, rendering the rotation matrix adaptable to diverse channel distributions. 2) Smooth-Fused rotation, which equalizes channel variances and can merge additional parameters into model weights. Experiments show that MambaQuant can quantize both weights and activations into 8-bit with less than 1% accuracy loss for Mamba-based vision and language tasks. To the best of our knowledge, MambaQuant is the first comprehensive PTQ design for the Mamba family, paving the way for further advancements in its application.
Authors: Zehao Liu, Mengzhou Gao, Pengfei Jiao
Abstract: Multivariate time series anomaly detection has numerous real-world applications and is being extensively studied. Modeling pairwise correlations between variables is crucial. Existing methods employ learnable graph structures and graph neural networks to explicitly model the spatial dependencies between variables. However, these methods are primarily based on prediction or reconstruction tasks, which can only learn similarity relationships between sequence embeddings and lack interpretability in how graph structures affect time series evolution. In this paper, we designed a framework that models spatial dependencies using interpretable causal relationships and detects anomalies through changes in causal patterns. Specifically, we propose a method to dynamically discover Granger causality using gradients in nonlinear deep predictors and employ a simple sparsification strategy to obtain a Granger causality graph, detecting anomalies from a causal perspective. Experiments on real-world datasets demonstrate that the proposed model achieves more accurate anomaly detection compared to baseline methods.
Authors: Xiaoxing Ren, Nicola Bastianello, Karl H. Johansson, Thomas Parisini
Abstract: We address distributed learning problems, both nonconvex and convex, over undirected networks. In particular, we design a novel algorithm based on the distributed Alternating Direction Method of Multipliers (ADMM) to address the challenges of high communication costs, and large datasets. Our design tackles these challenges i) by enabling the agents to perform multiple local training steps between each round of communications; and ii) by allowing the agents to employ stochastic gradients while carrying out local computations. We show that the proposed algorithm converges to a neighborhood of a stationary point, for nonconvex problems, and of an optimal point, for convex problems. We also propose a variant of the algorithm to incorporate variance reduction thus achieving exact convergence. We show that the resulting algorithm indeed converges to a stationary (or optimal) point, and moreover that local training accelerates convergence. We thoroughly compare the proposed algorithms with the state of the art, both theoretically and through numerical results.
Authors: Rui Wang, Mingxuan Xia, Chang Yao, Lei Feng, Junbo Zhao, Gang Chen, Haobo Wang
Abstract: Traditional Incremental Learning (IL) targets to handle sequential fully-supervised learning problems where novel classes emerge from time to time. However, due to inherent annotation uncertainty and ambiguity, collecting high-quality annotated data in a dynamic learning system can be extremely expensive. To mitigate this problem, we propose a novel weakly-supervised learning paradigm called Incremental Partial Label Learning (IPLL), where the sequentially arrived data relate to a set of candidate labels rather than the ground truth. Technically, we develop the Prototype-Guided Disambiguation and Replay Algorithm (PGDR) which leverages the class prototypes as a proxy to mitigate two intertwined challenges in IPLL, i.e., label ambiguity and catastrophic forgetting. To handle the former, PGDR encapsulates a momentum-based pseudo-labeling algorithm along with prototype-guided initialization, resulting in a balanced perception of classes. To alleviate forgetting, we develop a memory replay technique that collects well-disambiguated samples while maintaining representativeness and diversity. By jointly distilling knowledge from curated memory data, our framework exhibits a great disambiguation ability for samples of new tasks and achieves less forgetting of knowledge. Extensive experiments demonstrate that PGDR achieves superior
Authors: Yuxuan (Edison), Liu, Jinpei Han, Padmanabhan Ramnarayan, A. Aldo Faisal
Abstract: Clinical machine learning deployment across institutions faces significant challenges when patient populations and clinical practices differ substantially. We present a systematic framework for cross-institutional knowledge transfer in clinical time series, demonstrated through pediatric ventilation management between a general pediatric intensive care unit (PICU) and a cardiac-focused unit. Using contrastive predictive coding (CPC) for representation learning, we investigate how different data regimes and fine-tuning strategies affect knowledge transfer across institutional boundaries. Our results show that while direct model transfer performs poorly, CPC with appropriate fine-tuning enables effective knowledge sharing between institutions, with benefits particularly evident in limited data scenarios. Analysis of transfer patterns reveals an important asymmetry: temporal progression patterns transfer more readily than point-of-care decisions, suggesting practical pathways for cross-institutional deployment. Through a systematic evaluation of fine-tuning approaches and transfer patterns, our work provides insights for developing more generalizable clinical decision support systems while enabling smaller specialized units to leverage knowledge from larger centers.
Authors: Claire Bizon Monroc, Ana Bu\v{s}i\'c, Donatien Dubuc, Jiamin Zhu
Abstract: The wind farm control problem is challenging, since conventional model-based control strategies require tractable models of complex aerodynamical interactions between the turbines and suffer from the curse of dimension when the number of turbines increases. Recently, model-free and multi-agent reinforcement learning approaches have been used to address this challenge. In this article, we introduce WFCRL (Wind Farm Control with Reinforcement Learning), the first open suite of multi-agent reinforcement learning environments for the wind farm control problem. WFCRL frames a cooperative Multi-Agent Reinforcement Learning (MARL) problem: each turbine is an agent and can learn to adjust its yaw, pitch or torque to maximize the common objective (e.g. the total power production of the farm). WFCRL also offers turbine load observations that will allow to optimize the farm performance while limiting turbine structural damages. Interfaces with two state-of-the-art farm simulators are implemented in WFCRL: a static simulator (FLORIS) and a dynamic simulator (FAST.Farm). For each simulator, $10$ wind layouts are provided, including $5$ real wind farms. Two state-of-the-art online MARL algorithms are implemented to illustrate the scaling challenges. As learning online on FAST.Farm is highly time-consuming, WFCRL offers the possibility of designing transfer learning strategies from FLORIS to FAST.Farm.
Authors: Kamal Berahmand, Farid Saberi-Movahed, Razieh Sheikhpour, Yuefeng Li, Mahdi Jalili
Abstract: Spectral clustering is a powerful technique for clustering high-dimensional data, utilizing graph-based representations to detect complex, non-linear structures and non-convex clusters. The construction of a similarity graph is essential for ensuring accurate and effective clustering, making graph structure learning (GSL) central for enhancing spectral clustering performance in response to the growing demand for scalable solutions. Despite advancements in GSL, there is a lack of comprehensive surveys specifically addressing its role within spectral clustering. To bridge this gap, this survey presents a comprehensive review of spectral clustering methods, emphasizing on the critical role of GSL. We explore various graph construction techniques, including pairwise, anchor, and hypergraph-based methods, in both fixed and adaptive settings. Additionally, we categorize spectral clustering approaches into single-view and multi-view frameworks, examining their applications within one-step and two-step clustering processes. We also discuss multi-view information fusion techniques and their impact on clustering data. By addressing current challenges and proposing future research directions, this survey provides valuable insights for advancing spectral clustering methodologies and highlights the pivotal role of GSL in tackling large-scale and high-dimensional data clustering tasks.
Authors: Younes Yousef, Lukas Galke, Ansgar Scherp
Abstract: Recent approaches in hierarchical text classification (HTC) rely on the capabilities of a pre-trained transformer model and exploit the label semantics and a graph encoder for the label hierarchy. In this paper, we introduce an effective hierarchical text classifier RADAr (Transformer-based Autoregressive Decoder Architecture) that is based only on an off-the-shelf RoBERTa transformer to process the input and a custom autoregressive decoder with two decoder layers for generating the classification output. Thus, unlike existing approaches for HTC, the encoder of RADAr has no explicit encoding of the label hierarchy and the decoder solely relies on the label sequences of the samples observed during training. We demonstrate on three benchmark datasets that RADAr achieves results competitive to the state of the art with less training and inference time. Our model consistently performs better when organizing the label sequences from children to parents versus the inverse, as done in existing HTC approaches. Our experiments show that neither the label semantics nor an explicit graph encoder for the hierarchy is needed. This has strong practical implications for HTC as the architecture has fewer requirements and provides a speed-up by a factor of 2 at inference time. Moreover, training a separate decoder from scratch in conjunction with fine-tuning the encoder allows future researchers and practitioners to exchange the encoder part as new models arise. The source code is available at https://github.com/yousef-younes/RADAr.
Authors: Maria Hartmann, Gr\'egoire Danoy, Pascal Bouvry
Abstract: Federated Learning (FL) is a distributed machine learning strategy, developed for settings where training data is owned by distributed devices and cannot be shared. FL circumvents this constraint by carrying out model training in distribution. The parameters of these local models are shared intermittently among participants and aggregated to enhance model accuracy. This strategy has been rapidly adopted by the industry in efforts to overcome privacy and resource constraints in model training. However, the application of FL to real-world settings brings additional challenges associated with heterogeneity between participants. Research into mitigating these difficulties in FL has largely focused on only two types of heterogeneity: the unbalanced distribution of training data, and differences in client resources. Yet more types of heterogeneity are becoming relevant as the capability of FL expands to cover more complex problems, from the tuning of LLMs to enabling machine learning on edge devices. In this work, we discuss a novel type of heterogeneity that is likely to become increasingly relevant in future applications: this is preference heterogeneity, emerging when clients learn under multiple objectives, with different importance assigned to each objective on different clients. In this work, we discuss the implications of this type of heterogeneity and propose FedPref, a first algorithm designed to facilitate personalised FL in this setting. We demonstrate the effectiveness of the algorithm across different problems, preference distributions and model architectures. In addition, we introduce a new analytical point of view, based on multi-objective metrics, for evaluating the performance of FL algorithms in this setting beyond the traditional client-focused metrics. We perform a second experimental analysis based in this view, and show that FedPref outperforms compared algorithms.
Authors: Zhirui Chen, P. N. Karthik, Yeow Meng Chee, Vincent Y. F. Tan
Abstract: We consider a multi-armed bandit setting with finitely many arms, in which each arm yields an $M$-dimensional vector reward upon selection. We assume that the reward of each dimension (a.k.a. {\em objective}) is generated independently of the others. The best arm of any given objective is the arm with the largest component of mean corresponding to the objective. The end goal is to identify the best arm of {\em every} objective in the shortest (expected) time subject to an upper bound on the probability of error (i.e., fixed-confidence regime). We establish a problem-dependent lower bound on the limiting growth rate of the expected stopping time, in the limit of vanishing error probabilities. This lower bound, we show, is characterised by a max-min optimisation problem that is computationally expensive to solve at each time step. We propose an algorithm that uses the novel idea of {\em surrogate proportions} to sample the arms at each time step, eliminating the need to solve the max-min optimisation problem at each step. We demonstrate theoretically that our algorithm is asymptotically optimal. In addition, we provide extensive empirical studies to substantiate the efficiency of our algorithm. While existing works on pure exploration with multi-objective multi-armed bandits predominantly focus on {\em Pareto frontier identification}, our work fills the gap in the literature by conducting a formal investigation of the multi-objective best arm identification problem.
Authors: Olaya P\'erez-Mon, Juan Jos\'e del Coz, Pablo Gonz\'alez
Abstract: Quantification, or prevalence estimation, is the task of predicting the prevalence of each class within an unknown bag of examples. Most existing quantification methods in the literature rely on prior probability shift assumptions to create a quantification model that uses the predictions of an underlying classifier to make optimal prevalence estimates. In this work, we present an end-to-end neural network that uses Gaussian distributions in latent spaces to obtain invariant representations of bags of examples. This approach addresses the quantification problem using deep learning, enabling the optimization of specific loss functions relevant to the problem and avoiding the need for an intermediate classifier, tackling the quantification problem as a direct optimization problem. Our method achieves state-of-the-art results, both against traditional quantification methods and other deep learning approaches for quantification. The code needed to reproduce all our experiments is publicly available at https://github.com/AICGijon/gmnet.
Authors: Shinsaku Sakaue, Han Bao, Taira Tsuchiya
Abstract: This paper revisits the online learning approach to inverse linear optimization studied by B\"armann et al. (2017), where the goal is to infer an unknown linear objective function of an agent from sequential observations of the agent's input-output pairs. First, we provide a simple understanding of the online learning approach through its connection to online convex optimization of \emph{Fenchel--Young losses}. As a byproduct, we present an offline guarantee on the \emph{suboptimality loss}, which measures how well predicted objectives explain the agent's choices, without assuming the optimality of the agent's choices. Second, assuming that there is a gap between optimal and suboptimal objective values in the agent's decision problems, we obtain an upper bound independent of the time horizon $T$ on the sum of suboptimality and \emph{estimate losses}, where the latter measures the quality of solutions recommended by predicted objectives. Interestingly, our gap-dependent analysis achieves a faster rate than the standard $O(\sqrt{T})$ regret bound by exploiting structures specific to inverse linear optimization, even though neither the loss functions nor their domains enjoy desirable properties, such as strong convexity.
Authors: Elias Abad Rocamora, Grigorios G. Chrysos, Volkan Cevher
Abstract: Text classifiers suffer from small perturbations, that if chosen adversarially, can dramatically change the output of the model. Verification methods can provide robustness certificates against such adversarial perturbations, by computing a sound lower bound on the robust accuracy. Nevertheless, existing verification methods incur in prohibitive costs and cannot practically handle Levenshtein distance constraints. We propose the first method for computing the Lipschitz constant of convolutional classifiers with respect to the Levenshtein distance. We use these Lipschitz constant estimates for training 1-Lipschitz classifiers. This enables computing the certified radius of a classifier in a single forward pass. Our method, LipsLev, is able to obtain $38.80$% and $13.93$% verified accuracy at distance $1$ and $2$ respectively in the AG-News dataset, while being $4$ orders of magnitude faster than existing approaches. We believe our work can open the door to more efficient verification in the text domain.
Authors: Zihui Wu, Haichang Gao, Jiacheng Luo, Zhaoxiang Liu
Abstract: Large Language Models (LLMs) commonly rely on explicit refusal prefixes for safety, making them vulnerable to prefix injection attacks. We introduce HumorReject, a novel data-driven approach that fundamentally reimagines LLM safety by decoupling it from refusal prefixes through the use of humor as an indirect refusal strategy. Rather than explicitly rejecting harmful instructions, HumorReject responds with contextually appropriate humor that naturally defuses potentially dangerous requests while maintaining engaging interactions. Our approach effectively addresses the common "over-defense" issues in existing safety mechanisms, demonstrating superior robustness against various attack vectors while preserving natural and high-quality interactions on legitimate tasks. Our findings suggest that innovations at the data level are even more fundamental than the alignment algorithm itself in achieving effective LLM safety, opening new directions for developing more resilient and user-friendly AI systems.
Authors: Ayush K. Varshney, Konstantinos Vandikas, Vicen\c{c} Torra
Abstract: Federated Learning (FL) has emerged as a prominent distributed learning paradigm. Within the scope of privacy preservation, information privacy regulations such as GDPR entitle users to request the removal (or unlearning) of their contribution from a service that is hosting the model. For this purpose, a server hosting an ML model must be able to unlearn certain information in cases such as copyright infringement or security issues that can make the model vulnerable or impact the performance of a service based on that model. While most unlearning approaches in FL focus on Horizontal FL (HFL), where clients share the feature space and the global model, Vertical FL (VFL) has received less attention from the research community. VFL involves clients (passive parties) sharing the sample space among them while not having access to the labels. In this paper, we explore unlearning in VFL from three perspectives: unlearning clients, unlearning features, and unlearning samples. To unlearn clients and features we introduce VFU-KD which is based on knowledge distillation (KD) while to unlearn samples, VFU-GA is introduced which is based on gradient ascent. To provide evidence of approximate unlearning, we utilize Membership Inference Attack (MIA) to audit the effectiveness of our unlearning approach. Our experiments across six tabular datasets and two image datasets demonstrate that VFU-KD and VFU-GA achieve performance comparable to or better than both retraining from scratch and the benchmark R2S method in many cases, with improvements of $(0-2\%)$. In the remaining cases, utility scores remain comparable, with a modest utility loss ranging from $1-5\%$. Unlike existing methods, VFU-KD and VFU-GA require no communication between active and passive parties during unlearning. However, they do require the active party to store the previously communicated embeddings.
Authors: Yuchun Li, Zihan Lin, Xize Wang, Chunyang Liu, Liaoyuan Wu, Fang Zhang
Abstract: In modern warfare, real-time and accurate battle situation analysis is crucial for making strategic and tactical decisions. The proposed real-time battle situation intelligent awareness system (BSIAS) aims at meta-learning analysis and stepwise RNN (recurrent neural network) modeling, where the former carries out the basic processing and analysis of battlefield data, which includes multi-steps such as data cleansing, data fusion, data mining and continuously updates, and the latter optimizes the battlefield modeling by stepwise capturing the temporal dependencies of data set. BSIAS can predict the possible movement from any side of the fence and attack routes by taking a simulated battle as an example, which can be an intelligent support platform for commanders to make scientific decisions during wartime. This work delivers the potential application of integrated BSIAS in the field of battlefield command & analysis engineering.
Authors: Maria-Florina Balcan, Anh Tuan Nguyen, Dravyansh Sharma
Abstract: Modern machine learning algorithms, especially deep learning based techniques, typically involve careful hyperparameter tuning to achieve the best performance. Despite the surge of intense interest in practical techniques like Bayesian optimization and random search based approaches to automating this laborious and compute-intensive task, the fundamental learning theoretic complexity of tuning hyperparameters for deep neural networks is poorly understood. Inspired by this glaring gap, we initiate the formal study of hyperparameter tuning complexity in deep learning through a recently introduced data driven setting. We assume that we have a series of deep learning tasks, and we have to tune hyperparameters to do well on average over the distribution of tasks. A major difficulty is that the utility function as a function of the hyperparameter is very volatile and furthermore, it is given implicitly by an optimization problem over the model parameters. This is unlike previous work in data driven design, where one can typically explicitly model the algorithmic behavior as a function of the hyperparameters. To tackle this challenge, we introduce a new technique to characterize the discontinuities and oscillations of the utility function on any fixed problem instance as we vary the hyperparameter, our analysis relies on subtle concepts including tools from differential/algebraic geometry and constrained optimization. This can be used to show that the learning theoretic complexity of the corresponding family of utility functions is bounded. We instantiate our results and provide sample complexity bounds for concrete applications tuning a hyperparameter that interpolates neural activation functions and setting the kernel parameter in graph neural networks.
Authors: Te Pei, Fuat Alican, Aaron Ontoyin Yin, Yigit Ihlamur
Abstract: This paper introduces GPT-HTree, a framework combining hierarchical clustering, decision trees, and large language models (LLMs) to address this challenge. By leveraging hierarchical clustering to segment individuals based on salient features, resampling techniques to balance class distributions, and decision trees to tailor classification paths within each cluster, GPT-HTree ensures both accuracy and interpretability. LLMs enhance the framework by generating human-readable cluster descriptions, bridging quantitative analysis with actionable insights.
Authors: Thomas Wedenig, Rishub Nagpal, Ga\"etan Cassiers, Stefan Mangard, Robert Peharz
Abstract: Detecting weaknesses in cryptographic algorithms is of utmost importance for designing secure information systems. The state-of-the-art soft analytical side-channel attack (SASCA) uses physical leakage information to make probabilistic predictions about intermediate computations and combines these "guesses" with the known algorithmic logic to compute the posterior distribution over the key. This attack is commonly performed via loopy belief propagation, which, however, lacks guarantees in terms of convergence and inference quality. In this paper, we develop a fast and exact inference method for SASCA, denoted as ExSASCA, by leveraging knowledge compilation and tractable probabilistic circuits. When attacking the Advanced Encryption Standard (AES), the most widely used encryption algorithm to date, ExSASCA outperforms SASCA by more than 31% top-1 success rate absolute. By leveraging sparse belief messages, this performance is achieved with little more computational cost than SASCA, and about 3 orders of magnitude less than exact inference via exhaustive enumeration. Even with dense belief messages, ExSASCA still uses 6 times less computations than exhaustive inference.
Authors: Nanjangud C. Narendra, Nithin Nagaraj
Abstract: Deep learning implemented via neural networks, has revolutionized machine learning by providing methods for complex tasks such as object detection/classification and prediction. However, architectures based on deep neural networks have started to yield diminishing returns, primarily due to their statistical nature and inability to capture causal structure in the training data. Another issue with deep learning is its high energy consumption, which is not that desirable from a sustainability perspective. Therefore, alternative approaches are being considered to address these issues, both of which are inspired by the functioning of the human brain. One approach is causal learning, which takes into account causality among the items in the dataset on which the neural network is trained. It is expected that this will help minimize the spurious correlations that are prevalent in the learned representations of deep neural networks. The other approach is Neurochaos Learning, a recent development, which draws its inspiration from the nonlinear chaotic firing intrinsic to neurons in biological neural networks (brain/central nervous system). Both approaches have shown improved results over just deep learning alone. To that end, in this position paper, we investigate how causal and neurochaos learning approaches can be integrated together to produce better results, especially in domains that contain linked data. We propose an approach for this integration to enhance classification, prediction and reinforcement learning. We also propose a set of research questions that need to be investigated in order to make this integration a reality.
Authors: Mingzhao Wang, You Zhou, Zhiguang Cao, Yubin Xiao, Xuan Wu, Wei Pang, Yuan Jiang, Hui Yang, Peng Zhao, Yuanshu Li
Abstract: Recent advances in neural models have shown considerable promise in solving Traveling Salesman Problems (TSPs) without relying on much hand-crafted engineering. However, while non-autoregressive (NAR) approaches benefit from faster inference through parallelism, they typically deliver solutions of inferior quality compared to autoregressive ones. To enhance the solution quality while maintaining fast inference, we propose DEITSP, a diffusion model with efficient iterations tailored for TSP that operates in a NAR manner. Firstly, we introduce a one-step diffusion model that integrates the controlled discrete noise addition process with self-consistency enhancement, enabling optimal solution prediction through simultaneous denoising of multiple solutions. Secondly, we design a dual-modality graph transformer to bolster the extraction and fusion of features from node and edge modalities, while further accelerating the inference with fewer layers. Thirdly, we develop an efficient iterative strategy that alternates between adding and removing noise to improve exploration compared to previous diffusion methods. Additionally, we devise a scheduling framework to progressively refine the solution space by adjusting noise levels, facilitating a smooth search for optimal solutions. Extensive experiments on real-world and large-scale TSP instances demonstrate that DEITSP performs favorably against existing neural approaches in terms of solution quality, inference latency, and generalization ability. Our code is available at $\href{https://github.com/DEITSP/DEITSP}{https://github.com/DEITSP/DEITSP}$.
URLs: https://github.com/DEITSP/DEITSP, https://github.com/DEITSP/DEITSP
Authors: Lorenz Kummer, Samir Moustafa, Wilfried Gansterer, Nils Kriege
Abstract: Bit Flip Attacks (BFAs) are a well-established class of adversarial attacks, originally developed for Convolutional Neural Networks within the computer vision domain. Most recently, these attacks have been extended to target Graph Neural Networks (GNNs), revealing significant vulnerabilities. This new development naturally raises questions about the best strategies to defend GNNs against BFAs, a challenge for which no solutions currently exist. Given the applications of GNNs in critical fields, any defense mechanism must not only maintain network performance, but also verifiably restore the network to its pre-attack state. Verifiably restoring the network to its pre-attack state also eliminates the need for costly evaluations on test data to ensure network quality. We offer first insights into the effectiveness of existing honeypot- and hashing-based defenses against BFAs adapted from the computer vision domain to GNNs, and characterize the shortcomings of these approaches. To overcome their limitations, we propose Crossfire, a hybrid approach that exploits weight sparsity and combines hashing and honeypots with bit-level correction of out-of-distribution weight elements to restore network integrity. Crossfire is retraining-free and does not require labeled data. Averaged over 2,160 experiments on six benchmark datasets, Crossfire offers a 21.8% higher probability than its competitors of reconstructing a GNN attacked by a BFA to its pre-attack state. These experiments cover up to 55 bit flips from various attacks. Moreover, it improves post-repair prediction quality by 10.85%. Computational and storage overheads are negligible compared to the inherent complexity of even the simplest GNNs.
Authors: Tanya Rodchenko, Natasha Noy, Nino Scherrer, Jennifer Prendki
Abstract: While Large Language Models require more and more data to train and scale, rather than looking for any data to acquire, we should consider what types of tasks are more likely to benefit from data scaling. We should be intentional in our data acquisition. We argue that the topology of data itself informs which tasks to prioritize in data scaling, and shapes the development of the next generation of compute paradigms for tasks where data scaling is inefficient, or even insufficient.
Authors: Rahul Bordoloi, Cl\'emence R\'eda, Saptarshi Bej
Abstract: Missing feature values are a significant hurdle for downstream machine-learning tasks such as classification and regression. However, they are pervasive in multiple real-life use cases, for instance, in drug discovery research. Moreover, imputation methods might be time-consuming and offer few guarantees on the imputation quality, especially for not-missing-at-random mechanisms. We propose an imputation approach named F3I based on the iterative improvement of a K-nearest neighbor imputation that learns the weights for each neighbor of a data point, optimizing for the most likely distribution of points over data points. This algorithm can also be jointly trained with a downstream task on the imputed values. We provide a theoretical analysis of the imputation quality by F3I for several types of missing mechanisms. We also demonstrate the performance of F3I on both synthetic data sets and real-life drug repurposing and handwritten-digit recognition data.
Authors: Michael Crawshaw, Blake Woodworth, Mingrui Liu
Abstract: We analyze two variants of Local Gradient Descent applied to distributed logistic regression with heterogeneous, separable data and show convergence at the rate $O(1/KR)$ for $K$ local steps and sufficiently large $R$ communication rounds. In contrast, all existing convergence guarantees for Local GD applied to any problem are at least $\Omega(1/R)$, meaning they fail to show the benefit of local updates. The key to our improved guarantee is showing progress on the logistic regression objective when using a large stepsize $\eta \gg 1/K$, whereas prior analysis depends on $\eta \leq 1/K$.
Authors: Zhi Sheng, Yuan Yuan, Jingtao Ding, Yong Li
Abstract: Accurate prediction of mobile traffic, \textit{i.e.,} network traffic from cellular base stations, is crucial for optimizing network performance and supporting urban development. However, the non-stationary nature of mobile traffic, driven by human activity and environmental changes, leads to both regular patterns and abrupt variations. Diffusion models excel in capturing such complex temporal dynamics due to their ability to capture the inherent uncertainties. Most existing approaches prioritize designing novel denoising networks but often neglect the critical role of noise itself, potentially leading to sub-optimal performance. In this paper, we introduce a novel perspective by emphasizing the role of noise in the denoising process. Our analysis reveals that noise fundamentally shapes mobile traffic predictions, exhibiting distinct and consistent patterns. We propose NPDiff, a framework that decomposes noise into \textit{prior} and \textit{residual} components, with the \textit{prior} derived from data dynamics, enhancing the model's ability to capture both regular and abrupt variations. NPDiff can seamlessly integrate with various diffusion-based prediction models, delivering predictions that are effective, efficient, and robust. Extensive experiments demonstrate that it achieves superior performance with an improvement over 30\%, offering a new perspective on leveraging diffusion models in this domain.
Authors: Yu Wu, Yansong Li, Zeyu Dong, Nitya Sathyavageeswaran, Anand D. Sarwate
Abstract: Deploying complex machine learning models on resource-constrained devices is challenging due to limited computational power, memory, and model retrainability. To address these limitations, a hybrid system can be established by augmenting the local model with a server-side model, where samples are selectively deferred by a rejector and then sent to the server for processing. The hybrid system enables efficient use of computational resources while minimizing the overhead associated with server usage. The recently proposed Learning to Help (L2H) model trains a server model given a fixed local (client) model, differing from the Learning to Defer (L2D) framework, which trains the client for a fixed (expert) server. In both L2D and L2H, the training includes learning a rejector at the client to determine when to query the server. In this work, we extend the L2H model from binary to multi-class classification problems and demonstrate its applicability in a number of different scenarios of practical interest in which access to the server may be limited by cost, availability, or policy. We derive a stage-switching surrogate loss function that is differentiable, convex, and consistent with the Bayes rule corresponding to the 0-1 loss for the L2H model. Experiments show that our proposed methods offer an efficient and practical solution for multi-class classification in resource-constrained environments.
Authors: Shiling Deng, Serge Belongie, Peter Ebert Christensen
Abstract: Memes have emerged as a powerful form of communication, integrating visual and textual elements to convey humor, satire, and cultural messages. Existing research has focused primarily on aspects such as emotion classification, meme generation, propagation, interpretation, figurative language, and sociolinguistics, but has often overlooked deeper meme comprehension and meme-text retrieval. To address these gaps, this study introduces ClassicMemes-50-templates (CM50), a large-scale dataset consisting of over 33,000 memes, centered around 50 popular meme templates. We also present an automated knowledge-grounded annotation pipeline leveraging large vision-language models to produce high-quality image captions, meme captions, and literary device labels overcoming the labor intensive demands of manual annotation. Additionally, we propose a meme-text retrieval CLIP model (mtrCLIP) that utilizes cross-modal embedding to enhance meme analysis, significantly improving retrieval performance. Our contributions include:(1) a novel dataset for large-scale meme study, (2) a scalable meme annotation framework, and (3) a fine-tuned CLIP for meme-text retrieval, all aimed at advancing the understanding and analysis of memes at scale.
Authors: Roel Bouman, Tom Heskes
Abstract: Autoencoders are frequently used for anomaly detection, both in the unsupervised and semi-supervised settings. They rely on the assumption that when trained using the reconstruction loss, they will be able to reconstruct normal data more accurately than anomalous data. Some recent works have posited that this assumption may not always hold, but little has been done to study the validity of the assumption in theory. In this work we show that this assumption indeed does not hold, and illustrate that anomalies, lying far away from normal data, can be perfectly reconstructed in practice. We revisit the theory of failure of linear autoencoders for anomaly detection by showing how they can perfectly reconstruct out of bounds, or extrapolate undesirably, and note how this can be dangerous in safety critical applications. We connect this to non-linear autoencoders through experiments on both tabular data and real-world image data, the two primary application areas of autoencoders for anomaly detection.
Authors: Maty\'a\v{s} Lorenc
Abstract: We explore a capability of evolution strategies to train an agent with its policy based on a transformer architecture in a reinforcement learning setting. We performed experiments using OpenAI's highly parallelizable evolution strategy to train Decision Transformer in Humanoid locomotion environment and in the environment of Atari games, testing the ability of this black-box optimization technique to train even such relatively large and complicated models (compared to those previously tested in the literature). We also proposed a method to aid the training by first pretraining the model before using the OpenAI-ES to train it further, and tested its effectiveness. The examined evolution strategy proved to be, in general, capable of achieving strong results and managed to obtain high-performing agents. Therefore, the pretraining was shown to be unnecessary; yet still, it helped us observe and formulate several further insights.
Authors: Petr Grinberg, Ankur Kumar, Surya Koppisetti, Gaurav Bharaj
Abstract: Adding explanations to audio deepfake detection (ADD) models will boost their real-world application by providing insight on the decision making process. In this paper, we propose a relevancy-based explainable AI (XAI) method to analyze the predictions of transformer-based ADD models. We compare against standard Grad-CAM and SHAP-based methods, using quantitative faithfulness metrics as well as a partial spoof test, to comprehensively analyze the relative importance of different temporal regions in an audio. We consider large datasets, unlike previous works where only limited utterances are studied, and find that the XAI methods differ in their explanations. The proposed relevancy-based XAI method performs the best overall on a variety of metrics. Further investigation on the relative importance of speech/non-speech, phonetic content, and voice onsets/offsets suggest that the XAI results obtained from analyzing limited utterances don't necessarily hold when evaluated on large datasets.
Authors: Ali Abedi, Charlene H. Chu, Shehroz S. Khan
Abstract: Lower-Limb Fractures (LLF) are a major health concern for older adults, often leading to reduced mobility and prolonged recovery, potentially impairing daily activities and independence. During recovery, older adults frequently face social isolation and functional decline, complicating rehabilitation and adversely affecting physical and mental health. Multi-modal sensor platforms that continuously collect data and analyze it using machine-learning algorithms can remotely monitor this population and infer health outcomes. They can also alert clinicians to individuals at risk of isolation and decline. This paper presents a new publicly available multi-modal sensor dataset, MAISON-LLF, collected from older adults recovering from LLF in community settings. The dataset includes data from smartphone and smartwatch sensors, motion detectors, sleep-tracking mattresses, and clinical questionnaires on isolation and decline. The dataset was collected from ten older adults living alone at home for eight weeks each, totaling 560 days of 24-hour sensor data. For technical validation, supervised machine-learning and deep-learning models were developed using the sensor and clinical questionnaire data, providing a foundational comparison for the research community.
Authors: Ayush Mohanty, Nazal Mohamed, Paritosh Ramanan, Nagi Gebraeel
Abstract: Advanced sensors and IoT devices have improved the monitoring and control of complex industrial enterprises. They have also created an interdependent fabric of geographically distributed process operations (clients) across these enterprises. Granger causality is an effective approach to detect and quantify interdependencies by examining how one client's state affects others over time. Understanding these interdependencies captures how localized events, such as faults and disruptions, can propagate throughout the system, possibly causing widespread operational impacts. However, the large volume and complexity of industrial data pose challenges in modeling these interdependencies. This paper develops a federated approach to learning Granger causality. We utilize a linear state space system framework that leverages low-dimensional state estimates to analyze interdependencies. This addresses bandwidth limitations and the computational burden commonly associated with centralized data processing. We propose augmenting the client models with the Granger causality information learned by the server through a Machine Learning (ML) function. We examine the co-dependence between the augmented client and server models and reformulate the framework as a standalone ML algorithm providing conditions for its sublinear and linear convergence rates. We also study the convergence of the framework to a centralized oracle model. Moreover, we include a differential privacy analysis to ensure data security while preserving causal insights. Using synthetic data, we conduct comprehensive experiments to demonstrate the robustness of our approach to perturbations in causality, the scalability to the size of communication, number of clients, and the dimensions of raw data. We also evaluate the performance on two real-world industrial control system datasets by reporting the volume of data saved by decentralization.
Authors: Linh Tran, Wei Sun, Stacy Patterson, Ana Milanova
Abstract: Multimodal Large Language Models (LLMs) are pivotal in revolutionizing customer support and operations by integrating multiple modalities such as text, images, and audio. Federated Prompt Learning (FPL) is a recently proposed approach that combines pre-trained multimodal LLMs such as vision-language models with federated learning to create personalized, privacy-preserving AI systems. However, balancing the competing goals of personalization, generalization, and privacy remains a significant challenge. Over-personalization can lead to overfitting, reducing generalizability, while stringent privacy measures, such as differential privacy, can hinder both personalization and generalization. In this paper, we propose a Differentially Private Federated Prompt Learning (DP-FPL) approach to tackle this challenge by leveraging a low-rank adaptation scheme to capture generalization while maintaining a residual term that preserves expressiveness for personalization. To ensure privacy, we introduce a novel method where we apply local differential privacy to the two low-rank components of the local prompt, and global differential privacy to the global prompt. Our approach mitigates the impact of privacy noise on the model performance while balancing the tradeoff between personalization and generalization. Extensive experiments demonstrate the effectiveness of our approach over other benchmarks.
Authors: Inwon Kang, Parikshit Ram, Yi Zhou, Horst Samulowitz, Oshani Seneviratne
Abstract: Dataset distillation generates a small set of information-rich instances from a large dataset, resulting in reduced storage requirements, privacy or copyright risks, and computational costs for downstream modeling, though much of the research has focused on the image data modality. We study tabular data distillation, which brings in novel challenges such as the inherent feature heterogeneity and the common use of non-differentiable learning models (such as decision tree ensembles and nearest-neighbor predictors). To mitigate these challenges, we present $\texttt{TDColER}$, a tabular data distillation framework via column embeddings-based representation learning. To evaluate this framework, we also present a tabular data distillation benchmark, ${{\sf \small TDBench}}$. Based on an elaborate evaluation on ${{\sf \small TDBench}}$, resulting in 226,890 distilled datasets and 548,880 models trained on them, we demonstrate that $\texttt{TDColER}$ is able to boost the distilled data quality of off-the-shelf distillation schemes by 0.5-143% across 7 different tabular learning models.
Authors: Linh Tran, Timothy Castiglia, Stacy Patterson, Ana Milanova
Abstract: We present Poisson Binomial Mechanism Vertical Federated Learning (PBM-VFL), a communication-efficient Vertical Federated Learning algorithm with Differential Privacy guarantees. PBM-VFL combines Secure Multi-Party Computation with the recently introduced Poisson Binomial Mechanism to protect parties' private datasets during model training. We define the novel concept of feature privacy and analyze end-to-end feature and sample privacy of our algorithm. We compare sample privacy loss in VFL with privacy loss in HFL. We also provide the first theoretical characterization of the relationship between privacy budget, convergence error, and communication cost in differentially-private VFL. Finally, we empirically show that our model performs well with high levels of privacy.
Authors: Dimitrios Liakopoulos, Tianrui Hu, Prasoon Sinha, Neeraja J. Yadwadkar
Abstract: Large Language Models (LLMs) are becoming ubiquitous across industries, where applications demand they fulfill diverse user intents. However, developers currently face the challenge of manually exploring numerous deployment configurations - combinations of parallelism and compression techniques that impact resource usage, latency, cost, and accuracy - to meet these intents. Assessing the impact of these configurations on user metrics requires extensive, costly profiling for each model. Existing approaches avoid this expense by using fixed, static configurations, but this often leads to sub-optimal performance and higher costs. Moreover, none of these solutions dynamically adapt to changing user intents to balance latency and cost, effectively. We present iServe, an automated, intent-based system for distributed LLM inference. Instead of manually selecting deployment configurations, developers simply specify their intent - such as minimizing latency, reducing cost, or meeting specific targets for either. iServe introduces fingerprints, lightweight representations of LLMs, to efficiently estimate how different configurations impact latency and memory usage. Based on these insights and GPU availability, iServe dynamically selects the optimal configuration to align with the user's intent. For various LLMs and query arrival rates, iServe best meets user intents compared to state-of-the-art systems by reducing latency by 77.62% and SLO violations by 7.09x while improving GPU throughput by 4.72x. Moreover, iServe's fingerprint-based profiling reduces profiling cost by 6.05x (GPU-hours) compared to baselines.
Authors: Ambreesh Parthasarathy, Chandrasekar Subramanian, Ganesh Senrayan, Shreyash Adappanavar, Aparna Taneja, Balaraman Ravindran, Milind Tambe
Abstract: Restless Multi-Armed Bandits (RMABs) have been successfully applied to resource allocation problems in a variety of settings, including public health. With the rapid development of powerful large language models (LLMs), they are increasingly used to design reward functions to better match human preferences. Recent work has shown that LLMs can be used to tailor automated allocation decisions to community needs using language prompts. However, this has been studied primarily for English prompts and with a focus on task performance only. This can be an issue since grassroots workers, especially in developing countries like India, prefer to work in local languages, some of which are low-resource. Further, given the nature of the problem, biases along population groups unintended by the user are also undesirable. In this work, we study the effects on both task performance and fairness when the DLM algorithm, a recent work on using LLMs to design reward functions for RMABs, is prompted with non-English language commands. Specifically, we run the model on a synthetic environment for various prompts translated into multiple languages. The prompts themselves vary in complexity. Our results show that the LLM-proposed reward functions are significantly better when prompted in English compared to other languages. We also find that the exact phrasing of the prompt impacts task performance. Further, as prompt complexity increases, performance worsens for all languages; however, it is more robust with English prompts than with lower-resource languages. On the fairness side, we find that low-resource languages and more complex prompts are both highly likely to create unfairness along unintended dimensions.
Authors: Alexis Huet, Zied Ben Houidi, Dario Rossi
Abstract: Episodic memory -- the ability to recall specific events grounded in time and space -- is a cornerstone of human cognition, enabling not only coherent storytelling, but also planning and decision-making. Despite their remarkable capabilities, Large Language Models (LLMs) lack a robust mechanism for episodic memory: we argue that integrating episodic memory capabilities into LLM is essential for advancing AI towards human-like cognition, increasing their potential to reason consistently and ground their output in real-world episodic events, hence avoiding confabulations. To address this challenge, we introduce a comprehensive framework to model and evaluate LLM episodic memory capabilities. Drawing inspiration from cognitive science, we develop a structured approach to represent episodic events, encapsulating temporal and spatial contexts, involved entities, and detailed descriptions. We synthesize a unique episodic memory benchmark, free from contamination, and release open source code and datasets to assess LLM performance across various recall and episodic reasoning tasks. Our evaluation of state-of-the-art models, including GPT-4 and Claude variants, Llama 3.1, and o1-mini, reveals that even the most advanced LLMs struggle with episodic memory tasks, particularly when dealing with multiple related events or complex spatio-temporal relationships -- even in contexts as short as 10k-100k tokens.
Authors: Yooseop Lee, Suin Kim, Yohan Jo
Abstract: In designing multiple-choice questions (MCQs) in education, creating plausible distractors is crucial for identifying students' misconceptions and gaps in knowledge and accurately assessing their understanding. However, prior studies on distractor generation have not paid sufficient attention to enhancing the difficulty of distractors, resulting in reduced effectiveness of MCQs. This study presents a pipeline for training a model to generate distractors that are more likely to be selected by students. First, we train a pairwise ranker to reason about students' misconceptions and assess the relative plausibility of two distractors. Using this model, we create a dataset of pairwise distractor ranks and then train a distractor generator via Direct Preference Optimization (DPO) to generate more plausible distractors. Experiments on computer science subjects (Python, DB, MLDL) demonstrate that our pairwise ranker effectively identifies students' potential misunderstandings and achieves ranking accuracy comparable to human experts. Furthermore, our distractor generator outperforms several baselines in generating plausible distractors and produces questions with a higher item discrimination index (DI).
Authors: Aniket Pramanik, Singanallur V. Venkatakrishnan, Obaidullah Rahman, Amirkoushyar Ziabari
Abstract: Industrial X-ray cone-beam CT (XCT) scanners are widely used for scientific imaging and non-destructive characterization. Industrial CBCT scanners use large detectors containing millions of pixels and the subsequent 3D reconstructions can be of the order of billions of voxels. In order to obtain high-quality reconstruction when using typical analytic algorithms, the scan involves collecting a large number of projections/views which results in large measurement times - limiting the utility of the technique. Model-based iterative reconstruction (MBIR) algorithms can produce high-quality reconstructions from fast sparse-view CT scans, but are computationally expensive and hence are avoided in practice. Single-step deep-learning (DL) based methods have demonstrated that it is possible to obtain fast and high-quality reconstructions from sparse-view data but they do not generalize well to out-of-distribution scenarios. In this work, we propose a half-quadratic splitting-based algorithm that uses convolutional neural networks (CNN) in order to obtain high-quality reconstructions from large sparse-view cone-beam CT (CBCT) measurements while overcoming the challenges with typical approaches. The algorithm alternates between the application of a CNN and a conjugate gradient (CG) step enforcing data-consistency (DC). The proposed method outperforms other methods on the publicly available Walnuts data-set.
Authors: Ramin Mousa, Meysam Afrookhteh, Hooman Khaloo, Amir Ali Bengari, Gholamreza Heidary
Abstract: Digital currencies have become popular in the last decade due to their non-dependency and decentralized nature. The price of these currencies has seen a lot of fluctuations at times, which has increased the need for prediction. As their most popular, Bitcoin(BTC) has become a research hotspot. The main challenge and trend of digital currencies, especially BTC, is price fluctuations, which require studying the basic price prediction model. This research presents a classification and regression model based on stack deep learning that uses a wavelet to remove noise to predict movements and prices of BTC at different time intervals. The proposed model based on the stacking technique uses models based on deep learning, especially neural networks and transformers, for one, seven, thirty and ninety-day forecasting. Three feature selection models, Chi2, RFE and Embedded, were also applied to the data in the pre-processing stage. The classification model achieved 63\% accuracy for predicting the next day and 64\%, 67\% and 82\% for predicting the seventh, thirty and ninety days, respectively. For daily price forecasting, the percentage error was reduced to 0.58, while the error ranged from 2.72\% to 2.85\% for seven- to ninety-day horizons. These results show that the proposed model performed better than other models in the literature.
Authors: Naman Jain, Amir Kalev
Abstract: We introduce Quantum Feature Extraction (QuFeX), a novel quantum machine learning module. The proposed module enables feature extraction in a reduced-dimensional space, significantly decreasing the number of parallel evaluations required in typical quantum convolutional neural network architectures. Its design allows seamless integration into deep classical neural networks, making it particularly suitable for hybrid quantum-classical models. As an application of QuFeX, we propose Qu-Net -- a hybrid architecture which integrates QuFeX at the bottleneck of a U-Net architecture. The latter is widely used for image segmentation tasks such as medical imaging and autonomous driving. Our numerical analysis indicates that the Qu-Net can achieve superior segmentation performance compared to a U-Net baseline. These results highlight the potential of QuFeX to enhance deep neural networks by leveraging hybrid computational paradigms, providing a path towards a robust framework for real-world applications requiring precise feature extraction.
Authors: Francesco Sacco, Dalton A R Sakthivadivel, Michael Levin
Abstract: All intelligence is collective intelligence, in the sense that it is made of parts which must align with respect to system-level goals. Understanding the dynamics which facilitate or limit navigation of problem spaces by aligned parts thus impacts many fields ranging across life sciences and engineering. To that end, consider a system on the vertices of a planar graph, with pairwise interactions prescribed by the edges of the graph. Such systems can sometimes exhibit long-range order, distinguishing one phase of macroscopic behaviour from another. In networks of interacting systems we may view spontaneous ordering as a form of self-organisation, modelling neural and basal forms of cognition. Here, we discuss necessary conditions on the topology of the graph for an ordered phase to exist, with an eye towards finding constraints on the ability of a system with local interactions to maintain an ordered target state. By studying the scaling of free energy under the formation of domain walls in three model systems -- the Potts model, autoregressive models, and hierarchical networks -- we show how the combinatorics of interactions on a graph prevent or allow spontaneous ordering. As an application we are able to analyse why multiscale systems like those prevalent in biology are capable of organising into complex patterns, whereas rudimentary language models are challenged by long sequences of outputs.
Authors: Alexander Spinos, Bradley Woosley, Justin Rokisky, Christopher Korpela, John G. Rogers III, Brian A. Bittner
Abstract: Traditionally, autonomous reconnaissance applications have acted on explicit sets of historical observations. Aided by recent breakthroughs in generative technologies, this work enables robot teams to act beyond what is currently known about the environment by inferring a distribution of reasonable interpretations of the scene. We developed a map predictor that inpaints the unknown space in a multi-agent 2D occupancy map during an exploration mission. From a comparison of several inpainting methods, we found that a fine-tuned latent diffusion inpainting model could provide rich and coherent interpretations of simulated urban environments with relatively little computation time. By iteratively inferring interpretations of the scene throughout an exploration run, we are able to identify areas that exhibit high uncertainty in the prediction, which we formalize with the concept of generative entropy. We prioritize tasks in regions of high generative entropy, hypothesizing that this will expedite convergence on an accurate predicted map of the scene. In our study we juxtapose this new paradigm of task ranking with the state of the art, which ranks regions to explore by those which maximize expected information recovery. We compare both of these methods in a simulated urban environment with three vehicles. Our results demonstrate that by using our new task ranking method, we can predict a correct scene significantly faster than with a traditional information-guided method.
Authors: Yang Bai, Christan Earl Grant, Daisy Zhe Wang
Abstract: Multi-modal retrieval-augmented Question Answering (MRAQA), integrating text and images, has gained significant attention in information retrieval (IR) and natural language processing (NLP). Traditional ranking methods rely on small encoder-based language models, which are incompatible with modern decoder-based generative large language models (LLMs) that have advanced various NLP tasks. To bridge this gap, we propose RAMQA, a unified framework combining learning-to-rank methods with generative permutation-enhanced ranking techniques. We first train a pointwise multi-modal ranker using LLaVA as the backbone. Then, we apply instruction tuning to train a LLaMA model for re-ranking the top-k documents using an innovative autoregressive multi-task learning approach. Our generative ranking model generates re-ranked document IDs and specific answers from document candidates in various permutations. Experiments on two MRAQA benchmarks, WebQA and MultiModalQA, show significant improvements over strong baselines, highlighting the effectiveness of our approach. Code and data are available at: https://github.com/TonyBY/RAMQA
Authors: Yixuan Wang, Leonor Fermoselle, Tarik Kelestemur, Jiuguang Wang, Yunzhu Li
Abstract: Mobile exploration is a longstanding challenge in robotics, yet current methods primarily focus on active perception instead of active interaction, limiting the robot's ability to interact with and fully explore its environment. Existing robotic exploration approaches via active interaction are often restricted to tabletop scenes, neglecting the unique challenges posed by mobile exploration, such as large exploration spaces, complex action spaces, and diverse object relations. In this work, we introduce a 3D relational object graph that encodes diverse object relations and enables exploration through active interaction. We develop a system based on this representation and evaluate it across diverse scenes. Our qualitative and quantitative results demonstrate the system's effectiveness and generalization capabilities, outperforming methods that rely solely on vision-language models (VLMs).
Authors: Tianyuan Yao, Zhiyuan Li, Praitayini Kanakaraj, Derek B. Archer, Kurt Schilling, Lori Beason-Held, Susan Resnick, Bennett A. Landman, Yuankai Huo
Abstract: Diffusion-weighted Magnetic Resonance Imaging (dMRI) is an essential tool in neuroimaging. It is arguably the sole noninvasive technique for examining the microstructural properties and structural connectivity of the brain. Recent years have seen the emergence of machine learning and data-driven approaches that enhance the speed, accuracy, and consistency of dMRI data analysis. However, traditional deep learning models often fell short, as they typically utilize pixel-level or volumetric patch-level embeddings similar to those used in structural MRI, and do not account for the unique distribution of various gradient encodings. In this paper, we propose a novel method called Polyhedra Encoding Transformer (PE-Transformer) for dMRI, designed specifically to handle spherical signals. Our approach involves projecting an icosahedral polygon onto a unit sphere to resample signals from predetermined directions. These resampled signals are then transformed into embeddings, which are processed by a transformer encoder that incorporates orientational information reflective of the icosahedral structure. Through experimental validation with various gradient encoding protocols, our method demonstrates superior accuracy in estimating multi-compartment models and Fiber Orientation Distributions (FOD), outperforming both conventional CNN architectures and standard transformers.
Authors: Yuzhuo Li, Di Zhao, Yihao Wu, Yun Sing Koh
Abstract: Identifying individual animals within large wildlife populations is essential for effective wildlife monitoring and conservation efforts. Recent advancements in computer vision have shown promise in animal re-identification (Animal ReID) by leveraging data from camera traps. However, existing methods rely exclusively on visual data, neglecting environmental metadata that ecologists have identified as highly correlated with animal behavior and identity, such as temperature and circadian rhythms. To bridge this gap, we propose the Meta-Feature Adapter (MFA), a lightweight module designed to integrate environmental metadata into vision-language foundation models, such as CLIP, to enhance Animal ReID performance. Our approach translates environmental metadata into natural language descriptions, encodes them into metadata-aware text embeddings, and incorporates these embeddings into image features through a cross-attention mechanism. Furthermore, we introduce a Gated Cross-Attention mechanism that dynamically adjusts the weights of metadata contributions, further improving performance. To validate our approach, we constructed the Metadata Augmented Animal Re-identification (MAAR) dataset, encompassing six species from New Zealand and featuring paired image data and environmental metadata. Extensive experiments demonstrate that MFA consistently improves Animal ReID performance across multiple baseline models.
Authors: Bishwash Panerua, Biplov Paneru
Abstract: This study focuses on membrane-based systems for CO$_2$ separation, addressing the urgent need for efficient carbon capture solutions to mitigate climate change. Linear regression models, based on membrane equations, were utilized to estimate key parameters, including porosity ($\epsilon$) of 0.4805, Kozeny constant (K) of 2.9084, specific surface area ($\sigma$) of 105.3272 m$^2$/m$^3$, mean pressure (Pm) of 6.2166 MPa, viscosity ($\mu$) of 0.1997 Ns/m$^2$, and gas flux (Jg) of 3.2559 kg m$^{-2}$ s$^{-1}$. These parameters were derived from the analysis of synthetic datasets using linear regression. The study also provides insights into the performance of the membrane, with a flow rate (Q) of 9.8778 $\times$ 10$^{-4}$ m$^3$/s, an injection pressure (P$_1$) of 2.8219 MPa, and an exit pressure (P$_2$) of 2.5762 MPa. The permeability value of 0.045 for CO$_2$ indicates the potential for efficient separation. Optimizing membrane properties to selectively block CO$_2$ while allowing other gases to pass is crucial for improving carbon capture efficiency. By integrating these technologies into industrial processes, significant reductions in greenhouse gas emissions can be achieved, fostering a circular carbon economy and contributing to global climate goals. This study also explores how artificial intelligence (AI) can aid in designing membranes for carbon capture, addressing the global climate change challenge and supporting the Sustainable Development Goals (SDGs) set by the United Nations.
Authors: Meng-Ping Lin, Jen-Cheng Hou, Chia-Wei Chen, Shao-Yi Chien, Jun-Cheng Chen, Xugang Lu, Yu Tsao
Abstract: Speech Enhancement (SE) aims to improve the quality of noisy speech. It has been shown that additional visual cues can further improve performance. Given that speech communication involves audio, visual, and linguistic modalities, it is natural to expect another performance boost by incorporating linguistic information. However, bridging the modality gaps to efficiently incorporate linguistic information, along with audio and visual modalities during knowledge transfer, is a challenging task. In this paper, we propose a novel multi-modality learning framework for SE. In the model framework, a state-of-the-art diffusion Model backbone is utilized for Audio-Visual Speech Enhancement (AVSE) modeling where both audio and visual information are directly captured by microphones and video cameras. Based on this AVSE, the linguistic modality employs a PLM to transfer linguistic knowledge to the visual acoustic modality through a process termed Cross-Modal Knowledge Transfer (CMKT) during AVSE model training. After the model is trained, it is supposed that linguistic knowledge is encoded in the feature processing of the AVSE model by the CMKT, and the PLM will not be involved during inference stage. We carry out SE experiments to evaluate the proposed model framework. Experimental results demonstrate that our proposed AVSE system significantly enhances speech quality and reduces generative artifacts, such as phonetic confusion compared to the state-of-the-art. Moreover, our visualization results demonstrate that our Cross-Modal Knowledge Transfer method further improves the generated speech quality of our AVSE system. These findings not only suggest that Diffusion Model-based techniques hold promise for advancing the state-of-the-art in AVSE but also justify the effectiveness of incorporating linguistic information to improve the performance of Diffusion-based AVSE systems.
Authors: Kangjie Zheng, Junwei Yang, Siyue Liang, Bin Feng, Zequn Liu, Wei Ju, Zhiping Xiao, Ming Zhang
Abstract: Masked Language Models (MLMs) have achieved remarkable success in many self-supervised representation learning tasks. MLMs are trained by randomly replacing some tokens in the input sentences with $\texttt{[MASK]}$ tokens and predicting the original tokens based on the remaining context. This paper explores the impact of $\texttt{[MASK]}$ tokens on MLMs. Analytical studies show that masking tokens can introduce the corrupted semantics problem, wherein the corrupted context may convey multiple, ambiguous meanings. This problem is also a key factor affecting the performance of MLMs on downstream tasks. Based on these findings, we propose a novel enhanced-context MLM, ExLM. Our approach expands $\texttt{[MASK]}$ tokens in the input context and models the dependencies between these expanded states. This expansion increases context capacity and enables the model to capture richer semantic information, effectively mitigating the corrupted semantics problem during pre-training. Experimental results demonstrate that ExLM achieves significant performance improvements in both text modeling and SMILES modeling tasks. Further analysis confirms that ExLM enhances semantic representations through context enhancement, and effectively reduces the multimodality problem commonly observed in MLMs.
Authors: Gyuhyeon Pak, Euntai Kim
Abstract: Recently, map representations based on radiance fields such as 3D Gaussian Splatting and NeRF, which excellent for realistic depiction, have attracted considerable attention, leading to attempts to combine them with SLAM. While these approaches can build highly realistic maps, large-scale SLAM still remains a challenge because they require a large number of Gaussian images for mapping and adjacent images as keyframes for tracking. We propose a novel 3D Gaussian Splatting SLAM method, VIGS SLAM, that utilizes sensor fusion of RGB-D and IMU sensors for large-scale indoor environments. To reduce the computational load of 3DGS-based tracking, we adopt an ICP-based tracking framework that combines IMU preintegration to provide a good initial guess for accurate pose estimation. Our proposed method is the first to propose that Gaussian Splatting-based SLAM can be effectively performed in large-scale environments by integrating IMU sensor measurements. This proposal not only enhances the performance of Gaussian Splatting SLAM beyond room-scale scenarios but also achieves SLAM performance comparable to state-of-the-art methods in large-scale indoor environments.
Authors: Jaewon Lee, Mangyu Kong, Minseong Park, Euntai Kim
Abstract: Mapping and localization are crucial problems in robotics and autonomous driving. Recent advances in 3D Gaussian Splatting (3DGS) have enabled precise 3D mapping and scene understanding by rendering photo-realistic images. However, existing 3DGS methods often struggle to accurately reconstruct a 3D map that reflects the actual scale and geometry of the real world, which degrades localization performance. To address these limitations, we propose a novel 3DGS method called Geometry-Aware Gaussian Splatting (GeomGS). This method fully integrates LiDAR data into 3D Gaussian primitives via a probabilistic approach, as opposed to approaches that only use LiDAR as initial points or introduce simple constraints for Gaussian points. To this end, we introduce a Geometric Confidence Score (GCS), which identifies the structural reliability of each Gaussian point. The GCS is optimized simultaneously with Gaussians under probabilistic distance constraints to construct a precise structure. Furthermore, we propose a novel localization method that fully utilizes both the geometric and photometric properties of GeomGS. Our GeomGS demonstrates state-of-the-art geometric and localization performance across several benchmarks, while also improving photometric performance.
Authors: Anoop Mishra, Deepak Khazanchi
Abstract: In machine learning (ML) applications, unfairness is triggered due to bias in the data, the data curation process, erroneous assumptions, and implicit bias rendered during the development process. It is also well-accepted by researchers that fairness in ML application development is highly subjective, with a lack of clarity of what it means from an ML development and implementation perspective. Thus, in this research, we investigate and formalize the notion of the perceived fairness of ML development from a sociotechnical lens. Our goal in this research is to understand the characteristics of perceived fairness in ML applications. We address this research goal using a three-pronged strategy: 1) conducting virtual focus groups with ML developers, 2) reviewing existing literature on fairness in ML, and 3) incorporating aspects of justice theory relating to procedural and distributive justice. Based on our theoretical exposition, we propose operational attributes of perceived fairness to be transparency, accountability, and representativeness. These are described in terms of multiple concepts that comprise each dimension of perceived fairness. We use this operationalization to empirically validate the notion of perceived fairness of machine learning (ML) applications from both the ML practioners and users perspectives. The multidimensional framework for perceived fairness offers a comprehensive understanding of perceived fairness, which can guide the creation of fair ML systems with positive implications for society and businesses.
Authors: Bo Gao, Michael W. Spratling
Abstract: Large language models have achieved remarkable success in recent years, primarily due to the implementation of self-attention mechanisms. However, traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases. This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the $l_1$-norm. We identify the latter as essential for maintaining model performance. By replacing the non-linear transformation with the Softplus activation function and introducing a dynamic length scale factor for different token lengths based on invariance entropy, we create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths. To further improve the length extrapolation ability of the proposed attention mechanism, we introduce a re-weighting mechanism that amplifies significant attention weights while diminishing weaker ones, enabling the model to concentrate more effectively on relevant tokens. When combined with our proposed attention mechanism, this approach demonstrates significant promise in managing longer sequences, maintaining nearly constant validation loss even at 16$\times$ the training token length while ensuring numerical stability. Our code is available at: https://github.com/iminfine/freeatten.
Authors: Samer Attrah
Abstract: Emotion estimation in general is a field that has been studied for a long time, and several approaches exist using machine learning. in this paper, we present an LSTM model, that processes the blend-shapes produced by the library MediaPipe, for a face detected in a live stream of a camera, to estimate the main emotion from the facial expressions, this model is trained on the FER2013 dataset and delivers a result of 71% accuracy and 62% f1-score which meets the accuracy benchmark of the FER2013 dataset, with significantly reduced computation costs. https://github.com/ Samir-atra/Emotion_estimation_from_video_footage_with_LSTM_ML_algorithm
URLs: https://github.com/
Authors: Deepak Ghimire, Dayoung Kil, Seonghwan Jeong, Jaesik Park, Seong-heum Kim
Abstract: Existing structured pruning typically involves multi-stage training procedures that often demand heavy computation. Pruning at initialization, which aims to address this limitation, reduces training costs but struggles with performance. To address these challenges, we propose an efficient framework for one-cycle structured pruning without compromising model performance. In this approach, we integrate pre-training, pruning, and fine-tuning into a single training cycle, referred to as the `one cycle approach'. The core idea is to search for the optimal sub-network during the early stages of network training, guided by norm-based group saliency criteria and structured sparsity regularization. We introduce a novel pruning indicator that determines the stable pruning epoch by assessing the similarity between evolving pruning sub-networks across consecutive training epochs. Also, group sparsity regularization helps to accelerate the pruning process and results in speeding up the entire process. Extensive experiments on datasets, including CIFAR-10/100, and ImageNet, using VGGNet, ResNet, MobileNet, and ViT architectures, demonstrate that our method achieves state-of-the-art accuracy while being one of the most efficient pruning frameworks in terms of training time. The source code will be made publicly available.
Authors: Simeon Emanuilov, Aleksandar Dimov
Abstract: This paper presents a novel approach for similarity search with complex filtering capabilities on billion-scale datasets, optimized for CPU inference. Our method extends the classical IVF-Flat index structure to integrate multi-dimensional filters. The proposed algorithm combines dense embeddings with discrete filtering attributes, enabling fast retrieval in high-dimensional spaces. Designed specifically for CPU-based systems, our disk-based approach offers a cost-effective solution for large-scale similarity search. We demonstrate the effectiveness of our method through a case study, showcasing its potential for various practical uses.
Authors: Yulong Hu, Siyuan Feng, Sen Li
Abstract: This paper introduces Localized Bipartite Match Graph Attention Q-Learning (BMG-Q), a novel Multi-Agent Reinforcement Learning (MARL) algorithm framework tailored for ride-pooling order dispatch. BMG-Q advances ride-pooling decision-making process with the localized bipartite match graph underlying the Markov Decision Process, enabling the development of novel Graph Attention Double Deep Q Network (GATDDQN) as the MARL backbone to capture the dynamic interactions among ride-pooling vehicles in fleet. Our approach enriches the state information for each agent with GATDDQN by leveraging a localized bipartite interdependence graph and enables a centralized global coordinator to optimize order matching and agent behavior using Integer Linear Programming (ILP). Enhanced by gradient clipping and localized graph sampling, our GATDDQN improves scalability and robustness. Furthermore, the inclusion of a posterior score function in the ILP captures the online exploration-exploitation trade-off and reduces the potential overestimation bias of agents, thereby elevating the quality of the derived solutions. Through extensive experiments and validation, BMG-Q has demonstrated superior performance in both training and operations for thousands of vehicle agents, outperforming benchmark reinforcement learning frameworks by around 10% in accumulative rewards and showing a significant reduction in overestimation bias by over 50%. Additionally, it maintains robustness amidst task variations and fleet size changes, establishing BMG-Q as an effective, scalable, and robust framework for advancing ride-pooling order dispatch operations.
Authors: Ruijia Liu, Ancheng Hou, Xiao Yu, Xiang Yin
Abstract: Signal Temporal Logic (STL) is a powerful specification language for describing complex temporal behaviors of continuous signals, making it well-suited for high-level robotic task descriptions. However, generating executable plans for STL tasks is challenging, as it requires consideration of the coupling between the task specification and the system dynamics. Existing approaches either follow a model-based setting that explicitly requires knowledge of the system dynamics or adopt a task-oriented data-driven approach to learn plans for specific tasks. In this work, we investigate the problem of generating executable STL plans for systems whose dynamics are unknown a priori. We propose a new planning framework that uses only task-agnostic data during the offline training stage, enabling zero-shot generalization to new STL tasks. Our framework is hierarchical, involving: (i) decomposing the STL task into a set of progress and time constraints, (ii) searching for time-aware waypoints guided by task-agnostic data, and (iii) generating trajectories using a pre-trained safe diffusion model. Simulation results demonstrate the effectiveness of our method indeed in achieving zero-shot generalization to various STL tasks.
Authors: Le Xu, Lei Cheng, Junting Chen, Wenqiang Pu, Xiao Fu
Abstract: Radio map estimation (RME), also known as spectrum cartography, aims to reconstruct the strength of radio interference across different domains (e.g., space and frequency) from sparsely sampled measurements. To tackle this typical inverse problem, state-of-the-art RME methods rely on handcrafted or data-driven structural information of radio maps. However, the former often struggles to model complex radio frequency (RF) environments and the latter requires excessive training -- making it hard to quickly adapt to in situ sensing tasks. This work presents a spatio-spectral RME approach based on plug-and-play (PnP) denoising, a technique from computational imaging. The idea is to leverage the observation that the denoising operations of signals like natural images and radio maps are similar -- despite the nontrivial differences of the signals themselves. Hence, sophisticated denoisers designed for or learned from natural images can be directly employed to assist RME, avoiding using radio map data for training. Unlike conventional PnP methods that operate directly in the data domain, the proposed method exploits the underlying physical structure of radio maps and proposes an ADMM algorithm that denoises in a latent domain. This design significantly improves computational efficiency and enhances noise robustness. Theoretical aspects, e.g., recoverability of the complete radio map and convergence of the ADMM algorithm are analyzed. Synthetic and real data experiments are conducted to demonstrate the effectiveness of our approach.
Authors: Aayush Mishra, Daniel Habermann, Marvin Schmitt, Stefan T. Radev, Paul-Christian B\"urkner
Abstract: Neural amortized Bayesian inference (ABI) can solve probabilistic inverse problems orders of magnitude faster than classical methods. However, neural ABI is not yet sufficiently robust for widespread and safe applicability. In particular, when performing inference on observations outside of the scope of the simulated data seen during training, for example, because of model misspecification, the posterior approximations are likely to become highly biased. Due to the bad pre-asymptotic behavior of current neural posterior estimators in the out-of-simulation regime, the resulting estimation biases cannot be fixed in acceptable time by just simulating more training data. In this proof-of-concept paper, we propose a semi-supervised approach that enables training not only on (labeled) simulated data generated from the model, but also on unlabeled data originating from any source, including real-world data. To achieve the latter, we exploit Bayesian self-consistency properties that can be transformed into strictly proper losses without requiring knowledge of true parameter values, that is, without requiring data labels. The results of our initial experiments show remarkable improvements in the robustness of ABI on out-of-simulation data. Even if the observed data is far away from both labeled and unlabeled training data, inference remains highly accurate. If our findings also generalize to other scenarios and model classes, we believe that our new method represents a major breakthrough in neural ABI.
Authors: Zicheng Pan, Xiaohan Yu, Weichuan Zhang, Yongsheng Gao
Abstract: Source-free active domain adaptation (SFADA) addresses the challenge of adapting a pre-trained model to new domains without access to source data while minimizing the need for target domain annotations. This scenario is particularly relevant in real-world applications where data privacy, storage limitations, or labeling costs are significant concerns. Key challenges in SFADA include selecting the most informative samples from the target domain for labeling, effectively leveraging both labeled and unlabeled target data, and adapting the model without relying on source domain information. Additionally, existing methods often struggle with noisy or outlier samples and may require impractical progressive labeling during training. To effectively select more informative samples without frequently requesting human annotations, we propose the Propensity-driven Uncertainty Learning (ProULearn) framework. ProULearn utilizes a novel homogeneity propensity estimation mechanism combined with correlation index calculation to evaluate feature-level relationships. This approach enables the identification of representative and challenging samples while avoiding noisy outliers. Additionally, we develop a central correlation loss to refine pseudo-labels and create compact class distributions during adaptation. In this way, ProULearn effectively bridges the domain gap and maximizes adaptation performance. The principles of informative sample selection underlying ProULearn have broad implications beyond SFADA, offering benefits across various deep learning tasks where identifying key data points or features is crucial. Extensive experiments on four benchmark datasets demonstrate that ProULearn outperforms state-of-the-art methods in domain adaptation scenarios.
Authors: Wenzhuo Ma, Zhenzhong Chen
Abstract: Recently, foundational diffusion models have attracted considerable attention in image compression tasks, whereas their application to video compression remains largely unexplored. In this article, we introduce DiffVC, a diffusion-based perceptual neural video compression framework that effectively integrates foundational diffusion model with the video conditional coding paradigm. This framework uses temporal context from previously decoded frame and the reconstructed latent representation of the current frame to guide the diffusion model in generating high-quality results. To accelerate the iterative inference process of diffusion model, we propose the Temporal Diffusion Information Reuse (TDIR) strategy, which significantly enhances inference efficiency with minimal performance loss by reusing the diffusion information from previous frames. Additionally, to address the challenges posed by distortion differences across various bitrates, we propose the Quantization Parameter-based Prompting (QPP) mechanism, which utilizes quantization parameters as prompts fed into the foundational diffusion model to explicitly modulate intermediate features, thereby enabling a robust variable bitrate diffusion-based neural compression framework. Experimental results demonstrate that our proposed solution delivers excellent performance in both perception metrics and visual quality.
Authors: Wailing Tang, Biqi Yang, Pheng-Ann Heng, Yun-Hui Liu, Chi-Wing Fu
Abstract: Few-shot Semantic Segmentation (FSS) is a challenging task that utilizes limited support images to segment associated unseen objects in query images. However, recent FSS methods are observed to perform worse, when enlarging the number of shots. As the support set enlarges, existing FSS networks struggle to concentrate on the high-contributed supports and could easily be overwhelmed by the low-contributed supports that could severely impair the mask predictions. In this work, we study this challenging issue, called support dilution, our goal is to recognize, select, preserve, and enhance those high-contributed supports in the raw support pool. Technically, our method contains three novel parts. First, we propose a contribution index, to quantitatively estimate if a high-contributed support dilutes. Second, we develop the Symmetric Correlation (SC) module to preserve and enhance the high-contributed support features, minimizing the distraction by the low-contributed features. Third, we design the Support Image Pruning operation, to retrieve a compact and high quality subset by discarding low-contributed supports. We conduct extensive experiments on two FSS benchmarks, COCO-20i and PASCAL-5i, the segmentation results demonstrate the compelling performance of our solution over state-of-the-art FSS approaches. Besides, we apply our solution for online segmentation and real-world segmentation, convincing segmentation results showing the practical ability of our work for real-world demonstrations.
Authors: Francis Rhys Ward
Abstract: I am a person and so are you. Philosophically we sometimes grant personhood to non-human animals, and entities such as sovereign states or corporations can legally be considered persons. But when, if ever, should we ascribe personhood to AI systems? In this paper, we outline necessary conditions for AI personhood, focusing on agency, theory-of-mind, and self-awareness. We discuss evidence from the machine learning literature regarding the extent to which contemporary AI systems, such as language models, satisfy these conditions, finding the evidence surprisingly inconclusive. If AI systems can be considered persons, then typical framings of AI alignment may be incomplete. Whereas agency has been discussed at length in the literature, other aspects of personhood have been relatively neglected. AI agents are often assumed to pursue fixed goals, but AI persons may be self-aware enough to reflect on their aims, values, and positions in the world and thereby induce their goals to change. We highlight open research directions to advance the understanding of AI personhood and its relevance to alignment. Finally, we reflect on the ethical considerations surrounding the treatment of AI systems. If AI systems are persons, then seeking control and alignment may be ethically untenable.
Authors: Nicolas Menet (ETH Z\"urich), Jonas H\"ubotter (ETH Z\"urich), Parnian Kassraie (ETH Z\"urich), Andreas Krause (ETH Z\"urich)
Abstract: We consider the problem of computing the probability of maximality (PoM) of a Gaussian random vector, i.e., the probability for each dimension to be maximal. This is a key challenge in applications ranging from Bayesian optimization to reinforcement learning, where the PoM not only helps with finding an optimal action, but yields a fine-grained analysis of the action domain, crucial in tasks such as drug discovery. Existing techniques are costly, scaling polynomially in computation and memory with the vector size. We introduce LITE, the first approach for estimating Gaussian PoM with almost-linear time and memory complexity. LITE achieves SOTA accuracy on a number of tasks, while being in practice several orders of magnitude faster than the baselines. This also translates to a better performance on downstream tasks such as entropy estimation and optimal control of bandits. Theoretically, we cast LITE as entropy-regularized UCB and connect it to prior PoM estimators.
Authors: G Krishnakumar, Abhishek Sinha
Abstract: We consider an online channel scheduling problem for a single transmitter-receiver pair equipped with $N$ arbitrarily varying wireless channels. The transmission rates of the channels might be non-stationary and could be controlled by an oblivious adversary. At every slot, incoming data arrives at an infinite-capacity data queue located at the transmitter. A scheduler, which is oblivious to the current channel rates, selects one of the $N$ channels for transmission. At the end of the slot, the scheduler only gets to know the transmission rate of the selected channel. The objective is to minimize the queue length regret, defined as the difference between the queue length at some time $T$ achieved by an online policy and the queue length obtained by always transmitting over the single best channel in hindsight. We propose a weakly adaptive Multi-Armed Bandit (MAB) algorithm for minimizing the queue length regret in this setup. Unlike previous works, we do not make any stability assumptions about the queue or the arrival process. Hence, our result holds even when the queueing process is unstable. Our main observation is that the queue length regret can be upper bounded by the regret of a MAB policy that competes against the best channel in hindsight uniformly over all sub-intervals of $[T]$. As a technical contribution of independent interest, we then propose a weakly adaptive adversarial MAB policy which achieves $\tilde{O}(\sqrt{N}T^{\frac{3}{4}})$ regret with high probability, implying the same bound for queue length regret.
Authors: Nasir Khan, Asmaa Abdallah, Abdulkadir Celik, Ahmed M. Eltawil, Sinem Coleri
Abstract: Artificial intelligence (AI) is expected to significantly enhance radio resource management (RRM) in sixth-generation (6G) networks. However, the lack of explainability in complex deep learning (DL) models poses a challenge for practical implementation. This paper proposes a novel explainable AI (XAI)- based framework for feature selection and model complexity reduction in a model-agnostic manner. Applied to a multi-agent deep reinforcement learning (MADRL) setting, our approach addresses the joint sub-band assignment and power allocation problem in cellular vehicle-to-everything (V2X) communications. We propose a novel two-stage systematic explainability framework leveraging feature relevance-oriented XAI to simplify the DRL agents. While the former stage generates a state feature importance ranking of the trained models using Shapley additive explanations (SHAP)-based importance scores, the latter stage exploits these importance-based rankings to simplify the state space of the agents by removing the least important features from the model input. Simulation results demonstrate that the XAI-assisted methodology achieves 97% of the original MADRL sum-rate performance while reducing optimal state features by 28%, average training time by 11%, and trainable weight parameters by 46% in a network with eight vehicular pairs.
Authors: Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng
Abstract: Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original model architectures. This limits their applicability across different domains and diverse diffusion model configurations. In this paper, we first observe the inherent capability of language models, coined context consistency, to comprehend identity through context with a single prompt. Drawing inspiration from the inherent context consistency, we propose a novel training-free method for consistent text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story). Our approach 1Prompt1Story concatenates all prompts into a single input for T2I diffusion models, initially preserving character identities. We then refine the generation process using two novel techniques: Singular-Value Reweighting and Identity-Preserving Cross-Attention, ensuring better alignment with the input description for each frame. In our experiments, we compare our method against various existing consistent T2I generation approaches to demonstrate its effectiveness through quantitative metrics and qualitative assessments. Code is available at https://github.com/byliutao/1Prompt1Story.
Authors: Yuecheng Zhang, Guanhua Fang, Wen Yu
Abstract: Event stream is an important data format in real life. The events are usually expected to follow some regular patterns over time. However, the patterns could be contaminated by unexpected absences or occurrences of events. In this paper, we adopt the temporal point process framework for learning event stream and we provide a simple-but-effective method to deal with both commission and omission event outliers.In particular, we introduce a novel weight function to dynamically adjust the importance of each observed event so that the final estimator could offer multiple statistical merits. We compare the proposed method with the vanilla one in the classification problems, where event streams can be clustered into different groups. Both theoretical and numerical results confirm the effectiveness of our new approach. To our knowledge, our method is the first one to provably handle both commission and omission outliers simultaneously.
Authors: Oliver Goldstein
Abstract: This paper proposes a novel representation of molecules through Algebraic Data Types (ADTs). The representation has useful properties primarily by including type information. The representation uses the Dietz representation enabling representation of organometallics with multi-centre, multi-atom bonding and delocalised electrons, resonant structures and co-ordinate data of atoms. Furthermore, this representation goes further than any other in the literature, providing a natural data structure to represent shells, subshells and orbitals. Perks of the representation include it's natural inclusion in reaction descriptions and the ability to make molecules instances of algebraic groups. The representation is further motivated as providing guarantees for those wishing to do Bayesian machine learning (probabilistic programming) over molecular structures. A criticism of competing and commonly used representations such as SMILES and SELFIES is provided and solutions are proposed to the weaknesses of these along with an open source library, written in Haskell. An example of integrating the library with LazyPPL -- a lazy probabilistic programming library written in Haskell -- is provided, conceptually justifying the efficiency of the representation over string based representations and recent work such as SELFIES. This library distinguishes between the data and the type of data -- enabling a separation of concerns between interface and object. I solve three problems associated with the future of SELFIES, molecular programming language, 3D information, syntactic invalidity and Dietz representation.
Authors: Pravin Pandey (Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Germany), Julia Reuter (Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Germany), Christoph Steup (Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Germany), Sanaz Mostaghim (Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Germany, Fraunhofer Institute for Transportation and Infrastructure Systems IVI, Dresden, Germany)
Abstract: This paper shows how a Graph Neural Network (GNN) can be used to learn an Inverse Kinematics (IK) based on an automatically generated dataset. The generated Inverse Kinematics is generalized to a family of manipulators with the same Degree of Freedom (DOF), but varying link length configurations. The results indicate a position error of less than 1.0 cm for 3 DOF and 4.5 cm for 5 DOF, and orientation error of 2$^\circ$ for 3 DOF and 8.2$^\circ$ for 6 DOF, which allows the deployment to certain real world-problems. However, out-of-domain errors and lack of extrapolation can be observed in the resulting GNN. An extensive analysis of these errors indicates potential for enhancement in the future. Consequently, the generated GNNs are tailored to be used in future work as an inductive bias to generate analytical equations through symbolic regression.
Authors: Stefanos Bakirtzis, \c{C}a\u{g}kan Yapar, Kehai Qiu, Ian Wassell, Jie Zhang
Abstract: To encourage further research and to facilitate fair comparisons in the development of deep learning-based radio propagation models, in the less explored case of directional radio signal emissions in indoor propagation environments, we have launched the ICASSP 2025 First Indoor Pathloss Radio Map Prediction Challenge. This overview paper describes the indoor path loss prediction problem, the datasets used, the Challenge tasks, and the evaluation methodology. Finally, the results of the Challenge and a summary of the submitted methods are presented.
Authors: Fabian Raisch, Thomas Krug, Christoph Goebel, Benjamin Tischler
Abstract: Transfer Learning (TL) is an emerging field in modeling building thermal dynamics. This method reduces the data required for a data-driven model of a target building by leveraging knowledge from a source building. Consequently, it enables the creation of data-efficient models that can be used for advanced control and fault detection & diagnosis. A major limitation of the TL approach is its inconsistent performance across different sources. Although accurate source-building selection for a target is crucial, it remains a persistent challenge. We present GenTL, a general transfer learning model for single-family houses in Central Europe. GenTL can be efficiently fine-tuned to a large variety of target buildings. It is pretrained on a Long Short-Term Memory (LSTM) network with data from 450 different buildings. The general transfer learning model eliminates the need for source-building selection by serving as a universal source for fine-tuning. Comparative analysis with conventional single-source to single-target TL demonstrates the efficacy and reliability of the general pretraining approach. Testing GenTL on 144 target buildings for fine-tuning reveals an average prediction error (RMSE) reduction of 42.1 % compared to fine-tuning single-source models.
Authors: Abdulrahman Oladipupo Ibraheem
Abstract: I introduce two novel loss functions for classification in deep learning. The two loss functions extend standard cross entropy loss by regularizing it with minimum entropy and Kullback-Leibler (K-L) divergence terms. The first of the two novel loss functions is termed mixed entropy loss (MIX-ENT for short), while the second one is termed minimum entropy regularized cross-entropy loss (MIN-ENT for short). The MIX-ENT function introduces a regularizer that can be shown to be equivalent to the sum of a minimum entropy term and a K-L divergence term. However, it should be noted that the K-L divergence term here is different from that in the standard cross-entropy loss function, in the sense that it swaps the roles of the target probability and the hypothesis probability. The MIN-ENT function simply adds a minimum entropy regularizer to the standard cross entropy loss function. In both MIX-ENT and MIN-ENT, the minimum entropy regularizer minimizes the entropy of the hypothesis probability distribution which is output by the neural network. Experiments on the EMNIST-Letters dataset shows that my implementation of MIX-ENT and MIN-ENT lets the VGG model climb from its previous 3rd position on the paperswithcode leaderboard to reach the 2nd position on the leaderboard, outperforming the Spinal-VGG model in so doing. Specifically, using standard cross-entropy, VGG achieves 95.86% while Spinal-VGG achieves 95.88% classification accuracies, whereas using VGG (without Spinal-VGG) our MIN-ENT achieved 95.933%, while our MIX-ENT achieved 95.927% accuracies. The pre-trained models for both MIX-ENT and MIN-ENT are at https://github.com/rahmanoladi/minimum entropy project.
Authors: Mark Chevallier, Filip Smola, Richard Schmoetten, Jacques D. Fleuriot
Abstract: We present a novel formalisation of tensor semantics for linear temporal logic on finite traces (LTLf), with formal proofs of correctness carried out in the theorem prover Isabelle/HOL. We demonstrate that this formalisation can be integrated into a neurosymbolic learning process by defining and verifying a differentiable loss function for the LTLf constraints, and automatically generating an implementation that integrates with PyTorch. We show that, by using this loss, the process learns to satisfy pre-specified logical constraints. Our approach offers a fully rigorous framework for constrained training, eliminating many of the inherent risks of ad-hoc, manual implementations of logical aspects directly in an "unsafe" programming language such as Python, while retaining efficiency in implementation.
Authors: Timothy Chase Jr, Christopher Wilson, Karthik Dantu
Abstract: The in-situ detection of planetary, lunar, and small-body surface terrain is crucial for autonomous spacecraft applications, where learning-based computer vision methods are increasingly employed to enable intelligence without prior information or human intervention. However, many of these methods remain computationally expensive for spacecraft processors and prevent real-time operation. Training of such algorithms is additionally complex due to the scarcity of labeled data and reliance on supervised learning approaches. Unsupervised Domain Adaptation (UDA) offers a promising solution by facilitating model training with disparate data sources such as simulations or synthetic scenes, although UDA is difficult to apply to celestial environments where challenging feature spaces are paramount. To alleviate such issues, You Only Crash Once (YOCOv1) has studied the integration of Visual Similarity-based Alignment (VSA) into lightweight one-stage object detection architectures to improve space terrain UDA. Although proven effective, the approach faces notable limitations, including performance degradations in multi-class and high-altitude scenarios. Building upon the foundation of YOCOv1, we propose novel additions to the VSA scheme that enhance terrain detection capabilities under UDA, and our approach is evaluated across both simulated and real-world data. Our second YOCO rendition, YOCOv2, is capable of achieving state-of-the-art UDA performance on surface terrain detection, where we showcase improvements upwards of 31% compared with YOCOv1 and terrestrial state-of-the-art. We demonstrate the practical utility of YOCOv2 with spacecraft flight hardware performance benchmarking and qualitative evaluation of NASA mission data.
Authors: Rafael P. Eufrazio, Eduardo Fernandes Montesuma, Charles C. Cavalcante
Abstract: Analyzing relationships between objects is a pivotal problem within data science. In this context, Dimensionality reduction (DR) techniques are employed to generate smaller and more manageable data representations. This paper proposes a new method for dimensionality reduction, based on optimal transportation theory and the Gromov-Wasserstein distance. We offer a new probabilistic view of the classical Multidimensional Scaling (MDS) algorithm and the nonlinear dimensionality reduction algorithm, Isomap (Isometric Mapping or Isometric Feature Mapping) that extends the classical MDS, in which we use the Gromov-Wasserstein distance between the probability measure of high-dimensional data, and its low-dimensional representation. Through gradient descent, our method embeds high-dimensional data into a lower-dimensional space, providing a robust and efficient solution for analyzing complex high-dimensional datasets.
Authors: Ziheng Wang, Toni Lassila, Sharib Ali
Abstract: In real-world data, long-tailed data distribution is common, making it challenging for models trained on empirical risk minimisation to learn and classify tail classes effectively. While many studies have sought to improve long tail recognition by altering the data distribution in the feature space and adjusting model decision boundaries, research on the synergy and corrective approach among various methods is limited. Our study delves into three long-tail recognition techniques: Supervised Contrastive Learning (SCL), Rare-Class Sample Generator (RSG), and Label-Distribution-Aware Margin Loss (LDAM). SCL enhances intra-class clusters based on feature similarity and promotes clear inter-class separability but tends to favour dominant classes only. When RSG is integrated into the model, we observed that the intra-class features further cluster towards the class centre, which demonstrates a synergistic effect together with SCL's principle of enhancing intra-class clustering. RSG generates new tail features and compensates for the tail feature space squeezed by SCL. Similarly, LDAM is known to introduce a larger margin specifically for tail classes; we demonstrate that LDAM further bolsters the model's performance on tail classes when combined with the more explicit decision boundaries achieved by SCL and RSG. Furthermore, SCL can compensate for the dominant class accuracy sacrificed by RSG and LDAM. Our research emphasises the synergy and balance among the three techniques, with each amplifying the strengths of the others and mitigating their shortcomings. Our experiment on long-tailed distribution datasets, using an end-to-end architecture, yields competitive results by enhancing tail class accuracy without compromising dominant class performance, achieving a balanced improvement across all classes.
Authors: Yumeng Wang, Ziran Zhou, Junjin Wang
Abstract: Effective sentence embeddings that capture semantic nuances and generalize well across diverse contexts are crucial for natural language processing tasks. We address this challenge by applying SimCSE (Simple Contrastive Learning of Sentence Embeddings) using contrastive learning to fine-tune the minBERT model for sentiment analysis, semantic textual similarity (STS), and paraphrase detection. Our contributions include experimenting with three different dropout techniques, namely standard dropout, curriculum dropout, and adaptive dropout, to tackle overfitting, proposing a novel 2-Tier SimCSE Fine-tuning Model that combines both unsupervised and supervised SimCSE on STS task, and exploring transfer learning potential for Paraphrase and SST tasks. Our findings demonstrate the effectiveness of SimCSE, with the 2-Tier model achieving superior performance on the STS task, with an average test score of 0.742 across all three downstream tasks. The results of error analysis reveals challenges in handling complex sentiments and reliance on lexical overlap for paraphrase detection, highlighting areas for future research. The ablation study revealed that removing Adaptive Dropout in the Single-Task Unsupervised SimCSE Model led to improved performance on the STS task, indicating overfitting due to added parameters. Transfer learning from SimCSE models on Paraphrase and SST tasks did not enhance performance, suggesting limited transferability of knowledge from the STS task.
Authors: Erjia Xiao, Hao Cheng, Jing Shao, Jinhao Duan, Kaidi Xu, Le Yang, Jindong Gu, Renjing Xu
Abstract: Large Language Models (LLMs) demonstrate remarkable zero-shot performance across various natural language processing tasks. The integration of multimodal encoders extends their capabilities, enabling the development of Multimodal Large Language Models that process vision, audio, and text. However, these capabilities also raise significant security concerns, as these models can be manipulated to generate harmful or inappropriate content through jailbreak. While extensive research explores the impact of modality-specific input edits on text-based LLMs and Large Vision-Language Models in jailbreak, the effects of audio-specific edits on Large Audio-Language Models (LALMs) remain underexplored. Hence, this paper addresses this gap by investigating how audio-specific edits influence LALMs inference regarding jailbreak. We introduce the Audio Editing Toolbox (AET), which enables audio-modality edits such as tone adjustment, word emphasis, and noise injection, and the Edited Audio Datasets (EADs), a comprehensive audio jailbreak benchmark. We also conduct extensive evaluations of state-of-the-art LALMs to assess their robustness under different audio edits. This work lays the groundwork for future explorations on audio-modality interactions in LALMs security.
Authors: Trung-Khang Tran, Thach V. Bui
Abstract: The main goal of group testing is to identify a small number of defective items in a large population of items. A test on a subset of items is positive if the subset contains at least one defective item and negative otherwise. In non-adaptive design, all tests can be tested simultaneously and represented by a measurement matrix in which a row and a column represent a test and an item, respectively. An entry in row $i$ and column $j$ is 1 if item $j$ belongs to the test $i$ and is 0 otherwise. Given an unknown set of defective items, the objective is to design a measurement matrix such that, by observing its corresponding outcome vector, the defective items can be recovered efficiently. The basic trait of this approach is that the measurement matrix has remained unchanged throughout the course of generating the outcome vector and recovering defective items. In this paper, we study the case in which some entries in the measurement matrix are erased, called \emph{the missing measurement matrix}, before the recovery phase of the defective items, and our objective is to fully recover the measurement matrix from the missing measurement matrix. In particular, we show that some specific rows with erased entries provide information aiding the recovery while others do not. Given measurement matrices and erased entries follow the Bernoulli distribution, we show that before the erasing event happens, sampling sufficient sets of defective items and their corresponding outcome vectors can help us recover the measurement matrix from the missing measurement matrix.
Authors: Ping He, Lorenzo Cavallaro, Shouling Ji
Abstract: Android malware presents a persistent threat to users' privacy and data integrity. To combat this, researchers have proposed machine learning-based (ML-based) Android malware detection (AMD) systems. However, adversarial Android malware attacks compromise the detection integrity of the ML-based AMD systems, raising significant concerns. Existing defenses against adversarial Android malware provide protections against feature space attacks which generate adversarial feature vectors only, leaving protection against realistic threats from problem space attacks which generate real adversarial malware an open problem. In this paper, we address this gap by proposing ADD, a practical adversarial Android malware defense framework designed as a plug-in to enhance the adversarial robustness of the ML-based AMD systems against problem space attacks. Our extensive evaluation across various ML-based AMD systems demonstrates that ADD is effective against state-of-the-art problem space adversarial Android malware attacks. Additionally, ADD shows the defense effectiveness in enhancing the adversarial robustness of real-world antivirus solutions.
Authors: Dan Zhang, Tao Feng, Lilong Xue, Yuandong Wang, Yuxiao Dong, Jie Tang
Abstract: This survey delves into the realm of Parameter-Efficient Fine-Tuning (PEFT) within the context of Foundation Models (FMs). PEFT, a cost-effective fine-tuning technique, minimizes parameters and computational complexity while striving for optimal downstream task performance. FMs, like ChatGPT, DALL-E, and LLaVA specialize in language understanding, generative tasks, and multimodal tasks, trained on diverse datasets spanning text, images, and videos. The diversity of FMs guides various adaptation strategies for PEFT. Therefore, this survey aims to provide a comprehensive overview of PEFT techniques applied to diverse FMs and address critical gaps in understanding the techniques, trends, and applications. We start by providing a detailed development of FMs and PEFT. Subsequently, we systematically review the key categories and core mechanisms of PEFT across diverse FMs to offer a comprehensive understanding of trends. We also explore the most recent applications across various FMs to demonstrate the versatility of PEFT, shedding light on the integration of systematic PEFT methods with a range of FMs. Furthermore, we identify potential research and development directions for improving PEFTs in the future. This survey provides a valuable resource for both newcomers and experts seeking to understand and use the power of PEFT across FMs. All reviewed papers are listed at \url{https://github.com/THUDM/Awesome-Parameter-Efficient-Fine-Tuning-for-Foundation-Models}.
URLs: https://github.com/THUDM/Awesome-Parameter-Efficient-Fine-Tuning-for-Foundation-Models
Authors: Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
Abstract: Deep neural networks are increasingly employed in high-stakes medical applications, despite their tendency for shortcut learning in the presence of spurious correlations, which can have potentially fatal consequences in practice. Detecting and mitigating shortcut behavior is a challenging task that often requires significant labeling efforts from domain experts. To alleviate this problem, we introduce a semi-automated framework for the identification of spurious behavior from both data and model perspective by leveraging insights from eXplainable Artificial Intelligence (XAI). This allows the retrieval of spurious data points and the detection of model circuits that encode the associated prediction rules. Moreover, we demonstrate how these shortcut encodings can be used for XAI-based sample- and pixel-level data annotation, providing valuable information for bias mitigation methods to unlearn the undesired shortcut behavior. We show the applicability of our framework using four medical datasets across two modalities, featuring controlled and real-world spurious correlations caused by data artifacts. We successfully identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision Transformer models, ultimately increasing their robustness and applicability for real-world medical tasks.
Authors: Ian V\"alimaa, Lasse Leskel\"a
Abstract: High-order clustering aims to classify objects in multiway datasets that are prevalent in various fields such as bioinformatics, social network analysis, and recommendation systems. These tasks often involve data that is sparse and high-dimensional, presenting significant statistical and computational challenges. This paper introduces a tensor block model specifically designed for sparse integer-valued data tensors. We propose a simple spectral clustering algorithm augmented with a trimming step to mitigate noise fluctuations, and identify a density threshold that ensures the algorithm's consistency. Our approach models sparsity using a sub-Poisson noise concentration framework, accommodating heavier than sub-Gaussian tails. Remarkably, this natural class of tensor block models is closed under aggregation across arbitrary modes. Consequently, we obtain a comprehensive framework for evaluating the tradeoff between signal loss and noise reduction during data aggregation. The analysis is based on a novel concentration bound for sparse random Gram matrices. The theoretical findings are illustrated through simulation experiments.
Authors: Tharini Suresh, Salma Afifi, Sudeep Pasricha
Abstract: Generative Adversarial Networks (GANs) are at the forefront of AI innovation, driving advancements in areas such as image synthesis, medical imaging, and data augmentation. However, the unique computational operations within GANs, such as transposed convolutions and instance normalization, introduce significant inefficiencies when executed on traditional electronic accelerators, resulting in high energy consumption and suboptimal performance. To address these challenges, we introduce PhotoGAN, the first silicon-photonic accelerator designed to handle the specialized operations of GAN models. By leveraging the inherent high throughput and energy efficiency of silicon photonics, PhotoGAN offers an innovative, reconfigurable architecture capable of accelerating transposed convolutions and other GAN-specific layers. The accelerator also incorporates a sparse computation optimization technique to reduce redundant operations, improving computational efficiency. Our experimental results demonstrate that PhotoGAN achieves at least 4.4x higher GOPS and 2.18x lower energy-per-bit (EPB) compared to state-of-the-art accelerators, including GPUs and TPUs. These findings showcase PhotoGAN as a promising solution for the next generation of GAN acceleration, providing substantial gains in both performance and energy efficiency.
Authors: Yan Yang, Bin Gao, Ya-xiang Yuan
Abstract: Imposing additional constraints on low-rank optimization has garnered growing interest. However, the geometry of coupled constraints hampers the well-developed low-rank structure and makes the problem intricate. To this end, we propose a space-decoupling framework for optimization on bounded-rank matrices with orthogonally invariant constraints. The ``space-decoupling" is reflected in several ways. We show that the tangent cone of coupled constraints is the intersection of tangent cones of each constraint. Moreover, we decouple the intertwined bounded-rank and orthogonally invariant constraints into two spaces, leading to optimization on a smooth manifold. Implementing Riemannian algorithms on this manifold is painless as long as the geometry of additional constraints is known. In addition, we unveil the equivalence between the reformulated problem and the original problem. Numerical experiments on real-world applications -- spherical data fitting, graph similarity measuring, low-rank SDP, model reduction of Markov processes, reinforcement learning, and deep learning -- validate the superiority of the proposed framework.
Authors: Hao Zhang, Felix Stahlberg, Shankar Kumar
Abstract: Large Language Models (LLMs) excel at rewriting tasks such as text style transfer and grammatical error correction. While there is considerable overlap between the inputs and outputs in these tasks, the decoding cost still increases with output length, regardless of the amount of overlap. By leveraging the overlap between the input and the output, Kaneko and Okazaki (2023) proposed model-agnostic edit span representations to compress the rewrites to save computation. They reported an output length reduction rate of nearly 80% with minimal accuracy impact in four rewriting tasks. In this paper, we propose alternative edit phrase representations inspired by phrase-based statistical machine translation. We systematically compare our phrasal representations with their span representations. We apply the LLM rewriting model to the task of Automatic Speech Recognition (ASR) post editing and show that our target-phrase-only edit representation has the best efficiency-accuracy trade-off. On the LibriSpeech test set, our method closes 50-60% of the WER gap between the edit span model and the full rewrite model while losing only 10-20% of the length reduction rate of the edit span model.
Authors: Mohammad Ali Rezaei, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi
Abstract: Accurate prediction of pedestrian trajectories is crucial for enhancing the safety of autonomous vehicles and reducing traffic fatalities involving pedestrians. While numerous studies have focused on modeling interactions among pedestrians to forecast their movements, the influence of environmental factors and scene-object placements has been comparatively underexplored. In this paper, we present a novel trajectory prediction model that integrates both pedestrian interactions and environmental context to improve prediction accuracy. Our approach captures spatial and temporal interactions among pedestrians within a sparse graph framework. To account for pedestrian-scene interactions, we employ advanced image enhancement and semantic segmentation techniques to extract detailed scene features. These scene and interaction features are then fused through a cross-attention mechanism, enabling the model to prioritize relevant environmental factors that influence pedestrian movements. Finally, a temporal convolutional network processes the fused features to predict future pedestrian trajectories. Experimental results demonstrate that our method significantly outperforms existing state-of-the-art approaches, achieving ADE and FDE values of 0.252 and 0.372 meters, respectively, underscoring the importance of incorporating both social interactions and environmental context in pedestrian trajectory prediction.
Authors: Timo Lange, Ajish Babu, Philipp Meyer, Matthis Keppner, Tim Tiedemann, Martin Wittmaier, Sebastian Wolff, Thomas V\"ogele
Abstract: Current disposal facilities for coarse-grained waste perform manual sorting of materials with heavy machinery. Large quantities of recyclable materials are lost to coarse waste, so more effective sorting processes must be developed to recover them. Two key aspects to automate the sorting process are object detection with material classification in mixed piles of waste, and autonomous control of hydraulic machinery. Because most objects in those accumulations of waste are damaged or destroyed, object detection alone is not feasible in the majority of cases. To address these challenges, we propose a classification of materials with multispectral images of ultraviolet (UV), visual (VIS), near infrared (NIR), and short-wave infrared (SWIR) spectrums. Solution for autonomous control of hydraulic heavy machines for sorting of bulky waste is being investigated using cost-effective cameras and artificial intelligence-based controllers.
Authors: Zuyao You, Junke Wang, Lingyu Kong, Bo He, Zuxuan Wu
Abstract: We present Pix2Cap-COCO, the first panoptic pixel-level caption dataset designed to advance fine-grained visual understanding. To achieve this, we carefully design an automated annotation pipeline that prompts GPT-4V to generate pixel-aligned, instance-specific captions for individual objects within images, enabling models to learn more granular relationships between objects and their contexts. This approach results in 167,254 detailed captions, with an average of 22.94 words per caption. Building on Pix2Cap-COCO, we introduce a novel task, panoptic segmentation-captioning, which challenges models to recognize instances in an image and provide detailed descriptions for each simultaneously. To benchmark this task, we design a robust baseline based on X-Decoder. The experimental results demonstrate that Pix2Cap-COCO is a particularly challenging dataset, as it requires models to excel in both fine-grained visual understanding and detailed language generation. Furthermore, we leverage Pix2Cap-COCO for Supervised Fine-Tuning (SFT) on large multimodal models (LMMs) to enhance their performance. For example, training with Pix2Cap-COCO significantly improves the performance of GPT4RoI, yielding gains in CIDEr +1.4%, ROUGE +0.4%, and SPICE +0.5% on Visual Genome dataset, and strengthens its region understanding ability on the ViP-BENCH, with an overall improvement of +5.1%, including notable increases in recognition accuracy +11.2% and language generation quality +22.2%.
Authors: Yue Fan, Handong Zhao, Ruiyi Zhang, Yu Shen, Xin Eric Wang, Gang Wu
Abstract: Graphical User Interface (GUI) action grounding is a critical step in GUI automation that maps language instructions to actionable elements on GUI screens. Most recent works of GUI action grounding leverage large GUI datasets to fine-tune MLLMs. However, the fine-tuning data always covers limited GUI environments, and we find the performance of the resulting model deteriorates in novel environments. We argue that the GUI grounding models should be further aligned to the novel environments to reveal their full potential, when the inference is known to involve novel environments, i.e., environments not used during the previous fine-tuning. To realize this, we first propose GUI-Bee, an MLLM-based autonomous agent, to collect high-quality, environment-specific data through exploration and then continuously fine-tune GUI grounding models with the collected data. Our agent leverages a novel Q-value-Incentive In-Context Reinforcement Learning (Q-ICRL) method to optimize exploration efficiency and data quality. Additionally, we introduce NovelScreenSpot, a benchmark for testing how well the data can help align GUI action grounding models to novel environments and demonstrate the effectiveness of data collected by GUI-Bee in the experiments. Furthermore, we conduct an ablation study to validate the Q-ICRL method in enhancing the efficiency of GUI-Bee. Project page: https://gui-bee.github.io
Authors: Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang
Abstract: Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models by extending those from diffusion models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and standard supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs. Project page: https://gongyeliu.github.io/videoalign.
Authors: Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy
Abstract: Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on manually annotated data. Extensive experiments on three long-form video understanding benchmarks--LongVideoBench, MLVU, and Video-MME--demonstrate the effectiveness of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark, underscoring the potential of TPO as a scalable and efficient solution for advancing temporal reasoning in long-form video understanding. Project page: https://ruili33.github.io/tpo_website.
Authors: Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li, Shitian Zhao, Ziyu Guo, Yiting Lu, Peng Gao, Hongsheng Li
Abstract: With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I models are moving towards general-purpose applicability. Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation. However, current evaluation frameworks are insufficient to comprehensively assess these models' performance across expanding domains. To thoroughly evaluate these models, we developed the IMAGINE-E and tested six prominent models: FLUX.1, Ideogram2.0, Midjourney, Dall-E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided into five key domains: structured output generation, realism, and physical consistency, specific domain generation, challenging scenario generation, and multi-style creation tasks. This comprehensive assessment highlights each model's strengths and limitations, particularly the outstanding performance of FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring the expanding applications and potential of T2I models as foundational AI tools. This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability. Evaluation scripts will be released at https://github.com/jylei16/Imagine-e.
Authors: Hao Dong, Eleni Chatzi, Olga Fink
Abstract: Test-time adaptation (TTA) has demonstrated significant potential in addressing distribution shifts between training and testing data. Open-set test-time adaptation (OSTTA) aims to adapt a source pre-trained model online to an unlabeled target domain that contains unknown classes. This task becomes more challenging when multiple modalities are involved. Existing methods have primarily focused on unimodal OSTTA, often filtering out low-confidence samples without addressing the complexities of multimodal data. In this work, we present Adaptive Entropy-aware Optimization (AEO), a novel framework specifically designed to tackle Multimodal Open-set Test-time Adaptation (MM-OSTTA) for the first time. Our analysis shows that the entropy difference between known and unknown samples in the target domain strongly correlates with MM-OSTTA performance. To leverage this, we propose two key components: Unknown-aware Adaptive Entropy Optimization (UAE) and Adaptive Modality Prediction Discrepancy Optimization (AMP). These components enhance the ability of model to distinguish unknown class samples during online adaptation by amplifying the entropy difference between known and unknown samples. To thoroughly evaluate our proposed methods in the MM-OSTTA setting, we establish a new benchmark derived from existing datasets. This benchmark includes two downstream tasks and incorporates five modalities. Extensive experiments across various domain shift situations demonstrate the efficacy and versatility of the AEO framework. Additionally, we highlight the strong performance of AEO in long-term and continual MM-OSTTA settings, both of which are challenging and highly relevant to real-world applications. Our source code is available at https://github.com/donghao51/AEO.
Authors: Alex Glushkovsky
Abstract: The periodic table is a fundamental representation of chemical elements that plays essential theoretical and practical roles. The research article discusses the experiences of unsupervised training of neural networks to represent elements on the 2D latent space based on their electron configurations. To emphasize chemical properties of the elements, the original data of electron configurations has been realigned towards valence orbitals. Recognizing seven shells and four subshells, the input data has been arranged as 7x4 images. Latent space representation has been performed using a convolutional beta variational autoencoder (beta-VAE). Despite discrete and sparse input data, the beta-VAE disentangles elements of different periods, blocks, groups, and types. The unsupervised representation of elements on the latent space reveals pairwise symmetries of periods and elements related to the invariance of quantum numbers of corresponding elements. In addition, it isolates outliers that turned out to be known cases of Madelung's rule violations for lanthanide and actinide elements. Considering the generative capabilities of beta-VAE, the supervised machine learning has been set to find out if there are insightful patterns distinguishing electron configurations between real elements and decoded artificial ones. Also, the article addresses the capability of dual representation by autoencoders. Conventionally, autoencoders represent observations of input data on the latent space. By transposing and duplicating original input data, it is possible to represent variables on the latent space which can lead to the discovery of meaningful patterns among input variables. Applying that unsupervised learning for transposed data of electron configurations, the order of input variables that has been arranged by the encoder on the latent space has turned out to exactly match the sequence of Madelung's rule.
Authors: Filipa Valdeira, Cl\'audia Soares
Abstract: In this work, we leverage a generative data model considering comparison noise to develop a fast, precise, and informative ranking algorithm from pairwise comparisons that produces a measure of confidence on each comparison. The problem of ranking a large number of items from noisy and sparse pairwise comparison data arises in diverse applications, like ranking players in online games, document retrieval or ranking human perceptions. Although different algorithms are available, we need fast, large-scale algorithms whose accuracy degrades gracefully when the number of comparisons is too small. Fitting our proposed model entails solving a non-convex optimization problem, which we tightly approximate by a sum of quasi-convex functions and a regularization term. Resorting to an iterative reweighted minimization and the Primal-Dual Hybrid Gradient method, we obtain PD-Rank, achieving a Kendall tau 0.1 higher than all comparing methods, even for 10\% of wrong comparisons in simulated data matching our data model, and leading in accuracy if data is generated according to the Bradley-Terry model, in both cases faster by one order of magnitude, in seconds. In real data, PD-Rank requires less computational time to achieve the same Kendall tau than active learning methods.
Authors: Simone Luetto, Fabrizio Garuti, Enver Sangineto, Lorenzo Forni, Rita Cucchiara
Abstract: There is a recent growing interest in applying Deep Learning techniques to tabular data, in order to replicate the success of other Artificial Intelligence areas in this structured domain. Specifically interesting is the case in which tabular data have a time dependence, such as, for instance financial transactions. However, the heterogeneity of the tabular values, in which categorical elements are mixed with numerical items, makes this adaptation difficult. In this paper we propose a Transformer architecture to represent heterogeneous time-dependent tabular data, in which numerical features are represented using a set of frequency functions and the whole network is uniformly trained with a unique loss function.
Authors: Subrata Biswas, Mohammad Nur Hossain Khan, Alex Colwell, Jack Adiletta, Bashima Islam
Abstract: Accurate sound source localization (SSL) requires consistent multichannel data for reliable degree of arrival (DoA) estimation. However, intermittently powered batteryless systems often suffer from incomplete sensor data due to the stochastic nature of energy harvesting. Existing methods struggle with missing channels, leading to significant performance degradation. In this paper, we propose $\textit{LOCUS}$, a novel deep learning-based system designed to recover corrupted features for SSL in batteryless systems. $\textit{LOCUS}$ addresses missing data by leveraging information entropy estimation and conditional interpolation, combining three modules: (1) Information-Weighted Focus (InFo), which identifies and quantifies corrupted data elements, (2) Latent Feature Synthesizer (LaFS), which synthesizes missing features, and (3) Guided Replacement (GRep), which intelligently replaces missing elements while preserving valid data. We demonstrate significant performance improvements using two datasets: DCASE and LargeSet, where $\textit{LOCUS}$ achieves up to $36.91\%$ lower DoA error compared to existing methods. Real-world evaluations across three environments with intermittent power sources show a $25.87-59.46\%$ improvement in performance when channels are stochastically missing. Additionally, we release a 50-hour multichannel dataset to support further research in SSL.
Authors: Mark Deutel, Georgios Kontes, Christopher Mutschler, J\"urgen Teich
Abstract: Deploying deep neural networks (DNNs) on microcontrollers (TinyML) is a common trend to process the increasing amount of sensor data generated at the edge, but in practice, resource and latency constraints make it difficult to find optimal DNN candidates. Neural architecture search (NAS) is an excellent approach to automate this search and can easily be combined with DNN compression techniques commonly used in TinyML. However, many NAS techniques are not only computationally expensive, especially hyperparameter optimization (HPO), but also often focus on optimizing only a single objective, e.g., maximizing accuracy, without considering additional objectives such as memory requirements or computational complexity of a DNN, which are key to making deployment at the edge feasible. In this paper, we propose a novel NAS strategy for TinyML based on multi-objective Bayesian optimization (MOBOpt) and an ensemble of competing parametric policies trained using Augmented Random Search (ARS) reinforcement learning (RL) agents. Our methodology aims at efficiently finding tradeoffs between a DNN's predictive accuracy, memory requirements on a given target system, and computational complexity. Our experiments show that we consistently outperform existing MOBOpt approaches on different datasets and architectures such as ResNet-18 and MobileNetV3.
Authors: Tiezhi Wang, Nils Strodthoff
Abstract: Scoring sleep stages in polysomnography recordings is a time-consuming task plagued by significant inter-rater variability. Therefore, it stands to benefit from the application of machine learning algorithms. While many algorithms have been proposed for this purpose, certain critical architectural decisions have not received systematic exploration. In this study, we meticulously investigate these design choices within the broad category of encoder-predictor architectures. We identify robust architectures applicable to both time series and spectrogram input representations. These architectures incorporate structured state space models as integral components and achieve statistically significant performance improvements compared to state-of-the-art approaches on the extensive Sleep Heart Health Study dataset. We anticipate that the architectural insights gained from this study along with the refined methodology for architecture search demonstrated herein will not only prove valuable for future research in sleep staging but also hold relevance for other time series annotation tasks.
Authors: Pierre Wolinski
Abstract: When training large models, such as neural networks, the full derivatives of order 2 and beyond are usually inaccessible, due to their computational cost. This is why, among the second-order optimization methods, it is very common to bypass the computation of the Hessian by using first-order information, such as the gradient of the parameters (e.g., quasi-Newton methods) or the activations (e.g., K-FAC). In this paper, we focus on the exact and explicit computation of projections of the Hessian and higher-order derivatives on well-chosen subspaces, which are relevant for optimization. Namely, for a given partition of the set of parameters, it is possible to compute tensors which can be seen as "higher-order derivatives according to the partition", at a reasonable cost as long as the number of subsets of the partition remains small. Then, we propose an optimization method exploiting these tensors at order 2 and 3 with several interesting properties, including: it outputs a learning rate per subset of parameters, which can be used for hyperparameter tuning; it takes into account long-range interactions between the layers of the trained neural network, which is usually not the case in similar methods (e.g., K-FAC); the trajectory of the optimization is invariant under affine layer-wise reparameterization. Code available at https://github.com/p-wol/GroupedNewton/ .
Authors: Tsunehiko Tanaka, Kenshi Abe, Kaito Ariu, Tetsuro Morimura, Edgar Simo-Serra
Abstract: Traditional approaches in offline reinforcement learning aim to learn the optimal policy that maximizes the cumulative reward, also known as return. It is increasingly important to adjust the performance of AI agents to meet human requirements, for example, in applications like video games and education tools. Decision Transformer (DT) optimizes a policy that generates actions conditioned on the target return through supervised learning and includes a mechanism to control the agent's performance using the target return. However, the action generation is hardly influenced by the target return because DT's self-attention allocates scarce attention scores to the return tokens. In this paper, we propose Return-Aligned Decision Transformer (RADT), designed to more effectively align the actual return with the target return. RADT leverages features extracted by paying attention solely to the return, enabling action generation to consistently depend on the target return. Extensive experiments show that RADT significantly reduces the discrepancies between the actual return and the target return compared to DT-based methods.
Authors: Benjamin Plaut, Hanlin Zhu, Stuart Russell
Abstract: Most learning algorithms with formal regret guarantees assume that all mistakes are recoverable and essentially rely on trying all possible behaviors. This approach is problematic when some mistakes are \emph{catastrophic}, i.e., irreparable. We propose an online learning problem where the goal is to minimize the chance of catastrophe. Specifically, we assume that the payoff in each round represents the chance of avoiding catastrophe that round and try to maximize the product of payoffs (the overall chance of avoiding catastrophe) while allowing a limited number of queries to a mentor. We first show that in general, any algorithm either constantly queries the mentor or is nearly guaranteed to cause catastrophe. However, in settings where the mentor policy class is learnable in the standard online model, we provide an algorithm whose regret and rate of querying the mentor both approach 0 as the time horizon grows. Conceptually, if a policy class is learnable in the absence of catastrophic risk, it is learnable in the presence of catastrophic risk if the agent can ask for help.
Authors: Futa Waseda, Ching-Chun Chang, Isao Echizen
Abstract: Adversarial training often suffers from a robustness-accuracy trade-off, where achieving high robustness comes at the cost of accuracy. One approach to mitigate this trade-off is leveraging invariance regularization, which encourages model invariance under adversarial perturbations; however, it still leads to accuracy loss. In this work, we closely analyze the challenges of using invariance regularization in adversarial training and understand how to address them. Our analysis identifies two key issues: (1) a ``gradient conflict" between invariance and classification objectives, leading to suboptimal convergence, and (2) the mixture distribution problem arising from diverged distributions between clean and adversarial inputs. To address these issues, we propose Asymmetric Representation-regularized Adversarial Training (ARAT), which incorporates asymmetric invariance loss with stop-gradient operation and a predictor to avoid gradient conflict, and a split-BatchNorm (BN) structure to resolve the mixture distribution problem. Our detailed analysis demonstrates that each component effectively addresses the identified issues, offering novel insights into adversarial defense. ARAT shows superiority over existing methods across various settings. Finally, we discuss the implications of our findings to knowledge distillation-based defenses, providing a new perspective on their relative successes.
Authors: Eunjee Choi, Jong-Kook Kim
Abstract: Detecting fake news has received a lot of attention. Many previous methods concatenate independently encoded unimodal data, ignoring the benefits of integrated multimodal information. Also, the absence of specialized feature extraction for text and images further limits these methods. This paper introduces an end-to-end model called TT-BLIP that applies the bootstrapping language-image pretraining for unified vision-language understanding and generation (BLIP) for three types of information: BERT and BLIPTxt for text, ResNet and BLIPImg for images, and bidirectional BLIP encoders for multimodal information. The Multimodal Tri-Transformer fuses tri-modal features using three types of multi-head attention mechanisms, ensuring integrated modalities for enhanced representations and improved multimodal data analysis. The experiments are performed using two fake news datasets, Weibo and Gossipcop. The results indicate TT-BLIP outperforms the state-of-the-art models.
Authors: Felix den Breejen, Sangmin Bae, Stephen Cha, Se-Young Yun
Abstract: The recently introduced TabPFN pretrains an In-Context Learning (ICL) transformer on synthetic data to perform tabular data classification. In this work, we extend TabPFN to the fine-tuning setting, resulting in a significant performance boost. We also discover that fine-tuning enables ICL-transformers to create complex decision boundaries, a property regular neural networks do not have. Based on this observation, we propose to pretrain ICL-transformers on a new forest dataset generator which creates datasets that are unrealistic, but have complex decision boundaries. TabForest, the ICL-transformer pretrained on this dataset generator, shows better fine-tuning performance when pretrained on more complex datasets. Additionally, TabForest outperforms TabPFN on some real-world datasets when fine-tuning, despite having lower zero-shot performance due to the unrealistic nature of the pretraining datasets. By combining both dataset generators, we create TabForestPFN, an ICL-transformer that achieves excellent fine-tuning performance and good zero-shot performance.
Authors: Xinyi Gao, Guanhua Ye, Tong Chen, Wentao Zhang, Junliang Yu, Hongzhi Yin
Abstract: The increasing prevalence of large-scale graphs poses a significant challenge for graph neural network training, attributed to their substantial computational requirements. In response, graph condensation (GC) emerges as a promising data-centric solution aiming to substitute the large graph with a small yet informative condensed graph to facilitate data-efficient GNN training. However, existing GC methods suffer from intricate optimization processes, necessitating excessive computing resources and training time. In this paper, we revisit existing GC optimization strategies and identify two pervasive issues therein: (1) various GC optimization strategies converge to coarse-grained class-level node feature matching between the original and condensed graphs; (2) existing GC methods rely on a Siamese graph network architecture that requires time-consuming bi-level optimization with iterative gradient computations. To overcome these issues, we propose a training-free GC framework termed Class-partitioned Graph Condensation (CGC), which refines the node distribution matching from the class-to-class paradigm into a novel class-to-node paradigm, transforming the GC optimization into a class partition problem which can be efficiently solved by any clustering methods. Moreover, CGC incorporates a pre-defined graph structure to enable a closed-form solution for condensed node features, eliminating the need for back-and-forth gradient descent in existing GC approaches. Extensive experiments demonstrate that CGC achieves an exceedingly efficient condensation process with advanced accuracy. Compared with the state-of-the-art GC methods, CGC condenses the Ogbn-products graph within 30 seconds, achieving a speedup ranging from $10^2$X to $10^4$X and increasing accuracy by up to 4.2%.
Authors: Hyunseung Kim, Byungkun Lee, Hojoon Lee, Dongyoon Hwang, Donghu Kim, Jaegul Choo
Abstract: Unsupervised skill discovery is a learning paradigm that aims to acquire diverse behaviors without explicit rewards. However, it faces challenges in learning complex behaviors and often leads to learning unsafe or undesirable behaviors. For instance, in various continuous control tasks, current unsupervised skill discovery methods succeed in learning basic locomotions like standing but struggle with learning more complex movements such as walking and running. Moreover, they may acquire unsafe behaviors like tripping and rolling or navigate to undesirable locations such as pitfalls or hazardous areas. In response, we present DoDont (Do's and Don'ts), an instruction-based skill discovery algorithm composed of two stages. First, in an instruction learning stage, DoDont leverages action-free instruction videos to train an instruction network to distinguish desirable transitions from undesirable ones. Then, in the skill learning stage, the instruction network adjusts the reward function of the skill discovery algorithm to weight the desired behaviors. Specifically, we integrate the instruction network into a distance-maximizing skill discovery algorithm, where the instruction network serves as the distance function. Empirically, with less than 8 instruction videos, DoDont effectively learns desirable behaviors and avoids undesirable ones across complex continuous control tasks. Code and videos are available at https://mynsng.github.io/dodont/
Authors: Mengdan Zhu, Raasikh Kanjiani, Jiahui Lu, Andrew Choi, Qirui Ye, Liang Zhao
Abstract: Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces \textit{LatentExplainer}, a framework for automatically generating semantically meaningful explanations of latent variables in deep generative models. \textit{LatentExplainer} tackles three main challenges: inferring the meaning of latent variables, aligning explanations with inductive biases, and handling varying degrees of explainability. Our approach perturbs latent variables, interpreting changes in generated data, and uses multi-modal large language models (MLLMs) to produce human-understandable explanations. We evaluate our proposed method on several real-world and synthetic datasets, and the results demonstrate superior performance in generating high-quality explanations for latent variables. The results highlight the effectiveness of incorporating inductive biases and uncertainty quantification, significantly enhancing model interpretability.
Authors: Hubert Baniecki, Giuseppe Casalicchio, Bernd Bischl, Przemyslaw Biecek
Abstract: We discover a theoretical connection between explanation estimation and distribution compression that significantly improves the approximation of feature attributions, importance, and effects. While the exact computation of various machine learning explanations requires numerous model inferences and becomes impractical, the computational cost of approximation increases with an ever-increasing size of data and model parameters. We show that the standard i.i.d. sampling used in a broad spectrum of algorithms for post-hoc explanation leads to an approximation error worthy of improvement. To this end, we introduce Compress Then Explain (CTE), a new paradigm of sample-efficient explainability. It relies on distribution compression through kernel thinning to obtain a data sample that best approximates its marginal distribution. CTE significantly improves the accuracy and stability of explanation estimation with negligible computational overhead. It often achieves an on-par explanation approximation error 2-3x faster by using fewer samples, i.e. requiring 2-3x fewer model evaluations. CTE is a simple, yet powerful, plug-in for any explanation method that now relies on i.i.d. sampling.
Authors: Werner Zellinger
Abstract: Estimating the ratio of two probability densities from a finite number of observations is a central machine learning problem. A common approach is to construct estimators using binary classifiers that distinguish observations from the two densities. However, the accuracy of these estimators depends on the choice of the binary loss function, raising the question of which loss function to choose based on desired error properties. For example, traditional loss functions, such as logistic or boosting loss, prioritize accurate estimation of small density ratio values over large ones, even though the latter are more critical in many applications. In this work, we start with prescribed error measures in a class of Bregman divergences and characterize all loss functions that result in density ratio estimators with small error. Our characterization extends results on composite binary losses from (Reid & Williamson, 2010) and their connection to density ratio estimation as identified by (Menon & Ong, 2016). As a result, we obtain a simple recipe for constructing loss functions with certain properties, such as those that prioritize an accurate estimation of large density ratio values. Our novel loss functions outperform related approaches for resolving parameter choice issues of 11 deep domain adaptation algorithms in average performance across 484 real-world tasks including sensor signals, texts, and images.
Authors: Simone Appella, Simon Arridge, Chris Budd, Teo Deveney, Lisa Maria Kreusser
Abstract: We consider the problem of univariate nonlinear function approximation using shallow neural networks (NN) with a rectified linear unit (ReLU) activation function. We show that the $L_2$ based approximation problem is ill-conditioned and the behaviour of optimisation algorithms used in training these networks degrades rapidly as the width of the network increases. This can lead to significantly poorer approximation in practice than expected from the theoretical expressivity of the ReLU architecture and traditional methods such as univariate Free Knot Splines (FKS). Univariate shallow ReLU NNs and FKS span the same function space, and thus have the same theoretical expressivity. However, the FKS representation remains well-conditioned as the number of knots increases. We leverage the theory of optimal piecewise linear interpolants to improve the training procedure for ReLU NNs. Using the equidistribution principle, we propose a two-level procedure for training the FKS by first solving the nonlinear problem of finding the optimal knot locations of the interpolating FKS, and then determine the optimal weights and knots of the FKS by solving a nearly linear, well-conditioned problem. The training of the FKS gives insights into how we can train a ReLU NN effectively, with an equally accurate approximation. We combine the training of the ReLU NN with an equidistribution-based loss to find the breakpoints of the ReLU functions. This is then combined with preconditioning the ReLU NN approximation to find the scalings of the ReLU functions. This fast, well-conditioned and reliable method finds an accurate shallow ReLU NN approximation to a univariate target function. We test this method on a series of regular, singular, and rapidly varying target functions and obtain good results, realising the expressivity of the shallow ReLU network in all cases. We then extend our results to deeper networks.
Authors: Jianke Yu, Hanchen Wang, Chen Chen, Xiaoyang Wang, Lu Qin, Wenjie Zhang, Ying Zhang, Xijuan Liu
Abstract: Graph Neural Networks (GNNs) are vital in data science but are increasingly susceptible to adversarial attacks. To help researchers develop more robust GNN models, it's essential to focus on designing strong attack models as foundational benchmarks and guiding references. Among adversarial attacks, gray-box poisoning attacks are noteworthy due to their effectiveness and fewer constraints. These attacks exploit GNNs' need for retraining on updated data, thereby impacting their performance by perturbing these datasets. However, current research overlooks the real-world scenario of incomplete graphs.To address this gap, we introduce the Robust Incomplete Deep Attack Framework (RIDA). It is the first algorithm for robust gray-box poisoning attacks on incomplete graphs. The approach innovatively aggregates distant vertex information and ensures powerful data utilization.Extensive tests against 9 SOTA baselines on 3 real-world datasets demonstrate RIDA's superiority in handling incompleteness and high attack performance on the incomplete graph.
Authors: Chang-Wei Shi, Yi-Rui Yang, Wu-Jun Li
Abstract: Distributed learning is essential for training large-scale deep models. Asynchronous SGD (ASGD) and its variants are commonly used distributed learning methods, particularly in scenarios where the computing capabilities of workers in the cluster are heterogeneous. Momentum has been acknowledged for its benefits in both optimization and generalization in deep model training. However, existing works have found that naively incorporating momentum into ASGD can impede the convergence. In this paper, we propose a novel method called ordered momentum (OrMo) for ASGD. In OrMo, momentum is incorporated into ASGD by organizing the gradients in order based on their iteration indexes. We theoretically prove the convergence of OrMo with both constant and delay-adaptive learning rates for non-convex problems. To the best of our knowledge, this is the first work to establish the convergence analysis of ASGD with momentum without dependence on the maximum delay. Empirical results demonstrate that OrMo can achieve better convergence performance compared with ASGD and other asynchronous methods with momentum.
Authors: Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, Dongfang Liu
Abstract: Achieving human-level intelligence requires refining cognitive distinctions between System 1 and System 2 thinking. While contemporary AI, driven by large language models, demonstrates human-like traits, it falls short of genuine cognition. Transitioning from structured benchmarks to real-world scenarios presents challenges for visual agents, often leading to inaccurate and overly confident responses. To address the challenge, we introduce FaST, which incorporates the Fast and Slow Thinking mechanism into visual agents. FaST employs a switch adapter to dynamically select between System 1/2 modes, tailoring the problem-solving approach to different task complexity. It tackles uncertain and unseen objects by adjusting model confidence and integrating new contextual data. With this novel design, we advocate a flexible system, hierarchical reasoning capabilities, and a transparent decision-making pipeline, all of which contribute to its ability to emulate human-like cognitive processes in visual intelligence. Empirical results demonstrate that FaST outperforms various well-known baselines, achieving 80.8% accuracy over VQA^{v2} for visual question answering and 48.7% GIoU score over ReasonSeg for reasoning segmentation, demonstrate FaST's superior performance. Extensive testing validates the efficacy and robustness of FaST's core components, showcasing its potential to advance the development of cognitive visual agents in AI systems. The code is available at ttps://github.com/GuangyanS/Sys2-LLaVA.
Authors: Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, Germain Forestier
Abstract: Deep learning models have been shown to be a powerful solution for Time Series Classification (TSC). State-of-the-art architectures, while producing promising results on the UCR and the UEA archives , present a high number of trainable parameters. This can lead to long training with high CO2 emission, power consumption and possible increase in the number of FLoating-point Operation Per Second (FLOPS). In this paper, we present a new architecture for TSC, the Light Inception with boosTing tEchnique (LITE) with only 2.34% of the number of parameters of the state-of-the-art InceptionTime model, while preserving performance. This architecture, with only 9, 814 trainable parameters due to the usage of DepthWise Separable Convolutions (DWSC), is boosted by three techniques: multiplexing, custom filters, and dilated convolution. The LITE architecture, trained on the UCR, is 2.78 times faster than InceptionTime and consumes 2.79 times less CO2 and power. To evaluate the performance of the proposed architecture on multivariate time series data, we adapt LITE to handle multivariate time series, we call this version LITEMV. To bring theory into application, we also conducted experiments using LITEMV on multivariate time series representing human rehabilitation movements, showing that LITEMV not only is the most efficient model but also the best performing for this application on the Kimore dataset, a skeleton based human rehabilitation exercises dataset. Moreover, to address the interpretability of LITEMV, we present a study using Class Activation Maps to understand the classification decision taken by the model during evaluation.
Authors: Yun-Wei Chu, Dong-Jun Han, Seyyedali Hosseinalipour, Christopher Brinton
Abstract: Over the past several years, various federated learning (FL) methodologies have been developed to improve model accuracy, a primary performance metric in machine learning. However, to utilize FL in practical decision-making scenarios, beyond considering accuracy, the trained model must also have a reliable confidence in each of its predictions, an aspect that has been largely overlooked in existing FL research. Motivated by this gap, we propose Non-Uniform Calibration for Federated Learning (NUCFL), a generic framework that integrates FL with the concept of model calibration. The inherent data heterogeneity in FL environments makes model calibration particularly difficult, as it must ensure reliability across diverse data distributions and client conditions. Our NUCFL addresses this challenge by dynamically adjusting the model calibration objectives based on statistical relationships between each client's local model and the global model in FL. In particular, NUCFL assesses the similarity between local and global model relationships, and controls the penalty term for the calibration loss during client-side local training. By doing so, NUCFL effectively aligns calibration needs for the global model in heterogeneous FL settings while not sacrificing accuracy. Extensive experiments show that NUCFL offers flexibility and effectiveness across various FL algorithms, enhancing accuracy as well as model calibration.
Authors: Chia-Yuan Wu, Frank E. Curtis, Daniel P. Robinson
Abstract: In distributed computing environments, collaborative machine learning enables multiple clients to train a global model collaboratively. To preserve privacy in such settings, a common technique is to utilize frequent updates and transmissions of model parameters. However, this results in high communication costs between the clients and the server. To tackle unfairness concerns in distributed environments, client-specific information (e.g., local dataset size or data-related fairness metrics) must be sent to the server to compute algorithmic quantities (e.g., aggregation weights), which leads to a potential leakage of client information. To address these challenges, we propose a two-stage strategy that promotes fair predictions, prevents client-data leakage, and reduces communication costs in certain scenarios without the need to pass information between clients and server iteratively. In the first stage, for each client, we use its local dataset to obtain a synthetic dataset by solving a bilevel optimization problem that aims to ensure that the ultimate global model yields fair predictions. In the second stage, we apply a method with differential privacy guarantees to the synthetic dataset from the first stage to obtain a second synthetic data. We then pass each client's second-stage synthetic dataset to the server, the collection of which is used to train the server model using conventional machine learning techniques (that no longer need to take fairness metrics or privacy into account). Thus, we eliminate the need to handle fairness-specific aggregation weights while preserving client privacy. Our approach requires only a single communication between the clients and the server (thus making it communication cost-effective), maintains data privacy, and promotes fairness. We present empirical evidence to demonstrate the advantages of our approach.
Authors: Eslam Eldeeb, Hirley Alves
Abstract: Reinforcement learning (RL) has proved to have a promising role in future intelligent wireless networks. Online RL has been adopted for radio resource management (RRM), taking over traditional schemes. However, due to its reliance on online interaction with the environment, its role becomes limited in practical, real-world problems where online interaction is not feasible. In addition, traditional RL stands short in front of the uncertainties and risks in real-world stochastic environments. In this manner, we propose an offline and distributional RL scheme for the RRM problem, enabling offline training using a static dataset without any interaction with the environment and considering the sources of uncertainties using the distributions of the return. Simulation results demonstrate that the proposed scheme outperforms conventional resource management models. In addition, it is the only scheme that surpasses online RL with a 10 % gain over online RL.
Authors: Kazuki Adachi, Shin'ya Yamaguchi, Atsutoshi Kumagai, Tomoki Hamagami
Abstract: This paper investigates test-time adaptation (TTA) for regression, where a regression model pre-trained in a source domain is adapted to an unknown target distribution with unlabeled target data. Although regression is one of the fundamental tasks in machine learning, most of the existing TTA methods have classification-specific designs, which assume that models output class-categorical predictions, whereas regression models typically output only single scalar values. To enable TTA for regression, we adopt a feature alignment approach, which aligns the feature distributions between the source and target domains to mitigate the domain gap. However, we found that naive feature alignment employed in existing TTA methods for classification is ineffective or even worse for regression because the features are distributed in a small subspace and many of the raw feature dimensions have little significance to the output. For an effective feature alignment in TTA for regression, we propose Significant-subspace Alignment (SSA). SSA consists of two components: subspace detection and dimension weighting. Subspace detection finds the feature subspace that is representative and significant to the output. Then, the feature alignment is performed in the subspace during TTA. Meanwhile, dimension weighting raises the importance of the dimensions of the feature subspace that have greater significance to the output. We experimentally show that SSA outperforms various baselines on real-world datasets.
Authors: Yahong Yang
Abstract: In this paper, we investigate the use of operator learning, specifically DeepONet, for solving nonlinear partial differential equations (PDEs). Unlike conventional function learning methods that require training separate neural networks for each PDE, operator learning enables generalization across different PDEs without retraining. This study examines the performance of DeepONet in physics-informed training, focusing on two key aspects: (1) the approximation capabilities of deep branch and trunk networks, and (2) the generalization error in Sobolev norms. Our results demonstrate that deep branch networks provide substantial performance improvements, while trunk networks achieve optimal results when kept relatively simple. Furthermore, we derive a bound on the generalization error of DeepONet for solving nonlinear PDEs by analyzing the Rademacher complexity of its derivatives in terms of pseudo-dimension. This work bridges a critical theoretical gap by delivering rigorous error estimates. This paper fills a theoretical gap by providing error estimations for a wide range of physics-informed machine learning models and applications.
Authors: Shakila Mahjabin Tonni, Pedro Faustini, Mark Dras
Abstract: Adversarial examples pose a significant challenge to deep neural networks (DNNs) across both image and text domains, with the intent to degrade model performance through meticulously altered inputs. Adversarial texts, however, are distinct from adversarial images due to their requirement for semantic similarity and the discrete nature of the textual contents. This study delves into the concept of human suspiciousness, a quality distinct from the traditional focus on imperceptibility found in image-based adversarial examples. Unlike images, where adversarial changes are meant to be indistinguishable to the human eye, textual adversarial content must often remain undetected or non-suspicious to human readers, even when the text's purpose is to deceive NLP systems or bypass filters. In this research, we expand the study of human suspiciousness by analyzing how individuals perceive adversarial texts. We gather and publish a novel dataset of Likert-scale human evaluations on the suspiciousness of adversarial sentences, crafted by four widely used adversarial attack methods and assess their correlation with the human ability to detect machine-generated alterations. Additionally, we develop a regression-based model to quantify suspiciousness and establish a baseline for future research in reducing the suspiciousness in adversarial text generation. We also demonstrate how the regressor-generated suspicious scores can be incorporated into adversarial generation methods to produce texts that are less likely to be perceived as computer-generated. We make our human suspiciousness annotated data and our code available.
Authors: Jiayi Han, Liang Du, Hongwei Du, Xiangguo Zhou, Yiwen Wu, Weibo Zheng, Donghong Han
Abstract: Although many efforts have been made, it is still a challenge to balance the training budget, downstream performance, and the general capabilities of the LLMs in many applications. Training the whole model for downstream tasks is expensive, and could easily result in catastrophic forgetting. By introducing parameter-efficient fine-tuning (PEFT), the training cost could be reduced, but it still suffers from forgetting, and limits the learning on the downstream tasks. To efficiently fine-tune the LLMs with less limitation to their downstream performance while mitigating the forgetting of general capabilities, we propose a novel mixture of expert (MoE) framework based on Soft LoRA and Identity Mixture (SLIM), that allows dynamic routing between LoRA adapters and skipping connection, enables the suppression of forgetting. We adopt weight-yielding with sliding clustering for better out-of-domain distinguish to enhance the routing. We also propose to convert the mixture of low-rank adapters to the model merging formulation and introduce fast dynamic merging of LoRA adapters to keep the general capabilities of the base model. Extensive experiments demonstrate that the proposed SLIM is comparable to the state-of-the-art PEFT approaches on the downstream tasks while achieving the leading performance in mitigating catastrophic forgetting.
Authors: Sangyun Lee, Yilun Xu, Tomas Geffner, Giulia Fanti, Karsten Kreis, Arash Vahdat, Weili Nie
Abstract: Consistency models have recently been introduced to accelerate sampling from diffusion models by directly predicting the solution (i.e., data) of the probability flow ODE (PF ODE) from initial noise. However, the training of consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints. This task is much more challenging than the ultimate objective of one-step generation, which only concerns the PF ODE's noise-to-data mapping. We empirically find that this training paradigm limits the one-step generation performance of consistency models. To address this issue, we generalize consistency training to the truncated time range, which allows the model to ignore denoising tasks at earlier time steps and focus its capacity on generation. We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution. Experiments on CIFAR-10 and ImageNet $64\times64$ datasets show that our method achieves better one-step and two-step FIDs than the state-of-the-art consistency models such as iCT-deep, using more than 2$\times$ smaller networks. Project page: https://truncated-cm.github.io/
Authors: Shicheng Liu, Minghui Zhu
Abstract: Inverse reinforcement learning (IRL) aims to learn a reward function and a corresponding policy that best fit the demonstrated trajectories of an expert. However, current IRL works cannot learn incrementally from an ongoing trajectory because they have to wait to collect at least one complete trajectory to learn. To bridge the gap, this paper considers the problem of learning a reward function and a corresponding policy while observing the initial state-action pair of an ongoing trajectory and keeping updating the learned reward and policy when new state-action pairs of the ongoing trajectory are observed. We formulate this problem as an online bi-level optimization problem where the upper level dynamically adjusts the learned reward according to the newly observed state-action pairs with the help of a meta-regularization term, and the lower level learns the corresponding policy. We propose a novel algorithm to solve this problem and guarantee that the algorithm achieves sub-linear local regret $O(\sqrt{T}+\log T+\sqrt{T}\log T)$. If the reward function is linear, we prove that the proposed algorithm achieves sub-linear regret $O(\log T)$. Experiments are used to validate the proposed algorithm.
Authors: Kaiyan Zhao, Yiming Wang, Yuyang Chen, Yan Li, Leong Hou U, Xiaoguang Niu
Abstract: Experience replay is widely used to improve learning efficiency in reinforcement learning by leveraging past experiences. However, existing experience replay methods, whether based on uniform or prioritized sampling, often suffer from low efficiency, particularly in real-world scenarios with high-dimensional state spaces. To address this limitation, we propose a novel approach, Efficient Diversity-based Experience Replay (EDER). EDER employs a deterministic point process to model the diversity between samples and prioritizes replay based on the diversity between samples. To further enhance learning efficiency, we incorporate Cholesky decomposition for handling large state spaces in realistic environments. Additionally, rejection sampling is applied to select samples with higher diversity, thereby improving overall learning efficacy. Extensive experiments are conducted on robotic manipulation tasks in MuJoCo, Atari games, and realistic indoor environments in Habitat. The results demonstrate that our approach not only significantly improves learning efficiency but also achieves superior performance in high-dimensional, realistic environments.
Authors: Nicolas Langren\'e, Xavier Warin, Pierre Gruet
Abstract: Rahimi and Recht (2007) introduced the idea of decomposing positive definite shift-invariant kernels by randomly sampling from their spectral distribution. This famous technique, known as Random Fourier Features (RFF), is in principle applicable to any such kernel whose spectral distribution can be identified and simulated. In practice, however, it is usually applied to the Gaussian kernel because of its simplicity, since its spectral distribution is also Gaussian. Clearly, simple spectral sampling formulas would be desirable for broader classes of kernels. In this paper, we prove that the spectral distribution of every positive definite isotropic kernel can be decomposed as a scale mixture of $\alpha$-stable random vectors, and we identify the scaling distribution as a function of the kernel. This constructive decomposition provides a simple and ready-to-use spectral sampling formula for every multivariate positive definite shift-invariant kernel, including exponential power kernels, generalized Mat\'ern kernels, generalized Cauchy kernels, as well as newly introduced kernels such as the Beta, Kummer, and Tricomi kernels. In particular, we show that the spectral distributions of these kernels are scale mixtures of the multivariate Gaussian distribution. This provides a very simple way to adapt existing random Fourier features software based on Gaussian kernels to any positive definite shift-invariant kernel. This result has broad applications for support vector machines, kernel ridge regression, Gaussian processes, and other kernel-based machine learning techniques for which the random Fourier features technique is applicable.
Authors: Libo Wang
Abstract: In order to address the chain of thought in the large language model inference cost surge, this research proposes to use a sparse attention mechanism that only focuses on a few relevant tokens. The researcher constructed a new attention mechanism and used GiantRabbit trained with custom GPTs as an experimental tool. The experiment tested and compared the reasoning time, correctness score and chain of thought length of this model and o1 Preview in solving the linear algebra test questions of MIT OpenCourseWare. The results show that GiantRabbit's reasoning time and chain of thought length are significantly lower than o1 Preview. It verifies the feasibility of sparse attention mechanism for optimizing chain of thought reasoning. Detailed architectural details and experimental process have been uploaded to Github, the link is:https://github.com/brucewang123456789/GeniusTrail.git.
URLs: https://github.com/brucewang123456789/GeniusTrail.git.
Authors: Guochen Yan, Xunkai Li, Luyuan Xie, Wentao Zhang, Qingni Shen, Yuejian Fang, Zhonghai Wu
Abstract: Federated Graph Learning (FGL) has emerged as a promising paradigm for breaking data silos among distributed private graphs. In practical scenarios involving heterogeneous distributed graph data, personalized Federated Graph Learning (pFGL) aims to enhance model utility by training personalized models tailored to client needs. However, existing pFGL methods often require numerous communication rounds under heterogeneous graphs, leading to significant communication overhead and security concerns. While One-shot Federated Learning (OFL) enables collaboration in a single round, existing OFL methods are designed for image-centric tasks and ineffective for graph data, leaving a critical gap in the field. Additionally, personalized models derived from existing methods suffer from bias, failing to effectively generalize to the minority. To address these challenges, we propose the first $\textbf{O}$ne-shot $\textbf{p}$ersonalized $\textbf{F}$ederated $\textbf{G}$raph $\textbf{L}$earning method ($\textbf{O-pFGL}$) for node classification, compatible with Secure Aggregation protocols for privacy preservation. Specifically, for effective graph learning in one communication round, our method estimates and aggregates class-wise feature distribution statistics to construct a global pseudo-graph on the server, facilitating the training of a global graph model. To mitigate bias, we introduce a two-stage personalized training approach that adaptively balances local personal information and global insights from the pseudo-graph, improving both personalization and generalization. Extensive experiments on 12 multi-scale graph datasets demonstrate that our method significantly outperforms state-of-the-art baselines across various settings.
Authors: Zheng Li, Feng Xie, Xichen Guo, Yan Zeng, Hao Zhang, Zhi Geng
Abstract: Estimating causal effects from nonexperimental data is a fundamental problem in many fields of science. A key component of this task is selecting an appropriate set of covariates for confounding adjustment to avoid bias. Most existing methods for covariate selection often assume the absence of latent variables and rely on learning the global network structure among variables. However, identifying the global structure can be unnecessary and inefficient, especially when our primary interest lies in estimating the effect of a treatment variable on an outcome variable. To address this limitation, we propose a novel local learning approach for covariate selection in nonparametric causal effect estimation, which accounts for the presence of latent variables. Our approach leverages testable independence and dependence relationships among observed variables to identify a valid adjustment set for a target causal relationship, ensuring both soundness and completeness under standard assumptions. We validate the effectiveness of our algorithm through extensive experiments on both synthetic and real-world data.
Authors: Indranil Halder
Abstract: We develop an analytically tractable single-step diffusion model based on a linear denoiser and present explicit formula for the Kullback-Leibler divergence between generated and sampling distribution, taken to be isotropic Gaussian, showing the effect of finite diffusion time and noise scale. Our study further reveals that the monotonic fall phase of Kullback-Leibler divergence begins when the training dataset size reaches the dimension of the data points. Along the way, we provide a mathematically precise definition of memorization to non-memorization transition when only finite number of data points are available. It is shown that the simplified model also features this transition during the monotonic fall phase of the aforementioned Kullback-Leibler divergence. For large-scale practical diffusion models, we explain why higher number of diffusion steps enhance production quality based on the theoretical arguments presented before. In addition, we show that higher diffusion steps does not necessarily help in reducing memorization. These two facts combined suggests existence of an optimal number of diffusion steps for finite number of training samples.
Authors: Bethia Sun, Maurice Pagnucco, Yang Song
Abstract: Since the inception of the classicalist vs. connectionist debate, it has been argued that the ability to systematically combine symbol-like entities into compositional representations is crucial for human intelligence. In connectionist systems, the field of disentanglement has gained prominence for its ability to produce explicitly compositional representations; however, it relies on a fundamentally symbolic, concatenative representation of compositional structure that clashes with the continuous, distributed foundations of deep learning. To resolve this tension, we extend Smolensky's Tensor Product Representation (TPR) and introduce Soft TPR, a representational form that encodes compositional structure in an inherently distributed, flexible manner, along with Soft TPR Autoencoder, a theoretically-principled architecture designed specifically to learn Soft TPRs. Comprehensive evaluations in the visual representation learning domain demonstrate that the Soft TPR framework consistently outperforms conventional disentanglement alternatives -- achieving state-of-the-art disentanglement, boosting representation learner convergence, and delivering superior sample efficiency and low-sample regime performance in downstream tasks. These findings highlight the promise of a distributed and flexible approach to representing compositional structure by potentially enhancing alignment with the core principles of deep learning over the conventional symbolic approach.
Authors: Seungeun Lee, Il-Youp Kwak, Kihwan Lee, Subin Bae, Sangjun Lee, Seulbin Lee, Seungsang Oh
Abstract: Recent advancements in deep learning for tabular data have shown promise, but challenges remain in achieving interpretable and lightweight models. This paper introduces Table2Image, a novel framework that transforms tabular data into realistic and diverse image representations, enabling deep learning methods to achieve competitive classification performance. To address multicollinearity in tabular data, we propose a variance inflation factor (VIF) initialization, which enhances model stability and robustness by incorporating statistical feature relationships. Additionally, we present an interpretability framework that integrates insights from both the original tabular data and its transformed image representations, by leveraging Shapley additive explanations (SHAP) and methods to minimize distributional discrepancies. Experiments on benchmark datasets demonstrate the efficacy of our approach, achieving superior accuracy, area under the curve, and interpretability compared to recent leading deep learning models. Our lightweight method provides a scalable and reliable solution for tabular data classification.
Authors: Muyun Li, Aaron Fainman, Stefan Vlaski
Abstract: Decentralized learning strategies allow a collection of agents to learn efficiently from local data sets without the need for central aggregation or orchestration. Current decentralized learning paradigms typically rely on an averaging mechanism to encourage agreement in the parameter space. We argue that in the context of deep neural networks, which are often over-parameterized, encouraging consensus of the neural network outputs, as opposed to their parameters can be more appropriate. This motivates the development of a new decentralized learning algorithm, termed DRT diffusion, based on deep relative trust (DRT), a recently introduced similarity measure for neural networks. We provide convergence analysis for the proposed strategy, and numerically establish its benefit to generalization, especially with sparse topologies, in an image classification task.
Authors: Haishan Ye, Yinghui Huang, Hao Di, Xiangyu Chang
Abstract: We propose an enhanced zeroth-order stochastic Frank-Wolfe framework to address constrained finite-sum optimization problems, a structure prevalent in large-scale machine-learning applications. Our method introduces a novel double variance reduction framework that effectively reduces the gradient approximation variance induced by zeroth-order oracles and the stochastic sampling variance from finite-sum objectives. By leveraging this framework, our algorithm achieves significant improvements in query efficiency, making it particularly well-suited for high-dimensional optimization tasks. Specifically, for convex objectives, the algorithm achieves a query complexity of O(d \sqrt{n}/\epsilon ) to find an epsilon-suboptimal solution, where d is the dimensionality and n is the number of functions in the finite-sum objective. For non-convex objectives, it achieves a query complexity of O(d^{3/2}\sqrt{n}/\epsilon^2 ) without requiring the computation ofd partial derivatives at each iteration. These complexities are the best known among zeroth-order stochastic Frank-Wolfe algorithms that avoid explicit gradient calculations. Empirical experiments on convex and non-convex machine learning tasks, including sparse logistic regression, robust classification, and adversarial attacks on deep networks, validate the computational efficiency and scalability of our approach. Our algorithm demonstrates superior performance in both convergence rate and query complexity compared to existing methods.
Authors: Ping Guo, Cheng Gong, Xi Lin, Fei Liu, Zhichao Lu, Qingfu Zhang, Zhenkun Wang
Abstract: Crafting adversarial examples is crucial for evaluating and enhancing the robustness of Deep Neural Networks (DNNs), presenting a challenge equivalent to maximizing a non-differentiable 0-1 loss function. However, existing single objective methods, namely adversarial attacks focus on a surrogate loss function, do not fully harness the benefits of engaging multiple loss functions, as a result of insufficient understanding of their synergistic and conflicting nature. To overcome these limitations, we propose the Multi-Objective Set-based Attack (MOS Attack), a novel adversarial attack framework leveraging multiple loss functions and automatically uncovering their interrelations. The MOS Attack adopts a set-based multi-objective optimization strategy, enabling the incorporation of numerous loss functions without additional parameters. It also automatically mines synergistic patterns among various losses, facilitating the generation of potent adversarial attacks with fewer objectives. Extensive experiments have shown that our MOS Attack outperforms single-objective attacks. Furthermore, by harnessing the identified synergistic patterns, MOS Attack continues to show superior results with a reduced number of loss functions.
Authors: Zhenxing Niu, Haoxuan Ji, Yuyao Sun, Zheng Lin, Fei Gao, Yuhang Wang, Haichao Gao
Abstract: Modern privacy regulations have spurred the evolution of machine unlearning, a technique enabling a trained model to efficiently forget specific training data. In prior unlearning methods, the concept of "data forgetting" is often interpreted and implemented as achieving zero classification accuracy on such data. Nevertheless, the authentic aim of machine unlearning is to achieve alignment between the unlearned model and the gold model, i.e., encouraging them to have identical classification accuracy. On the other hand, the gold model often exhibits non-zero classification accuracy due to its generalization ability. To achieve aligned data forgetting, we propose a Twin Machine Unlearning (TMU) approach, where a twin unlearning problem is defined corresponding to the original unlearning problem. Consequently, the generalization-label predictor trained on the twin problem can be transferred to the original problem, facilitating aligned data forgetting. Comprehensive empirical experiments illustrate that our approach significantly enhances the alignment between the unlearned model and the gold model.
Authors: Paul Garnier, Vincent Lannelongue, Jonathan Viquerat, Elie Hachem
Abstract: We introduce a novel masked pre-training technique for graph neural networks (GNNs) applied to computational fluid dynamics (CFD) problems. By randomly masking up to 40\% of input mesh nodes during pre-training, we force the model to learn robust representations of complex fluid dynamics. We pair this masking strategy with an asymmetric encoder-decoder architecture and gated multi-layer perceptrons to further enhance performance. The proposed method achieves state-of-the-art results on seven CFD datasets, including a new challenging dataset of 3D intracranial aneurysm simulations with over 250,000 nodes per mesh. Moreover, it significantly improves model performance and training efficiency across such diverse range of fluid simulation tasks. We demonstrate improvements of up to 60\% in long-term prediction accuracy compared to previous best models, while maintaining similar computational costs. Notably, our approach enables effective pre-training on multiple datasets simultaneously, significantly reducing the time and data required to achieve high performance on new tasks. Through extensive ablation studies, we provide insights into the optimal masking ratio, architectural choices, and training strategies.
Authors: Jiannan Li, Yiyang Yang, Yao Wang, Shaojie Tang
Abstract: Modern decision-making scenarios often involve data that is both high-dimensional and rich in higher-order contextual information, where existing bandits algorithms fail to generate effective policies. In response, we propose in this paper a generalized linear tensor bandits algorithm designed to tackle these challenges by incorporating low-dimensional tensor structures, and further derive a unified analytical framework of the proposed algorithm. Specifically, our framework introduces a convex optimization approach with the weakly decomposable regularizers, enabling it to not only achieve better results based on the tensor low-rankness structure assumption but also extend to cases involving other low-dimensional structures such as slice sparsity and low-rankness. The theoretical analysis shows that, compared to existing low-rankness tensor result, our framework not only provides better bounds but also has a broader applicability. Notably, in the special case of degenerating to low-rank matrices, our bounds still offer advantages in certain scenarios.
Authors: Deniz Sezin Ayvaz, Lorenzo Belenguer, Hankun He, Deborah Dormah Kanubala, Mingxu Li, Soung Low, Carlos Mougan, Faithful Chiagoziem Onwuegbuche, Yulu Pi, Natalia Sikora, Dan Tran, Shresth Verma, Hanzhi Wang, Skyler Xie, Adeline Pelletier
Abstract: Mastercard, a global leader in financial services, develops and deploys machine learning models aimed at optimizing card usage and preventing attrition through advanced predictive models. These models use aggregated and anonymized card usage patterns, including cross-border transactions and industry-specific spending, to tailor bank offerings and maximize revenue opportunities. Mastercard has established an AI Governance program, based on its Data and Tech Responsibility Principles, to evaluate any built and bought AI for efficacy, fairness, and transparency. As part of this effort, Mastercard has sought expertise from the Turing Institute through a Data Study Group to better assess fairness in more complex AI/ML models. The Data Study Group challenge lies in defining, measuring, and mitigating fairness in these predictions, which can be complex due to the various interpretations of fairness, gaps in the research literature, and ML-operations challenges.
Authors: Chaoqing Tang, Huanze Zhuang, Guiyun Tian, Zhenli Zeng, Yi Ding, Wenzhong Liu, Xiang Bai
Abstract: Pre-trained large models attract widespread attention in recent years, but they face challenges in applications that require high interpretability or have limited resources, such as physical sensing, medical imaging, and bioinformatics. Compressed Sensing (CS) is a well-proved theory that drives many recent breakthroughs in these applications. However, as a typical under-determined linear system, CS suffers from excessively long sparse reconstruction times when using traditional iterative methods, particularly with large-scale data. Current AI methods like deep unfolding fail to substitute them because pre-trained models exhibit poor generality beyond their training conditions and dataset distributions, or lack interpretability. Instead of following the big model fervor, this paper proposes ultra-small artificial neural models called coefficients learning (CL), enabling training-free and rapid sparse reconstruction while perfectly inheriting the generality and interpretability of traditional iterative methods, bringing new feature of incorporating prior knowledges. In CL, a signal of length $n$ only needs a minimal of $n$ trainable parameters. A case study model called CLOMP is implemented for evaluation. Experiments are conducted on both synthetic and real one-dimensional and two-dimensional signals, demonstrating significant improvements in efficiency and accuracy. Compared to representative iterative methods, CLOMP improves efficiency by 100 to 1000 folds for large-scale data. Test results on eight diverse image datasets indicate that CLOMP improves structural similarity index by 292%, 98%, 45% for sampling rates of 0.1, 0.3, 0.5, respectively. We believe this method can truly usher CS reconstruction into the AI era, benefiting countless under-determined linear systems that rely on sparse solution.
Authors: Zhuang Li
Abstract: The rapid proliferation of Large Language Models (LLMs) has raised concerns about misuse and the challenges of distinguishing AI-generated text from human-written content. Existing watermarking techniques, such as \kgw, still face limitations under low watermark strength, stringent false-positive requirements, and low-entropy scenarios. Our analysis reveals that current detection methods rely on coarse estimates of non-watermarked text, which constrains watermark detectability. We propose the Bipolar Watermark (BiMarker), a novel approach that divides generated text into positive and negative poles, leveraging the difference in green token counts for detection. This differential mechanism significantly enhances the detectability of watermarked text. Theoretical analysis and experimental results demonstrate BiMarker's effectiveness and compatibility with existing optimization techniques, offering a new optimization dimension for watermarking in LLM-generated content.
Authors: Emir Ceyani, Han Xie, Baturalp Buyukates, Carl Yang, Salman Avestimehr
Abstract: Graphs are crucial for modeling relational and biological data. As datasets grow larger in real-world scenarios, the risk of exposing sensitive information increases, making privacy-preserving training methods like federated learning (FL) essential to ensure data security and compliance with privacy regulations. Recently proposed personalized subgraph FL methods have become the de-facto standard for training personalized Graph Neural Networks (GNNs) in a federated manner while dealing with the missing links across clients' subgraphs due to privacy restrictions. However, personalized subgraph FL faces significant challenges due to the heterogeneity in client subgraphs, such as degree distributions among the nodes, which complicate federated training of graph models. To address these challenges, we propose \textit{FedGrAINS}, a novel data-adaptive and sampling-based regularization method for subgraph FL. FedGrAINS leverages generative flow networks (GFlowNets) to evaluate node importance concerning clients' tasks, dynamically adjusting the message-passing step in clients' GNNs. This adaptation reflects task-optimized sampling aligned with a trajectory balance objective. Experimental results demonstrate that the inclusion of \textit{FedGrAINS} as a regularizer consistently improves the FL performance compared to baselines that do not leverage such regularization.
Authors: Myunsoo Kim, Hayeong Lee, Seong-Woong Shim, JunHo Seo, Byung-Jun Lee
Abstract: Intelligent agents are able to make decisions based on different levels of granularity and duration. Recent advances in skill learning enabled the agent to solve complex, long-horizon tasks by effectively guiding the agent in choosing appropriate skills. However, the practice of using fixed-length skills can easily result in skipping valuable decision points, which ultimately limits the potential for further exploration and faster policy learning. In this work, we propose to learn a simple and effective termination condition that identifies decision points through a state-action novelty module that leverages agent experience data. Our approach, Novelty-based Decision Point Identification (NBDI), outperforms previous baselines in complex, long-horizon tasks, and remains effective even in the presence of significant variations in the environment configurations of downstream tasks, highlighting the importance of decision point identification in skill learning.
Authors: Oliver R. A. Dunbar, Charles M. Elliott, Lisa Maria Kreusser
Abstract: We propose and unify classes of different models for information propagation over graphs. In a first class, propagation is modelled as a wave which emanates from a set of \emph{known} nodes at an initial time, to all other \emph{unknown} nodes at later times with an ordering determined by the arrival time of the information wave front. A second class of models is based on the notion of a travel time along paths between nodes. The time of information propagation from an initial \emph{known} set of nodes to a node is defined as the minimum of a generalised travel time over subsets of all admissible paths. A final class is given by imposing a local equation of an eikonal form at each \emph{unknown} node, with boundary conditions at the \emph{known} nodes. The solution value of the local equation at a node is coupled to those of neighbouring nodes with lower values. We provide precise formulations of the model classes and prove equivalences between them. Finally we apply the front propagation models on graphs to semi-supervised learning via label propagation and information propagation on trust networks.
Authors: Mohamed elShehaby, Ashraf Matrawy
Abstract: Machine Learning (ML) has become ubiquitous, and its deployment in Network Intrusion Detection Systems (NIDS) is inevitable due to its automated nature and high accuracy compared to traditional models in processing and classifying large volumes of data. However, ML has been found to have several flaws, most importantly, adversarial attacks, which aim to trick ML models into producing faulty predictions. While most adversarial attack research focuses on computer vision datasets, recent studies have explored the suitability of these attacks against ML-based network security entities, especially NIDS, due to the wide difference between different domains regarding the generation of adversarial attacks. To further explore the practicality of adversarial attacks against ML-based NIDS in-depth, this paper presents three distinct contributions: identifying numerous practicality issues for evasion adversarial attacks on ML-NIDS using an attack tree threat model, introducing a taxonomy of practicality issues associated with adversarial attacks against ML-based NIDS, and investigating how the dynamicity of some real-world ML models affects adversarial attacks against NIDS. Our experiments indicate that continuous re-training, even without adversarial training, can reduce the effectiveness of adversarial attacks. While adversarial attacks can compromise ML-based NIDSs, our aim is to highlight the significant gap between research and real-world practicality in this domain, warranting attention.
Authors: Fabio Calefato, Luigi Quaranta, Filippo Lanubile, Marcos Kalinowski
Abstract: Background. Due to the widespread adoption of Artificial Intelligence (AI) and Machine Learning (ML) for building software applications, companies are struggling to recruit employees with a deep understanding of such technologies. In this scenario, AutoML is soaring as a promising solution to fill the AI/ML skills gap since it promises to automate the building of end-to-end AI/ML pipelines that would normally be engineered by specialized team members. Aims. Despite the growing interest and high expectations, there is a dearth of information about the extent to which AutoML is currently adopted by teams developing AI/ML-enabled systems and how it is perceived by practitioners and researchers. Method. To fill these gaps, in this paper, we present a mixed-method study comprising a benchmark of 12 end-to-end AutoML tools on two SE datasets and a user survey with follow-up interviews to further our understanding of AutoML adoption and perception. Results. We found that AutoML solutions can generate models that outperform those trained and optimized by researchers to perform classification tasks in the SE domain. Also, our findings show that the currently available AutoML solutions do not live up to their names as they do not equally support automation across the stages of the ML development workflow and for all the team members. Conclusions. We derive insights to inform the SE research community on how AutoML can facilitate their activities and tool builders on how to design the next generation of AutoML technologies.
Authors: Tommaso Salvatori, Ankur Mali, Christopher L. Buckley, Thomas Lukasiewicz, Rajesh P. N. Rao, Karl Friston, Alexander Ororbia
Abstract: Artificial intelligence (AI) is rapidly becoming one of the key technologies of this century. The majority of results in AI thus far have been achieved using deep neural networks trained with the error backpropagation learning algorithm. However, the ubiquitous adoption of this approach has highlighted some important limitations such as substantial computational cost, difficulty in quantifying uncertainty, lack of robustness, unreliability, and biological implausibility. It is possible that addressing these limitations may require schemes that are inspired and guided by neuroscience theories. One such theory, called predictive coding (PC), has shown promising performance in machine intelligence tasks, exhibiting exciting properties that make it potentially valuable for the machine learning community: PC can model information processing in different brain areas, can be used in cognitive control and robotics, and has a solid mathematical grounding in variational inference, offering a powerful inversion scheme for a specific class of continuous-state generative models. With the hope of foregrounding research in this direction, we survey the literature that has contributed to this perspective, highlighting the many ways that PC might play a role in the future of machine learning and computational intelligence at large.
Authors: Finn Behrendt, Debayan Bhattacharya, Robin Mieling, Lennart Maack, Julia Kr\"uger, Roland Opfer, Alexander Schlaefer
Abstract: The application of supervised models to clinical screening tasks is challenging due to the need for annotated data for each considered pathology. Unsupervised Anomaly Detection (UAD) is an alternative approach that aims to identify any anomaly as an outlier from a healthy training distribution. A prevalent strategy for UAD in brain MRI involves using generative models to learn the reconstruction of healthy brain anatomy for a given input image. As these models should fail to reconstruct unhealthy structures, the reconstruction errors indicate anomalies. However, a significant challenge is to balance the accurate reconstruction of healthy anatomy and the undesired replication of abnormal structures. While diffusion models have shown promising results with detailed and accurate reconstructions, they face challenges in preserving intensity characteristics, resulting in false positives. We propose conditioning the denoising process of diffusion models with additional information derived from a latent representation of the input image. We demonstrate that this conditioning allows for accurate and local adaptation to the general input intensity distribution while avoiding the replication of unhealthy structures. We compare the novel approach to different state-of-the-art methods and for different data sets. Our results show substantial improvements in the segmentation performance, with the Dice score improved by 11.9%, 20.0%, and 44.6%, for the BraTS, ATLAS and MSLUB data sets, respectively, while maintaining competitive performance on the WMH data set. Furthermore, our results indicate effective domain adaptation across different MRI acquisitions and simulated contrasts, an important attribute for general anomaly detection methods. The code for our work is available at https://github.com/FinnBehrendt/Conditioned-Diffusion-Models-UAD
URLs: https://github.com/FinnBehrendt/Conditioned-Diffusion-Models-UAD
Authors: Riccardo Majellaro, Jonathan Collu, Aske Plaat, Thomas M. Moerland
Abstract: Extracting structured representations from raw visual data is an important and long-standing challenge in machine learning. Recently, techniques for unsupervised learning of object-centric representations have raised growing interest. In this context, enhancing the robustness of the latent features can improve the efficiency and effectiveness of the training of downstream tasks. A promising step in this direction is to disentangle the factors that cause variation in the data. Previously, Invariant Slot Attention disentangled position, scale, and orientation from the remaining features. Extending this approach, we focus on separating the shape and texture components. In particular, we propose a novel architecture that biases object-centric models toward disentangling shape and texture components into two non-overlapping subsets of the latent space dimensions. These subsets are known a priori, hence before the training process. Experiments on a range of object-centric benchmarks reveal that our approach achieves the desired disentanglement while also numerically improving baseline performance in most cases. In addition, we show that our method can generate novel textures for a specific object or transfer textures between objects with distinct shapes.
Authors: Jan Heiland, Yongho Kim, Steffen W. R. Werner
Abstract: Polytopic autoencoders provide low-di\-men\-sion\-al parametrizations of states in a polytope. For nonlinear PDEs, this is readily applied to low-dimensional linear parameter-varying (LPV) approximations as they have been exploited for efficient nonlinear controller design via series expansions of the solution to the state-dependent Riccati equation. In this work, we develop a polytopic autoencoder for control applications and show how it improves on standard linear approaches in view of LPV approximations of nonlinear systems. We discuss how the particular architecture enables exact representation of target states and higher order series expansions of the nonlinear feedback law at little extra computational effort in the online phase and how the linear though high-dimensional and nonstandard Lyapunov equations are efficiently computed during the offline phase. In a numerical study, we illustrate the procedure and how this approach can reliably outperform the standard linear-quadratic regulator design.
Authors: Yoonhyuk Choi, Jiho Choi, Taewook Ko, Chong-Kwon Kim
Abstract: The issue of data sparsity poses a significant challenge to recommender systems. In response to this, algorithms that leverage side information such as review texts have been proposed. Furthermore, Cross-Domain Recommendation (CDR), which captures domain-shareable knowledge and transfers it from a richer domain (source) to a sparser one (target), has received notable attention. Nevertheless, the majority of existing methodologies assume a Euclidean embedding space, encountering difficulties in accurately representing richer text information and managing complex interactions between users and items. This paper advocates a hyperbolic CDR approach based on review texts for modeling user-item relationships. We first emphasize that conventional distance-based domain alignment techniques may cause problems because small modifications in hyperbolic geometry result in magnified perturbations, ultimately leading to the collapse of hierarchical structures. To address this challenge, we propose hierarchy-aware embedding and domain alignment schemes that adjust the scale to extract domain-shareable information without disrupting structural forms. The process involves the initial embedding of review texts in hyperbolic space, followed by feature extraction incorporating degree-based normalization and structure alignment. We conducted extensive experiments to substantiate the efficiency, robustness, and scalability of our proposed model in comparison to state-of-the-art baselines.
Authors: Joel Niklaus, Lucia Zheng, Arya D. McCarthy, Christopher Hahn, Brian M. Rosen, Peter Henderson, Daniel E. Ho, Garrett Honke, Percy Liang, Christopher Manning
Abstract: Instruction tuning is an important step in making language models useful for direct user interaction. However, the legal domain is underrepresented in typical instruction datasets (e.g., only 10 out of 1600+ tasks in Super-NaturalInstructions). To study whether instruction tuning on legal datasets is necessary for strong legal reasoning, we aggregate 58 annotated legal datasets and write instructions for each, creating LawInstruct. LawInstruct covers 17 global jurisdictions, 24 languages and a total of 12M examples across diverse tasks such as legal QA, summarization of court cases, and legal argument mining. We evaluate our models on LegalBench, measuring legal reasoning across five categories in 162 challenging and realistic legal tasks, and MMLU, to measure potential drops in general reasoning capabilities. We find that legal-specific instruction tuning on Flan-T5 - yielding FLawN-T5 - improves performance on LegalBench across all model sizes, with an aggregate increase of 15 points or 50% over Flan-T5 for the base size. No model size shows performance drops in MMLU. We publish LawInstruct as a resource for further study of instruction tuning in the legal domain.
Authors: Haonan Zhang, Dongxia Wang, Zhu Sun, Yanhui Li, Youcheng Sun, Huizhi Liang, Wenhai Wang
Abstract: Recommender systems (RSs) are designed to provide personalized recommendations to users. Recently, knowledge graphs (KGs) have been widely introduced in RSs to improve recommendation accuracy. In this study, however, we demonstrate that RSs do not necessarily perform worse even if the KG is downgraded to the user-item interaction graph only (or removed). We propose an evaluation framework KG4RecEval to systematically evaluate how much a KG contributes to the recommendation accuracy of a KG-based RS, using our defined metric KGER (KG utilization efficiency in recommendation). We consider the scenarios where knowledge in a KG gets completely removed, randomly distorted and decreased, and also where recommendations are for cold-start users. Our extensive experiments on four commonly used datasets and a number of state-of-the-art KG-based RSs reveal that: to remove, randomly distort or decrease knowledge does not necessarily decrease recommendation accuracy, even for cold-start users. These findings inspire us to rethink how to better utilize knowledge from existing KGs, whereby we discuss and provide insights into what characteristics of datasets and KG-based RSs may help improve KG utilization efficiency. The code and supplementary material of this paper are available at: https://github.com/HotBento/KG4RecEval.
Authors: Arthur Ledaguenel, C\'eline Hudelot, Mostepha Khouadjia
Abstract: Neurosymbolic artificial intelligence is a growing field of research aiming to combine neural network learning capabilities with the reasoning abilities of symbolic systems. Informed multi-label classification is a sub-field of neurosymbolic AI which studies how to leverage prior knowledge to improve neural classification systems. Recently, a family of neurosymbolic techniques for informed classification based on probabilistic reasoning has gained significant traction. Unfortunately, depending on the language used to represent prior knowledge, solving certain probabilistic reasoning problems can become prohibitively hard when the number of classes increases. Therefore, the asymptotic complexity of probabilistic reasoning is of cardinal importance to assess the scalability of such techniques. In this paper, we develop a unified formalism for four probabilistic reasoning problems. Then, we compile several known and new tractability results into a single complexity map of probabilistic reasoning. We build on top of this complexity map to characterize the domains of scalability of several techniques. We hope this work will help neurosymbolic AI practitioners navigate the scalability landscape of probabilistic neurosymbolic techniques.
Authors: Olivier C. Pasche, Jonathan Wider, Zhongwei Zhang, Jakob Zscheischler, Sebastian Engelke
Abstract: The forecast accuracy of machine learning (ML) weather prediction models is improving rapidly, leading many to speak of a "second revolution in weather forecasting". With numerous methods being developed and limited physical guarantees offered by ML models, there is a critical need for a comprehensive evaluation of these emerging techniques. While this need has been partly fulfilled by benchmark datasets, they provide little information on rare and impactful extreme events or on compound impact metrics, for which model accuracy might degrade due to misrepresented dependencies between variables. To address these issues, we compare ML weather prediction models (GraphCast, PanguWeather, and FourCastNet) and ECMWF's high-resolution forecast system (HRES) in three case studies: the 2021 Pacific Northwest heatwave, the 2023 South Asian humid heatwave, and the North American winter storm in 2021. We find that ML weather prediction models locally achieve similar accuracy to HRES on the record-shattering Pacific Northwest heatwave but underperform when aggregated over space and time. However, they forecast the compound winter storm substantially better. We also highlight structural differences in how the errors of HRES and the ML models build up to that event. The ML forecasts lack important variables for a detailed assessment of the health risks of the 2023 humid heatwave. Using a possible substitute variable, prediction errors show spatial patterns with the highest danger levels over Bangladesh being underestimated by the ML models. Generally, case-study-driven, impact-centric evaluation can complement existing research, increase public trust, and aid in developing reliable ML weather prediction models.
Authors: Javier Lopez-Piqueres, Jing Chen
Abstract: In this study, we introduce a novel family of tensor networks, termed \textit{constrained matrix product states} (MPS), designed to incorporate exactly arbitrary discrete linear constraints, including inequalities, into sparse block structures. These tensor networks are particularly tailored for modeling distributions with support strictly over the feasible space, offering benefits such as reducing the search space in optimization problems, alleviating overfitting, improving training efficiency, and decreasing model size. Central to our approach is the concept of a quantum region, an extension of quantum numbers traditionally used in U(1) symmetric tensor networks, adapted to capture any linear constraint, including the unconstrained scenario. We further develop a novel canonical form for these new MPS, which allow for the merging and factorization of tensor blocks according to quantum region fusion rules and permit optimal truncation schemes. Utilizing this canonical form, we apply an unsupervised training strategy to optimize arbitrary objective functions subject to discrete linear constraints. Our method's efficacy is demonstrated by solving the quadratic knapsack problem, achieving superior performance compared to a leading nonlinear integer programming solver. Additionally, we analyze the complexity and scalability of our approach, demonstrating its potential in addressing complex constrained combinatorial optimization problems.
Authors: Haoran Chen, Micah Goldblum, Zuxuan Wu, Yu-Gang Jiang
Abstract: Continual learning, also known as lifelong learning or incremental learning, refers to the process by which a model learns from a stream of incoming data over time. A common problem in continual learning is the classification layer's bias towards the most recent task. Traditionally, methods have relied on incorporating data from past tasks during training to mitigate this issue. However, the recent shift in continual learning to memory-free environments has rendered these approaches infeasible. In this study, we propose a solution focused on the testing phase. We first introduce a simple Out-of-Task Detection method, OTD, designed to accurately identify samples from past tasks during testing. Leveraging OTD, we then propose: (1) an Adaptive Retention mechanism for dynamically tuning the classifier layer on past task data; (2) an Adaptive Correction mechanism for revising predictions when the model classifies data from previous tasks into classes from the current task. We name our approach Adaptive Retention & Correction (ARC). While designed for memory-free environments, ARC also proves effective in memory-based settings. Extensive experiments show that our proposed method can be plugged in to virtually any existing continual learning approach without requiring any modifications to its training procedure. Specifically, when integrated with state-of-the-art approaches, ARC achieves an average performance increase of 2.7% and 2.6% on the CIFAR-100 and Imagenet-R datasets, respectively.
Authors: Heejun Lee, Geon Park, Youngwan Lee, Jaduk Suh, Jina Kim, Wonyoung Jeong, Bumsik Kim, Hyemin Lee, Myeongjae Jeon, Sung Ju Hwang
Abstract: In modern large language models (LLMs), increasing the context length is crucial for improving comprehension and coherence in long-context, multi-modal, and retrieval-augmented language generation. While many recent transformer models attempt to extend their context length over a million tokens, they remain impractical due to the quadratic time and space complexities. Although recent works on linear and sparse attention mechanisms can achieve this goal, their real-world applicability is often limited by the need to re-train from scratch and significantly worse performance. In response, we propose a novel approach, Hierarchically Pruned Attention (HiP), which reduces the time complexity of the attention mechanism to $O(T \log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length. We notice a pattern in the attention scores of pretrained LLMs where tokens close together tend to have similar scores, which we call ``attention locality''. Based on this observation, we utilize a novel tree-search-like algorithm that estimates the top-$k$ key tokens for a given query on the fly, which is mathematically guaranteed to have better performance than random attention pruning. In addition to improving the time complexity of the attention mechanism, we further optimize GPU memory usage by implementing KV cache offloading, which stores only $O(\log T)$ tokens on the GPU while maintaining similar decoding throughput. Experiments on benchmarks show that HiP, with its training-free nature, significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation. HiP enables pretrained LLMs to scale up to millions of tokens on commodity GPUs, potentially unlocking long-context LLM applications previously deemed infeasible.
Authors: Tinh Son Luong, Thanh-Thien Le, Thang Viet Doan, Linh Ngo Van, Thien Huu Nguyen, Diep Thi-Ngoc Nguyen
Abstract: Existing toxic detection models face significant limitations, such as lack of transparency, customization, and reproducibility. These challenges stem from the closed-source nature of their training data and the paucity of explanations for their evaluation mechanism. To address these issues, we propose a dataset creation mechanism that integrates voting and chain-of-thought processes, producing a high-quality open-source dataset for toxic content detection. Our methodology ensures diverse classification metrics for each sample and includes both classification scores and explanatory reasoning for the classifications. We utilize the dataset created through our proposed mechanism to train our model, which is then compared against existing widely-used detectors. Our approach not only enhances transparency and customizability but also facilitates better fine-tuning for specific use cases. This work contributes a robust framework for developing toxic content detection models, emphasizing openness and adaptability, thus paving the way for more effective and user-specific content moderation solutions.
Authors: Ping Chang, Tosiron Adegbija, Yuchao Liao, Claudio Talarico, Ao Li, Janet Roveda
Abstract: High-level synthesis (HLS) has significantly advanced the automation of digital circuits design, yet the need for expertise and time in pragma tuning remains challenging. Existing solutions for the design space exploration (DSE) adopt either heuristic methods, lacking essential information for further optimization potential, or predictive models, missing sufficient generalization due to the time-consuming nature of HLS and the exponential growth of the design space. To address these challenges, we propose Deep Inverse Design for HLS (DID4HLS), a novel approach that integrates graph neural networks and generative models. DID4HLS iteratively optimizes hardware designs aimed at compute-intensive algorithms by learning conditional distributions of design features from post-HLS data. Compared to four state-of-the-art DSE baselines, our method achieved an average improvement of 42.8% on average distance to reference set (ADRS) compared to the best-performing baselines across six benchmarks, while demonstrating high robustness and efficiency.
Authors: Joel Sol, Jamil Fayyad, Shadi Alijani, Homayoun Najjaran
Abstract: Deformation detection is vital for enabling accurate assessment and prediction of structural changes in materials, ensuring timely and effective interventions to maintain safety and integrity. Automating deformation detection through computer vision is crucial for efficient monitoring, but it faces significant challenges in creating a comprehensive dataset of both deformed and non-deformed objects, which can be difficult to obtain in many scenarios. In this paper, we introduce a novel framework for generating controlled synthetic data that simulates deformed objects. This approach allows for the realistic modeling of object deformations under various conditions. Our framework integrates an intelligent adapter network that facilitates sim-to-real domain adaptation, enhancing classification results without requiring real data from deformed objects. We conduct experiments on domain adaptation and classification tasks and demonstrate that our framework improves sim-to-real classification results compared to simulation baseline.
Authors: Jingying Ma, Qika Lin, Ziyu Jia, Mengling Feng
Abstract: Sleep staging is critical to assess sleep quality and diagnose disorders. Despite advancements in artificial intelligence enabling automated sleep staging, significant challenges remain: (1) Simultaneously extracting prominent temporal and spatial sleep features from multi-channel raw signals, including characteristic sleep waveforms and salient spatial brain networks. (2) Capturing the spatial-temporal coupling patterns essential for accurate sleep staging. To address these challenges, we propose a novel framework named ST-USleepNet, comprising a spatial-temporal graph construction module (ST) and a U-shaped sleep network (USleepNet). The ST module converts raw signals into a spatial-temporal graph based on signal similarity, temporal, and spatial relationships to model spatial-temporal coupling patterns. The USleepNet employs a U-shaped structure for both the temporal and spatial streams, mirroring its original use in image segmentation to isolate significant targets. Applied to raw sleep signals and graph data from the ST module, USleepNet effectively segments these inputs, simultaneously extracting prominent temporal and spatial sleep features. Testing on three datasets demonstrates that ST-USleepNet outperforms existing baselines, and model visualizations confirm its efficacy in extracting prominent sleep features and temporal-spatial coupling patterns across various sleep stages. The code is available at: https://github.com/Majy-Yuji/ST-USleepNet.git.
Authors: Haojun Shi, Suyu Ye, Xinyu Fang, Chuanyang Jin, Leyla Isik, Yen-Ling Kuo, Tianmin Shu
Abstract: Understanding people's social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can watch people's actions, hear their conversations, and/or read about their past behaviors. For AI systems to successfully and safely interact with people in real-world environments, they also need to understand people's mental states as well as their inferences about each other's mental states based on multi-modal information about their interactions. For this, we introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark. MuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates mental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide video and text descriptions of people's multi-modal behavior in realistic household environments. Based on the context, we then ask questions about people's goals, beliefs, and beliefs about others' goals. We validated MuMA-ToM in a human experiment and provided a human baseline. We also proposed a novel multi-modal, multi-agent ToM model, LIMP (Language model-based Inverse Multi-agent Planning). Our experimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal models (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM.
Authors: Arjun Roy, Kaushik Roy
Abstract: The convergence of fully homomorphic encryption (FHE) and machine learning offers unprecedented opportunities for private inference of sensitive data. FHE enables computation directly on encrypted data, safeguarding the entire machine learning pipeline, including data and model confidentiality. However, existing FHE-based implementations for deep neural networks face significant challenges in computational cost, latency, and scalability, limiting their practical deployment. This paper introduces DCT-CryptoNets, a novel approach that operates directly in the frequency-domain to reduce the burden of computationally expensive non-linear activations and homomorphic bootstrap operations during private inference. It does so by utilizing the discrete cosine transform (DCT), commonly employed in JPEG encoding, which has inherent compatibility with remote computing services where images are generally stored and transmitted in this encoded format. DCT-CryptoNets demonstrates a substantial latency reductions of up to 5.3$\times$ compared to prior work on benchmark image classification tasks. Notably, it demonstrates inference on the ImageNet dataset within 2.5 hours (down from 12.5 hours on equivalent 96-thread compute resources). Furthermore, by learning perceptually salient low-frequency information DCT-CryptoNets improves the reliability of encrypted predictions compared to RGB-based networks by reducing error accumulating homomorphic bootstrap operations. DCT-CryptoNets also demonstrates superior scalability to RGB-based networks by further reducing computational cost as image size increases. This study demonstrates a promising avenue for achieving efficient and practical private inference of deep learning models on high resolution images seen in real-world applications.
Authors: Juli\'an Tachella, Mike Davies, Laurent Jacques
Abstract: Recently, many self-supervised learning methods for image reconstruction have been proposed that can learn from noisy data alone, bypassing the need for ground-truth references. Most existing methods cluster around two classes: i) Stein's Unbiased Risk Estimate (SURE) and similar approaches that assume full knowledge of the distribution, and ii) Noise2Self and similar cross-validation methods that require very mild knowledge about the noise distribution. The first class of methods tends to be impractical, as the noise level is often unknown in real-world applications, and the second class is often suboptimal compared to supervised learning. In this paper, we provide a theoretical framework that characterizes this expressivity-robustness trade-off and propose a new approach based on SURE, but unlike the standard SURE, does not require knowledge about the noise level. Throughout a series of experiments, we show that the proposed estimator outperforms other existing self-supervised methods on various imaging inverse problems.
Authors: Shijing Wang, Yaping Huang, Jun Xie, Yi Tian, Feng Chen, Zhepeng Wang
Abstract: Achieving accurate and reliable gaze predictions in complex and diverse environments remains challenging. Fortunately, it is straightforward to access diverse gaze datasets in real-world applications. We discover that training these datasets jointly can significantly improve the generalization of gaze estimation, which is overlooked in previous works. However, due to the inherent distribution shift across different datasets, simply mixing multiple dataset decreases the performance in the original domain despite gaining better generalization abilities. To address the problem of ``cross-dataset gaze estimation'', we propose a novel Evidential Inter-intra Fusion EIF framework, for training a cross-dataset model that performs well across all source and unseen domains. Specifically, we build independent single-dataset branches for various datasets where the data space is partitioned into overlapping subspaces within each dataset for local regression, and further create a cross-dataset branch to integrate the generalizable features from single-dataset branches. Furthermore, evidential regressors based on the Normal and Inverse-Gamma (NIG) distribution are designed to additionally provide uncertainty estimation apart from predicting gaze. Building upon this foundation, our proposed framework achieves both intra-evidential fusion among multiple local regressors within each dataset and inter-evidential fusion among multiple branches by Mixture \textbfof Normal Inverse-Gamma (MoNIG distribution. Experiments demonstrate that our method consistently achieves notable improvements in both source domains and unseen domains.
Authors: Hanfeng Zhai
Abstract: Numerical modeling of polycrystal plasticity is computationally intensive. We employ Graph Neural Networks (GNN) to predict stresses on complex geometries for polycrystal plasticity from Finite Element Method (FEM) simulations. We present a novel message-passing GNN that encodes nodal strain and edge distances between FEM mesh cells, and aggregates to obtain embeddings and combines the decoded embeddings with the nodal strains to predict stress tensors on graph nodes. The GNN is trained on subgraphs generated from FEM mesh graphs, in which the mesh cells are converted to nodes and edges are created between adjacent cells. We apply the trained GNN to periodic polycrystals with complex geometries and learn the strain-stress maps based on crystal plasticity theory. The GNN is accurately trained on FEM graphs, in which the $R^2$ for both training and testing sets are larger than 0.99. The proposed GNN approach speeds up more than 150 times compared with FEM on stress predictions. We also apply the trained GNN to unseen simulations for validations and the GNN generalizes well with an overall $R^2$ of 0.992. The GNN accurately predicts the von Mises stress on polycrystals. The proposed model does not overfit and generalizes well beyond the training data, as the error distributions demonstrate. This work outlooks surrogating crystal plasticity simulations using graph data.
Authors: Nicholas Kr\"amer
Abstract: Practical implementations of Gaussian smoothing algorithms have received a great deal of attention in the last 60 years. However, almost all work focuses on estimating complete time series (''fixed-interval smoothing'', $\mathcal{O}(K)$ memory) through variations of the Rauch--Tung--Striebel smoother, rarely on estimating the initial states (''fixed-point smoothing'', $\mathcal{O}(1)$ memory). Since fixed-point smoothing is a crucial component of algorithms for dynamical systems with unknown initial conditions, we close this gap by introducing a new formulation of a Gaussian fixed-point smoother. In contrast to prior approaches, our perspective admits a numerically robust Cholesky-based form (without downdates) and avoids state augmentation, which would needlessly inflate the state-space model and reduce the numerical practicality of any fixed-point smoother code. The experiments demonstrate how a JAX implementation of our algorithm matches the runtime of the fastest methods and the robustness of the most robust techniques while existing implementations must always sacrifice one for the other.
Authors: Qiushuo Hou, Sangwoo Park, Matteo Zecchin, Yunlong Cai, Guanding Yu, Osvaldo Simeone
Abstract: In modern wireless network architectures, such as Open Radio Access Network (O-RAN), the operation of the radio access network (RAN) is managed by applications, or apps for short, deployed at intelligent controllers. These apps are selected from a given catalog based on current contextual information. For instance, a scheduling app may be selected on the basis of current traffic and network conditions. Once an app is chosen and run, it is no longer possible to directly test the key performance indicators (KPIs) that would have been obtained with another app. In other words, we can never simultaneously observe both the actual KPI, obtained by the selected app, and the counterfactual KPI, which would have been attained with another app, for the same network condition, making individual-level counterfactual KPIs analysis particularly challenging. This what-if analysis, however, would be valuable to monitor and optimize the network operation, e.g., to identify suboptimal app selection strategies. This paper addresses the problem of estimating the values of KPIs that would have been obtained if a different app had been implemented by the RAN. To this end, we propose a conformal-prediction-based counterfactual analysis method for wireless systems that provides reliable error bars for the estimated KPIs, despite the inherent covariate shift between logged and test data. Experimental results for medium access control-layer apps and for physical-layer apps demonstrate the merits of the proposed method.
Authors: Yi Cheng, Xiao Liang, Yeyun Gong, Wen Xiao, Song Wang, Yuji Zhang, Wenjun Hou, Kaishuai Xu, Wenge Liu, Wenjie Li, Jian Jiao, Qi Chen, Peng Cheng, Wayne Xiong
Abstract: Self-consistency-based approaches, which involve repeatedly sampling multiple outputs and selecting the most consistent one as the final response, prove to be remarkably effective in improving the factual accuracy of large language models. Nonetheless, existing methods usually have strict constraints on the task format, largely limiting their applicability. In this paper, we present Integrative Decoding (ID), to unlock the potential of self-consistency in open-ended generation tasks. ID operates by constructing a set of inputs, each prepended with a previously sampled response, and then processes them concurrently, with the next token being selected by aggregating of all their corresponding predictions at each decoding step. In essence, this simple approach implicitly incorporates self-consistency in the decoding objective. Extensive evaluation shows that ID consistently enhances factuality over a wide range of language models, with substantial improvements on the TruthfulQA (+11.2%), Biographies (+15.4%) and LongFact (+8.5%) benchmarks. The performance gains amplify progressively as the number of sampled responses increases, indicating the potential of ID to scale up with repeated sampling.
Authors: Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Yu Wang, Guohao Dai
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various fields, from natural language understanding to text generation. Compared to non-generative LLMs like BERT and DeBERTa, generative LLMs like GPT series and Llama series are currently the main focus due to their superior algorithmic performance. The advancements in generative LLMs are closely intertwined with the development of hardware capabilities. Various hardware platforms exhibit distinct hardware characteristics, which can help improve LLM inference performance. Therefore, this paper comprehensively surveys efficient generative LLM inference on different hardware platforms. First, we provide an overview of the algorithm architecture of mainstream generative LLMs and delve into the inference process. Then, we summarize different optimization methods for different platforms such as CPU, GPU, FPGA, ASIC, and PIM/NDP, and provide inference results for generative LLMs. Furthermore, we perform a qualitative and quantitative comparison of inference performance with batch sizes 1 and 8 on different hardware platforms by considering hardware power consumption, absolute inference speed (tokens/s), and energy efficiency (tokens/J). We compare the performance of the same optimization methods across different hardware platforms, the performance across different hardware platforms, and the performance of different methods on the same hardware platform. This provides a systematic and comprehensive summary of existing inference acceleration work by integrating software optimization methods and hardware platforms. We point out that three trends (multimodality, inference-time compute, and higher inference energy efficiency) are promising to redefine the capabilities of edge artificial intelligence systems. Our project is available at https://dai.sjtu.edu.cn/project.html.
Authors: Jiatao Gu, Yuyang Wang, Yizhe Zhang, Qihang Zhang, Dinghuai Zhang, Navdeep Jaitly, Josh Susskind, Shuangfei Zhai
Abstract: Diffusion models have become the dominant approach for visual generation. They are trained by denoising a Markovian process which gradually adds noise to the input. We argue that the Markovian property limits the model's ability to fully utilize the generation trajectory, leading to inefficiencies during training and inference. In this paper, we propose DART, a transformer-based model that unifies autoregressive (AR) and diffusion within a non-Markovian framework. DART iteratively denoises image patches spatially and spectrally using an AR model that has the same architecture as standard language models. DART does not rely on image quantization, which enables more effective image modeling while maintaining flexibility. Furthermore, DART seamlessly trains with both text and image data in a unified model. Our approach demonstrates competitive performance on class-conditioned and text-to-image generation tasks, offering a scalable, efficient alternative to traditional diffusion models. Through this unified framework, DART sets a new benchmark for scalable, high-quality image synthesis.
Authors: Bryan Chan, Anson Leung, James Bergstra
Abstract: Offline-to-online reinforcement learning (O2O RL) aims to obtain a continually improving policy as it interacts with the environment, while ensuring the initial policy behaviour is satisficing. This satisficing behaviour is necessary for robotic manipulation where random exploration can be costly due to catastrophic failures and time. O2O RL is especially compelling when we can only obtain a scarce amount of (potentially suboptimal) demonstrations$\unicode{x2014}$a scenario where behavioural cloning (BC) is known to suffer from distribution shift. Previous works have outlined the challenges in applying O2O RL algorithms under the image-based environments. In this work, we propose a novel O2O RL algorithm that can learn in a real-life image-based robotic vacuum grasping task with a small number of demonstrations where BC fails majority of the time. The proposed algorithm replaces the target network in off-policy actor-critic algorithms with a regularization technique inspired by neural tangent kernel. We demonstrate that the proposed algorithm can reach above 90\% success rate in under two hours of interaction time, with only 50 human demonstrations, while BC and existing commonly-used RL algorithms fail to achieve similar performance.
Authors: Amitai Yacobi, Ofir Lindenbaum, Uri Shaham
Abstract: Multi-view representation learning (MvRL) has garnered substantial attention in recent years, driven by the increasing demand for applications that can effectively process and analyze data from multiple sources. In this context, graph Laplacian-based MvRL methods have demonstrated remarkable success in representing multi-view data. However, these methods often struggle with generalization to new data and face challenges with scalability. Moreover, in many practical scenarios, multi-view data is contaminated by noise or outliers. In such cases, modern deep-learning-based MvRL approaches that rely on alignment or contrastive objectives present degraded performance in downstream tasks, as they may impose incorrect consistency between clear and corrupted data sources. We introduce $\textit{SpecRaGE}$, a novel fusion-based framework that integrates the strengths of graph Laplacian methods with the power of deep learning to overcome these challenges. SpecRage uses neural networks to learn parametric mapping that approximates a joint diagonalization of graph Laplacians. This solution bypasses the need for alignment while enabling generalizable and scalable learning of informative and meaningful representations. Moreover, it incorporates a meta-learning fusion module that dynamically adapts to data quality, ensuring robustness against outliers and noisy views. Our extensive experiments demonstrate that SpecRaGE outperforms state-of-the-art methods, particularly in scenarios with data contamination, paving the way for more reliable and efficient multi-view learning.
Authors: Hongyi Pan, Ziliang Hong, Gorkem Durak, Elif Keles, Halil Ertugrul Aktas, Yavuz Taktak, Alpay Medetalibeyoglu, Zheyuan Zhang, Yury Velichko, Concetto Spampinato, Ivo Schoots, Marco J. Bruno, Pallavi Tiwari, Candice Bolan, Tamas Gonda, Frank Miller, Rajesh N. Keswani, Michael B. Wallace, Ziyue Xu, Ulas Bagci
Abstract: Accurate classification of Intraductal Papillary Mucinous Neoplasms (IPMN) is essential for identifying high-risk cases that require timely intervention. In this study, we develop a federated learning framework for multi-center IPMN classification utilizing a comprehensive pancreas MRI dataset. This dataset includes 652 T1-weighted and 655 T2-weighted MRI images, accompanied by corresponding IPMN risk scores from 7 leading medical institutions, making it the largest and most diverse dataset for IPMN classification to date. We assess the performance of DenseNet-121 in both centralized and federated settings for training on distributed data. Our results demonstrate that the federated learning approach achieves high classification accuracy comparable to centralized learning while ensuring data privacy across institutions. This work marks a significant advancement in collaborative IPMN classification, facilitating secure and high-accuracy model training across multiple centers.
Authors: Taiyi Wang, Jianheng Liu, Bryan Lee, Zhihao Wu, Yu Wu
Abstract: In many practical applications, decision-making processes must balance the costs of acquiring information with the benefits it provides. Traditional control systems often assume full observability, an unrealistic assumption when observations are expensive. We tackle the challenge of simultaneously learning observation and control strategies in such cost-sensitive environments by introducing the Observation-Constrained Markov Decision Process (OCMDP), where the policy influences the observability of the true state. To manage the complexity arising from the combined observation and control actions, we develop an iterative, model-free deep reinforcement learning algorithm that separates the sensing and control components of the policy. This decomposition enables efficient learning in the expanded action space by focusing on when and what to observe, as well as determining optimal control actions, without requiring knowledge of the environment's dynamics. We validate our approach on a simulated diagnostic task and a realistic healthcare environment using HeartPole. Given both scenarios, the experimental results demonstrate that our model achieves a substantial reduction in observation costs on average, significantly outperforming baseline methods by a notable margin in efficiency.
Authors: Amirabbas Afzali, Borna Khodabandeh, Ali Rasekh, Mahyar JafariNodeh, Sepehr kazemi, Simon Gottschalk
Abstract: Contrastive learning models have demonstrated impressive abilities to capture semantic similarities by aligning representations in the embedding space. However, their performance can be limited by the quality of the training data and its inherent biases. While Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have been applied to generative models to align them with human preferences, their use in contrastive learning has yet to be explored. This paper introduces a novel method for training contrastive learning models using Preference Optimization (PO) to break down complex concepts. Our method systematically aligns model behavior with desired preferences, enhancing performance on the targeted task. In particular, we focus on enhancing model robustness against typographic attacks, commonly seen in contrastive models like CLIP. We further apply our method to disentangle gender understanding and mitigate gender biases, offering a more nuanced control over these sensitive attributes. Our experiments demonstrate that models trained using PO outperform standard contrastive learning techniques while retaining their ability to handle adversarial challenges and maintain accuracy on other downstream tasks. This makes our method well-suited for tasks requiring fairness, robustness, and alignment with specific preferences. We evaluate our method on several vision-language tasks, tackling challenges such as typographic attacks. Additionally, we explore the model's ability to disentangle gender concepts and mitigate gender bias, showcasing the versatility of our approach.
Authors: Hanning Yuan, Zhihui Zhang, Qi Guo, Lianhua Chi, Sijie Ruan, Jinhui Pang, Xiaoshuai Hao
Abstract: Multi-view contrastive clustering (MVCC) has gained significant attention for generating consistent clustering structures from multiple views through contrastive learning. However, most existing MVCC methods create cross-views by combining any two views, leading to a high volume of unreliable pairs. Furthermore, these approaches often overlook discrepancies in multi-view representations, resulting in representation degeneration. To address these challenges, we introduce a novel model called Dual-Weighted Contrastive Learning (DWCL) for Multi-View Clustering. Specifically, to reduce the impact of unreliable cross-views, we introduce an innovative Best-Other (B-O) contrastive mechanism that enhances the representation of individual views at a low computational cost. Furthermore, we develop a dual weighting strategy that combines a view quality weight, reflecting the quality of each view, with a view discrepancy weight. This approach effectively mitigates representation degeneration by downplaying cross-views that are both low in quality and high in discrepancy. We theoretically validate the efficiency of the B-O contrastive mechanism and the effectiveness of the dual weighting strategy. Extensive experiments demonstrate that DWCL outperforms previous methods across eight multi-view datasets, showcasing superior performance and robustness in MVCC. Specifically, our method achieves absolute accuracy improvements of 5.4\% and 5.6\% compared to state-of-the-art methods on the Caltech6V7 and MSRCv1 datasets, respectively.
Authors: Maxwell A. Xu, Jaya Narain, Gregory Darnell, Haraldur Hallgrimsson, Hyewon Jeong, Darren Forde, Richard Fineman, Karthik J. Raghuram, James M. Rehg, Shirley Ren
Abstract: We present RelCon, a novel self-supervised *Rel*ative *Con*trastive learning approach that uses a learnable distance measure in combination with a softened contrastive loss for training an motion foundation model from wearable sensors. The learnable distance measure captures motif similarity and domain-specific semantic information such as rotation invariance. The learned distance provides a measurement of semantic similarity between a pair of accelerometer time-series segments, which is used to measure the distance between an anchor and various other sampled candidate segments. The self-supervised model is trained on 1 billion segments from 87,376 participants from a large wearables dataset. The model achieves strong performance across multiple downstream tasks, encompassing both classification and regression. To our knowledge, we are the first to show the generalizability of a self-supervised learning model with motion data from wearables across distinct evaluation tasks.
Authors: Piotr Teterwak, Kuniaki Saito, Theodoros Tsiligkaridis, Bryan A. Plummer, Kate Saenko
Abstract: Multi-Source Domain Generalization (DG) is the task of training on multiple source domains and achieving high classification performance on unseen target domains. Recent methods combine robust features from web-scale pretrained backbones with new features learned from source data, and this has dramatically improved benchmark results. However, it remains unclear if DG finetuning methods are becoming better over time, or if improved benchmark performance is simply an artifact of stronger pre-training. Prior studies have shown that perceptual similarity to pre-training data correlates with zero-shot performance, but we find the effect limited in the DG setting. Instead, we posit that having perceptually similar data in pretraining is not enough; and that it is how well these data were learned that determines performance. This leads us to introduce the Alignment Hypothesis, which states that the final DG performance will be high if and only if alignment of image and class label text embeddings is high. Our experiments confirm the Alignment Hypothesis is true, and we use it as an analysis tool of existing DG methods evaluated on DomainBed datasets by splitting evaluation data into In-pretraining (IP) and Out-of-pretraining (OOP). We show that all evaluated DG methods struggle on DomainBed-OOP, while recent methods excel on DomainBed-IP. Put together, our findings highlight the need for DG methods which can generalize beyond pretraining alignment.
Authors: Zheng Cao, Helyette Geman
Abstract: This article introduces a Hype-Adjusted Probability Measure in the context of a new Natural Language Processing (NLP) approach for stock return and volatility forecasting. A novel sentiment score equation is proposed to represent the impact of intraday news on forecasting next-period stock return and volatility for selected U.S. semiconductor tickers, a very vibrant industry sector. This work improves the forecast accuracy by addressing news bias, memory, and weight, and incorporating shifts in sentiment direction. More importantly, it extends the use of the remarkable tool of change of Probability Measure developed in the finance of Asset Pricing to NLP forecasting by constructing a Hype-Adjusted Probability Measure, obtained from a redistribution of the weights in the probability space, meant to correct for excessive or insufficient news.
Authors: Joohyung Lee, Jungchan Cho, Wonjun Lee, Mohamed Seif, H. Vincent Poor
Abstract: To alleviate the training burden in federated learning while enhancing convergence speed, Split Federated Learning (SFL) has emerged as a promising approach by combining the advantages of federated and split learning. However, recent studies have largely overlooked competitive situations. In this framework, the SFL model owner can choose the cut layer to balance the training load between the server and clients, ensuring the necessary level of privacy for the clients. Additionally, the SFL model owner sets incentives to encourage client participation in the SFL process. The optimization strategies employed by the SFL model owner influence clients' decisions regarding the amount of data they contribute, taking into account the shared incentives over clients and anticipated energy consumption during SFL. To address this framework, we model the problem using a hierarchical decision-making approach, formulated as a single-leader multi-follower Stackelberg game. We demonstrate the existence and uniqueness of the Nash equilibrium among clients and analyze the Stackelberg equilibrium by examining the leader's game. Furthermore, we discuss privacy concerns related to differential privacy and the criteria for selecting the minimum required cut layer. Our findings show that the Stackelberg equilibrium solution maximizes the utility for both the clients and the SFL model owner.
Authors: Robyn Brooks, Marissa Masden
Abstract: One common function class in machine learning is the class of ReLU neural networks. ReLU neural networks induce a piecewise linear decomposition of their input space called the canonical polyhedral complex. It has previously been established that it is decidable whether a ReLU neural network is piecewise linear Morse. In order to expand computational tools for analyzing the topological properties of ReLU neural networks, and to harness the strengths of discrete Morse theory, we introduce a schematic for translating between a given piecewise linear Morse function (e.g. parameters of a ReLU neural network) on a canonical polyhedral complex and a compatible (``relatively perfect") discrete Morse function on the same complex. Our approach is constructive, producing an algorithm that can be used to determine if a given vertex in a canonical polyhedral complex corresponds to a piecewise linear Morse critical point. Furthermore we provide an algorithm for constructing a consistent discrete Morse pairing on cells in the canonical polyhedral complex which contain this vertex. We additionally provide some new realizability results with respect to sublevel set topology in the case of shallow ReLU neural networks.
Authors: Saskia Laura Schr\"oer, Giovanni Apruzzese, Soheil Human, Pavel Laskov, Hyrum S. Anderson, Edward W. N. Bernroider, Aurore Fass, Ben Nassi, Vera Rimmer, Fabio Roli, Samer Salam, Ashley Shen, Ali Sunyaev, Tim Wadwha-Brown, Isabel Wagner, Gang Wang
Abstract: Our society increasingly benefits from Artificial Intelligence (AI). Unfortunately, more and more evidence shows that AI is also used for offensive purposes. Prior works have revealed various examples of use cases in which the deployment of AI can lead to violation of security and privacy objectives. No extant work, however, has been able to draw a holistic picture of the offensive potential of AI. In this SoK paper we seek to lay the ground for a systematic analysis of the heterogeneous capabilities of offensive AI. In particular we (i) account for AI risks to both humans and systems while (ii) consolidating and distilling knowledge from academic literature, expert opinions, industrial venues, as well as laypeople -- all of which being valuable sources of information on offensive AI. To enable alignment of such diverse sources of knowledge, we devise a common set of criteria reflecting essential technological factors related to offensive AI. With the help of such criteria, we systematically analyze: 95 research papers; 38 InfoSec briefings (from, e.g., BlackHat); the responses of a user study (N=549) entailing individuals with diverse backgrounds and expertise; and the opinion of 12 experts. Our contributions not only reveal concerning ways (some of which overlooked by prior work) in which AI can be offensively used today, but also represent a foothold to address this threat in the years to come.
Authors: Phuc Nguyen, Miao Li, Alexandra Morgan, Rima Arnaout, Ramy Arnaout
Abstract: Generative models hold great potential, but only if one can trust the evaluation of the data they generate. We show that many commonly used quality scores for comparing two-dimensional distributions of synthetic vs. ground-truth data give better results than they should, a phenomenon we call the "grade inflation problem." We show that the correlation score, Jaccard score, earth-mover's score, and Kullback-Leibler (relative-entropy) score all suffer grade inflation. We propose that any score that values all datapoints equally, as these do, will also exhibit grade inflation; we refer to such scores as "equipoint" scores. We introduce the concept of "equidensity" scores, and present the Eden score, to our knowledge the first example of such a score. We found that Eden avoids grade inflation and agrees better with human perception of goodness-of-fit than the equipoint scores above. We propose that any reasonable equidensity score will avoid grade inflation. We identify a connection between equidensity scores and R\'enyi entropy of negative order. We conclude that equidensity scores are likely to outperform equipoint scores for generative models, and for comparing low-dimensional distributions more generally.
Authors: Travis E. Gibson, Sawal Acharya
Abstract: Online learning and model reference adaptive control have many interesting intersections. One area where they differ however is in how the algorithms are analyzed and what objective or metric is used to discriminate "good" algorithms from "bad" algorithms. In adaptive control there are usually two objectives: 1) prove that all time varying parameters/states of the system are bounded, and 2) that the instantaneous error between the adaptively controlled system and a reference system converges to zero over time (or at least a compact set). For online learning the performance of algorithms is often characterized by the regret the algorithm incurs. Regret is defined as the cumulative loss (cost) over time from the online algorithm minus the cumulative loss (cost) of the single optimal fixed parameter choice in hindsight. Another significant difference between the two areas of research is with regard to the assumptions made in order to obtain said results. Adaptive control makes assumptions about the input-output properties of the control problem and derives solutions for a fixed error model or optimization task. In the online learning literature results are derived for classes of loss functions (i.e. convex) while a priori assuming certain signals are bounded. In this work we discuss these differences in detail through the regret based analysis of gradient descent for convex functions and the control based analysis of a streaming regression problem. We close with a discussion about the newly defined paradigm of online adaptive control.
Authors: Zephan M. Enciso, Boyang Cheng, Likai Pei, Jianbo Liu, Steven Davis, Michael Niemier, Ningyuan Cao
Abstract: Uncertainty estimation is an indispensable capability for AI-enabled, safety-critical applications, e.g. autonomous vehicles or medical diagnosis. Bayesian neural networks (BNNs) use Bayesian statistics to provide both classification predictions and uncertainty estimation, but they suffer from high computational overhead associated with random number generation and repeated sample iterations. Furthermore, BNNs are not immediately amenable to acceleration through compute-in-memory architectures due to the frequent memory writes necessary after each RNG operation. To address these challenges, we present an ASIC that integrates 360 fJ/Sample Gaussian RNG directly into the SRAM memory words. This integration reduces RNG overhead and enables fully-parallel compute-in-memory operations for BNNs. The prototype chip achieves 5.12 GSa/s RNG throughput and 102 GOp/s neural network throughput while occupying 0.45 mm2, bringing AI uncertainty estimation to edge computation.
Authors: Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, Yujiu Yang
Abstract: Chain-of-Thought (CoT) reasoning is widely used to enhance the mathematical reasoning capabilities of large language models (LLMs). The introduction of process supervision for CoT trajectories has sparked discussions on improving test-time scaling, thereby unlocking the System 2-style thinking capabilities of these models. However, in multimodal mathematical reasoning, the scarcity of high-quality CoT training data has hindered existing models from achieving both deliberate reasoning and fine-grained verification. In this work, we propose a novel framework that introduces System 2-style thinking to multimodal mathematical reasoning. We introduce a three-module CoT data synthesis process that integrates CoT distillation, trajectory-format rewriting, and format unification. This process generates MMathCoT-1M, a high-quality CoT reasoning instruction fine-tuning dataset. Furthermore, we implement a dual-view trajectory labeling automation that targets both visual grounding fidelity and deductive chain validity, resulting in the DualMath-1.1M dataset. The URSA-8B model, trained on MMathCoT-1M, achieves new state-of-the-art (SOTA) performance among similarly sized multimodal LLMs on six popular reasoning benchmarks. Training URSA-8B further on the DualMath-1.1M dataset yields URSA-RM-8B, a verifier that enhances URSA-8B's test-time performance and surpasses strong closed-source multimodal MLLMs like GPT-4o. The model weights, training data, and code have been open-sourced: https://github.com/URSA-MATH/URSA-MATH.
Authors: Feiyi Chen, Leilei Zhang, Guansong Pang, Roger Zimmermann, Shuiguang Deng
Abstract: In anomaly detection, methods based on large language models (LLMs) can incorporate expert knowledge by reading professional document, while task-specific small models excel at extracting normal data patterns and detecting value fluctuations from training data of target applications. Inspired by the human nervous system, where the brain stores expert knowledge and the peripheral nervous system and spinal cord handle specific tasks like withdrawal and knee-jerk reflexes, we propose CoLLaTe, a framework designed to facilitate collaboration between LLMs and task-specific models, leveraging the strengths of both models for anomaly detection. In particular, we first formulate the collaboration process and identify two key challenges in the collaboration: (1) the misalignment between the expression domains of the LLMs and task-specific small models, and (2) error accumulation arising from the predictions of both models. To address these challenges, we then introduce two key components in CoLLaTe: a model alignment module and a collaborative loss function. Through theoretical analysis and experimental validation, we demonstrate that these components effectively mitigate the identified challenges and achieve better performance than both LLM-based and task-specific models.
Authors: Guido Nannini, Julian Suk, Patryk Rygiel, Simone Saitta, Luca Mariani, Riccardo Maragna, Andrea Baggiano, Gianluca Pontone, Jelmer M. Wolterink, Alberto Redaelli
Abstract: Coronary artery disease, caused by the narrowing of coronary vessels due to atherosclerosis, is the leading cause of death worldwide. The diagnostic gold standard, fractional flow reserve (FFR), measures the trans-stenotic pressure ratio during maximal vasodilation but is invasive and costly. This has driven the development of virtual FFR (vFFR) using computational fluid dynamics (CFD) to simulate coronary flow. Geometric deep learning algorithms have shown promise for learning features on meshes, including cardiovascular research applications. This study empirically analyzes various backends for predicting vFFR fields in coronary arteries as CFD surrogates, comparing six backends for learning hemodynamics on meshes using CFD solutions as ground truth. The study has two parts: i) Using 1,500 synthetic left coronary artery bifurcations, models were trained to predict pressure-related fields for vFFR reconstruction, comparing different learning variables. ii) Using 427 patient-specific CFD simulations, experiments were repeated focusing on the best-performing learning variable from the synthetic dataset. Most backends performed well on the synthetic dataset, especially when predicting pressure drop over the manifold. Transformer-based backends outperformed others when predicting pressure and vFFR fields and were the only models achieving strong performance on patient-specific data, excelling in both average per-point error and vFFR accuracy in stenotic lesions. These results suggest geometric deep learning backends can effectively replace CFD for simple geometries, while transformer-based networks are superior for complex, heterogeneous datasets. Pressure drop was identified as the optimal network output for learning pressure-related fields.
Authors: Iosif Tsangko, Andreas Triantafyllopoulos, Michael M\"uller, Hendrik Schr\"oter, Bj\"orn W. Schuller
Abstract: The DeepFilterNet (DFN) architecture was recently proposed as a deep learning model suited for hearing aid devices. Despite its competitive performance on numerous benchmarks, it still follows a `one-size-fits-all' approach, which aims to train a single, monolithic architecture that generalises across different noises and environments. However, its limited size and computation budget can hamper its generalisability. Recent work has shown that in-context adaptation can improve performance by conditioning the denoising process on additional information extracted from background recordings to mitigate this. These recordings can be offloaded outside the hearing aid, thus improving performance while adding minimal computational overhead. We introduce these principles to the DFN model, thus proposing the DFingerNet (DFiN) model, which shows superior performance on various benchmarks inspired by the DNS Challenge.
Authors: Pengcheng Dong, Wenbo Wan, Huaxiang Zhang, Jiande Sun
Abstract: In recent years, action recognition has received much attention and wide application due to its important role in video understanding. Most of the researches on action recognition methods focused on improving the performance via various deep learning methods rather than the classification of skeleton points. The topological modeling between skeleton points and body parts was seldom considered. Although some studies have used a data-driven approach to classify the topology of the skeleton point, the nature of the skeleton point in terms of kinematics has not been taken into consideration. Therefore, in this paper, we draw on the theory of kinematics to adapt the topological relations of the skeleton point and propose a topological relation classification based on body parts and distance from core of body. To synthesize these topological relations for action recognition, we propose a novel Hypergraph Fusion Graph Convolutional Network (HFGCN). In particular, the proposed model is able to focus on the human skeleton points and the different body parts simultaneously, and thus construct the topology, which improves the recognition accuracy obviously. We use a hypergraph to represent the categorical relationships of these skeleton points and incorporate the hypergraph into a graph convolution network to model the higher-order relationships among the skeleton points and enhance the feature representation of the network. In addition, our proposed hypergraph attention module and hypergraph graph convolution module optimize topology modeling in temporal and channel dimensions, respectively, to further enhance the feature representation of the network. We conducted extensive experiments on three widely used datasets.The results validate that our proposed method can achieve the best performance when compared with the state-of-the-art skeleton-based methods.
Authors: Huachi Zhou, Jiahe Du, Chuang Zhou, Chang Yang, Yilin Xiao, Yuxuan Xie, Xiao Huang
Abstract: Recent efforts leverage Large Language Models (LLMs) for modeling text-attributed graph structures in node classification tasks. These approaches describe graph structures for LLMs to understand or aggregate LLM-generated textual attribute embeddings through graph structure. However, these approaches face two main limitations in modeling graph structures with LLMs. (i) Graph descriptions become verbose in describing high-order graph structure. (ii) Textual attributes alone do not contain adequate graph structure information. It is challenging to model graph structure concisely and adequately with LLMs. LLMs lack built-in mechanisms to model graph structures directly. They also struggle with complex long-range dependencies between high-order nodes and target nodes. Inspired by the observation that LLMs pre-trained on one language can achieve exceptional performance on another with minimal additional training, we propose \textbf{G}raph-\textbf{D}efined \textbf{L}anguage for \textbf{L}arge \textbf{L}anguage \textbf{M}odel (GDL4LLM). This novel framework enables LLMs to transfer their powerful language understanding capabilities to graph-structured data. GDL4LLM translates graphs into a graph language corpus instead of graph descriptions and pre-trains LLMs on this corpus to adequately understand graph structures. During fine-tuning, this corpus describes the structural information of target nodes concisely with only a few tokens. By treating graphs as a new language, GDL4LLM enables LLMs to model graph structures adequately and concisely for node classification tasks. Extensive experiments on three real-world datasets demonstrate that GDL4LLM outperforms description-based and textual attribute embeddings-based baselines by efficiently modeling different orders of graph structure with LLMs.
Authors: Titouan Vayer (OCKHAM), Etienne Lasalle (OCKHAM)
Abstract: This note aims to demonstrate that performing maximum-likelihood estimation for a mixture model is equivalent to minimizing over the parameters an optimal transport problem with entropic regularization. The objective is pedagogical: we seek to present this already known result in a concise and hopefully simple manner. We give an illustration with Gaussian mixture models by showing that the standard EM algorithm is a specific block-coordinate descent on an optimal transport loss.
Authors: Frank Sifei Luan, Ziming Mao, Ron Yifeng Wang, Charlotte Lin, Amog Kamsetty, Hao Chen, Cheng Su, Balaji Veeramani, Scott Lee, SangBin Cho, Clark Zinzow, Eric Liang, Ion Stoica, Stephanie Wang
Abstract: While ML model training and inference are both GPU-intensive, CPU-based data processing is often the bottleneck. Distributed data processing systems based on the batch or stream processing models assume homogeneous resource requirements. They excel at CPU-based computation but either under-utilize heterogeneous resources or impose high overheads on failure and reconfiguration. We introduce the streaming batch model, a hybrid of the two models that enables efficient and fault-tolerant heterogeneous execution. The key idea is to execute one partition at a time to allow lineage-based recovery with dynamic resource allocation. This enables memory-efficient pipelining across heterogeneous resources, similar to stream processing, but also offers the elasticity and fault tolerance properties of batch processing. We present Ray Data, an implementation of the streaming batch model that improves throughput on heterogeneous batch inference pipelines by 3--8$\times$ compared to traditional batch and stream processing systems. When training Stable Diffusion, Ray Data matches the throughput of single-node ML data loaders while additionally leveraging distributed heterogeneous clusters to further improve training throughput by 31%.
Authors: Kayvon Mazooji, Ilan Shomorony
Abstract: Clustering is often a challenging problem because of the inherent ambiguity in what the "correct" clustering should be. Even when the number of clusters $K$ is known, this ambiguity often still exists, particularly when there is variation in density among different clusters, and clusters have multiple relatively separated regions of high density. In this paper we propose an information-theoretic characterization of when a $K$-clustering is ambiguous, and design an algorithm that recovers the clustering whenever it is unambiguous. This characterization formalizes the situation when two high density regions within a cluster are separable enough that they look more like two distinct clusters than two truly distinct clusters in the clustering. The algorithm first identifies $K$ partial clusters (or "seeds") using a density-based approach, and then adds unclustered points to the initial $K$ partial clusters in a greedy manner to form a complete clustering. We implement and test a version of the algorithm that is modified to effectively handle overlapping clusters, and observe that it requires little parameter selection and displays improved performance on many datasets compared to widely used algorithms for non-convex cluster recovery.