Authors: Tan Zhi-Xuan, Gloria Kang, Vikash Mansinghka, Joshua B. Tenenbaum
Abstract: The space of human goals is tremendously vast; and yet, from just a few moments of watching a scene or reading a story, we seem to spontaneously infer a range of plausible motivations for the people and characters involved. What explains this remarkable capacity for intuiting other agents' goals, despite the infinitude of ends they might pursue? And how does this cohere with our understanding of other people as approximately rational agents? In this paper, we introduce a sequential Monte Carlo model of open-ended goal inference, which combines top-down Bayesian inverse planning with bottom-up sampling based on the statistics of co-occurring subgoals. By proposing goal hypotheses related to the subgoals achieved by an agent, our model rapidly generates plausible goals without exhaustive search, then filters out goals that would be irrational given the actions taken so far. We validate this model in a goal inference task called Block Words, where participants try to guess the word that someone is stacking out of lettered blocks. In comparison to both heuristic bottom-up guessing and exact Bayesian inference over hundreds of goals, our model better predicts the mean, variance, efficiency, and resource rationality of human goal inferences, achieving similar accuracy to the exact model at a fraction of the cognitive cost, while also explaining garden-path effects that arise from misleading bottom-up cues. Our experiments thus highlight the importance of uniting top-down and bottom-up models for explaining the speed, accuracy, and generality of human theory-of-mind.
Authors: Jared Quincy Davis, Boris Hanin, Lingjiao Chen, Peter Bailis, Ion Stoica, Matei Zaharia
Abstract: As practitioners seek to surpass the current reliability and quality frontier of monolithic models, Compound AI Systems consisting of many language model inference calls are increasingly employed. In this work, we construct systems, which we call Networks of Networks (NoNs) organized around the distinction between generating a proposed answer and verifying its correctness, a fundamental concept in complexity theory that we show empirically extends to Language Models (LMs). We introduce a verifier-based judge NoN with K generators, an instantiation of "best-of-K" or "judge-based" compound AI systems. Through experiments on synthetic tasks such as prime factorization, and core benchmarks such as the MMLU, we demonstrate notable performance gains. For instance, in factoring products of two 3-digit primes, a simple NoN improves accuracy from 3.7\% to 36.6\%. On MMLU, a verifier-based judge construction with only 3 generators boosts accuracy over individual GPT-4-Turbo calls by 2.8\%. Our analysis reveals that these gains are most pronounced in domains where verification is notably easier than generation--a characterization which we believe subsumes many reasoning and procedural knowledge tasks, but doesn't often hold for factual and declarative knowledge-based settings. For mathematical and formal logic reasoning-based subjects of MMLU, we observe a 5-8\% or higher gain, whilst no gain on others such as geography and religion. We provide key takeaways for ML practitioners, including the importance of considering verification complexity, the impact of witness format on verifiability, and a simple test to determine the potential benefit of this NoN approach for a given problem distribution. This work aims to inform future research and practice in the design of compound AI systems.
Authors: Nur Ahmad Khatim, Ahmad Azmul Asmar Irfan, Amaliya Mata'ul Hayah, Mansur M. Arief
Abstract: This study addresses the challenge of stroke diagnosis and treatment under uncertainty, a critical issue given the rapid progression and severe consequences of stroke conditions such as aneurysms, arteriovenous malformations (AVM), and occlusions. Current diagnostic methods, including Digital Subtraction Angiography (DSA), face limitations due to high costs and its invasive nature. To overcome these challenges, we propose a novel approach using a Partially Observable Markov Decision Process (POMDP) framework. Our model integrates advanced diagnostic tools and treatment approaches with a decision-making algorithm that accounts for the inherent uncertainties in stroke diagnosis. Our approach combines noisy observations from CT scans, Siriraj scores, and DSA reports to inform the subsequent treatment options. We utilize the online solver DESPOT, which employs tree-search methods and particle filters, to simulate potential future scenarios and guide our strategies. The results indicate that our POMDP framework balances diagnostic and treatment objectives, striking a tradeoff between the need for precise stroke identification via invasive procedures like DSA and the constraints of limited healthcare resources that necessitate more cost-effective strategies, such as in-hospital or at-home observation, by relying only relying on simulation rollouts and not imposing any prior knowledge. Our study offers a significant contribution by presenting a systematic framework that optimally integrates diagnostic and treatment processes for stroke and accounting for various uncertainties, thereby improving care and outcomes in stroke management.
Authors: Jake R. Watts, Joel Sokol
Abstract: This paper proposes a new method for preventing unsafe or otherwise low quality large language model (LLM) outputs, by leveraging the stochasticity of LLMs. We propose a system whereby LLM checkers vote on the acceptability of a generated output, regenerating it if a threshold of disapproval is reached, until sufficient checkers approve. We further propose estimators for cost and failure rate, and based on those estimators and experimental data tailored to the application, we propose an algorithm that achieves a desired failure rate at the least possible cost. We demonstrate that, under these models, failure rate decreases exponentially as a function of cost when voter count and threshold are chosen according to the algorithm, and that the models reasonably estimate the actual performance of such a system in action, even with limited data.
Authors: Zuoyin Tang, Jianhua He, Dashuai Pei, Kezhong Liu, Tao Gao
Abstract: Handling long tail corner cases is a major challenge faced by autonomous vehicles (AVs). While large language models (LLMs) hold great potentials to handle the corner cases with excellent generalization and explanation capabilities and received increasing research interest on application to autonomous driving, there are still technical barriers to be tackled, such as strict model performance and huge computing resource requirements of LLMs. In this paper, we investigate a new approach of applying remote or edge LLMs to support autonomous driving. A key issue for such LLM assisted driving system is the assessment of LLMs on their understanding of driving theory and skills, ensuring they are qualified to undertake safety critical driving assistance tasks for CAVs. We design and run driving theory tests for several proprietary LLM models (OpenAI GPT models, Baidu Ernie and Ali QWen) and open-source LLM models (Tsinghua MiniCPM-2B and MiniCPM-Llama3-V2.5) with more than 500 multiple-choices theory test questions. Model accuracy, cost and processing latency are measured from the experiments. Experiment results show that while model GPT-4 passes the test with improved domain knowledge and Ernie has an accuracy of 85% (just below the 86% passing threshold), other LLM models including GPT-3.5 fail the test. For the test questions with images, the multimodal model GPT4-o has an excellent accuracy result of 96%, and the MiniCPM-Llama3-V2.5 achieves an accuracy of 76%. While GPT-4 holds stronger potential for CAV driving assistance applications, the cost of using model GPT4 is much higher, almost 50 times of that of using GPT3.5. The results can help make decision on the use of the existing LLMs for CAV applications and balancing on the model performance and cost.
Authors: Zijian Wu, Jiayu Wang, Dahua Lin, Kai Chen
Abstract: Recently, large language models have presented promising results in aiding formal mathematical reasoning. However, their performance is restricted due to the scarcity of formal theorem-proving data, which requires additional effort to be extracted from raw formal language corpora. Meanwhile, a significant amount of human-written formal language corpora remains underutilized. To address this issue, we propose LEAN-GitHub, a dataset consisting of large-scale formal data extracted from almost all Lean 4 repositories on GitHub. After fine-tuning InternLM-math-plus on this dataset, our model achieved accuracies of 48.8% with a single pass and 54.5% with 64 passes on the Lean 4 miniF2F test, surpassing state-of-the-art method at 52%. And it also achieves state-of-the-art on two other Lean 4 benchmarks (ProofNet and Putnam) targeting different fields/levels of math. These results demonstrate that our proposed dataset is beneficial for formal reasoning on a wide range of math topics. We open-source our model at https://GitHub. com/InternLM/InternLM-Math and our data at https://huggingface.co/ datasets/InternLM/Lean-GitHub
Authors: Irtaza Khalid, Steven Schockaert
Abstract: Developing models that can learn to reason is a notoriously challenging problem. We focus on reasoning in relational domains, where the use of Graph Neural Networks (GNNs) seems like a natural choice. However, previous work on reasoning with GNNs has shown that such models tend to fail when presented with test examples that require longer inference chains than those seen during training. This suggests that GNNs lack the ability to generalize from training examples in a systematic way, which would fundamentally limit their reasoning abilities. A common solution is to instead rely on neuro-symbolic methods, which are capable of reasoning in a systematic way by design. Unfortunately, the scalability of such methods is often limited and they tend to rely on overly strong assumptions, e.g.\ that queries can be answered by inspecting a single relational path. In this paper, we revisit the idea of reasoning with GNNs, showing that systematic generalization is possible as long as the right inductive bias is provided. In particular, we argue that node embeddings should be treated as epistemic states and that GNN should be parameterised accordingly. We propose a simple GNN architecture which is based on this view and show that it is capable of achieving state-of-the-art results. We furthermore introduce a benchmark which requires models to aggregate evidence from multiple relational paths. We show that existing neuro-symbolic approaches fail on this benchmark, whereas our considered GNN model learns to reason accurately.
Authors: Tsunehiko Tanaka, Edgar Simo-Serra
Abstract: To lower the barriers to game design development, automated game design, which generates game designs through computational processes, has been explored. In automated game design, machine learning-based techniques such as evolutionary algorithms have achieved success. Benefiting from the remarkable advancements in deep learning, applications in computer vision and natural language processing have progressed in level generation. However, due to the limited amount of data in game design, the application of deep learning has been insufficient for tasks such as game description generation. To pioneer a new approach for handling limited data in automated game design, we focus on the in-context learning of large language models (LLMs). LLMs can capture the features of a task from a few demonstration examples and apply the capabilities acquired during pre-training. We introduce the grammar of game descriptions, which effectively structures the game design space, into the LLMs' reasoning process. Grammar helps LLMs capture the characteristics of the complex task of game description generation. Furthermore, we propose a decoding method that iteratively improves the generated output by leveraging the grammar. Our experiments demonstrate that this approach performs well in generating game descriptions.
Authors: Markus Iser
Abstract: Automated reasoning is a key technology in the young but rapidly growing field of Explainable Artificial Intelligence (XAI). Explanability helps build trust in artificial intelligence systems beyond their mere predictive accuracy and robustness. In this paper, we propose a cycle of scientific discovery that combines machine learning with automated reasoning for the generation and the selection of explanations. We present a taxonomy of explanation selection problems that draws on insights from sociology and cognitive science. These selection criteria subsume existing notions and extend them with new properties.
Authors: Huandong Wang, Huan Yan, Can Rong, Yuan Yuan, Fenyu Jiang, Zhenyu Han, Hongjie Sui, Depeng Jin, Yong Li
Abstract: Complex system simulation has been playing an irreplaceable role in understanding, predicting, and controlling diverse complex systems. In the past few decades, the multi-scale simulation technique has drawn increasing attention for its remarkable ability to overcome the challenges of complex system simulation with unknown mechanisms and expensive computational costs. In this survey, we will systematically review the literature on multi-scale simulation of complex systems from the perspective of knowledge and data. Firstly, we will present background knowledge about simulating complex system simulation and the scales in complex systems. Then, we divide the main objectives of multi-scale modeling and simulation into five categories by considering scenarios with clear scale and scenarios with unclear scale, respectively. After summarizing the general methods for multi-scale simulation based on the clues of knowledge and data, we introduce the adopted methods to achieve different objectives. Finally, we introduce the applications of multi-scale simulation in typical matter systems and social systems.
Authors: Tao Meng, Fuchen Zhang, Yuntao Shou, Hongen Shao, Wei Ai, Keqin Li
Abstract: Since Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields, it has received extensive research attention in recent years. Unlike traditional unimodal emotion recognition, MERC can fuse complementary semantic information between multiple modalities (e.g., text, audio, and vision) to improve emotion recognition. However, previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion but directly fuses multimodal features, which will hinder the model for representation learning. In this study, we have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem, which uses a recurrent iterative module with memory to align multimodal features, and then uses the masked GCN for multimodal feature fusion. First, we employ LSTM to capture contextual information and use a graph attention-filtering mechanism to eliminate noise effectively within the modality. Second, we build a recurrent iteration module with a memory function, which can use communication between different modalities to eliminate the gap between modalities and achieve the preliminary alignment of features between modalities. Then, a cross-modal multi-head attention mechanism is introduced to achieve feature alignment between modalities and construct a masked GCN for multimodal feature fusion, which can perform random mask reconstruction on the nodes in the graph to obtain better node feature representation. Finally, we utilize a multilayer perceptron (MLP) for emotion recognition. Extensive experiments on two benchmark datasets (i.e., IEMOCAP and MELD) demonstrate that {MGLRA} outperforms state-of-the-art methods.
Authors: Yufeng Li, Wenchao Zhao, Bo Dang, Xu Yan, Weimin Wang, Min Gao, Mingxuan Xiao
Abstract: In clinical treatment, identifying potential adverse reactions of drugs can help assist doctors in making medication decisions. In response to the problems in previous studies that features are high-dimensional and sparse, independent prediction models need to be constructed for each adverse reaction of drugs, and the prediction accuracy is low, this paper develops an adverse drug reaction prediction model based on knowledge graph embedding and deep learning, which can predict experimental results. Unified prediction of adverse drug reactions covered. Knowledge graph embedding technology can fuse the associated information between drugs and alleviate the shortcomings of high-dimensional sparsity in feature matrices, and the efficient training capabilities of deep learning can improve the prediction accuracy of the model. This article builds an adverse drug reaction knowledge graph based on drug feature data; by analyzing the embedding effect of the knowledge graph under different embedding strategies, the best embedding strategy is selected to obtain sample vectors; and then a convolutional neural network model is constructed to predict adverse reactions. The results show that under the DistMult embedding model and 400-dimensional embedding strategy, the convolutional neural network model has the best prediction effect; the average accuracy, F_1 score, recall rate and area under the curve of repeated experiments are better than the methods reported in the literature. The obtained prediction model has good prediction accuracy and stability, and can provide an effective reference for later safe medication guidance.
Authors: Jiaqiang Zhang, Songcan Chen
Abstract: Graph contrastive learning (GCL) is an effective paradigm for node representation learning in graphs. The key components hidden behind GCL are data augmentation and positive-negative pair selection. Typical data augmentations in GCL, such as uniform deletion of edges, are generally blind and resort to local perturbation, which is prone to producing under-diversity views. Additionally, there is a risk of making the augmented data traverse to other classes. Moreover, most methods always treat all other samples as negatives. Such a negative pairing naturally results in sampling bias and likewise may make the learned representation suffer from semantic drift. Therefore, to increase the diversity of the contrastive view, we propose two simple and effective global topological augmentations to compensate current GCL. One is to mine the semantic correlation between nodes in the feature space. The other is to utilize the algebraic properties of the adjacency matrix to characterize the topology by eigen-decomposition. With the help of both, we can retain important edges to build a better view. To reduce the risk of semantic drift, a prototype-based negative pair selection is further designed which can filter false negative samples. Extensive experiments on various tasks demonstrate the advantages of the model compared to the state-of-the-art methods.
Authors: Vivek Khatana, Murti V. Salapaka
Abstract: In this article, we focus on solving a class of distributed optimization problems involving $n$ agents with the local objective function at every agent $i$ given by the difference of two convex functions $f_i$ and $g_i$ (difference-of-convex (DC) form), where $f_i$ and $g_i$ are potentially nonsmooth. The agents communicate via a directed graph containing $n$ nodes. We create smooth approximations of the functions $f_i$ and $g_i$ and develop a distributed algorithm utilizing the gradients of the smooth surrogates and a finite-time approximate consensus protocol. We term this algorithm as DDC-Consensus. The developed DDC-Consensus algorithm allows for non-symmetric directed graph topologies and can be synthesized distributively. We establish that the DDC-Consensus algorithm converges to a stationary point of the nonconvex distributed optimization problem. The performance of the DDC-Consensus algorithm is evaluated via a simulation study to solve a nonconvex DC-regularized distributed least squares problem. The numerical results corroborate the efficacy of the proposed algorithm.
Authors: Huandong Wang, Changzheng Gao, Yuchen Wu, Depeng Jin, Lina Yao, Yong Li
Abstract: Generating human mobility trajectories is of great importance to solve the lack of large-scale trajectory data in numerous applications, which is caused by privacy concerns. However, existing mobility trajectory generation methods still require real-world human trajectories centrally collected as the training data, where there exists an inescapable risk of privacy leakage. To overcome this limitation, in this paper, we propose PateGail, a privacy-preserving imitation learning model to generate mobility trajectories, which utilizes the powerful generative adversary imitation learning model to simulate the decision-making process of humans. Further, in order to protect user privacy, we train this model collectively based on decentralized mobility data stored in user devices, where personal discriminators are trained locally to distinguish and reward the real and generated human trajectories. In the training process, only the generated trajectories and their rewards obtained based on personal discriminators are shared between the server and devices, whose privacy is further preserved by our proposed perturbation mechanisms with theoretical proof to satisfy differential privacy. Further, to better model the human decision-making process, we propose a novel aggregation mechanism of the rewards obtained from personal discriminators. We theoretically prove that under the reward obtained based on the aggregation mechanism, our proposed model maximizes the lower bound of the discounted total rewards of users. Extensive experiments show that the trajectories generated by our model are able to resemble real-world trajectories in terms of five key statistical metrics, outperforming state-of-the-art algorithms by over 48.03%. Furthermore, we demonstrate that the synthetic trajectories are able to efficiently support practical applications, including mobility prediction and location recommendation.
Authors: Yaolun Zhang, Yinxu Pan, Yudong Wang, Jie Cai, Zhi Zheng, Guoyang Zeng, Zhiyuan Liu
Abstract: The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image editing. However, existing benchmarks primarily focus on either simplistic tasks, such as completing a few lines of code, or on extremely complex and specific tasks at the repository level, neither of which are representative of various daily coding tasks. To address this gap, we introduce \textbf{PyBench}, a benchmark encompassing five main categories of real-world tasks, covering more than 10 types of files. Given a high-level user query and related files, the LLM Agent needs to reason and execute Python code via a code interpreter for a few turns before making a formal response to fulfill the user's requirements. Successfully addressing tasks in PyBench demands a robust understanding of various Python packages, superior reasoning capabilities, and the ability to incorporate feedback from executed code. Our evaluations indicate that current open-source LLMs are struggling with these tasks. Hence, we conduct analysis and experiments on four kinds of datasets proving that comprehensive abilities are needed for PyBench. Our fine-tuned 8B size model: \textbf{PyLlama3} achieves an exciting performance on PyBench which surpasses many 33B and 70B size models. Our Benchmark, Training Dataset, and Model are available at: \href{https://github.com/Mercury7353/PyBench}{https://github.com/Mercury7353/PyBench}
URLs: https://github.com/Mercury7353/PyBench, https://github.com/Mercury7353/PyBench
Authors: Xiaojin Zhang, Wei Chen
Abstract: Federated learning has emerged as a promising paradigm for collaborative model training while preserving data privacy. However, recent studies have shown that it is vulnerable to various privacy attacks, such as data reconstruction attacks. In this paper, we provide a theoretical analysis of privacy leakage in federated learning from two perspectives: linear algebra and optimization theory. From the linear algebra perspective, we prove that when the Jacobian matrix of the batch data is not full rank, there exist different batches of data that produce the same model update, thereby ensuring a level of privacy. We derive a sufficient condition on the batch size to prevent data reconstruction attacks. From the optimization theory perspective, we establish an upper bound on the privacy leakage in terms of the batch size, the distortion extent, and several other factors. Our analysis provides insights into the relationship between privacy leakage and various aspects of federated learning, offering a theoretical foundation for designing privacy-preserving federated learning algorithms.
Authors: Aws Khalil, Jaerock Kwon
Abstract: This study introduces the Perception Latency Mitigation Network (PLM-Net), a novel deep learning approach for addressing perception latency in vision-based Autonomous Vehicle (AV) lateral control systems. Perception latency is the delay between capturing the environment through vision sensors (e.g., cameras) and applying an action (e.g., steering). This issue is understudied in both classical and neural-network-based control methods. Reducing this latency with powerful GPUs and FPGAs is possible but impractical for automotive platforms. PLM-Net comprises the Base Model (BM) and the Timed Action Prediction Model (TAPM). BM represents the original Lane Keeping Assist (LKA) system, while TAPM predicts future actions for different latency values. By integrating these models, PLM-Net mitigates perception latency. The final output is determined through linear interpolation of BM and TAPM outputs based on real-time latency. This design addresses both constant and varying latency, improving driving trajectories and steering control. Experimental results validate the efficacy of PLM-Net across various latency conditions. Source code: https://github.com/AwsKhalil/oscar/tree/devel-plm-net.
URLs: https://github.com/AwsKhalil/oscar/tree/devel-plm-net.
Authors: Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, Graham Neubig
Abstract: Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenDevin, a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web. We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, coordination between multiple agents, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 15 challenging tasks, including software engineering (e.g., SWE-Bench) and web browsing (e.g., WebArena), among others. Released under the permissive MIT license, OpenDevin is a community project spanning academia and industry with more than 1.3K contributions from over 160 contributors and will improve going forward.
Authors: Riku Arakawa, Hiromu Yakura, Mayank Goel
Abstract: We routinely perform procedures (such as cooking) that include a set of atomic steps. Often, inadvertent omission or misordering of a single step can lead to serious consequences, especially for those experiencing cognitive challenges such as dementia. This paper introduces PrISM-Observer, a smartwatch-based, context-aware, real-time intervention system designed to support daily tasks by preventing errors. Unlike traditional systems that require users to seek out information, the agent observes user actions and intervenes proactively. This capability is enabled by the agent's ability to continuously update its belief in the user's behavior in real-time through multimodal sensing and forecast optimal intervention moments and methods. We first validated the steps-tracking performance of our framework through evaluations across three datasets with different complexities. Then, we implemented a real-time agent system using a smartwatch and conducted a user study in a cooking task scenario. The system generated helpful interventions, and we gained positive feedback from the participants. The general applicability of PrISM-Observer to daily tasks promises broad applications, for instance, including support for users requiring more involved interventions, such as people with dementia or post-surgical patients.
Authors: Benjamin Wilson, Nicholas Autio Mitchell, Jhony Kaesemodel Pontes, James Hays
Abstract: Lidar-based perception pipelines rely on 3D object detection models to interpret complex scenes. While multiple representations for lidar exist, the range-view is enticing since it losslessly encodes the entire lidar sensor output. In this work, we achieve state-of-the-art amongst range-view 3D object detection models without using multiple techniques proposed in past range-view literature. We explore range-view 3D object detection across two modern datasets with substantially different properties: Argoverse 2 and Waymo Open. Our investigation reveals key insights: (1) input feature dimensionality significantly influences the overall performance, (2) surprisingly, employing a classification loss grounded in 3D spatial proximity works as well or better compared to more elaborate IoU-based losses, and (3) addressing non-uniform lidar density via a straightforward range subsampling technique outperforms existing multi-resolution, range-conditioned networks. Our experiments reveal that techniques proposed in recent range-view literature are not needed to achieve state-of-the-art performance. Combining the above findings, we establish a new state-of-the-art model for range-view 3D object detection -- improving AP by 2.2% on the Waymo Open dataset while maintaining a runtime of 10 Hz. We establish the first range-view model on the Argoverse 2 dataset and outperform strong voxel-based baselines. All models are multi-class and open-source. Code is available at https://github.com/benjaminrwilson/range-view-3d-detection.
URLs: https://github.com/benjaminrwilson/range-view-3d-detection.
Authors: Jae Soon Baik, In Young Yoon, Kun Hoon Kim, Jun Won Choi
Abstract: Deep neural networks have demonstrated remarkable advancements in various fields using large, well-annotated datasets. However, real-world data often exhibit long-tailed distributions and label noise, significantly degrading generalization performance. Recent studies addressing these issues have focused on noisy sample selection methods that estimate the centroid of each class based on high-confidence samples within each target class. The performance of these methods is limited because they use only the training samples within each class for class centroid estimation, making the quality of centroids susceptible to long-tailed distributions and noisy labels. In this study, we present a robust training framework called Distribution-aware Sample Selection and Contrastive Learning (DaSC). Specifically, DaSC introduces a Distribution-aware Class Centroid Estimation (DaCC) to generate enhanced class centroids. DaCC performs weighted averaging of the features from all samples, with weights determined based on model predictions. Additionally, we propose a confidence-aware contrastive learning strategy to obtain balanced and robust representations. The training samples are categorized into high-confidence and low-confidence samples. Our method then applies Semi-supervised Balanced Contrastive Loss (SBCL) using high-confidence samples, leveraging reliable label information to mitigate class bias. For the low-confidence samples, our method computes Mixup-enhanced Instance Discrimination Loss (MIDL) to improve their representations in a self-supervised manner. Our experimental results on CIFAR and real-world noisy-label datasets demonstrate the superior performance of the proposed DaSC compared to previous approaches.
Authors: Abhi Kamboj, Anh Duy Nguyen, Minh Do
Abstract: Despite living in a multi-sensory world, most AI models are limited to textual and visual interpretations of human motion and behavior. Inertial measurement units (IMUs) provide a salient signal to understand human motion; however, they are challenging to use due to their uninterpretability and scarcity of their data. We investigate a method to transfer knowledge between visual and inertial modalities using the structure of an informative joint representation space designed for human action recognition (HAR). We apply the resulting Fusion and Cross-modal Transfer (FACT) method to a novel setup, where the model does not have access to labeled IMU data during training and is able to perform HAR with only IMU data during testing. Extensive experiments on a wide range of RGB-IMU datasets demonstrate that FACT significantly outperforms existing methods in zero-shot cross-modal transfer.
Authors: Zahraa Al Sahili, Ioannis Patras, Matthew Purver
Abstract: The application of machine learning (ML) in detecting, diagnosing, and treating mental health disorders is garnering increasing attention. Traditionally, research has focused on single modalities, such as text from clinical notes, audio from speech samples, or video of interaction patterns. Recently, multimodal ML, which combines information from multiple modalities, has demonstrated significant promise in offering novel insights into human behavior patterns and recognizing mental health symptoms and risk factors. Despite its potential, multimodal ML in mental health remains an emerging field, facing several complex challenges before practical applications can be effectively developed. This survey provides a comprehensive overview of the data availability and current state-of-the-art multimodal ML applications for mental health. It discusses key challenges that must be addressed to advance the field. The insights from this survey aim to deepen the understanding of the potential and limitations of multimodal ML in mental health, guiding future research and development in this evolving domain.
Authors: Mikhail Terekhov, Caglar Gulcehre
Abstract: Multi-objective reinforcement learning (MORL) is essential for addressing the intricacies of real-world RL problems, which often require trade-offs between multiple utility functions. However, MORL is challenging due to unstable learning dynamics with deep learning-based function approximators. The research path most taken has been to explore different value-based loss functions for MORL to overcome this issue. Our work empirically explores model-free policy learning loss functions and the impact of different architectural choices. We introduce two different approaches: Multi-objective Proximal Policy Optimization (MOPPO), which extends PPO to MORL, and Multi-objective Advantage Actor Critic (MOA2C), which acts as a simple baseline in our ablations. Our proposed approach is straightforward to implement, requiring only small modifications at the level of function approximator. We conduct comprehensive evaluations on the MORL Deep Sea Treasure, Minecart, and Reacher environments and show that MOPPO effectively captures the Pareto front. Our extensive ablation studies and empirical analyses reveal the impact of different architectural choices, underscoring the robustness and versatility of MOPPO compared to popular MORL approaches like Pareto Conditioned Networks (PCN) and Envelope Q-learning in terms of MORL metrics, including hypervolume and expected utility.
Authors: Yuheng Wang, Tianze Yu, Jiayue Cai, Sunil Kalia, Harvey Lui, Z. Jane Wang, Tim K. Lee
Abstract: The 7-point checklist (7PCL) is widely used in dermoscopy to identify malignant melanoma lesions needing urgent medical attention. It assigns point values to seven attributes: major attributes are worth two points each, and minor ones are worth one point each. A total score of three or higher prompts further evaluation, often including a biopsy. However, a significant limitation of current methods is the uniform weighting of attributes, which leads to imprecision and neglects their interconnections. Previous deep learning studies have treated the prediction of each attribute with the same importance as predicting melanoma, which fails to recognize the clinical significance of the attributes for melanoma. To address these limitations, we introduce a novel diagnostic method that integrates two innovative elements: a Clinical Knowledge-Based Topological Graph (CKTG) and a Gradient Diagnostic Strategy with Data-Driven Weighting Standards (GD-DDW). The CKTG integrates 7PCL attributes with diagnostic information, revealing both internal and external associations. By employing adaptive receptive domains and weighted edges, we establish connections among melanoma's relevant features. Concurrently, GD-DDW emulates dermatologists' diagnostic processes, who first observe the visual characteristics associated with melanoma and then make predictions. Our model uses two imaging modalities for the same lesion, ensuring comprehensive feature acquisition. Our method shows outstanding performance in predicting malignant melanoma and its features, achieving an average AUC value of 85%. This was validated on the EDRA dataset, the largest publicly available dataset for the 7-point checklist algorithm. Specifically, the integrated weighting system can provide clinicians with valuable data-driven benchmarks for their evaluations.
Authors: Timo Wilm, Philipp Normann, Felix Stepprath
Abstract: This work introduces MultiTRON, an approach that adapts Pareto front approximation techniques to multi-objective session-based recommender systems using a transformer neural network. Our approach optimizes trade-offs between key metrics such as click-through and conversion rates by training on sampled preference vectors. A significant advantage is that after training, a single model can access the entire Pareto front, allowing it to be tailored to meet the specific requirements of different stakeholders by adjusting an additional input vector that weights the objectives. We validate the model's performance through extensive offline and online evaluation. For broader application and research, the source code is made available at https://github.com/otto-de/MultiTRON . The results confirm the model's ability to manage multiple recommendation objectives effectively, offering a flexible tool for diverse business needs.
Authors: Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Michael Bendersky
Abstract: Retrieval Augmented Generation (RAG) has been a powerful tool for Large Language Models (LLMs) to efficiently process overly lengthy contexts. However, recent LLMs like Gemini-1.5 and GPT-4 show exceptional capabilities to understand long contexts directly. We conduct a comprehensive comparison between RAG and long-context (LC) LLMs, aiming to leverage the strengths of both. We benchmark RAG and LC across various public datasets using three latest LLMs. Results reveal that when resourced sufficiently, LC consistently outperforms RAG in terms of average performance. However, RAG's significantly lower cost remains a distinct advantage. Based on this observation, we propose Self-Route, a simple yet effective method that routes queries to RAG or LC based on model self-reflection. Self-Route significantly reduces the computation cost while maintaining a comparable performance to LC. Our findings provide a guideline for long-context applications of LLMs using RAG and LC.
Authors: Harish Neelam
Abstract: This paper presents a multilevel hierarchical framework for the classification of weather conditions and hazard prediction. In recent years, the importance of data has grown significantly, with various types like text, numbers, images, audio, and videos playing a key role. Among these, images make up a large portion of the data available. This application shows promise for various purposes, especially when combined with decision support systems for traffic management, afforestation, and weather forecasting. It's particularly useful in situations where traditional weather predictions are not very accurate, such as ensuring the safe operation of self driving cars in dangerous weather. While previous studies have looked at this topic with fewer categories, this paper focuses on eleven specific types of weather images. The goal is to create a model that can accurately predict weather conditions after being trained on a large dataset of images. Accuracy is crucial in real-life situations to prevent accidents, making it the top priority for this paper. This work lays the groundwork for future applications in weather prediction, especially in situations where human expertise is not available or may be biased. The framework, capable of classifying images into eleven weather categories: dew, frost, glaze, rime, snow, hail, rain, lightning, rainbow, and sandstorm, provides real-time weather information with an accuracy of 0.9329. The proposed framework addresses the growing need for accurate weather classification and hazard prediction, offering a robust solution for various applications in the field.
Authors: Jihyung Kil, Zheda Mai, Justin Lee, Zihe Wang, Kerrie Cheng, Lemeng Wang, Ye Liu, Arpita Chowdhury, Wei-Lun Chao
Abstract: The ability to compare objects, scenes, or situations is crucial for effective decision-making and problem-solving in everyday life. For instance, comparing the freshness of apples enables better choices during grocery shopping, while comparing sofa designs helps optimize the aesthetics of our living space. Despite its significance, the comparative capability is largely unexplored in artificial general intelligence (AGI). In this paper, we introduce CompBench, a benchmark designed to evaluate the comparative reasoning capability of multimodal large language models (MLLMs). CompBench mines and pairs images through visually oriented questions covering eight dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. We curate a collection of around 40K image pairs using metadata from diverse vision datasets and CLIP similarity scores. These image pairs span a broad array of visual domains, including animals, fashion, sports, and both outdoor and indoor scenes. The questions are carefully crafted to discern relative characteristics between two images and are labeled by human annotators for accuracy and relevance. We use CompBench to evaluate recent MLLMs, including GPT-4V(ision), Gemini-Pro, and LLaVA-1.6. Our results reveal notable shortcomings in their comparative abilities. We believe CompBench not only sheds light on these limitations but also establishes a solid foundation for future enhancements in the comparative capability of MLLMs.
Authors: Pai Zhu, Dhruuv Agarwal, Jacob W. Bartel, Kurt Partridge, Hyun Jin Park, Quan Wang
Abstract: One of the challenges in developing a high quality custom keyword spotting (KWS) model is the lengthy and expensive process of collecting training data covering a wide range of languages, phrases and speaking styles. We introduce Synth4Kws - a framework to leverage Text to Speech (TTS) synthesized data for custom KWS in different resource settings. With no real data, we found increasing TTS phrase diversity and utterance sampling monotonically improves model performance, as evaluated by EER and AUC metrics over 11k utterances of the speech command dataset. In low resource settings, with 50k real utterances as a baseline, we found using optimal amounts of TTS data can improve EER by 30.1% and AUC by 46.7%. Furthermore, we mix TTS data with varying amounts of real data and interpolate the real data needed to achieve various quality targets. Our experiments are based on English and single word utterances but the findings generalize to i18n languages and other keyword types.
Authors: Zhixiang Shen, Haolan He, Zhao Kang
Abstract: Multi-relational graph clustering has demonstrated remarkable success in uncovering underlying patterns in complex networks. Representative methods manage to align different views motivated by advances in contrastive learning. Our empirical study finds the pervasive presence of imbalance in real-world graphs, which is in principle contradictory to the motivation of alignment. In this paper, we first propose a novel metric, the Aggregation Class Distance, to empirically quantify structural disparities among different graphs. To address the challenge of view imbalance, we propose Balanced Multi-Relational Graph Clustering (BMGC), comprising unsupervised dominant view mining and dual signals guided representation learning. It dynamically mines the dominant view throughout the training process, synergistically improving clustering performance with representation learning. Theoretical analysis ensures the effectiveness of dominant view mining. Extensive experiments and in-depth analysis on real-world and synthetic datasets showcase that BMGC achieves state-of-the-art performance, underscoring its superiority in addressing the view imbalance inherent in multi-relational graphs. The source code and datasets are available at https://github.com/zxlearningdeep/BMGC.
Authors: Jianing Xia (Deakin University, Australia), Man Li (Macquarie University, Australia), Jianxin Li (Deakin University, Australia)
Abstract: This research investigates the curriculum structures of undergraduate Artificial Intelligence (AI) education across universities worldwide. By examining the curricula of leading universities, the research seeks to contribute to a deeper understanding of AI education on a global scale, facilitating the alignment of educational practices with the evolving needs of the AI landscape. This research delves into the diverse course structures of leading universities, exploring contemporary trends and priorities to reveal the nuanced approaches in AI education. It also investigates the core AI topics and learning contents frequently taught, comparing them with the CS2023 curriculum guidance to identify convergence and divergence. Additionally, it examines how universities across different countries approach AI education, analyzing educational objectives, priorities, potential careers, and methodologies to understand the global landscape and implications of AI pedagogy.
Authors: Nitisha Jain, Mubashara Akhtar, Joan Giner-Miguelez, Rajat Shinde, Joaquin Vanschoren, Steffen Vogler, Sujata Goswami, Yuhan Rao, Tim Santos, Luis Oala, Michalis Karamousadakis, Manil Maskey, Pierre Marcenac, Costanza Conforti, Michael Kuchnik, Lora Aroyo, Omar Benjelloun, Elena Simperl
Abstract: Data is critical to advancing AI technologies, yet its quality and documentation remain significant challenges, leading to adverse downstream effects (e.g., potential biases) in AI applications. This paper addresses these issues by introducing Croissant-RAI, a machine-readable metadata format designed to enhance the discoverability, interoperability, and trustworthiness of AI datasets. Croissant-RAI extends the Croissant metadata format and builds upon existing responsible AI (RAI) documentation frameworks, offering a standardized set of attributes and practices to facilitate community-wide adoption. Leveraging established web-publishing practices, such as Schema.org, Croissant-RAI enables dataset users to easily find and utilize RAI metadata regardless of the platform on which the datasets are published. Furthermore, it is seamlessly integrated into major data search engines, repositories, and machine learning frameworks, streamlining the reading and writing of responsible AI metadata within practitioners' existing workflows. Croissant-RAI was developed through a community-led effort. It has been designed to be adaptable to evolving documentation requirements and is supported by a Python library and a visual editor.
Authors: Pooja Thakar, Anil Mehta, Manisha
Abstract: Educational Data Mining (EDM) is a promising field, where data mining is widely used for predicting students performance. One of the most prevalent and recent challenge that higher education faces today is making students skillfully employable. Institutions possess large volume of data; still they are unable to reveal knowledge and guide their students. Data in education is generally very large, multidimensional and unbalanced in nature. Process of extracting knowledge from such data has its own set of problems and is a very complicated task. In this paper, Engineering and MCA (Masters in Computer Applications) students data is collected from various universities and institutes pan India. The dataset is large, unbalanced and multidimensional in nature. A cluster based model is presented in this paper, which, when applied at preprocessing stage helps in parsimonious selection of variables and improves the performance of predictive algorithms. Hence, facilitate in better prediction of Students Employability.
Authors: Pontus Strimling, Joel Krueger, Simon Karlsson
Abstract: Prior research demonstrates that Open AI's GPT models can predict variations in moral opinions between countries but that the accuracy tends to be substantially higher among high-income countries compared to low-income ones. This study aims to replicate previous findings and advance the research by examining how accuracy varies with different types of moral questions. Using responses from the World Value Survey and the European Value Study, covering 18 moral issues across 63 countries, we calculated country-level mean scores for each moral issue and compared them with GPT-4's predictions. Confirming previous findings, our results show that GPT-4 has greater predictive success in high-income than in low-income countries. However, our factor analysis reveals that GPT-4 bases its predictions primarily on a single dimension, presumably reflecting countries' degree of conservatism/liberalism. Conversely, the real-world moral landscape appears to be two-dimensional, differentiating between personal-sexual and violent-dishonest issues. When moral issues are categorized based on their moral domain, GPT-4's predictions are found to be remarkably accurate in the personal-sexual domain, across both high-income (r = .77) and low-income (r = .58) countries. Yet the predictive accuracy significantly drops in the violent-dishonest domain for both high-income (r = .30) and low-income (r = -.16) countries, indicating that GPT-4's one-dimensional world-view does not fully capture the complexity of the moral landscape. In sum, this study underscores the importance of not only considering country-specific characteristics to understand GPT-4's moral understanding, but also the characteristics of the moral issues at hand.
Authors: Sel\c{c}uk K{\i}l{\i}n\c{c}
Abstract: The integration of generative artificial intelligence (GenAI) tools into education has been a game-changer for teaching and assessment practices, bringing new opportunities, but also novel challenges which need to be dealt with. This paper presents the Comprehensive AI Assessment Framework (CAIAF), an evolved version of the AI Assessment Scale (AIAS) by Perkins, Furze, Roe, and MacVaugh, targeted toward the ethical integration of AI into educational assessments. This is where the CAIAF differs, as it incorporates stringent ethical guidelines, with clear distinctions based on educational levels, and advanced AI capabilities of real-time interactions and personalized assistance. The framework developed herein has a very intuitive use, mainly through the use of a color gradient that enhances the user-friendliness of the framework. Methodologically, the framework has been developed through the huge support of a thorough literature review and practical insight into the topic, becoming a dynamic tool to be used in different educational settings. The framework will ensure better learning outcomes, uphold academic integrity, and promote responsible use of AI, hence the need for this framework in modern educational practice.
Authors: Akshat Dubey, Zewen Yang, Georges Hattab
Abstract: The burgeoning field of artificial intelligence (AI) has yet to fully permeate real-world applications, largely due to issues of trust, transparency, and concerns about fairness and discrimination. Despite the increasing need for new and revised regulations to address the ethical and legal risks of using AI, there is a mismatch between regulatory science and AI, hindering the creation of a consistent framework. This highlights the need for guidance and regulation, especially as new AI legislation emerges. To bridge this gap, we propose a five-layer nested model for AI design and validation. This model is designed to address the challenges faced by AI practitioners and streamline the design and validation of AI applications and workflows, thereby improving fairness, trust, and AI adoption. This model not only parallels regulations and addresses the daily challenges faced by AI practitioners, but also provides prescriptive guidance for determining appropriate evaluation approaches by identifying threats to validity unique to each layer. We also provide three recommendations motivated by this model: authors should distinguish between layers when claiming contributions to clarify the specific areas in which the contribution is made and to avoid confusion, authors should explicitly state upstream assumptions to ensure that the context and limitations of their AI system are clearly understood, AI venues should promote thorough testing and validation of AI systems and their compliance with regulatory requirements
Authors: Massimo Passamonti
Abstract: In this essay, I argue that explicit ethical machines, whose moral principles are inferred through a bottom-up approach, are unable to replicate human-like moral reasoning and cannot be considered moral agents. By utilizing Alan Turing's theory of computation, I demonstrate that moral reasoning is computationally intractable by these machines due to the halting problem. I address the frontiers of machine ethics by formalizing moral problems into 'algorithmic moral questions' and by exploring moral psychology's dual-process model. While the nature of Turing Machines theoretically allows artificial agents to engage in recursive moral reasoning, critical limitations are introduced by the halting problem, which states that it is impossible to predict with certainty whether a computational process will halt. A thought experiment involving a military drone illustrates this issue, showing that an artificial agent might fail to decide between actions due to the halting problem, which limits the agent's ability to make decisions in all instances, undermining its moral agency.
Authors: Erik Johannes Husom, Arda Goknil, Lwin Khin Shar, Sagar Sen
Abstract: In the rapidly evolving realm of artificial intelligence, deploying large language models (LLMs) poses increasingly pressing computational and environmental challenges. This paper introduces MELODI - Monitoring Energy Levels and Optimization for Data-driven Inference - a multifaceted framework crafted to monitor and analyze the energy consumed during LLM inference processes. MELODI enables detailed observations of power consumption dynamics and facilitates the creation of a comprehensive dataset reflective of energy efficiency across varied deployment scenarios. The dataset, generated using MELODI, encompasses a broad spectrum of LLM deployment frameworks, multiple language models, and extensive prompt datasets, enabling a comparative analysis of energy use. Using the dataset, we investigate how prompt attributes, including length and complexity, correlate with energy expenditure. Our findings indicate substantial disparities in energy efficiency, suggesting ample scope for optimization and adoption of sustainable measures in LLM deployment. Our contribution lies not only in the MELODI framework but also in the novel dataset, a resource that can be expanded by other researchers. Thus, MELODI is a foundational tool and dataset for advancing research into energy-conscious LLM deployment, steering the field toward a more sustainable future.
Authors: Wim Vanderbauwhede
Abstract: AI-generated answers to conventional search queries dramatically increase the energy consumption. By our estimates, energy demand increase by 60-70 times. This is a based on an updated estimate of energy consumption for conventional search and recent work on the energy demand of queries to the BLOOM model, a 176B parameter model, and OpenAI's ChatGPT, which is of similar complexity.
Authors: Jennifer Chien, A. Stevie Bergman, Kevin R. McKee, Nenad Tomasev, Vinodkumar Prabhakaran, Rida Qadri, Nahema Marchal, William Isaac
Abstract: Algorithmic fairness has emerged as a critical concern in artificial intelligence (AI) research. However, the development of fair AI systems is not an objective process. Fairness is an inherently subjective concept, shaped by the values, experiences, and identities of those involved in research and development. To better understand the norms and values embedded in current fairness research, we conduct a meta-analysis of algorithmic fairness papers from two leading conferences on AI fairness and ethics, AIES and FAccT, covering a final sample of 139 papers over the period from 2018 to 2022. Our investigation reveals two concerning trends: first, a US-centric perspective dominates throughout fairness research; and second, fairness studies exhibit a widespread reliance on binary codifications of human identity (e.g., "Black/White", "male/female"). These findings highlight how current research often overlooks the complexities of identity and lived experiences, ultimately failing to represent diverse global contexts when defining algorithmic bias and fairness. We discuss the limitations of these research design choices and offer recommendations for fostering more inclusive and representative approaches to fairness in AI systems, urging a paradigm shift that embraces nuanced, global understandings of human identity and values.
Authors: Daniel Hopp
Abstract: Generative artificial intelligence (AI), and in particular Large Language Models (LLMs), have exploded in popularity and attention since the release to the public of ChatGPT's Generative Pre-trained Transformer (GPT)-3.5 model in November of 2022. Due to the power of these general purpose models and their ability to communicate in natural language, they can be useful in a range of domains, including the work of official statistics and international organizations. However, with such a novel and seemingly complex technology, it can feel as if generative AI is something that happens to an organization, something that can be talked about but not understood, that can be commented on but not contributed to. Additionally, the costs of adoption and operation of proprietary solutions can be both uncertain and high, a barrier for often cost-constrained international organizations. In the face of these challenges, United Nations Trade and Development (UNCTAD), through its Global Crisis Response Group (GCRG), has explored and developed its own open-source Retrieval Augmented Generation (RAG) LLM application. RAG makes LLMs aware of and more useful for the organization's domain and work. Developing in-house solutions comes with pros and cons, with pros including cost, flexibility, and fostering institutional knowledge. Cons include time and skill investments and gaps and application polish and power. The three libraries developed to produce the app, nlp_pipeline for document processing and statistical analysis, local_rag_llm for running a local RAG LLM, and streamlit_rag for the user interface, are publicly available on PyPI and GitHub with Dockerfiles. A fourth library, local_llm_finetune, is also available for fine-tuning existing LLMs which can then be used in the application.
Authors: Kevin Wu, Eric Wu, Kit Rodolfa, Daniel E. Ho, James Zou
Abstract: While the pace of development of AI has rapidly progressed in recent years, the implementation of safe and effective regulatory frameworks has lagged behind. In particular, the adaptive nature of AI models presents unique challenges to regulators as updating a model can improve its performance but also introduce safety risks. In the US, the Food and Drug Administration (FDA) has been a forerunner in regulating and approving hundreds of AI medical devices. To better understand how AI is updated and its regulatory considerations, we systematically analyze the frequency and nature of updates in FDA-approved AI medical devices. We find that less than 2% of all devices report having been updated by being re-trained on new data. Meanwhile, nearly a quarter of devices report updates in the form of new functionality and marketing claims. As an illustrative case study, we analyze pneumothorax detection models and find that while model performance can degrade by as much as 0.18 AUC when evaluated on new sites, re-training on site-specific data can mitigate this performance drop, recovering up to 0.23 AUC. However, we also observed significant degradation on the original site after re-training using data from new sites, providing insight from one example that challenges the current one-model-fits-all approach to regulatory approvals. Our analysis provides an in-depth look at the current state of FDA-approved AI device updates and insights for future regulatory policies toward model updating and adaptive AI.
Authors: Laleh Jalilian, Daniel McDuff, Achuta Kadambi
Abstract: Generative artificial intelligence (GenAI) has the potential to improve healthcare through automation that enhances the quality and safety of patient care. Powered by foundation models that have been pretrained and can generate complex content, GenAI represents a paradigm shift away from the more traditional focus on task-specific classifiers that have dominated the AI landscape thus far. We posit that the imminent application of GenAI in healthcare will be through well-defined, low risk, high value, and narrow applications that automate healthcare workflows at the point of care using smaller foundation models. These models will be finetuned for different capabilities and application specific scenarios and will have the ability to provide medical explanations, reference evidence within a retrieval augmented framework and utilizing external tools. We contrast this with a general, all-purpose AI model for end-to-end clinical decision making that improves clinician performance, including safety-critical diagnostic tasks, which will require greater research prior to implementation. We consider areas where 'human in the loop' Generative AI can improve healthcare quality and safety by automating mundane tasks. Using the principles of implementation science will be critical for integrating 'end to end' GenAI systems that will be accepted by healthcare teams.
Authors: Akash Wasil, Tim Durgin
Abstract: The United States and China will play an important role in navigating safety and security challenges relating to advanced artificial intelligence. We sought to better understand how experts in each country describe safety and security threats from advanced artificial intelligence, extreme risks from AI, and the potential for international cooperation. Specifically, we compiled publicly-available statements from major technical and policy leaders in both the United States and China. We focused our analysis on advanced forms of artificial intelligence, such as artificial general intelligence (AGI), that may have the most significant impacts on national and global security. Experts in both countries expressed concern about risks from AGI, risks from intelligence explosions, and risks from AI systems that escape human control. Both countries have also launched early efforts designed to promote international cooperation around safety standards and risk management practices. Notably, our findings only reflect information from publicly available sources. Nonetheless, our findings can inform policymakers and researchers about the state of AI discourse in the US and China. We hope such work can contribute to policy discussions around advanced AI, its global security threats, and potential international dialogues or agreements to mitigate such threats.
Authors: Chanseo Lee, Kimon-Aristotelis Vogt, Sonu Kumar
Abstract: Effective summarization of unstructured patient data in electronic health records (EHRs) is crucial for accurate diagnosis and efficient patient care, yet clinicians often struggle with information overload and time constraints. This review dives into recent literature and case studies on both the significant impacts and outstanding issues of patient chart review on communications, diagnostics, and management. It also discusses recent efforts to integrate artificial intelligence (AI) into clinical summarization tasks, and its transformative impact on the clinician's potential, including but not limited to reductions of administrative burden and improved patient-centered care.
Authors: Georgios Kollias, Payel Das, Subhajit Chaudhury
Abstract: Addressing the issue of hallucinations in large language models (LLMs) is a critical challenge. As the cognitive mechanisms of hallucination have been related to memory, here we explore hallucination for LLM that is enabled with explicit memory mechanisms. We empirically demonstrate that by simply scaling the readout vector that constrains generation in a memory-augmented LLM decoder, hallucination mitigation can be achieved in a training-free manner. Our method is geometry-inspired and outperforms a state-of-the-art LLM editing method on the task of generation of Wikipedia-like biography entries both in terms of generation quality and runtime complexity.
Authors: Georgi Ganev
Abstract: In this paper, we argue that similarity-based privacy metrics cannot ensure regulatory compliance of synthetic data. Our analysis and counter-examples show that they do not protect against singling out and linkability and, among other fundamental issues, completely ignore the motivated intruder test.
Authors: Jesse Merhi, Erik Buchholz, Salil S. Kanhere
Abstract: Location trajectories provide valuable insights for applications from urban planning to pandemic control. However, mobility data can also reveal sensitive information about individuals, such as political opinions, religious beliefs, or sexual orientations. Existing privacy-preserving approaches for publishing this data face a significant utility-privacy trade-off. Releasing synthetic trajectory data generated through deep learning offers a promising solution. Due to the trajectories' sequential nature, most existing models are based on recurrent neural networks (RNNs). However, research in generative adversarial networks (GANs) largely employs convolutional neural networks (CNNs) for image generation. This discrepancy raises the question of whether advances in computer vision can be applied to trajectory generation. In this work, we introduce a Reversible Trajectory-to-CNN Transformation (RTCT) that adapts trajectories into a format suitable for CNN-based models. We integrated this transformation with the well-known DCGAN in a proof-of-concept (PoC) and evaluated its performance against an RNN-based trajectory GAN using four metrics across two datasets. The PoC was superior in capturing spatial distributions compared to the RNN model but had difficulty replicating sequential and temporal properties. Although the PoC's utility is not sufficient for practical applications, the results demonstrate the transformation's potential to facilitate the use of CNNs for trajectory generation, opening up avenues for future research. To support continued research, all source code has been made available under an open-source license.
Authors: Jaewoong Choi, Janghyeok Yoon, Changyong Lee
Abstract: Despite the usefulness of machine learning approaches for the early screening of potential breakthrough technologies, their practicality is often hindered by opaque models. To address this, we propose an interpretable machine learning approach to predicting future citation counts from patent texts using a patent-specific hierarchical attention network (PatentHAN) model. Central to this approach are (1) a patent-specific pre-trained language model, capturing the meanings of technical words in patent claims, (2) a hierarchical network structure, enabling detailed analysis at the claim level, and (3) a claim-wise self-attention mechanism, revealing pivotal claims during the screening process. A case study of 35,376 pharmaceutical patents demonstrates the effectiveness of our approach in early screening of potential breakthrough technologies while ensuring interpretability. Furthermore, we conduct additional analyses using different language models and claim types to examine the robustness of the approach. It is expected that the proposed approach will enhance expert-machine collaboration in identifying breakthrough technologies, providing new insight derived from text mining into technological value.
Authors: Jingze Shi, Lu He, Yuhan Wang, Tianyu He, Bingheng Wu, Mingkun Hou
Abstract: Recent studies have shown that, relative position encoding performs well in selective state space model scanning algorithms, and the architecture that balances SSM and Attention enhances the efficiency and effectiveness of the algorithm, while the sparse activation of the mixture of experts reduces the training cost. I studied the effectiveness of using different position encodings in structured state space dual algorithms, and the more effective SSD-Attn internal and external function mixing method, and designed a more efficient cross domain mixture of experts. I found that the same matrix is very wonderful in different algorithms, which allows us to establish a new hybrid sparse architecture: Cheems. Compared with other hybrid architectures, it is more efficient and more effective in language modeling tasks.
Authors: Derek Fox, Samuel Hernandez, Qianqian Tong
Abstract: Stochastic optimization algorithms are widely used for large-scale data analysis due to their low per-iteration costs, but they often suffer from slow asymptotic convergence caused by inherent variance. Variance-reduced techniques have been therefore used to address this issue in structured sparse models utilizing sparsity-inducing norms or $\ell_0$-norms. However, these techniques are not directly applicable to complex (non-convex) graph sparsity models, which are essential in applications like disease outbreak monitoring and social network analysis. In this paper, we introduce two stochastic variance-reduced gradient-based methods to solve graph sparsity optimization: GraphSVRG-IHT and GraphSCSG-IHT. We provide a general framework for theoretical analysis, demonstrating that our methods enjoy a linear convergence speed. Extensive experiments validate
Authors: Sa\"uc Abadal Lloret, Shehzaad Dhuliawala, Keerthiram Murugesan, Mrinmaya Sachan
Abstract: We present ALT (ALignment with Textual feedback), an approach that aligns language models with user preferences expressed in text. We argue that text offers greater expressiveness, enabling users to provide richer feedback than simple comparative preferences and this richer feedback can lead to more efficient and effective alignment. ALT aligns the model by conditioning its generation on the textual feedback. Our method relies solely on language modeling techniques and requires minimal hyper-parameter tuning, though it still presents the main benefits of RL-based alignment algorithms and can effectively learn from textual feedback. We explore the efficacy and efficiency of textual feedback across different tasks such as toxicity reduction, summarization, and dialog response generation. We find that ALT outperforms PPO for the task of toxicity reduction while being able to match its performance on summarization with only 20% of the samples. We also explore how ALT can be used with feedback provided by an existing LLM where we explore an LLM providing constrained and unconstrained textual feedback. We also outline future directions to align models with natural language feedback.
Authors: Ziwei Zhao, David Leake, Xiaomeng Ye, David Crandall
Abstract: This short paper presents preliminary research on the Case-Enhanced Vision Transformer (CEViT), a similarity measurement method aimed at improving the explainability of similarity assessments for image data. Initial experimental results suggest that integrating CEViT into k-Nearest Neighbor (k-NN) classification yields classification accuracy comparable to state-of-the-art computer vision models, while adding capabilities for illustrating differences between classes. CEViT explanations can be influenced by prior cases, to illustrate aspects of similarity relevant to those cases.
Authors: Lirui Zhao, Tianshuo Yang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Kaipeng Zhang, Rongrong Ji
Abstract: This paper addresses an important problem of object addition for images with only text guidance. It is challenging because the new object must be integrated seamlessly into the image with consistent visual context, such as lighting, texture, and spatial location. While existing text-guided image inpainting methods can add objects, they either fail to preserve the background consistency or involve cumbersome human intervention in specifying bounding boxes or user-scribbled masks. To tackle this challenge, we introduce Diffree, a Text-to-Image (T2I) model that facilitates text-guided object addition with only text control. To this end, we curate OABench, an exquisite synthetic dataset by removing objects with advanced image inpainting techniques. OABench comprises 74K real-world tuples of an original image, an inpainted image with the object removed, an object mask, and object descriptions. Trained on OABench using the Stable Diffusion model with an additional mask prediction module, Diffree uniquely predicts the position of the new object and achieves object addition with guidance from only text. Extensive experiments demonstrate that Diffree excels in adding new objects with a high success rate while maintaining background consistency, spatial appropriateness, and object relevance and quality.
Authors: Changchang Yin, Pin-Yu Chen, Bingsheng Yao, Dakuo Wang, Jeffrey Caterino, Ping Zhang
Abstract: Sepsis is the leading cause of in-hospital mortality in the USA. Early sepsis onset prediction and diagnosis could significantly improve the survival of sepsis patients. Existing predictive models are usually trained on high-quality data with few missing information, while missing values widely exist in real-world clinical scenarios (especially in the first hours of admissions to the hospital), which causes a significant decrease in accuracy and an increase in uncertainty for the predictive models. The common method to handle missing values is imputation, which replaces the unavailable variables with estimates from the observed data. The uncertainty of imputation results can be propagated to the sepsis prediction outputs, which have not been studied in existing works on either sepsis prediction or uncertainty quantification. In this study, we first define such propagated uncertainty as the variance of prediction output and then introduce uncertainty propagation methods to quantify the propagated uncertainty. Moreover, for the potential high-risk patients with low confidence due to limited observations, we propose a robust active sensing algorithm to increase confidence by actively recommending clinicians to observe the most informative variables. We validate the proposed models in both publicly available data (i.e., MIMIC-III and AmsterdamUMCdb) and proprietary data in The Ohio State University Wexner Medical Center (OSUWMC). The experimental results show that the propagated uncertainty is dominant at the beginning of admissions to hospitals and the proposed algorithm outperforms state-of-the-art active sensing methods. Finally, we implement a SepsisLab system for early sepsis prediction and active sensing based on our pre-trained models. Clinicians and potential sepsis patients can benefit from the system in early prediction and diagnosis of sepsis.
Authors: Yoonseok Yang, Jack Liu, J. D. Zamfirescu-Pereira, John DeNero
Abstract: Small-group tutoring in Computer Science (CS) is effective, but presents the challenge of providing a dedicated tutor for each group and encouraging collaboration among group members at scale. We present Pensieve Discuss, a software platform that integrates synchronous editing for scaffolded programming problems with online human and AI tutors, designed to improve student collaboration and experience during group tutoring sessions. Our semester-long deployment to 800 students in a CS1 course demonstrated consistently high collaboration rates, positive feedback about the AI tutor's helpfulness and correctness, increased satisfaction with the group tutoring experience, and a substantial increase in question volume. The use of our system was preferred over an interface lacking AI tutors and synchronous editing capabilities. Our experiences suggest that small-group tutoring sessions are an important avenue for future research in educational AI.
Authors: Sara Vera Marjanovi\'c, Haeun Yu, Pepa Atanasova, Maria Maistro, Christina Lioma, Isabelle Augenstein
Abstract: Knowledge-intensive language understanding tasks require Language Models (LMs) to integrate relevant context, mitigating their inherent weaknesses, such as incomplete or outdated knowledge. Nevertheless, studies indicate that LMs often ignore the provided context as it can conflict with the pre-existing LM's memory learned during pre-training. Moreover, conflicting knowledge can already be present in the LM's parameters, termed intra-memory conflict. Existing works have studied the two types of knowledge conflicts only in isolation. We conjecture that the (degree of) intra-memory conflicts can in turn affect LM's handling of context-memory conflicts. To study this, we introduce the DYNAMICQA dataset, which includes facts with a temporal dynamic nature where a fact can change with a varying time frequency and disputable dynamic facts, which can change depending on the viewpoint. DYNAMICQA is the first to include real-world knowledge conflicts and provide context to study the link between the different types of knowledge conflicts. With the proposed dataset, we assess the use of uncertainty for measuring the intra-memory conflict and introduce a novel Coherent Persuasion (CP) score to evaluate the context's ability to sway LM's semantic output. Our extensive experiments reveal that static facts, which are unlikely to change, are more easily updated with additional context, relative to temporal and disputable facts.
Authors: Qinfeng Zhu, Ningxin Weng, Lei Fan, Yuanzhi Cai
Abstract: Environmental monitoring of lakeside green areas is crucial for environmental protection. Compared to manual inspections, computer vision technologies offer a more efficient solution when deployed on-site. Multispectral imaging provides diverse information about objects under different spectrums, aiding in the differentiation between waste and lakeside lawn environments. This study introduces WasteMS, the first multispectral dataset established for the semantic segmentation of lakeside waste. WasteMS includes a diverse range of waste types in lawn environments, captured under various lighting conditions. We implemented a rigorous annotation process to label waste in images. Representative semantic segmentation frameworks were used to evaluate segmentation accuracy using WasteMS. Challenges encountered when using WasteMS for segmenting waste on lakeside lawns were discussed. The WasteMS dataset is available at https://github.com/zhuqinfeng1999/WasteMS.
Authors: Jian Xu, Delu Zeng, John Paisley
Abstract: Deep Gaussian processes (DGPs) provide a robust paradigm for Bayesian deep learning. In DGPs, a set of sparse integration locations called inducing points are selected to approximate the posterior distribution of the model. This is done to reduce computational complexity and improve model efficiency. However, inferring the posterior distribution of inducing points is not straightforward. Traditional variational inference approaches to posterior approximation often lead to significant bias. To address this issue, we propose an alternative method called Denoising Diffusion Variational Inference (DDVI) that uses a denoising diffusion stochastic differential equation (SDE) to generate posterior samples of inducing variables. We rely on score matching methods for denoising diffusion model to approximate score functions with a neural network. Furthermore, by combining classical mathematical theory of SDEs with the minimization of KL divergence between the approximate and true processes, we propose a novel explicit variational lower bound for the marginal likelihood function of DGP. Through experiments on various datasets and comparisons with baseline methods, we empirically demonstrate the effectiveness of DDVI for posterior inference of inducing points for DGP models.
Authors: Chanyoung Jung, Yun Jang
Abstract: Researchers have been persistently working to address the issue of missing values in time series data. Numerous models have been proposed, striving to estimate the distribution of the data. The Radial Basis Functions Neural Network (RBFNN) has recently exhibited exceptional performance in estimating data distribution. In this paper, we propose a time series imputation model based on RBFNN. Our imputation model learns local information from timestamps to create a continuous function. Additionally, we incorporate time gaps to facilitate learning information considering the missing terms of missing values. We name this model the Missing Imputation Multivariate RBFNN (MIM-RBFNN). However, MIM-RBFNN relies on a local information-based learning approach, which presents difficulties in utilizing temporal information. Therefore, we propose an extension called the Missing Value Imputation Recurrent Neural Network with Continuous Function (MIRNN-CF) using the continuous function generated by MIM-RBFNN. We evaluate the performance using two real-world datasets with non-random missing and random missing patterns, and conduct an ablation study comparing MIM-RBFNN and MIRNN-CF.
Authors: Binzhe Li, Shurun Wang, Shiqi Wang, Yan Ye
Abstract: In recent years, large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression framework consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for LVLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. {Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, Versatile Video Coding.} Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework.
Authors: Kaixuan Li, Jian Zhang, Sen Chen, Han Liu, Yang Liu, Yixiang Chen
Abstract: Open-source software (OSS) vulnerabilities are increasingly prevalent, emphasizing the importance of security patches. However, in widely used security platforms like NVD, a substantial number of CVE records still lack trace links to patches. Although rank-based approaches have been proposed for security patch tracing, they heavily rely on handcrafted features in a single-step framework, which limits their effectiveness. In this paper, we propose PatchFinder, a two-phase framework with end-to-end correlation learning for better-tracing security patches. In the **initial retrieval** phase, we employ a hybrid patch retriever to account for both lexical and semantic matching based on the code changes and the description of a CVE, to narrow down the search space by extracting those commits as candidates that are similar to the CVE descriptions. Afterwards, in the **re-ranking** phase, we design an end-to-end architecture under the supervised fine-tuning paradigm for learning the semantic correlations between CVE descriptions and commits. In this way, we can automatically rank the candidates based on their correlation scores while maintaining low computation overhead. We evaluated our system against 4,789 CVEs from 532 OSS projects. The results are highly promising: PatchFinder achieves a Recall@10 of 80.63% and a Mean Reciprocal Rank (MRR) of 0.7951. Moreover, the Manual Effort@10 required is curtailed to 2.77, marking a 1.94 times improvement over current leading methods. When applying PatchFinder in practice, we initially identified 533 patch commits and submitted them to the official, 482 of which have been confirmed by CVE Numbering Authorities.
Authors: Ziyue Chen, Tongya Zheng, Mingli Song
Abstract: Temporal networks are effective in capturing the evolving interactions of networks over time, such as social networks and e-commerce networks. In recent years, researchers have primarily concentrated on developing specific model architectures for Temporal Graph Neural Networks (TGNNs) in order to improve the representation quality of temporal nodes and edges. However, limited attention has been given to the quality of negative samples during the training of TGNNs. When compared with static networks, temporal networks present two specific challenges for negative sampling: positive sparsity and positive shift. Positive sparsity refers to the presence of a single positive sample amidst numerous negative samples at each timestamp, while positive shift relates to the variations in positive samples across different timestamps. To robustly address these challenges in training TGNNs, we introduce Curriculum Negative Mining (CurNM), a model-aware curriculum learning framework that adaptively adjusts the difficulty of negative samples. Within this framework, we first establish a dynamically updated negative pool that balances random, historical, and hard negatives to address the challenges posed by positive sparsity. Secondly, we implement a temporal-aware negative selection module that focuses on learning from the disentangled factors of recently active edges, thus accurately capturing shifting preferences. Extensive experiments on 12 datasets and 3 TGNNs demonstrate that our method outperforms baseline methods by a significant margin. Additionally, thorough ablation studies and parameter sensitivity experiments verify the usefulness and robustness of our approach. Our code is available at https://github.com/zziyue83/CurNM.
Authors: Jinghong Li, Wen Gu, Koichi Ota, Shinobu Hasegawa
Abstract: With the exponential growth in the number of papers and the trend of AI research, the use of Generative AI for information retrieval and question-answering has become popular for conducting research surveys. However, novice researchers unfamiliar with a particular field may not significantly improve their efficiency in interacting with Generative AI because they have not developed divergent thinking in that field. This study aims to develop an in-depth Survey Forest Diagram that guides novice researchers in divergent thinking about the research topic by indicating the citation clues among multiple papers, to help expand the survey perspective for novice researchers.
Authors: Adam Goodge, Bryan Hooi, Wee Siong Ng
Abstract: Contrastive Language-Image Pre-training (CLIP) achieves remarkable performance in various downstream tasks through the alignment of image and text input embeddings and holds great promise for anomaly detection. However, our empirical experiments show that the embeddings of text inputs unexpectedly tightly cluster together, far away from image embeddings, contrary to the model's contrastive training objective to align image-text input pairs. We show that this phenomenon induces a `similarity bias' - in which false negative and false positive errors occur due to bias in the similarities between images and the normal label text embeddings. To address this bias, we propose a novel methodology called BLISS which directly accounts for this similarity bias through the use of an auxiliary, external set of text inputs. BLISS is simple, it does not require strong inductive biases about anomalous behaviour nor an expensive training process, and it significantly outperforms baseline methods on benchmark image datasets, even when access to normal data is extremely limited.
Authors: Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Andrew Zisserman
Abstract: We introduce a dataset of annotations of temporal repetitions in videos. The dataset, OVR (pronounced as over), contains annotations for over 72K videos, with each annotation specifying the number of repetitions, the start and end time of the repetitions, and also a free-form description of what is repeating. The annotations are provided for videos sourced from Kinetics and Ego4D, and consequently cover both Exo and Ego viewing conditions, with a huge variety of actions and activities. Moreover, OVR is almost an order of magnitude larger than previous datasets for video repetition. We also propose a baseline transformer-based counting model, OVRCounter, that can localise and count repetitions in videos that are up to 320 frames long. The model is trained and evaluated on the OVR dataset, and its performance assessed with and without using text to specify the target class to count. The performance is also compared to a prior repetition counting model. The dataset is available for download at: https://sites.google.com/view/openvocabreps/
Authors: Shuyan Huang, Zitao Liu, Xiangyu Zhao, Weiqi Luo, Jian Weng
Abstract: Knowledge tracing (KT) is the problem of predicting students' future performance based on their historical interaction sequences. With the advanced capability of capturing contextual long-term dependency, attention mechanism becomes one of the essential components in many deep learning based KT (DLKT) models. In spite of the impressive performance achieved by these attentional DLKT models, many of them are often vulnerable to run the risk of overfitting, especially on small-scale educational datasets. Therefore, in this paper, we propose \textsc{sparseKT}, a simple yet effective framework to improve the robustness and generalization of the attention based DLKT approaches. Specifically, we incorporate a k-selection module to only pick items with the highest attention scores. We propose two sparsification heuristics : (1) soft-thresholding sparse attention and (2) top-$K$ sparse attention. We show that our \textsc{sparseKT} is able to help attentional KT models get rid of irrelevant student interactions and have comparable predictive performance when compared to 11 state-of-the-art KT models on three publicly available real-world educational datasets. To encourage reproducible research, we make our data and code publicly available at \url{https://github.com/pykt-team/pykt-toolkit}\footnote{We merged our model to the \textsc{pyKT} benchmark at \url{https://pykt.org/}.}.
URLs: https://github.com/pykt-team/pykt-toolkit, https://pykt.org/
Authors: Mu Chen, Zhedong Zheng, Yi Yang
Abstract: Unsupervised domain adaptive segmentation aims to improve the segmentation accuracy of models on target domains without relying on labeled data from those domains. This approach is crucial when labeled target domain data is scarce or unavailable. It seeks to align the feature representations of the source domain (where labeled data is available) and the target domain (where only unlabeled data is present), thus enabling the model to generalize well to the target domain. Current image- and video-level domain adaptation have been addressed using different and specialized frameworks, training strategies and optimizations despite their underlying connections. In this paper, we propose a unified framework PiPa++, which leverages the core idea of ``comparing'' to (1) explicitly encourage learning of discriminative pixel-wise features with intraclass compactness and inter-class separability, (2) promote the robust feature learning of the identical patch against different contexts or fluctuations, and (3) enable the learning of temporal continuity under dynamic environments. With the designed task-smart contrastive sampling strategy, PiPa++ enables the mining of more informative training samples according to the task demand. Extensive experiments demonstrate the effectiveness of our method on both image-level and video-level domain adaption benchmarks. Moreover, the proposed method is compatible with other UDA approaches to further improve the performance without introducing extra parameters.
Authors: Arun Verma, Zhongxiang Dai, Xiaoqiang Lin, Patrick Jaillet, Bryan Kian Hsiang Low
Abstract: Contextual dueling bandit is used to model the bandit problems, where a learner's goal is to find the best arm for a given context using observed noisy preference feedback over the selected arms for the past contexts. However, existing algorithms assume the reward function is linear, which can be complex and non-linear in many real-life applications like online recommendations or ranking web search results. To overcome this challenge, we use a neural network to estimate the reward function using preference feedback for the previously selected arms. We propose upper confidence bound- and Thompson sampling-based algorithms with sub-linear regret guarantees that efficiently select arms in each round. We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution. Experimental results on the problem instances derived from synthetic datasets corroborate our theoretical results.
Authors: Edward, Mohamed Ragab, Yuecong Xu, Min Wu, Yuecong Xu, Zhenghua Chen, Abdulla Alseiari, Xiaoli Li
Abstract: Unsupervised Domain Adaptation (UDA) has emerged as a key solution in data-driven fault diagnosis, addressing domain shift where models underperform in changing environments. However, under the realm of continually changing environments, UDA tends to underperform on previously seen domains when adapting to new ones - a problem known as catastrophic forgetting. To address this limitation, we introduce the EverAdapt framework, specifically designed for continuous model adaptation in dynamic environments. Central to EverAdapt is a novel Continual Batch Normalization (CBN), which leverages source domain statistics as a reference point to standardize feature representations across domains. EverAdapt not only retains statistical information from previous domains but also adapts effectively to new scenarios. Complementing CBN, we design a class-conditional domain alignment module for effective integration of target domains, and a Sample-efficient Replay strategy to reinforce memory retention. Experiments on real-world datasets demonstrate EverAdapt superiority in maintaining robust fault diagnosis in dynamic environments. Our code is available: https://github.com/mohamedr002/EverAdapt
Authors: Jingren Liu, Zhong Ji, YunLong Yu, Jiale Cao, Yanwei Pang, Jungong Han, Xuelong Li
Abstract: Parameter-efficient fine-tuning for continual learning (PEFT-CL) has shown promise in adapting pre-trained models to sequential tasks while mitigating catastrophic forgetting problem. However, understanding the mechanisms that dictate continual performance in this paradigm remains elusive. To tackle this complexity, we undertake a rigorous analysis of PEFT-CL dynamics to derive relevant metrics for continual scenarios using Neural Tangent Kernel (NTK) theory. With the aid of NTK as a mathematical analysis tool, we recast the challenge of test-time forgetting into the quantifiable generalization gaps during training, identifying three key factors that influence these gaps and the performance of PEFT-CL: training sample size, task-level feature orthogonality, and regularization. To address these challenges, we introduce NTK-CL, a novel framework that eliminates task-specific parameter storage while adaptively generating task-relevant features. Aligning with theoretical guidance, NTK-CL triples the feature representation of each sample, theoretically and empirically reducing the magnitude of both task-interplay and task-specific generalization gaps. Grounded in NTK analysis, our approach imposes an adaptive exponential moving average mechanism and constraints on task-level feature orthogonality, maintaining intra-task NTK forms while attenuating inter-task NTK forms. Ultimately, by fine-tuning optimizable parameters with appropriate regularization, NTK-CL achieves state-of-the-art performance on established PEFT-CL benchmarks. This work provides a theoretical foundation for understanding and improving PEFT-CL models, offering insights into the interplay between feature representation, task orthogonality, and generalization, contributing to the development of more efficient continual learning systems.
Authors: Bernardo Consoli, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, Ying Ding
Abstract: Extracting social determinants of health (SDoH) from unstructured medical notes depends heavily on labor-intensive annotations, which are typically task-specific, hampering reusability and limiting sharing. In this study we introduced SDoH-GPT, a simple and effective few-shot Large Language Model (LLM) method leveraging contrastive examples and concise instructions to extract SDoH without relying on extensive medical annotations or costly human intervention. It achieved tenfold and twentyfold reductions in time and cost respectively, and superior consistency with human annotators measured by Cohen's kappa of up to 0.92. The innovative combination of SDoH-GPT and XGBoost leverages the strengths of both, ensuring high accuracy and computational efficiency while consistently maintaining 0.90+ AUROC scores. Testing across three distinct datasets has confirmed its robustness and accuracy. This study highlights the potential of leveraging LLMs to revolutionize medical note classification, demonstrating their capability to achieve highly accurate classifications with significantly reduced time and cost.
Authors: Yuyan Chen, Songzhou Yan, Zhihong Zhu, Zhixu Li, Yanghua Xiao
Abstract: Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on the impact of multi-images on meme captioning. After that, we introduce the \textsc{XMeCap} framework, a novel approach that adopts supervised fine-tuning and reinforcement learning based on an innovative reward model, which factors in both global and local similarities between visuals and text. Our results, benchmarked against contemporary models, manifest a marked improvement in caption generation for both single-image and multi-image memes, as well as different meme categories. \textsc{XMeCap} achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71\% and 4.82\%, respectively. This research not only establishes a new frontier in meme-related studies but also underscores the potential of machines in understanding and generating humor in a multi-modal setting.
Authors: Jan Lehe\v{c}ka, Josef V. Psutka, Lubo\v{s} \v{S}m\'idl, Pavel Ircing, Josef Psutka
Abstract: In this paper, we are comparing monolingual Wav2Vec 2.0 models with various multilingual models to see whether we could improve speech recognition performance on a unique oral history archive containing a lot of mixed-language sentences. Our main goal is to push forward research on this unique dataset, which is an extremely valuable part of our cultural heritage. Our results suggest that monolingual speech recognition models are, in most cases, superior to multilingual models, even when processing the oral history archive full of mixed-language sentences from non-native speakers. We also performed the same experiments on the public CommonVoice dataset to verify our results. We are contributing to the research community by releasing our pre-trained models to the public.
Authors: Xiaoyu Tan, Bin Li, Xihe Qiu, Jingjing Huang, Yinghui Xu, Wei Chu
Abstract: Integrating deep neural networks with the Hawkes process has significantly improved predictive capabilities in finance, health informatics, and information technology. Nevertheless, these models often face challenges in real-world settings, particularly due to substantial label noise. This issue is of significant concern in the medical field, where label noise can arise from delayed updates in electronic medical records or misdiagnoses, leading to increased prediction risks. Our research indicates that deep Hawkes process models exhibit reduced robustness when dealing with label noise, particularly when it affects both event types and timing. To address these challenges, we first investigate the influence of label noise in approximated intensity functions and present a novel framework, the Robust Deep Hawkes Process (RDHP), to overcome the impact of label noise on the intensity function of Hawkes models, considering both the events and their occurrences. We tested RDHP using multiple open-source benchmarks with synthetic noise and conducted a case study on obstructive sleep apnea-hypopnea syndrome (OSAHS) in a real-world setting with inherent label noise. The results demonstrate that RDHP can effectively perform classification and regression tasks, even in the presence of noise related to events and their timing. To the best of our knowledge, this is the first study to successfully address both event and time label noise in deep Hawkes process models, offering a promising solution for medical applications, specifically in diagnosing OSAHS.
Authors: Saad Lahlali, Nicolas Granger, Herv\'e Le Borgne, Quoc-Cuong Pham
Abstract: 3D object detection plays a crucial role in various applications such as autonomous vehicles, robotics and augmented reality. However, training 3D detectors requires a costly precise annotation, which is a hindrance to scaling annotation to large datasets. To address this challenge, we propose a weakly supervised 3D annotator that relies solely on 2D bounding box annotations from images, along with size priors. One major problem is that supervising a 3D detection model using only 2D boxes is not reliable due to ambiguities between different 3D poses and their identical 2D projection. We introduce a simple yet effective and generic solution: we build 3D proxy objects with annotations by construction and add them to the training dataset. Our method requires only size priors to adapt to new classes. To better align 2D supervision with 3D detection, our method ensures depth invariance with a novel expression of the 2D losses. Finally, to detect more challenging instances, our annotator follows an offline pseudo-labelling scheme which gradually improves its 3D pseudo-labels. Extensive experiments on the KITTI dataset demonstrate that our method not only performs on-par or above previous works on the Car category, but also achieves performance close to fully supervised methods on more challenging classes. We further demonstrate the effectiveness and robustness of our method by being the first to experiment on the more challenging nuScenes dataset. We additionally propose a setting where weak labels are obtained from a 2D detector pre-trained on MS-COCO instead of human annotations.
Authors: Jonathan Pirnay, Dominik G. Grimm
Abstract: The constructive approach within Neural Combinatorial Optimization (NCO) treats a combinatorial optimization problem as a finite Markov decision process, where solutions are built incrementally through a sequence of decisions guided by a neural policy network. To train the policy, recent research is shifting toward a 'self-improved' learning methodology that addresses the limitations of reinforcement learning and supervised approaches. Here, the policy is iteratively trained in a supervised manner, with solutions derived from the current policy serving as pseudo-labels. The way these solutions are obtained from the policy determines the quality of the pseudo-labels. In this paper, we present a simple and problem-independent sequence decoding method for self-improved learning based on sampling sequences without replacement. We incrementally follow the best solution found and repeat the sampling process from intermediate partial solutions. By modifying the policy to ignore previously sampled sequences, we force it to consider only unseen alternatives, thereby increasing solution diversity. Experimental results for the Traveling Salesman and Capacitated Vehicle Routing Problem demonstrate its strong performance. Furthermore, our method outperforms previous NCO approaches on the Job Shop Scheduling Problem.
Authors: Uro\v{s} Petkovi\'c, Jonas Frenkel, Olaf Hellwich, Rebecca Lazarides
Abstract: This paper introduces a novel computational approach for analyzing nonverbal social behavior in educational settings. Integrating multimodal behavioral cues, including facial expressions, gesture intensity, and spatial dynamics, the model assesses the nonverbal immediacy (NVI) of teachers from RGB classroom videos. A dataset of 400 30-second video segments from German classrooms was constructed for model training and validation. The gesture intensity regressor achieved a correlation of 0.84, the perceived distance regressor 0.55, and the NVI model 0.44 with median human ratings. The model demonstrates the potential to provide a valuable support in nonverbal behavior assessment, approximating the accuracy of individual human raters. Validated against both questionnaire data and trained observer ratings, our models show moderate to strong correlations with relevant educational outcomes, indicating their efficacy in reflecting effective teaching behaviors. This research advances the objective assessment of nonverbal communication behaviors, opening new pathways for educational research.
Authors: Yilie Huang, Yanwei Jia, Xun Yu Zhou
Abstract: We study reinforcement learning (RL) for a class of continuous-time linear-quadratic (LQ) control problems for diffusions where volatility of the state processes depends on both state and control variables. We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an actor-critic algorithm to learn the optimal policy parameter directly. Our main contributions include the introduction of a novel exploration schedule and a regret analysis of the proposed algorithm. We provide the convergence rate of the policy parameter to the optimal one, and prove that the algorithm achieves a regret bound of $O(N^{\frac{3}{4}})$ up to a logarithmic factor. We conduct a simulation study to validate the theoretical results and demonstrate the effectiveness and reliability of the proposed algorithm. We also perform numerical comparisons between our method and those of the recent model-based stochastic LQ RL studies adapted to the state- and control-dependent volatility setting, demonstrating a better performance of the former in terms of regret bounds.
Authors: Abhijith R. Beeravolu, Mirjam Jonkman, Sami Azam, Friso De Boer
Abstract: Recent advancements in natural language processing (NLP) have led to automation in various domains. However, clinical NLP often relies on benchmark datasets that may not reflect real-world scenarios accurately. Automatic ICD coding, a vital NLP task, typically uses outdated and imbalanced datasets like MIMIC-III, with existing methods yielding micro-averaged F1 scores between 0.4 and 0.7 due to many false positives. Our research introduces an enhanced approach to ICD coding that improves F1 scores by using chapter-based named entities and attentional models. This method categorizes discharge summaries into ICD-9 Chapters and develops attentional models with chapter-specific data, eliminating the need to consider external data for code identification. For categorization, we use Chapter-IV to de-bias and influence key entities and weights without neural networks, creating accurate thresholds and providing interpretability for human validation. Post-validation, we develop attentional models for three frequent and three non-frequent codes from Chapter-IV using Bidirectional-Gated Recurrent Units (GRUs) with Attention and Transformer with Multi-head Attention architectures. The average Micro-F1 scores of 0.79 and 0.81 from these models demonstrate significant performance improvements in ICD coding.
Authors: Enamundram Naga Karthik, Jan Valo\v{s}ek, Lynn Farner, Dario Pfyffer, Simon Schading-Sassenhausen, Anna Lebret, Gergely David, Andrew C. Smith, Kenneth A. Weber II, Maryam Seif, RHSCIR Network Imaging Group, Patrick Freund, Julien Cohen-Adad
Abstract: Spinal cord injury (SCI) is a devastating incidence leading to permanent paralysis and loss of sensory-motor functions potentially resulting in the formation of lesions within the spinal cord. Imaging biomarkers obtained from magnetic resonance imaging (MRI) scans can predict the functional recovery of individuals with SCI and help choose the optimal treatment strategy. Currently, most studies employ manual quantification of these MRI-derived biomarkers, which is a subjective and tedious task. In this work, we propose (i) a universal tool for the automatic segmentation of intramedullary SCI lesions, dubbed \texttt{SCIsegV2}, and (ii) a method to automatically compute the width of the tissue bridges from the segmented lesion. Tissue bridges represent the spared spinal tissue adjacent to the lesion, which is associated with functional recovery in SCI patients. The tool was trained and validated on a heterogeneous dataset from 7 sites comprising patients from different SCI phases (acute, sub-acute, and chronic) and etiologies (traumatic SCI, ischemic SCI, and degenerative cervical myelopathy). Tissue bridges quantified automatically did not significantly differ from those computed manually, suggesting that the proposed automatic tool can be used to derive relevant MRI biomarkers. \texttt{SCIsegV2} and the automatic tissue bridges computation are open-source and available in Spinal Cord Toolbox (v6.4 and above) via the \texttt{sct\_deepseg -task seg\_sc\_lesion\_t2w\_sci} and \texttt{sct\_analyze\_lesion} functions, respectively.
Authors: Yongqi Li, Hongru Cai, Wenjie Wang, Leigang Qu, Yinwei Wei, Wenjie Li, Liqiang Nie, Tat-Seng Chua
Abstract: Text-to-image retrieval is a fundamental task in multimedia processing, aiming to retrieve semantically relevant cross-modal content. Traditional studies have typically approached this task as a discriminative problem, matching the text and image via the cross-attention mechanism (one-tower framework) or in a common embedding space (two-tower framework). Recently, generative cross-modal retrieval has emerged as a new research line, which assigns images with unique string identifiers and generates the target identifier as the retrieval target. Despite its great potential, existing generative approaches are limited due to the following issues: insufficient visual information in identifiers, misalignment with high-level semantics, and learning gap towards the retrieval target. To address the above issues, we propose an autoregressive voken generation method, named AVG. AVG tokenizes images into vokens, i.e., visual tokens, and innovatively formulates the text-to-image retrieval task as a token-to-voken generation problem. AVG discretizes an image into a sequence of vokens as the identifier of the image, while maintaining the alignment with both the visual information and high-level semantics of the image. Additionally, to bridge the learning gap between generative training and the retrieval target, we incorporate discriminative training to modify the learning direction during token-to-voken training. Extensive experiments demonstrate that AVG achieves superior results in both effectiveness and efficiency.
Authors: Leo Yu-Ho Lo, Huamin Qu
Abstract: In this study, we address the growing issue of misleading charts, a prevalent problem that undermines the integrity of information dissemination. Misleading charts can distort the viewer's perception of data, leading to misinterpretations and decisions based on false information. The development of effective automatic detection methods for misleading charts is an urgent field of research. The recent advancement of multimodal Large Language Models (LLMs) has introduced a promising direction for addressing this challenge. We explored the capabilities of these models in analyzing complex charts and assessing the impact of different prompting strategies on the models' analyses. We utilized a dataset of misleading charts collected from the internet by prior research and crafted nine distinct prompts, ranging from simple to complex, to test the ability of four different multimodal LLMs in detecting over 21 different chart issues. Through three experiments--from initial exploration to detailed analysis--we progressively gained insights into how to effectively prompt LLMs to identify misleading charts and developed strategies to address the scalability challenges encountered as we expanded our detection range from the initial five issues to 21 issues in the final experiment. Our findings reveal that multimodal LLMs possess a strong capability for chart comprehension and critical thinking in data interpretation. There is significant potential in employing multimodal LLMs to counter misleading information by supporting critical thinking and enhancing visualization literacy. This study demonstrates the applicability of LLMs in addressing the pressing concern of misleading charts.
Authors: Nikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Vasileios Argyriou, Panagiotis Sarigianndis
Abstract: Dementia, a debilitating neurological condition affecting millions worldwide, presents significant diagnostic challenges. In this work, we introduce a novel methodology for the classification of demented and non-demented elderly patients using 3D brain Magnetic Resonance Imaging (MRI) scans. Our approach features a unique technique for selectively processing MRI slices, focusing on the most relevant brain regions and excluding less informative sections. This methodology is complemented by a confidence-based classification committee composed of three custom deep learning models: Dem3D ResNet, Dem3D CNN, and Dem3D EfficientNet. These models work synergistically to enhance decision-making accuracy, leveraging their collective strengths. Tested on the Open Access Series of Imaging Studies(OASIS) dataset, our method achieved an impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore, validation on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset confirmed the robustness and generalizability of our approach. The use of explainable AI (XAI) techniques and comprehensive ablation studies further substantiate the effectiveness of our techniques, providing insights into the decision-making process and the importance of our methodology. This research offers a significant advancement in dementia diagnosis, providing a highly accurate and efficient tool for clinical applications.
Authors: Aleksander Ogonowski, Micha{\l} \.Zebrowski, Arkadiusz \'Cwiek, Tobiasz Jarosiewicz, Konrad Klimaszewski, Adam Padee, Piotr Wasiuk, Micha{\l} W\'ojcik
Abstract: Most of the intrusion detection methods in computer networks are based on traffic flow characteristics. However, this approach may not fully exploit the potential of deep learning algorithms to directly extract features and patterns from raw packets. Moreover, it impedes real-time monitoring due to the necessity of waiting for the processing pipeline to complete and introduces dependencies on additional software components. In this paper, we investigate deep learning methodologies capable of detecting attacks in real-time directly from raw packet data within network traffic. We propose a novel approach where packets are stacked into windows and separately recognised, with a 2D image representation suitable for processing with computer vision models. Our investigation utilizes the CIC IDS-2017 dataset, which includes both benign traffic and prevalent real-world attacks, providing a comprehensive foundation for our research.
Authors: Alejandra P\'erez, Santiago Rodr\'iguez, Nicol\'as Ayobi, Nicol\'as Aparicio, Eug\'enie Dessevres, Pablo Arbel\'aez
Abstract: Phase recognition in surgical videos is crucial for enhancing computer-aided surgical systems as it enables automated understanding of sequential procedural stages. Existing methods often rely on fixed temporal windows for video analysis to identify dynamic surgical phases. Thus, they struggle to simultaneously capture short-, mid-, and long-term information necessary to fully understand complex surgical procedures. To address these issues, we propose Multi-Scale Transformers for Surgical Phase Recognition (MuST), a novel Transformer-based approach that combines a Multi-Term Frame encoder with a Temporal Consistency Module to capture information across multiple temporal scales of a surgical video. Our Multi-Term Frame Encoder computes interdependencies across a hierarchy of temporal scales by sampling sequences at increasing strides around the frame of interest. Furthermore, we employ a long-term Transformer encoder over the frame embeddings to further enhance long-term reasoning. MuST achieves higher performance than previous state-of-the-art methods on three different public benchmarks.
Authors: Edyta Bogucka, Marios Constantinides, Sanja \v{S}\'cepanovi\'c, Daniele Quercia
Abstract: In the evolving landscape of AI regulation, it is crucial for companies to conduct impact assessments and document their compliance through comprehensive reports. However, current reports lack grounding in regulations and often focus on specific aspects like privacy in relation to AI systems, without addressing the real-world uses of these systems. Moreover, there is no systematic effort to design and evaluate these reports with both AI practitioners and AI compliance experts. To address this gap, we conducted an iterative co-design process with 14 AI practitioners and 6 AI compliance experts and proposed a template for impact assessment reports grounded in the EU AI Act, NIST's AI Risk Management Framework, and ISO 42001 AI Management System. We evaluated the template by producing an impact assessment report for an AI-based meeting companion at a major tech company. A user study with 8 AI practitioners from the same company and 5 AI compliance experts from industry and academia revealed that our template effectively provides necessary information for impact assessments and documents the broad impacts of AI systems. Participants envisioned using the template not only at the pre-deployment stage for compliance but also as a tool to guide the design stage of AI uses.
Authors: Amirreza Naziri, Hossein Zeinali
Abstract: Writing, as an omnipresent form of human communication, permeates nearly every aspect of contemporary life. Consequently, inaccuracies or errors in written communication can lead to profound consequences, ranging from financial losses to potentially life-threatening situations. Spelling mistakes, among the most prevalent writing errors, are frequently encountered due to various factors. This research aims to identify and rectify diverse spelling errors in text using neural networks, specifically leveraging the Bidirectional Encoder Representations from Transformers (BERT) masked language model. To achieve this goal, we compiled a comprehensive dataset encompassing both non-real-word and real-word errors after categorizing different types of spelling mistakes. Subsequently, multiple pre-trained BERT models were employed. To ensure optimal performance in correcting misspelling errors, we propose a combined approach utilizing the BERT masked language model and Levenshtein distance. The results from our evaluation data demonstrate that the system presented herein exhibits remarkable capabilities in identifying and rectifying spelling mistakes, often surpassing existing systems tailored for the Persian language.
Authors: Yida Zhao, Chao Lou, Kewei Tu
Abstract: Syntactic Transformer language models aim to achieve better generalization through simultaneously modeling syntax trees and sentences. While prior work has been focusing on adding constituency-based structures to Transformers, we introduce Dependency Transformer Grammars (DTGs), a new class of Transformer language model with explicit dependency-based inductive bias. DTGs simulate dependency transition systems with constrained attention patterns by modifying attention masks, incorporate the stack information through relative positional encoding, and augment dependency arc representation with a combination of token embeddings and operation embeddings. When trained on a dataset of sentences annotated with dependency trees, DTGs achieve better generalization while maintaining comparable perplexity with Transformer language model baselines. DTGs also outperform recent constituency-based models, showing that dependency can better guide Transformer language models. Our code is released at https://github.com/zhaoyd1/Dep_Transformer_Grammars.
Authors: Tianjin Huang, Fang Meng, Li Shen, Fan Liu, Yulong Pei, Mykola Pechenizkiy, Shiwei Liu, Tianlong Chen
Abstract: Large-scale neural networks have demonstrated remarkable performance in different domains like vision and language processing, although at the cost of massive computation resources. As illustrated by compression literature, structural model pruning is a prominent algorithm to encourage model efficiency, thanks to its acceleration-friendly sparsity patterns. One of the key questions of structural pruning is how to estimate the channel significance. In parallel, work on data-centric AI has shown that prompting-based techniques enable impressive generalization of large language models across diverse downstream tasks. In this paper, we investigate a charming possibility - \textit{leveraging visual prompts to capture the channel importance and derive high-quality structural sparsity}. To this end, we propose a novel algorithmic framework, namely \texttt{PASS}. It is a tailored hyper-network to take both visual prompts and network weight statistics as input, and output layer-wise channel sparsity in a recurrent manner. Such designs consider the intrinsic channel dependency between layers. Comprehensive experiments across multiple network architectures and six datasets demonstrate the superiority of \texttt{PASS} in locating good structural sparsity. For example, at the same FLOPs level, \texttt{PASS} subnetworks achieve $1\%\sim 3\%$ better accuracy on Food101 dataset; or with a similar performance of $80\%$ accuracy, \texttt{PASS} subnetworks obtain $0.35\times$ more speedup than the baselines.
Authors: Kaarthik Sundar, Sivakumar Rathinam
Abstract: We present a novel algorithm that fuses the existing convex-programming based approach with heuristic information to find optimality guarantees and near-optimal paths for the Shortest Path Problem in the Graph of Convex Sets (SPP-GCS). Our method, inspired by $A^*$, initiates a best-first-like procedure from a designated subset of vertices and iteratively expands it until further growth is neither possible nor beneficial. Traditionally, obtaining solutions with bounds for an optimization problem involves solving a relaxation, modifying the relaxed solution to a feasible one, and then comparing the two solutions to establish bounds. However, for SPP-GCS, we demonstrate that reversing this process can be more advantageous, especially with Euclidean travel costs. In other words, we initially employ $A^*$ to find a feasible solution for SPP-GCS, then solve a convex relaxation restricted to the vertices explored by $A^*$ to obtain a relaxed solution, and finally, compare the solutions to derive bounds. We present numerical results to highlight the advantages of our algorithm over the existing approach in terms of the sizes of the convex programs solved and computation time.
Authors: Jesin James, Balamurali B. T., Binu Abeysinghe, Junchen Liu
Abstract: This study investigates discriminative patterns learned by neural networks for accurate speech classification, with a specific focus on vowel classification tasks. By examining the activations and features of neural networks for vowel classification, we gain insights into what the networks "see" in spectrograms. Through the use of class activation mapping, we identify the frequencies that contribute to vowel classification and compare these findings with linguistic knowledge. Experiments on a American English dataset of vowels showcases the explainability of neural networks and provides valuable insights into the causes of misclassifications and their characteristics when differentiating them from unvoiced speech. This study not only enhances our understanding of the underlying acoustic cues in vowel classification but also offers opportunities for improving speech recognition by bridging the gap between abstract representations in neural networks and established linguistic knowledge
Authors: Zijun Zhan, Yaxian Dong, Yuqing Hu, Shuai Li, Shaohua Cao, Zhu Han
Abstract: Integrating low-light image enhancement techniques, in which diffusion-based AI-generated content (AIGC) models are promising, is necessary to enhance nighttime teleoperation. Remarkably, the AIGC model is computation-intensive, thus necessitating the allocation of AIGC tasks to edge servers with ample computational resources. Given the distinct cost of the AIGC model trained with varying-sized datasets and AIGC tasks possessing disparate demand, it is imperative to formulate a differential pricing strategy to optimize the utility of teleoperators and edge servers concurrently. Nonetheless, the pricing strategy formulation is under information asymmetry, i.e., the demand (e.g., the difficulty level of AIGC tasks and their distribution) of AIGC tasks is hidden information to edge servers. Additionally, manually assessing the difficulty level of AIGC tasks is tedious and unnecessary for teleoperators. To this end, we devise a framework of AIGC task allocation assisted by the Vision Language Model (VLM)-empowered contract theory, which includes two components: VLM-empowered difficulty assessment and contract theory-assisted AIGC task allocation. The first component enables automatic and accurate AIGC task difficulty assessment. The second component is capable of formulating the pricing strategy for edge servers under information asymmetry, thereby optimizing the utility of both edge servers and teleoperators. The simulation results demonstrated that our proposed framework can improve the average utility of teleoperators and edge servers by 10.88~12.43% and 1.4~2.17%, respectively. Code and data are available at https://github.com/ZiJun0819/VLM-Contract-Theory.
Authors: Pratyusha Maiti, Ashok K. Goel
Abstract: Jill Watson, a virtual teaching assistant powered by LLMs, answers student questions and engages them in extended conversations on courseware provided by the instructors. In this paper, we analyze student interactions with Jill across multiple courses and colleges, focusing on the types and complexity of student questions based on Bloom's Revised Taxonomy and tool usage patterns. We find that, by supporting a wide range of cognitive demands, Jill encourages students to engage in sophisticated, higher-order cognitive questions. However, the frequency of usage varies significantly across deployments, and the types of questions asked depend on course-specific contexts. These findings pave the way for future work on AI-driven educational tools tailored to individual learning styles and course structure, potentially enhancing both the teaching and learning experience in classrooms.
Authors: Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, Bo Li
Abstract: Foundation models (FMs) provide societal benefits but also amplify risks. Governments, companies, and researchers have proposed regulatory frameworks, acceptable use policies, and safety benchmarks in response. However, existing public benchmarks often define safety categories based on previous literature, intuitions, or common sense, leading to disjointed sets of categories for risks specified in recent regulations and policies, which makes it challenging to evaluate and compare FMs across these benchmarks. To bridge this gap, we introduce AIR-Bench 2024, the first AI safety benchmark aligned with emerging government regulations and company policies, following the regulation-based safety categories grounded in our AI risks study, AIR 2024. AIR 2024 decomposes 8 government regulations and 16 company policies into a four-tiered safety taxonomy with 314 granular risk categories in the lowest tier. AIR-Bench 2024 contains 5,694 diverse prompts spanning these categories, with manual curation and human auditing to ensure quality. We evaluate leading language models on AIR-Bench 2024, uncovering insights into their alignment with specified safety concerns. By bridging the gap between public benchmarks and practical AI risks, AIR-Bench 2024 provides a foundation for assessing model safety across jurisdictions, fostering the development of safer and more responsible AI systems.
Authors: Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, Bo Dai, Dahua Lin
Abstract: Human image animation involves generating videos from a character photo, allowing user control and unlocking potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance of camera motions in videos, leading to limited control and unstable video generation.To demystify the training data, we present HumanVid, the first large-scale high-quality dataset tailored for human image animation, which combines crafted real-world and synthetic data. For the real-world data, we compile a vast collection of copyright-free real-world videos from the internet. Through a carefully designed rule-based filtering strategy, we ensure the inclusion of high-quality videos, resulting in a collection of 20K human-centric videos in 1080P resolution. Human and camera motion annotation is accomplished using a 2D pose estimator and a SLAM-based method. For the synthetic data, we gather 2,300 copyright-free 3D avatar assets to augment existing available 3D assets. Notably, we introduce a rule-based camera trajectory generation method, enabling the synthetic pipeline to incorporate diverse and precise camera motion annotation, which can rarely be found in real-world data. To verify the effectiveness of HumanVid, we establish a baseline model named CamAnimate, short for Camera-controllable Human Animation, that considers both human and camera motions as conditions. Through extensive experimentation, we demonstrate that such simple baseline training on our HumanVid achieves state-of-the-art performance in controlling both human pose and camera motions, setting a new benchmark. Code and data will be publicly available at \url{https://github.com/zhenzhiwang/HumanVid/}.
Authors: Jianpeng Yao, Xiaopan Zhang, Yu Xia, Zejin Wang, Amit K. Roy-Chowdhury, Jiachen Li
Abstract: Reinforcement Learning (RL) has enabled social robots to generate trajectories without human-designed rules or interventions, which makes it more effective than hard-coded systems for generalizing to complex real-world scenarios. However, social navigation is a safety-critical task that requires robots to avoid collisions with pedestrians while previous RL-based solutions fall short in safety performance in complex environments. To enhance the safety of RL policies, to the best of our knowledge, we propose the first algorithm, SoNIC, that integrates adaptive conformal inference (ACI) with constrained reinforcement learning (CRL) to learn safe policies for social navigation. More specifically, our method augments RL observations with ACI-generated nonconformity scores and provides explicit guidance for agents to leverage the uncertainty metrics to avoid safety-critical areas by incorporating safety constraints with spatial relaxation. Our method outperforms state-of-the-art baselines in terms of both safety and adherence to social norms by a large margin and demonstrates much stronger robustness to out-of-distribution scenarios. Our code and video demos are available on our project website: https://sonic-social-nav.github.io/.
Authors: Wenting Zhao, Tanya Goyal, Yu Ying Chiu, Liwei Jiang, Benjamin Newman, Abhilasha Ravichander, Khyathi Chandu, Ronan Le Bras, Claire Cardie, Yuntian Deng, Yejin Choi
Abstract: While hallucinations of large language models (LLMs) prevail as a major challenge, existing evaluation benchmarks on factuality do not cover the diverse domains of knowledge that the real-world users of LLMs seek information about. To bridge this gap, we introduce WildHallucinations, a benchmark that evaluates factuality. It does so by prompting LLMs to generate information about entities mined from user-chatbot conversations in the wild. These generations are then automatically fact-checked against a systematically curated knowledge source collected from web search. Notably, half of these real-world entities do not have associated Wikipedia pages. We evaluate 118,785 generations from 15 LLMs on 7,919 entities. We find that LLMs consistently hallucinate more on entities without Wikipedia pages and exhibit varying hallucination rates across different domains. Finally, given the same base models, adding a retrieval component only slightly reduces hallucinations but does not eliminate hallucinations.
Authors: Wenting Zhao, Ge Gao, Claire Cardie, Alexander M. Rush
Abstract: When seeking information from unfamiliar documents, users frequently pose questions that cannot be answered by the documents. While existing large language models (LLMs) identify these unanswerable questions, they do not assist users in reformulating their questions, thereby reducing their overall utility. We curate CouldAsk, an evaluation benchmark composed of existing and new datasets for document-grounded question answering, specifically designed to study reformulating unanswerable questions. We evaluate state-of-the-art open-source and proprietary LLMs on CouldAsk. The results demonstrate the limited capabilities of these models in reformulating questions. Specifically, GPT-4 and Llama2-7B successfully reformulate questions only 26% and 12% of the time, respectively. Error analysis shows that 62% of the unsuccessful reformulations stem from the models merely rephrasing the questions or even generating identical questions. We publicly release the benchmark and the code to reproduce the experiments.
Authors: Yongseung Jegal, Jaewoong Choi, Jiho Lee, Ki-Su Park, Seyoung Lee, Janghyeok Yoon
Abstract: Drug repositioning-a promising strategy for discovering new therapeutic uses for existing drugs-has been increasingly explored in the computational science literature using biomedical databases. However, the technological potential of drug repositioning candidates has often been overlooked. This study presents a novel protocol to comprehensively analyse various sources such as pharmaceutical patents and biomedical databases, and identify drug repositioning candidates with both technological potential and scientific evidence. To this end, first, we constructed a scientific biomedical knowledge graph (s-BKG) comprising relationships between drugs, diseases, and genes derived from biomedical databases. Our protocol involves identifying drugs that exhibit limited association with the target disease but are closely located in the s-BKG, as potential drug candidates. We constructed a patent-informed biomedical knowledge graph (p-BKG) by adding pharmaceutical patent information. Finally, we developed a graph embedding protocol to ascertain the structure of the p-BKG, thereby calculating the relevance scores of those candidates with target disease-related patents to evaluate their technological potential. Our case study on Alzheimer's disease demonstrates its efficacy and feasibility, while the quantitative outcomes and systematic methods are expected to bridge the gap between computational discoveries and successful market applications in drug repositioning research.
Authors: Shengyin Sun, Yuxiang Ren, Chen Ma, Xuecang Zhang
Abstract: The latest advancements in large language models (LLMs) have revolutionized the field of natural language processing (NLP). Inspired by the success of LLMs in NLP tasks, some recent work has begun investigating the potential of applying LLMs in graph learning tasks. However, most of the existing work focuses on utilizing LLMs as powerful node feature augmenters, leaving employing LLMs to enhance graph topological structures an understudied problem. In this work, we explore how to leverage the information retrieval and text generation capabilities of LLMs to refine/enhance the topological structure of text-attributed graphs (TAGs) under the node classification setting. First, we propose using LLMs to help remove unreliable edges and add reliable ones in the TAG. Specifically, we first let the LLM output the semantic similarity between node attributes through delicate prompt designs, and then perform edge deletion and edge addition based on the similarity. Second, we propose using pseudo-labels generated by the LLM to improve graph topology, that is, we introduce the pseudo-label propagation as a regularization to guide the graph neural network (GNN) in learning proper edge weights. Finally, we incorporate the two aforementioned LLM-based methods for graph topological refinement into the process of GNN training, and perform extensive experiments on four real-world datasets. The experimental results demonstrate the effectiveness of LLM-based graph topology refinement (achieving a 0.15%--2.47% performance gain on public benchmarks).
Authors: Dirk Fahland, Fabiana Fournier, Lior Limonad, Inna Skarbovsky, Ava J. E. Swevels
Abstract: Large Language Models (LLMs) are likely to play a prominent role in future AI-augmented business process management systems (ABPMSs) catering functionalities across all system lifecycle stages. One such system's functionality is Situation-Aware eXplainability (SAX), which relates to generating causally sound and yet human-interpretable explanations that take into account the process context in which the explained condition occurred. In this paper, we present the SAX4BPM framework developed to generate SAX explanations. The SAX4BPM suite consists of a set of services and a central knowledge repository. The functionality of these services is to elicit the various knowledge ingredients that underlie SAX explanations. A key innovative component among these ingredients is the causal process execution view. In this work, we integrate the framework with an LLM to leverage its power to synthesize the various input ingredients for the sake of improved SAX explanations. Since the use of LLMs for SAX is also accompanied by a certain degree of doubt related to its capacity to adequately fulfill SAX along with its tendency for hallucination and lack of inherent capacity to reason, we pursued a methodological evaluation of the quality of the generated explanations. To this aim, we developed a designated scale and conducted a rigorous user study. Our findings show that the input presented to the LLMs aided with the guard-railing of its performance, yielding SAX explanations having better-perceived fidelity. This improvement is moderated by the perception of trust and curiosity. More so, this improvement comes at the cost of the perceived interpretability of the explanation.
Authors: Guilherme Lima, Jo\~ao M. B. Rodrigues, Marcelo Machado, Elton Soares, Sandro R. Fiorini, Raphael Thiago, Leonardo G. Azevedo, Viviane T. da Silva, Renato Cerqueira
Abstract: We present a Wikidata-based framework, called KIF, for virtually integrating heterogeneous knowledge sources. KIF is written in Python and is released as open-source. It leverages Wikidata's data model and vocabulary plus user-defined mappings to construct a unified view of the underlying sources while keeping track of the context and provenance of their statements. The underlying sources can be triplestores, relational databases, CSV files, etc., which may or may not use the vocabulary and RDF encoding of Wikidata. The end result is a virtual knowledge base which behaves like an "extended Wikidata" and which can be queried using a simple but expressive pattern language, defined in terms of Wikidata's data model. In this paper, we present the design and implementation of KIF, discuss how we have used it to solve a real integration problem in the domain of chemistry (involving Wikidata, PubChem, and IBM CIRCA), and present experimental results on the performance and overhead of KIF
Authors: Renjie Liang, Li Li, Chongzhi Zhang, Jing Wang, Xizhou Zhu, Aixin Sun
Abstract: In this paper, we propose the task of \textit{Ranked Video Moment Retrieval} (RVMR) to locate a ranked list of matching moments from a collection of videos, through queries in natural language. Although a few related tasks have been proposed and studied by CV, NLP, and IR communities, RVMR is the task that best reflects the practical setting of moment search. To facilitate research in RVMR, we develop the TVR-Ranking dataset, based on the raw videos and existing moment annotations provided in the TVR dataset. Our key contribution is the manual annotation of relevance levels for 94,442 query-moment pairs. We then develop the $NDCG@K, IoU\geq \mu$ evaluation metric for this new task and conduct experiments to evaluate three baseline models. Our experiments show that the new RVMR task brings new challenges to existing models and we believe this new dataset contributes to the research on multi-modality search. The dataset is available at \url{https://github.com/Ranking-VMR/TVR-Ranking}
Authors: Maud van Lier, Gorka Mu\~noz-Gil
Abstract: The arrival of Large Language Models (LLMs) has stirred up philosophical debates about the possibility of realizing agency in an artificial manner. In this work we contribute to the debate by presenting a theoretical model that can be used as a threshold conception for artificial agents. The model defines agents as systems whose actions and goals are always influenced by a dynamic framework of factors that consists of the agent's accessible history, its adaptive repertoire and its external environment. This framework, in turn, is influenced by the actions that the agent takes and the goals that it forms. We show with the help of the model that state-of-the-art LLMs are not agents yet, but that there are elements to them that suggest a way forward. The paper argues that a combination of the agent architecture presented in Park et al. (2023) together with the use of modules like the Coscientist in Boiko et al. (2023) could potentially be a way to realize agency in an artificial manner. We end the paper by reflecting on the obstacles one might face in building such an artificial agent and by presenting possible directions for future research.
Authors: Alejandro Pozas-Kerstjens, Senaida Hern\'andez-Santana, Jos\'e Ram\'on Pareja Monturiol, Marco Castrill\'on L\'opez, Giannicola Scarpa, Carlos E. Gonz\'alez-Guill\'en, David P\'erez-Garc\'ia
Abstract: Tensor networks, widely used for providing efficient representations of low-energy states of local quantum many-body systems, have been recently proposed as machine learning architectures which could present advantages with respect to traditional ones. In this work we show that tensor network architectures have especially prospective properties for privacy-preserving machine learning, which is important in tasks such as the processing of medical records. First, we describe a new privacy vulnerability that is present in feedforward neural networks, illustrating it in synthetic and real-world datasets. Then, we develop well-defined conditions to guarantee robustness to such vulnerability, which involve the characterization of models equivalent under gauge symmetry. We rigorously prove that such conditions are satisfied by tensor-network architectures. In doing so, we define a novel canonical form for matrix product states, which has a high degree of regularity and fixes the residual gauge that is left in the canonical forms based on singular value decompositions. We supplement the analytical findings with practical examples where matrix product states are trained on datasets of medical records, which show large reductions on the probability of an attacker extracting information about the training dataset from the model's parameters. Given the growing expertise in training tensor-network architectures, these results imply that one may not have to be forced to make a choice between accuracy in prediction and ensuring the privacy of the information processed.
Authors: Mengkang Hu, Yao Mu, Xinmiao Yu, Mingyu Ding, Shiguang Wu, Wenqi Shao, Qiguang Chen, Bin Wang, Yu Qiao, Ping Luo
Abstract: This paper studies close-loop task planning, which refers to the process of generating a sequence of skills (a plan) to accomplish a specific goal while adapting the plan based on real-time observations. Recently, prompting Large Language Models (LLMs) to generate actions iteratively has become a prevalent paradigm due to its superior performance and user-friendliness. However, this paradigm is plagued by two inefficiencies: high token consumption and redundant error correction, both of which hinder its scalability for large-scale testing and applications. To address these issues, we propose Tree-Planner, which reframes task planning with LLMs into three distinct phases: plan sampling, action tree construction, and grounded deciding. Tree-Planner starts by using an LLM to sample a set of potential plans before execution, followed by the aggregation of them to form an action tree. Finally, the LLM performs a top-down decision-making process on the tree, taking into account real-time environmental information. Experiments show that Tree-Planner achieves state-of-the-art performance while maintaining high efficiency. By decomposing LLM queries into a single plan-sampling call and multiple grounded-deciding calls, a considerable part of the prompt are less likely to be repeatedly consumed. As a result, token consumption is reduced by 92.2% compared to the previously best-performing model. Additionally, by enabling backtracking on the action tree as needed, the correction process becomes more flexible, leading to a 40.5% decrease in error corrections.
Authors: Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Abstract: Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs. In particular, our methodology seeks to sever "toxic" linguistic and visual concepts, unlearning the linkage between unsafe linguistic or visual items and unsafe regions of the embedding space. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator. We conduct extensive experiments on the resulting embedding space for cross-modal retrieval, text-to-image, and image-to-text generation, where we show that our model can be remarkably employed with pre-trained generative models. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.
Authors: M\'elanie Ducoffe, Guillaume Pov\'eda, Audrey Galametz, Ryma Boumazouza, Marion-C\'ecile Martin, Julien Baris, Derk Daverschot, Eugene O'Higgins
Abstract: Surrogate Neural Networks are nowadays routinely used in industry as substitutes for computationally demanding engineering simulations (e.g., in structural analysis). They allow to generate faster predictions and thus analyses in industrial applications e.g., during a product design, testing or monitoring phases. Due to their performance and time-efficiency, these surrogate models are now being developed for use in safety-critical applications. Neural network verification and in particular the assessment of their robustness (e.g., to perturbations) is the next critical step to allow their inclusion in real-life applications and certification. We assess the applicability and scalability of empirical and formal methods in the context of aircraft predictive maintenance for surrogate neural networks designed to predict the stress sustained by an aircraft part from external loads. The case study covers a high-dimensional input and output space and the verification process thus accommodates multi-objective constraints. We explore the complementarity of verification methods in assessing the local stability property of such surrogate models to input noise. We showcase the effectiveness of sequentially combining methods in one verification 'pipeline' and demonstrate the subsequent gain in runtime required to assess the targeted property.
Authors: Qixin Zhang, Zongqi Wan, Zengde Deng, Zaiyi Chen, Xiaoming Sun, Jialin Zhang, Yu Yang
Abstract: Projected Gradient Ascent (PGA) is the most commonly used optimization scheme in machine learning and operations research areas. Nevertheless, numerous studies and examples have shown that the PGA methods may fail to achieve the tight approximation ratio for continuous DR-submodular maximization problems. To address this challenge, we present a boosting technique in this paper, which can efficiently improve the approximation guarantee of the standard PGA to \emph{optimal} with only small modifications on the objective function. The fundamental idea of our boosting technique is to exploit non-oblivious search to derive a novel auxiliary function $F$, whose stationary points are excellent approximations to the global maximum of the original DR-submodular objective $f$. Specifically, when $f$ is monotone and $\gamma$-weakly DR-submodular, we propose an auxiliary function $F$ whose stationary points can provide a better $(1-e^{-\gamma})$-approximation than the $(\gamma^2/(1+\gamma^2))$-approximation guaranteed by the stationary points of $f$ itself. Similarly, for the non-monotone case, we devise another auxiliary function $F$ whose stationary points can achieve an optimal $\frac{1-\min_{\boldsymbol{x}\in\mathcal{C}}\|\boldsymbol{x}\|_{\infty}}{4}$-approximation guarantee where $\mathcal{C}$ is a convex constraint set. In contrast, the stationary points of the original non-monotone DR-submodular function can be arbitrarily bad~\citep{chen2023continuous}. Furthermore, we demonstrate the scalability of our boosting technique on four problems. In all of these four problems, our resulting variants of boosting PGA algorithm beat the previous standard PGA in several aspects such as approximation ratio and efficiency. Finally, we corroborate our theoretical findings with numerical experiments, which demonstrate the effectiveness of our boosting PGA methods.
Authors: Chenghua Gong, Yao Cheng, Xiang Li, Caihua Shan, Siqiang Luo
Abstract: Graphs are structured data that models complex relations between real-world entities. Heterophilous graphs, where linked nodes are prone to be with different labels or dissimilar features, have recently attracted significant attention and found many applications. Meanwhile, increasing efforts have been made to advance learning from heterophilous graphs. Although there exist surveys on the relevant topic, they focus on heterophilous GNNs, which are only sub-topics of heterophilous graph learning. In this survey, we comprehensively overview existing works on learning from graphs with heterophily.First, we collect over 180 publications and introduce the development of this field. Then, we systematically categorize existing methods based on a hierarchical taxonomy including learning strategies, model architectures and practical applications. Finally, we discuss the primary challenges of existing studies and highlight promising avenues for future research.More publication details and corresponding open-source codes can be accessed and will be continuously updated at our repositories:https://github.com/gongchenghua/Papers-Graphs-with-Heterophily.
URLs: https://github.com/gongchenghua/Papers-Graphs-with-Heterophily.
Authors: Vassilis Papadopoulos, J\'er\'emie Wenger, Cl\'ement Hongler
Abstract: We study the probabilistic modeling performed by Autoregressive Large Language Models (LLMs) through the angle of time directionality, addressing a question first raised in (Shannon, 1951). For large enough models, we empirically find a time asymmetry in their ability to learn natural language: a difference in the average log-perplexity when trying to predict the next token versus when trying to predict the previous one. This difference is at the same time subtle and very consistent across various modalities (language, model size, training time, ...). Theoretically, this is surprising: from an information-theoretic point of view, there should be no such difference. We provide a theoretical framework to explain how such an asymmetry can appear from sparsity and computational complexity considerations, and outline a number of perspectives opened by our results.
Authors: Arnau Pastor, Pau Escofet, Sahar Ben Rached, Eduard Alarc\'on, Pere Barlet-Ros, Sergi Abadal
Abstract: Quantum computing holds immense potential for solving classically intractable problems by leveraging the unique properties of quantum mechanics. The scalability of quantum architectures remains a significant challenge. Multi-core quantum architectures are proposed to solve the scalability problem, arising a new set of challenges in hardware, communications and compilation, among others. One of these challenges is to adapt a quantum algorithm to fit within the different cores of the quantum computer. This paper presents a novel approach for circuit partitioning using Deep Reinforcement Learning, contributing to the advancement of both quantum computing and graph partitioning. This work is the first step in integrating Deep Reinforcement Learning techniques into Quantum Circuit Mapping, opening the door to a new paradigm of solutions to such problems.
Authors: Annie Wong, Jacob de Nobel, Thomas B\"ack, Aske Plaat, Anna V. Kononova
Abstract: Although deep reinforcement learning methods can learn effective policies for challenging problems such as Atari games and robotics tasks, algorithms are complex, and training times are often long. This study investigates how Evolution Strategies perform compared to gradient-based deep reinforcement learning methods. We use Evolution Strategies to optimize the weights of a neural network via neuroevolution, performing direct policy search. We benchmark both deep policy networks and networks consisting of a single linear layer from observations to actions for three gradient-based methods, such as Proximal Policy Optimization. These methods are evaluated against three classical Evolution Strategies and Augmented Random Search, which all use linear policy networks. Our results reveal that Evolution Strategies can find effective linear policies for many reinforcement learning benchmark tasks, unlike deep reinforcement learning methods that can only find successful policies using much larger networks, suggesting that current benchmarks are easier to solve than previously assumed. Interestingly, Evolution Strategies also achieve results comparable to gradient-based deep reinforcement learning algorithms for higher-complexity tasks. Furthermore, we find that by directly accessing the memory state of the game, Evolution Strategies can find successful policies in Atari that outperform the policies found by Deep Q-Learning. Evolution Strategies also outperform Augmented Random Search in most benchmarks, demonstrating superior sample efficiency and robustness in training linear policy networks.
Authors: Jianhao Yan, Yun Luo, Yue Zhang
Abstract: The application scope of large language models (LLMs) is increasingly expanding. In practical use, users might provide feedback based on the model's output, hoping for a responsive model that can complete responses according to their feedback. Whether the model can appropriately respond to users' refuting feedback and consistently follow through with execution has not been thoroughly analyzed. In light of this, this paper proposes a comprehensive benchmark, RefuteBench, covering tasks such as question answering, machine translation, and email writing. The evaluation aims to assess whether models can positively accept feedback in form of refuting instructions and whether they can consistently adhere to user demands throughout the conversation. We conduct evaluations on numerous LLMs and find that LLMs are stubborn, i.e. exhibit inclination to their internal knowledge, often failing to comply with user feedback. Additionally, as the length of the conversation increases, models gradually forget the user's stated feedback and roll back to their own responses. We further propose a recall-and-repeat prompts as a simple and effective way to enhance the model's responsiveness to feedback.
Authors: Sinan Abdulhak, Wayne Hubbard, Karthik Gopalakrishnan, Max Z. Li
Abstract: Generative artificial intelligence (AI) and large language models (LLMs) have gained rapid popularity through publicly available tools such as ChatGPT. The adoption of LLMs for personal and professional use is fueled by the natural interactions between human users and computer applications such as ChatGPT, along with powerful summarization and text generation capabilities. Given the widespread use of such generative AI tools, in this work we investigate how these tools can be deployed in a non-safety critical, strategic traffic flow management setting. Specifically, we train an LLM, CHATATC, based on a large historical data set of Ground Delay Program (GDP) issuances, spanning 2000-2023 and consisting of over 80,000 GDP implementations, revisions, and cancellations. We test the query and response capabilities of CHATATC, documenting successes (e.g., providing correct GDP rates, durations, and reason) and shortcomings (e.g,. superlative questions). We also detail the design of a graphical user interface for future users to interact and collaborate with the CHATATC conversational agent.
Authors: Greg d'Eon, Neil Newman, Kevin Leyton-Brown
Abstract: Iterative combinatorial auctions are widely used in high stakes settings such as spectrum auctions. Such auctions can be hard to analyze, making it difficult for bidders to determine how to behave and for designers to optimize auction rules to ensure desirable outcomes such as high revenue or welfare. In this paper, we investigate whether multi-agent reinforcement learning (MARL) algorithms can be used to understand iterative combinatorial auctions, given that these algorithms have recently shown empirical success in several other domains. We find that MARL can indeed benefit auction analysis, but that deploying it effectively is nontrivial. We begin by describing modelling decisions that keep the resulting game tractable without sacrificing important features such as imperfect information or asymmetry between bidders. We also discuss how to navigate pitfalls of various MARL algorithms, how to overcome challenges in verifying convergence, and how to generate and interpret multiple equilibria. We illustrate the promise of our resulting approach by using it to evaluate a specific rule change to a clock auction, finding substantially different auction outcomes due to complex changes in bidders' behavior.
Authors: Emi Zeger, Yifei Wang, Aaron Mishkin, Tolga Ergen, Emmanuel Cand\`es, Mert Pilanci
Abstract: We prove that training neural networks on 1-D data is equivalent to solving convex Lasso problems with discrete, explicitly defined dictionary matrices. We consider neural networks with piecewise linear activations and depths ranging from 2 to an arbitrary but finite number of layers. We first show that two-layer networks with piecewise linear activations are equivalent to Lasso models using a discrete dictionary of ramp functions, with breakpoints corresponding to the training data points. In certain general architectures with absolute value or ReLU activations, a third layer surprisingly creates features that reflect the training data about themselves. Additional layers progressively generate reflections of these reflections. The Lasso representation provides valuable insights into the analysis of globally optimal networks, elucidating their solution landscapes and enabling closed-form solutions in certain special cases. Numerical results show that reflections also occur when optimizing standard deep networks using standard non-convex optimizers. Additionally, we demonstrate our theory with autoregressive time series models.
Authors: Qifeng Zhou, Wenliang Zhong, Yuzhi Guo, Michael Xiao, Hehuan Ma, Junzhou Huang
Abstract: In the field of computational histopathology, both whole slide images (WSIs) and diagnostic captions provide valuable insights for making diagnostic decisions. However, aligning WSIs with diagnostic captions presents a significant challenge. This difficulty arises from two main factors: 1) Gigapixel WSIs are unsuitable for direct input into deep learning models, and the redundancy and correlation among the patches demand more attention; and 2) Authentic WSI diagnostic captions are extremely limited, making it difficult to train an effective model. To overcome these obstacles, we present PathM3, a multimodal, multi-task, multiple instance learning (MIL) framework for WSI classification and captioning. PathM3 adapts a query-based transformer to effectively align WSIs with diagnostic captions. Given that histopathology visual patterns are redundantly distributed across WSIs, we aggregate each patch feature with MIL method that considers the correlations among instances. Furthermore, our PathM3 overcomes data scarcity in WSI-level captions by leveraging limited WSI diagnostic caption data in the manner of multi-task joint learning. Extensive experiments with improved classification accuracy and caption generation demonstrate the effectiveness of our method on both WSI classification and captioning task.
Authors: Yupeng Li, Haorui He, Jin Bai, Dacheng Wen
Abstract: The prevalence of fake news across various online sources has had a significant influence on the public. Existing Chinese fake news detection datasets are limited to news sourced solely from Weibo. However, fake news originating from multiple sources exhibits diversity in various aspects, including its content and social context. Methods trained on purely one single news source can hardly be applicable to real-world scenarios. Our pilot experiment demonstrates that the F1 score of the state-of-the-art method that learns from a large Chinese fake news detection dataset, Weibo-21, drops significantly from 0.943 to 0.470 when the test data is changed to multi-source news data, failing to identify more than one-third of the multi-source fake news. To address this limitation, we constructed the first multi-source benchmark dataset for Chinese fake news detection, termed MCFEND, which is composed of news we collected from diverse sources such as social platforms, messaging apps, and traditional online news outlets. Notably, such news has been fact-checked by 14 authoritative fact-checking agencies worldwide. In addition, various existing Chinese fake news detection methods are thoroughly evaluated on our proposed dataset in cross-source, multi-source, and unseen source ways. MCFEND, as a benchmark dataset, aims to advance Chinese fake news detection approaches in real-world scenarios.
Authors: Daniel Tanneberg, Felix Ocker, Stephan Hasler, Joerg Deigmoeller, Anna Belardinelli, Chao Wang, Heiko Wersing, Bernhard Sendhoff, Michael Gienger
Abstract: How can a robot provide unobtrusive physical support within a group of humans? We present Attentive Support, a novel interaction concept for robots to support a group of humans. It combines scene perception, dialogue acquisition, situation understanding, and behavior generation with the common-sense reasoning capabilities of Large Language Models (LLMs). In addition to following user instructions, Attentive Support is capable of deciding when and how to support the humans, and when to remain silent to not disturb the group. With a diverse set of scenarios, we show and evaluate the robot's attentive behavior, which supports and helps the humans when required, while not disturbing if no help is needed.
Authors: Zhuowen Yuan, Zidi Xiong, Yi Zeng, Ning Yu, Ruoxi Jia, Dawn Song, Bo Li
Abstract: Recent advancements in Large Language Models (LLMs) have showcased remarkable capabilities across various tasks in different domains. However, the emergence of biases and the potential for generating harmful content in LLMs, particularly under malicious inputs, pose significant challenges. Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently and effectively moderate harmful and unsafe inputs and outputs for LLMs. By employing a multi-faceted approach that includes energy-based training data augmentation through Langevin dynamics, optimizing a safe suffix for inputs via minimax optimization, and integrating a fusion-based model combining robust KNN with LLMs based on our data augmentation, RigorLLM offers a robust solution to harmful content moderation. Our experimental evaluations demonstrate that RigorLLM not only outperforms existing baselines like OpenAI API and Perspective API in detecting harmful content but also exhibits unparalleled resilience to jailbreaking attacks. The innovative use of constrained optimization and a fusion-based guardrail approach represents a significant step forward in developing more secure and reliable LLMs, setting a new standard for content moderation frameworks in the face of evolving digital threats.
Authors: Akshat Gupta, Dev Sajnani, Gopala Anumanchipalli
Abstract: ROME and MEMIT are largely believed to be two different model editing algorithms, with the major difference between them being the ability to perform batched edits. In this paper, we unify these two algorithms under a single conceptual umbrella, optimizing for the same goal, which we call the preservation-memorization objective. ROME uses an equality constraint to optimize this objective to perform one edit at a time, whereas MEMIT employs a more flexible least-square constraint that allows for batched edits. We generalize ROME and enable batched editing with equality constraint in the form of EMMET - an Equality-constrained Mass Model Editing algorithm for Transformers, a new batched memory-editing algorithm. EMMET can perform batched-edits up to a batch-size of 10,000, with very similar performance to MEMIT across multiple dimensions. With the introduction of EMMET, we truly unify ROME and MEMIT and show that both algorithms are equivalent in terms of their optimization objective, their abilities (singular and batched editing), their model editing performance and their limitations.
Authors: Mathieu Blondel, Vincent Roulet
Abstract: Artificial intelligence has recently experienced remarkable advances, fueled by large models, vast datasets, accelerated hardware, and, last but not least, the transformative power of differentiable programming. This new programming paradigm enables end-to-end differentiation of complex computer programs (including those with control flows and data structures), making gradient-based optimization of program parameters possible. As an emerging paradigm, differentiable programming builds upon several areas of computer science and applied mathematics, including automatic differentiation, graphical models, optimization and statistics. This book presents a comprehensive review of the fundamental concepts useful for differentiable programming. We adopt two main perspectives, that of optimization and that of probability, with clear analogies between the two. Differentiable programming is not merely the differentiation of programs, but also the thoughtful design of programs intended for differentiation. By making programs differentiable, we inherently introduce probability distributions over their execution, providing a means to quantify the uncertainty associated with program outputs.
Authors: Ali Behrouz, Michele Santacatterina, Ramin Zabih
Abstract: Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. Despite recent attempts to design efficient and effective architecture backbone for multi-dimensional data, such as images and multivariate time series, existing models are either data independent, or fail to allow inter- and intra-dimension communication. Recently, State Space Models (SSMs), and more specifically Selective State Space Models, with efficient hardware-aware implementation, have shown promising potential for long sequence modeling. Motivated by the success of SSMs, we present MambaMixer, a new architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels, called Selective Token and Channel Mixer. MambaMixer connects selective mixers using a weighted averaging mechanism, allowing layers to have direct access to early features. As a proof of concept, we design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on the MambaMixer block and explore their performance in various vision and time series forecasting tasks. Our results underline the importance of selective mixing across both tokens and channels. In ImageNet classification, object detection, and semantic segmentation tasks, ViM2 achieves competitive performance with well-established vision models and outperforms SSM-based vision models. In time series forecasting, TSM2 achieves outstanding performance compared to state-of-the-art methods while demonstrating significantly improved computational cost. These results show that while Transformers, cross-channel attention, and MLPs are sufficient for good performance in time series forecasting, neither is necessary.
Authors: Minyoung Huh, Brian Cheung, Tongzhou Wang, Phillip Isola
Abstract: We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato's concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.
Authors: Rickard Stureborg, Sanxing Chen, Ruoyu Xie, Aayushi Patel, Christopher Li, Chloe Qinyu Zhu, Tingnan Hu, Jun Yang, Bhuwan Dhingra
Abstract: One way to personalize chatbot interactions is by establishing common ground with the intended reader. A domain where establishing mutual understanding could be particularly impactful is vaccine concerns and misinformation. Vaccine interventions are forms of messaging which aim to answer concerns expressed about vaccination. Tailoring responses in this domain is difficult, since opinions often have seemingly little ideological overlap. We define the task of tailoring vaccine interventions to a Common-Ground Opinion (CGO). Tailoring responses to a CGO involves meaningfully improving the answer by relating it to an opinion or belief the reader holds. In this paper we introduce TAILOR-CGO, a dataset for evaluating how well responses are tailored to provided CGOs. We benchmark several major LLMs on this task; finding GPT-4-Turbo performs significantly better than others. We also build automatic evaluation metrics, including an efficient and accurate BERT model that outperforms finetuned LLMs, investigate how to successfully tailor vaccine messaging to CGOs, and provide actionable recommendations from this investigation. Code and model weights: https://github.com/rickardstureborg/tailor-cgo Dataset: https://huggingface.co/datasets/DukeNLP/tailor-cgo
URLs: https://github.com/rickardstureborg/tailor-cgo, https://huggingface.co/datasets/DukeNLP/tailor-cgo
Authors: Rongyi Zhu, Zeliang Zhang, Susan Liang, Zhuo Liu, Chenliang Xu
Abstract: Adversarial examples, crafted by adding perturbations imperceptible to humans, can deceive neural networks. Recent studies identify the adversarial transferability across various models, \textit{i.e.}, the cross-model attack ability of adversarial samples. To enhance such adversarial transferability, existing input transformation-based methods diversify input data with transformation augmentation. However, their effectiveness is limited by the finite number of available transformations. In our study, we introduce a novel approach named Learning to Transform (L2T). L2T increases the diversity of transformed images by selecting the optimal combination of operations from a pool of candidates, consequently improving adversarial transferability. We conceptualize the selection of optimal transformation combinations as a trajectory optimization problem and employ a reinforcement learning strategy to effectively solve the problem. Comprehensive experiments on the ImageNet dataset, as well as practical tests with Google Vision and GPT-4V, reveal that L2T surpasses current methodologies in enhancing adversarial transferability, thereby confirming its effectiveness and practical significance. The code is available at https://github.com/RongyiZhu/L2T.
Authors: Craig Innes, Subramanian Ramamoorthy
Abstract: Autonomous Vehicles (AVs) are often tested in simulation to estimate the probability they will violate safety specifications. Two common issues arise when using existing techniques to produce this estimation: If violations occur rarely, simple Monte-Carlo sampling techniques can fail to produce efficient estimates; if simulation horizons are too long, importance sampling techniques (which learn proposal distributions from past simulations) can fail to converge. This paper addresses both issues by interleaving rare-event sampling techniques with online specification monitoring algorithms. We use adaptive multi-level splitting to decompose simulations into partial trajectories, then calculate the distance of those partial trajectories to failure by leveraging robustness metrics from Signal Temporal Logic (STL). By caching those partial robustness metric values, we can efficiently re-use computations across multiple sampling stages. Our experiments on an interstate lane-change scenario show our method is viable for testing simulated AV-pipelines, efficiently estimating failure probabilities for STL specifications based on real traffic rules. We produce better estimates than Monte-Carlo and importance sampling in fewer simulations.
Authors: Zhangyu Wang, Krzysztof Janowicz, Gengchen Mai, Ivan Majic
Abstract: Intuitively, there is a relation between measures of spatial dependence and information theoretical measures of entropy. For instance, we can provide an intuition of why spatial data is special by stating that, on average, spatial data samples contain less than expected information. Similarly, spatial data, e.g., remotely sensed imagery, that is easy to compress is also likely to show significant spatial autocorrelation. Formulating our (highly specific) core concepts of spatial information theory in the widely used language of information theory opens new perspectives on their differences and similarities and also fosters cross-disciplinary collaboration, e.g., with the broader AI/ML communities. Interestingly, however, this intuitive relation is challenging to formalize and generalize, leading prior work to rely mostly on experimental results, e.g., for describing landscape patterns. In this work, we will explore the information theoretical roots of spatial autocorrelation, more specifically Moran's I, through the lens of self-information (also known as surprisal) and provide both formal proofs and experiments.
Authors: Jiaming Zhao, Wenbo Qiao, Peng Zhang, Hui Gao
Abstract: Implicit neural representations have emerged as a powerful paradigm to represent signals such as images and sounds. This approach aims to utilize neural networks to parameterize the implicit function of the signal. However, when representing implicit functions, traditional neural networks such as ReLU-based multilayer perceptrons face challenges in accurately modeling high-frequency components of signals. Recent research has begun to explore the use of Fourier Neural Networks (FNNs) to overcome this limitation. In this paper, we propose Quantum Implicit Representation Network (QIREN), a novel quantum generalization of FNNs. Furthermore, through theoretical analysis, we demonstrate that QIREN possesses a quantum advantage over classical FNNs. Lastly, we conducted experiments in signal representation, image superresolution, and image generation tasks to show the superior performance of QIREN compared to state-of-the-art (SOTA) models. Our work not only incorporates quantum advantages into implicit neural representations but also uncovers a promising application direction for Quantum Neural Networks.
Authors: Amandeep Kumar, Muhammad Awais, Sanath Narayan, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer
Abstract: Drawing upon StyleGAN's expressivity and disentangled latent space, existing 2D approaches employ textual prompting to edit facial images with different attributes. In contrast, 3D-aware approaches that generate faces at different target poses require attribute-specific classifiers, learning separate model weights for each attribute, and are not scalable for novel attributes. In this work, we propose an efficient, plug-and-play, 3D-aware face editing framework based on attribute-specific prompt learning, enabling the generation of facial images with controllable attributes across various target poses. To this end, we introduce a text-driven learnable style token-based latent attribute editor (LAE). The LAE harnesses a pre-trained vision-language model to find text-guided attribute-specific editing direction in the latent space of any pre-trained 3D-aware GAN. It utilizes learnable style tokens and style mappers to learn and transform this editing direction to 3D latent space. To train LAE with multiple attributes, we use directional contrastive loss and style token loss. Furthermore, to ensure view consistency and identity preservation across different poses and attributes, we employ several 3D-aware identity and pose preservation losses. Our experiments show that our proposed framework generates high-quality images with 3D awareness and view consistency while maintaining attribute-specific features. We demonstrate the effectiveness of our method on different facial attributes, including hair color and style, expression, and others.
Authors: Siyan Zhao, Tung Nguyen, Aditya Grover
Abstract: In-context learning is a key paradigm in large language models (LLMs) that enables them to generalize to new tasks and domains by simply prompting these models with a few exemplars without explicit parameter updates. Many attempts have been made to understand in-context learning in LLMs as a function of model scale, pretraining data, and other factors. In this work, we propose a new mechanism to probe and understand in-context learning from the lens of decision boundaries for in-context binary classification. Decision boundaries are straightforward to visualize and provide important information about the qualitative behavior of the inductive biases of standard classifiers. To our surprise, we find that the decision boundaries learned by current LLMs in simple binary classification tasks are often irregular and non-smooth, regardless of linear separability in the underlying task. This paper investigates the factors influencing these decision boundaries and explores methods to enhance their generalizability. We assess various approaches, including training-free and fine-tuning methods for LLMs, the impact of model architecture, and the effectiveness of active prompting techniques for smoothing decision boundaries in a data-efficient manner. Our findings provide a deeper understanding of in-context learning dynamics and offer practical improvements for enhancing robustness and generalizability of in-context learning.
Authors: Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, Hao Liu, Can Huang
Abstract: Recently, many studies have demonstrated that exclusively incorporating OCR-derived text and spatial layouts with large language models (LLMs) can be highly effective for document understanding tasks. However, existing methods that integrate spatial layouts with text have limitations, such as producing overly long text sequences or failing to fully leverage the autoregressive traits of LLMs. In this work, we introduce Interleaving Layout and Text in a Large Language Model (LayTextLLM)} for document understanding. In particular, LayTextLLM projects each bounding box to a single embedding and interleaves it with text, efficiently avoiding long sequence issues while leveraging autoregressive traits of LLMs. LayTextLLM not only streamlines the interaction of layout and textual data but also shows enhanced performance in Key Information Extraction (KIE) and Visual Question Answering (VQA). Comprehensive benchmark evaluations reveal significant improvements, with a 27.2% increase on KIE tasks and 12.0% on VQA tasks compared to previous state-of-the-art document understanding MLLMs, as well as a 15.1% improvement over other SOTA OCR-based LLMs on KIE tasks.
Authors: Darshan Prabhu, Yifan Peng, Preethi Jyothi, Shinji Watanabe
Abstract: Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition~(ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itself has been far less explored. Towards this, we introduce Multi-Convformer that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating. This helps in improved modeling of local dependencies at varying granularities. Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient. We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate~(WER) improvements.
Authors: Shiquan Zhang, Ying Ma, Le Fang, Hong Jia, Simon D'Alfonso, Vassilis Kostakos
Abstract: This demo presents a novel end-to-end framework that combines on-device large language models (LLMs) with smartphone sensing technologies to achieve context-aware and personalized services. The framework addresses critical limitations of current personalization solutions via cloud LLMs, such as privacy concerns, latency and cost, and limited personal information. To achieve this, we innovatively proposed deploying LLMs on smartphones with multimodal sensor data through context-aware sensing and customized prompt engineering, ensuring privacy and enhancing personalization performance. A case study involving a university student demonstrated the capability of the framework to provide tailored recommendations. In addition, we show that the framework achieves the best trade-off in privacy, performance, latency, cost, battery and energy consumption between on-device and cloud LLMs. To the best of our knowledge, this is the first framework to provide on-device LLMs personalization with smartphone sensing. Future work will incorporate more diverse sensor data and involve extensive user studies to enhance personalization. Our proposed framework has the potential to substantially improve user experiences across domains including healthcare, productivity, and entertainment.
Authors: Younggun Kim (Department of Civil Engineering, University of Central Florida, Florida, USA), Beomsik Cho (Department of AI Mobility Engineering, Suwon, South Korea), Seonghoon Ryoo (Department of AI Mobility Engineering, Suwon, South Korea), Soomok Lee (Department of AI Mobility Engineering, Suwon, South Korea)
Abstract: Adapting deep learning networks for point cloud data recognition in self-driving vehicles faces challenges due to the variability in datasets and sensor technologies, emphasizing the need for adaptive techniques to maintain accuracy across different conditions. In this paper, we introduce the 3D Adaptive Structural Convolution Network (3D-ASCN), a cutting-edge framework for 3D point cloud recognition. It combines 3D convolution kernels, a structural tree structure, and adaptive neighborhood sampling for effective geometric feature extraction. This method obtains domain-invariant features and demonstrates robust, adaptable performance on a variety of point cloud datasets, ensuring compatibility across diverse sensor configurations without the need for parameter adjustments. This highlights its potential to significantly enhance the reliability and efficiency of self-driving vehicle technology.
Authors: Rapha\"el Tinarrage, Henrique Ennes, Lucas E. Resck, Lucas T. Gomes, Jean R. Ponciano, Jorge Poco
Abstract: Binding precedents (S\'umulas Vinculantes) constitute a juridical instrument unique to the Brazilian legal system and whose objectives include the protection of the Federal Supreme Court against repetitive demands. Studies of the effectiveness of these instruments in decreasing the Court's exposure to similar cases, however, indicate that they tend to fail in such a direction, with some of the binding precedents seemingly creating new demands. We empirically assess the legal impact of five binding precedents, 11, 14, 17, 26 and 37, at the highest court level through their effects on the legal subjects they address. This analysis is only possible through the comparison of the Court's ruling about the precedents' themes before they are created, which means that these decisions should be detected through techniques of Similar Case Retrieval. The contributions of this article are therefore twofold: on the mathematical side, we compare the uses of different methods of Natural Language Processing -- TF-IDF, LSTM, BERT, and regex -- for Similar Case Retrieval, whereas on the legal side, we contrast the inefficiency of these binding precedents with a set of hypotheses that may justify their repeated usage. We observe that the deep learning models performed significantly worse in the specific Similar Case Retrieval task and that the reasons for binding precedents to fail in responding to repetitive demand are heterogeneous and case-dependent, making it impossible to single out a specific cause.
Authors: Hossein Entezari Zarch, Abdulla Alshabanah, Chaoyi Jiang, Murali Annavaram
Abstract: Deep learning recommendation models (DLRMs) are at the heart of the current e-commerce industry. However, the amount of training data used to train these large models is growing exponentially, leading to substantial training hurdles. The training dataset contains two primary types of information: content-based information (features of users and items) and collaborative information (interactions between users and items). One approach to reduce the training dataset is to remove user-item interactions. But that significantly diminishes collaborative information, which is crucial for maintaining accuracy due to its inclusion of interaction histories. This loss profoundly impacts DLRM performance. This paper makes an important observation that if one can capture the user-item interaction history to enrich the user and item embeddings, then the interaction history can be compressed without losing model accuracy. Thus, this work, Collaborative Aware Data Compression (CADC), takes a two-step approach to training dataset compression. In the first step, we use matrix factorization of the user-item interaction matrix to create a novel embedding representation for both the users and items. Once the user and item embeddings are enriched by the interaction history information the approach then applies uniform random sampling of the training dataset to drastically reduce the training dataset size while minimizing model accuracy drop. The source code of CADC is available at \href{https://anonymous.4open.science/r/DSS-RM-8C1D/README.md}{https://anonymous.4open.science/r/DSS-RM-8C1D/README.md}.
URLs: https://anonymous.4open.science/r/DSS-RM-8C1D/README.md, https://anonymous.4open.science/r/DSS-RM-8C1D/README.md
Authors: Quang H. Nguyen, Duy C. Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V. Chawla, Khoa D. Doan
Abstract: The rapid progress in machine learning (ML) has brought forth many large language models (LLMs) that excel in various tasks and areas. These LLMs come with different abilities and costs in terms of computation or pricing. Since the demand for each query can vary, e.g., because of the queried domain or its complexity, defaulting to one LLM in an application is not usually the best choice, whether it is the biggest, priciest, or even the one with the best average test performance. Consequently, picking the right LLM that is both accurate and cost-effective for an application remains a challenge. In this paper, we introduce MetaLLM, a framework that dynamically and intelligently routes each query to the optimal LLM (among several available LLMs) for classification tasks, achieving significantly improved accuracy and cost-effectiveness. By framing the selection problem as a multi-armed bandit, MetaLLM balances prediction accuracy and cost efficiency under uncertainty. Our experiments, conducted on popular LLM platforms such as OpenAI's GPT models, Amazon's Titan, Anthropic's Claude, and Meta's LLaMa, showcase MetaLLM's efficacy in real-world scenarios, laying the groundwork for future extensions beyond classification tasks.
Authors: Xuhong Wang, Haoyu Jiang, Yi Yu, Jingru Yu, Yilun Lin, Ping Yi, Yingchun Wang, Yu Qiao, Li Li, Fei-Yue Wang
Abstract: Large Language Models (LLMs) are increasingly integrated into diverse industries, posing substantial security risks due to unauthorized replication and misuse. To mitigate these concerns, robust identification mechanisms are widely acknowledged as an effective strategy. Identification systems for LLMs now rely heavily on watermarking technology to manage and protect intellectual property and ensure data security. However, previous studies have primarily concentrated on the basic principles of algorithms and lacked a comprehensive analysis of watermarking theory and practice from the perspective of intelligent identification. To bridge this gap, firstly, we explore how a robust identity recognition system can be effectively implemented and managed within LLMs by various participants using watermarking technology. Secondly, we propose a mathematical framework based on mutual information theory, which systematizes the identification process to achieve more precise and customized watermarking. Additionally, we present a comprehensive evaluation of performance metrics for LLM watermarking, reflecting participant preferences and advancing discussions on its identification applications. Lastly, we outline the existing challenges in current watermarking technologies and theoretical frameworks, and provide directional guidance to address these challenges. Our systematic classification and detailed exposition aim to enhance the comparison and evaluation of various methods, fostering further research and development toward a transparent, secure, and equitable LLM ecosystem.
Authors: Chung Park, Taesan Kim, Hyungjun Yoon, Junui Hong, Yelim Yu, Mincheol Cho, Minsung Choi, Jaegul Choo
Abstract: Cross-Domain Sequential Recommendation (CDSR) improves recommendation performance by utilizing information from multiple domains, which contrasts with Single-Domain Sequential Recommendation (SDSR) that relies on a historical interaction within a specific domain. However, CDSR may underperform compared to the SDSR approach in certain domains due to negative transfer, which occurs when there is a lack of relation between domains or different levels of data sparsity. To address the issue of negative transfer, our proposed CDSR model estimates the degree of negative transfer of each domain and adaptively assigns it as a weight factor to the prediction loss, to control gradient flows through domains with significant negative transfer. To this end, our model compares the performance of a model trained on multiple domains (CDSR) with a model trained solely on the specific domain (SDSR) to evaluate the negative transfer of each domain using our asymmetric cooperative network. In addition, to facilitate the transfer of valuable cues between the SDSR and CDSR tasks, we developed an auxiliary loss that maximizes the mutual information between the representation pairs from both tasks on a per-domain basis. This cooperative learning between SDSR and CDSR tasks is similar to the collaborative dynamics between pacers and runners in a marathon. Our model outperformed numerous previous works in extensive experiments on two real-world industrial datasets across ten service domains. We also have deployed our model in the recommendation system of our personal assistant app service, resulting in 21.4% increase in click-through rate compared to existing models, which is valuable to real-world business.
Authors: Shubham Vatsal, Harsh Dubey
Abstract: Large language models (LLMs) have shown remarkable performance on many different Natural Language Processing (NLP) tasks. Prompt engineering plays a key role in adding more to the already existing abilities of LLMs to achieve significant performance gains on various NLP tasks. Prompt engineering requires composing natural language instructions called prompts to elicit knowledge from LLMs in a structured way. Unlike previous state-of-the-art (SoTA) models, prompt engineering does not require extensive parameter re-training or fine-tuning based on the given NLP task and thus solely operates on the embedded knowledge of LLMs. Additionally, LLM enthusiasts can intelligently extract LLMs' knowledge through a basic natural language conversational exchange or prompt engineering, allowing more and more people even without deep mathematical machine learning background to experiment with LLMs. With prompt engineering gaining popularity in the last two years, researchers have come up with numerous engineering techniques around designing prompts to improve accuracy of information extraction from the LLMs. In this paper, we summarize different prompting techniques and club them together based on different NLP tasks that they have been used for. We further granularly highlight the performance of these prompting strategies on various datasets belonging to that NLP task, talk about the corresponding LLMs used, present a taxonomy diagram and discuss the possible SoTA for specific datasets. In total, we read and present a survey of 44 research papers which talk about 39 different prompting methods on 29 different NLP tasks of which most of them have been published in the last two years.
Authors: Sourojit Ghosh, Pranav Narayanan Venkit, Sanjana Gautam, Shomir Wilson, Aylin Caliskan
Abstract: Our research investigates the impact of Generative Artificial Intelligence (GAI) models, specifically text-to-image generators (T2Is), on the representation of non-Western cultures, with a focus on Indian contexts. Despite the transformative potential of T2Is in content creation, concerns have arisen regarding biases that may lead to misrepresentations and marginalizations. Through a community-centered approach and grounded theory analysis of 5 focus groups from diverse Indian subcultures, we explore how T2I outputs to English prompts depict Indian culture and its subcultures, uncovering novel representational harms such as exoticism and cultural misappropriation. These findings highlight the urgent need for inclusive and culturally sensitive T2I systems. We propose design guidelines informed by a sociotechnical perspective, aiming to address these issues and contribute to the development of more equitable and representative GAI technologies globally. Our work also underscores the necessity of adopting a community-centered approach to comprehend the sociotechnical dynamics of these models, complementing existing work in this space while identifying and addressing the potential negative repercussions and harms that may arise when these models are deployed on a global scale.
Authors: Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad Alghamdi, Enrico Shippole, Jianguo Zhang, Joanna Materzynska, Kun Qian, Kush Tiwary, Lester Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Shrestha Mohanty, Vipul Gupta, Vivek Sharma, Vu Minh Chien, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, Sandy Pentland
Abstract: General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how codified data use preferences are changing over time. We observe a proliferation of AI-specific clauses to limit use, acute differences in restrictions on AI developers, as well as general inconsistencies between websites' expressed intentions in their Terms of Service and their robots.txt. We diagnose these as symptoms of ineffective web protocols, not designed to cope with the widespread re-purposing of the internet for AI. Our longitudinal analyses show that in a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources, rendering ~5%+ of all tokens in C4, or 28%+ of the most actively maintained, critical sources in C4, fully restricted from use. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted. If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems. We hope to illustrate the emerging crises in data consent, for both developers and creators. The foreclosure of much of the open web will impact not only commercial AI, but also non-commercial AI and academic research.
Authors: Desta Haileselassie Hagos, Rick Battle, Danda B. Rawat
Abstract: The emergence of Generative Artificial Intelligence (AI) and Large Language Models (LLMs) has marked a new era of Natural Language Processing (NLP), introducing unprecedented capabilities that are revolutionizing various domains. This paper explores the current state of these cutting-edge technologies, demonstrating their remarkable advancements and wide-ranging applications. Our paper contributes to providing a holistic perspective on the technical foundations, practical applications, and emerging challenges within the evolving landscape of Generative AI and LLMs. We believe that understanding the generative capabilities of AI systems and the specific context of LLMs is crucial for researchers, practitioners, and policymakers to collaboratively shape the responsible and ethical integration of these technologies into various domains. Furthermore, we identify and address main research gaps, providing valuable insights to guide future research endeavors within the AI research community.
Authors: Liangyan Jiang, Chuang Zhu, Yanxu Chen
Abstract: The spike camera, with its high temporal resolution, low latency, and high dynamic range, addresses high-speed imaging challenges like motion blur. It captures photons at each pixel independently, creating binary spike streams rich in temporal information but challenging for image reconstruction. Current algorithms, both traditional and deep learning-based, still need to be improved in the utilization of the rich temporal detail and the restoration of the details of the reconstructed image. To overcome this, we introduce Swin Spikeformer (SwinSF), a novel model for dynamic scene reconstruction from spike streams. SwinSF is composed of Spike Feature Extraction, Spatial-Temporal Feature Extraction, and Final Reconstruction Module. It combines shifted window self-attention and proposed temporal spike attention, ensuring a comprehensive feature extraction that encapsulates both spatial and temporal dynamics, leading to a more robust and accurate reconstruction of spike streams. Furthermore, we build a new synthesized dataset for spike image reconstruction which matches the resolution of the latest spike camera, ensuring its relevance and applicability to the latest developments in spike camera imaging. Experimental results demonstrate that the proposed network SwinSF sets a new benchmark, achieving state-of-the-art performance across a series of datasets, including both real-world and synthesized data across various resolutions. Our codes and proposed dataset will be available soon.
Authors: Dario Pasquini, Evgenios M. Kornaropoulos, Giuseppe Ateniese
Abstract: We introduce LLMmap, a first-generation fingerprinting attack targeted at LLM-integrated applications. LLMmap employs an active fingerprinting approach, sending carefully crafted queries to the application and analyzing the responses to identify the specific LLM model in use. With as few as 8 interactions, LLMmap can accurately identify LLMs with over 95% accuracy. More importantly, LLMmap is designed to be robust across different application layers, allowing it to identify LLMs operating under various system prompts, stochastic sampling hyperparameters, and even complex generation frameworks such as RAG or Chain-of-Thought.
Authors: Letian Gong, Huaiyu Wan, Shengnan Guo, Xiucheng Li, Yan Lin, Erwen Zheng, Tianyi Wang, Zeyu Zhou, Youfang Lin
Abstract: The rapid growth of location-based services (LBS) has yielded massive amounts of data on human mobility. Effectively extracting meaningful representations for user-generated check-in sequences is pivotal for facilitating various downstream services. However, the user-generated check-in data are simultaneously influenced by the surrounding objective circumstances and the user's subjective intention. Specifically, the temporal uncertainty and spatial diversity exhibited in check-in data make it difficult to capture the macroscopic spatial-temporal patterns of users and to understand the semantics of user mobility activities. Furthermore, the distinct characteristics of the temporal and spatial information in check-in sequences call for an effective fusion method to incorporate these two types of information. In this paper, we propose a novel Spatial-Temporal Cross-view Contrastive Representation (STCCR) framework for check-in sequence representation learning. Specifically, STCCR addresses the above challenges by employing self-supervision from "spatial topic" and "temporal intention" views, facilitating effective fusion of spatial and temporal information at the semantic level. Besides, STCCR leverages contrastive clustering to uncover users' shared spatial topics from diverse mobility activities, while employing angular momentum contrast to mitigate the impact of temporal uncertainty and noise. We extensively evaluate STCCR on three real-world datasets and demonstrate its superior performance across three downstream tasks.
Authors: Maciej Wodzi\'nski, Marcin Rz\k{a}deczka, Anastazja Szu{\l}a, Marta Sok\'o{\l}, Marcin Moskalewicz
Abstract: Avoiding systemic discrimination requires investigating AI models' potential to propagate stereotypes resulting from the inherent biases of training datasets. Our study investigated how text-to-image models unintentionally perpetuate non-rational beliefs regarding autism. The research protocol involved generating images based on 53 prompts aimed at visualizing concrete objects and abstract concepts related to autism across four models: DALL-E, Stable Diffusion, SDXL, and Midjourney (N=249). Expert assessment of results was performed via a framework of 10 deductive codes representing common stereotypes contested by the community regarding their presence and spatial intensity, quantified on ordinal scales and subject to statistical analysis of inter-rater reliability and size effects. The models frequently utilised controversial themes and symbols which were unevenly distributed, however, with striking homogeneity in terms of skin colour, gender, and age, with autistic individuals portrayed as engaged in solitary activities, interacting with objects rather than people, and displaying stereotypical emotional expressions such as pale, anger, or sad. Secondly we observed representational insensitivity regarding autism images despite directional prompting aimed at falsifying the above results. Additionally, DALL-E explicitly denied perpetuating stereotypes. We interpret this as ANNs mirroring the human cognitive architecture regarding the discrepancy between background and reflective knowledge, as justified by our previous research on autism-related stereotypes in humans.
Authors: Wenbo Huang, Jinghui Zhang, Xuwei Qian, Zhen Wu, Meng Wang, Lei Zhang
Abstract: High frame-rate (HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios, promoting few-shot action recognition (FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of diverse frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code is released at https://github.com/wenbohuang1002/SOAP.
Authors: Fang-Duo Tsai, Shih-Lun Wu, Haven Kim, Bo-Yu Chen, Hao-Chung Cheng, Yi-Hsuan Yang
Abstract: Text-to-music models allow users to generate nearly realistic musical audio with textual commands. However, editing music audios remains challenging due to the conflicting desiderata of performing fine-grained alterations on the audio while maintaining a simple user interface. To address this challenge, we propose Audio Prompt Adapter (or AP-Adapter), a lightweight addition to pretrained text-to-music models. We utilize AudioMAE to extract features from the input audio, and construct attention-based adapters to feedthese features into the internal layers of AudioLDM2, a diffusion-based text-to-music model. With 22M trainable parameters, AP-Adapter empowers users to harness both global (e.g., genre and timbre) and local (e.g., melody) aspects of music, using the original audio and a short text as inputs. Through objective and subjective studies, we evaluate AP-Adapter on three tasks: timbre transfer, genre transfer, and accompaniment generation. Additionally, we demonstrate its effectiveness on out-of-domain audios containing unseen instruments during training.
Authors: Chunyu Luo, Yi Zhou, Zhengren Wang, Mingyu Xiao
Abstract: A $k$-defective clique of an undirected graph $G$ is a subset of its vertices that induces a nearly complete graph with a maximum of $k$ missing edges. The maximum $k$-defective clique problem, which asks for the largest $k$-defective clique from the given graph, is important in many applications, such as social and biological network analysis. In the paper, we propose a new branching algorithm that takes advantage of the structural properties of the $k$-defective clique and uses the efficient maximum clique algorithm as a subroutine. As a result, the algorithm has a better asymptotic running time than the existing ones. We also investigate upper-bounding techniques and propose a new upper bound utilizing the \textit{conflict relationship} between vertex pairs. Because conflict relationship is common in many graph problems, we believe that this technique can be potentially generalized. Finally, experiments show that our algorithm outperforms state-of-the-art solvers on a wide range of open benchmarks.