Authors: Junchao Gong, Tao Han, Kang Chen, Lei Bai
Abstract: Numerical Weather Prediction (NWP) system is an infrastructure that exerts considerable impacts on modern society.Traditional NWP system, however, resolves it by solving complex partial differential equations with a huge computing cluster, resulting in tons of carbon emission. Exploring efficient and eco-friendly solutions for NWP attracts interest from Artificial Intelligence (AI) and earth science communities. To narrow the performance gap between the AI-based methods and physic predictor, this work proposes a new transformer-based NWP framework, termed as WeatherFormer, to model the complex spatio-temporal atmosphere dynamics and empowering the capability of data-driven NWP. WeatherFormer innovatively introduces the space-time factorized transformer blocks to decrease the parameters and memory consumption, in which Position-aware Adaptive Fourier Neural Operator (PAFNO) is proposed for location sensible token mixing. Besides, two data augmentation strategies are utilized to boost the performance and decrease training consumption. Extensive experiments on WeatherBench dataset show WeatherFormer achieves superior performance over existing deep learning methods and further approaches the most advanced physical model.
Authors: Ville Heilala, Roberto Araya, Raija H\"am\"al\"ainen
Abstract: Generative artificial intelligence (GenAI) can reshape education and learning. While large language models (LLMs) like ChatGPT dominate current educational research, multimodal capabilities, such as text-to-speech and text-to-image, are less explored. This study uses topic modeling to map the research landscape of multimodal and generative AI in education. An extensive literature search using Dimensions.ai yielded 4175 articles. Employing a topic modeling approach, latent topics were extracted, resulting in 38 interpretable topics organized into 14 thematic areas. Findings indicate a predominant focus on text-to-text models in educational contexts, with other modalities underexplored, overlooking the broader potential of multimodal approaches. The results suggest a research gap, stressing the importance of more balanced attention across different AI modalities and educational levels. In summary, this research provides an overview of current trends in generative AI for education, underlining opportunities for future exploration of multimodal technologies to fully realize the transformative potential of artificial intelligence in education.
Authors: Jiho Lee, Nisar R. Ahmed, Kyle H. Wray, Zachary N. Sunberg
Abstract: Partially Observable Markov Decision Processes (POMDPs) provide a structured framework for decision-making under uncertainty, but their application requires efficient belief updates. Sequential Importance Resampling Particle Filters (SIRPF), also known as Bootstrap Particle Filters, are commonly used as belief updaters in large approximate POMDP solvers, but they face challenges such as particle deprivation and high computational costs as the system's state dimension grows. To address these issues, this study introduces Rao-Blackwellized POMDP (RB-POMDP) approximate solvers and outlines generic methods to apply Rao-Blackwellization in both belief updates and online planning. We compare the performance of SIRPF and Rao-Blackwellized Particle Filters (RBPF) in a simulated localization problem where an agent navigates toward a target in a GPS-denied environment using POMCPOW and RB-POMCPOW planners. Our results not only confirm that RBPFs maintain accurate belief approximations over time with fewer particles, but, more surprisingly, RBPFs combined with quadrature-based integration improve planning quality significantly compared to SIRPF-based planning under the same computational limits.
Authors: Gabriele De Vito, Filomena Ferrucci, Athanasios Angelakis
Abstract: Medication errors significantly threaten patient safety, leading to adverse drug events and substantial economic burdens on healthcare systems. Clinical Decision Support Systems (CDSSs) aimed at mitigating these errors often face limitations, including reliance on static databases and rule-based algorithms, which can result in high false alert rates and alert fatigue among clinicians. This paper introduces HELIOT, an innovative CDSS for drug allergy management, integrating Large Language Models (LLMs) with a comprehensive pharmaceutical data repository. HELIOT leverages advanced natural language processing capabilities to interpret complex medical texts and synthesize unstructured data, overcoming the limitations of traditional CDSSs. An empirical evaluation using a synthetic patient dataset and expert-verified ground truth demonstrates HELIOT's high accuracy, precision, recall, and F1 score, uniformly reaching 100\% across multiple experimental runs. The results underscore HELIOT's potential to enhance decision support in clinical settings, offering a scalable, efficient, and reliable solution for managing drug allergies.
Authors: Xuhui Zhou, Hyunwoo Kim, Faeze Brahman, Liwei Jiang, Hao Zhu, Ximing Lu, Frank Xu, Bill Yuchen Lin, Yejin Choi, Niloofar Mireshghallah, Ronan Le Bras, Maarten Sap
Abstract: AI agents are increasingly autonomous in their interactions with human users and tools, leading to increased interactional safety risks. We present HAICOSYSTEM, a framework examining AI agent safety within diverse and complex social interactions. HAICOSYSTEM features a modular sandbox environment that simulates multi-turn interactions between human users and AI agents, where the AI agents are equipped with a variety of tools (e.g., patient management platforms) to navigate diverse scenarios (e.g., a user attempting to access other patients' profiles). To examine the safety of AI agents in these interactions, we develop a comprehensive multi-dimensional evaluation framework that uses metrics covering operational, content-related, societal, and legal risks. Through running 1840 simulations based on 92 scenarios across seven domains (e.g., healthcare, finance, education), we demonstrate that HAICOSYSTEM can emulate realistic user-AI interactions and complex tool use by AI agents. Our experiments show that state-of-the-art LLMs, both proprietary and open-sourced, exhibit safety risks in over 50\% cases, with models generally showing higher risks when interacting with simulated malicious users. Our findings highlight the ongoing challenge of building agents that can safely navigate complex interactions, particularly when faced with malicious users. To foster the AI agent safety ecosystem, we release a code platform that allows practitioners to create custom scenarios, simulate interactions, and evaluate the safety and performance of their agents.
Authors: Amin Zakaie Far, Mohammad Zakaie Far, Sonia Gharibzadeh, Shiva Zangeneh, Leila Amini, Morteza Rahimi
Abstract: The accelerated expansion of the Internet of Things (IoT) has raised critical challenges associated with privacy, security, and data integrity, specifically in infrastructures such as smart cities or smart manufacturing. Blockchain technology provides immutable, scalable, and decentralized solutions to address these challenges, and integrating deep reinforcement learning (DRL) into the IoT environment offers enhanced adaptability and decision-making. This paper investigates the integration of blockchain and DRL to optimize mobile transmission and secure data exchange in IoT-assisted smart cities. Through the clustering and categorization of IoT application systems, the combination of DRL and blockchain is shown to enhance the performance of IoT networks by maintaining privacy and security. Based on the review of papers published between 2015 and 2024, we have classified the presented approaches and offered practical taxonomies, which provide researchers with critical perspectives and highlight potential areas for future exploration and research. Our investigation shows how combining blockchain's decentralized framework with DRL can address privacy and security issues, improve mobile transmission efficiency, and guarantee robust, privacy-preserving IoT systems. Additionally, we explore blockchain integration for DRL and outline the notable applications of DRL technology. By addressing the challenges of machine learning and blockchain integration, this study proposes novel perspectives for researchers and serves as a foundational exploration from an interdisciplinary standpoint.
Authors: Qiuhai Zeng, Zimeng Qiu, Dae Yon Hwang, Xin He, William M. Campbell
Abstract: Dense retrieval systems are commonly used for information retrieval (IR). They rely on learning text representations through an encoder and usually require supervised modeling via labelled data which can be costly to obtain or simply unavailable. In this study, we introduce a novel unsupervised text representation learning technique via instruction-tuning the pre-trained encoder-decoder large language models (LLM) under the dual-encoder retrieval framework. We demonstrate the corpus representation can be augmented by the representations of relevant synthetic queries generated by the instruct-tuned LLM founded on the Rao-Blackwell theorem. Furthermore, we effectively align the query and corpus text representation with self-instructed-tuning. Specifically, we first prompt an open-box pre-trained LLM to follow defined instructions (i.e. question generation and keyword summarization) to generate synthetic queries. Next, we fine-tune the pre-trained LLM with defined instructions and the generated queries that passed quality check. Finally, we generate synthetic queries with the instruction-tuned LLM for each corpora and represent each corpora by weighted averaging the synthetic queries and original corpora embeddings. We evaluate our proposed method under low-resource settings on three English and one German retrieval datasets measuring NDCG@10, MRR@100, Recall@100. We significantly improve the average zero-shot retrieval performance on all metrics, increasing open-box FLAN-T5 model variations by [3.34%, 3.50%] in absolute and exceeding three competitive dense retrievers (i.e. mDPR, T-Systems, mBART-Large), with model of size at least 38% smaller, by 1.96%, 4.62%, 9.52% absolute on NDCG@10.
Authors: Mengchen Liu, Qixiu Li, Dongdong Chen, Dong Chen, Jianmin Bao, Yunsheng Li
Abstract: With the release of GPT-4V(O), its use in generating pseudo labels for multi-modality tasks has gained significant popularity. However, it is still a secret how to build such advanced models from its base large language models (LLMs). This work explores the potential of using LLMs alone for data generation and develop competitive multi-modality models focusing on chart understanding. We construct a large-scale chart dataset, SynChart, which contains approximately 4 million diverse chart images with over 75 million dense annotations, including data tables, code, descriptions, and question-answer sets. We trained a 4.2B chart-expert model using this dataset and achieve near-GPT-4O performance on the ChartQA task, surpassing GPT-4V.
Authors: Zihao Jing
Abstract: With the process of urbanization and the rapid growth of population, the issue of traffic congestion has become an increasingly critical concern. Intelligent transportation systems heavily rely on real-time and precise prediction algorithms to address this problem. While Recurrent Neural Network (RNN) and Graph Convolutional Network (GCN) methods in deep learning have demonstrated high accuracy in predicting road conditions when sufficient data is available, forecasting in road networks with limited data remains a challenging task. This study proposed a novel Spatial-temporal Convolutional Network (TL-GPSTGN) based on graph pruning and transfer learning framework to tackle this issue. Firstly, the essential structure and information of the graph are extracted by analyzing the correlation and information entropy of the road network structure and feature data. By utilizing graph pruning techniques, the adjacency matrix of the graph and the input feature data are processed, resulting in a significant improvement in the model's migration performance. Subsequently, the well-characterized data are inputted into the spatial-temporal graph convolutional network to capture the spatial-temporal relationships and make predictions regarding the road conditions. Furthermore, this study conducts comprehensive testing and validation of the TL-GPSTGN method on real datasets, comparing its prediction performance against other commonly used models under identical conditions. The results demonstrate the exceptional predictive accuracy of TL-GPSTGN on a single dataset, as well as its robust migration performance across different datasets.
Authors: Yuanchang Luo, Jiaxin Guo, Daimeng Wei, Hengchao Shang, Zongyao Li, Zhanglin Wu, Zhiqiang Rao, Shaojun Li, Jinlong Yang, Hao Yang
Abstract: This report outlines our approach for the WMT24 Discourse-Level Literary Translation Task, focusing on the Chinese-English language pair in the Constrained Track. Translating literary texts poses significant challenges due to the nuanced meanings, idiomatic expressions, and intricate narrative structures inherent in such works. To address these challenges, we leveraged the Chinese-Llama2 model, specifically enhanced for this task through a combination of Continual Pre-training (CPT) and Supervised Fine-Tuning (SFT). Our methodology includes a novel Incremental Decoding framework, which ensures that each sentence is translated with consideration of its broader context, maintaining coherence and consistency throughout the text. This approach allows the model to capture long-range dependencies and stylistic elements, producing translations that faithfully preserve the original literary quality. Our experiments demonstrate significant improvements in both sentence-level and document-level BLEU scores, underscoring the effectiveness of our proposed framework in addressing the complexities of document-level literary translation.
Authors: Zongyue Qin, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun
Abstract: Large language models (LLMs) have shown outstanding performance across numerous real-world tasks. However, the autoregressive nature of these models makes the inference process slow and costly. Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens, which are then validated simultaneously by the larger model, achieving a speed-up of 1-2x. Although speculative decoding matches the same distribution as multinomial sampling, multinomial sampling itself is prone to suboptimal outputs, whereas beam sampling is widely recognized for producing higher-quality results by maintaining multiple candidate sequences at each step. This paper explores the novel integration of speculative decoding with beam sampling. However, there are four key challenges: (1) how to generate multiple sequences from the larger model's distribution given drafts sequences from the small model; (2) how to dynamically optimize the number of beams to balance efficiency and accuracy; (3) how to efficiently verify the multiple drafts in parallel; and (4) how to address the extra memory costs inherent in beam sampling. To address these challenges, we propose dynamic-width speculative beam decoding (DSBD). Specifically, we first introduce a novel draft and verification scheme that generates multiple sequences following the large model's distribution based on beam sampling trajectories from the small model. Then, we introduce an adaptive mechanism to dynamically tune the number of beams based on the context, optimizing efficiency and effectiveness. Besides, we extend tree-based parallel verification to handle multiple trees simultaneously, accelerating the verification process. Finally, we illustrate a simple modification to our algorithm to mitigate the memory overhead of beam sampling...
Authors: Yishu Wei, Xindi Wang, Hanley Ong, Yiliang Zhou, Adam Flanders, George Shih, Yifan Peng
Abstract: Despite significant progress in applying large language models (LLMs) to the medical domain, several limitations still prevent them from practical applications. Among these are the constraints on model size and the lack of cohort-specific labeled datasets. In this work, we investigated the potential of improving a lightweight LLM, such as Llama 3.1-8B, through fine-tuning with datasets using synthetic labels. Two tasks are jointly trained by combining their respective instruction datasets. When the quality of the task-specific synthetic labels is relatively high (e.g., generated by GPT4- o), Llama 3.1-8B achieves satisfactory performance on the open-ended disease detection task, with a micro F1 score of 0.91. Conversely, when the quality of the task-relevant synthetic labels is relatively low (e.g., from the MIMIC-CXR dataset), fine-tuned Llama 3.1-8B is able to surpass its noisy teacher labels (micro F1 score of 0.67 v.s. 0.63) when calibrated against curated labels, indicating the strong inherent underlying capability of the model. These findings demonstrate the potential of fine-tuning LLMs with synthetic labels, offering a promising direction for future research on LLM specialization in the medical domain.
Authors: Xin Jing, Yichen Jing, Yuhuan Lu, Bangchao Deng, Xueqin Chen, Dingqi Yang
Abstract: The rapid spread of diverse information on online social platforms has prompted both academia and industry to realize the importance of predicting content popularity, which could benefit a wide range of applications, such as recommendation systems and strategic decision-making. Recent works mainly focused on extracting spatiotemporal patterns inherent in the information diffusion process within a given observation period so as to predict its popularity over a future period of time. However, these works often overlook the future popularity trend, as future popularity could either increase exponentially or stagnate, introducing uncertainties to the prediction performance. Additionally, how to transfer the preceding-term dynamics learned from the observed diffusion process into future-term trends remains an unexplored challenge. Against this background, we propose CasFT, which leverages observed information Cascades and dynamic cues extracted via neural ODEs as conditions to guide the generation of Future popularity-increasing Trends through a diffusion model. These generated trends are then combined with the spatiotemporal patterns in the observed information cascade to make the final popularity prediction. Extensive experiments conducted on three real-world datasets demonstrate that CasFT significantly improves the prediction accuracy, compared to state-of-the-art approaches, yielding 2.2%-19.3% improvement across different datasets.
Authors: Esteban Aldana Guerra
Abstract: Monte Carlo Tree Search (MCTS) is a powerful algorithm for solving complex decision-making problems. This paper presents an optimized MCTS implementation applied to the FrozenLake environment, a classic reinforcement learning task characterized by stochastic transitions. The optimization leverages cumulative reward and visit count tables along with the Upper Confidence Bound for Trees (UCT) formula, resulting in efficient learning in a slippery grid world. We benchmark our implementation against other decision-making algorithms, including MCTS with Policy and Q-Learning, and perform a detailed comparison of their performance. The results demonstrate that our optimized approach effectively maximizes rewards and success rates while minimizing convergence time, outperforming baseline methods, especially in environments with inherent randomness.
Authors: Bhanuka Silva, Dishanika Denipitiyage, Suranga Seneviratne, Anirban Mahanti, Aruna Seneviratne
Abstract: While many online services provide privacy policies for end users to read and understand what personal data are being collected, these documents are often lengthy and complicated. As a result, the vast majority of users do not read them at all, leading to data collection under uninformed consent. Several attempts have been made to make privacy policies more user friendly by summarising them, providing automatic annotations or labels for key sections, or by offering chat interfaces to ask specific questions. With recent advances in Large Language Models (LLMs), there is an opportunity to develop more effective tools to parse privacy policies and help users make informed decisions. In this paper, we propose an entailment-driven LLM based framework to classify paragraphs of privacy policies into meaningful labels that are easily understood by users. The results demonstrate that our framework outperforms traditional LLM methods, improving the F1 score in average by 11.2%. Additionally, our framework provides inherently explainable and meaningful predictions.
Authors: Xin Jing, Yichen Jing, Yuhuan Lu, Bangchao Deng, Sikun Yang, Dingqi Yang
Abstract: Information popularity prediction is important yet challenging in various domains, including viral marketing and news recommendations. The key to accurately predicting information popularity lies in subtly modeling the underlying temporal information diffusion process behind observed events of an information cascade, such as the retweets of a tweet. To this end, most existing methods either adopt recurrent networks to capture the temporal dynamics from the first to the last observed event or develop a statistical model based on self-exciting point processes to make predictions. However, information diffusion is intrinsically a complex continuous-time process with irregularly observed discrete events, which is oversimplified using recurrent networks as they fail to capture the irregular time intervals between events, or using self-exciting point processes as they lack flexibility to capture the complex diffusion process. Against this background, we propose ConCat, modeling the Continuous-time dynamics of Cascades for information popularity prediction. On the one hand, it leverages neural Ordinary Differential Equations (ODEs) to model irregular events of a cascade in continuous time based on the cascade graph and sequential event information. On the other hand, it considers cascade events as neural temporal point processes (TPPs) parameterized by a conditional intensity function which can also benefit the popularity prediction task. We conduct extensive experiments to evaluate ConCat on three real-world datasets. Results show that ConCat achieves superior performance compared to state-of-the-art baselines, yielding a 2.3%-33.2% improvement over the best-performing baselines across the three datasets.
Authors: Sungjune Park, Daeseon Choi
Abstract: This paper proposes a novel prompt engineering technique called Judgment of Thought (JoT) that is specifically tailored for binary logical reasoning tasks. JoT employs three roles$\unicode{x2014}$lawyer, prosecutor, and judge$\unicode{x2014}$to facilitate more reliable and accurate reasoning by the model. In this framework, the judge utilizes a high$\unicode{x2010}$level model, while the lawyer and prosecutor utilize low$\unicode{x2010}$level models. This structure helps the judge better understand the responses from both the lawyer and prosecutor, enabling a more accurate judgment. Experimental results on large language model (LLM) benchmark datasets, such as BigBenchHard and Winogrande, demonstrate that JoT outperforms existing methods, including Chain of Thought (CoT) and Self$\unicode{x2010}$Consistency (SC), in binary logical reasoning tasks. Additionally, in real$\unicode{x2010}$world tasks, such as Fake News Detection and SMS Spam Detection, JoT shows comparable or improved performance compared to existing techniques. JoT significantly enhances the accuracy and reliability of models in binary reasoning tasks and show potential for practical applicability across various domains. Future research should aim to further broaden the applicability of JoT and optimize its implementation for real$\unicode{x2010}$world problem$\unicode{x2010}$solving.
Authors: Dayuan Fu, Biqing Qi, Yihuai Gao, Che Jiang, Guanting Dong, Bowen Zhou
Abstract: Long-term memory is significant for agents, in which insights play a crucial role. However, the emergence of irrelevant insight and the lack of general insight can greatly undermine the effectiveness of insight. To solve this problem, in this paper, we introduce Multi-Scale Insight Agent (MSI-Agent), an embodied agent designed to improve LLMs' planning and decision-making ability by summarizing and utilizing insight effectively across different scales. MSI achieves this through the experience selector, insight generator, and insight selector. Leveraging a three-part pipeline, MSI can generate task-specific and high-level insight, store it in a database, and then use relevant insight from it to aid in decision-making. Our experiments show that MSI outperforms another insight strategy when planning by GPT3.5. Moreover, We delve into the strategies for selecting seed experience and insight, aiming to provide LLM with more useful and relevant insight for better decision-making. Our observations also indicate that MSI exhibits better robustness when facing domain-shifting scenarios.
Authors: Romain Xu-Darme (LSL), Aymeric Varasse (LSL), Alban Grastien (LSL), Julien Girard (LSL), Zakaria Chihani (LSL)
Abstract: In the field of explainable AI, a vibrant effort is dedicated to the design of self-explainable models, as a more principled alternative to post-hoc methods that attempt to explain the decisions after a model opaquely makes them. However, this productive line of research suffers from common downsides: lack of reproducibility, unfeasible comparison, diverging standards. In this paper, we propose CaBRNet, an open-source, modular, backward-compatible framework for Case-Based Reasoning Networks: https://github.com/aiser-team/cabrnet.
Authors: Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu
Abstract: Large language models (LLMs) have achieved remarkable advancements in natural language processing, showcasing exceptional performance across various tasks. However, the expensive memory and computational requirements present significant challenges for their practical deployment. Low-bit quantization has emerged as a critical approach to mitigate these challenges by reducing the bit-width of model parameters, activations, and gradients, thus decreasing memory usage and computational demands. This paper presents a comprehensive survey of low-bit quantization methods tailored for LLMs, covering the fundamental principles, system implementations, and algorithmic strategies. An overview of basic concepts and new data formats specific to low-bit LLMs is first introduced, followed by a review of frameworks and systems that facilitate low-bit LLMs across various hardware platforms. Then, we categorize and analyze techniques and toolkits for efficient low-bit training and inference of LLMs. Finally, we conclude with a discussion of future trends and potential advancements of low-bit LLMs. Our systematic overview from basic, system, and algorithm perspectives can offer valuable insights and guidelines for future works to enhance the efficiency and applicability of LLMs through low-bit quantization.
Authors: Syed Mohd Faisal Malik, Md Tabrez Nafis, Mohd Abdul Ahad, Safdar Tanweer
Abstract: In contemporary healthcare, to protect patient data, electronic health records have become invaluable repositories, creating vast opportunities to leverage deep learning techniques for predictive analysis. Retinal fundus images, cirrhosis stages, and heart disease diagnostic predictions have shown promising results through the integration of deep learning techniques for classifying diverse datasets. This study proposes a novel deep learning predictive analysis framework for classifying multiple datasets by pre-processing data from three distinct sources. A hybrid deep learning model combining Residual Networks and Artificial Neural Networks is proposed to detect acute and chronic diseases such as heart diseases, cirrhosis, and retinal conditions, outperforming existing models. Dataset preparation involves aspects such as categorical data transformation, dimensionality reduction, and missing data synthesis. Feature extraction is effectively performed using scaler transformation for categorical datasets and ResNet architecture for image datasets. The resulting features are integrated into a unified classification model. Rigorous experimentation and evaluation resulted in high accuracies of 93%, 99%, and 95% for retinal fundus images, cirrhosis stages, and heart disease diagnostic predictions, respectively. The efficacy of the proposed method is demonstrated through a detailed analysis of F1-score, precision, and recall metrics. This study offers a comprehensive exploration of methodologies and experiments, providing in-depth knowledge of deep learning predictive analysis in electronic health records.
Authors: Ning Sun, Yufei Wang, Yuwei Zhang, Jixiang Wan, Shenyue Wang, Ping Liu, Xudong Zhang
Abstract: Human Activity Recognition (HAR) has gained great attention from researchers due to the popularity of mobile devices and the need to observe users' daily activity data for better human-computer interaction. In this work, we collect a human activity recognition dataset called OPPOHAR consisting of phone IMU data. To facilitate the employment of HAR system in mobile phone and to achieve user-specific activity recognition, we propose a novel light-weight network called Non-stationary BERT with a two-stage training method. We also propose a simple yet effective data augmentation method to explore the deeper relationship between the accelerator and gyroscope data from the IMU. The network achieves the state-of-the-art performance testing on various activity recognition datasets and the data augmentation method demonstrates its wide applicability.
Authors: Marc-Antoine Allard, Matin Ansaripour, Maria Yuffa, Paul Teiletche
Abstract: Large Language Models (LLMs) often struggle with tasks requiring mathematical reasoning, particularly multiple-choice questions (MCQs). To address this issue, we developed LLaMa-SciQ, an educational chatbot designed to assist college students in solving and understanding MCQs in STEM fields. We begin by fine-tuning and aligning the models to human preferences. After comparing the performance of Mistral-7B and LLaMa-8B, we selected the latter as the base model due to its higher evaluation accuracy. To further enhance accuracy, we implement Retrieval-Augmented Generation (RAG) and apply quantization to compress the model, reducing inference time and increasing accessibility for students. For mathematical reasoning, LLaMa-SciQ achieved 74.5% accuracy on the GSM8k dataset and 30% on the MATH dataset. However, RAG does not improve performance and even reduces it, likely due to retriever issues or the model's unfamiliarity with context. Despite this, the quantized model shows only a 5% loss in performance, demonstrating significant efficiency improvements.
Authors: Ujjawal Sharma, Madhav Biyani, Akhil Dev Suresh, Debi Prasad Bhuyan, Saroj Kanta Mishra, Tanmoy Chakraborty
Abstract: Reliable prediction of the All India Summer Monsoon Rainfall (AISMR) is pivotal for informed policymaking for the country, impacting the lives of billions of people. However, accurate simulation of AISMR has been a persistent challenge due to the complex interplay of various muti-scale factors and the inherent variability of the monsoon system. This research focuses on adapting and fine-tuning the latest LLM model, PatchTST, to accurately predict AISMR with a lead time of three months. The fine-tuned PatchTST model, trained with historical AISMR data, the Ni\~no3.4 index, and categorical Indian Ocean Dipole values, outperforms several popular neural network models and statistical models. This fine-tuned LLM model exhibits an exceptionally low RMSE percentage of 0.07% and a Spearman correlation of 0.976. This is particularly impressive, since it is nearly 80% more accurate than the best-performing NN models. The model predicts an above-normal monsoon for the year 2024, with an accumulated rainfall of 921.6 mm in the month of June-September for the entire country.
Authors: Purin Sukpanichnant, Anna Rapberger, Francesca Toni
Abstract: Peer review is an essential process to determine the quality of papers submitted to scientific conferences or journals. However, it is subjective and prone to biases. Several studies have been conducted to apply techniques from NLP to support peer review, but they are based on black-box techniques and their outputs are difficult to interpret and trust. In this paper, we propose a novel pipeline to support and understand the reviewing and decision-making processes of peer review: the PeerArg system combining LLMs with methods from knowledge representation. PeerArg takes in input a set of reviews for a paper and outputs the paper acceptance prediction. We evaluate the performance of the PeerArg pipeline on three different datasets, in comparison with a novel end-2-end LLM that uses few-shot learning to predict paper acceptance given reviews. The results indicate that the end-2-end LLM is capable of predicting paper acceptance from reviews, but a variant of the PeerArg pipeline outperforms this LLM.
Authors: Jonathan H. Rystr{\o}m, Kenneth C. Enevoldsen
Abstract: Cultural AI benchmarks often rely on implicit assumptions about measured constructs, leading to vague formulations with poor validity and unclear interrelations. We propose exposing these assumptions using explicit cognitive models formulated as Structural Equation Models. Using cross-lingual alignment transfer as an example, we show how this approach can answer key research questions and identify missing datasets. This framework grounds benchmark construction theoretically and guides dataset development to improve construct measurement. By embracing transparency, we move towards more rigorous, cumulative AI evaluation science, challenging researchers to critically examine their assessment foundations.
Authors: Xiao Chi
Abstract: Mediation is often treated as an extension of negotiation, without taking into account the unique role that norms and facts play in legal mediation. Additionally, current approaches for updating argument acceptability in response to changing variables frequently require the introduction of new arguments or the removal of existing ones, which can be inefficient and cumbersome in decision-making processes within legal disputes. In this paper, our contribution is two-fold. First, we introduce a QuAM (Quantitative Argumentation Mediate) framework, which integrates the parties' knowledge and the mediator's knowledge, including facts and legal norms, when determining the acceptability of a mediation goal. Second, we develop a new formalism to model the relationship between the acceptability of a goal argument and the values assigned to a variable associated with the argument. We use a real-world legal mediation as a running example to illustrate our approach.
Authors: Shunyu Yao, Fei Liu, Xi Lin, Zhichao Lu, Zhenkun Wang, Qingfu Zhang
Abstract: Heuristics are commonly used to tackle diverse search and optimization problems. Design heuristics usually require tedious manual crafting with domain knowledge. Recent works have incorporated large language models (LLMs) into automatic heuristic search leveraging their powerful language and coding capacity. However, existing research focuses on the optimal performance on the target problem as the sole objective, neglecting other criteria such as efficiency and scalability, which are vital in practice. To tackle this challenge, we propose to model heuristic search as a multi-objective optimization problem and consider introducing other practical criteria beyond optimal performance. Due to the complexity of the search space, conventional multi-objective optimization methods struggle to effectively handle multi-objective heuristic search. We propose the first LLM-based multi-objective heuristic search framework, Multi-objective Evolution of Heuristic (MEoH), which integrates LLMs in a zero-shot manner to generate a non-dominated set of heuristics to meet multiple design criteria. We design a new dominance-dissimilarity mechanism for effective population management and selection, which incorporates both code dissimilarity in the search space and dominance in the objective space. MEoH is demonstrated in two well-known combinatorial optimization problems: the online Bin Packing Problem (BPP) and the Traveling Salesman Problem (TSP). Results indicate that a variety of elite heuristics are automatically generated in a single run, offering more trade-off options than existing methods. It successfully achieves competitive or superior performance while improving efficiency up to 10 times. Moreover, we also observe that the multi-objective search introduces novel insights into heuristic design and leads to the discovery of diverse heuristics.
Authors: Haocheng Lin
Abstract: The popularisation of applying AI in businesses poses significant challenges relating to ethical principles, governance, and legal compliance. Although businesses have embedded AI into their day-to-day processes, they lack a unified approach for mitigating its potential risks. This paper introduces a framework ensuring that AI must be ethical, controllable, viable, and desirable. Balancing these factors ensures the design of a framework that addresses its trade-offs, such as balancing performance against explainability. A successful framework provides practical advice for businesses to meet regulatory requirements in sectors such as finance and healthcare, where it is critical to comply with standards like GPDR and the EU AI Act. Different case studies validate this framework by integrating AI in both academic and practical environments. For instance, large language models are cost-effective alternatives for generating synthetic opinions that emulate attitudes to environmental issues. These case studies demonstrate how having a structured framework could enhance transparency and maintain performance levels as shown from the alignment between synthetic and expected distributions. This alignment is quantified using metrics like Chi-test scores, normalized mutual information, and Jaccard indexes. Future research should explore the framework's empirical validation in diverse industrial settings further, ensuring the model's scalability and adaptability.
Authors: Xusen Guo, Xinxi Yang, Mingxing Peng, Hongliang Lu, Meixin Zhu, Hai Yang
Abstract: Developing efficient traffic models is essential for optimizing transportation systems, yet current approaches remain time-intensive and susceptible to human errors due to their reliance on manual processes. Traditional workflows involve exhaustive literature reviews, formula optimization, and iterative testing, leading to inefficiencies in research. In response, we introduce the Traffic Research Agent (TR-Agent), an AI-driven system designed to autonomously develop and refine traffic models through an iterative, closed-loop process. Specifically, we divide the research pipeline into four key stages: idea generation, theory formulation, theory evaluation, and iterative optimization; and construct TR-Agent with four corresponding modules: Idea Generator, Code Generator, Evaluator, and Analyzer. Working in synergy, these modules retrieve knowledge from external resources, generate novel ideas, implement and debug models, and finally assess them on the evaluation datasets. Furthermore, the system continuously refines these models based on iterative feedback, enhancing research efficiency and model performance. Experimental results demonstrate that TR-Agent achieves significant performance improvements across multiple traffic models, including the Intelligent Driver Model (IDM) for car following, the MOBIL lane-changing model, and the Lighthill-Whitham-Richards (LWR) traffic flow model. Additionally, TR-Agent provides detailed explanations for its optimizations, allowing researchers to verify and build upon its improvements easily. This flexibility makes the framework a powerful tool for researchers in transportation and beyond. To further support research and collaboration, we have open-sourced both the code and data used in our experiments, facilitating broader access and enabling continued advancements in the field.
Authors: Jaeyoung Huh, Paul Klein, Gareth Funka-Lea, Puneet Sharma, Ankur Kapoor, Young-Ho Kim
Abstract: Intra-cardiac Echocardiography (ICE) is a crucial imaging modality used in electrophysiology (EP) and structural heart disease (SHD) interventions, providing real-time, high-resolution views from within the heart. Despite its advantages, effective manipulation of the ICE catheter requires significant expertise, which can lead to inconsistent outcomes, particularly among less experienced operators. To address this challenge, we propose an AI-driven closed-loop view guidance system with human-in-the-loop feedback, designed to assist users in navigating ICE imaging without requiring specialized knowledge. Our method models the relative position and orientation vectors between arbitrary views and clinically defined ICE views in a spatial coordinate system, guiding users on how to manipulate the ICE catheter to transition from the current view to the desired view over time. Operating in a closed-loop configuration, the system continuously predicts and updates the necessary catheter manipulations, ensuring seamless integration into existing clinical workflows. The effectiveness of the proposed system is demonstrated through a simulation-based evaluation, achieving an 89% success rate with the 6532 test dataset, highlighting its potential to improve the accuracy and efficiency of ICE imaging procedures.
Authors: Wenhao Liu, Siyu An, Junru Lu, Muling Wu, Tianlong Li, Xiaohua Wang, Xiaoqing Zheng, Di Yin, Xing Sun, Xuanjing Huang
Abstract: Role-Playing Agents (RPAs) have shown remarkable performance in various applications, yet they often struggle to recognize and appropriately respond to hard queries that conflict with their role-play knowledge. To investigate RPAs' performance when faced with different types of conflicting requests, we develop an evaluation benchmark that includes contextual knowledge conflicting requests, parametric knowledge conflicting requests, and non-conflicting requests to assess RPAs' ability to identify conflicts and refuse to answer appropriately without over-refusing. Through extensive evaluation, we find that most RPAs behave significant performance gaps toward different conflict requests. To elucidate the reasons, we conduct an in-depth representation-level analysis of RPAs under various conflict scenarios. Our findings reveal the existence of rejection regions and direct response regions within the model's forwarding representation, and thus influence the RPA's final response behavior. Therefore, we introduce a lightweight representation editing approach that conveniently shifts conflicting requests to the rejection region, thereby enhancing the model's refusal accuracy. The experimental results validate the effectiveness of our editing method, improving RPAs' refusal ability of conflicting requests while maintaining their general role-playing capabilities.
Authors: Yong-Siang Shih, Zach Zhao, Chenhao Niu, Bruce Iberg, James Sharpnack, Mirza Basim Baig
Abstract: For high-stakes online exams, it is important to detect potential rule violations to ensure the security of the test. In this study, we investigate the task of detecting whether test takers are looking away from the screen, as such behavior could be an indication that the test taker is consulting external resources. For asynchronous proctoring, the exam videos are recorded and reviewed by the proctors. However, when the length of the exam is long, it could be tedious for proctors to watch entire exam videos to determine the exact moments when test takers look away. We present an AI-assisted gaze detection system, which allows proctors to navigate between different video frames and discover video frames where the test taker is looking in similar directions. The system enables proctors to work more effectively to identify suspicious moments in videos. An evaluation framework is proposed to evaluate the system against human-only and ML-only proctoring, and a user study is conducted to gather feedback from proctors, aiming to demonstrate the effectiveness of the system.
Authors: Mario Bifulco, Luca Roversi
Abstract: In this study, we initially investigate the application of a hybrid classical-quantum classifier (HCQC) for sentiment analysis, comparing its performance against the classical CPLEX classifier and the Transformer architecture. Our findings indicate that while the HCQC underperforms relative to the Transformer in terms of classification accuracy, but it requires significantly less time to converge to a reasonably good approximate solution. This experiment also reveals a critical bottleneck in the HCQC, whose architecture is partially undisclosed by the D-Wave property. To address this limitation, we propose a novel algorithm based on the algebraic decomposition of QUBO models, which enhances the time the quantum processing unit can allocate to problem-solving tasks.
Authors: Bastiaan Bruinsma, Annika Fred\'en, Kajsa Hansson, Moa Johansson, Pasko Kisi\'c-Merino, Denitsa Saynova
Abstract: This paper examines the development of the Artificial Intelligence (AI) meta-debate in Sweden before and after the release of ChatGPT. From the perspective of agenda-setting theory, we propose that it is an elite outside of party politics that is leading the debate -- i.e. that the politicians are relatively silent when it comes to this rapid development. We also suggest that the debate has become more substantive and risk-oriented in recent years. To investigate this claim, we draw on an original dataset of elite-level documents from the early 2010s to the present, using op-eds published in a number of leading Swedish newspapers. By conducting a qualitative content analysis of these materials, our preliminary findings lend support to the expectation that an academic, rather than a political elite is steering the debate.
Authors: Lorenzo Fiaschi, Marco Cococcioni
Abstract: This work proposes a novel approach to the deep hierarchical classification task, i.e., the problem of classifying data according to multiple labels organized in a rigid parent-child structure. It consists in a multi-output deep neural network equipped with specific projection operators placed before each output layer. The design of such an architecture, called lexicographic hybrid deep neural network (LH-DNN), has been possible by combining tools from different and quite distant research fields: lexicographic multi-objective optimization, non-standard analysis, and deep learning. To assess the efficacy of the approach, the resulting network is compared against the B-CNN, a convolutional neural network tailored for hierarchical classification tasks, on the CIFAR10, CIFAR100 (where it has been originally and recently proposed before being adopted and tuned for multiple real-world applications) and Fashion-MNIST benchmarks. Evidence states that an LH-DNN can achieve comparable if not superior performance, especially in the learning of the hierarchical relations, in the face of a drastic reduction of the learning parameters, training epochs, and computational time, without the need for ad-hoc loss functions weighting values.
Authors: P Aditya Sreekar, Sahil Verma, Suransh Chopra, Sarik Ghazarian, Abhishek Persad, Narayanan Sadagopan
Abstract: Large Language Models (LLMs) are widely used in both industry and academia for various tasks, yet evaluating the consistency of generated text responses continues to be a challenge. Traditional metrics like ROUGE and BLEU show a weak correlation with human judgment. More sophisticated metrics using Natural Language Inference (NLI) have shown improved correlations but are complex to implement, require domain-specific training due to poor cross-domain generalization, and lack explainability. More recently, prompt-based metrics using LLMs as evaluators have emerged; while they are easier to implement, they still lack explainability and depend on task-specific prompts, which limits their generalizability. This work introduces Automated eXplainable Consistency Evaluation using LLMs (AXCEL), a prompt-based consistency metric which offers explanations for the consistency scores by providing detailed reasoning and pinpointing inconsistent text spans. AXCEL is also a generalizable metric which can be adopted to multiple tasks without changing the prompt. AXCEL outperforms both non-prompt and prompt-based state-of-the-art (SOTA) metrics in detecting inconsistencies across summarization by 8.7%, free text generation by 6.2%, and data-to-text conversion tasks by 29.4%. We also evaluate the influence of underlying LLMs on prompt based metric performance and recalibrate the SOTA prompt-based metrics with the latest LLMs for fair comparison. Further, we show that AXCEL demonstrates strong performance using open source LLMs.
Authors: Chi Zhang, Huaping Zhong, Kuan Zhang, Chengliang Chai, Rui Wang, Xinlin Zhuang, Tianyi Bai, Jiantao Qiu, Lei Cao, Ye Yuan, Guoren Wang, Conghui He
Abstract: Data selection is of great significance in pre-training large language models, given the variation in quality within the large-scale available training corpora. To achieve this, researchers are currently investigating the use of data influence to measure the importance of data instances, $i.e.,$ a high influence score indicates that incorporating this instance to the training set is likely to enhance the model performance. Consequently, they select the top-$k$ instances with the highest scores. However, this approach has several limitations. (1) Computing the influence of all available data is time-consuming. (2) The selected data instances are not diverse enough, which may hinder the pre-trained model's ability to generalize effectively to various downstream tasks. In this paper, we introduce \texttt{Quad}, a data selection approach that considers both quality and diversity by using data influence to achieve state-of-the-art pre-training results. In particular, noting that attention layers capture extensive semantic details, we have adapted the accelerated $iHVP$ computation methods for attention layers, enhancing our ability to evaluate the influence of data, $i.e.,$ its quality. For the diversity, \texttt{Quad} clusters the dataset into similar data instances within each cluster and diverse instances across different clusters. For each cluster, if we opt to select data from it, we take some samples to evaluate the influence to prevent processing all instances. To determine which clusters to select, we utilize the classic Multi-Armed Bandit method, treating each cluster as an arm. This approach favors clusters with highly influential instances (ensuring high quality) or clusters that have been selected less frequently (ensuring diversity), thereby well balancing between quality and diversity.
Authors: Sasha Boguraev, Ben Lipkin, Leonie Weissweiler, Kyle Mahowald
Abstract: Math is constructed by people for people: just as natural language corpora reflect not just propositions but the communicative goals of language users, the math data that models are trained on reflects not just idealized mathematical entities but rich communicative intentions. While there are important advantages to treating math in a purely symbolic manner, we here hypothesize that there are benefits to treating math as situated linguistic communication and that language models are well suited for this goal, in ways that are not fully appreciated. We illustrate these points with two case studies. First, we ran an experiment in which we found that language models interpret the equals sign in a humanlike way -- generating systematically different word problems for the same underlying equation arranged in different ways. Second, we found that language models prefer proofs to be ordered in naturalistic ways, even though other orders would be logically equivalent. We advocate for AI systems that learn from and represent the communicative intentions latent in human-generated math.
Authors: Antoine Poupon, Hugo de Rohan Willner, Pierre Nikitits, Adam Abdin
Abstract: The proliferation of debris in Low Earth Orbit (LEO) represents a significant threat to space sustainability and spacecraft safety. Active Debris Removal (ADR) has emerged as a promising approach to address this issue, utilising Orbital Transfer Vehicles (OTVs) to facilitate debris deorbiting, thereby reducing future collision risks. However, ADR missions are substantially complex, necessitating accurate planning to make the missions economically viable and technically effective. Moreover, these servicing missions require a high level of autonomous capability to plan under evolving orbital conditions and changing mission requirements. In this paper, an autonomous decision-planning model based on Deep Reinforcement Learning (DRL) is developed to train an OTV to plan optimal debris removal sequencing. It is shown that using the proposed framework, the agent can find optimal mission plans and learn to update the planning autonomously to include risk handling of debris with high collision risk.
Authors: Azmul Asmar Irfan, Nur Ahmad Khatim, Mansur M. Arief
Abstract: One of the key issues contributing to inefficiency in Puskesmas is the time-consuming nature of doctor-patient interactions. Doctors need to conduct thorough consultations, which include diagnosing the patient's condition, providing treatment advice, and transcribing detailed notes into medical records. In regions with diverse linguistic backgrounds, doctors often have to ask clarifying questions, further prolonging the process. While diagnosing is essential, transcription and summarization can often be automated using AI to improve time efficiency and help doctors enhance care quality and enable early diagnosis and intervention. This paper proposes a solution using a localized large language model (LLM) to transcribe, translate, and summarize doctor-patient conversations. We utilize the Whisper model for transcription and GPT-3 to summarize them into the ePuskemas medical records format. This system is implemented as an add-on to an existing web browser extension, allowing doctors to fill out patient forms while talking. By leveraging this solution for real-time transcription, translation, and summarization, doctors can improve the turnaround time for patient care while enhancing the quality of records, which become more detailed and insightful for future visits. This innovation addresses challenges like overcrowded facilities and the administrative burden on healthcare providers in Indonesia. We believe this solution will help doctors save time, provide better care, and produce more accurate medical records, representing a significant step toward modernizing healthcare and ensuring patients receive timely, high-quality care, even in resource-constrained settings.
Authors: Lucas Robinet, Ahmad Berjaoui, Ziad Kheil, Elizabeth Cohen-Jonathan Moyal
Abstract: Real-life medical data is often multimodal and incomplete, fueling the growing need for advanced deep learning models capable of integrating them efficiently. The use of diverse modalities, including histopathology slides, MRI, and genetic data, offers unprecedented opportunities to improve prognosis prediction and to unveil new treatment pathways. Contrastive learning, widely used for deriving representations from paired data in multimodal tasks, assumes that different views contain the same task-relevant information and leverages only shared information. This assumption becomes restrictive when handling medical data since each modality also harbors specific knowledge relevant to downstream tasks. We introduce DRIM, a new multimodal method for capturing these shared and unique representations, despite data sparsity. More specifically, given a set of modalities, we aim to encode a representation for each one that can be divided into two components: one encapsulating patient-related information common across modalities and the other, encapsulating modality-specific details. This is achieved by increasing the shared information among different patient modalities while minimizing the overlap between shared and unique components within each modality. Our method outperforms state-of-the-art algorithms on glioma patients survival prediction tasks, while being robust to missing modalities. To promote reproducibility, the code is made publicly available at https://github.com/Lucas-rbnt/DRIM
Authors: Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang
Abstract: Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables. In this paper, we introduce Vector Post-Training Quantization (VPTQ) for extremely low-bit quantization of LLMs. We use Second-Order Optimization to formulate the LLM VQ problem and guide our quantization algorithm design by solving the optimization. We further refine the weights using Channel-Independent Second-Order Optimization for a granular VQ. In addition, by decomposing the optimization problem, we propose a brief and effective codebook initialization algorithm. We also extend VPTQ to support residual and outlier quantization, which enhances model accuracy and further compresses the model. Our experimental results show that VPTQ reduces model quantization perplexity by $0.01$-$0.34$ on LLaMA-2, $0.38$-$0.68$ on Mistral-7B, $4.41$-$7.34$ on LLaMA-3 over SOTA at 2-bit, with an average accuracy improvement of $0.79$-$1.5\%$ on LLaMA-2, $1\%$ on Mistral-7B, $11$-$22\%$ on LLaMA-3 on QA tasks on average. We only utilize $10.4$-$18.6\%$ of the quantization algorithm execution time, resulting in a $1.6$-$1.8\times$ increase in inference throughput compared to SOTA.
Authors: Susmitha Patnala, Adam Abdin
Abstract: This study develops an AI-based implementation of autonomous On-Orbit Servicing (OOS) mission to assist with spacecraft collision avoidance maneuvers (CAMs). We propose an autonomous `servicer' trained with Reinforcement Learning (RL) to autonomously detect potential collisions between a target satellite and space debris, rendezvous and dock with endangered satellites, and execute optimal CAM. The RL model integrates collision risk estimates, satellite specifications, and debris data to generate an optimal maneuver matrix for OOS rendezvous and collision prevention. We employ the Cross-Entropy algorithm to find optimal decision policies efficiently. Initial results demonstrate the feasibility of autonomous robotic OOS for collision avoidance services, focusing on one servicer spacecraft to one endangered satellite scenario. However, merging spacecraft rendezvous and optimal CAM presents significant complexities. We discuss design challenges and critical parameters for the successful implementation of the framework presented through a case study.
Authors: Junting Lu, Zhiyang Zhang, Fangkai Yang, Jue Zhang, Lu Wang, Chao Du, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
Abstract: Multimodal large language models (MLLMs) have enabled LLM-based agents to directly interact with application user interfaces (UIs), enhancing agents' performance in complex tasks. However, these agents often suffer from high latency and low reliability due to the extensive sequential UI interactions. To address this issue, we propose AXIS, a novel LLM-based agents framework prioritize actions through application programming interfaces (APIs) over UI actions. This framework also facilitates the creation and expansion of APIs through automated exploration of applications. Our experiments on Office Word demonstrate that AXIS reduces task completion time by 65%-70% and cognitive workload by 38%-53%, while maintaining accuracy of 97%-98% compare to humans. Our work contributes to a new human-agent-computer interaction (HACI) framework and a fresh UI design principle for application providers in the era of LLMs. It also explores the possibility of turning every applications into agents, paving the way towards an agent-centric operating system (Agent OS).
Authors: Zhiyu Lin, Upol Ehsan, Rohan Agarwal, Samihan Dani, Vidushi Vashishth, Mark Riedl
Abstract: Generative Artificial Intelligence (AI) encounters limitations in efficiency and fairness within the realm of Procedural Content Generation (PCG) when human creators solely drive and bear responsibility for the generative process. Alternative setups, such as Mixed-Initiative Co-Creative (MI-CC) systems, exhibited their promise. Still, the potential of an active mixed initiative, where AI takes a role beyond following, is understudied. This work investigates the influence of the adaptive ability of an active and learning AI agent on creators' expectancy of creative responsibilities in an MI-CC setting. We built and studied a system that employs reinforcement learning (RL) methods to learn the creative responsibility preferences of a human user during online interactions. Situated in story co-creation, we develop a Multi-armed-bandit agent that learns from the human creator, updates its collaborative decision-making belief, and switches between its capabilities during an MI-CC experience. With 39 participants joining a human subject study, Our developed system's learning capabilities are well recognized compared to the non-learning ablation, corresponding to a significant increase in overall satisfaction with the MI-CC experience. These findings indicate a robust association between effective MI-CC collaborative interactions, particularly the implementation of proactive AI initiatives, and deepened understanding among all participants.
Authors: Nhut Truong, Dario Pesenti, Uri Hasson
Abstract: We present a computational explainability approach for human comparison tasks, using Alignment Importance Score (AIS) heatmaps derived from deep-vision models. The AIS reflects a feature-map's unique contribution to the alignment between Deep Neural Network's (DNN) representational geometry and that of humans. We first validate the AIS by showing that prediction of out-of-sample human similarity judgments is improved when constructing representations using only higher-scoring AIS feature maps identified from a training set. We then compute image-specific heatmaps that visually indicate the areas that correspond to feature-maps with higher AIS scores. These maps provide an intuitive explanation of which image areas are more important when it is compared to other images in a cohort. We observe a correspondence between these heatmaps and saliency maps produced by a gaze-prediction model. However, in some cases, meaningful differences emerge, as the dimensions relevant for comparison are not necessarily the most visually salient. To conclude, Alignment Importance improves prediction of human similarity judgments from DNN embeddings, and provides interpretable insights into the relevant information in image space.
Authors: Huy Nhat Phan, Phong X. Nguyen, Nghi D. Q. Bui
Abstract: Large Language Models (LLMs) have revolutionized software engineering (SE), demonstrating remarkable capabilities in various coding tasks. While recent efforts have produced autonomous software agents based on LLMs for end-to-end development tasks, these systems are typically designed for specific SE tasks. We introduce HyperAgent, a novel generalist multi-agent system designed to address a wide spectrum of SE tasks across different programming languages by mimicking human developers' workflows. Comprising four specialized agents - Planner, Navigator, Code Editor, and Executor. HyperAgent manages the full lifecycle of SE tasks, from initial conception to final verification. Through extensive evaluations, HyperAgent achieves state-of-the-art performance across diverse SE tasks: it attains a 25.01% success rate on SWE-Bench-Lite and 31.40% on SWE-Bench-Verified for GitHub issue resolution, surpassing existing methods. Furthermore, HyperAgent demonstrates SOTA performance in repository-level code generation (RepoExec), and in fault localization and program repair (Defects4J), often outperforming specialized systems. This work represents a significant advancement towards versatile, autonomous agents capable of handling complex, multi-step SE tasks across various domains and languages, potentially transforming AI-assisted software development practices.
Authors: Jon Oleson
Abstract: Medical practitioners are rapidly adopting generative AI solutions for clinical documentation, leading to significant time savings and reduced stress. However, evaluating the quality of AI-generated documentation is a complex and ongoing challenge. This paper presents an overview of DeepScribe's methodologies for assessing and managing note quality, focusing on various metrics and the composite "DeepScore", an overall index of quality and accuracy. These methodologies aim to enhance the quality of patient care documentation through accountability and continuous improvement.
Authors: Yitian Tao, Yan Liang, Luoyu Wang, Yongqing Li, Qing Yang, Han Zhang
Abstract: Decoding neurophysiological signals into language is of great research interest within brain-computer interface (BCI) applications. Electroencephalography (EEG), known for its non-invasiveness, ease of use, and cost-effectiveness, has been a popular method in this field. However, current EEG-to-Text decoding approaches face challenges due to the huge domain gap between EEG recordings and raw texts, inherent data bias, and small closed vocabularies. In this paper, we propose SEE: Semantically Aligned EEG-to-Text Translation, a novel method aimed at improving EEG-to-Text decoding by seamlessly integrating two modules into a pre-trained BART language model. These two modules include (1) a Cross-Modal Codebook that learns cross-modal representations to enhance feature consolidation and mitigate domain gap, and (2) a Semantic Matching Module that fully utilizes pre-trained text representations to align multi-modal features extracted from EEG-Text pairs while considering noise caused by false negatives, i.e., data from different EEG-Text pairs that have similar semantic meanings. Experimental results on the Zurich Cognitive Language Processing Corpus (ZuCo) demonstrate the effectiveness of SEE, which enhances the feasibility of accurate EEG-to-Text decoding.
Authors: K. R. Schuurman, A. Meyer
Abstract: Accurate estimates of surface solar irradiance (SSI) are essential for solar resource assessments and solar energy forecasts in grid integration and building control applications. SSI estimates for spatially extended regions can be retrieved from geostationary satellites such as Meteosat. Traditional SSI satellite retrievals like Heliosat rely on physical radiative transfer modelling. We introduce the first machine-learning-based satellite retrieval for instantaneous SSI and demonstrate its capability to provide accurate and generalizable SSI estimates across Europe. Our deep learning retrieval provides near real-time SSI estimates based on data-driven emulation of Heliosat and fine-tuning on pyranometer networks. By including SSI from ground stations, our SSI retrieval model can outperform Heliosat accuracy and generalize well to regions with other climates and surface albedos in cloudy conditions (clear-sky index < 0.8). We also show that the SSI retrieved from Heliosat exhibits large biases in mountain regions, and that training and fine-tuning our retrieval models on SSI data from ground stations strongly reduces these biases, outperforming Heliosat. Furthermore, we quantify the relative importance of the Meteosat channels and other predictor variables like solar zenith angle for the accuracy of our deep learning SSI retrieval model in different cloud conditions. We find that in cloudy conditions multiple near-infrared and infrared channels enhance the performance. Our results can facilitate the development of more accurate satellite retrieval models of surface solar irradiance.
Authors: Syed Muhammad Aqdas Rizvi
Abstract: This literature review surveys the advancements of keyword spotting (KWS) technologies, specifically focusing on Urdu, Pakistan's low-resource language (LRL), which has complex phonetics. Despite the global strides in speech technology, Urdu presents unique challenges requiring more tailored solutions. The review traces the evolution from foundational Gaussian Mixture Models to sophisticated neural architectures like deep neural networks and transformers, highlighting significant milestones such as integrating multi-task learning and self-supervised approaches that leverage unlabeled data. It examines emerging technologies' role in enhancing KWS systems' performance within multilingual and resource-constrained settings, emphasizing the need for innovations that cater to languages like Urdu. Thus, this review underscores the need for context-specific research addressing the inherent complexities of Urdu and similar URLs and the means of regions communicating through such languages for a more inclusive approach to speech technology.
Authors: Suwichaya Suwanwimolkul, Natanon Tongamrak, Nuttamon Thungka, Naebboon Hoonchareon, Jitkomut Songsiri
Abstract: This paper presents an online platform that shows Thailand's solar irradiance map every 30 minutes. It is available at https://www.cusolarforecast.com. The methodology for estimating global horizontal irradiance (GHI) across Thailand relies on cloud index extracted from Himawari-8 satellite imagery, Ineichen clear-sky model with locally-tuned Linke turbidity, and machine learning models. The methods take clear-sky irradiance, cloud index, re-analyzed GHI and temperature data from the MERRA-2 database, and date-time as inputs for GHI estimation models, including LightGBM, LSTM, Informer, and Transformer. These are benchmarked with the estimate from the SolCast service by evaluation of 15-minute ground GHI data from 53 ground stations over 1.5 years during 2022-2023. The results show that the four models have competitive performances and outperform the SolCast service. The best model is LightGBM, with an MAE of 78.58 W/sqm and RMSE of 118.97 W/sqm. Obtaining re-analyzed MERRA-2 data for Thailand is not economically feasible for deployment. When removing these features, the Informer model has a winning performance of 78.67 W/sqm in MAE. The obtained performance aligns with existing literature by taking the climate zone and time granularity of data into consideration. As the map shows an estimate of GHI over 93,000 grids with a frequent update, the paper also describes a computational framework for displaying the entire map. It tests the runtime performance of deep learning models in the GHI estimation process.
Authors: Jiawen Kang, Dongrui Han, Lingwei Meng, Jingyan Zhou, Jinchao Li, Xixin Wu, Helen Meng
Abstract: Alzheimer's Disease (AD) detection has emerged as a promising research area that employs machine learning classification models to distinguish between individuals with AD and those without. Unlike conventional classification tasks, we identify within-class variation as a critical challenge in AD detection: individuals with AD exhibit a spectrum of cognitive impairments. Given that many AD detection tasks lack fine-grained labels, simplistic binary classification may overlook two crucial aspects: within-class differences and instance-level imbalance. The former compels the model to map AD samples with varying degrees of impairment to a single diagnostic label, disregarding certain changes in cognitive function. While the latter biases the model towards overrepresented severity levels. This work presents early efforts to address these challenges. We propose two novel methods: Soft Target Distillation (SoTD) and Instance-level Re-balancing (InRe), targeting two problems respectively. Experiments on the ADReSS and ADReSSo datasets demonstrate that the proposed methods significantly improve detection accuracy. Further analysis reveals that SoTD effectively harnesses the strengths of multiple component models, while InRe substantially alleviates model over-fitting. These findings provide insights for developing more robust and reliable AD detection models.
Authors: Julie Keisler (CRIStAL, EDF R\&D OSIRIS, EDF R\&D), Margaux Bregere (EDF R\&D, EDF R\&D OSIRIS, LPSM)
Abstract: Electricity is difficult to store, except at prohibitive cost, and therefore the balance between generation and load must be maintained at all times. Electricity is traditionally managed by anticipating demand and intermittent production (wind, solar) and matching flexible production (hydro, nuclear, coal and gas). Accurate forecasting of electricity load and renewable production is therefore essential to ensure grid performance and stability. Both are highly dependent on meteorological variables (temperature, wind, sunshine). These dependencies are complex and difficult to model. On the one hand, spatial variations do not have a uniform impact because population, industry, and wind and solar farms are not evenly distributed across the territory. On the other hand, temporal variations can have delayed effects on load (due to the thermal inertia of buildings). With access to observations from different weather stations and simulated data from meteorological models, we believe that both phenomena can be modeled together. In today's state-of-the-art load forecasting models, the spatio-temporal modeling of the weather is fixed. In this work, we aim to take advantage of the automated representation and spatio-temporal feature extraction capabilities of deep neural networks to improve spatio-temporal weather modeling for load forecasting. We compare our deep learning-based methodology with the state-of-the-art on French national load. This methodology could also be fully adapted to forecasting renewable energy production.
Authors: Stanislav Koz\'ak
Abstract: Radiomics is a relatively new field which utilises automatically identified features from radiological scans. It has found a widespread application, particularly in oncology because many of the important oncological biomarkers are not visible to the naked eye. The recent advent of big data, including in medical imaging, and the development of new ML techniques brought the possibility of faster and more accurate oncological diagnosis. Furthermore, standardised mathematical feature extraction based on radiomics helps to eliminate possible radiologist bias. This paper reviews the recent development in the oncological use of MRI radiomic features. It focuses on the identification of the isocitrate dehydrogenase (IDH) mutation status, which is an important biomarker for the diagnosis of glioblastoma and grade IV astrocytoma.
Authors: Jinlong Yang, Hengchao Shang, Daimeng Wei, Jiaxin Guo, Zongyao Li, Zhanglin Wu, Zhiqiang Rao, Shaojun Li, Yuhao Xie, Yuanchang Luo, Jiawei Zheng, Bin Wei, Hao Yang
Abstract: This paper describes the submissions of Huawei Translation Services Center(HW-TSC) to WMT24 chat translation shared task on English$\leftrightarrow$Germany (en-de) bidirection. The experiments involved fine-tuning models using chat data and exploring various strategies, including Minimum Bayesian Risk (MBR) decoding and self-training. The results show significant performance improvements in certain directions, with the MBR self-training method achieving the best results. The Large Language Model also discusses the challenges and potential avenues for further research in the field of chat translation.
Authors: Nikolas Koutsoubis, Asim Waqas, Yasin Yilmaz, Ravi P. Ramachandran, Matthew Schabath, Ghulam Rasool
Abstract: Artificial Intelligence (AI) has demonstrated significant potential in automating various medical imaging tasks, which could soon become routine in clinical practice for disease diagnosis, prognosis, treatment planning, and post-treatment surveillance. However, the privacy concerns surrounding patient data present a major barrier to the widespread adoption of AI in medical imaging, as large, diverse training datasets are essential for developing accurate, generalizable, and robust Artificial intelligence models. Federated Learning (FL) offers a solution that enables organizations to train AI models collaboratively without sharing sensitive data. federated learning exchanges model training information, such as gradients, between the participating sites. Despite its promise, federated learning is still in its developmental stages and faces several challenges. Notably, sensitive information can still be inferred from the gradients shared during model training. Quantifying AI models' uncertainty is vital due to potential data distribution shifts post-deployment, which can affect model performance. Uncertainty quantification (UQ) in FL is particularly challenging due to data heterogeneity across participating sites. This review provides a comprehensive examination of FL, privacy-preserving FL (PPFL), and UQ in FL. We identify key gaps in current FL methodologies and propose future research directions to enhance data privacy and trustworthiness in medical imaging applications.
Authors: Satyananda Kashyap, Niharika S. D'Souza, Luyao Shi, Ken C. L. Wong, Hongzhi Wang, Tanveer Syeda-Mahmood
Abstract: Content-addressable memories such as Modern Hopfield Networks (MHN) have been studied as mathematical models of auto-association and storage/retrieval in the human declarative memory, yet their practical use for large-scale content storage faces challenges. Chief among them is the occurrence of meta-stable states, particularly when handling large amounts of high dimensional content. This paper introduces Hopfield Encoding Networks (HEN), a framework that integrates encoded neural representations into MHNs to improve pattern separability and reduce meta-stable states. We show that HEN can also be used for retrieval in the context of hetero association of images with natural language queries, thus removing the limitation of requiring access to partial content in the same domain. Experimental results demonstrate substantial reduction in meta-stable states and increased storage capacity while still enabling perfect recall of a significantly larger number of inputs advancing the practical utility of associative memory networks for real-world tasks.
Authors: Chung-Yu Wang, Alireza DaghighFarsoodeh, Hung Viet Pham
Abstract: Large Language Models (LLMs) have demonstrated impressive performance in software engineering tasks. However, improving their accuracy in generating correct and reliable code remains challenging. Numerous prompt engineering techniques (PETs) have been developed to address this, but no single approach is universally optimal. Selecting the right PET for each query is difficult for two primary reasons: (1) interactive prompting techniques may not consistently deliver the expected benefits, especially for simpler queries, and (2) current automated prompt engineering methods lack adaptability and fail to fully utilize multi-stage responses. To overcome these challenges, we propose PET-Select, a PET-agnostic selection model that uses code complexity as a proxy to classify queries and select the most appropriate PET. By incorporating contrastive learning, PET-Select effectively distinguishes between simple and complex problems, allowing it to choose PETs that are best suited for each query's complexity level. Our evaluations on the MBPP and HumanEval benchmarks using GPT-3.5 Turbo and GPT-4o show up to a 1.9% improvement in pass@1 accuracy, along with a 74.8% reduction in token usage. Additionally, we provide both quantitative and qualitative results to demonstrate how PET-Select effectively selects the most appropriate techniques for each code generation query, further showcasing its efficiency in optimizing PET selection.
Authors: Chung-Yu Wang, Alireza DaghighFarsoodeh, Hung Viet Pham
Abstract: Large Language Models (LLMs) have demonstrated remarkable abilities across various tasks, leveraging advanced reasoning. Yet, they struggle with task-oriented prompts due to a lack of specific prior knowledge of the task answers. The current state-of-the-art approach, PAL, utilizes code generation to address this issue. However, PAL depends on manually crafted prompt templates and examples while still producing inaccurate results. In this work, we present TITAN-a novel strategy designed to enhance LLMs' performance on task-oriented prompts. TITAN achieves this by generating scripts using a universal approach and zero-shot learning. Unlike existing methods, TITAN eliminates the need for detailed task-specific instructions and extensive manual efforts. TITAN enhances LLMs' performance on various tasks by utilizing their analytical and code-generation capabilities in a streamlined process. TITAN employs two key techniques: (1) step-back prompting to extract the task's input specifications and (2) chain-of-thought prompting to identify required procedural steps. This information is used to improve the LLMs' code-generation process. TITAN further refines the generated script through post-processing and the script is executed to retrieve the final answer. Our comprehensive evaluation demonstrates TITAN's effectiveness in a diverse set of tasks. On average, TITAN outperforms the state-of-the-art zero-shot approach by 7.6% and 3.9% when paired with GPT-3.5 and GPT-4. Overall, without human annotation, TITAN achieves state-of-the-art performance in 8 out of 11 cases while only marginally losing to few-shot approaches (which needed human intervention) on three occasions by small margins. This work represents a significant advancement in addressing task-oriented prompts, offering a novel solution for effectively utilizing LLMs in everyday life tasks.
Authors: Kevin Paeth, Daniel Atherton, Nikiforos Pittaras, Heather Frase, Sean McGregor
Abstract: As artificial intelligence (AI) systems become increasingly deployed across the world, they are also increasingly implicated in AI incidents - harm events to individuals and society. As a result, industry, civil society, and governments worldwide are developing best practices and regulations for monitoring and analyzing AI incidents. The AI Incident Database (AIID) is a project that catalogs AI incidents and supports further research by providing a platform to classify incidents for different operational and research-oriented goals. This study reviews the AIID's dataset of 750+ AI incidents and two independent taxonomies applied to these incidents to identify common challenges to indexing and analyzing AI incidents. We find that certain patterns of AI incidents present structural ambiguities that challenge incident databasing and explore how epistemic uncertainty in AI incident reporting is unavoidable. We therefore report mitigations to make incident processes more robust to uncertainty related to cause, extent of harm, severity, or technical details of implicated systems. With these findings, we discuss how to develop future AI incident reporting practices.
Authors: Ruo Yang, Binghui Wang, Mustafa Bilgic
Abstract: Numerous explanation methods have been recently developed to interpret the decisions made by deep neural network (DNN) models. For image classifiers, these methods typically provide an attribution score to each pixel in the image to quantify its contribution to the prediction. However, most of these explanation methods appropriate attribution scores to pixels independently, even though both humans and DNNs make decisions by analyzing a set of closely related pixels simultaneously. Hence, the attribution score of a pixel should be evaluated jointly by considering itself and its structurally-similar pixels. We propose a method called IProp, which models each pixel's individual attribution score as a source of explanatory information and explains the image prediction through the dynamic propagation of information across all pixels. To formulate the information propagation, IProp adopts the Markov Reward Process, which guarantees convergence, and the final status indicates the desired pixels' attribution scores. Furthermore, IProp is compatible with any existing attribution-based explanation method. Extensive experiments on various explanation methods and DNN models verify that IProp significantly improves them on a variety of interpretability metrics.
Authors: Rajesh Ranjan, Shailja Gupta, Surya Narayan Singh
Abstract: Large Language Models(LLMs) have revolutionized various applications in natural language processing (NLP) by providing unprecedented text generation, translation, and comprehension capabilities. However, their widespread deployment has brought to light significant concerns regarding biases embedded within these models. This paper presents a comprehensive survey of biases in LLMs, aiming to provide an extensive review of the types, sources, impacts, and mitigation strategies related to these biases. We systematically categorize biases into several dimensions. Our survey synthesizes current research findings and discusses the implications of biases in real-world applications. Additionally, we critically assess existing bias mitigation techniques and propose future research directions to enhance fairness and equity in LLMs. This survey serves as a foundational resource for researchers, practitioners, and policymakers concerned with addressing and understanding biases in LLMs.
Authors: Zheda Mai, Ping Zhang, Cheng-Hao Tu, Hong-You Chen, Li Zhang, Wei-Lun Chao
Abstract: Parameter-efficient transfer learning (PETL) has attracted significant attention lately, due to the increasing size of pre-trained models and the need to fine-tune (FT) them for superior downstream performance. This community-wide enthusiasm has sparked a plethora of new methods. Nevertheless, a systematic study to understand their performance and suitable application scenarios is lacking, leaving questions like when to apply PETL and which method to use largely unanswered. In this paper, we conduct a unifying empirical study of representative PETL methods in the context of Vision Transformers. We systematically tune their hyper-parameters to fairly compare their accuracy on downstream tasks. Our study not only offers a valuable user guide but also unveils several new insights. First, if tuned carefully, different PETL methods can obtain quite similar accuracy in the low-shot benchmark VTAB-1K. This includes simple methods like FT the bias terms that were reported inferior. Second, though with similar accuracy, we find that PETL methods make different mistakes and high-confidence predictions, likely due to their different inductive biases. Such an inconsistency (or complementariness) opens up the opportunity for ensemble methods, and we make preliminary attempts at this. Third, going beyond the commonly used low-shot tasks, we find that PETL is also useful in many-shot regimes -- it achieves comparable and sometimes better accuracy than full FT, using much fewer learnable parameters. Last but not least, we investigate PETL's ability to preserve a pre-trained model's robustness to distribution shifts (e.g., a CLIP backbone). Perhaps not surprisingly, PETL methods outperform full FT alone. However, with weight-space ensembles, the fully FT model can achieve a better balance between downstream and out-of-distribution performance, suggesting a future research direction for PETL.
Authors: Erica Coppolillo, Simone Mungari, Ettore Ritacco, Francesco Fabbri, Marco Minici, Francesco Bonchi, Giuseppe Manco
Abstract: Digital platforms such as social media and e-commerce websites adopt Recommender Systems to provide value to the user. However, the social consequences deriving from their adoption are still unclear. Many scholars argue that recommenders may lead to detrimental effects, such as bias-amplification deriving from the feedback loop between algorithmic suggestions and users' choices. Nonetheless, the extent to which recommenders influence changes in users leaning remains uncertain. In this context, it is important to provide a controlled environment for evaluating the recommendation algorithm before deployment. To address this, we propose a stochastic simulation framework that mimics user-recommender system interactions in a long-term scenario. In particular, we simulate the user choices by formalizing a user model, which comprises behavioral aspects, such as the user resistance towards the recommendation algorithm and their inertia in relying on the received suggestions. Additionally, we introduce two novel metrics for quantifying the algorithm's impact on user preferences, specifically in terms of drift over time. We conduct an extensive evaluation on multiple synthetic datasets, aiming at testing the robustness of our framework when considering different scenarios and hyper-parameters setting. The experimental results prove that the proposed methodology is effective in detecting and quantifying the drift over the users preferences by means of the simulation. All the code and data used to perform the experiments are publicly available.
Authors: Imra Aqeel
Abstract: The global pandemic due to emergence of COVID 19 has created the unrivaled public health crisis. It has huge morbidity rate never comprehended in the recent decades. Researchers have made many efforts to find the optimal solution of this pandemic. Progressively, drug repurposing is an emergent and powerful strategy with saving cost, time, and labor. Lacking of identified repurposed drug candidates against COVID 19 demands more efforts to explore the potential inhibitors for effective cure. In this study, we used the combination of molecular docking and machine learning regression approaches to explore the potential inhibitors for the treatment of COVID 19. We calculated the binding affinities of these drugs to multitarget proteins using molecular docking process. We perform the QSAR modeling by employing various machine learning regression approaches to identify the potential inhibitors against COVID 19. Our findings with best scores of R2 and RMSE demonstrated that our proposed Decision Tree Regression (DTR) model is the most appropriate model to explore the potential inhibitors. We proposed five novel promising inhibitors with their respective Zinc IDs ZINC (3873365, 85432544, 8214470, 85536956, and 261494640) within the range of -19.7 kcal/mol to -12.6 kcal/mol. We further analyzed the physiochemical and pharmacokinetic properties of these most potent inhibitors to examine their behavior. The analysis of these properties is the key factor to promote an effective cure for public health. Our work constructs an efficient structure with which to probe the potential inhibitors against COVID-19, creating the combination of molecular docking with machine learning regression approaches.
Authors: Gennady Sidorov, Malik Mohrat, Ksenia Lebedeva, Ruslan Rakhimov, Sergey Kolyubin
Abstract: Although various visual localization approaches exist, such as scene coordinate and pose regression, these methods often struggle with high memory consumption or extensive optimization requirements. To address these challenges, we utilize recent advancements in novel view synthesis, particularly 3D Gaussian Splatting (3DGS), to enhance localization. 3DGS allows for the compact encoding of both 3D geometry and scene appearance with its spatial features. Our method leverages the dense description maps produced by XFeat's lightweight keypoint detection and description model. We propose distilling these dense keypoint descriptors into 3DGS to improve the model's spatial understanding, leading to more accurate camera pose predictions through 2D-3D correspondences. After estimating an initial pose, we refine it using a photometric warping loss. Benchmarking on popular indoor and outdoor datasets shows that our approach surpasses state-of-the-art Neural Render Pose (NRP) methods, including NeRFMatch and PNeRFLoc.
Authors: Ryan Lagerquist, Galina Chirokova, Robert DeMaria, Mark DeMaria, Imme Ebert-Uphoff
Abstract: Determining the location of a tropical cyclone's (TC) surface circulation center -- "center-fixing" -- is a critical first step in the TC-forecasting process, affecting current and future estimates of track, intensity, and structure. Despite a recent increase in the number of automated center-fixing methods, only one such method (ARCHER-2) is operational, and its best performance is achieved when using microwave or scatterometer data, which are not available at every forecast cycle. We develop a deep-learning algorithm called GeoCenter; it relies only on geostationary IR satellite imagery, which is available for all TC basins at high frequency (10-15 min) and low latency (< 10 min) during both day and night. GeoCenter ingests an animation (time series) of IR images, including 10 channels at lag times up to 3 hours. The animation is centered at a "first guess" location, offset from the true TC-center location by 48 km on average and sometimes > 100 km; GeoCenter is tasked with correcting this offset. On an independent testing dataset, GeoCenter achieves a mean/median/RMS (root mean square) error of 26.9/23.3/32.0 km for all systems, 25.7/22.3/30.5 km for tropical systems, and 15.7/13.6/18.6 km for category-2--5 hurricanes. These values are similar to ARCHER-2 errors when microwave or scatterometer data are available, and better than ARCHER-2 errors when only IR data are available. GeoCenter also performs skillful uncertainty quantification (UQ), producing a well calibrated ensemble of 200 TC-center locations. Furthermore, all predictors used by GeoCenter are available in real time, which would make GeoCenter easy to implement operationally every 10-15 min.
Authors: Simon Varailhon, Masih Aminbeidokhti, Marco Pedersoli, Eric Granger
Abstract: Source-free domain adaptation (SFDA) is a challenging problem in object detection, where a pre-trained source model is adapted to a new target domain without using any source domain data for privacy and efficiency reasons. Most state-of-the-art SFDA methods for object detection have been proposed for Faster-RCNN, a detector that is known to have high computational complexity. This paper focuses on domain adaptation techniques for real-world vision systems, particularly for the YOLO family of single-shot detectors known for their fast baselines and practical applications. Our proposed SFDA method - Source-Free YOLO (SF-YOLO) - relies on a teacher-student framework in which the student receives images with a learned, target domain-specific augmentation, allowing the model to be trained with only unlabeled target data and without requiring feature alignment. A challenge with self-training using a mean-teacher architecture in the absence of labels is the rapid decline of accuracy due to noisy or drifting pseudo-labels. To address this issue, a teacher-to-student communication mechanism is introduced to help stabilize the training and reduce the reliance on annotated target data for model selection. Despite its simplicity, our approach is competitive with state-of-the-art detectors on several challenging benchmark datasets, even sometimes outperforming methods that use source data for adaptation.
Authors: Yangxiao Cai, Peng Liang, Yifei Wang, Zengyang Li, Mojtaba Shahin
Abstract: With the advancements of Large Language Models (LLMs), an increasing number of open-source software projects are using LLMs as their core functional component. Although research and practice on LLMs are capturing considerable interest, no dedicated studies explored the challenges faced by practitioners of LLM open-source projects, the causes of these challenges, and potential solutions. To fill this research gap, we conducted an empirical study to understand the issues that practitioners encounter when developing and using LLM open-source software, the possible causes of these issues, and potential solutions.We collected all closed issues from 15 LLM open-source projects and labelled issues that met our requirements. We then randomly selected 994 issues from the labelled issues as the sample for data extraction and analysis to understand the prevalent issues, their underlying causes, and potential solutions. Our study results show that (1) Model Issue is the most common issue faced by practitioners, (2) Model Problem, Configuration and Connection Problem, and Feature and Method Problem are identified as the most frequent causes of the issues, and (3) Optimize Model is the predominant solution to the issues. Based on the study results, we provide implications for practitioners and researchers of LLM open-source projects.
Authors: Chao Huang, Wenshuo Zang, Carlo Pinciroli, Zhi Jane Li, Taposh Banerjee, Lili Su, Rui Liu
Abstract: Compared with single robots, Multi-Robot Systems (MRS) can perform missions more efficiently due to the presence of multiple members with diverse capabilities. However, deploying an MRS in wide real-world environments is still challenging due to uncertain and various obstacles (e.g., building clusters and trees). With a limited understanding of environmental uncertainty on performance, an MRS cannot flexibly adjust its behaviors (e.g., teaming, load sharing, trajectory planning) to ensure both environment adaptation and task accomplishments. In this work, a novel joint preference landscape learning and behavior adjusting framework (PLBA) is designed. PLBA efficiently integrates real-time human guidance to MRS coordination and utilizes Sparse Variational Gaussian Processes with Varying Output Noise to quickly assess human preferences by leveraging spatial correlations between environment characteristics. An optimization-based behavior-adjusting method then safely adapts MRS behaviors to environments. To validate PLBA's effectiveness in MRS behavior adaption, a flood disaster search and rescue task was designed. 20 human users provided 1764 feedback based on human preferences obtained from MRS behaviors related to "task quality", "task progress", "robot safety". The prediction accuracy and adaptation speed results show the effectiveness of PLBA in preference learning and MRS behavior adaption.
Authors: Tengfei Lyu, Weijia Zhang, Jinliang Deng, Hao Liu
Abstract: Spatio-temporal forecasting is a critical component of various smart city applications, such as transportation optimization, energy management, and socio-economic analysis. Recently, several automated spatio-temporal forecasting methods have been proposed to automatically search the optimal neural network architecture for capturing complex spatio-temporal dependencies. However, the existing automated approaches suffer from expensive neural architecture search overhead, which hinders their practical use and the further exploration of diverse spatio-temporal operators in a finer granularity. In this paper, we propose AutoSTF, a decoupled automatic neural architecture search framework for cost-effective automated spatio-temporal forecasting. From the efficiency perspective, we first decouple the mixed search space into temporal space and spatial space and respectively devise representation compression and parameter-sharing schemes to mitigate the parameter explosion. The decoupled spatio-temporal search not only expedites the model optimization process but also leaves new room for more effective spatio-temporal dependency modeling. From the effectiveness perspective, we propose a multi-patch transfer module to jointly capture multi-granularity temporal dependencies and extend the spatial search space to enable finer-grained layer-wise spatial dependency search. Extensive experiments on eight datasets demonstrate the superiority of AutoSTF in terms of both accuracy and efficiency. Specifically, our proposed method achieves up to 13.48x speed-up compared to state-of-the-art automatic spatio-temporal forecasting methods while maintaining the best forecasting accuracy.
Authors: Tong Wu, Zhiyong Chen, Meixia Tao, Yaping Sun, Xiaodong Xu, Wenjun Zhang, Ping Zhang
Abstract: Lightweight and efficient neural network models for deep joint source-channel coding (JSCC) are crucial for semantic communications. In this paper, we propose a novel JSCC architecture, named MambaJSCC, that achieves state-of-the-art performance with low computational and parameter overhead. MambaJSCC utilizes the visual state space model with channel adaptation (VSSM-CA) blocks as its backbone for transmitting images over wireless channels, where the VSSM-CA primarily consists of the generalized state space models (GSSM) and the zero-parameter, zero-computational channel adaptation method (CSI-ReST). We design the GSSM module, leveraging reversible matrix transformations to express generalized scan expanding operations, and theoretically prove that two GSSM modules can effectively capture global information. We discover that GSSM inherently possesses the ability to adapt to channels, a form of endogenous intelligence. Based on this, we design the CSI-ReST method, which injects channel state information (CSI) into the initial state of GSSM to utilize its native response, and into the residual state to mitigate CSI forgetting, enabling effective channel adaptation without introducing additional computational and parameter overhead. Experimental results show that MambaJSCC not only outperforms existing JSCC methods (e.g., SwinJSCC) across various scenarios but also significantly reduces parameter size, computational overhead, and inference delay.
Authors: Hevish Cowlessur, Chandra Thapa, Tansu Alpcan, Seyit Camtepe
Abstract: Quantum Machine Learning (QML) is an emerging field of research with potential applications to distributed collaborative learning, such as Split Learning (SL). SL allows resource-constrained clients to collaboratively train ML models with a server, reduce their computational overhead, and enable data privacy by avoiding raw data sharing. Although QML with SL has been studied, the problem remains open in resource-constrained environments where clients lack quantum computing capabilities. Additionally, data privacy leakage between client and server in SL poses risks of reconstruction attacks on the server side. To address these issues, we propose Hybrid Quantum Split Learning (HQSL), an application of Hybrid QML in SL. HQSL enables classical clients to train models with a hybrid quantum server and curtails reconstruction attacks. In addition, we introduce a novel qubit-efficient data-loading technique for designing a quantum layer in HQSL, minimizing both the number of qubits and circuit depth. Experiments on five datasets demonstrate HQSL's feasibility and ability to enhance classification performance compared to its classical models. Notably, HQSL achieves mean improvements of over 3% in both accuracy and F1-score for the Fashion-MNIST dataset, and over 1.5% in both metrics for the Speech Commands dataset. We expand these studies to include up to 100 clients, confirming HQSL's scalability. Moreover, we introduce a noise-based defense mechanism to tackle reconstruction attacks on the server side. Overall, HQSL enables classical clients to collaboratively train their models with a hybrid quantum server, leveraging quantum advantages while improving model performance and security against data privacy leakage-related reconstruction attacks.
Authors: Ethan Lin, Zhiyuan Peng, Yi Fang
Abstract: Recent studies have evaluated the creativity/novelty of large language models (LLMs) primarily from a semantic perspective, using benchmarks from cognitive science. However, accessing the novelty in scholarly publications is a largely unexplored area in evaluating LLMs. In this paper, we introduce a scholarly novelty benchmark (SchNovel) to evaluate LLMs' ability to assess novelty in scholarly papers. SchNovel consists of 15000 pairs of papers across six fields sampled from the arXiv dataset with publication dates spanning 2 to 10 years apart. In each pair, the more recently published paper is assumed to be more novel. Additionally, we propose RAG-Novelty, which simulates the review process taken by human reviewers by leveraging the retrieval of similar papers to assess novelty. Extensive experiments provide insights into the capabilities of different LLMs to assess novelty and demonstrate that RAG-Novelty outperforms recent baseline models.
Authors: Matthew A. Reyna, Deepanshi, James Weigle, Zuzana Koscova, Kiersten Campbell, Kshama Kodthalu Shivashankara, Soheil Saghafi, Sepideh Nikookar, Mohsen Motie-Shirazi, Yashar Kiarashi, Salman Seyedi, Gari D. Clifford, Reza Sameni
Abstract: We introduce the ECG-Image-Database, a large and diverse collection of electrocardiogram (ECG) images generated from ECG time-series data, with real-world scanning, imaging, and physical artifacts. We used ECG-Image-Kit, an open-source Python toolkit, to generate realistic images of 12-lead ECG printouts from raw ECG time-series. The images include realistic distortions such as noise, wrinkles, stains, and perspective shifts, generated both digitally and physically. The toolkit was applied to 977 12-lead ECG records from the PTB-XL database and 1,000 from Emory Healthcare to create high-fidelity synthetic ECG images. These unique images were subjected to both programmatic distortions using ECG-Image-Kit and physical effects like soaking, staining, and mold growth, followed by scanning and photography under various lighting conditions to create real-world artifacts. The resulting dataset includes 35,595 software-labeled ECG images with a wide range of imaging artifacts and distortions. The dataset provides ground truth time-series data alongside the images, offering a reference for developing machine and deep learning models for ECG digitization and classification. The images vary in quality, from clear scans of clean papers to noisy photographs of degraded papers, enabling the development of more generalizable digitization algorithms. ECG-Image-Database addresses a critical need for digitizing paper-based and non-digital ECGs for computerized analysis, providing a foundation for developing robust machine and deep learning models capable of converting ECG images into time-series. The dataset aims to serve as a reference for ECG digitization and computerized annotation efforts. ECG-Image-Database was used in the PhysioNet Challenge 2024 on ECG image digitization and classification.
Authors: Minkyoo Song, Hanna Kim, Jaehan Kim, Youngjin Jin, Seungwon Shin
Abstract: Recent advances in natural language processing and the increased use of large language models have exposed new security vulnerabilities, such as backdoor attacks. Previous backdoor attacks require input manipulation after model distribution to activate the backdoor, posing limitations in real-world applicability. Addressing this gap, we introduce a novel Claim-Guided Backdoor Attack (CGBA), which eliminates the need for such manipulations by utilizing inherent textual claims as triggers. CGBA leverages claim extraction, clustering, and targeted training to trick models to misbehave on targeted claims without affecting their performance on clean data. CGBA demonstrates its effectiveness and stealthiness across various datasets and models, significantly enhancing the feasibility of practical backdoor attacks. Our code and data will be available at https://github.com/PaperCGBA/CGBA.
Authors: Yuanyong Luo, Zhongxing Zhang, Richard Wu, Hu Liu, Ying Jin, Kai Zheng, Minmin Wang, Zhanying He, Guipeng Hu, Luyao Chen, Tianchi Hu, Junsong Wang, Minqi Chen, Mikhaylov Dmitry, Korviakov Vladimir, Bobrin Maxim, Yuhao Hu, Guanfu Chen, Zeyi Huang
Abstract: This preliminary white paper proposes a novel 8-bit floating-point data format HiFloat8 (abbreviated as HiF8) for deep learning. HiF8 features tapered precision. For normal value encoding, it provides 7 exponents with 3-bit mantissa, 8 exponents with 2-bit mantissa, and 16 exponents with 1-bit mantissa. For denormal or subnormal value encoding, it extends the dynamic range by 7 extra powers of 2, from 31 to 38 binades (notice that FP16 covers 40 binades). Meanwhile, HiF8 encodes all the special values except that positive zero and negative zero are represented by only one bit-pattern. Thanks to the better balance between precision and dynamic range, HiF8 can be simultaneously used in both forward and backward passes of AI training. In this paper, we will describe the definition and rounding methods of HiF8, as well as the tentative training and inference solutions. To demonstrate the efficacy of HiF8 format, massive simulation results on various neural networks, including traditional neural networks and large language models (LLMs), will also be presented.
Authors: Bum Jun Kim, Sang Woo Kim
Abstract: Regularization of deep neural networks has been an important issue to achieve higher generalization performance without overfitting problems. Although the popular method of Dropout provides a regularization effect, it causes inconsistent properties in the output, which may degrade the performance of deep neural networks. In this study, we propose a new module called stochastic average pooling, which incorporates Dropout-like stochasticity in pooling. We describe the properties of stochastic subsampling and average pooling and leverage them to design a module without any inconsistency problem. The stochastic average pooling achieves a regularization effect without any potential performance degradation due to the inconsistency issue and can easily be plugged into existing architectures of deep neural networks. Experiments demonstrate that replacing existing average pooling with stochastic average pooling yields consistent improvements across a variety of tasks, datasets, and models.
Authors: Samuel Arnesen, David Rein, Julian Michael
Abstract: We test the robustness of debate as a method of scalable oversight by training models to debate with data generated via self-play. In a long-context reading comprehension task, we find that language model based evaluators answer questions more accurately when judging models optimized to win debates. By contrast, we find no such relationship for consultancy models trained to persuade a judge without an opposing debater present. In quantitative and qualitative comparisons between our debate models and novel consultancy baselines, we find evidence that debate training encourages stronger and more informative arguments, showing promise that it can help provide high-quality supervision for tasks that are difficult to directly evaluate.
Authors: Soorin Yim, Dae-Woong Jeong, Sung Moon Ko, Sumin Lee, Hyunseung Kim, Chanhui Lee, Sehui Han
Abstract: Training deep learning models on limited data while maintaining generalization is one of the fundamental challenges in molecular property prediction. One effective solution is transferring knowledge extracted from abundant datasets to those with scarce data. Recently, a novel algorithm called Geometrically Aligned Transfer Encoder (GATE) has been introduced, which uses soft parameter sharing by aligning the geometrical shapes of task-specific latent spaces. However, GATE faces limitations in scaling to multiple tasks due to computational costs. In this study, we propose a task addition approach for GATE to improve performance on target tasks with limited data while minimizing computational complexity. It is achieved through supervised multi-task pre-training on a large dataset, followed by the addition and training of task-specific modules for each target task. Our experiments demonstrate the superior performance of the task addition strategy for GATE over conventional multi-task methods, with comparable computational costs.
Authors: Changhong Fu, Xiang Lei, Haobo Zuo, Liangliang Yao, Guangze Zheng, Jia Pan
Abstract: Visual object tracking has significantly promoted autonomous applications for unmanned aerial vehicles (UAVs). However, learning robust object representations for UAV tracking is especially challenging in complex dynamic environments, when confronted with aspect ratio change and occlusion. These challenges severely alter the original information of the object. To handle the above issues, this work proposes a novel progressive representation learning framework for UAV tracking, i.e., PRL-Track. Specifically, PRL-Track is divided into coarse representation learning and fine representation learning. For coarse representation learning, two innovative regulators, which rely on appearance and semantic information, are designed to mitigate appearance interference and capture semantic information. Furthermore, for fine representation learning, a new hierarchical modeling generator is developed to intertwine coarse object representations. Exhaustive experiments demonstrate that the proposed PRL-Track delivers exceptional performance on three authoritative UAV tracking benchmarks. Real-world tests indicate that the proposed PRL-Track realizes superior tracking performance with 42.6 frames per second on the typical UAV platform equipped with an edge smart camera. The code, model, and demo videos are available at \url{https://github.com/vision4robotics/PRL-Track}.
Authors: Zhe-Rui Yang, Jindong Han, Chang-Dong Wang, Hao Liu
Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable proficiency in handling a range of graph analytical tasks across various domains, such as e-commerce and social networks. Despite their versatility, GNNs face significant challenges in transferability, limiting their utility in real-world applications. Existing research in GNN transfer learning overlooks discrepancies in distribution among various graph datasets, facing challenges when transferring across different distributions. How to effectively adopt a well-trained GNN to new graphs with varying feature and structural distributions remains an under-explored problem. Taking inspiration from the success of Low-Rank Adaptation (LoRA) in adapting large language models to various domains, we propose GraphLoRA, an effective and parameter-efficient method for transferring well-trained GNNs to diverse graph domains. Specifically, we first propose a Structure-aware Maximum Mean Discrepancy (SMMD) to align divergent node feature distributions across source and target graphs. Moreover, we introduce low-rank adaptation by injecting a small trainable GNN alongside the pre-trained one, effectively bridging structural distribution gaps while mitigating the catastrophic forgetting. Additionally, a structure-aware regularization objective is proposed to enhance the adaptability of the pre-trained GNN to target graph with scarce supervision labels. Extensive experiments on six real-world datasets demonstrate the effectiveness of GraphLoRA against eleven baselines by tuning only 20% of parameters, even across disparate graph domains. The code is available at https://anonymous.4open.science/r/GraphLoRA.
Authors: Tingting Yang, Liang Xiao, Yizhe Zhang
Abstract: A global threshold (e.g., 0.5) is often applied to determine which bounding boxes should be included in the final results for an object detection task. A higher threshold reduces false positives but may result in missing a significant portion of true positives. A lower threshold can increase detection recall but may also result in more false positives. Because of this, using a preset global threshold (e.g., 0.5) applied to all the bounding box candidates may lead to suboptimal solutions. In this paper, we propose a Test-time Self-guided Bounding-box Propagation (TSBP) method, leveraging Earth Mover's Distance (EMD) to enhance object detection in histology images. TSBP utilizes bounding boxes with high confidence to influence those with low confidence, leveraging visual similarities between them. This propagation mechanism enables bounding boxes to be selected in a controllable, explainable, and robust manner, which surpasses the effectiveness of using simple thresholds and uncertainty calibration methods. Importantly, TSBP does not necessitate additional labeled samples for model training or parameter estimation, unlike calibration methods. We conduct experiments on gland detection and cell detection tasks in histology images. The results show that our proposed TSBP significantly improves detection outcomes when working in conjunction with state-of-the-art deep learning-based detection networks. Compared to other methods such as uncertainty calibration, TSBP yields more robust and accurate object detection predictions while using no additional labeled samples. The code is available at https://github.com/jwhgdeu/TSBP.
Authors: Zhe-Rui Yang, Jindong Han, Chang-Dong Wang, Hao Liu
Abstract: Graph unlearning, which aims to eliminate the influence of specific nodes, edges, or attributes from a trained Graph Neural Network (GNN), is essential in applications where privacy, bias, or data obsolescence is a concern. However, existing graph unlearning techniques often necessitate additional training on the remaining data, leading to significant computational costs, particularly with large-scale graphs. To address these challenges, we propose a two-stage training-free approach, Erase then Rectify (ETR), designed for efficient and scalable graph unlearning while preserving the model utility. Specifically, we first build a theoretical foundation showing that masking parameters critical for unlearned samples enables effective unlearning. Building on this insight, the Erase stage strategically edits model parameters to eliminate the impact of unlearned samples and their propagated influence on intercorrelated nodes. To further ensure the GNN's utility, the Rectify stage devises a gradient approximation method to estimate the model's gradient on the remaining dataset, which is then used to enhance model performance. Overall, ETR achieves graph unlearning without additional training or full training data access, significantly reducing computational overhead and preserving data privacy. Extensive experiments on seven public datasets demonstrate the consistent superiority of ETR in model utility, unlearning efficiency, and unlearning effectiveness, establishing it as a promising solution for real-world graph unlearning challenges.
Authors: Shoma Iwai, Atsuki Osanai, Shunsuke Kitada, Shinichiro Omachi
Abstract: Layout generation is a task to synthesize a harmonious layout with elements characterized by attributes such as category, position, and size. Human designers experiment with the placement and modification of elements to create aesthetic layouts, however, we observed that current discrete diffusion models (DDMs) struggle to correct inharmonious layouts after they have been generated. In this paper, we first provide novel insights into layout sticking phenomenon in DDMs and then propose a simple yet effective layout-assessment module Layout-Corrector, which works in conjunction with existing DDMs to address the layout sticking problem. We present a learning-based module capable of identifying inharmonious elements within layouts, considering overall layout harmony characterized by complex composition. During the generation process, Layout-Corrector evaluates the correctness of each token in the generated layout, reinitializing those with low scores to the ungenerated state. The DDM then uses the high-scored tokens as clues to regenerate the harmonized tokens. Layout-Corrector, tested on common benchmarks, consistently boosts layout-generation performance when in conjunction with various state-of-the-art DDMs. Furthermore, our extensive analysis demonstrates that the Layout-Corrector (1) successfully identifies erroneous tokens, (2) facilitates control over the fidelity-diversity trade-off, and (3) significantly mitigates the performance drop associated with fast sampling.
Authors: Youngwan Jin, Incheol Park, Hanbin Song, Hyeongjin Ju, Yagiz Nalcakan, Shiho Kim
Abstract: This paper proposes Pix2Next, a novel image-to-image translation framework designed to address the challenge of generating high-quality Near-Infrared (NIR) images from RGB inputs. Our approach leverages a state-of-the-art Vision Foundation Model (VFM) within an encoder-decoder architecture, incorporating cross-attention mechanisms to enhance feature integration. This design captures detailed global representations and preserves essential spectral characteristics, treating RGB-to-NIR translation as more than a simple domain transfer problem. A multi-scale PatchGAN discriminator ensures realistic image generation at various detail levels, while carefully designed loss functions couple global context understanding with local feature preservation. We performed experiments on the RANUS dataset to demonstrate Pix2Next's advantages in quantitative metrics and visual quality, improving the FID score by 34.81% compared to existing methods. Furthermore, we demonstrate the practical utility of Pix2Next by showing improved performance on a downstream object detection task using generated NIR data to augment limited real NIR datasets. The proposed approach enables the scaling up of NIR datasets without additional data acquisition or annotation efforts, potentially accelerating advancements in NIR-based computer vision applications.
Authors: Ming Li, Jike Zhong, Chenxin Li, Liuzhuozheng Li, Nie Lin, Masashi Sugiyama
Abstract: Recent advances in fine-tuning Vision-Language Models (VLMs) have witnessed the success of prompt tuning and adapter tuning, while the classic model fine-tuning on inherent parameters seems to be overlooked. It is believed that fine-tuning the parameters of VLMs with few-shot samples corrupts the pre-trained knowledge since fine-tuning the CLIP model even degrades performance. In this paper, we revisit this viewpoint, and propose a new perspective: fine-tuning the specific parameters instead of all will uncover the power of classic model fine-tuning on VLMs. Through our meticulous study, we propose ClipFit, a simple yet effective method to fine-tune CLIP without introducing any overhead of extra parameters. We demonstrate that by only fine-tuning the specific bias terms and normalization layers, ClipFit can improve the performance of zero-shot CLIP by 7.27\% average harmonic mean accuracy. Lastly, to understand how fine-tuning in CLIPFit affects the pre-trained models, we conducted extensive experimental analyses w.r.t. changes in internal parameters and representations. We found that low-level text bias layers and the first layer normalization layer change much more than other layers. The code is available at \url{https://github.com/minglllli/CLIPFit}.
Authors: M. Sajid, A. Quadir, M. Tanveer
Abstract: The random vector functional link (RVFL) network is a prominent classification model with strong generalization ability. However, RVFL treats all samples uniformly, ignoring whether they are pure or noisy, and its scalability is limited due to the need for inverting the entire training matrix. To address these issues, we propose granular ball RVFL (GB-RVFL) model, which uses granular balls (GBs) as inputs instead of training samples. This approach enhances scalability by requiring only the inverse of the GB center matrix and improves robustness against noise and outliers through the coarse granularity of GBs. Furthermore, RVFL overlooks the dataset's geometric structure. To address this, we propose graph embedding GB-RVFL (GE-GB-RVFL) model, which fuses granular computing and graph embedding (GE) to preserve the topological structure of GBs. The proposed GB-RVFL and GE-GB-RVFL models are evaluated on KEEL, UCI, NDC and biomedical datasets, demonstrating superior performance compared to baseline models.
Authors: Eslam Eldeeb, Hirley Alves
Abstract: Reinforcement learning (RL) has proved to have a promising role in future intelligent wireless networks. Online RL has been adopted for radio resource management (RRM), taking over traditional schemes. However, due to its reliance on online interaction with the environment, its role becomes limited in practical, real-world problems where online interaction is not feasible. In addition, traditional RL stands short in front of the uncertainties and risks in real-world stochastic environments. In this manner, we propose an offline and distributional RL scheme for the RRM problem, enabling offline training using a static dataset without any interaction with the environment and considering the sources of uncertainties using the distributions of the return. Simulation results demonstrate that the proposed scheme outperforms conventional resource management models. In addition, it is the only scheme that surpasses online RL and achieves a $16 \%$ gain over online RL.
Authors: Katharina Anderer, Andreas Reich, Matthias W\"olfel
Abstract: This paper presents a benchmark dataset for aligning lecture videos with corresponding slides and introduces a novel multimodal algorithm leveraging features from speech, text, and images. It achieves an average accuracy of 0.82 in comparison to SIFT (0.56) while being approximately 11 times faster. Using dynamic programming the algorithm tries to determine the optimal slide sequence. The results show that penalizing slide transitions increases accuracy. Features obtained via optical character recognition (OCR) contribute the most to a high matching accuracy, followed by image features. The findings highlight that audio transcripts alone provide valuable information for alignment and are beneficial if OCR data is lacking. Variations in matching accuracy across different lectures highlight the challenges associated with video quality and lecture style. The novel multimodal algorithm demonstrates robustness to some of these challenges, underscoring the potential of the approach.
Authors: Jatin Chaudhary, Dipak Nidhi, Jukka Heikkonen, Haari Merisaari, Rajiv Kanth
Abstract: The objective of this paper is to enhance the optimization process for neural networks by developing a dynamic learning rate algorithm that effectively integrates exponential decay and advanced anti-overfitting strategies. Our primary contribution is the establishment of a theoretical framework where we demonstrate that the optimization landscape, under the influence of our algorithm, exhibits unique stability characteristics defined by Lyapunov stability principles. Specifically, we prove that the superlevel sets of the loss function, as influenced by our adaptive learning rate, are always connected, ensuring consistent training dynamics. Furthermore, we establish the "equiconnectedness" property of these superlevel sets, which maintains uniform stability across varying training conditions and epochs. This paper contributes to the theoretical understanding of dynamic learning rate mechanisms in neural networks and also pave the way for the development of more efficient and reliable neural optimization techniques. This study intends to formalize and validate the equiconnectedness of loss function as superlevel sets in the context of neural network training, opening newer avenues for future research in adaptive machine learning algorithms. We leverage previous theoretical discoveries to propose training mechanisms that can effectively handle complex and high-dimensional data landscapes, particularly in applications requiring high precision and reliability.
Authors: Jinchuan Zhang, Yan Zhou, Yaxin Liu, Ziming Li, Songlin Hu
Abstract: Automated red teaming is an effective method for identifying misaligned behaviors in large language models (LLMs). Existing approaches, however, often focus primarily on improving attack success rates while overlooking the need for comprehensive test case coverage. Additionally, most of these methods are limited to single-turn red teaming, failing to capture the multi-turn dynamics of real-world human-machine interactions. To overcome these limitations, we propose HARM (Holistic Automated Red teaMing), which scales up the diversity of test cases using a top-down approach based on an extensible, fine-grained risk taxonomy. Our method also leverages a novel fine-tuning strategy and reinforcement learning techniques to facilitate multi-turn adversarial probing in a human-like manner. Experimental results demonstrate that our framework enables a more systematic understanding of model vulnerabilities and offers more targeted guidance for the alignment process.
Authors: Alexander Hinterleitner, Thomas Bartz-Beielstein, Richard Schulz, Sebastian Spengler, Thomas Winter, Christoph Leitenmeier
Abstract: Research in Explainable Artificial Intelligence (XAI) is increasing, aiming to make deep learning models more transparent. Most XAI methods focus on justifying the decisions made by Artificial Intelligence (AI) systems in security-relevant applications. However, relatively little attention has been given to using these methods to improve the performance and robustness of deep learning algorithms. Additionally, much of the existing XAI work primarily addresses classification problems. In this study, we investigate the potential of feature attribution methods to filter out uninformative features in input data for regression problems, thereby improving the accuracy and stability of predictions. We introduce a feature selection pipeline that combines Integrated Gradients with k-means clustering to select an optimal set of variables from the initial data space. To validate the effectiveness of this approach, we apply it to a real-world industrial problem - blade vibration analysis in the development process of turbo machinery.
Authors: Mohsen Ghaffari, Mahsa Varshosaz, Einar Broch Johnsen, Andrzej W\k{a}sowski
Abstract: Tabular reinforcement learning methods cannot operate directly on continuous state spaces. One solution for this problem is to partition the state space. A good partitioning enables generalization during learning and more efficient exploitation of prior experiences. Consequently, the learning process becomes faster and produces more reliable policies. However, partitioning introduces approximation, which is particularly harmful in the presence of nonlinear relations between state components. An ideal partition should be as coarse as possible, while capturing the key structure of the state space for the given problem. This work extracts partitions from the environment dynamics by symbolic execution. We show that symbolic partitioning improves state space coverage with respect to environmental behavior and allows reinforcement learning to perform better for sparse rewards. We evaluate symbolic state space partitioning with respect to precision, scalability, learning agent performance and state space coverage for the learnt policies.
Authors: Alexander Rubinstein, Luca Scimeca, Damien Teney, Seong Joon Oh
Abstract: Training a diverse ensemble of models has several practical applications such as providing candidates for model selection with better out-of-distribution (OOD) generalization, and enabling the detection of OOD samples via Bayesian principles. An existing approach to diverse ensemble training encourages the models to disagree on provided OOD samples. However, the approach is computationally expensive and it requires well-separated ID and OOD examples, such that it has only been demonstrated in small-scale settings. $\textbf{Method.}$ This work presents a method for Scalable Ensemble Diversification (SED) applicable to large-scale settings (e.g. ImageNet) that does not require OOD samples. Instead, SED identifies hard training samples on the fly and encourages the ensemble members to disagree on these. To improve scaling, we show how to avoid the expensive computations in existing methods of exhaustive pairwise disagreements across models. $\textbf{Results.}$ We evaluate the benefits of diversification with experiments on ImageNet. First, for OOD generalization, we observe large benefits from the diversification in multiple settings including output-space (classical) ensembles and weight-space ensembles (model soups). Second, for OOD detection, we turn the diversity of ensemble hypotheses into a novel uncertainty score estimator that surpasses a large number of OOD detection baselines. Code is available here: https://github.com/AlexanderRubinstein/diverse-universe-public.
URLs: https://github.com/AlexanderRubinstein/diverse-universe-public.
Authors: Maximilian Andreas Hoefler, Karsten Mueller, Wojciech Samek
Abstract: Power grids serve as a vital component in numerous industries, seamlessly delivering electrical energy to industrial processes and technologies, making their safe and reliable operation indispensable. However, powerlines can be hard to inspect due to difficult terrain or harsh climatic conditions. Therefore, unmanned aerial vehicles are increasingly deployed to inspect powerlines, resulting in a substantial stream of visual data which requires swift and accurate processing. Deep learning methods have become widely popular for this task, proving to be a valuable asset in fault detection. In particular, the detection of insulator defects is crucial for predicting powerline failures, since their malfunction can lead to transmission disruptions. It is therefore of great interest to continuously maintain and rigorously inspect insulator components. In this work we propose a novel pipeline to tackle this task. We utilize state-of-the-art object detection to detect and subsequently classify individual insulator anomalies. Our approach addresses dataset challenges such as imbalance and motion-blurred images through a fine-tuning methodology which allows us to alter the classification focus of the model by increasing the classification accuracy of anomalous insulators. In addition, we employ explainable-AI tools for precise localization and explanation of anomalies. This proposed method contributes to the field of anomaly detection, particularly vision-based industrial inspection and predictive maintenance. We significantly improve defect detection accuracy by up to 13%, while also offering a detailed analysis of model mis-classifications and localization quality, showcasing the potential of our method on real-world data.
Authors: Carlos E. Luis, Alessandro G. Bottero, Julia Vinogradska, Felix Berkenkamp, Jan Peters
Abstract: Optimal decision-making under partial observability requires reasoning about the uncertainty of the environment's hidden state. However, most reinforcement learning architectures handle partial observability with sequence models that have no internal mechanism to incorporate uncertainty in their hidden state representation, such as recurrent neural networks, deterministic state-space models and transformers. Inspired by advances in probabilistic world models for reinforcement learning, we propose a standalone Kalman filter layer that performs closed-form Gaussian inference in linear state-space models and train it end-to-end within a model-free architecture to maximize returns. Similar to efficient linear recurrent layers, the Kalman filter layer processes sequential data using a parallel scan, which scales logarithmically with the sequence length. By design, Kalman filter layers are a drop-in replacement for other recurrent layers in standard model-free architectures, but importantly they include an explicit mechanism for probabilistic filtering of the latent state representation. Experiments in a wide variety of tasks with partial observability show that Kalman filter layers excel in problems where uncertainty reasoning is key for decision-making, outperforming other stateful models.
Authors: \'Alvaro Fern\'andez Corral, Nicol\'as Mendoza, Armin Iske, Andrey Yachmenev, Jochen K\"upper
Abstract: We present a computational framework for obtaining multidimensional phase-space solutions of systems of non-linear coupled differential equations, using high-order implicit Runge-Kutta Physics- Informed Neural Networks (IRK-PINNs) schemes. Building upon foundational work originally solving differential equations for fields depending on coordinates [J. Comput. Phys. 378, 686 (2019)], we adapt the scheme to a context where the coordinates are treated as functions. This modification enables us to efficiently solve equations of motion for a particle in an external field. Our scheme is particularly useful for explicitly time-independent and periodic fields. We apply this approach to successfully solve the equations of motion for a mass particle placed in a central force field and a charged particle in a periodic electric field.
Authors: Vincenzo Petrone, Enrico Ferrentino, Pasquale Chiacchio
Abstract: This position paper explores the integration of Artificial Intelligence (AI) into force-controlled robotic tasks within the scope of advanced manufacturing, a cornerstone of Industry 4.0. AI's role in enhancing robotic manipulators - key drivers in the Fourth Industrial Revolution - is rapidly leading to significant innovations in smart manufacturing. The objective of this article is to frame these innovations in practical force-controlled applications - e.g. deburring, polishing, and assembly tasks like peg-in-hole (PiH) - highlighting their necessity for maintaining high-quality production standards. By reporting on recent AI-based methodologies, this article contrasts them and identifies current challenges to be addressed in future research. The analysis concludes with a perspective on future research directions, emphasizing the need for common performance metrics to validate AI techniques, integration of various enhancements for performance optimization, and the importance of validating them in relevant scenarios. These future directions aim to provide consistency with already adopted approaches, so as to be compatible with manufacturing standards, increasing the relevance of AI-driven methods in both academic and industrial contexts.
Authors: Srikar Babu Gadipudi, Srujan Deolasee, Siva Kailas, Wenhao Luo, Katia Sycara, Woojun Kim
Abstract: Informative path planning (IPP) is a crucial task in robotics, where agents must design paths to gather valuable information about a target environment while adhering to resource constraints. Reinforcement learning (RL) has been shown to be effective for IPP, however, it requires environment interactions, which are risky and expensive in practice. To address this problem, we propose an offline RL-based IPP framework that optimizes information gain without requiring real-time interaction during training, offering safety and cost-efficiency by avoiding interaction, as well as superior performance and fast computation during execution -- key advantages of RL. Our framework leverages batch-constrained reinforcement learning to mitigate extrapolation errors, enabling the agent to learn from pre-collected datasets generated by arbitrary algorithms. We validate the framework through extensive simulations and real-world experiments. The numerical results show that our framework outperforms the baselines, demonstrating the effectiveness of the proposed approach.
Authors: Amna Khalid, Ayma Khalid, Umar Khalid
Abstract: The application of large language models (LLMs) in healthcare has gained significant attention due to their ability to process complex medical data and provide insights for clinical decision-making. These models have demonstrated substantial capabilities in understanding and generating natural language, which is crucial for medical documentation, diagnostics, and patient interaction. This review examines the trajectory of language models from their early stages to the current state-of-the-art LLMs, highlighting their strengths in healthcare applications and discussing challenges such as data privacy, bias, and ethical considerations. The potential of LLMs to enhance healthcare delivery is explored, alongside the necessary steps to ensure their ethical and effective integration into medical practice.
Authors: Agni Bandyopadhyay, Guenther Waxenegger-Wilfing
Abstract: This research introduces a novel application of a masked Proximal Policy Optimization (PPO) algorithm from the field of deep reinforcement learning (RL), for determining the most efficient sequence of space debris visitation, utilizing the Lambert solver as per Izzo's adaptation for individual rendezvous. The aim is to optimize the sequence in which all the given debris should be visited to get the least total time for rendezvous for the entire mission. A neural network (NN) policy is developed, trained on simulated space missions with varying debris fields. After training, the neural network calculates approximately optimal paths using Izzo's adaptation of Lambert maneuvers. Performance is evaluated against standard heuristics in mission planning. The reinforcement learning approach demonstrates a significant improvement in planning efficiency by optimizing the sequence for debris rendezvous, reducing the total mission time by an average of approximately {10.96\%} and {13.66\%} compared to the Genetic and Greedy algorithms, respectively. The model on average identifies the most time-efficient sequence for debris visitation across various simulated scenarios with the fastest computational speed. This approach signifies a step forward in enhancing mission planning strategies for space debris clearance.
Authors: Sara Incao, Carlo Mazzola, Giulia Belgiovine, Alessandra Sciutti
Abstract: The fusion of Large Language Models (LLMs) and robotic systems has led to a transformative paradigm in the robotic field, offering unparalleled capabilities not only in the communication domain but also in skills like multimodal input handling, high-level reasoning, and plan generation. The grounding of LLMs knowledge into the empirical world has been considered a crucial pathway to exploit the efficiency of LLMs in robotics. Nevertheless, connecting LLMs' representations to the external world with multimodal approaches or with robots' bodies is not enough to let them understand the meaning of the language they are manipulating. Taking inspiration from humans, this work draws attention to three necessary elements for an agent to grasp and experience the world. The roadmap for LLMs grounding is envisaged in an active bodily system as the reference point for experiencing the environment, a temporally structured experience for a coherent, self-related interaction with the external world, and social skills to acquire a common-grounded shared experience.
Authors: Chunhui Zhang, Li Liu, Guanjie Huang, Hao Wen, Xi Zhou, Yanfeng Wang
Abstract: Over the past decade, significant progress has been made in visual object tracking, largely due to the availability of large-scale training datasets. However, existing tracking datasets are primarily focused on open-air scenarios, which greatly limits the development of object tracking in underwater environments. To address this issue, we take a step forward by proposing the first large-scale underwater camouflaged object tracking dataset, namely UW-COT. Based on the proposed dataset, this paper presents an experimental evaluation of several advanced visual object tracking methods and the latest advancements in image and video segmentation. Specifically, we compare the performance of the Segment Anything Model (SAM) and its updated version, SAM 2, in challenging underwater environments. Our findings highlight the improvements in SAM 2 over SAM, demonstrating its enhanced capability to handle the complexities of underwater camouflaged objects. Compared to current advanced visual object tracking methods, the latest video segmentation foundation model SAM 2 also exhibits significant advantages, providing valuable insights into the development of more effective tracking technologies for underwater scenarios. The dataset will be accessible at \color{magenta}{https://github.com/983632847/Awesome-Multimodal-Object-Tracking}.
URLs: https://github.com/983632847/Awesome-Multimodal-Object-Tracking
Authors: Yalan Qin, Nan Pu, Hanzhou Wu, Nicu Sebe
Abstract: Multi-view clustering aims to study the complementary information across views and discover the underlying structure. For solving the relatively high computational cost for the existing approaches, works based on anchor have been presented recently. Even with acceptable clustering performance, these methods tend to map the original representation from multiple views into a fixed shared graph based on the original dataset. However, most studies ignore the discriminative property of the learned anchors, which ruin the representation capability of the built model. Moreover, the complementary information among anchors across views is neglected to be ensured by simply learning the shared anchor graph without considering the quality of view-specific anchors. In this paper, we propose discriminative anchor learning for multi-view clustering (DALMC) for handling the above issues. We learn discriminative view-specific feature representations according to the original dataset and build anchors from different views based on these representations, which increase the quality of the shared anchor graph. The discriminative feature learning and consensus anchor graph construction are integrated into a unified framework to improve each other for realizing the refinement. The optimal anchors from multiple views and the consensus anchor graph are learned with the orthogonal constraints. We give an iterative algorithm to deal with the formulated problem. Extensive experiments on different datasets show the effectiveness and efficiency of our method compared with other methods.
Authors: Wanqi Yang, Yanda Li, Meng Fang, Ling Chen
Abstract: Time-Sensitive Question Answering (TSQA) demands the effective utilization of specific temporal contexts, encompassing multiple time-evolving facts, to address time-sensitive questions. This necessitates not only the parsing of temporal information within questions but also the identification and understanding of time-evolving facts to generate accurate answers. However, current large language models still have limited sensitivity to temporal information and their inadequate temporal reasoning capabilities.In this paper, we propose a novel framework that enhances temporal awareness and reasoning through Temporal Information-Aware Embedding and Granular Contrastive Reinforcement Learning. Experimental results on four TSQA datasets demonstrate that our framework significantly outperforms existing LLMs in TSQA tasks, marking a step forward in bridging the performance gap between machine and human temporal understanding and reasoning.
Authors: Zhichen Han, Tianqi Geng, Hui Feng, Jiahong Yuan, Korin Richmond, Yuanchao Li
Abstract: Utilizing Self-Supervised Learning (SSL) models for Speech Emotion Recognition (SER) has proven effective, yet limited research has explored cross-lingual scenarios. This study presents a comparative analysis between human performance and SSL models, beginning with a layer-wise analysis and an exploration of parameter-efficient fine-tuning strategies in monolingual, cross-lingual, and transfer learning contexts. We further compare the SER ability of models and humans at both utterance- and segment-levels. Additionally, we investigate the impact of dialect on cross-lingual SER through human evaluation. Our findings reveal that models, with appropriate knowledge transfer, can adapt to the target language and achieve performance comparable to native speakers. We also demonstrate the significant effect of dialect on SER for individuals without prior linguistic and paralinguistic background. Moreover, both humans and models exhibit distinct behaviors across different emotions. These results offer new insights into the cross-lingual SER capabilities of SSL models, underscoring both their similarities to and differences from human emotion perception.
Authors: Emanuela Boros, Maud Ehrmann
Abstract: This paper investigates the presence of OCR-sensitive neurons within the Transformer architecture and their influence on named entity recognition (NER) performance on historical documents. By analysing neuron activation patterns in response to clean and noisy text inputs, we identify and then neutralise OCR-sensitive neurons to improve model performance. Based on two open access large language models (Llama2 and Mistral), experiments demonstrate the existence of OCR-sensitive regions and show improvements in NER performance on historical newspapers and classical commentaries, highlighting the potential of targeted neuron modulation to improve models' performance on noisy text.
Authors: Yuanchao Li, Zixing Zhang, Jing Han, Peter Bell, Catherine Lai
Abstract: The lack of labeled data is a common challenge in speech classification tasks, particularly those requiring extensive subjective assessment, such as cognitive state classification. In this work, we propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method that leverages both acoustic and linguistic characteristics to select the most confident data for training the classification model. Acoustically, unlabeled data are compared to labeled data using the Frechet audio distance, calculated from embeddings generated by multiple audio encoders. Linguistically, large language models are prompted to revise automatic speech recognition transcriptions and predict labels based on our proposed task-specific knowledge. High-confidence data are identified when pseudo-labels from both sources align, while mismatches are treated as low-confidence data. A bimodal classifier is then trained to iteratively label the low-confidence data until a predefined criterion is met. We evaluate our SSL framework on emotion recognition and dementia detection tasks. Experimental results demonstrate that our method achieves competitive performance compared to fully supervised learning using only 30% of the labeled data and significantly outperforms two selected baselines.
Authors: Hongliang Zhong, Can Wang, Jingbo Zhang, Jing Liao
Abstract: Generating and inserting new objects into 3D content is a compelling approach for achieving versatile scene recreation. Existing methods, which rely on SDS optimization or single-view inpainting, often struggle to produce high-quality results. To address this, we propose a novel method for object insertion in 3D content represented by Gaussian Splatting. Our approach introduces a multi-view diffusion model, dubbed MVInpainter, which is built upon a pre-trained stable video diffusion model to facilitate view-consistent object inpainting. Within MVInpainter, we incorporate a ControlNet-based conditional injection module to enable controlled and more predictable multi-view generation. After generating the multi-view inpainted results, we further propose a mask-aware 3D reconstruction technique to refine Gaussian Splatting reconstruction from these sparse inpainted views. By leveraging these fabricate techniques, our approach yields diverse results, ensures view-consistent and harmonious insertions, and produces better object quality. Extensive experiments demonstrate that our approach outperforms existing methods.
Authors: Phu Pham, Dipam Patel, Damon Conover, Aniket Bera
Abstract: We introduce Go-SLAM, a novel framework that utilizes 3D Gaussian Splatting SLAM to reconstruct dynamic environments while embedding object-level information within the scene representations. This framework employs advanced object segmentation techniques, assigning a unique identifier to each Gaussian splat that corresponds to the object it represents. Consequently, our system facilitates open-vocabulary querying, allowing users to locate objects using natural language descriptions. Furthermore, the framework features an optimal path generation module that calculates efficient navigation paths for robots toward queried objects, considering obstacles and environmental uncertainties. Comprehensive evaluations in various scene settings demonstrate the effectiveness of our approach in delivering high-fidelity scene reconstructions, precise object segmentation, flexible object querying, and efficient robot path planning. This work represents an additional step forward in bridging the gap between 3D scene reconstruction, semantic object understanding, and real-time environment interactions.
Authors: Vineet Punyamoorty, Pascal Jutras-Dub\'e, Ruqi Zhang, Vaneet Aggarwal, Damon Conover, Aniket Bera
Abstract: By framing reinforcement learning as a sequence modeling problem, recent work has enabled the use of generative models, such as diffusion models, for planning. While these models are effective in predicting long-horizon state trajectories in deterministic environments, they face challenges in dynamic settings with moving obstacles. Effective collision avoidance demands continuous monitoring and adaptive decision-making. While replanning at every timestep could ensure safety, it introduces substantial computational overhead due to the repetitive prediction of overlapping state sequences -- a process that is particularly costly with diffusion models, known for their intensive iterative sampling procedure. We propose an adaptive generative planning approach that dynamically adjusts replanning frequency based on the uncertainty of action predictions. Our method minimizes the need for frequent, computationally expensive, and redundant replanning while maintaining robust collision avoidance performance. In experiments, we obtain a 13.5% increase in the mean trajectory length and a 12.7% increase in mean reward over long-horizon planning, indicating a reduction in collision rates and an improved ability to navigate the environment safely.
Authors: Rafael Mendoza, Isabella Cruz, Richard Liu, Aarav Deshmukh, David Williams, Jesscia Peng, Rohan Iyer
Abstract: Large language models (LLMs) have revolutionized how we interact with technology, but their personalization to individual user preferences remains a significant challenge, particularly in on-device applications. Traditional methods often depend heavily on labeled datasets and can be resource-intensive. To address these issues, we present Adaptive Self-Supervised Learning Strategies (ASLS), which utilizes self-supervised learning techniques to personalize LLMs dynamically. The framework comprises a user profiling layer for collecting interaction data and a neural adaptation layer for real-time model fine-tuning. This innovative approach enables continuous learning from user feedback, allowing the model to generate responses that align closely with user-specific contexts. The adaptive mechanisms of ASLS minimize computational demands and enhance personalization efficiency. Experimental results across various user scenarios illustrate the superior performance of ASLS in boosting user engagement and satisfaction, highlighting its potential to redefine LLMs as highly responsive and context-aware systems on-device.
Authors: Zeyneb N. Kaya, Souvick Ghosh
Abstract: There have been rapid advancements in the capabilities of large language models (LLMs) in recent years, greatly revolutionizing the field of natural language processing (NLP) and artificial intelligence (AI) to understand and interact with human language. Therefore, in this work, we conduct a systematic investigation of the literature to identify the prominent themes and directions of LLM developments, impacts, and limitations. Our findings illustrate the aims, methodologies, limitations, and future directions of LLM research. It includes responsible development considerations, algorithmic improvements, ethical challenges, and societal implications of LLM development. Overall, this paper provides a rigorous and comprehensive overview of current research in LLM and identifies potential directions for future development. The article highlights the application areas that could have a positive impact on society along with the ethical considerations.
Authors: Elisa Nguyen, Johannes Bertram, Evgenii Kortukov, Jean Y. Song, Seong Joon Oh
Abstract: While Explainable AI (XAI) aims to make AI understandable and useful to humans, it has been criticised for relying too much on formalism and solutionism, focusing more on mathematical soundness than user needs. We propose an alternative to this bottom-up approach inspired by design thinking: the XAI research community should adopt a top-down, user-focused perspective to ensure user relevance. We illustrate this with a relatively young subfield of XAI, Training Data Attribution (TDA). With the surge in TDA research and growing competition, the field risks repeating the same patterns of solutionism. We conducted a needfinding study with a diverse group of AI practitioners to identify potential user needs related to TDA. Through interviews (N=10) and a systematic survey (N=31), we uncovered new TDA tasks that are currently largely overlooked. We invite the TDA and XAI communities to consider these novel tasks and improve the user relevance of their research outcomes.
Authors: Shimao Chen, Zirui Liu, Zhiying Wu, Ce Zheng, Peizhuang Cong, Zihan Jiang, Lei Su, Tong Yang
Abstract: As the foundation of large language models (LLMs), self-attention module faces the challenge of quadratic time and memory complexity with respect to sequence length. FlashAttention accelerates attention computation and reduces its memory usage by leveraging the GPU memory hierarchy. A promising research direction is to integrate FlashAttention with quantization methods. This paper introduces INT-FlashAttention, the first INT8 quantization architecture compatible with the forward workflow of FlashAttention, which significantly improves the inference speed of FlashAttention on Ampere GPUs. We implement our INT-FlashAttention prototype with fully INT8 activations and general matrix-multiplication (GEMM) kernels, making it the first attention operator with fully INT8 input. As a general token-level post-training quantization framework, INT-FlashAttention is also compatible with other data formats like INT4, etc. Experimental results show INT-FlashAttention achieves 72% faster inference speed and 82% smaller quantization error compared to standard FlashAttention with FP16 and FP8 data format.
Authors: Ivi Chatzi, Nina Corvelo Benz, Eleni Straitouri, Stratis Tsirtsis, Manuel Gomez-Rodriguez
Abstract: "Sure, I am happy to generate a story for you: Captain Lyra stood at the helm of her trusty ship, the Maelstrom's Fury, gazing out at the endless sea. [...] Lyra's eyes welled up with tears as she realized the bitter truth - she had sacrificed everything for fleeting riches, and lost the love of her crew, her family, and herself." Although this story, generated by a large language model, is captivating, one may wonder -- how would the story have unfolded if the model had chosen "Captain Maeve" as the protagonist instead? We cannot know. State-of-the-art large language models are stateless -- they maintain no internal memory or state. Given a prompt, they generate a sequence of tokens as an output using an autoregressive process. As a consequence, they cannot reason about counterfactual alternatives to tokens they have generated in the past. In this work, our goal is to enhance them with this functionality. To this end, we develop a causal model of token generation that builds upon the Gumbel-Max structural causal model. Our model allows any large language model to perform counterfactual token generation at almost no cost in comparison with vanilla token generation, it is embarrassingly simple to implement, and it does not require any fine-tuning nor prompt engineering. We implement our model on Llama 3 8B-instruct and conduct both qualitative and quantitative analyses of counterfactually generated text. We conclude with a demonstrative application of counterfactual token generation for bias detection, unveiling interesting insights about the model of the world constructed by large language models.
Authors: Francesco Verdini, Pierfrancesco Melucci, Stefano Perna, Francesco Cariaggi, Marco Gaido, Sara Papi, Szymon Mazurek, Marek Kasztelnik, Luisa Bentivogli, S\'ebastien Brati\`eres, Paolo Merialdo, Simone Scardapane
Abstract: The remarkable performance achieved by Large Language Models (LLM) has driven research efforts to leverage them for a wide range of tasks and input modalities. In speech-to-text (S2T) tasks, the emerging solution consists of projecting the output of the encoder of a Speech Foundational Model (SFM) into the LLM embedding space through an adapter module. However, no work has yet investigated how much the downstream-task performance depends on each component (SFM, adapter, LLM) nor whether the best design of the adapter depends on the chosen SFM and LLM. To fill this gap, we evaluate the combination of 5 adapter modules, 2 LLMs (Mistral and Llama), and 2 SFMs (Whisper and SeamlessM4T) on two widespread S2T tasks, namely Automatic Speech Recognition and Speech Translation. Our results demonstrate that the SFM plays a pivotal role in downstream performance, while the adapter choice has moderate impact and depends on the SFM and LLM.
Authors: Phillip Mueller, Sebastian Mueller, Lars Mikelsons
Abstract: We provide a dataset for enabling Deep Generative Models (DGMs) in engineering design and propose methods to automate data labeling by utilizing large-scale foundation models. GeoBiked is curated to contain 4 355 bicycle images, annotated with structural and technical features and is used to investigate two automated labeling techniques: The utilization of consolidated latent features (Hyperfeatures) from image-generation models to detect geometric correspondences (e.g. the position of the wheel center) in structural images and the generation of diverse text descriptions for structural images. GPT-4o, a vision-language-model (VLM), is instructed to analyze images and produce diverse descriptions aligned with the system-prompt. By representing technical images as Diffusion-Hyperfeatures, drawing geometric correspondences between them is possible. The detection accuracy of geometric points in unseen samples is improved by presenting multiple annotated source images. GPT-4o has sufficient capabilities to generate accurate descriptions of technical images. Grounding the generation only on images leads to diverse descriptions but causes hallucinations, while grounding it on categorical labels restricts the diversity. Using both as input balances creativity and accuracy. Successfully using Hyperfeatures for geometric correspondence suggests that this approach can be used for general point-detection and annotation tasks in technical images. Labeling such images with text descriptions using VLMs is possible, but dependent on the models detection capabilities, careful prompt-engineering and the selection of input information. Applying foundation models in engineering design is largely unexplored. We aim to bridge this gap with a dataset to explore training, finetuning and conditioning DGMs in this field and suggesting approaches to bootstrap foundation models to process technical images.
Authors: Fangshuo Zhou, Huaxia Li, Rui Hu, Sensen Wu, Hailin Feng, Zhenhong Du, Liuchang Xu
Abstract: Volunteer Geographic Information (VGI), with its rich variety, large volume, rapid updates, and diverse sources, has become a critical source of geospatial data. However, VGI data from platforms like OSM exhibit significant quality heterogeneity across different data types, particularly with urban building data. To address this, we propose a multi-source geographic data transformation solution, utilizing accessible and complete VGI data to assist in generating urban building footprint data. We also employ a multimodal data generation framework to improve accuracy. First, we introduce a pipeline for constructing an 'image-text-metadata-building footprint' dataset, primarily based on road network data and supplemented by other multimodal data. We then present ControlCity, a geographic data transformation method based on a multimodal diffusion model. This method first uses a pre-trained text-to-image model to align text, metadata, and building footprint data. An improved ControlNet further integrates road network and land-use imagery, producing refined building footprint data. Experiments across 22 global cities demonstrate that ControlCity successfully simulates real urban building patterns, achieving state-of-the-art performance. Specifically, our method achieves an average FID score of 50.94, reducing error by 71.01% compared to leading methods, and a MIoU score of 0.36, an improvement of 38.46%. Additionally, our model excels in tasks like urban morphology transfer, zero-shot city generation, and spatial data completeness assessment. In the zero-shot city task, our method accurately predicts and generates similar urban structures, demonstrating strong generalization. This study confirms the effectiveness of our approach in generating urban building footprint data and capturing complex city characteristics.
Authors: Neda Zamanitajeddin, Mostafa Jahanifar, Kesi Xu, Fouzia Siraj, Nasir Rajpoot
Abstract: Deep learning models have shown immense promise in computational pathology (CPath) tasks, but their performance often suffers when applied to unseen data due to domain shifts. Addressing this requires domain generalization (DG) algorithms. However, a systematic evaluation of DG algorithms in the CPath context is lacking. This study aims to benchmark the effectiveness of 30 DG algorithms on 3 CPath tasks of varying difficulty through 7,560 cross-validation runs. We evaluate these algorithms using a unified and robust platform, incorporating modality-specific techniques and recent advances like pretrained foundation models. Our extensive cross-validation experiments provide insights into the relative performance of various DG strategies. We observe that self-supervised learning and stain augmentation consistently outperform other methods, highlighting the potential of pretrained models and data augmentation. Furthermore, we introduce a new pan-cancer tumor detection dataset (HISTOPANTUM) as a benchmark for future research. This study offers valuable guidance to researchers in selecting appropriate DG approaches for CPath tasks.
Authors: Tashi Namgyal, Alexander Hepburn, Raul Santos-Rodriguez, Valero Laparra, Jesus Malo
Abstract: The subjective quality of natural signals can be approximated with objective perceptual metrics. Designed to approximate the perceptual behaviour of human observers, perceptual metrics often reflect structures found in natural signals and neurological pathways. Models trained with perceptual metrics as loss functions can capture perceptually meaningful features from the structures held within these metrics. We demonstrate that using features extracted from autoencoders trained with perceptual losses can improve performance on music understanding tasks, i.e. genre classification, over using these metrics directly as distances when learning a classifier. This result suggests improved generalisation to novel signals when using perceptual metrics as loss functions for representation learning.
Authors: Luigi Russo, Francesco Mauro, Alessandro Sebastianelli, Paolo Gamba, Silvia Liberata Ullo
Abstract: Climate change and increasing droughts pose significant challenges to water resource management around the world. These problems lead to severe water shortages that threaten ecosystems, agriculture, and human communities. To advance the fight against these challenges, we present a new dataset, SEN12-WATER, along with a benchmark using a novel end-to-end Deep Learning (DL) framework for proactive drought-related analysis. The dataset, identified as a spatiotemporal datacube, integrates SAR polarization, elevation, slope, and multispectral optical bands. Our DL framework enables the analysis and estimation of water losses over time in reservoirs of interest, revealing significant insights into water dynamics for drought analysis by examining temporal changes in physical quantities such as water volume. Our methodology takes advantage of the multitemporal and multimodal characteristics of the proposed dataset, enabling robust generalization and advancing understanding of drought, contributing to climate change resilience and sustainable water resource management. The proposed framework involves, among the several components, speckle noise removal from SAR data, a water body segmentation through a U-Net architecture, the time series analysis, and the predictive capability of a Time-Distributed-Convolutional Neural Network (TD-CNN). Results are validated through ground truth data acquired on-ground via dedicated sensors and (tailored) metrics, such as Precision, Recall, Intersection over Union, Mean Squared Error, Structural Similarity Index Measure and Peak Signal-to-Noise Ratio.
Authors: Xinrui Zhou, Yuhao Huang, Haoran Dou, Shijing Chen, Ao Chang, Jia Liu, Weiran Long, Jian Zheng, Erjiao Xu, Jie Ren, Ruobing Huang, Jun Cheng, Wufeng Xue, Dong Ni
Abstract: In the medical field, the limited availability of large-scale datasets and labor-intensive annotation processes hinder the performance of deep models. Diffusion-based generative augmentation approaches present a promising solution to this issue, having been proven effective in advancing downstream medical recognition tasks. Nevertheless, existing works lack sufficient semantic and sequential steerability for challenging video/3D sequence generation, and neglect quality control of noisy synthesized samples, resulting in unreliable synthetic databases and severely limiting the performance of downstream tasks. In this work, we present Ctrl-GenAug, a novel and general generative augmentation framework that enables highly semantic- and sequential-customized sequence synthesis and suppresses incorrectly synthesized samples, to aid medical sequence classification. Specifically, we first design a multimodal conditions-guided sequence generator for controllably synthesizing diagnosis-promotive samples. A sequential augmentation module is integrated to enhance the temporal/stereoscopic coherence of generated samples. Then, we propose a noisy synthetic data filter to suppress unreliable cases at semantic and sequential levels. Extensive experiments on 3 medical datasets, using 11 networks trained on 3 paradigms, comprehensively analyze the effectiveness and generality of Ctrl-GenAug, particularly in underrepresented high-risk populations and out-domain conditions.
Authors: Ian Colbert, Fabian Grob, Giuseppe Franco, Jinjie Zhang, Rayan Saab
Abstract: Several recent studies have investigated low-precision accumulation, reporting improvements in throughput, power, and area across various platforms. However, the accompanying proposals have only considered the quantization-aware training (QAT) paradigm, in which models are fine-tuned or trained from scratch with quantization in the loop. As models continue to grow in size, QAT techniques become increasingly more expensive, which has motivated the recent surge in post-training quantization (PTQ) research. To the best of our knowledge, ours marks the first formal study of accumulator-aware quantization in the PTQ setting. To bridge this gap, we introduce AXE, a practical framework of accumulator-aware extensions designed to endow overflow avoidance guarantees to existing layer-wise PTQ algorithms. We theoretically motivate AXE and demonstrate its flexibility by implementing it on top of two state-of-the-art PTQ algorithms: GPFQ and OPTQ. We further generalize AXE to support multi-stage accumulation for the first time, opening the door for full datapath optimization and scaling to large language models (LLMs). We evaluate AXE across image classification and language generation models, and observe significant improvements in the trade-off between accumulator bit width and model accuracy over baseline methods.
Authors: Mert Keser, Gesina Schwalbe, Niki Amini-Naieni, Matthias Rottmann, Alois Knoll
Abstract: Ontological commitment, i.e., used concepts, relations, and assumptions, are a corner stone of qualitative reasoning (QR) models. The state-of-the-art for processing raw inputs, though, are deep neural networks (DNNs), nowadays often based off from multimodal foundation models. These automatically learn rich representations of concepts and respective reasoning. Unfortunately, the learned qualitative knowledge is opaque, preventing easy inspection, validation, or adaptation against available QR models. So far, it is possible to associate pre-defined concepts with latent representations of DNNs, but extractable relations are mostly limited to semantic similarity. As a next step towards QR for validation and verification of DNNs: Concretely, we propose a method that extracts the learned superclass hierarchy from a multimodal DNN for a given set of leaf concepts. Under the hood we (1) obtain leaf concept embeddings using the DNN's textual input modality; (2) apply hierarchical clustering to them, using that DNNs encode semantic similarities via vector distances; and (3) label the such-obtained parent concepts using search in available ontologies from QR. An initial evaluation study shows that meaningful ontological class hierarchies can be extracted from state-of-the-art foundation models. Furthermore, we demonstrate how to validate and verify a DNN's learned representations against given ontologies. Lastly, we discuss potential future applications in the context of QR.
Authors: Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, Pengfei Liu
Abstract: Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data filtered by other selection methods by more than 2% across various downstream benchmarks. Its effectiveness spans various model sizes and pre-training corpora, including C4, RedPajama-V2, and FineWeb. Furthermore, ProX exhibits significant potential in domain-specific continual pre-training: without domain specific design, models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving average accuracy by 7.6% over Mistral-7B, with 14.6% for Llama-2-7B and 20.3% for CodeLlama-7B, all within 10B tokens to be comparable to models like Llemma-7B trained on 200B tokens. Further analysis highlights that ProX significantly saves training FLOPs, offering a promising path for efficient LLM pre-training.We are open-sourcing ProX with >100B corpus, models, and sharing all training and implementation details for reproducible research and future innovation. Code: https://github.com/GAIR-NLP/ProX
Authors: Andrew Goldberg, Kavish Kondap, Tianshuang Qiu, Zehan Ma, Letian Fu, Justin Kerr, Huang Huang, Kaiyuan Chen, Kuan Fang, Ken Goldberg
Abstract: Generative AI systems have shown impressive capabilities in creating text, code, and images. Inspired by the rich history of research in industrial ''Design for Assembly'', we introduce a novel problem: Generative Design-for-Robot-Assembly (GDfRA). The task is to generate an assembly based on a natural language prompt (e.g., ''giraffe'') and an image of available physical components, such as 3D-printed blocks. The output is an assembly, a spatial arrangement of these components, and instructions for a robot to build this assembly. The output must 1) resemble the requested object and 2) be reliably assembled by a 6 DoF robot arm with a suction gripper. We then present Blox-Net, a GDfRA system that combines generative vision language models with well-established methods in computer vision, simulation, perturbation analysis, motion planning, and physical robot experimentation to solve a class of GDfRA problems with minimal human supervision. Blox-Net achieved a Top-1 accuracy of 63.5% in the ''recognizability'' of its designed assemblies (eg, resembling giraffe as judged by a VLM). These designs, after automated perturbation redesign, were reliably assembled by a robot, achieving near-perfect success across 10 consecutive assembly iterations with human intervention only during reset prior to assembly. Surprisingly, this entire design process from textual word (''giraffe'') to reliable physical assembly is performed with zero human intervention.
Authors: Fazal Mittu, Yihuan Bu, Akshat Gupta, Ashok Devireddy, Alp Eren Ozdarendeli, Anant Singh, Gopala Anumanchipalli
Abstract: While the language modeling objective has been shown to be deeply connected with compression, it is surprising that modern LLMs are not employed in practical text compression systems. In this paper, we provide an in-depth analysis of neural network and transformer-based compression techniques to answer this question. We compare traditional text compression systems with neural network and LLM-based text compression methods. Although LLM-based systems significantly outperform conventional compression methods, they are highly impractical. Specifically, LLMZip, a recent text compression system using Llama3-8B requires 9.5 days to compress just 10 MB of text, although with huge improvements in compression ratios. To overcome this, we present FineZip - a novel LLM-based text compression system that combines ideas of online memorization and dynamic context to reduce the compression time immensely. FineZip can compress the above corpus in approximately 4 hours compared to 9.5 days, a 54 times improvement over LLMZip and comparable performance. FineZip outperforms traditional algorithmic compression methods with a large margin, improving compression ratios by approximately 50\%. With this work, we take the first step towards making lossless text compression with LLMs a reality. While FineZip presents a significant step in that direction, LLMs are still not a viable solution for large-scale text compression. We hope our work paves the way for future research and innovation to solve this problem.
Authors: Runpeng Yu, Weihao Yu, Xinchao Wang
Abstract: Compared with Large Language Models (LLMs), Large Vision-Language Models (LVLMs) can also accept images as input, thus showcasing more interesting emergent capabilities and demonstrating impressive performance on various vision-language tasks. Motivated by text prompting in LLMs, visual prompting has been explored to enhance LVLMs' capabilities of perceiving visual information. However, previous visual prompting techniques solely process visual inputs without considering text queries, limiting the models' ability to follow text instructions to complete tasks. To fill this gap, in this work, we propose a new prompting technique named Attention Prompting on Image, which just simply overlays a text-query-guided attention heatmap on the original input image and effectively enhances LVLM on various tasks. Specifically, we generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP. Then the heatmap simply multiplies the pixel values of the original image to obtain the actual input image for the LVLM. Extensive experiments on various vison-language benchmarks verify the effectiveness of our technique. For example, Attention Prompting on Image improves LLaVA-1.5 by 3.8% and 2.9% on MM-Vet and LLaVA-Wild benchmarks, respectively.
Authors: Francisco Aguilera-Mart\'inez, Fernando Berzal
Abstract: Training machine learning models based on neural networks requires large datasets, which may contain sensitive information. The models, however, should not expose private information from these datasets. Differentially private SGD [DP-SGD] requires the modification of the standard stochastic gradient descent [SGD] algorithm for training new models. In this short paper, a novel regularization strategy is proposed to achieve the same goal in a more efficient manner.
Authors: Siddhant Bhambri, Mudit Verma, Upasana Biswas, Anil Murthy, Subbarao Kambhampati
Abstract: Preference-based Reinforcement Learning (PbRL) has made significant strides in single-agent settings, but has not been studied for multi-agent frameworks. On the other hand, modeling cooperation between multiple agents, specifically, Human-AI Teaming settings while ensuring successful task completion is a challenging problem. To this end, we perform the first investigation of multi-agent PbRL by extending single-agent PbRL to the two-agent teaming settings and formulate it as a Human-AI PbRL Cooperation Game, where the RL agent queries the human-in-the-loop to elicit task objective and human's preferences on the joint team behavior. Under this game formulation, we first introduce the notion of Human Flexibility to evaluate team performance based on if humans prefer to follow a fixed policy or adapt to the RL agent on the fly. Secondly, we study the RL agent's varying access to the human policy. We highlight a special case along these two dimensions, which we call Specified Orchestration, where the human is least flexible and agent has complete access to human policy. We motivate the need for taking Human Flexibility into account and the usefulness of Specified Orchestration through a gamified user study. We evaluate state-of-the-art PbRL algorithms for Human-AI cooperative setups through robot locomotion based domains that explicitly require forced cooperation. Our findings highlight the challenges associated with PbRL by varying Human Flexibility and agent's access to the human policy. Finally, we draw insights from our user study and empirical results, and conclude that Specified Orchestration can be seen as an upper bound PbRL performance for future research in Human-AI teaming scenarios.
Authors: Liming Xu, Stephen Mak, Yaniv Proselkov, Alexandra Brintrup
Abstract: Recent global disruptions, such as the pandemic and geopolitical conflicts, have profoundly exposed vulnerabilities in traditional supply chains, requiring exploration of more resilient alternatives. Autonomous supply chains (ASCs) have emerged as a potential solution, offering increased visibility, flexibility, and resilience in turbulent trade environments. Despite discussions in industry and academia over several years, ASCs lack well-established theoretical foundations. This paper addresses this research gap by presenting a formal definition of ASC along with its defining characteristics and auxiliary concepts. We propose a layered conceptual framework called the MIISI model. An illustrative case study focusing on the meat supply chain demonstrates an initial ASC implementation based on this conceptual model. Additionally, we introduce a seven-level supply chain autonomy reference model, delineating a trajectory towards achieving a full supply chain autonomy. Recognising that this work represents an initial endeavour, we emphasise the need for continued exploration in this emerging domain. We anticipate that this work will stimulate further research, both theoretical and technical, and contribute to the continual evolution of ASCs.
Authors: Sevin Mohammadi, Andrew W. Smyth
Abstract: Large-scale geolocation telematics data acquired from connected vehicles has the potential to significantly enhance mobility infrastructures and operational systems within smart cities. To effectively utilize this data, it is essential to accurately match the geolocation data to the road segments. However, this matching is often not trivial due to the low sampling rate and errors exacerbated by multipath effects in urban environments. Traditionally, statistical modeling techniques such as Hidden-Markov models incorporating domain knowledge into the matching process have been extensively used for map-matching tasks. However, rule-based map-matching tasks are noise-sensitive and inefficient in processing large-scale trajectory data. Deep learning techniques directly learn the relationship between observed data and road networks from the data, often without the need for hand-crafted rules or domain knowledge. This renders them an efficient approach for map-matching large-scale datasets and more robust to the noise. This paper introduces a deep-learning model, specifically the transformer-based encoder-decoder model, to perform as a surrogate for offline map-matching algorithms. The encoder-decoder architecture initially encodes the series of noisy GPS points into a representation that automatically captures autoregressive behavior and spatial correlations between GPS points. Subsequently, the decoder associates data points with the road network features and thus transforms these representations into a sequence of road segments. The model is trained and evaluated using GPS traces collected in Manhattan, New York. Achieving an accuracy of 75%, transformer-based encoder-decoder models extensively employed in natural language processing presented a promising performance for translating noisy GPS data to the navigated routes in urban road networks.
Authors: Gianmarco Mengaldo
Abstract: The scientific method is the cornerstone of human progress across all branches of the natural and applied sciences, from understanding the human body to explaining how the universe works. The scientific method is based on identifying systematic rules or principles that describe the phenomenon of interest in a reproducible way that can be validated through experimental evidence. In the era of artificial intelligence (AI), there are discussions on how AI systems may discover new knowledge. We argue that human complex reasoning for scientific discovery remains of vital importance, at least before the advent of artificial general intelligence. Yet, AI can be leveraged for scientific discovery via explainable AI. More specifically, knowing what data AI systems deemed important to make decisions can be a point of contact with domain experts and scientists, that can lead to divergent or convergent views on a given scientific problem. Divergent views may spark further scientific investigations leading to new scientific knowledge.
Authors: Liang Zhang, Jionghao Lin, Ziyi Kuang, Sheng Xu, Xiangen Hu
Abstract: Dialogue-based Intelligent Tutoring Systems (ITSs) have significantly advanced adaptive and personalized learning by automating sophisticated human tutoring strategies within interactive dialogues. However, replicating the nuanced patterns of expert human communication remains a challenge in Natural Language Processing (NLP). Recent advancements in NLP, particularly Large Language Models (LLMs) such as OpenAI's GPT-4, offer promising solutions by providing human-like and context-aware responses based on extensive pre-trained knowledge. Motivated by the effectiveness of LLMs in various educational tasks (e.g., content creation and summarization, problem-solving, and automated feedback provision), our study introduces the Socratic Playground for Learning (SPL), a dialogue-based ITS powered by the GPT-4 model, which employs the Socratic teaching method to foster critical thinking among learners. Through extensive prompt engineering, SPL can generate specific learning scenarios and facilitates efficient multi-turn tutoring dialogues. The SPL system aims to enhance personalized and adaptive learning experiences tailored to individual needs, specifically focusing on improving critical thinking skills. Our pilot experimental results from essay writing tasks demonstrate SPL has the potential to improve tutoring interactions and further enhance dialogue-based ITS functionalities. Our study, exemplified by SPL, demonstrates how LLMs enhance dialogue-based ITSs and expand the accessibility and efficacy of educational technologies.
Authors: Rujia Shen, Yang Yang, Yaoxion Lin, Liangliang Liu, Boran Wang, Yi Guan, Jingchi Jiang
Abstract: Time series forecasting (TSF) plays a crucial role in various applications, including electricity transformation, medical monitoring, and crop growth. Despite the advancements in deep learning methods for TSF, their capacity to predict long-term series remains constrained. This limitation arises from the failure to account for both intra- and inter-variable variations meanwhile. To mitigate this challenge, we introduce the Fi$^2$VBlock, which leverages a \textbf{F}requency domain perspective to capture \textbf{i}ntra- and \textbf{i}nter-variable \textbf{V}ariations. After transforming into the frequency domain via the Frequency Transform Module, the Frequency Cross Attention between the real and imaginary parts is designed to obtain enhanced frequency representations and capture intra-variable variations. Furthermore, Inception blocks are employed to integrate information, thus capturing correlations across different variables. Our backbone network, Fi$^2$VTS, employs a residual architecture by concatenating multiple Fi$^2$VBlocks, thereby preventing degradation issues. Theoretically, we demonstrate that Fi$^2$VTS achieves a substantial reduction in both time and memory complexity, decreasing from $\mathcal{O}(L^2)$ to $\mathcal{O}(L)$ per Fi$^2$VBlock computation. Empirical evaluations on three benchmark datasets reveal that Fi$^2$VTS delivers an overall relative Mean Squared Error (MSE) reduction of 30\% and an overall relative Mean Absolute Error (MAE) reduction of 22\% when compared to the latest state-of-the-art methods. The implementation code is accessible at \url{https://github.com/HITshenrj/Fi2VTS}.
Authors: Zhanpeng Chen, Chengjin Xu, Yiyan Qi, Jian Guo
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in processing and generating content across multiple data modalities. However, a significant drawback of MLLMs is their reliance on static training data, leading to outdated information and limited contextual awareness. This static nature hampers their ability to provide accurate and up-to-date responses, particularly in dynamic or rapidly evolving contexts. Though integrating Multimodal Retrieval-augmented Generation (Multimodal RAG) offers a promising solution, the system would inevitably encounter the multi-granularity noisy correspondence (MNC) problem, which hinders accurate retrieval and generation. In this work, we propose RagVL, a novel framework with knowledge-enhanced reranking and noise-injected training, to address these limitations. We instruction-tune the MLLM with a simple yet effective instruction template to induce its ranking ability and serve it as a reranker to precisely filter the top-k retrieved images. For generation, we inject visual noise during training at the data and token levels to enhance the generator's robustness. Extensive experiments on the subsets of two datasets that require retrieving and reasoning over images to answer a given query verify the effectiveness of our method. Code and models are available at https://github.com/IDEA-FinAI/RagVL.
Authors: Ismail Labiad, Thomas B\"ack, Pierre Fernandez, Laurent Najman, Tom Sander, Furong Ye, Mariia Zameshina, Olivier Teytaud
Abstract: In many cases, adversarial attacks are based on specialized algorithms specifically dedicated to attacking automatic image classifiers. These algorithms perform well, thanks to an excellent ad hoc distribution of initial attacks. However, these attacks are easily detected due to their specific initial distribution. We therefore consider other black-box attacks, inspired from generic black-box optimization tools, and in particular the log-normal algorithm. We apply the log-normal method to the attack of fake detectors, and get successful attacks: importantly, these attacks are not detected by detectors specialized on classical adversarial attacks. Then, combining these attacks and deep detection, we create improved fake detectors.
Authors: Rui Gan, Haotian Shi, Pei Li, Keshu Wu, Bocheng An, Linheng Li, Junyi Ma, Chengyuan Ma, Bin Ran
Abstract: Vehicle trajectory prediction plays a vital role in intelligent transportation systems and autonomous driving, as it significantly affects vehicle behavior planning and control, thereby influencing traffic safety and efficiency. Numerous studies have been conducted to predict short-term vehicle trajectories in the immediate future. However, long-term trajectory prediction remains a major challenge due to accumulated errors and uncertainties. Additionally, balancing accuracy with interpretability in the prediction is another challenging issue in predicting vehicle trajectory. To address these challenges, this paper proposes a Goal-based Neural Physics Vehicle Trajectory Prediction Model (GNP). The GNP model simplifies vehicle trajectory prediction into a two-stage process: determining the vehicle's goal and then choosing the appropriate trajectory to reach this goal. The GNP model contains two sub-modules to achieve this process. The first sub-module employs a multi-head attention mechanism to accurately predict goals. The second sub-module integrates a deep learning model with a physics-based social force model to progressively predict the complete trajectory using the generated goals. The GNP demonstrates state-of-the-art long-term prediction accuracy compared to four baseline models. We provide interpretable visualization results to highlight the multi-modality and inherent nature of our neural physics framework. Additionally, ablation studies are performed to validate the effectiveness of our key designs.
Authors: Taowen Wang, Yiyang Liu, James Chenhao Liang, junhan zhao, Yiming Cui, Yuning Mao, Shaoliang Nie, Jiahao Liu, Fuli Feng, Zenglin Xu, Cheng Han, Lifu Huang, Qifan Wang, Dongfang Liu
Abstract: Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains, with increasing emphasis on enhancing their zero-shot generalization capabilities for unseen tasks across various modalities. Instruction tuning has emerged as an effective strategy for achieving zero-shot generalization by finetuning pretrained models on diverse multimodal tasks. As the scale of MLLMs continues to grow, parameter-efficient finetuning becomes increasingly critical. However, most existing parameter-efficient approaches focus only on single modalities and often overlook the multimodal characteristics during finetuning. In this work, we introduce a novel Multimodal Prompt Tuning (M$^2$PT) approach for efficient instruction tuning of MLLMs. M$^2$PT effectively integrates visual and textual prompts into the vision encoder and language processor respectively during finetuning, facilitating the extraction and alignment of features across modalities. Empirical results on various multimodal evaluation datasets demonstrate the superior performance of our approach compared to several state-of-the-art baselines. A comprehensive set of ablation studies validates the effectiveness of our prompt design and the efficiency of our approach.
Authors: Moucheng Xu, Evangelos Chatzaroulas, Luc McCutcheon, Abdul Ahad, Hamzah Azeem, Janusz Marecki, Ammar Anwar
Abstract: A Standard Operating Procedure (SOP) defines a low-level, step-by-step written guide for a business software workflow based on a video demonstration. SOPs are a crucial step toward automating end-to-end software workflows. Manually creating SOPs can be time-consuming. Recent advancements in large video-language models offer the potential for automating SOP generation by analyzing recordings of human demonstrations. However, current large video-language models face challenges with zero-shot SOP generation. We explore in-context learning with video-language models for SOP generation. We report that in-context learning sometimes helps video-language models at SOP generation. We then propose an in-context ensemble learning to further enhance the capabilities of the models in SOP generation.
Authors: Qian-Wen Zhang, Haochen Wang, Fang Li, Siyu An, Lingfeng Qiao, Liangcai Gao, Di Yin, Xing Sun
Abstract: Online education platforms have significantly transformed the dissemination of educational resources by providing a dynamic and digital infrastructure. With the further enhancement of this transformation, the advent of Large Language Models (LLMs) has elevated the intelligence levels of these platforms. However, current academic benchmarks provide limited guidance for real-world industry scenarios. This limitation arises because educational applications require more than mere test question responses. To bridge this gap, we introduce CJEval, a benchmark based on Chinese Junior High School Exam Evaluations. CJEval consists of 26,136 samples across four application-level educational tasks covering ten subjects. These samples include not only questions and answers but also detailed annotations such as question types, difficulty levels, knowledge concepts, and answer explanations. By utilizing this benchmark, we assessed LLMs' potential applications and conducted a comprehensive analysis of their performance by fine-tuning on various educational tasks. Extensive experiments and discussions have highlighted the opportunities and challenges of applying LLMs in the field of education.
Authors: Alina Jade Barnett, Zhicheng Guo, Jin Jing, Wendong Ge, Peter W. Kaplan, Wan Yee Kong, Ioannis Karakis, Aline Herlopian, Lakshman Arcot Jayagopal, Olga Taraschenko, Olga Selioutski, Gamaleldin Osman, Daniel Goldenholz, Cynthia Rudin, M. Brandon Westover
Abstract: In intensive care units (ICUs), critically ill patients are monitored with electroencephalograms (EEGs) to prevent serious brain injury. The number of patients who can be monitored is constrained by the availability of trained physicians to read EEGs, and EEG interpretation can be subjective and prone to inter-observer variability. Automated deep learning systems for EEG could reduce human bias and accelerate the diagnostic process. However, black box deep learning models are untrustworthy, difficult to troubleshoot, and lack accountability in real-world applications, leading to a lack of trust and adoption by clinicians. To address these challenges, we propose a novel interpretable deep learning model that not only predicts the presence of harmful brainwave patterns but also provides high-quality case-based explanations of its decisions. Our model performs better than the corresponding black box model, despite being constrained to be interpretable. The learned 2D embedded space provides the first global overview of the structure of ictal-interictal-injury continuum brainwave patterns. The ability to understand how our model arrived at its decisions will not only help clinicians to diagnose and treat harmful brain activities more accurately but also increase their trust and adoption of machine learning models in clinical practice; this could be an integral component of the ICU neurologists' standard workflow.
Authors: Rishikesh Jha, Siddharth Subramaniyam, Ethan Benjamin, Thrivikrama Taula
Abstract: Embedding-based neural retrieval is a prevalent approach to address the semantic gap problem which often arises in product search on tail queries. In contrast, popular queries typically lack context and have a broad intent where additional context from users historical interaction can be helpful. In this paper, we share our novel approach to address both: the semantic gap problem followed by an end to end trained model for personalized semantic retrieval. We propose learning a unified embedding model incorporating graph, transformer and term-based embeddings end to end and share our design choices for optimal tradeoff between performance and efficiency. We share our learnings in feature engineering, hard negative sampling strategy, and application of transformer model, including a novel pre-training strategy and other tricks for improving search relevance and deploying such a model at industry scale. Our personalized retrieval model significantly improves the overall search experience, as measured by a 5.58% increase in search purchase rate and a 2.63% increase in site-wide conversion rate, aggregated across multiple A/B tests - on live traffic.
Authors: Tomoya Yamashita, Masanori Yamada, Takashi Shibata
Abstract: Ethical and privacy issues inherent in artificial intelligence (AI) applications have been a growing concern with the rapid spread of deep learning. Machine unlearning (MU) is the research area that addresses these issues by making a trained AI model forget about undesirable training data. Unfortunately, most existing MU methods incur significant time and computational costs for forgetting. Therefore, it is often difficult to apply these methods to practical datasets and sophisticated architectures, e.g., ImageNet and Transformer. To tackle this problem, we propose a lightweight and effective MU method. Our method identifies the model parameters sensitive to the forgetting targets and adds perturbation to such model parameters. We identify the sensitive parameters by calculating the Fisher Information Matrix (FIM). This approach does not require time-consuming additional training for forgetting. In addition, we introduce class-specific random signals called mnemonic code to reduce the cost of FIM calculation, which generally requires the entire training data and incurs significant computational costs. In our method, we train the model with mnemonic code; when forgetting, we use a small number of mnemonic codes to calculate the FIM and get the effective perturbation for forgetting. Comprehensive experiments demonstrate that our method is faster and better at forgetting than existing MU methods. Furthermore, we show that our method can scale to more practical datasets and sophisticated architectures.
Authors: Hyunki Seong, David Hyunchul Shim
Abstract: Dogfighting is a challenging scenario in aerial applications that requires a comprehensive understanding of both strategic maneuvers and the aerodynamics of agile aircraft. The aerial agent needs to not only understand tactically evolving maneuvers of fighter jets from a long-term perspective but also react to rapidly changing aerodynamics of aircraft from a short-term viewpoint. In this paper, we introduce TempFuser, a novel long short-term temporal fusion transformer architecture that can learn agile, tactical, and acrobatic flight maneuvers in complex dogfight problems. Our approach integrates two distinct temporal transition embeddings into a transformer-based network to comprehensively capture both the long-term tactics and short-term agility of aerial agents. By incorporating these perspectives, our policy network generates end-to-end flight commands that secure dominant positions over the long term and effectively outmaneuver agile opponents. After training in a high-fidelity flight simulator, our model successfully learns to execute strategic maneuvers, outperforming baseline policy models against various types of opponent aircraft. Notably, our model exhibits human-like acrobatic maneuvers even when facing adversaries with superior specifications, all without relying on prior knowledge. Moreover, it demonstrates robust pursuit performance in challenging supersonic and low-altitude situations. Demo videos are available at https://sites.google.com/view/tempfuser.
Authors: Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, Dongyeop Kang
Abstract: Large Language Models are cognitively biased judges. Large Language Models (LLMs) have recently been shown to be effective as automatic evaluators with simple prompting and in-context learning. In this work, we assemble 15 LLMs of four different size ranges and evaluate their output responses by preference ranking from the other LLMs as evaluators, such as System Star is better than System Square. We then evaluate the quality of ranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators (CoBBLEr), a benchmark to measure six different cognitive biases in LLM evaluation outputs, such as the Egocentric bias where a model prefers to rank its own outputs highly in evaluation. We find that LLMs are biased text quality evaluators, exhibiting strong indications on our bias benchmark (average of 40% of comparisons across all models) within each of their evaluations that question their robustness as evaluators. Furthermore, we examine the correlation between human and machine preferences and calculate the average Rank-Biased Overlap (RBO) score to be 49.6%, indicating that machine preferences are misaligned with humans. According to our findings, LLMs may still be unable to be utilized for automatic annotation aligned with human preferences. Our project page is at: https://minnesotanlp.github.io/cobbler.
Authors: Jingtan Wang, Xinyang Lu, Zitong Zhao, Zhongxiang Dai, Chuan-Sheng Foo, See-Kiong Ng, Bryan Kian Hsiang Low
Abstract: The impressive performances of Large Language Models (LLMs) and their immense potential for commercialization have given rise to serious concerns over the Intellectual Property (IP) of their training data. In particular, the synthetic texts generated by LLMs may infringe the IP of the data being used to train the LLMs. To this end, it is imperative to be able to perform source attribution by identifying the data provider who contributed to the generation of a synthetic text by an LLM. In this paper, we show that this problem can be tackled by watermarking, i.e., by enabling an LLM to generate synthetic texts with embedded watermarks that contain information about their source(s). We identify the key properties of such watermarking frameworks (e.g., source attribution accuracy, robustness against adversaries), and propose a source attribution framework that satisfies these key properties due to our algorithmic designs. Our framework enables an LLM to learn an accurate mapping from the generated texts to data providers, which sets the foundation for effective source attribution. Extensive empirical evaluations show that our framework achieves effective source attribution.
Authors: Quanqi Hu, Dixian Zhu, Tianbao Yang
Abstract: This paper investigates new families of compositional optimization problems, called $\underline{\bf n}$on-$\underline{\bf s}$mooth $\underline{\bf w}$eakly-$\underline{\bf c}$onvex $\underline{\bf f}$inite-sum $\underline{\bf c}$oupled $\underline{\bf c}$ompositional $\underline{\bf o}$ptimization (NSWC FCCO). There has been a growing interest in FCCO due to its wide-ranging applications in machine learning and AI, as well as its ability to address the shortcomings of stochastic algorithms based on empirical risk minimization. However, current research on FCCO presumes that both the inner and outer functions are smooth, limiting their potential to tackle a more diverse set of problems. Our research expands on this area by examining non-smooth weakly-convex FCCO, where the outer function is weakly convex and non-decreasing, and the inner function is weakly-convex. We analyze a single-loop algorithm and establish its complexity for finding an $\epsilon$-stationary point of the Moreau envelop of the objective function. Additionally, we also extend the algorithm to solving novel non-smooth weakly-convex tri-level finite-sum coupled compositional optimization problems, which feature a nested arrangement of three functions. Lastly, we explore the applications of our algorithms in deep learning for two-way partial AUC maximization and multi-instance two-way partial AUC maximization, using empirical studies to showcase the effectiveness of the proposed algorithms.
Authors: Gergely Cs\'aji, David Manlove, Iain McBride, James Trimble
Abstract: In this paper, we study the Hospitals / Residents problem with Couples (HRC), where a solution is a stable matching or a report that none exists. We present a novel polynomial-time algorithm that can find a near-feasible stable matching (adjusting the hospitals' capacities by at most 1) in an HRC instance where the couples' preferences are sub-responsive (i.e., if one member switches to a better hospital, then the couple also improves) and sub-complete (i.e., each pair of hospitals that are individually acceptable to both members are jointly acceptable for the couple) by reducing it to an instance of the Stable Fixtures problem. We also present a polynomial-time algorithm for HRC in a sub-responsive, sub-complete instance that is a Dual Market, or where all couples are one of several possible types. We show that our algorithm also implies the polynomial-time solvability of a stable b-matching problem, where the underlying graph is a multigraph with loops. We complement our algorithms with several hardness results. We show that HRC with sub-responsive and sub-complete couples is NP-hard, even with other strong restrictions. We also show that HRC with a Dual Market is NP-hard under several simultaneous restrictions. Finally, we show that the problem of finding a matching with the minimum number of blocking pairs in HRC is not approximable within $m^{1-\varepsilon}$, for any $\varepsilon>0$, where $m$ is the total length of the hospitals' preference lists, unless P=NP, even if each couple applies to only one pair of hospitals. Our polynomial-time solvability results greatly expand the class of known tractable instances of HRC and provide a useful tool for designing better and more efficient mechanisms in the future.
Authors: Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, Alexander Toshev
Abstract: Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, which causes the catastrophic degradation of language models' capability.
Authors: Ana-Maria Marcu, Long Chen, Jan H\"unermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, Elahe Arani, Oleg Sinavski
Abstract: We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 59.6% of the questions compared to 96.6% for humans. For evaluation, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark https://github.com/wayveai/LingoQA as an evaluation platform for vision-language models in autonomous driving.
Authors: Seyed Mohammad Hossein Hashemi, Leila Safari, Amirhossein Dadashzadeh Taromi
Abstract: In the field of medical sciences, reliable detection and classification of brain tumors from images remains a formidable challenge due to the rarity of tumors within the population of patients. Therefore, the ability to detect tumors in anomaly scenarios is paramount for ensuring timely interventions and improved patient outcomes. This study addresses the issue by leveraging deep learning (DL) techniques to detect and classify brain tumors in challenging situations. The curated data set from the National Brain Mapping Lab (NBML) comprises 81 patients, including 30 Tumor cases and 51 Normal cases. The detection and classification pipelines are separated into two consecutive tasks. The detection phase involved comprehensive data analysis and pre-processing to modify the number of image samples and the number of patients of each class to anomaly distribution (9 Normal per 1 Tumor) to comply with real world scenarios. Next, in addition to common evaluation metrics for the testing, we employed a novel performance evaluation method called Patient to Patient (PTP), focusing on the realistic evaluation of the model. In the detection phase, we fine-tuned a YOLOv8n detection model to detect the tumor region. Subsequent testing and evaluation yielded competitive performance both in Common Evaluation Metrics and PTP metrics. Furthermore, using the Data Efficient Image Transformer (DeiT) module, we distilled a Vision Transformer (ViT) model from a fine-tuned ResNet152 as a teacher in the classification phase. This approach demonstrates promising strides in reliable tumor detection and classification, offering potential advancements in tumor diagnosis for real-world medical imaging scenarios.
Authors: Santiago Miret, N M Anoop Krishnan
Abstract: Large Language Models (LLMs) create exciting possibilities for powerful language processing tools to accelerate research in materials science. While LLMs have great potential to accelerate materials understanding and discovery, they currently fall short in being practical materials science tools. In this position paper, we show relevant failure cases of LLMs in materials science that reveal current limitations of LLMs related to comprehending and reasoning over complex, interconnected materials science knowledge. Given those shortcomings, we outline a framework for developing Materials Science LLMs (MatSci-LLMs) that are grounded in materials science knowledge and hypothesis generation followed by hypothesis testing. The path to attaining performant MatSci-LLMs rests in large part on building high-quality, multi-modal datasets sourced from scientific literature where various information extraction challenges persist. As such, we describe key materials science information extraction challenges which need to be overcome in order to build large-scale, multi-modal datasets that capture valuable materials science knowledge. Finally, we outline a roadmap for applying future MatSci-LLMs for real-world materials discovery via: 1. Automated Knowledge Base Generation; 2. Automated In-Silico Material Design; and 3. MatSci-LLM Integrated Self-Driving Materials Laboratories.
Authors: Zhongqi Yang, Elahe Khatibi, Nitish Nagesh, Mahyar Abbasian, Iman Azimi, Ramesh Jain, Amir M. Rahmani
Abstract: The profound impact of food on health necessitates advanced nutrition-oriented food recommendation services. Conventional methods often lack the crucial elements of personalization, explainability, and interactivity. While Large Language Models (LLMs) bring interpretability and explainability, their standalone use falls short of achieving true personalization. In this paper, we introduce ChatDiet, a novel LLM-powered framework designed specifically for personalized nutrition-oriented food recommendation chatbots. ChatDiet integrates personal and population models, complemented by an orchestrator, to seamlessly retrieve and process pertinent information. The personal model leverages causal discovery and inference techniques to assess personalized nutritional effects for a specific user, whereas the population model provides generalized information on food nutritional content. The orchestrator retrieves, synergizes and delivers the output of both models to the LLM, providing tailored food recommendations designed to support targeted health outcomes. The result is a dynamic delivery of personalized and explainable food recommendations, tailored to individual user preferences. Our evaluation of ChatDiet includes a compelling case study, where we establish a causal personal model to estimate individual nutrition effects. Our assessments, including a food recommendation test showcasing a 92\% effectiveness rate, coupled with illustrative dialogue examples, underscore ChatDiet's strengths in explainability, personalization, and interactivity.
Authors: Ali Zare, Yulei Niu, Hammad Ayyubi, Shih-fu Chang
Abstract: Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states. Despite the rapid progress in this task, there remain several critical challenges to be solved: (1) Adaptive procedures: Prior works hold an unrealistic assumption that the number of action steps is known and fixed, leading to non-generalizable models in real-world scenarios where the sequence length varies. (2) Temporal relation: Understanding the step temporal relation knowledge is essential in producing reasonable and executable plans. (3) Annotation cost: Annotating instructional videos with step-level labels (i.e., timestamp) or sequence-level labels (i.e., action category) is demanding and labor-intensive, limiting its generalizability to large-scale datasets. In this work, we propose a new and practical setting, called adaptive procedure planning in instructional videos, where the procedure length is not fixed or pre-determined. To address these challenges, we introduce Retrieval-Augmented Planner (RAP) model. Specifically, for adaptive procedures, RAP adaptively determines the conclusion of actions using an auto-regressive model architecture. For temporal relation, RAP establishes an external memory module to explicitly retrieve the most relevant state-action pairs from the training videos and revises the generated procedures. To tackle high annotation cost, RAP utilizes a weakly-supervised learning manner to expand the training dataset to other task-relevant, unannotated videos by generating pseudo labels for action steps. Experiments on CrossTask and COIN benchmarks show the superiority of RAP over traditional fixed-length models, establishing it as a strong baseline solution for adaptive procedure planning.
Authors: Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, Przemys{\l}aw Kazienko, Kranthi Kiran GV, Jan Koco\'n, Bart{\l}omiej Koptyra, Satyapriya Krishna, Ronald McClelland Jr., Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Stanis{\l}aw Wo\'zniak, Ruichong Zhang, Bingchen Zhao, Qihang Zhao, Peng Zhou, Jian Zhu, Rui-Jie Zhu
Abstract: We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture. Our architectural design advancements include multi-headed matrix-valued states and a dynamic recurrence mechanism that improve expressivity while maintaining the inference efficiency characteristics of RNNs. We introduce a new multilingual corpus with 1.12 trillion tokens and a fast tokenizer based on greedy matching for enhanced multilinguality. We trained four Eagle models, ranging from 0.46 to 7.5 billion parameters, and two Finch models with 1.6 and 3.1 billion parameters and find that they achieve competitive performance across a wide variety of benchmarks. We release all our models on HuggingFace under the Apache 2.0 license. Models at: https://huggingface.co/RWKV Training code at: https://github.com/RWKV/RWKV-LM Inference code at: https://github.com/RWKV/ChatRWKV Time-parallel training code at: https://github.com/RWKV/RWKV-infctx-trainer
URLs: https://huggingface.co/RWKV, https://github.com/RWKV/RWKV-LM, https://github.com/RWKV/ChatRWKV, https://github.com/RWKV/RWKV-infctx-trainer
Authors: Vincenzo Auletta, Diodato Ferraioli, Cosimo Vinci
Abstract: In this work, we study the Stochastic Budgeted Multi-round Submodular Maximization (SBMSm) problem, where we aim to adaptively maximize the sum, over multiple rounds, of a monotone and submodular objective function defined on subsets of items. The objective function also depends on the realization of stochastic events, and the total number of items we can select over all rounds is bounded by a limited budget. This problem extends, and generalizes to multiple round settings, well-studied problems such as (adaptive) influence maximization and stochastic probing. We show that, if the number of items and stochastic events is somehow bounded, there is a polynomial time dynamic programming algorithm for SBMSm. Then, we provide a simple greedy $1/2(1-1/e-\epsilon)\approx 0.316$-approximation algorithm for SBMSm, that first non-adaptively allocates the budget to be spent at each round, and then greedily and adaptively maximizes the objective function by using the budget assigned at each round. Finally, we introduce the {\em budget-adaptivity gap}, by which we measure how much an adaptive policy for SBMSm is better than an optimal partially adaptive one that, as in our greedy algorithm, determines the budget allocation in advance. We show that the budget-adaptivity gap lies between $e/(e-1)\approx 1.582$ and $2$.
Authors: Hyeonseok Moon, Seungyoon Lee, Seongtae Hong, Seungjun Lee, Chanjun Park, Heuiseok Lim
Abstract: Translating major language resources to build minor language resources becomes a widely-used approach. Particularly in translating complex data points composed of multiple components, it is common to translate each component separately. However, we argue that this practice often overlooks the interrelation between components within the same data point. To address this limitation, we propose a novel MT pipeline that considers the intra-data relation in implementing MT for training data. In our MT pipeline, all the components in a data point are concatenated to form a single translation sequence and subsequently reconstructed to the data components after translation. We introduce a Catalyst Statement (CS) to enhance the intra-data relation, and Indicator Token (IT) to assist the decomposition of a translated sequence into its respective data components. Through our approach, we have achieved a considerable improvement in translation quality itself, along with its effectiveness as training data. Compared with the conventional approach that translates each data component separately, our method yields better training data that enhances the performance of the trained model by 2.690 points for the web page ranking (WPR) task, and 0.845 for the question generation (QG) task in the XGLUE benchmark.
Authors: Zhixuan Xu, Chongkai Gao, Zixuan Liu, Gang Yang, Chenrui Tie, Haozhuo Zheng, Haoyu Zhou, Weikun Peng, Debang Wang, Tianrun Hu, Tianyi Chen, Zhouliang Yu, Lin Shao
Abstract: To substantially enhance robot intelligence, there is a pressing need to develop a large model that enables general-purpose robots to proficiently undertake a broad spectrum of manipulation tasks, akin to the versatile task-planning ability exhibited by LLMs. The vast diversity in objects, robots, and manipulation tasks presents huge challenges. Our work introduces a comprehensive framework to develop a foundation model for general robotic manipulation that formalizes a manipulation task as contact synthesis. Specifically, our model takes as input object and robot manipulator point clouds, object physical attributes, target motions, and manipulation region masks. It outputs contact points on the object and associated contact forces or post-contact motions for robots to achieve the desired manipulation task. We perform extensive experiments both in the simulation and real-world settings, manipulating articulated rigid objects, rigid objects, and deformable objects that vary in dimensionality, ranging from one-dimensional objects like ropes to two-dimensional objects like cloth and extending to three-dimensional objects such as plasticine. Our model achieves average success rates of around 90\%. Supplementary materials and videos are available on our project website at https://manifoundationmodel.github.io/.
Authors: Cristian J. Vaca-Rubio, Luis Blanco, Roberto Pereira, M\`arius Caus
Abstract: This paper introduces a novel application of Kolmogorov-Arnold Networks (KANs) to time series forecasting, leveraging their adaptive activation functions for enhanced predictive modeling. Inspired by the Kolmogorov-Arnold representation theorem, KANs replace traditional linear weights with spline-parametrized univariate functions, allowing them to learn activation patterns dynamically. We demonstrate that KANs outperforms conventional Multi-Layer Perceptrons (MLPs) in a real-world satellite traffic forecasting task, providing more accurate results with considerably fewer number of learnable parameters. We also provide an ablation study of KAN-specific parameters impact on performance. The proposed approach opens new avenues for adaptive forecasting models, emphasizing the potential of KANs as a powerful tool in predictive analytics.
Authors: Joseph Spracklen, Raveen Wijewickrama, A H M Nazmus Sakib, Anindya Maiti, Bimal Viswanath, Murtuza Jadliwala
Abstract: The reliance of popular programming languages such as Python and JavaScript on centralized package repositories and open-source software, combined with the emergence of code-generating Large Language Models (LLMs), has created a new type of threat to the software supply chain: package hallucinations. These hallucinations, which arise from fact-conflicting errors when generating code using LLMs, represent a novel form of package confusion attack that poses a critical threat to the integrity of the software supply chain. This paper conducts a rigorous and comprehensive evaluation of package hallucinations across different programming languages, settings, and parameters, exploring how a diverse set of models and configurations affect the likelihood of generating erroneous package recommendations and identifying the root causes of this phenomenon. Using 16 popular LLMs for code generation and two unique prompt datasets, we generate 576,000 code samples in two programming languages that we analyze for package hallucinations. Our findings reveal that that the average percentage of hallucinated packages is at least 5.2% for commercial models and 21.7% for open-source models, including a staggering 205,474 unique examples of hallucinated package names, further underscoring the severity and pervasiveness of this threat. To overcome this problem, we implement several hallucination mitigation strategies and show that they are able to significantly reduce the number of package hallucinations while maintaining code quality. Our experiments and findings highlight package hallucinations as a persistent and systemic phenomenon while using state-of-the-art LLMs for code generation, and a significant challenge which deserves the research community's urgent attention.
Authors: Siyu Yuan, Cheng Jiayang, Lin Qiu, Deqing Yang
Abstract: Analogical reasoning plays a critical role in human cognition, enabling us to understand new concepts by associating them with familiar ones. Previous research in the AI community has mainly focused on identifying and generating analogies and then examining their quality under human evaluation, which overlooks the practical application of these analogies in real-world settings. Inspired by the human education process, in this paper, we propose to investigate how analogies created by teacher language models (LMs) can assist student LMs in understanding scientific concepts, thereby aligning more closely with practical scenarios. Our results suggest that free-form analogies can indeed aid LMs in understanding concepts. Additionally, analogies generated by student LMs can improve their own performance on scientific question answering, demonstrating their capability to use analogies for self-learning new knowledge. Resources are available at https://github.com/siyuyuan/SCUA.
Authors: Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, Guangjun Yu, Xiang Wan, Benyou Wang
Abstract: The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.
Authors: Chaoming Wang, Muyang Lyu, Tianqiu Zhang, Sichao He, Si Wu
Abstract: We present a multi-scale differentiable brain modeling workflow utilizing BrainPy, a unique differentiable brain simulator that combines accurate brain simulation with powerful gradient-based optimization. We leverage this capability of BrainPy across different brain scales. At the single-neuron level, we implement differentiable neuron models and employ gradient methods to optimize their fit to electrophysiological data. On the network level, we incorporate connectomic data to construct biologically constrained network models. Finally, to replicate animal behavior, we train these models on cognitive tasks using gradient-based learning rules. Experiments demonstrate that our approach achieves superior performance and speed in fitting generalized leaky integrate-and-fire and Hodgkin-Huxley single neuron models. Additionally, training a biologically-informed network of excitatory and inhibitory spiking neurons on working memory tasks successfully replicates observed neural activity and synaptic weight distributions. Overall, our differentiable multi-scale simulation approach offers a promising tool to bridge neuroscience data across electrophysiological, anatomical, and behavioral scales.
Authors: Matteo Belenchia, Flavio Corradini, Michela Quadrini, Michele Loreti
Abstract: Graph neural networks form a class of deep learning architectures specifically designed to work with graph-structured data. As such, they share the inherent limitations and problems of deep learning, especially regarding the issues of explainability and trustworthiness. We propose $\mu\mathcal{G}$, an original domain-specific language for the specification of graph neural networks that aims to overcome these issues. The language's syntax is introduced, and its meaning is rigorously defined by a denotational semantics. An equivalent characterization in the form of an operational semantics is also provided and, together with a type system, is used to prove the type soundness of $\mu\mathcal{G}$. We show how $\mu\mathcal{G}$ programs can be represented in a more user-friendly graphical visualization, and provide examples of its generality by showing how it can be used to define some of the most popular graph neural network models, or to develop any custom graph processing application.
Authors: Jingyue Huang, Yi-Hsuan Yang
Abstract: Emotion-driven melody harmonization aims to generate diverse harmonies for a single melody to convey desired emotions. Previous research found it hard to alter the perceived emotional valence of lead sheets only by harmonizing the same melody with different chords, which may be attributed to the constraints imposed by the melody itself and the limitation of existing music representation. In this paper, we propose a novel functional representation for symbolic music. This new method takes musical keys into account, recognizing their significant role in shaping music's emotional character through major-minor tonality. It also allows for melodic variation with respect to keys and addresses the problem of data scarcity for better emotion modeling. A Transformer is employed to harmonize key-adaptable melodies, allowing for keys determined in rule-based or model-based manner. Experimental results confirm the effectiveness of our new representation in generating key-aware harmonies, with objective and subjective evaluations affirming the potential of our approach to convey specific valence for versatile melody.
Authors: William Yicheng Zhu, Keren Ye, Junjie Ke, Jiahui Yu, Leonidas Guibas, Peyman Milanfar, Feng Yang
Abstract: Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP's contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of image-object-attribute relations to use towards attribute recognition. Specifically, for each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence encoding the attribute's relation to objects on the image. Unlike contrastive retrieval, which measures likelihood by globally aligning elements of the sentence to the image, generative retrieval is sensitive to the order and dependency of objects and attributes in the sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets, Visual Attribute in the Wild (VAW), and our newly-proposed Visual Genome Attribute Ranking (VGARank).
Authors: Ka Hei Carrie Lau, Efe Bozkir, Hong Gao, Enkelejda Kasneci
Abstract: This paper explores the innovative application of Large Language Models (LLMs) in Virtual Reality (VR) environments to promote heritage education, focusing on traditional Scottish curling presented in the game ``Scottish Bonspiel VR''. Our study compares the effectiveness of LLM-based chatbots with pre-defined scripted chatbots, evaluating key criteria such as usability, user engagement, and learning outcomes. The results show that LLM-based chatbots significantly improve interactivity and engagement, creating a more dynamic and immersive learning environment. This integration helps document and preserve cultural heritage and enhances dissemination processes, which are crucial for safeguarding intangible cultural heritage (ICH) amid environmental changes. Furthermore, the study highlights the potential of novel technologies in education to provide immersive experiences that foster a deeper appreciation of cultural heritage. These findings support the wider application of LLMs and VR in cultural education to address global challenges and promote sustainable practices to preserve and enhance cultural heritage.
Authors: Lionel Z. Wang, Yiming Ma, Renfei Gao, Beichen Guo, Han Zhu, Wenqi Fan, Zexin Lu, Ka Chung Ng
Abstract: The advent of large language models (LLMs) has revolutionized online content creation, making it much easier to generate high-quality fake news. This misuse threatens the integrity of our digital environment and ethical standards. Therefore, understanding the motivations and mechanisms behind LLM-generated fake news is crucial. In this study, we analyze the creation of fake news from a social psychology perspective and develop a comprehensive LLM-based theoretical framework, LLM-Fake Theory. We introduce a novel pipeline that automates the generation of fake news using LLMs, thereby eliminating the need for manual annotation. Utilizing this pipeline, we create a theoretically informed Machine-generated Fake news dataset, MegaFake, derived from the GossipCop dataset. We conduct comprehensive analyses to evaluate our MegaFake dataset. We believe that our dataset and insights will provide valuable contributions to future research focused on the detection and governance of fake news in the era of LLMs.
Authors: Joseph Cho, Samuel Schmidgall, Cyril Zakka, Mrudang Mathur, Dhamanpreet Kaur, Rohan Shad, William Hiesinger
Abstract: Diffusion-based video generation models have made significant strides, producing outputs with improved visual fidelity, temporal coherence, and user control. These advancements hold great promise for improving surgical education by enabling more realistic, diverse, and interactive simulation environments. In this study, we introduce SurGen, a text-guided diffusion model tailored for surgical video synthesis. SurGen produces videos with the highest resolution and longest duration among existing surgical video generation models. We validate the visual and temporal quality of the outputs using standard image and video generation metrics. Additionally, we assess their alignment to the corresponding text prompts through a deep learning classifier trained on surgical data. Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.
Authors: Haisheng Su, Feixiang Song, Cong Ma, Wei Wu, Junchi Yan
Abstract: Robust object detection and tracking under arbitrary sight of view is challenging yet essential for the development of Autonomous Vehicle technology. With the growing demand of unmanned function vehicles, near-field scene understanding becomes an important research topic in the areas of low-speed autonomous driving. Due to the complexity of driving conditions and diversity of near obstacles such as blind spots and high occlusion, the perception capability of near-field environment is still inferior than its farther counterpart. To further enhance the intelligent ability of unmanned vehicles, in this paper, we construct a multimodal data collection platform based on 3 main types of sensors (Camera, LiDAR and Fisheye), which supports flexible sensor configurations to enable dynamic sight of view for ego vehicle, either global view or local view. Meanwhile, a large-scale multi-sensor dataset is built, named RoboSense, to facilitate near-field scene understanding. RoboSense contains more than 133K synchronized data with 1.4M 3D bounding box and IDs annotated in the full $360^{\circ}$ view, forming 216K trajectories across 7.6K temporal sequences. It has $270\times$ and $18\times$ as many annotations of near-field obstacles within 5$m$ as the previous single-vehicle datasets such as KITTI and nuScenes. Moreover, we define a novel matching criterion for near-field 3D perception and prediction metrics. Based on RoboSense, we formulate 6 popular tasks to facilitate the future development of related research, where the detailed data analysis as well as benchmarks are also provided accordingly. Code and dataset will be available at https://github.com/suhaisheng/RoboSense.
Authors: Anton Andreychuk, Konstantin Yakovlev, Aleksandr Panov, Alexey Skrynnik
Abstract: Multi-agent pathfinding (MAPF) is a challenging computational problem that typically requires to find collision-free paths for multiple agents in a shared environment. Solving MAPF optimally is NP-hard, yet efficient solutions are critical for numerous applications, including automated warehouses and transportation systems. Recently, learning-based approaches to MAPF have gained attention, particularly those leveraging deep reinforcement learning. Following current trends in machine learning, we have created a foundation model for the MAPF problems called MAPF-GPT. Using imitation learning, we have trained a policy on a set of pre-collected sub-optimal expert trajectories that can generate actions in conditions of partial observability without additional heuristics, reward functions, or communication with other agents. The resulting MAPF-GPT model demonstrates zero-shot learning abilities when solving the MAPF problem instances that were not present in the training dataset. We show that MAPF-GPT notably outperforms the current best-performing learnable-MAPF solvers on a diverse range of problem instances and is efficient in terms of computation (in the inference mode).
Authors: Chen-Chi Chang, Ching-Yuan Chen, Hung-Shin Lee, Chih-Cheng Lee
Abstract: This study introduces a comprehensive benchmark designed to evaluate the performance of large language models (LLMs) in understanding and processing cultural knowledge, with a specific focus on Hakka culture as a case study. Leveraging Bloom's Taxonomy, the study develops a multi-dimensional framework that systematically assesses LLMs across six cognitive domains: Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. This benchmark extends beyond traditional single-dimensional evaluations by providing a deeper analysis of LLMs' abilities to handle culturally specific content, ranging from basic recall of facts to higher-order cognitive tasks such as creative synthesis. Additionally, the study integrates Retrieval-Augmented Generation (RAG) technology to address the challenges of minority cultural knowledge representation in LLMs, demonstrating how RAG enhances the models' performance by dynamically incorporating relevant external information. The results highlight the effectiveness of RAG in improving accuracy across all cognitive domains, particularly in tasks requiring precise retrieval and application of cultural knowledge. However, the findings also reveal the limitations of RAG in creative tasks, underscoring the need for further optimization. This benchmark provides a robust tool for evaluating and comparing LLMs in culturally diverse contexts, offering valuable insights for future research and development in AI-driven cultural knowledge preservation and dissemination.
Authors: Fatma Yasmine Loumachi, Mohamed Chahine Ghanem
Abstract: Timeline Analysis (TA) plays a crucial role in Timeline Forensics (TF) within the field of Digital Forensics (DF). It focuses on examining and analyzing time-based digital artefacts, such as timestamps derived from event logs, file metadata, and other relevant data, to correlate events linked to cyber incidents and reconstruct their chronological sequence. Traditional tools often struggle to efficiently handle the large volume and variety of data generated during DF investigations and Incident Response (IR) processes. This paper introduces a novel framework, GenDFIR, which combines Rule-Based Artificial Intelligence (R-BAI) algorithms with Large Language Models (LLMs) to enhance and automate the TA process. The proposed approach consists of two key stages: (1) R-BAI is used to identify and select anomalous digital artefacts based on predefined rules. (2) The selected artefacts are then transformed into embeddings for processing by an LLM with the assistance of a Retrieval-Augmented Generation (RAG) agent. The LLM uses its capabilities to perform automated TA on the artefacts and predict potential incident outcomes. To validate the framework, we evaluated its performance, efficiency, and reliability. Several metrics were applied to simulated cyber incident scenarios, which were presented as forensic case documents. Our findings demonstrate the significant potential of integrating R-BAI and LLMs for TA. This innovative approach underscores the power of Generative AI (GenAI), particularly LLMs, and opens up new possibilities for advanced threat detection and incident reconstruction, marking a significant advancement in the field.
Authors: Khiem Ton, Nhi Nguyen, Mahmoud Nazzal, Abdallah Khreishah, Cristian Borcea, NhatHai Phan, Ruoming Jin, Issa Khalil, Yelong Shen
Abstract: This paper introduces SGCode, a flexible prompt-optimizing system to generate secure code with large language models (LLMs). SGCode integrates recent prompt-optimization approaches with LLMs in a unified system accessible through front-end and back-end APIs, enabling users to 1) generate secure code, which is free of vulnerabilities, 2) review and share security analysis, and 3) easily switch from one prompt optimization approach to another, while providing insights on model and system performance. We populated SGCode on an AWS server with PromSec, an approach that optimizes prompts by combining an LLM and security tools with a lightweight generative adversarial graph neural network to detect and fix security vulnerabilities in the generated code. Extensive experiments show that SGCode is practical as a public tool to gain insights into the trade-offs between model utility, secure code generation, and system cost. SGCode has only a marginal cost compared with prompting LLMs. SGCode is available at: https://sgcode.codes/.
URLs: https://sgcode.codes/.
Authors: Sichun Wu, Kazi Injamamul Haque, Zerrin Yumak
Abstract: Audio-driven 3D facial animation synthesis has been an active field of research with attention from both academia and industry. While there are promising results in this area, recent approaches largely focus on lip-sync and identity control, neglecting the role of emotions and emotion control in the generative process. That is mainly due to the lack of emotionally rich facial animation data and algorithms that can synthesize speech animations with emotional expressions at the same time. In addition, majority of the models are deterministic, meaning given the same audio input, they produce the same output motion. We argue that emotions and non-determinism are crucial to generate diverse and emotionally-rich facial animations. In this paper, we propose ProbTalk3D a non-deterministic neural network approach for emotion controllable speech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and an emotionally rich facial animation dataset 3DMEAD. We provide an extensive comparative analysis of our model against the recent 3D facial animation synthesis approaches, by evaluating the results objectively, qualitatively, and with a perceptual user study. We highlight several objective metrics that are more suitable for evaluating stochastic outputs and use both in-the-wild and ground truth data for subjective evaluation. To our knowledge, that is the first non-deterministic 3D facial animation synthesis method incorporating a rich emotion dataset and emotion control with emotion labels and intensity levels. Our evaluation demonstrates that the proposed model achieves superior performance compared to state-of-the-art emotion-controlled, deterministic and non-deterministic models. We recommend watching the supplementary video for quality judgement. The entire codebase is publicly available (https://github.com/uuembodiedsocialai/ProbTalk3D/).
Authors: Rania Hossam, Ahmed Heakl, Walid Gomaa
Abstract: Traditional fish farming practices often lead to inefficient feeding, resulting in environmental issues and reduced productivity. We developed an innovative system combining computer vision and IoT technologies for precise Tilapia feeding. Our solution uses real-time IoT sensors to monitor water quality parameters and computer vision algorithms to analyze fish size and count, determining optimal feed amounts. A mobile app enables remote monitoring and control. We utilized YOLOv8 for keypoint detection to measure Tilapia weight from length, achieving \textbf{94\%} precision on 3,500 annotated images. Pixel-based measurements were converted to centimeters using depth estimation for accurate feeding calculations. Our method, with data collection mirroring inference conditions, significantly improved results. Preliminary estimates suggest this approach could increase production up to 58 times compared to traditional farms. Our models, code, and dataset are open-source~\footnote{The code, dataset, and models are available upon reasonable request.
Authors: Mercy Asiedu, Nenad Tomasev, Chintan Ghate, Tiya Tiyasirichokchai, Awa Dieng, Oluwatosin Akande, Geoffrey Siwo, Steve Adudans, Sylvanus Aitkins, Odianosen Ehiakhamen, Eric Ndombi, Katherine Heller
Abstract: While large language models (LLMs) have shown promise for medical question answering, there is limited work focused on tropical and infectious disease-specific exploration. We build on an opensource tropical and infectious diseases (TRINDs) dataset, expanding it to include demographic and semantic clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM performance on these, comparing generalist and medical LLMs, as well as LLM outcomes to human experts. We demonstrate through systematic experimentation, the benefit of contextual information such as demographics, location, gender, risk factors for optimal LLM response. Finally we develop a prototype of TRINDs-LM, a research tool that provides a playground to navigate how context impacts LLM outputs for health.
Authors: Sebastian Towers, Aleksandra Kalisz, Philippe A. Robert, Alicia Higueruelo, Francesca Vianello, Ming-Han Chloe Tsai, Harrison Steel, Jakob N. Foerster
Abstract: Anti-viral therapies are typically designed to target the current strains of a virus. Game theoretically, this corresponds to a short-sighted, or myopic, response. However, therapy-induced selective pressures act on viral antigens to drive the emergence of mutated strains, against which initial therapies have reduced efficacy. Building on a computational model of binding between antibodies and viral antigens (the Absolut! framework), we design and implement a genetic simulation of such viral evolutionary escape. Crucially, this allows our antibody optimisation algorithm to consider and influence the entire escape curve of the virus, i.e. to guide (or ''shape'') the viral evolution. This is inspired by opponent shaping which, in general-sum learning, accounts for the adaptation of the co-player rather than playing a myopic best response. Hence we call the optimised antibodies shapers. Within our simulations, we demonstrate that our shapers target both current and simulated future viral variants, outperforming the antibodies chosen in a myopic way. Furthermore, we show that shapers exert specific evolutionary pressure on the virus compared to myopic antibodies. Altogether, shapers modify the evolutionary trajectories of viral strains and minimise the viral escape compared to their myopic counterparts. While this is a simplified model, we hope that our proposed paradigm will enable the discovery of better long-lived vaccines and antibody therapies in the future, enabled by rapid advancements in the capabilities of simulation tools. Our code is available at https://github.com/olakalisz/antibody-shapers.
Authors: Hanjun Luo, Yingbin Jin, Xuecheng Liu, Tong Shang, Ruizhe Chen, Zuozhu Liu
Abstract: Large Language Models (LLMs) have supplanted traditional methods in numerous natural language processing tasks. Nonetheless, in Named Entity Recognition (NER), existing LLM-based methods underperform compared to baselines and require significantly more computational resources, limiting their application. In this paper, we introduce the task of generation-based extraction and in-context classification (GEIC), designed to leverage LLMs' prior knowledge and self-attention mechanisms for NER tasks. We then propose CascadeNER, a universal and multilingual GEIC framework for few-shot and zero-shot NER. CascadeNER employs model cascading to utilize two small-parameter LLMs to extract and classify independently, reducing resource consumption while enhancing accuracy. We also introduce AnythingNER, the first NER dataset specifically designed for LLMs, including 8 languages, 155 entity types and a novel dynamic categorization system. Experiments show that CascadeNER achieves state-of-the-art performance on low-resource and fine-grained scenarios, including CrossNER and FewNERD. Our work is openly accessible.
Authors: Xinyue Fang, Zhen Huang, Zhiliang Tian, Minghui Fang, Ziyi Pan, Quntian Fang, Zhihua Wen, Hengyue Pan, Dongsheng Li
Abstract: LLMs obtain remarkable performance but suffer from hallucinations. Most research on detecting hallucination focuses on the questions with short and concrete correct answers that are easy to check the faithfulness. Hallucination detections for text generation with open-ended answers are more challenging. Some researchers use external knowledge to detect hallucinations in generated texts, but external resources for specific scenarios are hard to access. Recent studies on detecting hallucinations in long text without external resources conduct consistency comparison among multiple sampled outputs. To handle long texts, researchers split long texts into multiple facts and individually compare the consistency of each pairs of facts. However, these methods (1) hardly achieve alignment among multiple facts; (2) overlook dependencies between multiple contextual facts. In this paper, we propose a graph-based context-aware (GCA) hallucination detection for text generations, which aligns knowledge facts and considers the dependencies between contextual knowledge triples in consistency comparison. Particularly, to align multiple facts, we conduct a triple-oriented response segmentation to extract multiple knowledge triples. To model dependencies among contextual knowledge triple (facts), we construct contextual triple into a graph and enhance triples' interactions via message passing and aggregating via RGCN. To avoid the omission of knowledge triples in long text, we conduct a LLM-based reverse verification via reconstructing the knowledge triples. Experiments show that our model enhances hallucination detection and excels all baselines.
Authors: DaDong Jiang, Xianghui Yang, Zibo Zhao, Sheng Zhang, Jiaao Yu, Zeqiang Lai, Shaoxiong Yang, Chunchao Guo, Xiaobo Zhou, Zhihui Ke
Abstract: Recent texture generation methods achieve impressive results due to the powerful generative prior they leverage from large-scale text-to-image diffusion models. However, abstract textual prompts are limited in providing global textural or shape information, which results in the texture generation methods producing blurry or inconsistent patterns. To tackle this, we present FlexiTex, embedding rich information via visual guidance to generate a high-quality texture. The core of FlexiTex is the Visual Guidance Enhancement module, which incorporates more specific information from visual guidance to reduce ambiguity in the text prompt and preserve high-frequency details. To further enhance the visual guidance, we introduce a Direction-Aware Adaptation module that automatically designs direction prompts based on different camera poses, avoiding the Janus problem and maintaining semantically global consistency. Benefiting from the visual guidance, FlexiTex produces quantitatively and qualitatively sound results, demonstrating its potential to advance texture generation for real-world applications.
Authors: Zhecan Wang, Junzhang Liu, Chia-Wei Tang, Hani Alomari, Anushka Sivakumar, Rui Sun, Wenhao Li, Md. Atabuzzaman, Hammad Ayyubi, Haoxuan You, Alvi Ishmam, Kai-Wei Chang, Shih-Fu Chang, Chris Thomas
Abstract: Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts. As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying on background language biases. Thus, strong performance on these benchmarks does not necessarily correlate with strong visual understanding. In this paper, we release JourneyBench, a comprehensive human-annotated benchmark of generated images designed to assess the model's fine-grained multimodal reasoning abilities across five tasks: complementary multimodal chain of thought, multi-image VQA, imaginary image captioning, VQA with hallucination triggers, and fine-grained retrieval with sample-specific distractors. Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios where language bias and holistic image gist are insufficient. We benchmark state-of-the-art models on JourneyBench and analyze performance along a number of fine-grained dimensions. Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models, indicating that models' visual reasoning abilities are not as strong as they first appear. We discuss the implications of our findings and propose avenues for further research.
Authors: Zhangchen Ye, Tao Jiang, Chenfeng Xu, Yiming Li, Hang Zhao
Abstract: Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to improve the accuracy of 3D occupancy predictions. By sampling points along the line of sight of each voxel and integrating the features of these points from historical frames, we construct a cost volume feature map that refines current volume features for improved prediction outcomes. Our method takes advantage of parallax cues from historical observations and employs a data-driven approach to learn the cost volume. We validate the effectiveness of CVT-Occ through rigorous experiments on the Occ3D-Waymo dataset, where it outperforms state-of-the-art methods in 3D occupancy prediction with minimal additional computational cost. The code is released at \url{https://github.com/Tsinghua-MARS-Lab/CVT-Occ}.
Authors: Maria Wang, Srinivas Sunkara, Gilles Baechler, Jason Lin, Yun Zhu, Fedir Zubach, Lei Shu, Jindong Chen
Abstract: The rise of powerful multimodal LLMs has enhanced the viability of building web agents which can, with increasing levels of autonomy, assist users to retrieve information and complete tasks on various human-computer interfaces. It is hence necessary to build challenging benchmarks that span a wide-variety of use cases reflecting real-world usage. In this work, we present WebQuest, a multi-page question-answering dataset that requires reasoning across multiple related web pages. In contrast to existing UI benchmarks that focus on multi-step web navigation and task completion, our dataset evaluates information extraction, multimodal retrieval and composition of information from many web pages. WebQuest includes three question categories: single-screen QA, multi-screen QA, and QA based on navigation traces. We evaluate leading proprietary multimodal models like GPT-4V, Gemini Flash, Claude 3, and open source models like InstructBLIP, PaliGemma on our dataset, revealing a significant gap between single-screen and multi-screen reasoning. Finally, we investigate inference time techniques like Chain-of-Thought prompting to improve model capabilities on multi-screen reasoning.
Authors: Chi Chiu So, Siu Pang Yung
Abstract: Finding solutions to partial differential equations (PDEs) is an important and essential component in many scientific and engineering discoveries. One of the common approaches empowered by deep learning is Physics-informed Neural Networks (PINNs). Recently, a new type of fundamental neural network model, Kolmogorov-Arnold Networks (KANs), has been proposed as a substitute of Multilayer Perceptions (MLPs), and possesses trainable activation functions. To enhance KANs in fitting accuracy, a modification of KANs, so called ReLU-KANs, using "square of ReLU" as the basis of its activation functions, has been suggested. In this work, we propose another basis of activation functions, namely, Higherorder-ReLU (HR), which is simpler than the basis of activation functions used in KANs, namely, Bsplines; allows efficient KAN matrix operations; and possesses smooth and non-zero higher-order derivatives, essential to physicsinformed neural networks. We name such KANs with Higher-order-ReLU (HR) as their activations, HRKANs. Our detailed experiments on two famous and representative PDEs, namely, the linear Poisson equation and nonlinear Burgers' equation with viscosity, reveal that our proposed Higher-order-ReLU-KANs (HRKANs) achieve the highest fitting accuracy and training robustness and lowest training time significantly among KANs, ReLU-KANs and HRKANs. The codes to replicate our experiments are available at https://github.com/kelvinhkcs/HRKAN.
Authors: Xuewen Liu, Zhikai Li, Qingyi Gu
Abstract: Diffusion models have shown excellent performance on various image generation tasks, but the substantial computational costs and huge memory footprint hinder their low-latency applications in real-world scenarios. Quantization is a promising way to compress and accelerate models. Nevertheless, due to the wide range and time-varying activations in diffusion models, existing methods cannot maintain both accuracy and efficiency simultaneously for low-bit quantization. To tackle this issue, we propose DilateQuant, a novel quantization framework for diffusion models that offers comparable accuracy and high efficiency. Specifically, we keenly aware of numerous unsaturated in-channel weights, which can be cleverly exploited to reduce the range of activations without additional computation cost. Based on this insight, we propose Weight Dilation (WD) that maximally dilates the unsaturated in-channel weights to a constrained range through a mathematically equivalent scaling. WD costlessly absorbs the activation quantization errors into weight quantization. The range of activations decreases, which makes activations quantization easy. The range of weights remains constant, which makes model easy to converge in training stage. Considering the temporal network leads to time-varying activations, we design a Temporal Parallel Quantizer (TPQ), which sets time-step quantization parameters and supports parallel quantization for different time steps, significantly improving the performance and reducing time cost. To further enhance performance while preserving efficiency, we introduce a Block-wise Knowledge Distillation (BKD) to align the quantized models with the full-precision models at a block level. The simultaneous training of time-step quantization parameters and weights minimizes the time required, and the shorter backpropagation paths decreases the memory footprint of the quantization process.
Authors: David Chanin, James Wilken-Smith, Tom\'a\v{s} Dulka, Hardik Bhatnagar, Joseph Bloom
Abstract: Sparse Autoencoders (SAEs) have emerged as a promising approach to decompose the activations of Large Language Models (LLMs) into human-interpretable latents. In this paper, we pose two questions. First, to what extent do SAEs extract monosemantic and interpretable latents? Second, to what extent does varying the sparsity or the size of the SAE affect monosemanticity / interpretability? By investigating these questions in the context of a simple first-letter identification task where we have complete access to ground truth labels for all tokens in the vocabulary, we are able to provide more detail than prior investigations. Critically, we identify a problematic form of feature-splitting we call feature absorption where seemingly monosemantic latents fail to fire in cases where they clearly should. Our investigation suggests that varying SAE size or sparsity is insufficient to solve this issue, and that there are deeper conceptual issues in need of resolution.
Authors: Yannis Bendi-Ouis, Dan Dutarte, Xavier Hinaut
Abstract: Since the release of ChatGPT in November 2022, large language models (LLMs) have seen considerable success, including in the open-source community, with many open-weight models available. However, the requirements to deploy such a service are often unknown and difficult to evaluate in advance. To facilitate this process, we conducted numerous tests at the Centre Inria de l'Universit\'e de Bordeaux. In this article, we propose a comparison of the performance of several models of different sizes (mainly Mistral and LLaMa) depending on the available GPUs, using vLLM, a Python library designed to optimize the inference of these models. Our results provide valuable information for private and public groups wishing to deploy LLMs, allowing them to evaluate the performance of different models based on their available hardware. This study thus contributes to facilitating the adoption and use of these large language models in various application domains.
Authors: Kehui Liu, Zixin Tang, Dong Wang, Zhigang Wang, Bin Zhao, Xuelong Li
Abstract: Leveraging the powerful reasoning capabilities of large language models (LLMs), recent LLM-based robot task planning methods yield promising results. However, they mainly focus on single or multiple homogeneous robots on simple tasks. Practically, complex long-horizon tasks always require collaborations among multiple heterogeneous robots especially with more complex action spaces, which makes these tasks more challenging. To this end, we propose COHERENT, a novel LLM-based task planning framework for collaboration of heterogeneous multi-robot systems including quadrotors, robotic dogs, and robotic arms. Specifically, a Proposal-Execution-Feedback-Adjustment (PEFA) mechanism is designed to decompose and assign actions for individual robots, where a centralized task assigner makes a task planning proposal to decompose the complex task into subtasks, and then assigns subtasks to robot executors. Each robot executor selects a feasible action to implement the assigned subtask and reports self-reflection feedback to the task assigner for plan adjustment. The PEFA loops until the task is completed. Moreover, we create a challenging heterogeneous multi-robot task planning benchmark encompassing 100 complex long-horizon tasks. The experimental results show that our work surpasses the previous methods by a large margin in terms of success rate and execution efficiency. The experimental videos, code, and benchmark are released at https://github.com/MrKeee/COHERENT.
Authors: East Sun, Yan Wang, Lan Tian
Abstract: We introduce Block-Attention, an attention mechanism designed to address the increased inference latency and cost in Retrieval-Augmented Generation (RAG) scenarios. Unlike existing works that encodes the whole context, its main idea lies in dividing the retrieved documents into blocks, where each block calculates key-value (KV) states independently except for the final block. In RAG scenarios, by defining each passage as a block, Block-Attention enables us to pre-compute the KV states for all passages and cache them in memory, significantly reducing the latency and the computation cost during inference. The implementation involves block segmentation, positional encoding calculation, and fine-tuning the LLM to adapt to the Block-Attention mechanism. Experiments on four RAG benchmarks demonstrate that after block fine-tuning, the Block Attention model can achieve performance comparable to (68.4\% vs 67.9\% on Llama3) or even better (62.8\% vs 59.6\% on Mistral) than self-attention models. Notably, Block-Attention reduces the TTFT (the time to first token) and FLOPs (floating point operations) to a very low level. It only takes 45 ms to output the first token for an input sequence with a total length of 32K. Compared with the self-attention model, the time consumption and corresponding FLOPs are reduced by 98.7\% and 99.8\%, respectively.
Authors: Rosemary Y. He, Jeffrey N. Chiang
Abstract: Trajectory forecasting in healthcare data has been an important area of research in precision care and clinical integration for computational methods. In recent years, generative AI models have demonstrated promising results in capturing short and long range dependencies in time series data. While these models have also been applied in healthcare, most of them only predict one value at a time, which is unrealistic in a clinical setting where multiple measures are taken at once. In this work, we extend the framework temporal fusion transformer (TFT), a multi-horizon time series prediction tool, and propose TFT-multi, an end-to-end framework that can predict multiple vital trajectories simultaneously. We apply TFT-multi to forecast 5 vital signs recorded in the intensive care unit: blood pressure, pulse, SpO2, temperature and respiratory rate. We hypothesize that by jointly predicting these measures, which are often correlated with one another, we can make more accurate predictions, especially in variables with large missingness. We validate our model on the public MIMIC dataset and an independent institutional dataset, and demonstrate that this approach outperforms state-of-the-art univariate prediction tools including the original TFT and Prophet, as well as vector regression modeling for multivariate prediction. Furthermore, we perform a study case analysis by applying our pipeline to forecast blood pressure changes in response to actual and hypothetical pressor administration.
Authors: Xiaoyu He, Chenlin Wu, Zike Li, Zibin Zheng
Abstract: Federated learning is a distributed learning framework which enables clients to train models individually and to upload their model updates for aggregation. The local training process heavily relies on distributed gradient descent techniques. In the situation where gradient information is not available, the gradients need to be estimated from zeroth-order information, which typically involves computing finite-differences along isotropic random directions. This method suffers from high estimation errors, as the geometric features of the objective landscape may be overlooked during the isotropic sampling. In this work, we propose a non-isotropic sampling method to improve the gradient estimation procedure. Gradients in our method are estimated in a subspace spanned by historical trajectories of solutions, aiming to encourage the exploration of promising regions and hence improve the convergence. We implement this method in zeroth-order federated settings, and show that the convergence rate aligns with existing ones while introducing no significant overheads in communication or local computation. The effectiveness of our proposal is verified on several numerical experiments in comparison to several commonly-used zeroth-order federated optimization algorithms.