Authors: Yuan Liao, Jiang Bian, Yuhui Yun, Shuo Wang, Yubo Zhang, Jiaming Chu, Tao Wang, Kewei Li, Yuchen Li, Xuhong Li, Shilei Ji, Haoyi Xiong
Abstract: While the field of NL2SQL has made significant advancements in translating natural language instructions into executable SQL scripts for data querying and processing, achieving full automation within the broader data science pipeline - encompassing data querying, analysis, visualization, and reporting - remains a complex challenge. This study introduces SageCopilot, an advanced, industry-grade system system that automates the data science pipeline by integrating Large Language Models (LLMs), Autonomous Agents (AutoAgents), and Language User Interfaces (LUIs). Specifically, SageCopilot incorporates a two-phase design: an online component refining users' inputs into executable scripts through In-Context Learning (ICL) and running the scripts for results reporting & visualization, and an offline preparing demonstrations requested by ICL in the online phase. A list of trending strategies such as Chain-of-Thought and prompt-tuning have been used to augment SageCopilot for enhanced performance. Through rigorous testing and comparative analysis against prompt-based solutions, SageCopilot has been empirically validated to achieve superior end-to-end performance in generating or executing scripts and offering results with visualization, backed by real-world datasets. Our in-depth ablation studies highlight the individual contributions of various components and strategies used by SageCopilot to the end-to-end correctness for data sciences.
Authors: Marco AF Pimentel, Cl\'ement Christophe, Tathagata Raha, Prateek Munjal, Praveen K Kanithi, Shadab Khan
Abstract: As large language models (LLMs) continue to evolve, the need for robust and standardized evaluation benchmarks becomes paramount. Evaluating the performance of these models is a complex challenge that requires careful consideration of various linguistic tasks, model architectures, and benchmarking methodologies. In recent years, various frameworks have emerged as noteworthy contributions to the field, offering comprehensive evaluation tests and benchmarks for assessing the capabilities of LLMs across diverse domains. This paper provides an exploration and critical analysis of some of these evaluation methodologies, shedding light on their strengths, limitations, and impact on advancing the state-of-the-art in natural language processing.
Authors: Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Sam Wiseman, Syd Evans, Tao Lei, Vivek Rathod, Xiang Kong, Xianzhi Du, Yanghao Li, Yongqiang Wang, Yuan Gao, Zaid Ahmed, Zhaoyang Xu, Zhiyun Lu, Al Rashid, Albin Madappally Jose, Alec Doane, Alfredo Bencomo, Allison Vanderby, Andrew Hansen, Ankur Jain, Anupama Mann Anupama, Areeba Kamal, Bugu Wu, Carolina Brum, Charlie Maalouf, Chinguun Erdenebileg, Chris Dulhanty, Dominik Moritz, Doug Kang, Eduardo Jimenez, Evan Ladd, Fangping Shi, Felix Bai, Frank Chu, Fred Hohman, Hadas Kotek, Hannah Gillis Coleman, Jane Li, Jeffrey Bigham, Jeffery Cao, Jeff Lai, Jessica Cheung, Jiulong Shan, Joe Zhou, John Li, Jun Qin, Karanjeet Singh, Karla Vega, Kelvin Zou, Laura Heckman, Lauren Gardiner, Margit Bowler, Maria Cordell, Meng Cao, Nicole Hay, Nilesh Shahdadpuri, Otto Godwin, Pranay Dighe, Pushyami Rachapudi, Ramsey Tantawi, Roman Frigg, Sam Davarnia, Sanskruti Shah, Saptarshi Guha, Sasha Sirovica, Shen Ma, Shuang Ma, Simon Wang, Sulgi Kim, Suma Jayaram, Vaishaal Shankar, Varsha Paidi, Vivek Kumar, Xin Wang, Xin Zheng, Walker Cheng, Yael Shrager, Yang Ye, Yasu Tanaka, Yihao Guo, Yunsong Meng, Zhao Tang Luo, Zhi Ouyang, Alp Aygar, Alvin Wan, Andrew Walkingshaw, Andy Narayanan, Antonie Lin, Arsalan Farooq, Brent Ramerth, Colorado Reed, Chris Bartels, Chris Chaney, David Riazati, Eric Liang Yang, Erin Feldman, Gabriel Hochstrasser, Guillaume Seguin, Irina Belousova, Joris Pelemans, Karen Yang, Keivan Alizadeh Vahid, Liangliang Cao, Mahyar Najibi, Marco Zuliani, Max Horton, Minsik Cho, Nikhil Bhendawade, Patrick Dong, Piotr Maj, Pulkit Agrawal, Qi Shan, Qichen Fu, Regan Poston, Sam Xu, Shuangning Liu, Sushma Rao, Tashweena Heeramun, Thomas Merth, Uday Rayala, Victor Cui, Vivek Rangarajan Sridhar, Wencong Zhang, Wenqi Zhang, Wentao Wu, Xingyu Zhou, Xinwen Liu, Yang Zhao, Yin Xia, Zhile Ren, Zhongzheng Ren
Abstract: We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.
Authors: Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu
Abstract: KV-Cache compression methods generally sample a KV-Cache of effectual tokens or quantize it into lower bits. However, these methods cannot exploit the redundancy of the hidden dimension of KV tensors. This paper investigates a unique hidden dimension approach called Palu, a novel KV-Cache compression framework that utilizes low-rank projection. Palu decomposes the linear layers into low-rank matrices, caches the smaller intermediate states, and reconstructs the full keys and values on the fly. To improve accuracy, compression rate, and efficiency, Palu further encompasses (1) a medium-grained low-rank decomposition scheme, (2) an efficient rank search algorithm, (3) a low-rank-aware quantization algorithm, and (4) matrix fusion with optimized GPU kernels. Our extensive experiments with popular LLMs show that Palu can compress KV-Cache by more than 91.25% while maintaining a significantly better accuracy (up to 1.19 lower perplexity) than state-of-the-art KV-Cache quantization methods at a similar or even higher memory usage. When compressing KV-Cache for 50%, Palu delivers up to 1.61x end-to-end speedup for the attention module. Our code is publicly available at https://github.com/shadowpa0327/Palu.
Authors: Arne Decadt, Alexander Erreygers, Jasper De Bock
Abstract: We study how to infer new choices from prior choices using the framework of choice functions, a unifying mathematical framework for decision-making based on sets of preference orders. In particular, we define the natural (most conservative) extension of a given choice assessment to a coherent choice function -- whenever possible -- and use this natural extension to make new choices. We provide a practical algorithm for computing this natural extension and various ways to improve scalability. Finally, we test these algorithms for different types of choice assessments.
Authors: Fandi Meng, Simon Lucas
Abstract: We present a game framework tailored for deduction games, enabling structured analysis from the perspective of Shannon entropy variations. Additionally, we introduce a new forward search algorithm, Information Set Entropy Search (ISES), which effectively solves many single-player deduction games. The ISES algorithm, augmented with sampling techniques, allows agents to make decisions within controlled computational resources and time constraints. Experimental results on eight games within our framework demonstrate the significant superiority of our method over the Single Observer Information Set Monte Carlo Tree Search(SO-ISMCTS) algorithm under limited decision time constraints. The entropy variation of game states in our framework enables explainable decision-making, which can also be used to analyze the appeal of deduction games and provide insights for game designers.
Authors: Athena Vakali, Nicoleta Tantalaki
Abstract: Nowadays, we delegate many of our decisions to Artificial Intelligence (AI) that acts either in solo or as a human companion in decisions made to support several sensitive domains, like healthcare, financial services and law enforcement. AI systems, even carefully designed to be fair, are heavily criticized for delivering misjudged and discriminated outcomes against individuals and groups. Numerous work on AI algorithmic fairness is devoted on Machine Learning pipelines which address biases and quantify fairness under a pure computational view. However, the continuous unfair and unjust AI outcomes, indicate that there is urgent need to understand AI as a sociotechnical system, inseparable from the conditions in which it is designed, developed and deployed. Although, the synergy of humans and machines seems imperative to make AI work, the significant impact of human and societal factors on AI bias is currently overlooked. We address this critical issue by following a radical new methodology under which human cognitive biases become core entities in our AI fairness overview. Inspired by the cognitive science definition and taxonomy of human heuristics, we identify how harmful human actions influence the overall AI lifecycle, and reveal human to AI biases hidden pathways. We introduce a new mapping, which justifies the human heuristics to AI biases reflections and we detect relevant fairness intensities and inter-dependencies. We envision that this approach will contribute in revisiting AI fairness under deeper human-centric case studies, revealing hidden biases cause and effects.
Authors: H. Emre Erdem, Henry Leung
Abstract: Static noise maps depicting long-term noise levels over wide areas are valuable urban planning assets for municipalities in decreasing noise exposure of residents. However, non-traffic noise sources with transient behavior, which people complain frequently, are usually ignored by static maps. We propose here a dynamic noise mapping approach using the data collected via low-power wide-area network (LPWAN, specifically LoRaWAN) based internet of things (IoT) infrastructure, which is one of the most common communication backbones for smart cities. Noise mapping based on LPWAN is challenging due to the low data rates of these protocols. The proposed dynamic noise mapping approach diminishes the negative implications of data rate limitations using machine learning (ML) for event and location prediction of non-traffic sources based on the scarce data. The strength of these models lies in their consideration of the spatial variance in acoustic behavior caused by the buildings in urban settings. The effectiveness of the proposed method and the accuracy of the resulting dynamic maps are evaluated in field tests. The results show that the proposed system can decrease the map error caused by non-traffic sources up to 51% and can stay effective under significant packet losses.
Authors: Idowu Paul Okuwobi, Zexuan Ji, Wen Fan, Songtao Yuan, Loza Bekalo, Qiang Chen
Abstract: The presence of hyperreflective foci (HFs) is related to retinal disease progression, and the quantity has proven to be a prognostic factor of visual and anatomical outcome in various retinal diseases. However, lack of efficient quantitative tools for evaluating the HFs has deprived ophthalmologist of assessing the volume of HFs. For this reason, we propose an automated quantification algorithm to segment and quantify HFs in spectral domain optical coherence tomography (SD-OCT). The proposed algorithm consists of two parallel processes namely: region of interest (ROI) generation and HFs estimation. To generate the ROI, we use morphological reconstruction to obtain the reconstructed image and histogram constructed for data distributions and clustering. In parallel, we estimate the HFs by extracting the extremal regions from the connected regions obtained from a component tree. Finally, both the ROI and the HFs estimation process are merged to obtain the segmented HFs. The proposed algorithm was tested on 40 3D SD-OCT volumes from 40 patients diagnosed with non-proliferative diabetic retinopathy (NPDR), proliferative diabetic retinopathy (PDR), and diabetic macular edema (DME). The average dice similarity coefficient (DSC) and correlation coefficient (r) are 69.70%, 0.99 for NPDR, 70.31%, 0.99 for PDR, and 71.30%, 0.99 for DME, respectively. The proposed algorithm can provide ophthalmologist with good HFs quantitative information, such as volume, size, and location of the HFs.
Authors: Rujia Shen, Liangliang Liu, Boran Wang, Yi Guan, Yang Yang, Jingchi Jiang
Abstract: Time series forecasting (TSF) is immensely important in extensive applications, such as electricity transformation, financial trade, medical monitoring, and smart agriculture. Although Transformer-based methods can handle time series data, their ability to predict long-term time series is limited due to the ``anti-order" nature of the self-attention mechanism. To address this problem, we focus on frequency domain to weaken the impact of order in TSF and propose the FreqBlock, where we first obtain frequency representations through the Frequency Transform Module. Subsequently, a newly designed Frequency Cross Attention is used to obtian enhanced frequency representations between the real and imaginary parts, thus establishing a link between the attention mechanism and the inherent Kramer-Kronig relations (KKRs). Our backbone network, FreqTSF, adopts a residual structure by concatenating multiple FreqBlocks to simulate KKRs in the frequency domain and avoid degradation problems. On a theoretical level, we demonstrate that the proposed two modules can significantly reduce the time and memory complexity from $\mathcal{O}(L^2)$ to $\mathcal{O}(L)$ for each FreqBlock computation. Empirical studies on four benchmark datasets show that FreqTSF achieves an overall relative MSE reduction of 15\% and an overall relative MAE reduction of 11\% compared to the state-of-the-art methods. The code will be available soon.
Authors: Rubing Chen, Xulu Zhang, Jiaxin Wu, Wenqi Fan, Xiao-Yong Wei, Qing Li
Abstract: This paper addresses the need for improved precision in existing Retrieval-Augmented Generation (RAG) methods that primarily focus on enhancing recall. We propose a multi-layer knowledge pyramid approach within the RAG framework to achieve a better balance between precision and recall. The knowledge pyramid consists of three layers: Ontologies, Knowledge Graphs (KGs), and chunk-based raw text. We employ cross-layer augmentation techniques for comprehensive knowledge coverage and dynamic updates of the Ontology schema and instances. To ensure compactness, we utilize cross-layer filtering methods for knowledge condensation in KGs. Our approach, named PolyRAG, follows a waterfall model for retrieval, starting from the top of the pyramid and progressing down until a confident answer is obtained. We introduce two benchmarks for domain-specific knowledge retrieval, one in the academic domain and the other in the financial domain. The effectiveness of the methods has been validated through comprehensive experiments by outperforming 19 SOTA methods. An encouraging observation is that the proposed method has augmented the GPT-4, providing 395\% F1 gain by improving its performance from 0.1636 to 0.8109.
Authors: Yuxuan Chena, Xu Zhua, Hua Zhoua, Zhuyin Rena
Abstract: Remarkable progress has been made in automated problem solving through societies of agents based on large language models (LLMs). Computational fluid dynamics (CFD), as a complex problem, presents unique challenges in automated simulations that require sophisticated solutions. MetaOpenFOAM, as a novel multi-agent collaborations framework, aims to complete CFD simulation tasks with only natural language as input. These simulation tasks include mesh pre-processing, simulation and post-processing, etc. MetaOpenFOAM harnesses the power of MetaGPT's assembly line paradigm, which assigns diverse roles to various agents, efficiently breaking down complex CFD tasks into manageable subtasks. Langchain further complements MetaOpenFOAM by integrating Retrieval-Augmented Generation (RAG) technology, which enhances the framework's ability by integrating a searchable database of OpenFOAM tutorials for LLMs. Tests on a benchmark for natural language-based CFD solver, consisting of 8 CFD simulation tasks, have shown that MetaOpenFOAM achieved a high pass rate per test (85%), with each test case costing only $0.22 on average. The 8 CFD simulation tasks include compressible and incompressible flows, 2D and 3D flows, heat transfer, and combustion, demonstrating the ability to automate CFD simulations using only natural language input and iteratively correct errors to achieve the desired simulation at a low cost. An ablation study was conducted to verify the necessity of each component in the multi-agent system and the RAG technology. A sensitivity study on the randomness of LLM showed that LLM with low randomness can obtain more stable and accurate results. Additionally, MetaOpenFOAM own the ability to identify and modify key parameters in user requirements and excels in correcting bugs when failures occur, with or without human participation, which demonstrates the generalization of MetaOpenFOAM.
Authors: David Valencia, Henry Williams, Yuning Xing, Trevor Gee, Minas Liarokapis, Bruce A. MacDonald
Abstract: Reinforcement Learning (RL) has been widely used to solve tasks where the environment consistently provides a dense reward value. However, in real-world scenarios, rewards can often be poorly defined or sparse. Auxiliary signals are indispensable for discovering efficient exploration strategies and aiding the learning process. In this work, inspired by intrinsic motivation theory, we postulate that the intrinsic stimuli of novelty and surprise can assist in improving exploration in complex, sparsely rewarded environments. We introduce a novel sample-efficient method able to learn directly from pixels, an image-based extension of TD3 with an autoencoder called \textit{NaSA-TD3}. The experiments demonstrate that NaSA-TD3 is easy to train and an efficient method for tackling complex continuous-control robotic tasks, both in simulated environments and real-world settings. NaSA-TD3 outperforms existing state-of-the-art RL image-based methods in terms of final performance without requiring pre-trained models or human demonstrations.
Authors: Jingyao Wu, Ting Dang, Vidhyasaharan Sethu, Eliathamby Ambikairajah
Abstract: There has been a significant focus on modelling emotion ambiguity in recent years, with advancements made in representing emotions as distributions to capture ambiguity. However, there has been comparatively less effort devoted to the consideration of temporal dependencies in emotion distributions which encodes ambiguity in perceived emotions that evolve smoothly over time. Recognizing the benefits of using constrained dynamical neural ordinary differential equations (CD-NODE) to model time series as dynamic processes, we propose an ambiguity-aware dual-constrained Neural ODE approach to model the dynamics of emotion distributions on arousal and valence. In our approach, we utilize ODEs parameterised by neural networks to estimate the distribution parameters, and we integrate additional constraints to restrict the range of the system outputs to ensure the validity of predicted distributions. We evaluated our proposed system on the publicly available RECOLA dataset and observed very promising performance across a range of evaluation metrics.
Authors: Elan Markowitz, Anil Ramakrishna, Jwala Dhamala, Ninareh Mehrabi, Charith Peris, Rahul Gupta, Kai-Wei Chang, Aram Galstyan
Abstract: Knowledge graphs (KGs) complement Large Language Models (LLMs) by providing reliable, structured, domain-specific, and up-to-date external knowledge. However, KGs and LLMs are often developed separately and must be integrated after training. We introduce Tree-of-Traversals, a novel zero-shot reasoning algorithm that enables augmentation of black-box LLMs with one or more KGs. The algorithm equips a LLM with actions for interfacing a KG and enables the LLM to perform tree search over possible thoughts and actions to find high confidence reasoning paths. We evaluate on two popular benchmark datasets. Our results show that Tree-of-Traversals significantly improves performance on question answering and KG question answering tasks. Code is available at \url{https://github.com/amazon-science/tree-of-traversals}
Authors: Hongxun Zhou, Xiangyu Chen, Ye Yuan
Abstract: A dynamic weighted directed graph (DWDG) is commonly encountered in various application scenarios. It involves extensive dynamic interactions among numerous nodes. Most existing approaches explore the intricate temporal patterns hidden in a DWDG from the purely data-driven perspective, which suffers from accuracy loss when a DWDG exhibits strong fluctuations over time. To address this issue, this study proposes a novel Extended-Kalman-Filter-Incorporated Latent Feature (EKLF) model to represent a DWDG from the model-driven perspective. Its main idea is divided into the following two-fold ideas: a) adopting a control model, i.e., the Extended Kalman Filter (EKF), to track the complex temporal patterns precisely with its nonlinear state-transition and observation functions; and b) introducing an alternating least squares (ALS) algorithm to train the latent features (LFs) alternatively for precisely representing a DWDG. Empirical studies on DWDG datasets demonstrate that the proposed EKLF model outperforms state-of-the-art models in prediction accuracy and computational efficiency for missing edge weights of a DWDG. It unveils the potential for precisely representing a DWDG by incorporating a control model.
Authors: Andreas Birk
Abstract: We introduce SmileyNet, a novel neural network with psychic abilities. It is inspired by the fact that a positive mood can lead to improved cognitive capabilities including classification tasks. The network is hence presented in a first phase with smileys and an encouraging loss function is defined to bias it into a good mood. SmileyNet is then used to forecast the flipping of a coin based on an established method of Tasseology, namely by reading tea leaves. Training and testing in this second phase are done with a high-fidelity simulation based on real-world pixels sampled from a professional tea-reading cup. SmileyNet has an amazing accuracy of 72% to correctly predict the flip of a coin. Resnet-34, respectively YOLOv5 achieve only 49%, respectively 53%. It is then shown how multiple SmileyNets can be combined to win the lottery.
Authors: Zhanpeng Chen, Chengjin Xu, Yiyan Qi, Jian Guo
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in processing and generating content across multiple data modalities, including text, images, audio, and video. However, a significant drawback of MLLMs is their reliance on static training data, leading to outdated information and limited contextual awareness. This static nature hampers their ability to provide accurate, up-to-date responses, particularly in dynamic or rapidly evolving contexts. Integrating Multimodal Retrieval-augmented Generation (Multimodal RAG) offers a promising solution, but the system would inevitably encounter the multi-granularity noisy correspondence (MNC) problem, which involves two types of noise: coarse-grained (query-caption) and fine-grained (query-image). This noise hinders accurate retrieval and generation. In this work, we propose \textbf{RagLLaVA}, a novel framework with knowledge-enhanced reranking and noise-injected training, to address these limitations. We instruction-tune the MLLM with a simple yet effective instruction template to induce its ranking ability and serve it as a reranker to precisely filter the top-k retrieved images. For generation, we inject visual noise during training at the data and token levels to enhance the generator's robustness. Extensive experiments are conducted on the subsets of two datasets that require retrieving and reasoning over images to answer a given query. Our results demonstrate the superiority of RagLLaVA in retrieving accurately and generating robustly. Code and models are available at https://github.com/IDEA-FinAI/RagLLaVA.
Authors: Gilang Fajar Febrian, Grazziela Figueredo
Abstract: Data is crucial for evidence-based policymaking and enhancing public services, including those at the Ministry of Finance of the Republic of Indonesia. However, the complexity and dynamic nature of governmental financial data and regulations can hinder decision-making. This study investigates the potential of Large Language Models (LLMs) to address these challenges, focusing on Indonesia's financial data and regulations. While LLMs are effective in the financial sector, their use in the public sector in Indonesia is unexplored. This study undertakes an iterative process to develop KemenkeuGPT using the LangChain with Retrieval-Augmented Generation (RAG), prompt engineering and fine-tuning. The dataset from 2003 to 2023 was collected from the Ministry of Finance, Statistics Indonesia and the International Monetary Fund (IMF). Surveys and interviews with Ministry officials informed, enhanced and fine-tuned the model. We evaluated the model using human feedback, LLM-based evaluation and benchmarking. The model's accuracy improved from 35% to 61%, with correctness increasing from 48% to 64%. The Retrieval-Augmented Generation Assessment (RAGAS) framework showed that KemenkeuGPT achieved 44% correctness with 73% faithfulness, 40% precision and 60% recall, outperforming several other base models. An interview with an expert from the Ministry of Finance indicated that KemenkeuGPT has the potential to become an essential tool for decision-making. These results are expected to improve with continuous human feedback.
Authors: Jeffrey Redondo, Nauman Aslam, Juan Zhang, Zhenhui Yuan
Abstract: Reinforcement Learning (RL) algorithms have been used to address the challenging problems in the offloading process of vehicular ad hoc networks (VANET). More recently, they have been utilized to improve the dissemination of high-definition (HD) Maps. Nevertheless, implementing solutions such as deep Q-learning (DQN) and Actor-critic at the autonomous vehicle (AV) may lead to an increase in the computational load, causing a heavy burden on the computational devices and higher costs. Moreover, their implementation might raise compatibility issues between technologies due to the required modifications to the standards. Therefore, in this paper, we assess the scalability of an application utilizing a Q-learning single-agent solution in a distributed multi-agent environment. This application improves the network performance by taking advantage of a smaller state, and action space whilst using a multi-agent approach. The proposed solution is extensively evaluated with different test cases involving reward function considering individual or overall network performance, number of agents, and centralized and distributed learning comparison. The experimental results demonstrate that the time latencies of our proposed solution conducted in voice, video, HD Map, and best-effort cases have significant improvements, with 40.4%, 36%, 43%, and 12% respectively, compared to the performances with the single-agent approach.
Authors: Xiny Pan, Daniel Hern\'andez, Philipp Seifer, Ralf L\"ammel, Steffen Staab
Abstract: Over the past few years, we have seen the emergence of large knowledge graphs combining information from multiple sources. Sometimes, this information is provided in the form of assertions about other assertions, defining contexts where assertions are valid. A recent extension to RDF which admits statements over statements, called RDF-star, is in revision to become a W3C standard. However, there is no proposal for a semantics of these RDF-star statements nor a built-in facility to operate over them. In this paper, we propose a query language for epistemic RDF-star metadata based on a four-valued logic, called eSPARQL. Our proposed query language extends SPARQL-star, the query language for RDF-star, with a new type of FROM clause to facilitate operating with multiple and sometimes conflicting beliefs. We show that the proposed query language can express four use case queries, including the following features: (i) querying the belief of an individual, (ii) the aggregating of beliefs, (iii) querying who is conflicting with somebody, and (iv) beliefs about beliefs (i.e., nesting of beliefs).
Authors: Alejandro Fern\'andez-Alburquerque, Javier Segovia-Aguas
Abstract: In recent years, there has been renewed interest in closing the performance gap between state-of-the-art planning solvers and generalized planning (GP), a research area of AI that studies the automated synthesis of algorithmic-like solutions capable of solving multiple classical planning instances. One of the current advancements has been the introduction of Best-First Generalized Planning (BFGP), a GP algorithm based on a novel solution space that can be explored with heuristic search, one of the foundations of modern planners. This paper evaluates the application of parallel search techniques to BFGP, another critical component in closing the performance gap. We first discuss why BFGP is well suited for parallelization and some of its differentiating characteristics from classical planners. Then, we propose two simple shared-memory parallel strategies with good scaling with the number of cores.
Authors: Yuna Yan, Xin Zhang, Lixin Li, Wensheng Lin, Rui Li, Wenchi Cheng, Zhu Han
Abstract: In this paper, we address the problem of image semantic communication in a multi-user deployment scenario and propose a federated learning (FL) strategy for a Swin Transformer-based semantic communication system (FSSC). Firstly, we demonstrate that the adoption of a Swin Transformer for joint source-channel coding (JSCC) effectively extracts semantic information in the communication system. Next, the FL framework is introduced to collaboratively learn a global model by aggregating local model parameters, rather than directly sharing clients' data. This approach enhances user privacy protection and reduces the workload on the server or mobile edge. Simulation evaluations indicate that our method outperforms the typical JSCC algorithm and traditional separate-based communication algorithms. Particularly after integrating local semantics, the global aggregation model has further increased the Peak Signal-to-Noise Ratio (PSNR) by more than 2dB, thoroughly proving the effectiveness of our algorithm.
Authors: Jesse Heyninck
Abstract: Choice constructs are an important part of the language of logic programming, yet the study of their semantics has been a challenging task. So far, only two-valued semantics have been studied, and the different proposals for such semantics have not been compared in a principled way. In this paper, an operator-based framework allow for the definition and comparison of different semantics in a principled way is proposed.
Authors: Yunlong Huang, Junshuo Liu, Jianan Zhang, Tiebin Mi, Xin Shi, Robert Caiming Qiu
Abstract: Gait recognition with radio frequency (RF) signals enables many potential applications requiring accurate identification. However, current systems require individuals to be within a line-of-sight (LOS) environment and struggle with low signal-to-noise ratio (SNR) when signals traverse concrete and thick walls. To address these challenges, we present TRGR, a novel transmissive reconfigurable intelligent surface (RIS)-aided gait recognition system. TRGR can recognize human identities through walls using only the magnitude measurements of channel state information (CSI) from a pair of transceivers. Specifically, by leveraging transmissive RIS alongside a configuration alternating optimization algorithm, TRGR enhances wall penetration and signal quality, enabling accurate gait recognition. Furthermore, a residual convolution network (RCNN) is proposed as the backbone network to learn robust human information. Experimental results confirm the efficacy of transmissive RIS, highlighting the significant potential of transmissive RIS in enhancing RF-based gait recognition systems. Extensive experiment results show that TRGR achieves an average accuracy of 97.88\% in identifying persons when signals traverse concrete walls, demonstrating the effectiveness and robustness of TRGR.
Authors: Hermione Warr, Yasin Ibrahim, Daniel R. McGowan, Konstantinos Kamnitsas
Abstract: Automation of medical image interpretation could alleviate bottlenecks in diagnostic workflows, and has become of particular interest in recent years due to advancements in natural language processing. Great strides have been made towards automated radiology report generation via AI, yet ensuring clinical accuracy in generated reports is a significant challenge, hindering deployment of such methods in clinical practice. In this work we propose a quality control framework for assessing the reliability of AI-generated radiology reports with respect to semantics of diagnostic importance using modular auxiliary auditing components (AC). Evaluating our pipeline on the MIMIC-CXR dataset, our findings show that incorporating ACs in the form of disease-classifiers can enable auditing that identifies more reliable reports, resulting in higher F1 scores compared to unfiltered generated reports. Additionally, leveraging the confidence of the AC labels further improves the audit's effectiveness.
Authors: Diego Mart\'in, Jordi Sanchez, Xavier Vizca\'ino
Abstract: This study investigates different approaches to classify human interactions in an artificial intelligence-based environment, specifically for Applus+ IDIADA's intelligent agent AIDA. The main objective is to develop a classifier that accurately identifies the type of interaction received (Conversation, Services, or Document Translation) to direct requests to the appropriate channel and provide a more specialized and efficient service. Various models are compared, including LLM-based classifiers, KNN using Titan and Cohere embeddings, SVM, and artificial neural networks. Results show that SVM and ANN models with Cohere embeddings achieve the best overall performance, with superior F1 scores and faster execution times compared to LLM-based approaches. The study concludes that the SVM model with Cohere embeddings is the most suitable option for classifying human interactions in the AIDA environment, offering an optimal balance between accuracy and computational efficiency.
Authors: Ming Zhang, Caishuang Huang, Yilong Wu, Shichun Liu, Huiyuan Zheng, Yurui Dong, Yujiong Shen, Shihan Dou, Jun Zhao, Junjie Ye, Qi Zhang, Tao Gui, Xuanjing Huang
Abstract: Task-oriented dialogue (TOD) systems aim to efficiently handle task-oriented conversations, including information gathering. How to utilize ToD accurately, efficiently and effectively for information gathering has always been a critical and challenging task. Recent studies have demonstrated that Large Language Models (LLMs) excel in dialogue, instruction generation, and reasoning, and can significantly enhance the performance of TOD through fine-tuning. However, current datasets primarily cater to user-led systems and are limited to predefined specific scenarios and slots, thereby necessitating improvements in the proactiveness, diversity, and capabilities of TOD. In this study, we present a detailed multi-domain task-oriented data construction process for conversations, and a Chinese dialogue dataset generated based on this process, \textbf{TransferTOD}, which authentically simulates human-machine dialogues in 30 popular life service scenarios. Leveraging this dataset, we trained a \textbf{TransferTOD-7B} model using full-parameter fine-tuning, showcasing notable abilities in slot filling and questioning. Our work has demonstrated its strong generalization capabilities in various downstream scenarios, significantly enhancing both data utilization efficiency and system performance. The data is released in https://github.com/KongLongGeFDU/TransferTOD.
Authors: Stefan Langer, Fabian Neuhaus, Andreas N\"urnberger
Abstract: Ontologies are formal representations of knowledge in specific domains that provide a structured framework for organizing and understanding complex information. Creating ontologies, however, is a complex and time-consuming endeavor. ChEBI is a well-known ontology in the field of chemistry, which provides a comprehensive resource for defining chemical entities and their properties. However, it covers only a small fraction of the rapidly growing knowledge in chemistry and does not provide references to the scientific literature. To address this, we propose a methodology that involves augmenting existing annotated text corpora with knowledge from Chebi and fine-tuning a large language model (LLM) to recognize chemical entities and their roles in scientific text. Our experiments demonstrate the effectiveness of our approach. By combining ontological knowledge and the language understanding capabilities of LLMs, we achieve high precision and recall rates in identifying both the chemical entities and roles in scientific literature. Furthermore, we extract them from a set of 8,000 ChemRxiv articles, and apply a second LLM to create a knowledge graph (KG) of chemical entities and roles (CEAR), which provides complementary information to ChEBI, and can help to extend it.
Authors: Dingkun Liu, Hongjie Zhou, Yilu Qu, Huimei Zhang, Yongdong Xu
Abstract: The abundance of intestinal flora is closely related to human diseases, but diseases are not caused by a single gut microbe. Instead, they result from the complex interplay of numerous microbial entities. This intricate and implicit connection among gut microbes poses a significant challenge for disease prediction using abundance information from OTU data. Recently, several methods have shown potential in predicting corresponding diseases. However, these methods fail to learn the inner association among gut microbes from different hosts, leading to unsatisfactory performance. In this paper, we present a novel architecture, Unsupervised Multi-graph Merge Adversarial Network (UMMAN). UMMAN can obtain the embeddings of nodes in the Multi-Graph in an unsupervised scenario, so that it helps learn the multiplex association. Our method is the first to combine Graph Neural Network with the task of intestinal flora disease prediction. We employ complex relation-types to construct the Original-Graph and disrupt the relationships among nodes to generate corresponding Shuffled-Graph. We introduce the Node Feature Global Integration (NFGI) module to represent the global features of the graph. Furthermore, we design a joint loss comprising adversarial loss and hybrid attention loss to ensure that the real graph embedding aligns closely with the Original-Graph and diverges from the Shuffled-Graph. Comprehensive experiments on five classical OTU gut microbiome datasets demonstrate the effectiveness and stability of our method. (We will release our code soon.)
Authors: Joanna F. DeFranco, Luke Biersmith
Abstract: The deployment of artificial intelligence (AI) applications has accelerated rapidly. AI enabled technologies are facing the public in many ways including infrastructure, consumer products and home applications. Because many of these technologies present risks either in the form of physical injury, or bias, potentially yielding unfair outcomes, policy makers must consider the need for oversight. Most policymakers, however, lack the technical knowledge to judge whether an emerging AI technology is safe, effective, and requires oversight, therefore policy makers must depend on expert opinion. But policymakers are better served when, in addition to expert opinion, they have some general understanding of existing guidelines and regulations. This work provides an overview [the landscape] of AI legislation and directives at the international, U.S. state, city and federal levels. It also reviews relevant business standards, and technical society initiatives. Then an overlap and gap analysis are performed resulting in a reference guide that includes recommendations and guidance for future policy making.
Authors: Alberto Pasqualetto, Lorenzo Serafini, Michele Sprocatti
Abstract: United Nations set Sustainable Development Goals and this paper focuses on 7th (Affordable and Clean Energy), 9th (Industries, Innovation and Infrastructure), and 13th (Climate Action) goals. Climate change is a major concern in our society; for this reason, a current global objective is to reduce energy waste. This work summarizes all main approaches towards energy efficiency using Artificial Intelligence with a particular focus on multi-agent systems to create smart buildings. It mentions the tight relationship between AI, especially IoT, and Big Data. It explains the application of AI to anomaly detection in smart buildings and a possible classification of Intelligent Energy Management Systems: Direct and Indirect. Finally, some drawbacks of AI approaches and some possible future research focuses are proposed.
Authors: Zhihan Chen, Peng Lin, Hao Hu, Shaowei Cai
Abstract: As a broadly applied technique in numerous optimization problems, recently, local search has been employed to solve Pseudo-Boolean Optimization (PBO) problem. A representative local search solver for PBO is LSPBO. In this paper, firstly, we improve LSPBO by a dynamic scoring mechanism, which dynamically strikes a balance between score on hard constraints and score on the objective function. Moreover, on top of this improved LSPBO , we develop the first parallel local search PBO solver. The main idea is to share good solutions among different threads to guide the search, by maintaining a pool of feasible solutions. For evaluating solutions when updating the pool, we propose a function that considers both the solution quality and the diversity of the pool. Furthermore, we calculate the polarity density in the pool to enhance the scoring function of local search. Our empirical experiments show clear benefits of the proposed parallel approach, making it competitive with the parallel version of the famous commercial solver Gurobi.
Authors: Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Gosh, Luke Zettlemoyer, Armen Aghajanyan
Abstract: We introduce MoMa, a novel modality-aware mixture-of-experts (MoE) architecture designed for pre-training mixed-modal, early-fusion language models. MoMa processes images and text in arbitrary sequences by dividing expert modules into modality-specific groups. These groups exclusively process designated tokens while employing learned routing within each group to maintain semantically informed adaptivity. Our empirical results reveal substantial pre-training efficiency gains through this modality-specific parameter allocation. Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings: 3.7x overall, with 2.6x for text and 5.2x for image processing compared to a compute-equivalent dense baseline, measured by pre-training loss. This outperforms the standard expert-choice MoE with 8 mixed-modal experts, which achieves 3x overall FLOPs savings (3x for text, 2.8x for image). Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPs savings to 4.2x overall (text: 3.4x, image: 5.3x), although this combination hurts performance in causal inference due to increased sensitivity to router accuracy. These results demonstrate MoMa's potential to significantly advance the efficiency of mixed-modal, early-fusion language model pre-training, paving the way for more resource-efficient and capable multimodal AI systems.
Authors: Felix Ocker, Daniel Tanneberg, Julian Eggert, Michael Gienger
Abstract: We introduce tulip agent, an architecture for autonomous LLM-based agents with Create, Read, Update, and Delete access to a tool library containing a potentially large number of tools. In contrast to state-of-the-art implementations, tulip agent does not encode the descriptions of all available tools in the system prompt, which counts against the model's context window, or embed the entire prompt for retrieving suitable tools. Instead, the tulip agent can recursively search for suitable tools in its extensible tool library, implemented exemplarily as a vector store. The tulip agent architecture significantly reduces inference costs, allows using even large tool libraries, and enables the agent to adapt and extend its set of tools. We evaluate the architecture with several ablation studies in a mathematics context and demonstrate its generalizability with an application to robotics. A reference implementation and the benchmark are available at github.com/HRI-EU/tulip_agent.
Authors: Abhimanyu Dubey (Jack), Abhinav Jauhri (Jack), Abhinav Pandey (Jack), Abhishek Kadian (Jack), Ahmad Al-Dahle (Jack), Aiesha Letman (Jack), Akhil Mathur (Jack), Alan Schelten (Jack), Amy Yang (Jack), Angela Fan (Jack), Anirudh Goyal (Jack), Anthony Hartshorn (Jack), Aobo Yang (Jack), Archi Mitra (Jack), Archie Sravankumar (Jack), Artem Korenev (Jack), Arthur Hinsvark (Jack), Arun Rao (Jack), Aston Zhang (Jack), Aurelien Rodriguez (Jack), Austen Gregerson (Jack), Ava Spataru (Jack), Baptiste Roziere (Jack), Bethany Biron (Jack), Binh Tang (Jack), Bobbie Chern (Jack), Charlotte Caucheteux (Jack), Chaya Nayak (Jack), Chloe Bi (Jack), Chris Marra (Jack), Chris McConnell (Jack), Christian Keller (Jack), Christophe Touret (Jack), Chunyang Wu (Jack), Corinne Wong (Jack), Cristian Canton Ferrer (Jack), Cyrus Nikolaidis (Jack), Damien Allonsius (Jack), Daniel Song (Jack), Danielle Pintz (Jack), Danny Livshits (Jack), David Esiobu (Jack), Dhruv Choudhary (Jack), Dhruv Mahajan (Jack), Diego Garcia-Olano (Jack), Diego Perino (Jack), Dieuwke Hupkes (Jack), Egor Lakomkin (Jack), Ehab AlBadawy (Jack), Elina Lobanova (Jack), Emily Dinan (Jack), Eric Michael Smith (Jack), Filip Radenovic (Jack), Frank Zhang (Jack), Gabriel Synnaeve (Jack), Gabrielle Lee (Jack), Georgia Lewis Anderson (Jack), Graeme Nail (Jack), Gregoire Mialon (Jack), Guan Pang (Jack), Guillem Cucurell (Jack), Hailey Nguyen (Jack), Hannah Korevaar (Jack), Hu Xu (Jack), Hugo Touvron (Jack), Iliyan Zarov (Jack), Imanol Arrieta Ibarra (Jack), Isabel Kloumann (Jack), Ishan Misra (Jack), Ivan Evtimov (Jack), Jade Copet (Jack), Jaewon Lee (Jack), Jan Geffert (Jack), Jana Vranes (Jack), Jason Park (Jack), Jay Mahadeokar (Jack), Jeet Shah (Jack), Jelmer van der Linde (Jack), Jennifer Billock (Jack), Jenny Hong (Jack), Jenya Lee (Jack), Jeremy Fu (Jack), Jianfeng Chi (Jack), Jianyu Huang (Jack), Jiawen Liu (Jack), Jie Wang (Jack), Jiecao Yu (Jack), Joanna Bitton (Jack), Joe Spisak (Jack), Jongsoo Park (Jack), Joseph Rocca (Jack), Joshua Johnstun (Jack), Joshua Saxe (Jack), Junteng Jia (Jack), Kalyan Vasuden Alwala (Jack), Kartikeya Upasani (Jack), Kate Plawiak (Jack), Ke Li (Jack), Kenneth Heafield (Jack), Kevin Stone (Jack), Khalid El-Arini (Jack), Krithika Iyer (Jack), Kshitiz Malik (Jack), Kuenley Chiu (Jack), Kunal Bhalla (Jack), Lauren Rantala-Yeary (Jack), Laurens van der Maaten (Jack), Lawrence Chen (Jack), Liang Tan (Jack), Liz Jenkins (Jack), Louis Martin (Jack), Lovish Madaan (Jack), Lubo Malo (Jack), Lukas Blecher (Jack), Lukas Landzaat (Jack), Luke de Oliveira (Jack), Madeline Muzzi (Jack), Mahesh Pasupuleti (Jack), Mannat Singh (Jack), Manohar Paluri (Jack), Marcin Kardas (Jack), Mathew Oldham (Jack), Mathieu Rita (Jack), Maya Pavlova (Jack), Melanie Kambadur (Jack), Mike Lewis (Jack), Min Si (Jack), Mitesh Kumar Singh (Jack), Mona Hassan (Jack), Naman Goyal (Jack), Narjes Torabi (Jack), Nikolay Bashlykov (Jack), Nikolay Bogoychev (Jack), Niladri Chatterji (Jack), Olivier Duchenne (Jack), Onur \c{C}elebi (Jack), Patrick Alrassy (Jack), Pengchuan Zhang (Jack), Pengwei Li (Jack), Petar Vasic (Jack), Peter Weng (Jack), Prajjwal Bhargava (Jack), Pratik Dubal (Jack), Praveen Krishnan (Jack), Punit Singh Koura (Jack), Puxin Xu (Jack), Qing He (Jack), Qingxiao Dong (Jack), Ragavan Srinivasan (Jack), Raj Ganapathy (Jack), Ramon Calderer (Jack), Ricardo Silveira Cabral (Jack), Robert Stojnic (Jack), Roberta Raileanu (Jack), Rohit Girdhar (Jack), Rohit Patel (Jack), Romain Sauvestre (Jack), Ronnie Polidoro (Jack), Roshan Sumbaly (Jack), Ross Taylor (Jack), Ruan Silva (Jack), Rui Hou (Jack), Rui Wang (Jack), Saghar Hosseini (Jack), Sahana Chennabasappa (Jack), Sanjay Singh (Jack), Sean Bell (Jack), Seohyun Sonia Kim (Jack), Sergey Edunov (Jack), Shaoliang Nie (Jack), Sharan Narang (Jack), Sharath Raparthy (Jack), Sheng Shen (Jack), Shengye Wan (Jack), Shruti Bhosale (Jack), Shun Zhang (Jack), Simon Vandenhende (Jack), Soumya Batra (Jack), Spencer Whitman (Jack), Sten Sootla (Jack), Stephane Collot (Jack), Suchin Gururangan (Jack), Sydney Borodinsky (Jack), Tamar Herman (Jack), Tara Fowler (Jack), Tarek Sheasha (Jack), Thomas Georgiou (Jack), Thomas Scialom (Jack), Tobias Speckbacher (Jack), Todor Mihaylov (Jack), Tong Xiao (Jack), Ujjwal Karn (Jack), Vedanuj Goswami (Jack), Vibhor Gupta (Jack), Vignesh Ramanathan (Jack), Viktor Kerkez (Jack), Vincent Gonguet (Jack), Virginie Do (Jack), Vish Vogeti (Jack), Vladan Petrovic (Jack), Weiwei Chu (Jack), Wenhan Xiong (Jack), Wenyin Fu (Jack), Whitney Meers (Jack), Xavier Martinet (Jack), Xiaodong Wang (Jack), Xiaoqing Ellen Tan (Jack), Xinfeng Xie (Jack), Xuchao Jia (Jack), Xuewei Wang (Jack), Yaelle Goldschlag (Jack), Yashesh Gaur (Jack), Yasmine Babaei (Jack), Yi Wen (Jack), Yiwen Song (Jack), Yuchen Zhang (Jack), Yue Li (Jack), Yuning Mao (Jack), Zacharie Delpierre Coudert (Jack), Zheng Yan (Jack), Zhengxing Chen (Jack), Zoe Papakipos (Jack), Aaditya Singh (Jack), Aaron Grattafiori (Jack), Abha Jain (Jack), Adam Kelsey (Jack), Adam Shajnfeld (Jack), Adithya Gangidi (Jack), Adolfo Victoria (Jack), Ahuva Goldstand (Jack), Ajay Menon (Jack), Ajay Sharma (Jack), Alex Boesenberg (Jack), Alex Vaughan (Jack), Alexei Baevski (Jack), Allie Feinstein (Jack), Amanda Kallet (Jack), Amit Sangani (Jack), Anam Yunus (Jack), Andrei Lupu (Jack), Andres Alvarado (Jack), Andrew Caples (Jack), Andrew Gu (Jack), Andrew Ho (Jack), Andrew Poulton (Jack), Andrew Ryan (Jack), Ankit Ramchandani (Jack), Annie Franco (Jack), Aparajita Saraf (Jack), Arkabandhu Chowdhury (Jack), Ashley Gabriel (Jack), Ashwin Bharambe (Jack), Assaf Eisenman (Jack), Azadeh Yazdan (Jack), Beau James (Jack), Ben Maurer (Jack), Benjamin Leonhardi (Jack), Bernie Huang (Jack), Beth Loyd (Jack), Beto De Paola (Jack), Bhargavi Paranjape (Jack), Bing Liu (Jack), Bo Wu (Jack), Boyu Ni (Jack), Braden Hancock (Jack), Bram Wasti (Jack), Brandon Spence (Jack), Brani Stojkovic (Jack), Brian Gamido (Jack), Britt Montalvo (Jack), Carl Parker (Jack), Carly Burton (Jack), Catalina Mejia (Jack), Changhan Wang (Jack), Changkyu Kim (Jack), Chao Zhou (Jack), Chester Hu (Jack), Ching-Hsiang Chu (Jack), Chris Cai (Jack), Chris Tindal (Jack), Christoph Feichtenhofer (Jack), Damon Civin (Jack), Dana Beaty (Jack), Daniel Kreymer (Jack), Daniel Li (Jack), Danny Wyatt (Jack), David Adkins (Jack), David Xu (Jack), Davide Testuggine (Jack), Delia David (Jack), Devi Parikh (Jack), Diana Liskovich (Jack), Didem Foss (Jack), Dingkang Wang (Jack), Duc Le (Jack), Dustin Holland (Jack), Edward Dowling (Jack), Eissa Jamil (Jack), Elaine Montgomery (Jack), Eleonora Presani (Jack), Emily Hahn (Jack), Emily Wood (Jack), Erik Brinkman (Jack), Esteban Arcaute (Jack), Evan Dunbar (Jack), Evan Smothers (Jack), Fei Sun (Jack), Felix Kreuk (Jack), Feng Tian (Jack), Firat Ozgenel (Jack), Francesco Caggioni (Jack), Francisco Guzm\'an (Jack), Frank Kanayet (Jack), Frank Seide (Jack), Gabriela Medina Florez (Jack), Gabriella Schwarz (Jack), Gada Badeer (Jack), Georgia Swee (Jack), Gil Halpern (Jack), Govind Thattai (Jack), Grant Herman (Jack), Grigory Sizov (Jack), Guangyi (Jack), Zhang (Sid), Guna Lakshminarayanan (Sid), Hamid Shojanazeri (Sid), Han Zou (Sid), Hannah Wang (Sid), Hanwen Zha (Sid), Haroun Habeeb (Sid), Harrison Rudolph (Sid), Helen Suk (Sid), Henry Aspegren (Sid), Hunter Goldman (Sid), Igor Molybog (Sid), Igor Tufanov (Sid), Irina-Elena Veliche (Sid), Itai Gat (Sid), Jake Weissman (Sid), James Geboski (Sid), James Kohli (Sid), Japhet Asher (Sid), Jean-Baptiste Gaya (Sid), Jeff Marcus (Sid), Jeff Tang (Sid), Jennifer Chan (Sid), Jenny Zhen (Sid), Jeremy Reizenstein (Sid), Jeremy Teboul (Sid), Jessica Zhong (Sid), Jian Jin (Sid), Jingyi Yang (Sid), Joe Cummings (Sid), Jon Carvill (Sid), Jon Shepard (Sid), Jonathan McPhie (Sid), Jonathan Torres (Sid), Josh Ginsburg (Sid), Junjie Wang (Sid), Kai Wu (Sid), Kam Hou U (Sid), Karan Saxena (Sid), Karthik Prasad (Sid), Kartikay Khandelwal (Sid), Katayoun Zand (Sid), Kathy Matosich (Sid), Kaushik Veeraraghavan (Sid), Kelly Michelena (Sid), Keqian Li (Sid), Kun Huang (Sid), Kunal Chawla (Sid), Kushal Lakhotia (Sid), Kyle Huang (Sid), Lailin Chen (Sid), Lakshya Garg (Sid), Lavender A (Sid), Leandro Silva (Sid), Lee Bell (Sid), Lei Zhang (Sid), Liangpeng Guo (Sid), Licheng Yu (Sid), Liron Moshkovich (Sid), Luca Wehrstedt (Sid), Madian Khabsa (Sid), Manav Avalani (Sid), Manish Bhatt (Sid), Maria Tsimpoukelli (Sid), Martynas Mankus (Sid), Matan Hasson (Sid), Matthew Lennie (Sid), Matthias Reso (Sid), Maxim Groshev (Sid), Maxim Naumov (Sid), Maya Lathi (Sid), Meghan Keneally (Sid), Michael L. Seltzer (Sid), Michal Valko (Sid), Michelle Restrepo (Sid), Mihir Patel (Sid), Mik Vyatskov (Sid), Mikayel Samvelyan (Sid), Mike Clark (Sid), Mike Macey (Sid), Mike Wang (Sid), Miquel Jubert Hermoso (Sid), Mo Metanat (Sid), Mohammad Rastegari (Sid), Munish Bansal (Sid), Nandhini Santhanam (Sid), Natascha Parks (Sid), Natasha White (Sid), Navyata Bawa (Sid), Nayan Singhal (Sid), Nick Egebo (Sid), Nicolas Usunier (Sid), Nikolay Pavlovich Laptev (Sid), Ning Dong (Sid), Ning Zhang (Sid), Norman Cheng (Sid), Oleg Chernoguz (Sid), Olivia Hart (Sid), Omkar Salpekar (Sid), Ozlem Kalinli (Sid), Parkin Kent (Sid), Parth Parekh (Sid), Paul Saab (Sid), Pavan Balaji (Sid), Pedro Rittner (Sid), Philip Bontrager (Sid), Pierre Roux (Sid), Piotr Dollar (Sid), Polina Zvyagina (Sid), Prashant Ratanchandani (Sid), Pritish Yuvraj (Sid), Qian Liang (Sid), Rachad Alao (Sid), Rachel Rodriguez (Sid), Rafi Ayub (Sid), Raghotham Murthy (Sid), Raghu Nayani (Sid), Rahul Mitra (Sid), Raymond Li (Sid), Rebekkah Hogan (Sid), Robin Battey (Sid), Rocky Wang (Sid), Rohan Maheswari (Sid), Russ Howes (Sid), Ruty Rinott (Sid), Sai Jayesh Bondu (Sid), Samyak Datta (Sid), Sara Chugh (Sid), Sara Hunt (Sid), Sargun Dhillon (Sid), Sasha Sidorov (Sid), Satadru Pan (Sid), Saurabh Verma (Sid), Seiji Yamamoto (Sid), Sharadh Ramaswamy (Sid), Shaun Lindsay (Sid), Shaun Lindsay (Sid), Sheng Feng (Sid), Shenghao Lin (Sid), Shengxin Cindy Zha (Sid), Shiva Shankar (Sid), Shuqiang Zhang (Sid), Shuqiang Zhang (Sid), Sinong Wang (Sid), Sneha Agarwal (Sid), Soji Sajuyigbe (Sid), Soumith Chintala (Sid), Stephanie Max (Sid), Stephen Chen (Sid), Steve Kehoe (Sid), Steve Satterfield (Sid), Sudarshan Govindaprasad (Sid), Sumit Gupta (Sid), Sungmin Cho (Sid), Sunny Virk (Sid), Suraj Subramanian (Sid), Sy Choudhury (Sid), Sydney Goldman (Sid), Tal Remez (Sid), Tamar Glaser (Sid), Tamara Best (Sid), Thilo Kohler (Sid), Thomas Robinson (Sid), Tianhe Li (Sid), Tianjun Zhang (Sid), Tim Matthews (Sid), Timothy Chou (Sid), Tzook Shaked (Sid), Varun Vontimitta (Sid), Victoria Ajayi (Sid), Victoria Montanez (Sid), Vijai Mohan (Sid), Vinay Satish Kumar (Sid), Vishal Mangla (Sid), Vlad Ionescu (Sid), Vlad Poenaru (Sid), Vlad Tiberiu Mihailescu (Sid), Vladimir Ivanov (Sid), Wei Li (Sid), Wenchen Wang (Sid), Wenwen Jiang (Sid), Wes Bouaziz (Sid), Will Constable (Sid), Xiaocheng Tang (Sid), Xiaofang Wang (Sid), Xiaojian Wu (Sid), Xiaolan Wang (Sid), Xide Xia (Sid), Xilun Wu (Sid), Xinbo Gao (Sid), Yanjun Chen (Sid), Ye Hu (Sid), Ye Jia (Sid), Ye Qi (Sid), Yenda Li (Sid), Yilin Zhang (Sid), Ying Zhang (Sid), Yossi Adi (Sid), Youngjin Nam (Sid), Yu (Sid), Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, Zhiwei Zhao
Abstract: Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
Authors: Xiaoyi Chen, Siyuan Tang, Rui Zhu, Shijun Yan, Lei Jin, Zihao Wang, Liya Su, Zhikun Zhang, XiaoFeng Wang, Haixu Tang
Abstract: The rapid advancements of large language models (LLMs) have raised public concerns about the privacy leakage of personally identifiable information (PII) within their extensive training datasets. Recent studies have demonstrated that an adversary could extract highly sensitive privacy data from the training data of LLMs with carefully designed prompts. However, these attacks suffer from the model's tendency to hallucinate and catastrophic forgetting (CF) in the pre-training stage, rendering the veracity of divulged PIIs negligible. In our research, we propose a novel attack, Janus, which exploits the fine-tuning interface to recover forgotten PIIs from the pre-training data in LLMs. We formalize the privacy leakage problem in LLMs and explain why forgotten PIIs can be recovered through empirical analysis on open-source language models. Based upon these insights, we evaluate the performance of Janus on both open-source language models and two latest LLMs, i.e., GPT-3.5-Turbo and LLaMA-2-7b. Our experiment results show that Janus amplifies the privacy risks by over 10 times in comparison with the baseline and significantly outperforms the state-of-the-art privacy extraction attacks including prefix attacks and in-context learning (ICL). Furthermore, our analysis validates that existing fine-tuning APIs provided by OpenAI and Azure AI Studio are susceptible to our Janus attack, allowing an adversary to conduct such an attack at a low cost.
Authors: Huan Ning, Zhenlong Li, Temitope Akinboyewa, M. Naser Lessani
Abstract: Powered by the emerging large language models (LLMs), autonomous geographic information systems (GIS) agents have the potential to accomplish spatial analyses and cartographic tasks. However, a research gap exists to support fully autonomous GIS agents: how to enable agents to discover and download the necessary data for geospatial analyses. This study proposes LLM-Find, an autonomous GIS agent framework capable of selecting and fetching required geospatial data by generating, executing, and debugging programs. LLM-Find utilizes the LLM as the decision-maker, selects the appropriate data source (s) from a pre-defined source list, and fetches the data from the chosen source. Each data source has a handbook that records the metadata and technical details for data retrieval. The proposed framework is designed in a plug-and-play style to ensure flexibility and extensibility. Human users or autonomous data scrawlers can add a new data source by adding a new handbook. We developed a prototype agent based on LLM-Find, and experiment results demonstrate its capability of retrieving data from various sources including OpenStreetMap, administrative boundaries and demographic data from the US Census Bureau, satellite basemaps from ESRI World Imagery, weather data from a commercial provider, and the COVID-19 data from the NYTimes GitHub. Our study is among the first attempts to develop an autonomous geospatial data retrieval agent.
Authors: Francesco Foscarin, Emmanouil Karystinaios, Eita Nakamura, Gerhard Widmer
Abstract: This paper approaches the problem of separating the notes from a quantized symbolic music piece (e.g., a MIDI file) into multiple voices and staves. This is a fundamental part of the larger task of music score engraving (or score typesetting), which aims to produce readable musical scores for human performers. We focus on piano music and support homophonic voices, i.e., voices that can contain chords, and cross-staff voices, which are notably difficult tasks that have often been overlooked in previous research. We propose an end-to-end system based on graph neural networks that clusters notes that belong to the same chord and connects them with edges if they are part of a voice. Our results show clear and consistent improvements over a previous approach on two datasets of different styles. To aid the qualitative analysis of our results, we support the export in symbolic music formats and provide a direct visualization of our outputs graph over the musical score. All code and pre-trained models are available at https://github.com/CPJKU/piano_svsep
Authors: Sanghyun Kim, Seohyeon Jung, Balhae Kim, Moonseok Choi, Jinwoo Shin, Juho Lee
Abstract: This paper addresses the societal concerns arising from large-scale text-to-image diffusion models for generating potentially harmful or copyrighted content. Existing models rely heavily on internet-crawled data, wherein problematic concepts persist due to incomplete filtration processes. While previous approaches somewhat alleviate the issue, they often rely on text-specified concepts, introducing challenges in accurately capturing nuanced concepts and aligning model knowledge with human understandings. In response, we propose a framework named Human Feedback Inversion (HFI), where human feedback on model-generated images is condensed into textual tokens guiding the mitigation or removal of problematic images. The proposed framework can be built upon existing techniques for the same purpose, enhancing their alignment with human judgment. By doing so, we simplify the training objective with a self-distillation-based technique, providing a strong baseline for concept removal. Our experimental results demonstrate our framework significantly reduces objectionable content generation while preserving image quality, contributing to the ethical deployment of AI in the public sphere.
Authors: Jielong Tang, Zhenxing Wang, Ziyang Gong, Jianxing Yu, Shuang Wang, Jian Yin
Abstract: Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task, aiming to simultaneously extract entity spans, types, and entity-matched bounding box groundings in images from given sentence-image pairs data. Recent unified methods employing machine reading comprehension (MRC-based) frameworks or sequence generation-based models face challenges in understanding the relationships of multimodal entities. MRC-based frameworks, utilizing human-designed queries, struggle to model intra-entity connections. Meanwhile, sequence generation-based outputs excessively rely on inter-entity dependencies due to pre-defined decoding order. To tackle these, we propose a novel unified framework named Multi-grained Query-guided Set Prediction Network (MQSPN) to learn appropriate relationships at intra-entity and inter-entity levels. Specifically, MQSPN consists of a Multi-grained Query Set (MQS) and a Multimodal Set Prediction Network (MSP). MQS combines specific type-grained and learnable entity-grained queries to adaptively strengthen intra-entity connections by explicitly aligning visual regions with textual spans. Based on solid intra-entity modeling, MSP reformulates GMNER as a set prediction, enabling the parallel prediction of multimodal entities in a non-autoregressive manner, eliminating redundant dependencies from preceding sequences, and guiding models to establish appropriate inter-entity relationships from a global matching perspective. Additionally, to boost better alignment of two-level relationships, we also incorporate a Query-guided Fusion Net (QFNet) to work as a glue network between MQS and MSP. Extensive experiments demonstrate that our approach achieves state-of-the-art performances in widely used benchmarks. Notably, our method improves 2.83% F1 in the difficult fine-grained GMNER benchmark.
Authors: Ray Friedman, Jaewoo Cho, Jeanne Brett, Xuhui Zhan, Ningyu Han, Sriram Kannan, Yingxiang Ma, Jesse Spencer-Smith, Elisabeth J\"ackel, Alfred Zerres, Madison Hooper, Katie Babbit, Manish Acharya, Wendi Adair, Soroush Aslani, Tayfun Ayka\c{c}, Chris Bauman, Rebecca Bennett, Garrett Brady, Peggy Briggs, Cheryl Dowie, Chase Eck, Igmar Geiger, Frank Jacob, Molly Kern, Sujin Lee, Leigh Anne Liu, Wu Liu, Jeffrey Loewenstein, Anne Lytle, Li Ma, Michel Mann, Alexandra Mislin, Tyree Mitchell, Hannah Martensen n\'ee Nagler, Amit Nandkeolyar, Mara Olekalns, Elena Paliakova, Jennifer Parlamis, Jason Pierce, Nancy Pierce, Robin Pinkley, Nathalie Prime, Jimena Ramirez-Marin, Kevin Rockmann, William Ross, Zhaleh Semnani-Azad, Juliana Schroeder, Philip Smith, Elena Stimmer, Roderick Swaab, Leigh Thompson, Cathy Tinsley, Ece Tuncel, Laurie Weingart, Robert Wilken, JingJing Yao, Zhi-Xue Zhang
Abstract: In recent years, Large Language Models (LLM) have demonstrated impressive capabilities in the field of natural language processing (NLP). This paper explores the application of LLMs in negotiation transcript analysis by the Vanderbilt AI Negotiation Lab. Starting in September 2022, we applied multiple strategies using LLMs from zero shot learning to fine tuning models to in-context learning). The final strategy we developed is explained, along with how to access and use the model. This study provides a sense of both the opportunities and roadblocks for the implementation of LLMs in real life applications and offers a model for how LLMs can be applied to coding in other fields.
Authors: Hanwen Zheng, Sijia Wang, Chris Thomas, Lifu Huang
Abstract: Chart comprehension presents significant challenges for machine learning models due to the diverse and intricate shapes of charts. Existing multimodal methods often overlook these visual features or fail to integrate them effectively for chart question answering (ChartQA). To address this, we introduce Chartformer, a unified framework that enhances chart component recognition by accurately identifying and classifying components such as bars, lines, pies, titles, legends, and axes. Additionally, we propose a novel Question-guided Deformable Co-Attention (QDCAt) mechanism, which fuses chart features encoded by Chartformer with the given question, leveraging the question's guidance to ground the correct answer. Extensive experiments demonstrate that the proposed approaches significantly outperform baseline models in chart component recognition and ChartQA tasks, achieving improvements of 3.2% in mAP and 15.4% in accuracy, respectively. These results underscore the robustness of our solution for detailed visual data interpretation across various applications.
Authors: Sudeshna Jana, Tirthankar Dasgupta, Lipika Dey
Abstract: In recent years, healthcare professionals are increasingly emphasizing on personalized and evidence-based patient care through the exploration of prognostic pathways. To study this, structured clinical variables from Electronic Health Records (EHRs) data have traditionally been employed by many researchers. Presently, Natural Language Processing models have received great attention in clinical research which expanded the possibilities of using clinical narratives. In this paper, we propose a systematic methodology for developing sepsis prognostic pathways derived from clinical notes, focusing on diverse patient subgroups identified by exploring comorbidities associated with sepsis and generating explanations of these subgroups using SHAP. The extracted prognostic pathways of these subgroups provide valuable insights into the dynamic trajectories of sepsis severity over time. Visualizing these pathways sheds light on the likelihood and direction of disease progression across various contexts and reveals patterns and pivotal factors or biomarkers influencing the transition between sepsis stages, whether toward deterioration or improvement. This empowers healthcare providers to implement more personalized and effective healthcare strategies for individual patients.
Authors: Mohammad Saeid Mahdavinejad, Peyman Adibi, Amirhassan Monadjemi, Pascal Hitzler
Abstract: Depression is a common mental health issue that requires prompt diagnosis and treatment. Despite the promise of social media data for depression detection, the opacity of employed deep learning models hinders interpretability and raises bias concerns. We address this challenge by introducing ProtoDep, a novel, explainable framework for Twitter-based depression detection. ProtoDep leverages prototype learning and the generative power of Large Language Models to provide transparent explanations at three levels: (i) symptom-level explanations for each tweet and user, (ii) case-based explanations comparing the user to similar individuals, and (iii) transparent decision-making through classification weights. Evaluated on five benchmark datasets, ProtoDep achieves near state-of-the-art performance while learning meaningful prototypes. This multi-faceted approach offers significant potential to enhance the reliability and transparency of depression detection on social media, ultimately aiding mental health professionals in delivering more informed care.
Authors: Yu Feng, Zhen Tian, Yifan Zhu, Zongfu Han, Haoran Luo, Guangwei Zhang, Meina Song
Abstract: The key challenge of cross-modal domain-incremental learning (DIL) is to enable the learning model to continuously learn from novel data with different feature distributions under the same task without forgetting old ones. However, existing top-performing methods still cause high forgetting rates, by lacking intra-domain knowledge extraction and inter-domain common prompting strategy. In this paper, we propose a simple yet effective framework, CP-Prompt, by training limited parameters to instruct a pre-trained model to learn new domains and avoid forgetting existing feature distributions. CP-Prompt captures intra-domain knowledge by compositionally inserting personalized prompts on multi-head self-attention layers and then learns the inter-domain knowledge with a common prompting strategy. CP-Prompt shows superiority compared with state-of-the-art baselines among three widely evaluated DIL tasks. The source code is available at https://github.com/dannis97500/CP_Prompt.
Authors: Boyan Xu, Liang Wen, Zihao Li, Yuxing Yang, Guanlan Wu, Xiongpeng Tang, Yu Li, Zihao Wu, Qingxian Su, Xueqing Shi, Yue Yang, Rui Tong, How Yong Ng
Abstract: Recent advancements in Large Language Models (LLMs) have sparked interest in their potential applications across various fields. This paper embarked on a pivotal inquiry: Can existing LLMs effectively serve as "water expert models" for water engineering and research tasks? This study was the first to evaluate LLMs' contributions across various water engineering and research tasks by establishing a domain-specific benchmark suite, namely, WaterER. Herein, we prepared 983 tasks related to water engineering and research, categorized into "wastewater treatment", "environmental restoration", "drinking water treatment and distribution", "sanitation", "anaerobic digestion" and "contaminants assessment". We evaluated the performance of seven LLMs (i.e., GPT-4, GPT-3.5, Gemini, GLM-4, ERNIE, QWEN and Llama3) on these tasks. We highlighted the strengths of GPT-4 in handling diverse and complex tasks of water engineering and water research, the specialized capabilities of Gemini in academic contexts, Llama3's strongest capacity to answer Chinese water engineering questions and the competitive performance of Chinese-oriented models like GLM-4, ERNIE and QWEN in some water engineering tasks. More specifically, current LLMs excelled particularly in generating precise research gaps for papers on "contaminants and related water quality monitoring and assessment". Additionally, they were more adept at creating appropriate titles for research papers on "treatment processes for wastewaters", "environmental restoration", and "drinking water treatment". Overall, this study pioneered evaluating LLMs in water engineering and research by introducing the WaterER benchmark to assess the trustworthiness of their predictions. This standardized evaluation framework would also drive future advancements in LLM technology by using targeting datasets, propelling these models towards becoming true "water expert".
Authors: Yuxuan Hu, Minghuan Tan, Chenwei Zhang, Zixuan Li, Xiaodan Liang, Min Yang, Chengming Li, Xiping Hu
Abstract: Empathetic response generation is designed to comprehend the emotions of others and select the most appropriate strategies to assist them in resolving emotional challenges. Empathy can be categorized into cognitive empathy and affective empathy. The former pertains to the ability to understand and discern the emotional issues and situations of others, while the latter involves the capacity to provide comfort. To enhance one's empathetic abilities, it is essential to develop both these aspects. Therefore, we develop an innovative framework that combines retrieval augmentation and emotional support strategy integration. Our framework starts with the introduction of a comprehensive emotional palette for empathy. We then apply appraisal theory to decompose this palette and create a database of empathetic responses. This database serves as an external resource and enhances the LLM's empathy by integrating semantic retrieval mechanisms. Moreover, our framework places a strong emphasis on the proper articulation of response strategies. By incorporating emotional support strategies, we aim to enrich the model's capabilities in both cognitive and affective empathy, leading to a more nuanced and comprehensive empathetic response. Finally, we extract datasets ED and ET from the empathetic dialogue dataset \textsc{EmpatheticDialogues} and ExTES based on dialogue length. Experiments demonstrate that our framework can enhance the empathy ability of LLMs from both cognitive and affective empathy perspectives. Our code is released at https://github.com/CAS-SIAT-XinHai/APTNESS.
Authors: Yannick Assogba, Donghao Ren
Abstract: As language models support larger and larger context sizes, evaluating their ability to make effective use of that context becomes increasingly important. We analyze the ability of several code generation models to handle long range dependencies using a suite of multi-step key retrieval tasks in context windows up to 8k tokens in length. The tasks progressively increase in difficulty and allow more nuanced evaluation of model capabilities than tests like the popular needle-in-the-haystack test. We find that performance degrades significantly (up to 2x) when a function references another function that is defined later in the prompt. We also observe that models that use sliding window attention mechanisms have difficulty handling references further than the size of a single window. We perform simple prompt modifications using call graph information to improve multi-step retrieval performance up to 3x. Our analysis highlights different facets of long-context performance and is suggestive of prompt construction strategies for code completion tools
Authors: Kun Peng, Lei Jiang, Qian Li, Haoran Li, Xiaoyan Yu, Li Sun, Shuo Sun, Yanxian Bi, Hao Peng
Abstract: Cross-domain Aspect Sentiment Triplet Extraction (ASTE) aims to extract fine-grained sentiment elements from target domain sentences by leveraging the knowledge acquired from the source domain. Due to the absence of labeled data in the target domain, recent studies tend to rely on pre-trained language models to generate large amounts of synthetic data for training purposes. However, these approaches entail additional computational costs associated with the generation process. Different from them, we discover a striking resemblance between table-filling methods in ASTE and two-stage Object Detection (OD) in computer vision, which inspires us to revisit the cross-domain ASTE task and approach it from an OD standpoint. This allows the model to benefit from the OD extraction paradigm and region-level alignment. Building upon this premise, we propose a novel method named \textbf{T}able-\textbf{F}illing via \textbf{M}ean \textbf{T}eacher (TFMT). Specifically, the table-filling methods encode the sentence into a 2D table to detect word relations, while TFMT treats the table as a feature map and utilizes a region consistency to enhance the quality of those generated pseudo labels. Additionally, considering the existence of the domain gap, a cross-domain consistency based on Maximum Mean Discrepancy is designed to alleviate domain shift problems. Our method achieves state-of-the-art performance with minimal parameters and computational costs, making it a strong baseline for cross-domain ASTE.
Authors: Pralaypati Ta, Bhumika Gupta, Arihant Jain, Sneha Sree C, Keerthi Ram, Mohanasankar Sivaprakasam
Abstract: An automated knowledge modeling algorithm for Cancer Clinical Practice Guidelines (CPGs) extracts the knowledge contained in the CPG documents and transforms it into a programmatically interactable, easy-to-update structured model with minimal human intervention. The existing automated algorithms have minimal scope and cannot handle the varying complexity of the knowledge content in the CPGs for different cancer types. This work proposes an improved automated knowledge modeling algorithm to create knowledge models from the National Comprehensive Cancer Network (NCCN) CPGs in Oncology for different cancer types. The proposed algorithm has been evaluated with NCCN CPGs for four different cancer types. We also proposed an algorithm to compare the knowledge models for different versions of a guideline to discover the specific changes introduced in the treatment protocol of a new version. We created a question-answering (Q&A) framework with the guideline knowledge models as the augmented knowledge base to study our ability to query the knowledge models. We compiled a set of 32 question-answer pairs derived from two reliable data sources for the treatment of Non-Small Cell Lung Cancer (NSCLC) to evaluate the Q&A framework. The framework was evaluated against the question-answer pairs from one data source, and it can generate the answers with 54.5% accuracy from the treatment algorithm and 81.8% accuracy from the discussion part of the NCCN NSCLC guideline knowledge model.
Authors: Khai Le-Duc, Khai-Nguyen Nguyen, Bach Phan Tat, Duy Le, Jerry Ngo, Long Vo-Dang, Anh Totti Nguyen, Truong-Son Hy
Abstract: Transparency in AI decision-making is crucial in healthcare due to the severe consequences of errors, and this is important for building trust among AI and users in sentiment analysis task. Incorporating reasoning capabilities helps Large Language Models (LLMs) understand human emotions within broader contexts, handle nuanced and ambiguous language, and infer underlying sentiments that may not be explicitly stated. In this work, we introduce a new task - Sentiment Reasoning - for both speech and text modalities, along with our proposed multimodal multitask framework and dataset. Our study showed that rationale-augmented training enhances model performance in sentiment classification across both human transcript and ASR settings. Also, we found that the generated rationales typically exhibit different vocabularies compared to human-generated rationales, but maintain similar semantics. All code, data (English-translated and Vietnamese) and models are published online: https://github.com/leduckhai/MultiMed
Authors: Terrance Liu, Zhiwei Steven Wu
Abstract: While large language models are rapidly moving towards consumer-facing applications, they are often still prone to factual errors and hallucinations. In order to reduce the potential harms that may come from these errors, it is important for users to know to what extent they can trust an LLM when it makes a factual claim. To this end, we study the problem of uncertainty quantification of factual correctness in long-form natural language generation. Given some output from a large language model, we study both uncertainty at the level of individual claims contained within the output (via calibration) and uncertainty across the entire output itself (via conformal prediction). Moreover, we invoke multicalibration and multivalid conformal prediction to ensure that such uncertainty guarantees are valid both marginally and across distinct groups of prompts. Using the task of biography generation, we demonstrate empirically that having access to and making use of additional group attributes for each prompt improves both overall and group-wise performance. As the problems of calibration, conformal prediction, and their multi-group counterparts have not been extensively explored previously in the context of long-form text generation, we consider these empirical results to form a benchmark for this setting.
Authors: Muhammad Ali, Swetasudha Panda, Qinlan Shen, Michael Wick, Ari Kobren
Abstract: In the current landscape of language model research, larger models, larger datasets and more compute seems to be the only way to advance towards intelligence. While there have been extensive studies of scaling laws and models' scaling behaviors, the effect of scale on a model's social biases and stereotyping tendencies has received less attention. In this study, we explore the influence of model scale and pre-training data on its learnt social biases. We focus on BERT -- an extremely popular language model -- and investigate biases as they show up during language modeling (upstream), as well as during classification applications after fine-tuning (downstream). Our experiments on four architecture sizes of BERT demonstrate that pre-training data substantially influences how upstream biases evolve with model scale. With increasing scale, models pre-trained on large internet scrapes like Common Crawl exhibit higher toxicity, whereas models pre-trained on moderated data sources like Wikipedia show greater gender stereotypes. However, downstream biases generally decrease with increasing model scale, irrespective of the pre-training data. Our results highlight the qualitative role of pre-training data in the biased behavior of language models, an often overlooked aspect in the study of scale. Through a detailed case study of BERT, we shed light on the complex interplay of data and model scale, and investigate how it translates to concrete biases.
Authors: Yunfan Gao, Yun Xiong, Meng Wang, Haofen Wang
Abstract: Retrieval-augmented Generation (RAG) has markedly enhanced the capabilities of Large Language Models (LLMs) in tackling knowledge-intensive tasks. The increasing demands of application scenarios have driven the evolution of RAG, leading to the integration of advanced retrievers, LLMs and other complementary technologies, which in turn has amplified the intricacy of RAG systems. However, the rapid advancements are outpacing the foundational RAG paradigm, with many methods struggling to be unified under the process of "retrieve-then-generate". In this context, this paper examines the limitations of the existing RAG paradigm and introduces the modular RAG framework. By decomposing complex RAG systems into independent modules and specialized operators, it facilitates a highly reconfigurable framework. Modular RAG transcends the traditional linear architecture, embracing a more advanced design that integrates routing, scheduling, and fusion mechanisms. Drawing on extensive research, this paper further identifies prevalent RAG patterns-linear, conditional, branching, and looping-and offers a comprehensive analysis of their respective implementation nuances. Modular RAG presents innovative opportunities for the conceptualization and deployment of RAG systems. Finally, the paper explores the potential emergence of new operators and paradigms, establishing a solid theoretical foundation and a practical roadmap for the continued evolution and practical deployment of RAG technologies.
Authors: Stefan Fuchs, Michael Witbrock, Johannes Dimyadi, Robert Amor
Abstract: Compliance checking is an essential part of a construction project. The recent rapid uptake of building information models (BIM) in the construction industry has created more opportunities for automated compliance checking (ACC). BIM enables sharing of digital building design data that can be used for compliance checking with legal requirements, which are conventionally conveyed in natural language and not intended for machine processing. Creating a computable representation of legal requirements suitable for ACC is complex, costly, and time-consuming. Large language models (LLMs) such as the generative pre-trained transformers (GPT), GPT-3.5 and GPT-4, powering OpenAI's ChatGPT, can generate logically coherent text and source code responding to user prompts. This capability could be used to automate the conversion of building regulations into a semantic and computable representation. This paper evaluates the performance of LLMs in translating building regulations into LegalRuleML in a few-shot learning setup. By providing GPT-3.5 with only a few example translations, it can learn the basic structure of the format. Using a system prompt, we further specify the LegalRuleML representation and explore the existence of expert domain knowledge in the model. Such domain knowledge might be ingrained in GPT-3.5 through the broad pre-training but needs to be brought forth by careful contextualisation. Finally, we investigate whether strategies such as chain-of-thought reasoning and self-consistency could apply to this use case. As LLMs become more sophisticated, the increased common sense, logical coherence, and means to domain adaptation can significantly support ACC, leading to more efficient and effective checking processes.
Authors: Nakamasa Inoue, Shinta Otake, Takumi Hirose, Masanari Ohi, Rei Kawakami
Abstract: Self-supervised learning has emerged as a key approach for learning generic representations from speech data. Despite promising results in downstream tasks such as speech recognition, speaker verification, and emotion recognition, a significant number of parameters is required, which makes fine-tuning for each task memory-inefficient. To address this limitation, we introduce ELP-adapter tuning, a novel method for parameter-efficient fine-tuning using three types of adapter, namely encoder adapters (E-adapters), layer adapters (L-adapters), and a prompt adapter (P-adapter). The E-adapters are integrated into transformer-based encoder layers and help to learn fine-grained speech representations that are effective for speech recognition. The L-adapters create paths from each encoder layer to the downstream head and help to extract non-linguistic features from lower encoder layers that are effective for speaker verification and emotion recognition. The P-adapter appends pseudo features to CNN features to further improve effectiveness and efficiency. With these adapters, models can be quickly adapted to various speech processing tasks. Our evaluation across four downstream tasks using five backbone models demonstrated the effectiveness of the proposed method. With the WavLM backbone, its performance was comparable to or better than that of full fine-tuning on all tasks while requiring 90% fewer learnable parameters.
Authors: Yuewen Mei, Tong Nie, Jian Sun, Ye Tian
Abstract: Ensuring fault tolerance of Highly Automated Vehicles (HAVs) is crucial for their safety due to the presence of potentially severe faults. Hence, Fault Injection (FI) testing is conducted by practitioners to evaluate the safety level of HAVs. To fully cover test cases, various driving scenarios and fault settings should be considered. However, due to numerous combinations of test scenarios and fault settings, the testing space can be complex and high-dimensional. In addition, evaluating performance in all newly added scenarios is resource-consuming. The rarity of critical faults that can cause security problems further strengthens the challenge. To address these challenges, we propose to accelerate FI testing under the low-rank Smoothness Regularized Matrix Factorization (SRMF) framework. We first organize the sparse evaluated data into a structured matrix based on its safety values. Then the untested values are estimated by the correlation captured by the matrix structure. To address high dimensionality, a low-rank constraint is imposed on the testing space. To exploit the relationships between existing scenarios and new scenarios and capture the local regularity of critical faults, three types of smoothness regularization are further designed as a complement. We conduct experiments on car following and cut in scenarios. The results indicate that SRMF has the lowest prediction error in various scenarios and is capable of predicting rare critical faults compared to other machine learning models. In addition, SRMF can achieve 1171 acceleration rate, 99.3% precision and 91.1% F1 score in identifying critical faults. To the best of our knowledge, this is the first work to introduce low-rank models to FI testing of HAVs.
Authors: Pawel Renc, Yugang Jia, Anthony E. Samir, Jaroslaw Was, Quanzheng Li, David W. Bates, Arkadiusz Sitek
Abstract: Integrating modern machine learning and clinical decision-making has great promise for mitigating healthcare's increasing cost and complexity. We introduce the Enhanced Transformer for Health Outcome Simulation (ETHOS), a novel application of the transformer deep-learning architecture for analyzing high-dimensional, heterogeneous, and episodic health data. ETHOS is trained using Patient Health Timelines (PHTs)-detailed, tokenized records of health events-to predict future health trajectories, leveraging a zero-shot learning approach. ETHOS represents a significant advancement in foundation model development for healthcare analytics, eliminating the need for labeled data and model fine-tuning. Its ability to simulate various treatment pathways and consider patient-specific factors positions ETHOS as a tool for care optimization and addressing biases in healthcare delivery. Future developments will expand ETHOS' capabilities to incorporate a wider range of data types and data sources. Our work demonstrates a pathway toward accelerated AI development and deployment in healthcare.
Authors: Mayanka Chandrashekar, Ian Goethert, Md Inzamam Ul Haque, Benjamin McMahon, Sayera Dhaubhadel, Kathryn Knight, Joseph Erdos, Donna Reagan, Caroline Taylor, Peter Kuzmak, John Michael Gaziano, Eileen McAllister, Lauren Costa, Yuk-Lam Ho, Kelly Cho, Suzanne Tamang, Samah Fodeh-Jarad, Olga S. Ovchinnikova, Amy C. Justice, Jacob Hinkle, Ioana Danciu
Abstract: Objectives: This study aims to assess the impact of domain shift on chest X-ray classification accuracy and to analyze the influence of ground truth label quality and demographic factors such as age group, sex, and study year. Materials and Methods: We used a DenseNet121 model pretrained MIMIC-CXR dataset for deep learning-based multilabel classification using ground truth labels from radiology reports extracted using the CheXpert and CheXbert Labeler. We compared the performance of the 14 chest X-ray labels on the MIMIC-CXR and Veterans Healthcare Administration chest X-ray dataset (VA-CXR). The VA-CXR dataset comprises over 259k chest X-ray images spanning between the years 2010 and 2022. Results: The validation of ground truth and the assessment of multi-label classification performance across various NLP extraction tools revealed that the VA-CXR dataset exhibited lower disagreement rates than the MIMIC-CXR datasets. Additionally, there were notable differences in AUC scores between models utilizing CheXpert and CheXbert. When evaluating multi-label classification performance across different datasets, minimal domain shift was observed in unseen datasets, except for the label "Enlarged Cardiomediastinum." The study year's subgroup analyses exhibited the most significant variations in multi-label classification model performance. These findings underscore the importance of considering domain shifts in chest X-ray classification tasks, particularly concerning study years. Conclusion: Our study reveals the significant impact of domain shift and demographic factors on chest X-ray classification, emphasizing the need for improved transfer learning and equitable model development. Addressing these challenges is crucial for advancing medical imaging and enhancing patient care.
Authors: Selim F. Yilmaz, Burak Hasircioglu, Li Qiao, Deniz Gunduz
Abstract: We consider collaborative inference at the wireless edge, where each client's model is trained independently on their local datasets. Clients are queried in parallel to make an accurate decision collaboratively. In addition to maximizing the inference accuracy, we also want to ensure the privacy of local models. To this end, we leverage the superposition property of the multiple access channel to implement bandwidth-efficient multi-user inference methods. Specifically, we propose different methods for ensemble and multi-view classification that exploit over-the-air computation. We show that these schemes perform better than their orthogonal counterparts with statistically significant differences while using fewer resources and providing privacy guarantees. We also provide experimental results verifying the benefits of the proposed over-the-air multi-user inference approach and perform an ablation study to demonstrate the effectiveness of our design choices. We share the source code of the framework publicly on Github to facilitate further research and reproducibility.
Authors: Zack Dewis, Apratim Sen, Jeffrey Wong, Yujia Zhang
Abstract: This paper utilizes statistical data from various open datasets in Calgary to to uncover patterns and insights for community crimes, disorders, and traffic incidents. Community attributes like demographics, housing, and pet registration were collected and analyzed through geospatial visualization and correlation analysis. Strongly correlated features were identified using the chi-square test, and predictive models were built using association rule mining and machine learning algorithms. The findings suggest that crime rates are closely linked to factors such as population density, while pet registration has a smaller impact. This study offers valuable insights for city managers to enhance community safety strategies.
Authors: Maisha Binte Rashid, Pablo Rivas
Abstract: Multimodal machine learning models that combine visual and textual data are increasingly being deployed in critical applications, raising significant safety and security concerns due to their vulnerability to adversarial attacks. This paper presents an effective strategy to enhance the robustness of multimodal image captioning models against such attacks. By leveraging the Fast Gradient Sign Method (FGSM) to generate adversarial examples and incorporating adversarial training techniques, we demonstrate improved model robustness on two benchmark datasets: Flickr8k and COCO. Our findings indicate that selectively training only the text decoder of the multimodal architecture shows performance comparable to full adversarial training while offering increased computational efficiency. This targeted approach suggests a balance between robustness and training costs, facilitating the ethical deployment of multimodal AI systems across various domains.
Authors: Bernard J. Giron Castro, Christophe Peucheret, Darko Zibar, Francesco Da Ros
Abstract: Nowadays, as the ever-increasing demand for more powerful computing resources continues, alternative advanced computing paradigms are under extensive investigation. Significant effort has been made to deviate from conventional Von Neumann architectures. In-memory computing has emerged in the field of electronics as a possible solution to the infamous bottleneck between memory and computing processors, which reduces the effective throughput of data. In photonics, novel schemes attempt to collocate the computing processor and memory in a single device. Photonics offers the flexibility of multiplexing streams of data not only spatially and in time, but also in frequency or, equivalently, in wavelength, which makes it highly suitable for parallel computing. Here, we numerically show the use of time and wavelength division multiplexing (WDM) to solve four independent tasks at the same time in a single photonic chip, serving as a proof of concept for our proposal. The system is a time-delay reservoir computing (TDRC) based on a microring resonator (MRR). The addressed tasks cover different applications: Time-series prediction, waveform signal classification, wireless channel equalization, and radar signal prediction. The system is also tested for simultaneous computing of up to 10 instances of the same task, exhibiting excellent performance. The footprint of the system is reduced by using time-division multiplexing of the nodes that act as the neurons of the studied neural network scheme. WDM is used for the parallelization of wavelength channels, each addressing a single task. By adjusting the input power and frequency of each optical channel, we can achieve levels of performance for each of the tasks that are comparable to those quoted in state-of-the-art reports focusing on single-task operation...
Authors: Panfeng Cao, Pietro Lio
Abstract: Sequential recommendation is a task to capture hidden user preferences from historical user item interaction data. Significant progress has been made in this domain by leveraging classification based learning methods. Inspired by the recent paradigm of 'pretrain, prompt and predict' in NLP, we consider sequential recommendation as a sequence to sequence generation task and propose a novel model named Generative Recommendation (GenRec). Unlike classification based models that learn explicit user and item representations, GenRec utilizes the sequence modeling capability of Transformer and adopts the masked item prediction objective to effectively learn the hidden bidirectional sequential patterns. Different from existing generative sequential recommendation models, GenRec does not rely on manually designed hard prompts. The input to GenRec is textual user item sequence and the output is top ranked next items. Moreover, GenRec is lightweight and requires only a few hours to train effectively in low-resource settings, making it highly applicable to real-world scenarios and helping to democratize large language models in the sequential recommendation domain. Our extensive experiments have demonstrated that GenRec generalizes on various public real-world datasets and achieves state-of-the-art results. Our experiments also validate the effectiveness of the the proposed masked item prediction objective that improves the model performance by a large margin.
Authors: David Kremer, Victor Villar, Sanjay Vishwakarma, Ismael Faro, Juan Cruz-Benito
Abstract: This paper explores artificial intelligence (AI) methods for the approximate compiling of unitaries, focusing on the use of fixed two-qubit gates and arbitrary single-qubit rotations typical in superconducting hardware. Our approach involves three main stages: identifying an initial template that approximates the target unitary, predicting initial parameters for this template, and refining these parameters to maximize the fidelity of the circuit. We propose AI-driven approaches for the first two stages, with a deep learning model that suggests initial templates and an autoencoder-like model that suggests parameter values, which are refined through gradient descent to achieve the desired fidelity. We demonstrate the method on 2 and 3-qubit unitaries, showcasing promising improvements over exhaustive search and random parameter initialization. The results highlight the potential of AI to enhance the transpiling process, supporting more efficient quantum computations on current and future quantum hardware.
Authors: Florian Tambon, Amin Nikanjam, Foutse Khomh, Giuliano Antoniol
Abstract: Large Language Models (LLMs) show promising potential in Software Engineering, especially for code-related tasks like code completion and code generation. LLMs' evaluation is generally centred around general metrics computed over benchmarks. While painting a macroscopic view of the benchmarks and of the LLMs' capacity, it is unclear how each programming task in these benchmarks assesses the capabilities of the LLMs. In particular, the difficulty level of the tasks in the benchmarks is not reflected in the score used to report the performance of the model. Yet, a model achieving a 90% score on a benchmark of predominantly easy tasks is likely less capable than a model achieving a 90% score on a benchmark containing predominantly difficult tasks. This paper devises a framework, HardEval, for assessing task difficulty for LLMs and crafting new tasks based on identified hard tasks. The framework uses a diverse array of prompts for a single task across multiple LLMs to obtain a difficulty score for each task of a benchmark. Using two code generation benchmarks, HumanEval+ and ClassEval, we show that HardEval can reliably identify the hard tasks within those benchmarks, highlighting that only 21% of HumanEval+ and 27% of ClassEval tasks are hard for LLMs. Through our analysis of task difficulty, we also characterize 6 practical hard task topics which we used to generate new hard tasks. Orthogonal to current benchmarking evaluation efforts, HardEval can assist researchers and practitioners in fostering better assessments of LLMs. The difficulty score can be used to identify hard tasks within existing benchmarks. This, in turn, can be leveraged to generate more hard tasks centred around specific topics either for evaluation or improvement of LLMs. HardEval generalistic approach can be applied to other domains such as code completion or Q/A.
Authors: Aaron Ge, Monjoy Saha, Maire A. Duggan, Petra Lenz, Mustapha Abubakar, Montserrat Garc\'ia-Closas, Jeya Balasubramanian, Jonas S. Almeida, Praphulla MS Bhawsar
Abstract: Background: Tissue Microarrays (TMAs) significantly increase analytical efficiency in histopathology and large-scale epidemiologic studies by allowing multiple tissue cores to be scanned on a single slide. The individual cores can be digitally extracted and then linked to metadata for analysis in a process known as de-arraying. However, TMAs often contain core misalignments and artifacts due to assembly errors, which can adversely affect the reliability of the extracted cores during the de-arraying process. Moreover, conventional approaches for TMA de-arraying rely on desktop solutions.Therefore, a robust yet flexible de-arraying method is crucial to account for these inaccuracies and ensure effective downstream analyses. Results: We developed TMA-Grid, an in-browser, zero-footprint, interactive web application for TMA de-arraying. This web application integrates a convolutional neural network for precise tissue segmentation and a grid estimation algorithm to match each identified core to its expected location. The application emphasizes interactivity, allowing users to easily adjust segmentation and gridding results. Operating entirely in the web-browser, TMA-Grid eliminates the need for downloads or installations and ensures data privacy. Adhering to FAIR principles (Findable, Accessible, Interoperable, and Reusable), the application and its components are designed for seamless integration into TMA research workflows. Conclusions: TMA-Grid provides a robust, user-friendly solution for TMA dearraying on the web. As an open, freely accessible platform, it lays the foundation for collaborative analyses of TMAs and similar histopathology imaging data. Availability: Web application: https://episphere.github.io/tma-grid Code: https://github.com/episphere/tma-grid Tutorial: https://youtu.be/miajqyw4BVk
URLs: https://episphere.github.io/tma-grid, https://github.com/episphere/tma-grid, https://youtu.be/miajqyw4BVk
Authors: Hasan Yagiz Ozkan, Poul Einer Heegaard, Wolfgang Kellerer, Carmen Mas-Machuca
Abstract: Bugs are inevitable in software development, and their reporting in open repositories can enhance software transparency and reliability assessment. This study aims to extract information from the issue tracking system Jira and proposes a methodology to estimate resolution time for new bugs. The methodology is applied to network project ONAP, addressing concerns of network operators and manufacturers. This research provides insights into bug resolution times and related aspects in network softwarization projects.
Authors: Yixiu Zhao, Jiaxin Shi, Lester Mackey, Scott Linderman
Abstract: Discrete diffusion modeling is a promising framework for modeling and generating data in discrete spaces. To sample from these models, different strategies present trade-offs between computation and sample quality. A predominant sampling strategy is predictor-corrector $\tau$-leaping, which simulates the continuous time generative process with discretized predictor steps and counteracts the accumulation of discretization error via corrector steps. However, for absorbing state diffusion, an important class of discrete diffusion models, the standard forward-backward corrector can be ineffective in fixing such errors, resulting in subpar sample quality. To remedy this problem, we propose a family of informed correctors that more reliably counteracts discretization error by leveraging information learned by the model. For further efficiency gains, we also propose $k$-Gillespie's, a sampling algorithm that better utilizes each model evaluation, while still enjoying the speed and flexibility of $\tau$-leaping. Across several real and synthetic datasets, we show that $k$-Gillespie's with informed correctors reliably produces higher quality samples at lower computational cost.
Authors: Hamidreza Kasaei, Mohammadreza Kasaei
Abstract: Imitation Learning (IL) has emerged as a powerful approach in robotics, allowing robots to acquire new skills by mimicking human actions. Despite its potential, the data collection process for IL remains a significant challenge due to the logistical difficulties and high costs associated with obtaining high-quality demonstrations. To address these issues, we propose a low-cost visual teleoperation system for bimanual manipulation tasks, called VITAL. Our approach leverages affordable hardware and visual processing techniques to collect demonstrations, which are then augmented to create extensive training datasets for imitation learning. We enhance the generalizability and robustness of the learned policies by utilizing both real and simulated environments and human-in-the-loop corrections. We evaluated our method through several rounds of experiments in simulated and real-robot settings, focusing on tasks of varying complexity, including bottle collecting, stacking objects, and hammering. Our experimental results validate the effectiveness of our approach in learning robust robot policies from simulated data, significantly improved by human-in-the-loop corrections and real-world data integration. Additionally, we demonstrate the framework's capability to generalize to new tasks, such as setting a drink tray, showcasing its adaptability and potential for handling a wide range of real-world bimanual manipulation tasks. A video of the experiments can be found at: https://youtu.be/YeVAMRqRe64?si=R179xDlEGc7nPu8i
Authors: Jae-Won Yang, Seungbin Hong, Jae-Young Sim
Abstract: Person search is the task to localize a query person in gallery datasets of scene images. Existing methods have been mainly developed to handle a single target dataset only, however diverse datasets are continuously given in practical applications of person search. In such cases, they suffer from the catastrophic knowledge forgetting in the old datasets when trained on new datasets. In this paper, we first introduce a novel problem of lifelong person search (LPS) where the model is incrementally trained on the new datasets while preserving the knowledge learned in the old datasets. We propose an end-to-end LPS framework that facilitates the knowledge distillation to enforce the consistency learning between the old and new models by utilizing the prototype features of the foreground persons as well as the hard background proposals in the old domains. Moreover, we also devise the rehearsal-based instance matching to further improve the discrimination ability in the old domains by using the unlabeled person instances additionally. Experimental results demonstrate that the proposed method achieves significantly superior performance of both the detection and re-identification to preserve the knowledge learned in the old domains compared with the existing methods.
Authors: Taehyun Cho, Seungyub Han, Kyungjae Lee, Seokhun Ju, Dohyeong Kim, Jungwoo Lee
Abstract: Distributional reinforcement learning improves performance by effectively capturing environmental stochasticity, but a comprehensive theoretical understanding of its effectiveness remains elusive. In this paper, we present a regret analysis for distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting. We first introduce a key notion of Bellman unbiasedness for a tractable and exactly learnable update via statistical functional dynamic programming. Our theoretical results show that approximating the infinite-dimensional return distribution with a finite number of moment functionals is the only method to learn the statistical information unbiasedly, including nonlinear statistical functionals. Second, we propose a provably efficient algorithm, $\texttt{SF-LSVI}$, achieving a regret bound of $\tilde{O}(d_E H^{\frac{3}{2}}\sqrt{K})$ where $H$ is the horizon, $K$ is the number of episodes, and $d_E$ is the eluder dimension of a function class.
Authors: Dongwon Son, Sanghyeon Son, Jaehyung Kim, Beomjoon Kim
Abstract: We present DEF-oriCORN, a framework for language-directed manipulation tasks. By leveraging a novel object-based scene representation and diffusion-model-based state estimation algorithm, our framework enables efficient and robust manipulation planning in response to verbal commands, even in tightly packed environments with sparse camera views without any demonstrations. Unlike traditional representations, our representation affords efficient collision checking and language grounding. Compared to state-of-the-art baselines, our framework achieves superior estimation and motion planning performance from sparse RGB images and zero-shot generalizes to real-world scenarios with diverse materials, including transparent and reflective objects, despite being trained exclusively in simulation. Our code for data generation, training, inference, and pre-trained weights are publicly available at: https://sites.google.com/view/def-oricorn/home.
Authors: Rohini Banerjee, Cecilia G. Morales, Artur Dubrawski
Abstract: Efficient intravascular access in trauma and critical care significantly impacts patient outcomes. However, the availability of skilled medical personnel in austere environments is often limited. Autonomous robotic ultrasound systems can aid in needle insertion for medication delivery and support non-experts in such tasks. Despite advances in autonomous needle insertion, inaccuracies in vessel segmentation predictions pose risks. Understanding the uncertainty of predictive models in ultrasound imaging is crucial for assessing their reliability. We introduce MSU-Net, a novel multistage approach for training an ensemble of U-Nets to yield accurate ultrasound image segmentation maps. We demonstrate substantial improvements, 18.1% over a single Monte Carlo U-Net, enhancing uncertainty evaluations, model transparency, and trustworthiness. By highlighting areas of model certainty, MSU-Net can guide safe needle insertions, empowering non-experts to accomplish such tasks.
Authors: Marcelo Corrales Compagnucci, Mark Fenwick, Helena Haapio
Abstract: This chapter explores the essential role of Binding Corporate Rules (BCRs) in managing and facilitating secure health data transfers within corporate groups under the EU General Data Protection Regulation (GDPR). BCRs are tailored to ensure compliance with the GDPR and similar international data protection laws, presenting a flexible mechanism for transferring sensitive health and genomic data. The chapter situates BCRs within the broader spectrum of the GDPR international data transfer mechanisms, addressing the unique challenges posed by the sensitive nature of health data and the increased adoption of AI technologies. The European Data Protection Board (EDPB) Recommendations 1/2022 on BCRs, issued following the Schrems II decision, are critically analyzed, highlighting their stringent requirements and the need for a balanced approach that prioritizes data protection and an AI governance framework. The chapter outlines the BCR approval process, stressing the importance of streamlining this process to encourage broader adoption. It underscores the necessity of a multidisciplinary approach in developing BCRs, incorporating recently adopted international standards and frameworks, which offer valuable guidance for organizations to build trustworthy AI management systems. They guarantee the ethical development, deployment, and operation of AI, which is essential for its successful integration and the broader digital transformation. In conclusion, BCRs are positioned as essential tools for secure health data management, fostering transparency, accountability, and collaboration across international borders. The chapter calls for proactive measures to incentivize BCR adoption, streamline approval processes, and promote more innovative approaches, ensuring BCRs remain a robust mechanism for global data protection and compliance.
Authors: Yuhao Huang, Xin Yang, Han Zhou, Yan Cao, Haoran Dou, Fajin Dong, Dong Ni
Abstract: The Segment Anything Model (SAM) can achieve satisfactory segmentation performance under high-quality box prompts. However, SAM's robustness is compromised by the decline in box quality, limiting its practicality in clinical reality. In this study, we propose a novel Robust Box prompt based SAM (\textbf{RoBox-SAM}) to ensure SAM's segmentation performance under prompts with different qualities. Our contribution is three-fold. First, we propose a prompt refinement module to implicitly perceive the potential targets, and output the offsets to directly transform the low-quality box prompt into a high-quality one. We then provide an online iterative strategy for further prompt refinement. Second, we introduce a prompt enhancement module to automatically generate point prompts to assist the box-promptable segmentation effectively. Last, we build a self-information extractor to encode the prior information from the input image. These features can optimize the image embeddings and attention calculation, thus, the robustness of SAM can be further enhanced. Extensive experiments on the large medical segmentation dataset including 99,299 images, 5 modalities, and 25 organs/targets validated the efficacy of our proposed RoBox-SAM.
Authors: Peiru Zheng, Yun Zhao, Zhan Gong, Hong Zhu, Shaohua Wu
Abstract: Many fields could benefit from the rapid development of the large language models (LLMs). The end-to-end autonomous driving (e2eAD) is one of the typically fields facing new opportunities as the LLMs have supported more and more modalities. Here, by utilizing vision-language model (VLM), we proposed an e2eAD method called SimpleLLM4AD. In our method, the e2eAD task are divided into four stages, which are perception, prediction, planning, and behavior. Each stage consists of several visual question answering (VQA) pairs and VQA pairs interconnect with each other constructing a graph called Graph VQA (GVQA). By reasoning each VQA pair in the GVQA through VLM stage by stage, our method could achieve e2e driving with language. In our method, vision transformers (ViT) models are employed to process nuScenes visual data, while VLM are utilized to interpret and reason about the information extracted from the visual inputs. In the perception stage, the system identifies and classifies objects from the driving environment. The prediction stage involves forecasting the potential movements of these objects. The planning stage utilizes the gathered information to develop a driving strategy, ensuring the safety and efficiency of the autonomous vehicle. Finally, the behavior stage translates the planned actions into executable commands for the vehicle. Our experiments demonstrate that SimpleLLM4AD achieves competitive performance in complex driving scenarios.
Authors: An Wu, Yu Pan, Fuqi Zhou, Jinghui Yan, Chuanlu Liu
Abstract: Persistent homology is an effective method for extracting topological information, represented as persistent diagrams, of spatial structure data. Hence it is well-suited for the study of protein structures. Attempts to incorporate Persistent homology in machine learning methods of protein function prediction have resulted in several techniques for vectorizing persistent diagrams. However, current vectorization methods are excessively artificial and cannot ensure the effective utilization of information or the rationality of the methods. To address this problem, we propose a more geometrical vectorization method of persistent diagrams based on maximal margin classification for Banach space, and additionaly propose a framework that utilizes topological data analysis to identify proteins with specific functions. We evaluated our vectorization method using a binary classification task on proteins and compared it with the statistical methods that exhibit the best performance among thirteen commonly used vectorization methods. The experimental results indicate that our approach surpasses the statistical methods in both robustness and precision.
Authors: Kaustav Bhattacharjee, Soumya Kundu, Indrasis Chakraborty, Aritra Dasgupta
Abstract: Net load forecasting is crucial for energy planning and facilitating informed decision-making regarding trade and load distributions. However, evaluating forecasting models' performance against benchmark models remains challenging, thereby impeding experts' trust in the model's performance. In this context, there is a demand for technological interventions that allow scientists to compare models across various timeframes and solar penetration levels. This paper introduces a visual analytics-based application designed to compare the performance of deep-learning-based net load forecasting models with other models for probabilistic net load forecasting. This application employs carefully selected visual analytic interventions, enabling users to discern differences in model performance across different solar penetration levels, dataset resolutions, and hours of the day over multiple months. We also present observations made using our application through a case study, demonstrating the effectiveness of visualizations in aiding scientists in making informed decisions and enhancing trust in net load forecasting models.
Authors: Haoyu Kang (Central South University), Yuzhou Zhu (Dalian University of Technology), Yukun Zhong (Nanjing University), Ke Wang (Xidian University)
Abstract: Retrieval-augmented generation (RAG) has achieved great success in information retrieval to assist large models because it builds an external knowledge database. However, it also has many problems: it consumes a lot of memory because of the huge database. When faced with massive streaming data, it is unable to update the established index database in time. To save the memory of building the database and maintain accuracy simultaneously, we proposed a new approach combining a streaming algorithm and k-means cluster with RAG. Our approach applies a streaming algorithm to update the index and reduce memory consumption. Then use the k-means algorithm to cluster documents with high similarities together, the query time will be shortened by doing this. We conducted comparative experiments on four methods, and the results show that RAG with streaming algorithm and k-means cluster performs well in accuracy and memory. For massive streaming data, we find that our method behaves better than traditional RAG
Authors: Ali Abedi, Q. M. Jonathan Wu, Ning Zhang, Farhad Pourpanah
Abstract: Unsupervised domain adaptation (UDA) aims to mitigate the domain shift issue, where the distribution of training (source) data differs from that of testing (target) data. Many models have been developed to tackle this problem, and recently vision transformers (ViTs) have shown promising results. However, the complexity and large number of trainable parameters of ViTs restrict their deployment in practical applications. This underscores the need for an efficient model that not only reduces trainable parameters but also allows for adjustable complexity based on specific needs while delivering comparable performance. To achieve this, in this paper we introduce an Efficient Unsupervised Domain Adaptation (EUDA) framework. EUDA employs the DINOv2, which is a self-supervised ViT, as a feature extractor followed by a simplified bottleneck of fully connected layers to refine features for enhanced domain adaptation. Additionally, EUDA employs the synergistic domain alignment loss (SDAL), which integrates cross-entropy (CE) and maximum mean discrepancy (MMD) losses, to balance adaptation by minimizing classification errors in the source domain while aligning the source and target domain distributions. The experimental results indicate the effectiveness of EUDA in producing comparable results as compared with other state-of-the-art methods in domain adaptation with significantly fewer trainable parameters, between 42% to 99.7% fewer. This showcases the ability to train the model in a resource-limited environment. The code of the model is available at: https://github.com/A-Abedi/EUDA.
Authors: Zehui Wu, Ziwei Gong, Lin Ai, Pengyuan Shi, Kaan Donbekci, Julia Hirschberg
Abstract: This paper introduces a novel approach to emotion detection in speech using Large Language Models (LLMs). We address the limitation of LLMs in processing audio inputs by translating speech characteristics into natural language descriptions. Our method integrates these descriptions into text prompts, enabling LLMs to perform multimodal emotion analysis without architectural modifications. We evaluate our approach on two datasets: IEMOCAP and MELD, demonstrating significant improvements in emotion recognition accuracy, particularly for high-quality audio data. Our experiments show that incorporating speech descriptions yields a 2 percentage point increase in weighted F1 score on IEMOCAP (from 70.111\% to 72.596\%). We also compare various LLM architectures and explore the effectiveness of different feature representations. Our findings highlight the potential of this approach in enhancing emotion detection capabilities of LLMs and underscore the importance of audio quality in speech-based emotion recognition tasks. We'll release the source code on Github.
Authors: Yulai Cong
Abstract: Cooperation plays a pivotal role in the evolution of human intelligence; moreover, it also underlies the recent revolutionary advancement of artificial intelligence (AI) that is driven by foundation models. Specifically, we reveal that the training of foundation models can be interpreted as a form of big cooperative learning (\textit{abbr.} big learning), where massive learning individuals/tasks \emph{cooperate} to approach the unique essence of data from diverse perspectives of data prediction, leveraging a universal model. The presented big learning therefore unifies most training objectives of foundation models within a consistent framework, where their underlying assumptions are exposed simultaneously. We design tailored simulations to demonstrate the principle of big learning, based on which we provide learning-perspective justifications for the successes of foundation models, with interesting side-products. Furthermore, we reveal that big learning is a new dimension for upgrading conventional machine learning paradigms, valuable for endowing reinvigorations to associated applications; as an illustrative example, we propose the BigLearn-GAN, which is a novel adversarially-trained foundation model with versatile data sampling capabilities. Code is available at \texttt{https://github.com/YulaiCong/BigCooperativeLearning}.
Authors: Adrian Celaya, Evan Lim, Rachel Glenn, Brayden Mi, Alex Balsells, Tucker Netherton, Caroline Chung, Beatrice Riviere, David Fuentes
Abstract: Medical imaging segmentation is a highly active area of research, with deep learning-based methods achieving state-of-the-art results in several benchmarks. However, the lack of standardized tools for training, testing, and evaluating new methods makes the comparison of methods difficult. To address this, we introduce the Medical Imaging Segmentation Toolkit (MIST), a simple, modular, and end-to-end medical imaging segmentation framework designed to facilitate consistent training, testing, and evaluation of deep learning-based medical imaging segmentation methods. MIST standardizes data analysis, preprocessing, and evaluation pipelines, accommodating multiple architectures and loss functions. This standardization ensures reproducible and fair comparisons across different methods. We detail MIST's data format requirements, pipelines, and auxiliary features and demonstrate its efficacy using the BraTS Adult Glioma Post-Treatment Challenge dataset. Our results highlight MIST's ability to produce accurate segmentation masks and its scalability across multiple GPUs, showcasing its potential as a powerful tool for future medical imaging research and development.
Authors: David Zagardo
Abstract: Traditional Differentially Private Stochastic Gradient Descent (DP-SGD) introduces statistical noise on top of gradients drawn from a Gaussian distribution to ensure privacy. This paper introduces the novel Differentially Private Block-wise Gradient Shuffle (DP-BloGS) algorithm for deep learning. BloGS builds off of existing private deep learning literature, but makes a definitive shift by taking a probabilistic approach to gradient noise introduction through shuffling modeled after information theoretic privacy analyses. The theoretical results presented in this paper show that the combination of shuffling, parameter-specific block size selection, batch layer clipping, and gradient accumulation allows DP-BloGS to achieve training times close to that of non-private training while maintaining similar privacy and utility guarantees to DP-SGD. DP-BloGS is found to be significantly more resistant to data extraction attempts than DP-SGD. The theoretical results are validated by the experimental findings.
Authors: Zilong Zhang, Chang Niu, Zhibin Zhao, Xingwu Zhang, Xuefeng Chen
Abstract: Vision-based industrial inspection (VII) aims to locate defects quickly and accurately. Supervised learning under a close-set setting and industrial anomaly detection, as two common paradigms in VII, face different problems in practical applications. The former is that various and sufficient defects are difficult to obtain, while the latter is that specific defects cannot be located. To solve these problems, in this paper, we focus on the few-shot semantic segmentation (FSS) method, which can locate unseen defects conditioned on a few annotations without retraining. Compared to common objects in natural images, the defects in VII are small. This brings two problems to current FSS methods: 1 distortion of target semantics and 2 many false positives for backgrounds. To alleviate these problems, we propose a small object few-shot segmentation (SOFS) model. The key idea for alleviating 1 is to avoid the resizing of the original image and correctly indicate the intensity of target semantics. SOFS achieves this idea via the non-resizing procedure and the prototype intensity downsampling of support annotations. To alleviate 2, we design an abnormal prior map in SOFS to guide the model to reduce false positives and propose a mixed normal Dice loss to preferentially prevent the model from predicting false positives. SOFS can achieve FSS and few-shot anomaly detection determined by support masks. Diverse experiments substantiate the superior performance of SOFS. Code is available at https://github.com/zhangzilongc/SOFS.
Authors: Liangliang Liu, Yi Guan, BoRan Wang, Rujia Shen, Yi Lin, Chaoran Kong, Lian Yan, Jingchi Jiang
Abstract: Imagining potential outcomes of actions before execution helps agents make more informed decisions, a prospective thinking ability fundamental to human cognition. However, mainstream model-free Reinforcement Learning (RL) methods lack the ability to proactively envision future scenarios, plan, and guide strategies. These methods typically rely on trial and error to adjust policy functions, aiming to maximize cumulative rewards or long-term value, even if such high-reward decisions place the environment in extremely dangerous states. To address this, we propose the Prospective (ProSpec) RL method, which makes higher-value, lower-risk optimal decisions by imagining future n-stream trajectories. Specifically, ProSpec employs a dynamic model to predict future states (termed "imagined states") based on the current state and a series of sampled actions. Furthermore, we integrate the concept of Model Predictive Control and introduce a cycle consistency constraint that allows the agent to evaluate and select the optimal actions from these trajectories. Moreover, ProSpec employs cycle consistency to mitigate two fundamental issues in RL: augmenting state reversibility to avoid irreversible events (low risk) and augmenting actions to generate numerous virtual trajectories, thereby improving data efficiency. We validated the effectiveness of our method on the DMControl benchmarks, where our approach achieved significant performance improvements. Code will be open-sourced upon acceptance.
Authors: Danfeng Guo, Demetri Terzopoulos
Abstract: Large Vision-Language Models (LVLMs) have achieved significant success in recent years, and they have been extended to the medical domain. Although demonstrating satisfactory performance on medical Visual Question Answering (VQA) tasks, Medical LVLMs (MLVLMs) suffer from the hallucination problem, which makes them fail to diagnose complex pathologies. Moreover, they readily fail to learn minority pathologies due to imbalanced training data. We propose two prompting strategies for MLVLMs that reduce hallucination and improve VQA performance. In the first strategy, we provide a detailed explanation of the queried pathology. In the second strategy, we fine-tune a cheap, weak learner to achieve high performance on a specific metric, and textually provide its judgment to the MLVLM. Tested on the MIMIC-CXR-JPG and Chexpert datasets, our methods significantly improve the diagnostic F1 score, with the highest increase being 0.27. We also demonstrate that our prompting strategies can be extended to general LVLM domains. Based on POPE metrics, it effectively suppresses the false negative predictions of existing LVLMs and improves Recall by approximately 0.07.
Authors: Yanxu Mao, Peipei Liu, Tiehan Cui
Abstract: Document-level relation extraction (DocRE) aims to extract relations between entities from unstructured document text. Compared to sentence-level relation extraction, it requires more complex semantic understanding from a broader text context. Currently, some studies are utilizing logical rules within evidence sentences to enhance the performance of DocRE. However, in the data without provided evidence sentences, researchers often obtain a list of evidence sentences for the entire document through evidence retrieval (ER). Therefore, DocRE suffers from two challenges: firstly, the relevance between evidence and entity pairs is weak; secondly, there is insufficient extraction of complex cross-relations between long-distance multi-entities. To overcome these challenges, we propose GEGA, a novel model for DocRE. The model leverages graph neural networks to construct multiple weight matrices, guiding attention allocation to evidence sentences. It also employs multi-scale representation aggregation to enhance ER. Subsequently, we integrate the most efficient evidence information to implement both fully supervised and weakly supervised training processes for the model. We evaluate the GEGA model on three widely used benchmark datasets: DocRED, Re-DocRED, and Revisit-DocRED. The experimental results indicate that our model has achieved comprehensive improvements compared to the existing SOTA model.
Authors: Simon Valentin, Jinmiao Fu, Gianluca Detommaso, Shaoyuan Xu, Giovanni Zappella, Bryan Wang
Abstract: Large language models (LLMs) can be prone to hallucinations - generating unreliable outputs that are unfaithful to their inputs, external facts or internally inconsistent. In this work, we address several challenges for post-hoc hallucination detection in production settings. Our pipeline for hallucination detection entails: first, producing a confidence score representing the likelihood that a generated answer is a hallucination; second, calibrating the score conditional on attributes of the inputs and candidate response; finally, performing detection by thresholding the calibrated score. We benchmark a variety of state-of-the-art scoring methods on different datasets, encompassing question answering, fact checking, and summarization tasks. We employ diverse LLMs to ensure a comprehensive assessment of performance. We show that calibrating individual scoring methods is critical for ensuring risk-aware downstream decision making. Based on findings that no individual score performs best in all situations, we propose a multi-scoring framework, which combines different scores and achieves top performance across all datasets. We further introduce cost-effective multi-scoring, which can match or even outperform more expensive detection methods, while significantly reducing computational overhead.
Authors: Dengsheng Chen, Jie Hu, Xiaoming Wei, Enhua Wu
Abstract: The Gaussian diffusion model, initially designed for image generation, has recently been adapted for 3D point cloud generation. However, these adaptations have not fully considered the intrinsic geometric characteristics of 3D shapes, thereby constraining the diffusion model's potential for 3D shape manipulation. To address this limitation, we introduce a novel deformable 3D shape diffusion model that facilitates comprehensive 3D shape manipulation, including point cloud generation, mesh deformation, and facial animation. Our approach innovatively incorporates a differential deformation kernel, which deconstructs the generation of geometric structures into successive non-rigid deformation stages. By leveraging a probabilistic diffusion model to simulate this step-by-step process, our method provides a versatile and efficient solution for a wide range of applications, spanning from graphics rendering to facial expression animation. Empirical evidence highlights the effectiveness of our approach, demonstrating state-of-the-art performance in point cloud generation and competitive results in mesh deformation. Additionally, extensive visual demonstrations reveal the significant potential of our approach for practical applications. Our method presents a unique pathway for advancing 3D shape manipulation and unlocking new opportunities in the realm of virtual reality.
Authors: Taiji Li, Zhi Li, Yin Zhang
Abstract: Despite large language models (LLMs) have demonstrated impressive performance in various tasks, they are still suffering from the factual inconsistency problem called hallucinations. For instance, LLMs occasionally generate content that diverges from source article, and prefer to extract information that appears at the beginning and end of the context, especially in long document summarization. Inspired by these findings, we propose to improve the faithfulness of LLMs in summarization by impelling them to process the entire article more fairly and faithfully. We present a novel summary generation strategy, namely SliSum, which exploits the ideas of sliding windows and self-consistency. Specifically, SliSum divides the source article into overlapping windows, and utilizes LLM to generate local summaries for the content in the windows. Finally, SliSum aggregates all local summaries using clustering and majority voting algorithm to produce more faithful summary of entire article. Extensive experiments demonstrate that SliSum significantly improves the faithfulness of diverse LLMs including LLaMA-2, Claude-2 and GPT-3.5 in both short and long text summarization, while maintaining their fluency and informativeness and without additional fine-tuning and resources. We further conduct qualitative and quantitative studies to investigate why SliSum works and impacts of hyperparameters in SliSum on performance.
Authors: Zhaolan Huang, Adrien Tousnakhoff, Polina Kozyr, Roman Rehausen, Felix Bie{\ss}mann, Robert Lachlan, Cedric Adjih, Emmanuel Baccelli
Abstract: Monitoring biodiversity at scale is challenging. Detecting and identifying species in fine grained taxonomies requires highly accurate machine learning (ML) methods. Training such models requires large high quality data sets. And deploying these models to low power devices requires novel compression techniques and model architectures. While species classification methods have profited from novel data sets and advances in ML methods, in particular neural networks, deploying these state of the art models to low power devices remains difficult. Here we present a comprehensive empirical comparison of various tinyML neural network architectures and compression techniques for species classification. We focus on the example of bird song detection, more concretely a data set curated for studying the corn bunting bird species. The data set is released along with all code and experiments of this study. In our experiments we compare predictive performance, memory and time complexity of classical spectrogram based methods and recent approaches operating on raw audio signal. Our results indicate that individual bird species can be robustly detected with relatively simple architectures that can be readily deployed to low power devices.
Authors: Mengtian Kang, Yansong Hu, Shuo Gao, Yuanyuan Liu, Hongbei Meng, Xuemeng Li, Xuhang Chen, Hubin Zhao, Jing Fu, Guohua Hu, Wei Wang, Yanning Dai, Arokia Nathan, Peter Smielewski, Ningli Wang, Shiming Li
Abstract: Childhood myopia constitutes a significant global health concern. It exhibits an escalating prevalence and has the potential to evolve into severe, irreversible conditions that detrimentally impact familial well-being and create substantial economic costs. Contemporary research underscores the importance of precisely predicting myopia progression to enable timely and effective interventions, thereby averting severe visual impairment in children. Such predictions predominantly rely on subjective clinical assessments, which are inherently biased and resource-intensive, thus hindering their widespread application. In this study, we introduce a novel, high-accuracy method for quantitatively predicting the myopic trajectory and myopia risk in children using only fundus images and baseline refraction data. This approach was validated through a six-year longitudinal study of 3,408 children in Henan, utilizing 16,211 fundus images and corresponding refractive data. Our method based on deep learning demonstrated predictive accuracy with an error margin of 0.311D per year and AUC scores of 0.944 and 0.995 for forecasting the risks of developing myopia and high myopia, respectively. These findings confirm the utility of our model in supporting early intervention strategies and in significantly reducing healthcare costs, particularly by obviating the need for additional metadata and repeated consultations. Furthermore, our method was designed to rely only on fundus images and refractive error data, without the need for meta data or multiple inquiries from doctors, strongly reducing the associated medical costs and facilitating large-scale screening. Our model can even provide good predictions based on only a single time measurement. Consequently, the proposed method is an important means to reduce medical inequities caused by economic disparities.
Authors: Gero Kolhof, Sebastiaan J. van Zelst
Abstract: Process models are, like event data, first-class citizens in most process mining approaches. Several process modeling formalisms have been proposed and used, e.g., Petri nets, BPMN, and process trees. Despite their frequent use, little research addresses the formal properties of process trees and the corresponding potential to improve the efficiency of solving common computational problems. Therefore, in this paper, we propose an invertible state space definition for process trees and demonstrate that the corresponding state space graph is isomorphic to the state space graph of the tree's inverse. Our result supports the development of novel, time-efficient, decomposition strategies for applications of process trees. Our experiments confirm that our state space definition allows for the adoption of bidirectional state space search, which significantly improves the overall performance of state space searches.
Authors: Dengsheng Chen, Jie Hu, Xiaoming Wei, Enhua Wu
Abstract: Incorporating a temporal dimension into pretrained image diffusion models for video generation is a prevalent approach. However, this method is computationally demanding and necessitates large-scale video datasets. More critically, the heterogeneity between image and video datasets often results in catastrophic forgetting of the image expertise. Recent attempts to directly extract video snippets from image diffusion models have somewhat mitigated these problems. Nevertheless, these methods can only generate brief video clips with simple movements and fail to capture fine-grained motion or non-grid deformation. In this paper, we propose a novel Zero-Shot video Sampling algorithm, denoted as $\mathcal{ZS}^2$, capable of directly sampling high-quality video clips from existing image synthesis methods, such as Stable Diffusion, without any training or optimization. Specifically, $\mathcal{ZS}^2$ utilizes the dependency noise model and temporal momentum attention to ensure content consistency and animation coherence, respectively. This ability enables it to excel in related tasks, such as conditional and context-specialized video generation and instruction-guided video editing. Experimental results demonstrate that $\mathcal{ZS}^2$ achieves state-of-the-art performance in zero-shot video generation, occasionally outperforming recent supervised methods. Homepage: \url{https://densechen.github.io/zss/}.
Authors: Zhirui Kuai, Zuxu Chen, Huimu Wang, Mingming Li, Dadong Miao, Binbin Wang, Xusong Chen, Li Kuang, Yuxing Han, Jiaxing Wang, Guoyu Tang, Lin Liu, Songlin Wang, Jingwei Zhuo
Abstract: Generative retrieval (GR) has emerged as a transformative paradigm in search and recommender systems, leveraging numeric-based identifier representations to enhance efficiency and generalization. Notably, methods like TIGER employing Residual Quantization-based Semantic Identifiers (RQ-SID), have shown significant promise in e-commerce scenarios by effectively managing item IDs. However, a critical issue termed the "\textbf{Hourglass}" phenomenon, occurs in RQ-SID, where intermediate codebook tokens become overly concentrated, hindering the full utilization of generative retrieval methods. This paper analyses and addresses this problem by identifying data sparsity and long-tailed distribution as the primary causes. Through comprehensive experiments and detailed ablation studies, we analyze the impact of these factors on codebook utilization and data distribution. Our findings reveal that the "Hourglass" phenomenon substantially impacts the performance of RQ-SID in generative retrieval. We propose effective solutions to mitigate this issue, thereby significantly enhancing the effectiveness of generative retrieval in real-world E-commerce applications.
Authors: Giuliano Martinelli, Edoardo Barba, Roberto Navigli
Abstract: Large autoregressive generative models have emerged as the cornerstone for achieving the highest performance across several Natural Language Processing tasks. However, the urge to attain superior results has, at times, led to the premature replacement of carefully designed task-specific approaches without exhaustive experimentation. The Coreference Resolution task is no exception; all recent state-of-the-art solutions adopt large generative autoregressive models that outperform encoder-based discriminative systems. In this work,we challenge this recent trend by introducing Maverick, a carefully designed - yet simple - pipeline, which enables running a state-of-the-art Coreference Resolution system within the constraints of an academic budget, outperforming models with up to 13 billion parameters with as few as 500 million parameters. Maverick achieves state-of-the-art performance on the CoNLL-2012 benchmark, training with up to 0.006x the memory resources and obtaining a 170x faster inference compared to previous state-of-the-art systems. We extensively validate the robustness of the Maverick framework with an array of diverse experiments, reporting improvements over prior systems in data-scarce, long-document, and out-of-domain settings. We release our code and models for research purposes at https://github.com/SapienzaNLP/maverick-coref.
Authors: Junxuan Yu, Rusi Chen, Yongsong Zhou, Yanlin Chen, Yaofei Duan, Yuhao Huang, Han Zhou, Tan Tao, Xin Yang, Dong Ni
Abstract: Echocardiography video is a primary modality for diagnosing heart diseases, but the limited data poses challenges for both clinical teaching and machine learning training. Recently, video generative models have emerged as a promising strategy to alleviate this issue. However, previous methods often relied on holistic conditions during generation, hindering the flexible movement control over specific cardiac structures. In this context, we propose an explainable and controllable method for echocardiography video generation, taking an initial frame and a motion curve as guidance. Our contributions are three-fold. First, we extract motion information from each heart substructure to construct motion curves, enabling the diffusion model to synthesize customized echocardiography videos by modifying these curves. Second, we propose the structure-to-motion alignment module, which can map semantic features onto motion curves across cardiac structures. Third, The position-aware attention mechanism is designed to enhance video consistency utilizing Gaussian masks with structural position information. Extensive experiments on three echocardiography datasets show that our method outperforms others regarding fidelity and consistency. The full code will be released at https://github.com/mlmi-2024-72/ECM.
Authors: Jawad Haidar, Marc Mouawad, Imad Elhajj, Daniel Asmar
Abstract: Instance segmentation is an advanced form of image segmentation which, beyond traditional segmentation, requires identifying individual instances of repeating objects in a scene. Mask R-CNN is the most common architecture for instance segmentation, and improvements to this architecture include steps such as benefiting from bounding box refinements, adding semantics, or backbone enhancements. In all the proposed variations to date, the problem of competing kernels (each class aims to maximize its own accuracy) persists when models try to synchronously learn numerous classes. In this paper, we propose mitigating this problem by replacing mask prediction with a Switch-Split block that processes refined ROIs, classifies them, and assigns them to specialized mask predictors. We name the method MaskUno and test it on various models from the literature, which are then trained on multiple classes using the benchmark COCO dataset. An increase in the mean Average Precision (mAP) of 2.03% was observed for the high-performing DetectoRS when trained on 80 classes. MaskUno proved to enhance the mAP of instance segmentation models regardless of the number and typ
Authors: I. M. Chernenkiy, Y. A. Drach, S. R. Mustakimova, V. V. Kazantseva, N. A. Ushakov, S. K. Efetov, M. V. Feldsherov
Abstract: Colorectal cancer is the third-most common cancer in the Western Hemisphere. The segmentation of colorectal and colorectal cancer by computed tomography is an urgent problem in medicine. Indeed, a system capable of solving this problem will enable the detection of colorectal cancer at early stages of the disease, facilitate the search for pathology by the radiologist, and significantly accelerate the process of diagnosing the disease. However, scientific publications on medical image processing mostly use closed, non-public data. This paper presents an extension of the Medical Decathlon dataset with colorectal markups in order to improve the quality of segmentation algorithms. An experienced radiologist validated the data, categorized it into subsets by quality, and published it in the public domain. Based on the obtained results, we trained neural network models of the UNet architecture with 5-part cross-validation and achieved a Dice metric quality of $0.6988 \pm 0.3$. The published markups will improve the quality of colorectal cancer detection and simplify the radiologist's job for study description.
Authors: Boele Visser, Peter van der Putten, Amirhossein Zohrehvand
Abstract: This study examines the influence of perceived AI features on user motivation in virtual interactions. AI avatars, being disclosed as being an AI, or embodying specific genders, could be used in user-AI interactions. Leveraging insights from AI and avatar research, we explore how AI disclosure and gender affect user motivation. We conducted a game-based experiment involving over 72,500 participants who solved search problems alone or with an AI companion. Different groups experienced varying AI appearances and disclosures. We measured play intensity. Results revealed that the presence of another avatar led to less intense play compared to solo play. Disclosure of the avatar as AI heightened effort intensity compared to non-disclosed AI companions. Additionally, a masculine AI appearance reduced effort intensity.
Authors: Lingxi Cui, Huan Li, Ke Chen, Lidan Shou, Gang Chen
Abstract: Machine learning (ML) on tabular data is ubiquitous, yet obtaining abundant high-quality tabular data for model training remains a significant obstacle. Numerous works have focused on tabular data augmentation (TDA) to enhance the original table with additional data, thereby improving downstream ML tasks. Recently, there has been a growing interest in leveraging the capabilities of generative AI for TDA. Therefore, we believe it is time to provide a comprehensive review of the progress and future prospects of TDA, with a particular emphasis on the trending generative AI. Specifically, we present an architectural view of the TDA pipeline, comprising three main procedures: pre-augmentation, augmentation, and post-augmentation. Pre-augmentation encompasses preparation tasks that facilitate subsequent TDA, including error handling, table annotation, table simplification, table representation, table indexing, table navigation, schema matching, and entity matching. Augmentation systematically analyzes current TDA methods, categorized into retrieval-based methods, which retrieve external data, and generation-based methods, which generate synthetic data. We further subdivide these methods based on the granularity of the augmentation process at the row, column, cell, and table levels. Post-augmentation focuses on the datasets, evaluation and optimization aspects of TDA. We also summarize current trends and future directions for TDA, highlighting promising opportunities in the era of generative AI. In addition, the accompanying papers and related resources are continuously updated and maintained in the GitHub repository at https://github.com/SuDIS-ZJU/awesome-tabular-data-augmentation to reflect ongoing advancements in the field.
URLs: https://github.com/SuDIS-ZJU/awesome-tabular-data-augmentation
Authors: Jingyao Wang, Emmanuel Bergeret, Issam Falih
Abstract: Human Activity Recognition (HAR) is a field of study that focuses on identifying and classifying human activities. Skeleton-based Human Activity Recognition has received much attention in recent years, where Graph Convolutional Network (GCN) based method is widely used and has achieved remarkable results. However, the representation of skeleton data and the issue of over-smoothing in GCN still need to be studied. 1). Compared to central nodes, edge nodes can only aggregate limited neighbor information, and different edge nodes of the human body are always structurally related. However, the information from edge nodes is crucial for fine-grained activity recognition. 2). The Graph Convolutional Network suffers from a significant over-smoothing issue, causing nodes to become increasingly similar as the number of network layers increases. Based on these two ideas, we propose a two-stream graph convolution method called Spatial-Structural GCN (SpSt-GCN). Spatial GCN performs information aggregation based on the topological structure of the human body, and structural GCN performs differentiation based on the similarity of edge node sequences. The spatial connection is fixed, and the human skeleton naturally maintains this topology regardless of the actions performed by humans. However, the structural connection is dynamic and depends on the type of movement the human body is performing. Based on this idea, we also propose an entirely data-driven structural connection, which greatly increases flexibility. We evaluate our method on two large-scale datasets, i.e., NTU RGB+D and NTU RGB+D 120. The proposed method achieves good results while being efficient.
Authors: Jun Zhou, Dongyang Yu, Kamran Aziz, Fangfang Su, Qing Zhang, Fei Li, Donghong Ji
Abstract: Fine-grained sentiment analysis involves extracting and organizing sentiment elements from textual data. However, existing approaches often overlook issues of category semantic inclusion and overlap, as well as inherent structural patterns within the target sequence. This study introduces a generative sentiment analysis model. To address the challenges related to category semantic inclusion and overlap, a latent category distribution variable is introduced. By reconstructing the input of a variational autoencoder, the model learns the intensity of the relationship between categories and text, thereby improving sequence generation. Additionally, a trie data structure and constrained decoding strategy are utilized to exploit structural patterns, which in turn reduces the search space and regularizes the generation process. Experimental results on the Restaurant-ACOS and Laptop-ACOS datasets demonstrate a significant performance improvement compared to baseline models. Ablation experiments further confirm the effectiveness of latent category distribution and constrained decoding strategy.
Authors: Min Jae Jung, JooHee Kim
Abstract: Large Language Models (LLMs) encounter significant challenges in continual learning due to catastrophic forgetting, where new information overwrites previously acquired knowledge. This limitation leads to substantial environmental and economic waste. In this study, we introduce the PMoE, Progressive Mixture of Experts with Asymmetric Transformer, which aims to minimize forgetting by utilizing an asymmetric design with shallow layers dedicated to general knowledge and deep layers for new knowledge. PMoE incorporates progressively added experts in deep layers and a router that allocates new knowledge to the appropriate experts efficiently. The router, positioned adjacent to the deep layers, utilizes deep features aggregating consolidated information. This enables the router to perform efficiently, allocating new knowledge to the appropriate experts, which progressively increase in the deep layers. Extensive experiments on TRACE datasets and general language understanding datasets demonstrate that the proposed PMoE outperforms previous state-of-the-art approaches.
Authors: Kit M. Bransby, Woo-jin Cho Kim, Jorge Oliveira, Alex Thorley, Arian Beqiri, Alberto Gomez, Agisilaos Chartsias
Abstract: Building an echocardiography view classifier that maintains performance in real-life cases requires diverse multi-site data, and frequent updates with newly available data to mitigate model drift. Simply fine-tuning on new datasets results in "catastrophic forgetting", and cannot adapt to variations of view labels between sites. Alternatively, collecting all data on a single server and re-training may not be feasible as data sharing agreements may restrict image transfer, or datasets may only become available at different times. Furthermore, time and cost associated with re-training grows with every new dataset. We propose a class-incremental learning method which learns an expert network for each dataset, and combines all expert networks with a score fusion model. The influence of ``unqualified experts'' is minimised by weighting each contribution with a learnt in-distribution score. These weights promote transparency as the contribution of each expert is known during inference. Instead of using the original images, we use learned features from each dataset, which are easier to share and raise fewer licensing and privacy concerns. We validate our work on six datasets from multiple sites, demonstrating significant reductions in training time while improving view classification performance.
Authors: Tristan Coignion, Cl\'ement Quinton, Romain Rouvoy
Abstract: This study evaluates the efficiency of code generation by Large Language Models (LLMs) and measures their performance against human-crafted solutions using a dataset from Leetcode. We compare 18 LLMs, considering factors such as model temperature and success rate, and their impact on code performance. This research introduces a novel method for measuring and comparing the speed of LLM-generated code, revealing that LLMs produce code with comparable performance, irrespective of the adopted LLM. We also find that LLMs are capable of generating code that is, on average, more efficient than the code written by humans. The paper further discusses the use of Leetcode as a benchmarking dataset, the limitations imposed by potential data contamination, and the platform's measurement reliability. We believe that our findings contribute to a better understanding of LLM capabilities in code generation and set the stage for future optimizations in the field.
Authors: Antoine P. Sanner, Nils F. Grauhan, Marc A. Brockmann, Ahmed E. Othman, Anirban Mukhopadhyay
Abstract: Patients with Intracranial Hemorrhage (ICH) face a potentially life-threatening condition, and patient-centered individualized treatment remains challenging due to possible clinical complications. Deep-Learning-based methods can efficiently analyze the routinely acquired head CTs to support the clinical decision-making. The majority of early work focuses on the detection and segmentation of ICH, but do not model the complex relations between ICH and adjacent brain structures. In this work, we design a tailored object detection method for ICH, which we unite with segmentation-grounded Scene Graph Generation (SGG) methods to learn a holistic representation of the clinical cerebral scene. To the best of our knowledge, this is the first application of SGG for 3D voxel images. We evaluate our method on two head-CT datasets and demonstrate that our model can recall up to 74% of clinically relevant relations. This work lays the foundation towards SGG for 3D voxel data. The generated Scene Graphs can already provide insights for the clinician, but are also valuable for all downstream tasks as a compact and interpretable representation.
Authors: Steven N. Hart, Thomas E. Tavolara
Abstract: Unsupervised embeddings are fundamental to numerous machine learning applications, yet their evaluation remains a challenging task. Traditional assessment methods often rely on extrinsic variables, such as performance in downstream tasks, which can introduce confounding factors and mask the true quality of embeddings. This paper introduces the Intrinsic Distance Preservation Evaluation (IDPE) method, a novel approach for assessing embedding quality based on the preservation of Mahalanobis distances between data points in the original and embedded spaces. We demonstrate the limitations of extrinsic evaluation methods through a simple example, highlighting how they can lead to misleading conclusions about embedding quality. IDPE addresses these issues by providing a task-independent measure of how well embeddings preserve the intrinsic structure of the original data. Our method leverages efficient similarity search techniques to make it applicable to large-scale datasets. We compare IDPE with established intrinsic metrics like trustworthiness and continuity, as well as extrinsic metrics such as Average Rank and Mean Reciprocal Rank. Our results show that IDPE offers a more comprehensive and reliable assessment of embedding quality across various scenarios. We evaluate PCA and t-SNE embeddings using IDPE, revealing insights into their performance that are not captured by traditional metrics. This work contributes to the field by providing a robust, efficient, and interpretable method for embedding evaluation. IDPE's focus on intrinsic properties offers a valuable tool for researchers and practitioners seeking to develop and assess high-quality embeddings for diverse machine learning applications.
Authors: Shoujin Huang, Guanxiong Luo, Yuwan Wang, Kexin Yang, Lingyan Zhang, Jingzhe Liu, Hua Guo, Min Wang, Mengye Lyu
Abstract: Simultaneous multislice (SMS) imaging is a powerful technique for accelerating magnetic resonance imaging (MRI) acquisitions. However, SMS reconstruction remains challenging due to the complex signal interactions between and within the excited slices. This study presents a robust SMS MRI reconstruction method using deep generative priors. Starting from Gaussian noise, we leverage denoising diffusion probabilistic models (DDPM) to gradually recover the individual slices through reverse diffusion iterations while imposing data consistency from the measured k-space under readout concatenation framework. The posterior sampling procedure is designed such that the DDPM training can be performed on single-slice images without special adjustments for SMS tasks. Additionally, our method integrates a low-frequency enhancement (LFE) module to address a practical issue that SMS-accelerated fast spin echo (FSE) and echo-planar imaging (EPI) sequences cannot easily embed autocalibration signals. Extensive experiments demonstrate that our approach consistently outperforms existing methods and generalizes well to unseen datasets. The code is available at https://github.com/Solor-pikachu/ROGER after the review process.
Authors: Jiafeng Zhong, Bin Li, Jiangyan Yi
Abstract: The task of partially spoofed audio localization aims to accurately determine audio authenticity at a frame level. Although some works have achieved encouraging results, utilizing boundary information within a single model remains an unexplored research topic. In this work, we propose a novel method called Boundary-aware Attention Mechanism (BAM). Specifically, it consists of two core modules: Boundary Enhancement and Boundary Frame-wise Attention. The former assembles the intra-frame and inter-frame information to extract discriminative boundary features that are subsequently used for boundary position detection and authenticity decision, while the latter leverages boundary prediction results to explicitly control the feature interaction between frames, which achieves effective discrimination between real and fake frames. Experimental results on PartialSpoof database demonstrate our proposed method achieves the best performance. The code is available at https://github.com/media-sec-lab/BAM.
Authors: Pedro Sarmento, Jackson Loth, Mathieu Barthet
Abstract: Generative AI models have recently blossomed, significantly impacting artistic and musical traditions. Research investigating how humans interact with and deem these models is therefore crucial. Through a listening and reflection study, we explore participants' perspectives on AI- vs human-generated progressive metal, in symbolic format, using rock music as a control group. AI-generated examples were produced by ProgGP, a Transformer-based model. We propose a mixed methods approach to assess the effects of generation type (human vs. AI), genre (progressive metal vs. rock), and curation process (random vs. cherry-picked). This combines quantitative feedback on genre congruence, preference, creativity, consistency, playability, humanness, and repeatability, and qualitative feedback to provide insights into listeners' experiences. A total of 32 progressive metal fans completed the study. Our findings validate the use of fine-tuning to achieve genre-specific specialization in AI music generation, as listeners could distinguish between AI-generated rock and progressive metal. Despite some AI-generated excerpts receiving similar ratings to human music, listeners exhibited a preference for human compositions. Thematic analysis identified key features for genre and AI vs. human distinctions. Finally, we consider the ethical implications of our work in promoting musical data diversity within MIR research by focusing on an under-explored genre.
Authors: Gabriel Turinici
Abstract: Time is not a dimension as the others. In Physics-Informed Neural Networks (PINN) several proposals attempted to adapt the time sampling or time weighting to take into account the specifics of this special dimension. But these proposals are not principled and need guidance to be used. We explain here theoretically why the Lyapunov exponents give actionable insights and propose a weighting scheme to automatically adapt to chaotic, periodic or stable dynamics. We characterize theoretically the best weighting scheme under computational constraints as a cumulative exponential integral of the local Lyapunov exponent estimators and show that it performs well in practice under the regimes mentioned above.
Authors: Yash Zambre, Ekdev Rajkitkul, Akshatha Mohan, Joshua Peeples
Abstract: Object detection plays a crucial role in the field of computer vision by autonomously identifying and locating objects of interest. The You Only Look Once (YOLO) model is an effective single-shot detector. However, YOLO faces challenges in cluttered or partially occluded scenes and can struggle with small, low-contrast objects. We propose a new method that integrates spatial transformer networks (STNs) into YOLO to improve performance. The proposed STN-YOLO aims to enhance the model's effectiveness by focusing on important areas of the image and improving the spatial invariance of the model before the detection process. Our proposed method improved object detection performance both qualitatively and quantitatively. We explore the impact of different localization networks within the STN module as well as the robustness of the model across different spatial transformations. We apply the STN-YOLO on benchmark datasets for Agricultural object detection as well as a new dataset from a state-of-the-art plant phenotyping greenhouse facility. Our code and dataset are publicly available.
Authors: Aswini Kumar Patra, Ankit Varshney, Lingaraj Sahoo
Abstract: Early detection of drought stress is critical for taking timely measures for reducing crop loss before the drought impact becomes irreversible. The subtle phenotypical and physiological changes in response to drought stress are captured by non-invasive imaging techniques and these imaging data serve as valuable resource for machine learning methods to identify drought stress. While convolutional neural networks (CNNs) are in wide use, vision transformers (ViTs) present a promising alternative in capturing long-range dependencies and intricate spatial relationships, thereby enhancing the detection of subtle indicators of drought stress. We propose an explainable deep learning pipeline that leverages the power of ViTs for drought stress detection in potato crops using aerial imagery. We applied two distinct approaches: a synergistic combination of ViT and support vector machine (SVM), where ViT extracts intricate spatial features from aerial images, and SVM classifies the crops as stressed or healthy and an end-to-end approach using a dedicated classification layer within ViT to directly detect drought stress. Our key findings explain the ViT model's decision-making process by visualizing attention maps. These maps highlight the specific spatial features within the aerial images that the ViT model focuses as the drought stress signature. Our findings demonstrate that the proposed methods not only achieve high accuracy in drought stress identification but also shedding light on the diverse subtle plant features associated with drought stress. This offers a robust and interpretable solution for drought stress monitoring for farmers to undertake informed decisions for improved crop management.
Authors: Wei Wang, Qing Li
Abstract: Neural networks are increasingly evolving towards training large models with big data, a method that has demonstrated superior performance across many tasks. However, this approach introduces an urgent problem: current deep learning models are predominantly serial, meaning that as the number of network layers increases, so do the training and inference times. This is unacceptable if deep learning is to continue advancing. Therefore, this paper proposes a deep learning parallelization strategy based on the Universal Approximation Theorem (UAT). From this foundation, we designed a parallel network called Para-Former to test our theory. Unlike traditional serial models, the inference time of Para-Former does not increase with the number of layers, significantly accelerating the inference speed of multi-layer networks. Experimental results validate the effectiveness of this network.
Authors: Krishan Agyakari Raja Babu, Rachana Sathish, Mrunal Pattanaik, Rahul Venkataramani
Abstract: Synthetic data is becoming increasingly integral in data-scarce fields such as medical imaging, serving as a substitute for real data. However, its inherent statistical characteristics can significantly impact downstream tasks, potentially compromising deployment performance. In this study, we empirically investigate this issue and uncover a critical phenomenon: downstream neural networks often exploit spurious distinctions between real and synthetic data when there is a strong correlation between the data source and the task label. This exploitation manifests as \textit{simplicity bias}, where models overly rely on superficial features rather than genuine task-related complexities. Through principled experiments, we demonstrate that the source of data (real vs.\ synthetic) can introduce spurious correlating factors leading to poor performance during deployment when the correlation is absent. We first demonstrate this vulnerability on a digit classification task, where the model spuriously utilizes the source of data instead of the digit to provide an inference. We provide further evidence of this phenomenon in a medical imaging problem related to cardiac view classification in echocardiograms, particularly distinguishing between 2-chamber and 4-chamber views. Given the increasing role of utilizing synthetic datasets, we hope that our experiments serve as effective guidelines for the utilization of synthetic datasets in model training.
Authors: Jichuan Zhang, Wei Li, Shuang Cheng, Ya-Li Li, Shengjin Wang
Abstract: Incremental object detection (IOD) aims to sequentially learn new classes, while maintaining the capability to locate and identify old ones. As the training data arrives with annotations only with new classes, IOD suffers from catastrophic forgetting. Prior methodologies mainly tackle the forgetting issue through knowledge distillation and exemplar replay, ignoring the conflict between limited model capacity and increasing knowledge. In this paper, we explore \textit{dynamic object queries} for incremental object detection built on Transformer architecture. We propose the \textbf{Dy}namic object \textbf{Q}uery-based \textbf{DE}tection \textbf{TR}ansformer (DyQ-DETR), which incrementally expands the model representation ability to achieve stability-plasticity tradeoff. First, a new set of learnable object queries are fed into the decoder to represent new classes. These new object queries are aggregated with those from previous phases to adapt both old and new knowledge well. Second, we propose the isolated bipartite matching for object queries in different phases, based on disentangled self-attention. The interaction among the object queries at different phases is eliminated to reduce inter-class confusion. Thanks to the separate supervision and computation over object queries, we further present the risk-balanced partial calibration for effective exemplar replay. Extensive experiments demonstrate that DyQ-DETR significantly surpasses the state-of-the-art methods, with limited parameter overhead. Code will be made publicly available.
Authors: Dylan hillier, Cheston Tan, Jing Jiang
Abstract: Social learning plays an important role in the development of human intelligence. As children, we imitate our parents' speech patterns until we are able to produce sounds; we learn from them praising us and scolding us; and as adults, we learn by working with others. In this work, we survey the degree to which this paradigm -- social learning -- has been mirrored in machine learning. In particular, since learning socially requires interacting with others, we are interested in how embodied agents can and have utilised these techniques. This is especially in light of the degree to which recent advances in natural language processing (NLP) enable us to perform new forms of social learning. We look at how behavioural cloning and next-token prediction mirror human imitation, how learning from human feedback mirrors human education, and how we can go further to enable fully communicative agents that learn from each other. We find that while individual social learning techniques have been used successfully, there has been little unifying work showing how to bring them together into socially embodied agents.
Authors: Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei Xing, Xianghua Ying
Abstract: Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models. To properly evaluate the open-vocabulary AVSS, we split zero-shot training and testing subsets based on the AVSBench-semantic benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong segmentation and zero-shot generalization ability of our model on all categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%. The code is available at https://github.com/ruohaoguo/ovavss.
Authors: Joseph Geo Benjamin, Mothilal Asokan, Amna Alhosani, Hussain Alasmawi, Werner Gerhard Diehl, Leanne Bricker, Karthik Nandakumar, Mohammad Yaqub
Abstract: Self-supervised learning (SSL) methods are popular since they can address situations with limited annotated data by directly utilising the underlying data distribution. However, the adoption of such methods is not explored enough in ultrasound (US) imaging, especially for fetal assessment. We investigate the potential of dual-encoder SSL in utilizing unlabelled US video data to improve the performance of challenging downstream Standard Fetal Cardiac Planes (SFCP) classification using limited labelled 2D US images. We study 7 SSL approaches based on reconstruction, contrastive loss, distillation, and information theory and evaluate them extensively on a large private US dataset. Our observations and findings are consolidated from more than 500 downstream training experiments under different settings. Our primary observation shows that for SSL training, the variance of the dataset is more crucial than its size because it allows the model to learn generalisable representations, which improve the performance of downstream tasks. Overall, the BarlowTwins method shows robust performance, irrespective of the training settings and data variations, when used as an initialisation for downstream tasks. Notably, full fine-tuning with 1% of labelled data outperforms ImageNet initialisation by 12% in F1-score and outperforms other SSL initialisations by at least 4% in F1-score, thus making it a promising candidate for transfer learning from US video to image data.
Authors: Mothilal Asokan, Joseph Geo Benjamin, Mohammad Yaqub, Karthik Nandakumar
Abstract: Adapting foundation models for medical image analysis requires finetuning them on a considerable amount of data because of extreme distribution shifts between natural (source) data used for pretraining and medical (target) data. However, collecting task-specific medical data for such finetuning at a central location raises many privacy concerns. Although Federated learning (FL) provides an effective means for training on private decentralized data, communication costs in federating large foundation models can quickly become a significant bottleneck, impacting the solution's scalability. In this work, we address this problem of efficient communication while ensuring effective learning in FL by combining the strengths of Parameter-Efficient Fine-tuning (PEFT) with FL. Specifically, we study plug-and-play Low-Rank Adapters (LoRA) in a federated manner to adapt the Segment Anything Model (SAM) for 3D medical image segmentation. Unlike prior works that utilize LoRA and finetune the entire decoder, we critically analyze the contribution of each granular component of SAM on finetuning performance. Thus, we identify specific layers to be federated that are very efficient in terms of communication cost while producing on-par accuracy. Our experiments show that retaining the parameters of the SAM model (including most of the decoder) in their original state during adaptation is beneficial because fine-tuning on small datasets tends to distort the inherent capabilities of the underlying foundation model. On Fed-KiTS, our approach decreases communication cost (~48x) compared to full fine-tuning while increasing performance (~6% Dice score) in 3D segmentation tasks. Our approach performs similar to SAMed while achieving ~2.8x reduction in communication and parameters to be finetuned. We further validate our approach with experiments on Fed-IXI and Prostate MRI datasets.
Authors: Zhibin Duan, Tiansheng Wen, Yifei Wang, Chen Zhu, Bo Chen, Mingyuan Zhou
Abstract: Factor analysis, often regarded as a Bayesian variant of matrix factorization, offers superior capabilities in capturing uncertainty, modeling complex dependencies, and ensuring robustness. As the deep learning era arrives, factor analysis is receiving less and less attention due to their limited expressive ability. On the contrary, contrastive learning has emerged as a potent technique with demonstrated efficacy in unsupervised representational learning. While the two methods are different paradigms, recent theoretical analysis has revealed the mathematical equivalence between contrastive learning and matrix factorization, providing a potential possibility for factor analysis combined with contrastive learning. Motivated by the interconnectedness of contrastive learning, matrix factorization, and factor analysis, this paper introduces a novel Contrastive Factor Analysis framework, aiming to leverage factor analysis's advantageous properties within the realm of contrastive learning. To further leverage the interpretability properties of non-negative factor analysis, which can learn disentangled representations, contrastive factor analysis is extended to a non-negative version. Finally, extensive experimental validation showcases the efficacy of the proposed contrastive (non-negative) factor analysis methodology across multiple key properties, including expressiveness, robustness, interpretability, and accurate uncertainty estimation.
Authors: Junwei He, Qianqian Xu, Yangbangyan Jiang, Zitai Wang, Yuchen Sun, Qingming Huang
Abstract: With the progressive advancements in deep graph learning, out-of-distribution (OOD) detection for graph data has emerged as a critical challenge. While the efficacy of auxiliary datasets in enhancing OOD detection has been extensively studied for image and text data, such approaches have not yet been explored for graph data. Unlike Euclidean data, graph data exhibits greater diversity but lower robustness to perturbations, complicating the integration of outliers. To tackle these challenges, we propose the introduction of \textbf{H}ybrid External and Internal \textbf{G}raph \textbf{O}utlier \textbf{E}xposure (HGOE) to improve graph OOD detection performance. Our framework involves using realistic external graph data from various domains and synthesizing internal outliers within ID subgroups to address the poor robustness and presence of OOD samples within the ID class. Furthermore, we develop a boundary-aware OE loss that adaptively assigns weights to outliers, maximizing the use of high-quality OOD samples while minimizing the impact of low-quality ones. Our proposed HGOE framework is model-agnostic and designed to enhance the effectiveness of existing graph OOD detection models. Experimental results demonstrate that our HGOE framework can significantly improve the performance of existing OOD detection models across all 8 real datasets.
Authors: Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher R\'e, Azalia Mirhoseini
Abstract: Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit the amount of compute to only one attempt per problem. Here, we explore inference compute as another axis for scaling by increasing the number of generated samples. Across multiple tasks and models, we observe that coverage - the fraction of problems solved by any attempt - scales with the number of samples over four orders of magnitude. In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-V2-Coder-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-attempt state-of-the-art of 43% which uses more capable frontier models. Moreover, using current API pricing, amplifying the cheaper DeepSeek model with five samples is more cost-effective and solves more issues than paying a premium for one sample from GPT-4o or Claude 3.5 Sonnet. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. Finally, we find that identifying correct samples out of many generations remains an important direction for future research in domains without automatic verifiers. When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples. However, common methods to pick correct solutions from a sample collection, such as majority voting or reward models, plateau beyond several hundred samples and fail to fully scale with the sample budget.
Authors: Mihir Chauhan, Abhishek Satbhai, Mohammad Abuzar Hashemi, Mir Basheer Ali, Bina Ramamurthy, Mingchen Gao, Siwei Lyu, Sargur Srihari
Abstract: Handwriting Verification is a critical in document forensics. Deep learning based approaches often face skepticism from forensic document examiners due to their lack of explainability and reliance on extensive training data and handcrafted features. This paper explores using Vision Language Models (VLMs), such as OpenAI's GPT-4o and Google's PaliGemma, to address these challenges. By leveraging their Visual Question Answering capabilities and 0-shot Chain-of-Thought (CoT) reasoning, our goal is to provide clear, human-understandable explanations for model decisions. Our experiments on the CEDAR handwriting dataset demonstrate that VLMs offer enhanced interpretability, reduce the need for large training datasets, and adapt better to diverse handwriting styles. However, results show that the CNN-based ResNet-18 architecture outperforms the 0-shot CoT prompt engineering approach with GPT-4o (Accuracy: 70%) and supervised fine-tuned PaliGemma (Accuracy: 71%), achieving an accuracy of 84% on the CEDAR AND dataset. These findings highlight the potential of VLMs in generating human-interpretable decisions while underscoring the need for further advancements to match the performance of specialized deep learning models.
Authors: Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, Dan Hendrycks
Abstract: As artificial intelligence systems grow more powerful, there has been increasing interest in "AI safety" research to address emerging and future risks. However, the field of AI safety remains poorly defined and inconsistently measured, leading to confusion about how researchers can contribute. This lack of clarity is compounded by the unclear relationship between AI safety benchmarks and upstream general capabilities (e.g., general knowledge and reasoning). To address these issues, we conduct a comprehensive meta-analysis of AI safety benchmarks, empirically analyzing their correlation with general capabilities across dozens of models and providing a survey of existing directions in AI safety. Our findings reveal that many safety benchmarks highly correlate with upstream model capabilities, potentially enabling "safetywashing" -- where capability improvements are misrepresented as safety advancements. Based on these findings, we propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context as a set of clearly delineated research goals that are empirically separable from generic capabilities advancements. In doing so, we aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.
Authors: Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Yueqian Lin, Qing Yu, Go Irie, Shafiq Joty, Yixuan Li, Hai Li, Ziwei Liu, Toshihiko Yamasaki, Kiyoharu Aizawa
Abstract: Detecting out-of-distribution (OOD) samples is crucial for ensuring the safety of machine learning systems and has shaped the field of OOD detection. Meanwhile, several other problems are closely related to OOD detection, including anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD). To unify these problems, a generalized OOD detection framework was proposed, taxonomically categorizing these five problems. However, Vision Language Models (VLMs) such as CLIP have significantly changed the paradigm and blurred the boundaries between these fields, again confusing researchers. In this survey, we first present a generalized OOD detection v2, encapsulating the evolution of AD, ND, OSR, OOD detection, and OD in the VLM era. Our framework reveals that, with some field inactivity and integration, the demanding challenges have become OOD detection and AD. In addition, we also highlight the significant shift in the definition, problem settings, and benchmarks; we thus feature a comprehensive review of the methodology for OOD detection, including the discussion over other related tasks to clarify their relationship to OOD detection. Finally, we explore the advancements in the emerging Large Vision Language Model (LVLM) era, such as GPT-4V. We conclude this survey with open challenges and future directions.
Authors: Robert J. Moss, Anthony Corso, Jef Caers, Mykel J. Kochenderfer
Abstract: Real-world planning problems, including autonomous driving and sustainable energy applications like carbon storage and resource exploration, have recently been modeled as partially observable Markov decision processes (POMDPs) and solved using approximate methods. To solve high-dimensional POMDPs in practice, state-of-the-art methods use online planning with problem-specific heuristics to reduce planning horizons and make the problems tractable. Algorithms that learn approximations to replace heuristics have recently found success in large-scale fully observable domains. The key insight is the combination of online Monte Carlo tree search with offline neural network approximations of the optimal policy and value function. In this work, we bring this insight to partially observable domains and propose BetaZero, a belief-state planning algorithm for high-dimensional POMDPs. BetaZero learns offline approximations that replace heuristics to enable online decision making in long-horizon problems. We address several challenges inherent in large-scale partially observable domains; namely challenges of transitioning in stochastic environments, prioritizing action branching with a limited search budget, and representing beliefs as input to the network. To formalize the use of all limited search information, we train against a novel $Q$-weighted visit counts policy. We test BetaZero on various well-established POMDP benchmarks found in the literature and a real-world problem of critical mineral exploration. Experiments show that BetaZero outperforms state-of-the-art POMDP solvers on a variety of tasks.
Authors: Beatrice Milik, Stefan Saraev, Cristian Bleotiu, Radu Lupaescu, Bogdan Hobeanu, Andrei Ionut Damian
Abstract: Over the past few years, ubiquitous, or pervasive computing has gained popularity as the primary approach for a wide range of applications, including enterprise-grade systems, consumer applications, and gaming systems. Ubiquitous computing refers to the integration of computing technologies into everyday objects and environments, creating a network of interconnected devices that can communicate with each other and with humans. By using ubiquitous computing technologies, communities can become more connected and efficient, with members able to communicate and collaborate more easily. This enabled interconnectedness and collaboration can lead to a more successful and sustainable community. The spread of ubiquitous computing, however, has emphasized the importance of automated learning and smart applications in general. Even though there have been significant strides in Artificial Intelligence and Deep Learning, large scale adoption has been hesitant due to mounting pressure on expensive and highly complex cloud numerical-compute infrastructures. Adopting, and even developing, practical machine learning systems can come with prohibitive costs, not only in terms of complex infrastructures but also of solid expertise in Data Science and Machine Learning. In this paper we present an innovative approach for low-code development and deployment of end-to-end AI cooperative application pipelines. We address infrastructure allocation, costs, and secure job distribution in a fully decentralized global cooperative community based on tokenized economics.
Authors: Tom Zahavy, Vivek Veeriah, Shaobo Hou, Kevin Waugh, Matthew Lai, Edouard Leurent, Nenad Tomasev, Lisa Schut, Demis Hassabis, Satinder Singh
Abstract: In recent years, Artificial Intelligence (AI) systems have surpassed human intelligence in a variety of computational tasks. However, AI systems, like humans, make mistakes, have blind spots, hallucinate, and struggle to generalize to new situations. This work explores whether AI can benefit from creative decision-making mechanisms when pushed to the limits of its computational rationality. In particular, we investigate whether a team of diverse AI systems can outperform a single AI in challenging tasks by generating more ideas as a group and then selecting the best ones. We study this question in the game of chess, the so-called drosophila of AI. We build on AlphaZero (AZ) and extend it to represent a league of agents via a latent-conditioned architecture, which we call AZ_db. We train AZ_db to generate a wider range of ideas using behavioral diversity techniques and select the most promising ones with sub-additive planning. Our experiments suggest that AZ_db plays chess in diverse ways, solves more puzzles as a group and outperforms a more homogeneous team. Notably, AZ_db solves twice as many challenging puzzles as AZ, including the challenging Penrose positions. When playing chess from different openings, we notice that players in AZ_db specialize in different openings, and that selecting a player for each opening using sub-additive planning results in a 50 Elo improvement over AZ. Our findings suggest that diversity bonuses emerge in teams of AI agents, just as they do in teams of humans and that diversity is a valuable asset in solving computationally hard problems.
Authors: Bj\"orn Schembera, Frank W\"ubbeling, Hendrik Kleikamp, Christine Biedinger, Jochen Fiedler, Marco Reidelbach, Aurela Shehu, Burkhard Schmidt, Thomas Koprucki, Dorothea Iglezakis, Dominik G\"oddeke
Abstract: In applied mathematics and related disciplines, the modeling-simulation-optimization workflow is a prominent scheme, with mathematical models and numerical algorithms playing a crucial role. For these types of mathematical research data, the Mathematical Research Data Initiative has developed, merged and implemented ontologies and knowledge graphs. This contributes to making mathematical research data FAIR by introducing semantic technology and documenting the mathematical foundations accordingly. Using the concrete example of microfracture analysis of porous media, it is shown how the knowledge of the underlying mathematical model and the corresponding numerical algorithms for its solution can be represented by the ontologies.
Authors: Yuyi Mao, Xianghao Yu, Kaibin Huang, Ying-Jun Angela Zhang, Jun Zhang
Abstract: Artificial intelligence (AI) technologies have emerged as pivotal enablers across a multitude of industries largely due to their significant resurgence over the past decade. The transformative power of AI is primarily derived from the utilization of deep neural networks (DNNs), which require extensive data for training and substantial computational resources for processing. Consequently, DNN models are typically trained and deployed on resource-rich cloud servers. However, due to potential latency issues associated with cloud communications, deep learning (DL) workflows are increasingly being transitioned to wireless edge networks in proximity to end-user devices (EUDs). This shift is designed to support latency-sensitive applications and has given rise to a new paradigm of edge AI, which will play a critical role in upcoming sixth-generation (6G) networks to support ubiquitous AI applications. Despite its considerable potential, edge AI faces substantial challenges, mostly due to the dichotomy between the resource limitations of wireless edge networks and the resource-intensive nature of DL. Specifically, the acquisition of large-scale data, as well as the training and inference processes of DNNs, can rapidly deplete the battery energy of EUDs. This necessitates an energy-conscious approach to edge AI to ensure both optimal and sustainable performance. In this paper, we present a contemporary survey on green edge AI. We commence by analyzing the principal energy consumption components of edge AI systems to identify the fundamental design principles of green edge AI. Guided by these principles, we then explore energy-efficient design methodologies for the three critical tasks in edge AI systems, including training data acquisition, edge training, and edge inference. Finally, we underscore potential future research directions to further enhance the energy efficiency of edge AI.
Authors: Shaokun Zhang, Jieyu Zhang, Jiale Liu, Linxin Song, Chi Wang, Ranjay Krishna, Qingyun Wu
Abstract: Researchers and practitioners have recently reframed powerful Large Language Models (LLMs) as agents, enabling them to automate complex tasks largely via the use of specialized functions. To facilitate the development of LLM agents, we present a novel paradigm of training LLM agents without modifying the LLM weights, which is particularly useful when the LLMs are difficult or inaccessible for modifications. Inspired by how humans continuously forge tools to adapt to real-world tasks, rather than change our biological structure to fit a static set of tools, we propose to progressively forge agent's functions to better solve the downstream tasks instead of modifying the LLM weights. By treating the functions as learnable `agent parameters' and leveraging the fundamental idea of model training in artificial intelligence, we develop AgentOptimizer that employs the LLM to update agents' functions and devise an agent training algorithm with two strategies, roll-back, and early-stop, to streamline the training process. With extensive experiments, we showcase that the agent training paradigm could significantly improve the performance of representative LLM agents in various downstream tasks. We also study the behavior of the agent training regarding aspects like the learning curve and domain transferability.
Authors: Marco Reidelbach, Bj\"orn Schembera, Marcus Weber
Abstract: Modeling-Simulation-Optimization workflows play a fundamental role in applied mathematics. The Mathematical Research Data Initiative, MaRDI, responded to this by developing a FAIR and machine-interpretable template for a comprehensive documentation of such workflows. MaRDMO, a Plugin for the Research Data Management Organiser, enables scientists from diverse fields to document and publish their workflows on the MaRDI Portal seamlessly using the MaRDI template. Central to these workflows are mathematical models. MaRDI addresses them with the MathModDB ontology, offering a structured formal model description. Here, we showcase the interaction between MaRDMO and the MathModDB Knowledge Graph through an algebraic modeling workflow from the Digital Humanities. This demonstration underscores the versatility of both services beyond their original numerical domain.
Authors: Juan G. Colonna, Ahmed A. Fares, M\'arcio Duarte, Ricardo Sousa
Abstract: Process Mining offers a powerful framework for uncovering, analyzing, and optimizing real-world business processes. Petri nets provide a versatile means of modeling process behavior. However, traditional methods often struggle to effectively compare complex Petri nets, hindering their potential for process enhancement. To address this challenge, we introduce PetriNet2Vec, an unsupervised methodology inspired by Doc2Vec. This approach converts Petri nets into embedding vectors, facilitating the comparison, clustering, and classification of process models. We validated our approach using the PDC Dataset, comprising 96 diverse Petri net models. The results demonstrate that PetriNet2Vec effectively captures the structural properties of process models, enabling accurate process classification and efficient process retrieval. Specifically, our findings highlight the utility of the learned embeddings in two key downstream tasks: process classification and process retrieval. In process classification, the embeddings allowed for accurate categorization of process models based on their structural properties. In process retrieval, the embeddings enabled efficient retrieval of similar process models using cosine distance. These results demonstrate the potential of PetriNet2Vec to significantly enhance process mining capabilities.
Authors: Martina G. Vilas, Federico Adolfi, David Poeppel, Gemma Roig
Abstract: Inner Interpretability is a promising emerging field tasked with uncovering the inner mechanisms of AI systems, though how to develop these mechanistic theories is still much debated. Moreover, recent critiques raise issues that question its usefulness to advance the broader goals of AI. However, it has been overlooked that these issues resemble those that have been grappled with in another field: Cognitive Neuroscience. Here we draw the relevant connections and highlight lessons that can be transferred productively between fields. Based on these, we propose a general conceptual framework and give concrete methodological strategies for building mechanistic explanations in AI inner interpretability research. With this conceptual framework, Inner Interpretability can fend off critiques and position itself on a productive path to explain AI systems.
Authors: Gianmarco Mengaldo
Abstract: The scientific method is the cornerstone of human progress across all branches of the natural and applied sciences, from understanding the human body to explaining how the universe works. The scientific method is based on identifying systematic rules or principles that describe the phenomenon of interest in a reproducible way that can be validated through experimental evidence. In the era of artificial intelligence (AI), there are discussions on how AI systems may discover new knowledge. We argue that human complex reasoning for scientific discovery remains of vital importance, at least before the advent of artificial general intelligence. Yet, AI can be leveraged for scientific discovery via explainable AI. More specifically, knowing what data AI systems deemed important to make decisions can be a point of contact with domain experts and scientists, that can lead to divergent or convergent views on a given scientific problem. Divergent views may spark further scientific investigations leading to new scientific knowledge.
Authors: Jiuding Yang, Weidong Guo, Kaitong Yang, Xiangyang Li, Zhuwei Rao, Yu Xu, Di Niu
Abstract: The effective alignment of Large Language Models (LLMs) with precise instructions is essential for their application in diverse real-world scenarios. Current methods focus on enhancing the diversity and complexity of training and evaluation samples, yet they fall short in accurately assessing LLMs' ability to follow similar instruction variants. We introduce an effective data augmentation technique that decomposes complex instructions into simpler sub-components, modifies these, and reconstructs them into new variants, thereby preserves the original instruction's context and complexity while introducing variability, which is critical for training and evaluating LLMs' instruction-following precision. We developed the DeMoRecon dataset using this method to both fine-tune and evaluate LLMs. Our findings show that LLMs fine-tuned with DeMoRecon will gain significant performance boost on both ours and commonly used instructions-following benchmarks.
Authors: Song Zhang, Yuqing Duan, Daoliang Li, Ran Zhao
Abstract: In underwater image enhancement (UIE), convolutional neural networks (CNN) have inherent limitations in modeling long-range dependencies and are less effective in recovering global features. While Transformers excel at modeling long-range dependencies, their quadratic computational complexity with increasing image resolution presents significant efficiency challenges. Additionally, most supervised learning methods lack effective physical model constraint, which can lead to insufficient realism and overfitting in generated images. To address these issues, we propose a physical model constraint-based underwater image enhancement framework, Mamba-UIE. Specifically, we decompose the input image into four components: underwater scene radiance, direct transmission map, backscatter transmission map, and global background light. These components are reassembled according to the revised underwater image formation model, and the reconstruction consistency constraint is applied between the reconstructed image and the original image, thereby achieving effective physical constraint on the underwater image enhancement process. To tackle the quadratic computational complexity of Transformers when handling long sequences, we introduce the Mamba-UIE network based on linear complexity state space models. By incorporating the Mamba in Convolution block, long-range dependencies are modeled at both the channel and spatial levels, while the CNN backbone is retained to recover local features and details. Extensive experiments on three public datasets demonstrate that our proposed Mamba-UIE outperforms existing state-of-the-art methods, achieving a PSNR of 27.13 and an SSIM of 0.93 on the UIEB dataset. Our method is available at https://github.com/zhangsong1213/Mamba-UIE.
Authors: Yunsheng Wang, Songhao Chen, Kevin Jin
Abstract: Knowledge graphs (KGs) are essential in applications such as network alignment, question-answering, and recommender systems (RSs) since they offer structured relational data that facilitate the inference of indirect relationships. However, the development of KG-based RSs capable of processing user inputs in natural language faces significant challenges. Firstly, natural language processing units must effectively handle the ambiguity and variability in human language to interpret user intents accurately. Secondly, the system must precisely identify and link entities, like product names, to their corresponding nodes in KGs. To overcome these challenges, supported by Lenovo, we developed a novel chatbot called "Prometheus," which integrates a KG with a large language model (LLM), specifically designed for recommending computer components. This chatbot can accurately decode user requests and deliver personalized recommendations derived from KGs, ensuring precise comprehension and response to their computer setup needs.
Authors: Xinhao Luo, Man Yao, Yuhong Chou, Bo Xu, Guoqi Li
Abstract: Brain-inspired Spiking Neural Networks (SNNs) have bio-plausibility and low-power advantages over Artificial Neural Networks (ANNs). Applications of SNNs are currently limited to simple classification tasks because of their poor performance. In this work, we focus on bridging the performance gap between ANNs and SNNs on object detection. Our design revolves around network architecture and spiking neuron. First, the overly complex module design causes spike degradation when the YOLO series is converted to the corresponding spiking version. We design a SpikeYOLO architecture to solve this problem by simplifying the vanilla YOLO and incorporating meta SNN blocks. Second, object detection is more sensitive to quantization errors in the conversion of membrane potentials into binary spikes by spiking neurons. To address this challenge, we design a new spiking neuron that activates Integer values during training while maintaining spike-driven by extending virtual timesteps during inference. The proposed method is validated on both static and neuromorphic object detection datasets. On the static COCO dataset, we obtain 66.2% mAP@50 and 48.9% mAP@50:95, which is +15.0% and +18.7% higher than the prior state-of-the-art SNN, respectively. On the neuromorphic Gen1 dataset, we achieve 67.2% mAP@50, which is +2.5% greater than the ANN with equivalent architecture, and the energy efficiency is improved by 5.7*. Code: https://github.com/BICLab/SpikeYOLO
Authors: Xavier Renard, Thibault Laugel, Marcin Detyniecki
Abstract: A multitude of classifiers can be trained on the same data to achieve similar performances during test time, while having learned significantly different classification patterns. This phenomenon, which we call prediction discrepancies, is often associated with the blind selection of one model instead of another with similar performances. When making a choice, the machine learning practitioner has no understanding on the differences between models, their limits, where they agree and where they don't. But his/her choice will result in concrete consequences for instances to be classified in the discrepancy zone, since the final decision will be based on the selected classification pattern. Besides the arbitrary nature of the result, a bad choice could have further negative consequences such as loss of opportunity or lack of fairness. This paper proposes to address this question by analyzing the prediction discrepancies in a pool of best-performing models trained on the same data. A model-agnostic algorithm, DIG, is proposed to capture and explain discrepancies locally, to enable the practitioner to make the best educated decision when selecting a model by anticipating its potential undesired consequences. All the code to reproduce the experiments is available.
Authors: Tina Dorosti, Manuel Schultheiss, Philipp Schmette, Jule Heuchert, Johannes Thalhammer, Florian Schaff, Thorsten Sellerer, Rafael Schick, Kirsten Taphorn, Korbinian Mechlem, Lorenz Birnbacher, Franz Pfeiffer, Daniela Pfeiffer
Abstract: Purpose: We aimed to estimate the total lung volume (TLV) from real and synthetic frontal X-ray radiographs on a pixel level using lung thickness maps generated by a U-Net. Methods: 5,959 thorax X-ray computed tomography (CT) scans were retrieved from two publicly available datasets of the lung nodule analysis 2016 (n=656) and the RSNA pulmonary embolism detection challenge 2020 (n=5,303). Additionally, thorax CT scans from 72 subjects (33 healthy: 20 men, mean age [range] = 62.4 [34, 80]; 39 suffering from chronic obstructive pulmonary disease: 25 men, mean age [range] = 69.0 [47, 91]) were retrospectively selected (10.2018-12.2019) from our in-house dataset such that for each subject, a frontal chest X-ray radiograph no older than seven days was available. All CT scans and their corresponding lung segmentation were forward projected using a simulated X-ray spectrum to generate synthetic radiographs and lung thickness maps, respectively. A U-Net model was trained and tested on synthetic radiographs from the public datasets to predict lung thickness maps and consequently estimate TLV. Model performance was further assessed by evaluating the TLV estimations for the in-house synthetic and real radiograph pairs using Pearson correlation coefficient (r) and significance testing. Results: Strong correlations were measured between the predicted and CT-derived ground truth TLV values for test data from synthetic ($n_{Public}$=1,191, r=0.987, P < 0.001; $n_{In-house}$=72, r=0.973, P < 0.001) and real radiographs (n=72, r=0.908, P < 0.001). Conclusion: TLV from U-Net-generated pixel-level lung thickness maps were successfully estimated for synthetic and real radiographs.
Authors: Spandan Garg, Roshanak Zilouchian Moghaddam, Neel Sundaresan
Abstract: Performance bugs are non-functional bugs that can even manifest in well-tested commercial products. Fixing these performance bugs is an important yet challenging problem. In this work, we address this challenge and present a new approach called Retrieval-Augmented Prompt Generation (RAPGen). Given a code snippet with a performance issue, RAPGen first retrieves a prompt instruction from a pre-constructed knowledge-base of previous performance bug fixes and then generates a prompt using the retrieved instruction. It then uses this prompt on a Large Language Model (such as Codex) in zero-shot to generate a fix. We compare our approach with the various prompt variations and state of the art methods in the task of performance bug fixing. Our evaluation shows that RAPGen can generate performance improvement suggestions equivalent or better than a developer in ~60% of the cases, getting ~42% of them verbatim, in an expert-verified dataset of past performance changes made by C# developers.
Authors: Sewon Min, Suchin Gururangan, Eric Wallace, Weijia Shi, Hannaneh Hajishirzi, Noah A. Smith, Luke Zettlemoyer
Abstract: The legality of training language models (LMs) on copyrighted or otherwise restricted data is under intense debate. However, as we show, model performance significantly degrades if trained only on low-risk text (e.g., out-of-copyright books or government documents), due to its limited size and domain coverage. We present SILO, a new language model that manages this risk-performance tradeoff during inference. SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text and (2) augmenting it with a more general and easily modifiable nonparametric datastore (e.g., containing copyrighted books or news) that is only queried during inference. The datastore allows use of high-risk data without training on it, supports sentence-level data attribution, and enables data producers to opt out from the model by removing content from the store. These capabilities can foster compliance with data-use regulations such as the fair use doctrine in the United States and the GDPR in the European Union. Our experiments show that the parametric LM struggles on domains not covered by OLC. However, access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile, a more diverse corpus with mostly high-risk text. We also analyze which nonparametric approach works best, where the remaining errors lie, and how performance scales with datastore size. Our results suggest that it is possible to build high quality language models while mitigating their legal risk.
Authors: Jiho Shin, Sepehr Hashtroudi, Hadi Hemmati, Song Wang
Abstract: Recently, deep learning-based test case generation approaches have been proposed to automate the generation of unit test cases. In this study, we leverage Transformer-based code models to generate unit tests with the help of Domain Adaptation (DA) at a project level. Specifically, we use CodeT5, a relatively small language model trained on source code data, and fine-tune it on the test generation task. Then, we apply domain adaptation to each target project data to learn project-specific knowledge (project-level DA). We use the Methods2test dataset to fine-tune CodeT5 for the test generation task and the Defects4j dataset for project-level domain adaptation and evaluation. We compare our approach with (a) CodeT5 fine-tuned on the test generation without DA, (b) the A3Test tool, and (c) GPT-4 on five projects from the Defects4j dataset. The results show that tests generated using DA can increase the line coverage by 18.62%, 19.88%, and 18.02% and mutation score by 16.45%, 16.01%, and 12.99% compared to the above (a), (b), and (c) baselines, respectively. The overall results show consistent improvements in metrics such as parse rate, compile rate, BLEU, and CodeBLEU. In addition, we show that our approach can be seen as a complementary solution alongside existing search-based test generation tools such as EvoSuite, to increase the overall coverage and mutation scores with an average of 34.42% and 6.8%, for line coverage and mutation score, respectively.
Authors: Kristina Gligoric, Tiziano Piccardi, Jake Hofman, Robert West
Abstract: Science is facing a reproducibility crisis. Previous work has proposed incorporating data analysis replications into classrooms as a potential solution. However, despite the potential benefits, it is unclear whether this approach is feasible, and if so, what the involved stakeholders-students, educators, and scientists-should expect from it. Can students perform a data analysis replication over the course of a class? What are the costs and benefits for educators? And how can this solution help benchmark and improve the state of science? In the present study, we incorporated data analysis replications in the project component of the Applied Data Analysis course (CS-401) taught at EPFL (N=354 students). Here we report pre-registered findings based on surveys administered throughout the course. First, we demonstrate that students can replicate previously published scientific papers, most of them qualitatively and some exactly. We find discrepancies between what students expect of data analysis replications and what they experience by doing them along with changes in expectations about reproducibility, which together serve as evidence of attitude shifts to foster students' critical thinking. Second, we provide information for educators about how much overhead is needed to incorporate replications into the classroom and identify concerns that replications bring as compared to more traditional assignments. Third, we identify tangible benefits of the in-class data analysis replications for scientific communities, such as a collection of replication reports and insights about replication barriers in scientific work that should be avoided going forward. Overall, we demonstrate that incorporating replication tasks into a large data science class can increase the reproducibility of scientific work as a by-product of data science instruction, thus benefiting both science and students.
Authors: Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, S. Ramaneswaran, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
Abstract: A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute-binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.
Authors: Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang, Jun Xu
Abstract: Recently, the emergence of large language models (LLMs) has revolutionized the paradigm of information retrieval (IR) applications, especially in web search, by generating vast amounts of human-like texts on the Internet. As a result, IR systems in the LLM era are facing a new challenge: the indexed documents are now not only written by human beings but also automatically generated by the LLMs. How these LLM-generated documents influence the IR systems is a pressing and still unexplored question. In this work, we conduct a quantitative evaluation of IR models in scenarios where both human-written and LLM-generated texts are involved. Surprisingly, our findings indicate that neural retrieval models tend to rank LLM-generated documents higher. We refer to this category of biases in neural retrievers towards the LLM-generated content as the \textbf{source bias}. Moreover, we discover that this bias is not confined to the first-stage neural retrievers, but extends to the second-stage neural re-rankers. Then, in-depth analyses from the perspective of text compression indicate that LLM-generated texts exhibit more focused semantics with less noise, making it easier for neural retrieval models to semantic match. To mitigate the source bias, we also propose a plug-and-play debiased constraint for the optimization objective, and experimental results show its effectiveness. Finally, we discuss the potential severe concerns stemming from the observed source bias and hope our findings can serve as a critical wake-up call to the IR community and beyond. To facilitate future explorations of IR in the LLM era, the constructed two new benchmarks are available at https://github.com/KID-22/Source-Bias.
Authors: Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar
Abstract: Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Within this hardware-informed framework, we introduce two principal techniques. First, "windowing" strategically reduces data transfer by reusing previously activated neurons, and second, "row-column bundling", tailored to the sequential data access strengths of flash memory, increases the size of data chunks read from flash memory. These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively. Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for effective inference of LLMs on devices with limited memory.
Authors: Max Lamparth, Anthony Corso, Jacob Ganz, Oriana Skylar Mastro, Jacquelyn Schneider, Harold Trinkunas
Abstract: To some, the advent of AI promises better decision-making and increased military effectiveness while reducing the influence of human error and emotions. However, there is still debate about how AI systems, especially large language models (LLMs) that can be applied to many tasks, behave compared to humans in high-stakes military decision-making scenarios with the potential for increased risks towards escalation and unnecessary conflicts. To test this potential and scrutinize the use of LLMs for such purposes, we use a new wargame experiment with 107 national security experts designed to examine crisis escalation in a fictional US-China scenario and compare the behavior of human player teams to LLM-simulated team responses in separate simulations. Here, we find that the LLM-simulated responses can be more aggressive and significantly affected by changes in the scenario. We show a considerable high-level agreement in the LLM and human responses and significant quantitative and qualitative differences in individual actions and strategic tendencies. These differences depend on intrinsic biases in LLMs regarding the appropriate level of violence following strategic instructions, the choice of LLM, and whether the LLMs are tasked to decide for a team of players directly or first to simulate dialog between a team of players. When simulating the dialog, the discussions lack quality and maintain a farcical harmony. The LLM simulations cannot account for human player characteristics, showing no significant difference even for extreme traits, such as "pacifist" or "aggressive sociopath." When probing behavioral consistency across individual moves of the simulation, the tested LLMs deviated from each other but generally showed somewhat consistent behavior. Our results motivate policymakers to be cautious before granting autonomy or following AI-based strategy recommendations.
Authors: Shoujin Huang, Guanxiong Luo, Xi Wang, Ziran Chen, Yuwan Wang, Huaishui Yang, Pheng-Ann Heng, Lingyan Zhang, Mengye Lyu
Abstract: In general, diffusion model-based MRI reconstruction methods incrementally remove artificially added noise while imposing data consistency to reconstruct the underlying images. However, real-world MRI acquisitions already contain inherent noise due to thermal fluctuations. This phenomenon is particularly notable when using ultra-fast, high-resolution imaging sequences for advanced research, or using low-field systems favored by low- and middle-income countries. These common scenarios can lead to sub-optimal performance or complete failure of existing diffusion model-based reconstruction techniques. Specifically, as the artificially added noise is gradually removed, the inherent MRI noise becomes increasingly pronounced, making the actual noise level inconsistent with the predefined denoising schedule and consequently inaccurate image reconstruction. To tackle this problem, we propose a posterior sampling strategy with a novel NoIse Level Adaptive Data Consistency (Nila-DC) operation. Extensive experiments are conducted on two public datasets and an in-house clinical dataset with field strength ranging from 0.3T to 3T, showing that our method surpasses the state-of-the-art MRI reconstruction methods, and is highly robust against various noise levels. The code for Nila is available at https://github.com/Solor-pikachu/Nila.
Authors: Nathan Cornille, Marie-Francine Moens, Florian Mai
Abstract: By training to predict the next token in an unlabeled corpus, large language models learn to perform many tasks without any labeled data. However, their next-token-prediction objective arguably limits their performance in scenarios that require planning, such as writing a coherent article. In this paper, we train a module for planning the future writing process via a self-supervised learning objective. Given the textual context, this planning module learns to predict future abstract writing actions, which correspond to centroids in a clustered text embedding space. By conditioning on these actions, our model extends the successful language model formula to more abstract planning in an unsupervised way. Empirically, we demonstrate that our method improves language modeling performance in general, particularly with respect to the text structure. Because our framework uses a planner module that is unsupervised and external to the language model, new planner modules can be trained at large scale and easily be shared with the community.
Authors: Poulami Sinhamahapatra, Suprosanna Shit, Anjany Sekuboyina, Malek Husseini, David Schinz, Nicolas Lenhart, Joern Menze, Jan Kirschke, Karsten Roscher, Stephan Guennemann
Abstract: Vertebral fracture grading classifies the severity of vertebral fractures, which is a challenging task in medical imaging and has recently attracted Deep Learning (DL) models. Only a few works attempted to make such models human-interpretable despite the need for transparency and trustworthiness in critical use cases like DL-assisted medical diagnosis. Moreover, such models either rely on post-hoc methods or additional annotations. In this work, we propose a novel interpretable-by-design method, ProtoVerse, to find relevant sub-parts of vertebral fractures (prototypes) that reliably explain the model's decision in a human-understandable way. Specifically, we introduce a novel diversity-promoting loss to mitigate prototype repetitions in small datasets with intricate semantics. We have experimented with the VerSe'19 dataset and outperformed the existing prototype-based method. Further, our model provides superior interpretability against the post-hoc method. Importantly, expert radiologists validated the visual interpretability of our results, showing clinical applicability.
Authors: Denis Donadel, Francesco Marchiori, Luca Pajola, Mauro Conti
Abstract: Recent advancements in Artificial Intelligence, and particularly Large Language Models (LLMs), offer promising prospects for aiding system administrators in managing the complexity of modern networks. However, despite this potential, a significant gap exists in the literature regarding the extent to which LLMs can understand computer networks. Without empirical evidence, system administrators might rely on these models without assurance of their efficacy in performing network-related tasks accurately. In this paper, we are the first to conduct an exhaustive study on LLMs' comprehension of computer networks. We formulate several research questions to determine whether LLMs can provide correct answers when supplied with a network topology and questions on it. To assess them, we developed a thorough framework for evaluating LLMs' capabilities in various network-related tasks. We evaluate our framework on multiple computer networks employing proprietary (e.g., GPT4) and open-source (e.g., Llama2) models. Our findings in general purpose LLMs using a zero-shot scenario demonstrate promising results, with the best model achieving an average accuracy of 79.3%. Proprietary LLMs achieve noteworthy results in small and medium networks, while challenges persist in comprehending complex network topologies, particularly for open-source models. Moreover, we provide insight into how prompt engineering can enhance the accuracy of some tasks.
Authors: Patricia A. Apell\'aniz, Ana Jim\'enez, Borja Arroyo Galende, Juan Parras, Santiago Zazo
Abstract: The ever-increasing use of generative models in various fields where tabular data is used highlights the need for robust and standardized validation metrics to assess the similarity between real and synthetic data. Current methods lack a unified framework and rely on diverse and often inconclusive statistical measures. Divergences, which quantify discrepancies between data distributions, offer a promising avenue for validation. However, traditional approaches calculate divergences independently for each feature due to the complexity of joint distribution modeling. This paper addresses this challenge by proposing a novel approach that uses divergence estimation to overcome the limitations of marginal comparisons. Our core contribution lies in applying a divergence estimator to build a validation metric considering the joint distribution of real and synthetic data. We leverage a probabilistic classifier to approximate the density ratio between datasets, allowing the capture of complex relationships. We specifically calculate two divergences: the well-known Kullback-Leibler (KL) divergence and the Jensen-Shannon (JS) divergence. KL divergence offers an established use in the field, while JS divergence is symmetric and bounded, providing a reliable metric. The efficacy of this approach is demonstrated through a series of experiments with varying distribution complexities. The initial phase involves comparing estimated divergences with analytical solutions for simple distributions, setting a benchmark for accuracy. Finally, we validate our method on a real-world dataset and its corresponding synthetic counterpart, showcasing its effectiveness in practical applications. This research offers a significant contribution with applicability beyond tabular data and the potential to improve synthetic data validation in various fields.
Authors: Juyoung Yun, Jungmin Shin
Abstract: In the era of space exploration, coronal holes on the sun play a significant role due to their impact on satellites and aircraft through their open magnetic fields and increased solar wind emissions. This study employs computer vision techniques to detect coronal hole regions and estimate their sizes using imagery from the Solar Dynamics Observatory (SDO). Additionally, we utilize hybrid time series prediction model, specifically combination of Long Short-Term Memory (LSTM) networks and ARIMA, to analyze trends in the area of coronal holes and predict their areas across various solar regions over a span of seven days. By examining time series data, we aim to identify patterns in coronal hole behavior and understand their potential effects on space weather.
Authors: Mustafa Mert \c{C}elikok, Frans A. Oliehoek, Jan-Willem van de Meent
Abstract: We consider inverse reinforcement learning problems with concave utilities. Concave Utility Reinforcement Learning (CURL) is a generalisation of the standard RL objective, which employs a concave function of the state occupancy measure, rather than a linear function. CURL has garnered recent attention for its ability to represent instances of many important applications including the standard RL such as imitation learning, pure exploration, constrained MDPs, offline RL, human-regularized RL, and others. Inverse reinforcement learning is a powerful paradigm that focuses on recovering an unknown reward function that can rationalize the observed behaviour of an agent. There has been recent theoretical advances in inverse RL where the problem is formulated as identifying the set of feasible reward functions. However, inverse RL for CURL problems has not been considered previously. In this paper we show that most of the standard IRL results do not apply to CURL in general, since CURL invalidates the classical Bellman equations. This calls for a new theoretical framework for the inverse CURL problem. Using a recent equivalence result between CURL and Mean-field Games, we propose a new definition for the feasible rewards for I-CURL by proving that this problem is equivalent to an inverse game theory problem in a subclass of mean-field games. We present initial query and sample complexity results for the I-CURL problem under assumptions such as Lipschitz-continuity. Finally, we outline future directions and applications in human--AI collaboration enabled by our results.
Authors: Luca Podo, Marco Angelini, Paola Velardi
Abstract: NL2VIS (natural language to visualization) is a promising and recent research area that involves interpreting natural language queries and translating them into visualizations that accurately represent the underlying data. As we navigate the era of big data, NL2VIS holds considerable application potential since it greatly facilitates data exploration by non-expert users. Following the increasingly widespread usage of generative AI in NL2VIS applications, in this paper we present V-RECS, the first LLM-based Visual Recommender augmented with explanations(E), captioning(C), and suggestions(S) for further data exploration. V-RECS' visualization narratives facilitate both response verification and data exploration by non-expert users. Furthermore, our proposed solution mitigates computational, controllability, and cost issues associated with using powerful LLMs by leveraging a methodology to effectively fine-tune small models. To generate insightful visualization narratives, we use Chain-of-Thoughts (CoT), a prompt engineering technique to help LLM identify and generate the logical steps to produce a correct answer. Since CoT is reported to perform poorly with small LLMs, we adopted a strategy in which a large LLM (GPT-4), acting as a Teacher, generates CoT-based instructions to fine-tune a small model, Llama-2-7B, which plays the role of a Student. Extensive experiments-based on a framework for the quantitative evaluation of AI-based visualizations and on manual assessment by a group of participants-show that V-RECS achieves performance scores comparable to GPT-4, at a much lower cost. The efficacy of the V-RECS teacher-student paradigm is also demonstrated by the fact that the un-tuned Llama fails to perform the task in the vast majority of test cases. We release V-RECS for the visualization community to assist visualization designers throughout the entire visualization generation process.
Authors: William Stevens, Vishal Urs, Karthik Selvaraj, Gabriel Torres, Gaurish Lakhanpal
Abstract: With the increasing prevalence of autonomous vehicles, it is essential for computer vision algorithms to accurately assess road features in real-time. This study explores the LaneSegNet architecture, a new approach to lane topology prediction which integrates topological information with lane-line data to provide a more contextual understanding of road environments. The LaneSegNet architecture includes a feature extractor, lane encoder, lane decoder, and prediction head, leveraging components from ResNet-50, BEVFormer, and various attention mechanisms. We experimented with optimizations to the LaneSegNet architecture through feature extractor modification and transformer encoder-decoder stack modification. We found that modifying the encoder and decoder stacks offered an interesting tradeoff between training time and prediction accuracy, with certain combinations showing promising results. Our implementation, trained on a single NVIDIA Tesla A100 GPU, found that a 2:4 ratio reduced training time by 22.3% with only a 7.1% drop in mean average precision, while a 4:8 ratio increased training time by only 11.1% but improved mean average precision by a significant 23.7%. These results indicate that strategic hyperparameter tuning can yield substantial improvements depending on the resources of the user. This study provides valuable insights for optimizing LaneSegNet according to available computation power, making it more accessible for users with limited resources and increasing the capabilities for users with more powerful resources.
Authors: Markus Grotz, Mohit Shridhar, Tamim Asfour, Dieter Fox
Abstract: Bimanual manipulation is challenging due to precise spatial and temporal coordination required between two arms. While there exist several real-world bimanual systems, there is a lack of simulated benchmarks with a large task diversity for systematically studying bimanual capabilities across a wide range of tabletop tasks. This paper addresses the gap by extending RLBench to bimanual manipulation. We open-source our code and benchmark comprising 13 new tasks with 23 unique task variations, each requiring a high degree of coordination and adaptability. To kickstart the benchmark, we extended several state-of-the art methods to bimanual manipulation and also present a language-conditioned behavioral cloning agent -- PerAct2, which enables the learning and execution of bimanual 6-DoF manipulation tasks. Our novel network architecture efficiently integrates language processing with action prediction, allowing robots to understand and perform complex bimanual tasks in response to user-specified goals. Project website with code is available at: http://bimanual.github.io
Authors: Weihong Zhong, Xiaocheng Feng, Liang Zhao, Qiming Li, Lei Huang, Yuxuan Gu, Weitao Ma, Yuan Xu, Bing Qin
Abstract: Though advanced in understanding visual information with human languages, Large Vision-Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is that during multimodal interaction, the generated hallucinations could influence the LVLMs' subsequent generation. Thus, we raise a question: When presented with a query relevant to the previously generated hallucination, will LVLMs be misled and respond incorrectly, even though the ground visual information exists? To answer this, we propose a framework called MMHalSnowball to evaluate LVLMs' behaviors when encountering generated hallucinations, where LVLMs are required to answer specific visual questions within a curated hallucinatory conversation. Crucially, our experiment shows that the performance of open-source LVLMs drops by at least $31\%$, indicating that LVLMs are prone to accept the generated hallucinations and make false claims that they would not have supported without distractions. We term this phenomenon Multimodal Hallucination Snowballing. To mitigate this, we further propose a training-free method called Residual Visual Decoding, where we revise the output distribution of LVLMs with the one derived from the residual visual input, providing models with direct access to the visual information. Experiments show that our method can mitigate more than $24\%$ of the snowballed multimodal hallucination while maintaining capabilities.
Authors: Ruibo Fu, Rui Liu, Chunyu Qiang, Yingming Gao, Yi Lu, Shuchen Shi, Tao Wang, Ya Li, Zhengqi Wen, Chen Zhang, Hui Bu, Yukun Liu, Xin Qi, Guanjun Li
Abstract: The Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC 2024) is part of the ISCSLP 2024 Competitions and Challenges track. While current text-to-speech (TTS) technology can generate high-quality audio, its ability to convey complex emotions and controlled detail content remains limited. This constraint leads to a discrepancy between the generated audio and human subjective perception in practical applications like companion robots for children and marketing bots. The core issue lies in the inconsistency between high-quality audio generation and the ultimate human subjective experience. Therefore, this challenge aims to enhance the persuasiveness and acceptability of synthesized audio, focusing on human alignment convincing and inspirational audio generation. A total of 19 teams have registered for the challenge, and the results of the competition and the competition are described in this paper.
Authors: Manish Shetty, Yinfang Chen, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Xuchao Zhang, Jonathan Mace, Dax Vandevoorde, Pedro Las-Casas, Shachee Mishra Gupta, Suman Nath, Chetan Bansal, Saravan Rajmohan
Abstract: The rapid growth in the use of Large Language Models (LLMs) and AI Agents as part of software development and deployment is revolutionizing the information technology landscape. While code generation receives significant attention, a higher-impact application lies in using AI agents for operational resilience of cloud services, which currently require significant human effort and domain knowledge. There is a growing interest in AI for IT Operations (AIOps) which aims to automate complex operational tasks, like fault localization and root cause analysis, thereby reducing human intervention and customer impact. However, achieving the vision of autonomous and self-healing clouds through AIOps is hampered by the lack of standardized frameworks for building, evaluating, and improving AIOps agents. This vision paper lays the groundwork for such a framework by first framing the requirements and then discussing design decisions that satisfy them. We also propose AIOpsLab, a prototype implementation leveraging agent-cloud-interface that orchestrates an application, injects real-time faults using chaos engineering, and interfaces with an agent to localize and resolve the faults. We report promising results and lay the groundwork to build a modular and robust framework for building, evaluating, and improving agents for autonomous clouds.
Authors: Chengzhi Lin, Shuchang Liu, Chuyuan Wang, Yongqi Liu
Abstract: Accurately predicting watch time is crucial for optimizing recommendations and user experience in short video platforms. However, existing methods that estimate a single average watch time often fail to capture the inherent uncertainty and diversity in user engagement patterns. In this paper, we propose the Conditional Quantile Estimation (CQE) framework to model the entire conditional distribution of watch time. Using quantile regression, CQE characterizes the complex watch-time distribution for each user-video pair, providing a flexible and comprehensive approach to understanding user behavior. We further design multiple strategies to combine the quantile estimates, adapting to different recommendation scenarios and user preferences. Extensive offline experiments and online A/B tests demonstrate the superiority of CQE in watch time prediction and user engagement modeling. In particular, the online deployment of CQE in KuaiShow has led to significant improvements in key evaluation metrics, including active days, active users, engagement duration, and video view counts. These results highlight the practical impact of our proposed approach in enhancing the user experience and overall performance of the short video recommendation system. The code will be released after publication.
Authors: Charles Jin, Martin Rinard
Abstract: As language models (LMs) deliver increasing performance on a range of NLP tasks, probing classifiers have become an indispensable technique in the effort to better understand their inner workings. A typical setup involves (1) defining an auxiliary task consisting of a dataset of text annotated with labels, then (2) supervising small classifiers to predict the labels from the representations of a pretrained LM as it processed the dataset. A high probing accuracy is interpreted as evidence that the LM has learned to perform the auxiliary task as an unsupervised byproduct of its original pretraining objective. Despite the widespread usage of probes, however, the robust design and analysis of probing experiments remains a challenge. We develop a formal perspective on probing using structural causal models (SCM). Specifically, given an SCM which explains the distribution of tokens observed during training, we frame the central hypothesis as whether the LM has learned to represent the latent variables of the SCM. Empirically, we extend a recent study of LMs in the context of a synthetic grid-world navigation task, where having an exact model of the underlying causal structure allows us to draw strong inferences from the result of probing experiments. Our techniques provide robust empirical evidence for the ability of LMs to induce the latent concepts underlying text.
Authors: Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons
Abstract: Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.
Authors: Manish Marwah, Asad Narayanan, Stephan Jou, Martin Arlitt, Maria Pospelova
Abstract: The cost of errors related to machine learning classifiers, namely, false positives and false negatives, are not equal and are application dependent. For example, in cybersecurity applications, the cost of not detecting an attack is very different from marking a benign activity as an attack. Various design choices during machine learning model building, such as hyperparameter tuning and model selection, allow a data scientist to trade-off between these two errors. However, most of the commonly used metrics to evaluate model quality, such as $F_1$ score, which is defined in terms of model precision and recall, treat both these errors equally, making it difficult for users to optimize for the actual cost of these errors. In this paper, we propose a new cost-aware metric, $C_{score}$ based on precision and recall that can replace $F_1$ score for model evaluation and selection. It includes a cost ratio that takes into account the differing costs of handling false positives and false negatives. We derive and characterize the new cost metric, and compare it to $F_1$ score. Further, we use this metric for model thresholding for five cybersecurity related datasets for multiple cost ratios. The results show an average cost savings of 49%.
Authors: Mengru Wang, Yunzhi Yao, Ziwen Xu, Shuofei Qiao, Shumin Deng, Peng Wang, Xiang Chen, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang
Abstract: Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial for advancing towards trustworthy AGI. This paper reviews knowledge mechanism analysis from a novel taxonomy including knowledge utilization and evolution. Knowledge utilization delves into the mechanism of memorization, comprehension and application, and creation. Knowledge evolution focuses on the dynamic progression of knowledge within individual and group LLMs. Moreover, we discuss what knowledge LLMs have learned, the reasons for the fragility of parametric knowledge, and the potential dark knowledge (hypothesis) that will be challenging to address. We hope this work can help understand knowledge in LLMs and provide insights for future research.
Authors: Aki H\"arm\"a, Marcin Pietrasik, Anna Wilbik
Abstract: Large pretrained self-attention neural networks, or transformers, have been very successful in various tasks recently. The performance of a model on a given task depends on its ability to memorize and generalize the training data. Large transformer models, which may have billions of parameters, in theory have a huge capacity to memorize content. However, the current algorithms for the optimization fall short of the theoretical capacity, and the capacity is also highly dependent on the content. In this paper, we focus on the memory capacity of these models obtained using common training algorithms and synthetic training data. Based on the results, we derive an empirical capacity model (ECM) for a generic transformer. The ECM can be used to design task-specific transformer models with an optimal number of parameters in cases where the target memorization capability of the task can be defined.
Authors: Shi Lin, Rongchang Li, Xun Wang, Changting Lin, Wenpeng Xing, Meng Han
Abstract: The rapid development of Large Language Models (LLMs) has brought remarkable generative capabilities across diverse tasks. However, despite the impressive achievements, these models still have numerous security vulnerabilities, particularly when faced with jailbreak attacks. Therefore, by investigating jailbreak attacks, we can uncover hidden weaknesses in LLMs and guide us in developing more robust defense mechanisms to fortify their security. In this paper, we further explore the boundary of jailbreak attacks on LLMs and propose Analyzing-based Jailbreak (ABJ). This effective jailbreak attack method takes advantage of LLMs' growing analyzing and reasoning capability and reveals their underlying vulnerabilities when facing analysis-based tasks. We conduct a detailed evaluation of ABJ across various open-source and closed-source LLMs, which achieves 94.8% Attack Success Rate (ASR) and 1.06 Attack Efficiency (AE) on GPT-4-turbo-0409, demonstrating state-of-the-art attack effectiveness and efficiency. Our research highlights the importance of prioritizing and enhancing the safety of LLMs to mitigate the risks of misuse.The code is publicly available at https://github.com/theshi-1128/ABJ-Attack.
Authors: Chanyoung Jung, Yun Jang
Abstract: Researchers have been persistently working to address the issue of missing values in time series data. Numerous models have been proposed, striving to estimate the distribution of the data. The Radial Basis Functions Neural Network (RBFNN) has recently exhibited exceptional performance in estimating data distribution. In this paper, we propose a time series imputation model based on RBFNN. Our imputation model learns local information from timestamps to create a continuous function. Additionally, we incorporate time gaps to facilitate learning information considering the missing terms of missing values. We name this model the Missing Imputation Multivariate RBFNN (MIM-RBFNN). However, MIM-RBFNN relies on a local information-based learning approach, which presents difficulties in utilizing temporal information. Therefore, we propose an extension called the Missing Value Imputation Recurrent Neural Network with Continuous Function (MIRNN-CF) using the continuous function generated by MIM-RBFNN. We evaluate the performance using two real-world datasets with non-random missing and random missing patterns, and conduct an ablation study comparing MIM-RBFNN and MIRNN-CF.
Authors: Yuyan Chen, Songzhou Yan, Zhihong Zhu, Zhixu Li, Yanghua Xiao
Abstract: Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on the impact of multi-images on meme captioning. After that, we introduce the \textsc{XMeCap} framework, a novel approach that adopts supervised fine-tuning and reinforcement learning based on an innovative reward model, which factors in both global and local similarities between visuals and text. Our results, benchmarked against contemporary models, manifest a marked improvement in caption generation for both single-image and multi-image memes, as well as different meme categories. \textsc{XMeCap} achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71\% and 4.82\%, respectively. This research not only establishes a new frontier in meme-related studies but also underscores the potential of machines in understanding and generating humor in a multi-modal setting.
Authors: Laiyi Fu, Binbin Fan, Hongkai Du, Yanxiang Feng, Chunhua Li, Huping Song
Abstract: Ophthalmology consultations are crucial for diagnosing, treating, and preventing eye diseases. However, the growing demand for consultations exceeds the availability of ophthalmologists. By leveraging large pre-trained language models, we can design effective dialogues for specific scenarios, aiding in consultations. Traditional fine-tuning strategies for question-answering tasks are impractical due to increasing model size and often ignoring patient-doctor role function during consultations. In this paper, we propose EyeDoctor, an ophthalmic medical questioning large language model that enhances accuracy through doctor-patient role perception guided and an augmented knowledge base with external disease information. Experimental results show EyeDoctor achieves higher question-answering precision in ophthalmology consultations. Notably, EyeDoctor demonstrated a 7.25% improvement in Rouge-1 scores and a 10.16% improvement in F1 scores on multi-round datasets compared to second best model ChatGPT, highlighting the importance of doctor-patient role differentiation and dynamic knowledge base expansion for intelligent medical consultations. EyeDoc also serves as a free available web based service and souce code is available at https://github.com/sperfu/EyeDoc.
Authors: Tao Yu, Peng Gao, Fei Wang, Ru-Yue Yuan
Abstract: Due to the increasing complexity of chip design, existing placement methods still have many shortcomings in dealing with macro cells coverage and optimization efficiency. Aiming at the problems of layout overlap, inferior performance, and low optimization efficiency in existing chip design methods, this paper proposes an end-to-end placement method, SRLPlacer, based on reinforcement learning. First, the placement problem is transformed into a Markov decision process by establishing the coupling relationship graph model between macro cells to learn the strategy for optimizing layouts. Secondly, the whole placement process is optimized after integrating the standard cell layout. By assessing on the public benchmark ISPD2005, the proposed SRLPlacer can effectively solve the overlap problem between macro cells while considering routing congestion and shortening the total wire length to ensure routability.
Authors: Fangze Lin, Ying He, Fei Yu
Abstract: Personalized motion planning holds significant importance within urban automated driving, catering to the unique requirements of individual users. Nevertheless, prior endeavors have frequently encountered difficulties in simultaneously addressing two crucial aspects: personalized planning within intricate urban settings and enhancing planning performance through data utilization. The challenge arises from the expensive and limited nature of user data, coupled with the scene state space tending towards infinity. These factors contribute to overfitting and poor generalization problems during model training. Henceforth, we propose an instance-based transfer imitation learning approach. This method facilitates knowledge transfer from extensive expert domain data to the user domain, presenting a fundamental resolution to these issues. We initially train a pre-trained model using large-scale expert data. Subsequently, during the fine-tuning phase, we feed the batch data, which comprises expert and user data. Employing the inverse reinforcement learning technique, we extract the style feature distribution from user demonstrations, constructing the regularization term for the approximation of user style. In our experiments, we conducted extensive evaluations of the proposed method. Compared to the baseline methods, our approach mitigates the overfitting issue caused by sparse user data. Furthermore, we discovered that integrating the driving model with a differentiable nonlinear optimizer as a safety protection layer for end-to-end personalized fine-tuning results in superior planning performance.
Authors: Heedou Kim, Dain Kim, Jiwoo Lee, Chanwoong Yoon, Donghee Choi, Mogan Gim, Jaewoo Kang
Abstract: Crime situations are race against time. An AI-assisted criminal investigation system, providing prompt but precise legal counsel is in need for police officers. We introduce LAPIS (Language Model Augmented Police Investigation System), an automated system that assists police officers to perform rational and legal investigative actions. We constructed a finetuning dataset and retrieval knowledgebase specialized in crime investigation legal reasoning task. We extended the dataset's quality by incorporating manual curation efforts done by a group of domain experts. We then finetuned the pretrained weights of a smaller Korean language model to the newly constructed dataset and integrated it with the crime investigation knowledgebase retrieval approach. Experimental results show LAPIS' potential in providing reliable legal guidance for police officers, even better than the proprietary GPT-4 model. Qualitative analysis on the rationales generated by LAPIS demonstrate the model's reasoning ability to leverage the premises and derive legally correct conclusions.
Authors: Meghyn Bienvenu, Camille Bourgaux, Robin Jean
Abstract: In this paper, we explore a quantitative approach to querying inconsistent description logic knowledge bases. We consider weighted knowledge bases in which both axioms and assertions have (possibly infinite) weights, which are used to assign a cost to each interpretation based upon the axioms and assertions it violates. Two notions of certain and possible answer are defined by either considering interpretations whose cost does not exceed a given bound or restricting attention to optimal-cost interpretations. Our main contribution is a comprehensive analysis of the combined and data complexity of bounded cost satisfiability and certain and possible answer recognition, for description logics between ELbot and ALCO.
Authors: Yupeng Chen, Senmiao Wang, Zhihang Lin, Zeyu Qin, Yushun Zhang, Tian Ding, Ruoyu Sun
Abstract: Recently, large language models (LLMs) have demonstrated remarkable capabilities in a wide range of tasks. Typically, an LLM is pre-trained on large corpora and subsequently fine-tuned on task-specific datasets. However, during fine-tuning, LLMs may forget the knowledge acquired in the pre-training stage, leading to a decline in general capabilities. To address this issue, we propose a new fine-tuning algorithm termed Momentum-Filtered Optimizer (MoFO). The key idea of MoFO is to iteratively select and update the model parameters with the largest momentum magnitudes. Compared to full-parameter training, MoFO achieves similar fine-tuning performance while keeping parameters closer to the pre-trained model, thereby mitigating knowledge forgetting. Unlike most existing methods for forgetting mitigation, MoFO combines the following two advantages. First, MoFO does not require access to pre-training data. This makes MoFO particularly suitable for fine-tuning scenarios where pre-training data is unavailable, such as fine-tuning checkpoint-only open-source LLMs. Second, MoFO does not alter the original loss function. This could avoid impairing the model performance on the fine-tuning tasks. We validate MoFO through rigorous convergence analysis and extensive experiments, demonstrating its superiority over existing methods in mitigating forgetting and enhancing fine-tuning performance.