Authors: Yue Liu, Sin Kit Lo, Qinghua Lu, Liming Zhu, Dehai Zhao, Xiwei Xu, Stefan Harrer, Jon Whittle
Abstract: Foundation model-enabled generative artificial intelligence facilitates the development and implementation of agents, which can leverage distinguished reasoning and language processing capabilities to takes a proactive, autonomous role to pursue users' goals. Nevertheless, there is a lack of systematic knowledge to guide practitioners in designing the agents considering challenges of goal-seeking (including generating instrumental goals and plans), such as hallucinations inherent in foundation models, explainability of reasoning process, complex accountability, etc. To address this issue, we have performed a systematic literature review to understand the state-of-the-art foundation model-based agents and the broader ecosystem. In this paper, we present a pattern catalogue consisting of 16 architectural patterns with analyses of the context, forces, and trade-offs as the outcomes from the previous literature review. The proposed catalogue can provide holistic guidance for the effective use of patterns, and support the architecture design of foundation model-based agents by facilitating goal-seeking and plan generation.
Authors: Yu Xia, Sriram Narayanamoorthy, Zhengyuan Zhou, Joshua Mabry
Abstract: The development of open benchmarking platforms could greatly accelerate the adoption of AI agents in retail. This paper presents comprehensive simulations of customer shopping behaviors for the purpose of benchmarking reinforcement learning (RL) agents that optimize coupon targeting. The difficulty of this learning problem is largely driven by the sparsity of customer purchase events. We trained agents using offline batch data comprising summarized customer purchase histories to help mitigate this effect. Our experiments revealed that contextual bandit and deep RL methods that are less prone to over-fitting the sparse reward distributions significantly outperform static policies. This study offers a practical framework for simulating AI agents that optimize the entire retail customer journey. It aims to inspire the further development of simulation tools for retail AI systems.
Authors: Jaeik Jeong, Tai-Yeon Ku, Wan-Ki Park
Abstract: Energy storage devices, such as batteries, thermal energy storages, and hydrogen systems, can help mitigate climate change by ensuring a more stable and sustainable power supply. To maximize the effectiveness of such energy storage, determining the appropriate charging and discharging amounts for each time period is crucial. Reinforcement learning is preferred over traditional optimization for the control of energy storage due to its ability to adapt to dynamic and complex environments. However, the continuous nature of charging and discharging levels in energy storage poses limitations for discrete reinforcement learning, and time-varying feasible charge-discharge range based on state of charge (SoC) variability also limits the conventional continuous reinforcement learning. In this paper, we propose a continuous reinforcement learning approach that takes into account the time-varying feasible charge-discharge range. An additional objective function was introduced for learning the feasible action range for each time period, supplementing the objectives of training the actor for policy learning and the critic for value learning. This actively promotes the utilization of energy storage by preventing them from getting stuck in suboptimal states, such as continuous full charging or discharging. This is achieved through the enforcement of the charging and discharging levels into the feasible action range. The experimental results demonstrated that the proposed method further maximized the effectiveness of energy storage by actively enhancing its utilization.
Authors: Zhaohuan Zhan, Lisha Yu, Sijie Yu, Guang Tan
Abstract: In the Vision-and-Language Navigation (VLN) task, the agent is required to navigate to a destination following a natural language instruction. While learning-based approaches have been a major solution to the task, they suffer from high training costs and lack of interpretability. Recently, Large Language Models (LLMs) have emerged as a promising tool for VLN due to their strong generalization capabilities. However, existing LLM-based methods face limitations in memory construction and diversity of navigation strategies. To address these challenges, we propose a suite of techniques. Firstly, we introduce a method to maintain a topological map that stores navigation history, retaining information about viewpoints, objects, and their spatial relationships. This map also serves as a global action space. Additionally, we present a Navigation Chain of Thoughts module, leveraging human navigation examples to enrich navigation strategy diversity. Finally, we establish a pipeline that integrates navigational memory and strategies with perception and action prediction modules. Experimental results on the REVERIE and R2R datasets show that our method effectively enhances the navigation ability of the LLM and improves the interpretability of navigation reasoning.
Authors: Francesco Leofante, Hamed Ayoobi, Adam Dejl, Gabriel Freedman, Deniz Gorur, Junqi Jiang, Guilherme Paulino-Passos, Antonio Rago, Anna Rapberger, Fabrizio Russo, Xiang Yin, Dekai Zhang, Francesca Toni
Abstract: AI has become pervasive in recent years, but state-of-the-art approaches predominantly neglect the need for AI systems to be contestable. Instead, contestability is advocated by AI guidelines (e.g. by the OECD) and regulation of automated decision-making (e.g. GDPR). In this position paper we explore how contestability can be achieved computationally in and for AI. We argue that contestable AI requires dynamic (human-machine and/or machine-machine) explainability and decision-making processes, whereby machines can (i) interact with humans and/or other machines to progressively explain their outputs and/or their reasoning as well as assess grounds for contestation provided by these humans and/or other machines, and (ii) revise their decision-making processes to redress any issues successfully raised during contestation. Given that much of the current AI landscape is tailored to static AIs, the need to accommodate contestability will require a radical rethinking, that, we argue, computational argumentation is ideally suited to support.
Authors: Alyzia-Maria Konsta, Alberto Lluch Lafuente, Christoph Matheja
Abstract: Partially observable Markov Decision Processes (POMDPs) are a standard model for agents making decisions in uncertain environments. Most work on POMDPs focuses on synthesizing strategies based on the available capabilities. However, system designers can often control an agent's observation capabilities, e.g. by placing or selecting sensors. This raises the question of how one should select an agent's sensors cost-effectively such that it achieves the desired goals. In this paper, we study the novel optimal observability problem OOP: Given a POMDP M, how should one change M's observation capabilities within a fixed budget such that its (minimal) expected reward remains below a given threshold? We show that the problem is undecidable in general and decidable when considering positional strategies only. We present two algorithms for a decidable fragment of the OOP: one based on optimal strategies of M's underlying Markov decision process and one based on parameter synthesis with SMT. We report promising results for variants of typical examples from the POMDP literature.
Authors: Hongyi Yang, Fangyuan Chang, Dian Zhu, Muroi Fumie, Zhao Liu
Abstract: This review aims to systematically assess the current status and prospects of artificial intelligence (AI) in the rehabilitation management of patients with schizophrenia and their impact on the rehabilitation process. We selected 70 studies from 2012 to the present, focusing on application, technology categories, products, and data types of machine learning, deep learning, reinforcement learning, and other technologies in mental health interventions and management. The results indicate that AI can be widely used in symptom monitoring, relapse risk prediction, and rehabilitation treatment by analyzing ecological momentary assessment, behavioral, and speech data. This review further explores the potential challenges and future directions of emerging products, technologies, and analytical methods based on AI, such as social media analysis, serious games, and large language models in rehabilitation. In summary, this study systematically reviews the application status of AI in schizophrenia rehabilitation management and provides valuable insights and recommendations for future research paths.
Authors: Jixuan He, Yutong Guo, Jiacheng Zhao
Abstract: This paper delves into the paramount significance of information dissemination within the dynamic realm of social networks. It underscores the pivotal role of information communication models in unraveling the intricacies of data propagation in the digital age. By shedding light on the profound influence of these models, it not only lays the groundwork for exploring various hierarchies and their manifestations but also serves as a catalyst for further research in this formidable field.
Authors: Lingyun You, Nanning Guo, Zhengwu Long, Fusong Wang, Chundi Si, Aboelkasim Diab
Abstract: Asphalt pavements as the most prevalent transportation infrastructure, are prone to serious traffic safety problems due to functional or structural damage caused by stresses or strains imposed through repeated traffic loads and continuous climatic cycles. The good quality or high serviceability of infrastructure networks is vital to the urbanization and industrial development of nations. In order to maintain good functional pavement performance and extend the service life of asphalt pavements, the long-term performance of pavements under maintenance policies needs to be evaluated and favorable options selected based on the condition of the pavement. A major challenge in evaluating maintenance policies is to produce valid treatments for the outcome assessment under the control of uncertainty of vehicle loads and the disturbance of freeze-thaw cycles in the climatic environment. In this study, a novel causal inference approach combining a classical causal structural model and a potential outcome model framework is proposed to appraise the long-term effects of four preventive maintenance treatments for longitudinal cracking over a 5-year period of upkeep. Three fundamental issues were brought to our attention: 1) detection of causal relationships prior to variables under environmental loading (identification of causal structure); 2) obtaining direct causal effects of treatment on outcomes excluding covariates (identification of causal effects); and 3) sensitivity analysis of causal relationships. The results show that the method can accurately evaluate the effect of preventive maintenance treatments and assess the maintenance time to cater well for the functional performance of different preventive maintenance approaches. This framework could help policymakers to develop appropriate maintenance strategies for pavements.
Authors: Shikun Feng, Yuyan Ni, Minghao Li, Yanwen Huang, Zhi-Ming Ma, Wei-Ying Ma, Yanyan Lan
Abstract: Recently, a noticeable trend has emerged in developing pre-trained foundation models in the domains of CV and NLP. However, for molecular pre-training, there lacks a universal model capable of effectively applying to various categories of molecular tasks, since existing prevalent pre-training methods exhibit effectiveness for specific types of downstream tasks. Furthermore, the lack of profound understanding of existing pre-training methods, including 2D graph masking, 2D-3D contrastive learning, and 3D denoising, hampers the advancement of molecular foundation models. In this work, we provide a unified comprehension of existing pre-training methods through the lens of contrastive learning. Thus their distinctions lie in clustering different views of molecules, which is shown beneficial to specific downstream tasks. To achieve a complete and general-purpose molecular representation, we propose a novel pre-training framework, named UniCorn, that inherits the merits of the three methods, depicting molecular views in three different levels. SOTA performance across quantum, physicochemical, and biological tasks, along with comprehensive ablation study, validate the universality and effectiveness of UniCorn.
Authors: Divyagna Bavikadi, Ayushi Agarwal, Shashank Ganta, Yunro Chung, Lusheng Song, Ji Qiu, Paulo Shakarian
Abstract: Recent advances in experimental methods have enabled researchers to collect data on thousands of analytes simultaneously. This has led to correlational studies that associated molecular measurements with diseases such as Alzheimer's, Liver, and Gastric Cancer. However, the use of thousands of biomarkers selected from the analytes is not practical for real-world medical diagnosis and is likely undesirable due to potentially formed spurious correlations. In this study, we evaluate 4 different methods for biomarker selection and 4 different machine learning (ML) classifiers for identifying correlations, evaluating 16 approaches in all. We found that contemporary methods outperform previously reported logistic regression in cases where 3 and 10 biomarkers are permitted. When specificity is fixed at 0.9, ML approaches produced a sensitivity of 0.240 (3 biomarkers) and 0.520 (10 biomarkers), while standard logistic regression provided a sensitivity of 0.000 (3 biomarkers) and 0.040 (10 biomarkers). We also noted that causal-based methods for biomarker selection proved to be the most performant when fewer biomarkers were permitted, while univariate feature selection was the most performant when a greater number of biomarkers were permitted.
Authors: Jing Yang, Xiao Wang, Yutong Wang, Jiawei Wang, Fei-Yue Wang
Abstract: Temporal knowledge graphs (TKGs) can effectively model the ever-evolving nature of real-world knowledge, and their completeness and enhancement can be achieved by reasoning new events from existing ones. However, reasoning accuracy is adversely impacted due to an imbalance between new and recurring events in the datasets. To achieve more accurate TKG reasoning, we propose an attention masking-based contrastive event network (AMCEN) with local-global temporal patterns for the two-stage prediction of future events. In the network, historical and non-historical attention mask vectors are designed to control the attention bias towards historical and non-historical entities, acting as the key to alleviating the imbalance. A local-global message-passing module is proposed to comprehensively consider and capture multi-hop structural dependencies and local-global temporal evolution for the in-depth exploration of latent impact factors of different event types. A contrastive event classifier is used to classify events more accurately by incorporating local-global temporal patterns into contrastive learning. Therefore, AMCEN refines the prediction scope with the results of the contrastive event classification, followed by utilizing attention masking-based decoders to finalize the specific outcomes. The results of our experiments on four benchmark datasets highlight the superiority of AMCEN. Especially, the considerable improvements in Hits@1 prove that AMCEN can make more precise predictions about future occurrences.
Authors: Jing Liu, Yang Liu, Jieyu Lin, Jielin Li, Peng Sun, Bo Hu, Liang Song, Azzedine Boukerche, Victor C. M. Leung
Abstract: The increasing prevalence of surveillance cameras in smart cities, coupled with the surge of online video applications, has heightened concerns regarding public security and privacy protection, which propelled automated Video Anomaly Detection (VAD) into a fundamental research task within the Artificial Intelligence (AI) community. With the advancements in deep learning and edge computing, VAD has made significant progress and advances synergized with emerging applications in smart cities and video internet, which has moved beyond the conventional research scope of algorithm engineering to deployable Networking Systems for VAD (NSVAD), a practical hotspot for intersection exploration in the AI, IoVT, and computing fields. In this article, we delineate the foundational assumptions, learning frameworks, and applicable scenarios of various deep learning-driven VAD routes, offering an exhaustive tutorial for novices in NSVAD. This article elucidates core concepts by reviewing recent advances and typical solutions, and aggregating available research resources (e.g., literatures, code, tools, and workshops) accessible at https://github.com/fdjingliu/NSVAD. Additionally, we showcase our latest NSVAD research in industrial IoT and smart cities, along with an end-cloud collaborative architecture for deployable NSVAD to further elucidate its potential scope of research and application. Lastly, this article projects future development trends and discusses how the integration of AI and computing technologies can address existing research challenges and promote open opportunities, serving as an insightful guide for prospective researchers and engineers.
Authors: Lirong Wu, Yijun Tian, Haitao Lin, Yufei Huang, Siyuan Li, Nitesh V Chawla, Stan Z. Li
Abstract: Protein-protein bindings play a key role in a variety of fundamental biological processes, and thus predicting the effects of amino acid mutations on protein-protein binding is crucial. To tackle the scarcity of annotated mutation data, pre-training with massive unlabeled data has emerged as a promising solution. However, this process faces a series of challenges: (1) complex higher-order dependencies among multiple (more than paired) structural scales have not yet been fully captured; (2) it is rarely explored how mutations alter the local conformation of the surrounding microenvironment; (3) pre-training is costly, both in data size and computational burden. In this paper, we first construct a hierarchical prompt codebook to record common microenvironmental patterns at different structural scales independently. Then, we develop a novel codebook pre-training task, namely masked microenvironment modeling, to model the joint distribution of each mutation with their residue types, angular statistics, and local conformational changes in the microenvironment. With the constructed prompt codebook, we encode the microenvironment around each mutation into multiple hierarchical prompts and combine them to flexibly provide information to wild-type and mutated protein complexes about their microenvironmental differences. Such a hierarchical prompt learning framework has demonstrated superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction and a case study of optimizing human antibodies against SARS-CoV-2.
Authors: Muqsit Azeem, Marta Grobelna, Sudeep Kanav, Jan Kretinsky, Stefanie Mohr, Sabine Rieder
Abstract: The behavior of neural networks (NNs) on previously unseen types of data (out-of-distribution or OOD) is typically unpredictable. This can be dangerous if the network's output is used for decision-making in a safety-critical system. Hence, detecting that an input is OOD is crucial for the safe application of the NN. Verification approaches do not scale to practical NNs, making runtime monitoring more appealing for practical use. While various monitors have been suggested recently, their optimization for a given problem, as well as comparison with each other and reproduction of results, remain challenging. We present a tool for users and developers of NN monitors. It allows for (i) application of various types of monitors from the literature to a given input NN, (ii) optimization of the monitor's hyperparameters, and (iii) experimental evaluation and comparison to other approaches. Besides, it facilitates the development of new monitoring approaches. We demonstrate the tool's usability on several use cases of different types of users as well as on a case study comparing different approaches from recent literature.
Authors: Sarod Yatawatta
Abstract: Observing celestial objects and advancing our scientific knowledge about them involves tedious planning, scheduling, data collection and data post-processing. Many of these operational aspects of astronomy are guided and executed by expert astronomers. Reinforcement learning is a mechanism where we (as humans and astronomers) can teach agents of artificial intelligence to perform some of these tedious tasks. In this paper, we will present a state of the art overview of reinforcement learning and how it can benefit astronomy.
Authors: Yichuan Shi, Olivera Kotevska, Viktor Reshniak, Abhishek Singh, Ramesh Raskar
Abstract: Federated Learning (FL) has emerged as a leading paradigm for decentralized, privacy preserving machine learning training. However, recent research on gradient inversion attacks (GIAs) have shown that gradient updates in FL can leak information on private training samples. While existing surveys on GIAs have focused on the honest-but-curious server threat model, there is a dearth of research categorizing attacks under the realistic and far more privacy-infringing cases of malicious servers and clients. In this paper, we present a survey and novel taxonomy of GIAs that emphasize FL threat models, particularly that of malicious servers and clients. We first formally define GIAs and contrast conventional attacks with the malicious attacker. We then summarize existing honest-but-curious attack strategies, corresponding defenses, and evaluation metrics. Critically, we dive into attacks with malicious servers and clients to highlight how they break existing FL defenses, focusing specifically on reconstruction methods, target model architectures, target data, and evaluation metrics. Lastly, we discuss open problems and future research directions.
Authors: Sayan Bandyapadhyay, Eden Chlamt\'a\v{c}, Yury Makarychev, Ali Vakilian
Abstract: In this work, we study pairwise fair clustering with $\ell \ge 2$ groups, where for every cluster $C$ and every group $i \in [\ell]$, the number of points in $C$ from group $i$ must be at most $t$ times the number of points in $C$ from any other group $j \in [\ell]$, for a given integer $t$. To the best of our knowledge, only bi-criteria approximation and exponential-time algorithms follow for this problem from the prior work on fair clustering problems when $\ell > 2$. In our work, focusing on the $\ell > 2$ case, we design the first polynomial-time $(t^{\ell}\cdot \ell\cdot k)^{O(\ell)}$-approximation for this problem with $k$-median cost that does not violate the fairness constraints. We complement our algorithmic result by providing hardness of approximation results, which show that our problem even when $\ell=2$ is almost as hard as the popular uniform capacitated $k$-median, for which no polynomial-time algorithm with an approximation factor of $o(\log k)$ is known.
Authors: Mina Ghashami, Soumya Smruti Mishra
Abstract: The SemEval 2024 BRAINTEASER task represents a pioneering venture in Natural Language Processing (NLP) by focusing on lateral thinking, a dimension of cognitive reasoning that is often overlooked in traditional linguistic analyses. This challenge comprises of Sentence Puzzle and Word Puzzle subtasks and aims to test language models' capacity for divergent thinking. In this paper, we present our approach to the BRAINTEASER task. We employ a holistic strategy by leveraging cutting-edge pre-trained models in multiple choice architecture, and diversify the training data with Sentence and Word Puzzle datasets. To gain further improvement, we fine-tuned the model with synthetic humor/jokes dataset and the RiddleSense dataset which helped augmenting the model's lateral thinking abilities. Empirical results show that our approach achieve 92.5\% accuracy in Sentence Puzzle subtask and 80.2\% accuracy in Word Puzzle subtask.
Authors: Anish Bhattacharya, Nishanth Rao, Dhruv Parikh, Pratik Kunapuli, Nikolai Matni, Vijay Kumar
Abstract: We demonstrate the capabilities of an attention-based end-to-end approach for high-speed quadrotor obstacle avoidance in dense, cluttered environments, with comparison to various state-of-the-art architectures. Quadrotor unmanned aerial vehicles (UAVs) have tremendous maneuverability when flown fast; however, as flight speed increases, traditional vision-based navigation via independent mapping, planning, and control modules breaks down due to increased sensor noise, compounding errors, and increased processing latency. Thus, learning-based, end-to-end planning and control networks have shown to be effective for online control of these fast robots through cluttered environments. We train and compare convolutional, U-Net, and recurrent architectures against vision transformer models for depth-based end-to-end control, in a photorealistic, high-physics-fidelity simulator as well as in hardware, and observe that the attention-based models are more effective as quadrotor speeds increase, while recurrent models with many layers provide smoother commands at lower speeds. To the best of our knowledge, this is the first work to utilize vision transformers for end-to-end vision-based quadrotor control.
Authors: Pietro Farina, Subrata Biswas, Eren Y{\i}ld{\i}z, Khakim Akhunov, Saad Ahmed, Bashima Islam, Kas{\i}m Sinan Y{\i}ld{\i}r{\i}m
Abstract: Batteryless systems frequently face power failures, requiring extra runtime buffers to maintain inference progress and leaving only a memory space for storing ultra-tiny deep neural networks (DNNs). Besides, making these models responsive to stochastic energy harvesting dynamics during inference requires a balance between inference accuracy, latency, and energy overhead. Recent works on compression mostly focus on time and memory, but often ignore energy dynamics or significantly reduce the accuracy of pre-trained DNNs. Existing energy-adaptive inference works modify the architecture of pre-trained models and have significant memory overhead. Thus, energy-adaptive and accurate inference of pre-trained DNNs on batteryless devices with extreme memory constraints is more challenging than traditional microcontrollers. We combat these issues by proposing FreeML, a framework to optimize pre-trained DNN models for memory-efficient and energy-adaptive inference on batteryless systems. FreeML comprises (1) a novel compression technique to reduce the model footprint and runtime memory requirements simultaneously, making them executable on extremely memory-constrained batteryless platforms; and (2) the first early exit mechanism that uses a single exit branch for all exit points to terminate inference at any time, making models energy-adaptive with minimal memory overhead. Our experiments showed that FreeML reduces the model sizes by up to $95 \times$, supports adaptive inference with a $2.03-19.65 \times$ less memory overhead, and provides significant time and energy benefits with only a negligible accuracy drop compared to the state-of-the-art.
Authors: Alejo Lopez-Avila, Jinhua Du, Abbas Shimary, Ze Li
Abstract: The expansion of streaming media and e-commerce has led to a boom in recommendation systems, including Sequential recommendation systems, which consider the user's previous interactions with items. In recent years, research has focused on architectural improvements such as transformer blocks and feature extraction that can augment model information. Among these features are context and attributes. Of particular importance is the temporal footprint, which is often considered part of the context and seen in previous publications as interchangeable with positional information. Other publications use positional encodings with little attention to them. In this paper, we analyse positional encodings, showing that they provide relative information between items that are not inferable from the temporal footprint. Furthermore, we evaluate different encodings and how they affect metrics and stability using Amazon datasets. We added some new encodings to help with these problems along the way. We found that we can reach new state-of-the-art results by finding the correct positional encoding, but more importantly, certain encodings stabilise the training.
Authors: Anjana Wijekoon, David Corsar, Nirmalie Wiratunga, Kyle Martin, Pedram Salimi
Abstract: The evolution of Explainable Artificial Intelligence (XAI) has emphasised the significance of meeting diverse user needs. The approaches to identifying and addressing these needs must also advance, recognising that explanation experiences are subjective, user-centred processes that interact with users towards a better understanding of AI decision-making. This paper delves into the interrelations in multi-faceted XAI and examines how different types of explanations collaboratively meet users' XAI needs. We introduce the Intent Fulfilment Framework (IFF) for creating explanation experiences. The novelty of this paper lies in recognising the importance of "follow-up" on explanations for obtaining clarity, verification and/or substitution. Moreover, the Explanation Experience Dialogue Model integrates the IFF and "Explanation Followups" to provide users with a conversational interface for exploring their explanation needs, thereby creating explanation experiences. Quantitative and qualitative findings from our comparative user study demonstrate the impact of the IFF in improving user engagement, the utility of the AI system and the overall user experience. Overall, we reinforce the principle that "one explanation does not fit all" to create explanation experiences that guide the complex interaction through conversation.
Authors: Chinedu Ekuma
Abstract: The advent of natural language processing and large language models (LLMs) has revolutionized the extraction of data from unstructured scholarly papers. However, ensuring data trustworthiness remains a significant challenge. In this paper, we introduce PropertyExtractor, an open-source tool that leverages advanced conversational LLMs like Google Gemini-Pro and OpenAI GPT-4, blends zero-shot with few-shot in-context learning, and employs engineered prompts for the dynamic refinement of structured information hierarchies, enabling autonomous, efficient, scalable, and accurate identification, extraction, and verification of material property data. Our tests on material data demonstrate precision and recall exceeding 93% with an error rate of approximately 10%, highlighting the effectiveness and versatility of the toolkit. We apply PropertyExtractor to generate a database of 2D material thicknesses, a critical parameter for device integration. The rapid evolution of the field has outpaced both experimental measurements and computational methods, creating a significant data gap. Our work addresses this gap and showcases the potential of PropertyExtractor as a reliable and efficient tool for the autonomous generation of diverse material property databases, advancing the field.
Authors: David Ardia, Keven Bluteau
Abstract: We propose an approach to construct text-based time-series indices in an optimal way--typically, indices that maximize the contemporaneous relation or the predictive performance with respect to a target variable, such as inflation. We illustrate our methodology with a corpus of news articles from the Wall Street Journal by optimizing text-based indices focusing on tracking the VIX index and inflation expectations. Our results highlight the superior performance of our approach compared to existing indices.
Authors: Mohammad Amin Samadi, Spencer JaQuay, Jing Gu, Nia Nixon
Abstract: AI Collaborator, powered by OpenAI's GPT-4, is a groundbreaking tool designed for human-AI collaboration research. Its standout feature is the ability for researchers to create customized AI personas for diverse experimental setups using a user-friendly interface. This functionality is essential for simulating various interpersonal dynamics in team settings. AI Collaborator excels in mimicking different team behaviors, enabled by its advanced memory system and a sophisticated personality framework. Researchers can tailor AI personas along a spectrum from dominant to cooperative, enhancing the study of their impact on team processes. The tool's modular design facilitates integration with digital platforms like Slack, making it versatile for various research scenarios. AI Collaborator is thus a crucial resource for exploring human-AI team dynamics more profoundly.
Authors: Sanjaya Khanal, Shiva Raj Pokhrel
Abstract: This research analyzes, models and develops a novel Digital Learning Environment (DLE) fortified by the innovative Private Learning Intelligence (PLI) framework. The proposed PLI framework leverages federated machine learning (FL) techniques to autonomously construct and continuously refine personalized learning models for individual learners, ensuring robust privacy protection. Our approach is pivotal in advancing DLE capabilities, empowering learners to actively participate in personalized real-time learning experiences. The integration of PLI within a DLE also streamlines instructional design and development demands for personalized teaching/learning. We seek ways to establish a foundation for the seamless integration of FL into learning systems, offering a transformative approach to personalized learning in digital environments. Our implementation details and code are made public.
Authors: Yuqing Lan, Zhenghao Liu, Yu Gu, Xiaoyuan Yi, Xiaohua Li, Liner Yang, Ge Yu
Abstract: Fact verification tasks aim to identify the integrity of textual contents according to the truthful corpus. Existing fact verification models usually build a fully connected reasoning graph, which regards claim-evidence pairs as nodes and connects them with edges. They employ the graph to propagate the semantics of the nodes. Nevertheless, the noisy nodes usually propagate their semantics via the edges of the reasoning graph, which misleads the semantic representations of other nodes and amplifies the noise signals. To mitigate the propagation of noisy semantic information, we introduce a Confidential Graph Attention Network (CO-GAT), which proposes a node masking mechanism for modeling the nodes. Specifically, CO-GAT calculates the node confidence score by estimating the relevance between the claim and evidence pieces. Then, the node masking mechanism uses the node confidence scores to control the noise information flow from the vanilla node to the other graph nodes. CO-GAT achieves a 73.59% FEVER score on the FEVER dataset and shows the generalization ability by broadening the effectiveness to the science-specific domain.
Authors: Changshuai Wei, Benjamin Zelditch, Joyce Chen, Andre Assuncao Silva T Ribeiro, Jingyi Kenneth Tay, Borja Ocejo Elizondo, Keerthi Selvaraj, Aman Gupta, Licurgo Benemann De Almeida
Abstract: Computational marketing has become increasingly important in today's digital world, facing challenges such as massive heterogeneous data, multi-channel customer journeys, and limited marketing budgets. In this paper, we propose a general framework for marketing AI systems, the Neural Optimization with Adaptive Heuristics (NOAH) framework. NOAH is the first general framework for marketing optimization that considers both to-business (2B) and to-consumer (2C) products, as well as both owned and paid channels. We describe key modules of the NOAH framework, including prediction, optimization, and adaptive heuristics, providing examples for bidding and content optimization. We then detail the successful application of NOAH to LinkedIn's email marketing system, showcasing significant wins over the legacy ranking system. Additionally, we share details and insights that are broadly useful, particularly on: (i) addressing delayed feedback with lifetime value, (ii) performing large-scale linear programming with randomization, (iii) improving retrieval with audience expansion, (iv) reducing signal dilution in targeting tests, and (v) handling zero-inflated heavy-tail metrics in statistical testing.
Authors: Bo Wu, Peiye Liu, Wen-Huang Cheng, Bei Liu, Zhaoyang Zeng, Jia Wang, Qiushi Huang, Jiebo Luo
Abstract: Social Media Popularity Prediction (SMPP) is a crucial task that involves automatically predicting future popularity values of online posts, leveraging vast amounts of multimodal data available on social media platforms. Studying and investigating social media popularity becomes central to various online applications and requires novel methods of comprehensive analysis, multimodal comprehension, and accurate prediction. SMP Challenge is an annual research activity that has spurred academic exploration in this area. This paper summarizes the challenging task, data, and research progress. As a critical resource for evaluating and benchmarking predictive models, we have released a large-scale SMPD benchmark encompassing approximately half a million posts authored by around 70K users. The research progress analysis provides an overall analysis of the solutions and trends in recent years. The SMP Challenge website (www.smp-challenge.com) provides the latest information and news.
Authors: Tingyu Xia, Bowen Yu, Yuan Wu, Yi Chang, Chang Zhou
Abstract: In this paper, we initiate our discussion by demonstrating how Large Language Models (LLMs), when tasked with responding to queries, display a more even probability distribution in their answers if they are more adept, as opposed to their less skilled counterparts. Expanding on this foundational insight, we propose a new self-evaluation method ProbDiff for assessing the efficacy of various LLMs. This approach obviates the necessity for an additional evaluation model or the dependence on external, proprietary models like GPT-4 for judgment. It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions. A higher discrepancy for a given query between two LLMs indicates a relatively weaker capability. Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4, spanning a range of scenarios that include natural language generation (NLG) tasks such as translation, summarization, and our proposed Xiaohongshu blog writing task, and benchmarks for LLM evaluation like AlignBench, MT-Bench, and AlpacaEval, across LLMs of varying magnitudes.
Authors: Jiachen Sun, Changsheng Wang, Jiongxiao Wang, Yiwei Zhang, Chaowei Xiao
Abstract: Large language models have become increasingly prominent, also signaling a shift towards multimodality as the next frontier in artificial intelligence, where their embeddings are harnessed as prompts to generate textual content. Vision-language models (VLMs) stand at the forefront of this advancement, offering innovative ways to combine visual and textual data for enhanced understanding and interaction. However, this integration also enlarges the attack surface. Patch-based adversarial attack is considered the most realistic threat model in physical vision applications, as demonstrated in many existing literature. In this paper, we propose to address patched visual prompt injection, where adversaries exploit adversarial patches to generate target content in VLMs. Our investigation reveals that patched adversarial prompts exhibit sensitivity to pixel-wise randomization, a trait that remains robust even against adaptive attacks designed to counteract such defenses. Leveraging this insight, we introduce SmoothVLM, a defense mechanism rooted in smoothing techniques, specifically tailored to protect VLMs from the threat of patched visual prompt injectors. Our framework significantly lowers the attack success rate to a range between 0% and 5.0% on two leading VLMs, while achieving around 67.3% to 95.0% context recovery of the benign images, demonstrating a balance between security and usability.
Authors: Jie Zhu, Junhui Li, Yalong Wen, Lifan Guo
Abstract: In light of recent breakthroughs in large language models (LLMs) that have revolutionized natural language processing (NLP), there is an urgent need for new benchmarks to keep pace with the fast development of LLMs. In this paper, we propose CFLUE, the Chinese Financial Language Understanding Evaluation benchmark, designed to assess the capability of LLMs across various dimensions. Specifically, CFLUE provides datasets tailored for both knowledge assessment and application assessment. In knowledge assessment, it consists of 38K+ multiple-choice questions with associated solution explanations. These questions serve dual purposes: answer prediction and question reasoning. In application assessment, CFLUE features 16K+ test instances across distinct groups of NLP tasks such as text classification, machine translation, relation extraction, reading comprehension, and text generation. Upon CFLUE, we conduct a thorough evaluation of representative LLMs. The results reveal that only GPT-4 and GPT-4-turbo achieve an accuracy exceeding 60\% in answer prediction for knowledge assessment, suggesting that there is still substantial room for improvement in current LLMs. In application assessment, although GPT-4 and GPT-4-turbo are the top two performers, their considerable advantage over lightweight LLMs is noticeably diminished. The datasets and scripts associated with CFLUE are openly accessible at https://github.com/aliyun/cflue.
Authors: Yirong Zhou, Chengyan Wang, Mengtian Lu, Kunyuan Guo, Zi Wang, Dan Ruan, Rui Guo, Peijun Zhao, Jianhua Wang, Naiming Wu, Jianzhong Lin, Yinyin Chen, Hang Jin, Lianxin Xie, Lilan Wu, Liuhong Zhu, Jianjun Zhou, Congbo Cai, He Wang, Xiaobo Qu
Abstract: In cardiac Magnetic Resonance Imaging (MRI) analysis, simultaneous myocardial segmentation and T2 quantification are crucial for assessing myocardial pathologies. Existing methods often address these tasks separately, limiting their synergistic potential. To address this, we propose SQNet, a dual-task network integrating Transformer and Convolutional Neural Network (CNN) components. SQNet features a T2-refine fusion decoder for quantitative analysis, leveraging global features from the Transformer, and a segmentation decoder with multiple local region supervision for enhanced accuracy. A tight coupling module aligns and fuses CNN and Transformer branch features, enabling SQNet to focus on myocardium regions. Evaluation on healthy controls (HC) and acute myocardial infarction patients (AMI) demonstrates superior segmentation dice scores (89.3/89.2) compared to state-of-the-art methods (87.7/87.9). T2 quantification yields strong linear correlations (Pearson coefficients: 0.84/0.93) with label values for HC/AMI, indicating accurate mapping. Radiologist evaluations confirm SQNet's superior image quality scores (4.60/4.58 for segmentation, 4.32/4.42 for T2 quantification) over state-of-the-art methods (4.50/4.44 for segmentation, 3.59/4.37 for T2 quantification). SQNet thus offers accurate simultaneous segmentation and quantification, enhancing cardiac disease diagnosis, such as AMI.
Authors: Markus Lange-Hegermann, Christoph Zimmer
Abstract: Experimental exploration of high-cost systems with safety constraints, common in engineering applications, is a challenging endeavor. Data-driven models offer a promising solution, but acquiring the requisite data remains expensive and is potentially unsafe. Safe active learning techniques prove essential, enabling the learning of high-quality models with minimal expensive data points and high safety. This paper introduces a safe active learning framework tailored for time-varying systems, addressing drift, seasonal changes, and complexities due to dynamic behavior. The proposed Time-aware Integrated Mean Squared Prediction Error (T-IMSPE) method minimizes posterior variance over current and future states, optimizing information gathering also in the time domain. Empirical results highlight T-IMSPE's advantages in model quality through toy and real-world examples. State of the art Gaussian processes are compatible with T-IMSPE. Our theoretical contributions include a clear delineation which Gaussian process kernels, domains, and weighting measures are suitable for T-IMSPE and even beyond for its non-time aware predecessor IMSPE.
Authors: I-Hsiang Chen, Wei-Ting Chen, Yu-Wei Liu, Ming-Hsuan Yang, Sy-Yen Kuo
Abstract: Crowd counting and localization have become increasingly important in computer vision due to their wide-ranging applications. While point-based strategies have been widely used in crowd counting methods, they face a significant challenge, i.e., the lack of an effective learning strategy to guide the matching process. This deficiency leads to instability in matching point proposals to target points, adversely affecting overall performance. To address this issue, we introduce an effective approach to stabilize the proposal-target matching in point-based methods. We propose Auxiliary Point Guidance (APG) to provide clear and effective guidance for proposal selection and optimization, addressing the core issue of matching uncertainty. Additionally, we develop Implicit Feature Interpolation (IFI) to enable adaptive feature extraction in diverse crowd scenarios, further enhancing the model's robustness and accuracy. Extensive experiments demonstrate the effectiveness of our approach, showing significant improvements in crowd counting and localization performance, particularly under challenging conditions. The source codes and trained models will be made publicly available.
Authors: Jiawei Li, Jingshu Peng, Haoyang Li, Lei Chen
Abstract: Time-series analysis plays a pivotal role across a range of critical applications, from finance to healthcare, which involves various tasks, such as forecasting and classification. To handle the inherent complexities of time-series data, such as high dimensionality and noise, traditional supervised learning methods first annotate extensive labels for time-series data in each task, which is very costly and impractical in real-world applications. In contrast, pre-trained foundation models offer a promising alternative by leveraging unlabeled data to capture general time series patterns, which can then be fine-tuned for specific tasks. However, existing approaches to pre-training such models typically suffer from high-bias and low-generality issues due to the use of predefined and rigid augmentation operations and domain-specific data training. To overcome these limitations, this paper introduces UniCL, a universal and scalable contrastive learning framework designed for pretraining time-series foundation models across cross-domain datasets. Specifically, we propose a unified and trainable time-series augmentation operation to generate pattern-preserved, diverse, and low-bias time-series data by leveraging spectral information. Besides, we introduce a scalable augmentation algorithm capable of handling datasets with varying lengths, facilitating cross-domain pretraining. Extensive experiments on two benchmark datasets across eleven domains validate the effectiveness of UniCL, demonstrating its high generalization on time-series analysis across various fields.
Authors: Irene Ferfoglia, Gaia Saveri, Laura Nenzi, Luca Bortolussi
Abstract: Deep learning methods for time series have already reached excellent performances in both prediction and classification tasks, including anomaly detection. However, the complexity inherent in Cyber Physical Systems (CPS) creates a challenge when it comes to explainability methods. To overcome this inherent lack of interpretability, we propose ECATS, a concept-based neuro-symbolic architecture where concepts are represented as Signal Temporal Logic (STL) formulae. Leveraging kernel-based methods for STL, concept embeddings are learnt in an unsupervised manner through a cross-attention mechanism. The network makes class predictions through these concept embeddings, allowing for a meaningful explanation to be naturally extracted for each input. Our preliminary experiments with a simple CPS-based dataset show that our model is able to achieve great classification performance while ensuring local interpretability.
Authors: Remi Desmartin, Omri Isac, Ekaterina Komendantskaya, Kathrin Stark, Grant Passmore, Guy Katz
Abstract: Recent advances in the verification of deep neural networks (DNNs) have opened the way for broader usage of DNN verification technology in many application areas, including safety-critical ones. DNN verifiers are themselves complex programs that have been shown to be susceptible to errors and imprecisions; this in turn has raised the question of trust in DNN verifiers. One prominent attempt to address this issue is enhancing DNN verifiers with the capability of producing proofs of their results that are subject to independent algorithmic certification (proof checking). Formulations of proof production and proof checking already exist on top of the state-of-the-art Marabou DNN verifier. The native implementation of the proof checking algorithm for Marabou was done in C++ and itself raised the question of trust in the code (e.g., in the precision of floating point calculations or guarantees for implementation soundness). Here, we present an alternative implementation of the Marabou proof checking algorithm in Imandra -- an industrial functional programming language and prover -- that allows us to obtain an implementation with formal guarantees, including proofs of mathematical results underlying the algorithm, such as the use of the Farkas lemma.
Authors: Jinchuan Zhang, Bei Hui, Chong Mu, Ming Sun, Ling Tian
Abstract: Temporal Knowledge Graph (TKG) reasoning focuses on predicting events through historical information within snapshots distributed on a timeline. Existing studies mainly concentrate on two perspectives of leveraging the history of TKGs, including capturing evolution of each recent snapshot or correlations among global historical facts. Despite the achieved significant accomplishments, these models still fall short of (1) investigating the influences of multi-granularity interactions across recent snapshots and (2) harnessing the expressive semantics of significant links accorded with queries throughout the entire history, especially events exerting a profound impact on the future. These inadequacies restrict representation ability to reflect historical dependencies and future trends thoroughly. To overcome these drawbacks, we propose an innovative TKG reasoning approach towards \textbf{His}torically \textbf{R}elevant \textbf{E}vents \textbf{S}tructuring ($\mathsf{HisRES}$). Concretely, $\mathsf{HisRES}$ comprises two distinctive modules excelling in structuring historically relevant events within TKGs, including a multi-granularity evolutionary encoder that captures structural and temporal dependencies of the most recent snapshots, and a global relevance encoder that concentrates on crucial correlations among events relevant to queries from the entire history. Furthermore, $\mathsf{HisRES}$ incorporates a self-gating mechanism for adaptively merging multi-granularity recent and historically relevant structuring representations. Extensive experiments on four event-based benchmarks demonstrate the state-of-the-art performance of $\mathsf{HisRES}$ and indicate the superiority and effectiveness of structuring historical relevance for TKG reasoning.
Authors: Washim Uddin Mondal, Vaneet Aggarwal
Abstract: We consider a constrained Markov Decision Problem (CMDP) where the goal of an agent is to maximize the expected discounted sum of rewards over an infinite horizon while ensuring that the expected discounted sum of costs exceeds a certain threshold. Building on the idea of momentum-based acceleration, we develop the Primal-Dual Accelerated Natural Policy Gradient (PD-ANPG) algorithm that guarantees an $\epsilon$ global optimality gap and $\epsilon$ constraint violation with $\mathcal{O}(\epsilon^{-3})$ sample complexity. This improves the state-of-the-art sample complexity in CMDP by a factor of $\mathcal{O}(\epsilon^{-1})$.
Authors: Jiayun Pang, Ivan Vuli\'c
Abstract: Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become `chemistry domain compatible' in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; The most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.
Authors: Anastasia Voznyuk, Vasily Konovalov
Abstract: The Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection shared task in the SemEval-2024 competition aims to tackle the problem of misusing collaborative human-AI writing. Although there are a lot of existing detectors of AI content, they are often designed to give a binary answer and thus may not be suitable for more nuanced problem of finding the boundaries between human-written and machine-generated texts, while hybrid human-AI writing becomes more and more popular. In this paper, we address the boundary detection problem. Particularly, we present a pipeline for augmenting data for supervised fine-tuning of DeBERTaV3. We receive new best MAE score, according to the leaderboard of the competition, with this pipeline.
Authors: Xiaoming Shi, Zeming Liu, Li Du, Yuxuan Wang, Hongru Wang, Yuhang Guo, Tong Ruan, Jie Xu, Shaoting Zhang
Abstract: This paper surveys and organizes research works on medical dialog systems, which is an important yet challenging task. Although these systems have been surveyed in the medical community from an application perspective, a systematic review from a rigorous technical perspective has to date remained noticeably absent. As a result, an overview of the categories, methods, and evaluation of medical dialogue systems remain limited and underspecified, hindering the further improvement of this area. To fill this gap, we investigate an initial pool of 325 papers from well-known computer science, and natural language processing conferences and journals, and make an overview. Recently, large language models have shown strong model capacity on downstream tasks, which also reshaped medical dialog systems' foundation. Despite the alluring practical application value, current medical dialogue systems still suffer from problems. To this end, this paper lists the grand challenges of medical dialog systems, especially of large language models.
Authors: Lujain Ibrahim, Saffron Huang, Lama Ahmad, Markus Anderljung
Abstract: Model evaluations are central to understanding the safety, risks, and societal impacts of AI systems. While most real-world AI applications involve human-AI interaction, most current evaluations (e.g., common benchmarks) of AI models do not. Instead, they incorporate human factors in limited ways, assessing the safety of models in isolation, thereby falling short of capturing the complexity of human-model interactions. In this paper, we discuss and operationalize a definition of an emerging category of evaluations -- "human interaction evaluations" (HIEs) -- which focus on the assessment of human-model interactions or the process and the outcomes of humans using models. First, we argue that HIEs can be used to increase the validity of safety evaluations, assess direct human impact and interaction-specific harms, and guide future assessments of models' societal impact. Second, we propose a safety-focused HIE design framework -- containing a human-LLM interaction taxonomy -- with three stages: (1) identifying the risk or harm area, (2) characterizing the use context, and (3) choosing the evaluation parameters. Third, we apply our framework to two potential evaluations for overreliance and persuasion risks. Finally, we conclude with tangible recommendations for addressing concerns over costs, replicability, and unrepresentativeness of HIEs.
Authors: Harris Bin Munawar, Nikolaos Misirlis
Abstract: In the era of exponential technology growth, one unexpected guest has claimed a seat in classrooms worldwide, Artificial Intelligence. Generative AI, such as ChatGPT, promises a revolution in education, yet it arrives with a double-edged sword. Its potential for personalized learning is offset by issues of cheating, inaccuracies, and educators struggling to incorporate it effectively into their lesson design. We are standing on the brink of this educational frontier, and it is clear that we need to navigate this terrain with a lot of care. This is a major challenge that could undermine the integrity and value of our educational process. So, how can we turn these challenges into opportunities? When used inappropriately, AI tools can become the perfect tool for the cut copy paste mentality, and quickly begin to corrode critical thinking, creativity, and deep understanding, the most important skills in our rapidly changing world. Teachers feel that they are not equipped to leverage this technology, widening the digital divide among educators and institutions. Addressing these concerns calls for an in depth research approach. We will employ empirical research, drawing on the Technology Acceptance Model, to assess the attitudes toward generative AI among educators and students. Understanding their perceptions, usage patterns, and hurdles is the first crucial step in creating an effective solution. The present study will be used as a process manual for future researchers to apply, running their own data, based on the steps explained here
Authors: Haoyue Song, Jiacheng Wang, Liansheng Wang
Abstract: Federated Learning (FL) has gained attention for addressing data scarcity and privacy concerns. While parallel FL algorithms like FedAvg exhibit remarkable performance, they face challenges in scenarios with diverse network speeds and concerns about centralized control, especially in multi-institutional collaborations like the medical domain. Serial FL presents an alternative solution, circumventing these challenges by transferring model updates serially between devices in a cyclical manner. Nevertheless, it is deemed inferior to parallel FL in that (1) its performance shows undesirable fluctuations, and (2) it converges to a lower plateau, particularly when dealing with non-IID data. The observed phenomenon is attributed to catastrophic forgetting due to knowledge loss from previous sites. In this paper, to overcome fluctuation and low efficiency in the iterative learning and forgetting process, we introduce cyclical weight consolidation (CWC), a straightforward yet potent approach specifically tailored for serial FL. CWC employs a consolidation matrix to regulate local optimization. This matrix tracks the significance of each parameter on the overall federation throughout the entire training trajectory, preventing abrupt changes in significant weights. During revisitation, to maintain adaptability, old memory undergoes decay to incorporate new information. Our comprehensive evaluations demonstrate that in various non-IID settings, CWC mitigates the fluctuation behavior of the original serial FL approach and enhances the converged performance consistently and significantly. The improved performance is either comparable to or better than the parallel vanilla.
Authors: Tinh Son Luong, Thanh-Thien Le, Linh Ngo Van, Thien Huu Nguyen
Abstract: Large language models (LLMs) have become integral to our professional workflows and daily lives. Nevertheless, these machine companions of ours have a critical flaw: the huge amount of data which endows them with vast and diverse knowledge, also exposes them to the inevitable toxicity and bias. While most LLMs incorporate defense mechanisms to prevent the generation of harmful content, these safeguards can be easily bypassed with minimal prompt engineering. In this paper, we introduce the new Thoroughly Engineered Toxicity (TET) dataset, comprising manually crafted prompts designed to nullify the protective layers of such models. Through extensive evaluations, we demonstrate the pivotal role of TET in providing a rigorous benchmark for evaluation of toxicity awareness in several popular LLMs: it highlights the toxicity in the LLMs that might remain hidden when using normal prompts, thus revealing subtler issues in their behavior.
Authors: Rui Sun, Yumin Zhang, Tejal Shah, Jiahao Sun, Shuoying Zhang, Wenqi Li, Haoran Duan, Bo Wei, Rajiv Ranjan
Abstract: With impressive achievements made, artificial intelligence is on the path forward to artificial general intelligence. Sora, developed by OpenAI, which is capable of minute-level world-simulative abilities can be considered as a milestone on this developmental path. However, despite its notable successes, Sora still encounters various obstacles that need to be resolved. In this survey, we embark from the perspective of disassembling Sora in text-to-video generation, and conducting a comprehensive review of literature, trying to answer the question, \textit{From Sora What We Can See}. Specifically, after basic preliminaries regarding the general algorithms are introduced, the literature is categorized from three mutually perpendicular dimensions: evolutionary generators, excellent pursuit, and realistic panorama. Subsequently, the widely used datasets and metrics are organized in detail. Last but more importantly, we identify several challenges and open problems in this domain and propose potential future directions for research and development.
Authors: Theodoros Zafeiriou, Dimitris Kalles
Abstract: Our study focuses on comparing the performance and resource requirements between different Long Short-Term Memory (LSTM) neural network architectures and an ANN specialized architecture for forex market prediction. We analyze the execution time of the models as well as the resources consumed, such as memory and computational power. Our aim is to demonstrate that the specialized architecture not only achieves better results in forex market prediction but also executes using fewer resources and in a shorter time frame compared to LSTM architectures. This comparative analysis will provide significant insights into the suitability of these two types of architectures for time series prediction in the forex market environment.
Authors: Michael Shliselberg, Ashkan Kazemi, Scott A. Hale, Shiri Dori-Hacohen
Abstract: Diaspora communities are disproportionately impacted by off-the-radar misinformation and often neglected by mainstream fact-checking efforts, creating a critical need to scale-up efforts of nascent fact-checking initiatives. In this paper we present SynDy, a framework for Synthetic Dynamic Dataset Generation to leverage the capabilities of the largest frontier Large Language Models (LLMs) to train local, specialized language models. To the best of our knowledge, SynDy is the first paper utilizing LLMs to create fine-grained synthetic labels for tasks of direct relevance to misinformation mitigation, namely Claim Matching, Topical Clustering, and Claim Relationship Classification. SynDy utilizes LLMs and social media queries to automatically generate distantly-supervised, topically-focused datasets with synthetic labels on these three tasks, providing essential tools to scale up human-led fact-checking at a fraction of the cost of human-annotated data. Training on SynDy's generated labels shows improvement over a standard baseline and is not significantly worse compared to training on human labels (which may be infeasible to acquire). SynDy is being integrated into Meedan's chatbot tiplines that are used by over 50 organizations, serve over 230K users annually, and automatically distribute human-written fact-checks via messaging apps such as WhatsApp. SynDy will also be integrated into our deployed Co-Insights toolkit, enabling low-resource organizations to launch tiplines for their communities. Finally, we envision SynDy enabling additional fact-checking tools such as matching new misinformation claims to high-quality explainers on common misinformation topics.
Authors: A Akanbi
Abstract: Drought is a complex environmental phenomenon that affects millions of people and communities all over the globe and is too elusive to be accurately predicted. This is mostly due to the scalability and variability of the web of environmental parameters that directly/indirectly causes the onset of different categories of drought. Since the dawn of man, efforts have been made to uniquely understand the natural indicators that provide signs of likely environmental events. These indicators/signs in the form of indigenous knowledge system have been used for generations. The intricate complexity of drought has, however, always been a major stumbling block for accurate drought prediction and forecasting systems. Recently, scientists in the field of agriculture and environmental monitoring have been discussing the integration of indigenous knowledge and scientific knowledge for a more accurate environmental forecasting system in order to incorporate diverse environmental information for a reliable drought forecast. Hence, in this research, the core objective is the development of a semantics-based data integration middleware that encompasses and integrates heterogeneous data models of local indigenous knowledge and sensor data towards an accurate drought forecasting system for the study areas. The local indigenous knowledge on drought gathered from the domain experts is transformed into rules to be used for performing deductive inference in conjunction with sensors data for determining the onset of drought through an automated inference generation module of the middleware. The semantic middleware incorporates, inter alia, a distributed architecture that consists of a streaming data processing engine based on Apache Kafka for real-time stream processing; a rule-based reasoning module; an ontology module for semantic representation of the knowledge bases.
Authors: Hassan Haji Mohammadi, Alireza Talebpour, Ahmad Mahmoudi Aznaveh, Samaneh Yazdani
Abstract: Coreference resolution, critical for identifying textual entities referencing the same entity, faces challenges in pronoun resolution, particularly identifying pronoun antecedents. Existing methods often treat pronoun resolution as a separate task from mention detection, potentially missing valuable information. This study proposes the first end-to-end neural network system for Persian pronoun resolution, leveraging pre-trained Transformer models like ParsBERT. Our system jointly optimizes both mention detection and antecedent linking, achieving a 3.37 F1 score improvement over the previous state-of-the-art system (which relied on rule-based and statistical methods) on the Mehr corpus. This significant improvement demonstrates the effectiveness of combining neural networks with linguistic models, potentially marking a significant advancement in Persian pronoun resolution and paving the way for further research in this under-explored area.
Authors: Yizhang Jin, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan, Yabiao Wang, Chengjie Wang, Lizhuang Ma
Abstract: In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, especially in edge computing scenarios. In this survey, we provide a comprehensive and systematic review of the current state of efficient MLLMs. Specifically, we summarize the timeline of representative efficient MLLMs, research state of efficient structures and strategies, and the applications. Finally, we discuss the limitations of current efficient MLLM research and promising future directions. Please refer to our GitHub repository for more details: https://github.com/lijiannuist/Efficient-Multimodal-LLMs-Survey.
URLs: https://github.com/lijiannuist/Efficient-Multimodal-LLMs-Survey.
Authors: Albert Sawczyn, Jakub Binkowski, Piotr Bielak, Tomasz Kajdanowicz
Abstract: Knowledge-intensive tasks pose a significant challenge for Machine Learning (ML) techniques. Commonly adopted methods, such as Large Language Models (LLMs), often exhibit limitations when applied to such tasks. Nevertheless, there have been notable endeavours to mitigate these challenges, with a significant emphasis on augmenting LLMs through Knowledge Graphs (KGs). While KGs provide many advantages for representing knowledge, their development costs can deter extensive research and applications. Addressing this limitation, we introduce a framework for enriching embeddings of small-scale domain-specific Knowledge Graphs with well-established general-purpose KGs. Adopting our method, a modest domain-specific KG can benefit from a performance boost in downstream tasks when linked to a substantial general-purpose KG. Experimental evaluations demonstrate a notable enhancement, with up to a 44% increase observed in the Hits@10 metric. This relatively unexplored research direction can catalyze more frequent incorporation of KGs in knowledge-intensive tasks, resulting in more robust, reliable ML implementations, which hallucinates less than prevalent LLM solutions. Keywords: knowledge graph, knowledge graph completion, entity alignment, representation learning, machine learning
Authors: Yicheng Qi, Ang Li
Abstract: Artificially sweetened beverages like Diet Coke are often considered healthier alternatives, but the debate over their impact on obesity persists. Previous research has predominantly relied on observational data or randomized controlled trials (RCTs), which may not accurately capture the causal relationship between Diet Coke consumption and obesity. This study uses causal inference methods, employing data from the National Health and Nutrition Examination Survey (NHANES) to examine this relationship across diverse demographics. Instead of relying on RCT data, we constructed a causal graph and applied the back-door criterion with its adjustment formula to estimate the RCT distributions. We then calculated the counterfactual quantity, the Probability of Necessity and Sufficiency (PNS), using both NHANES data and estimated RCT data. We propose that PNS is the essential metric for assessing the impact of Diet Coke on obesity. Our results indicate that between 20% to 50% of individuals, especially those with poor dietary habits, are more likely to gain weight from Diet Coke. Conversely, in groups like young females with healthier diets, only a small proportion experience weight gain due to Diet Coke. These findings highlight the influence of individual lifestyle and potential hormonal factors on the varied effects of Diet Coke, providing a new framework for understanding its nutritional impacts on health.
Authors: Yu Cheng, Qin Yang, Liyang Wang, Ao Xiang, Jingyu Zhang
Abstract: In the realm of globalized financial markets, commercial banks are confronted with an escalating magnitude of credit risk, thereby imposing heightened requisites upon the security of bank assets and financial stability. This study harnesses advanced neural network techniques, notably the Backpropagation (BP) neural network, to pioneer a novel model for preempting credit risk in commercial banks. The discourse initially scrutinizes conventional financial risk preemptive models, such as ARMA, ARCH, and Logistic regression models, critically analyzing their real-world applications. Subsequently, the exposition elaborates on the construction process of the BP neural network model, encompassing network architecture design, activation function selection, parameter initialization, and objective function construction. Through comparative analysis, the superiority of neural network models in preempting credit risk in commercial banks is elucidated. The experimental segment selects specific bank data, validating the model's predictive accuracy and practicality. Research findings evince that this model efficaciously enhances the foresight and precision of credit risk management.
Authors: Xiaotian Lu, Jiyi Li, Zhen Wan, Xiaofeng Lin, Koh Takeuchi, Hisashi Kashima
Abstract: Deep learning models have performed well on many NLP tasks. However, their internal mechanisms are typically difficult for humans to understand. The development of methods to explain models has become a key issue in the reliability of deep learning models in many important applications. Various saliency explanation methods, which give each feature of input a score proportional to the contribution of output, have been proposed to determine the part of the input which a model values most. Despite a considerable body of work on the evaluation of saliency methods, whether the results of various evaluation metrics agree with human cognition remains an open question. In this study, we propose a new human-based method to evaluate saliency methods in NLP by crowdsourcing. We recruited 800 crowd workers and empirically evaluated seven saliency methods on two datasets with the proposed method. We analyzed the performance of saliency methods, compared our results with existing automated evaluation methods, and identified notable differences between NLP and computer vision (CV) fields when using saliency methods. The instance-level data of our crowdsourced experiments and the code to reproduce the explanations are available at https://github.com/xtlu/lreccoling_evaluation.
Authors: Siyuan Li, Zedong Wang, Zicheng Liu, Di Wu, Cheng Tan, Jiangbin Zheng, Yufei Huang, Stan Z. Li
Abstract: Similar to natural language models, pre-trained genome language models are proposed to capture the underlying intricacies within genomes with unsupervised sequence modeling. They have become essential tools for researchers and practitioners in biology. However, the \textit{hand-crafted} tokenization policies used in these models may not encode the most discriminative patterns from the limited vocabulary of genomic data. In this paper, we introduce VQDNA, a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebook as \textit{learnable} vocabulary, VQDNA can adaptively tokenize genomes into \textit{pattern-aware} embeddings in an end-to-end manner. To further push its limits, we propose Hierarchical Residual Quantization (HRQ), where varying scales of codebooks are designed in a hierarchy to enrich the genome vocabulary in a coarse-to-fine manner. Extensive experiments on 32 genome datasets demonstrate VQDNA's superiority and favorable parameter efficiency compared to existing genome language models. Notably, empirical analysis of SARS-CoV-2 mutations reveals the fine-grained pattern awareness and biological significance of learned HRQ vocabulary, highlighting its untapped potential for broader applications in genomics.
Authors: Fabian Fumagalli, Maximilian Muschalik, Patrick Kolpaczki, Eyke H\"ullermeier, Barbara Hammer
Abstract: The Shapley value (SV) is a prevalent approach of allocating credit to machine learning (ML) entities to understand black box ML models. Enriching such interpretations with higher-order interactions is inevitable for complex systems, where the Shapley Interaction Index (SII) is a direct axiomatic extension of the SV. While it is well-known that the SV yields an optimal approximation of any game via a weighted least square (WLS) objective, an extension of this result to SII has been a long-standing open problem, which even led to the proposal of an alternative index. In this work, we characterize higher-order SII as a solution to a WLS problem, which constructs an optimal approximation via SII and $k$-Shapley values ($k$-SII). We prove this representation for the SV and pairwise SII and give empirically validated conjectures for higher orders. As a result, we propose KernelSHAP-IQ, a direct extension of KernelSHAP for SII, and demonstrate state-of-the-art performance for feature interactions.
Authors: Lorenzo Sani, Alex Iacob, Zeyu Cao, Bill Marino, Yan Gao, Tomas Paulik, Wanru Zhao, William F. Shen, Preslav Aleksandrov, Xinchi Qiu, Nicholas D. Lane
Abstract: Generative pre-trained large language models (LLMs) have demonstrated impressive performance over a wide range of tasks, thanks to the unprecedented amount of data they have been trained on. As established scaling laws indicate, LLMs' future performance improvement depends on the amount of computing and data sources we can leverage for pre-training. Federated learning (FL) has the potential to unleash the majority of the planet's data and computational resources, which are underutilized by the data-center-focused training methodology of current LLM practice. Our work presents a robust, flexible, reproducible FL approach that enables large-scale collaboration across institutions to train LLMs. This would mobilize more computational and data resources while matching or potentially exceeding centralized performance. We further show the effectiveness of the federated training scales with model size and present our approach for training a billion-scale federated LLM using limited resources. This will help data-rich actors to become the protagonists of LLMs pre-training instead of leaving the stage to compute-rich actors alone.
Authors: Rickard Stureborg, Sanxing Chen, Ruoyu Xie, Aayushi Patel, Christopher Li, Chloe Qinyu Zhu, Tingnan Hu, Jun Yang, Bhuwan Dhingra
Abstract: One way to personalize chatbot interactions is by establishing common ground with the intended reader. A domain where establishing mutual understanding could be particularly impactful is vaccine concerns and misinformation. Vaccine interventions are forms of messaging which aim to answer concerns expressed about vaccination. Tailoring responses in this domain is difficult, since opinions often have seemingly little ideological overlap. We define the task of tailoring vaccine interventions to a Common-Ground Opinion (CGO). Tailoring responses to a CGO involves meaningfully improving the answer by relating it to an opinion or belief the reader holds. In this paper we introduce TAILOR-CGO, a dataset for evaluating how well responses are tailored to provided CGOs. We benchmark several major LLMs on this task; finding GPT-4-Turbo performs significantly better than others. We also build automatic evaluation metrics, including an efficient and accurate BERT model that outperforms finetuned LLMs, investigate how to successfully tailor vaccine messaging to CGOs, and provide actionable recommendations from this investigation. Code and model weights: https://github.com/rickardstureborg/tailor-cgo Dataset: https://huggingface.co/datasets/DukeNLP/tailor-cgo
URLs: https://github.com/rickardstureborg/tailor-cgo, https://huggingface.co/datasets/DukeNLP/tailor-cgo
Authors: Ziyou Guo, Yan Sun, Tieru Wu
Abstract: Time series (TS) forecasting has been an unprecedentedly popular problem in recent years, with ubiquitous applications in both scientific and business fields. Various approaches have been introduced to time series analysis, including both statistical approaches and deep neural networks. Although neural network approaches have illustrated stronger ability of representation than statistical methods, they struggle to provide sufficient interpretablility, and can be too complicated to optimize. In this paper, we present WEITS, a frequency-aware deep learning framework that is highly interpretable and computationally efficient. Through multi-level wavelet decomposition, WEITS novelly infuses frequency analysis into a highly deep learning framework. Combined with a forward-backward residual architecture, it enjoys both high representation capability and statistical interpretability. Extensive experiments on real-world datasets have demonstrated competitive performance of our model, along with its additional advantage of high computation efficiency. Furthermore, WEITS provides a general framework that can always seamlessly integrate with state-of-the-art approaches for time series forecast.
Authors: Mingxiang Fu, Yu Song, Jiameng Lv, Liang Cao, Peng Jia, Nan Li, Xiangru Li, Jifeng Liu, A-Li Luo, Bo Qiu, Shiyin Shen, Liangping Tu, Lili Wang, Shoulin Wei, Haifeng Yang, Zhenping Yi, Zhiqiang Zou
Abstract: The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe. However, effectively analyzing this vast amount of data poses a significant challenge. Astronomers are turning to deep learning techniques to address this, but the methods are limited by their specific training sets, leading to considerable duplicate workloads too. Hence, as an example to present how to overcome the issue, we built a framework for general analysis of galaxy images, based on a large vision model (LVM) plus downstream tasks (DST), including galaxy morphological classification, image restoration, object detection, parameter extraction, and more. Considering the low signal-to-noise ratio of galaxy images and the imbalanced distribution of galaxy categories, we have incorporated a Human-in-the-loop (HITL) module into our large vision model, which leverages human knowledge to enhance the reliability and interpretability of processing galaxy images interactively. The proposed framework exhibits notable few-shot learning capabilities and versatile adaptability to all the abovementioned tasks on galaxy images in the DESI legacy imaging surveys. Expressly, for object detection, trained by 1000 data points, our DST upon the LVM achieves an accuracy of 96.7%, while ResNet50 plus Mask R-CNN gives an accuracy of 93.1%; for morphology classification, to obtain AUC ~0.9, LVM plus DST and HITL only requests 1/50 training sets compared to ResNet18. Expectedly, multimodal data can be integrated similarly, which opens up possibilities for conducting joint analyses with datasets spanning diverse domains in the era of multi-message astronomy.
Authors: Dimitrios P. Panagoulias, Persephone Papatheodosiou, Anastasios P. Palamidas, Mattheos Sanoudos, Evridiki Tsoureli-Nikita, Maria Virvou, George A. Tsihrintzis
Abstract: Large Language Models (LLMs) constitute a breakthrough state-of-the-art Artificial Intelligence (AI) technology which is rapidly evolving and promises to aid in medical diagnosis either by assisting doctors or by simulating a doctor's workflow in more advanced and complex implementations. In this technical paper, we outline Cognitive Network Evaluation Toolkit for Medical Domains (COGNET-MD), which constitutes a novel benchmark for LLM evaluation in the medical domain. Specifically, we propose a scoring-framework with increased difficulty to assess the ability of LLMs in interpreting medical text. The proposed framework is accompanied with a database of Multiple Choice Quizzes (MCQs). To ensure alignment with current medical trends and enhance safety, usefulness, and applicability, these MCQs have been constructed in collaboration with several associated medical experts in various medical domains and are characterized by varying degrees of difficulty. The current (first) version of the database includes the medical domains of Psychiatry, Dentistry, Pulmonology, Dermatology and Endocrinology, but it will be continuously extended and expanded to include additional medical domains.
Authors: D. Subhalingam, Keshav Kolluru, Mausam, Saurabh Singal
Abstract: In the e-commerce domain, the accurate extraction of attribute-value pairs from product listings (e.g., Brand: Apple) is crucial for enhancing search and recommendation systems. The automation of this extraction process is challenging due to the vast diversity of product categories and their respective attributes, compounded by the lack of extensive, accurately annotated training datasets and the demand for low latency to meet the real-time needs of e-commerce platforms. To address these challenges, we introduce GenToC, a novel two-stage model for extracting attribute-value pairs from product titles. GenToC is designed to train with partially-labeled data, leveraging incomplete attribute-value pairs and obviating the need for a fully annotated dataset. Moreover, we introduce a bootstrapping method that enables GenToC to progressively refine and expand its training dataset. This enhancement substantially improves the quality of data available for training other neural network models that are typically faster but are inherently less capable than GenToC in terms of their capacity to handle partially-labeled data. By supplying an enriched dataset for training, GenToC significantly advances the performance of these alternative models, making them more suitable for real-time deployment. Our results highlight the unique capability of GenToC to learn from a limited set of labeled data and to contribute to the training of more efficient models, marking a significant leap forward in the automated extraction of attribute-value pairs from product titles. GenToC has been successfully integrated into India's largest B2B e-commerce platform, IndiaMART.com, achieving a significant increase of 21.1% in recall over the existing deployed system while maintaining a high precision of 89.5% in this challenging task.
Authors: Janick Weberpals, Pamela A. Shaw, Kueiyu Joshua Lin, Richard Wyss, Joseph M Plasek, Li Zhou, Kerry Ngan, Thomas DeRamus, Sudha R. Raman, Bradley G. Hammill, Hana Lee, Sengwee Toh, John G. Connolly, Kimberly J. Dandreo, Fang Tian, Wei Liu, Jie Li, Jos\'e J. Hern\'andez-Mu\~noz, Sebastian Schneeweiss, Rishi J. Desai
Abstract: Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features. We mimicked scenarios where U was unobserved by omitting it from all AC candidate sets. Using LASSO, we data-adaptively selected HDMI covariates associated with Z2 and MZ2 for MI, and with U to include in propensity score models. The treatment effect was estimated following propensity score matching in MI datasets and we benchmarked HDMI approaches against a baseline imputation and complete case analysis with Z1 only. HDMI using claims data showed the lowest bias (0.072). Combining claims and sentence embeddings led to an improvement in the efficiency displaying the lowest root-mean-squared-error (0.173) and coverage (94%). NLP-derived AC alone did not perform better than baseline MI. HDMI approaches may decrease bias in studies with partially observed confounders where missingness depends on unobserved factors.
Authors: Kaiyu Huang, Fengran Mo, Hongliang Li, You Li, Yuanchi Zhang, Weijian Yi, Yulong Mao, Jinchen Liu, Yuzhuang Xu, Jinan Xu, Jian-Yun Nie, Yang Liu
Abstract: The rapid development of Large Language Models (LLMs) demonstrates remarkable multilingual capabilities in natural language processing, attracting global attention in both academia and industry. To mitigate potential discrimination and enhance the overall usability and accessibility for diverse language user groups, it is important for the development of language-fair technology. Despite the breakthroughs of LLMs, the investigation into the multilingual scenario remains insufficient, where a comprehensive survey to summarize recent approaches, developments, limitations, and potential solutions is desirable. To this end, we provide a survey with multiple perspectives on the utilization of LLMs in the multilingual scenario. We first rethink the transitions between previous and current research on pre-trained language models. Then we introduce several perspectives on the multilingualism of LLMs, including training and inference methods, model security, multi-domain with language culture, and usage of datasets. We also discuss the major challenges that arise in these aspects, along with possible solutions. Besides, we highlight future research directions that aim at further enhancing LLMs with multilingualism. The survey aims to help the research community address multilingual problems and provide a comprehensive understanding of the core concepts, key techniques, and latest developments in multilingual natural language processing based on LLMs.
Authors: Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto
Abstract: Understanding how language model performance varies with scale is critical to benchmark and algorithm development. Scaling laws are one approach to building this understanding, but the requirement of training models across many different scales has limited their use. We propose an alternative, observational approach that bypasses model training and instead builds scaling laws from ~80 publically available models. Building a single scaling law from multiple model families is challenging due to large variations in their training compute efficiencies and capabilities. However, we show that these variations are consistent with a simple, generalized scaling law where language model performance is a function of a low-dimensional capability space, and model families only vary in their efficiency in converting training compute to capabilities. Using this approach, we show the surprising predictability of complex scaling phenomena: we show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models; we show that the agent performance of models such as GPT-4 can be precisely predicted from simpler non-agentic benchmarks; and we show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
Authors: Hariprasath Govindarajan, Per Sid\'en, Jacob Roll, Fredrik Lindsten
Abstract: Self-distillation methods using Siamese networks are popular for self-supervised pre-training. DINO is one such method based on a cross-entropy loss between $K$-dimensional probability vectors, obtained by applying a softmax function to the dot product between representations and learnt prototypes. Given the fact that the learned representations are $L^2$-normalized, we show that DINO and its derivatives, such as iBOT, can be interpreted as a mixture model of von Mises-Fisher components. With this interpretation, DINO assumes equal precision for all components when the prototypes are also $L^2$-normalized. Using this insight we propose DINO-vMF, that adds appropriate normalization constants when computing the cluster assignment probabilities. Unlike DINO, DINO-vMF is stable also for the larger ViT-Base model with unnormalized prototypes. We show that the added flexibility of the mixture model is beneficial in terms of better image representations. The DINO-vMF pre-trained model consistently performs better than DINO on a range of downstream tasks. We obtain similar improvements for iBOT-vMF vs iBOT and thereby show the relevance of our proposed modification also for other methods derived from DINO.
Authors: Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, Tatsunori Hashimoto
Abstract: Recent advances in Language Model (LM) agents and tool use, exemplified by applications like ChatGPT Plugins, enable a rich set of capabilities but also amplify potential risks - such as leaking private data or causing financial losses. Identifying these risks is labor-intensive, necessitating implementing the tools, setting up the environment for each test scenario manually, and finding risky cases. As tools and agents become more complex, the high cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks. To address these challenges, we introduce ToolEmu: a framework that uses an LM to emulate tool execution and enables the testing of LM agents against a diverse range of tools and scenarios, without manual instantiation. Alongside the emulator, we develop an LM-based automatic safety evaluator that examines agent failures and quantifies associated risks. We test both the tool emulator and evaluator through human evaluation and find that 68.8% of failures identified with ToolEmu would be valid real-world agent failures. Using our curated initial benchmark consisting of 36 high-stakes tools and 144 test cases, we provide a quantitative risk analysis of current LM agents and identify numerous failures with potentially severe outcomes. Notably, even the safest LM agent exhibits such failures 23.9% of the time according to our evaluator, underscoring the need to develop safer LM agents for real-world deployment.
Authors: Xiangyu Zhao, Bo Liu, Qijiong Liu, Guangyuan Shi, Xiao-Ming Wu
Abstract: We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs), Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge modalities,EasyGen leverages BiDiffuser,a bidirectional conditional diffusion model, to foster more efficient modality interactions. Easygen achieves text generation by training a projection layer linking BiDiffuser and an LLM, and facilities image generation by training an adapter to align the LLM's text space with the BiDiffuser's image space, Comprehensive quantitative and qualitative experiments show that EasyGen excels in data-efficient training, high-quality image generation, and extendibility, effectively addressing the challenges in multimodal generation. The source code is available at https://github.com/zxy556677/EasyGen.
Authors: Arifa Khan, P. Saravanan, S. K Venkatesan
Abstract: We provide a birds eye view of the rapid developments in AI and Deep Learning that has led to the path-breaking emergence of AI in Large Language Models. The aim of this study is to place all these developments in a pragmatic broader historical social perspective without any exaggerations while at the same time without any pessimism that created the AI winter in the 1970s to 1990s. We also at the same time point out toxicity, bias, memorization, sycophancy, logical inconsistencies, hallucinations that exist just as a warning to the overly optimistic. We note here that just as this emergence of AI seems to occur at a threshold point in the number of neural connections or weights, it has also been observed that human brain and especially the cortex region is nothing special or extraordinary but simply a case of scaled-up version of the primate brain and that even the human intelligence seems like an emergent phenomena of scale.
Authors: Jiaxin Zhang, Zhongzhi Li, Mingliang Zhang, Fei Yin, Chenglin Liu, Yashar Moshfeghi
Abstract: Recent advancements in large language models (LLMs) and multi-modal models (MMs) have demonstrated their remarkable capabilities in problem-solving. Yet, their proficiency in tackling geometry math problems, which necessitates an integrated understanding of both textual and visual information, has not been thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark, a comprehensive collection that includes a main subset of 2,000 problems, a 750 problems subset focusing on backward reasoning, an augmented subset of 2,000 problems, and a hard subset of 300 problems. This benchmark facilitates a deeper investigation into the performance of LLMs and MMs in solving geometry math problems. Our evaluation of ten LLMs and MMs across these varied subsets reveals that the WizardMath model excels, achieving a 55.67\% accuracy rate on the main subset but only a 6.00\% accuracy on the hard subset. This highlights the critical need for testing models against datasets on which they have not been pre-trained. Additionally, our findings indicate that GPT-series models perform more effectively on problems they have rephrased, suggesting a promising method for enhancing model capabilities.
Authors: Lenore Blum, Manuel Blum
Abstract: We look at consciousness through the lens of Theoretical Computer Science, a branch of mathematics that studies computation under resource limitations. From this perspective, we develop a formal machine model for consciousness. The model is inspired by Alan Turing's simple yet powerful model of computation and Bernard Baars' theater model of consciousness. Though extremely simple, the model aligns at a high level with many of the major scientific theories of human and animal consciousness, supporting our claim that machine consciousness is inevitable.
Authors: David "davidad" Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, Joshua Tenenbaum
Abstract: Ensuring that AI systems reliably and robustly avoid harmful or dangerous behaviours is a crucial challenge, especially for AI systems with a high degree of autonomy and general intelligence, or systems used in safety-critical contexts. In this paper, we will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees. This is achieved by the interplay of three core components: a world model (which provides a mathematical description of how the AI system affects the outside world), a safety specification (which is a mathematical description of what effects are acceptable), and a verifier (which provides an auditable proof certificate that the AI satisfies the safety specification relative to the world model). We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them. We also argue for the necessity of this approach to AI safety, and for the inadequacy of the main alternative approaches.
Authors: Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine
Abstract: Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method.
Authors: Qinbo Bai, Amrit Singh Bedi, Vaneet Aggarwal
Abstract: We consider the problem of constrained Markov decision process (CMDP) in continuous state-actions spaces where the goal is to maximize the expected cumulative reward subject to some constraints. We propose a novel Conservative Natural Policy Gradient Primal-Dual Algorithm (C-NPG-PD) to achieve zero constraint violation while achieving state of the art convergence results for the objective value function. For general policy parametrization, we prove convergence of value function to global optimal upto an approximation error due to restricted policy class. We even improve the sample complexity of existing constrained NPG-PD algorithm \cite{Ding2020} from $\mathcal{O}(1/\epsilon^6)$ to $\mathcal{O}(1/\epsilon^4)$. To the best of our knowledge, this is the first work to establish zero constraint violation with Natural policy gradient style algorithms for infinite horizon discounted CMDPs. We demonstrate the merits of proposed algorithm via experimental evaluations.
Authors: Shourav B. Rabbani, Ivan V. Medri, Manar D. Samad
Abstract: Deep learning methods are primarily proposed for supervised learning of images or text with limited applications to clustering problems. In contrast, tabular data with heterogeneous features pose unique challenges in representation learning, where deep learning has yet to replace traditional machine learning. This paper addresses these challenges in developing one of the first deep clustering methods for tabular data: Gaussian Cluster Embedding in Autoencoder Latent Space (G-CEALS). G-CEALS is an unsupervised deep clustering framework for learning the parameters of multivariate Gaussian cluster distributions by iteratively updating individual cluster weights. The G-CEALS method presents average rank orderings of 2.9(1.7) and 2.8(1.7) based on clustering accuracy and adjusted Rand index (ARI) scores on sixteen tabular data sets, respectively, and outperforms nine state-of-the-art clustering methods. G-CEALS substantially improves clustering performance compared to traditional K-means and GMM, which are still de facto methods for clustering tabular data. Similar computationally efficient and high-performing deep clustering frameworks are imperative to reap the myriad benefits of deep learning on tabular data over traditional machine learning.
Authors: Yanbiao Ma, Licheng Jiao, Fang Liu, Maoji Wen, Lingling Li, Wenping Ma, Shuyuan Yang, Xu Liu, Puhua Chen
Abstract: To address the challenges of long-tailed classification, researchers have proposed several approaches to reduce model bias, most of which assume that classes with few samples are weak classes. However, recent studies have shown that tail classes are not always hard to learn, and model bias has been observed on sample-balanced datasets, suggesting the existence of other factors that affect model bias. In this work, we first establish a geometric perspective for analyzing model fairness and then systematically propose a series of geometric measurements for perceptual manifolds in deep neural networks. Subsequently, we comprehensively explore the effect of the geometric characteristics of perceptual manifolds on classification difficulty and how learning shapes the geometric characteristics of perceptual manifolds. An unanticipated finding is that the correlation between the class accuracy and the separation degree of perceptual manifolds gradually decreases during training, while the negative correlation with the curvature gradually increases, implying that curvature imbalance leads to model bias.Building upon these observations, we propose curvature regularization to facilitate the model to learn curvature-balanced and flatter perceptual manifolds. Evaluations on multiple long-tailed and non-long-tailed datasets show the excellent performance and exciting generality of our approach, especially in achieving significant performance improvements based on current state-of-the-art techniques. Our work opens up a geometric analysis perspective on model bias and reminds researchers to pay attention to model bias on non-long-tailed and even sample-balanced datasets.
Authors: Jemin Lee, Yongin Kwon, Sihyeong Park, Misun Yu, Jeman Park, Hwanjun Song
Abstract: Recently, vision transformers (ViTs) have superseded convolutional neural networks in numerous applications, including classification, detection, and segmentation. However, the high computational requirements of ViTs hinder their widespread implementation. To address this issue, researchers have proposed efficient hybrid transformer architectures that combine convolutional and transformer layers with optimized attention computation of linear complexity. Additionally, post-training quantization has been proposed as a means of mitigating computational demands. For mobile devices, achieving optimal acceleration for ViTs necessitates the strategic integration of quantization techniques and efficient hybrid transformer structures. However, no prior investigation has applied quantization to efficient hybrid transformers. In this paper, we discover that applying existing post-training quantization (PTQ) methods for ViTs to efficient hybrid transformers leads to a drastic accuracy drop, attributed to the four following challenges: (i) highly dynamic ranges, (ii) zero-point overflow, (iii) diverse normalization, and (iv) limited model parameters ($<$5M). To overcome these challenges, we propose a new post-training quantization method, which is the first to quantize efficient hybrid ViTs (MobileViTv1, MobileViTv2, Mobile-Former, EfficientFormerV1, EfficientFormerV2). We achieve a significant improvement of 17.73% for 8-bit and 29.75% for 6-bit on average, respectively, compared with existing PTQ methods (EasyQuant, FQ-ViT, PTQ4ViT, and RepQ-ViT)}. We plan to release our code at https://gitlab.com/ones-ai/q-hyvit.
Authors: Siyu Yuan, Jiangjie Chen, Changzhi Sun, Jiaqing Liang, Yanghua Xiao, Deqing Yang
Abstract: Analogical reasoning is a fundamental cognitive ability of humans. However, current language models (LMs) still struggle to achieve human-like performance in analogical reasoning tasks due to a lack of resources for model training. In this work, we address this gap by proposing ANALOGYKB, a million-scale analogy knowledge base (KB) derived from existing knowledge graphs (KGs). ANALOGYKB identifies two types of analogies from the KGs: 1) analogies of the same relations, which can be directly extracted from the KGs, and 2) analogies of analogous relations, which are identified with a selection and filtering pipeline enabled by large language models (LLMs), followed by minor human efforts for data quality control. Evaluations on a series of datasets of two analogical reasoning tasks (analogy recognition and generation) demonstrate that ANALOGYKB successfully enables both smaller LMs and LLMs to gain better analogical reasoning capabilities.
Authors: Shuai Lyu, Dongmei Mo, Waikeung Wong
Abstract: Existing representation-based methods usually conduct industrial anomaly detection in two stages: obtain feature representations with a pre-trained model and perform distance measures for anomaly detection. Among them, K-nearest neighbor (KNN) retrieval-based anomaly detection methods show promising results. However, the features are not fully exploited as these methods ignore domain bias of pre-trained models and the difference of local density in feature space, which limits the detection performance. In this paper, we propose Reducing Biases (REB) in representation by considering the domain bias and building a self-supervised learning task for better domain adaption with a defect generation strategy (DefectMaker) that ensures a strong diversity in the synthetic defects. Additionally, we propose a local-density KNN (LDKNN) to reduce the local density bias in the feature space and obtain effective anomaly detection. The proposed REB method achieves a promising result of 99.5\% Im.AUROC on the widely used MVTec AD, with smaller backbone networks such as Vgg11 and Resnet18. The method also achieves an impressive 88.8\% Im.AUROC on the MVTec LOCO AD dataset and a remarkable 96.0\% on the BTAD dataset, outperforming other representation-based approaches. These results indicate the effectiveness and efficiency of REB for practical industrial applications. Code:https://github.com/ShuaiLYU/REB.
Authors: Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Kaokao Lv, Yi Liu
Abstract: Large Language Models (LLMs) have demonstrated exceptional proficiency in language-related tasks. However, their deployment presents significant challenges due to their substantial memory and storage requirements. To address this challenge, weight-only quantization has emerged as a promising solution. Previous research has indicated that fine-tuning through up and down rounding can enhance performance. In this study, we introduce SignRound, a method that utilizes signed gradient descent (SignSGD) to optimize rounding values and weight clipping within just 200 steps, combining the strengths of both Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ). SignRound achieves outstanding results compared to recent methods across 2 to 4 bits, while maintaining low tuning costs and without introducing any additional inference overhead. For instance, SignRound led to absolute average accuracy improvements ranging from 6.91\% to 33.22\% at 2 bits. Furthermore, it demonstrates robust generalization to various recent models and achieves near-lossless quantization in most scenarios at 4 bits. The source code is publicly available at \url{https://github.com/intel/auto-round}.
Authors: Paul Thomas, Seth Spielman, Nick Craswell, Bhaskar Mitra
Abstract: Relevance labels, which indicate whether a search result is valuable to a searcher, are key to evaluating and optimising search systems. The best way to capture the true preferences of users is to ask them for their careful feedback on which results would be useful, but this approach does not scale to produce a large number of labels. Getting relevance labels at scale is usually done with third-party labellers, who judge on behalf of the user, but there is a risk of low-quality data if the labeller doesn't understand user needs. To improve quality, one standard approach is to study real users through interviews, user studies and direct feedback, find areas where labels are systematically disagreeing with users, then educate labellers about user needs through judging guidelines, training and monitoring. This paper introduces an alternate approach for improving label quality. It takes careful feedback from real users, which by definition is the highest-quality first-party gold data that can be derived, and develops an large language model prompt that agrees with that data. We present ideas and observations from deploying language models for large-scale relevance labelling at Bing, and illustrate with data from TREC. We have found large language models can be effective, with accuracy as good as human labellers and similar capability to pick the hardest queries, best runs, and best groups. Systematic changes to the prompts make a difference in accuracy, but so too do simple paraphrases. To measure agreement with real searchers needs high-quality "gold" labels, but with these we find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.
Authors: Markus Frohmann, Carolin Holtermann, Shahed Masoudian, Anne Lauscher, Navid Rekabsaz
Abstract: Multi-task learning (MTL) has shown considerable practical benefits, particularly when using language models (LMs). While this is commonly achieved by learning $n$ tasks under a joint optimization procedure, some methods, such as AdapterFusion, divide the problem into two stages: (i) task learning, where knowledge specific to a task is encapsulated within sets of parameters (e.g., adapters), and (ii) transfer, where this already learned knowledge is leveraged for a target task. This separation of concerns provides numerous benefits (e.g., promoting reusability). However, current two-stage MTL introduces a substantial number of additional parameters. We address this issue by leveraging the usefulness of linearly scaling the output representations of source adapters for transfer learning. We introduce ScaLearn, a simple and highly parameter-efficient two-stage MTL method that capitalizes on the knowledge of the source tasks by learning a minimal set of scaling parameters that enable effective transfer to a target task. Our experiments on three benchmarks (GLUE, SuperGLUE, and HumSet) and two encoder LMs show that ScaLearn consistently outperforms strong baselines with a small number of transfer parameters (~ $0.35$% of those of AdapterFusion). Remarkably, we observe that ScaLearn maintains its strong abilities even when further reducing parameters, achieving competitive results with only $8$ transfer parameters per target task. Our proposed approach thus demonstrates the power of simple scaling as a promise for more efficient task transfer.
Authors: Kaiwen Zhou, Kwonjoon Lee, Teruhisa Misu, Xin Eric Wang
Abstract: In our work, we explore the synergistic capabilities of pre-trained vision-and-language models (VLMs) and large language models (LLMs) on visual commonsense reasoning (VCR) problems. We find that VLMs and LLMs-based decision pipelines are good at different kinds of VCR problems. Pre-trained VLMs exhibit strong performance for problems involving understanding the literal visual content, which we noted as visual commonsense understanding (VCU). For problems where the goal is to infer conclusions beyond image content, which we noted as visual commonsense inference (VCI), VLMs face difficulties, while LLMs, given sufficient visual evidence, can use commonsense to infer the answer well. We empirically validate this by letting LLMs classify VCR problems into these two categories and show the significant difference between VLM and LLM with image caption decision pipelines on two subproblems. Moreover, we identify a challenge with VLMs' passive perception, which may miss crucial context information, leading to incorrect reasoning by LLMs. Based on these, we suggest a collaborative approach, named ViCor, where pre-trained LLMs serve as problem classifiers to analyze the problem category, then either use VLMs to answer the question directly or actively instruct VLMs to concentrate on and gather relevant visual elements to support potential commonsense inferences. We evaluate our framework on two VCR benchmark datasets and outperform all other methods that do not require in-domain fine-tuning.
Authors: Charles K. Assaad, Emilie Devijver, Eric Gaussier, Gregor G\"ossler, Anouar Meynaoui
Abstract: We study the problem of identifiability of the total effect of an intervention from observational time series in the situation, common in practice, where one only has access to abstractions of the true causal graph. We consider here two abstractions: the extended summary causal graph, which conflates all lagged causal relations but distinguishes between lagged and instantaneous relations, and the summary causal graph which does not give any indication about the lag between causal relations. We show that the total effect is always identifiable in extended summary causal graphs and provide sufficient conditions for identifiability in summary causal graphs. We furthermore provide adjustment sets allowing to estimate the total effect whenever it is identifiable.
Authors: Max Pascher, Felix Ferdinand Goldau, Kirill Kronhardt, Udo Frese, Jens Gerken
Abstract: With the ongoing efforts to empower people with mobility impairments and the increase in technological acceptance by the general public, assistive technologies, such as collaborative robotic arms, are gaining popularity. Yet, their widespread success is limited by usability issues, specifically the disparity between user input and software control along the autonomy continuum. To address this, shared control concepts provide opportunities to combine the targeted increase of user autonomy with a certain level of computer assistance. This paper presents the free and open-source AdaptiX XR framework for developing and evaluating shared control applications in a high-resolution simulation environment. The initial framework consists of a simulated robotic arm with an example scenario in Virtual Reality (VR), multiple standard control interfaces, and a specialized recording/replay system. AdaptiX can easily be extended for specific research needs, allowing Human-Robot Interaction (HRI) researchers to rapidly design and test novel interaction methods, intervention strategies, and multi-modal feedback techniques, without requiring an actual physical robotic arm during the early phases of ideation, prototyping, and evaluation. Also, a Robot Operating System (ROS) integration enables the controlling of a real robotic arm in a PhysicalTwin approach without any simulation-reality gap. Here, we review the capabilities and limitations of AdaptiX in detail and present three bodies of research based on the framework. AdaptiX can be accessed at https://adaptix.robot-research.de.
Authors: Ziyang Chen, Dongfang Li, Xiang Zhao, Baotian Hu, Min Zhang
Abstract: In this study, we address the challenge of enhancing temporal knowledge reasoning in Large Language Models (LLMs). LLMs often struggle with this task, leading to the generation of inaccurate or misleading responses. This issue mainly arises from their limited ability to handle evolving factual knowledge and complex temporal logic. To overcome these limitations, we propose Abstract Reasoning Induction (ARI) framework, which divides temporal reasoning into two distinct phases: Knowledge-agnostic and Knowledge-based. This framework offers factual knowledge support to LLMs while minimizing the incorporation of extraneous noisy data. Concurrently, informed by the principles of constructivism, ARI provides LLMs the capability to engage in proactive, self-directed learning from both correct and incorrect historical reasoning samples. By teaching LLMs to actively construct knowledge and methods, it can significantly boosting their temporal reasoning abilities. Our approach achieves remarkable improvements, with relative gains of 29.7% and 9.27% on two temporal QA datasets, underscoring its efficacy in advancing temporal reasoning in LLMs. The code can be found at https://github.com/czy1999/ARI-QA
Authors: Shida Wang, Qianxiao Li
Abstract: In this paper, we investigate the long-term memory learning capabilities of state-space models (SSMs) from the perspective of parameterization. We prove that state-space models without any reparameterization exhibit a memory limitation similar to that of traditional RNNs: the target relationships that can be stably approximated by state-space models must have an exponential decaying memory. Our analysis identifies this "curse of memory" as a result of the recurrent weights converging to a stability boundary, suggesting that a reparameterization technique can be effective. To this end, we introduce a class of reparameterization techniques for SSMs that effectively lift its memory limitations. Besides improving approximation capabilities, we further illustrate that a principled choice of reparameterization scheme can also enhance optimization stability. We validate our findings using synthetic datasets, language models and image classifications.
Authors: Jason Vega, Isha Chaudhary, Changming Xu, Gagandeep Singh
Abstract: With the recent surge in popularity of LLMs has come an ever-increasing need for LLM safety training. In this paper, we investigate the fragility of SOTA open-source LLMs under simple, optimization-free attacks we refer to as $\textit{priming attacks}$, which are easy to execute and effectively bypass alignment from safety training. Our proposed attack improves the Attack Success Rate on Harmful Behaviors, as measured by Llama Guard, by up to $3.3\times$ compared to baselines. Source code and data are available at https://github.com/uiuc-focal-lab/llm-priming-attacks.
URLs: https://github.com/uiuc-focal-lab/llm-priming-attacks.
Authors: Andreas Madsen, Sarath Chandar, Siva Reddy
Abstract: Instruction-tuned Large Language Models (LLMs) excel at many tasks and will even explain their reasoning, so-called self-explanations. However, convincing and wrong self-explanations can lead to unsupported confidence in LLMs, thus increasing risk. Therefore, it's important to measure if self-explanations truly reflect the model's behavior. Such a measure is called interpretability-faithfulness and is challenging to perform since the ground truth is inaccessible, and many LLMs only have an inference API. To address this, we propose employing self-consistency checks to measure faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make its prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been successfully applied to LLM self-explanations for counterfactual, feature attribution, and redaction explanations. Our results demonstrate that faithfulness is explanation, model, and task-dependent, showing self-explanations should not be trusted in general. For example, with sentiment classification, counterfactuals are more faithful for Llama2, feature attribution for Mistral, and redaction for Falcon 40B.
Authors: Tairan He, Chong Zhang, Wenli Xiao, Guanqi He, Changliu Liu, Guanya Shi
Abstract: Legged robots navigating cluttered environments must be jointly agile for efficient task execution and safe to avoid collisions with obstacles or humans. Existing studies either develop conservative controllers (< 1.0 m/s) to ensure safety, or focus on agility without considering potentially fatal collisions. This paper introduces Agile But Safe (ABS), a learning-based control framework that enables agile and collision-free locomotion for quadrupedal robots. ABS involves an agile policy to execute agile motor skills amidst obstacles and a recovery policy to prevent failures, collaboratively achieving high-speed and collision-free navigation. The policy switch in ABS is governed by a learned control-theoretic reach-avoid value network, which also guides the recovery policy as an objective function, thereby safeguarding the robot in a closed loop. The training process involves the learning of the agile policy, the reach-avoid value network, the recovery policy, and an exteroception representation network, all in simulation. These trained modules can be directly deployed in the real world with onboard sensing and computation, leading to high-speed and collision-free navigation in confined indoor and outdoor spaces with both static and dynamic obstacles.
Authors: Chao Liang, Fan Ma, Linchao Zhu, Yingying Deng, Yi Yang
Abstract: We concentrate on a novel human-centric image synthesis task, that is, given only one reference facial photograph, it is expected to generate specific individual images with diverse head positions, poses, facial expressions, and illuminations in different contexts. To accomplish this goal, we argue that our generative model should be capable of the following favorable characteristics: (1) a strong visual and semantic understanding of our world and human society for basic object and human image generation. (2) generalizable identity preservation ability. (3) flexible and fine-grained head control. Recently, large pre-trained text-to-image diffusion models have shown remarkable results, serving as a powerful generative foundation. As a basis, we aim to unleash the above two capabilities of the pre-trained model. In this work, we present a new framework named CapHuman. We embrace the "encode then learn to align" paradigm, which enables generalizable identity preservation for new individuals without cumbersome tuning at inference. CapHuman encodes identity features and then learns to align them into the latent space. Moreover, we introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner. Extensive qualitative and quantitative analyses demonstrate our CapHuman can produce well-identity-preserved, photo-realistic, and high-fidelity portraits with content-rich representations and various head renditions, superior to established baselines. Code and checkpoint will be released at https://github.com/VamosC/CapHuman.
Authors: Zhuoyang Zhang, Han Cai, Song Han
Abstract: We present EfficientViT-SAM, a new family of accelerated segment anything models. We retain SAM's lightweight prompt encoder and mask decoder while replacing the heavy image encoder with EfficientViT. For the training, we begin with the knowledge distillation from the SAM-ViT-H image encoder to EfficientViT. Subsequently, we conduct end-to-end training on the SA-1B dataset. Benefiting from EfficientViT's efficiency and capacity, EfficientViT-SAM delivers 48.9x measured TensorRT speedup on A100 GPU over SAM-ViT-H without sacrificing performance. Our code and pre-trained models are released at https://github.com/mit-han-lab/efficientvit.
Authors: Sovanmonynuth Heng, Myeongseong Go, Youngsun Han
Abstract: Current advancements in technology have focused the attention of the quantum computing community toward exploring the potential of near-term devices whose computing power surpasses that of classical computers in practical applications. An unresolved central question revolves around whether the inherent noise in these devices can be overcome or whether any potential quantum advantage would be limited. There is no doubt that crosstalk is one of the main sources of noise in noisy intermediate-scale quantum (NISQ) systems, and it poses a fundamental challenge to hardware designs. Crosstalk between parallel instructions can corrupt quantum states and cause incorrect program execution. In this study, we present a necessary analysis of the crosstalk error effect on NISQ devices. Our approach is extremely straightforward and practical to estimate the crosstalk error of various multi-qubit devices. In particular, we combine the randomized benchmarking (RB) and simultaneous randomized benchmarking (SRB) protocol to estimate the crosstalk error from the correlation controlled-NOT (CNOT) gate. We demonstrate this protocol experimentally on 5-, 7-, \& 16-qubit devices. Our results demonstrate the crosstalk error model of three different IBM quantum devices over the experimental week and compare the error variation against the machine, number of qubits, quantum volume, processor, and topology. We then confirm the improvement in the circuit fidelity on different benchmarks by up to 3.06x via inserting an instruction barrier, as compared with an IBM quantum noisy device which offers near-optimal crosstalk mitigation in practice. Finally, we discuss the current system limitation, its tradeoff on fidelity and depth, noise beyond the NISQ system, and mitigation opportunities to ensure that the quantum operation can perform its quantum magic undisturbed.
Authors: Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek
Abstract: Large Language Models (LLMS) have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of "jailbreaking", where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain.
Authors: Zhenxi Lin, Ziheng Zhang, Xian Wu, Yefeng Zheng
Abstract: Although biomedical entity linking (BioEL) has made significant progress with pre-trained language models, challenges still exist for fine-grained and long-tailed entities. To address these challenges, we present BioELQA, a novel model that treats Biomedical Entity Linking as Multiple Choice Question Answering. BioELQA first obtains candidate entities with a fast retriever, jointly presents the mention and candidate entities to a generator, and then outputs the predicted symbol associated with its chosen entity. This formulation enables explicit comparison of different candidate entities, thus capturing fine-grained interactions between mentions and entities, as well as among entities themselves. To improve generalization for long-tailed entities, we retrieve similar labeled training instances as clues and concatenate the input with retrieved instances for the generator. Extensive experimental results show that BioELQA outperforms state-of-the-art baselines on several datasets.
Authors: Yosuke Fukuchi, Seiji Yamada
Abstract: This paper addresses the challenge of selecting explanations for XAI (Explainable AI)-based Intelligent Decision Support Systems (IDSSs). IDSSs have shown promise in improving user decisions through XAI-generated explanations along with AI predictions, and the development of XAI made it possible to generate a variety of such explanations. However, how IDSSs should select explanations to enhance user decision-making remains an open question. This paper proposes X-Selector, a method for selectively presenting XAI explanations. It enables IDSSs to strategically guide users to an AI-suggested decision by predicting the impact of different combinations of explanations on a user's decision and selecting the combination that is expected to minimize the discrepancy between an AI suggestion and a user decision. We compared the efficacy of X-Selector with two naive strategies (all possible explanations and explanations only for the most likely prediction) and two baselines (no explanation and no AI support). The results suggest the potential of X-Selector to guide users to AI-suggested decisions and improve task performance under the condition of a high AI accuracy.
Authors: Jongwook Choi, Taehoon Kim, Yonghyun Jeong, Seungryul Baek, Jongwon Choi
Abstract: This paper presents a new approach for the detection of fake videos, based on the analysis of style latent vectors and their abnormal behavior in temporal changes in the generated videos. We discovered that the generated facial videos suffer from the temporal distinctiveness in the temporal changes of style latent vectors, which are inevitable during the generation of temporally stable videos with various facial expressions and geometric transformations. Our framework utilizes the StyleGRU module, trained by contrastive learning, to represent the dynamic properties of style latent vectors. Additionally, we introduce a style attention module that integrates StyleGRU-generated features with content-based features, enabling the detection of visual and temporal artifacts. We demonstrate our approach across various benchmark scenarios in deepfake detection, showing its superiority in cross-dataset and cross-manipulation scenarios. Through further analysis, we also validate the importance of using temporal changes of style latent vectors to improve the generality of deepfake video detection.
Authors: H. A. Scheppink, S. Ahmadi, P. Desain, M. Tangermann, J. Thielen
Abstract: Auditory attention decoding (AAD) aims to extract from brain activity the attended speaker amidst candidate speakers, offering promising applications for neuro-steered hearing devices and brain-computer interfacing. This pilot study makes a first step towards AAD using the noise-tagging stimulus protocol, which evokes reliable code-modulated evoked potentials, but is minimally explored in the auditory modality. Participants were sequentially presented with two Dutch speech stimuli that were amplitude-modulated with a unique binary pseudo-random noise-code, effectively tagging these with additional decodable information. We compared the decoding of unmodulated audio against audio modulated with various modulation depths, and a conventional AAD method against a standard method to decode noise-codes. Our pilot study revealed higher performances for the conventional method with 70 to 100 percent modulation depths compared to unmodulated audio. The noise-code decoder did not further improve these results. These fundamental insights highlight the potential of integrating noise-codes in speech to enhance auditory speaker detection when multiple speakers are presented simultaneously.
Authors: Juntao Tan, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Zelong Li, Yongfeng Zhang
Abstract: Generative recommendation based on Large Language Models (LLMs) have transformed the traditional ranking-based recommendation style into a text-to-text generation paradigm. However, in contrast to standard NLP tasks that inherently operate on human vocabulary, current research in generative recommendations struggles to effectively encode recommendation items within the text-to-text framework using concise yet meaningful ID representations. To better align LLMs with recommendation needs, we propose IDGen, representing each item as a unique, concise, semantically rich, platform-agnostic textual ID using human language tokens. This is achieved by training a textual ID generator alongside the LLM-based recommender, enabling seamless integration of personalized recommendations into natural language generation. Notably, as user history is expressed in natural language and decoupled from the original dataset, our approach suggests the potential for a foundational generative recommendation model. Experiments show that our framework consistently surpasses existing models in sequential recommendation under standard experimental setting. Then, we explore the possibility of training a foundation recommendation model with the proposed method on data collected from 19 different datasets and tested its recommendation performance on 6 unseen datasets across different platforms under a completely zero-shot setting. The results show that the zero-shot performance of the pre-trained foundation model is comparable to or even better than some traditional recommendation models based on supervised training, showing the potential of the IDGen paradigm serving as the foundation model for generative recommendation. Code and data are open-sourced at https://github.com/agiresearch/IDGenRec.
Authors: Qi Zhang, Guang Wang, Li Lin, Kaiwen Xia, Shuai Wang
Abstract: With the advent of the era of big data, massive information, expert experience, and high-accuracy models bring great opportunities to the information cascade prediction of public emergencies. However, the involvement of specialist knowledge from various disciplines has resulted in a primarily application-specific focus (e.g., earthquakes, floods, infectious diseases) for information cascade prediction of public emergencies. The lack of a unified prediction framework poses a challenge for classifying intersectional prediction methods across different application fields. This survey paper offers a systematic classification and summary of information cascade modeling, prediction, and application. We aim to help researchers identify cutting-edge research and comprehend models and methods of information cascade prediction under public emergencies. By summarizing open issues and outlining future directions in this field, this paper has the potential to be a valuable resource for researchers conducting further studies on predicting information cascades.
Authors: Andre R Kuroswiski, Annie S Wu, Angelo Passaro
Abstract: In this paper, we introduce an alternative approach to enhancing Multi-Agent Reinforcement Learning (MARL) through the integration of domain knowledge and attention-based policy mechanisms. Our methodology focuses on the incorporation of domain-specific expertise into the learning process, which simplifies the development of collaborative behaviors. This approach aims to reduce the complexity and learning overhead typically associated with MARL by enabling agents to concentrate on essential aspects of complex tasks, thus optimizing the learning curve. The utilization of attention mechanisms plays a key role in our model. It allows for the effective processing of dynamic context data and nuanced agent interactions, leading to more refined decision-making. Applied in standard MARL scenarios, such as the Stanford Intelligent Systems Laboratory (SISL) Pursuit and Multi-Particle Environments (MPE) Simple Spread, our method has been shown to improve both learning efficiency and the effectiveness of collaborative behaviors. The results indicate that our attention-based approach can be a viable approach for improving the efficiency of MARL training process, integrating domain-specific knowledge at the action level.
Authors: Xiaomin Fang, Jie Gao, Jing Hu, Lihang Liu, Yang Xue, Xiaonan Zhang, Kunrui Zhu
Abstract: While monomer protein structure prediction tools boast impressive accuracy, the prediction of protein complex structures remains a daunting challenge in the field. This challenge is particularly pronounced in scenarios involving complexes with protein chains from different species, such as antigen-antibody interactions, where accuracy often falls short. Limited by the accuracy of complex prediction, tasks based on precise protein-protein interaction analysis also face obstacles. In this report, we highlight the ongoing advancements of our protein complex structure prediction model, HelixFold-Multimer, underscoring its enhanced performance. HelixFold-Multimer provides precise predictions for diverse protein complex structures, especially in therapeutic protein interactions. Notably, HelixFold-Multimer achieves remarkable success in antigen-antibody and peptide-protein structure prediction, greatly surpassing AlphaFold 3. HelixFold-Multimer is now available for public use on the PaddleHelix platform, offering both a general version and an antigen-antibody version. Researchers can conveniently access and utilize this service for their development needs.
Authors: Yanbiao Ma, Licheng Jiao, Fang Liu, Lingling Li, Wenping Ma, Shuyuan Yang, Xu Liu, Puhua Chen
Abstract: Building fair deep neural networks (DNNs) is a crucial step towards achieving trustworthy artificial intelligence. Delving into deeper factors that affect the fairness of DNNs is paramount and serves as the foundation for mitigating model biases. However, current methods are limited in accurately predicting DNN biases, relying solely on the number of training samples and lacking more precise measurement tools. Here, we establish a geometric perspective for analyzing the fairness of DNNs, comprehensively exploring how DNNs internally shape the intrinsic geometric characteristics of datasets-the intrinsic dimensions (IDs) of perceptual manifolds, and the impact of IDs on the fairness of DNNs. Based on multiple findings, we propose Intrinsic Dimension Regularization (IDR), which enhances the fairness and performance of models by promoting the learning of concise and ID-balanced class perceptual manifolds. In various image recognition benchmark tests, IDR significantly mitigates model bias while improving its performance.
Authors: Joachim Winther Pedersen, Erwan Plantec, Eleni Nisioti, Milton Montero, Sebastian Risi
Abstract: Artificial neural networks used for reinforcement learning are structurally rigid, meaning that each optimized parameter of the network is tied to its specific placement in the network structure. It also means that a network only works with pre-defined and fixed input- and output sizes. This is a consequence of having the number of optimized parameters being directly dependent on the structure of the network. Structural rigidity limits the ability to optimize parameters of policies across multiple environments that do not share input and output spaces. Here, we evolve a set of neurons and plastic synapses each represented by a gated recurrent unit (GRU). During optimization, the parameters of these fundamental units of a neural network are optimized in different random structural configurations. Earlier work has shown that parameter sharing between units is important for making structurally flexible neurons We show that it is possible to optimize a set of distinct neuron- and synapse types allowing for a mitigation of the symmetry dilemma. We demonstrate this by optimizing a single set of neurons and synapses to solve multiple reinforcement learning control tasks simultaneously.
Authors: Ao Xiang, Jingyu Zhang, Qin Yang, Liyang Wang, Yu Cheng
Abstract: With the development and widespread application of digital image processing technology, image splicing has become a common method of image manipulation, raising numerous security and legal issues. This paper introduces a new splicing image detection algorithm based on the statistical characteristics of natural images, aimed at improving the accuracy and efficiency of splicing image detection. By analyzing the limitations of traditional methods, we have developed a detection framework that integrates advanced statistical analysis techniques and machine learning methods. The algorithm has been validated using multiple public datasets, showing high accuracy in detecting spliced edges and locating tampered areas, as well as good robustness. Additionally, we explore the potential applications and challenges faced by the algorithm in real-world scenarios. This research not only provides an effective technological means for the field of image tampering detection but also offers new ideas and methods for future related research.
Authors: Xiangyu Peng, Jessica Quaye, Weijia Xu, Portia Botchway, Chris Brockett, Bill Dolan, Nebojsa Jojic, Gabriel DesGarennes, Ken Lobb, Michael Xu, Jorge Leandro, Claire Jin, Sudha Rao
Abstract: We explore how interaction with large language models (LLMs) can give rise to emergent behaviors, empowering players to participate in the evolution of game narratives. Our testbed is a text-adventure game in which players attempt to solve a mystery under a fixed narrative premise, but can freely interact with non-player characters generated by GPT-4, a large language model. We recruit 28 gamers to play the game and use GPT-4 to automatically convert the game logs into a node-graph representing the narrative in the player's gameplay. We find that through their interactions with the non-deterministic behavior of the LLM, players are able to discover interesting new emergent nodes that were not a part of the original narrative but have potential for being fun and engaging. Players that created the most emergent nodes tended to be those that often enjoy games that facilitate discovery, exploration and experimentation.
Authors: Jianhua Wu, Bingzhao Gao, Jincheng Gao, Jianhao Yu, Hongqing Chu, Qiankun Yu, Xun Gong, Yi Chang, H. Eric Tseng, Hong Chen, Jie Chen
Abstract: With the development of artificial intelligence and breakthroughs in deep learning, large-scale Foundation Models (FMs), such as GPT, Sora, etc., have achieved remarkable results in many fields including natural language processing and computer vision. The application of FMs in autonomous driving holds considerable promise. For example, they can contribute to enhancing scene understanding and reasoning. By pre-training on rich linguistic and visual data, FMs can understand and interpret various elements in a driving scene, and provide cognitive reasoning to give linguistic and action instructions for driving decisions and planning. Furthermore, FMs can augment data based on the understanding of driving scenarios to provide feasible scenes of those rare occurrences in the long tail distribution that are unlikely to be encountered during routine driving and data collection. The enhancement can subsequently lead to improvement in the accuracy and reliability of autonomous driving systems. Another testament to the potential of FMs' applications lies in World Models, exemplified by the DREAMER series, which showcases the ability to comprehend physical laws and dynamics. Learning from massive data under the paradigm of self-supervised learning, World Model can generate unseen yet plausible driving environments, facilitating the enhancement in the prediction of road users' behaviors and the off-line training of driving strategies. In this paper, we synthesize the applications and future trends of FMs in autonomous driving. By utilizing the powerful capabilities of FMs, we strive to tackle the potential issues stemming from the long-tail distribution in autonomous driving, consequently advancing overall safety in this domain.
Authors: Laurent Orseau, Remi Munos
Abstract: We improve the proofs of the lower bounds of Coquelin and Munos (2007) that demonstrate that UCT can have $\exp(\dots\exp(1)\dots)$ regret (with $\Omega(D)$ exp terms) on the $D$-chain environment, and that a `polynomial' UCT variant has $\exp_2(\exp_2(D - O(\log D)))$ regret on the same environment -- the original proofs contain an oversight for rewards bounded in $[0, 1]$, which we fix in the present draft. We also adapt the proofs to AlphaGo's MCTS and its descendants (e.g., AlphaZero, Leela Zero) to also show $\exp_2(\exp_2(D - O(\log D)))$ regret.
Authors: Benjamin Sinclair, Lucy Vivash, Jasmine Moses, Miranda Lynch, William Pham, Karina Dorfman, Cassandra Marotta, Shaun Koh, Jacob Bunyamin, Ella Rowsthorn, Alex Jarema, Himashi Peiris, Zhaolin Chen, Sandy R Shultz, David K Wright, Dexiao Kong, Sharon L. Naismith, Terence J. O\'Brien, Meng Law
Abstract: Perivascular spaces(PVSs) form a central component of the brain\'s waste clearance system, the glymphatic system. These structures are visible on MRI images, and their morphology is associated with aging and neurological disease. Manual quantification of PVS is time consuming and subjective. Numerous deep learning methods for PVS segmentation have been developed, however the majority have been developed and evaluated on homogenous datasets and high resolution scans, perhaps limiting their applicability for the wide range of image qualities acquired in clinic and research. In this work we train a nnUNet, a top-performing biomedical image segmentation algorithm, on a heterogenous training sample of manually segmented MRI images of a range of different qualities and resolutions from 6 different datasets. These are compared to publicly available deep learning methods for 3D segmentation of PVS. The resulting model, PINGU (Perivascular space Identification Nnunet for Generalised Usage), achieved voxel and cluster level dice scores of 0.50(SD=0.15), 0.63(0.17) in the white matter(WM), and 0.54(0.11), 0.66(0.17) in the basal ganglia(BG). Performance on data from unseen sites was substantially lower for both PINGU(0.20-0.38(WM, voxel), 0.29-0.58(WM, cluster), 0.22-0.36(BG, voxel), 0.46-0.60(BG, cluster)) and the publicly available algorithms(0.18-0.30(WM, voxel), 0.29-0.38(WM cluster), 0.10-0.20(BG, voxel), 0.15-0.37(BG, cluster)), but PINGU strongly outperformed the publicly available algorithms, particularly in the BG. Finally, training PINGU on manual segmentations from a single site with homogenous scan properties gave marginally lower performances on internal cross-validation, but in some cases gave higher performance on external validation. PINGU stands out as broad-use PVS segmentation tool, with particular strength in the BG, an area of PVS related to vascular disease and pathology.
Authors: Zaitian Wang, Pengfei Wang, Kunpeng Liu, Pengyang Wang, Yanjie Fu, Chang-Tien Lu, Charu C. Aggarwal, Jian Pei, Yuanchun Zhou
Abstract: Data augmentation is a series of techniques that generate high-quality artificial data by manipulating existing data samples. By leveraging data augmentation techniques, AI models can achieve significantly improved applicability in tasks involving scarce or imbalanced datasets, thereby substantially enhancing AI models' generalization capabilities. Existing literature surveys only focus on a certain type of specific modality data, and categorize these methods from modality-specific and operation-centric perspectives, which lacks a consistent summary of data augmentation methods across multiple modalities and limits the comprehension of how existing data samples serve the data augmentation process. To bridge this gap, we propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities. Specifically, from a data-centric perspective, this survey proposes a modality-independent taxonomy by investigating how to take advantage of the intrinsic relationship between data samples, including single-wise, pair-wise, and population-wise sample data augmentation methods. Additionally, we categorize data augmentation methods across five data modalities through a unified inductive approach.
Authors: Andong Wang, Bo Wu, Sunli Chen, Zhenfang Chen, Haotian Guan, Wei-Ning Lee, Li Erran Li, Chuang Gan
Abstract: Learning commonsense reasoning from visual contexts and scenes in real-world is a crucial step toward advanced artificial intelligence. However, existing video reasoning benchmarks are still inadequate since they were mainly designed for factual or situated reasoning and rarely involve broader knowledge in the real world. Our work aims to delve deeper into reasoning evaluations, specifically within dynamic, open-world, and structured context knowledge. We propose a new benchmark (SOK-Bench), consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos. The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving. To create such a dataset, we propose an automatic and scalable generation method to generate question-answer pairs, knowledge graphs, and rationales by instructing the combinations of LLMs and MLLMs. Concretely, we first extract observable situated entities, relations, and processes from videos for situated knowledge and then extend to open-world knowledge beyond the visible content. The task generation is facilitated through multiple dialogues as iterations and subsequently corrected and refined by our designed self-promptings and demonstrations. With a corpus of both explicit situated facts and implicit commonsense, we generate associated question-answer pairs and reasoning processes, finally followed by manual reviews for quality assurance. We evaluated recent mainstream large vision-language models on the benchmark and found several insightful conclusions. For more information, please refer to our benchmark at www.bobbywu.com/SOKBench.
Authors: Xiaofei Wen, Bangzheng Li, Tenghao Huang, Muhao Chen
Abstract: Most language models currently available are prone to self-contradiction during dialogues. To mitigate this issue, this study explores a novel contradictory dialogue processing task that aims to detect and modify contradictory statements in a conversation. This task is inspired by research on context faithfulness and dialogue comprehension, which have demonstrated that the detection and understanding of contradictions often necessitate detailed explanations. We develop a dataset comprising contradictory dialogues, in which one side of the conversation contradicts itself. Each dialogue is accompanied by an explanatory label that highlights the location and details of the contradiction. With this dataset, we present a Red Teaming framework for contradictory dialogue processing. The framework detects and attempts to explain the dialogue, then modifies the existing contradictory content using the explanation. Our experiments demonstrate that the framework improves the ability to detect contradictory dialogues and provides valid explanations. Additionally, it showcases distinct capabilities for modifying such dialogues. Our study highlights the importance of the logical inconsistency problem in conversational AI.