An Evaluation of State-of-the-Art Large Language Models for Sarcasm Detection. (arXiv:2312.03706v1 [cs.CL])

Authors: Juliann Zhou

Sarcasm, as defined by Merriam-Webster, is the use of words by someone who means the opposite of what he is trying to say. In the field of sentimental analysis of Natural Language Processing, the ability to correctly identify sarcasm is necessary for understanding people's true opinions. Because the use of sarcasm is often context-based, previous research has used language representation models, such as Support Vector Machine (SVM) and Long Short-Term Memory (LSTM), to identify sarcasm with contextual-based information. Recent innovations in NLP have provided more possibilities for detecting sarcasm. In BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin et al. (2018) introduced a new language representation model and demonstrated higher precision in interpreting contextualized language. As proposed by Hazarika et al. (2018), CASCADE is a context-driven model that produces good results for detecting sarcasm. This study analyzes a Reddit corpus using these two state-of-the-art models and evaluates their performance against baseline models to find the ideal approach to sarcasm detection.

Abstraction via exemplars? A representational case study on lexical category inference in BERT. (arXiv:2312.03708v1 [cs.CL])

Authors: Kanishka Misra, Najoung Kim

Exemplar based accounts are often considered to be in direct opposition to pure linguistic abstraction in explaining language learners' ability to generalize to novel expressions. However, the recent success of neural network language models on linguistically sensitive tasks suggests that perhaps abstractions can arise via the encoding of exemplars. We provide empirical evidence for this claim by adapting an existing experiment that studies how an LM (BERT) generalizes the usage of novel tokens that belong to lexical categories such as Noun/Verb/Adjective/Adverb from exposure to only a single instance of their usage. We analyze the representational behavior of the novel tokens in these experiments, and find that BERT's capacity to generalize to unseen expressions involving the use of these novel tokens constitutes the movement of novel token representations towards regions of known category exemplars in two-dimensional space. Our results suggest that learners' encoding of exemplars can indeed give rise to abstraction like behavior.

Co-guiding for Multi-intent Spoken Language Understanding. (arXiv:2312.03716v1 [cs.CL])

Authors: Bowen Xing, Ivor W. Tsang

Recent graph-based models for multi-intent SLU have obtained promising results through modeling the guidance from the prediction of intents to the decoding of slot filling. However, existing methods (1) only model the unidirectional guidance from intent to slot, while there are bidirectional inter-correlations between intent and slot; (2) adopt homogeneous graphs to model the interactions between the slot semantics nodes and intent label nodes, which limit the performance. In this paper, we propose a novel model termed Co-guiding Net, which implements a two-stage framework achieving the mutual guidances between the two tasks. In the first stage, the initial estimated labels of both tasks are produced, and then they are leveraged in the second stage to model the mutual guidances. Specifically, we propose two heterogeneous graph attention networks working on the proposed two heterogeneous semantics label graphs, which effectively represent the relations among the semantics nodes and label nodes. Besides, we further propose Co-guiding-SCL Net, which exploits the single-task and dual-task semantics contrastive relations. For the first stage, we propose single-task supervised contrastive learning, and for the second stage, we propose co-guiding supervised contrastive learning, which considers the two tasks' mutual guidances in the contrastive learning procedure. Experiment results on multi-intent SLU show that our model outperforms existing models by a large margin, obtaining a relative improvement of 21.3% over the previous best model on MixATIS dataset in overall accuracy. We also evaluate our model on the zero-shot cross-lingual scenario and the results show that our model can relatively improve the state-of-the-art model by 33.5% on average in terms of overall accuracy for the total 9 languages.

Large Language Models in Law: A Survey. (arXiv:2312.03718v1 [cs.CL])

Authors: Jinqi Lai, Wensheng Gan, Jiayang Wu, Zhenlian Qi, Philip S. Yu

The advent of artificial intelligence (AI) has significantly impacted the traditional judicial industry. Moreover, recently, with the development of AI-generated content (AIGC), AI and law have found applications in various domains, including image recognition, automatic text generation, and interactive chat. With the rapid emergence and growing popularity of large models, it is evident that AI will drive transformation in the traditional judicial industry. However, the application of legal large language models (LLMs) is still in its nascent stage. Several challenges need to be addressed. In this paper, we aim to provide a comprehensive survey of legal LLMs. We not only conduct an extensive survey of LLMs, but also expose their applications in the judicial system. We first provide an overview of AI technologies in the legal field and showcase the recent research in LLMs. Then, we discuss the practical implementation presented by legal LLMs, such as providing legal advice to users and assisting judges during trials. In addition, we explore the limitations of legal LLMs, including data, algorithms, and judicial practice. Finally, we summarize practical recommendations and propose future development directions to address these challenges.

Assessing AI Chatbots Performance in Comprehensive Standardized Test Preparation; A Case Study with GRE. (arXiv:2312.03719v1 [cs.CL])

Authors: Mohammad Abu-Haifa, Bara'a Etawi, Huthaifa Alkhatatbeh, Ayman Ababneh

This research paper presents a comprehensive evaluation of the performance of three artificial 10 intelligence chatbots: Bing, ChatGPT, and GPT-4, in addressing standardized test questions. Graduate record examination, known as GRE, serves as a case study in this paper, encompassing both quantitative reasoning and verbal skills. A total of 137 quantitative reasoning questions, featuring diverse styles and 157 verbal questions categorized into varying levels of difficulty (easy, medium, and hard) were administered to assess the chatbots' capabilities. This paper provides a detailed examination of the results and their implications for the utilization of artificial intelligence in standardized test preparation by presenting the performance of each chatbot across various skills and styles tested in the exam. Additionally, this paper explores the proficiency of artificial intelligence in addressing image-based questions and illustrates the uncertainty level of each chatbot. The results reveal varying degrees of success across the chatbots, demonstrating the influence of model sophistication and training data. GPT-4 emerged as the most proficient, especially in complex language understanding tasks, highlighting the evolution of artificial intelligence in language comprehension and its ability to pass the exam with a high score.

Negotiating with LLMS: Prompt Hacks, Skill Gaps, and Reasoning Deficits. (arXiv:2312.03720v1 [cs.CL])

Authors: Johannes Schneider, Steffi Haag, Leona Chandra Kruse

Large language models LLMs like ChatGPT have reached the 100 Mio user barrier in record time and might increasingly enter all areas of our life leading to a diverse set of interactions between those Artificial Intelligence models and humans. While many studies have discussed governance and regulations deductively from first-order principles, few studies provide an inductive, data-driven lens based on observing dialogues between humans and LLMs especially when it comes to non-collaborative, competitive situations that have the potential to pose a serious threat to people. In this work, we conduct a user study engaging over 40 individuals across all age groups in price negotiations with an LLM. We explore how people interact with an LLM, investigating differences in negotiation outcomes and strategies. Furthermore, we highlight shortcomings of LLMs with respect to their reasoning capabilities and, in turn, susceptiveness to prompt hacking, which intends to manipulate the LLM to make agreements that are against its instructions or beyond any rationality. We also show that the negotiated prices humans manage to achieve span a broad range, which points to a literacy gap in effectively interacting with LLMs.

Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability. (arXiv:2312.03721v1 [cs.CL])

Authors: Simon Lermen, Ondřej Kvapil

There has been increasing interest in evaluations of language models for a variety of risks and characteristics. Evaluations relying on natural language understanding for grading can often be performed at scale by using other language models. We test the robustness of these model-graded evaluations to injections on different datasets including a new Deception Eval. These injections resemble direct communication between the testee and the evaluator to change their grading. We extrapolate that future, more intelligent models might manipulate or cooperate with their evaluation model. We find significant susceptibility to these injections in state-of-the-art commercial models on all examined evaluations. Furthermore, similar injections can be used on automated interpretability frameworks to produce misleading model-written explanations. The results inspire future work and should caution against unqualified trust in evaluations and automated interpretability.

Leveraging AI-derived Data for Carbon Accounting: Information Extraction from Alternative Sources. (arXiv:2312.03722v1 [cs.CL])

Authors: Olamide Oladeji, Seyed Shahabeddin Mousavi

Carbon accounting is a fundamental building block in our global path to emissions reduction and decarbonization, yet many challenges exist in achieving reliable and trusted carbon accounting measures. We motivate that carbon accounting not only needs to be more data-driven, but also more methodologically sound. We discuss the need for alternative, more diverse data sources that can play a significant role on our path to trusted carbon accounting procedures and elaborate on not only why, but how Artificial Intelligence (AI) in general and Natural Language Processing (NLP) in particular can unlock reasonable access to a treasure trove of alternative data sets in light of the recent advances in the field that better enable the utilization of unstructured data in this process. We present a case study of the recent developments on real-world data via an NLP-powered analysis using OpenAI's GPT API on financial and shipping data. We conclude the paper with a discussion on how these methods and approaches can be integrated into a broader framework for AI-enabled integrative carbon accounting.

ChatGPT Application In Summarizing An Evolution Of Deep Learning Techniques In Imaging: A Qualitative Study. (arXiv:2312.03723v1 [cs.CL])

Authors: Arman Sarraf, Amirabbas Abbaspour

The pursuit of article or text summarization has captured the attention of natural language processing (NLP) practitioners, presenting itself as a formidable challenge. ChatGPT 3.5 exhibits the capacity to condense the content of up to 3000 tokens into a single page, aiming to retain pivotal information from a given text across diverse themes. In a conducted qualitative research endeavor, we selected seven scientific articles and employed the publicly available ChatGPT service to generate summaries of these articles. Subsequently, we engaged six co-authors of the articles in a survey, presenting five questions to evaluate the quality of the summaries compared to the original content. The findings revealed that the summaries produced by ChatGPT effectively encapsulated the crucial information present in the articles, preserving the principal message of each manuscript. Nonetheless, there was a slight diminishment in the technical depth of the summaries as opposed to the original articles. As a result, our conclusion underscores ChatGPT's text summarization capability as a potent tool for extracting essential insights in a manner more aligned with reporting than purely scientific discourse.

DP-OPT: Make Large Language Model Your Privacy-Preserving Prompt Engineer. (arXiv:2312.03724v1 [cs.CL])

Authors: Junyuan Hong, Jiachen T. Wang, Chenhui Zhang, Zhangheng Li, Bo Li, Zhangyang Wang

Large Language Models (LLMs) have emerged as dominant tools for various tasks, particularly when tailored for a specific target by prompt tuning. Nevertheless, concerns surrounding data privacy present obstacles due to the tuned prompts' dependency on sensitive private information. A practical solution is to host a local LLM and optimize a soft prompt privately using data. Yet, hosting a local model becomes problematic when model ownership is protected. Alternative methods, like sending data to the model's provider for training, intensify these privacy issues facing an untrusted provider. In this paper, we present a novel solution called Differentially-Private Offsite Prompt Tuning (DP-OPT) to address this challenge. Our approach involves tuning a discrete prompt on the client side and then applying it to the desired cloud models. We demonstrate that prompts suggested by LLMs themselves can be transferred without compromising performance significantly. To ensure that the prompts do not leak private information, we introduce the first private prompt generation mechanism, by a differentially-private (DP) ensemble of in-context learning with private demonstrations. With DP-OPT, generating privacy-preserving prompts by Vicuna-7b can yield competitive performance compared to non-private in-context learning on GPT3.5 or local private prompt tuning. Codes are available at .

SCStory: Self-supervised and Continual Online Story Discovery. (arXiv:2312.03725v1 [cs.CL])

Authors: Susik Yoon, Yu Meng, Dongha Lee, Jiawei Han

We present a framework SCStory for online story discovery, that helps people digest rapidly published news article streams in real-time without human annotations. To organize news article streams into stories, existing approaches directly encode the articles and cluster them based on representation similarity. However, these methods yield noisy and inaccurate story discovery results because the generic article embeddings do not effectively reflect the story-indicative semantics in an article and cannot adapt to the rapidly evolving news article streams. SCStory employs self-supervised and continual learning with a novel idea of story-indicative adaptive modeling of news article streams. With a lightweight hierarchical embedding module that first learns sentence representations and then article representations, SCStory identifies story-relevant information of news articles and uses them to discover stories. The embedding module is continuously updated to adapt to evolving news streams with a contrastive learning objective, backed up by two unique techniques, confidence-aware memory replay and prioritized-augmentation, employed for label absence and data scarcity problems. Thorough experiments on real and the latest news data sets demonstrate that SCStory outperforms existing state-of-the-art algorithms for unsupervised online story discovery.

Interpretation modeling: Social grounding of sentences by reasoning over their implicit moral judgments. (arXiv:2312.03726v1 [cs.CL])

Authors: Liesbeth Allein, Maria Mihaela Truşcǎ, Marie-Francine Moens

The social and implicit nature of human communication ramifies readers' understandings of written sentences. Single gold-standard interpretations rarely exist, challenging conventional assumptions in natural language processing. This work introduces the interpretation modeling (IM) task which involves modeling several interpretations of a sentence's underlying semantics to unearth layers of implicit meaning. To obtain these, IM is guided by multiple annotations of social relation and common ground - in this work approximated by reader attitudes towards the author and their understanding of moral judgments subtly embedded in the sentence. We propose a number of modeling strategies that rely on one-to-one and one-to-many generation methods that take inspiration from the philosophical study of interpretation. A first-of-its-kind IM dataset is curated to support experiments and analyses. The modeling results, coupled with scrutiny of the dataset, underline the challenges of IM as conflicting and complex interpretations are socially plausible. This interplay of diverse readings is affirmed by automated and human evaluations on the generated interpretations. Finally, toxicity analyses in the generated interpretations demonstrate the importance of IM for refining filters of content and assisting content moderators in safeguarding the safety in online discourse.

Content-Localization based System for Analyzing Sentiment and Hate Behaviors in Low-Resource Dialectal Arabic: English to Levantine and Gulf. (arXiv:2312.03727v1 [cs.CL])

Authors: Fatimah Alzamzami, Abdulmotaleb El Saddik

Even though online social movements can quickly become viral on social media, languages can be a barrier to timely monitoring and analyzing the underlying online social behaviors (OSB). This is especially true for under-resourced languages on social media like dialectal Arabic; the primary language used by Arabs on social media. Therefore, it is crucial to provide solutions to efficiently exploit resources from high-resourced languages to solve language-dependent OSB analysis in under-resourced languages. This paper proposes to localize content of resources in high-resourced languages into under-resourced Arabic dialects. Content localization goes beyond content translation that converts text from one language to another; content localization adapts culture, language nuances and regional preferences from one language to a specific language/dialect. Automating understanding of the natural and familiar day-to-day expressions in different regions, is the key to achieve a wider analysis of OSB especially for smart cities. In this paper, we utilize content-localization based neural machine translation to develop sentiment and hate classifiers for two low-resourced Arabic dialects: Levantine and Gulf. Not only this but we also leverage unsupervised learning to facilitate the analysis of sentiment and hate predictions by inferring hidden topics from the corresponding data and providing coherent interpretations of those topics in their native language/dialects. The experimental evaluations and proof-of-concept COVID-19 case study on real data have validated the effectiveness of our proposed system in precisely distinguishing sentiments and accurately identifying hate content in both Levantine and Gulf Arabic dialects. Our findings shed light on the importance of considering the unique nature of dialects within the same language and ignoring the dialectal aspect would lead to misleading analysis.

Real Customization or Just Marketing: Are Customized Versions of Chat GPT Useful?. (arXiv:2312.03728v1 [cs.CL])

Authors: Eduardo C. Garrido-Merchán, Jose L. Arroyo-Barrigüete, Francisco Borrás-Pala, Leandro Escobar-Torres, Carlos Martínez de Ibarreta, Jose María Ortiz-Lozano, Antonio Rua-Vieites

Large Language Models (LLMs), as the case of OpenAI ChatGPT-4 Turbo, are revolutionizing several industries, including higher education. In this context, LLMs can be personalized through a fine-tuning process to meet the student demands on every particular subject, like statistics. Recently, OpenAI has launched the possibility to fine-tune their model with a natural language web interface, enabling the possibility to create customized GPT version deliberately conditioned to meet the demands of a specific task. The objective of this research is to assess the potential of the customized GPTs that have recently been launched by OpenAI. After developing a Business Statistics Virtual Professor (BSVP), tailored for students at the Universidad Pontificia Comillas, its behavior was evaluated and compared with that of ChatGPT-4 Turbo. The results lead to several conclusions. Firstly, a substantial modification in the style of communication was observed. Following the instructions it was trained with, BSVP provided responses in a more relatable and friendly tone, even incorporating a few minor jokes. Secondly, and this is a matter of relevance, when explicitly asked for something like, "I would like to practice a programming exercise similar to those in R practice 4," BSVP was capable of providing a far superior response: having access to contextual documentation, it could fulfill the request, something beyond ChatGPT-4 Turbo's capabilities. On the downside, the response times were generally higher. Lastly, regarding overall performance, quality, depth, and alignment with the specific content of the course, no statistically significant differences were observed in the responses between BSVP and ChatGPT-4 Turbo. It appears that customized assistants trained with prompts present advantages as virtual aids for students, yet they do not constitute a substantial improvement over ChatGPT-4 Turbo.

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?. (arXiv:2312.03729v1 [cs.CL])

Authors: Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, Jacob Andreas

Neural language models (LMs) can be used to evaluate the truth of factual statements in two ways: they can be either queried for statement probabilities, or probed for internal representations of truthfulness. Past work has found that these two procedures sometimes disagree, and that probes tend to be more accurate than LM outputs. This has led some researchers to conclude that LMs "lie" or otherwise encode non-cooperative communicative intents. Is this an accurate description of today's LMs, or can query-probe disagreement arise in other ways? We identify three different classes of disagreement, which we term confabulation, deception, and heterogeneity. In many cases, the superiority of probes is simply attributable to better calibration on uncertain answers rather than a greater fraction of correct, high-confidence answers. In some cases, queries and probes perform better on different subsets of inputs, and accuracy can further be improved by ensembling the two. Code is available at

FakeWatch ElectionShield: A Benchmarking Framework to Detect Fake News for Credible US Elections. (arXiv:2312.03730v1 [cs.CL])

Authors: Tahniat Khan, Mizanur Rahman, Veronica Chatrath, Oluwanifemi Bamgbose, Shaina Raza

In today's technologically driven world, the spread of fake news, particularly during crucial events such as elections, presents an increasing challenge to the integrity of information. To address this challenge, we introduce FakeWatch ElectionShield, an innovative framework carefully designed to detect fake news. We have created a novel dataset of North American election-related news articles through a blend of advanced language models (LMs) and thorough human verification, for precision and relevance. We propose a model hub of LMs for identifying fake news. Our goal is to provide the research community with adaptable and accurate classification models in recognizing the dynamic nature of misinformation. Extensive evaluation of fake news classifiers on our dataset and a benchmark dataset shows our that while state-of-the-art LMs slightly outperform the traditional ML models, classical models are still competitive with their balance of accuracy, explainability, and computational efficiency. This research sets the foundation for future studies to address misinformation related to elections.

Methods to Estimate Large Language Model Confidence. (arXiv:2312.03733v1 [cs.CL])

Authors: Maia Kotelanski, Robert Gallo, Ashwin Nayak, Thomas Savage

Large Language Models have difficulty communicating uncertainty, which is a significant obstacle to applying LLMs to complex medical tasks. This study evaluates methods to measure LLM confidence when suggesting a diagnosis for challenging clinical vignettes. GPT4 was asked a series of challenging case questions using Chain of Thought and Self Consistency prompting. Multiple methods were investigated to assess model confidence and evaluated on their ability to predict the models observed accuracy. The methods evaluated were Intrinsic Confidence, SC Agreement Frequency and CoT Response Length. SC Agreement Frequency correlated with observed accuracy, yielding a higher Area under the Receiver Operating Characteristic Curve compared to Intrinsic Confidence and CoT Length analysis. SC agreement is the most useful proxy for model confidence, especially for medical diagnosis. Model Intrinsic Confidence and CoT Response Length exhibit a weaker ability to differentiate between correct and incorrect answers, preventing them from being reliable and interpretable markers for model confidence. We conclude GPT4 has a limited ability to assess its own diagnostic accuracy. SC Agreement Frequency is the most useful method to measure GPT4 confidence.

Conditional Prompt Tuning for Multimodal Fusion. (arXiv:2312.03734v1 [cs.CL])

Authors: Ruixiang Jiang, Lingbo Liu, Changwen Chen

We show that the representation of one modality can effectively guide the prompting of another modality for parameter-efficient multimodal fusion. Specifically, we first encode one modality and use its representation as a prior to conditionally prompt all frozen layers of the other modality. This is achieved by disentangling the vanilla prompt vectors into three types of specialized prompts that adaptively capture global-level and instance-level features. To better produce the instance-wise prompt, we introduce the mixture of prompt experts (MoPE) to dynamically route each instance to the most suitable prompt experts for encoding. We further study a regularization term to avoid degenerated prompt expert routing. Thanks to our design, our method can effectively transfer the pretrained knowledge in unimodal encoders for downstream multimodal tasks. Compared with vanilla prompting, we show that our MoPE-based conditional prompting is more expressive, thereby scales better with training data and the total number of prompts. We also demonstrate that our prompt tuning is architecture-agnostic, thereby offering high modularity. Extensive experiments over three multimodal datasets demonstrate state-of-the-art results, matching or surpassing the performance achieved through fine-tuning, while only necessitating 0.7% of the trainable parameters. Code will be released:

Advancing State of the Art in Language Modeling. (arXiv:2312.03735v1 [cs.CL])

Authors: David Herel, Tomas Mikolov

Generalization is arguably the most important goal of statistical language modeling research. Publicly available benchmarks and papers published with an open-source code have been critical to advancing the field. However, it is often very difficult, and sometimes even impossible, to reproduce the results fully as reported in publications. In this paper, we propose a simple framework that should help advance the state of the art in language modeling in terms of generalization. We propose to publish not just the code, but also probabilities on dev and test sets with future publications so that one can easily add the new model into an ensemble. This has crucial advantages: it is much easier to determine whether a newly proposed model is actually complementary to the current baseline. Therefore, instead of inventing new names for the old tricks, the scientific community can advance faster. Finally, this approach promotes diversity of ideas: one does not need to create an individual model that is the new state of the art to attract attention; it will be sufficient to develop a new model that learns patterns which other models do not. Thus, even a suboptimal model can be found to have value. Remarkably, our approach has yielded new state-of-the-art results across various language modeling benchmarks up to 10%.

De-identification of clinical free text using natural language processing: A systematic review of current approaches. (arXiv:2312.03736v1 [cs.CL])

Authors: Aleksandar Kovačević, Bojana Bašaragin, Nikola Milošević, Goran Nenadić

Background: Electronic health records (EHRs) are a valuable resource for data-driven medical research. However, the presence of protected health information (PHI) makes EHRs unsuitable to be shared for research purposes. De-identification, i.e. the process of removing PHI is a critical step in making EHR data accessible. Natural language processing has repeatedly demonstrated its feasibility in automating the de-identification process. Objectives: Our study aims to provide systematic evidence on how the de-identification of clinical free text has evolved in the last thirteen years, and to report on the performances and limitations of the current state-of-the-art systems. In addition, we aim to identify challenges and potential research opportunities in this field. Methods: A systematic search in PubMed, Web of Science and the DBLP was conducted for studies published between January 2010 and February 2023. Titles and abstracts were examined to identify the relevant studies. Selected studies were then analysed in-depth, and information was collected on de-identification methodologies, data sources, and measured performance. Results: A total of 2125 publications were identified for the title and abstract screening. 69 studies were found to be relevant. Machine learning (37 studies) and hybrid (26 studies) approaches are predominant, while six studies relied only on rules. Majority of the approaches were trained and evaluated on public corpora. The 2014 i2b2/UTHealth corpus is the most frequently used (36 studies), followed by the 2006 i2b2 (18 studies) and 2016 CEGS N-GRID (10 studies) corpora.

A Generic NLI approach for Classification of Sentiment Associated with Therapies. (arXiv:2312.03737v1 [cs.CL])

Authors: Rajaraman Kanagasabai, Anitha Veeramani

This paper describes our system for addressing SMM4H 2023 Shared Task 2 on "Classification of sentiment associated with therapies (aspect-oriented)". In our work, we adopt an approach based on Natural language inference (NLI) to formulate this task as a sentence pair classification problem, and train transformer models to predict sentiment associated with a therapy on a given text. Our best model achieved 75.22\% F1-score which was 11\% (4\%) more than the mean (median) score of all teams' submissions.

Comparing Generative Chatbots Based on Process Requirements. (arXiv:2312.03741v1 [cs.CL])

Authors: Luis Fernando Lins, Nathalia Nascimento, Paulo Alencar, Toacy Oliveira, Donald Cowan

Business processes are commonly represented by modelling languages, such as Event-driven Process Chain (EPC), Yet Another Workflow Language (YAWL), and the most popular standard notation for modelling business processes, the Business Process Model and Notation (BPMN). Most recently, chatbots, programs that allow users to interact with a machine using natural language, have been increasingly used for business process execution support. A recent category of chatbots worth mentioning is generative-based chatbots, powered by Large Language Models (LLMs) such as OpenAI's Generative Pre-Trained Transformer (GPT) model and Google's Pathways Language Model (PaLM), which are trained on billions of parameters and support conversational intelligence. However, it is not clear whether generative-based chatbots are able to understand and meet the requirements of constructs such as those provided by BPMN for process execution support. This paper presents a case study to compare the performance of prominent generative models, GPT and PaLM, in the context of process execution support. The research sheds light into the challenging problem of using conversational approaches supported by generative chatbots as a means to understand process-aware modelling notations and support users to execute their tasks.

Evaluating Large Language Model Creativity from a Literary Perspective. (arXiv:2312.03746v1 [cs.CL])

Authors: Murray Shanahan, Catherine Clarke

This paper assesses the potential for large language models (LLMs) to serve as assistive tools in the creative writing process, by means of a single, in-depth case study. In the course of the study, we develop interactive and multi-voice prompting strategies that interleave background descriptions (scene setting, plot elements), instructions that guide composition, samples of text in the target style, and critical discussion of the given samples. We qualitatively evaluate the results from a literary critical perspective, as well as from the standpoint of computational creativity (a sub-field of artificial intelligence). Our findings lend support to the view that the sophistication of the results that can be achieved with an LLM mirrors the sophistication of the prompting.

Classifying patient voice in social media data using neural networks: A comparison of AI models on different data sources and therapeutic domains. (arXiv:2312.03747v1 [cs.CL])

Authors: Giorgos Lysandrou, Roma English Owen, Vanja Popovic, Grant Le Brun, Beatrice Alex, Elizabeth A. L. Fairley

It is essential that healthcare professionals and members of the healthcare community can access and easily understand patient experiences in the real world, so that care standards can be improved and driven towards personalised drug treatment. Social media platforms and message boards are deemed suitable sources of patient experience information, as patients have been observed to discuss and exchange knowledge, look for and provide support online. This paper tests the hypothesis that not all online patient experience information can be treated and collected in the same way, as a result of the inherent differences in the way individuals talk about their journeys, in different therapeutic domains and or data sources.

We used linguistic analysis to understand and identify similarities between datasets, across patient language, between data sources (Reddit, SocialGist) and therapeutic domains (cardiovascular, oncology, immunology, neurology). We detected common vocabulary used by patients in the same therapeutic domain across data sources, except for immunology patients, who use unique vocabulary between the two data sources, and compared to all other datasets. We combined linguistically similar datasets to train classifiers (CNN, transformer) to accurately identify patient experience posts from social media, a task we refer to as patient voice classification. The cardiovascular and neurology transformer classifiers perform the best in their respective comparisons for the Reddit data source, achieving F1-scores of 0.865 and 1.0 respectively. The overall best performing classifier is the transformer classifier trained on all data collected for this experiment, achieving F1-scores ranging between 0.863 and 0.995 across all therapeutic domain and data source specific test datasets.

Applying Large Language Models and Chain-of-Thought for Automatic Scoring. (arXiv:2312.03748v1 [cs.CL])

Authors: Gyeong-Geon Lee, Ehsan Latif, Xuansheng Wu, Ninghao Liu, Xiaoming Zhai

This study investigates the application of large language models (LLMs), specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT)in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of automatic assessment tools among researchers and educators. We used a testing dataset comprising six assessment tasks (three binomial and three trinomial) with 1,650 student responses. We employed six prompt engineering strategies, combining zero-shot or few-shot learning with CoT, either alone or alongside item stem and scoring rubrics. Results indicated that few-shot (acc = .67) outperformed zero-shot learning (acc = .60), with 12.6\% increase. CoT, when used without item stem and scoring rubrics, did not significantly affect scoring accuracy (acc = .60). However, CoT prompting paired with contextual item stems and rubrics proved to be a significant contributor to scoring accuracy (13.44\% increase for zero-shot; 3.7\% increase for few-shot). Using a novel approach PPEAS, we found a more balanced accuracy across different proficiency categories, highlighting the importance of domain-specific reasoning in enhancing the effectiveness of LLMs in scoring tasks. Additionally, we also found that GPT-4 demonstrated superior performance over GPT-3.5 in various scoring tasks, showing 8.64\% difference. The study revealed that the single-call strategy with GPT-4, particularly using greedy sampling, outperformed other approaches, including ensemble voting strategies. This study demonstrates the potential of LLMs in facilitating automatic scoring, emphasizing that CoT enhances accuracy, particularly when used with item stem and scoring rubrics.

Conceptual Engineering Using Large Language Models. (arXiv:2312.03749v1 [cs.CL])

Authors: Bradley P. Allen

We describe a method, based on Jennifer Nado's definition of classification procedures as targets of conceptual engineering, that implements such procedures using a large language model. We then apply this method using data from the Wikidata knowledge graph to evaluate concept definitions from two paradigmatic conceptual engineering projects: the International Astronomical Union's redefinition of PLANET and Haslanger's ameliorative analysis of WOMAN. We discuss implications of this work for the theory and practice of conceptual engineering. The code and data can be found on GitHub.

Near-real-time Earthquake-induced Fatality Estimation using Crowdsourced Data and Large-Language Models. (arXiv:2312.03755v1 [cs.CL])

Authors: Chenguang Wang, Davis Engler, Xuechun Li, James Hou, David J. Wald, Kishor Jaiswal, Susu Xu

When a damaging earthquake occurs, immediate information about casualties is critical for time-sensitive decision-making by emergency response and aid agencies in the first hours and days. Systems such as Prompt Assessment of Global Earthquakes for Response (PAGER) by the U.S. Geological Survey (USGS) were developed to provide a forecast within about 30 minutes of any significant earthquake globally. Traditional systems for estimating human loss in disasters often depend on manually collected early casualty reports from global media, a process that's labor-intensive and slow with notable time delays. Recently, some systems have employed keyword matching and topic modeling to extract relevant information from social media. However, these methods struggle with the complex semantics in multilingual texts and the challenge of interpreting ever-changing, often conflicting reports of death and injury numbers from various unverified sources on social media platforms. In this work, we introduce an end-to-end framework to significantly improve the timeliness and accuracy of global earthquake-induced human loss forecasting using multi-lingual, crowdsourced social media. Our framework integrates (1) a hierarchical casualty extraction model built upon large language models, prompt design, and few-shot learning to retrieve quantitative human loss claims from social media, (2) a physical constraint-aware, dynamic-truth discovery model that discovers the truthful human loss from massive noisy and potentially conflicting human loss claims, and (3) a Bayesian updating loss projection model that dynamically updates the final loss estimation using discovered truths. We test the framework in real-time on a series of global earthquake events in 2021 and 2022 and show that our framework streamlines casualty data retrieval, achieving speed and accuracy comparable to manual methods by USGS.

LineConGraphs: Line Conversation Graphs for Effective Emotion Recognition using Graph Neural Networks. (arXiv:2312.03756v1 [cs.CL])

Authors: Gokul S Krishnan, Sarala Padi, Craig S. Greenberg, Balaraman Ravindran, Dinesh Manoch, Ram D.Sriram

Emotion Recognition in Conversations (ERC) is a critical aspect of affective computing, and it has many practical applications in healthcare, education, chatbots, and social media platforms. Earlier approaches for ERC analysis involved modeling both speaker and long-term contextual information using graph neural network architectures. However, it is ideal to deploy speaker-independent models for real-world applications. Additionally, long context windows can potentially create confusion in recognizing the emotion of an utterance in a conversation. To overcome these limitations, we propose novel line conversation graph convolutional network (LineConGCN) and graph attention (LineConGAT) models for ERC analysis. These models are speaker-independent and built using a graph construction strategy for conversations -- line conversation graphs (LineConGraphs). The conversational context in LineConGraphs is short-term -- limited to one previous and future utterance, and speaker information is not part of the graph. We evaluate the performance of our proposed models on two benchmark datasets, IEMOCAP and MELD, and show that our LineConGAT model outperforms the state-of-the-art methods with an F1-score of 64.58% and 76.50%. Moreover, we demonstrate that embedding sentiment shift information into line conversation graphs further enhances the ERC performance in the case of GCN models.

Stock Movement and Volatility Prediction from Tweets, Macroeconomic Factors and Historical Prices. (arXiv:2312.03758v1 [cs.AI])

Authors: Shengkun Wang, YangXiao Bai, Taoran Ji, Kaiqun Fu, Linhan Wang, Chang-Tien Lu

Predicting stock market is vital for investors and policymakers, acting as a barometer of the economic health. We leverage social media data, a potent source of public sentiment, in tandem with macroeconomic indicators as government-compiled statistics, to refine stock market predictions. However, prior research using tweet data for stock market prediction faces three challenges. First, the quality of tweets varies widely. While many are filled with noise and irrelevant details, only a few genuinely mirror the actual market scenario. Second, solely focusing on the historical data of a particular stock without considering its sector can lead to oversight. Stocks within the same industry often exhibit correlated price behaviors. Lastly, simply forecasting the direction of price movement without assessing its magnitude is of limited value, as the extent of the rise or fall truly determines profitability. In this paper, diverging from the conventional methods, we pioneer an ECON. The framework has following advantages: First, ECON has an adept tweets filter that efficiently extracts and decodes the vast array of tweet data. Second, ECON discerns multi-level relationships among stocks, sectors, and macroeconomic factors through a self-aware mechanism in semantic space. Third, ECON offers enhanced accuracy in predicting substantial stock price fluctuations by capitalizing on stock price movement. We showcase the state-of-the-art performance of our proposed model using a dataset, specifically curated by us, for predicting stock market movements and volatility.

How should the advent of large language models affect the practice of science?. (arXiv:2312.03759v1 [cs.CL])

Authors: Marcel Binz, Stephan Alaniz, Adina Roskies, Balazs Aczel, Carl T. Bergstrom, Colin Allen, Daniel Schad, Dirk Wulff, Jevin D. West, Qiong Zhang, Richard M. Shiffrin, Samuel J. Gershman, Ven Popov, Emily M. Bender, Marco Marelli, Matthew M. Botvinick, Zeynep Akata, Eric Schulz

Large language models (LLMs) are being increasingly incorporated into scientific workflows. However, we have yet to fully grasp the implications of this integration. How should the advent of large language models affect the practice of science? For this opinion piece, we have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate. Schulz et al. make the argument that working with LLMs is not fundamentally different from working with human collaborators, while Bender et al. argue that LLMs are often misused and over-hyped, and that their limitations warrant a focus on more specialized, easily interpretable tools. Marelli et al. emphasize the importance of transparent attribution and responsible use of LLMs. Finally, Botvinick and Gershman advocate that humans should retain responsibility for determining the scientific roadmap. To facilitate the discussion, the four perspectives are complemented with a response from each group. By putting these different perspectives in conversation, we aim to bring attention to important considerations within the academic community regarding the adoption of LLMs and their impact on both current and future scientific practices.

Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study. (arXiv:2312.03762v1 [cs.LG])

Authors: Karolis Ramanauskas, Özgür Şimşek

We explore colour versus shape goal misgeneralization originally demonstrated by Di Langosco et al. (2022) in the Procgen Maze environment, where, given an ambiguous choice, the agents seem to prefer generalization based on colour rather than shape. After training over 1,000 agents in a simplified version of the environment and evaluating them on over 10 million episodes, we conclude that the behaviour can be attributed to the agents learning to detect the goal object through a specific colour channel. This choice is arbitrary. Additionally, we show how, due to underspecification, the preferences can change when retraining the agents using exactly the same procedure except for using a different random seed for the training run. Finally, we demonstrate the existence of outliers in out-of-distribution behaviour based on training random seed alone.

Similarity-based Knowledge Transfer for Cross-Domain Reinforcement Learning. (arXiv:2312.03764v1 [cs.LG])

Authors: Sergio A. Serrano, Jose Martinez-Carranza, L. Enrique Sucar

Transferring knowledge in cross-domain reinforcement learning is a challenging setting in which learning is accelerated by reusing knowledge from a task with different observation and/or action space. However, it is often necessary to carefully select the source of knowledge for the receiving end to benefit from the transfer process. In this article, we study how to measure the similarity between cross-domain reinforcement learning tasks to select a source of knowledge that will improve the performance of the learning agent. We developed a semi-supervised alignment loss to match different spaces with a set of encoder-decoders, and use them to measure similarity and transfer policies across tasks. In comparison to prior works, our method does not require data to be aligned, paired or collected by expert policies. Experimental results, on a set of varied Mujoco control tasks, show the robustness of our method in effectively selecting and transferring knowledge, without the supervision of a tailored set of source tasks.

Unknown Sample Discovery for Source Free Open Set Domain Adaptation. (arXiv:2312.03767v1 [cs.CV])

Authors: Chowdhury Sadman Jahan, Andreas Savakis

Open Set Domain Adaptation (OSDA) aims to adapt a model trained on a source domain to a target domain that undergoes distribution shift and contains samples from novel classes outside the source domain. Source-free OSDA (SF-OSDA) techniques eliminate the need to access source domain samples, but current SF-OSDA methods utilize only the known classes in the target domain for adaptation, and require access to the entire target domain even during inference after adaptation, to make the distinction between known and unknown samples. In this paper, we introduce Unknown Sample Discovery (USD) as an SF-OSDA method that utilizes a temporally ensembled teacher model to conduct known-unknown target sample separation and adapts the student model to the target domain over all classes using co-training and temporal consistency between the teacher and the student. USD promotes Jensen-Shannon distance (JSD) as an effective measure for known-unknown sample separation. Our teacher-student framework significantly reduces error accumulation resulting from imperfect known-unknown sample separation, while curriculum guidance helps to reliably learn the distinction between target known and target unknown subspaces. USD appends the target model with an unknown class node, thus readily classifying a target sample into any of the known or unknown classes in subsequent post-adaptation inference stages. Empirical results show that USD is superior to existing SF-OSDA methods and is competitive with current OSDA models that utilize both source and target domains during adaptation.

GPT vs Human for Scientific Reviews: A Dual Source Review on Applications of ChatGPT in Science. (arXiv:2312.03769v1 [cs.CL])

Authors: Chenxi Wu, Alan John Varghese, Vivek Oommen, George Em Karniadakis

The new polymath Large Language Models (LLMs) can speed-up greatly scientific reviews, possibly using more unbiased quantitative metrics, facilitating cross-disciplinary connections, and identifying emerging trends and research gaps by analyzing large volumes of data. However, at the present time, they lack the required deep understanding of complex methodologies, they have difficulty in evaluating innovative claims, and they are unable to assess ethical issues and conflicts of interest. Herein, we consider 13 GPT-related papers across different scientific domains, reviewed by a human reviewer and SciSpace, a large language model, with the reviews evaluated by three distinct types of evaluators, namely GPT-3.5, a crowd panel, and GPT-4. We found that 50% of SciSpace's responses to objective questions align with those of a human reviewer, with GPT-4 (informed evaluator) often rating the human reviewer higher in accuracy, and SciSpace higher in structure, clarity, and completeness. In subjective questions, the uninformed evaluators (GPT-3.5 and crowd panel) showed varying preferences between SciSpace and human responses, with the crowd panel showing a preference for the human responses. However, GPT-4 rated them equally in accuracy and structure but favored SciSpace for completeness.

Lite-Mind: Towards Efficient and Versatile Brain Representation Network. (arXiv:2312.03781v1 [cs.CV])

Authors: Zixuan Gong, Qi Zhang, Duoqian Miao, Guangyin Bao, Liang Hu

Research in decoding visual information from the brain, particularly through the non-invasive fMRI method, is rapidly progressing. The challenge arises from the limited data availability and the low signal-to-noise ratio of fMRI signals, leading to a low-precision task of fMRI-to-image retrieval. State-of-the-art MindEye remarkably improves fMRI-to-image retrieval performance by leveraging a deep MLP with a high parameter count orders of magnitude, i.e., a 996M MLP Backbone per subject, to align fMRI embeddings to the final hidden layer of CLIP's vision transformer. However, significant individual variations exist among subjects, even within identical experimental setups, mandating the training of subject-specific models. The substantial parameters pose significant challenges in deploying fMRI decoding on practical devices, especially with the necessitating of specific models for each subject. To this end, we propose Lite-Mind, a lightweight, efficient, and versatile brain representation network based on discrete Fourier transform, that efficiently aligns fMRI voxels to fine-grained information of CLIP. Our experiments demonstrate that Lite-Mind achieves an impressive 94.3% fMRI-to-image retrieval accuracy on the NSD dataset for Subject 1, with 98.7% fewer parameters than MindEye. Lite-Mind is also proven to be able to be migrated to smaller brain datasets and establishes a new state-of-the-art for zero-shot classification on the GOD dataset. The code is available at

Sports Recommender Systems: Overview and Research Issues. (arXiv:2312.03785v1 [cs.IR])

Authors: Alexander Felfernig, Manfred Wundara, Thi Ngoc Trang Tran, Viet-Man Le, Sebastian Lubos, Seda Polat-Erdeniz

Sports recommender systems receive an increasing attention due to their potential of fostering healthy living, improving personal well-being, and increasing performances in sport. These systems support people in sports, for example, by the recommendation of healthy and performance boosting food items, the recommendation of training practices, talent and team recommendation, and the recommendation of specific tactics in competitions. With applications in the virtual world, for example, the recommendation of maps or opponents in e-sports, these systems already transcend conventional sports scenarios where physical presence is needed. On the basis of different working examples, we present an overview of sports recommender systems applications and techniques. Overall, we analyze the related state-of-the-art and discuss open research issues.

Multi-Scale and Multi-Modal Contrastive Learning Network for Biomedical Time Series. (arXiv:2312.03796v1 [cs.LG])

Authors: Hongbo Guo, Xinzi Xu, Hao Wu, Guoxing Wang

Multi-modal biomedical time series (MBTS) data offers a holistic view of the physiological state, holding significant importance in various bio-medical applications. Owing to inherent noise and distribution gaps across different modalities, MBTS can be complex to model. Various deep learning models have been developed to learn representations of MBTS but still fall short in robustness due to the ignorance of modal-to-modal variations. This paper presents a multi-scale and multi-modal biomedical time series representation learning (MBSL) network with contrastive learning to migrate these variations. Firstly, MBTS is grouped based on inter-modal distances, then each group with minimum intra-modal variations can be effectively modeled by individual encoders. Besides, to enhance the multi-scale feature extraction (encoder), various patch lengths and mask ratios are designed to generate tokens with semantic information at different scales and diverse contextual perspectives respectively. Finally, cross-modal contrastive learning is proposed to maximize consistency among inter-modal groups, maintaining useful information and eliminating noises. Experiments against four bio-medical applications show that MBSL outperforms state-of-the-art models by 33.9% mean average errors (MAE) in respiration rate, by 13.8% MAE in exercise heart rate, by 1.41% accuracy in human activity recognition, and by 1.14% F1-score in obstructive sleep apnea-hypopnea syndrome.

Low-power, Continuous Remote Behavioral Localization with Event Cameras. (arXiv:2312.03799v1 [cs.CV])

Authors: Friedhelm Hamann, Suman Ghosh, Ignacio Juarez Martinez, Tom Hart, Alex Kacelnik, Guillermo Gallego

Researchers in natural science need reliable methods for quantifying animal behavior. Recently, numerous computer vision methods emerged to automate the process. However, observing wild species at remote locations remains a challenging task due to difficult lighting conditions and constraints on power supply and data storage. Event cameras offer unique advantages for battery-dependent remote monitoring due to their low power consumption and high dynamic range capabilities. We use this novel sensor to quantify a behavior in Chinstrap penguins called ecstatic display. We formulate the problem as a temporal action detection task, determining the start and end times of the behavior. For this purpose, we recorded a colony of breeding penguins in Antarctica during several weeks and labeled event data on 16 nests. The developed method consists of a generator of candidate time intervals (proposals) and a classifier of the actions within them. The experiments show that the event cameras' natural response to motion is effective for continuous behavior monitoring and detection, reaching a mean average precision (mAP) of 58% (which increases to 63% in good weather conditions). The results also demonstrate the robustness against various lighting conditions contained in the challenging dataset. The low-power capabilities of the event camera allows to record three times longer than with a conventional camera. This work pioneers the use of event cameras for remote wildlife observation, opening new interdisciplinary opportunities.

Generalization to New Sequential Decision Making Tasks with In-Context Learning. (arXiv:2312.03801v1 [cs.LG])

Authors: Sharath Chandra Raparthy, Eric Hambro, Robert Kirk, Mikael Henaff, Roberta Raileanu

Training autonomous agents that can learn new tasks from only a handful of demonstrations is a long-standing problem in machine learning. Recently, transformers have been shown to learn new language or vision tasks without any weight updates from only a few examples, also referred to as in-context learning. However, the sequential decision making setting poses additional challenges having a lower tolerance for errors since the environment's stochasticity or the agent's actions can lead to unseen, and sometimes unrecoverable, states. In this paper, we use an illustrative example to show that naively applying transformers to sequential decision making problems does not enable in-context learning of new tasks. We then demonstrate how training on sequences of trajectories with certain distributional properties leads to in-context learning of new sequential decision making tasks. We investigate different design choices and find that larger model and dataset sizes, as well as more task diversity, environment stochasticity, and trajectory burstiness, all result in better in-context learning of new out-of-distribution tasks. By training on large diverse offline datasets, our model is able to learn new MiniHack and Procgen tasks without any weight updates from just a handful of demonstrations.

Improving Activation Steering in Language Models with Mean-Centring. (arXiv:2312.03813v1 [cs.CL])

Authors: Ole Jorgensen, Dylan Cope, Nandi Schoots, Murray Shanahan

Recent work in activation steering has demonstrated the potential to better control the outputs of Large Language Models (LLMs), but it involves finding steering vectors. This is difficult because engineers do not typically know how features are represented in these models. We seek to address this issue by applying the idea of mean-centring to steering vectors. We find that taking the average of activations associated with a target dataset, and then subtracting the mean of all training activations, results in effective steering vectors. We test this method on a variety of models on natural language tasks by steering away from generating toxic text, and steering the completion of a story towards a target genre. We also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin (compared to previous baselines). This suggests that mean-centring can be used to easily improve the effectiveness of activation steering in a wide range of contexts.

Pearl: A Production-ready Reinforcement Learning Agent. (arXiv:2312.03814v1 [cs.LG])

Authors: Zheqing Zhu, Rodrigo de Salvo Braz, Jalaj Bhandari, Daniel Jiang, Yi Wan, Yonathan Efroni, Liyuan Wang, Ruiyang Xu, Hongbo Guo, Alex Nikulkov, Dmytro Korenkevych, Urun Dogan, Frank Cheng, Zheng Wu, Wanqiao Xu

Reinforcement Learning (RL) offers a versatile framework for achieving long-term goals. Its generality allows us to formalize a wide range of problems that real-world intelligent systems encounter, such as dealing with delayed rewards, handling partial observability, addressing the exploration and exploitation dilemma, utilizing offline data to improve online performance, and ensuring safety constraints are met. Despite considerable progress made by the RL research community in addressing these issues, existing open-source RL libraries tend to focus on a narrow portion of the RL solution pipeline, leaving other aspects largely unattended. This paper introduces Pearl, a Production-ready RL agent software package explicitly designed to embrace these challenges in a modular fashion. In addition to presenting preliminary benchmark results, this paper highlights Pearl's industry adoptions to demonstrate its readiness for production usage. Pearl is open sourced on Github at and its official website is located at

LLM as OS (llmao), Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem. (arXiv:2312.03815v1 [cs.OS])

Authors: Yingqiang Ge, Yujie Ren, Wenyue Hua, Shuyuan Xu, Juntao Tan, Yongfeng Zhang

This paper envisions a revolutionary AIOS-Agent ecosystem, where Large Language Model (LLM) serves as the (Artificial) Intelligent Operating System (IOS, or AIOS)--an operating system ``with soul''. Upon this foundation, a diverse range of LLM-based AI Agent Applications (Agents, or AAPs) are developed, enriching the AIOS-Agent ecosystem and signaling a paradigm shift from the traditional OS-APP ecosystem. We envision that LLM's impact will not be limited to the AI application level, instead, it will in turn revolutionize the design and implementation of computer system, architecture, software, and programming language, featured by several main concepts: LLM as OS (system-level), Agents as Applications (application-level), Natural Language as Programming Interface (user-level), and Tools as Devices/Libraries (hardware/middleware-level).

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want. (arXiv:2312.03818v1 [cs.CV])

Authors: Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

Contrastive Language-Image Pre-training (CLIP) plays an essential role in extracting valuable content information from images across diverse tasks. It aligns textual and visual modalities to comprehend the entire image, including all the details, even those irrelevant to specific tasks. However, for a finer understanding and controlled editing of images, it becomes crucial to focus on specific regions of interest, which can be indicated as points, masks, or boxes by humans or perception models. To fulfill the requirements, we introduce Alpha-CLIP, an enhanced version of CLIP with an auxiliary alpha channel to suggest attentive regions and fine-tuned with constructed millions of RGBA region-text pairs. Alpha-CLIP not only preserves the visual recognition ability of CLIP but also enables precise control over the emphasis of image contents. It demonstrates effectiveness in various tasks, including but not limited to open-world recognition, multimodal large language models, and conditional 2D / 3D generation. It has a strong potential to serve as a versatile tool for image-related tasks.

Efficient Large Language Models: A Survey. (arXiv:2312.03863v1 [cs.CL])

Authors: Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury, Mi Zhang

Large Language Models (LLMs) have demonstrated remarkable capabilities in important tasks such as natural language understanding, language generation, and complex reasoning and have the potential to make a substantial impact on our society. Such capabilities, however, come with the considerable resources they demand, highlighting the strong need to develop effective techniques for addressing their efficiency challenges. In this survey, we provide a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from model-centric, data-centric, and framework-centric perspective, respectively. We have also created a GitHub repository where we compile the papers featured in this survey at,, and will actively maintain this repository and incorporate new research as it emerges. We hope our survey can serve as a valuable resource to help researchers and practitioners gain a systematic understanding of the research developments in efficient LLMs and inspire them to contribute to this important and exciting field.

The BigCode Project Governance Card. (arXiv:2312.03872v1 [cs.CY])

Authors: BigCode collaboration: Sean Hughes, Harm de Vries, Jennifer Robinson, Carlos Muñoz Ferrandis, Loubna Ben Allal, Leandro von Werra, Jennifer Ding, Sebastien Paquet, Yacine Jernite

This document serves as an overview of the different mechanisms and areas of governance in the BigCode project. It aims to support transparency by providing relevant information about choices that were made during the project to the broader public, and to serve as an example of intentional governance of an open research project that future endeavors can leverage to shape their own approach. The first section, Project Structure, covers the project organization, its stated goals and values, its internal decision processes, and its funding and resources. The second section, Data and Model Governance, covers decisions relating to the questions of data subject consent, privacy, and model release.

Scaling transformer neural networks for skillful and reliable medium-range weather forecasting. (arXiv:2312.03876v1 [])

Authors: Tung Nguyen, Rohan Shah, Hritik Bansal, Troy Arcomano, Sandeep Madireddy, Romit Maulik, Veerabhadra Kotamarthi, Ian Foster, Aditya Grover

Weather forecasting is a fundamental problem for anticipating and mitigating the impacts of climate change. Recently, data-driven approaches for weather forecasting based on deep learning have shown great promise, achieving accuracies that are competitive with operational systems. However, those methods often employ complex, customized architectures without sufficient ablation analysis, making it difficult to understand what truly contributes to their success. Here we introduce Stormer, a simple transformer model that achieves state-of-the-art performance on weather forecasting with minimal changes to the standard transformer backbone. We identify the key components of Stormer through careful empirical analyses, including weather-specific embedding, randomized dynamics forecast, and pressure-weighted loss. At the core of Stormer is a randomized forecasting objective that trains the model to forecast the weather dynamics over varying time intervals. During inference, this allows us to produce multiple forecasts for a target lead time and combine them to obtain better forecast accuracy. On WeatherBench 2, Stormer performs competitively at short to medium-range forecasts and outperforms current methods beyond 7 days, while requiring orders-of-magnitude less training data and compute. Additionally, we demonstrate Stormer's favorable scaling properties, showing consistent improvements in forecast accuracy with increases in model size and training tokens. Code and checkpoints will be made publicly available.

FoMo Rewards: Can we cast foundation models as reward functions?. (arXiv:2312.03881v1 [cs.LG])

Authors: Ekdeep Singh Lubana, Johann Brehmer, Pim de Haan, Taco Cohen

We explore the viability of casting foundation models as generic reward functions for reinforcement learning. To this end, we propose a simple pipeline that interfaces an off-the-shelf vision model with a large language model. Specifically, given a trajectory of observations, we infer the likelihood of an instruction describing the task that the user wants an agent to perform. We show that this generic likelihood function exhibits the characteristics ideally expected from a reward function: it associates high values with the desired behaviour and lower values for several similar, but incorrect policies. Overall, our work opens the possibility of designing open-ended agents for interactive tasks via foundation models.

On The Fairness Impacts of Hardware Selection in Machine Learning. (arXiv:2312.03886v1 [cs.LG])

Authors: Sree Harsha Nelaturu, Nishaanth Kanna Ravichandran, Cuong Tran, Sara Hooker, Ferdinando Fioretto

In the machine learning ecosystem, hardware selection is often regarded as a mere utility, overshadowed by the spotlight on algorithms and data. This oversight is particularly problematic in contexts like ML-as-a-service platforms, where users often lack control over the hardware used for model deployment. How does the choice of hardware impact generalization properties? This paper investigates the influence of hardware on the delicate balance between model performance and fairness. We demonstrate that hardware choices can exacerbate existing disparities, attributing these discrepancies to variations in gradient flows and loss surfaces across different demographic groups. Through both theoretical and empirical analysis, the paper not only identifies the underlying factors but also proposes an effective strategy for mitigating hardware-induced performance imbalances.

A Masked Pruning Approach for Dimensionality Reduction in Communication-Efficient Federated Learning Systems. (arXiv:2312.03889v1 [cs.LG])

Authors: Tamir L.S. Gez, Kobi Cohen

Federated Learning (FL) represents a growing machine learning (ML) paradigm designed for training models across numerous nodes that retain local datasets, all without directly exchanging the underlying private data with the parameter server (PS). Its increasing popularity is attributed to notable advantages in terms of training deep neural network (DNN) models under privacy aspects and efficient utilization of communication resources. Unfortunately, DNNs suffer from high computational and communication costs, as well as memory consumption in intricate tasks. These factors restrict the applicability of FL algorithms in communication-constrained systems with limited hardware resources.

In this paper, we develop a novel algorithm that overcomes these limitations by synergistically combining a pruning-based method with the FL process, resulting in low-dimensional representations of the model with minimal communication cost, dubbed Masked Pruning over FL (MPFL). The algorithm operates by initially distributing weights to the nodes through the PS. Subsequently, each node locally trains its model and computes pruning masks. These low-dimensional masks are then transmitted back to the PS, which generates a consensus pruning mask, broadcasted back to the nodes. This iterative process enhances the robustness and stability of the masked pruning model. The generated mask is used to train the FL model, achieving significant bandwidth savings. We present an extensive experimental study demonstrating the superior performance of MPFL compared to existing methods. Additionally, we have developed an open-source software package for the benefit of researchers and developers in related fields.

A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints. (arXiv:2312.03905v1 [cs.LG])

Authors: Kareem Ahmed, Kai-Wei Chang, Guy Van den Broeck

Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning. This often requires maximizing the likelihood of a symbolic constraint w.r.t the neural network's output distribution. Such output distributions are typically assumed to be fully-factorized. This limits the applicability of neuro-symbolic learning to the more expressive autoregressive distributions, e.g., transformers. Under such distributions, computing the likelihood of even simple constraints is #P-hard. Instead of attempting to enforce the constraint on the entire output distribution, we propose to do so on a random, local approximation thereof. More precisely, we optimize the likelihood of the constraint under a pseudolikelihood-based approximation centered around a model sample. Our approximation is factorized, allowing the reuse of solutions to sub-problems, a main tenet for efficiently computing neuro-symbolic losses. Moreover, it is a local, high-fidelity approximation of the likelihood, exhibiting low entropy and KL-divergence around the model sample. We evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation, and observe that we greatly improve upon the base model's ability to predict logically-consistent outputs. We also evaluate on the task of detoxifying large language models. Using a simple constraint disallowing a list of toxic words, we are able to steer the model's outputs away from toxic generations, achieving SoTA detoxification compared to previous approaches.

Improving Medical Report Generation with Adapter Tuning and Knowledge Enhancement in Vision-Language Foundation Models. (arXiv:2312.03970v1 [cs.CV])

Authors: Shibin Wu, Bang Yang, Zhiyu Ye, Haoqian Wang, Hairong Zheng, Tong Zhang

Medical report generation demands automatic creation of coherent and precise descriptions for medical images. However, the scarcity of labelled medical image-report pairs poses formidable challenges in developing large-scale neural networks capable of harnessing the potential of artificial intelligence, exemplified by large language models. This study builds upon the state-of-the-art vision-language pre-training and fine-tuning approach, BLIP-2, to customize general large-scale foundation models. Integrating adapter tuning and a medical knowledge enhancement loss, our model significantly improves accuracy and coherence. Validation on the dataset of ImageCLEFmedical 2023 demonstrates our model's prowess, achieving the best-averaged results against several state-of-the-art methods. Significant improvements in ROUGE and CIDEr underscore our method's efficacy, highlighting promising outcomes for the rapid medical-domain adaptation of the vision-language foundation models in addressing challenges posed by data scarcity.

Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration. (arXiv:2312.03987v1 [cs.CL])

Authors: Meihao Fan, Xiaoyue Han, Ju Fan, Chengliang Chai, Nan Tang, Guoliang Li, Xiaoyong Du

Entity resolution (ER) is an important data integration task with a wide spectrum of applications. The state-of-the-art solutions on ER rely on pre-trained language models (PLMs), which require fine-tuning on a lot of labeled matching/non-matching entity pairs. Recently, large languages models (LLMs), such as GPT-4, have shown the ability to perform many tasks without tuning model parameters, which is known as in-context learning (ICL) that facilitates effective learning from a few labeled input context demonstrations. However, existing ICL approaches to ER typically necessitate providing a task description and a set of demonstrations for each entity pair and thus have limitations on the monetary cost of interfacing LLMs. To address the problem, in this paper, we provide a comprehensive study to investigate how to develop a cost-effective batch prompting approach to ER. We introduce a framework BATCHER consisting of demonstration selection and question batching and explore different design choices that support batch prompting for ER. We also devise a covering-based demonstration selection strategy that achieves an effective balance between matching accuracy and monetary cost. We conduct a thorough evaluation to explore the design space and evaluate our proposed strategies. Through extensive experiments, we find that batch prompting is very cost-effective for ER, compared with not only PLM-based methods fine-tuned with extensive labeled data but also LLM-based methods with manually designed prompting. We also provide guidance for selecting appropriate design choices for batch prompting.

MICRO: Model-Based Offline Reinforcement Learning with a Conservative Bellman Operator. (arXiv:2312.03991v1 [cs.LG])

Authors: Xiao-Yin Liu, Xiao-Hu Zhou, Guo-Tao Li, Hao Li, Mei-Jiang Gui, Tian-Yu Xiang, De-Xing Huang, Zeng-Guang Hou

Offline reinforcement learning (RL) faces a significant challenge of distribution shift. Model-free offline RL penalizes the Q value for out-of-distribution (OOD) data or constrains the policy closed to the behavior policy to tackle this problem, but this inhibits the exploration of the OOD region. Model-based offline RL, which uses the trained environment model to generate more OOD data and performs conservative policy optimization within that model, has become an effective method for this problem. However, the current model-based algorithms rarely consider agent robustness when incorporating conservatism into policy. Therefore, the new model-based offline algorithm with a conservative Bellman operator (MICRO) is proposed. This method trades off performance and robustness via introducing the robust Bellman operator into the algorithm. Compared with previous model-based algorithms with robust adversarial models, MICRO can significantly reduce the computation cost by only choosing the minimal Q value in the state uncertainty set. Extensive experiments demonstrate that MICRO outperforms prior RL algorithms in offline RL benchmark and is considerably robust to adversarial perturbations.

Style Transfer to Calvin and Hobbes comics using Stable Diffusion. (arXiv:2312.03993v1 [cs.CV])

Authors: Sloke Shrestha, Sundar Sripada V. S., Asvin Venkataramanan

This project report summarizes our journey to perform stable diffusion fine-tuning on a dataset containing Calvin and Hobbes comics. The purpose is to convert any given input image into the comic style of Calvin and Hobbes, essentially performing style transfer. We train stable-diffusion-v1.5 using Low Rank Adaptation (LoRA) to efficiently speed up the fine-tuning process. The diffusion itself is handled by a Variational Autoencoder (VAE), which is a U-net. Our results were visually appealing for the amount of training time and the quality of input data that went into training.

KOALA: Self-Attention Matters in Knowledge Distillation of Latent Diffusion Models for Memory-Efficient and Fast Image Synthesis. (arXiv:2312.04005v1 [cs.CV])

Authors: Youngwan Lee, Kwanyong Park, Yoorhim Cho, Yong-Ju Lee, Sung Ju Hwang

Stable diffusion is the mainstay of the text-to-image (T2I) synthesis in the community due to its generation performance and open-source nature. Recently, Stable Diffusion XL (SDXL), the successor of stable diffusion, has received a lot of attention due to its significant performance improvements with a higher resolution of 1024x1024 and a larger model. However, its increased computation cost and model size require higher-end hardware(e.g., bigger VRAM GPU) for end-users, incurring higher costs of operation. To address this problem, in this work, we propose an efficient latent diffusion model for text-to-image synthesis obtained by distilling the knowledge of SDXL. To this end, we first perform an in-depth analysis of the denoising U-Net in SDXL, which is the main bottleneck of the model, and then design a more efficient U-Net based on the analysis. Secondly, we explore how to effectively distill the generation capability of SDXL into an efficient U-Net and eventually identify four essential factors, the core of which is that self-attention is the most important part. With our efficient U-Net and self-attention-based knowledge distillation strategy, we build our efficient T2I models, called KOALA-1B & -700M, while reducing the model size up to 54% and 69% of the original SDXL model. In particular, the KOALA-700M is more than twice as fast as SDXL while still retaining a decent generation quality. We hope that due to its balanced speed-performance tradeoff, our KOALA models can serve as a cost-effective alternative to SDXL in resource-constrained environments.

Efficiently Predicting Protein Stability Changes Upon Single-point Mutation with Large Language Models. (arXiv:2312.04019v1 [q-bio.BM])

Authors: Yijie Zhang, Zhangyang Gao, Cheng Tan, Stan Z.Li

Predicting protein stability changes induced by single-point mutations has been a persistent challenge over the years, attracting immense interest from numerous researchers. The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry, including drug development, protein evolution analysis, and enzyme synthesis. Despite the proposition of multiple methodologies aimed at addressing this issue, few approaches have successfully achieved optimal performance coupled with high computational efficiency. Two principal hurdles contribute to the existing challenges in this domain. The first is the complexity of extracting and aggregating sufficiently representative features from proteins. The second refers to the limited availability of experimental data for protein mutation analysis, further complicating the comprehensive evaluation of model performance on unseen data samples. With the advent of Large Language Models(LLM), such as the ESM models in protein research, profound interpretation of protein features is now accessibly aided by enormous training data. Therefore, LLMs are indeed to facilitate a wide range of protein research. In our study, we introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations. Furthermore, we have curated a dataset meticulously designed to preclude data leakage, corresponding to two extensively employed test datasets, to facilitate a more equitable model comparison.

A Study on the Calibration of In-context Learning. (arXiv:2312.04021v1 [cs.CL])

Authors: Hanlin Zhang, Yi-Fan Zhang, Yaodong Yu, Dhruv Madeka, Dean Foster, Eric Xing, Hima Lakkaraju, Sham Kakade

Modern auto-regressive language models are trained to minimize log loss on broad data by predicting the next token so they are expected to get calibrated answers when framing a problem as a next-token prediction task. We study this for in-context learning (ICL), a widely used way to adapt frozen large language models (LLMs) via crafting prompts, and investigate the trade-offs between performance and calibration on a wide range of natural language understanding and reasoning tasks. We conduct extensive experiments to show that such trade-offs may get worse as we increase model size, incorporate more ICL examples, and fine-tune models using instruction, dialog, or reinforcement learning from human feedback (RLHF) on carefully curated datasets. Furthermore, we find that common recalibration techniques that are widely effective such as temperature scaling provide limited gains in calibration errors, suggesting that new methods may be required for settings where models are expected to be reliable.

k* Distribution: Evaluating the Latent Space of Deep Neural Networks using Local Neighborhood Analysis. (arXiv:2312.04024v1 [cs.LG])

Authors: Shashank Kotyan, Ueda Tatsuya, Danilo Vasconcellos Vargas

Most examinations of neural networks' learned latent spaces typically employ dimensionality reduction techniques such as t-SNE or UMAP. While these methods effectively capture the overall sample distribution in the entire learned latent space, they tend to distort the structure of sample distributions within specific classes in the subset of the latent space. This distortion complicates the task of easily distinguishing classes identifiable by neural networks. In response to this challenge, we introduce the k* Distribution methodology. This approach focuses on capturing the characteristics and structure of sample distributions for individual classes within the subset of the learned latent space using local neighborhood analysis. The key concept is to facilitate easy comparison of different k* distributions, enabling analysis of how various classes are processed by the same neural network. This provides a more profound understanding of existing contemporary visualizations. Our study reveals three distinct distributions of samples within the learned latent space subset: a) Fractured, b) Overlapped, and c) Clustered. We note and demonstrate that the distribution of samples within the network's learned latent space significantly varies depending on the class. Furthermore, we illustrate that our analysis can be applied to explore the latent space of diverse neural network architectures, various layers within neural networks, transformations applied to input samples, and the distribution of training and testing data for neural networks. We anticipate that our approach will facilitate more targeted investigations into neural networks by collectively examining the distribution of different samples within the learned latent space.

Moirai: Towards Optimal Placement for Distributed Inference on Heterogeneous Devices. (arXiv:2312.04025v1 [cs.DC])

Authors: Beibei Zhang, Hongwei Zhu, Feng Gao, Zhihui Yang, Sean Xiaoyang Wang

The escalating size of Deep Neural Networks (DNNs) has spurred a growing research interest in hosting and serving DNN models across multiple devices. A number of studies have been reported to partition a DNN model across devices, providing device placement solutions. The methods appeared in the literature, however, either suffer from poor placement performance due to the exponential search space or miss an optimal placement as a consequence of the reduced search space with limited heuristics. Moreover, these methods have ignored the runtime inter-operator optimization of a computation graph when coarsening the graph, which degrades the end-to-end inference performance. This paper presents Moirai that better exploits runtime inter-operator fusion in a model to render a coarsened computation graph, reducing the search space while maintaining the inter-operator optimization provided by inference backends. Moirai also generalizes the device placement algorithm from multiple perspectives by considering inference constraints and device heterogeneity.Extensive experimental evaluation with 11 large DNNs demonstrates that Moirai outperforms the state-of-the-art counterparts, i.e., Placeto, m-SCT, and GETF, up to 4.28$\times$ in reduction of the end-to-end inference latency. Moirai code is anonymously released at \url{}.

The sample complexity of multi-distribution learning. (arXiv:2312.04027v1 [cs.LG])

Authors: Binghui Peng

Multi-distribution learning generalizes the classic PAC learning to handle data coming from multiple distributions. Given a set of $k$ data distributions and a hypothesis class of VC dimension $d$, the goal is to learn a hypothesis that minimizes the maximum population loss over $k$ distributions, up to $\epsilon$ additive error. In this paper, we settle the sample complexity of multi-distribution learning by giving an algorithm of sample complexity $\widetilde{O}((d+k)\epsilon^{-2}) \cdot (k/\epsilon)^{o(1)}$. This matches the lower bound up to sub-polynomial factor and resolves the COLT 2023 open problem of Awasthi, Haghtalab and Zhao [AHZ23].

Improved Face Representation via Joint Label Classification and Supervised Contrastive Clustering. (arXiv:2312.04029v1 [cs.CV])

Authors: Zhenduo Zhang

Face clustering tasks can learn hierarchical semantic information from large-scale data, which has the potential to help facilitate face recognition. However, there are few works on this problem. This paper explores it by proposing a joint optimization task of label classification and supervised contrastive clustering to introduce the cluster knowledge to the traditional face recognition task in two ways. We first extend ArcFace with a cluster-guided angular margin to adjust the within-class feature distribution according to the hard level of face clustering. Secondly, we propose a supervised contrastive clustering approach to pull the features to the cluster center and propose the cluster-aligning procedure to align the cluster center and the learnable class center in the classifier for joint training. Finally, extensive qualitative and quantitative experiments on popular facial benchmarks demonstrate the effectiveness of our paradigm and its superiority over the existing approaches to face recognition.

Modeling Boundedly Rational Agents with Latent Inference Budgets. (arXiv:2312.04030v1 [cs.AI])

Authors: Athul Paul Jacob, Abhishek Gupta, Jacob Andreas

We study the problem of modeling a population of agents pursuing unknown goals subject to unknown computational constraints. In standard models of bounded rationality, sub-optimal decision-making is simulated by adding homoscedastic noise to optimal decisions rather than explicitly simulating constrained inference. In this work, we introduce a latent inference budget model (L-IBM) that models agents' computational constraints explicitly, via a latent variable (inferred jointly with a model of agents' goals) that controls the runtime of an iterative inference algorithm. L-IBMs make it possible to learn agent models using data from diverse populations of suboptimal actors. In three modeling tasks -- inferring navigation goals from routes, inferring communicative intents from human utterances, and predicting next moves in human chess games -- we show that L-IBMs match or outperform Boltzmann models of decision-making under uncertainty. Inferred inference budgets are themselves meaningful, efficient to compute, and correlated with measures of player skill, partner skill and task difficulty.

Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes. (arXiv:2312.04043v1 [cs.CV])

Authors: Hmrishav Bandyopadhyay, Subhadeep Koley, Ayan Das, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Ayan Kumar Bhunia, Yi-Zhe Song

In this paper, we democratise 3D content creation, enabling precise generation of 3D shapes from abstract sketches while overcoming limitations tied to drawing skills. We introduce a novel part-level modelling and alignment framework that facilitates abstraction modelling and cross-modal correspondence. Leveraging the same part-level decoder, our approach seamlessly extends to sketch modelling by establishing correspondence between CLIPasso edgemaps and projected 3D part regions, eliminating the need for a dataset pairing human sketches and 3D shapes. Additionally, our method introduces a seamless in-position editing process as a byproduct of cross-modal part-aligned modelling. Operating in a low-dimensional implicit space, our approach significantly reduces computational demands and processing time.

A Low-Overhead Incorporation-Extrapolation based Few-Shot CSI Feedback Framework for Massive MIMO Systems. (arXiv:2312.04062v1 [cs.IT])

Authors: Binggui Zhou, Xi Yang, Jintao Wang, Shaodan Ma, Feifei Gao, Guanghua Yang

Accurate channel state information (CSI) is essential for downlink precoding at the base station (BS), especially for frequency FDD wideband massive MIMO systems with OFDM. In FDD systems, CSI is attained through CSI feedback from the user equipment (UE). However, large-scale antennas and large number of subcarriers significantly increase CSI feedback overhead. Deep learning-based CSI feedback methods have received tremendous attention in recent years due to their great capability of compressing CSI. Nonetheless, large amounts of collected samples are required to train deep learning models, which is severely challenging in practice. Besides, with the rapidly increasing number of antennas and subcarriers, most of these deep learning methods' CSI feedback overhead also grow dramatically, owing to their focus on full-dimensional CSI feedback. To address this issue, in this paper, we propose a low-overhead Incorporation-Extrapolation based Few-Shot CSI feedback Framework (IEFSF) for massive MIMO systems. To further reduce the feedback overhead, a low-dimensional eigenvector-based CSI matrix is first formed with the incorporation process at the UE, and then recovered to the full-dimensional eigenvector-based CSI matrix at the BS via the extrapolation process. After that, to alleviate the necessity of the extensive collected samples and enable few-shot CSI feedback, we further propose a knowledge-driven data augmentation method and an artificial intelligence-generated content (AIGC) -based data augmentation method by exploiting the domain knowledge of wireless channels and by exploiting a novel generative model, respectively. Numerical results demonstrate that the proposed IEFSF can significantly reduce CSI feedback overhead by 16 times compared with existing CSI feedback methods while maintaining higher feedback accuracy using only several hundreds of collected samples.

Making Translators Privacy-aware on the User's Side. (arXiv:2312.04068v1 [cs.CR])

Authors: Ryoma Sato

We propose PRISM to enable users of machine translation systems to preserve the privacy of data on their own initiative. There is a growing demand to apply machine translation systems to data that require privacy protection. While several machine translation engines claim to prioritize privacy, the extent and specifics of such protection are largely ambiguous. First, there is often a lack of clarity on how and to what degree the data is protected. Even if service providers believe they have sufficient safeguards in place, sophisticated adversaries might still extract sensitive information. Second, vulnerabilities may exist outside of these protective measures, such as within communication channels, potentially leading to data leakage. As a result, users are hesitant to utilize machine translation engines for data demanding high levels of privacy protection, thereby missing out on their benefits. PRISM resolves this problem. Instead of relying on the translation service to keep data safe, PRISM provides the means to protect data on the user's side. This approach ensures that even machine translation engines with inadequate privacy measures can be used securely. For platforms already equipped with privacy safeguards, PRISM acts as an additional protection layer, reinforcing their security furthermore. PRISM adds these privacy features without significantly compromising translation accuracy. Our experiments demonstrate the effectiveness of PRISM using real-world translators, T5 and ChatGPT (GPT-3.5-turbo), and the datasets with two languages. PRISM effectively balances privacy protection with translation accuracy.

Synergistic Signals: Exploiting Co-Engagement and Semantic Links via Graph Neural Networks. (arXiv:2312.04071v1 [cs.IR])

Authors: Zijie Huang, Baolin Li, Hafez Asgharzadeh, Anne Cocos, Lingyi Liu, Evan Cox, Colby Wise, Sudarshan Lamkhede

Given a set of candidate entities (e.g. movie titles), the ability to identify similar entities is a core capability of many recommender systems. Most often this is achieved by collaborative filtering approaches, i.e. if users co-engage with a pair of entities frequently enough, the embeddings should be similar. However, relying on co-engagement data alone can result in lower-quality embeddings for new and unpopular entities. We study this problem in the context recommender systems at Netflix. We observe that there is abundant semantic information such as genre, content maturity level, themes, etc. that complements co-engagement signals and provides interpretability in similarity models. To learn entity similarities from both data sources holistically, we propose a novel graph-based approach called SemanticGNN. SemanticGNN models entities, semantic concepts, collaborative edges, and semantic edges within a large-scale knowledge graph and conducts representation learning over it. Our key technical contributions are twofold: (1) we develop a novel relation-aware attention graph neural network (GNN) to handle the imbalanced distribution of relation types in our graph; (2) to handle web-scale graph data that has millions of nodes and billions of edges, we develop a novel distributed graph training paradigm. The proposed model is successfully deployed within Netflix and empirical experiments indicate it yields up to 35% improvement in performance on similarity judgment tasks.

Voice Recognition Robot with Real-Time Surveillance and Automation. (arXiv:2312.04072v1 [cs.RO])

Authors: Lochan Basyal

Voice recognition technology enables the execution of real-world operations through a single voice command. This paper introduces a voice recognition system that involves converting input voice signals into corresponding text using an Android application. The text messages are then transmitted through Bluetooth connectivity, serving as a communication platform. Simultaneously, a controller circuit, equipped with a Bluetooth module, receives the text signal and, following a coding mechanism, executes real-world operations. The paper extends the application of voice recognition to real-time surveillance and automation, incorporating obstacle detection and avoidance mechanisms, as well as control over lighting and horn functions through predefined voice commands. The proposed technique not only serves as an assistive tool for individuals with disabilities but also finds utility in industrial automation, enabling robots to perform specific tasks with precision.

VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal Models. (arXiv:2312.04087v1 [cs.CV])

Authors: Zongjie Li, Chaozheng Wang, Chaowei Liu, Pingchuan Ma, Daoyuan Wu, Shuai Wang, Cuiyun Gao

With recent advancements in Large Multimodal Models (LMMs) across various domains, a novel prompting method called visual referring prompting has emerged, showing significant potential in enhancing human-computer interaction within multimodal systems. This method offers a more natural and flexible approach to human interaction with these systems compared to traditional text descriptions or coordinates. However, the categorization of visual referring prompting remains undefined, and its impact on the performance of LMMs has yet to be formally examined. In this study, we conduct the first comprehensive analysis of LMMs using a variety of visual referring prompting strategies. We introduce a benchmark dataset called VRPTEST, comprising 3 different visual tasks and 2,275 images, spanning diverse combinations of prompt strategies. Using VRPTEST, we conduct a comprehensive evaluation of eight versions of prominent open-source and proprietary foundation models, including two early versions of GPT-4V. We develop an automated assessment framework based on software metamorphic testing techniques to evaluate the accuracy of LMMs without the need for human intervention or manual labeling. We find that the current proprietary models generally outperform the open-source ones, showing an average accuracy improvement of 22.70%; however, there is still potential for improvement. Moreover, our quantitative analysis shows that the choice of prompt strategy significantly affects the accuracy of LMMs, with variations ranging from -17.5% to +7.3%. Further case studies indicate that an appropriate visual referring prompting strategy can improve LMMs' understanding of context and location information, while an unsuitable one might lead to answer rejection. We also provide insights on minimizing the negative impact of visual referring prompting on LMMs.

Enhancing the Rationale-Input Alignment for Self-explaining Rationalization. (arXiv:2312.04103v1 [cs.AI])

Authors: Wei Liu, Haozhao Wang, Jun Wang, Zhiying Deng, YuanKai Zhang, Cheng Wang, Ruixuan Li

Rationalization empowers deep learning models with self-explaining capabilities through a cooperative game, where a generator selects a semantically consistent subset of the input as a rationale, and a subsequent predictor makes predictions based on the selected rationale. In this paper, we discover that rationalization is prone to a problem named \emph{rationale shift}, which arises from the algorithmic bias of the cooperative game. Rationale shift refers to a situation where the semantics of the selected rationale may deviate from the original input, but the predictor still produces accurate predictions based on the deviation, resulting in a compromised generator with misleading feedback.

To address this issue, we first demonstrate the importance of the alignment between the rationale and the full input through both empirical observations and theoretical analysis. Subsequently, we introduce a novel approach called DAR (\textbf{D}iscriminatively \textbf{A}ligned \textbf{R}ationalization), which utilizes an auxiliary module pretrained on the full input to discriminatively align the selected rationale and the original input. We theoretically illustrate how DAR accomplishes the desired alignment, thereby overcoming the rationale shift problem. The experiments on two widely used real-world benchmarks show that the proposed method significantly improves the explanation quality (measured by the overlap between the model-selected explanation and the human-annotated rationale) as compared to state-of-the-art techniques. Additionally, results on two synthetic settings further validate the effectiveness of DAR in addressing the rationale shift problem.

Breaking the Entanglement of Homophily and Heterophily in Semi-supervised Node Classification. (arXiv:2312.04111v1 [cs.LG])

Authors: Henan Sun, Xunkai Li, Zhengyu Wu, Daohan Su, Rong-Hua Li, Guoren Wang

Recently, graph neural networks (GNNs) have shown prominent performance in semi-supervised node classification by leveraging knowledge from the graph database. However, most existing GNNs follow the homophily assumption, where connected nodes are more likely to exhibit similar feature distributions and the same labels, and such an assumption has proven to be vulnerable in a growing number of practical applications. As a supplement, heterophily reflects dissimilarity in connected nodes, which has gained significant attention in graph learning. To this end, data engineers aim to develop a powerful GNN model that can ensure performance under both homophily and heterophily. Despite numerous attempts, most existing GNNs struggle to achieve optimal node representations due to the constraints of undirected graphs. The neglect of directed edges results in sub-optimal graph representations, thereby hindering the capacity of GNNs. To address this issue, we introduce AMUD, which quantifies the relationship between node profiles and topology from a statistical perspective, offering valuable insights for \underline{A}daptively \underline{M}odeling the natural directed graphs as the \underline{U}ndirected or \underline{D}irected graph to maximize the benefits from subsequent graph learning. Furthermore, we propose \underline{A}daptive \underline{D}irected \underline{P}attern \underline{A}ggregation (ADPA) as a new directed graph learning paradigm for AMUD. Empirical studies have demonstrated that AMUD guides efficient graph learning. Meanwhile, extensive experiments on 14 benchmark datasets substantiate the impressive performance of ADPA, outperforming baselines by significant margins of 3.96\%.

Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play. (arXiv:2312.04118v1 [cs.CV])

Authors: Timothy Schaumlöffel, Arthur Aubret, Gemma Roig, Jochen Triesch

Infants' ability to recognize and categorize objects develops gradually. The second year of life is marked by both the emergence of more semantic visual representations and a better understanding of word meaning. This suggests that language input may play an important role in shaping visual representations. However, even in suitable contexts for word learning like dyadic play sessions, caregivers utterances are sparse and ambiguous, often referring to objects that are different from the one to which the child attends. Here, we systematically investigate to what extent caregivers' utterances can nevertheless enhance visual representations. For this we propose a computational model of visual representation learning during dyadic play. We introduce a synthetic dataset of ego-centric images perceived by a toddler-agent that moves and rotates toy objects in different parts of its home environment while hearing caregivers' utterances, modeled as captions. We propose to model toddlers' learning as simultaneously aligning representations for 1) close-in-time images and 2) co-occurring images and utterances. We show that utterances with statistics matching those of real caregivers give rise to representations supporting improved category recognition. Our analysis reveals that a small decrease/increase in object-relevant naming frequencies can drastically impact the learned representations. This affects the attention on object names within an utterance, which is required for efficient visuo-linguistic alignment. Overall, our results support the hypothesis that caregivers' naming utterances can improve toddlers' visual representations.

Using a Large Language Model to generate a Design Structure Matrix. (arXiv:2312.04134v1 [cs.AI])

Authors: Edwin C. Y. Koh

The Design Structure Matrix (DSM) is an established method used in dependency modelling, especially in the design of complex engineering systems. The generation of DSM is traditionally carried out through manual means and can involve interviewing experts to elicit critical system elements and the relationships between them. Such manual approaches can be time-consuming and costly. This paper presents a workflow that uses a Large Language Model (LLM) to support the generation of DSM and improve productivity. A prototype of the workflow was developed in this work and applied on a diesel engine DSM published previously. It was found that the prototype could reproduce 357 out of 462 DSM entries published (i.e. 77.3%), suggesting that the work can aid DSM generation. A no-code version of the prototype is made available online to support future research.

TimeDRL: Disentangled Representation Learning for Multivariate Time-Series. (arXiv:2312.04142v1 [cs.LG])

Authors: Ching Chang, Chiao-Tung Chan, Wei-Yao Wang, Wen-Chih Peng, Tien-Fu Chen

Multivariate time-series data in numerous real-world applications (e.g., healthcare and industry) are informative but challenging due to the lack of labels and high dimensionality. Recent studies in self-supervised learning have shown their potential in learning rich representations without relying on labels, yet they fall short in learning disentangled embeddings and addressing issues of inductive bias (e.g., transformation-invariance). To tackle these challenges, we propose TimeDRL, a generic multivariate time-series representation learning framework with disentangled dual-level embeddings. TimeDRL is characterized by three novel features: (i) disentangled derivation of timestamp-level and instance-level embeddings from patched time-series data using a [CLS] token strategy; (ii) utilization of timestamp-predictive and instance-contrastive tasks for disentangled representation learning, with the former optimizing timestamp-level embeddings with predictive loss, and the latter optimizing instance-level embeddings with contrastive loss; and (iii) avoidance of augmentation methods to eliminate inductive biases, such as transformation-invariance from cropping and masking. Comprehensive experiments on 6 time-series forecasting datasets and 5 time-series classification datasets have shown that TimeDRL consistently surpasses existing representation learning approaches, achieving an average improvement of forecasting by 57.98% in MSE and classification by 1.25% in accuracy. Furthermore, extensive ablation studies confirmed the relative contribution of each component in TimeDRL's architecture, and semi-supervised learning evaluations demonstrated its effectiveness in real-world scenarios, even with limited labeled data.

Augmentation-Free Dense Contrastive Knowledge Distillation for Efficient Semantic Segmentation. (arXiv:2312.04168v1 [cs.CV])

Authors: Jiawei Fan, Chao Li, Xiaolong Liu, Meina Song, Anbang Yao

In recent years, knowledge distillation methods based on contrastive learning have achieved promising results on image classification and object detection tasks. However, in this line of research, we note that less attention is paid to semantic segmentation. Existing methods heavily rely on data augmentation and memory buffer, which entail high computational resource demands when applying them to handle semantic segmentation that requires to preserve high-resolution feature maps for making dense pixel-wise predictions. In order to address this problem, we present Augmentation-free Dense Contrastive Knowledge Distillation (Af-DCD), a new contrastive distillation learning paradigm to train compact and accurate deep neural networks for semantic segmentation applications. Af-DCD leverages a masked feature mimicking strategy, and formulates a novel contrastive learning loss via taking advantage of tactful feature partitions across both channel and spatial dimensions, allowing to effectively transfer dense and structured local knowledge learnt by the teacher model to a target student model while maintaining training efficiency. Extensive experiments on five mainstream benchmarks with various teacher-student network pairs demonstrate the effectiveness of our approach. For instance, the DeepLabV3-Res18|DeepLabV3-MBV2 model trained by Af-DCD reaches 77.03%|76.38% mIOU on Cityscapes dataset when choosing DeepLabV3-Res101 as the teacher, setting new performance records. Besides that, Af-DCD achieves an absolute mIOU improvement of 3.26%|3.04%|2.75%|2.30%|1.42% compared with individually trained counterpart on Cityscapes|Pascal VOC|Camvid|ADE20K|COCO-Stuff-164K. Code is available at

AI and Jobs: Has the Inflection Point Arrived? Evidence from an Online Labor Platform. (arXiv:2312.04180v1 [cs.AI])

Authors: Dandan Qiao, Huaxia Rui, Qian Xiong

Artificial intelligence (AI) refers to the ability of machines or software to mimic or even surpass human intelligence in a given cognitive task. While humans learn by both induction and deduction, the success of current AI is rooted in induction, relying on its ability to detect statistical regularities in task input -- an ability learnt from a vast amount of training data using enormous computation resources. We examine the performance of such a statistical AI in a human task through the lens of four factors, including task learnability, statistical resource, computation resource, and learning techniques, and then propose a three-phase visual framework to understand the evolving relation between AI and jobs. Based on this conceptual framework, we develop a simple economic model of competition to show the existence of an inflection point for each occupation. Before AI performance crosses the inflection point, human workers always benefit from an improvement in AI performance, but after the inflection point, human workers become worse off whenever such an improvement occurs. To offer empirical evidence, we first argue that AI performance has passed the inflection point for the occupation of translation but not for the occupation of web development. We then study how the launch of ChatGPT, which led to significant improvement of AI performance on many tasks, has affected workers in these two occupations on a large online labor platform. Consistent with the inflection point conjecture, we find that translators are negatively affected by the shock both in terms of the number of accepted jobs and the earnings from those jobs, while web developers are positively affected by the very same shock. Given the potentially large disruption of AI on employment, more studies on more occupations using data from different platforms are urgently needed.

Joint-Individual Fusion Structure with Fusion Attention Module for Multi-Modal Skin Cancer Classification. (arXiv:2312.04189v1 [cs.CV])

Authors: Peng Tang, Xintong Yan, Yang Nan, Xiaobin Hu, Xiaobin Hu, Bjoern H Menzee.Sebastian Krammer, Tobias Lasser

Most convolutional neural network (CNN) based methods for skin cancer classification obtain their results using only dermatological images. Although good classification results have been shown, more accurate results can be achieved by considering the patient's metadata, which is valuable clinical information for dermatologists. Current methods only use the simple joint fusion structure (FS) and fusion modules (FMs) for the multi-modal classification methods, there still is room to increase the accuracy by exploring more advanced FS and FM. Therefore, in this paper, we design a new fusion method that combines dermatological images (dermoscopy images or clinical images) and patient metadata for skin cancer classification from the perspectives of FS and FM. First, we propose a joint-individual fusion (JIF) structure that learns the shared features of multi-modality data and preserves specific features simultaneously. Second, we introduce a fusion attention (FA) module that enhances the most relevant image and metadata features based on both the self and mutual attention mechanism to support the decision-making pipeline. We compare the proposed JIF-MMFA method with other state-of-the-art fusion methods on three different public datasets. The results show that our JIF-MMFA method improves the classification results for all tested CNN backbones and performs better than the other fusion methods on the three public datasets, demonstrating our method's effectiveness and robustness

Constraint Model for the Satellite Image Mosaic Selection Problem. (arXiv:2312.04210v1 [cs.AI])

Authors: Manuel Combarro Simón, Pierre Talbot, Grégoire Danoy, Jedrzej Musial, Mohammed Alswaitti, Pascal Bouvry

Satellite imagery solutions are widely used to study and monitor different regions of the Earth. However, a single satellite image can cover only a limited area. In cases where a larger area of interest is studied, several images must be stitched together to create a single larger image, called a mosaic, that can cover the area. Today, with the increasing number of satellite images available for commercial use, selecting the images to build the mosaic is challenging, especially when the user wants to optimize one or more parameters, such as the total cost and the cloud coverage percentage in the mosaic. More precisely, for this problem the input is an area of interest, several satellite images intersecting the area, a list of requirements relative to the image and the mosaic, such as cloud coverage percentage, image resolution, and a list of objectives to optimize. We contribute to the constraint and mixed integer lineal programming formulation of this new problem, which we call the \textit{satellite image mosaic selection problem}, which is a multi-objective extension of the polygon cover problem. We propose a dataset of realistic and challenging instances, where the images were captured by the satellite constellations SPOT, Pl\'eiades and Pl\'eiades Neo. We evaluate and compare the two proposed models and show their efficiency for large instances, up to 200 images.

Dynamic Data-Driven Digital Twins for Blockchain Systems. (arXiv:2312.04226v1 [cs.CR])

Authors: Georgios Diamantopoulos, Nikos Tziritas, Rami Bahsoon, Georgios Theodoropoulos

In recent years, we have seen an increase in the adoption of blockchain-based systems in non-financial applications, looking to benefit from what the technology has to offer. Although many fields have managed to include blockchain in their core functionalities, the adoption of blockchain, in general, is constrained by the so-called trilemma trade-off between decentralization, scalability, and security. In our previous work, we have shown that using a digital twin for dynamically managing blockchain systems during runtime can be effective in managing the trilemma trade-off. Our Digital Twin leverages DDDAS feedback loop, which is responsible for getting the data from the system to the digital twin, conducting optimisation, and updating the physical system. This paper examines how leveraging DDDAS feedback loop can support the optimisation component of the trilemma benefiting from Reinforcement Learning agents and a simulation component to augment the quality of the learned model while reducing the computational overhead required for decision-making.

Adventures of Trustworthy Vision-Language Models: A Survey. (arXiv:2312.04231v1 [cs.CV])

Authors: Mayank Vatsa, Anubhooti Jain, Richa Singh

Recently, transformers have become incredibly popular in computer vision and vision-language tasks. This notable rise in their usage can be primarily attributed to the capabilities offered by attention mechanisms and the outstanding ability of transformers to adapt and apply themselves to a variety of tasks and domains. Their versatility and state-of-the-art performance have established them as indispensable tools for a wide array of applications. However, in the constantly changing landscape of machine learning, the assurance of the trustworthiness of transformers holds utmost importance. This paper conducts a thorough examination of vision-language transformers, employing three fundamental principles of responsible AI: Bias, Robustness, and Interpretability. The primary objective of this paper is to delve into the intricacies and complexities associated with the practical use of transformers, with the overarching goal of advancing our comprehension of how to enhance their reliability and accountability.

Graph Convolutions Enrich the Self-Attention in Transformers!. (arXiv:2312.04234v1 [cs.LG])

Authors: Jeongwhan Choi, Hyowon Wi, Jayoung Kim, Yehjin Shin, Kookjin Lee, Nathaniel Trask, Noseong Park

Transformers, renowned for their self-attention mechanism, have achieved state-of-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective. We propose graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph pattern classification, speech recognition, and code classification.

Detecting and Restoring Non-Standard Hands in Stable Diffusion Generated Images. (arXiv:2312.04236v1 [cs.CV])

Authors: Yiqun Zhang, Zhenyue Qin, Yang Liu, Dylan Campbell

We introduce a pipeline to address anatomical inaccuracies in Stable Diffusion generated hand images. The initial step involves constructing a specialized dataset, focusing on hand anomalies, to train our models effectively. A finetuned detection model is pivotal for precise identification of these anomalies, ensuring targeted correction. Body pose estimation aids in understanding hand orientation and positioning, crucial for accurate anomaly correction. The integration of ControlNet and InstructPix2Pix facilitates sophisticated inpainting and pixel-level transformation, respectively. This dual approach allows for high-fidelity image adjustments. This comprehensive approach ensures the generation of images with anatomically accurate hands, closely resembling real-world appearances. Our experimental results demonstrate the pipeline's efficacy in enhancing hand image realism in Stable Diffusion outputs. We provide an online demo at

Mastering Complex Coordination through Attention-based Dynamic Graph. (arXiv:2312.04245v1 [cs.MA])

Authors: Guangchong Zhou, Zhiwei Xu, Zeren Zhang, Guoliang Fan

The coordination between agents in multi-agent systems has become a popular topic in many fields. To catch the inner relationship between agents, the graph structure is combined with existing methods and improves the results. But in large-scale tasks with numerous agents, an overly complex graph would lead to a boost in computational cost and a decline in performance. Here we present DAGMIX, a novel graph-based value factorization method. Instead of a complete graph, DAGMIX generates a dynamic graph at each time step during training, on which it realizes a more interpretable and effective combining process through the attention mechanism. Experiments show that DAGMIX significantly outperforms previous SOTA methods in large-scale scenarios, as well as achieving promising results on other tasks.

Extending Answer Set Programming with Rational Numbers. (arXiv:2312.04249v1 [cs.AI])

Authors: Francesco Pacenza, Jessica Zangari

Answer Set Programming (ASP) is a widely used declarative programming paradigm that has shown great potential in solving complex computational problems. However, the inability to natively support non-integer arithmetic has been highlighted as a major drawback in real-world applications. This feature is crucial to accurately model and manage real-world data and information as emerged in various contexts, such as the smooth movement of video game characters, the 3D movement of mechanical arms, and data streamed by sensors. Nevertheless, extending ASP in this direction, without affecting its declarative nature and its well-defined semantics, poses non-trivial challenges; thus, no ASP system is able to reason natively with non-integer domains. Indeed, the widespread floating-point arithmetic is not applicable to the ASP case, as the reproducibility of results cannot be guaranteed and the semantics of an ASP program would not be uniquely and declaratively determined, regardless of the employed machine or solver. To overcome such limitations and in the realm of pure ASP, this paper proposes an extension of ASP in which non-integers are approximated to rational numbers, fully granting reproducibility and declarativity. We provide a well-defined semantics for the ASP-Core-2 standard extended with rational numbers and an implementation thereof. We hope this work could serve as a stepping stone towards a more expressive and versatile ASP language that can handle a broader range of real-world problems.

nerblackbox: A High-level Library for Named Entity Recognition in Python. (arXiv:2312.04306v1 [cs.CL])

Authors: Felix Stollenwerk

We present nerblackbox, a python library to facilitate the use of state-of-the-art transformer-based models for named entity recognition. It provides simple-to-use yet powerful methods to access data and models from a wide range of sources, for fully automated model training and evaluation as well as versatile model inference. While many technical challenges are solved and hidden from the user by default, nerblackbox also offers fine-grained control and a rich set of customizable features. It is thus targeted both at application-oriented developers as well as machine learning experts and researchers.

Towards Knowledge-driven Autonomous Driving. (arXiv:2312.04316v1 [cs.RO])

Authors: Xin Li, Yeqi Bai, Pinlong Cai, Licheng Wen, Daocheng Fu, Bo Zhang, Xuemeng Yang, Xinyu Cai, Tao Ma, Jianfei Guo, Xing Gao, Min Dou, Botian Shi, Yong Liu, Liang He, Yu Qiao

This paper explores the emerging knowledge-driven autonomous driving technologies. Our investigation highlights the limitations of current autonomous driving systems, in particular their sensitivity to data bias, difficulty in handling long-tail scenarios, and lack of interpretability. Conversely, knowledge-driven methods with the abilities of cognition, generalization and life-long learning emerge as a promising way to overcome these challenges. This paper delves into the essence of knowledge-driven autonomous driving and examines its core components: dataset \& benchmark, environment, and driver agent. By leveraging large language models, world models, neural rendering, and other advanced artificial intelligence techniques, these components collectively contribute to a more holistic, adaptive, and intelligent autonomous driving system. The paper systematically organizes and reviews previous research efforts in this area, and provides insights and guidance for future research and practical applications of autonomous driving. We will continually share the latest updates on cutting-edge developments in knowledge-driven autonomous driving along with the relevant valuable open-source resources at: \url{}.

MIMo: A Multi-Modal Infant Model for Studying Cognitive Development. (arXiv:2312.04318v1 [cs.AI])

Authors: Dominik Mattern, Pierre Schumacher, Francisco M. López, Marcel C. Raabe, Markus R. Ernst, Arthur Aubret, Jochen Triesch

Human intelligence and human consciousness emerge gradually during the process of cognitive development. Understanding this development is an essential aspect of understanding the human mind and may facilitate the construction of artificial minds with similar properties. Importantly, human cognitive development relies on embodied interactions with the physical and social environment, which is perceived via complementary sensory modalities. These interactions allow the developing mind to probe the causal structure of the world. This is in stark contrast to common machine learning approaches, e.g., for large language models, which are merely passively ``digesting'' large amounts of training data, but are not in control of their sensory inputs. However, computational modeling of the kind of self-determined embodied interactions that lead to human intelligence and consciousness is a formidable challenge. Here we present MIMo, an open-source multi-modal infant model for studying early cognitive development through computer simulations. MIMo's body is modeled after an 18-month-old child with detailed five-fingered hands. MIMo perceives its surroundings via binocular vision, a vestibular system, proprioception, and touch perception through a full-body virtual skin, while two different actuation models allow control of his body. We describe the design and interfaces of MIMo and provide examples illustrating its use. All code is available at .

Surrogate Modelling for Sea Ice Concentration using Lightweight Neural Ensemble. (arXiv:2312.04330v1 [cs.LG])

Authors: Julia Borisova, Nikolay O. Nikitin

The modeling and forecasting of sea ice conditions in the Arctic region are important tasks for ship routing, offshore oil production, and environmental monitoring. We propose the adaptive surrogate modeling approach named LANE-SI (Lightweight Automated Neural Ensembling for Sea Ice) that uses ensemble of relatively simple deep learning models with different loss functions for forecasting of spatial distribution for sea ice concentration in the specified water area. Experimental studies confirm the quality of a long-term forecast based on a deep learning model fitted to the specific water area is comparable to resource-intensive physical modeling, and for some periods of the year, it is superior. We achieved a 20% improvement against the state-of-the-art physics-based forecast system SEAS5 for the Kara Sea.

Causality and Explainability for Trustworthy Integrated Pest Management. (arXiv:2312.04343v1 [cs.LG])

Authors: Ilias Tsoumas, Vasileios Sitokonstantinou, Georgios Giannarakis, Evagelia Lampiri, Christos Athanassiou, Gustau Camps-Valls, Charalampos Kontoes, Ioannis Athanasiadis

Pesticides serve as a common tool in agricultural pest control but significantly contribute to the climate crisis. To combat this, Integrated Pest Management (IPM) stands as a climate-smart alternative. Despite its potential, IPM faces low adoption rates due to farmers' skepticism about its effectiveness. To address this challenge, we introduce an advanced data analysis framework tailored to enhance IPM adoption. Our framework provides i) robust pest population predictions across diverse environments with invariant and causal learning, ii) interpretable pest presence predictions using transparent models, iii) actionable advice through counterfactual explanations for in-season IPM interventions, iv) field-specific treatment effect estimations, and v) assessments of the effectiveness of our advice using causal inference. By incorporating these features, our framework aims to alleviate skepticism and encourage wider adoption of IPM practices among farmers.

Enhancing Medical Task Performance in GPT-4V: A Comprehensive Study on Prompt Engineering Strategies. (arXiv:2312.04344v1 [cs.CL])

Authors: Pengcheng Chen, Ziyan Huang, Zhongying Deng, Tianbin Li, Yanzhou Su, Haoyu Wang, Jin Ye, Yu Qiao, Junjun He

OpenAI's latest large vision-language model (LVLM), GPT-4V(ision), has piqued considerable interest for its potential in medical applications. Despite its promise, recent studies and internal reviews highlight its underperformance in specialized medical tasks. This paper explores the boundary of GPT-4V's capabilities in medicine, particularly in processing complex imaging data from endoscopies, CT scans, and MRIs etc. Leveraging open-source datasets, we assessed its foundational competencies, identifying substantial areas for enhancement. Our research emphasizes prompt engineering, an often-underutilized strategy for improving AI responsiveness. Through iterative testing, we refined the model's prompts, significantly improving its interpretative accuracy and relevance in medical imaging. From our comprehensive evaluations, we distilled 10 effective prompt engineering techniques, each fortifying GPT-4V's medical acumen. These methodical enhancements facilitate more reliable, precise, and clinically valuable insights from GPT-4V, advancing its operability in critical healthcare environments. Our findings are pivotal for those employing AI in medicine, providing clear, actionable guidance on harnessing GPT-4V's full diagnostic potential.

CLadder: A Benchmark to Assess Causal Reasoning Capabilities of Language Models. (arXiv:2312.04350v1 [cs.CL])

Authors: Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, Bernhard Schölkopf

The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating commonsense causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined formal rules. To address this, we propose a new NLP task, causal inference in natural language, inspired by the "causal inference engine" postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insight into the causal reasoning abilities of LLMs. Our data is open-sourced at, and our code can be found at

PCoQA: Persian Conversational Question Answering Dataset. (arXiv:2312.04362v1 [cs.CL])

Authors: Hamed Hematian Hemati, Atousa Toghyani, Atena Souri, Sayed Hesam Alavian, Hossein Sameti, Hamid Beigy

Humans seek information regarding a specific topic through performing a conversation containing a series of questions and answers. In the pursuit of conversational question answering research, we introduce the PCoQA, the first \textbf{P}ersian \textbf{Co}nversational \textbf{Q}uestion \textbf{A}nswering dataset, a resource comprising information-seeking dialogs encompassing a total of 9,026 contextually-driven questions. Each dialog involves a questioner, a responder, and a document from the Wikipedia; The questioner asks several inter-connected questions from the text and the responder provides a span of the document as the answer for each question. PCoQA is designed to present novel challenges compared to previous question answering datasets including having more open-ended non-factual answers, longer answers, and fewer lexical overlaps. This paper not only presents the comprehensive PCoQA dataset but also reports the performance of various benchmark models. Our models include baseline models and pre-trained models, which are leveraged to boost the performance of the model. The dataset and benchmarks are available at our Github page.

LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs. (arXiv:2312.04372v1 [cs.CL])

Authors: Yunsheng Ma, Can Cui, Xu Cao, Wenqian Ye, Peiran Liu, Juanwu Lu, Amr Abdelraouf, Rohit Gupta, Kyungtae Han, Aniket Bera, James M. Rehg, Ziran Wang

We present LaMPilot, a novel framework for planning in the field of autonomous driving, rethinking the task as a code-generation process that leverages established behavioral primitives. This approach aims to address the challenge of interpreting and executing spontaneous user instructions such as "overtake the car ahead," which have typically posed difficulties for existing frameworks. We introduce the LaMPilot benchmark specifically designed to quantitatively evaluate the efficacy of Large Language Models (LLMs) in translating human directives into actionable driving policies. We then evaluate a wide range of state-of-the-art code generation language models on tasks from the LaMPilot Benchmark. The results of the experiments showed that GPT-4, with human feedback, achieved an impressive task completion rate of 92.7% and a minimal collision rate of 0.9%. To encourage further investigation in this area, our code and dataset will be made available.

Deep Dynamics: Vehicle Dynamics Modeling with a Physics-Informed Neural Network for Autonomous Racing. (arXiv:2312.04374v1 [cs.RO])

Authors: John Chrosniak, Jingyun Ning, Madhur Behl

Autonomous racing is a critical research area for autonomous driving, presenting significant challenges in vehicle dynamics modeling, such as balancing model precision and computational efficiency at high speeds (>280kmph), where minor errors in modeling have severe consequences. Existing physics-based models for vehicle dynamics require elaborate testing setups and tuning, which are hard to implement, time-intensive, and cost-prohibitive. Conversely, purely data-driven approaches do not generalize well and cannot adequately ensure physical constraints on predictions. This paper introduces Deep Dynamics, a physics-informed neural network (PINN) for vehicle dynamics modeling of an autonomous racecar. It combines physics coefficient estimation and dynamical equations to accurately predict vehicle states at high speeds and includes a unique Physics Guard layer to ensure internal coefficient estimates remain within their nominal physical ranges. Open-loop and closed-loop performance assessments, using a physics-based simulator and full-scale autonomous Indy racecar data, highlight Deep Dynamics as a promising approach for modeling racecar vehicle dynamics.

How much informative is your XAI? A decision-making assessment task to objectively measure the goodness of explanations. (arXiv:2312.04379v1 [cs.AI])

Authors: Marco Matarese, Francesco Rea, Alessandra Sciutti

There is an increasing consensus about the effectiveness of user-centred approaches in the explainable artificial intelligence (XAI) field. Indeed, the number and complexity of personalised and user-centred approaches to XAI have rapidly grown in recent years. Often, these works have a two-fold objective: (1) proposing novel XAI techniques able to consider the users and (2) assessing the \textit{goodness} of such techniques with respect to others. From these new works, it emerged that user-centred approaches to XAI positively affect the interaction between users and systems. However, so far, the goodness of XAI systems has been measured through indirect measures, such as performance. In this paper, we propose an assessment task to objectively and quantitatively measure the goodness of XAI systems in terms of their \textit{information power}, which we intended as the amount of information the system provides to the users during the interaction. Moreover, we plan to use our task to objectively compare two XAI techniques in a human-robot decision-making task to understand deeper whether user-centred approaches are more informative than classical ones.

Adversarial Denoising Diffusion Model for Unsupervised Anomaly Detection. (arXiv:2312.04382v1 [eess.IV])

Authors: Jongmin Yu, Hyeontaek Oh, Jinhong Yang

In this paper, we propose the Adversarial Denoising Diffusion Model (ADDM). The ADDM is based on the Denoising Diffusion Probabilistic Model (DDPM) but complementarily trained by adversarial learning. The proposed adversarial learning is achieved by classifying model-based denoised samples and samples to which random Gaussian noise is added to a specific sampling step. With the addition of explicit adversarial learning on data samples, ADDM can learn the semantic characteristics of the data more robustly during training, which achieves a similar data sampling performance with much fewer sampling steps than DDPM. We apply ADDM to anomaly detection in unsupervised MRI images. Experimental results show that the proposed ADDM outperformed existing generative model-based unsupervised anomaly detection methods. In particular, compared to other DDPM-based anomaly detection methods, the proposed ADDM shows better performance with the same number of sampling steps and similar performance with 50% fewer sampling steps.

Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization. (arXiv:2312.04386v1 [cs.LG])

Authors: Carlos E. Luis, Alessandro G. Bottero, Julia Vinogradska, Felix Berkenkamp, Jan Peters

We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. In particular, we focus on characterizing the variance over values induced by a distribution over MDPs. Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation (UBE), but the over-approximation may result in inefficient exploration. We propose a new UBE whose solution converges to the true posterior variance over values and leads to lower regret in tabular exploration problems. We identify challenges to apply the UBE theory beyond tabular problems and propose a suitable approximation. Based on this approximation, we introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC), that can be applied for either risk-seeking or risk-averse policy optimization with minimal changes. Experiments in both online and offline RL demonstrate improved performance compared to other uncertainty estimation methods.

Intelligent Anomaly Detection for Lane Rendering Using Transformer with Self-Supervised Pre-Training and Customized Fine-Tuning. (arXiv:2312.04398v1 [cs.CV])

Authors: Yongqi Dong, Xingmin Lu, Ruohan Li, Wei Song, Bart van Arem, Haneen Farah

The burgeoning navigation services using digital maps provide great convenience to drivers. Nevertheless, the presence of anomalies in lane rendering map images occasionally introduces potential hazards, as such anomalies can be misleading to human drivers and consequently contribute to unsafe driving conditions. In response to this concern and to accurately and effectively detect the anomalies, this paper transforms lane rendering image anomaly detection into a classification problem and proposes a four-phase pipeline consisting of data pre-processing, self-supervised pre-training with the masked image modeling (MiM) method, customized fine-tuning using cross-entropy based loss with label smoothing, and post-processing to tackle it leveraging state-of-the-art deep learning techniques, especially those involving Transformer models. Various experiments verify the effectiveness of the proposed pipeline. Results indicate that the proposed pipeline exhibits superior performance in lane rendering image anomaly detection, and notably, the self-supervised pre-training with MiM can greatly enhance the detection accuracy while significantly reducing the total training time. For instance, employing the Swin Transformer with Uniform Masking as self-supervised pretraining (Swin-Trans-UM) yielded a heightened accuracy at 94.77% and an improved Area Under The Curve (AUC) score of 0.9743 compared with the pure Swin Transformer without pre-training (Swin-Trans) with an accuracy of 94.01% and an AUC of 0.9498. The fine-tuning epochs were dramatically reduced to 41 from the original 280. In conclusion, the proposed pipeline, with its incorporation of self-supervised pre-training using MiM and other advanced deep learning techniques, emerges as a robust solution for enhancing the accuracy and efficiency of lane rendering image anomaly detection in digital navigation systems.

Temporal Fairness in Multiwinner Voting. (arXiv:2312.04417v1 [cs.GT])

Authors: Edith Elkind, Svetlana Obratzsova, Nicholas Teh

Multiwinner voting captures a wide variety of settings, from parliamentary elections in democratic systems to product placement in online shopping platforms. There is a large body of work dealing with axiomatic characterizations, computational complexity, and algorithmic analysis of multiwinner voting rules. Although many challenges remain, significant progress has been made in showing existence of fair and representative outcomes as well as efficient algorithmic solutions for many commonly studied settings. However, much of this work focuses on single-shot elections, even though in numerous real-world settings elections are held periodically and repeatedly. Hence, it is imperative to extend the study of multiwinner voting to temporal settings. Recently, there have been several efforts to address this challenge. However, these works are difficult to compare, as they model multi-period voting in very different ways. We propose a unified framework for studying temporal fairness in this domain, drawing connections with various existing bodies of work, and consolidating them within a general framework. We also identify gaps in existing literature, outline multiple opportunities for future work, and put forward a vision for the future of multiwinner voting in temporal settings.

Scalable Knowledge Graph Construction and Inference on Human Genome Variants. (arXiv:2312.04423v1 [cs.AI])

Authors: Shivika Prasanna, Deepthi Rao, Eduardo Simoes, Praveen Rao

Real-world knowledge can be represented as a graph consisting of entities and relationships between the entities. The need for efficient and scalable solutions arises when dealing with vast genomic data, like RNA-sequencing. Knowledge graphs offer a powerful approach for various tasks in such large-scale genomic data, such as analysis and inference. In this work, variant-level information extracted from the RNA-sequences of vaccine-na\"ive COVID-19 patients have been represented as a unified, large knowledge graph. Variant call format (VCF) files containing the variant-level information were annotated to include further information for each variant. The data records in the annotated files were then converted to Resource Description Framework (RDF) triples. Each VCF file obtained had an associated CADD scores file that contained the raw and Phred-scaled scores for each variant. An ontology was defined for the VCF and CADD scores files. Using this ontology and the extracted information, a large, scalable knowledge graph was created. Available graph storage was then leveraged to query and create datasets for further downstream tasks. We also present a case study using the knowledge graph and perform a classification task using graph machine learning. We also draw comparisons between different Graph Neural Networks (GNNs) for the case study.

Adv-4-Adv: Thwarting Changing Adversarial Perturbations via Adversarial Domain Adaptation. (arXiv:2112.00428v3 [cs.CV] UPDATED)

Authors: Tianyue Zheng, Zhe Chen, Shuya Ding, Chao Cai, Jun Luo

Whereas adversarial training can be useful against specific adversarial perturbations, they have also proven ineffective in generalizing towards attacks deviating from those used for training. However, we observe that this ineffectiveness is intrinsically connected to domain adaptability, another crucial issue in deep learning for which adversarial domain adaptation appears to be a promising solution. Consequently, we proposed Adv-4-Adv as a novel adversarial training method that aims to retain robustness against unseen adversarial perturbations. Essentially, Adv-4-Adv treats attacks incurring different perturbations as distinct domains, and by leveraging the power of adversarial domain adaptation, it aims to remove the domain/attack-specific features. This forces a trained model to learn a robust domain-invariant representation, which in turn enhances its generalization ability. Extensive evaluations on Fashion-MNIST, SVHN, CIFAR-10, and CIFAR-100 demonstrate that a model trained by Adv-4-Adv based on samples crafted by simple attacks (e.g., FGSM) can be generalized to more advanced attacks (e.g., PGD), and the performance exceeds state-of-the-art proposals on these datasets.

Deep Learning for Hate Speech Detection: A Comparative Study. (arXiv:2202.09517v2 [cs.CL] UPDATED)

Authors: Jitendra Singh Malik, Hezhe Qiao, Guansong Pang, Anton van den Hengel

Automated hate speech detection is an important tool in combating the spread of hate speech, particularly in social media. Numerous methods have been developed for the task, including a recent proliferation of deep-learning based approaches. A variety of datasets have also been developed, exemplifying various manifestations of the hate-speech detection problem. We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods, mediated through the three most commonly used datasets. Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art. We particularly focus our analysis on measures of practical performance, including detection accuracy, computational efficiency, capability in using pre-trained models, and domain generalization. In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions. Code and dataset are available at

Rethinking Bias Mitigation: Fairer Architectures Make for Fairer Face Recognition. (arXiv:2210.09943v3 [cs.CV] UPDATED)

Authors: Samuel Dooley, Rhea Sanjay Sukthanker, John P. Dickerson, Colin White, Frank Hutter, Micah Goldblum

Face recognition systems are widely deployed in safety-critical applications, including law enforcement, yet they exhibit bias across a range of socio-demographic dimensions, such as gender and race. Conventional wisdom dictates that model biases arise from biased training data. As a consequence, previous works on bias mitigation largely focused on pre-processing the training data, adding penalties to prevent bias from effecting the model during training, or post-processing predictions to debias them, yet these approaches have shown limited success on hard problems such as face recognition. In our work, we discover that biases are actually inherent to neural network architectures themselves. Following this reframing, we conduct the first neural architecture search for fairness, jointly with a search for hyperparameters. Our search outputs a suite of models which Pareto-dominate all other high-performance architectures and existing bias mitigation methods in terms of accuracy and fairness, often by large margins, on the two most widely used datasets for face identification, CelebA and VGGFace2. Furthermore, these models generalize to other datasets and sensitive attributes. We release our code, models and raw data files at

MAUVE Scores for Generative Models: Theory and Practice. (arXiv:2212.14578v2 [cs.LG] UPDATED)

Authors: Krishna Pillutla, Lang Liu, John Thickstun, Sean Welleck, Swabha Swayamdipta, Rowan Zellers, Sewoong Oh, Yejin Choi, Zaid Harchaoui

Generative artificial intelligence has made significant strides, producing text indistinguishable from human prose and remarkably photorealistic images. Automatically measuring how close the generated data distribution is to the target distribution is central to diagnosing existing models and developing better ones. We present MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images. These scores are statistical summaries of divergence frontiers capturing two types of errors in generative modeling. We explore three approaches to statistically estimate these scores: vector quantization, non-parametric estimation, and classifier-based estimation. We provide statistical bounds for the vector quantization approach.

Empirically, we find that the proposed scores paired with a range of $f$-divergences and statistical estimation methods can quantify the gaps between the distributions of human-written text and those of modern neural language models by correlating with human judgments and identifying known properties of the generated texts. We demonstrate in the vision domain that MAUVE can identify known properties of generated images on par with or better than existing metrics. In conclusion, we present practical recommendations for using MAUVE effectively with language and image modalities.

Point Cloud-based Proactive Link Quality Prediction for Millimeter-wave Communications. (arXiv:2301.00752v4 [cs.NI] UPDATED)

Authors: Shoki Ohta, Takayuki Nishio, Riichi Kudo, Kahoko Takahashi, Hisashi Nagata

This study demonstrates the feasibility of point cloud-based proactive link quality prediction for millimeter-wave (mmWave) communications. Previous studies have proposed machine learning-based methods to predict received signal strength for future time periods using time series of depth images to mitigate the line-of-sight (LOS) path blockage by pedestrians in mmWave communication. However, these image-based methods have limited applicability due to privacy concerns as camera images may contain sensitive information. This study proposes a point cloud-based method for mmWave link quality prediction and demonstrates its feasibility through experiments. Point clouds represent three-dimensional (3D) spaces as a set of points and are sparser and less likely to contain sensitive information than camera images. Additionally, point clouds provide 3D position and motion information, which is necessary for understanding the radio propagation environment involving pedestrians. This study designs the mmWave link quality prediction method and conducts realistic indoor experiments, where the link quality fluctuates significantly due to human blockage, using commercially available IEEE 802.11ad-based 60 GHz wireless LAN devices and Kinect v2 RGB-D camera and Velodyne VLP-16 light detection and ranging (LiDAR) for point cloud acquisition. The experimental results showed that our proposed method can predict future large attenuation of mmWave received signal strength and throughput induced by the LOS path blockage by pedestrians with comparable or superior accuracy to image-based prediction methods. Hence, our point cloud-based method can serve as a viable alternative to image-based methods.

Sem@$K$: Is my knowledge graph embedding model semantic-aware?. (arXiv:2301.05601v2 [cs.LG] UPDATED)

Authors: Nicolas Hubert, Pierre Monnin, Armelle Brun, Davy Monticolo

Using knowledge graph embedding models (KGEMs) is a popular approach for predicting links in knowledge graphs (KGs). Traditionally, the performance of KGEMs for link prediction is assessed using rank-based metrics, which evaluate their ability to give high scores to ground-truth entities. However, the literature claims that the KGEM evaluation procedure would benefit from adding supplementary dimensions to assess. That is why, in this paper, we extend our previously introduced metric Sem@K that measures the capability of models to predict valid entities w.r.t. domain and range constraints. In particular, we consider a broad range of KGs and take their respective characteristics into account to propose different versions of Sem@K. We also perform an extensive study to qualify the abilities of KGEMs as measured by our metric. Our experiments show that Sem@K provides a new perspective on KGEM quality. Its joint analysis with rank-based metrics offers different conclusions on the predictive power of models. Regarding Sem@K, some KGEMs are inherently better than others, but this semantic superiority is not indicative of their performance w.r.t. rank-based metrics. In this work, we generalize conclusions about the relative performance of KGEMs w.r.t. rank-based and semantic-oriented metrics at the level of families of models. The joint analysis of the aforementioned metrics gives more insight into the peculiarities of each model. This work paves the way for a more comprehensive evaluation of KGEM adequacy for specific downstream tasks.

A Stability Analysis of Fine-Tuning a Pre-Trained Model. (arXiv:2301.09820v2 [cs.LG] UPDATED)

Authors: Zihao Fu, Anthony Man-Cho So, Nigel Collier

Fine-tuning a pre-trained model (such as BERT, ALBERT, RoBERTa, T5, GPT, etc.) has proven to be one of the most promising paradigms in recent NLP research. However, numerous recent works indicate that fine-tuning suffers from the instability problem, i.e., tuning the same model under the same setting results in significantly different performance. Many recent works have proposed different methods to solve this problem, but there is no theoretical understanding of why and how these methods work. In this paper, we propose a novel theoretical stability analysis of fine-tuning that focuses on two commonly used settings, namely, full fine-tuning and head tuning. We define the stability under each setting and prove the corresponding stability bounds. The theoretical bounds explain why and how several existing methods can stabilize the fine-tuning procedure. In addition to being able to explain most of the observed empirical discoveries, our proposed theoretical analysis framework can also help in the design of effective and provable methods. Based on our theory, we propose three novel strategies to stabilize the fine-tuning procedure, namely, Maximal Margin Regularizer (MMR), Multi-Head Loss (MHLoss), and Self Unsupervised Re-Training (SURT). We extensively evaluate our proposed approaches on 11 widely used real-world benchmark datasets, as well as hundreds of synthetic classification datasets. The experiment results show that our proposed methods significantly stabilize the fine-tuning procedure and also corroborate our theoretical analysis.

Temporal Robustness against Data Poisoning. (arXiv:2302.03684v3 [cs.LG] UPDATED)

Authors: Wenxiao Wang, Soheil Feizi

Data poisoning considers cases when an adversary manipulates the behavior of machine learning algorithms through malicious training data. Existing threat models of data poisoning center around a single metric, the number of poisoned samples. In consequence, if attackers can poison more samples than expected with affordable overhead, as in many practical scenarios, they may be able to render existing defenses ineffective in a short time. To address this issue, we leverage timestamps denoting the birth dates of data, which are often available but neglected in the past. Benefiting from these timestamps, we propose a temporal threat model of data poisoning with two novel metrics, earliness and duration, which respectively measure how long an attack started in advance and how long an attack lasted. Using these metrics, we define the notions of temporal robustness against data poisoning, providing a meaningful sense of protection even with unbounded amounts of poisoned samples when the attacks are temporally bounded. We present a benchmark with an evaluation protocol simulating continuous data collection and periodic deployments of updated models, thus enabling empirical evaluation of temporal robustness. Lastly, we develop and also empirically verify a baseline defense, namely temporal aggregation, offering provable temporal robustness and highlighting the potential of our temporal threat model for data poisoning.

Trajectory-User Linking via Hierarchical Spatio-Temporal Attention Networks. (arXiv:2302.10903v2 [cs.LG] UPDATED)

Authors: Wei Chen, Chao Huang, Yanwei Yu, Yongguo Jiang, Junyu Dong

Trajectory-User Linking (TUL) is crucial for human mobility modeling by linking diferent trajectories to users with the exploration of complex mobility patterns. Existing works mainly rely on the recurrent neural framework to encode the temporal dependencies in trajectories, have fall short in capturing spatial-temporal global context for TUL prediction. To ill this gap, this work presents a new hierarchical spatio-temporal attention neural network, called AttnTUL, to jointly encode the local trajectory transitional patterns and global spatial dependencies for TUL. Speciically, our irst model component is built over the graph neural architecture to preserve the local and global context and enhance the representation paradigm of geographical regions and user trajectories. Additionally, a hierarchically structured attention network is designed to simultaneously encode the intra-trajectory and inter-trajectory dependencies, with the integration of the temporal attention mechanism and global elastic attentional encoder. Extensive experiments demonstrate the superiority of our AttnTUL method as compared to state-of-the-art baselines on various trajectory datasets. The source code of our model is available at

Reinforcement Learning for Combining Search Methods in the Calibration of Economic ABMs. (arXiv:2302.11835v3 [cs.LG] UPDATED)

Authors: Aldo Glielmo, Marco Favorito, Debmallya Chanda, Domenico Delli Gatti

Calibrating agent-based models (ABMs) in economics and finance typically involves a derivative-free search in a very large parameter space. In this work, we benchmark a number of search methods in the calibration of a well-known macroeconomic ABM on real data, and further assess the performance of "mixed strategies" made by combining different methods. We find that methods based on random-forest surrogates are particularly efficient, and that combining search methods generally increases performance since the biases of any single method are mitigated. Moving from these observations, we propose a reinforcement learning (RL) scheme to automatically select and combine search methods on-the-fly during a calibration run. The RL agent keeps exploiting a specific method only as long as this keeps performing well, but explores new strategies when the specific method reaches a performance plateau. The resulting RL search scheme outperforms any other method or method combination tested, and does not rely on any prior information or trial and error procedure.

ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation. (arXiv:2303.06458v2 [cs.CL] UPDATED)

Authors: Bang Yang, Fenglin Liu, Yuexian Zou, Xian Wu, Yaowei Wang, David A. Clifton

Natural Language Generation (NLG) accepts input data in the form of images, videos, or text and generates corresponding natural language text as output. Existing NLG methods mainly adopt a supervised approach and rely heavily on coupled data-to-text pairs. However, for many targeted scenarios and for non-English languages, sufficient quantities of labeled data are often not available. To relax the dependency on labeled data of downstream tasks, we propose an intuitive and effective zero-shot learning framework, ZeroNLG, which can deal with multiple NLG tasks, including image-to-text (image captioning), video-to-text (video captioning), and text-to-text (neural machine translation), across English, Chinese, German, and French within a unified framework. ZeroNLG does not require any labeled downstream pairs for training. During training, ZeroNLG (i) projects different domains (across modalities and languages) to corresponding coordinates in a shared common latent space; (ii) bridges different domains by aligning their corresponding coordinates in this space; and (iii) builds an unsupervised multilingual auto-encoder to learn to generate text by reconstructing the input text given its coordinate in shared latent space. Consequently, during inference, based on the data-to-text pipeline, ZeroNLG can generate target sentences across different languages given the coordinate of input data in the common space. Within this unified framework, given visual (imaging or video) data as input, ZeroNLG can perform zero-shot visual captioning; given textual sentences as input, ZeroNLG can perform zero-shot machine translation. We present the results of extensive experiments on twelve NLG tasks, showing that, without using any labeled downstream pairs for training, ZeroNLG generates high-quality and believable outputs and significantly outperforms existing zero-shot methods.

Plotting Behind the Scenes: Towards Learnable Game Engines. (arXiv:2303.13472v2 [cs.CV] UPDATED)

Authors: Willi Menapace, Aliaksandr Siarohin, Stéphane Lathuilière, Panos Achlioptas, Vladislav Golyanik, Sergey Tulyakov, Elisa Ricci

Neural video game simulators emerged as powerful tools to generate and edit videos. Their idea is to represent games as the evolution of an environment's state driven by the actions of its agents. While such a paradigm enables users to play a game action-by-action, its rigidity precludes more semantic forms of control. To overcome this limitation, we augment game models with prompts specified as a set of natural language actions and desired states. The result-a Promptable Game Model (PGM)-makes it possible for a user to play the game by prompting it with high- and low-level action sequences. Most captivatingly, our PGM unlocks the director's mode, where the game is played by specifying goals for the agents in the form of a prompt. This requires learning "game AI", encapsulated by our animation model, to navigate the scene using high-level constraints, play against an adversary, and devise a strategy to win a point. To render the resulting state, we use a compositional NeRF representation encapsulated in our synthesis model. To foster future research, we present newly collected, annotated and calibrated Tennis and Minecraft datasets. Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state of the art. Our framework, data, and models are available at

FIANCEE: Faster Inference of Adversarial Networks via Conditional Early Exits. (arXiv:2304.10306v2 [cs.CV] UPDATED)

Authors: Polina Karpikova, Radionova Ekaterina, Anastasia Yaschenko, Andrei Spiridonov, Leonid Kostyushko, Riccardo Fabbricatore, Aleksei Ivakhnenko

Generative DNNs are a powerful tool for image synthesis, but they are limited by their computational load. On the other hand, given a trained model and a task, e.g. faces generation within a range of characteristics, the output image quality will be unevenly distributed among images with different characteristics. It follows, that we might restrain the models complexity on some instances, maintaining a high quality. We propose a method for diminishing computations by adding so-called early exit branches to the original architecture, and dynamically switching the computational path depending on how difficult it will be to render the output. We apply our method on two different SOTA models performing generative tasks: generation from a semantic map, and cross-reenactment of face expressions; showing it is able to output images with custom lower-quality thresholds. For a threshold of LPIPS <=0.1, we diminish their computations by up to a half. This is especially relevant for real-time applications such as synthesis of faces, when quality loss needs to be contained, but most of the inputs need fewer computations than the complex instances.

LAVA: Data Valuation without Pre-Specified Learning Algorithms. (arXiv:2305.00054v2 [cs.LG] UPDATED)

Authors: Hoang Anh Just, Feiyang Kang, Jiachen T. Wang, Yi Zeng, Myeongseob Ko, Ming Jin, Ruoxi Jia

Traditionally, data valuation (DV) is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many DV use cases, such as setting priorities over different data sources in a data acquisition process and informing pricing mechanisms in a data marketplace. In these scenarios, data needs to be valued before the actual analysis and the choice of the learning algorithm is still undetermined then. Another side-effect of the dependence is that to assess the value of individual points, one needs to re-run the learning algorithm with and without a point, which incurs a large computation burden. This work leapfrogs over the current limits of data valuation methods by introducing a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. Our main results are as follows. (1) We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions. (2) We develop a novel method to value individual data based on the sensitivity analysis of the class-wise Wasserstein distance. Importantly, these values can be directly obtained for free from the output of off-the-shelf optimization solvers when computing the distance. (3) We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a significant improvement over SOTA performance while being orders of magnitude faster.

A Latent Diffusion Model for Protein Structure Generation. (arXiv:2305.04120v2 [q-bio.BM] UPDATED)

Authors: Cong Fu, Keqiang Yan, Limei Wang, Wing Yee Au, Michael McThrow, Tao Komikado, Koji Maruhashi, Kanji Uchino, Xiaoning Qian, Shuiwang Ji

Proteins are complex biomolecules that perform a variety of crucial functions within living organisms. Designing and generating novel proteins can pave the way for many future synthetic biology applications, including drug discovery. However, it remains a challenging computational task due to the large modeling space of protein structures. In this study, we propose a latent diffusion model that can reduce the complexity of protein modeling while flexibly capturing the distribution of natural protein structures in a condensed latent space. Specifically, we propose an equivariant protein autoencoder that embeds proteins into a latent space and then uses an equivariant diffusion model to learn the distribution of the latent protein representations. Experimental results demonstrate that our method can effectively generate novel protein backbone structures with high designability and efficiency. The code will be made publicly available at

Conversational Semantic Parsing using Dynamic Context Graphs. (arXiv:2305.06164v2 [cs.CL] UPDATED)

Authors: Parag Jain, Mirella Lapata

In this paper we consider the task of conversational semantic parsing over general purpose knowledge graphs (KGs) with millions of entities, and thousands of relation-types. We focus on models which are capable of interactively mapping user utterances into executable logical forms (e.g., Sparql) in the context of the conversational history. Our key idea is to represent information about an utterance and its context via a subgraph which is created dynamically, i.e., the number of nodes varies per utterance. Rather than treating the subgraph as a sequence, we exploit its underlying structure and encode it with a graph neural network which further allows us to represent a large number of (unseen) nodes. Experimental results show that dynamic context modeling is superior to static approaches, delivering performance improvements across the board (i.e., for simple and complex questions). Our results further confirm that modeling the structure of context is better at processing discourse information, (i.e., at handling ellipsis and resolving coreference) and longer interactions.

Post Hoc Explanations of Language Models Can Improve Language Models. (arXiv:2305.11426v3 [cs.CL] UPDATED)

Authors: Satyapriya Krishna, Jiaqi Ma, Dylan Slack, Asma Ghandeharioun, Sameer Singh, Himabindu Lakkaraju

Large Language Models (LLMs) have demonstrated remarkable capabilities in performing complex tasks. Moreover, recent research has shown that incorporating human-annotated rationales (e.g., Chain-of-Thought prompting) during in-context learning can significantly enhance the performance of these models, particularly on tasks that require reasoning capabilities. However, incorporating such rationales poses challenges in terms of scalability as this requires a high degree of human involvement. In this work, we present a novel framework, Amplifying Model Performance by Leveraging In-Context Learning with Post Hoc Explanations (AMPLIFY), which addresses the aforementioned challenges by automating the process of rationale generation. To this end, we leverage post hoc explanation methods which output attribution scores (explanations) capturing the influence of each of the input features on model predictions. More specifically, we construct automated natural language rationales that embed insights from post hoc explanations to provide corrective signals to LLMs. Extensive experimentation with real-world datasets demonstrates that our framework, AMPLIFY, leads to prediction accuracy improvements of about 10-25% over a wide range of tasks, including those where prior approaches which rely on human-annotated rationales such as Chain-of-Thought prompting fall short. Our work makes one of the first attempts at highlighting the potential of post hoc explanations as valuable tools for enhancing the effectiveness of LLMs. Furthermore, we conduct additional empirical analyses and ablation studies to demonstrate the impact of each of the components of AMPLIFY, which, in turn, leads to critical insights for refining in-context learning.

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models. (arXiv:2305.13840v2 [cs.CV] UPDATED)

Authors: Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, Liang Lin

Recent advancements in diffusion models have unlocked unprecedented abilities in visual creation. However, current text-to-video generation models struggle with the trade-off among movement range, action coherence and object consistency. To mitigate this issue, we present a controllable text-to-video (T2V) diffusion model, called Control-A-Video, capable of maintaining consistency while customizable video synthesis. Based on a pre-trained conditional text-to-image (T2I) diffusion model, our model aims to generate videos conditioned on a sequence of control signals, such as edge or depth maps. For the purpose of improving object consistency, Control-A-Video integrates motion priors and content priors into video generation. We propose two motion-adaptive noise initialization strategies, which are based on pixel residual and optical flow, to introduce motion priors from input videos, producing more coherent videos. Moreover, a first-frame conditioned controller is proposed to generate videos from content priors of the first frame, which facilitates the semantic alignment with text and allows longer video generation in an auto-regressive manner. With the proposed architecture and strategies, our model achieves resource-efficient convergence and generate consistent and coherent videos with fine-grained control. Extensive experiments demonstrate its success in various video generative tasks such as video editing and video style transfer, outperforming previous methods in terms of consistency and quality.

ViCo: Plug-and-play Visual Condition for Personalized Text-to-image Generation. (arXiv:2306.00971v2 [cs.CV] UPDATED)

Authors: Shaozhe Hao, Kai Han, Shihao Zhao, Kwan-Yee K. Wong

Personalized text-to-image generation using diffusion models has recently emerged and garnered significant interest. This task learns a novel concept (e.g., a unique toy), illustrated in a handful of images, into a generative model that captures fine visual details and generates photorealistic images based on textual embeddings. In this paper, we present ViCo, a novel lightweight plug-and-play method that seamlessly integrates visual condition into personalized text-to-image generation. ViCo stands out for its unique feature of not requiring any fine-tuning of the original diffusion model parameters, thereby facilitating more flexible and scalable model deployment. This key advantage distinguishes ViCo from most existing models that necessitate partial or full diffusion fine-tuning. ViCo incorporates an image attention module that conditions the diffusion process on patch-wise visual semantics, and an attention-based object mask that comes at no extra cost from the attention module. Despite only requiring light parameter training (~6% compared to the diffusion U-Net), ViCo delivers performance that is on par with, or even surpasses, all state-of-the-art models, both qualitatively and quantitatively. This underscores the efficacy of ViCo, making it a highly promising solution for personalized text-to-image generation without the need for diffusion model fine-tuning. Code:

DeepGraphDMD: Interpretable Spatio-Temporal Decomposition of Non-linear Functional Brain Network Dynamics. (arXiv:2306.03088v2 [cs.AI] UPDATED)

Authors: Md Asadullah Turja, Martin Styner, Guorong Wu

Functional brain dynamics is supported by parallel and overlapping functional network modes that are associated with specific neural circuits. Decomposing these network modes from fMRI data and finding their temporal characteristics is challenging due to their time-varying nature and the non-linearity of the functional dynamics. Dynamic Mode Decomposition (DMD) algorithms have been quite popular for solving this decomposition problem in recent years. In this work, we apply GraphDMD -- an extension of the DMD for network data -- to extract the dynamic network modes and their temporal characteristics from the fMRI time series in an interpretable manner. GraphDMD, however, regards the underlying system as a linear dynamical system that is sub-optimal for extracting the network modes from non-linear functional data. In this work, we develop a generalized version of the GraphDMD algorithm -- DeepGraphDMD -- applicable to arbitrary non-linear graph dynamical systems. DeepGraphDMD is an autoencoder-based deep learning model that learns Koopman eigenfunctions for graph data and embeds the non-linear graph dynamics into a latent linear space. We show the effectiveness of our method in both simulated data and the HCP resting-state fMRI data. In the HCP data, DeepGraphDMD provides novel insights into cognitive brain functions by discovering two major network modes related to fluid and crystallized intelligence.

Preserving privacy in domain transfer of medical AI models comes at no performance costs: The integral role of differential privacy. (arXiv:2306.06503v2 [cs.LG] UPDATED)

Authors: Soroosh Tayebi Arasteh, Mahshad Lotfinia, Teresa Nolte, Marwin Saehn, Peter Isfort, Christiane Kuhl, Sven Nebelung, Georgios Kaissis, Daniel Truhn

Developing robust and effective artificial intelligence (AI) models in medicine requires access to large amounts of patient data. The use of AI models solely trained on large multi-institutional datasets can help with this, yet the imperative to ensure data privacy remains, particularly as membership inference risks breaching patient confidentiality. As a proposed remedy, we advocate for the integration of differential privacy (DP). We specifically investigate the performance of models trained with DP as compared to models trained without DP on data from institutions that the model had not seen during its training (i.e., external validation) - the situation that is reflective of the clinical use of AI models. By leveraging more than 590,000 chest radiographs from five institutions, we evaluated the efficacy of DP-enhanced domain transfer (DP-DT) in diagnosing cardiomegaly, pleural effusion, pneumonia, atelectasis, and in identifying healthy subjects. We juxtaposed DP-DT with non-DP-DT and examined diagnostic accuracy and demographic fairness using the area under the receiver operating characteristic curve (AUC) as the main metric, as well as accuracy, sensitivity, and specificity. Our results show that DP-DT, even with exceptionally high privacy levels (epsilon around 1), performs comparably to non-DP-DT (P>0.119 across all domains). Furthermore, DP-DT led to marginal AUC differences - less than 1% - for nearly all subgroups, relative to non-DP-DT. Despite consistent evidence suggesting that DP models induce significant performance degradation for on-domain applications, we show that off-domain performance is almost not affected. Therefore, we ardently advocate for the adoption of DP in training diagnostic medical AI models, given its minimal impact on performance.

XAI-TRIS: Non-linear image benchmarks to quantify false positive post-hoc attribution of feature importance. (arXiv:2306.12816v2 [cs.LG] UPDATED)

Authors: Benedict Clark, Rick Wilming, Stefan Haufe

The field of 'explainable' artificial intelligence (XAI) has produced highly cited methods that seek to make the decisions of complex machine learning (ML) methods 'understandable' to humans, for example by attributing 'importance' scores to input features. Yet, a lack of formal underpinning leaves it unclear as to what conclusions can safely be drawn from the results of a given XAI method and has also so far hindered the theoretical verification and empirical validation of XAI methods. This means that challenging non-linear problems, typically solved by deep neural networks, presently lack appropriate remedies. Here, we craft benchmark datasets for three different non-linear classification scenarios, in which the important class-conditional features are known by design, serving as ground truth explanations. Using novel quantitative metrics, we benchmark the explanation performance of a wide set of XAI methods across three deep learning model architectures. We show that popular XAI methods are often unable to significantly outperform random performance baselines and edge detection methods. Moreover, we demonstrate that explanations derived from different model architectures can be vastly different; thus, prone to misinterpretation even under controlled conditions.

PAPR: Proximity Attention Point Rendering. (arXiv:2307.11086v2 [cs.CV] UPDATED)

Authors: Yanshu Zhang, Shichong Peng, Alireza Moazeni, Ke Li

Learning accurate and parsimonious point cloud representations of scene surfaces from scratch remains a challenge in 3D representation learning. Existing point-based methods often suffer from the vanishing gradient problem or require a large number of points to accurately model scene geometry and texture. To address these limitations, we propose Proximity Attention Point Rendering (PAPR), a novel method that consists of a point-based scene representation and a differentiable renderer. Our scene representation uses a point cloud where each point is characterized by its spatial position, influence score, and view-independent feature vector. The renderer selects the relevant points for each ray and produces accurate colours using their associated features. PAPR effectively learns point cloud positions to represent the correct scene geometry, even when the initialization drastically differs from the target geometry. Notably, our method captures fine texture details while using only a parsimonious set of points. We also demonstrate four practical applications of our method: zero-shot geometry editing, object manipulation, texture transfer, and exposure control. More results and code are available on our project website at

Relation-Oriented: Toward Causal Knowledge-Aligned AGI. (arXiv:2307.16387v12 [cs.AI] UPDATED)

Authors: Jia Li, Xiang Li

The current relationship modeling paradigm, grounded in the observational i.i.d assumption, inherently misaligns with our causal knowledge comprehension due to two vital oversights: 1) the unobservable relations, which lead to undetectable hierarchical levels of knowledge, driving the need for model generalizability; 2) the counterfactual relative timings to support our structural causal reasoning, which lead to inherent biases in models under the current Observation-Oriented paradigm. This paper proposes a novel Relation-Oriented framework, to reconsider these fundamental questions and unify various confusions surrounding AI-based causal learning, ranging from traditional causal inference to modern language models.

Also, relation-indexed representation learning (RIRL) is raised as a baseline implementation method of the proposed new paradigm, alongside comprehensive experiments demonstrating its efficacy in autonomously identifying dynamical effects in relationship modeling.

GEMRec: Towards Generative Model Recommendation. (arXiv:2308.02205v2 [cs.IR] UPDATED)

Authors: Yuanhe Guo, Haoming Liu, Hongyi Wen

Recommender Systems are built to retrieve relevant items to satisfy users' information needs. The candidate corpus usually consists of a finite set of items that are ready to be served, such as videos, products, or articles. With recent advances in Generative AI such as GPT and Diffusion models, a new form of recommendation task is yet to be explored where items are to be created by generative models with personalized prompts. Taking image generation as an example, with a single prompt from the user and access to a generative model, it is possible to generate hundreds of new images in a few minutes. How shall we attain personalization in the presence of "infinite" items? In this preliminary study, we propose a two-stage framework, namely Prompt-Model Retrieval and Generated Item Ranking, to approach this new task formulation. We release GEMRec-18K, a prompt-model interaction dataset with 18K images generated by 200 publicly-available generative models paired with a diverse set of 90 textual prompts. Our findings demonstrate the promise of generative model recommendation as a novel personalization problem and the limitations of existing evaluation metrics. We highlight future directions for the RecSys community to advance towards generative recommender systems. Our code and dataset are available at

Large Language Models as Optimizers. (arXiv:2309.03409v2 [cs.LG] UPDATED)

Authors: Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, Xinyun Chen

Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. In this work, we propose Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step. We first showcase OPRO on linear regression and traveling salesman problems, then move on to prompt optimization where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks. Code at

TSGBench: Time Series Generation Benchmark. (arXiv:2309.03755v2 [cs.LG] UPDATED)

Authors: Yihao Ang, Qiang Huang, Yifan Bao, Anthony K. H. Tung, Zhiyong Huang

Synthetic Time Series Generation (TSG) is crucial in a range of applications, including data augmentation, anomaly detection, and privacy preservation. Although significant strides have been made in this field, existing methods exhibit three key limitations: (1) They often benchmark against similar model types, constraining a holistic view of performance capabilities. (2) The use of specialized synthetic and private datasets introduces biases and hampers generalizability. (3) Ambiguous evaluation measures, often tied to custom networks or downstream tasks, hinder consistent and fair comparison.

To overcome these limitations, we introduce \textsf{TSGBench}, the inaugural Time Series Generation Benchmark, designed for a unified and comprehensive assessment of TSG methods. It comprises three modules: (1) a curated collection of publicly available, real-world datasets tailored for TSG, together with a standardized preprocessing pipeline; (2) a comprehensive evaluation measures suite including vanilla measures, new distance-based assessments, and visualization tools; (3) a pioneering generalization test rooted in Domain Adaptation (DA), compatible with all methods. We have conducted comprehensive experiments using \textsf{TSGBench} across a spectrum of ten real-world datasets from diverse domains, utilizing ten advanced TSG methods and twelve evaluation measures. The results highlight the reliability and efficacy of \textsf{TSGBench} in evaluating TSG methods. Crucially, \textsf{TSGBench} delivers a statistical analysis of the performance rankings of these methods, illuminating their varying performance across different datasets and measures and offering nuanced insights into the effectiveness of each method.

The Importance of Multimodal Emotion Conditioning and Affect Consistency for Embodied Conversational Agents. (arXiv:2309.15311v2 [cs.HC] UPDATED)

Authors: Che-Jui Chang, Samuel S. Sohn, Sen Zhang, Rajath Jayashankar, Muhammad Usman, Mubbasir Kapadia

Previous studies regarding the perception of emotions for embodied virtual agents have shown the effectiveness of using virtual characters in conveying emotions through interactions with humans. However, creating an autonomous embodied conversational agent with expressive behaviors presents two major challenges. The first challenge is the difficulty of synthesizing the conversational behaviors for each modality that are as expressive as real human behaviors. The second challenge is that the affects are modeled independently, which makes it difficult to generate multimodal responses with consistent emotions across all modalities. In this work, we propose a conceptual framework, ACTOR (Affect-Consistent mulTimodal behaviOR generation), that aims to increase the perception of affects by generating multimodal behaviors conditioned on a consistent driving affect. We have conducted a user study with 199 participants to assess how the average person judges the affects perceived from multimodal behaviors that are consistent and inconsistent with respect to a driving affect. The result shows that among all model conditions, our affect-consistent framework receives the highest Likert scores for the perception of driving affects. Our statistical analysis suggests that making a modality affect-inconsistent significantly decreases the perception of driving affects. We also observe that multimodal behaviors conditioned on consistent affects are more expressive compared to behaviors with inconsistent affects. Therefore, we conclude that multimodal emotion conditioning and affect consistency are vital to enhancing the perception of affects for embodied conversational agents.

UPAR: A Kantian-Inspired Prompting Framework for Enhancing Large Language Model Capabilities. (arXiv:2310.01441v2 [cs.CL] UPDATED)

Authors: Hejia Geng, Boxun Xu, Peng Li

Large Language Models (LLMs) have demonstrated impressive inferential capabilities, with numerous research endeavors devoted to enhancing this capacity through prompting. Despite these efforts, a unified epistemological foundation is still conspicuously absent. Drawing inspiration from Kant's a priori philosophy, we propose the UPAR prompting framework, designed to emulate the structure of human cognition within LLMs. The UPAR framework is delineated into four phases: "Understand", "Plan", "Act", and "Reflect", enabling the extraction of structured information from complex contexts, prior planning of solutions, execution according to plan, and self-reflection. This structure significantly augments the explainability and accuracy of LLM inference, producing a human-understandable and inspectable inferential trajectory. Furthermore, our work offers an epistemological foundation for existing prompting techniques, allowing for a possible systematic integration of these methods. With GPT-4, our approach elevates the accuracy from COT baseline of 22.92% to 58.33% in a challenging subset of GSM8K, and from 67.91% to 75.40% in the causal judgment task. Without using few-shot examples or external tools, UPAR significantly outperforms existing prompting methods on SCIBENCH, a challenging dataset containing collegiate-level mathematics, chemistry, and physics scientific problems.

Prompting Audios Using Acoustic Properties For Emotion Representation. (arXiv:2310.02298v3 [cs.SD] UPDATED)

Authors: Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha Raj, Rita Singh

Emotions lie on a continuum, but current models treat emotions as a finite valued discrete variable. This representation does not capture the diversity in the expression of emotion. To better represent emotions we propose the use of natural language descriptions (or prompts). In this work, we address the challenge of automatically generating these prompts and training a model to better learn emotion representations from audio and prompt pairs. We use acoustic properties that are correlated to emotion like pitch, intensity, speech rate, and articulation rate to automatically generate prompts i.e. 'acoustic prompts'. We use a contrastive learning objective to map speech to their respective acoustic prompts. We evaluate our model on Emotion Audio Retrieval and Speech Emotion Recognition. Our results show that the acoustic prompts significantly improve the model's performance in EAR, in various Precision@K metrics. In SER, we observe a 3.8% relative accuracy improvement on the Ravdess dataset.

Harnessing the Power of Large Language Models for Empathetic Response Generation: Empirical Investigations and Improvements. (arXiv:2310.05140v3 [cs.CL] UPDATED)

Authors: Yushan Qian, Wei-Nan Zhang, Ting Liu

Empathetic dialogue is an indispensable part of building harmonious social relationships and contributes to the development of a helpful AI. Previous approaches are mainly based on fine small-scale language models. With the advent of ChatGPT, the application effect of large language models (LLMs) in this field has attracted great attention. This work empirically investigates the performance of LLMs in generating empathetic responses and proposes three improvement methods of semantically similar in-context learning, two-stage interactive generation, and combination with the knowledge base. Extensive experiments show that LLMs can significantly benefit from our proposed methods and is able to achieve state-of-the-art performance in both automatic and human evaluations. Additionally, we explore the possibility of GPT-4 simulating human evaluators.

Generative Judge for Evaluating Alignment. (arXiv:2310.05470v2 [cs.CL] UPDATED)

Authors: Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, Pengfei Liu

The rapid development of Large Language Models (LLMs) has substantially expanded the range of tasks they can address. In the field of Natural Language Processing (NLP), researchers have shifted their focus from conventional NLP tasks (e.g., sequence tagging and parsing) towards tasks that revolve around aligning with human needs (e.g., brainstorming and email writing). This shift in task distribution imposes new requirements on evaluating these aligned models regarding generality (i.e., assessing performance across diverse scenarios), flexibility (i.e., examining under different protocols), and interpretability (i.e., scrutinizing models with explanations). In this paper, we propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios and accommodates diverse evaluation protocols (e.g., pairwise response comparison and single-response evaluation) with well-structured natural language critiques. To demonstrate the efficacy of our approach, we construct a new testbed covering 58 different scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models, by a large margin. We also provide detailed analysis and case studies to further reveal the potential of our method and make a variety of resources public at

Optimizing K-means for Big Data: A Comparative Study. (arXiv:2310.09819v2 [cs.LG] UPDATED)

Authors: Ravil Mussabayev, Rustam Mussabayev

This paper presents a comparative analysis of different optimization techniques for the K-means algorithm in the context of big data. K-means is a widely used clustering algorithm, but it can suffer from scalability issues when dealing with large datasets. The paper explores different approaches to overcome these issues, including parallelization, approximation, and sampling methods. The authors evaluate the performance of these techniques on various benchmark datasets and compare them in terms of speed, quality of clustering, and scalability according to the LIMA dominance criterion. The results show that different techniques are more suitable for different types of datasets and provide insights into the trade-offs between speed and accuracy in K-means clustering for big data. Overall, the paper offers a comprehensive guide for practitioners and researchers on how to optimize K-means for big data applications.

Unveiling the General Intelligence Factor in Language Models: A Psychometric Approach. (arXiv:2310.11616v2 [cs.CL] UPDATED)

Authors: David Ilić

This study uncovers the factor of general intelligence, or g, in language models, extending the psychometric theory traditionally applied to humans and certain animal species. Utilizing factor analysis on two extensive datasets - Open LLM Leaderboard with 1,232 models and General Language Understanding Evaluation (GLUE) Leaderboard with 88 models - we find compelling evidence for a unidimensional, highly stable g factor that accounts for 85% of the variance in model performance. The study also finds a moderate correlation of .49 between model size and g. The discovery of g in language models offers a unified metric for model evaluation and opens new avenues for more robust, g-based model ability assessment. These findings lay the foundation for understanding and future research on artificial general intelligence from a psychometric perspective and have practical implications for model evaluation and development.

BM2CP: Efficient Collaborative Perception with LiDAR-Camera Modalities. (arXiv:2310.14702v2 [cs.CV] UPDATED)

Authors: Binyu Zhao, Wei Zhang, Zhaonian Zou

Collaborative perception enables agents to share complementary perceptual information with nearby agents. This would improve the perception performance and alleviate the issues of single-view perception, such as occlusion and sparsity. Most existing approaches mainly focus on single modality (especially LiDAR), and not fully exploit the superiority of multi-modal perception. We propose a collaborative perception paradigm, BM2CP, which employs LiDAR and camera to achieve efficient multi-modal perception. It utilizes LiDAR-guided modal fusion, cooperative depth generation and modality-guided intermediate fusion to acquire deep interactions among modalities of different agents, Moreover, it is capable to cope with the special case where one of the sensors, same or different type, of any agent is missing. Extensive experiments validate that our approach outperforms the state-of-the-art methods with 50X lower communication volumes in both simulated and real-world autonomous driving scenarios. Our code is available at

CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models. (arXiv:2310.19784v2 [cs.CV] UPDATED)

Authors: Ziyang Yuan, Mingdeng Cao, Xintao Wang, Zhongang Qi, Chun Yuan, Ying Shan

Incorporating a customized object into image generation presents an attractive feature in text-to-image generation. However, existing optimization-based and encoder-based methods are hindered by drawbacks such as time-consuming optimization, insufficient identity preservation, and a prevalent copy-pasting effect. To overcome these limitations, we introduce CustomNet, a novel object customization approach that explicitly incorporates 3D novel view synthesis capabilities into the object customization process. This integration facilitates the adjustment of spatial position relationships and viewpoints, yielding diverse outputs while effectively preserving object identity. Moreover, we introduce delicate designs to enable location control and flexible background control through textual descriptions or specific user-defined images, overcoming the limitations of existing 3D novel view synthesis methods. We further leverage a dataset construction pipeline that can better handle real-world objects and complex backgrounds. Equipped with these designs, our method facilitates zero-shot object customization without test-time optimization, offering simultaneous control over the viewpoints, location, and background. As a result, our CustomNet ensures enhanced identity preservation and generates diverse, harmonious outputs.

Efficient LLM Inference on CPUs. (arXiv:2311.00502v2 [cs.LG] UPDATED)

Authors: Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, Hengyu Meng

Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks. However, deploying these models has been challenging due to the astronomical amount of model parameters, which requires a demand for large memory capacity and high memory bandwidth. In this paper, we propose an effective approach that can make the deployment of LLMs more efficiently. We support an automatic INT4 weight-only quantization flow and design a special LLM runtime with highly-optimized kernels to accelerate the LLM inference on CPUs. We demonstrate the general applicability of our approach on popular LLMs including Llama2, Llama, GPT-NeoX, and showcase the extreme inference efficiency on CPUs. The code is publicly available at:

A New Fine-grained Alignment Method for Image-text Matching. (arXiv:2311.02183v2 [cs.CV] UPDATED)

Authors: Yang Zhang

Image-text retrieval is a widely studied topic in the field of computer vision due to the exponential growth of multimedia data, whose core concept is to measure the similarity between images and text. However, most existing retrieval methods heavily rely on cross-attention mechanisms for cross-modal fine-grained alignment, which takes into account excessive irrelevant regions and treats prominent and non-significant words equally, thereby limiting retrieval accuracy. This paper aims to investigate an alignment approach that reduces the involvement of non-significant fragments in images and text while enhancing the alignment of prominent segments. For this purpose, we introduce the Cross-Modal Prominent Fragments Enhancement Aligning Network(CPFEAN), which achieves improved retrieval accuracy by diminishing the participation of irrelevant regions during alignment and relatively increasing the alignment similarity of prominent words. Additionally, we incorporate prior textual information into image regions to reduce misalignment occurrences. In practice, we first design a novel intra-modal fragments relationship reasoning method, and subsequently employ our proposed alignment mechanism to compute the similarity between images and text. Extensive quantitative comparative experiments on MS-COCO and Flickr30K datasets demonstrate that our approach outperforms state-of-the-art methods by about 5% to 10% in the rSum metric.

ExpM+NF Tractable Exponential Mechanism via Normalizing Flow, A Path through the Accuracy-Privacy Ceiling Constraining Differentially Private ML. (arXiv:2311.09200v2 [stat.ML] UPDATED)

Authors: Robert A. Bridges, Vandy J. Tombs, Christopher B. Stanley

The Exponential Mechanism (ExpM), a differentially private optimization method, promises many advantages over Differentially Private Stochastic Gradient Descent (DPSGD), the state-of-the-art (SOTA) and de facto method for differentially private machine learning (ML). Yet, ExpM has been historically stymied from differentially private training of modern ML algorithms by two obstructions: ExpM requires a sensitivity bound for the given loss function; ExpM requires sampling from a historically intractable density. We prove a sensitivity bound for $\ell(2)$ loss, and investigate using Normalizing Flows (NFs), deep networks furnishing approximate sampling from the otherwise intractable ExpM distribution. We prove that as the NF output converges to ExpM distribution, the privacy ($\varepsilon$) of an NF sample converges to that of the ExpM distribution. Under the assumption that the NF output distribution is the ExpM distribution, we empirically test ExpM+NF against DPSGD using the SOTA implementation (Opacus \cite{opacus} with PRV accounting) in multiple classification tasks on the Adult Dataset (census data) and MIMIC-III Dataset (healthcare records) using Logistic Regression and GRU-D, a deep learning recurrent neural network with \smallsim 20K-100K parameters. In all experiments we find ExpM+NF achieves greater than 94\% of the non-private training accuracy (AUC) with $\varepsilon$-DP for $\varepsilon$ a low as $1\mathrm{e}{-3}$ -- three orders of magnitude stronger privacy with similar accuracy. Further, performance results show ExpM+NF training time is comparable to (slightly less) than DPSGD. Limitations and future directions are provided; notably, research on NF approximation accuracy and its effect on privacy are a promising avenue to substantially advancing the field. Code for these experiments \hl{will be provided after review}.

Video Face Re-Aging: Toward Temporally Consistent Face Re-Aging. (arXiv:2311.11642v2 [cs.CV] UPDATED)

Authors: Abdul Muqeet, Kyuchul Lee, Bumsoo Kim, Yohan Hong, Hyungrae Lee, Woonggon Kim, KwangHee Lee

Video face re-aging deals with altering the apparent age of a person to the target age in videos. This problem is challenging due to the lack of paired video datasets maintaining temporal consistency in identity and age. Most re-aging methods process each image individually without considering the temporal consistency of videos. While some existing works address the issue of temporal coherence through video facial attribute manipulation in latent space, they often fail to deliver satisfactory performance in age transformation. To tackle the issues, we propose (1) a novel synthetic video dataset that features subjects across a diverse range of age groups; (2) a baseline architecture designed to validate the effectiveness of our proposed dataset, and (3) the development of three novel metrics tailored explicitly for evaluating the temporal consistency of video re-aging techniques. Our comprehensive experiments on public datasets, such as VFHQ and CelebV-HQ, show that our method outperforms the existing approaches in terms of both age transformation and temporal consistency.

RDF Stream Taxonomy: Systematizing RDF Stream Types in Research and Practice. (arXiv:2311.14540v2 [cs.DB] UPDATED)

Authors: Piotr Sowinski, Pawel Szmeja, Maria Ganzha, Marcin Paprzycki

Over the years, RDF streaming was explored in research and practice from many angles, resulting in a wide range of RDF stream definitions. This variety presents a major challenge in discussing and integrating streaming solutions, due to the lack of a common language. This work attempts to address this critical research gap, by systematizing RDF stream types present in the literature in a novel taxonomy. The proposed RDF Stream Taxonomy (RDF-STaX) is embodied in an OWL 2 DL ontology that follows the FAIR principles, making it readily applicable in practice. Extensive documentation and additional resources are provided, to foster the adoption of the ontology. Two realized use cases are presented, demonstrating the usefulness of the resource in discussing research works and annotating streaming datasets. Another result of this contribution is the novel nanopublications dataset, which serves as a collaborative, living state-of-the-art review of RDF streaming. The aim of RDF-STaX is to address a real need of the community for a better way to systematize and describe RDF streams. The resource is designed to help drive innovation in RDF streaming, by fostering scientific discussion, cooperation, and tool interoperability.

From Prediction to Action: Critical Role of Performance Estimation for Machine-Learning-Driven Materials Discovery. (arXiv:2311.15549v2 [cond-mat.mtrl-sci] UPDATED)

Authors: Mario Boley, Felix Luong, Simon Teshuva, Daniel F Schmidt, Lucas Foppa, Matthias Scheffler

Materials discovery driven by statistical property models is an iterative decision process, during which an initial data collection is extended with new data proposed by a model-informed acquisition function--with the goal to maximize a certain "reward" over time, such as the maximum property value discovered so far. While the materials science community achieved much progress in developing property models that predict well on average with respect to the training distribution, this form of in-distribution performance measurement is not directly coupled with the discovery reward. This is because an iterative discovery process has a shifting reward distribution that is over-proportionally determined by the model performance for exceptional materials. We demonstrate this problem using the example of bulk modulus maximization among double perovskite oxides. We find that the in-distribution predictive performance suggests random forests as superior to Gaussian process regression, while the results are inverse in terms of the discovery rewards. We argue that the lack of proper performance estimation methods from pre-computed data collections is a fundamental problem for improving data-driven materials discovery, and we propose a novel such estimator that, in contrast to na\"ive reward estimation, successfully predicts Gaussian processes with the "expected improvement" acquisition function as the best out of four options in our demonstrational study for double perovskites. Importantly, it does so without requiring the over thousand ab initio computations that were needed to confirm this prediction.

LLMs for Science: Usage for Code Generation and Data Analysis. (arXiv:2311.16733v3 [cs.SE] UPDATED)

Authors: Mohamed Nejjar, Luca Zacharias, Fabian Stiehle, Ingo Weber

Large language models (LLMs) have been touted to enable increased productivity in many areas of today's work life. Scientific research as an area of work is no exception: the potential of LLM-based tools to assist in the daily work of scientists has become a highly discussed topic across disciplines. However, we are only at the very onset of this subject of study. It is still unclear how the potential of LLMs will materialise in research practice. With this study, we give first empirical evidence on the use of LLMs in the research process. We have investigated a set of use cases for LLM-based tools in scientific research, and conducted a first study to assess to which degree current tools are helpful. In this paper we report specifically on use cases related to software engineering, such as generating application code and developing scripts for data analytics. While we studied seemingly simple use cases, results across tools differ significantly. Our results highlight the promise of LLM-based tools in general, yet we also observe various issues, particularly regarding the integrity of the output these tools provide.

A Case for Competent AI Systems $-$ A Concept Note. (arXiv:2312.00052v2 [cs.CY] UPDATED)

Authors: Kamalakar Karlapalem

The efficiency of an AI system is contingent upon its ability to align with the specified requirements of a given task. How-ever, the inherent complexity of tasks often introduces the potential for harmful implications or adverse actions. This note explores the critical concept of capability within AI systems, representing what the system is expected to deliver. The articulation of capability involves specifying well-defined out-comes. Yet, the achievement of this capability may be hindered by deficiencies in implementation and testing, reflecting a gap in the system's competency (what it can do vs. what it does successfully).

A central challenge arises in elucidating the competency of an AI system to execute tasks effectively. The exploration of system competency in AI remains in its early stages, occasionally manifesting as confidence intervals denoting the probability of success. Trust in an AI system hinges on the explicit modeling and detailed specification of its competency, connected intricately to the system's capability. This note explores this gap by proposing a framework for articulating the competency of AI systems.

Motivated by practical scenarios such as the Glass Door problem, where an individual inadvertently encounters a glass obstacle due to a failure in their competency, this research underscores the imperative of delving into competency dynamics. Bridging the gap between capability and competency at a detailed level, this note contributes to advancing the discourse on bolstering the reliability of AI systems in real-world applications.

On the Interplay Between Stepsize Tuning and Progressive Sharpening. (arXiv:2312.00209v2 [cs.LG] UPDATED)

Authors: Vincent Roulet, Atish Agarwala, Fabian Pedregosa

Recent empirical work has revealed an intriguing property of deep learning models by which the sharpness (largest eigenvalue of the Hessian) increases throughout optimization until it stabilizes around a critical value at which the optimizer operates at the edge of stability, given a fixed stepsize (Cohen et al, 2022). We investigate empirically how the sharpness evolves when using stepsize-tuners, the Armijo linesearch and Polyak stepsizes, that adapt the stepsize along the iterations to local quantities such as, implicitly, the sharpness itself. We find that the surprisingly poor performance of a classical Armijo linesearch may be well explained by its tendency to ever-increase the sharpness of the objective in the full or large batch regimes. On the other hand, we observe that Polyak stepsizes operate generally at the edge of stability or even slightly beyond, while outperforming its Armijo and constant stepsizes counterparts. We conclude with an analysis that suggests unlocking stepsize tuners requires an understanding of the joint dynamics of the step size and the sharpness.

Mark My Words: Analyzing and Evaluating Language Model Watermarks. (arXiv:2312.00273v2 [cs.CR] UPDATED)

Authors: Julien Piet, Chawin Sitawarin, Vivian Fang, Norman Mu, David Wagner

The capabilities of large language models have grown significantly in recent years and so too have concerns about their misuse. In this context, the ability to distinguish machine-generated text from human-authored content becomes important. Prior works have proposed numerous schemes to watermark text, which would benefit from a systematic evaluation framework. This work focuses on text watermarking techniques - as opposed to image watermarks - and proposes MARKMYWORDS, a comprehensive benchmark for them under different tasks as well as practical attacks. We focus on three main metrics: quality, size (e.g. the number of tokens needed to detect a watermark), and tamper-resistance. Current watermarking techniques are good enough to be deployed: Kirchenbauer et al. [1] can watermark Llama2-7B-chat with no perceivable loss in quality, the watermark can be detected with fewer than 100 tokens, and the scheme offers good tamper-resistance to simple attacks. We argue that watermark indistinguishability, a criteria emphasized in some prior works, is too strong a requirement: schemes that slightly modify logit distributions outperform their indistinguishable counterparts with no noticeable loss in generation quality. We publicly release our benchmark (

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model. (arXiv:2312.02238v2 [cs.CV] UPDATED)

Authors: Lingmin Ran, Xiaodong Cun, Jia-Wei Liu, Rui Zhao, Song Zijie, Xintao Wang, Jussi Keppo, Mike Zheng Shou

We introduce X-Adapter, a universal upgrader to enable the pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a frozen copy of the old model to preserve the connectors of different plugins. Additionally, X-Adapter adds trainable mapping layers that bridge the decoders from models of different versions for feature remapping. The remapped features will be used as guidance for the upgraded model. To enhance the guidance ability of X-Adapter, we employ a null-text training strategy for the upgraded model. After training, we also introduce a two-stage denoising strategy to align the initial latents of X-Adapter and the upgraded model. Thanks to our strategies, X-Adapter demonstrates universal compatibility with various plugins and also enables plugins of different versions to work together, thereby expanding the functionalities of diffusion community. To verify the effectiveness of the proposed method, we conduct extensive experiments and the results show that X-Adapter may facilitate wider application in the upgraded foundational diffusion model.

WhisBERT: Multimodal Text-Audio Language Modeling on 100M Words. (arXiv:2312.02931v2 [cs.CL] UPDATED)

Authors: Lukas Wolf, Greta Tuckute, Klemen Kotar, Eghbal Hosseini, Tamar Regev, Ethan Wilcox, Alex Warstadt

Training on multiple modalities of input can augment the capabilities of a language model. Here, we ask whether such a training regime can improve the quality and efficiency of these systems as well. We focus on text--audio and introduce Whisbert, which is inspired by the text--image approach of FLAVA (Singh et al., 2022). In accordance with Babylm guidelines (Warstadt et al., 2023), we pretrain Whisbert on a dataset comprising only 100 million words plus their corresponding speech from the word-aligned version of the People's Speech dataset (Galvez et al., 2021). To assess the impact of multimodality, we compare versions of the model that are trained on text only and on both audio and text simultaneously. We find that while Whisbert is able to perform well on multimodal masked modeling and surpasses the Babylm baselines in most benchmark tasks, it struggles to optimize its complex objective and outperform its text-only Whisbert baseline.

Evaluating Agents using Social Choice Theory. (arXiv:2312.03121v2 [cs.AI] UPDATED)

Authors: Marc Lanctot, Kate Larson, Yoram Bachrach, Luke Marris, Zun Li, Avishkar Bhoopchand, Thomas Anthony, Brian Tanner, Anna Koop

We argue that many general evaluation problems can be viewed through the lens of voting theory. Each task is interpreted as a separate voter, which requires only ordinal rankings or pairwise comparisons of agents to produce an overall evaluation. By viewing the aggregator as a social welfare function, we are able to leverage centuries of research in social choice theory to derive principled evaluation frameworks with axiomatic foundations. These evaluations are interpretable and flexible, while avoiding many of the problems currently facing cross-task evaluation. We apply this Voting-as-Evaluation (VasE) framework across multiple settings, including reinforcement learning, large language models, and humans. In practice, we observe that VasE can be more robust than popular evaluation frameworks (Elo and Nash averaging), discovers properties in the evaluation data not evident from scores alone, and can predict outcomes better than Elo in a complex seven-player game. We identify one particular approach, maximal lotteries, that satisfies important consistency properties relevant to evaluation, is computationally efficient (polynomial in the size of the evaluation data), and identifies game-theoretic cycles.

Intrinsic Harmonization for Illumination-Aware Compositing. (arXiv:2312.03698v2 [cs.CV] UPDATED)

Authors: Chris Careaga, S. Mahdi H. Miangoleh, Yağız Aksoy

Despite significant advancements in network-based image harmonization techniques, there still exists a domain disparity between typical training pairs and real-world composites encountered during inference. Most existing methods are trained to reverse global edits made on segmented image regions, which fail to accurately capture the lighting inconsistencies between the foreground and background found in composited images. In this work, we introduce a self-supervised illumination harmonization approach formulated in the intrinsic image domain. First, we estimate a simple global lighting model from mid-level vision representations to generate a rough shading for the foreground region. A network then refines this inferred shading to generate a harmonious re-shading that aligns with the background scene. In order to match the color appearance of the foreground and background, we utilize ideas from prior harmonization approaches to perform parameterized image edits in the albedo domain. To validate the effectiveness of our approach, we present results from challenging real-world composites and conduct a user study to objectively measure the enhanced realism achieved compared to state-of-the-art harmonization methods.