new Building A Knowledge Graph to Enrich ChatGPT Responses in Manufacturing Service Discovery

Authors: Yunqing Li, Binil Starly

Abstract: Sourcing and identification of new manufacturing partners is crucial for manufacturing system integrators to enhance agility and reduce risk through supply chain diversification in the global economy. The advent of advanced large language models has captured significant interest, due to their ability to generate comprehensive and articulate responses across a wide range of knowledge domains. However, the system often falls short in accuracy and completeness when responding to domain-specific inquiries, particularly in areas like manufacturing service discovery. This research explores the potential of leveraging Knowledge Graphs in conjunction with ChatGPT to streamline the process for prospective clients in identifying small manufacturing enterprises. In this study, we propose a method that integrates bottom-up ontology with advanced machine learning models to develop a Manufacturing Service Knowledge Graph from an array of structured and unstructured data sources, including the digital footprints of small-scale manufacturers throughout North America. The Knowledge Graph and the learned graph embedding vectors are leveraged to tackle intricate queries within the digital supply chain network, responding with enhanced reliability and greater interpretability. The approach highlighted is scalable to millions of entities that can be distributed to form a global Manufacturing Service Knowledge Network Graph that can potentially interconnect multiple types of Knowledge Graphs that span industry sectors, geopolitical boundaries, and business domains. The dataset developed for this study, now publicly accessible, encompasses more than 13,000 manufacturers' weblinks, manufacturing services, certifications, and location entity types.

new GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation

Authors: Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, Roozbeh Mottaghi

Abstract: The Embodied AI community has made significant strides in visual navigation tasks, exploring targets from 3D coordinates, objects, language descriptions, and images. However, these navigation models often handle only a single input modality as the target. With the progress achieved so far, it is time to move towards universal navigation models capable of handling various goal types, enabling more effective user interaction with robots. To facilitate this goal, we propose GOAT-Bench, a benchmark for the universal navigation task referred to as GO to AnyThing (GOAT). In this task, the agent is directed to navigate to a sequence of targets specified by the category name, language description, or image in an open-vocabulary fashion. We benchmark monolithic RL and modular methods on the GOAT task, analyzing their performance across modalities, the role of explicit and implicit scene memories, their robustness to noise in goal specifications, and the impact of memory in lifelong scenarios.

new Causal Unit Selection using Tractable Arithmetic Circuits

Authors: Haiying Huang, Adnan Darwiche

Abstract: The unit selection problem aims to find objects, called units, that optimize a causal objective function which describes the objects' behavior in a causal context (e.g., selecting customers who are about to churn but would most likely change their mind if encouraged). While early studies focused mainly on bounding a specific class of counterfactual objective functions using data, more recent work allows one to find optimal units exactly by reducing the causal objective to a classical objective on a meta-model, and then applying a variant of the classical Variable Elimination (VE) algorithm to the meta-model -- assuming a fully specified causal model is available. In practice, however, finding optimal units using this approach can be very expensive because the used VE algorithm must be exponential in the constrained treewidth of the meta-model, which is larger and denser than the original model. We address this computational challenge by introducing a new approach for unit selection that is not necessarily limited by the constrained treewidth. This is done through compiling the meta-model into a special class of tractable arithmetic circuits that allows the computation of optimal units in time linear in the circuit size. We finally present empirical results on random causal models that show order-of-magnitude speedups based on the proposed method for solving unit selection.

new A Survey on the Integration of Generative AI for Critical Thinking in Mobile Networks

Authors: Athanasios Karapantelakis, Alexandros Nikou, Ajay Kattepur, Jean Martins, Leonid Mokrushin, Swarup Kumar Mohalik, Marin Orlic, Aneta Vulgarakis Feljan

Abstract: In the near future, mobile networks are expected to broaden their services and coverage to accommodate a larger user base and diverse user needs. Thus, they will increasingly rely on artificial intelligence (AI) to manage network operation and control costs, undertaking complex decision-making roles. This shift will necessitate the application of techniques that incorporate critical thinking abilities, including reasoning and planning. Symbolic AI techniques already facilitate critical thinking based on existing knowledge. Yet, their use in telecommunications is hindered by the high cost of mostly manual curation of this knowledge and high computational complexity of reasoning tasks. At the same time, there is a spurt of innovations in industries such as telecommunications due to Generative AI (GenAI) technologies, operating independently of human-curated knowledge. However, their capacity for critical thinking remains uncertain. This paper aims to address this gap by examining the current status of GenAI algorithms with critical thinking capabilities and investigating their potential applications in telecom networks. Specifically, the aim of this study is to offer an introduction to the potential utilization of GenAI for critical thinking techniques in mobile networks, while also establishing a foundation for future research.

new Towards a Game-theoretic Understanding of Explanation-based Membership Inference Attacks

Authors: Kavita Kumari, Murtuza Jadliwala, Sumit Kumar Jha, Anindya Maiti

Abstract: Model explanations improve the transparency of black-box machine learning (ML) models and their decisions; however, they can also be exploited to carry out privacy threats such as membership inference attacks (MIA). Existing works have only analyzed MIA in a single "what if" interaction scenario between an adversary and the target ML model; thus, it does not discern the factors impacting the capabilities of an adversary in launching MIA in repeated interaction settings. Additionally, these works rely on assumptions about the adversary's knowledge of the target model's structure and, thus, do not guarantee the optimality of the predefined threshold required to distinguish the members from non-members. In this paper, we delve into the domain of explanation-based threshold attacks, where the adversary endeavors to carry out MIA attacks by leveraging the variance of explanations through iterative interactions with the system comprising of the target ML model and its corresponding explanation method. We model such interactions by employing a continuous-time stochastic signaling game framework. In our framework, an adversary plays a stopping game, interacting with the system (having imperfect information about the type of an adversary, i.e., honest or malicious) to obtain explanation variance information and computing an optimal threshold to determine the membership of a datapoint accurately. First, we propose a sound mathematical formulation to prove that such an optimal threshold exists, which can be used to launch MIA. Then, we characterize the conditions under which a unique Markov perfect equilibrium (or steady state) exists in this dynamic system. By means of a comprehensive set of simulations of the proposed game model, we assess different factors that can impact the capability of an adversary to launch MIA in such repeated interaction settings.

new Zero-shot Logical Query Reasoning on any Knowledge Graph

Authors: Mikhail Galkin, Jincheng Zhou, Bruno Ribeiro, Jian Tang, Zhaocheng Zhu

Abstract: Complex logical query answering (CLQA) in knowledge graphs (KGs) goes beyond simple KG completion and aims at answering compositional queries comprised of multiple projections and logical operations. Existing CLQA methods that learn parameters bound to certain entity or relation vocabularies can only be applied to the graph they are trained on which requires substantial training time before being deployed on a new graph. Here we present UltraQuery, an inductive reasoning model that can zero-shot answer logical queries on any KG. The core idea of UltraQuery is to derive both projections and logical operations as vocabulary-independent functions which generalize to new entities and relations in any KG. With the projection operation initialized from a pre-trained inductive KG reasoning model, UltraQuery can solve CLQA on any KG even if it is only finetuned on a single dataset. Experimenting on 23 datasets, UltraQuery in the zero-shot inference mode shows competitive or better query answering performance than best available baselines and sets a new state of the art on 14 of them.

cross Best Response Shaping

Authors: Milad Aghajohari, Tim Cooijmans, Juan Agustin Duque, Shunichi Akatsuka, Aaron Courville

Abstract: We investigate the challenge of multi-agent deep reinforcement learning in partially competitive environments, where traditional methods struggle to foster reciprocity-based cooperation. LOLA and POLA agents learn reciprocity-based cooperative policies by differentiation through a few look-ahead optimization steps of their opponent. However, there is a key limitation in these techniques. Because they consider a few optimization steps, a learning opponent that takes many steps to optimize its return may exploit them. In response, we introduce a novel approach, Best Response Shaping (BRS), which differentiates through an opponent approximating the best response, termed the "detective." To condition the detective on the agent's policy for complex games we propose a state-aware differentiable conditioning mechanism, facilitated by a question answering (QA) method that extracts a representation of the agent based on its behaviour on specific environment states. To empirically validate our method, we showcase its enhanced performance against a Monte Carlo Tree Search (MCTS) opponent, which serves as an approximation to the best response in the Coin Game. This work expands the applicability of multi-agent RL in partially competitive environments and provides a new pathway towards achieving improved social welfare in general sum games.

cross An Enhanced Grey Wolf Optimizer with Elite Inheritance and Balance Search Mechanisms

Authors: Jianhua Jiang, Ziying Zhao, Weihua Li, Keqin Li

Abstract: The Grey Wolf Optimizer (GWO) is recognized as a novel meta-heuristic algorithm inspired by the social leadership hierarchy and hunting mechanism of grey wolves. It is well-known for its simple parameter setting, fast convergence speed, and strong optimization capability. In the original GWO, there are two significant design flaws in its fundamental optimization mechanisms. Problem (1): the algorithm fails to inherit from elite positions from the last iteration when generating the next positions of the wolf population, potentially leading to suboptimal solutions. Problem (2): the positions of the population are updated based on the central position of the three leading wolves (alpha, beta, delta), without a balanced mechanism between local and global search. To tackle these problems, an enhanced Grey Wolf Optimizer with Elite Inheritance Mechanism and Balance Search Mechanism, named as EBGWO, is proposed to improve the effectiveness of the position updating and the quality of the convergence solutions. The IEEE CEC 2014 benchmark functions suite and a series of simulation tests are employed to evaluate the performance of the proposed algorithm. The simulation tests involve a comparative study between EBGWO, three GWO variants, GWO and two well-known meta-heuristic algorithms. The experimental results demonstrate that the proposed EBGWO algorithm outperforms other meta-heuristic algorithms in both accuracy and convergence speed. Three engineering optimization problems are adopted to prove its capability in processing real-world problems. The results indicate that the proposed EBGWO outperforms several popular algorithms.

cross Emergent Braitenberg-style Behaviours for Navigating the ViZDoom `My Way Home' Labyrinth

Authors: Caleidgh Bayer, Robert J. Smith, Malcolm I. Heywood

Abstract: The navigation of complex labyrinths with tens of rooms under visual partially observable state is typically addressed using recurrent deep reinforcement learning architectures. In this work, we show that navigation can be achieved through the emergent evolution of a simple Braitentberg-style heuristic that structures the interaction between agent and labyrinth, i.e. complex behaviour from simple heuristics. To do so, the approach of tangled program graphs is assumed in which programs cooperatively coevolve to develop a modular indexing scheme that only employs 0.8\% of the state space. We attribute this simplicity to several biases implicit in the representation, such as the use of pixel indexing as opposed to deploying a convolutional kernel or image processing operators.

cross Learning Strategies For Successful Crowd Navigation

Authors: Rajshree Daulatabad, Serena Nath

Abstract: Teaching autonomous mobile robots to successfully navigate human crowds is a challenging task. Not only does it require planning, but it requires maintaining social norms which may differ from one context to another. Here we focus on crowd navigation, using a neural network to learn specific strategies in-situ with a robot. This allows us to take into account human behavior and reactions toward a real robot as well as learn strategies that are specific to various scenarios in that context. A CNN takes a top-down image of the scene as input and outputs the next action for the robot to take in terms of speed and angle. Here we present the method, experimental results, and quantitatively evaluate our approach.

cross Less is More for Improving Automatic Evaluation of Factual Consistency

Authors: Tong Wang, Ninad Kulkarni, Yanjun Qi

Abstract: Assessing the factual consistency of automatically generated texts in relation to source context is crucial for developing reliable natural language generation applications. Recent literature proposes AlignScore which uses a unified alignment model to evaluate factual consistency and substantially outperforms previous methods across many benchmark tasks. In this paper, we take a closer look of datasets used in AlignScore and uncover an unexpected finding: utilizing a smaller number of data points can actually improve performance. We process the original AlignScore training dataset to remove noise, augment with robustness-enhanced samples, and utilize a subset comprising 10\% of the data to train an improved factual consistency evaluation model, we call LIM-RA (Less Is More for Robust AlignScore). LIM-RA demonstrates superior performance, consistently outperforming AlignScore and other strong baselines like ChatGPT across four benchmarks (two utilizing traditional natural language generation datasets and two focused on large language model outputs). Our experiments show that LIM-RA achieves the highest score on 24 of the 33 test datasets, while staying competitive on the rest, establishing the new state-of-the-art benchmarks.

cross Spatially Optimized Compact Deep Metric Learning Model for Similarity Search

Authors: Md. Farhadul Islam, Md. Tanzim Reza, Meem Arafat Manab, Mohammad Rakibul Hasan Mahin, Sarah Zabeen, Jannatun Noor

Abstract: Spatial optimization is often overlooked in many computer vision tasks. Filters should be able to recognize the features of an object regardless of where it is in the image. Similarity search is a crucial task where spatial features decide an important output. The capacity of convolution to capture visual patterns across various locations is limited. In contrast to convolution, the involution kernel is dynamically created at each pixel based on the pixel value and parameters that have been learned. This study demonstrates that utilizing a single layer of involution feature extractor alongside a compact convolution model significantly enhances the performance of similarity search. Additionally, we improve predictions by using the GELU activation function rather than the ReLU. The negligible amount of weight parameters in involution with a compact model with better performance makes the model very useful in real-world implementations. Our proposed model is below 1 megabyte in size. We have experimented with our proposed methodology and other models on CIFAR-10, FashionMNIST, and MNIST datasets. Our proposed method outperforms across all three datasets.

cross FMDA-OT: Federated Multi-source Domain Adaptation Through Optimal Transport

Authors: Omar Ghannou, Youn\`es Bennani

Abstract: Multi-source Domain Adaptation (MDA) aims to adapt models trained on multiple labeled source domains to an unlabeled target domain. In this paper, we introduce our approach as a collaborative MDA framework, which comprises two adaptation phases. Firstly, we conduct domain adaptation for each source individually with the target, utilizing optimal transport. Then, in the second phase, which constitutes the final part of the framework, we design the architecture of centralized federated learning to collaborate the N models representing the N sources. This architecture offers the advantage of using the sources without accessing their data, thus resolving data privacy issues inherent in domain adaptation. Additionally, during this phase, the server guides and fine-tunes the adaptation using a small number of pseudo-labeled samples available in the target domain, referred to as the target validation subset of the dataset.

cross Counting Objects in a Robotic Hand

Authors: Francis Tsow, Tianze Chen, Yu Sun

Abstract: A robot performing multi-object grasping needs to sense the number of objects in the hand after grasping. The count plays an important role in determining the robot's next move and the outcome and efficiency of the whole pick-place process. This paper presents a data-driven contrastive learning-based counting classifier with a modified loss function as a simple and effective approach for object counting despite significant occlusion challenges caused by robotic fingers and objects. The model was validated against other models with three different common shapes (spheres, cylinders, and cubes) in simulation and in a real setup. The proposed contrastive learning-based counting approach achieved above 96\% accuracy for all three objects in the real setup.

cross Evolving Loss Functions for Specific Image Augmentation Techniques

Authors: Brandon Morgan, Dean Hougen

Abstract: Previous work in Neural Loss Function Search (NLFS) has shown a lack of correlation between smaller surrogate functions and large convolutional neural networks with massive regularization. We expand upon this research by revealing another disparity that exists, correlation between different types of image augmentation techniques. We show that different loss functions can perform well on certain image augmentation techniques, while performing poorly on others. We exploit this disparity by performing an evolutionary search on five types of image augmentation techniques in the hopes of finding image augmentation specific loss functions. The best loss functions from each evolution were then taken and transferred to WideResNet-28-10 on CIFAR-10 and CIFAR-100 across each of the five image augmentation techniques. The best from that were then taken and evaluated by fine-tuning EfficientNetV2Small on the CARS, Oxford-Flowers, and Caltech datasets across each of the five image augmentation techniques. Multiple loss functions were found that outperformed cross-entropy across multiple experiments. In the end, we found a single loss function, which we called the inverse bessel logarithm loss, that was able to outperform cross-entropy across the majority of experiments.

cross Federated learning model for predicting major postoperative complications

Authors: Yonggi Park, Yuanfang Ren, Benjamin Shickel, Ziyuan Guan, Ayush Patela, Yingbo Ma, Zhenhong Hu, Tyler J. Loftus, Parisa Rashidi, Tezcan Ozrazgat-Baslanti, Azra Bihorac

Abstract: Background: The accurate prediction of postoperative complication risk using Electronic Health Records (EHR) and artificial intelligence shows great potential. Training a robust artificial intelligence model typically requires large-scale and diverse datasets. In reality, collecting medical data often encounters challenges surrounding privacy protection. Methods: This retrospective cohort study includes adult patients who were admitted to UFH Gainesville (GNV) (n = 79,850) and Jacksonville (JAX) (n = 28,636) for any type of inpatient surgical procedure. Using perioperative and intraoperative features, we developed federated learning models to predict nine major postoperative complications (i.e., prolonged intensive care unit stay and mechanical ventilation). We compared federated learning models with local learning models trained on a single site and central learning models trained on pooled dataset from two centers. Results: Our federated learning models achieved the area under the receiver operating characteristics curve (AUROC) values ranged from 0.81 for wound complications to 0.92 for prolonged ICU stay at UFH GNV center. At UFH JAX center, these values ranged from 0.73-0.74 for wound complications to 0.92-0.93 for hospital mortality. Federated learning models achieved comparable AUROC performance to central learning models, except for prolonged ICU stay, where the performance of federated learning models was slightly higher than central learning models at UFH GNV center, but slightly lower at UFH JAX center. In addition, our federated learning model obtained comparable performance to the best local learning model at each center, demonstrating strong generalizability. Conclusion: Federated learning is shown to be a useful tool to train robust and generalizable models from large scale data across multiple institutions where data protection barriers are high.

cross Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?

Authors: Omid Ghahroodi, Marzia Nouri, Mohammad Vali Sanian, Alireza Sahebi, Doratossadat Dastgheib, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

Abstract: Evaluating Large Language Models (LLMs) is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM evaluation lags behind English, resulting in the absence or weakness of LLMs for many languages. In response to this necessity, we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously curated collection comprising 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages. The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language. Distinctive features of the Khayyam Challenge are (i) its comprehensive coverage of various topics, including literary comprehension, mathematics, sciences, logic, intelligence testing, etc., aimed at assessing different facets of LLMs such as language comprehension, reasoning, and information retrieval across various educational stages, from lower primary school to upper secondary school (ii) its inclusion of rich metadata such as human response rates, difficulty levels, and descriptive answers (iii) its utilization of new data to avoid data contamination issues prevalent in existing frameworks (iv) its use of original, non-translated data tailored for Persian speakers, ensuring the framework is free from translation challenges and errors while encompassing cultural nuances (v) its inherent scalability for future data updates and evaluations without requiring special human effort. Previous works lacked an evaluation framework that combined all of these features into a single comprehensive benchmark. Furthermore, we evaluate a wide range of existing LLMs that support the Persian language, with statistical analyses and interpretations of their outputs.

cross GenCHiP: Generating Robot Policy Code for High-Precision and Contact-Rich Manipulation Tasks

Authors: Kaylee Burns, Ajinkya Jain, Keegan Go, Fei Xia, Michael Stark, Stefan Schaal, Karol Hausman

Abstract: Large Language Models (LLMs) have been successful at generating robot policy code, but so far these results have been limited to high-level tasks that do not require precise movement. It is an open question how well such approaches work for tasks that require reasoning over contact forces and working within tight success tolerances. We find that, with the right action space, LLMs are capable of successfully generating policies for a variety of contact-rich and high-precision manipulation tasks, even under noisy conditions, such as perceptual errors or grasping inaccuracies. Specifically, we reparameterize the action space to include compliance with constraints on the interaction forces and stiffnesses involved in reaching a target pose. We validate this approach on subtasks derived from the Functional Manipulation Benchmark (FMB) and NIST Task Board Benchmarks. Exposing this action space alongside methods for estimating object poses improves policy generation with an LLM by greater than 3x and 4x when compared to non-compliant action spaces

cross From Protoscience to Epistemic Monoculture: How Benchmarking Set the Stage for the Deep Learning Revolution

Authors: Bernard J. Koch, David Peterson

Abstract: Over the past decade, AI research has focused heavily on building ever-larger deep learning models. This approach has simultaneously unlocked incredible achievements in science and technology, and hindered AI from overcoming long-standing limitations with respect to explainability, ethical harms, and environmental efficiency. Drawing on qualitative interviews and computational analyses, our three-part history of AI research traces the creation of this "epistemic monoculture" back to a radical reconceptualization of scientific progress that occurred in the 1990s. In the first era of AI research (1950s-late 1980s), researchers and patrons approached AI as a "basic" science that would advance through autonomous exploration and organic assessments of progress (e.g., peer-review, theoretical consensus). The failure of this approach led to a retrenchment of funding in the 1980s. Amid this "AI Winter," an intervention by the U.S. government reoriented the field towards measurable progress on tasks of military and commercial interest. A new evaluation system called "benchmarking" provided an objective way to quantify progress on tasks by focusing exclusively on increasing predictive accuracy on example datasets. Distilling science down to verifiable metrics clarified the roles of scientists, allowed the field to rapidly integrate talent, and provided clear signals of significance and progress. But history has also revealed a tradeoff to this streamlined approach to science: the consolidation around external interests and inherent conservatism of benchmarking has disincentivized exploration beyond scaling monoculture. In the discussion, we explain how AI's monoculture offers a compelling challenge to the belief that basic, exploration-driven research is needed for scientific progress. Implications for the spread of AI monoculture to other sciences in the era of generative AI are also discussed.

cross CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge

Authors: Yu Ying Chiu, Liwei Jiang, Maria Antoniak, Chan Young Park, Shuyue Stella Li, Mehar Bhatia, Sahithya Ravi, Yulia Tsvetkov, Vered Shwartz, Yejin Choi

Abstract: Frontier large language models (LLMs) are developed by researchers and practitioners with skewed cultural backgrounds and on datasets with skewed sources. However, LLMs' (lack of) multicultural knowledge cannot be effectively assessed with current methods for developing benchmarks. Existing multicultural evaluations primarily rely on expensive and restricted human annotations or potentially outdated internet resources. Thus, they struggle to capture the intricacy, dynamics, and diversity of cultural norms. LLM-generated benchmarks are promising, yet risk propagating the same biases they are meant to measure. To synergize the creativity and expert cultural knowledge of human annotators and the scalability and standardizability of LLM-based automation, we introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build truly challenging evaluation dataset for assessing the multicultural knowledge of LLMs, while improving annotators' capabilities and experiences. Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions, that modern LLMs fail at, in a gamified manner. Importantly, the increased level of AI assistance (e.g., LLM-generated revision hints) empowers users to create more difficult questions with enhanced perceived creativity of themselves, shedding light on the promises of involving heavier AI assistance in modern evaluation dataset creation procedures. Through a series of 1-hour workshop sessions, we gather CULTURALBENCH-V0.1, a compact yet high-quality evaluation dataset with users' red-teaming attempts, that different families of modern LLMs perform with accuracy ranging from 37.7% to 72.2%, revealing a notable gap in LLMs' multicultural proficiency.

cross SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models

Authors: Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, Wenyuan Xu

Abstract: Text-to-image (T2I) models, such as Stable Diffusion, have exhibited remarkable performance in generating high-quality images from text descriptions in recent years. However, text-to-image models may be tricked into generating not-safe-for-work (NSFW) content, particularly in sexual scenarios. Existing countermeasures mostly focus on filtering inappropriate inputs and outputs, or suppressing improper text embeddings, which can block explicit NSFW-related content (e.g., naked or sexy) but may still be vulnerable to adversarial prompts inputs that appear innocent but are ill-intended. In this paper, we present SafeGen, a framework to mitigate unsafe content generation by text-to-image models in a text-agnostic manner. The key idea is to eliminate unsafe visual representations from the model regardless of the text input. In this way, the text-to-image model is resistant to adversarial prompts since unsafe visual representations are obstructed from within. Extensive experiments conducted on four datasets demonstrate SafeGen's effectiveness in mitigating unsafe content generation while preserving the high-fidelity of benign images. SafeGen outperforms eight state-of-the-art baseline methods and achieves 99.1% sexual content removal performance. Furthermore, our constructed benchmark of adversarial prompts provides a basis for future development and evaluation of anti-NSFW-generation methods.

cross Forecasting the Future with Future Technologies: Advancements in Large Meteorological Models

Authors: Hailong Shu, Yue Wang, Weiwei Song, Huichuang Guo, Zhen Song

Abstract: The field of meteorological forecasting has undergone a significant transformation with the integration of large models, especially those employing deep learning techniques. This paper reviews the advancements and applications of these models in weather prediction, emphasizing their role in transforming traditional forecasting methods. Models like FourCastNet, Pangu-Weather, GraphCast, ClimaX, and FengWu have made notable contributions by providing accurate, high-resolution forecasts, surpassing the capabilities of traditional Numerical Weather Prediction (NWP) models. These models utilize advanced neural network architectures, such as Convolutional Neural Networks (CNNs), Graph Neural Networks (GNNs), and Transformers, to process diverse meteorological data, enhancing predictive accuracy across various time scales and spatial resolutions. The paper addresses challenges in this domain, including data acquisition and computational demands, and explores future opportunities for model optimization and hardware advancements. It underscores the integration of artificial intelligence with conventional meteorological techniques, promising improved weather prediction accuracy and a significant contribution to addressing climate-related challenges. This synergy positions large models as pivotal in the evolving landscape of meteorological forecasting.

cross VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing

Authors: Philip Anastassiou, Zhenyu Tang, Kainan Peng, Dongya Jia, Jiaxin Li, Ming Tu, Yuping Wang, Yuxuan Wang, Mingbo Ma

Abstract: We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion effect is weak, there is no zero-shot capability for out-of-distribution speakers, or the synthesized outputs exhibit timbre leakage which changes the speaker's perceived identity. Our work proposes solutions for each of these issues in a simple modular framework based on a conditional diffusion backbone model with optional normalizing flow-based and sequence-to-sequence speaker attribute-editing modules, whose components can be combined or removed during inference to meet a wide array of tasks without additional model finetuning. Audio samples are available at https://voiceshopai.github.io

URLs: https://voiceshopai.github.io

cross Neural Optimizer Equation, Decay Function, and Learning Rate Schedule Joint Evolution

Authors: Brandon Morgan, Dean Hougen

Abstract: A major contributor to the quality of a deep learning model is the selection of the optimizer. We propose a new dual-joint search space in the realm of neural optimizer search (NOS), along with an integrity check, to automate the process of finding deep learning optimizers. Our dual-joint search space simultaneously allows for the optimization of not only the update equation, but also internal decay functions and learning rate schedules for optimizers. We search the space using our proposed mutation-only, particle-based genetic algorithm able to be massively parallelized for our domain-specific problem. We evaluate our candidate optimizers on the CIFAR-10 dataset using a small ConvNet. To assess generalization, the final optimizers were then transferred to large-scale image classification on CIFAR- 100 and TinyImageNet, while also being fine-tuned on Flowers102, Cars196, and Caltech101 using EfficientNetV2Small. We found multiple optimizers, learning rate schedules, and Adam variants that outperformed Adam, as well as other standard deep learning optimizers, across the image classification tasks.

cross CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Authors: Leying Zhang, Yao Qian, Long Zhou, Shujie Liu, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Lei He, Sheng Zhao, Michael Zeng

Abstract: Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge in the field. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix is capable of first converting dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model. Furthermore, we devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation. Our experimental results show that CoVoMix can generate dialogues that are not only human-like in their naturalness and coherence but also involve multiple talkers engaging in multiple rounds of conversation. These dialogues, generated within a single channel, are characterized by seamless speech transitions, including overlapping speech, and appropriate paralinguistic behaviors such as laughter. Audio samples are available at https://aka.ms/covomix.

URLs: https://aka.ms/covomix.

cross How to Craft Backdoors with Unlabeled Data Alone?

Authors: Yifei Wang, Wenhan Ma, Yisen Wang

Abstract: Relying only on unlabeled data, Self-supervised learning (SSL) can learn rich features in an economical and scalable way. As the drive-horse for building foundation models, SSL has received a lot of attention recently with wide applications, which also raises security concerns where backdoor attack is a major type of threat: if the released dataset is maliciously poisoned, backdoored SSL models can behave badly when triggers are injected to test samples. The goal of this work is to investigate this potential risk. We notice that existing backdoors all require a considerable amount of \emph{labeled} data that may not be available for SSL. To circumvent this limitation, we explore a more restrictive setting called no-label backdoors, where we only have access to the unlabeled data alone, where the key challenge is how to select the proper poison set without using label information. We propose two strategies for poison selection: clustering-based selection using pseudolabels, and contrastive selection derived from the mutual information principle. Experiments on CIFAR-10 and ImageNet-100 show that both no-label backdoors are effective on many SSL methods and outperform random poisoning by a large margin. Code will be available at https://github.com/PKU-ML/nlb.

URLs: https://github.com/PKU-ML/nlb.

cross Convolution-based Probability Gradient Loss for Semantic Segmentation

Authors: Guohang Shan, Shuangcheng Jia

Abstract: In this paper, we introduce a novel Convolution-based Probability Gradient (CPG) loss for semantic segmentation. It employs convolution kernels similar to the Sobel operator, capable of computing the gradient of pixel intensity in an image. This enables the computation of gradients for both ground-truth and predicted category-wise probabilities. It enhances network performance by maximizing the similarity between these two probability gradients. Moreover, to specifically enhance accuracy near the object's boundary, we extract the object boundary based on the ground-truth probability gradient and exclusively apply the CPG loss to pixels belonging to boundaries. CPG loss proves to be highly convenient and effective. It establishes pixel relationships through convolution, calculating errors from a distinct dimension compared to pixel-wise loss functions such as cross-entropy loss. We conduct qualitative and quantitative analyses to evaluate the impact of the CPG loss on three well-established networks (DeepLabv3-Resnet50, HRNetV2-OCR, and LRASPP_MobileNet_V3_Large) across three standard segmentation datasets (Cityscapes, COCO-Stuff, ADE20K). Our extensive experimental results consistently and significantly demonstrate that the CPG loss enhances the mean Intersection over Union.

cross SpikeNVS: Enhancing Novel View Synthesis from Blurry Images via Spike Camera

Authors: Gaole Dai, Zhenyu Wang, Qinwen Xu, Wen Cheng, Ming Lu, Boxing Shi, Shanghang Zhang, Tiejun Huang

Abstract: One of the most critical factors in achieving sharp Novel View Synthesis (NVS) using neural field methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) is the quality of the training images. However, Conventional RGB cameras are susceptible to motion blur. In contrast, neuromorphic cameras like event and spike cameras inherently capture more comprehensive temporal information, which can provide a sharp representation of the scene as additional training data. Recent methods have explored the integration of event cameras to improve the quality of NVS. The event-RGB approaches have some limitations, such as high training costs and the inability to work effectively in the background. Instead, our study introduces a new method that uses the spike camera to overcome these limitations. By considering texture reconstruction from spike streams as ground truth, we design the Texture from Spike (TfS) loss. Since the spike camera relies on temporal integration instead of temporal differentiation used by event cameras, our proposed TfS loss maintains manageable training costs. It handles foreground objects with backgrounds simultaneously. We also provide a real-world dataset captured with our spike-RGB camera system to facilitate future research endeavors. We conduct extensive experiments using synthetic and real-world datasets to demonstrate that our design can enhance novel view synthesis across NeRF and 3DGS. The code and dataset will be made available for public access.

cross Racial/Ethnic Categories in AI and Algorithmic Fairness: Why They Matter and What They Represent

Authors: Jennifer Mickel

Abstract: Racial diversity has become increasingly discussed within the AI and algorithmic fairness literature, yet little attention is focused on justifying the choices of racial categories and understanding how people are racialized into these chosen racial categories. Even less attention is given to how racial categories shift and how the racialization process changes depending on the context of a dataset or model. An unclear understanding of \textit{who} comprises the racial categories chosen and \textit{how} people are racialized into these categories can lead to varying interpretations of these categories. These varying interpretations can lead to harm when the understanding of racial categories and the racialization process is misaligned from the actual racialization process and racial categories used. Harm can also arise if the racialization process and racial categories used are irrelevant or do not exist in the context they are applied. In this paper, we make two contributions. First, we demonstrate how racial categories with unclear assumptions and little justification can lead to varying datasets that poorly represent groups obfuscated or unrepresented by the given racial categories and models that perform poorly on these groups. Second, we develop a framework, CIRCSheets, for documenting the choices and assumptions in choosing racial categories and the process of racialization into these categories to facilitate transparency in understanding the processes and assumptions made by dataset or model developers when selecting or using these racial categories.

cross Accuracy of a Large Language Model in Distinguishing Anti- And Pro-vaccination Messages on Social Media: The Case of Human Papillomavirus Vaccination

Authors: Soojong Kim, Kwanho Kim, Claire Wonjeong Jo

Abstract: Objective. Vaccination has engendered a spectrum of public opinions, with social media acting as a crucial platform for health-related discussions. The emergence of artificial intelligence technologies, such as large language models (LLMs), offers a novel opportunity to efficiently investigate public discourses. This research assesses the accuracy of ChatGPT, a widely used and freely available service built upon an LLM, for sentiment analysis to discern different stances toward Human Papillomavirus (HPV) vaccination. Methods. Messages related to HPV vaccination were collected from social media supporting different message formats: Facebook (long format) and Twitter (short format). A selection of 1,000 human-evaluated messages was input into the LLM, which generated multiple response instances containing its classification results. Accuracy was measured for each message as the level of concurrence between human and machine decisions, ranging between 0 and 1. Results. Average accuracy was notably high when 20 response instances were used to determine the machine decision of each message: .882 (SE = .021) and .750 (SE = .029) for anti- and pro-vaccination long-form; .773 (SE = .027) and .723 (SE = .029) for anti- and pro-vaccination short-form, respectively. Using only three or even one instance did not lead to a severe decrease in accuracy. However, for long-form messages, the language model exhibited significantly lower accuracy in categorizing pro-vaccination messages than anti-vaccination ones. Conclusions. ChatGPT shows potential in analyzing public opinions on HPV vaccination using social media content. However, understanding the characteristics and limitations of a language model within specific public health contexts remains imperative.

cross Incremental XAI: Memorable Understanding of AI with Incremental Explanations

Authors: Jessica Y. Bo, Pan Hao, Brian Y. Lim

Abstract: Many explainable AI (XAI) techniques strive for interpretability by providing concise salient information, such as sparse linear factors. However, users either only see inaccurate global explanations, or highly-varying local explanations. We propose to provide more detailed explanations by leveraging the human cognitive capacity to accumulate knowledge by incrementally receiving more details. Focusing on linear factor explanations (factors $\times$ values = outcome), we introduce Incremental XAI to automatically partition explanations for general and atypical instances by providing Base + Incremental factors to help users read and remember more faithful explanations. Memorability is improved by reusing base factors and reducing the number of factors shown in atypical cases. In modeling, formative, and summative user studies, we evaluated the faithfulness, memorability and understandability of Incremental XAI against baseline explanation methods. This work contributes towards more usable explanation that users can better ingrain to facilitate intuitive engagement with AI.

cross Frontier AI Ethics: Anticipating and Evaluating the Societal Impacts of Generative Agents

Authors: Seth Lazar

Abstract: Some have criticised Generative AI Systems for replicating the familiar pathologies of already widely-deployed AI systems. Other critics highlight how they foreshadow vastly more powerful future systems, which might threaten humanity's survival. The first group says there is nothing new here; the other looks through the present to a perhaps distant horizon. In this paper, I instead pay attention to what makes these particular systems distinctive: both their remarkable scientific achievement, and the most likely and consequential ways in which they will change society over the next five to ten years. In particular, I explore the potential societal impacts and normative questions raised by the looming prospect of 'Generative Agents', in which multimodal large language models (LLMs) form the executive centre of complex, tool-using AI systems that can take unsupervised sequences of actions towards some goal.

cross CrimeAlarm: Towards Intensive Intent Dynamics in Fine-grained Crime Prediction

Authors: Kaixi Hu, Lin Li, Qing Xie, Xiaohui Tao, Guandong Xu

Abstract: Granularity and accuracy are two crucial factors for crime event prediction. Within fine-grained event classification, multiple criminal intents may alternately exhibit in preceding sequential events, and progress differently in next. Such intensive intent dynamics makes training models hard to capture unobserved intents, and thus leads to sub-optimal generalization performance, especially in the intertwining of numerous potential events. To capture comprehensive criminal intents, this paper proposes a fine-grained sequential crime prediction framework, CrimeAlarm, that equips with a novel mutual distillation strategy inspired by curriculum learning. During the early training phase, spot-shared criminal intents are captured through high-confidence sequence samples. In the later phase, spot-specific intents are gradually learned by increasing the contribution of low-confidence sequences. Meanwhile, the output probability distributions are reciprocally learned between prediction networks to model unobserved criminal intents. Extensive experiments show that CrimeAlarm outperforms state-of-the-art methods in terms of NDCG@5, with improvements of 4.51% for the NYC16 and 7.73% for the CHI18 in accuracy measures.

cross Language Generation in the Limit

Authors: Jon Kleinberg, Sendhil Mullainathan

Abstract: Although current large language models are complex, the most basic specifications of the underlying language generation problem itself are simple to state: given a finite set of training samples from an unknown language, produce valid new strings from the language that don't already appear in the training data. Here we ask what we can conclude about language generation using only this specification, without further assumptions. In particular, suppose that an adversary enumerates the strings of an unknown target language L that is known only to come from one of a possibly infinite list of candidates. A computational agent is trying to learn to generate from this language; we say that the agent generates from L in the limit if after some finite point in the enumeration of L, the agent is able to produce new elements that come exclusively from L and that have not yet been presented by the adversary. Our main result is that there is an agent that is able to generate in the limit for every countable list of candidate languages. This contrasts dramatically with negative results due to Gold and Angluin in a well-studied model of language learning where the goal is to identify an unknown language from samples; the difference between these results suggests that identifying a language is a fundamentally different problem than generating from it.

cross DiffusionDialog: A Diffusion Model for Diverse Dialog Generation with Latent Space

Authors: Jianxiang Xiang, Zhenhua Liu, Haodong Liu, Yin Bai, Jia Cheng, Wenliang Chen

Abstract: In real-life conversations, the content is diverse, and there exists the one-to-many problem that requires diverse generation. Previous studies attempted to introduce discrete or Gaussian-based continuous latent variables to address the one-to-many problem, but the diversity is limited. Recently, diffusion models have made breakthroughs in computer vision, and some attempts have been made in natural language processing. In this paper, we propose DiffusionDialog, a novel approach to enhance the diversity of dialogue generation with the help of diffusion model. In our approach, we introduce continuous latent variables into the diffusion model. The problem of using latent variables in the dialog task is how to build both an effective prior of the latent space and an inferring process to obtain the proper latent given the context. By combining the encoder and latent-based diffusion model, we encode the response's latent representation in a continuous space as the prior, instead of fixed Gaussian distribution or simply discrete ones. We then infer the latent by denoising step by step with the diffusion model. The experimental results show that our model greatly enhances the diversity of dialog responses while maintaining coherence. Furthermore, in further analysis, we find that our diffusion model achieves high inference efficiency, which is the main challenge of applying diffusion models in natural language processing.

cross Logit Calibration and Feature Contrast for Robust Federated Learning on Non-IID Data

Authors: Yu Qiao, Chaoning Zhang, Apurba Adhikary, Choong Seon Hong

Abstract: Federated learning (FL) is a privacy-preserving distributed framework for collaborative model training on devices in edge networks. However, challenges arise due to vulnerability to adversarial examples (AEs) and the non-independent and identically distributed (non-IID) nature of data distribution among devices, hindering the deployment of adversarially robust and accurate learning models at the edge. While adversarial training (AT) is commonly acknowledged as an effective defense strategy against adversarial attacks in centralized training, we shed light on the adverse effects of directly applying AT in FL that can severely compromise accuracy, especially in non-IID challenges. Given this limitation, this paper proposes FatCC, which incorporates local logit \underline{C}alibration and global feature \underline{C}ontrast into the vanilla federated adversarial training (\underline{FAT}) process from both logit and feature perspectives. This approach can effectively enhance the federated system's robust accuracy (RA) and clean accuracy (CA). First, we propose logit calibration, where the logits are calibrated during local adversarial updates, thereby improving adversarial robustness. Second, FatCC introduces feature contrast, which involves a global alignment term that aligns each local representation with unbiased global features, thus further enhancing robustness and accuracy in federated adversarial environments. Extensive experiments across multiple datasets demonstrate that FatCC achieves comparable or superior performance gains in both CA and RA compared to other baselines.

cross Private Wasserstein Distance with Random Noises

Authors: Wenqian Li, Haozhi Wang, Zhe Huang, Yan Pang

Abstract: Wasserstein distance is a principle measure of data divergence from a distributional standpoint. However, its application becomes challenging in the context of data privacy, where sharing raw data is restricted. Prior attempts have employed techniques like Differential Privacy or Federated optimization to approximate Wasserstein distance. Nevertheless, these approaches often lack accuracy and robustness against potential attack. In this study, we investigate the underlying triangular properties within the Wasserstein space, leading to a straightforward solution named TriangleWad. This approach enables the computation of Wasserstein distance between datasets stored across different entities. Notably, TriangleWad is 20 times faster, making raw data information truly invisible, enhancing resilience against attacks, and without sacrificing estimation accuracy. Through comprehensive experimentation across various tasks involving both image and text data, we demonstrate its superior performance and generalizations.

cross Proposed modified computational model for the amoeba-inspired combinatorial optimization machine

Authors: Yusuke Miyajima, Masahito Mochizuki

Abstract: A single-celled amoeba can solve the traveling salesman problem through its shape-changing dynamics. In this paper, we examine roles of several elements in a previously proposed computational model of the solution-search process of amoeba and three modifications towards enhancing the solution-search preformance. We find that appropriate modifications can indeed significantly improve the quality of solutions. It is also found that a condition associated with the volume conservation can also be modified in contrast to the naive belief that it is indispensable for the solution-search ability of amoeba. A proposed modified model shows much better performance.

cross Multi-Label Continual Learning for the Medical Domain: A Novel Benchmark

Authors: Marina Ceccon, Davide Dalle Pezze, Alessandro Fabris, Gian Antonio Susto

Abstract: Multi-label image classification in dynamic environments is a problem that poses significant challenges. Previous studies have primarily focused on scenarios such as Domain Incremental Learning and Class Incremental Learning, which do not fully capture the complexity of real-world applications. In this paper, we study the problem of classification of medical imaging in the scenario termed New Instances \& New Classes, which combines the challenges of both new class arrivals and domain shifts in a single framework. Unlike traditional scenarios, it reflects the realistic nature of CL in domains such as medical imaging, where updates may introduce both new classes and changes in domain characteristics. To address the unique challenges posed by this complex scenario, we introduce a novel approach called Pseudo-Label Replay. This method aims to mitigate forgetting while adapting to new classes and domain shifts by combining the advantages of the Replay and Pseudo-Label methods and solving their limitations in the proposed scenario. % part3 We evaluate our proposed approach on a challenging benchmark consisting of two datasets, seven tasks, and nineteen classes, modeling a realistic Continual Learning scenario. Our experimental findings demonstrate the effectiveness of Pseudo-Label Replay in addressing the challenges posed by the complex scenario proposed. Our method surpasses existing approaches, exhibiting superior performance while showing minimal forgetting.

cross SleepPPG-Net2: Deep learning generalization for sleep staging from photoplethysmography

Authors: Shirel Attia, Revital Shani Hershkovich, Alissa Tabakhov, Angeleene Ang, Sharon Haimov, Riva Tauman, Joachim A. Behar

Abstract: Background: Sleep staging is a fundamental component in the diagnosis of sleep disorders and the management of sleep health. Traditionally, this analysis is conducted in clinical settings and involves a time-consuming scoring procedure. Recent data-driven algorithms for sleep staging, using the photoplethysmogram (PPG) time series, have shown high performance on local test sets but lower performance on external datasets due to data drift. Methods: This study aimed to develop a generalizable deep learning model for the task of four class (wake, light, deep, and rapid eye movement (REM)) sleep staging from raw PPG physiological time-series. Six sleep datasets, totaling 2,574 patients recordings, were used. In order to create a more generalizable representation, we developed and evaluated a deep learning model called SleepPPG-Net2, which employs a multi-source domain training approach.SleepPPG-Net2 was benchmarked against two state-of-the-art models. Results: SleepPPG-Net2 showed consistently higher performance over benchmark approaches, with generalization performance (Cohen's kappa) improving by up to 19%. Performance disparities were observed in relation to age, sex, and sleep apnea severity. Conclusion: SleepPPG-Net2 sets a new standard for staging sleep from raw PPG time-series.

cross Research on Detection of Floating Objects in River and Lake Based on AI Intelligent Image Recognition

Authors: Jingyu Zhang, Ao Xiang, Yu Cheng, Qin Yang, Liyang Wang

Abstract: With the rapid advancement of artificial intelligence technology, AI-enabled image recognition has emerged as a potent tool for addressing challenges in traditional environmental monitoring. This study focuses on the detection of floating objects in river and lake environments, exploring an innovative approach based on deep learning. By intricately analyzing the technical pathways for detecting static and dynamic features and considering the characteristics of river and lake debris, a comprehensive image acquisition and processing workflow has been developed. The study highlights the application and performance comparison of three mainstream deep learning models -SSD, Faster-RCNN, and YOLOv5- in debris identification. Additionally, a detection system for floating objects has been designed and implemented, encompassing both hardware platform construction and software framework development. Through rigorous experimental validation, the proposed system has demonstrated its ability to significantly enhance the accuracy and efficiency of debris detection, thus offering a new technological avenue for water quality monitoring in rivers and lakes

cross DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting

Authors: Shijie Zhou, Zhiwen Fan, Dejia Xu, Haoran Chang, Pradyumna Chari, Tejas Bharadwaj, Suya You, Zhangyang Wang, Achuta Kadambi

Abstract: The increasing demand for virtual reality applications has highlighted the significance of crafting immersive 3D assets. We present a text-to-3D 360$^{\circ}$ scene generation pipeline that facilitates the creation of comprehensive 360$^{\circ}$ scenes for in-the-wild environments in a matter of minutes. Our approach utilizes the generative power of a 2D diffusion model and prompt self-refinement to create a high-quality and globally coherent panoramic image. This image acts as a preliminary "flat" (2D) scene representation. Subsequently, it is lifted into 3D Gaussians, employing splatting techniques to enable real-time exploration. To produce consistent 3D geometry, our pipeline constructs a spatially coherent structure by aligning the 2D monocular depth into a globally optimized point cloud. This point cloud serves as the initial state for the centroids of 3D Gaussians. In order to address invisible issues inherent in single-view inputs, we impose semantic and geometric constraints on both synthesized and input camera views as regularizations. These guide the optimization of Gaussians, aiding in the reconstruction of unseen regions. In summary, our method offers a globally consistent 3D scene within a 360$^{\circ}$ perspective, providing an enhanced immersive experience over existing techniques. Project website at: http://dreamscene360.github.io/

URLs: http://dreamscene360.github.io/

cross Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation

Authors: Thomas Merth, Qichen Fu, Mohammad Rastegari, Mahyar Najibi

Abstract: Despite the successes of large language models (LLMs), they exhibit significant drawbacks, particularly when processing long contexts. Their inference cost scales quadratically with respect to sequence length, making it expensive for deployment in some real-world text processing applications, such as retrieval-augmented generation (RAG). Additionally, LLMs also exhibit the "distraction phenomenon," where irrelevant context in the prompt degrades output quality. To address these drawbacks, we propose a novel RAG prompting methodology, superposition prompting, which can be directly applied to pre-trained transformer-based LLMs without the need for fine-tuning. At a high level, superposition prompting allows the LLM to process input documents in parallel prompt paths, discarding paths once they are deemed irrelevant. We demonstrate the capability of our method to simultaneously enhance time efficiency across a variety of question-answering benchmarks using multiple pre-trained LLMs. Furthermore, our technique significantly improves accuracy when the retrieved context is large relative the context the model was trained on. For example, our approach facilitates an 93x reduction in compute time while improving accuracy by 43\% on the NaturalQuestions-Open dataset with the MPT-7B instruction-tuned model over naive RAG.

cross GoEX: Perspectives and Designs Towards a Runtime for Autonomous LLM Applications

Authors: Shishir G. Patil, Tianjun Zhang, Vivian Fang, Noppapon C., Roy Huang, Aaron Hao, Martin Casado, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica

Abstract: Large Language Models (LLMs) are evolving beyond their classical role of providing information within dialogue systems to actively engaging with tools and performing actions on real-world applications and services. Today, humans verify the correctness and appropriateness of the LLM-generated outputs (e.g., code, functions, or actions) before putting them into real-world execution. This poses significant challenges as code comprehension is well known to be notoriously difficult. In this paper, we study how humans can efficiently collaborate with, delegate to, and supervise autonomous LLMs in the future. We argue that in many cases, "post-facto validation" - verifying the correctness of a proposed action after seeing the output - is much easier than the aforementioned "pre-facto validation" setting. The core concept behind enabling a post-facto validation system is the integration of an intuitive undo feature, and establishing a damage confinement for the LLM-generated actions as effective strategies to mitigate the associated risks. Using this, a human can now either revert the effect of an LLM-generated output or be confident that the potential risk is bounded. We believe this is critical to unlock the potential for LLM agents to interact with applications and services with limited (post-facto) human involvement. We describe the design and implementation of our open-source runtime for executing LLM actions, Gorilla Execution Engine (GoEX), and present open research questions towards realizing the goal of LLMs and applications interacting with each other with minimal human supervision. We release GoEX at https://github.com/ShishirPatil/gorilla/.

URLs: https://github.com/ShishirPatil/gorilla/.

cross Fast System Technology Co-Optimization Framework for Emerging Technology Based on Graph Neural Networks

Authors: Tianliang Ma, Guangxi Fan, Xuguang Sun, Zhihui Deng, Kainlu Low, Leilai Shao

Abstract: This paper proposes a fast system technology co-optimization (STCO) framework that optimizes power, performance, and area (PPA) for next-generation IC design, addressing the challenges and opportunities presented by novel materials and device architectures. We focus on accelerating the technology level of STCO using AI techniques, by employing graph neural network (GNN)-based approaches for both TCAD simulation and cell library characterization, which are interconnected through a unified compact model, collectively achieving over a 100X speedup over traditional methods. These advancements enable comprehensive STCO iterations with runtime speedups ranging from 1.9X to 14.1X and supports both emerging and traditional technologies.

cross MetaCheckGPT -- A Multi-task Hallucination Detection Using LLM Uncertainty and Meta-models

Authors: Rahul Mehta, Andrew Hoblitzell, Jack O'Keefe, Hyeju Jang, Vasudeva Varma

Abstract: This paper presents our winning solution for the SemEval-2024 Task 6 competition. We propose a meta-regressor framework of large language models (LLMs) for model evaluation and integration that achieves the highest scores on the leader board. Our approach leverages uncertainty signals present in a diverse basket of LLMs to detect hallucinations more robustly.

cross Untangling Critical Interaction with AI in Students Written Assessment

Authors: Antonette Shibani, Simon Knight, Kirsty Kitto, Ajanie Karunanayake, Simon Buckingham Shum

Abstract: Artificial Intelligence (AI) has become a ubiquitous part of society, but a key challenge exists in ensuring that humans are equipped with the required critical thinking and AI literacy skills to interact with machines effectively by understanding their capabilities and limitations. These skills are particularly important for learners to develop in the age of generative AI where AI tools can demonstrate complex knowledge and ability previously thought to be uniquely human. To activate effective human-AI partnerships in writing, this paper provides a first step toward conceptualizing the notion of critical learner interaction with AI. Using both theoretical models and empirical data, our preliminary findings suggest a general lack of Deep interaction with AI during the writing process. We believe that the outcomes can lead to better task and tool design in the future for learners to develop deep, critical thinking when interacting with AI.

cross Adversarial purification for no-reference image-quality metrics: applicability study and new methods

Authors: Aleksandr Gushchin, Anna Chistyakova, Vladislav Minashkin, Anastasia Antsiferova, Dmitriy Vatolin

Abstract: Recently, the area of adversarial attacks on image quality metrics has begun to be explored, whereas the area of defences remains under-researched. In this study, we aim to cover that case and check the transferability of adversarial purification defences from image classifiers to IQA methods. In this paper, we apply several widespread attacks on IQA models and examine the success of the defences against them. The purification methodologies covered different preprocessing techniques, including geometrical transformations, compression, denoising, and modern neural network-based methods. Also, we address the challenge of assessing the efficacy of a defensive methodology by proposing ways to estimate output visual quality and the success of neutralizing attacks. Defences were tested against attack on three IQA metrics -- Linearity, MetaIQA and SPAQ. The code for attacks and defences is available at: (link is hidden for a blind review).

cross Advancing Real-time Pandemic Forecasting Using Large Language Models: A COVID-19 Case Study

Authors: Hongru Du (Frank), Jianan Zhao (Frank), Yang Zhao (Frank), Shaochong Xu (Frank), Xihong Lin (Frank), Yiran Chen (Frank), Lauren M. Gardner (Frank), Hao (Frank), Yang

Abstract: Forecasting the short-term spread of an ongoing disease outbreak is a formidable challenge due to the complexity of contributing factors, some of which can be characterized through interlinked, multi-modality variables such as epidemiological time series data, viral biology, population demographics, and the intersection of public policy and human behavior. Existing forecasting model frameworks struggle with the multifaceted nature of relevant data and robust results translation, which hinders their performances and the provision of actionable insights for public health decision-makers. Our work introduces PandemicLLM, a novel framework with multi-modal Large Language Models (LLMs) that reformulates real-time forecasting of disease spread as a text reasoning problem, with the ability to incorporate real-time, complex, non-numerical information that previously unattainable in traditional forecasting models. This approach, through a unique AI-human cooperative prompt design and time series representation learning, encodes multi-modal data for LLMs. The model is applied to the COVID-19 pandemic, and trained to utilize textual public health policies, genomic surveillance, spatial, and epidemiological time series data, and is subsequently tested across all 50 states of the U.S. Empirically, PandemicLLM is shown to be a high-performing pandemic forecasting framework that effectively captures the impact of emerging variants and can provide timely and accurate predictions. The proposed PandemicLLM opens avenues for incorporating various pandemic-related data in heterogeneous formats and exhibits performance benefits over existing models. This study illuminates the potential of adapting LLMs and representation learning to enhance pandemic forecasting, illustrating how AI innovations can strengthen pandemic responses and crisis management in the future.

cross TrajPRed: Trajectory Prediction with Region-based Relation Learning

Authors: Chen Zhou, Ghassan AlRegib, Armin Parchami, Kunjan Singh

Abstract: Forecasting human trajectories in traffic scenes is critical for safety within mixed or fully autonomous systems. Human future trajectories are driven by two major stimuli, social interactions, and stochastic goals. Thus, reliable forecasting needs to capture these two stimuli. Edge-based relation modeling represents social interactions using pairwise correlations from precise individual states. Nevertheless, edge-based relations can be vulnerable under perturbations. To alleviate these issues, we propose a region-based relation learning paradigm that models social interactions via region-wise dynamics of joint states, i.e., the changes in the density of crowds. In particular, region-wise agent joint information is encoded within convolutional feature grids. Social relations are modeled by relating the temporal changes of local joint information from a global perspective. We show that region-based relations are less susceptible to perturbations. In order to account for the stochastic individual goals, we exploit a conditional variational autoencoder to realize multi-goal estimation and diverse future prediction. Specifically, we perform variational inference via the latent distribution, which is conditioned on the correlation between input states and associated target goals. Sampling from the latent distribution enables the framework to reliably capture the stochastic behavior in test data. We integrate multi-goal estimation and region-based relation learning to model the two stimuli, social interactions, and stochastic goals, in a prediction framework. We evaluate our framework on the ETH-UCY dataset and Stanford Drone Dataset (SDD). We show that the diverse prediction better fits the ground truth when incorporating the relation module. Our framework outperforms the state-of-the-art models on SDD by $27.61\%$/$18.20\%$ of ADE/FDE metrics.

cross Toward industrial use of continual learning : new metrics proposal for class incremental learning

Authors: Konat\'e Mohamed Abbas, Anne-Fran\c{c}oise Yao, Thierry Chateau, Pierre Bouges

Abstract: In this paper, we investigate continual learning performance metrics used in class incremental learning strategies for continual learning (CL) using some high performing methods. We investigate especially mean task accuracy. First, we show that it lacks of expressiveness through some simple experiments to capture performance. We show that monitoring average tasks performance is over optimistic and can lead to misleading conclusions for future real life industrial uses. Then, we propose first a simple metric, Minimal Incremental Class Accuracy (MICA) which gives a fair and more useful evaluation of different continual learning methods. Moreover, in order to provide a simple way to easily compare different methods performance in continual learning, we derive another single scalar metric that take into account the learning performance variation as well as our newly introduced metric.

cross XNLIeu: a dataset for cross-lingual NLI in Basque

Authors: Maite Heredia, Julen Etxaniz, Muitze Zulaika, Xabier Saralegi, Jeremy Barnes, Aitor Soroa

Abstract: XNLI is a popular Natural Language Inference (NLI) benchmark widely used to evaluate cross-lingual Natural Language Understanding (NLU) capabilities across languages. In this paper, we expand XNLI to include Basque, a low-resource language that can greatly benefit from transfer-learning approaches. The new dataset, dubbed XNLIeu, has been developed by first machine-translating the English XNLI corpus into Basque, followed by a manual post-edition step. We have conducted a series of experiments using mono- and multilingual LLMs to assess a) the effect of professional post-edition on the MT system; b) the best cross-lingual strategy for NLI in Basque; and c) whether the choice of the best cross-lingual strategy is influenced by the fact that the dataset is built by translation. The results show that post-edition is necessary and that the translate-train cross-lingual strategy obtains better results overall, although the gain is lower when tested in a dataset that has been built natively from scratch. Our code and datasets are publicly available under open licenses.

cross Event Grounded Criminal Court View Generation withCooperative (Large) Language Models

Authors: Linan Yue, Qi Liu, Lili Zhao, Li Wang, Weibo Gao, Yanqing An

Abstract: With the development of legal intelligence, Criminal Court View Generation has attracted much attention as a crucial task of legal intelligence, which aims to generate concise and coherent texts that summarize case facts and provide explanations for verdicts. Existing researches explore the key information in case facts to yield the court views. Most of them employ a coarse-grained approach that partitions the facts into broad segments (e.g., verdict-related sentences) to make predictions. However, this approach fails to capture the complex details present in the case facts, such as various criminal elements and legal events. To this end, in this paper, we propose an Event Grounded Generation (EGG) method for criminal court view generation with cooperative (Large) Language Models, which introduces the fine-grained event information into the generation. Specifically, we first design a LLMs-based extraction method that can extract events in case facts without massive annotated events. Then, we incorporate the extracted events into court view generation by merging case facts and events. Besides, considering the computational burden posed by the use of LLMs in the extraction phase of EGG, we propose a LLMs-free EGG method that can eliminate the requirement for event extraction using LLMs in the inference phase. Extensive experimental results on a real-world dataset clearly validate the effectiveness of our proposed method.

cross WordDecipher: Enhancing Digital Workspace Communication with Explainable AI for Non-native English Speakers

Authors: Yuexi Chen, Zhicheng Liu

Abstract: Non-native English speakers (NNES) face challenges in digital workspace communication (e.g., emails, Slack messages), often inadvertently translating expressions from their native languages, which can lead to awkward or incorrect usage. Current AI-assisted writing tools are equipped with fluency enhancement and rewriting suggestions; however, NNES may struggle to grasp the subtleties among various expressions, making it challenging to choose the one that accurately reflects their intent. Such challenges are exacerbated in high-stake text-based communications, where the absence of non-verbal cues heightens the risk of misinterpretation. By leveraging the latest advancements in large language models (LLM) and word embeddings, we propose WordDecipher, an explainable AI-assisted writing tool to enhance digital workspace communication for NNES. WordDecipher not only identifies the perceived social intentions detected in users' writing, but also generates rewriting suggestions aligned with users' intended messages, either numerically or by inferring from users' writing in their native language. Then, WordDecipher provides an overview of nuances to help NNES make selections. Through a usage scenario, we demonstrate how WordDecipher can significantly enhance an NNES's ability to communicate her request, showcasing its potential to transform workspace communication for NNES.

cross Knowledge graphs for empirical concept retrieval

Authors: Lenka T\v{e}tkov\'a, Teresa Karen Scheidt, Maria Mandrup Fogh, Ellen Marie Gaunby J{\o}rgensen, Finn {\AA}rup Nielsen, Lars Kai Hansen

Abstract: Concept-based explainable AI is promising as a tool to improve the understanding of complex models at the premises of a given user, viz.\ as a tool for personalized explainability. An important class of concept-based explainability methods is constructed with empirically defined concepts, indirectly defined through a set of positive and negative examples, as in the TCAV approach (Kim et al., 2018). While it is appealing to the user to avoid formal definitions of concepts and their operationalization, it can be challenging to establish relevant concept datasets. Here, we address this challenge using general knowledge graphs (such as, e.g., Wikidata or WordNet) for comprehensive concept definition and present a workflow for user-driven data collection in both text and image domains. The concepts derived from knowledge graphs are defined interactively, providing an opportunity for personalization and ensuring that the concepts reflect the user's intentions. We test the retrieved concept datasets on two concept-based explainability methods, namely concept activation vectors (CAVs) and concept activation regions (CARs) (Crabbe and van der Schaar, 2022). We show that CAVs and CARs based on these empirical concept datasets provide robust and accurate explanations. Importantly, we also find good alignment between the models' representations of concepts and the structure of knowledge graphs, i.e., human representations. This supports our conclusion that knowledge graph-based concepts are relevant for XAI.

cross Improving Language Model Reasoning with Self-motivated Learning

Authors: Yunlong Feng, Yang Xu, Libo Qin, Yasheng Wang, Wanxiang Che

Abstract: Large-scale high-quality training data is important for improving the performance of models. After trained with data that has rationales (reasoning steps), models gain reasoning capability. However, the dataset with high-quality rationales is relatively scarce due to the high annotation cost. To address this issue, we propose \textit{Self-motivated Learning} framework. The framework motivates the model itself to automatically generate rationales on existing datasets. Based on the inherent rank from correctness across multiple rationales, the model learns to generate better rationales, leading to higher reasoning capability. Specifically, we train a reward model with the rank to evaluate the quality of rationales, and improve the performance of reasoning through reinforcement learning. Experiment results of Llama2 7B on multiple reasoning datasets show that our method significantly improves the reasoning ability of models, even outperforming text-davinci-002 in some datasets.

cross Comparison of decision trees with Local Interpretable Model-Agnostic Explanations (LIME) technique and multi-linear regression for explaining support vector regression model in terms of root mean square error (RMSE) values

Authors: Amit Thombre

Abstract: In this work the decision trees are used for explanation of support vector regression model. The decision trees act as a global technique as well as a local technique. They are compared against the popular technique of LIME which is a local explanatory technique and with multi linear regression. It is observed that decision trees give a lower RMSE value when fitted to support vector regression as compared to LIME in 87% of the runs over 5 datasets. The comparison of results is statistically significant. Multi linear regression also gives a lower RMSE value when fitted to support vector regression model as compared to LIME in 73% of the runs over 5 datasets but the comparison of results is not statistically significant. Also, when used as a local explanatory technique, decision trees give better performance than LIME and the comparison of results is statistically significant.

cross Meta4XNLI: A Crosslingual Parallel Corpus for Metaphor Detection and Interpretation

Authors: Elisa Sanchez-Bayona, Rodrigo Agerri

Abstract: Metaphors, although occasionally unperceived, are ubiquitous in our everyday language. Thus, it is crucial for Language Models to be able to grasp the underlying meaning of this kind of figurative language. In this work, we present Meta4XNLI, a novel parallel dataset for the tasks of metaphor detection and interpretation that contains metaphor annotations in both Spanish and English. We investigate language models' metaphor identification and understanding abilities through a series of monolingual and cross-lingual experiments by leveraging our proposed corpus. In order to comprehend how these non-literal expressions affect models' performance, we look over the results and perform an error analysis. Additionally, parallel data offers many potential opportunities to investigate metaphor transferability between these languages and the impact of translation on the development of multilingual annotated resources.

cross LaPlaSS: Latent Space Planning for Stochastic Systems

Authors: Marlyse Reeves, Brian C. Williams

Abstract: Autonomous mobile agents often operate in hazardous environments, necessitating an awareness of safety. These agents can have non-linear, stochastic dynamics that must be considered during planning to guarantee bounded risk. Most state of the art methods require closed-form dynamics to verify plan correctness and safety however modern robotic systems often have dynamics that are learned from data. Thus, there is a need to perform efficient trajectory planning with guarantees on risk for agents without known dynamics models. We propose a "generate-and-test" approach to risk-bounded planning in which a planner generates a candidate trajectory using an approximate linear dynamics model and a validator assesses the risk of the trajectory, computing additional safety constraints for the planner if the candidate does not satisfy the desired risk bound. To acquire the approximate model, we use a variational autoencoder to learn a latent linear dynamics model and encode the planning problem into the latent space to generate the candidate trajectory. The VAE also serves to sample trajectories around the candidate to use in the validator. We demonstrate that our algorithm, LaPlaSS, is able to generate trajectory plans with bounded risk for a real-world agent with learned dynamics and is an order of magnitude more efficient than the state of the art.

cross Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?

Authors: Mingyu Jin, Qinkai Yu, Jingyuan Huang, Qingcheng Zeng, Zhenting Wang, Wenyue Hua, Haiyan Zhao, Kai Mei, Yanda Meng, Kaize Ding, Fan Yang, Mengnan Du, Yongfeng Zhang

Abstract: This paper studies the phenomenon that different concepts are learned in different layers of large language models, i.e. more difficult concepts are fully acquired with deeper layers. We define the difficulty of concepts by the level of abstraction, and here it is crudely categorized by factual, emotional, and inferential. Each category contains a spectrum of tasks, arranged from simple to complex. For example, within the factual dimension, tasks range from lie detection to categorizing mathematical problems. We employ a probing technique to extract representations from different layers of the model and apply these to classification tasks. Our findings reveal that models tend to efficiently classify simpler tasks, indicating that these concepts are learned in shallower layers. Conversely, more complex tasks may only be discernible at deeper layers, if at all. This paper explores the implications of these findings for our understanding of model learning processes and internal representations. Our implementation is available at \url{https://github.com/Luckfort/CD}.

URLs: https://github.com/Luckfort/CD

cross Dynamic Generation of Personalities with Large Language Models

Authors: Jianzhi Liu, Hexiang Gu, Tianyu Zheng, Liuyu Xiang, Huijia Wu, Jie Fu, Zhaofeng He

Abstract: In the realm of mimicking human deliberation, large language models (LLMs) show promising performance, thereby amplifying the importance of this research area. Deliberation is influenced by both logic and personality. However, previous studies predominantly focused on the logic of LLMs, neglecting the exploration of personality aspects. In this work, we introduce Dynamic Personality Generation (DPG), a dynamic personality generation method based on Hypernetworks. Initially, we embed the Big Five personality theory into GPT-4 to form a personality assessment machine, enabling it to evaluate characters' personality traits from dialogues automatically. We propose a new metric to assess personality generation capability based on this evaluation method. Then, we use this personality assessment machine to evaluate dialogues in script data, resulting in a personality-dialogue dataset. Finally, we fine-tune DPG on the personality-dialogue dataset. Experiments prove that DPG's personality generation capability is stronger after fine-tuning on this dataset than traditional fine-tuning methods, surpassing prompt-based GPT-4.

cross LaTiM: Longitudinal representation learning in continuous-time models to predict disease progression

Authors: Rachid Zeghlache, Pierre-Henri Conze, Mostafa El Habib Daho, Yihao Li, Hugo Le Boit\'e, Ramin Tadayoni, Pascal Massin, B\'eatrice Cochener, Alireza Rezaei, Ikram Brahim, Gwenol\'e Quellec, Mathieu Lamard

Abstract: This work proposes a novel framework for analyzing disease progression using time-aware neural ordinary differential equations (NODE). We introduce a "time-aware head" in a framework trained through self-supervised learning (SSL) to leverage temporal information in latent space for data augmentation. This approach effectively integrates NODEs with SSL, offering significant performance improvements compared to traditional methods that lack explicit temporal integration. We demonstrate the effectiveness of our strategy for diabetic retinopathy progression prediction using the OPHDIAT database. Compared to the baseline, all NODE architectures achieve statistically significant improvements in area under the ROC curve (AUC) and Kappa metrics, highlighting the efficacy of pre-training with SSL-inspired approaches. Additionally, our framework promotes stable training for NODEs, a commonly encountered challenge in time-aware modeling.

cross TransTARec: Time-Adaptive Translating Embedding Model for Next POI Recommendation

Authors: Yiping Sun

Abstract: The rapid growth of location acquisition technologies makes Point-of-Interest(POI) recommendation possible due to redundant user check-in records. In this paper, we focus on next POI recommendation in which next POI is based on previous POI. We observe that time plays an important role in next POI recommendation but is neglected in the recent proposed translating embedding methods. To tackle this shortage, we propose a time-adaptive translating embedding model (TransTARec) for next POI recommendation that naturally incorporates temporal influence, sequential dynamics, and user preference within a single component. Methodologically, we treat a (previous timestamp, user, next timestamp) triplet as a union translation vector and develop a neural-based fusion operation to fuse user preference and temporal influence. The superiority of TransTARec, which is confirmed by extensive experiments on real-world datasets, comes from not only the introduction of temporal influence but also the direct unification with user preference and sequential dynamics.

cross Rethinking Out-of-Distribution Detection for Reinforcement Learning: Advancing Methods for Evaluation and Detection

Authors: Linas Nasvytis, Kai Sandbrink, Jakob Foerster, Tim Franzmeyer, Christian Schroeder de Witt

Abstract: While reinforcement learning (RL) algorithms have been successfully applied across numerous sequential decision-making problems, their generalization to unforeseen testing environments remains a significant concern. In this paper, we study the problem of out-of-distribution (OOD) detection in RL, which focuses on identifying situations at test time that RL agents have not encountered in their training environments. We first propose a clarification of terminology for OOD detection in RL, which aligns it with the literature from other machine learning domains. We then present new benchmark scenarios for OOD detection, which introduce anomalies with temporal autocorrelation into different components of the agent-environment loop. We argue that such scenarios have been understudied in the current literature, despite their relevance to real-world situations. Confirming our theoretical predictions, our experimental results suggest that state-of-the-art OOD detectors are not able to identify such anomalies. To address this problem, we propose a novel method for OOD detection, which we call DEXTER (Detection via Extraction of Time Series Representations). By treating environment observations as time series data, DEXTER extracts salient time series features, and then leverages an ensemble of isolation forest algorithms to detect anomalies. We find that DEXTER can reliably identify anomalies across benchmark scenarios, exhibiting superior performance compared to both state-of-the-art OOD detectors and high-dimensional changepoint detectors adopted from statistics.

cross Semantically-correlated memories in a dense associative model

Authors: Thomas F Burns

Abstract: I introduce a novel associative memory model named Correlated Dense Associative Memory (CDAM), which integrates both auto- and hetero-association in a unified framework for continuous-valued memory patterns. Employing an arbitrary graph structure to semantically link memory patterns, CDAM is theoretically and numerically analysed, revealing four distinct dynamical modes: auto-association, narrow hetero-association, wide hetero-association, and neutral quiescence. Drawing inspiration from inhibitory modulation studies, I employ anti-Hebbian learning rules to control the range of hetero-association, extract multi-scale representations of community structures in graphs, and stabilise the recall of temporal sequences. Experimental demonstrations showcase CDAM's efficacy in handling real-world data, replicating a classical neuroscience experiment, performing image retrieval, and simulating arbitrary finite automata.

cross Measuring proximity to standard planes during fetal brain ultrasound scanning

Authors: Chiara Di Vece, Antonio Cirigliano, Meala Le Lous, Raffaele Napolitano, Anna L. David, Donald Peebles, Pierre Jannin, Francisco Vasconcelos, Danail Stoyanov

Abstract: This paper introduces a novel pipeline designed to bring ultrasound (US) plane pose estimation closer to clinical use for more effective navigation to the standard planes (SPs) in the fetal brain. We propose a semi-supervised segmentation model utilizing both labeled SPs and unlabeled 3D US volume slices. Our model enables reliable segmentation across a diverse set of fetal brain images. Furthermore, the model incorporates a classification mechanism to identify the fetal brain precisely. Our model not only filters out frames lacking the brain but also generates masks for those containing it, enhancing the relevance of plane pose regression in clinical settings. We focus on fetal brain navigation from 2D ultrasound (US) video analysis and combine this model with a US plane pose regression network to provide sensorless proximity detection to SPs and non-SPs planes; we emphasize the importance of proximity detection to SPs for guiding sonographers, offering a substantial advantage over traditional methods by allowing earlier and more precise adjustments during scanning. We demonstrate the practical applicability of our approach through validation on real fetal scan videos obtained from sonographers of varying expertise levels. Our findings demonstrate the potential of our approach to complement existing fetal US technologies and advance prenatal diagnostic practices.

cross Towards Robustness of Text-to-Visualization Translation against Lexical and Phrasal Variability

Authors: Jinwei Lu, Yuanfeng Song, Haodi Zhang, Chen Zhang, Raymond Chi-Wing Wong

Abstract: Text-to-Vis is an emerging task in the natural language processing (NLP) area that aims to automatically generate data visualizations from natural language questions (NLQs). Despite their progress, existing text-to-vis models often heavily rely on lexical matching between words in the questions and tokens in data schemas. This overreliance on lexical matching may lead to a diminished level of model robustness against input variations. In this study, we thoroughly examine the robustness of current text-to-vis models, an area that has not previously been explored. In particular, we construct the first robustness dataset nvBench-Rob, which contains diverse lexical and phrasal variations based on the original text-to-vis benchmark nvBench. Then, we found that the performance of existing text-to-vis models on this new dataset dramatically drops, implying that these methods exhibit inadequate robustness overall. Finally, we propose a novel framework based on Retrieval-Augmented Generation (RAG) technique, named GRED, specifically designed to address input perturbations in these two variants. The framework consists of three parts: NLQ-Retrieval Generator, Visualization Query-Retrieval Retuner and Annotation-based Debugger, which are used to tackle the challenges posed by natural language variants, programming style differences and data schema variants, respectively. Extensive experimental evaluations show that, compared to the state-of-the-art model RGVisNet in the Text-to-Vis field, RGDR performs better in terms of model robustness, with a 32% increase in accuracy on the proposed nvBench-Rob dataset.

cross Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Authors: Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal

Abstract: This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.

cross Analysis of Distributed Optimization Algorithms on a Real Processing-In-Memory System

Authors: Steve Rhyner, Haocong Luo, Juan G\'omez-Luna, Mohammad Sadrosadati, Jiawei Jiang, Ataberk Olgun, Harshita Gupta, Ce Zhang, Onur Mutlu

Abstract: Machine Learning (ML) training on large-scale datasets is a very expensive and time-consuming workload. Processor-centric architectures (e.g., CPU, GPU) commonly used for modern ML training workloads are limited by the data movement bottleneck, i.e., due to repeatedly accessing the training dataset. As a result, processor-centric systems suffer from performance degradation and high energy consumption. Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck by placing the computation mechanisms inside or near memory. Our goal is to understand the capabilities and characteristics of popular distributed optimization algorithms on real-world PIM architectures to accelerate data-intensive ML training workloads. To this end, we 1) implement several representative centralized distributed optimization algorithms on UPMEM's real-world general-purpose PIM system, 2) rigorously evaluate these algorithms for ML training on large-scale datasets in terms of performance, accuracy, and scalability, 3) compare to conventional CPU and GPU baselines, and 4) discuss implications for future PIM hardware and the need to shift to an algorithm-hardware codesign perspective to accommodate decentralized distributed optimization algorithms. Our results demonstrate three major findings: 1) Modern general-purpose PIM architectures can be a viable alternative to state-of-the-art CPUs and GPUs for many memory-bound ML training workloads, when operations and datatypes are natively supported by PIM hardware, 2) the importance of carefully choosing the optimization algorithm that best fit PIM, and 3) contrary to popular belief, contemporary PIM architectures do not scale approximately linearly with the number of nodes for many data-intensive ML training workloads. To facilitate future research, we aim to open-source our complete codebase.

cross Using Neural Networks to Model Hysteretic Kinematics in Tendon-Actuated Continuum Robots

Authors: Yuan Wang, Max McCandless, Abdulhamit Donder, Giovanni Pittiglio, Behnam Moradkhani, Yash Chitalia, Pierre E. Dupont

Abstract: The ability to accurately model mechanical hysteretic behavior in tendon-actuated continuum robots using deep learning approaches is a growing area of interest. In this paper, we investigate the hysteretic response of two types of tendon-actuated continuum robots and, ultimately, compare three types of neural network modeling approaches with both forward and inverse kinematic mappings: feedforward neural network (FNN), FNN with a history input buffer, and long short-term memory (LSTM) network. We seek to determine which model best captures temporal dependent behavior. We find that, depending on the robot's design, choosing different kinematic inputs can alter whether hysteresis is exhibited by the system. Furthermore, we present the results of the model fittings, revealing that, in contrast to the standard FNN, both FNN with a history input buffer and the LSTM model exhibit the capacity to model historical dependence with comparable performance in capturing rate-dependent hysteresis.

cross Worst-Case Convergence Time of ML Algorithms via Extreme Value Theory

Authors: Saeid Tizpaz-Niari, Sriram Sankaranarayanan

Abstract: This paper leverages the statistics of extreme values to predict the worst-case convergence times of machine learning algorithms. Timing is a critical non-functional property of ML systems, and providing the worst-case converge times is essential to guarantee the availability of ML and its services. However, timing properties such as worst-case convergence times (WCCT) are difficult to verify since (1) they are not encoded in the syntax or semantics of underlying programming languages of AI, (2) their evaluations depend on both algorithmic implementations and underlying systems, and (3) their measurements involve uncertainty and noise. Therefore, prevalent formal methods and statistical models fail to provide rich information on the amounts and likelihood of WCCT. Our key observation is that the timing information we seek represents the extreme tail of execution times. Therefore, extreme value theory (EVT), a statistical discipline that focuses on understanding and predicting the distribution of extreme values in the tail of outcomes, provides an ideal framework to model and analyze WCCT in the training and inference phases of ML paradigm. Building upon the mathematical tools from EVT, we propose a practical framework to predict the worst-case timing properties of ML. Over a set of linear ML training algorithms, we show that EVT achieves a better accuracy for predicting WCCTs than relevant statistical methods such as the Bayesian factor. On the set of larger machine learning training algorithms and deep neural network inference, we show the feasibility and usefulness of EVT models to accurately predict WCCTs, their expected return periods, and their likelihood.

cross Reward Learning from Suboptimal Demonstrations with Applications in Surgical Electrocautery

Authors: Zohre Karimi, Shing-Hei Ho, Bao Thach, Alan Kuntz, Daniel S. Brown

Abstract: Automating robotic surgery via learning from demonstration (LfD) techniques is extremely challenging. This is because surgical tasks often involve sequential decision-making processes with complex interactions of physical objects and have low tolerance for mistakes. Prior works assume that all demonstrations are fully observable and optimal, which might not be practical in the real world. This paper introduces a sample-efficient method that learns a robust reward function from a limited amount of ranked suboptimal demonstrations consisting of partial-view point cloud observations. The method then learns a policy by optimizing the learned reward function using reinforcement learning (RL). We show that using a learned reward function to obtain a policy is more robust than pure imitation learning. We apply our approach on a physical surgical electrocautery task and demonstrate that our method can perform well even when the provided demonstrations are suboptimal and the observations are high-dimensional point clouds.

cross VN-EGNN: E(3)-Equivariant Graph Neural Networks with Virtual Nodes Enhance Protein Binding Site Identification

Authors: Florian Sestak, Lisa Schneckenreiter, Johannes Brandstetter, Sepp Hochreiter, Andreas Mayr, G\"unter Klambauer

Abstract: Being able to identify regions within or around proteins, to which ligands can potentially bind, is an essential step to develop new drugs. Binding site identification methods can now profit from the availability of large amounts of 3D structures in protein structure databases or from AlphaFold predictions. Current binding site identification methods heavily rely on graph neural networks (GNNs), usually designed to output E(3)-equivariant predictions. Such methods turned out to be very beneficial for physics-related tasks like binding energy or motion trajectory prediction. However, the performance of GNNs at binding site identification is still limited potentially due to the lack of dedicated nodes that model hidden geometric entities, such as binding pockets. In this work, we extend E(n)-Equivariant Graph Neural Networks (EGNNs) by adding virtual nodes and applying an extended message passing scheme. The virtual nodes in these graphs are dedicated quantities to learn representations of binding sites, which leads to improved predictive performance. In our experiments, we show that our proposed method VN-EGNN sets a new state-of-the-art at locating binding site centers on COACH420, HOLO4K and PDBbind2020.

cross RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion

Authors: Jaidev Shriram, Alex Trevithick, Lingjie Liu, Ravi Ramamoorthi

Abstract: We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require video or multi-view data and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.

cross UMBRAE: Unified Multimodal Decoding of Brain Signals

Authors: Weihao Xia, Raoul de Charette, Cengiz \"Oztireli, Jing-Hao Xue

Abstract: We address prevailing challenges of the brain-powered research, departing from the observation that the literature hardly recover accurate spatial information and require subject-specific models. To address these challenges, we propose UMBRAE, a unified multimodal decoding of brain signals. First, to extract instance-level conceptual and spatial details from neural signals, we introduce an efficient universal brain encoder for multimodal-brain alignment and recover object descriptions at multiple levels of granularity from subsequent multimodal large language model (MLLM). Second, we introduce a cross-subject training strategy mapping subject-specific features to a common feature space. This allows a model to be trained on multiple subjects without extra resources, even yielding superior results compared to subject-specific models. Further, we demonstrate this supports weakly-supervised adaptation to new subjects, with only a fraction of the total training data. Experiments demonstrate that UMBRAE not only achieves superior results in the newly introduced tasks but also outperforms methods in well established tasks. To assess our method, we construct and share with the community a comprehensive brain understanding benchmark BrainHub. Our code and benchmark are available at https://weihaox.github.io/UMBRAE.

URLs: https://weihaox.github.io/UMBRAE.

cross BRAVE: Broadening the visual encoding of vision-language models

Authors: O\u{g}uzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari

Abstract: Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we study broadening the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM. BRAVE achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and having a more compressed representation. Our results highlight the potential of incorporating different visual biases for a more broad and contextualized visual understanding of VLMs.

cross GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models

Authors: Zewei Zhang, Huan Liu, Jun Chen, Xiangyu Xu

Abstract: In this paper, we introduce GoodDrag, a novel approach to improve the stability and image quality of drag editing. Unlike existing methods that struggle with accumulated perturbations and often result in distortions, GoodDrag introduces an AlDD framework that alternates between drag and denoising operations within the diffusion process, effectively improving the fidelity of the result. We also propose an information-preserving motion supervision operation that maintains the original features of the starting point for precise manipulation and artifact reduction. In addition, we contribute to the benchmarking of drag editing by introducing a new dataset, Drag100, and developing dedicated quality assessment metrics, Dragging Accuracy Index and Gemini Score, utilizing Large Multimodal Models. Extensive experiments demonstrate that the proposed GoodDrag compares favorably against the state-of-the-art approaches both qualitatively and quantitatively. The project page is https://gooddrag.github.io.

URLs: https://gooddrag.github.io.

replace CityNet: A Comprehensive Multi-Modal Urban Dataset for Advanced Research in Urban Computing

Authors: Zhengfei Zheng, Xu Geng, Hai Yang

Abstract: Data-driven approaches have emerged as a popular tool for addressing challenges in urban computing. However, current research efforts have primarily focused on limited data sources, which fail to capture the complexity of urban data arising from multiple entities and their interconnections. Therefore, a comprehensive and multifaceted dataset is required to enable more extensive studies in urban computing. In this paper, we present CityNet, a multi-modal urban dataset that incorporates various data, including taxi trajectory, traffic speed, point of interest (POI), road network, wind, rain, temperature, and more, from seven cities. We categorize this comprehensive data into three streams: mobility data, geographical data, and meteorological data. We begin by detailing the generation process and basic properties of CityNet. Additionally, we conduct extensive data mining and machine learning experiments, including spatio-temporal predictions, transfer learning, and reinforcement learning, to facilitate the use of CityNet. Our experimental results provide benchmarks for various tasks and methods, and also reveal internal correlations among cities and tasks within CityNet that can be leveraged to improve spatiotemporal forecasting performance. Based on our benchmarking results and the correlations uncovered, we believe that CityNet can significantly contribute to the field of urban computing by enabling research on advanced topics.

replace Unsupervised Learning for Solving the Travelling Salesman Problem

Authors: Yimeng Min, Yiwei Bai, Carla P. Gomes

Abstract: We propose UTSP, an unsupervised learning (UL) framework for solving the Travelling Salesman Problem (TSP). We train a Graph Neural Network (GNN) using a surrogate loss. The GNN outputs a heat map representing the probability for each edge to be part of the optimal path. We then apply local search to generate our final prediction based on the heat map. Our loss function consists of two parts: one pushes the model to find the shortest path and the other serves as a surrogate for the constraint that the route should form a Hamiltonian Cycle. Experimental results show that UTSP outperforms the existing data-driven TSP heuristics. Our approach is parameter efficient as well as data efficient: the model takes $\sim$ 10\% of the number of parameters and $\sim$ 0.2\% of training samples compared with reinforcement learning or supervised learning methods.

replace YAGO 4.5: A Large and Clean Knowledge Base with a Rich Taxonomy

Authors: Fabian Suchanek, Mehwish Alam, Thomas Bonald, Lihu Chen, Pierre-Henri Paris, Jules Soria

Abstract: Knowledge Bases (KBs) find applications in many knowledge-intensive tasks and, most notably, in information retrieval. Wikidata is one of the largest public general-purpose KBs. Yet, its collaborative nature has led to a convoluted schema and taxonomy. The YAGO 4 KB cleaned up the taxonomy by incorporating the ontology of Schema.org, resulting in a cleaner structure amenable to automated reasoning. However, it also cut away large parts of the Wikidata taxonomy, which is essential for information retrieval. In this paper, we extend YAGO 4 with a large part of the Wikidata taxonomy - while respecting logical constraints and the distinction between classes and instances. This yields YAGO 4.5, a new, logically consistent version of YAGO that adds a rich layer of informative classes. An intrinsic and an extrinsic evaluation show the value of the new resource.

replace Designing Interpretable ML System to Enhance Trust in Healthcare: A Systematic Review to Proposed Responsible Clinician-AI-Collaboration Framework

Authors: Elham Nasarian, Roohallah Alizadehsani, U. Rajendra Acharya, Kwok-Leung Tsui

Abstract: This paper explores the significant impact of AI-based medical devices, including wearables, telemedicine, large language models, and digital twins, on clinical decision support systems. It emphasizes the importance of producing outcomes that are not only accurate but also interpretable and understandable to clinicians, addressing the risk that lack of interpretability poses in terms of mistrust and reluctance to adopt these technologies in healthcare. The paper reviews interpretable AI processes, methods, applications, and the challenges of implementation in healthcare, focusing on quality control to facilitate responsible communication between AI systems and clinicians. It breaks down the interpretability process into data pre-processing, model selection, and post-processing, aiming to foster a comprehensive understanding of the crucial role of a robust interpretability approach in healthcare and to guide future research in this area. with insights for creating responsible clinician-AI tools for healthcare, as well as to offer a deeper understanding of the challenges they might face. Our research questions, eligibility criteria and primary goals were identified using Preferred Reporting Items for Systematic reviews and Meta-Analyses guideline and PICO method; PubMed, Scopus and Web of Science databases were systematically searched using sensitive and specific search strings. In the end, 52 publications were selected for data extraction which included 8 existing reviews and 44 related experimental studies. The paper offers general concepts of interpretable AI in healthcare and discuss three-levels interpretability process. Additionally, it provides a comprehensive discussion of evaluating robust interpretability AI in healthcare. Moreover, this survey introduces a step-by-step roadmap for implementing responsible AI in healthcare.

replace What AIs are not Learning (and Why): Bio-Inspired Foundation Models for Robots

Authors: Mark Stefik

Abstract: What applications is AI ready for? Advances in deep learning and generative approaches have produced AIs that learn from massive online data and outperform manually built AIs. Some of these AIs outperform people. It is easy (but misleading) to conclude that today's AI technologies are learning to do anything and everything. Conversely, it is striking that big data, deep learning, and generative AI have had so little impact on robotics. For example, today's autonomous robots do not learn to provide home care or to be nursing assistants. Current robot applications are created using manual programming, mathematical models, planning frameworks, and reinforcement learning. These methods do not lead to the leaps in performance and generality seen with deep learning and generative AI. Better approaches to train robots for service applications would greatly expand their social roles and economic impact. AI research is now extending "big data" approaches to train robots by combining multimodal sensing and effector technology from robotics with deep learning technology adapted for embodied systems. These approaches create robotic (or "experiential") foundation models (FMs) for AIs that perceive and act in the world. Robotic FM approaches differ in their expectations, sources, and timing of training data. Like mainstream FM approaches, some robotic FM approaches use vast data to create adult expert-level robots. In contrast, developmental robotic approaches would create progressive FMs that learn continuously and experientially. Aspirationally, these would progress from child-level to student-level, apprentice-level, and expert levels. They would acquire self-developed and socially developed competences. These AIs would model the goals of people around them. Like people, they would learn to coordinate, communicate, and collaborate.

replace Autonomous Evaluation and Refinement of Digital Agents

Authors: Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr

Abstract: We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control. We experiment with multiple evaluation models that trade off between inference cost, modularity of design, and accuracy. We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics. Finally, we use these evaluators to improve the performance of existing agents via fine-tuning and inference-time guidance. Without any additional supervision, we improve state-of-the-art performance by 29% on the popular benchmark WebArena, and achieve a 75% relative improvement in a challenging domain transfer scenario.

replace-cross Universal Prompt Tuning for Graph Neural Networks

Authors: Taoran Fang, Yunchao Zhang, Yang Yang, Chunping Wang, Lei Chen

Abstract: In recent years, prompt tuning has sparked a research surge in adapting pre-trained models. Unlike the unified pre-training strategy employed in the language field, the graph field exhibits diverse pre-training strategies, posing challenges in designing appropriate prompt-based tuning methods for graph neural networks. While some pioneering work has devised specialized prompting functions for models that employ edge prediction as their pre-training tasks, these methods are limited to specific pre-trained GNN models and lack broader applicability. In this paper, we introduce a universal prompt-based tuning method called Graph Prompt Feature (GPF) for pre-trained GNN models under any pre-training strategy. GPF operates on the input graph's feature space and can theoretically achieve an equivalent effect to any form of prompting function. Consequently, we no longer need to illustrate the prompting function corresponding to each pre-training strategy explicitly. Instead, we employ GPF to obtain the prompted graph for the downstream task in an adaptive manner. We provide rigorous derivations to demonstrate the universality of GPF and make guarantee of its effectiveness. The experimental results under various pre-training strategies indicate that our method performs better than fine-tuning, with an average improvement of about 1.4% in full-shot scenarios and about 3.2% in few-shot scenarios. Moreover, our method significantly outperforms existing specialized prompt-based tuning methods when applied to models utilizing the pre-training strategy they specialize in. These numerous advantages position our method as a compelling alternative to fine-tuning for downstream adaptations.

replace-cross A Generic Shared Attention Mechanism for Various Backbone Neural Networks

Authors: Zhongzhan Huang, Senwei Liang, Mingfu Liang, Liang Lin

Abstract: The self-attention mechanism has emerged as a critical component for improving the performance of various backbone neural networks. However, current mainstream approaches individually incorporate newly designed self-attention modules (SAMs) into each layer of the network for granted without fully exploiting their parameters' potential. This leads to suboptimal performance and increased parameter consumption as the network depth increases. To improve this paradigm, in this paper, we first present a counterintuitive but inherent phenomenon: SAMs tend to produce strongly correlated attention maps across different layers, with an average Pearson correlation coefficient of up to 0.85. Inspired by this inherent observation, we propose Dense-and-Implicit Attention (DIA), which directly shares SAMs across layers and employs a long short-term memory module to calibrate and bridge the highly correlated attention maps of different layers, thus improving the parameter utilization efficiency of SAMs. This design of DIA is also consistent with the neural network's dynamical system perspective. Through extensive experiments, we demonstrate that our simple yet effective DIA can consistently enhance various network backbones, including ResNet, Transformer, and UNet, across tasks such as image classification, object detection, and image generation using diffusion models.

replace-cross Discovering Closed-Loop Failures of Vision-Based Controllers via Reachability Analysis

Authors: Kaustav Chakraborty, Somil Bansal

Abstract: Machine learning driven image-based controllers allow robotic systems to take intelligent actions based on the visual feedback from their environment. Understanding when these controllers might lead to system safety violations is important for their integration in safety-critical applications and engineering corrective safety measures for the system. Existing methods leverage simulation-based testing (or falsification) to find the failures of vision-based controllers, i.e., the visual inputs that lead to closed-loop safety violations. However, these techniques do not scale well to the scenarios involving high-dimensional and complex visual inputs, such as RGB images. In this work, we cast the problem of finding closed-loop vision failures as a Hamilton-Jacobi (HJ) reachability problem. Our approach blends simulation-based analysis with HJ reachability methods to compute an approximation of the backward reachable tube (BRT) of the system, i.e., the set of unsafe states for the system under vision-based controllers. Utilizing the BRT, we can tractably and systematically find the system states and corresponding visual inputs that lead to closed-loop failures. These visual inputs can be subsequently analyzed to find the input characteristics that might have caused the failure. Besides its scalability to high-dimensional visual inputs, an explicit computation of BRT allows the proposed approach to capture non-trivial system failures that are difficult to expose via random simulations. We demonstrate our framework on two case studies involving an RGB image-based neural network controller for (a) autonomous indoor navigation, and (b) autonomous aircraft taxiing.

replace-cross Using Persuasive Writing Strategies to Explain and Detect Health Misinformation

Authors: Danial Kamali, Joseph Romain, Huiyi Liu, Wei Peng, Jingbo Meng, Parisa Kordjamshidi

Abstract: Nowadays, the spread of misinformation is a prominent problem in society. Our research focuses on aiding the automatic identification of misinformation by analyzing the persuasive strategies employed in textual documents. We introduce a novel annotation scheme encompassing common persuasive writing tactics to achieve our objective. Additionally, we provide a dataset on health misinformation, thoroughly annotated by experts utilizing our proposed scheme. Our contribution includes proposing a new task of annotating pieces of text with their persuasive writing strategy types. We evaluate fine-tuning and prompt-engineering techniques with pre-trained language models of the BERT family and the generative large language models of the GPT family using persuasive strategies as an additional source of information. We evaluate the effects of employing persuasive strategies as intermediate labels in the context of misinformation detection. Our results show that those strategies enhance accuracy and improve the explainability of misinformation detection models. The persuasive strategies can serve as valuable insights and explanations, enabling other models or even humans to make more informed decisions regarding the trustworthiness of the information.

replace-cross Disentangled Explanations of Neural Network Predictions by Finding Relevant Subspaces

Authors: Pattarawat Chormai, Jan Herrmann, Klaus-Robert M\"uller, Gr\'egoire Montavon

Abstract: Explainable AI aims to overcome the black-box nature of complex ML models like neural networks by generating explanations for their predictions. Explanations often take the form of a heatmap identifying input features (e.g. pixels) that are relevant to the model's decision. These explanations, however, entangle the potentially multiple factors that enter into the overall complex decision strategy. We propose to disentangle explanations by extracting at some intermediate layer of a neural network, subspaces that capture the multiple and distinct activation patterns (e.g. visual concepts) that are relevant to the prediction. To automatically extract these subspaces, we propose two new analyses, extending principles found in PCA or ICA to explanations. These novel analyses, which we call principal relevant component analysis (PRCA) and disentangled relevant subspace analysis (DRSA), maximize relevance instead of e.g. variance or kurtosis. This allows for a much stronger focus of the analysis on what the ML model actually uses for predicting, ignoring activations or concepts to which the model is invariant. Our approach is general enough to work alongside common attribution techniques such as Shapley Value, Integrated Gradients, or LRP. Our proposed methods show to be practically useful and compare favorably to the state of the art as demonstrated on benchmarks and three use cases.

replace-cross Neural Architecture Search via Two Constant Shared Weights Initialisations

Authors: Ekaterina Gracheva

Abstract: In recent years, zero-cost metrics are gaining ground in neural architecture search (NAS). There metrics allow finding the optimal neural network for a given task faster and with a lesser computational load than conventional NAS methods. Equally important is that they also shed some light on the internal workings of neural architectures. This paper presents a zero-cost metric that highly correlated with the train set accuracy across the NAS-Bench-101, NAS-Bench-201 and NAS-Bench-NLP benchmark datasets. We evaluate a neural achitecture's potential based on the outputs' statistics after two constant shared weights initialisations. For this, we only use an unlabelled mini-batch of data. We observe that the dispersion of the outputs between two initialisations positively correlates with trained accuracy. The correlation further improves when we normalise dispersion by average output magnitude. The resulting metric, epsilon, does not require gradients computation and unbinds the NAS procedure from training hyperparameters, loss metrics and human-labelled data. Our method is easy to integrate within existing NAS algorithms and takes a fraction of a second to evaluate a single network. The code supporting this study can be found on GitHub at https://github.com/egracheva/epsinas.

URLs: https://github.com/egracheva/epsinas.

replace-cross Explainable Anomaly Detection in Images and Videos: A Survey

Authors: Yizhou Wang, Dongliang Guo, Sheng Li, Octavia Camps, Yun Fu

Abstract: Anomaly detection and localization of visual data, including images and videos, are of great significance in both machine learning academia and applied real-world scenarios. Despite the rapid development of visual anomaly detection techniques in recent years, the interpretations of these black-box models and reasonable explanations of why anomalies can be distinguished out are scarce. This paper provides the first survey concentrated on explainable visual anomaly detection methods. We first introduce the basic background of image-level and video-level anomaly detection. Then, as the main content of this survey, a comprehensive and exhaustive literature review of explainable anomaly detection methods for both images and videos is presented. Next, we analyze why some explainable anomaly detection methods can be applied to both images and videos and why others can be only applied to one modality. Additionally, we provide summaries of current 2D visual anomaly detection datasets and evaluation metrics. Finally, we discuss several promising future directions and open problems to explore the explainability of 2D visual anomaly detection. The related resource collection is given at https://github.com/wyzjack/Awesome-XAD.

URLs: https://github.com/wyzjack/Awesome-XAD.

replace-cross Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Authors: Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, Wen Gao

Abstract: With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as BERT, ViT, GPT, etc. Inspired by the success of these models in single domains (like computer vision and natural language processing), the multi-modal pre-trained big models have also drawn more and more attention in recent years. In this work, we give a comprehensive survey of these models and hope this paper could provide new insights and helps fresh researchers to track the most cutting-edge works. Specifically, we firstly introduce the background of multi-modal pre-training by reviewing the conventional deep learning, pre-training works in natural language process, computer vision, and speech. Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network architectures, and knowledge enhanced pre-training. After that, we introduce the downstream tasks used for the validation of large-scale MM-PTMs, including generative, classification, and regression tasks. We also give visualization and analysis of the model parameters and results on representative downstream tasks. Finally, we point out possible research directions for this topic that may benefit future works. In addition, we maintain a continuously updated paper list for large-scale pre-trained multi-modal big models: https://github.com/wangxiao5791509/MultiModal_BigModels_Survey. This paper has been published by the journal Machine Intelligence Research (MIR), https://link.springer.com/article/10.1007/s11633-022-1410-8, DOI: 10.1007/s11633-022-1410-8, vol. 20, no. 4, pp. 447-482, 2023.

URLs: https://github.com/wangxiao5791509/MultiModal_BigModels_Survey., https://link.springer.com/article/10.1007/s11633-022-1410-8,

replace-cross Understanding Expressivity of GNN in Rule Learning

Authors: Haiquan Qiu, Yongqi Zhang, Yong Li, Quanming Yao

Abstract: Rule learning is critical to improving knowledge graph (KG) reasoning due to their ability to provide logical and interpretable explanations. Recently, Graph Neural Networks (GNNs) with tail entity scoring achieve the state-of-the-art performance on KG reasoning. However, the theoretical understandings for these GNNs are either lacking or focusing on single-relational graphs, leaving what the kind of rules these GNNs can learn an open problem. We propose to fill the above gap in this paper. Specifically, GNNs with tail entity scoring are unified into a common framework. Then, we analyze their expressivity by formally describing the rule structures they can learn and theoretically demonstrating their superiority. These results further inspire us to propose a novel labeling strategy to learn more rules in KG reasoning. Experimental results are consistent with our theoretical findings and verify the effectiveness of our proposed method. The code is publicly available at https://github.com/LARS-research/Rule-learning-expressivity.

URLs: https://github.com/LARS-research/Rule-learning-expressivity.

replace-cross A Two-Stage Framework with Self-Supervised Distillation For Cross-Domain Text Classification

Authors: Yunlong Feng, Bohan Li, Libo Qin, Xiao Xu, Wanxiang Che

Abstract: Cross-domain text classification aims to adapt models to a target domain that lacks labeled data. It leverages or reuses rich labeled data from the different but related source domain(s) and unlabeled data from the target domain. To this end, previous work focuses on either extracting domain-invariant features or task-agnostic features, ignoring domain-aware features that may be present in the target domain and could be useful for the downstream task. In this paper, we propose a two-stage framework for cross-domain text classification. In the first stage, we finetune the model with mask language modeling (MLM) and labeled data from the source domain. In the second stage, we further fine-tune the model with self-supervised distillation (SSD) and unlabeled data from the target domain. We evaluate its performance on a public cross-domain text classification benchmark and the experiment results show that our method achieves new state-of-the-art results for both single-source domain adaptations (94.17% $\uparrow$1.03%) and multi-source domain adaptations (95.09% $\uparrow$1.34%).

replace-cross Ripple Knowledge Graph Convolutional Networks For Recommendation Systems

Authors: Chen Li, Yang Cao, Ye Zhu, Debo Cheng, Chengyuan Li, Yasuhiko Morimoto

Abstract: Using knowledge graphs to assist deep learning models in making recommendation decisions has recently been proven to effectively improve the model's interpretability and accuracy. This paper introduces an end-to-end deep learning model, named RKGCN, which dynamically analyses each user's preferences and makes a recommendation of suitable items. It combines knowledge graphs on both the item side and user side to enrich their representations to maximize the utilization of the abundant information in knowledge graphs. RKGCN is able to offer more personalized and relevant recommendations in three different scenarios. The experimental results show the superior effectiveness of our model over 5 baseline models on three real-world datasets including movies, books, and music.

replace-cross Federated Learning Model Aggregation in Heterogenous Aerial and Space Networks

Authors: Fan Dong, Ali Abbasi, Steve Drew, Henry Leung, Xin Wang, Jiayu Zhou

Abstract: Federated learning offers a promising approach under the constraints of networking and data privacy constraints in aerial and space networks (ASNs), utilizing large-scale private edge data from drones, balloons, and satellites. Existing research has extensively studied the optimization of the learning process, computing efficiency, and communication overhead. An important yet often overlooked aspect is that participants contribute predictive knowledge with varying diversity of knowledge, affecting the quality of the learned federated models. In this paper, we propose a novel approach to address this issue by introducing a Weighted Averaging and Client Selection (WeiAvgCS) framework that emphasizes updates from high-diversity clients and diminishes the influence of those from low-diversity clients. Direct sharing of the data distribution may be prohibitive due to the additional private information that is sent from the clients. As such, we introduce an estimation for the diversity using a projection-based method. Extensive experiments have been performed to show WeiAvgCS's effectiveness. WeiAvgCS could converge 46% faster on FashionMNIST and 38% faster on CIFAR10 than its benchmarks on average in our experiments.

replace-cross MiniLLM: Knowledge Distillation of Large Language Models

Authors: Yuxian Gu, Li Dong, Furu Wei, Minlie Huang

Abstract: Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge of white-box LLMs into small models is still under-explored, which becomes more important with the prosperity of open-source LLMs. In this work, we propose a KD approach that distills LLMs into smaller language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. The student models are named MiniLLM. Extensive experiments in the instruction-following setting show that MiniLLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. Our method is scalable for different model families with 120M to 13B parameters. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/minillm.

URLs: https://github.com/microsoft/LMOps/tree/main/minillm.

replace-cross A Shift In Artistic Practices through Artificial Intelligence

Authors: K{\i}van\c{c} Tatar, Petter Ericson, Kelsey Cotton, Paola Torres N\'u\~nez del Prado, Roser Batlle-Roca, Beatriz Cabrero-Daniel, Sara Ljungblad, Georgios Diapoulis, Jabbar Hussain

Abstract: The explosion of content generated by artificial intelligence (AI) models has initiated a cultural shift in arts, music, and media, whereby roles are changing, values are shifting, and conventions are challenged. The vast, readily available dataset of the Internet has created an environment for AI models to be trained on any content on the Web. With AI models shared openly and used by many globally, how does this new paradigm shift challenge the status quo in artistic practices? What kind of changes will AI technology bring to music, arts, and new media?

replace-cross Expediting Building Footprint Extraction from High-resolution Remote Sensing Images via progressive lenient supervision

Authors: Haonan Guo, Bo Du, Chen Wu, Xin Su, Liangpei Zhang

Abstract: The efficacy of building footprint segmentation from remotely sensed images has been hindered by model transfer effectiveness. Many existing building segmentation methods were developed upon the encoder-decoder architecture of U-Net, in which the encoder is finetuned from the newly developed backbone networks that are pre-trained on ImageNet. However, the heavy computational burden of the existing decoder designs hampers the successful transfer of these modern encoder networks to remote sensing tasks. Even the widely-adopted deep supervision strategy fails to mitigate these challenges due to its invalid loss in hybrid regions where foreground and background pixels are intermixed. In this paper, we conduct a comprehensive evaluation of existing decoder network designs for building footprint segmentation and propose an efficient framework denoted as BFSeg to enhance learning efficiency and effectiveness. Specifically, a densely-connected coarse-to-fine feature fusion decoder network that facilitates easy and fast feature fusion across scales is proposed. Moreover, considering the invalidity of hybrid regions in the down-sampled ground truth during the deep supervision process, we present a lenient deep supervision and distillation strategy that enables the network to learn proper knowledge from deep supervision. Building upon these advancements, we have developed a new family of building segmentation networks, which consistently surpass prior works with outstanding performance and efficiency across a wide range of newly developed encoder networks.

replace-cross Building-road Collaborative Extraction from Remotely Sensed Images via Cross-Interaction

Authors: Haonan Guo, Xin Su, Chen Wu, Bo Du, Liangpei Zhang

Abstract: Buildings are the basic carrier of social production and human life; roads are the links that interconnect social networks. Building and road information has important application value in the frontier fields of regional coordinated development, disaster prevention, auto-driving, etc. Mapping buildings and roads from very high-resolution (VHR) remote sensing images have become a hot research topic. However, the existing methods often ignore the strong spatial correlation between roads and buildings and extract them in isolation. To fully utilize the complementary advantages between buildings and roads, we propose a building-road collaborative extraction method based on multi-task and cross-scale feature interaction to improve the accuracy of both tasks in a complementary way. A multi-task interaction module is proposed to interact information across tasks and preserve the unique information of each task, which tackle the seesaw phenomenon in multitask learning. By considering the variation in appearance and structure between buildings and roads, a cross-scale interaction module is designed to automatically learn the optimal reception field for different tasks. Compared with many existing methods that train each task individually, the proposed collaborative extraction method can utilize the complementary advantages between buildings and roads by the proposed inter-task and inter-scale feature interactions, and automatically select the optimal reception field for different tasks. Experiments on a wide range of urban and rural scenarios show that the proposed algorithm can achieve building-road extraction with outstanding performance and efficiency.

replace-cross L2MAC: Large Language Model Automatic Computer for Extensive Code Generation

Authors: Samuel Holt, Max Ruiz Luyten, Mihaela van der Schaar

Abstract: Transformer-based large language models (LLMs) are constrained by the fixed context window of the underlying transformer architecture, hindering their ability to produce long and coherent outputs. Memory-augmented LLMs are a promising solution, but current approaches cannot handle long output generation tasks since they (1) only focus on reading memory and reduce its evolution to the concatenation of new memories or (2) use very specialized memories that cannot adapt to other domains. This paper presents L2MAC, the first practical LLM-based general-purpose stored-program automatic computer (von Neumann architecture) framework, an LLM-based multi-agent system, for long and consistent output generation. Its memory has two components: the instruction registry, which is populated with a prompt program to solve the user-given task, and a file store, which will contain the final and intermediate outputs. Each instruction in turn is executed by a separate LLM agent, whose context is managed by a control unit capable of precise memory reading and writing to ensure effective interaction with the file store. These components enable L2MAC to generate extensive outputs, bypassing the constraints of the finite context window while producing outputs that fulfill a complex user-specified task. We empirically demonstrate that L2MAC achieves state-of-the-art performance in generating large codebases for system design tasks, significantly outperforming other coding methods in implementing the detailed user-specified task; we show that L2MAC works for general-purpose extensive text-based tasks, such as writing an entire book; and we provide valuable insights into L2MAC's performance improvement over existing methods.

replace-cross Towards Foundation Models for Knowledge Graph Reasoning

Authors: Mikhail Galkin, Xinyu Yuan, Hesham Mostafa, Jian Tang, Zhaocheng Zhu

Abstract: Foundation models in language and vision have the ability to run inference on any textual and visual inputs thanks to the transferable representations such as a vocabulary of tokens in language. Knowledge graphs (KGs) have different entity and relation vocabularies that generally do not overlap. The key challenge of designing foundation models on KGs is to learn such transferable representations that enable inference on any graph with arbitrary entity and relation vocabularies. In this work, we make a step towards such foundation models and present ULTRA, an approach for learning universal and transferable graph representations. ULTRA builds relational representations as a function conditioned on their interactions. Such a conditioning strategy allows a pre-trained ULTRA model to inductively generalize to any unseen KG with any relation vocabulary and to be fine-tuned on any graph. Conducting link prediction experiments on 57 different KGs, we find that the zero-shot inductive inference performance of a single pre-trained ULTRA model on unseen graphs of various sizes is often on par or better than strong baselines trained on specific graphs. Fine-tuning further boosts the performance.

replace-cross SALMON: Self-Alignment with Instructable Reward Models

Authors: Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan

Abstract: Supervised Fine-Tuning (SFT) on response demonstrations combined with Reinforcement Learning from Human Feedback (RLHF) constitutes a powerful paradigm for aligning LLM-based AI agents. However, a significant limitation of such an approach is its dependency on high-quality human annotations, making its application to intricate tasks challenging due to difficulties in obtaining consistent response demonstrations and in-distribution response preferences. This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision, using only a small set of human-defined principles, yet achieving superior performance. Central to our approach is an instructable reward model. Trained on synthetic preference data, this model can generate reward scores based on arbitrary human-defined principles. By merely adjusting these principles during the RL training phase, we gain full control over the preferences with the instructable reward model, subsequently influencing the behavior of the RL-trained policy models, and reducing the reliance on the collection of online human preferences. Applying our method to the LLaMA-2-70b base language model, we developed an AI assistant named Dromedary-2. With only 6 exemplars for in-context learning and 31 human-defined principles, Dromedary-2 significantly surpasses the performance of several state-of-the-art AI systems, including LLaMA-2-Chat-70b, on various benchmark datasets. We have open-sourced the code and model weights to encourage further research into aligning LLM-based AI agents with enhanced supervision efficiency, improved controllability, and scalable oversight.

replace-cross MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification

Authors: Chadi Helwe, Tom Calamai, Pierre-Henri Paris, Chlo\'e Clavel, Fabian Suchanek

Abstract: We introduce MAFALDA, a benchmark for fallacy classification that merges and unites previous fallacy datasets. It comes with a taxonomy that aligns, refines, and unifies existing classifications of fallacies. We further provide a manual annotation of a part of the dataset together with manual explanations for each annotation. We propose a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity. We then evaluate several language models under a zero-shot learning setting and human performances on MAFALDA to assess their capability to detect and classify fallacies.

replace-cross Global $\mathcal{L}^2$ minimization at uniform exponential rate via geometrically adapted gradient descent in Deep Learning

Authors: Thomas Chen

Abstract: We consider the scenario of supervised learning in Deep Learning (DL) networks, and exploit the arbitrariness of choice in the Riemannian metric relative to which the gradient descent flow can be defined (a general fact of differential geometry). In the standard approach to DL, the gradient flow on the space of parameters (weights and biases) is defined with respect to the Euclidean metric. Here instead, we choose the gradient flow with respect to the Euclidean metric in the output layer of the DL network. This naturally induces two modified versions of the gradient descent flow in the parameter space, one adapted for the overparametrized setting, and the other for the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the ${\mathcal L}^2$ cost to its global minimum at a uniform exponential convergence rate; one thereby obtains an a priori stopping time for any prescribed proximity to the global minimum. We point out relations of the latter to sub-Riemannian geometry. Moreover, we generalize the above framework to the situation in which the rank condition does not hold; in particular, we show that local equilibria can only exist if a rank loss occurs, and that generically, they are not isolated points, but elements of a critical submanifold of parameter space.

replace-cross SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples

Authors: Phillip Howard, Avinash Madasu, Tiep Le, Gustavo Lujan Moreno, Anahita Bhiwandiwalla, Vasudev Lal

Abstract: While vision-language models (VLMs) have achieved remarkable performance improvements recently, there is growing evidence that these models also posses harmful biases with respect to social attributes such as gender and race. Prior studies have primarily focused on probing such bias attributes individually while ignoring biases associated with intersections between social attributes. This could be due to the difficulty of collecting an exhaustive set of image-text pairs for various combinations of social attributes. To address this challenge, we employ text-to-image diffusion models to produce counterfactual examples for probing intersectional social biases at scale. Our approach utilizes Stable Diffusion with cross attention control to produce sets of counterfactual image-text pairs that are highly similar in their depiction of a subject (e.g., a given occupation) while differing only in their depiction of intersectional social attributes (e.g., race & gender). Through our over-generate-then-filter methodology, we produce SocialCounterfactuals, a high-quality dataset containing 171k image-text pairs for probing intersectional biases related to gender, race, and physical characteristics. We conduct extensive experiments to demonstrate the usefulness of our generated dataset for probing and mitigating intersectional social biases in state-of-the-art VLMs.

replace-cross Data-driven Semi-supervised Machine Learning with Surrogate Safety Measures for Abnormal Driving Behavior Detection

Authors: Lanxin Zhang, Yongqi Dong, Haneen Farah, Arkady Zgonnikov, Bart van Arem

Abstract: Detecting abnormal driving behavior is critical for road traffic safety and the evaluation of drivers' behavior. With the advancement of machine learning (ML) algorithms and the accumulation of naturalistic driving data, many ML models have been adopted for abnormal driving behavior detection. Most existing ML-based detectors rely on (fully) supervised ML methods, which require substantial labeled data. However, ground truth labels are not always available in the real world, and labeling large amounts of data is tedious. Thus, there is a need to explore unsupervised or semi-supervised methods to make the anomaly detection process more feasible and efficient. To fill this research gap, this study analyzes large-scale real-world data revealing several abnormal driving behaviors (e.g., sudden acceleration, rapid lane-changing) and develops a Hierarchical Extreme Learning Machines (HELM) based semi-supervised ML method using partly labeled data to accurately detect the identified abnormal driving behaviors. Moreover, previous ML-based approaches predominantly utilize basic vehicle motion features (such as velocity and acceleration) to label and detect abnormal driving behaviors, while this study seeks to introduce Surrogate Safety Measures (SSMs) as the input features for ML models to improve the detection performance. Results from extensive experiments demonstrate the effectiveness of the proposed semi-supervised ML model with the introduced SSMs serving as important features. The proposed semi-supervised ML method outperforms other baseline semi-supervised or unsupervised methods regarding various metrics, e.g., delivering the best accuracy at 99.58% and the best F-1 measure at 0.9913. The ablation study further highlights the significance of SSMs for advancing detection performance.

replace-cross Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos

Authors: Mehmet Saygin Seyfioglu, Wisdom O. Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, Linda Shapiro

Abstract: Diagnosis in histopathology requires a global whole slide images (WSIs) analysis, requiring pathologists to compound evidence from different WSI patches. The gigapixel scale of WSIs poses a challenge for histopathology multi-modal models. Training multi-model models for histopathology requires instruction tuning datasets, which currently contain information for individual image patches, without a spatial grounding of the concepts within each patch and without a wider view of the WSI. Therefore, they lack sufficient diagnostic capacity for histopathology. To bridge this gap, we introduce Quilt-Instruct, a large-scale dataset of 107,131 histopathology-specific instruction question/answer pairs, grounded within diagnostically relevant image patches that make up the WSI. Our dataset is collected by leveraging educational histopathology videos from YouTube, which provides spatial localization of narrations by automatically extracting the narrators' cursor positions. Quilt-Instruct supports contextual reasoning by extracting diagnosis and supporting facts from the entire WSI. Using Quilt-Instruct, we train Quilt-LLaVA, which can reason beyond the given single image patch, enabling diagnostic reasoning across patches. To evaluate Quilt-LLaVA, we propose a comprehensive evaluation dataset created from 985 images and 1283 human-generated question-answers. We also thoroughly evaluate Quilt-LLaVA using public histopathology datasets, where Quilt-LLaVA significantly outperforms SOTA by over 10% on relative GPT-4 score and 4% and 9% on open and closed set VQA. Our code, data, and model are publicly accessible at quilt-llava.github.io.

replace-cross Verification of Neural Reachable Tubes via Scenario Optimization and Conformal Prediction

Authors: Albert Lin, Somil Bansal

Abstract: Learning-based approaches for controlling safety-critical systems are rapidly growing in popularity; thus, it is important to assure their performance and safety. Hamilton-Jacobi (HJ) reachability analysis is a popular formal verification tool for providing such guarantees, since it can handle general nonlinear system dynamics, bounded adversarial system disturbances, and state and input constraints. However, its computational and memory complexity scales exponentially with the state dimension, making it intractable for large-scale systems. To overcome this challenge, neural approaches, such as DeepReach, have been used to synthesize reachable tubes and safety controllers for high-dimensional systems. However, verifying these neural reachable tubes remains challenging. In this work, we propose two verification methods, based on robust scenario optimization and conformal prediction, to provide probabilistic safety guarantees for neural reachable tubes. Our methods allow a direct trade-off between resilience to outlier errors in the neural tube, which are inevitable in a learning-based approach, and the strength of the probabilistic safety guarantee. Furthermore, we show that split conformal prediction, a widely used method in the machine learning community for uncertainty quantification, reduces to a scenario-based approach, making the two methods equivalent not only for verification of neural reachable tubes but also more generally. To our knowledge, our proof is the first in the literature to show a strong relationship between conformal prediction and scenario optimization. Finally, we propose an outlier-adjusted verification approach that uses the error distribution in neural reachable tubes to recover greater safe volumes. We demonstrate the efficacy of the proposed approaches for the high-dimensional problems of multi-vehicle collision avoidance and rocket landing with no-go zones.

replace-cross Data-Efficient Multimodal Fusion on a Single GPU

Authors: No\"el Vouitsis, Zhaoyan Liu, Satya Krishna Gorti, Valentin Villecroze, Jesse C. Cresswell, Guangwei Yu, Gabriel Loaiza-Ganem, Maksims Volkovs

Abstract: The goal of multimodal alignment is to learn a single latent space that is shared between multimodal inputs. The most powerful models in this space have been trained using massive datasets of paired inputs and large-scale computational resources, making them prohibitively expensive to train in many practical scenarios. We surmise that existing unimodal encoders pre-trained on large amounts of unimodal data should provide an effective bootstrap to create multimodal models from unimodal ones at much lower costs. We therefore propose FuseMix, a multimodal augmentation scheme that operates on the latent spaces of arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal alignment, we achieve competitive performance -- and in certain cases outperform state-of-the art methods -- in both image-text and audio-text retrieval, with orders of magnitude less compute and data: for example, we outperform CLIP on the Flickr30K text-to-image retrieval task with $\sim \! 600\times$ fewer GPU days and $\sim \! 80\times$ fewer image-text pairs. Additionally, we show how our method can be applied to convert pre-trained text-to-image generative models into audio-to-image ones. Code is available at: https://github.com/layer6ai-labs/fusemix.

URLs: https://github.com/layer6ai-labs/fusemix.

replace-cross Understanding Video Transformers via Universal Concept Discovery

Authors: Matthew Kowal, Achal Dave, Rares Ambrus, Adrien Gaidon, Konstantinos G. Derpanis, Pavel Tokmakov

Abstract: This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that are automatically discovered. Prior research on concept-based interpretability has concentrated solely on image-level tasks. Comparatively, video models deal with the added temporal dimension, increasing complexity and posing challenges in identifying dynamic concepts over time. In this work, we systematically address these challenges by introducing the first Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose an efficient approach for unsupervised identification of units of video transformer representations - concepts, and ranking their importance to the output of a model. The resulting concepts are highly interpretable, revealing spatio-temporal reasoning mechanisms and object-centric representations in unstructured video models. Performing this analysis jointly over a diverse set of supervised and self-supervised representations, we discover that some of these mechanism are universal in video transformers. Finally, we show that VTCD can be used for fine-grained action recognition and video object segmentation.

replace-cross Visibility into AI Agents

Authors: Alan Chan, Carson Ezell, Max Kaufmann, Kevin Wei, Lewis Hammond, Herbie Bradley, Emma Bluemke, Nitarshan Rajkumar, David Krueger, Noam Kolt, Lennart Heim, Markus Anderljung

Abstract: Increased delegation of commercial, scientific, governmental, and personal activities to AI agents -- systems capable of pursuing complex goals with limited supervision -- may exacerbate existing societal risks and introduce new risks. Understanding and mitigating these risks involves critically evaluating existing governance structures, revising and adapting these structures where needed, and ensuring accountability of key stakeholders. Information about where, why, how, and by whom certain AI agents are used, which we refer to as visibility, is critical to these objectives. In this paper, we assess three categories of measures to increase visibility into AI agents: agent identifiers, real-time monitoring, and activity logging. For each, we outline potential implementations that vary in intrusiveness and informativeness. We analyze how the measures apply across a spectrum of centralized through decentralized deployment contexts, accounting for various actors in the supply chain including hardware and software service providers. Finally, we discuss the implications of our measures for privacy and concentration of power. Further work into understanding the measures and mitigating their negative impacts can help to build a foundation for the governance of AI agents.

replace-cross MixedNUTS: Training-Free Accuracy-Robustness Balance via Nonlinearly Mixed Classifiers

Authors: Yatong Bai, Mo Zhou, Vishal M. Patel, Somayeh Sojoudi

Abstract: Adversarial robustness often comes at the cost of degraded accuracy, impeding the real-life application of robust classification models. Training-based solutions for better trade-offs are limited by incompatibilities with already-trained high-performance large models, necessitating the exploration of training-free ensemble approaches. Observing that robust models are more confident in correct predictions than in incorrect ones on clean and adversarial data alike, we speculate amplifying this "benign confidence property" can reconcile accuracy and robustness in an ensemble setting. To achieve so, we propose "MixedNUTS", a training-free method where the output logits of a robust classifier and a standard non-robust classifier are processed by nonlinear transformations with only three parameters, which are optimized through an efficient algorithm. MixedNUTS then converts the transformed logits into probabilities and mixes them as the overall output. On CIFAR-10, CIFAR-100, and ImageNet datasets, experimental results with custom strong adaptive attacks demonstrate MixedNUTS's vastly improved accuracy and near-SOTA robustness -- it boosts CIFAR-100 clean accuracy by 7.86 points, sacrificing merely 0.87 points in robust accuracy.

replace-cross Zero-Shot Clinical Trial Patient Matching with LLMs

Authors: Michael Wornow, Alejandro Lozano, Dev Dash, Jenelle Jindal, Kenneth W. Mahaffey, Nigam H. Shah

Abstract: Matching patients to clinical trials is a key unsolved challenge in bringing new drugs to market. Today, identifying patients who meet a trial's eligibility criteria is highly manual, taking up to 1 hour per patient. Automated screening is challenging, however, as it requires understanding unstructured clinical text. Large language models (LLMs) offer a promising solution. In this work, we explore their application to trial matching. First, we design an LLM-based system which, given a patient's medical history as unstructured clinical text, evaluates whether that patient meets a set of inclusion criteria (also specified as free text). Our zero-shot system achieves state-of-the-art scores on the n2c2 2018 cohort selection benchmark. Second, we improve the data and cost efficiency of our method by identifying a prompting strategy which matches patients an order of magnitude faster and more cheaply than the status quo, and develop a two-stage retrieval pipeline that reduces the number of tokens processed by up to a third while retaining high performance. Third, we evaluate the interpretability of our system by having clinicians evaluate the natural language justifications generated by the LLM for each eligibility decision, and show that it can output coherent explanations for 97% of its correct decisions and 75% of its incorrect ones. Our results establish the feasibility of using LLMs to accelerate clinical trial operations.

replace-cross Hysteresis Compensation of Flexible Continuum Manipulator using RGBD Sensing and Temporal Convolutional Network

Authors: Junhyun Park, Seonghyeok Jang, Hyojae Park, Seongjun Bae, Minho Hwang

Abstract: Flexible continuum manipulators are valued for minimally invasive surgery, offering access to confined spaces through nonlinear paths. However, cable-driven manipulators face control difficulties due to hysteresis from cabling effects such as friction, elongation, and coupling. These effects are difficult to model due to nonlinearity and the difficulties become even more evident when dealing with long and coupled, multi-segmented manipulator. This paper proposes a data-driven approach based on Deep Neural Networks (DNN) to capture these nonlinear and previous states-dependent characteristics of cable actuation. We collect physical joint configurations according to command joint configurations using RGBD sensing and 7 fiducial markers to model the hysteresis of the proposed manipulator. Result on a study comparing the estimation performance of four DNN models show that the Temporal Convolution Network (TCN) demonstrates the highest predictive capability. Leveraging trained TCNs, we build a control algorithm to compensate for hysteresis. Tracking tests in task space using unseen trajectories show that the proposed control algorithm reduces the average position and orientation error by 61.39% (from 13.7mm to 5.29 mm) and 64.04% (from 31.17{\deg} to 11.21{\deg}), respectively. This result implies that the proposed calibrated controller effectively reaches the desired configurations by estimating the hysteresis of the manipulator. Applying this method in real surgical scenarios has the potential to enhance control precision and improve surgical performance.

replace-cross Location-guided Head Pose Estimation for Fisheye Image

Authors: Bing Li, Dong Zhang, Cheng Huang, Yun Xian, Ming Li, Dah-Jye Lee

Abstract: Camera with a fisheye or ultra-wide lens covers a wide field of view that cannot be modeled by the perspective projection. Serious fisheye lens distortion in the peripheral region of the image leads to degraded performance of the existing head pose estimation models trained on undistorted images. This paper presents a new approach for head pose estimation that uses the knowledge of head location in the image to reduce the negative effect of fisheye distortion. We develop an end-to-end convolutional neural network to estimate the head pose with the multi-task learning of head pose and head location. Our proposed network estimates the head pose directly from the fisheye image without the operation of rectification or calibration. We also created a fisheye-distorted version of the three popular head pose estimation datasets, BIWI, 300W-LP, and AFLW2000 for our experiments. Experiments results show that our network remarkably improves the accuracy of head pose estimation compared with other state-of-the-art one-stage and two-stage methods.

replace-cross ChatASU: Evoking LLM's Reflexion to Truly Understand Aspect Sentiment in Dialogues

Authors: Yiding Liu, Jingjing Wang, Jiamin Luo, Tao Zeng, Guodong Zhou

Abstract: Aspect Sentiment Understanding (ASU) in interactive scenarios (e.g., Question-Answering and Dialogue) has attracted ever-more interest in recent years and achieved important progresses. However, existing studies on interactive ASU largely ignore the coreference issue for opinion targets (i.e., aspects), while this phenomenon is ubiquitous in interactive scenarios especially dialogues, limiting the ASU performance. Recently, large language models (LLMs) shows the powerful ability to integrate various NLP tasks with the chat paradigm. In this way, this paper proposes a new Chat-based Aspect Sentiment Understanding (ChatASU) task, aiming to explore LLMs' ability in understanding aspect sentiments in dialogue scenarios. Particularly, this ChatASU task introduces a sub-task, i.e., Aspect Chain Reasoning (ACR) task, to address the aspect coreference issue. On this basis, we propose a Trusted Self-reflexion Approach (TSA) with ChatGLM as backbone to ChatASU. Specifically, this TSA treats the ACR task as an auxiliary task to boost the performance of the primary ASU task, and further integrates trusted learning into reflexion mechanisms to alleviate the LLMs-intrinsic factual hallucination problem in TSA. Furthermore, a high-quality ChatASU dataset is annotated to evaluate TSA, and extensive experiments show that our proposed TSA can significantly outperform several state-of-the-art baselines, justifying the effectiveness of TSA to ChatASU and the importance of considering the coreference and hallucination issues in ChatASU.

replace-cross GPT as Psychologist? Preliminary Evaluations for GPT-4V on Visual Affective Computing

Authors: Hao Lu, Xuesong Niu, Jiyao Wang, Yin Wang, Qingyong Hu, Jiaqi Tang, Yuting Zhang, Kaishen Yuan, Bin Huang, Zitong Yu, Dengbo He, Shuiguang Deng, Hao Chen, Yingcong Chen, Shiguang Shan

Abstract: Multimodal large language models (MLLMs) are designed to process and integrate information from multiple sources, such as text, speech, images, and videos. Despite its success in language understanding, it is critical to evaluate the performance of downstream tasks for better human-centric applications. This paper assesses the application of MLLMs with 5 crucial abilities for affective computing, spanning from visual affective tasks and reasoning tasks. The results show that \gpt has high accuracy in facial action unit recognition and micro-expression detection while its general facial expression recognition performance is not accurate. We also highlight the challenges of achieving fine-grained micro-expression recognition and the potential for further study and demonstrate the versatility and potential of \gpt for handling advanced tasks in emotion recognition and related fields by integrating with task-related agents for more complex tasks, such as heart rate estimation through signal processing. In conclusion, this paper provides valuable insights into the potential applications and challenges of MLLMs in human-centric computing. Our interesting examples are at https://github.com/EnVision-Research/GPT4Affectivity.

URLs: https://github.com/EnVision-Research/GPT4Affectivity.

replace-cross Gemma: Open Models Based on Gemini Research and Technology

Authors: Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi\`ere, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, L\'eonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Am\'elie H\'eliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Cl\'ement Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Miku{\l}a, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu-hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Cl\'ement Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, Kathleen Kenealy

Abstract: This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations.

replace-cross GaussianImage: 1000 FPS Image Representation and Compression by 2D Gaussian Splatting

Authors: Xinjie Zhang, Xingtong Ge, Tongda Xu, Dailan He, Yan Wang, Hongwei Qin, Guo Lu, Jing Geng, Jun Zhang

Abstract: Implicit neural representations (INRs) recently achieved great success in image representation and compression, offering high visual quality and fast rendering speeds with 10-1000 FPS, assuming sufficient GPU resources are available. However, this requirement often hinders their use on low-end devices with limited memory. In response, we propose a groundbreaking paradigm of image representation and compression by 2D Gaussian Splatting, named GaussianImage. We first introduce 2D Gaussian to represent the image, where each Gaussian has 8 parameters including position, covariance and color. Subsequently, we unveil a novel rendering algorithm based on accumulated summation. Remarkably, our method with a minimum of 3$\times$ lower GPU memory usage and 5$\times$ faster fitting time not only rivals INRs (e.g., WIRE, I-NGP) in representation performance, but also delivers a faster rendering speed of 1500-2000 FPS regardless of parameter size. Furthermore, we integrate existing vector quantization technique to build an image codec. Experimental results demonstrate that our codec attains rate-distortion performance comparable to compression-based INRs such as COIN and COIN++, while facilitating decoding speeds of approximately 1000 FPS. Additionally, preliminary proof of concept shows that our codec surpasses COIN and COIN++ in performance when using partial bits-back coding. Code will be available at https://github.com/Xinjie-Q/GaussianImage.

URLs: https://github.com/Xinjie-Q/GaussianImage.

replace-cross M-HOF-Opt: Multi-Objective Hierarchical Output Feedback Optimization via Multiplier Induced Loss Landscape Scheduling

Authors: Xudong Sun, Nutan Chen, Alexej Gossmann, Yu Xing, Carla Feistner, Emilio Dorigatt, Felix Drost, Daniele Scarcella, Lisa Beer, Carsten Marr

Abstract: We address the online combinatorial choice of weight multipliers for multi-objective optimization of many loss terms parameterized by neural works via a probabilistic graphical model (PGM) for the joint model parameter and multiplier evolution process, with a hypervolume based likelihood promoting multi-objective descent. The corresponding parameter and multiplier estimation as a sequential decision process is then cast into an optimal control problem, where the multi-objective descent goal is dispatched hierarchically into a series of constraint optimization sub-problems. The subproblem constraint automatically adapts itself according to Pareto dominance and serves as the setpoint for the low level multiplier controller to schedule loss landscapes via output feedback of each loss term. Our method is multiplier-free and operates at the timescale of epochs, thus saves tremendous computational resources compared to full training cycle multiplier tuning. It also circumvents the excessive memory requirements and heavy computational burden of existing multi-objective deep learning methods. We applied it to domain invariant variational auto-encoding with 6 loss terms on the PACS domain generalization task, and observed robust performance across a range of controller hyperparameters, as well as different multiplier initial conditions, outperforming other multiplier scheduling methods. We offered modular implementation of our method, admitting extension to custom definition of many loss terms.

replace-cross Multi-role Consensus through LLMs Discussions for Vulnerability Detection

Authors: Zhenyu Mao, Jialong Li, Munan Li, Kenji Tei

Abstract: Recent advancements in large language models (LLMs) have highlighted the potential for vulnerability detection, a crucial component of software quality assurance. Despite this progress, most studies have been limited to the perspective of a single role, usually testers, lacking diverse viewpoints from different roles in a typical software development life-cycle, including both developers and testers. To this end, this paper introduces a multi-role approach to employ LLMs to act as different roles to simulate real-life code review process, engaging in discussions towards a consensus on the existence and classification of vulnerabilities in the code. Preliminary evaluation of the proposed approach indicates a 4.73% increase in the precision rate, 58.9% increase in the recall rate, and a 28.1% increase in the F1 score.

replace-cross StockGPT: A GenAI Model for Stock Prediction and Trading

Authors: Dat Mai

Abstract: This paper introduces StockGPT, an autoregressive ``number'' model trained and tested on 70 million daily U.S. stock returns over nearly 100 years. Treating each return series as a sequence of tokens, StockGPT automatically learns the hidden patterns predictive of future returns via its attention mechanism. On a held-out test sample from 2001 to 2023, a daily rebalanced long-short portfolio formed from StockGPT predictions earns an annual return of 119% with a Sharpe ratio of 6.5. The StockGPT-based portfolio completely spans momentum and long-/short-term reversals, eliminating the need for manually crafted price-based strategies, and also encompasses most leading stock market factors. This highlights the immense promise of generative AI in surpassing human in making complex financial investment decisions.

replace-cross IA2: Leveraging Instance-Aware Index Advisor with Reinforcement Learning for Diverse Workloads

Authors: Taiyi Wang, Eiko Yoneki

Abstract: This study introduces the Instance-Aware Index Advisor (IA2), a novel deep reinforcement learning (DRL)-based approach for optimizing index selection in databases facing large action spaces of potential candidates. IA2 introduces the Twin Delayed Deep Deterministic Policy Gradient - Temporal Difference State-Wise Action Refinery (TD3-TD-SWAR) model, enabling efficient index selection by understanding workload-index dependencies and employing adaptive action masking. This method includes a comprehensive workload model, enhancing its ability to adapt to unseen workloads and ensuring robust performance across diverse database environments. Evaluation on benchmarks such as TPC-H reveals IA2's suggested indexes' performance in enhancing runtime, securing a 40% reduction in runtime for complex TPC-H workloads compared to scenarios without indexes, and delivering a 20% improvement over existing state-of-the-art DRL-based index advisors.

replace-cross The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge

Authors: Yiwei Guo, Chenrun Wang, Yifan Yang, Hankun Wang, Ziyang Ma, Chenpeng Du, Shuai Wang, Hanzheng Li, Shuai Fan, Hui Zhang, Xie Chen, Kai Yu

Abstract: Discrete speech tokens have been more and more popular in multiple speech processing fields, including automatic speech recognition (ASR), text-to-speech (TTS) and singing voice synthesis (SVS). In this paper, we describe the systems developed by the SJTU X-LANCE group for the TTS (acoustic + vocoder), SVS, and ASR tracks in the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge. Notably, we achieved 1st rank on the leaderboard in the TTS track both with the whole training set and only 1h training data, with the highest UTMOS score and lowest bitrate among all submissions.

replace-cross MuPT: A Generative Symbolic Music Pretrained Transformer

Authors: Xingwei Qu, Yuelin Bai, Yinghao Ma, Ziya Zhou, Ka Man Lo, Jiaheng Liu, Ruibin Yuan, Lejun Min, Xueling Liu, Tianyu Zhang, Xinrun Du, Shuyue Guo, Yiming Liang, Yizhi Li, Shangda Wu, Junting Zhou, Tianyu Zheng, Ziyang Ma, Fengze Han, Wei Xue, Gus Xia, Emmanouil Benetos, Xiang Yue, Chenghua Lin, Xu Tan, Stephen W. Huang, Wenhu Chen, Jie Fu, Ge Zhang

Abstract: In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions.

replace-cross Text-Based Reasoning About Vector Graphics

Authors: Zhenhailong Wang, Joy Hsu, Xingyao Wang, Kuan-Hao Huang, Manling Li, Jiajun Wu, Heng Ji

Abstract: While large multimodal models excel in broad vision-language benchmarks, they often struggle with tasks requiring precise perception of low-level visual details, such as comparing line lengths or solving simple mazes. In particular, this failure mode persists in question-answering tasks about vector graphics -- images composed purely of 2D objects and shapes. To address this challenge, we propose the Visually Descriptive Language Model (VDLM), which performs text-based reasoning about vector graphics. VDLM leverages Scalable Vector Graphics (SVG) for a more precise visual description and first uses an off-the-shelf raster-to-SVG algorithm for encoding. Since existing language models cannot understand raw SVGs in a zero-shot setting, VDLM then bridges SVG with pretrained language models through a newly introduced intermediate symbolic representation, Primal Visual Description (PVD), comprising primitive attributes (e.g., shape, position, measurement) with their corresponding predicted values. PVD is task-agnostic and represents visual primitives that are universal across all vector graphics. It can be learned with procedurally generated (SVG, PVD) pairs and also enables the direct use of LLMs for generalization to complex reasoning tasks. By casting an image to a text-based representation, we can leverage the power of language models to learn alignment from SVG to visual primitives and generalize to unseen question-answering tasks. Empirical results show that VDLM achieves stronger zero-shot performance compared to state-of-the-art LMMs, such as GPT-4V, in various low-level multimodal perception and reasoning tasks on vector graphics. We additionally present extensive analyses on VDLM's performance, demonstrating that our framework offers better interpretability due to its disentangled perception and reasoning processes. Project page: https://mikewangwzhl.github.io/VDLM/

URLs: https://mikewangwzhl.github.io/VDLM/

replace-cross Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

Authors: Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen

Abstract: Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs' capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models' long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs. Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. These benchmarks support intricate manipulation of the length of test cases, and can easily produce text samples up to 128k tokens. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval. The evaluation results demonstrate the limitations of current LLMs, especially in ultra-long-context settings. Our code is available at https://github.com/open-compass/Ada-LEval.

URLs: https://github.com/open-compass/Ada-LEval.

replace-cross Public-private funding models in open source software development: A case study on scikit-learn

Authors: Cailean Osborne

Abstract: Governments are increasingly allocating funding for open source software (OSS) development to address concerns related to software security, digital sovereignty, and national competitiveness in science and innovation, amongst others. While announcements of governmental funding are generally well-received by OSS developers, we still have a limited understanding of OSS developers evaluate the relative benefits and drawbacks of such funding compared to other types of funding. This paper explores this question through a case study on scikit-learn, a Python library for machine learning, whose funding model combines research grants, commercial sponsorship, community donations, and a 32 million euro grant from the France's artificial intelligence strategy. Through 25 interviews with scikit-learn's maintainers and funders, this study makes two key contributions to research and practice. First, the study illustrates how the maintainers have weaved public and private funding into their project to ensure the continued provision of scikit-learn as a digital public good, as well as the importance of diversified funding and governance protocols for funding to safeguard the community ethos of the project. Second, it offers practical recommendations to various stakeholders. For OSS developer communities, it illustrates the benefits of a diversified funding model in balancing the merits and drawbacks of different funding sources. For companies, it serves as a reminder that sponsoring developers or OSS projects can significantly support OSS maintainers, who often struggle with limited resources and towering workloads. For governments, it emphasises the importance of funding the maintenance of existing OSS in addition to or exclusively funding the development of new OSS libraries or features. The paper concludes with suggestions for future research directions.