Spectral Temporal Graph Neural Network for massive MIMO CSI Prediction. (arXiv:2312.02159v1 [cs.IT])

Authors: Sharan Mourya, Pavan Reddy, SaiDhiraj Amuru, Kiran Kumar Kuchi

In the realm of 5G communication systems, the accuracy of Channel State Information (CSI) prediction is vital for optimizing performance. This letter introduces a pioneering approach: the Spectral-Temporal Graph Neural Network (STEM GNN), which fuses spatial relationships and temporal dynamics of the wireless channel using the Graph Fourier Transform. We compare the STEM GNN approach with conventional Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) models for CSI prediction. Our findings reveal a significant enhancement in overall communication system performance through STEM GNNs. For instance, in one scenario, STEM GNN achieves a sum rate of 5.009 bps/Hz which is $11.9\%$ higher than that of LSTM and $35\%$ higher than that of RNN. The spectral-temporal analysis capabilities of STEM GNNs capture intricate patterns often overlooked by traditional models, offering improvements in beamforming, interference mitigation, and ultra-reliable low-latency communication (URLLC).

Training Chain-of-Thought via Latent-Variable Inference. (arXiv:2312.02179v1 [cs.LG])

Authors: Du Phan, Matthew D. Hoffman, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, Rif A. Saurous

Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a ``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training set. Naively combining CoT with supervised tuning requires supervision not just of the correct answers, but also of detailed rationales that lead to those answers; these rationales are expensive to produce by hand. Instead, we propose a fine-tuning strategy that tries to maximize the \emph{marginal} log-likelihood of generating a correct answer using CoT prompting, approximately averaging over all possible rationales. The core challenge is sampling from the posterior over rationales conditioned on the correct answer; we address it using a simple Markov-chain Monte Carlo (MCMC) expectation-maximization (EM) algorithm inspired by the self-taught reasoner (STaR), memoized wake-sleep, Markovian score climbing, and persistent contrastive divergence. This algorithm also admits a novel control-variate technique that drives the variance of our gradient estimates to zero as the model improves. Applying our technique to GSM8K and the tasks in BIG-Bench Hard, we find that this MCMC-EM fine-tuning technique typically improves the model's accuracy on held-out examples more than STaR or prompt-tuning with or without CoT.

How Generative-AI can be Effectively used in Government Chatbots. (arXiv:2312.02181v1 [cs.CL])

Authors: Zeteng Lin

With the rapid development of artificial intelligence and breakthroughs in machine learning and natural language processing, intelligent question-answering robots have become widely used in government affairs. This paper conducts a horizontal comparison between Guangdong Province's government chatbots, ChatGPT, and Wenxin Ernie, two large language models, to analyze the strengths and weaknesses of existing government chatbots and AIGC technology. The study finds significant differences between government chatbots and large language models. China's government chatbots are still in an exploratory stage and have a gap to close to achieve "intelligence." To explore the future direction of government chatbots more deeply, this research proposes targeted optimization paths to help generative AI be effectively applied in government chatbot conversations.

Virtual Fusion with Contrastive Learning for Single Sensor-based Activity Recognition. (arXiv:2312.02185v1 [cs.LG])

Authors: Duc-Anh Nguyen, Cuong Pham, Nhien-An Le-Khac

Various types of sensors can be used for Human Activity Recognition (HAR), and each of them has different strengths and weaknesses. Sometimes a single sensor cannot fully observe the user's motions from its perspective, which causes wrong predictions. While sensor fusion provides more information for HAR, it comes with many inherent drawbacks like user privacy and acceptance, costly set-up, operation, and maintenance. To deal with this problem, we propose Virtual Fusion - a new method that takes advantage of unlabeled data from multiple time-synchronized sensors during training, but only needs one sensor for inference. Contrastive learning is adopted to exploit the correlation among sensors. Virtual Fusion gives significantly better accuracy than training with the same single sensor, and in some cases, it even surpasses actual fusion using multiple sensors at test time. We also extend this method to a more general version called Actual Fusion within Virtual Fusion (AFVF), which uses a subset of training sensors during inference. Our method achieves state-of-the-art accuracy and F1-score on UCI-HAR and PAMAP2 benchmark datasets. Implementation is available upon request.

Identifying Spurious Correlations using Counterfactual Alignment. (arXiv:2312.02186v1 [cs.CV])

Authors: Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari

Models driven by spurious correlations often yield poor generalization performance. We propose the counterfactual alignment method to detect and explore spurious correlations of black box classifiers. Counterfactual images generated with respect to one classifier can be input into other classifiers to see if they also induce changes in the outputs of these classifiers. The relationship between these responses can be quantified and used to identify specific instances where a spurious correlation exists as well as compute aggregate statistics over a dataset. Our work demonstrates the ability to detect spurious correlations in face attribute classifiers. This is validated by observing intuitive trends in a face attribute classifier as well as fabricating spurious correlations and detecting their presence, both visually and quantitatively. Further, utilizing the CF alignment method, we demonstrate that we can rectify spurious correlations identified in classifiers.

Video Summarization: Towards Entity-Aware Captions. (arXiv:2312.02188v1 [cs.CV])

Authors: Hammad A. Ayyubi, Tianqi Liu, Arsha Nagrani, Xudong Lin, Mingda Zhang, Anurag Arnab, Feng Han, Yukun Zhu, Jialu Liu, Shih-Fu Chang

Existing popular video captioning benchmarks and models deal with generic captions devoid of specific person, place or organization named entities. In contrast, news videos present a challenging setting where the caption requires such named entities for meaningful summarization. As such, we propose the task of summarizing news video directly to entity-aware captions. We also release a large-scale dataset, VIEWS (VIdeo NEWS), to support research on this task. Further, we propose a method that augments visual information from videos with context retrieved from external world knowledge to generate entity-aware captions. We demonstrate the effectiveness of our approach on three video captioning models. We also show that our approach generalizes to existing news image captions dataset. With all the extensive experiments and insights, we believe we establish a solid basis for future research on this challenging task.

StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D. (arXiv:2312.02189v1 [cs.CV])

Authors: Pengsheng Guo, Hans Hao, Adam Caccavale, Zhongzheng Ren, Edward Zhang, Qi Shan, Aditya Sankar, Alexander G. Schwing, Alex Colburn, Fangchang Ma

In the realm of text-to-3D generation, utilizing 2D diffusion models through score distillation sampling (SDS) frequently leads to issues such as blurred appearances and multi-faced geometry, primarily due to the intrinsically noisy nature of the SDS loss. Our analysis identifies the core of these challenges as the interaction among noise levels in the 2D diffusion process, the architecture of the diffusion network, and the 3D model representation. To overcome these limitations, we present StableDreamer, a methodology incorporating three advances. First, inspired by InstructNeRF2NeRF, we formalize the equivalence of the SDS generative prior and a simple supervised L2 reconstruction loss. This finding provides a novel tool to debug SDS, which we use to show the impact of time-annealing noise levels on reducing multi-faced geometries. Second, our analysis shows that while image-space diffusion contributes to geometric precision, latent-space diffusion is crucial for vivid color rendition. Based on this observation, StableDreamer introduces a two-stage training strategy that effectively combines these aspects, resulting in high-fidelity 3D models. Third, we adopt an anisotropic 3D Gaussians representation, replacing Neural Radiance Fields (NeRFs), to enhance the overall quality, reduce memory usage during training, and accelerate rendering speeds, and better capture semi-transparent objects. StableDreamer reduces multi-face geometries, generates fine details, and converges stably.

Prompt Tuning for Zero-shot Compositional Learning. (arXiv:2312.02191v1 [cs.CV])

Authors: Lingyu Zhang, Ting Hua, Yilin Shen, Hongxia Jin

Open World Compositional Zero-Shot Learning (OW-CZSL) is known to be an extremely challenging task, which aims to recognize unseen compositions formed from seen attributes and objects without any prior assumption of the output space. In order to achieve this goal, a model has to be "smart" and "knowledgeable". To be smart, a model should be good at reasoning the interactions between attributes and objects from the seen compositions. While "knowledgeable" means the model owns "common sense" to the open world that can "foresee" some features of the unseen compositions. Most previous work focuses on the "smart" part, while few of them provided an effective solution to achieve the "knowledgeable" goal. In this paper, we proposed a framework named Multi-Modal Prompt Tuning (MMPT) to inherit the "knowledgeable" property from the large pre-trained vision-language model. Extensive experiments show that our proposed MMPT obtains new state-of-the-art results in OW-CZSL task. On the UT-Zappos dataset, MMPT pushes the AUC score to $29.8$, while the previous best score is $26.5$. On the more challenging MIT-States dataset, the AUC score of MMPT is 1.5 times better than the current state-of-the-art.

USat: A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery. (arXiv:2312.02199v1 [cs.CV])

Authors: Jeremy Irvin, Lucas Tao, Joanne Zhou, Yuntao Ma, Langston Nashold, Benjamin Liu, Andrew Y. Ng

Large, self-supervised vision models have led to substantial advancements for automatically interpreting natural images. Recent works have begun tailoring these methods to remote sensing data which has rich structure with multi-sensor, multi-spectral, and temporal information providing massive amounts of self-labeled data that can be used for self-supervised pre-training. In this work, we develop a new encoder architecture called USat that can input multi-spectral data from multiple sensors for self-supervised pre-training. USat is a vision transformer with modified patch projection layers and positional encodings to model spectral bands with varying spatial scales from multiple sensors. We integrate USat into a Masked Autoencoder (MAE) self-supervised pre-training procedure and find that a pre-trained USat outperforms state-of-the-art self-supervised MAE models trained on remote sensing data on multiple remote sensing benchmark datasets (up to 8%) and leads to improvements in low data regimes (up to 7%). Code and pre-trained weights are available at https://github.com/stanfordmlgroup/USat .

An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets. (arXiv:2312.02200v1 [cs.CV])

Authors: Maya Srikanth, Jeremy Irvin, Brian Wesley Hill, Felipe Godoy, Ishan Sabane, Andrew Y. Ng

Major advancements in computer vision can primarily be attributed to the use of labeled datasets. However, acquiring labels for datasets often results in errors which can harm model performance. Recent works have proposed methods to automatically identify mislabeled images, but developing strategies to effectively implement them in real world datasets has been sparsely explored. Towards improved data-centric methods for cleaning real world vision datasets, we first conduct more than 200 experiments carefully benchmarking recently developed automated mislabel detection methods on multiple datasets under a variety of synthetic and real noise settings with varying noise levels. We compare these methods to a Simple and Efficient Mislabel Detector (SEMD) that we craft, and find that SEMD performs similarly to or outperforms prior mislabel detection approaches. We then apply SEMD to multiple real world computer vision datasets and test how dataset size, mislabel removal strategy, and mislabel removal amount further affect model performance after retraining on the cleaned data. With careful design of the approach, we find that mislabel removal leads per-class performance improvements of up to 8% of a retrained classifier in smaller data regimes.

Axiomatic Preference Modeling for Longform Question Answering. (arXiv:2312.02206v1 [cs.AI])

Authors: Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, Paul Bennett

The remarkable abilities of large language models (LLMs) like GPT-4 partially stem from post-training processes like Reinforcement Learning from Human Feedback (RLHF) involving human preferences encoded in a reward model. However, these reward models (RMs) often lack direct knowledge of why, or under what principles, the preferences annotations were made. In this study, we identify principles that guide RMs to better align with human preferences, and then develop an axiomatic framework to generate a rich variety of preference signals to uphold them. We use these axiomatic signals to train a model for scoring answers to longform questions. Our approach yields a Preference Model with only about 220M parameters that agrees with gold human-annotated preference labels more often than GPT-4. The contributions of this work include: training a standalone preference model that can score human- and LLM-generated answers on the same scale; developing an axiomatic framework for generating training data pairs tailored to certain principles; and showing that a small amount of axiomatic signals can help small models outperform GPT-4 in preference scoring. We release our model on huggingface: https://huggingface.co/corbyrosset/axiomatic_preference_model

Low-Precision Mixed-Computation Models for Inference on Edge. (arXiv:2312.02210v1 [cs.LG])

Authors: Seyedarmin Azizi, Mahdi Nazemi, Mehdi Kamal, Massoud Pedram

This paper presents a mixed-computation neural network processing approach for edge applications that incorporates low-precision (low-width) Posit and low-precision fixed point (FixP) number systems. This mixed-computation approach employs 4-bit Posit (Posit4), which has higher precision around zero, for representing weights with high sensitivity, while it uses 4-bit FixP (FixP4) for representing other weights. A heuristic for analyzing the importance and the quantization error of the weights is presented to assign the proper number system to different weights. Additionally, a gradient approximation for Posit representation is introduced to improve the quality of weight updates in the backpropagation process. Due to the high energy consumption of the fully Posit-based computations, neural network operations are carried out in FixP or Posit/FixP. An efficient hardware implementation of a MAC operation with a first Posit operand and FixP for a second operand and accumulator is presented. The efficacy of the proposed low-precision mixed-computation approach is extensively assessed on vision and language models. The results show that, on average, the accuracy of the mixed-computation is about 1.5% higher than that of FixP with a cost of 0.19% energy overhead.

JarviX: A LLM No code Platform for Tabular Data Analysis and Optimization. (arXiv:2312.02213v1 [cs.LG])

Authors: Shang-Ching Liu, ShengKun Wang, Wenqi Lin, Chung-Wei Hsiung, Yi-Chen Hsieh, Yu-Ping Cheng, Sian-Hong Luo, Tsungyao Chang, Jianwei Zhang

In this study, we introduce JarviX, a sophisticated data analytics framework. JarviX is designed to employ Large Language Models (LLMs) to facilitate an automated guide and execute high-precision data analyzes on tabular datasets. This framework emphasizes the significance of varying column types, capitalizing on state-of-the-art LLMs to generate concise data insight summaries, propose relevant analysis inquiries, visualize data effectively, and provide comprehensive explanations for results drawn from an extensive data analysis pipeline. Moreover, JarviX incorporates an automated machine learning (AutoML) pipeline for predictive modeling. This integration forms a comprehensive and automated optimization cycle, which proves particularly advantageous for optimizing machine configuration. The efficacy and adaptability of JarviX are substantiated through a series of practical use case studies.

Synthetic Data Generation Techniques for Developing AI-based Speech Assessments for Parkinson's Disease (A Comparative Study). (arXiv:2312.02229v1 [cs.SD])

Authors: Mahboobeh Parsapoor

Changes in speech and language are among the first signs of Parkinson's disease (PD). Thus, clinicians have tried to identify individuals with PD from their voices for years. Doctors can leverage AI-based speech assessments to spot PD thanks to advancements in artificial intelligence (AI). Such AI systems can be developed using machine learning classifiers that have been trained using individuals' voices. Although several studies have shown reasonable results in developing such AI systems, these systems would need more data samples to achieve promising performance. This paper explores using deep learning-based data generation techniques on the accuracy of machine learning classifiers that are the core of such systems.

A Simple and Scalable Representation for Graph Generation. (arXiv:2312.02230v1 [cs.LG])

Authors: Yunhui Jang, Seul Lee, Sungsoo Ahn

Recently, there has been a surge of interest in employing neural networks for graph generation, a fundamental statistical learning problem with critical applications like molecule design and community analysis. However, most approaches encounter significant limitations when generating large-scale graphs. This is due to their requirement to output the full adjacency matrices whose size grows quadratically with the number of nodes. In response to this challenge, we introduce a new, simple, and scalable graph representation named gap encoded edge list (GEEL) that has a small representation size that aligns with the number of edges. In addition, GEEL significantly reduces the vocabulary size by incorporating the gap encoding and bandwidth restriction schemes. GEEL can be autoregressively generated with the incorporation of node positional encoding, and we further extend GEEL to deal with attributed graphs by designing a new grammar. Our findings reveal that the adoption of this compact representation not only enhances scalability but also bolsters performance by simplifying the graph generation process. We conduct a comprehensive evaluation across ten non-attributed and two molecular graph generation tasks, demonstrating the effectiveness of GEEL.

Quality Diversity in the Amorphous Fortress (QD-AF): Evolving for Complexity in 0-Player Games. (arXiv:2312.02231v1 [cs.AI])

Authors: Sam Earle, M Charity, Dipika Rajesh, Mayu Wilson, Julian Togelius

We explore the generation of diverse environments using the Amorphous Fortress (AF) simulation framework. AF defines a set of Finite State Machine (FSM) nodes and edges that can be recombined to control the behavior of agents in the `fortress' grid-world. The behaviors and conditions of the agents within the framework are designed to capture the common building blocks of multi-agent artificial life and reinforcement learning environments. Using quality diversity evolutionary search, we generate diverse sets of environments. These environments exhibit certain types of complexity according to measures of agents' FSM architectures and activations, and collective behaviors. Our approach, Quality Diversity in Amorphous Fortress (QD-AF) generates families of 0-player games akin to simplistic ecological models, and we identify the emergence of both competitive and co-operative multi-agent and multi-species survival dynamics. We argue that these generated worlds can collectively serve as training and testing grounds for learning algorithms.

Rethinking Adversarial Training with Neural Tangent Kernel. (arXiv:2312.02236v1 [cs.LG])

Authors: Guanlin Li, Han Qiu, Shangwei Guo, Jiwei Li, Tianwei Zhang

Adversarial training (AT) is an important and attractive topic in deep learning security, exhibiting mysteries and odd properties. Recent studies of neural network training dynamics based on Neural Tangent Kernel (NTK) make it possible to reacquaint AT and deeply analyze its properties. In this paper, we perform an in-depth investigation of AT process and properties with NTK, such as NTK evolution. We uncover three new findings that are missed in previous works. First, we disclose the impact of data normalization on AT and the importance of unbiased estimators in batch normalization layers. Second, we experimentally explore the kernel dynamics and propose more time-saving AT methods. Third, we study the spectrum feature inside the kernel to address the catastrophic overfitting problem. To the best of our knowledge, it is the first work leveraging the observations of kernel dynamics to improve existing AT methods.

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model. (arXiv:2312.02238v1 [cs.CV])

Authors: Lingmin Ran, Xiaodong Cun, JiaWei Liu, Rui Zhao, Song Zijie, Xintao Wang, Jussi Keppo, Mike Zheng Shou

We introduce X-Adapter, a universal upgrader to enable the pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a frozen copy of the old model to preserve the connectors of different plugins. Additionally, X-Adapter adds trainable mapping layers that bridge the decoders from models of different versions for feature remapping. The remapped features will be used as guidance for the upgraded model. To enhance the guidance ability of X-Adapter, we employ a null-text training strategy for the upgraded model. After training, we also introduce a two-stage denoising strategy to align the initial latents of X-Adapter and the upgraded model. Thanks to our strategies, X-Adapter demonstrates universal compatibility with various plugins and also enables plugins of different versions to work together, thereby expanding the functionalities of diffusion community. To verify the effectiveness of the proposed method, we conduct extensive experiments and the results show that X-Adapter may facilitate wider application in the upgraded foundational diffusion model.

Model-based Deep Learning for Beam Prediction based on a Channel Chart. (arXiv:2312.02239v1 [cs.IT])

Authors: Taha Yassine (IETR, INSA Rennes), Baptiste Chatelier (IETR, MERCE-France, INSA Rennes), Vincent Corlay (MERCE-France), Matthieu Crussière (IETR, INSA Rennes), Stephane Paquelet, Olav Tirkkonen, Luc Le Magoarou (INSA Rennes, IETR)

Channel charting builds a map of the radio environment in an unsupervised way. The obtained chart locations can be seen as low-dimensional compressed versions of channel state information that can be used for a wide variety of applications, including beam prediction. In non-standalone or cell-free systems, chart locations computed at a given base station can be transmitted to several other base stations (possibly operating at different frequency bands) for them to predict which beams to use. This potentially yields a dramatic reduction of the overhead due to channel estimation or beam management, since only the base station performing charting requires channel state information, the others directly predicting the beam from the chart location. In this paper, advanced model-based neural network architectures are proposed for both channel charting and beam prediction. The proposed methods are assessed on realistic synthetic channels, yielding promising results.

Contrastive Learning-Based Spectral Knowledge Distillation for Multi-Modality and Missing Modality Scenarios in Semantic Segmentation. (arXiv:2312.02240v1 [cs.CV])

Authors: Aniruddh Sikdar, Jayant Teotia, Suresh Sundaram

Improving the performance of semantic segmentation models using multispectral information is crucial, especially for environments with low-light and adverse conditions. Multi-modal fusion techniques pursue either the learning of cross-modality features to generate a fused image or engage in knowledge distillation but address multimodal and missing modality scenarios as distinct issues, which is not an optimal approach for multi-sensor models. To address this, a novel multi-modal fusion approach called CSK-Net is proposed, which uses a contrastive learning-based spectral knowledge distillation technique along with an automatic mixed feature exchange mechanism for semantic segmentation in optical (EO) and infrared (IR) images. The distillation scheme extracts detailed textures from the optical images and distills them into the optical branch of CSK-Net. The model encoder consists of shared convolution weights with separate batch norm (BN) layers for both modalities, to capture the multi-spectral information from different modalities of the same objects. A Novel Gated Spectral Unit (GSU) and mixed feature exchange strategy are proposed to increase the correlation of modality-shared information and decrease the modality-specific information during the distillation process. Comprehensive experiments show that CSK-Net surpasses state-of-the-art models in multi-modal tasks and for missing modalities when exclusively utilizing IR data for inference across three public benchmarking datasets. For missing modality scenarios, the performance increase is achieved without additional computational costs compared to the baseline segmentation models.

Conditional Variational Diffusion Models. (arXiv:2312.02246v1 [cs.CV])

Authors: Gabriel della Maggiora, Luis Alberto Croquevielle, Nikita Desphande, Harry Horsley, Thomas Heinis, Artur Yakimovich

Inverse problems aim to determine parameters from observations, a crucial task in engineering and science. Lately, generative models, especially diffusion models, have gained popularity in this area for their ability to produce realistic solutions and their good mathematical properties. Despite their success, an important drawback of diffusion models is their sensitivity to the choice of variance schedule, which controls the dynamics of the diffusion process. Fine-tuning this schedule for specific applications is crucial but time-costly and does not guarantee an optimal result. We propose a novel approach for learning the schedule as part of the training process. Our method supports probabilistic conditioning on data, provides high-quality solutions, and is flexible, proving able to adapt to different applications with minimum overhead. This approach is tested in two unrelated inverse problems: super-resolution microscopy and quantitative phase imaging, yielding comparable or superior results to previous methods and fine-tuned diffusion models. We conclude that fine-tuning the schedule by experimentation should be avoided because it can be learned during training in a stable way that yields better results.

Federated Active Learning for Target Domain Generalisation. (arXiv:2312.02247v1 [cs.LG])

Authors: Razvan Caramalau, Binod Bhattarai, Danail Stoyanov

In this paper, we introduce Active Learning framework in Federated Learning for Target Domain Generalisation, harnessing the strength from both learning paradigms. Our framework, FEDALV, composed of Active Learning (AL) and Federated Domain Generalisation (FDG), enables generalisation of an image classification model trained from limited source domain client's data without sharing images to an unseen target domain. To this end, our FDG, FEDA, consists of two optimisation updates during training, one at the client and another at the server level. For the client, the introduced losses aim to reduce feature complexity and condition alignment, while in the server, the regularisation limits free energy biases between source and target obtained by the global model. The remaining component of FEDAL is AL with variable budgets, which queries the server to retrieve and sample the most informative local data for the targeted client. We performed multiple experiments on FDG w/ and w/o AL and compared with both conventional FDG baselines and Federated Active Learning baselines. Our extensive quantitative experiments demonstrate the superiority of our method in accuracy and efficiency compared to the multiple contemporary methods. FEDALV manages to obtain the performance of the full training target accuracy while sampling as little as 5% of the source client's data.

Fine-Tuning Language Models for Context-Specific SQL Query Generation. (arXiv:2312.02251v1 [cs.DB])

Authors: Amine Rebei

The ability to generate SQL queries from natural language has significant implications for making data accessible to non-specialists. This paper presents a novel approach to fine-tuning open-source large language models (LLMs) for the task of transforming natural language into SQL queries within the retail domain. We introduce models specialized in generating SQL queries, trained on synthetic datasets tailored to the Snowflake SQL and GoogleSQL dialects. Our methodology involves generating a context-specific dataset using GPT-4, then fine-tuning three open-source LLMs(Starcoder Plus, Code-Llama, and Mistral) employing the LoRa technique to optimize for resource constraints. The fine-tuned models demonstrate superior performance in zero-shot settings compared to the baseline GPT-4, with Code-Llama achieving the highest accuracy rates, at 81.58% for Snowflake SQL and 82.66% for GoogleSQL. These results underscore the effectiveness of fine-tuning LLMs on domain-specific tasks and suggest a promising direction for enhancing the accessibility of relational databases through natural language interfaces.

Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images. (arXiv:2312.02253v1 [cs.CV])

Authors: Zhuoran Yu, Chenchen Zhu, Sean Culatana, Raghuraman Krishnamoorthi, Fanyi Xiao, Yong Jae Lee

Recent advances in generative deep learning have enabled the creation of high-quality synthetic images in text-to-image generation. Prior work shows that fine-tuning a pretrained diffusion model on ImageNet and generating synthetic training images from the finetuned model can enhance an ImageNet classifier's performance. However, performance degrades as synthetic images outnumber real ones. In this paper, we explore whether generative fine-tuning is essential for this improvement and whether it is possible to further scale up training using more synthetic data. We present a new framework leveraging off-the-shelf generative models to generate synthetic training images, addressing multiple challenges: class name ambiguity, lack of diversity in naive prompts, and domain shifts. Specifically, we leverage large language models (LLMs) and CLIP to resolve class name ambiguity. To diversify images, we propose contextualized diversification (CD) and stylized diversification (SD) methods, also prompted by LLMs. Finally, to mitigate domain shifts, we leverage domain adaptation techniques with auxiliary batch normalization for synthetic images. Our framework consistently enhances recognition model performance with more synthetic data, up to 6x of original ImageNet size showcasing the potential of synthetic data for improved recognition models and strong out-of-domain generalization.

Innovations in Agricultural Forecasting: A Multivariate Regression Study on Global Crop Yield Prediction. (arXiv:2312.02254v1 [cs.LG])

Authors: Ishaan Gupta, Samyutha Ayalasomayajula, Yashas Shashidhara, Anish Kataria, Shreyas Shashidhara, Krishita Kataria, Aditya Undurti

The prediction of crop yields internationally is a crucial objective in agricultural research. Thus, this study implements 6 regression models (Linear, Tree, Gradient Descent, Gradient Boosting, K- Nearest Neighbors, and Random Forest) to predict crop yields in 196 countries. Given 4 key training parameters, pesticides (tonnes), rainfall (mm), temperature (Celsius), and yield (hg/ha), it was found that our Random Forest Regression model achieved a determination coefficient (r^2) of 0.94, with a margin of error (ME) of .03. The models were trained and tested using the Food and Agricultural Organization of the United Nations data, along with the World Bank Climate Change Data Catalog. Furthermore, each parameter was analyzed to understand how varying factors could impact overall yield. We used unconventional models, contrary to generally used Deep Learning (DL) and Machine Learning (ML) models, combined with recently collected data to implement a unique approach in our research. Existing scholarship would benefit from understanding the most optimal model for agricultural research, specifically using the United Nations data.

EMDM: Efficient Motion Diffusion Model for Fast, High-Quality Motion Generation. (arXiv:2312.02256v1 [cs.CV])

Authors: Wenyang Zhou, Zhiyang Dou, Zeyu Cao, Zhouyingcheng Liao, Jingbo Wang, Wenjia Wang, Yuan Liu, Taku Komura, Wenping Wang, Lingjie Liu

We introduce Efficient Motion Diffusion Model (EMDM) for fast and high-quality human motion generation. Although previous motion diffusion models have shown impressive results, they struggle to achieve fast generation while maintaining high-quality human motions. Motion latent diffusion has been proposed for efficient motion generation. However, effectively learning a latent space can be non-trivial in such a two-stage manner. Meanwhile, accelerating motion sampling by increasing the step size, e.g., DDIM, typically leads to a decline in motion quality due to the inapproximation of complex data distributions when naively increasing the step size. In this paper, we propose EMDM that allows for much fewer sample steps for fast motion generation by modeling the complex denoising distribution during multiple sampling steps. Specifically, we develop a Conditional Denoising Diffusion GAN to capture multimodal data distributions conditioned on both control signals, i.e., textual description and denoising time step. By modeling the complex data distribution, a larger sampling step size and fewer steps are achieved during motion synthesis, significantly accelerating the generation process. To effectively capture the human dynamics and reduce undesired artifacts, we employ motion geometric loss during network training, which improves the motion quality and training efficiency. As a result, EMDM achieves a remarkable speed-up at the generation stage while maintaining high-quality motion generation in terms of fidelity and diversity.

LLMs Accelerate Annotation for Medical Information Extraction. (arXiv:2312.02296v1 [cs.CL])

Authors: Akshay Goel, Almog Gueta, Omry Gilon, Chang Liu, Sofia Erell, Lan Huong Nguyen, Xiaohong Hao, Bolous Jaber, Shashir Reddy, Rupesh Kartha, Jean Steiner, Itay Laish, Amir Feder

The unstructured nature of clinical notes within electronic health records often conceals vital patient-related information, making it challenging to access or interpret. To uncover this hidden information, specialized Natural Language Processing (NLP) models are required. However, training these models necessitates large amounts of labeled data, a process that is both time-consuming and costly when relying solely on human experts for annotation. In this paper, we propose an approach that combines Large Language Models (LLMs) with human expertise to create an efficient method for generating ground truth labels for medical text annotation. By utilizing LLMs in conjunction with human annotators, we significantly reduce the human annotation burden, enabling the rapid creation of labeled datasets. We rigorously evaluate our method on a medical information extraction task, demonstrating that our approach not only substantially cuts down on human intervention but also maintains high accuracy. The results highlight the potential of using LLMs to improve the utilization of unstructured clinical data, allowing for the swift deployment of tailored NLP solutions in healthcare.

AdsorbRL: Deep Multi-Objective Reinforcement Learning for Inverse Catalysts Design. (arXiv:2312.02308v1 [cs.LG])

Authors: Romain Lacombe, Lucas Hendren, Khalid El-Awady

A central challenge of the clean energy transition is the development of catalysts for low-emissions technologies. Recent advances in Machine Learning for quantum chemistry drastically accelerate the computation of catalytic activity descriptors such as adsorption energies. Here we introduce AdsorbRL, a Deep Reinforcement Learning agent aiming to identify potential catalysts given a multi-objective binding energy target, trained using offline learning on the Open Catalyst 2020 and Materials Project data sets. We experiment with Deep Q-Network agents to traverse the space of all ~160,000 possible unary, binary and ternary compounds of 55 chemical elements, with very sparse rewards based on adsorption energy known for only between 2,000 and 3,000 catalysts per adsorbate. To constrain the actions space, we introduce Random Edge Traversal and train a single-objective DQN agent on the known states subgraph, which we find strengthens target binding energy by an average of 4.1 eV. We extend this approach to multi-objective, goal-conditioned learning, and train a DQN agent to identify materials with the highest (respectively lowest) adsorption energies for multiple simultaneous target adsorbates. We experiment with Objective Sub-Sampling, a novel training scheme aimed at encouraging exploration in the multi-objective setup, and demonstrate simultaneous adsorption energy improvement across all target adsorbates, by an average of 0.8 eV. Overall, our results suggest strong potential for Deep Reinforcement Learning applied to the inverse catalysts design problem.

Training Reinforcement Learning Agents and Humans With Difficulty-Conditioned Generators. (arXiv:2312.02309v1 [cs.AI])

Authors: Sidney Tio, Jimmy Ho, Pradeep Varakantham

We adapt Parameterized Environment Response Model (PERM), a method for training both Reinforcement Learning (RL) Agents and human learners in parameterized environments by directly modeling difficulty and ability. Inspired by Item Response Theory (IRT), PERM aligns environment difficulty with individual ability, creating a Zone of Proximal Development-based curriculum. Remarkably, PERM operates without real-time RL updates and allows for offline training, ensuring its adaptability across diverse students. We present a two-stage training process that capitalizes on PERM's adaptability, and demonstrate its effectiveness in training RL agents and humans in an empirical study.

VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding. (arXiv:2312.02310v1 [cs.CV])

Authors: Yizhou Wang, Ruiyi Zhang, Haoliang Wang, Uttaran Bhattacharya, Yun Fu, Gang Wu

Recent advancements in language-model-based video understanding have been progressing at a remarkable pace, spurred by the introduction of Large Language Models (LLMs). However, the focus of prior research has been predominantly on devising a projection layer that maps video features to tokens, an approach that is both rudimentary and inefficient. In our study, we introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings, which enables a more aligned selection of frames with the given question. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer (abbreviated as VQ-Former), which bolsters the interplay between the input question and the video features. We also discover that incorporating a simple prompt, "Please be critical", into the LLM input can substantially enhance its video comprehension capabilities. Our experimental results indicate that VaQuitA consistently sets a new benchmark for zero-shot video question-answering tasks and is adept at producing high-quality, multi-turn video dialogues with users.

Visual Encoders for Data-Efficient Imitation Learning in Modern Video Games. (arXiv:2312.02312v1 [cs.LG])

Authors: Lukas Schäfer, Logan Jones, Anssi Kanervisto, Yuhan Cao, Tabish Rashid, Raluca Georgescu, Dave Bignell, Siddhartha Sen, Andrea Treviño Gavito, Sam Devlin

Video games have served as useful benchmarks for the decision making community, but going beyond Atari games towards training agents in modern games has been prohibitively expensive for the vast majority of the research community. Recent progress in the research, development and open release of large vision models has the potential to amortize some of these costs across the community. However, it is currently unclear which of these models have learnt representations that retain information critical for sequential decision making. Towards enabling wider participation in the research of gameplaying agents in modern games, we present a systematic study of imitation learning with publicly available visual encoders compared to the typical, task-specific, end-to-end training approach in Minecraft, Minecraft Dungeons and Counter-Strike: Global Offensive.

Fine-tuning pre-trained extractive QA models for clinical document parsing. (arXiv:2312.02314v1 [cs.CL])

Authors: Ashwyn Sharma, David I. Feldman, Aneesh Jain

Electronic health records (EHRs) contain a vast amount of high-dimensional multi-modal data that can accurately represent a patient's medical history. Unfortunately, most of this data is either unstructured or semi-structured, rendering it unsuitable for real-time and retrospective analyses. A remote patient monitoring (RPM) program for Heart Failure (HF) patients needs to have access to clinical markers like EF (Ejection Fraction) or LVEF (Left Ventricular Ejection Fraction) in order to ascertain eligibility and appropriateness for the program. This paper explains a system that can parse echocardiogram reports and verify EF values. This system helps identify eligible HF patients who can be enrolled in such a program. At the heart of this system is a pre-trained extractive QA transformer model that is fine-tuned on custom-labeled data. The methods used to prepare such a model for deployment are illustrated by running experiments on a public clinical dataset like MIMIC-IV-Note. The pipeline can be used to generalize solutions to similar problems in a low-resource setting. We found that the system saved over 1500 hours for our clinicians over 12 months by automating the task at scale.

GNN2R: Weakly-Supervised Rationale-Providing Question Answering over Knowledge Graphs. (arXiv:2312.02317v1 [cs.CL])

Authors: Ruijie Wang, Luca Rossetto, Michael Cochez, Abraham Bernstein

Most current methods for multi-hop question answering (QA) over knowledge graphs (KGs) only provide final conclusive answers without explanations, such as a set of KG entities that is difficult for normal users to review and comprehend. This issue severely limits the application of KG-based QA in real-world scenarios. However, it is non-trivial to solve due to two challenges: First, annotations of reasoning chains of multi-hop questions, which could serve as supervision for explanation generation, are usually lacking. Second, it is difficult to maintain high efficiency when explicit KG triples need to be retrieved to generate explanations. In this paper, we propose a novel Graph Neural Network-based Two-Step Reasoning model (GNN2R) to solve this issue. GNN2R can provide both final answers and reasoning subgraphs as a rationale behind final answers efficiently with only weak supervision that is available through question-final answer pairs. We extensively evaluated GNN2R with detailed analyses in experiments. The results demonstrate that, in terms of effectiveness, efficiency, and quality of generated explanations, GNN2R outperforms existing state-of-the-art methods that are applicable to this task. Our code and pre-trained models are available at https://github.com/ruijie-wang-uzh/GNN2R.

An Evaluation Framework for Mapping News Headlines to Event Classes in a Knowledge Graph. (arXiv:2312.02334v1 [cs.CL])

Authors: Steve Fonin Mbouadeu, Martin Lorenzo, Ken Barker, Oktie Hassanzadeh

Mapping ongoing news headlines to event-related classes in a rich knowledge base can be an important component in a knowledge-based event analysis and forecasting solution. In this paper, we present a methodology for creating a benchmark dataset of news headlines mapped to event classes in Wikidata, and resources for the evaluation of methods that perform the mapping. We use the dataset to study two classes of unsupervised methods for this task: 1) adaptations of classic entity linking methods, and 2) methods that treat the problem as a zero-shot text classification problem. For the first approach, we evaluate off-the-shelf entity linking systems. For the second approach, we explore a) pre-trained natural language inference (NLI) models, and b) pre-trained large generative language models. We present the results of our evaluation, lessons learned, and directions for future work. The dataset and scripts for evaluation are made publicly available.

A Contrastive Compositional Benchmark for Text-to-Image Synthesis: A Study with Unified Text-to-Image Fidelity Metrics. (arXiv:2312.02338v1 [cs.CV])

Authors: Xiangru Zhu, Penglei Sun, Chengyu Wang, Jingping Liu, Zhixu Li, Yanghua Xiao, Jun Huang

Text-to-image (T2I) synthesis has recently achieved significant advancements. However, challenges remain in the model's compositionality, which is the ability to create new combinations from known components. We introduce Winoground-T2I, a benchmark designed to evaluate the compositionality of T2I models. This benchmark includes 11K complex, high-quality contrastive sentence pairs spanning 20 categories. These contrastive sentence pairs with subtle differences enable fine-grained evaluations of T2I synthesis models. Additionally, to address the inconsistency across different metrics, we propose a strategy that evaluates the reliability of various metrics by using comparative sentence pairs. We use Winoground-T2I with a dual objective: to evaluate the performance of T2I models and the metrics used for their evaluation. Finally, we provide insights into the strengths and weaknesses of these metrics and the capabilities of current T2I models in tackling challenges across a range of complex compositional categories. Our benchmark is publicly available at https://github.com/zhuxiangru/Winoground-T2I .

Expressive Sign Equivariant Networks for Spectral Geometric Learning. (arXiv:2312.02339v1 [cs.LG])

Authors: Derek Lim, Joshua Robinson, Stefanie Jegelka, Haggai Maron

Recent work has shown the utility of developing machine learning models that respect the structure and symmetries of eigenvectors. These works promote sign invariance, since for any eigenvector v the negation -v is also an eigenvector. However, we show that sign invariance is theoretically limited for tasks such as building orthogonally equivariant models and learning node positional encodings for link prediction in graphs. In this work, we demonstrate the benefits of sign equivariance for these tasks. To obtain these benefits, we develop novel sign equivariant neural network architectures. Our models are based on a new analytic characterization of sign equivariant polynomials and thus inherit provable expressiveness properties. Controlled synthetic experiments show that our networks can achieve the theoretically predicted benefits of sign equivariant models. Code is available at https://github.com/cptq/Sign-Equivariant-Nets.

Working Backwards: Learning to Place by Picking. (arXiv:2312.02352v1 [cs.RO])

Authors: Oliver Limoyo, Abhisek Konar, Trevor Ablett, Jonathan Kelly, Francois R. Hogan, Gregory Dudek

We present Learning to Place by Picking (LPP), a method capable of autonomously collecting demonstrations for a family of placing tasks in which objects must be manipulated to specific locations. With LPP, we approach the learning of robotic object placement policies by reversing the grasping process and exploiting the inherent symmetry of the pick and place problems. Specifically, we obtain placing demonstrations from a set of grasp sequences of objects that are initially located at their target placement locations. Our system is capable of collecting hundreds of demonstrations without human intervention by using a combination of tactile sensing and compliant control for grasps. We train a policy directly from visual observations through behaviour cloning, using the autonomously-collected demonstrations. By doing so, the policy can generalize to object placement scenarios outside of the training environment without privileged information (e.g., placing a plate picked up from a table and not at the original placement location). We validate our approach on home robotic scenarios that include dishwasher loading and table setting. Our approach yields robotic placing policies that outperform policies trained with kinesthetic teaching, both in terms of performance and data efficiency, while requiring no human supervision.

When is Offline Policy Selection Sample Efficient for Reinforcement Learning?. (arXiv:2312.02355v1 [cs.LG])

Authors: Vincent Liu, Prabhat Nagarajan, Andrew Patterson, Martha White

Offline reinforcement learning algorithms often require careful hyperparameter tuning. Consequently, before deployment, we need to select amongst a set of candidate policies. As yet, however, there is little understanding about the fundamental limits of this offline policy selection (OPS) problem. In this work we aim to provide clarity on when sample efficient OPS is possible, primarily by connecting OPS to off-policy policy evaluation (OPE) and Bellman error (BE) estimation. We first show a hardness result, that in the worst case, OPS is just as hard as OPE, by proving a reduction of OPE to OPS. As a result, no OPS method can be more sample efficient than OPE in the worst case. We then propose a BE method for OPS, called Identifiable BE Selection (IBES), that has a straightforward method for selecting its own hyperparameters. We highlight that using IBES for OPS generally has more requirements than OPE methods, but if satisfied, can be more sample efficient. We conclude with an empirical study comparing OPE and IBES, and by showing the difficulty of OPS on an offline Atari benchmark dataset.

Peer attention enhances student learning. (arXiv:2312.02358v1 [cs.HC])

Authors: Songlin Xu, Dongyin Hu, Ru Wang, Xinyu Zhang

Human visual attention is susceptible to social influences. In education, peer effects impact student learning, but their precise role in modulating attention remains unclear. Our experiment (N=311) demonstrates that displaying peer visual attention regions when students watch online course videos enhances their focus and engagement. However, students retain adaptability in following peer attention cues. Overall, guided peer attention improves learning experiences and outcomes. These findings elucidate how peer visual attention shapes students' gaze patterns, deepening understanding of peer influence on learning. They also offer insights into designing adaptive online learning interventions leveraging peer attention modelling to optimize student attentiveness and success.

Class-Discriminative Attention Maps for Vision Transformers. (arXiv:2312.02364v1 [cs.CV])

Authors: Lennart Brocki, Neo Christopher Chung

Interpretability methods are critical components for examining and exploring deep neural networks (DNN), as well as increasing our understanding of and trust in them. Vision transformers (ViT), which can be trained to state-of-the-art performance with a self-supervised learning (SSL) training method, provide built-in attention maps (AM). While AMs can provide high-quality semantic segmentation of input images, they do not account for any signal coming from a downstream classifier. We introduce class-discriminative attention maps (CDAM), a novel post-hoc explanation method that is highly sensitive to the target class. Our method essentially scales attention scores by how relevant the corresponding tokens are for the predictions of a classifier head. Alternative to classifier outputs, CDAM can also explain a user-defined concept by targeting similarity measures in the latent space of the ViT. This allows for explanations of arbitrary concepts, defined by the user through a few sample images. We investigate the operating characteristics of CDAM in comparison with relevance propagation (RP) and token ablation maps (TAM), an alternative to pixel occlusion methods. CDAM is highly class-discriminative and semantically relevant, while providing implicit regularization of relevance scores.

PyTorch implementation: \url{https://github.com/lenbrocki/CDAM}

Web live demo: \url{https://cdam.informatism.com/}

Towards General Purpose Vision Foundation Models for Medical Image Analysis: An Experimental Study of DINOv2 on Radiology Benchmarks. (arXiv:2312.02366v1 [cs.CV])

Authors: Mohammed Baharoon, Waseem Qureshi, Jiahong Ouyang, Yanwu Xu, Kilian Phol, Abdulrhman Aljouie, Wei Peng

The integration of deep learning systems into the medical domain has been hindered by the resource-intensive process of data annotation and the inability of these systems to generalize to different data distributions. Foundation models, which are models pre-trained on large datasets, have emerged as a solution to reduce reliance on annotated data and enhance model generalizability and robustness. DINOv2, an open-source foundation model pre-trained with self-supervised learning on 142 million curated natural images, excels in extracting general-purpose visual representations, exhibiting promising capabilities across various vision tasks. Nevertheless, a critical question remains unanswered regarding DINOv2's adaptability to radiological imaging, and the clarity on whether its features are sufficiently general to benefit radiology image analysis is yet to be established. Therefore, this study comprehensively evaluates DINOv2 for radiology, conducting over 100 experiments across diverse modalities (X-ray, CT, and MRI). Tasks include disease classification and organ segmentation on both 2D and 3D images, evaluated under different settings like kNN, few-shot learning, linear-probing, end-to-end fine-tuning, and parameter-efficient fine-tuning, to measure the effectiveness and generalizability of the DINOv2 feature embeddings. Comparative analyses with established medical image analysis models, U-Net and TransUnet for segmentation, and CNN and ViT models pre-trained via supervised, weakly supervised, and self-supervised learning for classification, reveal DINOv2's superior performance in segmentation tasks and competitive results in disease classification. The findings contribute insights to potential avenues for optimizing pre-training strategies for medical imaging and enhancing the broader understanding of DINOv2's role in bridging the gap between natural and radiological image analysis.

CityTFT: Temporal Fusion Transformer for Urban Building Energy Modeling. (arXiv:2312.02375v1 [stat.ML])

Authors: Ting-Yu Dai, Dev Niyogi, Zoltan Nagy

Urban Building Energy Modeling (UBEM) is an emerging method to investigate urban design and energy systems against the increasing energy demand at urban and neighborhood levels. However, current UBEM methods are mostly physic-based and time-consuming in multiple climate change scenarios. This work proposes CityTFT, a data-driven UBEM framework, to accurately model the energy demands in urban environments. With the empowerment of the underlying TFT framework and an augmented loss function, CityTFT could predict heating and cooling triggers in unseen climate dynamics with an F1 score of 99.98 \% while RMSE of loads of 13.57 kWh.

BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks. (arXiv:2312.02405v1 [cs.AI])

Authors: Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhoff, Brandon Houghton, Rohin Shah

The MineRL BASALT competition has served to catalyze advances in learning from human feedback through four hard-to-specify tasks in Minecraft, such as create and photograph a waterfall. Given the completion of two years of BASALT competitions, we offer to the community a formalized benchmark through the BASALT Evaluation and Demonstrations Dataset (BEDD), which serves as a resource for algorithm development and performance assessment. BEDD consists of a collection of 26 million image-action pairs from nearly 14,000 videos of human players completing the BASALT tasks in Minecraft. It also includes over 3,000 dense pairwise human evaluations of human and algorithmic agents. These comparisons serve as a fixed, preliminary leaderboard for evaluating newly-developed algorithms. To enable this comparison, we present a streamlined codebase for benchmarking new algorithms against the leaderboard. In addition to presenting these datasets, we conduct a detailed analysis of the data from both datasets to guide algorithm development and evaluation. The released code and data are available at https://github.com/minerllabs/basalt-benchmark .

Towards Fast and Stable Federated Learning: Confronting Heterogeneity via Knowledge Anchor. (arXiv:2312.02416v1 [cs.LG])

Authors: Jinqian Chen, Jihua Zhu, Qinghai Zheng

Federated learning encounters a critical challenge of data heterogeneity, adversely affecting the performance and convergence of the federated model. Various approaches have been proposed to address this issue, yet their effectiveness is still limited. Recent studies have revealed that the federated model suffers severe forgetting in local training, leading to global forgetting and performance degradation. Although the analysis provides valuable insights, a comprehensive understanding of the vulnerable classes and their impact factors is yet to be established. In this paper, we aim to bridge this gap by systematically analyzing the forgetting degree of each class during local training across different communication rounds. Our observations are: (1) Both missing and non-dominant classes suffer similar severe forgetting during local training, while dominant classes show improvement in performance. (2) When dynamically reducing the sample size of a dominant class, catastrophic forgetting occurs abruptly when the proportion of its samples is below a certain threshold, indicating that the local model struggles to leverage a few samples of a specific class effectively to prevent forgetting. Motivated by these findings, we propose a novel and straightforward algorithm called Federated Knowledge Anchor (FedKA). Assuming that all clients have a single shared sample for each class, the knowledge anchor is constructed before each local training stage by extracting shared samples for missing classes and randomly selecting one sample per class for non-dominant classes. The knowledge anchor is then utilized to correct the gradient of each mini-batch towards the direction of preserving the knowledge of the missing and non-dominant classes. Extensive experimental results demonstrate that our proposed FedKA achieves fast and stable convergence, significantly improving accuracy on popular benchmarks.

Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data. (arXiv:2312.02418v1 [cs.CL])

Authors: Yu Yang, Aaditya K. Singh, Mostafa Elhoushi, Anas Mahmoud, Kushal Tirumala, Fabian Gloeckle, Baptiste Rozière, Carole-Jean Wu, Ari S. Morcos, Newsha Ardalani

Code datasets, often collected from diverse and uncontrolled sources such as GitHub, potentially suffer from quality issues, thereby affecting the performance and training efficiency of Large Language Models (LLMs) optimized for code generation. Previous studies demonstrated the benefit of using embedding spaces for data pruning, but they mainly focused on duplicate removal or increasing variety, and in other modalities, such as images. Our work focuses on using embeddings to identify and remove "low-quality" code data. First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions. Armed with this knowledge, we devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset. We demonstrate the benefits of this synthetic corruption informed pruning (SCIP) approach on the well-established HumanEval and MBPP benchmarks, outperforming existing embedding-based methods. Importantly, we achieve up to a 3% performance improvement over no pruning, thereby showing the promise of insights from synthetic corruptions for data pruning.

Visually Grounded Language Learning: a review of language games, datasets, tasks, and models. (arXiv:2312.02431v1 [cs.CL])

Authors: Alessandro Suglia, Ioannis Konstas, Oliver Lemon

In recent years, several machine learning models have been proposed. They are trained with a language modelling objective on large-scale text-only data. With such pretraining, they can achieve impressive results on many Natural Language Understanding and Generation tasks. However, many facets of meaning cannot be learned by ``listening to the radio" only. In the literature, many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality. In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field. We rely on Wittgenstein's idea of `language games' to categorise such tasks into 3 different families: 1) discriminative games, 2) generative games, and 3) interactive games. Our analysis of the literature provides evidence that future work should be focusing on interactive games where communication in Natural Language is important to resolve ambiguities about object referents and action plans and that physical embodiment is essential to understand the semantics of situations and events. Overall, these represent key requirements for developing grounded meanings in neural models.

MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following. (arXiv:2312.02436v1 [cs.CL])

Authors: Renze Lou, Kai Zhang, Jian Xie, Yuxuan Sun, Janice Ahn, Hanzi Xu, Yu Su, Wenpeng Yin

In the realm of large language models (LLMs), enhancing instruction-following capability often involves curating expansive training data. This is achieved through two primary schemes: i) Scaling-Inputs: Amplifying (input, output) pairs per task instruction, aiming for better instruction adherence. ii) Scaling Input-Free Tasks: Enlarging tasks, each composed of an (instruction, output) pair (without requiring a separate input anymore). However, LLMs under Scaling-Inputs tend to be overly sensitive to inputs, leading to misinterpretation or non-compliance with instructions. Conversely, Scaling Input-Free Tasks demands a substantial number of tasks but is less effective in instruction following when dealing with instances in Scaling-Inputs. This work introduces MUFFIN, a new scheme of instruction-following dataset curation. Specifically, we automatically Scale Tasks per Input by diversifying these tasks with various input facets. Experimental results across four zero-shot benchmarks, spanning both Scaling-Inputs and Scaling Input-Free Tasks schemes, reveal that LLMs, at various scales, trained on MUFFIN generally demonstrate superior instruction-following capabilities compared to those trained on the two aforementioned schemes.

Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation. (arXiv:2312.02439v1 [cs.AI])

Authors: Shanshan Zhong, Zhongzhan Huang, Shanghua Gao, Wushao Wen, Liang Lin, Marinka Zitnik, Pan Zhou

Chain-of-Thought (CoT) guides large language models (LLMs) to reason step-by-step, and can motivate their logical reasoning ability. While effective for logical tasks, CoT is not conducive to creative problem-solving which often requires out-of-box thoughts and is crucial for innovation advancements. In this paper, we explore the Leap-of-Thought (LoT) abilities within LLMs -- a non-sequential, creative paradigm involving strong associations and knowledge leaps. To this end, we study LLMs on the popular Oogiri game which needs participants to have good creativity and strong associative thinking for responding unexpectedly and humorously to the given image, text, or both, and thus is suitable for LoT study. Then to investigate LLMs' LoT ability in the Oogiri game, we first build a multimodal and multilingual Oogiri-GO dataset which contains over 130,000 samples from the Oogiri game, and observe the insufficient LoT ability or failures of most existing LLMs on the Oogiri game. Accordingly, we introduce a creative Leap-of-Thought (CLoT) paradigm to improve LLM's LoT ability. CLoT first formulates the Oogiri-GO dataset into LoT-oriented instruction tuning data to train pretrained LLM for achieving certain LoT humor generation and discrimination abilities. Then CLoT designs an explorative self-refinement that encourages the LLM to generate more creative LoT data via exploring parallels between seemingly unrelated concepts and selects high-quality data to train itself for self-refinement. CLoT not only excels in humor generation in the Oogiri game but also boosts creative abilities in various tasks like cloud guessing game and divergent association task. These findings advance our understanding and offer a pathway to improve LLMs' creative capacities for innovative applications across domains. The dataset, code, and models will be released online. https://github.com/sail-sg/CLoT.

E4SRec: An Elegant Effective Efficient Extensible Solution of Large Language Models for Sequential Recommendation. (arXiv:2312.02443v1 [cs.IR])

Authors: Xinhang Li, Chong Chen, Xiangyu Zhao, Yong Zhang, Chunxiao Xing

The recent advancements in Large Language Models (LLMs) have sparked interest in harnessing their potential within recommender systems. Since LLMs are designed for natural language tasks, existing recommendation approaches have predominantly transformed recommendation tasks into open-domain natural language generation tasks. However, this approach necessitates items to possess rich semantic information, often generates out-of-range results, and suffers from notably low efficiency and limited extensibility. Furthermore, practical ID-based recommendation strategies, reliant on a huge number of unique identities (IDs) to represent users and items, have gained prominence in real-world recommender systems due to their effectiveness and efficiency. Nevertheless, the incapacity of LLMs to model IDs presents a formidable challenge when seeking to leverage LLMs for personalized recommendations. In this paper, we introduce an Elegant Effective Efficient Extensible solution for large language models for Sequential Recommendation (E4SRec), which seamlessly integrates LLMs with traditional recommender systems that exclusively utilize IDs to represent items. Specifically, E4SRec takes ID sequences as inputs, ensuring that the generated outputs fall within the candidate lists. Furthermore, E4SRec possesses the capability to generate the entire ranking list in a single forward process, and demands only a minimal set of pluggable parameters, which are trained for each dataset while keeping the entire LLM frozen. We substantiate the effectiveness, efficiency, and extensibility of our proposed E4SRec through comprehensive experiments conducted on four widely-used real-world datasets. The implementation code is accessible at https://github.com/HestiaSky/E4SRec/.

Learning to Holistically Detect Bridges from Large-Size VHR Remote Sensing Imagery. (arXiv:2312.02481v1 [cs.CV])

Authors: Yansheng Li, Junwei Luo, Yongjun Zhang, Yihua Tan, Jin-Gang Yu, Song Bai

Bridge detection in remote sensing images (RSIs) plays a crucial role in various applications, but it poses unique challenges compared to the detection of other objects. In RSIs, bridges exhibit considerable variations in terms of their spatial scales and aspect ratios. Therefore, to ensure the visibility and integrity of bridges, it is essential to perform holistic bridge detection in large-size very-high-resolution (VHR) RSIs. However, the lack of datasets with large-size VHR RSIs limits the deep learning algorithms' performance on bridge detection. Due to the limitation of GPU memory in tackling large-size images, deep learning-based object detection methods commonly adopt the cropping strategy, which inevitably results in label fragmentation and discontinuous prediction. To ameliorate the scarcity of datasets, this paper proposes a large-scale dataset named GLH-Bridge comprising 6,000 VHR RSIs sampled from diverse geographic locations across the globe. These images encompass a wide range of sizes, varying from 2,048*2,048 to 16,38*16,384 pixels, and collectively feature 59,737 bridges. Furthermore, we present an efficient network for holistic bridge detection (HBD-Net) in large-size RSIs. The HBD-Net presents a separate detector-based feature fusion (SDFF) architecture and is optimized via a shape-sensitive sample re-weighting (SSRW) strategy. Based on the proposed GLH-Bridge dataset, we establish a bridge detection benchmark including the OBB and HBB tasks, and validate the effectiveness of the proposed HBD-Net. Additionally, cross-dataset generalization experiments on two publicly available datasets illustrate the strong generalization capability of the GLH-Bridge dataset.

Flexible Communication for Optimal Distributed Learning over Unpredictable Networks. (arXiv:2312.02493v1 [cs.DC])

Authors: Sahil Tyagi, Martin Swany

Gradient compression alleviates expensive communication in distributed deep learning by sending fewer values and its corresponding indices, typically via Allgather (AG). Training with high compression ratio (CR) achieves high accuracy like DenseSGD, but has lower parallel scaling due to high communication cost (i.e., parallel efficiency). Using lower CRs improves parallel efficiency by lowering synchronization cost, but degrades model accuracy as well (statistical efficiency). Further, speedup attained with different models and CRs also varies with network latency, effective bandwidth and collective op used for aggregation. In many cases, collectives like Allreduce (AR) have lower cost than AG to exchange the same amount of data. In this paper, we propose an AR-compatible Topk compressor that is bandwidth-optimal and thus performs better than AG in certain network configurations. We develop a flexible communication strategy that switches between AG and AR based on which collective is optimal in the current settings, and model the pareto-relationship between parallel and statistical efficiency as a multi-objective optimization (MOO) problem to dynamically adjust CR and accelerate training while still converging to high accuracy.

MKA: A Scalable Medical Knowledge Assisted Mechanism for Generative Models on Medical Conversation Tasks. (arXiv:2312.02496v1 [cs.CL])

Authors: Ke Liang, Sifan Wu, Jiayi Gu

Using natural language processing (NLP) technologies to develop medical chatbots makes the diagnosis of the patient more convenient and efficient, which is a typical application in healthcare AI. Because of its importance, lots of research have been come out. Recently, the neural generative models have shown their impressive ability as the core of chatbot, while it cannot scale well when directly applied to medical conversation due to the lack of medical-specific knowledge. To address the limitation, a scalable Medical Knowledge Assisted mechanism, MKA, is proposed in this paper. The mechanism aims to assist general neural generative models to achieve better performance on the medical conversation task. The medical-specific knowledge graph is designed within the mechanism, which contains 6 types of medical-related information, including department, drug, check, symptom, disease, food. Besides, the specific token concatenation policy is defined to effectively inject medical information into the input data. Evaluation of our method is carried out on two typical medical datasets, MedDG and MedDialog-CN. The evaluation results demonstrate that models combined with our mechanism outperform original methods in multiple automatic evaluation metrics. Besides, MKA-Bert-GPT achieves state-of-the-art performance. The open-sourced codes are public: https://github.com/LIANGKE23/Knowledge_Assisted_Medical_Dialogue_Generation_Mechanism

Inspecting Model Fairness in Ultrasound Segmentation Tasks. (arXiv:2312.02501v1 [cs.CV])

Authors: Zikang Xu, Fenghe Tang, Quan Quan, Jianrui Ding, Chunping Ning, S. Kevin Zhou

With the rapid expansion of machine learning and deep learning (DL), researchers are increasingly employing learning-based algorithms to alleviate diagnostic challenges across diverse medical tasks and applications. While advancements in diagnostic precision are notable, some researchers have identified a concerning trend: their models exhibit biased performance across subgroups characterized by different sensitive attributes. This bias not only infringes upon the rights of patients but also has the potential to lead to life-altering consequences. In this paper, we inspect a series of DL segmentation models using two ultrasound datasets, aiming to assess the presence of model unfairness in these specific tasks. Our findings reveal that even state-of-the-art DL algorithms demonstrate unfair behavior in ultrasound segmentation tasks. These results serve as a crucial warning, underscoring the necessity for careful model evaluation before their deployment in real-world scenarios. Such assessments are imperative to ensure ethical considerations and mitigate the risk of adverse impacts on patient outcomes.

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation. (arXiv:2312.02512v1 [cs.CV])

Authors: Jeongsoo Choi, Se Jin Park, Minsu Kim, Yong Man Ro

This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence of acoustic noise, showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover, we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling, thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting. The demo page is available on https://choijeongsoo.github.io/av2av.

ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a Single GPU. (arXiv:2312.02515v1 [cs.LG])

Authors: Zhengmao Ye, Dengchun Li, Jingqi Tian, Tingfeng Lan, Jie Zuo, Lei Duan, Hui Lu, Yexi Jiang, Jian Sha, Ke Zhang, Mingjie Tang

Transformer-based large language models (LLMs) have demonstrated outstanding performance across diverse domains, particularly when fine-turned for specific domains. Recent studies suggest that the resources required for fine-tuning LLMs can be economized through parameter-efficient methods such as Low-Rank Adaptation (LoRA). While LoRA effectively reduces computational burdens and resource demands, it currently supports only a single-job fine-tuning setup.

In this paper, we present ASPEN, a high-throughput framework for fine-tuning LLMs. ASPEN efficiently trains multiple jobs on a single GPU using the LoRA method, leveraging shared pre-trained model and adaptive scheduling. ASPEN is compatible with transformer-based language models like LLaMA and ChatGLM, etc. Experiments show that ASPEN saves 53% of GPU memory when training multiple LLaMA-7B models on NVIDIA A100 80GB GPU and boosts training throughput by about 17% compared to existing methods when training with various pre-trained models on different GPUs. The adaptive scheduling algorithm reduces turnaround time by 24%, end-to-end training latency by 12%, prioritizing jobs and preventing out-of-memory issues.

Simplifying Neural Network Training Under Class Imbalance. (arXiv:2312.02517v1 [cs.LG])

Authors: Ravid Shwartz-Ziv, Micah Goldblum, Yucen Lily Li, C. Bayan Bruss, Andrew Gordon Wilson

Real-world datasets are often highly class-imbalanced, which can adversely impact the performance of deep learning models. The majority of research on training neural networks under class imbalance has focused on specialized loss functions, sampling techniques, or two-stage training procedures. Notably, we demonstrate that simply tuning existing components of standard deep learning pipelines, such as the batch size, data augmentation, optimizer, and label smoothing, can achieve state-of-the-art performance without any such specialized class imbalance methods. We also provide key prescriptions and considerations for training under class imbalance, and an understanding of why imbalance methods succeed or fail.

Creative Agents: Empowering Agents with Imagination for Creative Tasks. (arXiv:2312.02519v1 [cs.AI])

Authors: Chi Zhang, Penglin Cai, Yuhui Fu, Haoqi Yuan, Zongqing Lu

We study building embodied agents for open-ended creative tasks. While existing methods build instruction-following agents that can perform diverse open-ended tasks, none of them demonstrates creativity -- the ability to give novel and diverse task solutions implicit in the language instructions. This limitation comes from their inability to convert abstract language instructions into concrete task goals in the environment and perform long-horizon planning for such complicated goals. Given the observation that humans perform creative tasks with the help of imagination, we propose a class of solutions for creative agents, where the controller is enhanced with an imaginator that generates detailed imaginations of task outcomes conditioned on language instructions. We introduce several approaches to implementing the components of creative agents. We implement the imaginator with either a large language model for textual imagination or a diffusion model for visual imagination. The controller can either be a behavior-cloning policy learned from data or a pre-trained foundation model generating executable codes in the environment. We benchmark creative tasks with the challenging open-world game Minecraft, where the agents are asked to create diverse buildings given free-form language instructions. In addition, we propose novel evaluation metrics for open-ended creative tasks utilizing GPT-4V, which holds many advantages over existing metrics. We perform a detailed experimental analysis of creative agents, showing that creative agents are the first AI agents accomplishing diverse building creation in the survival mode of Minecraft. Our benchmark and models are open-source for future research on creative agents (https://github.com/PKU-RL/Creative-Agents).

Retrieving Conditions from Reference Images for Diffusion Models. (arXiv:2312.02521v1 [cs.CV])

Authors: Haoran Tang, Xin Zhou, Jieren Deng, Zhihong Pan, Hao Tian, Pratik Chaudhari

Recent diffusion-based subject driven generative methods have enabled image generations with good fidelity for specific objects or human portraits. However, to achieve better versatility for applications, we argue that not only improved datasets and evaluations are desired, but also more careful methods to retrieve only relevant information from conditional images are anticipated. To this end, we propose an anime figures dataset RetriBooru-V1, with enhanced identity and clothing labels. We state new tasks enabled by this dataset, and introduce a new diversity metric to measure success in completing these tasks, quantifying the flexibility of image generations. We establish an RAG-inspired baseline method, designed to retrieve precise conditional information from reference images. Then, we compare with current methods on existing task to demonstrate the capability of the proposed method. Finally, we provide baseline experiment results on new tasks, and conduct ablation studies on the possible structural choices.

MASP: Scalable GNN-based Planning for Multi-Agent Navigation. (arXiv:2312.02522v1 [cs.LG])

Authors: Xinyi Yang, Xinting Yang, Chao Yu, Jiayu Chen, Huazhong Yang, Yu Wang

We investigate the problem of decentralized multi-agent navigation tasks, where multiple agents need to reach initially unassigned targets in a limited time. Classical planning-based methods suffer from expensive computation overhead at each step and offer limited expressiveness for complex cooperation strategies. In contrast, reinforcement learning (RL) has recently become a popular paradigm for addressing this issue. However, RL struggles with low data efficiency and cooperation when directly exploring (nearly) optimal policies in the large search space, especially with an increased agent number (e.g., 10+ agents) or in complex environments (e.g., 3D simulators). In this paper, we propose Multi-Agent Scalable GNN-based P lanner (MASP), a goal-conditioned hierarchical planner for navigation tasks with a substantial number of agents. MASP adopts a hierarchical framework to divide a large search space into multiple smaller spaces, thereby reducing the space complexity and accelerating training convergence. We also leverage graph neural networks (GNN) to model the interaction between agents and goals, improving goal achievement. Besides, to enhance generalization capabilities in scenarios with unseen team sizes, we divide agents into multiple groups, each with a previously trained number of agents. The results demonstrate that MASP outperforms classical planning-based competitors and RL baselines, achieving a nearly 100% success rate with minimal training data in both multi-agent particle environments (MPE) with 50 agents and a quadrotor 3-dimensional environment (OmniDrones) with 20 agents. Furthermore, the learned policy showcases zero-shot generalization across unseen team sizes.

MEMTO: Memory-guided Transformer for Multivariate Time Series Anomaly Detection. (arXiv:2312.02530v1 [cs.LG])

Authors: Junho Song, Keonwoo Kim, Jeonglyul Oh, Sungzoon Cho

Detecting anomalies in real-world multivariate time series data is challenging due to complex temporal dependencies and inter-variable correlations. Recently, reconstruction-based deep models have been widely used to solve the problem. However, these methods still suffer from an over-generalization issue and fail to deliver consistently high performance. To address this issue, we propose the MEMTO, a memory-guided Transformer using a reconstruction-based approach. It is designed to incorporate a novel memory module that can learn the degree to which each memory item should be updated in response to the input data. To stabilize the training procedure, we use a two-phase training paradigm which involves using K-means clustering for initializing memory items. Additionally, we introduce a bi-dimensional deviation-based detection criterion that calculates anomaly scores considering both input space and latent space. We evaluate our proposed method on five real-world datasets from diverse domains, and it achieves an average anomaly detection F1-score of 95.74%, significantly outperforming the previous state-of-the-art methods. We also conduct extensive experiments to empirically validate the effectiveness of our proposed model's key components.

PolyFit: A Peg-in-hole Assembly Framework for Unseen Polygon Shapes via Sim-to-real Adaptation. (arXiv:2312.02531v1 [cs.RO])

Authors: Geonhyup Lee, Joosoon Lee, Sangjun Noh, Minhwan Ko, Kangmin Kim, Kyoobin Lee

The study addresses the foundational and challenging task of peg-in-hole assembly in robotics, where misalignments caused by sensor inaccuracies and mechanical errors often result in insertion failures or jamming. This research introduces PolyFit, representing a paradigm shift by transitioning from a reinforcement learning approach to a supervised learning methodology. PolyFit is a Force/Torque (F/T)-based supervised learning framework designed for 5-DoF peg-in-hole assembly. It utilizes F/T data for accurate extrinsic pose estimation and adjusts the peg pose to rectify misalignments. Extensive training in a simulated environment involves a dataset encompassing a diverse range of peg-hole shapes, extrinsic poses, and their corresponding contact F/T readings. To enhance extrinsic pose estimation, a multi-point contact strategy is integrated into the model input, recognizing that identical F/T readings can indicate different poses. The study proposes a sim-to-real adaptation method for real-world application, using a sim-real paired dataset to enable effective generalization to complex and unseen polygon shapes. PolyFit achieves impressive peg-in-hole success rates of 97.3% and 96.3% for seen and unseen shapes in simulations, respectively. Real-world evaluations further demonstrate substantial success rates of 86.7% and 85.0%, highlighting the robustness and adaptability of the proposed method.

Graph Information Bottleneck for Remote Sensing Segmentation. (arXiv:2312.02545v1 [cs.CV])

Authors: Yuntao Shou, Wei Ai, Tao Meng

Remote sensing segmentation has a wide range of applications in environmental protection, and urban change detection, etc. Despite the success of deep learning-based remote sensing segmentation methods (e.g., CNN and Transformer), they are not flexible enough to model irregular objects. In addition, existing graph contrastive learning methods usually adopt the way of maximizing mutual information to keep the node representations consistent between different graph views, which may cause the model to learn task-independent redundant information. To tackle the above problems, this paper treats images as graph structures and introduces a simple contrastive vision GNN (SC-ViG) architecture for remote sensing segmentation. Specifically, we construct a node-masked and edge-masked graph view to obtain an optimal graph structure representation, which can adaptively learn whether to mask nodes and edges. Furthermore, this paper innovatively introduces information bottleneck theory into graph contrastive learning to maximize task-related information while minimizing task-independent redundant information. Finally, we replace the convolutional module in UNet with the SC-ViG module to complete the segmentation and classification tasks of remote sensing images. Extensive experiments on publicly available real datasets demonstrate that our method outperforms state-of-the-art remote sensing image segmentation methods.

DanZero+: Dominating the GuanDan Game through Reinforcement Learning. (arXiv:2312.02561v1 [cs.AI])

Authors: Youpeng Zhao, Yudong Lu, Jian Zhao, Wengang Zhou, Houqiang Li

The utilization of artificial intelligence (AI) in card games has been a well-explored subject within AI research for an extensive period. Recent advancements have propelled AI programs to showcase expertise in intricate card games such as Mahjong, DouDizhu, and Texas Hold'em. In this work, we aim to develop an AI program for an exceptionally complex and popular card game called GuanDan. This game involves four players engaging in both competitive and cooperative play throughout a long process to upgrade their level, posing great challenges for AI due to its expansive state and action space, long episode length, and complex rules. Employing reinforcement learning techniques, specifically Deep Monte Carlo (DMC), and a distributed training framework, we first put forward an AI program named DanZero for this game. Evaluation against baseline AI programs based on heuristic rules highlights the outstanding performance of our bot. Besides, in order to further enhance the AI's capabilities, we apply policy-based reinforcement learning algorithm to GuanDan. To address the challenges arising from the huge action space, which will significantly impact the performance of policy-based algorithms, we adopt the pre-trained model to facilitate the training process and the achieved AI program manages to achieve a superior performance.

Structured World Representations in Maze-Solving Transformers. (arXiv:2312.02566v1 [cs.LG])

Authors: Michael Igorevich Ivanitskiy, Alex F. Spies, Tilman Räuker, Guillaume Corlouer, Chris Mathwin, Lucia Quirke, Can Rager, Rusheb Shah, Dan Valentine, Cecilia Diniz Behn, Katsumi Inoue, Samy Wu Fung

Transformer models underpin many recent advances in practical machine learning applications, yet understanding their internal behavior continues to elude researchers. Given the size and complexity of these models, forming a comprehensive picture of their inner workings remains a significant challenge. To this end, we set out to understand small transformer models in a more tractable setting: that of solving mazes. In this work, we focus on the abstractions formed by these models and find evidence for the consistent emergence of structured internal representations of maze topology and valid paths. We demonstrate this by showing that the residual stream of only a single token can be linearly decoded to faithfully reconstruct the entire maze. We also find that the learned embeddings of individual tokens have spatial structure. Furthermore, we take steps towards deciphering the circuity of path-following by identifying attention heads (dubbed $\textit{adjacency heads}$), which are implicated in finding valid subsequent tokens.

UTBoost: A Tree-boosting based System for Uplift Modeling. (arXiv:2312.02573v1 [cs.LG])

Authors: Junjie Gao, Xiangyu Zheng, DongDong Wang, Zhixiang Huang, Bangqi Zheng, Kai Yang

Uplift modeling refers to the set of machine learning techniques that a manager may use to estimate customer uplift, that is, the net effect of an action on some customer outcome. By identifying the subset of customers for whom a treatment will have the greatest effect, uplift models assist decision-makers in optimizing resource allocations and maximizing overall returns. Accurately estimating customer uplift poses practical challenges, as it requires assessing the difference between two mutually exclusive outcomes for each individual. In this paper, we propose two innovative adaptations of the well-established Gradient Boosting Decision Trees (GBDT) algorithm, which learn the causal effect in a sequential way and overcome the counter-factual nature. Both approaches innovate existing techniques in terms of ensemble learning method and learning objectives, respectively. Experiments on large-scale datasets demonstrate the usefulness of the proposed methods, which often yielding remarkable improvements over base models. To facilitate the application, we develop the UTBoost, an end-to-end tree boosting system specifically designed for uplift modeling. The package is open source and has been optimized for training speed to meet the needs of real industrial applications.

Impact of Tokenization on LLaMa Russian Adaptation. (arXiv:2312.02598v1 [cs.CL])

Authors: Mikhail Tikhomirov, Daniil Chernyshev

Latest instruction-tuned large language models (LLM) show great results on various tasks, however, they often face performance degradation for non-English input. There is evidence that the reason lies in inefficient tokenization caused by low language representation in pre-training data which hinders the comprehension of non-English instructions, limiting the potential of target language instruction-tuning. In this work we investigate the possibility of addressing the issue with vocabulary substitution in the context of LLaMa Russian language adaptation. We explore three variants of vocabulary adaptation and test their performance on Saiga instruction-tuning and fine-tuning on Russian Super Glue benchmark. The results of automatic evaluation show that vocabulary substitution not only improves the model's quality in Russian but also accelerates fine-tuning (35%) and inference (up to 60%) while reducing memory consumption. Additional human evaluation of the instruction-tuned models demonstrates that models with Russian-adapted vocabulary generate answers with higher user preference than the original Saiga-LLaMa model.

Panoptica -- instance-wise evaluation of 3D semantic and instance segmentation maps. (arXiv:2312.02608v1 [cs.CV])

Authors: Florian Kofler, Hendrik Möller, Josef A. Buchner, Ezequiel de la Rosa, Ivan Ezhov, Marcel Rosier, Isra Mekki, Suprosanna Shit, Moritz Negwer, Rami Al-Maskari, Ali Ertürk, Shankeeth Vinayahalingam, Fabian Isensee, Sarthak Pati, Daniel Rueckert, Jan S. Kirschke, Stefan K. Ehrlich, Annika Reinke, Bjoern Menze, Benedikt Wiestler, Marie Piraud

This paper introduces panoptica, a versatile and performance-optimized package designed for computing instance-wise segmentation quality metrics from 2D and 3D segmentation maps. panoptica addresses the limitations of existing metrics and provides a modular framework that complements the original intersection over union-based panoptic quality with other metrics, such as the distance metric Average Symmetric Surface Distance. The package is open-source, implemented in Python, and accompanied by comprehensive documentation and tutorials. panoptica employs a three-step metrics computation process to cover diverse use cases. The efficacy of panoptica is demonstrated on various real-world biomedical datasets, where an instance-wise evaluation is instrumental for an accurate representation of the underlying clinical task. Overall, we envision panoptica as a valuable tool facilitating in-depth evaluation of segmentation methods.

On the Initialization of Graph Neural Networks. (arXiv:2312.02622v1 [cs.LG])

Authors: Jiahang Li, Yakun Song, Xiang Song, David Paul Wipf

Graph Neural Networks (GNNs) have displayed considerable promise in graph representation learning across various applications. The core learning process requires the initialization of model weight matrices within each GNN layer, which is typically accomplished via classic initialization methods such as Xavier initialization. However, these methods were originally motivated to stabilize the variance of hidden embeddings and gradients across layers of Feedforward Neural Networks (FNNs) and Convolutional Neural Networks (CNNs) to avoid vanishing gradients and maintain steady information flow. In contrast, within the GNN context classical initializations disregard the impact of the input graph structure and message passing on variance. In this paper, we analyze the variance of forward and backward propagation across GNN layers and show that the variance instability of GNN initializations comes from the combined effect of the activation function, hidden dimension, graph structure and message passing. To better account for these influence factors, we propose a new initialization method for Variance Instability Reduction within GNN Optimization (Virgo), which naturally tends to equate forward and backward variances across successive layers. We conduct comprehensive experiments on 15 datasets to show that Virgo can lead to superior model performance and more stable variance at initialization on node classification, link prediction and graph classification tasks. Codes are in https://github.com/LspongebobJH/virgo_icml2023.

SAMSGL: Series-Aligned Multi-Scale Graph Learning for Spatio-Temporal Forecasting. (arXiv:2312.02646v1 [cs.LG])

Authors: Xiaobei Zou, Luolin Xiong, Yang Tang, Jurgen Kurths

Spatio-temporal forecasting in various domains, like traffic prediction and weather forecasting, is a challenging endeavor, primarily due to the difficulties in modeling propagation dynamics and capturing high-dimensional interactions among nodes. Despite the significant strides made by graph-based networks in spatio-temporal forecasting, there remain two pivotal factors closely related to forecasting performance that need further consideration: time delays in propagation dynamics and multi-scale high-dimensional interactions. In this work, we present a Series-Aligned Multi-Scale Graph Learning (SAMSGL) framework, aiming to enhance forecasting performance. In order to handle time delays in spatial interactions, we propose a series-aligned graph convolution layer to facilitate the aggregation of non-delayed graph signals, thereby mitigating the influence of time delays for the improvement in accuracy. To understand global and local spatio-temporal interactions, we develop a spatio-temporal architecture via multi-scale graph learning, which encompasses two essential components: multi-scale graph structure learning and graph-fully connected (Graph-FC) blocks. The multi-scale graph structure learning includes a global graph structure to learn both delayed and non-delayed node embeddings, as well as a local one to learn node variations influenced by neighboring factors. The Graph-FC blocks synergistically fuse spatial and temporal information to boost prediction accuracy. To evaluate the performance of SAMSGL, we conduct experiments on meteorological and traffic forecasting datasets, which demonstrate its effectiveness and superiority.

Supervised learning of spatial features with STDP and homeostasis using Spiking Neural Networks on SpiNNaker. (arXiv:2312.02659v1 [cs.NE])

Authors: Sergio Davies, Andrew Gait, Andrew Rowley, Alessandro Di Nuovo

Artificial Neural Networks (ANN) have gained large popularity thanks to their ability to learn using the well-known backpropagation algorithm. On the other hand, Spiking Neural Networks (SNNs), despite having wider abilities than ANNs, have always presented a challenge in the training phase. This paper shows a new method to perform supervised learning on SNNs, using Spike Timing Dependent Plasticity (STDP) and homeostasis, aiming at training the network to identify spatial patterns. The method is tested using the SpiNNaker digital architecture. A SNN is trained to recognise one or multiple patterns and performance metrics are extracted to measure the performance of the network. Some considerations are drawn from the results showing that, in the case of a single trained pattern, the network behaves as the ideal detector, with 100% accuracy in detecting the trained pattern. However, as the number of trained patterns on a single network increases, the accuracy of the identification is linked to the similarities between these patterns. This method of training an SNN to detect spatial patterns may be applied on pattern recognition in static images or traffic analysis in computer networks, where each network packet represents a spatial pattern. It will be stipulated that the homeostatic factor may enable the network to detect patterns with some degree of similarities, rather than only perfectly matching patterns.

FaceStudio: Put Your Face Everywhere in Seconds. (arXiv:2312.02663v1 [cs.CV])

Authors: Yuxuan Yan, Chi Zhang, Rui Wang, Pei Cheng, Gang Yu, Bin Fu

This study investigates identity-preserving image synthesis, an intriguing task in image generation that seeks to maintain a subject's identity while adding a personalized, stylistic touch. Traditional methods, such as Textual Inversion and DreamBooth, have made strides in custom image creation, but they come with significant drawbacks. These include the need for extensive resources and time for fine-tuning, as well as the requirement for multiple reference images. To overcome these challenges, our research introduces a novel approach to identity-preserving synthesis, with a particular focus on human images. Our model leverages a direct feed-forward mechanism, circumventing the need for intensive fine-tuning, thereby facilitating quick and efficient image generation. Central to our innovation is a hybrid guidance framework, which combines stylized images, facial images, and textual prompts to guide the image generation process. This unique combination enables our model to produce a variety of applications, such as artistic portraits and identity-blended images. Our experimental results, including both qualitative and quantitative evaluations, demonstrate the superiority of our method over existing baseline models and previous works, particularly in its remarkable efficiency and ability to preserve the subject's identity with high fidelity.

Lights out: training RL agents robust to temporary blindness. (arXiv:2312.02665v1 [cs.AI])

Authors: N. Ordonez, M. Tromp, P. M. Julbe, W. Böhmer

Agents trained with DQN rely on an observation at each timestep to decide what action to take next. However, in real world applications observations can change or be missing entirely. Examples of this could be a light bulb breaking down, or the wallpaper in a certain room changing. While these situations change the actual observation, the underlying optimal policy does not change. Because of this we want our agent to continue taking actions until it receives a (recognized) observation again. To achieve this we introduce a combination of a neural network architecture that uses hidden representations of the observations and a novel n-step loss function. Our implementation is able to withstand location based blindness stretches longer than the ones it was trained on, and therefore shows robustness to temporary blindness. For access to our implementation, please email Nathan, Marije, or Pau.

Amortized Bayesian Decision Making for simulation-based models. (arXiv:2312.02674v1 [cs.LG])

Authors: Mila Gorecki, Jakob H. Macke, Michael Deistler

Simulation-based inference (SBI) provides a powerful framework for inferring posterior distributions of stochastic simulators in a wide range of domains. In many settings, however, the posterior distribution is not the end goal itself -- rather, the derived parameter values and their uncertainties are used as a basis for deciding what actions to take. Unfortunately, because posterior distributions provided by SBI are (potentially crude) approximations of the true posterior, the resulting decisions can be suboptimal. Here, we address the question of how to perform Bayesian decision making on stochastic simulators, and how one can circumvent the need to compute an explicit approximation to the posterior. Our method trains a neural network on simulated data and can predict the expected cost given any data and action, and can, thus, be directly used to infer the action with lowest cost. We apply our method to several benchmark problems and demonstrate that it induces similar cost as the true posterior distribution. We then apply the method to infer optimal actions in a real-world simulator in the medical neurosciences, the Bayesian Virtual Epileptic Patient, and demonstrate that it allows to infer actions associated with low cost after few simulations.

Contact Energy Based Hindsight Experience Prioritization. (arXiv:2312.02677v1 [cs.RO])

Authors: Erdi Sayar, Zhenshan Bing, Carlo D'Eramo, Ozgur S. Oguz, Alois Knoll

Multi-goal robot manipulation tasks with sparse rewards are difficult for reinforcement learning (RL) algorithms due to the inefficiency in collecting successful experiences. Recent algorithms such as Hindsight Experience Replay (HER) expedite learning by taking advantage of failed trajectories and replacing the desired goal with one of the achieved states so that any failed trajectory can be utilized as a contribution to learning. However, HER uniformly chooses failed trajectories, without taking into account which ones might be the most valuable for learning. In this paper, we address this problem and propose a novel approach Contact Energy Based Prioritization~(CEBP) to select the samples from the replay buffer based on rich information due to contact, leveraging the touch sensors in the gripper of the robot and object displacement. Our prioritization scheme favors sampling of contact-rich experiences, which are arguably the ones providing the largest amount of information. We evaluate our proposed approach on various sparse reward robotic tasks and compare them with the state-of-the-art methods. We show that our method surpasses or performs on par with those methods on robot manipulation tasks. Finally, we deploy the trained policy from our method to a real Franka robot for a pick-and-place task. We observe that the robot can solve the task successfully. The videos and code are publicly available at: https://erdiphd.github.io/HER_force

H-GAP: Humanoid Control with a Generalist Planner. (arXiv:2312.02682v1 [cs.LG])

Authors: Zhengyao Jiang, Yingchen Xu, Nolan Wagener, Yicheng Luo, Michael Janner, Edward Grefenstette, Tim Rocktäschel, Yuandong Tian

Humanoid control is an important research challenge offering avenues for integration into human-centric infrastructures and enabling physics-driven humanoid animations. The daunting challenges in this field stem from the difficulty of optimizing in high-dimensional action spaces and the instability introduced by the bipedal morphology of humanoids. However, the extensive collection of human motion-captured data and the derived datasets of humanoid trajectories, such as MoCapAct, paves the way to tackle these challenges. In this context, we present Humanoid Generalist Autoencoding Planner (H-GAP), a state-action trajectory generative model trained on humanoid trajectories derived from human motion-captured data, capable of adeptly handling downstream control tasks with Model Predictive Control (MPC). For 56 degrees of freedom humanoid, we empirically demonstrate that H-GAP learns to represent and generate a wide range of motor behaviours. Further, without any learning from online interactions, it can also flexibly transfer these behaviors to solve novel downstream control tasks via planning. Notably, H-GAP excels established MPC baselines that have access to the ground truth dynamics model, and is superior or comparable to offline RL methods trained for individual tasks. Finally, we do a series of empirical studies on the scaling properties of H-GAP, showing the potential for performance gains via additional data but not computing. Code and videos are available at https://ycxuyingchen.github.io/hgap/.

Analyzing and Improving the Training Dynamics of Diffusion Models. (arXiv:2312.02696v1 [cs.CV])

Authors: Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, Samuli Laine

Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling.

As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.

Enhancing Vehicle Entrance and Parking Management: Deep Learning Solutions for Efficiency and Security. (arXiv:2312.02699v1 [cs.CV])

Authors: Muhammad Umer Ramzan, Usman Ali, Syed Haider Abbas Naqvi, Zeeshan Aslam, Tehseen, Husnain Ali, Muhammad Faheem

The auto-management of vehicle entrance and parking in any organization is a complex challenge encompassing record-keeping, efficiency, and security concerns. Manual methods for tracking vehicles and finding parking spaces are slow and a waste of time. To solve the problem of auto management of vehicle entrance and parking, we have utilized state-of-the-art deep learning models and automated the process of vehicle entrance and parking into any organization. To ensure security, our system integrated vehicle detection, license number plate verification, and face detection and recognition models to ensure that the person and vehicle are registered with the organization. We have trained multiple deep-learning models for vehicle detection, license number plate detection, face detection, and recognition, however, the YOLOv8n model outperformed all the other models. Furthermore, License plate recognition is facilitated by Google's Tesseract-OCR Engine. By integrating these technologies, the system offers efficient vehicle detection, precise identification, streamlined record keeping, and optimized parking slot allocation in buildings, thereby enhancing convenience, accuracy, and security. Future research opportunities lie in fine-tuning system performance for a wide range of real-world applications.

Unified learning-based lossy and lossless JPEG recompression. (arXiv:2312.02705v1 [cs.CV])

Authors: Jianghui Zhang, Yuanyuan Wang, Lina Guo, Jixiang Luo, Tongda Xu, Yan Wang, Zhi Wang, Hongwei Qin

JPEG is still the most widely used image compression algorithm. Most image compression algorithms only consider uncompressed original image, while ignoring a large number of already existing JPEG images. Recently, JPEG recompression approaches have been proposed to further reduce the size of JPEG files. However, those methods only consider JPEG lossless recompression, which is just a special case of the rate-distortion theorem. In this paper, we propose a unified lossly and lossless JPEG recompression framework, which consists of learned quantization table and Markovian hierarchical variational autoencoders. Experiments show that our method can achieve arbitrarily low distortion when the bitrate is close to the upper bound, namely the bitrate of the lossless compression model. To the best of our knowledge, this is the first learned method that bridges the gap between lossy and lossless recompression of JPEG images.

Large Knowledge Model: Perspectives and Challenges. (arXiv:2312.02706v1 [cs.AI])

Authors: Huajun Chen

Humankind's understanding of the world is fundamentally linked to our perception and cognition, with \emph{human languages} serving as one of the major carriers of \emph{world knowledge}. In this vein, \emph{Large Language Models} (LLMs) like ChatGPT epitomize the pre-training of extensive, sequence-based world knowledge into neural networks, facilitating the processing and manipulation of this knowledge in a parametric space. This article explores large models through the lens of ``knowledge''. We initially investigate the role of symbolic knowledge such as Knowledge Graphs (KGs) in enhancing LLMs, covering aspects like knowledge-augmented language model, structure-inducing pre-training, knowledgeable prompts, structured CoT, knowledge editing, semantic tools for LLM and knowledgeable AI agents. Subsequently, we examine how LLMs can amplify traditional symbolic knowledge bases, encompassing aspects like using LLM as KG builder and controller, structured knowledge pretraining, LLM-enhanced symbolic reasoning, and the amalgamation of perception with cognition. Considering the intricate nature of human knowledge, we advocate for the creation of \emph{Large Knowledge Models} (LKM), specifically engineered to manage diversified spectrum of knowledge structures. This ambitious undertaking could entail several key challenges, such as disentangling knowledge representation from language models, restructuring pre-training with structured knowledge, and building large commonsense models, among others. We finally propose a five-``A'' principle to distinguish the concept of LKM.

Towards the Inferrence of Structural Similarity of Combinatorial Landscapes. (arXiv:2312.02720v1 [cs.LG])

Authors: Mingyu Huang, Ke Li

One of the most common problem-solving heuristics is by analogy. For a given problem, a solver can be viewed as a strategic walk on its fitness landscape. Thus if a solver works for one problem instance, we expect it will also be effective for other instances whose fitness landscapes essentially share structural similarities with each other. However, due to the black-box nature of combinatorial optimization, it is far from trivial to infer such similarity in real-world scenarios. To bridge this gap, by using local optima network as a proxy of fitness landscapes, this paper proposed to leverage graph data mining techniques to conduct qualitative and quantitative analyses to explore the latent topological structural information embedded in those landscapes. By conducting large-scale empirical experiments on three classic combinatorial optimization problems, we gain concrete evidence to support the existence of structural similarity between landscapes of the same classes within neighboring dimensions. We also interrogated the relationship between landscapes of different problem classes.

PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo Multi-modal Features. (arXiv:2312.02781v1 [cs.CV])

Authors: Tianshun Han, Shengnan Gui, Yiqing Huang, Baihui Li, Lijian Liu, Benjia Zhou, Ning Jiang, Quan Lu, Ruicong Zhi, Yanyan Liang, Du Zhang, Jun Wan

Speech-driven 3D facial animation has improved a lot recently while most related works only utilize acoustic modality and neglect the influence of visual and textual cues, leading to unsatisfactory results in terms of precision and coherence. We argue that visual and textual cues are not trivial information. Therefore, we present a novel framework, namely PMMTalk, using complementary Pseudo Multi-Modal features for improving the accuracy of facial animation. The framework entails three modules: PMMTalk encoder, cross-modal alignment module, and PMMTalk decoder. Specifically, the PMMTalk encoder employs the off-the-shelf talking head generation architecture and speech recognition technology to extract visual and textual information from speech, respectively. Subsequently, the cross-modal alignment module aligns the audio-image-text features at temporal and semantic levels. Then PMMTalk decoder is employed to predict lip-syncing facial blendshape coefficients. Contrary to prior methods, PMMTalk only requires an additional random reference face image but yields more accurate results. Additionally, it is artist-friendly as it seamlessly integrates into standard animation production workflows by introducing facial blendshape coefficients. Finally, given the scarcity of 3D talking face datasets, we introduce a large-scale 3D Chinese Audio-Visual Facial Animation (3D-CAVFA) dataset. Extensive experiments and user studies show that our approach outperforms the state of the art. We recommend watching the supplementary video.

Leveraging Domain Adaptation and Data Augmentation to Improve Qur'anic IR in English and Arabic. (arXiv:2312.02803v1 [cs.CL])

Authors: Vera Pavlova

In this work, we approach the problem of Qur'anic information retrieval (IR) in Arabic and English. Using the latest state-of-the-art methods in neural IR, we research what helps to tackle this task more efficiently. Training retrieval models requires a lot of data, which is difficult to obtain for training in-domain. Therefore, we commence with training on a large amount of general domain data and then continue training on in-domain data. To handle the lack of in-domain data, we employed a data augmentation technique, which considerably improved results in MRR@10 and NDCG@5 metrics, setting the state-of-the-art in Qur'anic IR for both English and Arabic. The absence of an Islamic corpus and domain-specific model for IR task in English motivated us to address this lack of resources and take preliminary steps of the Islamic corpus compilation and domain-specific language model (LM) pre-training, which helped to improve the performance of the retrieval models that use the domain-specific LM as the shared backbone. We examined several language models (LMs) in Arabic to select one that efficiently deals with the Qur'anic IR task. Besides transferring successful experiments from English to Arabic, we conducted additional experiments with retrieval task in Arabic to amortize the scarcity of general domain datasets used to train the retrieval models. Handling Qur'anic IR task combining English and Arabic allowed us to enhance the comparison and share valuable insights across models and languages.

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models. (arXiv:2312.02813v1 [cs.CV])

Authors: Fengyuan Shi, Jiaxi Gu, Hang Xu, Songcen Xu, Wei Zhang, Limin Wang

Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons. First, it requires huge memory and compute overhead to train a video generation foundation model. Even with video foundation models, additional costly training is still required for downstream video synthesis tasks. Second, although some works extend image diffusion models into videos in a training-free manner, temporal consistency cannot be well kept. Finally, these adaption methods are specifically designed for one task and fail to generalize to different downstream video synthesis tasks. To mitigate these issues, we propose a training-free general-purpose video synthesis framework, coined as BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models. Specifically, we first use an image diffusion model (like ControlNet, Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion model for temporal smoothing. Decoupling image and video models enables flexible image model selection for different purposes, which endows the framework with strong task generalization and high efficiency. To validate the effectiveness and general use of BIVDiff, we perform a wide range of video generation tasks, including controllable video generation video editing, video inpainting and outpainting. Our project page is available at https://bivdiff.github.io.

Clustering Pseudo Language Family in Multilingual Translation Models with Fisher Information Matrix. (arXiv:2312.02820v1 [cs.CL])

Authors: Xinyu Ma, Xuebo Liu, Min Zhang

In multilingual translation research, the comprehension and utilization of language families are of paramount importance. Nevertheless, clustering languages based solely on their ancestral families can yield suboptimal results due to variations in the datasets employed during the model's training phase. To mitigate this challenge, we introduce an innovative method that leverages the fisher information matrix (FIM) to cluster language families, anchored on the multilingual translation model's characteristics. We hypothesize that language pairs with similar effects on model parameters exhibit a considerable degree of linguistic congruence and should thus be grouped cohesively. This concept has led us to define pseudo language families. We provide an in-depth discussion regarding the inception and application of these pseudo language families. Empirical evaluations reveal that employing these pseudo language families enhances performance over conventional language families in adapting a multilingual translation model to unfamiliar language pairs. The proposed methodology may also be extended to scenarios requiring language similarity measurements. The source code and associated scripts can be accessed at https://github.com/ecoli-hit/PseudoFamily.

Calibrated Adaptive Teacher for Domain Adaptive Intelligent Fault Diagnosis. (arXiv:2312.02826v1 [cs.LG])

Authors: Florent Forest, Olga Fink

Intelligent Fault Diagnosis (IFD) based on deep learning has proven to be an effective and flexible solution, attracting extensive research. Deep neural networks can learn rich representations from vast amounts of representative labeled data for various applications. In IFD, they achieve high classification performance from signals in an end-to-end manner, without requiring extensive domain knowledge. However, deep learning models usually only perform well on the data distribution they have been trained on. When applied to a different distribution, they may experience performance drops. This is also observed in IFD, where assets are often operated in working conditions different from those in which labeled data have been collected. Unsupervised domain adaptation (UDA) deals with the scenario where labeled data are available in a source domain, and only unlabeled data are available in a target domain, where domains may correspond to operating conditions. Recent methods rely on training with confident pseudo-labels for target samples. However, the confidence-based selection of pseudo-labels is hindered by poorly calibrated confidence estimates in the target domain, primarily due to over-confident predictions, which limits the quality of pseudo-labels and leads to error accumulation. In this paper, we propose a novel UDA method called Calibrated Adaptive Teacher (CAT), where we propose to calibrate the predictions of the teacher network throughout the self-training process, leveraging post-hoc calibration techniques. We evaluate CAT on domain-adaptive IFD and perform extensive experiments on the Paderborn benchmark for bearing fault diagnosis under varying operating conditions. Our proposed method achieves state-of-the-art performance on most transfer tasks.

MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition. (arXiv:2312.02829v1 [cs.LG])

Authors: Nicolas Menet (1 and 2), Michael Hersche (1 and 2), Geethan Karunaratne (1), Luca Benini (2), Abu Sebastian (1), Abbas Rahimi (1) ((1) IBM Research - Zurich, (2) ETH Zurich)

With the advent of deep learning, progressively larger neural networks have been designed to solve complex tasks. We take advantage of these capacity-rich models to lower the cost of inference by exploiting computation in superposition. To reduce the computational burden per input, we propose Multiple-Input-Multiple-Output Neural Networks (MIMONets) capable of handling many inputs at once. MIMONets augment various deep neural network architectures with variable binding mechanisms to represent an arbitrary number of inputs in a compositional data structure via fixed-width distributed representations. Accordingly, MIMONets adapt nonlinear neural transformations to process the data structure holistically, leading to a speedup nearly proportional to the number of superposed input items in the data structure. After processing in superposition, an unbinding mechanism recovers each transformed input of interest. MIMONets also provide a dynamic trade-off between accuracy and throughput by an instantaneous on-demand switching between a set of accuracy-throughput operating points, yet within a single set of fixed parameters. We apply the concept of MIMONets to both CNN and Transformer architectures resulting in MIMOConv and MIMOFormer, respectively. Empirical evaluations show that MIMOConv achieves about 2-4 x speedup at an accuracy delta within [+0.68, -3.18]% compared to WideResNet CNNs on CIFAR10 and CIFAR100. Similarly, MIMOFormer can handle 2-4 inputs at once while maintaining a high average accuracy within a [-1.07, -3.43]% delta on the long range arena benchmark. Finally, we provide mathematical bounds on the interference between superposition channels in MIMOFormer. Our code is available at https://github.com/IBM/multiple-input-multiple-output-nets.

Are Vision Transformers More Data Hungry Than Newborn Visual Systems?. (arXiv:2312.02843v1 [cs.CV])

Authors: Lalit Pandey, Samantha M. W. Wood, Justin N. Wood

Vision transformers (ViTs) are top performing models on many computer vision benchmarks and can accurately predict human behavior on object recognition tasks. However, researchers question the value of using ViTs as models of biological learning because ViTs are thought to be more data hungry than brains, with ViTs requiring more training data to reach similar levels of performance. To test this assumption, we directly compared the learning abilities of ViTs and animals, by performing parallel controlled rearing experiments on ViTs and newborn chicks. We first raised chicks in impoverished visual environments containing a single object, then simulated the training data available in those environments by building virtual animal chambers in a video game engine. We recorded the first-person images acquired by agents moving through the virtual chambers and used those images to train self supervised ViTs that leverage time as a teaching signal, akin to biological visual systems. When ViTs were trained through the eyes of newborn chicks, the ViTs solved the same view invariant object recognition tasks as the chicks. Thus, ViTs were not more data hungry than newborn visual systems: both learned view invariant object representations in impoverished visual environments. The flexible and generic attention based learning mechanism in ViTs combined with the embodied data streams available to newborn animals appears sufficient to drive the development of animal-like object recognition.

Exploring Error Bits for Memory Failure Prediction: An In-Depth Correlative Study. (arXiv:2312.02855v1 [cs.AR])

Authors: Qiao Yu, Wengui Zhang, Jorge Cardoso, Odej Kao

In large-scale datacenters, memory failure is a common cause of server crashes, with uncorrectable errors (UEs) being a major indicator of Dual Inline Memory Module (DIMM) defects. Existing approaches primarily focus on predicting UEs using correctable errors (CEs), without fully considering the information provided by error bits. However, error bit patterns have a strong correlation with the occurrence of uncorrectable errors (UEs). In this paper, we present a comprehensive study on the correlation between CEs and UEs, specifically emphasizing the importance of spatio-temporal error bit information. Our analysis reveals a strong correlation between spatio-temporal error bits and UE occurrence. Through evaluations using real-world datasets, we demonstrate that our approach significantly improves prediction performance by 15% in F1-score compared to the state-of-the-art algorithms. Overall, our approach effectively reduces the number of virtual machine interruptions caused by UEs by approximately 59%.

Towards Causal Representations of Climate Model Data. (arXiv:2312.02858v1 [cs.LG])

Authors: Julien Boussard, Chandni Nagda, Julia Kaltenborn, Charlotte Emilie Elektra Lange, Philippe Brouillard, Yaniv Gurwicz, Peer Nowack, David Rolnick

Climate models, such as Earth system models (ESMs), are crucial for simulating future climate change based on projected Shared Socioeconomic Pathways (SSP) greenhouse gas emissions scenarios. While ESMs are sophisticated and invaluable, machine learning-based emulators trained on existing simulation data can project additional climate scenarios much faster and are computationally efficient. However, they often lack generalizability and interpretability. This work delves into the potential of causal representation learning, specifically the \emph{Causal Discovery with Single-parent Decoding} (CDSD) method, which could render climate model emulation efficient \textit{and} interpretable. We evaluate CDSD on multiple climate datasets, focusing on emissions, temperature, and precipitation. Our findings shed light on the challenges, limitations, and promise of using CDSD as a stepping stone towards more interpretable and robust climate model emulation.

Experimental Insights Towards Explainable and Interpretable Pedestrian Crossing Prediction. (arXiv:2312.02872v1 [cs.LG])

Authors: Angie Nataly Melo, Carlota Salinas, Miguel Angel Sotelo

In the context of autonomous driving, pedestrian crossing prediction is a key component for improving road safety. Presently, the focus of these predictions extends beyond achieving trustworthy results; it is shifting towards the explainability and interpretability of these predictions. This research introduces a novel neuro-symbolic approach that combines deep learning and fuzzy logic for an explainable and interpretable pedestrian crossing prediction. We have developed an explainable predictor (ExPedCross), which utilizes a set of explainable features and employs a fuzzy inference system to predict whether the pedestrian will cross or not. Our approach was evaluated on both the PIE and JAAD datasets. The results offer experimental insights into achieving explainability and interpretability in the pedestrian crossing prediction task. Furthermore, the testing results yield a set of guidelines and recommendations regarding the process of dataset selection, feature selection, and explainability.

Toward autocorrection of chemical process flowsheets using large language models. (arXiv:2312.02873v1 [cs.LG])

Authors: Lukas Schulze Balhorn, Marc Caballero, Artur M. Schweidtmann

The process engineering domain widely uses Process Flow Diagrams (PFDs) and Process and Instrumentation Diagrams (P&IDs) to represent process flows and equipment configurations. However, the P&IDs and PFDs, hereafter called flowsheets, can contain errors causing safety hazards, inefficient operation, and unnecessary expenses. Correcting and verifying flowsheets is a tedious, manual process. We propose a novel generative AI methodology for automatically identifying errors in flowsheets and suggesting corrections to the user, i.e., autocorrecting flowsheets. Inspired by the breakthrough of Large Language Models (LLMs) for grammatical autocorrection of human language, we investigate LLMs for the autocorrection of flowsheets. The input to the model is a potentially erroneous flowsheet and the output of the model are suggestions for a corrected flowsheet. We train our autocorrection model on a synthetic dataset in a supervised manner. The model achieves a top-1 accuracy of 80% and a top-5 accuracy of 84% on an independent test dataset of synthetically generated flowsheets. The results suggest that the model can learn to autocorrect the synthetic flowsheets. We envision that flowsheet autocorrection will become a useful tool for chemical engineers.

Traffic Signal Control with Communicative Deep Reinforcement Learning Agents: a Case Study. (arXiv:2107.01347v4 [cs.MA] UPDATED)

Authors: Paolo Fazzini, Isaac Wheeler, Francesco Petracchini

In this work we analyze Multi-Agent Advantage Actor-Critic (MA2C) a recently proposed multi-agent reinforcement learning algorithm that can be applied to adaptive traffic signal control (ATSC) problems. To evaluate its potential we compare MA2C with Independent Advantage Actor-Critic (IA2C) and other Reinforcement Learning or heuristic based algorithms. Specifically, we analyze MA2C theoretically with the framework provided by non-Markov decision processes, which allows a deeper insight of the algorithm, and we critically examine the effectiveness and the robustness of the method by testing it in two traffic areas located in Bologna (Italy) simulated in SUMO, a software modeling tool for ATSC problems. Our results indicate that MA2C, trained with pseudo-random vehicle flows, is a promising technique able to outperform the alternative methods.

Source Free Unsupervised Graph Domain Adaptation. (arXiv:2112.00955v4 [cs.LG] UPDATED)

Authors: Haitao Mao, Lun Du, Yujia Zheng, Qiang Fu, Zelin Li, Xu Chen, Shi Han, Dongmei Zhang

Graph Neural Networks (GNNs) have achieved great success on a variety of tasks with graph-structural data, among which node classification is an essential one. Unsupervised Graph Domain Adaptation (UGDA) shows its practical value of reducing the labeling cost for node classification. It leverages knowledge from a labeled graph (i.e., source domain) to tackle the same task on another unlabeled graph (i.e., target domain). Most existing UGDA methods heavily rely on the labeled graph in the source domain. They utilize labels from the source domain as the supervision signal and are jointly trained on both the source graph and the target graph. However, in some real-world scenarios, the source graph is inaccessible because of privacy issues. Therefore, we propose a novel scenario named Source Free Unsupervised Graph Domain Adaptation (SFUGDA). In this scenario, the only information we can leverage from the source domain is the well-trained source model, without any exposure to the source graph and its labels. As a result, existing UGDA methods are not feasible anymore. To address the non-trivial adaptation challenges in this practical scenario, we propose a model-agnostic algorithm called SOGA for domain adaptation to fully exploit the discriminative ability of the source model while preserving the consistency of structural proximity on the target graph. We prove the effectiveness of the proposed algorithm both theoretically and empirically. The experimental results on four cross-domain tasks show consistent improvements in the Macro-F1 score and Macro-AUC.

Robust Reinforcement Learning in Continuous Control Tasks with Uncertainty Set Regularization. (arXiv:2207.02016v4 [cs.LG] UPDATED)

Authors: Yuan Zhang, Jianhong Wang, Joschka Boedecker

Reinforcement learning (RL) is recognized as lacking generalization and robustness under environmental perturbations, which excessively restricts its application for real-world robotics. Prior work claimed that adding regularization to the value function is equivalent to learning a robust policy with uncertain transitions. Although the regularization-robustness transformation is appealing for its simplicity and efficiency, it is still lacking in continuous control tasks. In this paper, we propose a new regularizer named $\textbf{U}$ncertainty $\textbf{S}$et $\textbf{R}$egularizer (USR), by formulating the uncertainty set on the parameter space of the transition function. In particular, USR is flexible enough to be plugged into any existing RL framework. To deal with unknown uncertainty sets, we further propose a novel adversarial approach to generate them based on the value function. We evaluate USR on the Real-world Reinforcement Learning (RWRL) benchmark, demonstrating improvements in the robust performance for perturbed testing environments.

Bayesian Soft Actor-Critic: A Directed Acyclic Strategy Graph Based Deep Reinforcement Learning. (arXiv:2208.06033v2 [cs.AI] UPDATED)

Authors: Qin Yang, Ramviyas Parasuraman

Adopting reasonable strategies is challenging but crucial for an intelligent agent with limited resources working in hazardous, unstructured, and dynamic environments to improve the system's utility, decrease the overall cost, and increase mission success probability. This paper proposes a novel directed acyclic strategy graph decomposition approach based on Bayesian chaining to separate an intricate policy into several simple sub-policies and organize their relationships as Bayesian strategy networks (BSN). We integrate this approach into the state-of-the-art DRL method -- soft actor-critic (SAC), and build the corresponding Bayesian soft actor-critic (BSAC) model by organizing several sub-policies as a joint policy. We compare our method against the state-of-the-art deep reinforcement learning algorithms on the standard continuous control benchmarks in the OpenAI Gym environment. The results demonstrate that the promising potential of the BSAC method significantly improves training efficiency.

A Policy Resonance Approach to Solve the Problem of Responsibility Diffusion in Multiagent Reinforcement Learning. (arXiv:2208.07753v3 [cs.AI] UPDATED)

Authors: Qingxu Fu, Tenghai Qiu, Jianqiang Yi, Zhiqiang Pu, Xiaolin Ai, Wanmai Yuan

SOTA multiagent reinforcement algorithms distinguish themselves in many ways from their single-agent equivalences. However, most of them still totally inherit the single-agent exploration-exploitation strategy. Naively inheriting this strategy from single-agent algorithms causes potential collaboration failures, in which the agents blindly follow mainstream behaviors and reject taking minority responsibility. We name this problem the Responsibility Diffusion (RD) as it shares similarities with a same-name social psychology effect. In this work, we start by theoretically analyzing the cause of this RD problem, which can be traced back to the exploration-exploitation dilemma of multiagent systems (especially large-scale multiagent systems). We address this RD problem by proposing a Policy Resonance (PR) approach which modifies the collaborative exploration strategy of agents by refactoring the joint agent policy while keeping individual policies approximately invariant. Next, we show that SOTA algorithms can equip this approach to promote the collaborative performance of agents in complex cooperative tasks. Experiments are performed in multiple test benchmark tasks to illustrate the effectiveness of this approach.

Iterative autoregression: a novel trick to improve your low-latency speech enhancement model. (arXiv:2211.01751v4 [cs.SD] UPDATED)

Authors: Pavel Andreev, Nicholas Babaev, Azat Saginbaev, Ivan Shchekotov, Aibek Alanov

Streaming models are an essential component of real-time speech enhancement tools. The streaming regime constrains speech enhancement models to use only a tiny context of future information. As a result, the low-latency streaming setup is generally considered a challenging task and has a significant negative impact on the model's quality. However, the sequential nature of streaming generation offers a natural possibility for autoregression, that is, utilizing previous predictions while making current ones. The conventional method for training autoregressive models is teacher forcing, but its primary drawback lies in the training-inference mismatch that can lead to a substantial degradation in quality. In this study, we propose a straightforward yet effective alternative technique for training autoregressive low-latency speech enhancement models. We demonstrate that the proposed approach leads to stable improvement across diverse architectures and training scenarios.

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing. (arXiv:2211.16934v2 [cs.CL] UPDATED)

Authors: Yihan Wu, Junliang Guo, Xu Tan, Chen Zhang, Bohan Li, Ruihua Song, Lei He, Sheng Zhao, Arul Menezes, Jiang Bian

Video dubbing aims to translate the original speech in a film or television program into the speech in a target language, which can be achieved with a cascaded system consisting of speech recognition, machine translation and speech synthesis. To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech, which requires strict length control. Previous works usually control the number of words or characters generated by the machine translation model to be similar to the source sentence, without considering the isochronicity of speech as the speech duration of words/characters in different languages varies. In this paper, we propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation, to match the length of source and target speech. Specifically, we control the speech length of generated sentence by guiding the prediction of each word with the duration information, including the speech duration of itself as well as how much duration is left for the remaining words. We design experiments on four language directions (German -> English, Spanish -> English, Chinese <-> English), and the results show that the proposed method achieves better length control ability on the generated speech than baseline methods. To make up the lack of real-world datasets, we also construct a real-world test set collected from films to provide comprehensive evaluations on the video dubbing task.

Balance is Essence: Accelerating Sparse Training via Adaptive Gradient Correction. (arXiv:2301.03573v2 [cs.LG] UPDATED)

Authors: Bowen Lei, Dongkuan Xu, Ruqi Zhang, Shuren He, Bani K. Mallick

Despite impressive performance, deep neural networks require significant memory and computation costs, prohibiting their application in resource-constrained scenarios. Sparse training is one of the most common techniques to reduce these costs, however, the sparsity constraints add difficulty to the optimization, resulting in an increase in training time and instability. In this work, we aim to overcome this problem and achieve space-time co-efficiency. To accelerate and stabilize the convergence of sparse training, we analyze the gradient changes and develop an adaptive gradient correction method. Specifically, we approximate the correlation between the current and previous gradients, which is used to balance the two gradients to obtain a corrected gradient. Our method can be used with the most popular sparse training pipelines under both standard and adversarial setups. Theoretically, we prove that our method can accelerate the convergence rate of sparse training. Extensive experiments on multiple datasets, model architectures, and sparsities demonstrate that our method outperforms leading sparse training methods by up to \textbf{5.0\%} in accuracy given the same number of training epochs, and reduces the number of training epochs by up to \textbf{52.1\%} to achieve the same accuracy. Our code is available on: \url{https://github.com/StevenBoys/AGENT}.

A Unified Theory of Diversity in Ensemble Learning. (arXiv:2301.03962v2 [cs.LG] UPDATED)

Authors: Danny Wood, Tingting Mu, Andrew Webb, Henry Reeve, Mikel Lujan, Gavin Brown

We present a theory of ensemble diversity, explaining the nature of diversity for a wide range of supervised learning scenarios. This challenge, of understanding ensemble diversity, has been referred to as the "holy grail" of ensemble learning, an open research issue for over 30 years. Our framework reveals that diversity is in fact a hidden dimension in the bias-variance decomposition of the ensemble loss. We prove a family of exact bias-variance-diversity decompositions, for both regression and classification, e.g., squared, cross-entropy, and Poisson losses. For losses where an additive bias-variance decomposition is not available (e.g., 0/1 loss) we present an alternative approach, which precisely quantifies the effects of diversity, turning out to be dependent on the label distribution. Experiments show how we can use our framework to understand the diversity-encouraging mechanisms of popular methods: Bagging, Boosting, and Random Forests.

Modelling human logical reasoning process in dynamic environmental stress with cognitive agents. (arXiv:2301.06216v3 [cs.AI] UPDATED)

Authors: Songlin Xu, Xinyu Zhang

Modelling human cognition can provide key insights into behavioral dynamics under changing conditions. This enables synthetic data generation and guides adaptive interventions for cognitive regulation. Challenges arise when environments are highly dynamic, obscuring stimulus-behavior relationships. We propose a cognitive agent integrating drift-diffusion with deep reinforcement learning to simulate granular stress effects on logical reasoning process. Leveraging a large dataset of 21,157 logical responses, we investigate performance impacts of dynamic stress. This prior knowledge informed model design and evaluation. Quantitatively, the framework improves cognition modelling by capturing both subject-specific and stimuli-specific behavioural differences. Qualitatively, it captures general trends in human logical reasoning under stress. Our approach is extensible to examining diverse environmental influences on cognition and behavior. Overall, this work demonstrates a powerful, data-driven methodology to simulate and understand the vagaries of human logical reasoning process in dynamic contexts.

A Framework for Neurosymbolic Robot Action Planning using Large Language Models. (arXiv:2303.00438v2 [cs.AI] UPDATED)

Authors: Alessio Capitanelli, Fulvio Mastrogiovanni

Symbolic task planning is a widely used approach to enforce robot autonomy due to its ease of understanding and deployment. However, symbolic task planning is difficult to scale in real-world when frequent re-planning is needed, for example, due to human-robot interactions or unforeseen events. Plan length and planning time can hinder the robot's efficiency and negatively affect the overall human-robot interaction's fluency. We present a framework, Teriyaki, designed to bridge the gap between symbolic task planning and machine learning approaches, by training Large Language Models (LLMs), namely GPT-3, into neurosymbolic task planners compatible with the Planning Domain Definition Language (PDDL). Potential benefits include: (i) better scalability in so far as the planning domain complexity increases, since LLMs' response time linearly scales with the combined length of the input and the output, instead of super-linearly as in the case of symbolic task planners, and (ii) the ability to synthesize a plan action-by-action instead of end-to-end, and to make each action available for execution as soon as it is generated, which in turn enables concurrent planning and execution. In the past year, significant efforts have been devoted by the research community to evaluate the overall cognitive abilities of LLMs, with alternate successes. Instead, with Teriyaki we aim to providing an overall planning performance comparable to traditional planners in specific planning domains, while leveraging LLMs capabilities in other metrics which are used to build a look-ahead predictive planning model. Preliminary results in selected domains show that our method can: (i) solve 95.5% of problems in a test data set of 1000 samples; (ii) produce plans up to 13.5% shorter than a traditional symbolic planner; (iii) reduce average overall waiting times for a plan availability by up to 61.4%.

ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance. (arXiv:2303.16894v4 [cs.CV] UPDATED)

Authors: Zoey Guo, Yiwen Tang, Ray Zhang, Dong Wang, Zhigang Wang, Bin Zhao, Xuelong Li

Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models, e.g., GPT, to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8%, +1.5%, and +1.35% on Sr3D, Nr3D, and ScanRefer. Code is released at https://github.com/Ivan-Tang-3D/ViewRefer3D.

Qualitative Failures of Image Generation Models and Their Application in Detecting Deepfakes. (arXiv:2304.06470v4 [cs.CV] UPDATED)

Authors: Ali Borji

The ability of image and video generation models to create photorealistic images has reached unprecedented heights, making it difficult to distinguish between real and fake images in many cases. However, despite this progress, a gap remains between the quality of generated images and those found in the real world. To address this, we have reviewed a vast body of literature from both academic publications and social media to identify qualitative shortcomings in image generation models, which we have classified into five categories. By understanding these failures, we can identify areas where these models need improvement, as well as develop strategies for detecting deep fakes. The prevalence of deep fakes in today's society is a serious concern, and our findings can help mitigate their negative impact.

Scaling laws for language encoding models in fMRI. (arXiv:2305.11863v3 [cs.CL] UPDATED)

Authors: Richard Antonello, Aditya Vaidya, Alexander G. Huth

Representations from transformer-based unidirectional language models are known to be effective at predicting brain responses to natural language. However, most studies comparing language models to brains have used GPT-2 or similarly sized language models. Here we tested whether larger open-source models such as those from the OPT and LLaMA families are better at predicting brain responses recorded using fMRI. Mirroring scaling results from other contexts, we found that brain prediction performance scales logarithmically with model size from 125M to 30B parameter models, with ~15% increased encoding performance as measured by correlation with a held-out test set across 3 subjects. Similar logarithmic behavior was observed when scaling the size of the fMRI training set. We also characterized scaling for acoustic encoding models that use HuBERT, WavLM, and Whisper, and we found comparable improvements with model size. A noise ceiling analysis of these large, high-performance encoding models showed that performance is nearing the theoretical maximum for brain areas such as the precuneus and higher auditory cortex. These results suggest that increasing scale in both models and data will yield incredibly effective models of language processing in the brain, enabling better scientific understanding as well as applications such as decoding.

Tweetorial Hooks: Generative AI Tools to Motivate Science on Social Media. (arXiv:2305.12265v2 [cs.HC] UPDATED)

Authors: Tao Long, Dorothy Zhang, Grace Li, Batool Taraif, Samia Menon, Kynnedy Simone Smith, Sitong Wang, Katy Ilonka Gero, Lydia B. Chilton

Communicating science and technology is essential for the public to understand and engage in a rapidly changing world. Tweetorials are an emerging phenomenon where experts explain STEM topics on social media in creative and engaging ways. However, STEM experts struggle to write an engaging "hook" in the first tweet that captures the reader's attention. We propose methods to use large language models (LLMs) to help users scaffold their process of writing a relatable hook for complex scientific topics. We demonstrate that LLMs can help writers find everyday experiences that are relatable and interesting to the public, avoid jargon, and spark curiosity. Our evaluation shows that the system reduces cognitive load and helps people write better hooks. Lastly, we discuss the importance of interactivity with LLMs to preserve the correctness, effectiveness, and authenticity of the writing.

Large Language Models, scientific knowledge and factuality: A systematic analysis in antibiotic discovery. (arXiv:2305.17819v2 [cs.CL] UPDATED)

Authors: Magdalena Wysocka, Oskar Wysocki, Maxime Delmas, Vincent Mutel, Andre Freitas

Inferring over and extracting information from Large Language Models (LLMs) trained on a large corpus of scientific literature can potentially drive a new era in biomedical research, reducing the barriers for accessing existing medical evidence. This work examines the potential of LLMs for dialoguing with biomedical background knowledge, using the context of antibiotic discovery. The systematic analysis is applied to ten state-of-the-art models, from models specialised on biomedical scientific corpora to general models such as ChatGPT, GPT-4 and Llama 2 in two prompting-based tasks: chemical compound definition generation and chemical compound-fungus relation determination. The work provides a systematic assessment on the ability of LLMs to encode and express these relations, verifying for fluency, prompt-alignment, semantic coherence, factual knowledge and specificity of generated responses. Results show that while recent models have improved in fluency, factual accuracy is still low and models are biased towards over-represented entities. The ability of LLMs to serve as biomedical knowledge bases is questioned, and the need for additional systematic evaluation frameworks is highlighted. The best performing GPT-4 produced a factual definition for 70% of chemical compounds and 43.6% factual relations to fungi, whereas the best open source model BioGPT-large 30% of the compounds and 30% of the relations for the best-performing prompt. The results show that while LLMs are currently not fit for purpose to be used as biomedical factual knowledge bases, there is a promising emerging property in the direction of factuality as the models become domain specialised, scale-up in size and level of human feedback.

Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation. (arXiv:2305.19798v2 [cs.LG] UPDATED)

Authors: Yingyi Chen, Qinghua Tao, Francesco Tonin, Johan A.K. Suykens

Recently, a new line of works has emerged to understand and improve self-attention in Transformers by treating it as a kernel machine. However, existing works apply the methods for symmetric kernels to the asymmetric self-attention, resulting in a nontrivial gap between the analytical understanding and numerical implementation. In this paper, we provide a new perspective to represent and optimize self-attention through asymmetric Kernel Singular Value Decomposition (KSVD), which is also motivated by the low-rank property of self-attention normally observed in deep layers. Through asymmetric KSVD, $i$) a primal-dual representation of self-attention is formulated, where the optimization objective is cast to maximize the projection variances in the attention outputs; $ii$) a novel attention mechanism, i.e., Primal-Attention, is proposed via the primal representation of KSVD, avoiding explicit computation of the kernel matrix in the dual; $iii$) with KKT conditions, we prove that the stationary solution to the KSVD optimization in Primal-Attention yields a zero-value objective. In this manner, KSVD optimization can be implemented by simply minimizing a regularization loss, so that low-rank property is promoted without extra decomposition. Numerical experiments show state-of-the-art performance of our Primal-Attention with improved efficiency. Moreover, we demonstrate that the deployed KSVD optimization regularizes Primal-Attention with a sharper singular value decay than that of the canonical self-attention, further verifying the great potential of our method. To the best of our knowledge, this is the first work that provides a primal-dual representation for the asymmetric kernel in self-attention and successfully applies it to modeling and optimization.

Value Functions are Control Barrier Functions: Verification of Safe Policies using Control Theory. (arXiv:2306.04026v4 [cs.LG] UPDATED)

Authors: Daniel C.H. Tan, Fernando Acero, Robert McCarthy, Dimitrios Kanoulas, Zhibin Li

Guaranteeing safe behaviour of reinforcement learning (RL) policies poses significant challenges for safety-critical applications, despite RL's generality and scalability. To address this, we propose a new approach to apply verification methods from control theory to learned value functions. By analyzing task structures for safety preservation, we formalize original theorems that establish links between value functions and control barrier functions. Further, we propose novel metrics for verifying value functions in safe control tasks and practical implementation details to improve learning. Our work presents a novel method for certificate learning, which unlocks a diversity of verification techniques from control theory for RL policies, and marks a significant step towards a formal framework for the general, scalable, and verifiable design of RL-based control systems. Code and videos are available at this https url: https://rl-cbf.github.io/

Quantitative Analysis of Primary Attribution Explainable Artificial Intelligence Methods for Remote Sensing Image Classification. (arXiv:2306.04037v2 [cs.LG] UPDATED)

Authors: Akshatha Mohan, Joshua Peeples

We present a comprehensive analysis of quantitatively evaluating explainable artificial intelligence (XAI) techniques for remote sensing image classification. Our approach leverages state-of-the-art machine learning approaches to perform remote sensing image classification across multiple modalities. We investigate the results of the models qualitatively through XAI methods. Additionally, we compare the XAI methods quantitatively through various categories of desired properties. Through our analysis, we offer insights and recommendations for selecting the most appropriate XAI method(s) to gain a deeper understanding of the models' decision-making processes. The code for this work is publicly available.

ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems. (arXiv:2306.04743v2 [cs.DB] UPDATED)

Authors: Yi Zhang, Jan Deriu, George Katsogiannis-Meimarakis, Catherine Kosten, Georgia Koutrika, Kurt Stockinger

Natural Language to SQL systems (NL-to-SQL) have recently shown a significant increase in accuracy for natural language to SQL query translation. This improvement is due to the emergence of transformer-based language models, and the popularity of the Spider benchmark - the de-facto standard for evaluating NL-to-SQL systems. The top NL-to-SQL systems reach accuracies of up to 85\%. However, Spider mainly contains simple databases with few tables, columns, and entries, which does not reflect a realistic setting. Moreover, complex real-world databases with domain-specific content have little to no training data available in the form of NL/SQL-pairs leading to poor performance of existing NL-to-SQL systems.

In this paper, we introduce ScienceBenchmark, a new complex NL-to-SQL benchmark for three real-world, highly domain-specific databases. For this new benchmark, SQL experts and domain experts created high-quality NL/SQL-pairs for each domain. To garner more data, we extended the small amount of human-generated data with synthetic data generated using GPT-3. We show that our benchmark is highly challenging, as the top performing systems on Spider achieve a very low performance on our benchmark. Thus, the challenge is many-fold: creating NL-to-SQL systems for highly complex domains with a small amount of hand-made training data augmented with synthetic data. To our knowledge, ScienceBenchmark is the first NL-to-SQL benchmark designed with complex real-world scientific databases, containing challenging training and test data carefully validated by domain experts.

Fedstellar: A Platform for Decentralized Federated Learning. (arXiv:2306.09750v3 [cs.LG] UPDATED)

Authors: Enrique Tomás Martínez Beltrán, Ángel Luis Perales Gómez, Chao Feng, Pedro Miguel Sánchez Sánchez, Sergio López Bernal, Gérôme Bovet, Manuel Gil Pérez, Gregorio Martínez Pérez, Alberto Huertas Celdrán

In 2016, Google proposed Federated Learning (FL) as a novel paradigm to train Machine Learning (ML) models across the participants of a federation while preserving data privacy. Since its birth, Centralized FL (CFL) has been the most used approach, where a central entity aggregates participants' models to create a global one. However, CFL presents limitations such as communication bottlenecks, single point of failure, and reliance on a central server. Decentralized Federated Learning (DFL) addresses these issues by enabling decentralized model aggregation and minimizing dependency on a central entity. Despite these advances, current platforms training DFL models struggle with key issues such as managing heterogeneous federation network topologies. To overcome these challenges, this paper presents Fedstellar, a novel platform designed to train FL models in a decentralized, semi-decentralized, and centralized fashion across diverse federations of physical or virtualized devices. The Fedstellar implementation encompasses a web application with an interactive graphical interface, a controller for deploying federations of nodes using physical or virtual devices, and a core deployed on each device which provides the logic needed to train, aggregate, and communicate in the network. The effectiveness of the platform has been demonstrated in two scenarios: a physical deployment involving single-board devices such as Raspberry Pis for detecting cyberattacks, and a virtualized deployment comparing various FL approaches in a controlled environment using MNIST and CIFAR-10 datasets. In both scenarios, Fedstellar demonstrated consistent performance and adaptability, achieving F1 scores of 91%, 98%, and 91.2% using DFL for detecting cyberattacks and classifying MNIST and CIFAR-10, respectively, reducing training time by 32% compared to centralized approaches.

Neural Priming for Sample-Efficient Adaptation. (arXiv:2306.10191v3 [cs.LG] UPDATED)

Authors: Matthew Wallingford, Vivek Ramanujan, Alex Fang, Aditya Kusupati, Roozbeh Mottaghi, Aniruddha Kembhavi, Ludwig Schmidt, Ali Farhadi

We propose Neural Priming, a technique for adapting large pretrained models to distribution shifts and downstream tasks given few or no labeled examples. Presented with class names or unlabeled test samples, Neural Priming enables the model to recall and conditions its parameters on relevant data seen throughout pretraining, thereby priming it for the test distribution. Neural Priming can be performed at test time, even for pretraining datasets as large as LAION-2B. Performing lightweight updates on the recalled data significantly improves accuracy across a variety of distribution shift and transfer learning benchmarks. Concretely, in the zero-shot setting, we see a 2.45% improvement in accuracy on ImageNet and 3.81% accuracy improvement on average across standard transfer learning benchmarks. Further, using Neural Priming at inference to adapt to distribution shift, we see a 1.41% accuracy improvement on ImageNetV2. These results demonstrate the effectiveness of Neural Priming in addressing the challenge of limited labeled data and changing distributions. Code is available at github.com/RAIVNLab/neural-priming.

RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model. (arXiv:2306.11300v4 [cs.CV] UPDATED)

Authors: Zilun Zhang, Tiancheng Zhao, Yulong Guo, Jianwei Yin

Pre-trained Vision-Language Models (VLMs) utilizing extensive image-text paired data have demonstrated unprecedented image-text association capabilities, achieving remarkable results across various downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. In this paper, we propose a new framework that includes the Domain pre-trained Vision-Language Model (DVLM), bridging the gap between the General Vision-Language Model (GVLM) and domain-specific downstream tasks. Moreover, we present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions. The dataset is obtained from filtering publicly available image-text paired datasets and captioning label-only RS datasets with pre-trained VLM. These constitute the first large-scale RS image-text paired dataset. Additionally, we fine-tuned the CLIP model and tried several Parameter-Efficient Fine-Tuning methods on RS5M to implement the DVLM. Experimental results show that our proposed dataset is highly effective for various tasks, and our model GeoRSCLIP improves upon the baseline or previous state-of-the-art model by $3\%\sim20\%$ in Zero-shot Classification (ZSC), $3\%\sim6\%$ in Remote Sensing Cross-Modal Text-Image Retrieval (RSCTIR) and $4\%\sim5\%$ in Semantic Localization (SeLo) tasks. Dataset and models have been released in: \url{https://github.com/om-ai-lab/RS5M}.

Understanding Social Reasoning in Language Models with Language Models. (arXiv:2306.15448v2 [cs.CL] UPDATED)

Authors: Kanishk Gandhi, Jan-Philipp Fränken, Tobias Gerstenberg, Noah D. Goodman

As Large Language Models (LLMs) become increasingly integrated into our everyday lives, understanding their ability to comprehend human mental states becomes critical for ensuring effective interactions. However, despite the recent attempts to assess the Theory-of-Mind (ToM) reasoning capabilities of LLMs, the degree to which these models can align with human ToM remains a nuanced topic of exploration. This is primarily due to two distinct challenges: (1) the presence of inconsistent results from previous evaluations, and (2) concerns surrounding the validity of existing evaluation methodologies. To address these challenges, we present a novel framework for procedurally generating evaluations with LLMs by populating causal templates. Using our framework, we create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations. We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations. Using BigToM, we evaluate the social reasoning capabilities of a variety of LLMs and compare model performances with human performance. Our results suggest that GPT4 has ToM capabilities that mirror human inference patterns, though less reliable, while other LLMs struggle.

On the Identifiability of Quantized Factors. (arXiv:2306.16334v2 [cs.LG] UPDATED)

Authors: Vitória Barin-Pacela, Kartik Ahuja, Simon Lacoste-Julien, Pascal Vincent

Disentanglement aims to recover meaningful latent ground-truth factors from the observed distribution solely, and is formalized through the theory of identifiability. The identifiability of independent latent factors is proven to be impossible in the unsupervised i.i.d. setting under a general nonlinear map from factors to observations. In this work, however, we demonstrate that it is possible to recover quantized latent factors under a generic nonlinear diffeomorphism. We only assume that the latent factors have independent discontinuities in their density, without requiring the factors to be statistically independent. We introduce this novel form of identifiability, termed quantized factor identifiability, and provide a comprehensive proof of the recovery of the quantized factors.

A Multi-Task Perspective for Link Prediction with New Relation Types and Nodes. (arXiv:2307.06046v2 [cs.LG] UPDATED)

Authors: Jincheng Zhou, Beatrice Bevilacqua, Bruno Ribeiro

The task of inductive link prediction in (discrete) attributed multigraphs infers missing attributed links (relations) between nodes in new test multigraphs. Traditional relational learning methods face the challenge of limited generalization to test multigraphs containing both novel nodes and novel relation types not seen in training. Recently, under the only assumption that all relation types share the same structural predictive patterns (single task), Gao et al. (2023) proposed a link prediction method using the theoretical concept of double equivariance (equivariance for nodes & relation types), in contrast to the (single) equivariance (only for nodes) used to design Graph Neural Networks (GNNs). In this work we further extend the double equivariance concept to multi-task double equivariance, where we define link prediction in attributed multigraphs that can have distinct and potentially conflicting predictive patterns for different sets of relation types (multiple tasks). Our empirical results on real-world datasets demonstrate that our approach can effectively generalize to test graphs with multi-task structures without access to additional information.

Learning beyond sensations: how dreams organize neuronal representations. (arXiv:2308.01830v2 [q-bio.NC] UPDATED)

Authors: Nicolas Deperrois, Mihai A. Petrovici, Walter Senn, Jakob Jordan

Semantic representations in higher sensory cortices form the basis for robust, yet flexible behavior. These representations are acquired over the course of development in an unsupervised fashion and continuously maintained over an organism's lifespan. Predictive learning theories propose that these representations emerge from predicting or reconstructing sensory inputs. However, brains are known to generate virtual experiences, such as during imagination and dreaming, that go beyond previously experienced inputs. Here, we suggest that virtual experiences may be just as relevant as actual sensory inputs in shaping cortical representations. In particular, we discuss two complementary learning principles that organize representations through the generation of virtual experiences. First, "adversarial dreaming" proposes that creative dreams support a cortical implementation of adversarial learning in which feedback and feedforward pathways engage in a productive game of trying to fool each other. Second, "contrastive dreaming" proposes that the invariance of neuronal representations to irrelevant factors of variation is acquired by trying to map similar virtual experiences together via a contrastive learning process. These principles are compatible with known cortical structure and dynamics and the phenomenology of sleep thus providing promising directions to explain cortical learning beyond the classical predictive learning paradigm.

Cost-effective On-device Continual Learning over Memory Hierarchy with Miro. (arXiv:2308.06053v4 [cs.LG] UPDATED)

Authors: Xinyue Ma, Suyeon Jeong, Minjia Zhang, Di Wang, Jonghyun Choi, Myeongjae Jeon

Continual learning (CL) trains NN models incrementally from a continuous stream of tasks. To remember previously learned knowledge, prior studies store old samples over a memory hierarchy and replay them when new tasks arrive. Edge devices that adopt CL to preserve data privacy are typically energy-sensitive and thus require high model accuracy while not compromising energy efficiency, i.e., cost-effectiveness. Our work is the first to explore the design space of hierarchical memory replay-based CL to gain insights into achieving cost-effectiveness on edge devices. We present Miro, a novel system runtime that carefully integrates our insights into the CL framework by enabling it to dynamically configure the CL system based on resource states for the best cost-effectiveness. To reach this goal, Miro also performs online profiling on parameters with clear accuracy-energy trade-offs and adapts to optimal values with low overhead. Extensive evaluations show that Miro significantly outperforms baseline systems we build for comparison, consistently achieving higher cost-effectiveness.

Topological Graph Signal Compression. (arXiv:2308.11068v2 [cs.LG] UPDATED)

Authors: Guillermo Bernárdez, Lev Telyatnikov, Eduard Alarcón, Albert Cabellos-Aparicio, Pere Barlet-Ros, Pietro Liò

Recently emerged Topological Deep Learning (TDL) methods aim to extend current Graph Neural Networks (GNN) by naturally processing higher-order interactions, going beyond the pairwise relations and local neighborhoods defined by graph representations. In this paper we propose a novel TDL-based method for compressing signals over graphs, consisting in two main steps: first, disjoint sets of higher-order structures are inferred based on the original signal --by clustering $N$ datapoints into $K\ll N$ collections; then, a topological-inspired message passing gets a compressed representation of the signal within those multi-element sets. Our results show that our framework improves both standard GNN and feed-forward architectures in compressing temporal link-based signals from two real-word Internet Service Provider Networks' datasets --from $30\%$ up to $90\%$ better reconstruction errors across all evaluation scenarios--, suggesting that it better captures and exploits spatial and temporal correlations over the whole graph-based network structure.

FIMO: A Challenge Formal Dataset for Automated Theorem Proving. (arXiv:2309.04295v2 [cs.AI] UPDATED)

Authors: Chengwu Liu, Jianhao Shen, Huajian Xin, Zhengying Liu, Ye Yuan, Haiming Wang, Wei Ju, Chuanyang Zheng, Yichun Yin, Lin Li, Ming Zhang, Qun Liu

We present FIMO, an innovative dataset comprising formal mathematical problem statements sourced from the International Mathematical Olympiad (IMO) Shortlisted Problems. Designed to facilitate advanced automated theorem proving at the IMO level, FIMO is currently tailored for the Lean formal language. It comprises 149 formal problem statements, accompanied by both informal problem descriptions and their corresponding LaTeX-based informal proofs. Through initial experiments involving GPT-4, our findings underscore the existing limitations in current methodologies, indicating a substantial journey ahead before achieving satisfactory IMO-level automated theorem proving outcomes.

Investigating the Catastrophic Forgetting in Multimodal Large Language Models. (arXiv:2309.10313v4 [cs.CL] UPDATED)

Authors: Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma

Following the success of GPT4, there has been a surge in interest in multimodal large language model (MLLM) research. This line of research focuses on developing general-purpose LLMs through fine-tuning pre-trained LLMs and vision models. However, catastrophic forgetting, a notorious phenomenon where the fine-tuned model fails to retain similar performance compared to the pre-trained model, still remains an inherent problem in multimodal LLMs (MLLM). In this paper, we introduce EMT: Evaluating MulTimodality for evaluating the catastrophic forgetting in MLLMs, by treating each MLLM as an image classifier. We first apply EMT to evaluate several open-source fine-tuned MLLMs and we discover that almost all evaluated MLLMs fail to retain the same performance levels as their vision encoders on standard image classification tasks. Moreover, we continue fine-tuning LLaVA, an MLLM and utilize EMT to assess performance throughout the fine-tuning. Interestingly, our results suggest that early-stage fine-tuning on an image dataset improves performance across other image datasets, by enhancing the alignment of text and visual features. However, as fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in a significant loss of generalizability, even when the image encoder remains frozen. Our results suggest that MLLMs have yet to demonstrate performance on par with their vision models on standard image classification tasks and the current MLLM fine-tuning procedure still has room for improvement.

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models. (arXiv:2309.12307v2 [cs.CL] UPDATED)

Authors: Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia

We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shifted sparse attention (S$^2$-Attn) effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA combines this improved LoRA with S$^2$-Attn. LongLoRA demonstrates strong empirical results on various tasks on Llama2 models from 7B/13B to 70B. LongLoRA adopts Llama2 7B from 4k context to 100k, or Llama2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like Flash-Attention2. In addition, we further conduct supervised fine-tuning with LongLoRA and our long instruction-following LongAlpaca dataset.

Data Upcycling Knowledge Distillation for Image Super-Resolution. (arXiv:2309.14162v2 [cs.CV] UPDATED)

Authors: Yun Zhang, Wei Li, Simiao Li, Jie Hu, Hanting Chen, Hailing Wang, Zhijun Tu, Wenjia Wang, Bingyi Jing, Yunhe Wang

Knowledge distillation (KD) emerges as a promising yet challenging technique for compressing deep neural networks, aiming to transfer extensive learning representations from proficient and computationally intensive teacher models to compact student models. However, current KD methods for super-resolution (SR) models have limited performance and restricted applications, since the characteristics of SR tasks are overlooked. In this paper, we put forth an approach from the perspective of effective data utilization, namely, the Data Upcycling Knowledge Distillation (DUKD), which facilitates the student model by the prior knowledge the teacher provided through the upcycled in-domain data derived from the input images. Besides, for the first time, we realize the label consistency regularization in KD for SR models, which is implemented by the paired invertible data augmentations. It constrains the training process of KD and leads to better generalization capability of the student model. The DUKD, due to its versatility, can be applied across a broad spectrum of teacher-student architectures (e.g., CNN and Transformer models) and SR tasks, such as single image SR, real-world SR, and SR quantization, and is in parallel with other compression techniques. Comprehensive experiments on diverse benchmarks demonstrate that the DUKD method significantly outperforms previous art.

Stackelberg Driver Model for Continual Policy Improvement in Scenario-Based Closed-Loop Autonomous Driving. (arXiv:2309.14235v3 [cs.LG] UPDATED)

Authors: Haoyi Niu, Qimao Chen, Yingyue Li, Yi Zhang, Jianming Hu

The deployment of autonomous vehicles (AVs) has faced hurdles due to the dominance of rare but critical corner cases within the long-tail distribution of driving scenarios, which negatively affects their overall performance. To address this challenge, adversarial generation methods have emerged as a class of efficient approaches to synthesize safety-critical scenarios for AV testing. However, these generated scenarios are often underutilized for AV training, resulting in the potential for continual AV policy improvement remaining untapped, along with a deficiency in the closed-loop design needed to achieve it. Therefore, we tailor the Stackelberg Driver Model (SDM) to accurately characterize the hierarchical nature of vehicle interaction dynamics, facilitating iterative improvement by engaging background vehicles (BVs) and AV in a sequential game-like interaction paradigm. With AV acting as the leader and BVs as followers, this leader-follower modeling ensures that AV would consistently refine its policy, always taking into account the additional information that BVs play the best response to challenge AV. Extensive experiments have shown that our algorithm exhibits superior performance compared to several baselines especially in higher dimensional scenarios, leading to substantial advancements in AV capabilities while continually generating progressively challenging scenarios. Code is available at https://github.com/BlueCat-de/SDM.

Denoising Diffusion Bridge Models. (arXiv:2309.16948v3 [cs.CV] UPDATED)

Authors: Linqi Zhou, Aaron Lou, Samar Khanna, Stefano Ermon

Diffusion models are powerful generative models that map noise to data using stochastic processes. However, for many applications such as image editing, the model input comes from a distribution that is not random noise. As such, diffusion models must rely on cumbersome methods like guidance or projected sampling to incorporate this information in the generative process. In our work, we propose Denoising Diffusion Bridge Models (DDBMs), a natural alternative to this paradigm based on diffusion bridges, a family of processes that interpolate between two paired distributions given as endpoints. Our method learns the score of the diffusion bridge from data and maps from one endpoint distribution to the other by solving a (stochastic) differential equation based on the learned score. Our method naturally unifies several classes of generative models, such as score-based diffusion models and OT-Flow-Matching, allowing us to adapt existing design and architectural choices to our more general problem. Empirically, we apply DDBMs to challenging image datasets in both pixel and latent space. On standard image translation problems, DDBMs achieve significant improvement over baseline methods, and, when we reduce the problem to image generation by setting the source distribution to random noise, DDBMs achieve comparable FID scores to state-of-the-art methods despite being built for a more general task.

GPT-Driver: Learning to Drive with GPT. (arXiv:2310.01415v3 [cs.CV] UPDATED)

Authors: Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, Yue Wang

We present a simple yet effective approach that can transform the OpenAI GPT-3.5 model into a reliable motion planner for autonomous vehicles. Motion planning is a core challenge in autonomous driving, aiming to plan a driving trajectory that is safe and comfortable. Existing motion planners predominantly leverage heuristic methods to forecast driving trajectories, yet these approaches demonstrate insufficient generalization capabilities in the face of novel and unseen driving scenarios. In this paper, we propose a novel approach to motion planning that capitalizes on the strong reasoning capabilities and generalization potential inherent to Large Language Models (LLMs). The fundamental insight of our approach is the reformulation of motion planning as a language modeling problem, a perspective not previously explored. Specifically, we represent the planner inputs and outputs as language tokens, and leverage the LLM to generate driving trajectories through a language description of coordinate positions. Furthermore, we propose a novel prompting-reasoning-finetuning strategy to stimulate the numerical reasoning potential of the LLM. With this strategy, the LLM can describe highly precise trajectory coordinates and also its internal decision-making process in natural language. We evaluate our approach on the large-scale nuScenes dataset, and extensive experiments substantiate the effectiveness, generalization ability, and interpretability of our GPT-based motion planner. Code is now available at https://github.com/PointsCoder/GPT-Driver.

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models. (arXiv:2310.04406v2 [cs.AI] UPDATED)

Authors: Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, Yu-Xiong Wang

While large language models (LLMs) have demonstrated impressive performance on a range of decision-making tasks, they rely on simple acting processes and fall short of broad deployment as autonomous agents. We introduce LATS (Language Agent Tree Search), a general framework that synergizes the capabilities of LLMs in planning, acting, and reasoning. Drawing inspiration from Monte Carlo tree search in model-based reinforcement learning, LATS employs LLMs as agents, value functions, and optimizers, repurposing their latent strengths for enhanced decision-making. What is crucial in this method is the use of an environment for external feedback, which offers a more deliberate and adaptive problem-solving mechanism that moves beyond the limitations of existing techniques. Our experimental evaluation across diverse domains, such as programming, HotPotQA, and WebShop, illustrates the applicability of LATS for both reasoning and acting. In particular, LATS achieves 94.4% for programming on HumanEval with GPT-4 and an average score of 75.9 for web browsing on WebShop with GPT-3.5, demonstrating the effectiveness and generality of our method.

Warfare:Breaking the Watermark Protection of AI-Generated Content. (arXiv:2310.07726v2 [cs.CV] UPDATED)

Authors: Guanlin Li, Yifei Chen, Jie Zhang, Jiwei Li, Shangwei Guo, Tianwei Zhang

AI-Generated Content (AIGC) is gaining great popularity, with many emerging commercial services and applications. These services leverage advanced generative models, such as latent diffusion models and large language models, to generate creative content (e.g., realistic images and fluent sentences) for users. The usage of such generated content needs to be highly regulated, as the service providers need to ensure the users do not violate the usage policies (e.g., abuse for commercialization, generating and distributing unsafe content). A promising solution to achieve this goal is watermarking, which adds unique and imperceptible watermarks on the content for service verification and attribution. Numerous watermarking approaches have been proposed recently. However, in this paper, we show that an adversary can easily break these watermarking mechanisms. Specifically, we consider two possible attacks. (1) Watermark removal: the adversary can easily erase the embedded watermark from the generated content and then use it freely bypassing the regulation of the service provider. (2) Watermark forging: the adversary can create illegal content with forged watermarks from another user, causing the service provider to make wrong attributions. We propose Warfare, a unified methodology to achieve both attacks in a holistic way. The key idea is to leverage a pre-trained diffusion model for content processing and a generative adversarial network for watermark removal or forging. We evaluate Warfare on different datasets and embedding setups. The results prove that it can achieve high success rates while maintaining the quality of the generated content. Compared to existing diffusion model-based attacks, Warfare is 5,050~11,000x faster.

Policy Gradient with Kernel Quadrature. (arXiv:2310.14768v2 [cs.LG] UPDATED)

Authors: Satoshi Hayakawa, Tetsuro Morimura

Reward evaluation of episodes becomes a bottleneck in a broad range of reinforcement learning tasks. Our aim in this paper is to select a small but representative subset of a large batch of episodes, only on which we actually compute rewards for more efficient policy gradient iterations. We build a Gaussian process modeling of discounted returns or rewards to derive a positive definite kernel on the space of episodes, run an ``episodic" kernel quadrature method to compress the information of sample episodes, and pass the reduced episodes to the policy network for gradient updates. We present the theoretical background of this procedure as well as its numerical illustrations in MuJoCo tasks.

Pitfall of Optimism: Distributional Reinforcement Learning by Randomizing Risk Criterion. (arXiv:2310.16546v3 [cs.LG] UPDATED)

Authors: Taehyun Cho, Seungyub Han, Heesoo Lee, Kyungjae Lee, Jungwoo Lee

Distributional reinforcement learning algorithms have attempted to utilize estimated uncertainty for exploration, such as optimism in the face of uncertainty. However, using the estimated variance for optimistic exploration may cause biased data collection and hinder convergence or performance. In this paper, we present a novel distributional reinforcement learning algorithm that selects actions by randomizing risk criterion to avoid one-sided tendency on risk. We provide a perturbed distributional Bellman optimality operator by distorting the risk measure and prove the convergence and optimality of the proposed method with the weaker contraction property. Our theoretical results support that the proposed method does not fall into biased exploration and is guaranteed to converge to an optimal return. Finally, we empirically show that our method outperforms other existing distribution-based algorithms in various environments including Atari 55 games.

Understanding Parameter Saliency via Extreme Value Theory. (arXiv:2310.17951v2 [cs.CV] UPDATED)

Authors: Shuo Wang, Issei Sato

Deep neural networks are being increasingly implemented throughout society in recent years. It is useful to identify which parameters trigger misclassification in diagnosing undesirable model behaviors. The concept of parameter saliency is proposed and used to diagnose convolutional neural networks (CNNs) by ranking convolution filters that may have caused misclassification on the basis of parameter saliency. It is also shown that fine-tuning the top ranking salient filters efficiently corrects misidentification on ImageNet. However, there is still a knowledge gap in terms of understanding why parameter saliency ranking can find the filters inducing misidentification. In this work, we attempt to bridge the gap by analyzing parameter saliency ranking from a statistical viewpoint, namely, extreme value theory. We first show that the existing work implicitly assumes that the gradient norm computed for each filter follows a normal distribution. Then, we clarify the relationship between parameter saliency and the score based on the peaks-over-threshold (POT) method, which is often used to model extreme values. Finally, we reformulate parameter saliency in terms of the POT method, where this reformulation is regarded as statistical anomaly detection and does not require the implicit assumptions of the existing parameter-saliency formulation. Our experimental results demonstrate that our reformulation can detect malicious filters as well. Furthermore, we show that the existing parameter saliency method exhibits a bias against the depth of layers in deep neural networks. In particular, this bias has the potential to inhibit the discovery of filters that cause misidentification in situations where domain shift occurs. In contrast, parameter saliency based on POT shows less of this bias.

BioImage.IO Chatbot: A Personalized Assistant for BioImage Analysis Augmented by Community Knowledge Base. (arXiv:2310.18351v3 [cs.AI] UPDATED)

Authors: Wanlu Lei, Caterina Fuster-Barceló, Arrate Muñoz-Barrutia, Wei Ouyang

The rapidly expanding landscape of bioimage analysis tools presents a navigational challenge for both experts and newcomers. Traditional search methods often fall short in assisting users in this complex environment. To address this, we introduce the BioImage$.$IO Chatbot, an AI-driven conversational assistant tailored for the bioimage community. Built upon large language models, this chatbot provides personalized, context-aware answers by aggregating and interpreting information from diverse databases, tool-specific documentation, and structured data sources. Enhanced by a community-contributed knowledge base and fine-tuned retrieval methods, the BioImage$.$IO Chatbot offers not just a personalized interaction but also a knowledge-enriched, context-aware experience. It fundamentally transforms the way biologists, bioimage analysts, and developers navigate and utilize advanced bioimage analysis tools, setting a new standard for community-driven, accessible scientific research.

Is one brick enough to break the wall of spoken dialogue state tracking?. (arXiv:2311.04923v2 [cs.CL] UPDATED)

Authors: Lucas Druart (LIA), Valentin Vielzeuf, Yannick Estève (LIA)

In Task-Oriented Dialogue (TOD) systems, correctly updating the system's understanding of the user's needs (a.k.a dialogue state tracking) is key to a smooth interaction. Traditionally, TOD systems perform this update in three steps: transcription of the user's utterance, semantic extraction of the key concepts, and contextualization with the previously identified concepts. Such cascade approaches suffer from cascading errors and separate optimization. End-to-End approaches have been proved helpful up to the semantic extraction step. This paper goes one step further paving the path towards completely neural spoken dialogue state tracking by comparing three approaches: (1) a state of the art cascade approach, (2) a locally E2E approach with rule-based contextualization and (3) a completely neural approach.

Deep Learning in Computed Tomography Pulmonary Angiography Imaging: A Dual-Pronged Approach for Pulmonary Embolism Detection. (arXiv:2311.05197v2 [cs.CV] UPDATED)

Authors: Fabiha Bushra, Muhammad E. H. Chowdhury, Rusab Sarmun, Saidul Kabir, Menatalla Said, Sohaib Bassam Zoghoul, Adam Mushtak, Israa Al-Hashimi, Abdulrahman Alqahtani, Anwarul Hasan

The increasing reliance on Computed Tomography Pulmonary Angiography for Pulmonary Embolism (PE) diagnosis presents challenges and a pressing need for improved diagnostic solutions. The primary objective of this study is to leverage deep learning techniques to enhance the Computer Assisted Diagnosis of PE. In this study, we propose a classifier-guided detection approach that effectively leverages the classifier's probabilistic inference to direct the detection predictions, marking a novel contribution in the domain of automated PE diagnosis. Our end-to-end classification framework introduces an Attention-Guided Convolutional Neural Network (AG-CNN) that leverages local context by utilizing an attention mechanism. This approach emulates the attention of a human expert by looking at both global appearances and local lesion regions before forming a conclusive decision. The classifier achieves a notable AUROC, sensitivity, specificity and F1-score of 0.927, 0.862, 0.879 and 0.805 respectively on the FUMPE dataset with Inception-v3 backbone architecture. Moreover, AG-CNN outperforms the baseline DenseNet-121 model, achieving an 8.1% AUROC gain. While prior studies have primarily focused on PE detection in main arteries, our utilization of state-of-the-art object detection models and ensembling techniques significantly enhances detection accuracy for small embolisms in the peripheral arteries. Finally, our proposed classifier-guided detection approach further refines the detection metrics contributing new state-of-the-art to the community: mAP$_{50}$, sensitivity and F1-score of 0.846, 0.901 and 0.779 respectively outperforming the former benchmark with a significant 3.7% improvement in mAP$_{50}$. Our research aims to elevate PE patient care by integrating AI solutions into clinical workflows, highlighting the potential of human-AI collaboration in medical diagnostics.

A Central Motor System Inspired Pre-training Reinforcement Learning for Robotic Control. (arXiv:2311.07822v2 [cs.RO] UPDATED)

Authors: Pei Zhang, Zhaobo Hua, Jinliang Ding

Designing controllers to achieve natural motor capabilities for multi-joint robots is a significant challenge. However, animals in nature are naturally with basic motor abilities and can master various complex motor skills through acquired learning. On the basis of analyzing the mechanism of the central motor system in mammals, we propose a novel pre-training reinforcement learning algorithm that enables robots to learn rich motor skills and apply them to complex task environments without relying on external data. We first design a skill based network similar to the cerebellum by utilizing the selection mechanism of voluntary movements in the basal ganglia and the basic motor regulation ability of the cerebellum. Subsequently, by imitating the structure of advanced centers in the central motor system, we propose a high-level policy to generate different skill combinations, thereby enabling the robot to acquire natural motor abilities. We conduct experiments on 4 types of robots and 22 task environments, and the results show that the proposed method can enable different types of robots to achieve flexible motor skills. Overall, our research provides a promising framework for the design of neural network motor controllers.

Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code. (arXiv:2311.07989v3 [cs.CL] UPDATED)

Authors: Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao, Zi Gong, Hang Yu, Jianguo Li, Rui Wang

In this work we systematically review the recent advancements in code processing with language models, covering 50+ models, 30+ evaluation tasks, 170+ datasets, and 700 related works. We break down code processing models into general language models represented by the GPT family and specialized models that are specifically pretrained on code, often with tailored objectives. We discuss the relations and differences between these models, and highlight the historical transition of code modeling from statistical models and RNNs to pretrained Transformers and LLMs, which is exactly the same course that had been taken by NLP. We also discuss code-specific features such as AST, CFG, and unit tests, along with their application in training code language models, and identify key challenges and potential future directions in this domain. We keep the survey open and updated on GitHub at https://github.com/codefuse-ai/Awesome-Code-LLM.

A Multi-In-Single-Out Network for Video Frame Interpolation without Optical Flow. (arXiv:2311.11602v3 [cs.CV] UPDATED)

Authors: Jaemin Lee, Minseok Seo, Sangwoo Lee, Hyobin Park, Dong-Geol Choi

In general, deep learning-based video frame interpolation (VFI) methods have predominantly focused on estimating motion vectors between two input frames and warping them to the target time. While this approach has shown impressive performance for linear motion between two input frames, it exhibits limitations when dealing with occlusions and nonlinear movements. Recently, generative models have been applied to VFI to address these issues. However, as VFI is not a task focused on generating plausible images, but rather on predicting accurate intermediate frames between two given frames, performance limitations still persist. In this paper, we propose a multi-in-single-out (MISO) based VFI method that does not rely on motion vector estimation, allowing it to effectively model occlusions and nonlinear motion. Additionally, we introduce a novel motion perceptual loss that enables MISO-VFI to better capture the spatio-temporal correlations within the video frames. Our MISO-VFI method achieves state-of-the-art results on VFI benchmarks Vimeo90K, Middlebury, and UCF101, with a significant performance gap compared to existing approaches.

ChatGPT and post-test probability. (arXiv:2311.12188v3 [cs.AI] UPDATED)

Authors: Samuel J. Weisenthal

Reinforcement learning-based large language models, such as ChatGPT, are believed to have potential to aid human experts in many domains, including healthcare. There is, however, little work on ChatGPT's ability to perform a key task in healthcare: formal, probabilistic medical diagnostic reasoning. This type of reasoning is used, for example, to update a pre-test probability to a post-test probability. In this work, we probe ChatGPT's ability to perform this task. In particular, we ask ChatGPT to give examples of how to use Bayes rule for medical diagnosis. Our prompts range from queries that use terminology from pure probability (e.g., requests for a "posterior probability") to queries that use terminology from the medical diagnosis literature (e.g., requests for a "post-test probability"). We show how the introduction of medical variable names leads to an increase in the number of errors that ChatGPT makes. Given our results, we also show how one can use prompt engineering to facilitate ChatGPT's partial avoidance of these errors. We discuss our results in light of recent commentaries on sensitivity and specificity. We also discuss how our results might inform new research directions for large language models.

CoSeR: Bridging Image and Language for Cognitive Super-Resolution. (arXiv:2311.16512v3 [cs.CV] UPDATED)

Authors: Haoze Sun, Wenbo Li, Jianzhuang Liu, Haoyu Chen, Renjing Pei, Xueyi Zou, Youliang Yan, Yujiu Yang

Existing super-resolution (SR) models primarily focus on restoring local texture details, often neglecting the global semantic information within the scene. This oversight can lead to the omission of crucial semantic details or the introduction of inaccurate textures during the recovery process. In our work, we introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images. We achieve this by marrying image appearance and language understanding to generate a cognitive embedding, which not only activates prior information from large text-to-image diffusion models but also facilitates the generation of high-quality reference images to optimize the SR process. To further improve image fidelity, we propose a novel condition injection scheme called "All-in-Attention", consolidating all conditional information into a single module. Consequently, our method successfully restores semantically correct and photorealistic details, demonstrating state-of-the-art performance across multiple benchmarks. Code: https://github.com/VINHYU/CoSeR

Grounding Foundation Models through Federated Transfer Learning: A General Framework. (arXiv:2311.17431v4 [cs.LG] UPDATED)

Authors: Yan Kang, Tao Fan, Hanlin Gu, Lixin Fan, Qiang Yang

Foundation Models (FMs) such as GPT-4 encoded with vast knowledge and powerful emergent abilities have achieved remarkable success in various natural language processing and computer vision tasks. Grounding FMs by adapting them to domain-specific tasks or augmenting them with domain-specific knowledge enables us to exploit the full potential of FMs. However, grounding FMs faces several challenges, stemming primarily from constrained computing resources, data privacy, model heterogeneity, and model ownership. Federated Transfer Learning (FTL), the combination of federated learning and transfer learning, provides promising solutions to address these challenges. In recent years, the need for grounding FMs leveraging FTL, coined FTL-FM, has arisen strongly in both academia and industry. Motivated by the strong growth in FTL-FM research and the potential impact of FTL-FM on industrial applications, we propose an FTL-FM framework that formulates problems of grounding FMs in the federated learning setting, construct a detailed taxonomy based on the FTL-FM framework to categorize state-of-the-art FTL-FM works, and comprehensively overview FTL-FM works based on the proposed taxonomy. We also establish correspondences between FTL-FM and conventional phases of adapting FM so that FM practitioners can align their research works with FTL-FM. In addition, we overview advanced efficiency-improving and privacy-preserving techniques because efficiency and privacy are critical concerns in FTL-FM. Last, we discuss opportunities and future research directions of FTL-FM.

AlignBench: Benchmarking Chinese Alignment of Large Language Models. (arXiv:2311.18743v3 [cs.CL] UPDATED)

Authors: Xiao Liu, Xuanyu Lei, Shengyuan Wang, Yue Huang, Zhuoer Feng, Bosi Wen, Jiale Cheng, Pei Ke, Yifan Xu, Weng Lam Tam, Xiaohan Zhang, Lichao Sun, Hongning Wang, Jing Zhang, Minlie Huang, Yuxiao Dong, Jie Tang

Alignment has become a critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, effective evaluation of alignment for emerging Chinese LLMs is still significantly lacking, calling for real-scenario grounded, open-ended, challenging and automatic evaluations tailored for alignment. To fill in this gap, we introduce AlignBench, a comprehensive multi-dimensional benchmark for evaluating LLMs' alignment in Chinese. Equipped with a human-in-the-loop data curation pipeline, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge with Chain-of-Thought to generate explanations and final ratings as evaluations, ensuring high reliability and interpretability. Furthermore, we report AlignBench evaluated by CritiqueLLM, a dedicated Chinese evaluator LLM that recovers 95% of GPT-4's evaluation ability. We will provide public APIs for evaluating AlignBench with CritiqueLLM to facilitate the evaluation of LLMs' Chinese alignment. All evaluation codes, data, and LLM generations are available at \url{https://github.com/THUDM/AlignBench}.

PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction. (arXiv:2312.00839v2 [cs.LG] UPDATED)

Authors: Lei Guan, Dongsheng Li, Jiye Liang, Wenjian Wang, Xicheng Lu

Asynchronous pipeline model parallelism with a "1F1B" (one forward, one backward) schedule generates little bubble overhead and always provides quite a high throughput. However, the "1F1B" schedule inevitably leads to weight inconsistency and weight staleness issues due to the cross-training of different mini-batches across GPUs. To simultaneously address these two problems, in this paper, we propose an optimizer-dependent weight prediction strategy (a.k.a PipeOptim) for asynchronous pipeline training. The key insight of our proposal is that we employ a weight prediction strategy in the forward pass to ensure that each mini-batch uses consistent and staleness-free weights to compute the forward pass. To be concrete, we first construct the weight prediction scheme based on the update rule of the used optimizer when training the deep neural network models. Then throughout the "1F1B" pipelined training, each mini-batch is mandated to execute weight prediction ahead of the forward pass, subsequently employing the predicted weights to perform the forward pass. As a result, PipeOptim 1) inherits the advantage of the "1F1B" schedule and generates pretty high throughput, and 2) can ensure effective parameter learning regardless of the type of the used optimizer. To verify the effectiveness of our proposal, we conducted extensive experimental evaluations using eight different deep-learning models spanning three machine-learning tasks including image classification, sentiment analysis, and machine translation. The experiment results demonstrate that PipeOptim outperforms the popular pipelined approaches including GPipe, PipeDream, PipeDream-2BW, and SpecTrain. The code of PipeOptim can be accessible at https://github.com/guanleics/PipeOptim.

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers. (arXiv:2312.00878v2 [cs.CV] UPDATED)

Authors: Walid Bousselham, Felix Petersen, Vittorio Ferrari, Hilde Kuehne

Vision-language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval, classification, or captioning. But so far, those models seem to fall behind when it comes to zero-shot localization of referential expressions and objects in images. As a result, they need to be fine-tuned for this task. In this paper, we show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning. To leverage those capabilities, we propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path. We show that the concept of self-self attention corresponds to clustering, thus enforcing groups of tokens arising from the same object to be similar while preserving the alignment with the language space. To further guide the group formation, we propose a set of regularizations that allows the model to finally generalize across datasets and backbones. We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation. It shows that GEM not only outperforms other training-free open-vocabulary localization methods, but also achieves state-of-the-art results on the recently proposed OpenImagesV7 large-scale segmentation benchmark.

Nash Learning from Human Feedback. (arXiv:2312.00886v2 [stat.ML] UPDATED)

Authors: Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina Precup, Bilal Piot

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution.

In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF).

In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.

Harnessing Discrete Representations For Continual Reinforcement Learning. (arXiv:2312.01203v2 [cs.LG] UPDATED)

Authors: Edan Meyer, Adam White, Marlos C. Machado

Reinforcement learning (RL) agents make decisions using nothing but observations from the environment, and consequently, heavily rely on the representations of those observations. Though some recent breakthroughs have used vector-based categorical representations of observations, often referred to as discrete representations, there is little work explicitly assessing the significance of such a choice. In this work, we provide a thorough empirical investigation of the advantages of representing observations as vectors of categorical values within the context of reinforcement learning. We perform evaluations on world-model learning, model-free RL, and ultimately continual RL problems, where the benefits best align with the needs of the problem setting. We find that, when compared to traditional continuous representations, world models learned over discrete representations accurately model more of the world with less capacity, and that agents trained with discrete representations learn better policies with less data. In the context of continual RL, these benefits translate into faster adapting agents. Additionally, our analysis suggests that the observed performance improvements can be attributed to the information contained within the latent vectors and potentially the encoding of the discrete representation itself.

A Comprehensive Study of Vision Transformers in Image Classification Tasks. (arXiv:2312.01232v2 [cs.CV] UPDATED)

Authors: Mahmoud Khalil, Ahmad Khalil, Alioune Ngom

Image Classification is a fundamental task in the field of computer vision that frequently serves as a benchmark for gauging advancements in Computer Vision. Over the past few years, significant progress has been made in image classification due to the emergence of deep learning. However, challenges still exist, such as modeling fine-grained visual information, high computation costs, the parallelism of the model, and inconsistent evaluation protocols across datasets. In this paper, we conduct a comprehensive survey of existing papers on Vision Transformers for image classification. We first introduce the popular image classification datasets that influenced the design of models. Then, we present Vision Transformers models in chronological order, starting with early attempts at adapting attention mechanism to vision tasks followed by the adoption of vision transformers, as they have demonstrated success in capturing intricate patterns and long-range dependencies within images. Finally, we discuss open problems and shed light on opportunities for image classification to facilitate new research ideas.

The Contemporary Art of Image Search: Iterative User Intent Expansion via Vision-Language Model. (arXiv:2312.01656v2 [cs.IR] UPDATED)

Authors: Yilin Ye, Qian Zhu, Shishi Xiao, Kang Zhang, Wei Zeng

Image search is an essential and user-friendly method to explore vast galleries of digital images. However, existing image search methods heavily rely on proximity measurements like tag matching or image similarity, requiring precise user inputs for satisfactory results. To meet the growing demand for a contemporary image search engine that enables accurate comprehension of users' search intentions, we introduce an innovative user intent expansion framework. Our framework leverages visual-language models to parse and compose multi-modal user inputs to provide more accurate and satisfying results. It comprises two-stage processes: 1) a parsing stage that incorporates a language parsing module with large language models to enhance the comprehension of textual inputs, along with a visual parsing module that integrates an interactive segmentation module to swiftly identify detailed visual elements within images; and 2) a logic composition stage that combines multiple user search intents into a unified logic expression for more sophisticated operations in complex searching scenarios. Moreover, the intent expansion framework enables users to perform flexible contextualized interactions with the search results to further specify or adjust their detailed search intents iteratively. We implemented the framework into an image search system for NFT (non-fungible token) search and conducted a user study to evaluate its usability and novel properties. The results indicate that the proposed framework significantly improves users' image search experience. Particularly the parsing and contextualized interactions prove useful in allowing users to express their search intents more accurately and engage in a more enjoyable iterative search experience.

Jellyfish: A Large Language Model for Data Preprocessing. (arXiv:2312.01678v2 [cs.AI] UPDATED)

Authors: Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada

In this paper, we present Jellyfish, an open-source LLM as a universal task solver for DP. Built on the Llama 2 13B model, Jellyfish is instruction-tuned with the datasets of several typical DP tasks including error detection, data imputation, schema matching, and entity matching, and delivers generalizability to other tasks. Remarkably, Jellyfish can operate on a local, single, and low-priced GPU with its 13 billion parameters, ensuring data security and enabling further tuning. Its proficiency in understanding natural language allows users to manually craft instructions for DP tasks. Unlike many existing methods that heavily rely on prior knowledge, Jellyfish acquires domain knowledge during its tuning process and integrates optional knowledge injection during inference. A distinctive feature of Jellyfish is its interpreter, which elucidates its output decisions. To construct Jellyfish, we develop a series of pre-tuning and DP-tuning techniques. Jellyfish is equipped with an instance serializer, which automatically translates raw data into model prompts, and a knowledge injector, which optionally introduces task- and dataset-specific knowledge to enhance DP performance. Our evaluation of Jellyfish, using a range of real datasets, shows its competitiveness compared to state-of-the-art methods and its strong generalizability to unseen tasks. Jellyfish's performance rivals that of GPT series models, and its interpreter offers enhanced reasoning capabilities compared to GPT-3.5. Furthermore, our evaluation highlights the effectiveness of the techniques employed in constructing Jellyfish. Our model is available at Hugging Face: https://huggingface.co/NECOUDBFM/Jellyfish .

Hulk: A Universal Knowledge Translator for Human-Centric Tasks. (arXiv:2312.01697v2 [cs.CV] UPDATED)

Authors: Yizhou Wang, Yixuan Wu, Shixiang Tang, Weizhen He, Xun Guo, Feng Zhu, Lei Bai, Rui Zhao, Jian Wu, Tong He, Wanli Ouyang

Human-centric perception tasks, e.g., human mesh recovery, pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, most of them only excel in 2D vision tasks or require extensive fine-tuning for practical deployment in real-world scenarios. These limitations severely restrict their usability across various downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing most of the mainstream tasks simultaneously without task-specific finetuning, covering 2D vision, 3D vision, skeleton-based, and vision-language tasks. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. To validate the effectiveness of our proposed method, we conduct comprehensive experiments on 11 benchmarks across 8 human-centric tasks. Experimental results surpass previous methods substantially, demonstrating the superiority of our proposed method. The code will be available on https://github.com/OpenGVLab/HumanBench.

TriDeNT: Triple Deep Network Training for Privileged Knowledge Distillation in Histopathology. (arXiv:2312.02111v2 [cs.CV] UPDATED)

Authors: Lucas Farndale, Robert Insall, Ke Yuan

Computational pathology models rarely utilise data that will not be available for inference. This means most models cannot learn from highly informative data such as additional immunohistochemical (IHC) stains and spatial transcriptomics. We present TriDeNT, a novel self-supervised method for utilising privileged data that is not available during inference to improve performance. We demonstrate the efficacy of this method for a range of different paired data including immunohistochemistry, spatial transcriptomics and expert nuclei annotations. In all settings, TriDeNT outperforms other state-of-the-art methods in downstream tasks, with observed improvements of up to 101%. Furthermore, we provide qualitative and quantitative measurements of the features learned by these models and how they differ from baselines. TriDeNT offers a novel method to distil knowledge from scarce or costly data during training, to create significantly better models for routine inputs.

Competition-Level Problems are Effective LLM Evaluators. (arXiv:2312.02143v2 [cs.CL] UPDATED)

Authors: Yiming Huang, Zhenghao Lin, Xiao Liu, Yeyun Gong, Shuai Lu, Fangyu Lei, Yaobo Liang, Yelong Shen, Chen Lin, Nan Duan, Weizhu Chen

Large language models (LLMs) have demonstrated impressive reasoning capabilities, yet there is ongoing debate about these abilities and the potential data contamination problem recently. This paper aims to evaluate the reasoning capacities of LLMs, specifically in solving recent competition-level programming problems in Codeforces, which are expert-crafted and unique, requiring deep understanding and robust reasoning skills. We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered. Surprisingly, the peiceived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems, which shows the potential data contamination, as well as the challenges for any existing LLM to solve unseen complex reasoning problems. We further explore various approaches such as fine-tuning, Chain-of-Thought prompting and problem description simplification, unfortunately none of them is able to consistently mitigate the challenges. Through our work, we emphasis the importance of this excellent data source for assessing the genuine reasoning capabilities of LLMs, and foster the development of LLMs with stronger reasoning abilities and better generalization in the future.