Nonparametric Spatio-Temporal Joint Probabilistic Data Association Coupled Filter and Interfering Extended Target Tracking. (arXiv:2311.16106v1 [cs.CV])

Authors: Behzad Akbari, Haibin Zhu, Ya-Jun Pan, R.Tharmarasa

Extended target tracking estimates the centroid and shape of the target in space and time. In various situations where extended target tracking is applicable, the presence of multiple targets can lead to interference, particularly when they maneuver behind one another in a sensor like a camera. Nonetheless, when dealing with multiple extended targets, there's a tendency for them to share similar shapes within a group, which can enhance their detectability. For instance, the coordinated movement of a cluster of aerial vehicles might cause radar misdetections during their convergence or divergence. Similarly, in the context of a self-driving car, lane markings might split or converge, resulting in inaccurate lane tracking detections. A well-known joint probabilistic data association coupled (JPDAC) filter can address this problem in only a single-point target tracking. A variation of JPDACF was developed by introducing a nonparametric Spatio-Temporal Joint Probabilistic Data Association Coupled Filter (ST-JPDACF) to address the problem for extended targets. Using different kernel functions, we manage the dependency of measurements in space (inside a frame) and time (between frames). Kernel functions are able to be learned using a limited number of training data. This extension can be used for tracking the shape and dynamics of nonparametric dependent extended targets in clutter when targets share measurements. The proposed algorithm was compared with other well-known supervised methods in the interfering case and achieved promising results.

Transfer Learning between Motor Imagery Datasets using Deep Learning -- Validation of Framework and Comparison of Datasets. (arXiv:2311.16109v1 [cs.CV])

Authors: Pierre Guetschel, Michael Tangermann

We present a simple deep learning-based framework commonly used in computer vision and demonstrate its effectiveness for cross-dataset transfer learning in mental imagery decoding tasks that are common in the field of Brain-Computer Interfaces (BCI). We investigate, on a large selection of 12 motor-imagery datasets, which ones are well suited for transfer, both as donors and as receivers. Challenges. Deep learning models typically require long training times and are data-hungry, which impedes their use for BCI systems that have to minimize the recording time for (training) examples and are subject to constraints induced by experiments involving human subjects. A solution to both issues is transfer learning, but it comes with its own challenge, i.e., substantial data distribution shifts between datasets, subjects and even between subsequent sessions of the same subject. Approach. For every pair of pre-training (donor) and test (receiver) dataset, we first train a model on the donor before training merely an additional new linear classification layer based on a few receiver trials. Performance of this transfer approach is then tested on other trials of the receiver dataset. Significance. First, we lower the threshold to use transfer learning between motor imagery datasets: the overall framework is extremely simple and nevertheless obtains decent classification scores. Second, we demonstrate that deep learning models are a good option for motor imagery cross-dataset transfer both for the reasons outlined in the first point and because the framework presented is viable in online scenarios. Finally, analysing which datasets are best suited for transfer learning can be used as a reference for future researchers to determine which to use for pre-training or benchmarking.

Learning Noise-Robust Joint Representation for Multimodal Emotion Recognition under Realistic Incomplete Data Scenarios. (arXiv:2311.16114v1 [cs.CV])

Authors: Qi Fan (1), Haolin Zuo (1), Rui Liu (1), Zheng Lian (2), Guanglai Gao (1) ((1) Inner Mongolia University, Hohhot, China, (2) Institute of Automation, Chinese Academy of Sciences, Beijing, China)

Multimodal emotion recognition (MER) in practical scenarios presents a significant challenge due to the presence of incomplete data, such as missing or noisy data. Traditional methods often discard missing data or replace it with a zero vector, neglecting the availability issue of noisy data. Consequently, these approaches are not fully applicable to realistic scenarios, where both missing and noisy data are prevalent. To address this problem, we propose a novel noise-robust MER model, named NMER, which effectively learns robust multimodal joint representations from incomplete data containing noise. Our approach incorporates two key components. First, we introduce a noise scheduler that adjusts the type and level of noise in the training data, emulating the characteristics of incomplete data in realistic scenarios. Second, we employ a Variational AutoEncoder (VAE)-based NMER model to generate robust multimodal joint representations from the noisy data, leveraging the modality invariant feature. The experimental results on the benchmark dataset IEMOCAP indicate the proposed NMER outperforms state-of-the-art MER systems. The ablation results also confirm the effectiveness of the VAE structure. We release our code at \href{https://github.com/WooyoohL/Noise-robust_MER.

Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models. (arXiv:2311.16117v1 [cs.CV])

Authors: Kota Sueyoshi, Takashi Matsubara

Diffusion models have achieved remarkable results in generating high-quality, diverse, and creative images. However, when it comes to text-based image generation, they often fail to capture the intended meaning presented in the text. For instance, a specified object may not be generated, an unnecessary object may be generated, and an adjective may alter objects it was not intended to modify. Moreover, we found that relationships indicating possession between objects are often overlooked. While users' intentions in text are diverse, existing methods tend to specialize in only some aspects of these. In this paper, we propose Predicated Diffusion, a unified framework to express users' intentions. We consider that the root of the above issues lies in the text encoder, which often focuses only on individual words and neglects the logical relationships between them. The proposed method does not solely rely on the text encoder, but instead, represents the intended meaning in the text as propositions using predicate logic and treats the pixels in the attention maps as the fuzzy predicates. This enables us to obtain a differentiable loss function that makes the image fulfill the proposition by minimizing it. When compared to several existing methods, we demonstrated that Predicated Diffusion can generate images that are more faithful to various text prompts, as verified by human evaluators and pretrained image-text models.

Sanity checks for patch visualisation in prototype-based image classification. (arXiv:2311.16120v1 [cs.CV])

Authors: Romain Xu-Darme (LSL, LIG), Georges Quénot (LIG), Zakaria Chihani (LSL), Marie-Christine Rousset (LIG)

In this work, we perform an analysis of the visualisation methods implemented in ProtoPNet and ProtoTree, two self-explaining visual classifiers based on prototypes. We show that such methods do not correctly identify the regions of interest inside of the images, and therefore do not reflect the model behaviour, which can create a false sense of bias in the model. We also demonstrate quantitatively that this issue can be mitigated by using other saliency methods that provide more faithful image patches.

Real-Time Neural Materials using Block-Compressed Features. (arXiv:2311.16121v1 [cs.CV])

Authors: Clément Weinreich, Louis de Oliveira, Antoine Houdard, Georges Nader

Neural materials typically consist of a collection of neural features along with a decoder network. The main challenge in integrating such models in real-time rendering pipelines lies in the large size required to store their features in GPU memory and the complexity of evaluating the network efficiently. We present a neural material model whose features and decoder are specifically designed to be used in real-time rendering pipelines. Our framework leverages hardware-based block compression (BC) texture formats to store the learned features and trains the model to output the material information continuously in space and scale. To achieve this, we organize the features in a block-based manner and emulate BC6 decompression during training, making it possible to export them as regular BC6 textures. This structure allows us to use high resolution features while maintaining a low memory footprint. Consequently, this enhances our model's overall capability, enabling the use of a lightweight and simple decoder architecture that can be evaluated directly in a shader. Furthermore, since the learned features can be decoded continuously, it allows for random uv sampling and smooth transition between scales without needing any subsequent filtering. As a result, our neural material has a small memory footprint, can be decoded extremely fast adding a minimal computational overhead to the rendering pipeline.

Semantic Generative Augmentations for Few-Shot Counting. (arXiv:2311.16122v1 [cs.CV])

Authors: Perla Doubinsky (CEDRIC - VERTIGO, CNAM), Nicolas Audebert (CEDRIC - VERTIGO, CNAM), Michel Crucianu (CEDRIC - VERTIGO), Hervé Le Borgne (CEA)

With the availability of powerful text-to-image diffusion models, recent works have explored the use of synthetic data to improve image classification performances. These works show that it can effectively augment or even replace real data. In this work, we investigate how synthetic data can benefit few-shot class-agnostic counting. This requires to generate images that correspond to a given input number of objects. However, text-to-image models struggle to grasp the notion of count. We propose to rely on a double conditioning of Stable Diffusion with both a prompt and a density map in order to augment a training dataset for few-shot counting. Due to the small dataset size, the fine-tuned model tends to generate images close to the training images. We propose to enhance the diversity of synthesized images by exchanging captions between images thus creating unseen configurations of object types and spatial layout. Our experiments show that our diversified generation strategy significantly improves the counting accuracy of two recent and performing few-shot counting models on FSC147 and CARPK.

Vision-Based Incoming Traffic Estimator Using Deep Neural Network on General Purpose Embedded Hardware. (arXiv:2311.16125v1 [cs.CV])

Authors: K. G. Zoysa, S. R. Munasinghe

Traffic management is a serious problem in many cities around the world. Even the suburban areas are now experiencing regular traffic congestion. Inappropriate traffic control wastes fuel, time, and the productivity of nations. Though traffic signals are used to improve traffic flow, they often cause problems due to inappropriate or obsolete timing that does not tally with the actual traffic intensity at the intersection. Traffic intensity determination based on statistical methods only gives the average intensity expected at any given time. However, to control traffic accurately, it is required to know the real-time traffic intensity. In this research, image processing and machine learning have been used to estimate actual traffic intensity in real time. General-purpose electronic hardware has been used for in-situ image processing based on the edge-detection method. A deep neural network (DNN) was trained to infer traffic intensity in each image in real time. The trained DNN estimated traffic intensity accurately in 90% of the real-time images during road tests. The electronic system was implemented on a Raspberry Pi single-board computer; hence, it is cost-effective for large-scale deployment.

SeamlessNeRF: Stitching Part NeRFs with Gradient Propagation. (arXiv:2311.16127v1 [cs.CV])

Authors: Bingchen Gong, Yuehao Wang, Xiaoguang Han, Qi Dou

Neural Radiance Fields (NeRFs) have emerged as promising digital mediums of 3D objects and scenes, sparking a surge in research to extend the editing capabilities in this domain. The task of seamless editing and merging of multiple NeRFs, resembling the ``Poisson blending'' in 2D image editing, remains a critical operation that is under-explored by existing work. To fill this gap, we propose SeamlessNeRF, a novel approach for seamless appearance blending of multiple NeRFs. In specific, we aim to optimize the appearance of a target radiance field in order to harmonize its merge with a source field. We propose a well-tailored optimization procedure for blending, which is constrained by 1) pinning the radiance color in the intersecting boundary area between the source and target fields and 2) maintaining the original gradient of the target. Extensive experiments validate that our approach can effectively propagate the source appearance from the boundary area to the entire target field through the gradients. To the best of our knowledge, SeamlessNeRF is the first work that introduces gradient-guided appearance editing to radiance fields, offering solutions for seamless stitching of 3D objects represented in NeRFs.

Effective Quantization for Diffusion Models on CPUs. (arXiv:2311.16133v1 [cs.CV])

Authors: Hanwen Chang, Haihao Shen, Yiyang Cai, Xinyu Ye, Zhenzhong Xu, Wenhua Cheng, Kaokao Lv, Weiwei Zhang, Yintong Lu, Heng Guo

Diffusion models have gained popularity for generating images from textual descriptions. Nonetheless, the substantial need for computational resources continues to present a noteworthy challenge, contributing to time-consuming processes. Quantization, a technique employed to compress deep learning models for enhanced efficiency, presents challenges when applied to diffusion models. These models are notably more sensitive to quantization compared to other model types, potentially resulting in a degradation of image quality. In this paper, we introduce a novel approach to quantize the diffusion models by leveraging both quantization-aware training and distillation. Our results show the quantized models can maintain the high image quality while demonstrating the inference efficiency on CPUs.

After-Stroke Arm Paresis Detection using Kinematic Data. (arXiv:2311.16138v1 [cs.CV])

Authors: Kenneth Lai, Mohammed Almekhlafi, Svetlana Yanushkevich

This paper presents an approach for detecting unilateral arm paralysis/weakness using kinematic data. Our method employs temporal convolution networks and recurrent neural networks, guided by knowledge distillation, where we use inertial measurement units attached to the body to capture kinematic information such as acceleration, rotation, and flexion of body joints during an action. This information is then analyzed to recognize body actions and patterns. Our proposed network achieves a high paretic detection accuracy of 97.99\%, with an action classification accuracy of 77.69\%, through knowledge sharing. Furthermore, by incorporating causal reasoning, we can gain additional insights into the patient's condition, such as their Fugl-Meyer assessment score or impairment level based on the machine learning result. Overall, our approach demonstrates the potential of using kinematic data and machine learning for detecting arm paralysis/weakness. The results suggest that our method could be a useful tool for clinicians and healthcare professionals working with patients with this condition.

Adapting Segment Anything Model (SAM) through Prompt-based Learning for Enhanced Protein Identification in Cryo-EM Micrographs. (arXiv:2311.16140v1 [cs.CV])

Authors: Fei He, Zhiyuan Yang, Mingyue Gao, Biplab Poudel, Newgin Sam Ebin Sam Dhas, Rajan Gyawali, Ashwin Dhakal, Jianlin Cheng, Dong Xu

Cryo-electron microscopy (cryo-EM) remains pivotal in structural biology, yet the task of protein particle picking, integral for 3D protein structure construction, is laden with manual inefficiencies. While recent AI tools such as Topaz and crYOLO are advancing the field, they do not fully address the challenges of cryo-EM images, including low contrast, complex shapes, and heterogeneous conformations. This study explored prompt-based learning to adapt the state-of-the-art image segmentation foundation model Segment Anything Model (SAM) for cryo-EM. This focus was driven by the desire to optimize model performance with a small number of labeled data without altering pre-trained parameters, aiming for a balance between adaptability and foundational knowledge retention. Through trials with three prompt-based learning strategies, namely head prompt, prefix prompt, and encoder prompt, we observed enhanced performance and reduced computational requirements compared to the fine-tuning approach. This work not only highlights the potential of prompting SAM in protein identification from cryo-EM micrographs but also suggests its broader promise in biomedical image segmentation and object detection.

Brain-Inspired Efficient Pruning: Exploiting Criticality in Spiking Neural Networks. (arXiv:2311.16141v1 [cs.NE])

Authors: Shuo Chen, Boxiao Liu, Haihang You

Spiking Neural Networks (SNNs) have been an attractive option for deployment on devices with limited computing resources and lower power consumption because of the event-driven computing characteristic. As such devices have limited computing and storage resources, pruning for SNNs has been widely focused recently. However, the binary and non-differentiable property of spike signals make pruning deep SNNs challenging, so existing methods require high time overhead to make pruning decisions. In this paper, inspired by critical brain hypothesis in neuroscience, we design a regeneration mechanism based on criticality to efficiently obtain the critical pruned networks. Firstly, we propose a low-cost metric for the criticality of pruning structures. Then we re-rank the pruned structures after pruning and regenerate those with higher criticality. We evaluate our method using VGG-16 and ResNet-19 for both unstructured pruning and structured pruning. Our method achieves higher performance compared to current state-of-the-art (SOTA) method with the same time overhead. We also achieve comparable performances (even better on VGG-16) compared to the SOTA method with 11.3x and 15.5x acceleration. Moreover, we investigate underlying mechanism of our method and find that it efficiently selects potential structures, learns the consistent feature representations and reduces the overfitting during the recovery phase.

Ransomware Detection and Classification using Machine Learning. (arXiv:2311.16143v1 [cs.CR])

Authors: Kavitha Kunku, ANK Zaman, Kaushik Roy

Vicious assaults, malware, and various ransomware pose a cybersecurity threat, causing considerable damage to computer structures, servers, and mobile and web apps across various industries and businesses. These safety concerns are important and must be addressed immediately. Ransomware detection and classification are critical for guaranteeing rapid reaction and prevention. This study uses the XGBoost classifier and Random Forest (RF) algorithms to detect and classify ransomware attacks. This approach involves analyzing the behaviour of ransomware and extracting relevant features that can help distinguish between different ransomware families. The models are evaluated on a dataset of ransomware attacks and demonstrate their effectiveness in accurately detecting and classifying ransomware. The results show that the XGBoost classifier, Random Forest Classifiers, can effectively detect and classify different ransomware attacks with high accuracy, thereby providing a valuable tool for enhancing cybersecurity.

Dual-Stream Attention Transformers for Sewer Defect Classification. (arXiv:2311.16145v1 [cs.CV])

Authors: Abdullah Al Redwan Newaz, Mahdi Abdeldguerfi, Kendall N. Niles, Joe Tom

We propose a dual-stream multi-scale vision transformer (DS-MSHViT) architecture that processes RGB and optical flow inputs for efficient sewer defect classification. Unlike existing methods that combine the predictions of two separate networks trained on each modality, we jointly train a single network with two branches for RGB and motion. Our key idea is to use self-attention regularization to harness the complementary strengths of the RGB and motion streams. The motion stream alone struggles to generate accurate attention maps, as motion images lack the rich visual features present in RGB images. To facilitate this, we introduce an attention consistency loss between the dual streams. By leveraging motion cues through a self-attention regularizer, we align and enhance RGB attention maps, enabling the network to concentrate on pertinent input regions. We evaluate our data on a public dataset as well as cross-validate our model performance in a novel dataset. Our method outperforms existing models that utilize either convolutional neural networks (CNNs) or multi-scale hybrid vision transformers (MSHViTs) without employing attention regularization between the two streams.

GeoTop: Advancing Image Classification with Geometric-Topological Analysis. (arXiv:2311.16157v1 [cs.CV])

Authors: Mariem Abaach, Ian Morilla

In this study, we explore the application of Topological Data Analysis (TDA) and Lipschitz-Killing Curvatures (LKCs) as powerful tools for feature extraction and classification in the context of biomedical multiomics problems. TDA allows us to capture topological features and patterns within complex datasets, while LKCs provide essential geometric insights. We investigate the potential of combining both methods to improve classification accuracy. Using a dataset of biomedical images, we demonstrate that TDA and LKCs can effectively extract topological and geometrical features, respectively. The combination of these features results in enhanced classification performance compared to using each method individually. This approach offers promising results and has the potential to advance our understanding of complex biological processes in various biomedical applications. Our findings highlight the value of integrating topological and geometrical information in biomedical data analysis. As we continue to delve into the intricacies of multiomics problems, the fusion of these insights holds great promise for unraveling the underlying biological complexities.

Vision Encoder-Decoder Models for AI Coaching. (arXiv:2311.16161v1 [cs.CV])

Authors: Jyothi S Nayak, Afifah Khan Mohammed Ajmal Khan, Chirag Manjeshwar, Imadh Ajaz Banday

This research paper introduces an innovative AI coaching approach by integrating vision-encoder-decoder models. The feasibility of this method is demonstrated using a Vision Transformer as the encoder and GPT-2 as the decoder, achieving a seamless integration of visual input and textual interaction. Departing from conventional practices of employing distinct models for image recognition and text-based coaching, our integrated architecture directly processes input images, enabling natural question-and-answer dialogues with the AI coach. This unique strategy simplifies model architecture while enhancing the overall user experience in human-AI interactions. We showcase sample results to demonstrate the capability of the model. The results underscore the methodology's potential as a promising paradigm for creating efficient AI coach models in various domains involving visual inputs. Importantly, this potential holds true regardless of the particular visual encoder or text decoder chosen. Additionally, we conducted experiments with different sizes of GPT-2 to assess the impact on AI coach performance, providing valuable insights into the scalability and versatility of our proposed methodology.

IODeep: an IOD for the introduction of deep learning in the DICOM standard. (arXiv:2311.16163v1 [eess.IV])

Authors: Salvatore Contino, Luca Cruciata, Orazio Gambino, Roberto Pirrone

In recent years, Artificial Intelligence (AI) and in particular Deep Neural Networks (DNN) became a relevant research topic in biomedical image segmentation due to the availability of more and more data sets along with the establishment of well known competitions. Despite the popularity of DNN based segmentation on the research side, these techniques are almost unused in the daily clinical practice even if they could support effectively the physician during the diagnostic process. Apart from the issues related to the explainability of the predictions of a neural model, such systems are not integrated in the diagnostic workflow, and a standardization of their use is needed to achieve this goal. This paper presents \textit{IODeep} a new DICOM Information Object Definition (IOD) aimed at storing both the weights and the architecture of a DNN already trained on a particular image dataset that is labeled as regards the acquisition modality, the anatomical region, and the disease under investigation. The IOD architecture is presented along with a DNN selection algorithm from the PACS server based on the labels outlined above, and a simple PACS viewer purposely designed for demonstrating the effectiveness of the DICOM integration, while no modifications are required on the PACS server side. The source code are freely available at https://github.com/CHILab1/IODeep.git

Shortcut Bias Mitigation via Ensemble Diversity Using Diffusion Probabilistic Models. (arXiv:2311.16176v1 [cs.LG])

Authors: Luca Scimeca, Alexander Rubinstein, Damien Teney, Seong Joon Oh, Armand Mihai Nicolicioiu, Yoshua Bengio

Spurious correlations in the data, where multiple cues are predictive of the target labels, often lead to a phenomenon known as simplicity bias, where a model relies on erroneous, easy-to-learn cues while ignoring reliable ones. In this work, we propose an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs) for shortcut bias mitigation. We show that at particular training intervals, DPMs can generate images with novel feature combinations, even when trained on images displaying correlated input features. We leverage this crucial property to generate synthetic counterfactuals to increase model diversity via ensemble disagreement. We show that DPM-guided diversification is sufficient to remove dependence on primary shortcut cues, without a need for additional supervised signals. We further empirically quantify its efficacy on several diversification objectives, and finally show improved generalization and diversification performance on par with prior work that relies on auxiliary data collection.

Next-gen traffic surveillance: AI-assisted mobile traffic violation detection system. (arXiv:2311.16179v1 [cs.CV])

Authors: Dila Dede, Mehmet Ali Sarsıl, Ata Shaker, Olgu Altıntaş, Onur Ergen

Road traffic accidents pose a significant global public health concern, leading to injuries, fatalities, and vehicle damage. Approximately 1,3 million people lose their lives daily due to traffic accidents [World Health Organization, 2022]. Addressing this issue requires accurate traffic law violation detection systems to ensure adherence to regulations. The integration of Artificial Intelligence algorithms, leveraging machine learning and computer vision, has facilitated the development of precise traffic rule enforcement. This paper illustrates how computer vision and machine learning enable the creation of robust algorithms for detecting various traffic violations. Our model, capable of identifying six common traffic infractions, detects red light violations, illegal use of breakdown lanes, violations of vehicle following distance, breaches of marked crosswalk laws, illegal parking, and parking on marked crosswalks. Utilizing online traffic footage and a self-mounted on-dash camera, we apply the YOLOv5 algorithm's detection module to identify traffic agents such as cars, pedestrians, and traffic signs, and the strongSORT algorithm for continuous interframe tracking. Six discrete algorithms analyze agents' behavior and trajectory to detect violations. Subsequently, an Identification Module extracts vehicle ID information, such as the license plate, to generate violation notices sent to relevant authorities.

BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP. (arXiv:2311.16194v1 [cs.CV])

Authors: Jiawang Bai, Kuofeng Gao, Shaobo Min, Shu-Tao Xia, Zhifeng Li, Wei Liu

Contrastive Vision-Language Pre-training, known as CLIP, has shown promising effectiveness in addressing downstream image recognition tasks. However, recent works revealed that the CLIP model can be implanted with a downstream-oriented backdoor. On downstream tasks, one victim model performs well on clean samples but predicts a specific target class whenever a specific trigger is present. For injecting a backdoor, existing attacks depend on a large amount of additional data to maliciously fine-tune the entire pre-trained CLIP model, which makes them inapplicable to data-limited scenarios. In this work, motivated by the recent success of learnable prompts, we address this problem by injecting a backdoor into the CLIP model in the prompt learning stage. Our method named BadCLIP is built on a novel and effective mechanism in backdoor attacks on CLIP, i.e., influencing both the image and text encoders with the trigger. It consists of a learnable trigger applied to images and a trigger-aware context generator, such that the trigger can change text features via trigger-aware prompts, resulting in a powerful and generalizable attack. Extensive experiments conducted on 11 datasets verify that the clean accuracy of BadCLIP is similar to those of advanced prompt learning methods and the attack success rate is higher than 99% in most cases. BadCLIP is also generalizable to unseen classes, and shows a strong generalization capability under cross-dataset and cross-domain settings.

Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation. (arXiv:2311.16201v1 [cs.CV])

Authors: Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, Alexander Toshev

Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, which causes the catastrophic degradation of language models' capability.

Continual Instruction Tuning for Large Multimodal Models. (arXiv:2311.16206v1 [cs.LG])

Authors: Jinghan He, Haiyun Guo, Ming Tang, Jinqiao Wang

Instruction tuning is now a widely adopted approach to aligning large multimodal models (LMMs) to follow human intent. It unifies the data format of vision-language tasks, enabling multi-task joint training. However, vision-language tasks are constantly being created in practice. Instead of always re-training LMMs when new tasks arrive, continual learning offers flexibility for models to continually and efficiently exploit the evolving data. This work aims to explore the following two questions: 1) Do LMMs still suffer from catastrophic forgetting in continual instruction tuning? 2) Are the existing three classes of continual learning methods still applicable to the continual instruction tuning of LMMs? An extensive study is conducted to address the above questions. First, we establish the first benchmark in this setting and reveal that catastrophic forgetting is still observed when continually instruction-tuning LMMs. However, the multi-task joint instruction tuning can facilitate the model's continual learning ability and mitigate forgetting. Second, we integrate and adapt classic continual learning methods to our context, demonstrating the efficacy of data replay and model expansion strategies across diverse scenarios. In contrast, regularization-based methods only perform well on models that have been jointly instruction-tuned on multiple tasks. Third, we delve into the correlation and forgetting dynamics between vision-language task pairs and propose task-similarity-informed regularization and model expansion methods for continual instruction tuning of LMMs. Experimental results show that our approach consistently boosts the model's performance.

Seeing Beyond Cancer: Multi-Institutional Validation of Object Localization and 3D Semantic Segmentation using Deep Learning for Breast MRI. (arXiv:2311.16213v1 [eess.IV])

Authors: Arda Pekis, Vignesh Kannan, Evandros Kaklamanos, Anu Antony, Snehal Patel, Tyler Earnest

The clinical management of breast cancer depends on an accurate understanding of the tumor and its anatomical context to adjacent tissues and landmark structures. This context may be provided by semantic segmentation methods; however, previous works have been largely limited to a singular focus on the tumor alone and rarely other tissue types. In contrast, we present a method that exploits tissue-tissue interactions to accurately segment every major tissue type in the breast including: chest wall, skin, adipose tissue, fibroglandular tissue, vasculature and tumor via standard-of-care Dynamic Contrast Enhanced MRI. Comparing our method to prior state-of-the-art, we achieved a superior Dice score on tumor segmentation while maintaining competitive performance on other studied tissues across multiple institutions. Briefly, our method proceeds by localizing the tumor using 2D object detectors, then segmenting the tumor and surrounding tissues independently using two 3D U-nets, and finally integrating these results while mitigating false positives by checking for anatomically plausible tissue-tissue contacts. The object detection models were pre-trained on ImageNet and COCO, and operated on MIP (maximum intensity projection) images in the axial and sagittal planes, establishing a 3D tumor bounding box. By integrating multiple relevant peri-tumoral tissues, our work enables clinical applications in breast cancer staging, prognosis and surgical planning.

SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance. (arXiv:2311.16241v1 [cs.CV])

Authors: Lukas Hoyer, David Joseph Tan, Muhammad Ferjad Naeem, Luc Van Gool, Federico Tombari

In semi-supervised semantic segmentation, a model is trained with a limited number of labeled images along with a large corpus of unlabeled images to reduce the high annotation effort. While previous methods are able to learn good segmentation boundaries, they are prone to confuse classes with similar visual appearance due to the limited supervision. On the other hand, vision-language models (VLMs) are able to learn diverse semantic knowledge from image-caption datasets but produce noisy segmentation due to the image-level training. In SemiVL, we propose to integrate rich priors from VLM pre-training into semi-supervised semantic segmentation to learn better semantic decision boundaries. To adapt the VLM from global to local reasoning, we introduce a spatial fine-tuning strategy for label-efficient learning. Further, we design a language-guided decoder to jointly reason over vision and language. Finally, we propose to handle inherent ambiguities in class labels by providing the model with language guidance in the form of class definitions. We evaluate SemiVL on 4 semantic segmentation datasets, where it significantly outperforms previous semi-supervised methods. For instance, SemiVL improves the state-of-the-art by +13.5 mIoU on COCO with 232 annotated images and by +6.1 mIoU on Pascal VOC with 92 labels. Project page: https://github.com/google-research/semivl

Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation. (arXiv:2311.16254v1 [cs.CV])

Authors: Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Vision-and-Language models such as CLIP have demonstrated remarkable effectiveness across a wide range of tasks. However, these models are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concern in their adoption. To overcome these limitations, we introduce a methodology to make Vision-and-Language models safer by removing their sensitivity to not-safe-for-work concepts. We show how this can be done by distilling from a large language model which converts between safe and unsafe sentences and which is fine-tuned starting from just 100 manually-curated pairs. We conduct extensive experiments on the resulting embedding space for both retrieval and text-to-image generation, where we show that our model can also be properly employed with pre-trained image generators. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.

RelVAE: Generative Pretraining for few-shot Visual Relationship Detection. (arXiv:2311.16261v1 [cs.CV])

Authors: Sotiris Karapiperis, Markos Diomataris, Vassilis Pitsikalis

Visual relations are complex, multimodal concepts that play an important role in the way humans perceive the world. As a result of their complexity, high-quality, diverse and large scale datasets for visual relations are still absent. In an attempt to overcome this data barrier, we choose to focus on the problem of few-shot Visual Relationship Detection (VRD), a setting that has been so far neglected by the community. In this work we present the first pretraining method for few-shot predicate classification that does not require any annotated relations. We achieve this by introducing a generative model that is able to capture the variation of semantic, visual and spatial information of relations inside a latent space and later exploiting its representations in order to achieve efficient few-shot classification. We construct few-shot training splits and show quantitative experiments on VG200 and VRD datasets where our model outperforms the baselines. Lastly we attempt to interpret the decisions of the model by conducting various qualitative experiments.

VehicleGAN: Pair-flexible Pose Guided Image Synthesis for Vehicle Re-identification. (arXiv:2311.16278v1 [cs.CV])

Authors: Baolu Li, Ping Liu, Lan Fu, Jinlong Li, Jianwu Fang, Zhigang Xu, Hongkai Yu

Vehicle Re-identification (Re-ID) has been broadly studied in the last decade; however, the different camera view angle leading to confused discrimination in the feature subspace for the vehicles of various poses, is still challenging for the Vehicle Re-ID models in the real world. To promote the Vehicle Re-ID models, this paper proposes to synthesize a large number of vehicle images in the target pose, whose idea is to project the vehicles of diverse poses into the unified target pose so as to enhance feature discrimination. Considering that the paired data of the same vehicles in different traffic surveillance cameras might be not available in the real world, we propose the first Pair-flexible Pose Guided Image Synthesis method for Vehicle Re-ID, named as VehicleGAN in this paper, which works for both supervised and unsupervised settings without the knowledge of geometric 3D models. Because of the feature distribution difference between real and synthetic data, simply training a traditional metric learning based Re-ID model with data-level fusion (i.e., data augmentation) is not satisfactory, therefore we propose a new Joint Metric Learning (JML) via effective feature-level fusion from both real and synthetic data. Intensive experimental results on the public VeRi-776 and VehicleID datasets prove the accuracy and effectiveness of our proposed VehicleGAN and JML.

Aligning Non-Causal Factors for Transformer-Based Source-Free Domain Adaptation. (arXiv:2311.16294v1 [cs.CV])

Authors: Sunandini Sanyal, Ashish Ramayee Asokan, Suvaansh Bhambri, Pradyumna YM, Akshay Kulkarni, Jogendra Nath Kundu, R Venkatesh Babu

Conventional domain adaptation algorithms aim to achieve better generalization by aligning only the task-discriminative causal factors between a source and target domain. However, we find that retaining the spurious correlation between causal and non-causal factors plays a vital role in bridging the domain gap and improving target adaptation. Therefore, we propose to build a framework that disentangles and supports causal factor alignment by aligning the non-causal factors first. We also investigate and find that the strong shape bias of vision transformers, coupled with its multi-head attention, make it a suitable architecture for realizing our proposed disentanglement. Hence, we propose to build a Causality-enforcing Source-Free Transformer framework (C-SFTrans) to achieve disentanglement via a novel two-stage alignment approach: a) non-causal factor alignment: non-causal factors are aligned using a style classification task which leads to an overall global alignment, b) task-discriminative causal factor alignment: causal factors are aligned via target adaptation. We are the first to investigate the role of vision transformers (ViTs) in a privacy-preserving source-free setting. Our approach achieves state-of-the-art results in several DA benchmarks.

Robust Self-calibration of Focal Lengths from the Fundamental Matrix. (arXiv:2311.16304v1 [cs.CV])

Authors: Viktor Kocur, Daniel Kyselica, Zuzana Kúkelová

The problem of self-calibration of two cameras from a given fundamental matrix is one of the basic problems in geometric computer vision. Under the assumption of known principal points and square pixels, the well-known Bougnoux formula offers a means to compute the two unknown focal lengths. However, in many practical situations, the formula yields inaccurate results due to commonly occurring singularities. Moreover, the estimates are sensitive to noise in the computed fundamental matrix and to the assumed positions of the principal points. In this paper, we therefore propose an efficient and robust iterative method to estimate the focal lengths along with the principal points of the cameras given a fundamental matrix and priors for the estimated camera parameters. In addition, we study a computationally efficient check of models generated within RANSAC that improves the accuracy of the estimated models while reducing the total computational time. Extensive experiments on real and synthetic data show that our iterative method brings significant improvements in terms of the accuracy of the estimated focal lengths over the Bougnoux formula and other state-of-the-art methods, even when relying on inaccurate priors.

Characterizing Video Question Answering with Sparsified Inputs. (arXiv:2311.16311v1 [cs.CV])

Authors: Shiyuan Huang, Robinson Piramuthu, Vicente Ordonez, Shih-Fu Chang, Gunnar A. Sigurdsson

In Video Question Answering, videos are often processed as a full-length sequence of frames to ensure minimal loss of information. Recent works have demonstrated evidence that sparse video inputs are sufficient to maintain high performance. However, they usually discuss the case of single frame selection. In our work, we extend the setting to multiple number of inputs and other modalities. We characterize the task with different input sparsity and provide a tool for doing that. Specifically, we use a Gumbel-based learnable selection module to adaptively select the best inputs for the final task. In this way, we experiment over public VideoQA benchmarks and provide analysis on how sparsified inputs affect the performance. From our experiments, we have observed only 5.2%-5.8% loss of performance with only 10% of video lengths, which corresponds to 2-4 frames selected from each video. Meanwhile, we also observed the complimentary behaviour between visual and textual inputs, even under highly sparsified settings, suggesting the potential of improving data efficiency for video-and-language tasks.

Domain-Specific Deep Learning Feature Extractor for Diabetic Foot Ulcer Detection. (arXiv:2311.16312v1 [cs.CV])

Authors: Reza Basiri, Milos R. Popovic, Shehroz S. Khan

Diabetic Foot Ulcer (DFU) is a condition requiring constant monitoring and evaluations for treatment. DFU patient population is on the rise and will soon outpace the available health resources. Autonomous monitoring and evaluation of DFU wounds is a much-needed area in health care. In this paper, we evaluate and identify the most accurate feature extractor that is the core basis for developing a deep-learning wound detection network. For the evaluation, we used mAP and F1-score on the publicly available DFU2020 dataset. A combination of UNet and EfficientNetb3 feature extractor resulted in the best evaluation among the 14 networks compared. UNet and Efficientnetb3 can be used as the classifier in the development of a comprehensive DFU domain-specific autonomous wound detection pipeline.

Multi-3D-Models Registration-Based Augmented Reality (AR) Instructions for Assembly. (arXiv:2311.16337v1 [cs.HC])

Authors: Seda Tuzun Canadinc, Wei Yan

This paper introduces a novel, markerless, step-by-step, in-situ 3D Augmented Reality (AR) instruction method and its application - BRICKxAR (Multi 3D Models/M3D) - for small parts assembly. BRICKxAR (M3D) realistically visualizes rendered 3D assembly parts at the assembly location of the physical assembly model (Figure 1). The user controls the assembly process through a user interface. BRICKxAR (M3D) utilizes deep learning-trained 3D model-based registration. Object recognition and tracking become challenging as the assembly model updates at each step. Additionally, not every part in a 3D assembly may be visible to the camera during the assembly. BRICKxAR (M3D) combines multiple assembly phases with a step count to address these challenges. Thus, using fewer phases simplifies the complex assembly process while step count facilitates accurate object recognition and precise visualization of each step. A testing and heuristic evaluation of the BRICKxAR (M3D) prototype and qualitative analysis were conducted with users and experts in visualization and human-computer interaction. Providing robust 3D AR instructions and allowing the handling of the assembly model, BRICKxAR (M3D) has the potential to be used at different scales ranging from manufacturing assembly to construction.

Spatially Adaptive Cloth Regression with Implicit Neural Representations. (arXiv:2311.16344v1 [cs.CV])

Authors: Lei Shu, Vinicius Azevedo, Barbara Solenthaler, Markus Gross

The accurate representation of fine-detailed cloth wrinkles poses significant challenges in computer graphics. The inherently non-uniform structure of cloth wrinkles mandates the employment of intricate discretization strategies, which are frequently characterized by high computational demands and complex methodologies. Addressing this, the research introduced in this paper elucidates a novel anisotropic cloth regression technique that capitalizes on the potential of implicit neural representations of surfaces. Our first core contribution is an innovative mesh-free sampling approach, crafted to reduce the reliance on traditional mesh structures, thereby offering greater flexibility and accuracy in capturing fine cloth details. Our second contribution is a novel adversarial training scheme, which is designed meticulously to strike a harmonious balance between the sampling and simulation objectives. The adversarial approach ensures that the wrinkles are represented with high fidelity, while also maintaining computational efficiency. Our results showcase through various cloth-object interaction scenarios that our method, given the same memory constraints, consistently surpasses traditional discrete representations, particularly when modelling highly-detailed localized wrinkles.

Small and Dim Target Detection in IR Imagery: A Review. (arXiv:2311.16346v1 [cs.CV])

Authors: Nikhil Kumar, Pravendra Singh

While there has been significant progress in object detection using conventional image processing and machine learning algorithms, exploring small and dim target detection in the IR domain is a relatively new area of study. The majority of small and dim target detection methods are derived from conventional object detection algorithms, albeit with some alterations. The task of detecting small and dim targets in IR imagery is complex. This is because these targets often need distinct features, the background is cluttered with unclear details, and the IR signatures of the scene can change over time due to fluctuations in thermodynamics. The primary objective of this review is to highlight the progress made in this field. This is the first review in the field of small and dim target detection in infrared imagery, encompassing various methodologies ranging from conventional image processing to cutting-edge deep learning-based approaches. The authors have also introduced a taxonomy of such approaches. There are two main types of approaches: methodologies using several frames for detection, and single-frame-based detection techniques. Single frame-based detection techniques encompass a diverse range of methods, spanning from traditional image processing-based approaches to more advanced deep learning methodologies. Our findings indicate that deep learning approaches perform better than traditional image processing-based approaches. In addition, a comprehensive compilation of various available datasets has also been provided. Furthermore, this review identifies the gaps and limitations in existing techniques, paving the way for future research and development in this area.

Improving Denoising Diffusion Probabilistic Models via Exploiting Shared Representations. (arXiv:2311.16353v1 [cs.LG])

Authors: Delaram Pirhayatifard, Mohammad Taha Toghani, Guha Balakrishnan, César A. Uribe

In this work, we address the challenge of multi-task image generation with limited data for denoising diffusion probabilistic models (DDPM), a class of generative models that produce high-quality images by reversing a noisy diffusion process. We propose a novel method, SR-DDPM, that leverages representation-based techniques from few-shot learning to effectively learn from fewer samples across different tasks. Our method consists of a core meta architecture with shared parameters, i.e., task-specific layers with exclusive parameters. By exploiting the similarity between diverse data distributions, our method can scale to multiple tasks without compromising the image quality. We evaluate our method on standard image datasets and show that it outperforms both unconditional and conditional DDPM in terms of FID and SSIM metrics.

Model-free Test Time Adaptation for Out-Of-Distribution Detection. (arXiv:2311.16420v1 [cs.LG])

Authors: YiFan Zhang, Xue Wang, Tian Zhou, Kun Yuan, Zhang Zhang, Liang Wang, Rong Jin, Tieniu Tan

Out-of-distribution (OOD) detection is essential for the reliability of ML models. Most existing methods for OOD detection learn a fixed decision criterion from a given in-distribution dataset and apply it universally to decide if a data point is OOD. Recent work~\cite{fang2022is} shows that given only in-distribution data, it is impossible to reliably detect OOD data without extra assumptions. Motivated by the theoretical result and recent exploration of test-time adaptation methods, we propose a Non-Parametric Test Time \textbf{Ada}ptation framework for \textbf{O}ut-Of-\textbf{D}istribution \textbf{D}etection (\abbr). Unlike conventional methods, \abbr utilizes online test samples for model adaptation during testing, enhancing adaptability to changing data distributions. The framework incorporates detected OOD instances into decision-making, reducing false positive rates, particularly when ID and OOD distributions overlap significantly. We demonstrate the effectiveness of \abbr through comprehensive experiments on multiple OOD detection benchmarks, extensive empirical studies show that \abbr significantly improves the performance of OOD detection over state-of-the-art methods. Specifically, \abbr reduces the false positive rate (FPR95) by $23.23\%$ on the CIFAR-10 benchmarks and $38\%$ on the ImageNet-1k benchmarks compared to the advanced methods. Lastly, we theoretically verify the effectiveness of \abbr.

Manifold Preserving Guided Diffusion. (arXiv:2311.16424v1 [cs.LG])

Authors: Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov, Stefano Ermon

Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training. In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a training-free conditional generation framework that leverages pretrained diffusion models and off-the-shelf neural networks with minimal additional inference cost for a broad range of tasks. Specifically, we leverage the manifold hypothesis to refine the guided diffusion steps and introduce a shortcut algorithm in the process. We then propose two methods for on-manifold training-free guidance using pre-trained autoencoders and demonstrate that our shortcut inherently preserves the manifolds when applied to latent diffusion models. Our experiments show that MPGD is efficient and effective for solving a variety of conditional generation applications in low-compute settings, and can consistently offer up to 3.8x speed-ups with the same number of diffusion steps while maintaining high sample quality compared to the baselines.

Text-Driven Image Editing via Learnable Regions. (arXiv:2311.16432v1 [cs.CV])

Authors: Yuanze Lin, Yi-Wen Chen, Yi-Hsuan Tsai, Lu Jiang, Ming-Hsuan Yang

Language has emerged as a natural interface for image editing. In this paper, we introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches. Specifically, our approach leverages an existing pretrained text-to-image model and introduces a bounding box generator to find the edit regions that are aligned with the textual prompts. We show that this simple approach enables flexible editing that is compatible with current image generation models, and is able to handle complex prompts featuring multiple objects, complex sentences or long paragraphs. We conduct an extensive user study to compare our method against state-of-the-art methods. Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that align with the language descriptions provided. Our project webpage: https://yuanze-lin.me/LearnableRegions_page.

Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos. (arXiv:2311.16444v1 [cs.CV])

Authors: Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta, Atsushi Hashimoto, Yoshitaka Ushiku, Yoichi Sato

We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain mixed views focusing on either human body actions or close-up hand-object interactions, while the egocentric view is constantly shifting as the camera wearer moves. This necessitates the in-depth study of cross-view transfer under complex view changes. In this work, we first create a real-life egocentric dataset (EgoYC2) whose captions are shared with YouCook2, enabling transfer learning between these datasets assuming their ground-truth is accessible. To bridge the view gaps, we propose a view-invariant learning method using adversarial training in both the pre-training and fine-tuning stages. While the pre-training is designed to learn invariant features against the mixed views in the web videos, the view-invariant fine-tuning further mitigates the view gaps between both datasets. We validate our proposed method by studying how effectively it overcomes the view change problem and efficiently transfers the knowledge to the egocentric domain. Our benchmark pushes the study of the cross-view transfer into a new task domain of dense video captioning and will envision methodologies to describe egocentric videos in natural language.

CLAP: Contrastive Learning with Augmented Prompts for Robustness on Pretrained Vision-Language Models. (arXiv:2311.16445v1 [cs.CV])

Authors: Yichao Cai, Yuhang Liu, Zhen Zhang, Javen Qinfeng Shi

Contrastive vision-language models, e.g., CLIP, have garnered substantial attention for their exceptional generalization capabilities. However, their robustness to perturbations has ignited concerns. Existing strategies typically reinforce their resilience against adversarial examples by enabling the image encoder to "see" these perturbed examples, often necessitating a complete retraining of the image encoder on both natural and adversarial samples. In this study, we propose a new method to enhance robustness solely through text augmentation, eliminating the need for retraining the image encoder on adversarial examples. Our motivation arises from the realization that text and image data inherently occupy a shared latent space, comprising latent content variables and style variables. This insight suggests the feasibility of learning to disentangle these latent content variables using text data exclusively. To accomplish this, we introduce an effective text augmentation method that focuses on modifying the style while preserving the content in the text data. By changing the style part of the text data, we empower the text encoder to emphasize latent content variables, ultimately enhancing the robustness of vision-language models. Our experiments across various datasets demonstrate substantial improvements in the robustness of the pre-trained CLIP model.

Centre Stage: Centricity-based Audio-Visual Temporal Action Detection. (arXiv:2311.16446v1 [cs.CV])

Authors: Hanyuan Wang, Majid Mirmehdi, Dima Damen, Toby Perrett

Previous one-stage action detection approaches have modelled temporal dependencies using only the visual modality. In this paper, we explore different strategies to incorporate the audio modality, using multi-scale cross-attention to fuse the two modalities. We also demonstrate the correlation between the distance from the timestep to the action centre and the accuracy of the predicted boundaries. Thus, we propose a novel network head to estimate the closeness of timesteps to the action centre, which we call the centricity score. This leads to increased confidence for proposals that exhibit more precise boundaries. Our method can be integrated with other one-stage anchor-free architectures and we demonstrate this on three recent baselines on the EPIC-Kitchens-100 action detection benchmark where we achieve state-of-the-art performance. Detailed ablation studies showcase the benefits of fusing audio and our proposed centricity scores. Code and models for our proposed method are publicly available at https://github.com/hanielwang/Audio-Visual-TAD.git

TopoSemiSeg: Enforcing Topological Consistency for Semi-Supervised Segmentation of Histopathology Images. (arXiv:2311.16447v1 [eess.IV])

Authors: Meilong Xu, Xiaoling Hu, Saumya Gupta, Shahira Abousamra, Chao Chen

In computational pathology, segmenting densely distributed objects like glands and nuclei is crucial for downstream analysis. To alleviate the burden of obtaining pixel-wise annotations, semi-supervised learning methods learn from large amounts of unlabeled data. Nevertheless, existing semi-supervised methods overlook the topological information hidden in the unlabeled images and are thus prone to topological errors, e.g., missing or incorrectly merged/separated glands or nuclei. To address this issue, we propose TopoSemiSeg, the first semi-supervised method that learns the topological representation from unlabeled data. In particular, we propose a topology-aware teacher-student approach in which the teacher and student networks learn shared topological representations. To achieve this, we introduce topological consistency loss, which contains signal consistency and noise removal losses to ensure the learned representation is robust and focuses on true topological signals. Extensive experiments on public pathology image datasets show the superiority of our method, especially on topology-wise evaluation metrics. Code is available at https://github.com/Melon-Xu/TopoSemiSeg.

Typhoon Intensity Prediction with Vision Transformer. (arXiv:2311.16450v1 [cs.CV])

Authors: Huanxin Chen, Pengshuai Yin, Huichou Huang, Qingyao Wu, Ruirui Liu, Xiatian Zhu

Predicting typhoon intensity accurately across space and time is crucial for issuing timely disaster warnings and facilitating emergency response. This has vast potential for minimizing life losses and property damages as well as reducing economic and environmental impacts. Leveraging satellite imagery for scenario analysis is effective but also introduces additional challenges due to the complex relations among clouds and the highly dynamic context. Existing deep learning methods in this domain rely on convolutional neural networks (CNNs), which suffer from limited per-layer receptive fields. This limitation hinders their ability to capture long-range dependencies and global contextual knowledge during inference. In response, we introduce a novel approach, namely "Typhoon Intensity Transformer" (Tint), which leverages self-attention mechanisms with global receptive fields per layer. Tint adopts a sequence-to-sequence feature representation learning perspective. It begins by cutting a given satellite image into a sequence of patches and recursively employs self-attention operations to extract both local and global contextual relations between all patch pairs simultaneously, thereby enhancing per-patch feature representation learning. Extensive experiments on a publicly available typhoon benchmark validate the efficacy of Tint in comparison with both state-of-the-art deep learning and conventional meteorological methods. Our code is available at https://github.com/chen-huanxin/Tint.

Spiking Neural Networks with Dynamic Time Steps for Vision Transformers. (arXiv:2311.16456v1 [cs.CV])

Authors: Gourav Datta, Zeyu Liu, Anni Li, Peter A. Beerel

Spiking Neural Networks (SNNs) have emerged as a popular spatio-temporal computing paradigm for complex vision tasks. Recently proposed SNN training algorithms have significantly reduced the number of time steps (down to 1) for improved latency and energy efficiency, however, they target only convolutional neural networks (CNN). These algorithms, when applied on the recently spotlighted vision transformers (ViT), either require a large number of time steps or fail to converge. Based on analysis of the histograms of the ANN and SNN activation maps, we hypothesize that each ViT block has a different sensitivity to the number of time steps. We propose a novel training framework that dynamically allocates the number of time steps to each ViT module depending on a trainable score assigned to each timestep. In particular, we generate a scalar binary time step mask that filters spikes emitted by each neuron in a leaky-integrate-and-fire (LIF) layer. The resulting SNNs have high activation sparsity and require only accumulate operations (AC), except for the input embedding layer, in contrast to expensive multiply-and-accumulates (MAC) needed in traditional ViTs. This yields significant improvements in energy efficiency. We evaluate our training framework and resulting SNNs on image recognition tasks including CIFAR10, CIFAR100, and ImageNet with different ViT architectures. We obtain a test accuracy of 95.97% with 4.97 time steps with direct encoding on CIFAR10.

Viewport Prediction for Volumetric Video Streaming by Exploring Video Saliency and Trajectory Information. (arXiv:2311.16462v1 [cs.CV])

Authors: Jie Li, Zhixin Li, Zhi Liu, Pengyuan Zhou, Richang Hong, Qiyue Li, Han Hu

Volumetric video, also known as hologram video, is a novel medium that portrays natural content in Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR). It is expected to be the next-gen video technology and a prevalent use case for 5G and beyond wireless communication. Considering that each user typically only watches a section of the volumetric video, known as the viewport, it is essential to have precise viewport prediction for optimal performance. However, research on this topic is still in its infancy. In the end, this paper presents and proposes a novel approach, named Saliency and Trajectory Viewport Prediction (STVP), which aims to improve the precision of viewport prediction in volumetric video streaming. The STVP extensively utilizes video saliency information and viewport trajectory. To our knowledge, this is the first comprehensive study of viewport prediction in volumetric video streaming. In particular, we introduce a novel sampling method, Uniform Random Sampling (URS), to reduce computational complexity while still preserving video features in an efficient manner. Then we present a saliency detection technique that incorporates both spatial and temporal information for detecting static, dynamic geometric, and color salient regions. Finally, we intelligently fuse saliency and trajectory information to achieve more accurate viewport prediction. We conduct extensive simulations to evaluate the effectiveness of our proposed viewport prediction methods using state-of-the-art volumetric video sequences. The experimental results show the superiority of the proposed method over existing schemes. The dataset and source code will be publicly accessible after acceptance.

Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection. (arXiv:2311.16464v1 [cs.CV])

Authors: Yicheng Xiao, Zhuoyan Luo, Yong Liu, Yue Ma, Hengwei Bian, Yatai Ji, Yujiu Yang, Xiu Li

Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted significant attention due to the growing demand for video analysis. Recent approaches treat MR and HD as similar video grounding problems and address them together with transformer-based architecture. However, we observe that the emphasis of MR and HD differs, with one necessitating the perception of local relationships and the other prioritizing the understanding of global contexts. Consequently, the lack of task-specific design will inevitably lead to limitations in associating the intrinsic specialty of two tasks. To tackle the issue, we propose a Unified Video COMprehension framework (UVCOM) to bridge the gap and jointly solve MR and HD effectively. By performing progressive integration on intra and inter-modality across multi-granularity, UVCOM achieves the comprehensive understanding in processing a video. Moreover, we present multi-aspect contrastive learning to consolidate the local relation modeling and global knowledge accumulation via well aligned multi-modal space. Extensive experiments on QVHighlights, Charades-STA, TACoS , YouTube Highlights and TVSum datasets demonstrate the effectiveness and rationality of UVCOM which outperforms the state-of-the-art methods by a remarkable margin.

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering. (arXiv:2311.16465v1 [cs.CV])

Authors: Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei

The diffusion model has been proven a powerful generative model in recent years, yet remains a challenge in generating visual text. Several methods alleviated this issue by incorporating explicit text position and content as guidance on where and what text to render. However, these methods still suffer from several drawbacks, such as limited flexibility and automation, constrained capability of layout prediction, and restricted style diversity. In this paper, we present TextDiffuser-2, aiming to unleash the power of language models for text rendering. Firstly, we fine-tune a large language model for layout planning. The large language model is capable of automatically generating keywords for text rendering and also supports layout modification through chatting. Secondly, we utilize the language model within the diffusion model to encode the position and texts at the line level. Unlike previous methods that employed tight character-level guidance, this approach generates more diverse text images. We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V, validating TextDiffuser-2's capacity to achieve a more rational text layout and generation with enhanced diversity. The code and model will be available at \url{https://aka.ms/textdiffuser-2}.

AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond. (arXiv:2311.16468v1 [cs.CV])

Authors: Zixiang Zhou, Yu Wan, Baoyuan Wang

Large Language Models(LLMs) have shown remarkable emergent abilities in unifying almost all (if not every) NLP tasks. In the human motion-related realm, however, researchers still develop siloed models for each task. Inspired by InstuctGPT, and the generalist concept behind Gato, we introduce AvatarGPT, an All-in-One framework for motion understanding, planning, generations as well as other tasks such as motion in-between synthesis. AvatarGPT treats each task as one type of instruction fine-tuned on the shared LLM. All the tasks are seamlessly interconnected with language as the universal interface, constituting a closed-loop within the framework. To achieve this, human motion sequences are first encoded as discrete tokens, which serve as the extended vocabulary of LLM. Then, an unsupervised pipeline to generate natural language descriptions of human action sequences from in-the-wild videos is developed. Finally, all tasks are jointly trained. Extensive experiments show that AvatarGPT achieves SOTA on low-level tasks, and promising results on high-level tasks, demonstrating the effectiveness of our proposed All-in-One framework. Moreover, for the first time, AvatarGPT enables a principled approach by iterative traversal of the tasks within the closed-loop for unlimited long-motion synthesis.

A Unified Framework for Multimodal, Multi-Part Human Motion Synthesis. (arXiv:2311.16471v1 [cs.CV])

Authors: Zixiang Zhou, Yu Wan, Baoyuan Wang

The field has made significant progress in synthesizing realistic human motion driven by various modalities. Yet, the need for different methods to animate various body parts according to different control signals limits the scalability of these techniques in practical scenarios. In this paper, we introduce a cohesive and scalable approach that consolidates multimodal (text, music, speech) and multi-part (hand, torso) human motion generation. Our methodology unfolds in several steps: We begin by quantizing the motions of diverse body parts into separate codebooks tailored to their respective domains. Next, we harness the robust capabilities of pre-trained models to transcode multimodal signals into a shared latent space. We then translate these signals into discrete motion tokens by iteratively predicting subsequent tokens to form a complete sequence. Finally, we reconstruct the continuous actual motion from this tokenized sequence. Our method frames the multimodal motion generation challenge as a token prediction task, drawing from specialized codebooks based on the modality of the control signal. This approach is inherently scalable, allowing for the easy integration of new modalities. Extensive experiments demonstrated the effectiveness of our design, emphasizing its potential for broad application.

GS-IR: 3D Gaussian Splatting for Inverse Rendering. (arXiv:2311.16473v1 [cs.CV])

Authors: Zhihao Liang, Qi Zhang, Ying Feng, Ying Shan, Kui Jia

We propose GS-IR, a novel inverse rendering approach based on 3D Gaussian Splatting (GS) that leverages forward mapping volume rendering to achieve photorealistic novel view synthesis and relighting results. Unlike previous works that use implicit neural representations and volume rendering (e.g. NeRF), which suffer from low expressive power and high computational complexity, we extend GS, a top-performance representation for novel view synthesis, to estimate scene geometry, surface material, and environment illumination from multi-view images captured under unknown lighting conditions. There are two main problems when introducing GS to inverse rendering: 1) GS does not support producing plausible normal natively; 2) forward mapping (e.g. rasterization and splatting) cannot trace the occlusion like backward mapping (e.g. ray tracing). To address these challenges, our GS-IR proposes an efficient optimization scheme that incorporates a depth-derivation-based regularization for normal estimation and a baking-based occlusion to model indirect lighting. The flexible and expressive GS representation allows us to achieve fast and compact geometry reconstruction, photorealistic novel view synthesis, and effective physically-based rendering. We demonstrate the superiority of our method over baseline methods through qualitative and quantitative evaluations on various challenging scenes.

Progressive Target-Styled Feature Augmentation for Unsupervised Domain Adaptation on Point Clouds. (arXiv:2311.16474v1 [cs.CV])

Authors: Zicheng Wang, Zhen Zhao, Yiming Wu, Luping Zhou, Dong Xu

Unsupervised domain adaptation is a critical challenge in the field of point cloud analysis, as models trained on one set of data often struggle to perform well in new scenarios due to domain shifts. Previous works tackle the problem by using adversarial training or self-supervised learning for feature extractor adaptation, but ensuring that features extracted from the target domain can be distinguished by the source-supervised classifier remains challenging. In this work, we propose a novel approach called progressive target-styled feature augmentation (PTSFA). Unlike previous works that focus on feature extractor adaptation, our PTSFA approach focuses on classifier adaptation. It aims to empower the classifier to recognize target-styled source features and progressively adapt to the target domain. To enhance the reliability of predictions within the PTSFA framework and encourage discriminative feature extraction, we further introduce a new intermediate domain approaching (IDA) strategy. We validate our method on the benchmark datasets, where our method achieves new state-of-the-art performance. Our code is available at https://github.com/xiaoyao3302/PTSFA.

Generating Human-Centric Visual Cues for Human-Object Interaction Detection via Large Vision-Language Models. (arXiv:2311.16475v1 [cs.CV])

Authors: Yu-Wei Zhan, Fan Liu, Xin Luo, Liqiang Nie, Xin-Shun Xu, Mohan Kankanhalli

Human-object interaction (HOI) detection aims at detecting human-object pairs and predicting their interactions. However, the complexity of human behavior and the diverse contexts in which these interactions occur make it challenging. Intuitively, human-centric visual cues, such as the involved participants, the body language, and the surrounding environment, play crucial roles in shaping these interactions. These cues are particularly vital in interpreting unseen interactions. In this paper, we propose three prompts with VLM to generate human-centric visual cues within an image from multiple perspectives of humans. To capitalize on these rich Human-Centric Visual Cues, we propose a novel approach named HCVC for HOI detection. Particularly, we develop a transformer-based multimodal fusion module with multitower architecture to integrate visual cue features into the instance and interaction decoders. Our extensive experiments and analysis validate the efficacy of leveraging the generated human-centric visual cues for HOI detection. Notably, the experimental results indicate the superiority of the proposed model over the existing state-of-the-art methods on two widely used datasets.

LANS: A Layout-Aware Neural Solver for Plane Geometry Problem. (arXiv:2311.16476v1 [cs.CV])

Authors: Ming-Liang Zhang, Zhong-Zhi Li, Fei Yin, Cheng-Lin Liu

Geometry problem solving (GPS) is a challenging mathematical reasoning task requiring multi-modal understanding, fusion and reasoning. Existing neural solvers take GPS as a vision-language task but be short in the representation of geometry diagrams which carry rich and complex layout information. In this paper, we propose a layout-aware neural solver named LANS, integrated with two new modules: multimodal layout-aware pre-trained language model (MLA-PLM) and layout-aware fusion attention (LA-FA). MLA-PLM adopts structural and semantic pre-training (SSP) to implement global relationship modeling, and point matching pre-training (PMP) to achieve alignment between visual points and textual points. LA-FA employs a layout-aware attention mask to realize point-guided cross-modal fusion for further boosting layout awareness of LANS. Extensive experiments on datasets Geometry3K and PGPS9K validate the effectiveness of the layout-aware modules and superior problem solving performance of our LANS solver, over existing symbolic solvers and neural solvers. The code will make public available soon.

UniHPE: Towards Unified Human Pose Estimation via Contrastive Learning. (arXiv:2311.16477v1 [cs.CV])

Authors: Zhongyu Jiang, Wenhao Chai, Lei Li, Zhuoran Zhou, Cheng-Yen Yang, Jenq-Neng Hwang

In recent times, there has been a growing interest in developing effective perception techniques for combining information from multiple modalities. This involves aligning features obtained from diverse sources to enable more efficient training with larger datasets and constraints, as well as leveraging the wealth of information contained in each modality. 2D and 3D Human Pose Estimation (HPE) are two critical perceptual tasks in computer vision, which have numerous downstream applications, such as Action Recognition, Human-Computer Interaction, Object tracking, etc. Yet, there are limited instances where the correlation between Image and 2D/3D human pose has been clearly researched using a contrastive paradigm. In this paper, we propose UniHPE, a unified Human Pose Estimation pipeline, which aligns features from all three modalities, i.e., 2D human pose estimation, lifting-based and image-based 3D human pose estimation, in the same pipeline. To align more than two modalities at the same time, we propose a novel singular value based contrastive learning loss, which better aligns different modalities and further boosts the performance. In our evaluation, UniHPE achieves remarkable performance metrics: MPJPE $50.5$mm on the Human3.6M dataset and PAMPJPE $51.6$mm on the 3DPW dataset. Our proposed method holds immense potential to advance the field of computer vision and contribute to various applications.

RetouchUAA: Unconstrained Adversarial Attack via Image Retouching. (arXiv:2311.16478v1 [cs.CV])

Authors: Mengda Xie, Yiling He, Meie Fang

Deep Neural Networks (DNNs) are susceptible to adversarial examples. Conventional attacks generate controlled noise-like perturbations that fail to reflect real-world scenarios and hard to interpretable. In contrast, recent unconstrained attacks mimic natural image transformations occurring in the real world for perceptible but inconspicuous attacks, yet compromise realism due to neglect of image post-processing and uncontrolled attack direction. In this paper, we propose RetouchUAA, an unconstrained attack that exploits a real-life perturbation: image retouching styles, highlighting its potential threat to DNNs. Compared to existing attacks, RetouchUAA offers several notable advantages. Firstly, RetouchUAA excels in generating interpretable and realistic perturbations through two key designs: the image retouching attack framework and the retouching style guidance module. The former custom-designed human-interpretability retouching framework for adversarial attack by linearizing images while modelling the local processing and retouching decision-making in human retouching behaviour, provides an explicit and reasonable pipeline for understanding the robustness of DNNs against retouching. The latter guides the adversarial image towards standard retouching styles, thereby ensuring its realism. Secondly, attributed to the design of the retouching decision regularization and the persistent attack strategy, RetouchUAA also exhibits outstanding attack capability and defense robustness, posing a heavy threat to DNNs. Experiments on ImageNet and Place365 reveal that RetouchUAA achieves nearly 100\% white-box attack success against three DNNs, while achieving a better trade-off between image naturalness, transferability and defense robustness than baseline attacks.

Mitigating Hallucination in Visual Language Models with Visual Supervision. (arXiv:2311.16479v1 [cs.CV])

Authors: Zhiyang Chen, Yousong Zhu, Yufei Zhan, Zhaowen Li, Chaoyang Zhao, Jinqiao Wang, Ming Tang

Large vision-language models (LVLMs) suffer from hallucination a lot, generating responses that apparently contradict to the image content occasionally. The key problem lies in its weak ability to comprehend detailed content in a multi-modal context, which can be mainly attributed to two factors in training data and loss function. The vision instruction dataset primarily focuses on global description, and the auto-regressive loss function favors text modeling rather than image understanding. In this paper, we bring more detailed vision annotations and more discriminative vision models to facilitate the training of LVLMs, so that they can generate more precise responses without encounter hallucination. On one hand, we generate image-text pairs with detailed relationship annotations in panoptic scene graph dataset (PSG). These conversations pay more attention on detailed facts in the image, encouraging the model to answer questions based on multi-modal contexts. On the other hand, we integrate SAM and mask prediction loss as auxiliary supervision, forcing the LVLMs to have the capacity to identify context-related objects, so that they can generate more accurate responses, mitigating hallucination. Moreover, to provide a deeper evaluation on the hallucination in LVLMs, we propose a new benchmark, RAH-Bench. It divides vision hallucination into three different types that contradicts the image with wrong categories, attributes or relations, and introduces False Positive Rate as detailed sub-metric for each type. In this benchmark, our approach demonstrates an +8.4% enhancement compared to original LLaVA and achieves widespread performance improvements across other models.

MI-Gen: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images. (arXiv:2311.16480v1 [cs.CV])

Authors: Pingyi Chen, Honglin Li, Chenglu Zhu, Sunyi Zheng, Lin Yang

Whole slide images are the foundation of digital pathology for the diagnosis and treatment of carcinomas. Writing pathology reports is laborious and error-prone for inexperienced pathologists. To reduce the workload and improve clinical automation, we investigate how to generate pathology reports given whole slide images. On the data end, we curated the largest WSI-text dataset (TCGA-PathoText). In specific, we collected nearly 10000 high-quality WSI-text pairs for visual-language models by recognizing and cleaning pathology reports which narrate diagnostic slides in TCGA. On the model end, we propose the multiple instance generative model (MI-Gen) which can produce pathology reports for gigapixel WSIs. We benchmark our model on the largest subset of TCGA-PathoText. Experimental results show our model can generate pathology reports which contain multiple clinical clues. Furthermore, WSI-text prediction can be seen as an approach of visual-language pre-training, which enables our model to be transferred to downstream diagnostic tasks like carcinoma grading and phenotyping. We observe that simple semantic extraction from the pathology reports can achieve the best performance (0.838 of F1 score) on BRCA subtyping without adding extra parameters or tricky fine-tuning. Our collected dataset and related code will all be publicly available.

Elucidating and Overcoming the Challenges of Label Noise in Supervised Contrastive Learning. (arXiv:2311.16481v1 [cs.CV])

Authors: Zijun Long, George Killick, Lipeng Zhuang, Richard McCreadie, Gerardo Aragon Camarasa, Paul Henderson

Image classification datasets exhibit a non-negligible fraction of mislabeled examples, often due to human error when one class superficially resembles another. This issue poses challenges in supervised contrastive learning (SCL), where the goal is to cluster together data points of the same class in the embedding space while distancing those of disparate classes. While such methods outperform those based on cross-entropy, they are not immune to labeling errors. However, while the detrimental effects of noisy labels in supervised learning are well-researched, their influence on SCL remains largely unexplored. Hence, we analyse the effect of label errors and examine how they disrupt the SCL algorithm's ability to distinguish between positive and negative sample pairs. Our analysis reveals that human labeling errors manifest as easy positive samples in around 99% of cases. We, therefore, propose D-SCL, a novel Debiased Supervised Contrastive Learning objective designed to mitigate the bias introduced by labeling errors. We demonstrate that D-SCL consistently outperforms state-of-the-art techniques for representation learning across diverse vision benchmarks, offering improved robustness to label errors.

Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars. (arXiv:2311.16482v1 [cs.CV])

Authors: Yang Liu, Xiang Huang, Minghan Qin, Qinwei Lin, Haoqian Wang

Neural radiance fields are capable of reconstructing high-quality drivable human avatars but are expensive to train and render. To reduce consumption, we propose Animatable 3D Gaussian, which learns human avatars from input images and poses. We extend 3D Gaussians to dynamic human scenes by modeling a set of skinned 3D Gaussians and a corresponding skeleton in canonical space and deforming 3D Gaussians to posed space according to the input poses. We introduce hash-encoded shape and appearance to speed up training and propose time-dependent ambient occlusion to achieve high-quality reconstructions in scenes containing complex motions and dynamic shadows. On both novel view synthesis and novel pose synthesis tasks, our method outperforms existing methods in terms of training time, rendering speed, and reconstruction quality. Our method can be easily extended to multi-human scenes and achieve comparable novel view synthesis results on a scene with ten people in only 25 seconds of training.

ChartLlama: A Multimodal LLM for Chart Understanding and Generation. (arXiv:2311.16483v1 [cs.CV])

Authors: Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, Hanwang Zhang

Multi-modal large language models have demonstrated impressive performances on most vision-language tasks. However, the model generally lacks the understanding capabilities for specific domain data, particularly when it comes to interpreting chart figures. This is mainly due to the lack of relevant multi-modal instruction tuning datasets. In this article, we create a high-quality instruction-tuning dataset leveraging GPT-4. We develop a multi-step data generation process in which different steps are responsible for generating tabular data, creating chart figures, and designing instruction tuning data separately. Our method's flexibility enables us to generate diverse, high-quality instruction-tuning data consistently and efficiently while maintaining a low resource expenditure. Additionally, it allows us to incorporate a wider variety of chart and task types not yet featured in existing datasets. Next, we introduce ChartLlama, a multi-modal large language model that we've trained using our created dataset. ChartLlama outperforms all prior methods in ChartQA, Chart-to-text, and Chart-extraction evaluation benchmarks. Additionally, ChartLlama significantly improves upon the baseline in our specially compiled chart dataset, which includes new chart and task types. The results of ChartLlama confirm the value and huge potential of our proposed data generation method in enhancing chart comprehension.

Eye vs. AI: Human Gaze and Model Attention in Video Memorability. (arXiv:2311.16484v1 [cs.CV])

Authors: Prajneya Kumar, Eshika Khandelwal, Makarand Tapaswi, Vishnu Sreekumar

Understanding the factors that determine video memorability has important applications in areas such as educational technology and advertising. Towards this goal, we investigate the semantic and temporal attention mechanisms underlying video memorability. We propose a Transformer-based model with spatio-temporal attention that matches SoTA performance on video memorability prediction on a large naturalistic video dataset. More importantly, the self-attention patterns show us where the model looks to predict memorability. We compare model attention against human gaze fixation density maps collected through a small-scale eye-tracking experiment where humans perform a video memory task. Quantitative saliency metrics show that the model attention and human gaze follow similar patterns. Furthermore, while panoptic segmentation confirms that the model and humans attend more to thing classes, stuff classes that receive increased/decreased attention tend to have higher memorability scores. We also observe that the model assigns greater importance to the initial frames, mimicking temporal attention patterns found in humans.

Class-Adaptive Sampling Policy for Efficient Continual Learning. (arXiv:2311.16485v1 [cs.LG])

Authors: Hossein Rezaei, Mohammad Sabokrou

Continual learning (CL) aims to acquire new knowledge while preserving information from previous experiences without forgetting. Though buffer-based methods (i.e., retaining samples from previous tasks) have achieved acceptable performance, determining how to allocate the buffer remains a critical challenge. Most recent research focuses on refining these methods but often fails to sufficiently consider the varying influence of samples on the learning process, and frequently overlooks the complexity of the classes/concepts being learned. Generally, these methods do not directly take into account the contribution of individual classes. However, our investigation indicates that more challenging classes necessitate preserving a larger number of samples compared to less challenging ones. To address this issue, we propose a novel method and policy named 'Class-Adaptive Sampling Policy' (CASP), which dynamically allocates storage space within the buffer. By utilizing concepts of class contribution and difficulty, CASP adaptively manages buffer space, allowing certain classes to occupy a larger portion of the buffer while reducing storage for others. This approach significantly improves the efficiency of knowledge retention and utilization. CASP provides a versatile solution to boost the performance and efficiency of CL. It meets the demand for dynamic buffer allocation, accommodating the varying contributions of different classes and their learning complexities over time.

Efficient Multimodal Diffusion Models Using Joint Data Infilling with Partially Shared U-Net. (arXiv:2311.16488v1 [cs.CV])

Authors: Zizhao Hu, Shaochong Jia, Mohammad Rostami

Recently, diffusion models have been used successfully to fit distributions for cross-modal data translation and multimodal data generation. However, these methods rely on extensive scaling, overlooking the inefficiency and interference between modalities. We develop Partially Shared U-Net (PS-U-Net) architecture which is an efficient multimodal diffusion model that allows text and image inputs to pass through dedicated layers and skip-connections for preserving modality-specific fine-grained details. Inspired by image inpainting, we also propose a new efficient multimodal sampling method that introduces new scenarios for conditional generation while only requiring a simple joint distribution to be learned. Our empirical exploration of the MS-COCO dataset demonstrates that our method generates multimodal text and image data with higher quality compared to existing multimodal diffusion models while having a comparable size, faster training, faster multimodal sampling, and more flexible generation.

SIRAN: Sinkhorn Distance Regularized Adversarial Network for DEM Super-resolution using Discriminative Spatial Self-attention. (arXiv:2311.16490v1 [eess.IV])

Authors: Subhajit Paul, Ashutosh Gupta

Digital Elevation Model (DEM) is an essential aspect in the remote sensing domain to analyze and explore different applications related to surface elevation information. In this study, we intend to address the generation of high-resolution DEMs using high-resolution multi-spectral (MX) satellite imagery by incorporating adversarial learning. To promptly regulate this process, we utilize the notion of polarized self-attention of discriminator spatial maps as well as introduce a Densely connected Multi-Residual Block (DMRB) module to assist in efficient gradient flow. Further, we present an objective function related to optimizing Sinkhorn distance with traditional GAN to improve the stability of adversarial learning. In this regard, we provide both theoretical and empirical substantiation of better performance in terms of vanishing gradient issues and numerical convergence. We demonstrate both qualitative and quantitative outcomes with available state-of-the-art methods. Based on our experiments on DEM datasets of Shuttle Radar Topographic Mission (SRTM) and Cartosat-1, we show that the proposed model performs preferably against other learning-based state-of-the-art methods. We also generate and visualize several high-resolution DEMs covering terrains with diverse signatures to show the performance of our model.

$Z^*$: Zero-shot Style Transfer via Attention Rearrangement. (arXiv:2311.16491v1 [cs.CV])

Authors: Yingying Deng, Xiangyu He, Fan Tang, Weiming Dong

Despite the remarkable progress in image style transfer, formulating style in the context of art is inherently subjective and challenging. In contrast to existing learning/tuning methods, this study shows that vanilla diffusion models can directly extract style information and seamlessly integrate the generative prior into the content image without retraining. Specifically, we adopt dual denoising paths to represent content/style references in latent space and then guide the content image denoising process with style latent codes. We further reveal that the cross-attention mechanism in latent diffusion models tends to blend the content and style images, resulting in stylized outputs that deviate from the original content image. To overcome this limitation, we introduce a cross-attention rearrangement strategy. Through theoretical analysis and experiments, we demonstrate the effectiveness and superiority of the diffusion-based $\underline{Z}$ero-shot $\underline{S}$tyle $\underline{T}$ransfer via $\underline{A}$ttention $\underline{R}$earrangement, Z-STAR.

VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation. (arXiv:2311.16492v1 [cs.CV])

Authors: Zijian Zhou, Miaojing Shi, Holger Caesar

Panoptic Scene Graph Generation (PSG) aims at achieving a comprehensive image understanding by simultaneously segmenting objects and predicting relations among objects. However, the long-tail problem among relations leads to unsatisfactory results in real-world applications. Prior methods predominantly rely on vision information or utilize limited language information, such as object or relation names, thereby overlooking the utility of language information. Leveraging the recent progress in Large Language Models (LLMs), we propose to use language information to assist relation prediction, particularly for rare relations. To this end, we propose the Vision-Language Prompting (VLPrompt) model, which acquires vision information from images and language information from LLMs. Then, through a prompter network based on attention mechanism, it achieves precise relation prediction. Our extensive experiments show that VLPrompt significantly outperforms previous state-of-the-art methods on the PSG dataset, proving the effectiveness of incorporating language information and alleviating the long-tail problem of relations.

Mip-Splatting: Alias-free 3D Gaussian Splatting. (arXiv:2311.16493v1 [cs.CV])

Authors: Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, Andreas Geiger

Recently, 3D Gaussian Splatting has demonstrated impressive novel view synthesis results, reaching high fidelity and efficiency. However, strong artifacts can be observed when changing the sampling rate, \eg, by changing focal length or camera distance. We find that the source for this phenomenon can be attributed to the lack of 3D frequency constraints and the usage of a 2D dilation filter. To address this problem, we introduce a 3D smoothing filter which constrains the size of the 3D Gaussian primitives based on the maximal sampling frequency induced by the input views, eliminating high-frequency artifacts when zooming in. Moreover, replacing 2D dilation with a 2D Mip filter, which simulates a 2D box filter, effectively mitigates aliasing and dilation issues. Our evaluation, including scenarios such a training on single-scale images and testing on multiple scales, validates the effectiveness of our approach.

ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models. (arXiv:2311.16494v1 [cs.CV])

Authors: Xinyu Tian, Shu Zou, Zhaoyuan Yang, Jing Zhang

Although soft prompt tuning is effective in efficiently adapting Vision-Language (V&L) models for downstream tasks, it shows limitations in dealing with distribution shifts. We address this issue with Attribute-Guided Prompt Tuning (ArGue), making three key contributions. 1) In contrast to the conventional approach of directly appending soft prompts preceding class names, we align the model with primitive visual attributes generated by Large Language Models (LLMs). We posit that a model's ability to express high confidence in these attributes signifies its capacity to discern the correct class rationales. 2) We introduce attribute sampling to eliminate disadvantageous attributes, thus only semantically meaningful attributes are preserved. 3) We propose negative prompting, explicitly enumerating class-agnostic attributes to activate spurious correlations and encourage the model to generate highly orthogonal probability distributions in relation to these negative features. In experiments, our method significantly outperforms current state-of-the-art prompt tuning methods on both novel class prediction and out-of-distribution generalization tasks.

Egocentric Whole-Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement. (arXiv:2311.16495v1 [cs.CV])

Authors: Jian Wang, Zhe Cao, Diogo Luvizon, Lingjie Liu, Kripasindhu Sarkar, Danhang Tang, Thabo Beeler, Christian Theobalt

In this work, we explore egocentric whole-body motion capture using a single fisheye camera, which simultaneously estimates human body and hand motion. This task presents significant challenges due to three factors: the lack of high-quality datasets, fisheye camera distortion, and human body self-occlusion. To address these challenges, we propose a novel approach that leverages FisheyeViT to extract fisheye image features, which are subsequently converted into pixel-aligned 3D heatmap representations for 3D human body pose prediction. For hand tracking, we incorporate dedicated hand detection and hand pose estimation networks for regressing 3D hand poses. Finally, we develop a diffusion-based whole-body motion prior model to refine the estimated whole-body motion while accounting for joint uncertainties. To train these networks, we collect a large synthetic dataset, EgoWholeBody, comprising 840,000 high-quality egocentric images captured across a diverse range of whole-body motion sequences. Quantitative and qualitative evaluations demonstrate the effectiveness of our method in producing high-quality whole-body motion estimates from a single egocentric camera.

GaitContour: Efficient Gait Recognition based on a Contour-Pose Representation. (arXiv:2311.16497v1 [cs.CV])

Authors: Yuxiang Guo, Anshul Shah, Jiang Liu, Rama Chellappa, Cheng Peng

Gait recognition holds the promise to robustly identify subjects based on walking patterns instead of appearance information. In recent years, this field has been dominated by learning methods based on two principal input representations: dense silhouette masks or sparse pose keypoints. In this work, we propose a novel, point-based Contour-Pose representation, which compactly expresses both body shape and body parts information. We further propose a local-to-global architecture, called GaitContour, to leverage this novel representation and efficiently compute subject embedding in two stages. The first stage consists of a local transformer that extracts features from five different body regions. The second stage then aggregates the regional features to estimate a global human gait representation. Such a design significantly reduces the complexity of the attention operation and improves efficiency and performance simultaneously. Through large scale experiments, GaitContour is shown to perform significantly better than previous point-based methods, while also being significantly more efficient than silhouette-based methods. On challenging datasets with significant distractors, GaitContour can even outperform silhouette-based methods.

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model. (arXiv:2311.16498v1 [cs.CV])

Authors: Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, Mike Zheng Shou

This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. Despite achieving reasonable results, these approaches face challenges in maintaining temporal consistency throughout the animation due to the lack of temporal modeling and poor preservation of reference identity. In this work, we introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity. To achieve this, we first develop a video diffusion model to encode temporal information. Second, to maintain the appearance coherence across frames, we introduce a novel appearance encoder to retain the intricate details of the reference image. Leveraging these two innovations, we further employ a simple video fusion technique to encourage smooth transitions for long video animation. Empirical results demonstrate the superiority of our method over baseline approaches on two benchmarks. Notably, our approach outperforms the strongest baseline by over 38% in terms of video fidelity on the challenging TikTok dancing dataset. Code and model will be made available.

Deceptive-Human: Prompt-to-NeRF 3D Human Generation with 3D-Consistent Synthetic Images. (arXiv:2311.16499v1 [cs.CV])

Authors: Shiu-hong Kao, Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang

This paper presents Deceptive-Human, a novel Prompt-to-NeRF framework capitalizing state-of-the-art control diffusion models (e.g., ControlNet) to generate a high-quality controllable 3D human NeRF. Different from direct 3D generative approaches, e.g., DreamFusion and DreamHuman, Deceptive-Human employs a progressive refinement technique to elevate the reconstruction quality. This is achieved by utilizing high-quality synthetic human images generated through the ControlNet with view-consistent loss. Our method is versatile and readily extensible, accommodating multimodal inputs, including a text prompt and additional data such as 3D mesh, poses, and seed images. The resulting 3D human NeRF model empowers the synthesis of highly photorealistic novel views from 360-degree perspectives. The key to our Deceptive-Human for hallucinating multi-view consistent synthetic human images lies in our progressive finetuning strategy. This strategy involves iteratively enhancing views using the provided multimodal inputs at each intermediate step to improve the human NeRF model. Within this iterative refinement process, view-dependent appearances are systematically eliminated to prevent interference with the underlying density estimation. Extensive qualitative and quantitative experimental comparison shows that our deceptive human models achieve state-of-the-art application quality.

LLMGA: Multimodal Large Language Model based Generation Assistant. (arXiv:2311.16500v1 [cs.CV])

Authors: Bin Xia, Shiyin Wang, Yingfan Tao, Yitong Wang, Jiaya Jia

In this paper, we introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and proficiency in reasoning, comprehension, and response inherent in Large Language Models (LLMs) to assist users in image generation and editing. Diverging from existing approaches where Multimodal Large Language Models (MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our LLMGA provides a detailed language generation prompt for precise control over SD. This not only augments LLM context understanding but also reduces noise in generation prompts, yields images with more intricate and precise content, and elevates the interpretability of the network. To this end, we curate a comprehensive dataset comprising prompt refinement, similar image generation, inpainting $\&$ outpainting, and visual question answering. Moreover, we propose a two-stage training scheme. In the first stage, we train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. In the second stage, we optimize SD to align with the MLLM's generation prompts. Additionally, we propose a reference-based restoration network to alleviate texture, brightness, and contrast disparities between generated and preserved regions during image editing. Extensive results show that LLMGA has promising generative capabilities and can enable wider applications in an interactive manner.

PISA: Point-cloud-based Instructed Scene Augmentation. (arXiv:2311.16501v1 [cs.CV])

Authors: Yiyang Luo, Ke Lin

Indoor scene augmentation has become an emerging topic in the field of computer vision with applications in augmented and virtual reality. However, existing scene augmentation methods mostly require a pre-built object database with a given position as the desired location. In this paper, we propose the first end-to-end multi-modal deep neural network that can generate point cloud objects consistent with their surroundings, conditioned on text instructions. Our model generates a seemly object in the appropriate position based on the inputs of a query and point clouds, thereby enabling the creation of new scenarios involving previously unseen layouts of objects. Database of pre-stored CAD models is no longer needed. We use Point-E as our generative model and introduce methods including quantified position prediction and Top-K estimation to mitigate the false negative problems caused by ambiguous language description. Moreover, we evaluate the ability of our model by demonstrating the diversity of generated objects, the effectiveness of instruction, and quantitative metric results, which collectively indicate that our model is capable of generating realistic in-door objects. For a more thorough evaluation, we also incorporate visual grounding as a metric to assess the quality of the scenes generated by our model.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. (arXiv:2311.16502v1 [cs.CL])

Authors: Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. Our evaluation of 14 open-source LMMs and the proprietary GPT-4V(ision) highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V only achieves a 56% accuracy, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models. (arXiv:2311.16503v1 [cs.CV])

Authors: Yushi Huang, Ruihao Gong, Jing Liu, Tianlong Chen, Xianglong Liu

The Diffusion model, a prevalent framework for image generation, encounters significant challenges in terms of broad applicability due to its extended inference times and substantial memory requirements. Efficient Post-training Quantization (PTQ) is pivotal for addressing these issues in traditional models. Different from traditional models, diffusion models heavily depend on the time-step $t$ to achieve satisfactory multi-round denoising. Usually, $t$ from the finite set $\{1, \ldots, T\}$ is encoded to a temporal feature by a few modules totally irrespective of the sampling data. However, existing PTQ methods do not optimize these modules separately. They adopt inappropriate reconstruction targets and complex calibration methods, resulting in a severe disturbance of the temporal feature and denoising trajectory, as well as a low compression efficiency. To solve these, we propose a Temporal Feature Maintenance Quantization (TFMQ) framework building upon a Temporal Information Block which is just related to the time-step $t$ and unrelated to the sampling data. Powered by the pioneering block design, we devise temporal information aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features in a limited time. Equipped with the framework, we can maintain the most temporal information and ensure the end-to-end generation quality. Extensive experiments on various datasets and diffusion models prove our state-of-the-art results. Remarkably, our quantization approach, for the first time, achieves model performance nearly on par with the full-precision model under 4-bit weight quantization. Additionally, our method incurs almost no extra computational cost and accelerates quantization time by $2.0 \times$ on LSUN-Bedrooms $256 \times 256$ compared to previous works.

Rethinking Directional Integration in Neural Radiance Fields. (arXiv:2311.16504v1 [cs.CV])

Authors: Congyue Deng, Jiawei Yang, Leonidas Guibas, Yue Wang

Recent works use the Neural radiance field (NeRF) to perform multi-view 3D reconstruction, providing a significant leap in rendering photorealistic scenes. However, despite its efficacy, NeRF exhibits limited capability of learning view-dependent effects compared to light field rendering or image-based view synthesis. To that end, we introduce a modification to the NeRF rendering equation which is as simple as a few lines of code change for any NeRF variations, while greatly improving the rendering quality of view-dependent effects. By swapping the integration operator and the direction decoder network, we only integrate the positional features along the ray and move the directional terms out of the integration, resulting in a disentanglement of the view-dependent and independent components. The modified equation is equivalent to the classical volumetric rendering in ideal cases on object surfaces with Dirac densities. Furthermore, we prove that with the errors caused by network approximation and numerical integration, our rendering equation exhibits better convergence properties with lower error accumulations compared to the classical NeRF. We also show that the modified equation can be interpreted as light field rendering with learned ray embeddings. Experiments on different NeRF variations show consistent improvements in the quality of view-dependent effects with our simple modification.

Exploring Straighter Trajectories of Flow Matching with Diffusion Guidance. (arXiv:2311.16507v1 [cs.CV])

Authors: Siyu Xing, Jie Cao, Huaibo Huang, Xiao-Yu Zhang, Ran He

Flow matching as a paradigm of generative model achieves notable success across various domains. However, existing methods use either multi-round training or knowledge within minibatches, posing challenges in finding a favorable coupling strategy for straight trajectories. To address this issue, we propose a novel approach, Straighter trajectories of Flow Matching (StraightFM). It straightens trajectories with the coupling strategy guided by diffusion model from entire distribution level. First, we propose a coupling strategy to straighten trajectories, creating couplings between image and noise samples under diffusion model guidance. Second, StraightFM also integrates real data to enhance training, employing a neural network to parameterize another coupling process from images to noise samples. StraightFM is jointly optimized with couplings from above two mutually complementary directions, resulting in straighter trajectories and enabling both one-step and few-step generation. Extensive experiments demonstrate that StraightFM yields high quality samples with fewer step. StraightFM generates visually appealing images with a lower FID among diffusion and traditional flow matching methods within 5 sampling steps when trained on pixel space. In the latent space (i.e., Latent Diffusion), StraightFM achieves a lower KID value compared to existing methods on the CelebA-HQ 256 dataset in fewer than 10 sampling steps.

RandMSAugment: A Mixed-Sample Augmentation for Limited-Data Scenarios. (arXiv:2311.16508v1 [cs.CV])

Authors: Swarna Kamlam Ravindran, Carlo Tomasi

The high costs of annotating large datasets suggests a need for effectively training CNNs with limited data, and data augmentation is a promising direction. We study foundational augmentation techniques, including Mixed Sample Data Augmentations (MSDAs) and a no-parameter variant of RandAugment termed Preset-RandAugment, in the fully supervised scenario. We observe that Preset-RandAugment excels in limited-data contexts while MSDAs are moderately effective. We show that low-level feature transforms play a pivotal role in this performance difference, postulate a new property of augmentations related to their data efficiency, and propose new ways to measure the diversity and realism of augmentations. Building on these insights, we introduce a novel augmentation technique called RandMSAugment that integrates complementary strengths of existing methods. RandMSAugment significantly outperforms the competition on CIFAR-100, STL-10, and Tiny-Imagenet. With very small training sets (4, 25, 100 samples/class), RandMSAugment achieves compelling performance gains between 4.1% and 6.75%. Even with more training data (500 samples/class) we improve performance by 1.03% to 2.47%. RandMSAugment does not require hyperparameter tuning, extra validation data, or cumbersome optimizations.

Source-Free Domain Adaptation with Frozen Multimodal Foundation Model. (arXiv:2311.16510v1 [cs.CV])

Authors: Song Tang, Wenxin Su, Mao Ye, Xiatian Zhu

Source-Free Domain Adaptation (SFDA) aims to adapt a source model for a target domain, with only access to unlabeled target training data and the source model pre-trained on a supervised source domain. Relying on pseudo labeling and/or auxiliary supervision, conventional methods are inevitably error-prone. To mitigate this limitation, in this work we for the first time explore the potentials of off-the-shelf vision-language (ViL) multimodal models (e.g.,CLIP) with rich whilst heterogeneous knowledge. We find that directly applying the ViL model to the target domain in a zero-shot fashion is unsatisfactory, as it is not specialized for this particular task but largely generic. To make it task specific, we propose a novel Distilling multimodal Foundation model(DIFO)approach. Specifically, DIFO alternates between two steps during adaptation: (i) Customizing the ViL model by maximizing the mutual information with the target model in a prompt learning manner, (ii) Distilling the knowledge of this customized ViL model to the target model. For more fine-grained and reliable distillation, we further introduce two effective regularization terms, namely most-likely category encouragement and predictive consistency. Extensive experiments show that DIFO significantly outperforms the state-of-the-art alternatives. Our source code will be released.

GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation. (arXiv:2311.16511v1 [cs.CV])

Authors: Zhanyu Wang, Longyue Wang, Zhen Zhao, Minghao Wu, Chenyang Lyu, Huayang Li, Deng Cai, Luping Zhou, Shuming Shi, Zhaopeng Tu

While the recent advances in Multimodal Large Language Models (MLLMs) constitute a significant leap forward in the field, these models are predominantly confined to the realm of input-side multimodal comprehension, lacking the capacity for multimodal content generation. To fill this gap, we present GPT4Video, a unified multi-model framework that empowers Large Language Models (LLMs) with the capability of both video understanding and generation. Specifically, we develop an instruction-following-based approach integrated with the stable diffusion generative model, which has demonstrated to effectively and securely handle video generation scenarios. GPT4Video offers the following benefits: 1) It exhibits impressive capabilities in both video understanding and generation scenarios. For example, GPT4Video outperforms Valley by 11.8\% on the Video Question Answering task, and surpasses NExt-GPT by 2.3\% on the Text to Video generation task. 2) it endows the LLM/MLLM with video generation capabilities without requiring additional training parameters and can flexibly interface with a wide range of models to perform video generation. 3) it maintains a safe and healthy conversation not only in output-side but also the input side in an end-to-end manner. Qualitative and qualitative experiments demonstrate that GPT4Video holds the potential to function as a effective, safe and Humanoid-like video assistant that can handle both video understanding and generation scenarios.

CoSeR: Bridging Image and Language for Cognitive Super-Resolution. (arXiv:2311.16512v1 [cs.CV])

Authors: Haoze Sun, Wenbo Li, Jianzhuang Liu, Haoyu Chen, Renjing Pei, Xueyi Zou, Youliang Yan, Yujiu Yang

Existing super-resolution (SR) models primarily focus on restoring local texture details, often neglecting the global semantic information within the scene. This oversight can lead to the omission of crucial semantic details or the introduction of inaccurate textures during the recovery process. In our work, we introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images. We achieve this by marrying image appearance and language understanding to generate a cognitive embedding, which not only activates prior information from large text-to-image diffusion models but also facilitates the generation of high-quality reference images to optimize the SR process. To further improve image fidelity, we propose a novel condition injection scheme called "All-in-Attention", consolidating all conditional information into a single module. Consequently, our method successfully restores semantically correct and photorealistic details, demonstrating state-of-the-art performance across multiple benchmarks.

Fine-grained Appearance Transfer with Diffusion Models. (arXiv:2311.16513v1 [cs.CV])

Authors: Yuteng Ye, Guanwen Li, Hang Zhou, Cai Jiale, Junqing Yu, Yawei Luo, Zikai Song, Qilong Xing, Youjia Zhang, Wei Yang

Image-to-image translation (I2I), and particularly its subfield of appearance transfer, which seeks to alter the visual appearance between images while maintaining structural coherence, presents formidable challenges. Despite significant advancements brought by diffusion models, achieving fine-grained transfer remains complex, particularly in terms of retaining detailed structural elements and ensuring information fidelity. This paper proposes an innovative framework designed to surmount these challenges by integrating various aspects of semantic matching, appearance transfer, and latent deviation. A pivotal aspect of our approach is the strategic use of the predicted $x_0$ space by diffusion models within the latent space of diffusion processes. This is identified as a crucial element for the precise and natural transfer of fine-grained details. Our framework exploits this space to accomplish semantic alignment between source and target images, facilitating mask-wise appearance transfer for improved feature acquisition. A significant advancement of our method is the seamless integration of these features into the latent space, enabling more nuanced latent deviations without necessitating extensive model retraining or fine-tuning. The effectiveness of our approach is demonstrated through extensive experiments, which showcase its ability to adeptly handle fine-grained appearance transfers across a wide range of categories and domains. We provide our code at https://github.com/babahui/Fine-grained-Appearance-Transfer

Video Anomaly Detection via Spatio-Temporal Pseudo-Anomaly Generation : A Unified Approach. (arXiv:2311.16514v1 [cs.CV])

Authors: Ayush K. Rai, Tarun Krishna, Feiyan Hu, Alexandru Drimbarean, Kevin McGuinness, Alan F. Smeaton, Noel E. O'Connor

Video Anomaly Detection (VAD) is an open-set recognition task, which is usually formulated as a one-class classification (OCC) problem, where training data is comprised of videos with normal instances while test data contains both normal and anomalous instances. Recent works have investigated the creation of pseudo-anomalies (PAs) using only the normal data and making strong assumptions about real-world anomalies with regards to abnormality of objects and speed of motion to inject prior information about anomalies in an autoencoder (AE) based reconstruction model during training. This work proposes a novel method for generating generic spatio-temporal PAs by inpainting a masked out region of an image using a pre-trained Latent Diffusion Model and further perturbing the optical flow using mixup to emulate spatio-temporal distortions in the data. In addition, we present a simple unified framework to detect real-world anomalies under the OCC setting by learning three types of anomaly indicators, namely reconstruction quality, temporal irregularity and semantic inconsistency. Extensive experiments on four VAD benchmark datasets namely Ped2, Avenue, ShanghaiTech and UBnormal demonstrate that our method performs on par with other existing state-of-the-art PAs generation and reconstruction based methods under the OCC setting. Our analysis also examines the transferability and generalisation of PAs across these datasets, offering valuable insights by identifying real-world anomalies through PAs.

Word for Person: Zero-shot Composed Person Retrieval. (arXiv:2311.16515v1 [cs.CV])

Authors: Delong Liu, Haiwen Li, Zhicheng Zhao, Fei Su, Hongying Meng

Searching for specific person has great security value and social benefits, and it often involves a combination of visual and textual information. Conventional person retrieval methods, whether image-based or text-based, usually fall short in effectively harnessing both types of information, leading to the loss of accuracy. In this paper, a whole new task called Composed Person Retrieval (CPR) is proposed to jointly utilize both image and text information for target person retrieval. However, the supervised CPR must depend on very costly manual annotation dataset, while there are currently no available resources. To mitigate this issue, we firstly introduce the Zero-shot Composed Person Retrieval (ZS-CPR), which leverages existing domain-related data to resolve the CPR problem without reliance on expensive annotations. Secondly, to learn ZS-CPR model, we propose a two-stage learning framework, Word4Per, where a lightweight Textual Inversion Network (TINet) and a text-based person retrieval model based on fine-tuned Contrastive Language-Image Pre-training (CLIP) network are learned without utilizing any CPR data. Thirdly, a finely annotated Image-Text Composed Person Retrieval dataset (ITCPR) is built as the benchmark to assess the performance of the proposed Word4Per framework. Extensive experiments under both Rank-1 and mAP demonstrate the effectiveness of Word4Per for the ZS-CPR task, surpassing the comparative methods by over 10%. The code and ITCPR dataset will be publicly available at https://github.com/Delong-liu-bupt/Word4Per.

Segment Every Out-of-Distribution Object. (arXiv:2311.16516v1 [cs.CV])

Authors: Wenjie Zhao, Jia Li, Xin Dong, Yu Xiang, Yunhui Guo

Semantic segmentation models, while effective for in-distribution categories, face challenges in real-world deployment due to encountering out-of-distribution (OoD) objects. Detecting these OoD objects is crucial for safety-critical applications. Existing methods rely on anomaly scores, but choosing a suitable threshold for generating masks presents difficulties and can lead to fragmentation and inaccuracy. This paper introduces a method to convert anomaly Score To segmentation Mask, called S2M, a simple and effective framework for OoD detection in semantic segmentation. Unlike assigning anomaly scores to pixels, S2M directly segments the entire OoD object. By transforming anomaly scores into prompts for a promptable segmentation model, S2M eliminates the need for threshold selection. Extensive experiments demonstrate that S2M outperforms the state-of-the-art by approximately 10\% in IoU and 30\% in mean F1 score, on average, across various benchmarks including Fishyscapes, Segment-Me-If-You-Can, and RoadAnomaly datasets.

LFSRDiff: Light Field Image Super-Resolution via Diffusion Models. (arXiv:2311.16517v1 [eess.IV])

Authors: Wentao Chao, Fuqing Duan, Xuechun Wang, Yingqian Wang, Guanghui Wang

Light field (LF) image super-resolution (SR) is a challenging problem due to its inherent ill-posed nature, where a single low-resolution (LR) input LF image can correspond to multiple potential super-resolved outcomes. Despite this complexity, mainstream LF image SR methods typically adopt a deterministic approach, generating only a single output supervised by pixel-wise loss functions. This tendency often results in blurry and unrealistic results. Although diffusion models can capture the distribution of potential SR results by iteratively predicting Gaussian noise during the denoising process, they are primarily designed for general images and struggle to effectively handle the unique characteristics and information present in LF images. To address these limitations, we introduce LFSRDiff, the first diffusion-based LF image SR model, by incorporating the LF disentanglement mechanism. Our novel contribution includes the introduction of a disentangled U-Net for diffusion models, enabling more effective extraction and fusion of both spatial and angular information within LF images. Through comprehensive experimental evaluations and comparisons with the state-of-the-art LF image SR methods, the proposed approach consistently produces diverse and realistic SR results. It achieves the highest perceptual metric in terms of LPIPS. It also demonstrates the ability to effectively control the trade-off between perception and distortion. The code is available at \url{https://github.com/chaowentao/LFSRDiff}.

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution. (arXiv:2311.16518v1 [cs.CV])

Authors: Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, Lei Zhang

Owe to the powerful generative priors, the pre-trained text-to-image (T2I) diffusion models have become increasingly popular in solving the real-world image super-resolution problem. However, as a consequence of the heavy quality degradation of input low-resolution (LR) images, the destruction of local structures can lead to ambiguous image semantics. As a result, the content of reproduced high-resolution image may have semantic errors, deteriorating the super-resolution performance. To address this issue, we present a semantics-aware approach to better preserve the semantic fidelity of generative real-world image super-resolution. First, we train a degradation-aware prompt extractor, which can generate accurate soft and hard semantic prompts even under strong degradation. The hard semantic prompts refer to the image tags, aiming to enhance the local perception ability of the T2I model, while the soft semantic prompts compensate for the hard ones to provide additional representation information. These semantic prompts can encourage the T2I model to generate detailed and semantically accurate results. Furthermore, during the inference process, we integrate the LR images into the initial sampling noise to mitigate the diffusion model's tendency to generate excessive random details. The experiments show that our method can reproduce more realistic image details and hold better the semantics.

3D Teeth Reconstruction from Panoramic Radiographs using Neural Implicit Functions. (arXiv:2311.16524v1 [cs.CV])

Authors: Sihwa Park, Seongjun Kim, In-Seok Song, Seung Jun Baek

Panoramic radiography is a widely used imaging modality in dental practice and research. However, it only provides flattened 2D images, which limits the detailed assessment of dental structures. In this paper, we propose Occudent, a framework for 3D teeth reconstruction from panoramic radiographs using neural implicit functions, which, to the best of our knowledge, is the first work to do so. For a given point in 3D space, the implicit function estimates whether the point is occupied by a tooth, and thus implicitly determines the boundaries of 3D tooth shapes. Firstly, Occudent applies multi-label segmentation to the input panoramic radiograph. Next, tooth shape embeddings as well as tooth class embeddings are generated from the segmentation outputs, which are fed to the reconstruction network. A novel module called Conditional eXcitation (CX) is proposed in order to effectively incorporate the combined shape and class embeddings into the implicit function. The performance of Occudent is evaluated using both quantitative and qualitative measures. Importantly, Occudent is trained and validated with actual panoramic radiographs as input, distinct from recent works which used synthesized images. Experiments demonstrate the superiority of Occudent over state-of-the-art methods.

Agents meet OKR: An Object and Key Results Driven Agent System with Hierarchical Self-Collaboration and Self-Evaluation. (arXiv:2311.16542v1 [cs.CV])

Authors: Yi Zheng, Chongyang Ma, Kanle Shi, Haibin Huang

In this study, we introduce the concept of OKR-Agent designed to enhance the capabilities of Large Language Models (LLMs) in task-solving. Our approach utilizes both self-collaboration and self-correction mechanism, facilitated by hierarchical agents, to address the inherent complexities in task-solving. Our key observations are two-fold: first, effective task-solving demands in-depth domain knowledge and intricate reasoning, for which deploying specialized agents for individual sub-tasks can markedly enhance LLM performance. Second, task-solving intrinsically adheres to a hierarchical execution structure, comprising both high-level strategic planning and detailed task execution. Towards this end, our OKR-Agent paradigm aligns closely with this hierarchical structure, promising enhanced efficacy and adaptability across a range of scenarios. Specifically, our framework includes two novel modules: hierarchical Objects and Key Results generation and multi-level evaluation, each contributing to more efficient and robust task-solving. In practical, hierarchical OKR generation decomposes Objects into multiple sub-Objects and assigns new agents based on key results and agent responsibilities. These agents subsequently elaborate on their designated tasks and may further decompose them as necessary. Such generation operates recursively and hierarchically, culminating in a comprehensive set of detailed solutions. The multi-level evaluation module of OKR-Agent refines solution by leveraging feedback from all associated agents, optimizing each step of the process. This ensures solution is accurate, practical, and effectively address intricate task requirements, enhancing the overall reliability and quality of the outcome. Experimental results also show our method outperforms the previous methods on several tasks. Code and demo are available at https://okr-agent.github.io/

Multi-Irreducible Spectral Synchronization for Robust Rotation Averaging. (arXiv:2311.16544v1 [cs.CV])

Authors: Owen Howell, Haoen Huang, David Rosen

Rotation averaging (RA) is a fundamental problem in robotics and computer vision. In RA, the goal is to estimate a set of $N$ unknown orientations $R_{1}, ..., R_{N} \in SO(3)$, given noisy measurements $R_{ij} \sim R^{-1}_{i} R_{j}$ of a subset of their pairwise relative rotations. This problem is both nonconvex and NP-hard, and thus difficult to solve in the general case. We apply harmonic analysis on compact groups to derive a (convex) spectral relaxation constructed from truncated Fourier decompositions of the individual summands appearing in the RA objective; we then recover an estimate of the RA solution by computing a few extremal eigenpairs of this relaxation, and (approximately) solving a consensus problem. Our approach affords several notable advantages versus prior RA methods: it can be used in conjunction with \emph{any} smooth loss function (including, but not limited to, robust M-estimators), does not require any initialization, and is implemented using only simple (and highly scalable) linear-algebraic computations and parallelizable optimizations over band-limited functions of individual rotational states. Moreover, under the (physically well-motivated) assumption of multiplicative Langevin measurement noise, we derive explicit performance guarantees for our spectral estimator (in the form of probabilistic tail bounds on the estimation error) that are parameterized in terms of graph-theoretic quantities of the underlying measurement network. By concretely linking estimator performance with properties of the underlying measurement graph, our results also indicate how to devise measurement networks that are \emph{guaranteed} to achieve accurate estimation, enabling such downstream tasks as sensor placement, network compression, and active sensing.

HandyPriors: Physically Consistent Perception of Hand-Object Interactions with Differentiable Priors. (arXiv:2311.16552v1 [cs.CV])

Authors: Shutong Zhang, Yi-Ling Qiao, Guanglei Zhu, Eric Heiden, Dylan Turpin, Jingzhou Liu, Ming Lin, Miles Macklin, Animesh Garg

Various heuristic objectives for modeling hand-object interaction have been proposed in past work. However, due to the lack of a cohesive framework, these objectives often possess a narrow scope of applicability and are limited by their efficiency or accuracy. In this paper, we propose HandyPriors, a unified and general pipeline for pose estimation in human-object interaction scenes by leveraging recent advances in differentiable physics and rendering. Our approach employs rendering priors to align with input images and segmentation masks along with physics priors to mitigate penetration and relative-sliding across frames. Furthermore, we present two alternatives for hand and object pose estimation. The optimization-based pose estimation achieves higher accuracy, while the filtering-based tracking, which utilizes the differentiable priors as dynamics and observation models, executes faster. We demonstrate that HandyPriors attains comparable or superior results in the pose estimation task, and that the differentiable physics module can predict contact information for pose refinement. We also show that our approach generalizes to perception tasks, including robotic hand manipulation and human-object pose estimation in the wild.

Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models. (arXiv:2311.16555v1 [cs.CV])

Authors: Ling Fu, Zijie Wu, Yingying Zhu, Yuliang Liu, Xiang Bai

Scene text detection techniques have garnered significant attention due to their wide-ranging applications. However, existing methods have a high demand for training data, and obtaining accurate human annotations is labor-intensive and time-consuming. As a solution, researchers have widely adopted synthetic text images as a complementary resource to real text images during pre-training. Yet there is still room for synthetic datasets to enhance the performance of scene text detectors. We contend that one main limitation of existing generation methods is the insufficient integration of foreground text with the background. To alleviate this problem, we present the Diffusion Model based Text Generator (DiffText), a pipeline that utilizes the diffusion model to seamlessly blend foreground text regions with the background's intrinsic features. Additionally, we propose two strategies to generate visually coherent text with fewer spelling errors. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors. Extensive experiments on detecting horizontal, rotated, curved, and line-level texts demonstrate the effectiveness of DiffText in producing realistic text images.

DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser. (arXiv:2311.16565v1 [cs.CV])

Authors: Peng Chen, Xiaobao Wei, Ming Lu, Yitong Zhu, Naiming Yao, Xingyu Xiao, Hui Chen

Speech-driven 3D facial animation has been an attractive task in both academia and industry. Traditional methods mostly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the non-deterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. However, personalizing facial animation and accelerating animation generation are still two major limitations of existing diffusion-based methods. To address the above limitations, we propose DiffusionTalker, a diffusion-based method that utilizes contrastive learning to personalize 3D facial animation and knowledge distillation to accelerate 3D animation generation. Specifically, to enable personalization, we introduce a learnable talking identity to aggregate knowledge in audio sequences. The proposed identity embeddings extract customized facial cues across different people in a contrastive learning manner. During inference, users can obtain personalized facial animation based on input audio, reflecting a specific talking style. With a trained diffusion model with hundreds of steps, we distill it into a lightweight model with 8 steps for acceleration. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released.

MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices. (arXiv:2311.16567v1 [cs.CV])

Authors: Yang Zhao, Yanwu Xu, Zhisheng Xiao, Tingbo Hou

The deployment of large-scale text-to-image diffusion models on mobile devices is impeded by their substantial model size and slow inference speed. In this paper, we propose \textbf{MobileDiffusion}, a highly efficient text-to-image diffusion model obtained through extensive optimizations in both architecture and sampling techniques. We conduct a comprehensive examination of model architecture design to reduce redundancy, enhance computational efficiency, and minimize model's parameter count, while preserving image generation quality. Additionally, we employ distillation and diffusion-GAN finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference respectively. Empirical studies, conducted both quantitatively and qualitatively, demonstrate the effectiveness of our proposed techniques. MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for generating a $512\times512$ image on mobile devices, establishing a new state of the art.

Efficient Key-Based Adversarial Defense for ImageNet by Using Pre-trained Model. (arXiv:2311.16577v1 [cs.CV])

Authors: AprilPyone MaungMaung, Isao Echizen, Hitoshi Kiya

In this paper, we propose key-based defense model proliferation by leveraging pre-trained models and utilizing recent efficient fine-tuning techniques on ImageNet-1k classification. First, we stress that deploying key-based models on edge devices is feasible with the latest model deployment advancements, such as Apple CoreML, although the mainstream enterprise edge artificial intelligence (Edge AI) has been focused on the Cloud. Then, we point out that the previous key-based defense on on-device image classification is impractical for two reasons: (1) training many classifiers from scratch is not feasible, and (2) key-based defenses still need to be thoroughly tested on large datasets like ImageNet. To this end, we propose to leverage pre-trained models and utilize efficient fine-tuning techniques to proliferate key-based models even on limited computing resources. Experiments were carried out on the ImageNet-1k dataset using adaptive and non-adaptive attacks. The results show that our proposed fine-tuned key-based models achieve a superior classification accuracy (more than 10% increase) compared to the previous key-based models on classifying clean and adversarial examples.

Clean Label Disentangling for Medical Image Segmentation with Noisy Labels. (arXiv:2311.16580v1 [cs.CV])

Authors: Zicheng Wang, Zhen Zhao, Erjian Guo, Luping Zhou

Current methods focusing on medical image segmentation suffer from incorrect annotations, which is known as the noisy label issue. Most medical image segmentation with noisy labels methods utilize either noise transition matrix, noise-robust loss functions or pseudo-labeling methods, while none of the current research focuses on clean label disentanglement. We argue that the main reason is that the severe class-imbalanced issue will lead to the inaccuracy of the selected ``clean'' labels, thus influencing the robustness of the model against the noises. In this work, we come up with a simple but efficient class-balanced sampling strategy to tackle the class-imbalanced problem, which enables our newly proposed clean label disentangling framework to successfully select clean labels from the given label sets and encourages the model to learn from the correct annotations. However, such a method will filter out too many annotations which may also contain useful information. Therefore, we further extend our clean label disentangling framework to a new noisy feature-aided clean label disentangling framework, which takes the full annotations into utilization to learn more semantics. Extensive experiments have validated the effectiveness of our methods, where our methods achieve new state-of-the-art performance. Our code is available at https://github.com/xiaoyao3302/2BDenoise.

GeoScaler: Geometry and Rendering-Aware Downsampling of 3D Mesh Textures. (arXiv:2311.16581v1 [cs.GR])

Authors: Sai Karthikey Pentapati, Anshul Rai, Arkady Ten, Chaitanya Atluru, Alan Bovik

High-resolution texture maps are necessary for representing real-world objects accurately with 3D meshes. The large sizes of textures can bottleneck the real-time rendering of high-quality virtual 3D scenes on devices having low computational budgets and limited memory. Downsampling the texture maps directly addresses the issue, albeit at the cost of visual fidelity. Traditionally, downsampling of texture maps is performed using methods like bicubic interpolation and the Lanczos algorithm. These methods ignore the geometric layout of the mesh and its UV parametrization and also do not account for the rendering process used to obtain the final visualization that the users will experience. Towards filling these gaps, we introduce GeoScaler, which is a method of downsampling texture maps of 3D meshes while incorporating geometric cues, and by maximizing the visual fidelity of the rendered views of the textured meshes. We show that the textures generated by GeoScaler deliver significantly better quality rendered images compared to those generated by traditional downsampling methods

Improving Lane Detection Generalization: A Novel Framework using HD Maps for Boosting Diversity. (arXiv:2311.16589v1 [cs.CV])

Authors: Daeun Lee, Minhyeok Heo, Jiwon Kim

Lane detection is a vital task for vehicles to navigate and localize their position on the road. To ensure reliable results, lane detection algorithms must have robust generalization performance in various road environments. However, despite the significant performance improvement of deep learning-based lane detection algorithms, their generalization performance in response to changes in road environments still falls short of expectations. In this paper, we present a novel framework for single-source domain generalization (SSDG) in lane detection. By decomposing data into lane structures and surroundings, we enhance diversity using High-Definition (HD) maps and generative models. Rather than expanding data volume, we strategically select a core subset of data, maximizing diversity and optimizing performance. Our extensive experiments demonstrate that our framework enhances the generalization performance of lane detection, comparable to the domain adaptation-based method.

Empowering COVID-19 Detection: Optimizing Performance Through Fine-Tuned EfficientNet Deep Learning Architecture. (arXiv:2311.16593v1 [eess.IV])

Authors: Md. Alamin Talukder, Md. Abu Layek, Mohsin Kazi, Md Ashraf Uddin, Sunil Aryal

The worldwide COVID-19 pandemic has profoundly influenced the health and everyday experiences of individuals across the planet. It is a highly contagious respiratory disease requiring early and accurate detection to curb its rapid transmission. Initial testing methods primarily revolved around identifying the genetic composition of the coronavirus, exhibiting a relatively low detection rate and requiring a time-intensive procedure. To address this challenge, experts have suggested using radiological imagery, particularly chest X-rays, as a valuable approach within the diagnostic protocol. This study investigates the potential of leveraging radiographic imaging (X-rays) with deep learning algorithms to swiftly and precisely identify COVID-19 patients. The proposed approach elevates the detection accuracy by fine-tuning with appropriate layers on various established transfer learning models. The experimentation was conducted on a COVID-19 X-ray dataset containing 2000 images. The accuracy rates achieved were impressive of 100% for EfficientNetB4 model. The fine-tuned EfficientNetB4 achieved an excellent accuracy score, showcasing its potential as a robust COVID-19 detection model. Furthermore, EfficientNetB4 excelled in identifying Lung disease using Chest X-ray dataset containing 4,350 Images, achieving remarkable performance with an accuracy of 99.17%, precision of 99.13%, recall of 99.16%, and f1-score of 99.14%. These results highlight the promise of fine-tuned transfer learning for efficient lung detection through medical imaging, especially with X-ray images. This research offers radiologists an effective means of aiding rapid and precise COVID-19 diagnosis and contributes valuable assistance for healthcare professionals in accurately identifying affected patients.

Filter-Pruning of Lightweight Face Detectors Using a Geometric Median Criterion. (arXiv:2311.16613v1 [cs.CV])

Authors: Konstantinos Gkrispanis, Nikolaos Gkalelis, Vasileios Mezaris

Face detectors are becoming a crucial component of many applications, including surveillance, that often have to run on edge devices with limited processing power and memory. Therefore, there's a pressing demand for compact face detection models that can function efficiently across resource-constrained devices. Over recent years, network pruning techniques have attracted a lot of attention from researchers. These methods haven't been well examined in the context of face detectors, despite their expanding popularity. In this paper, we implement filter pruning on two already small and compact face detectors, named EXTD (Extremely Tiny Face Detector) and EResFD (Efficient ResNet Face Detector). The main pruning algorithm that we utilize is Filter Pruning via Geometric Median (FPGM), combined with the Soft Filter Pruning (SFP) iterative procedure. We also apply L1 Norm pruning, as a baseline to compare with the proposed approach. The experimental evaluation on the WIDER FACE dataset indicates that the proposed approach has the potential to further reduce the model size of already lightweight face detectors, with limited accuracy loss, or even with small accuracy gain for low pruning rates.

Cross-level Attention with Overlapped Windows for Camouflaged Object Detection. (arXiv:2311.16618v1 [cs.CV])

Authors: Jiepan Li, Fangxiao Lu, Nan Xue, Zhuohong Li, Hongyan Zhang, Wei He

Camouflaged objects adaptively fit their color and texture with the environment, which makes them indistinguishable from the surroundings. Current methods revealed that high-level semantic features can highlight the differences between camouflaged objects and the backgrounds. Consequently, they integrate high-level semantic features with low-level detailed features for accurate camouflaged object detection (COD). Unlike previous designs for multi-level feature fusion, we state that enhancing low-level features is more impending for COD. In this paper, we propose an overlapped window cross-level attention (OWinCA) to achieve the low-level feature enhancement guided by the highest-level features. By sliding an aligned window pair on both the highest- and low-level feature maps, the high-level semantics are explicitly integrated into the low-level details via cross-level attention. Additionally, it employs an overlapped window partition strategy to alleviate the incoherence among windows, which prevents the loss of global information. These adoptions enable the proposed OWinCA to enhance low-level features by promoting the separability of camouflaged objects. The associated proposed OWinCANet fuses these enhanced multi-level features by simple convolution operation to achieve the final COD. Experiments conducted on three large-scale COD datasets demonstrate that our OWinCANet significantly surpasses the current state-of-the-art COD methods.

Visual Semantic Navigation with Real Robots. (arXiv:2311.16623v1 [cs.RO])

Authors: Carlos Gutiérrez-Álvarez, Pablo Ríos-Navarro, Rafael Flor-Rodríguez, Francisco Javier Acevedo-Rodríguez, Roberto J. López-Sastre

Visual Semantic Navigation (VSN) is the ability of a robot to learn visual semantic information for navigating in unseen environments. These VSN models are typically tested in those virtual environments where they are trained, mainly using reinforcement learning based approaches. Therefore, we do not yet have an in-depth analysis of how these models would behave in the real world. In this work, we propose a new solution to integrate VSN models into real robots, so that we have true embodied agents. We also release a novel ROS-based framework for VSN, ROS4VSN, so that any VSN-model can be easily deployed in any ROS-compatible robot and tested in a real setting. Our experiments with two different robots, where we have embedded two state-of-the-art VSN agents, confirm that there is a noticeable performance difference of these VSN solutions when tested in real-world and simulation environments. We hope that this research will endeavor to provide a foundation for addressing this consequential issue, with the ultimate aim of advancing the performance and efficiency of embodied agents within authentic real-world scenarios. Code to reproduce all our experiments can be found at https://github.com/gramuah/ros4vsn.

MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation. (arXiv:2311.16635v1 [cs.CV])

Authors: Sitong Su, Litao Guo, Lianli Gao, Hengtao Shen, Jingkuan Song

Zero-shot Text-to-Video synthesis generates videos based on prompts without any videos. Without motion information from videos, motion priors implied in prompts are vital guidance. For example, the prompt "airplane landing on the runway" indicates motion priors that the "airplane" moves downwards while the "runway" stays static. Whereas the motion priors are not fully exploited in previous approaches, thus leading to two nontrivial issues: 1) the motion variation pattern remains unaltered and prompt-agnostic for disregarding motion priors; 2) the motion control of different objects is inaccurate and entangled without considering the independent motion priors of different objects. To tackle the two issues, we propose a prompt-adaptive and disentangled motion control strategy coined as MotionZero, which derives motion priors from prompts of different objects by Large-Language-Models and accordingly applies motion control of different objects to corresponding regions in disentanglement. Furthermore, to facilitate videos with varying degrees of motion amplitude, we propose a Motion-Aware Attention scheme which adjusts attention among frames by motion amplitude. Extensive experiments demonstrate that our strategy could correctly control motion of different objects and support versatile applications including zero-shot video edit.

Parallax-Tolerant Image Stitching with Epipolar Displacement Field. (arXiv:2311.16637v1 [cs.CV])

Authors: Jian Yu, Yi Yu, Feipeng Da

Large parallax image stitching is a challenging task. Existing methods often struggle to maintain both the local and global structures of the image while reducing alignment artifacts and warping distortions. In this paper, we propose a novel approach that utilizes epipolar geometry to establish a warping technique based on the epipolar displacement field. Initially, the warping rule for pixels in the epipolar geometry is established through the infinite homography. Subsequently, Subsequently, the epipolar displacement field, which represents the sliding distance of the warped pixel along the epipolar line, is formulated by thin plate splines based on the principle of local elastic deformation. The stitching result can be generated by inversely warping the pixels according to the epipolar displacement field. This method incorporates the epipolar constraints in the warping rule, which ensures high-quality alignment and maintains the projectivity of the panorama. Qualitative and quantitative comparative experiments demonstrate the competitiveness of the proposed method in stitching images large parallax.

Augmenting x-ray single particle imaging reconstruction with self-supervised machine learning. (arXiv:2311.16652v1 [cs.CV])

Authors: Zhantao Chen, Cong Wang, Mingye Gao, Chun Hong Yoon, Jana B. Thayer, Joshua J. Turner

The development of X-ray Free Electron Lasers (XFELs) has opened numerous opportunities to probe atomic structure and ultrafast dynamics of various materials. Single Particle Imaging (SPI) with XFELs enables the investigation of biological particles in their natural physiological states with unparalleled temporal resolution, while circumventing the need for cryogenic conditions or crystallization. However, reconstructing real-space structures from reciprocal-space x-ray diffraction data is highly challenging due to the absence of phase and orientation information, which is further complicated by weak scattering signals and considerable fluctuations in the number of photons per pulse. In this work, we present an end-to-end, self-supervised machine learning approach to recover particle orientations and estimate reciprocal space intensities from diffraction images only. Our method demonstrates great robustness under demanding experimental conditions with significantly enhanced reconstruction capabilities compared with conventional algorithms, and signifies a paradigm shift in SPI as currently practiced at XFELs.

SCALAR-NeRF: SCAlable LARge-scale Neural Radiance Fields for Scene Reconstruction. (arXiv:2311.16657v1 [cs.CV])

Authors: Yu Chen, Gim Hee Lee

In this work, we introduce SCALAR-NeRF, a novel framework tailored for scalable large-scale neural scene reconstruction. We structure the neural representation as an encoder-decoder architecture, where the encoder processes 3D point coordinates to produce encoded features, and the decoder generates geometric values that include volume densities of signed distances and colors. Our approach first trains a coarse global model on the entire image dataset. Subsequently, we partition the images into smaller blocks using KMeans with each block being modeled by a dedicated local model. We enhance the overlapping regions across different blocks by scaling up the bounding boxes of each local block. Notably, the decoder from the global model is shared across distinct blocks and therefore promoting alignment in the feature space of local encoders. We propose an effective and efficient methodology to fuse the outputs from these local models to attain the final reconstruction. Employing this refined coarse-to-fine strategy, our method outperforms state-of-the-art NeRF methods and demonstrates scalability for large-scale scene reconstruction. The code will be available on our project page at https://aibluefisher.github.io/SCALAR-NeRF/

DGNR: Density-Guided Neural Point Rendering of Large Driving Scenes. (arXiv:2311.16664v1 [cs.CV])

Authors: Zhuopeng Li, Chenming Wu, Liangjun Zhang, Jianke Zhu

Despite the recent success of Neural Radiance Field (NeRF), it is still challenging to render large-scale driving scenes with long trajectories, particularly when the rendering quality and efficiency are in high demand. Existing methods for such scenes usually involve with spatial warping, geometric supervision from zero-shot normal or depth estimation, or scene division strategies, where the synthesized views are often blurry or fail to meet the requirement of efficient rendering. To address the above challenges, this paper presents a novel framework that learns a density space from the scenes to guide the construction of a point-based renderer, dubbed as DGNR (Density-Guided Neural Rendering). In DGNR, geometric priors are no longer needed, which can be intrinsically learned from the density space through volumetric rendering. Specifically, we make use of a differentiable renderer to synthesize images from the neural density features obtained from the learned density space. A density-based fusion module and geometric regularization are proposed to optimize the density space. By conducting experiments on a widely used autonomous driving dataset, we have validated the effectiveness of DGNR in synthesizing photorealistic driving scenes and achieving real-time capable rendering.

LiveNVS: Neural View Synthesis on Live RGB-D Streams. (arXiv:2311.16668v1 [cs.CV])

Authors: Laura Fink, Darius Rückert, Linus Franke, Joachim Keinert, Marc Stamminger

Existing real-time RGB-D reconstruction approaches, like Kinect Fusion, lack real-time photo-realistic visualization. This is due to noisy, oversmoothed or incomplete geometry and blurry textures which are fused from imperfect depth maps and camera poses. Recent neural rendering methods can overcome many of such artifacts but are mostly optimized for offline usage, hindering the integration into a live reconstruction pipeline.

In this paper, we present LiveNVS, a system that allows for neural novel view synthesis on a live RGB-D input stream with very low latency and real-time rendering. Based on the RGB-D input stream, novel views are rendered by projecting neural features into the target view via a densely fused depth map and aggregating the features in image-space to a target feature map. A generalizable neural network then translates the target feature map into a high-quality RGB image. LiveNVS achieves state-of-the-art neural rendering quality of unknown scenes during capturing, allowing users to virtually explore the scene and assess reconstruction quality in real-time.

SplitNeRF: Split Sum Approximation Neural Field for Joint Geometry, Illumination, and Material Estimation. (arXiv:2311.16671v1 [cs.CV])

Authors: Jesus Zarzar, Bernard Ghanem

We present a novel approach for digitizing real-world objects by estimating their geometry, material properties, and environmental lighting from a set of posed images with fixed lighting. Our method incorporates into Neural Radiance Field (NeRF) pipelines the split sum approximation used with image-based lighting for real-time physical-based rendering. We propose modeling the scene's lighting with a single scene-specific MLP representing pre-integrated image-based lighting at arbitrary resolutions. We achieve accurate modeling of pre-integrated lighting by exploiting a novel regularizer based on efficient Monte Carlo sampling. Additionally, we propose a new method of supervising self-occlusion predictions by exploiting a similar regularizer based on Monte Carlo sampling. Experimental results demonstrate the efficiency and effectiveness of our approach in estimating scene geometry, material properties, and lighting. Our method is capable of attaining state-of-the-art relighting quality after only ${\sim}1$ hour of training in a single NVIDIA A100 GPU.

Large Language Models Meet Computer Vision: A Brief Survey. (arXiv:2311.16673v1 [cs.CV])

Authors: Raby Hamadi

Recently, the intersection of Large Language Models (LLMs) and Computer Vision (CV) has emerged as a pivotal area of research, driving significant advancements in the field of Artificial Intelligence (AI). As transformers have become the backbone of many state-of-the-art models in both Natural Language Processing (NLP) and CV, understanding their evolution and potential enhancements is crucial. This survey paper delves into the latest progressions in the domain of transformers and their subsequent successors, emphasizing their potential to revolutionize Vision Transformers (ViTs) and LLMs. This survey also presents a comparative analysis, juxtaposing the performance metrics of several leading paid and open-source LLMs, shedding light on their strengths and areas of improvement as well as a literature review on how LLMs are being used to tackle vision related tasks. Furthermore, the survey presents a comprehensive collection of datasets employed to train LLMs, offering insights into the diverse data available to achieve high performance in various pre-training and downstream tasks of LLMs. The survey is concluded by highlighting open directions in the field, suggesting potential venues for future research and development. This survey aims to underscores the profound intersection of LLMs on CV, leading to a new era of integrated and advanced AI models.

Understanding the (Extra-)Ordinary: Validating Deep Model Decisions with Prototypical Concept-based Explanations. (arXiv:2311.16681v1 [cs.CV])

Authors: Maximilian Dreyer, Reduan Achtibat, Wojciech Samek, Sebastian Lapuschkin

Ensuring both transparency and safety is critical when deploying Deep Neural Networks (DNNs) in high-risk applications, such as medicine. The field of explainable AI (XAI) has proposed various methods to comprehend the decision-making processes of opaque DNNs. However, only few XAI methods are suitable of ensuring safety in practice as they heavily rely on repeated labor-intensive and possibly biased human assessment. In this work, we present a novel post-hoc concept-based XAI framework that conveys besides instance-wise (local) also class-wise (global) decision-making strategies via prototypes. What sets our approach apart is the combination of local and global strategies, enabling a clearer understanding of the (dis-)similarities in model decisions compared to the expected (prototypical) concept use, ultimately reducing the dependence on human long-term assessment. Quantifying the deviation from prototypical behavior not only allows to associate predictions with specific model sub-strategies but also to detect outlier behavior. As such, our approach constitutes an intuitive and explainable tool for model validation. We demonstrate the effectiveness of our approach in identifying out-of-distribution samples, spurious model behavior and data quality issues across three datasets (ImageNet, CUB-200, and CIFAR-10) utilizing VGG, ResNet, and EfficientNet architectures. Code is available on https://github.com/maxdreyer/pcx.

ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention. (arXiv:2311.16682v1 [cs.CV])

Authors: Jiawei Wang, Changjian Li

Sketch semantic segmentation is a well-explored and pivotal problem in computer vision involving the assignment of pre-defined part labels to individual strokes. This paper presents ContextSeg - a simple yet highly effective approach to tackling this problem with two stages. In the first stage, to better encode the shape and positional information of strokes, we propose to predict an extra dense distance field in an autoencoder network to reinforce structural information learning. In the second stage, we treat an entire stroke as a single entity and label a group of strokes within the same semantic part using an auto-regressive Transformer with the default attention mechanism. By group-based labeling, our method can fully leverage the context information when making decisions for the remaining groups of strokes. Our method achieves the best segmentation accuracy compared with state-of-the-art approaches on two representative datasets and has been extensively evaluated demonstrating its superior performance. Additionally, we offer insights into solving part imbalance in training data and the preliminary experiment on cross-category training, which can inspire future research in this field.

Rethinking Intermediate Layers design in Knowledge Distillation for Kidney and Liver Tumor Segmentation. (arXiv:2311.16700v1 [cs.CV])

Authors: Vandan Gorade, Sparsh Mittal, Debesh Jha, Ulas Bagci

Knowledge distillation(KD) has demonstrated remarkable success across various domains, but its application to medical imaging tasks, such as kidney and liver tumor segmentation, has encountered challenges. Many existing KD methods are not specifically tailored for these tasks. Moreover, prevalent KD methods often lack a careful consideration of what and from where to distill knowledge from the teacher to the student. This oversight may lead to issues like the accumulation of training bias within shallower student layers, potentially compromising the effectiveness of KD. To address these challenges, we propose Hierarchical Layer-selective Feedback Distillation (HLFD). HLFD strategically distills knowledge from a combination of middle layers to earlier layers and transfers final layer knowledge to intermediate layers at both the feature and pixel levels. This design allows the model to learn higher-quality representations from earlier layers, resulting in a robust and compact student model. Extensive quantitative evaluations reveal that HLFD outperforms existing methods by a significant margin. For example, in the kidney segmentation task, HLFD surpasses the student model (without KD) by over 10pp, significantly improving its focus on tumor-specific features. From a qualitative standpoint, the student model trained using HLFD excels at suppressing irrelevant information and can focus sharply on tumor-specific details, which opens a new pathway for more efficient and accurate diagnostic tools.

CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs. (arXiv:2311.16703v1 [cs.CV])

Authors: Haocheng Yuan, Jing Xu, Hao Pan, Adrien Bousseau, Niloy Mitra, Changjian Li

CAD programs are a popular way to compactly encode shapes as a sequence of operations that are easy to parametrically modify. However, without sufficient semantic comments and structure, such programs can be challenging to understand, let alone modify. We introduce the problem of semantic commenting CAD programs, wherein the goal is to segment the input program into code blocks corresponding to semantically meaningful shape parts and assign a semantic label to each block. We solve the problem by combining program parsing with visual-semantic analysis afforded by recent advances in foundational language and vision models. Specifically, by executing the input programs, we create shapes, which we use to generate conditional photorealistic images to make use of semantic annotators for such images. We then distill the information across the images and link back to the original programs to semantically comment on them. Additionally, we collected and annotated a benchmark dataset, CADTalk, consisting of 5,280 machine-made programs and 45 human-made programs with ground truth semantic comments to foster future research. We extensively evaluated our approach, compared to a GPT-based baseline approach, and an open-set shape segmentation baseline, i.e., PartSLIP, and reported an 83.24% accuracy on the new CADTalk dataset. Project page: https://enigma-li.github.io/CADTalk/.

Full-resolution MLPs Empower Medical Dense Prediction. (arXiv:2311.16707v1 [eess.IV])

Authors: Mingyuan Meng, Yuxin Xue, Dagan Feng, Lei Bi, Jinman Kim

Dense prediction is a fundamental requirement for many medical vision tasks such as medical image restoration, registration, and segmentation. The most popular vision model, Convolutional Neural Networks (CNNs), has reached bottlenecks due to the intrinsic locality of convolution operations. Recently, transformers have been widely adopted for dense prediction for their capability to capture long-range visual dependence. However, due to the high computational complexity and large memory consumption of self-attention operations, transformers are usually used at downsampled feature resolutions. Such usage cannot effectively leverage the tissue-level textural information available only at the full image resolution. This textural information is crucial for medical dense prediction as it can differentiate the subtle human anatomy in medical images. In this study, we hypothesize that Multi-layer Perceptrons (MLPs) are superior alternatives to transformers in medical dense prediction where tissue-level details dominate the performance, as MLPs enable long-range dependence at the full image resolution. To validate our hypothesis, we develop a full-resolution hierarchical MLP framework that uses MLPs beginning from the full image resolution. We evaluate this framework with various MLP blocks on a wide range of medical dense prediction tasks including restoration, registration, and segmentation. Extensive experiments on six public well-benchmarked datasets show that, by simply using MLPs at full resolution, our framework outperforms its CNN and transformer counterparts and achieves state-of-the-art performance on various medical dense prediction tasks.

LEDITS++: Limitless Image Editing using Text-to-Image Models. (arXiv:2311.16711v1 [cs.CV])

Authors: Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, Apolinário Passos

Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However, existing image-to-image methods are often inefficient, imprecise, and of limited versatility. They either require time-consuming fine-tuning, deviate unnecessarily strongly from the input image, and/or lack support for multiple, simultaneous edits. To address these issues, we introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second, our methodology supports multiple simultaneous edits and is architecture-agnostic. Third, we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods. The project page is available at https://leditsplusplus-project.static.hf.space .

Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld. (arXiv:2311.16714v1 [cs.CV])

Authors: Yijun Yang, Tianyi Zhou, Kanxue Li, Dapeng Tao, Lusong Li, Li Shen, Xiaodong He, Jing Jiang, Yuhui Shi

While large language models (LLMs) excel in a simulated world of texts, they struggle to interact with the more realistic world without perceptions of other modalities such as visual or audio signals. Although vision-language models (VLMs) integrate LLM modules (1) aligned with static image features, and (2) may possess prior knowledge of world dynamics (as demonstrated in the text world), they have not been trained in an embodied visual world and thus cannot align with its dynamics. On the other hand, training an embodied agent in a noisy visual world without expert guidance is often challenging and inefficient. In this paper, we train a VLM agent living in a visual world using an LLM agent excelling in a parallel text world (but inapplicable to the visual world). Specifically, we distill LLM's reflection outcomes (improved actions by analyzing mistakes) in a text world's tasks to finetune the VLM on the same tasks of the visual world, resulting in an Embodied Multi-Modal Agent (EMMA) quickly adapting to the visual world dynamics. Such cross-modality imitation learning between the two parallel worlds enables EMMA to generalize to a broad scope of new tasks without any further guidance from the LLM expert. Extensive evaluations on the ALFWorld benchmark highlight EMMA's superior performance to SOTA VLM-based agents across diverse tasks, e.g., 20%-70% improvement in the success rate.

Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular, Stereo, and RGB-D Cameras. (arXiv:2311.16728v1 [cs.CV])

Authors: Huajian Huang, Longwei Li, Hui Cheng, Sai-Kit Yeung

The integration of neural rendering and the SLAM system recently showed promising results in joint localization and photorealistic view reconstruction. However, existing methods, fully relying on implicit representations, are so resource-hungry that they cannot run on portable devices, which deviates from the original intention of SLAM. In this paper, we present Photo-SLAM, a novel SLAM framework with a hyper primitives map. Specifically, we simultaneously exploit explicit geometric features for localization and learn implicit photometric features to represent the texture information of the observed environment. In addition to actively densifying hyper primitives based on geometric features, we further introduce a Gaussian-Pyramid-based training method to progressively learn multi-level features, enhancing photorealistic mapping performance. The extensive experiments with monocular, stereo, and RGB-D datasets prove that our proposed system Photo-SLAM significantly outperforms current state-of-the-art SLAM systems for online photorealistic mapping, e.g., PSNR is 30% higher and rendering speed is hundreds of times faster in the Replica dataset. Moreover, the Photo-SLAM can run at real-time speed using an embedded platform such as Jetson AGX Orin, showing the potential of robotics applications.

Point'n Move: Interactive Scene Object Manipulation on Gaussian Splatting Radiance Fields. (arXiv:2311.16737v1 [cs.CV])

Authors: Jiajun Huang, Hongchuan Yu

We propose Point'n Move, a method that achieves interactive scene object manipulation with exposed region inpainting. Interactivity here further comes from intuitive object selection and real-time editing. To achieve this, we adopt Gaussian Splatting Radiance Field as the scene representation and fully leverage its explicit nature and speed advantage. Its explicit representation formulation allows us to devise a 2D prompt points to 3D mask dual-stage self-prompting segmentation algorithm, perform mask refinement and merging, minimize change as well as provide good initialization for scene inpainting and perform editing in real-time without per-editing training, all leads to superior quality and performance. We test our method by performing editing on both forward-facing and 360 scenes. We also compare our method against existing scene object removal methods, showing superior quality despite being more capable and having a speed advantage.

Riemannian Self-Attention Mechanism for SPD Networks. (arXiv:2311.16738v1 [cs.CV])

Authors: Rui Wang, Xiao-Jun Wu, Hui Li, Josef Kittler

Symmetric positive definite (SPD) matrix has been demonstrated to be an effective feature descriptor in many scientific areas, as it can encode spatiotemporal statistics of the data adequately on a curved Riemannian manifold, i.e., SPD manifold. Although there are many different ways to design network architectures for SPD matrix nonlinear learning, very few solutions explicitly mine the geometrical dependencies of features at different layers. Motivated by the great success of self-attention mechanism in capturing long-range relationships, an SPD manifold self-attention mechanism (SMSA) is proposed in this paper using some manifold-valued geometric operations, mainly the Riemannian metric, Riemannian mean, and Riemannian optimization. Then, an SMSA-based geometric learning module (SMSA-GLM) is designed for the sake of improving the discrimination of the generated deep structured representations. Extensive experimental results achieved on three benchmarking datasets show that our modification against the baseline network further alleviates the information degradation problem and leads to improved accuracy.

As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors. (arXiv:2311.16739v1 [cs.CV])

Authors: Seungwoo Yoo, Kunho Kim, Vladimir G. Kim, Minhyuk Sung

We present As-Plausible-as-Possible (APAP) mesh deformation technique that leverages 2D diffusion priors to preserve the plausibility of a mesh under user-controlled deformation. Our framework uses per-face Jacobians to represent mesh deformations, where mesh vertex coordinates are computed via a differentiable Poisson Solve. The deformed mesh is rendered, and the resulting 2D image is used in the Score Distillation Sampling (SDS) process, which enables extracting meaningful plausibility priors from a pretrained 2D diffusion model. To better preserve the identity of the edited mesh, we fine-tune our 2D diffusion model with LoRA. Gradients extracted by SDS and a user-prescribed handle displacement are then backpropagated to the per-face Jacobians, and we use iterative gradient descent to compute the final deformation that balances between the user edit and the output plausibility. We evaluate our method with 2D and 3D meshes and demonstrate qualitative and quantitative improvements when using plausibility priors over geometry-preservation or distortion-minimization priors used by previous techniques.

Towards Full-scene Domain Generalization in Multi-agent Collaborative Bird's Eye View Segmentation for Connected and Autonomous Driving. (arXiv:2311.16754v1 [cs.CV])

Authors: Senkang Hu, Zhengru Fang, Xianhao Chen, Yuguang Fang, Sam Kwong

Collaborative perception has recently gained significant attention in autonomous driving, improving perception quality by enabling the exchange of additional information among vehicles. However, deploying collaborative perception systems can lead to domain shifts due to diverse environmental conditions and data heterogeneity among connected and autonomous vehicles (CAVs). To address these challenges, we propose a unified domain generalization framework applicable in both training and inference stages of collaborative perception. In the training phase, we introduce an Amplitude Augmentation (AmpAug) method to augment low-frequency image variations, broadening the model's ability to learn across various domains. We also employ a meta-consistency training scheme to simulate domain shifts, optimizing the model with a carefully designed consistency loss to encourage domain-invariant representations. In the inference phase, we introduce an intra-system domain alignment mechanism to reduce or potentially eliminate the domain discrepancy among CAVs prior to inference. Comprehensive experiments substantiate the effectiveness of our method in comparison with the existing state-of-the-art works. Code will be released at https://github.com/DG-CAVs/DG-CoPerception.git.

Gradient-based Local Next-best-view Planning for Improved Perception of Targeted Plant Nodes. (arXiv:2311.16759v1 [cs.RO])

Authors: Akshay K. Burusa, Eldert J. van Henten, Gert Kootstra

Robots are increasingly used in tomato greenhouses to automate labour-intensive tasks such as selective harvesting and de-leafing. To perform these tasks, robots must be able to accurately and efficiently perceive the plant nodes that need to be cut, despite the high levels of occlusion from other plant parts. We formulate this problem as a local next-best-view (NBV) planning task where the robot has to plan an efficient set of camera viewpoints to overcome occlusion and improve the quality of perception. Our formulation focuses on quickly improving the perception accuracy of a single target node to maximise its chances of being cut. Previous methods of NBV planning mostly focused on global view planning and used random sampling of candidate viewpoints for exploration, which could suffer from high computational costs, ineffective view selection due to poor candidates, or non-smooth trajectories due to inefficient sampling. We propose a gradient-based NBV planner using differential ray sampling, which directly estimates the local gradient direction for viewpoint planning to overcome occlusion and improve perception. Through simulation experiments, we showed that our planner can handle occlusions and improve the 3D reconstruction and position estimation of nodes equally well as a sampling-based NBV planner, while taking ten times less computation and generating 28% more efficient trajectories.

Rescuing referral failures during automated diagnosis of domain-shifted medical images. (arXiv:2311.16766v1 [cs.CV])

Authors: Anuj Srivastava, Karm Patel, Pradeep Shenoy, Devarajan Sridharan

The success of deep learning models deployed in the real world depends critically on their ability to generalize well across diverse data domains. Here, we address a fundamental challenge with selective classification during automated diagnosis with domain-shifted medical images. In this scenario, models must learn to avoid making predictions when label confidence is low, especially when tested with samples far removed from the training set (covariate shift). Such uncertain cases are typically referred to the clinician for further analysis and evaluation. Yet, we show that even state-of-the-art domain generalization approaches fail severely during referral when tested on medical images acquired from a different demographic or using a different technology. We examine two benchmark diagnostic medical imaging datasets exhibiting strong covariate shifts: i) diabetic retinopathy prediction with retinal fundus images and ii) multilabel disease prediction with chest X-ray images. We show that predictive uncertainty estimates do not generalize well under covariate shifts leading to non-monotonic referral curves, and severe drops in performance (up to 50%) at high referral rates (>70%). We evaluate novel combinations of robust generalization and post hoc referral approaches, that rescue these failures and achieve significant performance improvements, typically >10%, over baseline methods. Our study identifies a critical challenge with referral in domain-shifted medical images and finds key applications in reliable, automated disease diagnosis.

Monocular Camera Localization for Automated Vehicles Using Image Retrieval. (arXiv:2109.06296v2 [cs.CV] UPDATED)

Authors: Eunhyek Joa, Francesco Borrelli

We address the problem of finding the current position and heading angle of an autonomous vehicle in real-time using a single camera. Compared to methods which require LiDARs and high definition (HD) 3D maps in real-time, the proposed approach is easily scalable and computationally efficient, at the price of lower precision.

The new method combines and adapts existing algorithms in three different fields: image retrieval, mapping database, and particle filtering. The result is a simple, real-time localization method using an image retrieval method whose performance is comparable to other monocular camera localization methods which use a map built with LiDARs.

We evaluate the proposed method using the KITTI odometry dataset and via closed-loop experiments with an indoor 1:10 autonomous vehicle. The tests demonstrate real-time capability and a 10cm level accuracy. Also, experimental results of the closed-loop indoor tests show the presence of a positive feedback loop between the localization error and the control error. Such phenomena is analysed in details at the end of the article.

Rethinking Dilated Convolution for Real-time Semantic Segmentation. (arXiv:2111.09957v3 [cs.CV] UPDATED)

Authors: Roland Gao

The field-of-view is an important metric when designing a model for semantic segmentation. To obtain a large field-of-view, previous approaches generally choose to rapidly downsample the resolution, usually with average poolings or stride 2 convolutions. We take a different approach by using dilated convolutions with large dilation rates throughout the backbone, allowing the backbone to easily tune its field-of-view by adjusting its dilation rates, and show that it's competitive with existing approaches. To effectively use the dilated convolution, we show a simple upper bound on the dilation rate in order to not leave gaps in between the convolutional weights, and design an SE-ResNeXt inspired block structure that uses two parallel $3\times 3$ convolutions with different dilation rates to preserve the local details. Manually tuning the dilation rates for every block can be difficult, so we also introduce a differentiable neural architecture search method that uses gradient descent to optimize the dilation rates. In addition, we propose a lightweight decoder that restores local information better than common alternatives. To demonstrate the effectiveness of our approach, our model RegSeg achieves competitive results on real-time Cityscapes and CamVid datasets. Using a T4 GPU with mixed precision, RegSeg achieves 78.3 mIOU on Cityscapes test set at $37$ FPS, and 80.9 mIOU on CamVid test set at $112$ FPS, both without ImageNet pretraining.

Towards Improving the Generation Quality of Autoregressive Slot VAEs. (arXiv:2206.01370v3 [cs.CV] UPDATED)

Authors: Patrick Emami, Pan He, Sanjay Ranka, Anand Rangarajan

Unconditional scene inference and generation are challenging to learn jointly with a single compositional model. Despite encouraging progress on models that extract object-centric representations (''slots'') from images, unconditional generation of scenes from slots has received less attention. This is primarily because learning the multi-object relations necessary to imagine coherent scenes is difficult. We hypothesize that most existing slot-based models have a limited ability to learn object correlations. We propose two improvements that strengthen object correlation learning. The first is to condition the slots on a global, scene-level variable that captures higher-order correlations between slots. Second, we address the fundamental lack of a canonical order for objects in images by proposing to learn a consistent order to use for the autoregressive generation of scene objects. Specifically, we train an autoregressive slot prior to sequentially generate scene objects following a learned order. Ordered slot inference entails first estimating a randomly ordered set of slots using existing approaches for extracting slots from images, then aligning those slots to ordered slots generated autoregressively with the slot prior. Our experiments across three multi-object environments demonstrate clear gains in unconditional scene generation quality. Detailed ablation studies are also provided that validate the two proposed improvements.

360Roam: Real-Time Indoor Roaming Using Geometry-Aware 360$^\circ$ Radiance Fields. (arXiv:2208.02705v2 [cs.CV] UPDATED)

Authors: Huajian Huang, Yingshu Chen, Tianjia Zhang, Sai-Kit Yeung

Virtual tour among sparse 360$^\circ$ images is widely used while hindering smooth and immersive roaming experiences. The emergence of Neural Radiance Field (NeRF) has showcased significant progress in synthesizing novel views, unlocking the potential for immersive scene exploration. Nevertheless, previous NeRF works primarily focused on object-centric scenarios, resulting in noticeable performance degradation when applied to outward-facing and large-scale scenes due to limitations in scene parameterization. To achieve seamless and real-time indoor roaming, we propose a novel approach using geometry-aware radiance fields with adaptively assigned local radiance fields. Initially, we employ multiple 360$^\circ$ images of an indoor scene to progressively reconstruct explicit geometry in the form of a probabilistic occupancy map, derived from a global omnidirectional radiance field. Subsequently, we assign local radiance fields through an adaptive divide-and-conquer strategy based on the recovered geometry. By incorporating geometry-aware sampling and decomposition of the global radiance field, our system effectively utilizes positional encoding and compact neural networks to enhance rendering quality and speed. Additionally, the extracted floorplan of the scene aids in providing visual guidance, contributing to a realistic roaming experience. To demonstrate the effectiveness of our system, we curated a diverse dataset of 360$^\circ$ images encompassing various real-life scenes, on which we conducted extensive experiments. Quantitative and qualitative comparisons against baseline approaches illustrated the superior performance of our system in large-scale indoor scene roaming.

Exploring Semantic Attributes from A Foundation Model for Federated Learning of Disjoint Label Spaces. (arXiv:2208.13465v2 [cs.CV] UPDATED)

Authors: Shitong Sun, Chenyang Si, Guile Wu, Shaogang Gong

Conventional centralised deep learning paradigms are not feasible when data from different sources cannot be shared due to data privacy or transmission limitation. To resolve this problem, federated learning has been introduced to transfer knowledge across multiple sources (clients) with non-shared data while optimising a globally generalised central model (server). Existing federated learning paradigms mostly focus on transferring holistic high-level knowledge (such as class) across models, which are closely related to specific objects of interest so may suffer from inverse attack. In contrast, in this work, we consider transferring mid-level semantic knowledge (such as attribute) which is not sensitive to specific objects of interest and therefore is more privacy-preserving and scalable. To this end, we formulate a new Federated Zero-Shot Learning (FZSL) paradigm to learn mid-level semantic knowledge at multiple local clients with non-shared local data and cumulatively aggregate a globally generalised central model for deployment. To improve model discriminative ability, we propose to explore semantic knowledge augmentation from external knowledge for enriching the mid-level semantic space in FZSL. Extensive experiments on five zeroshot learning benchmark datasets validate the effectiveness of our approach for optimising a generalisable federated learning model with mid-level semantic knowledge transfer.

Continuously Controllable Facial Expression Editing in Talking Face Videos. (arXiv:2209.08289v2 [cs.CV] UPDATED)

Authors: Zhiyao Sun, Yu-Hui Wen, Tian Lv, Yanan Sun, Ziyang Zhang, Yaoyuan Wang, Yong-Jin Liu

Recently audio-driven talking face video generation has attracted considerable attention. However, very few researches address the issue of emotional editing of these talking face videos with continuously controllable expressions, which is a strong demand in the industry. The challenge is that speech-related expressions and emotion-related expressions are often highly coupled. Meanwhile, traditional image-to-image translation methods cannot work well in our application due to the coupling of expressions with other attributes such as poses, i.e., translating the expression of the character in each frame may simultaneously change the head pose due to the bias of the training data distribution. In this paper, we propose a high-quality facial expression editing method for talking face videos, allowing the user to control the target emotion in the edited video continuously. We present a new perspective for this task as a special case of motion information editing, where we use a 3DMM to capture major facial movements and an associated texture map modeled by a StyleGAN to capture appearance details. Both representations (3DMM and texture map) contain emotional information and can be continuously modified by neural networks and easily smoothed by averaging in coefficient/latent spaces, making our method simple yet effective. We also introduce a mouth shape preservation loss to control the trade-off between lip synchronization and the degree of exaggeration of the edited expression. Extensive experiments and a user study show that our method achieves state-of-the-art performance across various evaluation criteria.

Development and evaluation of automated localisation and reconstruction of all fruits on tomato plants in a greenhouse based on multi-view perception and 3D multi-object tracking. (arXiv:2211.02760v3 [cs.RO] UPDATED)

Authors: David Rapado Rincon, Eldert J. van Henten, Gert Kootstra

The ability to accurately represent and localise relevant objects is essential for robots to carry out tasks effectively. Traditional approaches, where robots simply capture an image, process that image to take an action, and then forget the information, have proven to struggle in the presence of occlusions. Methods using multi-view perception, which have the potential to address some of these problems, require a world model that guides the collection, integration and extraction of information from multiple viewpoints. Furthermore, constructing a generic representation that can be applied in various environments and tasks is a difficult challenge. In this paper, a novel approach for building generic representations in occluded agro-food environments using multi-view perception and 3D multi-object tracking is introduced. The method is based on a detection algorithm that generates partial point clouds for each detected object, followed by a 3D multi-object tracking algorithm that updates the representation over time. The accuracy of the representation was evaluated in a real-world environment, where successful representation and localisation of tomatoes in tomato plants were achieved, despite high levels of occlusion, with the total count of tomatoes estimated with a maximum error of 5.08% and the tomatoes tracked with an accuracy up to 71.47%. Novel tracking metrics were introduced, demonstrating that valuable insight into the errors in localising and representing the fruits can be provided by their use. This approach presents a novel solution for building representations in occluded agro-food environments, demonstrating potential to enable robots to perform tasks effectively in these challenging environments.

FIXED: Frustratingly Easy Domain Generalization with Mixup. (arXiv:2211.05228v2 [cs.CV] UPDATED)

Authors: Wang Lu, Jindong Wang, Han Yu, Lei Huang, Xiang Zhang, Yiqiang Chen, Xing Xie

Domain generalization (DG) aims to learn a generalizable model from multiple training domains such that it can perform well on unseen target domains. A popular strategy is to augment training data to benefit generalization through methods such as Mixup~\cite{zhang2018mixup}. While the vanilla Mixup can be directly applied, theoretical and empirical investigations uncover several shortcomings that limit its performance. Firstly, Mixup cannot effectively identify the domain and class information that can be used for learning invariant representations. Secondly, Mixup may introduce synthetic noisy data points via random interpolation, which lowers its discrimination capability. Based on the analysis, we propose a simple yet effective enhancement for Mixup-based DG, namely domain-invariant Feature mIXup (FIX). It learns domain-invariant representations for Mixup. To further enhance discrimination, we leverage existing techniques to enlarge margins among classes to further propose the domain-invariant Feature MIXup with Enhanced Discrimination (FIXED) approach. We present theoretical insights about guarantees on its effectiveness. Extensive experiments on seven public datasets across two modalities including image classification (Digits-DG, PACS, Office-Home) and time series (DSADS, PAMAP2, UCI-HAR, and USC-HAD) demonstrate that our approach significantly outperforms nine state-of-the-art related methods, beating the best performing baseline by 6.5\% on average in terms of test accuracy. Code is available at: https://github.com/jindongwang/transferlearning/tree/master/code/deep/fixed.

FeTrIL: Feature Translation for Exemplar-Free Class-Incremental Learning. (arXiv:2211.13131v2 [cs.CV] UPDATED)

Authors: Grégoire Petit, Adrian Popescu, Hugo Schindler, David Picard, Bertrand Delezoide

Exemplar-free class-incremental learning is very challenging due to the negative effect of catastrophic forgetting. A balance between stability and plasticity of the incremental process is needed in order to obtain good accuracy for past as well as new classes. Existing exemplar-free class-incremental methods focus either on successive fine tuning of the model, thus favoring plasticity, or on using a feature extractor fixed after the initial incremental state, thus favoring stability. We introduce a method which combines a fixed feature extractor and a pseudo-features generator to improve the stability-plasticity balance. The generator uses a simple yet effective geometric translation of new class features to create representations of past classes, made of pseudo-features. The translation of features only requires the storage of the centroid representations of past classes to produce their pseudo-features. Actual features of new classes and pseudo-features of past classes are fed into a linear classifier which is trained incrementally to discriminate between all classes. The incremental process is much faster with the proposed method compared to mainstream ones which update the entire deep model. Experiments are performed with three challenging datasets, and different incremental settings. A comparison with ten existing methods shows that our method outperforms the others in most cases.

Deep Planar Parallax for Monocular Depth Estimation. (arXiv:2301.03178v2 [cs.CV] UPDATED)

Authors: Haoqian Liang, Zhichao Li, Ya Yang, Naiyan Wang

Recent research has highlighted the utility of Planar Parallax Geometry in monocular depth estimation. However, its potential has yet to be fully realized because networks rely heavily on appearance for depth prediction. Our in-depth analysis reveals that utilizing flow-pretrain can optimize the network's usage of consecutive frame modeling, leading to substantial performance enhancement. Additionally, we propose Planar Position Embedding (PPE) to handle dynamic objects that defy static scene assumptions and to tackle slope variations that are challenging to differentiate. Comprehensive experiments on autonomous driving datasets, namely KITTI and the Waymo Open Dataset (WOD), prove that our Planar Parallax Network (PPNet) significantly surpasses existing learning-based methods in performance.

Trainable Loss Weights in Super-Resolution. (arXiv:2301.10575v2 [cs.CV] UPDATED)

Authors: Arash Chaichi Mellatshahi, Shohreh Kasaei

In recent years, limited research has discussed the loss function in the super-resolution process. The majority of those studies have only used perceptual similarity conventionally. This is while the development of appropriate loss can improve the quality of other methods as well. In this article, a new weighting method for pixel-wise loss is proposed. With the help of this method, it is possible to use trainable weights based on the general structure of the image and its perceptual features while maintaining the advantages of pixel-wise loss. Also, a criterion for comparing weights of loss is introduced so that the weights can be estimated directly by a convolutional neural network. In addition, in this article, the expectation-maximization method is used for the simultaneous estimation super-resolution network and weighting network. In addition, a new activation function, called "FixedSum", is introduced which can keep the sum of all components of vector constants while keeping the output components between zero and one. As experimental results shows, weighted loss by the proposed method leads to better results than the unweighted loss and weighted loss based on uncertainty in both signal-to-noise and perceptual similarity senses on the state-of-the-art networks. Code is available online.

CHeart: A Conditional Spatio-Temporal Generative Model for Cardiac Anatomy. (arXiv:2301.13098v2 [eess.IV] UPDATED)

Authors: Mengyun Qiao, Shuo Wang, Huaqi Qiu, Antonio de Marvao, Declan P. O'Regan, Daniel Rueckert, Wenjia Bai

Two key questions in cardiac image analysis are to assess the anatomy and motion of the heart from images; and to understand how they are associated with non-imaging clinical factors such as gender, age and diseases. While the first question can often be addressed by image segmentation and motion tracking algorithms, our capability to model and to answer the second question is still limited. In this work, we propose a novel conditional generative model to describe the 4D spatio-temporal anatomy of the heart and its interaction with non-imaging clinical factors. The clinical factors are integrated as the conditions of the generative modelling, which allows us to investigate how these factors influence the cardiac anatomy. We evaluate the model performance in mainly two tasks, anatomical sequence completion and sequence generation. The model achieves a high performance in anatomical sequence completion, comparable to or outperforming other state-of-the-art generative models. In terms of sequence generation, given clinical conditions, the model can generate realistic synthetic 4D sequential anatomies that share similar distributions with the real data.

Mixed Hierarchy Network for Image Restoration. (arXiv:2302.09554v4 [cs.CV] UPDATED)

Authors: Hu Gao, Depeng Dang

Image restoration is a long-standing low-level vision problem, e.g., deblurring and deraining. In the process of image restoration, it is necessary to consider not only the spatial details and contextual information of restoration to ensure the quality, but also the system complexity. Although many methods have been able to guarantee the quality of image restoration, the system complexity of the state-of-the-art (SOTA) methods is increasing as well. Motivated by this, we present a mixed hierarchy network that can balance these competing goals. Our main proposal is a mixed hierarchy architecture, that progressively recovers contextual information and spatial details from degraded images while we design intra-blocks to reduce system complexity. Specifically, our model first learns the contextual information using encoder-decoder architectures, and then combines them with high-resolution branches that preserve spatial detail. In order to reduce the system complexity of this architecture for convenient analysis and comparison, we replace or remove the nonlinear activation function with multiplication and use a simple network structure. In addition, we replace spatial convolution with global self-attention for the middle block of encoder-decoder. The resulting tightly interlinked hierarchy architecture, named as MHNet, delivers strong performance gains on several image restoration tasks, including image deraining, and deblurring.

Hierarchical Relationships: A New Perspective to Enhance Scene Graph Generation. (arXiv:2303.06842v4 [cs.CV] UPDATED)

Authors: Bowen Jiang, Camillo J. Taylor

This paper presents a finding that leveraging the hierarchical structures among labels for relationships and objects can substantially improve the performance of scene graph generation systems. The focus of this work is to create an informative hierarchical structure that can divide object and relationship categories into disjoint super-categories in a systematic way. Specifically, we introduce a Bayesian prediction head to jointly predict the super-category of relationships between a pair of object instances, as well as the detailed relationship within that super-category simultaneously, facilitating more informative predictions. The resulting model exhibits the capability to produce a more extensive set of predicates beyond the dataset annotations, and to tackle the prevalent issue of low annotation quality. While our paper presents preliminary findings, experiments on the Visual Genome dataset show its strong performance, particularly in predicate classifications and zero-shot settings, that demonstrates the promise of our approach.

MAPSeg: Unified Unsupervised Domain Adaptation for Heterogeneous Medical Image Segmentation Based on 3D Masked Autoencoding and Pseudo-Labeling. (arXiv:2303.09373v2 [cs.CV] UPDATED)

Authors: Xuzhe Zhang, Yuhao Wu, Elsa Angelini, Ang Li, Jia Guo, Jerod M. Rasmussen, Thomas G. O'Connor, Pathik D. Wadhwa, Andrea Parolin Jackowski, Hai Li, Jonathan Posner, Andrew F. Laine, Yun Wang

Robust segmentation is critical for deriving quantitative measures from large-scale, multi-center, and longitudinal medical scans. Manually annotating medical scans, however, is expensive and labor-intensive and may not always be available in every domain. Unsupervised domain adaptation (UDA) is a well-studied technique that alleviates this label-scarcity problem by leveraging available labels from another domain. In this study, we introduce Masked Autoencoding and Pseudo-Labeling Segmentation (MAPSeg), a $\textbf{unified}$ UDA framework with great versatility and superior performance for heterogeneous and volumetric medical image segmentation. To the best of our knowledge, this is the first study that systematically reviews and develops a framework to tackle four different domain shifts in medical image segmentation. More importantly, MAPSeg is the first framework that can be applied to $\textbf{centralized}$, $\textbf{federated}$, and $\textbf{test-time}$ UDA while maintaining comparable performance. We compare MAPSeg with previous state-of-the-art methods on a private infant brain MRI dataset and a public cardiac CT-MRI dataset, and MAPSeg outperforms others by a large margin (10.5 Dice improvement on the private MRI dataset and 5.7 on the public CT-MRI dataset). MAPSeg poses great practical value and can be applied to real-world problems. Our code and pretrained model will be available later.

Unleashing the Potential of Spiking Neural Networks by Dynamic Confidence. (arXiv:2303.10276v3 [cs.CV] UPDATED)

Authors: Chen Li, Edward Jones, Steve Furber

This paper presents a new methodology to alleviate the fundamental trade-off between accuracy and latency in spiking neural networks (SNNs). The approach involves decoding confidence information over time from the SNN outputs and using it to develop a decision-making agent that can dynamically determine when to terminate each inference.

The proposed method, Dynamic Confidence, provides several significant benefits to SNNs. 1. It can effectively optimize latency dynamically at runtime, setting it apart from many existing low-latency SNN algorithms. Our experiments on CIFAR-10 and ImageNet datasets have demonstrated an average 40% speedup across eight different settings after applying Dynamic Confidence. 2. The decision-making agent in Dynamic Confidence is straightforward to construct and highly robust in parameter space, making it extremely easy to implement. 3. The proposed method enables visualizing the potential of any given SNN, which sets a target for current SNNs to approach. For instance, if an SNN can terminate at the most appropriate time point for each input sample, a ResNet-50 SNN can achieve an accuracy as high as 82.47% on ImageNet within just 4.71 time steps on average. Unlocking the potential of SNNs needs a highly-reliable decision-making agent to be constructed and fed with a high-quality estimation of ground truth. In this regard, Dynamic Confidence represents a meaningful step toward realizing the potential of SNNs.

NVAutoNet: Fast and Accurate 360$^{\circ}$ 3D Visual Perception For Self Driving. (arXiv:2303.12976v4 [cs.CV] UPDATED)

Authors: Trung Pham, Mehran Maghoumi, Wanli Jiang, Bala Siva Sashank Jujjavarapu, Mehdi Sajjadi, Xin Liu, Hsuan-Chu Lin, Bor-Jeng Chen, Giang Truong, Chao Fang, Junghyun Kwon, Minwoo Park

Achieving robust and real-time 3D perception is fundamental for autonomous vehicles. While most existing 3D perception methods prioritize detection accuracy, they often overlook critical aspects such as computational efficiency, onboard chip deployment friendliness, resilience to sensor mounting deviations, and adaptability to various vehicle types. To address these challenges, we present NVAutoNet: a specialized Bird's-Eye-View (BEV) perception network tailored explicitly for automated vehicles. NVAutoNet takes synchronized camera images as input and predicts 3D signals like obstacles, freespaces, and parking spaces. The core of NVAutoNet's architecture (image and BEV backbones) relies on efficient convolutional networks, optimized for high performance using TensorRT. More importantly, our image-to-BEV transformation employs simple linear layers and BEV look-up tables, ensuring rapid inference speed. Trained on an extensive proprietary dataset, NVAutoNet consistently achieves elevated perception accuracy, operating remarkably at 53 frames per second on the NVIDIA DRIVE Orin SoC. Notably, NVAutoNet demonstrates resilience to sensor mounting deviations arising from diverse car models. Moreover, NVAutoNet excels in adapting to varied vehicle types, facilitated by inexpensive model fine-tuning procedures that expedite compatibility adjustments.

Segment Anything in 3D with NeRFs. (arXiv:2304.12308v4 [cs.CV] UPDATED)

Authors: Jiazhong Cen, Zanwei Zhou, Jiemin Fang, Chen Yang, Wei Shen, Lingxi Xie, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian

Recently, the Segment Anything Model (SAM) emerged as a powerful vision foundation model which is capable to segment anything in 2D images. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the Neural Radiance Field (NeRF) as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. We refer to the proposed solution as SA3D, for Segment Anything in 3D. It is only required to provide a manual segmentation prompt (e.g., rough points) for the target object in a single view, which is used to generate its 2D mask in this view with SAM. Next, SA3D alternately performs mask inverse rendering and cross-view self-prompting across various views to iteratively complete the 3D mask of the target object constructed with voxel grids. The former projects the 2D mask obtained by SAM in the current view onto 3D mask with guidance of the density distribution learned by the NeRF; The latter extracts reliable prompts automatically as the input to SAM from the NeRF-rendered 2D mask in another view. We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within minutes. Our research reveals a potential methodology to lift the ability of a 2D vision foundation model to 3D, as long as the 2D model can steadily address promptable segmentation across multiple views. Our code is available at https://github.com/Jumpat/SegmentAnythingin3D.

Event-Free Moving Object Segmentation from Moving Ego Vehicle. (arXiv:2305.00126v2 [cs.CV] UPDATED)

Authors: Zhuyun Zhou, Zongwei Wu, Danda Pani Paudel, Rémi Boutteau, Fan Yang, Luc Van Gool, Radu Timofte, Dominique Ginhac

Moving object segmentation (MOS) in dynamic scenes is challenging for autonomous driving, especially for sequences obtained from moving ego vehicles. Most state-of-the-art methods leverage motion cues obtained from optical flow maps. However, since these methods are often based on optical flows that are pre-computed from successive RGB frames, this neglects the temporal consideration of events occurring within inter-frame and limits the practicality of these methods in real-life situations. To address these limitations, we propose to exploit event cameras for better video understanding, which provide rich motion cues without relying on optical flow. To foster research in this area, we first introduce a novel large-scale dataset called DSEC-MOS for moving object segmentation from moving ego vehicles. Subsequently, we devise EmoFormer, a novel network able to exploit the event data. For this purpose, we fuse the event prior with spatial semantic maps to distinguish moving objects from the static background, adding another level of dense supervision around our object of interest - moving ones. Our proposed network relies only on event data for training but does not require event input during inference, making it directly comparable to frame-only methods in terms of efficiency and more widely usable in many application cases. An exhaustive comparison with 8 state-of-the-art video object segmentation methods highlights a significant performance improvement of our method over all other methods. Project Page: https://github.com/ZZY-Zhou/DSEC-MOS.

One-Prompt to Segment All Medical Images. (arXiv:2305.10300v2 [eess.IV] UPDATED)

Authors: Junde Wu, Min Xu

Large foundation models, known for their strong zero-shot generalization, have excelled in visual and language applications. However, applying them to medical image segmentation, a domain with diverse imaging types and target labels, remains an open challenge. Current approaches, such as adapting interactive segmentation models like Segment Anything Model (SAM), require user prompts for each sample during inference. Alternatively, transfer learning methods like few/one-shot models demand labeled samples, leading to high costs. This paper introduces a new paradigm toward the universal medical image segmentation, termed 'One-Prompt Segmentation.' One-Prompt Segmentation combines the strengths of one-shot and interactive methods. In the inference stage, with just \textbf{one prompted sample}, it can adeptly handle the unseen task in a single forward pass. We train One-Prompt Model on 64 open-source medical datasets, accompanied by the collection of over 3,000 clinician-labeled prompts. Tested on 14 previously unseen tasks, the One-Prompt Model showcases superior zero-shot segmentation capabilities, outperforming a wide range of related methods. The code and annotated data will be publicly released.

Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models. (arXiv:2305.12476v4 [cs.CV] UPDATED)

Authors: Lin Li, Jun Xiao, Guikun Chen, Jian Shao, Yueting Zhuang, Long Chen

Pretrained vision-language models, such as CLIP, have demonstrated strong generalization capabilities, making them promising tools in the realm of zero-shot visual recognition. Visual relation detection (VRD) is a typical task that identifies relationship (or interaction) types between object pairs within an image. However, naively utilizing CLIP with prevalent class-based prompts for zero-shot VRD has several weaknesses, e.g., it struggles to distinguish between different fine-grained relation types and it neglects essential spatial information of two objects. To this end, we propose a novel method for zero-shot VRD: RECODE, which solves RElation detection via COmposite DEscription prompts. Specifically, RECODE first decomposes each predicate category into subject, object, and spatial components. Then, it leverages large language models (LLMs) to generate description-based prompts (or visual cues) for each component. Different visual cues enhance the discriminability of similar relation categories from different perspectives, which significantly boosts performance in VRD. To dynamically fuse different cues, we further introduce a chain-of-thought method that prompts LLMs to generate reasonable weights for different visual cues. Extensive experiments on four VRD benchmarks have demonstrated the effectiveness and interpretability of RECODE.

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence. (arXiv:2305.15347v2 [cs.CV] UPDATED)

Authors: Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, Ming-Hsuan Yang

Text-to-image diffusion models have made significant advances in generating and editing high-quality images. As a result, numerous approaches have explored the ability of diffusion model features to understand and process single images for downstream tasks, e.g., classification, semantic segmentation, and stylization. However, significantly less is known about what these features reveal across multiple, different images and objects. In this work, we exploit Stable Diffusion (SD) features for semantic and dense correspondence and discover that with simple post-processing, SD features can perform quantitatively similar to SOTA representations. Interestingly, the qualitative analysis reveals that SD features have very different properties compared to existing representation learning features, such as the recently released DINOv2: while DINOv2 provides sparse but accurate matches, SD features provide high-quality spatial information but sometimes inaccurate semantic matches. We demonstrate that a simple fusion of these two features works surprisingly well, and a zero-shot evaluation using nearest neighbors on these fused features provides a significant performance gain over state-of-the-art methods on benchmark datasets, e.g., SPair-71k, PF-Pascal, and TSS. We also show that these correspondences can enable interesting applications such as instance swapping in two images.

GenerateCT: Text-Conditional Generation of 3D Chest CT Volumes. (arXiv:2305.16037v3 [cs.CV] UPDATED)

Authors: Ibrahim Ethem Hamamci, Sezgin Er, Enis Simsar, Anjany Sekuboyina, Chinmay Prabhakar, Alperen Tezcan, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Furkan Almas, Irem Doğan, Muhammed Furkan Dasdelen, Hadrien Reynaud, Sarthak Pati, Christian Bluethgen, Mehmet Kemal Ozdemir, Bjoern Menze

In this paper, we introduce GenerateCT, a novel approach for generating CT volumes conditioned on free-form medical text prompts. GenerateCT includes a text encoder and three key components: a novel causal vision transformer for encoding CT volumes, a text-image transformer for aligning CT and text tokens, and a text-conditional super-resolution diffusion model. GenerateCT can produce realistic, high-resolution, and high-fidelity 3D chest CT volumes, validated by low FID and FVD scores. To explore GenerateCT's clinical applications, we evaluated its utility in a multi-abnormality classification task. First, we established a baseline by training a multi-abnormality classifier on our real dataset. To further assess the model's generalization to external datasets and its performance with unseen prompts in a zero-shot scenario, we employed an external dataset to train the classifier, setting an additional benchmark. We conducted two experiments in which we doubled the training datasets by synthesizing an equal number of volumes for each set using GenerateCT. The first experiment demonstrated an 11% improvement in the AP score when training the classifier jointly on real and generated volumes. The second experiment showed a 7% improvement when training on both real and generated volumes based on unseen prompts. Moreover, GenerateCT enables the scaling of synthetic training datasets to arbitrary sizes. As an example, we generated 100,000 CT volumes, fivefold the number in our real dataset, and trained the classifier exclusively on these synthetic volumes. Impressively, this classifier surpassed the performance of the one trained on all available real data by a margin of 8%. Lastly, domain experts evaluated the generated volumes, confirming a high degree of alignment with the text prompt. Our code and pre-trained models are available at: https://github.com/ibrahimethemhamamci/GenerateCT

ControlVideo: Conditional Control for One-shot Text-driven Video Editing and Beyond. (arXiv:2305.17098v2 [cs.CV] UPDATED)

Authors: Min Zhao, Rongzhen Wang, Fan Bao, Chongxuan Li, Jun Zhu

This paper presents \emph{ControlVideo} for text-driven video editing -- generating a video that aligns with a given text while preserving the structure of the source video. Building on a pre-trained text-to-image diffusion model, ControlVideo enhances the fidelity and temporal consistency by incorporating additional conditions (such as edge maps), and fine-tuning the key-frame and temporal attention on the source video-text pair via an in-depth exploration of the design space. Extensive experimental results demonstrate that ControlVideo outperforms various competitive baselines by delivering videos that exhibit high fidelity w.r.t. the source content, and temporal consistency, all while aligning with the text. By incorporating Low-rank adaptation layers into the model before training, ControlVideo is further empowered to generate videos that align seamlessly with reference images. More importantly, ControlVideo can be readily extended to the more challenging task of long video editing (e.g., with hundreds of frames), where maintaining long-range temporal consistency is crucial. To achieve this, we propose to construct a fused ControlVideo by applying basic ControlVideo to overlapping short video segments and key frame videos and then merging them by pre-defined weight functions. Empirical results validate its capability to create videos across 140 frames, which is approximately 5.83 to 17.5 times more than what previous works achieved. The code is available at \href{https://github.com/thu-ml/controlvideo}{https://github.com/thu-ml/controlvideo} and the visualization results are available at \href{https://drive.google.com/file/d/1wEgc2io3UwmoC5vTPbkccFvTkwVqsZlK/view?usp=drive_link}{HERE}.

HiFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance. (arXiv:2305.18766v3 [cs.CV] UPDATED)

Authors: Junzhe Zhu, Peiye Zhuang

The advancements in automatic text-to-3D generation have been remarkable. Most existing methods use pre-trained text-to-image diffusion models to optimize 3D representations like Neural Radiance Fields (NeRFs) via latent-space denoising score matching. Yet, these methods often result in artifacts and inconsistencies across different views due to their suboptimal optimization approaches and limited understanding of 3D geometry. Moreover, the inherent constraints of NeRFs in rendering crisp geometry and stable textures usually lead to a two-stage optimization to attain high-resolution details. This work proposes holistic sampling and smoothing approaches to achieve high-quality text-to-3D generation, all in a single-stage optimization. We compute denoising scores in the text-to-image diffusion model's latent and image spaces. Instead of randomly sampling timesteps (also referred to as noise levels in denoising score matching), we introduce a novel timestep annealing approach that progressively reduces the sampled timestep throughout optimization. To generate high-quality renderings in a single-stage optimization, we propose regularization for the variance of z-coordinates along NeRF rays. To address texture flickering issues in NeRFs, we introduce a kernel smoothing technique that refines importance sampling weights coarse-to-fine, ensuring accurate and thorough sampling in high-density regions. Extensive experiments demonstrate the superiority of our method over previous approaches, enabling the generation of highly detailed and view-consistent 3D assets through a single-stage training process.

DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation. (arXiv:2306.00519v3 [cs.CV] UPDATED)

Authors: Xiaoliang Ju, Zhaoyang Huang, Yijin Li, Guofeng Zhang, Yu Qiao, Hongsheng Li

We present DiffInDScene, a novel framework for tackling the problem of high-quality 3D indoor scene generation, which is challenging due to the complexity and diversity of the indoor scene geometry. Although diffusion-based generative models have previously demonstrated impressive performance in image generation and object-level 3D generation, they have not yet been applied to room-level 3D generation due to their computationally intensive costs. In DiffInDScene, we propose a cascaded 3D diffusion pipeline that is efficient and possesses strong generative performance for Truncated Signed Distance Function (TSDF). The whole pipeline is designed to run on a sparse occupancy space in a coarse-to-fine fashion. Inspired by KinectFusion's incremental alignment and fusion of local TSDF volumes, we propose a diffusion-based SDF fusion approach that iteratively diffuses and fuses local TSDF volumes, facilitating the generation of an entire room environment. The generated results demonstrate that our work is capable to achieve high-quality room generation directly in three-dimensional space, starting from scratch. In addition to the scene generation, the final part of DiffInDScene can be used as a post-processing module to refine the 3D reconstruction results from multi-view stereo. According to the user study, the mesh quality generated by our DiffInDScene can even outperform the ground truth mesh provided by ScanNet. Please visit our project page for the latest progress and demonstrations: https://github.com/AkiraHero/diffindscene.

Brain Diffusion for Visual Exploration: Cortical Discovery using Large Scale Generative Models. (arXiv:2306.03089v2 [cs.CV] UPDATED)

Authors: Andrew F. Luo, Margaret M. Henderson, Leila Wehbe, Michael J. Tarr

A long standing goal in neuroscience has been to elucidate the functional organization of the brain. Within higher visual cortex, functional accounts have remained relatively coarse, focusing on regions of interest (ROIs) and taking the form of selectivity for broad categories such as faces, places, bodies, food, or words. Because the identification of such ROIs has typically relied on manually assembled stimulus sets consisting of isolated objects in non-ecological contexts, exploring functional organization without robust a priori hypotheses has been challenging. To overcome these limitations, we introduce a data-driven approach in which we synthesize images predicted to activate a given brain region using paired natural images and fMRI recordings, bypassing the need for category-specific stimuli. Our approach -- Brain Diffusion for Visual Exploration ("BrainDiVE") -- builds on recent generative methods by combining large-scale diffusion models with brain-guided image synthesis. Validating our method, we demonstrate the ability to synthesize preferred images with appropriate semantic specificity for well-characterized category-selective ROIs. We then show that BrainDiVE can characterize differences between ROIs selective for the same high-level category. Finally we identify novel functional subdivisions within these ROIs, validated with behavioral data. These results advance our understanding of the fine-grained functional organization of human visual cortex, and provide well-specified constraints for further examination of cortical organization using hypothesis-driven methods.

Parameter-efficient is not sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions. (arXiv:2306.09729v2 [cs.CV] UPDATED)

Authors: Dongshuo Yin, Xueting Han, Bin Li, Hao Feng, Jing Bai

Pre-training & fine-tuning is a prevalent paradigm in computer vision (CV). Recently, parameter-efficient transfer learning (PETL) methods have shown promising performance in adapting to downstream tasks with only a few trainable parameters. Despite their success, the existing PETL methods in CV can be computationally expensive and require large amounts of memory and time cost during training, which limits low-resource users from conducting research and applications on large models. In this work, we propose Parameter, Memory, and Time Efficient Visual Adapter ($\mathrm{E^3VA}$) tuning to address this issue. We provide a gradient backpropagation highway for low-rank adapters which eliminates the need for expensive backpropagation through the frozen pre-trained model, resulting in substantial savings of training memory and training time. Furthermore, we optimise the $\mathrm{E^3VA}$ structure for CV tasks to promote model performance. Extensive experiments on COCO, ADE20K, and Pascal VOC benchmarks show that $\mathrm{E^3VA}$ can save up to 62.2% training memory and 26.2% training time on average, while achieving comparable performance to full fine-tuning and better performance than most PETL methods. Note that we can even train the Swin-Large-based Cascade Mask RCNN on GTX 1080Ti GPUs with less than 1.5% trainable parameters.

Masked Diffusion Models Are Fast Distribution Learners. (arXiv:2306.11363v4 [cs.CV] UPDATED)

Authors: Jiachen Lei, Qinglong Wang, Peng Cheng, Zhongjie Ba, Zhan Qin, Zhibo Wang, Zhenguang Liu, Kui Ren

Diffusion model has emerged as the \emph{de-facto} model for image generation, yet the heavy training overhead hinders its broader adoption in the research community. We observe that diffusion models are commonly trained to learn all fine-grained visual information from scratch. This paradigm may cause unnecessary training costs hence requiring in-depth investigation. In this work, we show that it suffices to train a strong diffusion model by first pre-training the model to learn some primer distribution that loosely characterizes the unknown real image distribution. Then the pre-trained model can be fine-tuned for various generation tasks efficiently. In the pre-training stage, we propose to mask a high proportion (e.g., up to 90\%) of input images to approximately represent the primer distribution and introduce a masked denoising score matching objective to train a model to denoise visible areas. In subsequent fine-tuning stage, we efficiently train diffusion model without masking. Utilizing the two-stage training framework, we achieves significant training acceleration and a new FID score record of 6.27 on CelebA-HQ $256 \times 256$ for ViT-based diffusion models. The generalizability of a pre-trained model further helps building models that perform better than ones trained from scratch on different downstream datasets. For instance, a diffusion model pre-trained on VGGFace2 attains a 46\% quality improvement when fine-tuned on a different dataset that contains only 3000 images. Our code is available at \url{https://github.com/jiachenlei/maskdm}.

DomainStudio: Fine-Tuning Diffusion Models for Domain-Driven Image Generation using Limited Data. (arXiv:2306.14153v3 [cs.CV] UPDATED)

Authors: Jingyuan Zhu, Huimin Ma, Jiansheng Chen, Jian Yuan

Denoising diffusion probabilistic models (DDPMs) have been proven capable of synthesizing high-quality images with remarkable diversity when trained on large amounts of data. Typical diffusion models and modern large-scale conditional generative models like text-to-image generative models are vulnerable to overfitting when fine-tuned on extremely limited data. Existing works have explored subject-driven generation using a reference set containing a few images. However, few prior works explore DDPM-based domain-driven generation, which aims to learn the common features of target domains while maintaining diversity. This paper proposes a novel DomainStudio approach to adapt DDPMs pre-trained on large-scale source datasets to target domains using limited data. It is designed to keep the diversity of subjects provided by source domains and get high-quality and diverse adapted samples in target domains. We propose to keep the relative distances between adapted samples to achieve considerable generation diversity. In addition, we further enhance the learning of high-frequency details for better generation quality. Our approach is compatible with both unconditional and conditional diffusion models. This work makes the first attempt to realize unconditional few-shot image generation with diffusion models, achieving better quality and greater diversity than current state-of-the-art GAN-based approaches. Moreover, this work also significantly relieves overfitting for conditional generation and realizes high-quality domain-driven generation, further expanding the applicable scenarios of modern large-scale text-to-image models.

GraSS: Contrastive Learning with Gradient Guided Sampling Strategy for Remote Sensing Image Semantic Segmentation. (arXiv:2306.15868v3 [cs.LG] UPDATED)

Authors: Zhaoyang Zhang, Zhen Ren, Chao Tao, Yunsheng Zhang, Chengli Peng, Haifeng Li

Self-supervised contrastive learning (SSCL) has achieved significant milestones in remote sensing image (RSI) understanding. Its essence lies in designing an unsupervised instance discrimination pretext task to extract image features from a large number of unlabeled images that are beneficial for downstream tasks. However, existing instance discrimination based SSCL suffer from two limitations when applied to the RSI semantic segmentation task: 1) Positive sample confounding issue; 2) Feature adaptation bias. It introduces a feature adaptation bias when applied to semantic segmentation tasks that require pixel-level or object-level features. In this study, We observed that the discrimination information can be mapped to specific regions in RSI through the gradient of unsupervised contrastive loss, these specific regions tend to contain singular ground objects. Based on this, we propose contrastive learning with Gradient guided Sampling Strategy (GraSS) for RSI semantic segmentation. GraSS consists of two stages: Instance Discrimination warm-up (ID warm-up) and Gradient guided Sampling contrastive training (GS training). The ID warm-up aims to provide initial discrimination information to the contrastive loss gradients. The GS training stage aims to utilize the discrimination information contained in the contrastive loss gradients and adaptively select regions in RSI patches that contain more singular ground objects, in order to construct new positive and negative samples. Experimental results on three open datasets demonstrate that GraSS effectively enhances the performance of SSCL in high-resolution RSI semantic segmentation. Compared to seven baseline methods from five different types of SSCL, GraSS achieves an average improvement of 1.57\% and a maximum improvement of 3.58\% in terms of mean intersection over the union. The source code is available at https://github.com/GeoX-Lab/GraSS

Stitched ViTs are Flexible Vision Backbones. (arXiv:2307.00154v2 [cs.CV] UPDATED)

Authors: Zizheng Pan, Jing Liu, Haoyu He, Jianfei Cai, Bohan Zhuang

Large pretrained plain vision Transformers (ViTs) have been the workhorse for many downstream tasks. However, existing works utilizing off-the-shelf ViTs are inefficient in terms of training and deployment, because adopting ViTs with individual sizes requires separate trainings and is restricted by fixed performance-efficiency trade-offs. In this paper, we are inspired by stitchable neural networks (SN-Net), which is a new framework that cheaply produces a single model that covers rich subnetworks by stitching pretrained model families, supporting diverse performance-efficiency trade-offs at runtime. Building upon this foundation, we introduce SN-Netv2, a systematically improved model stitching framework to facilitate downstream task adaptation. Specifically, we first propose a two-way stitching scheme to enlarge the stitching space. We then design a resource-constrained sampling strategy that takes into account the underlying FLOPs distributions in the space for better sampling. Finally, we observe that learning stitching layers as a low-rank update plays an essential role on downstream tasks to stabilize training and ensure a good Pareto frontier. With extensive experiments on ImageNet-1K, ADE20K, COCO-Stuff-10K and NYUv2, SN-Netv2 demonstrates superior performance over SN-Netv1 on downstream dense predictions and shows strong ability as a flexible vision backbone, achieving great advantages in both training efficiency and deployment flexibility. Code is available at https://github.com/ziplab/SN-Netv2.

High-performance real-world optical computing trained by in situ model-free optimization. (arXiv:2307.11957v4 [physics.optics] UPDATED)

Authors: Guangyuan Zhao, Xin Shu

Optical computing systems provide high-speed and low-energy data processing but face deficiencies in computationally demanding training and simulation-to-reality gaps. We propose a model-free optimization (MFO) method based on a score gradient estimation algorithm for computationally efficient in situ training of optical computing systems. This approach treats an optical computing system as a black box and back-propagates the loss directly to the optical computing weights' probability distributions, circumventing the need for a computationally heavy and biased system simulation. Our experiments on a single-layer diffractive optical computing system show that MFO outperforms hybrid training on the MNIST and FMNIST datasets. Furthermore, we demonstrate image-free and high-speed classification of cells from their phase maps. Our method's model-free and high-performance nature, combined with its low demand for computational resources, expedites the transition of optical computing from laboratory demonstrations to real-world applications.

Unsupervised Image Outlier Detection using RANSAC. (arXiv:2307.12301v2 [cs.CV] UPDATED)

Authors: Chen-Han Tsai, Yu-Shao Peng

Image outlier detection (OD) is an essential tool to ensure the quality and accuracy of image datasets used in computer vision tasks. Most existing approaches, however, require a set of in-distribution data for training prior to outlier prediction. The quality and quantity of the data can influence the resulting performance. Thus, selecting a suitable in-distribution set often requires considerable effort. In this work, we propose RANSAC-NN, an unsupervised image OD algorithm designed to detect outliers within contaminated sets in a one-class classification fashion. Without any training, RANSAC-NN performs favorably in comparison to other well-established methods in a variety of OD benchmarks. Furthermore, we show that our method can enhance the robustness of existing OD methods by simply applying RANSAC-NN during pre-processing.

Towards Discriminative Representation with Meta-learning for Colonoscopic Polyp Re-Identification. (arXiv:2308.00929v2 [cs.CV] UPDATED)

Authors: Suncheng Xiang, Qingzhong Chen, Shilun Cai, Chengfeng Zhou, Crystal Cai, Sijia Du, Zhengjie Zhang, Yunshi Zhong, Dahong Qian

Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras and plays an important role in the prevention and treatment of colorectal cancer in computer-aided diagnosis. However, traditional methods for object ReID directly adopting CNN models trained on the ImageNet dataset usually produce unsatisfactory retrieval performance on colonoscopic datasets due to the large domain gap. Additionally, these methods neglect to explore the potential of self-discrepancy among intra-class relations in the colonoscopic polyp dataset, which remains an open research problem in the medical community. To solve this dilemma, we propose a simple but effective training method named Colo-ReID, which can help our model learn more general and discriminative knowledge based on the meta-learning strategy in scenarios with fewer samples. Based on this, a dynamic Meta-Learning Regulation mechanism called MLR is introduced to further boost the performance of polyp re-identification. To the best of our knowledge, this is the first attempt to leverage the meta-learning paradigm instead of traditional machine learning algorithm to effectively train deep models in the task of colonoscopic polyp re-identification. Empirical results show that our method significantly outperforms current state-of-the-art methods by a clear margin.

FedSOL: Stabilized Orthogonal Learning in Federated Learning. (arXiv:2308.12532v3 [cs.LG] UPDATED)

Authors: Gihun Lee, Minchan Jeong, Sangmook Kim, Jaehoon Oh, Se-Young Yun

Federated Learning (FL) aggregates locally trained models from individual clients to construct a global model. While FL enables learning a model with data privacy, it often suffers from significant performance degradation when client data distributions are heterogeneous. Many previous FL algorithms have addressed this issue by introducing various proximal restrictions. These restrictions aim to encourage global alignment by constraining the deviation of local learning from the global objective. However, they inherently limit local learning by interfering with the original local objectives. Recently, an alternative approach has emerged to improve local learning generality. By obtaining local models within a smooth loss landscape, this approach mitigates conflicts among different local objectives of the clients. Yet, it does not ensure stable global alignment, as local learning does not take the global objective into account. In this study, we propose Federated Stability on Learning (FedSoL), which combines both the concepts of global alignment and local generality. In FedSoL, the local learning seeks a parameter region robust against proximal perturbations. This strategy introduces an implicit proximal restriction effect in local learning while maintaining the original local objective for parameter update. Our experiments show that FedSoL consistently achieves state-of-the-art performance on various setups.

PolarRec: Radio Interferometric Data Reconstruction with Polar Coordinate Representation. (arXiv:2308.14610v2 [astro-ph.IM] UPDATED)

Authors: Ruoqi Wang, Zhuoyang Chen, Jiayi Zhu, Qiong Luo, Feng Wang

In radio astronomy, visibility data, which are measurements of wave signals from radio telescopes, are transformed into images for observation of distant celestial objects. However, these resultant images usually contain both real sources and artifacts, due to signal sparsity and other factors. One way to obtain cleaner images is to reconstruct samples into dense forms before imaging. Unfortunately, existing reconstruction methods often miss some components of visibility in frequency domain, so blurred object edges and persistent artifacts remain in the images. Furthermore, the computation overhead is high on irregular visibility samples due to the data skew. To address these problems, we propose PolarRec, a transformer-encoder-conditioned reconstruction pipeline with visibility samples converted into the polar coordinate representation. This representation matches the way in which radio telescopes observe a celestial area as the Earth rotates. As a result, visibility samples distribute in the polar system more uniformly than in the Cartesian space. Therefore, we propose to use radial distance in the loss function, to help reconstruct complete visibility effectively. Also, we group visibility samples by their polar angles and propose a group-based encoding scheme to improve the efficiency. Our experiments demonstrate that PolarRec markedly improves imaging results by faithfully reconstructing all frequency components in the visibility domain while significantly reducing the computation cost in visibility data encoding. We believe this high-quality and high-efficiency imaging of PolarRec will better facilitate astronomers to conduct their research.

Diffusion Models for Interferometric Satellite Aperture Radar. (arXiv:2308.16847v2 [cs.CV] UPDATED)

Authors: Alexandre Tuel, Thomas Kerdreux, Claudia Hulbert, Bertrand Rouet-Leduc

Probabilistic Diffusion Models (PDMs) have recently emerged as a very promising class of generative models, achieving high performance in natural image generation. However, their performance relative to non-natural images, like radar-based satellite data, remains largely unknown. Generating large amounts of synthetic (and especially labelled) satellite data is crucial to implement deep-learning approaches for the processing and analysis of (interferometric) satellite aperture radar data. Here, we leverage PDMs to generate several radar-based satellite image datasets. We show that PDMs succeed in generating images with complex and realistic structures, but that sampling time remains an issue. Indeed, accelerated sampling strategies, which work well on simple image datasets like MNIST, fail on our radar datasets. We provide a simple and versatile open-source https://github.com/thomaskerdreux/PDM_SAR_InSAR_generation to train, sample and evaluate PDMs using any dataset on a single GPU.

Exploiting Causality Signals in Medical Images: A Pilot Study with Empirical Results. (arXiv:2309.10399v2 [cs.CV] UPDATED)

Authors: Gianluca Carloni, Sara Colantonio

We present a novel technique to discover and exploit weak causal signals directly from images via neural networks for classification purposes. This way, we model how the presence of a feature in one part of the image affects the appearance of another feature in a different part of the image. Our method consists of a convolutional neural network backbone and a causality-factors extractor module, which computes weights to enhance each feature map according to its causal influence in the scene. We developed different architecture variants and empirically evaluated all of our models on two public datasets of prostate MRI images and breast histopathology slides for cancer diagnosis. To confirm our quantitative results, we conduct ablation studies and investigate the explainability of our models via class activation maps. Our findings show that our lightweight block extracts meaningful information and improves the overall classification, together with producing more robust predictions that focus on relevant parts of the image. That is crucial in medical imaging, where accurate and reliable classifications are essential for effective diagnosis and treatment planning.

Towards Attributions of Input Variables in a Coalition. (arXiv:2309.13411v2 [cs.LG] UPDATED)

Authors: Xinhao Zheng, Huiqi Deng, Bo Fan, Quanshi Zhang

This paper aims to develop a new attribution method to explain the conflict between individual variables' attributions and their coalition's attribution from a fully new perspective. First, we find that the Shapley value can be reformulated as the allocation of Harsanyi interactions encoded by the AI model. Second, based the re-alloction of interactions, we extend the Shapley value to the attribution of coalitions. Third we ective. We derive the fundamental mechanism behind the conflict. This conflict come from the interaction containing partial variables in their coalition.

MM-NeRF: Multimodal-Guided 3D Multi-Style Transfer of Neural Radiance Field. (arXiv:2309.13607v2 [cs.CV] UPDATED)

Authors: Zijiang Yang, Zhongwei Qiu, Chang Xu, Dongmei Fu

3D style transfer aims to generate stylized views of 3D scenes with specified styles, which requires high-quality generating and keeping multi-view consistency. Existing methods still suffer the challenges of high-quality stylization with texture details and stylization with multimodal guidance. In this paper, we reveal that the common training method of stylization with NeRF, which generates stylized multi-view supervision by 2D style transfer models, causes the same object in supervision to show various states (color tone, details, etc.) in different views, leading NeRF to tend to smooth the texture details, further resulting in low-quality rendering for 3D multi-style transfer. To tackle these problems, we propose a novel Multimodal-guided 3D Multi-style transfer of NeRF, termed MM-NeRF. First, MM-NeRF projects multimodal guidance into a unified space to keep the multimodal styles consistency and extracts multimodal features to guide the 3D stylization. Second, a novel multi-head learning scheme is proposed to relieve the difficulty of learning multi-style transfer, and a multi-view style consistent loss is proposed to track the inconsistency of multi-view supervision data. Finally, a novel incremental learning mechanism to generalize MM-NeRF to any new style with small costs. Extensive experiments on several real-world datasets show that MM-NeRF achieves high-quality 3D multi-style stylization with multimodal guidance, and keeps multi-view consistency and style consistency between multimodal guidance. Codes will be released.

PRIS: Practical robust invertible network for image steganography. (arXiv:2309.13620v2 [cs.CV] UPDATED)

Authors: Hang Yang, Yitian Xu, Xuhua Liu, Xiaodong Ma

Image steganography is a technique of hiding secret information inside another image, so that the secret is not visible to human eyes and can be recovered when needed. Most of the existing image steganography methods have low hiding robustness when the container images affected by distortion. Such as Gaussian noise and lossy compression. This paper proposed PRIS to improve the robustness of image steganography, it based on invertible neural networks, and put two enhance modules before and after the extraction process with a 3-step training strategy. Moreover, rounding error is considered which is always ignored by existing methods, but actually it is unavoidable in practical. A gradient approximation function (GAF) is also proposed to overcome the undifferentiable issue of rounding distortion. Experimental results show that our PRIS outperforms the state-of-the-art robust image steganography method in both robustness and practicability. Codes are available at https://github.com/yanghangAI/PRIS, demonstration of our model in practical at this http URL

CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free. (arXiv:2309.14289v2 [cs.CV] UPDATED)

Authors: Monika Wysoczańska, Michaël Ramamonjisoa, Tomasz Trzciński, Oriane Siméoni

The emergence of CLIP has opened the way for open-world image perception. The zero-shot classification capabilities of the model are impressive but are harder to use for dense tasks such as image segmentation. Several methods have proposed different modifications and learning schemes to produce dense output. Instead, we propose in this work an open-vocabulary semantic segmentation method, dubbed CLIP-DIY, which does not require any additional training or annotations, but instead leverages existing unsupervised object localization approaches. In particular, CLIP-DIY is a multi-scale approach that directly exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map. We further guide the segmentation using foreground/background scores obtained using unsupervised object localization methods. With our method, we obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO. The code is available at this http URL

Directional Texture Editing for 3D Models. (arXiv:2309.14872v3 [cs.CV] UPDATED)

Authors: Shengqi Liu, Zhuo Chen, Jingnan Gao, Yichao Yan, Wenhan Zhu, Ke Gao, Jiangjing Lyu, Xiaokang Yang

Texture editing is a crucial task in 3D modeling that allows users to automatically manipulate the surface materials of 3D models. However, the inherent complexity of 3D models and the ambiguous text description lead to the challenge in this task. To address this challenge, we propose ITEM3D, a \textbf{T}exture \textbf{E}diting \textbf{M}odel designed for automatic \textbf{3D} object editing according to the text \textbf{I}nstructions. Leveraging the diffusion models and the differentiable rendering, ITEM3D takes the rendered images as the bridge of text and 3D representation, and further optimizes the disentangled texture and environment map. Previous methods adopted the absolute editing direction namely score distillation sampling (SDS) as the optimization objective, which unfortunately results in the noisy appearance and text inconsistency. To solve the problem caused by the ambiguous text, we introduce a relative editing direction, an optimization objective defined by the noise difference between the source and target texts, to release the semantic ambiguity between the texts and images. Additionally, we gradually adjust the direction during optimization to further address the unexpected deviation in the texture domain. Qualitative and quantitative experiments show that our ITEM3D outperforms the state-of-the-art methods on various 3D objects. We also perform text-guided relighting to show explicit control over lighting. Our project page: \href{https://shengqiliu1.github.io/ITEM3D}{https://shengqiliu1.github.io/ITEM3D}.

Masked Autoencoders are Scalable Learners of Cellular Morphology. (arXiv:2309.16064v2 [cs.CV] UPDATED)

Authors: Oren Kraus, Kian Kenyon-Dean, Saber Saberian, Maryam Fallah, Peter McLean, Jess Leung, Vasudev Sharma, Ayla Khan, Jia Balakrishnan, Safiye Celik, Maciej Sypetkowski, Chi Vicky Cheng, Kristen Morse, Maureen Makes, Ben Mabey, Berton Earnshaw

Inferring biological relationships from cellular phenotypes in high-content microscopy screens provides significant opportunity and challenge in biological research. Prior results have shown that deep vision models can capture biological signal better than hand-crafted features. This work explores how self-supervised deep learning approaches scale when training larger models on larger microscopy datasets. Our results show that both CNN- and ViT-based masked autoencoders significantly outperform weakly supervised baselines. At the high-end of our scale, a ViT-L/8 trained on over 3.5-billion unique crops sampled from 93-million microscopy images achieves relative improvements as high as 28% over our best weakly supervised baseline at inferring known biological relationships curated from public databases. Relevant code and select models released with this work can be found at: https://github.com/recursionpharma/maes_microscopy.

Large Scale Masked Autoencoding for Reducing Label Requirements on SAR Data. (arXiv:2310.00826v2 [cs.CV] UPDATED)

Authors: Matt Allen, Francisco Dorr, Joseph A. Gallego-Mejia, Laura Martínez-Ferrer, Anna Jungbluth, Freddie Kalaitzis, Raúl Ramos-Pollán

Satellite-based remote sensing is instrumental in the monitoring and mitigation of the effects of anthropogenic climate change. Large scale, high resolution data derived from these sensors can be used to inform intervention and policy decision making, but the timeliness and accuracy of these interventions is limited by use of optical data, which cannot operate at night and is affected by adverse weather conditions. Synthetic Aperture Radar (SAR) offers a robust alternative to optical data, but its associated complexities limit the scope of labelled data generation for traditional deep learning. In this work, we apply a self-supervised pretraining scheme, masked autoencoding, to SAR amplitude data covering 8.7\% of the Earth's land surface area, and tune the pretrained weights on two downstream tasks crucial to monitoring climate change - vegetation cover prediction and land cover classification. We show that the use of this pretraining scheme reduces labelling requirements for the downstream tasks by more than an order of magnitude, and that this pretraining generalises geographically, with the performance gain increasing when tuned downstream on regions outside the pretraining set. Our findings significantly advance climate change mitigation by facilitating the development of task and region-specific SAR models, allowing local communities and organizations to deploy tailored solutions for rapid, accurate monitoring of climate change effects.

Offline Tracking with Object Permanence. (arXiv:2310.01288v3 [cs.CV] UPDATED)

Authors: Xianzhong Liu, Holger Caesar

To reduce the expensive labor cost for manual labeling autonomous driving datasets, an alternative is to automatically label the datasets using an offline perception system. However, objects might be temporally occluded. Such occlusion scenarios in the datasets are common yet underexplored in offline auto labeling. In this work, we propose an offline tracking model that focuses on occluded object tracks. It leverages the concept of object permanence which means objects continue to exist even if they are not observed anymore. The model contains three parts: a standard online tracker, a re-identification (Re-ID) module that associates tracklets before and after occlusion, and a track completion module that completes the fragmented tracks. The Re-ID module and the track completion module use the vectorized map as one of the inputs to refine the tracking results with occlusion. The model can effectively recover the occluded object trajectories. It achieves state-of-the-art performance in 3D multi-object tracking by significantly improving the original online tracking result, showing its potential to be applied in offline auto labeling as a useful plugin to improve tracking by recovering occlusions.

Extending CAM-based XAI methods for Remote Sensing Imagery Segmentation. (arXiv:2310.01837v2 [cs.CV] UPDATED)

Authors: Abdul Karim Gizzini, Mustafa Shukor, Ali J. Ghandour

Current AI-based methods do not provide comprehensible physical interpretations of the utilized data, extracted features, and predictions/inference operations. As a result, deep learning models trained using high-resolution satellite imagery lack transparency and explainability and can be merely seen as a black box, which limits their wide-level adoption. Experts need help understanding the complex behavior of AI models and the underlying decision-making process. The explainable artificial intelligence (XAI) field is an emerging field providing means for robust, practical, and trustworthy deployment of AI models. Several XAI techniques have been proposed for image classification tasks, whereas the interpretation of image segmentation remains largely unexplored. This paper offers to bridge this gap by adapting the recent XAI classification algorithms and making them usable for muti-class image segmentation, where we mainly focus on buildings' segmentation from high-resolution satellite images. To benchmark and compare the performance of the proposed approaches, we introduce a new XAI evaluation methodology and metric based on "Entropy" to measure the model uncertainty. Conventional XAI evaluation methods rely mainly on feeding area-of-interest regions from the image back to the pre-trained (utility) model and then calculating the average change in the probability of the target class. Those evaluation metrics lack the needed robustness, and we show that using Entropy to monitor the model uncertainty in segmenting the pixels within the target class is more suitable. We hope this work will pave the way for additional XAI research for image segmentation and applications in the remote sensing discipline.

Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond. (arXiv:2310.02071v4 [cs.AI] UPDATED)

Authors: Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Tianyu Liu, Baobao Chang

In this study, we explore the potential of Multimodal Large Language Models (MLLMs) in improving embodied decision-making processes for agents. While Large Language Models (LLMs) have been widely used due to their advanced reasoning skills and vast world knowledge, MLLMs like GPT4-Vision offer enhanced visual understanding and reasoning capabilities. We investigate whether state-of-the-art MLLMs can handle embodied decision-making in an end-to-end manner and whether collaborations between LLMs and MLLMs can enhance decision-making. To address these questions, we introduce a new benchmark called PCA-EVAL, which evaluates embodied decision-making from the perspectives of Perception, Cognition, and Action. Additionally, we propose HOLMES, a multi-agent cooperation framework that allows LLMs to leverage MLLMs and APIs to gather multimodal information for informed decision-making. We compare end-to-end embodied decision-making and HOLMES on our benchmark and find that the GPT4-Vision model demonstrates strong end-to-end embodied decision-making abilities, outperforming GPT4-HOLMES in terms of average decision accuracy (+3%). However, this performance is exclusive to the latest GPT4-Vision model, surpassing the open-source state-of-the-art MLLM by 26%. Our results indicate that powerful MLLMs like GPT4-Vision hold promise for decision-making in embodied agents, offering new avenues for MLLM research. Code and data are open at https://github.com/pkunlp-icler/PCA-EVAL/.

Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models. (arXiv:2310.03059v3 [cs.CV] UPDATED)

Authors: Ivan Tang, Ray Zhang, Zoey Guo, Dong Wang, Zhigang Wang, Bin Zhao, Xuelong Li

The popularity of pre-trained large models has revolutionized downstream tasks across diverse fields, such as language, vision, and multi-modality. To minimize the adaption cost for downstream tasks, many Parameter-Efficient Fine-Tuning (PEFT) techniques are proposed for language and 2D image pre-trained models. However, the specialized PEFT method for 3D pre-trained models is still under-explored. To this end, we introduce Point-PEFT, a novel framework for adapting point cloud pre-trained models with minimal learnable parameters. Specifically, for a pre-trained 3D model, we freeze most of its parameters, and only tune the newly added PEFT modules on downstream tasks, which consist of a Point-prior Prompt and a Geometry-aware Adapter. The Point-prior Prompt adopts a set of learnable prompt tokens, for which we propose to construct a memory bank with domain-specific knowledge, and utilize a parameter-free attention to enhance the prompt tokens. The Geometry-aware Adapter aims to aggregate point cloud features within spatial neighborhoods to capture fine-grained geometric information through local interactions. Extensive experiments indicate that our Point-PEFT can achieve better performance than the full fine-tuning on various downstream tasks, while using only 5% of the trainable parameters, demonstrating the efficiency and effectiveness of our approach. Code will be released at https://github.com/Even-JK/PEFT-3D.

On the Performance of Multimodal Language Models. (arXiv:2310.03211v2 [cs.CL] UPDATED)

Authors: Utsav Garg, Erhan Bas

Instruction-tuned large language models (LLMs) have demonstrated promising zero-shot generalization capabilities across various downstream tasks. Recent research has introduced multimodal capabilities to LLMs by integrating independently pretrained vision encoders through model grafting. These multimodal variants undergo instruction tuning, similar to LLMs, enabling effective zero-shot generalization for multimodal tasks. This study conducts a comparative analysis of different multimodal instruction tuning approaches and evaluates their performance across a range of tasks, including complex reasoning, conversation, image captioning, multiple-choice questions (MCQs), and binary classification. Through rigorous benchmarking and ablation experiments, we reveal key insights for guiding architectural choices when incorporating multimodal capabilities into LLMs. However, current approaches have limitations; they do not sufficiently address the need for a diverse multimodal instruction dataset, which is crucial for enhancing task generalization. Additionally, they overlook issues related to truthfulness and factuality when generating responses. These findings illuminate current methodological constraints in adapting language models for image comprehension and provide valuable guidance for researchers and practitioners seeking to harness multimodal versions of LLMs.

What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models. (arXiv:2310.06627v3 [cs.CL] UPDATED)

Authors: Letian Zhang, Xiaotong Zhai, Zhongkai Zhao, Yongshuo Zong, Xin Wen, Bingchen Zhao

Counterfactual reasoning, a fundamental aspect of human cognition, involves contemplating alternatives to established facts or past events, significantly enhancing our abilities in planning and decision-making. In light of the advancements in current multi-modal large language models, we explore their effectiveness in counterfactual reasoning. To facilitate this investigation, we introduce a novel dataset, C-VQA, specifically designed to test the counterfactual reasoning capabilities of modern multi-modal large language models. This dataset is constructed by infusing original questions with counterfactual presuppositions, spanning various types such as numerical and boolean queries. It encompasses a mix of real and synthetic data, representing a wide range of difficulty levels. Our thorough evaluations of contemporary vision-language models using this dataset have revealed substantial performance drops, with some models showing up to a 40% decrease, highlighting a significant gap between current models and human-like vision reasoning capabilities. We hope our dataset will serve as a vital benchmark for evaluating the counterfactual reasoning capabilities of models. Code and dataset are publicly available at https://bzhao.me/C-VQA/.

COVID-19 detection using ViT transformer-based approach from Computed Tomography Images. (arXiv:2310.08165v2 [eess.IV] UPDATED)

Authors: Kenan Morani

In here, we introduce a novel approach to enhance the accuracy and efficiency of COVID-19 diagnosis using CT images. Leveraging state-of-the-art Transformer models in computer vision, we employed the base ViT Transformer configured for 224x224-sized input images, modifying the output to suit the binary classification task. Notably, input images were resized from the standard CT scan size of 512x512 to match the model's expectations. Our method implements a systematic patient-level prediction strategy, classifying individual CT slices as COVID-19 or non-COVID. To determine the overall diagnosis for each patient, a majority voting approach as well as other thresholding approaches were employed. This method involves evaluating all CT slices for a given patient and assigning the patient the diagnosis that relates to the thresholding for the CT scan. This meticulous patient-level prediction process contributes to the robustness of our solution as it starts from 2D-slices to 3D-patient level. Throughout the evaluation process, our approach resulted in 0.7 macro F1 score on the COV19-CT -DB validation set. To ensure the reliability and effectiveness of our model, we rigorously validate it on the extensive COV-19 CT dataset, which is meticulously annotated for the task. This dataset, with its comprehensive annotations, reinforces the overall robustness of our solution.

LLM4SGG: Large Language Model for Weakly Supervised Scene Graph Generation. (arXiv:2310.10404v5 [cs.CV] UPDATED)

Authors: Kibum Kim, Kanghoon Yoon, Jaehyeong Jeon, Yeonjun In, Jinyoung Moon, Donghyun Kim, Chanyoung Park

Weakly-Supervised Scene Graph Generation (WSSGG) research has recently emerged as an alternative to the fully-supervised approach that heavily relies on costly annotations. In this regard, studies on WSSGG have utilized image captions to obtain unlocalized triplets while primarily focusing on grounding the unlocalized triplets over image regions. However, they have overlooked the two issues involved in the triplet formation process from the captions: 1) Semantic over-simplification issue arises when extracting triplets from captions, where fine-grained predicates in captions are undesirably converted into coarse-grained predicates, resulting in a long-tailed predicate distribution, and 2) Low-density scene graph issue arises when aligning the triplets in the caption with entity/predicate classes of interest, where many triplets are discarded and not used in training, leading to insufficient supervision. To tackle the two issues, we propose a new approach, i.e., Large Language Model for weakly-supervised SGG (LLM4SGG), where we mitigate the two issues by leveraging the LLM's in-depth understanding of language and reasoning ability during the extraction of triplets from captions and alignment of entity/predicate classes with target data. To further engage the LLM in these processes, we adopt the idea of Chain-of-Thought and the in-context few-shot learning strategy. To validate the effectiveness of LLM4SGG, we conduct extensive experiments on Visual Genome and GQA datasets, showing significant improvements in both Recall@K and mean Recall@K compared to the state-of-the-art WSSGG methods. A further appeal is that LLM4SGG is data-efficient, enabling effective model training with a small amount of training images.

Training Image Derivatives: Increased Accuracy and Universal Robustness. (arXiv:2310.14045v2 [cs.LG] UPDATED)

Authors: Vsevolod I. Avrutskiy

Derivative training is a known method that significantly improves the accuracy of neural networks in some low-dimensional applications. In this paper, a similar improvement is obtained for an image analysis problem: reconstructing the vertices of a cube from its image. By training the derivatives with respect to the 6 degrees of freedom of the cube, we obtain 25 times more accurate results for noiseless inputs. The derivatives also offer insight into the robustness problem, which is currently understood in terms of two types of network vulnerabilities. The first type involves small perturbations that dramatically change the output, and the second type relates to substantial image changes that the network erroneously ignores. Defense against each is possible, but safeguarding against both while maintaining the accuracy defies conventional training methods. The first type is analyzed using the network's gradient, while the second relies on human input evaluation, serving as an oracle substitute. For the task at hand, the nearest neighbor oracle can be defined and expanded into Taylor series using image derivatives. This allows for a robustness analysis that unifies both types of vulnerabilities and enables training where accuracy and universal robustness are limited only by network capacity.

SemanticBoost: Elevating Motion Generation with Augmented Textual Cues. (arXiv:2310.20323v2 [cs.CV] UPDATED)

Authors: Xin He, Shaoli Huang, Xiaohang Zhan, Chao Weng, Ying Shan

Current techniques face difficulties in generating motions from intricate semantic descriptions, primarily due to insufficient semantic annotations in datasets and weak contextual understanding. To address these issues, we present SemanticBoost, a novel framework that tackles both challenges simultaneously. Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD). The Semantic Enhancement module extracts supplementary semantics from motion data, enriching the dataset's textual description and ensuring precise alignment between text and motion data without depending on large language models. On the other hand, the CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences by effectively capturing context information and aligning the generated motion with the given textual descriptions. Distinct from existing methods, our approach can synthesize accurate orientational movements, combined motions based on specific body part descriptions, and motions generated from complex, extended sentences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques, achieving cutting-edge performance on the Humanml3D dataset while maintaining realistic and smooth motion generation quality.

Enhanced Synthetic MRI Generation from CT Scans Using CycleGAN with Feature Extraction. (arXiv:2310.20604v2 [eess.IV] UPDATED)

Authors: Saba Nikbakhsh, Lachin Naghashyar, Morteza Valizadeh, Mehdi Chehel Amirani

In the field of radiotherapy, accurate imaging and image registration are of utmost importance for precise treatment planning. Magnetic Resonance Imaging (MRI) offers detailed imaging without being invasive and excels in soft-tissue contrast, making it a preferred modality for radiotherapy planning. However, the high cost of MRI, longer acquisition time, and certain health considerations for patients pose challenges. Conversely, Computed Tomography (CT) scans offer a quicker and less expensive imaging solution. To bridge these modalities and address multimodal alignment challenges, we introduce an approach for enhanced monomodal registration using synthetic MRI images. Utilizing unpaired data, this paper proposes a novel method to produce these synthetic MRI images from CT scans, leveraging CycleGANs and feature extractors. By building upon the foundational work on Cycle-Consistent Adversarial Networks and incorporating advancements from related literature, our methodology shows promising results, outperforming several state-of-the-art methods. The efficacy of our approach is validated by multiple comparison metrics.

CapST: An Enhanced and Lightweight Model Attribution Approach for Synthetic Videos. (arXiv:2311.03782v2 [cs.CV] UPDATED)

Authors: Wasim Ahmad, Yan-Tsung Peng, Yuan-Hao Chang, Gaddisa Olani Ganfure, Sarwar Khan, Sahibzada Adil Shahzad

Deepfake videos, generated through AI faceswapping techniques, have garnered considerable attention due to their potential for powerful impersonation attacks. While existing research primarily focuses on binary classification to discern between real and fake videos, however determining the specific generation model for a fake video is crucial for forensic investigation. Addressing this gap, this paper investigates the model attribution problem of Deepfake videos from a recently proposed dataset, Deepfakes from Different Models (DFDM), derived from various Autoencoder models. The dataset comprises 6,450 Deepfake videos generated by five distinct models with variations in encoder, decoder, intermediate layer, input resolution, and compression ratio. This study formulates Deepfakes model attribution as a multiclass classification task, proposing a segment of VGG19 as a feature extraction backbone, known for its effectiveness in imagerelated tasks, while integrated a Capsule Network with a Spatio-Temporal attention mechanism. The Capsule module captures intricate hierarchies among features for robust identification of deepfake attributes. Additionally, the video-level fusion technique leverages temporal attention mechanisms to handle concatenated feature vectors, capitalizing on inherent temporal dependencies in deepfake videos. By aggregating insights across frames, our model gains a comprehensive understanding of video content, resulting in more precise predictions. Experimental results on the deepfake benchmark dataset (DFDM) demonstrate the efficacy of our proposed method, achieving up to a 4% improvement in accurately categorizing deepfake videos compared to baseline models while demanding fewer computational resources.

CSAM: A 2.5D Cross-Slice Attention Module for Anisotropic Volumetric Medical Image Segmentation. (arXiv:2311.04942v2 [eess.IV] UPDATED)

Authors: Alex Ling Yu Hung, Haoxin Zheng, Kai Zhao, Xiaoxi Du, Kaifeng Pang, Qi Miao, Steven S. Raman, Demetri Terzopoulos, Kyunghyun Sung

A large portion of volumetric medical data, especially magnetic resonance imaging (MRI) data, is anisotropic, as the through-plane resolution is typically much lower than the in-plane resolution. Both 3D and purely 2D deep learning-based segmentation methods are deficient in dealing with such volumetric data since the performance of 3D methods suffers when confronting anisotropic data, and 2D methods disregard crucial volumetric information. Insufficient work has been done on 2.5D methods, in which 2D convolution is mainly used in concert with volumetric information. These models focus on learning the relationship across slices, but typically have many parameters to train. We offer a Cross-Slice Attention Module (CSAM) with minimal trainable parameters, which captures information across all the slices in the volume by applying semantic, positional, and slice attention on deep feature maps at different scales. Our extensive experiments using different network architectures and tasks demonstrate the usefulness and generalizability of CSAM. Associated code is available at https://github.com/aL3x-O-o-Hung/CSAM.

On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving. (arXiv:2311.05332v2 [cs.CV] UPDATED)

Authors: Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi, Yu Qiao

The pursuit of autonomous driving technology hinges on the sophisticated integration of perception, decision-making, and control systems. Traditional approaches, both data-driven and rule-based, have been hindered by their inability to grasp the nuance of complex driving environments and the intentions of other road users. This has been a significant bottleneck, particularly in the development of common sense reasoning and nuanced scene understanding necessary for safe and reliable autonomous driving. The advent of Visual Language Models (VLM) represents a novel frontier in realizing fully autonomous vehicle driving. This report provides an exhaustive evaluation of the latest state-of-the-art VLM, GPT-4V(ision), and its application in autonomous driving scenarios. We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver. Our comprehensive tests span from basic scene recognition to complex causal reasoning and real-time decision-making under varying conditions. Our findings reveal that GPT-4V demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems. It showcases the potential to handle out-of-distribution scenarios, recognize intentions, and make informed decisions in real driving contexts. However, challenges remain, particularly in direction discernment, traffic light recognition, vision grounding, and spatial reasoning tasks. These limitations underscore the need for further research and development. Project is now available on GitHub for interested parties to access and utilize: \url{https://github.com/PJLab-ADG/GPT4V-AD-Exploration}

BakedAvatar: Baking Neural Fields for Real-Time Head Avatar Synthesis. (arXiv:2311.05521v2 [cs.GR] UPDATED)

Authors: Hao-Bin Duan, Miao Wang, Jin-Chuan Shi, Xu-Chuan Chen, Yan-Pei Cao

Synthesizing photorealistic 4D human head avatars from videos is essential for VR/AR, telepresence, and video game applications. Although existing Neural Radiance Fields (NeRF)-based methods achieve high-fidelity results, the computational expense limits their use in real-time applications. To overcome this limitation, we introduce BakedAvatar, a novel representation for real-time neural head avatar synthesis, deployable in a standard polygon rasterization pipeline. Our approach extracts deformable multi-layer meshes from learned isosurfaces of the head and computes expression-, pose-, and view-dependent appearances that can be baked into static textures for efficient rasterization. We thus propose a three-stage pipeline for neural head avatar synthesis, which includes learning continuous deformation, manifold, and radiance fields, extracting layered meshes and textures, and fine-tuning texture details with differential rasterization. Experimental results demonstrate that our representation generates synthesis results of comparable quality to other state-of-the-art methods while significantly reducing the inference time required. We further showcase various head avatar synthesis results from monocular videos, including view synthesis, face reenactment, expression editing, and pose editing, all at interactive frame rates.

Generation Of Colors using Bidirectional Long Short Term Memory Networks. (arXiv:2311.06542v2 [cs.CV] UPDATED)

Authors: A. Sinha

Human vision can distinguish between a vast spectrum of colours, estimated to be between 2 to 7 million discernible shades. However, this impressive range does not inherently imply that all these colours have been precisely named and described within our lexicon. We often associate colours with familiar objects and concepts in our daily lives. This research endeavors to bridge the gap between our visual perception of countless shades and our ability to articulate and name them accurately. A novel model has been developed to achieve this goal, leveraging Bidirectional Long Short-Term Memory (BiLSTM) networks with Active learning. This model operates on a proprietary dataset meticulously curated for this study. The primary objective of this research is to create a versatile tool for categorizing and naming previously unnamed colours or identifying intermediate shades that elude traditional colour terminology. The findings underscore the potential of this innovative approach in revolutionizing our understanding of colour perception and language. Through rigorous experimentation and analysis, this study illuminates a promising avenue for Natural Language Processing (NLP) applications in diverse industries. By facilitating the exploration of the vast colour spectrum the potential applications of NLP are extended beyond conventional boundaries.

Improving Image Captioning via Predicting Structured Concepts. (arXiv:2311.08223v2 [cs.CV] UPDATED)

Authors: Ting Wang, Weidong Chen, Yuanhe Tian, Yan Song, Zhendong Mao

Having the difficulty of solving the semantic gap between images and texts for the image captioning task, conventional studies in this area paid some attention to treating semantic concepts as a bridge between the two modalities and improved captioning performance accordingly. Although promising results on concept prediction were obtained, the aforementioned studies normally ignore the relationship among concepts, which relies on not only objects in the image, but also word dependencies in the text, so that offers a considerable potential for improving the process of generating good descriptions. In this paper, we propose a structured concept predictor (SCP) to predict concepts and their structures, then we integrate them into captioning, so as to enhance the contribution of visual signals in this task via concepts and further use their relations to distinguish cross-modal semantics for better description generation. Particularly, we design weighted graph convolutional networks (W-GCN) to depict concept relations driven by word dependencies, and then learns differentiated contributions from these concepts for following decoding process. Therefore, our approach captures potential relations among concepts and discriminatively learns different concepts, so that effectively facilitates image captioning with inherited information across modalities. Extensive experiments and their results demonstrate the effectiveness of our approach as well as each proposed module in this work.

Defining the boundaries: challenges and advances in identifying cells in microscopy images. (arXiv:2311.08269v2 [q-bio.QM] UPDATED)

Authors: Nodar Gogoberidze, Beth A. Cimini

Segmentation, or the outlining of objects within images, is a critical step in the measurement and analysis of cells within microscopy images. While improvements continue to be made in tools that rely on classical methods for segmentation, deep learning-based tools increasingly dominate advances in the technology. Specialist models such as Cellpose continue to improve in accuracy and user-friendliness, and segmentation challenges such as the Multi-Modality Cell Segmentation Challenge continue to push innovation in accuracy across widely-varying test data as well as efficiency and usability. Increased attention on documentation, sharing, and evaluation standards are leading to increased user-friendliness and acceleration towards the goal of a truly universal method.

CORE-MM: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models. (arXiv:2311.11567v2 [cs.CV] UPDATED)

Authors: Xiaotian Han, Quanzeng You, Yongfei Liu, Wentao Chen, Huangjie Zheng, Khalil Mrini, Xudong Lin, Yiqi Wang, Bohan Zhai, Jianbo Yuan, Heng Wang, Hongxia Yang

Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. These models not only excel in traditional vision-language tasks but also demonstrate impressive performance in contemporary multi-modal benchmarks. Although many of these benchmarks attempt to holistically evaluate MLLMs, they typically concentrate on basic reasoning tasks, often yielding only simple yes/no or multi-choice responses. These methods naturally lead to confusion and difficulties in conclusively determining the reasoning capabilities of MLLMs. To mitigate this issue, we manually curate a benchmark dataset specifically designed for MLLMs, with a focus on complex reasoning tasks. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. The queries in our dataset are intentionally constructed to engage the reasoning capabilities of MLLMs in the process of generating answers. For a fair comparison across various MLLMs, we incorporate intermediate reasoning steps into our evaluation criteria. In instances where an MLLM is unable to produce a definitive answer, its reasoning ability is evaluated by requesting intermediate reasoning steps. If these steps align with our manual annotations, appropriate scores are assigned. This evaluation scheme resembles methods commonly used in human assessments, such as exams or assignments, and represents what we consider a more effective assessment technique compared with existing benchmarks. We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark, designed to challenge and accurately measure their reasoning capabilities. The code and data will be released at https://core-mm.github.io/

Applications of Large Scale Foundation Models for Autonomous Driving. (arXiv:2311.12144v5 [cs.CV] UPDATED)

Authors: Yu Huang, Yue Chen, Zhu Li

Since DARPA Grand Challenges (rural) in 2004/05 and Urban Challenges in 2007, autonomous driving has been the most active field of AI applications. Recently powered by large language models (LLMs), chat systems, such as chatGPT and PaLM, emerge and rapidly become a promising direction to achieve artificial general intelligence (AGI) in natural language processing (NLP). There comes a natural thinking that we could employ these abilities to reformulate autonomous driving. By combining LLM with foundation models, it is possible to utilize the human knowledge, commonsense and reasoning to rebuild autonomous driving systems from the current long-tailed AI dilemma. In this paper, we investigate the techniques of foundation models and LLMs applied for autonomous driving, categorized as simulation, world model, data annotation and planning or E2E solutions etc.

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions. (arXiv:2311.12793v2 [cs.CV] UPDATED)

Authors: Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin

In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data. To address this bottleneck, we introduce the ShareGPT4V dataset, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated 100K high-quality captions collected from advanced GPT4-Vision and has been expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT) phase, by substituting an equivalent quantity of detailed captions in existing SFT datasets with a subset of our high-quality captions, significantly enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and 2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple architecture that has remarkable performance across a majority of the multi-modal benchmarks. This project is available at https://ShareGPT4V.github.io to serve as a pivotal resource for advancing the LMMs community.

De-fine: Decomposing and Refining Visual Programs with Auto-Feedback. (arXiv:2311.12890v2 [cs.CV] UPDATED)

Authors: Minghe Gao, Juncheng Li, Hao Fei, Liang Pang, Wei Ji, Guoming Wang, Wenqiao Zhang, Siliang Tang, Yueting Zhuang

Visual programming, a modular and generalizable paradigm, integrates different modules and Python operators to solve various vision-language tasks. Unlike end-to-end models that need task-specific data, it advances in performing visual processing and reasoning in an unsupervised manner. Current visual programming methods generate programs in a single pass for each task where the ability to evaluate and optimize based on feedback, unfortunately, is lacking, which consequentially limits their effectiveness for complex, multi-step problems. Drawing inspiration from benders decomposition, we introduce De-fine, a general framework that automatically decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. This model-agnostic approach can improve logical reasoning performance by integrating the strengths of multiple models. Our experiments across various visual tasks show that De-fine creates more accurate and robust programs, setting new benchmarks in the field.

CompenHR: Efficient Full Compensation for High-resolution Projector. (arXiv:2311.13409v2 [cs.CV] UPDATED)

Authors: Yuxi Wang, Haibin Ling, Bingyao Huang

Full projector compensation is a practical task of projector-camera systems. It aims to find a projector input image, named compensation image, such that when projected it cancels the geometric and photometric distortions due to the physical environment and hardware. State-of-the-art methods use deep learning to address this problem and show promising performance for low-resolution setups. However, directly applying deep learning to high-resolution setups is impractical due to the long training time and high memory cost. To address this issue, this paper proposes a practical full compensation solution. Firstly, we design an attention-based grid refinement network to improve geometric correction quality. Secondly, we integrate a novel sampling scheme into an end-to-end compensation network to alleviate computation and introduce attention blocks to preserve key features. Finally, we construct a benchmark dataset for high-resolution projector full compensation. In experiments, our method demonstrates clear advantages in both efficiency and quality.

Progressive Learning with Visual Prompt Tuning for Variable-Rate Image Compression. (arXiv:2311.13846v2 [cs.CV] UPDATED)

Authors: Shiyu Qin, Yimin Zhou, Jinpeng Wang, Bin Chen, Baoyi An, Tao Dai, Shu-Tao Xia

In this paper, we propose a progressive learning paradigm for transformer-based variable-rate image compression. Our approach covers a wide range of compression rates with the assistance of the Layer-adaptive Prompt Module (LPM). Inspired by visual prompt tuning, we use LPM to extract prompts for input images and hidden features at the encoder side and decoder side, respectively, which are fed as additional information into the Swin Transformer layer of a pre-trained transformer-based image compression model to affect the allocation of attention region and the bits, which in turn changes the target compression ratio of the model. To ensure the network is more lightweight, we involves the integration of prompt networks with less convolutional layers. Exhaustive experiments show that compared to methods based on multiple models, which are optimized separately for different target rates, the proposed method arrives at the same performance with 80% savings in parameter storage and 90% savings in datasets. Meanwhile, our model outperforms all current variable bitrate image methods in terms of rate-distortion performance and approaches the state-of-the-art fixed bitrate image compression methods trained from scratch.

Perceptual Image Compression with Cooperative Cross-Modal Side Information. (arXiv:2311.13847v2 [cs.CV] UPDATED)

Authors: Shiyu Qin, Bin Chen, Yujun Huang, Baoyi An, Tao Dai, Shu-Tao Xia

The explosion of data has resulted in more and more associated text being transmitted along with images. Inspired by from distributed source coding, many works utilize image side information to enhance image compression. However, existing methods generally do not consider using text as side information to enhance perceptual compression of images, even though the benefits of multimodal synergy have been widely demonstrated in research. This begs the following question: How can we effectively transfer text-level semantic dependencies to help image compression, which is only available to the decoder? In this work, we propose a novel deep image compression method with text-guided side information to achieve a better rate-perception-distortion tradeoff. Specifically, we employ the CLIP text encoder and an effective Semantic-Spatial Aware block to fuse the text and image features. This is done by predicting a semantic mask to guide the learned text-adaptive affine transformation at the pixel level. Furthermore, we design a text-conditional generative adversarial networks to improve the perceptual quality of reconstructed images. Extensive experiments involving four datasets and ten image quality assessment metrics demonstrate that the proposed approach achieves superior results in terms of rate-perception trade-off and semantic distortion.

MVControl: Adding Conditional Control to Multi-view Diffusion for Controllable Text-to-3D Generation. (arXiv:2311.14494v2 [cs.CV] UPDATED)

Authors: Zhiqi Li, Yiming Chen, Lingzhe Zhao, Peidong Liu

We introduce MVControl, a novel neural network architecture that enhances existing pre-trained multi-view 2D diffusion models by incorporating additional input conditions, e.g. edge maps. Our approach enables the generation of controllable multi-view images and view-consistent 3D content. To achieve controllable multi-view image generation, we leverage MVDream as our base model, and train a new neural network module as additional plugin for end-to-end task-specific condition learning. To precisely control the shapes and views of generated images, we innovatively propose a new conditioning mechanism that predicts an embedding encapsulating the input spatial and view conditions, which is then injected to the network globally. Once MVControl is trained, score-distillation (SDS) loss based optimization can be performed to generate 3D content, in which process we propose to use a hybrid diffusion prior. The hybrid prior relies on a pre-trained Stable-Diffusion network and our trained MVControl for additional guidance. Extensive experiments demonstrate that our method achieves robust generalization and enables the controllable generation of high-quality 3D content. Code available at https://github.com/WU-CVGL/MVControl/.

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models. (arXiv:2311.14552v2 [cs.CV] UPDATED)

Authors: Yufei Zhan, Yousong Zhu, Zhiyang Chen, Fan Yang, Ming Tang, Jinqiao Wang

Replicating the innate human ability to detect all objects based on free-form texts at any granularity remains a formidable challenge for Vision-Language models. Current Large Vision Language Models (LVLMs) are predominantly constrained to grounding a single, pre-existing object, relying solely on data from Referring Expression Comprehension tasks. The limitation leads to a compromise in model design, necessitating the introduction of visual expert models or the integration of customized head structures. Beyond these constraints, our research delves into the untapped potential of LVLMs and uncover their inherent capability for basic object perception, allowing them to accurately identify and locate objects of interest. Building on this insight, we introduce a novel language-prompted localization dataset designed to fully unleash the capabilities of LVLMs in integrating fine-grained object perception with precise location awareness. More importantly, we present $\textbf{Griffon}$, a purely LVLM-based baseline, which does not require the introduction of any special tokens, expert models, or additional detection modules. It simply maintains a consistent structure with popular LVLMs by unifying data formats across various localization-related scenarios and is trained end-to-end through a well-designed pipeline. Comprehensive experiments demonstrate that $\textbf{Griffon}$ not only achieves state-of-the-art performance on the fine-grained RefCOCO series but also approaches the capabilities of the expert model Faster RCNN on the detection benchmark MSCOCO.

Proximal Algorithms for Accelerated Langevin Dynamics. (arXiv:2311.14829v2 [cs.CE] UPDATED)

Authors: Duy H. Thai, Alexander L. Young, David B. Dunson

We develop a novel class of MCMC algorithms based on a stochastized Nesterov scheme. With an appropriate addition of noise, the result is a time-inhomogeneous underdamped Langevin equation, which we prove emits a specified target distribution as its invariant measure. Convergence rates to stationarity under Wasserstein-2 distance are established as well. Metropolis-adjusted and stochastic gradient versions of the proposed Langevin dynamics are also provided. Experimental illustrations show superior performance of the proposed method over typical Langevin samplers for different models in statistics and image processing including better mixing of the resulting Markov chains.

Uncertainty Aware AI for 2D MRI Segmentation. (arXiv:2311.14875v2 [eess.IV] UPDATED)

Authors: Lohith Konathala

Robust uncertainty estimations are necessary in safety-critical applications of Deep Learning. One such example is the semantic segmentation of medical images, whilst deep-learning approaches have high performance in such tasks they lack interpretability as they give no indication of their confidence when making classification decisions. Robust and interpretable segmentation is a critical first stage in automatically screening for pathologies hence the optimal solution is one which can provide high accuracy but also capture the underlying uncertainty. In this work, we present an uncertainty-aware segmentation model, BA U-Net, for use on MRI data that incorporates Bayesian Neural Networks and Attention Mechanisms to provide accurate and interpretable segmentations. We evaluated our model on the publicly available BraTS 2020 dataset using F1 Score and Intersection Over Union (IoU) as evaluation metrics.

Segmentation of diagnostic tissue compartments on whole slide images with renal thrombotic microangiopathies (TMAs). (arXiv:2311.14971v2 [cs.CV] UPDATED)

Authors: Huy Q. Vo, Pietro A. Cicalese, Surya Seshan, Syed A. Rizvi, Aneesh Vathul, Gloria Bueno, Anibal Pedraza Dorado, Niels Grabe, Katharina Stolle, Francesco Pesce, Joris J.T.H. Roelofs, Jesper Kers, Vitoantonio Bevilacqua, Nicola Altini, Bernd Schröppel, Dario Roccatello, Antonella Barreca, Savino Sciascia, Chandra Mohan, Hien V. Nguyen, Jan U. Becker

The thrombotic microangiopathies (TMAs) manifest in renal biopsy histology with a broad spectrum of acute and chronic findings. Precise diagnostic criteria for a renal biopsy diagnosis of TMA are missing. As a first step towards a machine learning- and computer vision-based analysis of wholes slide images from renal biopsies, we trained a segmentation model for the decisive diagnostic kidney tissue compartments artery, arteriole, glomerulus on a set of whole slide images from renal biopsies with TMAs and Mimickers (distinct diseases with a similar nephropathological appearance as TMA like severe benign nephrosclerosis, various vasculitides, Bevacizumab-plug glomerulopathy, arteriolar light chain deposition disease). Our segmentation model combines a U-Net-based tissue detection with a Shifted windows-transformer architecture to reach excellent segmentation results for even the most severely altered glomeruli, arterioles and arteries, even on unseen staining domains from a different nephropathology lab. With accurate automatic segmentation of the decisive renal biopsy compartments in human renal vasculopathies, we have laid the foundation for large-scale compartment-specific machine learning and computer vision analysis of renal biopsy repositories with TMAs.

SAMv2: A Unified Framework for Learning Appearance, Semantic and Cross-Modality Anatomical Embeddings. (arXiv:2311.15111v2 [cs.CV] UPDATED)

Authors: Xiaoyu Bai, Fan Bai, Xiaofei Huo, Jia Ge, Jingjing Lu, Xianghua Ye, Ke Yan, Yong Xia

Identifying anatomical structures (e.g., lesions or landmarks) in medical images plays a fundamental role in medical image analysis. As an exemplar-based landmark detection method, Self-supervised Anatomical eMbedding (SAM) learns a discriminative embedding for each voxel in the image and has shown promising results on various tasks. However, SAM still faces challenges in: (1) differentiating voxels with similar appearance but different semantic meanings (\textit{e.g.}, two adjacent structures without clear borders); (2) matching voxels with similar semantics but markedly different appearance (e.g., the same vessel before and after contrast injection); and (3) cross-modality matching (e.g., CT-MRI registration). To overcome these challenges, we propose SAMv2, which is a unified framework designed to learn appearance, semantic, and cross-modality anatomical embeddings. Specifically, SAMv2 incorporates three key innovations: (1) semantic embedding learning with prototypical contrastive loss; (2) a fixed-point-based matching strategy; and (3) an iterative approach for cross-modality embedding learning. We thoroughly evaluated SAMv2 across three tasks, including one-shot landmark detection, lesion tracking on longitudinal CT scans, and CT-MRI affine/rigid registration with varying field of view. Our results suggest that SAMv2 outperforms SAM and other state-of-the-art methods, offering a robust and versatile approach for landmark based medical image analysis tasks. Code and trained models are available at: https://github.com/alibaba-damo-academy/self-supervised-anatomical-embedding-v2

ID-like Prompt Learning for Few-Shot Out-of-Distribution Detection. (arXiv:2311.15243v2 [cs.CV] UPDATED)

Authors: Yichen Bai, Zongbo Han, Changqing Zhang, Bing Cao, Xiaoheng Jiang, Qinghua Hu

Out-of-distribution (OOD) detection methods often exploit auxiliary outliers to train model identifying OOD samples, especially discovering challenging outliers from auxiliary outliers dataset to improve OOD detection. However, they may still face limitations in effectively distinguishing between the most challenging OOD samples that are much like in-distribution (ID) data, i.e., ID-like samples. To this end, we propose a novel OOD detection framework that discovers ID-like outliers using CLIP from the vicinity space of the ID samples, thus helping to identify these most challenging OOD samples. Then a prompt learning framework is proposed that utilizes the identified ID-like outliers to further leverage the capabilities of CLIP for OOD detection. Benefiting from the powerful CLIP, we only need a small number of ID samples to learn the prompts of the model without exposing other auxiliary outlier datasets. By focusing on the most challenging ID-like OOD samples and elegantly exploiting the capabilities of CLIP, our method achieves superior few-shot learning performance on various real-world image datasets (e.g., in 4-shot OOD detection on the ImageNet-1k dataset, our method reduces the average FPR95 by 12.16% and improves the average AUROC by 2.76%, compared to state-of-the-art methods).

A-JEPA: Joint-Embedding Predictive Architecture Can Listen. (arXiv:2311.15830v2 [cs.SD] UPDATED)

Authors: Zhengcong Fei, Mingyuan Fan, Junshi Huang

This paper presents that the masked-modeling principle driving the success of large foundational vision models can be effectively applied to audio by making predictions in a latent space. We introduce Audio-based Joint-Embedding Predictive Architecture (A-JEPA), a simple extension method for self-supervised learning from the audio spectrum. Following the design of I-JEPA, our A-JEPA encodes visible audio spectrogram patches with a curriculum masking strategy via context encoder, and predicts the representations of regions sampled at well-designed locations. The target representations of those regions are extracted by the exponential moving average of context encoder, \emph{i.e.}, target encoder, on the whole spectrogram. We find it beneficial to transfer random block masking into time-frequency aware masking in a curriculum manner, considering the complexity of highly correlated in local time and frequency in audio spectrograms. To enhance contextual semantic understanding and robustness, we fine-tune the encoder with a regularized masking on target datasets, instead of input dropping or zero. Empirically, when built with Vision Transformers structure, we find A-JEPA to be highly scalable and sets new state-of-the-art performance on multiple audio and speech classification tasks, outperforming other recent models that use externally supervised pre-training.

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models. (arXiv:2311.16103v2 [cs.CV] UPDATED)

Authors: Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan

Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries. In pursuit of the ultimate goal of achieving artificial general intelligence, a truly intelligent Video-LLM model should not only see and understand the surroundings, but also possess human-level commonsense, and make well-informed decisions for the users. To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial. To this end, this paper proposes \textit{Video-Bench}, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs. The benchmark comprises 10 meticulously crafted tasks, evaluating the capabilities of Video-LLMs across three distinct levels: Video-exclusive Understanding, Prior Knowledge-based Question-Answering, and Comprehension and Decision-making. In addition, we introduce an automatic toolkit tailored to process model outputs for various tasks, facilitating the calculation of metrics and generating convenient final scores. We evaluate 8 representative Video-LLMs using \textit{Video-Bench}. The findings reveal that current Video-LLMs still fall considerably short of achieving human-like comprehension and analysis of real-world videos, offering valuable insights for future research directions. The benchmark and toolkit are available at: \url{https://github.com/PKU-YuanGroup/Video-Bench}.

TransONet: Automatic Segmentation of Vasculature in Computed Tomographic Angiograms Using Deep Learning. (arXiv:2311.10328v1 [eess.IV] CROSS LISTED)

Authors: Alireza Bagheri Rajeoni, Breanna Pederson, Ali Firooz, Hamed Abdollahi, Andrew K. Smith, Daniel G. Clair, Susan M. Lessner, Homayoun Valafar

Pathological alterations in the human vascular system underlie many chronic diseases, such as atherosclerosis and aneurysms. However, manually analyzing diagnostic images of the vascular system, such as computed tomographic angiograms (CTAs) is a time-consuming and tedious process. To address this issue, we propose a deep learning model to segment the vascular system in CTA images of patients undergoing surgery for peripheral arterial disease (PAD). Our study focused on accurately segmenting the vascular system (1) from the descending thoracic aorta to the iliac bifurcation and (2) from the descending thoracic aorta to the knees in CTA images using deep learning techniques. Our approach achieved average Dice accuracies of 93.5% and 80.64% in test dataset for (1) and (2), respectively, highlighting its high accuracy and potential clinical utility. These findings demonstrate the use of deep learning techniques as a valuable tool for medical professionals to analyze the health of the vascular system efficiently and accurately. Please visit the GitHub page for this paper at https://github.com/pip-alireza/TransOnet.