new Temporal-Spatial Processing of Event Camera Data via Delay-Loop Reservoir Neural Network

Authors: Richard Lau, Anthony Tylan-Tyler, Lihan Yao, Rey de Castro Roberto, Robert Taylor, Isaiah Jones

Abstract: This paper describes a temporal-spatial model for video processing with special applications to processing event camera videos. We propose to study a conjecture motivated by our previous study of video processing with delay loop reservoir (DLR) neural network, which we call Temporal-Spatial Conjecture (TSC). The TSC postulates that there is significant information content carried in the temporal representation of a video signal and that machine learning algorithms would benefit from separate optimization of the spatial and temporal components for intelligent processing. To verify or refute the TSC, we propose a Visual Markov Model (VMM) which decompose the video into spatial and temporal components and estimate the mutual information (MI) of these components. Since computation of video mutual information is complex and time consuming, we use a Mutual Information Neural Network to estimate the bounds of the mutual information. Our result shows that the temporal component carries significant MI compared to that of the spatial component. This finding has often been overlooked in neural network literature. In this paper, we will exploit this new finding to guide our design of a delay-loop reservoir neural network for event camera classification, which results in a 18% improvement on classification accuracy.

new Contrastive Learning for Regression on Hyperspectral Data

Authors: Mohamad Dhaini, Maxime Berar, Paul Honeine, Antonin Van Exem

Abstract: Contrastive learning has demonstrated great effectiveness in representation learning especially for image classification tasks. However, there is still a shortage in the studies targeting regression tasks, and more specifically applications on hyperspectral data. In this paper, we propose a contrastive learning framework for the regression tasks for hyperspectral data. To this end, we provide a collection of transformations relevant for augmenting hyperspectral data, and investigate contrastive learning for regression. Experiments on synthetic and real hyperspectral datasets show that the proposed framework and transformations significantly improve the performance of regression models, achieving better scores than other state-of-the-art transformations.

new HEAL-ViT: Vision Transformers on a spherical mesh for medium-range weather forecasting

Authors: Vivek Ramavajjala

Abstract: In recent years, a variety of ML architectures and techniques have seen success in producing skillful medium range weather forecasts. In particular, Vision Transformer (ViT)-based models (e.g. Pangu-Weather, FuXi) have shown strong performance, working nearly "out-of-the-box" by treating weather data as a multi-channel image on a rectilinear grid. While a rectilinear grid is appropriate for 2D images, weather data is inherently spherical and thus heavily distorted at the poles on a rectilinear grid, leading to disproportionate compute being used to model data near the poles. Graph-based methods (e.g. GraphCast) do not suffer from this problem, as they map the longitude-latitude grid to a spherical mesh, but are generally more memory intensive and tend to need more compute resources for training and inference. While spatially homogeneous, the spherical mesh does not lend itself readily to be modeled by ViT-based models that implicitly rely on the rectilinear grid structure. We present HEAL-ViT, a novel architecture that uses ViT models on a spherical mesh, thus benefiting from both the spatial homogeneity enjoyed by graph-based models and efficient attention-based mechanisms exploited by transformers. HEAL-ViT produces weather forecasts that outperform the ECMWF IFS on key metrics, and demonstrate better bias accumulation and blurring than other ML weather prediction models. Further, the lowered compute footprint of HEAL-ViT makes it attractive for operational use as well, where other models in addition to a 6-hourly prediction model may be needed to produce the full set of operational forecasts required.

new Boosting Few-Shot Learning via Attentive Feature Regularization

Authors: Xingyu Zhu, Shuo Wang, Jinda Lu, Yanbin Hao, Haifeng Liu, Xiangnan He

Abstract: Few-shot learning (FSL) based on manifold regularization aims to improve the recognition capacity of novel objects with limited training samples by mixing two samples from different categories with a blending factor. However, this mixing operation weakens the feature representation due to the linear interpolation and the overlooking of the importance of specific channels. To solve these issues, this paper proposes attentive feature regularization (AFR) which aims to improve the feature representativeness and discriminability. In our approach, we first calculate the relations between different categories of semantic labels to pick out the related features used for regularization. Then, we design two attention-based calculations at both the instance and channel levels. These calculations enable the regularization procedure to focus on two crucial aspects: the feature complementarity through adaptive interpolation in related categories and the emphasis on specific feature channels. Finally, we combine these regularization strategies to significantly improve the classifier performance. Empirical studies on several popular FSL benchmarks demonstrate the effectiveness of AFR, which improves the recognition accuracy of novel categories without the need to retrain any feature extractor, especially in the 1-shot setting. Furthermore, the proposed AFR can seamlessly integrate into other FSL methods to improve classification performance.

new Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions

Authors: Stefan Andreas Baumann, Felix Krause, Michael Neumayr, Nick Stracke, Vincent Tao Hu, Bj\"orn Ommer

Abstract: In recent years, advances in text-to-image (T2I) diffusion models have substantially elevated the quality of their generated images. However, achieving fine-grained control over attributes remains a challenge due to the limitations of natural language prompts (such as no continuous set of intermediate descriptions existing between ``person'' and ``old person''). Even though many methods were introduced that augment the model or generation process to enable such control, methods that do not require a fixed reference image are limited to either enabling global fine-grained attribute expression control or coarse attribute expression control localized to specific subjects, not both simultaneously. We show that there exist directions in the commonly used token-level CLIP text embeddings that enable fine-grained subject-specific control of high-level attributes in text-to-image models. Based on this observation, we introduce one efficient optimization-free and one robust optimization-based method to identify these directions for specific attributes from contrastive text prompts. We demonstrate that these directions can be used to augment the prompt text input with fine-grained control over attributes of specific subjects in a compositional manner (control over multiple attributes of a single subject) without having to adapt the diffusion model. Project page: https://compvis.github.io/attribute-control. Code is available at https://github.com/CompVis/attribute-control.

URLs: https://compvis.github.io/attribute-control., https://github.com/CompVis/attribute-control.

new SynFog: A Photo-realistic Synthetic Fog Dataset based on End-to-end Imaging Simulation for Advancing Real-World Defogging in Autonomous Driving

Authors: Yiming Xie, Henglu Wei, Zhenyi Liu, Xiaoyu Wang, Xiangyang Ji

Abstract: To advance research in learning-based defogging algorithms, various synthetic fog datasets have been developed. However, existing datasets created using the Atmospheric Scattering Model (ASM) or real-time rendering engines often struggle to produce photo-realistic foggy images that accurately mimic the actual imaging process. This limitation hinders the effective generalization of models from synthetic to real data. In this paper, we introduce an end-to-end simulation pipeline designed to generate photo-realistic foggy images. This pipeline comprehensively considers the entire physically-based foggy scene imaging process, closely aligning with real-world image capture methods. Based on this pipeline, we present a new synthetic fog dataset named SynFog, which features both sky light and active lighting conditions, as well as three levels of fog density. Experimental results demonstrate that models trained on SynFog exhibit superior performance in visual perception and detection accuracy compared to others when applied to real-world foggy images.

new Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos

Authors: Remy Sabathier, Niloy J. Mitra, David Novotny

Abstract: We present a method to build animatable dog avatars from monocular videos. This is challenging as animals display a range of (unpredictable) non-rigid movements and have a variety of appearance details (e.g., fur, spots, tails). We develop an approach that links the video frames via a 4D solution that jointly solves for animal's pose variation, and its appearance (in a canonical pose). To this end, we significantly improve the quality of template-based shape fitting by endowing the SMAL parametric model with Continuous Surface Embeddings, which brings image-to-mesh reprojection constaints that are denser, and thus stronger, than the previously used sparse semantic keypoint correspondences. To model appearance, we propose an implicit duplex-mesh texture that is defined in the canonical pose, but can be deformed using SMAL pose coefficients and later rendered to enforce a photometric compatibility with the input video frames. On the challenging CoP3D and APTv2 datasets, we demonstrate superior results (both in terms of pose estimates and predicted appearance) to existing template-free (RAC) and template-based approaches (BARC, BITE).

new Benchmarking Video Frame Interpolation

Authors: Simon Kiefhaber, Simon Niklaus, Feng Liu, Simone Schaub-Meyer

Abstract: Video frame interpolation, the task of synthesizing new frames in between two or more given ones, is becoming an increasingly popular research target. However, the current evaluation of frame interpolation techniques is not ideal. Due to the plethora of test datasets available and inconsistent computation of error metrics, a coherent and fair comparison across papers is very challenging. Furthermore, new test sets have been proposed as part of method papers so they are unable to provide the in-depth evaluation of a dedicated benchmarking paper. Another severe downside is that these test sets violate the assumption of linearity when given two input frames, making it impossible to solve without an oracle. We hence strongly believe that the community would greatly benefit from a benchmarking paper, which is what we propose. Specifically, we present a benchmark which establishes consistent error metrics by utilizing a submission website that computes them, provides insights by analyzing the interpolation quality with respect to various per-pixel attributes such as the motion magnitude, contains a carefully designed test set adhering to the assumption of linearity by utilizing synthetic data, and evaluates the computational efficiency in a coherent manner.

new Task2Box: Box Embeddings for Modeling Asymmetric Task Relationships

Authors: Rangel Daroya, Aaron Sun, Subhransu Maji

Abstract: Modeling and visualizing relationships between tasks or datasets is an important step towards solving various meta-tasks such as dataset discovery, multi-tasking, and transfer learning. However, many relationships, such as containment and transferability, are naturally asymmetric and current approaches for representation and visualization (e.g., t-SNE do not readily support this. We propose Task2Box, an approach to represent tasks using box embeddings -- axis-aligned hyperrectangles in low dimensional spaces -- that can capture asymmetric relationships between them through volumetric overlaps. We show that Task2Box accurately predicts unseen hierarchical relationships between nodes in ImageNet and iNaturalist datasets, as well as transferability between tasks in the Taskonomy benchmark. We also show that box embeddings estimated from task representations (e.g., CLIP, Task2Vec, or attribute based) can be used to predict relationships between unseen tasks more accurately than classifiers trained on the same representations, as well as handcrafted asymmetric distances (e.g., KL divergence). This suggests that low-dimensional box embeddings can effectively capture these task relationships and have the added advantage of being interpretable. We use the approach to visualize relationships among publicly available image classification datasets on popular dataset hosting platform called Hugging Face.

new Engagement Measurement Based on Facial Landmarks and Spatial-Temporal Graph Convolutional Networks

Authors: Ali Abedi, Shehroz S. Khan

Abstract: Engagement in virtual learning is crucial for a variety of factors including learner satisfaction, performance, and compliance with learning programs, but measuring it is a challenging task. There is therefore considerable interest in utilizing artificial intelligence and affective computing to measure engagement in natural settings as well as on a large scale. This paper introduces a novel, privacy-preserving method for engagement measurement from videos. It uses facial landmarks, which carry no personally identifiable information, extracted from videos via the MediaPipe deep learning solution. The extracted facial landmarks are fed to a Spatial-Temporal Graph Convolutional Network (ST-GCN) to output the engagement level of the learner in the video. To integrate the ordinal nature of the engagement variable into the training process, ST-GCNs undergo training in a novel ordinal learning framework based on transfer learning. Experimental results on two video student engagement measurement datasets show the superiority of the proposed method compared to previous methods with improved state-of-the-art on the EngageNet dataset with a %3.1 improvement in four-class engagement level classification accuracy and on the Online Student Engagement dataset with a %1.5 improvement in binary engagement classification accuracy. The relatively lightweight ST-GCN and its integration with the real-time MediaPipe deep learning solution make the proposed approach capable of being deployed on virtual learning platforms and measuring engagement in real time.

new Histogram Layers for Neural Engineered Features

Authors: Joshua Peeples, Salim Al Kharsa, Luke Saleh, Alina Zare

Abstract: In the computer vision literature, many effective histogram-based features have been developed. These engineered features include local binary patterns and edge histogram descriptors among others and they have been shown to be informative features for a variety of computer vision tasks. In this paper, we explore whether these features can be learned through histogram layers embedded in a neural network and, therefore, be leveraged within deep learning frameworks. By using histogram features, local statistics of the feature maps from the convolution neural networks can be used to better represent the data. We present neural versions of local binary pattern and edge histogram descriptors that jointly improve the feature representation and perform image classification. Experiments are presented on benchmark and real-world datasets.

new LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning

Authors: Siyuan Cheng, Guanhong Tao, Yingqi Liu, Guangyu Shen, Shengwei An, Shiwei Feng, Xiangzhe Xu, Kaiyuan Zhang, Shiqing Ma, Xiangyu Zhang

Abstract: Backdoor attack poses a significant security threat to Deep Learning applications. Existing attacks are often not evasive to established backdoor detection techniques. This susceptibility primarily stems from the fact that these attacks typically leverage a universal trigger pattern or transformation function, such that the trigger can cause misclassification for any input. In response to this, recent papers have introduced attacks using sample-specific invisible triggers crafted through special transformation functions. While these approaches manage to evade detection to some extent, they reveal vulnerability to existing backdoor mitigation techniques. To address and enhance both evasiveness and resilience, we introduce a novel backdoor attack LOTUS. Specifically, it leverages a secret function to separate samples in the victim class into a set of partitions and applies unique triggers to different partitions. Furthermore, LOTUS incorporates an effective trigger focusing mechanism, ensuring only the trigger corresponding to the partition can induce the backdoor behavior. Extensive experimental results show that LOTUS can achieve high attack success rate across 4 datasets and 7 model structures, and effectively evading 13 backdoor detection and mitigation techniques. The code is available at https://github.com/Megum1/LOTUS.

URLs: https://github.com/Megum1/LOTUS.

new Strategies to Improve Real-World Applicability of Laparoscopic Anatomy Segmentation Models

Authors: Fiona R. Kolbinger, Jiangpeng He, Jinge Ma, Fengqing Zhu

Abstract: Accurate identification and localization of anatomical structures of varying size and appearance in laparoscopic imaging are necessary to leverage the potential of computer vision techniques for surgical decision support. Segmentation performance of such models is traditionally reported using metrics of overlap such as IoU. However, imbalanced and unrealistic representation of classes in the training data and suboptimal selection of reported metrics have the potential to skew nominal segmentation performance and thereby ultimately limit clinical translation. In this work, we systematically analyze the impact of class characteristics (i.e., organ size differences), training and test data composition (i.e., representation of positive and negative examples), and modeling parameters (i.e., foreground-to-background class weight) on eight segmentation metrics: accuracy, precision, recall, IoU, F1 score, specificity, Hausdorff Distance, and Average Symmetric Surface Distance. Based on our findings, we propose two simple yet effective strategies to improve real-world applicability of image segmentation models in laparoscopic surgical data: (1) inclusion of negative examples in the training process and (2) adaptation of foreground-background weights in segmentation models to maximize model performance with respect to specific metrics of interest, depending on the clinical use case.

new AnimateMe: 4D Facial Expressions via Diffusion Models

Authors: Dimitrios Gerogiannis, Foivos Paraperas Papantoniou, Rolandos Alexandros Potamias, Alexandros Lattas, Stylianos Moschoglou, Stylianos Ploumpis, Stefanos Zafeiriou

Abstract: The field of photorealistic 3D avatar reconstruction and generation has garnered significant attention in recent years; however, animating such avatars remains challenging. Recent advances in diffusion models have notably enhanced the capabilities of generative models in 2D animation. In this work, we directly utilize these models within the 3D domain to achieve controllable and high-fidelity 4D facial animation. By integrating the strengths of diffusion processes and geometric deep learning, we employ Graph Neural Networks (GNNs) as denoising diffusion models in a novel approach, formulating the diffusion process directly on the mesh space and enabling the generation of 3D facial expressions. This facilitates the generation of facial deformations through a mesh-diffusion-based model. Additionally, to ensure temporal coherence in our animations, we propose a consistent noise sampling method. Under a series of both quantitative and qualitative experiments, we showcase that the proposed method outperforms prior work in 4D expression synthesis by generating high-fidelity extreme expressions. Furthermore, we applied our method to textured 4D facial expression generation, implementing a straightforward extension that involves training on a large-scale textured 4D facial expression database.

new DiffusionAct: Controllable Diffusion Autoencoder for One-shot Face Reenactment

Authors: Stella Bounareli, Christos Tzelepis, Vasileios Argyriou, Ioannis Patras, Georgios Tzimiropoulos

Abstract: Video-driven neural face reenactment aims to synthesize realistic facial images that successfully preserve the identity and appearance of a source face, while transferring the target head pose and facial expressions. Existing GAN-based methods suffer from either distortions and visual artifacts or poor reconstruction quality, i.e., the background and several important appearance details, such as hair style/color, glasses and accessories, are not faithfully reconstructed. Recent advances in Diffusion Probabilistic Models (DPMs) enable the generation of high-quality realistic images. To this end, in this paper we present DiffusionAct, a novel method that leverages the photo-realistic image generation of diffusion models to perform neural face reenactment. Specifically, we propose to control the semantic space of a Diffusion Autoencoder (DiffAE), in order to edit the facial pose of the input images, defined as the head pose orientation and the facial expressions. Our method allows one-shot, self, and cross-subject reenactment, without requiring subject-specific fine-tuning. We compare against state-of-the-art GAN-, StyleGAN2-, and diffusion-based methods, showing better or on-par reenactment performance.

new Co-Occurring of Object Detection and Identification towards unlabeled object discovery

Authors: Binay Kumar Singh, Niels Da Vitoria Lobo

Abstract: In this paper, we propose a novel deep learning based approach for identifying co-occurring objects in conjunction with base objects in multilabel object categories. Nowadays, with the advancement in computer vision based techniques we need to know about co-occurring objects with respect to base object for various purposes. The pipeline of the proposed work is composed of two stages: in the first stage of the proposed model we detect all the bounding boxes present in the image and their corresponding labels, then in the second stage we perform co-occurrence matrix analysis. In co-occurrence matrix analysis, we set base classes based on the maximum occurrences of the labels and build association rules and generate frequent patterns. These frequent patterns will show base classes and their corresponding co-occurring classes. We performed our experiments on two publicly available datasets: Pascal VOC and MS-COCO. The experimental results on public benchmark dataset is reported in Sec 4. Further we extend this work by considering all frequently objects as unlabeled and what if they are occluded as well.

new DreamPolisher: Towards High-Quality Text-to-3D Generation via Geometric Diffusion

Authors: Yuanze Lin, Ronald Clark, Philip Torr

Abstract: We present DreamPolisher, a novel Gaussian Splatting based method with geometric guidance, tailored to learn cross-view consistency and intricate detail from textual descriptions. While recent progress on text-to-3D generation methods have been promising, prevailing methods often fail to ensure view-consistency and textural richness. This problem becomes particularly noticeable for methods that work with text input alone. To address this, we propose a two-stage Gaussian Splatting based approach that enforces geometric consistency among views. Initially, a coarse 3D generation undergoes refinement via geometric optimization. Subsequently, we use a ControlNet driven refiner coupled with the geometric consistency term to improve both texture fidelity and overall consistency of the generated 3D asset. Empirical evaluations across diverse textual prompts spanning various object categories demonstrate the efficacy of DreamPolisher in generating consistent and realistic 3D objects, aligning closely with the semantics of the textual instructions.

new Physical 3D Adversarial Attacks against Monocular Depth Estimation in Autonomous Driving

Authors: Junhao Zheng, Chenhao Lin, Jiahao Sun, Zhengyu Zhao, Qian Li, Chao Shen

Abstract: Deep learning-based monocular depth estimation (MDE), extensively applied in autonomous driving, is known to be vulnerable to adversarial attacks. Previous physical attacks against MDE models rely on 2D adversarial patches, so they only affect a small, localized region in the MDE map but fail under various viewpoints. To address these limitations, we propose 3D Depth Fool (3D$^2$Fool), the first 3D texture-based adversarial attack against MDE models. 3D$^2$Fool is specifically optimized to generate 3D adversarial textures agnostic to model types of vehicles and to have improved robustness in bad weather conditions, such as rain and fog. Experimental results validate the superior performance of our 3D$^2$Fool across various scenarios, including vehicles, MDE models, weather conditions, and viewpoints. Real-world experiments with printed 3D textures on physical vehicle models further demonstrate that our 3D$^2$Fool can cause an MDE error of over 10 meters.

new Staircase Localization for Autonomous Exploration in Urban Environments

Authors: Jinrae Kim, Sunggoo Jung, Sung-Kyun Kim, Youdan Kim, Ali-akbar Agha-mohammadi

Abstract: A staircase localization method is proposed for robots to explore urban environments autonomously. The proposed method employs a modular design in the form of a cascade pipeline consisting of three modules of stair detection, line segment detection, and stair localization modules. The stair detection module utilizes an object detection algorithm based on deep learning to generate a region of interest (ROI). From the ROI, line segment features are extracted using a deep line segment detection algorithm. The extracted line segments are used to localize a staircase in terms of position, orientation, and stair direction. The stair detection and localization are performed only with a single RGB-D camera. Each component of the proposed pipeline does not need to be designed particularly for staircases, which makes it easy to maintain the whole pipeline and replace each component with state-of-the-art deep learning detection techniques. The results of real-world experiments show that the proposed method can perform accurate stair detection and localization during autonomous exploration for various structured and unstructured upstairs and downstairs with shadows, dirt, and occlusions by artificial and natural objects.

new OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation

Authors: Ganlong Zhao, Guanbin Li, Weikai Chen, Yizhou Yu

Abstract: Recent advances in Iterative Vision-and-Language Navigation (IVLN) introduce a more meaningful and practical paradigm of VLN by maintaining the agent's memory across tours of scenes. Although the long-term memory aligns better with the persistent nature of the VLN task, it poses more challenges on how to utilize the highly unstructured navigation memory with extremely sparse supervision. Towards this end, we propose OVER-NAV, which aims to go over and beyond the current arts of IVLN techniques. In particular, we propose to incorporate LLMs and open-vocabulary detectors to distill key information and establish correspondence between multi-modal signals. Such a mechanism introduces reliable cross-modal supervision and enables on-the-fly generalization to unseen scenes without the need of extra annotation and re-training. To fully exploit the interpreted navigation data, we further introduce a structured representation, coded Omnigraph, to effectively integrate multi-modal information along the tour. Accompanied with a novel omnigraph fusion mechanism, OVER-NAV is able to extract the most relevant knowledge from omnigraph for a more accurate navigating action. In addition, OVER-NAV seamlessly supports both discrete and continuous environments under a unified framework. We demonstrate the superiority of OVER-NAV in extensive experiments.

new The Solution for the ICCV 2023 1st Scientific Figure Captioning Challenge

Authors: Dian Chao, Xin Song, Shupeng Zhong, Boyuan Wang, Xiangyu Wu, Chen Zhu, Yang Yang

Abstract: In this paper, we propose a solution for improving the quality of captions generated for figures in papers. We adopt the approach of summarizing the textual content in the paper to generate image captions. Throughout our study, we encounter discrepancies in the OCR information provided in the official dataset. To rectify this, we employ the PaddleOCR toolkit to extract OCR information from all images. Moreover, we observe that certain textual content in the official paper pertains to images that are not relevant for captioning, thereby introducing noise during caption generation. To mitigate this issue, we leverage LLaMA to extract image-specific information by querying the textual content based on image mentions, effectively filtering out extraneous information. Additionally, we recognize a discrepancy between the primary use of maximum likelihood estimation during text generation and the evaluation metrics such as ROUGE employed to assess the quality of generated captions. To bridge this gap, we integrate the BRIO model framework, enabling a more coherent alignment between the generation and evaluation processes. Our approach ranked first in the final test with a score of 4.49.

new Language Models are Free Boosters for Biomedical Imaging Tasks

Authors: Zhixin Lai, Jing Wu, Suiyao Chen, Yucheng Zhou, Anna Hovakimyan, Naira Hovakimyan

Abstract: In this study, we uncover the unexpected efficacy of residual-based large language models (LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of language or textual data. The approach diverges from established methodologies by utilizing a frozen transformer block, extracted from pre-trained LLMs, as an innovative encoder layer for the direct processing of visual tokens. This strategy represents a significant departure from the standard multi-modal vision-language frameworks, which typically hinge on language-driven prompts and inputs. We found that these LLMs could boost performance across a spectrum of biomedical imaging applications, including both 2D and 3D visual classification tasks, serving as plug-and-play boosters. More interestingly, as a byproduct, we found that the proposed framework achieved superior performance, setting new state-of-the-art results on extensive, standardized datasets in MedMNIST-2D and 3D. Through this work, we aim to open new avenues for employing LLMs in biomedical imaging and enriching the understanding of their potential in this specialized domain.

new TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

Authors: Yufu Wang, Ziyun Wang, Lingjie Liu, Kostas Daniilidis

Abstract: We propose TRAM, a two-stage method to reconstruct a human's global trajectory and motion from in-the-wild videos. TRAM robustifies SLAM to recover the camera motion in the presence of dynamic humans and uses the scene background to derive the motion scale. Using the recovered camera as a metric-scale reference frame, we introduce a video transformer model (VIMO) to regress the kinematic body motion of a human. By composing the two motions, we achieve accurate recovery of 3D humans in the world space, reducing global motion errors by 60% from prior work. https://yufu-wang.github.io/tram4d/

URLs: https://yufu-wang.github.io/tram4d/

new Activity-Biometrics: Person Identification from Daily Activities

Authors: Shehreen Azad, Yogesh Singh Rawat

Abstract: In this work, we study a novel problem which focuses on person identification while performing daily activities. Learning biometric features from RGB videos is challenging due to spatio-temporal complexity and presence of appearance biases such as clothing color and background. We propose ABNet, a novel framework which leverages disentanglement of biometric and non-biometric features to perform effective person identification from daily activities. ABNet relies on a bias-less teacher to learn biometric features from RGB videos and explicitly disentangle non-biometric features with the help of biometric distortion. In addition, ABNet also exploits activity prior for biometrics which is enabled by joint biometric and activity learning. We perform comprehensive evaluation of the proposed approach across five different datasets which are derived from existing activity recognition benchmarks. Furthermore, we extensively compare ABNet with existing works in person identification and demonstrate its effectiveness for activity-based biometrics across all five datasets. The code and dataset can be accessed at: \url{https://github.com/sacrcv/Activity-Biometrics/}

URLs: https://github.com/sacrcv/Activity-Biometrics/

new CoDA: Instructive Chain-of-Domain Adaptation with Severity-Aware Visual Prompt Tuning

Authors: Ziyang Gong, Fuhao Li, Yupeng Deng, Deblina Bhattacharjee, Xiangwei Zhu, Zhenming Ji

Abstract: Unsupervised Domain Adaptation (UDA) aims to adapt models from labeled source domains to unlabeled target domains. When adapting to adverse scenes, existing UDA methods fail to perform well due to the lack of instructions, leading their models to overlook discrepancies within all adverse scenes. To tackle this, we propose CoDA which instructs models to distinguish, focus, and learn from these discrepancies at scene and image levels. Specifically, CoDA consists of a Chain-of-Domain (CoD) strategy and a Severity-Aware Visual Prompt Tuning (SAVPT) mechanism. CoD focuses on scene-level instructions to divide all adverse scenes into easy and hard scenes, guiding models to adapt from source to easy domains with easy scene images, and then to hard domains with hard scene images, thereby laying a solid foundation for whole adaptations. Building upon this foundation, we employ SAVPT to dive into more detailed image-level instructions to boost performance. SAVPT features a novel metric Severity that divides all adverse scene images into low-severity and high-severity images. Then Severity directs visual prompts and adapters, instructing models to concentrate on unified severity features instead of scene-specific features, without adding complexity to the model architecture. CoDA achieves SOTA performances on widely-used benchmarks under all adverse scenes. Notably, CoDA outperforms the existing ones by 4.6%, and 10.3% mIoU on the Foggy Driving, and Foggy Zurich benchmarks, respectively. Our code is available at https://github.com/Cuzyoung/CoDA

URLs: https://github.com/Cuzyoung/CoDA

new AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving

Authors: Mingfu Liang, Jong-Chyi Su, Samuel Schulter, Sparsh Garg, Shiyu Zhao, Ying Wu, Manmohan Chandraker

Abstract: Autonomous vehicle (AV) systems rely on robust perception models as a cornerstone of safety assurance. However, objects encountered on the road exhibit a long-tailed distribution, with rare or unseen categories posing challenges to a deployed perception model. This necessitates an expensive process of continuously curating and annotating data with significant human effort. We propose to leverage recent advances in vision-language and large language models to design an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios. This process operates iteratively, allowing for continuous self-improvement of the model. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.

new Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance

Authors: Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin, Seungryong Kim

Abstract: Recent studies have demonstrated that diffusion models are capable of generating high-quality samples, but their quality heavily depends on sampling guidance techniques, such as classifier guidance (CG) and classifier-free guidance (CFG). These techniques are often not applicable in unconditional generation or in various downstream tasks such as image restoration. In this paper, we propose a novel sampling guidance, called Perturbed-Attention Guidance (PAG), which improves diffusion sample quality across both unconditional and conditional settings, achieving this without requiring additional training or the integration of external modules. PAG is designed to progressively enhance the structure of samples throughout the denoising process. It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, by considering the self-attention mechanisms' ability to capture structural information, and guiding the denoising process away from these degraded samples. In both ADM and Stable Diffusion, PAG surprisingly improves sample quality in conditional and even unconditional scenarios. Moreover, PAG significantly improves the baseline performance in various downstream tasks where existing guidances such as CG or CFG cannot be fully utilized, including ControlNet with empty prompts and image restoration such as inpainting and deblurring.

new Decoupled Pseudo-labeling for Semi-Supervised Monocular 3D Object Detection

Authors: Jiacheng Zhang, Jiaming Li, Xiangru Lin, Wei Zhang, Xiao Tan, Junyu Han, Errui Ding, Jingdong Wang, Guanbin Li

Abstract: We delve into pseudo-labeling for semi-supervised monocular 3D object detection (SSM3OD) and discover two primary issues: a misalignment between the prediction quality of 3D and 2D attributes and the tendency of depth supervision derived from pseudo-labels to be noisy, leading to significant optimization conflicts with other reliable forms of supervision. We introduce a novel decoupled pseudo-labeling (DPL) approach for SSM3OD. Our approach features a Decoupled Pseudo-label Generation (DPG) module, designed to efficiently generate pseudo-labels by separately processing 2D and 3D attributes. This module incorporates a unique homography-based method for identifying dependable pseudo-labels in BEV space, specifically for 3D attributes. Additionally, we present a DepthGradient Projection (DGP) module to mitigate optimization conflicts caused by noisy depth supervision of pseudo-labels, effectively decoupling the depth gradient and removing conflicting gradients. This dual decoupling strategy-at both the pseudo-label generation and gradient levels-significantly improves the utilization of pseudo-labels in SSM3OD. Our comprehensive experiments on the KITTI benchmark demonstrate the superiority of our method over existing approaches.

new SSF3D: Strict Semi-Supervised 3D Object Detection with Switching Filter

Authors: Songbur Wong

Abstract: SSF3D modified the semi-supervised 3D object detection (SS3DOD) framework, which designed specifically for point cloud data. Leveraging the characteristics of non-coincidence and weak correlation of target objects in point cloud, we adopt a strategy of retaining only the truth-determining pseudo labels and trimming the other fuzzy labels with points, instead of pursuing a balance between the quantity and quality of pseudo labels. Besides, we notice that changing the filter will make the model meet different distributed targets, which is beneficial to break the training bottleneck. Two mechanism are introduced to achieve above ideas: strict threshold and filter switching. The experiments are conducted to analyze the effectiveness of above approaches and their impact on the overall performance of the system. Evaluating on the KITTI dataset, SSF3D exhibits superior performance compared to the current state-of-the-art methods. The code will be released here.

new Neural Clustering based Visual Representation Learning

Authors: Guikun Chen, Xia Li, Yi Yang, Wenguan Wang

Abstract: We investigate a fundamental aspect of machine vision: the measurement of features, by revisiting clustering, one of the most classic approaches in machine learning and data analysis. Existing visual feature extractors, including ConvNets, ViTs, and MLPs, represent an image as rectangular regions. Though prevalent, such a grid-style paradigm is built upon engineering practice and lacks explicit modeling of data distribution. In this work, we propose feature extraction with clustering (FEC), a conceptually elegant yet surprisingly ad-hoc interpretable neural clustering framework, which views feature extraction as a process of selecting representatives from data and thus automatically captures the underlying data distribution. Given an image, FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives. Such an iterative working mechanism is implemented in the form of several neural layers and the final representatives can be used for downstream tasks. The cluster assignments across layers, which can be viewed and inspected by humans, make the forward process of FEC fully transparent and empower it with promising ad-hoc interpretability. Extensive experiments on various visual recognition models and tasks verify the effectiveness, generality, and interpretability of FEC. We expect this work will provoke a rethink of the current de facto grid-style paradigm.

new InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion

Authors: Jihyun Lee, Shunsuke Saito, Giljoo Nam, Minhyuk Sung, Tae-Kyun Kim

Abstract: We present InterHandGen, a novel framework that learns the generative prior of two-hand interaction. Sampling from our model yields plausible and diverse two-hand shapes in close interaction with or without an object. Our prior can be incorporated into any optimization or learning methods to reduce ambiguity in an ill-posed setup. Our key observation is that directly modeling the joint distribution of multiple instances imposes high learning complexity due to its combinatorial nature. Thus, we propose to decompose the modeling of joint distribution into the modeling of factored unconditional and conditional single instance distribution. In particular, we introduce a diffusion model that learns the single-hand distribution unconditional and conditional to another hand via conditioning dropout. For sampling, we combine anti-penetration and classifier-free guidance to enable plausible generation. Furthermore, we establish the rigorous evaluation protocol of two-hand synthesis, where our method significantly outperforms baseline generative models in terms of plausibility and diversity. We also demonstrate that our diffusion prior can boost the performance of two-hand reconstruction from monocular in-the-wild images, achieving new state-of-the-art accuracy.

new Test-time Adaptation Meets Image Enhancement: Improving Accuracy via Uncertainty-aware Logit Switching

Authors: Shohei Enomoto, Naoya Hasegawa, Kazuki Adachi, Taku Sasaki, Shin'ya Yamaguchi, Satoshi Suzuki, Takeharu Eda

Abstract: Deep neural networks have achieved remarkable success in a variety of computer vision applications. However, there is a problem of degrading accuracy when the data distribution shifts between training and testing. As a solution of this problem, Test-time Adaptation~(TTA) has been well studied because of its practicality. Although TTA methods increase accuracy under distribution shift by updating the model at test time, using high-uncertainty predictions is known to degrade accuracy. Since the input image is the root of the distribution shift, we incorporate a new perspective on enhancing the input image into TTA methods to reduce the prediction's uncertainty. We hypothesize that enhancing the input image reduces prediction's uncertainty and increase the accuracy of TTA methods. On the basis of our hypothesis, we propose a novel method: Test-time Enhancer and Classifier Adaptation~(TECA). In TECA, the classification model is combined with the image enhancement model that transforms input images into recognition-friendly ones, and these models are updated by existing TTA methods. Furthermore, we found that the prediction from the enhanced image does not always have lower uncertainty than the prediction from the original image. Thus, we propose logit switching, which compares the uncertainty measure of these predictions and outputs the lower one. In our experiments, we evaluate TECA with various TTA methods and show that TECA reduces prediction's uncertainty and increases accuracy of TTA methods despite having no hyperparameters and little parameter overhead.

new LaRE^2: Latent Reconstruction Error Based Method for Diffusion-Generated Image Detection

Authors: Yunpeng Luo, Junlong Du, Ke Yan, Shouhong Ding

Abstract: The evolution of Diffusion Models has dramatically improved image generation quality, making it increasingly difficult to differentiate between real and generated images. This development, while impressive, also raises significant privacy and security concerns. In response to this, we propose a novel Latent REconstruction error guided feature REfinement method (LaRE^2) for detecting the diffusion-generated images. We come up with the Latent Reconstruction Error (LaRE), the first reconstruction-error based feature in the latent space for generated image detection. LaRE surpasses existing methods in terms of feature extraction efficiency while preserving crucial cues required to differentiate between the real and the fake. To exploit LaRE, we propose an Error-Guided feature REfinement module (EGRE), which can refine the image feature guided by LaRE to enhance the discriminativeness of the feature. Our EGRE utilizes an align-then-refine mechanism, which effectively refines the image feature for generated-image detection from both spatial and channel perspectives. Extensive experiments on the large-scale GenImage benchmark demonstrate the superiority of our LaRE^2, which surpasses the best SoTA method by up to 11.9%/12.1% average ACC/AP across 8 different image generators. LaRE also surpasses existing methods in terms of feature extraction cost, delivering an impressive speed enhancement of 8 times.

new DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360{\deg} Images

Authors: Chuhan Jiao, Yao Wang, Guanhua Zhang, Mihai B\^ace, Zhiming Hu, Andreas Bulling

Abstract: We present DiffGaze, a novel method for generating realistic and diverse continuous human gaze sequences on 360{\deg} images based on a conditional score-based denoising diffusion model. Generating human gaze on 360{\deg} images is important for various human-computer interaction and computer graphics applications, e.g. for creating large-scale eye tracking datasets or for realistic animation of virtual humans. However, existing methods are limited to predicting discrete fixation sequences or aggregated saliency maps, thereby neglecting crucial parts of natural gaze behaviour. Our method uses features extracted from 360{\deg} images as condition and uses two transformers to model the temporal and spatial dependencies of continuous human gaze. We evaluate DiffGaze on two 360{\deg} image benchmarks for gaze sequence generation as well as scanpath prediction and saliency prediction. Our evaluations show that DiffGaze outperforms state-of-the-art methods on all tasks on both benchmarks. We also report a 21-participant user study showing that our method generates gaze sequences that are indistinguishable from real human sequences.

new Dr.Hair: Reconstructing Scalp-Connected Hair Strands without Pre-training via Differentiable Rendering of Line Segments

Authors: Yusuke Takimoto, Hikari Takehara, Hiroyuki Sato, Zihao Zhu, Bo Zheng

Abstract: In the film and gaming industries, achieving a realistic hair appearance typically involves the use of strands originating from the scalp. However, reconstructing these strands from observed surface images of hair presents significant challenges. The difficulty in acquiring Ground Truth (GT) data has led state-of-the-art learning-based methods to rely on pre-training with manually prepared synthetic CG data. This process is not only labor-intensive and costly but also introduces complications due to the domain gap when compared to real-world data. In this study, we propose an optimization-based approach that eliminates the need for pre-training. Our method represents hair strands as line segments growing from the scalp and optimizes them using a novel differentiable rendering algorithm. To robustly optimize a substantial number of slender explicit geometries, we introduce 3D orientation estimation utilizing global optimization, strand initialization based on Laplace's equation, and reparameterization that leverages geometric connectivity and spatial proximity. Unlike existing optimization-based methods, our method is capable of reconstructing internal hair flow in an absolute direction. Our method exhibits robust and accurate inverse rendering, surpassing the quality of existing methods and significantly improving processing speed.

new SeNM-VAE: Semi-Supervised Noise Modeling with Hierarchical Variational Autoencoder

Authors: Dihan Zheng, Yihang Zou, Xiaowen Zhang, Chenglong Bao

Abstract: The data bottleneck has emerged as a fundamental challenge in learning based image restoration methods. Researchers have attempted to generate synthesized training data using paired or unpaired samples to address this challenge. This study proposes SeNM-VAE, a semi-supervised noise modeling method that leverages both paired and unpaired datasets to generate realistic degraded data. Our approach is based on modeling the conditional distribution of degraded and clean images with a specially designed graphical model. Under the variational inference framework, we develop an objective function for handling both paired and unpaired data. We employ our method to generate paired training samples for real-world image denoising and super-resolution tasks. Our approach excels in the quality of synthetic degraded images compared to other unpaired and paired noise modeling methods. Furthermore, our approach demonstrates remarkable performance in downstream image restoration tasks, even with limited paired data. With more paired data, our method achieves the best performance on the SIDD dataset.

new Random-coupled Neural Network

Authors: Haoran Liu, Mingzhe Liu, Peng Li, Jiahui Wu, Xin Jiang, Zhuo Zuo, Bingqi Liu

Abstract: Improving the efficiency of current neural networks and modeling them in biological neural systems have become popular research directions in recent years. Pulse-coupled neural network (PCNN) is a well applicated model for imitating the computation characteristics of the human brain in computer vision and neural network fields. However, differences between the PCNN and biological neural systems remain: limited neural connection, high computational cost, and lack of stochastic property. In this study, random-coupled neural network (RCNN) is proposed. It overcomes these difficulties in PCNN's neuromorphic computing via a random inactivation process. This process randomly closes some neural connections in the RCNN model, realized by the random inactivation weight matrix of link input. This releases the computational burden of PCNN, making it affordable to achieve vast neural connections. Furthermore, the image and video processing mechanisms of RCNN are researched. It encodes constant stimuli as periodic spike trains and periodic stimuli as chaotic spike trains, the same as biological neural information encoding characteristics. Finally, the RCNN is applicated to image segmentation, fusion, and pulse shape discrimination subtasks. It is demonstrated to be robust, efficient, and highly anti-noised, with outstanding performance in all applications mentioned above.

new Equipping Sketch Patches with Context-Aware Positional Encoding for Graphic Sketch Representation

Authors: Sicong Zang, Zhijun Fang

Abstract: The drawing order of a sketch records how it is created stroke-by-stroke by a human being. For graphic sketch representation learning, recent studies have injected sketch drawing orders into graph edge construction by linking each patch to another in accordance to a temporal-based nearest neighboring strategy. However, such constructed graph edges may be unreliable, since a sketch could have variants of drawings. In this paper, we propose a variant-drawing-protected method by equipping sketch patches with context-aware positional encoding (PE) to make better use of drawing orders for learning graphic sketch representation. Instead of injecting sketch drawings into graph edges, we embed these sequential information into graph nodes only. More specifically, each patch embedding is equipped with a sinusoidal absolute PE to highlight the sequential position in the drawing order. And its neighboring patches, ranked by the values of self-attention scores between patch embeddings, are equipped with learnable relative PEs to restore the contextual positions within a neighborhood. During message aggregation via graph convolutional networks, a node receives both semantic contents from patch embeddings and contextual patterns from PEs by its neighbors, arriving at drawing-order-enhanced sketch representations. Experimental results indicate that our method significantly improves sketch healing and controllable sketch synthesis.

new Boosting Few-Shot Learning with Disentangled Self-Supervised Learning and Meta-Learning for Medical Image Classification

Authors: Eva Pachetti, Sotirios A. Tsaftaris, Sara Colantonio

Abstract: Background and objective: Employing deep learning models in critical domains such as medical imaging poses challenges associated with the limited availability of training data. We present a strategy for improving the performance and generalization capabilities of models trained in low-data regimes. Methods: The proposed method starts with a pre-training phase, where features learned in a self-supervised learning setting are disentangled to improve the robustness of the representations for downstream tasks. We then introduce a meta-fine-tuning step, leveraging related classes between meta-training and meta-testing phases but varying the granularity level. This approach aims to enhance the model's generalization capabilities by exposing it to more challenging classification tasks during meta-training and evaluating it on easier tasks but holding greater clinical relevance during meta-testing. We demonstrate the effectiveness of the proposed approach through a series of experiments exploring several backbones, as well as diverse pre-training and fine-tuning schemes, on two distinct medical tasks, i.e., classification of prostate cancer aggressiveness from MRI data and classification of breast cancer malignity from microscopic images. Results: Our results indicate that the proposed approach consistently yields superior performance w.r.t. ablation experiments, maintaining competitiveness even when a distribution shift between training and evaluation data occurs. Conclusion: Extensive experiments demonstrate the effectiveness and wide applicability of the proposed approach. We hope that this work will add another solution to the arsenal of addressing learning issues in data-scarce imaging domains.

new NeRF-HuGS: Improved Neural Radiance Fields in Non-static Scenes Using Heuristics-Guided Segmentation

Authors: Jiahao Chen, Yipeng Qin, Lingjie Liu, Jiangbo Lu, Guanbin Li

Abstract: Neural Radiance Field (NeRF) has been widely recognized for its excellence in novel view synthesis and 3D scene reconstruction. However, their effectiveness is inherently tied to the assumption of static scenes, rendering them susceptible to undesirable artifacts when confronted with transient distractors such as moving objects or shadows. In this work, we propose a novel paradigm, namely "Heuristics-Guided Segmentation" (HuGS), which significantly enhances the separation of static scenes from transient distractors by harmoniously combining the strengths of hand-crafted heuristics and state-of-the-art segmentation models, thus significantly transcending the limitations of previous solutions. Furthermore, we delve into the meticulous design of heuristics, introducing a seamless fusion of Structure-from-Motion (SfM)-based heuristics and color residual heuristics, catering to a diverse range of texture profiles. Extensive experiments demonstrate the superiority and robustness of our method in mitigating transient distractors for NeRFs trained in non-static scenes. Project page: https://cnhaox.github.io/NeRF-HuGS/.

URLs: https://cnhaox.github.io/NeRF-HuGS/.

new WordRobe: Text-Guided Generation of Textured 3D Garments

Authors: Astitva Srivastava, Pranav Manu, Amit Raj, Varun Jampani, Avinash Sharma

Abstract: In this paper, we tackle a new and challenging problem of text-driven generation of 3D garments with high-quality textures. We propose "WordRobe", a novel framework for the generation of unposed & textured 3D garment meshes from user-friendly text prompts. We achieve this by first learning a latent representation of 3D garments using a novel coarse-to-fine training strategy and a loss for latent disentanglement, promoting better latent interpolation. Subsequently, we align the garment latent space to the CLIP embedding space in a weakly supervised manner, enabling text-driven 3D garment generation and editing. For appearance modeling, we leverage the zero-shot generation capability of ControlNet to synthesize view-consistent texture maps in a single feed-forward inference step, thereby drastically decreasing the generation time as compared to existing methods. We demonstrate superior performance over current SOTAs for learning 3D garment latent space, garment interpolation, and text-driven texture synthesis, supported by quantitative evaluation and qualitative user study. The unposed 3D garment meshes generated using WordRobe can be directly fed to standard cloth simulation & animation pipelines without any post-processing.

new DeepMIF: Deep Monotonic Implicit Fields for Large-Scale LiDAR 3D Mapping

Authors: Kutay Y{\i}lmaz, Matthias Nie{\ss}ner, Anastasiia Kornilova, Alexey Artemov

Abstract: Recently, significant progress has been achieved in sensing real large-scale outdoor 3D environments, particularly by using modern acquisition equipment such as LiDAR sensors. Unfortunately, they are fundamentally limited in their ability to produce dense, complete 3D scenes. To address this issue, recent learning-based methods integrate neural implicit representations and optimizable feature grids to approximate surfaces of 3D scenes. However, naively fitting samples along raw LiDAR rays leads to noisy 3D mapping results due to the nature of sparse, conflicting LiDAR measurements. Instead, in this work we depart from fitting LiDAR data exactly, instead letting the network optimize a non-metric monotonic implicit field defined in 3D space. To fit our field, we design a learning system integrating a monotonicity loss that enables optimizing neural monotonic fields and leverages recent progress in large-scale 3D mapping. Our algorithm achieves high-quality dense 3D mapping performance as captured by multiple quantitative and perceptual measures and visual results obtained for Mai City, Newer College, and KITTI benchmarks. The code of our approach will be made publicly available.

new Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models

Authors: Yabin Zhang, Wenjie Zhu, Hui Tang, Zhiyuan Ma, Kaiyang Zhou, Lei Zhang

Abstract: With the emergence of pre-trained vision-language models like CLIP, how to adapt them to various downstream classification tasks has garnered significant attention in recent research. The adaptation strategies can be typically categorized into three paradigms: zero-shot adaptation, few-shot adaptation, and the recently-proposed training-free few-shot adaptation. Most existing approaches are tailored for a specific setting and can only cater to one or two of these paradigms. In this paper, we introduce a versatile adaptation approach that can effectively work under all three settings. Specifically, we propose the dual memory networks that comprise dynamic and static memory components. The static memory caches training data knowledge, enabling training-free few-shot adaptation, while the dynamic memory preserves historical test features online during the testing process, allowing for the exploration of additional data insights beyond the training set. This novel capability enhances model performance in the few-shot setting and enables model usability in the absence of training data. The two memory networks employ the same flexible memory interactive strategy, which can operate in a training-free mode and can be further enhanced by incorporating learnable projection layers. Our approach is tested across 11 datasets under the three task settings. Remarkably, in the zero-shot scenario, it outperforms existing methods by over 3\% and even shows superior results against methods utilizing external training data. Additionally, our method exhibits robust performance against natural distribution shifts. Codes are available at \url{https://github.com/YBZh/DMN}.

URLs: https://github.com/YBZh/DMN

new Fake or JPEG? Revealing Common Biases in Generated Image Detection Datasets

Authors: Patrick Grommelt, Louis Weiss, Franz-Josef Pfreundt, Janis Keuper

Abstract: The widespread adoption of generative image models has highlighted the urgent need to detect artificial content, which is a crucial step in combating widespread manipulation and misinformation. Consequently, numerous detectors and associated datasets have emerged. However, many of these datasets inadvertently introduce undesirable biases, thereby impacting the effectiveness and evaluation of detectors. In this paper, we emphasize that many datasets for AI-generated image detection contain biases related to JPEG compression and image size. Using the GenImage dataset, we demonstrate that detectors indeed learn from these undesired factors. Furthermore, we show that removing the named biases substantially increases robustness to JPEG compression and significantly alters the cross-generator performance of evaluated detectors. Specifically, it leads to more than 11 percentage points increase in cross-generator performance for ResNet50 and Swin-T detectors on the GenImage dataset, achieving state-of-the-art results. We provide the dataset and source codes of this paper on the anonymous website: https://www.unbiased-genimage.org

URLs: https://www.unbiased-genimage.org

new MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors

Authors: He Zhang, Shenghao Ren, Haolei Yuan, Jianhui Zhao, Fan Li, Shuangpeng Sun, Zhenghao Liang, Tao Yu, Qiu Shen, Xun Cao

Abstract: Foot contact is an important cue not only for human motion capture but also for motion understanding and physically plausible motion generation. However, most of the foot-contact annotations in existing datasets are estimated by purely visual matching and distance thresholding, which results in low accuracy and coarse granularity. Even though existing multimodal datasets synergistically capture plantar pressure (foot contact) and visual signals, they are specifically designed for small-range and slow motion such as Taiji Quan and Yoga. Therefore, there is still a lack of a vision-pressure multimodal dataset with large-range and fast human motion, as well as accurate and dense foot-contact annotation. To fill this gap, we propose a Multimodal MoCap Dataset with Vision and Pressure sensors, named MMVP. MMVP provides accurate and dense plantar pressure signals synchronized with RGBD observations, which is especially useful for both plausible shape estimation, robust pose fitting without foot drifting, and accurate global translation tracking. To validate the dataset, we propose an RGBD-P SMPL fitting method and also a monocular-video-based baseline framework, VP-MoCap, for human motion capture. Experiments demonstrate that our RGBD-P SMPL Fitting results significantly outperform pure visual motion capture. Moreover, VP-MoCap outperforms SOTA methods in foot-contact and global translation estimation accuracy. We believe the configuration of the dataset and the baseline frameworks will stimulate the research in this direction and also provide a good reference for MoCap applications in various domains. Project page: https://haolyuan.github.io/MMVP-Dataset/.

URLs: https://haolyuan.github.io/MMVP-Dataset/.

new AniArtAvatar: Animatable 3D Art Avatar from a Single Image

Authors: Shaoxu Li

Abstract: We present a novel approach for generating animatable 3D-aware art avatars from a single image, with controllable facial expressions, head poses, and shoulder movements. Unlike previous reenactment methods, our approach utilizes a view-conditioned 2D diffusion model to synthesize multi-view images from a single art portrait with a neutral expression. With the generated colors and normals, we synthesize a static avatar using an SDF-based neural surface. For avatar animation, we extract control points, transfer the motion with these points, and deform the implicit canonical space. Firstly, we render the front image of the avatar, extract the 2D landmarks, and project them to the 3D space using a trained SDF network. We extract 3D driving landmarks using 3DMM and transfer the motion to the avatar landmarks. To animate the avatar pose, we manually set the body height and bound the head and torso of an avatar with two cages. The head and torso can be animated by transforming the two cages. Our approach is a one-shot pipeline that can be applied to various styles. Experiments demonstrate that our method can generate high-quality 3D art avatars with desired control over different motions.

new UADA3D: Unsupervised Adversarial Domain Adaptation for 3D Object Detection with Sparse LiDAR and Large Domain Gaps

Authors: Maciej K Wozniak, Mattias Hansson, Marko Thiel, Patric Jensfelt

Abstract: In this study, we address a gap in existing unsupervised domain adaptation approaches on LiDAR-based 3D object detection, which have predominantly concentrated on adapting between established, high-density autonomous driving datasets. We focus on sparser point clouds, capturing scenarios from different perspectives: not just from vehicles on the road but also from mobile robots on sidewalks, which encounter significantly different environmental conditions and sensor configurations. We introduce Unsupervised Adversarial Domain Adaptation for 3D Object Detection (UADA3D). UADA3D does not depend on pre-trained source models or teacher-student architectures. Instead, it uses an adversarial approach to directly learn domain-invariant features. We demonstrate its efficacy in various adaptation scenarios, showing significant improvements in both self-driving car and mobile robot domains. Our code is open-source and will be available soon.

new Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency

Authors: Yingjie Xu, Bangzhen Liu, Hao Tang, Bailin Deng, Shengfeng He

Abstract: We propose a voxel-based optimization framework, ReVoRF, for few-shot radiance fields that strategically address the unreliability in pseudo novel view synthesis. Our method pivots on the insight that relative depth relationships within neighboring regions are more reliable than the absolute color values in disoccluded areas. Consequently, we devise a bilateral geometric consistency loss that carefully navigates the trade-off between color fidelity and geometric accuracy in the context of depth consistency for uncertain regions. Moreover, we present a reliability-guided learning strategy to discern and utilize the variable quality across synthesized views, complemented by a reliability-aware voxel smoothing algorithm that smoothens the transition between reliable and unreliable data patches. Our approach allows for a more nuanced use of all available data, promoting enhanced learning from regions previously considered unsuitable for high-quality reconstruction. Extensive experiments across diverse datasets reveal that our approach attains significant gains in efficiency and accuracy, delivering rendering speeds of 3 FPS, 7 mins to train a $360^\circ$ scene, and a 5\% improvement in PSNR over existing few-shot methods. Code is available at https://github.com/HKCLynn/ReVoRF.

URLs: https://github.com/HKCLynn/ReVoRF.

new Exploring Dynamic Transformer for Efficient Object Tracking

Authors: Jiawen Zhu, Xin Chen, Haiwen Diao, Shuai Li, Jun-Yan He, Chenyang Li, Bin Luo, Dong Wang, Huchuan Lu

Abstract: The speed-precision trade-off is a critical problem for visual object tracking which usually requires low latency and deployment on constrained resources. Existing solutions for efficient tracking mainly focus on adopting light-weight backbones or modules, which nevertheless come at the cost of a sacrifice in precision. In this paper, inspired by dynamic network routing, we propose DyTrack, a dynamic transformer framework for efficient tracking. Real-world tracking scenarios exhibit diverse levels of complexity. We argue that a simple network is sufficient for easy frames in video sequences, while more computation could be assigned to difficult ones. DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget. Thus, it can achieve higher performance with the same running speed. We formulate instance-specific tracking as a sequential decision problem and attach terminating branches to intermediate layers of the entire model. Especially, to fully utilize the computations, we introduce the feature recycling mechanism to reuse the outputs of predecessors. Furthermore, a target-aware self-distillation strategy is designed to enhance the discriminating capabilities of early predictions by effectively mimicking the representation pattern of the deep model. Extensive experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model. For instance, DyTrack obtains 64.9% AUC on LaSOT with a speed of 256 fps.

new DiffFAE: Advancing High-fidelity One-shot Facial Appearance Editing with Space-sensitive Customization and Semantic Preservation

Authors: Qilin Wang, Jiangning Zhang, Chengming Xu, Weijian Cao, Ying Tai, Yue Han, Yanhao Ge, Hong Gu, Chengjie Wang, Yanwei Fu

Abstract: Facial Appearance Editing (FAE) aims to modify physical attributes, such as pose, expression and lighting, of human facial images while preserving attributes like identity and background, showing great importance in photograph. In spite of the great progress in this area, current researches generally meet three challenges: low generation fidelity, poor attribute preservation, and inefficient inference. To overcome above challenges, this paper presents DiffFAE, a one-stage and highly-efficient diffusion-based framework tailored for high-fidelity FAE. For high-fidelity query attributes transfer, we adopt Space-sensitive Physical Customization (SPC), which ensures the fidelity and generalization ability by utilizing rendering texture derived from 3D Morphable Model (3DMM). In order to preserve source attributes, we introduce the Region-responsive Semantic Composition (RSC). This module is guided to learn decoupled source-regarding features, thereby better preserving the identity and alleviating artifacts from non-facial attributes such as hair, clothes, and background. We further introduce a consistency regularization for our pipeline to enhance editing controllability by leveraging prior knowledge in the attention matrices of diffusion model. Extensive experiments demonstrate the superiority of DiffFAE over existing methods, achieving state-of-the-art performance in facial appearance editing.

new Hierarchical Light Transformer Ensembles for Multimodal Trajectory Forecasting

Authors: Adrien Lafage, Mathieu Barbier, Gianni Franchi, David Filliat

Abstract: Accurate trajectory forecasting is crucial for the performance of various systems, such as advanced driver-assistance systems and self-driving vehicles. These forecasts allow to anticipate events leading to collisions and, therefore, to mitigate them. Deep Neural Networks have excelled in motion forecasting, but issues like overconfidence and uncertainty quantification persist. Deep Ensembles address these concerns, yet applying them to multimodal distributions remains challenging. In this paper, we propose a novel approach named Hierarchical Light Transformer Ensembles (HLT-Ens), aimed at efficiently training an ensemble of Transformer architectures using a novel hierarchical loss function. HLT-Ens leverages grouped fully connected layers, inspired by grouped convolution techniques, to capture multimodal distributions, effectively. Through extensive experimentation, we demonstrate that HLT-Ens achieves state-of-the-art performance levels, offering a promising avenue for improving trajectory forecasting techniques.

new Not All Similarities Are Created Equal: Leveraging Data-Driven Biases to Inform GenAI Copyright Disputes

Authors: Uri Hacohen, Adi Haviv, Shahar Sarfaty, Bruria Friedman, Niva Elkin-Koren, Roi Livni, Amit H Bermano

Abstract: The advent of Generative Artificial Intelligence (GenAI) models, including GitHub Copilot, OpenAI GPT, and Stable Diffusion, has revolutionized content creation, enabling non-professionals to produce high-quality content across various domains. This transformative technology has led to a surge of synthetic content and sparked legal disputes over copyright infringement. To address these challenges, this paper introduces a novel approach that leverages the learning capacity of GenAI models for copyright legal analysis, demonstrated with GPT2 and Stable Diffusion models. Copyright law distinguishes between original expressions and generic ones (Sc\`enes \`a faire), protecting the former and permitting reproduction of the latter. However, this distinction has historically been challenging to make consistently, leading to over-protection of copyrighted works. GenAI offers an unprecedented opportunity to enhance this legal analysis by revealing shared patterns in preexisting works. We propose a data-driven approach to identify the genericity of works created by GenAI, employing "data-driven bias" to assess the genericity of expressive compositions. This approach aids in copyright scope determination by utilizing the capabilities of GenAI to identify and prioritize expressive elements and rank them according to their frequency in the model's dataset. The potential implications of measuring expressive genericity for copyright law are profound. Such scoring could assist courts in determining copyright scope during litigation, inform the registration practices of Copyright Offices, allowing registration of only highly original synthetic works, and help copyright owners signal the value of their works and facilitate fairer licensing deals. More generally, this approach offers valuable insights to policymakers grappling with adapting copyright law to the challenges posed by the era of GenAI.

new Manifold-Guided Lyapunov Control with Diffusion Models

Authors: Amartya Mukherjee, Thanin Quartz, Jun Liu

Abstract: This paper presents a novel approach to generating stabilizing controllers for a large class of dynamical systems using diffusion models. The core objective is to develop stabilizing control functions by identifying the closest asymptotically stable vector field relative to a predetermined manifold and adjusting the control function based on this finding. To achieve this, we employ a diffusion model trained on pairs consisting of asymptotically stable vector fields and their corresponding Lyapunov functions. Our numerical results demonstrate that this pre-trained model can achieve stabilization over previously unseen systems efficiently and rapidly, showcasing the potential of our approach in fast zero-shot control and generalizability.

new AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

Authors: Huawei Wei, Zejun Yang, Zhisheng Wang

Abstract: In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image. Our methodology is divided into two stages. Initially, we extract 3D intermediate representations from audio and project them into a sequence of 2D facial landmarks. Subsequently, we employ a robust diffusion model, coupled with a motion module, to convert the landmark sequence into photorealistic and temporally consistent portrait animation. Experimental results demonstrate the superiority of AniPortrait in terms of facial naturalness, pose diversity, and visual quality, thereby offering an enhanced perceptual experience. Moreover, our methodology exhibits considerable potential in terms of flexibility and controllability, which can be effectively applied in areas such as facial motion editing or face reenactment. We release code and model weights at https://github.com/scutzzj/AniPortrait

URLs: https://github.com/scutzzj/AniPortrait

new PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition

Authors: Chenhongyi Yang, Zehui Chen, Miguel Espinosa, Linus Ericsson, Zhenyu Wang, Jiaming Liu, Elliot J. Crowley

Abstract: We present PlainMamba: a simple non-hierarchical state space model (SSM) designed for general visual recognition. The recent Mamba model has shown how SSMs can be highly competitive with other architectures on sequential data and initial attempts have been made to apply it to images. In this paper, we further adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images by (i) a continuous 2D scanning process that improves spatial continuity by ensuring adjacency of tokens in the scanning sequence, and (ii) direction-aware updating which enables the model to discern the spatial relations of tokens by encoding directional information. Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks, resulting in a model with constant width throughout all layers. The architecture is further simplified by removing the need for special tokens. We evaluate PlainMamba on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves performance gains over previous non-hierarchical models and is competitive with hierarchical alternatives. For tasks requiring high-resolution inputs, in particular, PlainMamba requires much less computing while maintaining high performance. Code and models are available at https://github.com/ChenhongyiYang/PlainMamba

URLs: https://github.com/ChenhongyiYang/PlainMamba

new The Solution for the CVPR 2023 1st foundation model challenge-Track2

Authors: Haonan Xu, Yurui Huang, Sishun Pan, Zhihao Guan, Yi Xu, Yang Yang

Abstract: In this paper, we propose a solution for cross-modal transportation retrieval. Due to the cross-domain problem of traffic images, we divide the problem into two sub-tasks of pedestrian retrieval and vehicle retrieval through a simple strategy. In pedestrian retrieval tasks, we use IRRA as the base model and specifically design an Attribute Classification to mine the knowledge implied by attribute labels. More importantly, We use the strategy of Inclusion Relation Matching to make the image-text pairs with inclusion relation have similar representation in the feature space. For the vehicle retrieval task, we use BLIP as the base model. Since aligning the color attributes of vehicles is challenging, we introduce attribute-based object detection techniques to add color patch blocks to vehicle images for color data augmentation. This serves as strong prior information, helping the model perform the image-text alignment. At the same time, we incorporate labeled attributes into the image-text alignment loss to learn fine-grained alignment and prevent similar images and texts from being incorrectly separated. Our approach ranked first in the final B-board test with a score of 70.9.

new Panonut360: A Head and Eye Tracking Dataset for Panoramic Video

Authors: Yutong Xu, Junhao Du, Jiahe Wang, Yuwei Ning, Sihan Zhou Yang Cao

Abstract: With the rapid development and widespread application of VR/AR technology, maximizing the quality of immersive panoramic video services that match users' personal preferences and habits has become a long-standing challenge. Understanding the saliency region where users focus, based on data collected with HMDs, can promote multimedia encoding, transmission, and quality assessment. At the same time, large-scale datasets are essential for researchers and developers to explore short/long-term user behavior patterns and train AI models related to panoramic videos. However, existing panoramic video datasets often include low-frequency user head or eye movement data through short-term videos only, lacking sufficient data for analyzing users' Field of View (FoV) and generating video saliency regions. Driven by these practical factors, in this paper, we present a head and eye tracking dataset involving 50 users (25 males and 25 females) watching 15 panoramic videos. The dataset provides details on the viewport and gaze attention locations of users. Besides, we present some statistics samples extracted from the dataset. For example, the deviation between head and eye movements challenges the widely held assumption that gaze attention decreases from the center of the FoV following a Gaussian distribution. Our analysis reveals a consistent downward offset in gaze fixations relative to the FoV in experimental settings involving multiple users and videos. That's why we name the dataset Panonut, a saliency weighting shaped like a donut. Finally, we also provide a script that generates saliency distributions based on given head or eye coordinates and pre-generated saliency distribution map sets of each video from the collected eye tracking data. The dataset is available on website: https://dianvrlab.github.io/Panonut360/.

URLs: https://dianvrlab.github.io/Panonut360/.

new Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection

Authors: Jongha Kim, Jihwan Park, Jinyoung Park, Jinyoung Kim, Sehyung Kim, Hyunwoo J. Kim

Abstract: Visual Relationship Detection (VRD) has seen significant advancements with Transformer-based architectures recently. However, we identify two key limitations in a conventional label assignment for training Transformer-based VRD models, which is a process of mapping a ground-truth (GT) to a prediction. Under the conventional assignment, an unspecialized query is trained since a query is expected to detect every relation, which makes it difficult for a query to specialize in specific relations. Furthermore, a query is also insufficiently trained since a GT is assigned only to a single prediction, therefore near-correct or even correct predictions are suppressed by being assigned no relation as a GT. To address these issues, we propose Groupwise Query Specialization and Quality-Aware Multi-Assignment (SpeaQ). Groupwise Query Specialization trains a specialized query by dividing queries and relations into disjoint groups and directing a query in a specific query group solely toward relations in the corresponding relation group. Quality-Aware Multi-Assignment further facilitates the training by assigning a GT to multiple predictions that are significantly close to a GT in terms of a subject, an object, and the relation in between. Experimental results and analyses show that SpeaQ effectively trains specialized queries, which better utilize the capacity of a model, resulting in consistent performance gains with zero additional inference cost across multiple VRD models and benchmarks. Code is available at https://github.com/mlvlab/SpeaQ.

URLs: https://github.com/mlvlab/SpeaQ.

new Invisible Gas Detection: An RGB-Thermal Cross Attention Network and A New Benchmark

Authors: Jue Wang, Yuxiang Lin, Qi Zhao, Dong Luo, Shuaibao Chen, Wei Chen, Xiaojiang Peng

Abstract: The widespread use of various chemical gases in industrial processes necessitates effective measures to prevent their leakage during transportation and storage, given their high toxicity. Thermal infrared-based computer vision detection techniques provide a straightforward approach to identify gas leakage areas. However, the development of high-quality algorithms has been challenging due to the low texture in thermal images and the lack of open-source datasets. In this paper, we present the RGB-Thermal Cross Attention Network (RT-CAN), which employs an RGB-assisted two-stream network architecture to integrate texture information from RGB images and gas area information from thermal images. Additionally, to facilitate the research of invisible gas detection, we introduce Gas-DB, an extensive open-source gas detection database including about 1.3K well-annotated RGB-thermal images with eight variant collection scenes. Experimental results demonstrate that our method successfully leverages the advantages of both modalities, achieving state-of-the-art (SOTA) performance among RGB-thermal methods, surpassing single-stream SOTA models in terms of accuracy, Intersection of Union (IoU), and F2 metrics by 4.86%, 5.65%, and 4.88%, respectively. The code and data will be made available soon.

new Deep Learning for Segmentation of Cracks in High-Resolution Images of Steel Bridges

Authors: Andrii Kompanets, Gautam Pai, Remco Duits, Davide Leonetti, Bert Snijder

Abstract: Automating the current bridge visual inspection practices using drones and image processing techniques is a prominent way to make these inspections more effective, robust, and less expensive. In this paper, we investigate the development of a novel deep-learning method for the detection of fatigue cracks in high-resolution images of steel bridges. First, we present a novel and challenging dataset comprising of images of cracks in steel bridges. Secondly, we integrate the ConvNext neural network with a previous state- of-the-art encoder-decoder network for crack segmentation. We study and report, the effects of the use of background patches on the network performance when applied to high-resolution images of cracks in steel bridges. Finally, we introduce a loss function that allows the use of more background patches for the training process, which yields a significant reduction in false positive rates.

new FastPerson: Enhancing Video Learning through Effective Video Summarization that Preserves Linguistic and Visual Contexts

Authors: Kazuki Kawamura, Jun Rekimoto

Abstract: Quickly understanding lengthy lecture videos is essential for learners with limited time and interest in various topics to improve their learning efficiency. To this end, video summarization has been actively researched to enable users to view only important scenes from a video. However, these studies focus on either the visual or audio information of a video and extract important segments in the video. Therefore, there is a risk of missing important information when both the teacher's speech and visual information on the blackboard or slides are important, such as in a lecture video. To tackle this issue, we propose FastPerson, a video summarization approach that considers both the visual and auditory information in lecture videos. FastPerson creates summary videos by utilizing audio transcriptions along with on-screen images and text, minimizing the risk of overlooking crucial information for learners. Further, it provides a feature that allows learners to switch between the summary and original videos for each chapter of the video, enabling them to adjust the pace of learning based on their interests and level of understanding. We conducted an evaluation with 40 participants to assess the effectiveness of our method and confirmed that it reduced viewing time by 53\% at the same level of comprehension as that when using traditional video playback methods.

new Multi-Task Dense Prediction via Mixture of Low-Rank Experts

Authors: Yuqi Yang, Peng-Tao Jiang, Qibin Hou, Hao Zhang, Jinwei Chen, Bo Li

Abstract: Previous multi-task dense prediction methods based on the Mixture of Experts (MoE) have received great performance but they neglect the importance of explicitly modeling the global relations among all tasks. In this paper, we present a novel decoder-focused method for multi-task dense prediction, called Mixture-of-Low-Rank-Experts (MLoRE). To model the global task relationships, MLoRE adds a generic convolution path to the original MoE structure, where each task feature can go through this path for explicit parameter sharing. Furthermore, to control the parameters and computational cost brought by the increase in the number of experts, we take inspiration from LoRA and propose to leverage the low-rank format of a vanilla convolution in the expert network. Since the low-rank experts have fewer parameters and can be dynamically parameterized into the generic convolution, the parameters and computational cost do not change much with the increase of experts. Benefiting from this design, we increase the number of experts and its reception field to enlarge the representation capacity, facilitating multiple dense tasks learning in a unified network. Extensive experiments on the PASCAL-Context and NYUD-v2 benchmarks show that our MLoRE achieves superior performance compared to previous state-of-the-art methods on all metrics. Our code is available at https://github.com/YuqiYang213/MLoRE.

URLs: https://github.com/YuqiYang213/MLoRE.

new Noise2Noise Denoising of CRISM Hyperspectral Data

Authors: Robert Platt, Rossella Arcucci, C\'edric John

Abstract: Hyperspectral data acquired by the Compact Reconnaissance Imaging Spectrometer for Mars (CRISM) have allowed for unparalleled mapping of the surface mineralogy of Mars. Due to sensor degradation over time, a significant portion of the recently acquired data is considered unusable. Here a new data-driven model architecture, Noise2Noise4Mars (N2N4M), is introduced to remove noise from CRISM images. Our model is self-supervised and does not require zero-noise target data, making it well suited for use in Planetary Science applications where high quality labelled data is scarce. We demonstrate its strong performance on synthetic-noise data and CRISM images, and its impact on downstream classification performance, outperforming benchmark methods on most metrics. This allows for detailed analysis for critical sites of interest on the Martian surface, including proposed lander sites.

new Makeup Prior Models for 3D Facial Makeup Estimation and Applications

Authors: Xingchao Yang, Takafumi Taketomi, Yuki Endo, Yoshihiro Kanamori

Abstract: In this work, we introduce two types of makeup prior models to extend existing 3D face prior models: PCA-based and StyleGAN2-based priors. The PCA-based prior model is a linear model that is easy to construct and is computationally efficient. However, it retains only low-frequency information. Conversely, the StyleGAN2-based model can represent high-frequency information with relatively higher computational cost than the PCA-based model. Although there is a trade-off between the two models, both are applicable to 3D facial makeup estimation and related applications. By leveraging makeup prior models and designing a makeup consistency module, we effectively address the challenges that previous methods faced in robustly estimating makeup, particularly in the context of handling self-occluded faces. In experiments, we demonstrate that our approach reduces computational costs by several orders of magnitude, achieving speeds up to 180 times faster. In addition, by improving the accuracy of the estimated makeup, we confirm that our methods are highly advantageous for various 3D facial makeup applications such as 3D makeup face reconstruction, user-friendly makeup editing, makeup transfer, and interpolation.

new MUTE-SLAM: Real-Time Neural SLAM with Multiple Tri-Plane Hash Representations

Authors: Yifan Yan, Ruomin He, Zhenghua Liu

Abstract: We introduce MUTE-SLAM, a real-time neural RGB-D SLAM system employing multiple tri-plane hash-encodings for efficient scene representation. MUTE-SLAM effectively tracks camera positions and incrementally builds a scalable multi-map representation for both small and large indoor environments. It dynamically allocates sub-maps for newly observed local regions, enabling constraint-free mapping without prior scene information. Unlike traditional grid-based methods, we use three orthogonal axis-aligned planes for hash-encoding scene properties, significantly reducing hash collisions and the number of trainable parameters. This hybrid approach not only speeds up convergence but also enhances the fidelity of surface reconstruction. Furthermore, our optimization strategy concurrently optimizes all sub-maps intersecting with the current camera frustum, ensuring global consistency. Extensive testing on both real-world and synthetic datasets has shown that MUTE-SLAM delivers state-of-the-art surface reconstruction quality and competitive tracking performance across diverse indoor settings. The code will be made public upon acceptance of the paper.

new GenesisTex: Adapting Image Denoising Diffusion to Texture Space

Authors: Chenjian Gao, Boyan Jiang, Xinghui Li, Yingpeng Zhang, Qian Yu

Abstract: We present GenesisTex, a novel method for synthesizing textures for 3D geometries from text descriptions. GenesisTex adapts the pretrained image diffusion model to texture space by texture space sampling. Specifically, we maintain a latent texture map for each viewpoint, which is updated with predicted noise on the rendering of the corresponding viewpoint. The sampled latent texture maps are then decoded into a final texture map. During the sampling process, we focus on both global and local consistency across multiple viewpoints: global consistency is achieved through the integration of style consistency mechanisms within the noise prediction network, and low-level consistency is achieved by dynamically aligning latent textures. Finally, we apply reference-based inpainting and img2img on denser views for texture refinement. Our approach overcomes the limitations of slow optimization in distillation-based methods and instability in inpainting-based methods. Experiments on meshes from various sources demonstrate that our method surpasses the baseline methods quantitatively and qualitatively.

new Towards 3D Vision with Low-Cost Single-Photon Cameras

Authors: Fangzhou Mu, Carter Sifferman, Sacha Jungerman, Yiquan Li, Mark Han, Michael Gleicher, Mohit Gupta, Yin Li

Abstract: We present a method for reconstructing 3D shape of arbitrary Lambertian objects based on measurements by miniature, energy-efficient, low-cost single-photon cameras. These cameras, operating as time resolved image sensors, illuminate the scene with a very fast pulse of diffuse light and record the shape of that pulse as it returns back from the scene at a high temporal resolution. We propose to model this image formation process, account for its non-idealities, and adapt neural rendering to reconstruct 3D geometry from a set of spatially distributed sensors with known poses. We show that our approach can successfully recover complex 3D shapes from simulated data. We further demonstrate 3D object reconstruction from real-world captures, utilizing measurements from a commodity proximity sensor. Our work draws a connection between image-based modeling and active range scanning and is a step towards 3D vision with single-photon cameras.

new Improving Text-to-Image Consistency via Automatic Prompt Optimization

Authors: Oscar Ma\~nas, Pietro Astolfi, Melissa Hall, Candace Ross, Jack Urbanek, Adina Williams, Aishwarya Agrawal, Adriana Romero-Soriano, Michal Drozdzal

Abstract: Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.

new DN-Splatter: Depth and Normal Priors for Gaussian Splatting and Meshing

Authors: Matias Turkulainen, Xuqian Ren, Iaroslav Melekhov, Otto Seiskari, Esa Rahtu, Juho Kannala

Abstract: 3D Gaussian splatting, a novel differentiable rendering technique, has achieved state-of-the-art novel view synthesis results with high rendering speeds and relatively low training times. However, its performance on scenes commonly seen in indoor datasets is poor due to the lack of geometric constraints during optimization. We extend 3D Gaussian splatting with depth and normal cues to tackle challenging indoor datasets and showcase techniques for efficient mesh extraction, an important downstream application. Specifically, we regularize the optimization procedure with depth information, enforce local smoothness of nearby Gaussians, and use the geometry of the 3D Gaussians supervised by normal cues to achieve better alignment with the true scene geometry. We improve depth estimation and novel view synthesis results over baselines and show how this simple yet effective regularization technique can be used to directly extract meshes from the Gaussian representation yielding more physically accurate reconstructions on indoor scenes. Our code will be released in https://github.com/maturk/dn-splatter.

URLs: https://github.com/maturk/dn-splatter.

new Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders

Authors: Alexandre Eyma\"el, Renaud Vandeghen, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, Marc Van Droogenbroeck

Abstract: Self-supervised pre-training of image encoders is omnipresent in the literature, particularly following the introduction of Masked autoencoders (MAE). Current efforts attempt to learn object-centric representations from motion in videos. In particular, SiamMAE recently introduced a Siamese network, training a shared-weight encoder from two frames of a video with a high asymmetric masking ratio (95%). In this work, we propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE. Our method specifically differs by exclusively considering pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video. CropMAE therefore alleviates the need for video datasets, while maintaining competitive performances and drastically reducing pre-training time. Furthermore, we demonstrate that CropMAE learns similar object-centric representations without explicit motion, showing that current self-supervised learning methods do not learn objects from motion, but rather thanks to the Siamese architecture. Finally, CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches. Our code is available at https://github.com/alexandre-eymael/CropMAE.

URLs: https://github.com/alexandre-eymael/CropMAE.

new DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions

Authors: Sammy Christen, Shreyas Hampali, Fadime Sener, Edoardo Remelli, Tomas Hodan, Eric Sauser, Shugao Ma, Bugra Tekin

Abstract: Generating natural hand-object interactions in 3D is challenging as the resulting hand and object motions are expected to be physically plausible and semantically meaningful. Furthermore, generalization to unseen objects is hindered by the limited scale of available hand-object interaction datasets. We propose DiffH2O, a novel method to synthesize realistic, one or two-handed object interactions from provided text prompts and geometry of the object. The method introduces three techniques that enable effective learning from limited data. First, we decompose the task into a grasping stage and a text-based interaction stage and use separate diffusion models for each. In the grasping stage, the model only generates hand motions, whereas in the interaction phase both hand and object poses are synthesized. Second, we propose a compact representation that tightly couples hand and object poses. Third, we propose two different guidance schemes to allow more control of the generated motions: grasp guidance and detailed textual guidance. Grasp guidance takes a single target grasping pose and guides the diffusion model to reach this grasp at the end of the grasping stage, which provides control over the grasping pose. Given a grasping motion from this stage, multiple different actions can be prompted in the interaction phase. For textual guidance, we contribute comprehensive text descriptions to the GRAB dataset and show that they enable our method to have more fine-grained control over hand-object interactions. Our quantitative and qualitative evaluation demonstrates that the proposed method outperforms baseline methods and leads to natural hand-object motions. Moreover, we demonstrate the practicality of our framework by utilizing a hand pose estimate from an off-the-shelf pose estimator for guidance, and then sampling multiple different actions in the interaction stage.

new Assessment of Multimodal Large Language Models in Alignment with Human Values

Authors: Zhelun Shi, Zhipin Wang, Hongxing Fan, Zaibin Zhang, Lijun Li, Yongting Zhang, Zhenfei Yin, Lu Sheng, Yu Qiao, Jing Shao

Abstract: Large Language Models (LLMs) aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh). However, in terms of Multimodal Large Language Models (MLLMs), despite their commendable performance in perception and reasoning tasks, their alignment with human values remains largely unexplored, given the complexity of defining hhh dimensions in the visual world and the difficulty in collecting relevant data that accurately mirrors real-world situations. To address this gap, we introduce Ch3Ef, a Compreh3ensive Evaluation dataset and strategy for assessing alignment with human expectations. Ch3Ef dataset contains 1002 human-annotated data samples, covering 12 domains and 46 tasks based on the hhh principle. We also present a unified evaluation strategy supporting assessment across various scenarios and different perspectives. Based on the evaluation results, we summarize over 10 key findings that deepen the understanding of MLLM capabilities, limitations, and the dynamic relationships between evaluation levels, guiding future advancements in the field.

new A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities

Authors: Ibrahim Ethem Hamamci, Sezgin Er, Furkan Almas, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Irem Dogan, Muhammed Furkan Dasdelen, Bastian Wittmann, Enis Simsar, Mehmet Simsar, Emine Bensu Erdemir, Abdullah Alanbay, Anjany Sekuboyina, Berkan Lafci, Mehmet K. Ozdemir, Bjoern Menze

Abstract: A major challenge in computational research in 3D medical imaging is the lack of comprehensive datasets. Addressing this issue, our study introduces CT-RATE, the first 3D medical imaging dataset that pairs images with textual reports. CT-RATE consists of 25,692 non-contrast chest CT volumes, expanded to 50,188 through various reconstructions, from 21,304 unique patients, along with corresponding radiology text reports. Leveraging CT-RATE, we developed CT-CLIP, a CT-focused contrastive language-image pre-training framework. As a versatile, self-supervised model, CT-CLIP is designed for broad application and does not require task-specific training. Remarkably, CT-CLIP outperforms state-of-the-art, fully supervised methods in multi-abnormality detection across all key metrics, thus eliminating the need for manual annotation. We also demonstrate its utility in case retrieval, whether using imagery or textual queries, thereby advancing knowledge dissemination. The open-source release of CT-RATE and CT-CLIP marks a significant advancement in medical AI, enhancing 3D imaging analysis and fostering innovation in healthcare.

new GTA-HDR: A Large-Scale Synthetic Dataset for HDR Image Reconstruction

Authors: Hrishav Bakul Barua, Kalin Stefanov, KokSheik Wong, Abhinav Dhall, Ganesh Krishnasamy

Abstract: High Dynamic Range (HDR) content (i.e., images and videos) has a broad range of applications. However, capturing HDR content from real-world scenes is expensive and time- consuming. Therefore, the challenging task of reconstructing visually accurate HDR images from their Low Dynamic Range (LDR) counterparts is gaining attention in the vision research community. A major challenge in this research problem is the lack of datasets, which capture diverse scene conditions (e.g., lighting, shadows, weather, locations, landscapes, objects, humans, buildings) and various image features (e.g., color, contrast, saturation, hue, luminance, brightness, radiance). To address this gap, in this paper, we introduce GTA-HDR, a large-scale synthetic dataset of photo-realistic HDR images sampled from the GTA-V video game. We perform thorough evaluation of the proposed dataset, which demonstrates significant qualitative and quantitative improvements of the state-of-the-art HDR image reconstruction methods. Furthermore, we demonstrate the effectiveness of the proposed dataset and its impact on additional computer vision tasks including 3D human pose estimation, human body part segmentation, and holistic scene segmentation. The dataset, data collection pipeline, and evaluation code are available at: https://github.com/HrishavBakulBarua/GTA-HDR.

URLs: https://github.com/HrishavBakulBarua/GTA-HDR.

new ReMamber: Referring Image Segmentation with Mamba Twister

Authors: Yuhuan Yang, Chaofan Ma, Jiangchao Yao, Zhun Zhong, Ya Zhang, Yanfeng Wang

Abstract: Referring Image Segmentation (RIS) leveraging transformers has achieved great success on the interpretation of complex visual-language tasks. However, the quadratic computation cost makes it resource-consuming in capturing long-range visual-language dependencies. Fortunately, Mamba addresses this with efficient linear complexity in processing. However, directly applying Mamba to multi-modal interactions presents challenges, primarily due to inadequate channel interactions for the effective fusion of multi-modal data. In this paper, we propose ReMamber, a novel RIS architecture that integrates the power of Mamba with a multi-modal Mamba Twister block. The Mamba Twister explicitly models image-text interaction, and fuses textual and visual features through its unique channel and spatial twisting mechanism. We achieve the state-of-the-art on three challenging benchmarks. Moreover, we conduct thorough analyses of ReMamber and discuss other fusion designs using Mamba. These provide valuable perspectives for future research.

new To Supervise or Not to Supervise: Understanding and Addressing the Key Challenges of 3D Transfer Learning

Authors: Souhail Hadgi, Lei Li, Maks Ovsjanikov

Abstract: Transfer learning has long been a key factor in the advancement of many fields including 2D image analysis. Unfortunately, its applicability in 3D data processing has been relatively limited. While several approaches for 3D transfer learning have been proposed in recent literature, with contrastive learning gaining particular prominence, most existing methods in this domain have only been studied and evaluated in limited scenarios. Most importantly, there is currently a lack of principled understanding of both when and why 3D transfer learning methods are applicable. Remarkably, even the applicability of standard supervised pre-training is poorly understood. In this work, we conduct the first in-depth quantitative and qualitative investigation of supervised and contrastive pre-training strategies and their utility in downstream 3D tasks. We demonstrate that layer-wise analysis of learned features provides significant insight into the downstream utility of trained networks. Informed by this analysis, we propose a simple geometric regularization strategy, which improves the transferability of supervised pre-training. Our work thus sheds light onto both the specific challenges of 3D transfer learning, as well as strategies to overcome them.

new Boosting Diffusion Models with Moving Average Sampling in Frequency Domain

Authors: Yurui Qian, Qi Cai, Yingwei Pan, Yehao Li, Ting Yao, Qibin Sun, Tao Mei

Abstract: Diffusion models have recently brought a powerful revolution in image generation. Despite showing impressive generative capabilities, most of these models rely on the current sample to denoise the next one, possibly resulting in denoising instability. In this paper, we reinterpret the iterative denoising process as model optimization and leverage a moving average mechanism to ensemble all the prior samples. Instead of simply applying moving average to the denoised samples at different timesteps, we first map the denoised samples to data space and then perform moving average to avoid distribution shift across timesteps. In view that diffusion models evolve the recovery from low-frequency components to high-frequency details, we further decompose the samples into different frequency components and execute moving average separately on each component. We name the complete approach "Moving Average Sampling in Frequency domain (MASF)". MASF could be seamlessly integrated into mainstream pre-trained diffusion models and sampling schedules. Extensive experiments on both unconditional and conditional diffusion models demonstrate that our MASF leads to superior performances compared to the baselines, with almost negligible additional complexity cost.

new Low-Latency Neural Stereo Streaming

Authors: Qiqi Hou, Farzad Farhadzadeh, Amir Said, Guillaume Sautiere, Hoang Le

Abstract: The rise of new video modalities like virtual reality or autonomous driving has increased the demand for efficient multi-view video compression methods, both in terms of rate-distortion (R-D) performance and in terms of delay and runtime. While most recent stereo video compression approaches have shown promising performance, they compress left and right views sequentially, leading to poor parallelization and runtime performance. This work presents Low-Latency neural codec for Stereo video Streaming (LLSS), a novel parallel stereo video coding method designed for fast and efficient low-latency stereo video streaming. Instead of using a sequential cross-view motion compensation like existing methods, LLSS introduces a bidirectional feature shifting module to directly exploit mutual information among views and encode them effectively with a joint cross-view prior model for entropy coding. Thanks to this design, LLSS processes left and right views in parallel, minimizing latency; all while substantially improving R-D performance compared to both existing neural and conventional codecs.

new Deepfake Generation and Detection: A Benchmark and Survey

Authors: Gan Pei, Jiangning Zhang, Menghan Hu, Guangtao Zhai, Chengjie Wang, Zhenyu Zhang, Jian Yang, Chunhua Shen, Dacheng Tao

Abstract: In addition to the advancements in deepfake generation, corresponding detection technologies need to continuously evolve to regulate the potential misuse of deepfakes, such as for privacy invasion and phishing attacks. This survey comprehensively reviews the latest developments in deepfake generation and detection, summarizing and analyzing the current state of the art in this rapidly evolving field. We first unify task definitions, comprehensively introduce datasets and metrics, and discuss the development of generation and detection technology frameworks. Then, we discuss the development of several related sub-fields and focus on researching four mainstream deepfake fields: popular face swap, face reenactment, talking face generation, and facial attribute editing, as well as foreign detection. Subsequently, we comprehensively benchmark representative methods on popular datasets for each field, fully evaluating the latest and influential works published in top conferences/journals. Finally, we analyze the challenges and future research directions of the discussed fields. We closely follow the latest developments in https://github.com/flyingby/Awesome-Deepfake-Generation-and-Detection.

URLs: https://github.com/flyingby/Awesome-Deepfake-Generation-and-Detection.

new Superior and Pragmatic Talking Face Generation with Teacher-Student Framework

Authors: Chao Liang, Jianwen Jiang, Tianyun Zhong, Gaojie Lin, Zhengkun Rong, Jiaqi Yang, Yongming Zhu

Abstract: Talking face generation technology creates talking videos from arbitrary appearance and motion signal, with the "arbitrary" offering ease of use but also introducing challenges in practical applications. Existing methods work well with standard inputs but suffer serious performance degradation with intricate real-world ones. Moreover, efficiency is also an important concern in deployment. To comprehensively address these issues, we introduce SuperFace, a teacher-student framework that balances quality, robustness, cost and editability. We first propose a simple but effective teacher model capable of handling inputs of varying qualities to generate high-quality results. Building on this, we devise an efficient distillation strategy to acquire an identity-specific student model that maintains quality with significantly reduced computational load. Our experiments validate that SuperFace offers a more comprehensive solution than existing methods for the four mentioned objectives, especially in reducing FLOPs by 99\% with the student model. SuperFace can be driven by both video and audio and allows for localized facial attributes editing.

new Sen2Fire: A Challenging Benchmark Dataset for Wildfire Detection using Sentinel Data

Authors: Yonghao Xu, Amanda Berg, Leif Haglund

Abstract: Utilizing satellite imagery for wildfire detection presents substantial potential for practical applications. To advance the development of machine learning algorithms in this domain, our study introduces the \textit{Sen2Fire} dataset--a challenging satellite remote sensing dataset tailored for wildfire detection. This dataset is curated from Sentinel-2 multi-spectral data and Sentinel-5P aerosol product, comprising a total of 2466 image patches. Each patch has a size of 512$\times$512 pixels with 13 bands. Given the distinctive sensitivities of various wavebands to wildfire responses, our research focuses on optimizing wildfire detection by evaluating different wavebands and employing a combination of spectral indices, such as normalized burn ratio (NBR) and normalized difference vegetation index (NDVI). The results suggest that, in contrast to using all bands for wildfire detection, selecting specific band combinations yields superior performance. Additionally, our study underscores the positive impact of integrating Sentinel-5 aerosol data for wildfire detection. The code and dataset are available online (https://zenodo.org/records/10881058).

URLs: https://zenodo.org/records/10881058).

new 2D Gaussian Splatting for Geometrically Accurate Radiance Fields

Authors: Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, Shenghua Gao

Abstract: 3D Gaussian Splatting (3DGS) has recently revolutionized radiance field reconstruction, achieving high quality novel view synthesis and fast rendering speed without baking. However, 3DGS fails to accurately represent surfaces due to the multi-view inconsistent nature of 3D Gaussians. We present 2D Gaussian Splatting (2DGS), a novel approach to model and reconstruct geometrically accurate radiance fields from multi-view images. Our key idea is to collapse the 3D volume into a set of 2D oriented planar Gaussian disks. Unlike 3D Gaussians, 2D Gaussians provide view-consistent geometry while modeling surfaces intrinsically. To accurately recover thin surfaces and achieve stable optimization, we introduce a perspective-accurate 2D splatting process utilizing ray-splat intersection and rasterization. Additionally, we incorporate depth distortion and normal consistency terms to further enhance the quality of the reconstructions. We demonstrate that our differentiable renderer allows for noise-free and detailed geometry reconstruction while maintaining competitive appearance quality, fast training speed, and real-time rendering. Our code will be made publicly available.

new A Survey on 3D Egocentric Human Pose Estimation

Authors: Md Mushfiqur Azam, Kevin Desai

Abstract: Egocentric human pose estimation aims to estimate human body poses and develop body representations from a first-person camera perspective. It has gained vast popularity in recent years because of its wide range of applications in sectors like XR-technologies, human-computer interaction, and fitness tracking. However, to the best of our knowledge, there is no systematic literature review based on the proposed solutions regarding egocentric 3D human pose estimation. To that end, the aim of this survey paper is to provide an extensive overview of the current state of egocentric pose estimation research. In this paper, we categorize and discuss the popular datasets and the different pose estimation models, highlighting the strengths and weaknesses of different methods by comparative analysis. This survey can be a valuable resource for both researchers and practitioners in the field, offering insights into key concepts and cutting-edge solutions in egocentric pose estimation, its wide-ranging applications, as well as the open problems with future scope.

new Octree-GS: Towards Consistent Real-time Rendering with LOD-Structured 3D Gaussians

Authors: Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, Bo Dai

Abstract: The recent 3D Gaussian splatting (3D-GS) has shown remarkable rendering fidelity and efficiency compared to NeRF-based neural scene representations. While demonstrating the potential for real-time rendering, 3D-GS encounters rendering bottlenecks in large scenes with complex details due to an excessive number of Gaussian primitives located within the viewing frustum. This limitation is particularly noticeable in zoom-out views and can lead to inconsistent rendering speeds in scenes with varying details. Moreover, it often struggles to capture the corresponding level of details at different scales with its heuristic density control operation. Inspired by the Level-of-Detail (LOD) techniques, we introduce Octree-GS, featuring an LOD-structured 3D Gaussian approach supporting level-of-detail decomposition for scene representation that contributes to the final rendering results. Our model dynamically selects the appropriate level from the set of multi-resolution anchor points, ensuring consistent rendering performance with adaptive LOD adjustments while maintaining high-fidelity rendering results.

new ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection

Authors: Mubashir Noman, Mustansar Fiaz, Hisham Cholakkal, Salman Khan, Fahad Shahbaz Khan

Abstract: Deep learning has shown remarkable success in remote sensing change detection (CD), aiming to identify semantic change regions between co-registered satellite image pairs acquired at distinct time stamps. However, existing convolutional neural network and transformer-based frameworks often struggle to accurately segment semantic change regions. Moreover, transformers-based methods with standard self-attention suffer from quadratic computational complexity with respect to the image resolution, making them less practical for CD tasks with limited training data. To address these issues, we propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions while reducing the model size. Our ELGC-Net comprises a Siamese encoder, fusion modules, and a decoder. The focus of our design is the introduction of an Efficient Local-Global Context Aggregator module within the encoder, capturing enhanced global context and local spatial information through a novel pooled-transpose (PT) attention and depthwise convolution, respectively. The PT attention employs pooling operations for robust feature extraction and minimizes computational cost with transposed attention. Extensive experiments on three challenging CD datasets demonstrate that ELGC-Net outperforms existing methods. Compared to the recent transformer-based CD approach (ChangeFormer), ELGC-Net achieves a 1.4% gain in intersection over union metric on the LEVIR-CD dataset, while significantly reducing trainable parameters. Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks. Finally, we also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings, while achieving comparable performance. Project url https://github.com/techmn/elgcnet.

URLs: https://github.com/techmn/elgcnet.

new Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

Authors: Akshay Paruchuri, Samuel Ehrenstein, Shuxian Wang, Inbar Fried, Stephen M. Pizer, Marc Niethammer, Roni Sengupta

Abstract: Monocular depth estimation in endoscopy videos can enable assistive and robotic surgery to obtain better coverage of the organ and detection of various health issues. Despite promising progress on mainstream, natural image depth estimation, techniques perform poorly on endoscopy images due to a lack of strong geometric features and challenging illumination effects. In this paper, we utilize the photometric cues, i.e., the light emitted from an endoscope and reflected by the surface, to improve monocular depth estimation. We first create two novel loss functions with supervised and self-supervised variants that utilize a per-pixel shading representation. We then propose a novel depth refinement network (PPSNet) that leverages the same per-pixel shading representation. Finally, we introduce teacher-student transfer learning to produce better depth maps from both synthetic data with supervision and clinical data with self-supervision. We achieve state-of-the-art results on the C3VD dataset while estimating high-quality depth maps from clinical data. Our code, pre-trained models, and supplementary materials can be found on our project page: https://ppsnet.github.io/

URLs: https://ppsnet.github.io/

new TC4D: Trajectory-Conditioned Text-to-4D Generation

Authors: Sherwin Bahmani, Xian Liu, Yifan Wang, Ivan Skorokhodov, Victor Rong, Ziwei Liu, Xihui Liu, Jeong Joon Park, Sergey Tulyakov, Gordon Wetzstein, Andrea Tagliasacchi, David B. Lindell

Abstract: Recent techniques for text-to-4D generation synthesize dynamic 3D scenes using supervision from pre-trained text-to-video models. However, existing representations for motion, such as deformation models or time-dependent neural representations, are limited in the amount of motion they can generate-they cannot synthesize motion extending far beyond the bounding box used for volume rendering. The lack of a more flexible motion model contributes to the gap in realism between 4D generation methods and recent, near-photorealistic video generation models. Here, we propose TC4D: trajectory-conditioned text-to-4D generation, which factors motion into global and local components. We represent the global motion of a scene's bounding box using rigid transformation along a trajectory parameterized by a spline. We learn local deformations that conform to the global trajectory using supervision from a text-to-video model. Our approach enables the synthesis of scenes animated along arbitrary trajectories, compositional scene generation, and significant improvements to the realism and amount of generated motion, which we evaluate qualitatively and through a user study. Video results can be viewed on our website: https://sherwinbahmani.github.io/tc4d.

URLs: https://sherwinbahmani.github.io/tc4d.

new AID: Attention Interpolation of Text-to-Image Diffusion

Authors: Qiyuan He, Jinghao Wang, Ziwei Liu, Angela Yao

Abstract: Conditional diffusion models can create unseen images in various settings, aiding image interpolation. Interpolation in latent spaces is well-studied, but interpolation with specific conditions like text or poses is less understood. Simple approaches, such as linear interpolation in the space of conditions, often result in images that lack consistency, smoothness, and fidelity. To that end, we introduce a novel training-free technique named Attention Interpolation via Diffusion (AID). Our key contributions include 1) proposing an inner/outer interpolated attention layer; 2) fusing the interpolated attention with self-attention to boost fidelity; and 3) applying beta distribution to selection to increase smoothness. We also present a variant, Prompt-guided Attention Interpolation via Diffusion (PAID), that considers interpolation as a condition-dependent generative process. This method enables the creation of new images with greater consistency, smoothness, and efficiency, and offers control over the exact path of interpolation. Our approach demonstrates effectiveness for conceptual and spatial interpolation. Code and demo are available at https://github.com/QY-H00/attention-interpolation-diffusion.

URLs: https://github.com/QY-H00/attention-interpolation-diffusion.

new FastCAR: Fast Classification And Regression Multi-Task Learning via Task Consolidation for Modelling a Continuous Property Variable of Object Classes

Authors: Anoop Kini, Andreas Jansche, Timo Bernthaler, Gerhard Schneider

Abstract: FastCAR is a novel task consolidation approach in Multi-Task Learning (MTL) for a classification and a regression task, despite task heterogeneity with only subtle correlation. It addresses object classification and continuous property variable regression, a crucial use case in science and engineering. FastCAR involves a labeling transformation approach that can be used with a single-task regression network architecture. FastCAR outperforms traditional MTL model families, parametrized in the landscape of architecture and loss weighting schemes, when learning of both tasks are collectively considered (classification accuracy of 99.54%, regression mean absolute percentage error of 2.3%). The experiments performed used an Advanced Steel Property dataset contributed by us. The dataset comprises 4536 images of 224x224 pixels, annotated with object classes and hardness properties that take continuous values. With the labeling transformation and single-task regression network architecture, FastCAR achieves reduced latency and time efficiency.

new Towards Explaining Hypercomplex Neural Networks

Authors: Eleonora Lopez, Eleonora Grassucci, Debora Capriotti, Danilo Comminiello

Abstract: Hypercomplex neural networks are gaining increasing interest in the deep learning community. The attention directed towards hypercomplex models originates from several aspects, spanning from purely theoretical and mathematical characteristics to the practical advantage of lightweight models over conventional networks, and their unique properties to capture both global and local relations. In particular, a branch of these architectures, parameterized hypercomplex neural networks (PHNNs), has also gained popularity due to their versatility across a multitude of application domains. Nonetheless, only few attempts have been made to explain or interpret their intricacies. In this paper, we propose inherently interpretable PHNNs and quaternion-like networks, thus without the need for any post-hoc method. To achieve this, we define a type of cosine-similarity transform within the parameterized hypercomplex domain. This PHB-cos transform induces weight alignment with relevant input features and allows to reduce the model into a single linear transform, rendering it directly interpretable. In this work, we start to draw insights into how this unique branch of neural models operates. We observe that hypercomplex networks exhibit a tendency to concentrate on the shape around the main object of interest, in addition to the shape of the object itself. We provide a thorough analysis, studying single neurons of different layers and comparing them against how real-valued networks learn. The code of the paper is available at https://github.com/ispamm/HxAI.

URLs: https://github.com/ispamm/HxAI.

new Track Everything Everywhere Fast and Robustly

Authors: Yunzhou Song, Jiahui Lei, Ziyun Wang, Lingjie Liu, Kostas Daniilidis

Abstract: We propose a novel test-time optimization approach for efficiently and robustly tracking any pixel at any time in a video. The latest state-of-the-art optimization-based tracking technique, OmniMotion, requires a prohibitively long optimization time, rendering it impractical for downstream applications. OmniMotion is sensitive to the choice of random seeds, leading to unstable convergence. To improve efficiency and robustness, we introduce a novel invertible deformation network, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid and enhances the expressivity of the coupling blocks with non-linear functions. While CaDeX++ incorporates a stronger geometric bias within its architectural design, it also takes advantage of the inductive bias provided by the vision foundation models. Our system utilizes monocular depth estimation to represent scene geometry and enhances the objective by incorporating DINOv2 long-term semantics to regulate the optimization process. Our experiments demonstrate a substantial improvement in training speed (more than \textbf{10 times} faster), robustness, and accuracy in tracking over the SoTA optimization-based method OmniMotion.

new AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation

Authors: Qingping Sun, Yanjun Wang, Ailing Zeng, Wanqi Yin, Chen Wei, Wenjia Wang, Haiyi Mei, Chi Sing Leung, Ziwei Liu, Lei Yang, Zhongang Cai

Abstract: Expressive human pose and shape estimation (a.k.a. 3D whole-body mesh recovery) involves the human body, hand, and expression estimation. Most existing methods have tackled this task in a two-stage manner, first detecting the human body part with an off-the-shelf detection model and inferring the different human body parts individually. Despite the impressive results achieved, these methods suffer from 1) loss of valuable contextual information via cropping, 2) introducing distractions, and 3) lacking inter-association among different persons and body parts, inevitably causing performance degradation, especially for crowded scenes. To address these issues, we introduce a novel all-in-one-stage framework, AiOS, for multiple expressive human pose and shape recovery without an additional human detection step. Specifically, our method is built upon DETR, which treats multi-person whole-body mesh recovery task as a progressive set prediction problem with various sequential detection. We devise the decoder tokens and extend them to our task. Specifically, we first employ a human token to probe a human location in the image and encode global features for each instance, which provides a coarse location for the later transformer block. Then, we introduce a joint-related token to probe the human joint in the image and encoder a fine-grained local feature, which collaborates with the global feature to regress the whole-body mesh. This straightforward but effective model outperforms previous state-of-the-art methods by a 9% reduction in NMVE on AGORA, a 30% reduction in PVE on EHF, a 10% reduction in PVE on ARCTIC, and a 3% reduction in PVE on EgoBody.

new OmniVid: A Generative Framework for Universal Video Understanding

Authors: Junke Wang, Dongdong Chen, Chong Luo, Bo He, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang

Abstract: The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution. Despite sharing a common goal, different tasks often rely on distinct model architectures and annotation formats. In contrast, natural language processing benefits from a unified output space, i.e., text sequences, which simplifies the training of powerful foundational language models, such as GPT-3, with extensive training corpora. Inspired by this, we seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. In this way, a variety of video tasks could be formulated as video-grounded token generation. This enables us to address various types of video tasks, including classification (such as action recognition), captioning (covering clip captioning, video question answering, and dense video captioning), and localization tasks (such as visual object tracking) within a fully shared encoder-decoder architecture, following a generative framework. Through comprehensive experiments, we demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results on seven video benchmarks, providing a novel perspective for more universal video understanding. Code is available at https://github.com/wangjk666/OmniVid.

URLs: https://github.com/wangjk666/OmniVid.

new ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis

Authors: Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, Christian Theobalt

Abstract: Gestures play a key role in human communication. Recent methods for co-speech gesture generation, while managing to generate beat-aligned motions, struggle generating gestures that are semantically aligned with the utterance. Compared to beat gestures that align naturally to the audio signal, semantically coherent gestures require modeling the complex interactions between the language and human motion, and can be controlled by focusing on certain words. Therefore, we present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis, which can not only generate gestures based on multi-modal speech inputs, but can also facilitate controllability in gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities (e.g. audio vs text) as well as to choose certain words to be emphasized during gesturing. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures. To further advance the research on multi-party interactive gestures, the DnD Group Gesture dataset is released, which contains 6 hours of gesture data showing 5 people interacting with one another. We compare our method with several recent works and demonstrate effectiveness of our method on a variety of tasks. We urge the reader to watch our supplementary video at our website.

new Efficient Video Object Segmentation via Modulated Cross-Attention Memory

Authors: Abdelrahman Shaker, Syed Talal Wasim, Martin Danelljan, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan

Abstract: Recently, transformer-based approaches have shown promising results for semi-supervised video object segmentation. However, these approaches typically struggle on long videos due to increased GPU memory demands, as they frequently expand the memory bank every few frames. We propose a transformer-based approach, named MAVOS, that introduces an optimized and dynamic long-term modulated cross-attention (MCA) memory to model temporal smoothness without requiring frequent memory expansion. The proposed MCA effectively encodes both local and global features at various levels of granularity while efficiently maintaining consistent speed regardless of the video length. Extensive experiments on multiple benchmarks, LVOS, Long-Time Video, and DAVIS 2017, demonstrate the effectiveness of our proposed contributions leading to real-time inference and markedly reduced memory demands without any degradation in segmentation accuracy on long videos. Compared to the best existing transformer-based approach, our MAVOS increases the speed by 7.6x, while significantly reducing the GPU memory by 87% with comparable segmentation performance on short and long video datasets. Notably on the LVOS dataset, our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU. Our code and models will be publicly available at: https://github.com/Amshaker/MAVOS.

URLs: https://github.com/Amshaker/MAVOS.

cross GeRM: A Generalist Robotic Model with Mixture-of-experts for Quadruped Robot

Authors: Wenxuan Song, Han Zhao, Pengxiang Ding, Can Cui, Shangke Lyu, Yaning Fan, Donglin Wang

Abstract: Multi-task robot learning holds significant importance in tackling diverse and complex scenarios. However, current approaches are hindered by performance issues and difficulties in collecting training datasets. In this paper, we propose GeRM (Generalist Robotic Model). We utilize offline reinforcement learning to optimize data utilization strategies to learn from both demonstrations and sub-optimal data, thus surpassing the limitations of human demonstrations. Thereafter, we employ a transformer-based VLA network to process multi-modal inputs and output actions. By introducing the Mixture-of-Experts structure, GeRM allows faster inference speed with higher whole model capacity, and thus resolves the issue of limited RL parameters, enhancing model performance in multi-task learning while controlling computational costs. Through a series of experiments, we demonstrate that GeRM outperforms other methods across all tasks, while also validating its efficiency in both training and inference processes. Additionally, we uncover its potential to acquire emergent skills. Additionally, we contribute the QUARD-Auto dataset, collected automatically to support our training approach and foster advancements in multi-task quadruped robot learning. This work presents a new paradigm for reducing the cost of collecting robot data and driving progress in the multi-task learning community.

cross Provably Robust Score-Based Diffusion Posterior Sampling for Plug-and-Play Image Reconstruction

Authors: Xingyu Xu, Yuejie Chi

Abstract: In a great number of tasks in science and engineering, the goal is to infer an unknown image from a small number of measurements collected from a known forward model describing certain sensing or imaging modality. Due to resource constraints, this task is often extremely ill-posed, which necessitates the adoption of expressive prior information to regularize the solution space. Score-based diffusion models, due to its impressive empirical success, have emerged as an appealing candidate of an expressive prior in image reconstruction. In order to accommodate diverse tasks at once, it is of great interest to develop efficient, consistent and robust algorithms that incorporate {\em unconditional} score functions of an image prior distribution in conjunction with flexible choices of forward models. This work develops an algorithmic framework for employing score-based diffusion models as an expressive data prior in general nonlinear inverse problems. Motivated by the plug-and-play framework in the imaging community, we introduce a diffusion plug-and-play method (\textsf{DPnP}) that alternatively calls two samplers, a proximal consistency sampler based solely on the likelihood function of the forward model, and a denoising diffusion sampler based solely on the score functions of the image prior. The key insight is that denoising under white Gaussian noise can be solved {\em rigorously} via both stochastic (i.e., DDPM-type) and deterministic (i.e., DDIM-type) samplers using the unconditional score functions. We establish both asymptotic and non-asymptotic performance guarantees of \textsf{DPnP}, and provide numerical experiments to illustrate its promise in solving both linear and nonlinear image reconstruction tasks. To the best of our knowledge, \textsf{DPnP} is the first provably-robust posterior sampling method for nonlinear inverse problems using unconditional diffusion priors.

cross A Study in Dataset Pruning for Image Super-Resolution

Authors: Brian B. Moser, Federico Raue, Andreas Dengel

Abstract: In image Super-Resolution (SR), relying on large datasets for training is a double-edged sword. While offering rich training material, they also demand substantial computational and storage resources. In this work, we analyze dataset pruning as a solution to these challenges. We introduce a novel approach that reduces a dataset to a core-set of training samples, selected based on their loss values as determined by a simple pre-trained SR model. By focusing the training on just 50% of the original dataset, specifically on the samples characterized by the highest loss values, we achieve results comparable to or even surpassing those obtained from training on the entire dataset. Interestingly, our analysis reveals that the top 5% of samples with the highest loss values negatively affect the training process. Excluding these samples and adjusting the selection to favor easier samples further enhances training outcomes. Our work opens new perspectives to the untapped potential of dataset pruning in image SR. It suggests that careful selection of training data based on loss-value metrics can lead to better SR models, challenging the conventional wisdom that more data inevitably leads to better performance.

cross A Comparative Analysis of Visual Odometry in Virtual and Real-World Railways Environments

Authors: Gianluca D'Amico, Mauro Marinoni, Giorgio Buttazzo

Abstract: Perception tasks play a crucial role in the development of automated operations and systems across multiple application fields. In the railway transportation domain, these tasks can improve the safety, reliability, and efficiency of various perations, including train localization, signal recognition, and track discrimination. However, collecting considerable and precisely labeled datasets for testing such novel algorithms poses extreme challenges in the railway environment due to the severe restrictions in accessing the infrastructures and the practical difficulties associated with properly equipping trains with the required sensors, such as cameras and LiDARs. The remarkable innovations of graphic engine tools offer new solutions to craft realistic synthetic datasets. To illustrate the advantages of employing graphic simulation for early-stage testing of perception tasks in the railway domain, this paper presents a comparative analysis of the performance of a SLAM algorithm applied both in a virtual synthetic environment and a real-world scenario. The analysis leverages virtual railway environments created with the latest version of Unreal Engine, facilitating data collection and allowing the examination of challenging scenarios, including low-visibility, dangerous operational modes, and complex environments. The results highlight the feasibility and potentiality of graphic simulation to advance perception tasks in the railway domain.

cross Brain Stroke Segmentation Using Deep Learning Models: A Comparative Study

Authors: Ahmed Soliman, Yousif Yousif, Ahmed Ibrahim, Yalda Zafari-Ghadim, Essam A. Rashed, Mohamed Mabrok

Abstract: Stroke segmentation plays a crucial role in the diagnosis and treatment of stroke patients by providing spatial information about affected brain regions and the extent of damage. Segmenting stroke lesions accurately is a challenging task, given that conventional manual techniques are time consuming and prone to errors. Recently, advanced deep models have been introduced for general medical image segmentation, demonstrating promising results that surpass many state of the art networks when evaluated on specific datasets. With the advent of the vision Transformers, several models have been introduced based on them, while others have aimed to design better modules based on traditional convolutional layers to extract long-range dependencies like Transformers. The question of whether such high-level designs are necessary for all segmentation cases to achieve the best results remains unanswered. In this study, we selected four types of deep models that were recently proposed and evaluated their performance for stroke segmentation: a pure Transformer-based architecture (DAE-Former), two advanced CNN-based models (LKA and DLKA) with attention mechanisms in their design, an advanced hybrid model that incorporates CNNs with Transformers (FCT), and the well- known self-adaptive nnUNet framework with its configuration based on given data. We examined their performance on two publicly available datasets, and found that the nnUNet achieved the best results with the simplest design among all. Revealing the robustness issue of Transformers to such variabilities serves as a potential reason for their weaker performance. Furthermore, nnUNet's success underscores the significant impact of preprocessing and postprocessing techniques in enhancing segmentation results, surpassing the focus solely on architectural designs

cross Decoding the visual attention of pathologists to reveal their level of expertise

Authors: Souradeep Chakraborty, Dana Perez, Paul Friedman, Natallia Sheuka, Constantin Friedman, Oksana Yaskiv, Rajarsi Gupta, Gregory J. Zelinsky, Joel H. Saltz, Dimitris Samaras

Abstract: We present a method for classifying the expertise of a pathologist based on how they allocated their attention during a cancer reading. We engage this decoding task by developing a novel method for predicting the attention of pathologists as they read whole-slide Images (WSIs) of prostate and make cancer grade classifications. Our ground truth measure of a pathologists' attention is the x, y and z (magnification) movement of their viewport as they navigated through WSIs during readings, and to date we have the attention behavior of 43 pathologists reading 123 WSIs. These data revealed that specialists have higher agreement in both their attention and cancer grades compared to general pathologists and residents, suggesting that sufficient information may exist in their attention behavior to classify their expertise level. To attempt this, we trained a transformer-based model to predict the visual attention heatmaps of resident, general, and specialist (GU) pathologists during Gleason grading. Based solely on a pathologist's attention during a reading, our model was able to predict their level of expertise with 75.3%, 56.1%, and 77.2% accuracy, respectively, better than chance and baseline models. Our model therefore enables a pathologist's expertise level to be easily and objectively evaluated, important for pathology training and competency assessment. Tools developed from our model could also be used to help pathology trainees learn how to read WSIs like an expert.

cross Tracing and segmentation of molecular patterns in 3-dimensional cryo-et/em density maps through algorithmic image processing and deep learning-based techniques

Authors: Salim Sazzed

Abstract: Understanding the structures of biological macromolecules is highly important as they are closely associated with cellular functionalities. Comprehending the precise organization actin filaments is crucial because they form the dynamic cytoskeleton, which offers structural support to cells and connects the cell's interior with its surroundings. However, determining the precise organization of actin filaments is challenging due to the poor quality of cryo-electron tomography (cryo-ET) images, which suffer from low signal-to-noise (SNR) ratios and the presence of missing wedge, as well as diverse shape characteristics of actin filaments. To address these formidable challenges, the primary component of this dissertation focuses on developing sophisticated computational techniques for tracing actin filaments. In particular, three novel methodologies have been developed: i) BundleTrac, for tracing bundle-like actin filaments found in Stereocilium, ii) Spaghetti Tracer, for tracing filaments that move individually with loosely cohesive movements, and iii) Struwwel Tracer, for tracing randomly orientated actin filaments in the actin network. The second component of the dissertation introduces a convolutional neural network (CNN) based segmentation model to determine the location of protein secondary structures, such as helices and beta-sheets, in medium-resolution (5-10 Angstrom) 3-dimensional cryo-electron microscopy (cryo-EM) images. This methodology later evolved into a tool named DeepSSETracer. The final component of the dissertation presents a novel algorithm, cylindrical fit measure, to estimate image structure match at helix regions in medium-resolution cryo-EM images. Overall, my dissertation has made significant contributions to addressing critical research challenges in structural biology by introducing various computational methods and tools.

cross Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer

Authors: Jeong-Yoon Kim, Seung-Ho Lee

Abstract: In this paper, we propose a method to improve the accuracy of speech emotion recognition (SER) by using vision transformer (ViT) to attend to the correlation of frequency (y-axis) with time (x-axis) in spectrogram and transferring positional information between ViT through knowledge transfer. The proposed method has the following originality i) We use vertically segmented patches of log-Mel spectrogram to analyze the correlation of frequencies over time. This type of patch allows us to correlate the most relevant frequencies for a particular emotion with the time they were uttered. ii) We propose the use of image coordinate encoding, an absolute positional encoding suitable for ViT. By normalizing the x, y coordinates of the image to -1 to 1 and concatenating them to the image, we can effectively provide valid absolute positional information for ViT. iii) Through feature map matching, the locality and location information of the teacher network is effectively transmitted to the student network. Teacher network is a ViT that contains locality of convolutional stem and absolute position information through image coordinate encoding, and student network is a structure that lacks positional encoding in the basic ViT structure. In feature map matching stage, we train through the mean absolute error (L1 loss) to minimize the difference between the feature maps of the two networks. To validate the proposed method, three emotion datasets (SAVEE, EmoDB, and CREMA-D) consisting of speech were converted into log-Mel spectrograms for comparison experiments. The experimental results show that the proposed method significantly outperforms the state-of-the-art methods in terms of weighted accuracy while requiring significantly fewer floating point operations (FLOPs). Overall, the proposed method offers an promising solution for SER by providing improved efficiency and performance.

cross Labeling subtypes in a Parkinson's Cohort using Multifeatures in MRI - Integrating Grey and White Matter Information

Authors: Tanmayee Samantaray, Jitender Saini, Pramod Kumar Pal, Bithiah Grace Jaganathan, Vijaya V Saradhi, Gupta CN

Abstract: Thresholding of networks has long posed a challenge in brain connectivity analysis. Weighted networks are typically binarized using threshold measures to facilitate network analysis. Previous studies on MRI-based brain networks have predominantly utilized density or sparsity-based thresholding techniques, optimized within specific ranges derived from network metrics such as path length, clustering coefficient, and small-world index. Thus, determination of a single threshold value for facilitating comparative analysis of networks remains elusive. To address this, our study introduces Mutual K-Nearest Neighbor (MKNN)-based thresholding for brain network analysis. Here, nearest neighbor selection is based on the highest correlation between features of brain regions. Construction of brain networks was accomplished by computing Pearson correlations between grey matter volume and white matter volume for each pair of brain regions. Structural MRI data from 180 Parkinsons patients and 70 controls from the NIMHANS, India were analyzed. Subtypes within Parkinsons disease were identified based on grey and white matter volume atrophy using source-based morphometric decomposition. The loading coefficients were correlated with clinical features to discern clinical relationship with the deciphered subtypes. Our data-mining approach revealed: Subtype A (N = 51, intermediate type), Subtype B (N = 57, mild-severe type with mild motor symptoms), and Subtype AB (N = 36, most-severe type with predominance in motor impairment). Subtype-specific weighted matrices were binarized using MKNN-based thresholding for brain network analysis. Permutation tests on network metrics of resulting bipartite graphs demonstrated significant group differences in betweenness centrality and participation coefficient. The identified hubs were specific to each subtype, with some hubs conserved across different subtypes.

cross Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Authors: Dongjin Kim, Sung Jin Um, Sangmin Lee, Jung Uk Kim

Abstract: The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance, they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper, to overcome this limitation, we present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources. To achieve this goal, we propose an iterative object identification (IOI) module, which can recognize sound-making objects in an iterative manner. After finding the regions of sound-making objects, we devise object similarity-aware clustering (OSC) loss to guide the IOI module to effectively combine regions of the same object but also distinguish between different objects and backgrounds. It enables our method to perform accurate localization of sound-making objects without any prior knowledge. Extensive experimental results on the MUSIC and VGGSound benchmarks show the significant performance improvements of the proposed method over the existing methods for both single and multi-source. Our code is available at: https://github.com/VisualAIKHU/NoPrior_MultiSSL

URLs: https://github.com/VisualAIKHU/NoPrior_MultiSSL

cross Integrating Mamba Sequence Model and Hierarchical Upsampling Network for Accurate Semantic Segmentation of Multiple Sclerosis Legion

Authors: Kazi Shahriar Sanjid, Md. Tanzim Hossain, Md. Shakib Shahariar Junayed, Dr. Mohammad Monir Uddin

Abstract: Integrating components from convolutional neural networks and state space models in medical image segmentation presents a compelling approach to enhance accuracy and efficiency. We introduce Mamba HUNet, a novel architecture tailored for robust and efficient segmentation tasks. Leveraging strengths from Mamba UNet and the lighter version of Hierarchical Upsampling Network (HUNet), Mamba HUNet combines convolutional neural networks local feature extraction power with state space models long range dependency modeling capabilities. We first converted HUNet into a lighter version, maintaining performance parity and then integrated this lighter HUNet into Mamba HUNet, further enhancing its efficiency. The architecture partitions input grayscale images into patches, transforming them into 1D sequences for processing efficiency akin to Vision Transformers and Mamba models. Through Visual State Space blocks and patch merging layers, hierarchical features are extracted while preserving spatial information. Experimental results on publicly available Magnetic Resonance Imaging scans, notably in Multiple Sclerosis lesion segmentation, demonstrate Mamba HUNet's effectiveness across diverse segmentation tasks. The model's robustness and flexibility underscore its potential in handling complex anatomical structures. These findings establish Mamba HUNet as a promising solution in advancing medical image segmentation, with implications for improving clinical decision making processes.

cross Chain of Compression: A Systematic Approach to Combinationally Compress Convolutional Neural Networks

Authors: Yingtao Shen, Minqing Sun, Jie Zhao, An Zou

Abstract: Convolutional neural networks (CNNs) have achieved significant popularity, but their computational and memory intensity poses challenges for resource-constrained computing systems, particularly with the prerequisite of real-time performance. To release this burden, model compression has become an important research focus. Many approaches like quantization, pruning, early exit, and knowledge distillation have demonstrated the effect of reducing redundancy in neural networks. Upon closer examination, it becomes apparent that each approach capitalizes on its unique features to compress the neural network, and they can also exhibit complementary behavior when combined. To explore the interactions and reap the benefits from the complementary features, we propose the Chain of Compression, which works on the combinational sequence to apply these common techniques to compress the neural network. Validated on the image-based regression and classification networks across different data sets, our proposed Chain of Compression can significantly compress the computation cost by 100-1000 times with ignorable accuracy loss compared with the baseline model.

cross Building Bridges across Spatial and Temporal Resolutions: Reference-Based Super-Resolution via Change Priors and Conditional Diffusion Model

Authors: Runmin Dong, Shuai Yuan, Bin Luo, Mengxuan Chen, Jinxiao Zhang, Lixian Zhang, Weijia Li, Juepeng Zheng, Haohuan Fu

Abstract: Reference-based super-resolution (RefSR) has the potential to build bridges across spatial and temporal resolutions of remote sensing images. However, existing RefSR methods are limited by the faithfulness of content reconstruction and the effectiveness of texture transfer in large scaling factors. Conditional diffusion models have opened up new opportunities for generating realistic high-resolution images, but effectively utilizing reference images within these models remains an area for further exploration. Furthermore, content fidelity is difficult to guarantee in areas without relevant reference information. To solve these issues, we propose a change-aware diffusion model named Ref-Diff for RefSR, using the land cover change priors to guide the denoising process explicitly. Specifically, we inject the priors into the denoising model to improve the utilization of reference information in unchanged areas and regulate the reconstruction of semantically relevant content in changed areas. With this powerful guidance, we decouple the semantics-guided denoising and reference texture-guided denoising processes to improve the model performance. Extensive experiments demonstrate the superior effectiveness and robustness of the proposed method compared with state-of-the-art RefSR methods in both quantitative and qualitative evaluations. The code and data are available at https://github.com/dongrunmin/RefDiff.

URLs: https://github.com/dongrunmin/RefDiff.

cross Sharing the Cost of Success: A Game for Evaluating and Learning Collaborative Multi-Agent Instruction Giving and Following Policies

Authors: Philipp Sadler, Sherzod Hakimov, David Schlangen

Abstract: In collaborative goal-oriented settings, the participants are not only interested in achieving a successful outcome, but do also implicitly negotiate the effort they put into the interaction (by adapting to each other). In this work, we propose a challenging interactive reference game that requires two players to coordinate on vision and language observations. The learning signal in this game is a score (given after playing) that takes into account the achieved goal and the players' assumed efforts during the interaction. We show that a standard Proximal Policy Optimization (PPO) setup achieves a high success rate when bootstrapped with heuristic partner behaviors that implement insights from the analysis of human-human interactions. And we find that a pairing of neural partners indeed reduces the measured joint effort when playing together repeatedly. However, we observe that in comparison to a reasonable heuristic pairing there is still room for improvement -- which invites further research in the direction of cost-sharing in collaborative interactions.

cross DS-AL: A Dual-Stream Analytic Learning for Exemplar-Free Class-Incremental Learning

Authors: Huiping Zhuang, Run He, Kai Tong, Ziqian Zeng, Cen Chen, Zhiping Lin

Abstract: Class-incremental learning (CIL) under an exemplar-free constraint has presented a significant challenge. Existing methods adhering to this constraint are prone to catastrophic forgetting, far more so than replay-based techniques that retain access to past samples. In this paper, to solve the exemplar-free CIL problem, we propose a Dual-Stream Analytic Learning (DS-AL) approach. The DS-AL contains a main stream offering an analytical (i.e., closed-form) linear solution, and a compensation stream improving the inherent under-fitting limitation due to adopting linear mapping. The main stream redefines the CIL problem into a Concatenated Recursive Least Squares (C-RLS) task, allowing an equivalence between the CIL and its joint-learning counterpart. The compensation stream is governed by a Dual-Activation Compensation (DAC) module. This module re-activates the embedding with a different activation function from the main stream one, and seeks fitting compensation by projecting the embedding to the null space of the main stream's linear mapping. Empirical results demonstrate that the DS-AL, despite being an exemplar-free technique, delivers performance comparable with or better than that of replay-based methods across various datasets, including CIFAR-100, ImageNet-100 and ImageNet-Full. Additionally, the C-RLS' equivalent property allows the DS-AL to execute CIL in a phase-invariant manner. This is evidenced by a never-before-seen 500-phase CIL ImageNet task, which performs on a level identical to a 5-phase one. Our codes are available at https://github.com/ZHUANGHP/Analytic-continual-learning.

URLs: https://github.com/ZHUANGHP/Analytic-continual-learning.

cross Boosting Adversarial Training via Fisher-Rao Norm-based Regularization

Authors: Xiangyu Yin, Wenjie Ruan

Abstract: Adversarial training is extensively utilized to improve the adversarial robustness of deep neural networks. Yet, mitigating the degradation of standard generalization performance in adversarial-trained models remains an open problem. This paper attempts to resolve this issue through the lens of model complexity. First, We leverage the Fisher-Rao norm, a geometrically invariant metric for model complexity, to establish the non-trivial bounds of the Cross-Entropy Loss-based Rademacher complexity for a ReLU-activated Multi-Layer Perceptron. Then we generalize a complexity-related variable, which is sensitive to the changes in model width and the trade-off factors in adversarial training. Moreover, intensive empirical evidence validates that this variable highly correlates with the generalization gap of Cross-Entropy loss between adversarial-trained and standard-trained models, especially during the initial and final phases of the training process. Building upon this observation, we propose a novel regularization framework, called Logit-Oriented Adversarial Training (LOAT), which can mitigate the trade-off between robustness and accuracy while imposing only a negligible increase in computational overhead. Our extensive experiments demonstrate that the proposed regularization strategy can boost the performance of the prevalent adversarial training algorithms, including PGD-AT, TRADES, TRADES (LSE), MART, and DM-AT, across various network architectures. Our code will be available at https://github.com/TrustAI/LOAT.

URLs: https://github.com/TrustAI/LOAT.

cross A Gaze-grounded Visual Question Answering Dataset for Clarifying Ambiguous Japanese Questions

Authors: Shun Inadumi, Seiya Kawano, Akishige Yuguchi, Yasutomo Kawanishi, Koichiro Yoshino

Abstract: Situated conversations, which refer to visual information as visual question answering (VQA), often contain ambiguities caused by reliance on directive information. This problem is exacerbated because some languages, such as Japanese, often omit subjective or objective terms. Such ambiguities in questions are often clarified by the contexts in conversational situations, such as joint attention with a user or user gaze information. In this study, we propose the Gaze-grounded VQA dataset (GazeVQA) that clarifies ambiguous questions using gaze information by focusing on a clarification process complemented by gaze information. We also propose a method that utilizes gaze target estimation results to improve the accuracy of GazeVQA tasks. Our experimental results showed that the proposed method improved the performance in some cases of a VQA system on GazeVQA and identified some typical problems of GazeVQA tasks that need to be improved.

cross Practical Applications of Advanced Cloud Services and Generative AI Systems in Medical Image Analysis

Authors: Jingyu Xu, Binbin Wu, Jiaxin Huang, Yulu Gong, Yifan Zhang, Bo Liu

Abstract: The medical field is one of the important fields in the application of artificial intelligence technology. With the explosive growth and diversification of medical data, as well as the continuous improvement of medical needs and challenges, artificial intelligence technology is playing an increasingly important role in the medical field. Artificial intelligence technologies represented by computer vision, natural language processing, and machine learning have been widely penetrated into diverse scenarios such as medical imaging, health management, medical information, and drug research and development, and have become an important driving force for improving the level and quality of medical services.The article explores the transformative potential of generative AI in medical imaging, emphasizing its ability to generate syntheticACM-2 data, enhance images, aid in anomaly detection, and facilitate image-to-image translation. Despite challenges like model complexity, the applications of generative models in healthcare, including Med-PaLM 2 technology, show promising results. By addressing limitations in dataset size and diversity, these models contribute to more accurate diagnoses and improved patient outcomes. However, ethical considerations and collaboration among stakeholders are essential for responsible implementation. Through experiments leveraging GANs to augment brain tumor MRI datasets, the study demonstrates how generative AI can enhance image quality and diversity, ultimately advancing medical diagnostics and patient care.

cross Grad-CAMO: Learning Interpretable Single-Cell Morphological Profiles from 3D Cell Painting Images

Authors: Vivek Gopalakrishnan, Jingzhe Ma, Zhiyong Xie

Abstract: Despite their black-box nature, deep learning models are extensively used in image-based drug discovery to extract feature vectors from single cells in microscopy images. To better understand how these networks perform representation learning, we employ visual explainability techniques (e.g., Grad-CAM). Our analyses reveal several mechanisms by which supervised models cheat, exploiting biologically irrelevant pixels when extracting morphological features from images, such as noise in the background. This raises doubts regarding the fidelity of learned single-cell representations and their relevance when investigating downstream biological questions. To address this misalignment between researcher expectations and machine behavior, we introduce Grad-CAMO, a novel single-cell interpretability score for supervised feature extractors. Grad-CAMO measures the proportion of a model's attention that is concentrated on the cell of interest versus the background. This metric can be assessed per-cell or averaged across a validation set, offering a tool to audit individual features vectors or guide the improved design of deep learning architectures. Importantly, Grad-CAMO seamlessly integrates into existing workflows, requiring no dataset or model modifications, and is compatible with both 2D and 3D Cell Painting data. Additional results are available at https://github.com/eigenvivek/Grad-CAMO.

URLs: https://github.com/eigenvivek/Grad-CAMO.

cross High-Resolution Image Translation Model Based on Grayscale Redefinition

Authors: Xixian Wu, Dian Chao, Yang Yang

Abstract: Image-to-image translation is a technique that focuses on transferring images from one domain to another while maintaining the essential content representations. In recent years, image-to-image translation has gained significant attention and achieved remarkable advancements due to its diverse applications in computer vision and image processing tasks. In this work, we propose an innovative method for image translation between different domains. For high-resolution image translation tasks, we use a grayscale adjustment method to achieve pixel-level translation. For other tasks, we utilize the Pix2PixHD model with a coarse-to-fine generator, multi-scale discriminator, and improved loss to enhance the image translation performance. On the other hand, to tackle the issue of sparse training data, we adopt model weight initialization from other task to optimize the performance of the current task.

cross Predicting Perceived Gloss: Do Weak Labels Suffice?

Authors: Julia Guerrero-Viu, J. Daniel Subias, Ana Serrano, Katherine R. Storrs, Roland W. Fleming, Belen Masia, Diego Gutierrez

Abstract: Estimating perceptual attributes of materials directly from images is a challenging task due to their complex, not fully-understood interactions with external factors, such as geometry and lighting. Supervised deep learning models have recently been shown to outperform traditional approaches, but rely on large datasets of human-annotated images for accurate perception predictions. Obtaining reliable annotations is a costly endeavor, aggravated by the limited ability of these models to generalise to different aspects of appearance. In this work, we show how a much smaller set of human annotations ("strong labels") can be effectively augmented with automatically derived "weak labels" in the context of learning a low-dimensional image-computable gloss metric. We evaluate three alternative weak labels for predicting human gloss perception from limited annotated data. Incorporating weak labels enhances our gloss prediction beyond the current state of the art. Moreover, it enables a substantial reduction in human annotation costs without sacrificing accuracy, whether working with rendered images or real photographs.

cross Rotate to Scan: UNet-like Mamba with Triplet SSM Module for Medical Image Segmentation

Authors: Hao Tang, Lianglun Cheng, Guoheng Huang, Zhengguang Tan, Junhao Lu, Kaihong Wu

Abstract: Image segmentation holds a vital position in the realms of diagnosis and treatment within the medical domain. Traditional convolutional neural networks (CNNs) and Transformer models have made significant advancements in this realm, but they still encounter challenges because of limited receptive field or high computing complexity. Recently, State Space Models (SSMs), particularly Mamba and its variants, have demonstrated notable performance in the field of vision. However, their feature extraction methods may not be sufficiently effective and retain some redundant structures, leaving room for parameter reduction. Motivated by previous spatial and channel attention methods, we propose Triplet Mamba-UNet. The method leverages residual VSS Blocks to extract intensive contextual features, while Triplet SSM is employed to fuse features across spatial and channel dimensions. We conducted experiments on ISIC17, ISIC18, CVC-300, CVC-ClinicDB, Kvasir-SEG, CVC-ColonDB, and Kvasir-Instrument datasets, demonstrating the superior segmentation performance of our proposed TM-UNet. Additionally, compared to the previous VM-UNet, our model achieves a one-third reduction in parameters.

cross Resolution Limit of Single-Photon LiDAR

Authors: Stanley H. Chan, Hashan K. Weerasooriya, Weijian Zhang, Pamela Abshire, Istvan Gyongy, Robert K. Henderson

Abstract: Single-photon Light Detection and Ranging (LiDAR) systems are often equipped with an array of detectors for improved spatial resolution and sensing speed. However, given a fixed amount of flux produced by the laser transmitter across the scene, the per-pixel Signal-to-Noise Ratio (SNR) will decrease when more pixels are packed in a unit space. This presents a fundamental trade-off between the spatial resolution of the sensor array and the SNR received at each pixel. Theoretical characterization of this fundamental limit is explored. By deriving the photon arrival statistics and introducing a series of new approximation techniques, the Mean Squared Error (MSE) of the maximum-likelihood estimator of the time delay is derived. The theoretical predictions align well with simulations and real data.

cross Paired Diffusion: Generation of related, synthetic PET-CT-Segmentation scans using Linked Denoising Diffusion Probabilistic Models

Authors: Rowan Bradbury, Katherine A. Vallis, Bartlomiej W. Papiez

Abstract: The rapid advancement of Artificial Intelligence (AI) in biomedical imaging and radiotherapy is hindered by the limited availability of large imaging data repositories. With recent research and improvements in denoising diffusion probabilistic models (DDPM), high quality synthetic medical scans are now possible. Despite this, there is currently no way of generating multiple related images, such as a corresponding ground truth which can be used to train models, so synthetic scans are often manually annotated before use. This research introduces a novel architecture that is able to generate multiple, related PET-CT-tumour mask pairs using paired networks and conditional encoders. Our approach includes innovative, time step-controlled mechanisms and a `noise-seeding' strategy to improve DDPM sampling consistency. While our model requires a modified perceptual loss function to ensure accurate feature alignment we show generation of clearly aligned synthetic images and improvement in segmentation accuracy with generated images.

cross DataCook: Crafting Anti-Adversarial Examples for Healthcare Data Copyright Protection

Authors: Sihan Shang, Jiancheng Yang, Zhenglong Sun, Pascal Fua

Abstract: In the realm of healthcare, the challenges of copyright protection and unauthorized third-party misuse are increasingly significant. Traditional methods for data copyright protection are applied prior to data distribution, implying that models trained on these data become uncontrollable. This paper introduces a novel approach, named DataCook, designed to safeguard the copyright of healthcare data during the deployment phase. DataCook operates by "cooking" the raw data before distribution, enabling the development of models that perform normally on this processed data. However, during the deployment phase, the original test data must be also "cooked" through DataCook to ensure normal model performance. This process grants copyright holders control over authorization during the deployment phase. The mechanism behind DataCook is by crafting anti-adversarial examples (AntiAdv), which are designed to enhance model confidence, as opposed to standard adversarial examples (Adv) that aim to confuse models. Similar to Adv, AntiAdv introduces imperceptible perturbations, ensuring that the data processed by DataCook remains easily understandable. We conducted extensive experiments on MedMNIST datasets, encompassing both 2D/3D data and the high-resolution variants. The outcomes indicate that DataCook effectively meets its objectives, preventing models trained on AntiAdv from analyzing unauthorized data effectively, without compromising the validity and accuracy of the data in legitimate scenarios. Code and data are available at https://github.com/MedMNIST/DataCook.

URLs: https://github.com/MedMNIST/DataCook.

cross CT Synthesis with Conditional Diffusion Models for Abdominal Lymph Node Segmentation

Authors: Yongrui Yu, Hanyu Chen, Zitian Zhang, Qiong Xiao, Wenhui Lei, Linrui Dai, Yu Fu, Hui Tan, Guan Wang, Peng Gao, Xiaofan Zhang

Abstract: Despite the significant success achieved by deep learning methods in medical image segmentation, researchers still struggle in the computer-aided diagnosis of abdominal lymph nodes due to the complex abdominal environment, small and indistinguishable lesions, and limited annotated data. To address these problems, we present a pipeline that integrates the conditional diffusion model for lymph node generation and the nnU-Net model for lymph node segmentation to improve the segmentation performance of abdominal lymph nodes through synthesizing a diversity of realistic abdominal lymph node data. We propose LN-DDPM, a conditional denoising diffusion probabilistic model (DDPM) for lymph node (LN) generation. LN-DDPM utilizes lymph node masks and anatomical structure masks as model conditions. These conditions work in two conditioning mechanisms: global structure conditioning and local detail conditioning, to distinguish between lymph nodes and their surroundings and better capture lymph node characteristics. The obtained paired abdominal lymph node images and masks are used for the downstream segmentation task. Experimental results on the abdominal lymph node datasets demonstrate that LN-DDPM outperforms other generative methods in the abdominal lymph node image synthesis and better assists the downstream abdominal lymph node segmentation task.

cross Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications

Authors: Fouad Trad, Ali Chehab

Abstract: The success of Large Language Models (LLMs) has led to a parallel rise in the development of Large Multimodal Models (LMMs), such as Gemini-pro, which have begun to transform a variety of applications. These sophisticated multimodal models are designed to interpret and analyze complex data, integrating both textual and visual information on a scale previously unattainable, opening new avenues for a range of applications. This paper investigates the applicability and effectiveness of prompt-engineered Gemini-pro LMMs versus fine-tuned Vision Transformer (ViT) models in addressing critical security challenges. We focus on two distinct tasks: a visually evident task of detecting simple triggers, such as small squares in images, indicative of potential backdoors, and a non-visually evident task of malware classification through visual representations. Our results highlight a significant divergence in performance, with Gemini-pro falling short in accuracy and reliability when compared to fine-tuned ViT models. The ViT models, on the other hand, demonstrate exceptional accuracy, achieving near-perfect performance on both tasks. This study not only showcases the strengths and limitations of prompt-engineered LMMs in cybersecurity applications but also emphasizes the unmatched efficacy of fine-tuned ViT models for precise and dependable tasks.

cross Annotated Biomedical Video Generation using Denoising Diffusion Probabilistic Models and Flow Fields

Authors: R\"uveyda Yilmaz, Dennis Eschweiler, Johannes Stegmaier

Abstract: The segmentation and tracking of living cells play a vital role within the biomedical domain, particularly in cancer research, drug development, and developmental biology. These are usually tedious and time-consuming tasks that are traditionally done by biomedical experts. Recently, to automatize these processes, deep learning based segmentation and tracking methods have been proposed. These methods require large-scale datasets and their full potential is constrained by the scarcity of annotated data in the biomedical imaging domain. To address this limitation, we propose Biomedical Video Diffusion Model (BVDM), capable of generating realistic-looking synthetic microscopy videos. Trained only on a single real video, BVDM can generate videos of arbitrary length with pixel-level annotations that can be used for training data-hungry models. It is composed of a denoising diffusion probabilistic model (DDPM) generating high-fidelity synthetic cell microscopy images and a flow prediction model (FPM) predicting the non-rigid transformation between consecutive video frames. During inference, initially, the DDPM imposes realistic cell textures on synthetic cell masks which are generated based on real data statistics. The flow prediction model predicts the flow field between consecutive masks and applies that to the DDPM output from the previous time frame to create the next one while keeping temporal consistency. BVDM outperforms state-of-the-art synthetic live cell microscopy video generation models. Furthermore, we demonstrate that a sufficiently large synthetic dataset enhances the performance of cell segmentation and tracking models compared to using a limited amount of available real data.

cross Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

Authors: Abdelrhman Werby, Chenguang Huang, Martin B\"uchner, Abhinav Valada, Wolfram Burgard

Abstract: Recent open-vocabulary robot mapping methods enrich dense geometric maps with pre-trained visual-language features. While these maps allow for the prediction of point-wise saliency maps when queried for a certain language concept, large-scale environments and abstract queries beyond the object level still pose a considerable hurdle, ultimately limiting language-grounded robotic navigation. In this work, we present HOV-SG, a hierarchical open-vocabulary 3D scene graph mapping approach for language-grounded robot navigation. Leveraging open-vocabulary vision foundation models, we first obtain state-of-the-art open-vocabulary segment-level maps in 3D and subsequently construct a 3D scene graph hierarchy consisting of floor, room, and object concepts, each enriched with open-vocabulary features. Our approach is able to represent multi-story buildings and allows robotic traversal of those using a cross-floor Voronoi graph. HOV-SG is evaluated on three distinct datasets and surpasses previous baselines in open-vocabulary semantic accuracy on the object, room, and floor level while producing a 75% reduction in representation size compared to dense open-vocabulary maps. In order to prove the efficacy and generalization capabilities of HOV-SG, we showcase successful long-horizon language-conditioned robot navigation within real-world multi-storage environments. We provide code and trial video data at http://hovsg.github.io/.

URLs: http://hovsg.github.io/.

cross Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models

Authors: Mohammad Shahab Sepehri, Zalan Fabian, Mahdi Soltanolkotabi

Abstract: The landscape of computational building blocks of efficient image restoration architectures is dominated by a combination of convolutional processing and various attention mechanisms. However, convolutional filters are inherently local and therefore struggle at modeling long-range dependencies in images. On the other hand, attention excels at capturing global interactions between arbitrary image regions, however at a quadratic cost in image dimension. In this work, we propose Serpent, an architecture that leverages recent advances in state space models (SSMs) in its core computational block. SSMs, originally introduced for sequence modeling, can maintain a global receptive field with a favorable linear scaling in input size. Our preliminary results demonstrate that Serpent can achieve reconstruction quality on par with state-of-the-art techniques, while requiring orders of magnitude less compute (up to $150$ fold reduction in FLOPS) and a factor of up to $5\times$ less GPU memory while maintaining a compact model size.

cross Scalable Non-Cartesian Magnetic Resonance Imaging with R2D2

Authors: Chen Yiwei, Tang Chao, Aghabiglou Amir, Chu Chung San, Wiaux Yves

Abstract: We propose a new approach for non-Cartesian magnetic resonance image reconstruction. While unrolled architectures provide robustness via data-consistency layers, embedding measurement operators in Deep Neural Network (DNN) can become impractical at large scale. Alternative Plug-and-Play (PnP) approaches, where the denoising DNNs are blind to the measurement setting, are not affected by this limitation and have also proven effective, but their highly iterative nature also affects scalability. To address this scalability challenge, we leverage the "Residual-to-Residual DNN series for high-Dynamic range imaging (R2D2)" approach recently introduced in astronomical imaging. R2D2's reconstruction is formed as a series of residual images, iteratively estimated as outputs of DNNs taking the previous iteration's image estimate and associated data residual as inputs. The method can be interpreted as a learned version of the Matching Pursuit algorithm. We demonstrate R2D2 in simulation, considering radial k-space sampling acquisition sequences. Our preliminary results suggest that R2D2 achieves: (i) suboptimal performance compared to its unrolled incarnation R2D2-Net, which is however non-scalable due to the necessary embedding of NUFFT-based data-consistency layers; (ii) superior reconstruction quality to a scalable version of R2D2-Net embedding an FFT-based approximation for data consistency; (iii) superior reconstruction quality to PnP, while only requiring few iterations.

cross CMP: Cooperative Motion Prediction with Multi-Agent Communication

Authors: Zhuoyuan Wu, Yuping Wang, Hengbo Ma, Zhaowei Li, Hang Qiu, Jiachen Li

Abstract: The confluence of the advancement of Autonomous Vehicles (AVs) and the maturity of Vehicle-to-Everything (V2X) communication has enabled the capability of cooperative connected and automated vehicles (CAVs). Building on top of cooperative perception, this paper explores the feasibility and effectiveness of cooperative motion prediction. Our method, CMP, takes LiDAR signals as input to enhance tracking and prediction capabilities. Unlike previous work that focuses separately on either cooperative perception or motion prediction, our framework, to the best of our knowledge, is the first to address the unified problem where CAVs share information in both perception and prediction modules. Incorporated into our design is the unique capability to tolerate realistic V2X bandwidth limitations and transmission delays, while dealing with bulky perception representations. We also propose a prediction aggregation module, which unifies the predictions obtained by different CAVs and generates the final prediction. Through extensive experiments and ablation studies, we demonstrate the effectiveness of our method in cooperative perception, tracking, and motion prediction tasks. In particular, CMP reduces the average prediction error by 17.2\% with fewer missing detections compared with the no cooperation setting. Our work marks a significant step forward in the cooperative capabilities of CAVs, showcasing enhanced performance in complex scenarios.

cross SLEDGE: Synthesizing Simulation Environments for Driving Agents with Generative Models

Authors: Kashyap Chitta, Daniel Dauner, Andreas Geiger

Abstract: SLEDGE is the first generative simulator for vehicle motion planning trained on real-world driving logs. Its core component is a learned model that is able to generate agent bounding boxes and lane graphs. The model's outputs serve as an initial state for traffic simulation. The unique properties of the entities to be generated for SLEDGE, such as their connectivity and variable count per scene, render the naive application of most modern generative models to this task non-trivial. Therefore, together with a systematic study of existing lane graph representations, we introduce a novel raster-to-vector autoencoder (RVAE). It encodes agents and the lane graph into distinct channels in a rasterized latent map. This facilitates both lane-conditioned agent generation and combined generation of lanes and agents with a Diffusion Transformer. Using generated entities in SLEDGE enables greater control over the simulation, e.g. upsampling turns or increasing traffic density. Further, SLEDGE can support 500m long routes, a capability not found in existing data-driven simulators like nuPlan. It presents new challenges for planning algorithms, evidenced by failure rates of over 40% for PDM, the winner of the 2023 nuPlan challenge, when tested on hard routes and dense traffic generated by our model. Compared to nuPlan, SLEDGE requires 500$\times$ less storage to set up (<4GB), making it a more accessible option and helping with democratizing future research in this field.

replace Confidence-Triggered Detection: Accelerating Real-time Tracking-by-detection Systems

Authors: Zhicheng Ding, Zhixin Lai, Siyang Li, Edward Wong

Abstract: Real-time object tracking necessitates a delicate balance between speed and accuracy, a challenge exacerbated by the computational demands of deep learning methods. In this paper, we propose Confidence-Triggered Detection (CTD), an innovative approach that strategically bypasses object detection for frames closely resembling intermediate states, leveraging tracker confidence scores. CTD not only enhances tracking speed but also preserves accuracy, surpassing existing tracking algorithms. Through extensive evaluation across various tracker confidence thresholds, we identify an optimal trade-off between tracking speed and accuracy, providing crucial insights for parameter fine-tuning and enhancing CTD's practicality in real-world scenarios. Our experiments across diverse detection models underscore the robustness and versatility of the CTD framework, demonstrating its potential to enable real-time tracking in resource-constrained environments.

replace DISN: Deep Implicit Surface Network for High-quality Single-view 3D Reconstruction

Authors: Qiangeng Xu, Weiyue Wang, Duygu Ceylan, Radomir Mech, Ulrich Neumann

Abstract: Reconstructing 3D shapes from single-view images has been a long-standing research problem. In this paper, we present DISN, a Deep Implicit Surface Network which can generate a high-quality detail-rich 3D mesh from an 2D image by predicting the underlying signed distance fields. In addition to utilizing global image features, DISN predicts the projected location for each 3D point on the 2D image, and extracts local features from the image feature maps. Combining global and local features significantly improves the accuracy of the signed distance field prediction, especially for the detail-rich areas. To the best of our knowledge, DISN is the first method that constantly captures details such as holes and thin structures present in 3D shapes from single-view images. DISN achieves the state-of-the-art single-view reconstruction performance on a variety of shape categories reconstructed from both synthetic and real images. Code is available at https://github.com/xharlie/DISN The supplementary can be found at https://xharlie.github.io/images/neurips_2019_supp.pdf

URLs: https://github.com/xharlie/DISN, https://xharlie.github.io/images/neurips_2019_supp.pdf

replace Vision Transformers with Hierarchical Attention

Authors: Yun Liu, Yu-Huan Wu, Guolei Sun, Le Zhang, Ajad Chhatkuli, Luc Van Gool

Abstract: This paper tackles the high computational/space complexity associated with Multi-Head Self-Attention (MHSA) in vanilla vision transformers. To this end, we propose Hierarchical MHSA (H-MHSA), a novel approach that computes self-attention in a hierarchical fashion. Specifically, we first divide the input image into patches as commonly done, and each patch is viewed as a token. Then, the proposed H-MHSA learns token relationships within local patches, serving as local relationship modeling. Then, the small patches are merged into larger ones, and H-MHSA models the global dependencies for the small number of the merged tokens. At last, the local and global attentive features are aggregated to obtain features with powerful representation capacity. Since we only calculate attention for a limited number of tokens at each step, the computational load is reduced dramatically. Hence, H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information. With the H-MHSA module incorporated, we build a family of Hierarchical-Attention-based Transformer Networks, namely HAT-Net. To demonstrate the superiority of HAT-Net in scene understanding, we conduct extensive experiments on fundamental vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation. Therefore, HAT-Net provides a new perspective for vision transformers. Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.

URLs: https://github.com/yun-liu/HAT-Net.

replace Semi-Supervised Crowd Counting from Unlabeled Data

Authors: Haoran Duan, Fan Wan, Rui Sun, Zeyu Wang, Varun Ojha, Yu Guan, Hubert P. H. Shum, Bingzhang Hu, Yang Long

Abstract: Automatic Crowd behavior analysis can be applied to effectively help the daily transportation statistics and planning, which helps the smart city construction. As one of the most important keys, crowd counting has drawn increasing attention. Recent works achieved promising performance but relied on the supervised paradigm with expensive crowd annotations. To alleviate the annotation cost in real-world transportation scenarios, in this work we proposed a semi-supervised learning framework $S^{4}\textit{Crowd}$, which can leverage both unlabeled/labeled data for robust crowd counting. In the unsupervised pathway, two \textit{self-supervised losses} were proposed to simulate the crowd variations such as scale, illumination, based on which supervised information pseudo labels were generated and gradually refined. We also proposed a crowd-driven recurrent unit \textit{Gated-Crowd-Recurrent-Unit (GCRU)}, which can preserve discriminant crowd information by extracting second-order statistics, yielding pseudo labels with improved quality. A joint loss including both unsupervised/supervised information was proposed, and a dynamic weighting strategy was employed to balance the importance of the unsupervised loss and supervised loss at different training stages. We conducted extensive experiments on four popular crowd counting datasets in semi-supervised settings. Experimental results supported the effectiveness of each proposed component in our $S^{4}$Crowd framework. Our method achieved competitive performance in semi-supervised learning approaches on these crowd counting datasets.

replace P2ANet: A Dataset and Benchmark for Dense Action Detection from Table Tennis Match Broadcasting Videos

Authors: Jiang Bian, Xuhong Li, Tao Wang, Qingzhong Wang, Jun Huang, Chen Liu, Jun Zhao, Feixiang Lu, Dejing Dou, Haoyi Xiong

Abstract: While deep learning has been widely used for video analytics, such as video classification and action detection, dense action detection with fast-moving subjects from sports videos is still challenging. In this work, we release yet another sports video benchmark \TheName{} for \emph{\underline{P}}ing \emph{\underline{P}}ong-\emph{\underline{A}}ction detection, which consists of 2,721 video clips collected from the broadcasting videos of professional table tennis matches in World Table Tennis Championships and Olympiads. We work with a crew of table tennis professionals and referees on a specially designed annotation toolbox to obtain fine-grained action labels (in 14 classes) for every ping-pong action that appeared in the dataset, and formulate two sets of action detection problems -- \emph{action localization} and \emph{action recognition}. We evaluate a number of commonly-seen action recognition (e.g., TSM, TSN, Video SwinTransformer, and Slowfast) and action localization models (e.g., BSN, BSN++, BMN, TCANet), using \TheName{} for both problems, under various settings. These models can only achieve 48\% area under the AR-AN curve for localization and 82\% top-one accuracy for recognition since the ping-pong actions are dense with fast-moving subjects but broadcasting videos are with only 25 FPS. The results confirm that \TheName{} is still a challenging task and can be used as a special benchmark for dense action detection from videos.

replace Dual Prototype Attention for Unsupervised Video Object Segmentation

Authors: Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Dogyoon Lee, Heeseung Choi, Ig-Jae Kim, Sangyoun Lee

Abstract: Unsupervised video object segmentation (VOS) aims to detect and segment the most salient object in videos. The primary techniques used in unsupervised VOS are 1) the collaboration of appearance and motion information; and 2) temporal fusion between different frames. This paper proposes two novel prototype-based attention mechanisms, inter-modality attention (IMA) and inter-frame attention (IFA), to incorporate these techniques via dense propagation across different modalities and frames. IMA densely integrates context information from different modalities based on a mutual refinement. IFA injects global context of a video to the query frame, enabling a full utilization of useful properties from multiple frames. Experimental results on public benchmark datasets demonstrate that our proposed approach outperforms all existing methods by a substantial margin. The proposed two components are also thoroughly validated via ablative study.

replace Diffusion Models Generate Images Like Painters: an Analytical Theory of Outline First, Details Later

Authors: Binxu Wang, John J. Vastola

Abstract: How do diffusion generative models convert pure noise into meaningful images? In a variety of pretrained diffusion models (including conditional latent space models like Stable Diffusion), we observe that the reverse diffusion process that underlies image generation has the following properties: (i) individual trajectories tend to be low-dimensional and resemble 2D `rotations'; (ii) high-variance scene features like layout tend to emerge earlier, while low-variance details tend to emerge later; and (iii) early perturbations tend to have a greater impact on image content than later perturbations. To understand these phenomena, we derive and study a closed-form solution to the probability flow ODE for a Gaussian distribution, which shows that the reverse diffusion state rotates towards a gradually-specified target on the image manifold. It also shows that generation involves first committing to an outline, and then to finer and finer details. We find that this solution accurately describes the initial phase of image generation for pretrained models, and can in principle be used to make image generation more efficient by skipping reverse diffusion steps. Finally, we use our solution to characterize the image manifold in Stable Diffusion. Our viewpoint reveals an unexpected similarity between generation by GANs and diffusion and provides a conceptual link between diffusion and image retrieval.

replace Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning

Authors: Siteng Huang, Biao Gong, Yutong Feng, Min Zhang, Yiliang Lv, Donglin Wang

Abstract: Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language models (VLMs) by constructing trainable prompts only for composed state-object pairs. Relying on learning the joint representation of seen compositions, these methods ignore the explicit modeling of the state and object, thus limiting the exploitation of pre-trained knowledge and generalization to unseen compositions. With a particular focus on the universality of the solution, in this work, we propose a novel paradigm for CZSL models that establishes three identification branches (i.e., Multi-Path) to jointly model the state, object, and composition. The presented Troika is our implementation that aligns the branch-specific prompt representations with decomposed visual features. To calibrate the bias between semantically similar multi-modal representations, we further devise a Cross-Modal Traction module into Troika that shifts the prompt representation towards the current visual content. We conduct extensive experiments on three popular benchmarks, where our method significantly outperforms existing methods in both closed-world and open-world settings. The code will be available at https://github.com/bighuang624/Troika.

URLs: https://github.com/bighuang624/Troika.

replace SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation

Authors: Nikos Athanasiou, Mathis Petrovich, Michael J. Black, G\"ul Varol

Abstract: Our goal is to synthesize 3D human motions given textual inputs describing simultaneous actions, for example 'waving hand' while 'walking' at the same time. We refer to generating such simultaneous movements as performing 'spatial compositions'. In contrast to temporal compositions that seek to transition from one action to another, spatial compositing requires understanding which body parts are involved in which action, to be able to move them simultaneously. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as "what are the body parts involved in the action ?", while also providing the parts list and few-shot examples. Given this action-part mapping, we combine body parts from two motions together and establish the first automated method to spatially compose two actions. However, training data with compositional actions is always limited by the combinatorics. Hence, we further create synthetic data with this approach, and use it to train a new state-of-the-art text-to-motion generation model, called SINC ("SImultaneous actioN Compositions for 3D human motions"). In our experiments, that training with such GPT-guided synthetic data improves spatial composition generation over baselines. Our code is publicly available at https://sinc.is.tue.mpg.de/.

URLs: https://sinc.is.tue.mpg.de/.

replace Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

Authors: Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, Yogesh Balaji

Abstract: Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a $10\times$ smaller model using significantly less computation than the prior art.

replace InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360-degree Neural Radiance Fields

Authors: Dongqing Wang, Tong Zhang, Alaa Abboud, Sabine S\"usstrunk

Abstract: We propose InNeRF360, an automatic system that accurately removes text-specified objects from 360-degree Neural Radiance Fields (NeRF). The challenge is to effectively remove objects while inpainting perceptually consistent content for the missing regions, which is particularly demanding for existing NeRF models due to their implicit volumetric representation. Moreover, unbounded scenes are more prone to floater artifacts in the inpainted region than frontal-facing scenes, as the change of object appearance and background across views is more sensitive to inaccurate segmentations and inconsistent inpainting. With a trained NeRF and a text description, our method efficiently removes specified objects and inpaints visually consistent content without artifacts. We apply depth-space warping to enforce consistency across multiview text-encoded segmentations, and then refine the inpainted NeRF model using perceptual priors and 3D diffusion-based geometric priors to ensure visual plausibility. Through extensive experiments in segmentation and inpainting on 360-degree and frontal-facing NeRFs, we show that our approach is effective and enhances NeRF's editability. Project page: https://ivrl.github.io/InNeRF360.

URLs: https://ivrl.github.io/InNeRF360.

replace Towards Source-free Domain Adaptive Semantic Segmentation via Importance-aware and Prototype-contrast Learning

Authors: Yihong Cao, Hui Zhang, Xiao Lu, Zheng Xiao, Kailun Yang, Yaonan Wang

Abstract: Domain adaptive semantic segmentation enables robust pixel-wise understanding in real-world driving scenes. Source-free domain adaptation, as a more practical technique, addresses the concerns of data privacy and storage limitations in typical unsupervised domain adaptation methods, making it especially relevant in the context of intelligent vehicles. It utilizes a well-trained source model and unlabeled target data to achieve adaptation in the target domain. However, in the absence of source data and target labels, current solutions cannot sufficiently reduce the impact of domain shift and fully leverage the information from the target data. In this paper, we propose an end-to-end source-free domain adaptation semantic segmentation method via Importance-Aware and Prototype-Contrast (IAPC) learning. The proposed IAPC framework effectively extracts domain-invariant knowledge from the well-trained source model and learns domain-specific knowledge from the unlabeled target domain. Specifically, considering the problem of domain shift in the prediction of the target domain by the source model, we put forward an importance-aware mechanism for the biased target prediction probability distribution to extract domain-invariant knowledge from the source model. We further introduce a prototype-contrast strategy, which includes a prototype-symmetric cross-entropy loss and a prototype-enhanced cross-entropy loss, to learn target intra-domain knowledge without relying on labels. A comprehensive variety of experiments on two domain adaptive semantic segmentation benchmarks demonstrates that the proposed end-to-end IAPC solution outperforms existing state-of-the-art methods. The source code is publicly available at https://github.com/yihong-97/Source-free-IAPC.

URLs: https://github.com/yihong-97/Source-free-IAPC.

replace Fault Localization for Buggy Deep Learning Framework Conversions in Image Recognition

Authors: Nikolaos Louloudakis, Perry Gibson, Jos\'e Cano, Ajitha Rajan

Abstract: When deploying Deep Neural Networks (DNNs), developers often convert models from one deep learning framework to another (e.g., TensorFlow to PyTorch). However, this process is error-prone and can impact target model accuracy. To identify the extent of such impact, we perform and briefly present a differential analysis against three DNNs widely used for image recognition (MobileNetV2, ResNet101, and InceptionV3) converted across four well-known deep learning frameworks (PyTorch, Keras, TensorFlow (TF), and TFLite), which revealed numerous model crashes and output label discrepancies of up to 100%. To mitigate such errors, we present a novel approach towards fault localization and repair of buggy deep learning framework conversions, focusing on pre-trained image recognition models. Our technique consists of four stages of analysis: 1) conversion tools, 2) model parameters, 3) model hyperparameters, and 4) graph representation. In addition, we propose various strategies towards fault repair of the faults detected. We implement our technique on top of the Apache TVM deep learning compiler, and we test it by conducting a preliminary fault localization analysis for the conversion of InceptionV3 from TF to TFLite. Our approach detected a fault in a common DNN converter tool, which introduced precision errors in weights, reducing model accuracy. After our fault localization, we repaired the issue, reducing our conversion error to zero.

replace DeltaNN: Assessing the Impact of Computational Environment Parameters on the Performance of Image Recognition Models

Authors: Nikolaos Louloudakis, Perry Gibson, Jos\'e Cano, Ajitha Rajan

Abstract: Image recognition tasks typically use deep learning and require enormous processing power, thus relying on hardware accelerators like GPUs and TPUs for fast, timely processing. Failure in real-time image recognition tasks can occur due to sub-optimal mapping on hardware accelerators during model deployment, which may lead to timing uncertainty and erroneous behavior. Mapping on hardware accelerators is done using multiple software components like deep learning frameworks, compilers, and device libraries, that we refer to as the computational environment. Owing to the increased use of image recognition tasks in safety-critical applications like autonomous driving and medical imaging, it is imperative to assess their robustness to changes in the computational environment, as the impact of parameters like deep learning frameworks, compiler optimizations, and hardware devices on model performance and correctness is not yet well understood. In this paper we present a differential testing framework, DeltaNN, that allows us to assess the impact of different computational environment parameters on the performance of image recognition models during deployment, post training. DeltaNN generates different implementations of a given image recognition model for variations in environment parameters, namely, deep learning frameworks, compiler optimizations and hardware devices and analyzes differences in model performance as a result. Using DeltaNN, we conduct an empirical study of robustness analysis of three popular image recognition models using the ImageNet dataset. We report the impact in terms of misclassifications and inference time differences across different settings. In total, we observed up to 100% output label differences across deep learning frameworks, and up to 81% unexpected performance degradation in terms of inference time, when applying compiler optimizations.

replace NeuS-PIR: Learning Relightable Neural Surface using Pre-Integrated Rendering

Authors: Shi Mao, Chenming Wu, Zhelun Shen, Yifan Wang, Dayan Wu, Liangjun Zhang

Abstract: This paper presents a method, namely NeuS-PIR, for recovering relightable neural surfaces using pre-integrated rendering from multi-view images or video. Unlike methods based on NeRF and discrete meshes, our method utilizes implicit neural surface representation to reconstruct high-quality geometry, which facilitates the factorization of the radiance field into two components: a spatially varying material field and an all-frequency lighting representation. This factorization, jointly optimized using an adapted differentiable pre-integrated rendering framework with material encoding regularization, in turn addresses the ambiguity of geometry reconstruction and leads to better disentanglement and refinement of each scene property. Additionally, we introduced a method to distil indirect illumination fields from the learned representations, further recovering the complex illumination effect like inter-reflection. Consequently, our method enables advanced applications such as relighting, which can be seamlessly integrated with modern graphics engines. Qualitative and quantitative experiments have shown that NeuS-PIR outperforms existing methods across various tasks on both synthetic and real datasets. Source code is available at https://github.com/Sheldonmao/NeuSPIR

URLs: https://github.com/Sheldonmao/NeuSPIR

replace DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields

Authors: Cheng-You Lu, Peisen Zhou, Angela Xing, Chandradeep Pokhariya, Arnab Dey, Ishaan Shah, Rugved Mavidipalli, Dylan Hu, Andrew Comport, Kefan Chen, Srinath Sridhar

Abstract: Advances in neural fields are enabling high-fidelity capture of the shape and appearance of dynamic 3D scenes. However, their capabilities lag behind those offered by conventional representations such as 2D videos because of algorithmic challenges and the lack of large-scale multi-view real-world datasets. We address the dataset limitation with DiVa-360, a real-world 360 dynamic visual dataset that contains synchronized high-resolution and long-duration multi-view video sequences of table-scale scenes captured using a customized low-cost system with 53 cameras. It contains 21 object-centric sequences categorized by different motion types, 25 intricate hand-object interaction sequences, and 8 long-duration sequences for a total of 17.4 M image frames. In addition, we provide foreground-background segmentation masks, synchronized audio, and text descriptions. We benchmark the state-of-the-art dynamic neural field methods on DiVa-360 and provide insights about existing methods and future challenges on long-duration neural field capture.

replace ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights

Authors: Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou

Abstract: Though the success of CLIP-based training recipes in vision-language models, their scalability to more modalities (e.g., 3D, audio, etc.) is limited to large-scale data, which is expensive or even inapplicable for rare modalities. In this paper, we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space. Specifically, the modality-specific lens is tuned to project multimodal signals to the shared embedding space, which are then processed by a strong ViT that carries pre-trained image knowledge. The encoded multimodal representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of subsequent modalities. ViT-Lens provides a unified solution for representation learning of increasing modalities with two appealing benefits: (i) Exploiting the pretrained ViT across tasks and domains effectively with efficient data regime; (ii) Emergent downstream capabilities of novel modalities are demonstrated due to the modality alignment space. We evaluate ViT-Lens in the context of 3D as an initial verification. In zero-shot 3D classification, ViT-Lens achieves substantial improvements over previous state-of-the-art, showing 52.0% accuracy on Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore, we enable zero-shot 3D question-answering by simply integrating the trained 3D lens into the InstructBLIP model without any adaptation. We will release the results of ViT-Lens on more modalities in the near future.

replace $\texttt{NePhi}$: Neural Deformation Fields for Approximately Diffeomorphic Medical Image Registration

Authors: Lin Tian, Hastings Greer, Ra\'ul San Jos\'e Est\'epar, Soumyadip Sengupta, Marc Niethammer

Abstract: This work proposes NePhi, a generalizable neural deformation model which results in approximately diffeomorphic transformations. In contrast to the predominant voxel-based transformation fields used in learning-based registration approaches, NePhi represents deformations functionally, leading to great flexibility within the design space of memory consumption during training and inference, inference time, registration accuracy, as well as transformation regularity. Specifically, NePhi 1) requires less memory compared to voxel-based learning approaches, 2) improves inference speed by predicting latent codes, compared to current existing neural deformation based registration approaches that \emph{only} rely on optimization, 3) improves accuracy via instance optimization, and 4) shows excellent deformation regularity which is highly desirable for medical image registration. We demonstrate the performance of NePhi on a 2D synthetic dataset as well as for real 3D lung registration. Our results show that NePhi can match the accuracy of voxel-based representations in a single-resolution registration setting. For multi-resolution registration, our method matches the accuracy of current SOTA learning-based registration approaches with instance optimization while reducing memory requirements by a factor of five.

replace Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling

Authors: Leon Sick, Dominik Engel, Pedro Hermosilla, Timo Ropinski

Abstract: Traditionally, training neural networks to perform semantic segmentation required expensive human-made annotations. But more recently, advances in the field of unsupervised learning have made significant progress on this issue and towards closing the gap to supervised algorithms. To achieve this, semantic knowledge is distilled by learning to correlate randomly sampled features from images across an entire dataset. In this work, we build upon these advances by incorporating information about the structure of the scene into the training process through the use of depth information. We achieve this by (1) learning depth-feature correlation by spatially correlate the feature maps with the depth maps to induce knowledge about the structure of the scene and (2) implementing farthest-point sampling to more effectively select relevant features by utilizing 3D sampling techniques on depth information of the scene. Finally, we demonstrate the effectiveness of our technical contributions through extensive experimentation and present significant improvements in performance across multiple benchmark datasets.

replace SimLVSeg: Simplifying Left Ventricular Segmentation in 2D+Time Echocardiograms with Self- and Weakly-Supervised Learning

Authors: Fadillah Maani, Asim Ukaye, Nada Saadi, Numan Saeed, Mohammad Yaqub

Abstract: Echocardiography has become an indispensable clinical imaging modality for general heart health assessment. From calculating biomarkers such as ejection fraction to the probability of a patient's heart failure, accurate segmentation of the heart structures allows doctors to assess the heart's condition and devise treatments with greater precision and accuracy. However, achieving accurate and reliable left ventricle segmentation is time-consuming and challenging due to different reasons. Hence, clinicians often rely on segmenting the left ventricular (LV) in two specific echocardiogram frames to make a diagnosis. This limited coverage in manual LV segmentation poses a challenge for developing automatic LV segmentation with high temporal consistency, as the resulting dataset is typically annotated sparsely. In response to this challenge, this work introduces SimLVSeg, a novel paradigm that enables video-based networks for consistent LV segmentation from sparsely annotated echocardiogram videos. SimLVSeg consists of self-supervised pre-training with temporal masking, followed by weakly supervised learning tailored for LV segmentation from sparse annotations. We demonstrate how SimLVSeg outperforms the state-of-the-art solutions by achieving a 93.32% (95%CI 93.21-93.43%) dice score on the largest 2D+time echocardiography dataset (EchoNet-Dynamic) while being more efficient. SimLVSeg is compatible with two types of video segmentation networks: 2D super image and 3D segmentation. To show the effectiveness of our approach, we provide extensive ablation studies, including pre-training settings and various deep learning backbones. We further conduct an out-of-distribution test to showcase SimLVSeg's generalizability on unseen distribution (CAMUS dataset). The code is publicly available at https://github.com/fadamsyah/SimLVSeg.

URLs: https://github.com/fadamsyah/SimLVSeg.

replace TP2O: Creative Text Pair-to-Object Generation using Balance Swap-Sampling

Authors: Jun Li, Zedong Zhang, Jian Yang

Abstract: Generating creative combinatorial objects from two seemingly unrelated object texts is a challenging task in text-to-image synthesis, often hindered by a focus on emulating existing data distributions. In this paper, we develop a straightforward yet highly effective method, called \textbf{balance swap-sampling}. First, we propose a swapping mechanism that generates a novel combinatorial object image set by randomly exchanging intrinsic elements of two text embeddings through a cutting-edge diffusion model. Second, we introduce a balance swapping region to efficiently sample a small subset from the newly generated image set by balancing CLIP distances between the new images and their original generations, increasing the likelihood of accepting the high-quality combinations. Last, we employ a segmentation method to compare CLIP distances among the segmented components, ultimately selecting the most promising object from the sampled subset. Extensive experiments demonstrate that our approach outperforms recent SOTA T2I methods. Surprisingly, our results even rival those of human artists, such as frog-broccoli.

replace SocialCircle: Learning the Angle-based Social Interaction Representation for Pedestrian Trajectory Prediction

Authors: Conghao Wong, Beihao Xia, Ziqian Zou, Yulong Wang, Xinge You

Abstract: Analyzing and forecasting trajectories of agents like pedestrians and cars in complex scenes has become more and more significant in many intelligent systems and applications. The diversity and uncertainty in socially interactive behaviors among a rich variety of agents make this task more challenging than other deterministic computer vision tasks. Researchers have made a lot of efforts to quantify the effects of these interactions on future trajectories through different mathematical models and network structures, but this problem has not been well solved. Inspired by marine animals that localize the positions of their companions underwater through echoes, we build a new anglebased trainable social interaction representation, named SocialCircle, for continuously reflecting the context of social interactions at different angular orientations relative to the target agent. We validate the effect of the proposed SocialCircle by training it along with several newly released trajectory prediction models, and experiments show that the SocialCircle not only quantitatively improves the prediction performance, but also qualitatively helps better simulate social interactions when forecasting pedestrian trajectories in a way that is consistent with human intuitions.

replace SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching

Authors: Xinghui Li, Jingyi Lu, Kai Han, Victor Prisacariu

Abstract: In this paper, we address the challenge of matching semantically similar keypoints across image pairs. Existing research indicates that the intermediate output of the UNet within the Stable Diffusion (SD) can serve as robust image feature maps for such a matching task. We demonstrate that by employing a basic prompt tuning technique, the inherent potential of Stable Diffusion can be harnessed, resulting in a significant enhancement in accuracy over previous approaches. We further introduce a novel conditional prompting module that conditions the prompt on the local details of the input image pairs, leading to a further improvement in performance. We designate our approach as SD4Match, short for Stable Diffusion for Semantic Matching. Comprehensive evaluations of SD4Match on the PF-Pascal, PF-Willow, and SPair-71k datasets show that it sets new benchmarks in accuracy across all these datasets. Particularly, SD4Match outperforms the previous state-of-the-art by a margin of 12 percentage points on the challenging SPair-71k dataset.

replace From Pretext to Purpose: Batch-Adaptive Self-Supervised Learning

Authors: Jiansong Zhang, Linlin Shen, Peizhong Liu

Abstract: In recent years, self-supervised contrastive learning has emerged as a distinguished paradigm in the artificial intelligence landscape. It facilitates unsupervised feature learning through contrastive delineations at the instance level. However, crafting an effective self-supervised paradigm remains a pivotal challenge within this field. This paper delves into two crucial factors impacting self-supervised contrastive learning-bach size and pretext tasks, and from a data processing standpoint, proposes an adaptive technique of batch fusion. The proposed method, via dimensionality reduction and reconstruction of batch data, enables formerly isolated individual data to partake in intra-batch communication through the Embedding Layer. Moreover, it adaptively amplifies the self-supervised feature encoding capability as the training progresses. We conducted a linear classification test of this method based on the classic contrastive learning framework on ImageNet-1k. The empirical findings illustrate that our approach achieves state-of-the-art performance under equitable comparisons. Benefiting from its "plug-and-play" characteristics, we further explored other contrastive learning methods. On the ImageNet-100, compared to the original performance, the top1 has seen a maximum increase of 1.25%. We suggest that the proposed method may contribute to the advancement of data-driven self-supervised learning research, bringing a fresh perspective to this community.

replace LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis

Authors: Peiang Zhao, Han Li, Ruiyang Jin, S. Kevin Zhou

Abstract: Recent text-to-image diffusion models have reached an unprecedented level in generating high-quality images. However, their exclusive reliance on textual prompts often falls short in precise control of image compositions. In this paper, we propose LoCo, a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions. Specifically, we introduce a Localized Attention Constraint (LAC), leveraging semantic affinity between pixels in self-attention maps to create precise representations of desired objects and effectively ensure the accurate placement of objects in designated regions. We further propose a Padding Token Constraint (PTC) to leverage the semantic information embedded in previously neglected padding tokens, improving the consistency between object appearance and layout instructions. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods. Extensive experiments showcase the superiority of our approach, surpassing existing state-of-the-art training-free layout-to-image methods both qualitatively and quantitatively across multiple benchmarks.

replace SegVol: Universal and Interactive Volumetric Medical Image Segmentation

Authors: Yuxin Du, Fan Bai, Tiejun Huang, Bo Zhao

Abstract: Precise image segmentation provides clinical study with instructive information. Despite the remarkable progress achieved in medical image segmentation, there is still an absence of 3D foundation segmentation model that can segment a wide range of anatomical categories with easy user interaction. In this paper, we propose a 3D foundation segmentation model, named SegVol, supporting universal and interactive volumetric medical image segmentation. By scaling up training data to 90K unlabeled Computed Tomography (CT) volumes and 6K labeled CT volumes, this foundation model supports the segmentation of over 200 anatomical categories using semantic and spatial prompts. Extensive experiments on 10 internal validation tasks and 18 external validation tasks verify that SegVol outperforms the state of the art by a large margin. Through its capacity to provide precise volumetric segmentation across various anatomical categories, SegVol has the potential to accelerate advancements in medical imaging diagnosis and facilitate treatment optimization. The model and code are publicly available at: https://github.com/BAAI-DCAI/SegVol.

URLs: https://github.com/BAAI-DCAI/SegVol.

replace Efficient Pre-training for Localized Instruction Generation of Videos

Authors: Anil Batra, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller

Abstract: Procedural videos show step-by-step demonstrations of tasks like recipe preparation. Understanding such videos is challenging, involving the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance, but demands significant computational resources. Furthermore, transcripts contain irrelevant content and exhibit style variation compared to instructions written by human annotators. To mitigate both issues, we propose a technique, Sieve-&-Swap, to automatically curate a smaller dataset: (i) Sieve filters irrelevant transcripts and (ii) Swap enhances the quality of the text instruction by automatically replacing the transcripts with human-written instructions from a text-only recipe dataset. The curated dataset, three orders of magnitude smaller than current web-scale datasets, enables efficient training of large-scale models with competitive performance. We complement our Sieve-\&-Swap approach with a Procedure Transformer (ProcX) for end-to-end step localization and instruction generation for procedural videos. When this model is pre-trained on our curated dataset, it achieves state-of-the-art performance in zero-shot and finetuning settings on YouCook2 and Tasty, while using a fraction of the computational resources.

replace ViT-Lens: Towards Omni-modal Representations

Authors: Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou

Abstract: Aiming to advance AI agents, large foundation models significantly improve reasoning and instruction execution, yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However, the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space. Specifically, the modality-specific lens is tuned to project any-modal signals to an intermediate embedding space, which are then processed by a strong ViT with pre-trained visual knowledge. The encoded representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. ViT-Lens-2 provides a unified solution for representation learning of increasing modalities with two appealing advantages: (i) Unlocking the great potential of pretrained ViTs to novel modalities effectively with efficient data regime; (ii) Enabling emergent downstream capabilities through modality alignment and shared ViT parameters. We tailor ViT-Lens-2 to learn representations for 3D point cloud, depth, audio, tactile and EEG, and set new state-of-the-art results across various understanding tasks, such as zero-shot classification. By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation in a zero-shot manner. Code and models are available at https://github.com/TencentARC/ViT-Lens.

URLs: https://github.com/TencentARC/ViT-Lens.

replace ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention

Authors: Jiawei Wang, Changjian Li

Abstract: Sketch semantic segmentation is a well-explored and pivotal problem in computer vision involving the assignment of pre-defined part labels to individual strokes. This paper presents ContextSeg - a simple yet highly effective approach to tackling this problem with two stages. In the first stage, to better encode the shape and positional information of strokes, we propose to predict an extra dense distance field in an autoencoder network to reinforce structural information learning. In the second stage, we treat an entire stroke as a single entity and label a group of strokes within the same semantic part using an auto-regressive Transformer with the default attention mechanism. By group-based labeling, our method can fully leverage the context information when making decisions for the remaining groups of strokes. Our method achieves the best segmentation accuracy compared with state-of-the-art approaches on two representative datasets and has been extensively evaluated demonstrating its superior performance. Additionally, we offer insights into solving part imbalance in training data and the preliminary experiment on cross-category training, which can inspire future research in this field.

replace CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs

Authors: Haocheng Yuan, Jing Xu, Hao Pan, Adrien Bousseau, Niloy J. Mitra, Changjian Li

Abstract: CAD programs are a popular way to compactly encode shapes as a sequence of operations that are easy to parametrically modify. However, without sufficient semantic comments and structure, such programs can be challenging to understand, let alone modify. We introduce the problem of semantic commenting CAD programs, wherein the goal is to segment the input program into code blocks corresponding to semantically meaningful shape parts and assign a semantic label to each block. We solve the problem by combining program parsing with visual-semantic analysis afforded by recent advances in foundational language and vision models. Specifically, by executing the input programs, we create shapes, which we use to generate conditional photorealistic images to make use of semantic annotators for such images. We then distill the information across the images and link back to the original programs to semantically comment on them. Additionally, we collected and annotated a benchmark dataset, CADTalk, consisting of 5,288 machine-made programs and 45 human-made programs with ground truth semantic comments. We extensively evaluated our approach, compared it to a GPT-based baseline, and an open-set shape segmentation baseline, and reported an 83.24% accuracy on the new CADTalk dataset. Code and data: https://enigma-li.github.io/CADTalk/.

URLs: https://enigma-li.github.io/CADTalk/.

replace LLaFS: When Large Language Models Meet Few-Shot Segmentation

Authors: Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, Jun Liu

Abstract: This paper proposes LLaFS, the first attempt to leverage large language models (LLMs) in few-shot segmentation. In contrast to the conventional few-shot segmentation methods that only rely on the limited and biased information from the annotated support images, LLaFS leverages the vast prior knowledge gained by LLM as an effective supplement and directly uses the LLM to segment images in a few-shot manner. To enable the text-based LLM to handle image-related tasks, we carefully design an input instruction that allows the LLM to produce segmentation results represented as polygons, and propose a region-attribute table to simulate the human visual mechanism and provide multi-modal guidance. We also synthesize pseudo samples and use curriculum learning for pretraining to augment data and achieve better optimization. LLaFS achieves state-of-the-art results on multiple datasets, showing the potential of using LLMs for few-shot computer vision tasks.

replace ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions

Authors: Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek

Abstract: Current approaches for 3D human motion synthesis generate high-quality animations of digital humans performing a wide variety of actions and gestures. However, a notable technological gap exists in addressing the complex dynamics of multi-human interactions within this paradigm. In this work, we present ReMoS, a denoising diffusion-based model that synthesizes full-body reactive motion of a person in a two-person interaction scenario. Assuming the motion of one person is given, we employ a combined spatio-temporal cross-attention mechanism to synthesize the reactive body and hand motion of the second person, thereby completing the interactions between the two. We demonstrate ReMoS across challenging two-person scenarios such as pair-dancing, Ninjutsu, kickboxing, and acrobatics, where one person's movements have complex and diverse influences on the other. We also contribute the ReMoCap dataset for two-person interactions containing full-body and finger motions. We evaluate ReMoS through multiple quantitative metrics, qualitative visualizations, and a user study, and also indicate usability in interactive motion editing applications.

replace LEOD: Label-Efficient Object Detection for Event Cameras

Authors: Ziyi Wu, Mathias Gehrig, Qing Lyu, Xudong Liu, Igor Gilitschenski

Abstract: Object detection with event cameras benefits from the sensor's low latency and high dynamic range. However, it is costly to fully label event streams for supervised training due to their high temporal resolution. To reduce this cost, we present LEOD, the first method for label-efficient event-based detection. Our approach unifies weakly- and semi-supervised object detection with a self-training mechanism. We first utilize a detector pre-trained on limited labels to produce pseudo ground truth on unlabeled events. Then, the detector is re-trained with both real and generated labels. Leveraging the temporal consistency of events, we run bi-directional inference and apply tracking-based post-processing to enhance the quality of pseudo labels. To stabilize training against label noise, we further design a soft anchor assignment strategy. We introduce new experimental protocols to evaluate the task of label-efficient event-based detection on Gen1 and 1Mpx datasets. LEOD consistently outperforms supervised baselines across various labeling ratios. For example, on Gen1, it improves mAP by 8.6% and 7.8% for RVT-S trained with 1% and 2% labels. On 1Mpx, RVT-S with 10% labels even surpasses its fully-supervised counterpart using 100% labels. LEOD maintains its effectiveness even when all labeled data are available, reaching new state-of-the-art results. Finally, we show that our method readily scales to improve larger detectors as well. Code is released at https://github.com/Wuziyi616/LEOD

URLs: https://github.com/Wuziyi616/LEOD

replace SCHEME: Scalable Channer Mixer for Vision Transformers

Authors: Deepak Sridhar, Yunsheng Li, Nuno Vasconcelos

Abstract: Vision Transformers have received significant attention due to their impressive performance in many vision tasks. While the token mixer or attention block has been studied in great detail, the channel mixer or feature mixing block (FFN or MLP) has not been explored in depth albeit it accounts for a bulk of the parameters and computation in a model. In this work, we study whether sparse feature mixing can replace the dense connections and confirm this with a block diagonal MLP structure that improves the accuracy by supporting larger expansion ratios. To improve the feature clusters formed by this structure and thereby further improve the accuracy, a lightweight, parameter-free, channel covariance attention (CCA) mechanism is introduced as a parallel branch during training. This design of CCA enables gradual feature mixing across channel groups during training whose contribution decays to zero as the training progresses to convergence. This allows the CCA block to be discarded during inference, thus enabling enhanced performance with no additional computational cost. The resulting $\textit{Scalable CHannEl MixEr}$ (SCHEME) can be plugged into any ViT architecture to obtain a gamut of models with different trade-offs between complexity and performance by controlling the block diagonal structure size in the MLP. This is shown by the introduction of a new family of SCHEMEformer models that is shown to establish new Pareto frontiers for accuracy vs FLOPS, accuracy vs model size, and accuracy vs throughput, especially for fast transformers of small model size. For example, the SCHEMEformer establishes a new SOTA of 79.7% accuracy for ViTs using pure attention mixers on ImageNet-1K at 1.77G FLOPs.

replace Segment and Caption Anything

Authors: Xiaoke Huang, Jianfeng Wang, Yansong Tang, Zheng Zhang, Han Hu, Jiwen Lu, Lijuan Wang, Zicheng Liu

Abstract: We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions. SAM presents strong generalizability to segment anything while is short for semantic understanding. By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation. As the number of trainable parameters is small (typically in the order of tens of millions), it costs less computation, less memory usage, and less communication bandwidth, resulting in both fast and scalable training. To address the scarcity problem of regional caption data, we propose to first pre-train our model on objection detection and segmentation tasks. We call this step weak supervision pretraining since the pre-training data only contains category names instead of full-sentence descriptions. The weak supervision pretraining allows us to leverage many publicly available object detection and segmentation datasets. We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice. This work serves as a stepping stone towards scaling up regional captioning data and sheds light on exploring efficient ways to augment SAM with regional semantics. The project page, along with the associated code, can be accessed via https://xk-huang.github.io/segment-caption-anything/.

URLs: https://xk-huang.github.io/segment-caption-anything/.

replace ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation

Authors: Dar-Yen Chen, Hamish Tennent, Ching-Wen Hsu

Abstract: This work introduces ArtAdapter, a transformative text-to-image (T2I) style transfer framework that transcends traditional limitations of color, brushstrokes, and object shape, capturing high-level style elements such as composition and distinctive artistic expression. The integration of a multi-level style encoder with our proposed explicit adaptation mechanism enables ArtAdapter to achieve unprecedented fidelity in style transfer, ensuring close alignment with textual descriptions. Additionally, the incorporation of an Auxiliary Content Adapter (ACA) effectively separates content from style, alleviating the borrowing of content from style references. Moreover, our novel fast finetuning approach could further enhance zero-shot style representation while mitigating the risk of overfitting. Comprehensive evaluations confirm that ArtAdapter surpasses current state-of-the-art methods.

replace AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

Authors: Jeongsoo Choi, Se Jin Park, Minsu Kim, Yong Man Ro

Abstract: This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence of acoustic noise, showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover, we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling, thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting. Demo page is available on https://choijeongsoo.github.io/av2av.

URLs: https://choijeongsoo.github.io/av2av.

replace DreamComposer: Controllable 3D Object Generation via Multi-View Conditions

Authors: Yunhan Yang, Yukun Huang, Xiaoyang Wu, Yuan-Chen Guo, Song-Hai Zhang, Hengshuang Zhao, Tong He, Xihui Liu

Abstract: Utilizing pre-trained 2D large-scale generative models, recent works are capable of generating high-quality novel views from a single in-the-wild image. However, due to the lack of information from multiple views, these works encounter difficulties in generating controllable novel views. In this paper, we present DreamComposer, a flexible and scalable framework that can enhance existing view-aware diffusion models by injecting multi-view conditions. Specifically, DreamComposer first uses a view-aware 3D lifting module to obtain 3D representations of an object from multiple views. Then, it renders the latent features of the target view from 3D representations with the multi-view feature fusion module. Finally the target view features extracted from multi-view inputs are injected into a pre-trained diffusion model. Experiments show that DreamComposer is compatible with state-of-the-art diffusion models for zero-shot novel view synthesis, further enhancing them to generate high-fidelity novel view images with multi-view conditions, ready for controllable 3D object reconstruction and various other applications.

replace Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance

Authors: Yuto Enyo, Ko Nishino

Abstract: Reflectance bounds the frequency spectrum of illumination in the object appearance. In this paper, we introduce the first stochastic inverse rendering method, which recovers the attenuated frequency spectrum of an illumination jointly with the reflectance of an object of known geometry from a single image. Our key idea is to solve this blind inverse problem in the reflectance map, an appearance representation invariant to the underlying geometry, by learning to reverse the image formation with a novel diffusion model which we refer to as the Diffusion Reflectance Map Network (DRMNet). Given an observed reflectance map converted and completed from the single input image, DRMNet generates a reflectance map corresponding to a perfect mirror sphere while jointly estimating the reflectance. The forward process can be understood as gradually filtering a natural illumination with lower and lower frequency reflectance and additive Gaussian noise. DRMNet learns to invert this process with two subnetworks, IllNet and RefNet, which work in concert towards this joint estimation. The network is trained on an extensive synthetic dataset and is demonstrated to generalize to real images, showing state-of-the-art accuracy on established datasets.

replace DiffCast: A Unified Framework via Residual Diffusion for Precipitation Nowcasting

Authors: Demin Yu, Xutao Li, Yunming Ye, Baoquan Zhang, Chuyao Luo, Kuai Dai, Rui Wang, Xunlai Chen

Abstract: Precipitation nowcasting is an important spatio-temporal prediction task to predict the radar echoes sequences based on current observations, which can serve both meteorological science and smart city applications. Due to the chaotic evolution nature of the precipitation systems, it is a very challenging problem. Previous studies address the problem either from the perspectives of deterministic modeling or probabilistic modeling. However, their predictions suffer from the blurry, high-value echoes fading away and position inaccurate issues. The root reason of these issues is that the chaotic evolutionary precipitation systems are not appropriately modeled. Inspired by the nature of the systems, we propose to decompose and model them from the perspective of global deterministic motion and local stochastic variations with residual mechanism. A unified and flexible framework that can equip any type of spatio-temporal models is proposed based on residual diffusion, which effectively tackles the shortcomings of previous methods. Extensive experimental results on four publicly available radar datasets demonstrate the effectiveness and superiority of the proposed framework, compared to state-of-the-art techniques. Our code is publicly available at https://github.com/DeminYu98/DiffCast.

URLs: https://github.com/DeminYu98/DiffCast.

replace Regularizing Self-supervised 3D Scene Flows with Surface Awareness and Cyclic Consistency

Authors: Patrik Vacek, David Hurych, Karel Zimmermann, Patrick Perez, Tomas Svoboda

Abstract: Learning without supervision how to predict 3D scene flows from point clouds is essential to many perception systems. We propose a novel learning framework for this task which improves the necessary regularization. Relying on the assumption that scene elements are mostly rigid, current smoothness losses are built on the definition of ``rigid clusters" in the input point clouds. The definition of these clusters is challenging and has a significant impact on the quality of predicted flows. We introduce two new consistency losses that enlarge clusters while preventing them from spreading over distinct objects. In particular, we enforce \emph{temporal} consistency with a forward-backward cyclic loss and \emph{spatial} consistency by considering surface orientation similarity in addition to spatial proximity. The proposed losses are model-independent and can thus be used in a plug-and-play fashion to significantly improve the performance of existing models, as demonstrated on two most widely used architectures. We also showcase the effectiveness and generalization capability of our framework on four standard sensor-unique driving datasets, achieving state-of-the-art performance in 3D scene flow estimation. Our codes are available on https://github.com/ctu-vras/sac-flow.

URLs: https://github.com/ctu-vras/sac-flow.

replace A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models

Authors: Julio Silva-Rodr\'iguez, Sina Hajimiri, Ismail Ben Ayed, Jose Dolz

Abstract: Efficient transfer learning (ETL) is receiving increasing attention to adapt large pre-trained language-vision models on downstream tasks with a few labeled samples. While significant progress has been made, we reveal that state-of-the-art ETL approaches exhibit strong performance only in narrowly-defined experimental setups, and with a careful adjustment of hyperparameters based on a large corpus of labeled samples. In particular, we make two interesting, and surprising empirical observations. First, to outperform a simple Linear Probing baseline, these methods require to optimize their hyper-parameters on each target task. And second, they typically underperform -- sometimes dramatically -- standard zero-shot predictions in the presence of distributional drifts. Motivated by the unrealistic assumptions made in the existing literature, i.e., access to a large validation set and case-specific grid-search for optimal hyperparameters, we propose a novel approach that meets the requirements of real-world scenarios. More concretely, we introduce a CLass-Adaptive linear Probe (CLAP) objective, whose balancing term is optimized via an adaptation of the general Augmented Lagrangian method tailored to this context. We comprehensively evaluate CLAP on a broad span of datasets and scenarios, demonstrating that it consistently outperforms SoTA approaches, while yet being a much more efficient alternative.

replace MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images

Authors: Libo Wang, Sijun Dong, Ying Chen, Xiaoliang Meng, Shenghui Fang, Ayman Habib, Songlin Fei

Abstract: Semantic segmentation of remote sensing images plays a vital role in a wide range of Earth Observation (EO) applications, such as land use land cover mapping, environment monitoring, and sustainable development. Driven by rapid developments in Artificial Intelligence (AI), deep learning (DL) has emerged as the mainstream tool for semantic segmentation and has achieved many breakthroughs in the field of remote sensing. However, the existing DL-based methods mainly focus on unimodal visual data while ignoring the rich multimodal information involved in the real world, usually demonstrating weak reliability and generlization. Inspired by the success of Vision Transformers and large language models, we propose a novel metadata-collaborative multimodal segmentation network (MetaSegNet) that applies vision-language representation learning for semantic segmentation of remote sensing images. Unlike the common model structure that only uses unimodal visual data, we extract the key characteristic (e.g. the climate zone) from freely available remote sensing image metadata and transfer it into knowledge-based text prompts via the generic ChatGPT. Then, we construct an image encoder, a text encoder and a cross-modal attention fusion subnetwork to extract the image and text feature and apply image-text interaction. Benefiting from such a design, the proposed MetaSegNet demonstrates superior generalization and achieves competitive accuracy with the state-of-the-art semantic segmentation methods on the large-scale OpenEarthMap dataset (68.6% mIoU) and Potsdam dataset (93.3% mean F1 score) as well as LoveDA dataset (52.2% mIoU).

replace TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Authors: Qinying Liu, Wei Wu, Kecheng Zheng, Zhan Tong, Jiawei Liu, Yu Liu, Wei Chen, Zilei Wang, Yujun Shen

Abstract: The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data. Existing attempts usually face the problem of coarse alignment, e.g., the vision encoder struggles in localizing an attribute-specified object. In this work, we propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. Concretely, given an image and its paired text, we manage to parse objects (e.g., cat) and attributes (e.g., black) from the description, which are highly likely to exist in the image. It is noteworthy that the parsing pipeline is fully automatic and thus enjoys good scalability. With these parsed semantics as supervision signals, we can complement the commonly used image-text contrastive loss with the multi-tag classification loss. Extensive experimental results on a broad suite of semantic segmentation datasets substantiate the average 5.2\% improvement of our framework over existing alternatives. Furthermore, the visualization results indicate that attribute supervision makes vision-language models accurately localize attribute-specified objects. Project page can be found at https://qinying-liu.github.io/Tag-Align.

URLs: https://qinying-liu.github.io/Tag-Align.

replace Passive Non-Line-of-Sight Imaging with Light Transport Modulation

Authors: Jiarui Zhang, Ruixu Geng, Xiaolong Du, Yan Chen, Houqiang Li, Yang Hu

Abstract: Passive non-line-of-sight (NLOS) imaging has witnessed rapid development in recent years, due to its ability to image objects that are out of sight. The light transport condition plays an important role in this task since changing the conditions will lead to different imaging models. Existing learning-based NLOS methods usually train independent models for different light transport conditions, which is computationally inefficient and impairs the practicality of the models. In this work, we propose NLOS-LTM, a novel passive NLOS imaging method that effectively handles multiple light transport conditions with a single network. We achieve this by inferring a latent light transport representation from the projection image and using this representation to modulate the network that reconstructs the hidden image from the projection image. We train a light transport encoder together with a vector quantizer to obtain the light transport representation. To further regulate this representation, we jointly learn both the reconstruction network and the reprojection network during training. A set of light transport modulation blocks is used to modulate the two jointly trained networks in a multi-scale way. Extensive experiments on a large-scale passive NLOS dataset demonstrate the superiority of the proposed method. The code is available at https://github.com/JerryOctopus/NLOS-LTM.

URLs: https://github.com/JerryOctopus/NLOS-LTM.

replace TRIPS: Trilinear Point Splatting for Real-Time Radiance Field Rendering

Authors: Linus Franke, Darius R\"uckert, Laura Fink, Marc Stamminger

Abstract: Point-based radiance field rendering has demonstrated impressive results for novel view synthesis, offering a compelling blend of rendering quality and computational efficiency. However, also latest approaches in this domain are not without their shortcomings. 3D Gaussian Splatting [Kerbl and Kopanas et al. 2023] struggles when tasked with rendering highly detailed scenes, due to blurring and cloudy artifacts. On the other hand, ADOP [R\"uckert et al. 2022] can accommodate crisper images, but the neural reconstruction network decreases performance, it grapples with temporal instability and it is unable to effectively address large gaps in the point cloud. In this paper, we present TRIPS (Trilinear Point Splatting), an approach that combines ideas from both Gaussian Splatting and ADOP. The fundamental concept behind our novel technique involves rasterizing points into a screen-space image pyramid, with the selection of the pyramid layer determined by the projected point size. This approach allows rendering arbitrarily large points using a single trilinear write. A lightweight neural network is then used to reconstruct a hole-free image including detail beyond splat resolution. Importantly, our render pipeline is entirely differentiable, allowing for automatic optimization of both point sizes and positions. Our evaluation demonstrate that TRIPS surpasses existing state-of-the-art methods in terms of rendering quality while maintaining a real-time frame rate of 60 frames per second on readily available hardware. This performance extends to challenging scenarios, such as scenes featuring intricate geometry, expansive landscapes, and auto-exposed footage. The project page is located at: https://lfranke.github.io/trips/

URLs: https://lfranke.github.io/trips/

replace SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM

Authors: Mingrui Li, Shuhong Liu, Heng Zhou, Guohao Zhu, Na Cheng, Tianchen Deng, Hongyu Wang

Abstract: We present SGS-SLAM, the first semantic visual SLAM system based on Gaussian Splatting. It incorporates appearance, geometry, and semantic features through multi-channel optimization, addressing the oversmoothing limitations of neural implicit SLAM systems in high-quality rendering, scene understanding, and object-level geometry. We introduce a unique semantic feature loss that effectively compensates for the shortcomings of traditional depth and color losses in object optimization. Through a semantic-guided keyframe selection strategy, we prevent erroneous reconstructions caused by cumulative errors. Extensive experiments demonstrate that SGS-SLAM delivers state-of-the-art performance in camera pose estimation, map reconstruction, precise semantic segmentation, and object-level geometric accuracy, while ensuring real-time rendering capabilities.

replace Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability

Authors: Xuelin Qian, Yu Wang, Simian Luo, Yinda Zhang, Ying Tai, Zhenyu Zhang, Chengjie Wang, Xiangyang Xue, Bo Zhao, Tiejun Huang, Yunsheng Wu, Yanwei Fu

Abstract: Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space. In this paper, we extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously. Firstly, we leverage an ensemble of publicly available 3D datasets to facilitate the training of large-scale models. It consists of a comprehensive collection of approximately 900,000 objects, with multiple properties of meshes, points, voxels, rendered images, and text captions. This diverse labeled dataset, termed Objaverse-Mix, empowers our model to learn from a wide range of object variations. However, directly applying 3D auto-regression encounters critical challenges of high computational demands on volumetric grids and ambiguous auto-regressive order along grid dimensions, resulting in inferior quality of 3D shapes. To this end, we then present a novel framework Argus3D in terms of capacity. Concretely, our approach introduces discrete representation learning based on a latent vector instead of volumetric grids, which not only reduces computational costs but also preserves essential geometric details by learning the joint distributions in a more tractable order. The capacity of conditional generation can thus be realized by simply concatenating various conditioning inputs to the latent vector, such as point clouds, categories, images, and texts. In addition, thanks to the simplicity of our model architecture, we naturally scale up our approach to a larger model with an impressive 3.6 billion parameters, further enhancing the quality of versatile 3D generation. Extensive experiments on four generation tasks demonstrate that Argus3D can synthesize diverse and faithful shapes across multiple categories, achieving remarkable performance.

replace Generative 3D Part Assembly via Part-Whole-Hierarchy Message Passing

Authors: Bi'an Du, Xiang Gao, Wei Hu, Renjie Liao

Abstract: Generative 3D part assembly involves understanding part relationships and predicting their 6-DoF poses for assembling a realistic 3D shape. Prior work often focus on the geometry of individual parts, neglecting part-whole hierarchies of objects. Leveraging two key observations: 1) super-part poses provide strong hints about part poses, and 2) predicting super-part poses is easier due to fewer superparts, we propose a part-whole-hierarchy message passing network for efficient 3D part assembly. We first introduce super-parts by grouping geometrically similar parts without any semantic labels. Then we employ a part-whole hierarchical encoder, wherein a super-part encoder predicts latent super-part poses based on input parts. Subsequently, we transform the point cloud using the latent poses, feeding it to the part encoder for aggregating super-part information and reasoning about part relationships to predict all part poses. In training, only ground-truth part poses are required. During inference, the predicted latent poses of super-parts enhance interpretability. Experimental results on the PartNet dataset show that our method achieves state-of-the-art performance in part and connectivity accuracy and enables an interpretable hierarchical part assembly.

replace VRP-SAM: SAM with Visual Reference Prompt

Authors: Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Qiang Chen, Gang Zhang, Errui Ding, Jingdong Wang, Zechao Li

Abstract: In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model. In essence, VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image. It is note that the VRP encoder can support a variety of annotation formats for reference images, including \textbf{point}, \textbf{box}, \textbf{scribble}, and \textbf{mask}. VRP-SAM achieves a breakthrough within the SAM framework by extending its versatility and applicability while preserving SAM's inherent strengths, thus enhancing user-friendliness. To enhance the generalization ability of VRP-SAM, the VRP encoder adopts a meta-learning strategy. To validate the effectiveness of VRP-SAM, we conducted extensive empirical studies on the Pascal and COCO datasets. Remarkably, VRP-SAM achieved state-of-the-art performance in visual reference segmentation with minimal learnable parameters. Furthermore, VRP-SAM demonstrates strong generalization capabilities, allowing it to perform segmentation of unseen objects and enabling cross-domain segmentation. The source code and models will be available at \url{https://github.com/syp2ysy/VRP-SAM}

URLs: https://github.com/syp2ysy/VRP-SAM

replace A Novel Approach to Industrial Defect Generation through Blended Latent Diffusion Model with Online Adaptation

Authors: Hanxi Li, Zhengxun Zhang, Hao Chen, Lin Wu, Bo Li, Deyin Liu, Mingwen Wang

Abstract: Effectively addressing the challenge of industrial Anomaly Detection (AD) necessitates an ample supply of defective samples, a constraint often hindered by their scarcity in industrial contexts. This paper introduces a novel algorithm designed to augment defective samples, thereby enhancing AD performance. The proposed method tailors the blended latent diffusion model for defect sample generation, employing a diffusion model to generate defective samples in the latent space. A feature editing process, controlled by a ``trimap" mask and text prompts, refines the generated samples. The image generation inference process is structured into three stages: a free diffusion stage, an editing diffusion stage, and an online decoder adaptation stage. This sophisticated inference strategy yields high-quality synthetic defective samples with diverse pattern variations, leading to significantly improved AD accuracies based on the augmented training set. Specifically, on the widely recognized MVTec AD dataset, the proposed method elevates the state-of-the-art (SOTA) performance of AD with augmented data by 1.5%, 1.9%, and 3.1% for AD metrics AP, IAP, and IAP90, respectively. The implementation code of this work can be found at the GitHub repository https://github.com/GrandpaXun242/AdaBLDM.git

URLs: https://github.com/GrandpaXun242/AdaBLDM.git

replace HoloVIC: Large-scale Dataset and Benchmark for Multi-Sensor Holographic Intersection and Vehicle-Infrastructure Cooperative

Authors: Cong Ma, Lei Qiao, Chengkai Zhu, Kai Liu, Zelong Kong, Qing Li, Xueqi Zhou, Yuheng Kan, Wei Wu

Abstract: Vehicle-to-everything (V2X) is a popular topic in the field of Autonomous Driving in recent years. Vehicle-infrastructure cooperation (VIC) becomes one of the important research area. Due to the complexity of traffic conditions such as blind spots and occlusion, it greatly limits the perception capabilities of single-view roadside sensing systems. To further enhance the accuracy of roadside perception and provide better information to the vehicle side, in this paper, we constructed holographic intersections with various layouts to build a large-scale multi-sensor holographic vehicle-infrastructure cooperation dataset, called HoloVIC. Our dataset includes 3 different types of sensors (Camera, Lidar, Fisheye) and employs 4 sensor-layouts based on the different intersections. Each intersection is equipped with 6-18 sensors to capture synchronous data. While autonomous vehicles pass through these intersections for collecting VIC data. HoloVIC contains in total on 100k+ synchronous frames from different sensors. Additionally, we annotated 3D bounding boxes based on Camera, Fisheye, and Lidar. We also associate the IDs of the same objects across different devices and consecutive frames in sequence. Based on HoloVIC, we formulated four tasks to facilitate the development of related research. We also provide benchmarks for these tasks.

replace Doubly Abductive Counterfactual Inference for Text-based Image Editing

Authors: Xue Song, Jiequan Cui, Hanwang Zhang, Jingjing Chen, Richang Hong, Yu-Gang Jiang

Abstract: We study text-based image editing (TBIE) of a single image by counterfactual inference because it is an elegant formulation to precisely address the requirement: the edited image should retain the fidelity of the original one. Through the lens of the formulation, we find that the crux of TBIE is that existing techniques hardly achieve a good trade-off between editability and fidelity, mainly due to the overfitting of the single-image fine-tuning. To this end, we propose a Doubly Abductive Counterfactual inference framework (DAC). We first parameterize an exogenous variable as a UNet LoRA, whose abduction can encode all the image details. Second, we abduct another exogenous variable parameterized by a text encoder LoRA, which recovers the lost editability caused by the overfitted first abduction. Thanks to the second abduction, which exclusively encodes the visual transition from post-edit to pre-edit, its inversion -- subtracting the LoRA -- effectively reverts pre-edit back to post-edit, thereby accomplishing the edit. Through extensive experiments, our DAC achieves a good trade-off between editability and fidelity. Thus, we can support a wide spectrum of user editing intents, including addition, removal, manipulation, replacement, style transfer, and facial change, which are extensively validated in both qualitative and quantitative evaluations. Codes are in https://github.com/xuesong39/DAC.

URLs: https://github.com/xuesong39/DAC.

replace ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes

Authors: Hashmat Shadab Malik, Muhammad Huzaifa, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan

Abstract: Given the large-scale multi-modal training of recent vision-based models and their generalization capabilities, understanding the extent of their robustness is critical for their real-world deployment. In this work, we evaluate the resilience of current vision-based models against diverse object-to-background context variations. The majority of robustness evaluation methods have introduced synthetic datasets to induce changes to object characteristics (viewpoints, scale, color) or utilized image transformation techniques (adversarial changes, common corruptions) on real images to simulate shifts in distributions. Recent works have explored leveraging large language models and diffusion models to generate changes in the background. However, these methods either lack in offering control over the changes to be made or distort the object semantics, making them unsuitable for the task. Our method, on the other hand, can induce diverse object-to-background changes while preserving the original semantics and appearance of the object. To achieve this goal, we harness the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate a broad spectrum of object-to-background changes. We induce both natural and adversarial background changes by either modifying the textual prompts or optimizing the latents and textual embedding of text-to-image models. We produce various versions of standard vision datasets (ImageNet, COCO), incorporating either diverse and realistic backgrounds into the images or introducing color, texture, and adversarial changes in the background. We conduct extensive experiment to analyze the robustness of vision-based models against object-to-background context variations across diverse tasks. Code https://github.com/Muhammad-Huzaifaa/ObjectCompose.git

URLs: https://github.com/Muhammad-Huzaifaa/ObjectCompose.git

replace Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation

Authors: Mingyu Lee, Jongwon Choi

Abstract: We propose a text-guided variational image generation method to address the challenge of getting clean data for anomaly detection in industrial manufacturing. Our method utilizes text information about the target object, learned from extensive text library documents, to generate non-defective data images resembling the input image. The proposed framework ensures that the generated non-defective images align with anticipated distributions derived from textual and image-based knowledge, ensuring stability and generality. Experimental results demonstrate the effectiveness of our approach, surpassing previous methods even with limited non-defective data. Our approach is validated through generalization tests across four baseline models and three distinct datasets. We present an additional analysis to enhance the effectiveness of anomaly detection models by utilizing the generated images.

replace Transferring Relative Monocular Depth to Surgical Vision with Temporal Consistency

Authors: Charlie Budd, Tom Vercauteren

Abstract: Relative monocular depth, inferring depth up to shift and scale from a single image, is an active research topic. Recent deep learning models, trained on large and varied meta-datasets, now provide excellent performance in the domain of natural images. However, few datasets exist which provide ground truth depth for endoscopic images, making training such models from scratch unfeasible. This work investigates the transfer of these models into the surgical domain, and presents an effective and simple way to improve on standard supervision through the use of temporal consistency self-supervision. We show temporal consistency significantly improves supervised training alone when transferring to the low-data regime of endoscopy, and outperforms the prevalent self-supervision technique for this task. In addition we show our method drastically outperforms the state-of-the-art method from within the domain of endoscopy. We also release our code, model and ensembled meta-dataset, Meta-MED, establishing a strong benchmark for future work.

replace FPT: Fine-grained Prompt Tuning for Parameter and Memory Efficient Fine Tuning in High-resolution Medical Image Classification

Authors: Yijin Huang, Pujin Cheng, Roger Tam, Xiaoying Tang

Abstract: Parameter-efficient fine-tuning (PEFT) is proposed as a cost-effective way to transfer pre-trained models to downstream tasks, avoiding the high cost of updating entire large-scale pre-trained models (LPMs). In this work, we present Fine-grained Prompt Tuning (FPT), a novel PEFT method for medical image classification. FPT significantly reduces memory consumption compared to other PEFT methods, especially in high-resolution contexts. To achieve this, we first freeze the weights of the LPM and construct a learnable lightweight side network. The frozen LPM takes high-resolution images as input to extract fine-grained features, while the side network is fed low-resolution images to reduce memory usage. To allow the side network to access pre-trained knowledge, we introduce fine-grained prompts that summarize information from the LPM through a fusion module. Important tokens selection and preloading techniques are employed to further reduce training cost and memory requirements. We evaluate FPT on four medical datasets with varying sizes, modalities, and complexities. Experimental results demonstrate that FPT achieves comparable performance to fine-tuning the entire LPM while using only 1.8% of the learnable parameters and 13% of the memory costs of an encoder ViT-B model with a 512 x 512 input resolution.

replace Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework

Authors: Vu Minh Hieu Phan, Yutong Xie, Yuankai Qi, Lingqiao Liu, Liyang Liu, Bowen Zhang, Zhibin Liao, Qi Wu, Minh-Son To, Johan W. Verjans

Abstract: Medical vision language pre-training (VLP) has emerged as a frontier of research, enabling zero-shot pathological recognition by comparing the query image with the textual descriptions for each disease. Due to the complex semantics of biomedical texts, current methods struggle to align medical images with key pathological findings in unstructured reports. This leads to the misalignment with the target disease's textual representation. In this paper, we introduce a novel VLP framework designed to dissect disease descriptions into their fundamental aspects, leveraging prior knowledge about the visual manifestations of pathologies. This is achieved by consulting a large language model and medical experts. Integrating a Transformer module, our approach aligns an input image with the diverse elements of a disease, generating aspect-centric image representations. By consolidating the matches from each aspect, we improve the compatibility between an image and its associated disease. Additionally, capitalizing on the aspect-oriented representations, we present a dual-head Transformer tailored to process known and unknown diseases, optimizing the comprehensive detection efficacy. Conducting experiments on seven downstream datasets, ours improves the accuracy of recent methods by up to 8.56% and 17.0% for seen and unseen categories, respectively. Our code is released at https://github.com/HieuPhan33/MAVL.

URLs: https://github.com/HieuPhan33/MAVL.

replace Identity-aware Dual-constraint Network for Cloth-Changing Person Re-identification

Authors: Peini Guo, Mengyuan Liu, Hong Liu, Ruijia Fan, Guoquan Wang, Bin He

Abstract: Cloth-Changing Person Re-Identification (CC-ReID) aims to accurately identify the target person in more realistic surveillance scenarios, where pedestrians usually change their clothing. Despite great progress, limited cloth-changing training samples in existing CC-ReID datasets still prevent the model from adequately learning cloth-irrelevant features. In addition, due to the absence of explicit supervision to keep the model constantly focused on cloth-irrelevant areas, existing methods are still hampered by the disruption of clothing variations. To solve the above issues, we propose an Identity-aware Dual-constraint Network (IDNet) for the CC-ReID task. Specifically, to help the model extract cloth-irrelevant clues, we propose a Clothes Diversity Augmentation (CDA), which generates more realistic cloth-changing samples by enriching the clothing color while preserving the texture. In addition, a Multi-scale Constraint Block (MCB) is designed, which extracts fine-grained identity-related features and effectively transfers cloth-irrelevant knowledge. Moreover, a Counterfactual-guided Attention Module (CAM) is presented, which learns cloth-irrelevant features from channel and space dimensions and utilizes the counterfactual intervention for supervising the attention map to highlight identity-related regions. Finally, a Semantic Alignment Constraint (SAC) is designed to facilitate high-level semantic feature interaction. Comprehensive experiments on four CC-ReID datasets indicate that our method outperforms prior state-of-the-art approaches.

replace HIMap: HybrId Representation Learning for End-to-end Vectorized HD Map Construction

Authors: Yi Zhou, Hui Zhang, Jiaqian Yu, Yifan Yang, Sangil Jung, Seung-In Park, ByungIn Yoo

Abstract: Vectorized High-Definition (HD) map construction requires predictions of the category and point coordinates of map elements (e.g. road boundary, lane divider, pedestrian crossing, etc.). State-of-the-art methods are mainly based on point-level representation learning for regressing accurate point coordinates. However, this pipeline has limitations in obtaining element-level information and handling element-level failures, e.g. erroneous element shape or entanglement between elements. To tackle the above issues, we propose a simple yet effective HybrId framework named HIMap to sufficiently learn and interact both point-level and element-level information. Concretely, we introduce a hybrid representation called HIQuery to represent all map elements, and propose a point-element interactor to interactively extract and encode the hybrid information of elements, e.g. point position and element shape, into the HIQuery. Additionally, we present a point-element consistency constraint to enhance the consistency between the point-level and element-level information. Finally, the output point-element integrated HIQuery can be directly converted into map elements' class, point coordinates, and mask. We conduct extensive experiments and consistently outperform previous methods on both nuScenes and Argoverse2 datasets. Notably, our method achieves $77.8$ mAP on the nuScenes dataset, remarkably superior to previous SOTAs by $8.3$ mAP at least.

replace Testing MediaPipe Holistic for Linguistic Analysis of Nonmanual Markers in Sign Languages

Authors: Anna Kuznetsova, Vadim Kimmelman

Abstract: Advances in Deep Learning have made possible reliable landmark tracking of human bodies and faces that can be used for a variety of tasks. We test a recent Computer Vision solution, MediaPipe Holistic (MPH), to find out if its tracking of the facial features is reliable enough for a linguistic analysis of data from sign languages, and compare it to an older solution (OpenFace, OF). We use an existing data set of sentences in Kazakh-Russian Sign Language and a newly created small data set of videos with head tilts and eyebrow movements. We find that MPH does not perform well enough for linguistic analysis of eyebrow movement - but in a different way from OF, which is also performing poorly without correction. We reiterate a previous proposal to train additional correction models to overcome these limitations.

replace Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives

Authors: Ronghui Li, YuXiang Zhang, Yachao Zhang, Hongwen Zhang, Jie Guo, Yan Zhang, Yebin Liu, Xiu Li

Abstract: We propose Lodge, a network capable of generating extremely long dance sequences conditioned on given music. We design Lodge as a two-stage coarse to fine diffusion architecture, and propose the characteristic dance primitives that possess significant expressiveness as intermediate representations between two diffusion models. The first stage is global diffusion, which focuses on comprehending the coarse-level music-dance correlation and production characteristic dance primitives. In contrast, the second-stage is the local diffusion, which parallelly generates detailed motion sequences under the guidance of the dance primitives and choreographic rules. In addition, we propose a Foot Refine Block to optimize the contact between the feet and the ground, enhancing the physical realism of the motion. Our approach can parallelly generate dance sequences of extremely long length, striking a balance between global choreographic patterns and local motion quality and expressiveness. Extensive experiments validate the efficacy of our method.

replace Implicit Discriminative Knowledge Learning for Visible-Infrared Person Re-Identification

Authors: Kaijie Ren, Lei Zhang

Abstract: Visible-Infrared Person Re-identification (VI-ReID) is a challenging cross-modal pedestrian retrieval task, due to significant intra-class variations and cross-modal discrepancies among different cameras. Existing works mainly focus on embedding images of different modalities into a unified space to mine modality-shared features. They only seek distinctive information within these shared features, while ignoring the identity-aware useful information that is implicit in the modality-specific features. To address this issue, we propose a novel Implicit Discriminative Knowledge Learning (IDKL) network to uncover and leverage the implicit discriminative information contained within the modality-specific. First, we extract modality-specific and modality-shared features using a novel dual-stream network. Then, the modality-specific features undergo purification to reduce their modality style discrepancies while preserving identity-aware discriminative knowledge. Subsequently, this kind of implicit knowledge is distilled into the modality-shared feature to enhance its distinctiveness. Finally, an alignment loss is proposed to minimize modality discrepancy on enhanced modality-shared features. Extensive experiments on multiple public datasets demonstrate the superiority of IDKL network over the state-of-the-art methods. Code is available at https://github.com/1KK077/IDKL.

URLs: https://github.com/1KK077/IDKL.

replace Emotic Masked Autoencoder with Attention Fusion for Facial Expression Recognition

Authors: Bach Nguyen-Xuan, Thien Nguyen-Hoang, Nhu Tai-Do

Abstract: Facial Expression Recognition (FER) is a critical task within computer vision with diverse applications across various domains. Addressing the challenge of limited FER datasets, which hampers the generalization capability of expression recognition models, is imperative for enhancing performance. Our paper presents an innovative approach integrating the MAE-Face self-supervised learning (SSL) method and Fusion Attention mechanism for expression classification, particularly showcased in the 6th Affective Behavior 32 pages harvmac; added references for section 5Analysis in-the-wild (ABAW) competition. Additionally, we propose preprocessing techniques to emphasize essential facial features, thereby enhancing model performance on both training and validation sets, notably demonstrated on the Aff-wild2 dataset.

replace Learning User Embeddings from Human Gaze for Personalised Saliency Prediction

Authors: Florian Strohm, Mihai B\^ace, Andreas Bulling

Abstract: Reusable embeddings of user behaviour have shown significant performance improvements for the personalised saliency prediction task. However, prior works require explicit user characteristics and preferences as input, which are often difficult to obtain. We present a novel method to extract user embeddings from pairs of natural images and corresponding saliency maps generated from a small amount of user-specific eye tracking data. At the core of our method is a Siamese convolutional neural encoder that learns the user embeddings by contrasting the image and personal saliency map pairs of different users. Evaluations on two public saliency datasets show that the generated embeddings have high discriminative power, are effective at refining universal saliency maps to the individual users, and generalise well across users and images. Finally, based on our model's ability to encode individual user characteristics, our work points towards other applications that can benefit from reusable embeddings of gaze behaviour.

replace ProMamba: Prompt-Mamba for polyp segmentation

Authors: Jianhao Xie, Ruofan Liao, Ziang Zhang, Sida Yi, Yuesheng Zhu, Guibo Luo

Abstract: Detecting polyps through colonoscopy is an important task in medical image segmentation, which provides significant assistance and reference value for clinical surgery. However, accurate segmentation of polyps is a challenging task due to two main reasons. Firstly, polyps exhibit various shapes and colors. Secondly, the boundaries between polyps and their normal surroundings are often unclear. Additionally, significant differences between different datasets lead to limited generalization capabilities of existing methods. To address these issues, we propose a segmentation model based on Prompt-Mamba, which incorporates the latest Vision-Mamba and prompt technologies. Compared to previous models trained on the same dataset, our model not only maintains high segmentation accuracy on the validation part of the same dataset but also demonstrates superior accuracy on unseen datasets, exhibiting excellent generalization capabilities. Notably, we are the first to apply the Vision-Mamba architecture to polyp segmentation and the first to utilize prompt technology in a polyp segmentation model. Our model efficiently accomplishes segmentation tasks, surpassing previous state-of-the-art methods by an average of 5% across six datasets. Furthermore, we have developed multiple versions of our model with scaled parameter counts, achieving better performance than previous models even with fewer parameters. Our code and trained weights will be released soon.

replace SeFFeC: Semantic Facial Feature Control for Fine-grained Face Editing

Authors: Florian Strohm, Mihai B\^ace, Markus Kaltenecker, Andreas Bulling

Abstract: We propose Semantic Facial Feature Control (SeFFeC) - a novel method for fine-grained face shape editing. Our method enables the manipulation of human-understandable, semantic face features, such as nose length or mouth width, which are defined by different groups of facial landmarks. In contrast to existing methods, the use of facial landmarks enables precise measurement of the facial features, which then enables training SeFFeC without any manually annotated labels. SeFFeC consists of a transformer-based encoder network that takes a latent vector of a pre-trained generative model and a facial feature embedding as input, and learns to modify the latent vector to perform the desired face edit operation. To ensure that the desired feature measurement is changed towards the target value without altering uncorrelated features, we introduced a novel semantic face feature loss. Qualitative and quantitative results show that SeFFeC enables precise and fine-grained control of 23 facial features, some of which could not previously be controlled by other methods, without requiring manual annotations. Unlike existing methods, SeFFeC also provides deterministic control over the exact values of the facial features and more localised and disentangled face edits.

replace EcoSense: Energy-Efficient Intelligent Sensing for In-Shore Ship Detection through Edge-Cloud Collaboration

Authors: Wenjun Huang, Hanning Chen, Yang Ni, Arghavan Rezvani, Sanggeon Yun, Sungheon Jeon, Eric Pedley, Mohsen Imani

Abstract: Detecting marine objects inshore presents challenges owing to algorithmic intricacies and complexities in system deployment. We propose a difficulty-aware edge-cloud collaborative sensing system that splits the task into object localization and fine-grained classification. Objects are classified either at the edge or within the cloud, based on their estimated difficulty. The framework comprises a low-power device-tailored front-end model for object localization, classification, and difficulty estimation, along with a transformer-graph convolutional network-based back-end model for fine-grained classification. Our system demonstrates superior performance (mAP@0.5 +4.3%}) on widely used marine object detection datasets, significantly reducing both data transmission volume (by 95.43%) and energy consumption (by 72.7%}) at the system level. We validate the proposed system across various embedded system platforms and in real-world scenarios involving drone deployment.

replace Clean-image Backdoor Attacks

Authors: Dazhong Rong, Guoyao Yu, Shuheng Shen, Xinyi Fu, Peng Qian, Jianhai Chen, Qinming He, Xing Fu, Weiqiang Wang

Abstract: To gather a significant quantity of annotated training data for high-performance image classification models, numerous companies opt to enlist third-party providers to label their unlabeled data. This practice is widely regarded as secure, even in cases where some annotated errors occur, as the impact of these minor inaccuracies on the final performance of the models is negligible and existing backdoor attacks require attacker's ability to poison the training images. Nevertheless, in this paper, we propose clean-image backdoor attacks which uncover that backdoors can still be injected via a fraction of incorrect labels without modifying the training images. Specifically, in our attacks, the attacker first seeks a trigger feature to divide the training images into two parts: those with the feature and those without it. Subsequently, the attacker falsifies the labels of the former part to a backdoor class. The backdoor will be finally implanted into the target model after it is trained on the poisoned data. During the inference phase, the attacker can activate the backdoor in two ways: slightly modifying the input image to obtain the trigger feature, or taking an image that naturally has the trigger feature as input. We conduct extensive experiments to demonstrate the effectiveness and practicality of our attacks. According to the experimental results, we conclude that our attacks seriously jeopardize the fairness and robustness of image classification models, and it is necessary to be vigilant about the incorrect labels in outsourced labeling.

replace MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis

Authors: Mai A. Shaaban, Adnan Khan, Mohammad Yaqub

Abstract: Chest X-ray images are commonly used for predicting acute and chronic cardiopulmonary conditions, but efforts to integrate them with structured clinical data face challenges due to incomplete electronic health records (EHR). This paper introduces \textbf{MedPromptX}, the first model to integrate multimodal large language models (MLLMs), few-shot prompting (FP) and visual grounding (VG) to combine imagery with EHR data for chest X-ray diagnosis. A pre-trained MLLM is utilized to complement the missing EHR information, providing a comprehensive understanding of patients' medical history. Additionally, FP reduces the necessity for extensive training of MLLMs while effectively tackling the issue of hallucination. Nevertheless, the process of determining the optimal number of few-shot examples and selecting high-quality candidates can be burdensome, yet it profoundly influences model performance. Hence, we propose a new technique that dynamically refines few-shot data for real-time adjustment to new patient scenarios. Moreover, VG aids in focusing the model's attention on relevant regions of interest in X-ray images, enhancing the identification of abnormalities. We release MedPromptX-VQA, a new in-context visual question answering dataset encompassing interleaved image and EHR data derived from MIMIC-IV and MIMIC-CXR databases. Results demonstrate the SOTA performance of MedPromptX, achieving an 11% improvement in F1-score compared to the baselines. Code and data are available at https://github.com/BioMedIA-MBZUAI/MedPromptX

URLs: https://github.com/BioMedIA-MBZUAI/MedPromptX

replace X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention

Authors: You Xie, Hongyi Xu, Guoxian Song, Chao Wang, Yichun Shi, Linjie Luo

Abstract: We propose X-Portrait, an innovative conditional diffusion model tailored for generating expressive and temporally coherent portrait animation. Specifically, given a single portrait as appearance reference, we aim to animate it with motion derived from a driving video, capturing both highly dynamic and subtle facial expressions along with wide-range head movements. As its core, we leverage the generative prior of a pre-trained diffusion model as the rendering backbone, while achieve fine-grained head pose and expression control with novel controlling signals within the framework of ControlNet. In contrast to conventional coarse explicit controls such as facial landmarks, our motion control module is learned to interpret the dynamics directly from the original driving RGB inputs. The motion accuracy is further enhanced with a patch-based local control module that effectively enhance the motion attention to small-scale nuances like eyeball positions. Notably, to mitigate the identity leakage from the driving signals, we train our motion control modules with scaling-augmented cross-identity images, ensuring maximized disentanglement from the appearance reference modules. Experimental results demonstrate the universal effectiveness of X-Portrait across a diverse range of facial portraits and expressive driving sequences, and showcase its proficiency in generating captivating portrait animations with consistently maintained identity characteristics.

replace PKU-DyMVHumans: A Multi-View Video Benchmark for High-Fidelity Dynamic Human Modeling

Authors: Xiaoyun Zheng, Liwei Liao, Xufeng Li, Jianbo Jiao, Rongjie Wang, Feng Gao, Shiqi Wang, Ronggang Wang

Abstract: High-quality human reconstruction and photo-realistic rendering of a dynamic scene is a long-standing problem in computer vision and graphics. Despite considerable efforts invested in developing various capture systems and reconstruction algorithms, recent advancements still struggle with loose or oversized clothing and overly complex poses. In part, this is due to the challenges of acquiring high-quality human datasets. To facilitate the development of these fields, in this paper, we present PKU-DyMVHumans, a versatile human-centric dataset for high-fidelity reconstruction and rendering of dynamic human scenarios from dense multi-view videos. It comprises 8.2 million frames captured by more than 56 synchronized cameras across diverse scenarios. These sequences comprise 32 human subjects across 45 different scenarios, each with a high-detailed appearance and realistic human motion. Inspired by recent advancements in neural radiance field (NeRF)-based scene representations, we carefully set up an off-the-shelf framework that is easy to provide those state-of-the-art NeRF-based implementations and benchmark on PKU-DyMVHumans dataset. It is paving the way for various applications like fine-grained foreground/background decomposition, high-quality human reconstruction and photo-realistic novel view synthesis of a dynamic scene. Extensive studies are performed on the benchmark, demonstrating new observations and challenges that emerge from using such high-fidelity dynamic data. The dataset is available at: https://pku-dymvhumans.github.io.

URLs: https://pku-dymvhumans.github.io.

replace Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models

Authors: Minchan Kim, Minyeong Kim, Junik Bae, Suhwan Choi, Sungkyung Kim, Buru Chang

Abstract: Hallucinations in vision-language models pose a significant challenge to their reliability, particularly in the generation of long captions. Current methods fall short of accurately identifying and mitigating these hallucinations. To address this issue, we introduce ESREAL, a novel unsupervised learning framework designed to suppress the generation of hallucinations through accurate localization and penalization of hallucinated tokens. Initially, ESREAL creates a reconstructed image based on the generated caption and aligns its corresponding regions with those of the original image. This semantic reconstruction aids in identifying both the presence and type of token-level hallucinations within the generated caption. Subsequently, ESREAL computes token-level hallucination scores by assessing the semantic similarity of aligned regions based on the type of hallucination. Finally, ESREAL employs a proximal policy optimization algorithm, where it selectively penalizes hallucinated tokens according to their token-level hallucination scores. Our framework notably reduces hallucinations in LLaVA, InstructBLIP, and mPLUG-Owl2 by 32.81%, 27.08%, and 7.46% on the CHAIR metric. This improvement is achieved solely through signals derived from the image itself, without the need for any image-text pairs.

replace Gaze-guided Hand-Object Interaction Synthesis: Benchmark and Method

Authors: Jie Tian, Lingxiao Yang, Ran Ji, Yuexin Ma, Lan Xu, Jingyi Yu, Ye Shi, Jingya Wang

Abstract: Gaze plays a crucial role in revealing human attention and intention, shedding light on the cognitive processes behind human actions. The integration of gaze guidance with the dynamics of hand-object interactions boosts the accuracy of human motion prediction. However, the lack of datasets that capture the intricate relationship and consistency among gaze, hand, and object movements remains a substantial hurdle. In this paper, we introduce the first Gaze-guided Hand-Object Interaction dataset, GazeHOI, and present a novel task for synthesizing gaze-guided hand-object interactions. Our dataset, GazeHOI, features simultaneous 3D modeling of gaze, hand, and object interactions, comprising 479 sequences with an average duration of 19.1 seconds, 812 sub-sequences, and 33 objects of various sizes. We propose a hierarchical framework centered on a gaze-guided hand-object interaction diffusion model, named GHO-Diffusion. In the pre-diffusion phase, we separate gaze conditions into spatial-temporal features and goal pose conditions at different levels of information granularity. During the diffusion phase, two gaze-conditioned diffusion models are stacked to simplify the complex synthesis of hand-object motions. Here, the object motion diffusion model generates sequences of object motions based on gaze conditions, while the hand motion diffusion model produces hand motions based on the generated object motion. To improve fine-grained goal pose alignment, we introduce a Spherical Gaussian constraint to guide the denoising step. In the subsequent post-diffusion phase, we optimize the generated hand motions using contact consistency. Our extensive experiments highlight the uniqueness of our dataset and the effectiveness of our approach.

replace Image Captioning in news report scenario

Authors: Tianrui Liu, Qi Cai, Changxin Xu, Bo Hong, Jize Xiong, Yuxin Qiao, Tsungwei Yang

Abstract: Image captioning strives to generate pertinent captions for specified images, situating itself at the crossroads of Computer Vision (CV) and Natural Language Processing (NLP). This endeavor is of paramount importance with far-reaching applications in recommendation systems, news outlets, social media, and beyond. Particularly within the realm of news reporting, captions are expected to encompass detailed information, such as the identities of celebrities captured in the images. However, much of the existing body of work primarily centers around understanding scenes and actions. In this paper, we explore the realm of image captioning specifically tailored for celebrity photographs, illustrating its broad potential for enhancing news industry practices. This exploration aims to augment automated news content generation, thereby facilitating a more nuanced dissemination of information. Our endeavor shows a broader horizon, enriching the narrative in news reporting through a more intuitive image captioning framework.

replace VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting

Authors: Yujin Tang, Peijie Dong, Zhenheng Tang, Xiaowen Chu, Junwei Liang

Abstract: Combining CNNs or ViTs, with RNNs for spatiotemporal forecasting, has yielded unparalleled results in predicting temporal and spatial dynamics. However, modeling extensive global information remains a formidable challenge; CNNs are limited by their narrow receptive fields, and ViTs struggle with the intensive computational demands of their attention mechanisms. The emergence of recent Mamba-based architectures has been met with enthusiasm for their exceptional long-sequence modeling capabilities, surpassing established vision models in efficiency and accuracy, which motivates us to develop an innovative architecture tailored for spatiotemporal forecasting. In this paper, we propose the VMRNN cell, a new recurrent unit that integrates the strengths of Vision Mamba blocks with LSTM. We construct a network centered on VMRNN cells to tackle spatiotemporal prediction tasks effectively. Our extensive evaluations show that our proposed approach secures competitive results on a variety of tasks while maintaining a smaller model size. Our code is available at https://github.com/yyyujintang/VMRNN-PyTorch.

URLs: https://github.com/yyyujintang/VMRNN-PyTorch.

replace-cross Domain-Aware Fine-Tuning: Enhancing Neural Network Adaptability

Authors: Seokhyeon Ha, Sunbeom Jung, Jungwoo Lee

Abstract: Fine-tuning pre-trained neural network models has become a widely adopted approach across various domains. However, it can lead to the distortion of pre-trained feature extractors that already possess strong generalization capabilities. Mitigating feature distortion during adaptation to new target domains is crucial. Recent studies have shown promising results in handling feature distortion by aligning the head layer on in-distribution datasets before performing fine-tuning. Nonetheless, a significant limitation arises from the treatment of batch normalization layers during fine-tuning, leading to suboptimal performance. In this paper, we propose Domain-Aware Fine-Tuning (DAFT), a novel approach that incorporates batch normalization conversion and the integration of linear probing and fine-tuning. Our batch normalization conversion method effectively mitigates feature distortion by reducing modifications to the neural network during fine-tuning. Additionally, we introduce the integration of linear probing and fine-tuning to optimize the head layer with gradual adaptation of the feature extractor. By leveraging batch normalization layers and integrating linear probing and fine-tuning, our DAFT significantly mitigates feature distortion and achieves improved model performance on both in-distribution and out-of-distribution datasets. Extensive experiments demonstrate that our method outperforms other baseline methods, demonstrating its effectiveness in not only improving performance but also mitigating feature distortion.

replace-cross Unveiling the Pitfalls of Knowledge Editing for Large Language Models

Authors: Zhoubo Li, Ningyu Zhang, Yunzhi Yao, Mengru Wang, Xi Chen, Huajun Chen

Abstract: As the cost associated with fine-tuning Large Language Models (LLMs) continues to rise, recent research efforts have pivoted towards developing methodologies to edit implicit knowledge embedded within LLMs. Yet, there's still a dark cloud lingering overhead -- will knowledge editing trigger butterfly effect? since it is still unclear whether knowledge editing might introduce side effects that pose potential risks or not. This paper pioneers the investigation into the potential pitfalls associated with knowledge editing for LLMs. To achieve this, we introduce new benchmark datasets and propose innovative evaluation metrics. Our results underline two pivotal concerns: (1) Knowledge Conflict: Editing groups of facts that logically clash can magnify the inherent inconsistencies in LLMs-a facet neglected by previous methods. (2) Knowledge Distortion: Altering parameters with the aim of editing factual knowledge can irrevocably warp the innate knowledge structure of LLMs. Experimental results vividly demonstrate that knowledge editing might inadvertently cast a shadow of unintended consequences on LLMs, which warrant attention and efforts for future works. Code and data are available at https://github.com/zjunlp/PitfallsKnowledgeEditing.

URLs: https://github.com/zjunlp/PitfallsKnowledgeEditing.

replace-cross Pre-Trained Masked Image Model for Mobile Robot Navigation

Authors: Vishnu Dutt Sharma, Anukriti Singh, Pratap Tokekar

Abstract: 2D top-down maps are commonly used for the navigation and exploration of mobile robots through unknown areas. Typically, the robot builds the navigation maps incrementally from local observations using onboard sensors. Recent works have shown that predicting the structural patterns in the environment through learning-based approaches can greatly enhance task efficiency. While many such works build task-specific networks using limited datasets, we show that the existing foundational vision networks can accomplish the same without any fine-tuning. Specifically, we use Masked Autoencoders, pre-trained on street images, to present novel applications for field-of-view expansion, single-agent topological exploration, and multi-agent exploration for indoor mapping, across different input modalities. Our work motivates the use of foundational vision models for generalized structure prediction-driven applications, especially in the dearth of training data. For more qualitative results see https://raaslab.org/projects/MIM4Robots.

URLs: https://raaslab.org/projects/MIM4Robots.

replace-cross Context-Aware Meta-Learning

Authors: Christopher Fifty, Dennis Duan, Ronald G. Junkins, Ehsan Amid, Jure Leskovec, Christopher Re, Sebastian Thrun

Abstract: Large Language Models like ChatGPT demonstrate a remarkable capacity to learn new concepts during inference without any fine-tuning. However, visual models trained to detect new objects during inference have been unable to replicate this ability, and instead either perform poorly or require meta-training and/or fine-tuning on similar objects. In this work, we propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning. Our approach leverages a frozen pre-trained feature extractor, and analogous to in-context learning, recasts visual meta-learning as sequence modeling over datapoints with known labels and a test datapoint with an unknown label. On 8 out of 11 meta-learning benchmarks, our approach -- without meta-training or fine-tuning -- exceeds or matches the state-of-the-art algorithm, P>M>F, which is meta-trained on these benchmarks. Our code is available at https://github.com/cfifty/CAML.

URLs: https://github.com/cfifty/CAML.

replace-cross In Search of a Data Transformation That Accelerates Neural Field Training

Authors: Junwon Seo, Sangyoon Lee, Kwang In Kim, Jaeho Lee

Abstract: Neural field is an emerging paradigm in data representation that trains a neural network to approximate the given signal. A key obstacle that prevents its widespread adoption is the encoding speed-generating neural fields requires an overfitting of a neural network, which can take a significant number of SGD steps to reach the desired fidelity level. In this paper, we delve into the impacts of data transformations on the speed of neural field training, specifically focusing on how permuting pixel locations affect the convergence speed of SGD. Counterintuitively, we find that randomly permuting the pixel locations can considerably accelerate the training. To explain this phenomenon, we examine the neural field training through the lens of PSNR curves, loss landscapes, and error patterns. Our analyses suggest that the random pixel permutations remove the easy-to-fit patterns, which facilitate easy optimization in the early stage but hinder capturing fine details of the signal.

replace-cross MEDPSeg: Hierarchical polymorphic multitask learning for the segmentation of ground-glass opacities, consolidation, and pulmonary structures on computed tomography

Authors: Diedre S. Carmo, Jean A. Ribeiro, Alejandro P. Comellas, Joseph M. Reinhardt, Sarah E. Gerard, Let\'icia Rittner, Roberto A. Lotufo

Abstract: The COVID-19 pandemic response highlighted the potential of deep learning methods in facilitating the diagnosis, prognosis and understanding of lung diseases through automated segmentation of pulmonary structures and lesions in chest computed tomography (CT). Automated separation of lung lesion into ground-glass opacity (GGO) and consolidation is hindered due to the labor-intensive and subjective nature of this task, resulting in scarce availability of ground truth for supervised learning. To tackle this problem, we propose MEDPSeg. MEDPSeg learns from heterogeneous chest CT targets through hierarchical polymorphic multitask learning (HPML). HPML explores the hierarchical nature of GGO and consolidation, lung lesions, and the lungs, with further benefits achieved through multitasking airway and pulmonary artery segmentation. Over 6000 volumetric CT scans from different partially labeled sources were used for training and testing. Experiments show PML enabling new state-of-the-art performance for GGO and consolidation segmentation tasks. In addition, MEDPSeg simultaneously performs segmentation of the lung parenchyma, airways, pulmonary artery, and lung lesions, all in a single forward prediction, with performance comparable to state-of-the-art methods specialized in each of those targets. Finally, we provide an open-source implementation with a graphical user interface at https://github.com/MICLab-Unicamp/medpseg.

URLs: https://github.com/MICLab-Unicamp/medpseg.

replace-cross Learning-based Axial Video Motion Magnification

Authors: Kwon Byung-Ki, Oh Hyun-Bin, Kim Jun-Seong, Hyunwoo Ha, Tae-Hyun Oh

Abstract: Video motion magnification amplifies invisible small motions to be perceptible, which provides humans with a spatially dense and holistic understanding of small motions in the scene of interest. This is based on the premise that magnifying small motions enhances the legibility of motions. In the real world, however, vibrating objects often possess convoluted systems that have complex natural frequencies, modes, and directions. Existing motion magnification often fails to improve legibility since the intricate motions still retain complex characteristics even after being magnified, which may distract us from analyzing them. In this work, we focus on improving legibility by proposing a new concept, axial motion magnification, which magnifies decomposed motions along the user-specified direction. Axial motion magnification can be applied to various applications where motions of specific axes are critical, by providing simplified and easily readable motion information. To achieve this, we propose a novel Motion Separation Module that enables to disentangle and magnify the motion representation along axes of interest. Furthermore, we build a new synthetic training dataset for the axial motion magnification task. Our proposed method improves the legibility of resulting motions along certain axes by adding a new feature: user controllability. Axial motion magnification is a more generalized concept; thus, our method can be directly adapted to the generic motion magnification and achieves favorable performance against competing methods.

replace-cross Joint Learning Neuronal Skeleton and Brain Circuit Topology with Permutation Invariant Encoders for Neuron Classification

Authors: Minghui Liao, Guojia Wan, Bo Du

Abstract: Determining the types of neurons within a nervous system plays a significant role in the analysis of brain connectomics and the investigation of neurological diseases. However, the efficiency of utilizing anatomical, physiological, or molecular characteristics of neurons is relatively low and costly. With the advancements in electron microscopy imaging and analysis techniques for brain tissue, we are able to obtain whole-brain connectome consisting neuronal high-resolution morphology and connectivity information. However, few models are built based on such data for automated neuron classification. In this paper, we propose NeuNet, a framework that combines morphological information of neurons obtained from skeleton and topological information between neurons obtained from neural circuit. Specifically, NeuNet consists of three components, namely Skeleton Encoder, Connectome Encoder, and Readout Layer. Skeleton Encoder integrates the local information of neurons in a bottom-up manner, with a one-dimensional convolution in neural skeleton's point data; Connectome Encoder uses a graph neural network to capture the topological information of neural circuit; finally, Readout Layer fuses the above two information and outputs classification results. We reprocess and release two new datasets for neuron classification task from volume electron microscopy(VEM) images of human brain cortex and Drosophila brain. Experiments on these two datasets demonstrated the effectiveness of our model with accuracy of 0.9169 and 0.9363, respectively. Code and data are available at: https://github.com/WHUminghui/NeuNet.

URLs: https://github.com/WHUminghui/NeuNet.

replace-cross Fix-Con: Automatic Fault Localization and Repair of Deep Learning Model Conversions between Frameworks

Authors: Nikolaos Louloudakis, Perry Gibson, Jos\'e Cano, Ajitha Rajan

Abstract: Converting deep learning models between frameworks is a common step to maximize model compatibility across devices and leverage optimization features that may be exclusively provided in one deep learning framework. However, this conversion process may be riddled with bugs, making the converted models either undeployable or problematic, considerably degrading their prediction correctness. In this paper we propose an automated approach for fault localization and repair, Fix-Con, during model conversion between deep learning frameworks. Fix-Con is capable of detecting and fixing faults introduced in model input, parameters, hyperparameters, and the model graph during conversion. Fix-Con uses a set of fault types (mined from surveying conversion issues reported \nick{in code repositories and forums}) to localize potential conversion faults in the converted target model and then repair them appropriately, e.g., replacing the parameters of the target model with those from the source model. This is done iteratively for every image in the dataset, comparing output label differences between the source model and the converted target model until all differences are resolved. We evaluate the effectiveness of Fix-Con in fixing model conversion bugs of three widely used image recognition models converted across four different deep learning frameworks. Overall, Fix-Con was able to fix $462$ out of $755$ detected conversion faults, either completely repairing or significantly improving the performance of $14$ out of the $15$ erroneous conversion cases.

replace-cross Few-Shot Class Incremental Learning with Attention-Aware Self-Adaptive Prompt

Authors: Chenxi Liu, Zhenyi Wang, Tianyi Xiong, Ruibo Chen, Yihan Wu, Junfeng Guo, Heng Huang

Abstract: Few-Shot Class-Incremental Learning (FSCIL) models aim to incrementally learn new classes with scarce samples while preserving knowledge of old ones. Existing FSCIL methods usually fine-tune the entire backbone, leading to overfitting and hindering the potential to learn new classes. On the other hand, recent prompt-based CIL approaches alleviate forgetting by training prompts with sufficient data in each task. In this work, we propose a novel framework named Attention-aware Self-adaptive Prompt (ASP). ASP encourages task-invariant prompts to capture shared knowledge by reducing specific information from the attention aspect. Additionally, self-adaptive task-specific prompts in ASP provide specific information and transfer knowledge from old classes to new classes with an Information Bottleneck learning objective. In summary, ASP prevents overfitting on base task and does not require enormous data in few-shot incremental tasks. Extensive experiments on three benchmark datasets validate that ASP consistently outperforms state-of-the-art FSCIL and prompt-based CIL methods in terms of both learning new classes and mitigating forgetting.

replace-cross Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State Classification

Authors: Filippos Gouidis, Katerina Papantoniou, Konstantinos Papoutsakis Theodore Patkos, Antonis Argyros, Dimitris Plexousakis

Abstract: Domain-specific knowledge can significantly contribute to addressing a wide variety of vision tasks. However, the generation of such knowledge entails considerable human labor and time costs. This study investigates the potential of Large Language Models (LLMs) in generating and providing domain-specific information through semantic embeddings. To achieve this, an LLM is integrated into a pipeline that utilizes Knowledge Graphs and pre-trained semantic vectors in the context of the Vision-based Zero-shot Object State Classification task. We thoroughly examine the behavior of the LLM through an extensive ablation study. Our findings reveal that the integration of LLM-based embeddings, in combination with general-purpose pre-trained embeddings, leads to substantial performance improvements. Drawing insights from this ablation study, we conduct a comparative analysis against competing models, thereby highlighting the state-of-the-art performance achieved by the proposed approach.

replace-cross Motion Generation from Fine-grained Textual Descriptions

Authors: Kunhang Li, Yansong Feng

Abstract: The task of text2motion is to generate human motion sequences from given textual descriptions, where the model explores diverse mappings from natural language instructions to human body movements. While most existing works are confined to coarse-grained motion descriptions, e.g., "A man squats.", fine-grained descriptions specifying movements of relevant body parts are barely explored. Models trained with coarse-grained texts may not be able to learn mappings from fine-grained motion-related words to motion primitives, resulting in the failure to generate motions from unseen descriptions. In this paper, we build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D, by feeding GPT-3.5-turbo with step-by-step instructions with pseudo-code compulsory checks. Accordingly, we design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information. Our quantitative evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines. According to the qualitative evaluation and case study, our model outperforms MotionDiffuse in generating spatially or chronologically composite motions, by learning the implicit mappings from fine-grained descriptions to the corresponding basic motions. We release our data at https://github.com/KunhangL/finemotiondiffuse.

URLs: https://github.com/KunhangL/finemotiondiffuse.

replace-cross Powerful Lossy Compression for Noisy Images

Authors: Shilv Cai, Xiaoguo Liang, Shuning Cao, Luxin Yan, Sheng Zhong, Liqun Chen, Xu Zou

Abstract: Image compression and denoising represent fundamental challenges in image processing with many real-world applications. To address practical demands, current solutions can be categorized into two main strategies: 1) sequential method; and 2) joint method. However, sequential methods have the disadvantage of error accumulation as there is information loss between multiple individual models. Recently, the academic community began to make some attempts to tackle this problem through end-to-end joint methods. Most of them ignore that different regions of noisy images have different characteristics. To solve these problems, in this paper, our proposed signal-to-noise ratio~(SNR) aware joint solution exploits local and non-local features for image compression and denoising simultaneously. We design an end-to-end trainable network, which includes the main encoder branch, the guidance branch, and the signal-to-noise ratio~(SNR) aware branch. We conducted extensive experiments on both synthetic and real-world datasets, demonstrating that our joint solution outperforms existing state-of-the-art methods.

replace-cross Towards Low-Energy Adaptive Personalization for Resource-Constrained Devices

Authors: Yushan Huang, Josh Millar, Yuxuan Long, Yuchen Zhao, Hamed Hadaddi

Abstract: The personalization of machine learning (ML) models to address data drift is a significant challenge in the context of Internet of Things (IoT) applications. Presently, most approaches focus on fine-tuning either the full base model or its last few layers to adapt to new data, while often neglecting energy costs. However, various types of data drift exist, and fine-tuning the full base model or the last few layers may not result in optimal performance in certain scenarios. We propose Target Block Fine-Tuning (TBFT), a low-energy adaptive personalization framework designed for resource-constrained devices. We categorize data drift and personalization into three types: input-level, feature-level, and output-level. For each type, we fine-tune different blocks of the model to achieve optimal performance with reduced energy costs. Specifically, input-, feature-, and output-level correspond to fine-tuning the front, middle, and rear blocks of the model. We evaluate TBFT on a ResNet model, three datasets, three different training sizes, and a Raspberry Pi. Compared with the $Block Avg$, where each block is fine-tuned individually and their performance improvements are averaged, TBFT exhibits an improvement in model accuracy by an average of 15.30% whilst saving 41.57% energy consumption on average compared with full fine-tuning.