new Category-Extensible Out-of-Distribution Detection via Hierarchical Context Descriptions

Authors: Kai Liu, Zhihang Fu, Chao Chen, Sheng Jin, Ze Chen, Mingyuan Tao, Rongxin Jiang, Jieping Ye

Abstract: The key to OOD detection has two aspects: generalized feature representation and precise category description. Recently, vision-language models such as CLIP provide significant advances in both two issues, but constructing precise category descriptions is still in its infancy due to the absence of unseen categories. This work introduces two hierarchical contexts, namely perceptual context and spurious context, to carefully describe the precise category boundary through automatic prompt tuning. Specifically, perceptual contexts perceive the inter-category difference (e.g., cats vs apples) for current classification tasks, while spurious contexts further identify spurious (similar but exactly not) OOD samples for every single category (e.g., cats vs panthers, apples vs peaches). The two contexts hierarchically construct the precise description for a certain category, which is, first roughly classifying a sample to the predicted category and then delicately identifying whether it is truly an ID sample or actually OOD. Moreover, the precise descriptions for those categories within the vision-language framework present a novel application: CATegory-EXtensible OOD detection (CATEX). One can efficiently extend the set of recognizable categories by simply merging the hierarchical contexts learned under different sub-task settings. And extensive experiments are conducted to demonstrate CATEX's effectiveness, robustness, and category-extensibility. For instance, CATEX consistently surpasses the rivals by a large margin with several protocols on the challenging ImageNet-1K dataset. In addition, we offer new insights on how to efficiently scale up the prompt engineering in vision-language models to recognize thousands of object categories, as well as how to incorporate large language models (like GPT-3) to boost zero-shot applications. Code will be made public soon.

new A study of animal action segmentation algorithms across supervised, unsupervised, and semi-supervised learning paradigms

Authors: Ari Blau, Evan S Schaffer, Neeli Mishra, Nathaniel J Miska, The International Brain Laboratory, Liam Paninski, Matthew R Whiteway

Abstract: Action segmentation of behavioral videos is the process of labeling each frame as belonging to one or more discrete classes, and is a crucial component of many studies that investigate animal behavior. A wide range of algorithms exist to automatically parse discrete animal behavior, encompassing supervised, unsupervised, and semi-supervised learning paradigms. These algorithms -- which include tree-based models, deep neural networks, and graphical models -- differ widely in their structure and assumptions on the data. Using four datasets spanning multiple species -- fly, mouse, and human -- we systematically study how the outputs of these various algorithms align with manually annotated behaviors of interest. Along the way, we introduce a semi-supervised action segmentation model that bridges the gap between supervised deep neural networks and unsupervised graphical models. We find that fully supervised temporal convolutional networks with the addition of temporal information in the observations perform the best on our supervised metrics across all datasets.

new VisMin: Visual Minimal-Change Understanding

Authors: Rabiul Awal, Saba Ahmadi, Le Zhang, Aishwarya Agrawal

Abstract: Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). Existing benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar \textit{captions} given an image. In this paper, we introduce a new, challenging benchmark termed \textbf{Vis}ual \textbf{Min}imal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. The image pair and caption pair contain minimal changes, i.e., only one aspect changes at a time from among the following: \textit{object}, \textit{attribute}, \textit{count}, and \textit{spatial relation}. These changes test the models' understanding of objects, attributes (such as color, material, shape), counts, and spatial relationships between objects. We built an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. We also generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks and in CLIP's general image-text alignment. We release all resources, including the benchmark, training data, and finetuned model checkpoints, at \url{https://vismin.net/}.

URLs: https://vismin.net/

new A Dataset for Crucial Object Recognition in Blind and Low-Vision Individuals' Navigation

Authors: Md Touhidul Islam, Imran Kabir, Elena Ariel Pearce, Md Alimoor Reza, Syed Masum Billah

Abstract: This paper introduces a dataset for improving real-time object recognition systems to aid blind and low-vision (BLV) individuals in navigation tasks. The dataset comprises 21 videos of BLV individuals navigating outdoor spaces, and a taxonomy of 90 objects crucial for BLV navigation, refined through a focus group study. We also provide object labeling for the 90 objects across 31 video segments created from the 21 videos. A deeper analysis reveals that most contemporary datasets used in training computer vision models contain only a small subset of the taxonomy in our dataset. Preliminary evaluation of state-of-the-art computer vision models on our dataset highlights shortcomings in accurately detecting key objects relevant to BLV navigation, emphasizing the need for specialized datasets. We make our dataset publicly available, offering valuable resources for developing more inclusive navigation systems for BLV individuals.

new Occlusion-Aware 3D Motion Interpretation for Abnormal Behavior Detection

Authors: Su Li, Wang Liang, Jianye Wang, Ziheng Zhang, Lei Zhang

Abstract: Estimating abnormal posture based on 3D pose is vital in human pose analysis, yet it presents challenges, especially when reconstructing 3D human poses from monocular datasets with occlusions. Accurate reconstructions enable the restoration of 3D movements, which assist in the extraction of semantic details necessary for analyzing abnormal behaviors. However, most existing methods depend on predefined key points as a basis for estimating the coordinates of occluded joints, where variations in data quality have adversely affected the performance of these models. In this paper, we present OAD2D, which discriminates against motion abnormalities based on reconstructing 3D coordinates of mesh vertices and human joints from monocular videos. The OAD2D employs optical flow to capture motion prior information in video streams, enriching the information on occluded human movements and ensuring temporal-spatial alignment of poses. Moreover, we reformulate the abnormal posture estimation by coupling it with Motion to Text (M2T) model in which, the VQVAE is employed to quantize motion features. This approach maps motion tokens to text tokens, allowing for a semantically interpretable analysis of motion, and enhancing the generalization of abnormal posture detection boosted by Language model. Our approach demonstrates the robustness of abnormal behavior detection against severe and self-occlusions, as it reconstructs human motion trajectories in global coordinates to effectively mitigate occlusion issues. Our method, validated using the Human3.6M, 3DPW, and NTU RGB+D datasets, achieves a high $F_1-$Score of 0.94 on the NTU RGB+D dataset for medical condition detection. And we will release all of our code and data.

new What Matters in Range View 3D Object Detection

Authors: Benjamin Wilson, Nicholas Autio Mitchell, Jhony Kaesemodel Pontes, James Hays

Abstract: Lidar-based perception pipelines rely on 3D object detection models to interpret complex scenes. While multiple representations for lidar exist, the range-view is enticing since it losslessly encodes the entire lidar sensor output. In this work, we achieve state-of-the-art amongst range-view 3D object detection models without using multiple techniques proposed in past range-view literature. We explore range-view 3D object detection across two modern datasets with substantially different properties: Argoverse 2 and Waymo Open. Our investigation reveals key insights: (1) input feature dimensionality significantly influences the overall performance, (2) surprisingly, employing a classification loss grounded in 3D spatial proximity works as well or better compared to more elaborate IoU-based losses, and (3) addressing non-uniform lidar density via a straightforward range subsampling technique outperforms existing multi-resolution, range-conditioned networks. Our experiments reveal that techniques proposed in recent range-view literature are not needed to achieve state-of-the-art performance. Combining the above findings, we establish a new state-of-the-art model for range-view 3D object detection -- improving AP by 2.2% on the Waymo Open dataset while maintaining a runtime of 10 Hz. We establish the first range-view model on the Argoverse 2 dataset and outperform strong voxel-based baselines. All models are multi-class and open-source. Code is available at https://github.com/benjaminrwilson/range-view-3d-detection.

URLs: https://github.com/benjaminrwilson/range-view-3d-detection.

new Distribution-Aware Robust Learning from Long-Tailed Data with Noisy Labels

Authors: Jae Soon Baik, In Young Yoon, Kun Hoon Kim, Jun Won Choi

Abstract: Deep neural networks have demonstrated remarkable advancements in various fields using large, well-annotated datasets. However, real-world data often exhibit long-tailed distributions and label noise, significantly degrading generalization performance. Recent studies addressing these issues have focused on noisy sample selection methods that estimate the centroid of each class based on high-confidence samples within each target class. The performance of these methods is limited because they use only the training samples within each class for class centroid estimation, making the quality of centroids susceptible to long-tailed distributions and noisy labels. In this study, we present a robust training framework called Distribution-aware Sample Selection and Contrastive Learning (DaSC). Specifically, DaSC introduces a Distribution-aware Class Centroid Estimation (DaCC) to generate enhanced class centroids. DaCC performs weighted averaging of the features from all samples, with weights determined based on model predictions. Additionally, we propose a confidence-aware contrastive learning strategy to obtain balanced and robust representations. The training samples are categorized into high-confidence and low-confidence samples. Our method then applies Semi-supervised Balanced Contrastive Loss (SBCL) using high-confidence samples, leveraging reliable label information to mitigate class bias. For the low-confidence samples, our method computes Mixup-enhanced Instance Discrimination Loss (MIDL) to improve their representations in a self-supervised manner. Our experimental results on CIFAR and real-world noisy-label datasets demonstrate the superior performance of the proposed DaSC compared to previous approaches.

new Fusion and Cross-Modal Transfer for Zero-Shot Human Action Recognition

Authors: Abhi Kamboj, Anh Duy Nguyen, Minh Do

Abstract: Despite living in a multi-sensory world, most AI models are limited to textual and visual interpretations of human motion and behavior. Inertial measurement units (IMUs) provide a salient signal to understand human motion; however, they are challenging to use due to their uninterpretability and scarcity of their data. We investigate a method to transfer knowledge between visual and inertial modalities using the structure of an informative joint representation space designed for human action recognition (HAR). We apply the resulting Fusion and Cross-modal Transfer (FACT) method to a novel setup, where the model does not have access to labeled IMU data during training and is able to perform HAR with only IMU data during testing. Extensive experiments on a wide range of RGB-IMU datasets demonstrate that FACT significantly outperforms existing methods in zero-shot cross-modal transfer.

new AI-Enhanced 7-Point Checklist for Melanoma Detection Using Clinical Knowledge Graphs and Data-Driven Quantification

Authors: Yuheng Wang, Tianze Yu, Jiayue Cai, Sunil Kalia, Harvey Lui, Z. Jane Wang, Tim K. Lee

Abstract: The 7-point checklist (7PCL) is widely used in dermoscopy to identify malignant melanoma lesions needing urgent medical attention. It assigns point values to seven attributes: major attributes are worth two points each, and minor ones are worth one point each. A total score of three or higher prompts further evaluation, often including a biopsy. However, a significant limitation of current methods is the uniform weighting of attributes, which leads to imprecision and neglects their interconnections. Previous deep learning studies have treated the prediction of each attribute with the same importance as predicting melanoma, which fails to recognize the clinical significance of the attributes for melanoma. To address these limitations, we introduce a novel diagnostic method that integrates two innovative elements: a Clinical Knowledge-Based Topological Graph (CKTG) and a Gradient Diagnostic Strategy with Data-Driven Weighting Standards (GD-DDW). The CKTG integrates 7PCL attributes with diagnostic information, revealing both internal and external associations. By employing adaptive receptive domains and weighted edges, we establish connections among melanoma's relevant features. Concurrently, GD-DDW emulates dermatologists' diagnostic processes, who first observe the visual characteristics associated with melanoma and then make predictions. Our model uses two imaging modalities for the same lesion, ensuring comprehensive feature acquisition. Our method shows outstanding performance in predicting malignant melanoma and its features, achieving an average AUC value of 85%. This was validated on the EDRA dataset, the largest publicly available dataset for the 7-point checklist algorithm. Specifically, the integrated weighting system can provide clinicians with valuable data-driven benchmarks for their evaluations.

new SINDER: Repairing the Singular Defects of DINOv2

Authors: Haoqi Wang, Tong Zhang, Mathieu Salzmann

Abstract: Vision Transformer models trained on large-scale datasets, although effective, often exhibit artifacts in the patch token they extract. While such defects can be alleviated by re-training the entire model with additional classification tokens, the underlying reasons for the presence of these tokens remain unclear. In this paper, we conduct a thorough investigation of this phenomenon, combining theoretical analysis with empirical observations. Our findings reveal that these artifacts originate from the pre-trained network itself, specifically stemming from the leading left singular vector of the network's weights. Furthermore, to mitigate these defects, we propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset, thereby avoiding the need for complete re-training. We validate our method on various downstream tasks, including unsupervised segmentation, classification, supervised segmentation, and depth estimation, demonstrating its effectiveness in improving model performance. Codes and checkpoints are available at https://github.com/haoqiwang/sinder.

URLs: https://github.com/haoqiwang/sinder.

new A Multi-Level Hierarchical Framework for the Classification of Weather Conditions and Hazard Prediction

Authors: Harish Neelam

Abstract: This paper presents a multilevel hierarchical framework for the classification of weather conditions and hazard prediction. In recent years, the importance of data has grown significantly, with various types like text, numbers, images, audio, and videos playing a key role. Among these, images make up a large portion of the data available. This application shows promise for various purposes, especially when combined with decision support systems for traffic management, afforestation, and weather forecasting. It's particularly useful in situations where traditional weather predictions are not very accurate, such as ensuring the safe operation of self driving cars in dangerous weather. While previous studies have looked at this topic with fewer categories, this paper focuses on eleven specific types of weather images. The goal is to create a model that can accurately predict weather conditions after being trained on a large dataset of images. Accuracy is crucial in real-life situations to prevent accidents, making it the top priority for this paper. This work lays the groundwork for future applications in weather prediction, especially in situations where human expertise is not available or may be biased. The framework, capable of classifying images into eleven weather categories: dew, frost, glaze, rime, snow, hail, rain, lightning, rainbow, and sandstorm, provides real-time weather information with an accuracy of 0.9329. The proposed framework addresses the growing need for accurate weather classification and hazard prediction, offering a robust solution for various applications in the field.

new CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

Authors: Jihyung Kil, Zheda Mai, Justin Lee, Zihe Wang, Kerrie Cheng, Lemeng Wang, Ye Liu, Arpita Chowdhury, Wei-Lun Chao

Abstract: The ability to compare objects, scenes, or situations is crucial for effective decision-making and problem-solving in everyday life. For instance, comparing the freshness of apples enables better choices during grocery shopping, while comparing sofa designs helps optimize the aesthetics of our living space. Despite its significance, the comparative capability is largely unexplored in artificial general intelligence (AGI). In this paper, we introduce CompBench, a benchmark designed to evaluate the comparative reasoning capability of multimodal large language models (MLLMs). CompBench mines and pairs images through visually oriented questions covering eight dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. We curate a collection of around 40K image pairs using metadata from diverse vision datasets and CLIP similarity scores. These image pairs span a broad array of visual domains, including animals, fashion, sports, and both outdoor and indoor scenes. The questions are carefully crafted to discern relative characteristics between two images and are labeled by human annotators for accuracy and relevance. We use CompBench to evaluate recent MLLMs, including GPT-4V(ision), Gemini-Pro, and LLaVA-1.6. Our results reveal notable shortcomings in their comparative abilities. We believe CompBench not only sheds light on these limitations but also establishes a solid foundation for future enhancements in the comparative capability of MLLMs.

new PathwayBench: Assessing Routability of Pedestrian Pathway Networks Inferred from Multi-City Imagery

Authors: Yuxiang Zhang, Bill Howe, Sachin Mehta, Nicholas-J Bolten, Anat Caspi

Abstract: Applications to support pedestrian mobility in urban areas require a complete, and routable graph representation of the built environment. Globally available information, including aerial imagery provides a scalable source for constructing these path networks, but the associated learning problem is challenging: Relative to road network pathways, pedestrian network pathways are narrower, more frequently disconnected, often visually and materially variable in smaller areas, and their boundaries are broken up by driveway incursions, alleyways, marked or unmarked crossings through roadways. Existing algorithms to extract pedestrian pathway network graphs are inconsistently evaluated and tend to ignore routability, making it difficult to assess utility for mobility applications: Even if all path segments are available, discontinuities could dramatically and arbitrarily shift the overall path taken by a pedestrian. In this paper, we describe a first standard benchmark for the pedestrian pathway graph extraction problem, comprising the largest available dataset equipped with manually vetted ground truth annotations (covering $3,000 km^2$ land area in regions from 8 cities), and a family of evaluation metrics centering routability and downstream utility. By partitioning the data into polygons at the scale of individual intersections, we compute local routability as an efficient proxy for global routability. We consider multiple measures of polygon-level routability and compare predicted measures with ground truth to construct evaluation metrics. Using these metrics, we show that this benchmark can surface strengths and weaknesses of existing methods that are hidden by simple edge-counting metrics over single-region datasets used in prior work, representing a challenging, high-impact problem in computer vision and machine learning.

new SAR to Optical Image Translation with Color Supervised Diffusion Model

Authors: Xinyu Bai, Feng Xu

Abstract: Synthetic Aperture Radar (SAR) offers all-weather, high-resolution imaging capabilities, but its complex imaging mechanism often poses challenges for interpretation. In response to these limitations, this paper introduces an innovative generative model designed to transform SAR images into more intelligible optical images, thereby enhancing the interpretability of SAR images. Specifically, our model backbone is based on the recent diffusion models, which have powerful generative capabilities. We employ SAR images as conditional guides in the sampling process and integrate color supervision to counteract color shift issues effectively. We conducted experiments on the SEN12 dataset and employed quantitative evaluations using peak signal-to-noise ratio, structural similarity, and fr\'echet inception distance. The results demonstrate that our model not only surpasses previous methods in quantitative assessments but also significantly enhances the visual quality of the generated images.

new McGAN: Generating Manufacturable Designs by Embedding Manufacturing Rules into Conditional Generative Adversarial Network

Authors: Zhichao Wang, Xiaoliang Yan, Shreyes Melkote, David Rosen

Abstract: Generative design (GD) methods aim to automatically generate a wide variety of designs that satisfy functional or aesthetic design requirements. However, research to date generally lacks considerations of manufacturability of the generated designs. To this end, we propose a novel GD approach by using deep neural networks to encode design for manufacturing (DFM) rules, thereby modifying part designs to make them manufacturable by a given manufacturing process. Specifically, a three-step approach is proposed: first, an instance segmentation method, Mask R-CNN, is used to decompose a part design into subregions. Second, a conditional generative adversarial neural network (cGAN), Pix2Pix, transforms unmanufacturable decomposed subregions into manufacturable subregions. The transformed subregions of designs are subsequently reintegrated into a unified manufacturable design. These three steps, Mask-RCNN, Pix2Pix, and reintegration, form the basis of the proposed Manufacturable conditional GAN (McGAN) framework. Experimental results show that McGAN can transform existing unmanufacturable designs to generate their corresponding manufacturable counterparts automatically that realize the specified manufacturing rules in an efficient and robust manner. The effectiveness of McGAN is demonstrated through two-dimensional design case studies of an injection molding process.

new Affective Behaviour Analysis via Progressive Learning

Authors: Chen Liu, Wei Zhang, Feng Qiu, Lincheng Li, Xin Yu

Abstract: Affective Behavior Analysis aims to develop emotionally intelligent technology that can recognize and respond to human emotions. To advance this, the 7th Affective Behavior Analysis in-the-wild (ABAW) competition establishes two tracks: i.e., the Multi-task Learning (MTL) Challenge and the Compound Expression (CE) challenge based on Aff-Wild2 and C-EXPR-DB datasets. In this paper, we present our methods and experimental results for the two competition tracks. Specifically, it can be summarized in the following four aspects: 1) To attain high-quality facial features, we train a Masked-Auto Encoder in a self-supervised manner. 2) We devise a temporal convergence module to capture the temporal information between video frames and explore the impact of window size and sequence length on each sub-task. 3) To facilitate the joint optimization of various sub-tasks, we explore the impact of sub-task joint training and feature fusion from individual tasks on each task performance improvement. 4) We utilize curriculum learning to transition the model from recognizing single expressions to recognizing compound expressions, thereby improving the accuracy of compound expression recognition. Extensive experiments demonstrate the superiority of our designs.

new Open Challenges on Fairness of Artificial Intelligence in Medical Imaging Applications

Authors: Enzo Ferrante, Rodrigo Echeveste

Abstract: Recently, the research community of computerized medical imaging has started to discuss and address potential fairness issues that may emerge when developing and deploying AI systems for medical image analysis. This chapter covers some of the pressing challenges encountered when doing research in this area, and it is intended to raise questions and provide food for thought for those aiming to enter this research field. The chapter first discusses various sources of bias, including data collection, model training, and clinical deployment, and their impact on the fairness of machine learning algorithms in medical image computing. We then turn to discussing open challenges that we believe require attention from researchers and practitioners, as well as potential pitfalls of naive application of common methods in the field. We cover a variety of topics including the impact of biased metrics when auditing for fairness, the leveling down effect, task difficulty variations among subgroups, discovering biases in unseen populations, and explaining biases beyond standard demographic attributes.

new DVPE: Divided View Position Embedding for Multi-View 3D Object Detection

Authors: Jiasen Wang, Zhenglin Li, Ke Sun, Xianyuan Liu, Yang Zhou

Abstract: Sparse query-based paradigms have achieved significant success in multi-view 3D detection for autonomous vehicles. Current research faces challenges in balancing between enlarging receptive fields and reducing interference when aggregating multi-view features. Moreover, different poses of cameras present challenges in training global attention models. To address these problems, this paper proposes a divided view method, in which features are modeled globally via the visibility crossattention mechanism, but interact only with partial features in a divided local virtual space. This effectively reduces interference from other irrelevant features and alleviates the training difficulties of the transformer by decoupling the position embedding from camera poses. Additionally, 2D historical RoI features are incorporated into the object-centric temporal modeling to utilize highlevel visual semantic information. The model is trained using a one-to-many assignment strategy to facilitate stability. Our framework, named DVPE, achieves state-of-the-art performance (57.2% mAP and 64.5% NDS) on the nuScenes test set. Codes will be available at https://github.com/dop0/DVPE.

URLs: https://github.com/dop0/DVPE.

new Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal

Authors: Yeying Jin, Xin Li, Jiadong Wang, Yan Zhang, Malu Zhang

Abstract: Existing raindrop removal datasets have two shortcomings. First, they consist of images captured by cameras with a focus on the background, leading to the presence of blurry raindrops. To our knowledge, none of these datasets include images where the focus is specifically on raindrops, which results in a blurry background. Second, these datasets predominantly consist of daytime images, thereby lacking nighttime raindrop scenarios. Consequently, algorithms trained on these datasets may struggle to perform effectively in raindrop-focused or nighttime scenarios. The absence of datasets specifically designed for raindrop-focused and nighttime raindrops constrains research in this area. In this paper, we introduce a large-scale, real-world raindrop removal dataset called Raindrop Clarity. Raindrop Clarity comprises 15,186 high-quality pairs/triplets (raindrops, blur, and background) of images with raindrops and the corresponding clear background images. There are 5,442 daytime raindrop images and 9,744 nighttime raindrop images. Specifically, the 5,442 daytime images include 3,606 raindrop- and 1,836 background-focused images. While the 9,744 nighttime images contain 4,838 raindrop- and 4,906 background-focused images. Our dataset will enable the community to explore background-focused and raindrop-focused images, including challenges unique to daytime and nighttime conditions. Our data and code are available at: \url{https://github.com/jinyeying/RaindropClarity}

URLs: https://github.com/jinyeying/RaindropClarity

new Pose Estimation from Camera Images for Underwater Inspection

Authors: Luyuan Peng, Hari Vishnu, Mandar Chitre, Yuen Min Too, Bharath Kalyan, Rajat Mishra, Soo Pieng Tan

Abstract: High-precision localization is pivotal in underwater reinspection missions. Traditional localization methods like inertial navigation systems, Doppler velocity loggers, and acoustic positioning face significant challenges and are not cost-effective for some applications. Visual localization is a cost-effective alternative in such cases, leveraging the cameras already equipped on inspection vehicles to estimate poses from images of the surrounding scene. Amongst these, machine learning-based pose estimation from images shows promise in underwater environments, performing efficient relocalization using models trained based on previously mapped scenes. We explore the efficacy of learning-based pose estimators in both clear and turbid water inspection missions, assessing the impact of image formats, model architectures and training data diversity. We innovate by employing novel view synthesis models to generate augmented training data, significantly enhancing pose estimation in unexplored regions. Moreover, we enhance localization accuracy by integrating pose estimator outputs with sensor data via an extended Kalman filter, demonstrating improved trajectory smoothness and accuracy.

new Selective Vision-Language Subspace Projection for Few-shot CLIP

Authors: Xingyu Zhu, Beier Zhu, Yi Tan, Shuo Wang, Yanbin Hao, Hanwang Zhang

Abstract: Vision-language models such as CLIP are capable of mapping the different modality data into a unified feature space, enabling zero/few-shot inference by measuring the similarity of given images and texts. However, most existing methods overlook modality gaps in CLIP's encoded features, which is shown as the text and image features lie far apart from each other, resulting in limited classification performance. To tackle this issue, we introduce a method called Selective Vision-Language Subspace Projection (SSP), which incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs. Specifically, our SSP framework comprises two parallel modules: a vision projector and a language projector. Both projectors utilize local image features to span the respective subspaces for image and texts, thereby projecting the image and text features into their respective subspaces to achieve alignment. Moreover, our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks. Extensive experiments on 11 datasets have demonstrated SSP's superior text-image alignment capabilities, outperforming the state-of-the-art alignment methods. The code is available at https://github.com/zhuhsingyuu/SSP

URLs: https://github.com/zhuhsingyuu/SSP

new Case-Enhanced Vision Transformer: Improving Explanations of Image Similarity with a ViT-based Similarity Metric

Authors: Ziwei Zhao, David Leake, Xiaomeng Ye, David Crandall

Abstract: This short paper presents preliminary research on the Case-Enhanced Vision Transformer (CEViT), a similarity measurement method aimed at improving the explainability of similarity assessments for image data. Initial experimental results suggest that integrating CEViT into k-Nearest Neighbor (k-NN) classification yields classification accuracy comparable to state-of-the-art computer vision models, while adding capabilities for illustrating differences between classes. CEViT explanations can be influenced by prior cases, to illustrate aspects of similarity relevant to those cases.

new Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

Authors: Lirui Zhao, Tianshuo Yang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Kaipeng Zhang, Rongrong Ji

Abstract: This paper addresses an important problem of object addition for images with only text guidance. It is challenging because the new object must be integrated seamlessly into the image with consistent visual context, such as lighting, texture, and spatial location. While existing text-guided image inpainting methods can add objects, they either fail to preserve the background consistency or involve cumbersome human intervention in specifying bounding boxes or user-scribbled masks. To tackle this challenge, we introduce Diffree, a Text-to-Image (T2I) model that facilitates text-guided object addition with only text control. To this end, we curate OABench, an exquisite synthetic dataset by removing objects with advanced image inpainting techniques. OABench comprises 74K real-world tuples of an original image, an inpainted image with the object removed, an object mask, and object descriptions. Trained on OABench using the Stable Diffusion model with an additional mask prediction module, Diffree uniquely predicts the position of the new object and achieves object addition with guidance from only text. Extensive experiments demonstrate that Diffree excels in adding new objects with a high success rate while maintaining background consistency, spatial appropriateness, and object relevance and quality.

new DreamCar: Leveraging Car-specific Prior for in-the-wild 3D Car Reconstruction

Authors: Xiaobiao Du, Haiyang Sun, Ming Lu, Tianqing Zhu, Xin Yu

Abstract: Self-driving industries usually employ professional artists to build exquisite 3D cars. However, it is expensive to craft large-scale digital assets. Since there are already numerous datasets available that contain a vast number of images of cars, we focus on reconstructing high-quality 3D car models from these datasets. However, these datasets only contain one side of cars in the forward-moving scene. We try to use the existing generative models to provide more supervision information, but they struggle to generalize well in cars since they are trained on synthetic datasets not car-specific. In addition, The reconstructed 3D car texture misaligns due to a large error in camera pose estimation when dealing with in-the-wild images. These restrictions make it challenging for previous methods to reconstruct complete 3D cars. To address these problems, we propose a novel method, named DreamCar, which can reconstruct high-quality 3D cars given a few images even a single image. To generalize the generative model, we collect a car dataset, named Car360, with over 5,600 vehicles. With this dataset, we make the generative model more robust to cars. We use this generative prior specific to the car to guide its reconstruction via Score Distillation Sampling. To further complement the supervision information, we utilize the geometric and appearance symmetry of cars. Finally, we propose a pose optimization method that rectifies poses to tackle texture misalignment. Extensive experiments demonstrate that our method significantly outperforms existing methods in reconstructing high-quality 3D cars. \href{https://xiaobiaodu.github.io/dreamcar-project/}{Our code is available.}

URLs: https://xiaobiaodu.github.io/dreamcar-project/

new LoFormer: Local Frequency Transformer for Image Deblurring

Authors: Xintian Mao, Jiansheng Wang, Xingran Xie, Qingli Li, Yan Wang

Abstract: Due to the computational complexity of self-attention (SA), prevalent techniques for image deblurring often resort to either adopting localized SA or employing coarse-grained global SA methods, both of which exhibit drawbacks such as compromising global modeling or lacking fine-grained correlation. In order to address this issue by effectively modeling long-range dependencies without sacrificing fine-grained details, we introduce a novel approach termed Local Frequency Transformer (LoFormer). Within each unit of LoFormer, we incorporate a Local Channel-wise SA in the frequency domain (Freq-LC) to simultaneously capture cross-covariance within low- and high-frequency local windows. These operations offer the advantage of (1) ensuring equitable learning opportunities for both coarse-grained structures and fine-grained details, and (2) exploring a broader range of representational properties compared to coarse-grained global SA methods. Additionally, we introduce an MLP Gating mechanism complementary to Freq-LC, which serves to filter out irrelevant features while enhancing global learning capabilities. Our experiments demonstrate that LoFormer significantly improves performance in the image deblurring task, achieving a PSNR of 34.09 dB on the GoPro dataset with 126G FLOPs. https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur

URLs: https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur

new Progressive Query Refinement Framework for Bird's-Eye-View Semantic Segmentation from Surrounding Images

Authors: Dooseop Choi, Jungyu Kang, Taeghyun An, Kyounghwan Ahn, KyoungWook Min

Abstract: Expressing images with Multi-Resolution (MR) features has been widely adopted in many computer vision tasks. In this paper, we introduce the MR concept into Bird's-Eye-View (BEV) semantic segmentation for autonomous driving. This introduction enhances our model's ability to capture both global and local characteristics of driving scenes through our proposed residual learning. Specifically, given a set of MR BEV query maps, the lowest resolution query map is initially updated using a View Transformation (VT) encoder. This updated query map is then upscaled and merged with a higher resolution query map to undergo further updates in a subsequent VT encoder. This process is repeated until the resolution of the updated query map reaches the target. Finally, the lowest resolution map is added to the target resolution to generate the final query map. During training, we enforce both the lowest and final query maps to align with the ground-truth BEV semantic map to help our model effectively capture the global and local characteristics. We also propose a visual feature interaction network that promotes interactions between features across images and across feature levels, thus highly contributing to the performance improvement. We evaluate our model on a large-scale real-world dataset. The experimental results show that our model outperforms the SOTA models in terms of IoU metric. Codes are available at https://github.com/d1024choi/ProgressiveQueryRefineNet

URLs: https://github.com/d1024choi/ProgressiveQueryRefineNet

new EAFormer: Scene Text Segmentation with Edge-Aware Transformers

Authors: Haiyang Yu, Teng Fu, Bin Li, Xiangyang Xue

Abstract: Scene text segmentation aims at cropping texts from scene images, which is usually used to help generative models edit or remove texts. The existing text segmentation methods tend to involve various text-related supervisions for better performance. However, most of them ignore the importance of text edges, which are significant for downstream applications. In this paper, we propose Edge-Aware Transformers, termed EAFormer, to segment texts more accurately, especially at the edge of texts. Specifically, we first design a text edge extractor to detect edges and filter out edges of non-text areas. Then, we propose an edge-guided encoder to make the model focus more on text edges. Finally, an MLP-based decoder is employed to predict text masks. We have conducted extensive experiments on commonly-used benchmarks to verify the effectiveness of EAFormer. The experimental results demonstrate that the proposed method can perform better than previous methods, especially on the segmentation of text edges. Considering that the annotations of several benchmarks (e.g., COCO_TS and MLT_S) are not accurate enough to fairly evaluate our methods, we have relabeled these datasets. Through experiments, we observe that our method can achieve a higher performance improvement when more accurate annotations are used for training.

new Enhancing Environmental Monitoring through Multispectral Imaging: The WasteMS Dataset for Semantic Segmentation of Lakeside Waste

Authors: Qinfeng Zhu, Ningxin Weng, Lei Fan, Yuanzhi Cai

Abstract: Environmental monitoring of lakeside green areas is crucial for environmental protection. Compared to manual inspections, computer vision technologies offer a more efficient solution when deployed on-site. Multispectral imaging provides diverse information about objects under different spectrums, aiding in the differentiation between waste and lakeside lawn environments. This study introduces WasteMS, the first multispectral dataset established for the semantic segmentation of lakeside waste. WasteMS includes a diverse range of waste types in lawn environments, captured under various lighting conditions. We implemented a rigorous annotation process to label waste in images. Representative semantic segmentation frameworks were used to evaluate segmentation accuracy using WasteMS. Challenges encountered when using WasteMS for segmenting waste on lakeside lawns were discussed. The WasteMS dataset is available at https://github.com/zhuqinfeng1999/WasteMS.

URLs: https://github.com/zhuqinfeng1999/WasteMS.

new Q-Ground: Image Quality Grounding with Large Multi-modality Models

Authors: Chaofeng Chen, Sensen Yang, Haoning Wu, Liang Liao, Zicheng Zhang, Annan Wang, Wenxiu Sun, Qiong Yan, Weisi Lin

Abstract: Recent advances of large multi-modality models (LMM) have greatly improved the ability of image quality assessment (IQA) method to evaluate and explain the quality of visual content. However, these advancements are mostly focused on overall quality assessment, and the detailed examination of local quality, which is crucial for comprehensive visual understanding, is still largely unexplored. In this work, we introduce Q-Ground, the first framework aimed at tackling fine-scale visual quality grounding by combining large multi-modality models with detailed visual quality analysis. Central to our contribution is the introduction of the QGround-100K dataset, a novel resource containing 100k triplets of (image, quality text, distortion segmentation) to facilitate deep investigations into visual quality. The dataset comprises two parts: one with human-labeled annotations for accurate quality assessment, and another labeled automatically by LMMs such as GPT4V, which helps improve the robustness of model training while also reducing the costs of data collection. With the QGround-100K dataset, we propose a LMM-based method equipped with multi-scale feature learning to learn models capable of performing both image quality answering and distortion segmentation based on text prompts. This dual-capability approach not only refines the model's understanding of region-aware image quality but also enables it to interactively respond to complex, text-based queries about image quality and specific distortions. Q-Ground takes a step towards sophisticated visual quality analysis in a finer scale, establishing a new benchmark for future research in the area. Codes and dataset are available at https://github.com/Q-Future/Q-Ground.

URLs: https://github.com/Q-Future/Q-Ground.

new DiffCD: A Symmetric Differentiable Chamfer Distance for Neural Implicit Surface Fitting

Authors: Linus H\"arenstam-Nielsen, Lu Sang, Abhishek Saroha, Nikita Araslanov, Daniel Cremers

Abstract: Neural implicit surfaces can be used to recover accurate 3D geometry from imperfect point clouds. In this work, we show that state-of-the-art techniques work by minimizing an approximation of a one-sided Chamfer distance. This shape metric is not symmetric, as it only ensures that the point cloud is near the surface but not vice versa. As a consequence, existing methods can produce inaccurate reconstructions with spurious surfaces. Although one approach against spurious surfaces has been widely used in the literature, we theoretically and experimentally show that it is equivalent to regularizing the surface area, resulting in over-smoothing. As a more appealing alternative, we propose DiffCD, a novel loss function corresponding to the symmetric Chamfer distance. In contrast to previous work, DiffCD also assures that the surface is near the point cloud, which eliminates spurious surfaces without the need for additional regularization. We experimentally show that DiffCD reliably recovers a high degree of shape detail, substantially outperforming existing work across varying surface complexity and noise levels. Project code is available at https://github.com/linusnie/diffcd.

URLs: https://github.com/linusnie/diffcd.

new High Efficiency Image Compression for Large Visual-Language Models

Authors: Binzhe Li, Shurun Wang, Shiqi Wang, Yan Ye

Abstract: In recent years, large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression framework consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for LVLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. {Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, Versatile Video Coding.} Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework.

new AI-based Density Recognition

Authors: Simone M\"uller, Daniel Kolb, Matthias M\"uller, Dieter Kranzlm\"uller

Abstract: Learning-based analysis of images is commonly used in the fields of mobility and robotics for safe environmental motion and interaction. This requires not only object recognition but also the assignment of certain properties to them. With the help of this information, causally related actions can be adapted to different circumstances. Such logical interactions can be optimized by recognizing object-assigned properties. Density as a physical property offers the possibility to recognize how heavy an object is, which material it is made of, which forces are at work, and consequently which influence it has on its environment. Our approach introduces an AI-based concept for assigning physical properties to objects through the use of associated images. Based on synthesized data, we derive specific patterns from 2D images using a neural network to extract further information such as volume, material, or density. Accordingly, we discuss the possibilities of property-based feature extraction to improve causally related logics.

new When Text and Images Don't Mix: Bias-Correcting Language-Image Similarity Scores for Anomaly Detection

Authors: Adam Goodge, Bryan Hooi, Wee Siong Ng

Abstract: Contrastive Language-Image Pre-training (CLIP) achieves remarkable performance in various downstream tasks through the alignment of image and text input embeddings and holds great promise for anomaly detection. However, our empirical experiments show that the embeddings of text inputs unexpectedly tightly cluster together, far away from image embeddings, contrary to the model's contrastive training objective to align image-text input pairs. We show that this phenomenon induces a `similarity bias' - in which false negative and false positive errors occur due to bias in the similarities between images and the normal label text embeddings. To address this bias, we propose a novel methodology called BLISS which directly accounts for this similarity bias through the use of an auxiliary, external set of text inputs. BLISS is simple, it does not require strong inductive biases about anomalous behaviour nor an expensive training process, and it significantly outperforms baseline methods on benchmark image datasets, even when access to normal data is extremely limited.

new OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos

Authors: Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Andrew Zisserman

Abstract: We introduce a dataset of annotations of temporal repetitions in videos. The dataset, OVR (pronounced as over), contains annotations for over 72K videos, with each annotation specifying the number of repetitions, the start and end time of the repetitions, and also a free-form description of what is repeating. The annotations are provided for videos sourced from Kinetics and Ego4D, and consequently cover both Exo and Ego viewing conditions, with a huge variety of actions and activities. Moreover, OVR is almost an order of magnitude larger than previous datasets for video repetition. We also propose a baseline transformer-based counting model, OVRCounter, that can localise and count repetitions in videos that are up to 320 frames long. The model is trained and evaluated on the OVR dataset, and its performance assessed with and without using text to specify the target class to count. The performance is also compared to a prior repetition counting model. The dataset is available for download at: https://sites.google.com/view/openvocabreps/

URLs: https://sites.google.com/view/openvocabreps/

new MemBench: Memorized Image Trigger Prompt Dataset for Diffusion Models

Authors: Chunsan Hong, Tae-Hyun Oh, Minhyuk Sung

Abstract: Diffusion models have achieved remarkable success in Text-to-Image generation tasks, leading to the development of many commercial models. However, recent studies have reported that diffusion models often generate replicated images in train data when triggered by specific prompts, potentially raising social issues ranging from copyright to privacy concerns. To sidestep the memorization, there have been recent studies for developing memorization mitigation methods for diffusion models. Nevertheless, the lack of benchmarks impedes the assessment of the true effectiveness of these methods. In this work, we present MemBench, the first benchmark for evaluating image memorization mitigation methods. Our benchmark includes a large number of memorized image trigger prompts in Stable Diffusion, the most popularly used model nowadays. Furthermore, in contrast to the prior work evaluating mitigation performance only on trigger prompts, we present metrics evaluating on both trigger prompts and general prompts, so that we can see whether mitigation methods address the memorization issue while maintaining performance for general prompts. This is an important development considering the practical applications which previous works have overlooked. Through evaluation on MemBench, we verify that the performance of existing image memorization mitigation methods is still insufficient for application to diffusion models.

new PiPa++: Towards Unification of Domain Adaptive Semantic Segmentation via Self-supervised Learning

Authors: Mu Chen, Zhedong Zheng, Yi Yang

Abstract: Unsupervised domain adaptive segmentation aims to improve the segmentation accuracy of models on target domains without relying on labeled data from those domains. This approach is crucial when labeled target domain data is scarce or unavailable. It seeks to align the feature representations of the source domain (where labeled data is available) and the target domain (where only unlabeled data is present), thus enabling the model to generalize well to the target domain. Current image- and video-level domain adaptation have been addressed using different and specialized frameworks, training strategies and optimizations despite their underlying connections. In this paper, we propose a unified framework PiPa++, which leverages the core idea of ``comparing'' to (1) explicitly encourage learning of discriminative pixel-wise features with intraclass compactness and inter-class separability, (2) promote the robust feature learning of the identical patch against different contexts or fluctuations, and (3) enable the learning of temporal continuity under dynamic environments. With the designed task-smart contrastive sampling strategy, PiPa++ enables the mining of more informative training samples according to the task demand. Extensive experiments demonstrate the effectiveness of our method on both image-level and video-level domain adaption benchmarks. Moreover, the proposed method is compatible with other UDA approaches to further improve the performance without introducing extra parameters.

new A Self-Supervised Image Registration Approach for Measuring Local Response Patterns in Metastatic Ovarian Cancer

Authors: In\^es P. Machado, Anna Reithmeir, Fryderyk Kogl, Leonardo Rundo, Gabriel Funingana, Marika Reinius, Gift Mungmeeprued, Zeyu Gao, Cathal McCague, Eric Kerfoot, Ramona Woitek, Evis Sala, Yangming Ou, James Brenton, Julia Schnabel, Mireia Crispin

Abstract: High-grade serous ovarian carcinoma (HGSOC) is characterised by significant spatial and temporal heterogeneity, typically manifesting at an advanced metastatic stage. A major challenge in treating advanced HGSOC is effectively monitoring localised change in tumour burden across multiple sites during neoadjuvant chemotherapy (NACT) and predicting long-term pathological response and overall patient survival. In this work, we propose a self-supervised deformable image registration algorithm that utilises a general-purpose image encoder for image feature extraction to co-register contrast-enhanced computerised tomography scan images acquired before and after neoadjuvant chemotherapy. This approach addresses challenges posed by highly complex tumour deformations and longitudinal lesion matching during treatment. Localised tumour changes are calculated using the Jacobian determinant maps of the registration deformation at multiple disease sites and their macroscopic areas, including hypo-dense (i.e., cystic/necrotic), hyper-dense (i.e., calcified), and intermediate density (i.e., soft tissue) portions. A series of experiments is conducted to understand the role of a general-purpose image encoder and its application in quantifying change in tumour burden during neoadjuvant chemotherapy in HGSOC. This work is the first to demonstrate the feasibility of a self-supervised image registration approach in quantifying NACT-induced localised tumour changes across the whole disease burden of patients with complex multi-site HGSOC, which could be used as a potential marker for ovarian cancer patient's long-term pathological response and survival.

new RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer

Authors: Wenyu Lv, Yian Zhao, Qinyao Chang, Kui Huang, Guanzhong Wang, Yi Liu

Abstract: In this report, we present RT-DETRv2, an improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 builds upon the previous state-of-the-art real-time detector, RT-DETR, and opens up a set of bag-of-freebies for flexibility and practicality, as well as optimizing the training strategy to achieve enhanced performance. To improve the flexibility, we suggest setting a distinct number of sampling points for features at different scales in the deformable attention to achieve selective multi-scale feature extraction by the decoder. To enhance practicality, we propose an optional discrete sampling operator to replace the grid_sample operator that is specific to RT-DETR compared to YOLOs. This removes the deployment constraints typically associated with DETRs. For the training strategy, we propose dynamic data augmentation and scale-adaptive hyperparameters customization to improve performance without loss of speed. Source code and pre-trained models will be available at https://github.com/lyuwenyu/RT-DETR.

URLs: https://github.com/lyuwenyu/RT-DETR.

new XMeCap: Meme Caption Generation with Sub-Image Adaptability

Authors: Yuyan Chen, Songzhou Yan, Zhihong Zhu, Zhixu Li, Yanghua Xiao

Abstract: Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on the impact of multi-images on meme captioning. After that, we introduce the \textsc{XMeCap} framework, a novel approach that adopts supervised fine-tuning and reinforcement learning based on an innovative reward model, which factors in both global and local similarities between visuals and text. Our results, benchmarked against contemporary models, manifest a marked improvement in caption generation for both single-image and multi-image memes, as well as different meme categories. \textsc{XMeCap} achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71\% and 4.82\%, respectively. This research not only establishes a new frontier in meme-related studies but also underscores the potential of machines in understanding and generating humor in a multi-modal setting.

new FIIH: Fully Invertible Image Hiding for Secure and Robust

Authors: Lang Huang, Lin Huo, Zheng Gan, Xinrong He

Abstract: Image hiding is the study of techniques for covert storage and transmission, which embeds a secret image into a container image and generates stego image to make it similar in appearance to a normal image. However, existing image hiding methods have a serious problem that the hiding and revealing process cannot be fully invertible, which results in the revealing network not being able to recover the secret image losslessly, which makes it impossible to simultaneously achieve high fidelity and secure transmission of the secret image in an insecure network environment. To solve this problem,this paper proposes a fully invertible image hiding architecture based on invertible neural network,aiming to realize invertible hiding of secret images,which is invertible on both data and network. Based on this ingenious architecture, the method can withstand deep learning based image steganalysis. In addition, we propose a new method for enhancing the robustness of stego images after interference during transmission. Experiments demonstrate that the FIIH proposed in this paper significantly outperforms other state-of-the-art image hiding methods in hiding a single image, and also significantly outperforms other state-of-the-art methods in robustness and security.

new Establishing Truly Causal Relationship Between Whole Slide Image Predictions and Diagnostic Evidence Subregions in Deep Learning

Authors: Tianhang Nan, Yong Ding, Hao Quan, Deliang Li, Mingchen Zou, Xiaoyu Cui

Abstract: In the field of deep learning-driven Whole Slide Image (WSI) classification, Multiple Instance Learning (MIL) has gained significant attention due to its ability to be trained using only slide-level diagnostic labels. Previous MIL researches have primarily focused on enhancing feature aggregators for globally analyzing WSIs, but overlook a causal relationship in diagnosis: model's prediction should ideally stem solely from regions of the image that contain diagnostic evidence (such as tumor cells), which usually occupy relatively small areas. To address this limitation and establish the truly causal relationship between model predictions and diagnostic evidence regions, we propose Causal Inference Multiple Instance Learning (CI-MIL). CI-MIL integrates feature distillation with a novel patch decorrelation mechanism, employing a two-stage causal inference approach to distill and process patches with high diagnostic value. Initially, CI-MIL leverages feature distillation to identify patches likely containing tumor cells and extracts their corresponding feature representations. These features are then mapped to random Fourier feature space, where a learnable weighting scheme is employed to minimize inter-feature correlations, effectively reducing redundancy from homogenous patches and mitigating data bias. These processes strengthen the causal relationship between model predictions and diagnostically relevant regions, making the prediction more direct and reliable. Experimental results demonstrate that CI-MIL outperforms state-of-the-art methods. Additionally, CI-MIL exhibits superior interpretability, as its selected regions demonstrate high consistency with ground truth annotations, promising more reliable diagnostic assistance for pathologists.

new Context-aware Multi-task Learning for Pedestrian Intent and Trajectory Prediction

Authors: Farzeen Munir, Tomasz Piotr Kucner

Abstract: The advancement of socially-aware autonomous vehicles hinges on precise modeling of human behavior. Within this broad paradigm, the specific challenge lies in accurately predicting pedestrian's trajectory and intention. Traditional methodologies have leaned heavily on historical trajectory data, frequently overlooking vital contextual cues such as pedestrian-specific traits and environmental factors. Furthermore, there's a notable knowledge gap as trajectory and intention prediction have largely been approached as separate problems, despite their mutual dependence. To bridge this gap, we introduce PTINet (Pedestrian Trajectory and Intention Prediction Network), which jointly learns the trajectory and intention prediction by combining past trajectory observations, local contextual features (individual pedestrian behaviors), and global features (signs, markings etc.). The efficacy of our approach is evaluated on widely used public datasets: JAAD and PIE, where it has demonstrated superior performance over existing state-of-the-art models in trajectory and intention prediction. The results from our experiments and ablation studies robustly validate PTINet's effectiveness in jointly exploring intention and trajectory prediction for pedestrian behaviour modelling. The experimental evaluation indicates the advantage of using global and local contextual features for pedestrian trajectory and intention prediction. The effectiveness of PTINet in predicting pedestrian behavior paves the way for the development of automated systems capable of seamlessly interacting with pedestrians in urban settings.

new Domain Generalized Recaptured Screen Image Identification Using SWIN Transformer

Authors: Preeti Mehta, Aman Sagar, Suchi Kumari

Abstract: An increasing number of classification approaches have been developed to address the issue of image rebroadcast and recapturing, a standard attack strategy in insurance frauds, face spoofing, and video piracy. However, most of them neglected scale variations and domain generalization scenarios, performing poorly in instances involving domain shifts, typically made worse by inter-domain and cross-domain scale variances. To overcome these issues, we propose a cascaded data augmentation and SWIN transformer domain generalization framework (DAST-DG) in the current research work Initially, we examine the disparity in dataset representation. A feature generator is trained to make authentic images from various domains indistinguishable. This process is then applied to recaptured images, creating a dual adversarial learning setup. Extensive experiments demonstrate that our approach is practical and surpasses state-of-the-art methods across different databases. Our model achieves an accuracy of approximately 82\% with a precision of 95\% on high-variance datasets.

new Unpaired Photo-realistic Image Deraining with Energy-informed Diffusion Model

Authors: Yuanbo Wen, Tao Gao, Ting Chen

Abstract: Existing unpaired image deraining approaches face challenges in accurately capture the distinguishing characteristics between the rainy and clean domains, resulting in residual degradation and color distortion within the reconstructed images. To this end, we propose an energy-informed diffusion model for unpaired photo-realistic image deraining (UPID-EDM). Initially, we delve into the intricate visual-language priors embedded within the contrastive language-image pre-training model (CLIP), and demonstrate that the CLIP priors aid in the discrimination of rainy and clean images. Furthermore, we introduce a dual-consistent energy function (DEF) that retains the rain-irrelevant characteristics while eliminating the rain-relevant features. This energy function is trained by the non-corresponding rainy and clean images. In addition, we employ the rain-relevance discarding energy function (RDEF) and the rain-irrelevance preserving energy function (RPEF) to direct the reverse sampling procedure of a pre-trained diffusion model, effectively removing the rain streaks while preserving the image contents. Extensive experiments demonstrate that our energy-informed model surpasses the existing unpaired learning approaches in terms of both supervised and no-reference metrics.

new ALPI: Auto-Labeller with Proxy Injection for 3D Object Detection using 2D Labels Only

Authors: Saad Lahlali, Nicolas Granger, Herv\'e Le Borgne, Quoc-Cuong Pham

Abstract: 3D object detection plays a crucial role in various applications such as autonomous vehicles, robotics and augmented reality. However, training 3D detectors requires a costly precise annotation, which is a hindrance to scaling annotation to large datasets. To address this challenge, we propose a weakly supervised 3D annotator that relies solely on 2D bounding box annotations from images, along with size priors. One major problem is that supervising a 3D detection model using only 2D boxes is not reliable due to ambiguities between different 3D poses and their identical 2D projection. We introduce a simple yet effective and generic solution: we build 3D proxy objects with annotations by construction and add them to the training dataset. Our method requires only size priors to adapt to new classes. To better align 2D supervision with 3D detection, our method ensures depth invariance with a novel expression of the 2D losses. Finally, to detect more challenging instances, our annotator follows an offline pseudo-labelling scheme which gradually improves its 3D pseudo-labels. Extensive experiments on the KITTI dataset demonstrate that our method not only performs on-par or above previous works on the Car category, but also achieves performance close to fully supervised methods on more challenging classes. We further demonstrate the effectiveness and robustness of our method by being the first to experiment on the more challenging nuScenes dataset. We additionally propose a setting where weak labels are obtained from a 2D detector pre-trained on MS-COCO instead of human annotations.

new Nonverbal Immediacy Analysis in Education: A Multimodal Computational Model

Authors: Uro\v{s} Petkovi\'c, Jonas Frenkel, Olaf Hellwich, Rebecca Lazarides

Abstract: This paper introduces a novel computational approach for analyzing nonverbal social behavior in educational settings. Integrating multimodal behavioral cues, including facial expressions, gesture intensity, and spatial dynamics, the model assesses the nonverbal immediacy (NVI) of teachers from RGB classroom videos. A dataset of 400 30-second video segments from German classrooms was constructed for model training and validation. The gesture intensity regressor achieved a correlation of 0.84, the perceived distance regressor 0.55, and the NVI model 0.44 with median human ratings. The model demonstrates the potential to provide a valuable support in nonverbal behavior assessment, approximating the accuracy of individual human raters. Validated against both questionnaire data and trained observer ratings, our models show moderate to strong correlations with relevant educational outcomes, indicating their efficacy in reflecting effective teaching behaviors. This research advances the objective assessment of nonverbal communication behaviors, opening new pathways for educational research.

new Graph Neural Networks: A suitable Alternative to MLPs in Latent 3D Medical Image Classification?

Authors: Johannes Kiechle, Daniel M. Lang, Stefan M. Fischer, Lina Felsner, Jan C. Peeken, Julia A. Schnabel

Abstract: Recent studies have underscored the capabilities of natural imaging foundation models to serve as powerful feature extractors, even in a zero-shot setting for medical imaging data. Most commonly, a shallow multi-layer perceptron (MLP) is appended to the feature extractor to facilitate end-to-end learning and downstream prediction tasks such as classification, thus representing the de facto standard. However, as graph neural networks (GNNs) have become a practicable choice for various tasks in medical research in the recent past, we direct attention to the question of how effective GNNs are compared to MLP prediction heads for the task of 3D medical image classification, proposing them as a potential alternative. In our experiments, we devise a subject-level graph for each volumetric dataset instance. Therein latent representations of all slices in the volume, encoded through a DINOv2 pretrained vision transformer (ViT), constitute the nodes and their respective node features. We use public datasets to compare the classification heads numerically and evaluate various graph construction and graph convolution methods in our experiments. Our findings show enhancements of the GNN in classification performance and substantial improvements in runtime compared to an MLP prediction head. Additional robustness evaluations further validate the promising performance of the GNN, promoting them as a suitable alternative to traditional MLP classification heads. Our code is publicly available at: https://github.com/compai-lab/2024-miccai-grail-kiechle

URLs: https://github.com/compai-lab/2024-miccai-grail-kiechle

new LPGen: Enhancing High-Fidelity Landscape Painting Generation through Diffusion Model

Authors: Wanggong Yang, Xiaona Wang, Yingrui Qiu, Yifei Zhao

Abstract: Generating landscape paintings expands the possibilities of artistic creativity and imagination. Traditional landscape painting methods involve using ink or colored ink on rice paper, which requires substantial time and effort. These methods are susceptible to errors and inconsistencies and lack precise control over lines and colors. This paper presents LPGen, a high-fidelity, controllable model for landscape painting generation, introducing a novel multi-modal framework that integrates image prompts into the diffusion model. We extract its edges and contours by computing canny edges from the target landscape image. These, along with natural language text prompts and drawing style references, are fed into the latent diffusion model as conditions. We implement a decoupled cross-attention strategy to ensure compatibility between image and text prompts, facilitating multi-modal image generation. A decoder generates the final image. Quantitative and qualitative analyses demonstrate that our method outperforms existing approaches in landscape painting generation and exceeds the current state-of-the-art. The LPGen network effectively controls the composition and color of landscape paintings, generates more accurate images, and supports further research in deep learning-based landscape painting generation.

new Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation

Authors: Hyunwoo Yu, Yubin Cho, Beoungwoo Kang, Seunghun Moon, Kyeongbo Kong, Suk-Ju Kang

Abstract: We present an Encoder-Decoder Attention Transformer, EDAFormer, which consists of the Embedding-Free Transformer (EFT) encoder and the all-attention decoder leveraging our Embedding-Free Attention (EFA) structure. The proposed EFA is a novel global context modeling mechanism that focuses on functioning the global non-linearity, not the specific roles of the query, key and value. For the decoder, we explore the optimized structure for considering the globality, which can improve the semantic segmentation performance. In addition, we propose a novel Inference Spatial Reduction (ISR) method for the computational efficiency. Different from the previous spatial reduction attention methods, our ISR method further reduces the key-value resolution at the inference phase, which can mitigate the computation-performance trade-off gap for the efficient semantic segmentation. Our EDAFormer shows the state-of-the-art performance with the efficient computation compared to the existing transformer-based semantic segmentation models in three public benchmarks, including ADE20K, Cityscapes and COCO-Stuff. Furthermore, our ISR method reduces the computational cost by up to 61% with minimal mIoU performance degradation on Cityscapes dataset. The code is available at https://github.com/hyunwoo137/EDAFormer.

URLs: https://github.com/hyunwoo137/EDAFormer.

new SCIsegV2: A Universal Tool for Segmentation of Intramedullary Lesions in Spinal Cord Injury

Authors: Enamundram Naga Karthik, Jan Valo\v{s}ek, Lynn Farner, Dario Pfyffer, Simon Schading-Sassenhausen, Anna Lebret, Gergely David, Andrew C. Smith, Kenneth A. Weber II, Maryam Seif, RHSCIR Network Imaging Group, Patrick Freund, Julien Cohen-Adad

Abstract: Spinal cord injury (SCI) is a devastating incidence leading to permanent paralysis and loss of sensory-motor functions potentially resulting in the formation of lesions within the spinal cord. Imaging biomarkers obtained from magnetic resonance imaging (MRI) scans can predict the functional recovery of individuals with SCI and help choose the optimal treatment strategy. Currently, most studies employ manual quantification of these MRI-derived biomarkers, which is a subjective and tedious task. In this work, we propose (i) a universal tool for the automatic segmentation of intramedullary SCI lesions, dubbed \texttt{SCIsegV2}, and (ii) a method to automatically compute the width of the tissue bridges from the segmented lesion. Tissue bridges represent the spared spinal tissue adjacent to the lesion, which is associated with functional recovery in SCI patients. The tool was trained and validated on a heterogeneous dataset from 7 sites comprising patients from different SCI phases (acute, sub-acute, and chronic) and etiologies (traumatic SCI, ischemic SCI, and degenerative cervical myelopathy). Tissue bridges quantified automatically did not significantly differ from those computed manually, suggesting that the proposed automatic tool can be used to derive relevant MRI biomarkers. \texttt{SCIsegV2} and the automatic tissue bridges computation are open-source and available in Spinal Cord Toolbox (v6.4 and above) via the \texttt{sct\_deepseg -task seg\_sc\_lesion\_t2w\_sci} and \texttt{sct\_analyze\_lesion} functions, respectively.

new M4: Multi-Proxy Multi-Gate Mixture of Experts Network for Multiple Instance Learning in Histopathology Image Analysis

Authors: Junyu Li, Ye Zhang, Wen Shu, Xiaobing Feng, Yingchun Wang, Pengju Yan, Xiaolin Li, Chulin Sha, Min He

Abstract: Multiple instance learning (MIL) has been successfully applied for whole slide images (WSIs) analysis in computational pathology, enabling a wide range of prediction tasks from tumor subtyping to inferring genetic mutations and multi-omics biomarkers. However, existing MIL methods predominantly focus on single-task learning, resulting in not only overall low efficiency but also the overlook of inter-task relatedness. To address these issues, we proposed an adapted architecture of Multi-gate Mixture-of-experts with Multi-proxy for Multiple instance learning (M4), and applied this framework for simultaneous prediction of multiple genetic mutations from WSIs. The proposed M4 model has two main innovations: (1) utilizing a mixture of experts with multiple gating strategies for multi-genetic mutation prediction on a single pathological slide; (2) constructing multi-proxy expert network and gate network for comprehensive and effective modeling of pathological image information. Our model achieved significant improvements across five tested TCGA datasets in comparison to current state-of-the-art single-task methods. The code is available at:https://github.com/Bigyehahaha/M4.

URLs: https://github.com/Bigyehahaha/M4.

new DenseTrack: Drone-based Crowd Tracking via Density-aware Motion-appearance Synergy

Authors: Yi Lei, Huilin Zhu, Jingling Yuan, Guangli Xiang, Xian Zhong, Shengfeng He

Abstract: Drone-based crowd tracking faces difficulties in accurately identifying and monitoring objects from an aerial perspective, largely due to their small size and close proximity to each other, which complicates both localization and tracking. To address these challenges, we present the Density-aware Tracking (DenseTrack) framework. DenseTrack capitalizes on crowd counting to precisely determine object locations, blending visual and motion cues to improve the tracking of small-scale objects. It specifically addresses the problem of cross-frame motion to enhance tracking accuracy and dependability. DenseTrack employs crowd density estimates as anchors for exact object localization within video frames. These estimates are merged with motion and position information from the tracking network, with motion offsets serving as key tracking cues. Moreover, DenseTrack enhances the ability to distinguish small-scale objects using insights from the visual-language model, integrating appearance with motion cues. The framework utilizes the Hungarian algorithm to ensure the accurate matching of individuals across frames. Demonstrated on DroneCrowd dataset, our approach exhibits superior performance, confirming its effectiveness in scenarios captured by drones.

new LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering

Authors: Simon Boeder, Fabian Gigengack, Benjamin Risse

Abstract: Semantic occupancy has recently gained significant traction as a prominent method for 3D scene representation. However, most existing camera-based methods rely on costly datasets with fine-grained 3D voxel labels or LiDAR scans for training, which limits their practicality and scalability, raising the need for self-supervised approaches in this domain. Moreover, most methods are tied to a predefined set of classes which they can detect. In this work we present a novel approach for open vocabulary occupancy estimation called \textit{LangOcc}, that is trained only via camera images, and can detect arbitrary semantics via vision-language alignment. In particular, we distill the knowledge of the strong vision-language aligned encoder CLIP into a 3D occupancy model via differentiable volume rendering. Our model estimates vision-language aligned features in a 3D voxel grid using only images. It is trained in a self-supervised manner by rendering our estimations back to 2D space, where ground-truth features can be computed. This training mechanism automatically supervises the scene geometry, allowing for a straight-forward and powerful training method without any explicit geometry supervision. LangOcc outperforms LiDAR-supervised competitors in open vocabulary occupancy by a large margin, solely relying on vision-based training. We also achieve state-of-the-art results in self-supervised semantic occupancy estimation on the Occ3D-nuScenes dataset, despite not being limited to a specific set of categories, thus demonstrating the effectiveness of our proposed vision-language training.

new Physical Adversarial Attack on Monocular Depth Estimation via Shape-Varying Patches

Authors: Chenxing Zhao, Yang Li, Shihao Wu, Wenyi Tan, Shuangju Zhou, Quan Pan

Abstract: Adversarial attacks against monocular depth estimation (MDE) systems pose significant challenges, particularly in safety-critical applications such as autonomous driving. Existing patch-based adversarial attacks for MDE are confined to the vicinity of the patch, making it difficult to affect the entire target. To address this limitation, we propose a physics-based adversarial attack on monocular depth estimation, employing a framework called Attack with Shape-Varying Patches (ASP), aiming to optimize patch content, shape, and position to maximize effectiveness. We introduce various mask shapes, including quadrilateral, rectangular, and circular masks, to enhance the flexibility and efficiency of the attack. Furthermore, we propose a new loss function to extend the influence of the patch beyond the overlapping regions. Experimental results demonstrate that our attack method generates an average depth error of 18 meters on the target car with a patch area of 1/9, affecting over 98\% of the target area.

new DarSwin-Unet: Distortion Aware Encoder-Decoder Architecture

Authors: Akshaya Athwale, Ichrak Shili, \'Emile Bergeron, Ola Ahmad, Jean-Fran\c{c}ois Lalonde

Abstract: Wide-angle fisheye images are becoming increasingly common for perception tasks in applications such as robotics, security, and mobility (e.g. drones, avionics). However, current models often either ignore the distortions in wide-angle images or are not suitable to perform pixel-level tasks. In this paper, we present an encoder-decoder model based on a radial transformer architecture that adapts to distortions in wide-angle lenses by leveraging the physical characteristics defined by the radial distortion profile. In contrast to the original model, which only performs classification tasks, we introduce a U-Net architecture, DarSwin-Unet, designed for pixel level tasks. Furthermore, we propose a novel strategy that minimizes sparsity when sampling the image for creating its input tokens. Our approach enhances the model capability to handle pixel-level tasks in wide-angle fisheye images, making it more effective for real-world applications. Compared to other baselines, DarSwin-Unet achieves the best results across different datasets, with significant gains when trained on bounded levels of distortions (very low, low, medium, and high) and tested on all, including out-of-distribution distortions. We demonstrate its performance on depth estimation and show through extensive experiments that DarSwin-Unet can perform zero-shot adaptation to unseen distortions of different wide-angle lenses.

new Multi-label Cluster Discrimination for Visual Representation Learning

Authors: Xiang An, Kaicheng Yang, Xiangzi Dai, Ziyong Feng, Jiankang Deng

Abstract: Contrastive Language Image Pre-training (CLIP) has recently demonstrated success across various tasks due to superior feature representation empowered by image-text contrastive learning. However, the instance discrimination method used by CLIP can hardly encode the semantic structure of training data. To handle this limitation, cluster discrimination has been proposed through iterative cluster assignment and classification. Nevertheless, most cluster discrimination approaches only define a single pseudo-label for each image, neglecting multi-label signals in the image. In this paper, we propose a novel Multi-Label Cluster Discrimination method named MLCD to enhance representation learning. In the clustering step, we first cluster the large-scale LAION-400M dataset into one million centers based on off-the-shelf embedding features. Considering that natural images frequently contain multiple visual objects or attributes, we select the multiple closest centers as auxiliary class labels. In the discrimination step, we design a novel multi-label classification loss, which elegantly separates losses from positive classes and negative classes, and alleviates ambiguity on decision boundary. We validate the proposed multi-label cluster discrimination method with experiments on different scales of models and pre-training datasets. Experimental results show that our method achieves state-of-the-art performance on multiple downstream tasks including linear probe, zero-shot classification, and image-text retrieval.

new Cascaded Light Propagation Volumes using Spherical Radial Basis Functions

Authors: Ludovic Silvestre, Jo\~ao Pereira

Abstract: This paper introduces a contribution made to one of the newest methods for simulating indirect lighting in dynamic scenes , the cascaded light propagation volumes . Our contribution consists on using Spherical Radial Basis Functions instead of Spherical Harmonic, since the first achieves much better results when many coefficients are used. We explain how to integrate the Spherical Radial Basis Functions with the cascaded light propagation volumes, and evaluate our technique against the same implementation, but with Spherical harmonics.

new Preliminary study on artificial intelligence methods for cybersecurity threat detection in computer networks based on raw data packets

Authors: Aleksander Ogonowski, Micha{\l} \.Zebrowski, Arkadiusz \'Cwiek, Tobiasz Jarosiewicz, Konrad Klimaszewski, Adam Padee, Piotr Wasiuk, Micha{\l} W\'ojcik

Abstract: Most of the intrusion detection methods in computer networks are based on traffic flow characteristics. However, this approach may not fully exploit the potential of deep learning algorithms to directly extract features and patterns from raw packets. Moreover, it impedes real-time monitoring due to the necessity of waiting for the processing pipeline to complete and introduces dependencies on additional software components. In this paper, we investigate deep learning methodologies capable of detecting attacks in real-time directly from raw packet data within network traffic. We propose a novel approach where packets are stacked into windows and separately recognised, with a 2D image representation suitable for processing with computer vision models. Our investigation utilizes the CIC IDS-2017 dataset, which includes both benign traffic and prevalent real-world attacks, providing a comprehensive foundation for our research.

new Deep Spherical Superpixels

Authors: R\'emi Giraud, Micha\"el Cl\'ement

Abstract: Over the years, the use of superpixel segmentation has become very popular in various applications, serving as a preprocessing step to reduce data size by adapting to the content of the image, regardless of its semantic content. While the superpixel segmentation of standard planar images, captured with a 90{\deg} field of view, has been extensively studied, there has been limited focus on dedicated methods to omnidirectional or spherical images, captured with a 360{\deg} field of view. In this study, we introduce the first deep learning-based superpixel segmentation approach tailored for omnidirectional images called DSS (for Deep Spherical Superpixels). Our methodology leverages on spherical CNN architectures and the differentiable K-means clustering paradigm for superpixels, to generate superpixels that follow the spherical geometry. Additionally, we propose to use data augmentation techniques specifically designed for 360{\deg} images, enabling our model to efficiently learn from a limited set of annotated omnidirectional data. Our extensive validation across two datasets demonstrates that taking into account the inherent circular geometry of such images into our framework improves the segmentation performance over traditional and deep learning-based superpixel methods. Our code is available online.

new MuST: Multi-Scale Transformers for Surgical Phase Recognition

Authors: Alejandra P\'erez, Santiago Rodr\'iguez, Nicol\'as Ayobi, Nicol\'as Aparicio, Eug\'enie Dessevres, Pablo Arbel\'aez

Abstract: Phase recognition in surgical videos is crucial for enhancing computer-aided surgical systems as it enables automated understanding of sequential procedural stages. Existing methods often rely on fixed temporal windows for video analysis to identify dynamic surgical phases. Thus, they struggle to simultaneously capture short-, mid-, and long-term information necessary to fully understand complex surgical procedures. To address these issues, we propose Multi-Scale Transformers for Surgical Phase Recognition (MuST), a novel Transformer-based approach that combines a Multi-Term Frame encoder with a Temporal Consistency Module to capture information across multiple temporal scales of a surgical video. Our Multi-Term Frame Encoder computes interdependencies across a hierarchy of temporal scales by sampling sequences at increasing strides around the frame of interest. Furthermore, we employ a long-term Transformer encoder over the frame embeddings to further enhance long-term reasoning. MuST achieves higher performance than previous state-of-the-art methods on three different public benchmarks.

new ViPer: Visual Personalization of Generative Models via Individual Preference Learning

Authors: Sogand Salehi, Mahdi Shafiei, Teresa Yeo, Roman Bachmann, Amir Zamir

Abstract: Different users find different images generated for the same prompt desirable. This gives rise to personalized image generation which involves creating images aligned with an individual's visual preference. Current generative models are, however, unpersonalized, as they are tuned to produce outputs that appeal to a broad audience. Using them to generate images aligned with individual users relies on iterative manual prompt engineering by the user which is inefficient and undesirable. We propose to personalize the image generation process by first capturing the generic preferences of the user in a one-time process by inviting them to comment on a small selection of images, explaining why they like or dislike each. Based on these comments, we infer a user's structured liked and disliked visual attributes, i.e., their visual preference, using a large language model. These attributes are used to guide a text-to-image model toward producing images that are tuned towards the individual user's visual preference. Through a series of user studies and large language model guided evaluations, we demonstrate that the proposed method results in generations that are well aligned with individual users' visual preferences.

new PrevPredMap: Exploring Temporal Modeling with Previous Predictions for Online Vectorized HD Map Construction

Authors: Nan Peng, Xun Zhou, Mingming Wang, Xiaojun Yang, Songming Chen, Guisong Chen

Abstract: Temporal information is crucial for detecting occluded instances. Existing temporal representations have progressed from BEV or PV features to more compact query features. Compared to these aforementioned features, predictions offer the highest level of abstraction, providing explicit information. In the context of online vectorized HD map construction, this unique characteristic of predictions is potentially advantageous for long-term temporal modeling and the integration of map priors. This paper introduces PrevPredMap, a pioneering temporal modeling framework that leverages previous predictions for constructing online vectorized HD maps. We have meticulously crafted two essential modules for PrevPredMap: the previous-predictions-based query generator and the dynamic-position-query decoder. Specifically, the previous-predictions-based query generator is designed to separately encode different types of information from previous predictions, which are then effectively utilized by the dynamic-position-query decoder to generate current predictions. Furthermore, we have developed a dual-mode strategy to ensure PrevPredMap's robust performance across both single-frame and temporal modes. Extensive experiments demonstrate that PrevPredMap achieves state-of-the-art performance on the nuScenes and Argoverse2 datasets. Code will be available at https://github.com/pnnnnnnn/PrevPredMap.

URLs: https://github.com/pnnnnnnn/PrevPredMap.

new MMRA: A Benchmark for Multi-granularity Multi-image Relational Association

Authors: Siwei Wu, Kang Zhu, Yu Bai, Yiming Liang, Yizhi Li, Haoning Wu, Jiaheng Liu, Ruibo Liu, Xingwei Qu, Xuxin Cheng, Ge Zhang, Wenhao Huang, Chenghua Lin

Abstract: Given the remarkable success that large visual language models (LVLMs) have achieved in image perception tasks, the endeavor to make LVMLs perceive the world like humans is drawing increasing attention. Current multi-modal benchmarks mainly focus on the objective fact or certain topic related potential knowledge within a image, but overlook the associative relations between multiple images. Therefore, we define a multi-image relation association task, and meticulously curate \textbf{MMRA} benchmark, a \textbf{M}ulti-granularity \textbf{M}ulti-image \textbf{R}elational \textbf{A}ssociation benchmark, consisted of \textbf{1026} samples. In order to systematically and comprehensively evaluate mainstream LVLMs, we establish an associational relation system among images that contain \textbf{11 subtasks} (e.g, UsageSimilarity, SubEvent, etc.) at two granularity levels (i.e., "\textbf{image}" and "\textbf{entity}") according to the relations in ConceptNet. Our experiments demonstrate that, on our MMRA benchmark, current mainstream LVLMs all have their own advantages and disadvantages across different subtasks. It is worth noting that, at the entity level, the performance of all models is worse than that of them at the image level, indicating that the fine-grained multi-image perception task is still challenging for LVLMs. The tasks related to spatial perception are relatively difficult for LVLMs to handle. Furthermore, we find that LVMLs exhibit a good ability to perceive image details, and the key to enhancing their multi-image association capability is to strengthen the reasoning ability of their language model component. All our codes and data are released at htt\url{https://github.com/Wusiwei0410/MMRA}.

URLs: https://github.com/Wusiwei0410/MMRA

new 3D Question Answering for City Scene Understanding

Authors: Penglei Sun, Yaoxian Song, Xiang Liu, Xiaofei Yang, Qiang Wang, Tiefeng Li, Yang Yang, Xiaowen Chu

Abstract: 3D multimodal question answering (MQA) plays a crucial role in scene understanding by enabling intelligent agents to comprehend their surroundings in 3D environments. While existing research has primarily focused on indoor household tasks and outdoor roadside autonomous driving tasks, there has been limited exploration of city-level scene understanding tasks. Furthermore, existing research faces challenges in understanding city scenes, due to the absence of spatial semantic information and human-environment interaction information at the city level.To address these challenges, we investigate 3D MQA from both dataset and method perspectives. From the dataset perspective, we introduce a novel 3D MQA dataset named City-3DQA for city-level scene understanding, which is the first dataset to incorporate scene semantic and human-environment interactive tasks within the city. From the method perspective, we propose a Scene graph enhanced City-level Understanding method (Sg-CityU), which utilizes the scene graph to introduce the spatial semantic. A new benchmark is reported and our proposed Sg-CityU achieves accuracy of 63.94 % and 63.76 % in different settings of City-3DQA. Compared to indoor 3D MQA methods and zero-shot using advanced large language models (LLMs), Sg-CityU demonstrates state-of-the-art (SOTA) performance in robustness and generalization.

new Self-Calibrated Variance-Stabilizing Transformations for Real-World Image Denoising

Authors: S\'ebastien Herbreteau, Michael Unser

Abstract: Supervised deep learning has become the method of choice for image denoising. It involves the training of neural networks on large datasets composed of pairs of noisy and clean images. However, the necessity of training data that are specific to the targeted application constrains the widespread use of denoising networks. Recently, several approaches have been developed to overcome this difficulty by whether artificially generating realistic clean/noisy image pairs, or training exclusively on noisy images. In this paper, we show that, contrary to popular belief, denoising networks specialized in the removal of Gaussian noise can be efficiently leveraged in favor of real-world image denoising, even without additional training. For this to happen, an appropriate variance-stabilizing transform (VST) has to be applied beforehand. We propose an algorithm termed Noise2VST for the learning of such a model-free VST. Our approach requires only the input noisy image and an off-the-shelf Gaussian denoiser. We demonstrate through extensive experiments the efficiency and superiority of Noise2VST in comparison to existing methods trained in the absence of specific clean/noisy pairs.

new Generation of Training Data from HD Maps in the Lanelet2 Framework

Authors: Fabian Immel, Richard Fehler, Frank Bieder, Christoph Stiller

Abstract: Using HD maps directly as training data for machine learning tasks has seen a massive surge in popularity and shown promising results, e.g. in the field of map perception. Despite that, a standardized HD map framework supporting all parts of map-based automated driving and training label generation from map data does not exist. Furthermore, feeding map perception models with map data as part of the input during real-time inference is not addressed by the research community. In order to fill this gap, we presentlanelet2_ml_converter, an integrated extension to the HD map framework Lanelet2, widely used in automated driving systems by academia and industry. With this addition Lanelet2 unifies map based automated driving, machine learning inference and training, all from a single source of map data and format. Requirements for a unified framework are analyzed and the implementation of these requirements is described. The usability of labels in state of the art machine learning is demonstrated with application examples from the field of map perception. The source code is available embedded in the Lanelet2 framework under https://github.com/fzi-forschungszentrum-informatik/Lanelet2/tree/feature_ml_converter

URLs: https://github.com/fzi-forschungszentrum-informatik/Lanelet2/tree/feature_ml_converter

new (PASS) Visual Prompt Locates Good Structure Sparsity through a Recurrent HyperNetwork

Authors: Tianjin Huang, Fang Meng, Li Shen, Fan Liu, Yulong Pei, Mykola Pechenizkiy, Shiwei Liu, Tianlong Chen

Abstract: Large-scale neural networks have demonstrated remarkable performance in different domains like vision and language processing, although at the cost of massive computation resources. As illustrated by compression literature, structural model pruning is a prominent algorithm to encourage model efficiency, thanks to its acceleration-friendly sparsity patterns. One of the key questions of structural pruning is how to estimate the channel significance. In parallel, work on data-centric AI has shown that prompting-based techniques enable impressive generalization of large language models across diverse downstream tasks. In this paper, we investigate a charming possibility - \textit{leveraging visual prompts to capture the channel importance and derive high-quality structural sparsity}. To this end, we propose a novel algorithmic framework, namely \texttt{PASS}. It is a tailored hyper-network to take both visual prompts and network weight statistics as input, and output layer-wise channel sparsity in a recurrent manner. Such designs consider the intrinsic channel dependency between layers. Comprehensive experiments across multiple network architectures and six datasets demonstrate the superiority of \texttt{PASS} in locating good structural sparsity. For example, at the same FLOPs level, \texttt{PASS} subnetworks achieve $1\%\sim 3\%$ better accuracy on Food101 dataset; or with a similar performance of $80\%$ accuracy, \texttt{PASS} subnetworks obtain $0.35\times$ more speedup than the baselines.

new 3D Gaussian Splatting: Survey, Technologies, Challenges, and Opportunities

Authors: Yanqi Bao, Tianyu Ding, Jing Huo, Yaoli Liu, Yuxin Li, Wenbin Li, Yang Gao, Jiebo Luo

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a prominent technique with the potential to become a mainstream method for 3D representations. It can effectively transform multi-view images into explicit 3D Gaussian representations through efficient training, and achieve real-time rendering of novel views. This survey aims to analyze existing 3DGS-related works from multiple intersecting perspectives, including related tasks, technologies, challenges, and opportunities. The primary objective is to provide newcomers with a rapid understanding of the field and to assist researchers in methodically organizing existing technologies and challenges. Specifically, we delve into the optimization, application, and extension of 3DGS, categorizing them based on their focuses or motivations. Additionally, we summarize and classify nine types of technical modules and corresponding improvements identified in existing works. Based on these analyses, we further examine the common challenges and technologies across various tasks, proposing potential research opportunities.

new On selection of centroids of fuzzy clusters for color classification

Authors: Dae-Won Kim, Kwang H. Lee

Abstract: A novel initialization method in the fuzzy c-means (FCM) algorithm is proposed for the color clustering problem. Given a set of color points, the proposed initialization extracts dominant colors that are the most vivid and distinguishable colors. Color points closest to the dominant colors are selected as initial centroids in the FCM. To obtain the dominant colors and their closest color points, we introduce reference colors and define a fuzzy membership model between a color point and a reference color.

new Vision Language Model-Empowered Contract Theory for AIGC Task Allocation in Teleoperation

Authors: Zijun Zhan, Yaxian Dong, Yuqing Hu, Shuai Li, Shaohua Cao, Zhu Han

Abstract: Integrating low-light image enhancement techniques, in which diffusion-based AI-generated content (AIGC) models are promising, is necessary to enhance nighttime teleoperation. Remarkably, the AIGC model is computation-intensive, thus necessitating the allocation of AIGC tasks to edge servers with ample computational resources. Given the distinct cost of the AIGC model trained with varying-sized datasets and AIGC tasks possessing disparate demand, it is imperative to formulate a differential pricing strategy to optimize the utility of teleoperators and edge servers concurrently. Nonetheless, the pricing strategy formulation is under information asymmetry, i.e., the demand (e.g., the difficulty level of AIGC tasks and their distribution) of AIGC tasks is hidden information to edge servers. Additionally, manually assessing the difficulty level of AIGC tasks is tedious and unnecessary for teleoperators. To this end, we devise a framework of AIGC task allocation assisted by the Vision Language Model (VLM)-empowered contract theory, which includes two components: VLM-empowered difficulty assessment and contract theory-assisted AIGC task allocation. The first component enables automatic and accurate AIGC task difficulty assessment. The second component is capable of formulating the pricing strategy for edge servers under information asymmetry, thereby optimizing the utility of both edge servers and teleoperators. The simulation results demonstrated that our proposed framework can improve the average utility of teleoperators and edge servers by 10.88~12.43% and 1.4~2.17%, respectively. Code and data are available at https://github.com/ZiJun0819/VLM-Contract-Theory.

URLs: https://github.com/ZiJun0819/VLM-Contract-Theory.

new HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

Authors: Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, Bo Dai, Dahua Lin

Abstract: Human image animation involves generating videos from a character photo, allowing user control and unlocking potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance of camera motions in videos, leading to limited control and unstable video generation.To demystify the training data, we present HumanVid, the first large-scale high-quality dataset tailored for human image animation, which combines crafted real-world and synthetic data. For the real-world data, we compile a vast collection of copyright-free real-world videos from the internet. Through a carefully designed rule-based filtering strategy, we ensure the inclusion of high-quality videos, resulting in a collection of 20K human-centric videos in 1080P resolution. Human and camera motion annotation is accomplished using a 2D pose estimator and a SLAM-based method. For the synthetic data, we gather 2,300 copyright-free 3D avatar assets to augment existing available 3D assets. Notably, we introduce a rule-based camera trajectory generation method, enabling the synthetic pipeline to incorporate diverse and precise camera motion annotation, which can rarely be found in real-world data. To verify the effectiveness of HumanVid, we establish a baseline model named CamAnimate, short for Camera-controllable Human Animation, that considers both human and camera motions as conditions. Through extensive experimentation, we demonstrate that such simple baseline training on our HumanVid achieves state-of-the-art performance in controlling both human pose and camera motions, setting a new benchmark. Code and data will be publicly available at \url{https://github.com/zhenzhiwang/HumanVid/}.

URLs: https://github.com/zhenzhiwang/HumanVid/

new AHMF: Adaptive Hybrid-Memory-Fusion Model for Driver Attention Prediction

Authors: Dongyang Xu, Qingfan Wang, Ji Ma, Xiangyun Zeng, Lei Chen

Abstract: Accurate driver attention prediction can serve as a critical reference for intelligent vehicles in understanding traffic scenes and making informed driving decisions. Though existing studies on driver attention prediction improved performance by incorporating advanced saliency detection techniques, they overlooked the opportunity to achieve human-inspired prediction by analyzing driving tasks from a cognitive science perspective. During driving, drivers' working memory and long-term memory play crucial roles in scene comprehension and experience retrieval, respectively. Together, they form situational awareness, facilitating drivers to quickly understand the current traffic situation and make optimal decisions based on past driving experiences. To explicitly integrate these two types of memory, this paper proposes an Adaptive Hybrid-Memory-Fusion (AHMF) driver attention prediction model to achieve more human-like predictions. Specifically, the model first encodes information about specific hazardous stimuli in the current scene to form working memories. Then, it adaptively retrieves similar situational experiences from the long-term memory for final prediction. Utilizing domain adaptation techniques, the model performs parallel training across multiple datasets, thereby enriching the accumulated driving experience within the long-term memory module. Compared to existing models, our model demonstrates significant improvements across various metrics on multiple public datasets, proving the effectiveness of integrating hybrid memories in driver attention prediction.

new $VILA^2$: VILA Augmented VILA

Authors: Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jang Hyun Cho, Marco Pavone, Song Han, Hongxu Yin

Abstract: Visual language models (VLMs) have rapidly progressed, driven by the success of large language models (LLMs). While model architectures and training infrastructures advance rapidly, data curation remains under-explored. When data quantity and quality become a bottleneck, existing work either directly crawls more raw data from the Internet that does not have a guarantee of data quality or distills from black-box commercial models (e.g., GPT-4V / Gemini) causing the performance upper bounded by that model. In this work, we introduce a novel approach that includes a self-augment step and a specialist-augment step to iteratively improve data quality and model performance. In the self-augment step, a VLM recaptions its own pretraining data to enhance data quality, and then retrains from scratch using this refined dataset to improve model performance. This process can iterate for several rounds. Once self-augmentation saturates, we employ several specialist VLMs finetuned from the self-augmented VLM with domain-specific expertise, to further infuse specialist knowledge into the generalist VLM through task-oriented recaptioning and retraining. With the combined self-augmented and specialist-augmented training, we introduce $VILA^2$ (VILA-augmented-VILA), a VLM family that consistently improves the accuracy on a wide range of tasks over prior art, and achieves new state-of-the-art results on MMMU leaderboard among open-sourced models.

new CSCPR: Cross-Source-Context Indoor RGB-D Place Recognition

Authors: Jing Liang, Zhuo Deng, Zheming Zhou, Min Sun, Omid Ghasemalizadeh, Cheng-Hao Kuo, Arnie Sen, Dinesh Manocha

Abstract: We present a new algorithm, Cross-Source-Context Place Recognition (CSCPR), for RGB-D indoor place recognition that integrates global retrieval and reranking into a single end-to-end model. Unlike prior approaches that primarily focus on the RGB domain, CSCPR is designed to handle the RGB-D data. We extend the Context-of-Clusters (CoCs) for handling noisy colorized point clouds and introduce two novel modules for reranking: the Self-Context Cluster (SCC) and Cross Source Context Cluster (CSCC), which enhance feature representation and match query-database pairs based on local features, respectively. We also present two new datasets, ScanNetIPR and ARKitIPR. Our experiments demonstrate that CSCPR significantly outperforms state-of-the-art models on these datasets by at least 36.5% in Recall@1 at ScanNet-PR dataset and 44% in new datasets. Code and datasets will be released.

new SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency

Authors: Yiming Xie, Chun-Han Yao, Vikram Voleti, Huaizu Jiang, Varun Jampani

Abstract: We present Stable Video 4D (SV4D), a latent video diffusion model for multi-frame and multi-view consistent dynamic 3D content generation. Unlike previous methods that rely on separately trained generative models for video generation and novel view synthesis, we design a unified diffusion model to generate novel view videos of dynamic 3D objects. Specifically, given a monocular reference video, SV4D generates novel views for each video frame that are temporally consistent. We then use the generated novel view videos to optimize an implicit 4D representation (dynamic NeRF) efficiently, without the need for cumbersome SDS-based optimization used in most prior works. To train our unified novel view video generation model, we curated a dynamic 3D object dataset from the existing Objaverse dataset. Extensive experimental results on multiple datasets and user studies demonstrate SV4D's state-of-the-art performance on novel-view video synthesis as well as 4D generation compared to prior works.

cross Exploring The Neural Burden In Pruned Models: An Insight Inspired By Neuroscience

Authors: Zeyu Wang, Weichen Dai, Xiangyu Zhou, Ji Qi, Yi Zhou

Abstract: Vision Transformer and its variants have been adopted in many visual tasks due to their powerful capabilities, which also bring significant challenges in computation and storage. Consequently, researchers have introduced various compression methods in recent years, among which the pruning techniques are widely used to remove a significant fraction of the network. Therefore, these methods can reduce significant percent of the FLOPs, but often lead to a decrease in model performance. To investigate the underlying causes, we focus on the pruning methods specifically belonging to the pruning-during-training category, then drew inspiration from neuroscience and propose a new concept for artificial neural network models named Neural Burden. We investigate its impact in the model pruning process, and subsequently explore a simple yet effective approach to mitigate the decline in model performance, which can be applied to any pruning-during-training technique. Extensive experiments indicate that the neural burden phenomenon indeed exists, and show the potential of our method. We hope that our findings can provide valuable insights for future research. Code will be made publicly available after this paper is published.

cross PlantTrack: Task-Driven Plant Keypoint Tracking with Zero-Shot Sim2Real Transfer

Authors: Samhita Marri, Arun N. Sivakumar, Naveen K. Uppalapati, Girish Chowdhary

Abstract: Tracking plant features is crucial for various agricultural tasks like phenotyping, pruning, or harvesting, but the unstructured, cluttered, and deformable nature of plant environments makes it a challenging task. In this context, the recent advancements in foundational models show promise in addressing this challenge. In our work, we propose PlantTrack where we utilize DINOv2 which provides high-dimensional features, and train a keypoint heatmap predictor network to identify the locations of semantic features such as fruits and leaves which are then used as prompts for point tracking across video frames using TAPIR. We show that with as few as 20 synthetic images for training the keypoint predictor, we achieve zero-shot Sim2Real transfer, enabling effective tracking of plant features in real environments.

cross Vision-Based Adaptive Robotics for Autonomous Surface Crack Repair

Authors: Joshua Genova, Eric Cabrera, Vedhus Hoskere

Abstract: Surface cracks in infrastructure can lead to significant deterioration and costly maintenance if not efficiently repaired. Manual repair methods are labor-intensive, time-consuming, and imprecise and thus difficult to scale to large areas. Breakthroughs in robotic perception and manipulation have advanced autonomous crack repair, but proposed methods lack end-to-end testing and adaptability to changing crack size. This paper presents an adaptive, autonomous system for surface crack detection and repair using robotics with advanced sensing technologies. The system uses an RGB-D camera for crack detection, a laser scanner for precise measurement, and an extruder and pump for material deposition. A novel validation procedure with 3D-printed crack specimens simulates real-world cracks and ensures testing repeatability. Our study shows that an adaptive system for crack filling is more efficient and effective than a fixed-speed approach, with experimental results confirming precision and consistency. This research paves the way for versatile, reliable robotic infrastructure maintenance.

cross Exploring Fusion Techniques in Multimodal AI-Based Recruitment: Insights from FairCVdb

Authors: Swati Swati, Arjun Roy, Eirini Ntoutsi

Abstract: Despite the large body of work on fairness-aware learning for individual modalities like tabular data, images, and text, less work has been done on multimodal data, which fuses various modalities for a comprehensive analysis. In this work, we investigate the fairness and bias implications of multimodal fusion techniques in the context of multimodal AI-based recruitment systems using the FairCVdb dataset. Our results show that early-fusion closely matches the ground truth for both demographics, achieving the lowest MAEs by integrating each modality's unique characteristics. In contrast, late-fusion leads to highly generalized mean scores and higher MAEs. Our findings emphasise the significant potential of early-fusion for accurate and fair applications, even in the presence of demographic biases, compared to late-fusion. Future research could explore alternative fusion strategies and incorporate modality-related fairness constraints to improve fairness. For code and additional insights, visit: https://github.com/Swati17293/Multimodal-AI-Based-Recruitment-FairCVdb

URLs: https://github.com/Swati17293/Multimodal-AI-Based-Recruitment-FairCVdb

cross Toward an Integrated Decision Making Framework for Optimized Stroke Diagnosis with DSA and Treatment under Uncertainty

Authors: Nur Ahmad Khatim, Ahmad Azmul Asmar Irfan, Amaliya Mata'ul Hayah, Mansur M. Arief

Abstract: This study addresses the challenge of stroke diagnosis and treatment under uncertainty, a critical issue given the rapid progression and severe consequences of stroke conditions such as aneurysms, arteriovenous malformations (AVM), and occlusions. Current diagnostic methods, including Digital Subtraction Angiography (DSA), face limitations due to high costs and its invasive nature. To overcome these challenges, we propose a novel approach using a Partially Observable Markov Decision Process (POMDP) framework. Our model integrates advanced diagnostic tools and treatment approaches with a decision-making algorithm that accounts for the inherent uncertainties in stroke diagnosis. Our approach combines noisy observations from CT scans, Siriraj scores, and DSA reports to inform the subsequent treatment options. We utilize the online solver DESPOT, which employs tree-search methods and particle filters, to simulate potential future scenarios and guide our strategies. The results indicate that our POMDP framework balances diagnostic and treatment objectives, striking a tradeoff between the need for precise stroke identification via invasive procedures like DSA and the constraints of limited healthcare resources that necessitate more cost-effective strategies, such as in-hospital or at-home observation, by relying only relying on simulation rollouts and not imposing any prior knowledge. Our study offers a significant contribution by presenting a systematic framework that optimally integrates diagnostic and treatment processes for stroke and accounting for various uncertainties, thereby improving care and outcomes in stroke management.

cross Trans2Unet: Neural fusion for Nuclei Semantic Segmentation

Authors: Dinh-Phu Tran, Quoc-Anh Nguyen, Van-Truong Pham, Thi-Thao Tran

Abstract: Nuclei segmentation, despite its fundamental role in histopathological image analysis, is still a challenge work. The main challenge of this task is the existence of overlapping areas, which makes separating independent nuclei more complicated. In this paper, we propose a new two-branch architecture by combining the Unet and TransUnet networks for nuclei segmentation task. In the proposed architecture, namely Trans2Unet, the input image is first sent into the Unet branch whose the last convolution layer is removed. This branch makes the network combine features from different spatial regions of the input image and localizes more precisely the regions of interest. The input image is also fed into the second branch. In the second branch, which is called TransUnet branch, the input image will be divided into patches of images. With Vision transformer (ViT) in architecture, TransUnet can serve as a powerful encoder for medical image segmentation tasks and enhance image details by recovering localized spatial information. To boost up Trans2Unet efficiency and performance, we proposed to infuse TransUnet with a computational-efficient variation called "Waterfall" Atrous Spatial Pooling with Skip Connection (WASP-KC) module, which is inspired by the "Waterfall" Atrous Spatial Pooling (WASP) module. Experiment results on the 2018 Data Science Bowl benchmark show the effectiveness and performance of the proposed architecture while compared with previous segmentation models.

cross Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

Authors: Yongqi Li, Hongru Cai, Wenjie Wang, Leigang Qu, Yinwei Wei, Wenjie Li, Liqiang Nie, Tat-Seng Chua

Abstract: Text-to-image retrieval is a fundamental task in multimedia processing, aiming to retrieve semantically relevant cross-modal content. Traditional studies have typically approached this task as a discriminative problem, matching the text and image via the cross-attention mechanism (one-tower framework) or in a common embedding space (two-tower framework). Recently, generative cross-modal retrieval has emerged as a new research line, which assigns images with unique string identifiers and generates the target identifier as the retrieval target. Despite its great potential, existing generative approaches are limited due to the following issues: insufficient visual information in identifiers, misalignment with high-level semantics, and learning gap towards the retrieval target. To address the above issues, we propose an autoregressive voken generation method, named AVG. AVG tokenizes images into vokens, i.e., visual tokens, and innovatively formulates the text-to-image retrieval task as a token-to-voken generation problem. AVG discretizes an image into a sequence of vokens as the identifier of the image, while maintaining the alignment with both the visual information and high-level semantics of the image. Additionally, to bridge the learning gap between generative training and the retrieval target, we incorporate discriminative training to modify the learning direction during token-to-voken training. Extensive experiments demonstrate that AVG achieves superior results in both effectiveness and efficiency.

cross How Good (Or Bad) Are LLMs at Detecting Misleading Visualizations?

Authors: Leo Yu-Ho Lo, Huamin Qu

Abstract: In this study, we address the growing issue of misleading charts, a prevalent problem that undermines the integrity of information dissemination. Misleading charts can distort the viewer's perception of data, leading to misinterpretations and decisions based on false information. The development of effective automatic detection methods for misleading charts is an urgent field of research. The recent advancement of multimodal Large Language Models (LLMs) has introduced a promising direction for addressing this challenge. We explored the capabilities of these models in analyzing complex charts and assessing the impact of different prompting strategies on the models' analyses. We utilized a dataset of misleading charts collected from the internet by prior research and crafted nine distinct prompts, ranging from simple to complex, to test the ability of four different multimodal LLMs in detecting over 21 different chart issues. Through three experiments--from initial exploration to detailed analysis--we progressively gained insights into how to effectively prompt LLMs to identify misleading charts and developed strategies to address the scalability challenges encountered as we expanded our detection range from the initial five issues to 21 issues in the final experiment. Our findings reveal that multimodal LLMs possess a strong capability for chart comprehension and critical thinking in data interpretation. There is significant potential in employing multimodal LLMs to counter misleading information by supporting critical thinking and enhancing visualization literacy. This study demonstrates the applicability of LLMs in addressing the pressing concern of misleading charts.

cross Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population

Authors: Nikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Vasileios Argyriou, Panagiotis Sarigianndis

Abstract: Dementia, a debilitating neurological condition affecting millions worldwide, presents significant diagnostic challenges. In this work, we introduce a novel methodology for the classification of demented and non-demented elderly patients using 3D brain Magnetic Resonance Imaging (MRI) scans. Our approach features a unique technique for selectively processing MRI slices, focusing on the most relevant brain regions and excluding less informative sections. This methodology is complemented by a confidence-based classification committee composed of three custom deep learning models: Dem3D ResNet, Dem3D CNN, and Dem3D EfficientNet. These models work synergistically to enhance decision-making accuracy, leveraging their collective strengths. Tested on the Open Access Series of Imaging Studies(OASIS) dataset, our method achieved an impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore, validation on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset confirmed the robustness and generalizability of our approach. The use of explainable AI (XAI) techniques and comprehensive ablation studies further substantiate the effectiveness of our techniques, providing insights into the decision-making process and the importance of our methodology. This research offers a significant advancement in dementia diagnosis, providing a highly accurate and efficient tool for clinical applications.

cross 2D and 3D Deep Learning Models for MRI-based Parkinson's Disease Classification: A Comparative Analysis of Convolutional Kolmogorov-Arnold Networks, Convolutional Neural Networks, and Graph Convolutional Networks

Authors: Salil B Patel, Vicky Goh, James F FitzGerald, Chrystalina A Antoniades

Abstract: Early and accurate diagnosis of Parkinson's Disease (PD) remains challenging. This study compares deep learning architectures for MRI-based PD classification, introducing the first three-dimensional (3D) implementation of Convolutional Kolmogorov-Arnold Networks (ConvKANs), a new approach that combines convolution layers with adaptive, spline-based activations. We evaluated Convolutional Neural Networks (CNNs), ConvKANs, and Graph Convolutional Networks (GCNs) using three open-source datasets; a total of 142 participants (75 with PD and 67 age-matched healthy controls). For 2D analysis, we extracted 100 axial slices centred on the midbrain from each T1-weighted scan. For 3D analysis, we used the entire volumetric scans. ConvKANs integrate learnable B-spline functions with convolutional layers. GCNs represent MRI data as graphs, theoretically capturing structural relationships that may be overlooked by traditional approaches. Interpretability visualizations, including the first ConvKAN spline activation maps, and projections of graph node embeddings, were depicted. ConvKANs demonstrated high performance across datasets and dimensionalities, achieving the highest 2D AUROC (0.98) in one dataset and matching CNN peak 3D performance (1.00). CNN models performed well, while GCN models improved in 3D analyses, reaching up to 0.97 AUROC. 3D implementations yielded higher AUROC values compared to 2D counterparts across all models. ConvKAN implementation shows promise for MRI analysis in PD classification, particularly in the context of early diagnosis. The improvement in 3D analyses highlights the value of volumetric data in capturing subtle PD-related changes. While MRI is not currently used for PD diagnosis, these findings suggest its potential as a component of a multimodal diagnostic approach, especially for early detection.

cross Looking at Model Debiasing through the Lens of Anomaly Detection

Authors: Vito Paolo Pastore, Massimiliano Ciranni, Davide Marinelli, Francesca Odone, Vittorio Murino

Abstract: It is widely recognized that deep neural networks are sensitive to bias in the data. This means that during training these models are likely to learn spurious correlations between data and labels, resulting in limited generalization abilities and low performance. In this context, model debiasing approaches can be devised aiming at reducing the model's dependency on such unwanted correlations, either leveraging the knowledge of bias information or not. In this work, we focus on the latter and more realistic scenario, showing the importance of accurately predicting the bias-conflicting and bias-aligned samples to obtain compelling performance in bias mitigation. On this ground, we propose to conceive the problem of model bias from an out-of-distribution perspective, introducing a new bias identification method based on anomaly detection. We claim that when data is mostly biased, bias-conflicting samples can be regarded as outliers with respect to the bias-aligned distribution in the feature space of a biased model, thus allowing for precisely detecting them with an anomaly detection method. Coupling the proposed bias identification approach with bias-conflicting data upsampling and augmentation in a two-step strategy, we reach state-of-the-art performance on synthetic and real benchmark datasets. Ultimately, our proposed approach shows that the data bias issue does not necessarily require complex debiasing methods, given that an accurate bias identification procedure is defined.

cross SoNIC: Safe Social Navigation with Adaptive Conformal Inference and Constrained Reinforcement Learning

Authors: Jianpeng Yao, Xiaopan Zhang, Yu Xia, Zejin Wang, Amit K. Roy-Chowdhury, Jiachen Li

Abstract: Reinforcement Learning (RL) has enabled social robots to generate trajectories without human-designed rules or interventions, which makes it more effective than hard-coded systems for generalizing to complex real-world scenarios. However, social navigation is a safety-critical task that requires robots to avoid collisions with pedestrians while previous RL-based solutions fall short in safety performance in complex environments. To enhance the safety of RL policies, to the best of our knowledge, we propose the first algorithm, SoNIC, that integrates adaptive conformal inference (ACI) with constrained reinforcement learning (CRL) to learn safe policies for social navigation. More specifically, our method augments RL observations with ACI-generated nonconformity scores and provides explicit guidance for agents to leverage the uncertainty metrics to avoid safety-critical areas by incorporating safety constraints with spatial relaxation. Our method outperforms state-of-the-art baselines in terms of both safety and adherence to social norms by a large margin and demonstrates much stronger robustness to out-of-distribution scenarios. Our code and video demos are available on our project website: https://sonic-social-nav.github.io/.

URLs: https://sonic-social-nav.github.io/.

replace Multimodal Query-guided Object Localization

Authors: Aditay Tripathi, Rajath R Dani, Anand Mishra, Anirban Chakraborty

Abstract: Consider a scenario in one-shot query-guided object localization where neither an image of the object nor the object category name is available as a query. In such a scenario, a hand-drawn sketch of the object could be a choice for a query. However, hand-drawn crude sketches alone, when used as queries, might be ambiguous for object localization, e.g., a sketch of a laptop could be confused for a sofa. On the other hand, a linguistic definition of the category, e.g., a small portable computer small enough to use in your lap" along with the sketch query, gives better visual and semantic cues for object localization. In this work, we present a multimodal query-guided object localization approach under the challenging open-set setting. In particular, we use queries from two modalities, namely, hand-drawn sketch and description of the object (also known as gloss), to perform object localization. Multimodal query-guided object localization is a challenging task, especially when a large domain gap exists between the queries and the natural images, as well as due to the challenge of combining the complementary and minimal information present across the queries. For example, hand-drawn crude sketches contain abstract shape information of an object, while the text descriptions often capture partial semantic information about a given object category. To address the aforementioned challenges, we present a novel cross-modal attention scheme that guides the region proposal network to generate object proposals relevant to the input queries and a novel orthogonal projection-based proposal scoring technique that scores each proposal with respect to the queries, thereby yielding the final localization results. ...

replace IRGen: Generative Modeling for Image Retrieval

Authors: Yidan Zhang, Ting Zhang, Dong Chen, Yujing Wang, Qi Chen, Xing Xie, Hao Sun, Weiwei Deng, Qi Zhang, Fan Yang, Mao Yang, Qingmin Liao, Jingdong Wang, Baining Guo

Abstract: While generative modeling has become prevalent across numerous research fields, its integration into the realm of image retrieval remains largely unexplored and underjustified. In this paper, we present a novel methodology, reframing image retrieval as a variant of generative modeling and employing a sequence-to-sequence model. This approach is harmoniously aligned with the current trend towards unification in research, presenting a cohesive framework that allows for end-to-end differentiable searching. This, in turn, facilitates superior performance via direct optimization techniques. The development of our model, dubbed IRGen, addresses the critical technical challenge of converting an image into a concise sequence of semantic units, which is pivotal for enabling efficient and effective search. Extensive experiments demonstrate that our model achieves state-of-the-art performance on three widely-used image retrieval benchmarks as well as two million-scale datasets, yielding significant improvement compared to prior competitive retrieval methods. In addition, the notable surge in precision scores facilitated by generative modeling presents the potential to bypass the reranking phase, which is traditionally indispensable in practical retrieval workflows.

replace DarSwin: Distortion Aware Radial Swin Transformer

Authors: Akshaya Athwale, Arman Afrasiyabi, Justin Lag\"ue, Ichrak Shili, Ola Ahmad, Jean-Fran\c{c}ois Lalonde

Abstract: Wide-angle lenses are commonly used in perception tasks requiring a large field of view. Unfortunately, these lenses produce significant distortions, making conventional models that ignore the distortion effects unable to adapt to wide-angle images. In this paper, we present a novel transformer-based model that automatically adapts to the distortion produced by wide-angle lenses. Our proposed image encoder architecture, dubbed DarSwin, leverages the physical characteristics of such lenses analytically defined by the radial distortion profile. In contrast to conventional transformer-based architectures, DarSwin comprises a radial patch partitioning, a distortion-based sampling technique for creating token embeddings, and an angular position encoding for radial patch merging. Compared to other baselines, DarSwin achieves the best results on different datasets with significant gains when trained on bounded levels of distortions (very low, low, medium, and high) and tested on all, including out-of-distribution distortions. While the base DarSwin architecture requires knowledge of the radial distortion profile, we show it can be combined with a self-calibration network that estimates such a profile from the input image itself, resulting in a completely uncalibrated pipeline. Finally, we also present DarSwin-Unet, which extends DarSwin, to an encoder-decoder architecture suitable for pixel-level tasks. We demonstrate its performance on depth estimation and show through extensive experiments that DarSwin-Unet can perform zero-shot adaptation to unseen distortions of different wide-angle lenses. The code and models are publicly available at https://lvsn.github.io/darswin/

URLs: https://lvsn.github.io/darswin/

replace PhenoBench -- A Large Dataset and Benchmarks for Semantic Image Interpretation in the Agricultural Domain

Authors: Jan Weyler, Federico Magistri, Elias Marks, Yue Linn Chong, Matteo Sodano, Gianmarco Roggiolani, Nived Chebrolu, Cyrill Stachniss, Jens Behley

Abstract: The production of food, feed, fiber, and fuel is a key task of agriculture, which has to cope with many challenges in the upcoming decades, e.g., a higher demand, climate change, lack of workers, and the availability of arable land. Vision systems can support making better and more sustainable field management decisions, but also support the breeding of new crop varieties by allowing temporally dense and reproducible measurements. Recently, agricultural robotics got an increasing interest in the vision and robotics communities since it is a promising avenue for coping with the aforementioned lack of workers and enabling more sustainable production. While large datasets and benchmarks in other domains are readily available and enable significant progress, agricultural datasets and benchmarks are comparably rare. We present an annotated dataset and benchmarks for the semantic interpretation of real agricultural fields. Our dataset recorded with a UAV provides high-quality, pixel-wise annotations of crops and weeds, but also crop leaf instances at the same time. Furthermore, we provide benchmarks for various tasks on a hidden test set comprised of different fields: known fields covered by the training data and a completely unseen field. Our dataset, benchmarks, and code are available at \url{https://www.phenobench.org}.

URLs: https://www.phenobench.org

replace Early Detection of Late Blight Tomato Disease using Histogram Oriented Gradient based Support Vector Machine

Authors: Yousef Alhwaiti, Muhammad Ishaq, Muhammad Hameed Siddiqi, Muhammad Waqas, Madallah Alruwaili, Saad Alanazi, Asfandyar Khan, Faheem Khan

Abstract: The tomato is one of the most important fruits on earth. It plays an important and useful role in the agricultural production of any country. This research propose a novel smart technique for early detection of late blight diseases in tomatoes. This work improve the dataset with an increase in images from the field (the Plant Village dataset) and proposed a hybrid algorithm composed of support vector machines (SVM) and histogram-oriented gradients (HOG) for real-time detection of late blight tomato disease. To propose a HOG-based SVM model for early detection of late blight tomato leaf disease. To check the performance of the proposed model in terms of MSE, accuracy, precision, and recall as compared to Decision Tree and KNN. The integration of advanced technology in agriculture has the potential to revolutionize the industry, making it more efficient, sustainable, and profitable. This research work on the early detection of tomato diseases contributes to the growing importance of smart farming, the need for climate-smart agriculture, the rising need to more efficiently utilize natural resources, and the demand for higher crop yields. The proposed hybrid algorithm of SVM and HOG has significant potential for the early detection of late blight disease in tomato plants. The performance of the proposed model against decision tree and KNN algorithms and the results may assist in selecting the best algorithm for future applications. The research work can help farmers make data-driven decisions to optimize crop yield and quality while also reducing the environmental impact of farming practices.

replace Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation

Authors: Renjie Liang, Yiming Yang, Hui Lu, Li Li

Abstract: Temporal Sentence Grounding in Videos (TSGV) aims to detect the event timestamps described by the natural language query from untrimmed videos. This paper discusses the challenge of achieving efficient computation in TSGV models while maintaining high performance. Most existing approaches exquisitely design complex architectures to improve accuracy with extra layers and loss, suffering from inefficiency and heaviness. Although some works have noticed that, they only make an issue of feature fusion layers, which can hardly enjoy the highspeed merit in the whole clunky network. To tackle this problem, we propose a novel efficient multi-teacher model (EMTM) based on knowledge distillation to transfer diverse knowledge from both heterogeneous and isomorphic networks. Specifically, We first unify different outputs of the heterogeneous models into one single form. Next, a Knowledge Aggregation Unit (KAU) is built to acquire high-quality integrated soft labels from multiple teachers. After that, the KAU module leverages the multi-scale video and global query information to adaptively determine the weights of different teachers. A Shared Encoder strategy is then proposed to solve the problem that the student shallow layers hardly benefit from teachers, in which an isomorphic teacher is collaboratively trained with the student to align their hidden states. Extensive experimental results on three popular TSGV benchmarks demonstrate that our method is both effective and efficient without bells and whistles.

replace Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN

Authors: Jiawei Yao, Tong Wu, Xiaofeng Zhang

Abstract: Monocular depth estimation is an ongoing challenge in computer vision. Recent progress with Transformer models has demonstrated notable advantages over conventional CNNs in this area. However, there's still a gap in understanding how these models prioritize different regions in 2D images and how these regions affect depth estimation performance. To explore the differences between Transformers and CNNs, we employ a sparse pixel approach to contrastively analyze the distinctions between the two. Our findings suggest that while Transformers excel in handling global context and intricate textures, they lag behind CNNs in preserving depth gradient continuity. To further enhance the performance of Transformer models in monocular depth estimation, we propose the Depth Gradient Refinement (DGR) module that refines depth estimation through high-order differentiation, feature fusion, and recalibration. Additionally, we leverage optimal transport theory, treating depth maps as spatial probability distributions, and employ the optimal transport distance as a loss function to optimize our model. Experimental results demonstrate that models integrated with the plug-and-play Depth Gradient Refinement (DGR) module and the proposed loss function enhance performance without increasing complexity and computational costs on both outdoor KITTI and indoor NYU-Depth-v2 datasets. This research not only offers fresh insights into the distinctions between Transformers and CNNs in depth estimation but also paves the way for novel depth estimation methodologies.

replace MovePose: A High-performance Human Pose Estimation Algorithm on Mobile and Edge Devices

Authors: Dongyang Yu, Haoyue Zhang, Ruisheng Zhao, Guoqi Chen, Wangpeng An, Yanhong Yang

Abstract: We present MovePose, an optimized lightweight convolutional neural network designed specifically for real-time body pose estimation on CPU-based mobile devices. The current solutions do not provide satisfactory accuracy and speed for human posture estimation, and MovePose addresses this gap. It aims to maintain real-time performance while improving the accuracy of human posture estimation for mobile devices. Our MovePose algorithm has attained an Mean Average Precision (mAP) score of 68.0 on the COCO \cite{cocodata} validation dataset. The MovePose algorithm displayed efficiency with a performance of 69+ frames per second (fps) when run on an Intel i9-10920x CPU. Additionally, it showcased an increased performance of 452+ fps on an NVIDIA RTX3090 GPU. On an Android phone equipped with a Snapdragon 8 + 4G processor, the fps reached above 11. To enhance accuracy, we incorporated three techniques: deconvolution, large kernel convolution, and coordinate classification methods. Compared to basic upsampling, deconvolution is trainable, improves model capacity, and enhances the receptive field. Large kernel convolution strengthens these properties at a decreased computational cost. In summary, MovePose provides high accuracy and real-time performance, marking it a potential tool for a variety of applications, including those focused on mobile-side human posture estimation. The code and models for this algorithm will be made publicly accessible.

replace SRFNet: Monocular Depth Estimation with Fine-grained Structure via Spatial Reliability-oriented Fusion of Frames and Events

Authors: Tianbo Pan, Zidong Cao, Lin Wang

Abstract: Monocular depth estimation is a crucial task to measure distance relative to a camera, which is important for applications, such as robot navigation and self-driving. Traditional frame-based methods suffer from performance drops due to the limited dynamic range and motion blur. Therefore, recent works leverage novel event cameras to complement or guide the frame modality via frame-event feature fusion. However, event streams exhibit spatial sparsity, leaving some areas unperceived, especially in regions with marginal light changes. Therefore, direct fusion methods, e.g., RAMNet, often ignore the contribution of the most confident regions of each modality. This leads to structural ambiguity in the modality fusion process, thus degrading the depth estimation performance. In this paper, we propose a novel Spatial Reliability-oriented Fusion Network (SRFNet), that can estimate depth with fine-grained structure at both daytime and nighttime. Our method consists of two key technical components. Firstly, we propose an attention-based interactive fusion (AIF) module that applies spatial priors of events and frames as the initial masks and learns the consensus regions to guide the inter-modal feature fusion. The fused feature are then fed back to enhance the frame and event feature learning. Meanwhile, it utilizes an output head to generate a fused mask, which is iteratively updated for learning consensual spatial priors. Secondly, we propose the Reliability-oriented Depth Refinement (RDR) module to estimate dense depth with the fine-grained structure based on the fused features and masks. We evaluate the effectiveness of our method on the synthetic and real-world datasets, which shows that, even without pretraining, our method outperforms the prior methods, e.g., RAMNet, especially in night scenes. Our project homepage: https://vlislab22.github.io/SRFNet.

URLs: https://vlislab22.github.io/SRFNet.

replace GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions

Authors: Junjie Wang, Jiemin Fang, Xiaopeng Zhang, Lingxi Xie, Qi Tian

Abstract: Recently, impressive results have been achieved in 3D scene editing with text instructions based on a 2D diffusion model. However, current diffusion models primarily generate images by predicting noise in the latent space, and the editing is usually applied to the whole image, which makes it challenging to perform delicate, especially localized, editing for 3D scenes. Inspired by recent 3D Gaussian splatting, we propose a systematic framework, named GaussianEditor, to edit 3D scenes delicately via 3D Gaussians with text instructions. Benefiting from the explicit property of 3D Gaussians, we design a series of techniques to achieve delicate editing. Specifically, we first extract the region of interest (RoI) corresponding to the text instruction, aligning it to 3D Gaussians. The Gaussian RoI is further used to control the editing process. Our framework can achieve more delicate and precise editing of 3D scenes than previous methods while enjoying much faster training speed, i.e. within 20 minutes on a single V100 GPU, more than twice as fast as Instruct-NeRF2NeRF (45 minutes -- 2 hours).

replace Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Authors: Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Abstract: Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs. In particular, our methodology seeks to sever "toxic" linguistic and visual concepts, unlearning the linkage between unsafe linguistic or visual items and unsafe regions of the embedding space. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator. We conduct extensive experiments on the resulting embedding space for cross-modal retrieval, text-to-image, and image-to-text generation, where we show that our model can be remarkably employed with pre-trained generative models. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.

URLs: https://github.com/aimagelab/safe-clip.

replace Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models

Authors: Zhengming Yu, Zhiyang Dou, Xiaoxiao Long, Cheng Lin, Zekun Li, Yuan Liu, Norman M\"uller, Taku Komura, Marc Habermann, Christian Theobalt, Xin Li, Wenping Wang

Abstract: We present Surf-D, a novel method for generating high-quality 3D shapes as Surfaces with arbitrary topologies using Diffusion models. Previous methods explored shape generation with different representations and they suffer from limited topologies and poor geometry details. To generate high-quality surfaces of arbitrary topologies, we use the Unsigned Distance Field (UDF) as our surface representation to accommodate arbitrary topologies. Furthermore, we propose a new pipeline that employs a point-based AutoEncoder to learn a compact and continuous latent space for accurately encoding UDF and support high-resolution mesh extraction. We further show that our new pipeline significantly outperforms the prior approaches to learning the distance fields, such as the grid-based AutoEncoder, which is not scalable and incapable of learning accurate UDF. In addition, we adopt a curriculum learning strategy to efficiently embed various surfaces. With the pretrained shape latent space, we employ a latent diffusion model to acquire the distribution of various shapes. Extensive experiments are presented on using Surf-D for unconditional generation, category conditional generation, image conditional generation, and text-to-shape tasks. The experiments demonstrate the superior performance of Surf-D in shape generation across multiple modalities as conditions. Visit our project page at https://yzmblog.github.io/projects/SurfD/.

URLs: https://yzmblog.github.io/projects/SurfD/.

replace PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation

Authors: Jian Ma, Chen Chen, Qingsong Xie, Haonan Lu

Abstract: Text-to-image diffusion models are well-known for their ability to generate realistic images based on textual prompts. However, the existing works have predominantly focused on English, lacking support for non-English text-to-image models. The most commonly used translation methods cannot solve the generation problem related to language culture, while training from scratch on a specific language dataset is prohibitively expensive. In this paper, we are inspired to propose a simple plug-and-play language transfer method based on knowledge distillation. All we need to do is train a lightweight MLP-like parameter-efficient adapter (PEA) with only 6M parameters under teacher knowledge distillation along with a small parallel data corpus. We are surprised to find that freezing the parameters of UNet can still achieve remarkable performance on the language-specific prompt evaluation set, demonstrating that PEA can stimulate the potential generation ability of the original UNet. Additionally, it closely approaches the performance of the English text-to-image model on a general prompt evaluation set. Furthermore, our adapter can be used as a plugin to achieve significant results in downstream tasks in cross-lingual text-to-image generation. Code will be available at: https://github.com/OPPO-Mente-Lab/PEA-Diffusion

URLs: https://github.com/OPPO-Mente-Lab/PEA-Diffusion

replace TLControl: Trajectory and Language Control for Human Motion Synthesis

Authors: Weilin Wan, Zhiyang Dou, Taku Komura, Wenping Wang, Dinesh Jayaraman, Lingjie Liu

Abstract: Controllable human motion synthesis is essential for applications in AR/VR, gaming and embodied AI. Existing methods often focus solely on either language or full trajectory control, lacking precision in synthesizing motions aligned with user-specified trajectories, especially for multi-joint control. To address these issues, we present TLControl, a novel method for realistic human motion synthesis, incorporating both low-level Trajectory and high-level Language semantics controls, through the integration of neural-based and optimization-based techniques. Specifically, we begin with training a VQ-VAE for a compact and well-structured latent motion space organized by body parts. We then propose a Masked Trajectories Transformer (MTT) for predicting a motion distribution conditioned on language and trajectory. Once trained, we use MTT to sample initial motion predictions given user-specified partial trajectories and text descriptions as conditioning. Finally, we introduce a test-time optimization to refine these coarse predictions for precise trajectory control, which offers flexibility by allowing users to specify various optimization goals and ensures high runtime efficiency. Comprehensive experiments show that TLControl significantly outperforms the state-of-the-art in trajectory accuracy and time efficiency, making it practical for interactive and high-quality animation generation.

replace Hybrid Functional Maps for Crease-Aware Non-Isometric Shape Matching

Authors: Lennart Bastian, Yizheng Xie, Nassir Navab, Zorah L\"ahner

Abstract: Non-isometric shape correspondence remains a fundamental challenge in computer vision. Traditional methods using Laplace-Beltrami operator (LBO) eigenmodes face limitations in characterizing high-frequency extrinsic shape changes like bending and creases. We propose a novel approach of combining the non-orthogonal extrinsic basis of eigenfunctions of the elastic thin-shell hessian with the intrinsic ones of the LBO, creating a hybrid spectral space in which we construct functional maps. To this end, we present a theoretical framework to effectively integrate non-orthogonal basis functions into descriptor- and learning-based functional map methods. Our approach can be incorporated easily into existing functional map pipelines across varying applications and is able to handle complex deformations beyond isometries. We show extensive evaluations across various supervised and unsupervised settings and demonstrate significant improvements. Notably, our approach achieves up to 15% better mean geodesic error for non-isometric correspondence settings and up to 45% improvement in scenarios with topological noise.

replace DisControlFace: Adding Disentangled Control to Diffusion Autoencoder for One-shot Explicit Facial Image Editing

Authors: Haozhe Jia, Yan Li, Hengfei Cui, Di Xu, Yuwang Wang, Tao Yu

Abstract: In this work, we focus on exploring explicit fine-grained control of generative facial image editing, all while generating faithful facial appearances and consistent semantic details, which however, is quite challenging and has not been extensively explored, especially under an one-shot scenario. We identify the key challenge as the exploration of disentangled conditional control between high-level semantics and explicit parameters (e.g., 3DMM) in the generation process, and accordingly propose a novel diffusion-based editing framework, named DisControlFace. Specifically, we leverage a Diffusion Autoencoder (Diff-AE) as the semantic reconstruction backbone. To enable explicit face editing, we construct an Exp-FaceNet that is compatible with Diff-AE to generate spatial-wise explicit control conditions based on estimated 3DMM parameters. Different from current diffusion-based editing methods that train the whole conditional generative model from scratch, we freeze the pre-trained weights of the Diff-AE to maintain its semantically deterministic conditioning capability and accordingly propose a random semantic masking (RSM) strategy to effectively achieve an independent training of Exp-FaceNet. This setting endows the model with disentangled face control meanwhile reducing semantic information shift in editing. Our model can be trained using 2D in-the-wild portrait images without requiring 3D or video data and perform robust editing on any new facial image through a simple one-shot fine-tuning. Comprehensive experiments demonstrate that DisControlFace can generate realistic facial images with better editing accuracy and identity preservation over state-of-the-art methods. Project page: https://discontrolface.github.io/

URLs: https://discontrolface.github.io/

replace Video Understanding with Large Language Models: A Survey

Authors: Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu

Abstract: With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.

URLs: https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.

replace Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

Authors: Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Br\'egier, Philippe Weinzaepfel, Gr\'egory Rogez, Thomas Lucas

Abstract: We present Multi-HMR, a strong sigle-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e., including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person locations, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and 3D location using a new cross-attention module called the Human Prediction Head (HPH), with one query attending to the entire set of features for each detected person. As direct prediction of fine-grained hands and facial poses in a single shot, i.e., without relying on explicit crops around body parts, is hard to learn from existing data, we introduce CUFFS, the Close-Up Frames of Full-Body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating it into the training data further enhances predictions, particularly for hands. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously: a ViT-S backbone on $448{\times}448$ images already yields a fast and competitive model, while larger models and higher resolutions obtain state-of-the-art results.

replace Edge Detectors Can Make Deep Convolutional Neural Networks More Robust

Authors: Jin Ding, Jie-Chao Zhao, Yong-Zhi Sun, Ping Tan, Jia-Wei Wang, Ji-En Ma, You-Tong Fang

Abstract: Deep convolutional neural networks (DCNN for short) are vulnerable to examples with small perturbations. Improving DCNN's robustness is of great significance to the safety-critical applications, such as autonomous driving and industry automation. Inspired by the principal way that human eyes recognize objects, i.e., largely relying on the shape features, this paper first employs the edge detectors as layer kernels and designs a binary edge feature branch (BEFB for short) to learn the binary edge features, which can be easily integrated into any popular backbone. The four edge detectors can learn the horizontal, vertical, positive diagonal, and negative diagonal edge features, respectively, and the branch is stacked by multiple Sobel layers (using edge detectors as kernels) and one threshold layer. The binary edge features learned by the branch, concatenated with the texture features learned by the backbone, are fed into the fully connected layers for classification. We integrate the proposed branch into VGG16 and ResNet34, respectively, and conduct experiments on multiple datasets. Experimental results demonstrate the BEFB is lightweight and has no side effects on training. And the accuracy of the BEFB integrated models is better than the original ones on all datasets when facing FGSM, PGD, and C\&W attacks. Besides, BEFB integrated models equipped with the robustness enhancing techniques can achieve better classification accuracy compared to the original models. The work in this paper for the first time shows it is feasible to enhance the robustness of DCNNs through combining both shape-like features and texture features.

replace Continuous Memory Representation for Anomaly Detection

Authors: Joo Chan Lee, Taejune Kim, Eunbyung Park, Simon S. Woo, Jong Hwan Ko

Abstract: There have been significant advancements in anomaly detection in an unsupervised manner, where only normal images are available for training. Several recent methods aim to detect anomalies based on a memory, comparing or reconstructing the input with directly stored normal features (or trained features with normal images). However, such memory-based approaches operate on a discrete feature space implemented by the nearest neighbor or attention mechanism, suffering from poor generalization or an identity shortcut issue outputting the same as input, respectively. Furthermore, the majority of existing methods are designed to detect single-class anomalies, resulting in unsatisfactory performance when presented with multiple classes of objects. To tackle all of the above challenges, we propose CRAD, a novel anomaly detection method for representing normal features within a "continuous" memory, enabled by transforming spatial features into coordinates and mapping them to continuous grids. Furthermore, we carefully design the grids tailored for anomaly detection, representing both local and global normal features and fusing them effectively. Our extensive experiments demonstrate that CRAD successfully generalizes the normal features and mitigates the identity shortcut, furthermore, CRAD effectively handles diverse classes in a single model thanks to the high-granularity continuous representation. In an evaluation using the MVTec AD dataset, CRAD significantly outperforms the previous state-of-the-art method by reducing 65.0% of the error for multi-class unified anomaly detection. The project page is available at https://tae-mo.github.io/crad/.

URLs: https://tae-mo.github.io/crad/.

replace PathM3: A Multimodal Multi-Task Multiple Instance Learning Framework for Whole Slide Image Classification and Captioning

Authors: Qifeng Zhou, Wenliang Zhong, Yuzhi Guo, Michael Xiao, Hehuan Ma, Junzhou Huang

Abstract: In the field of computational histopathology, both whole slide images (WSIs) and diagnostic captions provide valuable insights for making diagnostic decisions. However, aligning WSIs with diagnostic captions presents a significant challenge. This difficulty arises from two main factors: 1) Gigapixel WSIs are unsuitable for direct input into deep learning models, and the redundancy and correlation among the patches demand more attention; and 2) Authentic WSI diagnostic captions are extremely limited, making it difficult to train an effective model. To overcome these obstacles, we present PathM3, a multimodal, multi-task, multiple instance learning (MIL) framework for WSI classification and captioning. PathM3 adapts a query-based transformer to effectively align WSIs with diagnostic captions. Given that histopathology visual patterns are redundantly distributed across WSIs, we aggregate each patch feature with MIL method that considers the correlations among instances. Furthermore, our PathM3 overcomes data scarcity in WSI-level captions by leveraging limited WSI diagnostic caption data in the manner of multi-task joint learning. Extensive experiments with improved classification accuracy and caption generation demonstrate the effectiveness of our method on both WSI classification and captioning task.

replace MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering

Authors: Guoxing Sun, Rishabh Dabral, Pascal Fua, Christian Theobalt, Marc Habermann

Abstract: Faithful human performance capture and free-view rendering from sparse RGB observations is a long-standing problem in Vision and Graphics. The main challenges are the lack of observations and the inherent ambiguities of the setting, e.g. occlusions and depth ambiguity. As a result, radiance fields, which have shown great promise in capturing high-frequency appearance and geometry details in dense setups, perform poorly when naively supervising them on sparse camera views, as the field simply overfits to the sparse-view inputs. To address this, we propose MetaCap, a method for efficient and high-quality geometry recovery and novel view synthesis given very sparse or even a single view of the human. Our key idea is to meta-learn the radiance field weights solely from potentially sparse multi-view videos, which can serve as a prior when fine-tuning them on sparse imagery depicting the human. This prior provides a good network weight initialization, thereby effectively addressing ambiguities in sparse-view capture. Due to the articulated structure of the human body and motion-induced surface deformations, learning such a prior is non-trivial. Therefore, we propose to meta-learn the field weights in a pose-canonicalized space, which reduces the spatial feature range and makes feature learning more effective. Consequently, one can fine-tune our field parameters to quickly generalize to unseen poses, novel illumination conditions as well as novel and sparse (even monocular) camera views. For evaluating our method under different scenarios, we collect a new dataset, WildDynaCap, which contains subjects captured in, both, a dense camera dome and in-the-wild sparse camera rigs, and demonstrate superior results compared to recent state-of-the-art methods on, both, public and WildDynaCap dataset.

replace In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition

Authors: Wiktor Mucha, Martin Kampel

Abstract: Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort. Existing literature focuses on 3D hand pose input, which requires computationally intensive depth estimation networks or wearing an uncomfortable depth sensor. In contrast, there has been insufficient research in understanding 2D hand pose for egocentric action recognition, despite the availability of user-friendly smart glasses in the market capable of capturing a single RGB image. Our study aims to fill this research gap by exploring the field of 2D hand pose estimation for egocentric action recognition, making two contributions. Firstly, we introduce two novel approaches for 2D hand pose estimation, namely EffHandNet for single-hand estimation and EffHandEgoNet, tailored for an egocentric perspective, capturing interactions between hands and objects. Both methods outperform state-of-the-art models on H2O and FPHA public benchmarks. Secondly, we present a robust action recognition architecture from 2D hand and object poses. This method incorporates EffHandEgoNet, and a transformer-based action recognition method. Evaluated on H2O and FPHA datasets, our architecture has a faster inference time and achieves an accuracy of 91.32% and 94.43%, respectively, surpassing state of the art, including 3D-based methods. Our work demonstrates that using 2D skeletal data is a robust approach for egocentric action understanding. Extensive evaluation and ablation studies show the impact of the hand pose estimation approach, and how each input affects the overall performance.

replace Leveraging Temporal Contextualization for Video Action Recognition

Authors: Minji Kim, Dongyoon Han, Taekyung Kim, Bohyung Han

Abstract: We propose a novel framework for video understanding, called Temporally Contextualized CLIP (TC-CLIP), which leverages essential temporal information through global interactions in a spatio-temporal domain within a video. To be specific, we introduce Temporal Contextualization (TC), a layer-wise temporal information infusion mechanism for videos, which 1) extracts core information from each frame, 2) connects relevant information across frames for the summarization into context tokens, and 3) leverages the context tokens for feature encoding. Furthermore, the Video-conditional Prompting (VP) module processes context tokens to generate informative prompts in the text modality. Extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition validate the effectiveness of our model. Ablation studies for TC and VP support our design choices. Our project page with the source code is available at https://github.com/naver-ai/tc-clip

URLs: https://github.com/naver-ai/tc-clip

replace Magic Clothing: Controllable Garment-Driven Image Synthesis

Authors: Weifeng Chen, Tao Gu, Yuhao Xu, Chengcai Chen

Abstract: We propose Magic Clothing, a latent diffusion model (LDM)-based network architecture for an unexplored garment-driven image synthesis task. Aiming at generating customized characters wearing the target garments with diverse text prompts, the image controllability is the most critical issue, i.e., to preserve the garment details and maintain faithfulness to the text prompts. To this end, we introduce a garment extractor to capture the detailed garment features, and employ self-attention fusion to incorporate them into the pretrained LDMs, ensuring that the garment details remain unchanged on the target character. Then, we leverage the joint classifier-free guidance to balance the control of garment features and text prompts over the generated results. Meanwhile, the proposed garment extractor is a plug-in module applicable to various finetuned LDMs, and it can be combined with other extensions like ControlNet and IP-Adapter to enhance the diversity and controllability of the generated characters. Furthermore, we design Matched-Points-LPIPS (MP-LPIPS), a robust metric for evaluating the consistency of the target image to the source garment. Extensive experiments demonstrate that our Magic Clothing achieves state-of-the-art results under various conditional controls for garment-driven image synthesis. Our source code is available at https://github.com/ShineChen1024/MagicClothing.

URLs: https://github.com/ShineChen1024/MagicClothing.

replace AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception

Authors: Yipo Huang, Xiangfei Sheng, Zhichao Yang, Quan Yuan, Zhichao Duan, Pengfei Chen, Leida Li, Weisi Lin, Guangming Shi

Abstract: The highly abstract nature of image aesthetics perception (IAP) poses significant challenge for current multimodal large language models (MLLMs). The lack of human-annotated multi-modality aesthetic data further exacerbates this dilemma, resulting in MLLMs falling short of aesthetics perception capabilities. To address the above challenge, we first introduce a comprehensively annotated Aesthetic Multi-Modality Instruction Tuning (AesMMIT) dataset, which serves as the footstone for building multi-modality aesthetics foundation models. Specifically, to align MLLMs with human aesthetics perception, we construct a corpus-rich aesthetic critique database with 21,904 diverse-sourced images and 88K human natural language feedbacks, which are collected via progressive questions, ranging from coarse-grained aesthetic grades to fine-grained aesthetic descriptions. To ensure that MLLMs can handle diverse queries, we further prompt GPT to refine the aesthetic critiques and assemble the large-scale aesthetic instruction tuning dataset, i.e. AesMMIT, which consists of 409K multi-typed instructions to activate stronger aesthetic capabilities. Based on the AesMMIT database, we fine-tune the open-sourced general foundation models, achieving multi-modality Aesthetic Expert models, dubbed AesExpert. Extensive experiments demonstrate that the proposed AesExpert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision. Project homepage: https://yipoh.github.io/aes-expert/.

URLs: https://yipoh.github.io/aes-expert/.

replace FipTR: A Simple yet Effective Transformer Framework for Future Instance Prediction in Autonomous Driving

Authors: Xingtai Gui, Tengteng Huang, Haonan Shao, Haotian Yao, Chi Zhang

Abstract: The future instance prediction from a Bird's Eye View(BEV) perspective is a vital component in autonomous driving, which involves future instance segmentation and instance motion prediction. Existing methods usually rely on a redundant and complex pipeline which requires multiple auxiliary outputs and post-processing procedures. Moreover, estimated errors on each of the auxiliary predictions will lead to degradation of the prediction performance. In this paper, we propose a simple yet effective fully end-to-end framework named Future Instance Prediction Transformer(FipTR), which views the task as BEV instance segmentation and prediction for future frames. We propose to adopt instance queries representing specific traffic participants to directly estimate the corresponding future occupied masks, and thus get rid of complex post-processing procedures. Besides, we devise a flow-aware BEV predictor for future BEV feature prediction composed of a flow-aware deformable attention that takes backward flow guiding the offset sampling. A novel future instance matching strategy is also proposed to further improve the temporal coherence. Extensive experiments demonstrate the superiority of FipTR and its effectiveness under different temporal BEV encoders. The code is available at https://github.com/TabGuigui/FipTR .

URLs: https://github.com/TabGuigui/FipTR

replace On the Federated Learning Framework for Cooperative Perception

Authors: Zhenrong Zhang, Jianan Liu, Xi Zhou, Tao Huang, Qing-Long Han, Jingxin Liu, Hongbin Liu

Abstract: Cooperative perception is essential to enhance the efficiency and safety of future transportation systems, requiring extensive data sharing among vehicles on the road, which raises significant privacy concerns. Federated learning offers a promising solution by enabling data privacy-preserving collaborative enhancements in perception, decision-making, and planning among connected and autonomous vehicles (CAVs). However, federated learning is impeded by significant challenges arising from data heterogeneity across diverse clients, potentially diminishing model accuracy and prolonging convergence periods. This study introduces a specialized federated learning framework for CP, termed the federated dynamic weighted aggregation (FedDWA) algorithm, facilitated by dynamic adjusting loss (DALoss) function. This framework employs dynamic client weighting to direct model convergence and integrates a novel loss function that utilizes Kullback-Leibler divergence (KLD) to counteract the detrimental effects of non-independently and identically distributed (Non-IID) and unbalanced data. Utilizing the BEV transformer as the primary model, our rigorous testing on the OpenV2V dataset, augmented with FedBEVT data, demonstrates significant improvements in the average intersection over union (IoU). These results highlight the substantial potential of our federated learning framework to address data heterogeneity challenges in CP, thereby enhancing the accuracy of environmental perception models and facilitating more robust and efficient collaborative learning solutions in the transportation sector.

replace LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

Authors: Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Jiaming Liu, Shanghang Zhang

Abstract: The distribution of subpopulations is an important property hidden within a dataset. Uncovering and analyzing the subpopulation distribution within datasets provides a comprehensive understanding of the datasets, standing as a powerful tool beneficial to various downstream tasks, including Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery. Despite its importance, there has been no work that systematically explores the subpopulation distribution of datasets to our knowledge. To address the limitation and solve all the mentioned tasks in a unified way, we introduce a novel concept of subpopulation structures to represent, analyze, and utilize subpopulation distributions within datasets. To characterize the structures in an interpretable manner, we propose the Subpopulation Structure Discovery with Large Language Models (SSD-LLM) framework, which employs world knowledge and instruction-following capabilities of Large Language Models (LLMs) to linguistically analyze informative image captions and summarize the structures. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery.

replace Rasterized Edge Gradients: Handling Discontinuities Differentiably

Authors: Stanislav Pidhorskyi, Tomas Simon, Gabriel Schwartz, He Wen, Yaser Sheikh, Jason Saragih

Abstract: Computing the gradients of a rendering process is paramount for diverse applications in computer vision and graphics. However, accurate computation of these gradients is challenging due to discontinuities and rendering approximations, particularly for surface-based representations and rasterization-based rendering. We present a novel method for computing gradients at visibility discontinuities for rasterization-based differentiable renderers. Our method elegantly simplifies the traditionally complex problem through a carefully designed approximation strategy, allowing for a straightforward, effective, and performant solution. We introduce a novel concept of micro-edges, which allows us to treat the rasterized images as outcomes of a differentiable, continuous process aligned with the inherently non-differentiable, discrete-pixel rasterization. This technique eliminates the necessity for rendering approximations or other modifications to the forward pass, preserving the integrity of the rendered image, which makes it applicable to rasterized masks, depth, and normals images where filtering is prohibitive. Utilizing micro-edges simplifies gradient interpretation at discontinuities and enables handling of geometry intersections, offering an advantage over the prior art. We showcase our method in dynamic human head scene reconstruction, demonstrating effective handling of camera images and segmentation masks.

replace Unifying 3D Vision-Language Understanding via Promptable Queries

Authors: Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li

Abstract: A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying various 3D scene representations (i.e., voxels, point clouds, multi-view images) into a shared 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks, setting new records on most benchmarks. Particularly, PQ3D improves the state-of-the-art on ScanNet200 by 4.9% (AP25), ScanRefer by 5.4% (acc@0.5), Multi3DRefer by 11.7% (F1@0.5), and Scan2Cap by 13.4% (CIDEr@0.5). Moreover, PQ3D supports flexible inference with individual or combined forms of available 3D representations, e.g., solely voxel input.

replace Learning to Transform Dynamically for Better Adversarial Transferability

Authors: Rongyi Zhu, Zeliang Zhang, Susan Liang, Zhuo Liu, Chenliang Xu

Abstract: Adversarial examples, crafted by adding perturbations imperceptible to humans, can deceive neural networks. Recent studies identify the adversarial transferability across various models, \textit{i.e.}, the cross-model attack ability of adversarial samples. To enhance such adversarial transferability, existing input transformation-based methods diversify input data with transformation augmentation. However, their effectiveness is limited by the finite number of available transformations. In our study, we introduce a novel approach named Learning to Transform (L2T). L2T increases the diversity of transformed images by selecting the optimal combination of operations from a pool of candidates, consequently improving adversarial transferability. We conceptualize the selection of optimal transformation combinations as a trajectory optimization problem and employ a reinforcement learning strategy to effectively solve the problem. Comprehensive experiments on the ImageNet dataset, as well as practical tests with Google Vision and GPT-4V, reveal that L2T surpasses current methodologies in enhancing adversarial transferability, thereby confirming its effectiveness and practical significance. The code is available at https://github.com/RongyiZhu/L2T.

URLs: https://github.com/RongyiZhu/L2T.

replace Trackastra: Transformer-based cell tracking for live-cell microscopy

Authors: Benjamin Gallusser, Martin Weigert

Abstract: Cell tracking is a ubiquitous image analysis task in live-cell microscopy. Unlike multiple object tracking (MOT) for natural images, cell tracking typically involves hundreds of similar-looking objects that can divide in each frame, making it a particularly challenging problem. Current state-of-the-art approaches follow the tracking-by-detection paradigm, i.e. first all cells are detected per frame and successively linked in a second step to form biologically consistent cell tracks. Linking is commonly solved via discrete optimization methods, which require manual tuning of hyperparameters for each dataset and are therefore cumbersome to use in practice. Here we propose Trackastra, a general purpose cell tracking approach that uses a simple transformer architecture to directly learn pairwise associations of cells within a temporal window from annotated data. Importantly, unlike existing transformer-based MOT pipelines, our learning architecture also accounts for dividing objects such as cells and allows for accurate tracking even with simple greedy linking, thus making strides towards removing the requirement for a complex linking step. The proposed architecture operates on the full spatio-temporal context of detections within a time window by avoiding the computational burden of processing dense images. We show that our tracking approach performs on par with or better than highly tuned state-of-the-art cell tracking algorithms for various biological datasets, such as bacteria, cell cultures and fluorescent particles. We provide code at https://github.com/weigertlab/trackastra.

URLs: https://github.com/weigertlab/trackastra.

replace VAAD: Visual Attention Analysis Dashboard applied to e-Learning

Authors: Miriam Navarro, \'Alvaro Becerra, Roberto Daza, Ruth Cobos, Aythami Morales, Julian Fierrez

Abstract: In this paper, we present an approach in the Multimodal Learning Analytics field. Within this approach, we have developed a tool to visualize and analyze eye movement data collected during learning sessions in online courses. The tool is named VAAD, an acronym for Visual Attention Analysis Dashboard. These eye movement data have been gathered using an eye-tracker and subsequently processed and visualized for interpretation. The purpose of the tool is to conduct a descriptive analysis of the data by facilitating its visualization, enabling the identification of differences and learning patterns among various learner populations. Additionally, it integrates a predictive module capable of anticipating learner activities during a learning session. Consequently, VAAD holds the potential to offer valuable insights into online learning behaviors from both descriptive and predictive perspectives.

replace P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation

Authors: Qi Zhang, Guohua Geng, Longquan Yan, Pengbo Zhou, Zhaodi Li, Kang Li, Qinglin Liu

Abstract: Diffusion models and multi-scale features are essential components in semantic segmentation tasks that deal with remote-sensing images. They contribute to improved segmentation boundaries and offer significant contextual information. U-net-like architectures are frequently employed in diffusion models for segmentation tasks. These architectural designs include dense skip connections that may pose challenges for interpreting intermediate features. Consequently, they might not efficiently convey semantic information throughout various layers of the encoder-decoder architecture. To address these challenges, we propose a new model for semantic segmentation known as the diffusion model with parallel multi-scale branches. This model consists of Parallel Multiscale Diffusion modules (P-MSDiff) and a Cross-Bridge Linear Attention mechanism (CBLA). P-MSDiff enhances the understanding of semantic information across multiple levels of granularity and detects repetitive distribution data through the integration of recursive denoising branches. It further facilitates the amalgamation of data by connecting relevant branches to the primary framework to enable concurrent denoising. Furthermore, within the interconnected transformer architecture, the LA module has been substituted with the CBLA module. This module integrates a semidefinite matrix linked to the query into the dot product computation of keys and values. This integration enables the adaptation of queries within the LA framework. This adjustment enhances the structure for multi-head attention computation, leading to enhanced network performance and CBLA is a plug-and-play module. Our model demonstrates superior performance based on the J1 metric on both the UAVid and Vaihingen Building datasets, showing improvements of 1.60% and 1.40% over strong baseline models, respectively.

replace MeshVPR: Citywide Visual Place Recognition Using 3D Meshes

Authors: Gabriele Berton, Lorenz Junglas, Riccardo Zaccone, Thomas Pollok, Barbara Caputo, Carlo Masone

Abstract: Mesh-based scene representation offers a promising direction for simplifying large-scale hierarchical visual localization pipelines, combining a visual place recognition step based on global features (retrieval) and a visual localization step based on local features. While existing work demonstrates the viability of meshes for visual localization, the impact of using synthetic databases rendered from them in visual place recognition remains largely unexplored. In this work we investigate using dense 3D textured meshes for large-scale Visual Place Recognition (VPR). We identify a significant performance drop when using synthetic mesh-based image databases compared to real-world images for retrieval. To address this, we propose MeshVPR, a novel VPR pipeline that utilizes a lightweight features alignment framework to bridge the gap between real-world and synthetic domains. MeshVPR leverages pre-trained VPR models and is efficient and scalable for city-wide deployments. We introduce novel datasets with freely available 3D meshes and manually collected queries from Berlin, Paris, and Melbourne. Extensive evaluations demonstrate that MeshVPR achieves competitive performance with standard VPR pipelines, paving the way for mesh-based localization systems. Data, code, and interactive visualizations are available at https://meshvpr.github.io/

URLs: https://meshvpr.github.io/

replace Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning

Authors: Amandeep Kumar, Muhammad Awais, Sanath Narayan, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer

Abstract: Drawing upon StyleGAN's expressivity and disentangled latent space, existing 2D approaches employ textual prompting to edit facial images with different attributes. In contrast, 3D-aware approaches that generate faces at different target poses require attribute-specific classifiers, learning separate model weights for each attribute, and are not scalable for novel attributes. In this work, we propose an efficient, plug-and-play, 3D-aware face editing framework based on attribute-specific prompt learning, enabling the generation of facial images with controllable attributes across various target poses. To this end, we introduce a text-driven learnable style token-based latent attribute editor (LAE). The LAE harnesses a pre-trained vision-language model to find text-guided attribute-specific editing direction in the latent space of any pre-trained 3D-aware GAN. It utilizes learnable style tokens and style mappers to learn and transform this editing direction to 3D latent space. To train LAE with multiple attributes, we use directional contrastive loss and style token loss. Furthermore, to ensure view consistency and identity preservation across different poses and attributes, we employ several 3D-aware identity and pose preservation losses. Our experiments show that our proposed framework generates high-quality images with 3D awareness and view consistency while maintaining attribute-specific features. We demonstrate the effectiveness of our method on different facial attributes, including hair color and style, expression, and others.

replace MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

Authors: Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Kumar Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, Ran Xu, Yejin Choi, Ludwig Schmidt

Abstract: Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs). Despite the rapid progression of open-source LMMs, there remains a pronounced scarcity of large-scale, diverse open-source multimodal interleaved datasets. In response, we introduce MINT-1T, the most extensive and diverse open-source Multimodal INTerleaved dataset to date. MINT-1T comprises one trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. As scaling multimodal interleaved datasets requires substantial engineering effort, sharing the data curation process and releasing the dataset greatly benefits the community. Our experiments show that LMMs trained on MINT-1T rival the performance of models trained on the previous leading dataset, OBELICS. Our data and code will be released at https://github.com/mlfoundations/MINT-1T.

URLs: https://github.com/mlfoundations/MINT-1T.

replace The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval

Authors: Meinardus Boris, Batra Anil, Rohrbach Anna, Rohrbach Marcus

Abstract: Recent studies have shown promising results in utilizing multimodal large language models (MLLMs) for computer vision tasks such as object detection and semantic segmentation. However, many challenging video tasks remain under-explored. Video-language tasks necessitate spatial and temporal comprehension and require significant compute. Therefore, prior works have developed complex, highly specialized architectures or leveraged additional input signals such as video transcripts to best encode contextual and temporal information, which limits their generality and can be impractical. One particularly challenging task is video moment retrieval, which requires precise temporal and contextual grounding. This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval. We introduce Mr. BLIP (Mr. as in Moment Retrieval), a multimodal, single-stage model that requires no expensive video-language pretraining, no additional input signal (e.g., no transcript or audio), and has a simpler and more versatile design than prior state-of-the-art methods. We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions. Notably, we attain over 9% (absolute) higher Recall (at 0.5 and 0.7 IoU) on the challenging long-video multi-moment QVHighlights benchmark. Our code is publicly available.

replace FDS: Feedback-guided Domain Synthesis with Multi-Source Conditional Diffusion Models for Domain Generalization

Authors: Mehrdad Noori, Milad Cheraghalikhani, Ali Bahri, Gustavo Adolfo Vargas Hakim, David Osowiechi, Moslem Yazdanpanah, Ismail Ben Ayed, Christian Desrosiers

Abstract: Domain Generalization techniques aim to enhance model robustness by simulating novel data distributions during training, typically through various augmentation or stylization strategies. However, these methods frequently suffer from limited control over the diversity of generated images and lack assurance that these images span distinct distributions. To address these challenges, we propose FDS, Feedback-guided Domain Synthesis, a novel strategy that employs diffusion models to synthesize novel, pseudo-domains by training a single model on all source domains and performing domain mixing based on learned features. By incorporating images that pose classification challenges to models trained on original samples, alongside the original dataset, we ensure the generation of a training set that spans a broad distribution spectrum. Our comprehensive evaluations demonstrate that this methodology sets new benchmarks in domain generalization performance across a range of challenging datasets, effectively managing diverse types of domain shifts. The implementation is available at: \url{https://github.com/Mehrdad-Noori/FDS.git}.

URLs: https://github.com/Mehrdad-Noori/FDS.git

replace 3D Adaptive Structural Convolution Network for Domain-Invariant Point Cloud Recognition

Authors: Younggun Kim (Department of Civil Engineering, University of Central Florida, Florida, USA), Beomsik Cho (Department of AI Mobility Engineering, Suwon, South Korea), Seonghoon Ryoo (Department of AI Mobility Engineering, Suwon, South Korea), Soomok Lee (Department of AI Mobility Engineering, Suwon, South Korea)

Abstract: Adapting deep learning networks for point cloud data recognition in self-driving vehicles faces challenges due to the variability in datasets and sensor technologies, emphasizing the need for adaptive techniques to maintain accuracy across different conditions. In this paper, we introduce the 3D Adaptive Structural Convolution Network (3D-ASCN), a cutting-edge framework for 3D point cloud recognition. It combines 3D convolution kernels, a structural tree structure, and adaptive neighborhood sampling for effective geometric feature extraction. This method obtains domain-invariant features and demonstrates robust, adaptable performance on a variety of point cloud datasets, ensuring compatibility across diverse sensor configurations without the need for parameter adjustments. This highlights its potential to significantly enhance the reliability and efficiency of self-driving vehicle technology.

replace Inter and Intra Prior Learning-based Hyperspectral Image Reconstruction Using Snapshot SWIR Metasurface

Authors: Linqiang Li, Jinglei Hao, Yongqiang Zhao, Pan Liu, Haofang Yan, Ziqin Zhang, Seong G. Kong

Abstract: Shortwave-infrared(SWIR) spectral information, ranging from 1 {\mu}m to 2.5{\mu}m, overcomes the limitations of traditional color cameras in acquiring scene information. However, conventional SWIR hyperspectral imaging systems face challenges due to their bulky setups and low acquisition speeds. This work introduces a snapshot SWIR hyperspectral imaging system based on a metasurface filter and a corresponding filter selection method to achieve the lowest correlation coefficient among these filters. This system offers the advantages of compact size and snapshot imaging. We propose a novel inter and intra prior learning unfolding framework to achieve high-quality SWIR hyperspectral image reconstruction, which bridges the gap between prior learning and cross-stage information interaction. Additionally, We design an adaptive feature transfer mechanism to adaptively transfer the contextual correlation of multi-scale encoder features to prevent detailed information loss in the decoder. Experiment results demonstrate that our method can reconstruct hyperspectral images with high speed and superior performance over existing methods.

replace MutDet: Mutually Optimizing Pre-training for Remote Sensing Object Detection

Authors: Ziyue Huang, Yongchao Feng, Qingjie Liu, Yunhong Wang

Abstract: Detection pre-training methods for the DETR series detector have been extensively studied in natural scenes, e.g., DETReg. However, the detection pre-training remains unexplored in remote sensing scenes. In existing pre-training methods, alignment between object embeddings extracted from a pre-trained backbone and detector features is significant. However, due to differences in feature extraction methods, a pronounced feature discrepancy still exists and hinders the pre-training performance. The remote sensing images with complex environments and more densely distributed objects exacerbate the discrepancy. In this work, we propose a novel Mutually optimizing pre-training framework for remote sensing object Detection, dubbed as MutDet. In MutDet, we propose a systemic solution against this challenge. Firstly, we propose a mutual enhancement module, which fuses the object embeddings and detector features bidirectionally in the last encoder layer, enhancing their information interaction.Secondly, contrastive alignment loss is employed to guide this alignment process softly and simultaneously enhances detector features' discriminativity. Finally, we design an auxiliary siamese head to mitigate the task gap arising from the introduction of enhancement module. Comprehensive experiments on various settings show new state-of-the-art transfer performance. The improvement is particularly pronounced when data quantity is limited. When using 10% of the DIOR-R data, MutDet improves DetReg by 6.1% in AP50. Codes and models are available at: https://github.com/floatingstarZ/MutDet.

URLs: https://github.com/floatingstarZ/MutDet.

replace MSD: A Benchmark Dataset for Floor Plan Generation of Building Complexes

Authors: Casper van Engelenburg, Fatemeh Mostafavi, Emanuel Kuhn, Yuntae Jeon, Michael Franzen, Matthias Standfest, Jan van Gemert, Seyran Khademi

Abstract: Diverse and realistic floor plan data are essential for the development of useful computer-aided methods in architectural design. Today's large-scale floor plan datasets predominantly feature simple floor plan layouts, typically representing single-apartment dwellings only. To compensate for the mismatch between current datasets and the real world, we develop \textbf{Modified Swiss Dwellings} (MSD) -- the first large-scale floor plan dataset that contains a significant share of layouts of multi-apartment dwellings. MSD features over 5.3K floor plans of medium- to large-scale building complexes, covering over 18.9K distinct apartments. We validate that existing approaches for floor plan generation, while effective in simpler scenarios, cannot yet seamlessly address the challenges posed by MSD. Our benchmark calls for new research in floor plan machine understanding. Code and data are open.

replace TLRN: Temporal Latent Residual Networks For Large Deformation Image Registration

Authors: Nian Wu, Jiarui Xing, Miaomiao Zhang

Abstract: This paper presents a novel approach, termed {\em Temporal Latent Residual Network (TLRN)}, to predict a sequence of deformation fields in time-series image registration. The challenge of registering time-series images often lies in the occurrence of large motions, especially when images differ significantly from a reference (e.g., the start of a cardiac cycle compared to the peak stretching phase). To achieve accurate and robust registration results, we leverage the nature of motion continuity and exploit the temporal smoothness in consecutive image frames. Our proposed TLRN highlights a temporal residual network with residual blocks carefully designed in latent deformation spaces, which are parameterized by time-sequential initial velocity fields. We treat a sequence of residual blocks over time as a dynamic training system, where each block is designed to learn the residual function between desired deformation features and current input accumulated from previous time frames. We validate the effectivenss of TLRN on both synthetic data and real-world cine cardiac magnetic resonance (CMR) image videos. Our experimental results shows that TLRN is able to achieve substantially improved registration accuracy compared to the state-of-the-art. Our code is publicly available at https://github.com/nellie689/TLRN.

URLs: https://github.com/nellie689/TLRN.

replace Video-Language Alignment via Spatio-Temporal Graph Transformer

Authors: Shi-Xue Zhang, Hongfa Wang, Xiaobin Zhu, Weibo Gu, Tianjin Zhang, Chun Yang, Wei Liu, Xu-Cheng Yin

Abstract: Video-language alignment is a crucial multi-modal task that benefits various downstream applications, e.g., video-text retrieval and video question answering. Existing methods either utilize multi-modal information in video-text pairs or apply global and local alignment techniques to promote alignment precision. However, these methods often fail to fully explore the spatio-temporal relationships among vision tokens within video and across different video-text pairs. In this paper, we propose a novel Spatio-Temporal Graph Transformer module to uniformly learn spatial and temporal contexts for video-language alignment pre-training (dubbed STGT). Specifically, our STGT combines spatio-temporal graph structure information with attention in transformer block, effectively utilizing the spatio-temporal contexts. In this way, we can model the relationships between vision tokens, promoting video-text alignment precision for benefiting downstream tasks. In addition, we propose a self-similarity alignment loss to explore the inherent self-similarity in the video and text. With the initial optimization achieved by contrastive learning, it can further promote the alignment accuracy between video and text. Experimental results on challenging downstream tasks, including video-text retrieval and video question answering, verify the superior performance of our method.

replace CycleMix: Mixing Source Domains for Domain Generalization in Style-Dependent Data

Authors: Aristotelis Ballas, Christos Diou

Abstract: As deep learning-based systems have become an integral part of everyday life, limitations in their generalization ability have begun to emerge. Machine learning algorithms typically rely on the i.i.d. assumption, meaning that their training and validation data are expected to follow the same distribution, which does not necessarily hold in practice. In the case of image classification, one frequent reason that algorithms fail to generalize is that they rely on spurious correlations present in training data, such as associating image styles with target classes. These associations may not be present in the unseen test data, leading to significant degradation of their effectiveness. In this work, we attempt to mitigate this Domain Generalization (DG) problem by training a robust feature extractor which disregards features attributed to image-style but infers based on style-invariant image representations. To achieve this, we train CycleGAN models to learn the different styles present in the training data and randomly mix them together to create samples with novel style attributes to improve generalization. Experimental results on the PACS DG benchmark validate the proposed method.

replace GPSFormer: A Global Perception and Local Structure Fitting-based Transformer for Point Cloud Understanding

Authors: Changshuo Wang, Meiqing Wu, Siew-Kei Lam, Xin Ning, Shangshu Yu, Ruiping Wang, Weijun Li, Thambipillai Srikanthan

Abstract: Despite the significant advancements in pre-training methods for point cloud understanding, directly capturing intricate shape information from irregular point clouds without reliance on external data remains a formidable challenge. To address this problem, we propose GPSFormer, an innovative Global Perception and Local Structure Fitting-based Transformer, which learns detailed shape information from point clouds with remarkable precision. The core of GPSFormer is the Global Perception Module (GPM) and the Local Structure Fitting Convolution (LSFConv). Specifically, GPM utilizes Adaptive Deformable Graph Convolution (ADGConv) to identify short-range dependencies among similar features in the feature space and employs Multi-Head Attention (MHA) to learn long-range dependencies across all positions within the feature space, ultimately enabling flexible learning of contextual representations. Inspired by Taylor series, we design LSFConv, which learns both low-order fundamental and high-order refinement information from explicitly encoded local geometric structures. Integrating the GPM and LSFConv as fundamental components, we construct GPSFormer, a cutting-edge Transformer that effectively captures global and local structures of point clouds. Extensive experiments validate GPSFormer's effectiveness in three point cloud tasks: shape classification, part segmentation, and few-shot learning. The code of GPSFormer is available at \url{https://github.com/changshuowang/GPSFormer}.

URLs: https://github.com/changshuowang/GPSFormer

replace Differentiable Product Quantization for Memory Efficient Camera Relocalization

Authors: Zakaria Laskar, Iaroslav Melekhov, Assia Benbihi, Shuzhe Wang, Juho Kannala

Abstract: Camera relocalization relies on 3D models of the scene with a large memory footprint that is incompatible with the memory budget of several applications. One solution to reduce the scene memory size is map compression by removing certain 3D points and descriptor quantization. This achieves high compression but leads to performance drop due to information loss. To address the memory performance trade-off, we train a light-weight scene-specific auto-encoder network that performs descriptor quantization-dequantization in an end-to-end differentiable manner updating both product quantization centroids and network parameters through back-propagation. In addition to optimizing the network for descriptor reconstruction, we encourage it to preserve the descriptor-matching performance with margin-based metric loss functions. Results show that for a local descriptor memory of only 1MB, the synergistic combination of the proposed network and map compression achieves the best performance on the Aachen Day-Night compared to existing compression methods.

replace SwinSF: Image Reconstruction from Spatial-Temporal Spike Streams

Authors: Liangyan Jiang, Chuang Zhu, Yanxu Chen

Abstract: The spike camera, with its high temporal resolution, low latency, and high dynamic range, addresses high-speed imaging challenges like motion blur. It captures photons at each pixel independently, creating binary spike streams rich in temporal information but challenging for image reconstruction. Current algorithms, both traditional and deep learning-based, still need to be improved in the utilization of the rich temporal detail and the restoration of the details of the reconstructed image. To overcome this, we introduce Swin Spikeformer (SwinSF), a novel model for dynamic scene reconstruction from spike streams. SwinSF is composed of Spike Feature Extraction, Spatial-Temporal Feature Extraction, and Final Reconstruction Module. It combines shifted window self-attention and proposed temporal spike attention, ensuring a comprehensive feature extraction that encapsulates both spatial and temporal dynamics, leading to a more robust and accurate reconstruction of spike streams. Furthermore, we build a new synthesized dataset for spike image reconstruction which matches the resolution of the latest spike camera, ensuring its relevance and applicability to the latest developments in spike camera imaging. Experimental results demonstrate that the proposed network SwinSF sets a new benchmark, achieving state-of-the-art performance across a series of datasets, including both real-world and synthesized data across various resolutions. Our codes and proposed dataset will be available soon.

replace Robust Facial Reactions Generation: An Emotion-Aware Framework with Modality Compensation

Authors: Guanyu Hu, Jie Wei, Siyang Song, Dimitrios Kollias, Xinyu Yang, Zhonglin Sun, Odysseus Kaloidas

Abstract: The objective of the Multiple Appropriate Facial Reaction Generation (MAFRG) task is to produce contextually appropriate and diverse listener facial behavioural responses based on the multimodal behavioural data of the conversational partner (i.e., the speaker). Current methodologies typically assume continuous availability of speech and facial modality data, neglecting real-world scenarios where these data may be intermittently unavailable, which often results in model failures. Furthermore, despite utilising advanced deep learning models to extract information from the speaker's multimodal inputs, these models fail to adequately leverage the speaker's emotional context, which is vital for eliciting appropriate facial reactions from human listeners. To address these limitations, we propose an Emotion-aware Modality Compensatory (EMC) framework. This versatile solution can be seamlessly integrated into existing models, thereby preserving their advantages while significantly enhancing performance and robustness in scenarios with missing modalities. Our framework ensures resilience when faced with missing modality data through the Compensatory Modality Alignment (CMA) module. It also generates more appropriate emotion-aware reactions via the Emotion-aware Attention (EA) module, which incorporates the speaker's emotional information throughout the entire encoding and decoding process. Experimental results demonstrate that our framework improves the appropriateness metric FRCorr by an average of 57.2\% compared to the original model structure. In scenarios where speech modality data is missing, the performance of appropriate generation shows an improvement, and when facial data is missing, it only exhibits minimal degradation.

replace Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval

Authors: Xiaowan Hu, Yiyi Chen, Yan Li, Minquan Wang, Haoqian Wang, Quan Chen, Han Li, Peng Jiang

Abstract: With the rapid expansion of e-commerce, more consumers have become accustomed to making purchases via livestreaming. Accurately identifying the products being sold by salespeople, i.e., livestreaming product retrieval (LPR), poses a fundamental and daunting challenge. The LPR task encompasses three primary dilemmas in real-world scenarios: 1) the recognition of intended products from distractor products present in the background; 2) the video-image heterogeneity that the appearance of products showcased in live streams often deviates substantially from standardized product images in stores; 3) there are numerous confusing products with subtle visual nuances in the shop. To tackle these challenges, we propose the Spatiotemporal Graphing Multi-modal Network (SGMN). First, we employ a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to focus toward intended products, emphasizing their salience over cluttered background products. Second, a long-range spatiotemporal graph network is further designed to achieve both instance-level interaction and frame-level matching, solving the misalignment caused by video-image heterogeneity. Third, we propose a multi-modal hard example mining, assisting the model in distinguishing highly similar products with fine-grained features across the video-image-text domain. Through extensive quantitative and qualitative experiments, we demonstrate the superior performance of our proposed SGMN model, surpassing the state-of-the-art methods by a substantial margin. The code is available at https://github.com/Huxiaowan/SGMN.

URLs: https://github.com/Huxiaowan/SGMN.

replace SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition

Authors: Wenbo Huang, Jinghui Zhang, Xuwei Qian, Zhen Wu, Meng Wang, Lei Zhang

Abstract: High frame-rate (HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios, promoting few-shot action recognition (FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of diverse frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code is released at https://github.com/wenbohuang1002/SOAP.

URLs: https://github.com/wenbohuang1002/SOAP.

replace FCNR: Fast Compressive Neural Representation of Visualization Images

Authors: Yunfei Lu, Pengfei Gu, Chaoli Wang

Abstract: We present FCNR, a fast compressive neural representation for tens of thousands of visualization images under varying viewpoints and timesteps. The existing NeRVI solution, albeit enjoying a high compression ratio, incurs slow speeds in encoding and decoding. Built on the recent advances in stereo image compression, FCNR assimilates stereo context modules and joint context transfer modules to compress image pairs. Our solution significantly improves encoding and decoding speed while maintaining high reconstruction quality and satisfying compression ratio. To demonstrate its effectiveness, we compare FCNR with state-of-the-art neural compression methods, including E-NeRV, HNeRV, NeRVI, and ECSIC. The source code can be found at https://github.com/YunfeiLu0112/FCNR.

URLs: https://github.com/YunfeiLu0112/FCNR.

replace MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues

Authors: Liyun Zhang

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal emotion recognition capabilities, integrating multimodal cues from visual, acoustic, and linguistic contexts in the video to recognize human emotional states. However, existing methods ignore capturing local facial features of temporal dynamics of micro-expressions and do not leverage the contextual dependencies of the utterance-aware temporal segments in the video, thereby limiting their expected effectiveness to a certain extent. In this work, we propose MicroEmo, a time-sensitive MLLM aimed at directing attention to the local facial micro-expression dynamics and the contextual dependencies of utterance-aware video clips. Our model incorporates two key architectural contributions: (1) a global-local attention visual encoder that integrates global frame-level timestamp-bound image features with local facial features of temporal dynamics of micro-expressions; (2) an utterance-aware video Q-Former that captures multi-scale and contextual dependencies by generating visual token sequences for each utterance segment and for the entire video then combining them. Preliminary qualitative experiments demonstrate that in a new Explainable Multimodal Emotion Recognition (EMER) task that exploits multi-modal and multi-faceted clues to predict emotions in an open-vocabulary (OV) manner, MicroEmo demonstrates its effectiveness compared with the latest methods.

replace Aggregated Attributions for Explanatory Analysis of 3D Segmentation Models

Authors: Maciej Chrabaszcz, Hubert Baniecki, Piotr Komorowski, Szymon P{\l}otka, Przemyslaw Biecek

Abstract: Analysis of 3D segmentation models, especially in the context of medical imaging, is often limited to segmentation performance metrics that overlook the crucial aspect of explainability and bias. Currently, effectively explaining these models with saliency maps is challenging due to the high dimensions of input images multiplied by the ever-growing number of segmented class labels. To this end, we introduce Agg^2Exp, a methodology for aggregating fine-grained voxel attributions of the segmentation model's predictions. Unlike classical explanation methods that primarily focus on the local feature attribution, Agg^2Exp enables a more comprehensive global view on the importance of predicted segments in 3D images. Our benchmarking experiments show that gradient-based voxel attributions are more faithful to the model's predictions than perturbation-based explanations. As a concrete use-case, we apply Agg^2Exp to discover knowledge acquired by the Swin UNEt TRansformer model trained on the TotalSegmentator v2 dataset for segmenting anatomical structures in computed tomography medical images. Agg^2Exp facilitates the explanatory analysis of large segmentation models beyond their predictive performance.

replace-cross EXACT: How to Train Your Accuracy

Authors: Ivan Karpukhin, Stanislav Dereka, Sergey Kolesnikov

Abstract: Classification tasks are usually evaluated in terms of accuracy. However, accuracy is discontinuous and cannot be directly optimized using gradient ascent. Popular methods minimize cross-entropy, hinge loss, or other surrogate losses, which can lead to suboptimal results. In this paper, we propose a new optimization framework by introducing stochasticity to a model's output and optimizing expected accuracy, i.e. accuracy of the stochastic model. Extensive experiments on linear models and deep image classification show that the proposed optimization method is a powerful alternative to widely used classification losses.

replace-cross Towards contrast-agnostic soft segmentation of the spinal cord

Authors: Sandrine B\'edard, Enamundram Naga Karthik, Charidimos Tsagkas, Emanuele Pravat\`a, Cristina Granziera, Andrew Smith, Kenneth Arnold Weber II, Julien Cohen-Adad

Abstract: Spinal cord segmentation is clinically relevant and is notably used to compute spinal cord cross-sectional area (CSA) for the diagnosis and monitoring of cord compression or neurodegenerative diseases such as multiple sclerosis. While several semi and automatic methods exist, one key limitation remains: the segmentation depends on the MRI contrast, resulting in different CSA across contrasts. This is partly due to the varying appearance of the boundary between the spinal cord and the cerebrospinal fluid that depends on the sequence and acquisition parameters. This contrast-sensitive CSA adds variability in multi-center studies where protocols can vary, reducing the sensitivity to detect subtle atrophies. Moreover, existing methods enhance the CSA variability by training one model per contrast, while also producing binary masks that do not account for partial volume effects. In this work, we present a deep learning-based method that produces soft segmentations of the spinal cord. Using the Spine Generic Public Database of healthy participants ($\text{n}=267$; $\text{contrasts}=6$), we first generated participant-wise soft ground truth (GT) by averaging the binary segmentations across all 6 contrasts. These soft GT, along with aggressive data augmentation and a regression-based loss function, were used to train a U-Net model for spinal cord segmentation. We evaluated our model against state-of-the-art methods and performed ablation studies involving different loss functions and domain generalization methods. Our results show that using the soft segmentations along with a regression loss function reduces CSA variability ($p < 0.05$, Wilcoxon signed-rank test). The proposed spinal cord segmentation model generalizes better than the state-of-the-art methods amongst unseen datasets, vendors, contrasts, and pathologies (compression, lesions), while accounting for partial volume effects.

replace-cross Copyright Protection in Generative AI: A Technical Perspective

Authors: Jie Ren, Han Xu, Pengfei He, Yingqian Cui, Shenglai Zeng, Jiankun Zhang, Hongzhi Wen, Jiayuan Ding, Pei Huang, Lingjuan Lyu, Hui Liu, Yi Chang, Jiliang Tang

Abstract: Generative AI has witnessed rapid advancement in recent years, expanding their capabilities to create synthesized content such as text, images, audio, and code. The high fidelity and authenticity of contents generated by these Deep Generative Models (DGMs) have sparked significant copyright concerns. There have been various legal debates on how to effectively safeguard copyrights in DGMs. This work delves into this issue by providing a comprehensive overview of copyright protection from a technical perspective. We examine from two distinct viewpoints: the copyrights pertaining to the source data held by the data owners and those of the generative models maintained by the model builders. For data copyright, we delve into methods data owners can protect their content and DGMs can be utilized without infringing upon these rights. For model copyright, our discussion extends to strategies for preventing model theft and identifying outputs generated by specific models. Finally, we highlight the limitations of existing techniques and identify areas that remain unexplored. Furthermore, we discuss prospective directions for the future of copyright protection, underscoring its importance for the sustainable and ethical development of Generative AI.

replace-cross MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms

Authors: Yiqiao Jin, Minje Choi, Gaurav Verma, Jindong Wang, Srijan Kumar

Abstract: Social media platforms are hubs for multimodal information exchange, encompassing text, images, and videos, making it challenging for machines to comprehend the information or emotions associated with interactions in online spaces. Multimodal Large Language Models (MLLMs) have emerged as a promising solution to these challenges, yet they struggle to accurately interpret human emotions and complex content such as misinformation. This paper introduces MM-Soc, a comprehensive benchmark designed to evaluate MLLMs' understanding of multimodal social media content. MM-Soc compiles prominent multimodal datasets and incorporates a novel large-scale YouTube tagging dataset, targeting a range of tasks from misinformation detection, hate speech detection, and social context generation. Through our exhaustive evaluation on ten size-variants of four open-source MLLMs, we have identified significant performance disparities, highlighting the need for advancements in models' social understanding capabilities. Our analysis reveals that, in a zero-shot setting, various types of MLLMs generally exhibit difficulties in handling social media tasks. However, MLLMs demonstrate performance improvements post fine-tuning, suggesting potential pathways for improvement. Our code and data are available at https://github.com/claws-lab/MMSoc.git.

URLs: https://github.com/claws-lab/MMSoc.git.

replace-cross MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection

Authors: Ali Behrouz, Michele Santacatterina, Ramin Zabih

Abstract: Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. Despite recent attempts to design efficient and effective architecture backbone for multi-dimensional data, such as images and multivariate time series, existing models are either data independent, or fail to allow inter- and intra-dimension communication. Recently, State Space Models (SSMs), and more specifically Selective State Space Models, with efficient hardware-aware implementation, have shown promising potential for long sequence modeling. Motivated by the success of SSMs, we present MambaMixer, a new architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels, called Selective Token and Channel Mixer. MambaMixer connects selective mixers using a weighted averaging mechanism, allowing layers to have direct access to early features. As a proof of concept, we design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on the MambaMixer block and explore their performance in various vision and time series forecasting tasks. Our results underline the importance of selective mixing across both tokens and channels. In ImageNet classification, object detection, and semantic segmentation tasks, ViM2 achieves competitive performance with well-established vision models and outperforms SSM-based vision models. In time series forecasting, TSM2 achieves outstanding performance compared to state-of-the-art methods while demonstrating significantly improved computational cost. These results show that while Transformers, cross-channel attention, and MLPs are sufficient for good performance in time series forecasting, neither is necessary.

replace-cross The Platonic Representation Hypothesis

Authors: Minyoung Huh, Brian Cheung, Tongzhou Wang, Phillip Isola

Abstract: We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato's concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.

replace-cross Quantum Implicit Neural Representations

Authors: Jiaming Zhao, Wenbo Qiao, Peng Zhang, Hui Gao

Abstract: Implicit neural representations have emerged as a powerful paradigm to represent signals such as images and sounds. This approach aims to utilize neural networks to parameterize the implicit function of the signal. However, when representing implicit functions, traditional neural networks such as ReLU-based multilayer perceptrons face challenges in accurately modeling high-frequency components of signals. Recent research has begun to explore the use of Fourier Neural Networks (FNNs) to overcome this limitation. In this paper, we propose Quantum Implicit Representation Network (QIREN), a novel quantum generalization of FNNs. Furthermore, through theoretical analysis, we demonstrate that QIREN possesses a quantum advantage over classical FNNs. Lastly, we conducted experiments in signal representation, image superresolution, and image generation tasks to show the superior performance of QIREN compared to state-of-the-art (SOTA) models. Our work not only incorporates quantum advantages into implicit neural representations but also uncovers a promising application direction for Quantum Neural Networks.

replace-cross Asynchronous Large Language Model Enhanced Planner for Autonomous Driving

Authors: Yuan Chen, Zi-han Ding, Ziqin Wang, Yan Wang, Lijun Zhang, Si Liu

Abstract: Despite real-time planners exhibiting remarkable performance in autonomous driving, the growing exploration of Large Language Models (LLMs) has opened avenues for enhancing the interpretability and controllability of motion planning. Nevertheless, LLM-based planners continue to encounter significant challenges, including elevated resource consumption and extended inference times, which pose substantial obstacles to practical deployment. In light of these challenges, we introduce AsyncDriver, a new asynchronous LLM-enhanced closed-loop framework designed to leverage scene-associated instruction features produced by LLM to guide real-time planners in making precise and controllable trajectory predictions. On one hand, our method highlights the prowess of LLMs in comprehending and reasoning with vectorized scene data and a series of routing instructions, demonstrating its effective assistance to real-time planners. On the other hand, the proposed framework decouples the inference processes of the LLM and real-time planners. By capitalizing on the asynchronous nature of their inference frequencies, our approach have successfully reduced the computational cost introduced by LLM, while maintaining comparable performance. Experiments show that our approach achieves superior closed-loop evaluation performance on nuPlan's challenging scenarios.

replace-cross SvANet: A Scale-variant Attention-based Network for Small Medical Object Segmentation

Authors: Wei Dai

Abstract: Early detection and accurate diagnosis can predict the risk of malignant disease transformation, thereby increasing the probability of effective treatment. A mild syndrome with small infected regions is an ominous warning and is foremost in the early diagnosis of diseases. Deep learning algorithms, such as convolutional neural networks (CNNs), have been used to segment natural or medical objects, showing promising results. However, analyzing medical objects of small areas in images remains a challenge due to information losses and compression defects caused by convolution and pooling operations in CNNs. These losses and defects become increasingly significant as the network deepens, particularly for small medical objects. To address these challenges, we propose a novel scale-variant attention-based network (SvANet) for accurate small-scale object segmentation in medical images. The SvANet consists of Monte Carlo attention, scale-variant attention, and vision transformer, which incorporates cross-scale features and alleviates compression artifacts for enhancing the discrimination of small medical objects. Quantitative experimental results demonstrate the superior performance of SvANet, achieving 96.12%, 96.11%, 89.79%, 84.15%, 80.25%, 73.05%, and 72.58% in mean Dice coefficient for segmenting kidney tumors, skin lesions, hepatic tumors, polyps, surgical excision cells, retinal vasculatures, and sperms, which occupy less than 1% of the image areas in KiTS23, ISIC 2018, ATLAS, PolypGen, TissueNet, FIVES, and SpermHealth datasets, respectively.

replace-cross SAM2CLIP2SAM: Vision Language Model for Segmentation of 3D CT Scans for Covid-19 Detection

Authors: Dimitrios Kollias, Anastasios Arsenos, James Wingate, Stefanos Kollias

Abstract: This paper presents a new approach for effective segmentation of images that can be integrated into any model and methodology; the paradigm that we choose is classification of medical images (3-D chest CT scans) for Covid-19 detection. Our approach includes a combination of vision-language models that segment the CT scans, which are then fed to a deep neural architecture, named RACNet, for Covid-19 detection. In particular, a novel framework, named SAM2CLIP2SAM, is introduced for segmentation that leverages the strengths of both Segment Anything Model (SAM) and Contrastive Language-Image Pre-Training (CLIP) to accurately segment the right and left lungs in CT scans, subsequently feeding these segmented outputs into RACNet for classification of COVID-19 and non-COVID-19 cases. At first, SAM produces multiple part-based segmentation masks for each slice in the CT scan; then CLIP selects only the masks that are associated with the regions of interest (ROIs), i.e., the right and left lungs; finally SAM is given these ROIs as prompts and generates the final segmentation mask for the lungs. Experiments are presented across two Covid-19 annotated databases which illustrate the improved performance obtained when our method has been used for segmentation of the CT scans.

replace-cross Adaptive Extensions of Unbiased Risk Estimators for Unsupervised Magnetic Resonance Image Denoising

Authors: Reeshad Khan, Dr. John Gauch, Dr. Ukash Nakarmi

Abstract: The application of Deep Neural Networks (DNNs) to image denoising has notably challenged traditional denoising methods, particularly within complex noise scenarios prevalent in medical imaging. Despite the effectiveness of traditional and some DNN-based methods, their reliance on high-quality, noiseless ground truth images limits their practical utility. In response to this, our work introduces and benchmarks innovative unsupervised learning strategies, notably Stein's Unbiased Risk Estimator (SURE), its extension (eSURE), and our novel implementation, the Extended Poisson Unbiased Risk Estimator (ePURE), within medical imaging frameworks. This paper presents a comprehensive evaluation of these methods on MRI data afflicted with Gaussian and Poisson noise types, a scenario typical in medical imaging but challenging for most denoising algorithms. Our main contribution lies in the effective adaptation and implementation of the SURE, eSURE, and particularly the ePURE frameworks for medical images, showcasing their robustness and efficacy in environments where traditional noiseless ground truth cannot be obtained.

replace-cross Velocity Driven Vision: Asynchronous Sensor Fusion Birds Eye View Models for Autonomous Vehicles

Authors: Seamie Hayes, Sushil Sharma, Ciar\'an Eising

Abstract: Fusing different sensor modalities can be a difficult task, particularly if they are asynchronous. Asynchronisation may arise due to long processing times or improper synchronisation during calibration, and there must exist a way to still utilise this previous information for the purpose of safe driving, and object detection in ego vehicle/ multi-agent trajectory prediction. Difficulties arise in the fact that the sensor modalities have captured information at different times and also at different positions in space. Therefore, they are not spatially nor temporally aligned. This paper will investigate the challenge of radar and LiDAR sensors being asynchronous relative to the camera sensors, for various time latencies. The spatial alignment will be resolved before lifting into BEV space via the transformation of the radar/LiDAR point clouds into the new ego frame coordinate system. Only after this can we concatenate the radar/LiDAR point cloud and lifted camera features. Temporal alignment will be remedied for radar data only, we will implement a novel method of inferring the future radar point positions using the velocity information. Our approach to resolving the issue of sensor asynchrony yields promising results. We demonstrate velocity information can drastically improve IoU for asynchronous datasets, as for a time latency of 360 milliseconds (ms), IoU improves from 49.54 to 53.63. Additionally, for a time latency of 550ms, the camera+radar (C+R) model outperforms the camera+LiDAR (C+L) model by 0.18 IoU. This is an advancement in utilising the often-neglected radar sensor modality, which is less favoured than LiDAR for autonomous driving purposes.