Authors: Daehwan Kim, Hyungmin Kim, Daun Jeong, Sungho Suh, Hansang Cho
Abstract: In this paper, we propose SPACE, a novel anomaly detection methodology that integrates a Feature Encoder (FE) into the structure of the Student-Teacher method. The proposed method has two key elements: Spatial Consistency regularization Loss (SCL) and Feature converter Module (FM). SCL prevents overfitting in student models by avoiding excessive imitation of the teacher model. Simultaneously, it facilitates the expansion of normal data features by steering clear of abnormal areas generated through data augmentation. This dual functionality ensures a robust boundary between normal and abnormal data. The FM prevents the learning of ambiguous information from the FE. This protects the learned features and enables more effective detection of structural and logical anomalies. Through these elements, SPACE is available to minimize the influence of the FE while integrating various data augmentations.In this study, we evaluated the proposed method on the MVTec LOCO, MVTec AD, and VisA datasets. Experimental results, through qualitative evaluation, demonstrate the superiority of detection and efficiency of each module compared to state-of-the-art methods.
Authors: Zhanwei Zhang, Shizhao Sun, Wenxiao Wang, Deng Cai, Jiang Bian
Abstract: Recently, there is a growing interest in creating computer-aided design (CAD) models based on user intent, known as controllable CAD generation. Existing work offers limited controllability and needs separate models for different types of control, reducing efficiency and practicality. To achieve controllable generation across all CAD construction hierarchies, such as sketch-extrusion, extrusion, sketch, face, loop and curve, we propose FlexCAD, a unified model by fine-tuning large language models (LLMs). First, to enhance comprehension by LLMs, we represent a CAD model as a structured text by abstracting each hierarchy as a sequence of text tokens. Second, to address various controllable generation tasks in a unified model, we introduce a hierarchy-aware masking strategy. Specifically, during training, we mask a hierarchy-aware field in the CAD text with a mask token. This field, composed of a sequence of tokens, can be set flexibly to represent various hierarchies. Subsequently, we ask LLMs to predict this masked field. During inference, the user intent is converted into a CAD text with a mask token replacing the part the user wants to modify, which is then fed into FlexCAD to generate new CAD models. Comprehensive experiments on public dataset demonstrate the effectiveness of FlexCAD in both generation quality and controllability. Code will be available at https://github.com/microsoft/CADGeneration/FlexCAD.
Authors: Xintian Sun, Benji Peng, Charles Zhang, Fei Jin, Qian Niu, Junyu Liu, Keyu Chen, Ming Li, Pohsun Feng, Ziqian Bi, Ming Liu, Yichao Zhang
Abstract: Remote sensing has evolved from simple image acquisition to complex systems capable of integrating and processing visual and textual data. This review examines the development and application of multi-modal language models (MLLMs) in remote sensing, focusing on their ability to interpret and describe satellite imagery using natural language. We cover the technical underpinnings of MLLMs, including dual-encoder architectures, Transformer models, self-supervised and contrastive learning, and cross-modal integration. The unique challenges of remote sensing data--varying spatial resolutions, spectral richness, and temporal changes--are analyzed for their impact on MLLM performance. Key applications such as scene description, object detection, change detection, text-to-image retrieval, image-to-text generation, and visual question answering are discussed to demonstrate their relevance in environmental monitoring, urban planning, and disaster response. We review significant datasets and resources supporting the training and evaluation of these models. Challenges related to computational demands, scalability, data quality, and domain adaptation are highlighted. We conclude by proposing future research directions and technological advancements to further enhance MLLM utility in remote sensing.
Authors: Mathieu Lamarre, Patrick Anderson, \'Etienne Danvoye
Abstract: Accurate stabilization of facial motion is essential for applications in photoreal avatar construction for 3D games, virtual reality, movies, and training data collection. For the latter, stabilization must work automatically for the general population with people of varying morphology. Distinguishing rigid skull motion from facial expressions is critical since misalignment between skull motion and facial expressions can lead to animation models that are hard to control and can not fit natural motion. Existing methods struggle to work with sparse sets of very different expressions, such as when combining multiple units from the Facial Action Coding System (FACS). Certain approaches are not robust enough, some depend on motion data to find stable points, while others make one-for-all invalid physiological assumptions. In this paper, we leverage recent advances in neural signed distance fields and differentiable isosurface meshing to compute skull stabilization rigid transforms directly on unstructured triangle meshes or point clouds, significantly enhancing accuracy and robustness. We introduce the concept of a stable hull as the surface of the boolean intersection of stabilized scans, analogous to the visual hull in shape-from-silhouette and the photo hull from space carving. This hull resembles a skull overlaid with minimal soft tissue thickness, upper teeth are automatically included. Our skull carving algorithm simultaneously optimizes the stable hull shape and rigid transforms to get accurate stabilization of complex expressions for large diverse sets of people, outperforming existing methods.
Authors: Jun-Hyuk Kim, Seungeon Kim, Won-Hee Lee, Dokwan Oh
Abstract: Designing a fast and effective entropy model is challenging but essential for practical application of neural codecs. Beyond spatial autoregressive entropy models, more efficient backward adaptation-based entropy models have been recently developed. They not only reduce decoding time by using smaller number of modeling steps but also maintain or even improve rate--distortion performance by leveraging more diverse contexts for backward adaptation. Despite their significant progress, we argue that their performance has been limited by the simple adoption of the design convention for forward adaptation: using only a single type of hyper latent representation, which does not provide sufficient contextual information, especially in the first modeling step. In this paper, we propose a simple yet effective entropy modeling framework that leverages sufficient contexts for forward adaptation without compromising on bit-rate. Specifically, we introduce a strategy of diversifying hyper latent representations for forward adaptation, i.e., using two additional types of contexts along with the existing single type of context. In addition, we present a method to effectively use the diverse contexts for contextualizing the current elements to be encoded/decoded. By addressing the limitation of the previous approach, our proposed framework leads to significant performance improvements. Experimental results on popular datasets show that our proposed framework consistently improves rate--distortion performance across various bit-rate regions, e.g., 3.73% BD-rate gain over the state-of-the-art baseline on the Kodak dataset.
Authors: Abhishek Sebastian, Pragna R
Abstract: Fiber Specklegram Sensors (FSS) are widely used in environmental monitoring due to their high sensitivity to temperature fluctuations, yet the complex, nonlinear nature of specklegram data poses significant challenges for conventional predictive models. This study introduces a novel Prion-Vision Transformer (Prion-ViT) model, inspired by biological prion memory mechanisms, to enhance long-term dependency modeling for accurate temperature prediction using FSS data. By leveraging a persistent memory state, the Prion-ViT effectively retains and propagates essential features across multiple layers, thereby improving prediction accuracy and reducing mean absolute error (MAE) to "0.52 Degree Celsius" outperforming traditional models like ResNet, Inception Net V2, and existing transformer-based architectures. The study addresses the specific challenges of applying Vision Transformers (ViTs) to FSS data and demonstrates that the prion-inspired memory mechanism offers a robust solution for capturing complex optical interference patterns in specklegrams. These findings establish Prion-ViT as a promising advancement for real-time industrial temperature monitoring applications, with potential applicability to other optical sensing domains.
Authors: Zhuorui Ye, Farzan Farnia
Abstract: Gradient-based saliency maps have been widely used to interpret the decisions of neural network classifiers and discover phenomena from their learned functions. Standard gradient-based maps are frequently observed to be highly sensitive to the randomness of training data and the stochasticity in the training process. In this work, we study the role of randomized smoothing in the well-known Smooth-Grad algorithm in the stability of the gradient-based maps to the randomness of training samples. We extend the algorithmic stability framework to gradient-based saliency maps and prove bounds on the stability error of standard Simple-Grad, Integrated-Gradients, and Smooth-Grad saliency maps. Our theoretical results suggest the role of Gaussian smoothing in boosting the stability of gradient-based maps to the randomness of training settings. On the other hand, we analyze the faithfulness of the Smooth-Grad maps to the original Simple-Grad and show the lower fidelity under a more intense Gaussian smoothing. We support our theoretical results by performing several numerical experiments on standard image datasets. Our empirical results confirm our hypothesis on the fidelity-stability trade-off in the application of Gaussian smoothing to gradient-based interpretation maps.
Authors: Abhinav Kumar, Pratham Singla, Aayan Yadav
Abstract: Image steganography is the technique of embedding secret information within images. The development of deep learning has led to significant advances in this field. However, existing methods often struggle to balance image quality, embedding capacity, and security. This paper proposes a novel approach to image steganography by enhancing an encoder-decoder architecture with attention mechanisms, specifically focusing on channel and spatial attention modules. We systematically investigate five configurations: (1) channel attention, (2) spatial attention, (3) sequential channel followed by spatial attention, (4) spatial attention followed by channel attention and (5) parallel channel and spatial attention. Our experiments show that adding attention mechanisms improves the ability to embed hidden information while maintaining the visual quality of the images. The increase in the PSNR and SSIM scores shows that using a parallel combination of channel and spatial attention improves image quality and hiding capacity simultaneously. This is in contrast to previous works where there is a tradeoff between them. This study shows that attention mechanisms in image steganography lead to better hiding of secret information. Our code is available at https://github.com/vlgiitr/StegaVision.
Authors: Elmira Mousa Rezabeyk, Salar Beigzad, Yasin Hamzavi, Mohsen Bagheritabar, Seyedeh Sogol Mirikhoozani
Abstract: Deep learning methods have established a significant place in image classification. While prior research has focused on enhancing final outcomes, the opaque nature of the decision-making process in these models remains a concern for experts. Additionally, the deployment of these methods can be problematic in resource-limited environments. This paper tackles the inherent black-box nature of these models by providing real-time explanations during the training phase, compelling the model to concentrate on the most distinctive and crucial aspects of the input. Furthermore, we employ established quantization techniques to address resource constraints. To assess the effectiveness of our approach, we explore how quantization influences the interpretability and accuracy of Convolutional Neural Networks through a comparative analysis of saliency maps from standard and quantized models. Quantization is implemented during the training phase using the Parameterized Clipping Activation method, with a focus on the MNIST and FashionMNIST benchmark datasets. We evaluated three bit-width configurations (2-bit, 4-bit, and mixed 4/2-bit) to explore the trade-off between efficiency and interpretability, with each configuration designed to highlight varying impacts on saliency map clarity and model accuracy. The results indicate that while quantization is crucial for implementing models on resource-limited devices, it necessitates a trade-off between accuracy and interpretability. Lower bit-widths result in more pronounced reductions in both metrics, highlighting the necessity of meticulous quantization parameter selection in applications where model transparency is paramount. The study underscores the importance of achieving a balance between efficiency and interpretability in the deployment of neural networks.
Authors: Duy-Phuong Dao, Hyung-Jeong Yang, Jahae Kim
Abstract: Alzheimers disease progresses slowly and involves complex interaction between various biological factors. Longitudinal medical imaging data can capture this progression over time. However, longitudinal data frequently encounter issues such as missing data due to patient dropouts, irregular follow-up intervals, and varying lengths of observation periods. To address these issues, we designed a diffusion-based model for 3D longitudinal medical imaging generation using single magnetic resonance imaging (MRI). This involves the injection of a conditioning MRI and time-visit encoding to the model, enabling control in change between source and target images. The experimental results indicate that the proposed method generates higher-quality images compared to other competing methods.
Authors: Shuchang Lyu, Qi Zhaoa, Guangliang Cheng, Yiwei He, Zheng Zhou, Guangbiao Wang, Zhenwei Shi
Abstract: Unsupervised Domain Adaptation for Remote Sensing Semantic Segmentation (UDA-RSSeg) addresses the challenge of adapting a model trained on source domain data to target domain samples, thereby minimizing the need for annotated data across diverse remote sensing scenes. This task presents two principal challenges: (1) severe inconsistencies in feature representation across different remote sensing domains, and (2) a domain gap that emerges due to the representation bias of source domain patterns when translating features to predictive logits. To tackle these issues, we propose a joint-optimized adversarial network incorporating the "Segment Anything Model (SAM) (SAM-JOANet)" for UDA-RSSeg. Our approach integrates SAM to leverage its robust generalized representation capabilities, thereby alleviating feature inconsistencies. We introduce a finetuning decoder designed to convert SAM-Encoder features into predictive logits. Additionally, a feature-level adversarial-based prompted segmentor is employed to generate class-agnostic maps, which guide the finetuning decoder's feature representations. The network is optimized end-to-end, combining the prompted segmentor and the finetuning decoder. Extensive evaluations on benchmark datasets, including ISPRS (Potsdam/Vaihingen) and CITY-OSM (Paris/Chicago), demonstrate the effectiveness of our method. The results, supported by visualization and analysis, confirm the method's interpretability and robustness. The code of this paper is available at https://github.com/CV-ShuchangLyu/SAM-JOANet.
Authors: Yuanyuan Liu, Lin Wei, Kejun Liu, Yibing Zhan, Zijing Chen, Zhe Chen, Shiguang Shan
Abstract: Emotion Recognition (ER) is the process of identifying human emotions from given data. Currently, the field heavily relies on facial expression recognition (FER) because facial expressions contain rich emotional cues. However, it is important to note that facial expressions may not always precisely reflect genuine emotions and FER-based results may yield misleading ER. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cues for the creation of a new Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. Different from existing multimodal ER datasets, the EMER dataset employs a stimulus material-induced spontaneous emotion generation method to integrate non-invasive eye behavior data, like eye movements and eye fixation maps, with facial videos, aiming to obtain natural and accurate human emotions. Notably, for the first time, we provide annotations for both ER and FER in the EMER, enabling a comprehensive analysis to better illustrate the gap between both tasks. Furthermore, we specifically design a new EMERT architecture to concurrently enhance performance in both ER and FER by efficiently identifying and bridging the emotion gap between the two.Specifically, our EMERT employs modality-adversarial feature decoupling and multi-task Transformer to augment the modeling of eye behaviors, thus providing an effective complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance.
Authors: Mohammed Aledhari, Mohamed Rahouti, Ali Alfatemi
Abstract: Autism Spectrum Disorder (ASD) is often underdiagnosed in females due to gender-specific symptom differences overlooked by conventional diagnostics. This study evaluates machine learning models, particularly Random Forest and convolutional neural networks, for enhancing ASD diagnosis through structured data and facial image analysis. Random Forest achieved 100% validation accuracy across datasets, highlighting its ability to manage complex relationships and reduce false negatives, which is crucial for early intervention and addressing gender biases. In image-based analysis, MobileNet outperformed the baseline CNN, achieving 87% accuracy, though a 30% validation loss suggests possible overfitting, requiring further optimization for robustness in clinical settings. Future work will emphasize hyperparameter tuning, regularization, and transfer learning. Integrating behavioral data with facial analysis could improve diagnosis for underdiagnosed groups. These findings suggest Random Forest's high accuracy and balanced precision-recall metrics could enhance clinical workflows. MobileNet's lightweight structure also shows promise for resource-limited environments, enabling accessible ASD screening. Addressing model explainability and clinician trust will be vital.
Authors: Daniel Menges, Florian Stadtmann, Henrik Jordheim, Adil Rasheed
Abstract: This paper explores the development and practical application of a predictive digital twin specifically designed for condition monitoring, using advanced mathematical models and thermal imaging techniques. Our work presents a comprehensive approach to integrating Proper Orthogonal Decomposition (POD), Robust Principal Component Analysis (RPCA), and Dynamic Mode Decomposition (DMD) to establish a robust predictive digital twin framework. We employ these methods in a real-time experimental setup involving a heated plate monitored through thermal imaging. This system effectively demonstrates the digital twin's capabilities in real-time predictions, condition monitoring, and anomaly detection. Additionally, we introduce the use of a human-machine interface that includes virtual reality, enhancing user interaction and system understanding. The primary contributions of our research lie in the demonstration of these advanced techniques in a tangible setup, showcasing the potential of digital twins to transform industry practices by enabling more proactive and strategic asset management.
Authors: Linfeng He, Yiming Sun, Sihao Wu, Jiaxu Liu, Xiaowei Huang
Abstract: In this paper, we propose a novel framework for enhancing visual comprehension in autonomous driving systems by integrating visual language models (VLMs) with additional visual perception module specialised in object detection. We extend the Llama-Adapter architecture by incorporating a YOLOS-based detection network alongside the CLIP perception network, addressing limitations in object detection and localisation. Our approach introduces camera ID-separators to improve multi-view processing, crucial for comprehensive environmental awareness. Experiments on the DriveLM visual question answering challenge demonstrate significant improvements over baseline models, with enhanced performance in ChatGPT scores, BLEU scores, and CIDEr metrics, indicating closeness of model answer to ground truth. Our method represents a promising step towards more capable and interpretable autonomous driving systems. Possible safety enhancement enabled by detection modality is also discussed.
Authors: Francesco Girlanda, Olga Demler, Bjoern Menze, Neda Davoudi
Abstract: Accurate prediction of cardiovascular diseases remains imperative for early diagnosis and intervention, necessitating robust and precise predictive models. Recently, there has been a growing interest in multi-modal learning for uncovering novel insights not available through uni-modal datasets alone. By combining cardiac magnetic resonance images, electrocardiogram signals, and available medical information, our approach enables the capture of holistic status about individuals' cardiovascular health by leveraging shared information across modalities. Integrating information from multiple modalities and benefiting from self-supervised learning techniques, our model provides a comprehensive framework for enhancing cardiovascular disease prediction with limited annotated datasets. We employ a masked autoencoder to pre-train the electrocardiogram ECG encoder, enabling it to extract relevant features from raw electrocardiogram data, and an image encoder to extract relevant features from cardiac magnetic resonance images. Subsequently, we utilize a multi-modal contrastive learning objective to transfer knowledge from expensive and complex modality, cardiac magnetic resonance image, to cheap and simple modalities such as electrocardiograms and medical information. Finally, we fine-tuned the pre-trained encoders on specific predictive tasks, such as myocardial infarction. Our proposed method enhanced the image information by leveraging different available modalities and outperformed the supervised approach by 7.6% in balanced accuracy.
Authors: Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang, Chaofan Tao, Shen Yan, Huaxiu Yao, Lingpeng Kong, Hongxia Yang, Mi Zhang, Guillermo Sapiro, Jiebo Luo, Ping Luo, Ngai Wong
Abstract: Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, \textit{i.e.}, pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the strategy of representation. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multi-faceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multi-modal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: \url{https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey}.
URLs: https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey
Authors: Sjoerd van Steenkiste, Daniel Zoran, Yi Yang, Yulia Rubanova, Rishabh Kabra, Carl Doersch, Dilara Gokay, Joseph Heyward, Etienne Pot, Klaus Greff, Drew A. Hudson, Thomas Albert Keck, Joao Carreira, Alexey Dosovitskiy, Mehdi S. M. Sajjadi, Thomas Kipf
Abstract: Current vision models typically maintain a fixed correspondence between their representation structure and image space. Each layer comprises a set of tokens arranged "on-the-grid," which biases patches or tokens to encode information at a specific spatio(-temporal) location. In this work we present Moving Off-the-Grid (MooG), a self-supervised video representation model that offers an alternative approach, allowing tokens to move "off-the-grid" to better enable them to represent scene elements consistently, even as they move across the image plane through time. By using a combination of cross-attention and positional embeddings we disentangle the representation structure and image structure. We find that a simple self-supervised objective--next frame prediction--trained on video data, results in a set of latent tokens which bind to specific scene structures and track them as they move. We demonstrate the usefulness of MooG's learned representation both qualitatively and quantitatively by training readouts on top of the learned representation on a variety of downstream tasks. We show that MooG can provide a strong foundation for different vision tasks when compared to "on-the-grid" baselines.
Authors: Moseli Mots'oehli, kyungim Baek
Abstract: Active learning aims to train accurate classifiers while minimizing labeling costs by strategically selecting informative samples for annotation. This study focuses on image classification tasks, comparing AL methods on CIFAR10, CIFAR100, Food101, and the Chest X-ray datasets under varying label noise rates. We investigate the impact of model architecture by comparing Convolutional Neural Networks (CNNs) and Vision Transformer (ViT)-based models. Additionally, we propose a novel deep active learning algorithm, GCI-ViTAL, designed to be robust to label noise. GCI-ViTAL utilizes prediction entropy and the Frobenius norm of last-layer attention vectors compared to class-centric clean set attention vectors. Our method identifies samples that are both uncertain and semantically divergent from typical images in their assigned class. This allows GCI-ViTAL to select informative data points even in the presence of label noise while flagging potentially mislabeled candidates. Label smoothing is applied to train a model that is not overly confident about potentially noisy labels. We evaluate GCI-ViTAL under varying levels of symmetric label noise and compare it to five other AL strategies. Our results demonstrate that using ViTs leads to significant performance improvements over CNNs across all AL strategies, particularly in noisy label settings. We also find that using the semantic information of images as label grounding helps in training a more robust model under label noise. Notably, we do not perform extensive hyperparameter tuning, providing an out-of-the-box comparison that addresses the common challenge practitioners face in selecting models and active learning strategies without an exhaustive literature review on training and fine-tuning vision models on real-world application data.
Authors: Xiao Liu, Lijun Zhang, Deepak Ganesan, Hui Guan
Abstract: Vision Language Models (VLMs) are central to Visual Question Answering (VQA) systems and are typically deployed in the cloud due to their high computational demands. However, this cloud-only approach underutilizes edge computational resources and requires significant bandwidth for transmitting raw images. In this paper, we introduce an edge-cloud collaborative VQA system, called LLaVA-AlignedVQ, which features a novel Aligned Vector Quantization algorithm (AlignedVQ) that efficiently compress intermediate features without compromising accuracy to support partitioned execution. Our experiments demonstrate that LLaVA-AlignedVQ achieves approximately 1365x compression rate of intermediate features, reducing data transmission overhead by 96.8% compared to transmitting JPEG90-compressed images to the cloud. LLaVA-AlignedVQ achieves an inference speedup of 2-15x while maintaining high accuracy, remaining within -2.23% to +1.6% of the original model's accuracy performance across eight VQA datasets, compared to the cloud-only solution.
Authors: Mateusz Wasala, Krzysztof Blachut, Hubert Szolc, Marcin Kowalczyk, Michal Danilowicz, Tomasz Kryjak
Abstract: Nowadays, the increasing demand for maintaining high cleanliness standards in public spaces results in the search for innovative solutions. The deployment of CCTV systems equipped with modern cameras and software enables not only real-time monitoring of the cleanliness status but also automatic detection of impurities and optimisation of cleaning schedules. The Digital Twin technology allows for the creation of a virtual model of the space, facilitating the simulation, training, and testing of cleanliness management strategies before implementation in the real world. In this paper, we present the utilisation of advanced vision surveillance systems and the Digital Twin technology in cleanliness management, using a railway station as an example. The Digital Twin was created based on an actual 3D model in the Nvidia Omniverse Isaac Sim simulator. A litter detector, bin occupancy level detector, stain segmentation, and a human detector (including the cleaning crew) along with their movement analysis were implemented. A preliminary assessment was conducted, and potential modifications for further enhancement and future development of the system were identified.
Authors: Magauiya Zhussip, Iaroslav Koshelev, Stamatis Lefkimmiatis
Abstract: Diffusion Probabilistic Models (DPMs) have been recently utilized to deal with various blind image restoration (IR) tasks, where they have demonstrated outstanding performance in terms of perceptual quality. However, the task-specific nature of existing solutions and the excessive computational costs related to their training, make such models impractical and challenging to use for different IR tasks than those that were initially trained for. This hinders their wider adoption, especially by those who lack access to powerful computational resources and vast amount of training data. In this work we aim to address the above issues and enable the successful adoption of DPMs in practical IR-related applications. Towards this goal, we propose a modular diffusion probabilistic IR framework (DP-IR), which allows us to combine the performance benefits of existing pre-trained state-of-the-art IR networks and generative DPMs, while it requires only the additional training of a relatively small module (0.7M params) related to the particular IR task of interest. Moreover, the architecture of the proposed framework allows for a sampling strategy that leads to at least four times reduction of neural function evaluations without suffering any performance loss, while it can also be combined with existing acceleration techniques such as DDIM. We evaluate our model on four benchmarks for the tasks of burst JDD-SR, dynamic scene deblurring, and super-resolution. Our method outperforms existing approaches in terms of perceptual quality while it retains a competitive performance with respect to fidelity metrics.
Authors: Yangming Zhang, Wenqi Jia, Wei Niu, Miao Yin
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a mainstream for novel view synthesis, leveraging continuous aggregations of Gaussian functions to model scene geometry. However, 3DGS suffers from substantial memory requirements to store the multitude of Gaussians, hindering its practicality. To address this challenge, we introduce GaussianSpa, an optimization-based simplification framework for compact and high-quality 3DGS. Specifically, we formulate the simplification as an optimization problem associated with the 3DGS training. Correspondingly, we propose an efficient "optimizing-sparsifying" solution that alternately solves two independent sub-problems, gradually imposing strong sparsity onto the Gaussians in the training process. Our comprehensive evaluations on various datasets show the superiority of GaussianSpa over existing state-of-the-art approaches. Notably, GaussianSpa achieves an average PSNR improvement of 0.9 dB on the real-world Deep Blending dataset with 10$\times$ fewer Gaussians compared to the vanilla 3DGS. Our project page is available at https://gaussianspa.github.io/.
Authors: Hongyu Chen, Bingliang Jiao, Wenxuan Wang, Peng Wang
Abstract: Lifelong person re-identification attempts to recognize people across cameras and integrate new knowledge from continuous data streams. Key challenges involve addressing catastrophic forgetting caused by parameter updating and domain shift, and maintaining performance in seen and unseen domains. Many previous works rely on data memories to retain prior samples. However, the amount of retained data increases linearly with the number of training domains, leading to continually increasing memory consumption. Additionally, these methods may suffer significant performance degradation when data preservation is prohibited due to privacy concerns. To address these limitations, we propose using textual descriptions as guidance to encourage the ReID model to learn cross-domain invariant features without retaining samples. The key insight is that natural language can describe pedestrian instances with an invariant style, suggesting a shared textual space for any pedestrian images. By leveraging this shared textual space as an anchor, we can prompt the ReID model to embed images from various domains into a unified semantic space, thereby alleviating catastrophic forgetting caused by domain shifts. To achieve this, we introduce a task-driven dynamic textual prompt framework in this paper. This model features a dynamic prompt fusion module, which adaptively constructs and fuses two different textual prompts as anchors. This effectively guides the ReID model to embed images into a unified semantic space. Additionally, we design a text-visual feature alignment module to learn a more precise mapping between fine-grained visual and textual features. We also developed a learnable knowledge distillation module that allows our model to dynamically balance retaining existing knowledge with acquiring new knowledge. Extensive experiments demonstrate that our method outperforms SOTAs under various settings.
Authors: Yun Liu, Peng Li, Xuefeng Yan, Liangliang Nan, Bing Wang, Honghua Chen, Lina Gong, Wei Zhao, Mingqiang Wei
Abstract: The core of self-supervised point cloud learning lies in setting up appropriate pretext tasks, to construct a pre-training framework that enables the encoder to perceive 3D objects effectively. In this paper, we integrate two prevalent methods, masked point modeling (MPM) and 3D-to-2D generation, as pretext tasks within a pre-training framework. We leverage the spatial awareness and precise supervision offered by these two methods to address their respective limitations: ambiguous supervision signals and insensitivity to geometric information. Specifically, the proposed framework, abbreviated as PointCG, consists of a Hidden Point Completion (HPC) module and an Arbitrary-view Image Generation (AIG) module. We first capture visible points from arbitrary views as inputs by removing hidden points. Then, HPC extracts representations of the inputs with an encoder and completes the entire shape with a decoder, while AIG is used to generate rendered images based on the visible points' representations. Extensive experiments demonstrate the superiority of the proposed method over the baselines in various downstream tasks. Our code will be made available upon acceptance.
Authors: Fatemeh Shiri, Xiao-Yu Guo, Mona Golestan Far, Xin Yu, Gholamreza Haffari, Yuan-Fang Li
Abstract: Large Multimodal Models (LMMs) have achieved strong performance across a range of vision and language tasks. However, their spatial reasoning capabilities are under-investigated. In this paper, we construct a novel VQA dataset, Spatial-MM, to comprehensively study LMMs' spatial understanding and reasoning capabilities. Our analyses on object-relationship and multi-hop reasoning reveal several important findings. Firstly, bounding boxes and scene graphs, even synthetic ones, can significantly enhance LMMs' spatial reasoning. Secondly, LMMs struggle more with questions posed from the human perspective than the camera perspective about the image. Thirdly, chain of thought (CoT) prompting does not improve model performance on complex multi-hop questions involving spatial relations. % Moreover, spatial reasoning steps are much less accurate than non-spatial ones across MLLMs. Lastly, our perturbation analysis on GQA-spatial reveals that LMMs are much stronger at basic object detection than complex spatial reasoning. We believe our benchmark dataset and in-depth analyses can spark further research on LMMs spatial reasoning. Spatial-MM benchmark is available at: https://github.com/FatemehShiri/Spatial-MM
Authors: Yuanbo Chen, Yixiao Kang, Yukun Song, Cyrus Vachha, Sining Huang
Abstract: In this system, we discuss methods to stylize a scene of 3D primitive objects into a higher fidelity 3D scene using novel 3D representations like NeRFs and 3D Gaussian Splatting. Our approach leverages existing image stylization systems and image-to-3D generative models to create a pipeline that iteratively stylizes and composites 3D objects into scenes. We show our results on adding generated objects into a scene and discuss limitations.
Authors: Jiyul Ham, Yonggon Jung, Jun-Geol Baek
Abstract: Zero-shot anomaly detection (ZSAD) is crucial for detecting abnormal patterns in target datasets without using training samples, specifically in scenarios where there are distributional differences between the target domain and training data or where data scarcity arises because of restricted access. Although recently pretrained vision-language models demonstrate strong zero-shot performance across various visual tasks, they focus on learning class semantics, which makes their direct application to ZSAD challenging. To address this scenario, we propose GlocalCLIP, which uniquely separates global and local prompts and jointly optimizes them. This approach enables the object-agnostic glocal semantic prompt design to effectively capture general normal and anomalous patterns without dependency on specific objects in the image. We refine the text prompts for more precise adjustments by utilizing deep-text prompt tuning in the text encoder. In the vision encoder, we apply V-V attention layers to capture detailed local image features. Finally, we introduce glocal contrastive learning to improve the complementary learning of global and local prompts, effectively detecting abnormal patterns across various domains. The generalization performance of GlocalCLIP in ZSAD was demonstrated on 15 real-world datasets from both the industrial and medical domains, achieving superior performance compared to existing methods.
Authors: Kaixuan Lu, Ruiqian Zhang, Xiao Huang, Yuxing Xie
Abstract: Recently, large vision language models (VLMs) have made significant strides in visual language capabilities through visual instruction tuning, showing great promise in the field of remote sensing image interpretation. However, existing remote sensing vision language models (RSVLMs) often fall short in capturing the complex characteristics of remote sensing scenes, as they typically rely on low resolution, single scale visual features and simplistic methods to map visual features to language features. In this paper, we present Aquila, an advanced visual language foundation model designed to enable richer visual feature representation and more precise visual-language feature alignment for remote sensing images. Our approach introduces a learnable Hierarchical Spatial Feature Integration (SFI) module that supports high resolution image inputs and aggregates multi scale visual features, allowing for the detailed representation of complex visual information. Additionally, the SFI module is repeatedly integrated into the layers of the large language model (LLM) to achieve deep visual language feature alignment, without compromising the model's performance in natural language processing tasks. These innovations, capturing detailed visual effects through higher resolution and multi scale input, and enhancing feature alignment significantly improve the model's ability to learn from image text data. We validate the effectiveness of Aquila through extensive quantitative experiments and qualitative analyses, demonstrating its superior performance.
Authors: Kaixuan Lu, Ruiqian Zhang, Xiao Huang, Yuxing Xie, Xiaogang Ning, Hanchao Zhang, Mengke Yuan, Pan Zhang, Tao Wang, Tongkui Liao
Abstract: Recent self-supervised learning (SSL) methods have demonstrated impressive results in learning visual representations from unlabeled remote sensing images. However, most remote sensing images predominantly consist of scenographic scenes containing multiple ground objects without explicit foreground targets, which limits the performance of existing SSL methods that focus on foreground targets. This raises the question: Is there a method that can automatically aggregate similar objects within scenographic remote sensing images, thereby enabling models to differentiate knowledge embedded in various geospatial patterns for improved feature representation? In this work, we present the Pattern Integration and Enhancement Vision Transformer (PIEViT), a novel self-supervised learning framework designed specifically for remote sensing imagery. PIEViT utilizes a teacher-student architecture to address both image-level and patch-level tasks. It employs the Geospatial Pattern Cohesion (GPC) module to explore the natural clustering of patches, enhancing the differentiation of individual features. The Feature Integration Projection (FIP) module further refines masked token reconstruction using geospatially clustered patches. We validated PIEViT across multiple downstream tasks, including object detection, semantic segmentation, and change detection. Experiments demonstrated that PIEViT enhances the representation of internal patch features, providing significant improvements over existing self-supervised baselines. It achieves excellent results in object detection, land cover classification, and change detection, underscoring its robustness, generalization, and transferability for remote sensing image interpretation tasks.
Authors: Yuhan Pan, Yanan Sun, Wei Gong
Abstract: Deep long-tailed recognition has been widely studied to address the issue of imbalanced data distributions in real-world scenarios. However, there has been insufficient focus on the design of neural architectures, despite empirical evidence suggesting that architecture can significantly impact performance. In this paper, we attempt to mitigate long-tailed issues through architectural improvements. To simplify the design process, we utilize Differential Architecture Search (DARTS) to achieve this goal. Unfortunately, existing DARTS methods struggle to perform well in long-tailed scenarios. To tackle this challenge, we introduce Long-Tailed Differential Architecture Search (LT-DARTS). Specifically, we conduct extensive experiments to explore architectural components that demonstrate better performance on long-tailed data and propose a new search space based on our observations. This ensures that the architecture obtained through our search process incorporates superior components. Additionally, we propose replacing the learnable linear classifier with an Equiangular Tight Frame (ETF) classifier to further enhance our method. This classifier effectively alleviates the biased search process and prevents performance collapse. Extensive experimental evaluations demonstrate that our approach consistently improves upon existing methods from an orthogonal perspective and achieves state-of-the-art results with simple enhancements.
Authors: Zhaorui Tan, Xi Yang, Tan Pan, Tianyi Liu, Chen Jiang, Xin Guo, Qiufeng Wang, Anh Nguyen, Yuan Qi, Kaizhu Huang, Yuan Cheng
Abstract: Personalized medicine is a groundbreaking healthcare framework for the $21^{st}$ century, tailoring medical treatments to individuals based on unique clinical characteristics, including diverse medical imaging modalities. Given the significant differences among these modalities due to distinct underlying imaging principles, generalization in multi-modal medical image tasks becomes substantially challenging. Previous methods addressing multi-modal generalization rarely consider personalization, primarily focusing on common anatomical information. This paper aims to bridge multi-modal generalization with the concept of personalized medicine. Specifically, we propose a novel approach to derive a tractable form of the underlying personalized invariant representation $\mathbb{X}_h$ by leveraging individual-level constraints and a learnable biological prior. We demonstrate the feasibility and benefits of learning a personalized $\mathbb{X}_h$, showing that this representation is highly generalizable and transferable across various multi-modal medical tasks. Our method is rigorously validated on medical imaging modalities emphasizing both physical structure and functional information, encompassing a range of tasks that require generalization. Extensive experimental results consistently show that our approach significantly improves performance across diverse scenarios, confirming its effectiveness.
Authors: Sanchar Palit, Sathya Veera Reddy Dendi, Mallikarjuna Talluri, Raj Narayana Gadde
Abstract: Vision Transformers and U-Net architectures have been widely adopted in the implementation of Diffusion Models. However, each architecture presents specific challenges while realizing them on-device. Vision Transformers require positional embedding to maintain correspondence between the tokens processed by the transformer, although they offer the advantage of using fixed-size, reusable repetitive blocks following tokenization. The U-Net architecture lacks these attributes, as it utilizes variable-sized intermediate blocks for down-convolution and up-convolution in the noise estimation backbone for the diffusion process. To address these issues, we propose an architecture that utilizes a fixed-size, reusable transformer block as a core structure, making it more suitable for hardware implementation. Our architecture is characterized by low complexity, token-free design, absence of positional embeddings, uniformity, and scalability, making it highly suitable for deployment on mobile and resource-constrained devices. The proposed model exhibit competitive and consistent performance across both unconditional and conditional image generation tasks. The model achieved a state-of-the-art FID score of 1.6 on unconditional image generation with the CelebA.
Authors: Kaixuan Lu
Abstract: The recent development of vision language models (VLMs) has led to significant advances in visual-language integration through visual instruction tuning, and they have rapidly evolved in the field of remote sensing image understanding, demonstrating their powerful capabilities. However, existing RSVLMs mainly focus on image-level or frame-level understanding, making it difficult to achieve fine-grained pixel-level visual-language alignment. Additionally, the lack of mask-based instructional data limits their further development. In this paper, we propose a mask-text instruction tuning method called Aquila-plus, which extends the capabilities of RSVLMs to achieve pixel-level visual understanding by incorporating fine-grained mask regions into language instructions. To achieve this, we first meticulously constructed a mask region-text dataset containing 100K samples, and then designed a visual-language model by injecting pixel-level representations into a large language model (LLM). Specifically, Aquila-plus uses a convolutional CLIP as the visual encoder and employs a mask-aware visual extractor to extract precise visual mask features from high-resolution inputs. Experimental results demonstrate that Aquila-plus outperforms existing methods in various region understanding tasks, showcasing its novel capabilities in pixel-level instruction tuning.
Authors: Weijie Ma, Jingwei Jiang, Yang Yang, Zehui Chen, Hao Chen
Abstract: With the attention gained by camera-only 3D object detection in autonomous driving, methods based on Bird-Eye-View (BEV) representation especially derived from the forward view transformation paradigm, i.e., lift-splat-shoot (LSS), have recently seen significant progress. The BEV representation formulated by the frustum based on depth distribution prediction is ideal for learning the road structure and scene layout from multi-view images. However, to retain computational efficiency, the compressed BEV representation such as in resolution and axis is inevitably weak in retaining the individual geometric details, undermining the methodological generality and applicability. With this in mind, to compensate for the missing details and utilize multi-view geometry constraints, we propose LSSInst, a two-stage object detector incorporating BEV and instance representations in tandem. The proposed detector exploits fine-grained pixel-level features that can be flexibly integrated into existing LSS-based BEV networks. Having said that, due to the inherent gap between two representation spaces, we design the instance adaptor for the BEV-to-instance semantic coherence rather than pass the proposal naively. Extensive experiments demonstrated that our proposed framework is of excellent generalization ability and performance, which boosts the performances of modern LSS-based BEV perception methods without bells and whistles and outperforms current LSS-based state-of-the-art works on the large-scale nuScenes benchmark.
Authors: Shukun Jia, Yichao Cao, Feng Yang, Xin Lu, Xiaobo Lu
Abstract: Multi-object tracking is advancing through two dominant paradigms: traditional tracking by detection and newly emerging tracking by query. In this work, we fuse them together and propose the tracking-by-detection-and-query paradigm, which is achieved by a Learnable Associator. Specifically, the basic information interaction module and the content-position alignment module are proposed for thorough information Interaction among object queries. Tracking results are directly Decoded from these queries. Hence, we name the method as LAID. Compared to tracking-by-query models, LAID achieves competitive tracking accuracy with notably higher training efficiency. With regard to tracking-by-detection methods, experimental results on DanceTrack show that LAID significantly surpasses the state-of-the-art heuristic method by 3.9% on HOTA metric and 6.1% on IDF1 metric. On SportsMOT, LAID also achieves the best score on HOTA metric. By holding low training cost, strong tracking capabilities, and an elegant end-to-end approach all at once, LAID presents a forward-looking direction for the field.
Authors: Mohsen Yavartanoo, Sangmin Hong, Reyhaneh Neshatavar, Kyoung Mu Lee
Abstract: The generation of industrial Computer-Aided Design (CAD) models from user requests and specifications is crucial to enhancing efficiency in modern manufacturing. Traditional methods of CAD generation rely heavily on manual inputs and struggle with complex or non-standard designs, making them less suited for dynamic industrial needs. To overcome these challenges, we introduce Text2CAD, a novel framework that employs stable diffusion models tailored to automate the generation process and efficiently bridge the gap between user specifications in text and functional CAD models. This approach directly translates the user's textural descriptions into detailed isometric images, which are then precisely converted into orthographic views, e.g., top, front, and side, providing sufficient information to reconstruct 3D CAD models. This process not only streamlines the creation of CAD models from textual descriptions but also ensures that the resulting models uphold physical and dimensional consistency essential for practical engineering applications. Our experimental results show that Text2CAD effectively generates technical drawings that are accurately translated into high-quality 3D CAD models, showing substantial potential to revolutionize CAD automation in response to user demands.
Authors: Jing Huang, Hao Wen, Tianyi Zhou, Haozhe Lin, Yu-Kun Lai, Kun Li
Abstract: This paper aims to reconstruct hundreds of people's 3D poses, shapes, and locations from a single image with unknown camera parameters. Due to the small and highly varying 2D human scales, depth ambiguity, and perspective distortion, no existing methods can achieve globally consistent reconstruction and accurate reprojection. To address these challenges, we first propose Crowd3D, which leverages a new concept, Human-scene Virtual Interaction Point (HVIP), to convert the complex 3D human localization into 2D-pixel localization with robust camera and ground estimation to achieve globally consistent reconstruction. To achieve stable generalization on different camera FoVs without test-time optimization, we propose an extended version, Crowd3D++, which eliminates the influence of camera parameters and the cropping operation by the proposed canonical upright space and ground-aware normalization transform. In the defined upright space, Crowd3D++ also designs an HVIPNet to regress 2D HVIP and infer the depths. Besides, we contribute two benchmark datasets, LargeCrowd and SyntheticCrowd, for evaluating crowd reconstruction in large scenes. The source code and data will be made publicly available after acceptance.
Authors: Arshia Hemmat, Adam Davies, Tom A. Lamb, Jianhao Yuan, Philip Torr, Ashkan Khakzar, Francesco Pinto
Abstract: Despite the importance of shape perception in human vision, early neural image classifiers relied less on shape information for object recognition than other (often spurious) features. While recent research suggests that current large Vision-Language Models (VLMs) exhibit more reliance on shape, we find them to still be seriously limited in this regard. To quantify such limitations, we introduce IllusionBench, a dataset that challenges current cutting-edge VLMs to decipher shape information when the shape is represented by an arrangement of visual elements in a scene. Our extensive evaluations reveal that, while these shapes are easily detectable by human annotators, current VLMs struggle to recognize them, indicating important avenues for future work in developing more robust visual perception systems. The full dataset and codebase are available at: \url{https://arshiahemmat.github.io/illusionbench/}
Authors: Mei Qiu, Lauren Ann Christopher, Stanley Chien, Lingxi Li
Abstract: Vision Transformers (ViTs) have shown exceptional performance in vehicle re-identification (ReID) tasks. However, non-square aspect ratios of image or video inputs can negatively impact re-identification accuracy. To address this challenge, we propose a novel, human perception driven, and general ViT-based ReID framework that fuses models trained on various aspect ratios. Our key contributions are threefold: (i) We analyze the impact of aspect ratios on performance using the VeRi-776 and VehicleID datasets, providing guidance for input settings based on the distribution of original image aspect ratios. (ii) We introduce patch-wise mixup strategy during ViT patchification (guided by spatial attention scores) and implement uneven stride for better alignment with object aspect ratios. (iii) We propose a dynamic feature fusion ReID network to enhance model robustness. Our method outperforms state-of-the-art transformer-based approaches on both datasets, with only a minimal increase in inference time per image.
Authors: Taha Razzaq, Asim Iqbal
Abstract: Medical brain imaging relies heavily on image registration to accurately curate structural boundaries of brain features for various healthcare applications. Deep learning models have shown remarkable performance in image registration in recent years. Still, they often struggle to handle the diversity of 3D brain volumes, challenged by their structural and contrastive variations and their imaging domains. In this work, we present NeuReg, a Neuro-inspired 3D image registration architecture with the feature of domain invariance. NeuReg generates domain-agnostic representations of imaging features and incorporates a shifting window-based Swin Transformer block as the encoder. This enables our model to capture the variations across brain imaging modalities and species. We demonstrate a new benchmark in multi-domain publicly available datasets comprising human and mouse 3D brain volumes. Extensive experiments reveal that our model (NeuReg) outperforms the existing baseline deep learning-based image registration models and provides a high-performance boost on cross-domain datasets, where models are trained on 'source-only' domain and tested on completely 'unseen' target domains. Our work establishes a new state-of-the-art for domain-agnostic 3D brain image registration, underpinned by Neuro-inspired Transformer-based architecture.
Authors: Shuang Chen, Haozheng Zhang, Amir Atapour-Abarghouei, Hubert P. H. Shum
Abstract: Image inpainting aims to repair a partially damaged image based on the information from known regions of the images. \revise{Achieving semantically plausible inpainting results is particularly challenging because it requires the reconstructed regions to exhibit similar patterns to the semanticly consistent regions}. This requires a model with a strong capacity to capture long-range dependencies. Existing models struggle in this regard due to the slow growth of receptive field for Convolutional Neural Networks (CNNs) based methods and patch-level interactions in Transformer-based methods, which are ineffective for capturing long-range dependencies. Motivated by this, we propose SEM-Net, a novel visual State Space model (SSM) vision network, modelling corrupted images at the pixel level while capturing long-range dependencies (LRDs) in state space, achieving a linear computational complexity. To address the inherent lack of spatial awareness in SSM, we introduce the Snake Mamba Block (SMB) and Spatially-Enhanced Feedforward Network. These innovations enable SEM-Net to outperform state-of-the-art inpainting methods on two distinct datasets, showing significant improvements in capturing LRDs and enhancement in spatial consistency. Additionally, SEM-Net achieves state-of-the-art performance on motion deblurring, demonstrating its generalizability. Our source code will be released in https://github.com/ChrisChen1023/SEM-Net.
Authors: Parth Parag Kulkarni, Gaurav Kumar Nayak, Mubarak Shah
Abstract: Video geolocalization is a crucial problem in current times. Given just a video, ascertaining where it was captured from can have a plethora of advantages. The problem of worldwide geolocalization has been tackled before, but only using the image modality. Its video counterpart remains relatively unexplored. Meanwhile, video geolocalization has also garnered some attention in the recent past, but the existing methods are all restricted to specific regions. This motivates us to explore the problem of video geolocalization at a global scale. Hence, we propose a novel problem of worldwide video geolocalization with the objective of hierarchically predicting the correct city, state/province, country, and continent, given a video. However, no large scale video datasets that have extensive worldwide coverage exist, to train models for solving this problem. To this end, we introduce a new dataset, CityGuessr68k comprising of 68,269 videos from 166 cities all over the world. We also propose a novel baseline approach to this problem, by designing a transformer-based architecture comprising of an elegant Self-Cross Attention module for incorporating scenes as well as a TextLabel Alignment strategy for distilling knowledge from textlabels in feature space. To further enhance our location prediction, we also utilize soft-scene labels. Finally we demonstrate the performance of our method on our new dataset as well as Mapillary(MSLS). Our code and datasets are available at: https://github.com/ParthPK/CityGuessr
Authors: Yui Tatsumi, Shoko Tanaka, Shunsuke Akamatsu, Takahiro Shindo, Hiroshi Watanabe
Abstract: Sign language is a visual language expressed through hand movements and non-manual markers. Non-manual markers include facial expressions and head movements. These expressions vary across different nations. Therefore, specialized analysis methods for each sign language are necessary. However, research on Japanese Sign Language (JSL) recognition is limited due to a lack of datasets. The development of recognition models that consider both manual and non-manual features of JSL is crucial for precise and smooth communication with deaf individuals. In JSL, sentence types such as affirmative statements and questions are distinguished by facial expressions. In this paper, we propose a JSL recognition method that focuses on facial expressions. Our proposed method utilizes a neural network to analyze facial features and classify sentence types. Through the experiments, we confirm our method's effectiveness by achieving a classification accuracy of 96.05%.
Authors: Hao Tang, Junhao Lu, Guoheng Huang, Ming Li, Xuhang Chen, Guo Zhong, Zhengguang Tan, Zinuo Li
Abstract: In Few-Shot Learning (FSL), traditional metric-based approaches often rely on global metrics to compute similarity. However, in natural scenes, the spatial arrangement of key instances is often inconsistent across images. This spatial misalignment can result in mismatched semantic pixels, leading to inaccurate similarity measurements. To address this issue, we propose a novel method called the Layer-Wise Features Metric of Semantic-Pixel Matching (LWFM-SPM) to make finer comparisons. Our method enhances model performance through two key modules: (1) the Layer-Wise Embedding (LWE) Module, which refines the cross-correlation of image pairs to generate well-focused feature maps for each layer; (2)the Semantic-Pixel Matching (SPM) Module, which aligns critical pixels based on semantic embeddings using an assignment algorithm. We conducted extensive experiments to evaluate our method on four widely used few-shot classification benchmarks: miniImageNet, tieredImageNet, CUB-200-2011, and CIFAR-FS. The results indicate that LWFM-SPM achieves competitive performance across these benchmarks. Our code will be publicly available on https://github.com/Halo2Tang/Code-for-LWFM-SPM.
Authors: Liuyue Xie, Jiancong Guo, Laszlo A. Jeni, Zhiheng Jia, Mingyang Li, Yunwen Zhou, Chao Guo
Abstract: Recent extended reality headsets and field robots have adopted covers to protect the front-facing cameras from environmental hazards and falls. The surface irregularities on the cover can lead to optical aberrations like blurring and non-parametric distortions. Novel view synthesis methods like NeRF and 3D Gaussian Splatting are ill-equipped to synthesize from sequences with optical aberrations. To address this challenge, we introduce SynthCover to enable novel view synthesis through protective covers for downstream extended reality applications. SynthCover employs a Refractive Field that estimates the cover's geometry, enabling precise analytical calculation of refracted rays. Experiments on synthetic and real-world scenes demonstrate our method's ability to accurately model scenes viewed through protective covers, achieving a significant improvement in rendering quality compared to prior methods. We also show that the model can adjust well to various cover geometries with synthetic sequences captured with covers of different surface curvatures. To motivate further studies on this problem, we provide the benchmarked dataset containing real and synthetic walkable scenes captured with protective cover optical aberrations.
Authors: Hanwen Cao, George J. Pappas, Nikolay Atanasov
Abstract: In this paper, we derive a new Kalman filter with probabilistic data association between measurements and states. We formulate a variational inference problem to approximate the posterior density of the state conditioned on the measurement data. We view the unknown data association as a latent variable and apply Expectation Maximization (EM) to obtain a filter with update step in the same form as the Kalman filter but with expanded measurement vector of all potential associations. We show that the association probabilities can be computed as permanents of matrices with measurement likelihood entries. We also propose an ambiguity check that associates only a subset of ambiguous measurements and states probabilistically, thus reducing the association time and preventing low-probability measurements from harming the estimation accuracy. Experiments in simulation show that our filter achieves lower tracking errors than the well-established joint probabilistic data association filter (JPDAF), while running at comparable rate. We also demonstrate the effectiveness of our filter in multi-object tracking (MOT) on multiple real-world datasets, including MOT17, MOT20, and DanceTrack. We achieve better higher order tracking accuracy (HOTA) than previous Kalman-filter methods and remain real-time. Associating only bounding boxes without deep features or velocities, our method ranks top-10 on both MOT17 and MOT20 in terms of HOTA. Given offline detections, our algorithm tracks at 250+ fps on a single laptop CPU. Code is available at https://github.com/hwcao17/pkf.
Authors: Junyi Zhang, Chang Liu, Chun Yang
Abstract: In text recognition, complex glyphs and tail classes have always been factors affecting model performance. Specifically for Chinese text recognition, the lack of shape-awareness can lead to confusion among close complex characters. Since such characters are often tail classes that appear less frequently in the training-set, making it harder for the model to capture its shape information. Hence in this work, we propose a structure-aware network utilizing the hierarchical composition information to improve the recognition performance of complex characters. Implementation-wise, we first propose an auxiliary radical branch and integrate it into the base recognition network as a regularization term, which distills hierarchical composition information into the feature extractor. A Tree-Similarity-based weighting mechanism is then proposed to further utilize the depth information in the hierarchical representation. Experiments demonstrate that the proposed approach can significantly improve the performances of complex characters and tail characters, yielding a better overall performance. Code is available at https://github.com/Levi-ZJY/SAN.
Authors: Yutong Chen, Marko Mihajlovic, Xiyi Chen, Yiming Wang, Sergey Prokudin, Siyu Tang
Abstract: 3D Gaussian Splatting (3DGS) has recently transformed photorealistic reconstruction, achieving high visual fidelity and real-time performance. However, rendering quality significantly deteriorates when test views deviate from the camera angles used during training, posing a major challenge for applications in immersive free-viewpoint rendering and navigation. In this work, we conduct a comprehensive evaluation of 3DGS and related novel view synthesis methods under out-of-distribution (OOD) test camera scenarios. By creating diverse test cases with synthetic and real-world datasets, we demonstrate that most existing methods, including those incorporating various regularization techniques and data-driven priors, struggle to generalize effectively to OOD views. To address this limitation, we introduce SplatFormer, the first point transformer model specifically designed to operate on Gaussian splats. SplatFormer takes as input an initial 3DGS set optimized under limited training views and refines it in a single forward pass, effectively removing potential artifacts in OOD test views. To our knowledge, this is the first successful application of point transformers directly on 3DGS sets, surpassing the limitations of previous multi-scene training methods, which could handle only a restricted number of input views during inference. Our model significantly improves rendering quality under extreme novel views, achieving state-of-the-art performance in these challenging scenarios and outperforming various 3DGS regularization techniques, multi-scene models tailored for sparse view synthesis, and diffusion-based frameworks.
Authors: Dmitry Vesnin, Dmitry Levshun, Andrey Chechulin
Abstract: In recent years, diffusion models have become one of the main methods for generating images. However, detecting images generated by these models remains a challenging task. This paper proposes a novel method for detecting images generated by Latent Diffusion Models (LDM) by identifying artifacts introduced by their autoencoders. By training a detector to distinguish between real images and those reconstructed by the LDM autoencoder, the method enables detection of generated images without directly training on them. The novelty of this research lies in the fact that, unlike similar approaches, this method does not require training on synthesized data, significantly reducing computational costs and enhancing generalization ability. Experimental results show high detection accuracy with minimal false positives, making this approach a promising tool for combating fake images.
Authors: Minghong Duan, Linhao Qu, Shaolei Liu, Manning Wang
Abstract: Implicit neural representations have recently demonstrated promising potential in arbitrary-scale Super-Resolution (SR) of images. Most existing methods predict the pixel in the SR image based on the queried coordinate and ensemble nearby features, overlooking the importance of incorporating high-frequency prior information in images, which results in limited performance in reconstructing high-frequency texture details in images. To address this issue, we propose the Local Implicit Wavelet Transformer (LIWT) to enhance the restoration of high-frequency texture details. Specifically, we decompose the features extracted by an encoder into four sub-bands containing different frequency information using Discrete Wavelet Transform (DWT). We then introduce the Wavelet Enhanced Residual Module (WERM) to transform these four sub-bands into high-frequency priors, followed by utilizing the Wavelet Mutual Projected Fusion (WMPF) and the Wavelet-aware Implicit Attention (WIA) to fully exploit the high-frequency prior information for recovering high-frequency details in images. We conducted extensive experiments on benchmark datasets to validate the effectiveness of LIWT. Both qualitative and quantitative results demonstrate that LIWT achieves promising performance in arbitrary-scale SR tasks, outperforming other state-of-the-art methods. The code is available at https://github.com/dmhdmhdmh/LIWT.
Authors: Taohui Xiao, Jian Cheng, Wenxin Fan, Enqing Dong, Hairong Zheng, Shanshan Wang
Abstract: Neurite Orientation Dispersion and Density Imaging (NODDI) microstructure estimation from diffusion magnetic resonance imaging (dMRI) is of great significance for the discovery and treatment of various neurological diseases. Current deep learning-based methods accelerate the speed of NODDI parameter estimation and improve the accuracy. However, most methods require the number and coordinates of gradient directions during testing and training to remain strictly consistent, significantly limiting the generalization and robustness of these models in NODDI parameter estimation. In this paper, we propose a q-space sampling augmentation-based continuous representation learning framework (SamRobNODDI) to achieve robust and generalized NODDI. Specifically, a continuous representation learning method based on q-space sampling augmentation is introduced to fully explore the information between different gradient directions in q-space. Furthermore, we design a sampling consistency loss to constrain the outputs of different sampling schemes, ensuring that the outputs remain as consistent as possible, thereby further enhancing performance and robustness to varying q-space sampling schemes. SamRobNODDI is also a flexible framework that can be applied to different backbone networks. To validate the effectiveness of the proposed method, we compared it with 7 state-of-the-art methods across 18 different q-space sampling schemes, demonstrating that the proposed SamRobNODDI has better performance, robustness, generalization, and flexibility.
Authors: Pingyu Wu, Kai Zhu, Yu Liu, Liming Zhao, Wei Zhai, Yang Cao, Zheng-Jun Zha
Abstract: Variational Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI's Sora and other latent video diffusion generation models. While most of existing video VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression, this paper presents two astonishing findings: (1) The initialization from a well-trained image VAE with the same latent dimensions suppresses the improvement of subsequent temporal compression capabilities. (2) The adoption of causal reasoning leads to unequal information interactions and unbalanced performance between frames. To alleviate these problems, we propose a keyframe-based temporal compression (KTC) architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE). Specifically, the KTC architecture divides the latent space into two branches, in which one half completely inherits the compression prior of keyframes from a lower-dimension image VAE while the other half involves temporal compression through 3D group causal convolution, reducing temporal-spatial conflicts and accelerating the convergence speed of video VAE. The GCConv in above 3D half uses standard convolution within each frame group to ensure inter-frame equivalence, and employs causal logical padding between groups to maintain flexibility in processing variable frame video. Extensive experiments on five benchmarks demonstrate the SOTA video reconstruction and generation capabilities of the proposed IV-VAE (https://wpy1999.github.io/IV-VAE/).
Authors: Chen Wu, Ling Wang, Long Peng, Dianjie Lu, Zhuoran Zheng
Abstract: With the popularization of high-end mobile devices, Ultra-high-definition (UHD) images have become ubiquitous in our lives. The restoration of UHD images is a highly challenging problem due to the exaggerated pixel count, which often leads to memory overflow during processing. Existing methods either downsample UHD images at a high rate before processing or split them into multiple patches for separate processing. However, high-rate downsampling leads to significant information loss, while patch-based approaches inevitably introduce boundary artifacts. In this paper, we propose a novel design paradigm to solve the UHD image restoration problem, called D2Net. D2Net enables direct full-resolution inference on UHD images without the need for high-rate downsampling or dividing the images into several patches. Specifically, we ingeniously utilize the characteristics of the frequency domain to establish long-range dependencies of features. Taking into account the richer local patterns in UHD images, we also design a multi-scale convolutional group to capture local features. Additionally, during the decoding stage, we dynamically incorporate features from the encoding stage to reduce the flow of irrelevant information. Extensive experiments on three UHD image restoration tasks, including low-light image enhancement, image dehazing, and image deblurring, show that our model achieves better quantitative and qualitative results than state-of-the-art methods.
Authors: Boyao Wang, Volodymyr Kindratenko
Abstract: Convolutional Neural Networks (CNNs) have demonstrated exceptional performance in recent years. Compressing these models not only reduces storage requirements, making deployment to edge devices feasible, but also accelerates inference, thereby reducing latency and computational costs. Structured pruning, which removes filters at the layer level, directly modifies the model architecture. This approach achieves a more compact architecture while maintaining target accuracy, ensuring that the compressed model retains good compatibility and hardware efficiency. Our method is based on a key observation: filters in different layers of a neural network have varying importance to the model's performance. When the number of filters to prune is fixed, the optimal pruning distribution across different layers is uneven to minimize performance loss. Layers that are more sensitive to pruning should account for a smaller proportion of the pruning distribution. To leverage this insight, we propose RL-Pruner, which uses reinforcement learning to learn the optimal pruning distribution. RL-Pruner can automatically extract dependencies between filters in the input model and perform pruning, without requiring model-specific pruning implementations. We conducted experiments on models such as GoogleNet, ResNet, and MobileNet, comparing our approach to other structured pruning methods to validate its effectiveness. Our code is available at https://github.com/Beryex/RLPruner-CNN.
Authors: R\'emi Giraud, Micha\"el Cl\'ement
Abstract: For many years, image over-segmentation into superpixels has been essential to computer vision pipelines, by creating homogeneous and identifiable regions of similar sizes. Such constrained segmentation problem would require a clear definition and specific evaluation criteria. However, the validation framework for superpixel methods, typically viewed as standard object segmentation, has rarely been thoroughly studied. In this work, we first take a step back to show that superpixel segmentation is fundamentally an ill-posed problem, due to the implicit regularity constraint on the shape and size of superpixels. We also demonstrate through a novel comprehensive study that the literature suffers from only evaluating certain aspects, sometimes incorrectly and with inappropriate metrics. Concurrently, recent deep learning-based superpixel methods mainly focus on the object segmentation task at the expense of regularity. In this ill-posed context, we show that we can achieve competitive results using a recent architecture like the Segment Anything Model (SAM), without dedicated training for the superpixel segmentation task. This leads to rethinking superpixel segmentation and the necessary properties depending on the targeted downstream task.
Authors: Zeyu Zhang, Hang Gao, Akide Liu, Qi Chen, Feng Chen, Yiran Wang, Danning Li, Hao Tang
Abstract: Human motion generation is a cut-edge area of research in generative computer vision, with promising applications in video creation, game development, and robotic manipulation. The recent Mamba architecture shows promising results in efficiently modeling long and complex sequences, yet two significant challenges remain: Firstly, directly applying Mamba to extended motion generation is ineffective, as the limited capacity of the implicit memory leads to memory decay. Secondly, Mamba struggles with multimodal fusion compared to Transformers, and lack alignment with textual queries, often confusing directions (left or right) or omitting parts of longer text queries. To address these challenges, our paper presents three key contributions: Firstly, we introduce KMM, a novel architecture featuring Key frame Masking Modeling, designed to enhance Mamba's focus on key actions in motion segments. This approach addresses the memory decay problem and represents a pioneering method in customizing strategic frame-level masking in SSMs. Additionally, we designed a contrastive learning paradigm for addressing the multimodal fusion problem in Mamba and improving the motion-text alignment. Finally, we conducted extensive experiments on the go-to dataset, BABEL, achieving state-of-the-art performance with a reduction of more than 57% in FID and 70% parameters compared to previous state-of-the-art methods. See project website: https://steve-zeyu-zhang.github.io/KMM
Authors: Kecia G. de Moura, Rafael M. O. Cruz, Robert Sabourin
Abstract: Handwritten Signature Verification (HSV) systems distinguish between genuine and forged signatures. Traditional HSV development involves a static batch configuration, constraining the system's ability to model signatures to the limited data available. Signatures exhibit high intra-class variability and are sensitive to various factors, including time and external influences, imparting them a dynamic nature. This paper investigates the signature learning process within a data stream context. We propose a novel HSV approach with an adaptive system that receives an infinite sequence of signatures and is updated over time. Experiments were carried out on GPDS Synthetic, CEDAR, and MCYT datasets. Results demonstrate the superior performance of the proposed method compared to standard approaches that use a Support Vector Machine as a classifier. Implementation of the method is available at https://github.com/kdMoura/stream_hsv.
Authors: Wanquan Feng, Jiawei Liu, Pengqi Tu, Tianhao Qi, Mingzhen Sun, Tianxiang Ma, Songtao Zhao, Siyu Zhou, Qian He
Abstract: Video generation technologies are developing rapidly and have broad potential applications. Among these technologies, camera control is crucial for generating professional-quality videos that accurately meet user expectations. However, existing camera control methods still suffer from several limitations, including control precision and the neglect of the control for subject motion dynamics. In this work, we propose I2VControl-Camera, a novel camera control method that significantly enhances controllability while providing adjustability over the strength of subject motion. To improve control precision, we employ point trajectory in the camera coordinate system instead of only extrinsic matrix information as our control signal. To accurately control and adjust the strength of subject motion, we explicitly model the higher-order components of the video trajectory expansion, not merely the linear terms, and design an operator that effectively represents the motion strength. We use an adapter architecture that is independent of the base model structure. Experiments on static and dynamic scenes show that our framework outperformances previous methods both quantitatively and qualitatively.
Authors: Moritz Heep, Eduard Zell
Abstract: Image segmentation in RGB space is a notoriously difficult task where state-of-the-art methods are trained on thousands or even millions of annotated images. While the performance is impressive, it is still not perfect. We propose a novel image segmentation method, achieving similar segmentation quality but without training. Instead, we require an image sequence with a static camera and a single light source at varying positions, as used in for photometric stereo, for example.
Authors: Faisal Mehmood, Xin Guo, Enqing Chen, Muhammad Azeem Akbar, Arif Ali Khan, Sami Ullah
Abstract: Graph convolutional networks (GCNs) are an effective skeleton-based human action recognition (HAR) technique. GCNs enable the specification of CNNs to a non-Euclidean frame that is more flexible. The previous GCN-based models still have a lot of issues: (I) The graph structure is the same for all model layers and input data.
Authors: Zhennan Chen, Yajie Li, Haofan Wang, Zhibo Chen, Zhengkai Jiang, Jun Li, Qian Wang, Jian Yang, Ying Tai
Abstract: In this paper, we present RAG, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition. Regional prompting, or compositional generation, which enables fine-grained spatial control, has gained increasing attention for its practicality in real-world applications. However, previous methods either introduce additional trainable modules, thus only applicable to specific models, or manipulate on score maps within cross-attention layers using attention masks, resulting in limited control strength when the number of regions increases. To handle these limitations, we decouple the multi-region generation into two sub-tasks, the construction of individual region (Regional Hard Binding) that ensures the regional prompt is properly executed, and the overall detail refinement (Regional Soft Refinement) over regions that dismiss the visual boundaries and enhance adjacent interactions. Furthermore, RAG novelly makes repainting feasible, where users can modify specific unsatisfied regions in the last generation while keeping all other regions unchanged, without relying on additional inpainting models. Our approach is tuning-free and applicable to other frameworks as an enhancement to the prompt following property. Quantitative and qualitative experiments demonstrate that RAG achieves superior performance over attribute binding and object relationship than previous tuning-free methods.
Authors: Hadeel Awwad, Eloy Garc\'ia, Robert Mart\'i
Abstract: Breast compression simulation is essential for accurate image registration from 3D modalities to X-ray procedures like mammography. It accounts for tissue shape and position changes due to compression, ensuring precise alignment and improved analysis. Although Finite Element Analysis (FEA) is reliable for approximating soft tissue deformation, it struggles with balancing accuracy and computational efficiency. Recent studies have used data-driven models trained on FEA results to speed up tissue deformation predictions. We propose to explore Physics-based Graph Neural Networks (PhysGNN) for breast compression simulation. PhysGNN has been used for data-driven modelling in other domains, and this work presents the first investigation of their potential in predicting breast deformation during mammographic compression. Unlike conventional data-driven models, PhysGNN, which incorporates mesh structural information and enables inductive learning on unstructured grids, is well-suited for capturing complex breast tissue geometries. Trained on deformations from incremental FEA simulations, PhysGNN's performance is evaluated by comparing predicted nodal displacements with those from finite element (FE) simulations. This deep learning (DL) framework shows promise for accurate, rapid breast deformation approximations, offering enhanced computational efficiency for real-world scenarios.
Authors: Decai Chen, Brianne Oberson, Ingo Feldmann, Oliver Schreer, Anna Hilsmann, Peter Eisert
Abstract: 3D Gaussian Splatting has recently achieved notable success in novel view synthesis for dynamic scenes and geometry reconstruction in static scenes. Building on these advancements, early methods have been developed for dynamic surface reconstruction by globally optimizing entire sequences. However, reconstructing dynamic scenes with significant topology changes, emerging or disappearing objects, and rapid movements remains a substantial challenge, particularly for long sequences. To address these issues, we propose AT-GS, a novel method for reconstructing high-quality dynamic surfaces from multi-view videos through per-frame incremental optimization. To avoid local minima across frames, we introduce a unified and adaptive gradient-aware densification strategy that integrates the strengths of conventional cloning and splitting techniques. Additionally, we reduce temporal jittering in dynamic surfaces by ensuring consistency in curvature maps across consecutive frames. Our method achieves superior accuracy and temporal coherence in dynamic surface reconstruction, delivering high-fidelity space-time novel view synthesis, even in complex and challenging scenes. Extensive experiments on diverse multi-view video datasets demonstrate the effectiveness of our approach, showing clear advantages over baseline methods. Project page: \url{https://fraunhoferhhi.github.io/AT-GS}
Authors: Deegan Atha, Xianmei Lei, Shehryar Khattak, Anna Sabel, Elle Miller, Aurelio Noca, Grace Lim, Jeffrey Edlund, Curtis Padgett, Patrick Spieler
Abstract: Off-road environments pose significant perception challenges for high-speed autonomous navigation due to unstructured terrain, degraded sensing conditions, and domain-shifts among biomes. Learning semantic information across these conditions and biomes can be challenging when a large amount of ground truth data is required. In this work, we propose an approach that leverages a pre-trained Vision Transformer (ViT) with fine-tuning on a small (<500 images), sparse and coarsely labeled (<30% pixels) multi-biome dataset to predict 2D semantic segmentation classes. These classes are fused over time via a novel range-based metric and aggregated into a 3D semantic voxel map. We demonstrate zero-shot out-of-biome 2D semantic segmentation on the Yamaha (52.9 mIoU) and Rellis (55.5 mIoU) datasets along with few-shot coarse sparse labeling with existing data for improved segmentation performance on Yamaha (66.6 mIoU) and Rellis (67.2 mIoU). We further illustrate the feasibility of using a voxel map with a range-based semantic fusion approach to handle common off-road hazards like pop-up hazards, overhangs, and water features.
Authors: Zhengyi Liu, Longzhen Wang, Xianyong Fang, Zhengzheng Tu, Linbo Wang
Abstract: A light field camera can reconstruct 3D scenes using captured multi-focus images that contain rich spatial geometric information, enhancing applications in stereoscopic photography, virtual reality, and robotic vision. In this work, a state-of-the-art salient object detection model for multi-focus light field images, called LFSamba, is introduced to emphasize four main insights: (a) Efficient feature extraction, where SAM is used to extract modality-aware discriminative features; (b) Inter-slice relation modeling, leveraging Mamba to capture long-range dependencies across multiple focal slices, thus extracting implicit depth cues; (c) Inter-modal relation modeling, utilizing Mamba to integrate all-focus and multi-focus images, enabling mutual enhancement; (d) Weakly supervised learning capability, developing a scribble annotation dataset from an existing pixel-level mask dataset, establishing the first scribble-supervised baseline for light field salient object detection.https://github.com/liuzywen/LFScribble
Authors: Clayton Fields, Casey Kennington
Abstract: In the past several years there has been an explosion of available models for vision-language tasks. Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models. In this paper we seek to answer several questions related to the pretraining of vision-language encoders through meta-analysis. In our first set of experiments, we show that we can save significant compute at no cost to downstream performance, by freezing large parts of vision-language models during pretraining. In our second set of experiments we examine the effect of basing a VL transformer on a vision model versus a text model. Additionally, we introduce a VL modeling platform called Renaissance that we use to conduct all of the experiments. This program offers a great deal of flexibility in creating, training and evaluating transformer encoders for VL modeling. The source code for Renaissance can be found at https://github.com/bsu-slim/renaissance.
Authors: Xinyang Huang, Chuang Zhu, Bowen Zhang, Shanghang Zhang
Abstract: Semi-supervised domain adaptation (SSDA) has been widely studied due to its ability to utilize a few labeled target data to improve the generalization ability of the model. However, existing methods only consider designing certain strategies for target samples to adapt, ignoring the exploration of customized learning for different target samples. When the model encounters complex target distribution, existing methods will perform limited due to the inability to clearly and comprehensively learn the knowledge of multiple types of target samples. To fill this gap, this paper focuses on designing a framework to use different strategies for comprehensively mining different target samples. We propose a novel source-free framework (SOUF) to achieve semi-supervised fine-tuning of the source pre-trained model on the target domain. Different from existing SSDA methods, SOUF decouples SSDA from the perspectives of different target samples, specifically designing robust learning techniques for unlabeled, reliably labeled, and noisy pseudo-labeled target samples. For unlabeled target samples, probability-based weighted contrastive learning (PWC) helps the model learn more discriminative feature representations. To mine the latent knowledge of labeled target samples, reliability-based mixup contrastive learning (RMC) learns complex knowledge from the constructed reliable sample set. Finally, predictive regularization learning (PR) further mitigates the misleading effect of noisy pseudo-labeled samples on the model. Extensive experiments on benchmark datasets demonstrate the superiority of our framework over state-of-the-art methods.
Authors: Li Yu, Zhihui Li, Jimin Xiao, Moncef Gabbouj
Abstract: Neural Representations for Videos (NeRV) have simplified the video codec process and achieved swift decoding speeds by encoding video content into a neural network, presenting a promising solution for video compression. However, existing work overlooks the crucial issue that videos reconstructed by these methods lack high-frequency details. To address this problem, this paper introduces a High-Frequency Enhanced Hybrid Neural Representation Network. Our method focuses on leveraging high-frequency information to improve the synthesis of fine details by the network. Specifically, we design a wavelet high-frequency encoder that incorporates Wavelet Frequency Decomposer (WFD) blocks to generate high-frequency feature embeddings. Next, we design the High-Frequency Feature Modulation (HFM) block, which leverages the extracted high-frequency embeddings to enhance the fitting process of the decoder. Finally, with the refined Harmonic decoder block and a Dynamic Weighted Frequency Loss, we further reduce the potential loss of high-frequency information. Experiments on the Bunny and UVG datasets demonstrate that our method outperforms other methods, showing notable improvements in detail preservation and compression performance.
Authors: Yichun Shi, Peng Wang, Weilin Huang
Abstract: We introduce SeedEdit, a diffusion model that is able to revise a given image with any text prompt. In our perspective, the key to such a task is to obtain an optimal balance between maintaining the original image, i.e. image reconstruction, and generating a new image, i.e. image re-generation. To this end, we start from a weak generator (text-to-image model) that creates diverse pairs between such two directions and gradually align it into a strong image editor that well balances between the two tasks. SeedEdit can achieve more diverse and stable editing capability over prior image editing methods, enabling sequential revision over images generated by diffusion models.
Authors: Guandong Li
Abstract: Controllable image generation has always been one of the core demands in image generation, aiming to create images that are both creative and logical while satisfying additional specified conditions. In the post-AIGC era, controllable generation relies on diffusion models and is accomplished by maintaining certain components or introducing inference interferences. This paper addresses key challenges in controllable generation: 1. mismatched object attributes during generation and poor prompt-following effects; 2. inadequate completion of controllable layouts. We propose a train-free method based on attention loss backward, cleverly controlling the cross attention map. By utilizing external conditions such as prompts that can reasonably map onto the attention map, we can control image generation without any training or fine-tuning. This method addresses issues like attribute mismatch and poor prompt-following while introducing explicit layout constraints for controllable image generation. Our approach has achieved excellent practical applications in production, and we hope it can serve as an inspiring technical report in this field.
Authors: Xiaolong Wang, Lei Yu, Yingying Zhang, Jiangwei Lao, Lixiang Ru, Liheng Zhong, Jingdong Chen, Yu Zhang, Ming Yang
Abstract: Feature matching between image pairs is a fundamental problem in computer vision that drives many applications, such as SLAM. Recently, semi-dense matching approaches have achieved substantial performance enhancements and established a widely-accepted coarse-to-fine paradigm. However, the majority of existing methods focus on improving coarse feature representation rather than the fine-matching module. Prior fine-matching techniques, which rely on point-to-patch matching probability expectation or direct regression, often lack precision and do not guarantee the continuity of feature points across sequential images. To address this limitation, this paper concentrates on enhancing the fine-matching module in the semi-dense matching framework. We employ a lightweight and efficient homography estimation network to generate the perspective mapping between patches obtained from coarse matching. This patch-to-patch approach achieves the overall alignment of two patches, resulting in a higher sub-pixel accuracy by incorporating additional constraints. By leveraging the homography estimation between patches, we can achieve a dense matching result with low computational cost. Extensive experiments demonstrate that our method achieves higher accuracy compared to previous semi-dense matchers. Meanwhile, our dense matching results exhibit similar end-point-error accuracy compared to previous dense matchers while maintaining semi-dense efficiency.
Authors: Jia Syuen Lim, Yadan Luo, Zhi Chen, Tianqi Wei, Scott Chapman, Zi Huang
Abstract: In the Detection and Multi-Object Tracking of Sweet Peppers Challenge, we present Track Any Peppers (TAP) - a weakly supervised ensemble technique for sweet peppers tracking. TAP leverages the zero-shot detection capabilities of vision-language foundation models like Grounding DINO to automatically generate pseudo-labels for sweet peppers in video sequences with minimal human intervention. These pseudo-labels, refined when necessary, are used to train a YOLOv8 segmentation network. To enhance detection accuracy under challenging conditions, we incorporate pre-processing techniques such as relighting adjustments and apply depth-based filtering during post-inference. For object tracking, we integrate the Matching by Segment Anything (MASA) adapter with the BoT-SORT algorithm. Our approach achieves a HOTA score of 80.4%, MOTA of 66.1%, Recall of 74.0%, and Precision of 90.7%, demonstrating effective tracking of sweet peppers without extensive manual effort. This work highlights the potential of foundation models for efficient and accurate object detection and tracking in agricultural settings.
Authors: Yanguang Sun, Jian Yang, Lei Luo
Abstract: Recently, deep learning-based salient object detection (SOD) in optical remote sensing images (ORSIs) have achieved significant breakthroughs. We observe that existing ORSIs-SOD methods consistently center around optimizing pixel features in the spatial domain, progressively distinguishing between backgrounds and objects. However, pixel information represents local attributes, which are often correlated with their surrounding context. Even with strategies expanding the local region, spatial features remain biased towards local characteristics, lacking the ability of global perception. To address this problem, we introduce the Fourier transform that generate global frequency features and achieve an image-size receptive field. To be specific, we propose a novel United Domain Cognition Network (UDCNet) to jointly explore the global-local information in the frequency and spatial domains. Technically, we first design a frequency-spatial domain transformer block that mutually amalgamates the complementary local spatial and global frequency features to strength the capability of initial input features. Furthermore, a dense semantic excavation module is constructed to capture higher-level semantic for guiding the positioning of remote sensing objects. Finally, we devise a dual-branch joint optimization decoder that applies the saliency and edge branches to generate high-quality representations for predicting salient objects. Experimental results demonstrate the superiority of the proposed UDCNet method over 24 state-of-the-art models, through extensive quantitative and qualitative comparisons in three widely-used ORSIs-SOD datasets. The source code is available at: \href{https://github.com/CSYSI/UDCNet}{\color{blue} https://github.com/CSYSI/UDCNet}.
URLs: https://github.com/CSYSI/UDCNet, https://github.com/CSYSI/UDCNet
Authors: Osman Akar, Yushan Han, Yizhou Chen, Weixian Lan, Benn Gallagher, Ronald Fedkiw, Joseph Teran
Abstract: We present learning-based implicit shape representations designed for real-time avatar collision queries arising in the simulation of clothing. Signed distance functions (SDFs) have been used for such queries for many years due to their computational efficiency. Recently deep neural networks have been used for implicit shape representations (DeepSDFs) due to their ability to represent multiple shapes with modest memory requirements compared to traditional representations over dense grids. However, the computational expense of DeepSDFs prevents their use in real-time clothing simulation applications. We design a learning-based representation of SDFs for human avatars whoes bodies change shape kinematically due to joint-based skinning. Rather than using a single DeepSDF for the entire avatar, we use a collection of extremely computationally efficient (shallow) neural networks that represent localized deformations arising from changes in body shape induced by the variation of a single joint. This requires a stitching process to combine each shallow SDF in the collection together into one SDF representing the signed closest distance to the boundary of the entire body. To achieve this we augment each shallow SDF with an additional output that resolves whether or not the individual shallow SDF value is referring to a closest point on the boundary of the body, or to a point on the interior of the body (but on the boundary of the individual shallow SDF). Our model is extremely fast and accurate and we demonstrate its applicability with real-time simulation of garments driven by animated characters.
Authors: Shizhe Yuan, Li Zhou
Abstract: With the advancement of artificial intelligence, 3D human pose estimation-based systems for sports training and posture correction have gained significant attention in adolescent sports. However, existing methods face challenges in handling complex movements, providing real-time feedback, and accommodating diverse postures, particularly with occlusions, rapid movements, and the resource constraints of Internet of Things (IoT) devices, making it difficult to balance accuracy and real-time performance. To address these issues, we propose GTA-Net, an intelligent system for posture correction and real-time feedback in adolescent sports, integrated within an IoT-enabled environment. This model enhances pose estimation in dynamic scenes by incorporating Graph Convolutional Networks (GCN), Temporal Convolutional Networks (TCN), and Hierarchical Attention mechanisms, achieving real-time correction through IoT devices. Experimental results show GTA-Net's superior performance on Human3.6M, HumanEva-I, and MPI-INF-3DHP datasets, with Mean Per Joint Position Error (MPJPE) values of 32.2mm, 15.0mm, and 48.0mm, respectively, significantly outperforming existing methods. The model also demonstrates strong robustness in complex scenarios, maintaining high accuracy even with occlusions and rapid movements. This system enhances real-time posture correction and offers broad applications in intelligent sports and health management.
Authors: Yueyang Cang, Yu hang liu, Li Shi
Abstract: Kolmogorov-Arnold Networks(KANs), as a theoretically efficient neural network architecture, have garnered attention for their potential in capturing complex patterns. However, their application in computer vision remains relatively unexplored. This study first analyzes the potential of KAN in computer vision tasks, evaluating the performance of KAN and its convolutional variants in image classification and semantic segmentation. The focus is placed on examining their characteristics across varying data scales and noise levels. Results indicate that while KAN exhibits stronger fitting capabilities, it is highly sensitive to noise, limiting its robustness. To address this challenge, we propose a smoothness regularization method and introduce a Segment Deactivation technique. Both approaches enhance KAN's stability and generalization, demonstrating its potential in handling complex visual data tasks.
Authors: Zefan Qu, Ke Xu, Gerhard Petrus Hancke, Rynson W. H. Lau
Abstract: Neural Radiance Fields (NeRFs) have shown remarkable performances in producing novel-view images from high-quality scene images. However, hand-held low-light photography challenges NeRFs as the captured images may simultaneously suffer from low visibility, noise, and camera shakes. While existing NeRF methods may handle either low light or motion, directly combining them or incorporating additional image-based enhancement methods does not work as these degradation factors are highly coupled. We observe that noise in low-light images is always sharp regardless of camera shakes, which implies an implicit order of these degradation factors within the image formation process. To this end, we propose in this paper a novel model, named LuSh-NeRF, which can reconstruct a clean and sharp NeRF from a group of hand-held low-light images. The key idea of LuSh-NeRF is to sequentially model noise and blur in the images via multi-view feature consistency and frequency information of NeRF, respectively. Specifically, LuSh-NeRF includes a novel Scene-Noise Decomposition (SND) module for decoupling the noise from the scene representation and a novel Camera Trajectory Prediction (CTP) module for the estimation of camera motions based on low-frequency scene information. To facilitate training and evaluations, we construct a new dataset containing both synthetic and real images. Experiments show that LuSh-NeRF outperforms existing approaches. Our code and dataset can be found here: https://github.com/quzefan/LuSh-NeRF.
Authors: Hongsheng Zhang, Zhong Ji, Jingren Liu, Yanwei Pang, Jungong Han
Abstract: Vision Language Models (VLMs), pre-trained on large-scale image-text datasets, enable zero-shot predictions for unseen data but may underperform on specific unseen tasks. Continual learning (CL) can help VLMs effectively adapt to new data distributions without joint training, but faces challenges of catastrophic forgetting and generalization forgetting. Although significant progress has been achieved by distillation-based methods, they exhibit two severe limitations. One is the popularly adopted single-teacher paradigm fails to impart comprehensive knowledge, The other is the existing methods inadequately leverage the multimodal information in the original training dataset, instead they rely on additional data for distillation, which increases computational and storage overhead. To mitigate both limitations, by drawing on Knowledge Integration Theory (KIT), we propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods. MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections. During the four stages, we first leverage prototypes to align across modalities, eliciting cross-modal knowledge, then adding new knowledge by constructing fine-grained intra- and inter-modality relationships with prototypes. After that, knowledge from two teacher models is adaptively distinguished and re-weighted. Finally, we connect between models from intra- and inter-task, integrating preceding and new knowledge. Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks, showcasing its potential in adapting VLMs to evolving data distributions.
Authors: Mikhail Dremin (Lomonosov Moscow State University), Konstantin Kozhemyakov (Lomonosov Moscow State University), Ivan Molodetskikh (Lomonosov Moscow State University), Malakhov Kirill (Huawei Technologies Co., Ltd), Artur Sagitov (Huawei Technologies Co., Ltd, Independent Researcher Linjianping), Dmitriy Vatolin (Lomonosov Moscow State University)
Abstract: A main goal in developing video-compression algorithms is to enhance human-perceived visual quality while maintaining file size. But modern video-analysis efforts such as detection and recognition, which are integral to video surveillance and autonomous vehicles, involve so much data that they necessitate machine-vision processing with minimal human intervention. In such cases, the video codec must be optimized for machine vision. This paper explores the effects of compression on detection and recognition algorithms (objects, faces, and license plates) and introduces novel full-reference image/video-quality metrics for each task, tailored to machine vision. Experimental results indicate our proposed metrics correlate better with the machine-vision results for the respective tasks than do existing image/video-quality metrics.
Authors: Shubo Lin, Yutong Kou, Bing Li, Weiming Hu, Jin Gao
Abstract: In camera-based 3D multi-object tracking (MOT), the prevailing methods follow the tracking-by-query-propagation paradigm, which employs track queries to manage the lifecycle of identity-consistent tracklets while object queries handle the detection of new-born tracklets. However, this intertwined paradigm leads the inter-temporal tracking task and the single-frame detection task utilize the same model parameters, complicating training optimization. Drawing inspiration from studies on the roles of attention components in transformer-based decoders, we identify that the dispersing effect of self-attention necessitates object queries to match with new-born tracklets. This matching strategy diverges from the detection pre-training phase, where object queries align with all ground-truth targets, resulting in insufficient supervision signals. To address these issues, we present HSTrack, a novel plug-and-play method designed to co-facilitate multi-task learning for detection and tracking. HSTrack constructs a parallel weight-share decoder devoid of self-attention layers, circumventing competition between different types of queries. Considering the characteristics of cross-attention layer and distinct query types, our parallel decoder adopt one-to-one and one-to-many label assignment strategies for track queries and object queries, respectively. Leveraging the shared architecture, HSTrack further improve trackers for spatio-temporal modeling and quality candidates generation. Extensive experiments demonstrate that HSTrack consistently delivers improvements when integrated with various query-based 3D MOT trackers. For example, HSTrack improves the state-of-the-art PF-Track method by $+2.3\%$ AMOTA and $+1.7\%$ mAP on the nuScenes dataset.
Authors: Jiawei Fan, Chao Li, Xiaolong Liu, Anbang Yao
Abstract: In this paper, we question if well pre-trained vision transformer (ViT) models could be used as teachers that exhibit scalable properties to advance cross architecture knowledge distillation (KD) research, in the context of using large-scale datasets for evaluation. To make this possible, our analysis underlines the importance of seeking effective strategies to align (1) feature computing paradigm differences, (2) model scale differences, and (3) knowledge density differences. By combining three coupled components namely cross attention projector, dual-view feature mimicking and teacher parameter perception tailored to address the above problems, we present a simple and effective KD method, called ScaleKD. Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets, achieving state-of-the-art distillation performance. For instance, taking a well pre-trained Swin-L as the teacher model, our method gets 75.15%|82.03%|84.16%|78.63%|81.96%|83.93%|83.80%|85.53% top-1 accuracies for MobileNet-V1|ResNet-50|ConvNeXt-T|Mixer-S/16|Mixer-B/16|ViT-S/16|Swin-T|ViT-B/16 models trained on ImageNet-1K dataset from scratch, showing 3.05%|3.39%|2.02%|4.61%|5.52%|4.03%|2.62%|3.73% absolute gains to the individually trained counterparts. Intriguingly, when scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties, bringing increasingly larger gains to student models. The student backbones trained by our method transfer well on downstream MS-COCO and ADE20K datasets. More importantly, our method could be used as a more efficient alternative to the time-intensive pre-training paradigm for any target student model if a strong pre-trained ViT is available, reducing the amount of viewed training samples up to 195x.
Authors: Miguel Antunes-Garc\'ia, Luis M. Bergasa, Santiago Montiel-Mar\'in, Rafael Barea, Fabio S\'anchez-Garc\'ia, \'Angel Llamazares
Abstract: Accurate object detection and prediction are critical to ensure the safety and efficiency of self-driving architectures. Predicting object trajectories and occupancy enables autonomous vehicles to anticipate movements and make decisions with future information, increasing their adaptability and reducing the risk of accidents. Current State-Of-The-Art (SOTA) approaches often isolate the detection, tracking, and prediction stages, which can lead to significant prediction errors due to accumulated inaccuracies between stages. Recent advances have improved the feature representation of multi-camera perception systems through Bird's-Eye View (BEV) transformations, boosting the development of end-to-end systems capable of predicting environmental elements directly from vehicle sensor data. These systems, however, often suffer from high processing times and number of parameters, creating challenges for real-world deployment. To address these issues, this paper introduces a novel BEV instance prediction architecture based on a simplified paradigm that relies only on instance segmentation and flow prediction. The proposed system prioritizes speed, aiming at reduced parameter counts and inference times compared to existing SOTA architectures, thanks to the incorporation of an efficient transformer-based architecture. Furthermore, the implementation of the proposed architecture is optimized for performance improvements in PyTorch version 2.1. Code and trained models are available at https://github.com/miguelag99/Efficient-Instance-Prediction
URLs: https://github.com/miguelag99/Efficient-Instance-Prediction
Authors: Andr\'es Mu\~noz, Nancy Thomas, Annita Vapsi, Daciel Borrajo
Abstract: Many industrial and service sectors require tools to extract vehicle characteristics from images. This is a complex task not only by the variety of noise, and large number of classes, but also by the constant introduction of new vehicle models to the market. In this paper, we present Veri-Car, an information retrieval integrated approach designed to help on this task. It leverages supervised learning techniques to accurately identify the make, type, model, year, color, and license plate of cars. The approach also addresses the challenge of handling open-world problems, where new car models and variations frequently emerge, by employing a sophisticated combination of pre-trained models, and a hierarchical multi-similarity loss. Veri-Car demonstrates robust performance, achieving high precision and accuracy in classifying both seen and unseen data. Additionally, it integrates an ensemble license plate detection, and an OCR model to extract license plate numbers with impressive accuracy.
Authors: Junho Kim, Hyungjin Chung, Byung-Hoon Kim
Abstract: Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints, a process that is often cumbersome and may fail to fully capture the necessary correspondences across diverse object categories. Recent efforts have begun exploring the use of text-based queries, where the need for support keypoints is eliminated. However, the optimal use of textual descriptions for keypoints remains an underexplored area. In this work, we introduce CapeLLM, a novel approach that leverages a text-based multimodal large language model (MLLM) for CAPE. Our method only employs query image and detailed text descriptions as an input to estimate category-agnostic keypoints. We conduct extensive experiments to systematically explore the design space of LLM-based CAPE, investigating factors such as choosing the optimal description for keypoints, neural network architectures, and training strategies. Thanks to the advanced reasoning capabilities of the pre-trained MLLM, CapeLLM demonstrates superior generalization and robust performance. Our approach sets a new state-of-the-art on the MP-100 benchmark in the challenging 1-shot setting, marking a significant advancement in the field of category-agnostic pose estimation.
Authors: Antoine Hanna-Asaad, Decky Aspandi, Titus Zaharia
Abstract: Video captioning aims to describe video contents using natural language format that involves understanding and interpreting scenes, actions and events that occurs simultaneously on the view. Current approaches have mainly concentrated on visual cues, often neglecting the rich information available from other important modality of audio information, including their inter-dependencies. In this work, we introduce a novel video captioning method trained with multi-modal contrastive loss that emphasizes both multi-modal integration and interpretability. Our approach is designed to capture the dependency between these modalities, resulting in more accurate, thus pertinent captions. Furthermore, we highlight the importance of interpretability, employing multiple attention mechanisms that provide explanation into the model's decision-making process. Our experimental results demonstrate that our proposed method performs favorably against the state-of the-art models on commonly used benchmark datasets of MSR-VTT and VATEX.
Authors: Jai G Singla
Abstract: Accurate classification of buildings into residential and non-residential categories is crucial for urban planning, infrastructure development, population estimation and resource allocation. It is a complex job to carry out automatic classification of residential and nonresidential buildings manually using satellite data. In this paper, we are proposing a novel deep learning approach that combines high-resolution satellite data (50 cm resolution Image + 1m grid interval DEM) and vector data to achieve high-performance building classification. Our architecture leverages LeakyReLU and ReLU activations to capture nonlinearities in the data and employs feature-engineering techniques to eliminate highly correlated features, resulting in improved computational efficiency. Experimental results on a large-scale dataset demonstrate the effectiveness of our model, achieving an impressive overall F1 -score of 0.9936. The proposed approach offers a scalable and accurate solution for building classification, enabling informed decision-making in urban planning and resource allocation. This research contributes to the field of urban analysis by providing a valuable tool for understanding the built environment and optimizing resource utilization.
Authors: Yawen Xiang, Heng Zhou, Chengyang Li, Zhongbo Li, Yongqiang Xie
Abstract: Image deblurring is an essential image preprocessing technique, aiming to recover clear and detailed images form blurry ones. However, existing algorithms often fail to effectively integrate multi-scale feature extraction with frequency enhancement, limiting their ability to reconstruct fine textures. Additionally, non-uniform blur in images also restricts the effectiveness of image restoration. To address these issues, we propose a multi-scale frequency enhancement network (MFENet) for blind image deblurring. To capture the multi-scale spatial and channel information of blurred images, we introduce a multi-scale feature extraction module (MS-FE) based on depthwise separable convolutions, which provides rich target features for deblurring. We propose a frequency enhanced blur perception module (FEBP) that employs wavelet transforms to extract high-frequency details and utilizes multi-strip pooling to perceive non-uniform blur, combining multi-scale information with frequency enhancement to improve the restoration of image texture details. Experimental results on the GoPro and HIDE datasets demonstrate that the proposed method achieves superior deblurring performance in both visual quality and objective evaluation metrics. Furthermore, in downstream object detection tasks, the proposed blind image deblurring algorithm significantly improves detection accuracy, further validating its effectiveness androbustness in the field of image deblurring.
Authors: Hemal Naik, Junran Yang, Dipin Das, Margaret C Crofoot, Akanksha Rathore, Vivek Hari Sridhar
Abstract: Understanding animal behaviour is central to predicting, understanding, and mitigating impacts of natural and anthropogenic changes on animal populations and ecosystems. However, the challenges of acquiring and processing long-term, ecologically relevant data in wild settings have constrained the scope of behavioural research. The increasing availability of Unmanned Aerial Vehicles (UAVs), coupled with advances in machine learning, has opened new opportunities for wildlife monitoring using aerial tracking. However, limited availability of datasets with wild animals in natural habitats has hindered progress in automated computer vision solutions for long-term animal tracking. Here we introduce BuckTales, the first large-scale UAV dataset designed to solve multi-object tracking (MOT) and re-identification (Re-ID) problem in wild animals, specifically the mating behaviour (or lekking) of blackbuck antelopes. Collected in collaboration with biologists, the MOT dataset includes over 1.2 million annotations including 680 tracks across 12 high-resolution (5.4K) videos, each averaging 66 seconds and featuring 30 to 130 individuals. The Re-ID dataset includes 730 individuals captured with two UAVs simultaneously. The dataset is designed to drive scalable, long-term animal behaviour tracking using multiple camera sensors. By providing baseline performance with two detectors, and benchmarking several state-of-the-art tracking methods, our dataset reflects the real-world challenges of tracking wild animals in socially and ecologically relevant contexts. In making these data widely available, we hope to catalyze progress in MOT and Re-ID for wild animals, fostering insights into animal behaviour, conservation efforts, and ecosystem dynamics through automated, long-term monitoring.
Authors: Hao Liang, Zirong Chen, Wentao Zhang
Abstract: Video question-answering (QA) is a core task in video understanding. Evaluating the quality of video QA and video caption data quality for training video large language models (VideoLLMs) is an essential challenge. Although various methods have been proposed for assessing video caption quality, there remains a lack of dedicated evaluation methods for Video QA. To address this gap, we introduce EVQAScore, a reference-free method that leverages keyword extraction to assess both video caption and video QA data quality. Additionally, we incorporate frame sampling and rescaling techniques to enhance the efficiency and robustness of our evaluation, this enables our score to evaluate the quality of extremely long videos. Our approach achieves state-of-the-art (SOTA) performance (32.8 for Kendall correlation and 42.3 for Spearman correlation, 4.7 and 5.9 higher than the previous method PAC-S++) on the VATEX-EVAL benchmark for video caption evaluation. Furthermore, by using EVQAScore for data selection, we achieved SOTA results with only 12.5\% of the original data volume, outperforming the previous SOTA method PAC-S and 100\% of data.
Authors: Bruno Viti, Franz Thaler, Kathrin Lisa Kapper, Martin Urschler, Martin Holler, Elias Karabelas
Abstract: Segmentation of cardiac magnetic resonance images (MRI) is crucial for the analysis and assessment of cardiac function, helping to diagnose and treat various cardiovascular diseases. Most recent techniques rely on deep learning and usually require an extensive amount of labeled data. To overcome this problem, few-shot learning has the capability of reducing data dependency on labeled data. In this work, we introduce a new method that merges few-shot learning with a U-Net architecture and Gaussian Process Emulators (GPEs), enhancing data integration from a support set for improved performance. GPEs are trained to learn the relation between the support images and the corresponding masks in latent space, facilitating the segmentation of unseen query images given only a small labeled support set at inference. We test our model with the M&Ms-2 public dataset to assess its ability to segment the heart in cardiac magnetic resonance imaging from different orientations, and compare it with state-of-the-art unsupervised and few-shot methods. Our architecture shows higher DICE coefficients compared to these methods, especially in the more challenging setups where the size of the support set is considerably small.
Authors: Jiachen Liang, Ruibing Hou, Minyang Hu, Hong Chang, Shiguang Shan, Xilin Chen
Abstract: Pre-trained vision-language models (e.g., CLIP) have shown powerful zero-shot transfer capabilities. But they still struggle with domain shifts and typically require labeled data to adapt to downstream tasks, which could be costly. In this work, we aim to leverage unlabeled data that naturally spans multiple domains to enhance the transferability of vision-language models. Under this unsupervised multi-domain setting, we have identified inherent model bias within CLIP, notably in its visual and text encoders. Specifically, we observe that CLIP's visual encoder tends to prioritize encoding domain over discriminative category information, meanwhile its text encoder exhibits a preference for domain-relevant classes. To mitigate this model bias, we propose a training-free and label-free feature calibration method, Unsupervised Multi-domain Feature Calibration (UMFC). UMFC estimates image-level biases from domain-specific features and text-level biases from the direction of domain transition. These biases are subsequently subtracted from original image and text features separately, to render them domain-invariant. We evaluate our method on multiple settings including transductive learning and test-time adaptation. Extensive experiments show that our method outperforms CLIP and performs on par with the state-of-the-arts that need additional annotations or optimization. Our code is available at https://github.com/GIT-LJc/UMFC.
Authors: Zanlin Ni, Yulin Wang, Renping Zhou, Yizeng Han, Jiayi Guo, Zhiyuan Liu, Yuan Yao, Gao Huang
Abstract: Recently, token-based generation have demonstrated their effectiveness in image synthesis. As a representative example, non-autoregressive Transformers (NATs) can generate decent-quality images in a few steps. NATs perform generation in a progressive manner, where the latent tokens of a resulting image are incrementally revealed. At each step, the unrevealed image regions are padded with mask tokens and inferred by NAT. In this paper, we delve into the mechanisms behind the effectiveness of NATs and uncover two important patterns that naturally emerge from NATs: Spatially (within a step), although mask and visible tokens are processed uniformly by NATs, the interactions between them are highly asymmetric. In specific, mask tokens mainly gather information for decoding, while visible tokens tend to primarily provide information, and their deep representations can be built only upon themselves. Temporally (across steps), the interactions between adjacent generation steps mostly concentrate on updating the representations of a few critical tokens, while the computation for the majority of tokens is generally repetitive. Driven by these findings, we propose EfficientNAT (ENAT), a NAT model that explicitly encourages these critical interactions inherent in NATs. At the spatial level, we disentangle the computations of visible and mask tokens by encoding visible tokens independently, while decoding mask tokens conditioned on the fully encoded visible tokens. At the temporal level, we prioritize the computation of the critical tokens at each step, while maximally reusing previously computed token representations to supplement necessary information. ENAT improves the performance of NATs notably with significantly reduced computational cost. Experiments on ImageNet-256, ImageNet-512 and MS-COCO validate the effectiveness of ENAT. Code is available at https://github.com/LeapLabTHU/ENAT.
Authors: Beier Zhu, Jiequan Cui, Hanwang Zhang
Abstract: When fine-tuning zero-shot models like CLIP, our desideratum is for the fine-tuned model to excel in both in-distribution (ID) and out-of-distribution (OOD). Recently, ensemble-based models (ESM) have been shown to offer significant robustness improvement, while preserving high ID accuracy. However, our study finds that ESMs do not solve the ID-OOD trade-offs: they achieve peak performance for ID and OOD accuracy at different mixing coefficients. When optimized for OOD accuracy, the ensemble model exhibits a noticeable decline in ID accuracy, and vice versa. In contrast, we propose a sample-wise ensembling technique that can simultaneously attain the best ID and OOD accuracy without the trade-offs. Specifically, we construct a Zero-Shot Failure (ZSF) set containing training samples incorrectly predicted by the zero-shot model. For each test sample, we calculate its distance to the ZSF set and assign a higher weight to the fine-tuned model in the ensemble if the distance is small. We term our method Variance Reduction Fine-tuning (VRF), as it effectively reduces the variance in ensemble predictions, thereby decreasing residual error. On ImageNet and five derived distribution shifts, our VRF further improves the OOD accuracy by 1.5 - 2.0 pp over the ensemble baselines while maintaining or increasing ID accuracy. VRF achieves similar large robustness gains (0.9 - 3.1 pp) on other distribution shifts benchmarks. Codes are available in https://github.com/BeierZhu/VRF.
Authors: Xue Xia, Daiwei Zhang, Wenxuan Song, Wei Huang, Lorenz Hurni
Abstract: Automated feature detection in historical maps can significantly accelerate the reconstruction of the geospatial past. However, this process is often constrained by the time-consuming task of manually digitizing sufficient high-quality training data. The emergence of visual foundation models, such as the Segment Anything Model (SAM), offers a promising solution due to their remarkable generalization capabilities and rapid adaptation to new data distributions. Despite this, directly applying SAM in a zero-shot manner to historical map segmentation poses significant challenges, including poor recognition of certain geospatial features and a reliance on input prompts, which limits its ability to be fully automated. To address these challenges, we introduce MapSAM, a parameter-efficient fine-tuning strategy that adapts SAM into a prompt-free and versatile solution for various downstream historical map segmentation tasks. Specifically, we employ Weight-Decomposed Low-Rank Adaptation (DoRA) to integrate domain-specific knowledge into the image encoder. Additionally, we develop an automatic prompt generation process, eliminating the need for manual input. We further enhance the positional prompt in SAM, transforming it into a higher-level positional-semantic prompt, and modify the cross-attention mechanism in the mask decoder with masked attention for more effective feature aggregation. The proposed MapSAM framework demonstrates promising performance across two distinct historical map segmentation tasks: one focused on linear features and the other on areal features. Experimental results show that it adapts well to various features, even when fine-tuned with extremely limited data (e.g. 10 shots).
Authors: He Huang, Wenjie Huang, Qi Yang, Yiling Xu, Zhu li
Abstract: 3D Gaussian Splatting (GS) demonstrates excellent rendering quality and generation speed in novel view synthesis. However, substantial data size poses challenges for storage and transmission, making 3D GS compression an essential technology. Current 3D GS compression research primarily focuses on developing more compact scene representations, such as converting explicit 3D GS data into implicit forms. In contrast, compression of the GS data itself has hardly been explored. To address this gap, we propose a Hierarchical GS Compression (HGSC) technique. Initially, we prune unimportant Gaussians based on importance scores derived from both global and local significance, effectively reducing redundancy while maintaining visual quality. An Octree structure is used to compress 3D positions. Based on the 3D GS Octree, we implement a hierarchical attribute compression strategy by employing a KD-tree to partition the 3D GS into multiple blocks. We apply farthest point sampling to select anchor primitives within each block and others as non-anchor primitives with varying Levels of Details (LoDs). Anchor primitives serve as reference points for predicting non-anchor primitives across different LoDs to reduce spatial redundancy. For anchor primitives, we use the region adaptive hierarchical transform to achieve near-lossless compression of various attributes. For non-anchor primitives, each is predicted based on the k-nearest anchor primitives. To further minimize prediction errors, the reconstructed LoD and anchor primitives are combined to form new anchor primitives to predict the next LoD. Our method notably achieves superior compression quality and a significant data size reduction of over 4.5 times compared to the state-of-the-art compression method on small scenes datasets.
Authors: Jiale Chen, Fei Xia, Jianliang Mao, Haoping Wang, Chuanlin Zhang
Abstract: The ambiguity at the boundaries of different semantic classes in point cloud semantic segmentation often leads to incorrect decisions in intelligent perception systems, such as autonomous driving. Hence, accurate delineation of the boundaries is crucial for improving safety in autonomous driving. A novel spatial inter-correlation enhancement and spatially-embedded feature fusion network (SIESEF-FusionNet) is proposed in this paper, enhancing spatial inter-correlation by combining inverse distance weighting and angular compensation to extract more beneficial spatial information without causing redundancy. Meanwhile, a new spatial adaptive pooling module is also designed, embedding enhanced spatial information into semantic features for strengthening the context-awareness of semantic features. Experimental results demonstrate that 83.7% mIoU and 97.8% OA are achieved by SIESEF-FusionNet on the Toronto3D dataset, with performance superior to other baseline methods. A value of 61.1% mIoU is reached on the semanticKITTI dataset, where a marked improvement in segmentation performance is observed. In addition, the effectiveness and plug-and-play capability of the proposed modules are further verified through ablation studies.
Authors: Dominik Eckert, Ludwig Ritschl, Christopher Syben, Christian H\"ummer, Julia Wicklein, Marcel Beister, Steffen Kappler, Sebastian Stober
Abstract: Radiologists have preferred visual impressions or 'styles' of X-ray images that are manually adjusted to their needs to support their diagnostic performance. In this work, we propose an automatic and interpretable X-ray style transfer by introducing a trainable version of the Local Laplacian Filter (LLF). From the shape of the LLF's optimized remap function, the characteristics of the style transfer can be inferred and reliability of the algorithm can be ensured. Moreover, we enable the LLF to capture complex X-ray style features by replacing the remap function with a Multi-Layer Perceptron (MLP) and adding a trainable normalization layer. We demonstrate the effectiveness of the proposed method by transforming unprocessed mammographic X-ray images into images that match the style of target mammograms and achieve a Structural Similarity Index (SSIM) of 0.94 compared to 0.82 of the baseline LLF style transfer method from Aubry et al.
Authors: Chengyu Yang, Chengjun Liu
Abstract: Approximately 16 million Americans suffer from rosacea according to the National Rosacea Society. To increase rosacea awareness, automatic rosacea detection methods using deep learning and explainable statistical approaches are presented in this paper. The deep learning method applies the ResNet-18 for rosacea detection, and the statistical approaches utilize the means of the two classes, namely, the rosacea class vs. the normal class, and the principal component analysis to extract features from the facial images for automatic rosacea detection. The contributions of the proposed methods are three-fold. First, the proposed methods are able to automatically distinguish patients who are suffering from rosacea from people who are clean of this disease. Second, the statistical approaches address the explainability issue that allows doctors and patients to understand and trust the results. And finally, the proposed methods will not only help increase rosacea awareness in the general population but also help remind the patients who suffer from this disease of possible early treatment since rosacea is more treatable at its early stages. The code and data are available at https://github.com/cyang322/rosacea_detection.git.
Authors: Yichen He, Yuan Lin, Jianchao Wu, Hanchong Zhang, Yuchen Zhang, Ruicheng Le
Abstract: Existing large vision-language models (LVLMs) are largely limited to processing short, seconds-long videos and struggle with generating coherent descriptions for extended video spanning minutes or more. Long video description introduces new challenges, such as plot-level consistency across descriptions. To address these, we figure out audio-visual character identification, matching character names to each dialogue, as a key factor. We propose StoryTeller, a system for generating dense descriptions of long videos, incorporating both low-level visual concepts and high-level plot information. StoryTeller uses a multimodal large language model that integrates visual, audio, and text modalities to perform audio-visual character identification on minute-long video clips. The results are then fed into a LVLM to enhance consistency of video description. We validate our approach on movie description tasks and introduce MovieStory101, a dataset with dense descriptions for three-minute movie clips. To evaluate long video descriptions, we create MovieQA, a large set of multiple-choice questions for the MovieStory101 test set. We assess descriptions by inputting them into GPT-4 to answer these questions, using accuracy as an automatic evaluation metric. Experiments show that StoryTeller outperforms all open and closed-source baselines on MovieQA, achieving 9.5% higher accuracy than the strongest baseline, Gemini-1.5-pro, and demonstrating a +15.56% advantage in human side-by-side evaluations. Additionally, incorporating audio-visual character identification from StoryTeller improves the performance of all video description models, with Gemini-1.5-pro and GPT-4o showing relative improvement of 5.5% and 13.0%, respectively, in accuracy on MovieQA.
Authors: Hana Bezalel, Dotan Ankri, Ruojin Cai, Hadar Averbuch-Elor
Abstract: We present a technique and benchmark dataset for estimating the relative 3D orientation between a pair of Internet images captured in an extreme setting, where the images have limited or non-overlapping field of views. Prior work targeting extreme rotation estimation assume constrained 3D environments and emulate perspective images by cropping regions from panoramic views. However, real images captured in the wild are highly diverse, exhibiting variation in both appearance and camera intrinsics. In this work, we propose a Transformer-based method for estimating relative rotations in extreme real-world settings, and contribute the ExtremeLandmarkPairs dataset, assembled from scene-level Internet photo collections. Our evaluation demonstrates that our approach succeeds in estimating the relative rotations in a wide variety of extremeview Internet image pairs, outperforming various baselines, including dedicated rotation estimation techniques and contemporary 3D reconstruction methods.
Authors: Jannik Franzen, Claudia Winklmayr, Vanessa E. Guarino, Christoph Karg, Xiaoyan Yu, Nora Koreuber, Jan P. Albrecht, Philip Bischoff, Dagmar Kainmueller
Abstract: Uncertainty Quantification (UQ) is crucial for reliable image segmentation. Yet, while the field sees continual development of novel methods, a lack of agreed-upon benchmarks limits their systematic comparison and evaluation: Current UQ methods are typically tested either on overly simplistic toy datasets or on complex real-world datasets that do not allow to discern true uncertainty. To unify both controllability and complexity, we introduce Arctique, a procedurally generated dataset modeled after histopathological colon images. We chose histopathological images for two reasons: 1) their complexity in terms of intricate object structures and highly variable appearance, which yields challenging segmentation problems, and 2) their broad prevalence for medical diagnosis and respective relevance of high-quality UQ. To generate Arctique, we established a Blender-based framework for 3D scene creation with intrinsic noise manipulation. Arctique contains 50,000 rendered images with precise masks as well as noisy label simulations. We show that by independently controlling the uncertainty in both images and labels, we can effectively study the performance of several commonly used UQ methods. Hence, Arctique serves as a critical resource for benchmarking and advancing UQ techniques and other methodologies in complex, multi-object environments, bridging the gap between realism and controllability. All code is publicly available, allowing re-creation and controlled manipulations of our shipped images as well as creation and rendering of new scenes.
Authors: Mallika Garg, Debashis Ghosh, Pyari Mohan Pradhan
Abstract: Transformer models have demonstrated remarkable success in many domains such as natural language processing (NLP) and computer vision. With the growing interest in transformer-based architectures, they are now utilized for gesture recognition. So, we also explore and devise a novel ConvMixFormer architecture for dynamic hand gestures. The transformers use quadratic scaling of the attention features with the sequential data, due to which these models are computationally complex and heavy. We have considered this drawback of the transformer and designed a resource-efficient model that replaces the self-attention in the transformer with the simple convolutional layer-based token mixer. The computational cost and the parameters used for the convolution-based mixer are comparatively less than the quadratic self-attention. Convolution-mixer helps the model capture the local spatial features that self-attention struggles to capture due to their sequential processing nature. Further, an efficient gate mechanism is employed instead of a conventional feed-forward network in the transformer to help the model control the flow of features within different stages of the proposed model. This design uses fewer learnable parameters which is nearly half the vanilla transformer that helps in fast and efficient training. The proposed method is evaluated on NVidia Dynamic Hand Gesture and Briareo datasets and our model has achieved state-of-the-art results on single and multimodal inputs. We have also shown the parameter efficiency of the proposed ConvMixFormer model compared to other methods. The source code is available at https://github.com/mallikagarg/ConvMixFormer.
Authors: Yanchen Wang, Adam Turnbull, Tiange Xiang, Yunlong Xu, Sa Zhou, Adnan Masoud, Shekoofeh Azizi, Feng Vankee Lin, Ehsan Adeli
Abstract: Neural decoding, the process of understanding how brain activity corresponds to different stimuli, has been a primary objective in cognitive sciences. Over the past three decades, advancements in functional Magnetic Resonance Imaging and machine learning have greatly improved our ability to map visual stimuli to brain activity, especially in the visual cortex. Concurrently, research has expanded into decoding more complex processes like language and memory across the whole brain, utilizing techniques to handle greater variability and improve signal accuracy. We argue that "seeing" involves more than just mapping visual stimuli onto the visual cortex; it engages the entire brain, as various emotions and cognitive states can emerge from observing different scenes. In this paper, we develop algorithms to enhance our understanding of visual processes by incorporating whole-brain activation maps while individuals are exposed to visual stimuli. We utilize large-scale fMRI encoders and Image generative models pre-trained on large public datasets, which are then fine-tuned through Image-fMRI contrastive learning. Our models hence can decode visual experience across the entire cerebral cortex, surpassing the traditional confines of the visual cortex. We first compare our method with state-of-the-art approaches to decoding visual processing and show improved predictive semantic accuracy by 43%. A network ablation analysis suggests that beyond the visual cortex, the default mode network contributes most to decoding stimuli, in line with the proposed role of this network in sense-making and semantic processing. Additionally, we implemented zero-shot imagination decoding on an extra validation dataset, achieving a p-value of 0.0206 for mapping the reconstructed images and ground-truth text stimuli, which substantiates the model's capability to capture semantic meanings across various scenarios.
Authors: NVIDIA, :, Yuval Atzmon, Maciej Bala, Yogesh Balaji, Tiffany Cai, Yin Cui, Jiaojiao Fan, Yunhao Ge, Siddharth Gururani, Jacob Huffman, Ronald Isaac, Pooya Jannaty, Tero Karras, Grace Lam, J. P. Lewis, Aaron Licata, Yen-Chen Lin, Ming-Yu Liu, Qianli Ma, Arun Mallya, Ashlee Martino-Tarr, Doug Mendez, Seungjun Nah, Chris Pruett, Fitsum Reda, Jiaming Song, Ting-Chun Wang, Fangyin Wei, Xiaohui Zeng, Yu Zeng, Qinsheng Zhang
Abstract: We introduce Edify Image, a family of diffusion models capable of generating photorealistic image content with pixel-perfect accuracy. Edify Image utilizes cascaded pixel-space diffusion models trained using a novel Laplacian diffusion process, in which image signals at different frequency bands are attenuated at varying rates. Edify Image supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360 HDR panorama generation, and finetuning for image customization.
Authors: Taihang Hu, Linxuan Li, Joost van de Weijer, Hongcheng Gao, Fahad Shahbaz Khan, Jian Yang, Ming-Ming Cheng, Kai Wang, Yaxing Wang
Abstract: Although text-to-image (T2I) models exhibit remarkable generation capabilities, they frequently fail to accurately bind semantically related objects or attributes in the input prompts; a challenge termed semantic binding. Previous approaches either involve intensive fine-tuning of the entire T2I model or require users or large language models to specify generation layouts, adding complexity. In this paper, we define semantic binding as the task of associating a given object with its attribute, termed attribute binding, or linking it to other related sub-objects, referred to as object binding. We introduce a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a single composite token. This ensures that the object, its attributes and sub-objects all share the same cross-attention map. Additionally, to address potential confusion among main objects with complex textual prompts, we propose end token substitution as a complementary strategy. To further refine our approach in the initial stages of T2I generation, where layouts are determined, we incorporate two auxiliary losses, an entropy loss and a semantic binding loss, to iteratively update the composite token to improve the generation integrity. We conducted extensive experiments to validate the effectiveness of ToMe, comparing it against various existing methods on the T2I-CompBench and our proposed GPT-4o object binding benchmark. Our method is particularly effective in complex scenarios that involve multiple objects and attributes, which previous methods often fail to address. The code will be publicly available at \url{https://github.com/hutaihang/ToMe}.
Authors: NVIDIA, :, Maciej Bala, Yin Cui, Yifan Ding, Yunhao Ge, Zekun Hao, Jon Hasselgren, Jacob Huffman, Jingyi Jin, J. P. Lewis, Zhaoshuo Li, Chen-Hsuan Lin, Yen-Chen Lin, Tsung-Yi Lin, Ming-Yu Liu, Alice Luo, Qianli Ma, Jacob Munkberg, Stella Shi, Fangyin Wei, Donglai Xiang, Jiashu Xu, Xiaohui Zeng, Qinsheng Zhang
Abstract: We introduce Edify 3D, an advanced solution designed for high-quality 3D asset generation. Our method first synthesizes RGB and surface normal images of the described object at multiple viewpoints using a diffusion model. The multi-view observations are then used to reconstruct the shape, texture, and PBR materials of the object. Our method can generate high-quality 3D assets with detailed geometry, clean shape topologies, high-resolution textures, and materials within 2 minutes of runtime.
Authors: Martin Mayr, Julian Krenz, Katharina Neumeier, Anna Bub, Simon B\"urcky, Nina Brolich, Klaus Herbers, Mechthild Habermann, Peter Fleischmann, Andreas Maier, Vincent Christlein
Abstract: Most datasets in the field of document analysis utilize highly standardized labels, which, while simplifying specific tasks, often produce outputs that are not directly applicable to humanities research. In contrast, the Nuremberg Letterbooks dataset, which comprises historical documents from the early 15th century, addresses this gap by providing multiple types of transcriptions and accompanying metadata. This approach allows for developing methods that are more closely aligned with the needs of the humanities. The dataset includes 4 books containing 1711 labeled pages written by 10 scribes. Three types of transcriptions are provided for handwritten text recognition: Basic, diplomatic, and regularized. For the latter two, versions with and without expanded abbreviations are also available. A combination of letter ID and writer ID supports writer identification due to changing writers within pages. In the technical validation, we established baselines for various tasks, demonstrating data consistency and providing benchmarks for future research to build upon.
Authors: Ziqiang Dang, Jianfang Li, Lin Liu
Abstract: Facial landmark detection is a fundamental problem in computer vision for many downstream applications. This paper introduces a new facial landmark detector based on vision transformers, which consists of two unique designs: Dual Vision Transformer (D-ViT) and Long Skip Connections (LSC). Based on the observation that the channel dimension of feature maps essentially represents the linear bases of the heatmap space, we propose learning the interconnections between these linear bases to model the inherent geometric relations among landmarks via Channel-split ViT. We integrate such channel-split ViT into the standard vision transformer (i.e., spatial-split ViT), forming our Dual Vision Transformer to constitute the prediction blocks. We also suggest using long skip connections to deliver low-level image features to all prediction blocks, thereby preventing useful information from being discarded by intermediate supervision. Extensive experiments are conducted to evaluate the performance of our proposal on the widely used benchmarks, i.e., WFLW, COFW, and 300W, demonstrating that our model outperforms the previous SOTAs across all three benchmarks.
Authors: Yunhan Yang, Yukun Huang, Yuan-Chen Guo, Liangjun Lu, Xiaoyang Wu, Edmund Y. Lam, Yan-Pei Cao, Xihui Liu
Abstract: 3D part segmentation is a crucial and challenging task in 3D perception, playing a vital role in applications such as robotics, 3D generation, and 3D editing. Recent methods harness the powerful Vision Language Models (VLMs) for 2D-to-3D knowledge distillation, achieving zero-shot 3D part segmentation. However, these methods are limited by their reliance on text prompts, which restricts the scalability to large-scale unlabeled datasets and the flexibility in handling part ambiguities. In this work, we introduce SAMPart3D, a scalable zero-shot 3D part segmentation framework that segments any 3D object into semantic parts at multiple granularities, without requiring predefined part label sets as text prompts. For scalability, we use text-agnostic vision foundation models to distill a 3D feature extraction backbone, allowing scaling to large unlabeled 3D datasets to learn rich 3D priors. For flexibility, we distill scale-conditioned part-aware 3D features for 3D part segmentation at multiple granularities. Once the segmented parts are obtained from the scale-conditioned part-aware 3D features, we use VLMs to assign semantic labels to each part based on the multi-view renderings. Compared to previous methods, our SAMPart3D can scale to the recent large-scale 3D object dataset Objaverse and handle complex, non-ordinary objects. Additionally, we contribute a new 3D part segmentation benchmark to address the lack of diversity and complexity of objects and parts in existing benchmarks. Experiments show that our SAMPart3D significantly outperforms existing zero-shot 3D part segmentation methods, and can facilitate various applications such as part-level editing and interactive segmentation.
Authors: Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, Wenhu Chen
Abstract: Instruction-guided image editing methods have demonstrated significant potential by training diffusion models on automatically synthesized or manually annotated image editing pairs. However, these methods remain far from practical, real-life applications. We identify three primary challenges contributing to this gap. Firstly, existing models have limited editing skills due to the biased synthesis process. Secondly, these methods are trained with datasets with a high volume of noise and artifacts. This is due to the application of simple filtering methods like CLIP-score. Thirdly, all these datasets are restricted to a single low resolution and fixed aspect ratio, limiting the versatility to handle real-world use cases. In this paper, we present \omniedit, which is an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. Our contribution is in four folds: (1) \omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. (2) we utilize importance sampling based on the scores provided by large multimodal models (like GPT-4o) instead of CLIP-score to improve the data quality. (3) we propose a new editing architecture called EditNet to greatly boost the editing success rate, (4) we provide images with different aspect ratios to ensure that our model can handle any image in the wild. We have curated a test set containing images of different aspect ratios, accompanied by diverse instructions to cover different tasks. Both automatic evaluation and human evaluations demonstrate that \omniedit can significantly outperform all the existing models. Our code, dataset and model will be available at \url{https://tiger-ai-lab.github.io/OmniEdit/}
Authors: Nyle Siddiqui, Florinel Alin Croitoru, Gaurav Kumar Nayak, Radu Tudor Ionescu, Mubarak Shah
Abstract: With the recent exhibited strength of generative diffusion models, an open research question is \textit{if images generated by these models can be used to learn better visual representations}. While this generative data expansion may suffice for easier visual tasks, we explore its efficacy on a more difficult discriminative task: clothes-changing person re-identification (CC-ReID). CC-ReID aims to match people appearing in non-overlapping cameras, even when they change their clothes across cameras. Not only are current CC-ReID models constrained by the limited diversity of clothing in current CC-ReID datasets, but generating additional data that retains important personal features for accurate identification is a current challenge. To address this issue we propose DLCR, a novel data expansion framework that leverages pre-trained diffusion and large language models (LLMs) to accurately generate diverse images of individuals in varied attire. We generate additional data for five benchmark CC-ReID datasets (PRCC, CCVID, LaST, VC-Clothes, and LTCC) and \textbf{increase their clothing diversity by \boldmath{$10$}x, totaling over \boldmath{$2.1$}M images generated}. DLCR employs diffusion-based text-guided inpainting, conditioned on clothing prompts constructed using LLMs, to generate synthetic data that only modifies a subject's clothes while preserving their personally identifiable features. With this massive increase in data, we introduce two novel strategies - progressive learning and test-time prediction refinement - that respectively reduce training time and further boosts CC-ReID performance. On the PRCC dataset, we obtain a large top-1 accuracy improvement of $11.3\%$ by training CAL, a previous state of the art (SOTA) method, with DLCR-generated data. We publicly release our code and generated data for each dataset here: \url{https://github.com/CroitoruAlin/dlcr}.
Authors: Harsh Rangwani
Abstract: The datasets used for Deep Neural Network training (e.g., ImageNet, MSCOCO, etc.) are often manually balanced across categories (classes) to facilitate learning of all the categories. This curation process is often expensive and requires throwing away precious annotated data to balance the frequency across classes. This is because the distribution of data in the world (e.g., internet, etc.) significantly differs from the well-curated datasets and is often over-populated with samples from common categories. The algorithms designed for well-curated datasets perform suboptimally when used to learn from imperfect datasets with long-tailed imbalances and distribution shifts. For deep models to be widely used, getting away with the costly curation process by developing robust algorithms that can learn from real-world data distribution is necessary. Toward this goal, we develop practical algorithms for Deep Neural Networks that can learn from limited and imperfect data present in the real world. These works are divided into four segments, each covering a scenario of learning from limited or imperfect data. The first part of the works focuses on Learning Generative Models for Long-Tail Data, where we mitigate the mode-collapse for tail (minority) classes and enable diverse aesthetic image generations as head (majority) classes. In the second part, we enable effective generalization on tail classes through Inductive Regularization schemes, which allow tail classes to generalize as the head classes without enforcing explicit generation of images. In the third part, we develop algorithms for Optimizing Relevant Metrics compared to the average accuracy for learning from long-tailed data with limited annotation (semi-supervised), followed by the fourth part, which focuses on the effective domain adaptation of the model to various domains with zero to very few labeled samples.
Authors: Tom Sander, Pierre Fernandez, Alain Durmus, Teddy Furon, Matthijs Douze
Abstract: Image watermarking methods are not tailored to handle small watermarked areas. This restricts applications in real-world scenarios where parts of the image may come from different sources or have been edited. We introduce a deep-learning model for localized image watermarking, dubbed the Watermark Anything Model (WAM). The WAM embedder imperceptibly modifies the input image, while the extractor segments the received image into watermarked and non-watermarked areas and recovers one or several hidden messages from the areas found to be watermarked. The models are jointly trained at low resolution and without perceptual constraints, then post-trained for imperceptibility and multiple watermarks. Experiments show that WAM is competitive with state-of-the art methods in terms of imperceptibility and robustness, especially against inpainting and splicing, even on high-resolution images. Moreover, it offers new capabilities: WAM can locate watermarked areas in spliced images and extract distinct 32-bit messages with less than 1 bit error from multiple small regions - no larger than 10% of the image surface - even for small $256\times 256$ images.
Authors: Yoad Tewel, Rinon Gal, Dvir Samuel Yuval Atzmon, Lior Wolf, Gal Chechik
Abstract: Adding Object into images based on text instructions is a challenging task in semantic image editing, requiring a balance between preserving the original scene and seamlessly integrating the new object in a fitting location. Despite extensive efforts, existing models often struggle with this balance, particularly with finding a natural location for adding an object in complex scenes. We introduce Add-it, a training-free approach that extends diffusion models' attention mechanisms to incorporate information from three key sources: the scene image, the text prompt, and the generated image itself. Our weighted extended-attention mechanism maintains structural consistency and fine details while ensuring natural object placement. Without task-specific fine-tuning, Add-it achieves state-of-the-art results on both real and generated image insertion benchmarks, including our newly constructed "Additing Affordance Benchmark" for evaluating object placement plausibility, outperforming supervised methods. Human evaluations show that Add-it is preferred in over 80% of cases, and it also demonstrates improvements in various automated metrics.
Authors: Pranav Guruprasad, Harshvardhan Sikka, Jaewoo Song, Yangyue Wang, Paul Pu Liang
Abstract: Vision-language-action (VLA) models represent a promising direction for developing general-purpose robotic systems, demonstrating the ability to combine visual understanding, language comprehension, and action generation. However, systematic evaluation of these models across diverse robotic tasks remains limited. In this work, we present a comprehensive evaluation framework and benchmark suite for assessing VLA models. We profile three state-of-the-art VLM and VLAs - GPT-4o, OpenVLA, and JAT - across 20 diverse datasets from the Open-X-Embodiment collection, evaluating their performance on various manipulation tasks. Our analysis reveals several key insights: 1. current VLA models show significant variation in performance across different tasks and robot platforms, with GPT-4o demonstrating the most consistent performance through sophisticated prompt engineering, 2. all models struggle with complex manipulation tasks requiring multi-step planning, and 3. model performance is notably sensitive to action space characteristics and environmental factors. We release our evaluation framework and findings to facilitate systematic assessment of future VLA models and identify critical areas for improvement in the development of general purpose robotic systems.
Authors: Zixian Su, Jingwei Guo, Xi Yang, Qiufeng Wang, Frans Coenen, Kaizhu Huang
Abstract: Medical Image Analysis (MedIA) has become indispensable in modern healthcare, enhancing clinical diagnostics and personalized treatment. Despite the remarkable advancements supported by deep learning (DL) technologies, their practical deployment faces challenges due to distribution shifts, where models trained on specific datasets underperform across others from varying hospitals, regions, or patient populations. To navigate this issue, researchers have been actively developing strategies to increase the adaptability and robustness of DL models, enabling their effective use in unfamiliar and diverse environments. This paper systematically reviews approaches that apply DL techniques to MedIA systems affected by distribution shifts. Unlike traditional categorizations based on technical specifications, our approach is grounded in the real-world operational constraints faced by healthcare institutions. Specifically, we categorize the existing body of work into Joint Training, Federated Learning, Fine-tuning, and Domain Generalization, with each method tailored to distinct scenarios caused by Data Accessibility, Privacy Concerns, and Collaborative Protocols. This perspective equips researchers with a nuanced understanding of how DL can be strategically deployed to address distribution shifts in MedIA, ensuring diverse and robust medical applications. By delving deeper into these topics, we highlight potential pathways for future research that not only address existing limitations but also push the boundaries of deployable MedIA technologies.
Authors: Zhuoshuo Li (Department of Biomedical Engineering, Sun Yat-sen University, Shenzhen, China), Jiong Zhang (Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo, China), Youbing Zeng (Department of Biomedical Engineering, Sun Yat-sen University, Shenzhen, China), Jiaying Lin (Department of Biomedical Engineering, Sun Yat-sen University, Shenzhen, China), Dan Zhang (School of Cyber Science and Engineering, Ningbo University of Technology, Ningbo, China), Jianjia Zhang (Department of Biomedical Engineering, Sun Yat-sen University, Shenzhen, China), Duan Xu (Department of Radiology, School of Medicine, University of California San Francisco, San Francisco, CA, USA), Hosung Kim (USC Mark and Mary Stevens Neuroimaging and Informatics Institute, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA, USA), Bingguang Liu (Department of Radiology, Shenzhen Maternity and Child Healthcare Hospital, Shenzhen, China), Mengting Liu (Department of Biomedical Engineering, Sun Yat-sen University, Shenzhen, China)
Abstract: Current brain surface-based prediction models often overlook the variability of regional attributes at the cortical feature level. While graph neural networks (GNNs) excel at capturing regional differences, they encounter challenges when dealing with complex, high-density graph structures. In this work, we consider the cortical surface mesh as a sparse graph and propose an interpretable prediction model-Surface Graph Neural Network (SurfGNN). SurfGNN employs topology-sampling learning (TSL) and region-specific learning (RSL) structures to manage individual cortical features at both lower and higher scales of the surface mesh, effectively tackling the challenges posed by the overly abundant mesh nodes and addressing the issue of heterogeneity in cortical regions. Building on this, a novel score-weighted fusion (SWF) method is implemented to merge nodal representations associated with each cortical feature for prediction. We apply our model to a neonatal brain age prediction task using a dataset of harmonized MR images from 481 subjects (503 scans). SurfGNN outperforms all existing state-of-the-art methods, demonstrating an improvement of at least 9.0% and achieving a mean absolute error (MAE) of 0.827+0.056 in postmenstrual weeks. Furthermore, it generates feature-level activation maps, indicating its capability to identify robust regional variations in different morphometric contributions for prediction.
Authors: Savitha Sam Abraham, Sourav Garg, Feras Dayoub
Abstract: Recent research in Vision Language Navigation (VLN) has overlooked the development of agents' inquisitive abilities, which allow them to ask clarifying questions when instructions are incomplete. This paper addresses how agents can recognize "when" they lack sufficient information, without focusing on "what" is missing, particularly in VLN tasks with vague instructions. Equipping agents with this ability enhances efficiency by reducing potential digressions and seeking timely assistance. The challenge in identifying such uncertain points is balancing between being overly cautious (high recall) and overly confident (high precision). We propose an attention-based instruction-vagueness estimation module that learns associations between instructions and the agent's trajectory. By leveraging instruction-to-path alignment information during training, the module's vagueness estimation performance improves by around 52% in terms of precision-recall balance. In our ablative experiments, we also demonstrate the effectiveness of incorporating this additional instruction-to-path attention network alongside the cross-modal attention networks within the navigator module. Our results show that the attention scores from the instruction-to-path attention network serve as better indicators for estimating vagueness.
Authors: Young Jo Choi, Min Kyoon Yoo, Yu Rang Park
Abstract: Deep learning models generally display catastrophic forgetting when learning new data continuously. Many incremental learning approaches address this problem by reusing data from previous tasks while learning new tasks. However, the direct access to past data generates privacy and security concerns. To address these issues, we present a novel method that preserves previous knowledge without storing previous data. This method is inspired by the architecture of a vision transformer and employs a unique token capable of encapsulating the compressed knowledge of each task. This approach generates task-specific embeddings by directing attention differently based on the task associated with the data, thereby effectively mimicking the impact of having multiple models through tokens. Our method incorporates a distillation process that ensures efficient interactions even after multiple additional learning steps, thereby optimizing the model against forgetting. We measured the performance of our model in terms of accuracy and backward transfer using a benchmark dataset for different task-incremental learning scenarios. Our results demonstrate the superiority of our approach, which achieved the highest accuracy and lowest backward transfer among the compared methods. In addition to presenting a new model, our approach lays the foundation for various extensions within the spectrum of vision-transformer architectures.
Authors: Claire Wonjeong Jo, Miki Weso{\l}owska, Magdalena Wojcieszak
Abstract: Short video platforms, such as YouTube, Instagram, or TikTok, are used by billions of users globally. These platforms expose users to harmful content, ranging from clickbait or physical harms to misinformation or online hate. Yet, detecting harmful videos remains challenging due to an inconsistent understanding of what constitutes harm and limited resources and mental tolls involved in human annotation. As such, this study advances measures and methods to detect harm in video content. First, we develop a comprehensive taxonomy for online harm on video platforms, categorizing it into six categories: Information, Hate and harassment, Addictive, Clickbait, Sexual, and Physical harms. Next, we establish multimodal large language models as reliable annotators of harmful videos. We analyze 19,422 YouTube videos using 14 image frames, 1 thumbnail, and text metadata, comparing the accuracy of crowdworkers (Mturk) and GPT-4-Turbo with domain expert annotations serving as the gold standard. Our results demonstrate that GPT-4-Turbo outperforms crowdworkers in both binary classification (harmful vs. harmless) and multi-label harm categorization tasks. Methodologically, this study extends the application of LLMs to multi-label and multi-modal contexts beyond text annotation and binary classification. Practically, our study contributes to online harm mitigation by guiding the definitions and identification of harmful content on video platforms.
Authors: Neal Lawton, Aram Galstyan, Greg Ver Steeg
Abstract: A popular method for Neural Architecture Search (NAS) is based on growing networks via small local changes to the network's architecture called network morphisms. These methods start with a small seed network and progressively grow the network by adding new neurons in an automated way. However, it remains a challenge to efficiently determine which parts of the network are best to grow. Here we propose a NAS method for growing a network by using a Gauss-Newton approximation of the loss function to efficiently learn and evaluate candidate network morphisms. We compare our method with state of the art NAS methods for CIFAR-10 and CIFAR-100 classification tasks, and conclude our method learns similar quality or better architectures at a smaller computational cost.
Authors: Md Junayed Hasan, Somasundar Kannan, Ali Rohan, Mohd Asif Shah
Abstract: This study explores the potential of the Ping 360 sonar device, primarily used for navigation, in detecting complex underwater obstacles. The key motivation behind this research is the device's affordability and open-source nature, offering a cost-effective alternative to more expensive imaging sonar systems. The investigation focuses on understanding the behaviour of the Ping 360 in controlled environments and assessing its suitability for object detection, particularly in scenarios where human operators are unavailable for inspecting offshore structures in shallow waters. Through a series of carefully designed experiments, we examined the effects of surface reflections and object shadows in shallow underwater environments. Additionally, we developed a manually annotated sonar image dataset to train a U-Net segmentation model. Our findings indicate that while the Ping 360 sonar demonstrates potential in simpler settings, its performance is limited in more cluttered or reflective environments unless extensive data pre-processing and annotation are applied. To our knowledge, this is the first study to evaluate the Ping 360's capabilities for complex object detection. By investigating the feasibility of low-cost sonar devices, this research provides valuable insights into their limitations and potential for future AI-based interpretation, marking a unique contribution to the field.
Authors: Yequan Zhao, Hai Li, Ian Young, Zheng Zhang
Abstract: Back propagation (BP) is the default solution for gradient computation in neural network training. However, implementing BP-based training on various edge devices such as FPGA, microcontrollers (MCUs), and analog computing platforms face multiple major challenges, such as the lack of hardware resources, long time-to-market, and dramatic errors in a low-precision setting. This paper presents a simple BP-free training scheme on an MCU, which makes edge training hardware design as easy as inference hardware design. We adopt a quantized zeroth-order method to estimate the gradients of quantized model parameters, which can overcome the error of a straight-through estimator in a low-precision BP scheme. We further employ a few dimension reduction methods (e.g., node perturbation, sparse training) to improve the convergence of zeroth-order training. Experiment results show that our BP-free training achieves comparable performance as BP-based training on adapting a pre-trained image classifier to various corrupted data on resource-constrained edge devices (e.g., an MCU with 1024-KB SRAM for dense full-model training, or an MCU with 256-KB SRAM for sparse training). This method is most suitable for application scenarios where memory cost and time-to-market are the major concerns, but longer latency can be tolerated.
Authors: Mahtab Faraji, Homa Rashidisabet, George R. Nahass, RV Paul Chan, Thasarat S Vajaranant, Darvin Yi
Abstract: Here, we examine the latest advances in glaucoma detection through Deep Learning (DL) algorithms using Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). This study focuses on three aspects of DL-based glaucoma detection frameworks: input data modalities, processing strategies, and model architectures and applications. Moreover, we analyze trends in employing each aspect since the onset of DL in this field. Finally, we address current challenges and suggest future research directions.
Authors: Asma Tanabene (NEUROSPIN, MIND), Chaithya Giliyar Radhakrishna (NEUROSPIN, MIND), Aur\'elien Massire (NEUROSPIN, MIND), Mariappan S. Nadar (NEUROSPIN, MIND), Philippe Ciuciu (NEUROSPIN, MIND)
Abstract: Deep learning has shown great promise for MRI reconstruction from undersampled data, yet there is a lack of research on validating its performance in 3D parallel imaging acquisitions with non-Cartesian undersampling. In addition, the artifacts and the resulting image quality depend on the under-sampling pattern. To address this uncharted territory, we extend the Non-Cartesian Primal-Dual Network (NC-PDNet), a state-of-the-art unrolled neural network, to a 3D multi-coil setting. We evaluated the impact of channel-specific versus channel-agnostic training configurations and examined the effect of coil compression. Finally, we benchmark four distinct non-Cartesian undersampling patterns, with an acceleration factor of six, using the publicly available Calgary-Campinas dataset. Our results show that NC-PDNet trained on compressed data with varying input channel numbers achieves an average PSNR of 42.98 dB for 1 mm isotropic 32 channel whole-brain 3D reconstruction. With an inference time of 4.95sec and a GPU memory usage of 5.49 GB, our approach demonstrates significant potential for clinical research application.
Authors: Elisabeth Pfaehler, Daniel Pflugfelder, Hanno Scharr
Abstract: In the acquisition of Magnetic Resonance (MR) images shorter scan times lead to higher image noise. Therefore, automatic image denoising using deep learning methods is of high interest. MR images containing line-like structures such as roots or vessels yield special characteristics as they display connected structures and yield sparse information. For this kind of data, it is important to consider voxel neighborhoods when training a denoising network. In this paper, we translate the Perceptual Loss to 3D data by comparing feature maps of untrained networks in the loss function as done previously for 2D data. We tested the performance of untrained Perceptual Loss (uPL) on 3D image denoising of MR images displaying brain vessels (MR angiograms - MRA) and images of plant roots in soil. We investigate the impact of various uPL characteristics such as weight initialization, network depth, kernel size, and pooling operations on the results. We tested the performance of the uPL loss on four Rician noise levels using evaluation metrics such as the Structural Similarity Index Metric (SSIM). We observe, that our uPL outperforms conventional loss functions such as the L1 loss or a loss based on the Structural Similarity Index Metric (SSIM). The uPL network's initialization is not important, while network depth and pooling operations impact denoising performance. E.g. for both datasets a network with five convolutional layers led to the best performance while a network with more layers led to a performance drop. We also find that small uPL networks led to better or comparable results than using large networks such as VGG. We observe superior performance of our loss for both datasets, all noise levels, and three network architectures. In conclusion, for images containing line-like structures, uPL is an alternative to other loss functions for 3D image denoising.
Authors: Ahmed Karam Eldaly, Matteo Figini, Daniel C. Alexander
Abstract: Image Quality Transfer (IQT) aims to enhance the contrast and resolution of low-quality medical images, e.g. obtained from low-power devices, with rich information learned from higher quality images. In contrast to existing IQT methods which adopt supervised learning frameworks, in this work, we propose two novel formulations of the IQT problem. The first approach uses an unsupervised learning framework, whereas the second is a combination of both supervised and unsupervised learning. The unsupervised learning approach considers a sparse representation (SRep) and dictionary learning model, which we call IQT-SRep, whereas the combination of supervised and unsupervised learning approach is based on deep dictionary learning (DDL), which we call IQT-DDL. The IQT-SRep approach trains two dictionaries using a SRep model using pairs of low- and high-quality volumes. Subsequently, the SRep of a low-quality block, in terms of the low-quality dictionary, can be directly used to recover the corresponding high-quality block using the high-quality dictionary. On the other hand, the IQT-DDL approach explicitly learns a high-resolution dictionary to upscale the input volume, while the entire network, including high dictionary generator, is simultaneously optimised to take full advantage of deep learning methods. The two models are evaluated using a low-field magnetic resonance imaging (MRI) application aiming to recover high-quality images akin to those obtained from high-field scanners. Experiments comparing the proposed approaches against state-of-the-art supervised deep learning IQT method (IQT-DL) identify that the two novel formulations of the IQT problem can avoid bias associated with supervised methods when tested using out-of-distribution data that differs from the distribution of the data the model was trained on. This highlights the potential benefit of these novel paradigms for IQT.
Authors: Suhas Srinath, Aditya Chandrasekar, Hemang Jamadagni, Rajiv Soundararajan, Prathosh A P
Abstract: With the rise of marine exploration, underwater imaging has gained significant attention as a research topic. Underwater video enhancement has become crucial for real-time computer vision tasks in marine exploration. However, most existing methods focus on enhancing individual frames and neglect video temporal dynamics, leading to visually poor enhancements. Furthermore, the lack of ground-truth references limits the use of abundant available underwater video data in many applications. To address these issues, we propose a two-stage framework for enhancing underwater videos. The first stage uses a denoising diffusion probabilistic model to learn a generative prior from unlabeled data, capturing robust and descriptive feature representations. In the second stage, this prior is incorporated into a physics-based image formulation for spatial enhancement, while also enforcing temporal consistency between video frames. Our method enables real-time and computationally-efficient processing of high-resolution underwater videos at lower resolutions, and offers efficient enhancement in the presence of diverse water-types. Extensive experiments on four datasets show that our approach generalizes well and outperforms existing enhancement methods. Our code is available at github.com/suhas-srinath/undive.
Authors: Al Amin, Kamrul Hasan, Sharif Ullah, M. Shamim Hossain
Abstract: Privacy-preserving and secure data sharing are critical for medical image analysis while maintaining accuracy and minimizing computational overhead are also crucial. Applying existing deep neural networks (DNNs) to encrypted medical data is not always easy and often compromises performance and security. To address these limitations, this research introduces a secure framework consisting of a learnable encryption method based on the block-pixel operation to encrypt the data and subsequently integrate it with the Vision Transformer (ViT). The proposed framework ensures data privacy and security by creating unique scrambling patterns per key, providing robust performance against leading bit attacks and minimum difference attacks.
Authors: Ben Koska, Mojm\'ir Horv\'ath
Abstract: We present a novel 4.5B parameter small language model that can handle multiple input and output modalities, including text, images, videos, and audio. Despite its small size, the model achieves near state-of-the-art performance on a variety of tasks, demonstrating the potential of multi-modal models to tackle complex real-world problems. Our approach leverages recent advancements in language modeling and multi-task learning to create a versatile and high-performing model that can even be deployed for edge inference. Experimental results show the model's strong performance across multiple benchmarks, paving the way for further progress in multi-modal artificial intelligence.
Authors: Jacob Anderson, Georgios Fainekos, Bardh Hoxha, Hideki Okamoto, Danil Prokhorov
Abstract: Perception in fields like robotics, manufacturing, and data analysis generates large volumes of temporal and spatial data to effectively capture their environments. However, sorting through this data for specific scenarios is a meticulous and error-prone process, often dependent on the application, and lacks generality and reproducibility. In this work, we introduce SpREs as a novel querying language for pattern matching over perception streams containing spatial and temporal data derived from multi-modal dynamic environments. To highlight the capabilities of SpREs, we developed the STREM tool as both an offline and online pattern matching framework for perception data. We demonstrate the offline capabilities of STREM through a case study on a publicly available AV dataset (Woven Planet Perception) and its online capabilities through a case study integrating STREM in ROS with the CARLA simulator. We also conduct performance benchmark experiments on various SpRE queries. Using our matching framework, we are able to find over 20,000 matches within 296 ms making STREM applicable in runtime monitoring applications.
Authors: Cassandre Notton, Vasudev Sharma, Vincent Quoc-Huy Trinh, Lina Chen, Minqi Xu, Sonal Varma, Mahdi S. Hosseini
Abstract: Colorectal cancer (CRC) is one of the few cancers that have an established dysplasia-carcinoma sequence that benefits from screening. Everyone over 50 years of age in Canada is eligible for CRC screening. About 20\% of those people will undergo a biopsy for a pre-neoplastic polyp and, in many cases, multiple polyps. As such, these polyp biopsies make up the bulk of a pathologist's workload. Developing an efficient computational model to help screen these polyp biopsies can improve the pathologist's workflow and help guide their attention to critical areas on the slide. DL models face significant challenges in computational pathology (CPath) because of the gigapixel image size of whole-slide images and the scarcity of detailed annotated datasets. It is, therefore, crucial to leverage self-supervised learning (SSL) methods to alleviate the burden and cost of data annotation. However, current research lacks methods to apply SSL frameworks to analyze pathology data effectively. This paper aims to propose an optimized Barlow Twins framework for colorectal polyps screening. We adapt its hyperparameters, augmentation strategy and encoder to the specificity of the pathology data to enhance performance. Additionally, we investigate the best Field of View (FoV) for colorectal polyps screening and propose a new benchmark dataset for CRC screening, made of four types of colorectal polyps and normal tissue, by performing downstream tasking on MHIST and NCT-CRC-7K datasets. Furthermore, we show that the SSL representations are more meaningful and qualitative than the supervised ones and that Barlow Twins benefits from the Swin Transformer when applied to pathology data. Codes are avaialble from https://github.com/AtlasAnalyticsLab/PathBT.
Authors: Mehri Mehrnia, Mohamed Elbayumi, Mohammed S. M. Elbaz
Abstract: Atrial fibrillation (AF), the most common cardiac arrhythmia, is associated with heart failure and stroke. Accurate segmentation of the left atrium (LA) in 3D late gadolinium-enhanced (LGE) MRI is helpful for evaluating AF, as fibrotic remodeling in the LA myocardium contributes to arrhythmia and serves as a key determinant of therapeutic strategies. However, manual LA segmentation is labor-intensive and challenging. Recent foundational deep learning models, such as the Segment Anything Model (SAM), pre-trained on diverse datasets, have demonstrated promise in generic segmentation tasks. MedSAM, a fine-tuned version of SAM for medical applications, enables efficient, zero-shot segmentation without domain-specific training. Despite the potential of MedSAM model, it has not yet been evaluated for the complex task of LA segmentation in 3D LGE-MRI. This study aims to (1) evaluate the performance of MedSAM in automating LA segmentation, (2) compare the performance of the MedSAM2 model, which uses a single prompt with automated tracking, with the MedSAM1 model, which requires separate prompt for each slice, and (3) analyze the performance of MedSAM1 in terms of Dice score(i.e., segmentation accuracy) by varying the size and location of the box prompt.
Authors: Maneet Mehta, Cody Buntain
Abstract: This paper examines potential biases and inconsistencies in emotional evocation of images produced by generative artificial intelligence (AI) models and their potential bias toward negative emotions. In particular, we assess this bias by comparing the emotions evoked by an AI-produced image to the emotions evoked by prompts used to create those images. As a first step, the study evaluates three approaches for identifying emotions in images -- traditional supervised learning, zero-shot learning with vision-language models, and cross-modal auto-captioning -- using EmoSet, a large dataset of image-emotion annotations that categorizes images across eight emotional types. Results show fine-tuned models, particularly Google's Vision Transformer (ViT), significantly outperform zero-shot and caption-based methods in recognizing emotions in images. For a cross-modality comparison, we then analyze the differences between emotions in text prompts -- via existing text-based emotion-recognition models -- and the emotions evoked in the resulting images. Findings indicate that AI-generated images frequently lean toward negative emotional content, regardless of the original prompt. This emotional skew in generative models could amplify negative affective content in digital spaces, perpetuating its prevalence and impact. The study advocates for a multidisciplinary approach to better align AI emotion recognition with psychological insights and address potential biases in generative AI outputs across digital media.
Authors: Jingyi Zhou, Senlin Luo, Haofan Chen
Abstract: Text emotion detection constitutes a crucial foundation for advancing artificial intelligence from basic comprehension to the exploration of emotional reasoning. Most existing emotion detection datasets rely on manual annotations, which are associated with high costs, substantial subjectivity, and severe label imbalances. This is particularly evident in the inadequate annotation of micro-emotions and the absence of emotional intensity representation, which fail to capture the rich emotions embedded in sentences and adversely affect the quality of downstream task completion. By proposing an all-labels and training-set label regression method, we map label values to energy intensity levels, thereby fully leveraging the learning capabilities of machine models and the interdependencies among labels to uncover multiple emotions within samples. This led to the establishment of the Emotion Quantization Network (EQN) framework for micro-emotion detection and annotation. Using five commonly employed sentiment datasets, we conducted comparative experiments with various models, validating the broad applicability of our framework within NLP machine learning models. Based on the EQN framework, emotion detection and annotation are conducted on the GoEmotions dataset. A comprehensive comparison with the results from Google literature demonstrates that the EQN framework possesses a high capability for automatic detection and annotation of micro-emotions. The EQN framework is the first to achieve automatic micro-emotion annotation with energy-level scores, providing strong support for further emotion detection analysis and the quantitative research of emotion computing.
Authors: Daniel Gilo, Tzofi Klinghoffer, Or Litany
Abstract: Neural field methods, initially successful in the inverse rendering domain, have recently been extended to CT reconstruction, marking a paradigm shift from traditional techniques. While these approaches deliver state-of-the-art results in sparse-view CT reconstruction, they struggle in limited-angle settings, where input projections are captured over a restricted angle range. We present a novel loss term based on consistency conditions between corresponding epipolar lines in X-ray projection images, aimed at regularizing neural attenuation field optimization. By enforcing these consistency conditions, our approach, Epi-NAF, propagates supervision from input views within the limited-angle range to predicted projections over the full cone-beam CT range. This loss results in both qualitative and quantitative improvements in reconstruction compared to baseline methods.
Authors: Wenhao Chi, Haiping Liu, Hongqiao Dong, Wenhua Liang, Bo Liu
Abstract: In the field of non-invasive medical imaging, radiomic features are utilized to measure tumor characteristics. However, these features can be affected by the techniques used to discretize the images, ultimately impacting the accuracy of diagnosis. To investigate the influence of various image discretization methods on diagnosis, it is common practice to evaluate multiple discretization strategies individually. This approach often leads to redundant and time-consuming tasks such as training predictive models and fine-tuning hyperparameters separately. This study examines the feasibility of employing multi-task Bayesian optimization to accelerate the hyperparameters search for classifying benign and malignant pulmonary nodules using RBF SVM. Our findings suggest that multi-task Bayesian optimization significantly accelerates the search for hyperparameters in comparison to a single-task approach. To the best of our knowledge, this is the first investigation to utilize multi-task Bayesian optimization in a critical medical context.
Authors: Ning Wu, Han Huang, Yueting Xu, Zhifeng Hao
Abstract: Architecture performance evaluation is the most time-consuming part of neural architecture search (NAS). Zero-Shot NAS accelerates the evaluation by utilizing zero-cost proxies instead of training. Though effective, existing zero-cost proxies require invoking backpropagations or running networks on input data, making it difficult to further accelerate the computation of proxies. To alleviate this issue, architecture topologies are used to evaluate the performance of networks in this study. We prove that particular architectural topologies decrease the local entropy of feature maps, which degrades specific features to a bias, thereby reducing network performance. Based on this proof, architectural topologies are utilized to quantify the suppression of local entropy decrease (SED) as a data-free and running-free proxy. Experimental results show that SED outperforms most state-of-the-art proxies in terms of architecture selection on five benchmarks, with computation time reduced by three orders of magnitude. We further compare the SED-based NAS with state-of-the-art proxies. SED-based NAS selects the architecture with higher accuracy and fewer parameters in only one second. The theoretical analyses of local entropy and experimental results demonstrate that the suppression of local entropy decrease facilitates selecting optimal architectures in Zero-Shot NAS.
Authors: Ezgi Demircan-Tureyen, Felix Lucka, Tristan van Leeuwen
Abstract: Recent works demonstrate the effectiveness of diffusion models as unsupervised solvers for inverse imaging problems. Sparse-view computed tomography (CT) has greatly benefited from these advancements, achieving improved generalization without reliance on measurement parameters. However, this comes at the cost of potential hallucinations, especially when handling out-of-distribution (OOD) data. To ensure reliability, it is essential to study OOD detection for CT reconstruction across both clinical and industrial applications. This need further extends to enabling the OOD detector to function effectively as an anomaly inspection tool. In this paper, we explore the use of a diffusion model, trained to capture the target distribution for CT reconstruction, as an in-distribution prior. Building on recent research, we employ the model to reconstruct partially diffused input images and assess OOD-ness through multiple reconstruction errors. Adapting this approach for sparse-view CT requires redefining the notions of "input" and "reconstruction error". Here, we use filtered backprojection (FBP) reconstructions as input and investigate various definitions of reconstruction error. Our proof-of-concept experiments on the MNIST dataset highlight both successes and failures, demonstrating the potential and limitations of integrating such an OOD detector into a CT reconstruction system. Our findings suggest that effective OOD detection can be achieved by comparing measurements with forward-projected reconstructions, provided that reconstructions from noisy FBP inputs are conditioned on the measurements. However, conditioning can sometimes lead the OOD detector to inadvertently reconstruct OOD images well. To counter this, we introduce a weighting approach that improves robustness against highly informative OOD measurements, albeit with a trade-off in performance in certain cases.
Authors: Hailin Xu, Hongxia Wang, Huanshui Zhang
Abstract: The Bundle Adjustment (BA) model is commonly optimized using a nonlinear least squares method, with the Levenberg-Marquardt (L-M) algorithm being a typical choice. However, despite the L-M algorithm's effectiveness, its sensitivity to initial conditions often results in slower convergence when applied to poorly conditioned datasets, motivating the exploration of alternative optimization strategies. This paper introduces a novel algorithm for optimizing the BA model in the context of image sequence alignment for cryo-electron tomography, utilizing optimal control theory to directly optimize general nonlinear functions. The proposed Optimal Control Algorithm (OCA) exhibits superior convergence rates and effectively mitigates the oscillatory behavior frequently observed in L-M algorithm. Extensive experiments on both synthetic and real-world datasets were conducted to evaluate the algorithm's performance. The results demonstrate that the OCA achieves faster convergence compared to the L-M algorithm. Moreover, the incorporation of a bisection-based update procedure significantly enhances the OCA's performance, particularly in poorly initialized datasets. These findings indicate that the OCA can substantially improve the efficiency of 3D reconstructions in cryo-electron tomography.
Authors: Le-Trung Nguyen, A\"el Qu\'elennec, Enzo Tartaglione, Samuel Tardieu, Van-Tam Nguyen
Abstract: Internet of Things and Deep Learning are synergetically and exponentially growing industrial fields with a massive call for their unification into a common framework called Edge AI. While on-device inference is a well-explored topic in recent research, backpropagation remains an open challenge due to its prohibitive computational and memory costs compared to the extreme resource constraints of embedded devices. Drawing on tensor decomposition research, we tackle the main bottleneck of backpropagation, namely the memory footprint of activation map storage. We investigate and compare the effects of activation compression using Singular Value Decomposition and its tensor variant, High-Order Singular Value Decomposition. The application of low-order decomposition results in considerable memory savings while preserving the features essential for learning, and also offers theoretical guarantees to convergence. Experimental results obtained on main-stream architectures and tasks demonstrate Pareto-superiority over other state-of-the-art solutions, in terms of the trade-off between generalization and memory footprint.
Authors: Tian Xie, Jifan Zhang, Haoyue Bai, Robert Nowak
Abstract: Machine learning models deployed in open-world scenarios often encounter unfamiliar conditions and perform poorly in unanticipated situations. As AI systems advance and find application in safety-critical domains, effectively handling out-of-distribution (OOD) data is crucial to building open-world learning systems. In this work, we introduce ALOE, a novel active learning algorithm for open-world environments designed to enhance model adaptation by incorporating new OOD classes via a two-stage approach. First, diversity sampling selects a representative set of examples, followed by energy-based OOD detection to prioritize likely unknown classes for annotation. This strategy accelerates class discovery and learning, even under constrained annotation budgets. Evaluations on three long-tailed image classification benchmarks demonstrate that ALOE outperforms traditional active learning baselines, effectively expanding known categories while balancing annotation cost. Our findings reveal a crucial tradeoff between enhancing known-class performance and discovering new classes, setting the stage for future advancements in open-world machine learning.
Authors: Sumera Rounaq, Shahid Munir Shah, Mahmoud Aljawarneh, Sarah Khan, Ghulam Muhammad
Abstract: COVID-19 is extremely contagious and its rapid growth has drawn attention towards its early diagnosis. Early diagnosis of COVID-19 enables healthcare professionals and government authorities to break the chain of transition and flatten the epidemic curve. With the number of cases accelerating across the developed world, COVID-19 induced Viral Pneumonia cases is a big challenge. Overlapping of COVID-19 cases with Viral Pneumonia and other lung infections with limited dataset and long training hours is a serious problem to cater. Limited amount of data often results in over-fitting models and due to this reason, model does not predict generalized results. To fill this gap, we proposed GAN-based approach to synthesize images which later fed into the deep learning models to classify images of COVID-19, Normal, and Viral Pneumonia. Specifically, customized Wasserstein GAN is proposed to generate 19% more Chest X-ray images as compare to the real images. This expanded dataset is then used to train four proposed deep learning models: VGG-16, ResNet-50, GoogLeNet and MNAST. The result showed that expanded dataset utilized deep learning models to deliver high classification accuracies. In particular, VGG-16 achieved highest accuracy of 99.17% among all four proposed schemes. Rest of the models like ResNet-50, GoogLeNet and MNAST delivered 93.9%, 94.49% and 97.75% testing accuracies respectively. Later, the efficiency of these models is compared with the state of art models on the basis of accuracy. Further, our proposed models can be applied to address the issue of scant datasets for any problem of image analysis.
Authors: Mingyu Yu, Haonan Miao, Zhengping Jin, Sujuan Qing
Abstract: Typical steganography embeds secret information into images by exploiting their redundancy. Since the visual imperceptibility of secret information is a key factor in scheme evaluation, conventional methods aim to balance this requirement with embedding capacity. Consequently, integrating emerging image generation models and secret transmission has been extensively explored to achieve a higher embedding capacity. Previous works mostly focus on generating stego-images with Generative Adversarial Networks (GANs) and usually rely on pseudo-keys, namely conditions or parameters involved in the generation process, which are related to secret images. However, studies on diffusion-based coverless steganography remain insufficient. In this work, we leverage the Denoising Diffusion Implicit Model (DDIM) to generate high-quality stego-images without introducing pseudo-keys, instead employing real keys to enhance security. Furthermore, our method offers low-image-correlation real-key protection by incorporating chaotic encryption. Another core innovation is that our method requires only one-time negotiation for multiple communications, unlike prior methods that necessitate negotiation for each interaction.
Authors: Behraj Khan, Behroz Mirza, Nouman Durrani, Tahir Syed
Abstract: When training data are distributed across{ time or space,} covariate shift across fragments of training data biases cross-validation, compromising model selection and assessment. We present \textit{Fragmentation-Induced covariate-shift Remediation} ($FIcsR$), which minimizes an $f$-divergence between a fragment's covariate distribution and that of the standard cross-validation baseline. We s{how} an equivalence with popular importance-weighting methods. {The method}'s numerical solution poses a computational challenge owing to the overparametrized nature of a neural network, and we derive a Fisher Information approximation. When accumulated over fragments, this provides a global estimate of the amount of shift remediation thus far needed, and we incorporate that as a prior via the minimization objective. In the paper, we run extensive classification experiments on multiple data classes, over $40$ datasets, and with data batched over multiple sequence lengths. We extend the study to the $k$-fold cross-validation setting through a similar set of experiments. An ablation study exposes the method to varying amounts of shift and demonstrates slower degradation with $FIcsR$ in place. The results are promising under all these conditions; with improved accuracy against batch and fold state-of-the-art by more than $5\%$ and $10\%$, respectively.
Authors: Guangyi Wang, Wei Peng, Lijiang Li, Wenyu Chen, Yuren Cai, Songzhi Su
Abstract: Diffusion Probabilistic Models (DPMs) have demonstrated exceptional performance in generative tasks, but this comes at the expense of sampling efficiency. To enhance sampling speed without sacrificing quality, various distillation-based accelerated sampling algorithms have been recently proposed. However, they typically require significant additional training costs and model parameter storage, which limit their practical application. In this work, we propose PCA-based Adaptive Search (PAS), which optimizes existing solvers for DPMs with minimal learnable parameters and training costs. Specifically, we first employ PCA to obtain a few orthogonal unit basis vectors to span the high-dimensional sampling space, which enables us to learn just a set of coordinates to correct the sampling direction; furthermore, based on the observation that the cumulative truncation error exhibits an ``S''-shape, we design an adaptive search strategy that further enhances the sampling efficiency and reduces the number of stored parameters to approximately 10. Extensive experiments demonstrate that PAS can significantly enhance existing fast solvers in a plug-and-play manner with negligible costs. For instance, on CIFAR10, PAS requires only 12 parameters and less than 1 minute of training on a single NVIDIA A100 GPU to optimize the DDIM from 15.69 FID (NFE=10) to 4.37.
Authors: Yifei Wang, Kaiwen Hu, Sharut Gupta, Ziyu Ye, Yisen Wang, Stefanie Jegelka
Abstract: Contrastive learning has been a leading paradigm for self-supervised learning, but it is widely observed that it comes at the price of sacrificing useful features (\eg colors) by being invariant to data augmentations. Given this limitation, there has been a surge of interest in equivariant self-supervised learning (E-SSL) that learns features to be augmentation-aware. However, even for the simplest rotation prediction method, there is a lack of rigorous understanding of why, when, and how E-SSL learns useful features for downstream tasks. To bridge this gap between practice and theory, we establish an information-theoretic perspective to understand the generalization ability of E-SSL. In particular, we identify a critical explaining-away effect in E-SSL that creates a synergy between the equivariant and classification tasks. This synergy effect encourages models to extract class-relevant features to improve its equivariant prediction, which, in turn, benefits downstream tasks requiring semantic features. Based on this perspective, we theoretically analyze the influence of data transformations and reveal several principles for practical designs of E-SSL. Our theory not only aligns well with existing E-SSL methods but also sheds light on new directions by exploring the benefits of model equivariance. We believe that a theoretically grounded understanding on the role of equivariance would inspire more principled and advanced designs in this field. Code is available at https://github.com/kaotty/Understanding-ESSL.
Authors: Sarang Galada, Tanurima Halder, Kunal Deo, Ram P Krish, Kshitij Jadhav
Abstract: Multi-site MRI studies often suffer from site-specific variations arising from differences in methodology, hardware, and acquisition protocols, thereby compromising accuracy and reliability in clinical AI/ML tasks. We present PRISM (Privacy-preserving Inter-Site MRI Harmonization), a novel Deep Learning framework for harmonizing structural brain MRI across multiple sites while preserving data privacy. PRISM employs a dual-branch autoencoder with contrastive learning and variational inference to disentangle anatomical features from style and site-specific variations, enabling unpaired image translation without traveling subjects or multiple MRI modalities. Our modular design allows harmonization to any target site and seamless integration of new sites without the need for retraining or fine-tuning. Using multi-site structural MRI data, we demonstrate PRISM's effectiveness in downstream tasks such as brain tissue segmentation and validate its harmonization performance through multiple experiments. Our framework addresses key challenges in medical AI/ML, including data privacy, distribution shifts, model generalizability and interpretability. Code is available at https://github.com/saranggalada/PRISM
Authors: Elad Yoshai, Gil Goldinger, Miki Haifler, Natan T. Shaked
Abstract: In histological pathology, frozen sections are often used for rapid diagnosis during surgeries, as they can be produced within minutes. However, they suffer from artifacts and often lack crucial diagnostic details, particularly within the cell nuclei region. Permanent sections, on the other hand, contain more diagnostic detail but require a time-intensive preparation process. Here, we present a generative deep learning approach to enhance frozen section images by leveraging guidance from permanent sections. Our method places a strong emphasis on the nuclei region, which contains critical information in both frozen and permanent sections. Importantly, our approach avoids generating artificial data in blank regions, ensuring that the network only enhances existing features without introducing potentially unreliable information. We achieve this through a segmented attention network, incorporating nuclei-segmented images during training and adding an additional loss function to refine the nuclei details in the generated permanent images. We validated our method across various tissues, including kidney, breast, and colon. This approach significantly improves histological efficiency and diagnostic accuracy, enhancing frozen section images within seconds, and seamlessly integrating into existing laboratory workflows.
Authors: Rafael Orozco, Huseyin Tuna Erdinc, Yunlin Zeng, Mathias Louboutin, Felix J. Herrmann
Abstract: Accurately characterizing migration velocity models is crucial for a wide range of geophysical applications, from hydrocarbon exploration to monitoring of CO2 sequestration projects. Traditional velocity model building methods such as Full-Waveform Inversion (FWI) are powerful but often struggle with the inherent complexities of the inverse problem, including noise, limited bandwidth, receiver aperture and computational constraints. To address these challenges, we propose a scalable methodology that integrates generative modeling, in the form of Diffusion networks, with physics-informed summary statistics, making it suitable for complicated imaging problems including field datasets. By defining these summary statistics in terms of subsurface-offset image volumes for poor initial velocity models, our approach allows for computationally efficient generation of Bayesian posterior samples for migration velocity models that offer a useful assessment of uncertainty. To validate our approach, we introduce a battery of tests that measure the quality of the inferred velocity models, as well as the quality of the inferred uncertainties. With modern synthetic datasets, we reconfirm gains from using subsurface-image gathers as the conditioning observable. For complex velocity model building involving salt, we propose a new iterative workflow that refines amortized posterior approximations with salt flooding and demonstrate how the uncertainty in the velocity model can be propagated to the final product reverse time migrated images. Finally, we present a proof of concept on field datasets to show that our method can scale to industry-sized problems.
Authors: J\'er\^ome Gilles, St\'ephane Landeau, Tristan Dagobert, Philippe Chevalier, Eric Sti\'ee, Damien Diaz, Jean-Luc Maillart
Abstract: In this communication, we deal with the question of automatic target detection, recognition and tracking (ATD/R/T) algorithms performance assessment. We propose a complete methodology of evaluation which approaches objective image datasets development and adapted metrics definition for the different tasks (detection, recognition and tracking). We present some performance results which are currently processed in a French-MoD program called 2ACI (``Acquisition Automatique de Cibles par Imagerie``).
Authors: Jerome Gilles
Abstract: In this paper, we propose to improve image decomposition algorithms in the case of noisy images. In \cite{gilles1,aujoluvw}, the authors propose to separate structures, textures and noise from an image. Unfortunately, the use of separable wavelets shows some artefacts. In this paper, we propose to replace the wavelet transform by the contourlet transform which better approximate geometry in images. For that, we define contourlet spaces and their associated norms. Then, we get an iterative algorithm which we test on two noisy textured images.
Authors: Xuming He, Zhiwang Zhou, Wenlong Zhang, Xiangyu Zhao, Hao Chen, Shiqi Chen, Lei Bai
Abstract: Weather radar data synthesis can fill in data for areas where ground observations are missing. Existing methods often employ reconstruction-based approaches with MSE loss to reconstruct radar data from satellite observation. However, such methods lead to over-smoothing, which hinders the generation of high-frequency details or high-value observation areas associated with convective weather. To address this issue, we propose a two-stage diffusion-based method called DiffSR. We first pre-train a reconstruction model on global-scale data to obtain radar estimation and then synthesize radar reflectivity by combining radar estimation results with satellite data as conditions for the diffusion model. Extensive experiments show that our method achieves state-of-the-art (SOTA) results, demonstrating the ability to generate high-frequency details and high-value areas.
Authors: Xing Yao, Runxuan Yu, Dewei Hu, Hao Yang, Ange Lou, Jiacheng Wang, Daiwei Lu, Gabriel Arenas, Baris Oguz, Alison Pouch, Nadav Schwartz, Brett C Byram, Ipek Oguz
Abstract: Ultrasound (US) image stitching can expand the field-of-view (FOV) by combining multiple US images from varied probe positions. However, registering US images with only partially overlapping anatomical contents is a challenging task. In this work, we introduce SynStitch, a self-supervised framework designed for 2DUS stitching. SynStitch consists of a synthetic stitching pair generation module (SSPGM) and an image stitching module (ISM). SSPGM utilizes a patch-conditioned ControlNet to generate realistic 2DUS stitching pairs with known affine matrix from a single input image. ISM then utilizes this synthetic paired data to learn 2DUS stitching in a supervised manner. Our framework was evaluated against multiple leading methods on a kidney ultrasound dataset, demonstrating superior 2DUS stitching performance through both qualitative and quantitative analyses. The code will be made public upon acceptance of the paper.
Authors: Daria Tsereh, Mark Mirgaleev, Ivan Molodetskikh, Roman Kazantsev, Dmitriy Vatolin
Abstract: Learning-based image compression methods have improved in recent years and started to outperform traditional codecs. However, neural-network approaches can unexpectedly introduce visual artifacts in some images. We therefore propose methods to separately detect three types of artifacts (texture and boundary degradation, color change, and text corruption), to localize the affected regions, and to quantify the artifact strength. We consider only those regions that exhibit distortion due solely to the neural compression but that a traditional codec recovers successfully at a comparable bitrate. We employed our methods to collect artifacts for the JPEG AI verification model with respect to HM-18.0, the H.265 reference software. We processed about 350,000 unique images from the Open Images dataset using different compression-quality parameters; the result is a dataset of 46,440 artifacts validated through crowd-sourced subjective assessment. Our proposed dataset and methods are valuable for testing neural-network-based image codecs, identifying bugs in these codecs, and enhancing their performance. We make source code of the methods and the dataset publicly available.
Authors: Vladyslav Zalevskyi, Thomas Sanchez, Margaux Roulet, H\'el\`ene Lajous, Jordina Aviles Verdera, Jana Hutter, Hamza Kebiri, Meritxell Bach Cuadra
Abstract: Fetal brain tissue segmentation in magnetic resonance imaging (MRI) is a crucial tool that supports the understanding of neurodevelopment, yet it faces challenges due to the heterogeneity of data coming from different scanners and settings, and due to data scarcity. Recent approaches based on domain randomization, like SynthSeg, have shown a great potential for single source domain generalization, by simulating images with randomized contrast and image resolution from the label maps. In this work, we investigate how to maximize the out-of-domain (OOD) generalization potential of SynthSeg-based methods in fetal brain MRI. Specifically, when studying data generation, we demonstrate that the simple Gaussian mixture models used in SynthSeg enable more robust OOD generalization than physics-informed generation methods. We also investigate how intensity clustering can help create more faithful synthetic images, and observe that it is key to achieving a non-trivial OOD generalization capability when few label classes are available. Finally, by combining for the first time SynthSeg with modern fine-tuning approaches based on weight averaging, we show that fine-tuning a model pre-trained on synthetic data on a few real image-segmentation pairs in a new domain can lead to improvements in the target domain, but also in other domains. We summarize our findings as five key recommendations that we believe can guide practitioners who would like to develop SynthSeg-based approaches in other organs or modalities.
Authors: Pascal Janetzky, Tobias Schlagenhauf, Stefan Feuerriegel
Abstract: A common challenge in continual learning (CL) is catastrophic forgetting, where the performance on old tasks drops after new, additional tasks are learned. In this paper, we propose a novel framework called ReCL to slow down forgetting in CL. Our framework exploits an implicit bias of gradient-based neural networks due to which these converge to margin maximization points. Such convergence points allow us to reconstruct old data from previous tasks, which we then combine with the current training data. Our framework is flexible and can be applied on top of existing, state-of-the-art CL methods to slow down forgetting. We further demonstrate the performance gain from our framework across a large series of experiments, including different CL scenarios (class incremental, domain incremental, task incremental learning) different datasets (MNIST, CIFAR10), and different network architectures. Across all experiments, we find large performance gains through ReCL. To the best of our knowledge, our framework is the first to address catastrophic forgetting by leveraging models in CL as their own memory buffers.
Authors: Ivica Kopriva, Dario Sitnik, Laura-Isabelle Dion-Bertrand, Marija Milkovi\'c Peri\v{s}a, Mirko Had\v{z}ija, Marijana Popovi\'c Had\v{z}ija
Abstract: Hyperspectral imaging (HSI) holds significant potential for transforming the field of computational pathology. However, there is currently a shortage of pixel-wise annotated HSI data necessary for training deep learning (DL) models. Additionally, the number of HSI-based research studies remains limited, and in many cases, the advantages of HSI over traditional RGB imaging have not been conclusively demonstrated, particularly for specimens collected intraoperatively. To address these challenges we present a database consisted of 27 HSIs of hematoxylin-eosin stained frozen sections, collected from 14 patients with colon adenocarcinoma metastasized to the liver. It is aimed to validate pixel-wise classification for intraoperative tumor resection. The HSIs were acquired in the spectral range of 450 to 800 nm, with a resolution of 1 nm, resulting in images of 1384x1035 pixels. Pixel-wise annotations were performed by three pathologists. To overcome challenges such as experimental variability and the lack of annotated data, we combined label-propagation-based semi-supervised learning (SSL) with spectral-spatial features extracted by: the multiscale principle of relevant information (MPRI) method and tensor singular spectrum analysis method. Using only 1% of labeled pixels per class the SSL-MPRI method achieved a micro balanced accuracy (BACC) of 0.9313 and a micro F1-score of 0.9235 on the HSI dataset. The performance on corresponding RGB images was lower, with a micro BACC of 0.8809 and a micro F1-score of 0.8688. These improvements are statistically significant. The SSL-MPRI approach outperformed six DL architectures trained with 63% of labeled pixels. Data and code are available at: https://github.com/ikopriva/ColonCancerHSI.
Authors: Haohan Weng, Zibo Zhao, Biwen Lei, Xianghui Yang, Jian Liu, Zeqiang Lai, Zhuo Chen, Yuhong Liu, Jie Jiang, Chunchao Guo, Tong Zhang, Shenghua Gao, C. L. Philip Chen
Abstract: We propose a compressive yet effective mesh representation, Blocked and Patchified Tokenization (BPT), facilitating the generation of meshes exceeding 8k faces. BPT compresses mesh sequences by employing block-wise indexing and patch aggregation, reducing their length by approximately 75\% compared to the original sequences. This compression milestone unlocks the potential to utilize mesh data with significantly more faces, thereby enhancing detail richness and improving generation robustness. Empowered with the BPT, we have built a foundation mesh generative model training on scaled mesh data to support flexible control for point clouds and images. Our model demonstrates the capability to generate meshes with intricate details and accurate topology, achieving SoTA performance on mesh generation and reaching the level for direct product usage.
Authors: Minah Lee, Uday Kamal, Saibal Mukhopadhyay
Abstract: This paper proposes a novel problem: vision-based perception to learn and predict the collective dynamics of multi-agent systems, specifically focusing on interaction strength and convergence time. Multi-agent systems are defined as collections of more than ten interacting agents that exhibit complex group behaviors. Unlike prior studies that assume knowledge of agent positions, we focus on deep learning models to directly predict collective dynamics from visual data, captured as frames or events. Due to the lack of relevant datasets, we create a simulated dataset using a state-of-the-art flocking simulator, coupled with a vision-to-event conversion framework. We empirically demonstrate the effectiveness of event-based representation over traditional frame-based methods in predicting these collective behaviors. Based on our analysis, we present event-based vision for Multi-Agent dynamic Prediction (evMAP), a deep learning architecture designed for real-time, accurate understanding of interaction strength and collective behavior emergence in multi-agent systems.
Authors: Yasra Chandio, Khotso Selialia, Joseph DeGol, Luis Garcia, Fatima M. Anwar
Abstract: Advancements in tracking algorithms have empowered nascent applications across various domains, from steering autonomous vehicles to guiding robots to enhancing augmented reality experiences for users. However, these algorithms are application-specific and do not work across applications with different types of motion; even a tracking algorithm designed for a given application does not work in scenarios deviating from highly standard conditions. For example, a tracking algorithm designed for robot navigation inside a building will not work for tracking the same robot in an outdoor environment. To demonstrate this problem, we evaluate the performance of the state-of-the-art tracking methods across various applications and scenarios. To inform our analysis, we first categorize algorithmic, environmental, and locomotion-related challenges faced by tracking algorithms. We quantitatively evaluate the performance using multiple tracking algorithms and representative datasets for a wide range of Internet of Things (IoT) and Extended Reality (XR) applications, including autonomous vehicles, drones, and humans. Our analysis shows that no tracking algorithm works across different applications and scenarios within applications. Ultimately, using the insights generated from our analysis, we discuss multiple approaches to improving the tracking performance using input data characterization, leveraging intermediate information, and output evaluation.
Authors: Yusuke Oumi, Yuto Shibata, Go Irie, Akisato Kimura, Yoshimitsu Aoki, Mariko Isogawa
Abstract: This paper explores the problem of 3D human pose estimation from only low-level acoustic signals. The existing active acoustic sensing-based approach for 3D human pose estimation implicitly assumes that the target user is positioned along a line between loudspeakers and a microphone. Because reflection and diffraction of sound by the human body cause subtle acoustic signal changes compared to sound obstruction, the existing model degrades its accuracy significantly when subjects deviate from this line, limiting its practicality in real-world scenarios. To overcome this limitation, we propose a novel method composed of a position discriminator and reverberation-resistant model. The former predicts the standing positions of subjects and applies adversarial learning to extract subject position-invariant features. The latter utilizes acoustic signals before the estimation target time as references to enhance robustness against the variations in sound arrival times due to diffraction and reflection. We construct an acoustic pose estimation dataset that covers diverse human locations and demonstrate through experiments that our proposed method outperforms existing approaches.
Authors: Yunhao Luo, Yilun Du
Abstract: Large video models, pretrained on massive amounts of Internet video, provide a rich source of physical knowledge about the dynamics and motions of objects and tasks. However, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to reach the visual states depicted in a video. To tackle this problem, current methods use a separate vision-based inverse dynamic model trained on embodiment-specific data to map image states to actions. Gathering data to train such a model is often expensive and challenging, and this model is limited to visual settings similar to the ones in which data are available. In this paper, we investigate how to directly ground video models to continuous actions through self-exploration in the embodied environment -- using generated video states as visual goals for exploration. We propose a framework that uses trajectory level action generation in combination with video guidance to enable an agent to solve complex tasks without any external supervision, e.g., rewards, action labels, or segmentation masks. We validate the proposed approach on 8 tasks in Libero, 6 tasks in MetaWorld, 4 tasks in Calvin, and 12 tasks in iThor Visual Navigation. We show how our approach is on par with or even surpasses multiple behavior cloning baselines trained on expert demonstrations while without requiring any action annotations.
Authors: Kirill Vishniakov, Eric Xing, Zhiqiang Shen
Abstract: The recent progress in self-supervised learning has successfully combined Masked Image Modeling (MIM) with Siamese Networks, harnessing the strengths of both methodologies. Nonetheless, certain challenges persist when integrating conventional erase-based masking within Siamese ConvNets. Two primary concerns are: (1) The continuous data processing nature of ConvNets, which doesn't allow for the exclusion of non-informative masked regions, leading to reduced training efficiency compared to ViT architecture; (2) The misalignment between erase-based masking and the contrastive-based objective, distinguishing it from the MIM technique. To address these challenges, this work introduces a novel filling-based masking approach, termed \textbf{MixMask}. The proposed method replaces erased areas with content from a different image, effectively countering the information depletion seen in traditional masking methods. Additionally, we unveil an adaptive loss function that captures the semantics of the newly patched views, ensuring seamless integration within the architectural framework. We empirically validate the effectiveness of our approach through comprehensive experiments across various datasets and application scenarios. The findings underscore our framework's enhanced performance in areas such as linear probing, semi-supervised and supervised finetuning, object detection and segmentation. Notably, our method surpasses the MSCN, establishing MixMask as a more advantageous masking solution for Siamese ConvNets. Our code and models are publicly available at https://github.com/kirill-vish/MixMask.
Authors: Kennard Yanting Chan, Guosheng Lin, Haiyu Zhao, Weisi Lin
Abstract: We propose IntegratedPIFu, a new pixel aligned implicit model that builds on the foundation set by PIFuHD. IntegratedPIFu shows how depth and human parsing information can be predicted and capitalised upon in a pixel-aligned implicit model. In addition, IntegratedPIFu introduces depth oriented sampling, a novel training scheme that improve any pixel aligned implicit model ability to reconstruct important human features without noisy artefacts. Lastly, IntegratedPIFu presents a new architecture that, despite using less model parameters than PIFuHD, is able to improves the structural correctness of reconstructed meshes. Our results show that IntegratedPIFu significantly outperforms existing state of the arts methods on single view human reconstruction. Our code has been made available online.
Authors: Krzysztof Lis, Matthias Rottmann, Annika M\"utze, Sina Honari, Pascal Fua, Mathieu Salzmann
Abstract: In addition to impressive performance, vision transformers have demonstrated remarkable abilities to encode information they were not trained to extract. For example, this information can be used to perform segmentation or single-view depth estimation even though the networks were only trained for image recognition. We show that a similar phenomenon occurs when explicitly training transformers for semantic segmentation in a supervised manner for a set of categories: Once trained, they provide valuable information even about categories absent from the training set. This information can be used to segment objects from these never-seen-before classes in domains as varied as road obstacles, aircraft parked at a terminal, lunar rocks, and maritime hazards.
Authors: Shuo Ye, Shiming Chen, Ruxin Wang, Tianxu Wu, Jiamiao Xu, Salman Khan, Fahad Shahbaz Khan, Ling Shao
Abstract: Data is the foundation for the development of computer vision, and the establishment of datasets plays an important role in advancing the techniques of fine-grained visual categorization~(FGVC). In the existing FGVC datasets used in computer vision, it is generally assumed that each collected instance has fixed characteristics and the distribution of different categories is relatively balanced. In contrast, the real world scenario reveals the fact that the characteristics of instances tend to vary with time and exhibit a long-tailed distribution. Hence, the collected datasets may mislead the optimization of the fine-grained classifiers, resulting in unpleasant performance in real applications. Starting from the real-world conditions and to promote the practical progress of fine-grained visual categorization, we present a Concept Drift and Long-Tailed Distribution dataset. Specifically, the dataset is collected by gathering 11195 images of 250 instances in different species for 47 consecutive months in their natural contexts. The collection process involves dozens of crowd workers for photographing and domain experts for labeling. Meanwhile, we propose a feature recombination framework to address the learning challenges associated with CDLT. Experimental results validate the efficacy of our method while also highlighting the limitations of popular large vision-language models (e.g., CLIP) in the context of long-tailed distributions. This emphasizes the significance of CDLT as a benchmark for investigating these challenges.
Authors: Vladislav Li, Barbara Villarini, Jean-Christophe Nebel, Thomas Lagkas, Panagiotis Sarigiannidis, Vasileios Argyriou
Abstract: The objective of augmented reality (AR) is to add digital content to natural images and videos to create an interactive experience between the user and the environment. Scene analysis and object recognition play a crucial role in AR, as they must be performed quickly and accurately. In this study, a new approach is proposed that involves using oriented bounding boxes with a detection and recognition deep network to improve performance and processing time. The approach is evaluated using two datasets: a real image dataset (DOTA dataset) commonly used for computer vision tasks, and a synthetic dataset that simulates different environmental, lighting, and acquisition conditions. The focus of the evaluation is on small objects, which are difficult to detect and recognise. The results indicate that the proposed approach tends to produce better Average Precision and greater accuracy for small objects in most of the tested conditions.
Authors: Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee. K. Wong, Zhenguo Li, Hengshuang Zhao
Abstract: Multimodal large language models (MLLMs) have emerged as a prominent area of interest within the research community, given their proficiency in handling and reasoning with non-textual data, including images and videos. This study seeks to extend the application of MLLMs to the realm of autonomous driving by introducing DriveGPT4, a novel interpretable end-to-end autonomous driving system based on LLMs. Capable of processing multi-frame video inputs and textual queries, DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users. Furthermore, DriveGPT4 predicts low-level vehicle control signals in an end-to-end fashion.These advanced capabilities are achieved through the utilization of a bespoke visual instruction tuning dataset, specifically tailored for autonomous driving applications, in conjunction with a mix-finetuning training strategy. DriveGPT4 represents the pioneering effort to leverage LLMs for the development of an interpretable end-to-end autonomous driving solution. Evaluations conducted on the BDD-X dataset showcase the superior qualitative and quantitative performance of DriveGPT4. Additionally, the fine-tuning of domain-specific data enables DriveGPT4 to yield close or even improved results in terms of autonomous driving grounding when contrasted with GPT4-V.
Authors: Yanqi Cheng, Lipei Zhang, Zhenda Shen, Shujun Wang, Lequan Yu, Raymond H. Chan, Carola-Bibiane Sch\"onlieb, Angelica I Aviles-Rivero
Abstract: The utilisation of Plug-and-Play (PnP) priors in inverse problems has become increasingly prominent in recent years. This preference is based on the mathematical equivalence between the general proximal operator and the regularised denoiser, facilitating the adaptation of various off-the-shelf denoiser priors to a wide range of inverse problems. However, existing PnP models predominantly rely on pre-trained denoisers using large datasets. In this work, we introduce Single-Shot PnP methods (SS-PnP), shifting the focus to solving inverse problems with minimal data. First, we integrate Single-Shot proximal denoisers into iterative methods, enabling training with single instances. Second, we propose implicit neural priors based on a novel function that preserves relevant frequencies to capture fine details while avoiding the issue of vanishing gradients. We demonstrate, through extensive numerical and visual experiments, that our method leads to better approximations.
Authors: Hammad A. Ayyubi, Tianqi Liu, Arsha Nagrani, Xudong Lin, Mingda Zhang, Anurag Arnab, Feng Han, Yukun Zhu, Jialu Liu, Shih-Fu Chang
Abstract: Existing popular video captioning benchmarks and models deal with generic captions devoid of specific person, place or organization named entities. In contrast, news videos present a challenging setting where the caption requires such named entities for meaningful summarization. As such, we propose the task of summarizing news video directly to entity-aware captions. We also release a large-scale dataset, VIEWS (VIdeo NEWS), to support research on this task. Further, we propose a method that augments visual information from videos with context retrieved from external world knowledge to generate entity-aware captions. We demonstrate the effectiveness of our approach on three video captioning models. We also show that our approach generalizes to existing news image captions dataset. With all the extensive experiments and insights, we believe we establish a solid basis for future research on this challenging task.
Authors: Songsong Tian, Lusi Li, Weijun Li, Hang Ran, Li Li, Xin Ning
Abstract: Few-Shot Class-Incremental Learning (FSCIL) aims to enable deep neural networks to learn new tasks incrementally from a small number of labeled samples without forgetting previously learned tasks, closely mimicking human learning patterns. In this paper, we propose a novel approach called Prompt Learning for FSCIL (PL-FSCIL), which harnesses the power of prompts in conjunction with a pre-trained Vision Transformer (ViT) model to address the challenges of FSCIL effectively. Our work pioneers the use of visual prompts in FSCIL, which is characterized by its notable simplicity. PL-FSCIL consists of two distinct prompts: the Domain Prompt and the FSCIL Prompt. Both are vectors that augment the model by embedding themselves into the attention layer of the ViT model. Specifically, the Domain Prompt assists the ViT model in adapting to new data domains. The task-specific FSCIL Prompt, coupled with a prototype classifier, amplifies the model's ability to effectively handle FSCIL tasks. We validate the efficacy of PL-FSCIL on widely used benchmark datasets such as CIFAR-100 and CUB-200. The results showcase competitive performance, underscoring its promising potential for real-world applications where high-quality data is often scarce. The source code is available at: https://github.com/TianSongS/PL-FSCIL.
Authors: Xilai Li, Wuyang Liu, Xiaosong Li, Fuqiang Zhou, Huafeng Li, Feiping Nie
Abstract: Multi-modality image fusion (MMIF) combines complementary information from different image modalities to provide a more comprehensive and objective interpretation of scenes. However, existing MMIF methods lack the ability to resist different weather interferences in real-world scenes, preventing them from being useful in practical applications such as autonomous driving. To bridge this research gap, we proposed an all-weather MMIF model. Achieving effective multi-tasking in this context is particularly challenging due to the complex and diverse nature of weather conditions. A key obstacle lies in the 'black box' nature of current deep learning architectures, which restricts their multi-tasking capabilities. To overcome this, we decompose the network into two modules: a fusion module and a restoration module. For the fusion module, we introduce a learnable low-rank representation model to decompose images into low-rank and sparse components. This interpretable feature separation allows us to better observe and understand images. For the restoration module, we propose a physically-aware clear feature prediction module based on an atmospheric scattering model that can deduce variations in light transmittance from both scene illumination and reflectance. We also construct a large-scale multi-modality dataset with 100,000 image pairs across rain, haze, and snow conditions, covering various degradation levels and diverse scenes to thoroughly evaluate image fusion methods in adverse weather. Experimental results in both real-world and synthetic scenes show that the proposed algorithm excels in detail recovery and multi-modality feature extraction. The code is available at https://github.com/ixilai/AWFusion.
Authors: Kennard Yanting Chan, Fayao Liu, Guosheng Lin, Chuan Sheng Foo, Weisi Lin
Abstract: Pixel-aligned implicit models, such as PIFu, PIFuHD, and ICON, are used for single-view clothed human reconstruction. These models need to be trained using a sampling training scheme. Existing sampling training schemes either fail to capture thin surfaces (e.g. ears, fingers) or cause noisy artefacts in reconstructed meshes. To address these problems, we introduce Fine Structured-Aware Sampling (FSS), a new sampling training scheme to train pixel-aligned implicit models for single-view human reconstruction. FSS resolves the aforementioned problems by proactively adapting to the thickness and complexity of surfaces. In addition, unlike existing sampling training schemes, FSS shows how normals of sample points can be capitalized in the training process to improve results. Lastly, to further improve the training process, FSS proposes a mesh thickness loss signal for pixel-aligned implicit models. It becomes computationally feasible to introduce this loss once a slight reworking of the pixel-aligned implicit function framework is carried out. Our results show that our methods significantly outperform SOTA methods qualitatively and quantitatively. Our code is publicly available at https://github.com/kcyt/FSS.
Authors: Tian Xia, Xuweiyi Chen, Sihan Xu
Abstract: Video Diffusion Models have been developed for video generation, usually integrating text and image conditioning to enhance control over the generated content. Despite the progress, ensuring consistency across frames remains a challenge, particularly when using text prompts as control conditions. To address this problem, we introduce UniCtrl, a novel, plug-and-play method that is universally applicable to improve the spatiotemporal consistency and motion diversity of videos generated by text-to-video models without additional training. UniCtrl ensures semantic consistency across different frames through cross-frame self-attention control, and meanwhile, enhances the motion quality and spatiotemporal consistency through motion injection and spatiotemporal synchronization. Our experimental results demonstrate UniCtrl's efficacy in enhancing various text-to-video models, confirming its effectiveness and universality.
Authors: Zunnan Xu, Yukang Lin, Haonan Han, Sicheng Yang, Ronghui Li, Yachao Zhang, Xiu Li
Abstract: Gesture synthesis is a vital realm of human-computer interaction, with wide-ranging applications across various fields like film, robotics, and virtual reality. Recent advancements have utilized the diffusion model and attention mechanisms to improve gesture synthesis. However, due to the high computational complexity of these techniques, generating long and diverse sequences with low latency remains a challenge. We explore the potential of state space models (SSMs) to address the challenge, implementing a two-stage modeling strategy with discrete motion priors to enhance the quality of gestures. Leveraging the foundational Mamba block, we introduce MambaTalk, enhancing gesture diversity and rhythm through multimodal integration. Extensive experiments demonstrate that our method matches or exceeds the performance of state-of-the-art models.
Authors: Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S. Ryoo
Abstract: Large Language Models (LLMs) have allowed recent LLM-based approaches to achieve excellent performance on long-video understanding benchmarks. We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance. Surprisingly, we discover that LLM-based approaches can yield surprisingly good accuracy on long-video tasks with limited video information, sometimes even with no video specific information. Building on this, we exploring injecting video-specific information into an LLM-based framework. We utilize off-the-shelf vision tools to extract three object-centric information modalities from videos and then leverage natural language as a medium for fusing this information. Our resulting Multimodal Video Understanding (MVU) framework demonstrates state-of-the-art performance across multiple video understanding benchmarks. Strong performance also on robotics domain tasks establish its strong generality. Our code will be released publicly.
Authors: Jakaria Rabbi, Johannes Kiechle, Christian Beaulieu, Nilanjan Ray, Dana Cobzas
Abstract: This paper presents a comprehensive study focused on disentangling hippocampal shape variations from diffusion tensor imaging (DTI) datasets within the context of neurological disorders. Leveraging a Mesh Variational Autoencoder (VAE) enhanced with Supervised Contrastive Learning, our approach aims to improve interpretability by disentangling two distinct latent variables corresponding to age and the presence of diseases. In our ablation study, we investigate a range of VAE architectures and contrastive loss functions, showcasing the enhanced disentanglement capabilities of our approach. This evaluation uses synthetic 3D torus mesh data and real 3D hippocampal mesh datasets derived from the DTI hippocampal dataset. Our supervised disentanglement model outperforms several state-of-the-art (SOTA) methods like attribute and guided VAEs in terms of disentanglement scores. Our model distinguishes between age groups and disease status in patients with Multiple Sclerosis (MS) using the hippocampus data. Our Mesh VAE with Supervised Contrastive Learning shows the volume changes of the hippocampus of MS populations at different ages, and the result is consistent with the current neuroimaging literature. This research provides valuable insights into the relationship between neurological disorder and hippocampal shape changes in different age groups of MS populations using a Mesh VAE with Supervised Contrastive loss. Our code is available at https://github.com/Jakaria08/Explaining_Shape_Variability
URLs: https://github.com/Jakaria08/Explaining_Shape_Variability
Authors: Rui Xu, Shu Yang, Yihui Wang, Yu Cai, Bo Du, Hao Chen
Abstract: Mamba, a recent selective structured state space model, excels in long sequence modeling, which is vital in the large model era. Long sequence modeling poses significant challenges, including capturing long-range dependencies within the data and handling the computational demands caused by their extensive length. Mamba addresses these challenges by overcoming the local perception limitations of convolutional neural networks and the quadratic computational complexity of Transformers. Given its advantages over these mainstream foundation architectures, Mamba exhibits great potential to be a visual foundation architecture. Since January 2024, Mamba has been actively applied to diverse computer vision tasks, yielding numerous contributions. To help keep pace with the rapid advancements, this paper reviews visual Mamba approaches, analyzing over 200 papers. This paper begins by delineating the formulation of the original Mamba model. Subsequently, it delves into representative backbone networks, and applications categorized using different modalities, including image, video, point cloud, and multi-modal data. Particularly, we identify scanning techniques as critical for adapting Mamba to vision tasks, and decouple these scanning techniques to clarify their functionality and enhance their flexibility across various applications. Finally, we discuss the challenges and future directions, providing insights into new outlooks in this fast evolving area. A comprehensive list of visual Mamba models reviewed in this work is available at https://github.com/Ruixxxx/Awesome-Vision-Mamba-Models.
URLs: https://github.com/Ruixxxx/Awesome-Vision-Mamba-Models.
Authors: Zhilin Zhang
Abstract: Visual Question Answering (VQA) has emerged as a highly engaging field in recent years, with increasing research focused on enhancing VQA accuracy through advanced models such as Transformers. Despite this growing interest, limited work has examined the comparative effectiveness of textual encoders in VQA, particularly considering model complexity and computational efficiency. In this work, we conduct a comprehensive comparison between complex textual models that leverage long-range dependencies and simpler models focusing on local textual features within a well-established VQA framework. Our findings reveal that employing complex textual encoders is not invariably the optimal approach for the VQA-v2 dataset. Motivated by this insight, we propose ConvGRU, a model that incorporates convolutional layers to improve text feature representation without substantially increasing model complexity. Tested on the VQA-v2 dataset, ConvGRU demonstrates a modest yet consistent improvement over baselines for question types such as Number and Count, which highlights the potential of lightweight architectures for VQA tasks, especially when computational resources are limited.
Authors: Ammarah Hashmi, Sahibzada Adil Shahzad, Chia-Wen Lin, Yu Tsao, Hsin-Min Wang
Abstract: The emergence of contemporary deepfakes has attracted significant attention in machine learning research, as artificial intelligence (AI) generated synthetic media increases the incidence of misinterpretation and is difficult to distinguish from genuine content. Currently, machine learning techniques have been extensively studied for automatically detecting deepfakes. However, human perception has been less explored. Malicious deepfakes could ultimately cause public and social problems. Can we humans correctly perceive the authenticity of the content of the videos we watch? The answer is obviously uncertain; therefore, this paper aims to evaluate the human ability to discern deepfake videos through a subjective study. We present our findings by comparing human observers to five state-ofthe-art audiovisual deepfake detection models. To this end, we used gamification concepts to provide 110 participants (55 native English speakers and 55 non-native English speakers) with a webbased platform where they could access a series of 40 videos (20 real and 20 fake) to determine their authenticity. Each participant performed the experiment twice with the same 40 videos in different random orders. The videos are manually selected from the FakeAVCeleb dataset. We found that all AI models performed better than humans when evaluated on the same 40 videos. The study also reveals that while deception is not impossible, humans tend to overestimate their detection capabilities. Our experimental results may help benchmark human versus machine performance, advance forensics analysis, and enable adaptive countermeasures.
Authors: Rong Liu, Rui Xu, Yue Hu, Meida Chen, Andrew Feng
Abstract: 3D Gaussian Splatting (3DGS) has recently advanced radiance field reconstruction by offering superior capabilities for novel view synthesis and real-time rendering speed. However, its strategy of blending optimization and adaptive density control might lead to sub-optimal results; it can sometimes yield noisy geometry and blurry artifacts due to prioritizing optimizing large Gaussians at the cost of adequately densifying smaller ones. To address this, we introduce AtomGS, consisting of Atomized Proliferation and Geometry-Guided Optimization. The Atomized Proliferation constrains ellipsoid Gaussians of various sizes into more uniform-sized Atom Gaussians. The strategy enhances the representation of areas with fine features by placing greater emphasis on densification in accordance with scene details. In addition, we proposed a Geometry-Guided Optimization approach that incorporates an Edge-Aware Normal Loss. This optimization method effectively smooths flat surfaces while preserving intricate details. Our evaluation shows that AtomGS outperforms existing state-of-the-art methods in rendering quality. Additionally, it achieves competitive accuracy in geometry reconstruction and offers a significant improvement in training speed over other SDF-based methods. More interactive demos can be found in our website (https://rongliu-leo.github.io/AtomGS/).
Authors: Hao Fu, Naman Patel, Prashanth Krishnamurthy, Farshad Khorrami
Abstract: Detection of out-of-distribution (OOD) samples is crucial for safe real-world deployment of machine learning models. Recent advances in vision language foundation models have made them capable of detecting OOD samples without requiring in-distribution (ID) images. However, these zero-shot methods often underperform as they do not adequately consider ID class likelihoods in their detection confidence scoring. Hence, we introduce CLIPScope, a zero-shot OOD detection approach that normalizes the confidence score of a sample by class likelihoods, akin to a Bayesian posterior update. Furthermore, CLIPScope incorporates a novel strategy to mine OOD classes from a large lexical database. It selects class labels that are farthest and nearest to ID classes in terms of CLIP embedding distance to maximize coverage of OOD samples. We conduct extensive ablation studies and empirical evaluations, demonstrating state of the art performance of CLIPScope across various OOD detection benchmarks.
Authors: Juntae Kim, Sungwon Woo, Jongho Nang
Abstract: Image copy detection is the task of detecting edited copies of any image within a reference database. While previous approaches have shown remarkable progress, the large size of their networks and descriptors remains a disadvantage, complicating their practical application. In this paper, we propose a novel method that achieves competitive performance by using a lightweight network and compact descriptors. By utilizing relational self-supervised distillation to transfer knowledge from a large network to a small network, we enable the training of lightweight networks with smaller descriptor sizes. We introduce relational self-supervised distillation for flexible representation in a smaller feature space and apply contrastive learning with a hard negative loss to prevent dimensional collapse. For the DISC2021 benchmark, ResNet-50 and EfficientNet-B0 are used as the teacher and student models, respectively, with micro average precision improving by 5.0\%/4.9\%/5.9\% for 64/128/256 descriptor sizes compared to the baseline method. The code is available at \href{https://github.com/juntae9926/RDCD}{https://github.com/juntae9926/RDCD}.
URLs: https://github.com/juntae9926/RDCD, https://github.com/juntae9926/RDCD
Authors: Alfreds Lapkovskis, Natalia Nefedova, Ali Beikmohammadi
Abstract: Plant classification is vital for ecological conservation and agricultural productivity, enhancing our understanding of plant growth dynamics and aiding species preservation. The advent of deep learning (DL) techniques has revolutionized this field by enabling autonomous feature extraction, significantly reducing the dependence on manual expertise. However, conventional DL models often rely solely on single data sources, failing to capture the full biological diversity of plant species comprehensively. Recent research has turned to multimodal learning to overcome this limitation by integrating multiple data types, which enriches the representation of plant characteristics. This shift introduces the challenge of determining the optimal point for modality fusion. In this paper, we introduce a pioneering multimodal DL-based approach for plant classification with automatic modality fusion. Utilizing the multimodal fusion architecture search, our method integrates images from multiple plant organs--flowers, leaves, fruits, and stems--into a cohesive model. Our method achieves 82.61% accuracy on 979 classes of the PlantCLEF2015 dataset, surpassing state-of-the-art methods and outperforming late fusion by 10.33%. Through the incorporation of multimodal dropout, our approach demonstrates strong robustness to missing modalities. We validate our model against established benchmarks using standard performance metrics and McNemar's test, further underscoring its superiority.
Authors: Tanvir Mahmud, Mustafa Munir, Radu Marculescu, Diana Marculescu
Abstract: Video-to-video synthesis poses significant challenges in maintaining character consistency, smooth temporal transitions, and preserving visual quality during fast motion. While recent fully cross-frame self-attention mechanisms have improved character consistency across multiple frames, they come with high computational costs and often include redundant operations, especially for videos with higher frame rates. To address these inefficiencies, we propose an adaptive motion-guided cross-frame attention mechanism that selectively reduces redundant computations. This enables a greater number of cross-frame attentions over more frames within the same computational budget, thereby enhancing both video quality and temporal coherence. Our method leverages optical flow to focus on moving regions while sparsely attending to stationary areas, allowing for the joint editing of more frames without increasing computational demands. Traditional frame interpolation techniques struggle with motion blur and flickering in intermediate frames, which compromises visual fidelity. To mitigate this, we introduce KV-caching for jointly edited frames, reusing keys and values across intermediate frames to preserve visual quality and maintain temporal consistency throughout the video. With our adaptive cross-frame self-attention approach, we achieve a threefold increase in the number of keyframes processed compared to existing methods, all within the same computational budget as fully cross-frame attention baselines. This results in significant improvements in prediction accuracy and temporal consistency, outperforming state-of-the-art approaches. Code will be made publicly available at https://github.com/tanvir-utexas/AdaVE/tree/main
Authors: Haian Jin, Yuan Li, Fujun Luan, Yuanbo Xiangli, Sai Bi, Kai Zhang, Zexiang Xu, Jin Sun, Noah Snavely
Abstract: Single-image relighting is a challenging task that involves reasoning about the complex interplay between geometry, materials, and lighting. Many prior methods either support only specific categories of images, such as portraits, or require special capture conditions, like using a flashlight. Alternatively, some methods explicitly decompose a scene into intrinsic components, such as normals and BRDFs, which can be inaccurate or under-expressive. In this work, we propose a novel end-to-end 2D relighting diffusion model, called Neural Gaffer, that takes a single image of any object and can synthesize an accurate, high-quality relit image under any novel environmental lighting condition, simply by conditioning an image generator on a target environment map, without an explicit scene decomposition. Our method builds on a pre-trained diffusion model, and fine-tunes it on a synthetic relighting dataset, revealing and harnessing the inherent understanding of lighting present in the diffusion model. We evaluate our model on both synthetic and in-the-wild Internet imagery and demonstrate its advantages in terms of generalization and accuracy. Moreover, by combining with other generative methods, our model enables many downstream 2D tasks, such as text-based relighting and object insertion. Our model can also operate as a strong relighting prior for 3D tasks, such as relighting a radiance field.
Authors: Zihui Xue, Mi Luo, Changan Chen, Kristen Grauman
Abstract: We study the problem of precisely swapping objects in videos, with a focus on those interacted with by hands, given one user-provided reference object image. Despite the great advancements that diffusion models have made in video editing recently, these models often fall short in handling the intricacies of hand-object interactions (HOI), failing to produce realistic edits -- especially when object swapping results in object shape or functionality changes. To bridge this gap, we present HOI-Swap, a novel diffusion-based video editing framework trained in a self-supervised manner. Designed in two stages, the first stage focuses on object swapping in a single frame with HOI awareness; the model learns to adjust the interaction patterns, such as the hand grasp, based on changes in the object's properties. The second stage extends the single-frame edit across the entire sequence; we achieve controllable motion alignment with the original video by: (1) warping a new sequence from the stage-I edited frame based on sampled motion points and (2) conditioning video generation on the warped sequence. Comprehensive qualitative and quantitative evaluations demonstrate that HOI-Swap significantly outperforms existing methods, delivering high-quality video edits with realistic HOIs.
Authors: Irene Huang, Wei Lin, M. Jehanzeb Mirza, Jacob A. Hansen, Sivan Doveh, Victor Ion Butoi, Roei Herzig, Assaf Arbelle, Hilde Kuhene, Trevor Darrel, Chuang Gan, Aude Oliva, Rogerio Feris, Leonid Karlinsky
Abstract: Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmarks may not adequately push the boundaries of modern VLMs due to the reliance on an LLM-only negative text generation pipeline. Consequently, the negatives produced either appear as outliers from the natural language distribution learned by VLMs' LLM decoders or as improbable within the corresponding image context. To address these limitations, we introduce ConMe -- a compositional reasoning benchmark and a novel data generation pipeline leveraging VLMs to produce `hard CR Q&A'. Through a new concept of VLMs conversing with each other to collaboratively expose their weaknesses, our pipeline autonomously generates, evaluates, and selects challenging compositional reasoning questions, establishing a robust CR benchmark, also subsequently validated manually. Our benchmark provokes a noteworthy, up to 33%, decrease in CR performance compared to preceding benchmarks, reinstating the CR challenge even for state-of-the-art VLMs.
Authors: Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna
Abstract: Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In this work, we introduce Sketchpad, a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts planning and reasoning according to the visual artifacts it has drawn. Different from prior work, which uses text-to-image models to enable LMs to draw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is closer to human sketching and better facilitates reasoning. Sketchpad can also use specialist vision models during the sketching process (e.g., draw bounding boxes with object detection models, draw masks with segmentation models), to further enhance visual perception and reasoning. We experiment with a wide range of math tasks (including geometry, functions, graphs, and chess) and complex visual reasoning tasks. Sketchpad substantially improves performance on all tasks over strong base models with no sketching, yielding an average gain of 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a new state of the art on all tasks, including V*Bench (80.3%), BLINK spatial reasoning (83.9%), and visual correspondence (80.8%). All codes and data are in https://visualsketchpad.github.io/.
Authors: Zhikai Zhang, Yitang Li, Haofeng Huang, Mingxian Lin, Li Yi
Abstract: Human motion synthesis is a fundamental task in computer animation. Despite recent progress in this field utilizing deep learning and motion capture data, existing methods are always limited to specific motion categories, environments, and styles. This poor generalizability can be partially attributed to the difficulty and expense of collecting large-scale and high-quality motion data. At the same time, foundation models trained with internet-scale image and text data have demonstrated surprising world knowledge and reasoning ability for various downstream tasks. Utilizing these foundation models may help with human motion synthesis, which some recent works have superficially explored. However, these methods didn't fully unveil the foundation models' potential for this task and only support several simple actions and environments. In this paper, we for the first time, without any motion data, explore open-set human motion synthesis using natural language instructions as user control signals based on MLLMs across any motion task and environment. Our framework can be split into two stages: 1) sequential keyframe generation by utilizing MLLMs as a keyframe designer and animator; 2) motion filling between keyframes through interpolation and motion tracking. Our method can achieve general human motion synthesis for many downstream tasks. The promising results demonstrate the worth of mocap-free human motion synthesis aided by MLLMs and pave the way for future research.
Authors: Daiqing Qi, Handong Zhao, Zijun Wei, Sheng Li
Abstract: Despite recent advances in the general visual instruction-following ability of Multimodal Large Language Models (MLLMs), they still struggle with critical problems when required to provide a precise and detailed response to a visual instruction: (1) failure to identify novel objects or entities, (2) mention of non-existent objects, and (3) neglect of object's attributed details. Intuitive solutions include improving the size and quality of data or using larger foundation models. They show effectiveness in mitigating these issues, but at an expensive cost of collecting a vast amount of new data and introducing a significantly larger model. Standing at the intersection of these approaches, we examine the three object-oriented problems from the perspective of the image-to-text mapping process by the multimodal connector. In this paper, we first identify the limitations of multimodal connectors stemming from insufficient training data. Driven by this, we propose to enhance the mapping with retrieval-augmented tag tokens, which contain rich object-aware information such as object names and attributes. With our Tag-grounded visual instruction tuning with retrieval Augmentation (TUNA), we outperform baselines that share the same language model and training data on 12 benchmarks. Furthermore, we show the zero-shot capability of TUNA when provided with specific datastores.
Authors: Xiang Li, Jian Ding, Mohamed Elhoseiny
Abstract: We introduce a new benchmark designed to advance the development of general-purpose, large-scale vision-language models for remote sensing images. Although several vision-language datasets in remote sensing have been proposed to pursue this goal, existing datasets are typically tailored to single tasks, lack detailed object information, or suffer from inadequate quality control. Exploring these improvement opportunities, we present a Versatile vision-language Benchmark for Remote Sensing image understanding, termed VRSBench. This benchmark comprises 29,614 images, with 29,614 human-verified detailed captions, 52,472 object references, and 123,221 question-answer pairs. It facilitates the training and evaluation of vision-language models across a broad spectrum of remote sensing image understanding tasks. We further evaluated state-of-the-art models on this benchmark for three vision-language tasks: image captioning, visual grounding, and visual question answering. Our work aims to significantly contribute to the development of advanced vision-language models in the field of remote sensing. The data and code can be accessed at https://github.com/lx709/VRSBench.
Authors: Neng Wang, Ruibin Guo, Chenghao Shi, Ziyue Wang, Hui Zhang, Huimin Lu, Zhiqiang Zheng, Xieyuanli Chen
Abstract: 4D LiDAR semantic segmentation, also referred to as multi-scan semantic segmentation, plays a crucial role in enhancing the environmental understanding capabilities of autonomous robots. It classifies the semantic category of each LiDAR point and detects whether it is dynamic, a critical ability for tasks like obstacle avoidance and autonomous navigation. Existing approaches often rely on computationally heavy 4D convolutions or recursive networks, which result in poor real-time performance, making them unsuitable for online robotics and autonomous driving applications. In this paper, we introduce SegNet4D, a novel real-time 4D semantic segmentation network offering both efficiency and strong semantic understanding. SegNet4D addresses 4D segmentation as two tasks: single-scan semantic segmentation and moving object segmentation, each tackled by a separate network head. Both results are combined in a motion-semantic fusion module to achieve comprehensive 4D segmentation. Additionally, instance information is extracted from the current scan and exploited for instance-wise segmentation consistency. Our approach surpasses state-of-the-art in both multi-scan semantic segmentation and moving object segmentation while offering greater efficiency, enabling real-time operation. Besides, its effectiveness and efficiency have also been validated on a real-world robotic platform. Our code will be released at https://github.com/nubot-nudt/SegNet4D.
Authors: Markus Krimmel, Jan Achterhold, Joerg Stueckler
Abstract: Object-centric scene decompositions are important representations for downstream tasks in fields such as computer vision and robotics. The recently proposed Slot Attention module, already leveraged by several derivative works for image segmentation and object tracking in videos, is a deep learning component which performs unsupervised object-centric scene decomposition on input images. It is based on an attention architecture, in which latent slot vectors, which hold compressed information on objects, attend to localized perceptual features from the input image. In this paper, we demonstrate that design decisions on normalizing the aggregated values in the attention architecture have considerable impact on the capabilities of Slot Attention to generalize to a higher number of slots and objects as seen during training. We propose and investigate alternatives to the original normalization scheme which increase the generalization capabilities of Slot Attention to varying slot and object counts, resulting in performance gains on the task of unsupervised image segmentation. The newly proposed normalizations represent minimal and easy to implement modifications of the usual Slot Attention module, changing the value aggregation mechanism from a weighted mean operation to a scaled weighted sum operation.
Authors: Felix Fent, Fabian Kuttenreich, Florian Ruch, Farija Rizwin, Stefan Juergens, Lorenz Lechermann, Christian Nissler, Andrea Perl, Ulrich Voll, Min Yan, Markus Lienkamp
Abstract: Autonomous trucking is a promising technology that can greatly impact modern logistics and the environment. Ensuring its safety on public roads is one of the main duties that requires an accurate perception of the environment. To achieve this, machine learning methods rely on large datasets, but to this day, no such datasets are available for autonomous trucks. In this work, we present MAN TruckScenes, the first multimodal dataset for autonomous trucking. MAN TruckScenes allows the research community to come into contact with truck-specific challenges, such as trailer occlusions, novel sensor perspectives, and terminal environments for the first time. It comprises more than 740 scenes of 20s each within a multitude of different environmental conditions. The sensor set includes 4 cameras, 6 lidar, 6 radar sensors, 2 IMUs, and a high-precision GNSS. The dataset's 3D bounding boxes were manually annotated and carefully reviewed to achieve a high quality standard. Bounding boxes are available for 27 object classes, 15 attributes, and a range of more than 230m. The scenes are tagged according to 34 distinct scene tags, and all objects are tracked throughout the scene to promote a wide range of applications. Additionally, MAN TruckScenes is the first dataset to provide 4D radar data with 360{\deg} coverage and is thereby the largest radar dataset with annotated 3D bounding boxes. Finally, we provide extensive dataset analysis and baseline results. The dataset, development kit, and more are available online.
Authors: Zujin Guo, Wei Li, Chen Change Loy
Abstract: Motion modeling is critical in flow-based Video Frame Interpolation (VFI). Existing paradigms either consider linear combinations of bidirectional flows or directly predict bilateral flows for given timestamps without exploring favorable motion priors, thus lacking the capability of effectively modeling spatiotemporal dynamics in real-world videos. To address this limitation, in this study, we introduce Generalizable Implicit Motion Modeling (GIMM), a novel and effective approach to motion modeling for VFI. Specifically, to enable GIMM as an effective motion modeling paradigm, we design a motion encoding pipeline to model spatiotemporal motion latent from bidirectional flows extracted from pre-trained flow estimators, effectively representing input-specific motion priors. Then, we implicitly predict arbitrary-timestep optical flows within two adjacent input frames via an adaptive coordinate-based neural network, with spatiotemporal coordinates and motion latent as inputs. Our GIMM can be easily integrated with existing flow-based VFI works by supplying accurately modeled motion. We show that GIMM performs better than the current state of the art on standard VFI benchmarks.
Authors: Hui Xian Grace Lim, Xuanming Cui, Ser-Nam Lim, Yogesh S Rawat
Abstract: Illustration is a fundamental mode of human expression and communication. Certain types of motion that accompany speech can provide this illustrative mode of communication. While Augmented and Virtual Reality technologies (AR/VR) have introduced tools for producing drawings with hand motions (air drawing), they typically require costly hardware and additional digital markers, thereby limiting their accessibility and portability. Furthermore, air drawing demands considerable skill to achieve aesthetic results. To address these challenges, we introduce the concept of AirSketch, aimed at generating faithful and visually coherent sketches directly from hand motions, eliminating the need for complicated headsets or markers. We devise a simple augmentation-based self-supervised training procedure, enabling a controllable image diffusion model to learn to translate from highly noisy hand tracking images to clean, aesthetically pleasing sketches, while preserving the essential visual cues from the original tracking data. We present two air drawing datasets to study this problem. Our findings demonstrate that beyond producing photo-realistic images from precise spatial inputs, controllable image diffusion can effectively produce a refined, clear sketch from a noisy input. Our work serves as an initial step towards marker-less air drawing and reveals distinct applications of controllable diffusion models to AirSketch and AR/VR in general.
Authors: Sehwan Choi, Jungho Kim, Hongjae Shin, Jun Won Choi
Abstract: In this paper, we introduce Mask2Map, a novel end-to-end online HD map construction method designed for autonomous driving applications. Our approach focuses on predicting the class and ordered point set of map instances within a scene, represented in the bird's eye view (BEV). Mask2Map consists of two primary components: the Instance-Level Mask Prediction Network (IMPNet) and the Mask-Driven Map Prediction Network (MMPNet). IMPNet generates Mask-Aware Queries and BEV Segmentation Masks to capture comprehensive semantic information globally. Subsequently, MMPNet enhances these query features using local contextual information through two submodules: the Positional Query Generator (PQG) and the Geometric Feature Extractor (GFE). PQG extracts instance-level positional queries by embedding BEV positional information into Mask-Aware Queries, while GFE utilizes BEV Segmentation Masks to generate point-level geometric features. However, we observed limited performance in Mask2Map due to inter-network inconsistency stemming from different predictions to Ground Truth (GT) matching between IMPNet and MMPNet. To tackle this challenge, we propose the Inter-network Denoising Training method, which guides the model to denoise the output affected by both noisy GT queries and perturbed GT Segmentation Masks. Our evaluation conducted on nuScenes and Argoverse2 benchmarks demonstrates that Mask2Map achieves remarkable performance improvements over previous state-of-the-art methods, with gains of 10.1% mAP and 4.1 mAP, respectively. Our code can be found at https://github.com/SehwanChoi0307/Mask2Map.
Authors: Yifan Li, Yikai Wang, Yanwei Fu, Dongyu Ru, Zheng Zhang, Tong He
Abstract: Visual-Language Alignment (VLA) has gained a lot of attention since CLIP's groundbreaking work. Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores. On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse representation and interpretable, providing exact matches for individual words. However, lexical representations are difficult to learn due to no ground-truth supervision and false-discovery issues, and thus requires complex design to train effectively. In this paper, we introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design. We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative language model, to leverage its in-context lexical prediction ability. To avoid the false discovery, we propose an overuse penalty to refrain the lexical representation from falsely frequently activating meaningless words. We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on the modest multi-modal dataset and avoid intricate training configurations. On cross-modal retrieval benchmarks, LexVLA, trained on the CC-12M multi-modal dataset, outperforms baselines fine-tuned on larger datasets (e.g., YFCC15M) and those trained from scratch on even bigger datasets (e.g., 1.1B data, including CC-12M). We conduct extensive experiments to analyze LexVLA. Codes are available at https://github.com/Clementine24/LexVLA.
Authors: Mingrui Wu, Xinyue Cai, Jiayi Ji, Jiale Li, Oucheng Huang, Gen Luo, Hao Fei, Guannan Jiang, Xiaoshuai Sun, Rongrong Ji
Abstract: In this work, we propose a training-free method to inject visual referring into Multimodal Large Language Models (MLLMs) through learnable visual token optimization. We observe the relationship between text prompt tokens and visual tokens in MLLMs, where attention layers model the connection between them. Our approach involves adjusting visual tokens from the MLP output during inference, controlling which text prompt tokens attend to which visual tokens. We optimize a learnable visual token based on an energy function, enhancing the strength of referential regions in the attention map. This enables detailed region description and reasoning without the need for substantial training costs or model retraining. Our method offers a promising direction for integrating referential abilities into MLLMs. Our method support referring with box, mask, scribble and point. The results demonstrate that our method exhibits controllability and interpretability.
Authors: Yibin Wang, Weizhong Zhang, Cheng Jin
Abstract: Current human image customization methods leverage Stable Diffusion (SD) for its rich semantic prior. However, since SD is not specifically designed for human-oriented generation, these methods often require extensive fine-tuning on large-scale datasets, which renders them susceptible to overfitting and hinders their ability to personalize individuals with previously unseen styles. Moreover, these methods extensively focus on single-concept human image synthesis and lack the flexibility to customize individuals using multiple given concepts, thereby impeding their broader practical application. This paper proposes MagicFace, a novel training-free method for multi-concept universal-style human image personalized synthesis. Our core idea is to simulate how humans create images given specific concepts, i.e., first establish a semantic layout considering factors such as concepts' shape and posture, then optimize details by comparing with concepts at the pixel level. To implement this process, we introduce a coarse-to-fine generation pipeline, involving two sequential stages: semantic layout construction and concept feature injection. This is achieved by our Reference-aware Self-Attention (RSA) and Region-grouped Blend Attention (RBA) mechanisms. In the first stage, RSA enables the latent image to query features from all reference concepts simultaneously, extracting the overall semantic understanding to facilitate the initial semantic layout establishment. In the second stage, we employ an attention-based semantic segmentation method to pinpoint the latent generated regions of all concepts at each step. Following this, RBA divides the pixels of the latent image into semantic groups, with each group querying fine-grained features from the corresponding reference concept. Extensive experiments demonstrate the superiority of our MagicFace.
Authors: Minye Wu, Tinne Tuytelaars
Abstract: Recent advancements in photo-realistic novel view synthesis have been significantly driven by Gaussian Splatting (3DGS). Nevertheless, the explicit nature of 3DGS data entails considerable storage requirements, highlighting a pressing need for more efficient data representations. To address this, we present Implicit Gaussian Splatting (IGS), an innovative hybrid model that integrates explicit point clouds with implicit feature embeddings through a multi-level tri-plane architecture. This architecture features 2D feature grids at various resolutions across different levels, facilitating continuous spatial domain representation and enhancing spatial correlations among Gaussian primitives. Building upon this foundation, we introduce a level-based progressive training scheme, which incorporates explicit spatial regularization. This method capitalizes on spatial correlations to enhance both the rendering quality and the compactness of the IGS representation. Furthermore, we propose a novel compression pipeline tailored for both point clouds and 2D feature grids, considering the entropy variations across different levels. Extensive experimental evaluations demonstrate that our algorithm can deliver high-quality rendering using only a few MBs, effectively balancing storage efficiency and rendering fidelity, and yielding results that are competitive with the state-of-the-art.
Authors: Diandian Guo, Weixin Si, Zhixi Li, Jialun Pei, Pheng-Ann Heng
Abstract: Pringle maneuver (PM) in laparoscopic liver resection aims to reduce blood loss and provide a clear surgical view by intermittently blocking blood inflow of the liver, whereas prolonged PM may cause ischemic injury. To comprehensively monitor this surgical procedure and provide timely warnings of ineffective and prolonged blocking, we suggest two complementary AI-assisted surgical monitoring tasks: workflow recognition and blocking effectiveness detection in liver resections. The former presents challenges in real-time capturing of short-term PM, while the latter involves the intraoperative discrimination of long-term liver ischemia states. To address these challenges, we meticulously collect a novel dataset, called PmLR50, consisting of 25,037 video frames covering various surgical phases from 50 laparoscopic liver resection procedures. Additionally, we develop an online baseline for PmLR50, termed PmNet. This model embraces Masked Temporal Encoding (MTE) and Compressed Sequence Modeling (CSM) for efficient short-term and long-term temporal information modeling, and embeds Contrastive Prototype Separation (CPS) to enhance action discrimination between similar intraoperative operations. Experimental results demonstrate that PmNet outperforms existing state-of-the-art surgical workflow recognition methods on the PmLR50 benchmark. Our research offers potential clinical applications for the laparoscopic liver surgery community. Source code and data will be publicly available.
Authors: Weihan Luo, Anagh Malik, David B. Lindell
Abstract: We consider the problem of few-viewpoint 3D surface reconstruction using raw measurements from a lidar system. Lidar captures 3D scene geometry by emitting pulses of light to a target and recording the speed-of-light time delay of the reflected light. However, conventional lidar systems do not output the raw, captured waveforms of backscattered light; instead, they pre-process these data into a 3D point cloud. Since this procedure typically does not accurately model the noise statistics of the system, exploit spatial priors, or incorporate information about downstream tasks, it ultimately discards useful information that is encoded in raw measurements of backscattered light. Here, we propose to leverage raw measurements captured with a single-photon lidar system from multiple viewpoints to optimize a neural surface representation of a scene. The measurements consist of time-resolved photon count histograms, or transients, which capture information about backscattered light at picosecond time scales. Additionally, we develop new regularization strategies that improve robustness to photon noise, enabling accurate surface reconstruction with as few as 10 photons per pixel. Our method outperforms other techniques for few-viewpoint 3D reconstruction based on depth maps, point clouds, or conventional lidar as demonstrated in simulation and with captured data.
Authors: Sarah Rastegar, Mohammadreza Salehi, Yuki M. Asano, Hazel Doughty, Cees G. M. Snoek
Abstract: In this paper, we address Generalized Category Discovery, aiming to simultaneously uncover novel categories and accurately classify known ones. Traditional methods, which lean heavily on self-supervision and contrastive learning, often fall short when distinguishing between fine-grained categories. To address this, we introduce a novel concept called `self-expertise', which enhances the model's ability to recognize subtle differences and uncover unknown categories. Our approach combines unsupervised and supervised self-expertise strategies to refine the model's discernment and generalization. Initially, hierarchical pseudo-labeling is used to provide `soft supervision', improving the effectiveness of self-expertise. Our supervised technique differs from traditional methods by utilizing more abstract positive and negative samples, aiding in the formation of clusters that can generalize to novel categories. Meanwhile, our unsupervised strategy encourages the model to sharpen its category distinctions by considering within-category examples as `hard' negatives. Supported by theoretical insights, our empirical results showcase that our method outperforms existing state-of-the-art techniques in Generalized Category Discovery across several fine-grained datasets. Our code is available at: https://github.com/SarahRastegar/SelEx.
Authors: Jiayu Liu, Shancong Mou, Nathan Gaw, Yinan Wang
Abstract: Anomaly detection is a long-standing challenge in manufacturing systems. Traditionally, anomaly detection has relied on human inspectors. However, 3D point clouds have gained attention due to their robustness to environmental factors and their ability to represent geometric data. Existing 3D anomaly detection methods generally fall into two categories. One compares scanned 3D point clouds with design files, assuming these files are always available. However, such assumptions are often violated in many real-world applications where model-free products exist, such as fresh produce (i.e., ``Cookie", ``Potato", etc.), dentures, bone, etc. The other category compares patches of scanned 3D point clouds with a library of normal patches named memory bank. However, those methods usually fail to detect incomplete shapes, which is a fairly common defect type (i.e., missing pieces of different products). The main challenge is that missing areas in 3D point clouds represent the absence of scanned points. This makes it infeasible to compare the missing region with existing point cloud patches in the memory bank. To address these two challenges, we proposed a unified, unsupervised 3D anomaly detection framework capable of identifying all types of defects on model-free products. Our method integrates two detection modules: a feature-based detection module and a reconstruction-based detection module. Feature-based detection covers geometric defects, such as dents, holes, and cracks, while the reconstruction-based method detects missing regions. Additionally, we employ a One-class Support Vector Machine (OCSVM) to fuse the detection results from both modules. The results demonstrate that (1) our proposed method outperforms the state-of-the-art methods in identifying incomplete shapes and (2) it still maintains comparable performance with the SOTA methods in detecting all other types of anomalies.
Authors: Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung-Ting Su, Shang-Hong Lai, Winston H. Hsu
Abstract: Existing research often treats long-form videos as extended short videos, leading to several limitations: inadequate capture of long-range dependencies, inefficient processing of redundant information, and failure to extract high-level semantic concepts. To address these issues, we propose a novel approach that more accurately reflects human cognition. This paper introduces HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels, overcoming the challenge of long-range dependencies. Second, we propose a Semantics ReTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. This addresses the issues of redundancy and lack of high-level concept extraction. Extensive experiments demonstrate that HERMES achieves state-of-the-art performance across multiple long-video understanding benchmarks in both zero-shot and fully-supervised settings.
Authors: Jiaxin Guo, Jiangliu Wang, Ruofeng Wei, Di Kang, Qi Dou, Yun-hui Liu
Abstract: Visualizing surgical scenes is crucial for revealing internal anatomical structures during minimally invasive procedures. Novel View Synthesis is a vital technique that offers geometry and appearance reconstruction, enhancing understanding, planning, and decision-making in surgical scenes. Despite the impressive achievements of Neural Radiance Field (NeRF), its direct application to surgical scenes produces unsatisfying results due to two challenges: endoscopic sparse views and significant photometric inconsistencies. In this paper, we propose uncertainty-aware conditional NeRF for novel view synthesis to tackle the severe shape-radiance ambiguity from sparse surgical views. The core of UC-NeRF is to incorporate the multi-view uncertainty estimation to condition the neural radiance field for modeling the severe photometric inconsistencies adaptively. Specifically, our UC-NeRF first builds a consistency learner in the form of multi-view stereo network, to establish the geometric correspondence from sparse views and generate uncertainty estimation and feature priors. In neural rendering, we design a base-adaptive NeRF network to exploit the uncertainty estimation for explicitly handling the photometric inconsistencies. Furthermore, an uncertainty-guided geometry distillation is employed to enhance geometry learning. Experiments on the SCARED and Hamlyn datasets demonstrate our superior performance in rendering appearance and geometry, consistently outperforming the current state-of-the-art approaches. Our code will be released at https://github.com/wrld/UC-NeRF.
Authors: Yangguang Chen, Tong Wang, Guanzhou Chen, Kun Zhu, Xiaoliang Tan, Jiaqi Wang, Wenchao Guo, Qing Wang, Xiaolong Luo, Xiaodong Zhang
Abstract: The detection of fa\c{c}ade elements on buildings, such as doors, windows, balconies, air conditioning units, billboards, and glass curtain walls, is a critical step in automating the creation of Building Information Modeling (BIM). Yet, this field faces significant challenges, including the uneven distribution of fa\c{c}ade elements, the presence of small objects, and substantial background noise, which hamper detection accuracy. To address these issues, we develop the BFA-YOLO model and the BFA-3D dataset in this study. The BFA-YOLO model is an advanced architecture designed specifically for analyzing multi-view images of fa\c{c}ade attachments. It integrates three novel components: the Feature Balanced Spindle Module (FBSM) that tackles the issue of uneven object distribution; the Target Dynamic Alignment Task Detection Head (TDATH) that enhances the detection of small objects; and the Position Memory Enhanced Self-Attention Mechanism (PMESA), aimed at reducing the impact of background noise. These elements collectively enable BFA-YOLO to effectively address each challenge, thereby improving model robustness and detection precision. The BFA-3D dataset, offers multi-view images with precise annotations across a wide range of fa\c{c}ade attachment categories. This dataset is developed to address the limitations present in existing fa\c{c}ade detection datasets, which often feature a single perspective and insufficient category coverage. Through comparative analysis, BFA-YOLO demonstrated improvements of 1.8\% and 2.9\% in mAP$_{50}$ on the BFA-3D dataset and the public Fa\c{c}ade-WHU dataset, respectively, when compared to the baseline YOLOv8 model. These results highlight the superior performance of BFA-YOLO in fa\c{c}ade element detection and the advancement of intelligent BIM technologies.
Authors: Diego Porres
Abstract: The latent space of many generative models are rich in unexplored valleys and mountains. The majority of tools used for exploring them are so far limited to Graphical User Interfaces (GUIs). While specialized hardware can be used for this task, we show that a simple feature extraction of pre-trained Convolutional Neural Networks (CNNs) from a live RGB camera feed does a very good job at manipulating the latent space with simple changes in the scene, with vast room for improvement. We name this new paradigm Visual-reactive Interpolation, and the full code can be found at https://github.com/PDillis/stylegan3-fun.
Authors: Boyan Li, Shengyi Ding, Deen Ma, Yixuan Wu, Hongjie Liao, Kaiyuan Hu
Abstract: Millimeter wave sensing provides people with the capability of sensing the surrounding crowds in a non-invasive and privacy-preserving manner, which holds huge application potential. However, detecting stationary crowds remains challenging due to several factors such as minimal movements (like breathing or casual fidgets), which can be easily treated as noise clusters during data collection and consequently filtered in the following processing procedures. Additionally, the uneven distribution of signal power due to signal power attenuation and interferences resulting from external reflectors or absorbers further complicates accurate detection. To address these challenges and enable stationary crowd detection across various application scenarios requiring specialized domain adaption, we introduce LLMCount, the first system to harness the capabilities of large-language models (LLMs) to enhance crowd detection performance. By exploiting the decision-making capability of LLM, we can successfully compensate the signal power to acquire a uniform distribution and thereby achieve a detection with higher accuracy. To assess the system's performance, comprehensive evaluations are conducted under diversified scenarios like hall, meeting room, and cinema. The evaluation results show that our proposed approach reaches high detection accuracy with lower overall latency compared with previous methods.
Authors: Rwiddhi Chakraborty, Yinong Wang, Jialu Gao, Runkai Zheng, Cheng Zhang, Fernando De la Torre
Abstract: The widespread success of deep learning models today is owed to the curation of extensive datasets significant in size and complexity. However, such models frequently pick up inherent biases in the data during the training process, leading to unreliable predictions. Diagnosing and debiasing datasets is thus a necessity to ensure reliable model performance. In this paper, we present ConBias, a novel framework for diagnosing and mitigating Concept co-occurrence Biases in visual datasets. ConBias represents visual datasets as knowledge graphs of concepts, enabling meticulous analysis of spurious concept co-occurrences to uncover concept imbalances across the whole dataset. Moreover, we show that by employing a novel clique-based concept balancing strategy, we can mitigate these imbalances, leading to enhanced performance on downstream tasks. Extensive experiments show that data augmentation based on a balanced concept distribution augmented by Conbias improves generalization performance across multiple datasets compared to state-of-the-art methods.
Authors: Weihan Xu, Paul Pu Liang, Haven Kim, Julian McAuley, Taylor Berg-Kirkpatrick, Hao-Wen Dong
Abstract: Teasers are an effective tool for promoting content in entertainment, commercial and educational fields. However, creating an effective teaser for long videos is challenging for it requires long-range multimodal modeling on the input videos, while necessitating maintaining audiovisual alignments, managing scene changes and preserving factual accuracy for the output teasers. Due to the lack of a publicly-available dataset, progress along this research direction has been hindered. In this work, we present DocumentaryNet, a collection of 1,269 documentaries paired with their teasers, featuring multimodal data streams of video, speech, music, sound effects and narrations. With DocumentaryNet, we propose a new two-stage system for generating teasers from long documentaries. The proposed TeaserGen system first generates the teaser narration from the transcribed narration of the documentary using a pretrained large language model, and then selects the most relevant visual content to accompany the generated narration through language-vision models. For narration-video matching, we explore two approaches: a pretraining-based model using pretrained contrastive language-vision models and a deep sequential model that learns the mapping between the narrations and visuals. Our experimental results show that the pretraining-based approach is more effective at identifying relevant visual content than directly trained deep autoregressive models.
Authors: Mohammadreza Salehi, Jae Sung Park, Tanush Yadav, Aditya Kusupati, Ranjay Krishna, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi
Abstract: Our world is full of varied actions and moves across specialized domains that we, as humans, strive to identify and understand. Within any single domain, actions can often appear quite similar, making it challenging for deep models to distinguish them accurately. To evaluate the effectiveness of multimodal foundation models in helping us recognize such actions, we present ActionAtlas v1.0, a multiple-choice video question answering benchmark featuring short videos across various sports. Each video in the dataset is paired with a question and four or five choices. The question pinpoints specific individuals, asking which choice "best" describes their action within a certain temporal context. Overall, the dataset includes 934 videos showcasing 580 unique actions across 56 sports, with a total of 1896 actions within choices. Unlike most existing video question answering benchmarks that only cover simplistic actions, often identifiable from a single frame, ActionAtlas focuses on intricate movements and rigorously tests the model's capability to discern subtle differences between moves that look similar within each domain. We evaluate open and proprietary foundation models on this benchmark, finding that the best model, GPT-4o, achieves a maximum accuracy of 45.52%. Meanwhile, Non-expert crowd workers, provided with action description for each choice, achieve 61.64% accuracy, where random chance is approximately 21%. Our findings with state-of-the-art models indicate that having a high frame sampling rate is important for accurately recognizing actions in ActionAtlas, a feature that some leading proprietary video models, such as Gemini, do not include in their default configuration.
Authors: Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
Abstract: In this paper, we introduce a novel task called language-guided joint audio-visual editing. Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on the language guidance. For instance, we can alter the background environment of a sounding object while keeping its appearance unchanged, or we can add new sounds contextualized to the visual content. To address this task, we propose a new diffusion-based framework for joint audio-visual editing and introduce two key ideas. Firstly, we propose a one-shot adaptation approach to tailor generative diffusion models for audio-visual content editing. With as few as one audio-visual sample, we jointly transfer the audio and vision diffusion models to the target domain. After fine-tuning, our model enables consistent generation of this audio-visual sample. Secondly, we introduce a cross-modal semantic enhancement approach. We observe that when using language as content editing guidance, the vision branch may overlook editing requirements. This phenomenon, termed catastrophic neglect, hampers audio-visual alignment during content editing. We therefore enhance semantic consistency between language and vision to mitigate this issue. Extensive experiments validate the effectiveness of our method in language-based audio-visual editing and highlight its superiority over several baseline approaches. We recommend that readers visit our project page for more details: https://liangsusan-git.github.io/project/avedit/.
Authors: Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han
Abstract: We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder's spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder. Our code is available at https://github.com/mit-han-lab/efficientvit.
Authors: Xiaoxuan Ma, Yutang Lin, Yuan Xu, Stephan P. Kaufhold, Jack Terwilliger, Andres Meza, Yixin Zhu, Federico Rossano, Yizhou Wang
Abstract: Understanding non-human primate behavior is crucial for improving animal welfare, modeling social behavior, and gaining insights into both distinctly human and shared behaviors. Despite recent advances in computer vision, automated analysis of primate behavior remains challenging due to the complexity of their social interactions and the lack of specialized algorithms. Existing methods often struggle with the nuanced behaviors and frequent occlusions characteristic of primate social dynamics. This study aims to develop an effective method for automated detection, tracking, and recognition of chimpanzee behaviors in video footage. Here we show that our proposed method, AlphaChimp, an end-to-end approach that simultaneously detects chimpanzee positions and estimates behavior categories from videos, significantly outperforms existing methods in behavior recognition. AlphaChimp achieves approximately 10% higher tracking accuracy and a 20% improvement in behavior recognition compared to state-of-the-art methods, particularly excelling in the recognition of social behaviors. This superior performance stems from AlphaChimp's innovative architecture, which integrates temporal feature fusion with a Transformer-based self-attention mechanism, enabling more effective capture and interpretation of complex social interactions among chimpanzees. Our approach bridges the gap between computer vision and primatology, enhancing technical capabilities and deepening our understanding of primate communication and sociality. We release our code and models and hope this will facilitate future research in animal social dynamics. This work contributes to ethology, cognitive science, and artificial intelligence, offering new perspectives on social intelligence.
Authors: Guillem Capellera, Luis Ferraz, Antonio Rubio, Antonio Agudo, Francesc Moreno-Noguer
Abstract: Understanding trajectories in multi-agent scenarios requires addressing various tasks, including predicting future movements, imputing missing observations, inferring the status of unseen agents, and classifying different global states. Traditional data-driven approaches often handle these tasks separately with specialized models. We introduce TranSPORTmer, a unified transformer-based framework capable of addressing all these tasks, showcasing its application to the intricate dynamics of multi-agent sports scenarios like soccer and basketball. Using Set Attention Blocks, TranSPORTmer effectively captures temporal dynamics and social interactions in an equivariant manner. The model's tasks are guided by an input mask that conceals missing or yet-to-be-predicted observations. Additionally, we introduce a CLS extra agent to classify states along soccer trajectories, including passes, possessions, uncontrolled states, and out-of-play intervals, contributing to an enhancement in modeling trajectories. Evaluations on soccer and basketball datasets show that TranSPORTmer outperforms state-of-the-art task-specific models in player forecasting, player forecasting-imputation, ball inference, and ball imputation. https://youtu.be/8VtSRm8oGoE
Authors: Nagarajan Ganapathy, Podakanti Satyajith Chary, Teja Venkata Ramana Kumar Pithani, Pavan Kavati, Arun Kumar S
Abstract: This Paper presents an advanced approach for fine-tuning BiomedCLIP PubMedBERT, a multimodal model, to classify abnormalities in Video Capsule Endoscopy (VCE) frames, aiming to enhance diagnostic efficiency in gastrointestinal healthcare. By integrating the PubMedBERT language model with a Vision Transformer (ViT) to process endoscopic images, our method categorizes images into ten specific classes: angioectasia, bleeding, erosion, erythema, foreign body, lymphangiectasia, polyp, ulcer, worms, and normal. Our workflow incorporates image preprocessing and fine-tunes the BiomedCLIP model to generate high-quality embeddings for both visual and textual inputs, aligning them through similarity scoring for classification. Performance metrics, including classification, accuracy, recall, and F1 score, indicate the models strong ability to accurately identify abnormalities in endoscopic frames, showing promise for practical use in clinical diagnostics.
Authors: Bohang Sun (School of Information,Software Engineering, University of Electronic Science,Technology of China, Chengdu, China)
Abstract: In this study, we introduce an enhanced version of ViT that conducts attention-based QKV operations during the initial stages of downsampling. Performing attention directly on high-resolution feature maps is computationally demanding due to the large size and numerous tokens. To mitigate this, we propose a filter attention mechanism that uses a Filter Block to create a salient mask (Filter Mask) for selecting the most informative pixels for attention. The Filter Block scores the pixels of the feature map, and we sort these scores to retain only the top K pixels (with K varying across layers). This approach effectively decreases the number of tokens involved in the attention computation, reducing computational complexity and boosting processing speed. Furthermore, the salient mask provides interpretability, as the model focuses on regions of the image most critical to the outcome. Our experimental results show that this model improves parameter efficiency and computational speed while enhancing accuracy. Compared to existing models, our approach significantly reduces resource consumption while maintaining high performance.
Authors: Zheng Yu, Yaohua Wang, Siying Cui, Aixi Zhang, Wei-Long Zheng, Senzhang Wang
Abstract: Facial parts swapping aims to selectively transfer regions of interest from the source image onto the target image while maintaining the rest of the target image unchanged. Most studies on face swapping designed specifically for full-face swapping, are either unable or significantly limited when it comes to swapping individual facial parts, which hinders fine-grained and customized character designs. However, designing such an approach specifically for facial parts swapping is challenged by a reasonable multiple reference feature fusion, which needs to be both efficient and effective. To overcome this challenge, FuseAnyPart is proposed to facilitate the seamless "fuse-any-part" customization of the face. In FuseAnyPart, facial parts from different people are assembled into a complete face in latent space within the Mask-based Fusion Module. Subsequently, the consolidated feature is dispatched to the Addition-based Injection Module for fusion within the UNet of the diffusion model to create novel characters. Extensive experiments qualitatively and quantitatively validate the superiority and robustness of FuseAnyPart. Source codes are available at https://github.com/Thomas-wyh/FuseAnyPart.
Authors: Hongyu Chen, Chengcheng Chen, Fei Wang, Yuhu Shi, Weiming Zeng
Abstract: Recent advancements in synthetic aperture radar (SAR) ship detection using deep learning have significantly improved accuracy and speed, yet effectively detecting small objects in complex backgrounds with fewer parameters remains a challenge. This letter introduces RSNet, a lightweight framework constructed to enhance ship detection in SAR imagery. To ensure accuracy with fewer parameters, we proposed Waveletpool-ContextGuided (WCG) as its backbone, guiding global context understanding through multi-scale wavelet features for effective detection in complex scenes. Additionally, Waveletpool-StarFusion (WSF) is introduced as the neck, employing a residual wavelet element-wise multiplication structure to achieve higher dimensional nonlinear features without increasing network width. The Lightweight-Shared (LS) module is designed as detect components to achieve efficient detection through lightweight shared convolutional structure and multi-format compatibility. Experiments on the SAR Ship Detection Dataset (SSDD) and High-Resolution SAR Image Dataset (HRSID) demonstrate that RSNet achieves a strong balance between lightweight design and detection performance, surpassing many state-of-the-art detectors, reaching 72.5\% and 67.6\% in \textbf{\(\mathbf{mAP_{.50:.95}}\) }respectively with 1.49M parameters. Our code will be released soon.
Authors: Mingkun Zhang, Keping Bi, Wei Chen, Quanrun Chen, Jiafeng Guo, Xueqi Cheng
Abstract: Despite ongoing efforts to defend neural classifiers from adversarial attacks, they remain vulnerable, especially to unseen attacks. In contrast, humans are difficult to be cheated by subtle manipulations, since we make judgments only based on essential factors. Inspired by this observation, we attempt to model label generation with essential label-causative factors and incorporate label-non-causative factors to assist data generation. For an adversarial example, we aim to discriminate the perturbations as non-causative factors and make predictions only based on the label-causative factors. Concretely, we propose a casual diffusion model (CausalDiff) that adapts diffusion models for conditional data generation and disentangles the two types of casual factors by learning towards a novel casual information bottleneck objective. Empirically, CausalDiff has significantly outperformed state-of-the-art defense methods on various unseen attacks, achieving an average robustness of 86.39% (+4.01%) on CIFAR-10, 56.25% (+3.13%) on CIFAR-100, and 82.62% (+4.93%) on GTSRB (German Traffic Sign Recognition Benchmark).
Authors: Duc Dang Trung Tran, Byeongkeun Kang, Yeejin Lee
Abstract: Recently, transformer-based techniques incorporating superpoints have become prevalent in 3D instance segmentation. However, they often encounter an over-segmentation problem, especially noticeable with large objects. Additionally, unreliable mask predictions stemming from superpoint mask prediction further compound this issue. To address these challenges, we propose a novel framework called MSTA3D. It leverages multi-scale feature representation and introduces a twin-attention mechanism to effectively capture them. Furthermore, MSTA3D integrates a box query with a box regularizer, offering a complementary spatial constraint alongside semantic queries. Experimental evaluations on ScanNetV2, ScanNet200 and S3DIS datasets demonstrate that our approach surpasses state-of-the-art 3D instance segmentation methods.
Authors: Edward Vendrow, Omiros Pantazis, Alexander Shepard, Gabriel Brostow, Kate E. Jones, Oisin Mac Aodha, Sara Beery, Grant Van Horn
Abstract: We introduce INQUIRE, a text-to-image retrieval benchmark designed to challenge multimodal vision-language models on expert-level queries. INQUIRE includes iNaturalist 2024 (iNat24), a new dataset of five million natural world images, along with 250 expert-level retrieval queries. These queries are paired with all relevant images comprehensively labeled within iNat24, comprising 33,000 total matches. Queries span categories such as species identification, context, behavior, and appearance, emphasizing tasks that require nuanced image understanding and domain expertise. Our benchmark evaluates two core retrieval tasks: (1) INQUIRE-Fullrank, a full dataset ranking task, and (2) INQUIRE-Rerank, a reranking task for refining top-100 retrievals. Detailed evaluation of a range of recent multimodal models demonstrates that INQUIRE poses a significant challenge, with the best models failing to achieve an mAP@50 above 50%. In addition, we show that reranking with more powerful multimodal models can enhance retrieval performance, yet there remains a significant margin for improvement. By focusing on scientifically-motivated ecological challenges, INQUIRE aims to bridge the gap between AI capabilities and the needs of real-world scientific inquiry, encouraging the development of retrieval systems that can assist with accelerating ecological and biodiversity research. Our dataset and code are available at https://inquire-benchmark.github.io
Authors: Meng Cao, Yuyang Liu, Yingfei Liu, Tiancai Wang, Jiahua Dong, Henghui Ding, Xiangyu Zhang, Ian Reid, Xiaodan Liang
Abstract: Instruction tuning constitutes a prevalent technique for tailoring Large Vision Language Models (LVLMs) to meet individual task requirements. To date, most of the existing approaches are confined to single-task adaptation, whereas the requirements in real-world scenarios are inherently varied and continually evolving. Thus an ideal LVLM should sustain continual instruction tuning in the face of stream-task distributions (i.e., different domains, emerging capabilities, and new datasets) while minimizing the forgetting of previously acquired knowledge. To achieve this, we propose a new benchmark for COntinuAl inStruction Tuning on LVLMs (COAST), which encompasses the aforementioned domain-incremental, capability-incremental, and dataset-incremental configurations. In terms of methodology, we propose Continual LLaVA, a rehearsal-free method tailored for continual instruction tuning in LVLMs. To circumvent the additional overhead associated with experience replay, we freeze LVLMs and construct the dual increment embeddings for each input instruction to facilitate parameter-efficient tuning. Specifically, the increment embeddings can be decomposed into two principal components: 1) intrinsic increment embeddings to encode task-specific characteristics. To achieve this, we set up a low-rank pool containing candidate embeddings, from which we select the relevant ones based on their similarity with the user instructions; 2) contextual increment embeddings to investigate the inter-dependencies across tasks. In this regard, the low-rank embeddings chosen in the previous tasks are aggregated via learnable weighted sum to provide complementary hints. Extensive experiments indicate that the proposed Continual LLaVA outperforms previous methods by significantly reducing the forgetting during the continual instruction tuning process.
Authors: Cangxiong Chen, Vinay P. Namboodiri, Julia E. Sero
Abstract: The spatio-temporal nature of live-cell microscopy data poses challenges in the analysis of cell states which is fundamental in bioimaging. Deep-learning based segmentation or tracking methods rely on large amount of high quality annotations to work effectively. In this work, we explore an alternative solution: using feature maps obtained from self-supervised representation learning (SSRL) on time arrow prediction (TAP) for the downstream supervised task of cell event recognition. We demonstrate through extensive experiments and analysis that this approach can achieve better performance with limited annotation compared to models trained from end to end using fully supervised approach. Our analysis also provides insight into applications of the SSRL using TAP in live-cell microscopy.
Authors: Siddharth Seth, Rishabh Dabral, Diogo Luvizon, Marc Habermann, Ming-Hsuan Yang, Christian Theobalt, Adam Kortylewski
Abstract: Modeling a human avatar that can plausibly deform to articulations is an active area of research. We present PocoLoco -- the first template-free, point-based, pose-conditioned generative model for 3D humans in loose clothing. We motivate our work by noting that most methods require a parametric model of the human body to ground pose-dependent deformations. Consequently, they are restricted to modeling clothing that is topologically similar to the naked body and do not extend well to loose clothing. The few methods that attempt to model loose clothing typically require either canonicalization or a UV-parameterization and need to address the challenging problem of explicitly estimating correspondences for the deforming clothes. In this work, we formulate avatar clothing deformation as a conditional point-cloud generation task within the denoising diffusion framework. Crucially, our framework operates directly on unordered point clouds, eliminating the need for a parametric model or a clothing template. This also enables a variety of practical applications, such as point-cloud completion and pose-based editing -- important features for virtual human animation. As current datasets for human avatars in loose clothing are far too small for training diffusion models, we release a dataset of two subjects performing various poses in loose clothing with a total of 75K point clouds. By contributing towards tackling the challenging task of effectively modeling loose clothing and expanding the available data for training these models, we aim to set the stage for further innovation in digital humans. The source code is available at https://github.com/sidsunny/pocoloco .
Authors: Zhenyue Qin, Yiqun Zhang, Yang Liu, Dylan Campbell
Abstract: Generative text-to-image models, such as Stable Diffusion, have demonstrated a remarkable ability to generate diverse, high-quality images. However, they are surprisingly inept when it comes to rendering human hands, which are often anatomically incorrect or reside in the "uncanny valley". In this paper, we propose a method HandCraft for restoring such malformed hands. This is achieved by automatically constructing masks and depth images for hands as conditioning signals using a parametric model, allowing a diffusion-based image editor to fix the hand's anatomy and adjust its pose while seamlessly integrating the changes into the original image, preserving pose, color, and style. Our plug-and-play hand restoration solution is compatible with existing pretrained diffusion models, and the restoration process facilitates adoption by eschewing any fine-tuning or training requirements for the diffusion models. We also contribute MalHand datasets that contain generated images with a wide variety of malformed hands in several styles for hand detector training and hand restoration benchmarking, and demonstrate through qualitative and quantitative evaluation that HandCraft not only restores anatomical correctness but also maintains the integrity of the overall image.
Authors: Olaf Wysocki, Yue Tan, Thomas Froech, Yan Xia, Magdalena Wysocki, Ludwig Hoegner, Daniel Cremers, Christoph Holst
Abstract: Facade semantic segmentation is a long-standing challenge in photogrammetry and computer vision. Although the last decades have witnessed the influx of facade segmentation methods, there is a lack of comprehensive facade classes and data covering the architectural variability. In ZAHA, we introduce Level of Facade Generalization (LoFG), novel hierarchical facade classes designed based on international urban modeling standards, ensuring compatibility with real-world challenging classes and uniform methods' comparison. Realizing the LoFG, we present to date the largest semantic 3D facade segmentation dataset, providing 601 million annotated points at five and 15 classes of LoFG2 and LoFG3, respectively. Moreover, we analyze the performance of baseline semantic segmentation methods on our introduced LoFG classes and data, complementing it with a discussion on the unresolved challenges for facade segmentation. We firmly believe that ZAHA shall facilitate further development of 3D facade semantic segmentation methods, enabling robust segmentation indispensable in creating urban digital twins.
Authors: Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, Xiaodan Liang
Abstract: The advent of AI-Generated Content (AIGC) has spurred research into automated video generation to streamline conventional processes. However, automating storytelling video production, particularly for customized narratives, remains challenging due to the complexity of maintaining subject consistency across shots. While existing approaches like Mora and AesopAgent integrate multiple agents for Story-to-Video (S2V) generation, they fall short in preserving protagonist consistency and supporting Customized Storytelling Video Generation (CSVG). To address these limitations, we propose StoryAgent, a multi-agent framework designed for CSVG. StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process. Notably, our framework includes agents for story design, storyboard generation, video creation, agent coordination, and result evaluation. Leveraging the strengths of different models, StoryAgent enhances control over the generation process, significantly improving character consistency. Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency, while a novel storyboard generation pipeline is proposed to maintain subject consistency across shots. Extensive experiments demonstrate the effectiveness of our approach in synthesizing highly consistent storytelling videos, outperforming state-of-the-art methods. Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
Authors: Tianyu Yang, Yiyang Nan, Lisen Dai, Zhenwen Liang, Yapeng Tian, Xiangliang Zhang
Abstract: Audio-Visual Question Answering (AVQA) is a challenging task that involves answering questions based on both auditory and visual information in videos. A significant challenge is interpreting complex multi-modal scenes, which include both visual objects and sound sources, and connecting them to the given question. In this paper, we introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for AVQA. SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question. It streamlines the fusion of audio and visual information using spatial and temporal attention mechanisms to identify answers in multi-modal scenes. Extensive experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.
Authors: David Willmes, Nick Krall, James Tanis, Zachary Terner, Fernando Tavares, Chris Miller, Joe Haberlin III, Matt Crichton, Alexander Schlichting
Abstract: With billions of people facing moderate or severe food insecurity, the resilience of the global food supply will be of increasing concern due to the effects of climate change and geopolitical events. In this paper we describe a framework to better identify food security hotspots using a combination of remote sensing, deep learning, crop yield modeling, and causal modeling of the food distribution system. While we feel that the methods are adaptable to other regions of the world, we focus our analysis on the wheat breadbasket of northern India, which supplies a large percentage of the world's population. We present a quantitative analysis of deep learning domain adaptation methods for wheat farm identification based on curated remote sensing data from France. We model climate change impacts on crop yields using the existing crop yield modeling tool WOFOST and we identify key drivers of crop simulation error using a longitudinal penalized functional regression. A description of a system dynamics model of the food distribution system in India is also presented, along with results of food insecurity identification based on seeding this model with the predicted crop yields.
Authors: Corwin Grant Jeon MacMillan, K. Andrea Scott, Zhao Pan
Abstract: Rapid ice recession in the Arctic Ocean, with predictions of ice-free summers by 2060, opens new maritime routes but requires reliable navigation solutions. Current approaches rely heavily on subjective expert judgment, underscoring the need for automated, data-driven solutions. This study leverages machine learning to assess ice conditions using ship-borne optical data, introducing a finely annotated dataset of 946 images, and a semi-manual, region-based annotation technique. The proposed video segmentation model, UPerFlow, advances the SegFlow architecture by incorporating a six-channel ResNet encoder, two UPerNet-based segmentation decoders for each image, PWCNet as the optical flow encoder, and cross-connections that integrate bi-directional flow features without loss of latent information. The proposed architecture outperforms baseline image segmentation networks by an average 38\% in occluded regions, demonstrating the robustness of video segmentation in addressing challenging Arctic conditions.
Authors: Hankyeol Lee, Gawon Seo, Wonseok Choi, Geunyoung Jung, Kyungwoo Song, Jiyoung Jung
Abstract: The performance of vision-language models (VLMs), such as CLIP, in visual classification tasks, has been enhanced by leveraging semantic knowledge from large language models (LLMs), including GPT. Recent studies have shown that in zero-shot classification tasks, descriptors incorporating additional cues, high-level concepts, or even random characters often outperform those using only the category name. In many classification tasks, while the top-1 accuracy may be relatively low, the top-5 accuracy is often significantly higher. This gap implies that most misclassifications occur among a few similar classes, highlighting the model's difficulty in distinguishing between classes with subtle differences. To address this challenge, we introduce a novel concept of comparative descriptors. These descriptors emphasize the unique features of a target class against its most similar classes, enhancing differentiation. By generating and integrating these comparative descriptors into the classification framework, we refine the semantic focus and improve classification accuracy. An additional filtering process ensures that these descriptors are closer to the image embeddings in the CLIP space, further enhancing performance. Our approach demonstrates improved accuracy and robustness in visual classification tasks by addressing the specific challenge of subtle inter-class differences.
Authors: Shreyans Jain, Aadya Arora, Viraj Vekaria, Karan Gandhi
Abstract: Shadow removal and segmentation remain challenging tasks in computer vision, particularly in complex real world scenarios. This study presents a novel approach that enhances the ShadowFormer model by incorporating Masked Autoencoder (MAE) priors and Fast Fourier Convolution (FFC) blocks, leading to significantly faster convergence and improved performance. We introduce key innovations: (1) integration of MAE priors trained on Places2 dataset for better context understanding, (2) adoption of Haar wavelet features for enhanced edge detection and multiscale analysis, and (3) implementation of a modified SAM Adapter for robust shadow segmentation. Extensive experiments on the challenging DESOBA dataset demonstrate that our approach achieves state of the art results, with notable improvements in both convergence speed and shadow removal quality.
Authors: Rickard Br\"uel-Gabrielsson, Tongzhou Wang, Manel Baradad, Justin Solomon
Abstract: We introduce Deep Augmentation, an approach to implicit data augmentation using dropout or PCA to transform a targeted layer within a neural network to improve performance and generalization. We demonstrate Deep Augmentation through extensive experiments on contrastive learning tasks in NLP, computer vision, and graph learning. We observe substantial performance gains with Transformers, ResNets, and Graph Neural Networks as the underlying models in contrastive learning, but observe inverse effects on the corresponding supervised problems. Our analysis suggests that Deep Augmentation alleviates co-adaptation between layers, a problem exhibited by self-supervised learning where ground truth labels are not available. We use this observation to formulate a method for selecting which layer to target; in particular, our experimentation reveals that targeting deeper layers with Deep Augmentation outperforms augmenting the input data. The simple network- and modality-agnostic nature of this approach enables its integration into various machine learning pipelines.
Authors: Yongquan Yang, Fengling Li, Yani Wei, Yuanyuan Zhao, Jing Fu, Xiuli Xiao, Hong Bu
Abstract: In breast cancer, neoadjuvant chemotherapy (NAC) provides a standard treatment option for patients who have locally advanced cancer and some large operable tumors. A patient will have better prognosis when he has achieved a pathological complete response (pCR) with the treatment of NAC. There has been a trend to directly predict pCR to NAC from histological images based on deep learning (DL). However, the DL-based predictive models numerically have better performances in internal validation than in external validation. In this paper, we aim to alleviate this situation with an intrinsic approach. We propose an experts' cognition-driven ensemble deep learning (ECDEDL) approach. Taking the cognition of both pathology and artificial intelligence experts into consideration to improve the generalization of the predictive model to the external validation, ECDEDL can intrinsically approximate the working paradigm of a human being which will refer to his various working experiences to make decisions. ECDEDL was validated with 695 WSIs collected from the same center as the primary dataset to develop the predictive model and perform the internal validation, and was also validated with 340 WSIs collected from other three centers as the external dataset to perform the external validation. In external validation, ECDEDL improves the AUCs of pCR prediction from 61.52(59.80-63.26) to 67.75(66.74-68.80) and the Accuracies of pCR prediction from 56.09(49.39-62.79) to 71.01(69.44-72.58). ECDEDL was quite effective for external validation of predicting pCR to NAC from histological images in breast cancer, numerically approximating the internal validation.
Authors: Yan Scholten, Jan Schuchardt, Aleksandar Bojchevski, Stephan G\"unnemann
Abstract: Real-world data is complex and often consists of objects that can be decomposed into multiple entities (e.g. images into pixels, graphs into interconnected nodes). Randomized smoothing is a powerful framework for making models provably robust against small changes to their inputs - by guaranteeing robustness of the majority vote when randomly adding noise before classification. Yet, certifying robustness on such complex data via randomized smoothing is challenging when adversaries do not arbitrarily perturb entire objects (e.g. images) but only a subset of their entities (e.g. pixels). As a solution, we introduce hierarchical randomized smoothing: We partially smooth objects by adding random noise only on a randomly selected subset of their entities. By adding noise in a more targeted manner than existing methods we obtain stronger robustness guarantees while maintaining high accuracy. We initialize hierarchical smoothing using different noising distributions, yielding novel robustness certificates for discrete and continuous domains. We experimentally demonstrate the importance of hierarchical smoothing in image and node classification, where it yields superior robustness-accuracy trade-offs. Overall, hierarchical smoothing is an important contribution towards models that are both - certifiably robust to perturbations and accurate.
Authors: Subham Sekhar Sahoo, Aaron Gokaslan, Chris De Sa, Volodymyr Kuleshov
Abstract: Diffusion models have gained traction as powerful algorithms for synthesizing high-quality images. Central to these algorithms is the diffusion process, a set of equations which maps data to noise in a way that can significantly affect performance. In this paper, we explore whether the diffusion process can be learned from data. Our work is grounded in Bayesian inference and seeks to improve log-likelihood estimation by casting the learned diffusion process as an approximate variational posterior that yields a tighter lower bound (ELBO) on the likelihood. A widely held assumption is that the ELBO is invariant to the noise process: our work dispels this assumption and proposes multivariate learned adaptive noise (MULAN), a learned diffusion process that applies noise at different rates across an image. Specifically, our method relies on a multivariate noise schedule that is a function of the data to ensure that the ELBO is no longer invariant to the choice of the noise schedule as in previous works. Empirically, MULAN sets a new state-of-the-art in density estimation on CIFAR-10 and ImageNet and reduces the number of training steps by 50%. We provide the code, along with a blog post and video tutorial on the project page: https://s-sahoo.com/MuLAN
Authors: Liam Chalcroft, Ioannis Pappas, Cathy J. Price, John Ashburner
Abstract: Current deep learning-based approaches to lesion segmentation in neuroimaging often depend on high-resolution images and extensive annotated data, limiting clinical applicability. This paper introduces a novel synthetic data framework tailored for stroke lesion segmentation, expanding the SynthSeg methodology to incorporate lesion-specific augmentations that simulate diverse pathological features. Using a modified nnUNet architecture, our approach trains models with label maps from healthy and stroke datasets, facilitating segmentation across both normal and pathological tissue without reliance on specific sequence-based training. Evaluation across in-domain and out-of-domain (OOD) datasets reveals that our method matches state-of-the-art performance within the training domain and significantly outperforms existing methods on OOD data. By minimizing dependence on large annotated datasets and allowing for cross-sequence applicability, our framework holds potential to improve clinical neuroimaging workflows, particularly in stroke pathology. PyTorch training code and weights are publicly available at https://github.com/liamchalcroft/SynthStroke, along with an SPM toolbox featuring a plug-and-play model at https://github.com/liamchalcroft/SynthStrokeSPM.
URLs: https://github.com/liamchalcroft/SynthStroke,, https://github.com/liamchalcroft/SynthStrokeSPM.
Authors: Manas Gogoi, Sambhavi Tiwari, Shekhar Verma
Abstract: The reason for Meta Overfitting can be attributed to two factors: Mutual Non-exclusivity and the Lack of diversity, consequent to which a single global function can fit the support set data of all the meta-training tasks and fail to generalize to new unseen tasks. This issue is evidenced by low error rates on the meta-training tasks, but high error rates on new tasks. However, there can be a number of novel solutions to this problem keeping in mind any of the two objectives to be attained, i.e. to increase diversity in the tasks and to reduce the confidence of the model for some of the tasks. In light of the above, this paper proposes a number of solutions to tackle meta-overfitting on few-shot learning settings, such as few-shot sinusoid regression and few shot classification. Our proposed approaches demonstrate improved generalization performance compared to state-of-the-art baselines for learning in a non-mutually exclusive task setting. Overall, this paper aims to provide insights into tackling overfitting in meta-learning and to advance the field towards more robust and generalizable models.
Authors: Kleanthis Marios Papadopoulos
Abstract: The abundance of information present in Whole Slide Images (WSIs) renders them an essential tool for survival analysis. Several Multiple Instance Learning frameworks proposed for this task utilize a ResNet50 backbone pre-trained on natural images. By leveraging recenetly released histopathological foundation models such as UNI and Hibou, the predictive prowess of existing MIL networks can be enhanced. Furthermore, deploying an ensemble of digital pathology foundation models yields higher baseline accuracy, although the benefits appear to diminish with more complex MIL architectures. Our code will be made publicly available upon acceptance.
Authors: Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, Wenhu Chen
Abstract: Generative AI has made remarkable strides to revolutionize fields such as image and video generation. These advancements are driven by innovative algorithms, architecture, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the absence of trustworthy evaluation metrics. Current automatic assessments such as FID, CLIP, FVD, etc often fail to capture the nuanced quality and user satisfaction associated with generative outputs. This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three tasks of text-to-image generation, text-to-video generation, and image editing respectively. Currently, we cover a total of 35 open-source generative models. GenAI-Arena has been operating for seven months, amassing over 9000 votes from the community. We describe our platform, analyze the data, and explain the statistical methods for ranking the models. To further promote the research in building model-based evaluation metrics, we release a cleaned version of our preference data for the three tasks, namely GenAI-Bench. We prompt the existing multi-modal models like Gemini, and GPT-4o to mimic human voting. We compute the accuracy by comparing the model voting with the human voting to understand their judging abilities. Our results show existing multimodal models are still lagging in assessing the generated visual content, even the best model GPT-4o only achieves an average accuracy of 49.19 across the three generative tasks. Open-source MLLMs perform even worse due to the lack of instruction-following and reasoning ability in complex vision scenarios.
Authors: Pandiyaraju V, Shravan Venkatraman, Pavan Kumar S, Santhosh Malarvannan, Kannan A
Abstract: Breast cancer is a leading cause of mortality worldwide, and demands the critical need for early and accurate diagnostic tools. Ultrasound imaging is a widely used modality for breast cancer screening, yet the precise segmentation and classification of tumors in these images are challenging due to variations in tumor morphology and image quality. To address these challenges, we propose novel deep learning (DL) frameworks leveraging a precision mapping mechanism (PMM) along with a component-specific feature enhancement module (CSFEM) to improve breast cancer lesion segmentation and identification. Our PPM ensures that the segmentation accurately reflects the true shape and extent of the tumor by meticulously delineating their boundaries. The CSFEM focuses on extracting and amplifying features unique to different tumor types, enabling the model to effectively distinguish between benign, malignant, and normal tissues. Integrating PMM and CSFEM into our segmentation model yielded an accuracy of 98.1%, an IoU of 96.9%, and a Dice Coefficient of 97.2%. Similarly, our classification model achieved an accuracy of 99.2%, with F1-score, precision, and recall values of 99.1%, 99.3%, and 99.1%, respectively. Our results indicate significant improvement in evaluation metrics in comparison to state-of-the-art (SOTA) models, demonstrating the effectiveness of precision mapping and component-specific feature enhancement in advancing breast cancer lesion analysis.
Authors: Nick John Eliopoulos, Purvish Jajal, James C. Davis, Gaowen Liu, George K. Thiravathukal, Yung-Hsiang Lu
Abstract: This paper investigates how to efficiently deploy vision transformers on edge devices for small workloads. Recent methods reduce the latency of transformer neural networks by removing or merging tokens, with small accuracy degradation. However, these methods are not designed with edge device deployment in mind: they do not leverage information about the latency-workload trends to improve efficiency. We address this shortcoming in our work. First, we identify factors that affect ViT latency-workload relationships. Second, we determine token pruning schedule by leveraging non-linear latency-workload relationships. Third, we demonstrate a training-free, token pruning method utilizing this schedule. We show other methods may increase latency by 2-30%, while we reduce latency by 9-26%. For similar latency (within 5.2% or 7ms) across devices we achieve 78.6%-84.5% ImageNet1K accuracy, while the state-of-the-art, Token Merging, achieves 45.8%-85.4%.
Authors: Sina Ghorbani Kolahi, Seyed Kamal Chaharsooghi, Toktam Khatibi, Afshin Bozorgpour, Reza Azad, Moein Heidari, Ilker Hacihaliloglu, Dorit Merhof
Abstract: Medical image segmentation involves identifying and separating object instances in a medical image to delineate various tissues and structures, a task complicated by the significant variations in size, shape, and density of these features. Convolutional neural networks (CNNs) have traditionally been used for this task but have limitations in capturing long-range dependencies. Transformers, equipped with self-attention mechanisms, aim to address this problem. However, in medical image segmentation it is beneficial to merge both local and global features to effectively integrate feature maps across various scales, capturing both detailed features and broader semantic elements for dealing with variations in structures. In this paper, we introduce MSA$^2$Net, a new deep segmentation framework featuring an expedient design of skip-connections. These connections facilitate feature fusion by dynamically weighting and combining coarse-grained encoder features with fine-grained decoder feature maps. Specifically, we propose a Multi-Scale Adaptive Spatial Attention Gate (MASAG), which dynamically adjusts the receptive field (Local and Global contextual information) to ensure that spatially relevant features are selectively highlighted while minimizing background distractions. Extensive evaluations involving dermatology, and radiological datasets demonstrate that our MSA$^2$Net outperforms state-of-the-art (SOTA) works or matches their performance. The source code is publicly available at https://github.com/xmindflow/MSA-2Net.
Authors: Ziyuan Zhang, Han Qiu, Maosen Zhang, Jun Liu, Bin Chen, Tianwei Zhang, Hewu Li
Abstract: With the rapidly increasing number of satellites in space and their enhanced capabilities, the amount of earth observation images collected by satellites is exceeding the transmission limits of satellite-to-ground links. Although existing learned image compression solutions achieve remarkable performance by using a sophisticated encoder to extract fruitful features as compression and using a decoder to reconstruct, it is still hard to directly deploy those complex encoders on current satellites' embedded GPUs with limited computing capability and power supply to compress images in orbit. In this paper, we propose COSMIC, a simple yet effective learned compression solution to transmit satellite images. We first design a lightweight encoder (i.e. reducing FLOPs by 2.6~5x) on satellite to achieve a high image compression ratio to save satellite-to-ground links. Then, for reconstructions on the ground, to deal with the feature extraction ability degradation due to simplifying encoders, we propose a diffusion-based model to compensate image details when decoding. Our insight is that satellite's earth observation photos are not just images but indeed multi-modal data with a nature of Text-to-Image pairing since they are collected with rich sensor data (e.g. coordinates, timestamp, etc.) that can be used as the condition for diffusion generation. Extensive experiments show that COSMIC outperforms state-of-the-art baselines on both perceptual and distortion metrics.
Authors: Vansh Agrawal, Pratham Singla, Amitoj Singh Miglani, Shivank Garg, Ayush Mangal
Abstract: While state-of-the-art LLMs have shown poor logical and basic mathematical reasoning, recent works try to improve their problem-solving abilities using prompting techniques. We propose giving "hints" to improve the language model's performance on advanced mathematical problems, taking inspiration from how humans approach math pedagogically. We also test robustness to adversarial hints and demonstrate their sensitivity to them. We demonstrate the effectiveness of our approach by evaluating various diverse LLMs, presenting them with a broad set of problems of different difficulties and topics from the MATH dataset and comparing against techniques such as one-shot, few-shot, and chain of thought prompting.
Authors: Florian Hahlbohm, Fabian Friederichs, Tim Weyrich, Linus Franke, Moritz Kappel, Susana Castillo, Marc Stamminger, Martin Eisemann, Marcus Magnor
Abstract: 3D Gaussian Splats (3DGS) have proven a versatile rendering primitive, both for inverse rendering as well as real-time exploration of scenes. In these applications, coherence across camera frames and multiple views is crucial, be it for robust convergence of a scene reconstruction or for artifact-free fly-throughs. Recent work started mitigating artifacts that break multi-view coherence, including popping artifacts due to inconsistent transparency sorting and perspective-correct outlines of (2D) splats. At the same time, real-time requirements forced such implementations to accept compromises in how transparency of large assemblies of 3D Gaussians is resolved, in turn breaking coherence in other ways. In our work, we aim at achieving maximum coherence, by rendering fully perspective-correct 3D Gaussians while using a high-quality approximation of accurate blending, hybrid transparency, on a per-pixel level, in order to retain real-time frame rates. Our fast and perspectively accurate approach for evaluation of 3D Gaussians does not require matrix inversions, thereby ensuring numerical stability and eliminating the need for special handling of degenerate splats, and the hybrid transparency formulation for blending maintains similar quality as fully resolved per-pixel transparencies at a fraction of the rendering costs. We further show that each of these two components can be independently integrated into Gaussian splatting systems. In combination, they achieve up to 2$\times$ higher frame rates, 2$\times$ faster optimization, and equal or better image quality with fewer rendering artifacts compared to traditional 3DGS on common benchmarks.
Authors: Bingyu Yang, Huai Liao, Xinyan Huang, Qingyao Tian, Jinlin Wu, Jingdi Hu, Hongbin Liu
Abstract: Accurate and complete segmentation of airways in chest CT images is essential for the quantitative assessment of lung diseases and the facilitation of pulmonary interventional procedures. Although deep learning has led to significant advancements in medical image segmentation, maintaining airway continuity remains particularly challenging. This difficulty arises primarily from the small and dispersed nature of airway structures, as well as class imbalance in CT scans. To address these challenges, we designed a Multi-scale Nested Residual U-Net (MNR-UNet), incorporating multi-scale inputs and Residual Multi-scale Modules (RMM) into a nested residual framework to enhance information flow, effectively capturing the intricate details of small airways and mitigating gradient vanishing. Building on this, we developed a three-stage segmentation pipeline to optimize the training of the MNR-UNet. The first two stages prioritize high accuracy and sensitivity, while the third stage focuses on repairing airway breakages to balance topological completeness and correctness. To further address class imbalance, we introduced a weighted Breakage-Aware Loss (wBAL) to heighten focus on challenging samples, penalizing breakages and thereby extending the length of the airway tree. Additionally, we proposed a hierarchical evaluation framework to offer more clinically meaningful analysis. Validation on both in-house and public datasets demonstrates that our approach achieves superior performance in detecting more accurate airway voxels and identifying additional branches, significantly improving airway topological completeness. The code will be released publicly following the publication of the paper.