Authors: Jiahao Qin, Feng Liu
Abstract: Electroencephalogram (EEG) signals play a pivotal role in biomedical research and clinical applications, including epilepsy diagnosis, sleep disorder analysis, and brain-computer interfaces. However, the effective analysis and interpretation of these complex signals often present significant challenges. This paper presents a novel approach that integrates computer graphics techniques with biological signal pattern recognition, specifically using Markov Transfer Fields (MTFs) for EEG time series imaging. The proposed framework (STEAM-EEG) employs the capabilities of MTFs to capture the spatiotemporal dynamics of EEG signals, transforming them into visually informative images. These images are then rendered, visualised, and modelled using state-of-the-art computer graphics techniques, thereby facilitating enhanced data exploration, pattern recognition, and decision-making. The code could be accessed from GitHub.
Authors: Jiahao Qin, Feng Liu
Abstract: Electrocardiogram (ECG) analysis plays a crucial role in diagnosing cardiovascular diseases, but accurate interpretation of these complex signals remains challenging. This paper introduces a novel multimodal framework(GAF-FusionNet) for ECG classification that integrates time-series analysis with image-based representation using Gramian Angular Fields (GAF). Our approach employs a dual-layer cross-channel split attention module to adaptively fuse temporal and spatial features, enabling nuanced integration of complementary information. We evaluate GAF-FusionNet on three diverse ECG datasets: ECG200, ECG5000, and the MIT-BIH Arrhythmia Database. Results demonstrate significant improvements over state-of-the-art methods, with our model achieving 94.5\%, 96.9\%, and 99.6\% accuracy on the respective datasets. Our code will soon be available at https://github.com/Cross-Innovation-Lab/GAF-FusionNet.git.
URLs: https://github.com/Cross-Innovation-Lab/GAF-FusionNet.git.
Authors: Di Jin, Xing Liu, Yu Liu, Jia Qing Yap, Andrea Wong, Adriana Crespo, Qi Lin, Zhiyuan Yin, Qiang Yan, Ryan Ye
Abstract: The rapid development of large language models (LLMs) and large vision models (LVMs) have propelled the evolution of multi-modal AI systems, which have demonstrated the remarkable potential for industrial applications by emulating human-like cognition. However, they also pose significant ethical challenges, including amplifying harmful content and reinforcing societal biases. For instance, biases in some industrial image generation models highlighted the urgent need for robust fairness assessments. Most existing evaluation frameworks focus on the comprehensiveness of various aspects of the models, but they exhibit critical limitations, including insufficient attention to content generation alignment and social bias-sensitive domains. More importantly, their reliance on pixel-detection techniques is prone to inaccuracies. To address these issues, this paper presents INFELM, an in-depth fairness evaluation on widely-used text-to-image models. Our key contributions are: (1) an advanced skintone classifier incorporating facial topology and refined skin pixel representation to enhance classification precision by at least 16.04%, (2) a bias-sensitive content alignment measurement for understanding societal impacts, (3) a generalizable representation bias evaluation for diverse demographic groups, and (4) extensive experiments analyzing large-scale text-to-image model outputs across six social-bias-sensitive domains. We find that existing models in the study generally do not meet the empirical fairness criteria, and representation bias is generally more pronounced than alignment errors. INFELM establishes a robust benchmark for fairness assessment, supporting the development of multi-modal AI systems that align with ethical and human-centric principles.
Authors: Hyunho Ha, Inseung Hwang, Nestor Monzon, Jaemin Cho, Donggun Kim, Seung-Hwan Baek, Adolfo Mu\~noz, Diego Gutierrez, Min H. Kim
Abstract: Acquisition and modeling of polarized light reflection and scattering help reveal the shape, structure, and physical characteristics of an object, which is increasingly important in computer graphics. However, current polarimetric acquisition systems are limited to static and opaque objects. Human faces, on the other hand, present a particularly difficult challenge, given their complex structure and reflectance properties, the strong presence of spatially-varying subsurface scattering, and their dynamic nature. We present a new polarimetric acquisition method for dynamic human faces, which focuses on capturing spatially varying appearance and precise geometry, across a wide spectrum of skin tones and facial expressions. It includes both single and heterogeneous subsurface scattering, index of refraction, and specular roughness and intensity, among other parameters, while revealing biophysically-based components such as inner- and outer-layer hemoglobin, eumelanin and pheomelanin. Our method leverages such components' unique multispectral absorption profiles to quantify their concentrations, which in turn inform our model about the complex interactions occurring within the skin layers. To our knowledge, our work is the first to simultaneously acquire polarimetric and spectral reflectance information alongside biophysically-based skin parameters and geometry of dynamic human faces. Moreover, our polarimetric skin model integrates seamlessly into various rendering pipelines.
Authors: Yash Agrawal, Srinidhi Balasubramanian, Rahul Meena, Rohail Alam, Himanshu Malviya, Rohini P
Abstract: This research paper delves into the development of an Optical Character Recognition (OCR) system for the recognition of Ashokan Brahmi characters using Convolutional Neural Networks. It utilizes a comprehensive dataset of character images to train the models, along with data augmentation techniques to optimize the training process. Furthermore, the paper incorporates image preprocessing to remove noise, as well as image segmentation to facilitate line and character segmentation. The study mainly focuses on three pre-trained CNNs, namely LeNet, VGG-16, and MobileNet and compares their accuracy. Transfer learning was employed to adapt the pre-trained models to the Ashokan Brahmi character dataset. The findings reveal that MobileNet outperforms the other two models in terms of accuracy, achieving a validation accuracy of 95.94% and validation loss of 0.129. The paper provides an in-depth analysis of the implementation process using MobileNet and discusses the implications of the findings. The use of OCR for character recognition is of significant importance in the field of epigraphy, specifically for the preservation and digitization of ancient scripts. The results of this research paper demonstrate the effectiveness of using pre-trained CNNs for the recognition of Ashokan Brahmi characters.
Authors: Xiujie Song, Xiaoyi Pang, Haifeng Tang, Mengyue Wu, Kenny Q. Zhu
Abstract: Quantifying image complexity at the entity level is straightforward, but the assessment of semantic complexity has been largely overlooked. In fact, there are differences in semantic complexity across images. Images with richer semantics can tell vivid and engaging stories and offer a wide range of application scenarios. For example, the Cookie Theft picture is such a kind of image and is widely used to assess human language and cognitive abilities due to its higher semantic complexity. Additionally, semantically rich images can benefit the development of vision models, as images with limited semantics are becoming less challenging for them. However, such images are scarce, highlighting the need for a greater number of them. For instance, there is a need for more images like Cookie Theft to cater to people from different cultural backgrounds and eras. Assessing semantic complexity requires human experts and empirical evidence. Automatic evaluation of how semantically rich an image will be the first step of mining or generating more images with rich semantics, and benefit human cognitive assessment, Artificial Intelligence, and various other applications. In response, we propose the Image Semantic Assessment (ISA) task to address this problem. We introduce the first ISA dataset and a novel method that leverages language to solve this vision problem. Experiments on our dataset demonstrate the effectiveness of our approach. Our data and code are available at: https://github.com/xiujiesong/ISA.
Authors: Riling Wei, Hanjie Chen, Kelu Yao, Chuanguang Yang, Jun Wang, Chao Li
Abstract: Photoplethsmography (PPG)-based individual identification aiming at recognizing humans via intrinsic cardiovascular activities has raised extensive attention due to its high security and resistance to mimicry. However, this kind of technology witnesses unpromising results due to the limitation of low information density. To this end, electrocardiogram (ECG) signals have been introduced as a novel modality to enhance the density of input information. Specifically, a novel cross-modal knowledge distillation framework is implemented to propagate discriminate knowledge from ECG modality to PPG modality without incurring additional computational demands at the inference phase. Furthermore, to ensure efficient knowledge propagation, Contrastive Language-Image Pre-training (CLIP)-based knowledge alignment and cross-knowledge assessment modules are proposed respectively. Comprehensive experiments are conducted and results show our framework outperforms the baseline model with the improvement of 2.8% and 3.0% in terms of overall accuracy on seen- and unseen individual recognitions.
Authors: Pinar Yozgatli, Yavuz Acar, Mehmet Tulumen, Selman Minga, Salih Selamet, Beytullah Nalbant, Mustafa Talha Toru, Berna Koca, Tevfik Keles, Mehmet Selcok
Abstract: Computer vision technology, which involves analyzing images and videos captured by cameras through deep learning algorithms, has significantly advanced the field of human fall detection. This study focuses on the application of the YoloV8 Nano model in identifying fall incidents within passenger elevators, a context that presents unique challenges due to the enclosed environment and varying lighting conditions. By training the model on a robust dataset comprising over 10,000 images across diverse elevator types, we aim to enhance the detection precision and recall rates. The model's performance, with an 85% precision and 82% recall in fall detection, underscores its potential for integration into existing elevator safety systems to enable rapid intervention.
Authors: Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang
Abstract: The increasing demand to process long and high-resolution videos significantly burdens Large Vision-Language Models (LVLMs) due to the enormous number of visual tokens. Existing token reduction methods primarily focus on importance-based token pruning, which overlooks the redundancy caused by frame resemblance and repetitive visual elements. In this paper, we analyze the high vision token similarities in LVLMs. We reveal that token similarity distribution condenses as layers deepen while maintaining ranking consistency. Leveraging the unique properties of similarity over importance, we introduce FrameFusion, a novel approach that combines similarity-based merging with importance-based pruning for better token reduction in LVLMs. FrameFusion identifies and merges similar tokens before pruning, opening up a new perspective for token reduction. We evaluate FrameFusion on diverse LVLMs, including Llava-Video-{7B,32B,72B}, and MiniCPM-V-8B, on video understanding, question-answering, and retrieval benchmarks. Experiments show that FrameFusion reduces vision tokens by 70$\%$, achieving 3.4-4.4x LLM speedups and 1.6-1.9x end-to-end speedups, with an average performance impact of less than 3$\%$. Our code is available at https://github.com/thu-nics/FrameFusion.
Authors: Mohammad Nadeem, Shahab Saquib Sohail, Erik Cambria, Bj\"orn W. Schuller, Amir Hussain
Abstract: The advent of text-to-video generation models has revolutionized content creation as it produces high-quality videos from textual prompts. However, concerns regarding inherent biases in such models have prompted scrutiny, particularly regarding gender representation. Our study investigates the presence of gender bias in OpenAI's Sora, a state-of-the-art text-to-video generation model. We uncover significant evidence of bias by analyzing the generated videos from a diverse set of gender-neutral and stereotypical prompts. The results indicate that Sora disproportionately associates specific genders with stereotypical behaviors and professions, which reflects societal prejudices embedded in its training data.
Authors: Jianfei Xu, Thanet Markchom, Huizhi Liang
Abstract: The complexity of stacked imaging and the massive number of radiographs make writing radiology reports complex and inefficient. Even highly experienced radiologists struggle to maintain accuracy and consistency in interpreting radiographs under prolonged high-intensity work. To address these issues, this work proposes the CRRG-CLIP Model (Chest Radiology Report Generation and Radiograph Classification Model), an end-to-end model for automated report generation and radiograph classification. The model consists of two modules: the radiology report generation module and the radiograph classification module. The generation module uses Faster R-CNN to identify anatomical regions in radiographs, a binary classifier to select key regions, and GPT-2 to generate semantically coherent reports. The classification module uses the unsupervised Contrastive Language Image Pretraining (CLIP) model, addressing the challenges of high-cost labelled datasets and insufficient features. The results show that the generation module performs comparably to high-performance baseline models on BLEU, METEOR, and ROUGE-L metrics, and outperformed the GPT-4o model on BLEU-2, BLEU-3, BLEU-4, and ROUGE-L metrics. The classification module significantly surpasses the state-of-the-art model in AUC and Accuracy. This demonstrates that the proposed model achieves high accuracy, readability, and fluency in report generation, while multimodal contrastive training with unlabelled radiograph-report pairs enhances classification performance.
Authors: Lahcen El Fatimi, Elhoucine Elfatimi, Hanifa Bouchaneb
Abstract: Model checking, a formal verification technique, ensures systems meet predefined requirements, playing a crucial role in minimizing errors and enhancing quality during development. This paper introduces a novel hybrid framework integrating model checking with deep learning for brain tumor detection and validation in medical imaging. By combining model-checking principles with CNN-based feature extraction and K-FCM clustering for segmentation, the proposed approach enhances the reliability of tumor detection and segmentation. Experimental results highlight the framework's effectiveness, achieving 98\% accuracy, 96.15\% precision, and 100\% recall, demonstrating its potential as a robust tool for advanced medical image analysis.
Authors: Alexander Du, Yingwu Zhu
Abstract: Estimating 6D object poses from RGB images is challenging because the lack of depth information requires inferring a three dimensional structure from 2D projections. Traditional methods often rely on deep learning with grid based data structures but struggle to capture complex dependencies among extracted features. To overcome this, we introduce a graph based representation derived directly from images, where spatial temporal features of each pixel serve as nodes, and relationships between them are defined through node connectivity and spatial interactions. We also employ feature selection mechanisms that use spatial attention and self attention distillation, along with a Legendre convolution layer leveraging the orthogonality of Legendre polynomials for numerical stability. Experiments on the LINEMOD, Occluded LINEMOD, and YCB Video datasets demonstrate that our method outperforms nine existing approaches and achieves state of the art benchmark in object pose estimation.
Authors: Mao Xun Huang, Hen-Hsen Huang
Abstract: Stable Diffusion models have made remarkable strides in generating photorealistic images from text prompts but often falter when tasked with accurately representing complex spatial arrangements, particularly involving intricate 3D relationships. To address this limitation, we introduce SmartSpatial, an innovative approach that enhances the spatial arrangement capabilities of Stable Diffusion models through 3D-aware conditioning and attention-guided mechanisms. SmartSpatial incorporates depth information and employs cross-attention control to ensure precise object placement, delivering notable improvements in spatial accuracy metrics. In conjunction with SmartSpatial, we present SmartSpatialEval, a comprehensive evaluation framework designed to assess spatial relationships. This framework utilizes vision-language models and graph-based dependency parsing for performance analysis. Experimental results on the COCO and SpatialPrompts datasets show that SmartSpatial significantly outperforms existing methods, setting new benchmarks for spatial arrangement accuracy in image generation.
Authors: Sharvaree Vadgama, Mohammad Mohaiminul Islam, Domas Buracus, Christian Shewmake, Erik Bekkers
Abstract: This paper explores the key factors that influence the performance of models working with point clouds, across different tasks of varying geometric complexity. In this work, we explore the trade-offs between flexibility and weight-sharing introduced by equivariant layers, assessing when equivariance boosts or detracts from performance. It is often argued that providing more information as input improves a model's performance. However, if this additional information breaks certain properties, such as $\SE(3)$ equivariance, does it remain beneficial? We identify the key aspects of equivariant and non-equivariant architectures that drive success in different tasks by benchmarking them on segmentation, regression, and generation tasks across multiple datasets with increasing complexity. We observe a positive impact of equivariance, which becomes more pronounced with increasing task complexity, even when strict equivariance is not required.
Authors: Axel Schaffland
Abstract: Since the beginning of rephotography in the middle of the 19th century, techniques in registration, conservation, presentation, and sharing of rephotographs have come a long way. Here, we will present existing digital approaches to rephotography and discuss future approaches and requirements for digital mass rephotography. We present re.photos, an existing web portal for rephotography, featuring methods for collaborative rephotography, interactive image registration, as well as retrieval, organization, and sharing of rephotographs. For mass rephotography additional requirements must be met. Batches of template images and rephotographs must be handled simultaneously, image registration must be automated, and intuitive smartphone apps for rephotography must be available. Long--term storage with persistent identifiers, automatic or mass georeferencing, as well as gamification and social media integration are further requirements we will discuss in this paper.
Authors: Elhoucine Elfatimi, Lahcen El fatimi
Abstract: Recent advancements in model checking have demonstrated significant potential across diverse applications, particularly in signal and image analysis. Medical imaging stands out as a critical domain where model checking can be effectively applied to design and evaluate robust frameworks. These frameworks facilitate automatic and semi-automatic delineation of regions of interest within images, aiding in accurate segmentation. This paper provides a comprehensive analysis of recent works leveraging spatial logic to develop operators and tools for identifying regions of interest, including tumorous and non-tumorous areas. Additionally, we examine the challenges inherent to spatial model-checking techniques, such as variability in ground truth data and the need for streamlined procedures suitable for routine clinical practice.
Authors: Stella Girtsou, Emiliano Diaz Salas-Porras, Lilli Freischem, Joppe Massant, Kyriaki-Margarita Bintsi, Guiseppe Castiglione, William Jones, Michael Eisinger, Emmanuel Johnson, Anna Jungbluth
Abstract: Clouds play a key role in Earth's radiation balance with complex effects that introduce large uncertainties into climate models. Real-time 3D cloud data is essential for improving climate predictions. This study leverages geostationary imagery from MSG/SEVIRI and radar reflectivity measurements of cloud profiles from CloudSat/CPR to reconstruct 3D cloud structures. We first apply self-supervised learning (SSL) methods-Masked Autoencoders (MAE) and geospatially-aware SatMAE on unlabelled MSG images, and then fine-tune our models on matched image-profile pairs. Our approach outperforms state-of-the-art methods like U-Nets, and our geospatial encoding further improves prediction results, demonstrating the potential of SSL for cloud reconstruction.
Authors: Juntao Zhang, Shaogeng Liu, Kun Bian, You Zhou, Pei Zhang, Jianning Liu, Jun Zhou, Bingyan Liu
Abstract: Mamba is an efficient State Space Model (SSM) with linear computational complexity. Although SSMs are not suitable for handling non-causal data, Vision Mamba (ViM) methods still demonstrate good performance in tasks such as image classification and object detection. Recent studies have shown that there is a rich theoretical connection between state space models and attention variants. We propose a novel separable self attention method, for the first time introducing some excellent design concepts of Mamba into separable self-attention. To ensure a fair comparison with ViMs, we introduce VMINet, a simple yet powerful prototype architecture, constructed solely by stacking our novel attention modules with the most basic down-sampling layers. Notably, VMINet differs significantly from the conventional Transformer architecture. Our experiments demonstrate that VMINet has achieved competitive results on image classification and high-resolution dense prediction tasks.Code is available at: \url{https://github.com/yws-wxs/VMINet}.
Authors: Songjie Han, Yinhua Liu, Yanzheng Li, Hua Chen, Dongmei Yang
Abstract: A high-fidelity digital simulation environment is crucial for accurately replicating physical operational processes. However, inconsistencies between simulation and physical environments result in low confidence in simulation outcomes, limiting their effectiveness in guiding real-world production. Unlike the traditional step-by-step point cloud "segmentation-registration" generation method, this paper introduces, for the first time, a novel Multi-Robot Manufacturing Digital Scene Generation (MRG) method that leverages multi-instance point cloud registration, specifically within manufacturing scenes. Tailored to the characteristics of industrial robots and manufacturing settings, an instance-focused transformer module is developed to delineate instance boundaries and capture correlations between local regions. Additionally, a hypothesis generation module is proposed to extract target instances while preserving key features. Finally, an efficient screening and optimization algorithm is designed to refine the final registration results. Experimental evaluations on the Scan2CAD and Welding-Station datasets demonstrate that: (1) the proposed method outperforms existing multi-instance point cloud registration techniques; (2) compared to state-of-the-art methods, the Scan2CAD dataset achieves improvements in MR and MP by 12.15% and 17.79%, respectively; and (3) on the Welding-Station dataset, MR and MP are enhanced by 16.95% and 24.15%, respectively. This work marks the first application of multi-instance point cloud registration in manufacturing scenes, significantly advancing the precision and reliability of digital simulation environments for industrial applications.
Authors: Yuanpeng Tu, Xi Chen, Ser-Nam Lim, Hengshuang Zhao
Abstract: Open-vocabulary panoptic segmentation has received significant attention due to its applicability in the real world. Despite claims of robust generalization, we find that the advancements of previous works are attributed mainly on trained categories, exposing a lack of generalization to novel classes. In this paper, we explore boosting existing models from a data-centric perspective. We propose DreamMask, which systematically explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data. For the first part, we propose an automatic data generation pipeline with off-the-shelf models. We propose crucial designs for vocabulary expansion, layout arrangement, data filtering, etc. Equipped with these techniques, our generated data could significantly outperform the manually collected web data. To train the model with generated data, a synthetic-real alignment loss is designed to bridge the representation gap, bringing noticeable improvements across multiple benchmarks. In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods. For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
Authors: Nisha Huang, Kaer Huang, Yifan Pu, Jiangshan Wang, Jie Guo, Yiqiang Yan, Xiu Li
Abstract: Recent years have witnessed significant advancements in text-guided style transfer, primarily attributed to innovations in diffusion models. These models excel in conditional guidance, utilizing text or images to direct the sampling process. However, despite their capabilities, direct conditional guidance approaches often face challenges in balancing the expressiveness of textual semantics with the diversity of output results while capturing stylistic features. To address these challenges, we introduce ArtCrafter, a novel framework for text-to-image style transfer. Specifically, we introduce an attention-based style extraction module, meticulously engineered to capture the subtle stylistic elements within an image. This module features a multi-layer architecture that leverages the capabilities of perceiver attention mechanisms to integrate fine-grained information. Additionally, we present a novel text-image aligning augmentation component that adeptly balances control over both modalities, enabling the model to efficiently map image and text embeddings into a shared feature space. We achieve this through attention operations that enable smooth information flow between modalities. Lastly, we incorporate an explicit modulation that seamlessly blends multimodal enhanced embeddings with original embeddings through an embedding reframing design, empowering the model to generate diverse outputs. Extensive experiments demonstrate that ArtCrafter yields impressive results in visual stylization, exhibiting exceptional levels of stylistic intensity, controllability, and diversity.
Authors: Vasileios Magoulianitis, Jiaxin Yang, Catherine A. Alexander, C. -C. Jay Kuo
Abstract: Clinically significant prostate cancer (csPCa) is a leading cause of cancer death in men, yet it has a high survival rate if diagnosed early. Bi-parametric MRI (bpMRI) reading has become a prominent screening test for csPCa. However, this process has a high false positive (FP) rate, incurring higher diagnostic costs and patient discomfort. This paper introduces RadHop-Net, a novel and lightweight CNN for FP reduction. The pipeline consists of two stages: Stage 1 employs data driven radiomics to extract candidate ROIs. In contrast, Stage 2 expands the receptive field about each ROI using RadHop-Net to compensate for the predicted error from Stage 1. Moreover, a novel loss function for regression problems is introduced to balance the influence between FPs and true positives (TPs). RadHop-Net is trained in a radiomics-to-error manner, thus decoupling from the common voxel-to-label approach. The proposed Stage 2 improves the average precision (AP) in lesion detection from 0.407 to 0.468 in the publicly available pi-cai dataset, also maintaining a significantly smaller model size than the state-of-the-art.
Authors: Voncarlos, Ines, Thomas, Sebastien, Elsa, Marjorie, Abdoulaye
Abstract: Animal welfare has become a critical issue in contemporary society, emphasizing our ethical responsibilities toward animals, particularly within livestock farming. The advent of Artificial Intelligence (AI) technologies, specifically computer vision, offers an innovative approach to monitoring and enhancing animal welfare. Cows, as essential contributors to sustainable agriculture, are central to this effort. However, existing cow detection algorithms face challenges in real-world farming environments, such as complex lighting, occlusions, pose variations, and background interference, hindering detection. Model generalization is crucial for adaptation across contexts beyond the training dataset. This study addresses these challenges using a diverse cow dataset from six environments, including indoor and outdoor scenarios. We propose a detection model combining YOLOv8 with the CBAM (Convolutional Block Attention Module) and assess its performance against baseline models, including Mask R-CNN, YOLOv5, and YOLOv8. Our findings show baseline models degrade in complex conditions, while our approach improves using CBAM. YOLOv8-CBAM outperformed YOLOv8 by 2.3% in mAP, achieving 95.2% precision and an mAP@0.5:0.95 of 82.6%, demonstrating superior accuracy. Contributions include (1) analyzing detection limitations, (2) proposing a robust model, and (3) benchmarking state-of-the-art algorithms. Applications include health monitoring, behavioral analysis, and tracking in smart farms, enabling precise detection in challenging settings. This study advances AI-driven livestock monitoring, improving animal welfare and smart agriculture.
Authors: Tobias Trein, Luan Fonseca Garcia
Abstract: Street cats in urban areas often rely on human intervention for survival, leading to challenges in population control and welfare management. In April 2023, Hello Inc., a Chinese urban mobility company, launched the Hello Street Cat initiative to address these issues. The project deployed over 21,000 smart feeding stations across 14 cities in China, integrating livestreaming cameras and treat dispensers activated through user donations. It also promotes the Trap-Neuter-Return (TNR) method, supported by a community-driven platform, HelloStreetCatWiki, where volunteers catalog and identify cats. However, manual identification is inefficient and unsustainable, creating a need for automated solutions. This study explores Deep Learning-based models for re-identifying street cats in the Hello Street Cat initiative. A dataset of 2,796 images of 69 cats was used to train Siamese Networks with EfficientNetB0, MobileNet and VGG16 as base models, evaluated under contrastive and triplet loss functions. VGG16 paired with contrastive loss emerged as the most effective configuration, achieving up to 97% accuracy and an F1 score of 0.9344 during testing. The approach leverages image augmentation and dataset refinement to overcome challenges posed by limited data and diverse visual variations. These findings underscore the potential of automated cat re-identification to streamline population monitoring and welfare efforts. By reducing reliance on manual processes, the method offers a scalable and reliable solution for communitydriven initiatives. Future research will focus on expanding datasets and developing real-time implementations to enhance practicality in large-scale deployments.
Authors: Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Yaoting Wang, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha
Abstract: With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models' multi-modal reasoning proficiency. However, these benchmarks are restricted to assessing primarily the visual aspect and do not examine the holistic audio-visual (AV) understanding. Moreover, currently, there are no benchmarks that investigate the capabilities of AVLLMs to calibrate their responses when presented with perturbed inputs. To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial attack, Compositional reasoning, and Modality-specific dependency. Using our benchmark we extensively evaluate 13 state-of-the-art AVLLMs. The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension, offering valuable insights for future research directions. To alleviate the limitations in the existing approaches, we further propose a robust, model-agnostic calibrated audio-visual preference optimization based training strategy CAVPref, obtaining a gain up to 30.19% across all 9 tasks. We will publicly release our code and benchmark to facilitate future research in this direction.
Authors: Zhaobin Mo, Yunlong Li, Xuan Di
Abstract: Safety-critical driving data is crucial for developing safe and trustworthy self-driving algorithms. Due to the scarcity of safety-critical data in naturalistic datasets, current approaches primarily utilize simulated or artificially generated images. However, there remains a gap in authenticity between these generated images and naturalistic ones. We propose a novel framework to augment the safety-critical driving data from the naturalistic dataset to address this issue. In this framework, we first detect vehicles using YOLOv5, followed by depth estimation and 3D transformation to simulate vehicle proximity and critical driving scenarios better. This allows for targeted modification of vehicle dynamics data to reflect potentially hazardous situations. Compared to the simulated or artificially generated data, our augmentation methods can generate safety-critical driving data with minimal compromise on image authenticity. Experiments using KITTI datasets demonstrate that a downstream self-driving algorithm trained on this augmented dataset performs superiorly compared to the baselines, which include SMOGN and importance sampling.
Authors: Yanxi Chen, Yi Su, Celine Dumitrascu, Kewei Chen, David Weidman, Richard J Caselli, Nicholas Ashton, Eric M Reiman, Yalin Wang
Abstract: Cross-modality translation between MRI and PET imaging is challenging due to the distinct mechanisms underlying these modalities. Blood-based biomarkers (BBBMs) are revolutionizing Alzheimer's disease (AD) detection by identifying patients and quantifying brain amyloid levels. However, the potential of BBBMs to enhance PET image synthesis remains unexplored. In this paper, we performed a thorough study on the effect of incorporating BBBM into deep generative models. By evaluating three widely used cross-modality translation models, we found that BBBMs integration consistently enhances the generative quality across all models. By visual inspection of the generated results, we observed that PET images generated by CycleGAN exhibit the best visual fidelity. Based on these findings, we propose Plasma-CycleGAN, a novel generative model based on CycleGAN, to synthesize PET images from MRI using BBBMs as conditions. This is the first approach to integrate BBBMs in conditional cross-modality translation between MRI and PET.
Authors: Yilin Li, Weining Shen
Abstract: Bloodstain Pattern Analysis (BPA) helps us understand how bloodstains form, with a focus on their size, shape, and distribution. This aids in crime scene reconstruction and provides insight into victim positions and crime investigation. One challenge in BPA is distinguishing between different types of bloodstains, such as those from firearms, impacts, or other mechanisms. Our study focuses on differentiating impact spatter bloodstain patterns from gunshot bloodstain patterns. We distinguish patterns by extracting well-designed individual stain features, applying effective data consolidation methods, and selecting boosting classifiers. As a result, we have developed a model that excels in both accuracy and efficiency. In addition, we use outside data sources from previous studies to discuss the challenges and future directions for BPA.
Authors: Zhizheng Liu, Joe Lin, Wayne Wu, Bolei Zhou
Abstract: Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. JOSH uses techniques in both dense scene reconstruction and human mesh recovery as initialization, and then it leverages the human-scene contact constraints to jointly optimize the scene, the camera poses, and the human motion. Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction by joint optimization of scene geometry and human motion. We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos. JOSH3R outperforms other optimization-free methods by only training with labels predicted from JOSH, further demonstrating its accuracy and generalization ability.
Authors: Chaoyi Tan, Wenqing Zhang, Zhen Qi, Kowei Shih, Xinshi Li, Ao Xiang
Abstract: In the field of computer vision, multimodal image generation has become a research hotspot, especially the task of integrating text, image, and style. In this study, we propose a multimodal image generation method based on Generative Adversarial Networks (GAN), capable of effectively combining text descriptions, reference images, and style information to generate images that meet multimodal requirements. This method involves the design of a text encoder, an image feature extractor, and a style integration module, ensuring that the generated images maintain high quality in terms of visual content and style consistency. We also introduce multiple loss functions, including adversarial loss, text-image consistency loss, and style matching loss, to optimize the generation process. Experimental results show that our method produces images with high clarity and consistency across multiple public datasets, demonstrating significant performance improvements compared to existing methods. The outcomes of this study provide new insights into multimodal image generation and present broad application prospects.
Authors: Ren Hu, Pan Lian
Abstract: Quaternionic signal processing provides powerful tools for efficiently managing color signals by preserving the intrinsic correlations among signal dimensions through quaternion algebra. In this paper, we address the quaternionic phase retrieval problem by systematically developing novel algorithms based on an amplitude-based model. Specifically, we propose the Quaternionic Reweighted Amplitude Flow (QRAF) algorithm, which is further enhanced by three of its variants: incremental, accelerated, and adapted QRAF algorithms. In addition, we introduce the Quaternionic Perturbed Amplitude Flow (QPAF) algorithm, which has linear convergence. Extensive numerical experiments on both synthetic data and real images, demonstrate that our proposed methods significantly improve recovery performance and computational efficiency compared to state-of-the-art approaches.
Authors: Zongxia Li, Xiyang Wu, Hongyang Du, Huy Nghiem, Guangyao Shi
Abstract: Multimodal Vision Language Models (VLMs) have emerged as a transformative technology at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification. Despite their rapid advancements in research and growing popularity in applications, a comprehensive survey of existing studies on VLMs is notably lacking, particularly for researchers aiming to leverage VLMs in their specific domains. To this end, we provide a systematic overview of VLMs in the following aspects: model information of the major VLMs developed over the past five years (2019-2024); the main architectures and training methods of these VLMs; summary and categorization of the popular benchmarks and evaluation metrics of VLMs; the applications of VLMs including embodied agents, robotics, and video generation; the challenges and issues faced by current VLMs such as hallucination, fairness, and safety. Detailed collections including papers and model repository links are listed in https://github.com/zli12321/Awesome-VLM-Papers-And-Models.git.
URLs: https://github.com/zli12321/Awesome-VLM-Papers-And-Models.git.
Authors: Chongyan Chen, Yu-Yun Tseng, Zhuoheng Li, Anush Venkatesh, Danna Gurari
Abstract: No existing work on visual question answering explicitly accounts for ambiguity regarding where the content described in the question is located in the image. To fill this gap, we introduce VQ-FocusAmbiguity, the first VQA dataset that visually grounds each region described in the question that is necessary to arrive at the answer. We then provide an analysis showing how our dataset for visually grounding `questions' is distinct from visually grounding `answers', and characterize the properties of the questions and segmentations provided in our dataset. Finally, we benchmark modern models for two novel tasks: recognizing whether a visual question has focus ambiguity and localizing all plausible focus regions within the image. Results show that the dataset is challenging for modern models. To facilitate future progress on these tasks, we publicly share the dataset with an evaluation server at https://focusambiguity.github.io/.
Authors: Mian Zou, Baosheng Yu, Yibing Zhan, Kede Ma
Abstract: The detection of AI-generated faces is commonly approached as a binary classification task. Nevertheless, the resulting detectors frequently struggle to adapt to novel AI face generators, which evolve rapidly. In this paper, we describe an anomaly detection method for AI-generated faces by leveraging self-supervised learning of camera-intrinsic and face-specific features purely from photographic face images. The success of our method lies in designing a pretext task that trains a feature extractor to rank four ordinal exchangeable image file format (EXIF) tags and classify artificially manipulated face images. Subsequently, we model the learned feature distribution of photographic face images using a Gaussian mixture model. Faces with low likelihoods are flagged as AI-generated. Both quantitative and qualitative experiments validate the effectiveness of our method. Our code is available at \url{https://github.com/MZMMSEC/AIGFD_EXIF.git}.
Authors: Messi H. J. Lee
Abstract: Vision-Language Models trained on massive collections of human-generated data often reproduce and amplify societal stereotypes. One critical form of stereotyping reproduced by these models is homogeneity bias-the tendency to represent certain groups as more homogeneous than others. We investigate how this bias responds to hyperparameter adjustments in GPT-4, specifically examining sampling temperature and top p which control the randomness of model outputs. By generating stories about individuals from different racial and gender groups and comparing their similarities using vector representations, we assess both bias robustness and its relationship with hyperparameter values. We find that (1) homogeneity bias persists across most hyperparameter configurations, with Black Americans and women being represented more homogeneously than White Americans and men, (2) the relationship between hyperparameters and group representations shows unexpected non-linear patterns, particularly at extreme values, and (3) hyperparameter adjustments affect racial and gender homogeneity bias differently-while increasing temperature or decreasing top p can reduce racial homogeneity bias, these changes show different effects on gender homogeneity bias. Our findings suggest that while hyperparameter tuning may mitigate certain biases to some extent, it cannot serve as a universal solution for addressing homogeneity bias across different social group dimensions.
Authors: Wei Liu, Yonglin Wu, Chaoqun Li, Zhuodong Liu, Huanqian Yan
Abstract: The study of physical adversarial patches is crucial for identifying vulnerabilities in AI-based recognition systems and developing more robust deep learning models. While recent research has focused on improving patch stealthiness for greater practical applicability, achieving an effective balance between stealth and attack performance remains a significant challenge. To address this issue, we propose a novel physical adversarial attack method that leverages knowledge distillation. Specifically, we first define a stealthy color space tailored to the target environment to ensure smooth blending. Then, we optimize an adversarial patch in an unconstrained color space, which serves as the 'teacher' patch. Finally, we use an adversarial knowledge distillation module to transfer the teacher patch's knowledge to the 'student' patch, guiding the optimization of the stealthy patch. Experimental results show that our approach improves attack performance by 20%, while maintaining stealth, highlighting its practical value.
Authors: Mengting Wei, Tuomas Varanka, Xingxun Jiang, Huai-Qian Khor, Guoying Zhao
Abstract: We address the problem of facial expression editing by controling the relative variation of facial action-unit (AU) from the same person. This enables us to edit this specific person's expression in a fine-grained, continuous and interpretable manner, while preserving their identity, pose, background and detailed facial attributes. Key to our model, which we dub MagicFace, is a diffusion model conditioned on AU variations and an ID encoder to preserve facial details of high consistency. Specifically, to preserve the facial details with the input identity, we leverage the power of pretrained Stable-Diffusion models and design an ID encoder to merge appearance features through self-attention. To keep background and pose consistency, we introduce an efficient Attribute Controller by explicitly informing the model of current background and pose of the target. By injecting AU variations into a denoising UNet, our model can animate arbitrary identities with various AU combinations, yielding superior results in high-fidelity expression editing compared to other facial expression editing works. Code is publicly available at https://github.com/weimengting/MagicFace.
Authors: Javier Montalvo, \'Alvaro Garc\'ia-Mart\'in, Pablo Carballeira, Juan C. SanMiguel
Abstract: Semantic segmentation is a computer vision task where classification is performed at a pixel level. Due to this, the process of labeling images for semantic segmentation is time-consuming and expensive. To mitigate this cost there has been a surge in the use of synthetically generated data -- usually created using simulators or videogames -- which, in combination with domain adaptation methods, can effectively learn how to segment real data. Still, these datasets have a particular limitation: due to their closed-set nature, it is not possible to include novel classes without modifying the tool used to generate them, which is often not public. Concurrently, generative models have made remarkable progress, particularly with the introduction of diffusion models, enabling the creation of high-quality images from text prompts without additional supervision. In this work, we propose an unsupervised pipeline that leverages Stable Diffusion and Segment Anything Module to generate class examples with an associated segmentation mask, and a method to integrate generated cutouts for novel classes in semantic segmentation datasets, all with minimal user input. Our approach aims to improve the performance of unsupervised domain adaptation methods by introducing novel samples into the training data without modifications to the underlying algorithms. With our methods, we show how models can not only effectively learn how to segment novel classes, with an average performance of 51% IoU, but also reduce errors for other, already existing classes, reaching a higher performance level overall.
Authors: Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou
Abstract: Recent Multimodal Large Language Models(MLLMs) often use a large number of visual tokens to compensate their visual shortcoming, leading to excessive computation and obvious visual redundancy. In this paper, we investigate what kind of visual tokens are needed for MLLMs, and reveal that both foreground and background tokens are critical for MLLMs given the varying difficulties of examples. Based on this observation, we propose a graph-based method towards training-free visual token pruning, termed G-Prune.In particular, G-Prune regards visual tokens as nodes, and construct their connections based on their semantic similarities. Afterwards, the information flow is propagated via weighted links, and the most important tokens after iterations are kept for MLLMs, which can be front or background.To validate G-Prune, we apply it to a recent MLLM called LLaVA-NeXT, and conduct extensive experiments on a set of benchmarks.The experiment results show that G-Prune can greatly reduce computation overhead while retaining high performance on both coarse- and fine-grained tasks. For instance, G-Prune can reduce 63.57\% FLOPs of LLaVA-NeXT on VQA2.0 and TextVQA with only 0.95\% and 2.34\% accuracy drops, respectively.
Authors: Yizhou Li, Zihua Liu, Yusuke Monno, Masatoshi Okutomi
Abstract: In this paper, we propose the first diffusion-based all-in-one video restoration method that utilizes the power of a pre-trained Stable Diffusion and a fine-tuned ControlNet. Our method can restore various types of video degradation with a single unified model, overcoming the limitation of standard methods that require specific models for each restoration task. Our contributions include an efficient training strategy with Task Prompt Guidance (TPG) for diverse restoration tasks, an inference strategy that combines Denoising Diffusion Implicit Models~(DDIM) inversion with a novel Sliding Window Cross-Frame Attention (SW-CFA) mechanism for enhanced content preservation and temporal consistency, and a scalable pipeline that makes our method all-in-one to adapt to different video restoration tasks. Through extensive experiments on five video restoration tasks, we demonstrate the superiority of our method in generalization capability to real-world videos and temporal consistency preservation over existing state-of-the-art methods. Our method advances the video restoration task by providing a unified solution that enhances video quality across multiple applications.
Authors: Victor Nascimento Ribeiro, Nina S. T. Hirata
Abstract: Automatic License Plate Recognition (ALPR) involves extracting vehicle license plate information from image or a video capture. These systems have gained popularity due to the wide availability of low-cost surveillance cameras and advances in Deep Learning. Typically, video-based ALPR systems rely on multiple frames to detect the vehicle and recognize the license plates. Therefore, we propose a system capable of extracting exactly one frame per vehicle and recognizing its license plate characters from this singular image using an Optical Character Recognition (OCR) model. Early experiments show that this methodology is viable.
Authors: Yingjie Liu, Pengyu Zhang, Ziyao He, Mingsong Chen, Xuan Tang, Xian Wei
Abstract: Hyperbolic spaces allow for more efficient modeling of complex, hierarchical structures, which is particularly beneficial in tasks involving multi-modal data. Although hyperbolic geometries have been proven effective for language-image pre-training, their capabilities to unify language, image, and 3D Point Cloud modalities are under-explored. We extend the 3D Point Cloud modality in hyperbolic multi-modal contrastive pre-training. Additionally, we explore the entailment, modality gap, and alignment regularizers for learning hierarchical 3D embeddings and facilitating the transfer of knowledge from both Text and Image modalities. These regularizers enable the learning of intra-modal hierarchy within each modality and inter-modal hierarchy across text, 2D images, and 3D Point Clouds.Experimental results demonstrate that our proposed training strategy yields an outstanding 3D Point Cloud encoder, and the obtained 3D Point Cloud hierarchical embeddings significantly improve performance on various downstream tasks.
Authors: Liye Jia, Runwei Guan, Haocheng Zhao, Qiuchi Zhao, Ka Lok Man, Jeremy Smith, Limin Yu, Yutao Yue
Abstract: 3D object detection is crucial for Autonomous Driving (AD) and Advanced Driver Assistance Systems (ADAS). However, most 3D detectors prioritize detection accuracy, often overlooking network inference speed in practical applications. In this paper, we propose RadarNeXt, a real-time and reliable 3D object detector based on the 4D mmWave radar point clouds. It leverages the re-parameterizable neural networks to catch multi-scale features, reduce memory cost and accelerate the inference. Moreover, to highlight the irregular foreground features of radar point clouds and suppress background clutter, we propose a Multi-path Deformable Foreground Enhancement Network (MDFEN), ensuring detection accuracy while minimizing the sacrifice of speed and excessive number of parameters. Experimental results on View-of-Delft and TJ4DRadSet datasets validate the exceptional performance and efficiency of RadarNeXt, achieving 50.48 and 32.30 mAPs with the variant using our proposed MDFEN. Notably, our RadarNeXt variants achieve inference speeds of over 67.10 FPS on the RTX A4000 GPU and 28.40 FPS on the Jetson AGX Orin. This research demonstrates that RadarNeXt brings a novel and effective paradigm for 3D perception based on 4D mmWave radar.
Authors: Hieu D. Nguyen, Brandon McHenry, Thanh Nguyen, Harper Zappone, Anthony Thompson, Chau Tran, Anthony Segrest, Luke Tonon
Abstract: We present an AI pipeline that involves using smart drones equipped with computer vision to obtain a more accurate fruit count and yield estimation of the number of blueberries in a field. The core components are two object-detection models based on the YOLO deep learning architecture: a Bush Model that is able to detect blueberry bushes from images captured at low altitudes and at different angles, and a Berry Model that can detect individual berries that are visible on a bush. Together, both models allow for more accurate crop yield estimation by allowing intelligent control of the drone's position and camera to safely capture side-view images of bushes up close. In addition to providing experimental results for our models, which show good accuracy in terms of precision and recall when captured images are cropped around the foreground center bush, we also describe how to deploy our models to map out blueberry fields using different sampling strategies, and discuss the challenges of annotating very small objects (blueberries) and difficulties in evaluating the effectiveness of our models.
Authors: Kuan-Hung Liu, Cheng-Kun Yang, Min-Hung Chen, Yu-Lun Liu, Yen-Yu Lin
Abstract: In the task of reference-based image inpainting, an additional reference image is provided to restore a damaged target image to its original state. The advancement of diffusion models, particularly Stable Diffusion, allows for simple formulations in this task. However, existing diffusion-based methods often lack explicit constraints on the correlation between the reference and damaged images, resulting in lower faithfulness to the reference images in the inpainting results. In this work, we propose CorrFill, a training-free module designed to enhance the awareness of geometric correlations between the reference and target images. This enhancement is achieved by guiding the inpainting process with correspondence constraints estimated during inpainting, utilizing attention masking in self-attention layers and an objective function to update the input tensor according to the constraints. Experimental results demonstrate that CorrFill significantly enhances the performance of multiple baseline diffusion-based methods, including state-of-the-art approaches, by emphasizing faithfulness to the reference images.
Authors: Sichao Wang, Chuang Zhang, Ming Yuan, Qing Xu, Lei He, Jianqiang Wang
Abstract: In V2X collaborative perception, the domain gaps between heterogeneous nodes pose a significant challenge for effective information fusion. Pose errors arising from latency and GPS localization noise further exacerbate the issue by leading to feature misalignment. To overcome these challenges, we propose V2X-DGPE, a high-accuracy and robust V2X feature-level collaborative perception framework. V2X-DGPE employs a Knowledge Distillation Framework and a Feature Compensation Module to learn domain-invariant representations from multi-source data, effectively reducing the feature distribution gap between vehicles and roadside infrastructure. Historical information is utilized to provide the model with a more comprehensive understanding of the current scene. Furthermore, a Collaborative Fusion Module leverages a heterogeneous self-attention mechanism to extract and integrate heterogeneous representations from vehicles and infrastructure. To address pose errors, V2X-DGPE introduces a deformable attention mechanism, enabling the model to adaptively focus on critical parts of the input features by dynamically offsetting sampling points. Extensive experiments on the real-world DAIR-V2X dataset demonstrate that the proposed method outperforms existing approaches, achieving state-of-the-art detection performance. The code is available at https://github.com/wangsch10/V2X-DGPE.
Authors: Wenhao Wang, Yifan Sun, Zongxin Yang, Zhentao Tan, Zhengdong Hu, Yi Yang
Abstract: Text-guided image-to-image diffusion models excel in translating images based on textual prompts, allowing for precise and creative visual modifications. However, such a powerful technique can be misused for spreading misinformation, infringing on copyrights, and evading content tracing. This motivates us to introduce the task of origin IDentification for text-guided Image-to-image Diffusion models (ID$^2$), aiming to retrieve the original image of a given translated query. A straightforward solution to ID$^2$ involves training a specialized deep embedding model to extract and compare features from both query and reference images. However, due to visual discrepancy across generations produced by different diffusion models, this similarity-based approach fails when training on images from one model and testing on those from another, limiting its effectiveness in real-world applications. To solve this challenge of the proposed ID$^2$ task, we contribute the first dataset and a theoretically guaranteed method, both emphasizing generalizability. The curated dataset, OriPID, contains abundant Origins and guided Prompts, which can be used to train and test potential IDentification models across various diffusion models. In the method section, we first prove the existence of a linear transformation that minimizes the distance between the pre-trained Variational Autoencoder (VAE) embeddings of generated samples and their origins. Subsequently, it is demonstrated that such a simple linear transformation can be generalized across different diffusion models. Experimental results show that the proposed method achieves satisfying generalization performance, significantly surpassing similarity-based methods ($+31.6\%$ mAP), even those with generalization designs.
Authors: Kangyu Zhu, Ziyuan Qin, Huahui Yi, Zekun Jiang, Qicheng Lao, Shaoting Zhang, Kang Li
Abstract: With the recent advancements in vision-language models (VLMs) driven by large language models (LLMs), many researchers have focused on models that comprised of an image encoder, an image-to-language projection layer, and a text decoder architectures, leading to the emergence of works like LLava-Med. However, these works primarily operate at the whole-image level, aligning general information from 2D medical images without attending to finer details. As a result, these models often provide irrelevant or non-clinically valuable information while missing critical details. Medical vision-language tasks differ significantly from general images, particularly in their focus on fine-grained details, while excluding irrelevant content. General domain VLMs tend to prioritize global information due to their design, which compresses the entire image into a multi-token representation that is passed into the LLM decoder. Therefore, current VLMs all lack the capability to restrict their attention to particular areas. To address this critical issue in the medical domain, we introduce MedVP, an visual prompt generation and fine-tuning framework, which involves extract medical entities, generate visual prompts, and adapt datasets for visual prompt guided fine-tuning. To the best of our knowledge, this is the first work to explicitly introduce visual prompt into medical VLMs, and we successfully outperform recent state-of-the-art large models across multiple medical VQA datasets. Extensive experiments are conducted to analyze the impact of different visual prompt forms and how they contribute to performance improvement. The results demonstrate both the effectiveness and clinical significance of our approach
Authors: Bingjie Lu (Central South University), Han-Cheng Dan (Central South University), Yichen Zhang (Central South University), Zhetao Huang (Central South University)
Abstract: Mean texture depth (MTD) is pivotal in assessing the skid resistance of asphalt pavements and ensuring road safety. This study focuses on developing an automated system for extracting texture features and evaluating MTD based on pavement images. The contributions of this work are threefold: firstly, it proposes an economical method to acquire three-dimensional (3D) pavement texture data; secondly, it enhances 3D image processing techniques and formulates features that represent various aspects of texture; thirdly, it establishes multivariate prediction models that link these features with MTD values. Validation results demonstrate that the Gradient Boosting Tree (GBT) model achieves remarkable prediction stability and accuracy (R2 = 0.9858), and field tests indicate the superiority of the proposed method over other techniques, with relative errors below 10%. This method offers a comprehensive end-to-end solution for pavement quality evaluation, from images input to MTD predictions output.
Authors: Jialong Guo, Ke liu, Jiangchao Yao, Zhihua Wang, Jiajun Bu, Haishuai Wang
Abstract: Neural Representations for Videos (NeRV) has emerged as a promising implicit neural representation (INR) approach for video analysis, which represents videos as neural networks with frame indexes as inputs. However, NeRV-based methods are time-consuming when adapting to a large number of diverse videos, as each video requires a separate NeRV model to be trained from scratch. In addition, NeRV-based methods spatially require generating a high-dimension signal (i.e., an entire image) from the input of a low-dimension timestamp, and a video typically consists of tens of frames temporally that have a minor change between adjacent frames. To improve the efficiency of video representation, we propose Meta Neural Representations for Videos, named MetaNeRV, a novel framework for fast NeRV representation for unseen videos. MetaNeRV leverages a meta-learning framework to learn an optimal parameter initialization, which serves as a good starting point for adapting to new videos. To address the unique spatial and temporal characteristics of video modality, we further introduce spatial-temporal guidance to improve the representation capabilities of MetaNeRV. Specifically, the spatial guidance with a multi-resolution loss aims to capture the information from different resolution stages, and the temporal guidance with an effective progressive learning strategy could gradually refine the number of fitted frames during the meta-learning process. Extensive experiments conducted on multiple datasets demonstrate the superiority of MetaNeRV for video representations and video compression.
Authors: Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, Victor Qu\'etu, Enzo Tartaglione
Abstract: Recently, Multi-modal Large Language Models (MLLMs) have shown remarkable effectiveness for multi-modal tasks due to their abilities to generate and understand cross-modal data. However, processing long sequences of visual tokens extracted from visual backbones poses a challenge for deployment in real-time applications. To address this issue, we introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence, mitigating both computational and memory demands during training and inference. Through a comprehensive analysis of the token reduction process, we analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy. We showcase the effectiveness of FOLDER by integrating it into the visual backbone of several MLLMs, significantly accelerating the inference phase. Furthermore, we evaluate its utility as a training accelerator or even performance booster for MLLMs. In both contexts, FOLDER achieves comparable or even better performance than the original models, while dramatically reducing complexity by removing up to 70% of visual tokens.
Authors: Yixiao Chen, Yue Yao, Ruining Yang, Md Zakir Hossain, Ashu Gupta, Tom Gedeon
Abstract: This article investigates the critical issue of dataset bias in medical imaging, with a particular emphasis on racial disparities caused by uneven population distribution in dataset collection. Our analysis reveals that medical segmentation datasets are significantly biased, primarily influenced by the demographic composition of their collection sites. For instance, Scanning Laser Ophthalmoscopy (SLO) fundus datasets collected in the United States predominantly feature images of White individuals, with minority racial groups underrepresented. This imbalance can result in biased model performance and inequitable clinical outcomes, particularly for minority populations. To address this challenge, we propose a novel training set search strategy aimed at reducing these biases by focusing on underrepresented racial groups. Our approach utilizes existing datasets and employs a simple greedy algorithm to identify source images that closely match the target domain distribution. By selecting training data that aligns more closely with the characteristics of minority populations, our strategy improves the accuracy of medical segmentation models on specific minorities, i.e., Black. Our experimental results demonstrate the effectiveness of this approach in mitigating bias. We also discuss the broader societal implications, highlighting how addressing these disparities can contribute to more equitable healthcare outcomes.
Authors: Avni Mittal, John Kalkhof, Anirban Mukhopadhyay, Arnav Bhavsar
Abstract: Denoising Diffusion Models (DDMs) are widely used for high-quality image generation and medical image segmentation but often rely on Unet-based architectures, leading to high computational overhead, especially with high-resolution images. This work proposes three NCA-based improvements for diffusion-based medical image segmentation. First, Multi-MedSegDiffNCA uses a multilevel NCA framework to refine rough noise estimates generated by lower level NCA models. Second, CBAM-MedSegDiffNCA incorporates channel and spatial attention for improved segmentation. Third, MultiCBAM-MedSegDiffNCA combines these methods with a new RGB channel loss for semantic guidance. Evaluations on Lesion segmentation show that MultiCBAM-MedSegDiffNCA matches Unet-based model performance with dice score of 87.84% while using 60-110 times fewer parameters, offering a more efficient solution for low resource medical settings.
Authors: Yihang Tao, Senkang Hu, Yue Hu, Haonan An, Hangcheng Cao, Yuguang Fang
Abstract: Collaborative perception significantly enhances autonomous driving safety by extending each vehicle's perception range through message sharing among connected and autonomous vehicles. Unfortunately, it is also vulnerable to adversarial message attacks from malicious agents, resulting in severe performance degradation. While existing defenses employ hypothesis-and-verification frameworks to detect malicious agents based on single-shot outliers, they overlook temporal message correlations, which can be circumvented by subtle yet harmful perturbations in model input and output spaces. This paper reveals a novel blind area confusion (BAC) attack that compromises existing single-shot outlier-based detection methods. As a countermeasure, we propose GCP, a Guarded Collaborative Perception framework based on spatial-temporal aware malicious agent detection, which maintains single-shot spatial consistency through a confidence-scaled spatial concordance loss, while simultaneously examining temporal anomalies by reconstructing historical bird's eye view motion flows in low-confidence regions. We also employ a joint spatial-temporal Benjamini-Hochberg test to synthesize dual-domain anomaly results for reliable malicious agent detection. Extensive experiments demonstrate GCP's superior performance under diverse attack scenarios, achieving up to 34.69% improvements in AP@0.5 compared to the state-of-the-art CP defense strategies under BAC attacks, while maintaining consistent 5-8% improvements under other typical attacks. Code will be released at https://github.com/CP-Security/GCP.git.
Authors: Zijie Cheng, Boxuan Li, Andr\'e Altmann, Pearse A Keane, Yukun Zhou
Abstract: Contrastive learning, a prominent approach within self-supervised learning, has demonstrated significant effectiveness in developing generalizable models for various applications involving natural images. However, recent research indicates that these successes do not necessarily extend to the medical imaging domain. In this paper, we investigate the reasons for this suboptimal performance and hypothesize that the dense distribution of medical images poses challenges to the pretext tasks in contrastive learning, particularly in constructing positive and negative pairs. We explore model performance under different augmentation strategies and compare the results to those achieved with strong augmentations. Our study includes six publicly available datasets covering multiple clinically relevant tasks. We further assess the model's generalizability through external evaluations. The model pre-trained with weak augmentation outperforms those with strong augmentation, improving AUROC from 0.838 to 0.848 and AUPR from 0.523 to 0.597 on MESSIDOR2, and showing similar enhancements across other datasets. Our findings suggest that optimizing the scale of augmentation is critical for enhancing the efficacy of contrastive learning in medical imaging.
Authors: Haifeng Jia, Xinyi Chen, Yichen Wei, Yifei Sun, Yibo Pi
Abstract: Ray tracing is widely employed to model the propagation of radio-frequency (RF) signal in complex environment. The modelling performance greatly depends on how accurately the target scene can be depicted, including the scene geometry and surface material properties. The advances in computer vision and LiDAR make scene geometry estimation increasingly accurate, but there still lacks scalable and efficient approaches to estimate the material reflectivity in real-world environment. In this work, we tackle this problem by learning the material reflectivity efficiently from the path loss of the RF signal from the transmitters to receivers. Specifically, we want the learned material reflection coefficients to minimize the gap between the predicted and measured powers of the receivers. We achieve this by translating the neural reflectance field from optics to RF domain by modelling both the amplitude and phase of RF signals to account for the multipath effects. We further propose a differentiable RF ray tracing framework that optimizes the neural reflectance field to match the signal strength measurements. We simulate a complex real-world environment for experiments and our simulation results show that the neural reflectance field can successfully learn the reflection coefficients for all incident angles. As a result, our approach achieves better accuracy in predicting the powers of receivers with significantly less training data compared to existing approaches.
Authors: Hui Lin, Chao Zhang, Danfeng Hong, Kexin Dong, Congcong Wen
Abstract: Remote sensing data is often distributed across multiple institutions, and due to privacy concerns and data-sharing restrictions, leveraging large-scale datasets in a centralized training framework is challenging. Federated learning offers a promising solution by enabling collaborative model training across distributed data sources without requiring data centralization. However, current Vision-Language Models (VLMs), which typically contain billions of parameters, pose significant communication challenges for traditional federated learning approaches based on model parameter updates, as they would incur substantial communication costs. In this paper, we propose FedRSCLIP, the first federated learning framework designed for remote sensing image classification based on a VLM, specifically CLIP. FedRSCLIP addresses the challenges of data heterogeneity and large-scale model transmission in federated environments by introducing Prompt Learning, which optimizes only a small set of tunable parameters. The framework introduces a dual-prompt mechanism, comprising Shared Prompts for global knowledge sharing and Private Prompts for client-specific adaptation. To maintain semantic coherence between shared and private prompts, we propose the Dual Prompt Alignment Constraint to balance global consistency and local adaptability across diverse client distributions. Additionally, to enhance cross-modal representation learning, we introduce the Cross-Modal Feature Alignment Constraint to align multimodal features between text and image prompts. To validate the effectiveness of our proposed model, we construct a Fed-RSIC dataset based on three existing remote sensing image classification datasets, specifically designed to simulate various federated learning configurations. Experimental results demonstrate the effectiveness and superiority of FedRSCLIP in remote sensing image classification.
Authors: Yuliang Guo, Sparsh Garg, S. Mahdi H. Miangoleh, Xinyu Huang, Liu Ren
Abstract: While recent depth estimation methods exhibit strong zero-shot generalization, achieving accurate metric depth across diverse camera types-particularly those with large fields of view (FoV) such as fisheye and 360-degree cameras-remains a significant challenge. This paper presents Depth Any Camera (DAC), a powerful zero-shot metric depth estimation framework that extends a perspective-trained model to effectively handle cameras with varying FoVs. The framework is designed to ensure that all existing 3D data can be leveraged, regardless of the specific camera types used in new applications. Remarkably, DAC is trained exclusively on perspective images but generalizes seamlessly to fisheye and 360-degree cameras without the need for specialized training data. DAC employs Equi-Rectangular Projection (ERP) as a unified image representation, enabling consistent processing of images with diverse FoVs. Its key components include a pitch-aware Image-to-ERP conversion for efficient online augmentation in ERP space, a FoV alignment operation to support effective training across a wide range of FoVs, and multi-resolution data augmentation to address resolution disparities between training and testing. DAC achieves state-of-the-art zero-shot metric depth estimation, improving delta-1 ($\delta_1$) accuracy by up to 50% on multiple fisheye and 360-degree datasets compared to prior metric depth foundation models, demonstrating robust generalization across camera types.
Authors: Xinyu Zhou, Jinglun Li, Lingyi Hong, Kaixun Jiang, Pinxue Guo, Weifeng Ge, Wenqiang Zhang
Abstract: Previous visual object tracking methods employ image-feature regression models or coordinate autoregression models for bounding box prediction. Image-feature regression methods heavily depend on matching results and do not utilize positional prior, while the autoregressive approach can only be trained using bounding boxes available in the training set, potentially resulting in suboptimal performance during testing with unseen data. Inspired by the diffusion model, denoising learning enhances the model's robustness to unseen data. Therefore, We introduce noise to bounding boxes, generating noisy boxes for training, thus enhancing model robustness on testing data. We propose a new paradigm to formulate the visual object tracking problem as a denoising learning process. However, tracking algorithms are usually asked to run in real-time, directly applying the diffusion model to object tracking would severely impair tracking speed. Therefore, we decompose the denoising learning process into every denoising block within a model, not by running the model multiple times, and thus we summarize the proposed paradigm as an in-model latent denoising learning process. Specifically, we propose a denoising Vision Transformer (ViT), which is composed of multiple denoising blocks. In the denoising block, template and search embeddings are projected into every denoising block as conditions. A denoising block is responsible for removing the noise in a predicted bounding box, and multiple stacked denoising blocks cooperate to accomplish the whole denoising process. Subsequently, we utilize image features and trajectory information to refine the denoised bounding box. Besides, we also utilize trajectory memory and visual memory to improve tracking stability. Experimental results validate the effectiveness of our approach, achieving competitive performance on several challenging datasets.
Authors: Hui Lin, Nan Li, Pengjuan Yao, Kexin Dong, Yuhan Guo, Danfeng Hong, Ying Zhang, Congcong Wen
Abstract: Remote sensing object detection is particularly challenging due to the high resolution, multi-scale features, and diverse ground object characteristics inherent in satellite and UAV imagery. These challenges necessitate more advanced approaches for effective object detection in such environments. While deep learning methods have achieved remarkable success in remote sensing object detection, they typically rely on large amounts of labeled data. Acquiring sufficient labeled data, particularly for novel or rare objects, is both challenging and time-consuming in remote sensing scenarios, limiting the generalization capabilities of existing models. To address these challenges, few-shot learning (FSL) has emerged as a promising approach, aiming to enable models to learn new classes from limited labeled examples. Building on this concept, few-shot object detection (FSOD) specifically targets object detection challenges in data-limited conditions. However, the generalization capability of FSOD models, particularly in remote sensing, is often constrained by the complex and diverse characteristics of the objects present in such environments. In this paper, we propose the Generalization-Enhanced Few-Shot Object Detection (GE-FSOD) model to improve the generalization capability in remote sensing FSOD tasks. Our model introduces three key innovations: the Cross-Level Fusion Pyramid Attention Network (CFPAN) for enhanced multi-scale feature representation, the Multi-Stage Refinement Region Proposal Network (MRRPN) for more accurate region proposals, and the Generalized Classification Loss (GCL) for improved classification performance in few-shot scenarios. Extensive experiments on the DIOR and NWPU VHR-10 datasets show that our model achieves state-of-the-art performance for few-shot object detection in remote sensing.
Authors: Chao Liang, Linchao Zhu, Zongxin Yang, Wei Chen, Yi Yang
Abstract: We focus on the challenging problem of learning an unbiased classifier from a large number of potentially relevant but noisily labeled web images given only a few clean labeled images. This problem is particularly practical because it reduces the expensive annotation costs by utilizing freely accessible web images with noisy labels. Typically, prototypes are representative images or features used to classify or identify other images. However, in the few clean and many noisy scenarios, the class prototype can be severely biased due to the presence of irrelevant noisy images. The resulting prototypes are less compact and discriminative, as previous methods do not take into account the diverse range of images in the noisy web image collections. On the other hand, the relation modeling between noisy and clean images is not learned for the class prototype generation in an end-to-end manner, which results in a suboptimal class prototype. In this article, we introduce a similarity maximization loss named SimNoiPro. Our SimNoiPro first generates noise-tolerant hybrid prototypes composed of clean and noise-tolerant prototypes and then pulls them closer to each other. Our approach considers the diversity of noisy images by explicit division and overcomes the optimization discrepancy issue. This enables better relation modeling between clean and noisy images and helps extract judicious information from the noisy image set. The evaluation results on two extended few-shot classification benchmarks confirm that our SimNoiPro outperforms prior methods in measuring image relations and cleaning noisy data.
Authors: Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, Jingren Zhou
Abstract: We report ACE++, an instruction-based diffusion framework that tackles various image generation and editing tasks. Inspired by the input format for the inpainting task proposed by FLUX.1-Fill-dev, we improve the Long-context Condition Unit (LCU) introduced in ACE and extend this input paradigm to any editing and generation tasks. To take full advantage of image generative priors, we develop a two-stage training scheme to minimize the efforts of finetuning powerful text-to-image diffusion models like FLUX.1-dev. In the first stage, we pre-train the model using task data with the 0-ref tasks from the text-to-image model. There are many models in the community based on the post-training of text-to-image foundational models that meet this training paradigm of the first stage. For example, FLUX.1-Fill-dev deals primarily with painting tasks and can be used as an initialization to accelerate the training process. In the second stage, we finetune the above model to support the general instructions using all tasks defined in ACE. To promote the widespread application of ACE++ in different scenarios, we provide a comprehensive set of models that cover both full finetuning and lightweight finetuning, while considering general applicability and applicability in vertical scenarios. The qualitative analysis showcases the superiority of ACE++ in terms of generating image quality and prompt following ability.
Authors: Sung Jin Um, Dongjin Kim, Sangmin Lee, Jung Uk Kim
Abstract: The goal of video moment retrieval and highlight detection is to identify specific segments and highlights based on a given text query. With the rapid growth of video content and the overlap between these tasks, recent works have addressed both simultaneously. However, they still struggle to fully capture the overall video context, making it challenging to determine which words are most relevant. In this paper, we present a novel Video Context-aware Keyword Attention module that overcomes this limitation by capturing keyword variation within the context of the entire video. To achieve this, we introduce a video context clustering module that provides concise representations of the overall video context, thereby enhancing the understanding of keyword dynamics. Furthermore, we propose a keyword weight detection module with keyword-aware contrastive learning that incorporates keyword information to enhance fine-grained alignment between visual and textual features. Extensive experiments on the QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that our proposed method significantly improves performance in moment retrieval and highlight detection tasks compared to existing approaches. Our code is available at: https://github.com/VisualAIKHU/Keyword-DETR
Authors: Hui Li, Xiaoyu Ren, Hongjiu Yu, Huiyu Duan, Kai Li, Ying Chen, Libo Wang, Xiongkuo Min, Guangtao Zhai, Xu Liu
Abstract: Facial attractiveness prediction (FAP) has long been an important computer vision task, which could be widely applied in live streaming for facial retouching, content recommendation, etc. However, previous FAP datasets are either small, closed-source, or lack diversity. Moreover, the corresponding FAP models exhibit limited generalization and adaptation ability. To overcome these limitations, in this paper we present LiveBeauty, the first large-scale live-specific FAP dataset, in a more challenging application scenario, i.e., live streaming. 10,000 face images are collected from a live streaming platform directly, with 200,000 corresponding attractiveness annotations obtained from a well-devised subjective experiment, making LiveBeauty the largest open-access FAP dataset in the challenging live scenario. Furthermore, a multi-modal FAP method is proposed to measure the facial attractiveness in live streaming. Specifically, we first extract holistic facial prior knowledge and multi-modal aesthetic semantic features via a Personalized Attractiveness Prior Module (PAPM) and a Multi-modal Attractiveness Encoder Module (MAEM), respectively, then integrate the extracted features through a Cross-Modal Fusion Module (CMFM). Extensive experiments conducted on both LiveBeauty and other open-source FAP datasets demonstrate that our proposed method achieves state-of-the-art performance. Dataset will be available soon.
Authors: Minglin Chen, Longguang Wang, Sheng Ao, Ye Zhang, Kai Xu, Yulan Guo
Abstract: 3D scene generation conditioned on text prompts has significantly progressed due to the development of 2D diffusion generation models. However, the textual description of 3D scenes is inherently inaccurate and lacks fine-grained control during training, leading to implausible scene generation. As an intuitive and feasible solution, the 3D layout allows for precise specification of object locations within the scene. To this end, we present a text-to-scene generation method (namely, Layout2Scene) using additional semantic layout as the prompt to inject precise control of 3D object positions. Specifically, we first introduce a scene hybrid representation to decouple objects and backgrounds, which is initialized via a pre-trained text-to-3D model. Then, we propose a two-stage scheme to optimize the geometry and appearance of the initialized scene separately. To fully leverage 2D diffusion priors in geometry and appearance generation, we introduce a semantic-guided geometry diffusion model and a semantic-geometry guided diffusion model which are finetuned on a scene dataset. Extensive experiments demonstrate that our method can generate more plausible and realistic scenes as compared to state-of-the-art approaches. Furthermore, the generated scene allows for flexible yet precise editing, thereby facilitating multiple downstream applications.
Authors: Dawei Dai, Mingming Jia, Yinxiu Zhou, Hang Xing, Chenghang Li
Abstract: Facial images have extensive practical applications. Although the current large-scale text-image diffusion models exhibit strong generation capabilities, it is challenging to generate the desired facial images using only text prompt. Image prompts are a logical choice. However, current methods of this type generally focus on general domain. In this paper, we aim to optimize image makeup techniques to generate the desired facial images. Specifically, (1) we built a dataset of 4 million high-quality face image-text pairs (FaceCaptionHQ-4M) based on LAION-Face to train our Face-MakeUp model; (2) to maintain consistency with the reference facial image, we extract/learn multi-scale content features and pose features for the facial image, integrating these into the diffusion model to enhance the preservation of facial identity features for diffusion models. Validation on two face-related test datasets demonstrates that our Face-MakeUp can achieve the best comprehensive performance.All codes are available at:https://github.com/ddw2AIGROUP2CQUPT/Face-MakeUp
Authors: Leo Franklin, Apiradee Boonmee, Kritsada Wongsuwan
Abstract: Vision generation remains a challenging frontier in artificial intelligence, requiring seamless integration of visual understanding and generative capabilities. In this paper, we propose a novel framework, Vision-Driven Prompt Optimization (VDPO), that leverages Large Language Models (LLMs) to dynamically generate textual prompts from visual inputs, guiding high-fidelity image synthesis. VDPO combines a visual embedding prompt tuner, a textual instruction generator, and a vision generation module to achieve state-of-the-art performance in diverse vision generation tasks. Extensive experiments on benchmarks such as COCO and Sketchy demonstrate that VDPO consistently outperforms existing methods, achieving significant improvements in FID, LPIPS, and BLEU/CIDEr scores. Additional analyses reveal the scalability, robustness, and generalization capabilities of VDPO, making it a versatile solution for in-domain and out-of-domain tasks. Human evaluations further validate the practical superiority of VDPO in generating visually appealing and semantically coherent outputs.
Authors: Hao Shu
Abstract: Although deep convolutional neutral networks (CNNs) have significantly enhanced performance in image edge detection (ED), current models remain highly dependent on post-processing techniques such as non-maximum suppression (NMS), and often fail to deliver satisfactory perceptual results, while the performance will deteriorate significantly if the allowed error toleration distance decreases. These limitations arise from the uniform fusion of features across all pixels, regardless of their specific characteristics, such as the distinction between textural and edge areas. If the features extracted by the ED models are selected more meticulously and encompass greater diversity, the resulting predictions are expected to be more accurate and perceptually meaningful. Motivated by this observation, this paper proposes a novel feature selection paradigm for deep networks that facilitates the differential selection of features and can be seamlessly integrated into existing ED models. By incorporating this additional structure, the performance of conventional ED models is substantially enhanced without post-processing, while simultaneously enhancing the perceptual quality of the predictions. Extensive experimental evaluations validate the effectiveness of the proposed model.
Authors: Lijun Zhang, Yifan Zhang, Weicheng Tang, Xinzhi Sun, Xiaomeng Wang, Zhanshan Li
Abstract: Micro-expression recognition (MER) presents a significant challenge due to the transient and subtle nature of the motion changes involved. In recent years, deep learning methods based on attention mechanisms have made some breakthroughs in MER. However, these methods still suffer from the limitations of insufficient feature capture and poor dynamic adaptation when coping with the instantaneous subtle movement changes of micro-expressions. Therefore, in this paper, we design an Adaptive Hierarchical Multi-Scale Attention Network (AHMSA-Net) for MER. Specifically, we first utilize the onset and apex frames of the micro-expression sequence to extract three-dimensional (3D) optical flow maps, including horizontal optical flow, vertical optical flow, and optical flow strain. Subsequently, the optical flow feature maps are inputted into AHMSA-Net, which consists of two parts: an adaptive hierarchical framework and a multi-scale attention mechanism. Based on the adaptive downsampling hierarchical attention framework, AHMSA-Net captures the subtle changes of micro-expressions from different granularities (fine and coarse) by dynamically adjusting the size of the optical flow feature map at each layer. Based on the multi-scale attention mechanism, AHMSA-Net learns micro-expression action information by fusing features from different scales (channel and spatial). These two modules work together to comprehensively improve the accuracy of MER. Additionally, rigorous experiments demonstrate that the proposed method achieves competitive results on major micro-expression databases, with AHMSA-Net achieving recognition accuracy of up to 78.21% on composite databases (SMIC, SAMM, CASMEII) and 77.08% on the CASME^{}3 database.
Authors: Zhenglai Li, Jun Wang, Chang Tang, Xinzhong Zhu, Wei Zhang, Xinwang Liu
Abstract: Multi-view clustering (MvC) aims to integrate information from different views to enhance the capability of the model in capturing the underlying data structures. The widely used joint training paradigm in MvC is potentially not fully leverage the multi-view information, since the imbalanced and under-optimized view-specific features caused by the uniform learning objective for all views. For instance, particular views with more discriminative information could dominate the learning process in the joint training paradigm, leading to other views being under-optimized. To alleviate this issue, we first analyze the imbalanced phenomenon in the joint-training paradigm of multi-view clustering from the perspective of gradient descent for each view-specific feature extractor. Then, we propose a novel balanced multi-view clustering (BMvC) method, which introduces a view-specific contrastive regularization (VCR) to modulate the optimization of each view. Concretely, VCR preserves the sample similarities captured from the joint features and view-specific ones into the clustering distributions corresponding to view-specific features to enhance the learning process of view-specific feature extractors. Additionally, a theoretical analysis is provided to illustrate that VCR adaptively modulates the magnitudes of gradients for updating the parameters of view-specific feature extractors to achieve a balanced multi-view learning procedure. In such a manner, BMvC achieves a better trade-off between the exploitation of view-specific patterns and the exploration of view-invariance patterns to fully learn the multi-view information for the clustering task. Finally, a set of experiments are conducted to verify the superiority of the proposed method compared with state-of-the-art approaches both on eight benchmark MvC datasets and two spatially resolved transcriptomics datasets.
Authors: Vyacheslav Shen, Kassymzhomart Kunanbayev, Dae-Shik Kim
Abstract: With the advancements in Large Language and Latent Diffusion models, brain decoding has achieved remarkable results in recent years. The works on the NSD dataset, with stimuli images from the COCO dataset, leverage the embeddings from the CLIP model for image reconstruction and GIT for captioning. However, the current captioning approach introduces the challenge of potential data contamination given that the GIT model was trained on the COCO dataset. In this work, we present an alternative method for decoding brain signals into image captions by predicting a DINOv2 model's embedding of an image from the corresponding fMRI signal and then providing its [CLS] token as the prefix to the GPT-2 language model which decreases computational requirements considerably. Additionally, instead of commonly used Linear Regression, we explore 3D Convolutional Neural Network mapping of fMRI signals to image embedding space for better accounting positional information of voxels.
Authors: Ziyang Song, Zerong Wang, Bo Li, Hao Zhang, Ruijie Zhu, Li Liu, Peng-Tao Jiang, Tianzhu Zhang
Abstract: Monocular depth estimation within the diffusion-denoising paradigm demonstrates impressive generalization ability but suffers from low inference speed. Recent methods adopt a single-step deterministic paradigm to improve inference efficiency while maintaining comparable performance. However, they overlook the gap between generative and discriminative features, leading to suboptimal results. In this work, we propose DepthMaster, a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task. First, to mitigate overfitting to texture details introduced by generative features, we propose a Feature Alignment module, which incorporates high-quality semantic features to enhance the denoising network's representation capability. Second, to address the lack of fine-grained details in the single-step deterministic framework, we propose a Fourier Enhancement module to adaptively balance low-frequency structure and high-frequency details. We adopt a two-stage training strategy to fully leverage the potential of the two modules. In the first stage, we focus on learning the global scene structure with the Feature Alignment module, while in the second stage, we exploit the Fourier Enhancement module to improve the visual quality. Through these efforts, our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets. Our project page can be found at https://indu1ge.github.io/DepthMaster_page.
Authors: Miguel Carvalho, Bruno Martins
Abstract: Vision-Language Models (VLMs) have recently experienced significant advancements. However, challenges persist in the accurate recognition of fine details within high resolution images, which limits performance in multiple tasks. This work introduces Pheye, a novel architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong performance, particularly in tasks that demand fine-grained image understanding and/or the handling of scene-text.
Authors: Jushang Qiu, Lei Wang
Abstract: Skeleton-based action recognition has gained significant attention for its ability to efficiently represent spatiotemporal information in a lightweight format. Most existing approaches use graph-based models to process skeleton sequences, where each pose is represented as a skeletal graph structured around human physical connectivity. Among these, the Spatiotemporal Graph Convolutional Network (ST-GCN) has become a widely used framework. Alternatively, hypergraph-based models, such as the Hyperformer, capture higher-order correlations, offering a more expressive representation of complex joint interactions. A recent advancement, termed Taylor Videos, introduces motion-enhanced skeleton sequences by embedding motion concepts, providing a fresh perspective on interpreting human actions in skeleton-based action recognition. In this paper, we conduct a comprehensive evaluation of both traditional skeleton sequences and Taylor-transformed skeletons using ST-GCN and Hyperformer models on the NTU-60 and NTU-120 datasets. We compare skeletal graph and hypergraph representations, analyzing static poses against motion-injected poses. Our findings highlight the strengths and limitations of Taylor-transformed skeletons, demonstrating their potential to enhance motion dynamics while exposing current challenges in fully using their benefits. This study underscores the need for innovative skeletal modelling techniques to effectively handle motion-rich data and advance the field of action recognition.
Authors: Sanya Sinha, Michal Balazia, Francois Bremond
Abstract: Instructional cataract surgery videos are crucial for ophthalmologists and trainees to observe surgical details repeatedly. This paper presents a deep learning model for real-time identification of surgical instruments in these videos, using a custom dataset scraped from open-access sources. Inspired by the architecture of YOLOV9, the model employs a Programmable Gradient Information (PGI) mechanism and a novel Generally-Optimized Efficient Layer Aggregation Network (Go-ELAN) to address the information bottleneck problem, enhancing Minimum Average Precision (mAP) at higher Non-Maximum Suppression Intersection over Union (NMS IoU) scores. The Go-ELAN YOLOV9 model, evaluated against YOLO v5, v7, v8, v9 vanilla, Laptool and DETR, achieves a superior mAP of 73.74 at IoU 0.5 on a dataset of 615 images with 10 instrument classes, demonstrating the effectiveness of the proposed model.
Authors: Chan Lee, Seungho Shin, Gyeong-Moon Park, Jung Uk Kim
Abstract: Although existing Sparsely Annotated Object Detection (SAOD) approches have made progress in handling sparsely annotated environments in multispectral domain, where only some pedestrians are annotated, they still have the following limitations: (i) they lack considerations for improving the quality of pseudo-labels for missing annotations, and (ii) they rely on fixed ground truth annotations, which leads to learning only a limited range of pedestrian visual appearances in the multispectral domain. To address these issues, we propose a novel framework called Sparsely Annotated Multispectral Pedestrian Detection (SAMPD). For limitation (i), we introduce Multispectral Pedestrian-aware Adaptive Weight (MPAW) and Positive Pseudo-label Enhancement (PPE) module. Utilizing multispectral knowledge, these modules ensure the generation of high-quality pseudo-labels and enable effective learning by increasing weights for high-quality pseudo-labels based on modality characteristics. To address limitation (ii), we propose an Adaptive Pedestrian Retrieval Augmentation (APRA) module, which adaptively incorporates pedestrian patches from ground-truth and dynamically integrates high-quality pseudo-labels with the ground-truth, facilitating a more diverse learning pool of pedestrians. Extensive experimental results demonstrate that our SAMPD significantly enhances performance in sparsely annotated environments within the multispectral domain.
Authors: Haixu Liu, Penghao Jiang, Zerui Tao, Muyan Wan, Qiuzhuang Sun
Abstract: Predicting plant species composition in specific spatiotemporal contexts plays an important role in biodiversity management and conservation, as well as in improving species identification tools. Our work utilizes 88,987 plant survey records conducted in specific spatiotemporal contexts across Europe. We also use the corresponding satellite images, time series data, climate time series, and other rasterized environmental data such as land cover, human footprint, bioclimatic, and soil variables as training data to train the model to predict the outcomes of 4,716 plant surveys. We propose a feature construction and result correction method based on the graph structure. Through comparative experiments, we select the best-performing backbone networks for feature extraction in both temporal and image modalities. In this process, we built a backbone network based on the Swin-Transformer Block for extracting temporal Cubes features. We then design a hierarchical cross-attention mechanism capable of robustly fusing features from multiple modalities. During training, we adopt a 10-fold cross-fusion method based on fine-tuning and use a Threshold Top-K method for post-processing. Ablation experiments demonstrate the improvements in model performance brought by our proposed solution pipeline.
Authors: Simon Park, Abhishek Panigrahi, Yun Cheng, Dingli Yu, Anirudh Goyal, Sanjeev Arora
Abstract: While Vision Language Models (VLMs) are impressive in tasks such as visual question answering (VQA) and image captioning, their ability to apply multi-step reasoning to images has lagged, giving rise to perceptions of modality imbalance or brittleness. Towards systematic study of such issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning (AVR), comprising three tasks: Table Readout, Grid Navigation, and Visual Analogy. Each has two levels of difficulty, SIMPLE and HARD, and even the SIMPLE versions are difficult for frontier VLMs. We seek strategies for training on the SIMPLE version of the tasks that improve performance on the corresponding HARD task, i.e., S2H generalization. This synthetic framework, where each task also has a text-only version, allows a quantification of the modality imbalance, and how it is impacted by training strategy. Ablations highlight the importance of explicit image-to-text conversion in promoting S2H generalization when using auto-regressive training. We also report results of mechanistic study of this phenomenon, including a measure of gradient alignment that seems to identify training strategies that promote better S2H generalization.
Authors: Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu-Yun Wang, Hongsheng Li
Abstract: 4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that optimizes a 4D representation and renders videos according to different 4D elements, such as camera pose and object motion editing, we bring pseudo 4D Gaussian fields to video generation. Specifically, we propose a novel framework that constructs a pseudo 4D Gaussian field with dense 3D point tracking and renders the Gaussian field for all video frames. Then we finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT. To boost the training of the GS-DiT, we also propose an efficient Dense 3D Point Tracking (D3D-PT) method for the pseudo 4D Gaussian field construction. Our D3D-PT outperforms SpatialTracker, the state-of-the-art sparse 3D point tracking method, in accuracy and accelerates the inference speed by two orders of magnitude. During the inference stage, GS-DiT can generate videos with the same dynamic content while adhering to different camera parameters, addressing a significant limitation of current video generation models. GS-DiT demonstrates strong generalization capabilities and extends the 4D controllability of Gaussian splatting to video generation beyond just camera poses. It supports advanced cinematic effects through the manipulation of the Gaussian field and camera intrinsics, making it a powerful tool for creative video production. Demos are available at https://wkbian.github.io/Projects/GS-DiT/.
Authors: Andr\'es Villa, Juan Le\'on Alc\'azar, Motasem Alfarra, Vladimir Araujo, Alvaro Soto, Bernard Ghanem
Abstract: Large language models and vision transformers have demonstrated impressive zero-shot capabilities, enabling significant transferability in downstream tasks. The fusion of these models has resulted in multi-modal architectures with enhanced instructional capabilities. Despite incorporating vast image and language pre-training, these multi-modal architectures often generate responses that deviate from the ground truth in the image data. These failure cases are known as hallucinations. Current methods for mitigating hallucinations generally focus on regularizing the language component, improving the fusion module, or ensembling multiple visual encoders to improve visual representation. In this paper, we address the hallucination issue by directly enhancing the capabilities of the visual component. Our approach, named EAGLE, is fully agnostic to the LLM or fusion module and works as a post-pretraining approach that improves the grounding and language alignment of the visual encoder. We show that a straightforward reformulation of the original contrastive pre-training task results in an improved visual encoder that can be incorporated into the instructional multi-modal architecture without additional instructional training. As a result, EAGLE achieves a significant reduction in hallucinations across multiple challenging benchmarks and tasks.
Authors: Xiaojiao Guo, Xuhang Chen, Shuqiang Wang, Chi-Man Pun
Abstract: Underwater imaging grapples with challenges from light-water interactions, leading to color distortions and reduced clarity. In response to these challenges, we propose a novel Color Balance Prior \textbf{Guided} \textbf{Hyb}rid \textbf{Sens}e \textbf{U}nderwater \textbf{I}mage \textbf{R}estoration framework (\textbf{GuidedHybSensUIR}). This framework operates on multiple scales, employing the proposed \textbf{Detail Restorer} module to restore low-level detailed features at finer scales and utilizing the proposed \textbf{Feature Contextualizer} module to capture long-range contextual relations of high-level general features at a broader scale. The hybridization of these different scales of sensing results effectively addresses color casts and restores blurry details. In order to effectively point out the evolutionary direction for the model, we propose a novel \textbf{Color Balance Prior} as a strong guide in the feature contextualization step and as a weak guide in the final decoding phase. We construct a comprehensive benchmark using paired training data from three real-world underwater datasets and evaluate on six test sets, including three paired and three unpaired, sourced from four real-world underwater datasets. Subsequently, we tested 14 traditional and retrained 23 deep learning existing underwater image restoration methods on this benchmark, obtaining metric results for each approach. This effort aims to furnish a valuable benchmarking dataset for standard basis for comparison. The extensive experiment results demonstrate that our method outperforms 37 other state-of-the-art methods overall on various benchmark datasets and metrics, despite not achieving the best results in certain individual cases. The code and dataset are available at \href{https://github.com/CXH-Research/GuidedHybSensUIR}{https://github.com/CXH-Research/GuidedHybSensUIR}.
URLs: https://github.com/CXH-Research/GuidedHybSensUIR, https://github.com/CXH-Research/GuidedHybSensUIR
Authors: Jiaze Li, Haoran Xu, Shiding Zhu, Junwei He, Haozhao Wang
Abstract: The rapid development of diffusion models has greatly advanced AI-generated videos in terms of length and consistency recently, yet assessing AI-generated videos still remains challenging. Previous approaches have often focused on User-Generated Content(UGC), but few have targeted AI-Generated Video Quality Assessment methods. In this work, we introduce MSA-VQA, a Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment, which leverages CLIP-based semantic supervision and cross-attention mechanisms. Our hierarchical framework analyzes video content at three levels: frame, segment, and video. We propose a Prompt Semantic Supervision Module using text encoder of CLIP to ensure semantic consistency between videos and conditional prompts. Additionally, we propose the Semantic Mutation-aware Module to capture subtle variations between frames. Extensive experiments demonstrate our method achieves state-of-the-art results.
Authors: Ji Cao, Tongya Zheng, Qinghong Guo, Yu Wang, Junshu Dai, Shunyu Liu, Jie Yang, Jie Song, Mingli Song
Abstract: Trajectory generation has garnered significant attention from researchers in the field of spatio-temporal analysis, as it can generate substantial synthesized human mobility trajectories that enhance user privacy and alleviate data scarcity. However, existing trajectory generation methods often focus on improving trajectory generation quality from a singular perspective, lacking a comprehensive semantic understanding across various scales. Consequently, we are inspired to develop a HOlistic SEmantic Representation (HOSER) framework for navigational trajectory generation. Given an origin-and-destination (OD) pair and the starting time point of a latent trajectory, we first propose a Road Network Encoder to expand the receptive field of road- and zone-level semantics. Second, we design a Multi-Granularity Trajectory Encoder to integrate the spatio-temporal semantics of the generated trajectory at both the point and trajectory levels. Finally, we employ a Destination-Oriented Navigator to seamlessly integrate destination-oriented guidance. Extensive experiments on three real-world datasets demonstrate that HOSER outperforms state-of-the-art baselines by a significant margin. Moreover, the model's performance in few-shot learning and zero-shot learning scenarios further verifies the effectiveness of our holistic semantic representation.
Authors: Li Weitao, Zhang Xinru, Wang Dianhui, Tong Qianqian, Chai Tianyou
Abstract: To address the issues of a weak generalization capability and interpretability in working condition recognition model of a fused magnesium furnace, this paper proposes an interpretable working condition recognition method based on deep convolutional stochastic configuration networks (DCSCNs). Firstly, a supervised learning mechanism is employed to generate physically meaningful Gaussian differential convolution kernels. An incremental method is utilized to construct a DCSCNs model, ensuring the convergence of recognition errors in a hierarchical manner and avoiding the iterative optimization process of convolutional kernel parameters using the widely used backpropagation algorithm. The independent coefficient of channel feature maps is defined to obtain the visualization results of feature class activation maps for the fused magnesium furnace. A joint reward function is constructed based on the recognition accuracy, the interpretable trustworthiness evaluation metrics, and the model parameter quantity. Reinforcement learning (RL) is applied to adaptively prune the convolutional kernels of the DCSCNs model, aiming to build a compact, highly performed and interpretable network. The experimental results demonstrate that the proposed method outperforms the other deep learning approaches in terms of recognition accuracy and interpretability.
Authors: Yunlong Yuan, Yuanfan Guo, Chunwei Wang, Hang Xu, Li Zhang
Abstract: Recent advances in diffusion models have greatly improved text-driven video generation. However, training models for long video generation demands significant computational power and extensive data, leading most video diffusion models to be limited to a small number of frames. Existing training-free methods that attempt to generate long videos using pre-trained short video diffusion models often struggle with issues such as insufficient motion dynamics and degraded video fidelity. In this paper, we present Brick-Diffusion, a novel, training-free approach capable of generating long videos of arbitrary length. Our method introduces a brick-to-wall denoising strategy, where the latent is denoised in segments, with a stride applied in subsequent iterations. This process mimics the construction of a staggered brick wall, where each brick represents a denoised segment, enabling communication between frames and improving overall video quality. Through quantitative and qualitative evaluations, we demonstrate that Brick-Diffusion outperforms existing baseline methods in generating high-fidelity videos.
Authors: Deguo Xia, Weiming Zhang, Xiyan Liu, Wei Zhang, Chenting Gong, Xiao Tan, Jizhou Huang, Mengmeng Yang, Diange Yang
Abstract: An up-to-date city-scale lane-level map is an indispensable infrastructure and a key enabling technology for ensuring the safety and user experience of autonomous driving systems. In industrial scenarios, reliance on manual annotation for map updates creates a critical bottleneck. Lane-level updates require precise change information and must ensure consistency with adjacent data while adhering to strict standards. Traditional methods utilize a three-stage approach-construction, change detection, and updating-which often necessitates manual verification due to accuracy limitations. This results in labor-intensive processes and hampers timely updates. To address these challenges, we propose LDMapNet-U, which implements a new end-to-end paradigm for city-scale lane-level map updating. By reconceptualizing the update task as an end-to-end map generation process grounded in historical map data, we introduce a paradigm shift in map updating that simultaneously generates vectorized maps and change information. To achieve this, a Prior-Map Encoding (PME) module is introduced to effectively encode historical maps, serving as a critical reference for detecting changes. Additionally, we incorporate a novel Instance Change Prediction (ICP) module that learns to predict associations with historical maps. Consequently, LDMapNet-U simultaneously achieves vectorized map element generation and change detection. To demonstrate the superiority and effectiveness of LDMapNet-U, extensive experiments are conducted using large-scale real-world datasets. In addition, LDMapNet-U has been successfully deployed in production at Baidu Maps since April 2024, supporting map updating for over 360 cities and significantly shortening the update cycle from quarterly to weekly. The updated maps serve hundreds of millions of users and are integrated into the autonomous driving systems of several leading vehicle companies.
Authors: Yifan Li, Zhixin Lai, Wentao Bao, Zhen Tan, Anh Dao, Kewei Sui, Jiayi Shen, Dong Liu, Huan Liu, Yu Kong
Abstract: Visual-language models (VLM) have emerged as a powerful tool for learning a unified embedding space for vision and language. Inspired by large language models, which have demonstrated strong reasoning and multi-task capabilities, visual large language models (VLLMs) are gaining increasing attention for building general-purpose VLMs. Despite the significant progress made in VLLMs, the related literature remains limited, particularly from a comprehensive application perspective, encompassing generalized and specialized applications across vision (image, video, depth), action, and language modalities. In this survey, we focus on the diverse applications of VLLMs, examining their using scenarios, identifying ethics consideration and challenges, and discussing future directions for their development. By synthesizing these contents, we aim to provide a comprehensive guide that will pave the way for future innovations and broader applications of VLLMs. The paper list repository is available: https://github.com/JackYFL/awesome-VLLMs.
Authors: Tianjian Jiang, Johsan Billingham, Sebastian M\"uksch, Juan Zarate, Nicolas Evans, Martin R. Oswald, Marc Polleyfeys, Otmar Hilliges, Manuel Kaufmann, Jie Song
Abstract: We present WorldPose, a novel dataset for advancing research in multi-person global pose estimation in the wild, featuring footage from the 2022 FIFA World Cup. While previous datasets have primarily focused on local poses, often limited to a single person or in constrained, indoor settings, the infrastructure deployed for this sporting event allows access to multiple fixed and moving cameras in different stadiums. We exploit the static multi-view setup of HD cameras to recover the 3D player poses and motions with unprecedented accuracy given capture areas of more than 1.75 acres. We then leverage the captured players' motions and field markings to calibrate a moving broadcasting camera. The resulting dataset comprises more than 80 sequences with approx 2.5 million 3D poses and a total traveling distance of over 120 km. Subsequently, we conduct an in-depth analysis of the SOTA methods for global pose estimation. Our experiments demonstrate that WorldPose challenges existing multi-person techniques, supporting the potential for new research in this area and others, such as sports analysis. All pose annotations (in SMPL format), broadcasting camera parameters and footage will be released for academic research purposes.
Authors: Arindam Dutta, Sarosij Bose, Saketh Bachu, Calvin-Khang Ta, Konstantinos Karydis, Amit K. Roy-Chowdhury
Abstract: Occlusions are a significant challenge to human pose estimation algorithms, often resulting in inaccurate and anatomically implausible poses. Although current occlusion-robust human pose estimation algorithms exhibit impressive performance on existing datasets, their success is largely attributed to supervised training and the availability of additional information, such as multiple views or temporal continuity. Furthermore, these algorithms typically suffer from performance degradation under distribution shifts. While existing domain adaptive human pose estimation algorithms address this bottleneck, they tend to perform suboptimally when the target domain images are occluded, a common occurrence in real-life scenarios. To address these challenges, we propose OR-POSE: Unsupervised Domain Adaptation for Occlusion Resilient Human POSE Estimation. OR-POSE is an innovative unsupervised domain adaptation algorithm which effectively mitigates domain shifts and overcomes occlusion challenges by employing the mean teacher framework for iterative pseudo-label refinement. Additionally, OR-POSE reinforces realistic pose prediction by leveraging a learned human pose prior which incorporates the anatomical constraints of humans in the adaptation process. Lastly, OR-POSE avoids overfitting to inaccurate pseudo labels generated from heavily occluded images by employing a novel visibility-based curriculum learning approach. This enables the model to gradually transition from training samples with relatively less occlusion to more challenging, heavily occluded samples. Extensive experiments show that OR-POSE outperforms existing analogous state-of-the-art algorithms by $\sim$ 7% on challenging occluded human pose estimation datasets.
Authors: Sugandha Saxena, S. N. Prasad, Ashwin M Polnaya, Shweta Agarwala
Abstract: Advances in healthcare research have significantly enhanced our understanding of disease mechanisms, diagnostic precision, and therapeutic options. Yet, lung cancer remains one of the leading causes of cancer-related mortality worldwide due to challenges in early and accurate diagnosis. While current lung cancer detection models show promise, there is considerable potential for further improving the accuracy for timely intervention. To address this challenge, we introduce a hybrid deep convolution model leveraging transfer learning, named the Maximum Sensitivity Neural Network (MSNN). MSNN is designed to improve the precision of lung cancer detection by refining sensitivity and specificity. This model has surpassed existing deep learning approaches through experimental validation, achieving an accuracy of 98% and a sensitivity of 97%. By overlaying sensitivity maps onto lung Computed Tomography (CT) scans, it enables the visualization of regions most indicative of malignant or benign classifications. This innovative method demonstrates exceptional performance in distinguishing lung cancer with minimal false positives, thereby enhancing the accuracy of medical diagnoses.
Authors: Niloufar Eghbali, Hassan Bagher-Ebadian, Tuka Alhanai, Mohammad M. Ghassemi
Abstract: Vision Transformers (ViTs) have shown promise in medical image semantic segmentation (MISS) by capturing long-range correlations. However, ViTs often struggle to model local spatial information effectively, which is essential for accurately segmenting fine anatomical details, particularly when applied to small datasets without extensive pre-training. We introduce Gabor and Laplacian of Gaussian Convolutional Swin Network (GLoG-CSUnet), a novel architecture enhancing Transformer-based models by incorporating learnable radiomic features. This approach integrates dynamically adaptive Gabor and Laplacian of Gaussian (LoG) filters to capture texture, edge, and boundary information, enhancing the feature representation processed by the Transformer model. Our method uniquely combines the long-range dependency modeling of Transformers with the texture analysis capabilities of Gabor and LoG features. Evaluated on the Synapse multi-organ and ACDC cardiac segmentation datasets, GLoG-CSUnet demonstrates significant improvements over state-of-the-art models, achieving a 1.14\% increase in Dice score for Synapse and 0.99\% for ACDC, with minimal computational overhead (only 15 and 30 additional parameters, respectively). GLoG-CSUnet's flexible design allows integration with various base models, offering a promising approach for incorporating radiomics-inspired feature extraction in Transformer architectures for medical image analysis. The code implementation is available on GitHub at: https://github.com/HAAIL/GLoG-CSUnet.
Authors: Yan Hu, Mingdao Gong, Zhongxi Qiu, Jiabao Liu, Hongli Shen, Mingzhen Yuan, Xiaoqing Zhang, Heng Li, Hai Lu, Jiang Liu
Abstract: Retinal image registration is vital for diagnostic therapeutic applications within the field of ophthalmology. Existing public datasets, focusing on adult retinal pathologies with high-quality images, have limited number of image pairs and neglect clinical challenges. To address this gap, we introduce COph100, a novel and challenging dataset known as the Comprehensive Ophthalmology Retinal Image Registration dataset for infants with a wide range of image quality issues constituting the public "RIDIRP" database. COph100 consists of 100 eyes, each with 2 to 9 examination sessions, amounting to a total of 491 image pairs carefully selected from the publicly available dataset. We manually labeled the corresponding ground truth image points and provided automatic vessel segmentation masks for each image. We have assessed COph100 in terms of image quality and registration outcomes using state-of-the-art algorithms. This resource enables a robust comparison of retinal registration methodologies and aids in the analysis of disease progression in infants, thereby deepening our understanding of pediatric ophthalmic conditions.
Authors: Chaoran Feng, Wangbo Yu, Xinhua Cheng, Zhenyu Tang, Junwu Zhang, Li Yuan, Yonghong Tian
Abstract: Compared to frame-based methods, computational neuromorphic imaging using event cameras offers significant advantages, such as minimal motion blur, enhanced temporal resolution, and high dynamic range. The multi-view consistency of Neural Radiance Fields combined with the unique benefits of event cameras, has spurred recent research into reconstructing NeRF from data captured by moving event cameras. While showing impressive performance, existing methods rely on ideal conditions with the availability of uniform and high-quality event sequences and accurate camera poses, and mainly focus on the object level reconstruction, thus limiting their practical applications. In this work, we propose AE-NeRF to address the challenges of learning event-based NeRF from non-ideal conditions, including non-uniform event sequences, noisy poses, and various scales of scenes. Our method exploits the density of event streams and jointly learn a pose correction module with an event-based NeRF (e-NeRF) framework for robust 3D reconstruction from inaccurate camera poses. To generalize to larger scenes, we propose hierarchical event distillation with a proposal e-NeRF network and a vanilla e-NeRF network to resample and refine the reconstruction process. We further propose an event reconstruction loss and a temporal loss to improve the view consistency of the reconstructed scene. We established a comprehensive benchmark that includes large-scale scenes to simulate practical non-ideal conditions, incorporating both synthetic and challenging real-world event datasets. The experimental results show that our method achieves a new state-of-the-art in event-based 3D reconstruction.
Authors: Bin Wang, Li Jing
Abstract: Text recognition technology applied to street-view storefront signs is increasingly utilized across various practical domains, including map navigation, smart city planning analysis, and business value assessments in commercial districts. This technology holds significant research and commercial potential. Nevertheless, it faces numerous challenges. Street view images often contain signboards with complex designs and diverse text styles, complicating the text recognition process. A notable advancement in this field was introduced by our team in a recent competition. We developed a novel multistage approach that integrates multimodal feature fusion, extensive self-supervised training, and a Transformer-based large model. Furthermore, innovative techniques such as BoxDQN, which relies on reinforcement learning, and text rectification methods were employed, leading to impressive outcomes. Comprehensive experiments have validated the effectiveness of these methods, showcasing our potential to enhance text recognition capabilities in complex urban environments.
Authors: Kai Wang, Shaozhang Niu, Qixian Hao, Jiwei Zhang
Abstract: As artificial intelligence advances rapidly, particularly with the advent of GANs and diffusion models, the accuracy of Image Inpainting Localization (IIL) has become increasingly challenging. Current IIL methods face two main challenges: a tendency towards overconfidence, leading to incorrect predictions; and difficulty in detecting subtle tampering boundaries in inpainted images. In response, we propose a new paradigm that treats IIL as a conditional mask generation task utilizing diffusion models. Our method, InpDiffusion, utilizes the denoising process enhanced by the integration of image semantic conditions to progressively refine predictions. During denoising, we employ edge conditions and introduce a novel edge supervision strategy to enhance the model's perception of edge details in inpainted objects. Balancing the diffusion model's stochastic sampling with edge supervision of tampered image regions mitigates the risk of incorrect predictions from overconfidence and prevents the loss of subtle boundaries that can result from overly stochastic processes. Furthermore, we propose an innovative Dual-stream Multi-scale Feature Extractor (DMFE) for extracting multi-scale features, enhancing feature representation by considering both semantic and edge conditions of the inpainted images. Extensive experiments across challenging datasets demonstrate that the InpDiffusion significantly outperforms existing state-of-the-art methods in IIL tasks, while also showcasing excellent generalization capabilities and robustness.
Authors: Asma Alkalbani, Muhammad Saqib, Ahmed Salim Alrawahi, Abbas Anwar, Chandarnath Adak, Saeed Anwar
Abstract: Road damage detection and assessment are crucial components of infrastructure maintenance. However, current methods often struggle with detecting multiple types of road damage in a single image, particularly at varying scales. This is due to the lack of road datasets with various damage types having varying scales. To overcome this deficiency, first, we present a novel dataset called Diverse Road Damage Dataset (DRDD) for road damage detection that captures the diverse road damage types in individual images, addressing a crucial gap in existing datasets. Then, we provide our model, RDD4D, that exploits Attention4D blocks, enabling better feature refinement across multiple scales. The Attention4D module processes feature maps through an attention mechanism combining positional encoding and "Talking Head" components to capture local and global contextual information. In our comprehensive experimental analysis comparing various state-of-the-art models on our proposed, our enhanced model demonstrated superior performance in detecting large-sized road cracks with an Average Precision (AP) of 0.458 and maintained competitive performance with an overall AP of 0.445. Moreover, we also provide results on the CrackTinyNet dataset; our model achieved around a 0.21 increase in performance. The code, model weights, dataset, and our results are available on \href{https://github.com/msaqib17/Road_Damage_Detection}{https://github.com/msaqib17/Road\_Damage\_Detection}.
URLs: https://github.com/msaqib17/Road_Damage_Detection, https://github.com/msaqib17/Road\_Damage\_Detection
Authors: Wentian Qu, Chenyu Meng, Heng Li, Jian Cheng, Cuixia Ma, Hongan Wang, Xiao Zhou, Xiaoming Deng, Ping Tan
Abstract: Object pose estimation, crucial in computer vision and robotics applications, faces challenges with the diversity of unseen categories. We propose a zero-shot method to achieve category-level 6-DOF object pose estimation, which exploits both 2D and 3D universal features of input RGB-D image to establish semantic similarity-based correspondences and can be extended to unseen categories without additional model fine-tuning. Our method begins with combining efficient 2D universal features to find sparse correspondences between intra-category objects and gets initial coarse pose. To handle the correspondence degradation of 2D universal features if the pose deviates much from the target pose, we use an iterative strategy to optimize the pose. Subsequently, to resolve pose ambiguities due to shape differences between intra-category objects, the coarse pose is refined by optimizing with dense alignment constraint of 3D universal features. Our method outperforms previous methods on the REAL275 and Wild6D benchmarks for unseen categories.
Authors: Kuldeep Kurte, Kedar Kulkarni
Abstract: In this paper, we present an enhanced Convolutional Neural Network (CNN)-based rooftop solar photovoltaic (PV) panel detection approach using satellite images. We propose to use pre-trained CNN-based model to extract the local convolutional features of rooftops. These local features are then combined using the Vectors of Locally Aggregated Descriptors (VLAD) technique to obtain rooftop-level global features, which are then used to train traditional Machine Learning (ML) models to identify rooftop images that do and do not contain PV panels. On the dataset used in this study, the proposed approach achieved rooftop-PV classification scores exceeding the predefined threshold of 0.9 across all three cities for each of the feature extractor networks evaluated. Moreover, we propose a 3-phase approach to enable efficient utilization of the previously trained models on a new city or region with limited labelled data. We illustrate the effectiveness of this 3-phase approach for multi-city rooftop-PV detection task.
Authors: Wentian Qu, Jiahe Li, Jian Cheng, Jian Shi, Chenyu Meng, Cuixia Ma, Hongan Wang, Xiaoming Deng, Yinda Zhang
Abstract: Understanding of bimanual hand-object interaction plays an important role in robotics and virtual reality. However, due to significant occlusions between hands and object as well as the high degree-of-freedom motions, it is challenging to collect and annotate a high-quality, large-scale dataset, which prevents further improvement of bimanual hand-object interaction-related baselines. In this work, we propose a new 3D Gaussian Splatting based data augmentation framework for bimanual hand-object interaction, which is capable of augmenting existing dataset to large-scale photorealistic data with various hand-object pose and viewpoints. First, we use mesh-based 3DGS to model objects and hands, and to deal with the rendering blur problem due to multi-resolution input images used, we design a super-resolution module. Second, we extend the single hand grasping pose optimization module for the bimanual hand object to generate various poses of bimanual hand-object interaction, which can significantly expand the pose distribution of the dataset. Third, we conduct an analysis for the impact of different aspects of the proposed data augmentation on the understanding of the bimanual hand-object interaction. We perform our data augmentation on two benchmarks, H2O and Arctic, and verify that our method can improve the performance of the baselines.
Authors: Ulindu De Silva, Leon Fernando, Billy Lau Pik Lik, Zann Koh, Sam Conrad Joyce, Belinda Yuen, Chau Yuen
Abstract: The rapid increase in video content production has resulted in enormous data volumes, creating significant challenges for efficient analysis and resource management. To address this, robust video analysis tools are essential. This paper presents an innovative proof of concept using Generative Artificial Intelligence (GenAI) in the form of Vision Language Models to enhance the downstream video analysis process. Our tool generates customized textual summaries based on user-defined queries, providing focused insights within extensive video datasets. Unlike traditional methods that offer generic summaries or limited action recognition, our approach utilizes Vision Language Models to extract relevant information, improving analysis precision and efficiency. The proposed method produces textual summaries from extensive CCTV footage, which can then be stored for an indefinite time in a very small storage space compared to videos, allowing users to quickly navigate and verify significant events without exhaustive manual review. Qualitative evaluations result in 80% and 70% accuracy in temporal and spatial quality and consistency of the pipeline respectively.
Authors: A. Rani, D. O. Arroyo, P. Durdevic
Abstract: Fungi undergo dynamic morphological transformations throughout their lifecycle, forming intricate networks as they transition from spores to mature mycelium structures. To support the study of these time-dependent processes, we present a synthetic, time-aligned image dataset that models key stages of fungal growth. This dataset systematically captures phenomena such as spore size reduction, branching dynamics, and the emergence of complex mycelium networks. The controlled generation process ensures temporal consistency, scalability, and structural alignment, addressing the limitations of real-world fungal datasets. Optimized for deep learning (DL) applications, this dataset facilitates the development of models for classifying growth stages, predicting fungal development, and analyzing morphological patterns over time. With applications spanning agriculture, medicine, and industrial mycology, this resource provides a robust foundation for automating fungal analysis, enhancing disease monitoring, and advancing fungal biology research through artificial intelligence.
Authors: Toomas Tahves, Junyi Gu, Mauro Bellone, Raivo Sell
Abstract: This paper presents Camera-LiDAR Fusion Transformer (CLFT) models for traffic object segmentation, which leverage the fusion of camera and LiDAR data using vision transformers. Building on the methodology of visual transformers that exploit the self-attention mechanism, we extend segmentation capabilities with additional classification options to a diverse class of objects including cyclists, traffic signs, and pedestrians across diverse weather conditions. Despite good performance, the models face challenges under adverse conditions which underscores the need for further optimization to enhance performance in darkness and rain. In summary, the CLFT models offer a compelling solution for autonomous driving perception, advancing the state-of-the-art in multimodal fusion and object segmentation, with ongoing efforts required to address existing limitations and fully harness their potential in practical deployments.
Authors: Kaishva Chintan Shah, Karthik S. Gurumoorthy, Ajit Rajwade
Abstract: This study presents a technique for 2D tomography under unknown viewing angles when the distribution of the viewing angles is also unknown. Unknown view tomography (UVT) is a problem encountered in cryo-electron microscopy and in the geometric calibration of CT systems. There exists a moderate-sized literature on the 2D UVT problem, but most existing 2D UVT algorithms assume knowledge of the angle distribution which is not available usually. Our proposed methodology formulates the problem as an optimization task based on cross-validation error, to estimate the angle distribution jointly with the underlying 2D structure in an alternating fashion. We explore the algorithm's capabilities for the case of two probability distribution models: a semi-parametric mixture of von Mises densities and a probability mass function model. We evaluate our algorithm's performance under noisy projections using a PCA-based denoising technique and Graph Laplacian Tomography (GLT) driven by order statistics of the estimated distribution, to ensure near-perfect ordering, and compare our algorithm to intuitive baselines.
Authors: Xu Ma, Mengsheng Chen, Junhui Zhang, Lijuan Song, Fang Du, Zhenhua Yu
Abstract: Convolutional neural networks (CNNs) excel in local feature extraction while Transformers are superior in processing global semantic information. By leveraging the strengths of both, hybrid Transformer-CNN networks have become the major architectures in medical image segmentation tasks. However, existing hybrid methods still suffer deficient learning of local semantic features due to the fixed receptive fields of convolutions, and also fall short in effectively integrating local and long-range dependencies. To address these issues, we develop a new method PARF-Net to integrate convolutions of Pixel-wise Adaptive Receptive Fields (Conv-PARF) into hybrid Network for medical image segmentation. The Conv-PARF is introduced to cope with inter-pixel semantic differences and dynamically adjust convolutional receptive fields for each pixel, thus providing distinguishable features to disentangle the lesions with varying shapes and scales from the background. The features derived from the Conv-PARF layers are further processed using hybrid Transformer-CNN blocks under a lightweight manner, to effectively capture local and long-range dependencies, thus boosting the segmentation performance. By assessing PARF-Net on four widely used medical image datasets including MoNuSeg, GlaS, DSB2018 and multi-organ Synapse, we showcase the advantages of our method over the state-of-the-arts. For instance, PARF-Net achieves 84.27% mean Dice on the Synapse dataset, surpassing existing methods by a large margin.
Authors: Hui Sun, Shiyin Lu, Huanyu Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Ming Li
Abstract: Video large language models (Video-LLMs) have made significant progress in understanding videos. However, processing multiple frames leads to lengthy visual token sequences, presenting challenges such as the limited context length cannot accommodate the entire video, and the inclusion of irrelevant frames hinders visual perception. Hence, effective frame selection is crucial. This paper emphasizes that frame selection should follow three key principles: query relevance, list-wise diversity, and sequentiality. Existing methods, such as uniform frame sampling and query-frame matching, do not capture all of these principles. Thus, we propose Markov decision determinantal point process with dynamic programming (MDP3) for frame selection, a training-free and model-agnostic method that can be seamlessly integrated into existing Video-LLMs. Our method first estimates frame similarities conditioned on the query using a conditional Gaussian kernel within the reproducing kernel Hilbert space~(RKHS). We then apply the determinantal point process~(DPP) to the similarity matrix to capture both query relevance and list-wise diversity. To incorporate sequentiality, we segment the video and apply DPP within each segment, conditioned on the preceding segment selection, modeled as a Markov decision process~(MDP) for allocating selection sizes across segments. Theoretically, MDP3 provides a \((1 - 1/e)\)-approximate solution to the NP-hard list-wise frame selection problem with pseudo-polynomial time complexity, demonstrating its efficiency. Empirically, MDP3 significantly outperforms existing methods, verifying its effectiveness and robustness.
Authors: Guray Ozgur, Eduarda Caldeira, Tahar Chettaoui, Fadi Boutros, Raghavendra Ramachandra, Naser Damer
Abstract: Although face recognition systems have seen a massive performance enhancement in recent years, they are still targeted by threats such as presentation attacks, leading to the need for generalizable presentation attack detection (PAD) algorithms. Current PAD solutions suffer from two main problems: low generalization to unknown cenarios and large training data requirements. Foundation models (FM) are pre-trained on extensive datasets, achieving remarkable results when generalizing to unseen domains and allowing for efficient task-specific adaption even when little training data are available. In this work, we recognize the potential of FMs to address common PAD problems and tackle the PAD task with an adapted FM for the first time. The FM under consideration is adapted with LoRA weights while simultaneously training a classification header. The resultant architecture, FoundPAD, is highly generalizable to unseen domains, achieving competitive results in several settings under different data availability scenarios and even when using synthetic training data. To encourage reproducibility and facilitate further research in PAD, we publicly release the implementation of FoundPAD at https://github.com/gurayozgur/FoundPAD .
Authors: Daisuke Komura, Maki Takao, Mieko Ochi, Takumi Onoyama, Hiroto Katoh, Hiroyuki Abe, Hiroyuki Sano, Teppei Konishi, Toshio Kumasaka, Tomoyuki Yokose, Yohei Miyagi, Tetsuo Ushiku, Shumpei Ishikawa
Abstract: The tumor microenvironment (TME) plays a crucial role in cancer progression and treatment response, yet current methods for its comprehensive analysis in H&E-stained tissue slides face significant limitations in the diversity of tissue cell types and accuracy. Here, we present PAGET (Pathological image segmentation via AGgrEgated Teachers), a new knowledge distillation approach that integrates multiple segmentation models while considering the hierarchical nature of cell types in the TME. By leveraging a unique dataset created through immunohistochemical restaining techniques and existing segmentation models, PAGET enables simultaneous identification and classification of 14 key TME components. We demonstrate PAGET's ability to perform rapid, comprehensive TME segmentation across various tissue types and medical institutions, advancing the quantitative analysis of tumor microenvironments. This method represents a significant step forward in enhancing our understanding of cancer biology and supporting precise clinical decision-making from large-scale histopathology images.
Authors: Thang-Anh-Quan Nguyen, Nathan Piasco, Luis Rold\~ao, Moussab Bennehar, Dzmitry Tsishkou, Laurent Caraffa, Jean-Philippe Tarel, Roland Br\'emond
Abstract: In this paper, we present PointmapDiffusion, a novel framework for single-image novel view synthesis (NVS) that utilizes pre-trained 2D diffusion models. Our method is the first to leverage pointmaps (i.e. rasterized 3D scene coordinates) as a conditioning signal, capturing geometric prior from the reference images to guide the diffusion process. By embedding reference attention blocks and a ControlNet for pointmap features, our model balances between generative capability and geometric consistency, enabling accurate view synthesis across varying viewpoints. Extensive experiments on diverse real-world datasets demonstrate that PointmapDiffusion achieves high-quality, multi-view consistent results with significantly fewer trainable parameters compared to other baselines for single-image NVS tasks.
Authors: Jonathan Courtois, Beno\^it Miramond, Alain Pegatoquet
Abstract: With the growing interest in on On-orbit servicing (OOS) and Active Debris Removal (ADR) missions, spacecraft poses estimation algorithms are being developed using deep learning to improve the precision of this complex task and find the most efficient solution. With the advances of bio-inspired low-power solutions, such a spiking neural networks and event-based processing and cameras, and their recent work for space applications, we propose to investigate the feasibility of a fully event-based solution to improve event-based pose estimation for spacecraft. In this paper, we address the first event-based dataset SEENIC with real event frames captured by an event-based camera on a testbed. We show the methods and results of the first event-based solution for this use case, where our small spiking end-to-end network (S2E2) solution achieves interesting results over 21cm position error and 14degree rotation error, which is the first step towards fully event-based processing for embedded spacecraft pose estimation.
Authors: Mahmoud Abdulsalam, Usman Zahidi, Bradley Hurst, Simon Pearson, Grzegorz Cielniak, James Brown
Abstract: Tomato anomalies/damages pose a significant challenge in greenhouse farming. While this method of cultivation benefits from efficient resource utilization, anomalies can significantly degrade the quality of farm produce. A common anomaly associated with tomatoes is splitting, characterized by the development of cracks on the tomato skin, which degrades its quality. Detecting this type of anomaly is challenging due to dynamic variations in appearance and sizes, compounded by dataset scarcity. We address this problem in an unsupervised manner by utilizing a tailored variational autoencoder (VAE) with hyperspectral input. Preliminary analysis of the dataset enabled us to select the optimal range of wavelengths for detecting this anomaly. Our findings indicate that the 530nm - 550nm range is suitable for identifying tomato dry splits. The analysis on reconstruction loss allow us to not only detect the anomalies but also to some degree estimate the anomalous regions.
Authors: Susu Sun, Leslie Tessier, Fr\'ed\'erique Meeuwsen, Cl\'ement Grisi, Dominique van Midden, Geert Litjens, Christian F. Baumgartner
Abstract: Multiple Instance Learning (MIL) methods allow for gigapixel Whole-Slide Image (WSI) analysis with only slide-level annotations. Interpretability is crucial for safely deploying such algorithms in high-stakes medical domains. Traditional MIL methods offer explanations by highlighting salient regions. However, such spatial heatmaps provide limited insights for end users. To address this, we propose a novel inherently interpretable WSI-classification approach that uses human-understandable pathology concepts to generate explanations. Our proposed Concept MIL model leverages recent advances in vision-language models to directly predict pathology concepts based on image features. The model's predictions are obtained through a linear combination of the concepts identified on the top-K patches of a WSI, enabling inherent explanations by tracing each concept's influence on the prediction. In contrast to traditional concept-based interpretable models, our approach eliminates the need for costly human annotations by leveraging the vision-language model. We validate our method on two widely used pathology datasets: Camelyon16 and PANDA. On both datasets, Concept MIL achieves AUC and accuracy scores over 0.9, putting it on par with state-of-the-art models. We further find that 87.1\% (Camelyon16) and 85.3\% (PANDA) of the top 20 patches fall within the tumor region. A user study shows that the concepts identified by our model align with the concepts used by pathologists, making it a promising strategy for human-interpretable WSI classification.
Authors: Jiexi Zhong, Zhiheng Li, Yubo Cui, Zheng Fang
Abstract: Semantic segmentation of LiDAR points has significant value for autonomous driving and mobile robot systems. Most approaches explore spatio-temporal information of multi-scan to identify the semantic classes and motion states for each point. However, these methods often overlook the segmentation consistency in space and time, which may result in point clouds within the same object being predicted as different categories. To handle this issue, our core idea is to generate cluster labels across multiple frames that can reflect the complete spatial structure and temporal information of objects. These labels serve as explicit guidance for our dual-branch network, 4D-CS, which integrates point-based and cluster-based branches to enable more consistent segmentation. Specifically, in the point-based branch, we leverage historical knowledge to enrich the current feature through temporal fusion on multiple views. In the cluster-based branch, we propose a new strategy to produce cluster labels of foreground objects and apply them to gather point-wise information to derive cluster features. We then merge neighboring clusters across multiple scans to restore missing features due to occlusion. Finally, in the point-cluster fusion stage, we adaptively fuse the information from the two branches to optimize segmentation results. Extensive experiments confirm the effectiveness of the proposed method, and we achieve state-of-the-art results on the multi-scan semantic and moving object segmentation on SemanticKITTI and nuScenes datasets. The code will be available at https://github.com/NEU-REAL/4D-CS.git.
Authors: Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang
Abstract: In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .
Authors: Jiawei Liu, Yuanzhi Zhu, Feiyu Gao, Zhibo Yang, Peng Wang, Junyang Lin, Xinggang Wang, Wenyu Liu
Abstract: Generating visual text in natural scene images is a challenging task with many unsolved problems. Different from generating text on artificially designed images (such as posters, covers, cartoons, etc.), the text in natural scene images needs to meet the following four key criteria: (1) Fidelity: the generated text should appear as realistic as a photograph and be completely accurate, with no errors in any of the strokes. (2) Reasonability: the text should be generated on reasonable carrier areas (such as boards, signs, walls, etc.), and the generated text content should also be relevant to the scene. (3) Utility: the generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks. (4) Controllability: The attribute of the text (such as font and color) should be controllable as needed.In this paper, we propose a two stage method, SceneVTG++, which simultaneously satisfies the four aspects mentioned above. SceneVTG++ consists of a Text Layout and Content Generator (TLCG) and a Controllable Local Text Diffusion (CLTD). The former utilizes the world knowledge of multi modal large language models to find reasonable text areas and recommend text content according to the nature scene background images, while the latter generates controllable multilingual text based on the diffusion model. Through extensive experiments, we respectively verified the effectiveness of TLCG and CLTD, and demonstrated the state-of-the-art text generation performance of SceneVTG++. In addition, the generated images have superior utility in OCR tasks like text detection and text recognition. Codes and datasets will be available.
Authors: Wanpeng Hu, Haodi Liu, Lin Chen, Feng Zhou, Changming Xiao, Qi Yang, Changshui Zhang
Abstract: Complex visual reasoning remains a key challenge today. Typically, the challenge is tackled using methodologies such as Chain of Thought (COT) and visual instruction tuning. However, how to organically combine these two methodologies for greater success remains unexplored. Also, issues like hallucinations and high training cost still need to be addressed. In this work, we devise an innovative multi-round training and reasoning framework suitable for lightweight Multimodal Large Language Models (MLLMs). Our self-questioning approach heuristically guides MLLMs to focus on visual clues relevant to the target problem, reducing hallucinations and enhancing the model's ability to describe fine-grained image details. This ultimately enables the model to perform well in complex visual reasoning and question-answering tasks. We have named this framework Socratic Questioning(SQ). To facilitate future research, we create a multimodal mini-dataset named CapQA, which includes 1k images of fine-grained activities, for visual instruction tuning and evaluation, our proposed SQ method leads to a 31.2% improvement in the hallucination score. Our extensive experiments on various benchmarks demonstrate SQ's remarkable capabilities in heuristic self-questioning, zero-shot visual reasoning and hallucination mitigation. Our model and code will be publicly available.
Authors: Timothy Schauml\"offel, Arthur Aubret, Gemma Roig, Jochen Triesch
Abstract: Recent self-supervised learning (SSL) models trained on human-like egocentric visual inputs substantially underperform on image recognition tasks compared to humans. These models train on raw, uniform visual inputs collected from head-mounted cameras. This is different from humans, as the anatomical structure of the retina and visual cortex relatively amplifies the central visual information, i.e. around humans' gaze location. This selective amplification in humans likely aids in forming object-centered visual representations. Here, we investigate whether focusing on central visual information boosts egocentric visual object learning. We simulate 5-months of egocentric visual experience using the large-scale Ego4D dataset and generate gaze locations with a human gaze prediction model. To account for the importance of central vision in humans, we crop the visual area around the gaze location. Finally, we train a time-based SSL model on these modified inputs. Our experiments demonstrate that focusing on central vision leads to better object-centered representations. Our analysis shows that the SSL model leverages the temporal dynamics of the gaze movements to build stronger visual representations. Overall, our work marks a significant step toward bio-inspired learning of visual representations.
Authors: Jinglei Zhang, Jiankang Deng, Chao Ma, Rolandos Alexandros Potamias
Abstract: Despite the advent in 3D hand pose estimation, current methods predominantly focus on single-image 3D hand reconstruction in the camera frame, overlooking the world-space motion of the hands. Such limitation prohibits their direct use in egocentric video settings, where hands and camera are continuously in motion. In this work, we propose HaWoR, a high-fidelity method for hand motion reconstruction in world coordinates from egocentric videos. We propose to decouple the task by reconstructing the hand motion in the camera space and estimating the camera trajectory in the world coordinate system. To achieve precise camera trajectory estimation, we propose an adaptive egocentric SLAM framework that addresses the shortcomings of traditional SLAM methods, providing robust performance under challenging camera dynamics. To ensure robust hand motion trajectories, even when the hands move out of view frustum, we devise a novel motion infiller network that effectively completes the missing frames of the sequence. Through extensive quantitative and qualitative evaluations, we demonstrate that HaWoR achieves state-of-the-art performance on both hand motion reconstruction and world-frame camera trajectory estimation under different egocentric benchmark datasets. Code and models are available on https://hawor-project.github.io/ .
Authors: Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, Ying Tai
Abstract: Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (\textit{e.g.}, CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce\textbf{~\name} (\textbf{S}patial-\textbf{T}emporal \textbf{A}ugmentation with T2V models for \textbf{R}eal-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the model to focus on different frequency components across diffusion steps. Extensive experiments demonstrate\textbf{~\name}~outperforms state-of-the-art methods on both synthetic and real-world datasets.
Authors: Haozheng Xu, Alistair Weld, Chi Xu, Alfie Roddan, Joao Cartucho, Mert Asim Karaoglu, Alexander Ladikos, Yangke Li, Yiping Li, Daiyun Shen, Shoujie Yang, Geonhee Lee, Seyeon Park, Jongho Shin, Young-Gon Kim, Lucy Fothergill, Dominic Jones, Pietro Valdastri, Duygu Sarikaya, Stamatia Giannarou
Abstract: Accurate instrument pose estimation is a crucial step towards the future of robotic surgery, enabling applications such as autonomous surgical task execution. Vision-based methods for surgical instrument pose estimation provide a practical approach to tool tracking, but they often require markers to be attached to the instruments. Recently, more research has focused on the development of marker-less methods based on deep learning. However, acquiring realistic surgical data, with ground truth instrument poses, required for deep learning training, is challenging. To address the issues in surgical instrument pose estimation, we introduce the Surgical Robot Instrument Pose Estimation (SurgRIPE) challenge, hosted at the 26th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) in 2023. The objectives of this challenge are: (1) to provide the surgical vision community with realistic surgical video data paired with ground truth instrument poses, and (2) to establish a benchmark for evaluating markerless pose estimation methods. The challenge led to the development of several novel algorithms that showcased improved accuracy and robustness over existing methods. The performance evaluation study on the SurgRIPE dataset highlights the potential of these advanced algorithms to be integrated into robotic surgery systems, paving the way for more precise and autonomous surgical procedures. The SurgRIPE challenge has successfully established a new benchmark for the field, encouraging further research and development in surgical robot instrument pose estimation.
Authors: Junmyeong Lee, Eui Jun Hwang, Sukmin Cho, Jong C. Park
Abstract: In Masked Image Modeling (MIM), two primary methods exist: Pixel MIM and Latent MIM, each utilizing different reconstruction targets, raw pixels and latent representations, respectively. Pixel MIM tends to capture low-level visual details such as color and texture, while Latent MIM focuses on high-level semantics of an object. However, these distinct strengths of each method can lead to suboptimal performance in tasks that rely on a particular level of visual features. To address this limitation, we propose PiLaMIM, a unified framework that combines Pixel MIM and Latent MIM to integrate their complementary strengths. Our method uses a single encoder along with two distinct decoders: one for predicting pixel values and another for latent representations, ensuring the capture of both high-level and low-level visual features. We further integrate the CLS token into the reconstruction process to aggregate global context, enabling the model to capture more semantic information. Extensive experiments demonstrate that PiLaMIM outperforms key baselines such as MAE, I-JEPA and BootMAE in most cases, proving its effectiveness in extracting richer visual representations.
Authors: Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yingcong Chen
Abstract: Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes. We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data. Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.
Authors: Guy Yariv, Yuval Kirstain, Amit Zohar, Shelly Sheynin, Yaniv Taigman, Yossi Adi, Sagie Benaim, Adam Polyak
Abstract: We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. Our key innovation is the introduction of a mask-based motion trajectory as an intermediate representation, that captures both semantic object information and motion, enabling an expressive but compact representation of motion and semantics. To incorporate the learned representation in the second stage, we utilize object-level attention objectives. Specifically, we consider a spatial, per-object, masked-cross attention objective, integrating object-specific prompts into corresponding latent space regions and a masked spatio-temporal self-attention objective, ensuring frame-to-frame consistency for each object. We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art results in temporal coherence, motion realism, and text-prompt faithfulness. Additionally, we introduce \benchmark, a new challenging benchmark for single-object and multi-object I2V generation, and demonstrate our method's superiority on this benchmark. Project page is available at https://guyyariv.github.io/TTM/.
Authors: Haojin Li, Heng Li, Jianyu Chen, Rihan Zhong, Ke Niu, Huazhu Fu, Jiang Liu
Abstract: Decoupling domain-variant information (DVI) from domain-invariant information (DII) serves as a prominent strategy for mitigating domain shifts in the practical implementation of deep learning algorithms. However, in medical settings, concerns surrounding data collection and privacy often restrict access to both training and test data, hindering the empirical decoupling of information by existing methods. To tackle this issue, we propose an Autonomous Information Filter-driven Source-free Domain Adaptation (AIF-SFDA) algorithm, which leverages a frequency-based learnable information filter to autonomously decouple DVI and DII. Information Bottleneck (IB) and Self-supervision (SS) are incorporated to optimize the learnable frequency filter. The IB governs the information flow within the filter to diminish redundant DVI, while SS preserves DII in alignment with the specific task and image modality. Thus, the autonomous information filter can overcome domain shifts relying solely on target data. A series of experiments covering various medical image modalities and segmentation tasks were conducted to demonstrate the benefits of AIF-SFDA through comparisons with leading algorithms and ablation studies. The code is available at https://github.com/JingHuaMan/AIF-SFDA.
Authors: Rasa Khosrowshahli, Shahryar Rahnamayan, Beatrice Ombuki-Berman
Abstract: Deep neural networks suffer from storing millions and billions of weights in memory post-training, making challenging memory-intensive models to deploy on embedded devices. The weight-sharing technique is one of the popular compression approaches that use fewer weight values and share across specific connections in the network. In this paper, we propose a multi-objective evolutionary algorithm (MOEA) based compression framework independent of neural network architecture, dimension, task, and dataset. We use uniformly sized bins to quantize network weights into a single codebook (lookup table) for efficient weight representation. Using MOEA, we search for Pareto optimal $k$ bins by optimizing two objectives. Then, we apply the iterative merge technique to non-dominated Pareto frontier solutions by combining neighboring bins without degrading performance to decrease the number of bins and increase the compression ratio. Our approach is model- and layer-independent, meaning the weights are mixed in the clusters from any layer, and the uniform quantization method used in this work has $O(N)$ complexity instead of non-uniform quantization methods such as k-means with $O(Nkt)$ complexity. In addition, we use the center of clusters as the shared weight values instead of retraining shared weights, which is computationally expensive. The advantage of using evolutionary multi-objective optimization is that it can obtain non-dominated Pareto frontier solutions with respect to performance and shared weights. The experimental results show that we can reduce the neural network memory by $13.72 \sim14.98 \times$ on CIFAR-10, $11.61 \sim 12.99\times$ on CIFAR-100, and $7.44 \sim 8.58\times$ on ImageNet showcasing the effectiveness of the proposed deep neural network compression framework.
Authors: Valeriya Strizhkova, Hadi Kachmar, Hava Chaptoukaev, Raphael Kalandadze, Natia Kukhilava, Tatia Tsmindashvili, Nibras Abo-Alzahab, Maria A. Zuluaga, Michal Balazia, Antitza Dantcheva, Fran\c{c}ois Br\'emond, Laura Ferrari
Abstract: Human emotions entail a complex set of behavioral, physiological and cognitive changes. Current state-of-the-art models fuse the behavioral and physiological components using classic machine learning, rather than recent deep learning techniques. We propose to fill this gap, designing the Multimodal for Video and Physio (MVP) architecture, streamlined to fuse video and physiological signals. Differently then others approaches, MVP exploits the benefits of attention to enable the use of long input sequences (1-2 minutes). We have studied video and physiological backbones for inputting long sequences and evaluated our method with respect to the state-of-the-art. Our results show that MVP outperforms former methods for emotion recognition based on facial videos, EDA, and ECG/PPG.
Authors: Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, Lili Yu, Chunting Zhou
Abstract: Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity. To address this, we introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image, taking into account factors critical to human perception. Trained on images with diverse compression ratios, CAT demonstrates robust performance in image reconstruction. We also utilize its variable-length latent representations to train Diffusion Transformers (DiTs) for ImageNet generation. By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.
Authors: Yuxiang Bao, Guoliang Kang, Linlin Yang, Xiaoyue Duan, Bo Zhao, Baochang Zhang
Abstract: In real-world scenarios, the number of training samples across classes usually subjects to a long-tailed distribution. The conventionally trained network may achieve unexpected inferior performance on the rare class compared to the frequent class. Most previous works attempt to rectify the network bias from the data-level or from the classifier-level. Differently, in this paper, we identify that the bias towards the frequent class may be encoded into features, i.e., the rare-specific features which play a key role in discriminating the rare class are much weaker than the frequent-specific features. Based on such an observation, we introduce a simple yet effective approach, normalizing the parameters of Batch Normalization (BN) layer to explicitly rectify the feature bias. To achieve this end, we represent the Weight/Bias parameters of a BN layer as a vector, normalize it into a unit one and multiply the unit vector by a scalar learnable parameter. Through decoupling the direction and magnitude of parameters in BN layer to learn, the Weight/Bias exhibits a more balanced distribution and thus the strength of features becomes more even. Extensive experiments on various long-tailed recognition benchmarks (i.e., CIFAR-10/100-LT, ImageNet-LT and iNaturalist 2018) show that our method outperforms previous state-of-the-arts remarkably. The code and checkpoints are available at https://github.com/yuxiangbao/NBN.
Authors: Valery Istomin, Oleg Pereziabov, Ilya Afanasyev
Abstract: This research focuses on developing a method for restoring the topology of digital images of paper documents captured by a camera, using algorithms for detection, segmentation, geometry restoration, and dewarping. Our methodology employs deep learning (DL) for document outline detection, followed by computer vision (CV) to create a topological 2D grid using cubic polynomial interpolation and correct nonlinear distortions by remapping the image. Using classical CV methods makes the document topology restoration process more efficient and faster, as it requires significantly fewer computational resources and memory. We developed a new pipeline for automatic document dewarping and reconstruction, along with a framework and annotated dataset to demonstrate its efficiency. Our experiments confirm the promise of our methodology and its superiority over existing benchmarks (including mobile apps and popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both visually and in terms of document readability via Optical Character Recognition (OCR) and geometry restoration metrics. This paves the way for creating high-quality digital copies of paper documents and enhancing the efficiency of OCR systems. Project page: https://github.com/HorizonParadox/DRCCBI
Authors: Risha Goel, Zain Shabeeb, Isabel Panicker, Vida Jamali
Abstract: Liquid phase transmission electron microscopy (LPTEM) offers an unparalleled combination of spatial and temporal resolution, making it a promising tool for single particle tracking at the nanoscale. However, the absence of a standardized framework for identifying and tracking nanoparticles in noisy LPTEM videos has impeded progress in the field to develop this technique as a single particle tracking tool. To address this, we leveraged Segment Anything Model 2 (SAM 2), released by Meta, which is a foundation model developed for segmenting videos and images. Here, we demonstrate that SAM 2 can successfully segment LPTEM videos in a zero-shot manner and without requiring fine-tuning. Building on this capability, we introduce SAM4EM, a comprehensive framework that integrates promptable video segmentation with particle tracking and statistical analysis, providing an end-to-end LPTEM analysis framework for single particle tracking. SAM4EM achieves nearly 50-fold higher accuracy in segmenting and analyzing LPTEM videos compared to state-of-the-art methods, paving the way for broader applications of LPTEM in nanoscale imaging.
Authors: Alexandru Buburuzan, Anuj Sharma, John Redford, Puneet K. Dokania, Romain Mueller
Abstract: Safety-critical applications, such as autonomous driving, require extensive multimodal data for rigorous testing. Methods based on synthetic data are gaining prominence due to the cost and complexity of gathering real-world data but require a high degree of realism and controllability in order to be useful. This paper introduces MObI, a novel framework for Multimodal Object Inpainting that leverages a diffusion model to create realistic and controllable object inpaintings across perceptual modalities, demonstrated for both camera and lidar simultaneously. Using a single reference RGB image, MObI enables objects to be seamlessly inserted into existing multimodal scenes at a 3D location specified by a bounding box, while maintaining semantic consistency and multimodal coherence. Unlike traditional inpainting methods that rely solely on edit masks, our 3D bounding box conditioning gives objects accurate spatial positioning and realistic scaling. As a result, our approach can be used to insert novel objects flexibly into multimodal scenes, providing significant advantages for testing perception models.
Authors: Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
Abstract: Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly. Unlike offline video LLMs, which analyze the entire video before answering questions, active real-time interaction requires three capabilities: 1) Perception: real-time video monitoring and interaction capturing. 2) Decision: raising proactive interaction in proper situations, 3) Reaction: continuous interaction with users. However, inherent conflicts exist among the desired capabilities. The Decision and Reaction require a contrary Perception scale and grain, and the autoregressive decoding blocks the real-time Perception and Decision during the Reaction. To unify the conflicted capabilities within a harmonious system, we present Dispider, a system that disentangles Perception, Decision, and Reaction. Dispider features a lightweight proactive streaming video processing module that tracks the video stream and identifies optimal moments for interaction. Once the interaction is triggered, an asynchronous interaction module provides detailed responses, while the processing module continues to monitor the video in the meantime. Our disentangled and asynchronous design ensures timely, contextually accurate, and computationally efficient responses, making Dispider ideal for active real-time interaction for long-duration video streams. Experiments show that Dispider not only maintains strong performance in conventional video QA tasks, but also significantly surpasses previous online models in streaming scenario responses, thereby validating the effectiveness of our architecture. The code and model are released at \url{https://github.com/Mark12Ding/Dispider}.
Authors: Tingyang Zhang, Chen Wang, Zhiyang Dou, Qingzhe Gao, Jiahui Lei, Baoquan Chen, Lingjie Liu
Abstract: In this paper, we propose ProTracker, a novel framework for robust and accurate long-term dense tracking of arbitrary points in videos. The key idea of our method is incorporating probabilistic integration to refine multiple predictions from both optical flow and semantic features for robust short-term and long-term tracking. Specifically, we integrate optical flow estimations in a probabilistic manner, producing smooth and accurate trajectories by maximizing the likelihood of each prediction. To effectively re-localize challenging points that disappear and reappear due to occlusion, we further incorporate long-term feature correspondence into our flow predictions for continuous trajectory generation. Extensive experiments show that ProTracker achieves the state-of-the-art performance among unsupervised and self-supervised approaches, and even outperforms supervised methods on several benchmarks. Our code and model will be publicly available upon publication.
Authors: Haosheng Zhang, Hao Huang
Abstract: In the domain of 3D object classification, a fundamental challenge lies in addressing the scarcity of labeled data, which limits the applicability of traditional data-intensive learning paradigms. This challenge is particularly pronounced in few-shot learning scenarios, where the objective is to achieve robust generalization from minimal annotated samples. To overcome these limitations, it is crucial to identify and leverage the most salient and discriminative features of 3D objects, thereby enhancing learning efficiency and reducing dependency on large-scale labeled datasets. This work introduces RW-Net, a novel framework designed to address the challenges above by integrating Rate-Distortion Explanation (RDE) and wavelet transform into a state-of-the-art projection-based 3D object classification architecture. The proposed method capitalizes on RDE to extract critical features by identifying and preserving the most informative data components while reducing redundancy. This process ensures the retention of essential information for effective decision-making, optimizing the model's ability to learn from limited data. Complementing RDE, incorporating the wavelet transform further enhances the framework's capability to generalize in low-data regimes. By emphasizing low-frequency components of the input data, the wavelet transform captures fundamental geometric and structural attributes of 3D objects. These attributes are instrumental in mitigating overfitting and improving the robustness of the learned representations across diverse tasks and domains. To validate the effectiveness of our RW-Net, we conduct extensive experiments on three datasets: ModelNet40, ModelNet40-C, and ScanObjectNN for few-shot 3D object classification. The results demonstrate that our approach achieves state-of-the-art performance and exhibits superior generalization and robustness in few-shot learning scenarios.
Authors: Xiaoxiao He, Haizhou Shi, Ligong Han, Chaowei Tan, Bo Liu, Zihao Xu, Meng Ye, Leon Axel, Kang Li, Dimitris Metaxas
Abstract: Cardiovascular disease (CVD) and cardiac dyssynchrony are major public health problems in the United States. Precise cardiac image segmentation is crucial for extracting quantitative measures that help categorize cardiac dyssynchrony. However, achieving high accuracy often depends on centralizing large datasets from different hospitals, which can be challenging due to privacy concerns. To solve this problem, Federated Learning (FL) is proposed to enable decentralized model training on such data without exchanging sensitive information. However, bandwidth limitations and data heterogeneity remain as significant challenges in conventional FL algorithms. In this paper, we propose a novel efficient and adaptive federate learning method for cardiac segmentation that improves model performance while reducing the bandwidth requirement. Our method leverages the low-rank adaptation (LoRA) to regularize model weight update and reduce communication overhead. We also propose a \mymethod{} aggregation technique to address data heterogeneity among clients. This technique adaptively penalizes the aggregated weights from different clients by comparing the validation accuracy in each client, allowing better generalization performance and fast local adaptation. In-client and cross-client evaluations on public cardiac MR datasets demonstrate the superiority of our method over other LoRA-based federate learning approaches.
Authors: Yuhui Zhang, Yuchang Su, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, Ludwig Schmidt, Serena Yeung-Levy
Abstract: The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.
Authors: Jathushan Rajasegaran, Xinlei Chen, Rulilong Li, Christoph Feichtenhofer, Jitendra Malik, Shiry Ginosar
Abstract: This paper explores Masked Autoencoders (MAE) with Gaussian Splatting. While reconstructive self-supervised learning frameworks such as MAE learns good semantic abstractions, it is not trained for explicit spatial awareness. Our approach, named Gaussian Masked Autoencoder, or GMAE, aims to learn semantic abstractions and spatial understanding jointly. Like MAE, it reconstructs the image end-to-end in the pixel space, but beyond MAE, it also introduces an intermediate, 3D Gaussian-based representation and renders images via splatting. We show that GMAE can enable various zero-shot learning capabilities of spatial understanding (e.g., figure-ground segmentation, image layering, edge detection, etc.) while preserving the high-level semantics of self-supervised representation quality from MAE. To our knowledge, we are the first to employ Gaussian primitives in an image representation learning framework beyond optimization-based single-scene reconstructions. We believe GMAE will inspire further research in this direction and contribute to developing next-generation techniques for modeling high-fidelity visual data. More details at https://brjathu.github.io/gmae
Authors: Atharva Divekar, Atharva Sonawane
Abstract: The AUTO-PCOS Classification Challenge seeks to advance the diagnostic capabilities of artificial intelligence (AI) in identifying Polycystic Ovary Syndrome (PCOS) through automated classification of healthy and unhealthy ultrasound frames. This report outlines our methodology for building a robust AI pipeline utilizing transfer learning with the InceptionV3 architecture to achieve high accuracy in binary classification. Preprocessing steps ensured the dataset was optimized for training, validation, and testing, while interpretability methods like LIME and saliency maps provided valuable insights into the model's decision-making. Our approach achieved an accuracy of 90.52%, with precision, recall, and F1-score metrics exceeding 90% on validation data, demonstrating its efficacy. The project underscores the transformative potential of AI in healthcare, particularly in addressing diagnostic challenges like PCOS. Key findings, challenges, and recommendations for future enhancements are discussed, highlighting the pathway for creating reliable, interpretable, and scalable AI-driven medical diagnostic tools.
Authors: Yang Qi, Jiaxin Cai, Jing Lu, Runqing Xiong, Rongshang Chen, Liping Zheng, Duo Ma
Abstract: Prenatal ultrasound evaluates fetal growth and detects congenital abnormalities during pregnancy, but the examination of ultrasound images by radiologists requires expertise and sophisticated equipment, which would otherwise fail to improve the rate of identifying specific types of fetal central nervous system (CNS) abnormalities and result in unnecessary patient examinations. We construct a deep learning model to improve the overall accuracy of the diagnosis of fetal cranial anomalies to aid prenatal diagnosis. In our collected multi-center dataset of fetal craniocerebral anomalies covering four typical anomalies of the fetal central nervous system (CNS): anencephaly, encephalocele (including meningocele), holoprosencephaly, and rachischisis, patient-level prediction accuracy reaches 94.5%, with an AUROC value of 99.3%. In the subgroup analyzes, our model is applicable to the entire gestational period, with good identification of fetal anomaly types for any gestational period. Heatmaps superimposed on the ultrasound images not only provide a visual interpretation for the algorithm but also provide an intuitive visual aid to the physician by highlighting key areas that need to be reviewed, helping the physician to quickly identify and validate key areas. Finally, the retrospective reader study demonstrates that by combining the automatic prediction of the DL system with the professional judgment of the radiologist, the diagnostic accuracy and efficiency can be effectively improved and the misdiagnosis rate can be reduced, which has an important clinical application prospect.
Authors: Delin An, Chaoli Wang
Abstract: Unlike their line-based counterparts, surface-based techniques have yet to be thoroughly investigated in flow visualization due to their significant placement, speed, perception, and evaluation challenges. This paper presents SurfPatch, a novel framework supporting exploratory stream surface visualization. To begin with, we translate the issue of surface placement to surface selection and trace a large number of stream surfaces from a given flow field dataset. Then, we introduce a three-stage process: vertex-level classification, patch-level matching, and surface-level clustering that hierarchically builds the connection between vertices and patches and between patches and surfaces. This bottom-up approach enables fine-grained, multiscale patch-level matching, sharply contrasts surface-level matching offered by existing works, and provides previously unavailable flexibility during querying. We design an intuitive visual interface for users to conveniently visualize and analyze the underlying collection of stream surfaces in an exploratory manner. SurfPatch is not limited to stream surfaces traced from steady flow datasets. We demonstrate its effectiveness through experiments on stream surfaces produced from steady and unsteady flows as well as isosurfaces extracted from scalar fields. The code is available at https://github.com/adlsn/SurfPatch.
Authors: Aashish Cheruvu, Nathaniel Rigoni
Abstract: This paper presents a novel deep learning-based approach named RealDiffFusionNet incorporating Neural Controlled Differential Equations (Neural CDE) - time series models that are robust in handling irregularly sampled data - and multi-head attention to align relevant multimodal context (image data, time invariant data, etc.) at each time point. Long short-term memory (LSTM) models were also used as a baseline. Two different datasets were used: a data from the Open-Source Imaging Consortium (OSIC) containing structured time series data of demographics and lung function with a baseline CT scan of the lungs and the second from the Alzheimer's Disease Neuroimaging Initiative (ADNI) containing a series of MRI scans along with demographics, physical examinations, and cognitive assessment data. An ablation study was performed to understand the role of CDEs, multimodal data, attention fusion, and interpolation strategies on model performance. When the baseline models were evaluated, the use of multimodal data resulted in an improvement in Neural CDE performance, with a lower test RMSE. Additionally, the performance of multimodal Neural CDE was also superior to multimodal LSTM. In the attention-based architectures, fusion through concatenation and rectilinear interpolation were found to improve model performance. The performance of the proposed RealDiffFusionNet was found to be superior (0.2570) to all models. For the ADNI dataset, between the Neural-CDE and LSTM models trained only on the structured data, the test RMSE were comparable (0.471 for LSTM vs. 0.4581 Neural-CDE). Furthermore, the addition of image features from patients' MRI series resulted in an improvement in performance, with a lower test RMSE (0.4372 with multimodal vs 0.4581 with structured data). RealDiffFusionNet has shown promise in utilizing CDEs and multimodal data to accurately predict disease progression.
Authors: Ziwei Zheng, Junyao Zhao, Le Yang, Lijun He, Fan Li
Abstract: With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector that integrates seamlessly into the generation process with minimal extra inference overhead. Despite its simple structure of a logistic regression model, the detector surprisingly exhibits strong zero-shot generalization capabilities. Experiments across various prompt-based attacks confirm the effectiveness of leveraging safety heads to protect LVLMs. Code is available at \url{https://github.com/Ziwei-Zheng/SAHs}.
Authors: Orhan Demirci, Bulent Yilmaz
Abstract: This paper introduces Tree-NET, a novel framework for medical image segmentation that leverages bottleneck feature supervision to enhance both segmentation accuracy and computational efficiency. While previous studies have employed bottleneck feature supervision, their applications have largely been limited to the training phase, offering no computational benefits during training or evaluation. To the best of our knowledge, this study is the first to propose a framework that incorporates two additional training phases for segmentation models, utilizing bottleneck features at both input and output stages. This approach significantly improves computational performance by reducing input and output dimensions with a negligible addition to parameter count, without compromising accuracy. Tree-NET features a three-layer architecture comprising Encoder-Net and Decoder-Net, which are autoencoders designed to compress input and label data, respectively, and Bridge-Net, a segmentation framework that supervises the bottleneck features. By focusing on dense, compressed representations, Tree-NET enhances operational efficiency and can be seamlessly integrated into existing segmentation models without altering their internal structures or increasing model size. We evaluate Tree-NET on two critical segmentation tasks -- skin lesion and polyp segmentation -- using various backbone models, including U-NET variants and Polyp-PVT. Experimental results demonstrate that Tree-NET reduces FLOPs by a factor of 4 to 13 and decreases memory usage, while achieving comparable or superior accuracy compared to the original architectures. These findings underscore Tree-NET's potential as a robust and efficient solution for medical image segmentation.
Authors: Yinchuan Wang, Bin Ren, Xiang Zhang, Pengyu Wang, Chaoqun Wang, Rui Song, Yibin Li, Max Q. -H. Meng
Abstract: LiDAR-based SLAM is recognized as one effective method to offer localization guidance in rough environments. However, off-the-shelf LiDAR-based SLAM methods suffer from significant pose estimation drifts, particularly components relevant to the vertical direction, when passing to uneven terrains. This deficiency typically leads to a conspicuously distorted global map. In this article, a LiDAR-based SLAM method is presented to improve the accuracy of pose estimations for ground vehicles in rough terrains, which is termed Rotation-Optimized LiDAR-Only (ROLO) SLAM. The method exploits a forward location prediction to coarsely eliminate the location difference of consecutive scans, thereby enabling separate and accurate determination of the location and orientation at the front-end. Furthermore, we adopt a parallel-capable spatial voxelization for correspondence-matching. We develop a spherical alignment-guided rotation registration within each voxel to estimate the rotation of vehicle. By incorporating geometric alignment, we introduce the motion constraint into the optimization formulation to enhance the rapid and effective estimation of LiDAR's translation. Subsequently, we extract several keyframes to construct the submap and exploit an alignment from the current scan to the submap for precise pose estimation. Meanwhile, a global-scale factor graph is established to aid in the reduction of cumulative errors. In various scenes, diverse experiments have been conducted to evaluate our method. The results demonstrate that ROLO-SLAM excels in pose estimation of ground vehicles and outperforms existing state-of-the-art LiDAR SLAM frameworks.
Authors: Zhongyi Zhou, Yaxin Peng, Pin Yi, Minjie Zhu, Chaomin Shen
Abstract: Continual Learning enables models to learn and adapt to new tasks while retaining prior knowledge.Introducing new tasks, however, can naturally lead to feature entanglement across tasks, limiting the model's capability to distinguish between new domain data.In this work, we propose a method called Feature Realignment through Experts on hyperSpHere in Continual Learning (Fresh-CL). By leveraging predefined and fixed simplex equiangular tight frame (ETF) classifiers on a hypersphere, our model improves feature separation both intra and inter tasks.However, the projection to a simplex ETF shifts with new tasks, disrupting structured feature representation of previous tasks and degrading performance. Therefore, we propose a dynamic extension of ETF through mixture of experts, enabling adaptive projections onto diverse subspaces to enhance feature representation.Experiments on 11 datasets demonstrate a 2\% improvement in accuracy compared to the strongest baseline, particularly in fine-grained datasets, confirming the efficacy of combining ETF and MoE to improve feature distinction in continual learning scenarios.
Authors: Chao Wang, Licheng Jiao, Jiaxuan Zhao, Lingling Li, Fang Liu, Shuyuan Yang
Abstract: Evolutionary algorithms (EAs) maintain populations through evolutionary operators to discover diverse solutions for complex tasks while gathering valuable knowledge, such as historical population data and fitness evaluations. However, traditional EAs face challenges in dynamically adapting to expanding knowledge bases, hindering the efficient exploitation of accumulated information and limiting adaptability to new situations. To address these issues, we introduce an Optimization Knowledge Adaptation Evolutionary Model (OKAEM), which features dynamic parameter adjustment using accumulated knowledge to enhance its optimization capabilities. OKAEM employs attention mechanisms to model the interactions among individuals, fitness landscapes, and genetic components separately, thereby parameterizing the evolutionary operators of selection, crossover, and mutation. These powerful learnable operators enable OKAEM to benefit from pre-learned extensive prior knowledge and self-tune with real-time evolutionary insights. Experimental results demonstrate that OKAEM: 1) exploits prior knowledge for significant performance gains across various knowledge transfer settings; 2) achieves competitive performance through self-tuning alone, even without prior knowledge; 3) outperforms state-of-the-art black-box baselines in a vision-language model tuning case; 4) can improve its optimization capabilities with growing knowledge; 5) is capable of emulating principles of natural selection and genetic recombination.
Authors: Guanghua He, Wangang Cheng, Hancan Zhu, Xiaohao Cai, Gaohang Yu
Abstract: Transfer learning, by leveraging knowledge from pre-trained models, has significantly enhanced the performance of target tasks. However, as deep neural networks scale up, full fine-tuning introduces substantial computational and storage challenges in resource-constrained environments, limiting its widespread adoption. To address this, parameter-efficient fine-tuning (PEFT) methods have been developed to reduce computational complexity and storage requirements by minimizing the number of updated parameters. While matrix decomposition-based PEFT methods, such as LoRA, show promise, they struggle to fully capture the high-dimensional structural characteristics of model weights. In contrast, high-dimensional tensors offer a more natural representation of neural network weights, allowing for a more comprehensive capture of higher-order features and multi-dimensional interactions. In this paper, we propose tCURLoRA, a novel fine-tuning method based on tensor CUR decomposition. By concatenating pre-trained weight matrices into a three-dimensional tensor and applying tensor CUR decomposition, we update only the lower-order tensor components during fine-tuning, effectively reducing computational and storage overhead. Experimental results demonstrate that tCURLoRA outperforms existing PEFT methods in medical image segmentation tasks.
Authors: Ashiqur Rahman, Muhammad E. H. Chowdhury, Md Sharjis Ibne Wadud, Rusab Sarmun, Adam Mushtak, Sohaib Bassam Zoghoul, Israa Al-Hashimi
Abstract: Ischemic stroke, caused by cerebral vessel occlusion, presents substantial challenges in medical imaging due to the variability and subtlety of stroke lesions. Magnetic Resonance Imaging (MRI) plays a crucial role in diagnosing and managing ischemic stroke, yet existing segmentation techniques often fail to accurately delineate lesions. This study introduces a novel deep learning-based method for segmenting ischemic stroke lesions using multi-channel MRI modalities, including Diffusion Weighted Imaging (DWI), Apparent Diffusion Coefficient (ADC), and enhanced Diffusion Weighted Imaging (eDWI). The proposed architecture integrates DenseNet121 as the encoder with Self-Organized Operational Neural Networks (SelfONN) in the decoder, enhanced by Channel and Space Compound Attention (CSCA) and Double Squeeze-and-Excitation (DSE) blocks. Additionally, a custom loss function combining Dice Loss and Jaccard Loss with weighted averages is introduced to improve model performance. Trained and evaluated on the ISLES 2022 dataset, the model achieved Dice Similarity Coefficients (DSC) of 83.88% using DWI alone, 85.86% with DWI and ADC, and 87.49% with the integration of DWI, ADC, and eDWI. This approach not only outperforms existing methods but also addresses key limitations in current segmentation practices. These advancements significantly enhance diagnostic precision and treatment planning for ischemic stroke, providing valuable support for clinical decision-making.
Authors: Debjany Ghosh Aronno, Sumaiya Saeha
Abstract: Diabetic Retinopathy (DR) is a major cause of blindness worldwide, caused by damage to the blood vessels in the retina due to diabetes. Early detection and classification of DR are crucial for timely intervention and preventing vision loss. This work proposes an automated system for DR detection using Convolutional Neural Networks (CNNs) with a residual block architecture, which enhances feature extraction and model performance. To further improve the model's robustness, we incorporate advanced data augmentation techniques, specifically leveraging a Deep Convolutional Generative Adversarial Network (DCGAN) for generating diverse retinal images. This approach increases the variability of training data, making the model more generalizable and capable of handling real-world variations in retinal images. The system is designed to classify retinal images into five distinct categories, from No DR to Proliferative DR, providing an efficient and scalable solution for early diagnosis and monitoring of DR progression. The proposed model aims to support healthcare professionals in large-scale DR screening, especially in resource-constrained settings.
Authors: Kristopher Tapp
Abstract: Modern sampling methods create ensembles of district maps that score well on discrete compactness scores, whereas the Polsby-Popper and other shape-based scores remain highly relevant for building fair maps and litigating unfair ones. The aim of this paper is twofold. First, we introduce population-weighted versions of shape-based scores and show a precise sense in which this interpolates between shape-based and discrete scores. Second, we introduce a modification of the ReCom sampling method that produces ensembles of maps with improved shape-based compactness scores.
Authors: Yahya Sowti Khiabani, Farris Atif, Chieh Hsu, Sven Stahlmann, Tobias Michels, Sebastian Kramer, Benedikt Heidrich, M. Saquib Sarfraz, Julian Merten, Faezeh Tafazzoli
Abstract: We propose a holistic approach for deploying Small Language Models (SLMs) as function-calling agents within vehicles as edge devices, offering a more flexible and robust alternative to traditional rule-based systems. By leveraging SLMs, we simplify vehicle control mechanisms and enhance the user experience. Given the in-vehicle hardware constraints, we apply state-of-the-art model compression techniques, including structured pruning, healing, and quantization, ensuring that the model fits within the resource limitations while maintaining acceptable performance. Our work focuses on optimizing a representative SLM, Microsoft's Phi-3 mini, and outlines best practices for enabling embedded models, including compression, task-specific fine-tuning, and vehicle integration. We demonstrate that, despite significant reduction in model size which removes up to 2 billion parameters from the original model, our approach preserves the model's ability to handle complex in-vehicle tasks accurately and efficiently. Furthermore, by executing the model in a lightweight runtime environment, we achieve a generation speed of 11 tokens per second, making real-time, on-device inference feasible without hardware acceleration. Our results demonstrate the potential of SLMs to transform vehicle control systems, enabling more intuitive interactions between users and their vehicles for an enhanced driving experience.
Authors: Abbaas Alif Mohamed Nishar, Shrinivas Kudekar, Bernard Kintzing, Ashwin Ashok
Abstract: We present `Revelio', a real-world screen-camera communication system leveraging temporal flicker fusion in the OKLAB color space. Using spatially-adaptive flickering and encoding information in pixel region shapes, Revelio achieves visually imperceptible data embedding while remaining robust against noise, asynchronicity, and distortions in screen-camera channels, ensuring reliable decoding by standard smartphone cameras. The decoder, driven by a two-stage neural network, uses a weighted differential accumulator for precise frame detection and symbol recognition. Initial experiments demonstrate Revelio's effectiveness in interactive television, offering an unobtrusive method for meta-information transmission.
Authors: Ali Ghanbarzade, Hossein Soleimani
Abstract: The increasing reliance on Global Navigation Satellite Systems (GNSS), particularly the Global Positioning System (GPS), underscores the urgent need to safeguard these technologies against malicious threats such as spoofing and jamming. As the backbone for positioning, navigation, and timing (PNT) across various applications including transportation, telecommunications, and emergency services GNSS is vulnerable to deliberate interference that poses significant risks. Spoofing attacks, which involve transmitting counterfeit GNSS signals to mislead receivers into calculating incorrect positions, can result in serious consequences, from navigational errors in civilian aviation to security breaches in military operations. Furthermore, the lack of inherent security measures within GNSS systems makes them attractive targets for adversaries. While GNSS/GPS jamming and spoofing systems consist of numerous components, the ability to distinguish authentic signals from malicious ones is essential for maintaining system integrity. Recent advancements in machine learning and deep learning provide promising avenues for enhancing detection and mitigation strategies against these threats. This paper addresses both spoofing and jamming by tackling real-world challenges through machine learning, deep learning, and computer vision techniques. Through extensive experiments on two real-world datasets related to spoofing and jamming detection using advanced algorithms, we achieved state of the art results. In the GNSS/GPS jamming detection task, we attained approximately 99% accuracy, improving performance by around 5% compared to previous studies. Additionally, we addressed a challenging tasks related to spoofing detection, yielding results that underscore the potential of machine learning and deep learning in this domain.
Authors: Alec S. Xu, Can Yaras, Peng Wang, Qing Qu
Abstract: Deep neural networks have attained remarkable success across diverse classification tasks. Recent empirical studies have shown that deep networks learn features that are linearly separable across classes. However, these findings often lack rigorous justifications, even under relatively simple settings. In this work, we address this gap by examining the linear separation capabilities of shallow nonlinear networks. Specifically, inspired by the low intrinsic dimensionality of image data, we model inputs as a union of low-dimensional subspaces (UoS) and demonstrate that a single nonlinear layer can transform such data into linearly separable sets. Theoretically, we show that this transformation occurs with high probability when using random weights and quadratic activations. Notably, we prove this can be achieved when the network width scales polynomially with the intrinsic dimension of the data rather than the ambient dimension. Experimental results corroborate these theoretical findings and demonstrate that similar linear separation properties hold in practical scenarios beyond our analytical scope. This work bridges the gap between empirical observations and theoretical understanding of the separation capacity of nonlinear networks, offering deeper insights into model interpretability and generalization.
Authors: Hao Ziang, Jingsi Zhang, Lixian Li
Abstract: Recently, the state-of-art models for medical image segmentation is U-Net and their variants. These networks, though succeeding in deriving notable results, ignore the practical problem hanging over the medical segmentation field: overfitting and small dataset. The over-complicated deep neural networks unnecessarily extract meaningless information, and a majority of them are not suitable for lung slice CT image segmentation task. To overcome the two limitations, we proposed a new whole-process network merging advanced UNet++ model. The network comprises three main modules: data augmentation, optimized neural network, parameter fine-tuning. By incorporating diverse methods, the training results demonstrate a significant advantage over similar works, achieving leading accuracy of 98.03% with the lowest overfitting. potential. Our network is remarkable as one of the first to target on lung slice CT images.
Authors: Ramin Zarei-Sabzevar, Ahad Harati
Abstract: This paper proposes a novel Deep Positive-Negative Prototype (DPNP) model that combines prototype-based learning (PbL) with discriminative methods to improve class compactness and separability in deep neural networks. While PbL traditionally emphasizes interpretability by classifying samples based on their similarity to representative prototypes, it struggles with creating optimal decision boundaries in complex scenarios. Conversely, discriminative methods effectively separate classes but often lack intuitive interpretability. Toward exploiting advantages of these two approaches, the suggested DPNP model bridges between them by unifying class prototypes with weight vectors, thereby establishing a structured latent space that enables accurate classification using interpretable prototypes alongside a properly learned feature representation. Based on this central idea of unified prototype-weight representation, Deep Positive Prototype (DPP) is formed in the latent space as a representative for each class using off-the-shelf deep networks as feature extractors. Then, rival neighboring class DPPs are treated as implicit negative prototypes with repulsive force in DPNP, which push away DPPs from each other. This helps to enhance inter-class separation without the need for any extra parameters. Hence, through a novel loss function that integrates cross-entropy, prototype alignment, and separation terms, DPNP achieves well-organized feature space geometry, maximizing intra-class compactness and inter-class margins. We show that DPNP can organize prototypes in nearly regular positions within feature space, such that it is possible to achieve competitive classification accuracy even in much lower-dimensional feature spaces. Experimental results on several datasets demonstrate that DPNP outperforms state-of-the-art models, while using smaller networks.
Authors: Assaf Lahiany, Yehudit Aperstein
Abstract: For many practical applications, a high computational cost of inference over deep network architectures might be unacceptable. A small degradation in the overall inference accuracy might be a reasonable price to pay for a significant reduction in the required computational resources. In this work, we describe a method for introducing "shortcuts" into the DNN feedforward inference process by skipping costly feedforward computations whenever possible. The proposed method is based on the previously described BranchyNet (Teerapittayanon et al., 2016) and the EEnet (Demir, 2019) architectures that jointly train the main network and early exit branches. We extend those methods by attaching branches to pre-trained models and, thus, eliminating the need to alter the original weights of the network. We also suggest a new branch architecture based on convolutional building blocks to allow enough training capacity when applied on large DNNs. The proposed architecture includes confidence heads that are used for predicting the confidence level in the corresponding early exits. By defining adjusted thresholds on these confidence extensions, we can control in real-time the amount of data exiting from each branch and the overall tradeoff between speed and accuracy of our model. In our experiments, we evaluate our method using image datasets (SVHN and CIFAR10) and several DNN architectures (ResNet, DenseNet, VGG) with varied depth. Our results demonstrate that the proposed method enables us to reduce the average inference computational cost and further controlling the tradeoff between the model accuracy and the computation cost.
Authors: Takashi Harada, Takehiro Motomitsu, Katsuhiko Hayashi, Yusuke Sakai, Hidetaka Kamigaito
Abstract: In recent years, there has been a notable increase in research on machine learning models for music retrieval and generation systems that are capable of taking natural language sentences as inputs. However, there is a scarcity of large-scale publicly available datasets, consisting of music data and their corresponding natural language descriptions known as music captions. In particular, non-musical information such as suitable situations for listening to a track and the emotions elicited upon listening is crucial for describing music. This type of information is underrepresented in existing music caption datasets due to the challenges associated with extracting it directly from music data. To address this issue, we propose a method for generating music caption data that incorporates non-musical aspects inferred from music thumbnail images, and validated the effectiveness of our approach through human evaluations. Additionally, we created a dataset with approximately 360,000 captions containing non-musical aspects. Leveraging this dataset, we trained a music retrieval model and demonstrated its effectiveness in music retrieval tasks through evaluation.
Authors: Jaeyoung Kim, Jongho Lee, Hong-Jun Choi, Ting-Yao Hsu, Chieh-Yang Huang, Sungchul Kim, Ryan Rossi, Tong Yu, Clyde Lee Giles, Ting-Hao 'Kenneth' Huang, Sungchul Choi
Abstract: Scientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as either an image-to-text or text summarization problem. This limitation hinders the generation of high-quality captions that fully capture the necessary details. Moreover, existing data sourced from arXiv papers contain low-quality captions, posing significant challenges for training large language models (LLMs). In this paper, we introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP) to address these challenges by leveraging specialized LLMs for distinct sub-tasks. Our approach unfolds in three key modules: (Quality Assessment) We utilize multimodal LLMs to assess the quality of training data, enabling the filtration of low-quality captions. (Diverse Caption Generation) We then employ a strategy of fine-tuning/prompting multiple LLMs on the captioning task to generate candidate captions. (Judgment) Lastly, we prompt a prominent LLM to select the highest quality caption from the candidates, followed by refining any remaining inaccuracies. Human evaluations demonstrate that informative captions produced by our approach rank better than human-written captions, highlighting its effectiveness. Our code is available at https://github.com/teamreboott/MLBCAP
Authors: Minoo Dolatabadi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi
Abstract: Autonomous vehicles have gained significant attention due to technological advancements and their potential to transform transportation. A critical challenge in this domain is precise localization, particularly in LiDAR-based map matching, which is prone to errors due to degeneracy in the data. Most sensor fusion techniques, such as the Kalman filter, rely on accurate error covariance estimates for each sensor to improve localization accuracy. However, obtaining reliable covariance values for map matching remains a complex task. To address this challenge, we propose a neural network-based framework for predicting localization error covariance in LiDAR map matching. To achieve this, we introduce a novel dataset generation method specifically designed for error covariance estimation. In our evaluation using a Kalman filter, we achieved a 2 cm improvement in localization accuracy, a significant enhancement in this domain.
Authors: Yibo Zhang
Abstract: Medical image segmentation is a critical task in medical imaging analysis. Traditional CNN-based methods struggle with modeling long-range dependencies, while Transformer-based models, despite their success, suffer from quadratic computational complexity. To address these limitations, we propose KM-UNet, a novel U-shaped network architecture that combines the strengths of Kolmogorov-Arnold Networks (KANs) and state-space models (SSMs). KM-UNet leverages the Kolmogorov-Arnold representation theorem for efficient feature representation and SSMs for scalable long-range modeling, achieving a balance between accuracy and computational efficiency. We evaluate KM-UNet on five benchmark datasets: ISIC17, ISIC18, CVC, BUSI, and GLAS. Experimental results demonstrate that KM-UNet achieves competitive performance compared to state-of-the-art methods in medical image segmentation tasks. To the best of our knowledge, KM-UNet is the first medical image segmentation framework integrating KANs and SSMs. This work provides a valuable baseline and new insights for the development of more efficient and interpretable medical image segmentation systems. The code is open source at https://github.com/2760613195/KM_UNet Keywords:KAN,Manba, state-space models,UNet, Medical image segmentation, Deep learning
Authors: Rebecca Ramnauth, Frederick Shic, Brian Scassellati
Abstract: Atypical gaze behavior is a diagnostic hallmark of Autism Spectrum Disorder (ASD), playing a substantial role in the social and communicative challenges that individuals with ASD face. This study explores the impacts of a month-long, in-home intervention designed to promote triadic interactions between a social robot, a child with ASD, and their caregiver. Our results indicate that the intervention successfully promoted appropriate gaze behavior, encouraging children with ASD to follow the robot's gaze, resulting in more frequent and prolonged instances of spontaneous eye contact and joint attention with their caregivers. Additionally, we observed specific timelines for behavioral variability and novelty effects among users. Furthermore, diagnostic measures for ASD emerged as strong predictors of gaze patterns for both caregivers and children. These results deepen our understanding of ASD gaze patterns and highlight the potential for clinical relevance of robot-assisted interventions.
Authors: Iustin S\^irbu, Iulia-Renata S\^irbu, Jasmina Bogojeska, Traian Rebedea
Abstract: Medical imaging is crucial for diagnosing, monitoring, and treating medical conditions. The medical reports of radiology images are the primary medium through which medical professionals attest their findings, but their writing is time consuming and requires specialized clinical expertise. The automated generation of radiography reports has thus the potential to improve and standardize patient care and significantly reduce clinicians workload. Through our work, we have designed and evaluated an end-to-end transformer-based method to generate accurate and factually complete radiology reports for X-ray images. Additionally, we are the first to introduce curriculum learning for end-to-end transformers in medical imaging and demonstrate its impact in obtaining improved performance. The experiments have been conducted using the MIMIC-CXR-JPG database, the largest available chest X-ray dataset. The results obtained are comparable with the current state-of-the-art on the natural language generation (NLG) metrics BLEU and ROUGE-L, while setting new state-of-the-art results on F1 examples-averaged, F1-macro and F1-micro metrics for clinical accuracy and on the METEOR metric widely used for NLG.
Authors: Amol Khanna, Chenyi Ling, Derek Everett, Edward Raff, Nathan Inkawhich
Abstract: Existing methods for out-of-distribution (OOD) detection use various techniques to produce a score, separate from classification, that determines how ``OOD'' an input is. Our insight is that OOD detection can be simplified by using a neural network architecture which can effectively merge classification and OOD detection into a single step. Radial basis function networks (RBFNs) inherently link classification confidence and OOD detection; however, these networks have lost popularity due to the difficult of training them in a multi-layer fashion. In this work, we develop a multi-layer radial basis function network (MLRBFN) which can be easily trained. To ensure that these networks are also effective for OOD detection, we develop a novel depression mechanism. We apply MLRBFNs as standalone classifiers and as heads on top of pretrained feature extractors, and find that they are competitive with commonly used methods for OOD detection. Our MLRBFN architecture demonstrates a promising new direction for OOD detection methods.
Authors: Hongyi Miao, Jun Jia, Yankun Cao, Yingjie Zhou, Yanwei Jiang, Zhi Liu, Guangtao Zhai
Abstract: With the dramatic upsurge in the volume of ultrasound examinations, low-quality ultrasound imaging has gradually increased due to variations in operator proficiency and imaging circumstances, imposing a severe burden on diagnosis accuracy and even entailing the risk of restarting the diagnosis in critical cases. To assist clinicians in selecting high-quality ultrasound images and ensuring accurate diagnoses, we introduce Ultrasound-QBench, a comprehensive benchmark that systematically evaluates multimodal large language models (MLLMs) on quality assessment tasks of ultrasound images. Ultrasound-QBench establishes two datasets collected from diverse sources: IVUSQA, consisting of 7,709 images, and CardiacUltraQA, containing 3,863 images. These images encompassing common ultrasound imaging artifacts are annotated by professional ultrasound experts and classified into three quality levels: high, medium, and low. To better evaluate MLLMs, we decompose the quality assessment task into three dimensionalities: qualitative classification, quantitative scoring, and comparative assessment. The evaluation of 7 open-source MLLMs as well as 1 proprietary MLLMs demonstrates that MLLMs possess preliminary capabilities for low-level visual tasks in ultrasound image quality classification. We hope this benchmark will inspire the research community to delve deeper into uncovering and enhancing the untapped potential of MLLMs for medical imaging tasks.
Authors: Binyu Zhang, Zhu Meng, Junhao Dong, Fei Su, Zhicheng Zhao
Abstract: Survival prediction is a crucial task in the medical field and is essential for optimizing treatment options and resource allocation. However, current methods often rely on limited data modalities, resulting in suboptimal performance. In this paper, we propose an Integrated Cross-modal Fusion Network (ICFNet) that integrates histopathology whole slide images, genomic expression profiles, patient demographics, and treatment protocols. Specifically, three types of encoders, a residual orthogonal decomposition module and a unification fusion module are employed to merge multi-modal features to enhance prediction accuracy. Additionally, a balanced negative log-likelihood loss function is designed to ensure fair training across different patients. Extensive experiments demonstrate that our ICFNet outperforms state-of-the-art algorithms on five public TCGA datasets, including BLCA, BRCA, GBMLGG, LUAD, and UCEC, and shows its potential to support clinical decision-making and advance precision medicine. The codes are available at: https://github.com/binging512/ICFNet.
Authors: Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, Yuki Mitsufuji
Abstract: Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine-grained spatial details. In this paper, we propose a new audio-visual binaural generation model incorporating an audio-visual conditional normalisation layer that dynamically aligns the mean and variance of the target difference audio features using visual context, along with a new contrastive learning method to enhance spatial sensitivity by mining negative samples from shuffled visual features. We also introduce a cost-efficient way to utilise test-time augmentation in video data to enhance performance. Our approach achieves state-of-the-art generation accuracy on the FAIR-Play and MUSIC-Stereo benchmarks.
Authors: Zhaoyi Yan, Zhijie Sang, Yiming Zhang, Yuhao Fu, Baoyi He, Qi Zhou, Yining Di, Chunlin Ji, Shengyu Zhang, Fei Wu, Hongxia Yang
Abstract: Large Language Models (LLMs) have demonstrated strong performance across various reasoning tasks, yet building a single model that consistently excels across all domains remains challenging. This paper addresses this problem by exploring strategies to integrate multiple domain-specialized models into an efficient pivot model.We propose two fusion strategies to combine the strengths of multiple LLMs: (1) a pairwise, multi-step fusion approach that sequentially distills each source model into the pivot model, followed by a weight merging step to integrate the distilled models into the final model. This method achieves strong performance but requires substantial training effort; and (2) a unified fusion approach that aggregates all source models' outputs simultaneously.To improve the fusion process, we introduce a novel Rate-Skewness Adaptive Fusion (RSAF) technique, which dynamically adjusts top-K ratios during parameter merging for enhanced flexibility and stability.Furthermore, we propose an uncertainty-based weighting method for the unified approach, which dynamically balances the contributions of source models and outperforms other logits/distribution ensemble methods.We achieved accuracy improvements of 9.27%, 8.80%, and 8.89% on the GSM8K, MATH, and HumanEval tasks, respectively.
Authors: Arthur Aubret, C\'eline Teuli\`ere, Jochen Triesch
Abstract: Recent successes in self-supervised learning (SSL) model spatial co-occurrences of visual features either by masking portions of an image or by aggressively cropping it. Here, we propose a new way to model spatial co-occurrences by aligning local representations (before pooling) with a global image representation. We present CO-SSL, a family of instance discrimination methods and show that it outperforms previous methods on several datasets, including ImageNet-1K where it achieves 71.5% of Top-1 accuracy with 100 pre-training epochs. CO-SSL is also more robust to noise corruption, internal corruption, small adversarial attacks, and large training crop sizes. Our analysis further indicates that CO-SSL learns highly redundant local representations, which offers an explanation for its robustness. Overall, our work suggests that aligning local and global representations may be a powerful principle of unsupervised category learning.
Authors: Rezkellah Noureddine Khiati, Pierre-Yves Brillet, Radu Ispas, Catalin Fetita
Abstract: Accurate quantification of the extent of lung pathological patterns (fibrosis, ground-glass opacity, emphysema, consolidation) is prerequisite for diagnosis and follow-up of interstitial lung diseases. However, segmentation is challenging due to the significant class imbalance between healthy and pathological tissues. This paper addresses this issue by leveraging a diffusion model for data augmentation applied during training an AI model. Our approach generates synthetic pathological tissue patches while preserving essential shape characteristics and intricate details specific to each tissue type. This method enhances the segmentation process by increasing the occurence of underrepresented classes in the training data. We demonstrate that our diffusion-based augmentation technique improves segmentation accuracy across all pathological tissue types, particularly for the less common patterns. This advancement contributes to more reliable automated analysis of lung CT scans, potentially improving clinical decision-making and patient outcomes
Authors: Utkarsh Prakash Srivastava, Toshiaki Fujii
Abstract: The vast volume of medical image data necessitates efficient compression techniques to support remote healthcare services. This paper explores Region of Interest (ROI) coding to address the balance between compression rate and image quality. By leveraging UNET segmentation on the Brats 2020 dataset, we accurately identify tumor regions, which are critical for diagnosis. These regions are then subjected to High Efficiency Video Coding (HEVC) for compression, enhancing compression rates while preserving essential diagnostic information. This approach ensures that critical image regions maintain their quality, while non-essential areas are compressed more. Our method optimizes storage space and transmission bandwidth, meeting the demands of telemedicine and large-scale medical imaging. Through this technique, we provide a robust solution that maintains the integrity of vital data and improves the efficiency of medical image handling.
Authors: Xianhao Zhou, Jianghao Wu, Huangxuan Zhao, Lei Chen, Shaoting Zhang, Guotai Wang, Guotai Wang
Abstract: Generating synthetic Computed Tomography (CT) images from Cone Beam Computed Tomography (CBCT) is desirable for improving the image quality of CBCT. Existing synthetic CT (sCT) generation methods using Convolutional Neural Networks (CNN) and Transformers often face difficulties in effectively capturing both global and local features and contrasts for high-quality sCT generation. In this work, we propose a Global-Local Feature and Contrast learning (GLFC) framework for sCT generation. First, a Mamba-Enhanced UNet (MEUNet) is introduced by integrating Mamba blocks into the skip connections of a high-resolution UNet for effective global and local feature learning. Second, we propose a Multiple Contrast Loss (MCL) that calculates synthetic loss at different intensity windows to improve quality for both soft tissues and bone regions. Experiments on the SynthRAD2023 dataset demonstrate that GLFC improved the SSIM of sCT from 77.91% to 91.50% compared with the original CBCT, and significantly outperformed several existing methods for sCT generation. The code is available at https://github.com/intelland/GLFC
Authors: Pegah Khayatan, Mustafa Shukor, Jayneel Parekh, Matthieu Cord
Abstract: Multimodal LLMs have reached remarkable levels of proficiency in understanding multimodal inputs, driving extensive research to develop increasingly powerful models. However, much less attention has been paid to understanding and explaining the underlying mechanisms of these models. Most existing explainability research examines these models only in their final states, overlooking the dynamic representational shifts that occur during training. In this work, we systematically analyze the evolution of hidden state representations to reveal how fine-tuning alters the internal structure of a model to specialize in new multimodal tasks. Using a concept-based approach, we map hidden states to interpretable visual and textual concepts, enabling us to trace changes in encoded concepts across modalities as training progresses. We also demonstrate the use of shift vectors to capture these concepts changes. These shift vectors allow us to recover fine-tuned concepts by shifting those in the original model. Finally, we explore the practical impact of our findings on model steering, showing that we can adjust multimodal LLMs behaviors without any training, such as modifying answer types, captions style, or biasing the model toward specific responses. Our work sheds light on how multimodal representations evolve through fine-tuning and offers a new perspective for interpreting model adaptation in multimodal tasks. The code for this project is publicly available at https://github.com/mshukor/xl-vlms.
Authors: Arda Atal{\i}k, Sumit Chopra, Daniel K. Sodickson
Abstract: Reducing MRI scan times can improve patient care and lower healthcare costs. Many acceleration methods are designed to reconstruct diagnostic-quality images from limited sets of acquired $\textit{k}$-space data. This task can be framed as a linear inverse problem (LIP), where, as a result of undersampling, the forward operator may become rank-deficient or exhibit small singular values. This results in ambiguities in reconstruction, in which multiple generally incorrect or non-diagnostic images can map to the same acquired data. To address such ambiguities, it is crucial to incorporate prior knowledge, for example in the form of regularization. Another form of prior knowledge less commonly used in medical imaging is contextual side information garnered from other sources than the current acquisition. Here, we propose the $\textbf{T}$rust-$\textbf{G}$uided $\textbf{V}$ariational $\textbf{N}$etwork $\textbf{(TGVN)}$, a novel end-to-end deep learning framework that effectively integrates side information into LIPs. TGVN eliminates undesirable solutions from the ambiguous space of the forward operator while remaining faithful to the acquired data. We demonstrate its effectiveness in multi-coil, multi-contrast MR image reconstruction, where incomplete or low-quality measurements from one contrast are used as side information to reconstruct high-quality images of another contrast from heavily under-sampled data. Our method is robust across different contrasts, anatomies, and field strengths. Compared to baselines that also utilize side information, TGVN achieves superior image quality at challenging under-sampling levels, drastically speeding up acquisition while minimizing hallucinations. Our approach is also versatile enough to incorporate many different types of side information (including previous scans or even text) into any LIP.
Authors: Mehmet Onurcan Kaya, Figen S. Oktem
Abstract: Diffusion models have demonstrated their utility as learned priors for solving various inverse problems. However, most existing approaches are limited to linear inverse problems. This paper exploits the efficient and unsupervised posterior sampling framework of Denoising Diffusion Restoration Models (DDRM) for the solution of nonlinear phase retrieval problem, which requires reconstructing an image from its noisy intensity-only measurements such as Fourier intensity. The approach combines the model-based alternating-projection methods with the DDRM to utilize pretrained unconditional diffusion priors for phase retrieval. The performance is demonstrated through both simulations and experimental data. Results demonstrate the potential of this approach for improving the alternating-projection methods as well as its limitations.
Authors: Yiliang Chen, Steven SC Ho, Cheng Xu, Yao Jie Xie, Wing-Fai Yeung, Shengfeng He, Jing Qin
Abstract: Tongue diagnosis is a vital tool in Western and Traditional Chinese Medicine, providing key insights into a patient's health by analyzing tongue attributes. The COVID-19 pandemic has heightened the need for accurate remote medical assessments, emphasizing the importance of precise tongue attribute recognition via telehealth. To address this, we propose a Sign-Oriented multi-label Attributes Detection framework. Our approach begins with an adaptive tongue feature extraction module that standardizes tongue images and mitigates environmental factors. This is followed by a Sign-oriented Network (SignNet) that identifies specific tongue attributes, emulating the diagnostic process of experienced practitioners and enabling comprehensive health evaluations. To validate our methodology, we developed an extensive tongue image dataset specifically designed for telemedicine. Unlike existing datasets, ours is tailored for remote diagnosis, with a comprehensive set of attribute labels. This dataset will be openly available, providing a valuable resource for research. Initial tests have shown improved accuracy in detecting various tongue attributes, highlighting our framework's potential as an essential tool for remote medical assessments.
Authors: Alhassan Mumuni, Fuseini Mumuni
Abstract: Generative artificial intelligence (AI) systems based on large-scale pretrained foundation models (PFMs) such as vision-language models, large language models (LLMs), diffusion models and vision-language-action (VLA) models have demonstrated the ability to solve complex and truly non-trivial AI problems in a wide variety of domains and contexts. Multimodal large language models (MLLMs), in particular, learn from vast and diverse data sources, allowing rich and nuanced representations of the world and, thereby, providing extensive capabilities, including the ability to reason, engage in meaningful dialog; collaborate with humans and other agents to jointly solve complex problems; and understand social and emotional aspects of humans. Despite this impressive feat, the cognitive abilities of state-of-the-art LLMs trained on large-scale datasets are still superficial and brittle. Consequently, generic LLMs are severely limited in their generalist capabilities. A number of foundational problems -- embodiment, symbol grounding, causality and memory -- are required to be addressed for LLMs to attain human-level general intelligence. These concepts are more aligned with human cognition and provide LLMs with inherent human-like cognitive properties that support the realization of physically-plausible, semantically meaningful, flexible and more generalizable knowledge and intelligence. In this work, we discuss the aforementioned foundational issues and survey state-of-the art approaches for implementing these concepts in LLMs. Specifically, we discuss how the principles of embodiment, symbol grounding, causality and memory can be leveraged toward the attainment of artificial general intelligence (AGI) in an organic manner.
Authors: Yonggan Fu, Wuyang Chen, Haotao Wang, Haoran Li, Yingyan Celine Lin, Zhangyang Wang
Abstract: The compression of Generative Adversarial Networks (GANs) has lately drawn attention, due to the increasing demand for deploying GANs into mobile devices for numerous applications such as image translation, enhancement and editing. However, compared to the substantial efforts to compressing other deep models, the research on compressing GANs (usually the generators) remains at its infancy stage. Existing GAN compression algorithms are limited to handling specific GAN architectures and losses. Inspired by the recent success of AutoML in deep compression, we introduce AutoML to GAN compression and develop an AutoGAN-Distiller (AGD) framework. Starting with a specifically designed efficient search space, AGD performs an end-to-end discovery for new efficient generators, given the target computational resource constraints. The search is guided by the original GAN model via knowledge distillation, therefore fulfilling the compression. AGD is fully automatic, standalone (i.e., needing no trained discriminators), and generically applicable to various GAN models. We evaluate AGD in two representative GAN tasks: image translation and super resolution. Without bells and whistles, AGD yields remarkably lightweight yet more competitive compressed models, that largely outperform existing alternatives. Our codes and pretrained models are available at https://github.com/TAMU-VITA/AGD.
Authors: Yonggan Fu, Haoran You, Yang Zhao, Yue Wang, Chaojian Li, Kailash Gopalakrishnan, Zhangyang Wang, Yingyan Celine Lin
Abstract: Recent breakthroughs in deep neural networks (DNNs) have fueled a tremendous demand for intelligent edge devices featuring on-site learning, while the practical realization of such systems remains a challenge due to the limited resources available at the edge and the required massive training costs for state-of-the-art (SOTA) DNNs. As reducing precision is one of the most effective knobs for boosting training time/energy efficiency, there has been a growing interest in low-precision DNN training. In this paper, we explore from an orthogonal direction: how to fractionally squeeze out more training cost savings from the most redundant bit level, progressively along the training trajectory and dynamically per input. Specifically, we propose FracTrain that integrates (i) progressive fractional quantization which gradually increases the precision of activations, weights, and gradients that will not reach the precision of SOTA static quantized DNN training until the final training stage, and (ii) dynamic fractional quantization which assigns precisions to both the activations and gradients of each layer in an input-adaptive manner, for only "fractionally" updating layer parameters. Extensive simulations and ablation studies (six models, four datasets, and three training settings including standard, adaptation, and fine-tuning) validate the effectiveness of FracTrain in reducing computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%~+1.87%) accuracy. For example, when training ResNet-74 on CIFAR-10, FracTrain achieves 77.6% and 53.5% computational cost and training latency savings, respectively, compared with the best SOTA baseline, while achieving a comparable (-0.07%) accuracy. Our codes are available at: https://github.com/RICE-EIC/FracTrain.
Authors: Valter Estevam, Rayson Laroca, Helio Pedrini, David Menotti
Abstract: We introduce a method to learn unsupervised semantic visual information based on the premise that complex events can be decomposed into simpler events and that these simple events are shared across several complex events. We first employ a clustering method to group representations producing a visual codebook. Then, we learn a dense representation by encoding the co-occurrence probability matrix for the codebook entries. This representation leverages the performance of the dense video captioning task in a scenario with only visual features. For example, we replace the audio signal in the BMT method and produce temporal proposals with comparable performance. Furthermore, we concatenate the visual representation with our descriptor in a vanilla transformer method to achieve state-of-the-art performance in the captioning subtask compared to the methods that explore only visual features, as well as a competitive performance with multi-modal methods. Our code is available at https://github.com/valterlej/dvcusi.
Authors: Yonggan Fu, Shunyao Zhang, Shang Wu, Cheng Wan, Yingyan Celine Lin
Abstract: Vision transformers (ViTs) have recently set off a new wave in neural architecture design thanks to their record-breaking performance in various vision tasks. In parallel, to fulfill the goal of deploying ViTs into real-world vision applications, their robustness against potential malicious attacks has gained increasing attention. In particular, recent works show that ViTs are more robust against adversarial attacks as compared with convolutional neural networks (CNNs), and conjecture that this is because ViTs focus more on capturing global interactions among different input/feature patches, leading to their improved robustness to local perturbations imposed by adversarial attacks. In this work, we ask an intriguing question: "Under what kinds of perturbations do ViTs become more vulnerable learners compared to CNNs?" Driven by this question, we first conduct a comprehensive experiment regarding the robustness of both ViTs and CNNs under various existing adversarial attacks to understand the underlying reason favoring their robustness. Based on the drawn insights, we then propose a dedicated attack framework, dubbed Patch-Fool, that fools the self-attention mechanism by attacking its basic component (i.e., a single patch) with a series of attention-aware optimization techniques. Interestingly, our Patch-Fool framework shows for the first time that ViTs are not necessarily more robust than CNNs against adversarial perturbations. In particular, we find that ViTs are more vulnerable learners compared with CNNs against our Patch-Fool attack which is consistent across extensive experiments, and the observations from Sparse/Mild Patch-Fool, two variants of Patch-Fool, indicate an intriguing insight that the perturbation density and strength on each patch seem to be the key factors that influence the robustness ranking between ViTs and CNNs.
Authors: Yonggan Fu, Yuecheng Li, Chenghui Li, Jason Saragih, Peizhao Zhang, Xiaoliang Dai, Yingyan Celine Lin
Abstract: Real-time and robust photorealistic avatars for telepresence in AR/VR have been highly desired for enabling immersive photorealistic telepresence. However, there still exists one key bottleneck: the considerable computational expense needed to accurately infer facial expressions captured from headset-mounted cameras with a quality level that can match the realism of the avatar's human appearance. To this end, we propose a framework called Auto-CARD, which for the first time enables real-time and robust driving of Codec Avatars when exclusively using merely on-device computing resources. This is achieved by minimizing two sources of redundancy. First, we develop a dedicated neural architecture search technique called AVE-NAS for avatar encoding in AR/VR, which explicitly boosts both the searched architectures' robustness in the presence of extreme facial expressions and hardware friendliness on fast evolving AR/VR headsets. Second, we leverage the temporal redundancy in consecutively captured images during continuous rendering and develop a mechanism dubbed LATEX to skip the computation of redundant frames. Specifically, we first identify an opportunity from the linearity of the latent space derived by the avatar decoder and then propose to perform adaptive latent extrapolation for redundant frames. For evaluation, we demonstrate the efficacy of our Auto-CARD framework in real-time Codec Avatar driving settings, where we achieve a 5.05x speed-up on Meta Quest 2 while maintaining a comparable or even better animation quality than state-of-the-art avatar encoder designs.
Authors: Yonggan Fu, Zhifan Ye, Jiayi Yuan, Shunyao Zhang, Sixu Li, Haoran You, Yingyan Celine Lin
Abstract: Novel view synthesis is an essential functionality for enabling immersive experiences in various Augmented- and Virtual-Reality (AR/VR) applications, for which generalizable Neural Radiance Fields (NeRFs) have gained increasing popularity thanks to their cross-scene generalization capability. Despite their promise, the real-device deployment of generalizable NeRFs is bottlenecked by their prohibitive complexity due to the required massive memory accesses to acquire scene features, causing their ray marching process to be memory-bounded. To this end, we propose Gen-NeRF, an algorithm-hardware co-design framework dedicated to generalizable NeRF acceleration, which for the first time enables real-time generalizable NeRFs. On the algorithm side, Gen-NeRF integrates a coarse-then-focus sampling strategy, leveraging the fact that different regions of a 3D scene contribute differently to the rendered pixel, to enable sparse yet effective sampling. On the hardware side, Gen-NeRF highlights an accelerator micro-architecture to maximize the data reuse opportunities among different rays by making use of their epipolar geometric relationship. Furthermore, our Gen-NeRF accelerator features a customized dataflow to enhance data locality during point-to-hardware mapping and an optimized scene feature storage strategy to minimize memory bank conflicts. Extensive experiments validate the effectiveness of our proposed Gen-NeRF framework in enabling real-time and generalizable novel view synthesis.
Authors: Chuanxin Song, Hanbo Wu, Xin Ma
Abstract: Exploring the semantic context in scene images is essential for indoor scene recognition. However, due to the diverse intra-class spatial layouts and the coexisting inter-class objects, modeling contextual relationships to adapt various image characteristics is a great challenge. Existing contextual modeling methods for scene recognition exhibit two limitations: 1) They typically model only one type of spatial relationship (order or metric) among objects within scenes, with limited exploration of diverse spatial layouts. 2) They often overlook the differences in coexisting objects across different scenes, suppressing scene recognition performance. To overcome these limitations, we propose SpaCoNet, which simultaneously models Spatial relation and Co-occurrence of objects guided by semantic segmentation. Firstly, the Semantic Spatial Relation Module (SSRM) is constructed to model scene spatial features. With the help of semantic segmentation, this module decouples spatial information from the scene image and thoroughly explores all spatial relationships among objects in an implicit manner, thereby obtaining semantic-based spatial features. Secondly, both spatial features from the SSRM and deep features from the Image Feature Extraction Module are allocated to each object, so as to distinguish the coexisting object across different scenes. Finally, utilizing the discriminative features above, we design a Global-Local Dependency Module to explore the long-range co-occurrence among objects, and further generate a semantic-guided feature representation for indoor scene recognition. Experimental results on three widely used scene datasets demonstrate the effectiveness and generality of the proposed method.
Authors: Yonggan Fu, Ye Yuan, Souvik Kundu, Shang Wu, Shunyao Zhang, Yingyan Celine Lin
Abstract: Generalizable Neural Radiance Fields (GNeRF) are one of the most promising real-world solutions for novel view synthesis, thanks to their cross-scene generalization capability and thus the possibility of instant rendering on new scenes. While adversarial robustness is essential for real-world applications, little study has been devoted to understanding its implication on GNeRF. We hypothesize that because GNeRF is implemented by conditioning on the source views from new scenes, which are often acquired from the Internet or third-party providers, there are potential new security concerns regarding its real-world applications. Meanwhile, existing understanding and solutions for neural networks' adversarial robustness may not be applicable to GNeRF, due to its 3D nature and uniquely diverse operations. To this end, we present NeRFool, which to the best of our knowledge is the first work that sets out to understand the adversarial robustness of GNeRF. Specifically, NeRFool unveils the vulnerability patterns and important insights regarding GNeRF's adversarial robustness. Built upon the above insights gained from NeRFool, we further develop NeRFool+, which integrates two techniques capable of effectively attacking GNeRF across a wide range of target views, and provide guidelines for defending against our proposed attacks. We believe that our NeRFool/NeRFool+ lays the initial foundation for future innovations in developing robust real-world GNeRF solutions. Our codes are available at: https://github.com/GATECH-EIC/NeRFool.
Authors: Renhao Wang, Yu Sun, Arnuv Tandon, Yossi Gandelsman, Xinlei Chen, Alexei A. Efros, Xiaolong Wang
Abstract: Prior work has established Test-Time Training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is first trained on the same instance using a self-supervised task such as reconstruction. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The improvements are more than 2.2x and 1.5x for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses strictly more information, training on all frames from the entire test video regardless of temporal order. This finding challenges those in prior work using synthetic videos. We formalize a notion of locality as the advantage of online over offline TTT, and analyze its role with ablations and a theory based on bias-variance trade-off.
Authors: Kaiwen Cai, Chris Xiaoxuan Lu, Xingyu Zhao, Xiaowei Huang
Abstract: Most image retrieval research prioritizes improving predictive performance, often overlooking situations where the reliability of predictions is equally important. The gap between model performance and reliability requirements highlights the need for a systematic approach to analyze and address the risks associated with image retrieval. Uncertainty quantification technique can be applied to mitigate this issue by assessing uncertainty for retrieval sets, but it provides only a heuristic estimate of uncertainty rather than a guarantee. To address these limitations, we present Risk Controlled Image Retrieval (RCIR), which generates retrieval sets with coverage guarantee, i.e., retrieval sets that are guaranteed to contain the true nearest neighbors with a predefined probability. RCIR can be easily integrated with existing uncertainty-aware image retrieval systems, agnostic to data distribution and model selection. To the best of our knowledge, this is the first work that provides coverage guarantees to image retrieval. The validity and efficiency of RCIR are demonstrated on four real-world datasets: CAR-196, CUB-200, Pittsburgh, and ChestX-Det.
Authors: Zhanbo Feng, Zenan Ling, Xinyu Lu, Ci Gong, Feng Zhou, Wugedele Bao, Jie Li, Fan Yang, Robert C. Qiu
Abstract: The use of denoising diffusion models is becoming increasingly popular in the field of image editing. However, current approaches often rely on either image-guided methods, which provide a visual reference but lack control over semantic consistency, or text-guided methods, which ensure alignment with the text guidance but compromise visual quality. To resolve this issue, we propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a \textit{frozen} pre-trained diffusion model. Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt. Compared to state-of-the-art methods, the framework generates images of higher quality while providing realistic editing effects across various benchmark datasets.
Authors: Jonathan Henrich, Jan van Delden, Dominik Seidel, Thomas Kneib, Alexander Ecker
Abstract: Laser-scanned point clouds of forests make it possible to extract valuable information for forest management. To consider single trees, a forest point cloud needs to be segmented into individual tree point clouds. Existing segmentation methods are usually based on hand-crafted algorithms, such as identifying trunks and growing trees from them, and face difficulties in dense forests with overlapping tree crowns. In this study, we propose TreeLearn, a deep learning-based approach for tree instance segmentation of forest point clouds. TreeLearn is trained on already segmented point clouds in a data-driven manner, making it less reliant on predefined features and algorithms. Furthermore, TreeLearn is implemented as a fully automatic pipeline and does not rely on extensive hyperparameter tuning, which makes it easy to use. Additionally, we introduce a new manually segmented benchmark forest dataset containing 156 full trees. The data is generated by mobile laser scanning and contributes to create a larger and more diverse data basis for model development and fine-grained instance segmentation evaluation. We trained TreeLearn on forest point clouds of 6665 trees, labeled using the Lidar360 software. An evaluation on the benchmark dataset shows that TreeLearn performs as well as the algorithm used to generate its training data. Furthermore, the performance can be vastly improved by fine-tuning the model using manually annotated datasets. We evaluate TreeLearn on our benchmark dataset and the Wytham Woods dataset, outperforming the recent SegmentAnyTree, ForAINet and TLS2Trees methods. The TreeLearn code and all datasets that were created in the course of this work are made publicly available.
Authors: Vullnet Useini, Stephanie Tanadini-Lang, Quentin Lohmeyer, Mirko Meboldt, Nicolaus Andratschke, Ralph P. Braun, Javier Barranco Garc\'ia
Abstract: Melanoma, the deadliest form of skin cancer, has seen a steady increase in incidence rates worldwide, posing a significant challenge to dermatologists. Early detection is crucial for improving patient survival rates. However, performing total body screening (TBS), i.e., identifying suspicious lesions or ugly ducklings (UDs) by visual inspection, can be challenging and often requires sound expertise in pigmented lesions. To assist users of varying expertise levels, an artificial intelligence (AI) decision support tool was developed. Our solution identifies and characterizes UDs from real-world wide-field patient images. It employs a state-of-the-art object detection algorithm to locate and isolate all skin lesions present in a patient's total body images. These lesions are then sorted based on their level of suspiciousness using a self-supervised AI approach, tailored to the specific context of the patient under examination. A clinical validation study was conducted to evaluate the tool's performance. The results demonstrated an average sensitivity of 95% for the top-10 AI-identified UDs on skin lesions selected by the majority of experts in pigmented skin lesions. The study also found that the tool increased dermatologists' confidence when formulating a diagnosis, and the average majority agreement with the top-10 AI-identified UDs reached 100% when assisted by our tool. With the development of this AI-based decision support tool, we aim to address the shortage of specialists, enable faster consultation times for patients, and demonstrate the impact and usability of AI-assisted screening. Future developments will include expanding the dataset to include histologically confirmed melanoma and validating the tool for additional body regions.
Authors: William Heyden, Habib Ullah, M. Salman Siddiqui, Fadi Al Machot
Abstract: Zero-Shot Learning (ZSL) presents the challenge of identifying categories not seen during training. This task is crucial in domains where it is costly, prohibited, or simply not feasible to collect training data. ZSL depends on a mapping between the visual space and available semantic information. Prior works learn a mapping between spaces that can be exploited during inference. We contend, however, that the disparity between meticulously curated semantic spaces and the inherently noisy nature of real-world data remains a substantial and unresolved challenge. In this paper, we address this by introducing a Semantic Encoder-Enhanced Representations for Zero-Shot Learning (SEER-ZSL). We propose a hybrid strategy to address the generalization gap. First, we aim to distill meaningful semantic information using a probabilistic encoder, enhancing the semantic consistency and robustness. Second, we distill the visual space by exploiting the learned data distribution through an adversarially trained generator. Finally, we align the distilled information, enabling a mapping of unseen categories onto the true data manifold. We demonstrate empirically that this approach yields a model that outperforms the state-of-the-art benchmarks in terms of both generalization and benchmarks across diverse settings with small, medium, and large datasets. The complete code is available on GitHub.
Authors: Andong Lu, Chenglong Li, Tianrui Zha, Jin Tang, Xiaofeng Wang, Bin Luo
Abstract: Prevalent nighttime person re-identification (ReID) methods typically combine image relighting and ReID networks in a sequential manner. However, their performance (recognition accuracy) is limited by the quality of relighting images and insufficient collaboration between image relighting and ReID tasks. To handle these problems, we propose a novel Collaborative Enhancement Network called CENet, which performs the multilevel feature interactions in a parallel framework, for nighttime person ReID. In particular, the designed parallel structure of CENet can not only avoid the impact of the quality of relighting images on ReID performance, but also allow us to mine the collaborative relations between image relighting and person ReID tasks. To this end, we integrate the multilevel feature interactions in CENet, where we first share the Transformer encoder to build the low-level feature interaction, and then perform the feature distillation that transfers the high-level features from image relighting to ReID, thereby alleviating the severe image degradation issue caused by the nighttime scenario while avoiding the impact of relighting images. In addition, the sizes of existing real-world nighttime person ReID datasets are limited, and large-scale synthetic ones exhibit substantial domain gaps with real-world data. To leverage both small-scale real-world and large-scale synthetic training data, we develop a multi-domain learning algorithm, which alternately utilizes both kinds of data to reduce the inter-domain difference in training procedure. Extensive experiments on two real nighttime datasets, \textit{Night600} and \textit{RGBNT201$_{rgb}$}, and a synthetic nighttime ReID dataset are conducted to validate the effectiveness of CENet. We release the code and synthetic dataset at: \hyperlink{https://github.com/Alexadlu/CENet}{\color{red} https://github.com/Alexadlu/CENet}.
URLs: https://github.com/Alexadlu/CENet, https://github.com/Alexadlu/CENet
Authors: Zolnamar Dorjsembe, Hsing-Kuo Pao, Furen Xiao
Abstract: This study introduces Polyp-DDPM, a diffusion-based method for generating realistic images of polyps conditioned on masks, aimed at enhancing the segmentation of gastrointestinal (GI) tract polyps. Our approach addresses the challenges of data limitations, high annotation costs, and privacy concerns associated with medical images. By conditioning the diffusion model on segmentation masks-binary masks that represent abnormal areas-Polyp-DDPM outperforms state-of-the-art methods in terms of image quality (achieving a Frechet Inception Distance (FID) score of 78.47, compared to scores above 83.79) and segmentation performance (achieving an Intersection over Union (IoU) of 0.7156, versus less than 0.6694 for synthetic images from baseline models and 0.7067 for real data). Our method generates a high-quality, diverse synthetic dataset for training, thereby enhancing polyp segmentation models to be comparable with real images and offering greater data augmentation capabilities to improve segmentation models. The source code and pretrained weights for Polyp-DDPM are made publicly available at https://github.com/mobaidoctor/polyp-ddpm.
Authors: Yuhao Lin, Haiming Xu, Lingqiao Liu, Javen Qinfeng Shi
Abstract: Class-Agnostic Counting (CAC) seeks to accurately count objects in a given image with only a few reference examples. While previous methods achieving this relied on additional training, recent efforts have shown that it's possible to accomplish this without training by utilizing pre-existing foundation models, particularly the Segment Anything Model (SAM), for counting via instance-level segmentation. Although promising, current training-free methods still lag behind their training-based counterparts in terms of performance. In this research, we present a straightforward training-free solution that effectively bridges this performance gap, serving as a strong baseline. The primary contribution of our work lies in the discovery of four key technologies that can enhance performance. Specifically, we suggest employing a superpixel algorithm to generate more precise initial point prompts, utilizing an image encoder with richer semantic knowledge to replace the SAM encoder for representing candidate objects, and adopting a multiscale mechanism and a transductive prototype scheme to update the representation of reference examples. By combining these four technologies, our approach achieves significant improvements over existing training-free methods and delivers performance on par with training-based ones.
Authors: Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Fangxun Shu, Hao Jiang, Linchao Zhu
Abstract: The rapidly developing Large Vision Language Models (LVLMs) have shown notable capabilities on a range of multi-modal tasks, but still face the hallucination phenomena where the generated texts do not align with the given contexts, significantly restricting the usages of LVLMs. Most previous work detects and mitigates hallucination at the coarse-grained level or requires expensive annotation (e.g., labeling by proprietary models or human experts). To address these issues, we propose detecting and mitigating hallucinations in LVLMs via fine-grained AI feedback. The basic idea is that we generate a small-size sentence-level hallucination annotation dataset by proprietary models, whereby we train a hallucination detection model which can perform sentence-level hallucination detection, covering primary hallucination types (i.e., object, attribute, and relationship). Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model. Furthermore, we propose differentiating the severity of hallucinations, and introducing a Hallucination Severity-Aware Direct Preference Optimization (HSA-DPO) for mitigating hallucination in LVLMs by incorporating the severity of hallucinations into preference learning. Extensive experiments demonstrate the effectiveness of our method.
Authors: Xiang Gao, Yuqi Zhang
Abstract: Recent style transfer problems are still largely dominated by Generative Adversarial Network (GAN) from the perspective of cross-domain image-to-image (I2I) translation, where the pivotal issue is to learn and transfer target-domain style patterns onto source-domain content images. This paper handles the problem of translating real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though a wide range of I2I models tackle this problem, a notable challenge is that the content details of the source image could be easily erased or corrupted due to the transfer of ink-wash style elements. To remedy this issue, we propose to incorporate saliency detection into the unpaired I2I framework to regularize image content, where the detected saliency map is utilized from two aspects: (\romannumeral1) we propose saliency IOU (SIOU) loss to explicitly regularize object content structure by enforcing saliency consistency before and after image stylization; (\romannumeral2) we propose saliency adaptive normalization (SANorm) which implicitly enhances object structure integrity of the generated paintings by dynamically injecting image saliency information into the generator to guide stylization process. Besides, we also propose saliency attended discriminator which harnesses image saliency information to focus generative adversarial attention onto the drawn objects, contributing to generating more vivid and delicate brush strokes and ink-wash textures. Extensive qualitative and quantitative experiments demonstrate superiority of our approach over related advanced image stylization methods in both GAN and diffusion model paradigms.
Authors: Haoyu Wang, Zhilu Zhang, Donglin Di, Shiliang Zhang, Wangmeng Zuo
Abstract: The goal of image-based virtual try-on is to generate an image of the target person naturally wearing the given clothing. However, existing methods solely focus on the frontal try-on using the frontal clothing. When the views of the clothing and person are significantly inconsistent, particularly when the person's view is non-frontal, the results are unsatisfactory. To address this challenge, we introduce Multi-View Virtual Try-ON (MV-VTON), which aims to reconstruct the dressing results from multiple views using the given clothes. Given that single-view clothes provide insufficient information for MV-VTON, we instead employ two images, i.e., the frontal and back views of the clothing, to encompass the complete view as much as possible. Moreover, we adopt diffusion models that have demonstrated superior abilities to perform our MV-VTON. In particular, we propose a view-adaptive selection method where hard-selection and soft-selection are applied to the global and local clothing feature extraction, respectively. This ensures that the clothing features are roughly fit to the person's view. Subsequently, we suggest joint attention blocks to align and fuse clothing features with person features. Additionally, we collect a MV-VTON dataset MVG, in which each person has multiple photos with diverse views and poses. Experiments show that the proposed method not only achieves state-of-the-art results on MV-VTON task using our MVG dataset, but also has superiority on frontal-view virtual try-on task using VITON-HD and DressCode datasets.
Authors: Adrian Azzarelli, Nantheera Anantrasirichai, David R Bull
Abstract: This paper offers the first comprehensive review of artificial intelligence (AI) research in the context of real camera content acquisition for entertainment purposes and is aimed at both researchers and cinematographers. Addressing the lack of review papers in the field of intelligent cinematography} (IC) and the breadth of related computer vision research, we present a holistic view of the IC landscape while providing technical insight, important for experts across disciplines. We provide technical background on generative AI, object detection, automated camera calibration and 3-D content acquisition, with references to assist non-technical readers. The application sections categorize work in terms of four production types: General Production, Virtual Production, Live Production and Aerial Production. Within each application section, we (1) sub-classify work according to research topic and (2) describe the trends and challenges relevant to each type of production. In the final chapter, we address the greater scope of IC research and summarize the significant potential of this area to influence the creative industries sector. We suggest that work relating to virtual production has the greatest potential to impact other mediums of production, driven by the growing interest in LED volumes/stages for in-camera virtual effects (ICVFX) and automated 3-D capture for virtual modeling of real world scenes and actors. We also address ethical and legal concerns regarding the use of creative AI that impact on artists, actors, technologists and the general public.
Authors: Yunchuan Ma, Laiyun Qing, Guorong Li, Yuankai Qi, Amin Beheshti, Quan Z. Sheng, Qingming Huang
Abstract: Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA), which takes advantage of existing pretrained large-scale vision and language models to directly generate captions with test-time adaptation. Specifically, we bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE. Different from the conventional way that trains these tokens with training data, we propose to learn these tokens with soft targets of the inference data under several carefully crafted loss functions, which enable the tokens to absorb video information catered for GPT-2. This procedure can be efficiently done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show absolute 5.1%-32.4% improvements in terms of the main metric CIDEr compared to several state-of-the-art zero-shot video captioning methods.
Authors: Yachan Guo, Yi Xiao, Danna Xue, Jose Luis Gomez Zurita, Antonio M. Lopez
Abstract: Instance segmentation is crucial for autonomous driving but is hindered by the lack of annotated real-world data due to expensive labeling costs. Unsupervised Domain Adaptation (UDA) offers a solution by transferring knowledge from labeled synthetic data to unlabeled real-world data. While UDA methods for synthetic to real-world domains (synth-to-real) show remarkable performance in tasks such as semantic segmentation and object detection, very few have been proposed for instance segmentation in vision-based autonomous driving. Moreover, existing methods rely on suboptimal baselines, which severely limits performance. We introduce \textbf{UDA4Inst}, a powerful framework for synth-to-real UDA in instance segmentation. Our framework enhances instance segmentation through \textit{Semantic Category Training} and \textit{Bidirectional Mixing Training}. With the Semantic Category Training method, semantically related classes are grouped and trained separately, enabling the generation of higher-quality pseudo-labels and improved segmentation performance. We further propose a bidirectional cross-domain data mixing strategy that combines instance-wise and patch-wise mixing techniques to effectively utilize data from both source and target domains, producing realistic composite images that improve the model's generalization performance. Extensive experiments demonstrate the effectiveness of our methods. Our approach establishes a new state-of-the-art on the SYNTHIA->Cityscapes benchmark with mAP 31.3. Notably, we are the first to report results on multiple novel synth-to-real instance segmentation datasets, using UrbanSyn and Synscapes as source domains while Cityscapes and KITTI360 serve as target domains. Our code will be released soon.
Authors: Ahnaf Mozib Samin, M. Firoz Ahmed, Md. Mushtaq Shahriyar Rafee
Abstract: With the utilization of Transformer architecture, large Vision and Language (V&L) models have shown promising performance in even zero-shot settings. Several studies, however, indicate a lack of robustness of the models when dealing with complex linguistics and visual attributes. In this work, we introduce a novel V&L benchmark - ColorFoil, by creating color-related foils to assess the models' perception ability to detect colors like red, white, green, etc. We evaluate seven state-of-the-art V&L models including CLIP, ViLT, GroupViT, and BridgeTower, etc. in a zero-shot setting and present intriguing findings from the V&L models. The experimental evaluation indicates that ViLT and BridgeTower demonstrate much better color perception capabilities compared to CLIP and its variants and GroupViT. Moreover, CLIP-based models and GroupViT struggle to distinguish colors that are visually distinct to humans with normal color perception ability.
Authors: Weize Li, Zhicheng Zhao, Haochen Bai, Fei Su
Abstract: Referring Expression Segmentation (RES) has attracted rising attention, aiming to identify and segment objects based on natural language expressions. While substantial progress has been made in RES, the emergence of Generalized Referring Expression Segmentation (GRES) introduces new challenges by allowing expressions to describe multiple objects or lack specific object references. Existing RES methods, usually rely on sophisticated encoder-decoder and feature fusion modules, and are difficult to generate class prototypes that match each instance individually when confronted with the complex referent and binary labels of GRES. In this paper, reevaluating the differences between RES and GRES, we propose a novel Model with Adaptive Binding Prototypes (MABP) that adaptively binds queries to object features in the corresponding region. It enables different query vectors to match instances of different categories or different parts of the same instance, significantly expanding the decoder's flexibility, dispersing global pressure across all queries, and easing the demands on the encoder. Experimental results demonstrate that MABP significantly outperforms state-of-the-art methods in all three splits on gRefCOCO dataset. Meanwhile, MABP also surpasses state-of-the-art methods on RefCOCO+ and G-Ref datasets, and achieves very competitive results on RefCOCO. Code is available at https://github.com/buptLwz/MABP
Authors: Silky Singh, Surgan Jandial, Simra Shahid, Abhinav Java
Abstract: Text-conditioned style transfer enables users to communicate their desired artistic styles through text descriptions, offering a new and expressive means of achieving stylization. In this work, we evaluate the text-conditioned image editing and style transfer techniques on their fine-grained understanding of user prompts for precise "local" style transfer. We find that current methods fail to accomplish localized style transfers effectively, either failing to localize style transfer to certain regions in the image, or distorting the content and structure of the input image. To this end, we develop an end-to-end pipeline for "local" style transfer tailored to align with users' intent. Further, we substantiate the effectiveness of our approach through quantitative and qualitative analysis. The project code is available at: https://github.com/silky1708/local-style-transfer.
Authors: Zhengtao Yao, Hong Nguyen, Ajitesh Srivastava, Jose Luis Ambite
Abstract: In the realm of medical imaging, leveraging large-scale datasets from various institutions is crucial for developing precise deep learning models, yet privacy concerns frequently impede data sharing. federated learning (FL) emerges as a prominent solution for preserving privacy while facilitating collaborative learning. However, its application in real-world scenarios faces several obstacles, such as task & data heterogeneity, label scarcity, non-identically distributed (non-IID) data, computational vaiation, etc. In real-world, medical institutions may not want to disclose their tasks to FL server and generalization challenge of out-of-network institutions with un-seen task want to join the on-going federated system. This study address task-agnostic and generalization problem on un-seen tasks by adapting self-supervised FL framework. Utilizing Vision Transformer (ViT) as consensus feature encoder for self-supervised pre-training, no initial labels required, the framework enabling effective representation learning across diverse datasets and tasks. Our extensive evaluations, using various real-world non-IID medical imaging datasets, validate our approach's efficacy, retaining 90\% of F1 accuracy with only 5\% of the training data typically required for centralized approaches and exhibiting superior adaptability to out-of-distribution task. The result indicate that federated learning architecture can be a potential approach toward multi-task foundation modeling.
Authors: Zhuoyuan Li, Yubo Ai, Jiahao Lu, ChuXin Wang, Jiacheng Deng, Hanzhi Chang, Yanzhe Liang, Wenfei Yang, Shifeng Zhang, Tianzhu Zhang
Abstract: Transformers have demonstrated impressive results for 3D point cloud semantic segmentation. However, the quadratic complexity of transformer makes computation costs high, limiting the number of points that can be processed simultaneously and impeding the modeling of long-range dependencies between objects in a single scene. Drawing inspiration from the great potential of recent state space models (SSM) for long sequence modeling, we introduce Mamba, an SSM-based architecture, to the point cloud domain and propose Pamba, a novel architecture with strong global modeling capability under linear complexity. Specifically, to make the disorderness of point clouds fit in with the causal nature of Mamba, we propose a multi-path serialization strategy applicable to point clouds. Besides, we propose the ConvMamba block to compensate for the shortcomings of Mamba in modeling local geometries and in unidirectional modeling. Pamba obtains state-of-the-art results on several 3D point cloud segmentation tasks, including ScanNet v2, ScanNet200, S3DIS and nuScenes, while its effectiveness is validated by extensive experiments.
Authors: Xu Yin, Fei Pan, Guoyuan An, Yuchi Huo, Zixuan Xie, Sung-Eui Yoon
Abstract: Existing open-set recognition (OSR) studies typically assume that each image contains only one class label, with the unknown test set (negative) having a disjoint label space from the known test set (positive), a scenario referred to as full-label shift. This paper introduces the mixed OSR problem, where test images contain multiple class semantics, with both known and unknown classes co-occurring in the negatives, leading to a more complex super-label shift that better reflects real-world scenarios. To tackle this challenge, we propose the OpenSlot framework, based on object-centric learning, which uses slot features to represent diverse class semantics and generate class predictions. The proposed anti-noise slot (ANS) technique helps mitigate the impact of noise (invalid or background) slots during classification training, addressing the semantic misalignment between class predictions and ground truth. We evaluate OpenSlot on both mixed and conventional OSR benchmarks. Without elaborate designs, our method not only excels existing approaches in detecting super-label shifts across OSR tasks, but also achieves state-of-the-art performance on conventional benchmarks. Meanwhile, OpenSlot can localize class objects without using bounding boxes during training, demonstrating competitive performance in open-set object detection and potential for generalization.
Authors: Yueyang Li, Weiming Zeng, Wenhao Dong, Luhui Cai, Lei Wang, Hongyu Chen, Hongjie Yan, Lingbin Bian, Nizhuan Wang
Abstract: Background: Deep learning models have shown promise in diagnosing neurodevelopmental disorders (NDD) like ASD and ADHD. However, many models either use graph neural networks (GNN) to construct single-level brain functional networks (BFNs) or employ spatial convolution filtering for local information extraction from rs-fMRI data, often neglecting high-order features crucial for NDD classification. Methods: We introduce a Multi-view High-order Network (MHNet) to capture hierarchical and high-order features from multi-view BFNs derived from rs-fMRI data for NDD prediction. MHNet has two branches: the Euclidean Space Features Extraction (ESFE) module and the Non-Euclidean Space Features Extraction (Non-ESFE) module, followed by a Feature Fusion-based Classification (FFC) module for NDD identification. ESFE includes a Functional Connectivity Generation (FCG) module and a High-order Convolutional Neural Network (HCNN) module to extract local and high-order features from BFNs in Euclidean space. Non-ESFE comprises a Generic Internet-like Brain Hierarchical Network Generation (G-IBHN-G) module and a High-order Graph Neural Network (HGNN) module to capture topological and high-order features in non-Euclidean space. Results: Experiments on three public datasets show that MHNet outperforms state-of-the-art methods using both AAL1 and Brainnetome Atlas templates. Extensive ablation studies confirm the superiority of MHNet and the effectiveness of using multi-view fMRI information and high-order features. Our study also offers atlas options for constructing more sophisticated hierarchical networks and explains the association between key brain regions and NDD. Conclusion: MHNet leverages multi-view feature learning from both Euclidean and non-Euclidean spaces, incorporating high-order information from BFNs to enhance NDD classification performance.
Authors: Pranshu Pandya, Vatsal Gupta, Agney S Talwarr, Tushar Kataria, Dan Roth, Vivek Gupta
Abstract: Cognitive textual and visual reasoning tasks, including puzzles, series, and analogies, demand the ability to quickly reason, decipher, and evaluate patterns both textually and spatially. Due to extensive training on vast amounts of human-curated data, LLMs and VLMs excel in common-sense reasoning tasks, however still struggle with more complex reasoning that demands deeper cognitive understanding. We introduce NTSEBench, a new dataset designed to evaluate cognitive multi-modal reasoning and problem-solving skills of large models. The dataset contains 2728 multiple-choice questions, accompanied by a total of 4,642 images, categorized into 26 different types. These questions are drawn from the nationwide NTSE examination in India and feature a mix of visual and textual general aptitude challenges, designed to assess intelligence and critical thinking skills beyond mere rote learning. We establish baselines on the dataset using state-of-the-art LLMs and VLMs. To facilitate a comparison between open source and propriety models, we propose four distinct modeling strategies to handle different modalities -- text and images -- in the dataset instances.
Authors: Chenfan Qu, Yiwu Zhong, Fengjun Guo, Lianwen Jin
Abstract: The rapid advancements of generative AI have fueled the potential of generative text image editing, meanwhile escalating the threat of misinformation spreading. However, existing forensics methods struggle to detect unseen forgery types that they have not been trained on, underscoring the need for a model capable of generalized detection of tampered scene text. To tackle this, we propose a novel task: open-set tampered scene text detection, which evaluates forensics models on their ability to identify both seen and previously unseen forgery types. We have curated a comprehensive, high-quality dataset, featuring the texts tampered by eight text editing models, to thoroughly assess the open-set generalization capabilities. Further, we introduce a novel and effective training paradigm that subtly alters the texture of selected texts within an image and trains the model to identify these regions. This approach not only mitigates the scarcity of high-quality training data but also enhances models' fine-grained perception and open-set generalization abilities. Additionally, we present DAF, a novel framework that improves open-set generalization by distinguishing between the features of authentic and tampered text, rather than focusing solely on the tampered text's features. Our extensive experiments validate the remarkable efficacy of our methods. For example, our zero-shot performance can even beat the previous state-of-the-art full-shot model by a large margin. Our dataset and code are available at https://github.com/qcf-568/OSTF.
Authors: Vito Mengers, Nicolas Roth, Oliver Brock, Klaus Obermayer, Martin Rolfs
Abstract: The objects we perceive guide our eye movements when observing real-world dynamic scenes. Yet, gaze shifts and selective attention are critical for perceiving details and refining object boundaries. Object segmentation and gaze behavior are, however, typically treated as two independent processes. Here, we present a computational model that simulates these processes in an interconnected manner and allows for hypothesis-driven investigations of distinct attentional mechanisms. Drawing on an information processing pattern from robotics, we use a Bayesian filter to recursively segment the scene, which also provides an uncertainty estimate for the object boundaries that we use to guide active scene exploration. We demonstrate that this model closely resembles observers' free viewing behavior on a dataset of dynamic real-world scenes, measured by scanpath statistics, including foveation duration and saccade amplitude distributions used for parameter fitting and higher-level statistics not used for fitting. These include how object detections, inspections, and returns are balanced and a delay of returning saccades without an explicit implementation of such temporal inhibition of return. Extensive simulations and ablation studies show that uncertainty promotes balanced exploration and that semantic object cues are crucial to forming the perceptual units used in object-based attention. Moreover, we show how our model's modular design allows for extensions, such as incorporating saccadic momentum or pre-saccadic attention, to further align its output with human scanpaths.
Authors: Chuyun Shen, Wenhao Li, Yuhang Shi, Xiangfeng Wang
Abstract: Interactive medical image segmentation (IMIS) has shown significant potential in enhancing segmentation accuracy by integrating iterative feedback from medical professionals. However, the limited availability of enough 3D medical data restricts the generalization and robustness of most IMIS methods. The Segment Anything Model (SAM), though effective for 2D images, requires expensive semi-auto slice-by-slice annotations for 3D medical images. In this paper, we explore the zero-shot capabilities of SAM 2, the next-generation Meta SAM model trained on videos, for 3D medical image segmentation. By treating sequential 2D slices of 3D images as video frames, SAM 2 can fully automatically propagate annotations from a single frame to the entire 3D volume. We propose a practical pipeline for using SAM 2 in 3D medical image segmentation and present key findings highlighting its efficiency and potential for further optimization. Concretely, numerical experiments on the BraTS2020 and the medical segmentation decathlon datasets demonstrate that SAM 2 still has a gap with supervised methods but can narrow the gap in specific settings and organ types, significantly reducing the annotation burden on medical professionals. Our code will be open-sourced and available at https://github.com/Chuyun-Shen/SAM_2_Medical_3D.
Authors: Yiming Zhong, Xiaolin Zhang, Yao Zhao, Yunchao Wei
Abstract: Recently, the text-to-3D task has developed rapidly due to the appearance of the SDS method. However, the SDS method always generates 3D objects with poor quality due to the over-smooth issue. This issue is attributed to two factors: 1) the DDPM single-step inference produces poor guidance gradients; 2) the randomness from the input noises and timesteps averages the details of the 3D contents. In this paper, to address the issue, we propose DreamLCM which incorporates the Latent Consistency Model (LCM). DreamLCM leverages the powerful image generation capabilities inherent in LCM, enabling generating consistent and high-quality guidance, i.e., predicted noises or images. Powered by the improved guidance, the proposed method can provide accurate and detailed gradients to optimize the target 3D models. In addition, we propose two strategies to enhance the generation quality further. Firstly, we propose a guidance calibration strategy, utilizing Euler Solver to calibrate the guidance distribution to accelerate 3D models to converge. Secondly, we propose a dual timestep strategy, increasing the consistency of guidance and optimizing 3D models from geometry to appearance in DreamLCM. Experiments show that DreamLCM achieves state-of-the-art results in both generation quality and training efficiency. The code is available at https://github.com/1YimingZhong/DreamLCM.
Authors: Amirhosein Chahe, Lifeng Zhou
Abstract: This paper introduces a novel method for open-vocabulary 3D scene querying in autonomous driving by combining Language Embedded 3D Gaussians with Large Language Models (LLMs). We propose utilizing LLMs to generate both contextually canonical phrases and helping positive words for enhanced segmentation and scene interpretation. Our method leverages GPT-3.5 Turbo as an expert model to create a high-quality text dataset, which we then use to fine-tune smaller, more efficient LLMs for on-device deployment. Our comprehensive evaluation on the WayveScenes101 dataset demonstrates that LLM-guided segmentation significantly outperforms traditional approaches based on predefined canonical phrases. Notably, our fine-tuned smaller models achieve performance comparable to larger expert models while maintaining faster inference times. Through ablation studies, we discover that the effectiveness of helping positive words correlates with model scale, with larger models better equipped to leverage additional semantic information. This work represents a significant advancement towards more efficient, context-aware autonomous driving systems, effectively bridging 3D scene representation with high-level semantic querying while maintaining practical deployment considerations.
Authors: Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, Fuli Feng
Abstract: With recent generative models facilitating photo-realistic image synthesis, the proliferation of synthetic images has also engendered certain negative impacts on social platforms, thereby raising an urgent imperative to develop effective detectors. Current synthetic image detection (SID) pipelines are primarily dedicated to crafting universal artifact features, accompanied by an oversight about SID training paradigm. In this paper, we re-examine the SID problem and identify two prevalent biases in current training paradigms, i.e., weakened artifact features and overfitted artifact features. Meanwhile, we discover that the imaging mechanism of synthetic images contributes to heightened local correlations among pixels, suggesting that detectors should be equipped with local awareness. In this light, we propose SAFE, a lightweight and effective detector with three simple image transformations. Firstly, for weakened artifact features, we substitute the down-sampling operator with the crop operator in image pre-processing to help circumvent artifact distortion. Secondly, for overfitted artifact features, we include ColorJitter and RandomRotation as additional data augmentations, to help alleviate irrelevant biases from color discrepancies and semantic differences in limited training samples. Thirdly, for local awareness, we propose a patch-based random masking strategy tailored for SID, forcing the detector to focus on local regions at training. Comparative experiments are conducted on an open-world dataset, comprising synthetic images generated by 26 distinct generative models. Our pipeline achieves a new state-of-the-art performance, with remarkable improvements of 4.5% in accuracy and 2.9% in average precision against existing methods. Our code is available at: https://github.com/Ouxiang-Li/SAFE.
Authors: Haitao Zhou, Chuang Wang, Rui Nie, Jinlin Liu, Dongdong Yu, Qian Yu, Changhu Wang
Abstract: Recent years have seen substantial progress in diffusion-based controllable video generation. However, achieving precise control in complex scenarios, including fine-grained object parts, sophisticated motion trajectories, and coherent background movement, remains a challenge. In this paper, we introduce TrackGo, a novel approach that leverages free-form masks and arrows for conditional video generation. This method offers users with a flexible and precise mechanism for manipulating video content. We also propose the TrackAdapter for control implementation, an efficient and lightweight adapter designed to be seamlessly integrated into the temporal self-attention layers of a pretrained video generation model. This design leverages our observation that the attention map of these layers can accurately activate regions corresponding to motion in videos. Our experimental results demonstrate that our new approach, enhanced by the TrackAdapter, achieves state-of-the-art performance on key metrics such as FVD, FID, and ObjMC scores.
Authors: Debesh Jha, Nikhil Kumar Tomar, Vanshali Sharma, Quoc-Huy Trinh, Koushik Biswas, Hongyi Pan, Ritika K. Jha, Gorkem Durak, Alexander Hann, Jonas Varkey, Hang Viet Dao, Long Van Dao, Binh Phuc Nguyen, Nikolaos Papachrysos, Brandon Rieders, Peter Thelin Schmidt, Enrik Geissler, Tyler Berzin, P{\aa}l Halvorsen, Michael A. Riegler, Thomas de Lange, Ulas Bagci
Abstract: Colonoscopy is the primary method for examination, detection, and removal of polyps. However, challenges such as variations among the endoscopists' skills, bowel quality preparation, and the complex nature of the large intestine contribute to high polyp miss-rate. These missed polyps can develop into cancer later, underscoring the importance of improving the detection methods. To address this gap of lack of publicly available, multi-center large and diverse datasets for developing automatic methods for polyp detection and segmentation, we introduce PolypDB, a large scale publicly available dataset that contains 3934 still polyp images and their corresponding ground truth from real colonoscopy videos. PolypDB comprises images from five modalities: Blue Light Imaging (BLI), Flexible Imaging Color Enhancement (FICE), Linked Color Imaging (LCI), Narrow Band Imaging (NBI), and White Light Imaging (WLI) from three medical centers in Norway, Sweden, and Vietnam. We provide a benchmark on each modality and center, including federated learning settings using popular segmentation and detection benchmarks. PolypDB is public and can be downloaded at \url{https://osf.io/pr7ms/}. More information about the dataset, segmentation, detection, federated learning benchmark and train-test split can be found at \url{https://github.com/DebeshJha/PolypDB}.
URLs: https://osf.io/pr7ms/, https://github.com/DebeshJha/PolypDB
Authors: Yan Chen, Di Huang, Zhichao Liao, Xi Cheng, Xinghui Li, Lone Zeng
Abstract: The trend of employing training-free methods for point cloud recognition is becoming increasingly popular due to its significant reduction in computational resources and time costs. However, existing approaches are limited as they typically extract either geometric or semantic features. To address this limitation, we are the first to propose a novel training-free method that integrates both geometric and semantic features. For the geometric branch, we adopt a non-parametric strategy to extract geometric features. In the semantic branch, we leverage a model aligned with text features to obtain semantic features. Additionally, we introduce the GFE module to complement the geometric information of point clouds and the MFF module to improve performance in few-shot settings. Experimental results demonstrate that our method outperforms existing state-of-the-art training-free approaches on mainstream benchmark datasets, including ModelNet and ScanObiectNN.
Authors: Qingyao Tian, Zhen Chen, Huai Liao, Xinyan Huang, Lujie Li, Sebastien Ourselin, Hongbin Liu
Abstract: Single-image depth estimation is essential for endoscopy tasks such as localization, reconstruction, and augmented reality. Most existing methods in surgical scenes focus on in-domain depth estimation, limiting their real-world applicability. This constraint stems from the scarcity and inferior labeling quality of medical data for training. In this work, we present EndoOmni, the first foundation model for zero-shot cross-domain depth estimation for endoscopy. To harness the potential of diverse training data, we refine the advanced self-learning paradigm that employs a teacher model to generate pseudo-labels, guiding a student model trained on large-scale labeled and unlabeled data. To address training disturbance caused by inherent noise in depth labels, we propose a robust training framework that leverages both depth labels and estimated confidence from the teacher model to jointly guide the student model training. Moreover, we propose a weighted scale-and-shift invariant loss to adaptively adjust learning weights based on label confidence, thus imposing learning bias towards cleaner label pixels while reducing the influence of highly noisy pixels. Experiments on zero-shot relative depth estimation show that our EndoOmni improves state-of-the-art methods in medical imaging for 33\% and existing foundation models for 34\% in terms of absolute relative error on specific datasets. Furthermore, our model provides strong initialization for fine-tuning metric depth estimation, maintaining superior performance in both in-domain and out-of-domain scenarios. The source code is publicly available at https://github.com/TianCuteQY/EndoOmni.
Authors: Qianyi Deng, Oishi Deb, Amir Patel, Christian Rupprecht, Philip Torr, Niki Trigoni, Andrew Markham
Abstract: Animal pose estimation (APE) aims to locate the animal body parts using a diverse array of sensor and modality inputs (e.g. RGB cameras, LiDAR, infrared, IMU, acoustic and language cues), which is crucial for research across neuroscience, biomechanics, and veterinary medicine. By evaluating 176 papers since 2011, APE methods are categorised by their input sensor and modality types, output forms, learning paradigms, experimental setup, and application domains, presenting detailed analyses of current trends, challenges, and future directions in single- and multi-modality APE systems. The analysis also highlights the transition between human and animal pose estimation, and how innovations in APE can reciprocally enrich human pose estimation and the broader machine learning paradigm. Additionally, 2D and 3D APE datasets and evaluation metrics based on different sensors and modalities are provided. A regularly updated project page is provided here: https://github.com/ChennyDeng/MM-APE.
Authors: Jiawei Liu, Fanrui Zhang, Jiaying Zhu, Esther Sun, Qiang Zhang, Zheng-Jun Zha
Abstract: Multimodal Large Language Models (MLLMs), such as GPT4o, have shown strong capabilities in visual reasoning and explanation generation. However, despite these strengths, they face significant challenges in the increasingly critical task of Image Forgery Detection and Localization (IFDL). Moreover, existing IFDL methods are typically limited to the learning of low-level semantic-agnostic clues and merely provide a single outcome judgment. To tackle these issues, we propose ForgeryGPT, a novel framework that advances the IFDL task by capturing high-order forensics knowledge correlations of forged images from diverse linguistic feature spaces, while enabling explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture. Specifically, ForgeryGPT enhances traditional LLMs by integrating the Mask-Aware Forgery Extractor, which enables the excavating of precise forgery mask information from input images and facilitating pixel-level understanding of tampering artifacts. The Mask-Aware Forgery Extractor consists of a Forgery Localization Expert (FL-Expert) and a Mask Encoder, where the FL-Expert is augmented with an Object-agnostic Forgery Prompt and a Vocabulary-enhanced Vision Encoder, allowing for effectively capturing of multi-scale fine-grained forgery details. To enhance its performance, we implement a three-stage training strategy, supported by our designed Mask-Text Alignment and IFDL Task-Specific Instruction Tuning datasets, which align vision-language modalities and improve forgery detection and instruction-following capabilities. Extensive experiments demonstrate the effectiveness of the proposed method.
Authors: Shaozhe Hao, Xuantong Liu, Xianbiao Qi, Shihao Zhao, Bojia Zi, Rong Xiao, Kai Han, Kwan-Yee K. Wong
Abstract: We introduce BiGR, a novel conditional image generation model using compact binary latent codes for generative training, focusing on enhancing both generation and representation capabilities. BiGR is the first conditional generative model that unifies generation and discrimination within the same framework. BiGR features a binary tokenizer, a masked modeling mechanism, and a binary transcoder for binary code prediction. Additionally, we introduce a novel entropy-ordered sampling method to enable efficient image generation. Extensive experiments validate BiGR's superior performance in generation quality, as measured by FID-50k, and representation capabilities, as evidenced by linear-probe accuracy. Moreover, BiGR showcases zero-shot generalization across various vision tasks, enabling applications such as image inpainting, outpainting, editing, interpolation, and enrichment, without the need for structural modifications. Our findings suggest that BiGR unifies generative and discriminative tasks effectively, paving the way for further advancements in the field. We further enable BiGR to perform text-to-image generation, showcasing its potential for broader applications.
Authors: Yang Yu, Yuezun Li, Xin Sun, Junyu Dong
Abstract: Phytoplankton are a crucial component of aquatic ecosystems, and effective monitoring of them can provide valuable insights into ocean environments and ecosystem changes. Traditional phytoplankton monitoring methods are often complex and lack timely analysis. Therefore, deep learning algorithms offer a promising approach for automated phytoplankton monitoring. However, the lack of large-scale, high-quality training samples has become a major bottleneck in advancing phytoplankton tracking. In this paper, we propose a challenging benchmark dataset, Multiple Phytoplankton Tracking (MPT), which covers diverse background information and variations in motion during observation. The dataset includes 27 species of phytoplankton and zooplankton, 14 different backgrounds to simulate diverse and complex underwater environments, and a total of 140 videos. To enable accurate real-time observation of phytoplankton, we introduce a multi-object tracking method, Deviation-Corrected Multi-Scale Feature Fusion Tracker(DSFT), which addresses issues such as focus shifts during tracking and the loss of small target information when computing frame-to-frame similarity. Specifically, we introduce an additional feature extractor to predict the residuals of the standard feature extractor's output, and compute multi-scale frame-to-frame similarity based on features from different layers of the extractor. Extensive experiments on the MPT have demonstrated the validity of the dataset and the superiority of DSFT in tracking phytoplankton, providing an effective solution for phytoplankton monitoring.
Authors: Shrey Vishen, Jatin Sarabu, Saurav Kumar, Chinmay Bharathulwar, Rithwick Lakshmanan, Vishnu Srinivas
Abstract: We present a novel method for diffusion-guided frameworks for view-consistent super-resolution (SR) in neural rendering. Our approach leverages existing 2D SR models in conjunction with advanced techniques such as Variational Score Distilling (VSD) and a LoRA fine-tuning helper, with spatial training to significantly boost the quality and consistency of upscaled 2D images compared to the previous methods in the literature, such as Renoised Score Distillation (RSD) proposed in DiSR-NeRF (1), or SDS proposed in DreamFusion. The VSD score facilitates precise fine-tuning of SR models, resulting in high-quality, view-consistent images. To address the common challenge of inconsistencies among independent SR 2D images, we integrate Iterative 3D Synchronization (I3DS) from the DiSR-NeRF framework. Our quantitative benchmarks and qualitative results on the LLFF dataset demonstrate the superior performance of our system compared to existing methods such as DiSR-NeRF.
Authors: HyunJun Jung, Weihang Li, Shun-Cheng Wu, William Bittner, Nikolas Brasch, Jifei Song, Eduardo P\'erez-Pellitero, Zhensong Zhang, Arthur Moreau, Nassir Navab, Benjamin Busam
Abstract: Traditionally, 3d indoor datasets have generally prioritized scale over ground-truth accuracy in order to obtain improved generalization. However, using these datasets to evaluate dense geometry tasks, such as depth rendering, can be problematic as the meshes of the dataset are often incomplete and may produce wrong ground truth to evaluate the details. In this paper, we propose SCRREAM, a dataset annotation framework that allows annotation of fully dense meshes of objects in the scene and registers camera poses on the real image sequence, which can produce accurate ground truth for both sparse 3D as well as dense 3D tasks. We show the details of the dataset annotation pipeline and showcase four possible variants of datasets that can be obtained from our framework with example scenes, such as indoor reconstruction and SLAM, scene editing & object removal, human reconstruction and 6d pose estimation. Recent pipelines for indoor reconstruction and SLAM serve as new benchmarks. In contrast to previous indoor dataset, our design allows to evaluate dense geometry tasks on eleven sample scenes against accurately rendered ground truth depth maps.
Authors: Junjie Ni, Guofeng Zhang, Guanglin Li, Yijin Li, Xinyang Liu, Zhaoyang Huang, Hujun Bao
Abstract: We tackle the efficiency problem of learning local feature matching. Recent advancements have given rise to purely CNN-based and transformer-based approaches, each augmented with deep learning techniques. While CNN-based methods often excel in matching speed, transformer-based methods tend to provide more accurate matches. We propose an efficient transformer-based network architecture for local feature matching. This technique is built on constructing multiple homography hypotheses to approximate the continuous correspondence in the real world and uni-directional cross-attention to accelerate the refinement. On the YFCC100M dataset, our matching accuracy is competitive with LoFTR, a state-of-the-art transformer-based architecture, while the inference speed is boosted to 4 times, even outperforming the CNN-based methods. Comprehensive evaluations on other open datasets such as Megadepth, ScanNet, and HPatches demonstrate our method's efficacy, highlighting its potential to significantly enhance a wide array of downstream applications.
Authors: Xinyuan Chang, Maixuan Xue, Xinran Liu, Zheng Pan, Xing Wei
Abstract: Ensuring adherence to traffic sign regulations is essential for both human and autonomous vehicle navigation. While current online mapping solutions often prioritize the construction of the geometric and connectivity layers of HD maps, overlooking the construction of the traffic regulation layer within HD maps. Addressing this gap, we introduce MapDR, a novel dataset designed for the extraction of Driving Rules from traffic signs and their association with vectorized, locally perceived HD Maps. MapDR features over $10,000$ annotated video clips that capture the intricate correlation between traffic sign regulations and lanes. Built upon this benchmark and the newly defined task of integrating traffic regulations into online HD maps, we provide modular and end-to-end solutions: VLE-MEE and RuleVLM, offering a strong baseline for advancing autonomous driving technology. It fills a critical gap in the integration of traffic sign rules, contributing to the development of reliable autonomous driving systems.
Authors: Liyuan Zhang, Le Hui, Qi Liu, Bo Li, Yuchao Dai
Abstract: Multi-instance point cloud registration aims to estimate the pose of all instances of a model point cloud in the whole scene. Existing methods all adopt the strategy of first obtaining the global correspondence and then clustering to obtain the pose of each instance. However, due to the cluttered and occluded objects in the scene, it is difficult to obtain an accurate correspondence between the model point cloud and all instances in the scene. To this end, we propose a simple yet powerful 3D focusing-and-matching network for multi-instance point cloud registration by learning the multiple pair-wise point cloud registration. Specifically, we first present a 3D multi-object focusing module to locate the center of each object and generate object proposals. By using self-attention and cross-attention to associate the model point cloud with structurally similar objects, we can locate potential matching instances by regressing object centers. Then, we propose a 3D dual masking instance matching module to estimate the pose between the model point cloud and each object proposal. It performs instance mask and overlap mask masks to accurately predict the pair-wise correspondence. Extensive experiments on two public benchmarks, Scan2CAD and ROBI, show that our method achieves a new state-of-the-art performance on the multi-instance point cloud registration task. Code is available at https://github.com/zlynpu/3DFMNet.
Authors: Xiaobin Deng, Changyu Diao, Min Li, Ruohan Yu, Duanqing Xu
Abstract: 3D Gaussian Splatting (3DGS) excels in novel view synthesis, balancing advanced rendering quality with real-time performance. However, in trained scenes, a large number of Gaussians with low opacity significantly increase rendering costs. This issue arises due to flaws in the split and clone operations during the densification process, which lead to extensive Gaussian overlap and subsequent opacity reduction. To enhance the efficiency of Gaussian utilization, we improve the adaptive density control of 3DGS. First, we introduce a more efficient long-axis split operation to replace the original clone and split, which mitigates Gaussian overlap and improves densification efficiency.Second, we propose a simple adaptive pruning technique to reduce the number of low-opacity Gaussians. Finally, by dynamically lowering the splitting threshold and applying importance weighting, the efficiency of Gaussian utilization is further improved. We evaluate our proposed method on various challenging real-world datasets. Experimental results show that our Efficient Density Control (EDC) can enhance both the rendering speed and quality. Code is available at https://github.com/XiaoBin2001/EDC.
Authors: Yifan Xie, Jingge Wang, Tao Feng, Fei Ma, Yang Li
Abstract: Colonoscopy is crucial for identifying adenomatous polyps and preventing colorectal cancer. However, developing robust models for polyp detection is challenging by the limited size and accessibility of existing colonoscopy datasets. While previous efforts have attempted to synthesize colonoscopy images, current methods suffer from instability and insufficient data diversity. Moreover, these approaches lack precise control over the generation process, resulting in images that fail to meet clinical quality standards. To address these challenges, we propose CCIS-DIFF, a Controlled generative model for high-quality Colonoscopy Image Synthesis based on a Diffusion architecture. Our method offers precise control over both the spatial attributes (polyp location and shape) and clinical characteristics of polyps that align with clinical descriptions. Specifically, we introduce a blur mask weighting strategy to seamlessly blend synthesized polyps with the colonic mucosa, and a text-aware attention mechanism to guide the generated images to reflect clinical characteristics. Notably, to achieve this, we construct a new multi-modal colonoscopy dataset that integrates images, mask annotations, and corresponding clinical text descriptions. Experimental results demonstrate that our method generates high-quality, diverse colonoscopy images with fine control over both spatial constraints and clinical consistency, offering valuable support for downstream segmentation and diagnostic tasks.
Authors: Ziyang Chen, Yongjun Zhang, Wenting Li, Bingshu Wang, Yong Zhao, C. L. Philip Chen
Abstract: Real-world applications of stereo matching, such as autonomous driving, place stringent demands on both safety and accuracy. However, learning-based stereo matching methods inherently suffer from the loss of geometric structures in certain feature channels, creating a bottleneck in achieving precise detail matching. Additionally, these methods lack interpretability due to the black-box nature of deep learning. In this paper, we propose MoCha-V2, a novel learning-based paradigm for stereo matching. MoCha-V2 introduces the Motif Correlation Graph (MCG) to capture recurring textures, which are referred to as ``motifs" within feature channels. These motifs reconstruct geometric structures and are learned in a more interpretable way. Subsequently, we integrate features from multiple frequency domains through wavelet inverse transformation. The resulting motif features are utilized to restore geometric structures in the stereo matching process. Experimental results demonstrate the effectiveness of MoCha-V2. MoCha-V2 achieved 1st place on the Middlebury benchmark at the time of its release. Code is available at https://github.com/ZYangChen/MoCha-Stereo.
Authors: Xiwei Deng, Xianchun He, Jiangfeng Bao, Yudan Zhou, Shuhui Cai, Congbo Cai, Zhong Chen
Abstract: CT report generation (CTRG) aims to automatically generate diagnostic reports for 3D volumes, relieving clinicians' workload and improving patient care. Despite clinical value, existing works fail to effectively incorporate diagnostic information from multiple anatomical views and lack related clinical expertise essential for accurate and reliable diagnosis. To resolve these limitations, we propose a novel Multi-view perception Knowledge-enhanced Transformer (MvKeTR) to mimic the diagnostic workflow of clinicians. Just as radiologists first examine CT scans from multiple planes, a Multi-View Perception Aggregator (MVPA) with view-aware attention effectively synthesizes diagnostic information from multiple anatomical views. Then, inspired by how radiologists further refer to relevant clinical records to guide diagnostic decision-making, a Cross-Modal Knowledge Enhancer (CMKE) retrieves the most similar reports based on the query volume to incorporate domain knowledge into the diagnosis procedure. Furthermore, instead of traditional MLPs, we employ Kolmogorov-Arnold Networks (KANs) with learnable nonlinear activation functions as the fundamental building blocks of both modules to better capture intricate diagnostic patterns in CT interpretation. Extensive experiments on the public CTRG-Chest-548K dataset demonstrate that our method outpaces prior state-of-the-art (SOTA) models across almost all metrics. The code will be made publicly available.
Authors: Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, Siyu Zhu
Abstract: Recent advancements in visual generation technologies have markedly increased the scale and availability of video datasets, which are crucial for training effective video generation models. However, a significant lack of high-quality, human-centric video datasets presents a challenge to progress in this field. To bridge this gap, we introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset characterized by precise and detailed captions that encompass both human appearance and motion states, along with supplementary human motion conditions, including skeleton sequences and speech audio. To validate the efficacy of this dataset and the associated training strategies, we propose an extension of existing classical diffusion transformer architectures and conduct further pretraining of our models on the proposed dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos while preserving performance in general video generation tasks. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs. Based on these insights and corresponding methodologies, the straightforward extended network trained on the proposed dataset demonstrates an obvious improvement in the generation of human-centric videos. Project page https://fudan-generative-vision.github.io/OpenHumanVid
URLs: https://fudan-generative-vision.github.io/OpenHumanVid
Authors: Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, Siyu Zhu
Abstract: Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of continuous video driven by speech audio. Our method is validated through experiments on benchmark and newly proposed wild datasets, demonstrating substantial improvements over prior methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes. Further visualizations and the source code are available at: https://fudan-generative-vision.github.io/hallo3/.
Authors: Xinghui Li, Qichao Sun, Pengze Zhang, Fulong Ye, Zhichao Liao, Wanquan Feng, Songtao Zhao, Qian He
Abstract: Recent advances in garment-centric image generation from text and image prompts based on diffusion models are impressive. However, existing methods lack support for various combinations of attire, and struggle to preserve the garment details while maintaining faithfulness to the text prompts, limiting their performance across diverse scenarios. In this paper, we focus on a new task, i.e., Multi-Garment Virtual Dressing, and we propose a novel AnyDressing method for customizing characters conditioned on any combination of garments and any personalized text prompts. AnyDressing comprises two primary networks named GarmentsNet and DressingNet, which are respectively dedicated to extracting detailed clothing features and generating customized images. Specifically, we propose an efficient and scalable module called Garment-Specific Feature Extractor in GarmentsNet to individually encode garment textures in parallel. This design prevents garment confusion while ensuring network efficiency. Meanwhile, we design an adaptive Dressing-Attention mechanism and a novel Instance-Level Garment Localization Learning strategy in DressingNet to accurately inject multi-garment features into their corresponding regions. This approach efficiently integrates multi-garment texture cues into generated images and further enhances text-image consistency. Additionally, we introduce a Garment-Enhanced Texture Learning strategy to improve the fine-grained texture details of garments. Thanks to our well-craft design, AnyDressing can serve as a plug-in module to easily integrate with any community control extensions for diffusion models, improving the diversity and controllability of synthesized images. Extensive experiments show that AnyDressing achieves state-of-the-art results.
Authors: Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, Xun Huang
Abstract: Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly. To further reduce latency, we extend distribution matching distillation (DMD) to videos, distilling 50-step diffusion model into a 4-step generator. To enable stable and high-quality distillation, we introduce a student initialization scheme based on teacher's ODE trajectories, as well as an asymmetric distillation strategy that supervises a causal student model with a bidirectional teacher. This approach effectively mitigates error accumulation in autoregressive generation, allowing long-duration video synthesis despite training on short clips. Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models. It enables fast streaming generation of high-quality videos at 9.4 FPS on a single GPU thanks to KV caching. Our approach also enables streaming video-to-video translation, image-to-video, and dynamic prompting in a zero-shot manner. We will release the code based on an open-source model in the future.
Authors: MinHyuk Jang, Youngdong Jang, JaeHyeok Lee, Kodai Kawamura, Feng Yang, Sangpil Kim
Abstract: Rapid advancements in generative models have made it possible to create hyper-realistic videos. As their applicability increases, their unauthorized use has raised significant concerns, leading to the growing demand for techniques to protect the ownership of the generative model itself. While existing watermarking methods effectively embed watermarks into image-generative models, they fail to account for temporal information, resulting in poor performance when applied to video-generative models. To address this issue, we introduce a novel watermarking method called LVMark, which embeds watermarks into video diffusion models. A key component of LVMark is a selective weight modulation strategy that efficiently embeds watermark messages into the video diffusion model while preserving the quality of the generated videos. To accurately decode messages in the presence of malicious attacks, we design a watermark decoder that leverages spatio-temporal information in the 3D wavelet domain through a cross-attention module. To the best of our knowledge, our approach is the first to highlight the potential of video-generative model watermarking as a valuable tool for enhancing the effectiveness of ownership protection in video-generative models.
Authors: Yu Chen, Shuai Zheng, Nianyi Wang, Menglong Jin, Yan Chang
Abstract: Fluid simulation is an important research topic in computer graphics (CG) and animation in video games. Traditional methods based on Navier-Stokes equations are computationally expensive. In this paper, we treat fluid motion as point cloud transformation and propose the first neural network method specifically designed for efficient and robust fluid simulation in complex environments. This model is also the deep learning model that is the first to be capable of stably modeling fluid particle dynamics in such complex scenarios. Our triangle feature fusion design achieves an optimal balance among fluid dynamics modeling, momentum conservation constraints, and global stability control. We conducted comprehensive experiments on datasets. Compared to existing neural network-based fluid simulation algorithms, we significantly enhanced accuracy while maintaining high computational speed. Compared to traditional SPH methods, our speed improved approximately 10 times. Furthermore, compared to traditional fluid simulation software such as Flow3D, our computation speed increased by more than 300 times.
Authors: Chao Wang, Zhihao Xia, Thomas Leimkuehler, Karol Myszkowski, Xuaner Zhang
Abstract: While consumer displays increasingly support more than 10 stops of dynamic range, most image assets such as internet photographs and generative AI content remain limited to 8-bit low dynamic range (LDR), constraining their utility across high dynamic range (HDR) applications. Currently, no generative model can produce high-bit, high-dynamic range content in a generalizable way. Existing LDR-to-HDR conversion methods often struggle to produce photorealistic details and physically-plausible dynamic range in the clipped areas. We introduce LEDiff, a method that enables a generative model with HDR content generation through latent space fusion inspired by image-space exposure fusion techniques. It also functions as an LDR-to-HDR converter, expanding the dynamic range of existing low-dynamic range images. Our approach uses a small HDR dataset to enable a pretrained diffusion model to recover detail and dynamic range in clipped highlights and shadows. LEDiff brings HDR capabilities to existing generative models and converts any LDR image to HDR, creating photorealistic HDR outputs for image generation, image-based lighting (HDR environment map generation), and photographic effects such as depth of field simulation, where linear HDR data is essential for realistic quality.
Authors: Helia Mohamadi, Mohammad Ali Keyvanrad, Mohammad Reza Mohammadi
Abstract: Domain adaptation, a pivotal branch of transfer learning, aims to enhance the performance of machine learning models when deployed in target domains with distinct data distributions. This is particularly critical for object detection tasks, where domain shifts (caused by factors such as lighting conditions, viewing angles, and environmental variations) can lead to significant performance degradation. This review delves into advanced methodologies for domain adaptation, including adversarial learning, discrepancy-based, multi-domain, teacher-student, ensemble, and Vision Language Models techniques, emphasizing their efficacy in reducing domain gaps and enhancing model robustness. Feature-based methods have emerged as powerful tools for addressing these challenges by harmonizing feature representations across domains. These techniques, such as Feature Alignment, Feature Augmentation/Reconstruction, and Feature Transformation, are employed alongside or as integral parts of other domain adaptation strategies to minimize domain gaps and improve model performance. Special attention is given to strategies that minimize the reliance on extensive labeled data and using unlabeled data, particularly in scenarios involving synthetic-to-real domain shifts. Applications in fields such as autonomous driving and medical imaging are explored, showcasing the potential of these methods to ensure reliable object detection in diverse and complex settings. By providing a thorough analysis of state-of-the-art techniques, challenges, and future directions, this work offers a valuable reference for researchers striving to develop resilient and adaptable object detection frameworks, advancing the seamless deployment of artificial intelligence in dynamic environments.
Authors: Haoyuan Zhang, Xiangyu Zhu, Li Gao, Guoying Zhao, Zhen Lei
Abstract: With the rapid growth usage of face recognition in people's daily life, face anti-spoofing becomes increasingly important to avoid malicious attacks. Recent face anti-spoofing models can reach a high classification accuracy on multiple datasets but these models can only tell people "this face is fake" while lacking the explanation to answer "why it is fake". Such a system undermines trustworthiness and causes user confusion, as it denies their requests without providing any explanations. In this paper, we incorporate XAI into face anti-spoofing and propose a new problem termed X-FAS (eXplainable Face Anti-Spoofing) empowering face anti-spoofing models to provide an explanation. We propose SPED (SPoofing Evidence Discovery), an X-FAS method which can discover spoof concepts and provide reliable explanations on the basis of discovered concepts. To evaluate the quality of X-FAS methods, we propose an X-FAS benchmark with annotated spoofing evidence by experts. We analyze SPED explanations on face anti-spoofing dataset and compare SPED quantitatively and qualitatively with previous XAI methods on proposed X-FAS benchmark. Experimental results demonstrate SPED's ability to generate reliable explanations.
Authors: Zixi Liang, Guowei Xu, Haifeng Wu, Ye Huang, Wen Li, Lixin Duan
Abstract: Learning-based methods have become increasingly popular in 3D indoor scene synthesis (ISS), showing superior performance over traditional optimization-based approaches. These learning-based methods typically model distributions on simple yet explicit scene representations using generative models. However, due to the oversimplified explicit representations that overlook detailed information and the lack of guidance from multimodal relationships within the scene, most learning-based methods struggle to generate indoor scenes with realistic object arrangements and styles. In this paper, we introduce a new method, Scene Implicit Neural Field (S-INF), for indoor scene synthesis, aiming to learn meaningful representations of multimodal relationships, to enhance the realism of indoor scene synthesis. S-INF assumes that the scene layout is often related to the object-detailed information. It disentangles the multimodal relationships into scene layout relationships and detailed object relationships, fusing them later through implicit neural fields (INFs). By learning specialized scene layout relationships and projecting them into S-INF, we achieve a realistic generation of scene layout. Additionally, S-INF captures dense and detailed object relationships through differentiable rendering, ensuring stylistic consistency across objects. Through extensive experiments on the benchmark 3D-FRONT dataset, we demonstrate that our method consistently achieves state-of-the-art performance under different types of ISS.
Authors: Gopi Raju Matta, Reddypalli Trisha, Kaushik Mitra
Abstract: Novel view synthesis has been greatly enhanced by the development of radiance field methods. The introduction of 3D Gaussian Splatting (3DGS) has effectively addressed key challenges, such as long training times and slow rendering speeds, typically associated with Neural Radiance Fields (NeRF), while maintaining high-quality reconstructions. In this work (BeSplat), we demonstrate the recovery of sharp radiance field (Gaussian splats) from a single motion-blurred image and its corresponding event stream. Our method jointly learns the scene representation via Gaussian Splatting and recovers the camera motion through Bezier SE(3) formulation effectively, minimizing discrepancies between synthesized and real-world measurements of both blurry image and corresponding event stream. We evaluate our approach on both synthetic and real datasets, showcasing its ability to render view-consistent, sharp images from the learned radiance field and the estimated camera trajectory. To the best of our knowledge, ours is the first work to address this highly challenging ill-posed problem in a Gaussian Splatting framework with the effective incorporation of temporal information captured using the event stream.
Authors: Junjie Hu, Shuyong Gao, Lingyi Hong, Qishan Wang, Yuzhou Zhao, Yan Wang, Wenqiang Zhang
Abstract: Recent research in subject-driven generation increasingly emphasizes the importance of selective subject features. Nevertheless, accurately selecting the content in a given reference image still poses challenges, especially when selecting the similar subjects in an image (e.g., two different dogs). Some methods attempt to use text prompts or pixel masks to isolate specific elements. However, text prompts often fall short in precisely describing specific content, and pixel masks are often expensive. To address this, we introduce P3S-Diffusion, a novel architecture designed for context-selected subject-driven generation via point supervision. P3S-Diffusion leverages minimal cost label (e.g., points) to generate subject-driven images. During fine-tuning, it can generate an expanded base mask from these points, obviating the need for additional segmentation models. The mask is employed for inpainting and aligning with subject representation. The P3S-Diffusion preserves fine features of the subjects through Multi-layers Condition Injection. Enhanced by the Attention Consistency Loss for improved training, extensive experiments demonstrate its excellent feature preservation and image generation capabilities.
Authors: Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie
Abstract: Video Large Language Models (VideoLLMs) have achieved remarkable progress in video understanding. However, existing VideoLLMs often inherit the limitations of their backbone LLMs in handling long sequences, leading to challenges for long video understanding. Common solutions either simply uniformly sample videos' frames or compress visual tokens, which focus primarily on low-level temporal visual redundancy, overlooking high-level knowledge redundancy. This limits the achievable compression rate with minimal loss. To this end. we introduce a training-free method, $\textbf{ReTaKe}$, containing two novel modules DPSelect and PivotKV, to jointly model and reduce both temporal visual redundancy and knowledge redundancy for long video understanding. Specifically, DPSelect identifies keyframes with local maximum peak distance based on their visual features, which are closely aligned with human video perception. PivotKV employs the obtained keyframes as pivots and conducts KV-Cache compression for the non-pivot tokens with low attention scores, which are derived from the learned prior knowledge of LLMs. Experiments on benchmarks VideoMME, MLVU, and LVBench, show that ReTaKe can support 4x longer video sequences with minimal performance loss (<1%) and outperform all similar-size VideoLLMs with 3%-5%, even surpassing or on par with much larger ones. Our code is available at https://github.com/SCZwangxiao/video-ReTaKe
Authors: Zhiqiang Yuan, Ting Zhang, Jiapei Zhang, Jie Zhou, Jinchao Zhang
Abstract: Approximately 200 million individuals around the world suffer from varying degrees of visual impairment, making it crucial to leverage AI technology to offer walking assistance for these people. With the recent progress of vision-language models (VLMs), employing VLMs to improve this field has emerged as a popular research topic. However, most existing methods are studied on self-built question-answering datasets, lacking a unified training and testing benchmark for walk guidance. Moreover, in blind walking task, it is necessary to perform real-time streaming video parsing and generate concise yet informative reminders, which poses a great challenge for VLMs that suffer from redundant responses and low inference efficiency. In this paper, we firstly release a diverse, extensive, and unbiased walking awareness dataset, containing 12k video-manual annotation pairs from Europe and Asia to provide a fair training and testing benchmark for blind walking task. Furthermore, a WalkVLM model is proposed, which employs chain of thought for hierarchical planning to generate concise but informative reminders and utilizes temporal-aware adaptive prediction to reduce the temporal redundancy of reminders. Finally, we have established a solid benchmark for blind walking task and verified the advantages of WalkVLM in stream video processing for this task compared to other VLMs. Our dataset and code will be released at anonymous link https://walkvlm2024.github.io.
Authors: Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, Yue Wang
Abstract: Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild driving scenarios. On the other hand, generative models can synthesize action-conditioned driving videos in a more generalizable way but often struggle with maintaining 3D visual consistency. In this paper, we present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction, to synthesize generalizable 4D driving scenes and dynamic driving videos with 3D consistency. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references and further elevate them to 4D with a novel hybrid Gaussian representation. Given a driving trajectory, we then render 3D-consistent driving videos via Gaussian splatting. The use of generative priors allows our method to produce high-quality 4D scenes from in-the-wild driving data, while neural rendering ensures 3D-consistent video generation from the 4D scenes. Extensive experiments on nuScenes and street view images demonstrate that DreamDrive can generate controllable and generalizable 4D driving scenes, synthesize novel views of driving videos with high fidelity and 3D consistency, decompose static and dynamic elements in a self-supervised manner, and enhance perception and planning tasks for autonomous driving.
Authors: Suttisak Wizadwongsa, Jinfan Zhou, Edward Li, Jeong Joon Park
Abstract: Recent AI-based 3D content creation has largely evolved along two paths: feed-forward image-to-3D reconstruction approaches and 3D generative models trained with 2D or 3D supervision. In this work, we show that existing feed-forward reconstruction methods can serve as effective latent encoders for training 3D generative models, thereby bridging these two paradigms. By reusing powerful pre-trained reconstruction models, we avoid computationally expensive encoder network training and obtain rich 3D latent features for generative modeling for free. However, the latent spaces of reconstruction models are not well-suited for generative modeling due to their unstructured nature. To enable flow-based model training on these latent features, we develop post-processing pipelines, including protocols to standardize the features and spatial weighting to concentrate on important regions. We further incorporate a 2D image space perceptual rendering loss to handle the high-dimensional latent spaces. Finally, we propose a multi-stream transformer-based rectified flow architecture to achieve linear scaling and high-quality text-conditioned 3D generation. Our framework leverages the advancements of feed-forward reconstruction models to enhance the scalability of 3D generative modeling, achieving both high computational efficiency and state-of-the-art performance in text-to-3D generation.
Authors: Xindi Wu, Mengzhou Xia, Rulin Shao, Zhiwei Deng, Pang Wei Koh, Olga Russakovsky
Abstract: Visual Instruction Tuning typically requires a large amount of vision-language training data. This data often containing redundant information that increases computational costs without proportional performance gains. In this work, we introduce ICONS, a gradient-driven Influence CONsensus approach for vision-language data Selection that selects a compact training dataset for efficient multi-task training. The key element of our approach is cross-task influence consensus, which uses majority voting across task-specific influence matrices to identify samples that are consistently valuable across multiple tasks, allowing us to effectively prioritize data that optimizes for overall performance. Experiments show that models trained on our selected data (20% of LLaVA-665K) achieve 98.6% of the relative performance obtained using the full dataset. Additionally, we release this subset, LLaVA-ICONS-133K, a compact yet highly informative subset of LLaVA-665K visual instruction tuning data, preserving high impact training data for efficient vision-language model development.
Authors: Kun Li, George Vosselman, Michael Ying Yang
Abstract: The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression. Recent advancements, particularly Transformer-based fusion designs, have demonstrated remarkable progress in this domain. However, existing methods primarily focus on refining visual features using language-aware guidance during the cross-modal fusion stage, neglecting the complementary vision-to-language flow. This limitation often leads to irrelevant or suboptimal representations. In addition, the diverse spatial scales of ground objects in aerial images pose significant challenges to the visual perception capabilities of existing models when conditioned on textual inputs. In this paper, we propose an innovative framework called Scale-wise Bidirectional Alignment Network (SBANet) to address these challenges for RRSIS. Specifically, we design a Bidirectional Alignment Module (BAM) with learnable query tokens to selectively and effectively represent visual and linguistic features, emphasizing regions associated with key tokens. BAM is further enhanced with a dynamic feature selection block, designed to provide both macro- and micro-level visual features, preserving global context and local details to facilitate more effective cross-modal interaction. Furthermore, SBANet incorporates a text-conditioned channel and spatial aggregator to bridge the gap between the encoder and decoder, enhancing cross-scale information exchange in complex aerial scenarios. Extensive experiments demonstrate that our proposed method achieves superior performance in comparison to previous state-of-the-art methods on the RRSIS-D and RefSegRS datasets, both quantitatively and qualitatively. The code will be released after publication.
Authors: Lixiong Qin, Shilong Ou, Miaoxuan Zhang, Jiangning Wei, Yuhang Zhang, Xiaoshuai Song, Yuchen Liu, Mei Wang, Weiran Xu
Abstract: Faces and humans are crucial elements in social interaction and are widely included in everyday photos and videos. Therefore, a deep understanding of faces and humans will enable multi-modal assistants to achieve improved response quality and broadened application scope. Currently, the multi-modal assistant community lacks a comprehensive and scientific evaluation of face and human understanding abilities. In this paper, we first propose a hierarchical ability taxonomy that includes three levels of abilities. Then, based on this taxonomy, we collect images and annotations from publicly available datasets in the face and human community and build a semi-automatic data pipeline to produce problems for the new benchmark. Finally, the obtained Face-Human-Bench comprises a development set with 900 problems and a test set with 1800 problems, supporting both English and Chinese. We conduct evaluations over 25 mainstream multi-modal large language models (MLLMs) with our Face-Human-Bench, focusing on the correlation between abilities, the impact of the relative position of targets on performance, and the impact of Chain of Thought (CoT) prompting on performance. Moreover, inspired by multi-modal agents, we also explore which abilities of MLLMs need to be supplemented by specialist models.
Authors: Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Chen Change Loy, Lu Jiang
Abstract: Video restoration poses non-trivial challenges in maintaining fidelity while recovering temporally consistent details from unknown degradations in the wild. Despite recent advances in diffusion-based restoration, these methods often face limitations in generation capability and sampling efficiency. In this work, we present SeedVR, a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution. The core design of SeedVR lies in the shifted window attention that facilitates effective restoration on long video sequences. SeedVR further supports variable-sized windows near the boundary of both spatial and temporal dimensions, overcoming the resolution constraints of traditional window attention. Equipped with contemporary practices, including causal video autoencoder, mixed image and video training, and progressive training, SeedVR achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos. Extensive experiments demonstrate SeedVR's superiority over existing methods for generic video restoration.
Authors: Jingfeng Yao, Xinggang Wang
Abstract: Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance. Consequently, existing systems often settle for sub-optimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs. We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces. To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT. The integrated system achieves state-of-the-art (SOTA) performance on ImageNet 256x256 generation with an FID score of 1.35 while demonstrating remarkable training efficiency by reaching an FID score of 2.11 in just 64 epochs--representing an over 21 times convergence speedup compared to the original DiT. Models and codes are available at: https://github.com/hustvl/LightningDiT.
Authors: Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, Hengshuang Zhao
Abstract: Despite significant advancements in video generation, inserting a given object into videos remains a challenging task. The difficulty lies in preserving the appearance details of the reference object and accurately modeling coherent motions at the same time. In this paper, we propose VideoAnydoor, a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control. Starting from a text-to-video model, we utilize an ID extractor to inject the global identity and leverage a box sequence to control the overall motion. To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper. It takes the reference image with arbitrary key-points and the corresponding key-point trajectories as inputs. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net, thus improving detail preservation and supporting users in manipulating the motion trajectories. In addition, we propose a training strategy involving both videos and static images with a weighted loss to enhance insertion quality. VideoAnydoor demonstrates significant superiority over existing methods and naturally supports various downstream applications (e.g., talking head generation, video virtual try-on, multi-region editing) without task-specific fine-tuning.
Authors: Jinyung Hong, Theodore P. Pavlic
Abstract: Fruit flies are established model systems for studying olfactory learning as they will readily learn to associate odors with both electric shock or sugar rewards. The mechanisms of the insect brain apparently responsible for odor learning form a relatively shallow neuronal architecture. Olfactory inputs are received by the antennal lobe (AL) of the brain, which produces an encoding of each odor mixture across ~50 sub-units known as glomeruli. Each of these glomeruli then projects its component of this feature vector to several of ~2000 so-called Kenyon Cells (KCs) in a region of the brain known as the mushroom body (MB). Fly responses to odors are generated by small downstream neutrophils that decode the higher-order representation from the MB. Research has shown that there is no recognizable pattern in the glomeruli--KC connections (and thus the particular higher-order representations); they are akin to fingerprints--even isogenic flies have different projections. Leveraging insights from this architecture, we propose KCNet, a single-hidden-layer neural network that contains sparse, randomized, binary weights between the input layer and the hidden layer and analytically learned weights between the hidden layer and the output layer. Furthermore, we also propose a dynamic optimization algorithm that enables the KCNet to increase performance beyond its structural limits by searching for a more efficient set of inputs. For odorant-perception tasks that predict the perceptual properties of an odorant, we show that KCNet outperforms existing data-driven approaches, such as XGBoost. For image classification tasks, KCNet achieves reasonable performance on benchmark datasets (MNIST, Fashion-MNIST, and EMNIST) without any data-augmentation methods or convolutional layers and shows a particularly fast running time.
Authors: Yonggan Fu, Haichuan Yang, Jiayi Yuan, Meng Li, Cheng Wan, Raghuraman Krishnamoorthi, Vikas Chandra, Yingyan Celine Lin
Abstract: Efficient deep neural network (DNN) models equipped with compact operators (e.g., depthwise convolutions) have shown great potential in reducing DNNs' theoretical complexity (e.g., the total number of weights/operations) while maintaining a decent model accuracy. However, existing efficient DNNs are still limited in fulfilling their promise in boosting real-hardware efficiency, due to their commonly adopted compact operators' low hardware utilization. In this work, we open up a new compression paradigm for developing real-hardware efficient DNNs, leading to boosted hardware efficiency while maintaining model accuracy. Interestingly, we observe that while some DNN layers' activation functions help DNNs' training optimization and achievable accuracy, they can be properly removed after training without compromising the model accuracy. Inspired by this observation, we propose a framework dubbed DepthShrinker, which develops hardware-friendly compact networks via shrinking the basic building blocks of existing efficient DNNs that feature irregular computation patterns into dense ones with much improved hardware utilization and thus real-hardware efficiency. Excitingly, our DepthShrinker framework delivers hardware-friendly compact networks that outperform both state-of-the-art efficient DNNs and compression techniques, e.g., a 3.06% higher accuracy and 1.53$\times$ throughput on Tesla V100 over SOTA channel-wise pruning method MetaPruning. Our codes are available at: https://github.com/facebookresearch/DepthShrinker.
Authors: Jing Liu, Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang
Abstract: In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation. Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner. It contains three separate encoders for single modality representations, and a decoder for multimodal conditional text generation. We design two pretext tasks to pretrain VALOR model, including Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC). MGA projects vision, language and audio to the same common space, building vision-language, audio-language and audiovisual-language alignment simultaneously. MGC learns how to generate text tokens in conditions of vision, audio or their both. To promote vision-audio-language pretraining research, we construct a large-scale high-quality tri-modality dataset named VALOR-1M, which contains 1M audiable videos with human annotated audiovisual captions. Extensive experiments show that VALOR can learn strong multimodal correlations and be generalized to various downstream tasks (e.g., retrieval, captioning and question answering), with different input modalities (e.g., vision-language, audio-language and audiovisual-language). VALOR achieves new state-of-the-art performances on series of public cross-modality benchmarks. Code and data are available at project page https://casia-iva-group.github.io/projects/VALOR.
Authors: Zhili He, Yu-Hsing Wang, Jian Zhang
Abstract: AI-based structural design represents a transformative approach that addresses the inefficiencies inherent in traditional structural design practices. This paper innovates the existing AI-based design frameworks from four aspects and proposes Generative AIBIM: an intelligent design pipeline that integrates BIM and generative AI. First, the proposed pipeline not only broadens the application scope of BIM, which aligns with BIM's growing relevance in civil engineering, but also marks a significant supplement to previous methods that relied on CAD drawings. Second, a two-stage generation framework incorporating generative AI (TGAI), inspired by the human drawing process, is designed to simplify the complexity of the design problem. Third, for the AI model in TGAI, this paper pioneers to fuse physical conditions into diffusion models (DMs) to build a novel physics-based conditional diffusion model (PCDM). In contrast to conventional DMs, on the one hand, PCDM directly predicts shear wall drawings to focus on similarity, and on the other hand, PCDM effectively fuses cross-domain information, i.e., design drawings, timesteps, and physical conditions, by integrating well-designed attention modules. Additionally, a new evaluation system including objective and subjective measures is designed to evaluate models' performance, complementing the evaluation system in the traditional methods. The quantitative results demonstrate that PCDM significantly surpasses recent SOTA techniques across both measures. The qualitative results highlight PCDM's superior capabilities in generating high-perceptual-quality drawings adhering to essential design criteria. In addition, benefiting from the fusion of physical conditions, PCDM effectively supports diverse and creative designs tailored to building heights and seismic precautionary intensities, showcasing its unique generation and generalization abilities.
Authors: Xi Zhang, Xiaolin Wu
Abstract: Recently, DNN models for lossless image coding have surpassed their traditional counterparts in compression performance, reducing the previous lossless bit rate by about ten percent for natural color images. But even with these advances, mathematically lossless image compression (MLLIC) ratios for natural images still fall short of the bandwidth and cost-effectiveness requirements of most practical imaging and vision systems at present and beyond. To overcome the performance barrier of MLLIC, we question the very necessity of MLLIC. Considering that all digital imaging sensors suffer from acquisition noises, why should we insist on mathematically lossless coding, i.e., wasting bits to preserve noises? Instead, we propose a new paradigm of joint denoising and compression called functionally lossless image compression (FLLIC), which performs lossless compression of optimally denoised images (the optimality may be task-specific). Although not literally lossless with respect to the noisy input, FLLIC aims to achieve the best possible reconstruction of the latent noise-free original image. Extensive experiments show that FLLIC achieves state-of-the-art performance in joint denoising and compression of noisy images and does so at a lower computational cost.
Authors: Alhassan Mumuni, Fuseini Mumuni
Abstract: Data augmentation is arguably the most important regularization technique commonly used to improve generalization performance of machine learning models. It primarily involves the application of appropriate data transformation operations to create new data samples with desired properties. Despite its effectiveness, the process is often challenging because of the time-consuming trial and error procedures for creating and testing different candidate augmentations and their hyperparameters manually. Automated data augmentation methods aim to automate the process. State-of-the-art approaches typically rely on automated machine learning (AutoML) principles. This work presents a comprehensive survey of AutoML-based data augmentation techniques. We discuss various approaches for accomplishing data augmentation with AutoML, including data manipulation, data integration and data synthesis techniques. We present extensive discussion of techniques for realizing each of the major subtasks of the data augmentation process: search space design, hyperparameter optimization and model evaluation. Finally, we carried out an extensive comparison and analysis of the performance of automated data augmentation techniques and state-of-the-art methods based on classical augmentation approaches. The results show that AutoML methods for data augmentation currently outperform state-of-the-art techniques based on conventional approaches.
Authors: Shani Goren, Ido Galil, Ran El-Yaniv
Abstract: Deploying deep neural networks for risk-sensitive tasks necessitates an uncertainty estimation mechanism. This paper introduces hierarchical selective classification, extending selective classification to a hierarchical setting. Our approach leverages the inherent structure of class relationships, enabling models to reduce the specificity of their predictions when faced with uncertainty. In this paper, we first formalize hierarchical risk and coverage, and introduce hierarchical risk-coverage curves. Next, we develop algorithms for hierarchical selective classification (which we refer to as "inference rules"), and propose an efficient algorithm that guarantees a target accuracy constraint with high probability. Lastly, we conduct extensive empirical studies on over a thousand ImageNet classifiers, revealing that training regimes such as CLIP, pretraining on ImageNet21k and knowledge distillation boost hierarchical selective performance.
Authors: Guanxiong Luo, Shoujin Huang, Martin Uecker
Abstract: Magnetic resonance imaging (MRI) is a widely used non-invasive imaging modality. However, a persistent challenge lies in balancing image quality with imaging speed. This trade-off is primarily constrained by k-space measurements, which traverse specific trajectories in the spatial Fourier domain (k-space). These measurements are often undersampled to shorten acquisition times, resulting in image artifacts and compromised quality. Generative models learn image distributions and can be used to reconstruct high-quality images from undersampled k-space data. In this work, we present the autoregressive image diffusion (AID) model for image sequences and use it to sample the posterior for accelerated MRI reconstruction. The algorithm incorporates both undersampled k-space and pre-existing information. Models trained with fastMRI dataset are evaluated comprehensively. The results show that the AID model can robustly generate sequentially coherent image sequences. In MRI applications, the AID can outperform the standard diffusion model and reduce hallucinations, due to the learned inter-image dependencies. The project code is available at https://github.com/mrirecon/aid.
Authors: Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, Zhou Zhao
Abstract: Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video, and it remains challenging to build V2A models with high generation quality, efficiency, and visual-audio temporal synchrony. We propose Frieren, a V2A model based on rectified flow matching. Frieren regresses the conditional transport vector field from noise to spectrogram latent with straight paths and conducts sampling by solving ODE, outperforming autoregressive and score-based models in terms of audio quality. By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, our model generates audio that is highly synchronized with the input video. Furthermore, through reflow and one-step distillation with guided vector field, our model can generate decent audio in a few, or even only one sampling step. Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment on VGGSound, with alignment accuracy reaching 97.22%, and 6.2% improvement in inception score over the strong diffusion-based baseline. Audio samples are available at http://frieren-v2a.github.io.
Authors: Jifei Luo, Hantao Yao, Changsheng Xu
Abstract: Diffusion-based re-ranking is a common method used for retrieving instances by performing similarity propagation in a nearest neighbor graph. However, existing techniques that construct the affinity graph based on pairwise instances can lead to the propagation of misinformation from outliers and other manifolds, resulting in inaccurate results. To overcome this issue, we propose a novel Cluster-Aware Similarity (CAS) diffusion for instance retrieval. The primary concept of CAS is to conduct similarity diffusion within local clusters, which can reduce the influence from other manifolds explicitly. To obtain a symmetrical and smooth similarity matrix, our Bidirectional Similarity Diffusion strategy introduces an inverse constraint term to the optimization objective of local cluster diffusion. Additionally, we have optimized a Neighbor-guided Similarity Smoothing approach to ensure similarity consistency among the local neighbors of each instance. Evaluations in instance retrieval and object re-identification validate the effectiveness of the proposed CAS, our code is publicly available.
Authors: Malitha Gunawardhana, Anuradha Kulathilaka, Jichao Zhao
Abstract: Atrial fibrillation (AFib) is the prominent cardiac arrhythmia in the world. It affects mostly the elderly population, with potential consequences such as stroke and heart failure in the absence of necessary treatments as soon as possible. The importance of atrial scarring in the development and progression of AFib has gained recognition, positioning late gadolinium-enhanced magnetic resonance imaging (LGE-MRI) as a crucial technique for the non-invasive evaluation of atrial scar tissue. This review delves into the recent progress in segmenting atrial scars using LGE-MRIs, emphasizing the importance of precise scar measurement in the treatment and management of AFib. Initially, it provides a detailed examination of AFib. Subsequently, it explores the application of deep learning in this domain. The review culminates in a discussion of the latest research advancements in atrial scar segmentation using deep learning methods. By offering a thorough analysis of current technologies and their impact on AFib management strategies, this review highlights the integral role of deep learning in enhancing atrial scar segmentation and its implications for future therapeutic approaches.
Authors: Zhiwen Yang, Jiayin Li, Hui Zhang, Dan Zhao, Bingzheng Wei, Yan Xu
Abstract: Transformers have revolutionized medical image restoration, but the quadratic complexity still poses limitations for their application to high-resolution medical images. The recent advent of the Receptance Weighted Key Value (RWKV) model in the natural language processing field has attracted much attention due to its ability to process long sequences efficiently. To leverage its advanced design, we propose Restore-RWKV, the first RWKV-based model for medical image restoration. Since the original RWKV model is designed for 1D sequences, we make two necessary modifications for modeling spatial relations in 2D medical images. First, we present a recurrent WKV (Re-WKV) attention mechanism that captures global dependencies with linear computational complexity. Re-WKV incorporates bidirectional attention as basic for a global receptive field and recurrent attention to effectively model 2D dependencies from various scan directions. Second, we develop an omnidirectional token shift (Omni-Shift) layer that enhances local dependencies by shifting tokens from all directions and across a wide context range. These adaptations make the proposed Restore-RWKV an efficient and effective model for medical image restoration. Even a lightweight variant of Restore-RWKV, with only 1.16 million parameters, achieves comparable or even superior results compared to existing state-of-the-art (SOTA) methods. Extensive experiments demonstrate that the resulting Restore-RWKV achieves SOTA performance across a range of medical image restoration tasks, including PET image synthesis, CT image denoising, MRI image super-resolution, and all-in-one medical image restoration. Code is available at: https://github.com/Yaziwel/Restore-RWKV.
Authors: Xiaoxuan Zhu, Zhouhong Gu, Sihang Jiang, Zhixu Li, Hongwei Feng, Yanghua Xiao
Abstract: Online courses have significantly lowered the barrier to accessing education, yet the varying content quality of these videos poses challenges. In this work, we focus on the task of automatically evaluating the quality of video course content. We have constructed a dataset with a substantial collection of video courses and teaching materials. We propose three evaluation principles and design a new evaluation framework, \textit{VCEval}, based on these principles. The task is modeled as a multiple-choice question-answering task, with a language model serving as the evaluator. Our method effectively distinguishes video courses of different content quality and produces a range of interpretable results.
Authors: Johnny Jingze Li, Vivek Kurien George, Gabriel A. Silva
Abstract: Emergence in machine learning refers to the spontaneous appearance of complex behaviors or capabilities that arise from the scale and structure of training data and model architectures, despite not being explicitly programmed. We introduce a novel yet straightforward neural network initialization scheme that aims at achieving greater potential for emergence. Measuring emergence as a kind of structural nonlinearity, our method adjusts the layer-wise weight scaling factors to achieve higher emergence values. This enhancement is easy to implement, requiring no additional optimization steps for initialization compared to GradInit. We evaluate our approach across various architectures, including MLP and convolutional architectures for image recognition and transformers for machine translation. We demonstrate substantial improvements in both model accuracy and training speed, with and without batch normalization. The simplicity, theoretical innovation, and demonstrable empirical advantages of our method make it a potent enhancement to neural network initialization practices. These results suggest a promising direction for leveraging emergence to improve neural network training methodologies. Code is available at: https://github.com/johnnyjingzeli/EmergenceInit.
Authors: Feng Zhou, Yanjie Zhou, Longjie Wang, Yun Peng, David E. Carlson, Liyun Tu
Abstract: Traditional one-shot medical image segmentation (MIS) methods use registration networks to propagate labels from a reference atlas or rely on comprehensive sampling strategies to generate synthetic labeled data for training. However, these methods often struggle with registration errors and low-quality synthetic images, leading to poor performance and generalization. To overcome this, we introduce a novel one-shot MIS framework based on knowledge distillation, which allows the network to directly 'see' real images through a distillation process guided by image reconstruction. It focuses on anatomical structures in a single labeled image and a few unlabeled ones. A registration-based data augmentation network creates realistic, labeled samples, while a feature distillation module helps the student network learn segmentation from these samples, guided by the teacher network. During inference, the streamlined student network accurately segments new images. Evaluations on three public datasets (OASIS for T1 brain MRI, BCV for abdomen CT, and VerSe for vertebrae CT) show superior segmentation performance and generalization across different medical image datasets and modalities compared to leading methods. Our code is available at https://github.com/NoviceFodder/OS-MedSeg.
Authors: Ruiquan Ge, Xiao Yu, Yifei Chen, Guanyu Zhou, Fan Jia, Shenghao Zhu, Junhao Jia, Chenyan Zhang, Yifei Sun, Dong Zeng, Changmiao Wang, Qiegen Liu, Shanzhou Niu
Abstract: Magnetic Resonance Imaging (MRI) has become essential in clinical diagnosis due to its high resolution and multiple contrast mechanisms. However, the relatively long acquisition time limits its broader application. To address this issue, this study presents an innovative conditional guided diffusion model, named as TC-KANRecon, which incorporates the Multi-Free U-KAN (MF-UKAN) module and a dynamic clipping strategy. TC-KANRecon model aims to accelerate the MRI reconstruction process through deep learning methods while maintaining the quality of the reconstructed images. The MF-UKAN module can effectively balance the tradeoff between image denoising and structure preservation. Specifically, it presents the multi-head attention mechanisms and scalar modulation factors, which significantly enhances the model's robustness and structure preservation capabilities in complex noise environments. Moreover, the dynamic clipping strategy in TC-KANRecon adjusts the cropping interval according to the sampling steps, thereby mitigating image detail loss typicalching the visual features of the images. Furthermore, the MC-Model incorporates full-sampling k-space information, realizing efficient fusion of conditional information, enhancing the model's ability to process complex data, and improving the realism and detail richness of reconstructed images. Experimental results demonstrate that the proposed method outperforms other MRI reconstruction methods in both qualitative and quantitative evaluations. Notably, TC-KANRecon method exhibits excellent reconstruction results when processing high-noise, low-sampling-rate MRI data. Our source code is available at https://github.com/lcbkmm/TC-KANRecon.
Authors: Malitha Gunawardhana, Fangqiang Xu, Jichao Zhao
Abstract: Cardiac segmentation is a critical task in medical imaging, essential for detailed analysis of heart structures, which is crucial for diagnosing and treating various cardiovascular diseases. With the advent of deep learning, automated segmentation techniques have demonstrated remarkable progress, achieving high accuracy and efficiency compared to traditional manual methods. Among these techniques, the nnU-Net framework stands out as a robust and versatile tool for medical image segmentation. In this study, we evaluate the performance of nnU-Net in segmenting cardiac magnetic resonance images (MRIs). Utilizing five cardiac segmentation datasets, we employ various nnU-Net configurations, including 2D, 3D full resolution, 3D low resolution, 3D cascade, and ensemble models. Our study benchmarks the capabilities of these configurations and examines the necessity of developing new models for specific cardiac segmentation tasks.
Authors: Chenjun Li, Dian Yang, Shun Yao, Shuyue Wang, Ye Wu, Le Zhang, Qiannuo Li, Kang Ik Kevin Cho, Johanna Seitz-Holland, Lipeng Ning, Jon Haitz Legarreta, Yogesh Rathi, Carl-Fredrik Westin, Lauren J. O'Donnell, Nir A. Sochen, Ofer Pasternak, Fan Zhang
Abstract: In this study, we developed an Evidence-based Ensemble Neural Network, namely EVENet, for anatomical brain parcellation using diffusion MRI. The key innovation of EVENet is the design of an evidential deep learning framework to quantify predictive uncertainty at each voxel during a single inference. To do so, we design an evidence-based ensemble learning framework for uncertainty-aware parcellation to leverage the multiple dMRI parameters derived from diffusion MRI. Using EVENet, we obtained accurate parcellation and uncertainty estimates across different datasets from healthy and clinical populations and with different imaging acquisitions. The overall network includes five parallel subnetworks, where each is dedicated to learning the FreeSurfer parcellation for a certain diffusion MRI parameter. An evidence-based ensemble methodology is then proposed to fuse the individual outputs. We perform experimental evaluations on large-scale datasets from multiple imaging sources, including high-quality diffusion MRI data from healthy adults and clinically diffusion MRI data from participants with various brain diseases (schizophrenia, bipolar disorder, attention-deficit/hyperactivity disorder, Parkinson's disease, cerebral small vessel disease, and neurosurgical patients with brain tumors). Compared to several state-of-the-art methods, our experimental results demonstrate highly improved parcellation accuracy across the multiple testing datasets despite the differences in dMRI acquisition protocols and health conditions. Furthermore, thanks to the uncertainty estimation, our EVENet approach demonstrates a good ability to detect abnormal brain regions in patients with lesions, enhancing the interpretability and reliability of the segmentation results.
Authors: Naren Sengodan
Abstract: Breast cancer histopathology image classification is critical for early detection and improved patient outcomes. 1 This study introduces a novel approach leveraging EfficientNetV2 models, to improve feature extraction and focus on relevant tissue regions. The proposed models were evaluated on the BreakHis dataset across multiple magnification scales (40X, 100X, 200X, and 400X). 2 Among them, the EfficientNetV2-XL with CBAM achieved outstanding performance, reaching a peak accuracy of 98.96 percent and an F1-score of 98.31 percent at 400X magnification, outperforming state-of-the-art methods. 3 By integrating Contrast Limited Adaptive Histogram Equalization (CLAHE) for preprocessing and optimizing computational efficiency, this method demonstrates its suitability for real-time clinical deployment. 3 The results underscore the potential of attention-enhanced scalable architectures in advancing diagnostic precision for breast cancer detection.
Authors: Isaac Baglin, Xiatian Zhu, Simon Hadfield
Abstract: Federated Learning is a privacy preserving decentralized machine learning paradigm designed to collaboratively train models across multiple clients by exchanging gradients to the server and keeping private data local. Nevertheless, recent research has revealed that the security of Federated Learning is compromised, as private ground truth data can be recovered through a gradient inversion technique known as Deep Leakage. While these attacks are crafted with a focus on applications in Federated Learning, they generally are not evaluated in realistic scenarios. This paper introduces the FEDLAD Framework (Federated Evaluation of Deep Leakage Attacks and Defenses), a comprehensive benchmark for evaluating Deep Leakage attacks and defenses within a realistic Federated context. By implementing a unified benchmark that encompasses multiple state-of-the-art Deep Leakage techniques and various defense strategies, our framework facilitates the evaluation and comparison of the efficacy of these methods across different datasets and training states. This work highlights a crucial trade-off between privacy and model accuracy in Federated Learning and aims to advance the understanding of security challenges in decentralized machine learning systems, stimulate future research, and enhance reproducibility in evaluating Deep Leakage attacks and defenses.
Authors: Shengqi Chen, Zilin Wang, Jianrong Dai, Shirui Qin, Ying Cao, Ruiao Zhao, Jiayun Chen, Guohua Wu, Yuan Tang
Abstract: Background and Purpose: Accurate motion tracking in MRI-guided Radiotherapy (MRIgRT) is essential for effective treatment delivery. This study aimed to enhance motion tracking precision in MRIgRT through an automatic real-time markerless tracking method using an enhanced Tracking-Learning-Detection (ETLD) framework with automatic segmentation. Materials and Methods: We developed a novel MRIgRT motion tracking and segmentation method by integrating the ETLD framework with an improved Chan-Vese model (ICV), named ETLD+ICV. The ETLD framework was upgraded for real-time cine MRI, including advanced image preprocessing, no-reference image quality assessment, an enhanced median-flow tracker, and a refined detector with dynamic search region adjustments. ICV was used for precise target volume coverage, refining the segmented region frame by frame using tracking results, with key parameters optimized. The method was tested on 3.5D MRI scans from 10 patients with liver metastases. Results: Evaluation of 106,000 frames across 77 treatment fractions showed sub-millimeter tracking errors of less than 0.8mm, with over 99% precision and 98% recall for all subjects in the Beam Eye View(BEV)/Beam Path View(BPV) orientation. The ETLD+ICV method achieved a dice global score of more than 82% for all subjects, demonstrating the method's extensibility and precise target volume coverage. Conclusion: This study successfully developed an automatic real-time markerless motion tracking method for MRIgRT that significantly outperforms current methods. The novel method not only delivers exceptional precision in tracking and segmentation but also shows enhanced adaptability to clinical demands, making it an indispensable asset in improving the efficacy of radiotherapy treatments.
Authors: Changqing Ji
Abstract: In the application of brain-computer interface (BCI), being able to accurately decode brain signals is a critical task. For the multi-class classification task of brain signal ECoG, how to improve the classification accuracy is one of the current research hotspots. ECoG acquisition uses a high-density electrode array and a high sampling frequency, which makes ECoG data have a certain high similarity and data redundancy in the temporal domain, and also unique spatial pattern in spatial domain. How to effectively extract features is both exciting and challenging. Previous work found that visual-related ECoG can carry visual information via frequency and spatial domain. Based on this finding, we focused on using deep learning to design frequency and spatial feature extraction modules, and proposed a Bi-Band ECoGNet model based on deep learning. The main contributions of this paper are: 1) The Bi-BCWT (Bi-Band Channel-Wise Transform) neural network module is designed to replace the time-consume method MST, this module greatly improves the model calculation and data storage efficiency, and effectively increases the training speed; 2) The Bi-BCWT module can effectively take into account the information both in low-frequency and high-frequency domain, which is more conducive to ECoG multi-classification tasks; 3) ECoG is acquired using 2D electrode array, the newly designed 2D Spatial-Temporal feature encoder can extract the 2D spatial feature better. Experiments have shown that the unique 2D spatial data structure can effectively improve classification accuracy; 3) Compared with previous work, the Bi-Band ECoGNet model is smaller and has higher performance, with an accuracy increase of 1.24%, and the model training speed is increased by 6 times, which is more suitable for BCI applications.
Authors: Yu Zhang, Shuang Li, Yibing Wang, Yu Sun, Wenyi Xiang
Abstract: Photoacoustic imaging (PAI) suffers from inherent limitations that can degrade the quality of reconstructed results, such as noise, artifacts and incomplete data acquisition caused by sparse sampling or partial array detection. In this study, we proposed a new optimization method for both two-dimensional (2D) and three-dimensional (3D) PAI reconstruction results, called the regularized iteration method with shape prior. The shape prior is a probability matrix derived from the reconstruction results of multiple sets of random partial array signals in a computational imaging system using any reconstruction algorithm, such as Delay-and-Sum (DAS) and Back-Projection (BP). In the probability matrix, high-probability locations indicate high consistency among multiple reconstruction results at those positions, suggesting a high likelihood of representing the true imaging results. In contrast, low-probability locations indicate higher randomness, leaning more towards noise or artifacts. As a shape prior, this probability matrix guides the iteration and regularization of the entire array signal reconstruction results using the original reconstruction algorithm (the same algorithm for processing random partial array signals). The method takes advantage of the property that the similarity of the object to be imitated is higher than that of noise or artifact in the results reconstructed by multiple sets of random partial array signals of the entire imaging system. The probability matrix is taken as a prerequisite for improving the original reconstruction results, and the optimizer is used to further iterate the imaging results to remove noise and artifacts and improve the imaging fidelity. Especially in the case involving sparse view which brings more artifacts, the effect is remarkable. Simulation and real experiments have both demonstrated the superiority of this method.
Authors: Ao Liu, Cuihong Wen, Jieci Wang
Abstract: The integration of algorithms from quantum information with neural networks has enabled unprecedented advancements in various domains. Nonetheless, the application of quantum machine learning algorithms for image classiffcation predominantly relies on traditional architectures such as variational quantum circuits. The performance of these models is closely tied to the scale of their parameters, with the substantial demand for parameters potentially leading to limitations in computational resources and a signiffcant increase in computation time. In this paper, we introduce a Lean Classical-Quantum Hybrid Neural Network (LCQHNN), which achieves efffcient classiffcation performance with only four layers of variational circuits, thereby substantially reducing computational costs. We apply the LCQHNN to image classiffcation tasks on public datasets and achieve a classiffcation accuracy of 99.02% on the dataset, marking a 5.07% improvement over traditional deep learning methods. Under the same parameter conditions, this method shows a 75% and 70.59% improvement in training convergence speed on two datasets. Furthermore, through visualization studies, it is found that the model effectively captures key data features during training and establishes a clear association between these features and their corresponding categories. This study conffrms that the employment of quantum algorithms enhances the model's ability to handle complex classiffcation problems.