Authors: Leo Thomas Ramos, Angel D. Sappa
Abstract: This review marks the tenth anniversary of You Only Look Once (YOLO), one of the most influential frameworks in real-time object detection. Over the past decade, YOLO has evolved from a streamlined detector into a diverse family of architectures characterized by efficient design, modular scalability, and cross-domain adaptability. The paper presents a technical overview of the main versions, highlights key architectural trends, and surveys the principal application areas in which YOLO has been adopted. It also addresses evaluation practices, ethical considerations, and potential future directions for the framework's continued development. The analysis aims to provide a comprehensive and critical perspective on YOLO's trajectory and ongoing transformation.
Authors: Zhikai Wang, Jiashuo Sun, Wenqi Zhang, Zhiqiang Hu, Xin Li, Fan Wang, Deli Zhao
Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have significantly enhanced their ability to integrate visual and linguistic information, achieving near-human proficiency in tasks like object recognition, captioning, and visual question answering. However, current benchmarks typically focus on knowledge-centric evaluations that assess domain-specific expertise, often neglecting the core ability to reason about fundamental mathematical elements and visual concepts. We identify a gap in evaluating elementary-level math problems, which rely on explicit visual dependencies-requiring models to discern, integrate, and reason across multiple images while incorporating commonsense knowledge, all of which are crucial for advancing toward broader AGI capabilities. To address this gap, we introduce VCBENCH, a comprehensive benchmark for multimodal mathematical reasoning with explicit visual dependencies. VCBENCH includes 1,720 problems across six cognitive domains, featuring 6,697 images (averaging 3.9 per question) to ensure multi-image reasoning. We evaluate 26 state-of-the-art LVLMs on VCBENCH, revealing substantial performance disparities, with even the top models unable to exceed 50% accuracy. Our findings highlight the ongoing challenges in visual-mathematical integration and suggest avenues for future LVLM advancements.
Authors: David Aparco-Cardenas, Jancarlo F. Gomes, Alexandre X. Falc\~ao, Pedro J. de Rezende
Abstract: A major challenge that prevents the training of DL models is the limited availability of accurately labeled data. This shortcoming is highlighted in areas where data annotation becomes a time-consuming and error-prone task. In this regard, SSL tackles this challenge by capitalizing on scarce labeled and abundant unlabeled data; however, SoTA methods typically depend on pre-trained features and large validation sets to learn effective representations for classification tasks. In addition, the reduced set of labeled data is often randomly sampled, neglecting the selection of more informative samples. Here, we present active-DeepFA, a method that effectively combines CL, teacher-student-based meta-pseudo-labeling and AL to train non-pretrained CNN architectures for image classification in scenarios of scarcity of labeled and abundance of unlabeled data. It integrates DeepFA into a co-training setup that implements two cooperative networks to mitigate confirmation bias from pseudo-labels. The method starts with a reduced set of labeled samples by warming up the networks with supervised CL. Afterward and at regular epoch intervals, label propagation is performed on the 2D projections of the networks' deep features. Next, the most reliable pseudo-labels are exchanged between networks in a cross-training fashion, while the most meaningful samples are annotated and added into the labeled set. The networks independently minimize an objective loss function comprising supervised contrastive, supervised and semi-supervised loss components, enhancing the representations towards image classification. Our approach is evaluated on three challenging biological image datasets using only 5% of labeled samples, improving baselines and outperforming six other SoTA methods. In addition, it reduces annotation effort by achieving comparable results to those of its counterparts with only 3% of labeled data.
Authors: Nader Zantout, Haochen Zhang, Pujith Kachana, Jinkai Qiu, Ji Zhang, Wenshan Wang
Abstract: Interpreting object-referential language and grounding objects in 3D with spatial relations and attributes is essential for robots operating alongside humans. However, this task is often challenging due to the diversity of scenes, large number of fine-grained objects, and complex free-form nature of language references. Furthermore, in the 3D domain, obtaining large amounts of natural language training data is difficult. Thus, it is important for methods to learn from little data and zero-shot generalize to new environments. To address these challenges, we propose SORT3D, an approach that utilizes rich object attributes from 2D data and merges a heuristics-based spatial reasoning toolbox with the ability of large language models (LLMs) to perform sequential reasoning. Importantly, our method does not require text-to-3D data for training and can be applied zero-shot to unseen environments. We show that SORT3D achieves state-of-the-art performance on complex view-dependent grounding tasks on two benchmarks. We also implement the pipeline to run real-time on an autonomous vehicle and demonstrate that our approach can be used for object-goal navigation on previously unseen real-world environments. All source code for the system pipeline is publicly released at https://github.com/nzantout/SORT3D .
Authors: Apoorva Beedu, Irfan Essa
Abstract: Video summarization creates an abridged version (i.e., a summary) that provides a quick overview of the video while retaining pertinent information. In this work, we focus on summarizing instructional videos and propose a method for breaking down a video into meaningful segments, each corresponding to essential steps in the video. We propose \textbf{HierSum}, a hierarchical approach that integrates fine-grained local cues from subtitles with global contextual information provided by video-level instructions. Our approach utilizes the ``most replayed" statistic as a supervisory signal to identify critical segments, thereby improving the effectiveness of the summary. We evaluate on benchmark datasets such as TVSum, BLiSS, Mr.HiSum, and the WikiHow test set, and show that HierSum consistently outperforms existing methods in key metrics such as F1-score and rank correlation. We also curate a new multi-modal dataset using WikiHow and EHow videos and associated articles containing step-by-step instructions. Through extensive ablation studies, we demonstrate that training on this dataset significantly enhances summarization on the target datasets.
Authors: Ranjan Sapkota, Konstantinos I Roumeliotis, Rahul Harsha Cheppally, Marco Flores Calero, Manoj Karkee
Abstract: This review provides a systematic analysis of comprehensive survey of 3D object detection with vision-language models(VLMs) , a rapidly advancing area at the intersection of 3D vision and multimodal AI. By examining over 100 research papers, we provide the first systematic analysis dedicated to 3D object detection with vision-language models. We begin by outlining the unique challenges of 3D object detection with vision-language models, emphasizing differences from 2D detection in spatial reasoning and data complexity. Traditional approaches using point clouds and voxel grids are compared to modern vision-language frameworks like CLIP and 3D LLMs, which enable open-vocabulary detection and zero-shot generalization. We review key architectures, pretraining strategies, and prompt engineering methods that align textual and 3D features for effective 3D object detection with vision-language models. Visualization examples and evaluation benchmarks are discussed to illustrate performance and behavior. Finally, we highlight current challenges, such as limited 3D-language datasets and computational demands, and propose future research directions to advance 3D object detection with vision-language models. >Object Detection, Vision-Language Models, Agents, VLMs, LLMs, AI
Authors: Brian K. S. Isaac-Medina, Toby P. Breckon
Abstract: Deep neural networks have demonstrated great generalization capabilities for tasks whose training and test sets are drawn from the same distribution. Nevertheless, out-of-distribution (OOD) detection remains a challenging task that has received significant attention in recent years. Specifically, OOD detection refers to the detection of instances that do not belong to the training distribution, while still having good performance on the in-distribution task (e.g., classification or object detection). Recent work has focused on generating synthetic outliers and using them to train an outlier detector, generally achieving improved OOD detection than traditional OOD methods. In this regard, outliers can be generated either in feature or pixel space. Feature space driven methods have shown strong performance on both the classification and object detection tasks, at the expense that the visualization of training outliers remains unknown, making further analysis on OOD failure modes challenging. On the other hand, pixel space outlier generation techniques enabled by diffusion models have been used for image classification using, providing improved OOD detection performance and outlier visualization, although their adaption to the object detection task is as yet unexplored. We therefore introduce Dream-Box, a method that provides a link to object-wise outlier generation in the pixel space for OOD detection. Specifically, we use diffusion models to generate object-wise outliers that are used to train an object detector for an in-distribution task and OOD detection. Our method achieves comparable performance to previous traditional methods while being the first technique to provide concrete visualization of generated OOD objects.
Authors: Rezowan Shuvo, M S Mekala, Eyad Elyan
Abstract: Understanding actions within surgical workflows is essential for evaluating post-operative outcomes. However, capturing long sequences of actions performed in surgical settings poses challenges, as individual surgeons have their unique approaches shaped by their expertise, leading to significant variability. To tackle this complex problem, we focused on segmentation with precise boundaries, a demanding task due to the inherent variability in action durations and the subtle transitions often observed in untrimmed videos. These transitions, marked by ambiguous starting and ending points, complicate the segmentation process. Traditional models, such as MS-TCN, which depend on large receptive fields, frequently face challenges of over-segmentation (resulting in fragmented segments) or under-segmentation (merging distinct actions). Both of these issues negatively impact the quality of segmentation. To overcome these challenges, we present the Multi-Stage Boundary-Aware Transformer Network (MSBATN) with hierarchical sliding window attention, designed to enhance action segmentation. Our proposed approach incorporates a novel unified loss function that treats action classification and boundary detection as distinct yet interdependent tasks. Unlike traditional binary boundary detection methods, our boundary voting mechanism accurately identifies start and end points by leveraging contextual information. Extensive experiments using three challenging surgical datasets demonstrate the superior performance of the proposed method, achieving state-of-the-art results in F1 scores at thresholds of 25% and 50%, while also delivering comparable performance in other metrics.
Authors: Manuel Weber, Carly Beneke
Abstract: We propose PyViT-FUSE, a foundation model for earth observation data explicitly designed to handle multi-modal imagery by learning to fuse an arbitrary number of mixed-resolution input bands into a single representation through an attention mechanism. The learned patch tokens are further processed by a stack of vision transformers with a novel pyramidal structure. We train the model on a globally sampled dataset in a self-supervised manner, leveraging core concepts of the SwAV algorithm. We show the interpretability of the fusion mechanism by visualization of the attention scores and the models applicability to downstream tasks.
Authors: Zhiheng Tu, Xinjian Huang, Yong He, Ruiyang Zhou, Bo Du, Weitao Wu
Abstract: The perception of vehicles and pedestrians in urban scenarios is crucial for autonomous driving. This process typically involves complicated data collection, imposes high computational and hardware demands. To address these limitations, we first develop a highly efficient method for generating virtual datasets, which enables the creation of task- and scenario-specific datasets in a short time. Leveraging this method, we construct the virtual depth estimation dataset VirDepth, a large-scale, multi-task autonomous driving dataset. Subsequently, we propose CenterDepth, a lightweight architecture for monocular depth estimation that ensures high operational efficiency and exhibits superior performance in depth estimation tasks with highly imbalanced height-scale distributions. CenterDepth integrates global semantic information through the innovative Center FC-CRFs algorithm, aggregates multi-scale features based on object key points, and enables detection-based depth estimation of targets. Experiments demonstrate that our proposed method achieves superior performance in terms of both computational speed and prediction accuracy.
Authors: Hassan Wasswa, Timothy Lynar, Aziida Nanyonga, Hussein Abbass
Abstract: Despite the demonstrated effectiveness of transformer models in NLP, and image and video classification, the available tools for extracting features from captured IoT network flow packets fail to capture sequential patterns in addition to the absence of spatial patterns consequently limiting transformer model application. This work introduces a novel preprocessing method to adapt transformer models, the vision transformer (ViT) in particular, for IoT botnet attack detection using network flow packets. The approach involves feature extraction from .pcap files and transforming each instance into a 1-channel 2D image shape, enabling ViT-based classification. Also, the ViT model was enhanced to allow use any classifier besides Multilayer Perceptron (MLP) that was deployed in the initial ViT paper. Models including the conventional feed forward Deep Neural Network (DNN), LSTM and Bidirectional-LSTM (BLSTM) demonstrated competitive performance in terms of precision, recall, and F1-score for multiclass-based attack detection when evaluated on two IoT attack datasets.
Authors: Hang Yu, Jiahao Wen, Zhedong Zheng
Abstract: Text-based person retrieval aims to identify specific individuals within an image database using textual descriptions. Due to the high cost of annotation and privacy protection, researchers resort to synthesized data for the paradigm of pretraining and fine-tuning. However, these generated data often exhibit domain biases in both images and textual annotations, which largely compromise the scalability of the pre-trained model. Therefore, we introduce a domain-agnostic pretraining framework based on Cross-modality Adaptive Meta-Learning (CAMeL) to enhance the model generalization capability during pretraining to facilitate the subsequent downstream tasks. In particular, we develop a series of tasks that reflect the diversity and complexity of real-world scenarios, and introduce a dynamic error sample memory unit to memorize the history for errors encountered within multiple tasks. To further ensure multi-task adaptation, we also adopt an adaptive dual-speed update strategy, balancing fast adaptation to new tasks and slow weight updates for historical tasks. Albeit simple, our proposed model not only surpasses existing state-of-the-art methods on real-world benchmarks, including CUHK-PEDES, ICFG-PEDES, and RSTPReid, but also showcases robustness and scalability in handling biased synthetic images and noisy text annotations. Our code is available at https://github.com/Jahawn-Wen/CAMeL-reID.
Authors: Ryo Takizawa, Satoshi Kodera, Tempei Kabayama, Ryo Matsuoka, Yuta Ando, Yuto Nakamura, Haruki Settai, Norihiko Takeda
Abstract: Echocardiography involves recording videos of the heart using ultrasound, enabling clinicians to evaluate its condition. Recent advances in large-scale vision-language models (VLMs) have garnered attention for automating the interpretation of echocardiographic videos. However, most existing VLMs proposed for medical interpretation thus far rely on single-frame (i.e., image) inputs. Consequently, these image-based models often exhibit lower diagnostic accuracy for conditions identifiable through cardiac motion. Moreover, echocardiographic videos are recorded from various views that depend on the direction of ultrasound emission, and certain views are more suitable than others for interpreting specific conditions. Incorporating multiple views could potentially yield further improvements in accuracy. In this study, we developed a video-language model that takes five different views and full video sequences as input, training it on pairs of echocardiographic videos and clinical reports from 60,747 cases. Our experiments demonstrate that this expanded approach achieves higher interpretation accuracy than models trained with only single-view videos or with still images.
Authors: Yifan Xie, Fei Ma, Yi Bin, Ying He, Fei Yu
Abstract: Talking face video generation with arbitrary speech audio is a significant challenge within the realm of digital human technology. The previous studies have emphasized the significance of audio-lip synchronization and visual quality. Currently, limited attention has been given to the learning of visual uncertainty, which creates several issues in existing systems, including inconsistent visual quality and unreliable performance across different input conditions. To address the problem, we propose a Joint Uncertainty Learning Network (JULNet) for high-quality talking face video generation, which incorporates a representation of uncertainty that is directly related to visual error. Specifically, we first design an uncertainty module to individually predict the error map and uncertainty map after obtaining the generated image. The error map represents the difference between the generated image and the ground truth image, while the uncertainty map is used to predict the probability of incorrect estimates. Furthermore, to match the uncertainty distribution with the error distribution through a KL divergence term, we introduce a histogram technique to approximate the distributions. By jointly optimizing error and uncertainty, the performance and robustness of our model can be enhanced. Extensive experiments demonstrate that our method achieves superior high-fidelity and audio-lip synchronization in talking face video generation compared to previous methods.
Authors: Shahad Albastaki, Anabia Sohail, Iyyakutti Iyappan Ganapathi, Basit Alawode, Asim Khan, Sajid Javed, Naoufel Werghi, Mohammed Bennamoun, Arif Mahmood
Abstract: In Computational Pathology (CPath), the introduction of Vision-Language Models (VLMs) has opened new avenues for research, focusing primarily on aligning image-text pairs at a single magnification level. However, this approach might not be sufficient for tasks like cancer subtype classification, tissue phenotyping, and survival analysis due to the limited level of detail that a single-resolution image can provide. Addressing this, we propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions and generate corresponding textual descriptions through advanced CPath VLM. We introduce visual-textual alignment at multiple resolutions as well as cross-resolution alignment to establish more effective text-guided visual representations. Cross-resolution alignment using a multimodal encoder enhances the model's ability to capture context from multiple resolutions in histology images. Our model aims to capture a broader range of information, supported by novel loss functions, enriches feature representation, improves discriminative ability, and enhances generalization across different resolutions. Pre-trained on a comprehensive TCGA dataset with 34 million image-language pairs at various resolutions, our fine-tuned model outperforms state-of-the-art (SOTA) counterparts across multiple datasets and tasks, demonstrating its effectiveness in CPath. The code is available on GitHub at: https://github.com/BasitAlawode/MR-PLIP
Authors: Yunzhong Zhang, Bo Xiong, You Zhou, Changqing Su, Zhen Cheng, Zhaofei Yu, Xun Cao, Tiejun Huang
Abstract: The need for accurate and non-intrusive flow measurement methods has led to the widespread adoption of Particle Image Velocimetry (PIV), a powerful diagnostic tool in fluid motion estimation. This study investigates the tremendous potential of spike cameras (a type of ultra-high-speed, high-dynamic-range camera) in PIV. We propose a deep learning framework, Spike Imaging Velocimetry (SIV), designed specifically for highly turbulent and intricate flow fields. To aggregate motion features from the spike stream while minimizing information loss, we incorporate a Detail-Preserving Hierarchical Transform (DPHT) module. Additionally, we introduce a Graph Encoder (GE) to extract contextual features from highly complex fluid flows. Furthermore, we present a spike-based PIV dataset, Particle Scenes with Spike and Displacement (PSSD), which provides labeled data for three challenging fluid dynamics scenarios. Our proposed method achieves superior performance compared to existing baseline methods on PSSD. The datasets and our implementation of SIV are open-sourced in the supplementary materials.
Authors: Jiaxu Leng, Zhanjie Wu, Mingpi Tan, Mengjingcheng Mo, Jiankang Zheng, Qingqing Li, Ji Gan, Xinbo Gao
Abstract: Existing weakly supervised video violence detection (VVD) methods primarily rely on Euclidean representation learning, which often struggles to distinguish visually similar yet semantically distinct events due to limited hierarchical modeling and insufficient ambiguous training samples. To address this challenge, we propose PiercingEye, a novel dual-space learning framework that synergizes Euclidean and hyperbolic geometries to enhance discriminative feature representation. Specifically, PiercingEye introduces a layer-sensitive hyperbolic aggregation strategy with hyperbolic Dirichlet energy constraints to progressively model event hierarchies, and a cross-space attention mechanism to facilitate complementary feature interactions between Euclidean and hyperbolic spaces. Furthermore, to mitigate the scarcity of ambiguous samples, we leverage large language models to generate logic-guided ambiguous event descriptions, enabling explicit supervision through a hyperbolic vision-language contrastive loss that prioritizes high-confusion samples via dynamic similarity-aware weighting. Extensive experiments on XD-Violence and UCF-Crime benchmarks demonstrate that PiercingEye achieves state-of-the-art performance, with particularly strong results on a newly curated ambiguous event subset, validating its superior capability in fine-grained violence detection.
Authors: Guodong Sun, Mingjing Li, Dingjie Liu, Mingxuan Liu, Bo Wu, Yang Zhang
Abstract: As an essential component of logistics automation, the automated loading system is becoming a critical technology for enhancing operational efficiency and safety. Precise automatic positioning of the truck compartment, which serves as the loading area, is the primary step in automated loading. However, existing methods have difficulty adapting to truck compartments of various sizes, do not establish a unified coordinate system for LiDAR and mobile manipulators, and often exhibit reliability issues in cluttered environments. To address these limitations, our study focuses on achieving precise automatic positioning of key points in large, medium, and small fence-style truck compartments in cluttered scenarios. We propose an innovative wide field-of-view 3-D LiDAR vehicle compartment automatic localization system. For vehicles of various sizes, this system leverages the LiDAR to generate high-density point clouds within an extensive field-of-view range. By incorporating parking area constraints, our vehicle point cloud segmentation method more effectively segments vehicle point clouds within the scene. Our compartment key point positioning algorithm utilizes the geometric features of the compartments to accurately locate the corner points, providing stackable spatial regions. Extensive experiments on our collected data and public datasets demonstrate that this system offers reliable positioning accuracy and reduced computational resource consumption, leading to its application and promotion in relevant fields.
Authors: Simone Maurizio La Cava, Roberto Casula, Sara Concas, Giulia Orr\`u, Ruben Tolosana, Martin Drahansky, Julian Fierrez, Gian Luca Marcialis
Abstract: 3D face reconstruction (3DFR) algorithms are based on specific assumptions tailored to the limits and characteristics of the different application scenarios. In this study, we investigate how multiple state-of-the-art 3DFR algorithms can be used to generate a better representation of subjects, with the final goal of improving the performance of face recognition systems in challenging uncontrolled scenarios. We also explore how different parametric and non-parametric score-level fusion methods can exploit the unique strengths of multiple 3DFR algorithms to enhance biometric recognition robustness. With this goal, we propose a comprehensive analysis of several face recognition systems across diverse conditions, such as varying distances and camera setups, intra-dataset and cross-dataset, to assess the robustness of the proposed ensemble method. The results demonstrate that the distinct information provided by different 3DFR algorithms can alleviate the problem of generalizing over multiple application scenarios. In addition, the present study highlights the potential of advanced fusion strategies to enhance the reliability of 3DFR-based face recognition systems, providing the research community with key insights to exploit them in real-world applications effectively. Although the experiments are carried out in a specific face verification setup, our proposed fusion-based 3DFR methods may be applied to other tasks around face biometrics that are not strictly related to identity recognition.
Authors: Yufeng Wu, Xin Liao, Baowei Wang, Han Fang, Xiaoshuai Wu, Guiling Wang
Abstract: Unauthorized screen capturing and dissemination pose severe security threats such as data leakage and information theft. Several studies propose robust watermarking methods to track the copyright of Screen-Camera (SC) images, facilitating post-hoc certification against infringement. These techniques typically employ heuristic mathematical modeling or supervised neural network fitting as the noise layer, to enhance watermarking robustness against SC. However, both strategies cannot fundamentally achieve an effective approximation of SC noise. Mathematical simulation suffers from biased approximations due to the incomplete decomposition of the noise and the absence of interdependence among the noise components. Supervised networks require paired data to train the noise-fitting model, and it is difficult for the model to learn all the features of the noise. To address the above issues, we propose Simulation-to-Real (S2R). Specifically, an unsupervised noise layer employs unpaired data to learn the discrepancy between the modeling simulated noise distribution and the real-world SC noise distribution, rather than directly learning the mapping from sharp images to real-world images. Learning this transformation from simulation to reality is inherently simpler, as it primarily involves bridging the gap in noise distributions, instead of the complex task of reconstructing fine-grained image details. Extensive experimental results validate the efficacy of the proposed method, demonstrating superior watermark robustness and generalization compared to those of state-of-the-art methods.
Authors: Ali Nazari, Mohsen Ebrahimi Moghaddam, Omidreza Borzoei
Abstract: Early methods used face representations in kinship verification, which are less accurate than joint representations of parents' and children's facial images learned from scratch. We propose an approach featuring graph neural network concepts to utilize face representations and have comparable results to joint representation algorithms. Moreover, we designed the structure of the classification module and introduced a new combination of losses to engage the center loss gradually in training our network. Additionally, we conducted experiments on KinFaceW-I and II, demonstrating the effectiveness of our approach. We achieved the best result on KinFaceW-II, an average improvement of nearly 1.6 for all kinship types, and we were near the best on KinFaceW-I. The code is available at https://github.com/ali-nazari/Kinship-Verification
Authors: Kamirul Kamirul, Odysseas Pappas, Alin Achim
Abstract: We introduce R-Sparse R-CNN, a novel pipeline for oriented ship detection in Synthetic Aperture Radar (SAR) images that leverages sparse learnable proposals enriched with background contextual information, termed background-aware proposals (BAPs). The adoption of sparse proposals streamlines the pipeline by eliminating the need for proposal generators and post-processing for overlapping predictions. The proposed BAPs enrich object representation by integrating ship and background features, allowing the model to learn their contextual relationships for more accurate distinction of ships in complex environments. To complement BAPs, we propose Dual-Context Pooling (DCP), a novel strategy that jointly extracts ship and background features in a single unified operation. This unified design improves efficiency by eliminating redundant computation inherent in separate pooling. Moreover, by ensuring that ship and background features are pooled from the same feature map level, DCP provides aligned features that improve contextual relationship learning. Finally, as a core component of contextual relationship learning in R-Sparse R-CNN, we design a dedicated transformer-based Interaction Module. This module interacts pooled ship and background features with corresponding proposal features and models their relationships. Experimental results show that R-Sparse R-CNN delivers outstanding accuracy, surpassing state-of-the-art models by margins of up to 12.8% and 11.9% on SSDD and RSDD-SAR inshore datasets, respectively. These results demonstrate the effectiveness and competitiveness of R-Sparse R-CNN as a robust framework for oriented ship detection in SAR imagery. The code is available at: www.github.com/ka-mirul/R-Sparse-R-CNN.
Authors: Ihsan Ullah, Alfredo Petrosino
Abstract: Convolutional neural network (CNN) slides a kernel over the whole image to produce an output map. This kernel scheme reduces the number of parameters with respect to a fully connected neural network (NN). While CNN has proven to be an effective model in recognition of handwritten characters and traffic signal sign boards, etc. recently, its deep variants have proven to be effective in similar as well as more challenging applications like object, scene and action recognition. Deep CNN add more layers and kernels to the classical CNN, increasing the number of parameters, and partly reducing the main advantage of CNN which is less parameters. In this paper, a 3D pyramidal neural network called 3DPyraNet and a discriminative approach for spatio-temporal feature learning based on it, called 3DPyraNet-F, are proposed. 3DPyraNet introduces a new weighting scheme which learns features from both spatial and temporal dimensions analyzing multiple adjacent frames and keeping a biological plausible structure. It keeps the spatial topology of the input image and presents fewer parameters and lower computational and memory costs compared to both fully connected NNs and recent deep CNNs. 3DPyraNet-F extract the features maps of the highest layer of the learned network, fuse them in a single vector, and provide it as input in such a way to a linear-SVM classifier that enhances the recognition of human actions and dynamic scenes from the videos. Encouraging results are reported with 3DPyraNet in real-world environments, especially in the presence of camera induced motion. Further, 3DPyraNet-F clearly outperforms the state-of-the-art on three benchmark datasets and shows comparable result for the fourth.
Authors: Xuyin Qi, Zeyu Zhang, Canxuan Gang, Hao Zhang, Lei Zhang, Zhiwei Zhang, Yang Zhao
Abstract: Data augmentation is essential in medical imaging for improving classification accuracy, lesion detection, and organ segmentation under limited data conditions. However, two significant challenges remain. First, a pronounced domain gap between natural photographs and medical images can distort critical disease features. Second, augmentation studies in medical imaging are fragmented and limited to single tasks or architectures, leaving the benefits of advanced mix-based strategies unclear. To address these challenges, we propose a unified evaluation framework with six mix-based augmentation methods integrated with both convolutional and transformer backbones on brain tumour MRI and eye disease fundus datasets. Our contributions are threefold. (1) We introduce MediAug, a comprehensive and reproducible benchmark for advanced data augmentation in medical imaging. (2) We systematically evaluate MixUp, YOCO, CropMix, CutMix, AugMix, and SnapMix with ResNet-50 and ViT-B backbones. (3) We demonstrate through extensive experiments that MixUp yields the greatest improvement on the brain tumor classification task for ResNet-50 with 79.19% accuracy and SnapMix yields the greatest improvement for ViT-B with 99.44% accuracy, and that YOCO yields the greatest improvement on the eye disease classification task for ResNet-50 with 91.60% accuracy and CutMix yields the greatest improvement for ViT-B with 97.94% accuracy. Code will be available at https://github.com/AIGeeksGroup/MediAug.
Authors: Niaz Ahmad, Youngmoon Lee, Guanghui Wang
Abstract: We introduce VISUALCENT, a unified human pose and instance segmentation framework to address generalizability and scalability limitations to multi person visual human analysis. VISUALCENT leverages centroid based bottom up keypoint detection paradigm and uses Keypoint Heatmap incorporating Disk Representation and KeyCentroid to identify the optimal keypoint coordinates. For the unified segmentation task, an explicit keypoint is defined as a dynamic centroid called MaskCentroid to swiftly cluster pixels to specific human instance during rapid changes in human body movement or significantly occluded environment. Experimental results on COCO and OCHuman datasets demonstrate VISUALCENTs accuracy and real time performance advantages, outperforming existing methods in mAP scores and execution frame rate per second. The implementation is available on the project page.
Authors: Mohammad Mahdi Abootorabi, Omid Ghahroodi, Pardis Sadat Zahraei, Hossein Behzadasl, Alireza Mirrokni, Mobina Salimipanah, Arash Rasouli, Bahar Behzadipour, Sara Azarnoush, Benyamin Maleki, Erfan Sadraiye, Kiarash Kiani Feriz, Mahdi Teymouri Nahad, Ali Moghadasi, Abolfazl Eshagh Abianeh, Nizi Nazar, Hamid R. Rabiee, Mahdieh Soleymani Baghshah, Meisam Ahmadi, Ehsaneddin Asgari
Abstract: Generative AI is reshaping art, gaming, and most notably animation. Recent breakthroughs in foundation and diffusion models have reduced the time and cost of producing animated content. Characters are central animation components, involving motion, emotions, gestures, and facial expressions. The pace and breadth of advances in recent months make it difficult to maintain a coherent view of the field, motivating the need for an integrative review. Unlike earlier overviews that treat avatars, gestures, or facial animation in isolation, this survey offers a single, comprehensive perspective on all the main generative AI applications for character animation. We begin by examining the state-of-the-art in facial animation, expression rendering, image synthesis, avatar creation, gesture modeling, motion synthesis, object generation, and texture synthesis. We highlight leading research, practical deployments, commonly used datasets, and emerging trends for each area. To support newcomers, we also provide a comprehensive background section that introduces foundational models and evaluation metrics, equipping readers with the knowledge needed to enter the field. We discuss open challenges and map future research directions, providing a roadmap to advance AI-driven character-animation technologies. This survey is intended as a resource for researchers and developers entering the field of generative AI animation or adjacent fields. Resources are available at: https://github.com/llm-lab-org/Generative-AI-for-Character-Animation-Survey.
URLs: https://github.com/llm-lab-org/Generative-AI-for-Character-Animation-Survey.
Authors: Anyong Qin, Chaoqi Yuan, Qiang Li, Feng Yang, Tiecheng Song, Chenqiang Gao
Abstract: Convolutional neural networks (CNNs) are effective for hyperspectral image (HSI) classification, but their 3D convolutional structures introduce high computational costs and limited generalization in few-shot scenarios. Domain shifts caused by sensor differences and environmental variations further hinder cross-dataset adaptability. Metric-based few-shot learning (FSL) prototype networks mitigate this problem, yet their performance is sensitive to prototype quality, especially with limited samples. To overcome these challenges, a dual-branch residual network that integrates spatial and spectral features via parallel branches is proposed in this letter. Additionally, more robust refined prototypes are obtained through a regulation term. Furthermore, a kernel probability matching strategy aligns source and target domain features, alleviating domain shift. Experiments on four publicly available HSI datasets illustrate that the proposal achieves superior performance compared to other methods.
Authors: Qiuhui Chen, Jintao Wang, Gang Wang, Yi Hong
Abstract: Accurate diagnosis of Alzheimer's disease (AD) requires effectively integrating multimodal data and clinical expertise. However, existing methods often struggle to fully utilize multimodal information and lack structured mechanisms to incorporate dynamic domain knowledge. To address these limitations, we propose HoloDx, a knowledge- and data-driven framework that enhances AD diagnosis by aligning domain knowledge with multimodal clinical data. HoloDx incorporates a knowledge injection module with a knowledge-aware gated cross-attention, allowing the model to dynamically integrate domain-specific insights from both large language models (LLMs) and clinical expertise. Also, a memory injection module with a designed prototypical memory attention enables the model to retain and retrieve subject-specific information, ensuring consistency in decision-making. By jointly leveraging these mechanisms, HoloDx enhances interpretability, improves robustness, and effectively aligns prior knowledge with current subject data. Evaluations on five AD datasets demonstrate that HoloDx outperforms state-of-the-art methods, achieving superior diagnostic accuracy and strong generalization across diverse cohorts. The source code will be released upon publication acceptance.
Authors: Mitchell Goff, Greg Hogan, George Hotz, Armand du Parc Locmaria, Kacper Raczy, Harald Sch\"afer, Adeeb Shihadeh, Weixing Zhang, Yassine Yousfi
Abstract: Most self-driving systems rely on hand-coded perception outputs and engineered driving rules. Learning directly from human driving data with an end-to-end method can allow for a training architecture that is simpler and scales well with compute and data. In this work, we propose an end-to-end training architecture that uses real driving data to train a driving policy in an on-policy simulator. We show two different methods of simulation, one with reprojective simulation and one with a learned world model. We show that both methods can be used to train a policy that learns driving behavior without any hand-coded driving rules. We evaluate the performance of these policies in a closed-loop simulation and when deployed in a real-world advanced driver-assistance system.
Authors: Zhenkai Qin, Jiaquan Liang, Qiao Fang
Abstract: Attention mechanisms have significantly advanced deep learning by enhancing feature representation through selective focus. However, existing approaches often independently model channel importance and spatial saliency, overlooking their inherent interdependence and limiting their effectiveness. To address this limitation, we propose MIA-Mind, a lightweight and modular Multidimensional Interactive Attention Mechanism, built upon the MindSpore framework. MIA-Mind jointly models spatial and channel features through a unified cross-attentive fusion strategy, enabling fine-grained feature recalibration with minimal computational overhead. Extensive experiments are conducted on three representative datasets: on CIFAR-10, MIA-Mind achieves an accuracy of 82.9\%; on ISBI2012, it achieves an accuracy of 78.7\%; and on CIC-IDS2017, it achieves an accuracy of 91.9\%. These results validate the versatility, lightweight design, and generalization ability of MIA-Mind across heterogeneous tasks. Future work will explore the extension of MIA-Mind to large-scale datasets, the development of ada,ptive attention fusion strategies, and distributed deployment to further enhance scalability and robustness.
Authors: Xiaoran Xu, Jiangang Yang, Wenyue Chong, Wenhui Shi, Shichu Sun, Jing Xing, Jian Liu
Abstract: Single-Domain Generalized Object Detection~(S-DGOD) aims to train an object detector on a single source domain while generalizing well to diverse unseen target domains, making it suitable for multimedia applications that involve various domain shifts, such as intelligent video surveillance and VR/AR technologies. With the success of large-scale Vision-Language Models, recent S-DGOD approaches exploit pre-trained vision-language knowledge to guide invariant feature learning across visual domains. However, the utilized knowledge remains at a coarse-grained level~(e.g., the textual description of adverse weather paired with the image) and serves as an implicit regularization for guidance, struggling to learn accurate region- and object-level features in varying domains. In this work, we propose a new cross-modal feature learning method, which can capture generalized and discriminative regional features for S-DGOD tasks. The core of our method is the mechanism of Cross-modal and Region-aware Feature Interaction, which simultaneously learns both inter-modal and intra-modal regional invariance through dynamic interactions between fine-grained textual and visual features. Moreover, we design a simple but effective strategy called Cross-domain Proposal Refining and Mixing, which aligns the position of region proposals across multiple domains and diversifies them, enhancing the localization ability of detectors in unseen scenarios. Our method achieves new state-of-the-art results on S-DGOD benchmark datasets, with improvements of +8.8\%~mPC on Cityscapes-C and +7.9\%~mPC on DWD over baselines, demonstrating its efficacy.
Authors: Jiaqi Peng, Tai Wang, Jiangmiao Pang, Yuan Shen
Abstract: Although existing 3D perception algorithms have demonstrated significant improvements in performance, their deployment on edge devices continues to encounter critical challenges due to substantial runtime latency. We propose a new benchmark tailored for online evaluation by considering runtime latency. Based on the benchmark, we build a Latency-Aware 3D Streaming Perception (LASP) framework that addresses the latency issue through two primary components: 1) latency-aware history integration, which extends query propagation into a continuous process, ensuring the integration of historical feature regardless of varying latency; 2) latency-aware predictive detection, a module that compensates the detection results with the predicted trajectory and the posterior accessed latency. By incorporating the latency-aware mechanism, our method shows generalization across various latency levels, achieving an online performance that closely aligns with 80\% of its offline evaluation on the Jetson AGX Orin without any acceleration techniques.
Authors: Zhongxuan Li
Abstract: Blind source separation (BSS) is a key technique in array processing and data analysis, aiming to recover unknown sources from observed mixtures without knowledge of the mixing matrix. Classical independent component analysis (ICA) methods rely on the assumption that sources are mutually independent. To address limitations of ICA, sparsity-based methods have been introduced, which decompose source signals sparsely in a predefined dictionary. Morphological Component Analysis (MCA), based on sparse representation theory, assumes that a signal is a linear combination of components with distinct geometries, each sparsely representable in one dictionary and not in others. This approach has recently been applied to BSS with promising results. This report reviews key approaches derived from classical ICA and explores sparsity-based methods for BSS. It introduces the theory of sparse representation and decomposition, followed by a block coordinate relaxation MCA algorithm, whose variants are used in Multichannel MCA (MMCA) and Generalized MCA (GMCA). A local dictionary learning method using K-SVD is then presented. Finally, we propose an improved algorithm, SAC+BK-SVD, which enhances K-SVD by learning a block-sparsifying dictionary that clusters and updates similar atoms in blocks. The implementation includes experiments on image segmentation and blind image source separation using the discussed techniques. We also compare the proposed block-sparse dictionary learning algorithm with K-SVD. Simulation results demonstrate that our method yields improved blind image separation quality.
Authors: Jialang Lu, Huayu Zhao, Huiyu Zhai, Xingxing Yang, Shini Han
Abstract: There has long been a belief that high-level semantics learning can benefit various downstream computer vision tasks. However, in the low-light image enhancement (LLIE) community, existing methods learn a brutal mapping between low-light and normal-light domains without considering the semantic information of different regions, especially in those extremely dark regions that suffer from severe information loss. To address this issue, we propose a new deep semantic prior-guided framework (DeepSPG) based on Retinex image decomposition for LLIE to explore informative semantic knowledge via a pre-trained semantic segmentation model and multimodal learning. Notably, we incorporate both image-level semantic prior and text-level semantic prior and thus formulate a multimodal learning framework with combinatorial deep semantic prior guidance for LLIE. Specifically, we incorporate semantic knowledge to guide the enhancement process via three designs: an image-level semantic prior guidance by leveraging hierarchical semantic features from a pre-trained semantic segmentation model; a text-level semantic prior guidance by integrating natural language semantic constraints via a pre-trained vision-language model; a multi-scale semantic-aware structure that facilitates effective semantic feature incorporation. Eventually, our proposed DeepSPG demonstrates superior performance compared to state-of-the-art methods across five benchmark datasets. The implementation details and code are publicly available at https://github.com/Wenyuzhy/DeepSPG.
Authors: Huiling Zheng, Xian Zhong, Bin Liu, Yi Xiao, Bihan Wen, Xiaofeng Li
Abstract: The fusion of Synthetic Aperture Radar (SAR) and RGB imagery for land cover classification remains challenging due to modality heterogeneity and the underutilization of spectral complementarity. Existing methods often fail to decouple shared structural features from modality-specific radiometric attributes, leading to feature conflicts and information loss. To address this issue, we propose Phase-Amplitude Decoupling (PAD), a frequency-aware framework that separates phase (modality-shared) and amplitude (modality-specific) components in the Fourier domain. Specifically, PAD consists of two key components: 1) Phase Spectrum Correction (PSC), which aligns cross-modal phase features through convolution-guided scaling to enhance geometric consistency, and 2) Amplitude Spectrum Fusion (ASF), which dynamically integrates high-frequency details and low-frequency structures using frequency-adaptive multilayer perceptrons. This approach leverages SAR's sensitivity to morphological features and RGB's spectral richness. Extensive experiments on WHU-OPT-SAR and DDHR-SK datasets demonstrate state-of-the-art performance. Our work establishes a new paradigm for physics-aware multi-modal fusion in remote sensing. The code will be available at https://github.com/RanFeng2/PAD.
Authors: Zheng Fang, Kangjun Liu, Ke Chen, Qingyu Liu, Jianguo Zhang, Lingyang Song, Yaowei Wang
Abstract: The task of radio map estimation aims to generate a dense representation of electromagnetic spectrum quantities, such as the received signal strength at each grid point within a geographic region, based on measurements from a subset of spatially distributed nodes (represented as pixels). Recently, deep vision models such as the U-Net have been adapted to radio map estimation, whose effectiveness can be guaranteed with sufficient spatial observations (typically 0.01% to 1% of pixels) in each map, to model local dependency of observed signal power. However, such a setting of sufficient measurements can be less practical in real-world scenarios, where extreme sparsity in spatial sampling can be widely encountered. To address this challenge, we propose RadioFormer, a novel multiple-granularity transformer designed to handle the constraints posed by spatial sparse observations. Our RadioFormer, through a dual-stream self-attention (DSA) module, can respectively discover the correlation of pixel-wise observed signal power and also learn patch-wise buildings' geometries in a style of multiple granularities, which are integrated into multi-scale representations of radio maps by a cross stream cross-attention (CCA) module. Extensive experiments on the public RadioMapSeer dataset demonstrate that RadioFormer outperforms state-of-the-art methods in radio map estimation while maintaining the lowest computational cost. Furthermore, the proposed approach exhibits exceptional generalization capabilities and robust zero-shot performance, underscoring its potential to advance radio map estimation in a more practical setting with very limited observation nodes.
Authors: Yuan Li, Ziqian Bai, Feitong Tan, Zhaopeng Cui, Sean Fanello, Yinda Zhang
Abstract: We propose a novel 3D-aware diffusion-based method for generating photorealistic talking head videos directly from a single identity image and explicit control signals (e.g., expressions). Our method generates Multiplane Images (MPIs) that ensure geometric consistency, making them ideal for immersive viewing experiences like binocular videos for VR headsets. Unlike existing methods that often require a separate stage or joint optimization to reconstruct a 3D representation (such as NeRF or 3D Gaussians), our approach directly generates the final output through a single denoising process, eliminating the need for post-processing steps to render novel views efficiently. To effectively learn from monocular videos, we introduce a training mechanism that reconstructs the output MPI randomly in either the target or the reference camera space. This approach enables the model to simultaneously learn sharp image details and underlying 3D information. Extensive experiments demonstrate the effectiveness of our method, which achieves competitive avatar quality and novel-view rendering capabilities, even without explicit 3D reconstruction or high-quality multi-view training data.
Authors: Mi Zheng, Guanglei Yang, Zitong Huang, Zhenhua Guo, Kevin Han, Wangmeng Zuo
Abstract: With the emergence of transformer-based architectures and large language models (LLMs), the accuracy of road scene perception has substantially advanced. Nonetheless, current road scene segmentation approaches are predominantly trained on closed-set data, resulting in insufficient detection capabilities for out-of-distribution (OOD) objects. To overcome this limitation, road anomaly detection methods have been proposed. However, existing methods primarily depend on image inpainting and OOD distribution detection techniques, facing two critical issues: (1) inadequate consideration of the objectiveness attributes of anomalous regions, causing incomplete segmentation when anomalous objects share similarities with known classes, and (2) insufficient attention to environmental constraints, leading to the detection of anomalies irrelevant to autonomous driving tasks. In this paper, we propose a novel framework termed Segmenting Objectiveness and Task-Awareness (SOTA) for autonomous driving scenes. Specifically, SOTA enhances the segmentation of objectiveness through a Semantic Fusion Block (SFB) and filters anomalies irrelevant to road navigation tasks using a Scene-understanding Guided Prompt-Context Adaptor (SG-PCA). Extensive empirical evaluations on multiple benchmark datasets, including Fishyscapes Lost and Found, Segment-Me-If-You-Can, and RoadAnomaly, demonstrate that the proposed SOTA consistently improves OOD detection performance across diverse detectors, achieving robust and accurate segmentation outcomes.
Authors: Zhangshuo Qi, Luqi Cheng, Zijie Zhou, Guangming Xiong
Abstract: In autonomous driving, place recognition is critical for global localization in GPS-denied environments. LiDAR and radar-based place recognition methods have garnered increasing attention, as LiDAR provides precise ranging, whereas radar excels in adverse weather resilience. However, effectively leveraging LiDAR-radar fusion for place recognition remains challenging. The noisy and sparse nature of radar data limits its potential to further improve recognition accuracy. In addition, heterogeneous radar configurations complicate the development of unified cross-modality fusion frameworks. In this paper, we propose LRFusionPR, which improves recognition accuracy and robustness by fusing LiDAR with either single-chip or scanning radar. Technically, a dual-branch network is proposed to fuse different modalities within the unified polar coordinate bird's eye view (BEV) representation. In the fusion branch, cross-attention is utilized to perform cross-modality feature interactions. The knowledge from the fusion branch is simultaneously transferred to the distillation branch, which takes radar as its only input to further improve the robustness. Ultimately, the descriptors from both branches are concatenated, producing the multimodal global descriptor for place retrieval. Extensive evaluations on multiple datasets demonstrate that our LRFusionPR achieves accurate place recognition, while maintaining robustness under varying weather conditions. Our open-source code will be released at https://github.com/QiZS-BIT/LRFusionPR.
Authors: Lingtao Peng, Liheng Bian
Abstract: Recently, learning-based Underwater Image Enhancement (UIE) methods have demonstrated promising performance. However, existing learning-based methods still face two challenges. 1) They rarely consider the inconsistent degradation levels in different spatial regions and spectral bands simultaneously. 2) They treat all regions equally, ignoring that the regions with high-frequency details are more difficult to reconstruct. To address these challenges, we propose a novel UIE method based on spatial-spectral dual-domain adaptive learning, termed SS-UIE. Specifically, we first introduce a spatial-wise Multi-scale Cycle Selective Scan (MCSS) module and a Spectral-Wise Self-Attention (SWSA) module, both with linear complexity, and combine them in parallel to form a basic Spatial-Spectral block (SS-block). Benefiting from the global receptive field of MCSS and SWSA, SS-block can effectively model the degradation levels of different spatial regions and spectral bands, thereby enabling degradation level-based dual-domain adaptive UIE. By stacking multiple SS-blocks, we build our SS-UIE network. Additionally, a Frequency-Wise Loss (FWL) is introduced to narrow the frequency-wise discrepancy and reinforce the model's attention on the regions with high-frequency details. Extensive experiments validate that the SS-UIE technique outperforms state-of-the-art UIE methods while requiring cheaper computational and memory costs.
Authors: Yuming Zhao, Qijian Zhang, Junhui Hou, Jiazhi Xia, Wenping Wang, Ying He
Abstract: Surface parameterization is a fundamental geometry processing task, laying the foundations for the visual presentation of 3D assets and numerous downstream shape analysis scenarios. Conventional parameterization approaches demand high-quality mesh triangulation and are restricted to certain simple topologies unless additional surface cutting and decomposition are provided. In practice, the optimal configurations (e.g., type of parameterization domains, distribution of cutting seams, number of mapping charts) may vary drastically with different surface structures and task characteristics, thus requiring more flexible and controllable processing pipelines. To this end, this paper introduces FlexPara, an unsupervised neural optimization framework to achieve both global and multi-chart surface parameterizations by establishing point-wise mappings between 3D surface points and adaptively-deformed 2D UV coordinates. We ingeniously design and combine a series of geometrically-interpretable sub-networks, with specific functionalities of cutting, deforming, unwrapping, and wrapping, to construct a bi-directional cycle mapping framework for global parameterization without the need for manually specified cutting seams. Furthermore, we construct a multi-chart parameterization framework with adaptively-learned chart assignment. Extensive experiments demonstrate the universality, superiority, and inspiring potential of our neural surface parameterization paradigm. The code will be publicly available at https://github.com/AidenZhao/FlexPara
Authors: Tuan Nguyen, Naseem Khan, Issa Khalil
Abstract: The rapid evolution of deepfake technology, particularly in instruction-guided image editing, threatens the integrity of digital images by enabling subtle, context-aware manipulations. Generated conditionally from real images and textual prompts, these edits are often imperceptible to both humans and existing detection systems, revealing significant limitations in current defenses. We propose a novel multimodal capsule network, CapsFake, designed to detect such deepfake image edits by integrating low-level capsules from visual, textual, and frequency-domain modalities. High-level capsules, predicted through a competitive routing mechanism, dynamically aggregate local features to identify manipulated regions with precision. Evaluated on diverse datasets, including MagicBrush, Unsplash Edits, Open Images Edits, and Multi-turn Edits, CapsFake outperforms state-of-the-art methods by up to 20% in detection accuracy. Ablation studies validate its robustness, achieving detection rates above 94% under natural perturbations and 96% against adversarial attacks, with excellent generalization to unseen editing scenarios. This approach establishes a powerful framework for countering sophisticated image manipulations.
Authors: Alexander Baumann, Leonardo Ayala, Silvia Seidlitz, Jan Sellner, Alexander Studier-Fischer, Berkin \"Ozdemir, Lena Maier-Hein, Slobodan Ilic
Abstract: Spectral imaging offers promising applications across diverse domains, including medicine and urban scene understanding, and is already established as a critical modality in remote sensing. However, variability in channel dimensionality and captured wavelengths among spectral cameras impede the development of AI-driven methodologies, leading to camera-specific models with limited generalizability and inadequate cross-camera applicability. To address this bottleneck, we introduce $\textbf{CARL}$, a model for $\textbf{C}$amera-$\textbf{A}$gnostic $\textbf{R}$epresentation $\textbf{L}$earning across RGB, multispectral, and hyperspectral imaging modalities. To enable the conversion of a spectral image with any channel dimensionality to a camera-agnostic embedding, we introduce wavelength positional encoding and a self-attention-cross-attention mechanism to compress spectral information into learned query representations. Spectral-spatial pre-training is achieved with a novel spectral self-supervised JEPA-inspired strategy tailored to CARL. Large-scale experiments across the domains of medical imaging, autonomous driving, and satellite imaging demonstrate our model's unique robustness to spectral heterogeneity, outperforming on datasets with simulated and real-world cross-camera spectral variations. The scalability and versatility of the proposed approach position our model as a backbone for future spectral foundation models.
Authors: Shalini Maiti, Lourdes Agapito, Benjamin Graham
Abstract: For non-rigid objects, predicting the 3D shape from 2D keypoint observations is ill-posed due to occlusions, and the need to disentangle changes in viewpoint and changes in shape. This challenge has often been addressed by embedding low-rank constraints into specialized models. These models can be hard to train, as they depend on finding a canonical way of aligning observations, before they can learn detailed geometry. These constraints have limited the reconstruction quality. We show that generic, high capacity models, trained with an unsupervised loss, allow for more accurate predicted shapes. In particular, applying low-rank constraints to localized subsets of the full shape allows the high capacity to be suitably constrained. We reduce the state-of-the-art reconstruction error on the S-Up3D dataset by over 70%.
Authors: De Cheng, Lingfeng He, Nannan Wang, Dingwen Zhang, Xinbo Gao
Abstract: Unsupervised visible-infrared person re-identification (USL-VI-ReID) seeks to match pedestrian images of the same individual across different modalities without human annotations for model learning. Previous methods unify pseudo-labels of cross-modality images through label association algorithms and then design contrastive learning framework for global feature learning. However, these methods overlook the cross-modality variations in feature representation and pseudo-label distributions brought by fine-grained patterns. This insight results in insufficient modality-shared learning when only global features are optimized. To address this issue, we propose a Semantic-Aligned Learning with Collaborative Refinement (SALCR) framework, which builds up optimization objective for specific fine-grained patterns emphasized by each modality, thereby achieving complementary alignment between the label distributions of different modalities. Specifically, we first introduce a Dual Association with Global Learning (DAGI) module to unify the pseudo-labels of cross-modality instances in a bi-directional manner. Afterward, a Fine-Grained Semantic-Aligned Learning (FGSAL) module is carried out to explore part-level semantic-aligned patterns emphasized by each modality from cross-modality instances. Optimization objective is then formulated based on the semantic-aligned features and their corresponding label space. To alleviate the side-effects arising from noisy pseudo-labels, we propose a Global-Part Collaborative Refinement (GPCR) module to mine reliable positive sample sets for the global and part features dynamically and optimize the inter-instance relationships. Extensive experiments demonstrate the effectiveness of the proposed method, which achieves superior performances to state-of-the-art methods. Our code is available at \href{https://github.com/FranklinLingfeng/code-for-SALCR}.
Authors: Loc Phuc Truong Nguyen, Hung Truong Thanh Nguyen, Hung Cao
Abstract: Explainable Artificial Intelligence (XAI) techniques for interpreting object detection models remain in an early stage, with no established standards for systematic evaluation. This absence of consensus hinders both the comparative analysis of methods and the informed selection of suitable approaches. To address this gap, we introduce the Object Detection Explainable AI Evaluation (ODExAI), a comprehensive framework designed to assess XAI methods in object detection based on three core dimensions: localization accuracy, faithfulness to model behavior, and computational complexity. We benchmark a set of XAI methods across two widely used object detectors (YOLOX and Faster R-CNN) and standard datasets (MS-COCO and PASCAL VOC). Empirical results demonstrate that region-based methods (e.g., D-CLOSE) achieve strong localization (PG = 88.49%) and high model faithfulness (OA = 0.863), though with substantial computational overhead (Time = 71.42s). On the other hand, CAM-based methods (e.g., G-CAME) achieve superior localization (PG = 96.13%) and significantly lower runtime (Time = 0.54s), but at the expense of reduced faithfulness (OA = 0.549). These findings demonstrate critical trade-offs among existing XAI approaches and reinforce the need for task-specific evaluation when deploying them in object detection pipelines. Our implementation and evaluation benchmarks are publicly available at: https://github.com/Analytics-Everywhere-Lab/odexai.
Authors: Songsong Xiong, Hamidreza Kasaei
Abstract: In human-centered environments such as restaurants, homes, and warehouses, robots often face challenges in accurately recognizing 3D objects. These challenges stem from the complexity and variability of these environments, including diverse object shapes. In this paper, we propose a novel Lightweight Multi-modal Multi-view Convolutional-Vision Transformer network (LM-MCVT) to enhance 3D object recognition in robotic applications. Our approach leverages the Globally Entropy-based Embeddings Fusion (GEEF) method to integrate multi-views efficiently. The LM-MCVT architecture incorporates pre- and mid-level convolutional encoders and local and global transformers to enhance feature extraction and recognition accuracy. We evaluate our method on the synthetic ModelNet40 dataset and achieve a recognition accuracy of 95.6% using a four-view setup, surpassing existing state-of-the-art methods. To further validate its effectiveness, we conduct 5-fold cross-validation on the real-world OmniObject3D dataset using the same configuration. Results consistently show superior performance, demonstrating the method's robustness in 3D object recognition across synthetic and real-world 3D data.
Authors: Shuhao Kang, Martin Y. Liao, Yan Xia, Olaf Wysocki, Boris Jutzi, Daniel Cremers
Abstract: LiDAR place recognition is a critical capability for autonomous navigation and cross-modal localization in large-scale outdoor environments. Existing approaches predominantly depend on pre-built 3D dense maps or aerial imagery, which impose significant storage overhead and lack real-time adaptability. In this paper, we propose OPAL, a novel network for LiDAR place recognition that leverages OpenStreetMap as a lightweight and up-to-date prior. Our key innovation lies in bridging the domain disparity between sparse LiDAR scans and structured OSM data through two carefully designed components: a cross-modal visibility mask that identifies maximal observable regions from both modalities to guide feature learning, and an adaptive radial fusion module that dynamically consolidates multiscale radial features into discriminative global descriptors. Extensive experiments on the augmented KITTI and KITTI-360 datasets demonstrate OPAL's superiority, achieving 15.98% higher recall at @1m threshold for top-1 retrieved matches while operating at 12x faster inference speeds compared to state-of-the-art approaches. Code and datasets are publicly available at: https://github.com/WHU-USI3DV/OPAL .
Authors: Xiaofeng Jin, Yan Fang, Matteo Frosi, Jianfei Ge, Jiangjian Xiao, Matteo Matteucci
Abstract: Scene view synthesis, which generates novel views from limited perspectives, is increasingly vital for applications like virtual reality, augmented reality, and robotics. Unlike object-based tasks, such as generating 360{\deg} views of a car, scene view synthesis handles entire environments where non-uniform observations pose unique challenges for stable rendering quality. To address this issue, we propose a novel approach: renderability field-guided gaussian splatting (RF-GS). This method quantifies input inhomogeneity through a renderability field, guiding pseudo-view sampling to enhanced visual consistency. To ensure the quality of wide-baseline pseudo-views, we train an image restoration model to map point projections to visible-light styles. Additionally, our validated hybrid data optimization strategy effectively fuses information of pseudo-view angles and source view textures. Comparative experiments on simulated and real-world data show that our method outperforms existing approaches in rendering stability.
Authors: Xiaofeng Jin, Matteo Frosi, Matteo Matteucci
Abstract: Real-time open-vocabulary scene understanding is essential for efficient 3D perception in applications such as vision-language navigation, embodied intelligence, and augmented reality. However, existing methods suffer from imprecise instance segmentation, static semantic updates, and limited handling of complex queries. To address these issues, we present OpenFusion++, a TSDF-based real-time 3D semantic-geometric reconstruction system. Our approach refines 3D point clouds by fusing confidence maps from foundational models, dynamically updates global semantic labels via an adaptive cache based on instance area, and employs a dual-path encoding framework that integrates object attributes with environmental context for precise query responses. Experiments on the ICL, Replica, ScanNet, and ScanNet++ datasets demonstrate that OpenFusion++ significantly outperforms the baseline in both semantic accuracy and query responsiveness.
Authors: Chamin Hewa Koneputugodage, Yizhak Ben-Shabat, Sameera Ramasinghe, Stephen Gould
Abstract: Implicit Neural Representations (INRs) are a versatile and powerful tool for encoding various forms of data, including images, videos, sound, and 3D shapes. A critical factor in the success of INRs is the initialization of the network, which can significantly impact the convergence and accuracy of the learned model. Unfortunately, commonly used neural network initializations are not widely applicable for many activation functions, especially those used by INRs. In this paper, we improve upon previous initialization methods by deriving an initialization that has stable variance across layers, and applies to any activation function. We show that this generalizes many previous initialization methods, and has even better stability for well studied activations. We also show that our initialization leads to improved results with INR activation functions in multiple signal modalities. Our approach is particularly effective for Gaussian INRs, where we demonstrate that the theory of our initialization matches with task performance in multiple experiments, allowing us to achieve improvements in image, audio, and 3D surface reconstruction.
Authors: Athul M. Mathew, Arshad Ali Khan, Thariq Khalid, Faroq AL-Tam, Riad Souissi
Abstract: Gaze target detection (GTD) is the task of predicting where a person in an image is looking. This is a challenging task, as it requires the ability to understand the relationship between the person's head, body, and eyes, as well as the surrounding environment. In this paper, we propose a novel method for GTD that fuses multiple pieces of information extracted from an image. First, we project the 2D image into a 3D representation using monocular depth estimation. We then extract a depth-infused saliency module map, which highlights the most salient (\textit{attention-grabbing}) regions in image for the subject in consideration. We also extract face and depth modalities from the image, and finally fuse all the extracted modalities to identify the gaze target. We quantitatively evaluated our method, including the ablation analysis on three publicly available datasets, namely VideoAttentionTarget, GazeFollow and GOO-Real, and showed that it outperforms other state-of-the-art methods. This suggests that our method is a promising new approach for GTD.
Authors: Vita V. Vlasova, Vladimir G. Kuzmin, Maria S. Varetsa, Natalia A. Ibragimova, Oleg Y. Rogov, Elena V. Lyapuntsova
Abstract: Hyperspectral image (HSI) classification presents significant challenges due to the high dimensionality, spectral redundancy, and limited labeled data typically available in real-world applications. To address these issues and optimize classification performance, we propose a novel band selection strategy known as Iterative Wavelet-based Gradient Sampling (IWGS). This method incrementally selects the most informative spectral bands by analyzing gradients within the wavelet-transformed domain, enabling efficient and targeted dimensionality reduction. Unlike traditional selection methods, IWGS leverages the multi-resolution properties of wavelets to better capture subtle spectral variations relevant for classification. The iterative nature of the approach ensures that redundant or noisy bands are systematically excluded while maximizing the retention of discriminative features. We conduct comprehensive experiments on two widely-used benchmark HSI datasets: Houston 2013 and Indian Pines. Results demonstrate that IWGS consistently outperforms state-of-the-art band selection and classification techniques in terms of both accuracy and computational efficiency. These improvements make our method especially suitable for deployment in edge devices or other resource-constrained environments, where memory and processing power are limited. In particular, IWGS achieved an overall accuracy up to 97.8% on Indian Pines for selected classes, confirming its effectiveness and generalizability across different HSI scenarios.
Authors: Alexandra Malyugina, Guoxi Huang, Eduardo Ruiz, Benjamin Leslie, Nantheera Anantrasirichai
Abstract: Underwater videos often suffer from degraded quality due to light absorption, scattering, and various noise sources. Among these, marine snow, which is suspended organic particles appearing as bright spots or noise, significantly impacts machine vision tasks, particularly those involving feature matching. Existing methods for removing marine snow are ineffective due to the lack of paired training data. To address this challenge, this paper proposes a novel enhancement framework that introduces a new approach for generating paired datasets from raw underwater videos. The resulting dataset consists of paired images of generated snowy and snow, free underwater videos, enabling supervised training for video enhancement. We describe the dataset creation process, highlight its key characteristics, and demonstrate its effectiveness in enhancing underwater image restoration in the absence of ground truth.
Authors: Kangbiao Shi, Yixu Feng, Tao Hu, Yu Cao, Peng Wu, Yijin Liang, Yanning Zhang, Qingsen Yan
Abstract: The advent of Deep Neural Networks (DNNs) has driven remarkable progress in low-light image enhancement (LLIE), with diverse architectures (e.g., CNNs and Transformers) and color spaces (e.g., sRGB, HSV, HVI) yielding impressive results. Recent efforts have sought to leverage the complementary strengths of these paradigms, offering promising solutions to enhance performance across varying degradation scenarios. However, existing fusion strategies are hindered by challenges such as parameter explosion, optimization instability, and feature misalignment, limiting further improvements. To overcome these issues, we introduce FusionNet, a novel multi-model linear fusion framework that operates in parallel to effectively capture global and local features across diverse color spaces. By incorporating a linear fusion strategy underpinned by Hilbert space theoretical guarantees, FusionNet mitigates network collapse and reduces excessive training costs. Our method achieved 1st place in the CVPR2025 NTIRE Low Light Enhancement Challenge. Extensive experiments conducted on synthetic and real-world benchmark datasets demonstrate that the proposed method significantly outperforms state-of-the-art methods in terms of both quantitative and qualitative results, delivering robust enhancement under diverse low-light conditions.
Authors: Ni Yao, Xiangyu Liu, Danyang Sun, Chuang Han, Yanting Li, Jiaofen Nan, Chengyang Li, Fubao Zhu, Weihua Zhou, Chen Zhao
Abstract: Coronary artery disease (CAD) remains a leading cause of mortality worldwide, requiring accurate segmentation and stenosis detection using Coronary Computed Tomography angiography (CCTA). Existing methods struggle with challenges such as low contrast, morphological variability and small vessel segmentation. To address these limitations, we propose the Myocardial Region-guided Feature Aggregation Net, a novel U-shaped dual-encoder architecture that integrates anatomical prior knowledge to enhance robustness in coronary artery segmentation. Our framework incorporates three key innovations: (1) a Myocardial Region-guided Module that directs attention to coronary regions via myocardial contour expansion and multi-scale feature fusion, (2) a Residual Feature Extraction Encoding Module that combines parallel spatial channel attention with residual blocks to enhance local-global feature discrimination, and (3) a Multi-scale Feature Fusion Module for adaptive aggregation of hierarchical vascular features. Additionally, Monte Carlo dropout f quantifies prediction uncertainty, supporting clinical interpretability. For stenosis detection, a morphology-based centerline extraction algorithm separates the vascular tree into anatomical branches, enabling cross-sectional area quantification and stenosis grading. The superiority of MGFA-Net was demonstrated by achieving an Dice score of 85.04%, an accuracy of 84.24%, an HD95 of 6.1294 mm, and an improvement of 5.46% in true positive rate for stenosis detection compared to3D U-Net. The integrated segmentation-to-stenosis pipeline provides automated, clinically interpretable CAD assessment, bridging deep learning with anatomical prior knowledge for precision medicine. Our code is publicly available at http://github.com/chenzhao2023/MGFA_CCTA
Authors: Moulik Choraria, Xinbo Wu, Akhil Bhimaraju, Nitesh Sekhar, Yue Wu, Xu Zhang, Prateek Singhal, Lav R. Varshney
Abstract: The hyperscaling of data and parameter count in Transformer-based models is yielding diminishing performance improvement, especially when weighed against training costs. Such plateauing indicates the importance of methods for more efficient finetuning and inference, while retaining similar performance. This is especially relevant for multimodal learning paradigms, where inference costs of processing multimodal tokens can determine the model's practical viability. At the same time, research on representations and mechanistic interpretability has improved our understanding of the inner workings of Transformer-based models; one such line of work reveals an implicit alignment in the deeper layers of pretrained models, across modalities. Taking inspiration from this, we motivate and propose a simple modification to existing multimodal frameworks that rely on aligning pretrained models. We demonstrate that our approach maintains and, in some cases, even improves performance of baseline methods while achieving significant gains in both training and inference-time compute. Our work also has implications for combining pretrained models into larger systems efficiently.
Authors: Sidharth Rai, Aryan Dalal, Riley Slichter, Ajay Sharda
Abstract: Effective seed sowing in precision agriculture is hindered by challenges such as residue accumulation, low soil temperatures, and hair pinning (crop residue pushed in the trench by furrow opener), which obstruct optimal trench formation. Row cleaners are employed to mitigate these issues, but there is a lack of quantitative methods to assess trench cleanliness. In this study, a novel computer vision-based method was developed to evaluate row cleaner performance. Multiple air seeders were equipped with a video acquisition system to capture trench conditions after row cleaner operation, enabling an effective comparison of the performance of each row cleaner. The captured data were used to develop a segmentation model that analyzed key elements such as soil, straw, and machinery. Using the results from the segmentation model, an objective method was developed to quantify row cleaner performance. The results demonstrated the potential of this method to improve row cleaner selection and enhance seeding efficiency in precision agriculture.
Authors: Rayson Laroca, Marcelo dos Santos, David Menotti
Abstract: Detecting small drones, often indistinguishable from birds, is crucial for modern surveillance. This work introduces a drone detection methodology built upon the medium-sized YOLOv11 object detection model. To enhance its performance on small targets, we implemented a multi-scale approach in which the input image is processed both as a whole and in segmented parts, with subsequent prediction aggregation. We also utilized a copy-paste data augmentation technique to enrich the training dataset with diverse drone and bird examples. Finally, we implemented a post-processing technique that leverages frame-to-frame consistency to mitigate missed detections. The proposed approach attained a top-3 ranking in the 8th WOSDETC Drone-vsBird Detection Grand Challenge, held at the 2025 International Joint Conference on Neural Networks (IJCNN), showcasing its capability to detect drones in complex environments effectively.
Authors: Jiahao Lu, Chong Yin, Silvia Ingala, Kenny Erleben, Michael Bachmann Nielsen, Sune Darkner
Abstract: Lung cancer, a leading cause of cancer-related deaths globally, emphasises the importance of early detection for better patient outcomes. Pulmonary nodules, often early indicators of lung cancer, necessitate accurate, timely diagnosis. Despite Explainable Artificial Intelligence (XAI) advances, many existing systems struggle providing clear, comprehensive explanations, especially with limited labelled data. This study introduces MERA, a Multimodal and Multiscale self-Explanatory model designed for lung nodule diagnosis with considerably Reduced Annotation requirements. MERA integrates unsupervised and weakly supervised learning strategies (self-supervised learning techniques and Vision Transformer architecture for unsupervised feature extraction) and a hierarchical prediction mechanism leveraging sparse annotations via semi-supervised active learning in the learned latent space. MERA explains its decisions on multiple levels: model-level global explanations via semantic latent space clustering, instance-level case-based explanations showing similar instances, local visual explanations via attention maps, and concept explanations using critical nodule attributes. Evaluations on the public LIDC dataset show MERA's superior diagnostic accuracy and self-explainability. With only 1% annotated samples, MERA achieves diagnostic accuracy comparable to or exceeding state-of-the-art methods requiring full annotation. The model's inherent design delivers comprehensive, robust, multilevel explanations aligned closely with clinical practice, enhancing trustworthiness and transparency. Demonstrated viability of unsupervised and weakly supervised learning lowers the barrier to deploying diagnostic AI in broader medical domains. Our complete code is open-source available: https://github.com/diku-dk/credanno.
Authors: Jean-R\'emy Conti, St\'ephan Cl\'emen\c{c}on
Abstract: The urging societal demand for fair AI systems has put pressure on the research community to develop predictive models that are not only globally accurate but also meet new fairness criteria, reflecting the lack of disparate mistreatment with respect to sensitive attributes ($\textit{e.g.}$ gender, ethnicity, age). In particular, the variability of the errors made by certain Facial Recognition (FR) systems across specific segments of the population compromises the deployment of the latter, and was judged unacceptable by regulatory authorities. Designing fair FR systems is a very challenging problem, mainly due to the complex and functional nature of the performance measure used in this domain ($\textit{i.e.}$ ROC curves) and because of the huge heterogeneity of the face image datasets usually available for training. In this paper, we propose a novel post-processing approach to improve the fairness of pre-trained FR models by optimizing a regression loss which acts on centroid-based scores. Beyond the computational advantages of the method, we present numerical experiments providing strong empirical evidence of the gain in fairness and of the ability to preserve global accuracy.
Authors: Jakub Zadro\.zny, Hakan Bilen
Abstract: We introduce HumMorph, a novel generalized approach to free-viewpoint rendering of dynamic human bodies with explicit pose control. HumMorph renders a human actor in any specified pose given a few observed views (starting from just one) in arbitrary poses. Our method enables fast inference as it relies only on feed-forward passes through the model. We first construct a coarse representation of the actor in the canonical T-pose, which combines visual features from individual partial observations and fills missing information using learned prior knowledge. The coarse representation is complemented by fine-grained pixel-aligned features extracted directly from the observed views, which provide high-resolution appearance information. We show that HumMorph is competitive with the state-of-the-art when only a single input view is available, however, we achieve results with significantly better visual quality given just 2 monocular observations. Moreover, previous generalized methods assume access to accurate body shape and pose parameters obtained using synchronized multi-camera setups. In contrast, we consider a more practical scenario where these body parameters are noisily estimated directly from the observed views. Our experimental results demonstrate that our architecture is more robust to errors in the noisy parameters and clearly outperforms the state of the art in this setting.
Authors: Shuo Wang, Weili Shi, Shuai Yang, Jiahao Cui, Qinwei Guo
Abstract: This paper presents a dynamic arthroscopic navigation system based on multi-level memory architecture for anterior cruciate ligament (ACL) reconstruction surgery. The system extends our previously proposed markerless navigation method from static image matching to dynamic video sequence tracking. By integrating the Atkinson-Shiffrin memory model's three-level architecture (sensory memory, working memory, and long-term memory), our system maintains continuous tracking of the femoral condyle throughout the surgical procedure, providing stable navigation support even in complex situations involving viewpoint changes, instrument occlusion, and tissue deformation. Unlike existing methods, our system operates in real-time on standard arthroscopic equipment without requiring additional tracking hardware, achieving 25.3 FPS with a latency of only 39.5 ms, representing a 3.5-fold improvement over our previous static system. For extended sequences (1000 frames), the dynamic system maintained an error of 5.3 plus-minus 1.5 pixels, compared to the static system's 12.6 plus-minus 3.7 pixels - an improvement of approximately 45 percent. For medium-length sequences (500 frames) and short sequences (100 frames), the system achieved approximately 35 percent and 19 percent accuracy improvements, respectively. Experimental results demonstrate the system overcomes limitations of traditional static matching methods, providing new technical support for improving surgical precision in ACL reconstruction.
Authors: Khoa Tuan Nguyen, Francesca Tozzi, Wouter Willaert, Joris Vankerschaver, Nikdokht Rashidian, Wesley De Neve
Abstract: While the availability of open 3D medical shape datasets is increasing, offering substantial benefits to the research community, we have found that many of these datasets are, unfortunately, disorganized and contain artifacts. These issues limit the development and training of robust models, particularly for accurate 3D reconstruction tasks. In this paper, we examine the current state of available 3D liver shape datasets and propose a solution using diffusion models combined with implicit neural representations (INRs) to augment and expand existing datasets. Our approach utilizes the generative capabilities of diffusion models to create realistic, diverse 3D liver shapes, capturing a wide range of anatomical variations and addressing the problem of data scarcity. Experimental results indicate that our method enhances dataset diversity, providing a scalable solution to improve the accuracy and reliability of 3D liver reconstruction and generation in medical applications. Finally, we suggest that diffusion models can also be applied to other downstream tasks in 3D medical imaging.
Authors: Sehyeong Jo, Gangjae Jang, Haesol Park
Abstract: The Vision Transformer (ViT) has made significant advancements in computer vision, utilizing self-attention mechanisms to achieve state-of-the-art performance across various tasks, including image classification, object detection, and segmentation. Its architectural flexibility and capabilities have made it a preferred choice among researchers and practitioners. However, the intricate multi-head attention mechanism of ViT presents significant challenges to interpretability, as the underlying prediction process remains opaque. A critical limitation arises from an observation commonly noted in transformer architectures: "Not all attention heads are equally meaningful." Overlooking the relative importance of specific heads highlights the limitations of existing interpretability methods. To address these challenges, we introduce Gradient-Driven Multi-Head Attention Rollout (GMAR), a novel method that quantifies the importance of each attention head using gradient-based scores. These scores are normalized to derive a weighted aggregate attention score, effectively capturing the relative contributions of individual heads. GMAR clarifies the role of each head in the prediction process, enabling more precise interpretability at the head level. Experimental results demonstrate that GMAR consistently outperforms traditional attention rollout techniques. This work provides a practical contribution to transformer-based architectures, establishing a robust framework for enhancing the interpretability of Vision Transformer models.
Authors: Dehao Yuan, Cornelia Ferm\"uller
Abstract: This paper presents a real-time, asynchronous, event-based normal flow estimator. It follows the same algorithm as Learning Normal Flow Directly From Event Neighborhoods, but with a more optimized implementation. The original method treats event slices as 3D point clouds, encodes each event's local geometry into a fixed-length vector, and uses a multi-layer perceptron to predict normal flow. It constructs representations by multiplying an adjacency matrix with a feature matrix, resulting in quadratic time complexity with respect to the number of events. In contrast, we leverage the fact that event coordinates are integers and reformulate the representation step as a pooling operation. This achieves the same effect as the adjacency matrix but with much lower computational cost. As a result, our method supports real-time normal flow prediction on event cameras. Our estimator uses 1 GB of CUDA memory and runs at 4 million normal flows per second on an RTX 3070, or 6 million per second on an RTX A5000. We release the CUDA implementation along with a Python interface at https://github.com/dhyuan99/VecKM_flow_cpp.
Authors: Zhe Dong, Yuzhe Sun, Tianzhu Liu, Wangmeng Zuo, Yanfeng Gu
Abstract: Satellite imagery and maps, as two fundamental data modalities in remote sensing, offer direct observations of the Earth's surface and human-interpretable geographic abstractions, respectively. The task of bidirectional translation between satellite images and maps (BSMT) holds significant potential for applications in urban planning and disaster response. However, this task presents two major challenges: first, the absence of precise pixel-wise alignment between the two modalities substantially complicates the translation process; second, it requires achieving both high-level abstraction of geographic features and high-quality visual synthesis, which further elevates the technical complexity. To address these limitations, we introduce EarthMapper, a novel autoregressive framework for controllable bidirectional satellite-map translation. EarthMapper employs geographic coordinate embeddings to anchor generation, ensuring region-specific adaptability, and leverages multi-scale feature alignment within a geo-conditioned joint scale autoregression (GJSA) process to unify bidirectional translation in a single training cycle. A semantic infusion (SI) mechanism is introduced to enhance feature-level consistency, while a key point adaptive guidance (KPAG) mechanism is proposed to dynamically balance diversity and precision during inference. We further contribute CNSatMap, a large-scale dataset comprising 302,132 precisely aligned satellite-map pairs across 38 Chinese cities, enabling robust benchmarking. Extensive experiments on CNSatMap and the New York dataset demonstrate EarthMapper's superior performance, achieving significant improvements in visual realism, semantic consistency, and structural fidelity over state-of-the-art methods. Additionally, EarthMapper excels in zero-shot tasks like in-painting, out-painting and coordinate-conditional generation, underscoring its versatility.
Authors: Yejin Jeong, Donghun Lee
Abstract: Knee osteoarthritis (KOA) is a universal chronic musculoskeletal disorders worldwide, making early diagnosis crucial. Currently, the Kellgren and Lawrence (KL) grading system is widely used to assess KOA severity. However, its high inter-observer variability and subjectivity hinder diagnostic consistency. To address these limitations, automated diagnostic techniques using deep learning have been actively explored in recent years. In this study, we propose a CLIP-based framework (CLIP-KOA) to enhance the consistency and reliability of KOA grade prediction. To achieve this, we introduce a learning approach that integrates image and text information and incorporate Symmetry Loss and Consistency Loss to ensure prediction consistency between the original and flipped images. CLIP-KOA achieves state-of-the-art accuracy of 71.86\% on KOA severity prediction task, and ablation studies show that CLIP-KOA has 2.36\% improvement in accuracy over the standard CLIP model due to our contribution. This study shows a novel direction for data-driven medical prediction not only to improve reliability of fine-grained diagnosis and but also to explore multimodal methods for medical image analysis. Our code is available at https://github.com/anonymized-link.
Authors: Yuki Hirakawa, Ryotaro Shimizu
Abstract: Constructing dataset for fashion style recognition is challenging due to the inherent subjectivity and ambiguity of style concepts. Recent advances in text-to-image models have facilitated generative data augmentation by synthesizing images from labeled data, yet existing methods based solely on class names or reference captions often fail to balance visual diversity and style consistency. In this work, we propose \textbf{Masked Language Prompting (MLP)}, a novel prompting strategy that masks selected words in a reference caption and leverages large language models to generate diverse yet semantically coherent completions. This approach preserves the structural semantics of the original caption while introducing attribute-level variations aligned with the intended style, enabling style-consistent and diverse image generation without fine-tuning. Experimental results on the FashionStyle14 dataset demonstrate that our MLP-based augmentation consistently outperforms class-name and caption-based baselines, validating its effectiveness for fashion style recognition under limited supervision.
Authors: Sonia Joseph, Praneet Suresh, Lorenz Hufe, Edward Stevinson, Robert Graham, Yash Vadi, Danilo Bzdok, Sebastian Lapuschkin, Lee Sharkey, Blake Aaron Richards
Abstract: Robust tooling and publicly available pre-trained models have helped drive recent advances in mechanistic interpretability for language models. However, similar progress in vision mechanistic interpretability has been hindered by the lack of accessible frameworks and pre-trained weights. We present Prisma (Access the codebase here: https://github.com/Prisma-Multimodal/ViT-Prisma), an open-source framework designed to accelerate vision mechanistic interpretability research, providing a unified toolkit for accessing 75+ vision and video transformers; support for sparse autoencoder (SAE), transcoder, and crosscoder training; a suite of 80+ pre-trained SAE weights; activation caching, circuit analysis tools, and visualization tools; and educational resources. Our analysis reveals surprising findings, including that effective vision SAEs can exhibit substantially lower sparsity patterns than language SAEs, and that in some instances, SAE reconstructions can decrease model loss. Prisma enables new research directions for understanding vision model internals while lowering barriers to entry in this emerging field.
Authors: Weitao Feng, Hang Zhou, Jing Liao, Li Cheng, Wenbo Zhou
Abstract: We present a novel approach for indoor scene synthesis, which learns to arrange decomposed cuboid primitives to represent 3D objects within a scene. Unlike conventional methods that use bounding boxes to determine the placement and scale of 3D objects, our approach leverages cuboids as a straightforward yet highly effective alternative for modeling objects. This allows for compact scene generation while minimizing object intersections. Our approach, coined CasaGPT for Cuboid Arrangement and Scene Assembly, employs an autoregressive model to sequentially arrange cuboids, producing physically plausible scenes. By applying rejection sampling during the fine-tuning stage to filter out scenes with object collisions, our model further reduces intersections and enhances scene quality. Additionally, we introduce a refined dataset, 3DFRONT-NC, which eliminates significant noise presented in the original dataset, 3D-FRONT. Extensive experiments on the 3D-FRONT dataset as well as our dataset demonstrate that our approach consistently outperforms the state-of-the-art methods, enhancing the realism of generated scenes, and providing a promising direction for 3D scene synthesis.
Authors: Yan Wang, Baoxiong Jia, Ziyu Zhu, Siyuan Huang
Abstract: Open-vocabulary 3D scene understanding is pivotal for enhancing physical intelligence, as it enables embodied agents to interpret and interact dynamically within real-world environments. This paper introduces MPEC, a novel Masked Point-Entity Contrastive learning method for open-vocabulary 3D semantic segmentation that leverages both 3D entity-language alignment and point-entity consistency across different point cloud views to foster entity-specific feature representations. Our method improves semantic discrimination and enhances the differentiation of unique instances, achieving state-of-the-art results on ScanNet for open-vocabulary 3D semantic segmentation and demonstrating superior zero-shot scene understanding capabilities. Extensive fine-tuning experiments on 8 datasets, spanning from low-level perception to high-level reasoning tasks, showcase the potential of learned 3D features, driving consistent performance gains across varied 3D scene understanding tasks. Project website: https://mpec-3d.github.io/
Authors: Xinyang Li, Chengjie Yi, Jiawei Lai, Mingbao Lin, Yansong Qu, Shengchuan Zhang, Liujuan Cao
Abstract: Image deocclusion (or amodal completion) aims to recover the invisible regions (\ie, shape and appearance) of occluded instances in images. Despite recent advances, the scarcity of high-quality data that balances diversity, plausibility, and fidelity remains a major obstacle. To address this challenge, we identify three critical elements: leveraging in-the-wild image data for diversity, incorporating human expertise for plausibility, and utilizing generative priors for fidelity. We propose SynergyAmodal, a novel framework for co-synthesizing in-the-wild amodal datasets with comprehensive shape and appearance annotations, which integrates these elements through a tripartite data-human-model collaboration. First, we design an occlusion-grounded self-supervised learning algorithm to harness the diversity of in-the-wild image data, fine-tuning an inpainting diffusion model into a partial completion diffusion model. Second, we establish a co-synthesis pipeline to iteratively filter, refine, select, and annotate the initial deocclusion results of the partial completion diffusion model, ensuring plausibility and fidelity through human expert guidance and prior model constraints. This pipeline generates a high-quality paired amodal dataset with extensive category and scale diversity, comprising approximately 16K pairs. Finally, we train a full completion diffusion model on the synthesized dataset, incorporating text prompts as conditioning signals. Extensive experiments demonstrate the effectiveness of our framework in achieving zero-shot generalization and textual controllability. Our code, dataset, and models will be made publicly available at https://github.com/imlixinyang/SynergyAmodal.
Authors: Rong Gao, Xin Liu, Zhuozhao Hu, Bohao Xing, Baiqiang Xia, Zitong Yu, Heikki K\"alvi\"ainen
Abstract: Figure skating, known as the "Art on Ice," is among the most artistic sports, challenging to understand due to its blend of technical elements (like jumps and spins) and overall artistic expression. Existing figure skating datasets mainly focus on single tasks, such as action recognition or scoring, lacking comprehensive annotations for both technical and artistic evaluation. Current sports research is largely centered on ball games, with limited relevance to artistic sports like figure skating. To address this, we introduce FSAnno, a large-scale dataset advancing artistic sports understanding through figure skating. FSAnno includes an open-access training and test dataset, alongside a benchmark dataset, FSBench, for fair model evaluation. FSBench consists of FSBench-Text, with multiple-choice questions and explanations, and FSBench-Motion, containing multimodal data and Question and Answer (QA) pairs, supporting tasks from technical analysis to performance commentary. Initial tests on FSBench reveal significant limitations in existing models' understanding of artistic sports. We hope FSBench will become a key tool for evaluating and enhancing model comprehension of figure skating.
Authors: Peijian Zeng, Feiyan Pang, Zhanbo Wang, Aimin Yang
Abstract: Industrial Anomaly Detection (IAD) is critical for ensuring product quality by identifying defects. Traditional methods such as feature embedding and reconstruction-based approaches require large datasets and struggle with scalability. Existing vision-language models (VLMs) and Multimodal Large Language Models (MLLMs) address some limitations but rely on mask annotations, leading to high implementation costs and false positives. Additionally, industrial datasets like MVTec-AD and VisA suffer from severe class imbalance, with defect samples constituting only 23.8% and 11.1% of total data respectively. To address these challenges, we propose a reward function that dynamically prioritizes rare defect patterns during training to handle class imbalance. We also introduce a mask-free reasoning framework using Chain of Thought (CoT) and Group Relative Policy Optimization (GRPO) mechanisms, enabling anomaly detection directly from raw images without annotated masks. This approach generates interpretable step-by-step explanations for defect localization. Our method achieves state-of-the-art performance, outperforming prior approaches by 36% in accuracy on MVTec-AD and 16% on VisA. By eliminating mask dependency and reducing costs while providing explainable outputs, this work advances industrial anomaly detection and supports scalable quality control in manufacturing. Code to reproduce the experiment is available at https://github.com/LilaKen/LR-IAD.
Authors: Guobiao Li, Lei Tan, Yuliang Xue, Gaozhi Liu, Zhenxing Qian, Sheng Li, Xinpeng Zhang
Abstract: Recent advances in digital watermarking make use of deep neural networks for message embedding and extraction. They typically follow the ``encoder-noise layer-decoder''-based architecture. By deliberately establishing a differentiable noise layer to simulate the distortion of the watermarked signal, they jointly train the deep encoder and decoder to fit the noise layer to guarantee robustness. As a result, they are usually weak against unknown distortions that are not used in their training pipeline. In this paper, we propose a novel watermarking framework to resist unknown distortions, namely Adversarial Shallow Watermarking (ASW). ASW utilizes only a shallow decoder that is randomly parameterized and designed to be insensitive to distortions for watermarking extraction. During the watermark embedding, ASW freezes the shallow decoder and adversarially optimizes a host image until its updated version (i.e., the watermarked image) stably triggers the shallow decoder to output the watermark message. During the watermark extraction, it accurately recovers the message from the watermarked image by leveraging the insensitive nature of the shallow decoder against arbitrary distortions. Our ASW is training-free, encoder-free, and noise layer-free. Experiments indicate that the watermarked images created by ASW have strong robustness against various unknown distortions. Compared to the existing ``encoder-noise layer-decoder'' approaches, ASW achieves comparable results on known distortions and better robustness on unknown distortions.
Authors: Zezeng Li, Zhihui Qi, Weimin Wang, Ziliang Wang, Junyi Duan, Na Lei
Abstract: Quad meshes are essential in geometric modeling and computational mechanics. Although learning-based methods for triangle mesh demonstrate considerable advancements, quad mesh generation remains less explored due to the challenge of ensuring coplanarity, convexity, and quad-only meshes. In this paper, we present Point2Quad, the first learning-based method for quad-only mesh generation from point clouds. The key idea is learning to identify quad mesh with fused pointwise and facewise features. Specifically, Point2Quad begins with a k-NN-based candidate generation considering the coplanarity and squareness. Then, two encoders are followed to extract geometric and topological features that address the challenge of quad-related constraints, especially by combining in-depth quadrilaterals-specific characteristics. Subsequently, the extracted features are fused to train the classifier with a designed compound loss. The final results are derived after the refinement by a quad-specific post-processing. Extensive experiments on both clear and noise data demonstrate the effectiveness and superiority of Point2Quad, compared to baseline methods under comprehensive metrics.
Authors: Tong Xiao, Qunming Wang, Ping Lu, Tenghai Huang, Xiaohua Tong, Peter M. Atkinson
Abstract: Accurate crowd detection (CD) is critical for public safety and historical pattern analysis, yet existing methods relying on ground and aerial imagery suffer from limited spatio-temporal coverage. The development of very-fine-resolution (VFR) satellite sensor imagery (e.g., ~0.3 m spatial resolution) provides unprecedented opportunities for large-scale crowd activity analysis, but it has never been considered for this task. To address this gap, we proposed CrowdSat-Net, a novel point-based convolutional neural network, which features two innovative components: Dual-Context Progressive Attention Network (DCPAN) to improve feature representation of individuals by aggregating scene context and local individual characteristics, and High-Frequency Guided Deformable Upsampler (HFGDU) that recovers high-frequency information during upsampling through frequency-domain guided deformable convolutions. To validate the effectiveness of CrowdSat-Net, we developed CrowdSat, the first VFR satellite imagery dataset designed specifically for CD tasks, comprising over 120k manually labeled individuals from multi-source satellite platforms (Beijing-3N, Jilin-1 Gaofen-04A and Google Earth) across China. In the experiments, CrowdSat-Net was compared with five state-of-the-art point-based CD methods (originally designed for ground or aerial imagery) using CrowdSat and achieved the largest F1-score of 66.12% and Precision of 73.23%, surpassing the second-best method by 1.71% and 2.42%, respectively. Moreover, extensive ablation experiments validated the importance of the DCPAN and HFGDU modules. Furthermore, cross-regional evaluation further demonstrated the spatial generalizability of CrowdSat-Net. This research advances CD capability by providing both a newly developed network architecture for CD and a pioneering benchmark dataset to facilitate future CD development.
Authors: Deng Li, Bohao Xing, Xin Liu, Baiqiang Xia, Bihan Wen, Heikki K\"alvi\"ainen
Abstract: Emotion understanding is a critical yet challenging task. Most existing approaches rely heavily on identity-sensitive information, such as facial expressions and speech, which raises concerns about personal privacy. To address this, we introduce the De-identity Multimodal Emotion Recognition and Reasoning (DEEMO), a novel task designed to enable emotion understanding using de-identified video and audio inputs. The DEEMO dataset consists of two subsets: DEEMO-NFBL, which includes rich annotations of Non-Facial Body Language (NFBL), and DEEMO-MER, an instruction dataset for Multimodal Emotion Recognition and Reasoning using identity-free cues. This design supports emotion understanding without compromising identity privacy. In addition, we propose DEEMO-LLaMA, a Multimodal Large Language Model (MLLM) that integrates de-identified audio, video, and textual information to enhance both emotion recognition and reasoning. Extensive experiments show that DEEMO-LLaMA achieves state-of-the-art performance on both tasks, outperforming existing MLLMs by a significant margin, achieving 74.49% accuracy and 74.45% F1-score in de-identity emotion recognition, and 6.20 clue overlap and 7.66 label overlap in de-identity emotion reasoning. Our work contributes to ethical AI by advancing privacy-preserving emotion understanding and promoting responsible affective computing.
Authors: Mohammad Altillawi, Fengyi Shen, Liudi Yang, Sai Manoj Prakhya, Ziyuan Liu
Abstract: Current point-based approaches encounter limitations in scalability and rendering quality when using large 3D point cloud maps because using them directly for novel view synthesis (NVS) leads to degraded visualizations. We identify the primary issue behind these low-quality renderings as a visibility mismatch between geometry and appearance, stemming from using these two modalities together. To address this problem, we present CE-NPBG, a new approach for novel view synthesis (NVS) in large-scale autonomous driving scenes. Our method is a neural point-based technique that leverages two modalities: posed images (cameras) and synchronized raw 3D point clouds (LiDAR). We first employ a connectivity relationship graph between appearance and geometry, which retrieves points from a large 3D point cloud map observed from the current camera perspective and uses them for rendering. By leveraging this connectivity, our method significantly improves rendering quality and enhances run-time and scalability by using only a small subset of points from the large 3D point cloud map. Our approach associates neural descriptors with the points and uses them to synthesize views. To enhance the encoding of these descriptors and elevate rendering quality, we propose a joint adversarial and point rasterization training. During training, we pair an image-synthesizer network with a multi-resolution discriminator. At inference, we decouple them and use the image-synthesizer to generate novel views. We also integrate our proposal into the recent 3D Gaussian Splatting work to highlight its benefits for improved rendering and scalability.
Authors: Peter H\"onig, Matthias Hirschmanner, Markus Vincze
Abstract: Object pose estimation enables a variety of tasks in computer vision and robotics, including scene understanding and robotic grasping. The complexity of a pose estimation task depends on the unknown variables related to the target object. While instance-level methods already excel for opaque and Lambertian objects, category-level and open-set methods, where texture, shape, and size are partially or entirely unknown, still struggle with these basic material properties. Since texture is unknown in these scenarios, it cannot be used for disambiguating object symmetries, another core challenge of 6D object pose estimation. The complexity of estimating 6D poses with such a manifold of unknowns led to various datasets, accuracy metrics, and algorithmic solutions. This paper compares datasets, accuracy metrics, and algorithms for solving 6D pose estimation on the category-level. Based on this comparison, we analyze how to bridge category-level and open-set object pose estimation to reach generalization and provide actionable recommendations.
Authors: Seongmin Hwang, Daeyoung Han, Moongu Jeon
Abstract: End-to-end Transformer-based detectors (DETRs) have demonstrated strong detection performance. However, domain generalization (DG) research has primarily focused on convolutional neural network (CNN)-based detectors, while paying little attention to enhancing the robustness of DETRs. In this letter, we introduce a Domain Generalized DEtection TRansformer (DG-DETR), a simple, effective, and plug-and-play method that improves out-of-distribution (OOD) robustness for DETRs. Specifically, we propose a novel domain-agnostic query selection strategy that removes domain-induced biases from object queries via orthogonal projection onto the instance-specific style space. Additionally, we leverage a wavelet decomposition to disentangle features into domain-invariant and domain-specific components, enabling synthesis of diverse latent styles while preserving the semantic features of objects. Experimental results validate the effectiveness of DG-DETR. Our code is available at https://github.com/sminhwang/DG-DETR.
Authors: Chengzhi Wu, Yuxin Wan, Hao Fu, Julius Pfrommer, Zeyun Zhong, Junwei Zheng, Jiaming Zhang, J\"urgen Beyerer
Abstract: Driven by the increasing demand for accurate and efficient representation of 3D data in various domains, point cloud sampling has emerged as a pivotal research topic in 3D computer vision. Recently, learning-to-sample methods have garnered growing interest from the community, particularly for their ability to be jointly trained with downstream tasks. However, previous learning-based sampling methods either lead to unrecognizable sampling patterns by generating a new point cloud or biased sampled results by focusing excessively on sharp edge details. Moreover, they all overlook the natural variations in point distribution across different shapes, applying a similar sampling strategy to all point clouds. In this paper, we propose a Sparse Attention Map and Bin-based Learning method (termed SAMBLE) to learn shape-specific sampling strategies for point cloud shapes. SAMBLE effectively achieves an improved balance between sampling edge points for local details and preserving uniformity in the global shape, resulting in superior performance across multiple common point cloud downstream tasks, even in scenarios with few-point sampling.
Authors: Sangmin Kim, Seunguk Do, Jaesik Park
Abstract: Reconstructing dynamic radiance fields from video clips is challenging, especially when entertainment videos like TV shows are given. Many challenges make the reconstruction difficult due to (1) actors occluding with each other and having diverse facial expressions, (2) cluttered stages, and (3) small baseline views or sudden shot changes. To address these issues, we present ShowMak3r, a comprehensive reconstruction pipeline that allows the editing of scenes like how video clips are made in a production control room. In ShowMak3r, a 3DLocator module locates recovered actors on the stage using depth prior and estimates unseen human poses via interpolation. The proposed ShotMatcher module then tracks the actors under shot changes. Furthermore, ShowMak3r introduces a face-fitting network that dynamically recovers the actors' expressions. Experiments on Sitcoms3D dataset show that our pipeline can reassemble TV show scenes with new cameras at different timestamps. We also demonstrate that ShowMak3r enables interesting applications such as synthetic shot-making, actor relocation, insertion, deletion, and pose manipulation. Project page : https://nstar1125.github.io/showmak3r
Authors: Daniele Rege Cambrin, Luca Colomba, Paolo Garza
Abstract: In crisis management and remote sensing, image segmentation plays a crucial role, enabling tasks like disaster response and emergency planning by analyzing visual data. Neural networks are able to analyze satellite acquisitions and determine which areas were affected by a catastrophic event. The problem in their development in this context is the data scarcity and the lack of extensive benchmark datasets, limiting the capabilities of training large neural network models. In this paper, we propose a novel methodology, namely Magnifier, to improve segmentation performance with limited data availability. The Magnifier methodology is applicable to any existing encoder-decoder architecture, as it extends a model by merging information at different contextual levels through a dual-encoder approach: a local and global encoder. Magnifier analyzes the input data twice using the dual-encoder approach. In particular, the local and global encoders extract information from the same input at different granularities. This allows Magnifier to extract more information than the other approaches given the same set of input images. Magnifier improves the quality of the results of +2.65% on average IoU while leading to a restrained increase in terms of the number of trainable parameters compared to the original model. We evaluated our proposed approach with state-of-the-art burned area segmentation models, demonstrating, on average, comparable or better performances in less than half of the GFLOPs.
Authors: Roman Malashin, Daniil Ilyukhin
Abstract: This paper introduces a concept of neural network specialization via task-specific domain constraining, aimed at enhancing network performance on data subspace in which the network operates. The study presents experiments on training specialists for image classification and object detection tasks. The results demonstrate that specialization can enhance a generalist's accuracy even without additional data or changing training regimes: solely by constraining class label space in which the network performs. Theoretical and experimental analyses indicate that effective specialization requires modifying traditional fine-tuning methods and constraining data space to semantically coherent subsets. The specialist extraction phase before tuning the network is proposed for maximal performance gains. We also provide analysis of the evolution of the feature space during specialization. This study paves way to future research for developing more advanced dynamically configurable image analysis systems, where computations depend on the specific input. Additionally, the proposed methods can help improve system performance in scenarios where certain data domains should be excluded from consideration of the generalist network.
Authors: Dou Quan, Rufan Zhou, Shuang Wang, Ning Huyan, Dong Zhao, Yunan Li, Licheng Jiao
Abstract: Deep learning methods have shown promising performances in remote sensing image change detection (CD). However, existing methods usually train a dataset-specific deep network for each dataset. Due to the significant differences in the data distribution and labeling between various datasets, the trained dataset-specific deep network has poor generalization performances on other datasets. To solve this problem, this paper proposes a change adapter network (CANet) for a more universal and generalized CD. CANet contains dataset-shared and dataset-specific learning modules. The former explores the discriminative features of images, and the latter designs a lightweight adapter model, to deal with the characteristics of different datasets in data distribution and labeling. The lightweight adapter can quickly generalize the deep network for new CD tasks with a small computation cost. Specifically, this paper proposes an interesting change region mask (ICM) in the adapter, which can adaptively focus on interested change objects and decrease the influence of labeling differences in various datasets. Moreover, CANet adopts a unique batch normalization layer for each dataset to deal with data distribution differences. Compared with existing deep learning methods, CANet can achieve satisfactory CD performances on various datasets simultaneously. Experimental results on several public datasets have verified the effectiveness and advantages of the proposed CANet on CD. CANet has a stronger generalization ability, smaller training costs (merely updating 4.1%-7.7% parameters), and better performances under limited training datasets than other deep learning methods, which also can be flexibly inserted with existing deep models.
Authors: Pengfei Zhang, Shouqing Jia
Abstract: Denoising Diffusion Probabilistic Models (DDPMs) achieve high-quality image generation without adversarial training, but they process images as a whole. Since adjacent pixels are highly likely to belong to the same object, we propose the Heat Diffusion Model (HDM) to further preserve image details and generate more realistic images. HDM is a model that incorporates pixel-level operations while maintaining the same training process as DDPM. In HDM, the discrete form of the two-dimensional heat equation is integrated into the diffusion and generation formulas of DDPM, enabling the model to compute relationships between neighboring pixels during image processing. Our experiments demonstrate that HDM can generate higher-quality samples compared to models such as DDPM, Consistency Diffusion Models (CDM), Latent Diffusion Models (LDM), and Vector Quantized Generative Adversarial Networks (VQGAN).
Authors: Junpeng Jiang, Gangyi Hong, Miao Zhang, Hengtong Hu, Kun Zhan, Rui Shao, Liqiang Nie
Abstract: Collecting multi-view driving scenario videos to enhance the performance of 3D visual perception tasks presents significant challenges and incurs substantial costs, making generative models for realistic data an appealing alternative. Yet, the videos generated by recent works suffer from poor quality and spatiotemporal consistency, undermining their utility in advancing perception tasks under driving scenarios. To address this gap, we propose DiVE, a diffusion transformer-based generative framework meticulously engineered to produce high-fidelity, temporally coherent, and cross-view consistent multi-view videos, aligning seamlessly with bird's-eye view layouts and textual descriptions. DiVE leverages a unified cross-attention and a SketchFormer to exert precise control over multimodal data, while incorporating a view-inflated attention mechanism that adds no extra parameters, thereby guaranteeing consistency across views. Despite these advancements, synthesizing high-resolution videos under multimodal constraints introduces dual challenges: investigating the optimal classifier-free guidance coniguration under intricate multi-condition inputs and mitigating excessive computational latency in high-resolution rendering--both of which remain underexplored in prior researches. To resolve these limitations, we introduce two innovations: Multi-Control Auxiliary Branch Distillation, which streamlines multi-condition CFG selection while circumventing high computational overhead, and Resolution Progressive Sampling, a training-free acceleration strategy that staggers resolution scaling to reduce high latency due to high resolution. These innovations collectively achieve a 2.62x speedup with minimal quality degradation. Evaluated on the nuScenes dataset, DiVE achieves SOTA performance in multi-view video generation, yielding photorealistic outputs with exceptional temporal and cross-view coherence.
Authors: Yechan Kim, DongHo Yoon, SooYeon Kim, Moongu Jeon
Abstract: Labeling errors in remote sensing (RS) image segmentation datasets often remain implicit and subtle due to ambiguous class boundaries, mixed pixels, shadows, complex terrain features, and subjective annotator bias. Furthermore, the scarcity of annotated RS data due to high image acquisition and labeling costs complicates training noise-robust models. While sophisticated mechanisms such as label selection or noise correction might address this issue, they tend to increase training time and add implementation complexity. In this letter, we propose NSegment-a simple yet effective data augmentation solution to mitigate this issue. Unlike traditional methods, it applies elastic transformations only to segmentation labels, varying deformation intensity per sample in each training epoch to address annotation inconsistencies. Experimental results demonstrate that our approach improves the performance of RS image segmentation on various state-of-the-art models.
Authors: Junlong Ren, Gangjian Zhang, Yu Hu, Jian Shu, Hao Wang
Abstract: Partially Relevant Video Retrieval (PRVR) aims to retrieve the target video that is partially relevant to the text query. The primary challenge in PRVR arises from the semantic asymmetry between textual and visual modalities, as videos often contain substantial content irrelevant to the query. Existing methods coarsely align paired videos and text queries to construct the semantic space, neglecting the critical cross-modal dual nature inherent in this task: inter-sample correlation and intra-sample redundancy. To this end, we propose a novel PRVR framework to systematically exploit these two characteristics. Our framework consists of three core modules. First, the Inter Correlation Enhancement (ICE) module captures inter-sample correlation by identifying semantically similar yet unpaired text queries and video moments, combining them to form pseudo-positive pairs for more robust semantic space construction. Second, the Intra Redundancy Mining (IRM) module mitigates intra-sample redundancy by mining redundant video moment features and treating them as hard negative samples, thereby encouraging the model to learn more discriminative representations. Finally, to reinforce these modules, we introduce the Temporal Coherence Prediction (TCP) module, which enhances feature discrimination by training the model to predict the original temporal order of randomly shuffled video frames and moments. Extensive experiments on three datasets demonstrate the superiority of our approach compared to previous methods, achieving state-of-the-art results.
Authors: Pin-Chi Pan, Soo-Chang Pei
Abstract: Underwater instance segmentation is challenging due to adverse visual conditions such as light attenuation, scattering, and color distortion, which degrade model performance. In this work, we propose BARIS-Decoder (Boundary-Aware Refinement Decoder for Instance Segmentation), a framework that enhances segmentation accuracy through feature refinement. To address underwater degradations, we introduce the Environmental Robust Adapter (ERA), which efficiently models underwater degradation patterns while reducing trainable parameters by over 90\% compared to full fine-tuning. The integration of BARIS-Decoder with ERA-tuning, referred to as BARIS-ERA, achieves state-of-the-art performance, surpassing Mask R-CNN by 3.4 mAP with a Swin-B backbone and 3.8 mAP with ConvNeXt V2. Our findings demonstrate the effectiveness of BARIS-ERA in advancing underwater instance segmentation, providing a robust and efficient solution.
Authors: Anjith George, Sebastien Marcel
Abstract: Heterogeneous Face Recognition (HFR) addresses the challenge of matching face images across different sensing modalities, such as thermal to visible or near-infrared to visible, expanding the applicability of face recognition systems in real-world, unconstrained environments. While recent HFR methods have shown promising results, many rely on computation-intensive architectures, limiting their practicality for deployment on resource-constrained edge devices. In this work, we present a lightweight yet effective HFR framework by adapting a hybrid CNN-Transformer architecture originally designed for face recognition. Our approach enables efficient end-to-end training with minimal paired heterogeneous data while preserving strong performance on standard RGB face recognition tasks. This makes it a compelling solution for both homogeneous and heterogeneous scenarios. Extensive experiments across multiple challenging HFR and face recognition benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches while maintaining a low computational overhead.
Authors: Nikolaos Chaidos, Angeliki Dimitriou, Nikolaos Spanos, Athanasios Voulodimos, Giorgos Stamou
Abstract: Graph Neural Networks (GNNs) have emerged as an efficient alternative to convolutional approaches for vision tasks such as image classification, leveraging patch-based representations instead of raw pixels. These methods construct graphs where image patches serve as nodes, and edges are established based on patch similarity or classification relevance. Despite their efficiency, the explainability of GNN-based vision models remains underexplored, even though graphs are naturally interpretable. In this work, we analyze the semantic consistency of the graphs formed at different layers of GNN-based image classifiers, focusing on how well they preserve object structures and meaningful relationships. A comprehensive analysis is presented by quantifying the extent to which inter-layer graph connections reflect semantic similarity and spatial coherence. Explanations from standard and adversarial settings are also compared to assess whether they reflect the classifiers' robustness. Additionally, we visualize the flow of information across layers through heatmap-based visualization techniques, thereby highlighting the models' explainability. Our findings demonstrate that the decision-making processes of these models can be effectively explained, while also revealing that their reasoning does not necessarily align with human perception, especially in deeper layers.
Authors: Anush Lakshman Sivaraman, Kojo Adu-Gyamfi, Ibne Farabi Shihab, Anuj Sharma
Abstract: Accurate weather classification from low-quality traffic camera imagery remains a challenging task, particularly under adverse nighttime conditions. In this study, we propose a scalable framework that combines generative domain adaptation with efficient contrastive learning to enhance classification performance. Using CycleGAN-based domain translation, we improve the quality of nighttime images, enabling better feature extraction by downstream models. While the baseline EVA-02 model employing CLIP-based contrastive loss achieves an overall accuracy of 96.55\%, it exhibits a significant performance gap between daytime (97.21\%) and nighttime conditions (63.40\%). Replacing CLIP with the lightweight SigLIP-2 (Sigmoid contrastive loss) achieves a competitive overall accuracy of 94.00\%, with substantial improvements in nighttime performance (85.90\% accuracy). The combination of Vision-SigLIP-2, Text-SigLIP-2, CycleGAN, and contrastive training achieves the best nighttime accuracy (85.90\%) among all models tested, while EVA-02 with CycleGAN maintains the highest overall accuracy (97.01\%) and per-class accuracies. These findings demonstrate the potential of combining domain adaptation and efficient contrastive learning to build practical, resource-efficient weather classification systems for intelligent transportation infrastructure.
Authors: Baoshun Shi, Bing Chen, Shaolei Zhang, Huazhu Fu, Zhanli Hu
Abstract: Low-dose CT (LDCT) is capable of reducing X-ray radiation exposure, but it will potentially degrade image quality, even yields metal artifacts at the case of metallic implants. For simultaneous LDCT reconstruction and metal artifact reduction (LDMAR), existing deep learning-based efforts face two main limitations: i) the network design neglects multi-scale and within-scale information; ii) training a distinct model for each dose necessitates significant storage space for multiple doses. To fill these gaps, we propose a prompt guiding multi-scale adaptive sparse representation-driven network, abbreviated as PMSRNet, for LDMAR task. Specifically, we construct PMSRNet inspired from multi-scale sparsifying frames, and it can simultaneously employ within-scale characteristics and cross-scale complementarity owing to an elaborated prompt guiding scale-adaptive threshold generator (PSATG) and a built multi-scale coefficient fusion module (MSFuM). The PSATG can adaptively capture multiple contextual information to generate more faithful thresholds, achieved by fusing features from local, regional, and global levels. Furthermore, we elaborate a model interpretable dual domain LDMAR framework called PDuMSRNet, and train single model with a prompt guiding strategy for multiple dose levels. We build a prompt guiding module, whose input contains dose level, metal mask and input instance, to provide various guiding information, allowing a single model to accommodate various CT dose settings. Extensive experiments at various dose levels demonstrate that the proposed methods outperform the state-of-the-art LDMAR methods.
Authors: Lucas Morin, Gerhard Ingmar Meijer, Val\'ery Weber, Luc Van Gool, Peter W. J. Staar
Abstract: Automatic extraction of chemical structures from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerprinting of chemical structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting molecular fingerprints directly from chemical structure images. Using learning-based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure-based fingerprint that enables chemical structure retrieval. Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecular depictions. The dataset, models, and code will be made publicly available.
Authors: Song Xia, Yi Yu, Henghui Ding, Wenhan Yang, Shifei Liu, Alex C. Kot, Xudong Jiang
Abstract: Precise segmentation of out-of-distribution (OoD) objects, herein referred to as anomalies, is crucial for the reliable deployment of semantic segmentation models in open-set, safety-critical applications, such as autonomous driving. Current anomalous segmentation benchmarks predominantly focus on favorable weather conditions, resulting in untrustworthy evaluations that overlook the risks posed by diverse meteorological conditions in open-set environments, such as low illumination, dense fog, and heavy rain. To bridge this gap, this paper introduces the ComsAmy, a challenging benchmark specifically designed for open-set anomaly segmentation in complex scenarios. ComsAmy encompasses a wide spectrum of adverse weather conditions, dynamic driving environments, and diverse anomaly types to comprehensively evaluate the model performance in realistic open-world scenarios. Our extensive evaluation of several state-of-the-art anomalous segmentation models reveals that existing methods demonstrate significant deficiencies in such challenging scenarios, highlighting their serious safety risks for real-world deployment. To solve that, we propose a novel energy-entropy learning (EEL) strategy that integrates the complementary information from energy and entropy to bolster the robustness of anomaly segmentation under complex open-world environments. Additionally, a diffusion-based anomalous training data synthesizer is proposed to generate diverse and high-quality anomalous images to enhance the existing copy-paste training data synthesizer. Extensive experimental results on both public and ComsAmy benchmarks demonstrate that our proposed diffusion-based synthesizer with energy and entropy learning (DiffEEL) serves as an effective and generalizable plug-and-play method to enhance existing models, yielding an average improvement of around 4.96% in $\rm{AUPRC}$ and 9.87% in $\rm{FPR}_{95}$.
Authors: Lukas Folkman, Quynh LK Vo, Colin Johnston, Bela Stantic, Kylie A Pitt
Abstract: The increasing demand for aquaculture production necessitates the development of innovative, intelligent tools to effectively monitor and manage fish health and welfare. While non-invasive video monitoring has become a common practice in finfish aquaculture, existing intelligent monitoring methods predominantly focus on assessing body condition or fish swimming patterns and are often developed and evaluated in controlled tank environments, without demonstrating their applicability to real-world aquaculture settings in open sea farms. This underscores the necessity for methods that can monitor physiological traits directly within the production environment of sea fish farms. To this end, we have developed a computer vision method for monitoring ventilation rates of Atlantic salmon (Salmo salar), which was specifically designed for videos recorded in the production environment of commercial sea fish farms using the existing infrastructure. Our approach uses a fish head detection model, which classifies the mouth state as either open or closed using a convolutional neural network. This is followed with multiple object tracking to create temporal sequences of fish swimming across the field of view of the underwater video camera to estimate ventilation rates. The method demonstrated high efficiency, achieving a Pearson correlation coefficient of 0.82 between ground truth and predicted ventilation rates in a test set of 100 fish collected independently of the training data. By accurately identifying pens where fish exhibit signs of respiratory distress, our method offers broad applicability and the potential to transform fish health and welfare monitoring in finfish aquaculture.
Authors: Rupert Polley, Nikolai Polley, Dominik Heid, Marc Heinrich, Sven Ochs, J. Marius Z\"ollner
Abstract: Traffic light perception is an essential component of the camera-based perception system for autonomous vehicles, enabling accurate detection and interpretation of traffic lights to ensure safe navigation through complex urban environments. In this work, we propose a modularized perception framework that integrates state-of-the-art detection models with a novel real-time association and decision framework, enabling seamless deployment into an autonomous driving stack. To address the limitations of existing public datasets, we introduce the ATLAS dataset, which provides comprehensive annotations of traffic light states and pictograms across diverse environmental conditions and camera setups. This dataset is publicly available at https://url.fzi.de/ATLAS. We train and evaluate several state-of-the-art traffic light detection architectures on ATLAS, demonstrating significant performance improvements in both accuracy and robustness. Finally, we evaluate the framework in real-world scenarios by deploying it in an autonomous vehicle to make decisions at traffic light-controlled intersections, highlighting its reliability and effectiveness for real-time operation.
Authors: Haofan Wang, Yujia Xu, Yimeng Li, Junchen Li, Chaowei Zhang, Jing Wang, Kejia Yang, Zhibo Chen
Abstract: Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address these limitations, we start from an naive assumption that text understanding is only a sufficient condition for text rendering, but not a necessary condition. Based on this, we present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render, or more precisely, replicate, multilingual visual text in user-specified fonts, without the need to really understand them. Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text, allowing users to customize text content, font and position on their needs. To improve accuracy, a text perceptual loss is employed along with the diffusion loss. Furthermore, to stabilize rendering process, at the inference phase, we directly initialize with noisy glyph latent instead of random initialization, and adopt region masks to restrict the feature injection to only the text region to avoid distortion of the background. We conducted extensive experiments to verify the effectiveness of our RepText relative to existing works, our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models. To be more fair, we also exhaustively discuss its limitations in the end.
Authors: Rustam Tagiew (German Centre for Rail Traffic Research at the Federal Railway Authority), Prasannavenkatesh Balaji (German Centre for Rail Traffic Research at the Federal Railway Authority)
Abstract: Points 2.1.4(b), 2.4.2(b) and 2.4.3(b) in Annex I of Implementing Regulation (EU) No. 402/2013 allow a simplified approach for the safety approval of computer vision systems for driverless trains, if they have 'similar' functions and interfaces as the replaced human driver. The human driver is not replaced one-to-one by a technical system - only a limited set of cognitive functions are replaced. However, performance in the most challenging function, obstacle detection, is difficult to quantify due to the deficiency of published measurement results. This article summarizes the data published so far. This article also goes a long way to remedy this situation by providing a new public and anonymized dataset of 711 train driver performance measurements from controlled experiments. The measurements are made for different speeds, obstacle sizes, train protection systems and obstacle color contrasts respectively. The measured values are reaction time and distance to the obstacle. The goal of this paper is an unbiased and exhaustive description of the presented dataset for research, standardization and regulation. Further project related information including the dataset and source code is available at https://atosense-02371c.usercontent.opencode.de/
Authors: Abhishek Kuriyal, Elliot Vincent, Mathieu Aubry, Loic Landrieu
Abstract: Global variations in terrain appearance raise a major challenge for satellite image analysis, leading to poor model performance when training on locations that differ from those encountered at test time. This remains true even with recent large global datasets. To address this challenge, we propose a novel domain-generalization framework for satellite images. Instead of trying to learn a single generalizable model, we train one expert model per training domain, while learning experts' similarity and encouraging similar experts to be consistent. A model selection module then identifies the most suitable experts for a given test sample and aggregates their predictions. Experiments on four datasets (DynamicEarthNet, MUDS, OSCD, and FMoW) demonstrate consistent gains over existing domain generalization and adaptation methods. Our code is publicly available at https://github.com/Abhishek19009/CoDEx.
Authors: Muzammil Behzad, Guoying Zhao
Abstract: In this paper, we introduce AffectVLM, a vision-language model designed to integrate multiviews for a semantically rich and visually comprehensive understanding of facial emotions from 3D/4D data. To effectively capture visual features, we propose a joint representation learning framework paired with a novel gradient-friendly loss function that accelerates model convergence towards optimal feature representation. Additionally, we introduce augmented textual prompts to enhance the model's linguistic capabilities and employ mixed view augmentation to expand the visual dataset. We also develop a Streamlit app for a real-time interactive inference and enable the model for distributed learning. Extensive experiments validate the superior performance of AffectVLM across multiple benchmarks.
Authors: Valerie Zermatten, Javiera Castillo-Navarro, Pallavi Jain, Devis Tuia, Diego Marcos
Abstract: The presence of species provides key insights into the ecological properties of a location such as land cover, climatic conditions or even soil properties. We propose a method to predict such ecological properties directly from remote sensing (RS) images by aligning them with species habitat descriptions. We introduce the EcoWikiRS dataset, consisting of high-resolution aerial images, the corresponding geolocated species observations, and, for each species, the textual descriptions of their habitat from Wikipedia. EcoWikiRS offers a scalable way of supervision for RS vision language models (RS-VLMs) for ecology. This is a setting with weak and noisy supervision, where, for instance, some text may describe properties that are specific only to part of the species' niche or is irrelevant to a specific image. We tackle this by proposing WINCEL, a weighted version of the InfoNCE loss. We evaluate our model on the task of ecosystem zero-shot classification by following the habitat definitions from the European Nature Information System (EUNIS). Our results show that our approach helps in understanding RS images in a more ecologically meaningful manner. The code and the dataset are available at https://github.com/eceo-epfl/EcoWikiRS.
Authors: Zhimin Liao, Ping Wei, Shuaijia Chen, Haoxuan Wang, Ziyang Ren
Abstract: 3D occupancy and scene flow offer a detailed and dynamic representation of 3D scene. Recognizing the sparsity and complexity of 3D space, previous vision-centric methods have employed implicit learning-based approaches to model spatial and temporal information. However, these approaches struggle to capture local details and diminish the model's spatial discriminative ability. To address these challenges, we propose a novel explicit state-based modeling method designed to leverage the occupied state to renovate the 3D features. Specifically, we propose a sparse occlusion-aware attention mechanism, integrated with a cascade refinement strategy, which accurately renovates 3D features with the guidance of occupied state information. Additionally, we introduce a novel method for modeling long-term dynamic interactions, which reduces computational costs and preserves spatial information. Compared to the previous state-of-the-art methods, our efficient explicit renovation strategy not only delivers superior performance in terms of RayIoU and mAVE for occupancy and scene flow prediction but also markedly reduces GPU memory usage during training, bringing it down to 8.7GB. Our code is available on https://github.com/lzzzzzm/STCOcc
Authors: Kapil Kashyap, Sean Fargose, Chrisil Dabre, Fatema Dolaria, Nilesh Patil, Aniket Kore
Abstract: Liver cirrhosis is an insidious condition involving the substitution of normal liver tissue with fibrous scar tissue and causing major health complications. The conventional method of diagnosis using liver biopsy is invasive and, therefore, inconvenient for use in regular screening. In this paper,we present a hybrid model that combines machine learning techniques with clinical data and ultrasoundscans to improve liver fibrosis and cirrhosis detection accuracy is presented. The model integrates fixed blood test probabilities with deep learning model predictions (DenseNet-201) for ultrasonic images. The combined hybrid model achieved an accuracy of 92.5%. The findings establish the viability of the combined model in enhancing diagnosis accuracy and supporting early intervention in liver disease care.
Authors: Hoang Chuong Nguyen, Wei Mao, Jose M. Alvarez, Miaomiao Liu
Abstract: Neural Radiance Fields (NeRF) has demonstrated its superior capability to represent 3D geometry but require accurately precomputed camera poses during training. To mitigate this requirement, existing methods jointly optimize camera poses and NeRF often relying on good pose initialisation or depth priors. However, these approaches struggle in challenging scenarios, such as large rotations, as they map each camera to a world coordinate system. We propose a novel method that eliminates prior dependencies by modeling continuous camera motions as time-dependent angular velocity and velocity. Relative motions between cameras are learned first via velocity integration, while camera poses can be obtained by aggregating such relative motions up to a world coordinate system defined at a single time step within the video. Specifically, accurate continuous camera movements are learned through a time-dependent NeRF, which captures local scene geometry and motion by training from neighboring frames for each time step. The learned motions enable fine-tuning the NeRF to represent the full scene geometry. Experiments on Co3D and Scannet show our approach achieves superior camera pose and depth estimation and comparable novel-view synthesis performance compared to state-of-the-art methods. Our code is available at https://github.com/HoangChuongNguyen/cope-nerf.
Authors: Mohamed Hassan, Mohammad Wasil, Sebastian Houben
Abstract: Contrastive learning (CL) approaches have gained great recognition as a very successful subset of self-supervised learning (SSL) methods. SSL enables learning from unlabeled data, a crucial step in the advancement of deep learning, particularly in computer vision (CV), given the plethora of unlabeled image data. CL works by comparing different random augmentations (e.g., different crops) of the same image, thus achieving self-labeling. Nevertheless, randomly augmenting images and especially random cropping can result in an image that is semantically very distant from the original and therefore leads to false labeling, hence undermining the efficacy of the methods. In this research, two novel parameterized cropping methods are introduced that increase the robustness of self-labeling and consequently increase the efficacy. The results show that the use of these methods significantly improves the accuracy of the model by between 2.7\% and 12.4\% on the downstream task of classifying CIFAR-10, depending on the crop size compared to that of the non-parameterized random cropping method.
Authors: Zhiming Hu, Daniel Haeufle, Syn Schmitt, Andreas Bulling
Abstract: We present HOIGaze - a novel learning-based approach for gaze estimation during hand-object interactions (HOI) in extended reality (XR). HOIGaze addresses the challenging HOI setting by building on one key insight: The eye, hand, and head movements are closely coordinated during HOIs and this coordination can be exploited to identify samples that are most useful for gaze estimator training - as such, effectively denoising the training data. This denoising approach is in stark contrast to previous gaze estimation methods that treated all training samples as equal. Specifically, we propose: 1) a novel hierarchical framework that first recognises the hand currently visually attended to and then estimates gaze direction based on the attended hand; 2) a new gaze estimator that uses cross-modal Transformers to fuse head and hand-object features extracted using a convolutional neural network and a spatio-temporal graph convolutional network; and 3) a novel eye-head coordination loss that upgrades training samples belonging to the coordinated eye-head movements. We evaluate HOIGaze on the HOT3D and Aria digital twin (ADT) datasets and show that it significantly outperforms state-of-the-art methods, achieving an average improvement of 15.6% on HOT3D and 6.0% on ADT in mean angular error. To demonstrate the potential of our method, we further report significant performance improvements for the sample downstream task of eye-based activity recognition on ADT. Taken together, our results underline the significant information content available in eye-hand-head coordination and, as such, open up an exciting new direction for learning-based gaze estimation.
Authors: Xiaoyu Liu, Mingshuai Yao, Yabo Zhang, Xianhui Lin, Peiran Ren, Xiaoming Li, Ming Liu, Wangmeng Zuo
Abstract: Human image animation aims to generate human videos of given characters and backgrounds that adhere to the desired pose sequence. However, existing methods focus more on human actions while neglecting the generation of background, which typically leads to static results or inharmonious movements. The community has explored camera pose-guided animation tasks, yet preparing the camera trajectory is impractical for most entertainment applications and ordinary users. As a remedy, we present an AnimateAnywhere framework, rousing the background in human image animation without requirements on camera trajectories. In particular, based on our key insight that the movement of the human body often reflects the motion of the background, we introduce a background motion learner (BML) to learn background motions from human pose sequences. To encourage the model to learn more accurate cross-frame correspondences, we further deploy an epipolar constraint on the 3D attention map. Specifically, the mask used to suppress geometrically unreasonable attention is carefully constructed by combining an epipolar mask and the current 3D attention map. Extensive experiments demonstrate that our AnimateAnywhere effectively learns the background motion from human pose sequences, achieving state-of-the-art performance in generating human animation results with vivid and realistic backgrounds. The source code and model will be available at https://github.com/liuxiaoyu1104/AnimateAnywhere.
Authors: Yulong Guo, Zilun Zhang, Yongheng Shang, Tiancheng Zhao, Shuiguang Deng, Yingchun Yang, Jianwei Yin
Abstract: The long-tail problem presents a significant challenge to the advancement of semantic segmentation in ultra-high-resolution (UHR) satellite imagery. While previous efforts in UHR semantic segmentation have largely focused on multi-branch network architectures that emphasize multi-scale feature extraction and fusion, they have often overlooked the importance of addressing the long-tail issue. In contrast to prior UHR methods that focused on independent feature extraction, we emphasize data augmentation and multimodal feature fusion to alleviate the long-tail problem. In this paper, we introduce SRMF, a novel framework for semantic segmentation in UHR satellite imagery. Our approach addresses the long-tail class distribution by incorporating a multi-scale cropping technique alongside a data augmentation strategy based on semantic reordering and resampling. To further enhance model performance, we propose a multimodal fusion-based general representation knowledge injection method, which, for the first time, fuses text and visual features without the need for individual region text descriptions, extracting more robust features. Extensive experiments on the URUR, GID, and FBP datasets demonstrate that our method improves mIoU by 3.33\%, 0.66\%, and 0.98\%, respectively, achieving state-of-the-art performance. Code is available at: https://github.com/BinSpa/SRMF.git.
Authors: Juhan Park, Kyungjae Lee, Hyung Jin Chang, Jungchan Cho
Abstract: In this work, we introduce Segmentation to Human-Object Interaction (\textit{\textbf{Seg2HOI}}) approach, a novel framework that integrates segmentation-based vision foundation models with the human-object interaction task, distinguished from traditional detection-based Human-Object Interaction (HOI) methods. Our approach enhances HOI detection by not only predicting the standard triplets but also introducing quadruplets, which extend HOI triplets by including segmentation masks for human-object pairs. More specifically, Seg2HOI inherits the properties of the vision foundation model (e.g., promptable and interactive mechanisms) and incorporates a decoder that applies these attributes to HOI task. Despite training only for HOI, without additional training mechanisms for these properties, the framework demonstrates that such features still operate efficiently. Extensive experiments on two public benchmark datasets demonstrate that Seg2HOI achieves performance comparable to state-of-the-art methods, even in zero-shot scenarios. Lastly, we propose that Seg2HOI can generate HOI quadruplets and interactive HOI segmentation from novel text and visual prompts that were not used during training, making it versatile for a wide range of applications by leveraging this flexibility.
Authors: Chenhan Jiang, Yihan Zeng, Hang Xu, Dit-Yan Yeung
Abstract: Score Distillation Sampling (SDS) has achieved remarkable success in text-to-3D content generation. However, SDS-based methods struggle to maintain semantic fidelity for user prompts, particularly when involving multiple objects with intricate interactions. While existing approaches often address 3D consistency through multiview diffusion model fine-tuning on 3D datasets, this strategy inadvertently exacerbates text-3D alignment degradation. The limitation stems from SDS's inherent accumulation of view-independent biases during optimization, which progressively diverges from the ideal text alignment direction. To alleviate this limitation, we propose a novel SDS objective, dubbed as Textual Coherent Score Distillation (TCSD), which integrates alignment feedback from multimodal large language models (MLLMs). Our TCSD leverages cross-modal understanding capabilities of MLLMs to assess and guide the text-3D correspondence during the optimization. We further develop 3DLLaVA-CRITIC - a fine-tuned MLLM specialized for evaluating multiview text alignment in 3D generations. Additionally, we introduce an LLM-layout initialization that significantly accelerates optimization convergence through semantic-aware spatial configuration. Comprehensive evaluations demonstrate that our framework, CoherenDream, establishes state-of-the-art performance in text-aligned 3D generation across multiple benchmarks, including T$^3$Bench and TIFA subset. Qualitative results showcase the superior performance of CoherenDream in preserving textual consistency and semantic interactions. As the first study to incorporate MLLMs into SDS optimization, we also conduct extensive ablation studies to explore optimal MLLM adaptations for 3D generation tasks.
Authors: Daniel Kienzle, Robin Sch\"on, Rainer Lienhart, Shin'Ichi Satoh
Abstract: Analyzing a player's technique in table tennis requires knowledge of the ball's 3D trajectory and spin. While, the spin is not directly observable in standard broadcasting videos, we show that it can be inferred from the ball's trajectory in the video. We present a novel method to infer the initial spin and 3D trajectory from the corresponding 2D trajectory in a video. Without ground truth labels for broadcast videos, we train a neural network solely on synthetic data. Due to the choice of our input data representation, physically correct synthetic training data, and using targeted augmentations, the network naturally generalizes to real data. Notably, these simple techniques are sufficient to achieve generalization. No real data at all is required for training. To the best of our knowledge, we are the first to present a method for spin and trajectory prediction in simple monocular broadcast videos, achieving an accuracy of 92.0% in spin classification and a 2D reprojection error of 0.19% of the image diagonal.
Authors: Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Abdenour Hadid
Abstract: This paper introduces DeeCLIP, a novel framework for detecting AI-generated images using CLIP-ViT and fusion learning. Despite significant advancements in generative models capable of creating highly photorealistic images, existing detection methods often struggle to generalize across different models and are highly sensitive to minor perturbations. To address these challenges, DeeCLIP incorporates DeeFuser, a fusion module that combines high-level and low-level features, improving robustness against degradations such as compression and blurring. Additionally, we apply triplet loss to refine the embedding space, enhancing the model's ability to distinguish between real and synthetic content. To further enable lightweight adaptation while preserving pre-trained knowledge, we adopt parameter-efficient fine-tuning using low-rank adaptation (LoRA) within the CLIP-ViT backbone. This approach supports effective zero-shot learning without sacrificing generalization. Trained exclusively on 4-class ProGAN data, DeeCLIP achieves an average accuracy of 89.00% on 19 test subsets composed of generative adversarial network (GAN) and diffusion models. Despite having fewer trainable parameters, DeeCLIP outperforms existing methods, demonstrating superior robustness against various generative models and real-world distortions. The code is publicly available at https://github.com/Mamadou-Keita/DeeCLIP for research purposes.
Authors: Claire Warwick, Andrew Beresford, Soazig Casteau, Hubert P. H. Shum, Dan Smith, Francis Xiatian Zhang
Abstract: The following paper describes a collaborative project involving researchers at Durham University, and professionals at the Bowes Museum, Barnard Castle, County Durham, UK, during which we used fixed and mobile eye tracking to understand how visitors view art. Our study took place during summer 2024 and builds on work presented at DH2017 (Bailey-Ross et al., 2017). Our interdisciplinary team included researchers from digital humanities, psychology, art history and computer science, working in collaboration with professionals from the museum. We used fixed and mobile eye tracking to understand how museum visitors view art in a physical gallery setting. This research will enable us to make recommendations about how the Museum's collections could be more effectively displayed, encouraging visitors to engage with them more fully.
Authors: Runhui Zhang, Sijin Zhou, Zhuang Qi
Abstract: Federated learning aims to collaboratively model by integrating multi-source information to obtain a model that can generalize across all client data. Existing methods often leverage knowledge distillation or data augmentation to mitigate the negative impact of data bias across clients. However, the limited performance of teacher models on out-of-distribution samples and the inherent quality gap between augmented and original data hinder their effectiveness and they typically fail to leverage the advantages of incorporating rich contextual information. To address these limitations, this paper proposes a Federated Causal Augmentation method, termed FedCAug, which employs causality-inspired data augmentation to break the spurious correlation between attributes and categories. Specifically, it designs a causal region localization module to accurately identify and decouple the background and objects in the image, providing rich contextual information for causal data augmentation. Additionally, it designs a causality-inspired data augmentation module that integrates causal features and within-client context to generate counterfactual samples. This significantly enhances data diversity, and the entire process does not require any information sharing between clients, thereby contributing to the protection of data privacy. Extensive experiments conducted on three datasets reveal that FedCAug markedly reduces the model's reliance on background to predict sample labels, achieving superior performance compared to state-of-the-art methods.
Authors: Han Chen, Anne L. Martel
Abstract: Purpose: The scarcity of high-quality curated labeled medical training data remains one of the major limitations in applying artificial intelligence (AI) systems to breast cancer diagnosis. Deep models for mammogram analysis and mass (or micro-calcification) detection require training with a large volume of labeled images, which are often expensive and time-consuming to collect. To reduce this challenge, we proposed a novel method that leverages self-supervised learning (SSL) and a deep hybrid model, named \textbf{HybMNet}, which combines local self-attention and fine-grained feature extraction to enhance breast cancer detection on screening mammograms. Approach: Our method employs a two-stage learning process: (1) SSL Pretraining: We utilize EsViT, a SSL technique, to pretrain a Swin Transformer (Swin-T) using a limited set of mammograms. The pretrained Swin-T then serves as the backbone for the downstream task. (2) Downstream Training: The proposed HybMNet combines the Swin-T backbone with a CNN-based network and a novel fusion strategy. The Swin-T employs local self-attention to identify informative patch regions from the high-resolution mammogram, while the CNN-based network extracts fine-grained local features from the selected patches. A fusion module then integrates global and local information from both networks to generate robust predictions. The HybMNet is trained end-to-end, with the loss function combining the outputs of the Swin-T and CNN modules to optimize feature extraction and classification performance. Results: The proposed method was evaluated for its ability to detect breast cancer by distinguishing between benign (normal) and malignant mammograms. Leveraging SSL pretraining and the HybMNet model, it achieved AUC of 0.864 (95% CI: 0.852, 0.875) on the CMMD dataset and 0.889 (95% CI: 0.875, 0.903) on the INbreast dataset, highlighting its effectiveness.
Authors: Quynh Phung, Long Mai, Fabian David Caba Heilbron, Feng Liu, Jia-Bin Huang, Cusuh Ham
Abstract: We present CineVerse, a novel framework for the task of cinematic scene composition. Similar to traditional multi-shot generation, our task emphasizes the need for consistency and continuity across frames. However, our task also focuses on addressing challenges inherent to filmmaking, such as multiple characters, complex interactions, and visual cinematic effects. In order to learn to generate such content, we first create the CineVerse dataset. We use this dataset to train our proposed two-stage approach. First, we prompt a large language model (LLM) with task-specific instructions to take in a high-level scene description and generate a detailed plan for the overall setting and characters, as well as the individual shots. Then, we fine-tune a text-to-image generation model to synthesize high-quality visual keyframes. Experimental results demonstrate that CineVerse yields promising improvements in generating visually coherent and contextually rich movie scenes, paving the way for further exploration in cinematic video synthesis.
Authors: Han Chen, Anne L. Martel
Abstract: Accurate detection of breast cancer from high-resolution mammograms is crucial for early diagnosis and effective treatment planning. Previous studies have shown the potential of using single-view mammograms for breast cancer detection. However, incorporating multi-view data can provide more comprehensive insights. Multi-view classification, especially in medical imaging, presents unique challenges, particularly when dealing with large-scale, high-resolution data. In this work, we propose a novel Multi-view Visual Prompt Tuning Network (MVPT-NET) for analyzing multiple screening mammograms. We first pretrain a robust single-view classification model on high-resolution mammograms and then innovatively adapt multi-view feature learning into a task-specific prompt tuning process. This technique selectively tunes a minimal set of trainable parameters (7\%) while retaining the robustness of the pre-trained single-view model, enabling efficient integration of multi-view data without the need for aggressive downsampling. Our approach offers an efficient alternative to traditional feature fusion methods, providing a more robust, scalable, and efficient solution for high-resolution mammogram analysis. Experimental results on a large multi-institution dataset demonstrate that our method outperforms conventional approaches while maintaining detection efficiency, achieving an AUROC of 0.852 for distinguishing between Benign, DCIS, and Invasive classes. This work highlights the potential of MVPT-NET for medical imaging tasks and provides a scalable solution for integrating multi-view data in breast cancer detection.
Authors: Hugo Georgenthum, Cristian Cosentino, Fabrizio Marozzo, Pietro Li\`o
Abstract: The automatic summarization of surgical videos is essential for enhancing procedural documentation, supporting surgical training, and facilitating post-operative analysis. This paper presents a novel method at the intersection of artificial intelligence and medicine, aiming to develop machine learning models with direct real-world applications in surgical contexts. We propose a multi-modal framework that leverages recent advancements in computer vision and large language models to generate comprehensive video summaries. % The approach is structured in three key stages. First, surgical videos are divided into clips, and visual features are extracted at the frame level using visual transformers. This step focuses on detecting tools, tissues, organs, and surgical actions. Second, the extracted features are transformed into frame-level captions via large language models. These are then combined with temporal features, captured using a ViViT-based encoder, to produce clip-level summaries that reflect the broader context of each video segment. Finally, the clip-level descriptions are aggregated into a full surgical report using a dedicated LLM tailored for the summarization task. % We evaluate our method on the CholecT50 dataset, using instrument and action annotations from 50 laparoscopic videos. The results show strong performance, achieving 96\% precision in tool detection and a BERT score of 0.74 for temporal context summarization. This work contributes to the advancement of AI-assisted tools for surgical reporting, offering a step toward more intelligent and reliable clinical documentation.
Authors: Xiem HoangVan, Hieu Bui Minh, Sang NguyenQuang, Wen-Hsiao Peng
Abstract: The latest video coding standard H.266/VVC has shown its great improvement in terms of compression performance when compared to its predecessor HEVC standard. Though VVC was implemented with many advanced techniques, it still met the same challenges as its predecessor due to the need for even higher perceptual quality demand at the decoder side as well as the compression performance at the encoder side. The advancement of Artificial Intelligence (AI) technology, notably the deep learning-based video quality enhancement methods, was shown to be a promising approach to improving the perceptual quality experience. In this paper, we propose a novel Omniscient video quality enhancement Network for VVC compressed Videos. The Omniscient Network for compressed video quality enhancement was originally designed for HEVC compressed videos in which not only the spatial-temporal features but also cross-frequencies information were employed to augment the visual quality. Inspired by this work, we propose a modification of the OVQE model and integrate it into the lasted STD-VVC (Standard Versatile Video Coding) decoder architecture. As assessed in a rich set of test conditions, the proposed OVQE-VVC solution is able to achieve significant PSNR improvement, notably around 0.74 dB and up to 1.2 dB with respect to the original STD-VVC codec. This also corresponds to around 19.6% of bitrate saving while keeping a similar quality observation.
Authors: Yunfei Wan, Jianheng Liu, Jiarong Lin, Fu Zhang
Abstract: In this paper, we present a 3D reconstruction and rendering framework termed Mesh-Learner that is natively compatible with traditional rasterization pipelines. It integrates mesh and spherical harmonic (SH) texture (i.e., texture filled with SH coefficients) into the learning process to learn each mesh s view-dependent radiance end-to-end. Images are rendered by interpolating surrounding SH Texels at each pixel s sampling point using a novel interpolation method. Conversely, gradients from each pixel are back-propagated to the related SH Texels in SH textures. Mesh-Learner exploits graphic features of rasterization pipeline (texture sampling, deferred rendering) to render, which makes Mesh-Learner naturally compatible with tools (e.g., Blender) and tasks (e.g., 3D reconstruction, scene rendering, reinforcement learning for robotics) that are based on rasterization pipelines. Our system can train vast, unlimited scenes because we transfer only the SH textures within the frustum to the GPU for training. At other times, the SH textures are stored in CPU RAM, which results in moderate GPU memory usage. The rendering results on interpolation and extrapolation sequences in the Replica and FAST-LIVO2 datasets achieve state-of-the-art performance compared to existing state-of-the-art methods (e.g., 3D Gaussian Splatting and M2-Mapping). To benefit the society, the code will be available at https://github.com/hku-mars/Mesh-Learner.
Authors: Narges Rashvand, Ghazal Alinezhad Noghre, Armin Danesh Pazho, Babak Rahimi Ardabili, Hamed Tabkhi
Abstract: Shoplifting remains a costly issue for the retail sector, but traditional surveillance systems, which are mostly based on human monitoring, are still largely ineffective, with only about 2% of shoplifters being arrested. Existing AI-based approaches rely on pixel-level video analysis which raises privacy concerns, is sensitive to environmental variations, and demands significant computational resources. To address these limitations, we introduce Shopformer, a novel transformer-based model that detects shoplifting by analyzing pose sequences rather than raw video. We propose a custom tokenization strategy that converts pose sequences into compact embeddings for efficient transformer processing. To the best of our knowledge, this is the first pose-sequence-based transformer model for shoplifting detection. Evaluated on real-world pose data, our method outperforms state-of-the-art anomaly detection models, offering a privacy-preserving, and scalable solution for real-time retail surveillance. The code base for this work is available at https://github.com/TeCSAR-UNCC/Shopformer.
Authors: Ioannis Kontogiorgakis, Iason Tsardanidis, Dimitrios Bormpoudakis, Ilias Tsoumas, Dimitra A. Loka, Christos Noulas, Alexandros Tsitouras, Charalampos Kontoes
Abstract: Effective weed management is crucial for improving agricultural productivity, as weeds compete with crops for vital resources like nutrients and water. Accurate maps of weed management methods are essential for policymakers to assess farmer practices, evaluate impacts on vegetation health, biodiversity, and climate, as well as ensure compliance with policies and subsidies. However, monitoring weed management methods is challenging as commonly rely on on-ground field surveys, which are often costly, time-consuming and subject to delays. In order to tackle this problem, we leverage Earth Observation (EO) data and Machine Learning (ML). Specifically, we developed an ML approach for mapping four distinct weed management methods (Mowing, Tillage, Chemical-spraying, and No practice) in orchards using satellite image time series (SITS) data from two different sources: Sentinel-2 (S2) and PlanetScope (PS). The findings demonstrate the potential of ML-driven remote sensing to enhance the efficiency and accuracy of weed management mapping in orchards.
Authors: Andreas Kalogeras, Dimitrios Bormpoudakis, Iason Tsardanidis, Dimitra A. Loka, Charalampos Kontoes
Abstract: The widespread use of Exogenous Organic Matter in agriculture necessitates monitoring to assess its effects on soil and crop health. This study evaluates optical Sentinel-2 satellite imagery for detecting digestate application, a practice that enhances soil fertility but poses environmental risks like microplastic contamination and nitrogen losses. In the first instance, Sentinel-2 satellite image time series (SITS) analysis of specific indices (EOMI, NDVI, EVI) was used to characterize EOM's spectral behavior after application on the soils of four different crop types in Thessaly, Greece. Furthermore, Machine Learning (ML) models (namely Random Forest, k-NN, Gradient Boosting and a Feed-Forward Neural Network), were used to investigate digestate presence detection, achieving F1-scores up to 0.85. The findings highlight the potential of combining remote sensing and ML for scalable and cost-effective monitoring of EOM applications, supporting precision agriculture and sustainability.
Authors: Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jieneng Chen, Jianwen Xie, Alan Yuille
Abstract: Recent studies in 3D spatial reasoning explore data-driven approaches and achieve enhanced spatial reasoning performance with reinforcement learning (RL). However, these methods typically perform spatial reasoning in an implicit manner, and it remains underexplored whether the acquired 3D knowledge generalizes to unseen question types at any stage of the training. In this work we introduce SpatialReasoner, a novel large vision-language model (LVLM) that address 3D spatial reasoning with explicit 3D representations shared between stages -- 3D perception, computation, and reasoning. Explicit 3D representations provide a coherent interface that supports advanced 3D spatial reasoning and enable us to study the factual errors made by LVLMs. Results show that our SpatialReasoner achieve improved performance on a variety of spatial reasoning benchmarks and generalizes better when evaluating on novel 3D spatial reasoning questions. Our study bridges the 3D parsing capabilities of prior visual foundation models with the powerful reasoning abilities of large language models, opening new directions for 3D spatial reasoning.
Authors: Zhengqin Li, Dilin Wang, Ka Chen, Zhaoyang Lv, Thu Nguyen-Phuoc, Milim Lee, Jia-Bin Huang, Lei Xiao, Cheng Zhang, Yufeng Zhu, Carl S. Marshall, Yufeng Ren, Richard Newcombe, Zhao Dong
Abstract: We present Large Inverse Rendering Model (LIRM), a transformer architecture that jointly reconstructs high-quality shape, materials, and radiance fields with view-dependent effects in less than a second. Our model builds upon the recent Large Reconstruction Models (LRMs) that achieve state-of-the-art sparse-view reconstruction quality. However, existing LRMs struggle to reconstruct unseen parts accurately and cannot recover glossy appearance or generate relightable 3D contents that can be consumed by standard Graphics engines. To address these limitations, we make three key technical contributions to build a more practical multi-view 3D reconstruction framework. First, we introduce an update model that allows us to progressively add more input views to improve our reconstruction. Second, we propose a hexa-plane neural SDF representation to better recover detailed textures, geometry and material parameters. Third, we develop a novel neural directional-embedding mechanism to handle view-dependent effects. Trained on a large-scale shape and material dataset with a tailored coarse-to-fine training scheme, our model achieves compelling results. It compares favorably to optimization-based dense-view inverse rendering methods in terms of geometry and relighting accuracy, while requiring only a fraction of the inference time.
Authors: Kai Ye, Haidi Tang, Bowen Liu, Pingyang Dai, Liujuan Cao, Rongrong Ji
Abstract: Applications of unmanned aerial vehicle (UAV) in logistics, agricultural automation, urban management, and emergency response are highly dependent on oriented object detection (OOD) to enhance visual perception. Although existing datasets for OOD in UAV provide valuable resources, they are often designed for specific downstream tasks.Consequently, they exhibit limited generalization performance in real flight scenarios and fail to thoroughly demonstrate algorithm effectiveness in practical environments. To bridge this critical gap, we introduce CODrone, a comprehensive oriented object detection dataset for UAVs that accurately reflects real-world conditions. It also serves as a new benchmark designed to align with downstream task requirements, ensuring greater applicability and robustness in UAV-based OOD.Based on application requirements, we identify four key limitations in current UAV OOD datasets-low image resolution, limited object categories, single-view imaging, and restricted flight altitudes-and propose corresponding improvements to enhance their applicability and robustness.Furthermore, CODrone contains a broad spectrum of annotated images collected from multiple cities under various lighting conditions, enhancing the realism of the benchmark. To rigorously evaluate CODrone as a new benchmark and gain deeper insights into the novel challenges it presents, we conduct a series of experiments based on 22 classical or SOTA methods.Our evaluation not only assesses the effectiveness of CODrone in real-world scenarios but also highlights key bottlenecks and opportunities to advance OOD in UAV applications.Overall, CODrone fills the data gap in OOD from UAV perspective and provides a benchmark with enhanced generalization capability, better aligning with practical applications and future algorithm development.
Authors: Sara Yavari, Jacob Furst
Abstract: This paper proposes an Incremental Learning (IL) approach to enhance the accuracy and efficiency of deep learning models in analyzing T2-weighted (T2w) MRI medical images prostate cancer detection using the PI-CAI dataset. We used multiple health centers' artificial intelligence and radiology data, focused on different tasks that looked at prostate cancer detection using MRI (PI-CAI). We utilized Knowledge Distillation (KD), as it employs generated images from past tasks to guide the training of models for subsequent tasks. The approach yielded improved performance and faster convergence of the models. To demonstrate the versatility and robustness of our approach, we evaluated it on the PI-CAI dataset, a diverse set of medical imaging modalities including OCT and PathMNIST, and the benchmark continual learning dataset CIFAR-10. Our results indicate that KD can be a promising technique for IL in medical image analysis in which data is sourced from individual health centers and the storage of large datasets is not feasible. By using generated images from prior tasks, our method enables the model to retain and apply previously acquired knowledge without direct access to the original data.
Authors: Zador Pataki, Paul-Edouard Sarlin, Johannes L. Sch\"onberger, Marc Pollefeys
Abstract: While Structure-from-Motion (SfM) has seen much progress over the years, state-of-the-art systems are prone to failure when facing extreme viewpoint changes in low-overlap, low-parallax or high-symmetry scenarios. Because capturing images that avoid these pitfalls is challenging, this severely limits the wider use of SfM, especially by non-expert users. We overcome these limitations by augmenting the classical SfM paradigm with monocular depth and normal priors inferred by deep neural networks. Thanks to a tight integration of monocular and multi-view constraints, our approach significantly outperforms existing ones under extreme viewpoint changes, while maintaining strong performance in standard conditions. We also show that monocular priors can help reject faulty associations due to symmetries, which is a long-standing problem for SfM. This makes our approach the first capable of reliably reconstructing challenging indoor environments from few images. Through principled uncertainty propagation, it is robust to errors in the priors, can handle priors inferred by different models with little tuning, and will thus easily benefit from future progress in monocular depth and normal estimation. Our code is publicly available at https://github.com/cvg/mpsfm.
Authors: Yibin Yan, Jilan Xu, Shangzhe Di, Yikun Liu, Yudi Shi, Qirui Chen, Zeqian Li, Yifei Huang, Weidi Xie
Abstract: Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions.To address these challenges, our main contributions are three-fold. (i) We develop a novel streaming video backbone, termed as StreamFormer, by incorporating causal temporal attention into a pre-trained vision transformer. This enables efficient streaming video processing while maintaining image representation capability.(ii) To train StreamFormer, we propose to unify diverse spatial-temporal video understanding tasks within a multitask visual-language alignment framework. Hence, StreamFormer learns global semantics, temporal dynamics, and fine-grained spatial relationships simultaneously. (iii) We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering. StreamFormer achieves competitive results while maintaining efficiency, demonstrating its potential for real-time applications.
Authors: Yu-Ju Tsai, Brian Price, Qing Liu, Luis Figueroa, Daniil Pakhomov, Zhihong Ding, Scott Cohen, Ming-Hsuan Yang
Abstract: Recent methods for human image completion can reconstruct plausible body shapes but often fail to preserve unique details, such as specific clothing patterns or distinctive accessories, without explicit reference images. Even state-of-the-art reference-based inpainting approaches struggle to accurately capture and integrate fine-grained details from reference images. To address this limitation, we propose CompleteMe, a novel reference-based human image completion framework. CompleteMe employs a dual U-Net architecture combined with a Region-focused Attention (RFA) Block, which explicitly guides the model's attention toward relevant regions in reference images. This approach effectively captures fine details and ensures accurate semantic correspondence, significantly improving the fidelity and consistency of completed images. Additionally, we introduce a challenging benchmark specifically designed for evaluating reference-based human image completion tasks. Extensive experiments demonstrate that our proposed method achieves superior visual quality and semantic consistency compared to existing techniques. Project page: https://liagm.github.io/CompleteMe/
Authors: Gon\c{c}alo Hora de Carvalho
Abstract: This study investigated visual impairment complaints in a sample of 948 Acquired Brain Injury (ABI) patients using the DiaNAH dataset, emphasizing advanced machine learning techniques for managing missing data. Patients completed a CVS questionnaire capturing eight types of visual symptoms, including blurred vision and altered contrast perception. Due to incomplete data, 181 patients were excluded, resulting in an analytical subset of 767 individuals. To address the challenge of missing data, an automated machine learning (AutoML) approach was employed for data imputation, preserving the distributional characteristics of the original dataset. Patients were grouped according to singular and combined complaint clusters derived from the 40,320 potential combinations identified through the CVS questionnaire. A linear correlation analysis revealed minimal to no direct relationship between patient-reported visual complaints and standard visual perceptual function tests. This study represents an initial systematic attempt to understand the complex relationship between subjective visual complaints and objective visual perceptual assessments in ABI patients. Given the limitations of sample size and variability, further studies with larger populations are recommended to robustly explore these complaint clusters and their implications for visual perception following brain injury.
Authors: Ching-Yi Lin, Sahil Shah
Abstract: Pre-trained vision transformers have achieved remarkable performance across various visual tasks but suffer from expensive computational and memory costs. While model quantization reduces memory usage by lowering precision, these models still incur significant computational overhead due to the dequantization before matrix operations. In this work, we analyze the computation graph and propose an integerization process based on operation reordering. Specifically, the process delays dequantization until after matrix operations. This enables integerized matrix multiplication and linear module by directly processing the quantized input. To validate our approach, we synthesize the self-attention module of ViT on a systolic array-based hardware. Experimental results show that our low-bit inference reduces per-PE power consumption for linear layer and matrix multiplication, bridging the gap between quantized models and efficient inference.
Authors: Boyuan Peng, Jiaju Chen, Yiwei Zhang, Cuiyi Peng, Junyang Li, Jiaming Deng, Peiwu Qin
Abstract: The growing burden of myopia and retinal diseases necessitates more accessible and efficient eye screening solutions. This study presents a compact, dual-function optical device that integrates fundus photography and refractive error detection into a unified platform. The system features a coaxial optical design using dichroic mirrors to separate wavelength-dependent imaging paths, enabling simultaneous alignment of fundus and refraction modules. A Dense-U-Net-based algorithm with customized loss functions is employed for accurate pupil segmentation, facilitating automated alignment and focusing. Experimental evaluations demonstrate the system's capability to achieve high-precision pupil localization (EDE = 2.8 px, mIoU = 0.931) and reliable refractive estimation with a mean absolute error below 5%. Despite limitations due to commercial lens components, the proposed framework offers a promising solution for rapid, intelligent, and scalable ophthalmic screening, particularly suitable for community health settings.
Authors: Abha Jha, Ashwath Vaithinathan Aravindan, Matthew Salaway, Atharva Sandeep Bhide, Duygu Nur Yaldiz
Abstract: Text-to-image diffusion models are increasingly vulnerable to backdoor attacks, where malicious modifications to the training data cause the model to generate unintended outputs when specific triggers are present. While classification models have seen extensive development of defense mechanisms, generative models remain largely unprotected due to their high-dimensional output space, which complicates the detection and mitigation of subtle perturbations. Defense strategies for diffusion models, in particular, remain under-explored. In this work, we propose Spatial Attention Unlearning (SAU), a novel technique for mitigating backdoor attacks in diffusion models. SAU leverages latent space manipulation and spatial attention mechanisms to isolate and remove the latent representation of backdoor triggers, ensuring precise and efficient removal of malicious effects. We evaluate SAU across various types of backdoor attacks, including pixel-based and style-based triggers, and demonstrate its effectiveness in achieving 100% trigger removal accuracy. Furthermore, SAU achieves a CLIP score of 0.7023, outperforming existing methods while preserving the model's ability to generate high-quality, semantically aligned images. Our results show that SAU is a robust, scalable, and practical solution for securing text-to-image diffusion models against backdoor attacks.
Authors: Giovanni Catalani, Michael Bauerheim, Fr\'ed\'eric Tost, Xavier Bertrand, Joseph Morlier
Abstract: Recent advances in Neural Fields have enabled powerful, discretization-invariant methods for learning neural operators that approximate solutions of Partial Differential Equations (PDEs) on general geometries. Building on these developments, we introduce enf2enf, an encoder--decoder methodology for predicting steady-state Partial Differential Equations with non-parameterized geometric variability, based on recently proposed Equivariant Neural Field architectures. In enf2enf, input geometries are encoded into latent point cloud embeddings that inherently preserve geometric grounding and capture local phenomena. The resulting representations are then combined with global parameters and directly decoded into continuous output fields, thus efficiently modeling the coupling between geometry and physics. By leveraging the inductive biases of locality and translation invariance, our approach is able to capture fine-scale physical features as well as complex shape variations, thereby enhancing generalization and physical compliance. Extensive experiments on a high-fidelity aerodynamic dataset, a hyper-elastic material benchmark, and multi-element airfoil geometries, demonstrate that the proposed model achieves superior or competitive performance compared to state-of-the-art graph based, operator learning, and neural field methods. Notably, our method supports real time inference and zero-shot super-resolution, enabling efficient training on low-resolution meshes while maintaining high accuracy on full-scale discretizations.
Authors: Ali SaraerToosi, Avery Broderick
Abstract: The Event Horizon Telescope (EHT) enables the exploration of black hole accretion flows at event-horizon scales. Fitting ray-traced physical models to EHT observations requires the generation of synthetic images, a task that is computationally demanding. This study leverages \alinet, a generative machine learning model, to efficiently produce radiatively inefficient accretion flow (RIAF) images as a function of the specified physical parameters. \alinet has previously been shown to be able to interpolate black hole images and their associated physical parameters after training on a computationally tractable set of library images. We utilize this model to estimate the uncertainty introduced by a number of anticipated unmodeled physical effects, including interstellar scattering and intrinsic source variability. We then use this to calibrate physical parameter estimates and their associated uncertainties from RIAF model fits to mock EHT data via a library of general relativistic magnetohydrodynamics models.
Authors: Letian Huang, Dongwei Ye, Jialin Dan, Chengzhi Tao, Huiwen Liu, Kun Zhou, Bo Ren, Yuanqi Li, Yanwen Guo, Jie Guo
Abstract: The emergence of neural and Gaussian-based radiance field methods has led to considerable advancements in novel view synthesis and 3D object reconstruction. Nonetheless, specular reflection and refraction continue to pose significant challenges due to the instability and incorrect overfitting of radiance fields to high-frequency light variations. Currently, even 3D Gaussian Splatting (3D-GS), as a powerful and efficient tool, falls short in recovering transparent objects with nearby contents due to the existence of apparent secondary ray effects. To address this issue, we propose TransparentGS, a fast inverse rendering pipeline for transparent objects based on 3D-GS. The main contributions are three-fold. Firstly, an efficient representation of transparent objects, transparent Gaussian primitives, is designed to enable specular refraction through a deferred refraction strategy. Secondly, we leverage Gaussian light field probes (GaussProbe) to encode both ambient light and nearby contents in a unified framework. Thirdly, a depth-based iterative probes query (IterQuery) algorithm is proposed to reduce the parallax errors in our probe-based framework. Experiments demonstrate the speed and accuracy of our approach in recovering transparent objects from complex environments, as well as several applications in computer graphics and vision.
Authors: Xiren Zhou, Shikang Liu, Xinyu Yan, Yizhan Fan, Xiangyu Wang, Yu Kang, Jian Cheng, Huanhuan Chen
Abstract: Urban roads and infrastructure, vital to city operations, face growing threats from subsurface anomalies like cracks and cavities. Ground Penetrating Radar (GPR) effectively visualizes underground conditions employing electromagnetic (EM) waves; however, accurate anomaly detection via GPR remains challenging due to limited labeled data, varying subsurface conditions, and indistinct target boundaries. Although visually image-like, GPR data fundamentally represent EM waves, with variations within and between waves critical for identifying anomalies. Addressing these, we propose the Reservoir-enhanced Segment Anything Model (Res-SAM), an innovative framework exploiting both visual discernibility and wave-changing properties of GPR data. Res-SAM initially identifies apparent candidate anomaly regions given minimal prompts, and further refines them by analyzing anomaly-induced changing information within and between EM waves in local GPR data, enabling precise and complete anomaly region extraction and category determination. Real-world experiments demonstrate that Res-SAM achieves high detection accuracy (>85%) and outperforms state-of-the-art. Notably, Res-SAM requires only minimal accessible non-target data, avoids intensive training, and incorporates simple human interaction to enhance reliability. Our research provides a scalable, resource-efficient solution for rapid subsurface anomaly detection across diverse environments, improving urban safety monitoring while reducing manual effort and computational cost.
Authors: Jiayi Chen, Yubin Ke, Lin Peng, He Wang
Abstract: Generalizable dexterous grasping with suitable grasp types is a fundamental skill for intelligent robots. Developing such skills requires a large-scale and high-quality dataset that covers numerous grasp types (i.e., at least those categorized by the GRASP taxonomy), but collecting such data is extremely challenging. Existing automatic grasp synthesis methods are often limited to specific grasp types or object categories, hindering scalability. This work proposes an efficient pipeline capable of synthesizing contact-rich, penetration-free, and physically plausible grasps for any grasp type, object, and articulated hand. Starting from a single human-annotated template for each hand and grasp type, our pipeline tackles the complicated synthesis problem with two stages: optimize the object to fit the hand template first, and then locally refine the hand to fit the object in simulation. To validate the synthesized grasps, we introduce a contact-aware control strategy that allows the hand to apply the appropriate force at each contact point to the object. Those validated grasps can also be used as new grasp templates to facilitate future synthesis. Experiments show that our method significantly outperforms previous type-unaware grasp synthesis baselines in simulation. Using our algorithm, we construct a dataset containing 10.7k objects and 9.5M grasps, covering 31 grasp types in the GRASP taxonomy. Finally, we train a type-conditional generative model that successfully performs the desired grasp type from single-view object point clouds, achieving an 82.3% success rate in real-world experiments. Project page: https://pku-epic.github.io/Dexonomy.
Authors: Marco Mezzina, Pieter De Backer, Tom Vercauteren, Matthew Blaschko, Alexandre Mottrie, Tinne Tuytelaars
Abstract: Purpose: Automated Surgical Phase Recognition (SPR) uses Artificial Intelligence (AI) to segment the surgical workflow into its key events, functioning as a building block for efficient video review, surgical education as well as skill assessment. Previous research has focused on short and linear surgical procedures and has not explored if temporal context influences experts' ability to better classify surgical phases. This research addresses these gaps, focusing on Robot-Assisted Partial Nephrectomy (RAPN) as a highly non-linear procedure. Methods: Urologists of varying expertise were grouped and tasked to indicate the surgical phase for RAPN on both single frames and video snippets using a custom-made web platform. Participants reported their confidence levels and the visual landmarks used in their decision-making. AI architectures without and with temporal context as trained and benchmarked on the Cholec80 dataset were subsequently trained on this RAPN dataset. Results: Video snippets and presence of specific visual landmarks improved phase classification accuracy across all groups. Surgeons displayed high confidence in their classifications and outperformed novices, who struggled discriminating phases. The performance of the AI models is comparable to the surgeons in the survey, with improvements when temporal context was incorporated in both cases. Conclusion: SPR is an inherently complex task for expert surgeons and computer vision, where both perform equally well when given the same context. Performance increases when temporal information is provided. Surgical tools and organs form the key landmarks for human interpretation and are expected to shape the future of automated SPR.
Authors: Gal Almog, Ariel Shamir, Ohad Fried
Abstract: While latent diffusion models achieve impressive image editing results, their application to iterative editing of the same image is severely restricted. When trying to apply consecutive edit operations using current models, they accumulate artifacts and noise due to repeated transitions between pixel and latent spaces. Some methods have attempted to address this limitation by performing the entire edit chain within the latent space, sacrificing flexibility by supporting only a limited, predetermined set of diffusion editing operations. We present a RE-encode decode (REED) training scheme for variational autoencoders (VAEs), which promotes image quality preservation even after many iterations. Our work enables multi-method iterative image editing: users can perform a variety of iterative edit operations, with each operation building on the output of the previous one using both diffusion-based operations and conventional editing techniques. We demonstrate the advantage of REED-VAE across a range of image editing scenarios, including text-based and mask-based editing frameworks. In addition, we show how REED-VAE enhances the overall editability of images, increasing the likelihood of successful and precise edit operations. We hope that this work will serve as a benchmark for the newly introduced task of multi-method image editing. Our code and models will be available at https://github.com/galmog/REED-VAE
Authors: Delun Lai, Yeyubei Zhang, Yunchong Liu, Chaojie Li, Huadong Mo
Abstract: This paper introduces a novel deep learning-based multimodal fusion architecture aimed at enhancing the perception capabilities of autonomous navigation robots in complex environments. By utilizing innovative feature extraction modules, adaptive fusion strategies, and time-series modeling mechanisms, the system effectively integrates RGB images and LiDAR data. The key contributions of this work are as follows: a. the design of a lightweight feature extraction network to enhance feature representation; b. the development of an adaptive weighted cross-modal fusion strategy to improve system robustness; and c. the incorporation of time-series information modeling to boost dynamic scene perception accuracy. Experimental results on the KITTI dataset demonstrate that the proposed approach increases navigation and positioning accuracy by 3.5% and 2.2%, respectively, while maintaining real-time performance. This work provides a novel solution for autonomous robot navigation in complex environments.
Authors: Xueqi Ma, Yilin Liu, Tianlong Gao, Qirui Huang, Hui Huang
Abstract: We introduce CLR-Wire, a novel framework for 3D curve-based wireframe generation that integrates geometry and topology into a unified Continuous Latent Representation. Unlike conventional methods that decouple vertices, edges, and faces, CLR-Wire encodes curves as Neural Parametric Curves along with their topological connectivity into a continuous and fixed-length latent space using an attention-driven variational autoencoder (VAE). This unified approach facilitates joint learning and generation of both geometry and topology. To generate wireframes, we employ a flow matching model to progressively map Gaussian noise to these latents, which are subsequently decoded into complete 3D wireframes. Our method provides fine-grained modeling of complex shapes and irregular topologies, and supports both unconditional generation and generation conditioned on point cloud or image inputs. Experimental results demonstrate that, compared with state-of-the-art generative approaches, our method achieves substantial improvements in accuracy, novelty, and diversity, offering an efficient and comprehensive solution for CAD design, geometric reconstruction, and 3D content creation.
Authors: Lei Zhong, Chuan Guo, Yiming Xie, Jiawei Wang, Changjian Li
Abstract: Storyboarding is widely used for creating 3D animations. Animators use the 2D sketches in storyboards as references to craft the desired 3D animations through a trial-and-error process. The traditional approach requires exceptional expertise and is both labor-intensive and time-consuming. Consequently, there is a high demand for automated methods that can directly translate 2D storyboard sketches into 3D animations. This task is under-explored to date and inspired by the significant advancements of motion diffusion models, we propose to address it from the perspective of conditional motion synthesis. We thus present Sketch2Anim, composed of two key modules for sketch constraint understanding and motion generation. Specifically, due to the large domain gap between the 2D sketch and 3D motion, instead of directly conditioning on 2D inputs, we design a 3D conditional motion generator that simultaneously leverages 3D keyposes, joint trajectories, and action words, to achieve precise and fine-grained motion control. Then, we invent a neural mapper dedicated to aligning user-provided 2D sketches with their corresponding 3D keyposes and trajectories in a shared embedding space, enabling, for the first time, direct 2D control of motion generation. Our approach successfully transfers storyboards into high-quality 3D motions and inherently supports direct 3D animation editing, thanks to the flexibility of our multi-conditional motion generator. Comprehensive experiments and evaluations, and a user perceptual study demonstrate the effectiveness of our approach.
Authors: Tristan Manchester, Adam Anders, Julio Spadotto, Hannah Eccleston, William Beavan, Hugues Arcis, Brian J. Connolly
Abstract: In situ synchrotron X-ray computed tomography enables dynamic material studies, but automated segmentation remains challenging due to complex imaging artefacts and limited training data. We present a methodology for deep learning-based segmentation by transforming high-quality ex situ laboratory data to train models for binary segmentation of in situ synchrotron data, demonstrated through copper oxide dissolution studies. Using a modified SegFormer architecture, our approach achieves high segmentation performance on unseen data while reducing processing time from hours to seconds per 3D dataset. The method maintains consistent performance over significant morphological changes during experiments, despite training only on static specimens. This methodology can be readily applied to diverse materials systems, accelerating the analysis of time-resolved tomographic data across scientific disciplines.
Authors: Ehsan Karami, Hamid Soltanian-Zadeh
Abstract: Knee osteoarthritis (KOA) is a common joint disease that causes pain and mobility issues. While MRI-based deep learning models have demonstrated superior performance in predicting total knee replacement (TKR) and disease progression, their generalizability remains challenging, particularly when applied to imaging data from different sources. In this study, we have shown that replacing batch normalization with instance normalization, using data augmentation, and applying contrastive loss improves model generalization in a baseline deep learning model for knee osteoarthritis (KOA) prediction. We trained and evaluated our model using MRI data from the Osteoarthritis Initiative (OAI) database, considering sagittal fat-suppressed intermediate-weighted turbo spin-echo (FS-IW-TSE) images as the source domain and sagittal fat-suppressed three-dimensional (3D) dual-echo in steady state (DESS) images as the target domain. The results demonstrate a statistically significant improvement in classification accuracy across both domains, with our approach outperforming the baseline model.
Authors: Taoyi Wang, Lijian Wang, Yihan Lin, Mingtao Ou, Yuguo Chen, Xinglong Ji, Rong Zhao
Abstract: Perception systems in robotics encounter significant challenges in high-speed and dynamic conditions when relying on traditional cameras, where motion blur can compromise spatial feature integrity and task performance. Brain-inspired vision sensors (BVS) have recently gained attention as an alternative, offering high temporal resolution with reduced bandwidth and power requirements. Here, we present the first quantitative evaluation framework for two representative classes of BVSs in variable-speed robotic sensing, including event-based vision sensors (EVS) that detect asynchronous temporal contrasts, and the primitive-based sensor Tianmouc that employs a complementary mechanism to encode both spatiotemporal changes and intensity. A unified testing protocol is established, including crosssensor calibrations, standardized testing platforms, and quality metrics to address differences in data modality. From an imaging standpoint, we evaluate the effects of sensor non-idealities, such as motion-induced distortion, on the capture of structural information. For functional benchmarking, we examine task performance in corner detection and motion estimation under different rotational speeds. Results indicate that EVS performs well in highspeed, sparse scenarios and in modestly fast, complex scenes, but exhibits performance limitations in high-speed, cluttered settings due to pixel-level bandwidth variations and event rate saturation. In comparison, Tianmouc demonstrates consistent performance across sparse and complex scenarios at various speeds, supported by its global, precise, high-speed spatiotemporal gradient samplings. These findings offer valuable insights into the applicationdependent suitability of BVS technologies and support further advancement in this area.
Authors: Mohamed Gado, Towhid Taliee, Muhammad Memon, Dmitry Ignatov, Radu Timofte
Abstract: Visual storytelling is an interdisciplinary field combining computer vision and natural language processing to generate cohesive narratives from sequences of images. This paper presents a novel approach that leverages recent advancements in multimodal models, specifically adapting transformer-based architectures and large multimodal models, for the visual storytelling task. Leveraging the large-scale Visual Storytelling (VIST) dataset, our VIST-GPT model produces visually grounded, contextually appropriate narratives. We address the limitations of traditional evaluation metrics, such as BLEU, METEOR, ROUGE, and CIDEr, which are not suitable for this task. Instead, we utilize RoViST and GROOVIST, novel reference-free metrics designed to assess visual storytelling, focusing on visual grounding, coherence, and non-redundancy. These metrics provide a more nuanced evaluation of narrative quality, aligning closely with human judgment.
Authors: Yunxuan Wang, Ray Yin, Yumei Tan, Hao Chen, Haiying Xia
Abstract: Diabetic retinopathy (DR), a serious ocular complication of diabetes, is one of the primary causes of vision loss among retinal vascular diseases. Deep learning methods have been extensively applied in the grading of diabetic retinopathy (DR). However, their performance declines significantly when applied to data outside the training distribution due to domain shifts. Domain generalization (DG) has emerged as a solution to this challenge. However, most existing DG methods overlook lesion-specific features, resulting in insufficient accuracy. In this paper, we propose a novel approach that enhances existing DG methods by incorporating structural priors, inspired by the observation that DR grading is heavily dependent on vessel and lesion structures. We introduce Low-rank Adaptive Structural Priors (LoASP), a plug-and-play framework designed for seamless integration with existing DG models. LoASP improves generalization by learning adaptive structural representations that are finely tuned to the complexities of DR diagnosis. Extensive experiments on eight diverse datasets validate its effectiveness in both single-source and multi-source domain scenarios. Furthermore, visualizations reveal that the learned structural priors intuitively align with the intricate architecture of the vessels and lesions, providing compelling insights into their interpretability and diagnostic relevance.
Authors: Shuo Wang, Tong Ren, Nan Cheng, Li Zhang, Rong Wang
Abstract: Background: Coronary artery bypass grafting (CABG) planning requires advanced spatial visualization and consideration of coronary artery depth, calcification, and pericardial adhesions. Objective: To develop and evaluate a dynamic cardiovascular holographic visualization tool for preoperative CABG planning. Methods: Using 4D cardiac computed tomography angiography data from 14 CABG candidates, we developed a semi-automated workflow for time-resolved segmentation of cardiac structures, epicardial adipose tissue (EAT), and coronary arteries with calcium scoring. The workflow incorporated methods for cardiac segmentation, coronary calcification quantification, visualization of coronary depth within EAT, and pericardial adhesion assessment through motion analysis. Dynamic cardiovascular holograms were displayed using the Looking Glass platform. Thirteen cardiac surgeons evaluated the tool using a Likert scale. Additionally, pericardial adhesion scores from holograms of 21 patients (including seven undergoing secondary cardiac surgeries) were compared with intraoperative findings. Results: Surgeons rated the visualization tool highly for preoperative planning utility (mean Likert score: 4.57/5.0). Hologram-based pericardial adhesion scoring strongly correlated with intraoperative findings (r=0.786, P<0.001). Conclusion: This study establishes a visualization framework for CABG planning that produces clinically relevant dynamic holograms from patient-specific data, with clinical feedback confirming its effectiveness for preoperative planning.
Authors: Maitreya Sonawane, Sumit Mamtani
Abstract: Making accurate weather predictions can be particularly challenging for localized storms or events that evolve on hourly timescales, such as thunderstorms. Hence, our goal for the project was to model Weather Nowcasting for making highly localized and accurate predictions that apply to the immediate future replacing the current numerical weather models and data assimilation systems with Deep Learning approaches. A significant advantage of machine learning is that inference is computationally cheap given an already-trained model, allowing forecasts that are nearly instantaneous and in the native high resolution of the input data. In this work we developed a novel method that employs Transformer-based machine learning models to forecast precipitation. This approach works by leveraging axial attention mechanisms to learn complex patterns and dynamics from time series frames. Moreover, it is a generic framework and can be applied to univariate and multivariate time series data, as well as time series embeddings data. This paper represents an initial research on the dataset used in the domain of next frame prediciton, and hence, we demonstrate state-of-the-art results in terms of metrices (PSNR = 47.67, SSIM = 0.9943) used for the given dataset using UNet with Axial Transformer.
Authors: Lingrui Zhang, Liang Guo, Xiao An, Feng Lin, Binlong Zheng, Jiankun Wang, Zhirui Li
Abstract: Lumbar disc herniation (LDH) is a common musculoskeletal disease that requires magnetic resonance imaging (MRI) for effective clinical management. However, the interpretation of MRI images heavily relies on the expertise of radiologists, leading to delayed diagnosis and high costs for training physicians. Therefore, this paper proposes an innovative automated LDH classification framework. To address these key issues, the framework utilizes T1-weighted and T2-weighted MRI images from 205 people. The framework extracts clinically actionable LDH features and generates standardized diagnostic outputs by leveraging data augmentation and channel and spatial attention mechanisms. These outputs can help physicians make confident and time-effective care decisions when needed. The proposed framework achieves an area under the receiver operating characteristic curve (AUC-ROC) of 0.969 and an accuracy of 0.9486 for LDH detection. The experimental results demonstrate the performance of the proposed framework. Our framework only requires a small number of datasets for training to demonstrate high diagnostic accuracy. This is expected to be a solution to enhance the LDH detection capabilities of primary hospitals.
Authors: Pietro Bongini, Sara Mandelli, Andrea Montibeller, Mirko Casu, Orazio Pontorno, Claudio Ragaglia, Luca Zanchetta, Mattia Aquilina, Taiba Majid Wani, Luca Guarnera, Benedetta Tondi, Paolo Bestagini, Irene Amerini, Francesco Denatale, Sebastiano Battiato, Mauro Barni
Abstract: Synthetic image source attribution is an open challenge, with an increasing number of image generators being released yearly. The complexity and the sheer number of available generative techniques, as well as the scarcity of high-quality open source datasets of diverse nature for this task, make training and benchmarking synthetic image source attribution models very challenging. WILD is a new in-the-Wild Image Linkage Dataset designed to provide a powerful training and benchmarking tool for synthetic image attribution models. The dataset is built out of a closed set of 10 popular commercial generators, which constitutes the training base of attribution models, and an open set of 10 additional generators, simulating a real-world in-the-wild scenario. Each generator is represented by 1,000 images, for a total of 10,000 images in the closed set and 10,000 images in the open set. Half of the images are post-processed with a wide range of operators. WILD allows benchmarking attribution models in a wide range of tasks, including closed and open set identification and verification, and robust attribution with respect to post-processing and adversarial attacks. Models trained on WILD are expected to benefit from the challenging scenario represented by the dataset itself. Moreover, an assessment of seven baseline methodologies on closed and open set attribution is presented, including robustness tests with respect to post-processing.
Authors: Run Luo, Renke Shan, Longze Chen, Ziqiang Liu, Lu Wang, Min Yang, Xiaobo Xia
Abstract: Large Vision-Language Models (LVLMs) are pivotal for real-world AI tasks like embodied intelligence due to their strong vision-language reasoning abilities. However, current LVLMs process entire images at the token level, which is inefficient compared to humans who analyze information and generate content at the conceptual level, extracting relevant visual concepts with minimal effort. This inefficiency, stemming from the lack of a visual concept model, limits LVLMs' usability in real-world applications. To address this, we propose VCM, an end-to-end self-supervised visual concept modeling framework. VCM leverages implicit contrastive learning across multiple sampled instances and vision-language fine-tuning to construct a visual concept model without requiring costly concept-level annotations. Our results show that VCM significantly reduces computational costs (e.g., 85\% fewer FLOPs for LLaVA-1.5-7B) while maintaining strong performance across diverse image understanding tasks. Moreover, VCM enhances visual encoders' capabilities in classic visual concept perception tasks. Extensive quantitative and qualitative experiments validate the effectiveness and efficiency of VCM.
Authors: Victoria Yue Chen, Daoye Wang, Stephan Garbin, Sebastian Winberg, Timo Bolkart, Thabo Beeler
Abstract: Face registration deforms a template mesh to closely fit a 3D face scan, the quality of which commonly degrades in non-skin regions (e.g., hair, beard, accessories), because the optimized template-to-scan distance pulls the template mesh towards the noisy scan surface. Improving registration quality requires a clean separation of skin and non-skin regions on the scan mesh. Existing image-based (2D) or scan-based (3D) segmentation methods however perform poorly. Image-based segmentation outputs multi-view inconsistent masks, and they cannot account for scan inaccuracies or scan-image misalignment, while scan-based methods suffer from lower spatial resolution compared to images. In this work, we introduce a novel method that accurately separates skin from non-skin geometry on 3D human head scans. For this, our method extracts features from multi-view images using a frozen image foundation model and aggregates these features in 3D. These lifted 2D features are then fused with 3D geometric features extracted from the scan mesh, to then predict a segmentation mask directly on the scan mesh. We show that our segmentations improve the registration accuracy over pure 2D or 3D segmentation methods by 8.89% and 14.3%, respectively. Although trained only on synthetic data, our model generalizes well to real data.
Authors: Claudia Drygala, Hanno Gottschalk, Thomas Kruse, S\'egol\`ene Martin, Annika M\"utze
Abstract: Brenier proved that under certain conditions on a source and a target probability measure there exists a strictly convex function such that its gradient is a transport map from the source to the target distribution. This function is called the Brenier potential. Furthermore, detailed information on the H\"older regularity of the Brenier potential is available. In this work we develop the statistical learning theory of generative adversarial neural networks that learn the Brenier potential. As by the transformation of densities formula, the density of the generated measure depends on the second derivative of the Brenier potential, we develop the universal approximation theory of ReCU networks with cubic activation $\mathtt{ReCU}(x)=\max\{0,x\}^3$ that combines the favorable approximation properties of H\"older functions with a Lipschitz continuous density. In order to assure the convexity of such general networks, we introduce an adversarial training procedure for a potential function represented by the ReCU networks that combines the classical discriminator cross entropy loss with a penalty term that enforces (strict) convexity. We give a detailed decomposition of learning errors and show that for a suitable high penalty parameter all networks chosen in the adversarial min-max optimization problem are strictly convex. This is further exploited to prove the consistency of the learning procedure for (slowly) expanding network capacity. We also implement the described learning algorithm and apply it to a number of standard test cases from Gaussian mixture to image data as target distributions. As predicted in theory, we observe that the convexity loss becomes inactive during the training process and the potentials represented by the neural networks have learned convexity.
Authors: Minjong Cheon
Abstract: Recent advances in AI-based weather forecasting models, such as FourCastNet, Pangu-Weather, and GraphCast, have demonstrated the remarkable ability of deep learning to emulate complex atmospheric dynamics. Building on this momentum, we propose Mj\"olnir, a novel deep learning-based framework for global lightning flash density parameterization. Trained on ERA5 atmospheric predictors and World Wide Lightning Location Network (WWLLN) observations at a daily temporal resolution and 1 degree spatial resolution, Mj\"olnir captures the nonlinear mapping between large-scale environmental conditions and lightning activity. The model architecture is based on the InceptionNeXt backbone with SENet, and a multi-task learning strategy to simultaneously predict lightning occurrence and magnitude. Extensive evaluations yield that Mollnir accurately reproduces the global distribution, seasonal variability, and regional characteristics of lightning activity, achieving a global Pearson correlation coefficient of 0.96 for annual mean fields. These results suggest that Mj\"olnir serves not only as an effective data-driven global lightning parameterization but also as a promising AI-based scheme for next-generation Earth system models (AI-ESMs).
Authors: Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U-Xuan Tan, Navonil Majumder, Soujanya Poria
Abstract: Existing Visual-Language-Action (VLA) models have shown promising performance in zero-shot scenarios, demonstrating impressive task execution and reasoning capabilities. However, a significant challenge arises from the limitations of visual encoding, which can result in failures during tasks such as object grasping. Moreover, these models typically suffer from high computational overhead due to their large sizes, often exceeding 7B parameters. While these models excel in reasoning and task planning, the substantial computational overhead they incur makes them impractical for real-time robotic environments, where speed and efficiency are paramount. To address the limitations of existing VLA models, we propose NORA, a 3B-parameter model designed to reduce computational overhead while maintaining strong task performance. NORA adopts the Qwen-2.5-VL-3B multimodal model as its backbone, leveraging its superior visual-semantic understanding to enhance visual reasoning and action grounding. Additionally, our \model{} is trained on 970k real-world robot demonstrations and equipped with the FAST+ tokenizer for efficient action sequence generation. Experimental results demonstrate that NORA outperforms existing large-scale VLA models, achieving better task performance with significantly reduced computational overhead, making it a more practical solution for real-time robotic autonomy.
Authors: Thanuja Uruththirakodeeswaran, Harald Becher, Michelle Noga, Lawrence H. Le, Pierre Boulanger, Jonathan Windram, Kumaradevan Punithakumar
Abstract: The perfect alignment of 3D echocardiographic images captured from various angles has improved image quality and broadened the field of view. This study proposes an accelerated sequential Monte Carlo (SMC) algorithm for 3D-3D rigid registration of transthoracic echocardiographic images with significant and limited overlap taken from apical window that is robust to the noise and intensity variation in ultrasound images. The algorithm estimates the translational and rotational components of the rigid transform through an iterative process and requires an initial approximation of the rotation and translation limits. We perform registration in two ways: the image-based registration computes the transform to align the end-diastolic frame of the apical nonstandard image to the apical standard image and applies the same transform to all frames of the cardiac cycle, whereas the mask-based registration approach uses the binary masks of the left ventricle in the same way. The SMC and exhaustive search (EX) algorithms were evaluated for 4D temporal sequences recorded from 7 volunteers who participated in a study conducted at the Mazankowski Alberta Heart Institute. The evaluations demonstrate that the mask-based approach of the accelerated SMC yielded a Dice score value of 0.819 +/- 0.045 for the left ventricle and gained 16.7x speedup compared to the CPU version of the SMC algorithm.
Authors: Sima Soltanpour, Rachel Utama, Arnold Chang, Md Taufiq Nasseef, Dan Madularu, Praveen Kulkarni, Craig Ferris, Chris Joslin
Abstract: Skull stripping is a common preprocessing step that is often performed manually in Magnetic Resonance Imaging (MRI) pipelines, including functional MRI (fMRI). This manual process is time-consuming and operator dependent. Automating this process is challenging for preclinical data due to variations in brain geometry, resolution, and tissue contrast. While existing methods for MRI skull stripping exist, they often struggle with the low resolution and varying slice sizes in preclinical fMRI data. This study proposes a novel method called SST-DUNet, that integrates a dense UNet-based architecture with a feature extractor based on Smart Swin Transformer (SST) for fMRI skull stripping. The Smart Shifted Window Multi-Head Self-Attention (SSW-MSA) module in SST is adapted to replace the mask-based module in the Swin Transformer (ST), enabling the learning of distinct channel-wise features while focusing on relevant dependencies within brain structures. This modification allows the model to better handle the complexities of fMRI skull stripping, such as low resolution and variable slice sizes. To address the issue of class imbalance in preclinical data, a combined loss function using Focal and Dice loss is utilized. The model was trained on rat fMRI images and evaluated across three in-house datasets with a Dice similarity score of 98.65%, 97.86%, and 98.04%. The fMRI results obtained through automatic skull stripping using the SST-DUNet model closely align with those from manual skull stripping for both seed-based and independent component analyses. These results indicate that the SST-DUNet can effectively substitute manual brain extraction in rat fMRI analysis.
Authors: Anita Srbinovska, Angela Srbinovska, Vivek Senthil, Adrian Martin, John McCluskey, Ernest Fokou\'e
Abstract: This paper proposes a novel interdisciplinary framework for analyzing police body-worn camera (BWC) footage from the Rochester Police Department (RPD) using advanced artificial intelligence (AI) and statistical machine learning (ML) techniques. Our goal is to detect, classify, and analyze patterns of interaction between police officers and civilians to identify key behavioral dynamics, such as respect, disrespect, escalation, and de-escalation. We apply multimodal data analysis by integrating video, audio, and natural language processing (NLP) techniques to extract meaningful insights from BWC footage. We present our methodology, computational techniques, and findings, outlining a practical approach for law enforcement while advancing the frontiers of knowledge discovery from police BWC data.
Authors: Feng Chen, Fei Wu, Qi Wu, Zhiguo Wan
Abstract: The domain shift, coming from unneglectable modality gap and non-overlapped identity classes between training and test sets, is a major issue of RGB-Infrared person re-identification. A key to tackle the inherent issue -- domain shift -- is to enforce the data distributions of the two domains to be similar. However, RGB-IR ReID always demands discriminative features, leading to over-rely feature sensitivity of seen classes, \textit{e.g.}, via attention-based feature alignment or metric learning. Therefore, predicting the unseen query category from predefined training classes may not be accurate and leads to a sub-optimal adversarial gradient. In this paper, we uncover it in a more explainable way and propose a novel multi-granularity memory regulation and alignment module (MG-MRA) to solve this issue. By explicitly incorporating a latent variable attribute, from fine-grained to coarse semantic granularity, into intermediate features, our method could alleviate the over-confidence of the model about discriminative features of seen classes. Moreover, instead of matching discriminative features by traversing nearest neighbor, sparse attributes, \textit{i.e.}, global structural pattern, are recollected with respect to features and assigned to measure pair-wise image similarity in hashing. Extensive experiments on RegDB \cite{RegDB} and SYSU-MM01 \cite{SYSU} show the superiority of the proposed method that outperforms existing state-of-the-art methods. Our code is available in https://github.com/Chenfeng1271/MGMRA.
Authors: Zilun Zhang, Cuifeng Shen, Yuan Shen, Huixin Xiong, Xinyu Zhou
Abstract: Although CLIP-like Visual Language Models provide a functional joint feature space for image and text, due to the limitation of the CILP-like model's image input size (e.g., 224), subtle details are lost in the feature representation if we input high-resolution images (e.g., 2240). In this work, we introduce an efficient framework that can produce a single feature representation for a high-resolution image that injects image details and shares the same semantic space as the original CLIP. In the framework, we train a feature fusing model based on CLIP features extracted from a carefully designed image patch method that can cover objects of any scale, weakly supervised by image-agnostic class prompted queries. We validate our framework by retrieving images from class prompted queries on the real world and synthetic datasets, showing significant performance improvement on these tasks. Furthermore, to fully demonstrate our framework's detail retrieval ability, we construct a CLEVR-like synthetic dataset called CLVER-DS, which is fully annotated and has a controllable object scale.
Authors: Dongsheng Xu, Qingbao Huang, Xingmao Zhang, Haonan Cheng, Feng Shuang, Yi Cai
Abstract: OCR-based image captioning is an important but under-explored task, aiming to generate descriptions containing visual objects and scene text. Recent studies have made encouraging progress, but they are still suffering from a lack of overall understanding of scenes and generating inaccurate captions. One possible reason is that current studies mainly focus on constructing the plane-level geometric relationship of scene text without depth information. This leads to insufficient scene text relational reasoning so that models may describe scene text inaccurately. The other possible reason is that existing methods fail to generate fine-grained descriptions of some visual objects. In addition, they may ignore essential visual objects, leading to the scene text belonging to these ignored objects not being utilized. To address the above issues, we propose a Depth and Visual Concepts Aware Transformer (DEVICE) for OCR-based image captinong. Concretely, to construct three-dimensional geometric relations, we introduce depth information and propose a depth-enhanced feature updating module to ameliorate OCR token features. To generate more precise and comprehensive captions, we introduce semantic features of detected visual concepts as auxiliary information, and propose a semantic-guided alignment module to improve the model's ability to utilize visual concepts. Our DEVICE is capable of comprehending scenes more comprehensively and boosting the accuracy of described visual entities. Sufficient experiments demonstrate the effectiveness of our proposed DEVICE, which outperforms state-of-the-art models on the TextCaps test set.
Authors: Yi Xiao, Qiangqiang Yuan, Kui Jiang, Xianyu Jin, Jiang He, Liangpei Zhang, Chia-Wen Lin
Abstract: Optical-flow-based and kernel-based approaches have been extensively explored for temporal compensation in satellite Video Super-Resolution (VSR). However, these techniques are less generalized in large-scale or complex scenarios, especially in satellite videos. In this paper, we propose to exploit the well-defined temporal difference for efficient and effective temporal compensation. To fully utilize the local and global temporal information within frames, we systematically modeled the short-term and long-term temporal discrepancies since we observed that these discrepancies offer distinct and mutually complementary properties. Specifically, we devise a Short-term Temporal Difference Module (S-TDM) to extract local motion representations from RGB difference maps between adjacent frames, which yields more clues for accurate texture representation. To explore the global dependency in the entire frame sequence, a Long-term Temporal Difference Module (L-TDM) is proposed, where the differences between forward and backward segments are incorporated and activated to guide the modulation of the temporal feature, leading to a holistic global compensation. Moreover, we further propose a Difference Compensation Unit (DCU) to enrich the interaction between the spatial distribution of the target frame and temporal compensated results, which helps maintain spatial consistency while refining the features to avoid misalignment. Rigorous objective and subjective evaluations conducted across five mainstream video satellites demonstrate that our method performs favorably against state-of-the-art approaches. Code will be available at https://github.com/XY-boy/LGTD
Authors: Amogh Joshi, Adarsh Kosta, Wachirawit Ponghiran, Manish Nagaraj, Kaushik Roy
Abstract: The ability of resource-constrained biological systems such as fruitflies to perform complex and high-speed maneuvers in cluttered environments has been one of the prime sources of inspiration for developing vision-based autonomous systems. To emulate this capability, the perception pipeline of such systems must integrate information cues from tasks including optical flow and depth estimation, object detection and tracking, and segmentation, among others. However, the conventional approach of employing slow, synchronous inputs from standard frame-based cameras constrains these perception capabilities, particularly during high-speed maneuvers. Recently, event-based sensors have emerged as low latency and low energy alternatives to standard frame-based cameras for capturing high-speed motion, effectively speeding up perception and hence navigation. For coherence, all the perception tasks must be trained on the same input data. However, present-day datasets are curated mainly for a single or a handful of tasks and are limited in the rate of the provided ground truths. To address these limitations, we present Flying Event Dataset fOr Reactive behAviour (FEDORA) - a fully synthetic dataset for perception tasks, with raw data from frame-based cameras, event-based cameras, and Inertial Measurement Units (IMU), along with ground truths for depth, pose, and optical flow at a rate much higher than existing datasets.
Authors: Jiangpeng He, Xiaoyan Zhang, Luotao Lin, Jack Ma, Heather A. Eicher-Miller, Fengqing Zhu
Abstract: Deep learning-based food recognition has made significant progress in predicting food types from eating occasion images. However, two key challenges hinder real-world deployment: (1) continuously learning new food classes without forgetting previously learned ones, and (2) handling the long-tailed distribution of food images, where a few common classes and many more rare classes. To address these, food recognition methods should focus on long-tailed continual learning. In this work, We introduce a dataset that encompasses 186 American foods along with comprehensive annotations. We also introduce three new benchmark datasets, VFN186-LT, VFN186-INSULIN and VFN186-T2D, which reflect real-world food consumption for healthy populations, insulin takers and individuals with type 2 diabetes without taking insulin. We propose a novel end-to-end framework that improves the generalization ability for instance-rare food classes using a knowledge distillation-based predictor to avoid misalignment of representation during continual learning. Additionally, we introduce an augmentation technique by integrating class-activation-map (CAM) and CutMix to improve generalization on instance-rare food classes. Our method, evaluated on Food101-LT, VFN-LT, VFN186-LT, VFN186-INSULIN, and VFN186-T2DM, shows significant improvements over existing methods. An ablation study highlights further performance enhancements, demonstrating its potential for real-world food recognition applications.
Authors: Jianjun Gao, Yi Wang, Kim-Hui Yap, Kratika Garg, Boon Siew Han
Abstract: Multiple pedestrian tracking is crucial for enhancing safety and efficiency in intelligent transport and autonomous driving systems by predicting movements and enabling adaptive decision-making in dynamic environments. It optimizes traffic flow, facilitates human interaction, and ensures compliance with regulations. However, it faces the challenge of tracking pedestrians in the presence of occlusion. Existing methods overlook effects caused by abnormal detections during partial occlusion. Subsequently, these abnormal detections can lead to inaccurate motion estimation, unreliable appearance features, and unfair association. To address these issues, we propose an adaptive occlusion-aware multiple pedestrian tracker, OccluTrack, to mitigate the effects caused by partial occlusion. Specifically, we first introduce a plug-and-play abnormal motion suppression mechanism into the Kalman Filter to adaptively detect and suppress outlier motions caused by partial occlusion. Second, we develop a pose-guided re-identification (Re-ID) module to extract discriminative part features for partially occluded pedestrians. Last, we develop a new occlusion-aware association method towards fair Intersection over Union (IoU) and appearance embedding distance measurement for occluded pedestrians. Extensive evaluation results demonstrate that our method outperforms state-of-the-art methods on MOTChallenge and DanceTrack datasets. Particularly, the performance improvements on IDF1 and ID Switches, as well as visualized results, demonstrate the effectiveness of our method in multiple pedestrian tracking.
Authors: Shuang Li, Jiaxu Leng, Ji Gan, Mengjingcheng Mo, Xinbo Gao
Abstract: Visible-Infrared Person Re-Identification (VI-ReID) plays a critical role in all-day surveillance systems. However, existing methods primarily focus on learning appearance features while overlooking body shape features, which not only complement appearance features but also exhibit inherent robustness to modality variations. Despite their potential, effectively integrating shape and appearance features remains challenging. Appearance features are highly susceptible to modality variations and background noise, while shape features often suffer from inaccurate infrared shape estimation due to the limitations of auxiliary models. To address these challenges, we propose the Shape-centered Representation Learning (ScRL) framework, which enhances VI-ReID performance by innovatively integrating shape and appearance features. Specifically, we introduce Infrared Shape Restoration (ISR) to restore inaccuracies in infrared body shape representations at the feature level by leveraging infrared appearance features. In addition, we propose Shape Feature Propagation (SFP), which enables the direct extraction of shape features from original images during inference with minimal computational complexity. Furthermore, we design Appearance Feature Enhancement (AFE), which utilizes shape features to emphasize shape-related appearance features while effectively suppressing identity-unrelated noise. Benefiting from the effective integration of shape and appearance features, ScRL demonstrates superior performance through extensive experiments. On the SYSU-MM01, HITSZ-VCM, and RegDB datasets, it achieves Rank-1 (mAP) accuracies of 76.1% (72.6%), 71.2% (52.9%), and 92.4% (86.7%), respectively, surpassing existing state-of-the-art methods.
Authors: Colton R. Crum, Adam Czajka
Abstract: Leveraging human perception into training of convolutional neural networks (CNN) has boosted generalization capabilities of such models in open-set recognition tasks. One of the active research questions is where (in the model architecture or training pipeline) and how to efficiently incorporate always limited human perceptual data into training strategies of models. In this paper, we introduce MENTOR (huMan pErceptioN-guided preTraining fOr increased geneRalization), which addresses this question through two unique rounds of training CNNs tasked with open-set anomaly detection. First, we train an autoencoder to learn human saliency maps given an input image, without any class labels. The autoencoder is thus tasked with discovering domain-specific salient features which mimic human perception. Second, we remove the decoder part, add a classification layer on top of the encoder, and train this new model conventionally, now using class labels. We show that MENTOR successfully raises the generalization performance across three different CNN backbones in a variety of anomaly detection tasks (demonstrated for detection of unknown iris presentation attacks, synthetically-generated faces, and anomalies in chest X-ray images) compared to traditional pretraining methods (e.g., sourcing the weights from ImageNet), and as well as state-of-the-art methods that incorporate human perception guidance into training. In addition, we demonstrate that MENTOR can be flexibly applied to existing human perception-guided methods and subsequently increasing their generalization with no architectural modifications.
Authors: Nicolas Cherel, Andr\'es Almansa, Yann Gousseau, Alasdair Newson
Abstract: Video inpainting is the task of filling a region in a video in a visually convincing manner. It is very challenging due to the high dimensionality of the data and the temporal consistency required for obtaining convincing results. Recently, diffusion models have shown impressive results in modeling complex data distributions, including images and videos. Such models remain nonetheless very expensive to train and to perform inference with, which strongly reduce their applicability to videos, and yields unreasonable computational loads. We show that in the case of video inpainting, thanks to the highly auto-similar nature of videos, the training data of a diffusion model can be restricted to the input video and still produce very satisfying results. With this internal learning approach, where the training data is limited to a single video, our lightweight models perform very well with only half a million parameters, in contrast to the very large networks with billions of parameters typically found in the literature. We also introduce a new method for efficient training and inference of diffusion models in the context of internal learning, by splitting the diffusion process into different learning intervals corresponding to different noise levels of the diffusion process. We show qualitative and quantitative results, demonstrating that our method reaches or exceeds state of the art performance in the case of dynamic textures and complex dynamic backgrounds
Authors: Chun Tao, Timur Ibrayev, Kaushik Roy
Abstract: Despite considerable progress in image classification tasks, classification models seem unaffected by the images that significantly deviate from those that appear natural to human eyes. Specifically, while human perception can easily identify abnormal appearances or compositions in images, classification models overlook any alterations in the arrangement of object parts as long as they are present in any order, even if unnatural. Hence, this work exposes the vulnerability of having semantic and syntactic discrepancy in images (SSDI) in the form of corruptions that remove or shuffle image patches or present images in the form of puzzles. To address this vulnerability, we propose the concept of "image grammar", comprising "image semantics" and "image syntax". Image semantics pertains to the interpretation of parts or patches within an image, whereas image syntax refers to the arrangement of these parts to form a coherent object. We present a semi-supervised two-stage method for learning the image grammar of visual elements and environments solely from natural images. While the first stage learns the semantic meaning of individual object parts, the second stage learns how their relative arrangement constitutes an entire object. The efficacy of the proposed approach is then demonstrated by achieving SSDI detection rates ranging from 70% to 90% on corruptions generated from CelebA and SUN-RGBD datasets. Code is publicly available at: https://github.com/ChunTao1999/SSDI/
Authors: Xiaokang Wei, Zhuoman Liu, Ping Li, Yan Luximon
Abstract: We propose SIR, an efficient method to decompose differentiable shadows for inverse rendering on indoor scenes using multi-view data, addressing the challenges in accurately decomposing the materials and lighting conditions. Unlike previous methods that struggle with shadow fidelity in complex lighting environments, our approach explicitly learns shadows for enhanced realism in material estimation under unknown light positions. Utilizing posed HDR images as input, SIR employs an SDF-based neural radiance field for comprehensive scene representation. Then, SIR integrates a shadow term with a three-stage material estimation approach to improve SVBRDF quality. Specifically, SIR is designed to learn a differentiable shadow, complemented by BRDF regularization, to optimize inverse rendering accuracy. Extensive experiments on both synthetic and real-world indoor scenes demonstrate the superior performance of SIR over existing methods in both quantitative metrics and qualitative analysis. The significant decomposing ability of SIR enables sophisticated editing capabilities like free-view relighting, object insertion, and material replacement. The code and data are available at https://xiaokangwei.github.io/SIR/.
Authors: Xin Zhang, Tao Xiao, Gepeng Ji, Xuan Wu, Keren Fu, Qijun Zhao
Abstract: Camouflage poses challenges in distinguishing a static target, whereas any movement of the target can break this disguise. Existing video camouflaged object detection (VCOD) approaches take noisy motion estimation as input or model motion implicitly, restricting detection performance in complex dynamic scenes. In this paper, we propose a novel Explicit Motion handling and Interactive Prompting framework for VCOD, dubbed EMIP, which handles motion cues explicitly using a frozen pre-trained optical flow fundamental model. EMIP is characterized by a two-stream architecture for simultaneously conducting camouflaged segmentation and optical flow estimation. Interactions across the dual streams are realized in an interactive prompting way that is inspired by emerging visual prompt learning. Two learnable modules, i.e., the camouflaged feeder and motion collector, are designed to incorporate segmentation-to-motion and motion-to-segmentation prompts, respectively, and enhance outputs of the both streams. The prompt fed to the motion stream is learned by supervising optical flow in a self-supervised manner. Furthermore, we show that long-term historical information can also be incorporated as a prompt into EMIP and achieve more robust results with temporal consistency. Experimental results demonstrate that our EMIP achieves new state-of-the-art records on popular VCOD benchmarks. Our code is made publicly available at https://github.com/zhangxin06/EMIP.
Authors: Hongyuan Yu, Cheng Wan, Xiyang Dai, Mengchen Liu, Dongdong Chen, Bin Xiao, Yan Huang, Yuan Lu, Liang Wang
Abstract: Image segmentation is one of the most fundamental problems in computer vision and has drawn a lot of attention due to its vast applications in image understanding and autonomous driving. However, designing effective and efficient segmentation neural architectures is a labor-intensive process that may require numerous trials by human experts. In this paper, we address the challenge of integrating multi-head self-attention into high-resolution representation CNNs efficiently by leveraging architecture search. Manually replacing convolution layers with multi-head self-attention is non-trivial due to the costly overhead in memory to maintain high resolution. By contrast, we develop a multi-target multi-branch supernet method, which not only fully utilizes the advantages of high-resolution features but also finds the proper location for placing the multi-head self-attention module. Our search algorithm is optimized towards multiple objectives (e.g., latency and mIoU) and is capable of finding architectures on the Pareto frontier with an arbitrary number of branches in a single search. We further present a series of models via the Hybrid Convolutional-Transformer Architecture Search (HyCTAS) method that searches for the best hybrid combination of light-weight convolution layers and memory-efficient self-attention layers between branches from different resolutions and fuses them to high resolution for both efficiency and effectiveness. Extensive experiments demonstrate that HyCTAS outperforms previous methods in both semantic segmentation and panoptic segmentation tasks. Code and models are available at https://github.com/MarvinYu1995/HyCTAS.
Authors: Yutong He, Alexander Robey, Naoki Murata, Yiding Jiang, Joshua Nathaniel Williams, George J. Pappas, Hamed Hassani, Yuki Mitsufuji, Ruslan Salakhutdinov, J. Zico Kolter
Abstract: Prompt engineering is an effective but labor-intensive way to control text-to-image (T2I) generative models. Its time-intensive nature and complexity have spurred the development of algorithms for automated prompt generation. However, these methods often struggle with transferability across T2I models, require white-box access to the underlying model, or produce non-intuitive prompts. In this work, we introduce PRISM, an algorithm that automatically produces human-interpretable and transferable prompts that can effectively generate desired concepts given only black-box access to T2I models. Inspired by large language model (LLM) jailbreaking, PRISM leverages the in-context learning ability of LLMs to iteratively refine the candidate prompt distribution built upon the reference images. Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles, and images across multiple T2I models, including Stable Diffusion, DALL-E, and Midjourney.
Authors: Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Qing Yu, Go Irie, Yixuan Li, Hai Li, Ziwei Liu, Kiyoharu Aizawa
Abstract: This paper introduces a novel task to evaluate the robust understanding capability of Large Multimodal Models (LMMs), termed $\textbf{Unsolvable Problem Detection (UPD)}$. Multiple-choice question answering (MCQA) is widely used to assess the understanding capability of LMMs, but it does not guarantee that LMMs truly comprehend the answer. UPD assesses the LMM's ability to withhold answers when encountering unsolvable problems of MCQA, verifying whether the model truly understands the answer. UPD encompasses three problems: Absent Answer Detection (AAD), Incompatible Answer Set Detection (IASD), and Incompatible Visual Question Detection (IVQD), covering unsolvable cases like answer-lacking or incompatible choices and image-question mismatches. For the evaluation, we introduce the MM-UPD Bench, a benchmark for assessing performance across various ability dimensions. Our experiments reveal that even most LMMs, which demonstrate adequate performance on existing benchmarks, struggle significantly with MM-UPD, underscoring a novel aspect of trustworthiness that current benchmarks have overlooked. A detailed analysis shows that LMMs have different bottlenecks and chain-of-thought and self-reflection improved performance for LMMs with the bottleneck in their LLM capability. We hope our insights will enhance the broader understanding and development of more reliable LMMs.
Authors: Ming Nie, Xinyue Cai, Hang Xu, Li Zhang
Abstract: Lane detection has evolved highly functional autonomous driving system to understand driving scenes even under complex environments. In this paper, we work towards developing a generalized computer vision system able to detect lanes without using any annotation. We make the following contributions: (i) We illustrate how to perform unsupervised 3D lane segmentation by leveraging the distinctive intensity of lanes on the LiDAR point cloud frames, and then obtain the noisy lane labels in the 2D plane by projecting the 3D points; (ii) We propose a novel self-supervised training scheme, dubbed LaneCorrect, that automatically corrects the lane label by learning geometric consistency and instance awareness from the adversarial augmentations; (iii) With the self-supervised pre-trained model, we distill to train a student network for arbitrary target lane (e.g., TuSimple) detection without any human labels; (iv) We thoroughly evaluate our self-supervised method on four major lane detection benchmarks (including TuSimple, CULane, CurveLanes and LLAMAS) and demonstrate excellent performance compared with existing supervised counterpart, whilst showing more effective results on alleviating the domain gap, i.e., training on CULane and test on TuSimple.
Authors: Yusuke Akamatsu, Terumi Umematsu, Hitoshi Imaoka, Shizuko Gomi, Hideo Tsurushima
Abstract: Daily monitoring of intra-personal facial changes associated with health and emotional conditions has great potential to be useful for medical, healthcare, and emotion recognition fields. However, the approach for capturing intra-personal facial changes is relatively unexplored due to the difficulty of collecting temporally changing face images. In this paper, we propose a facial representation learning method using synthetic images for comparing faces, called ComFace, which is designed to capture intra-personal facial changes. For effective representation learning, ComFace aims to acquire two feature representations, i.e., inter-personal facial differences and intra-personal facial changes. The key point of our method is the use of synthetic face images to overcome the limitations of collecting real intra-personal face images. Facial representations learned by ComFace are transferred to three extensive downstream tasks for comparing faces: estimating facial expression changes, weight changes, and age changes from two face images of the same individual. Our ComFace, trained using only synthetic data, achieves comparable to or better transfer performance than general pre-training and state-of-the-art representation learning methods trained using real images.
Authors: Fang Chen, Gourav Datta, Mujahid Al Rafi, Hyeran Jeon, Meng Tang
Abstract: The expansion of neural network sizes and the enhanced resolution of modern image sensors result in heightened memory and power demands to process modern computer vision models. In order to deploy these models in extremely resource-constrained edge devices, it is crucial to reduce their peak memory, which is the maximum memory consumed during the execution of a model. A naive approach to reducing peak memory is aggressive down-sampling of feature maps via pooling with large stride, which often results in unacceptable degradation in network performance. To mitigate this problem, we propose residual encoded distillation (ReDistill) for peak memory reduction in a teacher-student framework, in which a student network with less memory is derived from the teacher network using aggressive pooling. We apply our distillation method to multiple problems in computer vision, including image classification and diffusion-based image generation. For image classification, our method yields 4x-5x theoretical peak memory reduction with less degradation in accuracy for most CNN-based architectures. For diffusion-based image generation, our proposed distillation method yields a denoising network with 4x lower theoretical peak memory while maintaining decent diversity and fidelity for image generation. Experiments demonstrate our method's superior performance compared to other feature-based and response-based distillation methods when applied to the same student network. The code is available at https://github.com/mengtang-lab/ReDistill.
Authors: Daniel Bogdoll, Jan Imhof, Tim Joseph, Svetlana Pavlitska, J. Marius Z\"ollner
Abstract: In autonomous driving, the most challenging scenarios can only be detected within their temporal context. Most video anomaly detection approaches focus either on surveillance or traffic accidents, which are only a subfield of autonomous driving. We present HF$^2$-VAD$_{AD}$, a variation of the HF$^2$-VAD surveillance video anomaly detection method for autonomous driving. We learn a representation of normality from a vehicle's ego perspective and evaluate pixel-wise anomaly detections in rare and critical scenarios.
Authors: Jun Bai, Di Wu, Tristan Shelley, Peter Schubel, David Twine, John Russell, Xuesen Zeng, Ji Zhang
Abstract: Material defects (MD) represent a primary challenge affecting product performance and giving rise to safety issues in related products. The rapid and accurate identification and localization of MD constitute crucial research endeavors in addressing contemporary challenges associated with MD. In recent years, propelled by the swift advancement of machine learning (ML) technologies, particularly exemplified by deep learning, ML has swiftly emerged as the core technology and a prominent research direction for material defect detection (MDD). Through a comprehensive review of the latest literature, we systematically survey the ML techniques applied in MDD into five categories: unsupervised learning, supervised learning, semi-supervised learning, reinforcement learning, and generative learning. We provide a detailed analysis of the main principles and techniques used, together with the advantages and potential challenges associated with these techniques. Furthermore, the survey focuses on the techniques for defect detection in composite materials, which are important types of materials enjoying increasingly wide application in various industries such as aerospace, automotive, construction, and renewable energy. Finally, the survey explores potential future directions in MDD utilizing ML technologies. This survey consolidates ML-based MDD literature and provides a foundation for future research and practice.
Authors: Md Saiful Islam, Tariq Adnan, Jan Freyberg, Sangwu Lee, Abdelrahman Abdelkader, Meghan Pawlik, Cathe Schwartz, Karen Jaffe, Ruth B. Schneider, E Ray Dorsey, Ehsan Hoque
Abstract: Limited accessibility to neurological care leads to underdiagnosed Parkinson's Disease (PD), preventing early intervention. Existing AI-based PD detection methods primarily focus on unimodal analysis of motor or speech tasks, overlooking the multifaceted nature of the disease. To address this, we introduce a large-scale, multi-task video dataset consisting of 1102 sessions (each containing videos of finger tapping, facial expression, and speech tasks captured via webcam) from 845 participants (272 with PD). We propose a novel Uncertainty-calibrated Fusion Network (UFNet) that leverages this multimodal data to enhance diagnostic accuracy. UFNet employs independent task-specific networks, trained with Monte Carlo Dropout for uncertainty quantification, followed by self-attended fusion of features, with attention weights dynamically adjusted based on task-specific uncertainties. To ensure patient-centered evaluation, the participants were randomly split into three sets: 60% for training, 20% for model selection, and 20% for final performance evaluation. UFNet significantly outperformed single-task models in terms of accuracy, area under the ROC curve (AUROC), and sensitivity while maintaining non-inferior specificity. Withholding uncertain predictions further boosted the performance, achieving 88.0+-0.3%$ accuracy, 93.0+-0.2% AUROC, 79.3+-0.9% sensitivity, and 92.6+-0.3% specificity, at the expense of not being able to predict for 2.3+-0.3% data (+- denotes 95% confidence interval). Further analysis suggests that the trained model does not exhibit any detectable bias across sex and ethnic subgroups and is most effective for individuals aged between 50 and 80. Requiring only a webcam and microphone, our approach facilitates accessible home-based PD screening, especially in regions with limited healthcare resources.
Authors: Zuo-Liang Zhu, Beibei Wang, Jian Yang
Abstract: 3D Gaussian Splatting (3DGS) has shown a powerful capability for novel view synthesis due to its detailed expressive ability and highly efficient rendering speed. Unfortunately, creating relightable 3D assets and reconstructing faithful geometry with 3DGS is still problematic, particularly for reflective objects, as its discontinuous representation raises difficulties in constraining geometries. Volumetric signed distance field (SDF) methods provide robust geometry reconstruction, while the expensive ray marching hinders its real-time application and slows the training. Besides, these methods struggle to capture sharp geometric details. To this end, we propose to guide 3DGS and SDF bidirectionally in a complementary manner, including an SDF-aided Gaussian splatting for efficient optimization of the relighting model and a GS-guided SDF enhancement for high-quality geometry reconstruction. At the core of our SDF-aided Gaussian splatting is the mutual supervision of the depth and normal between blended Gaussians and SDF, which avoids the expensive volume rendering of SDF. Thanks to this mutual supervision, the learned blended Gaussians are well-constrained with a minimal time cost. As the Gaussians are rendered in a deferred shading mode, the alpha-blended Gaussians are smooth, while individual Gaussians may still be outliers, yielding floater artifacts. Therefore, we introduce an SDF-aware pruning strategy to remove Gaussian outliers located distant from the surface defined by SDF, avoiding floater issue. This way, our GS framework provides reasonable normal and achieves realistic relighting, while the mesh from depth is still problematic. Therefore, we design a GS-guided SDF refinement, which utilizes the blended normal from Gaussians to finetune SDF. With this enhancement, our method can further provide high-quality meshes for reflective objects at the cost of 17% extra training time.
Authors: Pei-Kai Huang, Tzu-Hsien Chen, Ya-Ting Chan, Kuan-Wen Chen, Chiou-Ting Hsu
Abstract: Remote Photoplethysmography (rPPG) aims to measure physiological signals and Heart Rate (HR) from facial videos. Recent unsupervised rPPG estimation methods have shown promising potential in estimating rPPG signals from facial regions without relying on ground truth rPPG signals. However, these methods seem oblivious to interference existing in rPPG signals and still result in unsatisfactory performance. In this paper, we propose a novel De-interfered and Descriptive rPPG Estimation Network (DD-rPPGNet) to eliminate the interference within rPPG features for learning genuine rPPG signals. First, we investigate the characteristics of local spatial-temporal similarities of interference and design a novel unsupervised model to estimate the interference. Next, we propose an unsupervised de-interfered method to learn genuine rPPG signals with two stages. In the first stage, we estimate the initial rPPG signals by contrastive learning from both the training data and their augmented counterparts. In the second stage, we use the estimated interference features to derive de-interfered rPPG features and encourage the rPPG signals to be distinct from the interference. In addition, we propose an effective descriptive rPPG feature learning by developing a strong 3D Learnable Descriptive Convolution (3DLDC) to capture the subtle chrominance changes for enhancing rPPG estimation. Extensive experiments conducted on five rPPG benchmark datasets demonstrate that the proposed DD-rPPGNet outperforms previous unsupervised rPPG estimation methods and achieves competitive performances with state-of-the-art supervised rPPG methods. The code is available at: https://github.com/Pei-KaiHuang/TIFS2025-DD-rPPGNet
Authors: Xin Liu, Weijia Zhang, Min-Ling Zhang
Abstract: Although attention-based multi-instance learning (MIL) algorithms have achieved impressive performance on slide-level whole slide image (WSI) classification tasks, they are prone to mistakenly focusing on irrelevant patterns such as staining conditions and tissue morphology, leading to incorrect patch-level predictions and unreliable interpretability. In this paper, we analyze why attention-based methods tend to rely on spurious correlations in their predictions. Furthermore, we revisit max-pooling-based approaches and examine the reasons behind the underperformance of existing methods. We argue that well-trained max-pooling-based MIL models can make predictions based on causal factors and avoid relying on spurious correlations. Building on these insights, we propose a simple yet effective max-pooling-based MIL method (FocusMIL) that outperforms existing mainstream attention-based methods on two datasets. In this position paper, we advocate renewed attention to max-pooling-based methods to achieve more robust and interpretable predictions.
Authors: Andrew Kiruluta, Eric Lundy, Andreas Lemos
Abstract: We present a unified change detection pipeline that combines instance level masking, multi\-scale attention within a denoising diffusion model, and per pixel semantic classification, all refined via SSIM to match human perception. By first isolating only temporally novel objects with Mask R\-CNN, then guiding diffusion updates through hierarchical cross attention to object and global contexts, and finally categorizing each pixel into one of C change types, our method delivers detailed, interpretable multi\-class maps. It outperforms traditional differencing, Siamese CNNs, and GAN\-based detectors by 10\-25 points in F1 and IoU on both synthetic and real world benchmarks, marking a new state of the art in remote sensing change detection.
Authors: Xujie Wan, Wenjie Li, Guangwei Gao, Huimin Lu, Jian Yang, Chia-Wen Lin
Abstract: Recently, CNN and Transformer hybrid networks demonstrated excellent performance in face super-resolution (FSR) tasks. Since numerous features at different scales in hybrid networks, how to fuse these multi-scale features and promote their complementarity is crucial for enhancing FSR. However, existing hybrid network-based FSR methods ignore this, only simply combining the Transformer and CNN. To address this issue, we propose an attention-guided Multi-scale interaction network (AMINet), which contains local and global feature interactions and encoder-decoder phase feature interactions. Specifically, we propose a Local and Global Feature Interaction Module (LGFI) to promote fusions of global features and different receptive fields' local features extracted by our Residual Depth Feature Extraction Module (RDFE). Additionally, we propose a Selective Kernel Attention Fusion Module (SKAF) to adaptively select fusions of different features within LGFI and encoder-decoder phases. Our above design allows the free flow of multi-scale features from within modules and between encoder and decoder, which can promote the complementarity of different scale features to enhance FSR. Comprehensive experiments confirm that our method consistently performs well with less computational consumption and faster inference.
Authors: Qian Li, Shuojue Yang, Daiyun Shen, Jimmy Bok Yan So, Jing Qin, Yueming Jin
Abstract: High-fidelity reconstruction of surgical scene is a fundamentally crucial task to support many applications, such as intra-operative navigation and surgical education. However, most existing methods assume the ideal surgical scenarios - either focus on dynamic reconstruction with deforming tissue yet assuming a given fixed camera pose, or allow endoscope movement yet reconstructing the static scenes. In this paper, we target at a more realistic yet challenging setup - free-pose reconstruction with a moving camera for highly dynamic surgical scenes. Meanwhile, we take the first step to introduce Gaussian Splitting (GS) technique to tackle this challenging setting and propose a novel GS-based framework for fast reconstruction, termed \textit{Free-DyGS}. Concretely, our model embraces a novel scene initialization in which a pre-trained Sparse Gaussian Regressor (SGR) can efficiently parameterize the initial attributes. For each subsequent frame, we propose to jointly optimize the deformation model and 6D camera poses in a frame-by-frame manner, easing training given the limited deformation differences between consecutive frames. A Scene Expansion scheme is followed to expand the GS model for the unseen regions introduced by the moving camera. Moreover, the framework is equipped with a novel Retrospective Deformation Recapitulation (RDR) strategy to preserve the entire-clip deformations throughout the frame-by-frame training scheme. The efficacy of the proposed Free-DyGS is substantiated through extensive experiments on two datasets: StereoMIS and Hamlyn datasets. The experimental outcomes underscore that Free-DyGS surpasses other advanced methods in both rendering accuracy and efficiency. Code will be available.
Authors: Guanglei Zhou, Bhargav Korrapati, Gaurav Rajavendra Reddy, Chen-Chia Chang, Jingyu Pan, Jiang Hu, Yiran Chen, Dipto G. Thakurta
Abstract: Generating diverse VLSI layout patterns is essential for various downstream tasks in design for manufacturing, as design rules continually evolve during the development of new technology nodes. However, existing training-based methods for layout pattern generation rely on large datasets. In practical scenarios, especially when developing a new technology node, obtaining such extensive layout data is challenging. Consequently, training models with large datasets becomes impractical, limiting the scalability and adaptability of prior approaches. To this end, we propose PatternPaint, a diffusion-based framework capable of generating legal patterns with limited design-rule-compliant training samples. PatternPaint simplifies complex layout pattern generation into a series of inpainting processes with a template-based denoising scheme. Furthermore, we perform few-shot finetuning on a pretrained image foundation model with only 20 design-rule-compliant samples. Experimental results show that using a sub-3nm technology node (Intel 18A), our model is the only one that can generate legal patterns in complex 2D metal interconnect design rule settings among all previous works and achieves a high diversity score. Additionally, our few-shot finetuning can boost the legality rate with 1.87X improvement compared to the original pretrained model. As a result, we demonstrate a production-ready approach for layout pattern generation in developing new technology nodes.
Authors: Hugo Porta, Emanuele Dalsasso, Diego Marcos, Devis Tuia
Abstract: Prototypical part learning is emerging as a promising approach for making semantic segmentation interpretable. The model selects real patches seen during training as prototypes and constructs the dense prediction map based on the similarity between parts of the test image and the prototypes. This improves interpretability since the user can inspect the link between the predicted output and the patterns learned by the model in terms of prototypical information. In this paper, we propose a method for interpretable semantic segmentation that leverages multi-scale image representation for prototypical part learning. First, we introduce a prototype layer that explicitly learns diverse prototypical parts at several scales, leading to multi-scale representations in the prototype activation output. Then, we propose a sparse grouping mechanism that produces multi-scale sparse groups of these scale-specific prototypical parts. This provides a deeper understanding of the interactions between multi-scale object representations while enhancing the interpretability of the segmentation model. The experiments conducted on Pascal VOC, Cityscapes, and ADE20K demonstrate that the proposed method increases model sparsity, improves interpretability over existing prototype-based methods, and narrows the performance gap with the non-interpretable counterpart models. Code is available at github.com/eceo-epfl/ScaleProtoSeg.
Authors: Jiashuo Li, Shaokun Wang, Bo Qian, Yuhang He, Xing Wei, Qiang Wang, Yihong Gong
Abstract: Non-exemplar class Incremental Learning (NECIL) enables models to continuously acquire new classes without retraining from scratch and storing old task exemplars, addressing privacy and storage issues. However, the absence of data from earlier tasks exacerbates the challenge of catastrophic forgetting in NECIL. In this paper, we propose a novel framework called Dynamic Integration of task-specific Adapters (DIA), which comprises two key components: Task-Specific Adapter Integration (TSAI) and Patch-Level Model Alignment. TSAI boosts compositionality through a patch-level adapter integration strategy, which provides a more flexible compositional solution while maintaining low computation costs. Patch-Level Model Alignment maintains feature consistency and accurate decision boundaries via two specialized mechanisms: Patch-Level Distillation Loss (PDL) and Patch-Level Feature Reconstruction method (PFR). Specifically, the PDL preserves feature-level consistency between successive models by implementing a distillation loss based on the contributions of patch tokens to new class learning. The PFR facilitates accurate classifier alignment by reconstructing old class features from previous tasks that adapt to new task knowledge. Extensive experiments validate the effectiveness of our DIA, revealing significant improvements on benchmark datasets in the NECIL setting, maintaining an optimal balance between computational complexity and accuracy.
Authors: Junchen Yu, Si-Yuan Cao, Runmin Zhang, Chenghao Zhang, Zhu Yu, Shujie Chen, Bailin Yang, Hui-liang Shen
Abstract: We propose a novel unsupervised cross-modal homography estimation learning framework, named Split Supervised Homography estimation Network (SSHNet). SSHNet reformulates the unsupervised cross-modal homography estimation into two supervised sub-problems, each addressed by its specialized network: a homography estimation network and a modality transfer network. To realize stable training, we introduce an effective split optimization strategy to train each network separately within its respective sub-problem. We also formulate an extra homography feature space supervision to enhance feature consistency, further boosting the estimation accuracy. Moreover, we employ a simple yet effective distillation training technique to reduce model parameters and improve cross-domain generalization ability while maintaining comparable performance. The training stability of SSHNet enables its cooperation with various homography estimation architectures. Experiments reveal that the SSHNet using IHN as homography estimation network, namely SSHNet-IHN, outperforms previous unsupervised approaches by a significant margin. Even compared to supervised approaches MHN and LocalTrans, SSHNet-IHN achieves 47.4% and 85.8% mean average corner errors (MACEs) reduction on the challenging OPT-SAR dataset.
Authors: Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, Xihui Liu
Abstract: Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D scene understanding capabilities has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D visual understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we utilize the 3D position embeddings to enhance the 2D CLIP Patches with 3D spatial context information and construct 3D patches. By integrating the 3D position embeddings into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D visual understanding and 3D scene understanding. In contrast to previous 3D LMMs, LLaVA-3D supports decoding accurate 3D spatial perception outputs, e.g., 3D bounding boxes, directly from these 3D patches, without relying on the time-consuming off-the-shelf 3D segmentors. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D visual understanding and vision-language conversation capabilities with LLaVA.
Authors: Shiben Liu, Qiang Wang, Huijie Fan, Weihong Ren, Baojie Fan, Yandong Tang
Abstract: Lifelong person re-identification (LReID) exhibits a contradictory relationship between intra-domain discrimination and inter-domain gaps when learning from continuous data. Intra-domain discrimination focuses on individual nuances (i.e., clothing type, accessories, etc.), while inter-domain gaps emphasize domain consistency. Achieving a trade-off between maximizing intra-domain discrimination and minimizing inter-domain gaps is a crucial challenge for improving LReID performance. Most existing methods strive to reduce inter-domain gaps through knowledge distillation to maintain domain consistency. However, they often ignore intra-domain discrimination. To address this challenge, we propose a novel domain consistency representation learning (DCR) model that explores global and attribute-wise representations as a bridge to balance intra-domain discrimination and inter-domain gaps. At the intra-domain level, we explore the complementary relationship between global and attribute-wise representations to improve discrimination among similar identities. Excessive learning intra-domain discrimination can lead to catastrophic forgetting. We further develop an attribute-oriented anti-forgetting (AF) strategy that explores attribute-wise representations to enhance inter-domain consistency, and propose a knowledge consolidation (KC) strategy to facilitate knowledge transfer. Extensive experiments show that our DCR model achieves superior performance compared to state-of-the-art LReID methods. Our code is publicly available at https://github.com/LiuShiBen/DCR.
Authors: Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, Fazl Barez
Abstract: Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM. Our approach focuses on analyzing the localization of object information, the evolution of visual token representations across layers, and the mechanism of integrating visual information for predictions. Through ablation studies, we demonstrated that object identification accuracy drops by over 70\% when object-specific tokens are removed. We observed that visual token representations become increasingly interpretable in the vocabulary space across layers, suggesting an alignment with textual tokens corresponding to image content. Finally, we found that the model extracts object information from these refined representations at the last token position for prediction, mirroring the process in text-only language models for factual association tasks. These findings provide crucial insights into how VLMs process and integrate visual information, bridging the gap between our understanding of language and vision models, and paving the way for more interpretable and controllable multimodal systems.
Authors: Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, Di Zhang
Abstract: With the continuous progress of visual generation technologies, the scale of video datasets has grown exponentially. The quality of these datasets plays a pivotal role in the performance of video generation models. We assert that temporal splitting, detailed captions, and video quality filtering are three crucial determinants of dataset quality. However, existing datasets exhibit various limitations in these areas. To address these challenges, we introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality. The essence of our approach lies in improving the consistency between fine-grained conditions and video content. Specifically, we employ a linear classifier on probability distributions to enhance the accuracy of transition detection, ensuring better temporal consistency. We then provide structured captions for the splitted videos, with an average length of 200 words, to improve text-video alignment. Additionally, we develop a Video Training Suitability Score (VTSS) that integrates multiple sub-metrics, allowing us to filter high-quality videos from the original corpus. Finally, we incorporate several metrics into the training process of the generation model, further refining the fine-grained conditions. Our experiments demonstrate the effectiveness of our data processing pipeline and the quality of the proposed Koala-36M dataset. Our dataset and code have been released at https://koala36m.github.io/.
Authors: Evans Xu Han, Linghao Jin, Xiaofeng Liu, Paul Pu Liang
Abstract: Despite the impressive text-to-image (T2I) synthesis capabilities of diffusion models, they often struggle to understand compositional relationships between objects and attributes, especially in complex settings. Existing solutions have tackled these challenges by optimizing the cross-attention mechanism or learning from the caption pairs with minimal semantic changes. However, can we generate high-quality complex contrastive images that diffusion models can directly discriminate based on visual representations? In this work, we leverage large-language models (LLMs) to compose realistic, complex scenarios and harness Visual-Question Answering (VQA) systems alongside diffusion models to automatically curate a contrastive dataset, ConPair, consisting of 15k pairs of high-quality contrastive images. These pairs feature minimal visual discrepancies and cover a wide range of attribute categories, especially complex and natural scenarios. To learn effectively from these error cases, i.e., hard negative images, we propose EvoGen, a new multi-stage curriculum for contrastive learning of diffusion models. Through extensive experiments across a wide range of compositional scenarios, we showcase the effectiveness of our proposed framework on compositional T2I benchmarks.
Authors: Elena Sierra, Lauren E. Gillespie, Salim Soltani, Moises Exposito-Alonso, Teja Kattenborn
Abstract: Large-scale, volunteer-collected datasets of community-identified natural world imagery like iNaturalist have enabled marked performance gains for fine-grained visual classification of species using machine learning methods. However, such data -- sometimes referred to as citizen science data -- are opportunistic and lack a structured sampling strategy. This volunteer-collected biodiversity data contains geographic, temporal, taxonomic, observers, and sociopolitical biases that can have significant effects on biodiversity model performance, but whose impacts are unclear for fine-grained species recognition performance. Here we introduce Diversity Shift (DivShift), a framework for quantifying the effects of domain-specific distribution shifts on machine learning model performance. To diagnose the performance effects of biases specific to volunteer-collected biodiversity data, we also introduce DivShift - North American West Coast (DivShift-NAWC), a curated dataset of almost 7.5 million iNaturalist images across the western coast of North America partitioned across five types of expert-verified bias. We compare species recognition performance across these bias partitions using a diverse variety of species- and ecosystem-focused accuracy metrics. We observe that these biases confound model performance less than expected from the underlying label distribution shift, and that more data leads to better model performance but the magnitude of these improvements are bias-specific. These findings imply that while the structure within natural world images provides generalization improvements for biodiversity monitoring tasks, the biases present in volunteer-collected biodiversity data can also affect model performance; thus these models should be used with caution in downstream biodiversity monitoring tasks.
Authors: Bo Gao, Jianhui Wang, Xinyuan Song, Yangfan He, Fangxu Xing, Tianyu Shi
Abstract: Current semantic segmentation models typically require a substantial amount of manually annotated data, a process that is both time-consuming and resource-intensive. Alternatively, leveraging advanced text-to-image models such as Midjourney and Stable Diffusion has emerged as an efficient strategy, enabling the automatic generation of synthetic data in place of manual annotations. However, previous methods have been limited to generating single-instance images, as the generation of multiple instances with Stable Diffusion has proven unstable. To address this limitation and expand the scope and diversity of synthetic datasets, we propose a framework \textbf{Free-Mask} that combines a Diffusion Model for segmentation with advanced image editing capabilities, allowing for the integration of multiple objects into images via text-to-image models. Our method facilitates the creation of highly realistic datasets that closely emulate open-world environments while generating accurate segmentation masks. It reduces the labor associated with manual annotation and also ensures precise mask generation. Experimental results demonstrate that synthetic data generated by \textbf{Free-Mask} enables segmentation models to outperform those trained on real data, especially in zero-shot settings. Notably, \textbf{Free-Mask} achieves new state-of-the-art results on previously unseen classes in the VOC 2012 benchmark.
Authors: Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, Wenhu Chen
Abstract: Instruction-guided image editing methods have demonstrated significant potential by training diffusion models on automatically synthesized or manually annotated image editing pairs. However, these methods remain far from practical, real-life applications. We identify three primary challenges contributing to this gap. Firstly, existing models have limited editing skills due to the biased synthesis process. Secondly, these methods are trained with datasets with a high volume of noise and artifacts. This is due to the application of simple filtering methods like CLIP-score. Thirdly, all these datasets are restricted to a single low resolution and fixed aspect ratio, limiting the versatility to handle real-world use cases. In this paper, we present \omniedit, which is an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. Our contribution is in four folds: (1) \omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. (2) we utilize importance sampling based on the scores provided by large multimodal models (like GPT-4o) instead of CLIP-score to improve the data quality. (3) we propose a new editing architecture called EditNet to greatly boost the editing success rate, (4) we provide images with different aspect ratios to ensure that our model can handle any image in the wild. We have curated a test set containing images of different aspect ratios, accompanied by diverse instructions to cover different tasks. Both automatic evaluation and human evaluations demonstrate that \omniedit can significantly outperform all the existing models. Our code, dataset and model will be available at https://tiger-ai-lab.github.io/OmniEdit/
Authors: Anjia Cao, Xing Wei, Zhiheng Ma
Abstract: Language-image pre-training faces significant challenges due to limited data in specific formats and the constrained capacities of text encoders. While prevailing methods attempt to address these issues through data augmentation and architecture modifications, they continue to struggle with processing long-form text inputs, and the inherent limitations of traditional CLIP text encoders lead to suboptimal downstream generalization. In this paper, we propose FLAME (Frozen Large lAnguage Models Enable data-efficient language-image pre-training) that leverages frozen large language models as text encoders, naturally processing long text inputs and demonstrating impressive multilingual generalization. FLAME comprises two key components: 1) a multifaceted prompt distillation technique for extracting diverse semantic representations from long captions, which better aligns with the multifaceted nature of images, and 2) a facet-decoupled attention mechanism, complemented by an offline embedding strategy, to ensure efficient computation. Extensive empirical evaluations demonstrate FLAME's superior performance. When trained on CC3M, FLAME surpasses the previous state-of-the-art by 4.9% in ImageNet top-1 accuracy. On YFCC15M, FLAME surpasses the WIT-400M-trained CLIP by 44.4\% in average image-to-text recall@1 across 36 languages, and by 34.6% in text-to-image recall@1 for long-context retrieval on Urban-1k. Code is available at https://github.com/MIV-XJTU/FLAME.
Authors: Zhen Lv, Yangqi Long, Congzhentao Huang, Cao Li, Chengfei Lv, Hao Ren, Dian Zheng
Abstract: Stereo video synthesis from a monocular input is a demanding task in the fields of spatial computing and virtual reality. The main challenges of this task lie on the insufficiency of high-quality paired stereo videos for training and the difficulty of maintaining the spatio-temporal consistency between frames. Existing methods primarily address these issues by directly applying novel view synthesis (NVS) techniques to video, while facing limitations such as the inability to effectively represent dynamic scenes and the requirement for large amounts of training data. In this paper, we introduce a novel self-supervised stereo video synthesis paradigm via a video diffusion model, termed SpatialDreamer, which meets the challenges head-on. Firstly, to address the stereo video data insufficiency, we propose a Depth based Video Generation module DVG, which employs a forward-backward rendering mechanism to generate paired videos with geometric and temporal priors. Leveraging data generated by DVG, we propose RefinerNet along with a self-supervised synthetic framework designed to facilitate efficient and dedicated training. More importantly, we devise a consistency control module, which consists of a metric of stereo deviation strength and a Temporal Interaction Learning module TIL for geometric and temporal consistency ensurance respectively. We evaluated the proposed method against various benchmark methods, with the results showcasing its superior performance.
Authors: Jiaman Li, C. Karen Liu, Jiajun Wu
Abstract: Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion -- including both joint rotations and root trajectories in the world coordinate system -- using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision.
Authors: Nardiena A. Pratama, Shaoyang Fan, Gianluca Demartini
Abstract: Human-annotated content is often used to train machine learning (ML) models. However, recently, language and multi-modal foundational models have been used to replace and scale-up human annotator's efforts. This study explores the similarity between human-generated and ML-generated annotations of images across diverse socio-economic contexts (RQ1) and their impact on ML model performance and bias (RQ2). We aim to understand differences in perception and identify potential biases in content interpretation. Our dataset comprises images of people from various geographical regions and income levels, covering various daily activities and home environments. ML captions and human labels show highest similarity at a low-level, i.e., types of words that appear and sentence structures, but all annotations are consistent in how they perceive images across regions. ML Captions resulted in best overall region classification performance, while ML Objects and ML Captions performed best overall for income regression. ML annotations worked best for action categories, while human input was more effective for non-action categories. These findings highlight the notion that both human and machine annotations are important, and that human-generated annotations are yet to be replaceable.
Authors: Kasra Arabi, Benjamin Feuer, R. Teal Witter, Chinmay Hegde, Niv Cohen
Abstract: As the quality of image generators continues to improve, deepfakes become a topic of considerable societal debate. Image watermarking allows responsible model owners to detect and label their AI-generated content, which can mitigate the harm. Yet, current state-of-the-art methods in image watermarking remain vulnerable to forgery and removal attacks. This vulnerability occurs in part because watermarks distort the distribution of generated images, unintentionally revealing information about the watermarking techniques. In this work, we first demonstrate a distortion-free watermarking method for images, based on a diffusion model's initial noise. However, detecting the watermark requires comparing the initial noise reconstructed for an image to all previously used initial noises. To mitigate these issues, we propose a two-stage watermarking framework for efficient detection. During generation, we augment the initial noise with generated Fourier patterns to embed information about the group of initial noises we used. For detection, we (i) retrieve the relevant group of noises, and (ii) search within the given group for an initial noise that might match our image. This watermarking approach achieves state-of-the-art robustness to forgery and removal against a large battery of attacks.
Authors: Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N. Plataniotis, Sergey Tulyakov, Jian Ren
Abstract: How can one efficiently generate high-quality, wide-scope 3D scenes from arbitrary single images? Existing methods suffer several drawbacks, such as requiring multi-view data, time-consuming per-scene optimization, distorted geometry in occluded areas, and low visual quality in backgrounds. Our novel 3D scene reconstruction pipeline overcomes these limitations to tackle the aforesaid challenge. Specifically, we introduce a large-scale reconstruction model that leverages latents from a video diffusion model to predict 3D Gaussian Splattings of scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that encode multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive learning strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets affirm that our model significantly outperforms existing single-view 3D scene generation methods, especially with out-of-domain images. Thus, we demonstrate for the first time that a 3D reconstruction model can effectively be built upon the latent space of a diffusion model in order to realize efficient 3D scene generation.
Authors: Huiyu Duan, Qiang Hu, Jiarui Wang, Liu Yang, Zitong Xu, Lu Liu, Xiongkuo Min, Chunlei Cai, Tianxiao Ye, Xiaoyun Zhang, Guangtao Zhai
Abstract: The rapid growth of user-generated content (UGC) videos has produced an urgent need for effective video quality assessment (VQA) algorithms to monitor video quality and guide optimization and recommendation procedures. However, current VQA models generally only give an overall rating for a UGC video, which lacks fine-grained labels for serving video processing and recommendation applications. To address the challenges and promote the development of UGC videos, we establish the first large-scale Fine-grained Video quality assessment Database, termed FineVD, which comprises 6104 UGC videos with fine-grained quality scores and descriptions across multiple dimensions. Based on this database, we propose a Fine-grained Video Quality assessment (FineVQ) model to learn the fine-grained quality of UGC videos, with the capabilities of quality rating, quality scoring, and quality attribution. Extensive experimental results demonstrate that our proposed FineVQ can produce fine-grained video-quality results and achieve state-of-the-art performance on FineVD and other commonly used UGC-VQA datasets.
Authors: Phi Vu Tran
Abstract: While modern visual recognition systems have made significant advancements, many continue to struggle with the open problem of learning from few exemplars. This paper focuses on the task of object detection in the setting where object classes follow a natural long-tailed distribution. Existing methods for long-tailed detection resort to external ImageNet labels to augment the low-shot training instances. However, such dependency on a large labeled database has limited utility in practical scenarios. We propose a versatile and scalable approach to leverage optional unlabeled images, which are easy to collect without the burden of human annotations. Our SimLTD framework is straightforward and intuitive, and consists of three simple steps: (1) pre-training on abundant head classes; (2) transfer learning on scarce tail classes; and (3) fine-tuning on a sampled set of both head and tail classes. Our approach can be viewed as an improved head-to-tail model transfer paradigm without the added complexities of meta-learning or knowledge distillation, as was required in past research. By harnessing supplementary unlabeled images, without extra image labels, SimLTD establishes new record results on the challenging LVIS v1 benchmark across both supervised and semi-supervised settings.
Authors: Chenhao Zhang, Yuanping Cao, Lei Zhang
Abstract: 3D Gaussian Splatting (3DGS) leverages densely distributed Gaussian primitives for high-quality scene representation and reconstruction. While existing 3DGS methods perform well in scenes with minor view variation, large view changes from cross-view data pose optimization challenges for these methods. To address these issues, we propose a novel cross-view Gaussian Splatting method for large-scale scene reconstruction based on multi-branch construction and fusion. Our method independently reconstructs models from different sets of views as multiple independent branches to establish the baselines of Gaussian distribution, providing reliable priors for cross-view reconstruction during initialization and densification. Specifically, a gradient-aware regularization strategy is introduced to mitigate smoothing issues caused by significant view disparities. Additionally, a unique Gaussian supplementation strategy is utilized to incorporate complementary information of multi-branch into the cross-view model. Extensive experiments on benchmark datasets demonstrate that our method achieves superior performance in novel view synthesis compared to state-of-the-art methods.
Authors: Mahmoud Abdulsalam, Usman Zahidi, Bradley Hurst, Simon Pearson, Grzegorz Cielniak, James Brown
Abstract: Tomato anomalies/damages pose a significant challenge in greenhouse farming. While this method of cultivation benefits from efficient resource utilization, anomalies can significantly degrade the quality of farm produce. A common anomaly associated with tomatoes is splitting, characterized by the development of cracks on the tomato skin, which degrades its quality. Detecting this type of anomaly is challenging due to dynamic variations in appearance and sizes, compounded by dataset scarcity. We address this problem in an unsupervised manner by utilizing a tailored variational autoencoder (VAE) with hyperspectral input. Preliminary analysis of the dataset enabled us to select the optimal range of wavelengths for detecting this anomaly. Our findings indicate that the 530nm - 550nm range is suitable for identifying tomato dry splits. The proposed VAE model achieved a 97% detection accuracy for tomato split anomalies in the test data. The analysis on reconstruction loss allow us to not only detect the anomalies but also to some degree estimate the anomalous regions.
Authors: Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Zhiyuan Cai, Hongmei Wang, Xi Wang, Luyang Luo, Mingxiang Wu, Xian Wu, Ronald Cheong Kin Chan, Yuk Ming Lau, Yefeng Zheng, Pranav Rajpurkar, Hao Chen
Abstract: The clinical adoption of artificial intelligence (AI) in medical imaging requires models that are both diagnostically accurate and interpretable to clinicians. While current multimodal biomedical foundation models prioritize performance, their black-box nature hinders explaining the decision-making process in clinically meaningful concepts. Here, we present ConceptCLIP, the first explainable biomedical foundation model that achieves state-of-the-art diagnostic accuracy while delivering human-interpretable explanations across diverse imaging modalities. We curate MedConcept-23M, the largest pre-training dataset comprising 23 million image-text-concept triplets across diverse medical modalities, where clinical concepts are derived from the Unified Medical Language System. Leveraging this dataset, we develop ConceptCLIP through a novel dual-alignment approach that simultaneously learns global image-text representations and fine-grained region-concept associations for precise and interpretable medical image analysis. We curate the most extensive evaluation benchmark for multimodal biomedical foundation models, covering 52 clinical tasks spanning 10 imaging modalities. Extensive experiments demonstrate that ConceptCLIP outperforms existing state-of-the-art multimodal biomedical foundation models. Importantly, ConceptCLIP demonstrates superior diagnostic performance while providing human-understandable explanations validated by clinical experts. As the first precise and interpretable biomedical foundation model, ConceptCLIP represents a critical milestone toward the widespread clinical adoption of AI, thereby advancing trustworthy AI in medicine.
Authors: Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, Song Han
Abstract: Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D Full Attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality. Our code is open-sourced and is available at https://github.com/svg-project/Sparse-VideoGen
Authors: Zaid Ilyas, Arooba Maqsood, Afsah Saleem, Erchuan Zhang, David Suter, Parminder Raina, Jonathan M. Hodgson, John T. Schousboe, William D. Leslie, Joshua R. Lewis, Syed Zulqarnain Gilani
Abstract: Lateral Spine Image (LSI) analysis is important for medical diagnosis, treatment planning, and detailed spinal health assessments. Although modalities like Computed Tomography and Digital X-ray Imaging are commonly used, Dual Energy X-ray Absorptiometry (DXA) is often preferred due to lower radiation exposure, seamless capture, and cost-effectiveness. Accurate Vertebral Landmark Localization (VLL) on LSIs is important to detect spinal conditions like kyphosis and lordosis, as well as assessing Abdominal Aortic Calcification (AAC) using Inter-Vertebral Guides (IVGs). Nonetheless, few automated VLL methodologies have concentrated on DXA LSIs. We present VerteNet, a hybrid CNN-Transformer model featuring a novel dual-resolution attention mechanism in self and cross-attention domains, referred to as Dual Resolution Self-Attention (DRSA) and Dual Resolution Cross-Attention (DRCA). These mechanisms capture the diverse frequencies in DXA images by operating at two different feature map resolutions. Additionally, we design a Multi-Context Feature Fusion Block (MCFB) that efficiently integrates the features using DRSA and DRCA. We train VerteNet on 620 DXA LSIs from various machines and achieve superior results compared to existing methods. We also design an algorithm that utilizes VerteNet's predictions in estimating the Region of Interest (ROI) to detect potential abdominal aorta cropping, where inadequate soft tissue hinders calcification assessment. Additionally, we present a small proof-of-concept study to show that IVGs generated from VLL information can improve inter-reader correlation in AAC scoring, addressing two key areas of disagreement in expert AAC-24 scoring: IVG placement and quality control for full abdominal aorta assessment. The code for this work can be found at https://github.com/zaidilyas89/VerteNet.
Authors: Rupesh Dulal, Rabin Dulal
Abstract: Identifying the extent of brain tumors is a significant challenge in brain cancer treatment. The main difficulty is in the approximate detection of tumor size. Magnetic resonance imaging (MRI) has become a critical diagnostic tool. However, manually detecting the boundaries of brain tumors from MRI scans is a labor-intensive task that requires extensive expertise. Deep learning and computer-aided detection techniques have led to notable advances in machine learning for this purpose. In this paper, we propose a modified You Only Look Once (YOLOv8) model to accurately detect the tumors within the MRI images. The proposed model replaced the Non-Maximum Suppression (NMS) algorithm with a Real-Time Detection Transformer (RT- DETR) in the detection head. NMS filters out redundant or overlapping bounding boxes in the detected tumors, but they are hand-designed and pre-set. RT-DETR removes hand-designed components. The second improvement was made by replacing the normal convolution block with ghost convolution. Ghost Convolution reduces computational and memory costs while maintaining high accuracy and enabling faster inference, making it ideal for resource-constrained environments and real-time applications. The third improvement was made by introducing a vision transformer block in the backbone of YOLOv8 to extract context-aware features. We used a publicly available dataset of brain tumors in the proposed model. The proposed model performed better than the original YOLOv8 model and also performed better than other object detectors (Faster R- CNN, Mask R-CNN, YOLO, YOLOv3, YOLOv4, YOLOv5, SSD, RetinaNet, EfficientDet, and DETR). The proposed model achieved 0.91 mAP (mean Average Precision)@0.5.
Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin
Abstract: While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce \textbf{VideoRoPE}, with a \textit{3D structure} designed to preserve spatio-temporal relationships. VideoRoPE features \textit{low-frequency temporal allocation} to mitigate periodic oscillations, a \textit{diagonal layout} to maintain spatial symmetry, and \textit{adjustable temporal spacing} to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at \href{https://github.com/Wiselnn570/VideoRoPE}{https://github.com/Wiselnn570/VideoRoPE}.
URLs: https://github.com/Wiselnn570/VideoRoPE, https://github.com/Wiselnn570/VideoRoPE
Authors: Kristofer Grover Roos, Atsushi Fukuda, Quan Huu Cap
Abstract: While functional magnetic resonance imaging (fMRI) offers valuable insights into brain activity, it is limited by high operational costs and significant infrastructural demands. In contrast, electroencephalography (EEG) provides millisecond-level precision in capturing electrical activity but lacks the spatial fidelity necessary for precise neural localization. To bridge these gaps, we propose E2fNet, a simple yet effective deep learning model for synthesizing fMRI images from low-cost EEG data. E2fNet is an encoder-decoder network specifically designed to capture and translate meaningful multi-scale features from EEG across electrode channels into accurate fMRI representations. Extensive evaluations across three public datasets demonstrate that E2fNet consistently outperforms existing CNN- and transformer-based methods, achieving state-of-the-art results in terms of the structural similarity index measure (SSIM). These results demonstrate that E2fNet is a promising, cost-effective solution for enhancing neuroimaging capabilities. The code is available at https://github.com/kgr20/E2fNet.
Authors: Mark Beliaev, Victor Yang, Madhura Raju, Jiachen Sun, Xinghai Hu
Abstract: In this study, we tackle industry challenges in video content classification by exploring and optimizing GPT-based models for zero-shot classification across seven critical categories of video quality. We contribute a novel approach to improving GPT's performance through prompt optimization and policy refinement, demonstrating that simplifying complex policies significantly reduces false negatives. Additionally, we introduce a new decomposition-aggregation-based prompt engineering technique, which outperforms traditional single-prompt methods. These experiments, conducted on real industry problems, show that thoughtful prompt design can substantially enhance GPT's performance without additional finetuning, offering an effective and scalable solution for improving video classification.
Authors: Liyuan Zhu, Shengqu Cai, Shengyu Huang, Gordon Wetzstein, Naji Khosravan, Iro Armeni
Abstract: We introduce ReStyle3D, a novel framework for scene-level appearance transfer from a single style image to a real-world scene represented by multiple views. The method combines explicit semantic correspondences with multi-view consistency to achieve precise and coherent stylization. Unlike conventional stylization methods that apply a reference style globally, ReStyle3D uses open-vocabulary segmentation to establish dense, instance-level correspondences between the style and real-world images. This ensures that each object is stylized with semantically matched textures. It first transfers the style to a single view using a training-free semantic-attention mechanism in a diffusion model. It then lifts the stylization to additional views via a learned warp-and-refine network guided by monocular depth and pixel-wise correspondences. Experiments show that ReStyle3D consistently outperforms prior methods in structure preservation, perceptual style similarity, and multi-view coherence. User studies further validate its ability to produce photo-realistic, semantically faithful results. Our code, pretrained models, and dataset will be publicly released, to support new applications in interior design, virtual staging, and 3D-consistent stylization.
Authors: Sheng-Yu Huang, Zi-Ting Chou, Yu-Chiang Frank Wang
Abstract: When performing 3D inpainting using novel-view rendering methods like Neural Radiance Field (NeRF) or 3D Gaussian Splatting (3DGS), how to achieve texture and geometry consistency across camera views has been a challenge. In this paper, we propose a framework of 3D Gaussian Inpainting with Depth-Guided Cross-View Consistency (3DGIC) for cross-view consistent 3D inpainting. Guided by the rendered depth information from each training view, our 3DGIC exploits background pixels visible across different views for updating the inpainting mask, allowing us to refine the 3DGS for inpainting purposes.Through extensive experiments on benchmark datasets, we confirm that our 3DGIC outperforms current state-of-the-art 3D inpainting methods quantitatively and qualitatively.
Authors: Meng Lou, Yizhou Yu
Abstract: Top-down attention plays a crucial role in the human vision system, wherein the brain initially obtains a rough overview of a scene to discover salient cues (i.e., overview first), followed by a more careful finer-grained examination (i.e., look closely next). However, modern ConvNets remain confined to a pyramid structure that successively downsamples the feature map for receptive field expansion, neglecting this crucial biomimetic principle. We present OverLoCK, the first pure ConvNet backbone architecture that explicitly incorporates a top-down attention mechanism. Unlike pyramid backbone networks, our design features a branched architecture with three synergistic sub-networks: 1) a Base-Net that encodes low/mid-level features; 2) a lightweight Overview-Net that generates dynamic top-down attention through coarse global context modeling (i.e., overview first); and 3) a robust Focus-Net that performs finer-grained perception guided by top-down attention (i.e., look closely next). To fully unleash the power of top-down attention, we further propose a novel context-mixing dynamic convolution (ContMix) that effectively models long-range dependencies while preserving inherent local inductive biases even when the input resolution increases, addressing critical limitations in existing convolutions. Our OverLoCK exhibits a notable performance improvement over existing methods. For instance, OverLoCK-T achieves a Top-1 accuracy of 84.2%, significantly surpassing ConvNeXt-B while using only around one-third of the FLOPs/parameters. On object detection, our OverLoCK-S clearly surpasses MogaNet-B by 1% in AP^b. On semantic segmentation, our OverLoCK-T remarkably improves UniRepLKNet-T by 1.7% in mIoU. Code is publicly available at https://github.com/LMMMEng/OverLoCK.
Authors: Qingwen Zhang, Ajinkya Khoche, Yi Yang, Li Ling, Sina Sharif Mansouri, Olov Andersson, Patric Jensfelt
Abstract: LiDAR point cloud is essential for autonomous vehicles, but motion distortions from dynamic objects degrade the data quality. While previous work has considered distortions caused by ego motion, distortions caused by other moving objects remain largely overlooked, leading to errors in object shape and position. This distortion is particularly pronounced in high-speed environments such as highways and in multi-LiDAR configurations, a common setup for heavy vehicles. To address this challenge, we introduce HiMo, a pipeline that repurposes scene flow estimation for non-ego motion compensation, correcting the representation of dynamic objects in point clouds. During the development of HiMo, we observed that existing self-supervised scene flow estimators often produce degenerate or inconsistent estimates under high-speed distortion. We further propose SeFlow++, a real-time scene flow estimator that achieves state-of-the-art performance on both scene flow and motion compensation. Since well-established motion distortion metrics are absent in the literature, we introduce two evaluation metrics: compensation accuracy at a point level and shape similarity of objects. We validate HiMo through extensive experiments on Argoverse 2, ZOD, and a newly collected real-world dataset featuring highway driving and multi-LiDAR-equipped heavy vehicles. Our findings show that HiMo improves the geometric consistency and visual fidelity of dynamic objects in LiDAR point clouds, benefiting downstream tasks such as semantic segmentation and 3D detection. See https://kin-zhang.github.io/HiMo for more details.
Authors: Mojtaba Safari, Shansong Wang, Zach Eidex, Qiang Li, Erik H. Middlebrooks, David S. Yu, Xiaofeng Yang
Abstract: Objective:This study introduces a residual error-shifting mechanism that drastically reduces sampling steps while preserving critical anatomical details, thus accelerating MRI reconstruction. Approach:We propose a novel diffusion-based SR framework called Res-SRDiff, which integrates residual error shifting into the forward diffusion process. This enables efficient HR image reconstruction by aligning the degraded HR and LR distributions.We evaluated Res-SRDiff on ultra-high-field brain T1 MP2RAGE maps and T2-weighted prostate images, comparing it with Bicubic, Pix2pix, CycleGAN, and a conventional denoising diffusion probabilistic model with vision transformer backbone (TM-DDPM), using quantitative metrics such as peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), gradient magnitude similarity deviation (GMSD), and learned perceptual image patch similarity (LPIPS). Main results: Res-SRDiff significantly outperformed all comparative methods in terms of PSNR, SSIM, and GMSD across both datasets, with statistically significant improvements (p-values<<0.05). The model achieved high-fidelity image restoration with only four sampling steps, drastically reducing computational time to under one second per slice, which is substantially faster than conventional TM-DDPM with around 20 seconds per slice. Qualitative analyses further demonstrated that Res-SRDiff effectively preserved fine anatomical details and lesion morphology in both brain and pelvic MRI images. Significance: Our findings show that Res-SRDiff is an efficient and accurate MRI SR method, markedly improving computational efficiency and image quality. Integrating residual error shifting into the diffusion process allows for rapid and robust HR image reconstruction, enhancing clinical MRI workflows and advancing medical imaging research. The source at:https://github.com/mosaf/Res-SRDiff
Authors: Dujun Nie, Xianda Guo, Yiqun Duan, Ruijun Zhang, Long Chen
Abstract: Object Goal Navigation-requiring an agent to locate a specific object in an unseen environment-remains a core challenge in embodied AI. Although recent progress in Vision-Language Model (VLM)-based agents has demonstrated promising perception and decision-making abilities through prompting, none has yet established a fully modular world model design that reduces risky and costly interactions with the environment by predicting the future state of the world. We introduce WMNav, a novel World Model-based Navigation framework powered by Vision-Language Models (VLMs). It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module. To retain the predicted state of the environment, WMNav proposes the online maintained Curiosity Value Map as part of the world model memory to provide dynamic configuration for navigation policy. By decomposing according to a human-like thinking process, WMNav effectively alleviates the impact of model hallucination by making decisions based on the feedback difference between the world model plan and observation. To further boost efficiency, we implement a two-stage action proposer strategy: broad exploration followed by precise localization. Extensive evaluation on HM3D and MP3D validates WMNav surpasses existing zero-shot benchmarks in both success rate and exploration efficiency (absolute improvement: +3.2% SR and +3.2% SPL on HM3D, +13.5% SR and +1.1% SPL on MP3D). Project page: https://b0b8k1ng.github.io/WMNav/.
Authors: Hongze Sun, Jun Wang, Wuque Cai, Duo Chen, Qianqian Liao, Jiayi He, Yan Cui, Dezhong Yao, Daqing Guo
Abstract: Spiking Neural Networks (SNNs) have emerged as a promising tool for event-based optical flow estimation tasks due to their ability to leverage spatio-temporal information and low-power capabilities. However, the performance of SNN models is often constrained, limiting their application in real-world scenarios. In this work, we address this gap by proposing a novel neural network architecture, ST-FlowNet, specifically tailored for optical flow estimation from event-based data. The ST-FlowNet architecture integrates ConvGRU modules to facilitate cross-modal feature augmentation and temporal alignment of the predicted optical flow, improving the network's ability to capture complex motion dynamics. Additionally, to overcome the challenges associated with training SNNs, we introduce a novel approach to derive SNN models from pre-trained artificial neural networks (ANNs) through ANN-to-SNN conversion or our proposed BISNN method. Notably, the BISNN method alleviates the complexities involved in biological parameter selection, further enhancing the robustness of SNNs in optical flow estimation tasks. Extensive evaluations on three benchmark event-based datasets demonstrate that the SNN-based ST-FlowNet model outperforms state-of-the-art methods, delivering superior performance in accurate optical flow estimation across a diverse range of dynamic visual scenes. Furthermore, the inherent energy efficiency of SNN models is highlighted, establishing a compelling advantage for their practical deployment. Overall, our work presents a novel framework for optical flow estimation using SNNs and event-based data, contributing to the advancement of neuromorphic vision applications.
Authors: Hongyu Sun, Qiuhong Ke, Ming Cheng, Yongcai Wang, Deying Li, Chenhui Gou, Jianfei Cai
Abstract: This paper proposes a general solution to enable point cloud recognition models to handle distribution shifts at test time. Unlike prior methods, which rely heavily on training data (often inaccessible during online inference) and are limited to recognizing a fixed set of point cloud classes predefined during training, we explore a more practical and challenging scenario: adapting the model solely based on online test data to recognize both previously seen classes and novel, unseen classes at test time. To this end, we develop \textbf{Point-Cache}, a hierarchical cache model that captures essential clues of online test samples, particularly focusing on the global structure of point clouds and their local-part details. Point-Cache, which serves as a rich 3D knowledge base, is dynamically managed to prioritize the inclusion of high-quality samples. Designed as a plug-and-play module, our method can be flexibly integrated into large multimodal 3D models to support open-vocabulary point cloud recognition. Notably, our solution operates with efficiency comparable to zero-shot inference, as it is entirely training-free. Point-Cache demonstrates substantial gains across 8 challenging benchmarks and 4 representative large 3D models, highlighting its effectiveness. Code is available at https://github.com/auniquesun/Point-Cache.
Authors: Xuying Zhang, Yupeng Zhou, Kai Wang, Yikai Wang, Zhen Li, Shaohui Jiao, Daquan Zhou, Qibin Hou, Ming-Ming Cheng
Abstract: Novel view synthesis (NVS) is a cornerstone for image-to-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose difference, leading to poor-quality 3D geometries and textures. We attribute this issue to their treatment of all target views with equal priority according to our empirical observation that the target views closer to the input views exhibit higher fidelity. With this inspiration, we propose AR-1-to-3, a novel next-view prediction paradigm based on diffusion models that first generates views close to the input views, which are then utilized as contextual information to progressively synthesize farther views. To encode the generated view subsequences as local and global conditions for the next-view prediction, we accordingly develop a stacked local feature encoding strategy (Stacked-LE) and an LSTM-based global feature encoding strategy (LSTM-GE). Extensive experiments demonstrate that our method significantly improves the consistency between the generated views and the input views, producing high-fidelity 3D assets.
Authors: Xuyang Fang, Sion Hannuna, Neill Campbell
Abstract: We introduce the 8-Calves dataset, a benchmark for evaluating object detection and identity preservation in occlusion-rich, temporally consistent environments. Comprising a 1-hour video (67,760 frames) of eight Holstein Friesian calves with unique coat patterns and 900 static frames, the dataset emphasizes real-world challenges like prolonged occlusions, motion blur, and pose variation. By fine-tuning 28 object detectors (YOLO variants, transformers) and evaluating 23 pretrained backbones (ResNet, ConvNextV2, ViTs), we expose critical architectural trade-offs: smaller models (e.g., ConvNextV2 Nano, 15.6M parameters) excel in efficiency and retrieval accuracy, while pure vision transformers lag in occlusion-heavy settings. The dataset's structured design-fixed camera views, natural motion, and verified identities-provides a reproducible testbed for object detection challenges (mAP50:95: 56.5-66.4%), bridging synthetic simplicity and domain-specific complexity. The dataset and benchmark code are all publicly available at https://huggingface.co/datasets/tonyFang04/8-calves. Limitations include partial labeling and detector bias, addressed in later sections.
Authors: Dongkwan Lee, Kyomin Hwang, Nojun Kwak
Abstract: We address the problem of semi-supervised domain generalization (SSDG), where the distributions of train and test data differ, and only a small amount of labeled data along with a larger amount of unlabeled data are available during training. Existing SSDG methods that leverage only the unlabeled samples for which the model's predictions are highly confident (confident-unlabeled samples), limit the full utilization of the available unlabeled data. To the best of our knowledge, we are the first to explore a method for incorporating the unconfident-unlabeled samples that were previously disregarded in SSDG setting. To this end, we propose UPCSC to utilize these unconfident-unlabeled samples in SSDG that consists of two modules: 1) Unlabeled Proxy-based Contrastive learning (UPC) module, treating unconfident-unlabeled samples as additional negative pairs and 2) Surrogate Class learning (SC) module, generating positive pairs for unconfident-unlabeled samples using their confusing class set. These modules are plug-and-play and do not require any domain labels, which can be easily integrated into existing approaches. Experiments on four widely used SSDG benchmarks demonstrate that our approach consistently improves performance when attached to baselines and outperforms competing plug-and-play methods. We also analyze the role of our method in SSDG, showing that it enhances class-level discriminability and mitigates domain gaps. The code is available at https://github.com/dongkwani/UPCSC.
Authors: Ketan Suhaas Saichandran, Xavier Thomas, Prakhar Kaushik, Deepti Ghadiyaram
Abstract: Text-to-image generative models often struggle with long prompts detailing complex scenes, diverse objects with distinct visual characteristics and spatial relationships. In this work, we propose SCoPE (Scheduled interpolation of Coarse-to-fine Prompt Embeddings), a training-free method to improve text-to-image alignment by progressively refining the input prompt in a coarse-to-fine-grained manner. Given a detailed input prompt, we first decompose it into multiple sub-prompts which evolve from describing broad scene layout to highly intricate details. During inference, we interpolate between these sub-prompts and thus progressively introduce finer-grained details into the generated image. Our training-free plug-and-play approach significantly enhances prompt alignment, achieves an average improvement of up to +4% in Visual Question Answering (VQA) scores over the Stable Diffusion baselines on 85% of the prompts from the GenAI-Bench dataset.
Authors: Inpyo Hong, Youngwan Jo, Hyojeong Lee, Sunghyun Ahn, Sanghyun Park
Abstract: Zero-shot quantization (ZSQ) enables neural network compression without training data, which is crucial in restricted data access environments. However, existing ZSQ methods suffer from significant activation loss in low-bit environments owing to their coarse-grained scaling strategy. To address this issue, we propose GranQ, a novel ZSQ approach that leverages layer-channel awareness to minimize the quantization error. Unlike conventional layer- or channel-wise quantization, GranQ dynamically adjusts quantization granularity by considering both layer- and channel-level activation distributions. This enables fine-grained quantization while minimizing activation distortion. Additionally, we introduce vectorized activation quantization, which enables efficient parallel computation and reduces computational overhead while preserving accuracy. GranQ achieves superior performance compared with those of state-of-the-art ZSQ methods that employ quantization-aware training. With these findings, we anticipate that GranQ will inspire novel research directions beyond conventional ZSQ approaches focused on data generation and model training.
Authors: Xiangrui Liu, Yan Shu, Zheng Liu, Ao Li, Yang Tian, Bo Zhao
Abstract: Despite advanced token compression techniques, existing multimodal large language models (MLLMs) still struggle with hour-long video understanding. In this work, we propose Video-XL-Pro, an efficient method for extremely long video understanding, built upon Reconstructive Compression of Tokens (ReCoT), a learnable module that leverages self-supervised learning to generate comprehensive and compact video tokens. ReCoT introduces two key components: (i) Dynamic Token Synthesizer (DTS): DTS generates pseudo-video tokens from static image tokens by learning intra-token relationships, which are then used in masked video modeling. (ii) Semantic-Guided Masking (SGM): SGM adaptively masks redundant visual tokens to facilitate more effective reconstructive learning. To improve training efficiency in MLLMs fine-tuning, we introduce a video-specific dataset pruning strategy and design a simple yet Query-aware Selector that enables the model to precisely locate query-relevant video tokens. With only 3B parameters, Video-XL-Pro outperforms most 7B models trained on larger datasets across multiple long video understanding benchmarks. Moreover, it can process over 8K frames on a single A100 GPU while maintaining high-quality performance.
Authors: Yuhang Yang, Ke Fan, Shangkun Sun, Hongxiang Li, Ailing Zeng, FeiLin Han, Wei Zhai, Wei Liu, Yang Cao, Zheng-Jun Zha
Abstract: The rapid advancement of video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models, primarily due to simple prompts that cannot showcase the model's capabilities, fixed evaluation operators struggling with Out-of-Distribution (OOD) cases, and misalignment between computed metrics and human preferences. To bridge the gap, we propose VideoGen-Eval, an agent evaluation system that integrates LLM-based content structuring, MLLM-based content judgment, and patch tools designed for temporal-dense dimensions, to achieve a dynamic, flexible, and expandable video generation evaluation. Additionally, we introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system. It comprises 700 structured, content-rich prompts (both T2V and I2V) and over 12,000 videos generated by 20+ models, among them, 8 cutting-edge models are selected as quantitative evaluation for the agent and human. Extensive experiments validate that our proposed agent-based evaluation system demonstrates strong alignment with human preferences and reliably completes the evaluation, as well as the diversity and richness of the benchmark.
Authors: Yasmine Mustafa, Mohamed Elmahallawy, Tie Luo
Abstract: Alzheimer's disease (AD) leads to progressive cognitive decline, making early detection crucial for effective intervention. While deep learning models have shown high accuracy in AD diagnosis, their lack of interpretability limits clinical trust and adoption. This paper introduces a novel pre-model approach leveraging Jacobian Maps (JMs) within a multi-modal framework to enhance explainability and trustworthiness in AD detection. By capturing localized brain volume changes, JMs establish meaningful correlations between model predictions and well-known neuroanatomical biomarkers of AD. We validate JMs through experiments comparing a 3D CNN trained on JMs versus on traditional preprocessed data, which demonstrates superior accuracy. We also employ 3D Grad-CAM analysis to provide both visual and quantitative insights, further showcasing improved interpretability and diagnostic reliability.
Authors: Peng Gao, Yujian Lee, Zailong Chen, Hui zhang, Xubo Liu, Yiyang Hu, Guquang Jing
Abstract: Composed Image Retrieval (CIR) seeks to find a target image using a multi-modal query, which combines an image with modification text to pinpoint the target. While recent CIR methods have shown promise, they mainly focus on exploring relationships between the query pairs (image and text) through data augmentation or model design. These methods often assume perfect alignment between queries and target images, an idealized scenario rarely encountered in practice. In reality, pairs are often partially or completely mismatched due to issues like inaccurate modification texts, low-quality target images, and annotation errors. Ignoring these mismatches leads to numerous False Positive Pair (FFPs) denoted as noise pairs in the dataset, causing the model to overfit and ultimately reducing its performance. To address this problem, we propose the Noise-aware Contrastive Learning for CIR (NCL-CIR), comprising two key components: the Weight Compensation Block (WCB) and the Noise-pair Filter Block (NFB). The WCB coupled with diverse weight maps can ensure more stable token representations of multi-modal queries and target images. Meanwhile, the NFB, in conjunction with the Gaussian Mixture Model (GMM) predicts noise pairs by evaluating loss distributions, and generates soft labels correspondingly, allowing for the design of the soft-label based Noise Contrastive Estimation (NCE) loss function. Consequently, the overall architecture helps to mitigate the influence of mismatched and partially matched samples, with experimental results demonstrating that NCL-CIR achieves exceptional performance on the benchmark datasets.
Authors: Mengxia Dai, Wenqian Luo, Tianyang Li
Abstract: Instance segmentation plays a pivotal role in medical image analysis by enabling precise localization and delineation of lesions, tumors, and anatomical structures. Although deep learning models such as Mask R-CNN and BlendMask have achieved remarkable progress, their application in high-risk medical scenarios remains constrained by confidence calibration issues, which may lead to misdiagnosis. To address this challenge, we propose a robust quality control framework based on conformal prediction theory. This framework innovatively constructs a risk-aware dynamic threshold mechanism that adaptively adjusts segmentation decision boundaries according to clinical requirements.Specifically, we design a \textbf{calibration-aware loss function} that dynamically tunes the segmentation threshold based on a user-defined risk level $\alpha$. Utilizing exchangeable calibration data, this method ensures that the expected FNR or FDR on test data remains below $\alpha$ with high probability. The framework maintains compatibility with mainstream segmentation models (e.g., Mask R-CNN, BlendMask+ResNet-50-FPN) and datasets (PASCAL VOC format) without requiring architectural modifications. Empirical results demonstrate that we rigorously bound the FDR metric marginally over the test set via our developed calibration framework.
Authors: Zhenteng Li, Sheng Lian, Dengfeng Pan, Youlin Wang, Wei Liu
Abstract: Object detection in Unmanned Aerial Vehicle (UAV) images poses significant challenges due to complex scale variations and class imbalance among objects. Existing methods often address these challenges separately, overlooking the intricate nature of UAV images and the potential synergy between them. In response, this paper proposes AD-Det, a novel framework employing a coherent coarse-to-fine strategy that seamlessly integrates two pivotal components: Adaptive Small Object Enhancement (ASOE) and Dynamic Class-balanced Copy-paste (DCC). ASOE utilizes a high-resolution feature map to identify and cluster regions containing small objects. These regions are subsequently enlarged and processed by a fine-grained detector. On the other hand, DCC conducts object-level resampling by dynamically pasting tail classes around the cluster centers obtained by ASOE, main-taining a dynamic memory bank for each tail class. This approach enables AD-Det to not only extract regions with small objects for precise detection but also dynamically perform reasonable resampling for tail-class objects. Consequently, AD-Det enhances the overall detection performance by addressing the challenges of scale variations and class imbalance in UAV images through a synergistic and adaptive framework. We extensively evaluate our approach on two public datasets, i.e., VisDrone and UAVDT, and demonstrate that AD-Det significantly outperforms existing competitive alternatives. Notably, AD-Det achieves a 37.5% Average Precision (AP) on the VisDrone dataset, surpassing its counterparts by at least 3.1%.
Authors: Thomas Kerdreux, Alexandre Tuel, Quentin Febvre, Alexis Mouche, Bertrand Chapron
Abstract: Self-supervised learning (SSL) has enabled the development of vision foundation models for Earth Observation (EO), demonstrating strong transferability across diverse remote sensing tasks. While prior work has focused on network architectures and training strategies, the role of dataset curation, especially in balancing and diversifying pre-training datasets, remains underexplored. In EO, this challenge is amplified by the redundancy and heavy-tailed distributions common in satellite imagery, which can lead to biased representations and inefficient training. In this work, we propose a dynamic dataset pruning strategy designed to improve SSL pre-training by maximizing dataset diversity and balance. Our method iteratively refines the training set without requiring a pre-existing feature extractor, making it well-suited for domains where curated datasets are limited or unavailable. We demonstrate our approach on the Sentinel-1 Wave Mode (WV) Synthetic Aperture Radar (SAR) archive, a challenging dataset dominated by ocean observations. We train models from scratch on the entire Sentinel-1 WV archive spanning 10 years. Across three downstream tasks, our results show that dynamic pruning improves both computational efficiency and representation quality, leading to stronger transferability. We also release the weights of OceanSAR-1, the first model in the OceanSAR family, a series of foundation models for ocean observation and analysis using SAR imagery, at github.com/galeio-research/OceanSAR-models/.
Authors: Yiting Lu, Jiakang Yuan, Zhen Li, Shitian Zhao, Qi Qin, Xinyue Li, Le Zhuo, Licheng Wen, Dongyang Liu, Yuewen Cao, Xiangchao Yan, Xin Li, Tianshuo Peng, Shufei Zhang, Botian Shi, Tao Chen, Zhibo Chen, Lei Bai, Bo Zhang, Peng Gao
Abstract: We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.
Authors: Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, Jiaqi Wang
Abstract: The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both compose-level constraints for output responses and perception-level constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating both rule-based assessment and judge model. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF benchmarks, such as MM-IFEval (+10.2$\%$), MIA (+7.6$\%$), and IFEval (+12.3$\%$). We have fully open-sourced the datasets (both SFT and DPO), evaluation code and training scripts at https://github.com/SYuan03/MM-IFEngine.
Authors: Changhao Li, Yu Xin, Xiaowei Zhou, Ariel Shamir, Hao Zhang, Ligang Liu, Ruizhen Hu
Abstract: We introduce Masked Anchored SpHerical Distances (MASH), a novel multi-view and parametrized representation of 3D shapes. Inspired by multi-view geometry and motivated by the importance of perceptual shape understanding for learning 3D shapes, MASH represents a 3D shape as a collection of observable local surface patches, each defined by a spherical distance function emanating from an anchor point. We further leverage the compactness of spherical harmonics to encode the MASH functions, combined with a generalized view cone with a parameterized base that masks the spatial extent of the spherical function to attain locality. We develop a differentiable optimization algorithm capable of converting any point cloud into a MASH representation accurately approximating ground-truth surfaces with arbitrary geometry and topology. Extensive experiments demonstrate that MASH is versatile for multiple applications including surface reconstruction, shape generation, completion, and blending, achieving superior performance thanks to its unique representation encompassing both implicit and explicit features.
Authors: Hyejin Lee, Seokjun Hong, Jeonghoon Song, Haechan Cho, Zhixiong Jin, Byeonghun Kim, Joobin Jin, Jaegyun Im, Byeongjoon Noh, Hwasoo Yeo
Abstract: Reliable traffic data are essential for understanding urban mobility and developing effective traffic management strategies. This study introduces the DRone-derived Intelligence For Traffic analysis (DRIFT) dataset, a large-scale urban traffic dataset collected systematically from synchronized drone videos at approximately 250 meters altitude, covering nine interconnected intersections in Daejeon, South Korea. DRIFT provides high-resolution vehicle trajectories that include directional information, processed through video synchronization and orthomap alignment, resulting in a comprehensive dataset of 81,699 vehicle trajectories. Through our DRIFT dataset, researchers can simultaneously analyze traffic at multiple scales - from individual vehicle maneuvers like lane-changes and safety metrics such as time-to-collision to aggregate network flow dynamics across interconnected urban intersections. The DRIFT dataset is structured to enable immediate use without additional preprocessing, complemented by open-source models for object detection and trajectory extraction, as well as associated analytical tools. DRIFT is expected to significantly contribute to academic research and practical applications, such as traffic flow analysis and simulation studies. The dataset and related resources are publicly accessible at https://github.com/AIxMobility/The-DRIFT.
Authors: Jose Francisco Diez-Pastor, Francisco Javier Gonzalez-Moya, Pedro Latorre-Carmona, Francisco Javier Perez-Barber\'ia, Ludmila I. Kuncheva, Antonio Canepa-Oneto, Alvar Arnaiz-Gonz\'alez, Cesar Garcia-Osorio
Abstract: Identifying spatial regions where biodiversity is threatened is crucial for effective ecosystem conservation and monitoring. In this stydy, we assessed varios machine learning methods to detect grazing trails automatically. We tested five semantic segmentation models combined with 14 different encoder networks. The best combination was UNet with MambaOut encoder. The solution proposed could be used as the basis for tools aiming at mapping and tracking changes in grazing trails on a continuous temporal basis.
Authors: Alex Ergasti, Filippo Botti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati
Abstract: Diffusion models have become the most popular approach for high-quality image generation, but their high computational cost still remains a significant challenge. To address this problem, we propose U-Shape Mamba (USM), a novel diffusion model that leverages Mamba-based layers within a U-Net-like hierarchical structure. By progressively reducing sequence length in the encoder and restoring it in the decoder through Mamba blocks, USM significantly lowers computational overhead while maintaining strong generative capabilities. Experimental results against Zigma, which is currently the most efficient Mamba-based diffusion model, demonstrate that USM achieves one-third the GFlops, requires less memory and is faster, while outperforming Zigma in image quality. Frechet Inception Distance (FID) is improved by 15.3, 0.84 and 2.7 points on AFHQ, CelebAHQ and COCO datasets, respectively. These findings highlight USM as a highly efficient and scalable solution for diffusion-based generative models, making high-quality image synthesis more accessible to the research community while reducing computational costs.
Authors: Zuyao Chen, Jinlin Wu, Zhen Lei, Marc Pollefeys, Chang Wen Chen
Abstract: Next token prediction is the fundamental principle for training large language models (LLMs), and reinforcement learning (RL) further enhances their reasoning performance. As an effective way to model language, image, video, and other modalities, the use of LLMs for end-to-end extraction of structured visual representations, such as scene graphs, remains underexplored. It requires the model to accurately produce a set of objects and relationship triplets, rather than generating text token by token. To achieve this, we introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset and subsequently refined using reinforcement learning to enhance its ability to generate scene graphs in an end-to-end manner. The SFT follows a conventional prompt-response paradigm, while RL requires the design of effective reward signals. Given the structured nature of scene graphs, we design a graph-centric reward function that integrates node-level rewards, edge-level rewards, and a format consistency reward. Our experiments demonstrate that rule-based RL substantially enhances model performance in the SGG task, achieving a zero failure rate--unlike supervised fine-tuning (SFT), which struggles to generalize effectively. Our code is available at https://github.com/gpt4vision/R1-SGG.
Authors: Hai Pham-Ngoc, De Nguyen-Van, Dung Vu-Tien, Phuong Le-Hong
Abstract: Background: Automated classification of thyroid Fine Needle Aspiration Biopsy (FNAB) images faces challenges in limited data, inter-observer variability, and computational cost. Efficient, interpretable models are crucial for clinical support. Objective: To develop and externally validate a deep learning system for multi-class thyroid FNAB image classification into three key categories directly guiding post-biopsy treatment in Vietnam: Benign (Bethesda II), Indeterminate/Suspicious (BI, III, IV, V), and Malignant (BVI), achieving high diagnostic accuracy with low computational overhead. Methods: Our pipeline features: (1) YOLOv10 cell cluster detection for informative sub-region extraction/noise reduction; (2) curriculum learning sequencing localized crops to full images for multi-scale capture; (3) adaptive lightweight EfficientNetB0 (4M parameters) balancing performance/efficiency; and (4) a Transformer-inspired module for multi-scale/multi-region analysis. External validation used 1,015 independent FNAB images. Results: ThyroidEffi Basic achieved macro F1 of 89.19% and AUCs of 0.98 (Benign), 0.95 (Indeterminate/Suspicious), 0.96 (Malignant) on the internal test set. External validation yielded AUCs of 0.9495 (Benign), 0.7436 (Indeterminate/Suspicious), 0.8396 (Malignant). ThyroidEffi Premium improved macro F1 to 89.77%. Grad-CAM highlighted key diagnostic regions, confirming interpretability. The system processed 1000 cases in 30 seconds, demonstrating feasibility on widely accessible hardware. Conclusions: This work demonstrates that high-accuracy, interpretable thyroid FNAB image classification is achievable with minimal computational demands.
Authors: Mingxia Zhan, Li Zhang, Xiaomeng Chu, Beibei Wang
Abstract: Monocular depth estimation (MDE) aims to predict per-pixel depth values from a single RGB image. Recent advancements have positioned diffusion models as effective MDE tools by framing the challenge as a conditional image generation task. Despite their progress, these methods often struggle with accurately reconstructing distant depths, due largely to the imbalanced distribution of depth values and an over-reliance on spatial-domain features. To overcome these limitations, we introduce VistaDepth, a novel framework that integrates adaptive frequency-domain feature enhancements with an adaptive weight-balancing mechanism into the diffusion process. Central to our approach is the Latent Frequency Modulation (LFM) module, which dynamically refines spectral responses in the latent feature space, thereby improving the preservation of structural details and reducing noisy artifacts. Furthermore, we implement an adaptive weighting strategy that modulates the diffusion loss in real-time, enhancing the model's sensitivity towards distant depth reconstruction. These innovations collectively result in superior depth perception performance across both distance and detail. Experimental evaluations confirm that VistaDepth achieves state-of-the-art performance among diffusion-based MDE techniques, particularly excelling in the accurate reconstruction of distant regions.
Authors: Xiao Zhang, Lu Zou, Tao Lu, Yuan Yao, Zhangjin Huang, Guoping Wang
Abstract: Category-level object pose estimation aims to predict the 6D pose and size of previously unseen instances from predefined categories, requiring strong generalization across diverse object instances. Although many previous methods attempt to mitigate intra-class variations, they often struggle with instances exhibiting complex geometries or significant deviations from canonical shapes. To address this challenge, we propose INKL-Pose, a novel category-level object pose estimation framework that enables INstance-adaptive Keypoint Learning with local-to-global geometric aggregation. Specifically, our approach first predicts semantically consistent and geometric informative keypoints through an Instance-Adaptive Keypoint Generator, then refines them with: (1) a Local Keypoint Feature Aggregator capturing fine-grained geometries, and (2) a Global Keypoint Feature Aggregator using bidirectional Mamba for structural consistency. To enable bidirectional modeling in Mamba, we introduce a Feature Sequence Flipping strategy that preserves spatial coherence while constructing backward feature sequences. Additionally, we design a surface loss and a separation loss to enforce uniform coverage and spatial diversity in keypoint distribution. The generated keypoints are finally mapped to a canonical space for regressing the object's 6D pose and size. Extensive experiments on CAMERA25, REAL275, and HouseCat6D demonstrate that INKL-Pose achieves state-of-the-art performance and significantly outperforms existing methods.
Authors: Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, Yi Ma
Abstract: Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted with multi-view geometric consistency and cross-view correspondence. To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we propose All-Angles Bench, a benchmark of over 2,100 human carefully annotated multi-view question-answer pairs across 90 diverse real-world scenes. Our six tasks (counting, attribute identification, relative distance, relative direction, object manipulation, and camera pose estimation) specifically test model's geometric correspondence and the capacity to align information consistently across views. Our extensive experiments, benchmark on 27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o against human evaluators reveals a substantial performance gap, indicating that current MLLMs remain far from human-level proficiency. Through in-depth analysis, we show that MLLMs are particularly underperforming under two aspects: (1) cross-view correspondence for partially occluded views and (2) establishing the coarse camera poses. These findings highlight the necessity of domain-specific refinements or modules that embed stronger multi-view awareness. We believe that our All-Angles Bench offers valuable insights and contribute to bridging the gap between MLLMs and human-level multi-view understanding. The project and benchmark are publicly available at https://danielchyeh.github.io/All-Angles-Bench/.
Authors: Jingzhi Li, Changjiang Luo, Ruoyu Chen, Hua Zhang, Wenqi Ren, Jianhou Gan, Xiaochun Cao
Abstract: Recent advances in multimodal large language models (MLLMs) have demonstrated strong capabilities in understanding general visual content. However, these general-domain MLLMs perform poorly in face perception tasks, often producing inaccurate or misleading responses to face-specific queries. To address this gap, we propose FaceInsight, the versatile face perception MLLM that provides fine-grained facial information. Our approach introduces visual-textual alignment of facial knowledge to model both uncertain dependencies and deterministic relationships among facial information, mitigating the limitations of language-driven reasoning. Additionally, we incorporate face segmentation maps as an auxiliary perceptual modality, enriching the visual input with localized structural cues to enhance semantic understanding. Comprehensive experiments and analyses across three face perception tasks demonstrate that FaceInsight consistently outperforms nine compared MLLMs under both training-free and fine-tuned settings.
Authors: Zebin Yao, Lei Ren, Huixing Jiang, Chen Wei, Xiaojie Wang, Ruifan Li, Fangxiang Feng
Abstract: Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance, yet existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive subject-specific optimization, while zero-shot methods fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor employs semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated image. Additionally, our framework incorporates a novel noise initialization strategy to preserve geometry priors of reference subjects for robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.
Authors: Qi Yang, Weichen Bi, Haiyang Shen, Yaoqi Guo, Yun Ma
Abstract: Graphical User Interface (GUI) datasets are crucial for various downstream tasks. However, GUI datasets often generate annotation information through automatic labeling, which commonly results in inaccurate GUI element BBox annotations, including missing, duplicate, or meaningless BBoxes. These issues can degrade the performance of models trained on these datasets, limiting their effectiveness in real-world applications. Additionally, existing GUI datasets only provide BBox annotations visually, which restricts the development of visually related GUI downstream tasks. To address these issues, we introduce PixelWeb, a large-scale GUI dataset containing over 100,000 annotated web pages. PixelWeb is constructed using a novel automatic annotation approach that integrates visual feature extraction and Document Object Model (DOM) structure analysis through two core modules: channel derivation and layer analysis. Channel derivation ensures accurate localization of GUI elements in cases of occlusion and overlapping elements by extracting BGRA four-channel bitmap annotations. Layer analysis uses the DOM to determine the visibility and stacking order of elements, providing precise BBox annotations. Additionally, PixelWeb includes comprehensive metadata such as element images, contours, and mask annotations. Manual verification by three independent annotators confirms the high quality and accuracy of PixelWeb annotations. Experimental results on GUI element detection tasks show that PixelWeb achieves performance on the mAP95 metric that is 3-7 times better than existing datasets. We believe that PixelWeb has great potential for performance improvement in downstream tasks such as GUI generation and automated user interaction.
Authors: Haosheng Chen, Lian Luo, Mengjingcheng Mo, Zhanjie Wu, Guobao Xiao, Ji Gan, Jiaxu Leng, Xinbo Gao
Abstract: Event cameras, with microsecond temporal resolution and high dynamic range (HDR) characteristics, emit high-speed event stream for perception tasks. Despite the recent advancement in GNN-based perception methods, they are prone to use straightforward pairwise connectivity mechanisms in the pure Euclidean space where they struggle to capture long-range dependencies and fail to effectively characterize the inherent hierarchical structures of non-uniformly distributed event stream. To this end, in this paper we propose a novel approach named EHGCN, which is a pioneer to perceive event stream in both Euclidean and hyperbolic spaces for event vision. In EHGCN, we introduce an adaptive sampling strategy to dynamically regulate sampling rates, retaining discriminative events while attenuating chaotic noise. Then we present a Markov Vector Field (MVF)-driven motion-aware hyperedge generation method based on motion state transition probabilities, thereby eliminating cross-target spurious associations and providing critically topological priors while capturing long-range dependencies between events. Finally, we propose a Euclidean-Hyperbolic GCN to fuse the information locally aggregated and globally hierarchically modeled in Euclidean and hyperbolic spaces, respectively, to achieve hybrid event perception. Experimental results on event perception tasks such as object detection and recognition validate the effectiveness of our approach.
Authors: Ning Li, Antai Andy Liu, Jingran Zhang, Justin Cui
Abstract: Dataset distillation has demonstrated remarkable effectiveness in high-compression scenarios for image datasets. While video datasets inherently contain greater redundancy, existing video dataset distillation methods primarily focus on compression in the pixel space, overlooking advances in the latent space that have been widely adopted in modern text-to-image and text-to-video models. In this work, we bridge this gap by introducing a novel video dataset distillation approach that operates in the latent space using a state-of-the-art variational encoder. Furthermore, we employ a diversity-aware data selection strategy to select both representative and diverse samples. Additionally, we introduce a simple, training-free method to further compress the distilled latent dataset. By combining these techniques, our approach achieves a new state-of-the-art performance in dataset distillation, outperforming prior methods on all datasets, e.g. on HMDB51 IPC 1, we achieve a 2.6% performance increase; on MiniUCF IPC 5, we achieve a 7.8% performance increase. Our code is available at https://github.com/liningresearch/Latent_Video_Dataset_Distillation.
URLs: https://github.com/liningresearch/Latent_Video_Dataset_Distillation.
Authors: Shiwen Cao, Zhaoxing Zhang, Junming Jiao, Juyi Qiao, Guowen Song, Rong Shen, Xiangbing Meng
Abstract: Even in the era of rapid advances in large models, video understanding remains a highly challenging task. Compared to texts or images, videos commonly contain more information with redundancy, requiring large models to properly allocate attention at a global level for comprehensive and accurate understanding. To address this, we propose a Multimodal hierarchical Attention focusing Self-reflective Reasoning (MASR) framework for agent-based video understanding. The key innovation lies in its ability to detect and prioritize segments of videos that are highly relevant to the query. Firstly, MASR realizes Multimodal Coarse-to-fine Relevance Sensing (MCRS) which enhances the correlation between the acquired contextual information and the query. Secondly, MASR employs Dilated Temporal Expansion (DTE) to mitigate the risk of missing crucial details when extracting semantic information from the focused frames selected through MCRS. By iteratively applying MCRS and DTE in the self-reflective reasoning process, MASR is able to adaptively adjust the attention to extract highly query-relevant context and therefore improve the response accuracy. In the EgoSchema dataset, MASR achieves a remarkable 5% performance gain over previous leading approaches. In the Next-QA and IntentQA datasets, it outperforms the state-of-the-art standards by 0.2% and 0.3% respectively. In the Video-MME dataset that contains long-term videos, MASR also performs better than other agent-based methods.
Authors: Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin Jiang
Abstract: In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.
Authors: Xu Ma, Peize Sun, Haoyu Ma, Hao Tang, Chih-Yao Ma, Jialiang Wang, Kunpeng Li, Xiaoliang Dai, Yujun Shi, Xuan Ju, Yushi Hu, Artsiom Sanakoyeu, Felix Juefei-Xu, Ji Hou, Junjiao Tian, Tao Xu, Tingbo Hou, Yen-Cheng Liu, Zecheng He, Zijian He, Matt Feiszli, Peizhao Zhang, Peter Vajda, Sam Tsai, Yun Fu
Abstract: Autoregressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often considered less competitive than Diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency, as well as image resolution. To address this, we present Token-Shuffle, a novel yet simple method that reduces the number of image tokens in Transformer. Our key insight is the dimensional redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs), where low-dimensional visual codes from visual encoder are directly mapped to high-dimensional language vocabularies. Leveraging this, we consider two key operations: token-shuffle, which merges spatially local tokens along channel dimension to decrease the input token number, and token-unshuffle, which untangles the inferred tokens after Transformer blocks to restore the spatial arrangement for output. Jointly training with textual prompts, our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis in a unified next-token prediction way while maintaining efficient training and inference. For the first time, we push the boundary of AR text-to-image generation to a resolution of 2048x2048 with gratifying generation performance. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human evaluations also demonstrate our prominent image generation ability in terms of text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle can serve as a foundational design for efficient high-resolution image generation within MLLMs.
Authors: Georgios Kouros, Minye Wu, Tinne Tuytelaars
Abstract: We introduce RGS-DR, a novel inverse rendering method for reconstructing and rendering glossy and reflective objects with support for flexible relighting and scene editing. Unlike existing methods (e.g., NeRF and 3D Gaussian Splatting), which struggle with view-dependent effects, RGS-DR utilizes a 2D Gaussian surfel representation to accurately estimate geometry and surface normals, an essential property for high-quality inverse rendering. Our approach explicitly models geometric and material properties through learnable primitives rasterized into a deferred shading pipeline, effectively reducing rendering artifacts and preserving sharp reflections. By employing a multi-level cube mipmap, RGS-DR accurately approximates environment lighting integrals, facilitating high-quality reconstruction and relighting. A residual pass with spherical-mipmap-based directional encoding further refines the appearance modeling. Experiments demonstrate that RGS-DR achieves high-quality reconstruction and rendering quality for shiny objects, often outperforming reconstruction-exclusive state-of-the-art methods incapable of relighting.
Authors: Nikolaos Louloudakis, Perry Gibson, Jos\'e Cano, Ajitha Rajan
Abstract: Converting deep learning models between frameworks is a common step to maximize model compatibility across devices and leverage optimization features that may be exclusively provided in one deep learning framework. However, this conversion process may be riddled with bugs, making the converted models either undeployable or problematic, considerably degrading their prediction correctness. In this paper, we propose an automated approach for fault localization and repair, FetaFix, during model conversion between deep learning frameworks. FetaFix is capable of detecting and fixing faults introduced in model input, parameters, hyperparameters, and the model graph during conversion. FetaFix uses a set of fault types (mined from surveying common conversion issues reported in code repositories and forums) to localize potential conversion faults in the converted target model and then repair them appropriately, e.g., replacing the parameters of the target model with those from the source model. This is done iteratively for every image in the dataset, comparing output label differences between the source model and the converted target model until all differences are resolved. We evaluate the effectiveness of FetaFix in fixing model conversion bugs of three widely used image recognition models converted across four different deep learning frameworks. Overall, FetaFix was able to fix $462$ out of $755$ detected conversion faults, either completely repairing or significantly improving the performance of $14$ out of the $15$ erroneous conversion cases.
Authors: Vladyslav Gapyak, Corinna Rentschler, Thomas M\"arz, Andreas Weinmann
Abstract: Objective: Magnetic particle imaging (MPI) is an emerging medical imaging modality which has gained increasing interest in recent years. Among the benefits of MPI are its high temporal resolution, and that the technique does not expose the specimen to any kind of ionizing radiation. It is based on the non-linear response of magnetic nanoparticles to an applied magnetic field. From the electric signal measured in receive coils, the particle concentration has to be reconstructed. Due to the ill-posedness of the reconstruction problem, various regularization methods have been proposed for reconstruction ranging from early stopping methods, via classical Tikhonov regularization and iterative methods to modern machine learning approaches. In this work, we contribute to the latter class: we propose a plug-and-play approach based on a generic zero-shot denoiser with an $\ell^1$-prior. Approach: We validate the reconstruction parameters of the method on a hybrid dataset and compare it with the baseline Tikhonov, DIP and the previous PP-MPI, which is a plug-and-play method with denoiser trained on MPI-friendly data. Main results: We offer a quantitative and qualitative evaluation of the zero-shot plug-and-play approach on the 3D Open MPI dataset. Moreover, we show the quality of the approach with different levels of preprocessing of the data. Significance: The proposed method employs a zero-shot denoiser which has not been trained for the MPI task and therefore saves the cost for training. Moreover, it offers a method that can be potentially applied in future MPI contexts.
Authors: Sungmin Cha, Kyunghyun Cho
Abstract: Continual learning (CL) aims to train a model on a sequence of tasks (i.e., a CL scenario) while balancing the trade-off between plasticity (learning new tasks) and stability (retaining prior knowledge). The dominantly adopted conventional evaluation protocol for CL algorithms selects the best hyperparameters (e.g., learning rate, mini-batch size, regularization strengths, etc.) within a given scenario and then evaluates the algorithms using these hyperparameters in the same scenario. However, this protocol has significant shortcomings: it overestimates the CL capacity of algorithms and relies on unrealistic hyperparameter tuning, which is not feasible for real-world applications. From the fundamental principles of evaluation in machine learning, we argue that the evaluation of CL algorithms should focus on assessing the generalizability of their CL capacity to unseen scenarios. Based on this, we propose the Generalizable Two-phase Evaluation Protocol (GTEP) consisting of hyperparameter tuning and evaluation phases. Both phases share the same scenario configuration (e.g., number of tasks) but are generated from different datasets. Hyperparameters of CL algorithms are tuned in the first phase and applied in the second phase to evaluate the algorithms. We apply this protocol to class-incremental learning, both with and without pretrained models. Across more than 8,000 experiments, our results show that most state-of-the-art algorithms fail to replicate their reported performance, highlighting that their CL capacity has been significantly overestimated in the conventional evaluation protocol. Our implementation can be found in https://github.com/csm9493/GTEP.
Authors: Tianyu Zhu, Myong Chol Jung, Jesse Clark
Abstract: Contrastive learning has gained widespread adoption for retrieval tasks due to its minimal requirement for manual annotations. However, popular training frameworks typically learn from binary (positive/negative) relevance, making them ineffective at incorporating desired rankings. As a result, the poor ranking performance of these models forces systems to employ a re-ranker, which increases complexity, maintenance effort and inference time. To address this, we introduce Generalized Contrastive Learning (GCL), a training framework designed to learn from continuous ranking scores beyond binary relevance. GCL encodes both relevance and ranking information into a unified embedding space by applying ranking scores to the loss function. This enables a single-stage retrieval system. In addition, during our research, we identified a lack of public multi-modal datasets that benchmark both retrieval and ranking capabilities. To facilitate this and future research for ranked retrieval, we curated a large-scale MarqoGS-10M dataset using GPT-4 and Google Shopping, providing ranking scores for each of the 10 million query-document pairs. Our results show that GCL achieves a 29.3% increase in NDCG@10 for in-domain evaluations and 6.0% to 10.0% increases for cold-start evaluations compared to the finetuned CLIP baseline with MarqoGS-10M. Additionally, we evaluated GCL offline on a proprietary user interaction data. GCL shows an 11.2% gain for in-domain evaluations. The dataset and the method are available at: https://github.com/marqo-ai/GCL.
Authors: Mohammad Mohammadi, Sreejita Ghosh
Abstract: Classification of sets of inputs (e.g., images and texts) is an active area of research within both computer vision (CV) and natural language processing (NLP). A common way to represent a set of vectors is to model them as linear subspaces. In this contribution, we present a prototype-based approach for learning on the manifold formed from such linear subspaces, the Grassmann manifold. Our proposed method learns a set of subspace prototypes capturing the representative characteristics of classes and a set of relevance factors automating the selection of the dimensionality of the subspaces. This leads to a transparent classifier model which presents the computed impact of each input vector on its decision. Through experiments on benchmark image and text datasets, we have demonstrated the efficiency of our proposed classifier, compared to the transformer-based models in terms of not only performance and explainability but also computational resource requirements.
Authors: Yuran Wang, Zhijing Wan, Yansheng Qiu, Zheng Wang
Abstract: In the realm of medical image analysis, self-supervised learning (SSL) techniques have emerged to alleviate labeling demands, while still facing the challenge of training data scarcity owing to escalating resource requirements and privacy constraints. Numerous efforts employ generative models to generate high-fidelity, unlabeled 3D volumes across diverse modalities and anatomical regions. However, the intricate and indistinguishable anatomical structures within the abdomen pose a unique challenge to abdominal CT volume generation compared to other anatomical regions. To address the overlooked challenge, we introduce the Locality-Aware Diffusion (Lad), a novel method tailored for exquisite 3D abdominal CT volume generation. We design a locality loss to refine crucial anatomical regions and devise a condition extractor to integrate abdominal priori into generation, thereby enabling the generation of large quantities of high-quality abdominal CT volumes essential for SSL tasks without the need for additional data such as labels or radiology reports. Volumes generated through our method demonstrate remarkable fidelity in reproducing abdominal structures, achieving a decrease in FID score from 0.0034 to 0.0002 on AbdomenCT-1K dataset, closely mirroring authentic data and surpassing current methods. Extensive experiments demonstrate the effectiveness of our method in self-supervised organ segmentation tasks, resulting in an improvement in mean Dice scores on two abdominal datasets effectively. These results underscore the potential of synthetic data to advance self-supervised learning in medical image analysis.
Authors: Barak Gahtan, Robert J. Shahla, Reuven Cohen, Alex M. Bronstein
Abstract: QUIC, a new and increasingly used transport protocol, enhances TCP by offering improved security, performance, and stream multiplexing. These features, however, also impose challenges for network middle-boxes that need to monitor and analyze web traffic. This paper proposes a novel method to estimate the number of HTTP/3 responses in a given QUIC connection by an observer. This estimation reveals server behavior, client-server interactions, and data transmission efficiency, which is crucial for various applications such as designing a load balancing solution and detecting HTTP/3 flood attacks. The proposed scheme transforms QUIC connection traces into image sequences and uses machine learning (ML) models, guided by a tailored loss function, to predict response counts. Evaluations on more than seven million images-derived from 100,000 traces collected across 44,000 websites over four months-achieve up to 97% accuracy in both known and unknown server settings and 92% accuracy on previously unseen complete QUIC traces.
Authors: Tao Song, Yicheng Wu, Minhao Hu, Xiangde Luo, Linda Wei, Guotai Wang, Yi Guo, Feng Xu, Shaoting Zhang
Abstract: Multimodal MR image synthesis aims to generate missing modality images by effectively fusing and mapping from a subset of available MRI modalities. Most existing methods adopt an image-to-image translation paradigm, treating multiple modalities as input channels. However, these approaches often yield sub-optimal results due to the inherent difficulty in achieving precise feature- or semantic-level alignment across modalities. To address these challenges, we propose an Adaptive Group-wise Interaction Network (AGI-Net) that explicitly models both inter-modality and intra-modality relationships for multimodal MR image synthesis. Specifically, feature channels are first partitioned into predefined groups, after which an adaptive rolling mechanism is applied to conventional convolutional kernels to better capture feature and semantic correspondences between different modalities. In parallel, a cross-group attention module is introduced to enable effective feature fusion across groups, thereby enhancing the network's representational capacity. We validate the proposed AGI-Net on the publicly available IXI and BraTS2023 datasets. Experimental results demonstrate that AGI-Net achieves state-of-the-art performance in multimodal MR image synthesis tasks, confirming the effectiveness of its modality-aware interaction design. We release the relevant code at: https://github.com/zunzhumu/Adaptive-Group-wise-Interaction-Network-for-Multimodal-MRI-Synthesis.git.
Authors: Tao Song, Yicheng Wu, Minhao Hu, Xiangde Luo, Guoting Luo, Guotai Wang, Yi Guo, Feng Xu, Shaoting Zhang
Abstract: Accelerated MRI reconstruction plays a vital role in reducing scan time while preserving image quality. While most existing methods rely on complex-valued image-space or k-space data, these formats are often inaccessible in clinical practice due to proprietary reconstruction pipelines, leaving only magnitude images stored in DICOM files. To address this gap, we focus on the underexplored task of magnitude-image-based MRI reconstruction. Recent advancements in diffusion models, particularly denoising diffusion probabilistic models (DDPMs), have demonstrated strong capabilities in modeling image priors. However, their task-agnostic denoising nature limits performance in source-to-target image translation tasks, such as MRI reconstruction. In this work, we propose a novel Self-Consistent Nested Diffusion Bridge (SC-NDB) framework that models accelerated MRI reconstruction as a bi-directional image translation process between under-sampled and fully-sampled magnitude MRI images. SC-NDB introduces a nested diffusion architecture with a self-consistency constraint and reverse bridge diffusion pathways to improve intermediate prediction fidelity and better capture the explicit priors of source images. Furthermore, we incorporate a Contour Decomposition Embedding Module (CDEM) to inject structural and textural knowledge by leveraging Laplacian pyramids and directional filter banks. Extensive experiments on the fastMRI and IXI datasets demonstrate that our method achieves state-of-the-art performance compared to both magnitude-based and non-magnitude-based diffusion models, confirming the effectiveness and clinical relevance of SC-NDB.
Authors: Huayi Zhou, Ruixiang Wang, Yunxin Tai, Yueci Deng, Guiliang Liu, Kui Jia
Abstract: Bimanual robotic manipulation is a long-standing challenge of embodied intelligence due to its characteristics of dual-arm spatial-temporal coordination and high-dimensional action spaces. Previous studies rely on pre-defined action taxonomies or direct teleoperation to alleviate or circumvent these issues, often making them lack simplicity, versatility and scalability. Differently, we believe that the most effective and efficient way for teaching bimanual manipulation is learning from human demonstrated videos, where rich features such as spatial-temporal positions, dynamic postures, interaction states and dexterous transitions are available almost for free. In this work, we propose the YOTO (You Only Teach Once), which can extract and then inject patterns of bimanual actions from as few as a single binocular observation of hand movements, and teach dual robot arms various complex tasks. Furthermore, based on keyframes-based motion trajectories, we devise a subtle solution for rapidly generating training demonstrations with diverse variations of manipulated objects and their locations. These data can then be used to learn a customized bimanual diffusion policy (BiDP) across diverse scenes. In experiments, YOTO achieves impressive performance in mimicking 5 intricate long-horizon bimanual tasks, possesses strong generalization under different visual and spatial conditions, and outperforms existing visuomotor imitation learning methods in accuracy and efficiency. Our project link is https://hnuzhy.github.io/projects/YOTO.
Authors: Saikat Roy, Mahmoud Mostapha, Radu Miron, Matt Holbrook, Mariappan Nadar
Abstract: Plug-and-play approaches to solving inverse problems such as restoration and super-resolution have recently benefited from Diffusion-based generative priors for natural as well as medical images. However, solutions often use the standard albeit computationally intensive route of training and inferring with the whole image on the diffusion prior. While patch-based approaches to evaluating diffusion priors in plug-and-play methods have received some interest, they remain an open area of study. In this work, we explore the feasibility of the usage of patches for training and inference of a diffusion prior on MRI images. We explore the minor adaptation necessary for artifact avoidance, the performance and the efficiency of memory usage of patch-based methods as well as the adaptability of whole image training to patch-based evaluation - evaluating across multiple plug-and-play methods, tasks and datasets.
Authors: Amogh Joshi, Sourav Sanyal, Kaushik Roy
Abstract: The integration of human-intuitive interactions into autonomous systems has been limited. Traditional Natural Language Processing (NLP) systems struggle with context and intent understanding, severely restricting human-robot interaction. Recent advancements in Large Language Models (LLMs) have transformed this dynamic, allowing for intuitive and high-level communication through speech and text, and bridging the gap between human commands and robotic actions. Additionally, autonomous navigation has emerged as a central focus in robotics research, with artificial intelligence (AI) increasingly being leveraged to enhance these systems. However, existing AI-based navigation algorithms face significant challenges in latency-critical tasks where rapid decision-making is critical. Traditional frame-based vision systems, while effective for high-level decision-making, suffer from high energy consumption and latency, limiting their applicability in real-time scenarios. Neuromorphic vision systems, combining event-based cameras and spiking neural networks (SNNs), offer a promising alternative by enabling energy-efficient, low-latency navigation. Despite their potential, real-world implementations of these systems, particularly on physical platforms such as drones, remain scarce. In this work, we present Neuro-LIFT, a real-time neuromorphic navigation framework implemented on a Parrot Bebop2 quadrotor. Leveraging an LLM for natural language processing, Neuro-LIFT translates human speech into high-level planning commands which are then autonomously executed using event-based neuromorphic vision and physics-driven planning. Our framework demonstrates its capabilities in navigating in a dynamic environment, avoiding obstacles, and adapting to human instructions in real-time.
Authors: Eirini Schoinas, Adyah Rastogi, Anissa Carter, Jacob Granley, Michael Beyeler
Abstract: Human-in-the-loop optimization (HILO) is a promising approach for personalizing visual prostheses by iteratively refining stimulus parameters based on user feedback. Previous work demonstrated HILO's efficacy in simulation, but its performance with human participants remains untested. Here we evaluate HILO using sighted participants viewing simulated prosthetic vision to assess its ability to optimize stimulation strategies under realistic conditions. Participants selected between phosphenes generated by competing encoders to iteratively refine a deep stimulus encoder (DSE). We tested HILO in three conditions: standard optimization, threshold misspecifications, and out-of-distribution parameter sampling. Participants consistently preferred HILO-generated stimuli over both a naive encoder and the DSE alone, with log odds favoring HILO across all conditions. We also observed key differences between human and simulated decision-making, highlighting the importance of validating optimization strategies with human participants. These findings support HILO as a viable approach for adapting visual prostheses to individuals. Clinical relevance: Validating HILO with sighted participants viewing simulated prosthetic vision is an important step toward personalized calibration of future visual prostheses.
Authors: Haoyu Han, Heng Yang
Abstract: Global bundle adjustment is made easy by depth prediction and convex optimization. We (i) propose a scaled bundle adjustment (SBA) formulation that lifts 2D keypoint measurements to 3D with learned depth, (ii) design an empirically tight convex semidfinite program (SDP) relaxation that solves SBA to certfiable global optimality, (iii) solve the SDP relaxations at extreme scale with Burer-Monteiro factorization and a CUDA-based trust-region Riemannian optimizer (dubbed XM), (iv) build a structure from motion (SfM) pipeline with XM as the optimization engine and show that XM-SfM compares favorably with existing pipelines in terms of reconstruction quality while being significantly faster, more scalable, and initialization-free.
Authors: Moo Jin Kim, Chelsea Finn, Percy Liang
Abstract: Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA as our representative base model. Our empirical analysis informs an Optimized Fine-Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression-based learning objective to altogether improve inference efficiency, policy performance, and flexibility in the model's input-output specifications. We propose OpenVLA-OFT, an instantiation of this recipe, which sets a new state of the art on the LIBERO simulation benchmark, significantly boosting OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$\times$. In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot and outperform other VLAs ($\pi_0$ and RDT-1B) fine-tuned using their default recipes, as well as strong imitation learning policies trained from scratch (Diffusion Policy and ACT) by up to 15% (absolute) in average success rate. We release code for OFT and pretrained model checkpoints at https://openvla-oft.github.io/.
Authors: Kai Uwe Barthel, Florian Barthel, Peter Eisert
Abstract: Sorting and permutation learning are key concepts in optimization and machine learning, especially when organizing high-dimensional data into meaningful spatial layouts. The Gumbel-Sinkhorn method, while effective, requires N*N parameters to determine a full permutation matrix, making it computationally expensive for large datasets. Low-rank matrix factorization approximations reduce memory requirements to 2NM (with M << N), but they still struggle with very large problems. SoftSort, by providing a continuous relaxation of the argsort operator, allows differentiable 1D sorting, but it faces challenges with multidimensional data and complex permutations. In this paper, we present a novel method for learning permutations using only N parameters, which dramatically reduces storage costs. Our method extends SoftSort by iteratively shuffling the N indices of the elements and applying a few SoftSort optimization steps per iteration. This modification significantly improves sorting quality, especially for multidimensional data and complex optimization criteria, and outperforms pure SoftSort. Our method offers improved memory efficiency and scalability compared to existing approaches, while maintaining high-quality permutation learning. Its dramatically reduced memory requirements make it particularly well-suited for large-scale optimization tasks, such as "Self-Organizing Gaussians", where efficient and scalable permutation learning is critical.
Authors: Ivan Sviridov, Konstantin Egorov
Abstract: Cardiovascular diseases (CVDs) are disorders impacting the heart and circulatory system. These disorders are the foremost and continuously escalating cause of mortality worldwide. One of the main tasks when working with CVDs is analyzing and identifying pathologies on a 12-lead electrocardiogram (ECG) with a standard 10-second duration. Using machine learning (ML) in automatic ECG analysis increases CVD diagnostics' availability, speed, and accuracy. However, the most significant difficulty in developing ML models is obtaining a sufficient training dataset. Due to the limitations of medical data usage, such as expensiveness, errors, the ambiguity of labels, imbalance of classes, and privacy issues, utilizing synthetic samples depending on specific pathologies bypasses these restrictions and improves algorithm quality. Existing solutions for the conditional generation of ECG signals are mainly built on Generative Adversarial Networks (GANs), and only a few papers consider the architectures based on Variational Autoencoders (VAEs), showing comparable results in recent works. This paper proposes the publicly available conditional Nouveau VAE model for ECG signal generation (cNVAE-ECG), which produces high-resolution ECGs with multiple pathologies. We provide an extensive comparison of the proposed model on various practical downstream tasks, including transfer learning scenarios showing an area under the receiver operating characteristic (AUROC) increase up to 2% surpassing GAN-like competitors.
Authors: Youyi Zhan, Tianjia Shao, Yin Yang, Kun Zhou
Abstract: Many works have succeeded in reconstructing Gaussian human avatars from multi-view videos. However, they either struggle to capture pose-dependent appearance details with a single MLP, or rely on a computationally intensive neural network to reconstruct high-fidelity appearance but with rendering performance degraded to non-real-time. We propose a novel Gaussian human avatar representation that can reconstruct high-fidelity pose-dependence appearance with details and meanwhile can be rendered in real time. Our Gaussian avatar is empowered by spatially distributed MLPs which are explicitly located on different positions on human body. The parameters stored in each Gaussian are obtained by interpolating from the outputs of its nearby MLPs based on their distances. To avoid undesired smooth Gaussian property changing during interpolation, for each Gaussian we define a set of Gaussian offset basis, and a linear combination of basis represents the Gaussian property offsets relative to the neutral properties. Then we propose to let the MLPs output a set of coefficients corresponding to the basis. In this way, although Gaussian coefficients are derived from interpolation and change smoothly, the Gaussian offset basis is learned freely without constraints. The smoothly varying coefficients combined with freely learned basis can still produce distinctly different Gaussian property offsets, allowing the ability to learn high-frequency spatial signals. We further use control points to constrain the Gaussians distributed on a surface layer rather than allowing them to be irregularly distributed inside the body, to help the human avatar generalize better when animated under novel poses. Compared to the state-of-the-art method, our method achieves better appearance quality with finer details while the rendering speed is significantly faster under novel views and novel poses.
Authors: Md Abdul Baset Sarker, Art Nguyen, Sigmond Kukla, Kevin Fite, Masudul H. Imtiaz
Abstract: This paper introduces a novel AI vision-enabled pediatric prosthetic hand designed to assist children aged 10-12 with upper limb disabilities. The prosthesis features an anthropomorphic appearance, multi-articulating functionality, and a lightweight design that mimics a natural hand, making it both accessible and affordable for low-income families. Using 3D printing technology and integrating advanced machine vision, sensing, and embedded computing, the prosthetic hand offers a low-cost, customizable solution that addresses the limitations of current myoelectric prostheses. A micro camera is interfaced with a low-power FPGA for real-time object detection and assists with precise grasping. The onboard DL-based object detection and grasp classification models achieved accuracies of 96% and 100% respectively. In the force prediction, the mean absolute error was found to be 0.018. The features of the proposed prosthetic hand can thus be summarized as: a) a wrist-mounted micro camera for artificial sensing, enabling a wide range of hand-based tasks; b) real-time object detection and distance estimation for precise grasping; and c) ultra-low-power operation that delivers high performance within constrained power and resource limits.
Authors: Drew Linsley, Pinyuan Feng, Thomas Serre
Abstract: Deep neural networks (DNNs) once showed increasing alignment with primate perception and neural responses as they improved on vision benchmarks, raising hopes that advances in AI would yield better models of biological vision. However, we show across three benchmarks that this alignment is now plateauing - and in some cases worsening - as DNNs scale to human or superhuman accuracy. This divergence may reflect the adoption of visual strategies that differ from those used by primates. These findings challenge the view that progress in artificial intelligence will naturally translate to neuroscience. We argue that vision science must chart its own course, developing algorithms grounded in biological visual systems rather than optimizing for benchmarks based on internet-scale datasets.
Authors: Chuyu Wang, Huiting Deng, Dong Liu
Abstract: Electrical Impedance Tomography (EIT) provides a non-invasive, portable imaging modality with significant potential in medical and industrial applications. Despite its advantages, EIT encounters two primary challenges: the ill-posed nature of its inverse problem and the spatially variable, location-dependent sensitivity distribution. Traditional model-based methods mitigate ill-posedness through regularization but overlook sensitivity variability, while supervised deep learning approaches require extensive training data and lack generalization. Recent developments in neural fields have introduced implicit regularization techniques for image reconstruction, but these methods typically neglect the physical principles underlying EIT, thus limiting their effectiveness. In this study, we propose PhyNC (Physics-driven Neural Compensation), an unsupervised deep learning framework that incorporates the physical principles of EIT. PhyNC addresses both the ill-posed inverse problem and the sensitivity distribution by dynamically allocating neural representational capacity to regions with lower sensitivity, ensuring accurate and balanced conductivity reconstructions. Extensive evaluations on both simulated and experimental data demonstrate that PhyNC outperforms existing methods in terms of detail preservation and artifact resistance, particularly in low-sensitivity regions. Our approach enhances the robustness of EIT reconstructions and provides a flexible framework that can be adapted to other imaging modalities with similar challenges.