Authors: Haoxiang Chen, Wei Zhao, Rufei Zhang, Nannan Li, Dongjin Li
Abstract: In the context of multi-object tracking using video synthetic aperture radar (Video SAR), Doppler shifts induced by target motion result in artifacts that are easily mistaken for shadows caused by static occlusions. Moreover, appearance changes of the target caused by Doppler mismatch may lead to association failures and disrupt trajectory continuity. A major limitation in this field is the lack of public benchmark datasets for standardized algorithm evaluation. To address the above challenges, we collected and annotated 45 video SAR sequences containing moving targets, and named the Video SAR MOT Benchmark (VSMB). Specifically, to mitigate the effects of trailing and defocusing in moving targets, we introduce a line feature enhancement mechanism that emphasizes the positive role of motion shadows and reduces false alarms induced by static occlusions. In addition, to mitigate the adverse effects of target appearance variations, we propose a motion-aware clue discarding mechanism that substantially improves tracking robustness in Video SAR. The proposed model achieves state-of-the-art performance on the VSMB, and the dataset and model are released at https://github.com/softwarePupil/VSMB.
Authors: Naomi Fridman, Bubby Solway, Tomer Fridman, Itamar Barnea, Anat Goldshtein
Abstract: Breast cancer remains a leading cause of cancer-related mortality worldwide, making early detection and accurate treatment response monitoring critical priorities. We present BreastDCEDL, a curated, deep learning-ready dataset comprising pre-treatment 3D Dynamic Contrast-Enhanced MRI (DCE-MRI) scans from 2,070 breast cancer patients drawn from the I-SPY1, I-SPY2, and Duke cohorts, all sourced from The Cancer Imaging Archive. The raw DICOM imaging data were rigorously converted into standardized 3D NIfTI volumes with preserved signal integrity, accompanied by unified tumor annotations and harmonized clinical metadata including pathologic complete response (pCR), hormone receptor (HR), and HER2 status. Although DCE-MRI provides essential diagnostic information and deep learning offers tremendous potential for analyzing such complex data, progress has been limited by lack of accessible, public, multicenter datasets. BreastDCEDL addresses this gap by enabling development of advanced models, including state-of-the-art transformer architectures that require substantial training data. To demonstrate its capacity for robust modeling, we developed the first transformer-based model for breast DCE-MRI, leveraging Vision Transformer (ViT) architecture trained on RGB-fused images from three contrast phases (pre-contrast, early post-contrast, and late post-contrast). Our ViT model achieved state-of-the-art pCR prediction performance in HR+/HER2- patients (AUC 0.94, accuracy 0.93). BreastDCEDL includes predefined benchmark splits, offering a framework for reproducible research and enabling clinically meaningful modeling in breast cancer imaging.
Authors: Sibo Dong, Ismail Shaheen, Maggie Shen, Rupayan Mallick, Sarah Adel Bargal
Abstract: Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past image-text pairs but require extensive training, while training-free subject-specific approaches ensure consistency but lack adaptability to narrative prompts. To address these limitations, we propose a multi-modal history adapter for text-to-image diffusion models, \textbf{ViSTA}. It consists of (1) a multi-modal history fusion module to extract relevant history features and (2) a history adapter to condition the generation on the extracted relevant features. We also introduce a salient history selection strategy during inference, where the most salient history text-image pair is selected, improving the quality of the conditioning. Furthermore, we propose to employ a Visual Question Answering-based metric TIFA to assess text-image alignment in visual storytelling, providing a more targeted and interpretable assessment of generated images. Evaluated on the StorySalon and FlintStonesSV dataset, our proposed ViSTA model is not only consistent across different frames, but also well-aligned with the narrative text descriptions.
Authors: Daniya Najiha Abdul Kareem, Abdul Hannan, Mubashir Noman, Jean Lahoud, Mustansar Fiaz, Hisham Cholakkal
Abstract: Accurate microscopic medical image segmentation plays a crucial role in diagnosing various cancerous cells and identifying tumors. Driven by advancements in deep learning, convolutional neural networks (CNNs) and transformer-based models have been extensively studied to enhance receptive fields and improve medical image segmentation task. However, they often struggle to capture complex cellular and tissue structures in challenging scenarios such as background clutter and object overlap. Moreover, their reliance on the availability of large datasets for improved performance, along with the high computational cost, limit their practicality. To address these issues, we propose an efficient framework for the segmentation task, named InceptionMamba, which encodes multi-stage rich features and offers both performance and computational efficiency. Specifically, we exploit semantic cues to capture both low-frequency and high-frequency regions to enrich the multi-stage features to handle the blurred region boundaries (e.g., cell boundaries). These enriched features are input to a hybrid model that combines an Inception depth-wise convolution with a Mamba block, to maintain high efficiency and capture inherent variations in the scales and shapes of the regions of interest. These enriched features along with low-resolution features are fused to get the final segmentation mask. Our model achieves state-of-the-art performance on two challenging microscopic segmentation datasets (SegPC21 and GlaS) and two skin lesion segmentation datasets (ISIC2017 and ISIC2018), while reducing computational cost by about 5 times compared to the previous best performing method.
Authors: Ilya Ilyankou, Natchapon Jongwiriyanurak, Tao Cheng, James Haworth
Abstract: We present a CLIP-based, multi-modal, multi-label classifier for predicting geographical context tags from landscape photos in the Geograph dataset--a crowdsourced image archive spanning the British Isles, including remote regions lacking POIs and street-level imagery. Our approach addresses a Kaggle competition\footnote{https://www.kaggle.com/competitions/predict-geographic-context-from-landscape-photos} task based on a subset of Geograph's 8M images, with strict evaluation: exact match accuracy is required across 49 possible tags. We show that combining location and title embeddings with image features improves accuracy over using image embeddings alone. We release a lightweight pipeline\footnote{https://github.com/SpaceTimeLab/ClipTheLandscape} that trains on a modest laptop, using pre-trained CLIP image and text embeddings and a simple classification head. Predicted tags can support downstream tasks such as building location embedders for GeoAI applications, enriching spatial understanding in data-sparse regions.
URLs: https://www.kaggle.com/competitions/predict-geographic-context-from-landscape-photos, https://github.com/SpaceTimeLab/ClipTheLandscape
Authors: Mohammed Elhenawy, Shadi Jaradat, Taqwa I. Alhadidi, Huthaifa I. Ashqar, Ahmed Jaber, Andry Rakotonirainy, Mohammad Abu Tami
Abstract: Scene understanding is critical for various downstream tasks in autonomous driving, including facilitating driver-agent communication and enhancing human-centered explainability of autonomous vehicle (AV) decisions. This paper evaluates the capability of four multimodal large language models (MLLMs), including relatively small models, to understand scenes in a zero-shot, in-context learning setting. Additionally, we explore whether combining these models using an ensemble approach with majority voting can enhance scene understanding performance. Our experiments demonstrate that GPT-4o, the largest model, outperforms the others in scene understanding. However, the performance gap between GPT-4o and the smaller models is relatively modest, suggesting that advanced techniques such as improved in-context learning, retrieval-augmented generation (RAG), or fine-tuning could further optimize the smaller models' performance. We also observe mixed results with the ensemble approach: while some scene attributes show improvement in performance metrics such as F1-score, others experience a decline. These findings highlight the need for more sophisticated ensemble techniques to achieve consistent gains across all scene attributes. This study underscores the potential of leveraging MLLMs for scene understanding and provides insights into optimizing their performance for autonomous driving applications.
Authors: Boris Ivanovic, Cristiano Saltori, Yurong You, Yan Wang, Wenjie Luo, Marco Pavone
Abstract: Autoregressive Transformers are increasingly being deployed as end-to-end robot and autonomous vehicle (AV) policy architectures, owing to their scalability and potential to leverage internet-scale pretraining for generalization. Accordingly, tokenizing sensor data efficiently is paramount to ensuring the real-time feasibility of such architectures on embedded hardware. To this end, we present an efficient triplane-based multi-camera tokenization strategy that leverages recent advances in 3D neural reconstruction and rendering to produce sensor tokens that are agnostic to the number of input cameras and their resolution, while explicitly accounting for their geometry around an AV. Experiments on a large-scale AV dataset and state-of-the-art neural simulator demonstrate that our approach yields significant savings over current image patch-based tokenization strategies, producing up to 72% fewer tokens, resulting in up to 50% faster policy inference while achieving the same open-loop motion planning accuracy and improved offroad rates in closed-loop driving simulations.
Authors: Yijiang Li, Genpei Zhang, Jiacheng Cheng, Yi Li, Xiaojun Shan, Dashan Gao, Jiancheng Lyu, Yuan Li, Ning Bi, Nuno Vasconcelos
Abstract: While the rapid proliferation of wearable cameras has raised significant concerns about egocentric video privacy, prior work has largely overlooked the unique privacy threats posed to the camera wearer. This work investigates the core question: How much privacy information about the camera wearer can be inferred from their first-person view videos? We introduce EgoPrivacy, the first large-scale benchmark for the comprehensive evaluation of privacy risks in egocentric vision. EgoPrivacy covers three types of privacy (demographic, individual, and situational), defining seven tasks that aim to recover private information ranging from fine-grained (e.g., wearer's identity) to coarse-grained (e.g., age group). To further emphasize the privacy threats inherent to egocentric vision, we propose Retrieval-Augmented Attack, a novel attack strategy that leverages ego-to-exo retrieval from an external pool of exocentric videos to boost the effectiveness of demographic privacy attacks. An extensive comparison of the different attacks possible under all threat models is presented, showing that private information of the wearer is highly susceptible to leakage. For instance, our findings indicate that foundation models can effectively compromise wearer privacy even in zero-shot settings by recovering attributes such as identity, scene, gender, and race with 70-80% accuracy. Our code and data are available at https://github.com/williamium3000/ego-privacy.
Authors: Worasit Sangjan, Piyush Pandey, Norman B. Best, Jacob D. Washburn
Abstract: Accurate identification of individual plants from unmanned aerial vehicle (UAV) images is essential for advancing high-throughput phenotyping and supporting data-driven decision-making in plant breeding. This study presents MatchPlant, a modular, graphical user interface-supported, open-source Python pipeline for UAV-based single-plant detection and geospatial trait extraction. MatchPlant enables end-to-end workflows by integrating UAV image processing, user-guided annotation, Convolutional Neural Network model training for object detection, forward projection of bounding boxes onto an orthomosaic, and shapefile generation for spatial phenotypic analysis. In an early-season maize case study, MatchPlant achieved reliable detection performance (validation AP: 89.6%, test AP: 85.9%) and effectively projected bounding boxes, covering 89.8% of manually annotated boxes with 87.5% of projections achieving an Intersection over Union (IoU) greater than 0.5. Trait values extracted from predicted bounding instances showed high agreement with manual annotations (r = 0.87-0.97, IoU >= 0.4). Detection outputs were reused across time points to extract plant height and Normalized Difference Vegetation Index with minimal additional annotation, facilitating efficient temporal phenotyping. By combining modular design, reproducibility, and geospatial precision, MatchPlant offers a scalable framework for UAV-based plant-level analysis with broad applicability in agricultural and environmental monitoring.
Authors: Janet Wang, Yunbei Zhang, Zhengming Ding, Jihun Hamm
Abstract: Paucity of medical data severely limits the generalizability of diagnostic ML models, as the full spectrum of disease variability can not be represented by a small clinical dataset. To address this, diffusion models (DMs) have been considered as a promising avenue for synthetic image generation and augmentation. However, they frequently produce medically inaccurate images, deteriorating the model performance. Expert domain knowledge is critical for synthesizing images that correctly encode clinical information, especially when data is scarce and quality outweighs quantity. Existing approaches for incorporating human feedback, such as reinforcement learning (RL) and Direct Preference Optimization (DPO), rely on robust reward functions or demand labor-intensive expert evaluations. Recent progress in Multimodal Large Language Models (MLLMs) reveals their strong visual reasoning capabilities, making them adept candidates as evaluators. In this work, we propose a novel framework, coined MAGIC (Medically Accurate Generation of Images through AI-Expert Collaboration), that synthesizes clinically accurate skin disease images for data augmentation. Our method creatively translates expert-defined criteria into actionable feedback for image synthesis of DMs, significantly improving clinical accuracy while reducing the direct human workload. Experiments demonstrate that our method greatly improves the clinical quality of synthesized skin disease images, with outputs aligning with dermatologist assessments. Additionally, augmenting training data with these synthesized images improves diagnostic accuracy by +9.02% on a challenging 20-condition skin disease classification task, and by +13.89% in the few-shot setting.
Authors: Yuantao Wang, Haowei Yang, Wei Zhang, Shijian Lu
Abstract: Real-world object detection is a challenging task where the captured images/videos often suffer from complex degradations due to various adverse weather conditions such as rain, fog, snow, low-light, etc. Despite extensive prior efforts, most existing methods are designed for one specific type of adverse weather with constraints of poor generalization, under-utilization of visual features while handling various image degradations. Leveraging a theoretical analysis on how critical visual details are lost in adverse-weather images, we design UniDet-D, a unified framework that tackles the challenge of object detection under various adverse weather conditions, and achieves object detection and image restoration within a single network. Specifically, the proposed UniDet-D incorporates a dynamic spectral attention mechanism that adaptively emphasizes informative spectral components while suppressing irrelevant ones, enabling more robust and discriminative feature representation across various degradation types. Extensive experiments show that UniDet-D achieves superior detection accuracy across different types of adverse-weather degradation. Furthermore, UniDet-D demonstrates superior generalization towards unseen adverse weather conditions such as sandstorms and rain-fog mixtures, highlighting its great potential for real-world deployment.
Authors: Yongmin Kwon, Namwoo Kang
Abstract: Generative models have attracted considerable attention for their ability to produce novel shapes. However, their application in mechanical design remains constrained due to the limited size and variability of available datasets. This study proposes a deep learning-based optimization framework specifically tailored for shape optimization with limited datasets, leveraging positional encoding and a Lipschitz regularization term to robustly learn geometric characteristics and maintain a meaningful latent space. Through extensive experiments, the proposed approach demonstrates robustness, generalizability and effectiveness in addressing typical limitations of conventional optimization frameworks. The validity of the methodology is confirmed through multi-objective shape optimization experiments conducted on diverse three-dimensional datasets, including wheels and cars, highlighting the model's versatility in producing practical and high-quality design outcomes even under data-constrained conditions.
Authors: Chuntao Ding, Jianhang Xie, Junna Zhang, Salman Raza, Shangguang Wang, Jiannong Cao
Abstract: It has become mainstream to deploy Convolutional Neural Network (CNN) models on ubiquitous Internet of Things (IoT) devices with the help of the cloud to provide users with a variety of high-quality services. Most existing methods have two limitations: (i) low robustness in handling corrupted image data collected by IoT devices; and (ii) high consumption of computational and transmission resources. To this end, we propose the Grouped NonLinear transformation generation method (GroupNL), which generates diversified feature maps by utilizing data-agnostic Nonlinear Transformation Functions (NLFs) to improve the robustness of the CNN model. Specifically, partial convolution filters are designated as seed filters in a convolutional layer, and a small set of feature maps, i.e., seed feature maps, are first generated based on vanilla convolution operation. Then, we split seed feature maps into several groups, each with a set of different NLFs, to generate corresponding diverse feature maps with in-place nonlinear processing. Moreover, GroupNL effectively reduces the parameter transmission between multiple nodes during model training by setting the hyperparameters of NLFs to random initialization and not updating them during model training, and reduces the computing resources by using NLFs to generate feature maps instead of most feature maps generated based on sliding windows. Experimental results on CIFAR-10, GTSRB, CIFAR-10-C, Icons50, and ImageNet-1K datasets in NVIDIA RTX GPU platforms show that the proposed GroupNL outperforms other state-of-the-art methods in model robust and training acceleration. Specifically, on the Icons-50 dataset, the accuracy of GroupNL-ResNet-18 achieves approximately 2.86% higher than the vanilla ResNet-18. GroupNL improves training speed by about 53% compared to vanilla CNN when trained on a cluster of 8 NVIDIA RTX 4090 GPUs on the ImageNet-1K dataset.
Authors: Youze Wang, Zijun Chen, Ruoyu Chen, Shishen Gu, Yinpeng Dong, Hang Su, Jun Zhu, Meng Wang, Richang Hong, Wenbo Hu
Abstract: Recent advancements in multimodal large language models for video understanding (videoLLMs) have improved their ability to process dynamic multimodal data. However, trustworthiness challenges factual inaccuracies, harmful content, biases, hallucinations, and privacy risks, undermine reliability due to video data's spatiotemporal complexities. This study introduces Trust-videoLLMs, a comprehensive benchmark evaluating videoLLMs across five dimensions: truthfulness, safety, robustness, fairness, and privacy. Comprising 30 tasks with adapted, synthetic, and annotated videos, the framework assesses dynamic visual scenarios, cross-modal interactions, and real-world safety concerns. Our evaluation of 23 state-of-the-art videoLLMs (5 commercial,18 open-source) reveals significant limitations in dynamic visual scene understanding and cross-modal perturbation resilience. Open-source videoLLMs show occasional truthfulness advantages but inferior overall credibility compared to commercial models, with data diversity outperforming scale effects. These findings highlight the need for advanced safety alignment to enhance capabilities. Trust-videoLLMs provides a publicly available, extensible toolbox for standardized trustworthiness assessments, bridging the gap between accuracy-focused benchmarks and critical demands for robustness, safety, fairness, and privacy.
Authors: Zongyu Wu, Minhua Lin, Zhiwei Zhang, Fali Wang, Xianren Zhang, Xiang Zhang, Suhang Wang
Abstract: Large vision-language models (LVLMs) have demonstrated outstanding performance in many downstream tasks. However, LVLMs are trained on large-scale datasets, which can pose privacy risks if training images contain sensitive information. Therefore, it is important to detect whether an image is used to train the LVLM. Recent studies have investigated membership inference attacks (MIAs) against LVLMs, including detecting image-text pairs and single-modality content. In this work, we focus on detecting whether a target image is used to train the target LVLM. We design simple yet effective Image Corruption-Inspired Membership Inference Attacks (ICIMIA) against LLVLMs, which are inspired by LVLM's different sensitivity to image corruption for member and non-member images. We first perform an MIA method under the white-box setting, where we can obtain the embeddings of the image through the vision part of the target LVLM. The attacks are based on the embedding similarity between the image and its corrupted version. We further explore a more practical scenario where we have no knowledge about target LVLMs and we can only query the target LVLMs with an image and a question. We then conduct the attack by utilizing the output text embeddings' similarity. Experiments on existing datasets validate the effectiveness of our proposed attack methods under those two different settings.
Authors: Huaijie Wang, De Cheng, Lingfeng He, Yan Li, Jie Li, Nannan Wang, Xinbo Gao
Abstract: Class-Incremental Learning (CIL) aims to enable AI models to continuously learn from sequentially arriving data of different classes over time while retaining previously acquired knowledge. Recently, Parameter-Efficient Fine-Tuning (PEFT) methods, like prompt pool-based approaches and adapter tuning, have shown great attraction in CIL. However, these methods either introduce additional parameters that increase memory usage, or rely on rigid regularization techniques which reduce forgetting but compromise model flexibility. To overcome these limitations, we propose the Elastic Knowledge Preservation and Compensation (EKPC) method, integrating Importance-aware Parameter Regularization (IPR) and Trainable Semantic Drift Compensation (TSDC) for CIL. Specifically, the IPR method assesses the sensitivity of network parameters to prior tasks using a novel parameter-importance algorithm. It then selectively constrains updates within the shared adapter according to these importance values, thereby preserving previously acquired knowledge while maintaining the model's flexibility. However, it still exhibits slight semantic differences in previous knowledge to accommodate new incremental tasks, leading to decision boundaries confusion in classifier. To eliminate this confusion, TSDC trains a unified classifier by compensating prototypes with trainable semantic drift. Extensive experiments on five CIL benchmarks demonstrate the effectiveness of the proposed method, showing superior performances to existing state-of-the-art methods.
Authors: Zahid Ullah, Jihie Kim
Abstract: Accurate brain tumor classification is crucial in medical imaging to ensure reliable diagnosis and effective treatment planning. This study introduces a novel double ensembling framework that synergistically combines pre-trained deep learning (DL) models for feature extraction with optimized machine learning (ML) classifiers for robust classification. The framework incorporates comprehensive preprocessing and data augmentation of brain magnetic resonance images (MRI), followed by deep feature extraction using transfer learning with pre-trained Vision Transformer (ViT) networks. The novelty lies in the dual-level ensembling strategy: feature-level ensembling, which integrates deep features from the top-performing ViT models, and classifier-level ensembling, which aggregates predictions from hyperparameter-optimized ML classifiers. Experiments on two public Kaggle MRI brain tumor datasets demonstrate that this approach significantly surpasses state-of-the-art methods, underscoring the importance of feature and classifier fusion. The proposed methodology also highlights the critical roles of hyperparameter optimization (HPO) and advanced preprocessing techniques in improving diagnostic accuracy and reliability, advancing the integration of DL and ML for clinically relevant medical image analysis.
Authors: Haotian Zhang, Liu Liu, Baosheng Yu, Jiayan Qiu, Yanwei Ren, Xianglong Liu
Abstract: The advent of parameter-efficient fine-tuning methods has significantly reduced the computational burden of adapting large-scale pretrained models to diverse downstream tasks. However, existing approaches often struggle to achieve robust performance under domain shifts while maintaining computational efficiency. To address this challenge, we propose Low-rAnk Regulated Gradient Projection (LARGO) algorithm that integrates dynamic constraints into low-rank adaptation methods. Specifically, LARGO incorporates parallel trainable gradient projections to dynamically regulate layer-wise updates, retaining the Out-Of-Distribution robustness of pretrained model while preserving inter-layer independence. Additionally, it ensures computational efficiency by mitigating the influence of gradient dependencies across layers during weight updates. Besides, through leveraging singular value decomposition of pretrained weights for structured initialization, we incorporate an SVD-based initialization strategy that minimizing deviation from pretrained knowledge. Through extensive experiments on diverse benchmarks, LARGO achieves state-of-the-art performance across in-domain and out-of-distribution scenarios, demonstrating improved robustness under domain shifts with significantly lower computational overhead compared to existing PEFT methods. The source code will be released soon.
Authors: Hongbi Zhou, Zhangkai Ni
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis. However, existing methods struggle to adaptively optimize the distribution of Gaussian primitives based on scene characteristics, making it challenging to balance reconstruction quality and efficiency. Inspired by human perception, we propose scene-adaptive perceptual densification for Gaussian Splatting (Perceptual-GS), a novel framework that integrates perceptual sensitivity into the 3DGS training process to address this challenge. We first introduce a perception-aware representation that models human visual sensitivity while constraining the number of Gaussian primitives. Building on this foundation, we develop a \cameraready{perceptual sensitivity-adaptive distribution} to allocate finer Gaussian granularity to visually critical regions, enhancing reconstruction quality and robustness. Extensive evaluations on multiple datasets, including BungeeNeRF for large-scale scenes, demonstrate that Perceptual-GS achieves state-of-the-art performance in reconstruction quality, efficiency, and robustness. The code is publicly available at: https://github.com/eezkni/Perceptual-GS
Authors: Weiwei Wang, Meijia Wang, Haoyi Wang, Wenqiang Guo, Jiapan Guo, Changming Sun, Lingkun Ma, Weichuan Zhang
Abstract: Visual place recognition (VPR) plays a crucial role in robotic localization and navigation. The key challenge lies in constructing feature representations that are robust to environmental changes. Existing methods typically adopt convolutional neural networks (CNNs) or vision Transformers (ViTs) as feature extractors. However, these architectures excel in different aspects -- CNNs are effective at capturing local details. At the same time, ViTs are better suited for modeling global context, making it difficult to leverage the strengths of both. To address this issue, we propose a local-global feature complementation network (LGCN) for VPR which integrates a parallel CNN-ViT hybrid architecture with a dynamic feature fusion module (DFM). The DFM performs dynamic feature fusion through joint modeling of spatial and channel-wise dependencies. Furthermore, to enhance the expressiveness and adaptability of the ViT branch for VPR tasks, we introduce lightweight frequency-to-spatial fusion adapters into the frozen ViT backbone. These adapters enable task-specific adaptation with controlled parameter overhead. Extensive experiments on multiple VPR benchmark datasets demonstrate that the proposed LGCN consistently outperforms existing approaches in terms of localization accuracy and robustness, validating its effectiveness and generalizability.
Authors: Ziwei Liu, Borui Kang, Wei Li, Hangjie Yuan, Yanbing Yang, Wenbin Li, Jun Luo, Yifan Zhu, Tao Feng
Abstract: Continual learning in vision-language models (VLMs) faces critical challenges in balancing parameter efficiency, memory consumption, and optimization stability. While First-Order (FO) optimization (e.g., SGD) dominate current approaches, their deterministic gradients often trap models in suboptimal local minima and incur substantial memory overhead. This paper pioneers a systematic exploration of Zeroth-Order (ZO) optimization for vision-language continual learning (VLCL). We first identify the incompatibility of naive full-ZO adoption in VLCL due to modality-specific instability. To resolve this, we selectively applying ZO to either vision or language modalities while retaining FO in the complementary branch. Furthermore, we develop a layer-wise optimization paradigm that interleaves ZO and FO across network layers, capitalizing on the heterogeneous learning dynamics of shallow versus deep representations. A key theoretical insight reveals that ZO perturbations in vision branches exhibit higher variance than language counterparts, prompting a gradient sign normalization mechanism with modality-specific perturbation constraints. Extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art performance, reducing memory consumption by 89.1% compared to baselines. Code will be available upon publication.
Authors: Hyeonseo Lee, Juhyun Park, Jihyong Oh, Chanho Eom
Abstract: Person Re-identification (ReID) aims to retrieve images of the same individual captured across non-overlapping camera views, making it a critical component of intelligent surveillance systems. Traditional ReID methods assume that the training and test domains share similar characteristics and primarily focus on learning discriminative features within a given domain. However, they often fail to generalize to unseen domains due to domain shifts caused by variations in viewpoint, background, and lighting conditions. To address this issue, Domain-Adaptive ReID (DA-ReID) methods have been proposed. These approaches incorporate unlabeled target domain data during training and improve performance by aligning feature distributions between source and target domains. Domain-Generalizable ReID (DG-ReID) tackles a more realistic and challenging setting by aiming to learn domain-invariant features without relying on any target domain data. Recent methods have explored various strategies to enhance generalization across diverse environments, but the field remains relatively underexplored. In this paper, we present a comprehensive survey of DG-ReID. We first review the architectural components of DG-ReID including the overall setting, commonly used backbone networks and multi-source input configurations. Then, we categorize and analyze domain generalization modules that explicitly aim to learn domain-invariant and identity-discriminative representations. To examine the broader applicability of these techniques, we further conduct a case study on a related task that also involves distribution shifts. Finally, we discuss recent trends, open challenges, and promising directions for future research in DG-ReID. To the best of our knowledge, this is the first systematic survey dedicated to DG-ReID.
Authors: Caixu Xu, Junming Wei, Huizhen Chen, Pengchen Liang, Bocheng Liang, Ying Tan, Xintong Wei
Abstract: Recently, Mamba-based methods have become popular in medical image segmentation due to their lightweight design and long-range dependency modeling capabilities. However, current segmentation methods frequently encounter challenges in fetal ultrasound images, such as enclosed anatomical structures, blurred boundaries, and small anatomical structures. To address the need for balancing local feature extraction and global context modeling, we propose MS-UMamba, a novel hybrid convolutional-mamba model for fetal ultrasound image segmentation. Specifically, we design a visual state space block integrated with a CNN branch (SS-MCAT-SSM), which leverages Mamba's global modeling strengths and convolutional layers' local representation advantages to enhance feature learning. In addition, we also propose an efficient multi-scale feature fusion module that integrates spatial attention mechanisms, which Integrating feature information from different layers enhances the feature representation ability of the model. Finally, we conduct extensive experiments on a non-public dataset, experimental results demonstrate that MS-UMamba model has excellent performance in segmentation performance.
Authors: Nathanael L. Baisa, Babu Pallam, Amudhavel Jayavel
Abstract: This paper introduces a new approach to person identification based on hand images, designed specifically for criminal investigations. The method is particularly valuable in serious crimes like sexual abuse, where hand images are often the sole identifiable evidence available. Our proposed method, CLIP-HandID, leverages pre-trained foundational vision-language model, particularly CLIP, to efficiently learn discriminative deep feature representations from hand images given as input to the image encoder of CLIP using textual prompts as semantic guidance. We propose to learn pseudo-tokens that represent specific visual contexts or appearance attributes using textual inversion network since labels of hand images are indexes instead text descriptions. The learned pseudo-tokens are incorporated into textual prompts which are given as input to the text encoder of the CLIP to leverage its multi-modal reasoning to enhance its generalization for identification. Through extensive evaluations on two large, publicly available hand datasets with multi-ethnic representation, we show that our method substantially surpasses existing approaches.
Authors: Eugene Kofi Okrah Denteh, Andrews Danyo, Joshua Kofi Asamoah, Blessing Agyei Kyem, Armstrong Aboah
Abstract: This study presents a novel demographics informed deep learning framework designed to forecast urban spatial transformations by jointly modeling geographic satellite imagery, socio-demographics, and travel behavior dynamics. The proposed model employs an encoder-decoder architecture with temporal gated residual connections, integrating satellite imagery and demographic data to accurately forecast future spatial transformations. The study also introduces a demographics prediction component which ensures that predicted satellite imagery are consistent with demographic features, significantly enhancing physiological realism and socioeconomic accuracy. The framework is enhanced by a proposed multi-objective loss function complemented by a semantic loss function that balances visual realism with temporal coherence. The experimental results from this study demonstrate the superior performance of the proposed model compared to state-of-the-art models, achieving higher structural similarity (SSIM: 0.8342) and significantly improved demographic consistency (Demo-loss: 0.14 versus 0.95 and 0.96 for baseline models). Additionally, the study validates co-evolutionary theories of urban development, demonstrating quantifiable bidirectional influences between built environment characteristics and population patterns. The study also contributes a comprehensive multimodal dataset pairing satellite imagery sequences (2012-2023) with corresponding demographic and travel behavior attributes, addressing existing gaps in urban and transportation planning resources by explicitly connecting physical landscape evolution with socio-demographic patterns.
Authors: Hao Shu
Abstract: Image edge detection (ED) faces a fundamental mismatch between training and inference: models are trained using continuous-valued outputs but evaluated using binary predictions. This misalignment, caused by the non-differentiability of binarization, weakens the link between learning objectives and actual task performance. In this paper, we propose a theoretical method to design a Binarization-Aware Adjuster (BAA), which explicitly incorporates binarization behavior into gradient-based optimization. At the core of BAA is a novel loss adjustment mechanism based on a Distance Weight Function (DWF), which reweights pixel-wise contributions according to their correctness and proximity to the decision boundary. This emphasizes decision-critical regions while down-weighting less influential ones. We also introduce a self-adaptive procedure to estimate the optimal binarization threshold for BAA, further aligning training dynamics with inference behavior. Extensive experiments across various architectures and datasets demonstrate the effectiveness of our approach. Beyond ED, BAA offers a generalizable strategy for bridging the gap between continuous optimization and discrete evaluation in structured prediction tasks.
Authors: Runhao Zeng, Qi Deng, Ronghao Zhang, Shuaicheng Niu, Jian Chen, Xiping Hu, Victor C. M. Leung
Abstract: Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase. While most existing TTA methods for video primarily utilize visual supervisory signals, they often overlook the potential contribution of inherent audio data. To address this gap, we propose a novel approach that incorporates audio information into video TTA. Our method capitalizes on the rich semantic content of audio to generate audio-assisted pseudo-labels, a new concept in the context of video TTA. Specifically, we propose an audio-to-video label mapping method by first employing pre-trained audio models to classify audio signals extracted from videos and then mapping the audio-based predictions to video label spaces through large language models, thereby establishing a connection between the audio categories and video labels. To effectively leverage the generated pseudo-labels, we present a flexible adaptation cycle that determines the optimal number of adaptation iterations for each sample, based on changes in loss and consistency across different views. This enables a customized adaptation process for each sample. Experimental results on two widely used datasets (UCF101-C and Kinetics-Sounds-C), as well as on two newly constructed audio-video TTA datasets (AVE-C and AVMIT-C) with various corruption types, demonstrate the superiority of our approach. Our method consistently improves adaptation performance across different video classification models and represents a significant step forward in integrating audio information into video TTA. Code: https://github.com/keikeiqi/Audio-Assisted-TTA.
Authors: Yanqiao Zhu
Abstract: This paper presents a comparative analysis of deep learning strategies for detecting hypertensive retinopathy from fundus images, a central task in the HRDC challenge~\cite{qian2025hrdc}. We investigate three distinct approaches: a custom CNN, a suite of pre-trained transformer-based models, and an AutoML solution. Our findings reveal a stark, architecture-dependent response to data augmentation. Augmentation significantly boosts the performance of pure Vision Transformers (ViTs), which we hypothesize is due to their weaker inductive biases, forcing them to learn robust spatial and structural features. Conversely, the same augmentation strategy degrades the performance of hybrid ViT-CNN models, whose stronger, pre-existing biases from the CNN component may be "confused" by the transformations. We show that smaller patch sizes (ViT-B/8) excel on augmented data, enhancing fine-grained detail capture. Furthermore, we demonstrate that a powerful self-supervised model like DINOv2 fails on the original, limited dataset but is "rescued" by augmentation, highlighting the critical need for data diversity to unlock its potential. Preliminary tests with a ViT-Large model show poor performance, underscoring the risk of using overly-capacitive models on specialized, smaller datasets. This work provides critical insights into the interplay between model architecture, data augmentation, and dataset size for medical image classification.
Authors: Mohsen Jenadeleh, Jon Sneyers, Davi Lazzarotto, Shima Mohammadi, Dominik Keller, Atanas Boev, Rakesh Rao Ramachandra Rao, Ant\'onio Pinheiro, Thomas Richter, Alexander Raake, Touradj Ebrahimi, Jo\~ao Ascenso, Dietmar Saupe
Abstract: High dynamic range (HDR) and wide color gamut (WCG) technologies significantly improve color reproduction compared to standard dynamic range (SDR) and standard color gamuts, resulting in more accurate, richer, and more immersive images. However, HDR increases data demands, posing challenges for bandwidth efficiency and compression techniques. Advances in compression and display technologies require more precise image quality assessment, particularly in the high-fidelity range where perceptual differences are subtle. To address this gap, we introduce AIC-HDR2025, the first such HDR dataset, comprising 100 test images generated from five HDR sources, each compressed using four codecs at five compression levels. It covers the high-fidelity range, from visible distortions to compression levels below the visually lossless threshold. A subjective study was conducted using the JPEG AIC-3 test methodology, combining plain and boosted triplet comparisons. In total, 34,560 ratings were collected from 151 participants across four fully controlled labs. The results confirm that AIC-3 enables precise HDR quality estimation, with 95\% confidence intervals averaging a width of 0.27 at 1 JND. In addition, several recently proposed objective metrics were evaluated based on their correlation with subjective ratings. The dataset is publicly available.
Authors: Bingchen Zhao, Oisin Mac Aodha
Abstract: Traditional clustering methods aim to group unlabeled data points based on their similarity to each other. However, clustering, in the absence of additional information, is an ill-posed problem as there may be many different, yet equally valid, ways to partition a dataset. Distinct users may want to use different criteria to form clusters in the same data, e.g. shape v.s. color. Recently introduced text-guided image clustering methods aim to address this ambiguity by allowing users to specify the criteria of interest using natural language instructions. This instruction provides the necessary context and control needed to obtain clusters that are more aligned with the users' intent. We propose a new text-guided clustering approach named ITGC that uses an iterative discovery process, guided by an unsupervised clustering objective, to generate interpretable visual concepts that better capture the criteria expressed in a user's instructions. We report superior performance compared to existing methods across a wide variety of image clustering and fine-grained classification benchmarks.
Authors: Bingchen Zhao, Kai Han
Abstract: This paper addresses the problem of Generalized Category Discovery (GCD) under a long-tailed distribution, which involves discovering novel categories in an unlabelled dataset using knowledge from a set of labelled categories. Existing works assume a uniform distribution for both datasets, but real-world data often exhibits a long-tailed distribution, where a few categories contain most examples, while others have only a few. While the long-tailed distribution is well-studied in supervised and semi-supervised settings, it remains unexplored in the GCD context. We identify two challenges in this setting - balancing classifier learning and estimating category numbers - and propose a framework based on confident sample selection and density-based clustering to tackle them. Our experiments on both long-tailed and conventional GCD datasets demonstrate the effectiveness of our method.
Authors: Yunhao Shui, Xuekuan Wang, Feng Qiu, Yuqiu Huang, Jinzhu Li, Haoyu Zheng, Jinru Han, Zhuo Zeng, Pengpeng Zhang, Jiarui Han, Keqiang Sun
Abstract: We present RaCig, a novel system for generating comic-style image sequences with consistent characters and expressive gestures. RaCig addresses two key challenges: (1) maintaining character identity and costume consistency across frames, and (2) producing diverse and vivid character gestures. Our approach integrates a retrieval-based character assignment module, which aligns characters in textual prompts with reference images, and a regional character injection mechanism that embeds character features into specified image regions. Experimental results demonstrate that RaCig effectively generates engaging comic narratives with coherent characters and dynamic interactions. The source code will be publicly available to support further research in this area.
Authors: Saemee Choi, Sohyun Jeong, Jaegul Choo, Jinhee Kim
Abstract: We propose ImEdit, the first zero-shot, training-free video editing method conditioned on both images and text. The proposed method introduces $\rho$-start sampling and dilated dual masking to construct well-structured noise maps for coherent and accurate edits. We further present zero image guidance, a controllable negative prompt strategy, for visual fidelity. Both quantitative and qualitative evaluations show that our method outperforms state-of-the-art methods across all metrics.
Authors: Nuwan Bandara, Thivya Kandappu, Archan Misra
Abstract: Event-based eye tracking holds significant promise for fine-grained cognitive state inference, offering high temporal resolution and robustness to motion artifacts, critical features for decoding subtle mental states such as attention, confusion, or fatigue. In this work, we introduce a model-agnostic, inference-time refinement framework designed to enhance the output of existing event-based gaze estimation models without modifying their architecture or requiring retraining. Our method comprises two key post-processing modules: (i) Motion-Aware Median Filtering, which suppresses blink-induced spikes while preserving natural gaze dynamics, and (ii) Optical Flow-Based Local Refinement, which aligns gaze predictions with cumulative event motion to reduce spatial jitter and temporal discontinuities. To complement traditional spatial accuracy metrics, we propose a novel Jitter Metric that captures the temporal smoothness of predicted gaze trajectories based on velocity regularity and local signal complexity. Together, these contributions significantly improve the consistency of event-based gaze signals, making them better suited for downstream tasks such as micro-expression analysis and mind-state decoding. Our results demonstrate consistent improvements across multiple baseline models on controlled datasets, laying the groundwork for future integration with multimodal affect recognition systems in real-world environments.
Authors: Xingzhong Hou, Jie Wu, Boxiao Liu, Yi Zhang, Guanglu Song, Yunpeng Liu, Yu Liu, Haihang You
Abstract: Image inpainting is the task of reconstructing missing or damaged parts of an image in a way that seamlessly blends with the surrounding content. With the advent of advanced generative models, especially diffusion models and generative adversarial networks, inpainting has achieved remarkable improvements in visual quality and coherence. However, achieving seamless continuity remains a significant challenge. In this work, we propose two novel methods to address discrepancy issues in diffusion-based inpainting models. First, we introduce a modified Variational Autoencoder that corrects color imbalances, ensuring that the final inpainted results are free of color mismatches. Second, we propose a two-step training strategy that improves the blending of generated and existing image content during the diffusion process. Through extensive experiments, we demonstrate that our methods effectively reduce discontinuity and produce high-quality inpainting results that are coherent and visually appealing.
Authors: Mahmudul Hasan
Abstract: Freezing of gait (FoG) is a special symptom found in patients with Parkinson's disease (PD). Patients who have FoG abruptly lose the capacity to walk as they normally would. Accelerometers worn by patients can record movement data during these episodes, and machine learning algorithms can be useful to categorize this information. Thus, the combination may be able to identify FoG in real time. In order to identify FoG events in accelerometer data, we introduce the Transformer Encoder-Bi-LSTM fusion model in this paper. The model's capability to differentiate between FoG episodes and normal movement was used to evaluate its performance, and on the Kaggle Parkinson's Freezing of Gait dataset, the proposed Transformer Encoder-Bi-LSTM fusion model produced 92.6% accuracy, 80.9% F1 score, and 52.06% in terms of mean average precision. The findings highlight how Deep Learning-based approaches may progress the field of FoG identification and help PD patients receive better treatments and management plans.
Authors: Charith Wickrema, Sara Leary, Shivangi Sarkar, Mark Giglio, Eric Bianchi, Eliza Mace, Michael Twardowski
Abstract: Traditional image similarity metrics are ineffective at evaluating the similarity between a real image of a scene and an artificially generated version of that viewpoint [6, 9, 13, 14]. Our research evaluates the effectiveness of a new, perceptual-based similarity metric, DreamSim [2], and three popular image similarity metrics: Structural Similarity (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Learned Perceptual Image Patch Similarity (LPIPS) [18, 19] in novel view synthesis (NVS) applications. We create a corpus of artificially corrupted images to quantify the sensitivity and discriminative power of each of the image similarity metrics. These tests reveal that traditional metrics are unable to effectively differentiate between images with minor pixel-level changes and those with substantial corruption, whereas DreamSim is more robust to minor defects and can effectively evaluate the high-level similarity of the image. Additionally, our results demonstrate that DreamSim provides a more effective and useful evaluation of render quality, especially for evaluating NVS renders in real-world use cases where slight rendering corruptions are common, but do not affect image utility for human tasks.
Authors: Chunjiang Wang, Kun Zhang, Yandong Liu, Zhiyang He, Xiaodong Tao, S. Kevin Zhou
Abstract: The concept bottleneck model (CBM), as a technique improving interpretability via linking predictions to human-understandable concepts, makes high-risk and life-critical medical image classification credible. Typically, existing CBM methods associate the final layer of visual encoders with concepts to explain the model's predictions. However, we empirically discover the phenomenon of concept preference variation, that is, the concepts are preferably associated with the features at different layers than those only at the final layer; yet a blind last-layer-based association neglects such a preference variation and thus weakens the accurate correspondences between features and concepts, impairing model interpretability. To address this issue, we propose a novel Multi-layer Visual Preference-enhanced Concept Bottleneck Model (MVP-CBM), which comprises two key novel modules: (1) intra-layer concept preference modeling, which captures the preferred association of different concepts with features at various visual layers, and (2) multi-layer concept sparse activation fusion, which sparsely aggregates concept activations from multiple layers to enhance performance. Thus, by explicitly modeling concept preferences, MVP-CBM can comprehensively leverage multi-layer visual information to provide a more nuanced and accurate explanation of model decisions. Extensive experiments on several public medical classification benchmarks demonstrate that MVP-CBM achieves state-of-the-art accuracy and interoperability, verifying its superiority. Code is available at https://github.com/wcj6/MVP-CBM.
Authors: Darryl Ho, Samuel Madden
Abstract: In recent years, large transformer-based video encoder models have greatly advanced state-of-the-art performance on video classification tasks. However, these large models typically process videos by averaging embedding outputs from multiple clips over time to produce fixed-length representations. This approach fails to account for a variety of time-related features, such as variable video durations, chronological order of events, and temporal variance in feature significance. While methods for temporal modeling do exist, they often require significant architectural changes and expensive retraining, making them impractical for off-the-shelf, fine-tuned large encoders. To overcome these limitations, we propose DejaVid, an encoder-agnostic method that enhances model performance without the need for retraining or altering the architecture. Our framework converts a video into a variable-length temporal sequence of embeddings, which we call a multivariate time series (MTS). An MTS naturally preserves temporal order and accommodates variable video durations. We then learn per-timestep, per-feature weights over the encoded MTS frames, allowing us to account for variations in feature importance over time. We introduce a new neural network architecture inspired by traditional time series alignment algorithms for this learning task. Our evaluation demonstrates that DejaVid substantially improves the performance of a state-of-the-art large encoder, achieving leading Top-1 accuracy of 77.2% on Something-Something V2, 89.1% on Kinetics-400, and 88.6% on HMDB51, while adding fewer than 1.8% additional learnable parameters and requiring less than 3 hours of training time. Our code is available at https://github.com/darrylho/DejaVid.
Authors: Lexiang Tang, Xianwei Zhuang, Bang Yang, Zhiyuan Hu, Hongxiang Li, Lu Ma, Jinghan Ru, Yuexian Zou
Abstract: Large vision-language models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks. However, they remain prone to visual hallucination (VH), often producing confident but incorrect descriptions of visual content. We present VisFlow, an efficient and training-free framework designed to mitigate VH by directly manipulating attention patterns during inference. Through systematic analysis, we identify three key pathological attention behaviors in LVLMs: (1) weak visual grounding, where attention to visual tokens is insufficient or misallocated, over-focusing on uninformative regions; (2) language prior dominance, where excessive attention to prior response tokens reinforces autoregressive patterns and impairs multimodal alignment; (3) prompt redundancy, where many attention heads fixate on system prompt tokens, disrupting the integration of image, instruction, and response content. To address these issues, we introduce two inference-time interventions: token-level attention intervention (TAI), which enhances focus on salient visual content, and head-level attention intervention (HAI), which suppresses over-attention to prompt and nearby text tokens. VisFlow operates without additional training or model modifications. Extensive experiments across models and benchmarks show that VisFlow effectively reduces hallucinations and improves visual factuality, with negligible computational cost.
Authors: Wenxiao Cai, Zongru Li, Iris Wang, Yu-Neng Wang, Thomas H. Lee
Abstract: Machine learning has achieved remarkable advancements but at the cost of significant computational resources. This has created an urgent need for a novel and energy-efficient computational fabric. CMOS Oscillator Networks (OscNet) is a brain inspired and specially designed hardware for low energy consumption. In this paper, we propose a Hopfield Network based machine learning algorithm that can be implemented on OscNet. The network is trained using forward propagation alone to learn sparsely connected weights, yet achieves an 8% improvement in accuracy compared to conventional deep learning models on MNIST dataset. OscNet v1.5 achieves competitive accuracy on MNIST and is well-suited for implementation using CMOS-compatible ring oscillator arrays with SHIL. In oscillator-based implementation, we utilize only 24% of the connections used in a fully connected Hopfield network, with merely a 0.1% drop in accuracy. OscNet v1.5 relies solely on forward propagation and employs sparse connections, making it an energy-efficient machine learning pipeline designed for CMOS oscillator computing. The repository for OscNet family is: https://github.com/RussRobin/OscNet.
Authors: Yuan Zang, Hao Tan, Seunghyun Yoon, Franck Dernoncourt, Jiuxiang Gu, Kushal Kafle, Chen Sun, Trung Bui
Abstract: We study multi-modal summarization for instructional videos, whose goal is to provide users an efficient way to learn skills in the form of text instructions and key video frames. We observe that existing benchmarks focus on generic semantic-level video summarization, and are not suitable for providing step-by-step executable instructions and illustrations, both of which are crucial for instructional videos. We propose a novel benchmark for user interface (UI) instructional video summarization to fill the gap. We collect a dataset of 2,413 UI instructional videos, which spans over 167 hours. These videos are manually annotated for video segmentation, text summarization, and video summarization, which enable the comprehensive evaluations for concise and executable video summarization. We conduct extensive experiments on our collected MS4UI dataset, which suggest that state-of-the-art multi-modal summarization methods struggle on UI video summarization, and highlight the importance of new methods for UI instructional video summarization.
Authors: Changhyun Choi, Sungha Kim, H. Jin Kim
Abstract: Recently, it has been shown that investing computing resources in searching for good initial noise for a text-to-image diffusion model helps improve performance. However, previous studies required external models to evaluate the resulting images, which is impossible on GPUs with small VRAM. For these reasons, we apply Best-of-N inference-time scaling to algorithms that optimize the initial noise of a diffusion model without external models across multiple datasets and backbones. We demonstrate that inference-time scaling for text-to-image diffusion models in this setting quickly reaches a performance plateau, and a relatively small number of optimization steps suffices to achieve the maximum achievable performance with each algorithm.
Authors: Chen-Bin Feng, Kangdao Liu, Jian Sun, Jiping Jin, Yiguo Jiang, Chi-Man Vong
Abstract: The malformed hands in the AI-generated images seriously affect the authenticity of the images. To refine malformed hands, existing depth-based approaches use a hand depth estimator to guide the refinement of malformed hands. Due to the performance limitations of the hand depth estimator, many hand details cannot be represented, resulting in errors in the generated hands, such as confusing the palm and the back of the hand. To solve this problem, we propose a 3D mesh-guided refinement framework using a diffusion pipeline. We use a state-of-the-art 3D hand mesh estimator, which provides more details of the hands. For training, we collect and reannotate a dataset consisting of RGB images and 3D hand mesh. Then we design a diffusion inpainting model to generate refined outputs guided by 3D hand meshes. For inference, we propose a double check algorithm to facilitate the 3D hand mesh estimator to obtain robust hand mesh guidance to obtain our refined results. Beyond malformed hand refinement, we propose a novel hand pose transformation method. It increases the flexibility and diversity of the malformed hand refinement task. We made the restored images mimic the hand poses of the reference images. The pose transformation requires no additional training. Extensive experimental results demonstrate the superior performance of our proposed method.
Authors: Samarth Singhal, Sandeep Singhal
Abstract: Vision-Language Models (VLMs) have rapidly advanced alongside Large Language Models (LLMs). This study evaluates the capabilities of prominent generative VLMs, such as GPT-4.1 and Gemini 2.5 Pro, accessed via APIs, for histopathology image classification tasks, including cell typing. Using diverse datasets from public and private sources, we apply zero-shot and one-shot prompting methods to assess VLM performance, comparing them against custom-trained Convolutional Neural Networks (CNNs). Our findings demonstrate that while one-shot prompting significantly improves VLM performance over zero-shot ($p \approx 1.005 \times 10^{-5}$ based on Kappa scores), these general-purpose VLMs currently underperform supervised CNNs on most tasks. This work underscores both the promise and limitations of applying current VLMs to specialized domains like pathology via in-context learning. All code and instructions for reproducing the study can be accessed from the repository https://www.github.com/a12dongithub/VLMCCE.
Authors: Yuxiang Wang, Xuecheng Bai, Boyu Hu, Chuanzhi Xu, Haodong Chen, Vera Chung, Tingxue Li
Abstract: Small object detection in UAV imagery is crucial for applications such as search-and-rescue, traffic monitoring, and environmental surveillance, but it is hampered by tiny object size, low signal-to-noise ratios, and limited feature extraction. Existing multi-scale fusion methods help, but add computational burden and blur fine details, making small object detection in cluttered scenes difficult. To overcome these challenges, we propose the Multi-scale Global-detail Feature Integration Strategy (MGDFIS), a unified fusion framework that tightly couples global context with local detail to boost detection performance while maintaining efficiency. MGDFIS comprises three synergistic modules: the FusionLock-TSS Attention Module, which marries token-statistics self-attention with DynamicTanh normalization to highlight spectral and spatial cues at minimal cost; the Global-detail Integration Module, which fuses multi-scale context via directional convolution and parallel attention while preserving subtle shape and texture variations; and the Dynamic Pixel Attention Module, which generates pixel-wise weighting maps to rebalance uneven foreground and background distributions and sharpen responses to true object regions. Extensive experiments on the VisDrone benchmark demonstrate that MGDFIS consistently outperforms state-of-the-art methods across diverse backbone architectures and detection frameworks, achieving superior precision and recall with low inference time. By striking an optimal balance between accuracy and resource usage, MGDFIS provides a practical solution for small-object detection on resource-constrained UAV platforms.
Authors: Cuong Manh Hoang, Yeejin Lee, Byeongkeun Kang
Abstract: This work addresses the task of self-supervised learning (SSL) on a long-tailed dataset that aims to learn balanced and well-separated representations for downstream tasks such as image classification. This task is crucial because the real world contains numerous object categories, and their distributions are inherently imbalanced. Towards robust SSL on a class-imbalanced dataset, we investigate leveraging a network trained using unlabeled out-of-distribution (OOD) data that are prevalently available online. We first train a network using both in-domain (ID) and sampled OOD data by back-propagating the proposed pseudo semantic discrimination loss alongside a domain discrimination loss. The OOD data sampling and loss functions are designed to learn a balanced and well-separated embedding space. Subsequently, we further optimize the network on ID data by unsupervised contrastive learning while using the previously trained network as a guiding network. The guiding network is utilized to select positive/negative samples and to control the strengths of attractive/repulsive forces in contrastive learning. We also distil and transfer its embedding space to the training network to maintain balancedness and separability. Through experiments on four publicly available long-tailed datasets, we demonstrate that the proposed method outperforms previous state-of-the-art methods.
Authors: Jiaming Zhang, Xin Wang, Xingjun Ma, Lingyu Qiu, Yu-Gang Jiang, Jitao Sang
Abstract: Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable capabilities in understanding relationships between visual and textual data through joint embedding spaces. Despite their effectiveness, these models remain vulnerable to adversarial attacks, particularly in the image modality, posing significant security concerns. Building upon our previous work on Adversarial Prompt Tuning (AdvPT), which introduced learnable text prompts to enhance adversarial robustness in VLMs without extensive parameter training, we present a significant extension by introducing the Neural Augmentor framework for Multi-modal Adversarial Prompt Tuning (NAP-Tuning).Our key innovations include: (1) extending AdvPT from text-only to multi-modal prompting across both text and visual modalities, (2) expanding from single-layer to multi-layer prompt architectures, and (3) proposing a novel architecture-level redesign through our Neural Augmentor approach, which implements feature purification to directly address the distortions introduced by adversarial attacks in feature space. Our NAP-Tuning approach incorporates token refiners that learn to reconstruct purified features through residual connections, allowing for modality-specific and layer-specific feature correction.Comprehensive experiments demonstrate that NAP-Tuning significantly outperforms existing methods across various datasets and attack types. Notably, our approach shows significant improvements over the strongest baselines under the challenging AutoAttack benchmark, outperforming them by 33.5% on ViT-B16 and 33.0% on ViT-B32 architectures while maintaining competitive clean accuracy.
Authors: Zhenghao Xi, Zhengnan Lv, Yang Zheng, Xiang Liu, Zhuang Yu, Junran Chen, Jing Hu, Yaqi Liu
Abstract: The segmentation of coal maceral groups can be described as a semantic segmentation process of coal maceral group images, which is of great significance for studying the chemical properties of coal. Generally, existing semantic segmentation models of coal maceral groups use the method of stacking parameters to achieve higher accuracy. It leads to increased computational requirements and impacts model training efficiency. At the same time, due to the professionalism and diversity of coal maceral group images sampling, obtaining the number of samples for model training requires a long time and professional personnel operation. To address these issues, We have innovatively developed an IoT-based DA-VIT parallel network model. By utilizing this model, we can continuously broaden the dataset through IoT and achieving sustained improvement in the accuracy of coal maceral groups segmentation. Besides, we decouple the parallel network from the backbone network to ensure the normal using of the backbone network during model data updates. Secondly, DCSA mechanism of DA-VIT is introduced to enhance the local feature information of coal microscopic images. This DCSA can decompose the large kernels of convolutional attention into multiple scales and reduce 81.18% of parameters.Finally, we performed the contrast experiment and ablation experiment between DA-VIT and state-of-the-art methods at lots of evaluation metrics. Experimental results show that DA-VIT-Base achieves 92.14% pixel accuracy and 63.18% mIoU. Params and FLOPs of DA-VIT-Tiny are 4.95M and 8.99G, respectively. All of the evaluation metrics of the proposed DA-VIT are better than other state-of-the-art methods.
Authors: Wen-Hsuan Chu, Lei Ke, Jianmeng Liu, Mingxiao Huo, Pavel Tokmakov, Katerina Fragkiadaki
Abstract: We tackle the challenge of generating dynamic 4D scenes from monocular, multi-object videos with heavy occlusions, and introduce GenMOJO, a novel approach that integrates rendering-based deformable 3D Gaussian optimization with generative priors for view synthesis. While existing models perform well on novel view synthesis for isolated objects, they struggle to generalize to complex, cluttered scenes. To address this, GenMOJO decomposes the scene into individual objects, optimizing a differentiable set of deformable Gaussians per object. This object-wise decomposition allows leveraging object-centric diffusion models to infer unobserved regions in novel viewpoints. It performs joint Gaussian splatting to render the full scene, capturing cross-object occlusions, and enabling occlusion-aware supervision. To bridge the gap between object-centric priors and the global frame-centric coordinate system of videos, GenMOJO uses differentiable transformations that align generative and rendering constraints within a unified framework. The resulting model generates 4D object reconstructions over space and time, and produces accurate 2D and 3D point tracks from monocular input. Quantitative evaluations and perceptual human studies confirm that GenMOJO generates more realistic novel views of scenes and produces more accurate point tracks compared to existing approaches.
Authors: Ye Li, Yuan Meng, Zewen Sun, Kangye Ji, Chen Tang, Jiajun Fan, Xinzhu Ma, Shutao Xia, Zhi Wang, Wenwu Zhu
Abstract: Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities. However, their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation. Existing VLA acceleration methods primarily focus on structural optimization, overlooking the fact that these models operate in sequential decision-making environments. As a result, temporal redundancy in sequential action generation and spatial redundancy in visual input remain unaddressed. To this end, we propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens. Specifically, we design an action-aware model scheduling mechanism that reduces temporal redundancy by dynamically switching between VLA model and a lightweight generator. Inspired by the human motion pattern of focusing on key decision points while relying on intuition for other actions, we categorize VLA actions into deliberative and intuitive, assigning the former to the VLA model and the latter to the lightweight generator, enabling frequency-adaptive execution through collaborative model scheduling. To address spatial redundancy, we further develop a spatio-semantic dual-aware token pruning method. Tokens are classified into spatial and semantic types and pruned based on their dual-aware importance to accelerate VLA inference. These two mechanisms work jointly to guide the VLA in focusing on critical actions and salient visual information, achieving effective acceleration while maintaining high accuracy. Experimental results demonstrate that our method achieves up to 1.5$\times$ acceleration with less than 3% drop in accuracy, outperforming existing approaches in multiple tasks.
Authors: Hiroshi Tanaka, Anika Rao, Hana Satou, Michael Johnson, Sofia Garc\'ia
Abstract: Multimodal Large Models (MLLMs) have achieved remarkable progress in vision-language understanding and generation tasks. However, existing MLLMs typically rely on static modality fusion strategies, which treat all modalities equally regardless of their instance-level reliability or semantic contribution. This often leads to suboptimal performance, especially in scenarios with noisy, missing, or misaligned modalities. In this paper, we propose Dynamic Modality Scheduling (DMS), a novel framework that adaptively adjusts the contribution of each modality at a per-sample level. DMS evaluates each modality based on three key factors: (1) \textit{confidence}, estimated from predictive entropy; (2) \textit{uncertainty}, obtained via Monte Carlo dropout; and (3) \textit{semantic consistency}, computed through inter-modal similarity. These signals are combined through a learnable or rule-based scheduler to generate soft modality weights used in downstream fusion.To ensure stable training, we further introduce a \textit{Modality Weight Consistency Loss}, which regularizes the fused representation to stay close to unimodal embeddings proportionally to their assigned weights. Our method is model-agnostic and can be integrated into existing MLLMs such as BLIP-2 and LLaVA. Experimental results on VQA, image-text retrieval, and captioning tasks show that DMS significantly improves both clean and robust performance, especially under modality corruption or dropout conditions. This work provides a general and effective mechanism to enable instance-aware and robustness-enhanced multimodal modeling.
Authors: Minhyuk Choi, Injae Kim, Hyunwoo J. Kim
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a preferred choice alongside Neural Radiance Fields (NeRF) in inverse rendering due to its superior rendering speed. Currently, the common approach in 3DGS is to utilize "single-view" mini-batch training, where only one image is processed per iteration, in contrast to NeRF's "multi-view" mini-batch training, which leverages multiple images. We observe that such single-view training can lead to suboptimal optimization due to increased variance in mini-batch stochastic gradients, highlighting the necessity for multi-view training. However, implementing multi-view training in 3DGS poses challenges. Simply rendering multiple images per iteration incurs considerable overhead and may result in suboptimal Gaussian densification due to its reliance on single-view assumptions. To address these issues, we modify the rasterization process to minimize the overhead associated with multi-view training and propose a 3D distance-aware D-SSIM loss and multi-view adaptive density control that better suits multi-view scenarios. Our experiments demonstrate that the proposed methods significantly enhance the performance of 3DGS and its variants, freeing 3DGS from the constraints of single-view training.
Authors: Liam Bennett, Mason Clark, Lucas Anderson, Hana Satou, Olivia Martinez
Abstract: Multimodal foundation models have achieved impressive progress across a wide range of vision-language tasks. However, existing approaches often adopt fixed or task-specific fusion strategies, neglecting the intrinsic variability of modality reliability and sample complexity. In this paper, we propose Modality-Aware Adaptive Fusion Scheduling (MA-AFS), a general framework that learns to dynamically modulate the contribution of each modality on a per-instance basis. MA-AFS introduces a lightweight neural scheduler that predicts modality fusion weights by integrating visual and textual entropy signals along with cross-modal agreement cues. This enables the model to adaptively emphasize more reliable modalities, especially under noisy, missing, or misaligned inputs. We formulate the fusion process as a differentiable scheduling mechanism, analyze its theoretical consistency and regularization effect, and demonstrate that it improves robustness without increasing model capacity significantly. Extensive experiments on image-text retrieval, captioning, and visual question answering show that MA-AFS achieves consistent performance gains over strong baselines such as CLIP, ALBEF, and BLIP. Moreover, MA-AFS exhibits improved robustness under modality corruption and enhanced generalization under domain shifts. Our work highlights the importance of adaptive fusion and opens a promising direction toward reliable and uncertainty-aware multimodal learning.
Authors: Changsheng Gao, Shan Liu, Feng Wu, Weisi Lin
Abstract: Feature coding has become increasingly important in scenarios where semantic representations rather than raw pixels are transmitted and stored. However, most existing methods are architecture-specific, targeting either CNNs or Transformers. This design limits their applicability in real-world scenarios where features from both architectures coexist. To address this gap, we introduce a new research problem: cross-architecture universal feature coding (CAUFC), which seeks to build a unified codec that can effectively compress features from heterogeneous architectures. To tackle this challenge, we propose a two-step distribution alignment method. First, we design the format alignment method that unifies CNN and Transformer features into a consistent 2D token format. Second, we propose the feature value alignment method that harmonizes statistical distributions via truncation and normalization. As a first attempt to study CAUFC, we evaluate our method on the image classification task. Experimental results demonstrate that our method achieves superior rate-accuracy trade-offs compared to the architecture-specific baseline. This work marks an initial step toward universal feature compression across heterogeneous model architectures.
Authors: Hang Xu, Wei Yu, Jiangtong Tan, Zhen Zou, Feng Zhao
Abstract: Blind Super-Resolution (blind SR) aims to enhance the model's generalization ability with unknown degradation, yet it still encounters severe overfitting issues. Some previous methods inspired by dropout, which enhances generalization by regularizing features, have shown promising results in blind SR. Nevertheless, these methods focus solely on regularizing features before the final layer and overlook the need for generalization in features at intermediate layers. Without explicit regularization of features at intermediate layers, the blind SR network struggles to obtain well-generalized feature representations. However, the key challenge is that directly applying dropout to intermediate layers leads to a significant performance drop, which we attribute to the inconsistency in training-testing and across layers it introduced. Therefore, we propose Adaptive Dropout, a new regularization method for blind SR models, which mitigates the inconsistency and facilitates application across intermediate layers of networks. Specifically, for training-testing inconsistency, we re-design the form of dropout and integrate the features before and after dropout adaptively. For inconsistency in generalization requirements across different layers, we innovatively design an adaptive training strategy to strengthen feature propagation by layer-wise annealing. Experimental results show that our method outperforms all past regularization methods on both synthetic and real-world benchmark datasets, also highly effective in other image restoration tasks. Code is available at \href{https://github.com/xuhang07/Adpative-Dropout}{https://github.com/xuhang07/Adpative-Dropout}.
URLs: https://github.com/xuhang07/Adpative-Dropout, https://github.com/xuhang07/Adpative-Dropout
Authors: Rong Wu, Ziqi Chen, Liming Zhong, Heng Li, Hai Shu
Abstract: Existing segmentation models trained on a single medical imaging dataset often lack robustness when encountering unseen organs or tumors. Developing a robust model capable of identifying rare or novel tumor categories not present during training is crucial for advancing medical imaging applications. We propose DSM, a novel framework that leverages diffusion and state space models to segment unseen tumor categories beyond the training data. DSM utilizes two sets of object queries trained within modified attention decoders to enhance classification accuracy. Initially, the model learns organ queries using an object-aware feature grouping strategy to capture organ-level visual features. It then refines tumor queries by focusing on diffusion-based visual prompts, enabling precise segmentation of previously unseen tumors. Furthermore, we incorporate diffusion-guided feature fusion to improve semantic segmentation performance. By integrating CLIP text embeddings, DSM captures category-sensitive classes to improve linguistic transfer knowledge, thereby enhancing the model's robustness across diverse scenarios and multi-label tasks. Extensive experiments demonstrate the superior performance of DSM in various tumor segmentation tasks. Code is available at https://github.com/Rows21/KMax-Mamba.
Authors: Ruojing Li, Wei An, Xinyi Ying, Yingqian Wang, Yimian Dai, Longguang Wang, Miao Li, Yulan Guo, Li Liu
Abstract: Infrared small target (IRST) detection is challenging in simultaneously achieving precise, universal, robust and efficient performance due to extremely dim targets and strong interference. Current learning-based methods attempt to leverage ``more" information from both the spatial and the short-term temporal domains, but suffer from unreliable performance under complex conditions while incurring computational redundancy. In this paper, we explore the ``more essential" information from a more crucial domain for the detection. Through theoretical analysis, we reveal that the global temporal saliency and correlation information in the temporal profile demonstrate significant superiority in distinguishing target signals from other signals. To investigate whether such superiority is preferentially leveraged by well-trained networks, we built the first prediction attribution tool in this field and verified the importance of the temporal profile information. Inspired by the above conclusions, we remodel the IRST detection task as a one-dimensional signal anomaly detection task, and propose an efficient deep temporal probe network (DeepPro) that only performs calculations in the time dimension for IRST detection. We conducted extensive experiments to fully validate the effectiveness of our method. The experimental results are exciting, as our DeepPro outperforms existing state-of-the-art IRST detection methods on widely-used benchmarks with extremely high efficiency, and achieves a significant improvement on dim targets and in complex scenarios. We provide a new modeling domain, a new insight, a new method, and a new performance, which can promote the development of IRST detection. Codes are available at https://github.com/TinaLRJ/DeepPro.
Authors: Han Ke, Xiao Ke, Ye Yan, Rui Liu, Jinpeng Yang, Tianwen Zhang, Xu Zhan, Xiaowo Xu
Abstract: DL based Synthetic Aperture Radar (SAR) ship detection has tremendous advantages in numerous areas. However, it still faces some problems, such as the lack of prior knowledge, which seriously affects detection accuracy. In order to solve this problem, we propose a scene-aware SAR ship detection method based on unsupervised sea-land segmentation. This method follows a classical two-stage framework and is enhanced by two models: the unsupervised land and sea segmentation module (ULSM) and the land attention suppression module (LASM). ULSM and LASM can adaptively guide the network to reduce attention on land according to the type of scenes (inshore scene and offshore scene) and add prior knowledge (sea land segmentation information) to the network, thereby reducing the network's attention to land directly and enhancing offshore detection performance relatively. This increases the accuracy of ship detection and enhances the interpretability of the model. Specifically, in consideration of the lack of land sea segmentation labels in existing deep learning-based SAR ship detection datasets, ULSM uses an unsupervised approach to classify the input data scene into inshore and offshore types and performs sea-land segmentation for inshore scenes. LASM uses the sea-land segmentation information as prior knowledge to reduce the network's attention to land. We conducted our experiments using the publicly available SSDD dataset, which demonstrated the effectiveness of our network.
Authors: Junbo Niu, Yuanhong Zheng, Ziyang Miao, Hejun Dong, Chunjiang Ge, Hao Liang, Ma Lu, Bohan Zeng, Qiahao Zheng, Conghui He, Wentao Zhang
Abstract: Vision-Language Models (VLMs) face significant challenges when dealing with the diverse resolutions and aspect ratios of real-world images, as most existing models rely on fixed, low-resolution inputs. While recent studies have explored integrating native resolution visual encoding to improve model performance, such efforts remain fragmented and lack a systematic framework within the open-source community. Moreover, existing benchmarks fall short in evaluating VLMs under varied visual conditions, often neglecting resolution as a critical factor. To address the "Resolution Dilemma" stemming from both model design and benchmark limitations, we introduce RC-Bench, a novel benchmark specifically designed to systematically evaluate VLM capabilities under extreme visual conditions, with an emphasis on resolution and aspect ratio variations. In conjunction, we propose NativeRes-LLaVA, an open-source training framework that empowers VLMs to effectively process images at their native resolutions and aspect ratios. Based on RC-Bench and NativeRes-LLaVA, we conduct comprehensive experiments on existing visual encoding strategies. The results show that Native Resolution Visual Encoding significantly improves the performance of VLMs on RC-Bench as well as other resolution-centric benchmarks. Code is available at https://github.com/Niujunbo2002/NativeRes-LLaVA.
Authors: Szabolcs Velkei, Csaba Goldschmidt, K\'aroly Vass
Abstract: The Deep Learning Visual Space Simulation System (DLVS3) introduces a novel synthetic dataset generator and a simulation pipeline specifically designed for training and testing satellite pose estimation solutions. This work introduces the DLVS3-HST-V1 dataset, which focuses on the Hubble Space Telescope (HST) as a complex, articulated target. The dataset is generated using advanced real-time and offline rendering technologies, integrating high-fidelity 3D models, dynamic lighting (including secondary sources like Earth reflection), and physically accurate material properties. The pipeline supports the creation of large-scale, richly annotated image sets with ground-truth 6-DoF pose and keypoint data, semantic segmentation, depth, and normal maps. This enables the training and benchmarking of deep learning-based pose estimation solutions under realistic, diverse, and challenging visual conditions. The paper details the dataset generation process, the simulation architecture, and the integration with deep learning frameworks, and positions DLVS3 as a significant step toward closing the domain gap for autonomous spacecraft operations in proximity and servicing missions.
Authors: Chen Zhu, Kang Liang, Jianrong Bao, Zhouxiang Zhao, Zhaohui Yang, Zhaoyang Zhang, Mohammad Shikh-Bahaei
Abstract: The advent of 6G networks demands unprecedented levels of intelligence, adaptability, and efficiency to address challenges such as ultra-high-speed data transmission, ultra-low latency, and massive connectivity in dynamic environments. Traditional wireless image transmission frameworks, reliant on static configurations and isolated source-channel coding, struggle to balance computational efficiency, robustness, and quality under fluctuating channel conditions. To bridge this gap, this paper proposes an AI-native deep joint source-channel coding (JSCC) framework tailored for resource-constrained 6G networks. Our approach integrates key information extraction and adaptive background synthesis to enable intelligent, semantic-aware transmission. Leveraging AI-driven tools, Mediapipe for human pose detection and Rembg for background removal, the model dynamically isolates foreground features and matches backgrounds from a pre-trained library, reducing data payloads while preserving visual fidelity. Experimental results demonstrate significant improvements in peak signal-to-noise ratio (PSNR) compared with traditional JSCC method, especially under low-SNR conditions. This approach offers a practical solution for multimedia services in resource-constrained mobile communications.
Authors: Mufan Liu, Cixiao Zhang, Qi Yang, Yujie Cao, Yiling Xu, Yin Xu, Shu Sun, Mingzeng Dai, Yunfeng Guan
Abstract: Modeling the wireless radiance field (WRF) is fundamental to modern communication systems, enabling key tasks such as localization, sensing, and channel estimation. Traditional approaches, which rely on empirical formulas or physical simulations, often suffer from limited accuracy or require strong scene priors. Recent neural radiance field (NeRF-based) methods improve reconstruction fidelity through differentiable volumetric rendering, but their reliance on computationally expensive multilayer perceptron (MLP) queries hinders real-time deployment. To overcome these challenges, we introduce Gaussian splatting (GS) to the wireless domain, leveraging its efficiency in modeling optical radiance fields to enable compact and accurate WRF reconstruction. Specifically, we propose SwiftWRF, a deformable 2D Gaussian splatting framework that synthesizes WRF spectra at arbitrary positions under single-sided transceiver mobility. SwiftWRF employs CUDA-accelerated rasterization to render spectra at over 100000 fps and uses a lightweight MLP to model the deformation of 2D Gaussians, effectively capturing mobility-induced WRF variations. In addition to novel spectrum synthesis, the efficacy of SwiftWRF is further underscored in its applications in angle-of-arrival (AoA) and received signal strength indicator (RSSI) prediction. Experiments conducted on both real-world and synthetic indoor scenes demonstrate that SwiftWRF can reconstruct WRF spectra up to 500x faster than existing state-of-the-art methods, while significantly enhancing its signal quality. Code and datasets will be released.
Authors: Wenhao Shen, Gangjian Zhang, Jianfeng Zhang, Yu Feng, Nanjie Yao, Xuanmeng Zhang, Hao Wang
Abstract: Single-view textured human reconstruction aims to reconstruct a clothed 3D digital human by inputting a monocular 2D image. Existing approaches include feed-forward methods, limited by scarce 3D human data, and diffusion-based methods, prone to erroneous 2D hallucinations. To address these issues, we propose a novel SMPL normal map Equipped 3D Human Reconstruction (SEHR) framework, integrating a pretrained large 3D reconstruction model with human geometry prior. SEHR performs single-view human reconstruction without using a preset diffusion model in one forward propagation. Concretely, SEHR consists of two key components: SMPL Normal Map Guidance (SNMG) and SMPL Normal Map Constraint (SNMC). SNMG incorporates SMPL normal maps into an auxiliary network to provide improved body shape guidance. SNMC enhances invisible body parts by constraining the model to predict an extra SMPL normal Gaussians. Extensive experiments on two benchmark datasets demonstrate that SEHR outperforms existing state-of-the-art methods.
Authors: Afifa Khaled, Mohammed Sabir, Rizwan Qureshi, Camillo Maria Caruso, Valerio Guarrasi, Suncheng Xiang, S Kevin Zhou
Abstract: The Medical Information Mart for Intensive Care (MIMIC) datasets have become the Kernel of Digital Health Research by providing freely accessible, deidentified records from tens of thousands of critical care admissions, enabling a broad spectrum of applications in clinical decision support, outcome prediction, and healthcare analytics. Although numerous studies and surveys have explored the predictive power and clinical utility of MIMIC based models, critical challenges in data integration, representation, and interoperability remain underexplored. This paper presents a comprehensive survey that focuses uniquely on open problems. We identify persistent issues such as data granularity, cardinality limitations, heterogeneous coding schemes, and ethical constraints that hinder the generalizability and real-time implementation of machine learning models. We highlight key progress in dimensionality reduction, temporal modelling, causal inference, and privacy preserving analytics, while also outlining promising directions including hybrid modelling, federated learning, and standardized preprocessing pipelines. By critically examining these structural limitations and their implications, this survey offers actionable insights to guide the next generation of MIMIC powered digital health innovations.
Authors: Haoyou Deng, Zhiqiang Li, Feng Zhang, Qingbo Lu, Zisheng Cao, Yuanjie Shao, Shuhang Gu, Changxin Gao, Nong Sang
Abstract: Overfitting to synthetic training pairs remains a critical challenge in image dehazing, leading to poor generalization capability to real-world scenarios. To address this issue, existing approaches utilize unpaired realistic data for training, employing CycleGAN or contrastive learning frameworks. Despite their progress, these methods often suffer from training instability, resulting in limited dehazing performance. In this paper, we propose a novel training strategy for unpaired image dehazing, termed Rehazy, to improve both dehazing performance and training stability. This strategy explores the consistency of the underlying clean images across hazy images and utilizes hazy-rehazy pairs for effective learning of real haze characteristics. To favorably construct hazy-rehazy pairs, we develop a physics-based rehazy generation pipeline, which is theoretically validated to reliably produce high-quality rehazy images. Additionally, leveraging the rehazy strategy, we introduce a dual-branch framework for dehazing network training, where a clean branch provides a basic dehazing capability in a synthetic manner, and a hazy branch enhances the generalization ability with hazy-rehazy pairs. Moreover, we design a new dehazing network within these branches to improve the efficiency, which progressively restores clean scenes from coarse to fine. Extensive experiments on four benchmarks demonstrate the superior performance of our approach, exceeding the previous state-of-the-art methods by 3.58 dB on the SOTS-Indoor dataset and by 1.85 dB on the SOTS-Outdoor dataset in PSNR. Our code will be publicly available.
Authors: Zhihan Zhang, Xiang Pan, Hongchen Wei, Zhenzhong Chen
Abstract: Structural pruning techniques are essential for deploying multimodal large language models (MLLMs) across various hardware platforms, from edge devices to cloud servers. However, current pruning methods typically determine optimal strategies through iterative search processes, resulting in substantial computational overhead for on-demand MLLMs adaptation. To address this challenge, we propose LOP, an efficient neural pruning framework that learns optimal pruning strategies from the target pruning constraint, eliminating the need for computationally expensive search-based methods. LOP approach trains autoregressive neural networks (NNs) to directly predict layer-wise pruning strategies adaptive to the target pruning constraint, eliminating the time-consuming iterative searches. Experimental results across multiple tasks show that LOP outperforms state-of-the-art pruning methods in various metrics while achieving up to three orders of magnitude speedup.
Authors: Chenglin Wang, Yucheng Zhou, Qianning Wang, Zhe Wang, Kai Zhang
Abstract: Text-driven image editing has achieved remarkable success in following single instructions. However, real-world scenarios often involve complex, multi-step instructions, particularly ``chain'' instructions where operations are interdependent. Current models struggle with these intricate directives, and existing benchmarks inadequately evaluate such capabilities. Specifically, they often overlook multi-instruction and chain-instruction complexities, and common consistency metrics are flawed. To address this, we introduce ComplexBench-Edit, a novel benchmark designed to systematically assess model performance on complex, multi-instruction, and chain-dependent image editing tasks. ComplexBench-Edit also features a new vision consistency evaluation method that accurately assesses non-modified regions by excluding edited areas. Furthermore, we propose a simple yet powerful Chain-of-Thought (CoT)-based approach that significantly enhances the ability of existing models to follow complex instructions. Our extensive experiments demonstrate ComplexBench-Edit's efficacy in differentiating model capabilities and highlight the superior performance of our CoT-based method in handling complex edits. The data and code are released at https://github.com/llllly26/ComplexBench-Edit.
Authors: Di Kong, Qianhui Wan
Abstract: Reconstructing a 3D point cloud from a given conditional sketch is challenging. Existing methods often work directly in 3D space, but domain variability and difficulty in reconstructing accurate 3D structures from 2D sketches remain significant obstacles. Moreover, ideal models should also accept prompts for control, in addition with the sparse sketch, posing challenges in multi-modal fusion. We propose DiffS-NOCS (Diffusion-based Sketch-to-NOCS Map), which leverages ControlNet with a modified multi-view decoder to generate NOCS maps with embedded 3D structure and position information in 2D space from sketches. The 3D point cloud is reconstructed by combining multiple NOCS maps from different views. To enhance sketch understanding, we integrate a viewpoint encoder for extracting viewpoint features. Additionally, we design a feature-level multi-view aggregation network as the denoising module, facilitating cross-view information exchange and improving 3D consistency in NOCS map generation. Experiments on ShapeNet demonstrate that DiffS-NOCS achieves controllable and fine-grained point cloud reconstruction aligned with sketches.
Authors: Mustansar Fiaz, Mubashir Noman, Hiyam Debary, Kamran Ali, Hisham Cholakkal
Abstract: Recently convolution and transformer-based change detection (CD) methods provide promising performance. However, it remains unclear how the local and global dependencies interact to effectively alleviate the pseudo changes. Moreover, directly utilizing standard self-attention presents intrinsic limitations including governing global feature representations limit to capture subtle changes, quadratic complexity, and restricted training parallelism. To address these limitations, we propose a Siamese-based framework, called HyRet-Change, which can seamlessly integrate the merits of convolution and retention mechanisms at multi-scale features to preserve critical information and enhance adaptability in complex scenes. Specifically, we introduce a novel feature difference module to exploit both convolutions and multi-head retention mechanisms in a parallel manner to capture complementary information. Furthermore, we propose an adaptive local-global interactive context awareness mechanism that enables mutual learning and enhances discrimination capability through information exchange. We perform experiments on three challenging CD datasets and achieve state-of-the-art performance compared to existing methods. Our source code is publicly available at https://github.com/mustansarfiaz/HyRect-Change.
Authors: Hao Xu, Lechao Cheng, Yaxiong Wang, Shengeng Tang, Zhun Zhong
Abstract: We present our solution to the MiGA Challenge at IJCAI 2025, which aims to recognize micro-gestures (MGs) from skeleton sequences for the purpose of hidden emotion understanding. MGs are characterized by their subtlety, short duration, and low motion amplitude, making them particularly challenging to model and classify. We adopt PoseC3D as the baseline framework and introduce three key enhancements: (1) a topology-aware skeleton representation specifically designed for the iMiGUE dataset to better capture fine-grained motion patterns; (2) an improved temporal processing strategy that facilitates smoother and more temporally consistent motion modeling; and (3) the incorporation of semantic label embeddings as auxiliary supervision to improve the model generalization. Our method achieves a Top-1 accuracy of 67.01\% on the iMiGUE test set. As a result of these contributions, our approach ranks third on the official MiGA Challenge leaderboard. The source code is available at \href{https://github.com/EGO-False-Sleep/Miga25_track1}{https://github.com/EGO-False-Sleep/Miga25\_track1}.
URLs: https://github.com/EGO-False-Sleep/Miga25_track1, https://github.com/EGO-False-Sleep/Miga25\_track1
Authors: Songtao Jiang, Yuan Wang, Ruizhe Chen, Yan Zhang, Ruilin Luo, Bohan Lei, Sibo Song, Yang Feng, Jimeng Sun, Jian Wu, Zuozhu Liu
Abstract: In medical visual question answering (Med-VQA), achieving accurate responses relies on three critical steps: precise perception of medical imaging data, logical reasoning grounded in visual input and textual questions, and coherent answer derivation from the reasoning process. Recent advances in general vision-language models (VLMs) show that large-scale reinforcement learning (RL) could significantly enhance both reasoning capabilities and overall model performance. However, their application in medical domains is hindered by two fundamental challenges: 1) misalignment between perceptual understanding and reasoning stages, and 2) inconsistency between reasoning pathways and answer generation, both compounded by the scarcity of high-quality medical datasets for effective large-scale RL. In this paper, we first introduce Med-Zero-17K, a curated dataset for pure RL-based training, encompassing over 30 medical image modalities and 24 clinical tasks. Moreover, we propose a novel large-scale RL framework for Med-VLMs, Consistency-Aware Preference Optimization (CAPO), which integrates rewards to ensure fidelity between perception and reasoning, consistency in reasoning-to-answer derivation, and rule-based accuracy for final responses. Extensive experiments on both in-domain and out-of-domain scenarios demonstrate the superiority of our method over strong VLM baselines, showcasing strong generalization capability to 3D Med-VQA benchmarks and R1-like training paradigms.
Authors: Jie Liu, Zheng Hui
Abstract: Video object removal and inpainting are critical tasks in the fields of computer vision and multimedia processing, aimed at restoring missing or corrupted regions in video sequences. Traditional methods predominantly rely on flow-based propagation and spatio-temporal Transformers, but these approaches face limitations in effectively leveraging long-term temporal features and ensuring temporal consistency in the completion results, particularly when dealing with large masks. Consequently, performance on extensive masked areas remains suboptimal. To address these challenges, this paper introduces a novel video inpainting approach leveraging the Diffusion Transformer (DiT). DiT synergistically combines the advantages of diffusion models and transformer architectures to maintain long-term temporal consistency while ensuring high-quality inpainting results. We propose a Circular Position-Shift strategy to further enhance long-term temporal consistency during the inference stage. Additionally, the proposed method automatically detects objects within videos, interactively removes specified objects, and generates corresponding prompts. In terms of processing speed, it takes only 180 seconds (testing on one NVIDIA A100 GPU) to complete a video with a resolution of $1080 \times 1920$ with 121 frames without any acceleration method. Experimental results indicate that the proposed method demonstrates superior performance in content fidelity, texture restoration, and temporal consistency. Project page: https://jieliu95.github.io/EraserDiT_demo.
Authors: Rongxuan Peng, Shunquan Tan, Xianbo Mo, Alex C. Kot, Jiwu Huang
Abstract: Recent advances in deep learning have significantly propelled the development of image forgery localization. However, existing models remain highly vulnerable to adversarial attacks: imperceptible noise added to forged images can severely mislead these models. In this paper, we address this challenge with an Adversarial Noise Suppression Module (ANSM) that generate a defensive perturbation to suppress the attack effect of adversarial noise. We observe that forgery-relevant features extracted from adversarial and original forged images exhibit distinct distributions. To bridge this gap, we introduce Forgery-relevant Features Alignment (FFA) as a first-stage training strategy, which reduces distributional discrepancies by minimizing the channel-wise Kullback-Leibler divergence between these features. To further refine the defensive perturbation, we design a second-stage training strategy, termed Mask-guided Refinement (MgR), which incorporates a dual-mask constraint. MgR ensures that the perturbation remains effective for both adversarial and original forged images, recovering forgery localization accuracy to their original level. Extensive experiments across various attack algorithms demonstrate that our method significantly restores the forgery localization model's performance on adversarial images. Notably, when ANSM is applied to original forged images, the performance remains nearly unaffected. To our best knowledge, this is the first report of adversarial defense in image forgery localization tasks. We have released the source code and anti-forensics dataset.
Authors: Lu Chen, Han Yang, Hu Wang, Yuxin Cao, Shaofeng Li, Yuan Luo
Abstract: Adversarial examples have attracted significant attention over the years, yet understanding their frequency-based characteristics remains insufficient. In this paper, we investigate the intriguing properties of adversarial examples in the frequency domain for the image classification task, with the following key findings. (1) As the high-frequency components increase, the performance gap between adversarial and natural examples becomes increasingly pronounced. (2) The model performance against filtered adversarial examples initially increases to a peak and declines to its inherent robustness. (3) In Convolutional Neural Networks, mid- and high-frequency components of adversarial examples exhibit their attack capabilities, while in Transformers, low- and mid-frequency components of adversarial examples are particularly effective. These results suggest that different network architectures have different frequency preferences and that differences in frequency components between adversarial and natural examples may directly influence model robustness. Based on our findings, we further conclude with three useful proposals that serve as a valuable reference to the AI model security community.
Authors: Mehmet Ozgur Turkoglu, Selene Ledain, Helge Aasen
Abstract: Conventional benchmarks for crop type classification from optical satellite time series typically assume access to labeled data from the same year and rely on fixed calendar-day sampling. This limits generalization across seasons, where crop phenology shifts due to interannual climate variability, and precludes real-time application when current-year labels are unavailable. Furthermore, uncertainty quantification is often neglected, making such approaches unreliable for crop monitoring applications. Inspired by ecophysiological principles of plant growth, we propose a simple, model-agnostic sampling strategy that leverages growing degree days (GDD), based on daily average temperature, to replace calendar time with thermal time. By uniformly subsampling time series in this biologically meaningful domain, the method emphasizes phenologically active growth stages while reducing temporal redundancy and noise. We evaluate the method on a multi-year Sentinel-2 dataset spanning all of Switzerland, training on one growing season and testing on other seasons. Compared to state-of-the-art baselines, our method delivers substantial gains in classification accuracy and, critically, produces more calibrated uncertainty estimates. Notably, our method excels in low-data regimes and enables significantly more accurate early-season classification. With only 10 percent of the training data, our method surpasses the state-of-the-art baseline in both predictive accuracy and uncertainty estimation, and by the end of June, it achieves performance similar to a baseline trained on the full season. These results demonstrate that leveraging temperature data not only improves predictive performance across seasons but also enhances the robustness and trustworthiness of crop-type mapping in real-world applications.
Authors: Taiga Hayami, Kakeru Koizumi, Hiroshi Watanabe
Abstract: Implicit Neural Representations (INRs) have attracted significant interest for their ability to model complex signals by mapping spatial and temporal coordinates to signal values. In the context of neural video representation, several decoding strategies have been explored to balance compactness and reconstruction quality, including pixel-wise, frame-wise, and patch-wise methods. Patch-wise decoding aims to combine the flexibility of pixel-based models with the efficiency of frame-based approaches. However, conventional uniform patch division often leads to discontinuities at patch boundaries, as independently reconstructed regions may fail to form a coherent global structure. To address this limitation, we propose a neural video representation method based on Structure-Preserving Patches (SPPs). Our approach rearranges each frame into a set of spatially structured patch frames using a PixelUnshuffle-like operation. This rearrangement maintains the spatial coherence of the original frame while enabling patch-level decoding. The network learns to predict these rearranged patch frames, which supports a global-to-local fitting strategy and mitigates degradation caused by upsampling. Experiments on standard video datasets show that the proposed method improves reconstruction quality and compression performance compared to existing INR-based video representation methods.
Authors: Hyunjin Kim, Haebeom Jung, Jaesik Park
Abstract: We propose an adaptive sampling framework for 3D Gaussian Splatting (3DGS) that leverages comprehensive multi-view photometric error signals within a unified Metropolis-Hastings approach. Traditional 3DGS methods heavily rely on heuristic-based density-control mechanisms (e.g., cloning, splitting, and pruning), which can lead to redundant computations or the premature removal of beneficial Gaussians. Our framework overcomes these limitations by reformulating densification and pruning as a probabilistic sampling process, dynamically inserting and relocating Gaussians based on aggregated multi-view errors and opacity scores. Guided by Bayesian acceptance tests derived from these error-based importance scores, our method substantially reduces reliance on heuristics, offers greater flexibility, and adaptively infers Gaussian distributions without requiring predefined scene complexity. Experiments on benchmark datasets, including Mip-NeRF360, Tanks and Temples, and Deep Blending, show that our approach reduces the number of Gaussians needed, enhancing computational efficiency while matching or modestly surpassing the view-synthesis quality of state-of-the-art models.
Authors: Nabil Hezil, Suraj Singh, Vita Vlasova, Oleg Rogov, Ahmed Bouridane, Rifat Hamoudi
Abstract: Accurate segmentation of vascular structures in coronary angiography remains a core challenge in medical image analysis due to the complexity of elongated, thin, and low-contrast vessels. Classical convolutional neural networks (CNNs) often fail to preserve topological continuity, while recent Vision Transformer (ViT)-based models, although strong in global context modeling, lack precise boundary awareness. In this work, we introduce BAVT, a Boundary-Aware Vision Transformer, a ViT-based architecture enhanced with an edge-aware loss that explicitly guides the segmentation toward fine-grained vascular boundaries. Unlike hybrid transformer-CNN models, BAVT retains a minimal, scalable structure that is fully compatible with large-scale vision foundation model (VFM) pretraining. We validate our approach on the DCA-1 coronary angiography dataset, where BAVT achieves superior performance across medical image segmentation metrics outperforming both CNN and hybrid baselines. These results demonstrate the effectiveness of combining plain ViT encoders with boundary-aware supervision for clinical-grade vascular segmentation.
Authors: Xiaoya Tang, Bodong Zhang, Man Minh Ho, Beatrice S. Knudsen, Tolga Tasdizen
Abstract: Despite the widespread adoption of transformers in medical applications, the exploration of multi-scale learning through transformers remains limited, while hierarchical representations are considered advantageous for computer-aided medical diagnosis. We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs). Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are adapted for transformer input through an innovative patch tokenization process, preserving the inherited multi-scale inductive biases. We also introduce a scale-wise attention mechanism that directly captures intra-scale and inter-scale associations. This mechanism complements patch-wise attention by enhancing spatial understanding and preserving global perception, which we refer to as local and global attention, respectively. Our model significantly outperforms baseline models in terms of classification accuracy, demonstrating its efficiency in bridging the gap between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). The components are designed as plug-and-play for different CNN architectures and can be adapted for multiple applications. The code is available at https://github.com/xiaoyatang/DuoFormer.git.
Authors: Xinyi Zhao, Congjing Zhang, Pei Guo, Wei Li, Lin Chen, Chaoyue Zhao, Shuai Huang
Abstract: Video anomaly detection (VAD) is essential for enhancing safety and security by identifying unusual events across different environments. Existing VAD benchmarks, however, are primarily designed for general-purpose scenarios, neglecting the specific characteristics of smart home applications. To bridge this gap, we introduce SmartHome-Bench, the first comprehensive benchmark specially designed for evaluating VAD in smart home scenarios, focusing on the capabilities of multi-modal large language models (MLLMs). Our newly proposed benchmark consists of 1,203 videos recorded by smart home cameras, organized according to a novel anomaly taxonomy that includes seven categories, such as Wildlife, Senior Care, and Baby Monitoring. Each video is meticulously annotated with anomaly tags, detailed descriptions, and reasoning. We further investigate adaptation methods for MLLMs in VAD, assessing state-of-the-art closed-source and open-source models with various prompting techniques. Results reveal significant limitations in the current models' ability to detect video anomalies accurately. To address these limitations, we introduce the Taxonomy-Driven Reflective LLM Chain (TRLC), a new LLM chaining framework that achieves a notable 11.62% improvement in detection accuracy. The benchmark dataset and code are publicly available at https://github.com/Xinyi-0724/SmartHome-Bench-LLM.
Authors: Sebastian Janampa, Marios Pattichis
Abstract: Multi-person pose estimation (MPPE) estimates keypoints for all individuals present in an image. MPPE is a fundamental task for several applications in computer vision and virtual reality. Unfortunately, there are currently no transformer-based models that can perform MPPE in real time. The paper presents a family of transformer-based models capable of performing multi-person 2D pose estimation in real-time. Our approach utilizes a modified decoder architecture and keypoint similarity metrics to generate both positive and negative queries, thereby enhancing the quality of the selected queries within the architecture. Compared to state-of-the-art models, our proposed models train much faster, using 5 to 10 times fewer epochs, with competitive inference times without requiring quantization libraries to speed up the model. Furthermore, our proposed models provide competitive results or outperform alternative models, often using significantly fewer parameters.
Authors: Morris Alper, David Novotny, Filippos Kokkinos, Hadar Averbuch-Elor, Tom Monnier
Abstract: Despite recent advances in sparse novel view synthesis (NVS) applied to object-centric scenes, scene-level NVS remains a challenge. A central issue is the lack of available clean multi-view training data, beyond manually curated datasets with limited diversity, camera variation, or licensing issues. On the other hand, an abundance of diverse and permissively-licensed data exists in the wild, consisting of scenes with varying appearances (illuminations, transient occlusions, etc.) from sources such as tourist photos. To this end, we present WildCAT3D, a framework for generating novel views of scenes learned from diverse 2D scene image data captured in the wild. We unlock training on these data sources by explicitly modeling global appearance conditions in images, extending the state-of-the-art multi-view diffusion paradigm to learn from scene views of varying appearances. Our trained model generalizes to new scenes at inference time, enabling the generation of multiple consistent novel views. WildCAT3D provides state-of-the-art results on single-view NVS in object- and scene-level settings, while training on strictly less data sources than prior methods. Additionally, it enables novel applications by providing global appearance control during generation.
Authors: Thanh Tran, Son T. Luu, Quan Bui, Shoshin Nomura
Abstract: This paper proposes a method for automatic GUI component detection for the IBM i system (formerly and still more commonly known as AS/400). We introduce a human-annotated dataset consisting of 1,050 system screen images, in which 381 images are screenshots of IBM i system screens in Japanese. Each image contains multiple components, including text labels, text boxes, options, tables, instructions, keyboards, and command lines. We then develop a detection system based on state-of-the-art deep learning models and evaluate different approaches using our dataset. The experimental results demonstrate the effectiveness of our dataset in constructing a system for component detection from GUI screens. By automatically detecting GUI components from the screen, AS400-DET has the potential to perform automated testing on systems that operate via GUI screens.
Authors: Zijian Zhang, Xuecheng Wu, Danlei Huang, Siyu Yan, Chong Peng, Xuezhi Cao
Abstract: Driven by the rapid progress in vision-language models (VLMs), the responsible behavior of large-scale multimodal models has become a prominent research area, particularly focusing on hallucination detection and factuality checking. In this paper, we present the solution for the two tracks of Responsible AI challenge. Inspirations from the general domain demonstrate that a smaller distilled VLM can often outperform a larger VLM that is directly tuned on downstream tasks, while achieving higher efficiency. We thus jointly tackle two tasks from the perspective of knowledge distillation and propose a progressive hybrid knowledge distillation framework termed HKD4VLM. Specifically, the overall framework can be decomposed into Pyramid-like Progressive Online Distillation and Ternary-Coupled Refinement Distillation, hierarchically moving from coarse-grained knowledge alignment to fine-grained refinement. Besides, we further introduce the mapping shift-enhanced inference and diverse augmentation strategies to enhance model performance and robustness. Extensive experimental results demonstrate the effectiveness of our HKD4VLM. Ablation studies provide insights into the critical design choices driving performance gains.
Authors: Amran Bhuiyan, Mizanur Rahman, Md Tahmid Rahman Laskar, Aijun An, Jimmy Xiangji Huang
Abstract: Person re-identification (ReID) has evolved from handcrafted feature-based methods to deep learning approaches and, more recently, to models incorporating large language models (LLMs). Early methods struggled with variations in lighting, pose, and viewpoint, but deep learning addressed these issues by learning robust visual features. Building on this, LLMs now enable ReID systems to integrate semantic and contextual information through natural language. This survey traces that full evolution and offers one of the first comprehensive reviews of ReID approaches that leverage LLMs, where textual descriptions are used as privileged information to improve visual matching. A key contribution is the use of dynamic, identity-specific prompts generated by GPT-4o, which enhance the alignment between images and text in vision-language ReID systems. Experimental results show that these descriptions improve accuracy, especially in complex or ambiguous cases. To support further research, we release a large set of GPT-4o-generated descriptions for standard ReID datasets. By bridging computer vision and natural language processing, this survey offers a unified perspective on the field's development and outlines key future directions such as better prompt design, cross-modal transfer learning, and real-world adaptability.
Authors: Hanz Cuevas-Velasquez, Anastasios Yiannakidis, Soyong Shin, Giorgio Becherini, Markus H\"oschle, Joachim Tesch, Taylor Obersat, Tsvetelina Alexiadis, Michael Black
Abstract: We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video of two-person interaction sequences. Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialized hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent learning-based methods attempt to overcome these limitations, but most are designed for single-person capture, rely on sparse keypoints, or struggle with occlusions and physical interactions. In this work, we introduce a method that predicts dense 2D surface landmarks conditioned on segmentation masks, enabling person-specific correspondence estimation even under heavy occlusion. We employ a novel architecture that exploits learnable queries for each landmark. We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods. To train our network, we construct a large, synthetic multi-view dataset combining human motions from diverse sources, including extreme poses, hand motions, and close interactions. Our dataset yields high-variability synthetic sequences with rich body contact and occlusion, and includes SMPL-X ground-truth annotations with dense 2D landmarks. The result is a system capable of capturing human motion without the need for markers. Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions, without the extensive manual cleanup. Finally, we address the absence of common benchmarks for dense-landmark prediction and markerless motion capture by introducing two evaluation settings built from real multi-view sequences. We will release our dataset, benchmark, method, training code, and pre-trained model weights for research purposes.
Authors: Christian Hilaire, Sima Didari
Abstract: We propose a novel active learning framework for multi-view semantic segmentation. This framework relies on a new score that measures the discrepancy between point cloud distributions generated from the extra geometrical information derived from the model's prediction across different views. Our approach results in a data efficient and explainable active learning method. The source code is available at https://github.com/chilai235/viewpclAL.
Authors: Adhrith Vutukuri, Akash Awasthi, David Yang, Carol C. Wu, Hien Van Nguyen
Abstract: Chest radiography is widely used in diagnostic imaging. However, perceptual errors -- especially overlooked but visible abnormalities -- remain common and clinically significant. Current workflows and AI systems provide limited support for detecting such errors after interpretation and often lack meaningful human--AI collaboration. We introduce RADAR (Radiologist--AI Diagnostic Assistance and Review), a post-interpretation companion system. RADAR ingests finalized radiologist annotations and CXR images, then performs regional-level analysis to detect and refer potentially missed abnormal regions. The system supports a "second-look" workflow and offers suggested regions of interest (ROIs) rather than fixed labels to accommodate inter-observer variation. We evaluated RADAR on a simulated perceptual-error dataset derived from de-identified CXR cases, using F1 score and Intersection over Union (IoU) as primary metrics. RADAR achieved a recall of 0.78, precision of 0.44, and an F1 score of 0.56 in detecting missed abnormalities in the simulated perceptual-error dataset. Although precision is moderate, this reduces over-reliance on AI by encouraging radiologist oversight in human--AI collaboration. The median IoU was 0.78, with more than 90% of referrals exceeding 0.5 IoU, indicating accurate regional localization. RADAR effectively complements radiologist judgment, providing valuable post-read support for perceptual-error detection in CXR interpretation. Its flexible ROI suggestions and non-intrusive integration position it as a promising tool in real-world radiology workflows. To facilitate reproducibility and further evaluation, we release a fully open-source web implementation alongside a simulated error dataset. All code, data, demonstration videos, and the application are publicly available at https://github.com/avutukuri01/RADAR.
Authors: Can Polat, Hasan Kurban, Erchin Serpedin, Mustafa Kurban
Abstract: Evaluating foundation models for crystallographic reasoning requires benchmarks that isolate generalization behavior while enforcing physical constraints. This work introduces a multiscale multicrystal dataset with two physically grounded evaluation protocols to stress-test multimodal generative models. The Spatial-Exclusion benchmark withholds all supercells of a given radius from a diverse dataset, enabling controlled assessments of spatial interpolation and extrapolation. The Compositional-Exclusion benchmark omits all samples of a specific chemical composition, probing generalization across stoichiometries. Nine vision--language foundation models are prompted with crystallographic images and textual context to generate structural annotations. Responses are evaluated via (i) relative errors in lattice parameters and density, (ii) a physics-consistency index penalizing volumetric violations, and (iii) a hallucination score capturing geometric outliers and invalid space-group predictions. These benchmarks establish a reproducible, physically informed framework for assessing generalization, consistency, and reliability in large-scale multimodal models. Dataset and code are available at https://github.com/KurbanIntelligenceLab/StressTestingMMFMinCR.
URLs: https://github.com/KurbanIntelligenceLab/StressTestingMMFMinCR.
Authors: Hu Yu, Hao Luo, Fan Wang, Feng Zhao
Abstract: Diffusion probabilistic models (DPMs) have achieved impressive success in visual generation. While, they suffer from slow inference speed due to iterative sampling. Employing fewer sampling steps is an intuitive solution, but this will also introduces discretization error. Existing fast samplers make inspiring efforts to reduce discretization error through the adoption of high-order solvers, potentially reaching a plateau in terms of optimization. This raises the question: can the sampling process be accelerated further? In this paper, we re-examine the nature of sampling errors, discerning that they comprise two distinct elements: the widely recognized discretization error and the less explored approximation error. Our research elucidates the dynamics between these errors and the step by implementing a dual-error disentanglement strategy. Building on these foundations, we introduce an unified and training-free acceleration framework, DualFast, designed to enhance the speed of DPM sampling by concurrently accounting for both error types, thereby minimizing the total sampling error. DualFast is seamlessly compatible with existing samplers and significantly boost their sampling quality and speed, particularly in extremely few sampling steps. We substantiate the effectiveness of our framework through comprehensive experiments, spanning both unconditional and conditional sampling domains, across both pixel-space and latent-space DPMs.
Authors: George Shaikovski, Eugene Vorontsov, Adam Casson, Julian Viret, Eric Zimmermann, Neil Tenenholtz, Yi Kan Wang, Jan H. Bernhard, Ran A. Godrich, Juan A. Retamero, Razik Yousfi, Nicolo Fusi, Thomas J. Fuchs, Kristen Severson, Siqi Liu
Abstract: Recent pathology foundation models can provide rich tile-level representations but fall short of delivering general-purpose clinical utility without further extensive model development. These models lack whole-slide image (WSI) understanding and are not trained with large-scale diagnostic data, limiting their performance on diverse downstream tasks. We introduce PRISM2, a multi-modal slide-level foundation model trained via clinical dialogue to enable scalable, generalizable pathology AI. PRISM2 is trained on nearly 700,000 specimens (2.3 million WSIs) paired with real-world clinical diagnostic reports in a two-stage process. In Stage 1, a vision-language model is trained using contrastive and captioning objectives to align whole slide embeddings with textual clinical diagnosis. In Stage 2, the language model is unfrozen to enable diagnostic conversation and extract more clinically meaningful representations from hidden states. PRISM2 achieves strong performance on diagnostic and biomarker prediction tasks, outperforming prior slide-level models including PRISM and TITAN. It also introduces a zero-shot yes/no classification approach that surpasses CLIP-style methods without prompt tuning or class enumeration. By aligning visual features with clinical reasoning, PRISM2 improves generalization on both data-rich and low-sample tasks, offering a scalable path forward for building general pathology AI agents capable of assisting diagnostic and prognostic decisions.
Authors: Xuhui Zhu, Jing Xu, Bingjie Wang, Huikang Dai, Hao Lu
Abstract: Video Individual Counting (VIC) is a recently introduced task that aims to estimate pedestrian flux from a video. It extends conventional Video Crowd Counting (VCC) beyond the per-frame pedestrian count. In contrast to VCC that only learns to count repeated pedestrian patterns across frames, the key problem of VIC is how to identify co-existent pedestrians between frames, which turns out to be a correspondence problem. Existing VIC approaches, however, mainly follow a one-to-one (O2O) matching strategy where the same pedestrian must be exactly matched between frames, leading to sensitivity to appearance variations or missing detections. In this work, we show that the O2O matching could be relaxed to a one-to-many (O2M) matching problem, which better fits the problem nature of VIC and can leverage the social grouping behavior of walking pedestrians. We therefore introduce OMAN, a simple but effective VIC model with implicit One-to-Many mAtchiNg, featuring an implicit context generator and a one-to-many pairwise matcher. Experiments on the SenseCrowd and CroHD benchmarks show that OMAN achieves the state-of-the-art performance. Code is available at \href{https://github.com/tiny-smart/OMAN}{OMAN}.
Authors: Bingxi Liu, Pengju Zhang, Li He, Hao Chen, Shiyi Guo, Yihong Wu, Jinqiang Cui, Hong Zhang
Abstract: Recent visual place recognition (VPR) approaches have leveraged foundation models (FM) and introduced novel aggregation techniques. However, these methods have failed to fully exploit key concepts of FM, such as the effective utilization of extensive training sets, and they have overlooked the potential of classical aggregation methods, such as GeM and NetVLAD. Building on these insights, we revive classical feature aggregation methods and develop more fundamental VPR models, collectively termed SuperPlace. First, we introduce a supervised label alignment method that enables training across various VPR datasets within a unified framework. Second, we propose G$^2$M, a compact feature aggregation method utilizing two GeMs, where one GeM learns the principal components of feature maps along the channel dimension and calibrates the output of the other. Third, we propose the secondary fine-tuning (FT$^2$) strategy for NetVLAD-Linear (NVL). NetVLAD first learns feature vectors in a high-dimensional space and then compresses them into a lower-dimensional space via a single linear layer. Extensive experiments highlight our contributions and demonstrate the superiority of SuperPlace. Specifically, G$^2$M achieves promising results with only one-tenth of the feature dimensions compared to recent methods. Moreover, NVL-FT$^2$ ranks first on the MSLS leaderboard.
Authors: Shahram Najam Syed, Ishir Roongta, Kavin Ravie, Gangadhar Nageswar
Abstract: Visual simultaneous localization and mapping (SLAM) must remain accurate under extreme viewpoint, scale and illumination variations. The widely adopted ORB-SLAM3 falters in these regimes because it relies on hand-crafted ORB keypoints. We introduce SuperPoint-SLAM3, a drop-in upgrade that (i) replaces ORB with the self-supervised SuperPoint detector--descriptor, (ii) enforces spatially uniform keypoints via adaptive non-maximal suppression (ANMS), and (iii) integrates a lightweight NetVLAD place-recognition head for learning-based loop closure. On the KITTI Odometry benchmark SuperPoint-SLAM3 reduces mean translational error from 4.15% to 0.34% and mean rotational error from 0.0027 deg/m to 0.0010 deg/m. On the EuRoC MAV dataset it roughly halves both errors across every sequence (e.g., V2\_03: 1.58% -> 0.79%). These gains confirm that fusing modern deep features with a learned loop-closure module markedly improves ORB-SLAM3 accuracy while preserving its real-time operation. Implementation, pretrained weights and reproducibility scripts are available at https://github.com/shahram95/SuperPointSLAM3.
Authors: Yu Wang, Shiwei Chen
Abstract: Weakly supervised video anomaly detection (WS-VAD) is tasked with pinpointing temporal intervals containing anomalous events within untrimmed videos, utilizing only video-level annotations. However, a significant challenge arises due to the absence of dense frame-level annotations, often leading to incomplete localization in existing WS-VAD methods. To address this issue, we present a novel LEC-VAD, Learning Event Completeness for Weakly Supervised Video Anomaly Detection, which features a dual structure designed to encode both category-aware and category-agnostic semantics between vision and language. Within LEC-VAD, we devise semantic regularities that leverage an anomaly-aware Gaussian mixture to learn precise event boundaries, thereby yielding more complete event instances. Besides, we develop a novel memory bank-based prototype learning mechanism to enrich concise text descriptions associated with anomaly-event categories. This innovation bolsters the text's expressiveness, which is crucial for advancing WS-VAD. Our LEC-VAD demonstrates remarkable advancements over the current state-of-the-art methods on two benchmark datasets XD-Violence and UCF-Crime.
Authors: Ziqing Zhou, Binbin Gao, Yuri Pan, Lidong Wang, Wenbing Zhu, Yong Liu, Jun Liu, MIngmin Chi, Dong Wu, Bo Peng, Chengjie Wang
Abstract: Prototype-based reconstruction methods for unsupervised anomaly detection utilize a limited set of learnable prototypes which only aggregates insufficient normal information, resulting in undesirable reconstruction. However, increasing the number of prototypes may lead to anomalies being well reconstructed through the attention mechanism, which we refer to as the "Soft Identity Mapping" problem. In this paper, we propose Pro-AD to address these issues and fully utilize the prototypes to boost the performance of anomaly detection. Specifically, we first introduce an expanded set of learnable prototypes to provide sufficient capacity for semantic information. Then we employ a Dynamic Bidirectional Decoder which integrates the process of the normal information aggregation and the target feature reconstruction via prototypes, with the aim of allowing the prototypes to aggregate more comprehensive normal semantic information from different levels of the image features and the target feature reconstruction to not only utilize its contextual information but also dynamically leverage the learned comprehensive prototypes. Additionally, to prevent the anomalies from being well reconstructed using sufficient semantic information through the attention mechanism, Pro-AD introduces a Prototype-based Constraint that applied within the target feature reconstruction process of the decoder, which further improves the performance of our approach. Extensive experiments on multiple challenging benchmarks demonstrate that our Pro-AD achieve state-of-the-art performance, highlighting its superior robustness and practical effectiveness for Multi-class Unsupervised Anomaly Detection task.
Authors: Jinguang Tong, Xuesong li, Fahira Afzal Maken, Sundaram Muthu, Lars Petersson, Chuong Nguyen, Hongdong Li
Abstract: 3D modeling of highly reflective objects remains challenging due to strong view-dependent appearances. While previous SDF-based methods can recover high-quality meshes, they are often time-consuming and tend to produce over-smoothed surfaces. In contrast, 3D Gaussian Splatting (3DGS) offers the advantage of high speed and detailed real-time rendering, but extracting surfaces from the Gaussians can be noisy due to the lack of geometric constraints. To bridge the gap between these approaches, we propose a novel reconstruction method called GS-2DGS for reflective objects based on 2D Gaussian Splatting (2DGS). Our approach combines the rapid rendering capabilities of Gaussian Splatting with additional geometric information from foundation models. Experimental results on synthetic and real datasets demonstrate that our method significantly outperforms Gaussian-based techniques in terms of reconstruction and relighting and achieves performance comparable to SDF-based methods while being an order of magnitude faster. Code is available at https://github.com/hirotong/GS2DGS
Authors: Yuiga Wada, Kazuki Matsuda, Komei Sugiura, Graham Neubig
Abstract: Multimodal Large Language Models (MLLMs) often generate hallucinations, where the output deviates from the visual content. Given that these hallucinations can take diverse forms, detecting hallucinations at a fine-grained level is essential for comprehensive evaluation and analysis. To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. Moreover, we propose ZINA, a novel method that identifies hallucinated spans at a fine-grained level, classifies their error types into six categories, and suggests appropriate refinements. To train and evaluate models for this task, we constructed VisionHall, a dataset comprising 6.9k outputs from twelve MLLMs manually annotated by 211 annotators, and 20k synthetic samples generated using a graph-based method that captures dependencies among error types. We demonstrated that ZINA outperformed existing methods, including GPT-4o and LLama-3.2, in both detection and editing tasks.
Authors: Bingxi Liu, Hao Chen, Shiyi Guo, Yihong Wu, Jinqiang Cui, Hong Zhang
Abstract: Visual Place Recognition (VPR) is a scene-oriented image retrieval problem in computer vision in which re-ranking based on local features is commonly employed to improve performance. In robotics, VPR is also referred to as Loop Closure Detection, which emphasizes spatial-temporal verification within a sequence. However, designing local features specifically for VPR is impractical, and relying on motion sequences imposes limitations. Inspired by these observations, we propose a novel, simple re-ranking method that refines global features through a Mixture-of-Features (MoF) approach under embodied constraints. First, we analyze the practical feasibility of embodied constraints in VPR and categorize them according to existing datasets, which include GPS tags, sequential timestamps, local feature matching, and self-similarity matrices. We then propose a learning-based MoF weight-computation approach, utilizing a multi-metric loss function. Experiments demonstrate that our method improves the state-of-the-art (SOTA) performance on public datasets with minimal additional computational overhead. For instance, with only 25 KB of additional parameters and a processing time of 10 microseconds per frame, our method achieves a 0.9\% improvement over a DINOv2-based baseline performance on the Pitts-30k test set.
Authors: Jiamin Wang, Yichen Yao, Xiang Feng, Hang Wu, Yaming Wang, Qingqiu Huang, Yuexin Ma, Xinge Zhu
Abstract: The generation of temporally consistent, high-fidelity driving videos over extended horizons presents a fundamental challenge in autonomous driving world modeling. Existing approaches often suffer from error accumulation and feature misalignment due to inadequate decoupling of spatio-temporal dynamics and limited cross-frame feature propagation mechanisms. To address these limitations, we present STAGE (Streaming Temporal Attention Generative Engine), a novel auto-regressive framework that pioneers hierarchical feature coordination and multi-phase optimization for sustainable video synthesis. To achieve high-quality long-horizon driving video generation, we introduce Hierarchical Temporal Feature Transfer (HTFT) and a novel multi-stage training strategy. HTFT enhances temporal consistency between video frames throughout the video generation process by modeling the temporal and denoising process separately and transferring denoising features between frames. The multi-stage training strategy is to divide the training into three stages, through model decoupling and auto-regressive inference process simulation, thereby accelerating model convergence and reducing error accumulation. Experiments on the Nuscenes dataset show that STAGE has significantly surpassed existing methods in the long-horizon driving video generation task. In addition, we also explored STAGE's ability to generate unlimited-length driving videos. We generated 600 frames of high-quality driving videos on the Nuscenes dataset, which far exceeds the maximum length achievable by existing methods.
Authors: Jiashu He, Jiayi He, Shengeng Tang, Huixia Ben, Lechao Cheng, Richang Hong
Abstract: Sign language transition generation seeks to convert discrete sign language segments into continuous sign videos by synthesizing smooth transitions. However,most existing methods merely concatenate isolated signs, resulting in poor visual coherence and semantic accuracy in the generated videos. Unlike textual languages,sign language is inherently rich in spatial-temporal cues, making it more complex to model. To address this,we propose StgcDiff, a graph-based conditional diffusion framework that generates smooth transitions between discrete signs by capturing the unique spatial-temporal dependencies of sign language. Specifically, we first train an encoder-decoder architecture to learn a structure-aware representation of spatial-temporal skeleton sequences. Next, we optimize a diffusion denoiser conditioned on the representations learned by the pre-trained encoder, which is tasked with predicting transition frames from noise. Additionally, we design the Sign-GCN module as the key component in our framework, which effectively models the spatial-temporal features. Extensive experiments conducted on the PHOENIX14T, USTC-CSL100,and USTC-SLR500 datasets demonstrate the superior performance of our method.
Authors: Ruiguang Pei, Weiqing Sun, Zhihui Fu, Jun Wang
Abstract: Although Large Vision Language Models (LVLMs) have demonstrated remarkable performance in image understanding tasks, their computational efficiency remains a significant challenge, particularly on resource-constrained devices due to the high cost of processing large numbers of visual tokens. Recently, training-free visual token pruning methods have gained popularity as a low-cost solution to this issue. However, existing approaches suffer from two key limitations: semantic saliency-based strategies primarily focus on high cross-attention visual tokens, often neglecting visual diversity, whereas visual diversity-based methods risk inadvertently discarding semantically important tokens, especially under high compression ratios. In this paper, we introduce GreedyPrune, a training-free plug-and-play visual token pruning algorithm designed to jointly optimize semantic saliency and visual diversity. We formalize the token pruning process as a combinatorial optimization problem and demonstrate that greedy algorithms effectively balance computational efficiency with model accuracy. Extensive experiments validate the effectiveness of our approach, showing that GreedyPrune achieves state-of-the-art accuracy across various multimodal tasks and models while significantly reducing end-to-end inference latency.
Authors: Bingxi Liu, An Liu, Hao Chen, Jinqiang Cui, Yiqun Wang, Hong Zhang
Abstract: Point cloud registration (PCR) is a fundamental task in 3D computer vision and robotics. Most existing learning-based PCR methods rely on Transformers, which suffer from quadratic computational complexity. This limitation restricts the resolution of point clouds that can be processed, inevitably leading to information loss. In contrast, Mamba-a recently proposed model based on state space models (SSMs)-achieves linear computational complexity while maintaining strong long-range contextual modeling capabilities. However, directly applying Mamba to PCR tasks yields suboptimal performance due to the unordered and irregular nature of point cloud data. To address this challenge, we propose MT-PCR, the first point cloud registration framework that integrates both Mamba and Transformer modules. Specifically, we serialize point cloud features using Z-order space-filling curves to enforce spatial locality, enabling Mamba to better model the geometric structure of the input. Additionally, we remove the order indicator module commonly used in Mamba-based sequence modeling, leads to improved performance in our setting. The serialized features are then processed by an optimized Mamba encoder, followed by a Transformer refinement stage. Extensive experiments on multiple benchmarks demonstrate that MT-PCR outperforms Transformer-based and concurrent state-of-the-art methods in both accuracy and efficiency, significantly reducing while GPU memory usage and FLOPs.
Authors: Wenfeng Jia, Bin Liang, Yuxi Liu, Muhammad Arif Khan, Lihong Zheng
Abstract: Flooding remains a major global challenge, worsened by climate change and urbanization, demanding advanced solutions for effective disaster management. While traditional 2D flood mapping techniques provide limited insights, 3D flood mapping, powered by deep learning (DL), offers enhanced capabilities by integrating flood extent and depth. This paper presents a comprehensive survey of deep learning-based 3D flood mapping, emphasizing its advancements over 2D maps by integrating flood extent and depth for effective disaster management and urban planning. The survey categorizes deep learning techniques into task decomposition and end-to-end approaches, applicable to both static and dynamic flood features. We compare key DL architectures, highlighting their respective roles in enhancing prediction accuracy and computational efficiency. Additionally, this work explores diverse data sources such as digital elevation models, satellite imagery, rainfall, and simulated data, outlining their roles in 3D flood mapping. The applications reviewed range from real-time flood prediction to long-term urban planning and risk assessment. However, significant challenges persist, including data scarcity, model interpretability, and integration with traditional hydrodynamic models. This survey concludes by suggesting future directions to address these limitations, focusing on enhanced datasets, improved models, and policy implications for flood management. This survey aims to guide researchers and practitioners in leveraging DL techniques for more robust and reliable 3D flood mapping, fostering improved flood management strategies.
Authors: Zhenlong Yuan, Dapeng Zhang, Zehao Li, Chengxuan Qian, Jianing Chen, Yinda Chen, Kehua Chen, Tianlu Mao, Zhaoxin Li, Hao Jiang, Zhaoqi Wang
Abstract: Recently, patch deformation-based methods have demonstrated significant effectiveness in multi-view stereo due to their incorporation of deformable and expandable perception for reconstructing textureless areas. However, these methods generally focus on identifying reliable pixel correlations to mitigate matching ambiguity of patch deformation, while neglecting the deformation instability caused by edge-skipping and visibility occlusions, which may cause potential estimation deviations. To address these issues, we propose DVP-MVS++, an innovative approach that synergizes both depth-normal-edge aligned and harmonized cross-view priors for robust and visibility-aware patch deformation. Specifically, to avoid edge-skipping, we first apply DepthPro, Metric3Dv2 and Roberts operator to generate coarse depth maps, normal maps and edge maps, respectively. These maps are then aligned via an erosion-dilation strategy to produce fine-grained homogeneous boundaries for facilitating robust patch deformation. Moreover, we reformulate view selection weights as visibility maps, and then implement both an enhanced cross-view depth reprojection and an area-maximization strategy to help reliably restore visible areas and effectively balance deformed patch, thus acquiring harmonized cross-view priors for visibility-aware patch deformation. Additionally, we obtain geometry consistency by adopting both aggregated normals via view selection and projection depth differences via epipolar lines, and then employ SHIQ for highlight correction to enable geometry consistency with highlight-aware perception, thus improving reconstruction quality during propagation and refinement stage. Evaluation results on ETH3D, Tanks & Temples and Strecha datasets exhibit the state-of-the-art performance and robust generalization capability of our proposed method.
Authors: Jinfeng Xu, Xianzhi Li, Yuan Tang, Xu Han, Qiao Yu, Yixue Hao, Long Hu, Min Chen
Abstract: Recent advancements in deep learning have greatly enhanced 3D object recognition, but most models are limited to closed-set scenarios, unable to handle unknown samples in real-world applications. Open-set recognition (OSR) addresses this limitation by enabling models to both classify known classes and identify novel classes. However, current OSR methods rely on global features to differentiate known and unknown classes, treating the entire object uniformly and overlooking the varying semantic importance of its different parts. To address this gap, we propose Salience-Aware Structured Separation (SASep), which includes (i) a tunable semantic decomposition (TSD) module to semantically decompose objects into important and unimportant parts, (ii) a geometric synthesis strategy (GSS) to generate pseudo-unknown objects by combining these unimportant parts, and (iii) a synth-aided margin separation (SMS) module to enhance feature-level separation by expanding the feature distributions between classes. Together, these components improve both geometric and feature representations, enhancing the model's ability to effectively distinguish known and unknown classes. Experimental results show that SASep achieves superior performance in 3D OSR, outperforming existing state-of-the-art methods.
Authors: Jiashu Dai, Along Wang, Binfan Ni, Tao Cao
Abstract: Facial texture generation is crucial for high-fidelity 3D face reconstruction from a single image. However, existing methods struggle to generate UV albedo maps with high-frequency details. To address this challenge, we propose a novel end-to-end coarse-to-fine approach for UV albedo map generation. Our method first utilizes a UV Albedo Parametric Model (UVAPM), driven by low-dimensional coefficients, to generate coarse albedo maps with skin tones and low-frequency texture details. To capture high-frequency details, we train a detail generator using a decoupled albedo map dataset, producing high-resolution albedo maps. Extensive experiments demonstrate that our method can generate high-fidelity textures from a single image, outperforming existing methods in terms of texture quality and realism. The code and pre-trained model are publicly available at https://github.com/MVIC-DAI/UVAPM, facilitating reproducibility and further research.
Authors: Yining Shi, Kun Jiang, Qiang Meng, Ke Wang, Jiabao Wang, Wenchao Sun, Tuopu Wen, Mengmeng Yang, Diange Yang
Abstract: World models are critical for autonomous driving to simulate environmental dynamics and generate synthetic data. Existing methods struggle to disentangle ego-vehicle motion (perspective shifts) from scene evolvement (agent interactions), leading to suboptimal predictions. Instead, we propose to separate environmental changes from ego-motion by leveraging the scene-centric coordinate systems. In this paper, we introduce COME: a framework that integrates scene-centric forecasting Control into the Occupancy world ModEl. Specifically, COME first generates ego-irrelevant, spatially consistent future features through a scene-centric prediction branch, which are then converted into scene condition using a tailored ControlNet. These condition features are subsequently injected into the occupancy world model, enabling more accurate and controllable future occupancy predictions. Experimental results on the nuScenes-Occ3D dataset show that COME achieves consistent and significant improvements over state-of-the-art (SOTA) methods across diverse configurations, including different input sources (ground-truth, camera-based, fusion-based occupancy) and prediction horizons (3s and 8s). For example, under the same settings, COME achieves 26.3% better mIoU metric than DOME and 23.7% better mIoU metric than UniScene. These results highlight the efficacy of disentangled representation learning in enhancing spatio-temporal prediction fidelity for world models. Code and videos will be available at https://github.com/synsin0/COME.
Authors: Rohit Mohan, Julia Hindel, Florian Drews, Claudius Gl\"aser, Daniele Cattaneo, Abhinav Valada
Abstract: Autonomous vehicles that navigate in open-world environments may encounter previously unseen object classes. However, most existing LiDAR panoptic segmentation models rely on closed-set assumptions, failing to detect unknown object instances. In this work, we propose ULOPS, an uncertainty-guided open-set panoptic segmentation framework that leverages Dirichlet-based evidential learning to model predictive uncertainty. Our architecture incorporates separate decoders for semantic segmentation with uncertainty estimation, embedding with prototype association, and instance center prediction. During inference, we leverage uncertainty estimates to identify and segment unknown instances. To strengthen the model's ability to differentiate between known and unknown objects, we introduce three uncertainty-driven loss functions. Uniform Evidence Loss to encourage high uncertainty in unknown regions. Adaptive Uncertainty Separation Loss ensures a consistent difference in uncertainty estimates between known and unknown objects at a global scale. Contrastive Uncertainty Loss refines this separation at the fine-grained level. To evaluate open-set performance, we extend benchmark settings on KITTI-360 and introduce a new open-set evaluation for nuScenes. Extensive experiments demonstrate that ULOPS consistently outperforms existing open-set LiDAR panoptic segmentation methods.
Authors: Daichi Tanaka, Takumi Karasawa, Shu Takenouchi, Rei Kawakami
Abstract: Recycling steel scrap can reduce carbon dioxide (CO2) emissions from the steel industry. However, a significant challenge in steel scrap recycling is the inclusion of impurities other than steel. To address this issue, we propose vision-language-model-based anomaly detection where a model is finetuned in a supervised manner, enabling it to handle niche objects effectively. This model enables automated detection of anomalies at a fine-grained level within steel scrap. Specifically, we finetune the image encoder, equipped with multi-scale mechanism and text prompts aligned with both normal and anomaly images. The finetuning process trains these modules using a multiclass classification as the supervision.
Authors: Roman Flepp, Leon Nissen, Bastian Sigrist, Arend Nieuwland, Nicola Cavalcanti, Philipp F\"urnstahl, Thomas Dreher, Lilian Calvet
Abstract: Purpose: Accurate intraoperative X-ray/CT registration is essential for surgical navigation in orthopedic procedures. However, existing methods struggle with consistently achieving sub-millimeter accuracy, robustness under broad initial pose estimates or need manual key-point annotations. This work aims to address these challenges by proposing a novel multi-view X-ray/CT registration method for intraoperative bone registration. Methods: The proposed registration method consists of a multi-view, contour-based iterative closest point (ICP) optimization. Unlike previous methods, which attempt to match bone contours across the entire silhouette in both imaging modalities, we focus on matching specific subcategories of contours corresponding to bone substructures. This leads to reduced ambiguity in the ICP matches, resulting in a more robust and accurate registration solution. This approach requires only two X-ray images and operates fully automatically. Additionally, we contribute a dataset of 5 cadaveric specimens, including real X-ray images, X-ray image poses and the corresponding CT scans. Results: The proposed registration method is evaluated on real X-ray images using mean reprojection error (mRPD). The method consistently achieves sub-millimeter accuracy with a mRPD 0.67mm compared to 5.35mm by a commercial solution requiring manual intervention. Furthermore, the method offers improved practical applicability, being fully automatic. Conclusion: Our method offers a practical, accurate, and efficient solution for multi-view X-ray/CT registration in orthopedic surgeries, which can be easily combined with tracking systems. By improving registration accuracy and minimizing manual intervention, it enhances intraoperative navigation, contributing to more accurate and effective surgical outcomes in computer-assisted surgery (CAS).
Authors: Jeonghoon Park, Juyoung Lee, Chaeyeon Chung, Jaeseong Lee, Jaegul Choo, Jindong Gu
Abstract: Recent advancements in diffusion-based text-to-image (T2I) models have enabled the generation of high-quality and photorealistic images from text descriptions. However, they often exhibit societal biases related to gender, race, and socioeconomic status, thereby reinforcing harmful stereotypes and shaping public perception in unintended ways. While existing bias mitigation methods demonstrate effectiveness, they often encounter attribute entanglement, where adjustments to attributes relevant to the bias (i.e., target attributes) unintentionally alter attributes unassociated with the bias (i.e., non-target attributes), causing undesirable distribution shifts. To address this challenge, we introduce Entanglement-Free Attention (EFA), a method that accurately incorporates target attributes (e.g., White, Black, Asian, and Indian) while preserving non-target attributes (e.g., background details) during bias mitigation. At inference time, EFA randomly samples a target attribute with equal probability and adjusts the cross-attention in selected layers to incorporate the sampled attribute, achieving a fair distribution of target attributes. Extensive experiments demonstrate that EFA outperforms existing methods in mitigating bias while preserving non-target attributes, thereby maintaining the output distribution and generation capability of the original model.
Authors: Biao Yang, Muqi Huang, Yuhui Zhang, Yun Xiong, Kun Zhou, Xi Chen, Shiyang Zhou, Huishuai Bao, Chuan Li, Feng Shi, Hualei Liu
Abstract: Traditional point-based image editing methods rely on iterative latent optimization or geometric transformations, which are either inefficient in their processing or fail to capture the semantic relationships within the image. These methods often overlook the powerful yet underutilized image editing capabilities inherent in pre-trained diffusion models. In this work, we propose a novel one-step point-based image editing method, named AttentionDrag, which leverages the inherent latent knowledge and feature correlations within pre-trained diffusion models for image editing tasks. This framework enables semantic consistency and high-quality manipulation without the need for extensive re-optimization or retraining. Specifically, we reutilize the latent correlations knowledge learned by the self-attention mechanism in the U-Net module during the DDIM inversion process to automatically identify and adjust relevant image regions, ensuring semantic validity and consistency. Additionally, AttentionDrag adaptively generates masks to guide the editing process, enabling precise and context-aware modifications with friendly interaction. Our results demonstrate a performance that surpasses most state-of-the-art methods with significantly faster speeds, showing a more efficient and semantically coherent solution for point-based image editing tasks.
Authors: Sol\`ene Debuys\`ere, Nicolas Trouv\'e, Nathan Letheule, Olivier L\'ev\^eque, Elise Colin
Abstract: This work investigates the adaptation of large pre-trained latent diffusion models to a radically new imaging domain: Synthetic Aperture Radar (SAR). While these generative models, originally trained on natural images, demonstrate impressive capabilities in text-to-image synthesis, they are not natively adapted to represent SAR data, which involves different physics, statistical distributions, and visual characteristics. Using a sizeable SAR dataset (on the order of 100,000 to 1 million images), we address the fundamental question of fine-tuning such models for this unseen modality. We explore and compare multiple fine-tuning strategies, including full model fine-tuning and parameter-efficient approaches like Low-Rank Adaptation (LoRA), focusing separately on the UNet diffusion backbone and the text encoder components. To evaluate generative quality, we combine several metrics: statistical distance from real SAR distributions, textural similarity via GLCM descriptors, and semantic alignment assessed with a CLIP model fine-tuned on SAR data. Our results show that a hybrid tuning strategy yields the best performance: full fine-tuning of the UNet is better at capturing low-level SAR-specific patterns, while LoRA-based partial tuning of the text encoder, combined with embedding learning of the
Authors: Wenlong Wan, Weiying Zheng, Tianyi Xiang, Guiqing Li, Shengfeng He
Abstract: We introduce the task of Audible Action Temporal Localization, which aims to identify the spatio-temporal coordinates of audible movements. Unlike conventional tasks such as action recognition and temporal action localization, which broadly analyze video content, our task focuses on the distinct kinematic dynamics of audible actions. It is based on the premise that key actions are driven by inflectional movements; for example, collisions that produce sound often involve abrupt changes in motion. To capture this, we propose $TA^{2}Net$, a novel architecture that estimates inflectional flow using the second derivative of motion to determine collision timings without relying on audio input. $TA^{2}Net$ also integrates a self-supervised spatial localization strategy during training, combining contrastive learning with spatial analysis. This dual design improves temporal localization accuracy and simultaneously identifies sound sources within video frames. To support this task, we introduce a new benchmark dataset, $Audible623$, derived from Kinetics and UCF101 by removing non-essential vocalization subsets. Extensive experiments confirm the effectiveness of our approach on $Audible623$ and show strong generalizability to other domains, such as repetitive counting and sound source localization. Code and dataset are available at https://github.com/WenlongWan/Audible623.
Authors: Weijia Feng, Yichen Zhu, Ruojia Zhang, Chenyang Wang, Fei Ma, Xiaobao Wang, Xiaobai Li
Abstract: Owing to its rapid progress and broad application prospects, few-shot action recognition has attracted considerable interest. However, current methods are predominantly based on limited single-modal data, which does not fully exploit the potential of multimodal information. This paper presents a novel framework that actively identifies reliable modalities for each sample using task-specific contextual cues, thus significantly improving recognition performance. Our framework integrates an Active Sample Inference (ASI) module, which utilizes active inference to predict reliable modalities based on posterior distributions and subsequently organizes them accordingly. Unlike reinforcement learning, active inference replaces rewards with evidence-based preferences, making more stable predictions. Additionally, we introduce an active mutual distillation module that enhances the representation learning of less reliable modalities by transferring knowledge from more reliable ones. Adaptive multimodal inference is employed during the meta-test to assign higher weights to reliable modalities. Extensive experiments across multiple benchmarks demonstrate that our method significantly outperforms existing approaches.
Authors: Bo Pan, Yixiao Fu, Ke Wang, Junyu Lu, Lunke Pan, Ziyang Qian, Yuhan Chen, Guoliang Wang, Yitao Zhou, Li Zheng, Yinghao Tang, Zhen Wen, Yuchen Wu, Junhua Lu, Biao Zhu, Minfeng Zhu, Bo Zhang, Wei Chen
Abstract: Data visualization generation using Large Language Models (LLMs) has shown promising results but often produces suboptimal visualizations that require human intervention for improvement. In this work, we introduce VIS-Shepherd, a specialized Multimodal Large Language Model (MLLM)-based critic to evaluate and provide feedback for LLM-generated data visualizations. At the core of our approach is a framework to construct a high-quality visualization critique dataset, where we collect human-created visualization instances, synthesize corresponding LLM-generated instances, and construct high-quality critiques. We conduct both model-based automatic evaluation and human preference studies to evaluate the effectiveness of our approach. Our experiments show that even small (7B parameters) open-source MLLM models achieve substantial performance gains by leveraging our high-quality visualization critique dataset, reaching levels comparable to much larger open-source or even proprietary models. Our work demonstrates significant potential for MLLM-based automated visualization critique and indicates promising directions for enhancing LLM-based data visualization generation. Our project page: https://github.com/bopan3/VIS-Shepherd.
Authors: Andrea Bergamaschi, Abhinav Verma, Avik Bhattacharya, Fabio Dell'Acqua
Abstract: Multi-polarized Synthetic Aperture Radar (SAR) technology has gained increasing attention in agriculture, offering unique capabilities for monitoring vegetation dynamics thanks to its all-weather, day-and-night operation and high revisit frequency. This study presents, for the first time, a comprehensive analysis combining dual-polarimetric radar vegetation index (DpRVI) with optical indices to characterize vineyard crops. Vineyards exhibit distinct non-isotropic scattering behavior due to their pronounced row orientation, making them particularly challenging and interesting targets for remote sensing. The research further investigates the relationship between DpRVI and optical vegetation indices, demonstrating the complementary nature of their information. We demonstrate that DpRVI and optical indices provide complementary information, with low correlation suggesting that they capture distinct vineyard features. Key findings reveal a parabolic trend in DpRVI over the growing season, potentially linked to biomass dynamics estimated via the Winkler Index. Unlike optical indices reflecting vegetation greenness, DpRVI appears more directly related to biomass growth, aligning with specific phenological phases. Preliminary results also highlight the potential of DpRVI for distinguishing vineyards from other crops. This research aligns with the objectives of the PNRR-NODES project, which promotes nature-based solutions (NbS) for sustainable vineyard management. The application of DpRVI for monitoring vineyards is part of integrating remote sensing techniques into the broader field of strategies for climate-related change adaptation and risk reduction, emphasizing the role of innovative SAR-based monitoring in sustainable agriculture.
Authors: Gabriel A. Carneiro, Thierry J. Aubry, Ant\'onio Cunha, Petia Radeva, Joaquim Sousa
Abstract: Grapevine varieties are essential for the economies of many wine-producing countries, influencing the production of wine, juice, and the consumption of fruits and leaves. Traditional identification methods, such as ampelography and molecular analysis, have limitations: ampelography depends on expert knowledge and is inherently subjective, while molecular methods are costly and time-intensive. To address these limitations, recent studies have applied deep learning (DL) models to classify grapevine varieties using image data. However, due to the small dataset sizes, these methods often depend on transfer learning from datasets from other domains, e.g., ImageNet1K (IN1K), which can lead to performance degradation due to domain shift and supervision collapse. In this context, self-supervised learning (SSL) methods can be a good tool to avoid this performance degradation, since they can learn directly from data, without external labels. This study presents an evaluation of Masked Autoencoders (MAEs) for identifying grapevine varieties based on field-acquired images. The main contributions of this study include two benchmarks comprising 43 grapevine varieties collected across different seasons, an analysis of MAE's application in the agricultural context, and a performance comparison of trained models across seasons. Our results show that a ViT-B/16 model pre-trained with MAE and the unlabeled dataset achieved an F1 score of 0.7956, outperforming all other models. Additionally, we observed that pre-trained models benefit from long pre-training, perform well under low-data training regime, and that simple data augmentation methods are more effective than complex ones. The study also found that the mask ratio in MAE impacts performance only marginally.
Authors: Yan Chen, Hanlin Shang, Ce Liu, Yuxuan Chen, Hui Li, Weihao Yuan, Hao Zhu, Zilong Dong, Siyu Zhu
Abstract: Video face restoration faces a critical challenge in maintaining temporal consistency while recovering fine facial details from degraded inputs. This paper presents a novel approach that extends Vector-Quantized Variational Autoencoders (VQ-VAEs), pretrained on static high-quality portraits, into a video restoration framework through variational latent space modeling. Our key innovation lies in reformulating discrete codebook representations as Dirichlet-distributed continuous variables, enabling probabilistic transitions between facial features across frames. A spatio-temporal Transformer architecture jointly models inter-frame dependencies and predicts latent distributions, while a Laplacian-constrained reconstruction loss combined with perceptual (LPIPS) regularization enhances both pixel accuracy and visual quality. Comprehensive evaluations on blind face restoration, video inpainting, and facial colorization tasks demonstrate state-of-the-art performance. This work establishes an effective paradigm for adapting intensive image priors, pretrained on high-quality images, to video restoration while addressing the critical challenge of flicker artifacts. The source code has been open-sourced and is available at https://github.com/fudan-generative-vision/DicFace.
Authors: Beilei Cui, Yiming Huang, Long Bai, Hongliang Ren
Abstract: This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications. To this end, we aim to build up a framework to solve scale uncertainty and transfer relative depth to metric depth. Previous methods used language as input and estimated two factors for conducting rescaling. Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level. Features from two modalities are fused with a cross-modality attention module to better capture scale information. A strategy is designed to construct and filter confident pseudo metric depth for more comprehensive supervision. We also develop scale-oriented contrastive learning to utilize depth distribution as guidance to enforce the model learning about intrinsic knowledge aligning with the scale distribution. TR2M only exploits a small number of trainable parameters to train on datasets in various domains and experiments not only demonstrate TR2M's great performance in seen datasets but also reveal superior zero-shot capabilities on five unseen datasets. We show the huge potential in pixel-wise transferring relative depth to metric depth with language assistance. (Code is available at: https://github.com/BeileiCui/TR2M)
Authors: Zhen Wang, Hongyi Liu, Zhihui Wei
Abstract: Diffusion models have achieved remarkable success in imaging inverse problems owing to their powerful generative capabilities. However, existing approaches typically rely on models trained for specific degradation types, limiting their generalizability to various degradation scenarios. To address this limitation, we propose a zero-shot framework capable of handling various imaging inverse problems without model retraining. We introduce a likelihood-guided noise refinement mechanism that derives a closed-form approximation of the likelihood score, simplifying score estimation and avoiding expensive gradient computations. This estimated score is subsequently utilized to refine the model-predicted noise, thereby better aligning the restoration process with the generative framework of diffusion models. In addition, we integrate the Denoising Diffusion Implicit Models (DDIM) sampling strategy to further improve inference efficiency. The proposed mechanism can be applied to both optimization-based and sampling-based schemes, providing an effective and flexible zero-shot solution for imaging inverse problems. Extensive experiments demonstrate that our method achieves superior performance across multiple inverse problems, particularly in compressive sensing, delivering high-quality reconstructions even at an extremely low sampling rate (5%).
Authors: Tristan Kenneweg, Philip Kenneweg, Barbara Hammer
Abstract: Predicting mortality-related outcomes from images offers the prospect of accessible, noninvasive, and scalable health screening. We present a method that leverages pretrained vision transformer foundation models to estimate remaining lifespan from facial and whole-body images, alongside robust uncertainty quantification. We show that predictive uncertainty varies systematically with the true remaining lifespan, and that this uncertainty can be effectively modeled by learning a Gaussian distribution for each sample. Our approach achieves state-of-the-art mean absolute error (MAE) of 7.48 years on an established Dataset, and further improves to 4.79 and 5.07 years MAE on two new, higher-quality datasets curated and published in this work. Importantly, our models provide well-calibrated uncertainty estimates, as demonstrated by a bucketed expected calibration error of 0.62 years. While not intended for clinical deployment, these results highlight the potential of extracting medically relevant signals from images. We make all code and datasets available to facilitate further research.
Authors: Shenqi Wang, Yingfu Xu, Amirreza Yousefzadeh, Sherif Eissa, Henk Corporaal, Federico Corradi, Guangzhi Tang
Abstract: Leveraging the high temporal resolution and dynamic range, object detection with event cameras can enhance the performance and safety of automotive and robotics applications in real-world scenarios. However, processing sparse event data requires compute-intensive convolutional recurrent units, complicating their integration into resource-constrained edge applications. Here, we propose the Sparse Event-based Efficient Detector (SEED) for efficient event-based object detection on neuromorphic processors. We introduce sparse convolutional recurrent learning, which achieves over 92% activation sparsity in recurrent processing, vastly reducing the cost for spatiotemporal reasoning on sparse event data. We validated our method on Prophesee's 1 Mpx and Gen1 event-based object detection datasets. Notably, SEED sets a new benchmark in computational efficiency for event-based object detection which requires long-term temporal learning. Compared to state-of-the-art methods, SEED significantly reduces synaptic operations while delivering higher or same-level mAP. Our hardware simulations showcase the critical role of SEED's hardware-aware design in achieving energy-efficient and low-latency neuromorphic processing.
Authors: Laiyan Ding, Hualie Jiang, Jiwei Chen, Rui Huang
Abstract: Depth map enhancement using paired high-resolution RGB images offers a cost-effective solution for improving low-resolution depth data from lightweight ToF sensors. Nevertheless, naively adopting a depth estimation pipeline to fuse the two modalities requires groundtruth depth maps for supervision. To address this, we propose a self-supervised learning framework, SelfToF, which generates detailed and scale-aware depth maps. Starting from an image-based self-supervised depth estimation pipeline, we add low-resolution depth as inputs, design a new depth consistency loss, propose a scale-recovery module, and finally obtain a large performance boost. Furthermore, since the ToF signal sparsity varies in real-world applications, we upgrade SelfToF to SelfToF* with submanifold convolution and guided feature fusion. Consequently, SelfToF* maintain robust performance across varying sparsity levels in ToF data. Overall, our proposed method is both efficient and effective, as verified by extensive experiments on the NYU and ScanNet datasets. The code is available at \href{https://github.com/denyingmxd/selftof}{https://github.com/denyingmxd/selftof}.
URLs: https://github.com/denyingmxd/selftof, https://github.com/denyingmxd/selftof
Authors: Waqar Tanveer, Laura Fern\'andez-Robles, Eduardo Fidalgo, V\'ictor Gonz\'alez-Castro, Enrique Alegre
Abstract: Facial age estimation has achieved considerable success under controlled conditions. However, in unconstrained real-world scenarios, which are often referred to as 'in the wild', age estimation remains challenging, especially when faces are partially occluded, which may obscure their visibility. To address this limitation, we propose a new approach integrating generative adversarial networks (GANs) and transformer architectures to enable robust age estimation from occluded faces. We employ an SN-Patch GAN to effectively remove occlusions, while an Attentive Residual Convolution Module (ARCM), paired with a Swin Transformer, enhances feature representation. Additionally, we introduce a Multi-Task Age Head (MTAH) that combines regression and distribution learning, further improving age estimation under occlusion. Experimental results on the FG-NET, UTKFace, and MORPH datasets demonstrate that our proposed approach surpasses existing state-of-the-art techniques for occluded facial age estimation by achieving an MAE of $3.00$, $4.54$, and $2.53$ years, respectively.
Authors: Momir Ad\v{z}emovi\'c
Abstract: Multi-object tracking (MOT) is a core task in computer vision that involves detecting objects in video frames and associating them across time. The rise of deep learning has significantly advanced MOT, particularly within the tracking-by-detection paradigm, which remains the dominant approach. Advancements in modern deep learning-based methods accelerated in 2022 with the introduction of ByteTrack for tracking-by-detection and MOTR for end-to-end tracking. Our survey provides an in-depth analysis of deep learning-based MOT methods, systematically categorizing tracking-by-detection approaches into five groups: joint detection and embedding, heuristic-based, motion-based, affinity learning, and offline methods. In addition, we examine end-to-end tracking methods and compare them with existing alternative approaches. We evaluate the performance of recent trackers across multiple benchmarks and specifically assess their generality by comparing results across different domains. Our findings indicate that heuristic-based methods achieve state-of-the-art results on densely populated datasets with linear object motion, while deep learning-based association methods, in both tracking-by-detection and end-to-end approaches, excel in scenarios with complex motion patterns.
Authors: Cristina Mahanta, Gagan Bhatia
Abstract: Recognising human activity in a single photo enables indexing, safety and assistive applications, yet lacks motion cues. Using 285 MSCOCO images labelled as walking, running, sitting, and standing, scratch CNNs scored 41% accuracy. Fine-tuning multimodal CLIP raised this to 76%, demonstrating that contrastive vision-language pre-training decisively improves still-image action recognition in real-world deployments.
Authors: Zerui Gong, Zhonghua Wu, Qingyi Tao, Qinyue Li, Chen Change Loy
Abstract: Photorealistic style transfer (PST) enables real-world color grading by adapting reference image colors while preserving content structure. Existing methods mainly follow either approaches: generation-based methods that prioritize stylistic fidelity at the cost of content integrity and efficiency, or global color transformation methods such as LUT, which preserve structure but lack local adaptability. To bridge this gap, we propose Spatial Adaptive 4D Look-Up Table (SA-LUT), combining LUT efficiency with neural network adaptability. SA-LUT features: (1) a Style-guided 4D LUT Generator that extracts multi-scale features from the style image to predict a 4D LUT, and (2) a Context Generator using content-style cross-attention to produce a context map. This context map enables spatially-adaptive adjustments, allowing our 4D LUT to apply precise color transformations while preserving structural integrity. To establish a rigorous evaluation framework for photorealistic style transfer, we introduce PST50, the first benchmark specifically designed for PST assessment. Experiments demonstrate that SA-LUT substantially outperforms state-of-the-art methods, achieving a 66.7% reduction in LPIPS score compared to 3D LUT approaches, while maintaining real-time performance at 16 FPS for video stylization. Our code and benchmark are available at https://github.com/Ry3nG/SA-LUT
Authors: Xiem HoangVan, Dang Bui Dinh, Thanh Nguyen Canh, Van-Truong Nguyen
Abstract: Printed Circuit Boards (PCBs) are critical components in modern electronics, which require stringent quality control to ensure proper functionality. However, the detection of defects in small-scale PCBs images poses significant challenges as a result of the low resolution of the captured images, leading to potential confusion between defects and noise. To overcome these challenges, this paper proposes a novel framework, named ESRPCB (edgeguided super-resolution for PCBs defect detection), which combines edgeguided super-resolution with ensemble learning to enhance PCBs defect detection. The framework leverages the edge information to guide the EDSR (Enhanced Deep Super-Resolution) model with a novel ResCat (Residual Concatenation) structure, enabling it to reconstruct high-resolution images from small PCBs inputs. By incorporating edge features, the super-resolution process preserves critical structural details, ensuring that tiny defects remain distinguishable in the enhanced image. Following this, a multi-modal defect detection model employs ensemble learning to analyze the super-resolved
Authors: Martina Pastorino, Michael Alibani, Nicola Acito, Gabriele Moser
Abstract: This paper presents a novel methodology for generating realistic abundance maps from hyperspectral imagery using an unsupervised, deep-learning-driven approach. Our framework integrates blind linear hyperspectral unmixing with state-of-the-art diffusion models to enhance the realism and diversity of synthetic abundance maps. First, we apply blind unmixing to extract endmembers and abundance maps directly from raw hyperspectral data. These abundance maps then serve as inputs to a diffusion model, which acts as a generative engine to synthesize highly realistic spatial distributions. Diffusion models have recently revolutionized image synthesis by offering superior performance, flexibility, and stability, making them well-suited for high-dimensional spectral data. By leveraging this combination of physically interpretable unmixing and deep generative modeling, our approach enables the simulation of hyperspectral sensor outputs under diverse imaging conditions--critical for data augmentation, algorithm benchmarking, and model evaluation in hyperspectral analysis. Notably, our method is entirely unsupervised, ensuring adaptability to different datasets without the need for labeled training data. We validate our approach using real hyperspectral imagery from the PRISMA space mission for Earth observation, demonstrating its effectiveness in producing realistic synthetic abundance maps that capture the spatial and spectral characteristics of natural scenes.
Authors: Chengrui Zhang, Maizhen Ning, Zihao Zhou, Jie Sun, Kaizhu Huang, Qiufeng Wang
Abstract: Plane Geometry Diagram Synthesis has been a crucial task in computer graphics, with applications ranging from educational tools to AI-driven mathematical reasoning. Traditionally, we rely on computer tools (e.g., Matplotlib and GeoGebra) to manually generate precise diagrams, but it usually requires huge, complicated calculations cost. Recently, researchers start to work on learning-based methods (e.g., Stable Diffusion and GPT4) to automatically generate diagrams, saving operational cost but usually suffering from limited realism and insufficient accuracy. In this paper, we propose a novel framework GeoSDF to automatically generate diagrams efficiently and accurately with Signed Distance Field (SDF). Specifically, we first represent geometric elements in the SDF, then construct a series of constraint functions to represent geometric relationships, next we optimize such constraint functions to get an optimized field of both elements and constraints, finally by rendering the optimized field, we can obtain the synthesized diagram. In our GeoSDF, we define a symbolic language to easily represent geometric elements and those constraints, and our synthesized geometry diagrams can be self-verified in the SDF, ensuring both mathematical accuracy and visual plausibility. In experiments, our GeoSDF synthesized both normal high-school level and IMO-level geometry diagrams. Through both qualitative and quantitative analysis, we can see that synthesized diagrams are realistic and accurate, and our synthesizing process is simple and efficient. Furthermore, we obtain a very high accuracy of solving geometry problems (over 95\% while the current SOTA accuracy is around 75%) by leveraging our self-verification property. All of these demonstrate the advantage of GeoSDF, paving the way for more sophisticated, accurate, and flexible generation of geometric diagrams for a wide array of applications.
Authors: Kshitij Kavimandan, Angelos Nalmpantis, Emma Beauxis-Aussalet, Robert-Jan Sips
Abstract: Patent images are technical drawings that convey information about a patent's innovation. Patent image retrieval systems aim to search in vast collections and retrieve the most relevant images. Despite recent advances in information retrieval, patent images still pose significant challenges due to their technical intricacies and complex semantic information, requiring efficient fine-tuning for domain adaptation. Current methods neglect patents' hierarchical relationships, such as those defined by the Locarno International Classification (LIC) system, which groups broad categories (e.g., "furnishing") into subclasses (e.g., "seats" and "beds") and further into specific patent designs. In this work, we introduce a hierarchical multi-positive contrastive loss that leverages the LIC's taxonomy to induce such relations in the retrieval process. Our approach assigns multiple positive pairs to each patent image within a batch, with varying similarity scores based on the hierarchical taxonomy. Our experimental analysis with various vision and multimodal models on the DeepPatent2 dataset shows that the proposed method enhances the retrieval results. Notably, our method is effective with low-parameter models, which require fewer computational resources and can be deployed on environments with limited hardware.
Authors: Mingyuan Li, Tong Jia, Han Gu, Hui Lu, Hao Wang, Bowen Ma, Shuyang Lin, Shiyi Guo, Shizhuo Deng, Dongyue Chen
Abstract: Overlapping object perception aims to decouple the randomly overlapping foreground-background features, extracting foreground features while suppressing background features, which holds significant application value in fields such as security screening and medical auxiliary diagnosis. Despite some research efforts to tackle the challenge of overlapping object perception, most solutions are confined to the spatial domain. Through frequency domain analysis, we observe that the degradation of contours and textures due to the overlapping phenomenon can be intuitively reflected in the magnitude spectrum. Based on this observation, we propose a general Frequency-Optimized Anti-Overlapping Framework (FOAM) to assist the model in extracting more texture and contour information, thereby enhancing the ability for anti-overlapping object perception. Specifically, we design the Frequency Spatial Transformer Block (FSTB), which can simultaneously extract features from both the frequency and spatial domains, helping the network capture more texture features from the foreground. In addition, we introduce the Hierarchical De-Corrupting (HDC) mechanism, which aligns adjacent features in the separately constructed base branch and corruption branch using a specially designed consistent loss during the training phase. This mechanism suppresses the response to irrelevant background features of FSTBs, thereby improving the perception of foreground contour. We conduct extensive experiments to validate the effectiveness and generalization of the proposed FOAM, which further improves the accuracy of state-of-the-art models on four datasets, specifically for the three overlapping object perception tasks: Prohibited Item Detection, Prohibited Item Segmentation, and Pneumonia Detection. The code will be open source once the paper is accepted.
Authors: David W Arathorn, Josephine C. D'Angelo, Austin Roorda
Abstract: Even during fixation the human eye is constantly in low amplitude motion, jittering over small angles in random directions at up to 100Hz. This motion results in all features of the image on the retina constantly traversing a number of cones, yet objects which are stable in the world are perceived to be stable, and any object which is moving in the world is perceived to be moving. A series of experiments carried out over a dozen years revealed the psychophysics of visual stabilization to be more nuanced than might be assumed, say, from the mechanics of stabilization of camera images, or what might be assumed to be the simplest solution from an evolutionary perspective. The psychophysics revealed by the experiments strongly implies a specific set of operations on retinal signals resulting in the observed stabilization behavior. The presentation is in two levels. First is a functional description of the action of the mechanism that is very likely responsible for the experimentally observed behavior. Second is a more speculative proposal of circuit-level neural elements that might implement the functional behavior.
Authors: Jungeon Kim, Geonsoo Park, Seungyong Lee
Abstract: Recent methods, such as 2D Gaussian Splatting and Gaussian Opacity Fields, have aimed to address the geometric inaccuracies of 3D Gaussian Splatting while retaining its superior rendering quality. However, these approaches still struggle to reconstruct smooth and reliable geometry, particularly in scenes with significant color variation across viewpoints, due to their per-point appearance modeling and single-view optimization constraints. In this paper, we propose an effective multiview geometric regularization strategy that integrates multiview stereo (MVS) depth, RGB, and normal constraints into Gaussian Splatting initialization and optimization. Our key insight is the complementary relationship between MVS-derived depth points and Gaussian Splatting-optimized positions: MVS robustly estimates geometry in regions of high color variation through local patch-based matching and epipolar constraints, whereas Gaussian Splatting provides more reliable and less noisy depth estimates near object boundaries and regions with lower color variation. To leverage this insight, we introduce a median depth-based multiview relative depth loss with uncertainty estimation, effectively integrating MVS depth information into Gaussian Splatting optimization. We also propose an MVS-guided Gaussian Splatting initialization to avoid Gaussians falling into suboptimal positions. Extensive experiments validate that our approach successfully combines these strengths, enhancing both geometric accuracy and rendering quality across diverse indoor and outdoor scenes.
Authors: Xiaoyang Wei, Camille Kurtz, Florence Cloppet
Abstract: Performance evaluation for Content-Based Image Retrieval (CBIR) remains a crucial but unsolved problem today especially in the medical domain. Various evaluation metrics have been discussed in the literature to solve this problem. Most of the existing metrics (e.g., precision, recall) are adapted from classification tasks which require manual labels as ground truth. However, such labels are often expensive and unavailable in specific thematic domains. Furthermore, medical images are usually associated with (radiological) case reports or annotated with descriptive captions in literature figures, such text contains information that can help to assess CBIR.Several researchers have argued that the medical concepts hidden in the text can serve as the basis for CBIR evaluation purpose. However, these works often consider these medical concepts as independent and isolated labels while in fact the subtle relationships between various concepts are neglected. In this work, we introduce the use of knowledge graphs to measure the distance between various medical concepts and propose a novel relevance measure for the evaluation of CBIR by defining an approximate matching-based relevance score between two sets of medical concepts which allows us to indirectly measure the similarity between medical images.We quantitatively demonstrate the effectiveness and feasibility of our relevance measure using a public dataset.
Authors: Yihui Li, Chengxin Lv, Hongyu Yang, Di Huang
Abstract: Reconstructing 3D scenes from unconstrained image collections poses significant challenges due to variations in appearance. In this paper, we propose Scalable Micro-macro Wavelet-based Gaussian Splatting (SMW-GS), a novel method that enhances 3D reconstruction across diverse scales by decomposing scene representations into global, refined, and intrinsic components. SMW-GS incorporates the following innovations: Micro-macro Projection, which enables Gaussian points to sample multi-scale details with improved diversity; and Wavelet-based Sampling, which refines feature representations using frequency-domain information to better capture complex scene appearances. To achieve scalability, we further propose a large-scale scene promotion strategy, which optimally assigns camera views to scene partitions by maximizing their contributions to Gaussian points, achieving consistent and high-quality reconstructions even in expansive environments. Extensive experiments demonstrate that SMW-GS significantly outperforms existing methods in both reconstruction quality and scalability, particularly excelling in large-scale urban environments with challenging illumination variations. Project is available at https://github.com/Kidleyh/SMW-GS.
Authors: Hugo Riffaud de Turckheim, Sylvain Lobry, Roberto Interdonato, Diego Marcos
Abstract: The growing number of Earth observation satellites has led to increasingly diverse remote sensing data, with varying spatial, spectral, and temporal configurations. Most existing models rely on fixed input formats and modality-specific encoders, which require retraining when new configurations are introduced, limiting their ability to generalize across modalities. We introduce Atomizer, a flexible architecture that represents remote sensing images as sets of scalars, each corresponding to a spectral band value of a pixel. Each scalar is enriched with contextual metadata (acquisition time, spatial resolution, wavelength, and bandwidth), producing an atomic representation that allows a single encoder to process arbitrary modalities without interpolation or resampling. Atomizer uses structured tokenization with Fourier features and non-uniform radial basis functions to encode content and context, and maps tokens into a latent space via cross-attention. Under modality-disjoint evaluations, Atomizer outperforms standard models and demonstrates robust performance across varying resolutions and spatial sizes.
Authors: Yuan Gao, Shaoyan Pan, Mingzhe Hu, Huiqiao Xie, Jill Remick, Chih-Wei Chang, Justin Roper, Zhen Tian, Xiaofeng Yang
Abstract: Cone-beam CT (CBCT) is widely used in clinical radiotherapy for image-guided treatment, improving setup accuracy, adaptive planning, and motion management. However, slow gantry rotation limits performance by introducing motion artifacts, blurring, and increased dose. This work aims to develop a clinically feasible method for reconstructing high-quality CBCT volumes from consecutive limited-angle acquisitions, addressing imaging challenges in time- or dose-constrained settings. We propose a limited-angle (LA) geometry-integrated cycle-domain (LA-GICD) framework for CBCT reconstruction, comprising two denoising diffusion probabilistic models (DDPMs) connected via analytic cone-beam forward and back projectors. A Projection-DDPM completes missing projections, followed by back-projection, and an Image-DDPM refines the volume. This dual-domain design leverages complementary priors from projection and image spaces to achieve high-quality reconstructions from limited-angle (<= 90 degrees) scans. Performance was evaluated against full-angle reconstruction. Four board-certified medical physicists conducted assessments. A total of 78 planning CTs in common CBCT geometries were used for training and evaluation. The method achieved a mean absolute error of 35.5 HU, SSIM of 0.84, and PSNR of 29.8 dB, with visibly reduced artifacts and improved soft-tissue clarity. LA-GICD's geometry-aware dual-domain learning, embedded in analytic forward/backward operators, enabled artifact-free, high-contrast reconstructions from a single 90-degree scan, reducing acquisition time and dose four-fold. LA-GICD improves limited-angle CBCT reconstruction with strong data fidelity and anatomical realism. It offers a practical solution for short-arc acquisitions, enhancing CBCT use in radiotherapy by providing clinically applicable images with reduced scan time and dose for more accurate, personalized treatments.
Authors: Guohuan Xie, Syed Ariff Syed Hesham, Wenya Guo, Bing Li, Ming-Ming Cheng, Guolei Sun, Yun Liu
Abstract: Video Scene Parsing (VSP) has emerged as a cornerstone in computer vision, facilitating the simultaneous segmentation, recognition, and tracking of diverse visual entities in dynamic scenes. In this survey, we present a holistic review of recent advances in VSP, covering a wide array of vision tasks, including Video Semantic Segmentation (VSS), Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), as well as Video Tracking and Segmentation (VTS), and Open-Vocabulary Video Segmentation (OVVS). We systematically analyze the evolution from traditional hand-crafted features to modern deep learning paradigms -- spanning from fully convolutional networks to the latest transformer-based architectures -- and assess their effectiveness in capturing both local and global temporal contexts. Furthermore, our review critically discusses the technical challenges, ranging from maintaining temporal consistency to handling complex scene dynamics, and offers a comprehensive comparative study of datasets and evaluation metrics that have shaped current benchmarking standards. By distilling the key contributions and shortcomings of state-of-the-art methodologies, this survey highlights emerging trends and prospective research directions that promise to further elevate the robustness and adaptability of VSP in real-world applications.
Authors: Yueru Luo, Changqing Zhou, Yiming Yang, Erlong Li, Chao Zheng, Shuqi Mei, Shuguang Cui, Zhen Li
Abstract: Accurate road topology reasoning is critical for autonomous driving, enabling effective navigation and adherence to traffic regulations. Central to this task are lane perception and topology reasoning. However, existing methods typically focus on either lane detection or Lane-to-Lane (L2L) topology reasoning, often \textit{neglecting} Lane-to-Traffic-element (L2T) relationships or \textit{failing} to optimize these tasks jointly. Furthermore, most approaches either overlook relational modeling or apply it in a limited scope, despite the inherent spatial relationships among road elements. We argue that relational modeling is beneficial for both perception and reasoning, as humans naturally leverage contextual relationships for road element recognition and their connectivity inference. To this end, we introduce relational modeling into both perception and reasoning, \textit{jointly} enhancing structural understanding. Specifically, we propose: 1) a relation-aware lane detector, where our geometry-biased self-attention and \curve\ cross-attention refine lane representations by capturing relational dependencies; 2) relation-enhanced topology heads, including a geometry-enhanced L2L head and a cross-view L2T head, boosting reasoning with relational cues; and 3) a contrastive learning strategy with InfoNCE loss to regularize relationship embeddings. Extensive experiments on OpenLane-V2 demonstrate that our approach significantly improves both detection and topology reasoning metrics, achieving +3.1 in DET$_l$, +5.3 in TOP$_{ll}$, +4.9 in TOP$_{lt}$, and an overall +4.4 in OLS, setting a new state-of-the-art. Code will be released.
Authors: Yu Yang, Alan Liang, Jianbiao Mei, Yukai Ma, Yong Liu, Gim Hee Lee
Abstract: Diffusion models are advancing autonomous driving by enabling realistic data synthesis, predictive end-to-end planning, and closed-loop simulation, with a primary focus on temporally consistent generation. However, the generation of large-scale 3D scenes that require spatial coherence remains underexplored. In this paper, we propose X-Scene, a novel framework for large-scale driving scene generation that achieves both geometric intricacy and appearance fidelity, while offering flexible controllability. Specifically, X-Scene supports multi-granular control, including low-level conditions such as user-provided or text-driven layout for detailed scene composition and high-level semantic guidance such as user-intent and LLM-enriched text prompts for efficient customization. To enhance geometrical and visual fidelity, we introduce a unified pipeline that sequentially generates 3D semantic occupancy and the corresponding multiview images, while ensuring alignment between modalities. Additionally, we extend the generated local region into a large-scale scene through consistency-aware scene outpainting, which extrapolates new occupancy and images conditioned on the previously generated area, enhancing spatial continuity and preserving visual coherence. The resulting scenes are lifted into high-quality 3DGS representations, supporting diverse applications such as scene exploration. Comprehensive experiments demonstrate that X-Scene significantly advances controllability and fidelity for large-scale driving scene generation, empowering data generation and simulation for autonomous driving.
Authors: Geewook Kim, Minjoon Seo
Abstract: We propose an efficient framework to compress multiple video-frame features before feeding them into large multimodal models, thereby mitigating the severe token explosion arising from long or dense videos. Our design leverages a bidirectional state-space-based block equipped with a gated skip connection and a learnable weighted-average pooling mechanism applied to periodically inserted learned queries. This structure enables hierarchical downsampling across both spatial and temporal dimensions, preserving performance in a cost-effective manner. Across challenging long and dense video understanding tasks, our approach demonstrates competitive results against state-of-the-art models, while significantly reducing overall token budget. Notably, replacing our proposed state-space block with a conventional Transformer results in substantial performance degradation, highlighting the advantages of state-space modeling for effectively compressing multi-frame video data. Our framework emphasizes resource-conscious efficiency, making it practical for real-world deployments. We validate its scalability and generality across multiple benchmarks, achieving the dual objectives of efficient resource usage and comprehensive video understanding.
Authors: Bowen Zheng
Abstract: To address the challenges of 3D modeling and structural simulation in industrial environment, such as the difficulty of equipment deployment, and the difficulty of balancing accuracy and real-time performance, this paper proposes an integrated workflow, which integrates high-fidelity 3D reconstruction based on monocular video, finite element simulation analysis, and mixed reality visual display, aiming to build an interactive digital twin system for industrial inspection, equipment maintenance and other scenes. Firstly, the Neuralangelo algorithm based on deep learning is used to reconstruct the 3D mesh model with rich details from the surround-shot video. Then, the QuadRemesh tool of Rhino is used to optimize the initial triangular mesh and generate a structured mesh suitable for finite element analysis. The optimized mesh is further discretized by HyperMesh, and the material parameter setting and stress simulation are carried out in Abaqus to obtain high-precision stress and deformation results. Finally, combined with Unity and Vuforia engine, the real-time superposition and interactive operation of simulation results in the augmented reality environment are realized, which improves users 'intuitive understanding of structural response. Experiments show that the method has good simulation efficiency and visualization effect while maintaining high geometric accuracy. It provides a practical solution for digital modeling, mechanical analysis and interactive display in complex industrial scenes, and lays a foundation for the deep integration of digital twin and mixed reality technology in industrial applications.
Authors: Zhucun Xue, Jiangning Zhang, Xurong Xie, Yuxuan Cai, Yong Liu, Xiangtai Li, Dacheng Tao
Abstract: Multimodal Large Language Models (MLLMs) struggle with long videos due to fixed context windows and weak long-term dependency modeling. Existing Retrieval-Augmented Generation (RAG) methods for videos use static retrieval strategies, leading to inefficiencies for simple queries and information loss for complex tasks. To address this, we propose AdaVideoRAG, a novel framework that dynamically adapts retrieval granularity based on query complexity using a lightweight intent classifier. Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs, enabling optimal resource allocation across tasks. We also introduce the HiVU benchmark for comprehensive evaluation. Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs. AdaVideoRAG establishes a new paradigm for adaptive retrieval in video analysis. Codes will be open-sourced at https://github.com/xzc-zju/AdaVideoRAG.
Authors: Weimin Bai, Yubo Li, Wenzheng Chen, Weijian Luo, He Sun
Abstract: Distilling pre-trained 2D diffusion models into 3D assets has driven remarkable advances in text-to-3D synthesis. However, existing methods typically rely on Score Distillation Sampling (SDS) loss, which involves asymmetric KL divergence--a formulation that inherently favors mode-seeking behavior and limits generation diversity. In this paper, we introduce Dive3D, a novel text-to-3D generation framework that replaces KL-based objectives with Score Implicit Matching (SIM) loss, a score-based objective that effectively mitigates mode collapse. Furthermore, Dive3D integrates both diffusion distillation and reward-guided optimization under a unified divergence perspective. Such reformulation, together with SIM loss, yields significantly more diverse 3D outputs while improving text alignment, human preference, and overall visual fidelity. We validate Dive3D across various 2D-to-3D prompts and find that it consistently outperforms prior methods in qualitative assessments, including diversity, photorealism, and aesthetic appeal. We further evaluate its performance on the GPTEval3D benchmark, comparing against nine state-of-the-art baselines. Dive3D also achieves strong results on quantitative metrics, including text-asset alignment, 3D plausibility, text-geometry consistency, texture quality, and geometric detail.
Authors: Chenlu Zhan, Gaoang Wang, Hongwei Wang
Abstract: Semantic querying in complex 3D scenes through free-form language presents a significant challenge. Existing 3D scene understanding methods use large-scale training data and CLIP to align text queries with 3D semantic features. However, their reliance on predefined vocabulary priors from training data hinders free-form semantic querying. Besides, recent advanced methods rely on LLMs for scene understanding but lack comprehensive 3D scene-level information and often overlook the potential inconsistencies in LLM-generated outputs. In our paper, we propose FreeQ-Graph, which enables Free-form Querying with a semantic consistent scene Graph for 3D scene understanding. The core idea is to encode free-form queries from a complete and accurate 3D scene graph without predefined vocabularies, and to align them with 3D consistent semantic labels, which accomplished through three key steps. We initiate by constructing a complete and accurate 3D scene graph that maps free-form objects and their relations through LLM and LVLM guidance, entirely free from training data or predefined priors. Most importantly, we align graph nodes with accurate semantic labels by leveraging 3D semantic aligned features from merged superpoints, enhancing 3D semantic consistency. To enable free-form semantic querying, we then design an LLM-based reasoning algorithm that combines scene-level and object-level information to intricate reasoning. We conducted extensive experiments on 3D semantic grounding, segmentation, and complex querying tasks, while also validating the accuracy of graph generation. Experiments on 6 datasets show that our model excels in both complex free-form semantic queries and intricate relational reasoning.
Authors: Zhiyi Shi, Binjie Wang, Chongjie Si, Yichen Wu, Junsik Kim, Hanspeter Pfister
Abstract: Model editing aims to efficiently update a pre-trained model's knowledge without the need for time-consuming full retraining. While existing pioneering editing methods achieve promising results, they primarily focus on editing single-modal language models (LLMs). However, for vision-language models (VLMs), which involve multiple modalities, the role and impact of each modality on editing performance remain largely unexplored. To address this gap, we explore the impact of textual and visual modalities on model editing and find that: (1) textual and visual representations reach peak sensitivity at different layers, reflecting their varying importance; and (2) editing both modalities can efficiently update knowledge, but this comes at the cost of compromising the model's original capabilities. Based on our findings, we propose DualEdit, an editor that modifies both textual and visual modalities at their respective key layers. Additionally, we introduce a gating module within the more sensitive textual modality, allowing DualEdit to efficiently update new knowledge while preserving the model's original information. We evaluate DualEdit across multiple VLM backbones and benchmark datasets, demonstrating its superiority over state-of-the-art VLM editing baselines as well as adapted LLM editing methods on different evaluation metrics.
Authors: Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, Ziwei Liu
Abstract: We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources. Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.
Authors: Dipayan Biswas, Shishir Shah, Jaspal Subhlok
Abstract: We introduce the Lecture Video Visual Objects (LVVO) dataset, a new benchmark for visual object detection in educational video content. The dataset consists of 4,000 frames extracted from 245 lecture videos spanning biology, computer science, and geosciences. A subset of 1,000 frames, referred to as LVVO_1k, has been manually annotated with bounding boxes for four visual categories: Table, Chart-Graph, Photographic-image, and Visual-illustration. Each frame was labeled independently by two annotators, resulting in an inter-annotator F1 score of 83.41%, indicating strong agreement. To ensure high-quality consensus annotations, a third expert reviewed and resolved all cases of disagreement through a conflict resolution process. To expand the dataset, a semi-supervised approach was employed to automatically annotate the remaining 3,000 frames, forming LVVO_3k. The complete dataset offers a valuable resource for developing and evaluating both supervised and semi-supervised methods for visual content detection in educational videos. The LVVO dataset is publicly available to support further research in this domain.
Authors: Zhucun Xue, Jiangning Zhang, Teng Hu, Haoyang He, Yinan Chen, Yuxuan Cai, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, Dacheng Tao
Abstract: The quality of the video dataset (image quality, resolution, and fine-grained caption) greatly influences the performance of the video generation model. The growing demand for video applications sets higher requirements for high-quality video generation models. For example, the generation of movie-level Ultra-High Definition (UHD) videos and the creation of 4K short video content. However, the existing public datasets cannot support related research and applications. In this paper, we first propose a high-quality open-sourced UHD-4K (22.4\% of which are 8K) text-to-video dataset named UltraVideo, which contains a wide range of topics (more than 100 kinds), and each video has 9 structured captions with one summarized caption (average of 824 words). Specifically, we carefully design a highly automated curation process with four stages to obtain the final high-quality dataset: \textit{i)} collection of diverse and high-quality video clips. \textit{ii)} statistical data filtering. \textit{iii)} model-based data purification. \textit{iv)} generation of comprehensive, structured captions. In addition, we expand Wan to UltraWan-1K/-4K, which can natively generate high-quality 1K/4K videos with more consistent text controllability, demonstrating the effectiveness of our data curation.We believe that this work can make a significant contribution to future research on UHD video generation. UltraVideo dataset and UltraWan models are available at https://xzc-zju.github.io/projects/UltraVideo.
Authors: Junyoung Seo, Jisang Han, Jaewoo Jung, Siyoon Jin, Joungbin Lee, Takuya Narihira, Kazumi Fukuda, Takashi Shibuya, Donghoon Ahn, Shoukang Hu, Seungryong Kim, Yuki Mitsufuji
Abstract: We introduce Vid-CamEdit, a novel framework for video camera trajectory editing, enabling the re-synthesis of monocular videos along user-defined camera paths. This task is challenging due to its ill-posed nature and the limited multi-view video data for training. Traditional reconstruction methods struggle with extreme trajectory changes, and existing generative models for dynamic novel view synthesis cannot handle in-the-wild videos. Our approach consists of two steps: estimating temporally consistent geometry, and generative rendering guided by this geometry. By integrating geometric priors, the generative model focuses on synthesizing realistic details where the estimated geometry is uncertain. We eliminate the need for extensive 4D training data through a factorized fine-tuning framework that separately trains spatial and temporal components using multi-view image and video data. Our method outperforms baselines in producing plausible videos from novel camera trajectories, especially in extreme extrapolation scenarios on real-world footage.
Authors: Kaiyuan Tan, Pavan Kumar B N, Bharatesh Chakravarthi
Abstract: Event cameras are gaining traction in traffic monitoring applications due to their low latency, high temporal resolution, and energy efficiency, which makes them well-suited for real-time object detection at traffic intersections. However, the development of robust event-based detection models is hindered by the limited availability of annotated real-world datasets. To address this, several simulation tools have been developed to generate synthetic event data. Among these, the CARLA driving simulator includes a built-in dynamic vision sensor (DVS) module that emulates event camera output. Despite its potential, the sim-to-real gap for event-based object detection remains insufficiently studied. In this work, we present a systematic evaluation of this gap by training a recurrent vision transformer model exclusively on synthetic data generated using CARLAs DVS and testing it on varying combinations of synthetic and real-world event streams. Our experiments show that models trained solely on synthetic data perform well on synthetic-heavy test sets but suffer significant performance degradation as the proportion of real-world data increases. In contrast, models trained on real-world data demonstrate stronger generalization across domains. This study offers the first quantifiable analysis of the sim-to-real gap in event-based object detection using CARLAs DVS. Our findings highlight limitations in current DVS simulation fidelity and underscore the need for improved domain adaptation techniques in neuromorphic vision for traffic monitoring.
Authors: Qiyu Xu, Wenyang Chen, Zhanxuan Hu, Huafeng Li, Yonghang Tai
Abstract: Transductive zero-shot learning (ZSL) aims to classify unseen categories by leveraging both semantic class descriptions and the distribution of unlabeled test data. While Vision-Language Models (VLMs) such as CLIP excel at aligning visual inputs with textual semantics, they often rely too heavily on class-level priors and fail to capture fine-grained visual cues. In contrast, Vision-only Foundation Models (VFMs) like DINOv2 provide rich perceptual features but lack semantic alignment. To exploit the complementary strengths of these models, we propose OTFusion, a simple yet effective training-free framework that bridges VLMs and VFMs via Optimal Transport. Specifically, OTFusion aims to learn a shared probabilistic representation that aligns visual and semantic information by minimizing the transport cost between their respective distributions. This unified distribution enables coherent class predictions that are both semantically meaningful and visually grounded. Extensive experiments on 11 benchmark datasets demonstrate that OTFusion consistently outperforms the original CLIP model, achieving an average accuracy improvement of nearly $10\%$, all without any fine-tuning or additional annotations. The code will be publicly released after the paper is accepted.
Authors: Yuheng Yuan, Qiuhong Shen, Shizun Wang, Xingyi Yang, Xinchao Wang
Abstract: Dense matching methods like DUSt3R regress pairwise pointmaps for 3D reconstruction. However, the reliance on pairwise prediction and the limited generalization capability inherently restrict the global geometric consistency. In this work, we introduce Test3R, a surprisingly simple test-time learning technique that significantly boosts geometric accuracy. Using image triplets ($I_1,I_2,I_3$), Test3R generates reconstructions from pairs ($I_1,I_2$) and ($I_1,I_3$). The core idea is to optimize the network at test time via a self-supervised objective: maximizing the geometric consistency between these two reconstructions relative to the common image $I_1$. This ensures the model produces cross-pair consistent outputs, regardless of the inputs. Extensive experiments demonstrate that our technique significantly outperforms previous state-of-the-art methods on the 3D reconstruction and multi-view depth estimation tasks. Moreover, it is universally applicable and nearly cost-free, making it easily applied to other models and implemented with minimal test-time training overhead and parameter footprint. Code is available at https://github.com/nopQAQ/Test3R.
Authors: Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, Jiaqi Ma
Abstract: Recent advancements in Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving by leveraging world knowledge and reasoning capabilities. However, current VLA models often struggle with physically infeasible action outputs, complex model structures, or unnecessarily long reasoning. In this paper, we propose AutoVLA, a novel VLA model that unifies reasoning and action generation within a single autoregressive generation model for end-to-end autonomous driving. AutoVLA performs semantic reasoning and trajectory planning directly from raw visual inputs and language instructions. We tokenize continuous trajectories into discrete, feasible actions, enabling direct integration into the language model. For training, we employ supervised fine-tuning to equip the model with dual thinking modes: fast thinking (trajectory-only) and slow thinking (enhanced with chain-of-thought reasoning). To further enhance planning performance and efficiency, we introduce a reinforcement fine-tuning method based on Group Relative Policy Optimization (GRPO), reducing unnecessary reasoning in straightforward scenarios. Extensive experiments across real-world and simulated datasets and benchmarks, including nuPlan, nuScenes, Waymo, and CARLA, demonstrate the competitive performance of AutoVLA in both open-loop and closed-loop settings. Qualitative results showcase the adaptive reasoning and accurate planning capabilities of AutoVLA in diverse scenarios.
Authors: Lingteng Qiu, Peihao Li, Qi Zuo, Xiaodong Gu, Yuan Dong, Weihao Yuan, Siyu Zhu, Xiaoguang Han, Guanying Chen, Zilong Dong
Abstract: Reconstructing an animatable 3D human from casually captured images of an articulated subject without camera or human pose information is a practical yet challenging task due to view misalignment, occlusions, and the absence of structural priors. While optimization-based methods can produce high-fidelity results from monocular or multi-view videos, they require accurate pose estimation and slow iterative optimization, limiting scalability in unconstrained scenarios. Recent feed-forward approaches enable efficient single-image reconstruction but struggle to effectively leverage multiple input images to reduce ambiguity and improve reconstruction accuracy. To address these challenges, we propose PF-LHM, a large human reconstruction model that generates high-quality 3D avatars in seconds from one or multiple casually captured pose-free images. Our approach introduces an efficient Encoder-Decoder Point-Image Transformer architecture, which fuses hierarchical geometric point features and multi-view image features through multimodal attention. The fused features are decoded to recover detailed geometry and appearance, represented using 3D Gaussian splats. Extensive experiments on both real and synthetic datasets demonstrate that our method unifies single- and multi-image 3D human reconstruction, achieving high-fidelity and animatable 3D human avatars without requiring camera and human pose annotations. Code and models will be released to the public.
Authors: Hao Gu, Lujun Li, Zheyu Wang, Bei Liu, Qiyuan Zhu, Sirui Han, Yike Guo
Abstract: Binary quantization represents the most extreme form of large language model (LLM) compression, reducing weights to $\pm$1 for maximal memory and computational efficiency. While recent sparsity-aware binarization methods achieve sub-1-bit compression by pruning redundant binary weights, they suffer from three critical challenges: performance deterioration, computational complexity from sparse mask management, and limited hardware compatibility. In this paper, we present BTC-LLM, a novel sub-1-bit LLM quantization framework that leverages adaptive weight transformation and binary pattern clustering to overcome these limitations, delivering both superior accuracy and efficiency. Our approach incorporates two key innovations: (1) a Learnable Transformation that optimizes invertible scaling and rotation matrices to align binarized weights with full-precision distributions, enabling incoherence processing to enhance layer-wise representation quality; (2) a Flash and Accurate Binary Codebook that identifies recurring binary vector clusters, compressing them into compact indices with tailored distance metrics and sign-based centroid updates. This eliminates the need for sparse masks, enabling efficient inference on standard hardware. Our code is available at https://github.com/Chooovy/BTC-LLM.
Authors: Yewei Liu, Xiyuan Wang, Muhan Zhang
Abstract: Network pruning, aimed at reducing network size while preserving accuracy, has attracted significant research interest. Numerous pruning techniques have been proposed over time. They are becoming increasingly effective, but more complex and harder to interpret as well. Given the inherent complexity of neural networks, we argue that manually designing pruning criteria has reached a bottleneck. To address this, we propose a novel approach in which we "use a neural network to prune neural networks". More specifically, we introduce the newly developed idea of metanetwork from meta-learning into pruning. A metanetwork is a network that takes another network as input and produces a modified network as output. In this paper, we first establish a bijective mapping between neural networks and graphs, and then employ a graph neural network as our metanetwork. We train a metanetwork that learns the pruning strategy automatically which can transform a network that is hard to prune into another network that is much easier to prune. Once the metanetwork is trained, our pruning needs nothing more than a feedforward through the metanetwork and the standard finetuning to prune at state-of-the-art. Our method achieved outstanding results on many popular and representative pruning tasks (including ResNet56 on CIFAR10, VGG19 on CIFAR100, ResNet50 on ImageNet). Our code is available at https://github.com/Yewei-Liu/MetaPruning
Authors: Andr\'e Ferreira, Kunpeng Xie, Caroline Wilpert, Gustavo Correia, Felix Barajas Ordonez, Tiago Gil Oliveira, Maike Bode, Robert Siepmann, Frank H\"olzle, Rainer R\"ohrig, Jens Kleesiek, Daniel Truhn, Jan Egger, Victor Alves, Behrus Puladi
Abstract: AI requires extensive datasets, while medical data is subject to high data protection. Anonymization is essential, but poses a challenge for some regions, such as the head, as identifying structures overlap with regions of clinical interest. Synthetic data offers a potential solution, but studies often lack rigorous evaluation of realism and utility. Therefore, we investigate to what extent synthetic data can replace real data in segmentation tasks. We employed head and neck cancer CT scans and brain glioma MRI scans from two large datasets. Synthetic data were generated using generative adversarial networks and diffusion models. We evaluated the quality of the synthetic data using MAE, MS-SSIM, Radiomics and a Visual Turing Test (VTT) performed by 5 radiologists and their usefulness in segmentation tasks using DSC. Radiomics indicates high fidelity of synthetic MRIs, but fall short in producing highly realistic CT tissue, with correlation coefficient of 0.8784 and 0.5461 for MRI and CT tumors, respectively. DSC results indicate limited utility of synthetic data: tumor segmentation achieved DSC=0.064 on CT and 0.834 on MRI, while bone segmentation a mean DSC=0.841. Relation between DSC and correlation is observed, but is limited by the complexity of the task. VTT results show synthetic CTs' utility, but with limited educational applications. Synthetic data can be used independently for the segmentation task, although limited by the complexity of the structures to segment. Advancing generative models to better tolerate heterogeneous inputs and learn subtle details is essential for enhancing their realism and expanding their application potential.
Authors: Phillipe R. Sampaio, Helene Maxcici
Abstract: This paper investigates a novel approach to unsupervised document clustering by leveraging multimodal embeddings as input to traditional clustering algorithms such as $k$-Means and DBSCAN. Our method aims to achieve a finer-grained document understanding by not only grouping documents at the type level (e.g., invoices, purchase orders), but also distinguishing between different templates within the same document category. This is achieved by using embeddings that capture textual content, layout information, and visual features of documents. We evaluated the effectiveness of this approach using embeddings generated by several state-of-the-art pretrained multimodal models, including SBERT, LayoutLMv1, LayoutLMv3, DiT, Donut, and ColPali. Our findings demonstrate the potential of multimodal embeddings to significantly enhance document clustering, offering benefits for various applications in intelligent document processing, document layout analysis, and unsupervised document classification. This work provides valuable insight into the advantages and limitations of different multimodal models for this task and opens new avenues for future research to understand and organize document collections.
Authors: Shehroz S. Khan, Ali Abedi, Charlene H. Chu
Abstract: Interpreting large volumes of high-dimensional, unlabeled data in a manner that is comprehensible to humans remains a significant challenge across various domains. In unsupervised healthcare data analysis, interpreting clustered data can offer meaningful insights into patients' health outcomes, which hold direct implications for healthcare providers. This paper addresses the problem of interpreting clustered sensor data collected from older adult patients recovering from lower-limb fractures in the community. A total of 560 days of multimodal sensor data, including acceleration, step count, ambient motion, GPS location, heart rate, and sleep, alongside clinical scores, were remotely collected from patients at home. Clustering was first carried out separately for each data modality to assess the impact of feature sets extracted from each modality on patients' recovery trajectories. Then, using context-aware prompting, a large language model was employed to infer meaningful cluster labels for the clusters derived from each modality. The quality of these clusters and their corresponding labels was validated through rigorous statistical testing and visualization against clinical scores collected alongside the multimodal sensor data. The results demonstrated the statistical significance of most modality-specific cluster labels generated by the large language model with respect to clinical scores, confirming the efficacy of the proposed method for interpreting sensor data in an unsupervised manner. This unsupervised data analysis approach, relying solely on sensor data, enables clinicians to identify at-risk patients and take timely measures to improve health outcomes.
Authors: Stanley Lewis, Vishal Chandra, Tom Gao, Odest Chadwicke Jenkins
Abstract: Representing articulated objects remains a difficult problem within the field of robotics. Objects such as pliers, clamps, or cabinets require representations that capture not only geometry and color information, but also part seperation, connectivity, and joint parametrization. Furthermore, learning these representations becomes even more difficult with each additional degree of freedom. Complex articulated objects such as robot arms may have seven or more degrees of freedom, and the depth of their kinematic tree may be notably greater than the tools, drawers, and cabinets that are the typical subjects of articulated object research. To address these concerns, we introduce SPLATART - a pipeline for learning Gaussian splat representations of articulated objects from posed images, of which a subset contains image space part segmentations. SPLATART disentangles the part separation task from the articulation estimation task, allowing for post-facto determination of joint estimation and representation of articulated objects with deeper kinematic trees than previously exhibited. In this work, we present data on the SPLATART pipeline as applied to the syntheic Paris dataset objects, and qualitative results on a real-world object under spare segmentation supervision. We additionally present on articulated serial chain manipulators to demonstrate usage on deeper kinematic tree structures.
Authors: Haoyu Dong, Yuwen Chen, Hanxue Gu, Nicholas Konz, Yaqian Chen, Qihang Li, Maciej A. Mazurowski
Abstract: The widespread use of Magnetic Resonance Imaging (MRI) and the rise of deep learning have enabled the development of powerful predictive models for a wide range of diagnostic tasks in MRI, such as image classification or object segmentation. However, training models for specific new tasks often requires large amounts of labeled data, which is difficult to obtain due to high annotation costs and data privacy concerns. To circumvent this issue, we introduce MRI-CORE (MRI COmprehensive Representation Encoder), a vision foundation model pre-trained using more than 6 million slices from over 110,000 MRI volumes across 18 main body locations. Experiments on five diverse object segmentation tasks in MRI demonstrate that MRI-CORE can significantly improve segmentation performance in realistic scenarios with limited labeled data availability, achieving an average gain of 6.97% 3D Dice Coefficient using only 10 annotated slices per task. We further demonstrate new model capabilities in MRI such as classification of image properties including body location, sequence type and institution, and zero-shot segmentation. These results highlight the value of MRI-CORE as a generalist vision foundation model for MRI, potentially lowering the data annotation resource barriers for many applications.
Authors: Jayjun Lee, Nima Fazeli
Abstract: Mastering dexterous, contact-rich object manipulation demands precise estimation of both in-hand object poses and external contact locations$\unicode{x2013}$tasks particularly challenging due to partial and noisy observations. We present ViTaSCOPE: Visuo-Tactile Simultaneous Contact and Object Pose Estimation, an object-centric neural implicit representation that fuses vision and high-resolution tactile feedback. By representing objects as signed distance fields and distributed tactile feedback as neural shear fields, ViTaSCOPE accurately localizes objects and registers extrinsic contacts onto their 3D geometry as contact fields. Our method enables seamless reasoning over complementary visuo-tactile cues by leveraging simulation for scalable training and zero-shot transfers to the real-world by bridging the sim-to-real gap. We evaluate our method through comprehensive simulated and real-world experiments, demonstrating its capabilities in dexterous manipulation scenarios.
Authors: Babak Naderi, Ross Cutler, Juhee Cho, Nabakumar Khongbantabam, Dejan Ivkovic
Abstract: Super-Resolution (SR) is a critical task in computer vision, focusing on reconstructing high-resolution (HR) images from low-resolution (LR) inputs. The field has seen significant progress through various challenges, particularly in single-image SR. Video Super-Resolution (VSR) extends this to the temporal domain, aiming to enhance video quality using methods like local, uni-, bi-directional propagation, or traditional upscaling followed by restoration. This challenge addresses VSR for conferencing, where LR videos are encoded with H.265 at fixed QPs. The goal is to upscale videos by a specific factor, providing HR outputs with enhanced perceptual quality under a low-delay scenario using causal models. The challenge included three tracks: general-purpose videos, talking head videos, and screen content videos, with separate datasets provided by the organizers for training, validation, and testing. We open-sourced a new screen content dataset for the SR task in this challenge. Submissions were evaluated through subjective tests using a crowdsourced implementation of the ITU-T Rec P.910.
Authors: Haoyu Zhai, Shuo Wang, Pirouz Naghavi, Qingying Hao, Gang Wang
Abstract: Gaussian blur is widely used to blur human faces in sensitive photos before the photos are posted on the Internet. However, it is unclear to what extent the blurred faces can be restored and used to re-identify the person, especially under a high-blurring setting. In this paper, we explore this question by developing a deblurring method called Revelio. The key intuition is to leverage a generative model's memorization effect and approximate the inverse function of Gaussian blur for face restoration. Compared with existing methods, we design the deblurring process to be identity-preserving. It uses a conditional Diffusion model for preliminary face restoration and then uses an identity retrieval model to retrieve related images to further enhance fidelity. We evaluate Revelio with large public face image datasets and show that it can effectively restore blurred faces, especially under a high-blurring setting. It has a re-identification accuracy of 95.9%, outperforming existing solutions. The result suggests that Gaussian blur should not be used for face anonymization purposes. We also demonstrate the robustness of this method against mismatched Gaussian kernel sizes and functions, and test preliminary countermeasures and adaptive attacks to inspire future work.
Authors: Zaiqiang Wu, I-Chao Shen, Takeo Igarashi
Abstract: Per-garment virtual try-on methods collect garment-specific datasets and train networks tailored to each garment to achieve superior results. However, these approaches often struggle with loose-fitting garments due to two key limitations: (1) They rely on human body semantic maps to align garments with the body, but these maps become unreliable when body contours are obscured by loose-fitting garments, resulting in degraded outcomes; (2) They train garment synthesis networks on a per-frame basis without utilizing temporal information, leading to noticeable jittering artifacts. To address these challenges, we propose a two-stage approach for robust semantic map estimation. First, we extract a garment-invariant representation from the raw input image. This representation is then passed through an auxiliary network to estimate the semantic map. This enhances the robustness of semantic map estimation under loose-fitting garments during garment-specific dataset generation. Furthermore, we introduce a recurrent garment synthesis framework that incorporates temporal dependencies to improve frame-to-frame coherence while maintaining real-time performance. We conducted qualitative and quantitative evaluations to demonstrate that our method outperforms existing approaches in both image quality and temporal coherence. Ablation studies further validate the effectiveness of the garment-invariant representation and the recurrent synthesis framework.
Authors: Mingjun Xu, Jinhan Dong, Jue Hou, Zehui Wang, Sihang Li, Zhifeng Gao, Renxin Zhong, Hengxing Cai
Abstract: Multimodal document retrieval systems enable information access across text, images, and layouts, benefiting various domains like document-based question answering, report analysis, and interactive content summarization. Rerankers improve retrieval precision by reordering retrieved candidates. However, current multimodal reranking methods remain underexplored, with significant room for improvement in both training strategies and overall effectiveness. Moreover, the lack of explicit reasoning makes it difficult to analyze and optimize these methods further. In this paper, We propose MM-R5, a MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval, aiming to provide a more effective and reliable solution for multimodal reranking tasks. MM-R5 is trained in two stages: supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we focus on improving instruction-following and guiding the model to generate complete and high-quality reasoning chains. To support this, we introduce a novel data construction strategy that produces rich, high-quality reasoning data. In the RL stage, we design a task-specific reward framework, including a reranking reward tailored for multimodal candidates and a composite template-based reward to further refine reasoning quality. We conduct extensive experiments on MMDocIR, a challenging public benchmark spanning multiple domains. MM-R5 achieves state-of-the-art performance on most metrics and delivers comparable results to much larger models on the remaining ones. Moreover, compared to the best retrieval-only method, MM-R5 improves recall@1 by over 4%. These results validate the effectiveness of our reasoning-enhanced training pipeline.
Authors: Stan Mu\~noz Guti\'errez, Franz Wotawa
Abstract: This paper introduces Spectral Fault Receptive Fields (SFRFs), a biologically inspired technique for degradation state assessment in bearing fault diagnosis and remaining useful life (RUL) estimation. Drawing on the center-surround organization of retinal ganglion cell receptive fields, we propose a frequency-domain feature extraction algorithm that enhances the detection of fault signatures in vibration signals. SFRFs are designed as antagonistic spectral filters centered on characteristic fault frequencies, with inhibitory surrounds that enable robust characterization of incipient faults under variable operating conditions. A multi-objective evolutionary optimization strategy based on NSGA-II algorithm is employed to tune the receptive field parameters by simultaneously minimizing RUL prediction error, maximizing feature monotonicity, and promoting smooth degradation trajectories. The method is demonstrated on the XJTU-SY bearing run-to-failure dataset, confirming its suitability for constructing condition indicators in health monitoring applications. Key contributions include: (i) the introduction of SFRFs, inspired by the biology of vision in the primate retina; (ii) an evolutionary optimization framework guided by condition monitoring and prognosis criteria; and (iii) experimental evidence supporting the detection of early-stage faults and their precursors. Furthermore, we confirm that our diagnosis-informed spectral representation achieves accurate RUL prediction using a bagging regressor. The results highlight the interpretability and principled design of SFRFs, bridging signal processing, biological sensing principles, and data-driven prognostics in rotating machinery.
Authors: Minghui Zhang, Yaoyu Liu, Xin You, Hanxiao Zhang, Yun Gu
Abstract: Accurate multi-class tubular modeling is critical for precise lesion localization and optimal treatment planning. Deep learning methods enable automated shape modeling by prioritizing volumetric overlap accuracy. However, the inherent complexity of fine-grained semantic tubular shapes is not fully emphasized by overlap accuracy, resulting in reduced topological preservation. To address this, we propose the Shapeaware Sampling (SAS), which optimizes patchsize allocation for online sampling and extracts a topology-preserved skeletal representation for the objective function. Fractal Dimension-based Patchsize (FDPS) is first introduced to quantify semantic tubular shape complexity through axis-specific fractal dimension analysis. Axes with higher fractal complexity are then sampled with smaller patchsizes to capture fine-grained features and resolve structural intricacies. In addition, Minimum Path-Cost Skeletonization (MPC-Skel) is employed to sample topologically consistent skeletal representations of semantic tubular shapes for skeleton-weighted objective functions. MPC-Skel reduces artifacts from conventional skeletonization methods and directs the focus to critical topological regions, enhancing tubular topology preservation. SAS is computationally efficient and easily integrable into optimization pipelines. Evaluation on two semantic tubular datasets showed consistent improvements in both volumetric overlap and topological integrity metrics.
Authors: Mengyuan Sun, Yu Li, Yuchen Liu, Bo Du, Yunjie Ge
Abstract: Multimodal contrastive learning models like CLIP have demonstrated remarkable vision-language alignment capabilities, yet their vulnerability to backdoor attacks poses critical security risks. Attackers can implant latent triggers that persist through downstream tasks, enabling malicious control of model behavior upon trigger presentation. Despite great success in recent defense mechanisms, they remain impractical due to strong assumptions about attacker knowledge or excessive clean data requirements. In this paper, we introduce InverTune, the first backdoor defense framework for multimodal models under minimal attacker assumptions, requiring neither prior knowledge of attack targets nor access to the poisoned dataset. Unlike existing defense methods that rely on the same dataset used in the poisoning stage, InverTune effectively identifies and removes backdoor artifacts through three key components, achieving robust protection against backdoor attacks. Specifically, InverTune first exposes attack signatures through adversarial simulation, probabilistically identifying the target label by analyzing model response patterns. Building on this, we develop a gradient inversion technique to reconstruct latent triggers through activation pattern analysis. Finally, a clustering-guided fine-tuning strategy is employed to erase the backdoor function with only a small amount of arbitrary clean data, while preserving the original model capabilities. Experimental results show that InverTune reduces the average attack success rate (ASR) by 97.87% against the state-of-the-art (SOTA) attacks while limiting clean accuracy (CA) degradation to just 3.07%. This work establishes a new paradigm for securing multimodal systems, advancing security in foundation model deployment without compromising performance.
Authors: Zonghao Ying, Siyang Wu, Run Hao, Peng Ying, Shixuan Sun, Pengyu Chen, Junze Chen, Hao Du, Kaiwen Shen, Shangkun Wu, Jiwei Wei, Shiyuan He, Yang Yang, Xiaohai Xu, Ke Ma, Qianqian Xu, Qingming Huang, Shi Lin, Xun Wang, Changting Lin, Meng Han, Yilei Jiang, Siqi Lai, Yaozhi Zheng, Yifei Song, Xiangyu Yue, Zonglei Jing, Tianyuan Zhang, Zhilei Zhu, Aishan Liu, Jiakai Wang, Siyuan Liang, Xianglong Kong, Hainan Li, Junjie Mu, Haotong Qin, Yue Yu, Lei Chen, Felix Juefei-Xu, Qing Guo, Xinyun Chen, Yew Soon Ong, Xianglong Liu, Dawn Song, Alan Yuille, Philip Torr, Dacheng Tao
Abstract: Multimodal Large Language Models (MLLMs) have enabled transformative advancements across diverse applications but remain susceptible to safety threats, especially jailbreak attacks that induce harmful outputs. To systematically evaluate and improve their safety, we organized the Adversarial Testing & Large-model Alignment Safety Grand Challenge (ATLAS) 2025}. This technical report presents findings from the competition, which involved 86 teams testing MLLM vulnerabilities via adversarial image-text attacks in two phases: white-box and black-box evaluations. The competition results highlight ongoing challenges in securing MLLMs and provide valuable guidance for developing stronger defense mechanisms. The challenge establishes new benchmarks for MLLM safety evaluation and lays groundwork for advancing safer multimodal AI systems. The code and data for this challenge are openly available at https://github.com/NY1024/ATLAS_Challenge_2025.
Authors: Federico Simonetta
Abstract: This paper presents the first comprehensive systematic review of literature on style-based composer identification and authorship attribution in symbolic music scores. Addressing the critical need for improved reliability and reproducibility in this field, the review rigorously analyzes 58 peer-reviewed papers published across various historical periods, with the search adapted to evolving terminology. The analysis critically assesses prevailing repertoires, computational approaches, and evaluation methodologies, highlighting significant challenges. It reveals that a substantial portion of existing research suffers from inadequate validation protocols and an over-reliance on simple accuracy metrics for often imbalanced datasets, which can undermine the credibility of attribution claims. The crucial role of robust metrics like Balanced Accuracy and rigorous cross-validation in ensuring trustworthy results is emphasized. The survey also details diverse feature representations and the evolution of machine learning models employed. Notable real-world authorship attribution cases, such as those involving works attributed to Bach, Josquin Desprez, and Lennon-McCartney, are specifically discussed, illustrating the opportunities and pitfalls of applying computational techniques to resolve disputed musical provenance. Based on these insights, a set of actionable guidelines for future research are proposed. These recommendations are designed to significantly enhance the reliability, reproducibility, and musicological validity of composer identification and authorship attribution studies, fostering more robust and interpretable computational stylistic analysis.
Authors: Hyoung Suk Park, Kiwan Jeon
Abstract: Implicit neural representation (INR), particularly in combination with hash encoding, has recently emerged as a promising approach for computed tomography (CT) image reconstruction. However, directly applying INR techniques to 3D dental cone-beam CT (CBCT) with a truncated field of view (FOV) is challenging. During the training process, if the FOV does not fully encompass the patient's head, a discrepancy arises between the measured projections and the forward projections computed within the truncated domain. This mismatch leads the network to estimate attenuation values inaccurately, producing severe artifacts in the reconstructed images. In this study, we propose a computationally efficient INR-based reconstruction framework that leverages multi-resolution hash encoding for 3D dental CBCT with a truncated FOV. To mitigate truncation artifacts, we train the network over an expanded reconstruction domain that fully encompasses the patient's head. For computational efficiency, we adopt an adaptive training strategy that uses a multi-resolution grid: finer resolution levels and denser sampling inside the truncated FOV, and coarser resolution levels with sparser sampling outside. To maintain consistent input dimensionality of the network across spatially varying resolutions, we introduce an adaptive hash encoder that selectively activates the lower-level features of the hash hierarchy for points outside the truncated FOV. The proposed method with an extended FOV effectively mitigates truncation artifacts. Compared with a naive domain extension using fixed resolution levels and a fixed sampling rate, the adaptive strategy reduces computational time by over 60% for an image volume of 800x800x600, while preserving the PSNR within the truncated FOV.
Authors: Fangwei Hao, Ji Du, Desheng Kong, Jiesheng Wu, Jing Xu, Ping Li
Abstract: In recent years, the performance of lightweight Single-Image Super-Resolution (SISR) has been improved significantly with the application of Convolutional Neural Networks (CNNs) and Large Kernel Attention (LKA). However, existing information distillation modules for lightweight SISR struggle to map inputs into High-Dimensional Non-Linear (HDNL) feature spaces, limiting their representation learning. And their LKA modules possess restricted ability to capture the multi-shape multi-scale information for long-range dependencies while encountering a quadratic increase in the computational burden with increasing convolutional kernel size of its depth-wise convolutional layer. To address these issues, we firstly propose a Star Distillation Module (SDM) to enhance the discriminative representation learning via information distillation in the HDNL feature spaces. Besides, we present a Multi-shape Multi-scale Large Kernel Attention (MM-LKA) module to learn representative long-range dependencies while incurring low computational and memory footprints, leading to improving the performance of CNN-based self-attention significantly. Integrating SDM and MM-LKA, we develop a Residual Star Distillation Attention Module (RSDAM) and take it as the building block of the proposed efficient Star Distillation Attention Network (SDAN) which possesses high reconstruction efficiency to recover a higher-quality image from the corresponding low-resolution (LR) counterpart. When compared with other lightweight state-of-the-art SISR methods, extensive experiments show that our SDAN with low model complexity yields superior performance quantitatively and visually.
Authors: Hongjun An, Sida Huang, Siqi Huang, Ruanjun Li, Yuanzhi Liang, Jiawei Shao, Zihan Wang, Cheng Yuan, Chi Zhang, Hongyuan Zhang, Wenhao Zhuang, Xuelong Li
Abstract: Pioneered by the foundational information theory by Claude Shannon and the visionary framework of machine intelligence by Alan Turing, the convergent evolution of information and communication technologies (IT/CT) has created an unbroken wave of connectivity and computation. This synergy has sparked a technological revolution, now reaching its peak with large artificial intelligence (AI) models that are reshaping industries and redefining human-machine collaboration. However, the realization of ubiquitous intelligence faces considerable challenges due to substantial resource consumption in large models and high communication bandwidth demands. To address these challenges, AI Flow has been introduced as a multidisciplinary framework that integrates cutting-edge IT and CT advancements, with a particular emphasis on the following three key points. First, device-edge-cloud framework serves as the foundation, which integrates end devices, edge servers, and cloud clusters to optimize scalability and efficiency for low-latency model inference. Second, we introduce the concept of familial models, which refers to a series of different-sized models with aligned hidden features, enabling effective collaboration and the flexibility to adapt to varying resource constraints and dynamic scenarios. Third, connectivity- and interaction-based intelligence emergence is a novel paradigm of AI Flow. By leveraging communication networks to enhance connectivity, the collaboration among AI models across heterogeneous nodes achieves emergent intelligence that surpasses the capability of any single model. The innovations of AI Flow provide enhanced intelligence, timely responsiveness, and ubiquitous accessibility to AI services, paving the way for the tighter fusion of AI techniques and communication systems.
Authors: Catalin E. Brita, Hieu Nguyen, Lohithsai Yadala Chanchu, Domonkos Nagy, Maksim Zhdanov
Abstract: Self-attention scales quadratically with input size, limiting its use for large-scale physical systems. Although sparse attention mechanisms provide a viable alternative, they are primarily designed for regular structures such as text or images, making them inapplicable for irregular geometries. In this work, we present Ball Sparse Attention (BSA), which adapts Native Sparse Attention (NSA) (Yuan et al., 2025) to unordered point sets by imposing regularity using the Ball Tree structure from the Erwin Transformer (Zhdanov et al., 2025). We modify NSA's components to work with ball-based neighborhoods, yielding a global receptive field at sub-quadratic cost. On an airflow pressure prediction task, we achieve accuracy comparable to Full Attention while significantly reducing the theoretical computational complexity. Our implementation is available at https://github.com/britacatalin/bsa.
Authors: Ejafa Bassam, Dawei Zhu, Kaigui Bian
Abstract: Knowledge distillation is a model compression technique in which a compact "student" network is trained to replicate the predictive behavior of a larger "teacher" network. In logit-based knowledge distillation it has become the de facto approach to augment cross-entropy with a distillation term. Typically this term is either a KL divergence-matching marginal probabilities or a correlation-based loss capturing intra- and inter-class relationships but in every case it sits as an add-on to cross-entropy with its own weight that must be carefully tuned. In this paper we adopt a choice-theoretic perspective and recast knowledge distillation under the Plackett-Luce model by interpreting teacher logits as "worth" scores. We introduce Plackett-Luce Distillation (PLD), a weighted list-wise ranking loss in which the teacher model transfers knowledge of its full ranking of classes, weighting each ranked choice by its own confidence. PLD directly optimizes a single teacher-optimal ranking of the true label first, followed by the remaining classes in descending teacher confidence, yielding a convex, translation-invariant surrogate that subsumes weighted cross-entropy. Empirically on standard image classification benchmarks, PLD improves Top-1 accuracy by an average of +0.42% over DIST (arXiv:2205.10536) and +1.04% over KD (arXiv:1503.02531) in homogeneous settings and by +0.48% and +1.09% over DIST and KD, respectively, in heterogeneous settings.
Authors: Pranay Gupta, Henny Admoni, Andrea Bajcsy
Abstract: End-to-end visuomotor policies trained using behavior cloning have shown a remarkable ability to generate complex, multi-modal low-level robot behaviors. However, at deployment time, these policies still struggle to act reliably when faced with out-of-distribution (OOD) visuals induced by objects, backgrounds, or environment changes. Prior works in interactive imitation learning solicit corrective expert demonstrations under the OOD conditions -- but this can be costly and inefficient. We observe that task success under OOD conditions does not always warrant novel robot behaviors. In-distribution (ID) behaviors can directly be transferred to OOD conditions that share functional similarities with ID conditions. For example, behaviors trained to interact with in-distribution (ID) pens can apply to interacting with a visually-OOD pencil. The key challenge lies in disambiguating which ID observations functionally correspond to the OOD observation for the task at hand. We propose that an expert can provide this OOD-to-ID functional correspondence. Thus, instead of collecting new demonstrations and re-training at every OOD encounter, our method: (1) detects the need for feedback by first checking if current observations are OOD and then identifying whether the most similar training observations show divergent behaviors, (2) solicits functional correspondence feedback to disambiguate between those behaviors, and (3) intervenes on the OOD observations with the functionally corresponding ID observations to perform deployment-time generalization. We validate our method across diverse real-world robotic manipulation tasks with a Franka Panda robotic manipulator. Our results show that test-time functional correspondences can improve the generalization of a vision-based diffusion policy to OOD objects and environment conditions with low feedback.
Authors: Ali Zafari, Xi Chen, Shirin Jalali
Abstract: Zero-shot denoising aims to denoise observations without access to training samples or clean reference images. This setting is particularly relevant in practical imaging scenarios involving specialized domains such as medical imaging or biology. In this work, we propose the Zero-Shot Neural Compression Denoiser (ZS-NCD), a novel denoising framework based on neural compression. ZS-NCD treats a neural compression network as an untrained model, optimized directly on patches extracted from a single noisy image. The final reconstruction is then obtained by aggregating the outputs of the trained model over overlapping patches. Thanks to the built-in entropy constraints of compression architectures, our method naturally avoids overfitting and does not require manual regularization or early stopping. Through extensive experiments, we show that ZS-NCD achieves state-of-the-art performance among zero-shot denoisers for both Gaussian and Poisson noise, and generalizes well to both natural and non-natural images. Additionally, we provide new finite-sample theoretical results that characterize upper bounds on the achievable reconstruction error of general maximum-likelihood compression-based denoisers. These results further establish the theoretical foundations of compression-based denoising. Our code is available at: github.com/Computational-Imaging-RU/ZS-NCDenoiser.
Authors: Hu Xu, Yang Jingling, Jia Sihan, Bi Yuda, Calhoun Vince
Abstract: Generative models based on deep learning have shown significant potential in medical imaging, particularly for modality transformation and multimodal fusion in MRI-based brain imaging. This study introduces GM-LDM, a novel framework that leverages the latent diffusion model (LDM) to enhance the efficiency and precision of MRI generation tasks. GM-LDM integrates a 3D autoencoder, pre-trained on the large-scale ABCD MRI dataset, achieving statistical consistency through KL divergence loss. We employ a Vision Transformer (ViT)-based encoder-decoder as the denoising network to optimize generation quality. The framework flexibly incorporates conditional data, such as functional network connectivity (FNC) data, enabling personalized brain imaging, biomarker identification, and functional-to-structural information translation for brain diseases like schizophrenia.
Authors: Garima Jain, Ravi Kant Gupta, Priyansh Jain, Abhijeet Patil, Ardhendu Sekhar, Gajendra Smeeta, Sanghamitra Pati, Amit Sethi
Abstract: In this study, we propose a robust methodology for identification of myeloid blasts followed by prediction of genetic mutation in single-cell images of blasts, tackling challenges associated with label accuracy and data noise. We trained an initial binary classifier to distinguish between leukemic (blasts) and non-leukemic cells images, achieving 90 percent accuracy. To evaluate the models generalization, we applied this model to a separate large unlabeled dataset and validated the predictions with two haemato-pathologists, finding an approximate error rate of 20 percent in the leukemic and non-leukemic labels. Assuming this level of label noise, we further trained a four-class model on images predicted as blasts to classify specific mutations. The mutation labels were known for only a bag of cell images extracted from a single slide. Despite the tumor label noise, our mutation classification model achieved 85 percent accuracy across four mutation classes, demonstrating resilience to label inconsistencies. This study highlights the capability of machine learning models to work with noisy labels effectively while providing accurate, clinically relevant mutation predictions, which is promising for diagnostic applications in areas such as haemato-pathology.
Authors: Zhelun Shen, Chenming Wu, Junsheng Zhou, Chen Zhao, Kaisiyuan Wang, Hang Zhou, Yingying Li, Haocheng Feng, Wei He, Jingdong Wang
Abstract: Digital human video generation is gaining traction in fields like education and e-commerce, driven by advancements in head-body animation and lip-syncing technologies. However, realistic Hand-Object Interaction (HOI) - the complex dynamics between human hands and objects - continues to pose challenges. Generating natural and believable HOI reenactments is difficult due to issues such as occlusion between hands and objects, variations in object shapes and orientations, and the necessity for precise physical interactions, and importantly, the ability to generalize to unseen humans and objects. This paper presents a novel framework iDiT-HOI that enables in-the-wild HOI reenactment generation. Specifically, we propose a unified inpainting-based token process method, called Inp-TPU, with a two-stage video diffusion transformer (DiT) model. The first stage generates a key frame by inserting the designated object into the hand region, providing a reference for subsequent frames. The second stage ensures temporal coherence and fluidity in hand-object interactions. The key contribution of our method is to reuse the pretrained model's context perception capabilities without introducing additional parameters, enabling strong generalization to unseen objects and scenarios, and our proposed paradigm naturally supports long video generation. Comprehensive evaluations demonstrate that our approach outperforms existing methods, particularly in challenging real-world scenes, offering enhanced realism and more seamless hand-object interactions.
Authors: Haiyang Guo, Fanhu Zeng, Fei Zhu, Jiayi Wang, Xukai Wang, Jingang Zhou, Hongbo Zhao, Wenzhuo Liu, Shijie Ma, Xu-Yao Zhang, Cheng-Lin Liu
Abstract: The rapid advancement of generative models has enabled modern AI systems to comprehend and produce highly sophisticated content, even achieving human-level performance in specific domains. However, these models remain fundamentally constrained by catastrophic forgetting - a persistent challenge where adapting to new tasks typically leads to significant degradation in performance on previously learned tasks. To address this practical limitation, numerous approaches have been proposed to enhance the adaptability and scalability of generative models in real-world applications. In this work, we present a comprehensive survey of continual learning methods for mainstream generative models, including large language models, multimodal large language models, vision language action models, and diffusion models. Drawing inspiration from the memory mechanisms of the human brain, we systematically categorize these approaches into three paradigms: architecture-based, regularization-based, and replay-based methods, while elucidating their underlying methodologies and motivations. We further analyze continual learning setups for different generative models, including training objectives, benchmarks, and core backbones, offering deeper insights into the field. The project page of this paper is available at https://github.com/Ghy0501/Awesome-Continual-Learning-in-Generative-Models.
URLs: https://github.com/Ghy0501/Awesome-Continual-Learning-in-Generative-Models.
Authors: Pengfei Wang, Qiujie Dong, Fangtian Liang, Hao Pan, Lei Yang, Congyi Zhang, Guying Lin, Caiming Zhang, Yuanfeng Zhou, Changhe Tu, Shiqing Xin, Alla Sheffer, Xin Li, Wenping Wang
Abstract: Neural implicit shape representation has drawn significant attention in recent years due to its smoothness, differentiability, and topological flexibility. However, directly modeling the shape of a neural implicit surface, especially as the zero-level set of a neural signed distance function (SDF), with sparse geometric control is still a challenging task. Sparse input shape control typically includes 3D curve networks or, more generally, 3D curve sketches, which are unstructured and cannot be connected to form a curve network, and therefore more difficult to deal with. While 3D curve networks or curve sketches provide intuitive shape control, their sparsity and varied topology pose challenges in generating high-quality surfaces to meet such curve constraints. In this paper, we propose NeuVAS, a variational approach to shape modeling using neural implicit surfaces constrained under sparse input shape control, including unstructured 3D curve sketches as well as connected 3D curve networks. Specifically, we introduce a smoothness term based on a functional of surface curvatures to minimize shape variation of the zero-level set surface of a neural SDF. We also develop a new technique to faithfully model G0 sharp feature curves as specified in the input curve sketches. Comprehensive comparisons with the state-of-the-art methods demonstrate the significant advantages of our method.
Authors: Haibo Qiu, Xiaohan Lan, Fanfan Liu, Xiaohu Sun, Delian Ruan, Peng Shi, Lin Ma
Abstract: Recent advancements in large language models (LLMs) have witnessed a surge in the development of advanced reasoning paradigms, which are now being integrated into multimodal large language models (MLLMs). However, existing approaches often fall short: methods solely employing reinforcement learning (RL) can struggle with sample inefficiency and activating entirely absent reasoning capabilities, while conventional pipelines that initiate with a cold-start supervised fine-tuning (SFT) phase before RL may restrict the model's exploratory capacity and face suboptimal convergence. In this work, we introduce \textbf{Metis-RISE} (\textbf{R}L \textbf{I}ncentivizes and \textbf{S}FT \textbf{E}nhances) for multimodal reasoning model learning. Unlike conventional approaches, Metis-RISE distinctively omits an initial SFT stage, beginning instead with an RL phase (e.g., using a Group Relative Policy Optimization variant) to incentivize and activate the model's latent reasoning capacity. Subsequently, the targeted SFT stage addresses two key challenges identified during RL: (1) \textit{inefficient trajectory sampling} for tasks where the model possesses but inconsistently applies correct reasoning, which we tackle using self-distilled reasoning trajectories from the RL model itself; and (2) \textit{fundamental capability absence}, which we address by injecting expert-augmented knowledge for prompts where the model entirely fails. This strategic application of RL for incentivization followed by SFT for enhancement forms the core of Metis-RISE, leading to two versions of our MLLMs (7B and 72B parameters). Evaluations on the OpenCompass Multimodal Reasoning Leaderboard demonstrate that both models achieve state-of-the-art performance among similar-sized models, with the 72B version ranking fourth overall.
Authors: Zhanhua Xin, Zhihao Wang, Shenghao Zhang, Wanchao Chi, Yan Meng, Shihan Kong, Yan Xiong, Chong Zhang, Yuzhen Liu, Junzhi Yu
Abstract: In the field of multi-sensor fusion for simultaneous localization and mapping (SLAM), monocular cameras and IMUs are widely used to build simple and effective visual-inertial systems. However, limited research has explored the integration of motor-encoder devices to enhance SLAM performance. By incorporating such devices, it is possible to significantly improve active capability and field of view (FOV) with minimal additional cost and structural complexity. This paper proposes a novel visual-inertial-encoder tightly coupled odometry (VIEO) based on a ViDAR (Video Detection and Ranging) device. A ViDAR calibration method is introduced to ensure accurate initialization for VIEO. In addition, a platform motion decoupled active SLAM method based on deep reinforcement learning (DRL) is proposed. Experimental data demonstrate that the proposed ViDAR and the VIEO algorithm significantly increase cross-frame co-visibility relationships compared to its corresponding visual-inertial odometry (VIO) algorithm, improving state estimation accuracy. Additionally, the DRL-based active SLAM algorithm, with the ability to decouple from platform motion, can increase the diversity weight of the feature points and further enhance the VIEO algorithm's performance. The proposed methodology sheds fresh insights into both the updated platform design and decoupled approach of active SLAM systems in complex environments.
Authors: Ting Qiao, Yiming Li, Jianbin Li, Yingjia Wang, Leyi Qi, Junfeng Guo, Ruili Feng, Dacheng Tao
Abstract: Deep neural networks (DNNs) rely heavily on high-quality open-source datasets (e.g., ImageNet) for their success, making dataset ownership verification (DOV) crucial for protecting public dataset copyrights. In this paper, we find existing DOV methods (implicitly) assume that the verification process is faithful, where the suspicious model will directly verify ownership by using the verification samples as input and returning their results. However, this assumption may not necessarily hold in practice and their performance may degrade sharply when subjected to intentional or unintentional perturbations. To address this limitation, we propose the first certified dataset watermark (i.e., CertDW) and CertDW-based certified dataset ownership verification method that ensures reliable verification even under malicious attacks, under certain conditions (e.g., constrained pixel-level perturbation). Specifically, inspired by conformal prediction, we introduce two statistical measures, including principal probability (PP) and watermark robustness (WR), to assess model prediction stability on benign and watermarked samples under noise perturbations. We prove there exists a provable lower bound between PP and WR, enabling ownership verification when a suspicious model's WR value significantly exceeds the PP values of multiple benign models trained on watermark-free datasets. If the number of PP values smaller than WR exceeds a threshold, the suspicious model is regarded as having been trained on the protected dataset. Extensive experiments on benchmark datasets verify the effectiveness of our CertDW method and its resistance to potential adaptive attacks. Our codes are at \href{https://github.com/NcepuQiaoTing/CertDW}{GitHub}.
Authors: Yibo Yang, Sihao Liu, Chuan Rao, Bang An, Tiancheng Shen, Philip H. S. Torr, Ming-Hsuan Yang, Bernard Ghanem
Abstract: Conventional low-rank adaptation methods build adapters without considering data context, leading to sub-optimal fine-tuning performance and severe forgetting of inherent world knowledge. In this paper, we propose context-oriented decomposition adaptation (CorDA), a novel method that initializes adapters in a task-aware manner. Concretely, we develop context-oriented singular value decomposition, where we collect covariance matrices of input activations for each linear layer using sampled data from the target task, and apply SVD to the product of weight matrix and its corresponding covariance matrix. By doing so, the task-specific capability is compacted into the principal components. Thanks to the task awareness, our method enables two optional adaptation modes, knowledge-preserved mode (KPM) and instruction-previewed mode (IPM), providing flexibility to choose between freezing the principal components to preserve their associated knowledge or adapting them to better learn a new task. We further develop CorDA++ by deriving a metric that reflects the compactness of task-specific principal components, and then introducing dynamic covariance selection and dynamic rank allocation strategies based on the same metric. The two strategies provide each layer with the most representative covariance matrix and a proper rank allocation. Experimental results show that CorDA++ outperforms CorDA by a significant margin. CorDA++ in KPM not only achieves better fine-tuning performance than LoRA, but also mitigates the forgetting of pre-trained knowledge in both large language models and vision language models. For IPM, our method exhibits faster convergence, \emph{e.g.,} 4.5x speedup over QLoRA, and improves adaptation performance in various scenarios, outperforming strong baseline methods. Our method has been integrated into the PEFT library developed by Hugging Face.
Authors: Bikram Keshari Parida, Anusree P. Sunilkumar, Abhijit Sen, Wonsang You
Abstract: Dental diagnosis relies on two primary imaging modalities: panoramic radiographs (PX) providing 2D oral cavity representations, and Cone-Beam Computed Tomography (CBCT) offering detailed 3D anatomical information. While PX images are cost-effective and accessible, their lack of depth information limits diagnostic accuracy. CBCT addresses this but presents drawbacks including higher costs, increased radiation exposure, and limited accessibility. Existing reconstruction models further complicate the process by requiring CBCT flattening or prior dental arch information, often unavailable clinically. We introduce ViT-NeBLa, a vision transformer-based Neural Beer-Lambert model enabling accurate 3D reconstruction directly from single PX. Our key innovations include: (1) enhancing the NeBLa framework with Vision Transformers for improved reconstruction capabilities without requiring CBCT flattening or prior dental arch information, (2) implementing a novel horseshoe-shaped point sampling strategy with non-intersecting rays that eliminates intermediate density aggregation required by existing models due to intersecting rays, reducing sampling point computations by $52 \%$, (3) replacing CNN-based U-Net with a hybrid ViT-CNN architecture for superior global and local feature extraction, and (4) implementing learnable hash positional encoding for better higher-dimensional representation of 3D sample points compared to existing Fourier-based dense positional encoding. Experiments demonstrate that ViT-NeBLa significantly outperforms prior state-of-the-art methods both quantitatively and qualitatively, offering a cost-effective, radiation-efficient alternative for enhanced dental diagnostics.
Authors: Huayang Li, Yahui Liu, Hongyu Sun, Deng Cai, Leyang Cui, Wei Bi, Peilin Zhao, Taro Watanabe
Abstract: Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position embeddings (PEs) limit extrapolation capabilities beyond pre-trained sequence lengths. Expert-designed methods such as ALiBi and RoPE, mitigate this limitation but demand extensive modifications for adapting to new modalities, underscoring fundamental challenges in adaptability and scalability. In this work, we present SeqPE, a unified and fully learnable position encoding framework that represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings in an end-to-end manner. To regularize SeqPE's embedding space, we introduce two complementary objectives: a contrastive objective that aligns embedding distances with a predefined position-distance function, and a knowledge distillation loss that anchors out-of-distribution position embeddings to in-distribution teacher representations, further enhancing extrapolation performance. Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM), and accuracy--particularly under context length extrapolation--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign. We release our code, data, and checkpoints at https://github.com/ghrua/seqpe.
Authors: Salah Ghamizi, Georgia Kanli, Yu Deng, Magali Perquin, Olivier Keunen
Abstract: Foundation models (FMs), large neural networks pretrained on extensive and diverse datasets, have revolutionized artificial intelligence and shown significant promise in medical imaging by enabling robust performance with limited labeled data. Although numerous surveys have reviewed the application of FM in healthcare care, brain imaging remains underrepresented, despite its critical role in the diagnosis and treatment of neurological diseases using modalities such as MRI, CT, and PET. Existing reviews either marginalize brain imaging or lack depth on the unique challenges and requirements of FM in this domain, such as multimodal data integration, support for diverse clinical tasks, and handling of heterogeneous, fragmented datasets. To address this gap, we present the first comprehensive and curated review of FMs for brain imaging. We systematically analyze 161 brain imaging datasets and 86 FM architectures, providing information on key design choices, training paradigms, and optimizations driving recent advances. Our review highlights the leading models for various brain imaging tasks, summarizes their innovations, and critically examines current limitations and blind spots in the literature. We conclude by outlining future research directions to advance FM applications in brain imaging, with the aim of fostering progress in both clinical and research settings.
Authors: Mae Younes, Adnane Boukhayma
Abstract: Gaussian Splatting have demonstrated remarkable novel view synthesis performance at high rendering frame rates. Optimization-based inverse rendering within complex capture scenarios remains however a challenging problem. A particular case is modelling complex surface light interactions for highly reflective scenes, which results in intricate high frequency specular radiance components. We hypothesize that such challenging settings can benefit from increased representation power. We hence propose a method that tackles this issue through a geometrically and physically grounded Gaussian Splatting borne radiance field, where normals and material properties are spatially variable in the primitive's local space. Using per-primitive texture maps for this purpose, we also propose to harness the GPU hardware to accelerate rendering at test time via unified material texture atlas.
Authors: Xiang Yu, Yayan Chen, Guannan He, Qing Zeng, Yue Qin, Meiling Liang, Dandan Luo, Yimei Liao, Zeyu Ren, Cheng Kang, Delong Yang, Bocheng Liang, Bin Pu, Ying Yuan, Shengli Li
Abstract: While modern segmentation models often prioritize performance over practicality, we advocate a design philosophy prioritizing simplicity and efficiency, and attempted high performance segmentation model design. This paper presents SimpleUNet, a scalable ultra-lightweight medical image segmentation model with three key innovations: (1) A partial feature selection mechanism in skip connections for redundancy reduction while enhancing segmentation performance; (2) A fixed-width architecture that prevents exponential parameter growth across network stages; (3) An adaptive feature fusion module achieving enhanced representation with minimal computational overhead. With a record-breaking 16 KB parameter configuration, SimpleUNet outperforms LBUNet and other lightweight benchmarks across multiple public datasets. The 0.67 MB variant achieves superior efficiency (8.60 GFLOPs) and accuracy, attaining a mean DSC/IoU of 85.76%/75.60% on multi-center breast lesion datasets, surpassing both U-Net and TransUNet. Evaluations on skin lesion datasets (ISIC 2017/2018: mDice 84.86%/88.77%) and endoscopic polyp segmentation (KVASIR-SEG: 86.46%/76.48% mDice/mIoU) confirm consistent dominance over state-of-the-art models. This work demonstrates that extreme model compression need not compromise performance, providing new insights for efficient and accurate medical image segmentation. Codes can be found at https://github.com/Frankyu5666666/SimpleUNet.
Authors: Riku Takahashi, Ryugo Morita, Jinjia Zhou
Abstract: Talking head video compression has advanced with neural rendering and keypoint-based methods, but challenges remain, especially at low bit rates, including handling large head movements, suboptimal lip synchronization, and distorted facial reconstructions. To address these problems, we propose a novel audio-visual driven video codec that integrates compact 3D motion features and audio signals. This approach robustly models significant head rotations and aligns lip movements with speech, improving both compression efficiency and reconstruction quality. Experiments on the CelebV-HQ dataset show that our method reduces bitrate by 22% compared to VVC and by 8.5% over state-of-the-art learning-based codec. Furthermore, it provides superior lip-sync accuracy and visual fidelity at comparable bitrates, highlighting its effectiveness in bandwidth-constrained scenarios.
Authors: Sai Srinivas Jeevanandam, Sandeep Inuganti, Shreedhar Govil, Didier Stricker, Jason Rambach
Abstract: Vision-based robotic object grasping is typically investigated in the context of isolated objects or unstructured object sets in bin picking scenarios. However, there are several settings, such as construction or warehouse automation, where a robot needs to interact with a structured object formation such as a stack. In this context, we define the problem of selecting suitable objects for grasping along with estimating an accurate 6DoF pose of these objects. To address this problem, we propose a camera-IMU based approach that prioritizes unobstructed objects on the higher layers of stacks and introduce a dataset for benchmarking and evaluation, along with a suitable evaluation metric that combines object selection with pose accuracy. Experimental results show that although our method can perform quite well, this is a challenging problem if a completely error-free solution is needed. Finally, we show results from the deployment of our method for a brick-picking application in a construction scenario.
Authors: Kang Chen, Bin Huang, Xuebin Yang, Junyan Zhang, Qiegen Liu
Abstract: Synthesizing high quality CT images remains a signifi-cant challenge due to the limited availability of annotat-ed data and the complex nature of CT imaging. In this work, we present PRO, a novel framework that, to the best of our knowledge, is the first to perform CT image synthesis in the projection domain using latent diffusion models. Unlike previous approaches that operate in the image domain, PRO learns rich structural representa-tions from raw projection data and leverages anatomi-cal text prompts for controllable synthesis. This projec-tion domain strategy enables more faithful modeling of underlying imaging physics and anatomical structures. Moreover, PRO functions as a foundation model, capa-ble of generalizing across diverse downstream tasks by adjusting its generative behavior via prompt inputs. Experimental results demonstrated that incorporating our synthesized data significantly improves perfor-mance across multiple downstream tasks, including low-dose and sparse-view reconstruction, even with limited training data. These findings underscore the versatility and scalability of PRO in data generation for various CT applications. These results highlight the potential of projection domain synthesis as a powerful tool for data augmentation and robust CT imaging. Our source code is publicly available at: https://github.com/yqx7150/PRO.
Authors: Pegah Salehi, Sajad Amouei Sheshkal, Vajira Thambawita, P{\aa}l Halvorsen
Abstract: Dynamic facial emotion is essential for believable AI-generated avatars; however, most systems remain visually inert, limiting their utility in high-stakes simulations such as virtual training for investigative interviews with abused children. We introduce and evaluate a real-time architecture fusing Unreal Engine 5 MetaHuman rendering with NVIDIA Omniverse Audio2Face to translate vocal prosody into high-fidelity facial expressions on photorealistic child avatars. We implemented a distributed two-PC setup that decouples language processing and speech synthesis from GPU-intensive rendering, designed to support low-latency interaction in desktop and VR environments. A between-subjects study ($N=70$) using audio+visual and visual-only conditions assessed perceptual impacts as participants rated emotional clarity, facial realism, and empathy for two avatars expressing joy, sadness, and anger. Results demonstrate that avatars could express emotions recognizably, with sadness and joy achieving high identification rates. However, anger recognition significantly dropped without audio, highlighting the importance of congruent vocal cues for high-arousal emotions. Interestingly, removing audio boosted perceived facial realism, suggesting that audiovisual desynchrony remains a key design challenge. These findings confirm the technical feasibility of generating emotionally expressive avatars and provide guidance for improving non-verbal communication in sensitive training simulations.
Authors: Andrew Zhang, Anushka Sivakumar, Chiawei Tang, Chris Thomas
Abstract: Discrete diffusion models are a new class of text generators that offer advantages such as bidirectional context use, parallelizable generation, and flexible prompting compared to autoregressive models. However, a critical limitation of discrete diffusion models is their inability to perform flexible-length or flexible-position text infilling without access to ground-truth positional data. We introduce \textbf{DDOT} (\textbf{D}iscrete \textbf{D}iffusion with \textbf{O}ptimal \textbf{T}ransport Position Coupling), the first discrete diffusion model to overcome this challenge. DDOT jointly denoises token values and token positions, employing a novel sample-level Optimal Transport (OT) coupling. This coupling preserves relative token ordering while dynamically adjusting the positions and length of infilled segments, a capability previously missing in text diffusion. Our method is orthogonal to existing discrete text diffusion methods and is compatible with various pretrained text denoisers. Extensive experiments on text infilling benchmarks such as One-Billion-Word and Yelp demonstrate that DDOT outperforms naive diffusion baselines. Furthermore, DDOT achieves performance on par with state-of-the-art non-autoregressive models and enables significant improvements in training efficiency and flexibility.
Authors: Gregory Bellchambers
Abstract: The success of diffusion models has driven interest in performing conditional sampling via training-free guidance of the denoising process to solve image restoration and other inverse problems. A popular class of methods, based on Diffusion Posterior Sampling (DPS), attempts to approximate the intractable posterior score function directly. In this work, we present a novel expression for the exact posterior score for purely denoising tasks that is tractable in terms of the unconditional score function. We leverage this result to analyze the time-dependent error in the DPS score for denoising tasks and compute step sizes on the fly to minimize the error at each time step. We demonstrate that these step sizes are transferable to related inverse problems such as colorization, random inpainting, and super resolution. Despite its simplicity, this approach is competitive with state-of-the-art techniques and enables sampling with fewer time steps than DPS.
Authors: Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng
Abstract: The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience.
Authors: Bi Yuda, Jia Sihan, Gao Yutong, Abrol Anees, Fu Zening, Calhoun Vince
Abstract: Multimodal medical imaging integrates diverse data types, such as structural and functional neuroimaging, to provide complementary insights that enhance deep learning predictions and improve outcomes. This study focuses on a neuroimaging prediction framework based on both structural and functional neuroimaging data. We propose a next-generation prediction model, \textbf{MultiViT2}, which combines a pretrained representative learning base model with a vision transformer backbone for prediction output. Additionally, we developed a data augmentation module based on the latent diffusion model that enriches input data by generating augmented neuroimaging samples, thereby enhancing predictive performance through reduced overfitting and improved generalizability. We show that MultiViT2 significantly outperforms the first-generation model in schizophrenia classification accuracy and demonstrates strong scalability and portability.
Authors: Yuqing Wen, Kefan Gu, Haoxuan Liu, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiaoyan Sun
Abstract: Vision-Language-Action (VLA) models have recently made significant advance in multi-task, end-to-end robotic control, due to the strong generalization capabilities of Vision-Language Models (VLMs). A fundamental challenge in developing such models is effectively aligning the vision-language space with the robotic action space. Existing approaches typically rely on directly fine-tuning VLMs using expert demonstrations. However, this strategy suffers from a spatio-temporal gap, resulting in considerable data inefficiency and heavy reliance on human labor. Spatially, VLMs operate within a high-level semantic space, whereas robotic actions are grounded in low-level 3D physical space; temporally, VLMs primarily interpret the present, while VLA models anticipate future actions. To overcome these challenges, we propose a novel training paradigm, ROSA, which leverages robot state estimation to improve alignment between vision-language and action spaces. By integrating robot state estimation data obtained via an automated process, ROSA enables the VLA model to gain enhanced spatial understanding and self-awareness, thereby boosting performance and generalization. Extensive experiments in both simulated and real-world environments demonstrate the effectiveness of ROSA, particularly in low-data regimes.
Authors: Edward Li, Zichen Wang, Jiahe Huang, Jeong Joon Park
Abstract: We present a unified framework for solving partial differential equations (PDEs) using video-inpainting diffusion transformer models. Unlike existing methods that devise specialized strategies for either forward or inverse problems under full or partial observation, our approach unifies these tasks under a single, flexible generative framework. Specifically, we recast PDE-solving as a generalized inpainting problem, e.g., treating forward prediction as inferring missing spatiotemporal information of future states from initial conditions. To this end, we design a transformer-based architecture that conditions on arbitrary patterns of known data to infer missing values across time and space. Our method proposes pixel-space video diffusion models for fine-grained, high-fidelity inpainting and conditioning, while enhancing computational efficiency through hierarchical modeling. Extensive experiments show that our video inpainting-based diffusion model offers an accurate and versatile solution across a wide range of PDEs and problem setups, outperforming state-of-the-art baselines.
Authors: Jingwei Ma, Vivek Jayaram, Brian Curless, Ira Kemelmacher-Shlizerman, Steven M. Seitz
Abstract: We present UltraZoom, a system for generating gigapixel-resolution images of objects from casually captured inputs, such as handheld phone photos. Given a full-shot image (global, low-detail) and one or more close-ups (local, high-detail), UltraZoom upscales the full image to match the fine detail and scale of the close-up examples. To achieve this, we construct a per-instance paired dataset from the close-ups and adapt a pretrained generative model to learn object-specific low-to-high resolution mappings. At inference, we apply the model in a sliding window fashion over the full image. Constructing these pairs is non-trivial: it requires registering the close-ups within the full image for scale estimation and degradation alignment. We introduce a simple, robust method for getting registration on arbitrary materials in casual, in-the-wild captures. Together, these components form a system that enables seamless pan and zoom across the entire object, producing consistent, photorealistic gigapixel imagery from minimal input.
Authors: Zifan Zhao, Siddhant Haldar, Jinda Cui, Lerrel Pinto, Raunaq Bhirangi
Abstract: Data-driven approaches struggle with precise manipulation; imitation learning requires many hard-to-obtain demonstrations, while reinforcement learning yields brittle, non-generalizable policies. We introduce VisuoTactile Local (ViTaL) policy learning, a framework that solves fine-grained manipulation tasks by decomposing them into two phases: a reaching phase, where a vision-language model (VLM) enables scene-level reasoning to localize the object of interest, and a local interaction phase, where a reusable, scene-agnostic ViTaL policy performs contact-rich manipulation using egocentric vision and tactile sensing. This approach is motivated by the observation that while scene context varies, the low-level interaction remains consistent across task instances. By training local policies once in a canonical setting, they can generalize via a localize-then-execute strategy. ViTaL achieves around 90% success on contact-rich tasks in unseen environments and is robust to distractors. ViTaL's effectiveness stems from three key insights: (1) foundation models for segmentation enable training robust visual encoders via behavior cloning; (2) these encoders improve the generalizability of policies learned using residual RL; and (3) tactile sensing significantly boosts performance in contact-rich tasks. Ablation studies validate each of these insights, and we demonstrate that ViTaL integrates well with high-level VLMs, enabling robust, reusable low-level skills. Results and videos are available at https://vitalprecise.github.io.
Authors: Yixian Xu, Shengjie Luo, Liwei Wang, Di He, Chang Liu
Abstract: Diffusion models have achieved remarkable success in generative modeling. Despite more stable training, the loss of diffusion models is not indicative of absolute data-fitting quality, since its optimal value is typically not zero but unknown, leading to confusion between large optimal loss and insufficient model capacity. In this work, we advocate the need to estimate the optimal loss value for diagnosing and improving diffusion models. We first derive the optimal loss in closed form under a unified formulation of diffusion models, and develop effective estimators for it, including a stochastic variant scalable to large datasets with proper control of variance and bias. With this tool, we unlock the inherent metric for diagnosing the training quality of mainstream diffusion model variants, and develop a more performant training schedule based on the optimal loss. Moreover, using models with 120M to 1.5B parameters, we find that the power law is better demonstrated after subtracting the optimal loss from the actual training loss, suggesting a more principled setting for investigating the scaling law for diffusion models.
Authors: Siyuan Li, Zedong Wang, Zicheng Liu, Cheng Tan, Haitao Lin, Di Wu, Zhiyuan Chen, Jiangbin Zheng, Stan Z. Li
Abstract: By contextualizing the kernel as global as possible, Modern ConvNets have shown great potential in computer vision tasks. However, recent progress on multi-order game-theoretic interaction within deep neural networks (DNNs) reveals the representation bottleneck of modern ConvNets, where the expressive interactions have not been effectively encoded with the increased kernel size. To tackle this challenge, we propose a new family of modern ConvNets, dubbed MogaNet, for discriminative visual representation learning in pure ConvNet-based models with favorable complexity-performance trade-offs. MogaNet encapsulates conceptually simple yet effective convolutions and gated aggregation into a compact module, where discriminative features are efficiently gathered and contextualized adaptively. MogaNet exhibits great scalability, impressive efficiency of parameters, and competitive performance compared to state-of-the-art ViTs and ConvNets on ImageNet and various downstream vision benchmarks, including COCO object detection, ADE20K semantic segmentation, 2D&3D human pose estimation, and video prediction. Notably, MogaNet hits 80.0% and 87.8% accuracy with 5.2M and 181M parameters on ImageNet-1K, outperforming ParC-Net and ConvNeXt-L, while saving 59% FLOPs and 17M parameters, respectively. The source code is available at https://github.com/Westlake-AI/MogaNet.
Authors: Mahdi Ghafourian, Ruben Vera-Rodriguez, Julian Fierrez, Bilgesu Sumer, Ruben Tolosana, Aythami Moralez, Els Kindt
Abstract: Biometric recognition as an efficient and hard-to-forge way of identification and verification has become an indispensable part of the current digital world. The fast evolution of this technology has been a strong incentive for integration into many applications. Meanwhile, blockchain, the decentralized ledger technology, has been widely received by both research and industry in the past few years, and it is being increasingly deployed today in many different applications, such as money transfer, IoT, healthcare, or logistics. Recently, researchers have started to speculate on the pros and cons and what the best applications would be when these two technologies cross paths. This paper provides a survey of the research literature on the combination of blockchain and biometrics and includes a first legal analysis of this integration based on GDPR to shed light on challenges and potentials. Although the integration of blockchain technology into the biometric sector is still in its infancy, with a growing body of literature discussing specific applications and advanced technological setups, this paper aims to provide a holistic understanding of blockchain applicability in biometrics. Based on published studies, this article discusses, among others, practical examples combining blockchain and biometrics for novel applications in PKI systems, distributed trusted services, and identity management. Challenges and limitations when combining blockchain and biometrics that motivate future work will also be discussed; e.g., blockchain networks at their current stage may not be efficient or economical for some real-time biometric applications. Finally, we also discuss key legal aspects of the EU General Data Protection Regulation (GDPR) related to this combination of technologies (blockchain and biometrics); for example, accountability, immutability, anonymity, and data protection elements.
Authors: Yingjun Du, Jiayi Shen, Xiantong Zhen, Cees G. M. Snoek
Abstract: Modern image classifiers perform well on populated classes, while degrading considerably on tail classes with only a few instances. Humans, by contrast, effortlessly handle the long-tailed recognition challenge, since they can learn the tail representation based on different levels of semantic abstraction, making the learned tail features more discriminative. This phenomenon motivated us to propose SuperDisco, an algorithm that discovers super-class representations for long-tailed recognition using a graph model. We learn to construct the super-class graph to guide the representation learning to deal with long-tailed distributions. Through message passing on the super-class graph, image representations are rectified and refined by attending to the most relevant entities based on the semantic similarity among their super-classes. Moreover, we propose to meta-learn the super-class graph under the supervision of a prototype graph constructed from a small amount of imbalanced data. By doing so, we obtain a more robust super-class graph that further improves the long-tailed recognition performance. The consistent state-of-the-art experiments on the long-tailed CIFAR-100, ImageNet, Places and iNaturalist demonstrate the benefit of the discovered super-class graph for dealing with long-tailed distributions.
Authors: George Tom, Minesh Mathew, Sergi Garcia, Dimosthenis Karatzas, C. V. Jawahar
Abstract: Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition in motion is a challenging problem, while textual cues typically appear for a short time span, and early detection at a distance is necessary. Systems that exploit such information to assist the driver should not only extract and incorporate visual and textual cues from the video stream but also reason over time. To address this issue, we introduce RoadTextVQA, a new dataset for the task of video question answering (VideoQA) in the context of driver assistance. RoadTextVQA consists of $3,222$ driving videos collected from multiple countries, annotated with $10,500$ questions, all based on text or road signs present in the driving videos. We assess the performance of state-of-the-art video question answering models on our RoadTextVQA dataset, highlighting the significant potential for improvement in this domain and the usefulness of the dataset in advancing research on in-vehicle support systems and text-aware multimodal question answering. The dataset is available at http://cvit.iiit.ac.in/research/projects/cvit-projects/roadtextvqa
URLs: http://cvit.iiit.ac.in/research/projects/cvit-projects/roadtextvqa
Authors: Filipos Gouidis, Konstantinos Papoutsakis, Theodore Patkos, Antonis Argyros, Dimitris Plexousakis
Abstract: We investigate the problem of Object State Classification (OSC) as a zero-shot learning problem. Specifically, we propose the first Object-agnostic State Classification (OaSC) method that infers the state of a certain object without relying on the knowledge or the estimation of the object class. In that direction, we capitalize on Knowledge Graphs (KGs) for structuring and organizing knowledge, which, in combination with visual information, enable the inference of the states of objects in object/state pairs that have not been encountered in the method's training set. A series of experiments investigate the performance of the proposed method in various settings, against several hypotheses and in comparison with state of the art approaches for object attribute classification. The experimental results demonstrate that the knowledge of an object class is not decisive for the prediction of its state. Moreover, the proposed OaSC method outperforms existing methods in all datasets and benchmarks by a great margin.
Authors: Hao Chen, Haoran Zhou, Yunshu Zhang, Zheng Lin, Yongjian Deng
Abstract: In the RGB-D vision community, extensive research has been focused on designing multi-modal learning strategies and fusion structures. However, the complementary and fusion mechanisms in RGB-D models remain a black box. In this paper, we present an analytical framework and a novel score to dissect the RGB-D vision community. Our approach involves measuring proposed semantic variance and feature similarity across modalities and levels, conducting visual and quantitative analyzes on multi-modal learning through comprehensive experiments. Specifically, we investigate the consistency and specialty of features across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a RGB-D model. Our studies reveal/verify several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights consistency and specialty simultaneously for complementary inference. We also showcase the versatility of the proposed RGB-D dissection method and introduce a straightforward fusion strategy based on our findings, which delivers significant enhancements across various tasks and even other multi-modal data.
Authors: Liao Pan, Yang Feng, Zhao Wenhui, Yua Jinwen, Zhang Dingwen
Abstract: Decoder-only methods, such as GPT, have demonstrated superior performance in many areas compared to traditional encoder-decoder structure transformer methods. Over the years, end-to-end methods based on the traditional transformer structure, like MOTR, have achieved remarkable performance in multi-object tracking. However,The substantial computational resource consumption of these methods, coupled with the optimization challenges posed by dynamic data, results in less favorable inference speeds and training times. To address the aforementioned issues, this paper optimized the network architecture and proposed an effective training strategy to mitigate the problem of prolonged training times, thereby developing DecoderTrack, a novel end-to-end tracking method. Subsequently, to tackle the optimization challenges arising from dynamic data, this paper introduced DecoderTrack+ by incorporating a Fixed-Size Query Memory and refining certain attention layers. Our methods, without any bells and whistles, outperforms MOTR on multiple benchmarks, with inference speeds 2.06 and 3.03 times faster than MOTR, respectively
Authors: Xudong Li, Timin Gao, Runze Hu, Yan Zhang, Shengchuan Zhang, Xiawu Zheng, Jingyuan Zheng, Yunhang Shen, Ke Li, Yutao Liu, Pingyang Dai, Rongrong Ji
Abstract: The current state-of-the-art No-Reference Image Quality Assessment (NR-IQA) methods typically rely on feature extraction from upstream semantic backbone networks, assuming that all extracted features are relevant. However, we make a key observation that not all features are beneficial, and some may even be harmful, necessitating careful selection. Empirically, we find that many image pairs with small feature spatial distances can have vastly different quality scores, indicating that the extracted features may contain a significant amount of quality-irrelevant noise. To address this issue, we propose a Quality-Aware Feature Matching IQA Metric (QFM-IQM) that employs an adversarial perspective to remove harmful semantic noise features from the upstream task. Specifically, QFM-IQM enhances the semantic noise distinguish capabilities by matching image pairs with similar quality scores but varying semantic features as adversarial semantic noise and adaptively adjusting the upstream task's features by reducing sensitivity to adversarial noise perturbation. Furthermore, we utilize a distillation framework to expand the dataset and improve the model's generalization ability. Our approach achieves superior performance to the state-of-the-art NR-IQA methods on eight standard IQA datasets.
Authors: Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu
Abstract: With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.
URLs: https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.
Authors: Songlin Fan, Wei Gao, Zhineng Chen, Ge Li, Guoqing Liu, Qicheng Wang
Abstract: The evolution of point cloud processing algorithms necessitates an accurate assessment for their quality. Previous works consistently regard point cloud quality assessment (PCQA) as a MOS regression problem and devise a deterministic mapping, ignoring the stochasticity in generating MOS from subjective tests. This work presents the first probabilistic architecture for no-reference PCQA, motivated by the labeling process of existing datasets. The proposed method can model the quality judging stochasticity of subjects through a tailored conditional variational autoencoder (CVAE) and produces multiple intermediate quality ratings. These intermediate ratings simulate the judgments from different subjects and are then integrated into an accurate quality prediction, mimicking the generation process of a ground truth MOS. Specifically, our method incorporates a Prior Module, a Posterior Module, and a Quality Rating Generator, where the former two modules are introduced to model the judging stochasticity in subjective tests, while the latter is developed to generate diverse quality ratings. Extensive experiments indicate that our approach outperforms previous cutting-edge methods by a large margin and exhibits gratifying cross-dataset robustness. Codes are available at https://git.openi.org.cn/OpenPointCloud/nrpcqa.
Authors: Song Tang, Wenxin Su, Mao Ye, Jianwei Zhang, Xiatian Zhu
Abstract: In the pursuit of transferring a source model to a target domain without access to the source training data, Source-Free Domain Adaptation (SFDA) has been extensively explored across various scenarios, including Closed-set, Open-set, Partial-set, and Generalized settings. Existing methods, focusing on specific scenarios, not only address a limited subset of challenges but also necessitate prior knowledge of the target domain, significantly limiting their practical utility and deployability. In light of these considerations, we introduce a more practical yet challenging problem, termed unified SFDA, which comprehensively incorporates all specific scenarios in a unified manner. In this paper, we propose a novel approach latent Causal factors discovery for unified SFDA(CausalDA). In contrast to previous alternatives that emphasize learning the statistical description of reality, we formulate CausalDA from a causality perspective. The objective is to uncover the causal relationships between latent variables and model decisions, enhancing the reliability and robustness of the learned model against domain shifts. To integrate extensive world knowledge, we leverage a pre-trained vision-language model such as CLIP. This aids in the formation and discovery of latent causal factors in the absence of supervision in the variation of distribution and semantics, coupled with a newly designed information bottleneck with theoretical guarantees. Extensive experiments demonstrate that CausalDA can achieve new state-of-the-art results in distinct SFDA settings, as well as source-free out-of-distribution generalization.
Authors: Melanie Roschewitz, Fabio De Sousa Ribeiro, Tian Xia, Galvin Khara, Ben Glocker
Abstract: Contrastive pretraining is well-known to improve downstream task performance and model generalisation, especially in limited label settings. However, it is sensitive to the choice of augmentation pipeline. Positive pairs should preserve semantic information while destroying domain-specific information. Standard augmentation pipelines emulate domain-specific changes with pre-defined photometric transformations, but what if we could simulate realistic domain changes instead? In this work, we show how to utilise recent progress in counterfactual image generation to this effect. We propose CF-SimCLR, a counterfactual contrastive learning approach which leverages approximate counterfactual inference for positive pair creation. Comprehensive evaluation across five datasets, on chest radiography and mammography, demonstrates that CF-SimCLR substantially improves robustness to acquisition shift with higher downstream performance on both in- and out-of-distribution data, particularly for domains which are under-represented during training.
Authors: Qitong Yang, Mingtao Feng, Zijie Wu, Shijie Sun, Weisheng Dong, Yaonan Wang, Ajmal Mian
Abstract: Directly learning to model 4D content, including shape, color, and motion, is challenging. Existing methods rely on pose priors for motion control, resulting in limited motion diversity and continuity in details. To address this, we propose a framework that generates volumetric 4D sequences, where 3D shapes are animated under given conditions (text-image guidance) with dynamic evolution in shape and color across spatial and temporal dimensions, allowing for free navigation and rendering from any direction. We first use a coherent 3D shape and color modeling to encode the shape and color of each detailed 3D geometry frame into a latent space. Then we propose a matrixized 4D sequence representation allowing efficient diffusion model operation. Finally, we introduce spatio-temporal diffusion for 4D volumetric generation under given images and text prompts. Extensive experiments on the ShapeNet, 3DBiCar, DeformingThings4D and Objaverse datasets for several tasks demonstrate that our method effectively learns to generate high quality 3D shapes with consistent color and coherent mesh animations, improving over the current methods. Our code will be publicly available.
Authors: Fernando Alonso-Fernandez, Kevin Hernandez-Diaz, Jose Maria Buades Rubio, Prayag Tiwari, Josef Bigun
Abstract: The widespread use of mobile devices for all kinds of transactions makes necessary reliable and real-time identity authentication, leading to the adoption of face recognition (FR) via the cameras embedded in such devices. Progress of deep Convolutional Neural Networks (CNNs) has provided substantial advances in FR. Nonetheless, the size of state-of-the-art architectures is unsuitable for mobile deployment, since they often encompass hundreds of megabytes and millions of parameters. We address this by studying methods for deep network compression applied to FR. In particular, we apply network pruning based on Taylor scores, where less important filters are removed iteratively. The method is tested on three networks based on the small SqueezeNet (1.24M parameters) and the popular MobileNetv2 (3.5M) and ResNet50 (23.5M) architectures. These have been selected to showcase the method on CNNs with different complexities and sizes. We observe that a substantial percentage of filters can be removed with minimal performance loss. Also, filters with the highest amount of output channels tend to be removed first, suggesting that high-dimensional spaces within popular CNNs are over-dimensioned.
Authors: Honghao Fu, Yufei Wang, Wenhan Yang, Alex C. Kot, Bihan Wen
Abstract: Blind image quality assessment (IQA) in the wild, which assesses the quality of images with complex authentic distortions and no reference images, presents significant challenges. Given the difficulty in collecting large-scale training data, leveraging limited data to develop a model with strong generalization remains an open problem. Motivated by the robust image perception capabilities of pre-trained text-to-image (T2I) diffusion models, we propose a novel IQA method, diffusion priors-based IQA (DP-IQA), to utilize the T2I model's prior for improved performance and generalization ability. Specifically, we utilize pre-trained Stable Diffusion as the backbone, extracting multi-level features from the denoising U-Net guided by prompt embeddings through a tunable text adapter. Simultaneously, an image adapter compensates for information loss introduced by the lossy pre-trained encoder. Unlike T2I models that require full image distribution modeling, our approach targets image quality assessment, which inherently requires fewer parameters. To improve applicability, we distill the knowledge into a lightweight CNN-based student model, significantly reducing parameters while maintaining or even enhancing generalization performance. Experimental results demonstrate that DP-IQA achieves state-of-the-art performance on various in-the-wild datasets, highlighting the superior generalization capability of T2I priors in blind IQA tasks. To our knowledge, DP-IQA is the first method to apply pre-trained diffusion priors in blind IQA. Codes and checkpoints are available at https://github.com/RomGai/DP-IQA.
Authors: Xiao Xu, Xizhe Zhang, Yan Zhang
Abstract: Mood disorders such as depression and anxiety often manifest through facial expressions, but existing machine learning algorithms designed to assess these disorders have been hindered by small datasets and limited real-world applicability. To address this gap, we analyzed facial videos of 11,427 participants - a dataset two orders of magnitude larger than those used in previous studies - including standardized facial expression videos and psychological assessments of depression, anxiety, and stress. However, scaling up the dataset introduces significant challenges due to increased symptom heterogeneity, making it difficult for models to learn accurate representations. To address this, we introduced the Symptom Discrepancy Index (SDI), a novel metric for quantifying dataset heterogeneity caused by variability in individual symptoms among samples with identical total scores. By removing the 10% most heterogeneous cases as identified by the SDI, we raised the F1 scores of all models from approximately 50% to 80%. Notably, comparable performance gains were observed within both the retained and excluded subsets. These findings demonstrate symptom heterogeneity, not model capacity, as the principal bottleneck in large scale automated assessment and provide a general solution that is readily applicable to other psychometric data sets.
Authors: Laura Zheng, Wenjie Wei, Tony Wu, Jacob Clements, Shreelekha Revankar, Andre Harrison, Yu Shen, Ming C. Lin
Abstract: Achieving robustness in image segmentation models is challenging due to the fine-grained nature of pixel-level classification. These models, which are crucial for many real-time perception applications, particularly struggle when faced with natural corruptions in the wild for autonomous systems. While sensitivity analysis can help us understand how input variables influence model outputs, its application to natural and uncontrollable corruptions in training data is computationally expensive. In this work, we present an adaptive, sensitivity-guided augmentation method to enhance robustness against natural corruptions. Our sensitivity analysis on average runs 10x faster and requires about 200x less storage than previous sensitivity analysis, enabling practical, on-the-fly estimation during training for a model-free augmentation policy. With minimal fine-tuning, our sensitivity-guided augmentation method achieves improved robustness on both real-world and synthetic datasets compared to state-of-the-art data augmentation techniques in image segmentation. Code implementation for this work can be found at: https://github.com/laurayuzheng/SensAug.
Authors: Changsong Liu, Yimeng Fan, Mingyang Li, Wei Zhang, Yanyan Liu, Yuming Li, Wenlin Li, Liang Zhang
Abstract: Edge detection is a fundamental task in computer vision. It has made great progress under the development of deep convolutional neural networks (DCNNs), some of which have achieved a beyond human-level performance. However, recent top-performing edge detection methods tend to generate thick and noisy edge lines. In this work, we solve this problem from two aspects: (1) the lack of prior knowledge regarding image edges, and (2) the issue of imbalanced pixel distribution. We propose a second-order derivative-based multi-scale contextual enhancement module (SDMCM) to help the model locate true edge pixels accurately by introducing the edge prior knowledge. We also construct a hybrid focal loss function (HFL) to alleviate the imbalanced distribution issue. In addition, we employ the conditionally parameterized convolution (CondConv) to develop a novel boundary refinement module (BRM), which can further refine the final output edge maps. In the end, we propose a U-shape network named LUS-Net which is based on the SDMCM and BRM for crisp edge detection. We perform extensive experiments on three standard benchmarks, and the experiment results illustrate that our method can predict crisp and clean edge maps and achieves state-of-the-art performance on the BSDS500 dataset (ODS=0.829), NYUD-V2 dataset (ODS=0.768), and BIPED dataset (ODS=0.903).
Authors: Monika Wysocza\'nska, Antonin Vobecky, Amaia Cardiel, Tomasz Trzci\'nski, Renaud Marlet, Andrei Bursuc, Oriane Sim\'eoni
Abstract: Recent CLIP-like Vision-Language Models (VLMs), pre-trained on large amounts of image-text pairs to align both modalities with a simple contrastive objective, have paved the way to open-vocabulary semantic segmentation. Given an arbitrary set of textual queries, image pixels are assigned the closest query in feature space. However, this works well when a user exhaustively lists all possible visual concepts in an image that contrast against each other for the assignment. This corresponds to the current evaluation setup in the literature, which relies on having access to a list of in-domain relevant concepts, typically classes of a benchmark dataset. Here, we consider the more challenging (and realistic) scenario of segmenting a single concept, given a textual prompt and nothing else. To achieve good results, besides contrasting with the generic 'background' text, we propose two different approaches to automatically generate, at test time, query-specific textual contrastive concepts. We do so by leveraging the distribution of text in the VLM's training set or crafted LLM prompts. We also propose a metric designed to evaluate this scenario and show the relevance of our approach on commonly used datasets.
Authors: Seo-Bin Hwang, Han-Young Kim, Chae-Yeon Heo, Hie-Yong Jeong, Sung-Ju Jung, Yeong-Jun Cho
Abstract: The flatfish is a major farmed species consumed globally in large quantities. However, due to the densely populated farming environment, flatfish are susceptible to lesions and diseases, making early lesion detection crucial. Traditionally, lesions were detected through visual inspection, but observing large numbers of fish is challenging. Automated approaches based on deep learning technologies have been widely used to address this problem, but accurate detection remains difficult due to the diversity of the fish and the lack of a fish lesion and disease dataset. This study augments fish lesion images using generative adversarial networks and image harmonization methods. Next, lesion detectors are trained separately for three body parts (head, fins, and body) to address individual lesions properly. Additionally, a flatfish lesion and disease image dataset, called FlatIMG, is created and verified using the proposed methods on the dataset. A flash salmon lesion dataset is also tested to validate the generalizability of the proposed methods. The results achieved 12% higher performance than the baseline framework. This study is the first attempt to create a high-quality flatfish lesion image dataset with detailed annotations and propose an effective lesion detection framework. Automatic lesion and disease monitoring can be achieved in farming environments using the proposed methods and dataset.
Authors: Hao Xu, Xi Zhang, Xiaolin Wu
Abstract: Compressing a set of unordered points is far more challenging than compressing images/videos of regular sample grids, because of the difficulties in characterizing neighboring relations in an irregular layout of points. Many researchers resort to voxelization to introduce regularity, but this approach suffers from quantization loss. In this research, we use the KNN method to determine the neighborhoods of raw surface points. This gives us a means to determine the spatial context in which the latent features of 3D points are compressed by arithmetic coding. As such, the conditional probability model is adaptive to local geometry, leading to significant rate reduction. Additionally, we propose a dual-layer architecture where a non-learning base layer reconstructs the main structures of the point cloud at low complexity, while a learned refinement layer focuses on preserving fine details. This design leads to reductions in model complexity and coding latency by two orders of magnitude compared to SOTA methods. Moreover, we incorporate an implicit neural representation (INR) into the refinement layer, allowing the decoder to sample points on the underlying surface at arbitrary densities. This work is the first to effectively exploit content-aware local contexts for compressing irregular raw point clouds, achieving high rate-distortion performance, low complexity, and the ability to function as an arbitrary-scale upsampling network simultaneously.
Authors: Junbin Xiao, Nanxin Huang, Hangyu Qin, Dongyang Li, Yicong Li, Fengbin Zhu, Zhulin Tao, Jianxing Yu, Liang Lin, Tat-Seng Chua, Angela Yao
Abstract: Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-language tasks. As a golden testbed, Video Question Answering (VideoQA) plays pivotal role in Video-LLM developing. This work conducts a timely and comprehensive study of Video-LLMs' behavior in VideoQA, aiming to elucidate their success and failure modes, and provide insights towards more human-like video understanding and question answering. Our analyses demonstrate that Video-LLMs excel in VideoQA; they can correlate contextual cues and generate plausible responses to questions about varied video contents. However, models falter in handling video temporality, both in reasoning about temporal content ordering and grounding QA-relevant temporal moments. Moreover, the models behave unintuitively - they are unresponsive to adversarial video perturbations while being sensitive to simple variations of candidate answers and questions. Also, they do not necessarily generalize better. The findings demonstrate Video-LLMs' QA capability in standard condition yet highlight their severe deficiency in robustness and interpretability, suggesting the urgent need on rationales in Video-LLM developing.
Authors: Varun Burde, Assia Benbihi, Pavel Burget, Torsten Sattler
Abstract: Object pose estimation is essential to many industrial applications involving robotic manipulation, navigation, and augmented reality. Current generalizable object pose estimators, i.e., approaches that do not need to be trained per object, rely on accurate 3D models. Predominantly, CAD models are used, which can be hard to obtain in practice. At the same time, it is often possible to acquire images of an object. Naturally, this leads to the question whether 3D models reconstructed from images are sufficient to facilitate accurate object pose estimation. We aim to answer this question by proposing a novel benchmark for measuring the impact of 3D reconstruction quality on pose estimation accuracy. Our benchmark provides calibrated images for object reconstruction registered with the test images of the YCB-V dataset for pose evaluation under the BOP benchmark format. Detailed experiments with multiple state-of-the-art 3D reconstruction and object pose estimation approaches show that the geometry produced by modern reconstruction methods is often sufficient for accurate pose estimation. Our experiments lead to interesting observations: (1) Standard metrics for measuring 3D reconstruction quality are not necessarily indicative of pose estimation accuracy, which shows the need for dedicated benchmarks such as ours. (2) Classical, non-learning-based approaches can perform on par with modern learning-based reconstruction techniques and can even offer a better reconstruction time-pose accuracy tradeoff. (3) There is still a sizable gap between performance with reconstructed and with CAD models. To foster research on closing this gap, our benchmark is publicly available at https://github.com/VarunBurde/reconstruction_pose_benchmark}.
URLs: https://github.com/VarunBurde/reconstruction_pose_benchmark
Authors: Harish Babu Manogaran, M. Maruf, Arka Daw, Kazi Sajeed Mehrab, Caleb Patrick Charpentier, Josef C. Uyeda, Wasila Dahdul, Matthew J Thompson, Elizabeth G Campolongo, Kaiya L Provost, Wei-Lun Chao, Tanya Berger-Wolf, Paula M. Mabee, Hilmar Lapp, Anuj Karpatne
Abstract: A grand challenge in biology is to discover evolutionary traits - features of organisms common to a group of species with a shared ancestor in the tree of life (also referred to as phylogenetic tree). With the growing availability of image repositories in biology, there is a tremendous opportunity to discover evolutionary traits directly from images in the form of a hierarchy of prototypes. However, current prototype-based methods are mostly designed to operate over a flat structure of classes and face several challenges in discovering hierarchical prototypes, including the issue of learning over-specific prototypes at internal nodes. To overcome these challenges, we introduce the framework of Hierarchy aligned Commonality through Prototypical Networks (HComP-Net). The key novelties in HComP-Net include a novel over-specificity loss to avoid learning over-specific prototypes, a novel discriminative loss to ensure prototypes at an internal node are absent in the contrasting set of species with different ancestry, and a novel masking module to allow for the exclusion of over-specific prototypes at higher levels of the tree without hampering classification performance. We empirically show that HComP-Net learns prototypes that are accurate, semantically consistent, and generalizable to unseen species in comparison to baselines.
Authors: Yonghao Yu, Dongcheng Zhao, Guobin Shen, Yiting Dong, Yi Zeng
Abstract: The hierarchical architecture has become a mainstream design paradigm for Vision Transformers (ViTs), with Patch Merging serving as the pivotal component that transforms a columnar architecture into a hierarchical one. Drawing inspiration from the brain's ability to integrate global and local information for comprehensive visual understanding, we propose Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism's ability to 'see' better. SPM consists of Multi-Scale Aggregation (MSA) and Guided Local Enhancement (GLE) striking a proper balance between long-range dependency modeling and local feature enhancement. Extensive experiments conducted on benchmark datasets, including ImageNet-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models, particularly in dense prediction tasks such as object detection and semantic segmentation. Meanwhile, experiments show that combining SPM with different backbones can further improve performance. The code has been released at https://github.com/Yonghao-Yu/StepwisePatchMerging.
Authors: Yuchen Guan, Runxi Cheng, Kang Liu, Chun Yuan
Abstract: Knowledge distillation typically minimizes the Kullback-Leibler (KL) divergence between teacher and student logits. However, optimizing the KL divergence can be challenging for the student and often leads to sub-optimal solutions. We further show that gradients induced by KL divergence scale with the magnitude of the teacher logits, thereby diminishing updates on low-probability channels. This imbalance weakens the transfer of inter-class information and in turn limits the performance improvements achievable by the student. To mitigate this issue, we propose a plug-and-play auxiliary ranking loss based on Kendall's $\tau$ coefficient that can be seamlessly integrated into any logit-based distillation framework. It supplies inter-class relational information while rebalancing gradients toward low-probability channels. We demonstrate that the proposed ranking loss is largely invariant to channel scaling and optimizes an objective aligned with that of KL divergence, making it a natural complement rather than a replacement. Extensive experiments on CIFAR-100, ImageNet, and COCO datasets, as well as various CNN and ViT teacher-student architecture combinations, demonstrate that our plug-and-play ranking loss consistently boosts the performance of multiple distillation baselines. Code is available at https://github.com/OvernighTea/RankingLoss-KD
Authors: Tianyu Sun, Dingchang Hu, Yixiang Dai, Guijin Wang
Abstract: Transparent and reflective objects, which are common in our everyday lives, present a significant challenge to 3D imaging techniques due to their unique visual and optical properties. Faced with these types of objects, RGB-D cameras fail to capture the real depth value with their accurate spatial information. To address this issue, we propose DITR, a diffusion-based Depth Inpainting framework specifically designed for Transparent and Reflective objects. This network consists of two stages, including a Region Proposal stage and a Depth Inpainting stage. DITR dynamically analyzes the optical and geometric depth loss and inpaints them automatically. Furthermore, comprehensive experimental results demonstrate that DITR is highly effective in depth inpainting tasks of transparent and reflective objects with robust adaptability.
Authors: Amit Kumar Singh, Vrijendra Singh
Abstract: Deep learning and advancements in contactless sensors have significantly enhanced our ability to understand complex human activities in healthcare settings. In particular, deep learning models utilizing computer vision have been developed to enable detailed analysis of human gesture recognition, especially repetitive gestures which are commonly observed behaviors in children with autism. This research work aims to identify repetitive behaviors indicative of autism by analyzing videos captured in natural settings as children engage in daily activities. The focus is on accurately categorizing real-time repetitive gestures such as spinning, head banging, and arm flapping. To this end, we utilize the publicly accessible Self-Stimulatory Behavior Dataset (SSBD) to classify these stereotypical movements. A key component of the proposed methodology is the use of \textbf{VideoMAE}, a model designed to improve both spatial and temporal analysis of video data through a masking and reconstruction mechanism. This model significantly outperformed traditional methods, achieving an accuracy of 97.7\%, a 14.7\% improvement over the previous state-of-the-art.
Authors: Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan
Abstract: In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.
Authors: Yuji Wang, Zehua Chen, Xiaoyu Chen, Yixiang Wei, Jun Zhu, Jianfei Chen
Abstract: Diffusion models have achieved remarkable progress on image-to-video (I2V) generation, while their noise-to-data generation process is inherently mismatched with this task, which may lead to suboptimal synthesis quality. In this work, we present FrameBridge. By modeling the frame-to-frames generation process with a bridge model based data-to-data generative process, we are able to fully exploit the information contained in the given image and improve the consistency between the generation process and I2V task. Moreover, we propose two novel techniques toward the two popular settings of training I2V models, respectively. Firstly, we propose SNR-Aligned Fine-tuning (SAF), making the first attempt to fine-tune a diffusion model to a bridge model and, therefore, allowing us to utilize the pre-trained diffusion-based text-to-video (T2V) models. Secondly, we propose neural prior, further improving the synthesis quality of FrameBridge when training from scratch. Experiments conducted on WebVid-2M and UCF-101 demonstrate the superior quality of FrameBridge in comparison with the diffusion counterpart (zero-shot FVD 95 vs. 192 on MSR-VTT and non-zero-shot FVD 122 vs. 171 on UCF-101), and the advantages of our proposed SAF and neural prior for bridge-based I2V models. The project page: https://framebridge-icml.github.io/.
Authors: Tianyu Sun, Jianhao Li, Xueqian Zhang, Zhongdao Wang, Bailan Feng, Hengshuang Zhao
Abstract: This paper studies point cloud perception within outdoor environments. Existing methods face limitations in recognizing objects located at a distance or occluded, due to the sparse nature of outdoor point clouds. In this work, we observe a significant mitigation of this problem by accumulating multiple temporally consecutive point cloud sweeps, resulting in a remarkable improvement in perception accuracy. However, the computation cost also increases, hindering previous approaches from utilizing a large number of point cloud sweeps. To tackle this challenge, we find that a considerable portion of points in the accumulated point cloud is redundant, and discarding these points has minimal impact on perception accuracy. We introduce a simple yet effective Gumbel Spatial Pruning (GSP) layer that dynamically prunes points based on a learned end-to-end sampling. The GSP layer is decoupled from other network components and thus can be seamlessly integrated into existing point cloud network architectures. Without incurring additional computational overhead, we increase the number of point cloud sweeps from 10, a common practice, to as many as 40. Consequently, there is a significant enhancement in perception performance. For instance, in nuScenes 3D object detection and BEV map segmentation tasks, our pruning strategy improves several 3D perception baseline methods.
Authors: Zhihang Lin, Mingbao Lin, Wengyi Zhan, Rongrong Ji
Abstract: Diffusion models suffer severe object repetition and local distortion when the inference resolution differs from its pre-trained resolution. We propose AccDiffusion v2, an accurate method for patch-wise higher-resolution diffusion extrapolation without training. Our in-depth analysis in this paper shows that using an identical text prompt for different patches leads to repetitive generation, while the absence of a prompt undermines image details. In response, our AccDiffusion v2 novelly decouples the vanilla image-content-aware prompt into a set of patch-content-aware prompts, each of which serves as a more precise description of a patch. Further analysis reveals that local distortion arises from inaccurate descriptions in prompts about the local structure of higher-resolution images. To address this issue, AccDiffusion v2, for the first time, introduces an auxiliary local structural information through ControlNet during higher-resolution diffusion extrapolation aiming to mitigate the local distortions. Finally, our analysis indicates that global semantic information is conducive to suppressing both repetitive generation and local distortion. Hence, our AccDiffusion v2 further proposes dilated sampling with window interaction for better global semantic information during higher-resolution diffusion extrapolation. We conduct extensive experiments, including both quantitative and qualitative comparisons, to demonstrate the efficacy of our AccDiffusion v2. The quantitative comparison shows that AccDiffusion v2 achieves state-of-the-art performance in image generation extrapolation without training. The qualitative comparison intuitively illustrates that AccDiffusion v2 effectively suppresses the issues of repetitive generation and local distortion in image generation extrapolation. Our code is available at https://github.com/lzhxmu/AccDiffusion_v2.
Authors: Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Caiming Xiong, Ranjay Krishna, Silvio Savarese
Abstract: While open-source vision-language models perform well on simple question-answering, they still struggle with complex questions that require both perceptual and reasoning capabilities. We propose LATTE, a family of vision-language models that have LeArned to Think wiTh vision spEcialists. By offloading perception to state-of-the-art vision models, our approach enables vision-language models to focus solely on reasoning over high-quality perceptual information. To train LATTE, we synthesize and filter a large dataset of 273K multi-modal reasoning traces over perceptual outputs of vision specialists. LATTE trained on this data achieves significant gains over baselines across 6 benchmarks covering both perception and reasoning abilities. Ablation studies reveal that the effectiveness of multi-modal reasoning traces depends on the data sources, formats, and quality of thoughts.
Authors: Qiao Jin, Xiaodong Chen, Wu Liu, Tao Mei, Yongdong Zhang
Abstract: The advent of stereoscopic videos has opened new horizons in multimedia, particularly in extended reality (XR) and virtual reality (VR) applications, where immersive content captivates audiences across various platforms. Despite its growing popularity, producing stereoscopic videos remains challenging due to the technical complexities involved in generating stereo parallax. This refers to the positional differences of objects viewed from two distinct perspectives and is crucial for creating depth perception. This complex process poses significant challenges for creators aiming to deliver convincing and engaging presentations. To address these challenges, this paper introduces the Text-driven Stereoscopic Video Generation (T-SVG) system. This innovative, model-agnostic, zero-shot approach streamlines video generation by using text prompts to create reference videos. These videos are transformed into 3D point cloud sequences, which are rendered from two perspectives with subtle parallax differences, achieving a natural stereoscopic effect. T-SVG represents a significant advancement in stereoscopic content creation by integrating state-of-the-art, training-free techniques in text-to-video generation, depth estimation, and video inpainting. Its flexible architecture ensures high efficiency and user-friendliness, allowing seamless updates with newer models without retraining. By simplifying the production pipeline, T-SVG makes stereoscopic video generation accessible to a broader audience, demonstrating its potential to revolutionize the field.
Authors: Tianqi Shen, Shaohua Liu, Jiaqi Feng, Ziye Ma, Ning An
Abstract: Gaussian Splatting (GS) has emerged as a crucial technique for representing discrete volumetric radiance fields. It leverages unique parametrization to mitigate computational demands in scene optimization. This work introduces Topology-Aware 3D Gaussian Splatting (Topology-GS), which addresses two key limitations in current approaches: compromised pixel-level structural integrity due to incomplete initial geometric coverage, and inadequate feature-level integrity from insufficient topological constraints during optimization. To overcome these limitations, Topology-GS incorporates a novel interpolation strategy, Local Persistent Voronoi Interpolation (LPVI), and a topology-focused regularization term based on persistent barcodes, named PersLoss. LPVI utilizes persistent homology to guide adaptive interpolation, enhancing point coverage in low-curvature areas while preserving topological structure. PersLoss aligns the visual perceptual similarity of rendered images with ground truth by constraining distances between their topological features. Comprehensive experiments on three novel-view synthesis benchmarks demonstrate that Topology-GS outperforms existing methods in terms of PSNR, SSIM, and LPIPS metrics, while maintaining efficient memory usage. This study pioneers the integration of topology with 3D-GS, laying the groundwork for future research in this area.
Authors: Maxence Boels, Yang Liu, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin
Abstract: While existing approaches excel at recognising current surgical phases, they provide limited foresight and intraoperative guidance into future procedural steps. Similarly, current anticipation methods are constrained to predicting short-term and single events, neglecting the dense, repetitive, and long sequential nature of surgical workflows. To address these needs and limitations, we propose SWAG (Surgical Workflow Anticipative Generation), a framework that combines phase recognition and anticipation using a generative approach. This paper investigates two distinct decoding methods - single-pass (SP) and auto-regressive (AR) - to generate sequences of future surgical phases at minute intervals over long horizons. We propose a novel embedding approach using class transition probabilities to enhance the accuracy of phase anticipation. Additionally, we propose a generative framework using remaining time regression to classification (R2C). SWAG was evaluated on two publicly available datasets, Cholec80 and AutoLaparo21. Our single-pass model with class transition probability embeddings (SP*) achieves 32.1% and 41.3% F1 scores over 20 and 30 minutes on Cholec80 and AutoLaparo21, respectively. Moreover, our approach competes with existing methods on phase remaining time regression, achieving weighted mean absolute errors of 0.32 and 0.48 minutes for 2- and 3-minute horizons. SWAG demonstrates versatility across generative decoding frame works and classification and regression tasks to create temporal continuity between surgical workflow recognition and anticipation. Our method provides steps towards intraoperative surgical workflow generation for anticipation. Project: https://maxboels.com/research/swag.
Authors: Yifei Liu, Zhihang Zhong, Yifan Zhan, Sheng Xu, Xiao Sun
Abstract: While 3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in novel view synthesis and real-time rendering, the high memory consumption due to the use of millions of Gaussians limits its practicality. To mitigate this issue, improvements have been made by pruning unnecessary Gaussians, either through a hand-crafted criterion or by using learned masks. However, these methods deterministically remove Gaussians based on a snapshot of the pruning moment, leading to sub-optimized reconstruction performance from a long-term perspective. To address this issue, we introduce MaskGaussian, which models Gaussians as probabilistic entities rather than permanently removing them, and utilize them according to their probability of existence. To achieve this, we propose a masked-rasterization technique that enables unused yet probabilistically existing Gaussians to receive gradients, allowing for dynamic assessment of their contribution to the evolving scene and adjustment of their probability of existence. Hence, the importance of Gaussians iteratively changes and the pruned Gaussians are selected diversely. Extensive experiments demonstrate the superiority of the proposed method in achieving better rendering quality with fewer Gaussians than previous pruning methods, pruning over 60% of Gaussians on average with only a 0.02 PSNR decline. Our code can be found at: https://github.com/kaikai23/MaskGaussian
Authors: Chuanzhi Xu, Langyi Chen, Haodong Chen, Vera Chung, Qiang Qu
Abstract: Neuromorphic cameras, also known as event cameras, are asynchronous brightness-change sensors that can capture extremely fast motion without suffering from motion blur, making them particularly promising for 3D reconstruction in extreme environments. However, existing research on 3D reconstruction using monocular neuromorphic cameras is limited, and most of the methods rely on estimating physical priors and employ complex multi-step pipelines. In this work, we propose an end-to-end method for dense voxel 3D reconstruction using neuromorphic cameras that eliminates the need to estimate physical priors. Our method incorporates a novel event representation to enhance edge features, enabling the proposed feature-enhancement model to learn more effectively. Additionally, we introduced Optimal Binarization Threshold Selection Principle as a guideline for future related work, using the optimal reconstruction results achieved with threshold optimization as the benchmark. Our method achieves a 54.6% improvement in reconstruction accuracy compared to the baseline method.
Authors: Jian Lang, Zhangtao Cheng, Ting Zhong, Fan Zhou
Abstract: Multimodal learning with incomplete modality is practical and challenging. Recently, researchers have focused on enhancing the robustness of pre-trained MultiModal Transformers (MMTs) under missing modality conditions by applying learnable prompts. However, these prompt-based methods face several limitations: (1) incomplete modalities provide restricted modal cues for task-specific inference, (2) dummy imputation for missing content causes information loss and introduces noise, and (3) static prompts are instance-agnostic, offering limited knowledge for instances with various missing conditions. To address these issues, we propose RAGPT, a novel Retrieval-AuGmented dynamic Prompt Tuning framework. RAGPT comprises three modules: (I) the multi-channel retriever, which identifies similar instances through a within-modality retrieval strategy, (II) the missing modality generator, which recovers missing information using retrieved contexts, and (III) the context-aware prompter, which captures contextual knowledge from relevant instances and generates dynamic prompts to largely enhance the MMT's robustness. Extensive experiments conducted on three real-world datasets show that RAGPT consistently outperforms all competitive baselines in handling incomplete modality problems. The code of our work and prompt-based baselines is available at https://github.com/Jian-Lang/RAGPT.
Authors: Jihoon Chung, Tyler Zhu, Max Gonzalez Saez-Diez, Juan Carlos Niebles, Honglu Zhou, Olga Russakovsky
Abstract: The recent advent of Large Language Models (LLMs) has ushered sophisticated reasoning capabilities into the realm of video through Video Large Language Models (VideoLLMs). However, VideoLLMs currently rely on a single vision encoder for all of their visual processing, which limits the amount and type of visual information that can be conveyed to the LLM. Our method, MERV, Multi-Encoder Representation of Videos, instead leverages multiple frozen visual encoders to create a unified representation of a video, providing the VideoLLM with a comprehensive set of specialized visual knowledge. Spatio-temporally aligning the features from each encoder allows us to tackle a wider range of open-ended and multiple-choice video understanding questions and outperform prior state-of-the-art works. MERV is up to 3.7% better in accuracy than Video-LLaVA across the standard suite video understanding benchmarks, while also having a better Video-ChatGPT score. We also improve upon SeViLA, the previous best on zero-shot Perception Test accuracy, by 2.2%. MERV introduces minimal extra parameters and trains faster than equivalent single-encoder methods while parallelizing the visual processing. Finally, we provide qualitative evidence that MERV successfully captures domain knowledge from each of its encoders. Our results offer promising directions in utilizing multiple vision encoders for comprehensive video understanding.
Authors: Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, Bingyi Kang
Abstract: Depth Anything has achieved remarkable success in monocular depth estimation with strong generalization ability. However, it suffers from temporal inconsistency in videos, hindering its practical applications. Various methods have been proposed to alleviate this issue by leveraging video generation models or introducing priors from optical flow and camera poses. Nonetheless, these methods are only applicable to short videos (< 10 seconds) and require a trade-off between quality and computational efficiency. We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency. We base our model on Depth Anything V2 and replace its head with an efficient spatial-temporal head. We design a straightforward yet effective temporal consistency loss by constraining the temporal depth gradient, eliminating the need for additional geometric priors. The model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2. Moreover, a novel key-frame-based strategy is developed for long video inference. Experiments show that our model can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Comprehensive evaluations on multiple video benchmarks demonstrate that our approach sets a new state-of-the-art in zero-shot video depth estimation. We offer models of different scales to support a range of scenarios, with our smallest model capable of real-time performance at 30 FPS.
Authors: Xianrui Luo, Juewen Peng, Zhongang Cai, Lei Yang, Fan Yang, Zhiguo Cao, Guosheng Lin
Abstract: We introduce a novel framework for modeling high-fidelity, animatable 3D human avatars from motion-blurred monocular video inputs. Motion blur is prevalent in real-world dynamic video capture, especially due to human movements in 3D human avatar modeling. Existing methods either (1) assume sharp image inputs, failing to address the detail loss introduced by motion blur, or (2) mainly consider blur by camera movements, neglecting the human motion blur which is more common in animatable avatars. Our proposed approach integrates a human movement-based motion blur model into 3D Gaussian Splatting (3DGS). By explicitly modeling human motion trajectories during exposure time, we jointly optimize the trajectories and 3D Gaussians to reconstruct sharp, high-quality human avatars. We employ a pose-dependent fusion mechanism to distinguish moving body regions, optimizing both blurred and sharp areas effectively. Extensive experiments on synthetic and real-world datasets demonstrate that our method significantly outperforms existing methods in rendering quality and quantitative metrics, producing sharp avatar reconstructions and enabling real-time rendering under challenging motion blur conditions.
Authors: Mude Hui, Rui-Jie Zhu, Songlin Yang, Yu Zhang, Zirui Wang, Yuyin Zhou, Jason Eshraghian, Cihang Xie
Abstract: Flow models are effective at progressively generating realistic images, but they generally struggle to capture long-range dependencies during the generation process as they compress all the information from previous time steps into a single corrupted image. To address this limitation, we propose integrating autoregressive modeling -- known for its excellence in modeling complex, high-dimensional joint probability distributions -- into flow models. During training, at each step, we construct causally-ordered sequences by sampling multiple images from the same semantic category and applying different levels of noise, where images with higher noise levels serve as causal predecessors to those with lower noise levels. This design enables the model to learn broader category-level variations while maintaining proper causal relationships in the flow process. During generation, the model autoregressively conditions the previously generated images from earlier denoising steps, forming a contextual and coherent generation trajectory. Additionally, we design a customized hybrid linear attention mechanism tailored to our modeling approach to enhance computational efficiency. Our approach, termed ARFlow, achieves 6.63 FID scores on ImageNet at 256 * 256 without classifier-free guidance, reaching 1.96 FID with classifier-free guidance 1.5, outperforming the previous flow-based model SiT's 2.06 FID. Extensive ablation studies demonstrate the effectiveness of our modeling strategy and chunk-wise attention design.
Authors: Shi Chen, Lefei Zhang, Liangpei Zhang
Abstract: Mamba has demonstrated exceptional performance in visual tasks due to its powerful global modeling capabilities and linear computational complexity, offering considerable potential in hyperspectral image super-resolution (HSISR). However, in HSISR, Mamba faces challenges as transforming images into 1D sequences neglects the spatial-spectral structural relationships between locally adjacent pixels, and its performance is highly sensitive to input order, which affects the restoration of both spatial and spectral details. In this paper, we propose HSRMamba, a contextual spatial-spectral modeling state space model for HSISR, to address these issues both locally and globally. Specifically, a local spatial-spectral partitioning mechanism is designed to establish patch-wise causal relationships among adjacent pixels in 3D features, mitigating the local forgetting issue. Furthermore, a global spectral reordering strategy based on spectral similarity is employed to enhance the causal representation of similar pixels across both spatial and spectral dimensions. Finally, experimental results demonstrate our HSRMamba outperforms the state-of-the-art methods in quantitative quality and visual results. Code is available at: https://github.com/Tomchenshi/HSRMamba.
Authors: Aditya Kumar, Tom Blanchard, Adam Dziedzic, Franziska Boenisch
Abstract: State-of-the-art Diffusion Models (DMs) produce highly realistic images. While prior work has successfully mitigated Not Safe For Work (NSFW) content in the visual domain, we identify a novel threat: the generation of NSFW text embedded within images. This includes offensive language, such as insults, racial slurs, and sexually explicit terms, posing significant risks to users. We show that all state-of-the-art DMs (e.g., SD3, SDXL, Flux, DeepFloyd IF) are vulnerable to this issue. Through extensive experiments, we demonstrate that existing mitigation techniques, effective for visual content, fail to prevent harmful text generation while substantially degrading benign text generation. As an initial step toward addressing this threat, we introduce a novel fine-tuning strategy that targets only the text-generation layers in DMs. Therefore, we construct a safety fine-tuning dataset by pairing each NSFW prompt with two images: one with the NSFW term, and another where that term is replaced with a carefully crafted benign alternative while leaving the image unchanged otherwise. By training on this dataset, the model learns to avoid generating harmful text while preserving benign content and overall image quality. Finally, to advance research in the area, we release ToxicBench, an open-source benchmark for evaluating NSFW text generation in images. It includes our curated fine-tuning dataset, a set of harmful prompts, new evaluation metrics, and a pipeline that assesses both NSFW-ness and text and image quality. Our benchmark aims to guide future efforts in mitigating NSFW text generation in text-to-image models, thereby contributing to their safe deployment. The benchmark is available online for download.
Authors: \'Oscar A. Mart\'in, Javier S\'anchez
Abstract: Neural networks have become the standard technique for medical diagnostics, especially in cancer detection and classification. This work evaluates the performance of Vision Transformers architectures, including Swin Transformer and MaxViT, in several datasets of magnetic resonance imaging (MRI) and computed tomography (CT) scans. We used three training sets of images with brain, lung, and kidney tumors. Each dataset includes different classification labels, from brain gliomas and meningiomas to benign and malignant lung conditions and kidney anomalies such as cysts and cancers. This work aims to analyze the behavior of the neural networks in each dataset and the benefits of combining different image modalities and tumor classes. We designed several experiments by fine-tuning the models on combined and individual datasets. The results revealed that the Swin Transformer provided high accuracy, achieving up to 99\% on average for individual datasets and 99.4\% accuracy for the combined dataset. This research highlights the adaptability of Transformer-based models to various image modalities and features. However, challenges persist, including limited annotated data and interpretability issues. Future work will expand this study by incorporating other image modalities and enhancing diagnostic capabilities. Integrating these models across diverse datasets could mark a significant advance in precision medicine, paving the way for more efficient and comprehensive healthcare solutions.
Authors: Sen Peng, Mingyue Wang, Jianfei He, Jijia Yang, Xiaohua Jia
Abstract: Latent diffusion models have recently demonstrated superior capabilities in many downstream image synthesis tasks. However, customization of latent diffusion models using unauthorized data can severely compromise the privacy and intellectual property rights of data owners. Adversarial examples as protective perturbations have been developed to defend against unauthorized data usage by introducing imperceptible noise to customization samples, preventing diffusion models from effectively learning them. In this paper, we first reveal that the primary reason adversarial examples are effective as protective perturbations in latent diffusion models is the distortion of their latent representations, as demonstrated through qualitative and quantitative experiments. We then propose the Contrastive Adversarial Training (CAT) utilizing lightweight adapters as an adaptive attack against these protection methods, highlighting their lack of robustness. Extensive experiments demonstrate that our CAT method significantly reduces the effectiveness of protective perturbations in customization, urging the community to reconsider and improve the robustness of existing protective perturbations. The code is available at https://github.com/senp98/CAT.
Authors: Ai Chen, Yuxu Lu, Dong Yang, Junlin Zhou, Yan Fu, Duanbing Chen
Abstract: Salient object detection (SOD) plays a critical role in Intelligent Imaging, facilitating the detection and segmentation of key visual elements in an image. However, adverse imaging conditions such as haze during the day, low light, and haze at night severely degrade image quality and hinder reliable object detection in real-world scenarios. To address these challenges, we propose a multi-knowledge-oriented nighttime haze imaging enhancer (MKoIE), which integrates three tasks: daytime dehazing, low-light enhancement, and nighttime dehazing. The MKoIE incorporates two key innovative components: First, the network employs a task-oriented node learning mechanism to handle three specific degradation types: day-time haze, low light, and night-time haze conditions, with an embedded self-attention module enhancing its performance in nighttime imaging. In addition, multi-receptive field enhancement module that efficiently extracts multi-scale features through three parallel depthwise separable convolution branches with different dilation rates, capturing comprehensive spatial information with minimal computational overhead to meet the requirements of real-time imaging deployment. To ensure optimal image reconstruction quality and visual characteristics, we suggest a hybrid loss function. Extensive experiments on different types of weather/imaging conditions illustrate that MKoIE surpasses existing methods, enhancing the reliability, accuracy, and operational efficiency of intelligent imaging.
Authors: Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, Jun Zhu
Abstract: Recent advancements in video generation have enabled models to synthesize high-quality, minute-long videos. However, generating even longer videos with temporal coherence remains a major challenge and existing length extrapolation methods lead to temporal repetition or motion deceleration. In this work, we systematically analyze the role of frequency components in positional embeddings and identify an intrinsic frequency that primarily governs extrapolation behavior. Based on this insight, we propose RIFLEx, a minimal yet effective approach that reduces the intrinsic frequency to suppress repetition while preserving motion consistency, without requiring any additional modifications. RIFLEx offers a true free lunch--achieving high-quality 2x extrapolation on state-of-the-art video diffusion transformers in a completely training-free manner. Moreover, it enhances quality and enables 3x extrapolation by minimal fine-tuning without long videos. Project page and codes: https://riflex-video.github.io/.
Authors: Hanxun Yu, Wentong Li, Song Wang, Junbo Chen, Jianke Zhu
Abstract: Despite encouraging progress in 3D scene understanding, it remains challenging to develop an effective Large Multi-modal Model (LMM) that is capable of understanding and reasoning in complex 3D environments. Most previous methods typically encode 3D point and 2D image features separately, neglecting interactions between 2D semantics and 3D object properties, as well as the spatial relationships within the 3D environment. This limitation not only hinders comprehensive representations of 3D scene, but also compromises training and inference efficiency. To address these challenges, we propose a unified Instance-aware 3D Large Multi-modal Model (Inst3D-LMM) to deal with multiple 3D scene understanding tasks simultaneously. To obtain the fine-grained instance-level visual tokens, we first introduce a novel Multi-view Cross-Modal Fusion (MCMF) module to inject the multi-view 2D semantics into their corresponding 3D geometric features. For scene-level relation-aware tokens, we further present a 3D Instance Spatial Relation (3D-ISR) module to capture the intricate pairwise spatial relationships among objects. Additionally, we perform end-to-end multi-task instruction tuning simultaneously without the subsequent task-specific fine-tuning. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods across 3D scene understanding, reasoning and grounding tasks. Source code is available at https://github.com/hanxunyu/Inst3D-LMM
Authors: Tianjie Ju, Yi Hua, Hao Fei, Zhenyu Shao, Yubin Zheng, Haodong Zhao, Mong-Li Lee, Wynne Hsu, Zhuosheng Zhang, Gongshen Liu
Abstract: Multi-Modal Large Language Models (MLLMs) have exhibited remarkable performance on various vision-language tasks such as Visual Question Answering (VQA). Despite accumulating evidence of privacy concerns associated with task-relevant content, it remains unclear whether MLLMs inadvertently memorize private content that is entirely irrelevant to the training tasks. In this paper, we investigate how randomly generated task-irrelevant private content can become spuriously correlated with downstream objectives due to partial mini-batch training dynamics, thus causing inadvertent memorization. Concretely, we randomly generate task-irrelevant watermarks into VQA fine-tuning images at varying probabilities and propose a novel probing framework to determine whether MLLMs have inadvertently encoded such content. Our experiments reveal that MLLMs exhibit notably different training behaviors in partial mini-batch settings with task-irrelevant watermarks embedded. Furthermore, through layer-wise probing, we demonstrate that MLLMs trigger distinct representational patterns when encountering previously seen task-irrelevant knowledge, even if this knowledge does not influence their output during prompting. Our code is available at https://github.com/illusionhi/ProbingPrivacy.
Authors: Zicheng Zhang, Tengchuan Kou, Shushi Wang, Chunyi Li, Wei Sun, Wei Wang, Xiaoyu Li, Zongyu Wang, Xuezhi Cao, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai
Abstract: Evaluating text-to-vision content hinges on two crucial aspects: visual quality and alignment. While significant progress has been made in developing objective models to assess these dimensions, the performance of such models heavily relies on the scale and quality of human annotations. According to Scaling Law, increasing the number of human-labeled instances follows a predictable pattern that enhances the performance of evaluation models. Therefore, we introduce a comprehensive dataset designed to Evaluate Visual quality and Alignment Level for text-to-vision content (Q-EVAL-100K), featuring the largest collection of human-labeled Mean Opinion Scores (MOS) for the mentioned two aspects. The Q-EVAL-100K dataset encompasses both text-to-image and text-to-video models, with 960K human annotations specifically focused on visual quality and alignment for 100K instances (60K images and 40K videos). Leveraging this dataset with context prompt, we propose Q-Eval-Score, a unified model capable of evaluating both visual quality and alignment with special improvements for handling long-text prompt alignment. Experimental results indicate that the proposed Q-Eval-Score achieves superior performance on both visual quality and alignment, with strong generalization capabilities across other benchmarks. These findings highlight the significant value of the Q-EVAL-100K dataset. Data and codes will be available at https://github.com/zzc-1998/Q-Eval.
Authors: Chenglu Pan, Xiaogang Xu, Ganggui Ding, Yunke Zhang, Wenbo Li, Jiarong Xu, Qingbiao Wu
Abstract: Restoring low-resolution text images presents a significant challenge, as it requires maintaining both the fidelity and stylistic realism of the text in restored images. Existing text image restoration methods often fall short in hard situations, as the traditional super-resolution models cannot guarantee clarity, while diffusion-based methods fail to maintain fidelity. In this paper, we introduce a novel framework aimed at improving the generalization ability of diffusion models for text image super-resolution (SR), especially promoting fidelity. First, we propose a progressive data sampling strategy that incorporates diverse image types at different stages of training, stabilizing the convergence and improving the generalization. For the network architecture, we leverage a pre-trained SR prior to provide robust spatial reasoning capabilities, enhancing the model's ability to preserve textual information. Additionally, we employ a cross-attention mechanism to better integrate textual priors. To further reduce errors in textual priors, we utilize confidence scores to dynamically adjust the importance of textual features during training. Extensive experiments on real-world datasets demonstrate that our approach not only produces text images with more realistic visual appearances but also improves the accuracy of text structure.
Authors: Yifei Deng, Chenglong Li, Zhenyu Chen, Zihen Xu, Jin Tang
Abstract: The performance of traditional text-image person retrieval task is easily affected by lighting variations due to imaging limitations of visible spectrum sensors. In recent years, cross-modal information fusion has emerged as an effective strategy to enhance retrieval robustness. By integrating complementary information from different spectral modalities, it becomes possible to achieve more stable person recognition and matching under complex real-world conditions. Motivated by this, we introduce a novel task: Text-RGBT Person Retrieval, which incorporates cross-spectrum information fusion by combining the complementary cues from visible and thermal modalities for robust person retrieval in challenging environments. The key challenge of Text-RGBT person retrieval lies in aligning text with multi-modal visual features. However, the inherent heterogeneity between visible and thermal modalities may interfere with the alignment between vision and language. To handle this problem, we propose a Decoupled Cross-modal Alignment network (DCAlign), which sufficiently mines the relationships between modality-specific and modality-collaborative visual with the text, for Text-RGBT person retrieval. To promote the research and development of this field, we create a high-quality Text-RGBT person retrieval dataset, RGBT-PEDES. RGBT-PEDES contains 1,822 identities from different age groups and genders with 4,723 pairs of calibrated RGB and T images, and covers high-diverse scenes from both daytime and nighttime with a various of challenges such as occlusion, weak alignment and adverse lighting conditions. Additionally, we carefully annotate 7,987 fine-grained textual descriptions for all RGBT person image pairs. Extensive experiments on RGBT-PEDES demonstrate that our method outperforms existing text-image person retrieval methods.
Authors: Liewen Liao, Weihao Yan, Ming Yang, Songan Zhang
Abstract: Learning-based 3D reconstruction has emerged as a transformative technique in autonomous driving, enabling precise modeling of both dynamic and static environments through advanced neural representations. Despite data augmentation, 3D reconstruction inspires pioneering solution for vital tasks in the field of autonomous driving, such as scene understanding and closed-loop simulation. We investigates the details of 3D reconstruction and conducts a multi-perspective, in-depth analysis of recent advancements. Specifically, we first provide a systematic introduction of preliminaries, including data modalities, benchmarks and technical preliminaries of learning-based 3D reconstruction, facilitating instant identification of suitable methods according to sensor suites. Then, we systematically review learning-based 3D reconstruction methods in autonomous driving, categorizing approaches by subtasks and conducting multi-dimensional analysis and summary to establish a comprehensive technical reference. The development trends and existing challenges are summarized in the context of learning-based 3D reconstruction in autonomous driving. We hope that our review will inspire future researches.
Authors: Boyuan Zheng, Shouyi Lu, Renbo Huang, Minqing Huang, Fan Lu, Wei Tian, Guirong Zhuo, Lu Xiong
Abstract: We introduce R2LDM, an innovative approach for generating dense and accurate 4D radar point clouds, guided by corresponding LiDAR point clouds. Instead of utilizing range images or bird's eye view (BEV) images, we represent both LiDAR and 4D radar point clouds using voxel features, which more effectively capture 3D shape information. Subsequently, we propose the Latent Voxel Diffusion Model (LVDM), which performs the diffusion process in the latent space. Additionally, a novel Latent Point Cloud Reconstruction (LPCR) module is utilized to reconstruct point clouds from high-dimensional latent voxel features. As a result, R2LDM effectively generates LiDAR-like point clouds from paired raw radar data. We evaluate our approach on two different datasets, and the experimental results demonstrate that our model achieves 6- to 10-fold densification of radar point clouds, outperforming state-of-the-art baselines in 4D radar point cloud super-resolution. Furthermore, the enhanced radar point clouds generated by our method significantly improve downstream tasks, achieving up to 31.7% improvement in point cloud registration recall rate and 24.9% improvement in object detection accuracy.
Authors: Zhuohao Li, Zhicheng Huang, Wenchao Liu, Zhuxin Zhang, Jianming Miao
Abstract: Few-Shot Semantic Segmentation (FSS), which focuses on segmenting new classes in images using only a limited number of annotated examples, has recently progressed in data-scarce domains. However, in this work, we show that the existing FSS methods often struggle to generalize to underwater environments. Specifically, the prior features extracted by pre-trained models used as feature extractors are fragile due to the unique challenges of underwater images. To address this, we propose FSSUWNet, a tailored FSS framework for underwater images with feature enhancement. FSSUWNet exploits the integration of complementary features, emphasizing both low-level and high-level image characteristics. In addition to employing a pre-trained model as the primary encoder, we propose an auxiliary encoder called Feature Enhanced Encoder which extracts complementary features to better adapt to underwater scene characteristics. Furthermore, a simple and effective Feature Alignment Module aims to provide global prior knowledge and align low-level features with high-level features in dimensions. Given the scarcity of underwater images, we introduce a cross-validation dataset version based on the Segmentation of Underwater Imagery dataset. Extensive experiments on public underwater segmentation datasets demonstrate that our approach achieves state-of-the-art performance. For example, our method outperforms the previous best method by 2.8% and 2.6% in terms of the mean Intersection over Union metric for 1-shot and 5-shot scenarios in the datasets, respectively. Our implementation is available at https://github.com/lizhh268/FSSUWNet.
Authors: Yasmine Mustafa, Mohamed Elmahallawy, Tie Luo
Abstract: Alzheimer's disease (AD) leads to progressive cognitive decline, making early detection crucial for effective intervention. While deep learning models have shown high accuracy in AD diagnosis, their lack of interpretability limits clinical trust and adoption. This paper introduces a novel pre-model approach leveraging Jacobian Maps (JMs) within a multi-modal framework to enhance explainability and trustworthiness in AD detection. By capturing localized brain volume changes, JMs establish meaningful correlations between model predictions and well-known neuroanatomical biomarkers of AD. We validate JMs through experiments comparing a 3D CNN trained on JMs versus on traditional preprocessed data, which demonstrates superior accuracy. We also employ 3D Grad-CAM analysis to provide both visual and quantitative insights, further showcasing improved interpretability and diagnostic reliability.
Authors: Ronghui Zhang, Yuhang Ma, Tengfei Li, Ziyu Lin, Yueying Wu, Junzhou Chen, Lin Zhang, Jia Hu, Tony Z. Qiu, Konghui Guo
Abstract: Lane detection is a critical component of Advanced Driver Assistance Systems (ADAS). Existing lane detection algorithms generally perform well under favorable weather conditions. However, their performance degrades significantly in adverse conditions, such as fog, which increases the risk of traffic accidents. This challenge is compounded by the lack of specialized datasets and methods designed for foggy environments. To address this, we introduce the FoggyLane dataset, captured in real-world foggy scenarios, and synthesize two additional datasets, FoggyCULane and FoggyTusimple, from existing popular lane detection datasets. Furthermore, we propose a robust Fog-Enhanced Network for lane detection, incorporating a Global Feature Fusion Module (GFFM) to capture global relationships in foggy images, a Kernel Feature Fusion Module (KFFM) to model the structural and positional relationships of lane instances, and a Low-level Edge Enhanced Module (LEEM) to address missing edge details in foggy conditions. Comprehensive experiments demonstrate that our method achieves state-of-the-art performance, with F1-scores of 95.04 on FoggyLane, 79.85 on FoggyCULane, and 96.95 on FoggyTusimple. Additionally, with TensorRT acceleration, the method reaches a processing speed of 38.4 FPS on the NVIDIA Jetson AGX Orin, confirming its real-time capabilities and robustness in foggy environments.
Authors: Darian Toma\v{s}evi\'c, Fadi Boutros, Chenhao Lin, Naser Damer, Vitomir \v{S}truc, Peter Peer
Abstract: Recent advances in generative modeling have enabled the generation of high-quality synthetic data that is applicable in a variety of domains, including face recognition. Here, state-of-the-art generative models typically rely on conditioning and fine-tuning of powerful pretrained diffusion models to facilitate the synthesis of realistic images of a desired identity. Yet, these models often do not consider the identity of subjects during training, leading to poor consistency between generated and intended identities. In contrast, methods that employ identity-based training objectives tend to overfit on various aspects of the identity, and in turn, lower the diversity of images that can be generated. To address these issues, we present in this paper a novel generative diffusion-based framework, called ID-Booth. ID-Booth consists of a denoising network responsible for data generation, a variational auto-encoder for mapping images to and from a lower-dimensional latent space and a text encoder that allows for prompt-based control over the generation procedure. The framework utilizes a novel triplet identity training objective and enables identity-consistent image generation while retaining the synthesis capabilities of pretrained diffusion models. Experiments with a state-of-the-art latent diffusion model and diverse prompts reveal that our method facilitates better intra-identity consistency and inter-identity separability than competing methods, while achieving higher image diversity. In turn, the produced data allows for effective augmentation of small-scale datasets and training of better-performing recognition models in a privacy-preserving manner. The source code for the ID-Booth framework is publicly available at https://github.com/dariant/ID-Booth.
Authors: Michele Grimaldi, Nouf Alkaabi, Francesco Ruscio, Sebastian Realpe Rua, Rafael Garcia, Nuno Gracias
Abstract: Posidonia oceanica meadows are a species of seagrass highly dependent on rocks for their survival and conservation. In recent years, there has been a concerning global decline in this species, emphasizing the critical need for efficient monitoring and assessment tools. While deep learning-based semantic segmentation and visual automated monitoring systems have shown promise in a variety of applications, their performance in underwater environments remains challenging due to complex water conditions and limited datasets. This paper introduces a framework that combines machine learning and computer vision techniques to enable an autonomous underwater vehicle (AUV) to inspect the boundaries of Posidonia oceanica meadows autonomously. The framework incorporates an image segmentation module using an existing Mask R-CNN model and a strategy for Posidonia oceanica meadow boundary tracking. Furthermore, a new class dedicated to rocks is introduced to enhance the existing model, aiming to contribute to a comprehensive monitoring approach and provide a deeper understanding of the intricate interactions between the meadow and its surrounding environment. The image segmentation model is validated using real underwater images, while the overall inspection framework is evaluated in a realistic simulation environment, replicating actual monitoring scenarios with real underwater images. The results demonstrate that the proposed framework enables the AUV to autonomously accomplish the main tasks of underwater inspection and segmentation of rocks. Consequently, this work holds significant potential for the conservation and protection of marine environments, providing valuable insights into the status of Posidonia oceanica meadows and supporting targeted preservation efforts
Authors: Nanjie Yao, Gangjian Zhang, Wenhao Shen, Jian Shu, Hao Wang
Abstract: Monocular 3D clothed human reconstruction aims to create a complete 3D avatar from a single image. To tackle the human geometry lacking in one RGB image, current methods typically resort to a preceding model for an explicit geometric representation. For the reconstruction itself, focus is on modeling both it and the input image. This routine is constrained by the preceding model, and overlooks the integrity of the reconstruction task. To address this, this paper introduces a novel paradigm that treats human reconstruction as a holistic process, utilizing an end-to-end network for direct prediction from 2D image to 3D avatar, eliminating any explicit intermediate geometry display. Based on this, we further propose a novel reconstruction framework consisting of two core components: the Anatomy Shaping Extraction module, which captures implicit shape features taking into account the specialty of human anatomy, and the Twins Negotiating Reconstruction U-Net, which enhances reconstruction through feature interaction between two U-Nets of different modalities. Moreover, we propose a Comic Data Augmentation strategy and construct 15k+ 3D human scans to bolster model performance in more complex case input. Extensive experiments on two test sets and many in-the-wild cases show the superiority of our method over SOTA methods. Our demos can be found in : https://e2e3dgsrecon.github.io/e2e3dgsrecon/.
Authors: Yechan Kim, DongHo Yoon, SooYeon Kim, Moongu Jeon
Abstract: Labeling errors in remote sensing (RS) image segmentation datasets often remain implicit and subtle due to ambiguous class boundaries, mixed pixels, shadows, complex terrain features, and subjective annotator bias. Furthermore, the scarcity of annotated RS data due to high image acquisition and labeling costs complicates training noise-robust models. While sophisticated mechanisms such as label selection or noise correction might address this issue, they tend to increase training time and add implementation complexity. In this letter, we propose NSegment-a simple yet effective data augmentation solution to mitigate this issue. Unlike traditional methods, it applies elastic transformations only to segmentation labels, varying deformation intensity per sample in each training epoch to address annotation inconsistencies. Experimental results demonstrate that our approach improves the performance of RS image segmentation on various state-of-the-art models.
Authors: Mahmudul Hasan, Sadia Ruhama, Sabrina Tajnim Sithi, Chowdhury Mohammad Mutamir Samit, Oindrila Saha
Abstract: Deepfake videos, produced through advanced artificial intelligence methods now a days, pose a new challenge to the truthfulness of the digital media. As Deepfake becomes more convincing day by day, detecting them requires advanced methods capable of identifying subtle inconsistencies. The primary motivation of this paper is to recognize deepfake videos using deep learning techniques, specifically by using convolutional neural networks. Deep learning excels in pattern recognition, hence, makes it an ideal approach for detecting the intricate manipulations in deepfakes. In this paper, we consider using MTCNN as a face detector and EfficientNet-B5 as encoder model to predict if a video is deepfake or not. We utilize training and evaluation dataset from Kaggle DFDC. The results shows that our deepfake detection model acquired 42.78% log loss, 93.80% AUC and 86.82% F1 score on kaggle's DFDC dataset.
Authors: Jian Wang, Baoyuan Wu, Li Liu, Qingshan Liu
Abstract: The rapid evolution of generative AI has increased the threat of realistic audio-visual deepfakes, demanding robust detection methods. Existing solutions primarily address unimodal (audio or visual) forgeries but struggle with multimodal manipulations due to inadequate handling of heterogeneous modality features and poor generalization across datasets. To this end, we propose a novel framework called FauForensics by introducing biologically invariant facial action units (FAUs), which is a quantitative descriptor of facial muscle activity linked to emotion physiology. It serves as forgery-resistant representations that reduce domain dependency while capturing subtle dynamics often disrupted in synthetic content. Besides, instead of comparing entire video clips as in prior works, our method computes fine-grained frame-wise audiovisual similarities via a dedicated fusion module augmented with learnable cross-modal queries. It dynamically aligns temporal-spatial lip-audio relationships while mitigating multi-modal feature heterogeneity issues. Experiments on FakeAVCeleb and LAV-DF show state-of-the-art (SOTA) performance and superior cross-dataset generalizability with up to an average of 4.83\% than existing methods.
Authors: Junjie Li, Jiawei Wang, Miyu Li, Yu Liu, Yumei Wang, Haitao Xu
Abstract: Depth estimation plays a great potential role in obstacle avoidance and navigation for further Mars exploration missions. Compared to traditional stereo matching, learning-based stereo depth estimation provides a data-driven approach to infer dense and precise depth maps from stereo image pairs. However, these methods always suffer performance degradation in environments with sparse textures and lacking geometric constraints, such as the unstructured terrain of Mars. To address these challenges, we propose M3Depth, a depth estimation model tailored for Mars rovers. Considering the sparse and smooth texture of Martian terrain, which is primarily composed of low-frequency features, our model incorporates a convolutional kernel based on wavelet transform that effectively captures low-frequency response and expands the receptive field. Additionally, we introduce a consistency loss that explicitly models the complementary relationship between depth map and surface normal map, utilizing the surface normal as a geometric constraint to enhance the accuracy of depth estimation. Besides, a pixel-wise refinement module with mutual boosting mechanism is designed to iteratively refine both depth and surface normal predictions. Experimental results on synthetic Mars datasets with depth annotations show that M3Depth achieves a 16% improvement in depth estimation accuracy compared to other state-of-the-art methods in depth estimation. Furthermore, the model demonstrates strong applicability in real-world Martian scenarios, offering a promising solution for future Mars exploration missions.
Authors: Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang
Abstract: Despite recent advances in multimodal content generation enabled by vision-language models (VLMs), their ability to reason about and generate structured 3D scenes remains largely underexplored. This limitation constrains their utility in spatially grounded tasks such as embodied AI, immersive simulations, and interactive 3D applications. We introduce a new paradigm that enables VLMs to generate, understand, and edit complex 3D environments by injecting a continually evolving spatial context. Constructed from multimodal input, this context consists of three components: a scene portrait that provides a high-level semantic blueprint, a semantically labeled point cloud capturing object-level geometry, and a scene hypergraph that encodes rich spatial relationships, including unary, binary, and higher-order constraints. Together, these components provide the VLM with a structured, geometry-aware working memory that integrates its inherent multimodal reasoning capabilities with structured 3D understanding for effective spatial reasoning. Building on this foundation, we develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context. The pipeline features high-quality asset generation with geometric restoration, environment setup with automatic verification, and ergonomic adjustment guided by the scene hypergraph. Experiments show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work. Further results demonstrate that injecting spatial context enables VLMs to perform downstream tasks such as interactive scene editing and path planning, suggesting strong potential for spatially intelligent systems in computer graphics, 3D vision, and embodied applications. Project page: https://spatctxvlm.github.io/project_page/.
Authors: Ansel Blume, Jeonghwan Kim, Hyeonjeong Ha, Elen Chatikyan, Xiaomeng Jin, Khanh Duy Nguyen, Nanyun Peng, Kai-Wei Chang, Derek Hoiem, Heng Ji
Abstract: Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 part labels and 534 object labels for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY uses specialized concepts (e.g., agricultural airplane), and challenges models to compare objects' parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only 5.9% gIoU), highlighting a critical gap in their part grounding abilities. We note that existing segmentation-enabled LMMs (segmenting LMMs) have two key architectural shortcomings: they use special [SEG] tokens not seen during pretraining which induce distribution shift, and they discard predicted segmentations instead of using past predictions to guide future ones. To address these deficiencies, we train several part-centric LMMs and propose PLUM, a novel segmenting LMM that uses span tagging instead of segmentation tokens and that conditions on prior predictions in a feedback loop. We find that pretrained PLUM outperforms existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM finetuned on our proposed Explanatory Part Segmentation task is competitive with segmenting LMMs trained on significantly more segmentation data. Our work opens up new avenues towards enabling fine-grained, grounded visual understanding in LMMs.
Authors: Daniel Csizmadia, Andrei Codreanu, Victor Sim, Vighnesh Prabhu, Michael Lu, Kevin Zhu, Sean O'Brien, Vasu Sharma
Abstract: We present Distill CLIP (DCLIP), a fine-tuned variant of the CLIP model that enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities. CLIP models are typically constrained by fixed image resolutions and limited context, which can hinder their effectiveness in retrieval tasks that require fine-grained cross-modal understanding. DCLIP addresses these challenges through a meta teacher-student distillation framework, where a cross-modal transformer teacher is fine-tuned to produce enriched embeddings via bidirectional cross-attention between YOLO-extracted image regions and corresponding textual spans. These semantically and spatially aligned global representations guide the training of a lightweight student model using a hybrid loss that combines contrastive learning and cosine similarity objectives. Despite being trained on only ~67,500 samples curated from MSCOCO, Flickr30k, and Conceptual Captions-just a fraction of CLIP's original dataset-DCLIP significantly improves image-text retrieval metrics (Recall@K, MAP), while retaining approximately 94% of CLIP's zero-shot classification performance. These results demonstrate that DCLIP effectively mitigates the trade-off between task specialization and generalization, offering a resource-efficient, domain-adaptive, and detail-sensitive solution for advanced vision-language tasks. Code available at https://anonymous.4open.science/r/DCLIP-B772/README.md.
URLs: https://anonymous.4open.science/r/DCLIP-B772/README.md.
Authors: Mika Feng, Koichi Ito, Takafumi Aoki, Tetsushi Ohki, Masakatsu Nishigaki
Abstract: Face recognition systems are designed to be robust against changes in head pose, illumination, and blurring during image capture. If a malicious person presents a face photo of the registered user, they may bypass the authentication process illegally. Such spoofing attacks need to be detected before face recognition. In this paper, we propose a spoofing attack detection method based on Vision Transformer (ViT) to detect minute differences between live and spoofed face images. The proposed method utilizes the intermediate features of ViT, which have a good balance between local and global features that are important for spoofing attack detection, for calculating loss in training and score in inference. The proposed method also introduces two data augmentation methods: face anti-spoofing data augmentation and patch-wise data augmentation, to improve the accuracy of spoofing attack detection. We demonstrate the effectiveness of the proposed method through experiments using the OULU-NPU and SiW datasets. The project page is available at: https://gsisaoki.github.io/FAS-ViT-CVPRW/ .
Authors: Junwen Huang, Jizhong Liang, Jiaqi Hu, Martin Sundermeyer, Peter KT Yu, Nassir Navab, Benjamin Busam
Abstract: We introduce XYZ-IBD, a bin-picking dataset for 6D pose estimation that captures real-world industrial complexity, including challenging object geometries, reflective materials, severe occlusions, and dense clutter. The dataset reflects authentic robotic manipulation scenarios with millimeter-accurate annotations. Unlike existing datasets that primarily focus on household objects, which approach saturation,XYZ-IBD represents the unsolved realistic industrial conditions. The dataset features 15 texture-less, metallic, and mostly symmetrical objects of varying shapes and sizes. These objects are heavily occluded and randomly arranged in bins with high density, replicating the challenges of real-world bin-picking. XYZ-IBD was collected using two high-precision industrial cameras and one commercially available camera, providing RGB, grayscale, and depth images. It contains 75 multi-view real-world scenes, along with a large-scale synthetic dataset rendered under simulated bin-picking conditions. We employ a meticulous annotation pipeline that includes anti-reflection spray, multi-view depth fusion, and semi-automatic annotation, achieving millimeter-level pose labeling accuracy required for industrial manipulation. Quantification in simulated environments confirms the reliability of the ground-truth annotations. We benchmark state-of-the-art methods on 2D detection, 6D pose estimation, and depth estimation tasks on our dataset, revealing significant performance degradation in our setups compared to current academic household benchmarks. By capturing the complexity of real-world bin-picking scenarios, XYZ-IBD introduces more realistic and challenging problems for future research. The dataset and benchmark are publicly available at https://xyz-ibd.github.io/XYZ-IBD/.
Authors: Yulei Qin, Gang Li, Zongyi Li, Zihan Xu, Yuchen Shi, Zhekai Lin, Xiao Cui, Ke Li, Xing Sun
Abstract: Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Codes and data will be available later (under review). Keywords: reinforcement learning with verifiable rewards (RLVR), instruction following, complex instructions
Authors: Erhang Zhang, Junyi Ma, Yin-Dong Zheng, Yixuan Zhou, Hesheng Wang
Abstract: Locating human-object interaction (HOI) actions within video serves as the foundation for multiple downstream tasks, such as human behavior analysis and human-robot skill transfer. Current temporal action localization methods typically rely on annotated action and object categories of interactions for optimization, which leads to domain bias and low deployment efficiency. Although some recent works have achieved zero-shot temporal action localization (ZS-TAL) with large vision-language models (VLMs), their coarse-grained estimations and open-loop pipelines hinder further performance improvements for temporal interaction localization (TIL). To address these issues, we propose a novel zero-shot TIL approach dubbed EgoLoc to locate the timings of grasp actions for human-object interaction in egocentric videos. EgoLoc introduces a self-adaptive sampling strategy to generate reasonable visual prompts for VLM reasoning. By absorbing both 2D and 3D observations, it directly samples high-quality initial guesses around the possible contact/separation timestamps of HOI according to 3D hand velocities, leading to high inference accuracy and efficiency. In addition, EgoLoc generates closed-loop feedback from visual and dynamic cues to further refine the localization results. Comprehensive experiments on the publicly available dataset and our newly proposed benchmark demonstrate that EgoLoc achieves better temporal interaction localization for egocentric videos compared to state-of-the-art baselines. We will release our code and relevant data as open-source at https://github.com/IRMVLab/EgoLoc.
Authors: Tan-Hanh Pham, Chris Ngo
Abstract: The growing integration of vision-language models (VLMs) in medical applications offers promising support for diagnostic reasoning. However, current medical VLMs often face limitations in generalization, transparency, and computational efficiency-barriers that hinder deployment in real-world, resource-constrained settings. To address these challenges, we propose a Reasoning-Aware Reinforcement Learning framework, \textbf{RARL}, that enhances the reasoning capabilities of medical VLMs while remaining efficient and adaptable to low-resource environments. Our approach fine-tunes a lightweight base model, Qwen2-VL-2B-Instruct, using Low-Rank Adaptation and custom reward functions that jointly consider diagnostic accuracy and reasoning quality. Training is performed on a single NVIDIA A100-PCIE-40GB GPU, demonstrating the feasibility of deploying such models in constrained environments. We evaluate the model using an LLM-as-judge framework that scores both correctness and explanation quality. Experimental results show that RARL significantly improves VLM performance in medical image analysis and clinical reasoning, outperforming supervised fine-tuning on reasoning-focused tasks by approximately 7.78%, while requiring fewer computational resources. Additionally, we demonstrate the generalization capabilities of our approach on unseen datasets, achieving around 27% improved performance compared to supervised fine-tuning and about 4% over traditional RL fine-tuning. Our experiments also illustrate that diversity prompting during training and reasoning prompting during inference are crucial for enhancing VLM performance. Our findings highlight the potential of reasoning-guided learning and reasoning prompting to steer medical VLMs toward more transparent, accurate, and resource-efficient clinical decision-making. Code and data are publicly available.
Authors: Jingyuan Qi, Zhiyang Xu, Qifan Wang, Lifu Huang
Abstract: We introduce Autoregressive Retrieval Augmentation (AR-RAG), a novel paradigm that enhances image generation by autoregressively incorporating knearest neighbor retrievals at the patch level. Unlike prior methods that perform a single, static retrieval before generation and condition the entire generation on fixed reference images, AR-RAG performs context-aware retrievals at each generation step, using prior-generated patches as queries to retrieve and incorporate the most relevant patch-level visual references, enabling the model to respond to evolving generation needs while avoiding limitations (e.g., over-copying, stylistic bias, etc.) prevalent in existing methods. To realize AR-RAG, we propose two parallel frameworks: (1) Distribution-Augmentation in Decoding (DAiD), a training-free plug-and-use decoding strategy that directly merges the distribution of model-predicted patches with the distribution of retrieved patches, and (2) Feature-Augmentation in Decoding (FAiD), a parameter-efficient fine-tuning method that progressively smooths the features of retrieved patches via multi-scale convolution operations and leverages them to augment the image generation process. We validate the effectiveness of AR-RAG on widely adopted benchmarks, including Midjourney-30K, GenEval and DPG-Bench, demonstrating significant performance gains over state-of-the-art image generation models.
Authors: Sanjoy Chowdhury, Mohamed Elmoghany, Yohan Abeysinghe, Junjie Fei, Sayan Nag, Salman Khan, Mohamed Elhoseiny, Dinesh Manocha
Abstract: Large multimodal models (LMMs) have shown remarkable progress in audio-visual understanding, yet they struggle with real-world scenarios that require complex reasoning across extensive video collections. Existing benchmarks for video question answering remain limited in scope, typically involving one clip per query, which falls short of representing the challenges of large-scale, audio-visual retrieval and reasoning encountered in practical applications. To bridge this gap, we introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer. To this end, we present AVHaystacks, an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task. Additionally, we propose a model-agnostic, multi-agent framework MAGNET to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystacks. To enable robust evaluation of multi-video retrieval and temporal grounding for optimal response generation, we introduce two new metrics, STEM, which captures alignment errors between a ground truth and a predicted step sequence and MTGS, to facilitate balanced and interpretable evaluation of segment-level grounding performance. Project: https://schowdhury671.github.io/magnet_project/
Authors: Dane Williamson, Yangfeng Ji, Matthew Dwyer
Abstract: Saliency methods are widely used to visualize which input features are deemed relevant to a model's prediction. However, their visual plausibility can obscure critical limitations. In this work, we propose a diagnostic test for class sensitivity: a method's ability to distinguish between competing class labels on the same input. Through extensive experiments, we show that many widely used saliency methods produce nearly identical explanations regardless of the class label, calling into question their reliability. We find that class-insensitive behavior persists across architectures and datasets, suggesting the failure mode is structural rather than model-specific. Motivated by these findings, we introduce CASE, a contrastive explanation method that isolates features uniquely discriminative for the predicted class. We evaluate CASE using the proposed diagnostic and a perturbation-based fidelity test, and show that it produces faithful and more class-specific explanations than existing methods.
Authors: Jie He, Minglang Chen, Minying Lu, Bocheng Liang, Junming Wei, Guiyan Peng, Jiaxi Chen, Ying Tan
Abstract: Accurate ultrasound image segmentation is a prerequisite for precise biometrics and accurate assessment. Relying on manual delineation introduces significant errors and is time-consuming. However, existing segmentation models are designed based on objects in natural scenes, making them difficult to adapt to ultrasound objects with high noise and high similarity. This is particularly evident in small object segmentation, where a pronounced jagged effect occurs. Therefore, this paper proposes a fetal femur and cranial ultrasound image segmentation model based on feature perception and Mamba enhancement to address these challenges. Specifically, a longitudinal and transverse independent viewpoint scanning convolution block and a feature perception module were designed to enhance the ability to capture local detail information and improve the fusion of contextual information. Combined with the Mamba-optimized residual structure, this design suppresses the interference of raw noise and enhances local multi-dimensional scanning. The system builds global information and local feature dependencies, and is trained with a combination of different optimizers to achieve the optimal solution. After extensive experimental validation, the FAMSeg network achieved the fastest loss reduction and the best segmentation performance across images of varying sizes and orientations.
Authors: Xiangyu Guo, Zhanqian Wu, Kaixin Xiong, Ziyang Xu, Lijun Zhou, Gangwei Xu, Shaoqing Xu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang
Abstract: We present Genesis, a unified framework for joint generation of multi-view driving videos and LiDAR sequences with spatio-temporal and cross-modal consistency. Genesis employs a two-stage architecture that integrates a DiT-based video diffusion model with 3D-VAE encoding, and a BEV-aware LiDAR generator with NeRF-based rendering and adaptive sampling. Both modalities are directly coupled through a shared latent space, enabling coherent evolution across visual and geometric domains. To guide the generation with structured semantics, we introduce DataCrafter, a captioning module built on vision-language models that provides scene-level and instance-level supervision. Extensive experiments on the nuScenes benchmark demonstrate that Genesis achieves state-of-the-art performance across video and LiDAR metrics (FVD 16.95, FID 4.24, Chamfer 0.611), and benefits downstream tasks including segmentation and 3D detection, validating the semantic fidelity and practical utility of the generated data.
Authors: Haoxiang Wang, Zinan Lin, Da Yu, Huishuai Zhang
Abstract: Generating high fidelity, differentially private (DP) synthetic images offers a promising route to share and analyze sensitive visual data without compromising individual privacy. However, existing DP image synthesis methods struggle to produce high resolution outputs that faithfully capture the structure of the original data. In this paper, we introduce a novel method, referred to as Synthesis via Private Textual Intermediaries (SPTI), that can generate high resolution DP images with easy adoption. The key idea is to shift the challenge of DP image synthesis from the image domain to the text domain by leveraging state of the art DP text generation methods. SPTI first summarizes each private image into a concise textual description using image to text models, then applies a modified Private Evolution algorithm to generate DP text, and finally reconstructs images using text to image models. Notably, SPTI requires no model training, only inference with off the shelf models. Given a private dataset, SPTI produces synthetic images of substantially higher quality than prior DP approaches. On the LSUN Bedroom dataset, SPTI attains an FID less than or equal to 26.71 under epsilon equal to 1.0, improving over Private Evolution FID of 40.36. Similarly, on MM CelebA HQ, SPTI achieves an FID less than or equal to 33.27 at epsilon equal to 1.0, compared to 57.01 from DP fine tuning baselines. Overall, our results demonstrate that Synthesis via Private Textual Intermediaries provides a resource efficient and proprietary model compatible framework for generating high resolution DP synthetic images, greatly expanding access to private visual datasets.
Authors: Jianhui Wei, Zikai Xiao, Danyu Sun, Luqi Gong, Zongxin Yang, Zuozhu Liu, Jian Wu
Abstract: Surgical video understanding is pivotal for enabling automated intraoperative decision-making, skill assessment, and postoperative quality improvement. However, progress in developing surgical video foundation models (FMs) remains hindered by the scarcity of large-scale, diverse datasets for pretraining and systematic evaluation. In this paper, we introduce \textbf{SurgBench}, a unified surgical video benchmarking framework comprising a pretraining dataset, \textbf{SurgBench-P}, and an evaluation benchmark, \textbf{SurgBench-E}. SurgBench offers extensive coverage of diverse surgical scenarios, with SurgBench-P encompassing 53 million frames across 22 surgical procedures and 11 specialties, and SurgBench-E providing robust evaluation across six categories (phase classification, camera motion, tool recognition, disease diagnosis, action classification, and organ detection) spanning 72 fine-grained tasks. Extensive experiments reveal that existing video FMs struggle to generalize across varied surgical video analysis tasks, whereas pretraining on SurgBench-P yields substantial performance improvements and superior cross-domain generalization to unseen procedures and modalities. Our dataset and code are available upon request.
Authors: Huixin Zhan, Jason H. Moore
Abstract: Surgeons exhibit distinct operating styles shaped by training, experience, and motor behavior-yet most surgical AI systems overlook this personalization signal. We propose a novel agentic modeling approach for surgeon-specific behavior prediction in robotic surgery, combining a discrete diffusion framework with a vision-language-action (VLA) pipeline. Gesture prediction is framed as a structured sequence denoising task, conditioned on multimodal inputs including surgical video, intent language, and personalized embeddings of surgeon identity and skill. These embeddings are encoded through natural language prompts using third-party language models, allowing the model to retain individual behavioral style without exposing explicit identity. We evaluate our method on the JIGSAWS dataset and demonstrate that it accurately reconstructs gesture sequences while learning meaningful motion fingerprints unique to each surgeon. To quantify the privacy implications of personalization, we perform membership inference attacks and find that more expressive embeddings improve task performance but simultaneously increase susceptibility to identity leakage. These findings demonstrate that while personalized embeddings improve performance, they also increase vulnerability to identity leakage, revealing the importance of balancing personalization with privacy risk in surgical modeling. Code is available at: https://github.com/huixin-zhan-ai/Surgeon_style_fingerprinting.
URLs: https://github.com/huixin-zhan-ai/Surgeon_style_fingerprinting.
Authors: Ananthu Aniraj, Cassio F. Dantas, Dino Ienco, Diego Marcos
Abstract: We introduce an attention-based method that uses learned binary attention masks to ensure that only attended image regions influence the prediction. Context can strongly affect object perception, sometimes leading to biased representations, particularly when objects appear in out-of-distribution backgrounds. At the same time, many image-level object-centric tasks require identifying relevant regions, often requiring context. To address this conundrum, we propose a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. Extensive experiments across diverse benchmarks demonstrate that our approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds. Code: https://github.com/ananthu-aniraj/ifam
Authors: Chunming He, Kai Li, Yachao Zhang, Ziyun Yang, Youwei Pang, Longxiang Tang, Chengyu Fang, Yulun Zhang, Linghe Kong, Xiu Li, Sina Farsiu
Abstract: Incompletely-Supervised Concealed Object Segmentation (ISCOS) involves segmenting objects that seamlessly blend into their surrounding environments, utilizing incompletely annotated data, such as weak and semi-annotations, for model training. This task remains highly challenging due to (1) the limited supervision provided by the incompletely annotated training data, and (2) the difficulty of distinguishing concealed objects from the background, which arises from the intrinsic similarities in concealed scenarios. In this paper, we introduce the first unified method for ISCOS to address these challenges. To tackle the issue of incomplete supervision, we propose a unified mean-teacher framework, SEE, that leverages the vision foundation model, ``\emph{Segment Anything Model (SAM)}'', to generate pseudo-labels using coarse masks produced by the teacher model as prompts. To mitigate the effect of low-quality segmentation masks, we introduce a series of strategies for pseudo-label generation, storage, and supervision. These strategies aim to produce informative pseudo-labels, store the best pseudo-labels generated, and select the most reliable components to guide the student model, thereby ensuring robust network training. Additionally, to tackle the issue of intrinsic similarity, we design a hybrid-granularity feature grouping module that groups features at different granularities and aggregates these results. By clustering similar features, this module promotes segmentation coherence, facilitating more complete segmentation for both single-object and multiple-object images. We validate the effectiveness of our approach across multiple ISCOS tasks, and experimental results demonstrate that our method achieves state-of-the-art performance. Furthermore, SEE can serve as a plug-and-play solution, enhancing the performance of existing models.
Authors: Dingcheng Zhen, Qian Qiao, Tan Yu, Kangxi Wu, Ziwei Zhang, Siyuan Liu, Shunshun Yin, Ming Tao
Abstract: We introduce TransDiff, the first image generation model that marries Autoregressive (AR) Transformer with diffusion models. In this joint modeling framework, TransDiff encodes labels and images into high-level semantic features and employs a diffusion model to estimate the distribution of image samples. On the ImageNet 256x256 benchmark, TransDiff significantly outperforms other image generation models based on standalone AR Transformer or diffusion models. Specifically, TransDiff achieves a Frechet Inception Distance (FID) of 1.61 and an Inception Score (IS) of 293.4, and further provides x2 faster inference latency compared to state-of-the-art methods based on AR Transformer and x112 faster inference compared to diffusion-only models. Furthermore, building on the TransDiff model, we introduce a novel image generation paradigm called Multi-Reference Autoregression (MRAR), which performs autoregressive generation by predicting the next image. MRAR enables the model to reference multiple previously generated images, thereby facilitating the learning of more diverse representations and improving the quality of generated images in subsequent iterations. By applying MRAR, the performance of TransDiff is improved, with the FID reduced from 1.61 to 1.42. We expect TransDiff to open up a new frontier in the field of image generation.
Authors: Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zheng Zhu, Guan Huang, Xingang Wang
Abstract: Recent advances in large language models, especially in natural language understanding and reasoning, have opened new possibilities for text-to-motion generation. Although existing approaches have made notable progress in semantic alignment and motion synthesis, they often rely on end-to-end mapping strategies that fail to capture deep linguistic structures and logical reasoning. Consequently, generated motions tend to lack controllability, consistency, and diversity. To address these limitations, we propose Motion-R1, a unified motion-language modeling framework that integrates a Chain-of-Thought mechanism. By explicitly decomposing complex textual instructions into logically structured action paths, Motion-R1 provides high-level semantic guidance for motion generation, significantly enhancing the model's ability to interpret and execute multi-step, long-horizon, and compositionally rich commands. To train our model, we adopt Group Relative Policy Optimization, a reinforcement learning algorithm designed for large models, which leverages motion quality feedback to optimize reasoning chains and motion synthesis jointly. Extensive experiments across multiple benchmark datasets demonstrate that Motion-R1 achieves competitive or superior performance compared to state-of-the-art methods, particularly in scenarios requiring nuanced semantic understanding and long-term temporal coherence. The code, model and data will be publicly available.
Authors: Guoyi Zhang, Guangsheng Xu, Siyang Chen, Han Wang, Xiaohu Zhang
Abstract: Infrared small target detection (IRSTD) remains a long-standing challenge in complex backgrounds due to low signal-to-clutter ratios (SCR), diverse target morphologies, and the absence of distinctive visual cues. While recent deep learning approaches aim to learn discriminative representations, the intrinsic variability and weak priors of small targets often lead to unstable performance. In this paper, we propose a novel end-to-end IRSTD framework, termed LRRNet, which leverages the low-rank property of infrared image backgrounds. Inspired by the physical compressibility of cluttered scenes, our approach adopts a compression--reconstruction--subtraction (CRS) paradigm to directly model structure-aware low-rank background representations in the image domain, without relying on patch-based processing or explicit matrix decomposition. To the best of our knowledge, this is the first work to directly learn low-rank background structures using deep neural networks in an end-to-end manner. Extensive experiments on multiple public datasets demonstrate that LRRNet outperforms 38 state-of-the-art methods in terms of detection accuracy, robustness, and computational efficiency. Remarkably, it achieves real-time performance with an average speed of 82.34 FPS. Evaluations on the challenging NoisySIRST dataset further confirm the model's resilience to sensor noise. The source code will be made publicly available upon acceptance.
Authors: Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Ji-Rong Wen, Zhicheng Dou
Abstract: Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task's inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.
Authors: Jin Kim, Muhammad Wahi-Anwa, Sangyun Park, Shawn Shin, John M. Hoffman, Matthew S. Brown
Abstract: Agentic Artificial Intelligence (AI) systems leveraging Large Language Models (LLMs) exhibit significant potential for complex reasoning, planning, and tool utilization. We demonstrate that a specialized computer vision system can be built autonomously from a natural language prompt using Agentic AI methods. This involved extending SimpleMind (SM), an open-source Cognitive AI environment with configurable tools for medical image analysis, with an LLM-based agent, implemented using OpenManus, to automate the planning (tool configuration) for a particular computer vision task. We provide a proof-of-concept demonstration that an agentic system can interpret a computer vision task prompt, plan a corresponding SimpleMind workflow by decomposing the task and configuring appropriate tools. From the user input prompt, "provide sm (SimpleMind) config for lungs, heart, and ribs segmentation for cxr (chest x-ray)"), the agent LLM was able to generate the plan (tool configuration file in YAML format), and execute SM-Learn (training) and SM-Think (inference) scripts autonomously. The computer vision agent automatically configured, trained, and tested itself on 50 chest x-ray images, achieving mean dice scores of 0.96, 0.82, 0.83, for lungs, heart, and ribs, respectively. This work shows the potential for autonomous planning and tool configuration that has traditionally been performed by a data scientist in the development of computer vision applications.
Authors: Zikang Leng, Megha Thukral, Yaqi Liu, Hrudhai Rajasekhar, Shruthi K. Hiremath, Thomas Pl\"otz
Abstract: A major obstacle in developing robust and generalizable smart home-based Human Activity Recognition (HAR) systems is the lack of large-scale, diverse labeled datasets. Variability in home layouts, sensor configurations, and user behavior adds further complexity, as individuals follow varied routines and perform activities in distinct ways. Building HAR systems that generalize well requires training data that captures the diversity across users and environments. To address these challenges, we introduce AgentSense, a virtual data generation pipeline where diverse personas are generated by leveraging Large Language Models. These personas are used to create daily routines, which are then decomposed into low-level action sequences. Subsequently, the actions are executed in a simulated home environment called VirtualHome that we extended with virtual ambient sensors capable of recording the agents activities as they unfold. Overall, AgentSense enables the generation of rich, virtual sensor datasets that represent a wide range of users and home settings. Across five benchmark HAR datasets, we show that leveraging our virtual sensor data substantially improves performance, particularly when real data are limited. Notably, models trained on a combination of virtual data and just a few days of real data achieve performance comparable to those trained on the entire real datasets. These results demonstrate and prove the potential of virtual data to address one of the most pressing challenges in ambient sensing, which is the distinct lack of large-scale, annotated datasets without requiring any manual data collection efforts.
Authors: Nishan Gunawardena, Gough Yumu Lui, Bahman Javadi, Jeewani Anupama Ginige
Abstract: This study evaluates a smartphone-based, deep-learning eye-tracking algorithm by comparing its performance against a commercial infrared-based eye tracker, the Tobii Pro Nano. The aim is to investigate the feasibility of appearance-based gaze estimation under realistic mobile usage conditions. Key sensitivity factors, including age, gender, vision correction, lighting conditions, device type, and head position, were systematically analysed. The appearance-based algorithm integrates a lightweight convolutional neural network (MobileNet-V3) with a recurrent structure (Long Short-Term Memory) to predict gaze coordinates from grayscale facial images. Gaze data were collected from 51 participants using dynamic visual stimuli, and accuracy was measured using Euclidean distance. The deep learning model produced a mean error of 17.76 mm, compared to 16.53 mm for the Tobii Pro Nano. While overall accuracy differences were small, the deep learning-based method was more sensitive to factors such as lighting, vision correction, and age, with higher failure rates observed under low-light conditions among participants using glasses and in older age groups. Device-specific and positional factors also influenced tracking performance. These results highlight the potential of appearance-based approaches for mobile eye tracking and offer a reference framework for evaluating gaze estimation systems across varied usage conditions.
Authors: Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, Jun Xiao
Abstract: In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30\% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.
Authors: Hanxue Gu, Yaqian Chen, Jisoo Lee, Diego Schaps, Regina Woody, Roy Colglazier, Maciej A. Mazurowski, Christopher Mantyh
Abstract: Objective: To evaluate whether preoperative body composition metrics automatically extracted from CT scans can predict postoperative outcomes after colectomy, either alone or combined with clinical variables or existing risk predictors. Main outcomes and measures: The primary outcome was the predictive performance for 1-year all-cause mortality following colectomy. A Cox proportional hazards model with 1-year follow-up was used, and performance was evaluated using the concordance index (C-index) and Integrated Brier Score (IBS). Secondary outcomes included postoperative complications, unplanned readmission, blood transfusion, and severe infection, assessed using AUC and Brier Score from logistic regression. Odds ratios (OR) described associations between individual CT-derived body composition metrics and outcomes. Over 300 features were extracted from preoperative CTs across multiple vertebral levels, including skeletal muscle area, density, fat areas, and inter-tissue metrics. NSQIP scores were available for all surgeries after 2012.
Authors: M. Alex O. Vasilescu
Abstract: We derive a set of causal deep neural networks whose architectures are a consequence of tensor (multilinear) factor analysis, a framework that facilitates causal inference. Forward causal questions are addressed with a neural network architecture composed of causal capsules and a tensor transformer. Causal capsules compute a set of invariant causal factor representations, whose interactions are governed by a tensor transformation. Inverse causal questions are addressed with a neural network that implements the multilinear projection algorithm. The architecture reverses the order of operations of a forward neural network and estimates the causes of effects. As an alternative to aggressive bottleneck dimension reduction or regularized regression that may camouflage an inherently underdetermined inverse problem, we prescribe modeling different aspects of the mechanism of data formation with piecewise tensor models whose multilinear projections produce multiple candidate solutions. Our forward and inverse questions may be addressed with shallow architectures, but for computationally scalable solutions, we derive a set of deep neural networks by taking advantage of block algebra. An interleaved kernel hierarchy results in doubly non-linear tensor factor models. The causal neural networks that are a consequence of tensor factor analysis are data agnostic, but are illustrated with facial images. Sequential, parallel and asynchronous parallel computation strategies are described.
Authors: Vivien van Veldhuizen, Sharvaree Vadgama, Onno J. de Boer, Sybren Meijer, Erik J. Bekkers
Abstract: Early detection of Barrett's Esophagus (BE), the only known precursor to Esophageal adenocarcinoma (EAC), is crucial for effectively preventing and treating esophageal cancer. In this work, we investigate the potential of geometric Variational Autoencoders (VAEs) to learn a meaningful latent representation that captures the progression of BE. We show that hyperspherical VAE (S-VAE) and Kendall Shape VAE show improved classification accuracy, reconstruction loss, and generative capacity. Additionally, we present a novel autoencoder architecture that can generate qualitative images without the need for a variational framework while retaining the benefits of an autoencoder, such as improved stability and reconstruction quality.
Authors: Hsiu-Hsuan Wang, Tan-Ha Mai, Nai-Xuan Ye, Wei-I Lin, Hsuan-Tien Lin
Abstract: Complementary-label learning (CLL) is a weakly-supervised learning paradigm that aims to train a multi-class classifier using only complementary labels, which indicate classes to which an instance does not belong. Despite numerous algorithmic proposals for CLL, their practical applicability remains unverified for two reasons. Firstly, these algorithms often rely on assumptions about the generation of complementary labels, and it is not clear how far the assumptions are from reality. Secondly, their evaluation has been limited to synthetically labeled datasets. To gain insights into the real-world performance of CLL algorithms, we developed a protocol to collect complementary labels from human annotators. Our efforts resulted in the creation of four datasets: CLCIFAR10, CLCIFAR20, CLMicroImageNet10, and CLMicroImageNet20, derived from well-known classification datasets CIFAR10, CIFAR100, and TinyImageNet200. These datasets represent the very first real-world CLL datasets, namely CLImage, which are publicly available at: https://github.com/ntucllab/CLImage\_Dataset. Through extensive benchmark experiments, we discovered a notable decrease in performance when transitioning from synthetically labeled datasets to real-world datasets. We investigated the key factors contributing to the decrease with a thorough dataset-level ablation study. Our analyses highlight annotation noise as the most influential factor in the real-world datasets. In addition, we discover that the biased-nature of human-annotated complementary labels and the difficulty to validate with only complementary labels are two outstanding barriers to practical CLL. These findings suggest that the community focus more research efforts on developing CLL algorithms and validation schemes that are robust to noisy and biased complementary-label distributions.
Authors: Ronald B. Liu, Zhe Liu, Max G. A. Wolf, Krishna P. Purohit, Gregor Fritz, Yi Feng, Carsten G. Hansen, Pierre O. Bagnaninchi, Xavier Casadevall i Solvas, Yunjie Yang
Abstract: Advancements in high-throughput biomedical applications require real-time, large field-of-view (FOV) imaging. While current 2D lens-free imaging (LFI) systems improve FOV, they are often hindered by time-consuming multi-position measurements, extensive data pre-processing, and strict optical parameterization, limiting their application to static, thin samples. To overcome these limitations, we introduce GenLFI, combining a generative unsupervised physics-informed neural network (PINN) with a large FOV LFI setup for straightforward holographic image reconstruction, without multi-measurement. GenLFI enables real-time 2D imaging for 3D samples, such as droplet-based microfluidics and 3D cell models, in dynamic complex optical fields. Unlike previous methods, our approach decouples the reconstruction algorithm from optical setup parameters, enabling a large FOV limited only by hardware. We demonstrate a real-time FOV exceeding 550 mm$^2$, over 20 times larger than current real-time LFI systems. This framework unlocks the potential of LFI systems, providing a robust tool for advancing automated high-throughput biomedical applications.
Authors: Luzhe Huang, Yuzhu Li, Nir Pillar, Tal Keidar Haran, William Dean Wallace, Aydogan Ozcan
Abstract: Histopathological staining of human tissue is essential for disease diagnosis. Recent advances in virtual tissue staining technologies using artificial intelligence (AI) alleviate some of the costly and tedious steps involved in traditional histochemical staining processes, permitting multiplexed staining and tissue preservation. However, potential hallucinations and artifacts in these virtually stained tissue images pose concerns, especially for the clinical uses of these approaches. Quality assessment of histology images by experts can be subjective. Here, we present an autonomous quality and hallucination assessment method, AQuA, for virtual tissue staining and digital pathology. AQuA autonomously achieves 99.8% accuracy when detecting acceptable and unacceptable virtually stained tissue images without access to histochemically stained ground truth, and presents an agreement of 98.5% with the manual assessments made by board-certified pathologists, including identifying realistic-looking images that could mislead diagnosticians. We demonstrate the wide adaptability of AQuA across various virtually and histochemically stained human tissue images. This framework enhances the reliability of virtual tissue staining and provides autonomous quality assurance for image generation and transformation tasks in digital pathology and computational imaging.
Authors: Zhengsen Xu, Jonathan Li, Sibo Cheng, Xue Rui, Yu Zhao, Hongjie He, Haiyan Guan, Aryan Sharma, Matthew Erxleben, Ryan Chang, Linlin Xu
Abstract: Wildfires pose a significant threat to ecosystems, wildlife, and human communities, leading to habitat destruction, pollutant emissions, and biodiversity loss. Accurate wildfire risk prediction is crucial for mitigating these impacts and safeguarding both environmental and human health. This paper provides a comprehensive review of wildfire risk prediction methodologies, with a particular focus on deep learning approaches combined with remote sensing. We begin by defining wildfire risk and summarizing the geographical distribution of related studies. In terms of data, we analyze key predictive features, including fuel characteristics, meteorological and climatic conditions, socioeconomic factors, topography, and hydrology, while also reviewing publicly available wildfire prediction datasets derived from remote sensing. Additionally, we emphasize the importance of feature collinearity assessment and model interpretability to improve the understanding of prediction outcomes. Regarding methodology, we classify deep learning models into three primary categories: time-series forecasting, image segmentation, and spatiotemporal prediction, and further discuss methods for converting model outputs into risk classifications or probability-adjusted predictions. Finally, we identify the key challenges and limitations of current wildfire-risk prediction models and outline several research opportunities. These include integrating diverse remote sensing data, developing multimodal models, designing more computationally efficient architectures, and incorporating cross-disciplinary methods--such as coupling with numerical weather-prediction models--to enhance the accuracy and robustness of wildfire-risk assessments.
Authors: Tajamul Ashraf, Janibul Bashir
Abstract: Climate change stands as one of the most pressing global challenges of the twenty-first century, with far-reaching consequences such as rising sea levels, melting glaciers, and increasingly extreme weather patterns. Accurate forecasting is critical for monitoring these phenomena and supporting mitigation strategies. While recent data-driven models for time-series forecasting, including CNNs, RNNs, and attention-based transformers, have shown promise, they often struggle with sequential dependencies and limited parallelization, especially in long-horizon, multivariate meteorological datasets. In this work, we present Focal Modulated Attention Encoder (FATE), a novel transformer architecture designed for reliable multivariate time-series forecasting. Unlike conventional models, FATE introduces a tensorized focal modulation mechanism that explicitly captures spatiotemporal correlations in time-series data. We further propose two modulation scores that offer interpretability by highlighting critical environmental features influencing predictions. We benchmark FATE across seven diverse real-world datasets including ETTh1, ETTm2, Traffic, Weather5k, USA-Canada, Europe, and LargeST datasets, and show that it consistently outperforms all state-of-the-art methods, including temperature datasets. Our ablation studies also demonstrate that FATE generalizes well to broader multivariate time-series forecasting tasks. For reproducible research, code is released at https://github.com/Tajamul21/FATE.
Authors: Abhijeet Patil, Garima Jain, Harsh Diwakar, Jay Sawant, Tripti Bameta, Swapnil Rane, Amit Sethi
Abstract: We developed a software pipeline for quality control (QC) of histopathology whole slide images (WSIs) that segments various regions, such as blurs of different levels, tissue regions, tissue folds, and pen marks. Given the necessity and increasing availability of GPUs for processing WSIs, the proposed pipeline comprises multiple lightweight deep learning models to strike a balance between accuracy and speed. The pipeline was evaluated in all TCGAs, which is the largest publicly available WSI dataset containing more than 11,000 histopathological images from 28 organs. It was compared to a previous work, which was not based on deep learning, and it showed consistent improvement in segmentation results across organs. To minimize annotation effort for tissue and blur segmentation, annotated images were automatically prepared by mosaicking patches (sub-images) from various WSIs whose labels were identified using a patch classification tool HistoROI. Due to the generality of our trained QC pipeline and its extensive testing the potential impact of this work is broad. It can be used for automated pre-processing any WSI cohort to enhance the accuracy and reliability of large-scale histopathology image analysis for both research and clinical use. We have made the trained models, training scripts, training data, and inference results publicly available at https://github.com/abhijeetptl5/wsisegqc, which should enable the research community to use the pipeline right out of the box or further customize it to new datasets and applications in the future.
Authors: Yuetao Li, Zijia Kuang, Ting Li, Qun Hao, Zike Yan, Guyue Zhou, Shaohui Zhang
Abstract: We propose ActiveSplat, an autonomous high-fidelity reconstruction system leveraging Gaussian splatting. Taking advantage of efficient and realistic rendering, the system establishes a unified framework for online mapping, viewpoint selection, and path planning. The key to ActiveSplat is a hybrid map representation that integrates both dense information about the environment and a sparse abstraction of the workspace. Therefore, the system leverages sparse topology for efficient viewpoint sampling and path planning, while exploiting view-dependent dense prediction for viewpoint selection, facilitating efficient decision-making with promising accuracy and completeness. A hierarchical planning strategy based on the topological map is adopted to mitigate repetitive trajectories and improve local granularity given limited time budgets, ensuring high-fidelity reconstruction with photorealistic view synthesis. Extensive experiments and ablation studies validate the efficacy of the proposed method in terms of reconstruction accuracy, data coverage, and exploration efficiency. The released code will be available on our project page: https://li-yuetao.github.io/ActiveSplat/.
Authors: Tejaswini Medi, Steffen Jung, Margret Keuper
Abstract: Deep neural networks are susceptible to adversarial attacks and common corruptions, which undermine their robustness. In order to enhance model resilience against such challenges, Adversarial Training (AT) has emerged as a prominent solution. Nevertheless, adversarial robustness is often attained at the expense of model fairness during AT, i.e., disparity in class-wise robustness of the model. While distinctive classes become more robust towards such adversaries, hard to detect classes suffer. Recently, research has focused on improving model fairness specifically for perturbed images, overlooking the accuracy of the most likely non-perturbed data. Additionally, despite their robustness against the adversaries encountered during model training, state-of-the-art adversarial trained models have difficulty maintaining robustness and fairness when confronted with diverse adversarial threats or common corruptions. In this work, we address the above concerns by introducing a novel approach called Fair Targeted Adversarial Training (FAIR-TAT). We show that using targeted adversarial attacks for adversarial training (instead of untargeted attacks) can allow for more favorable trade-offs with respect to adversarial fairness. Empirical results validate the efficacy of our approach.
Authors: Lukas Kuhn, Sari Sadiya, Jorg Schlotterer, Florian Buettner, Christin Seifert, Gemma Roig
Abstract: Shortcut learning, i.e., a model's reliance on undesired features not directly relevant to the task, is a major challenge that severely limits the applications of machine learning algorithms, particularly when deploying them to assist in making sensitive decisions, such as in medical diagnostics. In this work, we leverage recent advancements in machine learning to create an unsupervised framework that is capable of both detecting and mitigating shortcut learning in transformers. We validate our method on multiple datasets. Results demonstrate that our framework significantly improves both worst-group accuracy (samples misclassified due to shortcuts) and average accuracy, while minimizing human annotation effort. Moreover, we demonstrate that the detected shortcuts are meaningful and informative to human experts, and that our framework is computationally efficient, allowing it to be run on consumer hardware.
Authors: Zekun Zhou, Tan Liu, Bing Yu, Yanru Gong, Liu Shi, Qiegen Liu
Abstract: Diffusion model shows remarkable potential on sparse-view computed tomography (SVCT) reconstruction. However, when a network is trained on a limited sample space, its generalization capability may be constrained, which degrades performance on unfamiliar data. For image generation tasks, this can lead to issues such as blurry details and inconsistencies between regions. To alleviate this problem, we propose a Sinogram-based Wavelet random decomposition And Random mask diffusion Model (SWARM) for SVCT reconstruction. Specifically, introducing a random mask strategy in the sinogram effectively expands the limited training sample space. This enables the model to learn a broader range of data distributions, enhancing its understanding and generalization of data uncertainty. In addition, applying a random training strategy to the high-frequency components of the sinogram wavelet enhances feature representation and improves the ability to capture details in different frequency bands, thereby improving performance and robustness. Two-stage iterative reconstruction method is adopted to ensure the global consistency of the reconstructed image while refining its details. Experimental results demonstrate that SWARM outperforms competing approaches in both quantitative and qualitative performance across various datasets.
Authors: Yuheng Qiu, Can Xu, Yutian Chen, Shibo Zhao, Junyi Geng, Sebastian Scherer
Abstract: Inertial odometry (IO) using only Inertial Measurement Units (IMUs) offers a lightweight and cost-effective solution for Unmanned Aerial Vehicle (UAV) applications, yet existing learning-based IO models often fail to generalize to UAVs due to the highly dynamic and non-linear-flight patterns that differ from pedestrian motion. In this work, we identify that the conventional practice of transforming raw IMU data to global coordinates undermines the observability of critical kinematic information in UAVs. By preserving the body-frame representation, our method achieves substantial performance improvements, with a 66.7% average increase in accuracy across three datasets. Furthermore, explicitly encoding attitude information into the motion network results in an additional 23.8% improvement over prior results. Combined with a data-driven IMU correction model (AirIMU) and an uncertainty-aware Extended Kalman Filter (EKF), our approach ensures robust state estimation under aggressive UAV maneuvers without relying on external sensors or control inputs. Notably, our method also demonstrates strong generalizability to unseen data not included in the training set, underscoring its potential for real-world UAV applications.
Authors: Oriol Barbany, Adri\`a Colom\'e, Carme Torras
Abstract: Cloth folding is a complex task due to the inevitable self-occlusions of clothes, their complicated dynamics, and the disparate materials, geometries, and textures that garments can have. In this work, we learn folding actions conditioned on text commands. Translating high-level, abstract instructions into precise robotic actions requires sophisticated language understanding and manipulation capabilities. To do that, we leverage a pre-trained vision-language model and repurpose it to predict manipulation actions. Our model, BiFold, can take context into account and achieves state-of-the-art performance on an existing language-conditioned folding benchmark. To address the lack of annotated bimanual folding data, we introduce a novel dataset with automatically parsed actions and language-aligned instructions, enabling better learning of text-conditioned manipulation. BiFold attains the best performance on our dataset and demonstrates strong generalization to new instructions, garments, and environments.
Authors: Ziyan Wang, Sizhe Wei, Xiaoming Huo, Hao Wang
Abstract: Diffusion models have made significant advancements in recent years. However, their performance often deteriorates when trained or fine-tuned on imbalanced datasets. This degradation is largely due to the disproportionate representation of majority and minority data in image-text pairs. In this paper, we propose a general fine-tuning approach, dubbed PoGDiff, to address this challenge. Rather than directly minimizing the KL divergence between the predicted and ground-truth distributions, PoGDiff replaces the ground-truth distribution with a Product of Gaussians (PoG), which is constructed by combining the original ground-truth targets with the predicted distribution conditioned on a neighboring text embedding. Experiments on real-world datasets demonstrate that our method effectively addresses the imbalance problem in diffusion models, improving both generation accuracy and quality.
Authors: Armando Fortes, Tianyi Wei, Shangchen Zhou, Xingang Pan
Abstract: Recent advances in large-scale text-to-image models have revolutionized creative fields by generating visually captivating outputs from textual prompts; however, while traditional photography offers precise control over camera settings to shape visual aesthetics - such as depth-of-field via aperture - current diffusion models typically rely on prompt engineering to mimic such effects. This approach often results in crude approximations and inadvertently alters the scene content. In this work, we propose Bokeh Diffusion, a scene-consistent bokeh control framework that explicitly conditions a diffusion model on a physical defocus blur parameter. To overcome the scarcity of paired real-world images captured under different camera settings, we introduce a hybrid training pipeline that aligns in-the-wild images with synthetic blur augmentations, providing diverse scenes and subjects as well as supervision to learn the separation of image content from lens blur. Central to our framework is our grounded self-attention mechanism, trained on image pairs with different bokeh levels of the same scene, which enables blur strength to be adjusted in both directions while preserving the underlying scene. Extensive experiments demonstrate that our approach enables flexible, lens-like blur control, supports downstream applications such as real image editing via inversion, and generalizes effectively across both Stable Diffusion and FLUX architectures.
Authors: Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu
Abstract: Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = \tanh(\alpha $x$)$, as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, $S$-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.
Authors: Andy Chu, Rashik Shrestha, Yu Gu, Jason N. Gross
Abstract: Monitoring flowers over time is essential for precision robotic pollination in agriculture. To accomplish this, a continuous spatial-temporal observation of plant growth can be done using stationary RGB-D cameras. However, image registration becomes a serious challenge due to changes in the visual appearance of the plant caused by the pollination process and occlusions from growth and camera angles. Plants flower in a manner that produces distinct clusters on branches. This paper presents a method for matching flower clusters using descriptors generated from RGB-D data and considers allowing for spatial uncertainty within the cluster. The proposed approach leverages the Unscented Transform to efficiently estimate plant descriptor uncertainty tolerances, enabling a robust image-registration process despite temporal changes. The Unscented Transform is used to handle the nonlinear transformations by propagating the uncertainty of flower positions to determine the variations in the descriptor domain. A Monte Carlo simulation is used to validate the Unscented Transform results, confirming our method's effectiveness for flower cluster matching. Therefore, it can facilitate improved robotics pollination in dynamic environments.
Authors: Ehsan Karami, Hamid Soltanian-Zadeh
Abstract: Knee osteoarthritis (KOA) is a common joint disease that causes pain and mobility issues. While MRI-based deep learning models have demonstrated superior performance in predicting total knee replacement (TKR) and disease progression, their generalizability remains challenging, particularly when applied to imaging data from different sources. In this study, we have shown that replacing batch normalization with instance normalization, using data augmentation, and applying contrastive loss improves model generalization in a baseline deep learning model for knee osteoarthritis (KOA) prediction. We trained and evaluated our model using MRI data from the Osteoarthritis Initiative (OAI) database, considering sagittal fat-suppressed intermediate-weighted turbo spin-echo (FS-IW-TSE) images as the source domain and sagittal fat-suppressed three-dimensional (3D) dual-echo in steady state (DESS) images as the target domain. The results demonstrate a statistically significant improvement in classification accuracy across both domains, with our approach outperforming the baseline model.
Authors: Wenhao Hu, Paul Henderson, Jos\'e Cano
Abstract: Pruning is a widely used method for compressing Deep Neural Networks (DNNs), where less relevant parameters are removed from a DNN model to reduce its size. However, removing parameters reduces model accuracy, so pruning is typically combined with fine-tuning, and sometimes other operations such as rewinding weights, to recover accuracy. A common approach is to repeatedly prune and then fine-tune, with increasing amounts of model parameters being removed in each step. While straightforward to implement, pruning pipelines that follow this approach are computationally expensive due to the need for repeated fine-tuning. In this paper we propose ICE-Pruning, an iterative pruning pipeline for DNNs that significantly decreases the time required for pruning by reducing the overall cost of fine-tuning, while maintaining a similar accuracy to existing pruning pipelines. ICE-Pruning is based on three main components: i) an automatic mechanism to determine after which pruning steps fine-tuning should be performed; ii) a freezing strategy for faster fine-tuning in each pruning step; and iii) a custom pruning-aware learning rate scheduler to further improve the accuracy of each pruning step and reduce the overall time consumption. We also propose an efficient auto-tuning stage for the hyperparameters (e.g., freezing percentage) introduced by the three components. We evaluate ICE-Pruning on several DNN models and datasets, showing that it can accelerate pruning by up to 9.61x. Code is available at https://github.com/gicLAB/ICE-Pruning
Authors: Seungjae Lee, Daniel Ekpo, Haowen Liu, Furong Huang, Abhinav Shrivastava, Jia-Bin Huang
Abstract: Exploration is essential for general-purpose robotic learning, especially in open-ended environments where dense rewards, explicit goals, or task-specific supervision are scarce. Vision-language models (VLMs), with their semantic reasoning over objects, spatial relations, and potential outcomes, present a compelling foundation for generating high-level exploratory behaviors. However, their outputs are often ungrounded, making it difficult to determine whether imagined transitions are physically feasible or informative. To bridge the gap between imagination and execution, we present IVE (Imagine, Verify, Execute), an agentic exploration framework inspired by human curiosity. Human exploration is often driven by the desire to discover novel scene configurations and to deepen understanding of the environment. Similarly, IVE leverages VLMs to abstract RGB-D observations into semantic scene graphs, imagine novel scenes, predict their physical plausibility, and generate executable skill sequences through action tools. We evaluate IVE in both simulated and real-world tabletop environments. The results show that IVE enables more diverse and meaningful exploration than RL baselines, as evidenced by a 4.1 to 7.8x increase in the entropy of visited states. Moreover, the collected experience supports downstream learning, producing policies that closely match or exceed the performance of those trained on human-collected demonstrations.
Authors: Badhan Kumar Das, Ajay Singh, Gengyan Zhao, Han Liu, Thomas J. Re, Dorin Comaniciu, Eli Gibson, Andreas Maier
Abstract: Self-supervised pretrain techniques have been widely used to improve the downstream tasks' performance. However, real-world magnetic resonance (MR) studies usually consist of different sets of contrasts due to different acquisition protocols, which poses challenges for the current deep learning methods on large-scale pretrain and different downstream tasks with different input requirements, since these methods typically require a fixed set of input modalities or, contrasts. To address this challenge, we propose variable-input ViT (VIViT), a transformer-based framework designed for self-supervised pretraining and segmentation finetuning for variable contrasts in each study. With this ability, our approach can maximize the data availability in pretrain, and can transfer the learned knowledge from pretrain to downstream tasks despite variations in input requirements. We validate our method on brain infarct and brain tumor segmentation, where our method outperforms current CNN and ViT-based models with a mean Dice score of 0.624 and 0.883 respectively. These results highlight the efficacy of our design for better adaptability and performance on tasks with real-world heterogeneous MR data.
Authors: Junlin Song, Miguel Olivares-Mendez
Abstract: Visual odometry (VO) is typically considered as a chicken-and-egg problem, as the localization and mapping modules are tightly-coupled. The estimation of a visual map relies on accurate localization information. Meanwhile, localization requires precise map points to provide motion constraints. This classical design principle is naturally inherited by visual-inertial odometry (VIO). Efficient localization solutions that do not require a map have not been fully investigated. To this end, we propose a novel structureless VIO, where the visual map is removed from the odometry framework. Experimental results demonstrated that, compared to the structure-based VIO baseline, our structureless VIO not only substantially improves computational efficiency but also has advantages in accuracy.
Authors: Zhengyang Fan, Wanru Li, Kuo-chu Chang, Ting Yuan
Abstract: Accurately estimating the remaining useful life (RUL) for degradation systems is crucial in modern prognostic and health management (PHM). Convolutional Neural Networks (CNNs), initially developed for tasks like image and video recognition, have proven highly effectively in RUL prediction, demonstrating remarkable performance. However, with the emergence of the Vision Transformer (ViT), a Transformer model tailored for computer vision tasks such as image classification, and its demonstrated superiority over CNNs, there is a natural inclination to explore its potential in enhancing RUL prediction accuracy. Nonetheless, applying ViT directly to multivariate sensor data for RUL prediction poses challenges, primarily due to the ambiguous nature of spatial information in time series data. To address this issue, we introduce the PerFormer, a permutation-based vision transformer approach designed to permute multivariate time series data, mimicking spatial characteristics akin to image data, thereby making it suitable for ViT. To generate the desired permutation matrix, we introduce a novel permutation loss function aimed at guiding the convergence of any matrix towards a permutation matrix. Our experiments on NASA's C-MAPSS dataset demonstrate the PerFormer's superior performance in RUL prediction compared to state-of-the-art methods employing CNNs, Recurrent Neural Networks (RNNs), and various Transformer models. This underscores its effectiveness and potential in PHM applications.
Authors: Parul Gupta, Shreya Ghosh, Tom Gedeon, Thanh-Toan Do, Abhinav Dhall
Abstract: The rapid advancement of GenAI technology over the past few years has significantly contributed towards highly realistic deepfake content generation. Despite ongoing efforts, the research community still lacks a large-scale and reasoning capability driven deepfake benchmark dataset specifically tailored for person-centric object, context and scene manipulations. In this paper, we address this gap by introducing MultiFakeVerse, a large scale person-centric deepfake dataset, comprising 845,286 images generated through manipulation suggestions and image manipulations both derived from vision-language models (VLM). The VLM instructions were specifically targeted towards modifications to individuals or contextual elements of a scene that influence human perception of importance, intent, or narrative. This VLM-driven approach enables semantic, context-aware alterations such as modifying actions, scenes, and human-object interactions rather than synthetic or low-level identity swaps and region-specific edits that are common in existing datasets. Our experiments reveal that current state-of-the-art deepfake detection models and human observers struggle to detect these subtle yet meaningful manipulations. The code and dataset are available on \href{https://github.com/Parul-Gupta/MultiFakeVerse}{GitHub}.
Authors: Isarun Chamveha, Supphanut Chaiyungyuen, Sasinun Worakriangkrai, Nattawadee Prasawang, Warasinee Chaisangmongkon, Pornpim Korpraphong, Voraparee Suvannarerg, Shanigarn Thiravit, Chalermdej Kannawat, Kewalin Rungsinaporn, Suwara Issaragrisil, Payia Chadbunchachai, Pattiya Gatechumpol, Chawiporn Muktabhant, Patarachai Sereerat
Abstract: This study presents a deep learning system for breast cancer detection in mammography, developed using a modified EfficientNetV2 architecture with enhanced attention mechanisms. The model was trained on mammograms from a major Thai medical center and validated on three distinct datasets: an in-domain test set (9,421 cases), a biopsy-confirmed set (883 cases), and an out-of-domain generalizability set (761 cases) collected from two different hospitals. For cancer detection, the model achieved AUROCs of 0.89, 0.96, and 0.94 on the respective datasets. The system's lesion localization capability, evaluated using metrics including Lesion Localization Fraction (LLF) and Non-Lesion Localization Fraction (NLF), demonstrated robust performance in identifying suspicious regions. Clinical validation through concordance tests showed strong agreement with radiologists: 83.5% classification and 84.0% localization concordance for biopsy-confirmed cases, and 78.1% classification and 79.6% localization concordance for out-of-domain cases. Expert radiologists' acceptance rate also averaged 96.7% for biopsy-confirmed cases, and 89.3% for out-of-domain cases. The system achieved a System Usability Scale score of 74.17 for source hospital, and 69.20 for validation hospitals, indicating good clinical acceptance. These results demonstrate the model's effectiveness in assisting mammogram interpretation, with the potential to enhance breast cancer screening workflows in clinical practice.
Authors: Guy Gaziv, Sarah Goulding, Ani Ayvazian-Hancock, Yoon Bai, James J. DiCarlo
Abstract: Precise control of neural activity -- modulating target neurons deep in the brain while leaving nearby neurons unaffected -- is an outstanding challenge in neuroscience, generally approached using invasive techniques. This study investigates the possibility of precisely and noninvasively modulating neural activity in the high-level primate ventral visual stream via perturbations on one's natural visual feed. When tested on macaque inferior temporal (IT) neural populations, we found quantitative agreement between the model-predicted and biologically realized effect: strong modulation concentrated on targeted neural sites. We extended this to demonstrate accurate injection of experimenter-chosen neural population patterns via subtle perturbations applied on the background of typical natural visual feeds. These results highlight that current machine-executable models of the ventral stream can now design noninvasive, visually-delivered, possibly imperceptible neural interventions at the resolution of individual neurons.
Authors: Thien Nhan Vo
Abstract: This study addresses the classification of heartbeats from ECG signals through two distinct approaches: traditional machine learning utilizing hand-crafted features and deep learning via transformed images of ECG beats. The dataset underwent preprocessing steps, including downsampling, filtering, and normalization, to ensure consistency and relevance for subsequent analysis. In the first approach, features such as heart rate variability (HRV), mean, variance, and RR intervals were extracted to train various classifiers, including SVM, Random Forest, AdaBoost, LSTM, Bi-directional LSTM, and LightGBM. The second approach involved transforming ECG signals into images using Gramian Angular Field (GAF), Markov Transition Field (MTF), and Recurrence Plots (RP), with these images subsequently classified using CNN architectures like VGG and Inception. Experimental results demonstrate that the LightGBM model achieved the highest performance, with an accuracy of 99% and an F1 score of 0.94, outperforming the image-based CNN approach (F1 score of 0.85). Models such as SVM and AdaBoost yielded significantly lower scores, indicating limited suitability for this task. The findings underscore the superior ability of hand-crafted features to capture temporal and morphological variations in ECG signals compared to image-based representations of individual beats. Future investigations may benefit from incorporating multi-lead ECG signals and temporal dependencies across successive beats to enhance classification accuracy further.
Authors: Arun Sharma, Mingzhou Yang, Majid Farhadloo, Subhankar Ghosh, Bharat Jayaprakash, Shashi Shekhar
Abstract: Given trajectory data, a domain-specific study area, and a user-defined threshold, we aim to find anomalous trajectories indicative of possible GPS spoofing (e.g., fake trajectory). The problem is societally important to curb illegal activities in international waters, such as unauthorized fishing and illicit oil transfers. The problem is challenging due to advances in AI generated in deep fakes generation (e.g., additive noise, fake trajectories) and lack of adequate amount of labeled samples for ground-truth verification. Recent literature shows promising results for anomalous trajectory detection using generative models despite data sparsity. However, they do not consider fine-scale spatiotemporal dependencies and prior physical knowledge, resulting in higher false-positive rates. To address these limitations, we propose a physics-informed diffusion model that integrates kinematic constraints to identify trajectories that do not adhere to physical laws. Experimental results on real-world datasets in the maritime and urban domains show that the proposed framework results in higher prediction accuracy and lower estimation error rate for anomaly detection and trajectory generation methods, respectively. Our implementation is available at https://github.com/arunshar/Physics-Informed-Diffusion-Probabilistic-Model.
URLs: https://github.com/arunshar/Physics-Informed-Diffusion-Probabilistic-Model.
Authors: Vivien van Veldhuizen, Vanessa Botha, Chunyao Lu, Melis Erdal Cesur, Kevin Groot Lipman, Edwin D. de Jong, Hugo Horlings, Cl\'arisa I. Sanchez, Cees G. M. Snoek, Lodewyk Wessels, Ritse Mann, Eric Marcus, Jonas Teuwen
Abstract: Foundation models (FMs) are changing the way medical images are analyzed by learning from large collections of unlabeled data. Instead of relying on manually annotated examples, FMs are pre-trained to learn general-purpose visual features that can later be adapted to specific clinical tasks with little additional supervision. In this review, we examine how FMs are being developed and applied in pathology, radiology, and ophthalmology, drawing on evidence from over 150 studies. We explain the core components of FM pipelines, including model architectures, self-supervised learning methods, and strategies for downstream adaptation. We also review how FMs are being used in each imaging domain and compare design choices across applications. Finally, we discuss key challenges and open questions to guide future research.
Authors: Jiaqi Tang, Yu Xia, Yi-Feng Wu, Yuwei Hu, Yuhui Chen, Qing-Guo Chen, Xiaogang Xu, Xiangyu Wu, Hao Lu, Yanqing Ma, Shiyin Lu, Qifeng Chen
Abstract: The advent of autonomous agents is transforming interactions with Graphical User Interfaces (GUIs) by employing natural language as a powerful intermediary. Despite the predominance of Supervised Fine-Tuning (SFT) methods in current GUI agents for achieving spatial localization, these methods face substantial challenges due to their limited capacity to accurately perceive positional data. Existing strategies, such as reinforcement learning, often fail to assess positional accuracy effectively, thereby restricting their utility. In response, we introduce Location Preference Optimization (LPO), a novel approach that leverages locational data to optimize interaction preferences. LPO uses information entropy to predict interaction positions by focusing on zones rich in information. Besides, it further introduces a dynamic location reward function based on physical distance, reflecting the varying importance of interaction positions. Supported by Group Relative Preference Optimization (GRPO), LPO facilitates an extensive exploration of GUI environments and significantly enhances interaction precision. Comprehensive experiments demonstrate LPO's superior performance, achieving SOTA results across both offline benchmarks and real-world online evaluations. Our code will be made publicly available soon, at https://github.com/AIDC-AI/LPO.
Authors: Jacob Munkberg, Zian Wang, Ruofan Liang, Tianchang Shen, Jon Hasselgren
Abstract: We leverage finetuned video diffusion models, intrinsic decomposition of videos, and physically-based differentiable rendering to generate high quality materials for 3D models given a text prompt or a single image. We condition a video diffusion model to respect the input geometry and lighting condition. This model produces multiple views of a given 3D model with coherent material properties. Secondly, we use a recent model to extract intrinsics (base color, roughness, metallic) from the generated video. Finally, we use the intrinsics alongside the generated video in a differentiable path tracer to robustly extract PBR materials directly compatible with common content creation tools.
Authors: Xinjie Wang, Liu Liu, Yu Cao, Ruiqi Wu, Wenkang Qin, Dehui Wang, Wei Sui, Zhizhong Su
Abstract: Constructing a physically realistic and accurately scaled simulated 3D world is crucial for the training and evaluation of embodied intelligence tasks. The diversity, realism, low cost accessibility and affordability of 3D data assets are critical for achieving generalization and scalability in embodied AI. However, most current embodied intelligence tasks still rely heavily on traditional 3D computer graphics assets manually created and annotated, which suffer from high production costs and limited realism. These limitations significantly hinder the scalability of data driven approaches. We present EmbodiedGen, a foundational platform for interactive 3D world generation. It enables the scalable generation of high-quality, controllable and photorealistic 3D assets with accurate physical properties and real-world scale in the Unified Robotics Description Format (URDF) at low cost. These assets can be directly imported into various physics simulation engines for fine-grained physical control, supporting downstream tasks in training and evaluation. EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key modules: Image-to-3D, Text-to-3D, Texture Generation, Articulated Object Generation, Scene Generation and Layout Generation. EmbodiedGen generates diverse and interactive 3D worlds composed of generative 3D assets, leveraging generative AI to address the challenges of generalization and evaluation to the needs of embodied intelligence related research. Code is available at https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html.
URLs: https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html.
Authors: Wenlong Hou, Guangqian Yang, Ye Du, Yeung Lau, Lihao Liu, Junjun He, Ling Long, Shujun Wang
Abstract: Alzheimer's disease (AD) is a progressive and irreversible neurodegenerative disease. Early and precise diagnosis of AD is crucial for timely intervention and treatment planning to alleviate the progressive neurodegeneration. However, most existing methods rely on single-modality data, which contrasts with the multifaceted approach used by medical experts. While some deep learning approaches process multi-modal data, they are limited to specific tasks with a small set of input modalities and cannot handle arbitrary combinations. This highlights the need for a system that can address diverse AD-related tasks, process multi-modal or missing input, and integrate multiple advanced methods for improved performance. In this paper, we propose ADAgent, the first specialized AI agent for AD analysis, built on a large language model (LLM) to address user queries and support decision-making. ADAgent integrates a reasoning engine, specialized medical tools, and a collaborative outcome coordinator to facilitate multi-modal diagnosis and prognosis tasks in AD. Extensive experiments demonstrate that ADAgent outperforms SOTA methods, achieving significant improvements in accuracy, including a 2.7% increase in multi-modal diagnosis, a 0.7% improvement in multi-modal prognosis, and enhancements in MRI and PET diagnosis tasks.