new RESAnything: Attribute Prompting for Arbitrary Referring Segmentation

Authors: Ruiqi Wang, Hao Zhang

Abstract: We present an open-vocabulary and zero-shot method for arbitrary referring expression segmentation (RES), targeting input expressions that are more general than what prior works were designed to handle. Specifically, our inputs encompass both object- and part-level labels as well as implicit references pointing to properties or qualities of object/part function, design, style, material, etc. Our model, coined RESAnything, leverages Chain-of-Thoughts (CoT) reasoning, where the key idea is attribute prompting. We generate detailed descriptions of object/part attributes including shape, color, and location for potential segment proposals through systematic prompting of a large language model (LLM), where the proposals are produced by a foundational image segmentation model. Our approach encourages deep reasoning about object or part attributes related to function, style, design, etc., enabling the system to handle implicit queries without any part annotations for training or fine-tuning. As the first zero-shot and LLM-based RES method, RESAnything achieves clearly superior performance among zero-shot methods on traditional RES benchmarks and significantly outperforms existing methods on challenging scenarios involving implicit queries and complex part-level relations. Finally, we contribute a new benchmark dataset to offer ~3K carefully curated RES instances to assess part-level, arbitrary RES solutions.

new Gone With the Bits: Revealing Racial Bias in Low-Rate Neural Compression for Facial Images

Authors: Tian Qiu, Arjun Nichani, Rasta Tadayontahmasebi, Haewon Jeong

Abstract: Neural compression methods are gaining popularity due to their superior rate-distortion performance over traditional methods, even at extremely low bitrates below 0.1 bpp. As deep learning architectures, these models are prone to bias during the training process, potentially leading to unfair outcomes for individuals in different groups. In this paper, we present a general, structured, scalable framework for evaluating bias in neural image compression models. Using this framework, we investigate racial bias in neural compression algorithms by analyzing nine popular models and their variants. Through this investigation, we first demonstrate that traditional distortion metrics are ineffective in capturing bias in neural compression models. Next, we highlight that racial bias is present in all neural compression models and can be captured by examining facial phenotype degradation in image reconstructions. We then examine the relationship between bias and realism in the decoded images and demonstrate a trade-off across models. Finally, we show that utilizing a racially balanced training set can reduce bias but is not a sufficient bias mitigation strategy. We additionally show the bias can be attributed to compression model bias and classification model bias. We believe that this work is a first step towards evaluating and eliminating bias in neural image compression models.

new Generating Narrated Lecture Videos from Slides with Synchronized Highlights

Authors: Alexander Holmberg

Abstract: Turning static slides into engaging video lectures takes considerable time and effort, requiring presenters to record explanations and visually guide their audience through the material. We introduce an end-to-end system designed to automate this process entirely. Given a slide deck, this system synthesizes a video lecture featuring AI-generated narration synchronized precisely with dynamic visual highlights. These highlights automatically draw attention to the specific concept being discussed, much like an effective presenter would. The core technical contribution is a novel highlight alignment module. This module accurately maps spoken phrases to locations on a given slide using diverse strategies (e.g., Levenshtein distance, LLM-based semantic analysis) at selectable granularities (line or word level) and utilizes timestamp-providing Text-to-Speech (TTS) for timing synchronization. We demonstrate the system's effectiveness through a technical evaluation using a manually annotated slide dataset with 1000 samples, finding that LLM-based alignment achieves high location accuracy (F1 > 92%), significantly outperforming simpler methods, especially on complex, math-heavy content. Furthermore, the calculated generation cost averages under $1 per hour of video, offering potential savings of two orders of magnitude compared to conservative estimates of manual production costs. This combination of high accuracy and extremely low cost positions this approach as a practical and scalable tool for transforming static slides into effective, visually-guided video lectures.

new Adversarial Robustness Analysis of Vision-Language Models in Medical Image Segmentation

Authors: Anjila Budathoki, Manish Dhakal

Abstract: Adversarial attacks have been fairly explored for computer vision and vision-language models. However, the avenue of adversarial attack for the vision language segmentation models (VLSMs) is still under-explored, especially for medical image analysis. Thus, we have investigated the robustness of VLSMs against adversarial attacks for 2D medical images with different modalities with radiology, photography, and endoscopy. The main idea of this project was to assess the robustness of the fine-tuned VLSMs specially in the medical domain setting to address the high risk scenario. First, we have fine-tuned pre-trained VLSMs for medical image segmentation with adapters. Then, we have employed adversarial attacks -- projected gradient descent (PGD) and fast gradient sign method (FGSM) -- on that fine-tuned model to determine its robustness against adversaries. We have reported models' performance decline to analyze the adversaries' impact. The results exhibit significant drops in the DSC and IoU scores after the introduction of these adversaries. Furthermore, we also explored universal perturbation but were not able to find for the medical images. \footnote{https://github.com/anjilab/secure-private-ai}

URLs: https://github.com/anjilab/secure-private-ai

new Completing Spatial Transcriptomics Data for Gene Expression Prediction Benchmarking

Authors: Daniela Ruiz, Paula Cardenas, Leonardo Manrique, Daniela Vega, Gabriel Mejia, Pablo Arbelaez

Abstract: Spatial Transcriptomics is a groundbreaking technology that integrates histology images with spatially resolved gene expression profiles. Among the various Spatial Transcriptomics techniques available, Visium has emerged as the most widely adopted. However, its accessibility is limited by high costs, the need for specialized expertise, and slow clinical integration. Additionally, gene capture inefficiencies lead to significant dropout, corrupting acquired data. To address these challenges, the deep learning community has explored the gene expression prediction task directly from histology images. Yet, inconsistencies in datasets, preprocessing, and training protocols hinder fair comparisons between models. To bridge this gap, we introduce SpaRED, a systematically curated database comprising 26 public datasets, providing a standardized resource for model evaluation. We further propose SpaCKLE, a state-of-the-art transformer-based gene expression completion model that reduces mean squared error by over 82.5% compared to existing approaches. Finally, we establish the SpaRED benchmark, evaluating eight state-of-the-art prediction models on both raw and SpaCKLE-completed data, demonstrating SpaCKLE substantially improves the results across all the gene expression prediction models. Altogether, our contributions constitute the most comprehensive benchmark of gene expression prediction from histology images to date and a stepping stone for future research on Spatial Transcriptomics.

new NTIRE 2025 Challenge on UGC Video Enhancement: Methods and Results

Authors: Nikolay Safonov, Alexey Bryncev, Andrey Moskalenko, Dmitry Kulikov, Dmitry Vatolin, Radu Timofte, Haibo Lei, Qifan Gao, Qing Luo, Yaqing Li, Jie Song, Shaozhe Hao, Meisong Zheng, Jingyi Xu, Chengbin Wu, Jiahui Liu, Ying Chen, Xin Deng, Mai Xu, Peipei Liang, Jie Ma, Junjie Jin, Yingxue Pang, Fangzhou Luo, Kai Chen, Shijie Zhao, Mingyang Wu, Renjie Li, Yushen Zuo, Shengyun Zhong, Zhengzhong Tu

Abstract: This paper presents an overview of the NTIRE 2025 Challenge on UGC Video Enhancement. The challenge constructed a set of 150 user-generated content videos without reference ground truth, which suffer from real-world degradations such as noise, blur, faded colors, compression artifacts, etc. The goal of the participants was to develop an algorithm capable of improving the visual quality of such videos. Given the widespread use of UGC on short-form video platforms, this task holds substantial practical importance. The evaluation was based on subjective quality assessment in crowdsourcing, obtaining votes from over 8000 assessors. The challenge attracted more than 25 teams submitting solutions, 7 of which passed the final phase with source code verification. The outcomes may provide insights into the state-of-the-art in UGC video enhancement and highlight emerging trends and effective strategies in this evolving research area. All data, including the processed videos and subjective comparison votes and scores, is made publicly available at https://github.com/msu-video-group/NTIRE25_UGC_Video_Enhancement.

URLs: https://github.com/msu-video-group/NTIRE25_UGC_Video_Enhancement.

new GIF: Generative Inspiration for Face Recognition at Scale

Authors: Saeed Ebrahimi, Sahar Rahimi, Ali Dabouei, Srinjoy Das, Jeremy M. Dawson, Nasser M. Nasrabadi

Abstract: Aiming to reduce the computational cost of Softmax in massive label space of Face Recognition (FR) benchmarks, recent studies estimate the output using a subset of identities. Although promising, the association between the computation cost and the number of identities in the dataset remains linear only with a reduced ratio. A shared characteristic among available FR methods is the employment of atomic scalar labels during training. Consequently, the input to label matching is through a dot product between the feature vector of the input and the Softmax centroids. Inspired by generative modeling, we present a simple yet effective method that substitutes scalar labels with structured identity code, i.e., a sequence of integers. Specifically, we propose a tokenization scheme that transforms atomic scalar labels into structured identity codes. Then, we train an FR backbone to predict the code for each input instead of its scalar label. As a result, the associated computational cost becomes logarithmic w.r.t. number of identities. We demonstrate the benefits of the proposed method by conducting experiments. In particular, our method outperforms its competitors by 1.52%, and 0.6% at TAR@FAR$=1e-4$ on IJB-B and IJB-C, respectively, while transforming the association between computational cost and the number of identities from linear to logarithmic. See code at https://github.com/msed-Ebrahimi/GIF

URLs: https://github.com/msed-Ebrahimi/GIF

new Lesion-Aware Generative Artificial Intelligence for Virtual Contrast-Enhanced Mammography in Breast Cancer

Authors: Aurora Rofena, Arianna Manchia, Claudia Lucia Piccolo, Bruno Beomonte Zobel, Paolo Soda, Valerio Guarrasi

Abstract: Contrast-Enhanced Spectral Mammography (CESM) is a dual-energy mammographic technique that improves lesion visibility through the administration of an iodinated contrast agent. It acquires both a low-energy image, comparable to standard mammography, and a high-energy image, which are then combined to produce a dual-energy subtracted image highlighting lesion contrast enhancement. While CESM offers superior diagnostic accuracy compared to standard mammography, its use entails higher radiation exposure and potential side effects associated with the contrast medium. To address these limitations, we propose Seg-CycleGAN, a generative deep learning framework for Virtual Contrast Enhancement in CESM. The model synthesizes high-fidelity dual-energy subtracted images from low-energy images, leveraging lesion segmentation maps to guide the generative process and improve lesion reconstruction. Building upon the standard CycleGAN architecture, Seg-CycleGAN introduces localized loss terms focused on lesion areas, enhancing the synthesis of diagnostically relevant regions. Experiments on the CESM@UCBM dataset demonstrate that Seg-CycleGAN outperforms the baseline in terms of PSNR and SSIM, while maintaining competitive MSE and VIF. Qualitative evaluations further confirm improved lesion fidelity in the generated images. These results suggest that segmentation-aware generative models offer a viable pathway toward contrast-free CESM alternatives.

new An Explainable Anomaly Detection Framework for Monitoring Depression and Anxiety Using Consumer Wearable Devices

Authors: Yuezhou Zhang, Amos A. Folarin, Callum Stewart, Heet Sankesara, Yatharth Ranjan, Pauline Conde, Akash Roy Choudhury, Shaoxiong Sun, Zulqarnain Rashid, Richard J. B. Dobson

Abstract: Continuous monitoring of behavior and physiology via wearable devices offers a novel, objective method for the early detection of worsening depression and anxiety. In this study, we present an explainable anomaly detection framework that identifies clinically meaningful increases in symptom severity using consumer-grade wearable data. Leveraging data from 2,023 participants with defined healthy baselines, our LSTM autoencoder model learned normal health patterns of sleep duration, step count, and resting heart rate. Anomalies were flagged when self-reported depression or anxiety scores increased by >=5 points (a threshold considered clinically significant). The model achieved an adjusted F1-score of 0.80 (precision = 0.73, recall = 0.88) in detecting 393 symptom-worsening episodes across 341 participants, with higher performance observed for episodes involving concurrent depression and anxiety escalation (F1 = 0.84) and for more pronounced symptom changes (>=10-point increases, F1 = 0.85). Model interpretability was supported by SHAP-based analysis, which identified resting heart rate as the most influential feature in 71.4 percentage of detected anomalies, followed by physical activity and sleep. Together, our findings highlight the potential of explainable anomaly detection to enable personalized, scalable, and proactive mental health monitoring in real-world settings.

new Estimating the Diameter at Breast Height of Trees in a Forest With a Single 360 Camera

Authors: Siming He, Zachary Osman, Fernando Cladera, Dexter Ong, Nitant Rai, Patrick Corey Green, Vijay Kumar, Pratik Chaudhari

Abstract: Forest inventories rely on accurate measurements of the diameter at breast height (DBH) for ecological monitoring, resource management, and carbon accounting. While LiDAR-based techniques can achieve centimeter-level precision, they are cost-prohibitive and operationally complex. We present a low-cost alternative that only needs a consumer-grade 360 video camera. Our semi-automated pipeline comprises of (i) a dense point cloud reconstruction using Structure from Motion (SfM) photogrammetry software called Agisoft Metashape, (ii) semantic trunk segmentation by projecting Grounded Segment Anything (SAM) masks onto the 3D cloud, and (iii) a robust RANSAC-based technique to estimate cross section shape and DBH. We introduce an interactive visualization tool for inspecting segmented trees and their estimated DBH. On 61 acquisitions of 43 trees under a variety of conditions, our method attains median absolute relative errors of 5-9% with respect to "ground-truth" manual measurements. This is only 2-4% higher than LiDAR-based estimates, while employing a single 360 camera that costs orders of magnitude less, requires minimal setup, and is widely available.

new Not All Parameters Matter: Masking Diffusion Models for Enhancing Generation Ability

Authors: Lei Wang, Senmao Li, Fei Yang, Jianye Wang, Ziheng Zhang, Yuhan Liu, Yaxing Wang, Jian Yang

Abstract: The diffusion models, in early stages focus on constructing basic image structures, while the refined details, including local features and textures, are generated in later stages. Thus the same network layers are forced to learn both structural and textural information simultaneously, significantly differing from the traditional deep learning architectures (e.g., ResNet or GANs) which captures or generates the image semantic information at different layers. This difference inspires us to explore the time-wise diffusion models. We initially investigate the key contributions of the U-Net parameters to the denoising process and identify that properly zeroing out certain parameters (including large parameters) contributes to denoising, substantially improving the generation quality on the fly. Capitalizing on this discovery, we propose a simple yet effective method-termed ``MaskUNet''- that enhances generation quality with negligible parameter numbers. Our method fully leverages timestep- and sample-dependent effective U-Net parameters. To optimize MaskUNet, we offer two fine-tuning strategies: a training-based approach and a training-free approach, including tailored networks and optimization functions. In zero-shot inference on the COCO dataset, MaskUNet achieves the best FID score and further demonstrates its effectiveness in downstream task evaluations. Project page: https://gudaochangsheng.github.io/MaskUnet-Page/

URLs: https://gudaochangsheng.github.io/MaskUnet-Page/

new Image Recognition with Online Lightweight Vision Transformer: A Survey

Authors: Zherui Zhang, Rongtao Xu, Jie Zhou, Changwei Wang, Xingtian Pei, Wenhao Xu, Jiguang Zhang, Li Guo, Longxiang Gao, Wenbo Xu, Shibiao Xu

Abstract: The Transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range dependencies and enable parallel processing, yet lack inductive biases and efficiency benefits, facing significant computational and memory challenges that limit its real-world applicability. This paper surveys various online strategies for generating lightweight vision transformers for image recognition, focusing on three key areas: Efficient Component Design, Dynamic Network, and Knowledge Distillation. We evaluate the relevant exploration for each topic on the ImageNet-1K benchmark, analyzing trade-offs among precision, parameters, throughput, and more to highlight their respective advantages, disadvantages, and flexibility. Finally, we propose future research directions and potential challenges in the lightweighting of vision transformers with the aim of inspiring further exploration and providing practical guidance for the community. Project Page: https://github.com/ajxklo/Lightweight-VIT

URLs: https://github.com/ajxklo/Lightweight-VIT

new Path and Bone-Contour Regularized Unpaired MRI-to-CT Translation

Authors: Teng Zhou, Jax Luo, Yuping Sun, Yiheng Tan, Shun Yao, Nazim Haouchine, Scott Raymond

Abstract: Accurate MRI-to-CT translation promises the integration of complementary imaging information without the need for additional imaging sessions. Given the practical challenges associated with acquiring paired MRI and CT scans, the development of robust methods capable of leveraging unpaired datasets is essential for advancing the MRI-to-CT translation. Current unpaired MRI-to-CT translation methods, which predominantly rely on cycle consistency and contrastive learning frameworks, frequently encounter challenges in accurately translating anatomical features that are highly discernible on CT but less distinguishable on MRI, such as bone structures. This limitation renders these approaches less suitable for applications in radiation therapy, where precise bone representation is essential for accurate treatment planning. To address this challenge, we propose a path- and bone-contour regularized approach for unpaired MRI-to-CT translation. In our method, MRI and CT images are projected to a shared latent space, where the MRI-to-CT mapping is modeled as a continuous flow governed by neural ordinary differential equations. The optimal mapping is obtained by minimizing the transition path length of the flow. To enhance the accuracy of translated bone structures, we introduce a trainable neural network to generate bone contours from MRI and implement mechanisms to directly and indirectly encourage the model to focus on bone contours and their adjacent regions. Evaluations conducted on three datasets demonstrate that our method outperforms existing unpaired MRI-to-CT translation approaches, achieving lower overall error rates. Moreover, in a downstream bone segmentation task, our approach exhibits superior performance in preserving the fidelity of bone structures. Our code is available at: https://github.com/kennysyp/PaBoT.

URLs: https://github.com/kennysyp/PaBoT.

new TimeTracker: Event-based Continuous Point Tracking for Video Frame Interpolation with Non-linear Motion

Authors: Haoyue Liu, Jinghan Xu, Yi Chang, Hanyu Zhou, Haozhi Zhao, Lin Wang, Luxin Yan

Abstract: Video frame interpolation (VFI) that leverages the bio-inspired event cameras as guidance has recently shown better performance and memory efficiency than the frame-based methods, thanks to the event cameras' advantages, such as high temporal resolution. A hurdle for event-based VFI is how to effectively deal with non-linear motion, caused by the dynamic changes in motion direction and speed within the scene. Existing methods either use events to estimate sparse optical flow or fuse events with image features to estimate dense optical flow. Unfortunately, motion errors often degrade the VFI quality as the continuous motion cues from events do not align with the dense spatial information of images in the temporal dimension. In this paper, we find that object motion is continuous in space, tracking local regions over continuous time enables more accurate identification of spatiotemporal feature correlations. In light of this, we propose a novel continuous point tracking-based VFI framework, named TimeTracker. Specifically, we first design a Scene-Aware Region Segmentation (SARS) module to divide the scene into similar patches. Then, a Continuous Trajectory guided Motion Estimation (CTME) module is proposed to track the continuous motion trajectory of each patch through events. Finally, intermediate frames at any given time are generated through global motion optimization and frame refinement. Moreover, we collect a real-world dataset that features fast non-linear motion. Extensive experiments show that our method outperforms prior arts in both motion estimation and frame interpolation quality.

new VISLIX: An XAI Framework for Validating Vision Models with Slice Discovery and Analysis

Authors: Xinyuan Yan, Xiwei Xuan, Jorge Piazentin Ono, Jiajing Guo, Vikram Mohanty, Shekar Arvind Kumar, Liang Gou, Bei Wang, Liu Ren

Abstract: Real-world machine learning models require rigorous evaluation before deployment, especially in safety-critical domains like autonomous driving and surveillance. The evaluation of machine learning models often focuses on data slices, which are subsets of the data that share a set of characteristics. Data slice finding automatically identifies conditions or data subgroups where models underperform, aiding developers in mitigating performance issues. Despite its popularity and effectiveness, data slicing for vision model validation faces several challenges. First, data slicing often needs additional image metadata or visual concepts, and falls short in certain computer vision tasks, such as object detection. Second, understanding data slices is a labor-intensive and mentally demanding process that heavily relies on the expert's domain knowledge. Third, data slicing lacks a human-in-the-loop solution that allows experts to form hypothesis and test them interactively. To overcome these limitations and better support the machine learning operations lifecycle, we introduce VISLIX, a novel visual analytics framework that employs state-of-the-art foundation models to help domain experts analyze slices in computer vision models. Our approach does not require image metadata or visual concepts, automatically generates natural language insights, and allows users to test data slice hypothesis interactively. We evaluate VISLIX with an expert study and three use cases, that demonstrate the effectiveness of our tool in providing comprehensive insights for validating object detection models.

new Enhancing Glass Defect Detection with Diffusion Models: Addressing Imbalanced Datasets in Manufacturing Quality Control

Authors: Sajjad Rezvani Boroujeni, Hossein Abedi, Tom Bush

Abstract: Visual defect detection in industrial glass manufacturing remains a critical challenge due to the low frequency of defective products, leading to imbalanced datasets that limit the performance of deep learning models and computer vision systems. This paper presents a novel approach using Denoising Diffusion Probabilistic Models (DDPMs) to generate synthetic defective glass product images for data augmentation, effectively addressing class imbalance issues in manufacturing quality control and automated visual inspection. The methodology significantly enhances image classification performance of standard CNN architectures (ResNet50V2, EfficientNetB0, and MobileNetV2) in detecting anomalies by increasing the minority class representation. Experimental results demonstrate substantial improvements in key machine learning metrics, particularly in recall for defective samples across all tested deep neural network architectures while maintaining perfect precision. The most dramatic improvement was observed in ResNet50V2's overall classification accuracy, which increased from 78 percent to 93 percent when trained with the augmented data. This work provides a scalable, cost-effective approach to enhancing automated defect detection in glass manufacturing that can potentially be extended to other industrial quality assurance systems and industries with similar class imbalance challenges.

new Motion-compensated cardiac MRI using low-rank diffeomorphic flow (DMoCo)

Authors: Joseph William Kettelkamp, Ludovica Romanin, Sarv Priya, Mathews Jacob

Abstract: We introduce an unsupervised motion-compensated image reconstruction algorithm for free-breathing and ungated 3D cardiac magnetic resonance imaging (MRI). We express the image volume corresponding to each specific motion phase as the deformation of a single static image template. The main contribution of the work is the low-rank model for the compact joint representation of the family of diffeomorphisms, parameterized by the motion phases. The diffeomorphism at a specific motion phase is obtained by integrating a parametric velocity field along a path connecting the reference template phase to the motion phase. The velocity field at different phases is represented using a low-rank model. The static template and the low-rank motion model parameters are learned directly from the k-space data in an unsupervised fashion. The more constrained motion model is observed to offer improved recovery compared to current motion-resolved and motion-compensated algorithms for free-breathing 3D cine MRI.

new Robust Fairness Vision-Language Learning for Medical Image Analysis

Authors: Sparsh Bansal, Mingyang Wu, Xin Wang, Shu Hu

Abstract: The advent of Vision-Language Models (VLMs) in medical image analysis has the potential to help process multimodal inputs and increase performance over traditional inference methods. However, when considering the domain in which these models will be implemented, fairness and robustness are important to ensure the model stays true for any patient. In this paper, we introduce a framework for ensuring robustness and fairness of VLM models. This framework modifies the loss function at training by identifying and adjusting faulty image-text pairs through a Dynamic Bad Pair Mining algorithm and also utilizing Sinkhorn distance to ensure the loss distributions of protected groups do not deviate from the total loss. Experimental testing of our framework shows up to a 8.6\% improvement when looking at equity-scaled AUC.

new StableMotion: Training Motion Cleanup Models with Unpaired Corrupted Data

Authors: Yuxuan Mu, Hung Yu Ling, Yi Shi, Ismael Baira Ojeda, Pengcheng Xi, Chang Shu, Fabio Zinno, Xue Bin Peng

Abstract: Motion capture (mocap) data often exhibits visually jarring artifacts due to inaccurate sensors and post-processing. Cleaning this corrupted data can require substantial manual effort from human experts, which can be a costly and time-consuming process. Previous data-driven motion cleanup methods offer the promise of automating this cleanup process, but often require in-domain paired corrupted-to-clean training data. Constructing such paired datasets requires access to high-quality, relatively artifact-free motion clips, which often necessitates laborious manual cleanup. In this work, we present StableMotion, a simple yet effective method for training motion cleanup models directly from unpaired corrupted datasets that need cleanup. The core component of our method is the introduction of motion quality indicators, which can be easily annotated through manual labeling or heuristic algorithms and enable training of quality-aware motion generation models on raw motion data with mixed quality. At test time, the model can be prompted to generate high-quality motions using the quality indicators. Our method can be implemented through a simple diffusion-based framework, leading to a unified motion generate-discriminate model, which can be used to both identify and fix corrupted frames. We demonstrate that our proposed method is effective for training motion cleanup models on raw mocap data in production scenarios by applying StableMotion to SoccerMocap, a 245-hour soccer mocap dataset containing real-world motion artifacts. The trained model effectively corrects a wide range of motion artifacts, reducing motion pops and frozen frames by 68% and 81%, respectively. See https://youtu.be/3Y7MMAH02B4 for more results.

URLs: https://youtu.be/3Y7MMAH02B4

new RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph

Authors: Sameer Malik, Moyuru Yamada, Ayush Singh, Dishank Aggarwal

Abstract: Comprehending long videos remains a significant challenge for Large Multi-modal Models (LMMs). Current LMMs struggle to process even minutes to hours videos due to their lack of explicit memory and retrieval mechanisms. To address this limitation, we propose RAVU (Retrieval Augmented Video Understanding), a novel framework for video understanding enhanced by retrieval with compositional reasoning over a spatio-temporal graph. We construct a graph representation of the video, capturing both spatial and temporal relationships between entities. This graph serves as a long-term memory, allowing us to track objects and their actions across time. To answer complex queries, we decompose the queries into a sequence of reasoning steps and execute these steps on the graph, retrieving relevant key information. Our approach enables more accurate understanding of long videos, particularly for queries that require multi-hop reasoning and tracking objects across frames. Our approach demonstrate superior performances with limited retrieved frames (5-10) compared with other SOTA methods and baselines on two major video QA datasets, NExT-QA and EgoSchema.

new seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models

Authors: Hafez Ghaemi, Eilif Muller, Shahab Bakhtiari

Abstract: Current self-supervised algorithms mostly rely on transformations such as data augmentation and masking to learn visual representations. This is achieved by inducing invariance or equivariance with respect to these transformations after encoding two views of an image. This dominant two-view paradigm can limit the flexibility of learned representations for downstream adaptation by creating performance trade-offs between invariance-related tasks such as image classification and more fine-grained equivariance-related tasks. In this work, we introduce \emph{seq-JEPA}, a world modeling paradigm based on joint-embedding predictive architecture that leverages architectural inductive biases to resolve this trade-off. Without requiring an additional equivariance predictor or loss term, seq-JEPA simultaneously learns two architecturally segregated representations: one equivariant to the specified transformations and another invariant to them and suited for tasks such as classification. To do so, our model processes a short sequence of different views (observations) of an input image. Each encoded view is concatenated with embeddings corresponding to the relative transformation (action) producing the next observation in the sequence. A transformer encoder outputs an aggregate representation of this sequence, which is subsequently conditioned on the action leading to the next observation to predict its representation. Empirically, seq-JEPA achieves strong performance on equivariant benchmarks and image classification without sacrificing one for the other. Additionally, our framework excels at tasks that inherently require aggregating a sequence of observations, such as path integration across actions and predictive learning across eye movements.

new Interactive Instance Annotation with Siamese Networks

Authors: Xiang Xu, Ruotong Li, Mengjun Yi, Baile XU, Furao Shen, Jian Zhao

Abstract: Annotating instance masks is time-consuming and labor-intensive. A promising solution is to predict contours using a deep learning model and then allow users to refine them. However, most existing methods focus on in-domain scenarios, limiting their effectiveness for cross-domain annotation tasks. In this paper, we propose SiamAnno, a framework inspired by the use of Siamese networks in object tracking. SiamAnno leverages one-shot learning to annotate previously unseen objects by taking a bounding box as input and predicting object boundaries, which can then be adjusted by annotators. Trained on one dataset and tested on another without fine-tuning, SiamAnno achieves state-of-the-art (SOTA) performance across multiple datasets, demonstrating its ability to handle domain and environment shifts in cross-domain tasks. We also provide more comprehensive results compared to previous work, establishing a strong baseline for future research. To our knowledge, SiamAnno is the first model to explore Siamese architecture for instance annotation.

new PiCo: Enhancing Text-Image Alignment with Improved Noise Selection and Precise Mask Control in Diffusion Models

Authors: Chang Xie, Chenyi Zhuang, Pan Gao

Abstract: Advanced diffusion models have made notable progress in text-to-image compositional generation. However, it is still a challenge for existing models to achieve text-image alignment when confronted with complex text prompts. In this work, we highlight two factors that affect this alignment: the quality of the randomly initialized noise and the reliability of the generated controlling mask. We then propose PiCo (Pick-and-Control), a novel training-free approach with two key components to tackle these two factors. First, we develop a noise selection module to assess the quality of the random noise and determine whether the noise is suitable for the target text. A fast sampling strategy is utilized to ensure efficiency in the noise selection stage. Second, we introduce a referring mask module to generate pixel-level masks and to precisely modulate the cross-attention maps. The referring mask is applied to the standard diffusion process to guide the reasonable interaction between text and image features. Extensive experiments have been conducted to verify the effectiveness of PiCo in liberating users from the tedious process of random generation and in enhancing the text-image alignment for diverse text descriptions.

new DCS-ST for Classification of Breast Cancer Histopathology Images with Limited Annotations

Authors: Liu Suxing, Byungwon Min

Abstract: Deep learning methods have shown promise in classifying breast cancer histopathology images, but their performance often declines with limited annotated data, a critical challenge in medical imaging due to the high cost and expertise required for annotations.

new Dual-Domain Masked Image Modeling: A Self-Supervised Pretraining Strategy Using Spatial and Frequency Domain Masking for Hyperspectral Data

Authors: Shaheer Mohamed, Tharindu Fernando, Sridha Sridharan, Peyman Moghadam, Clinton Fookes

Abstract: Hyperspectral images (HSIs) capture rich spectral signatures that reveal vital material properties, offering broad applicability across various domains. However, the scarcity of labeled HSI data limits the full potential of deep learning, especially for transformer-based architectures that require large-scale training. To address this constraint, we propose Spatial-Frequency Masked Image Modeling (SFMIM), a self-supervised pretraining strategy for hyperspectral data that utilizes the large portion of unlabeled data. Our method introduces a novel dual-domain masking mechanism that operates in both spatial and frequency domains. The input HSI cube is initially divided into non-overlapping patches along the spatial dimension, with each patch comprising the entire spectrum of its corresponding spatial location. In spatial masking, we randomly mask selected patches and train the model to reconstruct the masked inputs using the visible patches. Concurrently, in frequency masking, we remove portions of the frequency components of the input spectra and predict the missing frequencies. By learning to reconstruct these masked components, the transformer-based encoder captures higher-order spectral-spatial correlations. We evaluate our approach on three publicly available HSI classification benchmarks and demonstrate that it achieves state-of-the-art performance. Notably, our model shows rapid convergence during fine-tuning, highlighting the efficiency of our pretraining strategy.

new Seeing the Abstract: Translating the Abstract Language for Vision Language Models

Authors: Davide Talon, Federico Girella, Ziyue Liu, Marco Cristani, Yiming Wang

Abstract: Natural language goes beyond dryly describing visual content. It contains rich abstract concepts to express feeling, creativity and properties that cannot be directly perceived. Yet, current research in Vision Language Models (VLMs) has not shed light on abstract-oriented language. Our research breaks new ground by uncovering its wide presence and under-estimated value, with extensive analysis. Particularly, we focus our investigation on the fashion domain, a highly-representative field with abstract expressions. By analyzing recent large-scale multimodal fashion datasets, we find that abstract terms have a dominant presence, rivaling the concrete ones, providing novel information, and being useful in the retrieval task. However, a critical challenge emerges: current general-purpose or fashion-specific VLMs are pre-trained with databases that lack sufficient abstract words in their text corpora, thus hindering their ability to effectively represent abstract-oriented language. We propose a training-free and model-agnostic method, Abstract-to-Concrete Translator (ACT), to shift abstract representations towards well-represented concrete ones in the VLM latent space, using pre-trained models and existing multimodal databases. On the text-to-image retrieval task, despite being training-free, ACT outperforms the fine-tuned VLMs in both same- and cross-dataset settings, exhibiting its effectiveness with a strong generalization capability. Moreover, the improvement introduced by ACT is consistent with various VLMs, making it a plug-and-play solution.

new PROM: Prioritize Reduction of Multiplications Over Lower Bit-Widths for Efficient CNNs

Authors: Lukas Meiner, Jens Mehnert, Alexandru Paul Condurache

Abstract: Convolutional neural networks (CNNs) are crucial for computer vision tasks on resource-constrained devices. Quantization effectively compresses these models, reducing storage size and energy cost. However, in modern depthwise-separable architectures, the computational cost is distributed unevenly across its components, with pointwise operations being the most expensive. By applying a general quantization scheme to this imbalanced cost distribution, existing quantization approaches fail to fully exploit potential efficiency gains. To this end, we introduce PROM, a straightforward approach for quantizing modern depthwise-separable convolutional networks by selectively using two distinct bit-widths. Specifically, pointwise convolutions are quantized to ternary weights, while the remaining modules use 8-bit weights, which is achieved through a simple quantization-aware training procedure. Additionally, by quantizing activations to 8-bit, our method transforms pointwise convolutions with ternary weights into int8 additions, which enjoy broad support across hardware platforms and effectively eliminates the need for expensive multiplications. Applying PROM to MobileNetV2 reduces the model's energy cost by more than an order of magnitude (23.9x) and its storage size by 2.7x compared to the float16 baseline while retaining similar classification performance on ImageNet. Our method advances the Pareto frontier for energy consumption vs. top-1 accuracy for quantized convolutional models on ImageNet. PROM addresses the challenges of quantizing depthwise-separable convolutional networks to both ternary and 8-bit weights, offering a simple way to reduce energy cost and storage size.

new DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor

Authors: Wei-Ting Chen, Yu-Jiet Vong, Yi-Tsung Lee, Sy-Yen Kuo, Qiang Gao, Sizhuo Ma, Jian Wang

Abstract: Video Quality Assessment (VQA) aims to evaluate video quality based on perceptual distortions and human preferences. Despite the promising performance of existing methods using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), they often struggle to align closely with human perceptions, particularly in diverse real-world scenarios. This challenge is exacerbated by the limited scale and diversity of available datasets. To address this limitation, we introduce a novel VQA framework, DiffVQA, which harnesses the robust generalization capabilities of diffusion models pre-trained on extensive datasets. Our framework adapts these models to reconstruct identical input frames through a control module. The adapted diffusion model is then used to extract semantic and distortion features from a resizing branch and a cropping branch, respectively. To enhance the model's ability to handle long-term temporal dynamics, a parallel Mamba module is introduced, which extracts temporal coherence augmented features that are merged with the diffusion features to predict the final score. Experiments across multiple datasets demonstrate DiffVQA's superior performance on intra-dataset evaluations and its exceptional generalization across datasets. These results confirm that leveraging a diffusion model as a feature extractor can offer enhanced VQA performance compared to CNN and ViT backbones.

new OccCylindrical: Multi-Modal Fusion with Cylindrical Representation for 3D Semantic Occupancy Prediction

Authors: Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Yaoqi Huang, Hongyu Lyu, Nguyen Hoang Khoi Tran, Tzu-Yun Tseng, Stewart Worrall

Abstract: The safe operation of autonomous vehicles (AVs) is highly dependent on their understanding of the surroundings. For this, the task of 3D semantic occupancy prediction divides the space around the sensors into voxels, and labels each voxel with both occupancy and semantic information. Recent perception models have used multisensor fusion to perform this task. However, existing multisensor fusion-based approaches focus mainly on using sensor information in the Cartesian coordinate system. This ignores the distribution of the sensor readings, leading to a loss of fine-grained details and performance degradation. In this paper, we propose OccCylindrical that merges and refines the different modality features under cylindrical coordinates. Our method preserves more fine-grained geometry detail that leads to better performance. Extensive experiments conducted on the nuScenes dataset, including challenging rainy and nighttime scenarios, confirm our approach's effectiveness and state-of-the-art performance. The code will be available at: https://github.com/DanielMing123/OccCylindrical

URLs: https://github.com/DanielMing123/OccCylindrical

new Base-Detail Feature Learning Framework for Visible-Infrared Person Re-Identification

Authors: Zhihao Gong, Lian Wu, Yong Xu

Abstract: Visible-infrared person re-identification (VIReID) provides a solution for ReID tasks in 24-hour scenarios; however, significant challenges persist in achieving satisfactory performance due to the substantial discrepancies between visible (VIS) and infrared (IR) modalities. Existing methods inadequately leverage information from different modalities, primarily focusing on digging distinguishing features from modality-shared information while neglecting modality-specific details. To fully utilize differentiated minutiae, we propose a Base-Detail Feature Learning Framework (BDLF) that enhances the learning of both base and detail knowledge, thereby capitalizing on both modality-shared and modality-specific information. Specifically, the proposed BDLF mines detail and base features through a lossless detail feature extraction module and a complementary base embedding generation mechanism, respectively, supported by a novel correlation restriction method that ensures the features gained by BDLF enrich both detail and base knowledge across VIS and IR features. Comprehensive experiments conducted on the SYSU-MM01, RegDB, and LLCM datasets validate the effectiveness of BDLF.

new Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding Approach

Authors: Pierre Adorni, Minh-Tan Pham, St\'ephane May, S\'ebastien Lef\`evre

Abstract: Foundation models constitute a significant advancement in computer vision: after a single, albeit costly, training phase, they can address a wide array of tasks. In the field of Earth observation, over 75 remote sensing vision foundation models have been developed in the past four years. However, none has consistently outperformed the others across all available downstream tasks. To facilitate their comparison, we propose a cost-effective method for predicting a model's performance on multiple downstream tasks without the need for fine-tuning on each one. This method is based on what we call "capabilities encoding." The utility of this novel approach is twofold: we demonstrate its potential to simplify the selection of a foundation model for a given new task, and we employ it to offer a fresh perspective on the existing literature, suggesting avenues for future research. Codes are available at https://github.com/pierreadorni/capabilities-encoding.

URLs: https://github.com/pierreadorni/capabilities-encoding.

new 3D Can Be Explored In 2D: Pseudo-Label Generation for LiDAR Point Clouds Using Sensor-Intensity-Based 2D Semantic Segmentation

Authors: Andrew Caunes, Thierry Chateau, Vincent Fr\'emont

Abstract: Semantic segmentation of 3D LiDAR point clouds, essential for autonomous driving and infrastructure management, is best achieved by supervised learning, which demands extensive annotated datasets and faces the problem of domain shifts. We introduce a new 3D semantic segmentation pipeline that leverages aligned scenes and state-of-the-art 2D segmentation methods, avoiding the need for direct 3D annotation or reliance on additional modalities such as camera images at inference time. Our approach generates 2D views from LiDAR scans colored by sensor intensity and applies 2D semantic segmentation to these views using a camera-domain pretrained model. The segmented 2D outputs are then back-projected onto the 3D points, with a simple voting-based estimator that merges the labels associated to each 3D point. Our main contribution is a global pipeline for 3D semantic segmentation requiring no prior 3D annotation and not other modality for inference, which can be used for pseudo-label generation. We conduct a thorough ablation study and demonstrate the potential of the generated pseudo-labels for the Unsupervised Domain Adaptation task.

new Comparative Analysis of Lightweight Deep Learning Models for Memory-Constrained Devices

Authors: Tasnim Shahriar

Abstract: This paper presents a comprehensive evaluation of lightweight deep learning models for image classification, emphasizing their suitability for deployment in resource-constrained environments such as low-memory devices. Five state-of-the-art architectures - MobileNetV3 Small, ResNet18, SqueezeNet, EfficientNetV2-S, and ShuffleNetV2 - are benchmarked across three diverse datasets: CIFAR-10, CIFAR-100, and Tiny ImageNet. The models are assessed using four key performance metrics: classification accuracy, inference time, floating-point operations (FLOPs), and model size. Additionally, we investigate the impact of hyperparameter tuning, data augmentation, and training paradigms by comparing pretrained models with scratch-trained counterparts, focusing on MobileNetV3 Small. Our findings reveal that transfer learning significantly enhances model accuracy and computational efficiency, particularly for complex datasets like Tiny ImageNet. EfficientNetV2 consistently achieves the highest accuracy, while MobileNetV3 offers the best balance between accuracy and efficiency, and SqueezeNet excels in inference speed and compactness. This study highlights critical trade-offs between accuracy and efficiency, offering actionable insights for deploying lightweight models in real-world applications where computational resources are limited. By addressing these challenges, this research contributes to optimizing deep learning systems for edge computing and mobile platforms.

new 3D Gaussian Splatting Data Compression with Mixture of Priors

Authors: Lei Liu, Zhenghao Chen, Dong Xu

Abstract: 3D Gaussian Splatting (3DGS) data compression is crucial for enabling efficient storage and transmission in 3D scene modeling. However, its development remains limited due to inadequate entropy models and suboptimal quantization strategies for both lossless and lossy compression scenarios, where existing methods have yet to 1) fully leverage hyperprior information to construct robust conditional entropy models, and 2) apply fine-grained, element-wise quantization strategies for improved compression granularity. In this work, we propose a novel Mixture of Priors (MoP) strategy to simultaneously address these two challenges. Specifically, inspired by the Mixture-of-Experts (MoE) paradigm, our MoP approach processes hyperprior information through multiple lightweight MLPs to generate diverse prior features, which are subsequently integrated into the MoP feature via a gating mechanism. To enhance lossless compression, the resulting MoP feature is utilized as a hyperprior to improve conditional entropy modeling. Meanwhile, for lossy compression, we employ the MoP feature as guidance information in an element-wise quantization procedure, leveraging a prior-guided Coarse-to-Fine Quantization (C2FQ) strategy with a predefined quantization step value. Specifically, we expand the quantization step value into a matrix and adaptively refine it from coarse to fine granularity, guided by the MoP feature, thereby obtaining a quantization step matrix that facilitates element-wise quantization. Extensive experiments demonstrate that our proposed 3DGS data compression framework achieves state-of-the-art performance across multiple benchmarks, including Mip-NeRF360, BungeeNeRF, DeepBlending, and Tank&Temples.

new Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

Authors: Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang

Abstract: Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.

new SD-VSum: A Method and Dataset for Script-Driven Video Summarization

Authors: Manolis Mylonas, Evlampios Apostolidis, Vasileios Mezaris

Abstract: In this work, we introduce the task of script-driven video summarization, which aims to produce a summary of the full-length video by selecting the parts that are most relevant to a user-provided script outlining the visual content of the desired summary. Following, we extend a recently-introduced large-scale dataset for generic video summarization (VideoXum) by producing natural language descriptions of the different human-annotated summaries that are available per video. In this way we make it compatible with the introduced task, since the available triplets of ``video, summary and summary description'' can be used for training a method that is able to produce different summaries for a given video, driven by the provided script about the content of each summary. Finally, we develop a new network architecture for script-driven video summarization (SD-VSum), that relies on the use of a cross-modal attention mechanism for aligning and fusing information from the visual and text modalities. Our experimental evaluations demonstrate the advanced performance of SD-VSum against state-of-the-art approaches for query-driven and generic (unimodal and multimodal) summarization from the literature, and document its capacity to produce video summaries that are adapted to each user's needs about their content.

new Very High-Resolution Forest Mapping with TanDEM-X InSAR Data and Self-Supervised Learning

Authors: Jos\'e-Luis Bueso-Bello, Benjamin Chauvel, Daniel Carcereri, Philipp Posovszky, Pietro Milillo, Jennifer Ruiz, Juan-Carlos Fern\'andez-Diaz, Carolina Gonz\'alez, Michele Martone, Ronny H\"ansch, Paola Rizzoli

Abstract: Deep learning models have shown encouraging capabilities for mapping accurately forests at medium resolution with TanDEM-X interferometric SAR data. Such models, as most of current state-of-the-art deep learning techniques in remote sensing, are trained in a fully-supervised way, which requires a large amount of labeled data for training and validation. In this work, our aim is to exploit the high-resolution capabilities of the TanDEM-X mission to map forests at 6 m. The goal is to overcome the intrinsic limitations posed by midresolution products, which affect, e.g., the detection of narrow roads within vegetated areas and the precise delineation of forested regions contours. To cope with the lack of extended reliable reference datasets at such a high resolution, we investigate self-supervised learning techniques for extracting highly informative representations from the input features, followed by a supervised training step with a significantly smaller number of reliable labels. A 1 m resolution forest/non-forest reference map over Pennsylvania, USA, allows for comparing different training approaches for the development of an effective forest mapping framework with limited labeled samples. We select the best-performing approach over this test region and apply it in a real-case forest mapping scenario over the Amazon rainforest, where only very few labeled data at high resolution are available. In this challenging scenario, the proposed self-supervised framework significantly enhances the classification accuracy with respect to fully-supervised methods, trained using the same amount of labeled data, representing an extremely promising starting point for large-scale, very high-resolution forest mapping with TanDEM-X data.

new FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing

Authors: Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Lei Sun, Xiangxiang Chu

Abstract: The task of scene text editing is to modify or add texts on images while maintaining the fidelity of newly generated text and visual coherence with the background. Recent works based on latent diffusion models (LDM) show improved text editing results, yet still face challenges and often generate inaccurate or unrecognizable characters, especially for non-Latin ones (\eg, Chinese), which have complex glyph structures. To address these issues, we present FLUX-Text, a simple and advanced multilingual scene text editing framework based on FLUX-Fill. Specifically, we carefully investigate glyph conditioning, considering both visual and textual modalities. To retain the original generative capabilities of FLUX-Fill while enhancing its understanding and generation of glyphs, we propose lightweight glyph and text embedding modules. Owning to the lightweight design, FLUX-Text is trained only with $100K$ training examples compared to current popular methods trained with 2.9M ones. With no bells and whistles, our method achieves state-of-the-art performance on text editing tasks. Qualitative and quantitative experiments on the public datasets demonstrate that our method surpasses previous works in text fidelity.

new From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection

Authors: Guoting Wei, Yu Liu, Xia Yuan, Xizhe Xue, Linlin Guo, Yifan Yang, Chunxia Zhao, Zongwen Bai, Haokui Zhang, Rong Xiao

Abstract: In recent years, language-guided open-world aerial object detection has gained significant attention due to its better alignment with real-world application needs. However, due to limited datasets, most existing language-guided methods primarily focus on vocabulary, which fails to meet the demands of more fine-grained open-world detection. To address this limitation, we propose constructing a large-scale language-guided open-set aerial detection dataset, encompassing three levels of language guidance: from words to phrases, and ultimately to sentences. Centered around an open-source large vision-language model and integrating image-operation-based preprocessing with BERT-based postprocessing, we present the OS-W2S Label Engine, an automatic annotation pipeline capable of handling diverse scene annotations for aerial images. Using this label engine, we expand existing aerial detection datasets with rich textual annotations and construct a novel benchmark dataset, called Multi-instance Open-set Aerial Dataset (MI-OAD), addressing the limitations of current remote sensing grounding data and enabling effective open-set aerial detection. Specifically, MI-OAD contains 163,023 images and 2 million image-caption pairs, approximately 40 times larger than comparable datasets. We also employ state-of-the-art open-set methods from the natural image domain, trained on our proposed dataset, to validate the model's open-set detection capabilities. For instance, when trained on our dataset, Grounding DINO achieves improvements of 29.5 AP_{50} and 33.7 Recall@10 for sentence inputs under zero-shot transfer conditions. Both the dataset and the label engine will be released publicly.

new A Vision-Language Model for Focal Liver Lesion Classification

Authors: Song Jian, Hu Yuchang, Wang Hui, Chen Yen-Wei

Abstract: Accurate classification of focal liver lesions is crucial for diagnosis and treatment in hepatology. However, traditional supervised deep learning models depend on large-scale annotated datasets, which are often limited in medical imaging. Recently, Vision-Language models (VLMs) such as Contrastive Language-Image Pre-training model (CLIP) has been applied to image classifications. Compared to the conventional convolutional neural network (CNN), which classifiers image based on visual information only, VLM leverages multimodal learning with text and images, allowing it to learn effectively even with a limited amount of labeled data. Inspired by CLIP, we pro-pose a Liver-VLM, a model specifically designed for focal liver lesions (FLLs) classification. First, Liver-VLM incorporates class information into the text encoder without introducing additional inference overhead. Second, by calculating the pairwise cosine similarities between image and text embeddings and optimizing the model with a cross-entropy loss, Liver-VLM ef-fectively aligns image features with class-level text features. Experimental results on MPCT-FLLs dataset demonstrate that the Liver-VLM model out-performs both the standard CLIP and MedCLIP models in terms of accuracy and Area Under the Curve (AUC). Further analysis shows that using a lightweight ResNet18 backbone enhances classification performance, particularly under data-constrained conditions.

new GUAVA: Generalizable Upper Body 3D Gaussian Avatar

Authors: Dongbin Zhang, Yunfei Liu, Lijian Lin, Ye Zhu, Yang Li, Minghan Qin, Yu Li, Haoqian Wang

Abstract: Reconstructing a high-quality, animatable 3D human avatar with expressive facial and hand motions from a single image has gained significant attention due to its broad application potential. 3D human avatar reconstruction typically requires multi-view or monocular videos and training on individual IDs, which is both complex and time-consuming. Furthermore, limited by SMPLX's expressiveness, these methods often focus on body motion but struggle with facial expressions. To address these challenges, we first introduce an expressive human model (EHM) to enhance facial expression capabilities and develop an accurate tracking method. Based on this template model, we propose GUAVA, the first framework for fast animatable upper-body 3D Gaussian avatar reconstruction. We leverage inverse texture mapping and projection sampling techniques to infer Ubody (upper-body) Gaussians from a single image. The rendered images are refined through a neural refiner. Experimental results demonstrate that GUAVA significantly outperforms previous methods in rendering quality and offers significant speed improvements, with reconstruction times in the sub-second range (0.1s), and supports real-time animation and rendering.

new Interpretable Zero-shot Learning with Infinite Class Concepts

Authors: Zihan Ye, Shreyank N Gowda, Shiming Chen, Yaochu Jin, Kaizhu Huang, Xiaobo Jin

Abstract: Zero-shot learning (ZSL) aims to recognize unseen classes by aligning images with intermediate class semantics, like human-annotated concepts or class definitions. An emerging alternative leverages Large-scale Language Models (LLMs) to automatically generate class documents. However, these methods often face challenges with transparency in the classification process and may suffer from the notorious hallucination problem in LLMs, resulting in non-visual class semantics. This paper redefines class semantics in ZSL with a focus on transferability and discriminability, introducing a novel framework called Zero-shot Learning with Infinite Class Concepts (InfZSL). Our approach leverages the powerful capabilities of LLMs to dynamically generate an unlimited array of phrase-level class concepts. To address the hallucination challenge, we introduce an entropy-based scoring process that incorporates a ``goodness" concept selection mechanism, ensuring that only the most transferable and discriminative concepts are selected. Our InfZSL framework not only demonstrates significant improvements on three popular benchmark datasets but also generates highly interpretable, image-grounded concepts. Code will be released upon acceptance.

new 3D Surface Reconstruction with Enhanced High-Frequency Details

Authors: Shikun Zhang, Yiqun Wang, Cunjian Chen, Yong Li, Qiuhong Ke

Abstract: Neural implicit 3D reconstruction can reproduce shapes without 3D supervision, and it learns the 3D scene through volume rendering methods and neural implicit representations. Current neural surface reconstruction methods tend to randomly sample the entire image, making it difficult to learn high-frequency details on the surface, and thus the reconstruction results tend to be too smooth. We designed a method (FreNeuS) based on high-frequency information to solve the problem of insufficient surface detail. Specifically, FreNeuS uses pixel gradient changes to easily acquire high-frequency regions in an image and uses the obtained high-frequency information to guide surface detail reconstruction. High-frequency information is first used to guide the dynamic sampling of rays, applying different sampling strategies according to variations in high-frequency regions. To further enhance the focus on surface details, we have designed a high-frequency weighting method that constrains the representation of high-frequency details during the reconstruction process. Qualitative and quantitative results show that our method can reconstruct fine surface details and obtain better surface reconstruction quality compared to existing methods. In addition, our method is more applicable and can be generalized to any NeuS-based work.

new Reducing Annotation Burden in Physical Activity Research Using Vision-Language Models

Authors: Abram Schonfeldt, Benjamin Maylor, Xiaofang Chen, Ronald Clark, Aiden Doherty

Abstract: Introduction: Data from wearable devices collected in free-living settings, and labelled with physical activity behaviours compatible with health research, are essential for both validating existing wearable-based measurement approaches and developing novel machine learning approaches. One common way of obtaining these labels relies on laborious annotation of sequences of images captured by cameras worn by participants through the course of a day. Methods: We compare the performance of three vision language models and two discriminative models on two free-living validation studies with 161 and 111 participants, collected in Oxfordshire, United Kingdom and Sichuan, China, respectively, using the Autographer (OMG Life, defunct) wearable camera. Results: We found that the best open-source vision-language model (VLM) and fine-tuned discriminative model (DM) achieved comparable performance when predicting sedentary behaviour from single images on unseen participants in the Oxfordshire study; median F1-scores: VLM = 0.89 (0.84, 0.92), DM = 0.91 (0.86, 0.95). Performance declined for light (VLM = 0.60 (0.56,0.67), DM = 0.70 (0.63, 0.79)), and moderate-to-vigorous intensity physical activity (VLM = 0.66 (0.53, 0.85); DM = 0.72 (0.58, 0.84)). When applied to the external Sichuan study, performance fell across all intensity categories, with median Cohen's kappa-scores falling from 0.54 (0.49, 0.64) to 0.26 (0.15, 0.37) for the VLM, and from 0.67 (0.60, 0.74) to 0.19 (0.10, 0.30) for the DM. Conclusion: Freely available computer vision models could help annotate sedentary behaviour, typically the most prevalent activity of daily living, from wearable camera images within similar populations to seen data, reducing the annotation burden.

new Reinforced Correlation Between Vision and Language for Precise Medical AI Assistant

Authors: Haonan Wang, Jiaji Mao, Lehan Wang, Qixiang Zhang, Marawan Elbatel, Yi Qin, Huijun Hu, Baoxun Li, Wenhui Deng, Weifeng Qin, Hongrui Li, Jialin Liang, Jun Shen, Xiaomeng Li

Abstract: Medical AI assistants support doctors in disease diagnosis, medical image analysis, and report generation. However, they still face significant challenges in clinical use, including limited accuracy with multimodal content and insufficient validation in real-world settings. We propose RCMed, a full-stack AI assistant that improves multimodal alignment in both input and output, enabling precise anatomical delineation, accurate localization, and reliable diagnosis through hierarchical vision-language grounding. A self-reinforcing correlation mechanism allows visual features to inform language context, while language semantics guide pixel-wise attention, forming a closed loop that refines both modalities. This correlation is enhanced by a color region description strategy, translating anatomical structures into semantically rich text to learn shape-location-text relationships across scales. Trained on 20 million image-mask-description triplets, RCMed achieves state-of-the-art precision in contextualizing irregular lesions and subtle anatomical boundaries, excelling in 165 clinical tasks across 9 modalities. It achieved a 23.5% relative improvement in cell segmentation from microscopy images over prior methods. RCMed's strong vision-language alignment enables exceptional generalization, with state-of-the-art performance in external validation across 20 clinically significant cancer types, including novel tasks. This work demonstrates how integrated multimodal models capture fine-grained patterns, enabling human-level interpretation in complex scenarios and advancing human-centric AI healthcare.

new Attention-aggregated Attack for Boosting the Transferability of Facial Adversarial Examples

Authors: Jian-Wei Li, Wen-Ze Shao

Abstract: Adversarial examples have revealed the vulnerability of deep learning models and raised serious concerns about information security. The transfer-based attack is a hot topic in black-box attacks that are practical to real-world scenarios where the training datasets, parameters, and structure of the target model are unknown to the attacker. However, few methods consider the particularity of class-specific deep models for fine-grained vision tasks, such as face recognition (FR), giving rise to unsatisfactory attacking performance. In this work, we first investigate what in a face exactly contributes to the embedding learning of FR models and find that both decisive and auxiliary facial features are specific to each FR model, which is quite different from the biological mechanism of human visual system. Accordingly we then propose a novel attack method named Attention-aggregated Attack (AAA) to enhance the transferability of adversarial examples against FR, which is inspired by the attention divergence and aims to destroy the facial features that are critical for the decision-making of other FR models by imitating their attentions on the clean face images. Extensive experiments conducted on various FR models validate the superiority and robust effectiveness of the proposed method over existing methods.

new EOPose : Exemplar-based object reposing using Generalized Pose Correspondences

Authors: Sarthak Mehrotra, Rishabh Jain, Mayur Hemani, Balaji Krishnamurthy, Mausoom Sarkar

Abstract: Reposing objects in images has a myriad of applications, especially for e-commerce where several variants of product images need to be produced quickly. In this work, we leverage the recent advances in unsupervised keypoint correspondence detection between different object images of the same class to propose an end-to-end framework for generic object reposing. Our method, EOPose, takes a target pose-guidance image as input and uses its keypoint correspondence with the source object image to warp and re-render the latter into the target pose using a novel three-step approach. Unlike generative approaches, our method also preserves the fine-grained details of the object such as its exact colors, textures, and brand marks. We also prepare a new dataset of paired objects based on the Objaverse dataset to train and test our network. EOPose produces high-quality reposing output as evidenced by different image quality metrics (PSNR, SSIM and FID). Besides a description of the method and the dataset, the paper also includes detailed ablation and user studies to indicate the efficacy of the proposed method

new DDaTR: Dynamic Difference-aware Temporal Residual Network for Longitudinal Radiology Report Generation

Authors: Shanshan Song, Hui Tang, Honglong Yang, Xiaomeng Li

Abstract: Radiology Report Generation (RRG) automates the creation of radiology reports from medical imaging, enhancing the efficiency of the reporting process. Longitudinal Radiology Report Generation (LRRG) extends RRG by incorporating the ability to compare current and prior exams, facilitating the tracking of temporal changes in clinical findings. Existing LRRG approaches only extract features from prior and current images using a visual pre-trained encoder, which are then concatenated to generate the final report. However, these methods struggle to effectively capture both spatial and temporal correlations during the feature extraction process. Consequently, the extracted features inadequately capture the information of difference across exams and thus underrepresent the expected progressions, leading to sub-optimal performance in LRRG. To address this, we develop a novel dynamic difference-aware temporal residual network (DDaTR). In DDaTR, we introduce two modules at each stage of the visual encoder to capture multi-level spatial correlations. The Dynamic Feature Alignment Module (DFAM) is designed to align prior features across modalities for the integrity of prior clinical information. Prompted by the enriched prior features, the dynamic difference-aware module (DDAM) captures favorable difference information by identifying relationships across exams. Furthermore, our DDaTR employs the dynamic residual network to unidirectionally transmit longitudinal information, effectively modelling temporal correlations. Extensive experiments demonstrated superior performance over existing methods on three benchmarks, proving its efficacy in both RRG and LRRG tasks.

new CXR-AD: Component X-ray Image Dataset for Industrial Anomaly Detection

Authors: Haoyu Bai, Jie Wang, Gaomin Li, Xuan Li, Xiaohu Zhang, Xia Yang

Abstract: Internal defect detection constitutes a critical process in ensuring component quality, for which anomaly detection serves as an effective solution. However, existing anomaly detection datasets predominantly focus on surface defects in visible-light images, lacking publicly available X-ray datasets targeting internal defects in components. To address this gap, we construct the first publicly accessible component X-ray anomaly detection (CXR-AD) dataset, comprising real-world X-ray images. The dataset covers five industrial component categories, including 653 normal samples and 561 defect samples with precise pixel-level mask annotations. We systematically analyze the dataset characteristics and identify three major technical challenges: (1) strong coupling between complex internal structures and defect regions, (2) inherent low contrast and high noise interference in X-ray imaging, and (3) significant variations in defect scales and morphologies. To evaluate dataset complexity, we benchmark three state-of-the-art anomaly detection frameworks (feature-based, reconstruction-based, and zero-shot learning methods). Experimental results demonstrate a 29.78% average performance degradation on CXR-AD compared to MVTec AD, highlighting the limitations of current algorithms in handling internal defect detection tasks. To the best of our knowledge, CXR-AD represents the first publicly available X-ray dataset for component anomaly detection, providing a real-world industrial benchmark to advance algorithm development and enhance precision in internal defect inspection technologies.

new Enhancing Target-unspecific Tasks through a Features Matrix

Authors: Fangming Cui, Yonggang Zhang, Xuan Wang, Xinmei Tian, Jun Yu

Abstract: Recent developments in prompt learning of large vision-language models have significantly improved performance in target-specific tasks. However, these prompt optimizing methods often struggle to tackle the target-unspecific or generalizable tasks effectively. It may be attributed to the fact that overfitting training causes the model to forget its general knowledge having strong promotion on target-unspecific tasks. To alleviate this issue, we propose a novel Features Matrix (FM) regularization approach designed to enhance these models on target-unspecific tasks. Our method extracts and leverages general knowledge, shaping a Features Matrix (FM). Specifically, the FM captures the semantics of diverse inputs from a deep and fine perspective, preserving essential general knowledge, which mitigates the risk of overfitting. Representative evaluations demonstrate that: 1) the FM is compatible with existing frameworks as a generic and flexible module, and 2) the FM significantly showcases its effectiveness in enhancing target-unspecific tasks, achieving state-of-the-art performance.

new LiftFeat: 3D Geometry-Aware Local Feature Matching

Authors: Yepeng Liu, Wenpeng Lai, Zhou Zhao, Yuxuan Xiong, Jinchi Zhu, Jun Cheng, Yongchao Xu

Abstract: Robust and efficient local feature matching plays a crucial role in applications such as SLAM and visual localization for robotics. Despite great progress, it is still very challenging to extract robust and discriminative visual features in scenarios with drastic lighting changes, low texture areas, or repetitive patterns. In this paper, we propose a new lightweight network called \textit{LiftFeat}, which lifts the robustness of raw descriptor by aggregating 3D geometric feature. Specifically, we first adopt a pre-trained monocular depth estimation model to generate pseudo surface normal label, supervising the extraction of 3D geometric feature in terms of predicted surface normal. We then design a 3D geometry-aware feature lifting module to fuse surface normal feature with raw 2D descriptor feature. Integrating such 3D geometric feature enhances the discriminative ability of 2D feature description in extreme conditions. Extensive experimental results on relative pose estimation, homography estimation, and visual localization tasks, demonstrate that our LiftFeat outperforms some lightweight state-of-the-art methods. Code will be released at : https://github.com/lyp-deeplearning/LiftFeat.

URLs: https://github.com/lyp-deeplearning/LiftFeat.

new Phenotype-Guided Generative Model for High-Fidelity Cardiac MRI Synthesis: Advancing Pretraining and Clinical Applications

Authors: Ziyu Li, Yujian Hu, Zhengyao Ding, Yiheng Mao, Haitao Li, Fan Yi, Hongkun Zhang, Zhengxing Huang

Abstract: Cardiac Magnetic Resonance (CMR) imaging is a vital non-invasive tool for diagnosing heart diseases and evaluating cardiac health. However, the limited availability of large-scale, high-quality CMR datasets poses a major challenge to the effective application of artificial intelligence (AI) in this domain. Even the amount of unlabeled data and the health status it covers are difficult to meet the needs of model pretraining, which hinders the performance of AI models on downstream tasks. In this study, we present Cardiac Phenotype-Guided CMR Generation (CPGG), a novel approach for generating diverse CMR data that covers a wide spectrum of cardiac health status. The CPGG framework consists of two stages: in the first stage, a generative model is trained using cardiac phenotypes derived from CMR data; in the second stage, a masked autoregressive diffusion model, conditioned on these phenotypes, generates high-fidelity CMR cine sequences that capture both structural and functional features of the heart in a fine-grained manner. We synthesized a massive amount of CMR to expand the pretraining data. Experimental results show that CPGG generates high-quality synthetic CMR data, significantly improving performance on various downstream tasks, including diagnosis and cardiac phenotypes prediction. These gains are demonstrated across both public and private datasets, highlighting the effectiveness of our approach. Code is availabel at https://anonymous.4open.science/r/CPGG.

URLs: https://anonymous.4open.science/r/CPGG.

new A Fusion-Guided Inception Network for Hyperspectral Image Super-Resolution

Authors: Usman Muhammad, Jorma Laaksonen

Abstract: The fusion of low-spatial-resolution hyperspectral images (HSIs) with high-spatial-resolution conventional images (e.g., panchromatic or RGB) has played a significant role in recent advancements in HSI super-resolution. However, this fusion process relies on the availability of precise alignment between image pairs, which is often challenging in real-world scenarios. To mitigate this limitation, we propose a single-image super-resolution model called the Fusion-Guided Inception Network (FGIN). Specifically, we first employ a spectral-spatial fusion module to effectively integrate spectral and spatial information at an early stage. Next, an Inception-like hierarchical feature extraction strategy is used to capture multiscale spatial dependencies, followed by a dedicated multi-scale fusion block. To further enhance reconstruction quality, we incorporate an optimized upsampling module that combines bilinear interpolation with depthwise separable convolutions. Experimental evaluations on two publicly available hyperspectral datasets demonstrate the competitive performance of our method.

new Robustness in AI-Generated Detection: Enhancing Resistance to Adversarial Attacks

Authors: Sun Haoxuan, Hong Yan, Zhan Jiahui, Chen Haoxing, Lan Jun, Zhu Huijia, Wang Weiqiang, Zhang Liqing, Zhang Jianfu

Abstract: The rapid advancement of generative image technology has introduced significant security concerns, particularly in the domain of face generation detection. This paper investigates the vulnerabilities of current AI-generated face detection systems. Our study reveals that while existing detection methods often achieve high accuracy under standard conditions, they exhibit limited robustness against adversarial attacks. To address these challenges, we propose an approach that integrates adversarial training to mitigate the impact of adversarial examples. Furthermore, we utilize diffusion inversion and reconstruction to further enhance detection robustness. Experimental results demonstrate that minor adversarial perturbations can easily bypass existing detection systems, but our method significantly improves the robustness of these systems. Additionally, we provide an in-depth analysis of adversarial and benign examples, offering insights into the intrinsic characteristics of AI-generated content. All associated code will be made publicly available in a dedicated repository to facilitate further research and verification.

new Polar Coordinate-Based 2D Pose Prior with Neural Distance Field

Authors: Qi Gan, Sao Mai Nguyen, Eric Fenaux, Stephan Cl\'emen\c{c}on, Moun\^im El Yacoubi

Abstract: Human pose capture is essential for sports analysis, enabling precise evaluation of athletes' movements. While deep learning-based human pose estimation (HPE) models from RGB videos have achieved impressive performance on public datasets, their effectiveness in real-world sports scenarios is often hindered by motion blur, occlusions, and domain shifts across different pose representations. Fine-tuning these models can partially alleviate such challenges but typically requires large-scale annotated data and still struggles to generalize across diverse sports environments. To address these limitations, we propose a 2D pose prior-guided refinement approach based on Neural Distance Fields (NDF). Unlike existing approaches that rely solely on angular representations of human poses, we introduce a polar coordinate-based representation that explicitly incorporates joint connection lengths, enabling a more accurate correction of erroneous pose estimations. Additionally, we define a novel non-geodesic distance metric that separates angular and radial discrepancies, which we demonstrate is better suited for polar representations than traditional geodesic distances. To mitigate data scarcity, we develop a gradient-based batch-projection augmentation strategy, which synthesizes realistic pose samples through iterative refinement. Our method is evaluated on a long jump dataset, demonstrating its ability to improve 2D pose estimation across multiple pose representations, making it robust across different domains. Experimental results show that our approach enhances pose plausibility while requiring only limited training data. Code is available at: https://github.com/QGAN2019/polar-NDF.

URLs: https://github.com/QGAN2019/polar-NDF.

new Nonperiodic dynamic CT reconstruction using backward-warping INR with regularization of diffeomorphism (BIRD)

Authors: Muge Du, Zhuozhao Zheng, Wenying Wang, Guotao Quan, Wuliang Shi, Le Shen, Li Zhang, Liang Li, Yinong Liu, Yuxiang Xing

Abstract: Dynamic computed tomography (CT) reconstruction faces significant challenges in addressing motion artifacts, particularly for nonperiodic rapid movements such as cardiac imaging with fast heart rates. Traditional methods struggle with the extreme limited-angle problems inherent in nonperiodic cases. Deep learning methods have improved performance but face generalization challenges. Recent implicit neural representation (INR) techniques show promise through self-supervised deep learning, but have critical limitations: computational inefficiency due to forward-warping modeling, difficulty balancing DVF complexity with anatomical plausibility, and challenges in preserving fine details without additional patient-specific pre-scans. This paper presents a novel INR-based framework, BIRD, for nonperiodic dynamic CT reconstruction. It addresses these challenges through four key contributions: (1) backward-warping deformation that enables direct computation of each dynamic voxel with significantly reduced computational cost, (2) diffeomorphism-based DVF regularization that ensures anatomically plausible deformations while maintaining representational capacity, (3) motion-compensated analytical reconstruction that enhances fine details without requiring additional pre-scans, and (4) dimensional-reduction design for efficient 4D coordinate encoding. Through various simulations and practical studies, including digital and physical phantoms and retrospective patient data, we demonstrate the effectiveness of our approach for nonperiodic dynamic CT reconstruction with enhanced details and reduced motion artifacts. The proposed framework enables more accurate dynamic CT reconstruction with potential clinical applications, such as one-beat cardiac reconstruction, cinematic image sequences for functional imaging, and motion artifact reduction in conventional CT scans.

new Blending 3D Geometry and Machine Learning for Multi-View Stereopsis

Authors: Vibhas Vats, Md. Alimoor Reza, David Crandall, Soon-heung Jung

Abstract: Traditional multi-view stereo (MVS) methods primarily depend on photometric and geometric consistency constraints. In contrast, modern learning-based algorithms often rely on the plane sweep algorithm to infer 3D geometry, applying explicit geometric consistency (GC) checks only as a post-processing step, with no impact on the learning process itself. In this work, we introduce GC MVSNet plus plus, a novel approach that actively enforces geometric consistency of reference view depth maps across multiple source views (multi view) and at various scales (multi scale) during the learning phase (see Fig. 1). This integrated GC check significantly accelerates the learning process by directly penalizing geometrically inconsistent pixels, effectively halving the number of training iterations compared to other MVS methods. Furthermore, we introduce a densely connected cost regularization network with two distinct block designs simple and feature dense optimized to harness dense feature connections for enhanced regularization. Extensive experiments demonstrate that our approach achieves a new state of the art on the DTU and BlendedMVS datasets and secures second place on the Tanks and Temples benchmark. To our knowledge, GC MVSNet plus plus is the first method to enforce multi-view, multi-scale supervised geometric consistency during learning. Our code is available.

new UPMAD-Net: A Brain Tumor Segmentation Network with Uncertainty Guidance and Adaptive Multimodal Feature Fusion

Authors: Zhanyuan Jia, Ni Yao, Danyang Sun, Chuang Han, Yanting Li, Jiaofen Nan, Fubao Zhu, Chen Zhao, Weihua Zhou

Abstract: Background: Brain tumor segmentation has a significant impact on the diagnosis and treatment of brain tumors. Accurate brain tumor segmentation remains challenging due to their irregular shapes, vague boundaries, and high variability. Objective: We propose a brain tumor segmentation method that combines deep learning with prior knowledge derived from a region-growing algorithm. Methods: The proposed method utilizes a multi-scale feature fusion (MSFF) module and adaptive attention mechanisms (AAM) to extract multi-scale features and capture global contextual information. To enhance the model's robustness in low-confidence regions, the Monte Carlo Dropout (MC Dropout) strategy is employed for uncertainty estimation. Results: Extensive experiments demonstrate that the proposed method achieves superior performance on Brain Tumor Segmentation (BraTS) datasets, significantly outperforming various state-of-the-art methods. On the BraTS2021 dataset, the test Dice scores are 89.18% for Enhancing Tumor (ET) segmentation, 93.67% for Whole Tumor (WT) segmentation, and 91.23% for Tumor Core (TC) segmentation. On the BraTS2019 validation set, the validation Dice scores are 87.43%, 90.92%, and 90.40% for ET, WT, and TC segmentation, respectively. Ablation studies further confirmed the contribution of each module to segmentation accuracy, indicating that each component played a vital role in overall performance improvement. Conclusion: This study proposed a novel 3D brain tumor segmentation network based on the U-Net architecture. By incorporating the prior knowledge and employing the uncertainty estimation method, the robustness and performance were improved. The code for the proposed method is available at https://github.com/chenzhao2023/UPMAD_Net_BrainSeg.

URLs: https://github.com/chenzhao2023/UPMAD_Net_BrainSeg.

new MRI motion correction via efficient residual-guided denoising diffusion probabilistic models

Authors: Mojtaba Safari, Shansong Wang, Qiang Li, Zach Eidex, Richard L. J. Qiu, Chih-Wei Chang, Hui Mao, Xiaofeng Yang

Abstract: Purpose: Motion artifacts in magnetic resonance imaging (MRI) significantly degrade image quality and impair quantitative analysis. Conventional mitigation strategies, such as repeated acquisitions or motion tracking, are costly and workflow-intensive. This study introduces Res-MoCoDiff, an efficient denoising diffusion probabilistic model tailored for MRI motion artifact correction. Methods: Res-MoCoDiff incorporates a novel residual error shifting mechanism in the forward diffusion process, aligning the noise distribution with motion-corrupted data and enabling an efficient four-step reverse diffusion. A U-net backbone enhanced with Swin-Transformer blocks conventional attention layers, improving adaptability across resolutions. Training employs a combined l1+l2 loss, which promotes image sharpness and reduces pixel-level errors. Res-MoCoDiff was evaluated on synthetic dataset generated using a realistic motion simulation framework and on an in-vivo dataset. Comparative analyses were conducted against established methods, including CycleGAN, Pix2pix, and MT-DDPM using quantitative metrics such as peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and normalized mean squared error (NMSE). Results: The proposed method demonstrated superior performance in removing motion artifacts across all motion severity levels. Res-MoCoDiff consistently achieved the highest SSIM and the lowest NMSE values, with a PSNR of up to 41.91+-2.94 dB for minor distortions. Notably, the average sampling time was reduced to 0.37 seconds per batch of two image slices, compared with 101.74 seconds for conventional approaches.

new Modality-Guided Dynamic Graph Fusion and Temporal Diffusion for Self-Supervised RGB-T Tracking

Authors: Shenglan Li, Rui Yao, Yong Zhou, Hancheng Zhu, Kunyang Sun, Bing Liu, Zhiwen Shao, Jiaqi Zhao

Abstract: To reduce the reliance on large-scale annotations, self-supervised RGB-T tracking approaches have garnered significant attention. However, the omission of the object region by erroneous pseudo-label or the introduction of background noise affects the efficiency of modality fusion, while pseudo-label noise triggered by similar object noise can further affect the tracking performance. In this paper, we propose GDSTrack, a novel approach that introduces dynamic graph fusion and temporal diffusion to address the above challenges in self-supervised RGB-T tracking. GDSTrack dynamically fuses the modalities of neighboring frames, treats them as distractor noise, and leverages the denoising capability of a generative model. Specifically, by constructing an adjacency matrix via an Adjacency Matrix Generator (AMG), the proposed Modality-guided Dynamic Graph Fusion (MDGF) module uses a dynamic adjacency matrix to guide graph attention, focusing on and fusing the object's coherent regions. Temporal Graph-Informed Diffusion (TGID) models MDGF features from neighboring frames as interference, and thus improving robustness against similar-object noise. Extensive experiments conducted on four public RGB-T tracking datasets demonstrate that GDSTrack outperforms the existing state-of-the-art methods. The source code is available at https://github.com/LiShenglana/GDSTrack.

URLs: https://github.com/LiShenglana/GDSTrack.

new Optimization of Module Transferability in Single Image Super-Resolution: Universality Assessment and Cycle Residual Blocks

Authors: Haotong Cheng, Zhiqi Zhang, Hao Li, Xinshang Zhang

Abstract: Deep learning has substantially advanced the Single Image Super-Resolution (SISR). However, existing researches have predominantly focused on raw performance gains, with little attention paid to quantifying the transferability of architectural components. In this paper, we introduce the concept of "Universality" and its associated definitions which extend the traditional notion of "Generalization" to encompass the modules' ease of transferability, thus revealing the relationships between module universality and model generalizability. Then we propose the Universality Assessment Equation (UAE), a metric for quantifying how readily a given module could be transplanted across models. Guided by the UAE results of standard residual blocks and other plug-and-play modules, we further design two optimized modules, Cycle Residual Block (CRB) and Depth-Wise Cycle Residual Block (DCRB). Through comprehensive experiments on natural-scene benchmarks, remote-sensing datasets, extreme-industrial imagery and on-device deployments, we demonstrate that networks embedded with the proposed plug-and-play modules outperform several state-of-the-arts, reaching a PSNR enhancement of up to 0.83dB or enabling a 71.3% reduction in parameters with negligible loss in reconstruction fidelity.

new Coop-WD: Cooperative Perception with Weighting and Denoising for Robust V2V Communication

Authors: Chenguang Liu, Jianjun Chen, Yunfei Chen, Yubei He, Zhuangkun Wei, Hongjian Sun, Haiyan Lu, Qi Hao

Abstract: Cooperative perception, leveraging shared information from multiple vehicles via vehicle-to-vehicle (V2V) communication, plays a vital role in autonomous driving to alleviate the limitation of single-vehicle perception. Existing works have explored the effects of V2V communication impairments on perception precision, but they lack generalization to different levels of impairments. In this work, we propose a joint weighting and denoising framework, Coop-WD, to enhance cooperative perception subject to V2V channel impairments. In this framework, the self-supervised contrastive model and the conditional diffusion probabilistic model are adopted hierarchically for vehicle-level and pixel-level feature enhancement. An efficient variant model, Coop-WD-eco, is proposed to selectively deactivate denoising to reduce processing overhead. Rician fading, non-stationarity, and time-varying distortion are considered. Simulation results demonstrate that the proposed Coop-WD outperforms conventional benchmarks in all types of channels. Qualitative analysis with visual examples further proves the superiority of our proposed method. The proposed Coop-WD-eco achieves up to 50% reduction in computational cost under severe distortion while maintaining comparable accuracy as channel conditions improve.

new RAIL: Region-Aware Instructive Learning for Semi-Supervised Tooth Segmentation in CBCT

Authors: Chuyu Zhao, Hao Huang, Jiashuo Guo, Ziyu Shen, Zhongwei Zhou, Jie Liu, Zekuan Yu

Abstract: Semi-supervised learning has become a compelling approach for 3D tooth segmentation from CBCT scans, where labeled data is minimal. However, existing methods still face two persistent challenges: limited corrective supervision in structurally ambiguous or mislabeled regions during supervised training and performance degradation caused by unreliable pseudo-labels on unlabeled data. To address these problems, we propose Region-Aware Instructive Learning (RAIL), a dual-group dual-student, semi-supervised framework. Each group contains two student models guided by a shared teacher network. By alternating training between the two groups, RAIL promotes intergroup knowledge transfer and collaborative region-aware instruction while reducing overfitting to the characteristics of any single model. Specifically, RAIL introduces two instructive mechanisms. Disagreement-Focused Supervision (DFS) Controller improves supervised learning by instructing predictions only within areas where student outputs diverge from both ground truth and the best student, thereby concentrating supervision on structurally ambiguous or mislabeled areas. In the unsupervised phase, Confidence-Aware Learning (CAL) Modulator reinforces agreement in regions with high model certainty while reducing the effect of low-confidence predictions during training. This helps prevent our model from learning unstable patterns and improves the overall reliability of pseudo-labels. Extensive experiments on four CBCT tooth segmentation datasets show that RAIL surpasses state-of-the-art methods under limited annotation. Our code will be available at https://github.com/Tournesol-Saturday/RAIL.

URLs: https://github.com/Tournesol-Saturday/RAIL.

new Panoramic Out-of-Distribution Segmentation

Authors: Mengfei Duan, Kailun Yang, Yuheng Zhang, Yihong Cao, Fei Teng, Kai Luo, Jiaming Zhang, Zhiyong Li, Shutao Li

Abstract: Panoramic imaging enables capturing 360{\deg} images with an ultra-wide Field-of-View (FoV) for dense omnidirectional perception. However, current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out-of-distribution Segmentation (OoS) models perform unsatisfactorily in the panoramic domain due to background clutter and pixel distortions. To address these issues, we introduce a new task, Panoramic Out-of-distribution Segmentation (PanOoS), achieving OoS for panoramas. Furthermore, we propose the first solution, POS, which adapts to the characteristics of panoramic images through text-guided prompt distribution learning. Specifically, POS integrates a disentanglement strategy designed to materialize the cross-domain generalization capability of CLIP. The proposed Prompt-based Restoration Attention (PRA) optimizes semantic decoding by prompt guidance and self-adaptive correction, while Bilevel Prompt Distribution Learning (BPDL) refines the manifold of per-pixel mask embeddings via semantic prototype supervision. Besides, to compensate for the scarcity of PanOoS datasets, we establish two benchmarks: DenseOoS, which features diverse outliers in complex environments, and QuadOoS, captured by a quadruped robot with a panoramic annular lens system. Extensive experiments demonstrate superior performance of POS, with AuPRC improving by 34.25% and FPR95 decreasing by 21.42% on DenseOoS, outperforming state-of-the-art pinhole-OoS methods. Moreover, POS achieves leading closed-set segmentation capabilities. Code and datasets will be available at https://github.com/MengfeiD/PanOoS.

URLs: https://github.com/MengfeiD/PanOoS.

new Read My Ears! Horse Ear Movement Detection for Equine Affective State Assessment

Authors: Jo\~ao Alves, Pia Haubro Andersen, Rikke Gade

Abstract: The Equine Facial Action Coding System (EquiFACS) enables the systematic annotation of facial movements through distinct Action Units (AUs). It serves as a crucial tool for assessing affective states in horses by identifying subtle facial expressions associated with discomfort. However, the field of horse affective state assessment is constrained by the scarcity of annotated data, as manually labelling facial AUs is both time-consuming and costly. To address this challenge, automated annotation systems are essential for leveraging existing datasets and improving affective states detection tools. In this work, we study different methods for specific ear AU detection and localization from horse videos. We leverage past works on deep learning-based video feature extraction combined with recurrent neural networks for the video classification task, as well as a classic optical flow based approach. We achieve 87.5% classification accuracy of ear movement presence on a public horse video dataset, demonstrating the potential of our approach. We discuss future directions to develop these systems, with the aim of bridging the gap between automated AU detection and practical applications in equine welfare and veterinary diagnostics. Our code will be made publicly available at https://github.com/jmalves5/read-my-ears.

URLs: https://github.com/jmalves5/read-my-ears.

new Generating Synthetic Data via Augmentations for Improved Facial Resemblance in DreamBooth and InstantID

Authors: Koray Ulusan, Benjamin Kiefer

Abstract: The personalization of Stable Diffusion for generating professional portraits from amateur photographs is a burgeoning area, with applications in various downstream contexts. This paper investigates the impact of augmentations on improving facial resemblance when using two prominent personalization techniques: DreamBooth and InstantID. Through a series of experiments with diverse subject datasets, we assessed the effectiveness of various augmentation strategies on the generated headshots' fidelity to the original subject. We introduce FaceDistance, a wrapper around FaceNet, to rank the generations based on facial similarity, which aided in our assessment. Ultimately, this research provides insights into the role of augmentations in enhancing facial resemblance in SDXL-generated portraits, informing strategies for their effective deployment in downstream applications.

new Real-Time Person Image Synthesis Using a Flow Matching Model

Authors: Jiwoo Jeong, Kirok Kim, Wooju Kim, Nam-Joon Kim

Abstract: Pose-Guided Person Image Synthesis (PGPIS) generates realistic person images conditioned on a target pose and a source image. This task plays a key role in various real-world applications, such as sign language video generation, AR/VR, gaming, and live streaming. In these scenarios, real-time PGPIS is critical for providing immediate visual feedback and maintaining user immersion.However, achieving real-time performance remains a significant challenge due to the complexity of synthesizing high-fidelity images from diverse and dynamic human poses. Recent diffusion-based methods have shown impressive image quality in PGPIS, but their slow sampling speeds hinder deployment in time-sensitive applications. This latency is particularly problematic in tasks like generating sign language videos during live broadcasts, where rapid image updates are required. Therefore, developing a fast and reliable PGPIS model is a crucial step toward enabling real-time interactive systems. To address this challenge, we propose a generative model based on flow matching (FM). Our approach enables faster, more stable, and more efficient training and sampling. Furthermore, the proposed model supports conditional generation and can operate in latent space, making it especially suitable for real-time PGPIS applications where both speed and quality are critical. We evaluate our proposed method, Real-Time Person Image Synthesis Using a Flow Matching Model (RPFM), on the widely used DeepFashion dataset for PGPIS tasks. Our results show that RPFM achieves near-real-time sampling speeds while maintaining performance comparable to the state-of-the-art models. Our methodology trades off a slight, acceptable decrease in generated-image accuracy for over a twofold increase in generation speed, thereby ensuring real-time performance.

new Uncertainty-Aware Prototype Semantic Decoupling for Text-Based Person Search in Full Images

Authors: Zengli Luo, Canlong Zhang, Xiaochun Lu, Zhixin Li, Zhiwen Wang

Abstract: Text-based pedestrian search (TBPS) in full images aims to locate a target pedestrian in untrimmed images using natural language descriptions. However, in complex scenes with multiple pedestrians, existing methods are limited by uncertainties in detection and matching, leading to degraded performance. To address this, we propose UPD-TBPS, a novel framework comprising three modules: Multi-granularity Uncertainty Estimation (MUE), Prototype-based Uncertainty Decoupling (PUD), and Cross-modal Re-identification (ReID). MUE conducts multi-granularity queries to identify potential targets and assigns confidence scores to reduce early-stage uncertainty. PUD leverages visual context decoupling and prototype mining to extract features of the target pedestrian described in the query. It separates and learns pedestrian prototype representations at both the coarse-grained cluster level and the fine-grained individual level, thereby reducing matching uncertainty. ReID evaluates candidates with varying confidence levels, improving detection and retrieval accuracy. Experiments on CUHK-SYSU-TBPS and PRW-TBPS datasets validate the effectiveness of our framework.

new Corner Cases: How Size and Position of Objects Challenge ImageNet-Trained Models

Authors: Mishal Fatima, Steffen Jung, Margret Keuper

Abstract: Backgrounds in images play a major role in contributing to spurious correlations among different data points. Owing to aesthetic preferences of humans capturing the images, datasets can exhibit positional (location of the object within a given frame) and size (region-of-interest to image ratio) biases for different classes. In this paper, we show that these biases can impact how much a model relies on spurious features in the background to make its predictions. To better illustrate our findings, we propose a synthetic dataset derived from ImageNet1k, Hard-Spurious-ImageNet, which contains images with various backgrounds, object positions, and object sizes. By evaluating the dataset on different pretrained models, we find that most models rely heavily on spurious features in the background when the region-of-interest (ROI) to image ratio is small and the object is far from the center of the image. Moreover, we also show that current methods that aim to mitigate harmful spurious features, do not take into account these factors, hence fail to achieve considerable performance gains for worst-group accuracies when the size and location of core features in an image change.

new Supervised and Unsupervised Textile Classification via Near-Infrared Hyperspectral Imaging and Deep Learning

Authors: Maria Kainz, Johannes K. Krondorfer, Malte Jaschik, Maria Jernej, Harald Ganster

Abstract: Recycling textile fibers is critical to reducing the environmental impact of the textile industry. Hyperspectral near-infrared (NIR) imaging combined with advanced deep learning algorithms offers a promising solution for efficient fiber classification and sorting. In this study, we investigate supervised and unsupervised deep learning models and test their generalization capabilities on different textile structures. We show that optimized convolutional neural networks (CNNs) and autoencoder networks achieve robust generalization under varying conditions. These results highlight the potential of hyperspectral imaging and deep learning to advance sustainable textile recycling through accurate and robust classification.

new DyGEnc: Encoding a Sequence of Textual Scene Graphs to Reason and Answer Questions in Dynamic Scenes

Authors: Sergey Linok, Vadim Semenov, Anastasia Trunova, Oleg Bulichev, Dmitry Yudin

Abstract: The analysis of events in dynamic environments poses a fundamental challenge in the development of intelligent agents and robots capable of interacting with humans. Current approaches predominantly utilize visual models. However, these methods often capture information implicitly from images, lacking interpretable spatial-temporal object representations. To address this issue we introduce DyGEnc - a novel method for Encoding a Dynamic Graph. This method integrates compressed spatial-temporal structural observation representation with the cognitive capabilities of large language models. The purpose of this integration is to enable advanced question answering based on a sequence of textual scene graphs. Extended evaluations on the STAR and AGQA datasets indicate that DyGEnc outperforms existing visual methods by a large margin of 15-25% in addressing queries regarding the history of human-to-object interactions. Furthermore, the proposed method can be seamlessly extended to process raw input images utilizing foundational models for extracting explicit textual scene graphs, as substantiated by the results of a robotic experiment conducted with a wheeled manipulator platform. We hope that these findings will contribute to the implementation of robust and compressed graph-based robotic memory for long-horizon reasoning. Code is available at github.com/linukc/DyGEnc.

new Fixed-Length Dense Fingerprint Representation

Authors: Zhiyu Pan, Xiongjun Guan, Yongjie Duan, Jianjiang Feng, Jie Zhou

Abstract: Fixed-length fingerprint representations, which map each fingerprint to a compact and fixed-size feature vector, are computationally efficient and well-suited for large-scale matching. However, designing a robust representation that effectively handles diverse fingerprint modalities, pose variations, and noise interference remains a significant challenge. In this work, we propose a fixed-length dense descriptor of fingerprints, and introduce FLARE-a fingerprint matching framework that integrates the Fixed-Length dense descriptor with pose-based Alignment and Robust Enhancement. This fixed-length representation employs a three-dimensional dense descriptor to effectively capture spatial relationships among fingerprint ridge structures, enabling robust and locally discriminative representations. To ensure consistency within this dense feature space, FLARE incorporates pose-based alignment using complementary estimation methods, along with dual enhancement strategies that refine ridge clarity while preserving the original fingerprint modality. The proposed dense descriptor supports fixed-length representation while maintaining spatial correspondence, enabling fast and accurate similarity computation. Extensive experiments demonstrate that FLARE achieves superior performance across rolled, plain, latent, and contactless fingerprints, significantly outperforming existing methods in cross-modality and low-quality scenarios. Further analysis validates the effectiveness of the dense descriptor design, as well as the impact of alignment and enhancement modules on the accuracy of dense descriptor matching. Experimental results highlight the effectiveness and generalizability of FLARE as a unified and scalable solution for robust fingerprint representation and matching. The implementation and code will be publicly available at https://github.com/Yu-Yy/FLARE.

URLs: https://github.com/Yu-Yy/FLARE.

new From Pixels to Polygons: A Survey of Deep Learning Approaches for Medical Image-to-Mesh Reconstruction

Authors: Fengming Lin, Arezoo Zakeri, Yidan Xue, Michael MacRaild, Haoran Dou, Zherui Zhou, Ziwei Zou, Ali Sarrami-Foroushani, Jinming Duan, Alejandro F. Frangi

Abstract: Deep learning-based medical image-to-mesh reconstruction has rapidly evolved, enabling the transformation of medical imaging data into three-dimensional mesh models that are critical in computational medicine and in silico trials for advancing our understanding of disease mechanisms, and diagnostic and therapeutic techniques in modern medicine. This survey systematically categorizes existing approaches into four main categories: template models, statistical models, generative models, and implicit models. Each category is analysed in detail, examining their methodological foundations, strengths, limitations, and applicability to different anatomical structures and imaging modalities. We provide an extensive evaluation of these methods across various anatomical applications, from cardiac imaging to neurological studies, supported by quantitative comparisons using standard metrics. Additionally, we compile and analyze major public datasets available for medical mesh reconstruction tasks and discuss commonly used evaluation metrics and loss functions. The survey identifies current challenges in the field, including requirements for topological correctness, geometric accuracy, and multi-modality integration. Finally, we present promising future research directions in this domain. This systematic review aims to serve as a comprehensive reference for researchers and practitioners in medical image analysis and computational medicine.

new PAHA: Parts-Aware Audio-Driven Human Animation with Diffusion Model

Authors: Y. B. Wang, S. Z. Zhou, J. F. Wu, T. Hu, J. N. Zhang, Y. Liu

Abstract: Audio-driven human animation technology is widely used in human-computer interaction, and the emergence of diffusion models has further advanced its development. Currently, most methods rely on multi-stage generation and intermediate representations, resulting in long inference time and issues with generation quality in specific foreground regions and audio-motion consistency. These shortcomings are primarily due to the lack of localized fine-grained supervised guidance. To address above challenges, we propose PAHA, an end-to-end audio-driven upper-body human animation framework with diffusion model. We introduce two key methods: Parts-Aware Re-weighting (PAR) and Parts Consistency Enhancement (PCE). PAR dynamically adjusts regional training loss weights based on pose confidence scores, effectively improving visual quality. PCE constructs and trains diffusion-based regional audio-visual classifiers to improve the consistency of motion and co-speech audio. Afterwards, we design two novel inference guidance methods for the foregoing classifiers, Sequential Guidance (SG) and Differential Guidance (DG), to balance efficiency and quality respectively. Additionally, we build CNAS, the first public Chinese News Anchor Speech dataset, to advance research and validation in this field. Extensive experimental results and user studies demonstrate that PAHA significantly outperforms existing methods in audio-motion alignment and video-related evaluations. The codes and CNAS dataset will be released upon acceptance.

new Learning Knowledge-based Prompts for Robust 3D Mask Presentation Attack Detection

Authors: Fangling Jiang, Qi Li, Bing Liu, Weining Wang, Caifeng Shan, Zhenan Sun, Ming-Hsuan Yang

Abstract: 3D mask presentation attack detection is crucial for protecting face recognition systems against the rising threat of 3D mask attacks. While most existing methods utilize multimodal features or remote photoplethysmography (rPPG) signals to distinguish between real faces and 3D masks, they face significant challenges, such as the high costs associated with multimodal sensors and limited generalization ability. Detection-related text descriptions offer concise, universal information and are cost-effective to obtain. However, the potential of vision-language multimodal features for 3D mask presentation attack detection remains unexplored. In this paper, we propose a novel knowledge-based prompt learning framework to explore the strong generalization capability of vision-language models for 3D mask presentation attack detection. Specifically, our approach incorporates entities and triples from knowledge graphs into the prompt learning process, generating fine-grained, task-specific explicit prompts that effectively harness the knowledge embedded in pre-trained vision-language models. Furthermore, considering different input images may emphasize distinct knowledge graph elements, we introduce a visual-specific knowledge filter based on an attention mechanism to refine relevant elements according to the visual context. Additionally, we leverage causal graph theory insights into the prompt learning process to further enhance the generalization ability of our method. During training, a spurious correlation elimination paradigm is employed, which removes category-irrelevant local image patches using guidance from knowledge-based text features, fostering the learning of generalized causal prompts that align with category-relevant local patches. Experimental results demonstrate that the proposed method achieves state-of-the-art intra- and cross-scenario detection performance on benchmark datasets.

new Learning Unknown Spoof Prompts for Generalized Face Anti-Spoofing Using Only Real Face Images

Authors: Fangling Jiang, Qi Li, Weining Wang, Wei Shen, Bing Liu, Zhenan Sun

Abstract: Face anti-spoofing is a critical technology for ensuring the security of face recognition systems. However, its ability to generalize across diverse scenarios remains a significant challenge. In this paper, we attribute the limited generalization ability to two key factors: covariate shift, which arises from external data collection variations, and semantic shift, which results from substantial differences in emerging attack types. To address both challenges, we propose a novel approach for learning unknown spoof prompts, relying solely on real face images from a single source domain. Our method generates textual prompts for real faces and potential unknown spoof attacks by leveraging the general knowledge embedded in vision-language models, thereby enhancing the model's ability to generalize to unseen target domains. Specifically, we introduce a diverse spoof prompt optimization framework to learn effective prompts. This framework constrains unknown spoof prompts within a relaxed prior knowledge space while maximizing their distance from real face images. Moreover, it enforces semantic independence among different spoof prompts to capture a broad range of spoof patterns. Experimental results on nine datasets demonstrate that the learned prompts effectively transfer the knowledge of vision-language models, enabling state-of-the-art generalization ability against diverse unknown attack types across unseen target domains without using any spoof face images.

new PhysLLM: Harnessing Large Language Models for Cross-Modal Remote Physiological Sensing

Authors: Yiping Xie, Bo Zhao, Mingtong Dai, Jian-Ping Zhou, Yue Sun, Tao Tan, Weicheng Xie, Linlin Shen, Zitong Yu

Abstract: Remote photoplethysmography (rPPG) enables non-contact physiological measurement but remains highly susceptible to illumination changes, motion artifacts, and limited temporal modeling. Large Language Models (LLMs) excel at capturing long-range dependencies, offering a potential solution but struggle with the continuous, noise-sensitive nature of rPPG signals due to their text-centric design. To bridge this gap, we introduce PhysLLM, a collaborative optimization framework that synergizes LLMs with domain-specific rPPG components. Specifically, the Text Prototype Guidance (TPG) strategy is proposed to establish cross-modal alignment by projecting hemodynamic features into LLM-interpretable semantic space, effectively bridging the representational gap between physiological signals and linguistic tokens. Besides, a novel Dual-Domain Stationary (DDS) Algorithm is proposed for resolving signal instability through adaptive time-frequency domain feature re-weighting. Finally, rPPG task-specific cues systematically inject physiological priors through physiological statistics, environmental contextual answering, and task description, leveraging cross-modal learning to integrate both visual and textual information, enabling dynamic adaptation to challenging scenarios like variable illumination and subject movements. Evaluation on four benchmark datasets, PhysLLM achieves state-of-the-art accuracy and robustness, demonstrating superior generalization across lighting variations and motion scenarios.

new Bounding Box-Guided Diffusion for Synthesizing Industrial Images and Segmentation Map

Authors: Alessandro Simoni, Francesco Pelosin

Abstract: Synthetic dataset generation in Computer Vision, particularly for industrial applications, is still underexplored. Industrial defect segmentation, for instance, requires highly accurate labels, yet acquiring such data is costly and time-consuming. To address this challenge, we propose a novel diffusion-based pipeline for generating high-fidelity industrial datasets with minimal supervision. Our approach conditions the diffusion model on enriched bounding box representations to produce precise segmentation masks, ensuring realistic and accurately localized defect synthesis. Compared to existing layout-conditioned generative methods, our approach improves defect consistency and spatial accuracy. We introduce two quantitative metrics to evaluate the effectiveness of our method and assess its impact on a downstream segmentation task trained on real and synthetic data. Our results demonstrate that diffusion-based synthesis can bridge the gap between artificial and real-world industrial data, fostering more reliable and cost-efficient segmentation models. The code is publicly available at https://github.com/covisionlab/diffusion_labeling.

URLs: https://github.com/covisionlab/diffusion_labeling.

new Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision

Authors: Linhan Cao, Wei Sun, Kaiwei Zhang, Yicong Peng, Guangtao Zhai, Xiongkuo Min

Abstract: Video quality assessment (VQA) is essential for quantifying perceptual quality in various video processing workflows, spanning from camera capture systems to over-the-top streaming platforms. While recent supervised VQA models have made substantial progress, the reliance on manually annotated datasets -- a process that is labor-intensive, costly, and difficult to scale up -- has hindered further optimization of their generalization to unseen video content and distortions. To bridge this gap, we introduce a self-supervised learning framework for VQA to learn quality assessment capabilities from large-scale, unlabeled web videos. Our approach leverages a \textbf{learning-to-rank} paradigm to train a large multimodal model (LMM) on video pairs automatically labeled via two manners, including quality pseudo-labeling by existing VQA models and relative quality ranking based on synthetic distortion simulations. Furthermore, we introduce a novel \textbf{iterative self-improvement training strategy}, where the trained model acts an improved annotator to iteratively refine the annotation quality of training data. By training on a dataset $10\times$ larger than the existing VQA benchmarks, our model: (1) achieves zero-shot performance on in-domain VQA benchmarks that matches or surpasses supervised models; (2) demonstrates superior out-of-distribution (OOD) generalization across diverse video content and distortions; and (3) sets a new state-of-the-art when fine-tuned on human-labeled datasets. Extensive experimental results validate the effectiveness of our self-supervised approach in training generalized VQA models. The datasets and code will be publicly released to facilitate future research.

new Towards Smart Point-and-Shoot Photography

Authors: Jiawan Li, Fei Zhou, Zhipeng Zhong, Jiongzhi Lin, Guoping Qiu

Abstract: Hundreds of millions of people routinely take photos using their smartphones as point and shoot (PAS) cameras, yet very few would have the photography skills to compose a good shot of a scene. While traditional PAS cameras have built-in functions to ensure a photo is well focused and has the right brightness, they cannot tell the users how to compose the best shot of a scene. In this paper, we present a first of its kind smart point and shoot (SPAS) system to help users to take good photos. Our SPAS proposes to help users to compose a good shot of a scene by automatically guiding the users to adjust the camera pose live on the scene. We first constructed a large dataset containing 320K images with camera pose information from 4000 scenes. We then developed an innovative CLIP-based Composition Quality Assessment (CCQA) model to assign pseudo labels to these images. The CCQA introduces a unique learnable text embedding technique to learn continuous word embeddings capable of discerning subtle visual quality differences in the range covered by five levels of quality description words {bad, poor, fair, good, perfect}. And finally we have developed a camera pose adjustment model (CPAM) which first determines if the current view can be further improved and if so it outputs the adjust suggestion in the form of two camera pose adjustment angles. The two tasks of CPAM make decisions in a sequential manner and each involves different sets of training samples, we have developed a mixture-of-experts model with a gated loss function to train the CPAM in an end-to-end manner. We will present extensive results to demonstrate the performances of our SPAS system using publicly available image composition datasets.

new ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant

Authors: Yifan Xiang, Zhenxi Zhang, Bin Li, Yixuan Weng, Shoujun Zhou, Yangfan He, Keqin Li

Abstract: Recent advances in personalized MLLMs enable effective capture of user-specific concepts, supporting both recognition of personalized concepts and contextual captioning. However, humans typically explore and reason over relations among objects and individuals, transcending surface-level information to achieve more personalized and contextual understanding. To this end, existing methods may face three main limitations: Their training data lacks multi-object sets in which relations among objects are learnable. Building on the limited training data, their models overlook the relations between different personalized concepts and fail to reason over them. Their experiments mainly focus on a single personalized concept, where evaluations are limited to recognition and captioning tasks. To address the limitations, we present a new dataset named ReGraP, consisting of 120 sets of personalized knowledge. Each set includes images, KGs, and CoT QA pairs derived from the KGs, enabling more structured and sophisticated reasoning pathways. We propose ReGraP-LLaVA, an MLLM trained with the corresponding KGs and CoT QA pairs, where soft and hard graph prompting methods are designed to align KGs within the model's semantic space. We establish the ReGraP Benchmark, which contains diverse task types: multiple-choice, fill-in-the-blank, True/False, and descriptive questions in both open- and closed-ended settings. The proposed benchmark is designed to evaluate the relational reasoning and knowledge-connection capability of personalized MLLMs. We conduct experiments on the proposed ReGraP-LLaVA and other competitive MLLMs. Results show that the proposed model not only learns personalized knowledge but also performs relational reasoning in responses, achieving the SoTA performance compared with the competitive methods. All the codes and datasets are released at: https://github.com/xyfyyds/ReGraP.

URLs: https://github.com/xyfyyds/ReGraP.

new Revolutionizing Brain Tumor Imaging: Generating Synthetic 3D FA Maps from T1-Weighted MRI using CycleGAN Models

Authors: Xin Du, Francesca M. Cozzi, Rajesh Jena

Abstract: Fractional anisotropy (FA) and directionally encoded colour (DEC) maps are essential for evaluating white matter integrity and structural connectivity in neuroimaging. However, the spatial misalignment between FA maps and tractography atlases hinders their effective integration into predictive models. To address this issue, we propose a CycleGAN based approach for generating FA maps directly from T1-weighted MRI scans, representing the first application of this technique to both healthy and tumour-affected tissues. Our model, trained on unpaired data, produces high fidelity maps, which have been rigorously evaluated using Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR), demonstrating particularly robust performance in tumour regions. Radiological assessments further underscore the model's potential to enhance clinical workflows by providing an AI-driven alternative that reduces the necessity for additional scans.

new Distribution-Conditional Generation: From Class Distribution to Creative Generation

Authors: Fu Feng, Yucheng Xie, Xu Yang, Jing Wang, Xin Geng

Abstract: Text-to-image (T2I) diffusion models are effective at producing semantically aligned images, but their reliance on training data distributions limits their ability to synthesize truly novel, out-of-distribution concepts. Existing methods typically enhance creativity by combining pairs of known concepts, yielding compositions that, while out-of-distribution, remain linguistically describable and bounded within the existing semantic space. Inspired by the soft probabilistic outputs of classifiers on ambiguous inputs, we propose Distribution-Conditional Generation, a novel formulation that models creativity as image synthesis conditioned on class distributions, enabling semantically unconstrained creative generation. Building on this, we propose DisTok, an encoder-decoder framework that maps class distributions into a latent space and decodes them into tokens of creative concept. DisTok maintains a dynamic concept pool and iteratively sampling and fusing concept pairs, enabling the generation of tokens aligned with increasingly complex class distributions. To enforce distributional consistency, latent vectors sampled from a Gaussian prior are decoded into tokens and rendered into images, whose class distributions-predicted by a vision-language model-supervise the alignment between input distributions and the visual semantics of generated tokens. The resulting tokens are added to the concept pool for subsequent composition. Extensive experiments demonstrate that DisTok, by unifying distribution-conditioned fusion and sampling-based synthesis, enables efficient and flexible token-level generation, achieving state-of-the-art performance with superior text-image alignment and human preference scores.

new CaRaFFusion: Improving 2D Semantic Segmentation with Camera-Radar Point Cloud Fusion and Zero-Shot Image Inpainting

Authors: Huawei Sun, Bora Kunter Sahin, Georg Stettinger, Maximilian Bernhard, Matthias Schubert, Robert Wille

Abstract: Segmenting objects in an environment is a crucial task for autonomous driving and robotics, as it enables a better understanding of the surroundings of each agent. Although camera sensors provide rich visual details, they are vulnerable to adverse weather conditions. In contrast, radar sensors remain robust under such conditions, but often produce sparse and noisy data. Therefore, a promising approach is to fuse information from both sensors. In this work, we propose a novel framework to enhance camera-only baselines by integrating a diffusion model into a camera-radar fusion architecture. We leverage radar point features to create pseudo-masks using the Segment-Anything model, treating the projected radar points as point prompts. Additionally, we propose a noise reduction unit to denoise these pseudo-masks, which are further used to generate inpainted images that complete the missing information in the original images. Our method improves the camera-only segmentation baseline by 2.63% in mIoU and enhances our camera-radar fusion architecture by 1.48% in mIoU on the Waterscenes dataset. This demonstrates the effectiveness of our approach for semantic segmentation using camera-radar fusion under adverse weather conditions.

new Matching Distance and Geometric Distribution Aided Learning Multiview Point Cloud Registration

Authors: Shiqi Li, Jihua Zhu, Yifan Xie, Naiwen Hu, Di Wang

Abstract: Multiview point cloud registration plays a crucial role in robotics, automation, and computer vision fields. This paper concentrates on pose graph construction and motion synchronization within multiview registration. Previous methods for pose graph construction often pruned fully connected graphs or constructed sparse graph using global feature aggregated from local descriptors, which may not consistently yield reliable results. To identify dependable pairs for pose graph construction, we design a network model that extracts information from the matching distance between point cloud pairs. For motion synchronization, we propose another neural network model to calculate the absolute pose in a data-driven manner, rather than optimizing inaccurate handcrafted loss functions. Our model takes into account geometric distribution information and employs a modified attention mechanism to facilitate flexible and reliable feature interaction. Experimental results on diverse indoor and outdoor datasets confirm the effectiveness and generalizability of our approach. The source code is available at https://github.com/Shi-Qi-Li/MDGD.

URLs: https://github.com/Shi-Qi-Li/MDGD.

new Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning

Authors: Fran\c{c}ois Role, S\'ebastien Meyer, Victor Amblard

Abstract: Vision-language models (VLMs) allow to embed texts and images in a shared representation space. However, it has been shown that these models are subject to a modality gap phenomenon meaning there exists a clear separation between the embeddings from one modality and another in the embedding space. While this misalignment is detrimental for downstream tasks such as multimodal retrieval, multimodal clustering or zero-shot classification, etc. no generic and practical methods have so far been proposed to assess it precisely and even reduce it. We therefore propose novel measures and effective techniques (spectral- and optimal transport-based methods) to achieve this goal. Extensive experiments conducted on several image-text datasets and models demonstrate their effectiveness and beneficial effects on downstream tasks. Our code is available at the URL provided in the paper's abstract.

new DISARM++: Beyond scanner-free harmonization

Authors: Luca Caldera, Lara Cavinato, Alessio Cirone, Isabella Cama, Sara Garbarino, Raffaele Lodi, Fabrizio Tagliavini, Anna Nigri, Silvia De Francesco, Andrea Cappozzo, Michele Piana, Francesca Ieva

Abstract: Harmonization of T1-weighted MR images across different scanners is crucial for ensuring consistency in neuroimaging studies. This study introduces a novel approach to direct image harmonization, moving beyond feature standardization to ensure that extracted features remain inherently reliable for downstream analysis. Our method enables image transfer in two ways: (1) mapping images to a scanner-free space for uniform appearance across all scanners, and (2) transforming images into the domain of a specific scanner used in model training, embedding its unique characteristics. Our approach presents strong generalization capability, even for unseen scanners not included in the training phase. We validated our method using MR images from diverse cohorts, including healthy controls, traveling subjects, and individuals with Alzheimer's disease (AD). The model's effectiveness is tested in multiple applications, such as brain age prediction (R2 = 0.60 \pm 0.05), biomarker extraction, AD classification (Test Accuracy = 0.86 \pm 0.03), and diagnosis prediction (AUC = 0.95). In all cases, our harmonization technique outperforms state-of-the-art methods, showing improvements in both reliability and predictive accuracy. Moreover, our approach eliminates the need for extensive preprocessing steps, such as skull-stripping, which can introduce errors by misclassifying brain and non-brain structures. This makes our method particularly suitable for applications that require full-head analysis, including research on head trauma and cranial deformities. Additionally, our harmonization model does not require retraining for new datasets, allowing smooth integration into various neuroimaging workflows. By ensuring scanner-invariant image quality, our approach provides a robust and efficient solution for improving neuroimaging studies across diverse settings. The code is available at this link.

new FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios

Authors: Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, Yansong Tang

Abstract: Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Experiments demonstrate that our method effectively transfers actions to subjects with diverse layouts, skeletons, and viewpoints. We release our code and model weights to support further research at https://shiyi-zh0408.github.io/projectpages/FlexiAct/

URLs: https://shiyi-zh0408.github.io/projectpages/FlexiAct/

new Multi-Agent System for Comprehensive Soccer Understanding

Authors: Jiayuan Rao, Zifeng Li, Haoning Wu, Ya Zhang, Yanfeng Wang, Weidi Xie

Abstract: Recent advancements in AI-driven soccer understanding have demonstrated rapid progress, yet existing research predominantly focuses on isolated or narrow tasks. To bridge this gap, we propose a comprehensive framework for holistic soccer understanding. Specifically, we make the following contributions in this paper: (i) we construct SoccerWiki, the first large-scale multimodal soccer knowledge base, integrating rich domain knowledge about players, teams, referees, and venues to enable knowledge-driven reasoning; (ii) we present SoccerBench, the largest and most comprehensive soccer-specific benchmark, featuring around 10K standardized multimodal (text, image, video) multi-choice QA pairs across 13 distinct understanding tasks, curated through automated pipelines and manual verification; (iii) we introduce SoccerAgent, a novel multi-agent system that decomposes complex soccer questions via collaborative reasoning, leveraging domain expertise from SoccerWiki and achieving robust performance; (iv) extensive evaluations and ablations that benchmark state-of-the-art MLLMs on SoccerBench, highlighting the superiority of our proposed agentic system. All data and code are publicly available at: https://jyrao.github.io/SoccerAgent/.

URLs: https://jyrao.github.io/SoccerAgent/.

cross Physical foundations for trustworthy medical imaging: a review for artificial intelligence researchers

Authors: Miriam Cobo, David Corral Fontecha, Wilson Silva, Lara Lloret Iglesias

Abstract: Artificial intelligence in medical imaging has seen unprecedented growth in the last years, due to rapid advances in deep learning and computing resources. Applications cover the full range of existing medical imaging modalities, with unique characteristics driven by the physics of each technique. Yet, artificial intelligence professionals entering the field, and even experienced developers, often lack a comprehensive understanding of the physical principles underlying medical image acquisition, which hinders their ability to fully leverage its potential. The integration of physics knowledge into artificial intelligence algorithms enhances their trustworthiness and robustness in medical imaging, especially in scenarios with limited data availability. In this work, we review the fundamentals of physics in medical images and their impact on the latest advances in artificial intelligence, particularly, in generative models and reconstruction algorithms. Finally, we explore the integration of physics knowledge into physics-inspired machine learning models, which leverage physics-based constraints to enhance the learning of medical imaging features.

cross Floating Car Observers in Intelligent Transportation Systems: Detection Modeling and Temporal Insights

Authors: Jeremias Gerner, Klaus Bogenberger, Stefanie Schmidtner

Abstract: Floating Car Observers (FCOs) extend traditional Floating Car Data (FCD) by integrating onboard sensors to detect and localize other traffic participants, providing richer and more detailed traffic data. In this work, we explore various modeling approaches for FCO detections within microscopic traffic simulations to evaluate their potential for Intelligent Transportation System (ITS) applications. These approaches range from 2D raytracing to high-fidelity co-simulations that emulate real-world sensors and integrate 3D object detection algorithms to closely replicate FCO detections. Additionally, we introduce a neural network-based emulation technique that effectively approximates the results of high-fidelity co-simulations. This approach captures the unique characteristics of FCO detections while offering a fast and scalable solution for modeling. Using this emulation method, we investigate the impact of FCO data in a digital twin of a traffic network modeled in SUMO. Results demonstrate that even at a 20% penetration rate, FCOs using LiDAR-based detections can identify 65% of vehicles across various intersections and traffic demand scenarios. Further potential emerges when temporal insights are integrated, enabling the recovery of previously detected but currently unseen vehicles. By employing data-driven methods, we recover over 80% of these vehicles with minimal positional deviations. These findings underscore the potential of FCOs for ITS, particularly in enhancing traffic state estimation and monitoring under varying penetration rates and traffic conditions.

cross A Wireless Collaborated Inference Acceleration Framework for Plant Disease Recognition

Authors: Hele Zhu, Xinyi Huang, Haojia Gao, Mengfei Jiang, Haohua Que, Lei Mu

Abstract: Plant disease is a critical factor affecting agricultural production. Traditional manual recognition methods face significant drawbacks, including low accuracy, high costs, and inefficiency. Deep learning techniques have demonstrated significant benefits in identifying plant diseases, but they still face challenges such as inference delays and high energy consumption. Deep learning algorithms are difficult to run on resource-limited embedded devices. Offloading these models to cloud servers is confronted with the restriction of communication bandwidth, and all of these factors will influence the inference's efficiency. We propose a collaborative inference framework for recognizing plant diseases between edge devices and cloud servers to enhance inference speed. The DNN model for plant disease recognition is pruned through deep reinforcement learning to improve the inference speed and reduce energy consumption. Then the optimal split point is determined by a greedy strategy to achieve the best collaborated inference acceleration. Finally, the system for collaborative inference acceleration in plant disease recognition has been implemented using Gradio to facilitate friendly human-machine interaction. Experiments indicate that the proposed collaborative inference framework significantly increases inference speed while maintaining acceptable recognition accuracy, offering a novel solution for rapidly diagnosing and preventing plant diseases.

cross Dual Prompting for Diverse Count-level PET Denoising

Authors: Xiaofeng Liu, Yongsong Huang, Thibault Marin, Samira Vafay Eslahi, Tiss Amal, Yanis Chemli, Keith Johnson, Georges El Fakhri, Jinsong Ouyang

Abstract: The to-be-denoised positron emission tomography (PET) volumes are inherent with diverse count levels, which imposes challenges for a unified model to tackle varied cases. In this work, we resort to the recently flourished prompt learning to achieve generalizable PET denoising with different count levels. Specifically, we propose dual prompts to guide the PET denoising in a divide-and-conquer manner, i.e., an explicitly count-level prompt to provide the specific prior information and an implicitly general denoising prompt to encode the essential PET denoising knowledge. Then, a novel prompt fusion module is developed to unify the heterogeneous prompts, followed by a prompt-feature interaction module to inject prompts into the features. The prompts are able to dynamically guide the noise-conditioned denoising process. Therefore, we are able to efficiently train a unified denoising model for various count levels, and deploy it to different cases with personalized prompts. We evaluated on 1940 low-count PET 3D volumes with uniformly randomly selected 13-22\% fractions of events from 97 $^{18}$F-MK6240 tau PET studies. It shows our dual prompting can largely improve the performance with informed count-level and outperform the count-conditional model.

cross Sim2Real Transfer for Vision-Based Grasp Verification

Authors: Pau Amargant, Peter H\"onig, Markus Vincze

Abstract: The verification of successful grasps is a crucial aspect of robot manipulation, particularly when handling deformable objects. Traditional methods relying on force and tactile sensors often struggle with deformable and non-rigid objects. In this work, we present a vision-based approach for grasp verification to determine whether the robotic gripper has successfully grasped an object. Our method employs a two-stage architecture; first YOLO-based object detection model to detect and locate the robot's gripper and then a ResNet-based classifier determines the presence of an object. To address the limitations of real-world data capture, we introduce HSR-GraspSynth, a synthetic dataset designed to simulate diverse grasping scenarios. Furthermore, we explore the use of Visual Question Answering capabilities as a zero-shot baseline to which we compare our model. Experimental results demonstrate that our approach achieves high accuracy in real-world environments, with potential for integration into grasping pipelines. Code and datasets are publicly available at https://github.com/pauamargant/HSR-GraspSynth .

URLs: https://github.com/pauamargant/HSR-GraspSynth

cross STG: Spatiotemporal Graph Neural Network with Fusion and Spatiotemporal Decoupling Learning for Prognostic Prediction of Colorectal Cancer Liver Metastasis

Authors: Yiran Zhu, Wei Yang, Yan su, Zesheng Li, Chengchang Pan, Honggang Qi

Abstract: We propose a multimodal spatiotemporal graph neural network (STG) framework to predict colorectal cancer liver metastasis (CRLM) progression. Current clinical models do not effectively integrate the tumor's spatial heterogeneity, dynamic evolution, and complex multimodal data relationships, limiting their predictive accuracy. Our STG framework combines preoperative CT imaging and clinical data into a heterogeneous graph structure, enabling joint modeling of tumor distribution and temporal evolution through spatial topology and cross-modal edges. The framework uses GraphSAGE to aggregate spatiotemporal neighborhood information and leverages supervised and contrastive learning strategies to enhance the model's ability to capture temporal features and improve robustness. A lightweight version of the model reduces parameter count by 78.55%, maintaining near-state-of-the-art performance. The model jointly optimizes recurrence risk regression and survival analysis tasks, with contrastive loss improving feature representational discriminability and cross-modal consistency. Experimental results on the MSKCC CRLM dataset show a time-adjacent accuracy of 85% and a mean absolute error of 1.1005, significantly outperforming existing methods. The innovative heterogeneous graph construction and spatiotemporal decoupling mechanism effectively uncover the associations between dynamic tumor microenvironment changes and prognosis, providing reliable quantitative support for personalized treatment decisions.

cross Automated Data Curation Using GPS & NLP to Generate Instruction-Action Pairs for Autonomous Vehicle Vision-Language Navigation Datasets

Authors: Guillermo Roque, Erika Maquiling, Jose Giovanni Tapia Lopez, Ross Greer

Abstract: Instruction-Action (IA) data pairs are valuable for training robotic systems, especially autonomous vehicles (AVs), but having humans manually annotate this data is costly and time-inefficient. This paper explores the potential of using mobile application Global Positioning System (GPS) references and Natural Language Processing (NLP) to automatically generate large volumes of IA commands and responses without having a human generate or retroactively tag the data. In our pilot data collection, by driving to various destinations and collecting voice instructions from GPS applications, we demonstrate a means to collect and categorize the diverse sets of instructions, further accompanied by video data to form complete vision-language-action triads. We provide details on our completely automated data collection prototype system, ADVLAT-Engine. We characterize collected GPS voice instructions into eight different classifications, highlighting the breadth of commands and referentialities available for curation from freely available mobile applications. Through research and exploration into the automation of IA data pairs using GPS references, the potential to increase the speed and volume at which high-quality IA datasets are created, while minimizing cost, can pave the way for robust vision-language-action (VLA) models to serve tasks in vision-language navigation (VLN) and human-interactive autonomous systems.

cross CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization

Authors: Detao Bai, Zhiheng Ma, Xihan Wei, Liefeng Bo

Abstract: The inherent synchronization between a speaker's lip movements, voice, and the underlying linguistic content offers a rich source of information for improving speech processing tasks, especially in challenging conditions where traditional audio-only systems falter. We introduce CoGenAV, a powerful and data-efficient model designed to learn versatile audio-visual representations applicable across a wide range of speech and audio-visual tasks. CoGenAV is trained by optimizing a dual objective derived from natural audio-visual synchrony, contrastive feature alignment and generative text prediction, using only 223 hours of labeled data from the LRS2 dataset. This contrastive-generative synchronization strategy effectively captures fundamental cross-modal correlations. We showcase the effectiveness and versatility of the learned CoGenAV representations on multiple benchmarks. When utilized for Audio-Visual Speech Recognition (AVSR) on LRS2, these representations contribute to achieving a state-of-the-art Word Error Rate (WER) of 1.27. They also enable strong performance in Visual Speech Recognition (VSR) with a WER of 22.0 on LRS2, and significantly improve performance in noisy environments by over 70%. Furthermore, CoGenAV representations benefit speech reconstruction tasks, boosting performance in Speech Enhancement and Separation, and achieve competitive results in audio-visual synchronization tasks like Active Speaker Detection (ASD). Our model will be open-sourced to facilitate further development and collaboration within both academia and industry.

cross Mitigating Image Captioning Hallucinations in Vision-Language Models

Authors: Fei Zhao, Chengcui Zhang, Runlin Zhang, Tianyang Wang, Xi Li

Abstract: Hallucinations in vision-language models (VLMs) hinder reliability and real-world applicability, usually stemming from distribution shifts between pretraining data and test samples. Existing solutions, such as retraining or fine-tuning on additional data, demand significant computational resources and labor-intensive data collection, while ensemble-based methods incur additional costs by introducing auxiliary VLMs. To address these challenges, we propose a novel test-time adaptation framework using reinforcement learning to mitigate hallucinations during inference without retraining or any auxiliary VLMs. By updating only the learnable parameters in the layer normalization of the language model (approximately 0.003% of the model parameters), our method reduces distribution shifts between test samples and pretraining samples. A CLIP-based hallucination evaluation model is proposed to provide dual rewards to VLMs. Experimental results demonstrate a 15.4% and 17.3% reduction in hallucination rates on LLaVA and InstructBLIP, respectively. Our approach outperforms state-of-the-art baselines with a 68.3% improvement in hallucination mitigation, demonstrating its effectiveness.

cross From Neurons to Computation: Biological Reservoir Computing for Pattern Recognition

Authors: Ludovico Iannello, Luca Ciampi, Gabriele Lagani, Fabrizio Tonelli, Eleonora Crocco, Lucio Maria Calcagnile, Angelo Di Garbo, Federico Cremisi, Giuseppe Amato

Abstract: In this paper, we introduce a novel paradigm for reservoir computing (RC) that leverages a pool of cultured biological neurons as the reservoir substrate, creating a biological reservoir computing (BRC). This system operates similarly to an echo state network (ESN), with the key distinction that the neural activity is generated by a network of cultured neurons, rather than being modeled by traditional artificial computational units. The neuronal activity is recorded using a multi-electrode array (MEA), which enables high-throughput recording of neural signals. In our approach, inputs are introduced into the network through a subset of the MEA electrodes, while the remaining electrodes capture the resulting neural activity. This generates a nonlinear mapping of the input data to a high-dimensional biological feature space, where distinguishing between data becomes more efficient and straightforward, allowing a simple linear classifier to perform pattern recognition tasks effectively. To evaluate the performance of our proposed system, we present an experimental study that includes various input patterns, such as positional codes, bars with different orientations, and a digit recognition task. The results demonstrate the feasibility of using biological neural networks to perform tasks traditionally handled by artificial neural networks, paving the way for further exploration of biologically-inspired computing systems, with potential applications in neuromorphic engineering and bio-hybrid computing.

cross ALMA: Aggregated Lipschitz Maximization Attack on Auto-encoders

Authors: Chethan Krishnamurthy Ramanaik, Arjun Roy, Eirini Ntoutsi

Abstract: Despite the extensive use of deep autoencoders (AEs) in critical applications, their adversarial robustness remains relatively underexplored compared to classification models. AE robustness is characterized by the Lipschitz bounds of its components. Existing robustness evaluation frameworks based on white-box attacks do not fully exploit the vulnerabilities of intermediate ill-conditioned layers in AEs. In the context of optimizing imperceptible norm-bounded additive perturbations to maximize output damage, existing methods struggle to effectively propagate adversarial loss gradients throughout the network, often converging to less effective perturbations. To address this, we propose a novel layer-conditioning-based adversarial optimization objective that effectively guides the adversarial map toward regions of local Lipschitz bounds by enhancing loss gradient information propagation during attack optimization. We demonstrate through extensive experiments on state-of-the-art AEs that our adversarial objective results in stronger attacks, outperforming existing methods in both universal and sample-specific scenarios. As a defense method against this attack, we introduce an inference-time adversarially trained defense plugin that mitigates the effects of adversarial examples.

cross Self-Supervised Learning for Robotic Leaf Manipulation: A Hybrid Geometric-Neural Approach

Authors: Srecharan Selvam, Abhishesh Silwal, George Kanter

Abstract: Automating leaf manipulation in agricultural settings faces significant challenges, including the variability of plant morphologies and deformable leaves. We propose a novel hybrid geometric-neural approach for autonomous leaf grasping that combines traditional computer vision with neural networks through self-supervised learning. Our method integrates YOLOv8 for instance segmentation and RAFT-Stereo for 3D depth estimation to build rich leaf representations, which feed into both a geometric feature scoring pipeline and a neural refinement module (GraspPointCNN). The key innovation is our confidence-weighted fusion mechanism that dynamically balances the contribution of each approach based on prediction certainty. Our self-supervised framework uses the geometric pipeline as an expert teacher to automatically generate training data. Experiments demonstrate that our approach achieves an 88.0% success rate in controlled environments and 84.7% in real greenhouse conditions, significantly outperforming both purely geometric (75.3%) and neural (60.2%) methods. This work establishes a new paradigm for agricultural robotics where domain expertise is seamlessly integrated with machine learning capabilities, providing a foundation for fully automated crop monitoring systems.

cross Visual Imitation Enables Contextual Humanoid Control

Authors: Arthur Allshire, Hongsuk Choi, Junyi Zhang, David McAllister, Anthony Zhang, Chung Min Kim, Trevor Darrell, Pieter Abbeel, Jitendra Malik, Angjoo Kanazawa

Abstract: How can we teach humanoids to climb staircases and sit on chairs using the surrounding environment context? Arguably, the simplest way is to just show them-casually capture a human motion video and feed it to humanoids. We introduce VIDEOMIMIC, a real-to-sim-to-real pipeline that mines everyday videos, jointly reconstructs the humans and the environment, and produces whole-body control policies for humanoid robots that perform the corresponding skills. We demonstrate the results of our pipeline on real humanoid robots, showing robust, repeatable contextual control such as staircase ascents and descents, sitting and standing from chairs and benches, as well as other dynamic whole-body skills-all from a single policy, conditioned on the environment and global root commands. VIDEOMIMIC offers a scalable path towards teaching humanoids to operate in diverse real-world environments.

replace VecFontSDF: Learning to Reconstruct and Synthesize High-quality Vector Fonts via Signed Distance Functions

Authors: Zeqing Xia, Bojun Xiong, Zhouhui Lian

Abstract: Font design is of vital importance in the digital content design and modern printing industry. Developing algorithms capable of automatically synthesizing vector fonts can significantly facilitate the font design process. However, existing methods mainly concentrate on raster image generation, and only a few approaches can directly synthesize vector fonts. This paper proposes an end-to-end trainable method, VecFontSDF, to reconstruct and synthesize high-quality vector fonts using signed distance functions (SDFs). Specifically, based on the proposed SDF-based implicit shape representation, VecFontSDF learns to model each glyph as shape primitives enclosed by several parabolic curves, which can be precisely converted to quadratic B\'ezier curves that are widely used in vector font products. In this manner, most image generation methods can be easily extended to synthesize vector fonts. Qualitative and quantitative experiments conducted on a publicly-available dataset demonstrate that our method obtains high-quality results on several tasks, including vector font reconstruction, interpolation, and few-shot vector font synthesis, markedly outperforming the state of the art. Our code and trained models are available at https://xiazeqing.github.io/VecFontSDF.

URLs: https://xiazeqing.github.io/VecFontSDF.

replace Infrared Image Deturbulence Restoration Using Degradation Parameter-Assisted Wide & Deep Learning

Authors: Yi Lu, Yadong Wang, Xingbo Jiang, Xiangzhi Bai

Abstract: Infrared images captured under turbulent conditions are degraded by complex geometric distortions and blur. We address infrared deturbulence as an image restoration task, proposing DparNet, a parameter-assisted multi-frame network with a wide & deep architecture. DparNet learns a degradation prior (key parameter matrix) directly from degraded images without external knowledge. Its wide & deep architecture uses these learned parameters to directly modulate restoration, achieving spatially and intensity adaptive results. Evaluated on dedicated infrared deturbulence (49,744 images) and visible image denoising (109,536 images) datasets, DparNet significantly outperforms State-of-the-Art (SOTA) methods in restoration performance and efficiency. Notably, leveraging these parameters improves PSNR by 0.6-1.1 dB with less than 2% increase in model parameters and computational complexity. Our work demonstrates that degraded images hide key degradation information that can be learned and utilized to boost adaptive image restoration.

replace Semi-supervised Underwater Image Enhancement Using A Physics-Aware Triple-Stream Network

Authors: Hao Qi, Shixuan Xu, Xinghui Dong

Abstract: Underwater images normally suffer from degradation due to the transmission medium of water bodies. Both traditional prior-based approaches and deep learning-based methods have been used to address this problem. However, the inflexible assumption of the former often impairs their effectiveness in handling diverse underwater scenes, while the generalization of the latter to unseen images is usually weakened by insufficient data. In this study, we leverage both the physics-based Image Formation Model (IFM) and deep learning techniques for Underwater Image Enhancement (UIE). To this end, we propose a novel Physics-Aware Triple-Stream Underwater Image Enhancement Network, i.e., PATS-UIENet, which comprises a Direct Signal Transmission Estimation Steam (D-Stream), a Backscatter Signal Transmission Estimation Steam (B-Stream) and an Ambient Light Estimation Stream (A-Stream). This network fulfills the UIE task by explicitly estimating the degradation parameters of a revised IFM. We also adopt an IFM-inspired semi-supervised learning framework, which exploits both the labeled and unlabeled images, to address the issue of insufficient data. To our knowledge, such a physics-aware deep network and the IFM-inspired semi-supervised learning framework have not been used for the UIE task before. Our method performs better than, or at least comparably to, sixteen baselines across six testing sets in the degradation estimation and UIE tasks. These promising results should be due to the fact that the proposed method can not only model the degradation but also learn the characteristics of diverse underwater scenes.

replace Towards A Robust Group-level Emotion Recognition via Uncertainty-Aware Learning

Authors: Qing Zhu, Qirong Mao, Jialin Zhang, Xiaohua Huang, Wenming Zheng

Abstract: Group-level emotion recognition (GER) is an inseparable part of human behavior analysis, aiming to recognize an overall emotion in a multi-person scene. However, the existing methods are devoted to combing diverse emotion cues while ignoring the inherent uncertainties under unconstrained environments, such as congestion and occlusion occurring within a group. Additionally, since only group-level labels are available, inconsistent emotion predictions among individuals in one group can confuse the network. In this paper, we propose an uncertainty-aware learning (UAL) method to extract more robust representations for GER. By explicitly modeling the uncertainty of each individual, we utilize stochastic embedding drawn from a Gaussian distribution instead of deterministic point embedding. This representation captures the probabilities of different emotions and generates diverse predictions through this stochasticity during the inference stage. Furthermore, uncertainty-sensitive scores are adaptively assigned as the fusion weights of individuals' face within each group. Moreover, we develop an image enhancement module to enhance the model's robustness against severe noise. The overall three-branch model, encompassing face, object, and scene component, is guided by a proportional-weighted fusion strategy and integrates the proposed uncertainty-aware method to produce the final group-level output. Experimental results demonstrate the effectiveness and generalization ability of our method across three widely used databases.

replace Aligning Data Selection with Performance: Performance-driven Reinforcement Learning for Active Learning in Object Detection

Authors: Zhixuan Liang, Xingyu Zeng, Rui Zhao, Ping Luo

Abstract: Active learning strategies aim to train high-performance models with minimal labeled data by selecting the most informative instances for labeling. However, existing methods for assessing data informativeness often fail to align directly with task model performance metrics, such as mean average precision (mAP) in object detection. This paper introduces Mean-AP Guided Reinforced Active Learning for Object Detection (MGRAL), a novel approach that leverages the concept of expected model output changes as informativeness for deep detection networks, directly optimizing the sampling strategy using mAP. MGRAL employs a reinforcement learning agent based on LSTM architecture to efficiently navigate the combinatorial challenge of batch sample selection and the non-differentiable nature between performance and selected batches. The agent optimizes selection using policy gradient with mAP improvement as the reward signal. To address the computational intensity of mAP estimation with unlabeled samples, we implement fast look-up tables, ensuring real-world feasibility. We evaluate MGRAL on PASCAL VOC and MS COCO benchmarks across various backbone architectures. Our approach demonstrates strong performance, establishing a new paradigm in reinforcement learning-based active learning for object detection.

replace Paragraph-to-Image Generation with Information-Enriched Diffusion Model

Authors: Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang

Abstract: Text-to-image (T2I) models have recently experienced rapid development, achieving astonishing performance in terms of fidelity and textual alignment capabilities. However, given a long paragraph (up to 512 words), these generation models still struggle to achieve strong alignment and are unable to generate images depicting complex scenes. In this paper, we introduce an information-enriched diffusion model for paragraph-to-image generation task, termed ParaDiffusion, which delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation. At its core is using a large language model (e.g., Llama V2) to encode long-form text, followed by fine-tuning with LORA to alignthe text-image feature spaces in the generation task. To facilitate the training of long-text semantic alignment, we also curated a high-quality paragraph-image pair dataset, namely ParaImage. This dataset contains a small amount of high-quality, meticulously annotated data, and a large-scale synthetic dataset with long text descriptions being generated using a vision-language model. Experiments demonstrate that ParaDiffusion outperforms state-of-the-art models (SD XL, DeepFloyd IF) on ViLG-300 and ParaPrompts, achieving up to 15% and 45% human voting rate improvements for visual appeal and text faithfulness, respectively. The code and dataset will be released to foster community research on long-text alignment.

replace FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback

Authors: Liqiang Jing, Xinya Du

Abstract: Large Vision-Language Models (LVLMs) have demonstrated proficiency in tackling a variety of visual-language tasks. However, current LVLMs suffer from misalignment between text and image modalities which causes three kinds of hallucination problems, i.e., object existence, object attribute, and object relationship. To tackle this issue, existing methods mainly utilize Reinforcement Learning (RL) to align modalities in LVLMs. However, they still suffer from three main limitations: (1) General feedback can not indicate the hallucination type contained in the response; (2) Sparse rewards only give the sequence-level reward for the whole response; and (3)Annotation cost is time-consuming and labor-intensive. To handle these limitations, we propose an innovative method to align modalities in LVLMs through Fine-Grained Artificial Intelligence Feedback (FGAIF), which mainly consists of three steps: AI-based Feedback Collection, Fine-grained Reward Model Training, and Reinforcement Learning with Fine-grained Reward. Specifically, We first utilize AI tools to predict the types of hallucination for each segment in the response and obtain a collection of fine-grained feedback. Then, based on the collected reward data, three specialized reward models are trained to produce dense rewards. Finally, a novel fine-grained feedback module is integrated into the Proximal Policy Optimization (PPO) algorithm. Extensive experiments are conducted on hallucination and general benchmarks, demonstrating the superior performance of our proposed method. Notably, compared with previous models trained with the RL-based aligning method, our proposed method is effective even with fewer parameters.

replace 3D-HGS: 3D Half-Gaussian Splatting

Authors: Haolin Li, Jinyang Liu, Mario Sznaier, Octavia Camps

Abstract: Photo-realistic image rendering from 3D scene reconstruction has advanced significantly with neural rendering techniques. Among these, 3D Gaussian Splatting (3D-GS) outperforms Neural Radiance Fields (NeRFs) in quality and speed but struggles with shape and color discontinuities. We propose 3D Half-Gaussian (3D-HGS) kernels as a plug-and-play solution to address these limitations. Our experiments show that 3D-HGS enhances existing 3D-GS methods, achieving state-of-the-art rendering quality without compromising speed.

replace Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph

Authors: Sergey Linok, Tatiana Zemskova, Svetlana Ladanova, Roman Titkov, Dmitry Yudin, Maxim Monastyrny, Aleksei Valenkov

Abstract: Locating objects described in natural language presents a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object grounding with simple (bare) queries, but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene graph representation with metric and semantic spatial edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to construct 3D object-centric map and an advanced raycasting algorithm with a 2D vision-language model to describe them as graph nodes. On the Replica and ScanNet datasets, we have demonstrated that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On challenging Sr3D+, Nr3D and ScanRefer benchmarks, our deductive approach demonstrates a significant improvement, enabling objects grounding by complex queries compared to other state-of-the-art methods. The combination of our design choices and software implementation has resulted in significant data processing speed in experiments on the robot on-board computer. This promising performance enables the application of our approach in intelligent robotics projects. We made the code publicly available at https://linukc.github.io/BeyondBareQueries/.

URLs: https://linukc.github.io/BeyondBareQueries/.

replace SMORE: Simultaneous Map and Object REconstruction

Authors: Nathaniel Chodosh, Anish Madan, Simon Lucey, Deva Ramanan

Abstract: We present a method for dynamic surface reconstruction of large-scale urban scenes from LiDAR. Depth-based reconstructions tend to focus on small-scale objects or large-scale SLAM reconstructions that treat moving objects as outliers. We take a holistic perspective and optimize a compositional model of a dynamic scene that decomposes the world into rigidly-moving objects and the background. To achieve this, we take inspiration from recent novel view synthesis methods and frame the reconstruction problem as a global optimization over neural surfaces, ego poses, and object poses, which minimizes the error between composed spacetime surfaces and input LiDAR scans. In contrast to view synthesis methods, which typically minimize 2D errors with gradient descent, we minimize a 3D point-to-surface error by coordinate descent, which we decompose into registration and surface reconstruction steps. Each step can be handled well by off-the-shelf methods without any re-training. We analyze the surface reconstruction step for rolling-shutter LiDARs, and show that deskewing operations common in continuous time SLAM can be applied to dynamic objects as well, improving results over prior art by an order of magnitude. Beyond pursuing dynamic reconstruction as a goal in and of itself, we propose that such a system can be used to auto-label partially annotated sequences and produce ground truth annotation for hard-to-label problems such as depth completion and scene flow. Please see https://anishmadan23.github.io/smore/ for more visual results.

URLs: https://anishmadan23.github.io/smore/

replace Tailored Design of Audio-Visual Speech Recognition Models using Branchformers

Authors: David Gimeno-G\'omez, Carlos-D. Mart\'inez-Hinarejos

Abstract: Recent advances in Audio-Visual Speech Recognition (AVSR) have led to unprecedented achievements in the field, improving the robustness of this type of system in adverse, noisy environments. In most cases, this task has been addressed through the design of models composed of two independent encoders, each dedicated to a specific modality. However, while recent works have explored unified audio-visual encoders, determining the optimal cross-modal architecture remains an ongoing challenge. Furthermore, such approaches often rely on models comprising vast amounts of parameters and high computational cost training processes. In this paper, we aim to bridge this research gap by introducing a novel audio-visual framework. Our proposed method constitutes, to the best of our knowledge, the first attempt to harness the flexibility and interpretability offered by encoder architectures, such as the Branchformer, in the design of parameter-efficient AVSR systems. To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder based on the layer-level branch scores provided by the modality-specific models. Extensive experiments on English and Spanish AVSR benchmarks covering multiple data conditions and scenarios demonstrated the effectiveness of our proposed method. Even when trained on a moderate scale of data, our models achieve competitive word error rates (WER) of approximately 2.5\% for English and surpass existing approaches for Spanish, establishing a new benchmark with an average WER of around 9.1\%. These results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates while significantly reducing the model complexity w.r.t. the prevalent approach in the field. Code and pre-trained models are available at https://github.com/david-gimeno/tailored-avsr.

URLs: https://github.com/david-gimeno/tailored-avsr.

replace CardioSyntax: end-to-end SYNTAX score prediction -- dataset, benchmark and method

Authors: Alexander Ponomarchuk, Ivan Kruzhilov, Galina Zubkova, Artem Shadrin, Ruslan Utegenov, Ivan Bessonov, Pavel Blinov

Abstract: The SYNTAX score has become a widely used measure of coronary disease severity, crucial in selecting the optimal mode of the revascularization procedure. This paper introduces a new medical regression and classification problem - automatically estimating SYNTAX score from coronary angiography. Our study presents a comprehensive CardioSYNTAX dataset of 3,018 patients for the SYNTAX score estimation and coronary dominance classification. The dataset features a balanced distribution of individuals with zero and non-zero scores. This dataset includes a first-of-its-kind, complete coronary angiography samples captured through a multi-view X-ray video, allowing one to observe coronary arteries from multiple perspectives. Furthermore, we present a novel, fully automatic end-to-end method for estimating the SYNTAX. For such a difficult task, we have achieved a solid coefficient of determination R2 of 0.51 in score value prediction and 77.3% accuracy for zero score classification.

replace Uncertainty-Guided Self-Questioning and Answering for Video-Language Alignment

Authors: Jin Chen, Kaijing Ma, Haojian Huang, Han Fang, Hao Sun, Mehdi Hosseinzadeh, Zhe Liu

Abstract: The development of multi-modal models has been rapidly advancing, with some demonstrating remarkable capabilities. However, annotating video-text pairs remains expensive and insufficient. Take video question answering (VideoQA) tasks as an example, human annotated questions and answers often cover only part of the video, since the corresponding text is often short and monotonous, leading to underutilization of video. To address this, we propose a Bootstrapping Video-Language Alignment framework (BoViLA), a self-training method that augments question samples during training process through LLM-based self-questioning and answering, which help model exploit video information and the internal knowledge of LLMs more thoroughly to improve modality alignment. However, low-quality self-generated questions may instead contaminate the performance, especially in the early stages of training, as we have observed in our experiments. To filter bad self-generated questions, we introduce Evidential Deep Learning (EDL) to estimate uncertainty and assess the quality of self-generated questions by evaluating the modality alignment within the context. To the best of our knowledge, this work is the first to explore LLM-based self-training frameworks for modality alignment. We evaluate BoViLA on five strong VideoQA benchmarks, where it outperforms several state-of-the-art methods and demonstrate its effectiveness and generality. Additionally, we provide extensive analyses of the self-training framework and the EDL-based uncertainty filtering mechanism. The code will be made available.

replace CleAR: Robust Context-Guided Generative Lighting Estimation for Mobile Augmented Reality

Authors: Yiqin Zhao, Mallesham Dasari, Tian Guo

Abstract: High-quality environment lighting is essential for creating immersive mobile augmented reality (AR) experiences. However, achieving visually coherent estimation for mobile AR is challenging due to several key limitations in AR device sensing capabilities, including low camera FoV and limited pixel dynamic ranges. Recent advancements in generative AI, which can generate high-quality images from different types of prompts, including texts and images, present a potential solution for high-quality lighting estimation. Still, to effectively use generative image diffusion models, we must address two key limitations of content quality and slow inference. In this work, we design and implement a generative lighting estimation system called CleAR that can produce high-quality, diverse environment maps in the format of 360{\deg} HDR images. Specifically, we design a two-step generation pipeline guided by AR environment context data to ensure the output aligns with the physical environment's visual context and color appearance. To improve the estimation robustness under different lighting conditions, we design a real-time refinement component to adjust lighting estimation results on AR devices. Through a combination of quantitative and qualitative evaluations, we show that CleAR outperforms state-of-the-art lighting estimation methods on both estimation accuracy, latency, and robustness, and is rated by 31 participants as producing better renderings for most virtual objects. For example, CleAR achieves 51% to 56% accuracy improvement on virtual object renderings across objects of three distinctive types of materials and reflective properties. CleAR produces lighting estimates of comparable or better quality in just 3.2 seconds -- over 110X faster than state-of-the-art methods.

replace Decoupling Fine Detail and Global Geometry for Compressed Depth Map Super-Resolution

Authors: Huan Zheng, Wencheng Han, Jianbing Shen

Abstract: Recovering high-quality depth maps from compressed sources has gained significant attention due to the limitations of consumer-grade depth cameras and the bandwidth restrictions during data transmission. However, current methods still suffer from two challenges. First, bit-depth compression produces a uniform depth representation in regions with subtle variations, hindering the recovery of detailed information. Second, densely distributed random noise reduces the accuracy of estimating the global geometric structure of the scene. To address these challenges, we propose a novel framework, termed geometry-decoupled network (GDNet), for compressed depth map super-resolution that decouples the high-quality depth map reconstruction process by handling global and detailed geometric features separately. To be specific, we propose the fine geometry detail encoder (FGDE), which is designed to aggregate fine geometry details in high-resolution low-level image features while simultaneously enriching them with complementary information from low-resolution context-level image features. In addition, we develop the global geometry encoder (GGE) that aims at suppressing noise and extracting global geometric information effectively via constructing compact feature representation in a low-rank space. We conduct experiments on multiple benchmark datasets, demonstrating that our GDNet significantly outperforms current methods in terms of geometric consistency and detail recovery. In the ECCV 2024 AIM Compressed Depth Upsampling Challenge, our solution won the 1st place award. Our codes are available at: https://github.com/Ian0926/GDNet.

URLs: https://github.com/Ian0926/GDNet.

replace About Time: Advances, Challenges, and Outlooks of Action Understanding

Authors: Alexandros Stergiou, Ronald Poppe

Abstract: We have witnessed impressive advances in video action understanding. Increased dataset sizes, variability, and computation availability have enabled leaps in performance and task diversification. Current systems can provide coarse- and fine-grained descriptions of video scenes, extract segments corresponding to queries, synthesize unobserved parts of videos, and predict context across multiple modalities. This survey comprehensively reviews advances in uni- and multi-modal action understanding across a range of tasks. We focus on prevalent challenges, overview widely adopted datasets, and survey seminal works with an emphasis on recent advances. We broadly distinguish between three temporal scopes: (1) recognition tasks of actions observed in full, (2) prediction tasks for ongoing partially observed actions, and (3) forecasting tasks for subsequent unobserved action(s). This division allows us to identify specific action modeling and video representation challenges. Finally, we outline future directions to address current shortcomings.

replace AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

Authors: Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B. Lindell, Sergey Tulyakov

Abstract: Numerous works have recently integrated 3D camera control into foundational text-to-video models, but the resulting camera control is often imprecise, and video generation quality suffers. In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality. First, we determine that motion induced by camera movements in videos is low-frequency in nature. This motivates us to adjust train and test pose conditioning schedules, accelerating training convergence while improving visual and motion quality. Then, by probing the representations of an unconditional video diffusion transformer, we observe that they implicitly perform camera pose estimation under the hood, and only a sub-portion of their layers contain the camera information. This suggested us to limit the injection of camera conditioning to a subset of the architecture to prevent interference with other video features, leading to a 4x reduction of training parameters, improved training speed, and 10% higher visual quality. Finally, we complement the typical dataset for camera control learning with a curated dataset of 20K diverse, dynamic videos with stationary cameras. This helps the model distinguish between camera and scene motion and improves the dynamics of generated pose-conditioned videos. We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture, the new state-of-the-art model for generative video modeling with camera control.

replace HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing

Authors: Jinbin Bai, Wei Chow, Ling Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, Shuicheng Yan

Abstract: We present HumanEdit, a high-quality, human-rewarded dataset specifically designed for instruction-guided image editing, enabling precise and diverse image manipulations through open-form language instructions. Previous large-scale editing datasets often incorporate minimal human feedback, leading to challenges in aligning datasets with human preferences. HumanEdit bridges this gap by employing human annotators to construct data pairs and administrators to provide feedback. With meticulously curation, HumanEdit comprises 5,751 images and requires more than 2,500 hours of human effort across four stages, ensuring both accuracy and reliability for a wide range of image editing tasks. The dataset includes six distinct types of editing instructions: Action, Add, Counting, Relation, Remove, and Replace, encompassing a broad spectrum of real-world scenarios. All images in the dataset are accompanied by masks, and for a subset of the data, we ensure that the instructions are sufficiently detailed to support mask-free editing. Furthermore, HumanEdit offers comprehensive diversity and high-resolution $1024 \times 1024$ content sourced from various domains, setting a new versatile benchmark for instructional image editing datasets. With the aim of advancing future research and establishing evaluation benchmarks in the field of image editing, we release HumanEdit at https://huggingface.co/datasets/BryanW/HumanEdit.

URLs: https://huggingface.co/datasets/BryanW/HumanEdit.

replace A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs

Authors: Mohit Vaishnav, Tanel Tammet

Abstract: A fundamental challenge in artificial intelligence involves understanding the cognitive mechanisms underlying visual reasoning in sophisticated models like Vision-Language Models (VLMs). How do these models integrate visual perception with abstract thought, especially when reasoning across multiple images or requiring fine-grained compositional understanding? Drawing inspiration from cognitive science, this paper introduces a structured evaluation framework using diverse visual reasoning tasks-Bongard Problems (BPs) and Winoground-to dissect the perception-reasoning interface in VLMs. We propose three distinct evaluation paradigms, mirroring human problem-solving strategies: Direct Visual Rule Learning (DVRL; holistic processing), Deductive Rule Learning (DRL; rule extraction and application), and Componential Analysis (CA; analytical decomposition via task-agnostic textual descriptions). These paradigms systematically vary cognitive load and probe processing stages. Notably, CA enables multi-image reasoning evaluation even for single-image architectures and isolates reasoning from perception by operating on textual descriptions. Applying this framework, we demonstrate that CA, leveraging powerful language models for reasoning over rich, independently generated descriptions, achieves new state-of-the-art (SOTA) performance on challenging benchmarks including Bongard-OpenWorld, Bongard-HOI, and Winoground. Ablation studies confirm reasoning improves significantly when perceptual challenges are mitigated, revealing a critical perception bottleneck. Our framework provides a valuable diagnostic tool and suggests that decoupling perception (via rich, task-agnostic description) from reasoning is a promising direction for robust and general visual intelligence.

replace Recognize Any Surgical Object: Unleashing the Power of Weakly-Supervised Data

Authors: Jiajie Li, Brian R Quaranto, Chenhui Xu, Ishan Mishra, Ruiyang Qin, Dancheng Liu, Peter C W Kim, Jinjun Xiong

Abstract: We present RASO, a foundation model designed to Recognize Any Surgical Object, offering robust open-set recognition capabilities across a broad range of surgical procedures and object classes, in both surgical images and videos. RASO leverages a novel weakly-supervised learning framework that generates tag-image-text pairs automatically from large-scale unannotated surgical lecture videos, significantly reducing the need for manual annotations. Our scalable data generation pipeline gathers 2,200 surgical procedures and produces 3.6 million tag annotations across 2,066 unique surgical tags. Our experiments show that RASO achieves improvements of 2.9 mAP, 4.5 mAP, 10.6 mAP, and 7.2 mAP on four standard surgical benchmarks, respectively, in zero-shot settings, and surpasses state-of-the-art models in supervised surgical action recognition tasks. Code, model, and demo are available at https://ntlm1686.github.io/raso.

URLs: https://ntlm1686.github.io/raso.

replace Deformable Beta Splatting

Authors: Rong Liu, Dylan Sun, Meida Chen, Yue Wang, Andrew Feng

Abstract: 3D Gaussian Splatting (3DGS) has advanced radiance field reconstruction by enabling real-time rendering. However, its reliance on Gaussian kernels for geometry and low-order Spherical Harmonics (SH) for color encoding limits its ability to capture complex geometries and diverse colors. We introduce Deformable Beta Splatting (DBS), a deformable and compact approach that enhances both geometry and color representation. DBS replaces Gaussian kernels with deformable Beta Kernels, which offer bounded support and adaptive frequency control to capture fine geometric details with higher fidelity while achieving better memory efficiency. In addition, we extended the Beta Kernel to color encoding, which facilitates improved representation of diffuse and specular components, yielding superior results compared to SH-based methods. Furthermore, Unlike prior densification techniques that depend on Gaussian properties, we mathematically prove that adjusting regularized opacity alone ensures distribution-preserved Markov chain Monte Carlo (MCMC), independent of the splatting kernel type. Experimental results demonstrate that DBS achieves state-of-the-art visual quality while utilizing only 45% of the parameters and rendering 1.5x faster than 3DGS-MCMC, highlighting the superior performance of DBS for real-time radiance field rendering. Interactive demonstrations and source code are available on our project website: https://rongliu-leo.github.io/beta-splatting/.

URLs: https://rongliu-leo.github.io/beta-splatting/.

replace Novel computational workflows for natural and biomedical image processing based on hypercomplex algebras

Authors: Nektarios A. Valous, Eckhard Hitzer, Drago\c{s} Du\c{s}e, Rodrigo Rojas Moraleda, Ferdinand Popp, Meggy Suarez-Carmona, Anna Berthel, Ismini Papageorgiou, Carlo Fremd, Alexander R\"olle, Christina C. Westhoff, B\'en\'edicte Lenoir, Niels Halama, Inka Z\"ornig, Dirk J\"ager

Abstract: Hypercomplex image processing extends conventional techniques in a unified paradigm encompassing algebraic and geometric principles. This work leverages quaternions and the two-dimensional orthogonal planes split framework (splitting of a quaternion - representing a pixel - into pairs of orthogonal 2D planes) for natural/biomedical image analysis through the following computational workflows and outcomes: natural/biomedical image re-colorization, natural image de-colorization, natural/biomedical image contrast enhancement, computational re-staining and stain separation in histological images, and performance gains in machine/deep learning pipelines for histological images. The workflows are analyzed separately for natural and biomedical images to showcase the effectiveness of the proposed approaches. The proposed workflows can regulate color appearance (e.g. with alternative renditions and grayscale conversion) and image contrast, be part of automated image processing pipelines (e.g. isolating stain components, boosting learning models), and assist in digital pathology applications (e.g. enhancing biomarker visibility, enabling colorblind-friendly renditions). Employing only basic arithmetic and matrix operations, this work offers a computationally accessible methodology - in the hypercomplex domain - that showcases versatility and consistency across image processing tasks and a range of computer vision and biomedical applications. The proposed non-data-driven methods achieve comparable or better results (particularly in cases involving well-known methods) to those reported in the literature, showcasing the potential of robust theoretical frameworks with practical effectiveness. Results, methods, and limitations are detailed alongside discussion of promising extensions, emphasizing the potential of feature-rich mathematical/computational frameworks for natural and biomedical images.

replace Instance Segmentation of Scene Sketches Using Natural Image Priors

Authors: Mia Tang, Yael Vinker, Chuan Yan, Lvmin Zhang, Maneesh Agrawala

Abstract: Sketch segmentation involves grouping pixels within a sketch that belong to the same object or instance. It serves as a valuable tool for sketch editing tasks, such as moving, scaling, or removing specific components. While image segmentation models have demonstrated remarkable capabilities in recent years, sketches present unique challenges for these models due to their sparse nature and wide variation in styles. We introduce InkLayer, a method for instance segmentation of raster scene sketches. Our approach adapts state-of-the-art image segmentation and object detection models to the sketch domain by employing class-agnostic fine-tuning and refining segmentation masks using depth cues. Furthermore, our method organizes sketches into sorted layers, where occluded instances are inpainted, enabling advanced sketch editing applications. As existing datasets in this domain lack variation in sketch styles, we construct a synthetic scene sketch segmentation dataset, InkScenes, featuring sketches with diverse brush strokes and varying levels of detail. We use this dataset to demonstrate the robustness of our approach.

replace CalibRefine: Deep Learning-Based Online Automatic Targetless LiDAR-Camera Calibration with Iterative and Attention-Driven Post-Refinement

Authors: Lei Cheng, Lihao Guo, Tianya Zhang, Tam Bang, Austin Harris, Mustafa Hajij, Mina Sartipi, Siyang Cao

Abstract: Accurate multi-sensor calibration is essential for deploying robust perception systems in applications such as autonomous driving and intelligent transportation. Existing LiDAR-camera calibration methods often rely on manually placed targets, preliminary parameter estimates, or intensive data preprocessing, limiting their scalability and adaptability in real-world settings. In this work, we propose a fully automatic, targetless, and online calibration framework, CalibRefine, which directly processes raw LiDAR point clouds and camera images. Our approach is divided into four stages: (1) a Common Feature Discriminator that leverages relative spatial positions, visual appearance embeddings, and semantic class cues to identify and generate reliable LiDAR-camera correspondences, (2) a coarse homography-based calibration that uses the matched feature correspondences to estimate an initial transformation between the LiDAR and camera frames, serving as the foundation for further refinement, (3) an iterative refinement to incrementally improve alignment as additional data frames become available, and (4) an attention-based refinement that addresses non-planar distortions by leveraging a Vision Transformer and cross-attention mechanisms. Extensive experiments on two urban traffic datasets demonstrate that CalibRefine achieves high-precision calibration with minimal human input, outperforming state-of-the-art targetless methods and matching or surpassing manually tuned baselines. Our results show that robust object-level feature matching, combined with iterative refinement and self-supervised attention-based refinement, enables reliable sensor alignment in complex real-world conditions without ground-truth matrices or elaborate preprocessing. Code is available at https://github.com/radar-lab/Lidar_Camera_Automatic_Calibration

URLs: https://github.com/radar-lab/Lidar_Camera_Automatic_Calibration

replace Full-DoF Egomotion Estimation for Event Cameras Using Geometric Solvers

Authors: Ji Zhao, Banglei Guan, Zibin Liu, Laurent Kneip

Abstract: For event cameras, current sparse geometric solvers for egomotion estimation assume that the rotational displacements are known, such as those provided by an IMU. Thus, they can only recover the translational motion parameters. Recovering full-DoF motion parameters using a sparse geometric solver is a more challenging task, and has not yet been investigated. In this paper, we propose several solvers to estimate both rotational and translational velocities within a unified framework. Our method leverages event manifolds induced by line segments. The problem formulations are based on either an incidence relation for lines or a novel coplanarity relation for normal vectors. We demonstrate the possibility of recovering full-DoF egomotion parameters for both angular and linear velocities without requiring extra sensor measurements or motion priors. To achieve efficient optimization, we exploit the Adam framework with a first-order approximation of rotations for quick initialization. Experiments on both synthetic and real-world data demonstrate the effectiveness of our method. The code is available at https://github.com/jizhaox/relpose-event.

URLs: https://github.com/jizhaox/relpose-event.

replace UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Authors: Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M. Tamer \"Ozsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, Sai Rajeswar

Abstract: Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks-Element Grounding, Layout Grounding, and Action Prediction-with well-defined metrics to rigorously evaluate agents' performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer use agents. By releasing UI-Vision as open-source, we aim to advance the development of more capable agents for real-world desktop tasks.

replace The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation

Authors: Bingjie Gao, Xinyu Gao, Xiaoxue Wu, Yujie Zhou, Yu Qiao, Li Niu, Xinyuan Chen, Yaohui Wang

Abstract: The evolution of Text-to-video (T2V) generative models, trained on large-scale datasets, has been marked by significant progress. However, the sensitivity of T2V generative models to input prompts highlights the critical role of prompt design in influencing generative outcomes. Prior research has predominantly relied on Large Language Models (LLMs) to align user-provided prompts with the distribution of training prompts, albeit without tailored guidance encompassing prompt vocabulary and sentence structure nuances. To this end, we introduce RAPO, a novel Retrieval-Augmented Prompt Optimization framework. In order to address potential inaccuracies and ambiguous details generated by LLM-generated prompts. RAPO refines the naive prompts through dual optimization branches, selecting the superior prompt for T2V generation. The first branch augments user prompts with diverse modifiers extracted from a learned relational graph, refining them to align with the format of training prompts via a fine-tuned LLM. Conversely, the second branch rewrites the naive prompt using a pre-trained LLM following a well-defined instruction set. Extensive experiments demonstrate that RAPO can effectively enhance both the static and dynamic dimensions of generated videos, demonstrating the significance of prompt optimization for user-provided prompts.

replace Cobra: Efficient Line Art COlorization with BRoAder References

Authors: Junhao Zhuang, Lingen Li, Xuan Ju, Zhaoyang Zhang, Chun Yuan, Ying Shan

Abstract: The comic production industry requires reference-based line art colorization with high accuracy, efficiency, contextual consistency, and flexible control. A comic page often involves diverse characters, objects, and backgrounds, which complicates the coloring process. Despite advancements in diffusion models for image generation, their application in line art colorization remains limited, facing challenges related to handling extensive reference images, time-consuming inference, and flexible control. We investigate the necessity of extensive contextual image guidance on the quality of line art colorization. To address these challenges, we introduce Cobra, an efficient and versatile method that supports color hints and utilizes over 200 reference images while maintaining low latency. Central to Cobra is a Causal Sparse DiT architecture, which leverages specially designed positional encodings, causal sparse attention, and Key-Value Cache to effectively manage long-context references and ensure color identity consistency. Results demonstrate that Cobra achieves accurate line art colorization through extensive contextual reference, significantly enhancing inference speed and interactivity, thereby meeting critical industrial demands. We release our codes and models on our project page: https://zhuang2002.github.io/Cobra/.

URLs: https://zhuang2002.github.io/Cobra/.

replace Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization

Authors: Hongwei Ji, Wulian Yun, Mengshi Qi, Huadong Ma

Abstract: Traditional temporal action localization (TAL) methods rely on large amounts of detailed annotated data, whereas few-shot TAL reduces this dependence by using only a few training samples to identify unseen action categories. However, existing few-shot TAL methods typically focus solely on video-level information, neglecting textual information, which can provide valuable semantic support for the localization task. Therefore, we propose a new few-shot temporal action localization method by Chain-of-Thought textual reasoning to improve localization performance. Specifically, we design a novel few-shot learning framework that leverages textual semantic information to enhance the model's ability to capture action commonalities and variations, which includes a semantic-aware text-visual alignment module designed to align the query and support videos at different levels. Meanwhile, to better express the temporal dependencies and causal relationships between actions at the textual level to assist action localization, we design a Chain of Thought (CoT)-like reasoning method that progressively guides the Vision Language Model (VLM) and Large Language Model (LLM) to generate CoT-like text descriptions for videos. The generated texts can capture more variance of action than visual features. We conduct extensive experiments on the publicly available ActivityNet1.3 and THUMOS14 datasets. We introduce the first dataset named Human-related Anomaly Localization and explore the application of the TAL task in human anomaly detection. The experimental results demonstrate that our proposed method significantly outperforms existing methods in single-instance and multi-instance scenarios. We will release our code, data and benchmark.

replace Towards Accurate and Interpretable Neuroblastoma Diagnosis via Contrastive Multi-scale Pathological Image Analysis

Authors: Zhu Zhu, Shuo Jiang, Jingyuan Zheng, Yawen Li, Yifei Chen, Manli Zhao, Weizhong Gu, Feiwei Qin, Jinhu Wang, Gang Yu

Abstract: Neuroblastoma, adrenal-derived, is among the most common pediatric solid malignancies, characterized by significant clinical heterogeneity. Timely and accurate pathological diagnosis from hematoxylin and eosin-stained whole-slide images is critical for patient prognosis. However, current diagnostic practices primarily rely on subjective manual examination by pathologists, leading to inconsistent accuracy. Existing automated whole-slide image classification methods encounter challenges such as poor interpretability, limited feature extraction capabilities, and high computational costs, restricting their practical clinical deployment. To overcome these limitations, we propose CMSwinKAN, a contrastive-learning-based multi-scale feature fusion model tailored for pathological image classification, which enhances the Swin Transformer architecture by integrating a Kernel Activation Network within its multilayer perceptron and classification head modules, significantly improving both interpretability and accuracy. By fusing multi-scale features and leveraging contrastive learning strategies, CMSwinKAN mimics clinicians' comprehensive approach, effectively capturing global and local tissue characteristics. Additionally, we introduce a heuristic soft voting mechanism guided by clinical insights to bridge patch-level predictions to whole-slide image-level classifications seamlessly. We verified the CMSwinKAN on the publicly available BreakHis dataset and the PpNTs dataset, which was established by our hospital. Results demonstrate that CMSwinKAN performs better than existing state-of-the-art pathology-specific models pre-trained on large datasets. Our source code is available at https://github.com/JSLiam94/CMSwinKAN.

URLs: https://github.com/JSLiam94/CMSwinKAN.

replace DreamO: A Unified Framework for Image Customization

Authors: Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, Mengtian Li, Songtao Zhao, Jian Zhang, Qian He, Xinglong Wu

Abstract: Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge. In this paper, we present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions. Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types. During training, we construct a large-scale training dataset that includes various customization tasks, and we introduce a feature routing constraint to facilitate the precise querying of relevant information from reference images. Additionally, we design a placeholder strategy that associates specific placeholders with conditions at particular positions, enabling control over the placement of conditions in the generated results. Moreover, we employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data. Extensive experiments demonstrate that the proposed DreamO can effectively perform various image customization tasks with high quality and flexibly integrate different types of control conditions.

replace Step1X-Edit: A Practical Framework for General Image Editing

Authors: Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin Jiang

Abstract: In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.

replace Semantic-Aligned Learning with Collaborative Refinement for Unsupervised VI-ReID

Authors: De Cheng, Lingfeng He, Nannan Wang, Dingwen Zhang, Xinbo Gao

Abstract: Unsupervised visible-infrared person re-identification (USL-VI-ReID) seeks to match pedestrian images of the same individual across different modalities without human annotations for model learning. Previous methods unify pseudo-labels of cross-modality images through label association algorithms and then design contrastive learning framework for global feature learning. However, these methods overlook the cross-modality variations in feature representation and pseudo-label distributions brought by fine-grained patterns. This insight results in insufficient modality-shared learning when only global features are optimized. To address this issue, we propose a Semantic-Aligned Learning with Collaborative Refinement (SALCR) framework, which builds up optimization objective for specific fine-grained patterns emphasized by each modality, thereby achieving complementary alignment between the label distributions of different modalities. Specifically, we first introduce a Dual Association with Global Learning (DAGI) module to unify the pseudo-labels of cross-modality instances in a bi-directional manner. Afterward, a Fine-Grained Semantic-Aligned Learning (FGSAL) module is carried out to explore part-level semantic-aligned patterns emphasized by each modality from cross-modality instances. Optimization objective is then formulated based on the semantic-aligned features and their corresponding label space. To alleviate the side-effects arising from noisy pseudo-labels, we propose a Global-Part Collaborative Refinement (GPCR) module to mine reliable positive sample sets for the global and part features dynamically and optimize the inter-instance relationships. Extensive experiments demonstrate the effectiveness of the proposed method, which achieves superior performances to state-of-the-art methods. Our code is available at \href{https://github.com/FranklinLingfeng/code-for-SALCR}.

URLs: https://github.com/FranklinLingfeng/code-for-SALCR

replace Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

Authors: Pengxiang Li, Zhi Gao, Bofei Zhang, Yapeng Mi, Xiaojian Ma, Chenrui Shi, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li

Abstract: Multimodal agents, which integrate a controller (e.g., a large language model) with external tools, have demonstrated remarkable capabilities in tackling complex tasks. However, existing agents need to collect a large number of expert data for fine-tuning to adapt to new environments. In this paper, we propose an online self-exploration method for multimodal agents, namely SPORT, via step-wise preference optimization to refine the trajectories of agents, which automatically generates tasks and learns from solving the generated tasks, without any expert annotation. SPORT operates through four iterative components: task synthesis, step sampling, step verification, and preference tuning. First, we synthesize multi-modal tasks using language models. Then, we introduce a novel search scheme, where step sampling and step verification are executed alternately to solve each generated task. We employ a verifier to provide AI feedback to construct step-wise preference data. The data is subsequently used to update the controller's policy through preference tuning, producing a SPORT Agent. By interacting with real environments, the SPORT Agent evolves into a more refined and capable system. Evaluation in the GTA and GAIA benchmarks show that the SPORT Agent achieves 6.41\% and 3.64\% improvements, underscoring the generalization and effectiveness introduced by our method. The project page is https://SPORT-Agents.github.io.

URLs: https://SPORT-Agents.github.io.

replace HeAL3D: Heuristical-enhanced Active Learning for 3D Object Detection

Authors: Esteban Rivera, Surya Prabhakaran, Markus Lienkamp

Abstract: Active Learning has proved to be a relevant approach to perform sample selection for training models for Autonomous Driving. Particularly, previous works on active learning for 3D object detection have shown that selection of samples in uncontrolled scenarios is challenging. Furthermore, current approaches focus exclusively on the theoretical aspects of the sample selection problem but neglect the practical insights that can be obtained from the extensive literature and application of 3D detection models. In this paper, we introduce HeAL (Heuristical-enhanced Active Learning for 3D Object Detection) which integrates those heuristical features together with Localization and Classification to deliver the most contributing samples to the model's training. In contrast to previous works, our approach integrates heuristical features such as object distance and point-quantity to estimate the uncertainty, which enhance the usefulness of selected samples to train detection models. Our quantitative evaluation on KITTI shows that HeAL presents competitive mAP with respect to the State-of-the-Art, and achieves the same mAP as the full-supervised baseline with only 24% of the samples.

replace Entropy Heat-Mapping: Localizing GPT-Based OCR Errors with Sliding-Window Shannon Analysis

Authors: Alexei Kaltchenko

Abstract: Vision-language models such as OpenAI GPT-4o can transcribe mathematical documents directly from images, yet their token-level confidence signals are seldom used to pinpoint local recognition mistakes. We present an entropy-heat-mapping proof-of-concept that turns per-token Shannon entropy into a visual ''uncertainty landscape''. By scanning the entropy sequence with a fixed-length sliding window, we obtain hotspots that are likely to contain OCR errors such as missing symbols, mismatched braces, or garbled prose. Using a small, curated set of scanned research pages rendered at several resolutions, we compare the highlighted hotspots with the actual transcription errors produced by GPT-4o. Our analysis shows that the vast majority of true errors are indeed concentrated inside the high-entropy regions. This study demonstrates--in a minimally engineered setting--that sliding-window entropy can serve as a practical, lightweight aid for post-editing GPT-based OCR. All code and annotation guidelines are released to encourage replication and further research.

replace RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

Authors: Shuhang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Linghao Zhang, Shikang Wang, Yixin Liu, Hanbo Zhang, Ying Ma, Xuming Hu

Abstract: Multimodal Large Language Models (MLLMs) increasingly excel at perception, understanding, and reasoning. However, current benchmarks inadequately evaluate their ability to perform these tasks continuously in dynamic, real-world environments. To bridge this gap, we introduce RTV-Bench, a fine-grained benchmark for MLLM real-time video analysis. RTV-Bench uses three key principles: (1) Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes; (2) Hierarchical Question Structure, combining basic and advanced queries; and (3) Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning. RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (GPT-4o, Gemini 2.0), open-source offline (Qwen2.5-VL, VideoLLaMA3), and open-source real-time (VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost RTV-Bench performance, sometimes causing slight decreases. This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs. Our benchmark toolkit is available at: https://github.com/LJungang/RTV-Bench.

URLs: https://github.com/LJungang/RTV-Bench.

replace Robust Duality Learning for Unsupervised Visible-Infrared Person Re-Identification

Authors: Yongxiang Li, Yuan Sun, Yang Qin, Dezhong Peng, Xi Peng, Peng Hu

Abstract: Unsupervised visible-infrared person re-identification (UVI-ReID) aims to retrieve pedestrian images across different modalities without costly annotations, but faces challenges due to the modality gap and lack of supervision. Existing methods often adopt self-training with clustering-generated pseudo-labels but implicitly assume these labels are always correct. In practice, however, this assumption fails due to inevitable pseudo-label noise, which hinders model learning. To address this, we introduce a new learning paradigm that explicitly considers Pseudo-Label Noise (PLN), characterized by three key challenges: noise overfitting, error accumulation, and noisy cluster correspondence. To this end, we propose a novel Robust Duality Learning framework (RoDE) for UVI-ReID to mitigate the effects of noisy pseudo-labels. First, to combat noise overfitting, a Robust Adaptive Learning mechanism (RAL) is proposed to dynamically emphasize clean samples while down-weighting noisy ones. Second, to alleviate error accumulation-where the model reinforces its own mistakes-RoDE employs dual distinct models that are alternately trained using pseudo-labels from each other, encouraging diversity and preventing collapse. However, this dual-model strategy introduces misalignment between clusters across models and modalities, creating noisy cluster correspondence. To resolve this, we introduce Cluster Consistency Matching (CCM), which aligns clusters across models and modalities by measuring cross-cluster similarity. Extensive experiments on three benchmarks demonstrate the effectiveness of RoDE.

replace MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation

Authors: Mingcheng Li, Xiaolu Hou, Ziyang Liu, Dingkang Yang, Ziyun Qian, Jiawei Chen, Jinjie Wei, Yue Jiang, Qingyao Xu, Lihua Zhang

Abstract: Diffusion models have shown excellent performance in text-to-image generation. Nevertheless, existing methods often suffer from performance bottlenecks when handling complex prompts that involve multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration-based scene parsing module that generates an agent system comprising multiple agents with distinct tasks, utilizing MLLMs to extract various scene elements effectively. In addition, Hierarchical Compositional diffusion utilizes a Gaussian mask and filtering to refine bounding box regions and enhance objects through region enhancement, resulting in the accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, providing a substantial advantage in complex scene generation.

replace VGLD: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery

Authors: Bojin Wu, Jing Chen

Abstract: We propose a robust method for monocular depth scale recovery. Monocular depth estimation can be divided into two main directions: (1) relative depth estimation, which provides normalized or inverse depth without scale information, and (2) metric depth estimation, which involves recovering depth with absolute scale. To obtain absolute scale information for practical downstream tasks, utilizing textual information to recover the scale of a relative depth map is a highly promising approach. However, since a single image can have multiple descriptions from different perspectives or with varying styles, it has been shown that different textual descriptions can significantly affect the scale recovery process. To address this issue, our method, VGLD, stabilizes the influence of textual information by incorporating high-level semantic information from the corresponding image alongside the textual description. This approach resolves textual ambiguities and robustly outputs a set of linear transformation parameters (scalars) that can be globally applied to the relative depth map, ultimately generating depth predictions with metric-scale accuracy. We validate our method across several popular relative depth models(MiDas, DepthAnything), using both indoor scenes (NYUv2) and outdoor scenes (KITTI). Our results demonstrate that VGLD functions as a universal alignment module when trained on multiple datasets, achieving strong performance even in zero-shot scenarios. Code is available at: https://github.com/pakinwu/VGLD.

URLs: https://github.com/pakinwu/VGLD.

replace Towards Application-Specific Evaluation of Vision Models: Case Studies in Ecology and Biology

Authors: Alex Hoi Hang Chan, Otto Brookes, Urs Waldmann, Hemal Naik, Iain D. Couzin, Majid Mirmehdi, No\"el Adiko Houa, Emmanuelle Normand, Christophe Boesch, Lukas Boesch, Mimi Arandjelovic, Hjalmar K\"uhl, Tilo Burghardt, Fumihiro Kano

Abstract: Computer vision methods have demonstrated considerable potential to streamline ecological and biological workflows, with a growing number of datasets and models becoming available to the research community. However, these resources focus predominantly on evaluation using machine learning metrics, with relatively little emphasis on how their application impacts downstream analysis. We argue that models should be evaluated using application-specific metrics that directly represent model performance in the context of its final use case. To support this argument, we present two disparate case studies: (1) estimating chimpanzee abundance and density with camera trap distance sampling when using a video-based behaviour classifier and (2) estimating head rotation in pigeons using a 3D posture estimator. We show that even models with strong machine learning performance (e.g., 87% mAP) can yield data that leads to discrepancies in abundance estimates compared to expert-derived data. Similarly, the highest-performing models for posture estimation do not produce the most accurate inferences of gaze direction in pigeons. Motivated by these findings, we call for researchers to integrate application-specific metrics in ecological/biological datasets, allowing for models to be benchmarked in the context of their downstream application and to facilitate better integration of models into application workflows.

replace-cross PLUM: Improving Inference Efficiency By Leveraging Repetition-Sparsity Trade-Off

Authors: Sachit Kuhar, Yash Jain, Alexey Tumanov

Abstract: Efficient inference of Deep Neural Networks (DNNs) on resource-constrained edge devices is essential. Quantization and sparsity are key techniques that translate to repetition and sparsity within tensors at the hardware-software interface. This paper introduces the concept of repetition-sparsity trade-off that helps explain computational efficiency during inference. We propose PLUM, a unified co-design framework that integrates DNN inference systems and quantization (forward and backward pass) to leverage the repetition-sparsity trade-off to improve inference efficiency. Our results demonstrate that PLUM's quantization method is more accurate than binary quantization with the same number of non-zero weights. Detailed analysis indicates that signed binarization generates a smaller distribution of effectual (non-zero) parameters nested within a larger distribution of total parameters of latent full-precision weights for a DNN block. Finally, the proposed PLUM framework achieves a 26% speedup on real hardware, doubles energy efficiency, and reduces density by 2.8x compared to binary methods while retaining top-1 accuracy when compared to prior-art methods for ResNets on ImageNet (by achieving 66.2% top-1 accuracy), presenting an alternative solution for deploying efficient models in resource-limited environments.

replace-cross Momentum-SAM: Sharpness Aware Minimization without Computational Overhead

Authors: Marlon Becker, Frederick Altrock, Benjamin Risse

Abstract: The recently proposed optimization algorithm for deep neural networks Sharpness Aware Minimization (SAM) suggests perturbing parameters before gradient calculation by a gradient ascent step to guide the optimization into parameter space regions of flat loss. While significant generalization improvements and thus reduction of overfitting could be demonstrated, the computational costs are doubled due to the additionally needed gradient calculation, making SAM unfeasible in case of limited computationally capacities. Motivated by Nesterov Accelerated Gradient (NAG) we propose Momentum-SAM (MSAM), which perturbs parameters in the direction of the accumulated momentum vector to achieve low sharpness without significant computational overhead or memory demands over SGD or Adam. We evaluate MSAM in detail and reveal insights on separable mechanisms of NAG, SAM and MSAM regarding training optimization and generalization. Code is available at https://github.com/MarlonBecker/MSAM.

URLs: https://github.com/MarlonBecker/MSAM.

replace-cross Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records

Authors: Daeun Kyung, Junu Kim, Tackeun Kim, Edward Choi

Abstract: Chest X-ray (CXR) is an important diagnostic tool widely used in hospitals to assess patient conditions and monitor changes over time. Recently, generative models, specifically diffusion-based models, have shown promise in generating realistic synthetic CXRs. However, these models mainly focus on conditional generation using single-time-point data, i.e., generating CXRs conditioned on their corresponding reports from a specific time. This limits their clinical utility, particularly for capturing temporal changes. To address this limitation, we propose a novel framework, EHRXDiff, which predicts future CXR images by integrating previous CXRs with subsequent medical events, e.g., prescriptions, lab measures, etc. Our framework dynamically tracks and predicts disease progression based on a latent diffusion model, conditioned on the previous CXR image and a history of medical events. We comprehensively evaluate the performance of our framework across three key aspects, including clinical consistency, demographic consistency, and visual realism. Results show that our framework generates high-quality, realistic future images that effectively capture potential temporal changes. This suggests that our framework could be further developed to support clinical decision-making and provide valuable insights for patient monitoring and treatment planning in the medical field. The code is available at https://github.com/dek924/EHRXDiff.

URLs: https://github.com/dek924/EHRXDiff.

replace-cross Rethinking Meta-Learning from a Learning Lens

Authors: Jingyao Wang, Wenwen Qiang, Changwen Zheng, Hui Xiong, Gang Hua

Abstract: Meta-learning seeks to learn a well-generalized model initialization from training tasks to solve unseen tasks. From the "learning to learn" perspective, the quality of the initialization is modeled with one-step gradient decent in the inner loop. However, contrary to theoretical expectations, our empirical analysis reveals that this may expose meta-learning to underfitting. To bridge the gap between theoretical understanding and practical implementation, we reconsider meta-learning from the "Learning" lens. We propose that the meta-learning model comprises two interrelated components: parameters for model initialization and a meta-layer for task-specific fine-tuning. These components will lead to the risks of overfitting and underfitting depending on tasks, and their solutions, fewer parameters vs. more meta-layer, are often in conflict. To address this, we aim to regulate the task information the model receives without modifying the data or model structure. Our theoretical analysis indicates that models adapted to different tasks can mutually reinforce each other, highlighting the effective information. Based on this insight, we propose TRLearner, a plug-and-play method that leverages task relation to calibrate meta-learning. It first extracts task relation matrices and then applies relation-aware consistency regularization to guide optimization. Extensive theoretical and empirical evaluations demonstrate its effectiveness.

replace-cross Modular Autonomous Virtualization System for Two-Dimensional Semiconductor Quantum Dot Arrays

Authors: Anantha S. Rao, Donovan Buterakos, Barnaby van Straaten, Valentin John, C\'ecile X. Yu, Stefan D. Oosterhout, Lucas Stehouwer, Giordano Scappucci, Menno Veldhorst, Francesco Borsoi, Justyna P. Zwolak

Abstract: Arrays of gate-defined semiconductor quantum dots are among the leading candidates for building scalable quantum processors. High-fidelity initialization, control, and readout of spin qubit registers require exquisite and targeted control over key Hamiltonian parameters that define the electrostatic environment. However, due to the tight gate pitch, capacitive crosstalk between gates hinders independent tuning of chemical potentials and interdot couplings. While virtual gates offer a practical solution, determining all the required cross-capacitance matrices accurately and efficiently in large quantum dot registers is an open challenge. Here, we establish a modular automated virtualization system (MAViS) -- a general and modular framework for autonomously constructing a complete stack of multilayer virtual gates in real time. Our method employs machine learning techniques to rapidly extract features from two-dimensional charge stability diagrams. We then utilize computer vision and regression models to self-consistently determine all relative capacitive couplings necessary for virtualizing plunger and barrier gates in both low- and high-tunnel-coupling regimes. Using MAViS, we successfully demonstrate accurate virtualization of a dense two-dimensional array comprising ten quantum dots defined in a high-quality Ge/SiGe heterostructure. Our work offers an elegant and practical solution for the efficient control of large-scale semiconductor quantum dot systems.

replace-cross OSMamba: Omnidirectional Spectral Mamba with Dual-Domain Prior Generator for Exposure Correction

Authors: Gehui Li, Bin Chen, Chen Zhao, Lei Zhang, Jian Zhang

Abstract: Exposure correction is a fundamental problem in computer vision and image processing. Recently, frequency domain-based methods have achieved impressive improvement, yet they still struggle with complex real-world scenarios under extreme exposure conditions. This is due to the local convolutional receptive fields failing to model long-range dependencies in the spectrum, and the non-generative learning paradigm being inadequate for retrieving lost details from severely degraded regions. In this paper, we propose Omnidirectional Spectral Mamba (OSMamba), a novel exposure correction network that incorporates the advantages of state space models and generative diffusion models to address these limitations. Specifically, OSMamba introduces an omnidirectional spectral scanning mechanism that adapts Mamba to the frequency domain to capture comprehensive long-range dependencies in both the amplitude and phase spectra of deep image features, hence enhancing illumination correction and structure recovery. Furthermore, we develop a dual-domain prior generator that learns from well-exposed images to generate a degradation-free diffusion prior containing correct information about severely under- and over-exposed regions for better detail restoration. Extensive experiments on multiple-exposure and mixed-exposure datasets demonstrate that the proposed OSMamba achieves state-of-the-art performance both quantitatively and qualitatively.

replace-cross Editable-DeepSC: Reliable Cross-Modal Semantic Communications for Facial Editing

Authors: Bin Chen, Wenbo Yu, Qinshan Zhang, Tianqu Zhuang, Yong Jiang, Shu-Tao Xia

Abstract: Real-time computer vision (CV) plays a crucial role in various real-world applications, whose performance is highly dependent on communication networks. Nonetheless, the data-oriented characteristics of conventional communications often do not align with the special needs of real-time CV tasks. To alleviate this issue, the recently emerged semantic communications only transmit task-related semantic information and exhibit a promising landscape to address this problem. However, the communication challenges associated with Semantic Facial Editing, one of the most important real-time CV applications on social media, still remain largely unexplored. In this paper, we fill this gap by proposing Editable-DeepSC, a novel cross-modal semantic communication approach for facial editing. Firstly, we theoretically discuss different transmission schemes that separately handle communications and editings, and emphasize the necessity of Joint Editing-Channel Coding (JECC) via iterative attributes matching, which integrates editings into the communication chain to preserve more semantic mutual information. To compactly represent the high-dimensional data, we leverage inversion methods via pre-trained StyleGAN priors for semantic coding. To tackle the dynamic channel noise conditions, we propose SNR-aware channel coding via model fine-tuning. Extensive experiments indicate that Editable-DeepSC can achieve superior editings while significantly saving the transmission bandwidth, even under high-resolution and out-of-distribution (OOD) settings.

replace-cross FedSynthCT-Brain: A Federated Learning Framework for Multi-Institutional Brain MRI-to-CT Synthesis

Authors: Ciro Benito Raggio, Mathias Krohmer Zabaleta, Nils Skupien, Oliver Blanck, Francesco Cicone, Giuseppe Lucio Cascini, Paolo Zaffino, Lucia Migliorelli, Maria Francesca Spadea

Abstract: The generation of Synthetic Computed Tomography (sCT) images has become a pivotal methodology in modern clinical practice, particularly in the context of Radiotherapy (RT) treatment planning. The use of sCT enables the calculation of doses, pushing towards Magnetic Resonance Imaging (MRI) guided radiotherapy treatments. Deep learning methods for MRI-to-sCT have shown promising results, but their reliance on single-centre training dataset limits generalisation capabilities to diverse clinical settings. Moreover, creating centralised multi-centre datasets may pose privacy concerns. To address the aforementioned issues, we introduced FedSynthCT-Brain, an approach based on the Federated Learning (FL) paradigm for MRI-to-sCT in brain imaging. This is among the first applications of FL for MRI-to-sCT, employing a cross-silo horizontal FL approach that allows multiple centres to collaboratively train a U-Net-based deep learning model. We validated our method using real multicentre data from four European and American centres, simulating heterogeneous scanner types and acquisition modalities, and tested its performance on an independent dataset from a centre outside the federation. In the case of the unseen centre, the federated model achieved a median Mean Absolute Error (MAE) of $102.0$ HU across 23 patients, with an interquartile range of $96.7-110.5$ HU. The median (interquartile range) for the Structural Similarity Index (SSIM) and the Peak Signal to Noise Ratio (PNSR) were $0.89 (0.86-0.89)$ and $26.58 (25.52-27.42)$, respectively. The analysis of the results showed acceptable performances of the federated approach, thus highlighting the potential of FL to enhance MRI-to-sCT to improve generalisability and advancing safe and equitable clinical applications while fostering collaboration and preserving data privacy.

replace-cross Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets

Authors: Zhen Liu, Tim Z. Xiao, Weiyang Liu, Yoshua Bengio, Dinghuai Zhang

Abstract: While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to align and finetune pretrained diffusion models with some reward functions that are either designed by experts or learned from small-scale datasets. Existing post-training methods for reward finetuning of diffusion models typically suffer from lack of diversity in generated samples, lack of prior preservation, and/or slow convergence in finetuning. In response to this challenge, we take inspiration from recent successes in generative flow networks (GFlowNets) and propose a reinforcement learning method for diffusion model finetuning, dubbed Nabla-GFlowNet (abbreviated as $\nabla$-GFlowNet), that leverages the rich signal in reward gradients for probabilistic diffusion finetuning. We show that our proposed method achieves fast yet diversity- and prior-preserving finetuning of Stable Diffusion, a large-scale text-conditioned image diffusion model, on different realistic reward functions.

replace-cross Liver Cirrhosis Stage Estimation from MRI with Deep Learning

Authors: Jun Zeng, Debesh Jha, Ertugrul Aktas, Elif Keles, Alpay Medetalibeyoglu, Matthew Antalek, Amir A. Borhani, Daniela P. Ladner, Gorkem Durak, Ulas Bagci

Abstract: We present an end-to-end deep learning framework for automated liver cirrhosis stage estimation from multi-sequence MRI. Cirrhosis is the severe scarring (fibrosis) of the liver and a common endpoint of various chronic liver diseases. Early diagnosis is vital to prevent complications such as decompensation and cancer, which significantly decreases life expectancy. However, diagnosing cirrhosis in its early stages is challenging, and patients often present with life-threatening complications. Our approach integrates multi-scale feature learning with sequence-specific attention mechanisms to capture subtle tissue variations across cirrhosis progression stages. Using CirrMRI600+, a large-scale publicly available dataset of 628 high-resolution MRI scans from 339 patients, we demonstrate state-of-the-art performance in three-stage cirrhosis classification. Our best model achieves 72.8% accuracy on T1W and 63.8% on T2W sequences, significantly outperforming traditional radiomics-based approaches. Through extensive ablation studies, we show that our architecture effectively learns stage-specific imaging biomarkers. We establish new benchmarks for automated cirrhosis staging and provide insights for developing clinically applicable deep learning systems. The source code will be available at https://github.com/JunZengz/CirrhosisStage.

URLs: https://github.com/JunZengz/CirrhosisStage.

replace-cross Heart Failure Prediction using Modal Decomposition and Masked Autoencoders for Scarce Echocardiography Databases

Authors: Andr\'es Bell-Navas, Mar\'ia Villalba-Orero, Enrique Lara-Pezzi, Jes\'us Garicano-Mena, Soledad Le Clainche

Abstract: Heart diseases constitute the main cause of international human defunction. According to the World Health Organization (WHO), approximately 18 million deaths happen each year due to precisely heart diseases. In particular, heart failures (HF) press the healthcare industry to develop systems for their early, rapid, and effective prediction. This work presents an automatic system based on a novel deep learning framework which analyses in real-time echocardiography video sequences for the challenging and more specific task of heart failure time prediction. This system works in two stages. The first one transforms the data from a database of echocardiography video sequences into a machine learning-compatible collection of annotated images which can be used in the training phase of any machine learning-based framework, including a deep learning-based one. This stage includes the use of the Higher Order Dynamic Mode Decomposition (HODMD) algorithm for both data augmentation and feature extraction. The second stage builds and trains a Vision Transformer (ViT). Self-supervised learning (SSL) methods, so far barely explored in the literature about heart failure prediction, are adopted to effectively train the ViT from scratch, even with scarce databases. The designed neural network analyses images from echocardiography sequences to estimate the time in which a heart failure will happen. The results obtained show the efficacy of the HODMD algorithm and the superiority of the proposed system with respect to several established ViT and Convolutional Neural Network (CNN) architectures. The source code will be incorporated into the next version release of the ModelFLOWs-app software (https://github.com/modelflows/ModelFLOWs-app).

URLs: https://github.com/modelflows/ModelFLOWs-app).

replace-cross Light Weight CNN for classification of Brain Tumors from MRI Images

Authors: Natnael Alemayehu

Abstract: This study presents a convolutional neural network (CNN)-based approach for the multi-class classification of brain tumors using magnetic resonance imaging (MRI) scans. We utilize a publicly available dataset containing MRI images categorized into four classes: glioma, meningioma, pituitary tumor, and no tumor. Our primary objective is to build a light weight deep learning model that can automatically classify brain tumor types with high accuracy. To achieve this goal, we incorporate image preprocessing steps, including normalization, data augmentation, and a cropping technique designed to reduce background noise and emphasize relevant regions. The CNN architecture is optimized through hyperparameter tuning using Keras Tuner, enabling systematic exploration of network parameters. To ensure reliable evaluation, we apply 5-fold cross-validation, where each hyperparameter configuration is evaluated across multiple data splits to mitigate overfitting. Experimental results demonstrate that the proposed model achieves a classification accuracy of 98.78%, indicating its potential as a diagnostic aid in clinical settings. The proposed method offers a low-complexity yet effective solution for assisting in early brain tumor diagnosis.

replace-cross Robotic Visual Instruction

Authors: Yanbang Li, Ziyang Gong, Haoyang Li, Xiaoqi Huang, Haolan Kang, Guangping Bai, Xianzheng Ma

Abstract: Recently, natural language has been the primary medium for human-robot interaction. However, its inherent lack of spatial precision introduces challenges for robotic task definition such as ambiguity and verbosity. Moreover, in some public settings where quiet is required, such as libraries or hospitals, verbal communication with robots is inappropriate. To address these limitations, we introduce the Robotic Visual Instruction (RoVI), a novel paradigm to guide robotic tasks through an object-centric, hand-drawn symbolic representation. RoVI effectively encodes spatial-temporal information into human-interpretable visual instructions through 2D sketches, utilizing arrows, circles, colors, and numbers to direct 3D robotic manipulation. To enable robots to understand RoVI better and generate precise actions based on RoVI, we present Visual Instruction Embodied Workflow (VIEW), a pipeline formulated for RoVI-conditioned policies. This approach leverages Vision-Language Models (VLMs) to interpret RoVI inputs, decode spatial and temporal constraints from 2D pixel space via keypoint extraction, and then transform them into executable 3D action sequences. We additionally curate a specialized dataset of 15K instances to fine-tune small VLMs for edge deployment,enabling them to effectively learn RoVI capabilities. Our approach is rigorously validated across 11 novel tasks in both real and simulated environments, demonstrating significant generalization capability. Notably, VIEW achieves an 87.5% success rate in real-world scenarios involving unseen tasks that feature multi-step actions, with disturbances, and trajectory-following requirements. Project website: https://robotic-visual-instruction.github.io/

URLs: https://robotic-visual-instruction.github.io/

replace-cross A Multi-Granularity Retrieval Framework for Visually-Rich Documents

Authors: Mingjun Xu, Zehui Wang, Hengxing Cai, Renxin Zhong

Abstract: Retrieval-augmented generation (RAG) systems have predominantly focused on text-based retrieval, limiting their effectiveness in handling visually-rich documents that encompass text, images, tables, and charts. To bridge this gap, we propose a unified multi-granularity multimodal retrieval framework tailored for two benchmark tasks: MMDocIR and M2KR. Our approach integrates hierarchical encoding strategies, modality-aware retrieval mechanisms, and vision-language model (VLM)-based candidate filtering to effectively capture and utilize the complex interdependencies between textual and visual modalities. By leveraging off-the-shelf vision-language models and implementing a training-free hybrid retrieval strategy, our framework demonstrates robust performance without the need for task-specific fine-tuning. Experimental evaluations reveal that incorporating layout-aware search and VLM-based candidate verification significantly enhances retrieval accuracy, achieving a top performance score of 65.56. This work underscores the potential of scalable and reproducible solutions in advancing multimodal document retrieval systems.

replace-cross Adversarial Robustness of Deep Learning Models for Inland Water Body Segmentation from SAR Images

Authors: Siddharth Kothari, Srinivasan Murali, Sankalp Kothari, Ujjwal Verma, Jaya Sreevalsan-Nair

Abstract: Inland water body segmentation from Synthetic Aperture Radar (SAR) images is an important task needed for several applications, such as flood mapping. While SAR sensors capture data in all-weather conditions as high-resolution images, differentiating water and water-like surfaces from SAR images is not straightforward. Inland water bodies, such as large river basins, have complex geometry, which adds to the challenge of segmentation. U-Net is a widely used deep learning model for land-water segmentation of SAR images. In practice, manual annotation is often used to generate the corresponding water masks as ground truth. Manual annotation of the images is prone to label noise owing to data poisoning attacks, especially due to complex geometry. In this work, we simulate manual errors in the form of adversarial attacks on the U-Net model and study the robustness of the model to human errors in annotation. Our results indicate that U-Net can tolerate a certain level of corruption before its performance drops significantly. This finding highlights the crucial role that the quality of manual annotations plays in determining the effectiveness of the segmentation model. The code and the new dataset, along with adversarial examples for robust training, are publicly available. (GitHub link - https://github.com/GVCL/IWSeg-SAR-Poison.git)

URLs: https://github.com/GVCL/IWSeg-SAR-Poison.git)

replace-cross Regression is all you need for medical image translation

Authors: Sebastian Rassmann, David K\"ugler, Christian Ewert, Martin Reuter

Abstract: The acquisition of information-rich images within a limited time budget is crucial in medical imaging. Medical image translation (MIT) can help enhance and supplement existing datasets by generating synthetic images from acquired data. While Generative Adversarial Nets (GANs) and Diffusion Models (DMs) have achieved remarkable success in natural image generation, their benefits - creativity and image realism - do not necessarily transfer to medical applications where highly accurate anatomical information is required. In fact, the imitation of acquisition noise or content hallucination hinder clinical utility. Here, we introduce YODA (You Only Denoise once - or Average), a novel 2.5D diffusion-based framework for volumetric MIT. YODA unites diffusion and regression paradigms to produce realistic or noise-free outputs. Furthermore, we propose Expectation-Approximation (ExpA) DM sampling, which draws inspiration from MRI signal averaging. ExpA-sampling suppresses generated noise and, thus, eliminates noise from biasing the evaluation of image quality. Through extensive experiments on four diverse multi-modal datasets - comprising multi-contrast brain MRI and pelvic MRI-CT - we show that diffusion and regression sampling yield similar results in practice. As such, the computational overhead of diffusion sampling does not provide systematic benefits in medical information translation. Building on these insights, we demonstrate that YODA outperforms several state-of-the-art GAN and DM methods. Notably, YODA-generated images are shown to be interchangeable with, or even superior to, physical acquisitions for several downstream tasks. Our findings challenge the presumed advantages of DMs in MIT and pave the way for the practical application of MIT in medical imaging.

replace-cross Sharpness-Aware Minimization with Z-Score Gradient Filtering for Neural Networks

Authors: Juyoung Yun

Abstract: Generalizing well in deep neural networks remains a core challenge, particularly due to their tendency to converge to sharp minima that degrade robustness. Sharpness-Aware Minimization (SAM) mitigates this by seeking flatter minima but perturbs parameters using the full gradient, which can include statistically insignificant directions. We propose ZSharp, a simple yet effective extension to SAM that applies layer-wise Z-score normalization followed by percentile-based filtering to retain only statistically significant gradient components. This selective perturbation aligns updates with curvature-sensitive directions, enhancing generalization without requiring architectural changes. ZSharp introduces only one additional hyperparameter, the percentile threshold, and remains fully compatible with existing SAM variants. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet using ResNet, VGG, and Vision Transformers show that ZSharp consistently outperforms SAM and its variants in test accuracy, particularly on deeper and transformer-based models. These results demonstrate that ZSharp is a principled and lightweight improvement for sharpness-aware optimization.