new Towards Training-Free Underwater 3D Object Detection from Sonar Point Clouds: A Comparison of Traditional and Deep Learning Approaches

Authors: M. Salman Shaukat, Yannik K\"ackenmeister, Sebastian Bader, Thomas Kirste

Abstract: Underwater 3D object detection remains one of the most challenging frontiers in computer vision, where traditional approaches struggle with the harsh acoustic environment and scarcity of training data. While deep learning has revolutionized terrestrial 3D detection, its application underwater faces a critical bottleneck: obtaining sufficient annotated sonar data is prohibitively expensive and logistically complex, often requiring specialized vessels, expert surveyors, and favorable weather conditions. This work addresses a fundamental question: Can we achieve reliable underwater 3D object detection without real-world training data? We tackle this challenge by developing and comparing two paradigms for training-free detection of artificial structures in multibeam echo-sounder point clouds. Our dual approach combines a physics-based sonar simulation pipeline that generates synthetic training data for state-of-the-art neural networks, with a robust model-based template matching system that leverages geometric priors of target objects. Evaluation on real bathymetry surveys from the Baltic Sea reveals surprising insights: while neural networks trained on synthetic data achieve 98% mean Average Precision (mAP) on simulated scenes, they drop to 40% mAP on real sonar data due to domain shift. Conversely, our template matching approach maintains 83% mAP on real data without requiring any training, demonstrating remarkable robustness to acoustic noise and environmental variations. Our findings challenge conventional wisdom about data-hungry deep learning in underwater domains and establish the first large-scale benchmark for training-free underwater 3D detection. This work opens new possibilities for autonomous underwater vehicle navigation, marine archaeology, and offshore infrastructure monitoring in data-scarce environments where traditional machine learning approaches fail.

new MobileDenseAttn:A Dual-Stream Architecture for Accurate and Interpretable Brain Tumor Detection

Authors: Shudipta Banik, Muna Das, Trapa Banik, Md. Ehsanul Haque

Abstract: The detection of brain tumor in MRI is an important aspect of ensuring timely diagnostics and treatment; however, manual analysis is commonly long and error-prone. Current approaches are not universal because they have limited generalization to heterogeneous tumors, are computationally inefficient, are not interpretable, and lack transparency, thus limiting trustworthiness. To overcome these issues, we introduce MobileDenseAttn, a fusion model of dual streams of MobileNetV2 and DenseNet201 that can help gradually improve the feature representation scale, computing efficiency, and visual explanations via GradCAM. Our model uses feature level fusion and is trained on an augmented dataset of 6,020 MRI scans representing glioma, meningioma, pituitary tumors, and normal samples. Measured under strict 5-fold cross-validation protocols, MobileDenseAttn provides a training accuracy of 99.75%, a testing accuracy of 98.35%, and a stable F1 score of 0.9835 (95% CI: 0.9743 to 0.9920). The extensive validation shows the stability of the model, and the comparative analysis proves that it is a great advancement over the baseline models (VGG19, DenseNet201, MobileNetV2) with a +3.67% accuracy increase and a 39.3% decrease in training time compared to VGG19. The GradCAM heatmaps clearly show tumor-affected areas, offering clinically significant localization and improving interpretability. These findings position MobileDenseAttn as an efficient, high performance, interpretable model with a high probability of becoming a clinically practical tool in identifying brain tumors in the real world.

new Can VLMs Recall Factual Associations From Visual References?

Authors: Dhananjay Ashok, Ashutosh Chaubey, Hirona J. Arai, Jonathan May, Jesse Thomason

Abstract: Through a controlled study, we identify a systematic deficiency in the multimodal grounding of Vision Language Models (VLMs). While VLMs can recall factual associations when provided a textual reference to an entity; their ability to do so is significantly diminished when the reference is visual instead. Forcing VLMs to rely on image representations of an entity halves their ability to recall factual knowledge, suggesting that VLMs struggle to link their internal knowledge of an entity with its image representation. We show that such linking failures are correlated with the expression of distinct patterns in model internal states, and that probes on these internal states achieve over 92% accuracy at flagging cases where the VLM response is unreliable. These probes can be applied, without retraining, to identify when a VLM will fail to correctly answer a question that requires an understanding of multimodal input. When used to facilitate selective prediction on a visual question answering task, the probes increase coverage by 7.87% (absolute) while also reducing the risk of error by 0.9% (absolute). Addressing the systematic, detectable deficiency is an important avenue in language grounding, and we provide informed recommendations for future directions.

new SERES: Semantic-aware neural reconstruction from sparse views

Authors: Bo Xu, Yuhu Guo, Yuchao Wang, Wenting Wang, Yeung Yam, Charlie C. L. Wang, Xinyi Le

Abstract: We propose a semantic-aware neural reconstruction method to generate 3D high-fidelity models from sparse images. To tackle the challenge of severe radiance ambiguity caused by mismatched features in sparse input, we enrich neural implicit representations by adding patch-based semantic logits that are optimized together with the signed distance field and the radiance field. A novel regularization based on the geometric primitive masks is introduced to mitigate shape ambiguity. The performance of our approach has been verified in experimental evaluation. The average chamfer distances of our reconstruction on the DTU dataset can be reduced by 44% for SparseNeuS and 20% for VolRecon. When working as a plugin for those dense reconstruction baselines such as NeuS and Neuralangelo, the average error on the DTU dataset can be reduced by 69% and 68% respectively.

new Automated Landfill Detection Using Deep Learning: A Comparative Study of Lightweight and Custom Architectures with the AerialWaste Dataset

Authors: Nowshin Sharmily, Rusab Sarmun, Muhammad E. H. Chowdhury, Mir Hamidul Hussain, Saad Bin Abul Kashem, Molla E Majid, Amith Khandakar

Abstract: Illegal landfills are posing as a hazardous threat to people all over the world. Due to the arduous nature of manually identifying the location of landfill, many landfills go unnoticed by authorities and later cause dangerous harm to people and environment. Deep learning can play a significant role in identifying these landfills while saving valuable time, manpower and resources. Despite being a burning concern, good quality publicly released datasets for illegal landfill detection are hard to find due to security concerns. However, AerialWaste Dataset is a large collection of 10434 images of Lombardy region of Italy. The images are of varying qualities, collected from three different sources: AGEA Orthophotos, WorldView-3, and Google Earth. The dataset contains professionally curated, diverse and high-quality images which makes it particularly suitable for scalable and impactful research. As we trained several models to compare results, we found complex and heavy models to be prone to overfitting and memorizing training data instead of learning patterns. Therefore, we chose lightweight simpler models which could leverage general features from the dataset. In this study, Mobilenetv2, Googlenet, Densenet, MobileVit and other lightweight deep learning models were used to train and validate the dataset as they achieved significant success with less overfitting. As we saw substantial improvement in the performance using some of these models, we combined the best performing models and came up with an ensemble model. With the help of ensemble and fusion technique, binary classification could be performed on this dataset with 92.33% accuracy, 92.67% precision, 92.33% sensitivity, 92.41% F1 score and 92.71% specificity.

new Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning

Authors: Jiangfeng Sun, Sihao He, Zhonghong Ou, Meina Song

Abstract: Multimodal sentiment analysis (MSA) aims to infer emotional states by effectively integrating textual, acoustic, and visual modalities. Despite notable progress, existing multimodal fusion methods often neglect modality-specific structural dependencies and semantic misalignment, limiting their quality, interpretability, and robustness. To address these challenges, we propose a novel framework called the Structural-Semantic Unifier (SSU), which systematically integrates modality-specific structural information and cross-modal semantic grounding for enhanced multimodal representations. Specifically, SSU dynamically constructs modality-specific graphs by leveraging linguistic syntax for text and a lightweight, text-guided attention mechanism for acoustic and visual modalities, thus capturing detailed intra-modal relationships and semantic interactions. We further introduce a semantic anchor, derived from global textual semantics, that serves as a cross-modal alignment hub, effectively harmonizing heterogeneous semantic spaces across modalities. Additionally, we develop a multiview contrastive learning objective that promotes discriminability, semantic consistency, and structural coherence across intra- and inter-modal views. Extensive evaluations on two widely used benchmark datasets, CMU-MOSI and CMU-MOSEI, demonstrate that SSU consistently achieves state-of-the-art performance while significantly reducing computational overhead compared to prior methods. Comprehensive qualitative analyses further validate SSU's interpretability and its ability to capture nuanced emotional patterns through semantically grounded interactions.

new FastAvatar: Instant 3D Gaussian Splatting for Faces from Single Unconstrained Poses

Authors: Hao Liang, Zhixuan Ge, Ashish Tiwari, Soumendu Majee, G. M. Dilshan Godaliyadda, Ashok Veeraraghavan, Guha Balakrishnan

Abstract: We present FastAvatar, a pose-invariant, feed-forward framework that can generate a 3D Gaussian Splatting (3DGS) model from a single face image from an arbitrary pose in near-instant time (<10ms). FastAvatar uses a novel encoder-decoder neural network design to achieve both fast fitting and identity preservation regardless of input pose. First, FastAvatar constructs a 3DGS face ``template'' model from a training dataset of faces with multi-view captures. Second, FastAvatar encodes the input face image into an identity-specific and pose-invariant latent embedding, and decodes this embedding to predict residuals to the structural and appearance parameters of each Gaussian in the template 3DGS model. By only inferring residuals in a feed-forward fashion, model inference is fast and robust. FastAvatar significantly outperforms existing feed-forward face 3DGS methods (e.g., GAGAvatar) in reconstruction quality, and runs 1000x faster than per-face optimization methods (e.g., FlashAvatar, GaussianAvatars and GASP). In addition, FastAvatar's novel latent space design supports real-time identity interpolation and attribute editing which is not possible with any existing feed-forward 3DGS face generation framework. FastAvatar's combination of excellent reconstruction quality and speed expands the scope of 3DGS for photorealistic avatar applications in consumer and interactive systems.

new Securing Face and Fingerprint Templates in Humanitarian Biometric Systems

Authors: Giuseppe Stragapede, Sam Merrick, Vedrana Krivoku\'ca Hahn, Justin Sukaitis, Vincent Graf Narbel

Abstract: In humanitarian and emergency scenarios, the use of biometrics can dramatically improve the efficiency of operations, but it poses risks for the data subjects, which are exacerbated in contexts of vulnerability. To address this, we present a mobile biometric system implementing a biometric template protection (BTP) scheme suitable for these scenarios. After rigorously formulating the functional, operational, and security and privacy requirements of these contexts, we perform a broad comparative analysis of the BTP landscape. PolyProtect, a method designed to operate on neural network face embeddings, is identified as the most suitable method due to its effectiveness, modularity, and lightweight computational burden. We evaluate PolyProtect in terms of verification and identification accuracy, irreversibility, and unlinkability, when this BTP method is applied to face embeddings extracted using EdgeFace, a novel state-of-the-art efficient feature extractor, on a real-world face dataset from a humanitarian field project in Ethiopia. Moreover, as PolyProtect promises to be modality-independent, we extend its evaluation to fingerprints. To the best of our knowledge, this is the first time that PolyProtect has been evaluated for the identification scenario and for fingerprint biometrics. Our experimental results are promising, and we plan to release our code

new Why Relational Graphs Will Save the Next Generation of Vision Foundation Models?

Authors: Fatemeh Ziaeetabar

Abstract: Vision foundation models (FMs) have become the predominant architecture in computer vision, providing highly transferable representations learned from large-scale, multimodal corpora. Nonetheless, they exhibit persistent limitations on tasks that require explicit reasoning over entities, roles, and spatio-temporal relations. Such relational competence is indispensable for fine-grained human activity recognition, egocentric video understanding, and multimodal medical image analysis, where spatial, temporal, and semantic dependencies are decisive for performance. We advance the position that next-generation FMs should incorporate explicit relational interfaces, instantiated as dynamic relational graphs (graphs whose topology and edge semantics are inferred from the input and task context). We illustrate this position with cross-domain evidence from recent systems in human manipulation action recognition and brain tumor segmentation, showing that augmenting FMs with lightweight, context-adaptive graph-reasoning modules improves fine-grained semantic fidelity, out of distribution robustness, interpretability, and computational efficiency relative to FM only baselines. Importantly, by reasoning sparsely over semantic nodes, such hybrids also achieve favorable memory and hardware efficiency, enabling deployment under practical resource constraints. We conclude with a targeted research agenda for FM graph hybrids, prioritizing learned dynamic graph construction, multi-level relational reasoning (e.g., part object scene in activity understanding, or region organ in medical imaging), cross-modal fusion, and evaluation protocols that directly probe relational competence in structured vision tasks.

new LPLC: A Dataset for License Plate Legibility Classification

Authors: Lucas Wojcik, Gabriel E. Lima, Valfride Nascimento, Eduil Nascimento Jr., Rayson Laroca, David Menotti

Abstract: Automatic License Plate Recognition (ALPR) faces a major challenge when dealing with illegible license plates (LPs). While reconstruction methods such as super-resolution (SR) have emerged, the core issue of recognizing these low-quality LPs remains unresolved. To optimize model performance and computational efficiency, image pre-processing should be applied selectively to cases that require enhanced legibility. To support research in this area, we introduce a novel dataset comprising 10,210 images of vehicles with 12,687 annotated LPs for legibility classification (the LPLC dataset). The images span a wide range of vehicle types, lighting conditions, and camera/image quality levels. We adopt a fine-grained annotation strategy that includes vehicle- and LP-level occlusions, four legibility categories (perfect, good, poor, and illegible), and character labels for three categories (excluding illegible LPs). As a benchmark, we propose a classification task using three image recognition networks to determine whether an LP image is good enough, requires super-resolution, or is completely unrecoverable. The overall F1 score, which remained below 80% for all three baseline models (ViT, ResNet, and YOLO), together with the analyses of SR and LP recognition methods, highlights the difficulty of the task and reinforces the need for further research. The proposed dataset is publicly available at https://github.com/lmlwojcik/lplc-dataset.

URLs: https://github.com/lmlwojcik/lplc-dataset.

new CLARIFY: A Specialist-Generalist Framework for Accurate and Lightweight Dermatological Visual Question Answering

Authors: Aranya Saha, Tanvir Ahmed Khan, Ismam Nur Swapnil, Mohammad Ariful Haque

Abstract: Vision-language models (VLMs) have shown significant potential for medical tasks; however, their general-purpose nature can limit specialized diagnostic accuracy, and their large size poses substantial inference costs for real-world clinical deployment. To address these challenges, we introduce CLARIFY, a Specialist-Generalist framework for dermatological visual question answering (VQA). CLARIFY combines two components: (i) a lightweight, domain-trained image classifier (the Specialist) that provides fast and highly accurate diagnostic predictions, and (ii) a powerful yet compressed conversational VLM (the Generalist) that generates natural language explanations to user queries. In our framework, the Specialist's predictions directly guide the Generalist's reasoning, focusing it on the correct diagnostic path. This synergy is further enhanced by a knowledge graph-based retrieval module, which grounds the Generalist's responses in factual dermatological knowledge, ensuring both accuracy and reliability. This hierarchical design not only reduces diagnostic errors but also significantly improves computational efficiency. Experiments on our curated multimodal dermatology dataset demonstrate that CLARIFY achieves an 18\% improvement in diagnostic accuracy over the strongest baseline, a fine-tuned, uncompressed single-line VLM, while reducing the average VRAM requirement and latency by at least 20\% and 5\%, respectively. These results indicate that a Specialist-Generalist system provides a practical and powerful paradigm for building lightweight, trustworthy, and clinically viable AI systems.

new VQualA 2025 Challenge on Face Image Quality Assessment: Methods and Results

Authors: Sizhuo Ma, Wei-Ting Chen, Qiang Gao, Jian Wang, Chris Wei Zhou, Wei Sun, Weixia Zhang, Linhan Cao, Jun Jia, Xiangyang Zhu, Dandan Zhu, Xiongkuo Min, Guangtao Zhai, Baoying Chen, Xiongwei Xiao, Jishen Zeng, Wei Wu, Tiexuan Lou, Yuchen Tan, Chunyi Song, Zhiwei Xu, MohammadAli Hamidi, Hadi Amirpour, Mingyin Bai, Jiawang Du, Zhenyu Jiang, Zilong Lu, Ziguan Cui, Zongliang Gan, Xinpeng Li, Shiqi Jiang, Chenhui Li, Changbo Wang, Weijun Yuan, Zhan Li, Yihang Chen, Yifan Deng, Ruting Deng, Zhanglu Chen, Boyang Yao, Shuling Zheng, Feng Zhang, Zhiheng Fu, Abhishek Joshi, Aman Agarwal, Rakhil Immidisetti, Ajay Narasimha Mopidevi, Vishwajeet Shukla, Hao Yang, Ruikun Zhang, Liyuan Pan, Kaixin Deng, Hang Ouyang, Fan yang, Zhizun Luo, Zhuohang Shi, Songning Lai, Weilin Ruan, Yutao Yue

Abstract: Face images play a crucial role in numerous applications; however, real-world conditions frequently introduce degradations such as noise, blur, and compression artifacts, affecting overall image quality and hindering subsequent tasks. To address this challenge, we organized the VQualA 2025 Challenge on Face Image Quality Assessment (FIQA) as part of the ICCV 2025 Workshops. Participants created lightweight and efficient models (limited to 0.5 GFLOPs and 5 million parameters) for the prediction of Mean Opinion Scores (MOS) on face images with arbitrary resolutions and realistic degradations. Submissions underwent comprehensive evaluations through correlation metrics on a dataset of in-the-wild face images. This challenge attracted 127 participants, with 1519 final submissions. This report summarizes the methodologies and findings for advancing the development of practical FIQA approaches.

new Context-Aware Zero-Shot Anomaly Detection in Surveillance Using Contrastive and Predictive Spatiotemporal Modeling

Authors: Md. Rashid Shahriar Khan, Md. Abrar Hasan, Mohammod Tareq Aziz Justice

Abstract: Detecting anomalies in surveillance footage is inherently challenging due to their unpredictable and context-dependent nature. This work introduces a novel context-aware zero-shot anomaly detection framework that identifies abnormal events without exposure to anomaly examples during training. The proposed hybrid architecture combines TimeSformer, DPC, and CLIP to model spatiotemporal dynamics and semantic context. TimeSformer serves as the vision backbone to extract rich spatial-temporal features, while DPC forecasts future representations to identify temporal deviations. Furthermore, a CLIP-based semantic stream enables concept-level anomaly detection through context-specific text prompts. These components are jointly trained using InfoNCE and CPC losses, aligning visual inputs with their temporal and semantic representations. A context-gating mechanism further enhances decision-making by modulating predictions with scene-aware cues or global video features. By integrating predictive modeling with vision-language understanding, the system can generalize to previously unseen behaviors in complex environments. This framework bridges the gap between temporal reasoning and semantic context in zero-shot anomaly detection for surveillance. The code for this research has been made available at https://github.com/NK-II/Context-Aware-Zero-Shot-Anomaly-Detection-in-Surveillance.

URLs: https://github.com/NK-II/Context-Aware-Zero-Shot-Anomaly-Detection-in-Surveillance.

new DoGFlow: Self-Supervised LiDAR Scene Flow via Cross-Modal Doppler Guidance

Authors: Ajinkya Khoche, Qingwen Zhang, Yixi Cai, Sina Sharif Mansouri, Patric Jensfelt

Abstract: Accurate 3D scene flow estimation is critical for autonomous systems to navigate dynamic environments safely, but creating the necessary large-scale, manually annotated datasets remains a significant bottleneck for developing robust perception models. Current self-supervised methods struggle to match the performance of fully supervised approaches, especially in challenging long-range and adverse weather scenarios, while supervised methods are not scalable due to their reliance on expensive human labeling. We introduce DoGFlow, a novel self-supervised framework that recovers full 3D object motions for LiDAR scene flow estimation without requiring any manual ground truth annotations. This paper presents our cross-modal label transfer approach, where DoGFlow computes motion pseudo-labels in real-time directly from 4D radar Doppler measurements and transfers them to the LiDAR domain using dynamic-aware association and ambiguity-resolved propagation. On the challenging MAN TruckScenes dataset, DoGFlow substantially outperforms existing self-supervised methods and improves label efficiency by enabling LiDAR backbones to achieve over 90% of fully supervised performance with only 10% of the ground truth data. For more details, please visit https://ajinkyakhoche.github.io/DogFlow/

URLs: https://ajinkyakhoche.github.io/DogFlow/

new SAT-SKYLINES: 3D Building Generation from Satellite Imagery and Coarse Geometric Priors

Authors: Zhangyu Jin, Andrew Feng

Abstract: We present SatSkylines, a 3D building generation approach that takes satellite imagery and coarse geometric priors. Without proper geometric guidance, existing image-based 3D generation methods struggle to recover accurate building structures from the top-down views of satellite images alone. On the other hand, 3D detailization methods tend to rely heavily on highly detailed voxel inputs and fail to produce satisfying results from simple priors such as cuboids. To address these issues, our key idea is to model the transformation from interpolated noisy coarse priors to detailed geometries, enabling flexible geometric control without additional computational cost. We have further developed Skylines-50K, a large-scale dataset of over 50,000 unique and stylized 3D building assets in order to support the generations of detailed building models. Extensive evaluations indicate the effectiveness of our model and strong generalization ability.

new Adaptive Visual Navigation Assistant in 3D RPGs

Authors: Kaijie Xu, Clark Verbrugge

Abstract: In complex 3D game environments, players rely on visual affordances to spot map transition points. Efficient identification of such points is important to client-side auto-mapping, and provides an objective basis for evaluating map cue presentation. In this work, we formalize the task of detecting traversable Spatial Transition Points (STPs)-connectors between two sub regions-and selecting the singular Main STP (MSTP), the unique STP that lies on the designer-intended critical path toward the player's current macro-objective, from a single game frame, proposing this as a new research focus. We introduce a two-stage deep-learning pipeline that first detects potential STPs using Faster R-CNN and then ranks them with a lightweight MSTP selector that fuses local and global visual features. Both stages benefit from parameter-efficient adapters, and we further introduce an optional retrieval-augmented fusion step. Our primary goal is to establish the feasibility of this problem and set baseline performance metrics. We validate our approach on a custom-built, diverse dataset collected from five Action RPG titles. Our experiments reveal a key trade-off: while full-network fine-tuning produces superior STP detection with sufficient data, adapter-only transfer is significantly more robust and effective in low-data scenarios and for the MSTP selection task. By defining this novel problem, providing a baseline pipeline and dataset, and offering initial insights into efficient model adaptation, we aim to contribute to future AI-driven navigation aids and data-informed level-design tools.

new Wan-S2V: Audio-Driven Cinematic Video Generation

Authors: Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, Ke Sun, Linrui Tian, Guangyuan Wang, Qi Wang, Zhongjian Wang, Jiayu Xiao, Sheng Xu, Bang Zhang, Peng Zhang, Xindi Zhang, Zhe Zhang, Jingren Zhou, Lian Zhuo

Abstract: Current state-of-the-art (SOTA) methods for audio-driven character animation demonstrate promising performance for scenarios primarily involving speech and singing. However, they often fall short in more complex film and television productions, which demand sophisticated elements such as nuanced character interactions, realistic body movements, and dynamic camera work. To address this long-standing challenge of achieving film-level character animation, we propose an audio-driven model, which we refere to as Wan-S2V, built upon Wan. Our model achieves significantly enhanced expressiveness and fidelity in cinematic contexts compared to existing approaches. We conducted extensive experiments, benchmarking our method against cutting-edge models such as Hunyuan-Avatar and Omnihuman. The experimental results consistently demonstrate that our approach significantly outperforms these existing solutions. Additionally, we explore the versatility of our method through its applications in long-form video generation and precise video lip-sync editing.

new Decouple, Reorganize, and Fuse: A Multimodal Framework for Cancer Survival Prediction

Authors: Huayi Wang, Haochao Ying, Yuyang Xu, Qibo Qiu, Cheng Zhang, Danny Z. Chen, Ying Sun, Jian Wu

Abstract: Cancer survival analysis commonly integrates information across diverse medical modalities to make survival-time predictions. Existing methods primarily focus on extracting different decoupled features of modalities and performing fusion operations such as concatenation, attention, and MoE-based (Mixture-of-Experts) fusion. However, these methods still face two key challenges: i) Fixed fusion schemes (concatenation and attention) can lead to model over-reliance on predefined feature combinations, limiting the dynamic fusion of decoupled features; ii) in MoE-based fusion methods, each expert network handles separate decoupled features, which limits information interaction among the decoupled features. To address these challenges, we propose a novel Decoupling-Reorganization-Fusion framework (DeReF), which devises a random feature reorganization strategy between modalities decoupling and dynamic MoE fusion modules.Its advantages are: i) it increases the diversity of feature combinations and granularity, enhancing the generalization ability of the subsequent expert networks; ii) it overcomes the problem of information closure and helps expert networks better capture information among decoupled features. Additionally, we incorporate a regional cross-attention network within the modality decoupling module to improve the representation quality of decoupled features. Extensive experimental results on our in-house Liver Cancer (LC) and three widely used TCGA public datasets confirm the effectiveness of our proposed method. The code will be made publicly available.

new ROSE: Remove Objects with Side Effects in Videos

Authors: Chenxuan Miao, Yutong Feng, Jianshu Zeng, Zixiang Gao, Hantang Liu, Yunfeng Yan, Donglian Qi, Xi Chen, Bin Wang, Hengshuang Zhao

Abstract: Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object's effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios. The project page is https://rose2025-inpaint.github.io/.

URLs: https://rose2025-inpaint.github.io/.

new OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward

Authors: Chunlin Zhong, Qiuxia Hou, Zhangjun Zhou, Shuang Hao, Haonan Lu, Yanhao Zhang, He Tang, Xiang Bai

Abstract: Video captioning aims to generate comprehensive and coherent descriptions of the video content, contributing to the advancement of both video understanding and generation. However, existing methods often suffer from motion-detail imbalance, as models tend to overemphasize one aspect while neglecting the other. This imbalance results in incomplete captions, which in turn leads to a lack of consistency in video understanding and generation. To address this issue, we propose solutions from two aspects: 1) Data aspect: We constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). 2) Optimization aspect: We introduce the Caption Set Equivalence Reward (CSER) based on Group Relative Policy Optimization (GRPO). CSER enhances completeness and accuracy in capturing both motion and details through unit-to-set matching and bidirectional validation. Based on the HMD-270K supervised fine-tuning and GRPO post-training with CSER, we developed OwlCap, a powerful video captioning multi-modal large language model (MLLM) with motion-detail balance. Experimental results demonstrate that OwlCap achieves significant improvements compared to baseline models on two benchmarks: the detail-focused VDC (+4.2 Acc) and the motion-focused DREAM-1K (+4.6 F1). The HMD-270K dataset and OwlCap model will be publicly released to facilitate video captioning research community advancements.

new Clustering-based Feature Representation Learning for Oracle Bone Inscriptions Detection

Authors: Ye Tao, Xinran Fu, Honglin Pang, Xi Yang, Chuntao Li

Abstract: Oracle Bone Inscriptions (OBIs), play a crucial role in understanding ancient Chinese civilization. The automated detection of OBIs from rubbing images represents a fundamental yet challenging task in digital archaeology, primarily due to various degradation factors including noise and cracks that limit the effectiveness of conventional detection networks. To address these challenges, we propose a novel clustering-based feature space representation learning method. Our approach uniquely leverages the Oracle Bones Character (OBC) font library dataset as prior knowledge to enhance feature extraction in the detection network through clustering-based representation learning. The method incorporates a specialized loss function derived from clustering results to optimize feature representation, which is then integrated into the total network loss. We validate the effectiveness of our method by conducting experiments on two OBIs detection dataset using three mainstream detection frameworks: Faster R-CNN, DETR, and Sparse R-CNN. Through extensive experimentation, all frameworks demonstrate significant performance improvements.

new SFormer: SNR-guided Transformer for Underwater Image Enhancement from the Frequency Domain

Authors: Xin Tian, Yingtie Lei, Xiujun Zhang, Zimeng Li, Chi-Man Pun, Xuhang Chen

Abstract: Recent learning-based underwater image enhancement (UIE) methods have advanced by incorporating physical priors into deep neural networks, particularly using the signal-to-noise ratio (SNR) prior to reduce wavelength-dependent attenuation. However, spatial domain SNR priors have two limitations: (i) they cannot effectively separate cross-channel interference, and (ii) they provide limited help in amplifying informative structures while suppressing noise. To overcome these, we propose using the SNR prior in the frequency domain, decomposing features into amplitude and phase spectra for better channel modulation. We introduce the Fourier Attention SNR-prior Transformer (FAST), combining spectral interactions with SNR cues to highlight key spectral components. Additionally, the Frequency Adaptive Transformer (FAT) bottleneck merges low- and high-frequency branches using a gated attention mechanism to enhance perceptual quality. Embedded in a unified U-shaped architecture, these modules integrate a conventional RGB stream with an SNR-guided branch, forming SFormer. Trained on 4,800 paired images from UIEB, EUVP, and LSUI, SFormer surpasses recent methods with a 3.1 dB gain in PSNR and 0.08 in SSIM, successfully restoring colors, textures, and contrast in underwater scenes.

new Hierarchical Spatio-temporal Segmentation Network for Ejection Fraction Estimation in Echocardiography Videos

Authors: Dongfang Wang, Jian Yang, Yizhe Zhang, Tao Zhou

Abstract: Automated segmentation of the left ventricular endocardium in echocardiography videos is a key research area in cardiology. It aims to provide accurate assessment of cardiac structure and function through Ejection Fraction (EF) estimation. Although existing studies have achieved good segmentation performance, their results do not perform well in EF estimation. In this paper, we propose a Hierarchical Spatio-temporal Segmentation Network (\ourmodel) for echocardiography video, aiming to improve EF estimation accuracy by synergizing local detail modeling with global dynamic perception. The network employs a hierarchical design, with low-level stages using convolutional networks to process single-frame images and preserve details, while high-level stages utilize the Mamba architecture to capture spatio-temporal relationships. The hierarchical design balances single-frame and multi-frame processing, avoiding issues such as local error accumulation when relying solely on single frames or neglecting details when using only multi-frame data. To overcome local spatio-temporal limitations, we propose the Spatio-temporal Cross Scan (STCS) module, which integrates long-range context through skip scanning across frames and positions. This approach helps mitigate EF calculation biases caused by ultrasound image noise and other factors.

new Feature-Space Planes Searcher: A Universal Domain Adaptation Framework for Interpretability and Computational Efficiency

Authors: Zhitong Cheng, Yiran Jiang, Yulong Ge, Yufeng Li, Zhongheng Qin, Rongzhi Lin, Jianwei Ma

Abstract: Domain shift, characterized by degraded model performance during transition from labeled source domains to unlabeled target domains, poses a persistent challenge for deploying deep learning systems. Current unsupervised domain adaptation (UDA) methods predominantly rely on fine-tuning feature extractors - an approach limited by inefficiency, reduced interpretability, and poor scalability to modern architectures. Our analysis reveals that models pretrained on large-scale data exhibit domain-invariant geometric patterns in their feature space, characterized by intra-class clustering and inter-class separation, thereby preserving transferable discriminative structures. These findings indicate that domain shifts primarily manifest as boundary misalignment rather than feature degradation. Unlike fine-tuning entire pre-trained models - which risks introducing unpredictable feature distortions - we propose the Feature-space Planes Searcher (FPS): a novel domain adaptation framework that optimizes decision boundaries by leveraging these geometric patterns while keeping the feature encoder frozen. This streamlined approach enables interpretative analysis of adaptation while substantially reducing memory and computational costs through offline feature extraction, permitting full-dataset optimization in a single computation cycle. Evaluations on public benchmarks demonstrate that FPS achieves competitive or superior performance to state-of-the-art methods. FPS scales efficiently with multimodal large models and shows versatility across diverse domains including protein structure prediction, remote sensing classification, and earthquake detection. We anticipate FPS will provide a simple, effective, and generalizable paradigm for transfer learning, particularly in domain adaptation tasks. .

new A Novel Deep Hybrid Framework with Ensemble-Based Feature Optimization for Robust Real-Time Human Activity Recognition

Authors: Wasi Ullah, Yasir Noman Khalid, Saddam Hussain Khan

Abstract: Human Activity Recognition (HAR) plays a pivotal role in various applications, including smart surveillance, healthcare, assistive technologies, sports analytics, etc. However, HAR systems still face critical challenges, including high computational costs, redundant features, and limited scalability in real-time scenarios. An optimized hybrid deep learning framework is introduced that integrates a customized InceptionV3, an LSTM architecture, and a novel ensemble-based feature selection strategy. The proposed framework first extracts spatial descriptors using the customized InceptionV3 model, which captures multilevel contextual patterns, region homogeneity, and fine-grained localization cues. The temporal dependencies across frames are then modeled using LSTMs to effectively encode motion dynamics. Finally, an ensemble-based genetic algorithm with Adaptive Dynamic Fitness Sharing and Attention (ADFSA) is employed to select a compact and optimized feature set by dynamically balancing objectives such as accuracy, redundancy, uniqueness, and complexity reduction. Consequently, the selected feature subsets, which are both diverse and discriminative, enable various lightweight machine learning classifiers to achieve accurate and robust HAR in heterogeneous environments. Experimental results on the robust UCF-YouTube dataset, which presents challenges such as occlusion, cluttered backgrounds, motion dynamics, and poor illumination, demonstrate good performance. The proposed approach achieves 99.65% recognition accuracy, reduces features to as few as 7, and enhances inference time. The lightweight and scalable nature of the HAR system supports real-time deployment on edge devices such as Raspberry Pi, enabling practical applications in intelligent, resource-aware environments, including public safety, assistive technology, and autonomous monitoring systems.

new ColorGS: High-fidelity Surgical Scene Reconstruction with Colored Gaussian Splatting

Authors: Qun Ji, Peng Li, Mingqiang Wei

Abstract: High-fidelity reconstruction of deformable tissues from endoscopic videos remains challenging due to the limitations of existing methods in capturing subtle color variations and modeling global deformations. While 3D Gaussian Splatting (3DGS) enables efficient dynamic reconstruction, its fixed per-Gaussian color assignment struggles with intricate textures, and linear deformation modeling fails to model consistent global deformation. To address these issues, we propose ColorGS, a novel framework that integrates spatially adaptive color encoding and enhanced deformation modeling for surgical scene reconstruction. First, we introduce Colored Gaussian Primitives, which employ dynamic anchors with learnable color parameters to adaptively encode spatially varying textures, significantly improving color expressiveness under complex lighting and tissue similarity. Second, we design an Enhanced Deformation Model (EDM) that combines time-aware Gaussian basis functions with learnable time-independent deformations, enabling precise capture of both localized tissue deformations and global motion consistency caused by surgical interactions. Extensive experiments on DaVinci robotic surgery videos and benchmark datasets (EndoNeRF, StereoMIS) demonstrate that ColorGS achieves state-of-the-art performance, attaining a PSNR of 39.85 (1.5 higher than prior 3DGS-based methods) and superior SSIM (97.25\%) while maintaining real-time rendering efficiency. Our work advances surgical scene reconstruction by balancing high fidelity with computational practicality, critical for intraoperative guidance and AR/VR applications.

new Class-wise Flooding Regularization for Imbalanced Image Classification

Authors: Hiroaki Aizawa, Yuta Naito, Kohei Fukuda

Abstract: The purpose of training neural networks is to achieve high generalization performance on unseen inputs. However, when trained on imbalanced datasets, a model's prediction tends to favor majority classes over minority classes, leading to significant degradation in the recognition performance of minority classes. To address this issue, we propose class-wise flooding regularization, an extension of flooding regularization applied at the class level. Flooding is a regularization technique that mitigates overfitting by preventing the training loss from falling below a predefined threshold, known as the flooding level, thereby discouraging memorization. Our proposed method assigns a class-specific flooding level based on class frequencies. By doing so, it suppresses overfitting in majority classes while allowing sufficient learning for minority classes. We validate our approach on imbalanced image classification. Compared to conventional flooding regularizations, our method improves the classification performance of minority classes and achieves better overall generalization.

new Flatness-aware Curriculum Learning via Adversarial Difficulty

Authors: Hiroaki Aizawa, Yoshikazu Hayashi

Abstract: Neural networks trained by empirical risk minimization often suffer from overfitting, especially to specific samples or domains, which leads to poor generalization. Curriculum Learning (CL) addresses this issue by selecting training samples based on the difficulty. From the optimization perspective, methods such as Sharpness-Aware Minimization (SAM) improve robustness and generalization by seeking flat minima. However, combining CL with SAM is not straightforward. In flat regions, both the loss values and the gradient norms tend to become uniformly small, which makes it difficult to evaluate sample difficulty and design an effective curriculum. To overcome this problem, we propose the Adversarial Difficulty Measure (ADM), which quantifies adversarial vulnerability by leveraging the robustness properties of models trained toward flat minima. Unlike loss- or gradient-based measures, which become ineffective as training progresses into flatter regions, ADM remains informative by measuring the normalized loss gap between original and adversarial examples. We incorporate ADM into CL-based training with SAM to dynamically assess sample difficulty. We evaluated our approach on image classification tasks, fine-grained recognition, and domain generalization. The results demonstrate that our method preserves the strengths of both CL and SAM while outperforming existing curriculum-based and flatness-aware training strategies.

new Are All Marine Species Created Equal? Performance Disparities in Underwater Object Detection

Authors: Melanie Wille, Tobias Fischer, Scarlett Raine

Abstract: Underwater object detection is critical for monitoring marine ecosystems but poses unique challenges, including degraded image quality, imbalanced class distribution, and distinct visual characteristics. Not every species is detected equally well, yet underlying causes remain unclear. We address two key research questions: 1) What factors beyond data quantity drive class-specific performance disparities? 2) How can we systematically improve detection of under-performing marine species? We manipulate the DUO dataset to separate the object detection task into localization and classification and investigate the under-performance of the scallop class. Localization analysis using YOLO11 and TIDE finds that foreground-background discrimination is the most problematic stage regardless of data quantity. Classification experiments reveal persistent precision gaps even with balanced data, indicating intrinsic feature-based challenges beyond data scarcity and inter-class dependencies. We recommend imbalanced distributions when prioritizing precision, and balanced distributions when prioritizing recall. Improving under-performing classes should focus on algorithmic advances, especially within localization modules. We publicly release our code and datasets.

new Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vectorized Drawings

Authors: Feiwei Qin, Shichao Lu, Junhao Hou, Changmiao Wang, Meie Fang, Ligang Liu

Abstract: Computer-Aided Design (CAD) generative modeling is driving significant innovations across industrial applications. Recent works have shown remarkable progress in creating solid models from various inputs such as point clouds, meshes, and text descriptions. However, these methods fundamentally diverge from traditional industrial workflows that begin with 2D engineering drawings. The automatic generation of parametric CAD models from these 2D vector drawings remains underexplored despite being a critical step in engineering design. To address this gap, our key insight is to reframe CAD generation as a sequence-to-sequence learning problem where vector drawing primitives directly inform the generation of parametric CAD operations, preserving geometric precision and design intent throughout the transformation process. We propose Drawing2CAD, a framework with three key technical components: a network-friendly vector primitive representation that preserves precise geometric information, a dual-decoder transformer architecture that decouples command type and parameter generation while maintaining precise correspondence, and a soft target distribution loss function accommodating inherent flexibility in CAD parameters. To train and evaluate Drawing2CAD, we create CAD-VGDrawing, a dataset of paired engineering drawings and parametric CAD models, and conduct thorough experiments to demonstrate the effectiveness of our method. Code and dataset are available at https://github.com/lllssc/Drawing2CAD.

URLs: https://github.com/lllssc/Drawing2CAD.

new Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion

Authors: DongHoon Lim, YoungChae Kim, Dong-Hyun Kim, Da-Hee Yang, Joon-Hyuk Chang

Abstract: Robust audio-visual speech recognition (AVSR) in noisy environments remains challenging, as existing systems struggle to estimate audio reliability and dynamically adjust modality reliance. We propose router-gated cross-modal feature fusion, a novel AVSR framework that adaptively reweights audio and visual features based on token-level acoustic corruption scores. Using an audio-visual feature fusion-based router, our method down-weights unreliable audio tokens and reinforces visual cues through gated cross-attention in each decoder layer. This enables the model to pivot toward the visual modality when audio quality deteriorates. Experiments on LRS3 demonstrate that our approach achieves an 16.51-42.67% relative reduction in word error rate compared to AV-HuBERT. Ablation studies confirm that both the router and gating mechanism contribute to improved robustness under real-world acoustic noise.

new Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods

Authors: Qinqian Lei, Bo Wang, Robby T. Tan

Abstract: Prior human-object interaction (HOI) detection methods have integrated early vision-language models (VLMs) such as CLIP, but only as supporting components within their frameworks. In contrast, recent advances in large, generative VLMs suggest that these models may already possess strong ability to understand images involving HOI. This naturally raises an important question: can general-purpose standalone VLMs effectively solve HOI detection, and how do they compare with specialized HOI methods? Answering this requires a benchmark that can accommodate both paradigms. However, existing HOI benchmarks such as HICO-DET were developed before the emergence of modern VLMs, and their evaluation protocols require exact matches to annotated HOI classes. This is poorly aligned with the generative nature of VLMs, which often yield multiple valid interpretations in ambiguous cases. For example, a static image may capture a person mid-motion with a frisbee, which can plausibly be interpreted as either "throwing" or "catching". When only "catching" is annotated, the other, though equally plausible for the image, is marked incorrect when exact matching is used. As a result, correct predictions might be penalized, affecting both VLMs and HOI-specific methods. To avoid penalizing valid predictions, we introduce a new benchmark that reformulates HOI detection as a multiple-answer multiple-choice task, where each question includes only ground-truth positive options and a curated set of negatives that are constructed to reduce ambiguity (e.g., when "catching" is annotated, "throwing" is not selected as a negative to avoid penalizing valid predictions). The proposed evaluation protocol is the first of its kind for both VLMs and HOI methods, enabling direct comparison and offering new insight into the current state of progress in HOI understanding.

new Beyond the Textual: Generating Coherent Visual Options for MCQs

Authors: Wanqiang Wang, Longzhu He, Wei Zheng

Abstract: Multiple-choice questions (MCQs) play a crucial role in fostering deep thinking and knowledge integration in education. However, previous research has primarily focused on generating MCQs with textual options, but it largely overlooks the visual options. Moreover, generating high-quality distractors remains a major challenge due to the high cost and limited scalability of manual authoring. To tackle these problems, we propose a Cross-modal Options Synthesis (CmOS), a novel framework for generating educational MCQs with visual options. Our framework integrates Multimodal Chain-of-Thought (MCoT) reasoning process and Retrieval-Augmented Generation (RAG) to produce semantically plausible and visually similar answer and distractors. It also includes a discrimination module to identify content suitable for visual options. Experimental results on test tasks demonstrate the superiority of CmOS in content discrimination, question generation and visual option generation over existing methods across various subjects and educational levels.

new Design, Implementation and Evaluation of a Real-Time Remote Photoplethysmography (rPPG) Acquisition System for Non-Invasive Vital Sign Monitoring

Authors: Constantino \'Alvarez Casado, Sasan Sharifipour, Manuel Lage Ca\~nellas, Nhi Nguyen, Le Nguyen, Miguel Bordallo L\'opez

Abstract: The growing integration of smart environments and low-power computing devices, coupled with mass-market sensor technologies, is driving advancements in remote and non-contact physiological monitoring. However, deploying these systems in real-time on resource-constrained platforms introduces significant challenges related to scalability, interoperability, and performance. This paper presents a real-time remote photoplethysmography (rPPG) system optimized for low-power devices, designed to extract physiological signals, such as heart rate (HR), respiratory rate (RR), and oxygen saturation (SpO2), from facial video streams. The system is built on the Face2PPG pipeline, which processes video frames sequentially for rPPG signal extraction and analysis, while leveraging a multithreaded architecture to manage video capture, real-time processing, network communication, and graphical user interface (GUI) updates concurrently. This design ensures continuous, reliable operation at 30 frames per second (fps), with adaptive feedback through a collaborative user interface to guide optimal signal capture conditions. The network interface includes both an HTTP server for continuous video streaming and a RESTful API for on-demand vital sign retrieval. To ensure accurate performance despite the limitations of low-power devices, we use a hybrid programming model combining Functional Reactive Programming (FRP) and the Actor Model, allowing event-driven processing and efficient task parallelization. The system is evaluated under real-time constraints, demonstrating robustness while minimizing computational overhead. Our work addresses key challenges in real-time biosignal monitoring, offering practical solutions for optimizing performance in modern healthcare and human-computer interaction applications.

new PseudoMapTrainer: Learning Online Mapping without HD Maps

Authors: Christian L\"owens, Thorben Funke, Jingchao Xie, Alexandru Paul Condurache

Abstract: Online mapping models show remarkable results in predicting vectorized maps from multi-view camera images only. However, all existing approaches still rely on ground-truth high-definition maps during training, which are expensive to obtain and often not geographically diverse enough for reliable generalization. In this work, we propose PseudoMapTrainer, a novel approach to online mapping that uses pseudo-labels generated from unlabeled sensor data. We derive those pseudo-labels by reconstructing the road surface from multi-camera imagery using Gaussian splatting and semantics of a pre-trained 2D segmentation network. In addition, we introduce a mask-aware assignment algorithm and loss function to handle partially masked pseudo-labels, allowing for the first time the training of online mapping models without any ground-truth maps. Furthermore, our pseudo-labels can be effectively used to pre-train an online model in a semi-supervised manner to leverage large-scale unlabeled crowdsourced data. The code is available at github.com/boschresearch/PseudoMapTrainer.

new Robust and Label-Efficient Deep Waste Detection

Authors: Hassan Abid, Khan Muhammad, Muhammad Haris Khan

Abstract: Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems due to limited datasets and reliance on legacy object detectors. In this work, we advance AI-driven waste detection by establishing strong baselines and introducing an ensemble-based semi-supervised learning framework. We first benchmark state-of-the-art Open-Vocabulary Object Detection (OVOD) models on the real-world ZeroWaste dataset, demonstrating that while class-only prompts perform poorly, LLM-optimized prompts significantly enhance zero-shot accuracy. Next, to address domain-specific limitations, we fine-tune modern transformer-based detectors, achieving a new baseline of 51.6 mAP. We then propose a soft pseudo-labeling strategy that fuses ensemble predictions using spatial and consensus-aware weighting, enabling robust semi-supervised training. Applied to the unlabeled ZeroWaste-s subset, our pseudo-annotations achieve performance gains that surpass fully supervised training, underscoring the effectiveness of scalable annotation pipelines. Our work contributes to the research community by establishing rigorous baselines, introducing a robust ensemble-based pseudo-labeling pipeline, generating high-quality annotations for the unlabeled ZeroWaste-s subset, and systematically evaluating OVOD models under real-world waste sorting conditions. Our code is available at: https://github.com/h-abid97/robust-waste-detection.

URLs: https://github.com/h-abid97/robust-waste-detection.

new Embedding Font Impression Word Tags Based on Co-occurrence

Authors: Yugo Kubota, Seiichi Uchida

Abstract: Different font styles (i.e., font shapes) convey distinct impressions, indicating a close relationship between font shapes and word tags describing those impressions. This paper proposes a novel embedding method for impression tags that leverages these shape-impression relationships. For instance, our method assigns similar vectors to impression tags that frequently co-occur in order to represent impressions of fonts, whereas standard word embedding methods (e.g., BERT and CLIP) yield very different vectors. This property is particularly useful for impression-based font generation and font retrieval. Technically, we construct a graph whose nodes represent impression tags and whose edges encode co-occurrence relationships. Then, we apply spectral embedding to obtain the impression vectors for each tag. We compare our method with BERT and CLIP in qualitative and quantitative evaluations, demonstrating that our approach performs better in impression-guided font generation.

new Deep Pre-trained Time Series Features for Tree Species Classification in the Dutch Forest Inventory

Authors: Takayuki Ishikawa, Carmelo Bonannella, Bas J. W. Lerink, Marc Ru{\ss}wurm

Abstract: National Forest Inventory (NFI)s serve as the primary source of forest information, providing crucial tree species distribution data. However, maintaining these inventories requires labor-intensive on-site campaigns. Remote sensing approaches, particularly when combined with machine learning, offer opportunities to update NFIs more frequently and at larger scales. While the use of Satellite Image Time Series has proven effective for distinguishing tree species through seasonal canopy reflectance patterns, current approaches rely primarily on Random Forest classifiers with hand-designed features and phenology-based metrics. Using deep features from an available pre-trained remote sensing foundation models offers a complementary strategy. These pre-trained models leverage unannotated global data and are meant to used for general-purpose applications and can then be efficiently fine-tuned with smaller labeled datasets for specific classification tasks. This work systematically investigates how deep features improve tree species classification accuracy in the Netherlands with few annotated data. Data-wise, we extracted time-series data from Sentinel-1, Sentinel-2 and ERA5 satellites data and SRTM data using Google Earth Engine. Our results demonstrate that fine-tuning a publicly available remote sensing time series foundation model outperforms the current state-of-the-art in NFI classification in the Netherlands by a large margin of up to 10% across all datasets. This demonstrates that classic hand-defined harmonic features are too simple for this task and highlights the potential of using deep AI features for data-limited application like NFI classification. By leveraging openly available satellite data and pre-trained models, this approach significantly improves classification accuracy compared to traditional methods and can effectively complement existing forest inventory processes.

new Automated Classification of Normal and Atypical Mitotic Figures Using ConvNeXt V2: MIDOG 2025 Track 2

Authors: Yosuke Yamagishi, Shouhei Hanaoka

Abstract: This paper presents our solution for the MIDOG 2025 Challenge Track 2, which focuses on binary classification of normal mitotic figures (NMFs) versus atypical mitotic figures (AMFs) in histopathological images. Our approach leverages a ConvNeXt V2 base model with center cropping preprocessing and 5-fold cross-validation ensemble strategy. The method addresses key challenges including severe class imbalance, high morphological variability, and domain heterogeneity across different tumor types, species, and scanners. Through strategic preprocessing with 60% center cropping and mixed precision training, our model achieved robust performance on the diverse MIDOG 2025 dataset. The solution demonstrates the effectiveness of modern convolutional architectures for mitotic figure subtyping while maintaining computational efficiency through careful architectural choices and training optimizations.

new Boosting Micro-Expression Analysis via Prior-Guided Video-Level Regression

Authors: Zizheng Guo, Bochao Zou, Yinuo Jia, Xiangyu Li, Huimin Ma

Abstract: Micro-expressions (MEs) are involuntary, low-intensity, and short-duration facial expressions that often reveal an individual's genuine thoughts and emotions. Most existing ME analysis methods rely on window-level classification with fixed window sizes and hard decisions, which limits their ability to capture the complex temporal dynamics of MEs. Although recent approaches have adopted video-level regression frameworks to address some of these challenges, interval decoding still depends on manually predefined, window-based methods, leaving the issue only partially mitigated. In this paper, we propose a prior-guided video-level regression method for ME analysis. We introduce a scalable interval selection strategy that comprehensively considers the temporal evolution, duration, and class distribution characteristics of MEs, enabling precise spotting of the onset, apex, and offset phases. In addition, we introduce a synergistic optimization framework, in which the spotting and recognition tasks share parameters except for the classification heads. This fully exploits complementary information, makes more efficient use of limited data, and enhances the model's capability. Extensive experiments on multiple benchmark datasets demonstrate the state-of-the-art performance of our method, with an STRS of 0.0562 on CAS(ME)$^3$ and 0.2000 on SAMMLV. The code is available at https://github.com/zizheng-guo/BoostingVRME.

URLs: https://github.com/zizheng-guo/BoostingVRME.

new Quantitative Outcome-Oriented Assessment of Microsurgical Anastomosis

Authors: Luyin Hu, Soheil Gholami, George Dindelegan, Torstein R. Meling, Aude Billard

Abstract: Microsurgical anastomosis demands exceptional dexterity and visuospatial skills, underscoring the importance of comprehensive training and precise outcome assessment. Currently, methods such as the outcome-oriented anastomosis lapse index are used to evaluate this procedure. However, they often rely on subjective judgment, which can introduce biases that affect the reliability and efficiency of the assessment of competence. Leveraging three datasets from hospitals with participants at various levels, we introduce a quantitative framework that uses image-processing techniques for objective assessment of microsurgical anastomoses. The approach uses geometric modeling of errors along with a detection and scoring mechanism, enhancing the efficiency and reliability of microsurgical proficiency assessment and advancing training protocols. The results show that the geometric metrics effectively replicate expert raters' scoring for the errors considered in this work.

new Harnessing Meta-Learning for Controllable Full-Frame Video Stabilization

Authors: Muhammad Kashif Ali, Eun Woo Im, Dongjin Kim, Tae Hyun Kim, Vivek Gupta, Haonan Luo, Tianrui Li

Abstract: Video stabilization remains a fundamental problem in computer vision, particularly pixel-level synthesis solutions for video stabilization, which synthesize full-frame outputs, add to the complexity of this task. These methods aim to enhance stability while synthesizing full-frame videos, but the inherent diversity in motion profiles and visual content present in each video sequence makes robust generalization with fixed parameters difficult. To address this, we present a novel method that improves pixel-level synthesis video stabilization methods by rapidly adapting models to each input video at test time. The proposed approach takes advantage of low-level visual cues available during inference to improve both the stability and visual quality of the output. Notably, the proposed rapid adaptation achieves significant performance gains even with a single adaptation pass. We further propose a jerk localization module and a targeted adaptation strategy, which focuses the adaptation on high-jerk segments for maximizing stability with fewer adaptation steps. The proposed methodology enables modern stabilizers to overcome the longstanding SOTA approaches while maintaining the full frame nature of the modern methods, while offering users with control mechanisms akin to classical approaches. Extensive experiments on diverse real-world datasets demonstrate the versatility of the proposed method. Our approach consistently improves the performance of various full-frame synthesis models in both qualitative and quantitative terms, including results on downstream applications.

new Toward Robust Medical Fairness: Debiased Dual-Modal Alignment via Text-Guided Attribute-Disentangled Prompt Learning for Vision-Language Models

Authors: Yuexuan Xia, Benteng Ma, Jiang He, Zhiyong Wang, Qi Dou, Yong Xia

Abstract: Ensuring fairness across demographic groups in medical diagnosis is essential for equitable healthcare, particularly under distribution shifts caused by variations in imaging equipment and clinical practice. Vision-language models (VLMs) exhibit strong generalization, and text prompts encode identity attributes, enabling explicit identification and removal of sensitive directions. However, existing debiasing approaches typically address vision and text modalities independently, leaving residual cross-modal misalignment and fairness gaps. To address this challenge, we propose DualFairVL, a multimodal prompt-learning framework that jointly debiases and aligns cross-modal representations. DualFairVL employs a parallel dual-branch architecture that separates sensitive and target attributes, enabling disentangled yet aligned representations across modalities. Approximately orthogonal text anchors are constructed via linear projections, guiding cross-attention mechanisms to produce fused features. A hypernetwork further disentangles attribute-related information and generates instance-aware visual prompts, which encode dual-modal cues for fairness and robustness. Prototype-based regularization is applied in the visual branch to enforce separation of sensitive features and strengthen alignment with textual anchors. Extensive experiments on eight medical imaging datasets across four modalities show that DualFairVL achieves state-of-the-art fairness and accuracy under both in- and out-of-distribution settings, outperforming full fine-tuning and parameter-efficient baselines with only 3.6M trainable parameters. Code will be released upon publication.

new DQEN: Dual Query Enhancement Network for DETR-based HOI Detection

Authors: Zhehao Li, Chong Wang, Yi Chen, Yinghao Lu, Jiangbo Qian, Jiong Wang, Jiafei Wu

Abstract: Human-Object Interaction (HOI) detection focuses on localizing human-object pairs and recognizing their interactions. Recently, the DETR-based framework has been widely adopted in HOI detection. In DETR-based HOI models, queries with clear meaning are crucial for accurately detecting HOIs. However, prior works have typically relied on randomly initialized queries, leading to vague representations that limit the model's effectiveness. Meanwhile, humans in the HOI categories are fixed, while objects and their interactions are variable. Therefore, we propose a Dual Query Enhancement Network (DQEN) to enhance object and interaction queries. Specifically, object queries are enhanced with object-aware encoder features, enabling the model to focus more effectively on humans interacting with objects in an object-aware way. On the other hand, we design a novel Interaction Semantic Fusion module to exploit the HOI candidates that are promoted by the CLIP model. Semantic features are extracted to enhance the initialization of interaction queries, thereby improving the model's ability to understand interactions. Furthermore, we introduce an Auxiliary Prediction Unit aimed at improving the representation of interaction features. Our proposed method achieves competitive performance on both the HICO-Det and the V-COCO datasets. The source code is available at https://github.com/lzzhhh1019/DQEN.

URLs: https://github.com/lzzhhh1019/DQEN.

new Interpretable Decision-Making for End-to-End Autonomous Driving

Authors: Mona Mirzaie, Bodo Rosenhahn

Abstract: Trustworthy AI is mandatory for the broad deployment of autonomous vehicles. Although end-to-end approaches derive control commands directly from raw data, interpreting these decisions remains challenging, especially in complex urban scenarios. This is mainly attributed to very deep neural networks with non-linear decision boundaries, making it challenging to grasp the logic behind AI-driven decisions. This paper presents a method to enhance interpretability while optimizing control commands in autonomous driving. To address this, we propose loss functions that promote the interpretability of our model by generating sparse and localized feature maps. The feature activations allow us to explain which image regions contribute to the predicted control command. We conduct comprehensive ablation studies on the feature extraction step and validate our method on the CARLA benchmarks. We also demonstrate that our approach improves interpretability, which correlates with reducing infractions, yielding a safer, high-performance driving model. Notably, our monocular, non-ensemble model surpasses the top-performing approaches from the CARLA Leaderboard by achieving lower infraction scores and the highest route completion rate, all while ensuring interpretability.

new Event-Enriched Image Analysis Grand Challenge at ACM Multimedia 2025

Authors: Thien-Phuc Tran, Minh-Quang Nguyen, Minh-Triet Tran, Tam V. Nguyen, Trong-Le Do, Duy-Nam Ly, Viet-Tham Huynh, Khanh-Duy Le, Mai-Khiem Tran, Trung-Nghia Le

Abstract: The Event-Enriched Image Analysis (EVENTA) Grand Challenge, hosted at ACM Multimedia 2025, introduces the first large-scale benchmark for event-level multimodal understanding. Traditional captioning and retrieval tasks largely focus on surface-level recognition of people, objects, and scenes, often overlooking the contextual and semantic dimensions that define real-world events. EVENTA addresses this gap by integrating contextual, temporal, and semantic information to capture the who, when, where, what, and why behind an image. Built upon the OpenEvents V1 dataset, the challenge features two tracks: Event-Enriched Image Retrieval and Captioning, and Event-Based Image Retrieval. A total of 45 teams from six countries participated, with evaluation conducted through Public and Private Test phases to ensure fairness and reproducibility. The top three teams were invited to present their solutions at ACM Multimedia 2025. EVENTA establishes a foundation for context-aware, narrative-driven multimedia AI, with applications in journalism, media analysis, cultural archiving, and accessibility. Further details about the challenge are available at the official homepage: https://ltnghia.github.io/eventa/eventa-2025.

URLs: https://ltnghia.github.io/eventa/eventa-2025.

new Preliminary Study on Space Utilization and Emergent Behaviors of Group vs. Single Pedestrians in Real-World Trajectories

Authors: Amartaivan Sanjjamts, Morita Hiroshi

Abstract: This study presents an initial framework for distinguishing group and single pedestrians based on real-world trajectory data, with the aim of analyzing their differences in space utilization and emergent behavioral patterns. By segmenting pedestrian trajectories into fixed time bins and applying a Transformer-based pair classification model, we identify cohesive groups and isolate single pedestrians over a structured sequence-based filtering process. To prepare for deeper analysis, we establish a comprehensive metric framework incorporating both spatial and behavioral dimensions. Spatial utilization metrics include convex hull area, smallest enclosing circle radius, and heatmap-based spatial densities to characterize how different pedestrian types occupy and interact with space. Behavioral metrics such as velocity change, motion angle deviation, clearance radius, and trajectory straightness are designed to capture local adaptations and responses during interactions. Furthermore, we introduce a typology of encounter types-single-to-single, single-to-group, and group-to-group to categorize and later quantify different interaction scenarios. Although this version focuses primarily on the classification pipeline and dataset structuring, it establishes the groundwork for scalable analysis across different sequence lengths 60, 100, and 200 frames. Future versions will incorporate complete quantitative analysis of the proposed metrics and their implications for pedestrian simulation and space design validation in crowd dynamics research.

new The point is the mask: scaling coral reef segmentation with weak supervision

Authors: Matteo Contini, Victor Illien, Sylvain Poulain, Serge Bernard, Julien Barde, Sylvain Bonhommeau, Alexis Joly

Abstract: Monitoring coral reefs at large spatial scales remains an open challenge, essential for assessing ecosystem health and informing conservation efforts. While drone-based aerial imagery offers broad spatial coverage, its limited resolution makes it difficult to reliably distinguish fine-scale classes, such as coral morphotypes. At the same time, obtaining pixel-level annotations over large spatial extents is costly and labor-intensive, limiting the scalability of deep learning-based segmentation methods for aerial imagery. We present a multi-scale weakly supervised semantic segmentation framework that addresses this challenge by transferring fine-scale ecological information from underwater imagery to aerial data. Our method enables large-scale coral reef mapping from drone imagery with minimal manual annotation, combining classification-based supervision, spatial interpolation and self-distillation techniques. We demonstrate the efficacy of the approach, enabling large-area segmentation of coral morphotypes and demonstrating flexibility for integrating new classes. This study presents a scalable, cost-effective methodology for high-resolution reef monitoring, combining low-cost data collection, weakly supervised deep learning and multi-scale remote sensing.

new Generative AI in Map-Making: A Technical Exploration and Its Implications for Cartographers

Authors: Claudio Affolter, Sidi Wu, Yizi Chen, Lorenz Hurni

Abstract: Traditional map-making relies heavily on Geographic Information Systems (GIS), requiring domain expertise and being time-consuming, especially for repetitive tasks. Recent advances in generative AI (GenAI), particularly image diffusion models, offer new opportunities for automating and democratizing the map-making process. However, these models struggle with accurate map creation due to limited control over spatial composition and semantic layout. To address this, we integrate vector data to guide map generation in different styles, specified by the textual prompts. Our model is the first to generate accurate maps in controlled styles, and we have integrated it into a web application to improve its usability and accessibility. We conducted a user study with professional cartographers to assess the fidelity of generated maps, the usability of the web application, and the implications of ever-emerging GenAI in map-making. The findings have suggested the potential of our developed application and, more generally, the GenAI models in helping both non-expert users and professionals in creating maps more efficiently. We have also outlined further technical improvements and emphasized the new role of cartographers to advance the paradigm of AI-assisted map-making.

new Enhancing compact convolutional transformers with super attention

Authors: Simpenzwe Honore Leandre, Natenaile Asmamaw Shiferaw, Dillip Rout

Abstract: In this paper, we propose a vision model that adopts token mixing, sequence-pooling, and convolutional tokenizers to achieve state-of-the-art performance and efficient inference in fixed context-length tasks. In the CIFAR100 benchmark, our model significantly improves the baseline of the top 1% and top 5% validation accuracy from 36.50% to 46.29% and 66.33% to 76.31%, while being more efficient than the Scaled Dot Product Attention (SDPA) transformers when the context length is less than the embedding dimension and only 60% the size. In addition, the architecture demonstrates high training stability and does not rely on techniques such as data augmentation like mixup, positional embeddings, or learning rate scheduling. We make our code available on Github.

new USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning

Authors: Shaojin Wu, Mengqi Huang, Yufeng Cheng, Wenxu Wu, Jiahe Tian, Yiming Luo, Fei Ding, Qian He

Abstract: Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model's performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO

URLs: https://github.com/bytedance/USO

new Can we make NeRF-based visual localization privacy-preserving?

Authors: Maxime Pietrantoni, Martin Humenberger, Torsten Sattler, Gabriela Csurka

Abstract: Visual localization (VL) is the task of estimating the camera pose in a known scene. VL methods, a.o., can be distinguished based on how they represent the scene, e.g., explicitly through a (sparse) point cloud or a collection of images or implicitly through the weights of a neural network. Recently, NeRF-based methods have become popular for VL. While NeRFs offer high-quality novel view synthesis, they inadvertently encode fine scene details, raising privacy concerns when deployed in cloud-based localization services as sensitive information could be recovered. In this paper, we tackle this challenge on two ends. We first propose a new protocol to assess privacy-preservation of NeRF-based representations. We show that NeRFs trained with photometric losses store fine-grained details in their geometry representations, making them vulnerable to privacy attacks, even if the head that predicts colors is removed. Second, we propose ppNeSF (Privacy-Preserving Neural Segmentation Field), a NeRF variant trained with segmentation supervision instead of RGB images. These segmentation labels are learned in a self-supervised manner, ensuring they are coarse enough to obscure identifiable scene details while remaining discriminativeness in 3D. The segmentation space of ppNeSF can be used for accurate visual localization, yielding state-of-the-art results.

new Enhancing Document VQA Models via Retrieval-Augmented Generation

Authors: Eric L\'opez, Artemis Llabr\'es, Ernest Valveny

Abstract: Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA through different retrieval variants - text-based retrieval using OCR tokens and purely visual retrieval without OCR - across multiple models and benchmarks. Evaluated on the multi-page datasets MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the "concatenate-all-pages" baseline by up to +22.5 ANLS, while the visual variant achieves +5.0 ANLS improvement without requiring any text extraction. An ablation confirms that retrieval and reranking components drive most of the gain, whereas the layout-guided chunking strategy - proposed in several recent works to leverage page structure - fails to help on these datasets. Our experiments demonstrate that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks, underscoring its practical value for real-world Document VQA.

new Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone

Authors: Shaivi Malik, Hasnat Md Abdullah, Sriparna Saha, Amit Sheth

Abstract: As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical. We introduce GRAS, a benchmark for uncovering demographic biases in VLMs across gender, race, age, and skin tone, offering the most diverse coverage to date. We further propose the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark five state-of-the-art VLMs and reveal concerning bias levels, with the least biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings also reveal a methodological insight: evaluating bias in VLMs with visual question answering (VQA) requires considering multiple formulations of a question. Our code, data, and evaluation results are publicly available.

new RoofSeg: An edge-aware transformer-based network for end-to-end roof plane segmentation

Authors: Siyuan You, Guozheng Xu, Pengwei Zhou, Qiwen Jin, Jian Yao, Li Li

Abstract: Roof plane segmentation is one of the key procedures for reconstructing three-dimensional (3D) building models at levels of detail (LoD) 2 and 3 from airborne light detection and ranging (LiDAR) point clouds. The majority of current approaches for roof plane segmentation rely on the manually designed or learned features followed by some specifically designed geometric clustering strategies. Because the learned features are more powerful than the manually designed features, the deep learning-based approaches usually perform better than the traditional approaches. However, the current deep learning-based approaches have three unsolved problems. The first is that most of them are not truly end-to-end, the plane segmentation results may be not optimal. The second is that the point feature discriminability near the edges is relatively low, leading to inaccurate planar edges. The third is that the planar geometric characteristics are not sufficiently considered to constrain the network training. To solve these issues, a novel edge-aware transformer-based network, named RoofSeg, is developed for segmenting roof planes from LiDAR point clouds in a truly end-to-end manner. In the RoofSeg, we leverage a transformer encoder-decoder-based framework to hierarchically predict the plane instance masks with the use of a set of learnable plane queries. To further improve the segmentation accuracy of edge regions, we also design an Edge-Aware Mask Module (EAMM) that sufficiently incorporates planar geometric prior of edges to enhance its discriminability for plane instance mask refinement. In addition, we propose an adaptive weighting strategy in the mask loss to reduce the influence of misclassified points, and also propose a new plane geometric loss to constrain the network training.

new MicroDetect-Net (MDN): Leveraging Deep Learning to Detect Microplastics in Clam Blood, a Step Towards Human Blood Analysis

Authors: Riju Marwah, Riya Arora, Navneet Yadav, Himank Arora

Abstract: With the prevalence of plastics exceeding 368 million tons yearly, microplastic pollution has grown to an extent where air, water, soil, and living organisms have all tested positive for microplastic presence. These particles, which are smaller than 5 millimeters in size, are no less harmful to humans than to the environment. Toxicity research on microplastics has shown that exposure may cause liver infection, intestinal injuries, and gut flora imbalance, leading to numerous potential health hazards. This paper presents a new model, MicroDetect-Net (MDN), which applies fluorescence microscopy with Nile Red dye staining and deep learning to scan blood samples for microplastics. Although clam blood has certain limitations in replicating real human blood, this study opens avenues for applying the approach to human samples, which are more consistent for preliminary data collection. The MDN model integrates dataset preparation, fluorescence imaging, and segmentation using a convolutional neural network to localize and count microplastic fragments. The combination of convolutional networks and Nile Red dye for segmentation produced strong image detection and accuracy. MDN was evaluated on a dataset of 276 Nile Red-stained fluorescent blood images and achieved an accuracy of ninety two percent. Robust performance was observed with an Intersection over Union of 87.4 percent, F1 score of 92.1 percent, Precision of 90.6 percent, and Recall of 93.7 percent. These metrics demonstrate the effectiveness of MDN in the detection of microplastics.

new ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval

Authors: Yi Pan, Yujia Zhang, Michael Kampffmeyer, Xiaoguang Zhao

Abstract: Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Code is available at https://github.com/BUAAPY/ProPy.

URLs: https://github.com/BUAAPY/ProPy.

new GReAT: leveraging geometric artery data to improve wall shear stress assessment

Authors: Julian Suk, Jolanda J. Wentzel, Patryk Rygiel, Joost Daemen, Daniel Rueckert, Jelmer M. Wolterink

Abstract: Leveraging big data for patient care is promising in many medical fields such as cardiovascular health. For example, hemodynamic biomarkers like wall shear stress could be assessed from patient-specific medical images via machine learning algorithms, bypassing the need for time-intensive computational fluid simulation. However, it is extremely challenging to amass large-enough datasets to effectively train such models. We could address this data scarcity by means of self-supervised pre-training and foundations models given large datasets of geometric artery models. In the context of coronary arteries, leveraging learned representations to improve hemodynamic biomarker assessment has not yet been well studied. In this work, we address this gap by investigating whether a large dataset (8449 shapes) consisting of geometric models of 3D blood vessels can benefit wall shear stress assessment in coronary artery models from a small-scale clinical trial (49 patients). We create a self-supervised target for the 3D blood vessels by computing the heat kernel signature, a quantity obtained via Laplacian eigenvectors, which captures the very essence of the shapes. We show how geometric representations learned from this datasets can boost segmentation of coronary arteries into regions of low, mid and high (time-averaged) wall shear stress even when trained on limited data.

new No Label Left Behind: A Unified Surface Defect Detection Model for all Supervision Regimes

Authors: Bla\v{z} Rolih, Matic Fu\v{c}ka, Danijel Sko\v{c}aj

Abstract: Surface defect detection is a critical task across numerous industries, aimed at efficiently identifying and localising imperfections or irregularities on manufactured components. While numerous methods have been proposed, many fail to meet industrial demands for high performance, efficiency, and adaptability. Existing approaches are often constrained to specific supervision scenarios and struggle to adapt to the diverse data annotations encountered in real-world manufacturing processes, such as unsupervised, weakly supervised, mixed supervision, and fully supervised settings. To address these challenges, we propose SuperSimpleNet, a highly efficient and adaptable discriminative model built on the foundation of SimpleNet. SuperSimpleNet incorporates a novel synthetic anomaly generation process, an enhanced classification head, and an improved learning procedure, enabling efficient training in all four supervision scenarios, making it the first model capable of fully leveraging all available data annotations. SuperSimpleNet sets a new standard for performance across all scenarios, as demonstrated by its results on four challenging benchmark datasets. Beyond accuracy, it is very fast, achieving an inference time below 10 ms. With its ability to unify diverse supervision paradigms while maintaining outstanding speed and reliability, SuperSimpleNet represents a promising step forward in addressing real-world manufacturing challenges and bridging the gap between academic research and industrial applications. Code: https://github.com/blaz-r/SuperSimpleNet

URLs: https://github.com/blaz-r/SuperSimpleNet

new Learning Binary Sampling Patterns for Single-Pixel Imaging using Bilevel Optimisation

Authors: Serban C. Tudosie, Alexander Denker, Zeljko Kereta, Simon Arridge

Abstract: Single-Pixel Imaging enables reconstructing objects using a single detector through sequential illuminations with structured light patterns. We propose a bilevel optimisation method for learning task-specific, binary illumination patterns, optimised for applications like single-pixel fluorescence microscopy. We address the non-differentiable nature of binary pattern optimisation using the Straight-Through Estimator and leveraging a Total Deep Variation regulariser in the bilevel formulation. We demonstrate our method on the CytoImageNet microscopy dataset and show that learned patterns achieve superior reconstruction performance compared to baseline methods, especially in highly undersampled regimes.

new VibES: Induced Vibration for Persistent Event-Based Sensing

Authors: Vincenzo Polizzi, Stephen Yang, Quentin Clark, Jonathan Kelly, Igor Gilitschenski, David B. Lindell

Abstract: Event cameras are a bio-inspired class of sensors that asynchronously measure per-pixel intensity changes. Under fixed illumination conditions in static or low-motion scenes, rigidly mounted event cameras are unable to generate any events, becoming unsuitable for most computer vision tasks. To address this limitation, recent work has investigated motion-induced event stimulation that often requires complex hardware or additional optical components. In contrast, we introduce a lightweight approach to sustain persistent event generation by employing a simple rotating unbalanced mass to induce periodic vibrational motion. This is combined with a motion-compensation pipeline that removes the injected motion and yields clean, motion-corrected events for downstream perception tasks. We demonstrate our approach with a hardware prototype and evaluate it on real-world captured datasets. Our method reliably recovers motion parameters and improves both image reconstruction and edge detection over event-based sensing without motion induction.

new Few-Shot Connectivity-Aware Text Line Segmentation in Historical Documents

Authors: Rafael Sterzinger, Tingyu Lin, Robert Sablatnig

Abstract: A foundational task for the digital analysis of documents is text line segmentation. However, automating this process with deep learning models is challenging because it requires large, annotated datasets that are often unavailable for historical documents. Additionally, the annotation process is a labor- and cost-intensive task that requires expert knowledge, which makes few-shot learning a promising direction for reducing data requirements. In this work, we demonstrate that small and simple architectures, coupled with a topology-aware loss function, are more accurate and data-efficient than more complex alternatives. We pair a lightweight UNet++ with a connectivity-aware loss, initially developed for neuron morphology, which explicitly penalizes structural errors like line fragmentation and unintended line merges. To increase our limited data, we train on small patches extracted from a mere three annotated pages per manuscript. Our methodology significantly improves upon the current state-of-the-art on the U-DIADS-TL dataset, with a 200% increase in Recognition Accuracy and a 75% increase in Line Intersection over Union. Our method also achieves an F-Measure score on par with or even exceeding that of the competition winner of the DIVA-HisDB baseline detection task, all while requiring only three annotated pages, exemplifying the efficacy of our approach. Our implementation is publicly available at: https://github.com/RafaelSterzinger/acpr_few_shot_hist.

URLs: https://github.com/RafaelSterzinger/acpr_few_shot_hist.

new Dual Enhancement on 3D Vision-Language Perception for Monocular 3D Visual Grounding

Authors: Yuzhen Li, Min Liu, Yuan Bian, Xueping Wang, Zhaoyang Li, Gen Li, Yaonan Wang

Abstract: Monocular 3D visual grounding is a novel task that aims to locate 3D objects in RGB images using text descriptions with explicit geometry information. Despite the inclusion of geometry details in the text, we observe that the text embeddings are sensitive to the magnitude of numerical values but largely ignore the associated measurement units. For example, simply equidistant mapping the length with unit "meter" to "decimeters" or "centimeters" leads to severe performance degradation, even though the physical length remains equivalent. This observation signifies the weak 3D comprehension of pre-trained language model, which generates misguiding text features to hinder 3D perception. Therefore, we propose to enhance the 3D perception of model on text embeddings and geometry features with two simple and effective methods. Firstly, we introduce a pre-processing method named 3D-text Enhancement (3DTE), which enhances the comprehension of mapping relationships between different units by augmenting the diversity of distance descriptors in text queries. Next, we propose a Text-Guided Geometry Enhancement (TGE) module to further enhance the 3D-text information by projecting the basic text features into geometrically consistent space. These 3D-enhanced text features are then leveraged to precisely guide the attention of geometry features. We evaluate the proposed method through extensive comparisons and ablation studies on the Mono3DRefer dataset. Experimental results demonstrate substantial improvements over previous methods, achieving new state-of-the-art results with a notable accuracy gain of 11.94\% in the "Far" scenario. Our code will be made publicly available.

new Beyond flattening: a geometrically principled positional encoding for vision transformers with Weierstrass elliptic functions

Authors: Zhihang Xin, Xitong Hu, Rui Wang

Abstract: Vision Transformers have demonstrated remarkable success in computer vision tasks, yet their reliance on learnable one-dimensional positional embeddings fundamentally disrupts the inherent two-dimensional spatial structure of images through patch flattening procedures. Traditional positional encoding approaches lack geometric constraints and fail to establish monotonic correspondence between Euclidean spatial distances and sequential index distances, thereby limiting the model's capacity to leverage spatial proximity priors effectively. We propose Weierstrass Elliptic Function Positional Encoding (WEF-PE), a mathematically principled approach that directly addresses two-dimensional coordinates through natural complex domain representation, where the doubly periodic properties of elliptic functions align remarkably with translational invariance patterns commonly observed in visual data. Our method exploits the non-linear geometric nature of elliptic functions to encode spatial distance relationships naturally, while the algebraic addition formula enables direct derivation of relative positional information between arbitrary patch pairs from their absolute encodings. Comprehensive experiments demonstrate that WEF-PE achieves superior performance across diverse scenarios, including 63.78\% accuracy on CIFAR-100 from-scratch training with ViT-Tiny architecture, 93.28\% on CIFAR-100 fine-tuning with ViT-Base, and consistent improvements on VTAB-1k benchmark tasks. Theoretical analysis confirms the distance-decay property through rigorous mathematical proof, while attention visualization reveals enhanced geometric inductive bias and more coherent semantic focus compared to conventional approaches.The source code implementing the methods described in this paper is publicly available on GitHub.

new SoccerNet 2025 Challenges Results

Authors: Silvio Giancola, Anthony Cioppa, Marc Guti\'errez-P\'erez, Jan Held, Carlos Hinojosa, Victor Joos, Arnaud Leduc, Floriane Magera, Karen Sanchez, Vladimir Somers, Artur Xarles, Antonio Agudo, Alexandre Alahi, Olivier Barnich, Albert Clap\'es, Christophe De Vleeschouwer, Sergio Escalera, Bernard Ghanem, Thomas B. Moeslund, Marc Van Droogenbroeck, Tomoki Abe, Saad Alotaibi, Faisal Altawijri, Steven Araujo, Xiang Bai, Xiaoyang Bi, Jiawang Cao, Vanyi Chao, Kamil Czarnog\'orski, Fabian Deuser, Mingyang Du, Tianrui Feng, Patrick Frenzel, Mirco Fuchs, Jorge Garc\'ia, Konrad Habel, Takaya Hashiguchi, Sadao Hirose, Xinting Hu, Yewon Hwang, Ririko Inoue, Riku Itsuji, Kazuto Iwai, Hongwei Ji, Yangguang Ji, Licheng Jiao, Yuto Kageyama, Yuta Kamikawa, Yuuki Kanasugi, Hyungjung Kim, Jinwook Kim, Takuya Kurihara, Bozheng Li, Lingling Li, Xian Li, Youxing Lian, Dingkang Liang, Hongkai Lin, Jiadong Lin, Jian Liu, Liang Liu, Shuaikun Liu, Zhaohong Liu, Yi Lu, Federico M\'endez, Huadong Ma, Wenping Ma, Jacek Maksymiuk, Henry Mantilla, Ismail Mathkour, Daniel Matthes, Ayaha Motomochi, Amrulloh Robbani Muhammad, Haruto Nakayama, Joohyung Oh, Yin May Oo, Marcelo Ortega, Norbert Oswald, Rintaro Otsubo, Fabian Perez, Mengshi Qi, Cristian Rey, Abel Reyes-Angulo, Oliver Rose, Hoover Rueda-Chac\'on, Hideo Saito, Jose Sarmiento, Kanta Sawafuji, Atom Scott, Xi Shen, Pragyan Shrestha, Jae-Young Sim, Long Sun, Yuyang Sun, Tomohiro Suzuki, Licheng Tang, Masato Tonouchi, Ikuma Uchida, Henry O. Velesaca, Tiancheng Wang, Rio Watanabe, Jay Wu, Yongliang Wu, Shunzo Yamagishi, Di Yang, Xu Yang, Yuxin Yang, Hao Ye, Xinyu Ye, Calvin Yeung, Xuanlong Yu, Chao Zhang, Dingyuan Zhang, Kexing Zhang, Zhe Zhao, Xin Zhou, Wenbo Zhu, Julian Ziegler

Abstract: The SoccerNet 2025 Challenges mark the fifth annual edition of the SoccerNet open benchmarking effort, dedicated to advancing computer vision research in football video understanding. This year's challenges span four vision-based tasks: (1) Team Ball Action Spotting, focused on detecting ball-related actions in football broadcasts and assigning actions to teams; (2) Monocular Depth Estimation, targeting the recovery of scene geometry from single-camera broadcast clips through relative depth estimation for each pixel; (3) Multi-View Foul Recognition, requiring the analysis of multiple synchronized camera views to classify fouls and their severity; and (4) Game State Reconstruction, aimed at localizing and identifying all players from a broadcast video to reconstruct the game state on a 2D top-view of the field. Across all tasks, participants were provided with large-scale annotated datasets, unified evaluation protocols, and strong baselines as starting points. This report presents the results of each challenge, highlights the top-performing solutions, and provides insights into the progress made by the community. The SoccerNet Challenges continue to serve as a driving force for reproducible, open research at the intersection of computer vision, artificial intelligence, and sports. Detailed information about the tasks, challenges, and leaderboards can be found at https://www.soccer-net.org, with baselines and development kits available at https://github.com/SoccerNet.

URLs: https://www.soccer-net.org,, https://github.com/SoccerNet.

new FastMesh: Efficient Artistic Mesh Generation via Component Decoupling

Authors: Jeonghwan Kim, Yushi Lan, Armando Fortes, Yongwei Chen, Xingang Pan

Abstract: Recent mesh generation approaches typically tokenize triangle meshes into sequences of tokens and train autoregressive models to generate these tokens sequentially. Despite substantial progress, such token sequences inevitably reuse vertices multiple times to fully represent manifold meshes, as each vertex is shared by multiple faces. This redundancy leads to excessively long token sequences and inefficient generation processes. In this paper, we propose an efficient framework that generates artistic meshes by treating vertices and faces separately, significantly reducing redundancy. We employ an autoregressive model solely for vertex generation, decreasing the token count to approximately 23\% of that required by the most compact existing tokenizer. Next, we leverage a bidirectional transformer to complete the mesh in a single step by capturing inter-vertex relationships and constructing the adjacency matrix that defines the mesh faces. To further improve the generation quality, we introduce a fidelity enhancer to refine vertex positioning into more natural arrangements and propose a post-processing framework to remove undesirable edge connections. Experimental results show that our method achieves more than 8$\times$ faster speed on mesh generation compared to state-of-the-art approaches, while producing higher mesh quality.

new All-in-One Slider for Attribute Manipulation in Diffusion Models

Authors: Weixin Ye, Hongguang Zhu, Wei Wang, Yahui Liu, Mengyu Wang

Abstract: Text-to-image (T2I) diffusion models have made significant strides in generating high-quality images. However, progressively manipulating certain attributes of generated images to meet the desired user expectations remains challenging, particularly for content with rich details, such as human faces. Some studies have attempted to address this by training slider modules. However, they follow a One-for-One manner, where an independent slider is trained for each attribute, requiring additional training whenever a new attribute is introduced. This not only results in parameter redundancy accumulated by sliders but also restricts the flexibility of practical applications and the scalability of attribute manipulation. To address this issue, we introduce the All-in-One Slider, a lightweight module that decomposes the text embedding space into sparse, semantically meaningful attribute directions. Once trained, it functions as a general-purpose slider, enabling interpretable and fine-grained continuous control over various attributes. Moreover, by recombining the learned directions, the All-in-One Slider supports zero-shot manipulation of unseen attributes (e.g., races and celebrities) and the composition of multiple attributes. Extensive experiments demonstrate that our method enables accurate and scalable attribute manipulation, achieving notable improvements compared to previous methods. Furthermore, our method can be extended to integrate with the inversion framework to perform attribute manipulation on real images, broadening its applicability to various real-world scenarios. The code and trained model will be released at: https://github.com/ywxsuperstar/KSAE-FaceSteer.

URLs: https://github.com/ywxsuperstar/KSAE-FaceSteer.

new LSD-3D: Large-Scale 3D Driving Scene Generation with Geometry Grounding

Authors: Julian Ost, Andrea Ramazzina, Amogh Joshi, Maximilian B\"omer, Mario Bijelic, Felix Heide

Abstract: Large-scale scene data is essential for training and testing in robot learning. Neural reconstruction methods have promised the capability of reconstructing large physically-grounded outdoor scenes from captured sensor data. However, these methods have baked-in static environments and only allow for limited scene control -- they are functionally constrained in scene and trajectory diversity by the captures from which they are reconstructed. In contrast, generating driving data with recent image or video diffusion models offers control, however, at the cost of geometry grounding and causality. In this work, we aim to bridge this gap and present a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal novel view synthesis with object permanence and explicit 3D geometry estimation. The proposed method combines the generation of a proxy geometry and environment representation with score distillation from learned 2D image priors. We find that this approach allows for high controllability, enabling the prompt-guided geometry and high-fidelity texture and structure that can be conditioned on map layouts -- producing realistic and geometrically consistent 3D generations of complex driving scenes.

new OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation

Authors: Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, Mingyuan Gao

Abstract: Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character's authentic essence. Their motions typically synchronize with low-level cues like audio rhythm, lacking a deeper semantic understanding of emotion, intent, or context. To bridge this gap, \textbf{we propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.} Our model, \textbf{OmniHuman-1.5}, is built upon two key technical contributions. First, we leverage Multimodal Large Language Models to synthesize a structured textual representation of conditions that provides high-level semantic guidance. This guidance steers our motion generator beyond simplistic rhythmic synchronization, enabling the production of actions that are contextually and emotionally resonant. Second, to ensure the effective fusion of these multimodal inputs and mitigate inter-modality conflicts, we introduce a specialized Multimodal DiT architecture with a novel Pseudo Last Frame design. The synergy of these components allows our model to accurately interpret the joint semantics of audio, images, and text, thereby generating motions that are deeply coherent with the character, scene, and linguistic content. Extensive experiments demonstrate that our model achieves leading performance across a comprehensive set of metrics, including lip-sync accuracy, video quality, motion naturalness and semantic consistency with textual prompts. Furthermore, our approach shows remarkable extensibility to complex scenarios, such as those involving multi-person and non-human subjects. Homepage: \href{https://omnihuman-lab.github.io/v1_5/}

URLs: https://omnihuman-lab.github.io/v1_5/

new Automated Feature Tracking for Real-Time Kinematic Analysis and Shape Estimation of Carbon Nanotube Growth

Authors: Kaveh Safavigerdini, Ramakrishna Surya, Jaired Collins, Prasad Calyam, Filiz Bunyak, Matthew R. Maschmann, Kannappan Palaniappan

Abstract: Carbon nanotubes (CNTs) are critical building blocks in nanotechnology, yet the characterization of their dynamic growth is limited by the experimental challenges in nanoscale motion measurement using scanning electron microscopy (SEM) imaging. Existing ex situ methods offer only static analysis, while in situ techniques often require manual initialization and lack continuous per-particle trajectory decomposition. We present Visual Feature Tracking (VFTrack) an in-situ real-time particle tracking framework that automatically detects and tracks individual CNT particles in SEM image sequences. VFTrack integrates handcrafted or deep feature detectors and matchers within a particle tracking framework to enable kinematic analysis of CNT micropillar growth. A systematic using 13,540 manually annotated trajectories identifies the ALIKED detector with LightGlue matcher as an optimal combination (F1-score of 0.78, $\alpha$-score of 0.89). VFTrack motion vectors decomposed into axial growth, lateral drift, and oscillations, facilitate the calculation of heterogeneous regional growth rates and the reconstruction of evolving CNT pillar morphologies. This work enables advancement in automated nano-material characterization, bridging the gap between physics-based models and experimental observation to enable real-time optimization of CNT synthesis.

new Autoregressive Universal Video Segmentation Model

Authors: Miran Heo, Sukjun Hwang, Min-Hung Chen, Yu-Chiang Frank Wang, Albert Gu, Seon Joo Kim, Ryo Hachiuma

Abstract: Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation that aims to detect and track all objects in a video without external cues, leaving today's landscape fragmented across task-specific models and pipelines. We recast streaming video segmentation as sequential mask prediction, analogous to language modeling, and introduce the Autoregressive Universal Segmentation Model (AUSM), a single architecture that unifies both prompted and unprompted video segmentation. Built on recent state-space models, AUSM maintains a fixed-size spatial state and scales to video streams of arbitrary length. Furthermore, all components of AUSM are designed for parallel training across frames, yielding substantial speedups over iterative training. On standard benchmarks (DAVIS17, YouTube-VOS 2018 & 2019, MOSE, YouTube-VIS 2019 & 2021, and OVIS) AUSM outperforms prior universal streaming video segmentation methods and achieves up to 2.5x faster training on 16-frame sequences.

new Style4D-Bench: A Benchmark Suite for 4D Stylization

Authors: Beiqi Chen, Shuai Shao, Haitang Feng, Jianhuang Lai, Jianlou Si, Guangcong Wang

Abstract: We introduce Style4D-Bench, the first benchmark suite specifically designed for 4D stylization, with the goal of standardizing evaluation and facilitating progress in this emerging area. Style4D-Bench comprises: 1) a comprehensive evaluation protocol measuring spatial fidelity, temporal coherence, and multi-view consistency through both perceptual and quantitative metrics, 2) a strong baseline that make an initial attempt for 4D stylization, and 3) a curated collection of high-resolution dynamic 4D scenes with diverse motions and complex backgrounds. To establish a strong baseline, we present Style4D, a novel framework built upon 4D Gaussian Splatting. It consists of three key components: a basic 4DGS scene representation to capture reliable geometry, a Style Gaussian Representation that leverages lightweight per-Gaussian MLPs for temporally and spatially aware appearance control, and a Holistic Geometry-Preserved Style Transfer module designed to enhance spatio-temporal consistency via contrastive coherence learning and structural content preservation. Extensive experiments on Style4D-Bench demonstrate that Style4D achieves state-of-the-art performance in 4D stylization, producing fine-grained stylistic details with stable temporal dynamics and consistent multi-view rendering. We expect Style4D-Bench to become a valuable resource for benchmarking and advancing research in stylized rendering of dynamic 3D scenes. Project page: https://becky-catherine.github.io/Style4D . Code: https://github.com/Becky-catherine/Style4D-Bench .

URLs: https://becky-catherine.github.io/Style4D, https://github.com/Becky-catherine/Style4D-Bench

new Articulate3D: Zero-Shot Text-Driven 3D Object Posing

Authors: Oishi Deb, Anjun Hu, Ashkan Khakzar, Philip Torr, Christian Rupprecht

Abstract: We propose a training-free method, Articulate3D, to pose a 3D asset through language control. Despite advances in vision and language models, this task remains surprisingly challenging. To achieve this goal, we decompose the problem into two steps. We modify a powerful image-generator to create target images conditioned on the input image and a text instruction. We then align the mesh to the target images through a multi-view pose optimisation step. In detail, we introduce a self-attention rewiring mechanism (RSActrl) that decouples the source structure from pose within an image generative model, allowing it to maintain a consistent structure across varying poses. We observed that differentiable rendering is an unreliable signal for articulation optimisation; instead, we use keypoints to establish correspondences between input and target images. The effectiveness of Articulate3D is demonstrated across a diverse range of 3D objects and free-form text prompts, successfully manipulating poses while maintaining the original identity of the mesh. Quantitative evaluations and a comparative user study, in which our method was preferred over 85\% of the time, confirm its superiority over existing approaches. Project page:https://odeb1.github.io/articulate3d_page_deb/

URLs: https://odeb1.github.io/articulate3d_page_deb/

new VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space

Authors: Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, Lu Sheng

Abstract: 3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.

URLs: https://huanngzh.github.io/VoxHammer-Page/.

cross Federative ischemic stroke segmentation as alternative to overcome domain-shift multi-institution challenges

Authors: Edgar Rangel, Fabio Martinez

Abstract: Stroke is the second leading cause of death and the third leading cause of disability worldwide. Clinical guidelines establish diffusion resonance imaging (DWI, ADC) as the standard for localizing, characterizing, and measuring infarct volume, enabling treatment support and prognosis. Nonetheless, such lesion analysis is highly variable due to different patient demographics, scanner vendors, and expert annotations. Computational support approaches have been key to helping with the localization and segmentation of lesions. However, these strategies are dedicated solutions that learn patterns from only one institution, lacking the variability to generalize geometrical lesions shape models. Even worse, many clinical centers lack sufficient labeled samples to adjust these dedicated solutions. This work developed a collaborative framework for segmenting ischemic stroke lesions in DWI sequences by sharing knowledge from deep center-independent representations. From 14 emulated healthcare centers with 2031 studies, the FedAvg model achieved a general DSC of $0.71 \pm 0.24$, AVD of $5.29 \pm 22.74$, ALD of $2.16 \pm 3.60$ and LF1 of $0.70 \pm 0.26$ over all centers, outperforming both the centralized and other federated rules. Interestingly, the model demonstrated strong generalization properties, showing uniform performance across different lesion categories and reliable performance in out-of-distribution centers (with DSC of $0.64 \pm 0.29$ and AVD of $4.44 \pm 8.74$ without any additional training).

cross Impact of Target and Tool Visualization on Depth Perception and Usability in Optical See-Through AR

Authors: Yue Yang, Xue Xie, Xinkai Wang, Hui Zhang, Chiming Yu, Xiaoxian Xiong, Lifeng Zhu, Yuanyi Zheng, Jue Cen, Bruce Daniel, Fred Baik

Abstract: Optical see-through augmented reality (OST-AR) systems like Microsoft HoloLens 2 hold promise for arm's distance guidance (e.g., surgery), but depth perception of the hologram and occlusion of real instruments remain challenging. We present an evaluation of how visualizing the target object with different transparencies and visualizing a tracked tool (virtual proxy vs. real tool vs. no tool tracking) affects depth perception and system usability. Ten participants performed two experiments on HoloLens 2. In Experiment 1, we compared high-transparency vs. low-transparency target rendering in a depth matching task at arm's length. In Experiment 2, participants performed a simulated surgical pinpoint task on a frontal bone target under six visualization conditions ($2 \times 3$: two target transparencies and three tool visualization modes: virtual tool hologram, real tool, or no tool tracking). We collected data on depth matching error, target localization error, system usability, task workload, and qualitative feedback. Results show that a more opaque target yields significantly lower depth estimation error than a highly transparent target at arm's distance. Moreover, showing the real tool (occluding the virtual target) led to the highest accuracy and usability with the lowest workload, while not tracking the tool yielded the worst performance and user ratings. However, making the target highly transparent, while allowing the real tool to remain visible, slightly impaired depth cues and did not improve usability. Our findings underscore that correct occlusion cues, rendering virtual content opaque and occluding it with real tools in real time, are critical for depth perception and precision in OST-AR. Designers of arm-distance AR systems should prioritize robust tool tracking and occlusion handling; if unavailable, cautiously use transparency to balance depth perception and tool visibility.

cross Analise de Desaprendizado de Maquina em Modelos de Classificacao de Imagens Medicas

Authors: Andreza M. C. Falcao, Filipe R. Cordeiro

Abstract: Machine unlearning aims to remove private or sensitive data from a pre-trained model while preserving the model's robustness. Despite recent advances, this technique has not been explored in medical image classification. This work evaluates the SalUn unlearning model by conducting experiments on the PathMNIST, OrganAMNIST, and BloodMNIST datasets. We also analyse the impact of data augmentation on the quality of unlearning. Results show that SalUn achieves performance close to full retraining, indicating an efficient solution for use in medical applications.

cross Controllable Single-shot Animation Blending with Temporal Conditioning

Authors: Eleni Tselepi, Spyridon Thermos, Gerasimos Potamianos

Abstract: Training a generative model on a single human skeletal motion sequence without being bound to a specific kinematic tree has drawn significant attention from the animation community. Unlike text-to-motion generation, single-shot models allow animators to controllably generate variations of existing motion patterns without requiring additional data or extensive retraining. However, existing single-shot methods do not explicitly offer a controllable framework for blending two or more motions within a single generative pass. In this paper, we present the first single-shot motion blending framework that enables seamless blending by temporally conditioning the generation process. Our method introduces a skeleton-aware normalization mechanism to guide the transition between motions, allowing smooth, data-driven control over when and how motions blend. We perform extensive quantitative and qualitative evaluations across various animation styles and different kinematic skeletons, demonstrating that our approach produces plausible, smooth, and controllable motion blends in a unified and efficient manner.

cross A Deep Learning Application for Psoriasis Detection

Authors: Anna Milani, F\'abio S. da Silva, Ello\'a B. Guedes, Ricardo Rios

Abstract: In this paper a comparative study of the performance of three Convolutional Neural Network models, ResNet50, Inception v3 and VGG19 for classification of skin images with lesions affected by psoriasis is presented. The images used for training and validation of the models were obtained from specialized platforms. Some techniques were used to adjust the evaluation metrics of the neural networks. The results found suggest the model Inception v3 as a valuable tool for supporting the diagnosis of psoriasis. This is due to its satisfactory performance with respect to accuracy and F1-Score (97.5% ${\pm}$ 0.2).

cross The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation

Authors: Girish A. Koushik, Fatemeh Nazarieh, Katherine Birch, Shenbin Qian, Diptesh Kanojia

Abstract: Visual metaphor generation is a challenging task that aims to generate an image given an input text metaphor. Inherently, it needs language understanding to bind a source concept with a target concept, in a way that preserves meaning while ensuring visual coherence. We propose a self-evaluating visual metaphor generation framework that focuses on metaphor alignment. Our self-evaluation approach combines existing metrics with our newly proposed metaphor decomposition score and a meaning alignment (MA) metric. Within this setup, we explore two novel approaches: a training-free pipeline that explicitly decomposes prompts into source-target-meaning (S-T-M) mapping for image synthesis, and a complementary training-based pipeline that improves alignment using our proposed self-evaluation reward schema, without any large-scale retraining. On the held-out test set, the training-free approach surpasses strong closed baselines (GPT-4o, Imagen) on decomposition, CLIP, and MA scores, with the training-based approach close behind. We evaluate our framework output using a user-facing study, and observed that participants preferred GPT-4o overall, while our training-free pipeline led open-source methods and edged Imagen on abstract metaphors. Our analyses show S-T-M prompting helps longer or more abstract metaphors, with closed models excelling on short, concrete cases; we also observe sensitivity to sampler settings. Overall, structured prompting and lightweight RL perform metaphor alignment well under modest compute, and remaining gaps to human preference appear driven by aesthetics and sampling.

cross SemLayoutDiff: Semantic Layout Generation with Diffusion Model for Indoor Scene Synthesis

Authors: Xiaohao Sun, Divyam Goel, Angle X. Chang

Abstract: We present SemLayoutDiff, a unified model for synthesizing diverse 3D indoor scenes across multiple room types. The model introduces a scene layout representation combining a top-down semantic map and attributes for each object. Unlike prior approaches, which cannot condition on architectural constraints, SemLayoutDiff employs a categorical diffusion model capable of conditioning scene synthesis explicitly on room masks. It first generates a coherent semantic map, followed by a cross-attention-based network to predict furniture placements that respect the synthesized layout. Our method also accounts for architectural elements such as doors and windows, ensuring that generated furniture arrangements remain practical and unobstructed. Experiments on the 3D-FRONT dataset show that SemLayoutDiff produces spatially coherent, realistic, and varied scenes, outperforming previous methods.

cross Uncertainty Awareness on Unsupervised Domain Adaptation for Time Series Data

Authors: Weide Liu, Xiaoyang Zhong, Lu Wang, Jingwen Hou, Yuemei Luo, Jiebin Yan, Yuming Fang

Abstract: Unsupervised domain adaptation methods seek to generalize effectively on unlabeled test data, especially when encountering the common challenge in time series data that distribution shifts occur between training and testing datasets. In this paper, we propose incorporating multi-scale feature extraction and uncertainty estimation to improve the model's generalization and robustness across domains. Our approach begins with a multi-scale mixed input architecture that captures features at different scales, increasing training diversity and reducing feature discrepancies between the training and testing domains. Based on the mixed input architecture, we further introduce an uncertainty awareness mechanism based on evidential learning by imposing a Dirichlet prior on the labels to facilitate both target prediction and uncertainty estimation. The uncertainty awareness mechanism enhances domain adaptation by aligning features with the same labels across different domains, which leads to significant performance improvements in the target domain. Additionally, our uncertainty-aware model demonstrates a much lower Expected Calibration Error (ECE), indicating better-calibrated prediction confidence. Our experimental results show that this combined approach of mixed input architecture with the uncertainty awareness mechanism achieves state-of-the-art performance across multiple benchmark datasets, underscoring its effectiveness in unsupervised domain adaptation for time series data.

cross Enhancing Video-Based Robot Failure Detection Using Task Knowledge

Authors: Santosh Thoduka, Sebastian Houben, Juergen Gall, Paul G. Pl\"oger

Abstract: Robust robotic task execution hinges on the reliable detection of execution failures in order to trigger safe operation modes, recovery strategies, or task replanning. However, many failure detection methods struggle to provide meaningful performance when applied to a variety of real-world scenarios. In this paper, we propose a video-based failure detection approach that uses spatio-temporal knowledge in the form of the actions the robot performs and task-relevant objects within the field of view. Both pieces of information are available in most robotic scenarios and can thus be readily obtained. We demonstrate the effectiveness of our approach on three datasets that we amend, in part, with additional annotations of the aforementioned task-relevant knowledge. In light of the results, we also propose a data augmentation method that improves performance by applying variable frame rates to different parts of the video. We observe an improvement from 77.9 to 80.0 in F1 score on the ARMBench dataset without additional computational expense and an additional increase to 81.4 with test-time augmentation. The results emphasize the importance of spatio-temporal information during failure detection and suggest further investigation of suitable heuristics in future implementations. Code and annotations are available.

cross Natural Image Classification via Quasi-Cyclic Graph Ensembles and Random-Bond Ising Models at the Nishimori Temperature

Authors: V. S. Usatyuk, D. A. Sapoznikov, S. I. Egorov

Abstract: We present a unified framework combining statistical physics, coding theory, and algebraic topology for efficient multi-class image classification. High-dimensional feature vectors from a frozen MobileNetV2 backbone are interpreted as spins on a sparse Multi-Edge Type quasi-cyclic LDPC (MET-QC-LDPC) graph, forming a Random-Bond Ising Model (RBIM). We operate this RBIM at its Nishimori temperature, $\beta_N$, where the smallest eigenvalue of the Bethe-Hessian matrix vanishes, maximizing class separability. Our theoretical contribution establishes a correspondence between local trapping sets in the code's graph and topological invariants (Betti numbers, bordism classes) of the feature manifold. A practical algorithm estimates $\beta_N$ efficiently with a quadratic interpolant and Newton correction, achieving a six-fold speed-up over bisection. Guided by topology, we design spherical and toroidal MET-QC-LDPC graph ensembles, using permanent bounds to suppress harmful trapping sets. This compresses 1280-dimensional features to 32 or 64 dimensions for ImageNet-10 and -100 subsets. Despite massive compression (40x fewer parameters), we achieve 98.7% accuracy on ImageNet-10 and 82.7% on ImageNet-100, demonstrating that topology-guided graph design yields highly efficient, physics-inspired embeddings with state-of-the-art performance.

cross Stabilizing Open-Set Test-Time Adaptation via Primary-Auxiliary Filtering and Knowledge-Integrated Prediction

Authors: Byung-Joon Lee, Jin-Seop Lee, Jee-Hyong Lee

Abstract: Deep neural networks demonstrate strong performance under aligned training-test distributions. However, real-world test data often exhibit domain shifts. Test-Time Adaptation (TTA) addresses this challenge by adapting the model to test data during inference. While most TTA studies assume that the training and test data share the same class set (closed-set TTA), real-world scenarios often involve open-set data (open-set TTA), which can degrade closed-set accuracy. A recent study showed that identifying open-set data during adaptation and maximizing its entropy is an effective solution. However, the previous method relies on the source model for filtering, resulting in suboptimal filtering accuracy on domain-shifted test data. In contrast, we found that the adapting model, which learns domain knowledge from noisy test streams, tends to be unstable and leads to error accumulation when used for filtering. To address this problem, we propose Primary-Auxiliary Filtering (PAF), which employs an auxiliary filter to validate data filtered by the primary filter. Furthermore, we propose Knowledge-Integrated Prediction (KIP), which calibrates the outputs of the adapting model, EMA model, and source model to integrate their complementary knowledge for OSTTA. We validate our approach across diverse closed-set and open-set datasets. Our method enhances both closed-set accuracy and open-set discrimination over existing methods. The code is available at https://github.com/powerpowe/PAF-KIP-OSTTA .

URLs: https://github.com/powerpowe/PAF-KIP-OSTTA

cross EMind: A Foundation Model for Multi-task Electromagnetic Signals Understanding

Authors: Luqing Luo, Wenjin Gui, Yunfei Liu, Ziyue Zhang, Yunxi Zhang, Fengxiang Wang, Zonghao Guo, Zizhi Ma, Xinzhu Liu, Hanxiang He, Jinhai Li, Xin Qiu, Wupeng Xie, Yangang Sun

Abstract: Deep understanding of electromagnetic signals is fundamental to dynamic spectrum management, intelligent transportation, autonomous driving and unmanned vehicle perception. The field faces challenges because electromagnetic signals differ greatly from text and images, showing high heterogeneity, strong background noise and complex joint time frequency structure, which prevents existing general models from direct use. Electromagnetic communication and sensing tasks are diverse, current methods lack cross task generalization and transfer efficiency, and the scarcity of large high quality datasets blocks the creation of a truly general multitask learning framework. To overcome these issue, we introduce EMind, an electromagnetic signals foundation model that bridges large scale pretraining and the unique nature of this modality. We build the first unified and largest standardized electromagnetic signal dataset covering multiple signal types and tasks. By exploiting the physical properties of electromagnetic signals, we devise a length adaptive multi-signal packing method and a hardware-aware training strategy that enable efficient use and representation learning from heterogeneous multi-source signals. Experiments show that EMind achieves strong performance and broad generalization across many downstream tasks, moving decisively from task specific models to a unified framework for electromagnetic intelligence. The code is available at: https://github.com/GabrielleTse/EMind.

URLs: https://github.com/GabrielleTse/EMind.

cross A Closer Look at Edema Area Segmentation in SD-OCT Images Using Adversarial Framework

Authors: Yuhui Tao, Yizhe Zhang, Qiang Chen

Abstract: The development of artificial intelligence models for macular edema (ME) analy-sis always relies on expert-annotated pixel-level image datasets which are expen-sive to collect prospectively. While anomaly-detection-based weakly-supervised methods have shown promise in edema area (EA) segmentation task, their per-formance still lags behind fully-supervised approaches. In this paper, we leverage the strong correlation between EA and retinal layers in spectral-domain optical coherence tomography (SD-OCT) images, along with the update characteristics of weakly-supervised learning, to enhance an off-the-shelf adversarial framework for EA segmentation with a novel layer-structure-guided post-processing step and a test-time-adaptation (TTA) strategy. By incorporating additional retinal lay-er information, our framework reframes the dense EA prediction task as one of confirming intersection points between the EA contour and retinal layers, result-ing in predictions that better align with the shape prior of EA. Besides, the TTA framework further helps address discrepancies in the manifestations and presen-tations of EA between training and test sets. Extensive experiments on two pub-licly available datasets demonstrate that these two proposed ingredients can im-prove the accuracy and robustness of EA segmentation, bridging the gap between weakly-supervised and fully-supervised models.

cross Hidden Tail: Adversarial Image Causing Stealthy Resource Consumption in Vision-Language Models

Authors: Rui Zhang, Zihan Wang, Tianli Yang, Hongwei Li, Wenbo Jiang, Qingchuan Zhao, Yang Liu, Guowen Xu

Abstract: Vision-Language Models (VLMs) are increasingly deployed in real-world applications, but their high inference cost makes them vulnerable to resource consumption attacks. Prior attacks attempt to extend VLM output sequences by optimizing adversarial images, thereby increasing inference costs. However, these extended outputs often introduce irrelevant abnormal content, compromising attack stealthiness. This trade-off between effectiveness and stealthiness poses a major limitation for existing attacks. To address this challenge, we propose \textit{Hidden Tail}, a stealthy resource consumption attack that crafts prompt-agnostic adversarial images, inducing VLMs to generate maximum-length outputs by appending special tokens invisible to users. Our method employs a composite loss function that balances semantic preservation, repetitive special token induction, and suppression of the end-of-sequence (EOS) token, optimized via a dynamic weighting strategy. Extensive experiments show that \textit{Hidden Tail} outperforms existing attacks, increasing output length by up to 19.2$\times$ and reaching the maximum token limit, while preserving attack stealthiness. These results highlight the urgent need to improve the robustness of VLMs against efficiency-oriented adversarial threats. Our code is available at https://github.com/zhangrui4041/Hidden_Tail.

URLs: https://github.com/zhangrui4041/Hidden_Tail.

cross SWiFT: Soft-Mask Weight Fine-tuning for Bias Mitigation

Authors: Junyu Yan, Feng Chen, Yuyang Xue, Yuning Du, Konstantinos Vilouras, Sotirios A. Tsaftaris, Steven McDonagh

Abstract: Recent studies have shown that Machine Learning (ML) models can exhibit bias in real-world scenarios, posing significant challenges in ethically sensitive domains such as healthcare. Such bias can negatively affect model fairness, model generalization abilities and further risks amplifying social discrimination. There is a need to remove biases from trained models. Existing debiasing approaches often necessitate access to original training data and need extensive model retraining; they also typically exhibit trade-offs between model fairness and discriminative performance. To address these challenges, we propose Soft-Mask Weight Fine-Tuning (SWiFT), a debiasing framework that efficiently improves fairness while preserving discriminative performance with much less debiasing costs. Notably, SWiFT requires only a small external dataset and only a few epochs of model fine-tuning. The idea behind SWiFT is to first find the relative, and yet distinct, contributions of model parameters to both bias and predictive performance. Then, a two-step fine-tuning process updates each parameter with different gradient flows defined by its contribution. Extensive experiments with three bias sensitive attributes (gender, skin tone, and age) across four dermatological and two chest X-ray datasets demonstrate that SWiFT can consistently reduce model bias while achieving competitive or even superior diagnostic accuracy under common fairness and accuracy metrics, compared to the state-of-the-art. Specifically, we demonstrate improved model generalization ability as evidenced by superior performance on several out-of-distribution (OOD) datasets.

cross Quantum-Circuit-Based Visual Fractal Image Generation in Qiskit and Analytics

Authors: Hillol Biswas

Abstract: As nature is ascribed as quantum, the fractals also pose some intriguing appearance which is found in many micro and macro observable entities or phenomena. Fractals show self-similarity across sizes; structures that resemble the entire are revealed when zoomed in. In Quantum systems, the probability density or wavefunction may exhibit recurring interference patterns at various energy or length scales. Fractals are produced by basic iterative rules (such as Mandelbrot or Julia sets), and they provide limitless complexity. Despite its simplicity, the Schr\"odinger equation in quantum mechanics produces incredibly intricate patterns of interference and entanglement, particularly in chaotic quantum systems. Quantum computing, the root where lies to the using the principles of quantum-mechanical phenomenon, when applied in fractal image generation, what outcomes are expected? The paper outlines the generation of a Julia set dataset using an approach coupled with building quantum circuit, highlighting the concepts of superposition, randomness, and entanglement as foundational elements to manipulate the generated dataset patterns. As Quantum computing is finding many application areas, the possibility of using quantum circuits for fractal Julia image generation posits a unique direction of future research where it can be applied to quantum generative arts across various ecosystems with a customised approach, such as producing an exciting landscape based on a quantum art theme.

cross C-Flat++: Towards a More Efficient and Powerful Framework for Continual Learning

Authors: Wei Li, Hangjie Yuan, Zixiang Zhao, Yifan Zhu, Aojun Lu, Tao Feng, Yanan Sun

Abstract: Balancing sensitivity to new tasks and stability for retaining past knowledge is crucial in continual learning (CL). Recently, sharpness-aware minimization has proven effective in transfer learning and has also been adopted in continual learning (CL) to improve memory retention and learning efficiency. However, relying on zeroth-order sharpness alone may favor sharper minima over flatter ones in certain settings, leading to less robust and potentially suboptimal solutions. In this paper, we propose \textbf{C}ontinual \textbf{Flat}ness (\textbf{C-Flat}), a method that promotes flatter loss landscapes tailored for CL. C-Flat offers plug-and-play compatibility, enabling easy integration with minimal modifications to the code pipeline. Besides, we present a general framework that integrates C-Flat into all major CL paradigms and conduct comprehensive comparisons with loss-minima optimizers and flat-minima-based CL methods. Our results show that C-Flat consistently improves performance across a wide range of settings. In addition, we introduce C-Flat++, an efficient yet effective framework that leverages selective flatness-driven promotion, significantly reducing the update cost required by C-Flat. Extensive experiments across multiple CL methods, datasets, and scenarios demonstrate the effectiveness and efficiency of our proposed approaches. Code is available at https://github.com/WanNaa/C-Flat.

URLs: https://github.com/WanNaa/C-Flat.

cross PanoHair: Detailed Hair Strand Synthesis on Volumetric Heads

Authors: Shashikant Verma, Shanmuganathan Raman

Abstract: Achieving realistic hair strand synthesis is essential for creating lifelike digital humans, but producing high-fidelity hair strand geometry remains a significant challenge. Existing methods require a complex setup for data acquisition, involving multi-view images captured in constrained studio environments. Additionally, these methods have longer hair volume estimation and strand synthesis times, which hinder efficiency. We introduce PanoHair, a model that estimates head geometry as signed distance fields using knowledge distillation from a pre-trained generative teacher model for head synthesis. Our approach enables the prediction of semantic segmentation masks and 3D orientations specifically for the hair region of the estimated geometry. Our method is generative and can generate diverse hairstyles with latent space manipulations. For real images, our approach involves an inversion process to infer latent codes and produces visually appealing hair strands, offering a streamlined alternative to complex multi-view data acquisition setups. Given the latent code, PanoHair generates a clean manifold mesh for the hair region in under 5 seconds, along with semantic and orientation maps, marking a significant improvement over existing methods, as demonstrated in our experiments.

cross Enhanced UAV Path Planning Using the Tangent Intersection Guidance (TIG) Algorithm

Authors: Hichem Cheriet, Khellat Kihel Badra, Chouraqui Samira

Abstract: Efficient and safe navigation of Unmanned Aerial Vehicles (UAVs) is critical for various applications, including combat support, package delivery and Search and Rescue Operations. This paper introduces the Tangent Intersection Guidance (TIG) algorithm, an advanced approach for UAV path planning in both static and dynamic environments. The algorithm uses the elliptic tangent intersection method to generate feasible paths. It generates two sub-paths for each threat, selects the optimal route based on a heuristic rule, and iteratively refines the path until the target is reached. Considering the UAV kinematic and dynamic constraints, a modified smoothing technique based on quadratic B\'ezier curves is adopted to generate a smooth and efficient route. Experimental results show that the TIG algorithm can generate the shortest path in less time, starting from 0.01 seconds, with fewer turning angles compared to A*, PRM, RRT*, Tangent Graph, and Static APPATT algorithms in static environments. Furthermore, in completely unknown and partially known environments, TIG demonstrates efficient real-time path planning capabilities for collision avoidance, outperforming APF and Dynamic APPATT algorithms.

cross Understanding Benefits and Pitfalls of Current Methods for the Segmentation of Undersampled MRI Data

Authors: Jan Nikolas Morshuis, Matthias Hein, Christian F. Baumgartner

Abstract: MR imaging is a valuable diagnostic tool allowing to non-invasively visualize patient anatomy and pathology with high soft-tissue contrast. However, MRI acquisition is typically time-consuming, leading to patient discomfort and increased costs to the healthcare system. Recent years have seen substantial research effort into the development of methods that allow for accelerated MRI acquisition while still obtaining a reconstruction that appears similar to the fully-sampled MR image. However, for many applications a perfectly reconstructed MR image may not be necessary, particularly, when the primary goal is a downstream task such as segmentation. This has led to growing interest in methods that aim to perform segmentation directly on accelerated MRI data. Despite recent advances, existing methods have largely been developed in isolation, without direct comparison to one another, often using separate or private datasets, and lacking unified evaluation standards. To date, no high-quality, comprehensive comparison of these methods exists, and the optimal strategy for segmenting accelerated MR data remains unknown. This paper provides the first unified benchmark for the segmentation of undersampled MRI data comparing 7 approaches. A particular focus is placed on comparing \textit{one-stage approaches}, that combine reconstruction and segmentation into a unified model, with \textit{two-stage approaches}, that utilize established MRI reconstruction methods followed by a segmentation network. We test these methods on two MRI datasets that include multi-coil k-space data as well as a human-annotated segmentation ground-truth. We find that simple two-stage methods that consider data-consistency lead to the best segmentation scores, surpassing complex specialized methods that are developed specifically for this task.

cross Time Series Analysis of Spiking Neural Systems via Transfer Entropy and Directed Persistent Homology

Authors: Dylan Peek, Siddharth Pritam, Matthew P. Skerritt, Stephan Chalup

Abstract: We present a topological framework for analysing neural time series that integrates Transfer Entropy (TE) with directed Persistent Homology (PH) to characterize information flow in spiking neural systems. TE quantifies directional influence between neurons, producing weighted, directed graphs that reflect dynamic interactions. These graphs are then analyzed using PH, enabling assessment of topological complexity across multiple structural scales and dimensions. We apply this TE+PH pipeline to synthetic spiking networks trained on logic gate tasks, image-classification networks exposed to structured and perturbed inputs, and mouse cortical recordings annotated with behavioral events. Across all settings, the resulting topological signatures reveal distinctions in task complexity, stimulus structure, and behavioral regime. Higher-dimensional features become more prominent in complex or noisy conditions, reflecting interaction patterns that extend beyond pairwise connectivity. Our findings offer a principled approach to mapping directed information flow onto global organizational patterns in both artificial and biological neural systems. The framework is generalizable and interpretable, making it well suited for neural systems with time-resolved and binary spiking data.

cross Random forest-based out-of-distribution detection for robust lung cancer segmentation

Authors: Aneesh Rangnekar, Harini Veeraraghavan

Abstract: Accurate detection and segmentation of cancerous lesions from computed tomography (CT) scans is essential for automated treatment planning and cancer treatment response assessment. Transformer-based models with self-supervised pretraining can produce reliably accurate segmentation from in-distribution (ID) data but degrade when applied to out-of-distribution (OOD) datasets. We address this challenge with RF-Deep, a random forest classifier that utilizes deep features from a pretrained transformer encoder of the segmentation model to detect OOD scans and enhance segmentation reliability. The segmentation model comprises a Swin Transformer encoder, pretrained with masked image modeling (SimMIM) on 10,432 unlabeled 3D CT scans covering cancerous and non-cancerous conditions, with a convolution decoder, trained to segment lung cancers in 317 3D scans. Independent testing was performed on 603 3D CT public datasets that included one ID dataset and four OOD datasets comprising chest CTs with pulmonary embolism (PE) and COVID-19, and abdominal CTs with kidney cancers and healthy volunteers. RF-Deep detected OOD cases with a FPR95 of 18.26%, 27.66%, and less than 0.1% on PE, COVID-19, and abdominal CTs, consistently outperforming established OOD approaches. The RF-Deep classifier provides a simple and effective approach to enhance reliability of cancer segmentation in ID and OOD scenarios.

cross ZeST: an LLM-based Zero-Shot Traversability Navigation for Unknown Environments

Authors: Shreya Gummadi, Mateus V. Gasparino, Gianluca Capezzuto, Marcelo Becker, Girish Chowdhary

Abstract: The advancement of robotics and autonomous navigation systems hinges on the ability to accurately predict terrain traversability. Traditional methods for generating datasets to train these prediction models often involve putting robots into potentially hazardous environments, posing risks to equipment and safety. To solve this problem, we present ZeST, a novel approach leveraging visual reasoning capabilities of Large Language Models (LLMs) to create a traversability map in real-time without exposing robots to danger. Our approach not only performs zero-shot traversability and mitigates the risks associated with real-world data collection but also accelerates the development of advanced navigation systems, offering a cost-effective and scalable solution. To support our findings, we present navigation results, in both controlled indoor and unstructured outdoor environments. As shown in the experiments, our method provides safer navigation when compared to other state-of-the-art methods, constantly reaching the final goal.

cross A Bag of Tricks for Efficient Implicit Neural Point Clouds

Authors: Florian Hahlbohm, Linus Franke, Leon Overk\"amping, Paula Wespe, Susana Castillo, Martin Eisemann, Marcus Magnor

Abstract: Implicit Neural Point Cloud (INPC) is a recent hybrid representation that combines the expressiveness of neural fields with the efficiency of point-based rendering, achieving state-of-the-art image quality in novel view synthesis. However, as with other high-quality approaches that query neural networks during rendering, the practical usability of INPC is limited by comparatively slow rendering. In this work, we present a collection of optimizations that significantly improve both the training and inference performance of INPC without sacrificing visual fidelity. The most significant modifications are an improved rasterizer implementation, more effective sampling techniques, and the incorporation of pre-training for the convolutional neural network used for hole-filling. Furthermore, we demonstrate that points can be modeled as small Gaussians during inference to further improve quality in extrapolated, e.g., close-up views of the scene. We design our implementations to be broadly applicable beyond INPC and systematically evaluate each modification in a series of experiments. Our optimized INPC pipeline achieves up to 25% faster training, 2x faster rendering, and 20% reduced VRAM usage paired with slight image quality improvements.

cross RDDM: Practicing RAW Domain Diffusion Model for Real-world Image Restoration

Authors: Yan Chen, Yi Wen, Wei Li, Junchao Liu, Yong Guo, Jie Hu, Xinghao Chen

Abstract: We present the RAW domain diffusion model (RDDM), an end-to-end diffusion model that restores photo-realistic images directly from the sensor RAW data. While recent sRGB-domain diffusion methods achieve impressive results, they are caught in a dilemma between high fidelity and realistic generation. As these models process lossy sRGB inputs and neglect the accessibility of the sensor RAW images in many scenarios, e.g., in image and video capturing in edge devices, resulting in sub-optimal performance. RDDM bypasses this limitation by directly restoring images in the RAW domain, replacing the conventional two-stage image signal processing (ISP) + IR pipeline. However, a simple adaptation of pre-trained diffusion models to the RAW domain confronts the out-of-distribution (OOD) issues. To this end, we propose: (1) a RAW-domain VAE (RVAE) learning optimal latent representations, (2) a differentiable Post Tone Processing (PTP) module enabling joint RAW and sRGB space optimization. To compensate for the deficiency in the dataset, we develop a scalable degradation pipeline synthesizing RAW LQ-HQ pairs from existing sRGB datasets for large-scale training. Furthermore, we devise a configurable multi-bayer (CMB) LoRA module handling diverse RAW patterns such as RGGB, BGGR, etc. Extensive experiments demonstrate RDDM's superiority over state-of-the-art sRGB diffusion methods, yielding higher fidelity results with fewer artifacts.

cross MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Authors: Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang

Abstract: Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites, it achieves 71.9%, 72.7%, and 96.5% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: https://shihao1895.github.io/MemoryVLA

URLs: https://shihao1895.github.io/MemoryVLA

replace PointFix: Learning to Fix Domain Bias for Robust Online Stereo Adaptation

Authors: Kwonyoung Kim, Jungin Park, Jiyoung Lee, Dongbo Min, Kwanghoon Sohn

Abstract: Online stereo adaptation tackles the domain shift problem, caused by different environments between synthetic (training) and real (test) datasets, to promptly adapt stereo models in dynamic real-world applications such as autonomous driving. However, previous methods often fail to counteract particular regions related to dynamic objects with more severe environmental changes. To mitigate this issue, we propose to incorporate an auxiliary point-selective network into a meta-learning framework, called PointFix, to provide a robust initialization of stereo models for online stereo adaptation. In a nutshell, our auxiliary network learns to fix local variants intensively by effectively back-propagating local information through the meta-gradient for the robust initialization of the baseline model. This network is model-agnostic, so can be used in any kind of architectures in a plug-and-play manner. We conduct extensive experiments to verify the effectiveness of our method under three adaptation settings such as short-, mid-, and long-term sequences. Experimental results show that the proper initialization of the base stereo model by the auxiliary network enables our learning paradigm to achieve state-of-the-art performance at inference.

replace DiffBlender: Composable and Versatile Multimodal Text-to-Image Diffusion Models

Authors: Sungnyun Kim, Junsoo Lee, Kibeom Hong, Daesik Kim, Namhyuk Ahn

Abstract: In this study, we aim to enhance the capabilities of diffusion-based text-to-image (T2I) generation models by integrating diverse modalities beyond textual descriptions within a unified framework. To this end, we categorize widely used conditional inputs into three modality types: structure, layout, and attribute. We propose a multimodal T2I diffusion model, which is capable of processing all three modalities within a single architecture without modifying the parameters of the pre-trained diffusion model, as only a small subset of components is updated. Our approach sets new benchmarks in multimodal generation through extensive quantitative and qualitative comparisons with existing conditional generation methods. We demonstrate that DiffBlender effectively integrates multiple sources of information and supports diverse applications in detailed image synthesis. The code and demo are available at https://github.com/sungnyun/diffblender.

URLs: https://github.com/sungnyun/diffblender.

replace Memory augment is All You Need for image restoration

Authors: Xiao Feng Zhang, Chao Chen Gu, Shan Ying Zhu

Abstract: Image restoration is a low-level vision task, most CNN methods are designed as a black box, lacking transparency and internal aesthetics. Although some methods combining traditional optimization algorithms with DNNs have been proposed, they all have some limitations. In this paper, we propose a three-granularity memory layer and contrast learning named MemoryNet, specifically, dividing the samples into positive, negative, and actual three samples for contrastive learning, where the memory layer is able to preserve the deep features of the image and the contrastive learning converges the learned features to balance. Experiments on Derain/Deshadow/Deblur task demonstrate that these methods are effective in improving restoration performance. In addition, this paper's model obtains significant PSNR, SSIM gain on three datasets with different degradation types, which is a strong proof that the recovered images are perceptually realistic. The source code of MemoryNet can be obtained from https://github.com/zhangbaijin/MemoryNet

URLs: https://github.com/zhangbaijin/MemoryNet

replace Deshadow-Anything: When Segment Anything Model Meets Zero-shot shadow removal

Authors: Xiao Feng Zhang, Tian Yi Song, Jia Wei Yao

Abstract: Segment Anything (SAM), an advanced universal image segmentation model trained on an expansive visual dataset, has set a new benchmark in image segmentation and computer vision. However, it faced challenges when it came to distinguishing between shadows and their backgrounds. To address this, we developed Deshadow-Anything, considering the generalization of large-scale datasets, and we performed Fine-tuning on large-scale datasets to achieve image shadow removal. The diffusion model can diffuse along the edges and textures of an image, helping to remove shadows while preserving the details of the image. Furthermore, we design Multi-Self-Attention Guidance (MSAG) and adaptive input perturbation (DDPM-AIP) to accelerate the iterative training speed of diffusion. Experiments on shadow removal tasks demonstrate that these methods can effectively improve image restoration performance.

replace Learning county from pixels: corn yield prediction with attention-weighted multiple instance learning

Authors: Xiaoyu Wang, Yuchi Ma, Qunying Huang, Zhengwei Yang, Zhou Zhang

Abstract: Remote sensing technology has become a promising tool in yield prediction. Most prior work employs satellite imagery for county-level corn yield prediction by spatially aggregating all pixels within a county into a single value, potentially overlooking the detailed information and valuable insights offered by more granular data. To this end, this research examines each county at the pixel level and applies multiple instance learning to leverage detailed information within a county. In addition, our method addresses the "mixed pixel" issue caused by the inconsistent resolution between feature datasets and crop mask, which may introduce noise into the model and therefore hinder accurate yield prediction. Specifically, the attention mechanism is employed to automatically assign weights to different pixels, which can mitigate the influence of mixed pixels. The experimental results show that the developed model outperforms four other machine learning models over the past five years in the U.S. corn belt and demonstrates its best performance in 2022, achieving a coefficient of determination (R2) value of 0.84 and a root mean square error (RMSE) of 0.83. This paper demonstrates the advantages of our approach from both spatial and temporal perspectives. Furthermore, through an in-depth study of the relationship between mixed pixels and attention, it is verified that our approach can capture critical feature information while filtering out noise from mixed pixels.

replace Weakly-Supervised 3D Visual Grounding based on Visual Language Alignment

Authors: Xiaoxu Xu, Yitian Yuan, Qiudan Zhang, Wenhui Wu, Zequn Jie, Lin Ma, Xu Wang

Abstract: Learning to ground natural language queries to target objects or regions in 3D point clouds is quite essential for 3D scene understanding. Nevertheless, existing 3D visual grounding approaches require a substantial number of bounding box annotations for text queries, which is time-consuming and labor-intensive to obtain. In this paper, we propose 3D-VLA, a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment. Our 3D-VLA exploits the superior ability of current large-scale vision-language models (VLMs) on aligning the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds with no need for fine-grained box annotations in the training procedure. During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images. To the best of our knowledge, this is the first work to investigate 3D visual grounding in a weakly supervised manner by involving large scale vision-language models, and extensive experiments on ReferIt3D and ScanRefer datasets demonstrate that our 3D-VLA achieves comparable and even superior results over the fully supervised methods.

replace Meta-Learned Modality-Weighted Knowledge Distillation for Robust Multi-Modal Learning with Missing Data

Authors: Hu Wang, Salma Hassan, Yuyuan Liu, Congbo Ma, Yuanhong Chen, Qing Li, Jiahui Geng, Bingjie Wang, Yu Tian, Yutong Xie, Jodie Avery, Louise Hull, Ian Reid, Mohammad Yaqub, Gustavo Carneiro

Abstract: In multi-modal learning, some modalities are more influential than others, and their absence can have a significant impact on classification/segmentation accuracy. Addressing this challenge, we propose a novel approach called Meta-learned Modality-weighted Knowledge Distillation (MetaKD), which enables multi-modal models to maintain high accuracy even when key modalities are missing. MetaKD adaptively estimates the importance weight of each modality through a meta-learning process. These learned importance weights guide a pairwise modality-weighted knowledge distillation process, allowing high-importance modalities to transfer knowledge to lower-importance ones, resulting in robust performance despite missing inputs. Unlike previous methods in the field, which are often task-specific and require significant modifications, our approach is designed to work in multiple tasks (e.g., segmentation and classification) with minimal adaptation. Experimental results on five prevalent datasets, including three Brain Tumor Segmentation datasets (BraTS2018, BraTS2019 and BraTS2020), the Alzheimer's Disease Neuroimaging Initiative (ADNI) classification dataset and the Audiovision-MNIST classification dataset, demonstrate the proposed model is able to outperform the compared models by a large margin. The code is available at https://github.com/billhhh/MetaKD.

URLs: https://github.com/billhhh/MetaKD.

replace MicroMIL: Graph-Based Multiple Instance Learning for Context-Aware Diagnosis with Microscopic Images

Authors: Jongwoo Kim, Bryan Wong, Huazhu Fu, Willmer Rafell Qui\~nones, Youngsin Ko, Mun Yong Yi

Abstract: Cancer diagnosis has greatly benefited from the integration of whole-slide images (WSIs) with multiple instance learning (MIL), enabling high-resolution analysis of tissue morphology. Graph-based MIL (GNN-MIL) approaches have emerged as powerful solutions for capturing contextual information in WSIs, thereby improving diagnostic accuracy. However, WSIs require significant computational and infrastructural resources, limiting accessibility in resource-constrained settings. Conventional light microscopes offer a cost-effective alternative, but applying GNN-MIL to such data is challenging due to extensive redundant images and missing spatial coordinates, which hinder contextual learning. To address these issues, we introduce MicroMIL, the first weakly-supervised MIL framework specifically designed for images acquired from conventional light microscopes. MicroMIL leverages a representative image extractor (RIE) that employs deep cluster embedding (DCE) and hard Gumbel-Softmax to dynamically reduce redundancy and select representative images. These images serve as graph nodes, with edges computed via cosine similarity, eliminating the need for spatial coordinates while preserving contextual information. Extensive experiments on a real-world colon cancer dataset and the BreakHis dataset demonstrate that MicroMIL achieves state-of-the-art performance, improving both diagnostic accuracy and robustness to redundancy. The code is available at https://github.com/kimjongwoo-cell/MicroMIL

URLs: https://github.com/kimjongwoo-cell/MicroMIL

replace FUSELOC: Fusing Global and Local Descriptors to Disambiguate 2D-3D Matching in Visual Localization

Authors: Son Tung Nguyen, Alejandro Fontan, Michael Milford, Tobias Fischer

Abstract: Hierarchical visual localization methods achieve state-of-the-art accuracy but require substantial memory as they need to store all database images. Direct 2D-3D matching requires significantly less memory but suffers from lower accuracy due to the larger and more ambiguous search space. We address this ambiguity by fusing local and global descriptors using a weighted average operator. This operator rearranges the local descriptor space so that geographically nearby local descriptors are closer in the feature space according to the global descriptors. This decreases the number of irrelevant competing descriptors, especially if they are geographically distant, thus increasing the correct matching likelihood. We consistently improve the accuracy over local-only systems, and we achieve performance close to hierarchical methods while using 43\% less memory and running 1.6 times faster. Extensive experiments on four challenging datasets -- Cambridge Landmarks, Aachen Day/Night, RobotCar Seasons, and Extended CMU Seasons -- demonstrate that, for the first time, direct matching algorithms can benefit from global descriptors without compromising computational efficiency. Our code is available at \href{https://github.com/sontung/descriptor-disambiguation}{https://github.com/sontung/descriptor-disambiguation}.

URLs: https://github.com/sontung/descriptor-disambiguation, https://github.com/sontung/descriptor-disambiguation

replace MCGS: Multiview Consistency Enhancement for Sparse-View 3D Gaussian Radiance Fields

Authors: Yuru Xiao, Deming Zhai, Wenbo Zhao, Kui Jiang, Junjun Jiang, Xianming Liu

Abstract: Radiance fields represented by 3D Gaussians excel at synthesizing novel views, offering both high training efficiency and fast rendering. However, with sparse input views, the lack of multi-view consistency constraints results in poorly initialized Gaussians and unreliable heuristics for optimization, leading to suboptimal performance. Existing methods often incorporate depth priors from dense estimation networks but overlook the inherent multi-view consistency in input images. Additionally, they rely on dense initialization, which limits the efficiency of scene representation. To overcome these challenges, we propose a view synthesis framework based on 3D Gaussian Splatting, named MCGS, enabling photorealistic scene reconstruction from sparse views. The key innovations of MCGS in enhancing multi-view consistency are as follows: i) We leverage matching priors from a sparse matcher to initialize Gaussians primarily on textured regions, while low-texture areas are populated with randomly distributed Gaussians. This yields a compact yet sufficient set of initial Gaussians. ii) We propose a multi-view consistency-guided progressive pruning strategy to dynamically eliminate inconsistent Gaussians. This approach confines their optimization to a consistency-constrained space, which ensures robust and coherent scene reconstruction. These strategies enhance robustness to sparse views, accelerate rendering, and reduce memory consumption, making MCGS a practical framework for 3D Gaussian Splatting.

replace Benchmarking XAI Explanations with Human-Aligned Evaluations

Authors: R\'emi Kazmierczak, Steve Azzolin, Elo\"ise Berthier, Anna Hedstr\"om, Patricia Delhomme, David Filliat, Nicolas Bousquet, Goran Frehse, Massimiliano Mancini, Baptiste Caramiaux, Andrea Passerini, Gianni Franchi

Abstract: We introduce PASTA (Perceptual Assessment System for explanaTion of Artificial Intelligence), a novel human-centric framework for evaluating eXplainable AI (XAI) techniques in computer vision. Our first contribution is the creation of the PASTA-dataset, the first large-scale benchmark that spans a diverse set of models and both saliency-based and concept-based explanation methods. This dataset enables robust, comparative analysis of XAI techniques based on human judgment. Our second contribution is an automated, data-driven benchmark that predicts human preferences using the PASTA-dataset. This scoring called PASTA-score method offers scalable, reliable, and consistent evaluation aligned with human perception. Additionally, our benchmark allows for comparisons between explanations across different modalities, an aspect previously unaddressed. We then propose to apply our scoring method to probe the interpretability of existing models and to build more human interpretable XAI methods.

replace VAGUE: Visual Contexts Clarify Ambiguous Expressions

Authors: Heejeong Nam, Jinwoo Ahn, Keummin Ka, Jiwan Chung, Youngjae Yu

Abstract: Human communication often relies on visual cues to resolve ambiguity. While humans can intuitively integrate these cues, AI systems often find it challenging to engage in sophisticated multimodal reasoning. We introduce VAGUE, a benchmark evaluating multimodal AI systems' ability to integrate visual context for intent disambiguation. VAGUE consists of 1.6K ambiguous textual expressions, each paired with an image and multiple-choice interpretations, where the correct answer is only apparent with visual context. The dataset spans both staged, complex (Visual Commonsense Reasoning) and natural, personal (Ego4D) scenes, ensuring diversity. Our experiments reveal that existing multimodal AI models struggle to infer the speaker's true intent. While performance consistently improves from the introduction of more visual cues, the overall accuracy remains far below human performance, highlighting a critical gap in multimodal reasoning. Analysis of failure cases demonstrates that current models fail to distinguish true intent from superficial correlations in the visual scene, indicating that they perceive images but do not effectively reason with them. We release our code and data at https://hazel-heejeong-nam.github.io/vague/.

URLs: https://hazel-heejeong-nam.github.io/vague/.

replace ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

Authors: Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, Jianwei Yin

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in vision-language understanding. Recently, with the integration of test-time scaling techniques, these models have also shown strong potential in visual reasoning. However, most existing reasoning approaches remain text-level in nature: MLLMs are prompted to explore various combinations of textual tokens via their underlying language model, while the visual input remains fixed throughout the reasoning process. This paradigm limits the model's ability to fully exploit rich visual information, particularly when dealing with images containing numerous fine-grained elements. In such cases, vision-level reasoning becomes crucial - where models dynamically zoom into specific regions of the image to gather detailed visual cues necessary for accurate decision-making. In this paper, we propose Zoom Eye, a training-free, model-agnostic tree search algorithm tailored for vision-level reasoning. Zoom Eye treats an image as a hierarchical tree structure, where each child node represents a zoomed-in sub-region of its parent, and the root corresponds to the full image. The algorithm enables MLLMs to simulate human-like zooming behavior by navigating from root to leaf nodes in search of task-relevant visual evidence. We experiment on a series of high-resolution benchmarks and the results demonstrate that Zoom Eye consistently improves the performance of multiple MLLMs by a large margin (e.g., InternVL2.5-8B increases by 15.71% and 17.69% on HR-Bench) and also enables small 3-8B MLLMs to outperform strong large models such as GPT-4o. Code: https://github.com/om-ai-lab/ZoomEye

URLs: https://github.com/om-ai-lab/ZoomEye

replace Human Vision Constrained Super-Resolution

Authors: Volodymyr Karpenko, Taimoor Tariq, Jorge Condor, Piotr Didyk

Abstract: Modern deep-learning super-resolution (SR) techniques process images and videos independently of the underlying content and viewing conditions. However, the sensitivity of the human visual system (HVS) to image details changes depending on the underlying image characteristics, such as spatial frequency, luminance, color, contrast, or motion; as well viewing condition aspects such as ambient lighting and distance to the display. This observation suggests that computational resources spent on up-sampling images/videos may be wasted whenever a viewer cannot resolve the synthesized details i.e the resolution of details exceeds the resolving capability of human vision. Motivated by this observation, we propose a human vision inspired and architecture-agnostic approach for controlling SR techniques to deliver visually optimal results while limiting computational complexity. Its core is an explicit Human Visual Processing Framework (HVPF) that dynamically and locally guides SR methods according to human sensitivity to specific image details and viewing conditions. We demonstrate the application of our framework in combination with network branching to improve the computational efficiency of SR methods. Quantitative and qualitative evaluations, including user studies, demonstrate the effectiveness of our approach in reducing FLOPS by factors of 2$\times$ and greater, without sacrificing perceived quality.

replace Incremental Multi-Scene Modeling via Continual Neural Graphics Primitives

Authors: Prajwal Singh, Ashish Tiwari, Gautam Vashishtha, Shanmuganathan Raman

Abstract: Neural radiance fields (NeRF) have revolutionized photorealistic rendering of novel views for 3D scenes. Despite their growing popularity and efficiency as 3D resources, NeRFs face scalability challenges due to the need for separate models per scene and the cumulative increase in training time for multiple scenes. The potential for incrementally encoding multiple 3D scenes into a single NeRF model remains largely unexplored. To address this, we introduce Continual-Neural Graphics Primitives (C-NGP), a novel continual learning framework that integrates multiple scenes incrementally into a single neural radiance field. Using a generative replay approach, C-NGP adapts to new scenes without requiring access to old data. We demonstrate that C-NGP can accommodate multiple scenes without increasing the parameter count, producing high-quality novel-view renderings on synthetic and real datasets. Notably, C-NGP models all $8$ scenes from the Real-LLFF dataset together, with only a $2.2\%$ drop in PSNR compared to vanilla NeRF, which models each scene independently. Further, C-NGP allows multiple style edits in the same network.

replace Inspiring the Next Generation of Segment Anything Models: Comprehensively Evaluate SAM and SAM 2 with Diverse Prompts Towards Context-Dependent Concepts under Different Scenes

Authors: Xiaoqi Zhao, Youwei Pang, Shijie Chang, Yuan Zhao, Lihe Zhang, Chenyang Yu, Hanqi Liu, Jiaming Zuo, Jinsong Ouyang, Weisi Lin, Georges El Fakhri, Huchuan Lu, Xiaofeng Liu

Abstract: As large-scale foundation models trained on billions of image--mask pairs covering a vast diversity of scenes, objects, and contexts, SAM and its upgraded version, SAM~2, have significantly influenced multiple fields within computer vision. Leveraging such unprecedented data diversity, they exhibit strong open-world segmentation capabilities, with SAM~2 further enhancing these capabilities to support high-quality video segmentation. While SAMs (SAM and SAM~2) have demonstrated excellent performance in segmenting context-independent concepts like people, cars, and roads, they overlook more challenging context-dependent (CD) concepts, such as visual saliency, camouflage, industrial defects, and medical lesions. CD concepts rely heavily on global and local contextual information, making them susceptible to shifts in different contexts, which requires strong discriminative capabilities from the model. The lack of comprehensive evaluation of SAMs limits understanding of their performance boundaries, which may hinder the design of future models. In this paper, we conduct a thorough evaluation of SAMs on 11 CD concepts across 2D and 3D images and videos in various visual modalities within natural, medical, and industrial scenes. We develop a unified evaluation framework for SAM and SAM~2 that supports manual, automatic, and intermediate self-prompting, aided by our specific prompt generation and interaction strategies. We further explore the potential of SAM~2 for in-context learning and introduce prompt robustness testing to simulate real-world imperfect prompts. Finally, we analyze the benefits and limitations of SAMs in understanding CD concepts and discuss their future development in segmentation tasks.

replace StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models

Authors: Yunzhi Yan, Zhen Xu, Haotong Lin, Haian Jin, Haoyu Guo, Yida Wang, Kun Zhan, Xianpeng Lang, Hujun Bao, Xiaowei Zhou, Sida Peng

Abstract: This paper aims to tackle the problem of photorealistic view synthesis from vehicle sensor data. Recent advancements in neural scene representation have achieved notable success in rendering high-quality autonomous driving scenes, but the performance significantly degrades as the viewpoint deviates from the training trajectory. To mitigate this problem, we introduce StreetCrafter, a novel controllable video diffusion model that utilizes LiDAR point cloud renderings as pixel-level conditions, which fully exploits the generative prior for novel view synthesis, while preserving precise camera control. Moreover, the utilization of pixel-level LiDAR conditions allows us to make accurate pixel-level edits to target scenes. In addition, the generative prior of StreetCrafter can be effectively incorporated into dynamic scene representations to achieve real-time rendering. Experiments on Waymo Open Dataset and PandaSet demonstrate that our model enables flexible control over viewpoint changes, enlarging the view synthesis regions for satisfying rendering, which outperforms existing methods.

replace CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers

Authors: Dimitrios Mallis, Ahmet Serdar Karadeniz, Sebastian Cavada, Danila Rukhovich, Niki Foteinopoulou, Kseniya Cherenkova, Anis Kacem, Djamila Aouada

Abstract: We propose CAD-Assistant, a general-purpose CAD agent for AI-assisted design. Our approach is based on a powerful Vision and Large Language Model (VLLM) as a planner and a tool-augmentation paradigm using CAD-specific tools. CAD-Assistant addresses multimodal user queries by generating actions that are iteratively executed on a Python interpreter equipped with the FreeCAD software, accessed via its Python API. Our framework is able to assess the impact of generated CAD commands on geometry and adapts subsequent actions based on the evolving state of the CAD design. We consider a wide range of CAD-specific tools including a sketch image parameterizer, rendering modules, a 2D cross-section generator, and other specialized routines. CAD-Assistant is evaluated on multiple CAD benchmarks, where it outperforms VLLM baselines and supervised task-specific methods. Beyond existing benchmarks, we qualitatively demonstrate the potential of tool-augmented VLLMs as general-purpose CAD solvers across diverse workflows.

replace Survey on Monocular Metric Depth Estimation

Authors: Jiuling Zhang

Abstract: Monocular Depth Estimation (MDE) enables spatial understanding, 3D reconstruction, and autonomous navigation, yet deep learning approaches often predict only relative depth without a consistent metric scale. This limitation reduces reliability in applications such as visual SLAM, precise 3D modeling, and view synthesis. Monocular Metric Depth Estimation (MMDE) overcomes this challenge by producing depth maps with absolute scale, ensuring geometric consistency and enabling deployment without additional calibration. This survey reviews the evolution of MMDE, from geometry-based methods to state-of-the-art deep models, with emphasis on the datasets that drive progress. Key benchmarks, including KITTI, NYU-D, ApolloScape, and TartanAir, are examined in terms of modality, scene type, and application domain. Methodological advances are analyzed, covering domain generalization, boundary preservation, and the integration of synthetic and real data. Techniques such as unsupervised and semi-supervised learning, patch-based inference, architectural innovations, and generative modeling are evaluated for their strengths and limitations. By synthesizing current progress, highlighting the importance of high-quality datasets, and identifying open challenges, this survey provides a structured reference for advancing MMDE and supporting its adoption in real-world computer vision systems.

replace Single-Domain Generalized Object Detection by Balancing Domain Diversity and Invariance

Authors: Zhenwei He, Hongsu Ni

Abstract: Single-domain generalization for object detection (S-DGOD) seeks to transfer learned representations from a single source domain to unseen target domains. While recent approaches have primarily focused on achieving feature invariance, they ignore that domain diversity also presents significant challenges for the task. First, such invariance-driven strategies often lead to the loss of domain-specific information, resulting in incomplete feature representations. Second, cross-domain feature alignment forces the model to overlook domain-specific discrepancies, thereby increasing the complexity of the training process. To address these limitations, this paper proposes the Diversity Invariant Detection Model (DIDM), which achieves a harmonious integration of domain-specific diversity and domain invariance. Our key idea is to learn the invariant representations by keeping the inherent domain-specific features. Specifically, we introduce the Diversity Learning Module (DLM). This module limits the invariant semantics while explicitly enhancing domain-specific feature representation through a proposed feature diversity loss. Furthermore, to ensure cross-domain invariance without sacrificing diversity, we incorporate the Weighted Aligning Module (WAM) to enable feature alignment while maintaining the discriminative domain-specific information. Extensive experiments on multiple diverse datasets demonstrate the effectiveness of the proposed model, achieving superior performance compared to existing methods.

replace PromptGAR: Flexible Promptive Group Activity Recognition

Authors: Zhangyu Jin, Andrew Feng, Ankur Chemburkar, Celso M. De Melo

Abstract: We present PromptGAR, a novel framework for Group Activity Recognition (GAR) that offering both input flexibility and high recognition accuracy. The existing approaches suffer from limited real-world applicability due to their reliance on full prompt annotations, fixed number of frames and instances, and the lack of actor consistency. To bridge the gap, we proposed PromptGAR, which is the first GAR model to provide input flexibility across prompts, frames, and instances without the need for retraining. We leverage diverse visual prompts, like bounding boxes, skeletal keypoints, and instance identities, by unifying them as point prompts. A recognition decoder then cross-updates class and prompt tokens for enhanced performance. To ensure actor consistency for extended activity durations, we also introduce a relative instance attention mechanism that directly encodes instance identities. Comprehensive evaluations demonstrate that PromptGAR achieves competitive performances both on full prompts and partial prompt inputs, establishing its effectiveness on input flexibility and generalization ability for real-world applications.

replace Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness

Authors: Beier Zhu, Jiequan Cui, Hanwang Zhang, Chi Zhang

Abstract: While image-text foundation models have succeeded across diverse downstream tasks, they still face challenges in the presence of spurious correlations between the input and label. To address this issue, we propose a simple three-step approach,Project-Probe-Aggregate (PPA), that enables parameter-efficient fine-tuning for foundation models without relying on group annotations. Building upon the failure-based debiasing scheme, our method, PPA, improves its two key components: minority samples identification and the robust training algorithm. Specifically, we first train biased classifiers by projecting image features onto the nullspace of class proxies from text encoders. Next, we infer group labels using the biased classifier and probe group targets with prior correction. Finally, we aggregate group weights of each class to produce the debiased classifier. Our theoretical analysis shows that our PPA enhances minority group identification and is Bayes optimal for minimizing the balanced group error, mitigating spurious correlations. Extensive experimental results confirm the effectiveness of our PPA: it outperforms the state-of-the-art by an average worst-group accuracy while requiring less than 0.01% tunable parameters without training group labels.

replace Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering

Authors: Thanh-Son Nguyen, Hong Yang, Tzeh Yuan Neoh, Hao Zhang, Ee Yeo Keat, Basura Fernando

Abstract: We introduce PKR-QA (Procedural Knowledge Reasoning Question Answering), a new benchmark for question answering over procedural tasks that require structured reasoning. PKR-QA is constructed semi-automatically using a procedural knowledge graph (PKG), which encodes task-specific knowledge across diverse domains. The PKG is built by curating and linking information from the COIN instructional video dataset and the ontology, enriched with commonsense knowledge from ConceptNet and structured outputs from Large Language Models (LLMs), followed by manual verification. To generate question-answer pairs, we design graph traversal templates where each template is applied systematically over PKG. To enable interpretable reasoning, we propose a neurosymbolic approach called Knowledge Module Learning (KML), which learns procedural relations via neural modules and composes them for structured reasoning with LLMs. Experiments demonstrate that this paradigm improves reasoning performance on PKR-QA and enables step-by-step reasoning traces that facilitate interpretability. Code and dataset will be released soon https://github.com/LUNAProject22/KML.

URLs: https://github.com/LUNAProject22/KML.

replace Faster Parameter-Efficient Tuning with Token Redundancy Reduction

Authors: Kwonyoung Kim, Jungin Park, Jin Kim, Hyeongjun Kwon, Kwanghoon Sohn

Abstract: Parameter-efficient tuning (PET) aims to transfer pre-trained foundation models to downstream tasks by learning a small number of parameters. Compared to traditional fine-tuning, which updates the entire model, PET significantly reduces storage and transfer costs for each task regardless of exponentially increasing pre-trained model capacity. However, most PET methods inherit the inference latency of their large backbone models and often introduce additional computational overhead due to additional modules (e.g. adapters), limiting their practicality for compute-intensive applications. In this paper, we propose Faster Parameter-Efficient Tuning (FPET), a novel approach that enhances inference speed and training efficiency while maintaining high storage efficiency. Specifically, we introduce a plug-and-play token redundancy reduction module delicately designed for PET. This module refines tokens from the self-attention layer using an adapter to learn the accurate similarity between tokens and cuts off the tokens through a fully-differentiable token merging strategy, which uses a straight-through estimator for optimal token reduction. Experimental results prove that our FPET achieves faster inference and higher memory efficiency than the pre-trained backbone while keeping competitive performance on par with state-of-the-art PET methods.

replace LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification

Authors: Xiang Hu, Yuhao Wang, Pingping Zhang, Huchuan Lu

Abstract: As an important task in intelligent transportation systems, Aerial-Ground person Re-IDentification (AG-ReID) aims to retrieve specific persons across heterogeneous cameras in different viewpoints. Previous methods typically adopt deep learning-based models, focusing on extracting view-invariant features. However, they usually overlook the semantic information in person attributes. In addition, existing training strategies often rely on full fine-tuning large-scale models, which significantly increases training costs. To address these issues, we propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge. More specifically, we first introduce the Contrastive Language-Image Pre-training (CLIP) model as the backbone, and propose an Attribute-aware Image Encoder (AIE) to extract both global semantic features and attribute-aware features from input images. Then, with these features, we propose a Prompted Attribute Classifier Group (PACG) to predict person attributes and obtain attribute representations. Finally, we design a Coupled Prompt Template (CPT) to transform attribute representations and view information into structured sentences. These sentences are processed by the text encoder of CLIP to generate more discriminative features. As a result, our framework can fully leverage attribute-based text knowledge to improve AG-ReID performance. Extensive experiments on three AG-ReID benchmarks demonstrate the effectiveness of our proposed methods. The source code will be available.

replace M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering

Authors: Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, Ruixiang Tang

Abstract: Multimodal in-context learning (ICL) equips Large Vision-language Models (LVLMs) with the ability to adapt to new tasks via multiple user-provided demonstrations, without requiring any model parameter updates. However, its effectiveness is constrained by the token-intensive nature of multimodal inputs and the complexity of cross-modal few-shot reasoning, which together hinder LVLMs from extracting useful patterns from demonstrations. To address these challenges, we propose \textbf{M$^2$IV}, a novel representation engineering approach that replaces explicit token-level demonstrations with a set of learnable Multimodal In-context Vectors directly injected into the residual streams of LVLMs. By analyzing the distinct roles of multi-head attention (MHA) and multi-layer perceptrons (MLP) in the ICL process, we design a training strategy that enables M$^2$IV to perform fine-grained semantic distillation and robust cross-modal representation learning. M$^2$IV not only improves performance across diverse tasks and LVLMs but also significantly reduces token overhead, enabling graceful scaling to many-shot scenarios. To further enhance usability, we introduce \textbf{VLibrary}, a repository that stores trained M$^2$IVs for flexible retrieval and injection. With VLibrary, users can steer pre-trained LVLMs in a customized manner that meets diverse requirements. Extensive experiments demonstrate that M$^2$IV consistently outperforms vanilla ICL and prior representation engineering baselines, achieving an average accuracy gain of 3.74\% with substantial improvements in overall efficiency.

replace A Hybrid Fully Convolutional CNN-Transformer Model for Inherently Interpretable Disease Detection from Retinal Fundus Images

Authors: Kerol Djoumessi, Samuel Ofosu Mensah, Philipp Berens

Abstract: In many medical imaging tasks, convolutional neural networks (CNNs) efficiently extract local features hierarchically. More recently, vision transformers (ViTs) have gained popularity, using self-attention mechanisms to capture global dependencies, but lacking the inherent spatial localization of convolutions. Therefore, hybrid models combining CNNs and ViTs have been developed to combine the strengths of both architectures. However, such hybrid models are difficult to interpret, which hinders their application in medical imaging. In this work, we introduce an interpretable-by-design hybrid fully convolutional CNN-Transformer architecture for retinal disease detection. Unlike widely used post-hoc saliency methods for ViTs, our approach generates faithful and localized evidence maps that directly reflect the mode's decision process. We evaluated our method on two medical tasks focused on disease detection using color fundus images. Our model achieves state-of-the-art predictive performance compared to black-box and interpretable models and provides class-specific sparse evidence maps in a single forward pass. The code is available at: https://github.com/kdjoumessi/Self-Explainable-CNN-Transformer.

URLs: https://github.com/kdjoumessi/Self-Explainable-CNN-Transformer.

replace ForgetMe: Evaluating Selective Forgetting in Generative Models

Authors: Zhenyu Yu, Mohd Yamani Inda Idris, Pei Wang

Abstract: The widespread adoption of diffusion models in image generation has increased the demand for privacy-compliant unlearning. However, due to the high-dimensional nature and complex feature representations of diffusion models, achieving selective unlearning remains challenging, as existing methods struggle to remove sensitive information while preserving the consistency of non-sensitive regions. To address this, we propose an Automatic Dataset Creation Framework based on prompt-based layered editing and training-free local feature removal, constructing the ForgetMe dataset and introducing the Entangled evaluation metric. The Entangled metric quantifies unlearning effectiveness by assessing the similarity and consistency between the target and background regions and supports both paired (Entangled-D) and unpaired (Entangled-S) image data, enabling unsupervised evaluation. The ForgetMe dataset encompasses a diverse set of real and synthetic scenarios, including CUB-200-2011 (Birds), Stanford-Dogs, ImageNet, and a synthetic cat dataset. We apply LoRA fine-tuning on Stable Diffusion to achieve selective unlearning on this dataset and validate the effectiveness of both the ForgetMe dataset and the Entangled metric, establishing them as benchmarks for selective unlearning. Our work provides a scalable and adaptable solution for advancing privacy-preserving generative AI.

replace WMKA-Net: A Weighted Multi-Kernel Attention Network for Retinal Vessel Segmentation

Authors: Xinran Xu, Yuliang Ma, Sifu Cai, Ruoyan Shi

Abstract: Retinal vessel segmentation is crucial for intelligent ophthalmic diagnosis, yet it faces three major challenges: insufficient multi-scale feature fusion, disruption of contextual continuity, and noise interference. This study proposes a dual-stage solution to address these issues. The first stage employs a Reversible Multi-Scale Fusion Module (RMS) that uses hierarchical adaptive convolution to dynamically merge cross-scale features from capillaries to main vessels, self-adaptively calibrating feature biases. The second stage introduces a Vascular-Oriented Attention Mechanism, which models long-distance vascular continuity through an axial pathway and enhances the capture of topological key nodes, such as bifurcation points, via a dedicated bifurcation attention pathway. The synergistic operation of these two pathways effectively restores the continuity of vascular structures and improves the segmentation accuracy of complex vascular networks. Systematic experiments on the DRIVE, STARE, and CHASE-DB1 datasets demonstrate that WMKA-Net achieves an accuracy of 0.9909, sensitivity of 0.9198, and specificity of 0.9953, significantly outperforming existing methods. This model provides an efficient, precise, and robust intelligent solution for the early screening of diabetic retinopathy.

replace Decoupled Global-Local Alignment for Improving Compositional Understanding

Authors: Xiaoxing Hu, Kaicheng Yang, Jun Wang, Haoran Xu, Ziyong Feng, Yupei Wang

Abstract: Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities. However, the nature of global contrastive learning limits CLIP's ability to comprehend compositional concepts, such as relations and attributes. Although recent studies employ global hard negative samples to improve compositional understanding, these methods significantly compromise the model's inherent general capabilities by forcibly distancing textual negative samples from images in the embedding space. To overcome this limitation, we introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities. To optimize the retention of the model's inherent capabilities, we incorporate a self-distillation mechanism within the global alignment process, aligning the learnable image-text encoder with a frozen teacher model derived from an exponential moving average. Under the constraint of self-distillation, it effectively mitigates the catastrophic forgetting of pretrained knowledge during fine-tuning. To improve compositional understanding, we first leverage the in-context learning capability of Large Language Models (LLMs) to construct about 2M high-quality negative captions across five types. Subsequently, we propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to enhance vision-language compositionally. Extensive experimental results demonstrate the effectiveness of the DeGLA framework. Compared to previous state-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across the VALSE, SugarCrepe, and ARO benchmarks. Concurrently, it obtains an average performance improvement of 13.0% on zero-shot classification tasks across eleven datasets. Our code will be released at https://github.com/xiaoxing2001/DeGLA

URLs: https://github.com/xiaoxing2001/DeGLA

replace PhysioSync: Temporal and Cross-Modal Contrastive Learning Inspired by Physiological Synchronization for EEG-Based Emotion Recognition

Authors: Kai Cui, Jia Li, Yu Liu, Xuesong Zhang, Zhenzhen Hu, Meng Wang

Abstract: Electroencephalography (EEG) signals provide a promising and involuntary reflection of brain activity related to emotional states, offering significant advantages over behavioral cues like facial expressions. However, EEG signals are often noisy, affected by artifacts, and vary across individuals, complicating emotion recognition. While multimodal approaches have used Peripheral Physiological Signals (PPS) like GSR to complement EEG, they often overlook the dynamic synchronization and consistent semantics between the modalities. Additionally, the temporal dynamics of emotional fluctuations across different time resolutions in PPS remain underexplored. To address these challenges, we propose PhysioSync, a novel pre-training framework leveraging temporal and cross-modal contrastive learning, inspired by physiological synchronization phenomena. PhysioSync incorporates Cross-Modal Consistency Alignment (CM-CA) to model dynamic relationships between EEG and complementary PPS, enabling emotion-related synchronizations across modalities. Besides, it introduces Long- and Short-Term Temporal Contrastive Learning (LS-TCL) to capture emotional synchronization at different temporal resolutions within modalities. After pre-training, cross-resolution and cross-modal features are hierarchically fused and fine-tuned to enhance emotion recognition. Experiments on DEAP and DREAMER datasets demonstrate PhysioSync's advanced performance under uni-modal and cross-modal conditions, highlighting its effectiveness for EEG-centered emotion recognition.

replace Video CLIP Model for Multi-View Echocardiography Interpretation

Authors: Ryo Takizawa, Satoshi Kodera, Tempei Kabayama, Ryo Matsuoka, Yuta Ando, Yuto Nakamura, Haruki Settai, Norihiko Takeda

Abstract: Echocardiography records ultrasound videos of the heart, enabling clinicians to assess cardiac function. Recent advances in large-scale vision-language models (VLMs) have spurred interest in automating echocardiographic interpretation. However, most existing medical VLMs rely on single-frame (image) inputs, which can reduce diagnostic accuracy for conditions identifiable only through cardiac motion. In addition, echocardiographic videos are captured from multiple views, each varying in suitability for detecting specific conditions. Leveraging multiple views may therefore improve diagnostic performance. We developed a video-language model that processes full video sequences from five standard views, trained on 60,747 echocardiographic video-report pairs. We evaluated the gains in retrieval performance from video input and multi-view support, including the contributions of various pretrained models.

replace RAFT: Robust Augmentation of FeaTures for Image Segmentation

Authors: Edward Humes, Xiaomin Lin, Uttej Kallakuri, Tinoosh Mohsenin

Abstract: Image segmentation is a powerful computer vision technique for scene understanding. However, real-world deployment is stymied by the need for high-quality, meticulously labeled datasets. Synthetic data provides high-quality labels while reducing the need for manual data collection and annotation. However, deep neural networks trained on synthetic data often face the Syn2Real problem, leading to poor performance in real-world deployments. To mitigate the aforementioned gap in image segmentation, we propose RAFT, a novel framework for adapting image segmentation models using minimal labeled real-world data through data and feature augmentations, as well as active learning. To validate RAFT, we perform experiments on the synthetic-to-real "SYNTHIA->Cityscapes" and "GTAV->Cityscapes" benchmarks. We managed to surpass the previous state of the art, HALO. SYNTHIA->Cityscapes experiences an improvement in mIoU* upon domain adaptation of 2.1%/79.9%, and GTAV->Cityscapes experiences a 0.4%/78.2% improvement in mIoU. Furthermore, we test our approach on the real-to-real benchmark of "Cityscapes->ACDC", and again surpass HALO, with a gain in mIoU upon adaptation of 1.3%/73.2%. Finally, we examine the effect of the allocated annotation budget and various components of RAFT upon the final transfer mIoU.

replace MonoCoP: Chain-of-Prediction for Monocular 3D Object Detection

Authors: Zhihao Zhang, Abhinav Kumar, Girish Chandar Ganesan, Xiaoming Liu

Abstract: Accurately predicting 3D attributes is crucial for monocular 3D object detection (Mono3D), with depth estimation posing the greatest challenge due to the inherent ambiguity in mapping 2D images to 3D space. While existing methods leverage multiple depth cues (e.g., estimating depth uncertainty, modeling depth error) to improve depth accuracy, they overlook that accurate depth prediction requires conditioning on other 3D attributes, as these attributes are intrinsically inter-correlated through the 3D to 2D projection, which ultimately limits overall accuracy and stability. Inspired by Chain-of-Thought (CoT) in large language models (LLMs), this paper proposes MonoCoP, which leverages a Chain-of-Prediction (CoP) to predict attributes sequentially and conditionally via three key designs. First, it employs a lightweight AttributeNet (AN) for each 3D attribute to learn attribute-specific features. Next, MonoCoP constructs an explicit chain to propagate these learned features from one attribute to the next. Finally, MonoCoP uses a residual connection to aggregate features for each attribute along the chain, ensuring that later attribute predictions are conditioned on all previously processed attributes without forgetting the features of earlier ones. Experimental results show that our MonoCoP achieves state-of-the-art (SoTA) performance on the KITTI leaderboard without requiring additional data and further surpasses existing methods on the Waymo and nuScenes frontal datasets.

replace EVM-Fusion: An Explainable Vision Mamba Architecture with Neural Algorithmic Fusion

Authors: Zichuan Yang, Yongzhi Wang

Abstract: Medical image classification is critical for clinical decision-making, yet demands for accuracy, interpretability, and generalizability remain challenging. This paper introduces EVM-Fusion, an Explainable Vision Mamba architecture featuring a novel Neural Algorithmic Fusion (NAF) mechanism for multi-organ medical image classification. EVM-Fusion leverages a multipath design, where DenseNet and U-Net based pathways, enhanced by Vision Mamba (Vim) modules, operate in parallel with a traditional feature pathway. These diverse features are dynamically integrated via a two-stage fusion process: cross-modal attention followed by the iterative NAF block, which learns an adaptive fusion algorithm. Intrinsic explainability is embedded through path-specific spatial attention, Vim {\Delta}-value maps, traditional feature SE-attention, and cross-modal attention weights. Experiments on a diverse 9-class multi-organ medical image dataset demonstrate EVM-Fusion's strong classification performance, achieving 99.75% test accuracy and provide multi-faceted insights into its decision-making process, highlighting its potential for trustworthy AI in medical diagnostics.

replace Generative Data Augmentation for Object Point Cloud Segmentation

Authors: Dekai Zhu, Stefan Gavranovic, Flavien Boussuge, Benjamin Busam, Slobodan Ilic

Abstract: Data augmentation is widely used to train deep learning models to address data scarcity. However, traditional data augmentation (TDA) typically relies on simple geometric transformation, such as random rotation and rescaling, resulting in minimal data diversity enrichment and limited model performance improvement. State-of-the-art generative models for 3D shape generation rely on the denoising diffusion probabilistic models and manage to generate realistic novel point clouds for 3D content creation and manipulation. Nevertheless, the generated 3D shapes lack associated point-wise semantic labels, restricting their usage in enlarging the training data for point cloud segmentation tasks. To bridge the gap between data augmentation techniques and the advanced diffusion models, we extend the state-of-the-art 3D diffusion model, Lion, to a part-aware generative model that can generate high-quality point clouds conditioned on given segmentation masks. Leveraging the novel generative model, we introduce a 3-step generative data augmentation (GDA) pipeline for point cloud segmentation training. Our GDA approach requires only a small amount of labeled samples but enriches the training data with generated variants and pseudo-labeled samples, which are validated by a novel diffusion-based pseudo-label filtering method. Extensive experiments on two large-scale synthetic datasets and a real-world medical dataset demonstrate that our GDA method outperforms TDA approach and related semi-supervised and self-supervised methods.

replace WetCat: Enabling Automated Skill Assessment in Wet-Lab Cataract Surgery Videos

Authors: Negin Ghamsarian, Raphael Sznitman, Klaus Schoeffmann, Jens Kowal

Abstract: To meet the growing demand for systematic surgical training, wetlab environments have become indispensable platforms for hands-on practice in ophthalmology. Yet, traditional wetlab training depends heavily on manual performance evaluations, which are labor-intensive, time-consuming, and often subject to variability. Recent advances in computer vision offer promising avenues for automated skill assessment, enhancing both the efficiency and objectivity of surgical education. Despite notable progress in ophthalmic surgical datasets, existing resources predominantly focus on real surgeries or isolated tasks, falling short of supporting comprehensive skill evaluation in controlled wetlab settings. To address these limitations, we introduce WetCat, the first dataset of wetlab cataract surgery videos specifically curated for automated skill assessment. WetCat comprises high-resolution recordings of surgeries performed by trainees on artificial eyes, featuring comprehensive phase annotations and semantic segmentations of key anatomical structures. These annotations are meticulously designed to facilitate skill assessment during the critical capsulorhexis and phacoemulsification phases, adhering to standardized surgical skill assessment frameworks. By focusing on these essential phases, WetCat enables the development of interpretable, AI-driven evaluation tools aligned with established clinical metrics. This dataset lays a strong foundation for advancing objective, scalable surgical education and sets a new benchmark for automated workflow analysis and skill assessment in ophthalmology training. The dataset and annotations are publicly available in Synapse https://www.synapse.org/Synapse:syn66401174/files.

URLs: https://www.synapse.org/Synapse:syn66401174/files.

replace Egocentric Human-Object Interaction Detection: A New Benchmark and Method

Authors: Kunyuan Deng, Yi Wang, Lap-Pui Chau

Abstract: Egocentric human-object interaction (Ego-HOI) detection is crucial for intelligent agents to understand and assist human activities from a first-person perspective. However, progress has been hindered by the lack of benchmarks and methods tailored to egocentric challenges such as severe hand-object occlusion. In this paper, we introduce the real-world Ego-HOI detection task and the accompanying Ego-HOIBench, a new dataset with over 27K egocentric images and explicit, fine-grained hand-verb-object triplet annotations across 123 categories. Ego-HOIBench covers diverse daily scenarios, object types, and both single- and two-hand interactions, offering a comprehensive testbed for Ego-HOI research. Benchmarking existing third-person HOI detectors on Ego-HOIBench reveals significant performance gaps, highlighting the need for egocentric-specific solutions. To this end, we propose Hand Geometry and Interactivity Refinement (HGIR), a lightweight, plug-and-play scheme that leverages hand pose and geometric cues to enhance interaction representations. Specifically, HGIR explicitly extracts global hand geometric features from the estimated hand pose proposals, and further refines interaction features through pose-interaction attention, enabling the model to focus on subtle hand-object relationship differences even under severe occlusion. HGIR significantly improves Ego-HOI detection performance across multiple baselines, achieving new state-of-the-art results on Ego-HOIBench. Our dataset and method establish a solid foundation for future research in egocentric vision and human-object interaction understanding. Project page: https://dengkunyuan.github.io/EgoHOIBench/

URLs: https://dengkunyuan.github.io/EgoHOIBench/

replace Prompt-based Dynamic Token Pruning for Efficient Segmentation of Medical Images

Authors: Pallabi Dutta, Anubhab Maity, Sushmita Mitra

Abstract: The high computational demands of Vision Transformers (ViTs) in processing a large number of tokens often constrain their practical application in analyzing medical images. This research proposes a Prompt-driven Adaptive Token ({\it PrATo}) pruning method to selectively reduce the processing of irrelevant tokens in the segmentation pipeline. The prompt-based spatial prior helps to rank the tokens according to their relevance. Tokens with low-relevance scores are down-weighted, ensuring that only the relevant ones are propagated for processing across subsequent stages. This data-driven pruning strategy improves segmentation accuracy and inference speed by allocating computational resources to essential regions. The proposed framework is integrated with several state-of-the-art models to facilitate the elimination of irrelevant tokens, thereby enhancing computational efficiency while preserving segmentation accuracy. The experimental results show a reduction of $\sim$ 35-55% tokens; thus reducing the computational costs relative to baselines. Cost-effective medical image processing, using our framework, facilitates real-time diagnosis by expanding its applicability in resource-constrained environments.

replace OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding

Authors: Hieu Nguyen, Phuc-Tan Nguyen, Thien-Phuc Tran, Minh-Quang Nguyen, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

Abstract: We introduce OpenEvents V1a large-scale benchmark dataset designed to advance event-centric vision-language understanding. Unlike conventional image captioning and retrieval datasets that focus on surface-level descriptions, OpenEvents V1 dataset emphasizes contextual and temporal grounding through three primary tasks: (1) generating rich, event-aware image captions, (2) retrieving event-relevant news articles from image queries, and (3) retrieving event-relevant images from narrative-style textual queries. The dataset comprises over 200,000 news articles and 400,000 associated images sourced from CNN and The Guardian, spanning diverse domains and time periods. We provide extensive baseline results and standardized evaluation protocols for all tasks. OpenEvents V1 establishes a robust foundation for developing multimodal AI systems capable of deep reasoning over complex real-world events. The dataset is publicly available at https://ltnghia.github.io/eventa/openevents-v1.

URLs: https://ltnghia.github.io/eventa/openevents-v1.

replace Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach

Authors: Panpan Ji, Junni Song, Yifan Lu, Hang Xiao, Hanyu Liu, Chao Li

Abstract: Sensor-based Human Activity Recognition (HAR) is a core technology that enables intelligent systems to perceive and interact with their environment. However, multimodal HAR systems still encounter key challenges, such as difficulties in cross-modal feature alignment and imbalanced modality contributions. To address these issues, we propose a novel framework called the Dynamic Contrastive Dual-Path Network (DCDP-HAR). The framework comprises three key components. First, a dual-path feature extraction architecture is employed, where ResNet and DenseNet branches collaboratively process multimodal sensor data. Second, a multi-stage contrastive learning mechanism is introduced to achieve progressive alignment from local perception to semantic abstraction. Third, we present a confidence-driven gradient modulation strategy that dynamically monitors and adjusts the learning intensity of each modality branch during backpropagation, effectively alleviating modality competition. In addition, a momentum-based gradient accumulation strategy is adopted to enhance training stability. We conduct ablation studies to validate the effectiveness of each component and perform extensive comparative experiments on four public benchmark datasets.

replace DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Authors: Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin

Abstract: Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

replace Solar Altitude Guided Scene Illumination

Authors: Samed Do\u{g}an, Maximilian Hoh, Nico Leuze, Nicolas Rodriguez Pe\~na, Alfred Sch\"ottl

Abstract: The development of safe and robust autonomous driving functions is heavily dependent on large-scale, high-quality sensor data. However, real-world data acquisition requires extensive human labor and is strongly limited by factors such as labeling cost, driver safety protocols and scenario coverage. Thus, multiple lines of work focus on the conditional generation of synthetic camera sensor data. We identify a significant gap in research regarding daytime variation, presumably caused by the scarcity of available labels. Consequently, we present solar altitude as global conditioning variable. It is readily computable from latitude-longitude coordinates and local time, eliminating the need for manual labeling. Our work is complemented by a tailored normalization approach, targeting the sensitivity of daylight towards small numeric changes in altitude. We demonstrate its ability to accurately capture lighting characteristics and illumination-dependent image noise in the context of diffusion models.

replace Online Micro-gesture Recognition Using Data Augmentation and Spatial-Temporal Attention

Authors: Pengyu Liu, Kun Li, Fei Wang, Yanyan Wei, Junhui She, Dan Guo

Abstract: In this paper, we introduce the latest solution developed by our team, HFUT-VUT, for the Micro-gesture Online Recognition track of the IJCAI 2025 MiGA Challenge. The Micro-gesture Online Recognition task is a highly challenging problem that aims to locate the temporal positions and recognize the categories of multiple micro-gesture instances in untrimmed videos. Compared to traditional temporal action detection, this task places greater emphasis on distinguishing between micro-gesture categories and precisely identifying the start and end times of each instance. Moreover, micro-gestures are typically spontaneous human actions, with greater differences than those found in other human actions. To address these challenges, we propose hand-crafted data augmentation and spatial-temporal attention to enhance the model's ability to classify and localize micro-gestures more accurately. Our solution achieved an F1 score of 38.03, outperforming the previous state-of-the-art by 37.9%. As a result, our method ranked first in the Micro-gesture Online Recognition track.

replace Demographic-aware fine-grained classification of pediatric wrist fractures

Authors: Ammar Ahmed, Ali Shariq Imran, Zenun Kastrati, Sher Muhammad Daudpota

Abstract: Wrist pathologies are frequently observed, particularly among children who constitute the majority of fracture cases. Computer vision presents a promising avenue, contingent upon the availability of extensive datasets, a notable challenge in medical imaging. Therefore, reliance solely on one modality, such as images, proves inadequate, especially in an era of diverse and plentiful data types. In this study, we employ a multifaceted approach to address the challenge of recognizing wrist pathologies using an extremely limited dataset. Initially, we approach the problem as a fine-grained recognition task. Secondly, we enhance network performance by fusing patient metadata with X-rays. Thirdly, we improve the performance further by utilizing weights trained on a separate fine-grained dataset. While metadata integration has been used in other medical domains, this is a novel application for wrist pathologies.

replace SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models

Authors: Xiangyu Dong, Haoran Zhao, Jiang Gao, Haozhou Li, Xiaoguang Ma, Yaoming Zhou, Fuhai Chen, Juan Liu

Abstract: Recent advances in vision-language navigation (VLN) were mainly attributed to emerging large language models (LLMs). These methods exhibited excellent generalization capabilities in instruction understanding and task reasoning. However, they were constrained by the fixed knowledge bases and reasoning abilities of LLMs, preventing fully incorporating experiential knowledge and thus resulting in a lack of efficient evolutionary capacity. To address this, we drew inspiration from the evolution capabilities of natural agents, and proposed a self-evolving VLN framework (SE-VLN) to endow VLN agents with the ability to continuously evolve during testing. To the best of our knowledge, it was the first time that an multimodal LLM-powered self-evolving VLN framework was proposed. Specifically, SE-VLN comprised three core modules, i.e., a hierarchical memory module to transfer successful and failure cases into reusable knowledge, a retrieval-augmented thought-based reasoning module to retrieve experience and enable multi-step decision-making, and a reflection module to realize continual evolution. Comprehensive tests illustrated that the SE-VLN achieved navigation success rates of 57% and 35.2% in unseen environments, representing absolute performance improvements of 23.9% and 15.0% over current state-of-the-art methods on R2R and REVERSE datasets, respectively. Moreover, the SE-VLN showed performance improvement with increasing experience repository, elucidating its great potential as a self-evolving agent framework for VLN.

replace DriveIndia: An Object Detection Dataset for Diverse Indian Traffic Scenes

Authors: Rishav Kumar, D. Santhosh Reddy, P. Rajalakshmi

Abstract: We introduce DriveIndia, a large-scale object detection dataset purpose-built to capture the complexity and unpredictability of Indian traffic environments. The dataset contains 66,986 high-resolution images annotated in YOLO format across 24 traffic-relevant object categories, encompassing diverse conditions such as varied weather (fog, rain), illumination changes, heterogeneous road infrastructure, and dense, mixed traffic patterns and collected over 120+ hours and covering 3,400+ kilometers across urban, rural, and highway routes. DriveIndia offers a comprehensive benchmark for real-world autonomous driving challenges. We provide baseline results using state-of-the-art YOLO family models, with the top-performing variant achieving a mAP50 of 78.7%. Designed to support research in robust, generalizable object detection under uncertain road conditions, DriveIndia will be publicly available via the TiHAN-IIT Hyderabad dataset repository https://tihan.iith.ac.in/TiAND.html (Terrestrial Datasets -> Camera Dataset).

URLs: https://tihan.iith.ac.in/TiAND.html

replace MergeSAM: Unsupervised change detection of remote sensing images based on the Segment Anything Model

Authors: Meiqi Hu, Lingzhi Lu, Chengxi Han, Xiaoping Liu

Abstract: Recently, large foundation models trained on vast datasets have demonstrated exceptional capabilities in feature extraction and general feature representation. The ongoing advancements in deep learning-driven large models have shown great promise in accelerating unsupervised change detection methods, thereby enhancing the practical applicability of change detection technologies. Building on this progress, this paper introduces MergeSAM, an innovative unsupervised change detection method for high-resolution remote sensing imagery, based on the Segment Anything Model (SAM). Two novel strategies, MaskMatching and MaskSplitting, are designed to address real-world complexities such as object splitting, merging, and other intricate changes. The proposed method fully leverages SAM's object segmentation capabilities to construct multitemporal masks that capture complex changes, embedding the spatial structure of land cover into the change detection process.

replace Gaussian Splatting Feature Fields for Privacy-Preserving Visual Localization

Authors: Maxime Pietrantoni, Gabriela Csurka, Torsten Sattler

Abstract: Visual localization is the task of estimating a camera pose in a known environment. In this paper, we utilize 3D Gaussian Splatting (3DGS)-based representations for accurate and privacy-preserving visual localization. We propose Gaussian Splatting Feature Fields (GSFFs), a scene representation for visual localization that combines an explicit geometry model (3DGS) with an implicit feature field. We leverage the dense geometric information and differentiable rasterization algorithm from 3DGS to learn robust feature representations grounded in 3D. In particular, we align a 3D scale-aware feature field and a 2D feature encoder in a common embedding space through a contrastive framework. Using a 3D structure-informed clustering procedure, we further regularize the representation learning and seamlessly convert the features to segmentations, which can be used for privacy-preserving visual localization. Pose refinement, which involves aligning either feature maps or segmentations from a query image with those rendered from the GSFFs scene representation, is used to achieve localization. The resulting privacy- and non-privacy-preserving localization pipelines, evaluated on multiple real-world datasets, show state-of-the-art performances.

replace Is Uncertainty Quantification a Viable Alternative to Learned Deferral?

Authors: Anna M. Wundram, Christian F. Baumgartner

Abstract: Artificial Intelligence (AI) holds the potential to dramatically improve patient care. However, it is not infallible, necessitating human-AI-collaboration to ensure safe implementation. One aspect of AI safety is the models' ability to defer decisions to a human expert when they are likely to misclassify autonomously. Recent research has focused on methods that learn to defer by optimising a surrogate loss function that finds the optimal trade-off between predicting a class label or deferring. However, during clinical translation, models often face challenges such as data shift. Uncertainty quantification methods aim to estimate a model's confidence in its predictions. However, they may also be used as a deferral strategy which does not rely on learning from specific training distribution. We hypothesise that models developed to quantify uncertainty are more robust to out-of-distribution (OOD) input than learned deferral models that have been trained in a supervised fashion. To investigate this hypothesis, we constructed an extensive evaluation study on a large ophthalmology dataset, examining both learned deferral models and established uncertainty quantification methods, assessing their performance in- and out-of-distribution. Specifically, we evaluate their ability to accurately classify glaucoma from fundus images while deferring cases with a high likelihood of error. We find that uncertainty quantification methods may be a promising choice for AI deferral.

replace Uncertainty-Guided Face Matting for Occlusion-Aware Face Transformation

Authors: Hyebin Cho, Jaehyup Lee

Abstract: Face filters have become a key element of short-form video content, enabling a wide array of visual effects such as stylization and face swapping. However, their performance often degrades in the presence of occlusions, where objects like hands, hair, or accessories obscure the face. To address this limitation, we introduce the novel task of face matting, which estimates fine-grained alpha mattes to separate occluding elements from facial regions. We further present FaceMat, a trimap-free, uncertainty-aware framework that predicts high-quality alpha mattes under complex occlusions. Our approach leverages a two-stage training pipeline: a teacher model is trained to jointly estimate alpha mattes and per-pixel uncertainty using a negative log-likelihood (NLL) loss, and this uncertainty is then used to guide the student model through spatially adaptive knowledge distillation. This formulation enables the student to focus on ambiguous or occluded regions, improving generalization and preserving semantic consistency. Unlike previous approaches that rely on trimaps or segmentation masks, our framework requires no auxiliary inputs making it well-suited for real-time applications. In addition, we reformulate the matting objective by explicitly treating skin as foreground and occlusions as background, enabling clearer compositing strategies. To support this task, we newly constructed CelebAMat, a large-scale synthetic dataset specifically designed for occlusion-aware face matting. Extensive experiments show that FaceMat outperforms state-of-the-art methods across multiple benchmarks, enhancing the visual quality and robustness of face filters in real-world, unconstrained video scenarios. The source code and CelebAMat dataset are available at https://github.com/hyebin-c/FaceMat.git

URLs: https://github.com/hyebin-c/FaceMat.git

replace Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

Authors: Shaoguang Wang, Ziyang Chen, Yijie Xu, Weiyu Guo, Hui Xiong

Abstract: The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While increasing the number of sampled frames is a common strategy, we observe a "less is more" phenomenon where excessive frames can paradoxically degrade performance due to context dilution. Concurrently, state-of-the-art keyframe selection methods, while effective, still yield significant temporal redundancy, which we term 'visual echoes'. To address these dual challenges, we propose Adaptive Frame-Pruning (AFP), a novel post-processing method that intelligently prunes the selected keyframes. AFP employs an adaptive hierarchical clustering algorithm on a fused ResNet-50 and CLIP feature space to identify and merge these echoes into single representatives. To compensate for information loss, we then introduce a lightweight, text-based semantic graph that provides critical context with minimal token overhead. Conducting extensive experiments on the LongVideoBench and VideoMME benchmarks across multiple leading MLLMs, our full approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 83.2%. Crucially, by providing a concise, high-quality set of frames, our method not only enhances efficiency but often improves accuracy over baselines that use more frames. The code will be released upon publication.

replace Fast Motion Estimation and Context-Aware Refinement for Efficient Bayer-Domain Video Vision

Authors: Haichao Wang, Xinyue Xi, Jiangtao Wen, Yuxing Han

Abstract: The efficiency of video computer vision system remains a challenging task due to the high temporal redundancy inside a video. Existing works have been proposed for efficient vision computer vision. However, they do not fully reduce the temporal redundancy and neglect the front end computation overhead. In this paper, we propose an efficient video computer vision system. First, image signal processor is removed and Bayer-format data is directly fed into video computer vision models, thus saving the front end computation. Second, instead of optical flow models and video codecs, a fast block matching-based motion estimation algorithm is proposed specifically for efficient video computer vision, with a MV refinement module. To correct the error, context-aware block refinement network is introduced to refine regions with large error. To further balance the accuracy and efficiency, a frame selection strategy is employed. Experiments on multiple video computer vision tasks demonstrate that our method achieves significant acceleration with slight performance loss.

replace MultiRef: Controllable Image Generation with Multiple Visual References

Authors: Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Petr Sushko, Gaoyang Jiang, Yao Wan, Ranjay Krishna

Abstract: Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs -- either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: https://multiref.github.io/.

URLs: https://multiref.github.io/.

replace Physical Autoregressive Model for Robotic Manipulation without Action Pretraining

Authors: Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, Guangrun Wang

Abstract: The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100\% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining. The project page is here: https://hcplab-sysu.github.io/PhysicalAutoregressiveModel/

URLs: https://hcplab-sysu.github.io/PhysicalAutoregressiveModel/

replace VFM-Guided Semi-Supervised Detection Transformer under Source-Free Constraints for Remote Sensing Object Detection

Authors: Jianhong Han, Yupei Wang, Liang Chen

Abstract: Unsupervised domain adaptation methods have been widely explored to bridge domain gaps. However, in real-world remote-sensing scenarios, privacy and transmission constraints often preclude access to source domain data, which limits their practical applicability. Recently, Source-Free Object Detection (SFOD) has emerged as a promising alternative, aiming at cross-domain adaptation without relying on source data, primarily through a self-training paradigm. Despite its potential, SFOD frequently suffers from training collapse caused by noisy pseudo-labels, especially in remote sensing imagery with dense objects and complex backgrounds. Considering that limited target domain annotations are often feasible in practice, we propose a Vision foundation-Guided DEtection TRansformer (VG-DETR), built upon a semi-supervised framework for SFOD in remote sensing images. VG-DETR integrates a Vision Foundation Model (VFM) into the training pipeline in a "free lunch" manner, leveraging a small amount of labeled target data to mitigate pseudo-label noise while improving the detector's feature-extraction capability. Specifically, we introduce a VFM-guided pseudo-label mining strategy that leverages the VFM's semantic priors to further assess the reliability of the generated pseudo-labels. By recovering potentially correct predictions from low-confidence outputs, our strategy improves pseudo-label quality and quantity. In addition, a dual-level VFM-guided alignment method is proposed, which aligns detector features with VFM embeddings at both the instance and image levels. Through contrastive learning among fine-grained prototypes and similarity matching between feature maps, this dual-level alignment further enhances the robustness of feature representations against domain gaps. Extensive experiments demonstrate that VG-DETR achieves superior performance in source-free remote sensing detection tasks.

replace MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation

Authors: Qian Liang, Yujia Wu, Kuncheng Li, Jiwei Wei, Shiyuan He, Jinyu Guo, Ning Xie

Abstract: Multimodal Large Language Models (MLLMs) with unified architectures excel across a wide range of vision-language tasks, yet aligning them with personalized image generation remains a significant challenge. Existing methods for MLLMs are frequently subject-specific, demanding a data-intensive fine-tuning process for every new subject, which limits their scalability. In this paper, we introduce MM-R1, a framework that integrates a cross-modal Chain-of-Thought (X-CoT) reasoning strategy to unlock the inherent potential of unified MLLMs for personalized image generation. Specifically, we structure personalization as an integrated visual reasoning and generation process: (1) grounding subject concepts by interpreting and understanding user-provided images and contextual cues, and (2) generating personalized images conditioned on both the extracted subject representations and user prompts. To further enhance the reasoning capability, we adopt Grouped Reward Proximal Policy Optimization (GRPO) to explicitly align the generation. Experiments demonstrate that MM-R1 unleashes the personalization capability of unified MLLMs to generate images with high subject fidelity and strong text alignment in a zero-shot manner.

replace Quantifying and Alleviating Co-Adaptation in Sparse-View 3D Gaussian Splatting

Authors: Kangjie Chen, Yingji Zhong, Zhihao Li, Jiaqi Lin, Youyu Chen, Minghan Qin, Haoqian Wang

Abstract: 3D Gaussian Splatting (3DGS) has demonstrated impressive performance in novel view synthesis under dense-view settings. However, in sparse-view scenarios, despite the realistic renderings in training views, 3DGS occasionally manifests appearance artifacts in novel views. This paper investigates the appearance artifacts in sparse-view 3DGS and uncovers a core limitation of current approaches: the optimized Gaussians are overly-entangled with one another to aggressively fit the training views, which leads to a neglect of the real appearance distribution of the underlying scene and results in appearance artifacts in novel views. The analysis is based on a proposed metric, termed Co-Adaptation Score (CA), which quantifies the entanglement among Gaussians, i.e., co-adaptation, by computing the pixel-wise variance across multiple renderings of the same viewpoint, with different random subsets of Gaussians. The analysis reveals that the degree of co-adaptation is naturally alleviated as the number of training views increases. Based on the analysis, we propose two lightweight strategies to explicitly mitigate the co-adaptation in sparse-view 3DGS: (1) random gaussian dropout; (2) multiplicative noise injection to the opacity. Both strategies are designed to be plug-and-play, and their effectiveness is validated across various methods and benchmarks. We hope that our insights into the co-adaptation effect will inspire the community to achieve a more comprehensive understanding of sparse-view 3DGS.

replace SocialTrack: Multi-Object Tracking in Complex Urban Traffic Scenes Inspired by Social Behavior

Authors: Wenguang Tao, Xiaotian Wang, Tian Yan, Jie Yan, Guodong Li, Kun Bai

Abstract: As a key research direction in the field of multi-object tracking (MOT), UAV-based multi-object tracking has significant application value in the analysis and understanding of urban intelligent transportation systems. However, in complex UAV perspectives, challenges such as small target scale variations, occlusions, nonlinear crossing motions, and motion blur severely hinder the stability of multi-object tracking. To address these challenges, this paper proposes a novel multi-object tracking framework, SocialTrack, aimed at enhancing the tracking accuracy and robustness of small targets in complex urban traffic environments. The specialized small-target detector enhances the detection performance by employing a multi-scale feature enhancement mechanism. The Velocity Adaptive Cubature Kalman Filter (VACKF) improves the accuracy of trajectory prediction by incorporating a velocity dynamic modeling mechanism. The Group Motion Compensation Strategy (GMCS) models social group motion priors to provide stable state update references for low-quality tracks, significantly improving the target association accuracy in complex dynamic environments. Furthermore, the Spatio-Temporal Memory Prediction (STMP) leverages historical trajectory information to predict the future state of low-quality tracks, effectively mitigating identity switching issues. Extensive experiments on the UAVDT and MOT17 datasets demonstrate that SocialTrack outperforms existing state-of-the-art (SOTA) methods across several key metrics. Significant improvements in MOTA and IDF1, among other core performance indicators, highlight its superior robustness and adaptability. Additionally, SocialTrack is highly modular and compatible, allowing for seamless integration with existing trackers to further enhance performance.

replace FastTracker: Real-Time and Accurate Visual Tracking

Authors: Hamidreza Hashempoor, Yu Dong Hwang

Abstract: Conventional multi-object tracking (MOT) systems are predominantly designed for pedestrian tracking and often exhibit limited generalization to other object categories. This paper presents a generalized tracking framework capable of handling multiple object types, with a particular emphasis on vehicle tracking in complex traffic scenes. The proposed method incorporates two key components: (1) an occlusion-aware re-identification mechanism that enhances identity preservation for heavily occluded objects, and (2) a road-structure-aware tracklet refinement strategy that utilizes semantic scene priors such as lane directions, crosswalks, and road boundaries to improve trajectory continuity and accuracy. In addition, we introduce a new benchmark dataset comprising diverse vehicle classes with frame-level tracking annotations, specifically curated to support evaluation of vehicle-focused tracking methods. Extensive experimental results demonstrate that the proposed approach achieves robust performance on both the newly introduced dataset and several public benchmarks, highlighting its effectiveness in general-purpose object tracking. While our framework is designed for generalized multi-class tracking, it also achieves strong performance on conventional benchmarks, with HOTA scores of 66.4 on MOT17 and 65.7 on MOT20 test sets. Code and Benchmark are available: github.com/Hamidreza-Hashempoor/FastTracker, huggingface.co/datasets/Hamidreza-Hashemp/FastTracker-Benchmark.

replace Waver: Wave Your Way to Lifelike Video Generation

Authors: Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, Zehuan Yuan

Abstract: We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.

URLs: https://github.com/FoundationVision/Waver.

replace PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation

Authors: Xiaoyang Hao, Han Li

Abstract: Monocular 3D human pose estimation (HPE) methods estimate the 3D positions of joints from individual images. Existing 3D HPE approaches often use the cropped image alone as input for their models. However, the relative depths of joints cannot be accurately estimated from cropped images without the corresponding camera intrinsics, which determine the perspective relationship between 3D objects and the cropped images. In this work, we introduce Perspective Encoding (PE) to encode the camera intrinsics of the cropped images. Moreover, since the human subject can appear anywhere within the original image, the perspective relationship between the 3D scene and the cropped image differs significantly, which complicates model fitting. Additionally, the further the human subject deviates from the image center, the greater the perspective distortions in the cropped image. To address these issues, we propose Perspective Rotation (PR), a transformation applied to the original image that centers the human subject, thereby reducing perspective distortions and alleviating the difficulty of model fitting. By incorporating PE and PR, we propose a novel 3D HPE framework, PersPose. Experimental results demonstrate that PersPose achieves state-of-the-art (SOTA) performance on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets. For example, on the in-the-wild dataset 3DPW, PersPose achieves an MPJPE of 60.1 mm, 7.54% lower than the previous SOTA approach. Code is available at: https://github.com/KenAdamsJoseph/PersPose.

URLs: https://github.com/KenAdamsJoseph/PersPose.

replace Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs

Authors: Somraj Gautam, Abhirama Subramanyam Penamakuri, Abhishek Bhandari, Gaurav Harit

Abstract: We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA) on cricket scorecards, designed to evaluate large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning over semi-structured tabular images. MMCRICBENCH-3K comprises 1,463 synthetically generated scorecard images from ODI, T20, and Test formats, accompanied by 1,500 English QA pairs. It includes two subsets: MMCRICBENCH-E-1.5K, featuring English scorecards, and MMCRICBENCH-H-1.5K, containing visually similar Hindi scorecards, with all questions and answers kept in English to enable controlled cross-script evaluation. The task demands reasoning over structured numerical data, multi-image context, and implicit domain knowledge. Empirical results show that even state-of-the-art LVLMs, such as GPT-4o and Qwen2.5VL, struggle on the English subset despite it being their primary training language and exhibit a further drop in performance on the Hindi subset. This reveals key limitations in structure-aware visual text understanding, numerical reasoning, and cross-lingual generalization. The dataset is publicly available via Hugging Face at https://huggingface.co/datasets/DIALab/MMCricBench, to promote LVLM research in this direction.

URLs: https://huggingface.co/datasets/DIALab/MMCricBench,

replace Enhancing Underwater Images via Deep Learning: A Comparative Study of VGG19 and ResNet50-Based Approaches

Authors: Aoqi Li, Yanghui Song, Jichao Dao, Chengfu Yang

Abstract: This paper addresses the challenging problem of image enhancement in complex underwater scenes by proposing a solution based on deep learning. The proposed method skillfully integrates two deep convolutional neural network models, VGG19 and ResNet50, leveraging their powerful feature extraction capabilities to perform multi-scale and multi-level deep feature analysis of underwater images. By constructing a unified model, the complementary advantages of the two models are effectively integrated, achieving a more comprehensive and accurate image enhancement effect.To objectively evaluate the enhancement effect, this paper introduces image quality assessment metrics such as PSNR, UCIQE, and UIQM to quantitatively compare images before and after enhancement and deeply analyzes the performance of different models in different scenarios.Furthermore, to improve the practicality and stability of the underwater visual enhancement system, this paper also provides practical suggestions from aspects such as model optimization, multi-model fusion, and hardware selection, aiming to provide strong technical support for visual enhancement tasks in complex underwater environments.

replace Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels

Authors: Long Le, Ryan Lucas, Chen Wang, Chuhao Chen, Dinesh Jayaraman, Eric Eaton, Lingjie Liu

Abstract: Inferring the physical properties of 3D scenes from visual information is a critical yet challenging task for creating interactive and realistic virtual worlds. While humans intuitively grasp material characteristics such as elasticity or stiffness, existing methods often rely on slow, per-scene optimization, limiting their generalizability and application. To address this problem, we introduce PIXIE, a novel method that trains a generalizable neural network to predict physical properties across multiple scenes from 3D visual features purely using supervised losses. Once trained, our feed-forward network can perform fast inference of plausible material fields, which coupled with a learned static scene representation like Gaussian Splatting enables realistic physics simulation under external forces. To facilitate this research, we also collected PIXIEVERSE, one of the largest known datasets of paired 3D assets and physic material annotations. Extensive evaluations demonstrate that PIXIE is about 1.46-4.39x better and orders of magnitude faster than test-time optimization methods. By leveraging pretrained visual features like CLIP, our method can also zero-shot generalize to real-world scenes despite only ever been trained on synthetic data. https://pixie-3d.github.io/

URLs: https://pixie-3d.github.io/

replace Towards Optimal Convolutional Transfer Learning Architectures for Breast Lesion Classification and ACL Tear Detection

Authors: Daniel Frees, Moritz Bolling, Aditri Bhagirath

Abstract: Modern computer vision models have proven to be highly useful for medical imaging classification and segmentation tasks, but the scarcity of medical imaging data often limits the efficacy of models trained from scratch. Transfer learning has emerged as a pivotal solution to this, enabling the fine-tuning of high-performance models on small data. Mei et al. (2022) found that pre-training CNNs on a large dataset of radiologist-labeled images (RadImageNet) enhanced model performance on downstream tasks compared to ImageNet pretraining. The present work extends Mei et al. (2022) by conducting a comprehensive investigation to determine optimal CNN architectures for breast lesion malignancy detection and ACL tear detection, as well as performing statistical analysis to compare the effect of RadImageNet and ImageNet pre-training on downstream model performance. Our findings suggest that 1-dimensional convolutional classifiers with skip connections, ResNet50 pre-trained backbones, and partial backbone unfreezing yields optimal downstream medical classification performance. Our best models achieve AUCs of 0.9969 for ACL tear detection and 0.9641 for breast nodule malignancy detection, competitive with the results reported by Mei et al. (2022) and surpassing other previous works. We do not find evidence confirming RadImageNet pre-training to provide superior downstream performance for ACL tear and breast lesion classification tasks.

replace Finding Outliers in a Haystack: Anomaly Detection for Large Pointcloud Scenes

Authors: Ryan Faulkner, Luke Haub, Simon Ratcliffe, Tat-Jun Chin

Abstract: LiDAR scanning in outdoor scenes acquires accurate distance measurements over wide areas, producing large-scale point clouds. Application examples for this data include robotics, automotive vehicles, and land surveillance. During such applications, outlier objects from outside the training data will inevitably appear. Our research contributes a novel approach to open-set segmentation, leveraging the learnings of object defect-detection research. We also draw on the Mamba architecture's strong performance in utilising long-range dependencies and scalability to large data. Combining both, we create a reconstruction based approach for the task of outdoor scene open-set segmentation. We show that our approach improves performance not only when applied to our our own open-set segmentation method, but also when applied to existing methods. Furthermore we contribute a Mamba based architecture which is competitive with existing voxel-convolution based methods on challenging, large-scale pointclouds.

replace Rethinking the Detail-Preserved Completion of Complex Tubular Structures based on Point Cloud: a Dataset and a Benchmark

Authors: Yaolei Qi, Yikai Yang, Wenbo Peng, Shumei Miao, Yutao Hu, Guanyu Yang

Abstract: Complex tubular structures are essential in medical imaging and computer-assisted diagnosis, where their integrity enhances anatomical visualization and lesion detection. However, existing segmentation algorithms struggle with structural discontinuities, particularly in severe clinical cases such as coronary artery stenosis and vessel occlusions, which leads to undesired discontinuity and compromising downstream diagnostic accuracy. Therefore, it is imperative to reconnect discontinuous structures to ensure their completeness. In this study, we explore the tubular structure completion based on point cloud for the first time and establish a Point Cloud-based Coronary Artery Completion (PC-CAC) dataset, which is derived from real clinical data. This dataset provides a novel benchmark for tubular structure completion. Additionally, we propose TSRNet, a Tubular Structure Reconnection Network that integrates a detail-preservated feature extractor, a multiple dense refinement strategy, and a global-to-local loss function to ensure accurate reconnection while maintaining structural integrity. Comprehensive experiments on our PC-CAC and two additional public datasets (PC-ImageCAS and PC-PTR) demonstrate that our method consistently outperforms state-of-the-art approaches across multiple evaluation metrics, setting a new benchmark for point cloud-based tubular structure reconstruction. Our benchmark is available at https://github.com/YaoleiQi/PCCAC.

URLs: https://github.com/YaoleiQi/PCCAC.

replace Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

Authors: Yaqi Li, Peng Chen, Mingyang Han, Pi Bu, Haoxiang Shi, Runzhou Zhao, Yang Yao, Xuan Zhang, Jun Song, Bo Zheng

Abstract: Despite the promising progress of recent autoregressive models in text-to-image (T2I) generation, their ability to handle multi-attribute and ambiguous prompts remains limited. To address these limitations, existing works have applied chain-of-thought (CoT) to enable stage-aware visual synthesis and employed reinforcement learning (RL) to improve reasoning capabilities. However, most models provide reward signals only at the end of the generation stage. This monolithic final-only guidance makes it difficult to identify which stages contribute positively to the final outcome and may lead to suboptimal policies. To tackle this issue, we propose a Visual-Chain of Guidance (Visual-CoG) paradigm consisting of three stages: semantic reasoning, process refining, and outcome evaluation, with stage-aware rewards providing immediate guidance throughout the image generation pipeline. We further construct a visual cognition benchmark, VisCog-Bench, which comprises four subtasks to evaluate the effectiveness of semantic reasoning. Comprehensive evaluations on GenEval, T2I-CompBench, and the proposed VisCog-Bench show improvements of 15%, 5%, and 19%, respectively, demonstrating the superior performance of the proposed Visual-CoG. We will release all the resources soon.

replace Emerging Semantic Segmentation from Positive and Negative Coarse Label Learning

Authors: Le Zhang, Fuping Wu, Arun Thirunavukarasu, Kevin Bronik, Thomas Nichols, Bartlomiej W. Papiez

Abstract: Large annotated datasets are vital for training segmentation models, but pixel-level labeling is time-consuming, error-prone, and often requires scarce expert annotators, especially in medical imaging. In contrast, coarse annotations are quicker, cheaper, and easier to produce, even by non-experts. In this paper, we propose to use coarse drawings from both positive (target) and negative (background) classes in the image, even with noisy pixels, to train a convolutional neural network (CNN) for semantic segmentation. We present a method for learning the true segmentation label distributions from purely noisy coarse annotations using two coupled CNNs. The separation of the two CNNs is achieved by high fidelity with the characters of the noisy training annotations. We propose to add a complementary label learning that encourages estimating negative label distribution. To illustrate the properties of our method, we first use a toy segmentation dataset based on MNIST. We then present the quantitative results of experiments using publicly available datasets: Cityscapes dataset for multi-class segmentation, and retinal images for medical applications. In all experiments, our method outperforms state-of-the-art methods, particularly in the cases where the ratio of coarse annotations is small compared to the given dense annotations.

replace-cross MOSformer: Momentum encoder-based inter-slice fusion transformer for medical image segmentation

Authors: De-Xing Huang, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Zhen-Qiu Feng, Zhi-Chao Lai, Zeng-Guang Hou

Abstract: Medical image segmentation takes an important position in various clinical applications. 2.5D-based segmentation models bridge the computational efficiency of 2D-based models with the spatial perception capabilities of 3D-based models. However, existing 2.5D-based models primarily adopt a single encoder to extract features of target and neighborhood slices, failing to effectively fuse inter-slice information, resulting in suboptimal segmentation performance. In this study, a novel momentum encoder-based inter-slice fusion transformer (MOSformer) is proposed to overcome this issue by leveraging inter-slice information from multi-scale feature maps extracted by different encoders. Specifically, dual encoders are employed to enhance feature distinguishability among different slices. One of the encoders is moving-averaged to maintain consistent slice representations. Moreover, an inter-slice fusion transformer (IF-Trans) module is developed to fuse inter-slice multi-scale features. MOSformer is evaluated on three benchmark datasets (Synapse, ACDC, and AMOS), achieving a new state-of-the-art with 85.63%, 92.19%, and 85.43% DSC, respectively. These results demonstrate MOSformer's competitiveness in medical image segmentation.

replace-cross TimeFlow: Temporal Conditioning for Longitudinal Brain MRI Registration and Aging Analysis

Authors: Bailiang Jian, Jiazhen Pan, Yitong Li, Fabian Bongratz, Ruochen Li, Daniel Rueckert, Benedikt Wiestler, Christian Wachinger

Abstract: Longitudinal brain analysis is essential for understanding healthy aging and identifying pathological deviations. Longitudinal registration of sequential brain MRI underpins such analyses. However, existing methods are limited by reliance on densely sampled time series, a trade-off between accuracy and temporal smoothness, and an inability to prospectively forecast future brain states. To overcome these challenges, we introduce \emph{TimeFlow}, a learning-based framework for longitudinal brain MRI registration. TimeFlow uses a U-Net backbone with temporal conditioning to model neuroanatomy as a continuous function of age. Given only two scans from an individual, TimeFlow estimates accurate and temporally coherent deformation fields, enabling non-linear extrapolation to predict future brain states. This is achieved by our proposed inter-/extra-polation consistency constraints applied to both the deformation fields and deformed images. Remarkably, these constraints preserve temporal consistency and continuity without requiring explicit smoothness regularizers or densely sampled sequential data. Extensive experiments demonstrate that TimeFlow outperforms state-of-the-art methods in terms of both future timepoint forecasting and registration accuracy. Moreover, TimeFlow supports novel biological brain aging analyses by differentiating neurodegenerative trajectories from normal aging without requiring segmentation, thereby eliminating the need for labor-intensive annotations and mitigating segmentation inconsistency. TimeFlow offers an accurate, data-efficient, and annotation-free framework for longitudinal analysis of brain aging and chronic diseases, capable of forecasting brain changes beyond the observed study period.

replace-cross Enhancing Reusability of Learned Skills for Robot Manipulation via Gaze Information and Motion Bottlenecks

Authors: Ryo Takizawa, Izumi Karino, Koki Nakagawa, Yoshiyuki Ohmura, Yasuo Kuniyoshi

Abstract: Autonomous agents capable of diverse object manipulations should be able to acquire a wide range of manipulation skills with high reusability. Although advances in deep learning have made it increasingly feasible to replicate the dexterity of human teleoperation in robots, generalizing these acquired skills to previously unseen scenarios remains a significant challenge. In this study, we propose a novel algorithm, Gaze-based Bottleneck-aware Robot Manipulation (GazeBot), which enables high reusability of learned motions without sacrificing dexterity or reactivity. By leveraging gaze information and motion bottlenecks, both crucial features for object manipulation, GazeBot achieves high success rates compared with state-of-the-art imitation learning methods, particularly when the object positions and end-effector poses differ from those in the provided demonstrations. Furthermore, the training process of GazeBot is entirely data-driven once a demonstration dataset with gaze data is provided. Videos and code are available at https://crumbyrobotics.github.io/gazebot.

URLs: https://crumbyrobotics.github.io/gazebot.

replace-cross Image Coding for Machines via Feature-Preserving Rate-Distortion Optimization

Authors: Samuel Fern\'andez-Mendui\~na, Eduardo Pavez, Antonio Ortega

Abstract: Many images and videos are primarily processed by computer vision algorithms, involving only occasional human inspection. When this content requires compression before processing, e.g., in distributed applications, coding methods must optimize for both visual quality and downstream task performance. We first show theoretically that an approach to reduce the effect of compression for a given task loss is to perform rate-distortion optimization (RDO) using the distance between features, obtained from the original and the decoded images, as a distortion metric. However, optimizing directly such a rate-distortion objective is computationally impractical because it requires iteratively encoding and decoding the entire image-plus feature evaluation-for each possible coding configuration. We address this problem by simplifying the RDO formulation to make the distortion term computable using block-based encoders. We first apply Taylor's expansion to the feature extractor, recasting the feature distance as a quadratic metric involving the Jacobian matrix of the neural network. Then, we replace the linearized metric with a block-wise approximation, which we call input-dependent squared error (IDSE). To make the metric computable, we approximate IDSE using sketches of the Jacobian. The resulting loss can be evaluated block-wise in the transform domain and combined with the sum of squared errors (SSE) to address both visual quality and computer vision performance. Simulations with AVC and HEVC across multiple feature extractors and downstream networks show up to 17 % bit-rate savings for the same task accuracy compared to RDO based on SSE, with no decoder complexity overhead and a small (7.86 %) encoder complexity increase.

replace-cross Uni-AIMS: AI-Powered Microscopy Image Analysis

Authors: Yanhui Hong, Nan Wang, Zhiyi Xia, Haoyi Tao, Xi Fang, Yiming Li, Jiankun Wang, Peng Jin, Xiaochen Cai, Shengyu Li, Ziqi Chen, Zezhong Zhang, Guolin Ke, Linfeng Zhang

Abstract: This paper presents a systematic solution for the intelligent recognition and automatic analysis of microscopy images. We developed a data engine that generates high-quality annotated datasets through a combination of the collection of diverse microscopy images from experiments, synthetic data generation and a human-in-the-loop annotation process. To address the unique challenges of microscopy images, we propose a segmentation model capable of robustly detecting both small and large objects. The model effectively identifies and separates thousands of closely situated targets, even in cluttered visual environments. Furthermore, our solution supports the precise automatic recognition of image scale bars, an essential feature in quantitative microscopic analysis. Building upon these components, we have constructed a comprehensive intelligent analysis platform and validated its effectiveness and practicality in real-world applications. This study not only advances automatic recognition in microscopy imaging but also ensures scalability and generalizability across multiple application domains, offering a powerful tool for automated microscopic analysis in interdisciplinary research. A online application is made available for researchers to access and evaluate the proposed automated analysis service.

replace-cross Concept-Guided Interpretability via Neural Chunking

Authors: Shuchen Wu, Stephan Alaniz, Shyamgopal Karthik, Peter Dayan, Eric Schulz, Zeynep Akata

Abstract: Neural networks are often described as black boxes, reflecting the significant challenge of understanding their internal workings and interactions. We propose a different perspective that challenges the prevailing view: rather than being inscrutable, neural networks exhibit patterns in their raw population activity that mirror regularities in the training data. We refer to this as the Reflection Hypothesis and provide evidence for this phenomenon in both simple recurrent neural networks (RNNs) and complex large language models (LLMs). Building on this insight, we propose to leverage our cognitive tendency of chunking to segment high-dimensional neural population dynamics into interpretable units that reflect underlying concepts. We propose three methods to extract recurring chunks on a neural population level, complementing each other based on label availability and neural data dimensionality. Discrete sequence chunking (DSC) learns a dictionary of entities in a lower-dimensional neural space; population averaging (PA) extracts recurring entities that correspond to known labels; and unsupervised chunk discovery (UCD) can be used when labels are absent. We demonstrate the effectiveness of these methods in extracting concept-encoding entities agnostic to model architectures. These concepts can be both concrete (words), abstract (POS tags), or structural (narrative schema). Additionally, we show that extracted chunks play a causal role in network behavior, as grafting them leads to controlled and predictable changes in the model's behavior. Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data to reveal the hidden computations of complex learning systems, gradually transforming them from black boxes into systems we can begin to understand.

replace-cross mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation

Authors: Chan-Wei Hu, Yueqi Wang, Shuo Xing, Chia-Ju Chen, Suofei Feng, Ryan Rossi, Zhengzhong Tu

Abstract: Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning. However, they remain limited by static training data, susceptibility to hallucinations, and inability to verify claims against up-to-date, external evidence, compromising their performance in dynamic real-world applications. Retrieval-Augmented Generation (RAG) offers a practical solution to mitigate these challenges by allowing the LVLMs to access large-scale knowledge databases via retrieval mechanisms, thereby grounding model outputs in factual, contextually relevant information. Here in this paper, we conduct the first systematic dissection of the multimodal RAG pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the modality configurations and retrieval strategies, (2) the re-ranking stage: on strategies to mitigate positional biases and improve the relevance of retrieved evidence, and (3) the generation phase: we further investigate how to best integrate retrieved candidates into the final generation process. Finally, we extend to explore a unified agentic framework that integrates re-ranking and generation through self-reflection, enabling LVLMs to select relevant evidence and suppress irrelevant context dynamically. Our full-stack exploration of RAG for LVLMs yields substantial insights, resulting in an average performance boost of 5% without any fine-tuning.

replace-cross UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation

Authors: Yihe Tang, Wenlong Huang, Yingke Wang, Chengshu Li, Roy Yuan, Ruohan Zhang, Jiajun Wu, Li Fei-Fei

Abstract: Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions. However, existing methods of visual affordance predictions often rely on manually annotated data or conditions only on a predefined set of tasks. We introduce UAD (Unsupervised Affordance Distillation), a method for distilling affordance knowledge from foundation models into a task-conditioned affordance model without any manual annotations. By leveraging the complementary strengths of large vision models and vision-language models, UAD automatically annotates a large-scale dataset with detailed $<$instruction, visual affordance$>$ pairs. Training only a lightweight task-conditioned decoder atop frozen features, UAD exhibits notable generalization to in-the-wild robotic scenes and to various human activities, despite only being trained on rendered objects in simulation. Using affordance provided by UAD as the observation space, we show an imitation learning policy that demonstrates promising generalization to unseen object instances, object categories, and even variations in task instructions after training on as few as 10 demonstrations. Project website: https://unsup-affordance.github.io/

URLs: https://unsup-affordance.github.io/

replace-cross An Agentic System for Rare Disease Diagnosis with Traceable Reasoning

Authors: Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu, Yuze Sun, Xiao Zhou, Yanfeng Wang, Xin Sun, Ya Zhang, Yongguo Yu, Kun Sun, Weidi Xie

Abstract: Rare diseases collectively affect over 300 million individuals worldwide, yet timely and accurate diagnosis remains a pervasive challenge. This is largely due to their clinical heterogeneity, low individual prevalence, and the limited familiarity most clinicians have with rare conditions. Here, we introduce DeepRare, the first rare disease diagnosis agentic system powered by a large language model (LLM), capable of processing heterogeneous clinical inputs. The system generates ranked diagnostic hypotheses for rare diseases, each accompanied by a transparent chain of reasoning that links intermediate analytic steps to verifiable medical evidence. DeepRare comprises three key components: a central host with a long-term memory module; specialized agent servers responsible for domain-specific analytical tasks integrating over 40 specialized tools and web-scale, up-to-date medical knowledge sources, ensuring access to the most current clinical information. This modular and scalable design enables complex diagnostic reasoning while maintaining traceability and adaptability. We evaluate DeepRare on eight datasets. The system demonstrates exceptional diagnostic performance among 2,919 diseases, achieving 100% accuracy for 1013 diseases. In HPO-based evaluations, DeepRare significantly outperforms other 15 methods, like traditional bioinformatics diagnostic tools, LLMs, and other agentic systems, achieving an average Recall@1 score of 57.18% and surpassing the second-best method (Reasoning LLM) by a substantial margin of 23.79 percentage points. For multi-modal input scenarios, DeepRare achieves 70.60% at Recall@1 compared to Exomiser's 53.20% in 109 cases. Manual verification of reasoning chains by clinical experts achieves 95.40% agreements. Furthermore, the DeepRare system has been implemented as a user-friendly web application http://raredx.cn/doctor.

URLs: http://raredx.cn/doctor.

replace-cross Objective Task-based Evaluation of Quantitative Medical Imaging Methods: Emerging Frameworks and Future Directions

Authors: Yan Liu, Huitian Xia, Nancy A. Obuchowski, Richard Laforest, Arman Rahmim, Barry A. Siegel, Abhinav K. Jha

Abstract: Quantitative imaging (QI) is demonstrating strong promise across multiple clinical applications. For clinical translation of QI methods, objective evaluation on clinically relevant tasks is essential. To address this need, multiple evaluation strategies are being developed. In this paper, based on previous literature, we outline four emerging frameworks to perform evaluation studies of QI methods. We first discuss the use of virtual imaging trials (VITs) to evaluate QI methods. Next, we outline a no-gold-standard evaluation framework to clinically evaluate QI methods without ground truth. Third, a framework to evaluate QI methods for joint detection and quantification tasks is outlined. Finally, we outline a framework to evaluate QI methods that output multi-dimensional parameters, such as radiomic features. We review these frameworks, discussing their utilities and limitations. Further, we examine future research areas in evaluation of QI methods. Given the recent advancements in PET, including long axial field-of-view scanners and the development of artificial-intelligence algorithms, we present these frameworks in the context of PET.

replace-cross TRAN-D: 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update

Authors: Jeongyun Kim, Seunghoon Jeong, Giseop Kim, Myung-Hwan Jeon, Eunji Jun, Ayoung Kim

Abstract: Understanding the 3D geometry of transparent objects from RGB images is challenging due to their inherent physical properties, such as reflection and refraction. To address these difficulties, especially in scenarios with sparse views and dynamic environments, we introduce TRAN-D, a novel 2D Gaussian Splatting-based depth reconstruction method for transparent objects. Our key insight lies in separating transparent objects from the background, enabling focused optimization of Gaussians corresponding to the object. We mitigate artifacts with an object-aware loss that places Gaussians in obscured regions, ensuring coverage of invisible surfaces while reducing overfitting. Furthermore, we incorporate a physics-based simulation that refines the reconstruction in just a few seconds, effectively handling object removal and chain-reaction movement of remaining objects without the need for rescanning. TRAN-D is evaluated on both synthetic and real-world sequences, and it consistently demonstrated robust improvements over existing GS-based state-of-the-art methods. In comparison with baselines, TRAN-D reduces the mean absolute error by over 39% for the synthetic TRansPose sequences. Furthermore, despite being updated using only one image, TRAN-D reaches a {\delta} < 2.5 cm accuracy of 48.46%, over 1.5 times that of baselines, which uses six images. Code and more results are available at https://jeongyun0609.github.io/TRAN-D/.

URLs: https://jeongyun0609.github.io/TRAN-D/.

replace-cross Handcrafted vs. Deep Radiomics vs. Fusion vs. Deep Learning: A Comprehensive Review of Machine Learning -Based Cancer Outcome Prediction in PET and SPECT Imaging

Authors: Mohammad R. Salmanpour, Somayeh Sadat Mehrnia, Sajad Jabarzadeh Ghandilu, Sonya Falahati, Shahram Taeb, Ghazal Mousavi, Mehdi Maghsoudi, Ahmad Shariftabrizi, Ilker Hacihaliloglu, Arman Rahmim

Abstract: Machine learning (ML), including deep learning (DL) and radiomics-based methods, is increasingly used for cancer outcome prediction with PET and SPECT imaging. However, the comparative performance of handcrafted radiomics features (HRF), deep radiomics features (DRF), DL models, and hybrid fusion approaches remains inconsistent across clinical applications. This systematic review analyzed 226 studies published from 2020 to 2025 that applied ML to PET or SPECT imaging for outcome prediction. Each study was evaluated using a 59-item framework covering dataset construction, feature extraction, validation methods, interpretability, and risk of bias. We extracted key details including model type, cancer site, imaging modality, and performance metrics such as accuracy and area under the curve (AUC). PET-based studies (95%) generally outperformed those using SPECT, likely due to higher spatial resolution and sensitivity. DRF models achieved the highest mean accuracy (0.862), while fusion models yielded the highest AUC (0.861). ANOVA confirmed significant differences in performance (accuracy: p=0.0006, AUC: p=0.0027). Common limitations included inadequate handling of class imbalance (59%), missing data (29%), and low population diversity (19%). Only 48% of studies adhered to IBSI standards. These findings highlight the need for standardized pipelines, improved data quality, and explainable AI to support clinical integration.

replace-cross adder-viz: Real-Time Visualization Software for Transcoding Event Video

Authors: Andrew C. Freeman, Luke Reinkensmeyer

Abstract: Recent years have brought about a surge in neuromorphic ``event'' video research, primarily targeting computer vision applications. Event video eschews video frames in favor of asynchronous, per-pixel intensity samples. While much work has focused on a handful of representations for specific event cameras, these representations have shown limitations in flexibility, speed, and compressibility. We previously proposed the unified ADDER representation to address these concerns. This paper introduces numerous improvements to the adder-viz software for visualizing real-time event transcode processes and applications in-the-loop. The MIT-licensed software is available from a centralized repository at https://github.com/ac-freeman/adder-codec-rs.

URLs: https://github.com/ac-freeman/adder-codec-rs.

replace-cross Prompting with Sign Parameters for Low-resource Sign Language Instruction Generation

Authors: Md Tariquzzaman, Md Farhan Ishmam, Saiyma Sittul Muna, Md Kamrul Hasan, Hasan Mahmud

Abstract: Sign Language (SL) enables two-way communication for the deaf and hard-of-hearing community, yet many sign languages remain under-resourced in the AI space. Sign Language Instruction Generation (SLIG) produces step-by-step textual instructions that enable non-SL users to imitate and learn SL gestures, promoting two-way interaction. We introduce BdSLIG, the first Bengali SLIG dataset, used to evaluate Vision Language Models (VLMs) (i) on under-resourced SLIG tasks, and (ii) on long-tail visual concepts, as Bengali SL is unlikely to appear in the VLM pre-training data. To enhance zero-shot performance, we introduce Sign Parameter-Infused (SPI) prompting, which integrates standard SL parameters, like hand shape, motion, and orientation, directly into the textual prompts. Subsuming standard sign parameters into the prompt makes the instructions more structured and reproducible than free-form natural text from vanilla prompting. We envision that our work would promote inclusivity and advancement in SL learning systems for the under-resourced communities.

replace-cross Curvature Learning for Generalization of Hyperbolic Neural Networks

Authors: Xiaomeng Fan, Yuwei Wu, Zhi Gao, Mehrtash Harandi, Yunde Jia

Abstract: Hyperbolic neural networks (HNNs) have demonstrated notable efficacy in representing real-world data with hierarchical structures via exploiting the geometric properties of hyperbolic spaces characterized by negative curvatures. Curvature plays a crucial role in optimizing HNNs. Inappropriate curvatures may cause HNNs to converge to suboptimal parameters, degrading overall performance. So far, the theoretical foundation of the effect of curvatures on HNNs has not been developed. In this paper, we derive a PAC-Bayesian generalization bound of HNNs, highlighting the role of curvatures in the generalization of HNNs via their effect on the smoothness of the loss landscape. Driven by the derived bound, we propose a sharpness-aware curvature learning method to smooth the loss landscape, thereby improving the generalization of HNNs. In our method, we design a scope sharpness measure for curvatures, which is minimized through a bi-level optimization process. Then, we introduce an implicit differentiation algorithm that efficiently solves the bi-level optimization by approximating gradients of curvatures. We present the approximation error and convergence analyses of the proposed method, showing that the approximation error is upper-bounded, and the proposed method can converge by bounding gradients of HNNs. Experiments on four settings: classification, learning from long-tailed data, learning from noisy data, and few-shot learning show that our method can improve the performance of HNNs.

replace-cross Generative Feature Imputing -- A Technique for Error-resilient Semantic Communication

Authors: Jianhao Huang, Qunsong Zeng, Hongyang Du, Kaibin Huang

Abstract: Semantic communication (SemCom) has emerged as a promising paradigm for achieving unprecedented communication efficiency in sixth-generation (6G) networks by leveraging artificial intelligence (AI) to extract and transmit the underlying meanings of source data. However, deploying SemCom over digital systems presents new challenges, particularly in ensuring robustness against transmission errors that may distort semantically critical content. To address this issue, this paper proposes a novel framework, termed generative feature imputing, which comprises three key techniques. First, we introduce a spatial error concentration packetization strategy that spatially concentrates feature distortions by encoding feature elements based on their channel mappings, a property crucial for both the effectiveness and reduced complexity of the subsequent techniques. Second, building on this strategy, we propose a generative feature imputing method that utilizes a diffusion model to efficiently reconstruct missing features caused by packet losses. Finally, we develop a semantic-aware power allocation scheme that enables unequal error protection by allocating transmission power according to the semantic importance of each packet. Experimental results demonstrate that the proposed framework outperforms conventional approaches, such as Deep Joint Source-Channel Coding (DJSCC) and JPEG2000, under block fading conditions, achieving higher semantic accuracy and lower Learned Perceptual Image Patch Similarity (LPIPS) scores.