Authors: Ali K. AlShami, Ryan Rabinowitz, Maged Shoman, Jianwu Fang, Lukas Picek, Shao-Yuan Lo, Steve Cruz, Khang Nhut Lam, Nachiket Kamod, Lei-Lei Li, Jugal Kalita, Terrance E. Boult
Abstract: As the computer vision community advances autonomous driving algorithms, integrating vision-based insights with sensor data remains essential for improving perception, decision making, planning, prediction, simulation, and control. Yet we must ask: Why don't we have entirely safe self-driving cars yet? A key part of the answer lies in addressing novel scenarios, one of the most critical barriers to real-world deployment. Our 2COOOL workshop provides a dedicated forum for researchers and industry experts to push the state of the art in novelty handling, including out-of-distribution hazard detection, vision-language models for hazard understanding, new benchmarking and methodologies, and safe autonomous driving practices. The 2nd Workshop on the Challenge of Out-of-Label Hazards in Autonomous Driving (2COOOL) will be held at the International Conference on Computer Vision (ICCV) 2025 in Honolulu, Hawaii, on October 19, 2025. We aim to inspire the development of new algorithms and systems for hazard avoidance, drawing on ideas from anomaly detection, open-set recognition, open-vocabulary modeling, domain adaptation, and related fields. Building on the success of its inaugural edition at the Winter Conference on Applications of Computer Vision (WACV) 2025, the workshop will feature a mix of academic and industry participation.
Authors: Alireza Golkarieh, Kiana Kiashemshaki, Sajjad Rezvani Boroujeni
Abstract: This study investigates deep learning methods for automated classification of dental conditions in panoramic X-ray images. A dataset of 1,512 radiographs with 11,137 expert-verified annotations across four conditions fillings, cavities, implants, and impacted teeth was used. After preprocessing and class balancing, three approaches were evaluated: a custom convolutional neural network (CNN), hybrid models combining CNN feature extraction with traditional classifiers, and fine-tuned pre-trained architectures. Experiments employed 5 fold cross validation with accuracy, precision, recall, and F1 score as evaluation metrics. The hybrid CNN Random Forest model achieved the highest performance with 85.4% accuracy, surpassing the custom CNN baseline of 74.3%. Among pre-trained models, VGG16 performed best at 82.3% accuracy, followed by Xception and ResNet50. Results show that hybrid models improve discrimination of morphologically similar conditions and provide efficient, reliable performance. These findings suggest that combining CNN-based feature extraction with ensemble classifiers offers a practical path toward automated dental diagnostic support, while also highlighting the need for larger datasets and further clinical validation.
Authors: Namu Kim, Wonbin Kweon, Minsoo Kim, Hwanjo Yu
Abstract: We observe that zero-shot appearance transfer with large-scale image generation models faces a significant challenge: Attention Leakage. This challenge arises when the semantic mapping between two images is captured by the Query-Key alignment. To tackle this issue, we introduce Q-Align, utilizing Query-Query alignment to mitigate attention leakage and improve the semantic alignment in zero-shot appearance transfer. Q-Align incorporates three core contributions: (1) Query-Query alignment, facilitating the sophisticated spatial semantic mapping between two images; (2) Key-Value rearrangement, enhancing feature correspondence through realignment; and (3) Attention refinement using rearranged keys and values to maintain semantic consistency. We validate the effectiveness of Q-Align through extensive experiments and analysis, and Q-Align outperforms state-of-the-art methods in appearance fidelity while maintaining competitive structure preservation.
Authors: Xurui Peng, Hong Liu, Chenqian Yan, Rui Ma, Fangmin Chen, Xing Wang, Zhihua Wu, Songwei Liu, Mingbao Lin
Abstract: Diffusion models suffer from substantial computational overhead due to their inherently iterative inference process. While feature caching offers a promising acceleration strategy by reusing intermediate outputs across timesteps, naive reuse often incurs noticeable quality degradation. In this work, we formally analyze the cumulative error introduced by caching and decompose it into two principal components: feature shift error, caused by inaccuracies in cached outputs, and step amplification error, which arises from error propagation under fixed timestep schedules. To address these issues, we propose ERTACache, a principled caching framework that jointly rectifies both error types. Our method employs an offline residual profiling stage to identify reusable steps, dynamically adjusts integration intervals via a trajectory-aware correction coefficient, and analytically approximates cache-induced errors through a closed-form residual linearization model. Together, these components enable accurate and efficient sampling under aggressive cache reuse. Extensive experiments across standard image and video generation benchmarks show that ERTACache achieves up to 2x inference speedup while consistently preserving or even improving visual quality. Notably, on the state-of-the-art Wan2.1 video diffusion model, ERTACache delivers 2x acceleration with minimal VBench degradation, effectively maintaining baseline fidelity while significantly improving efficiency. The code is available at https://github.com/bytedance/ERTACache.
Authors: Zheyu Fan (May), Jiateng Liu (May), Yuji Zhang (May), Zihan Wang (May), Yi R. (May), Fung, Manling Li, Heng Ji
Abstract: Humans naturally perform temporal screening by dragging the progress bar and focusing on salient temporal segments, but current Video Large Language Models (Video-LLMs) struggle to capture fine-grained temporal semantics due to sparse frame sampling and insufficient inter-frame reasoning supervision during their training. To address this, Inspired by well-established cognitive science principles, we propose Temporal Visual Screening (TVS), a new task that universally pre-processes video question answering and instruction tuning data by: (1) retaining focus-critical video segments, (2) synchronously reconstructing queries to their most direct form while preserving answer consistency, and (3) keeping the invariance and consistency for any possible answer. TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines. TVS optimizes distribution of reasoning burden and cognitive load; during training, it aligns queries with focus-critical visual information; at inference, it enables query-aware segment focus and streamlined query representations. In particular, we curate the first benchmark for TVS and propose ReSimplifyIt, a baseline outperforming prior approaches on seemingly similar tasks by 0.47 in F-1 score on video trimming while achieving competitive query rewriting performance. Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference), demonstrating the effectiveness of temporal information screening for improving video-language understanding.
Authors: Zhe Han, Charlie Budd, Gongyu Zhang, Huanyu Tian, Christos Bergeles, Tom Vercauteren
Abstract: Localisation of surgical tools constitutes a foundational building block for computer-assisted interventional technologies. Works in this field typically focus on training deep learning models to perform segmentation tasks. Performance of learning-based approaches is limited by the availability of diverse annotated data. We argue that skeletal pose annotations are a more efficient annotation approach for surgical tools, striking a balance between richness of semantic information and ease of annotation, thus allowing for accelerated growth of available annotated data. To encourage adoption of this annotation style, we present, ROBUST-MIPS, a combined tool pose and tool instance segmentation dataset derived from the existing ROBUST-MIS dataset. Our enriched dataset facilitates the joint study of these two annotation styles and allow head-to-head comparison on various downstream tasks. To demonstrate the adequacy of pose annotations for surgical tool localisation, we set up a simple benchmark using popular pose estimation methods and observe high-quality results. To ease adoption, together with the dataset, we release our benchmark models and custom tool pose annotation software.
Authors: Xiangtao Meng, Yingkai Dong, Ning Yu, Li Wang, Zheng Li, Shanqing Guo
Abstract: Despite the advancements in Text-to-Image (T2I) generation models, their potential for misuse or even abuse raises serious safety concerns. Model developers have made tremendous efforts to introduce safety mechanisms that can address these concerns in T2I models. However, the existing safety mechanisms, whether external or internal, either remain susceptible to evasion under distribution shifts or require extensive model-specific adjustments. To address these limitations, we introduce Safe-Control, an innovative plug-and-play safety patch designed to mitigate unsafe content generation in T2I models. Using data-driven strategies and safety-aware conditions, Safe-Control injects safety control signals into the locked T2I model, acting as an update in a patch-like manner. Model developers can also construct various safety patches to meet the evolving safety requirements, which can be flexibly merged into a single, unified patch. Its plug-and-play design further ensures adaptability, making it compatible with other T2I models of similar denoising architecture. We conduct extensive evaluations on six diverse and public T2I models. Empirical results highlight that Safe-Control is effective in reducing unsafe content generation across six diverse T2I models with similar generative architectures, yet it successfully maintains the quality and text alignment of benign images. Compared to seven state-of-the-art safety mechanisms, including both external and internal defenses, Safe-Control significantly outperforms all baselines in reducing unsafe content generation. For example, it reduces the probability of unsafe content generation to 7%, compared to approximately 20% for most baseline methods, under both unsafe prompts and the latest adversarial attacks.
Authors: Kei Katsumata, Yui Iioka, Naoki Hosomi, Teruhisa Misu, Kentaro Yamada, Komei Sugiura
Abstract: We focus on the task of identifying the location of target regions from a natural language instruction and a front camera image captured by a mobility. This task is challenging because it requires both existence prediction and segmentation, particularly for stuff-type target regions with ambiguous boundaries. Existing methods often underperform in handling stuff-type target regions, in addition to absent or multiple targets. To overcome these limitations, we propose GENNAV, which predicts target existence and generates segmentation masks for multiple stuff-type target regions. To evaluate GENNAV, we constructed a novel benchmark called GRiN-Drive, which includes three distinct types of samples: no-target, single-target, and multi-target. GENNAV achieved superior performance over baseline methods on standard evaluation metrics. Furthermore, we conducted real-world experiments with four automobiles operated in five geographically distinct urban areas to validate its zero-shot transfer performance. In these experiments, GENNAV outperformed baseline methods and demonstrated its robustness across diverse real-world environments. The project page is available at https://gennav.vercel.app/.
Authors: Jie Jiang, Qi Yang, Bolin Ni, Shiming Xiang, Han Hu, Houwen Peng
Abstract: Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.
Authors: Harris Song, Tuan-Anh Vu, Sanjith Menon, Sriram Narasimhan, M. Khalid Jawed
Abstract: Detecting hidden or partially concealed objects remains a fundamental challenge in multimodal environments, where factors like occlusion, camouflage, and lighting variations significantly hinder performance. Traditional RGB-based detection methods often fail under such adverse conditions, motivating the need for more robust, modality-agnostic approaches. In this work, we present HiddenObject, a fusion framework that integrates RGB, thermal, and depth data using a Mamba-based fusion mechanism. Our method captures complementary signals across modalities, enabling enhanced detection of obscured or camouflaged targets. Specifically, the proposed approach identifies modality-specific features and fuses them in a unified representation that generalizes well across challenging scenarios. We validate HiddenObject across multiple benchmark datasets, demonstrating state-of-the-art or competitive performance compared to existing methods. These results highlight the efficacy of our fusion design and expose key limitations in current unimodal and na\"ive fusion strategies. More broadly, our findings suggest that Mamba-based fusion architectures can significantly advance the field of multimodal object detection, especially under visually degraded or complex conditions.
Authors: Ao Shen, Xueming Fu, Junfeng Jiang, Qiang Zeng, Ye Tang, Zhengming Chen, Luming Nong, Feng Wang, S. Kevin Zhou
Abstract: Computed Tomography (CT)/X-ray registration in image-guided navigation remains challenging because of its stringent requirements for high accuracy and real-time performance. Traditional "render and compare" methods, relying on iterative projection and comparison, suffer from spatial information loss and domain gap. 3D reconstruction from biplanar X-rays supplements spatial and shape information for 2D/3D registration, but current methods are limited by dense-view requirements and struggles with noisy X-rays. To address these limitations, we introduce RadGS-Reg, a novel framework for vertebral-level CT/X-ray registration through joint 3D Radiative Gaussians (RadGS) reconstruction and 3D/3D registration. Specifically, our biplanar X-rays vertebral RadGS reconstruction module explores learning-based RadGS reconstruction method with a Counterfactual Attention Learning (CAL) mechanism, focusing on vertebral regions in noisy X-rays. Additionally, a patient-specific pre-training strategy progressively adapts the RadGS-Reg from simulated to real data while simultaneously learning vertebral shape prior knowledge. Experiments on in-house datasets demonstrate the state-of-the-art performance for both tasks, surpassing existing methods. The code is available at: https://github.com/shenao1995/RadGS_Reg.
Authors: Kevin Mayer, Alex Vesel, Xinyi Zhao, Martin Fischer
Abstract: 3D building models are critical for applications in architecture, energy simulation, and navigation. Yet, generating accurate and semantically rich 3D buildings automatically remains a major challenge due to the lack of large-scale annotated datasets in the public domain. Inspired by the success of synthetic data in computer vision, we introduce SYNBUILD-3D, a large, diverse, and multi-modal dataset of over 6.2 million synthetic 3D residential buildings at Level of Detail (LoD) 4. In the dataset, each building is represented through three distinct modalities: a semantically enriched 3D wireframe graph at LoD 4 (Modality I), the corresponding floor plan images (Modality II), and a LiDAR-like roof point cloud (Modality III). The semantic annotations for each building wireframe are derived from the corresponding floor plan images and include information on rooms, doors, and windows. Through its tri-modal nature, future work can use SYNBUILD-3D to develop novel generative AI algorithms that automate the creation of 3D building models at LoD 4, subject to predefined floor plan layouts and roof geometries, while enforcing semantic-geometric consistency. Dataset and code samples are publicly available at https://github.com/kdmayer/SYNBUILD-3D.
Authors: M{\aa}rten Wadenb\"ack, Marcus Valtonen \"Ornhag, Johan Edstedt
Abstract: Homographies are among the most prevalent transformations occurring in geometric computer vision and projective geometry, and homography estimation is consequently a crucial step in a wide assortment of computer vision tasks. When working with real images, which are often afflicted with geometric distortions caused by the camera lens, it may be necessary to determine both the homography and the lens distortion-particularly the radial component, called radial distortion-simultaneously to obtain anything resembling useful estimates. When considering a homography with radial distortion between two images, there are three conceptually distinct configurations for the radial distortion; (i) distortion in only one image, (ii) identical distortion in the two images, and (iii) independent distortion in the two images. While these cases have been addressed separately in the past, the present paper provides a novel and unified approach to solve all three cases. We demonstrate how the proposed approach can be used to construct new fast, stable, and accurate minimal solvers for radially distorted homographies. In all three cases, our proposed solvers are faster than the existing state-of-the-art solvers while maintaining similar accuracy. The solvers are tested on well-established benchmarks including images taken with fisheye cameras. The source code for our solvers will be made available in the event our paper is accepted for publication.
Authors: Zhenghao He, Sanchit Sinha, Guangzhi Xiong, Aidong Zhang
Abstract: Concept Activation Vectors (CAVs) provide a powerful approach for interpreting deep neural networks by quantifying their sensitivity to human-defined concepts. However, when computed independently at different layers, CAVs often exhibit inconsistencies, making cross-layer comparisons unreliable. To address this issue, we propose the Global Concept Activation Vector (GCAV), a novel framework that unifies CAVs into a single, semantically consistent representation. Our method leverages contrastive learning to align concept representations across layers and employs an attention-based fusion mechanism to construct a globally integrated CAV. By doing so, our method significantly reduces the variance in TCAV scores while preserving concept relevance, ensuring more stable and reliable concept attributions. To evaluate the effectiveness of GCAV, we introduce Testing with Global Concept Activation Vectors (TGCAV) as a method to apply TCAV to GCAV-based representations. We conduct extensive experiments on multiple deep neural networks, demonstrating that our method effectively mitigates concept inconsistency across layers, enhances concept localization, and improves robustness against adversarial perturbations. By integrating cross-layer information into a coherent framework, our method offers a more comprehensive and interpretable understanding of how deep learning models encode human-defined concepts. Code and models are available at https://github.com/Zhenghao-He/GCAV.
Authors: Zhizhong Huang, Xiaoming Liu
Abstract: Current object re-identification (ReID) methods train domain-specific models (e.g., for persons or vehicles), which lack generalization and demand costly labeled data for new categories. While self-supervised learning reduces annotation needs by learning instance-wise invariance, it struggles to capture \textit{identity-sensitive} features critical for ReID. This paper proposes Visual In-Context Prompting~(VICP), a novel framework where models trained on seen categories can directly generalize to unseen novel categories using only \textit{in-context examples} as prompts, without requiring parameter adaptation. VICP synergizes LLMs and vision foundation models~(VFM): LLMs infer semantic identity rules from few-shot positive/negative pairs through task-specific prompting, which then guides a VFM (\eg, DINO) to extract ID-discriminative features via \textit{dynamic visual prompts}. By aligning LLM-derived semantic concepts with the VFM's pre-trained prior, VICP enables generalization to novel categories, eliminating the need for dataset-specific retraining. To support evaluation, we introduce ShopID10K, a dataset of 10K object instances from e-commerce platforms, featuring multi-view images and cross-domain testing. Experiments on ShopID10K and diverse ReID benchmarks demonstrate that VICP outperforms baselines by a clear margin on unseen categories. Code is available at https://github.com/Hzzone/VICP.
Authors: Keshav Jha, William Sharp, Dominic LaBella
Abstract: Accurate delineation of pancreatic tumors is critical for diagnosis, treatment planning, and outcome assessment, yet automated segmentation remains challenging due to anatomical variability and limited dataset availability. In this study, SegResNet models, as part of the Auto3DSeg architecture, were trained and evaluated on two MRI-based pancreatic tumor segmentation tasks as part of the 2025 PANTHER Challenge. Algorithm methodology included 5-fold cross-validation with STAPLE ensembling after focusing on an anatomically relevant region-of-interest. The Pancreatic Tumor Segmentation on Diagnostic MRI task 1 training set included 91 T1-weighted arterial contrast-enhanced MRI with expert annotated pancreas and tumor labels. The Pancreatic Tumor Segmentation on MR-Linac task 2 training set used 50 T2-weighted MR-Linac cases with expert annotated pancreas and tumor labels. Algorithm-automated segmentation performance of pancreatic tumor was assessed using Dice Similarity Coefficient (DSC), 5 mm DSC, 95th percentile Hausdorff Distance (HD95), Mean Average Surface Distance (MASD), and Root Mean Square Error (RMSE). For Task 1, the algorithm achieved a DSC of 0.56, 5 mm DSC of 0.73, HD95 of 41.1 mm, MASD of 26.0 mm, and RMSE of 5164 mm. For Task 2, performance decreased, with a DSC of 0.33, 5 mm DSC of 0.50, HD95 of 20.1 mm, MASD of 7.2 mm, and RMSE of 17,203 mm. These findings illustrate the challenges of MRI-based pancreatic tumor segmentation with small datasets, highlighting variability introduced by different MRI sequences. Despite modest performance, the results demonstrate potential for automated delineation and emphasize the need for larger, standardized MRI datasets to improve model robustness and clinical utility.
Authors: Yidong Zhao, Peter Kellman, Hui Xue, Tongyun Yang, Yi Zhang, Yuchi Han, Orlando Simonetti, Qian Tao
Abstract: Pretrained segmentation models for cardiac magnetic resonance imaging (MRI) struggle to generalize across different imaging sequences due to significant variations in image contrast. These variations arise from changes in imaging protocols, yet the same fundamental spin properties, including proton density, T1, and T2 values, govern all acquired images. With this core principle, we introduce Reverse Imaging, a novel physics-driven method for cardiac MRI data augmentation and domain adaptation to fundamentally solve the generalization problem. Our method reversely infers the underlying spin properties from observed cardiac MRI images, by solving ill-posed nonlinear inverse problems regularized by the prior distribution of spin properties. We acquire this "spin prior" by learning a generative diffusion model from the multiparametric SAturation-recovery single-SHot acquisition sequence (mSASHA) dataset, which offers joint cardiac T1 and T2 maps. Our method enables approximate but meaningful spin-property estimates from MR images, which provide an interpretable "latent variable" that lead to highly flexible image synthesis of arbitrary novel sequences. We show that Reverse Imaging enables highly accurate segmentation across vastly different image contrasts and imaging protocols, realizing wide-spectrum generalization of cardiac MRI segmentation.
Authors: Hsuan-I Ho, Chen Guo, Po-Chen Wu, Ivan Shugurov, Chengcheng Tang, Abhay Mittal, Sizhe An, Manuel Kaufmann, Linguang Zhang
Abstract: We introduce PHD, a novel approach for personalized 3D human mesh recovery (HMR) and body fitting that leverages user-specific shape information to improve pose estimation accuracy from videos. Traditional HMR methods are designed to be user-agnostic and optimized for generalization. While these methods often refine poses using constraints derived from the 2D image to improve alignment, this process compromises 3D accuracy by failing to jointly account for person-specific body shapes and the plausibility of 3D poses. In contrast, our pipeline decouples this process by first calibrating the user's body shape and then employing a personalized pose fitting process conditioned on that shape. To achieve this, we develop a body shape-conditioned 3D pose prior, implemented as a Point Diffusion Transformer, which iteratively guides the pose fitting via a Point Distillation Sampling loss. This learned 3D pose prior effectively mitigates errors arising from an over-reliance on 2D constraints. Consequently, our approach improves not only pelvis-aligned pose accuracy but also absolute pose accuracy -- an important metric often overlooked by prior work. Furthermore, our method is highly data-efficient, requiring only synthetic data for training, and serves as a versatile plug-and-play module that can be seamlessly integrated with existing 3D pose estimators to enhance their performance. Project page: https://phd-pose.github.io/
Authors: Yuquan Bi, Hongsong Wang, Xinli Shi, Zhipeng Gui, Jie Gui, Yuan Yan Tang
Abstract: Diffusion models have demonstrated strong capabilities in generating high-fidelity 3D human poses, yet their iterative nature and multi-hypothesis requirements incur substantial computational cost. In this paper, we propose an Efficient Diffusion-Based 3D Human Pose Estimation framework with a Hierarchical Temporal Pruning (HTP) strategy, which dynamically prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics. HTP operates in a staged, top-down manner: (1) Temporal Correlation-Enhanced Pruning (TCEP) identifies essential frames by analyzing inter-frame motion correlations through adaptive temporal graph construction; (2) Sparse-Focused Temporal MHSA (SFT MHSA) leverages the resulting frame-level sparsity to reduce attention computation, focusing on motion-relevant tokens; and (3) Mask-Guided Pose Token Pruner (MGPTP) performs fine-grained semantic pruning via clustering, retaining only the most informative pose tokens. Experiments on Human3.6M and MPI-INF-3DHP show that HTP reduces training MACs by 38.5\%, inference MACs by 56.8\%, and improves inference speed by an average of 81.1\% compared to prior diffusion-based methods, while achieving state-of-the-art performance.
Authors: Qingran Miao, Haixia Wang, Haohao Sun, Yilong Zhang
Abstract: Optical Coherence Tomography (OCT) enables the acquisition of high-resolution, three-dimensional fingerprint data, capturing rich subsurface structures for robust biometric recognition. However, the high cost and time-consuming nature of OCT data acquisition have led to a scarcity of large-scale public datasets, significantly hindering the development of advanced algorithms, particularly data-hungry deep learning models. To address this critical bottleneck, this paper introduces Print2Volume, a novel framework for generating realistic, synthetic OCT-based 3D fingerprints from 2D fingerprint image. Our framework operates in three sequential stages: (1) a 2D style transfer module that converts a binary fingerprint into a grayscale images mimicking the style of a Z-direction mean-projected OCT scan; (2) a 3D Structure Expansion Network that extrapolates the 2D im-age into a plausible 3D anatomical volume; and (3) an OCT Realism Refiner, based on a 3D GAN, that renders the structural volume with authentic textures, speckle noise, and other imaging characteristics. Using Print2Volume, we generated a large-scale synthetic dataset of 420,000 samples. Quantitative experiments demonstrate the high quality of our synthetic data and its significant impact on recognition performance. By pre-training a recognition model on our synthetic data and fine-tuning it on a small real-world dataset, we achieved a remarkable reduction in the Equal Error Rate (EER) from 15.62% to 2.50% on the ZJUT-EIFD benchmark, proving the effectiveness of our approach in overcoming data scarcity.
Authors: Andreas Leibetseder, Sabrina Kletz, Klaus Schoeffmann, Simon Keckstein, J\"org Keckstein
Abstract: Gynecologic laparoscopy as a type of minimally invasive surgery (MIS) is performed via a live feed of a patient's abdomen surveying the insertion and handling of various instruments for conducting treatment. Adopting this kind of surgical intervention not only facilitates a great variety of treatments, the possibility of recording said video streams is as well essential for numerous post-surgical activities, such as treatment planning, case documentation and education. Nonetheless, the process of manually analyzing surgical recordings, as it is carried out in current practice, usually proves tediously time-consuming. In order to improve upon this situation, more sophisticated computer vision as well as machine learning approaches are actively developed. Since most of such approaches heavily rely on sample data, which especially in the medical field is only sparsely available, with this work we publish the Gynecologic Laparoscopy ENdometriosis DAtaset (GLENDA) - an image dataset containing region-based annotations of a common medical condition named endometriosis, i.e. the dislocation of uterine-like tissue. The dataset is the first of its kind and it has been created in collaboration with leading medical experts in the field.
Authors: Sabrina Kletz, Klaus Schoeffmann, Jenny Benois-Pineau, Heinrich Husslein
Abstract: Recorded videos from surgeries have become an increasingly important information source for the field of medical endoscopy, since the recorded footage shows every single detail of the surgery. However, while video recording is straightforward these days, automatic content indexing - the basis for content-based search in a medical video archive - is still a great challenge due to the very special video content. In this work, we investigate segmentation and recognition of surgical instruments in videos recorded from laparoscopic gynecology. More precisely, we evaluate the achievable performance of segmenting surgical instruments from their background by using a region-based fully convolutional network for instance-aware (1) instrument segmentation as well as (2) instrument recognition. While the first part addresses only binary segmentation of instances (i.e., distinguishing between instrument or background) we also investigate multi-class instrument recognition (i.e., identifying the type of instrument). Our evaluation results show that even with a moderately low number of training examples, we are able to localize and segment instrument regions with a pretty high accuracy. However, the results also reveal that determining the particular instrument is still very challenging, due to the inherently high similarity of surgical instruments.
Authors: Jakub Straka, Ivan Gruber
Abstract: Self-supervised learning has emerged as a powerful tool for remote sensing, where large amounts of unlabeled data are available. In this work, we investigate the use of DINO, a contrastive self-supervised method, for pretraining on remote sensing imagery. We introduce SatDINO, a model tailored for representation learning in satellite imagery. Through extensive experiments on multiple datasets in multiple testing setups, we demonstrate that SatDINO outperforms other state-of-the-art methods based on much more common masked autoencoders (MAE) and achieves competitive results in multiple benchmarks. We also provide a rigorous ablation study evaluating SatDINO's individual components. Finally, we propose a few novel enhancements, such as a new way to incorporate ground sample distance (GSD) encoding and adaptive view sampling. These enhancements can be used independently on our SatDINO model. Our code and trained models are available at: https://github.com/strakaj/SatDINO.
Authors: Gernot Fiala, Markus Plass, Robert Harb, Peter Regitnig, Kristijan Skok, Wael Al Zoughbi, Carmen Zerner, Paul Torke, Michaela Kargl, Heimo M\"uller, Tomas Brazdil, Matej Gallo, Jaroslav Kub\'in, Roman Stoklasa, Rudolf Nenutil, Norman Zerbe, Andreas Holzinger, Petr Holub
Abstract: A Whole Slide Image (WSI) is a high-resolution digital image created by scanning an entire glass slide containing a biological specimen, such as tissue sections or cell samples, at multiple magnifications. These images can be viewed, analyzed, shared digitally, and are used today for Artificial Intelligence (AI) algorithm development. WSIs are used in a variety of fields, including pathology for diagnosing diseases and oncology for cancer research. They are also utilized in neurology, veterinary medicine, hematology, microbiology, dermatology, pharmacology, toxicology, immunology, and forensic science. When assembling cohorts for the training or validation of an AI algorithm, it is essential to know what is present on such a WSI. However, there is currently no standard for this metadata, so such selection has mainly been done through manual inspection, which is not suitable for large collections with several million objects. We propose a general framework to generate a 2D index map for WSI and a profiling mechanism for specific application domains. We demonstrate this approach in the field of clinical pathology, using common syntax and semantics to achieve interoperability between different catalogs. Our approach augments each WSI collection with a detailed tissue map that provides fine-grained information about the WSI content. The tissue map is organized into three layers: source, tissue type, and pathological alterations, with each layer assigning segments of the WSI to specific classes. We illustrate the advantages and applicability of the proposed standard through specific examples in WSI catalogs, Machine Learning (ML), and graph-based WSI representations.
Authors: Lucas Rakotoarivony
Abstract: Deep learning models have achieved state-of-the-art performance in many computer vision tasks. However, in real-world scenarios, novel classes that were unseen during training often emerge, requiring models to acquire new knowledge incrementally. Class-Incremental Learning (CIL) methods enable a model to learn novel classes while retaining knowledge of previous classes. However, these methods make the strong assumption that the incremental dataset is fully labeled, which is unrealistic in practice. In this work, we propose an unsupervised Incremental Learning method using Confidence-based Pseudo-labels (ICPL), which replaces human annotations with pseudo-labels, enabling incremental learning from unlabeled datasets. We integrate these pseudo-labels into various CIL methods with confidence-based selection and evaluate performance degradation on CIFAR100 and ImageNet100. Then, we compare our approach to popular Class Incremental Novel Category Discovery (class-iNCD) methods addressing similar challenges. Additionally, we apply our method to fine-grained datasets to demonstrate its real-world practicality and measure its computational complexity to validate its suitability for resource-constrained environments. ICPL achieves competitive results compared to supervised methods and outperforms state-of-the-art class-iNCD methods by more than 5% in final accuracy.
Authors: Francisco Caetano, Christiaan Viviers, Peter H. H. de With, Fons van der Sommen
Abstract: Synthetic medical data offers a scalable solution for training robust models, but significant domain gaps limit its generalizability to real-world clinical settings. This paper addresses the challenge of cross-domain translation between synthetic and real X-ray images of the head, focusing on bridging discrepancies in attenuation behavior, noise characteristics, and soft tissue representation. We propose MedShift, a unified class-conditional generative model based on Flow Matching and Schrodinger Bridges, which enables high-fidelity, unpaired image translation across multiple domains. Unlike prior approaches that require domain-specific training or rely on paired data, MedShift learns a shared domain-agnostic latent space and supports seamless translation between any pair of domains seen during training. We introduce X-DigiSkull, a new dataset comprising aligned synthetic and real skull X-rays under varying radiation doses, to benchmark domain translation models. Experimental results demonstrate that, despite its smaller model size compared to diffusion-based approaches, MedShift offers strong performance and remains flexible at inference time, as it can be tuned to prioritize either perceptual fidelity or structural consistency, making it a scalable and generalizable solution for domain adaptation in medical imaging. The code and dataset are available at https://caetas.github.io/medshift.html
Authors: Dimitri Gominski, Martin Brandt, Xiaoye Tong, Siyu Liu, Maurice Mugabowindekwe, Sizhuo Li, Florian Reiner, Andrew Davies, Rasmus Fensholt
Abstract: Trees are key components of the terrestrial biosphere, playing vital roles in ecosystem function, climate regulation, and the bioeconomy. However, large-scale monitoring of individual trees remains limited by inadequate modelling. Available global products have focused on binary tree cover or canopy height, which do not explicitely identify trees at individual level. In this study, we present a deep learning approach for detecting large individual trees in 3-m resolution PlanetScope imagery at a global scale. We simulate tree crowns with Gaussian kernels of scalable size, allowing the extraction of crown centers and the generation of binary tree cover maps. Training is based on billions of points automatically extracted from airborne lidar data, enabling the model to successfully identify trees both inside and outside forests. We compare against existing tree cover maps and airborne lidar with state-of-the-art performance (fractional cover R$^2 = 0.81$ against aerial lidar), report balanced detection metrics across biomes, and demonstrate how detection can be further improved through fine-tuning with manual labels. Our method offers a scalable framework for global, high-resolution tree monitoring, and is adaptable to future satellite missions offering improved imagery.
Authors: Jiayu Yang, Weijian Su, Songqian Zhang, Yuqi Han, Jinli Suo, Qiang Zhang
Abstract: 3D Gaussian Splatting (3DGS) enables high-fidelity real-time rendering, a key requirement for immersive applications. However, the extension of 3DGS to dynamic scenes remains limitations on the substantial data volume of dense Gaussians and the prolonged training time required for each frame. This paper presents \M, a scalable Gaussian Splatting framework designed for efficient training in streaming tasks. Specifically, Gaussian spheres are hierarchically organized by scale within an anchor-based structure. Coarser-level Gaussians represent the low-resolution structure of the scene, while finer-level Gaussians, responsible for detailed high-fidelity rendering, are selectively activated by the coarser-level Gaussians. To further reduce computational overhead, we introduce a hybrid deformation and spawning strategy that models motion of inter-frame through Gaussian deformation and triggers Gaussian spawning to characterize wide-range motion. Additionally, a bidirectional adaptive masking mechanism enhances training efficiency by removing static regions and prioritizing informative viewpoints. Extensive experiments demonstrate that \M~ achieves superior visual quality while significantly reducing training time compared to state-of-the-art methods.
Authors: Junha Song, Yongsik Jo, So Yeon Min, Quanting Xie, Taehwan Kim, Yonatan Bisk, Jaegul Choo
Abstract: Image captioning is fundamental for applications like video instruction systems and exploration robots, yet deploying such models on local devices is challenging due to the high computational demands of multimodal large language models (MLLMs). To address this, we first explore lightweight captioning by implementing a specialist based on a 125M-parameter language model, 56 times smaller than LLaMA-7B, and evaluating its performance on both single-sentence and detailed captioning tasks. Surprisingly, we find that our model can achieve performance comparable to large multimodal generalists, suggesting its potential to serve as a strong visual specialist for on-device applications. While promising, our model also exhibits a limitation: like other MLLMs, it suffers from visual blindness, occasionally resulting in semantic captioning errors. We carry out toy experiments and investigate the underlying causes, where we observe that the problems arise from ineffective attention mechanisms and limited visual representations. To alleviate them, we develop a novel captioning framework, Sharp-Eyed Refinement, which enhances caption quality through improved visual grounding. At its core, our DeepLens extracts detailed visual representations by concentrating on informative regions identified during the initial glance. Our experiments confirm both the advantages of our specialist over prior small captioning models and large generalists and the effectiveness of our framework.
Authors: Kaouther Mouheb, Marawan Elbatel, Janne Papma, Geert Jan Biessels, Jurgen Claassen, Huub Middelkoop, Barbara van Munster, Wiesje van der Flier, Inez Ramakers, Stefan Klein, Esther E. Bron
Abstract: While foundation models (FMs) offer strong potential for AI-based dementia diagnosis, their integration into federated learning (FL) systems remains underexplored. In this benchmarking study, we systematically evaluate the impact of key design choices: classification head architecture, fine-tuning strategy, and aggregation method, on the performance and efficiency of federated FM tuning using brain MRI data. Using a large multi-cohort dataset, we find that the architecture of the classification head substantially influences performance, freezing the FM encoder achieves comparable results to full fine-tuning, and advanced aggregation methods outperform standard federated averaging. Our results offer practical insights for deploying FMs in decentralized clinical settings and highlight trade-offs that should guide future method development.
Authors: Lucas Rakotoarivony
Abstract: Detecting out-of-distribution (OOD) samples is essential for neural networks operating in open-world settings, particularly in safety-critical applications. Existing methods have improved OOD detection by leveraging two main techniques: feature truncation, which increases the separation between in-distribution (ID) and OOD samples, and scoring functions, which assign scores to distinguish between ID and OOD data. However, most approaches either focus on a single family of techniques or evaluate their effectiveness on a specific type of OOD dataset, overlooking the potential of combining multiple existing solutions. Motivated by this observation, we theoretically and empirically demonstrate that state-of-the-art feature truncation and scoring functions can be effectively combined. Moreover, we show that aggregating multiple scoring functions enhances robustness against various types of OOD samples. Based on these insights, we propose the Multi-Method Ensemble (MME) score, which unifies state-of-the-art OOD detectors into a single, more effective scoring function. Extensive experiments on both large-scale and small-scale benchmarks, covering near-OOD and far-OOD scenarios, show that MME significantly outperforms recent state-of-the-art methods across all benchmarks. Notably, using the BiT model, our method achieves an average FPR95 of 27.57% on the challenging ImageNet-1K benchmark, improving performance by 6% over the best existing baseline.
Authors: Chun Liu, Panpan Ding, Zheng Zheng, Hailong Wang, Bingqian Zhu, Tao Xu, Zhigang Han, Jiayao Wang
Abstract: Current ship detection techniques based on remote sensing imagery primarily rely on the object detection capabilities of deep neural networks (DNNs). However, DNNs are vulnerable to adversarial patch attacks, which can lead to misclassification by the detection model or complete evasion of the targets. Numerous studies have demonstrated that data transformation-based methods can improve the transferability of adversarial examples. However, excessive augmentation of image backgrounds or irrelevant regions may introduce unnecessary interference, resulting in false detections of the object detection model. These errors are not caused by the adversarial patches themselves but rather by the over-augmentation of background and non-target areas. This paper proposes a localized augmentation method that applies augmentation only to the target regions, avoiding any influence on non-target areas. By reducing background interference, this approach enables the loss function to focus more directly on the impact of the adversarial patch on the detection model, thereby improving the attack success rate. Experiments conducted on the HRSC2016 dataset demonstrate that the proposed method effectively increases the success rate of adversarial patch attacks and enhances their transferability.
Authors: Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, Lewei Lu
Abstract: Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model's ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.
Authors: Ronan Docherty, Antonis Vamvakeros, Samuel J. Cooper
Abstract: Feature foundation models - usually vision transformers - offer rich semantic descriptors of images, useful for downstream tasks such as (interactive) segmentation and object detection. For computational efficiency these descriptors are often patch-based, and so struggle to represent the fine features often present in micrographs; they also struggle with the large image sizes present in materials and biological image analysis. In this work, we train a convolutional neural network to upsample low-resolution (i.e, large patch size) foundation model features with reference to the input image. We apply this upsampler network (without any further training) to efficiently featurise and then segment a variety of microscopy images, including plant cells, a lithium-ion battery cathode and organic crystals. The richness of these upsampled features admits separation of hard to segment phases, like hairline cracks. We demonstrate that interactive segmentation with these deep features produces high-quality segmentations far faster and with far fewer labels than training or finetuning a more traditional convolutional network.
Authors: Hao Ruan, Jinliang Lin, Yingxin Lai, Zhiming Luo, Shaozi Li
Abstract: Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantics in drone scenarios pose challenges for vision-language understanding. Mainstream Vision-Language Models (VLMs) emphasize global alignment while lacking fine-grained semantics, and existing hierarchical methods depend on precise entity partitioning and strict containment, limiting effectiveness in dynamic environments. To address this, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework with two components: (1) Region-Global Image-Text Contrastive Learning (RG-ITC), which avoids precise scene partitioning and captures hierarchical local-to-global semantics by contrasting local visual regions with global text and vice versa; (2) Region-Global Image-Text Matching (RG-ITM), which dispenses with rigid constraints and instead evaluates local semantic consistency within global cross-modal representations, enhancing compositional reasoning. Moreover, drone text descriptions are often incomplete or ambiguous, destabilizing alignment. HCCM introduces a Momentum Contrast and Distillation (MCD) mechanism to improve robustness. Experiments on GeoText-1652 show HCCM achieves state-of-the-art Recall@1 of 28.8% (image retrieval) and 14.7% (text retrieval). On the unseen ERA dataset, HCCM demonstrates strong zero-shot generalization with 39.93% mean recall (mR), outperforming fine-tuned baselines.
Authors: Ziwei Liao, Mohamed Sayed, Steven L. Waslander, Sara Vicente, Daniyar Turmukhambetov, Michael Firman
Abstract: Gaussian splatting typically requires dense observations of the scene and can fail to reconstruct occluded and unobserved areas. We propose a latent diffusion model to reconstruct a complete 3D scene with Gaussian splats, including the occluded parts, from only a single image during inference. Completing the unobserved surfaces of a scene is challenging due to the ambiguity of the plausible surfaces. Conventional methods use a regression-based formulation to predict a single "mode" for occluded and out-of-frustum surfaces, leading to blurriness, implausibility, and failure to capture multiple possible explanations. Thus, they often address this problem partially, focusing either on objects isolated from the background, reconstructing only visible surfaces, or failing to extrapolate far from the input views. In contrast, we propose a generative formulation to learn a distribution of 3D representations of Gaussian splats conditioned on a single input image. To address the lack of ground-truth training data, we propose a Variational AutoReconstructor to learn a latent space only from 2D images in a self-supervised manner, over which a diffusion model is trained. Our method generates faithful reconstructions and diverse samples with the ability to complete the occluded surfaces for high-quality 360-degree renderings.
Authors: Yujin Park, Haejun Chung, Ikbeom Jang
Abstract: Pairwise comparison is often favored over absolute rating or ordinal classification in subjective or difficult annotation tasks due to its improved reliability. However, exhaustive comparisons require a massive number of annotations (O(n^2)). Recent work has greatly reduced the annotation burden (O(n log n)) by actively sampling pairwise comparisons using a sorting algorithm. We further improve annotation efficiency by (1) roughly pre-ordering items using the Contrastive Language-Image Pre-training (CLIP) model hierarchically without training, and (2) replacing easy, obvious human comparisons with automated comparisons. The proposed EZ-Sort first produces a CLIP-based zero-shot pre-ordering, then initializes bucket-aware Elo scores, and finally runs an uncertainty-guided human-in-the-loop MergeSort. Validation was conducted using various datasets: face-age estimation (FGNET), historical image chronology (DHCI), and retinal image quality assessment (EyePACS). It showed that EZ-Sort reduced human annotation cost by 90.5% compared to exhaustive pairwise comparisons and by 19.8% compared to prior work (when n = 100), while improving or maintaining inter-rater reliability. These results demonstrate that combining CLIP-based priors with uncertainty-aware sampling yields an efficient and scalable solution for pairwise ranking.
Authors: Ilya A. Petrov, Vladimir Guzov, Riccardo Marin, Emre Aksan, Xu Chen, Daniel Cremers, Thabo Beeler, Gerard Pons-Moll
Abstract: Modeling human-object interactions (HOI) from an egocentric perspective is a largely unexplored yet important problem due to the increasing adoption of wearable devices, such as smart glasses and watches. We investigate how much information about interaction can be recovered from only head and wrists tracking. Our answer is ECHO (Ego-Centric modeling of Human-Object interactions), which, for the first time, proposes a unified framework to recover three modalities: human pose, object motion, and contact from such minimal observation. ECHO employs a Diffusion Transformer architecture and a unique three-variate diffusion process, which jointly models human motion, object trajectory, and contact sequence, allowing for flexible input configurations. Our method operates in a head-centric canonical space, enhancing robustness to global orientation. We propose a conveyor-based inference, which progressively increases the diffusion timestamp with the frame position, allowing us to process sequences of any length. Through extensive evaluation, we demonstrate that ECHO outperforms existing methods that do not offer the same flexibility, setting a state-of-the-art in egocentric HOI reconstruction.
Authors: Juneyoung Ro, Namwoo Kim, Yoonjin Yoon
Abstract: Effectively understanding urban scenes requires fine-grained spatial reasoning about objects, layouts, and depth cues. However, how well current vision-language models (VLMs), pretrained on general scenes, transfer these abilities to urban domain remains underexplored. To address this gap, we conduct a comparative study of three off-the-shelf VLMs-BLIP-2, InstructBLIP, and LLaVA-1.5-evaluating both zero-shot performance and the effects of fine-tuning with a synthetic VQA dataset specific to urban scenes. We construct such dataset from segmentation, depth, and object detection predictions of street-view images, pairing each question with LLM-generated Chain-of-Thought (CoT) answers for step-by-step reasoning supervision. Results show that while VLMs perform reasonably well in zero-shot settings, fine-tuning with our synthetic CoT-supervised dataset substantially boosts performance, especially for challenging question types such as negation and counterfactuals. This study introduces urban spatial reasoning as a new challenge for VLMs and demonstrates synthetic dataset construction as a practical path for adapting general-purpose models to specialized domains.
Authors: Nico Albert Disch, Yannick Kirchhoff, Robin Peretzke, Maximilian Rokuss, Saikat Roy, Constantin Ulrich, David Zimmerer, Klaus Maier-Hein
Abstract: Understanding temporal dynamics in medical imaging is crucial for applications such as disease progression modeling, treatment planning and anatomical development tracking. However, most deep learning methods either consider only single temporal contexts, or focus on tasks like classification or regression, limiting their ability for fine-grained spatial predictions. While some approaches have been explored, they are often limited to single timepoints, specific diseases or have other technical restrictions. To address this fundamental gap, we introduce Temporal Flow Matching (TFM), a unified generative trajectory method that (i) aims to learn the underlying temporal distribution, (ii) by design can fall back to a nearest image predictor, i.e. predicting the last context image (LCI), as a special case, and (iii) supports $3D$ volumes, multiple prior scans, and irregular sampling. Extensive benchmarks on three public longitudinal datasets show that TFM consistently surpasses spatio-temporal methods from natural imaging, establishing a new state-of-the-art and robust baseline for $4D$ medical image prediction.
Authors: Dani\"el Boeke, Cedrik Blommestijn, Rebecca N. Wray, Kalina Chupetlovska, Shangqi Gao, Zeyu Gao, Regina G. H. Beets-Tan, Mireia Crispin-Ortuzar, James O. Jones, Wilson Silva, Ines P. Machado
Abstract: Recurrence risk estimation in clear cell renal cell carcinoma (ccRCC) is essential for guiding postoperative surveillance and treatment. The Leibovich score remains widely used for stratifying distant recurrence risk but offers limited patient-level resolution and excludes imaging information. This study evaluates multimodal recurrence prediction by integrating preoperative computed tomography (CT) and postoperative histopathology whole-slide images (WSIs). A modular deep learning framework with pretrained encoders and Cox-based survival modeling was tested across unimodal, late fusion, and intermediate fusion setups. In a real-world ccRCC cohort, WSI-based models consistently outperformed CT-only models, underscoring the prognostic strength of pathology. Intermediate fusion further improved performance, with the best model (TITAN-CONCH with ResNet-18) approaching the adjusted Leibovich score. Random tie-breaking narrowed the gap between the clinical baseline and learned models, suggesting discretization may overstate individualized performance. Using simple embedding concatenation, radiology added value primarily through fusion. These findings demonstrate the feasibility of foundation model-based multimodal integration for personalized ccRCC risk prediction. Future work should explore more expressive fusion strategies, larger multimodal datasets, and general-purpose CT encoders to better match pathology modeling capacity.
Authors: Haomiao Zhang, Zhangyuan Li, Yanling Piao, Zhi Li, Xiaodong Wang, Miao Cao, Xiongfei Su, Qiang Song, Xin Yuan
Abstract: Computer-generated holography (CGH) has gained wide attention with deep learning-based algorithms. However, due to its nonlinear and ill-posed nature, challenges remain in achieving accurate and stable reconstruction. Specifically, ($i$) the widely used end-to-end networks treat the reconstruction model as a black box, ignoring underlying physical relationships, which reduces interpretability and flexibility. ($ii$) CNN-based CGH algorithms have limited receptive fields, hindering their ability to capture long-range dependencies and global context. ($iii$) Angular spectrum method (ASM)-based models are constrained to finite near-fields.In this paper, we propose a Deep Unfolding Network (DUN) that decomposes gradient descent into two modules: an adaptive bandwidth-preserving model (ABPM) and a phase-domain complex-valued denoiser (PCD), providing more flexibility. ABPM allows for wider working distances compared to ASM-based methods. At the same time, PCD leverages its complex-valued deformable self-attention module to capture global features and enhance performance, achieving a PSNR over 35 dB. Experiments on simulated and real data show state-of-the-art results.
Authors: Maximilian Rokuss, Yannick Kirchhoff, Fabian Isensee, Klaus H. Maier-Hein
Abstract: Whole-body PET/CT is a cornerstone of oncological imaging, yet accurate lesion segmentation remains challenging due to tracer heterogeneity, physiological uptake, and multi-center variability. While fully automated methods have advanced substantially, clinical practice benefits from approaches that keep humans in the loop to efficiently refine predicted masks. The autoPET/CT IV challenge addresses this need by introducing interactive segmentation tasks based on simulated user prompts. In this work, we present our submission to Task 1. Building on the winning autoPET III nnU-Net pipeline, we extend the framework with promptable capabilities by encoding user-provided foreground and background clicks as additional input channels. We systematically investigate representations for spatial prompts and demonstrate that Euclidean Distance Transform (EDT) encodings consistently outperform Gaussian kernels. Furthermore, we propose online simulation of user interactions and a custom point sampling strategy to improve robustness under realistic prompting conditions. Our ensemble of EDT-based models, trained with and without external data, achieves the strongest cross-validation performance, reducing both false positives and false negatives compared to baseline models. These results highlight the potential of promptable models to enable efficient, user-guided segmentation workflows in multi-tracer, multi-center PET/CT. Code is publicly available at https://github.com/MIC-DKFZ/autoPET-interactive
Authors: Fatih Erdo\u{g}an, Merve Rabia Bar{\i}n, Fatma G\"uney
Abstract: Constructing high-definition (HD) maps from sensory input requires accurately mapping the road elements in image space to the Bird's Eye View (BEV) space. The precision of this mapping directly impacts the quality of the final vectorized HD map. Existing HD mapping approaches outsource the projection to standard mapping techniques, such as attention-based ones. However, these methods struggle with accuracy due to generalization problems, often hallucinating non-existent road elements. Our key idea is to start with a geometric mapping based on camera parameters and adapt it to the scene to extract relevant map information from camera images. To implement this, we propose a novel probabilistic projection mechanism with confidence scores to (i) refine the mapping to better align with the scene and (ii) filter out irrelevant elements that should not influence HD map generation. In addition, we improve temporal processing by using confidence scores to selectively accumulate reliable information over time. Experiments on new splits of the nuScenes and Argoverse2 datasets demonstrate improved performance over state-of-the-art approaches, indicating better generalization. The improvements are particularly pronounced on nuScenes and in the challenging long perception range. Our code and model checkpoints are available at https://github.com/Fatih-Erdogan/mapping-like-skeptic .
Authors: Shashank Vempati, Nishit Anand, Gaurav Talebailkar, Arpan Garai, Chetan Arora
Abstract: Conventional optical character recognition (OCR) techniques segmented each character and then recognized. This made them prone to error in character segmentation, and devoid of context to exploit language models. Advances in sequence to sequence translation in last decade led to modern techniques first detecting words and then inputting one word at a time to a model to directly output full words as sequence of characters. This allowed better utilization of language models and bypass error-prone character segmentation step. We observe that the above transition in style has moved the bottleneck in accuracy to word segmentation. Hence, in this paper, we propose a natural and logical progression from word level OCR to line-level OCR. The proposal allows to bypass errors in word detection, and provides larger sentence context for better utilization of language models. We show that the proposed technique not only improves the accuracy but also efficiency of OCR. Despite our thorough literature survey, we did not find any public dataset to train and benchmark such shift from word to line-level OCR. Hence, we also contribute a meticulously curated dataset of 251 English page images with line-level annotations. Our experimentation revealed a notable end-to-end accuracy improvement of 5.4%, underscoring the potential benefits of transitioning towards line-level OCR, especially for document images. We also report a 4 times improvement in efficiency compared to word-based pipelines. With continuous improvements in large language models, our methodology also holds potential to exploit such advances. Project Website: https://nishitanand.github.io/line-level-ocr-website
Authors: Alvaro Patricio, Atabak Dehban, Rodrigo Ventura
Abstract: Recent advances in diffusion-based generative models have demonstrated significant potential in augmenting scarce datasets for object detection tasks. Nevertheless, most recent models rely on resource-intensive full fine-tuning of large-scale diffusion models, requiring enterprise-grade GPUs (e.g., NVIDIA V100) and thousands of synthetic images. To address these limitations, we propose Flux LoRA Augmentation (FLORA), a lightweight synthetic data generation pipeline. Our approach uses the Flux 1.1 Dev diffusion model, fine-tuned exclusively through Low-Rank Adaptation (LoRA). This dramatically reduces computational requirements, enabling synthetic dataset generation with a consumer-grade GPU (e.g., NVIDIA RTX 4090). We empirically evaluate our approach on seven diverse object detection datasets. Our results demonstrate that training object detectors with just 500 synthetic images generated by our approach yields superior detection performance compared to models trained on 5000 synthetic images from the ODGEN baseline, achieving improvements of up to 21.3% in mAP@.50:.95. This work demonstrates that it is possible to surpass state-of-the-art performance with far greater efficiency, as FLORA achieves superior results using only 10% of the data and a fraction of the computational cost. This work demonstrates that a quality and efficiency-focused approach is more effective than brute-force generation, making advanced synthetic data creation more practical and accessible for real-world scenarios.
Authors: Amirhossein Nazeri, Wael Hafez
Abstract: Convolutional Neural Networks (CNNs) have become the foundation of modern computer vision, achieving unprecedented accuracy across diverse image recognition tasks. While these networks excel on in-distribution data, they remain vulnerable to adversarial perturbations imperceptible input modifications that cause misclassification with high confidence. However, existing detection methods either require expensive retraining, modify network architecture, or degrade performance on clean inputs. Here we show that adversarial perturbations create immediate, detectable entropy signatures in CNN activations that can be monitored without any model modification. Using parallel entropy monitoring on VGG-16, we demonstrate that adversarial inputs consistently shift activation entropy by 7% in early convolutional layers, enabling 90% detection accuracy with false positives and false negative rates below 20%. The complete separation between clean and adversarial entropy distributions reveals that CNNs inherently encode distribution shifts in their activation patterns. This work establishes that CNN reliability can be assessed through activation entropy alone, enabling practical deployment of self-diagnostic vision systems that detect adversarial inputs in real-time without compromising original model performance.
Authors: Jo\~ao Valente, Atabak Dehban, Rodrigo Ventura
Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities across various multimodal tasks. They continue, however, to struggle with trivial scenarios such as reading values from Digital Measurement Devices (DMDs), particularly in real-world conditions involving clutter, occlusions, extreme viewpoints, and motion blur; common in head-mounted cameras and Augmented Reality (AR) applications. Motivated by these limitations, this work introduces CAD2DMD-SET, a synthetic data generation tool designed to support visual question answering (VQA) tasks involving DMDs. By leveraging 3D CAD models, advanced rendering, and high-fidelity image composition, our tool produces diverse, VQA-labelled synthetic DMD datasets suitable for fine-tuning LVLMs. Additionally, we present DMDBench, a curated validation set of 1,000 annotated real-world images designed to evaluate model performance under practical constraints. Benchmarking three state-of-the-art LVLMs using Average Normalised Levenshtein Similarity (ANLS) and further fine-tuning LoRA's of these models with CAD2DMD-SET's generated dataset yielded substantial improvements, with InternVL showcasing a score increase of 200% without degrading on other tasks. This demonstrates that the CAD2DMD-SET training dataset substantially improves the robustness and performance of LVLMs when operating under the previously stated challenging conditions. The CAD2DMD-SET tool is expected to be released as open-source once the final version of this manuscript is prepared, allowing the community to add different measurement devices and generate their own datasets.
Authors: Xavier Juanola, Giovana Morais, Magdalena Fuentes, Gloria Haro
Abstract: Visual sound source localization is a fundamental perception task that aims to detect the location of sounding sources in a video given its audio. Despite recent progress, we identify two shortcomings in current methods: 1) most approaches perform poorly in cases with low audio-visual semantic correspondence such as silence, noise, and offscreen sounds, i.e. in the presence of negative audio; and 2) most prior evaluations are limited to positive cases, where both datasets and metrics convey scenarios with a single visible sound source in the scene. To address this, we introduce three key contributions. First, we propose a new training strategy that incorporates silence and noise, which improves performance in positive cases, while being more robust against negative sounds. Our resulting self-supervised model, SSL-SaN, achieves state-of-the-art performance compared to other self-supervised models, both in sound localization and cross-modal retrieval. Second, we propose a new metric that quantifies the trade-off between alignment and separability of auditory and visual features across positive and negative audio-visual pairs. Third, we present IS3+, an extended and improved version of the IS3 synthetic dataset with negative audio. Our data, metrics and code are available on the https://xavijuanola.github.io/SSL-SaN/.
Authors: Zhixiong Zeng, Jing Huang, Liming Zheng, Wenkang Han, Yufeng Zhong, Lei Chen, Longrong Yang, Yingjie Chu, Yuzhi He, Lin Ma
Abstract: GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open-source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding, and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data engineering strategies to enhance training effects, but also establishes an interactive environment connecting both Mobile and PC devices. In training, UItron adopts supervised finetuning over perception and planning tasks in various GUI scenarios, and then develop a curriculum reinforcement learning framework to enable complex reasoning and exploration for online environments. As a result, UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning. In particular, UItron highlights the interaction proficiency with top-tier Chinese mobile APPs, as we identified a general lack of Chinese capabilities even in state-of-the-art solutions. To this end, we manually collect over one million steps of operation trajectories across the top 100 most popular apps, and build the offline and online agent evaluation environments. Experimental results demonstrate that UItron achieves significant progress in Chinese app scenarios, propelling GUI agents one step closer to real-world application.
Authors: Ha Min Son, Zhe Zhao, Shahbaz Rezaei, Xin Liu
Abstract: Evaluating domain generalization (DG) for foundational models like CLIP is challenging, as web-scale pretraining data potentially covers many existing benchmarks. Consequently, current DG evaluation may neither be sufficiently challenging nor adequately test genuinely unseen data scenarios. To better assess the performance of CLIP on DG in-the-wild, a scenario where CLIP encounters challenging unseen data, we consider two approaches: (1) evaluating on 33 diverse datasets with quantified out-of-distribution (OOD) scores after fine-tuning CLIP on ImageNet, and (2) using unlearning to make CLIP `forget' some domains as an approximation. We observe that CLIP's performance deteriorates significantly on more OOD datasets. To address this, we present CLIP-DCA (Disentangling Classification from enhanced domain Aware representations). Our approach is motivated by the observation that while standard domain invariance losses aim to make representations domain-invariant, this can be harmful to foundation models by forcing the discarding of domain-aware representations beneficial for generalization. We instead hypothesize that enhancing domain awareness is a prerequisite for effective domain-invariant classification in foundation models. CLIP-DCA identifies and enhances domain awareness within CLIP's encoders using a separate domain head and synthetically generated diverse domain data. Simultaneously, it encourages domain-invariant classification through disentanglement from the domain features. CLIP-DCA shows significant improvements within this challenging evaluation compared to existing methods, particularly on datasets that are more OOD.
Authors: Qiyue Sun, Qiming Huang, Yang Yang, Hongjun Wang, Jianbo Jiao
Abstract: Humans usually show exceptional generalisation and discovery ability in the open world, when being shown uncommon new concepts. Whereas most existing studies in the literature focus on common typical data from closed sets, open-world novel discovery is under-explored in videos. In this paper, we are interested in asking: \textit{What if atypical unusual videos are exposed in the learning process?} To this end, we collect a new video dataset consisting of various types of unusual atypical data (\eg sci-fi, animation, \etc). To study how such atypical data may benefit open-world learning, we feed them into the model training process for representation learning. Focusing on three key tasks in open-world learning: out-of-distribution (OOD) detection, novel category discovery (NCD), and zero-shot action recognition (ZSAR), we found that even straightforward learning approaches with atypical data consistently improve performance across various settings. Furthermore, we found that increasing the categorical diversity of the atypical samples further boosts OOD detection performance. Additionally, in the NCD task, using a smaller yet more semantically diverse set of atypical samples leads to better performance compared to using a larger but more typical dataset. In the ZSAR setting, the semantic diversity of atypical videos helps the model generalise better to unseen action classes. These observations in our extensive experimental evaluations reveal the benefits of atypical videos for visual representation learning in the open world, together with the newly proposed dataset, encouraging further studies in this direction.
Authors: Nattapong Kurpukdee, Adrian G. Bors
Abstract: We propose a realistic scenario for the unsupervised video learning where neither task boundaries nor labels are provided when learning a succession of tasks. We also provide a non-parametric learning solution for the under-explored problem of unsupervised video continual learning. Videos represent a complex and rich spatio-temporal media information, widely used in many applications, but which have not been sufficiently explored in unsupervised continual learning. Prior studies have only focused on supervised continual learning, relying on the knowledge of labels and task boundaries, while having labeled data is costly and not practical. To address this gap, we study the unsupervised video continual learning (uVCL). uVCL raises more challenges due to the additional computational and memory requirements of processing videos when compared to images. We introduce a general benchmark experimental protocol for uVCL by considering the learning of unstructured video data categories during each task. We propose to use the Kernel Density Estimation (KDE) of deep embedded video features extracted by unsupervised video transformer networks as a non-parametric probabilistic representation of the data. We introduce a novelty detection criterion for the incoming new task data, dynamically enabling the expansion of memory clusters, aiming to capture new knowledge when learning a succession of tasks. We leverage the use of transfer learning from the previous tasks as an initial state for the knowledge transfer to the current learning task. We found that the proposed methodology substantially enhances the performance of the model when successively learning many tasks. We perform in-depth evaluations on three standard video action recognition datasets, including UCF101, HMDB51, and Something-to-Something V2, without using any labels or class boundaries.
Authors: Omer Faruk Durugol, Maximilian Rokuss, Yannick Kirchhoff, Klaus H. Maier-Hein
Abstract: Automated segmentation of Pancreatic Ductal Adenocarcinoma (PDAC) from MRI is critical for clinical workflows but is hindered by poor tumor-tissue contrast and a scarcity of annotated data. This paper details our submission to the PANTHER challenge, addressing both diagnostic T1-weighted (Task 1) and therapeutic T2-weighted (Task 2) segmentation. Our approach is built upon the nnU-Net framework and leverages a deep, multi-stage cascaded pre-training strategy, starting from a general anatomical foundation model and sequentially fine-tuning on CT pancreatic lesion datasets and the target MRI modalities. Through extensive five-fold cross-validation, we systematically evaluated data augmentation schemes and training schedules. Our analysis revealed a critical trade-off, where aggressive data augmentation produced the highest volumetric accuracy, while default augmentations yielded superior boundary precision (achieving a state-of-the-art MASD of 5.46 mm and HD95 of 17.33 mm for Task 1). For our final submission, we exploited this finding by constructing custom, heterogeneous ensembles of specialist models, essentially creating a mix of experts. This metric-aware ensembling strategy proved highly effective, achieving a top cross-validation Tumor Dice score of 0.661 for Task 1 and 0.523 for Task 2. Our work presents a robust methodology for developing specialized, high-performance models in the context of limited data and complex medical imaging tasks (Team MIC-DKFZ).
Authors: Ugur Dinc, Jibak Sarkar, Philipp Schubert, Sabine Semrau, Thomas Weissmann, Andre Karius, Johann Brand, Bernd-Niklas Axer, Ahmed Gomaa, Pluvio Stephan, Ishita Sheth, Sogand Beirami, Annette Schwarz, Udo Gaipl, Benjamin Frey, Christoph Bert, Stefanie Corradini, Rainer Fietkau, Florian Putz
Abstract: Introduction: Large language models (LLM) have shown great potential in clinical decision support. GPT-5 is a novel LLM system that has been specifically marketed towards oncology use. Methods: Performance was assessed using two complementary benchmarks: (i) the ACR Radiation Oncology In-Training Examination (TXIT, 2021), comprising 300 multiple-choice items, and (ii) a curated set of 60 authentic radiation oncologic vignettes representing diverse disease sites and treatment indications. For the vignette evaluation, GPT-5 was instructed to generate concise therapeutic plans. Four board-certified radiation oncologists rated correctness, comprehensiveness, and hallucinations. Inter-rater reliability was quantified using Fleiss' \k{appa}. Results: On the TXIT benchmark, GPT-5 achieved a mean accuracy of 92.8%, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%). Domain-specific gains were most pronounced in Dose and Diagnosis. In the vignette evaluation, GPT-5's treatment recommendations were rated highly for correctness (mean 3.24/4, 95% CI: 3.11-3.38) and comprehensiveness (3.59/4, 95% CI: 3.49-3.69). Hallucinations were rare with no case reaching majority consensus for their presence. Inter-rater agreement was low (Fleiss' \k{appa} 0.083 for correctness), reflecting inherent variability in clinical judgment. Errors clustered in complex scenarios requiring precise trial knowledge or detailed clinical adaptation. Discussion: GPT-5 clearly outperformed prior model variants on the radiation oncology multiple-choice benchmark. Although GPT-5 exhibited favorable performance in generating real-world radiation oncology treatment recommendations, correctness ratings indicate room for further improvement. While hallucinations were infrequent, the presence of substantive errors underscores that GPT-5-generated recommendations require rigorous expert oversight before clinical implementation.
Authors: Jiawei Liu, Jiahe Hou, Wei Wang, Jinsong Du, Yang Cong, Huijie Fan
Abstract: Anomaly detection, which aims to identify anomalies deviating from normal patterns, is challenging due to the limited amount of normal data available. Unlike most existing unified methods that rely on carefully designed image feature extractors and memory banks to capture logical relationships between objects, we introduce a text memory bank to enhance the detection of logical anomalies. Specifically, we propose a Three-Memory framework for Unified structural and logical Anomaly Detection (TMUAD). First, we build a class-level text memory bank for logical anomaly detection by the proposed logic-aware text extractor, which can capture rich logical descriptions of objects from input images. Second, we construct an object-level image memory bank that preserves complete object contours by extracting features from segmented objects. Third, we employ visual encoders to extract patch-level image features for constructing a patch-level memory bank for structural anomaly detection. These three complementary memory banks are used to retrieve and compare normal images that are most similar to the query image, compute anomaly scores at multiple levels, and fuse them into a final anomaly score. By unifying structural and logical anomaly detection through collaborative memory banks, TMUAD achieves state-of-the-art performance across seven publicly available datasets involving industrial and medical domains. The model and code are available at https://github.com/SIA-IDE/TMUAD.
Authors: Jasper Uijlings, Xingyi Zhou, Xiuye Gu, Arsha Nagrani, Anurag Arnab, Alireza Fathi, David Ross, Cordelia Schmid
Abstract: Understanding objects in videos in terms of fine-grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio-temporal masklet with a corresponding object-centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground-truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations on the validation set. We call the resulting dataset SAV-Caption. We train our VoCap model at scale on a SAV-Caption together with a mix of other image and video datasets. Our model yields state-of-the-art results on referring expression video object segmentation, is competitive on semi-supervised video object segmentation, and establishes a benchmark for video object captioning. Our dataset will be made available at https://github.com/google-deepmind/vocap.
Authors: Yiming Lin, Yuchen Niu, Shang Wang, Kaizhu Huang, Qiufeng Wang, Xiao-Bo Jin
Abstract: Context recognition (SR) is a fundamental task in computer vision that aims to extract structured semantic summaries from images by identifying key events and their associated entities. Specifically, given an input image, the model must first classify the main visual events (verb classification), then identify the participating entities and their semantic roles (semantic role labeling), and finally localize these entities in the image (semantic role localization). Existing methods treat verb classification as a single-label problem, but we show through a comprehensive analysis that this formulation fails to address the inherent ambiguity in visual event recognition, as multiple verb categories may reasonably describe the same image. This paper makes three key contributions: First, we reveal through empirical analysis that verb classification is inherently a multi-label problem due to the ubiquitous semantic overlap between verb categories. Second, given the impracticality of fully annotating large-scale datasets with multiple labels, we propose to reformulate verb classification as a single positive multi-label learning (SPMLL) problem - a novel perspective in SR research. Third, we design a comprehensive multi-label evaluation benchmark for SR that is carefully designed to fairly evaluate model performance in a multi-label setting. To address the challenges of SPMLL, we futher develop the Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP), which combines graph neural networks to capture label correlations and adversarial training to optimize decision boundaries. Extensive experiments on real-world datasets show that our approach achieves more than 3\% MAP improvement while remaining competitive on traditional top-1 and top-5 accuracy metrics.
Authors: Maolin Wei, Wanzhou Liu, Eshed Ohn-Bar
Abstract: If a Large Language Model (LLM) were to take a driving knowledge test today, would it pass? Beyond standard spatial and visual question-answering (QA) tasks on current autonomous driving benchmarks, driving knowledge tests require a complete understanding of all traffic rules, signage, and right-of-way principles. To pass this test, human drivers must discern various edge cases that rarely appear in real-world datasets. In this work, we present DriveQA, an extensive open-source text and vision-based benchmark that exhaustively covers traffic regulations and scenarios. Through our experiments using DriveQA, we show that (1) state-of-the-art LLMs and Multimodal LLMs (MLLMs) perform well on basic traffic rules but exhibit significant weaknesses in numerical reasoning and complex right-of-way scenarios, traffic sign variations, and spatial layouts, (2) fine-tuning on DriveQA improves accuracy across multiple categories, particularly in regulatory sign recognition and intersection decision-making, (3) controlled variations in DriveQA-V provide insights into model sensitivity to environmental factors such as lighting, perspective, distance, and weather conditions, and (4) pretraining on DriveQA enhances downstream driving task performance, leading to improved results on real-world datasets such as nuScenes and BDD, while also demonstrating that models can internalize text and synthetic traffic knowledge to generalize effectively across downstream QA tasks.
Authors: Allen Wang, Gavin Tao
Abstract: We address vision-guided quadruped motion control with reinforcement learning (RL) and highlight the necessity of combining proprioception with vision for robust control. We propose QuadKAN, a spline-parameterized cross-modal policy instantiated with Kolmogorov-Arnold Networks (KANs). The framework incorporates a spline encoder for proprioception and a spline fusion head for proprioception-vision inputs. This structured function class aligns the state-to-action mapping with the piecewise-smooth nature of gait, improving sample efficiency, reducing action jitter and energy consumption, and providing interpretable posture-action sensitivities. We adopt Multi-Modal Delay Randomization (MMDR) and perform end-to-end training with Proximal Policy Optimization (PPO). Evaluations across diverse terrains, including both even and uneven surfaces and scenarios with static or dynamic obstacles, demonstrate that QuadKAN achieves consistently higher returns, greater distances, and fewer collisions than state-of-the-art (SOTA) baselines. These results show that spline-parameterized policies offer a simple, effective, and interpretable alternative for robust vision-guided locomotion. A repository will be made available upon acceptance.
Authors: Thomas Besnier, Sylvain Arguill\`ere, Mohamed Daoudi
Abstract: Unregistered surface meshes, especially raw 3D scans, present significant challenges for automatic computation of plausible deformations due to the lack of established point-wise correspondences and the presence of noise in the data. In this paper, we propose a new, rig-free, data-driven framework for motion prediction and transfer on such body meshes. Our method couples a robust motion embedding network with a learned per-vertex feature field to generate a spatio-temporal deformation field, which drives the mesh deformation. Extensive evaluations, including quantitative benchmarks and qualitative visuals on tasks such as walking and running, demonstrate the effectiveness and versatility of our approach on challenging unregistered meshes.
Authors: Samrajnee Ghosh, Naman Agarwal, Hemanshu Garg, Chinmay Mittal, Mausam, Parag Singla
Abstract: The reasoning abilities of Multimodal Large Language Models (MLLMs) have garnered a lot of attention in recent times, with advances made in frontiers like coding, mathematics, and science. However, very limited experiments have been done to assess their performance in simple perception tasks performed over uncontaminated, generated images containing basic shapes and structures. To address this issue, the paper introduces a dataset, Percept-V, containing a total of 7200 program-generated images equally divided into 30 categories, each testing a combination of visual perception skills. Unlike previously proposed datasets, Percept-V comprises very basic tasks of varying complexity that test the perception abilities of MLLMs. This dataset is then tested on state-of-the-art MLLMs like GPT-4o, Gemini, and Claude as well as Large Reasoning Models (LRMs) like OpenAI o4-mini and DeepSeek R1 to gauge their performance. Contrary to the evidence that MLLMs excel in many complex tasks, our experiments show a significant drop in the models' performance with increasing problem complexity across all categories. An analysis of the performances also reveals that the tested MLLMs exhibit a similar trend in accuracy across categories, testing a particular cognitive skill and find some skills to be more difficult than others.
Authors: Pablo Moraes, Monica Rodriguez, Kristofer S. Kappel, Hiago Sodre, Santiago Fernandez, Igor Nunes, Bruna Guterres, Ricardo Grando
Abstract: Autonomous driving applications have become increasingly relevant in the automotive industry due to their potential to enhance vehicle safety, efficiency, and user experience, thereby meeting the growing demand for sophisticated driving assistance features. However, the development of reliable and trustworthy autonomous systems poses challenges such as high complexity, prolonged training periods, and intrinsic levels of uncertainty. Mini Autonomous Cars (MACs) are used as a practical testbed, enabling validation of autonomous control methodologies on small-scale setups. This simplified and cost-effective environment facilitates rapid evaluation and comparison of machine learning models, which is particularly useful for algorithms requiring online training. To address these challenges, this work presents a methodology based on RGB-D information and three-dimensional convolutional neural networks (3D CNNs) for MAC autonomous driving in simulated environments. We evaluate the proposed approach against recurrent neural networks (RNNs), with architectures trained and tested on two simulated tracks with distinct environmental features. Performance was assessed using task completion success, lap-time metrics, and driving consistency. Results highlight how architectural modifications and track complexity influence the models' generalization capability and vehicle control performance. The proposed 3D CNN demonstrated promising results when compared with RNNs.
Authors: Jeong Uk Lee, Sung Hee Choi
Abstract: Reconstructing high-quality 3D meshes and visuals from 3D Gaussian Splatting(3DGS) still remains a central challenge in computer graphics. Although existing models such as SuGaR offer effective solutions for rendering, there is is still room to improve improve both visual fidelity and scene consistency. This work builds upon SuGaR by introducing two complementary regularization strategies that address common limitations in both the shape of individual Gaussians and the coherence of the overall surface. The first strategy introduces an effective rank regularization, motivated by recent studies on Gaussian primitive structures. This regularization discourages extreme anisotropy-specifically, "needle-like" shapes-by favoring more balanced, "disk-like" forms that are better suited for stable surface reconstruction. The second strategy integrates a neural Signed Distance Function (SDF) into the optimization process. The SDF is regularized with an Eikonal loss to maintain proper distance properties and provides a continuous global surface prior, guiding Gaussians toward better alignment with the underlying geometry. These two regularizations aim to improve both the fidelity of individual Gaussian primitives and their collective surface behavior. The final model can make more accurate and coherent visuals from 3DGS data.
Authors: Meidan Ding, Jipeng Zhang, Wenxuan Wang, Cheng-Yi Li, Wei-Chieh Fang, Hsin-Yu Wu, Haiqin Zhong, Wenting Chen, Linlin Shen
Abstract: Multimodal large language models (MLLMs) hold significant potential in medical applications, including disease diagnosis and clinical decision-making. However, these tasks require highly accurate, context-sensitive, and professionally aligned responses, making reliable reward models and judges critical. Despite their importance, medical reward models (MRMs) and judges remain underexplored, with no dedicated benchmarks addressing clinical requirements. Existing benchmarks focus on general MLLM capabilities or evaluate models as solvers, neglecting essential evaluation dimensions like diagnostic accuracy and clinical relevance. To address this, we introduce Med-RewardBench, the first benchmark specifically designed to evaluate MRMs and judges in medical scenarios. Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases. A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions. We evaluate 32 state-of-the-art MLLMs, including open-source, proprietary, and medical-specific models, revealing substantial challenges in aligning outputs with expert judgment. Additionally, we develop baseline models that demonstrate substantial performance improvements through fine-tuning.
Authors: Yi-Hao Peng, Dingzeyu Li, Jeffrey P. Bigham, Amy Pavel
Abstract: User interface (UI) agents promise to make inaccessible or complex UIs easier to access for blind and low-vision (BLV) users. However, current UI agents typically perform tasks end-to-end without involving users in critical choices or making them aware of important contextual information, thus reducing user agency. For example, in our field study, a BLV participant asked to buy the cheapest available sparkling water, and the agent automatically chose one from several equally priced options, without mentioning alternative products with different flavors or better ratings. To address this problem, we introduce Morae, a UI agent that automatically identifies decision points during task execution and pauses so that users can make choices. Morae uses large multimodal models to interpret user queries alongside UI code and screenshots, and prompt users for clarification when there is a choice to be made. In a study over real-world web tasks with BLV participants, Morae helped users complete more tasks and select options that better matched their preferences, as compared to baseline agents, including OpenAI Operator. More broadly, this work exemplifies a mixed-initiative approach in which users benefit from the automation of UI agents while being able to express their preferences.
Authors: Nicolas Soncini, Javier Cremona, Erica Vidal, Maximiliano Garc\'ia, Gast\'on Castro, Taih\'u Pire
Abstract: We present a multi-modal dataset collected in a soybean crop field, comprising over two hours of recorded data from sensors such as stereo infrared camera, color camera, accelerometer, gyroscope, magnetometer, GNSS (Single Point Positioning, Real-Time Kinematic and Post-Processed Kinematic), and wheel odometry. This dataset captures key challenges inherent to robotics in agricultural environments, including variations in natural lighting, motion blur, rough terrain, and long, perceptually aliased sequences. By addressing these complexities, the dataset aims to support the development and benchmarking of advanced algorithms for localization, mapping, perception, and navigation in agricultural robotics. The platform and data collection system is designed to meet the key requirements for evaluating multi-modal SLAM systems, including hardware synchronization of sensors, 6-DOF ground truth and loops on long trajectories. We run multimodal state-of-the art SLAM methods on the dataset, showcasing the existing limitations in their application on agricultural settings. The dataset and utilities to work with it are released on https://cifasis.github.io/rosariov2/.
Authors: Jonathan Tonglet, Jan Zimny, Tinne Tuytelaars, Iryna Gurevych
Abstract: Misleading visualizations are a potent driver of misinformation on social media and the web. By violating chart design principles, they distort data and lead readers to draw inaccurate conclusions. Prior work has shown that both humans and multimodal large language models (MLLMs) are frequently deceived by such visualizations. Automatically detecting misleading visualizations and identifying the specific design rules they violate could help protect readers and reduce the spread of misinformation. However, the training and evaluation of AI models has been limited by the absence of large, diverse, and openly available datasets. In this work, we introduce Misviz, a benchmark of 2,604 real-world visualizations annotated with 12 types of misleaders. To support model training, we also release Misviz-synth, a synthetic dataset of 81,814 visualizations generated using Matplotlib and based on real-world data tables. We perform a comprehensive evaluation on both datasets using state-of-the-art MLLMs, rule-based systems, and fine-tuned classifiers. Our results reveal that the task remains highly challenging. We release Misviz, Misviz-synth, and the accompanying code.
Authors: Bar{\i}\c{s} Z\"ong\"ur, Robin Hesse, Stefan Roth
Abstract: To ensure the reliability of deep models in real-world applications, out-of-distribution (OOD) detection methods aim to distinguish samples close to the training distribution (in-distribution, ID) from those farther away (OOD). In this work, we propose a novel OOD detection method that utilizes singular value decomposition of the weight matrix of the classification head to decompose the model's activations into decisive and insignificant components, which contribute maximally, respectively minimally, to the final classifier output. We find that the subspace of insignificant components more effectively distinguishes ID from OOD data than raw activations in regimes of large distribution shifts (Far-OOD). This occurs because the classification objective leaves the insignificant subspace largely unaffected, yielding features that are ''untainted'' by the target classification task. Conversely, in regimes of smaller distribution shifts (Near-OOD), we find that activation shaping methods profit from only considering the decisive subspace, as the insignificant component can cause interference in the activation space. By combining two findings into a single approach, termed ActSub, we achieve state-of-the-art results in various standard OOD benchmarks.
Authors: Weihuan Deng, Yaofu Huang, Luan Chen, Xun Li, Yao Yao
Abstract: With the deepening of poverty alleviation and rural revitalization strategies, improving the rural living environment and enhancing the quality of life have become key priorities. Rural livability is a key indicator for measuring the effectiveness of these efforts. Current measurement approaches face significant limitations, as questionnaire-based methods are difficult to scale, while urban-oriented visual perception methods are poorly suited for rural contexts. In this paper, a rural-specific livability assessment framework was proposed based on drone imagery and multimodal large language models (MLLMs). To comprehensively assess village livability, this study first used a top-down approach to collect large-scale drone imagery of 1,766 villages in 146 counties across China. In terms of the model framework, an efficient image comparison mechanism was developed, incorporating binary search interpolation to determine effective image pairs while reducing comparison iterations. Building on expert knowledge, a chain-of-thought prompting suitable for nationwide rural livability measurement was constructed, considering both living quality and ecological habitability dimensions. This approach enhanced the rationality and reliability of the livability assessment. Finally, this study characterized the spatial heterogeneity of rural livability across China and thoroughly analyzed its influential factors. The results show that: (1) The rural livability in China demonstrates a dual-core-periphery spatial pattern, radiating outward from Sichuan and Zhejiang provinces with declining gradients; (2) Among various influential factors, government fiscal expenditure emerged as the core determinant, with each unit increase corresponding to a 3.9 - 4.9 unit enhancement in livability. The findings provide valuable insights for rural construction policy-making.
Authors: Peng Yang, Zhengdong Huang, Zicheng Xie, Wentao Tian, Jingyu Liu, Lunhong Dong
Abstract: Heart rate prediction is vital for personalized health monitoring and fitness, while it frequently faces a critical challenge when deploying in real-world: data heterogeneity. We classify it in two key dimensions: source heterogeneity from fragmented device markets with varying feature sets, and user heterogeneity reflecting distinct physiological patterns across individuals and activities. Existing methods either discard device-specific information, or fail to model user-specific differences, limiting their real-world performance. To address this, we propose a framework that learns latent representations agnostic to both heterogeneity, enabling downstream predictors to work consistently under heterogeneous data patterns. Specifically, we introduce a random feature dropout strategy to handle source heterogeneity, making the model robust to various feature sets. To manage user heterogeneity, we employ a time-aware attention module to capture long-term physiological traits and use a contrastive learning objective to build a discriminative representation space. To reflect the heterogeneous nature of real-world data, we created and publicly released a new benchmark dataset, ParroTao. Evaluations on both ParroTao and the public FitRec dataset show that our model significantly outperforms existing baselines by 17% and 15%, respectively. Furthermore, analysis of the learned representations demonstrates their strong discriminative power, and one downstream application task confirm the practical value of our model.
Authors: Kaouther Mouheb, Mobina Ghojogh Nejad, Lavsen Dahal, Ehsan Samei, Kyle J. Lafata, W. Paul Segars, Joseph Y. Lo
Abstract: Accurate 3D modeling of human organs is critical for constructing digital phantoms in virtual imaging trials. However, organs such as the large intestine remain particularly challenging due to their complex geometry and shape variability. We propose CLAP, a novel Conditional LAtent Point-diffusion model that combines geometric deep learning with denoising diffusion models to enhance 3D representations of the large intestine. Given point clouds sampled from segmentation masks, we employ a hierarchical variational autoencoder to learn both global and local latent shape representations. Two conditional diffusion models operate within this latent space to refine the organ shape. A pretrained surface reconstruction model is then used to convert the refined point clouds into meshes. CLAP achieves substantial improvements in shape modeling accuracy, reducing Chamfer distance by 26% and Hausdorff distance by 36% relative to the initial suboptimal shapes. This approach offers a robust and extensible solution for high-fidelity organ modeling, with potential applicability to a wide range of anatomical structures.
Authors: Zhicheng Wang, Wensheng Liang, Ruiyan Zhuang, Shuai Li, Jianwei Tan, Xiaoguang Ma
Abstract: Action recognition (AR) in industrial environments -- particularly for identifying actions and operational gestures -- faces persistent challenges due to high deployment costs, poor cross-scenario generalization, and limited real-time performance. To address these issues, we propose a low-cost real-time framework for industrial action recognition using foundation models, denoted as LRIAR, to enhance recognition accuracy and transferability while minimizing human annotation and computational overhead. The proposed framework constructs an automatically labeled dataset by coupling Grounding DINO with the pretrained BLIP-2 image encoder, enabling efficient and scalable action labeling. Leveraging the constructed dataset, we train YOLOv5 for real-time action detection, and a Vision Transformer (ViT) classifier is deceloped via LoRA-based fine-tuning for action classification. Extensive experiments conducted in real-world industrial settings validate the effectiveness of LRIAR, demonstrating consistent improvements over state-of-the-art methods in recognition accuracy, scenario generalization, and deployment efficiency.
Authors: Farzaneh Jafari, Stefano Berretti, Anup Basu
Abstract: In recent years, the talking head generation has become a focal point for researchers. Considerable effort is being made to refine lip-sync motion, capture expressive facial expressions, generate natural head poses, and achieve high-quality video. However, no single model has yet achieved equivalence across all quantitative and qualitative metrics. We introduce Jamba, a hybrid Transformer-Mamba model, to animate a 3D face. Mamba, a pioneering Structured State Space Model (SSM) architecture, was developed to overcome the limitations of conventional Transformer architectures, particularly in handling long sequences. This challenge has constrained traditional models. Jamba combines the advantages of both the Transformer and Mamba approaches, offering a comprehensive solution. Based on the foundational Jamba block, we present JambaTalk to enhance motion variety and lip sync through multimodal integration. Extensive experiments reveal that our method achieves performance comparable or superior to state-of-the-art models.
Authors: Nikolas Adaloglou, Tim Kaiser, Damir Iagudin, Markus Kollmann
Abstract: Guidance is a widely used technique for diffusion models to enhance sample quality. Technically, guidance is realised by using an auxiliary model that generalises more broadly than the primary model. Using a 2D toy example, we first show that it is highly beneficial when the auxiliary model exhibits similar but stronger generalisation errors than the primary model. Based on this insight, we introduce \emph{masked sliding window guidance (M-SWG)}, a novel, training-free method. M-SWG upweights long-range spatial dependencies by guiding the primary model with itself by selectively restricting its receptive field. M-SWG requires neither access to model weights from previous iterations, additional training, nor class conditioning. M-SWG achieves a superior Inception score (IS) compared to previous state-of-the-art training-free approaches, without introducing sample oversaturation. In conjunction with existing guidance methods, M-SWG reaches state-of-the-art Frechet DINOv2 distance on ImageNet using EDM2-XXL and DiT-XL. The code is available at https://github.com/HHU-MMBS/swg_bmvc2025_official.
Authors: Zeeshan Nisar, Thomas Lampert
Abstract: Histopathology, the microscopic examination of tissue samples, is essential for disease diagnosis and prognosis. Accurate segmentation and identification of key regions in histopathology images are crucial for developing automated solutions. However, state-of-art deep learning segmentation methods like UNet require extensive labels, which is both costly and time-consuming, particularly when dealing with multiple stainings. To mitigate this, various multi-stain segmentation methods such as UDA-GAN have been developed, which reduce the need for labels by requiring only one (source) stain to be labelled. Nonetheless, obtaining source stain labels can still be challenging, and segmentation models fail when they are unavailable. This article shows that through self-supervised pre-training$-$including SimCLR, BYOL, and a novel approach, HR-CS-CO$-$the performance of these segmentation methods (UNet, and UDAGAN) can be retained even with 95% fewer labels. Notably, with self-supervised pre-training and using only 5% labels, the performance drops are minimal: 5.9% for UNet and 6.2% for UDAGAN, compared to their respective fully supervised counterparts (without pre-training, using 100% labels). Furthermore, these findings are shown to generalise beyond their training distribution to public benchmark datasets. Im-
Authors: Jos\'e Rodr\'iguez-Ortega (Nemotec, Madrid, Spain, Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain), Francisco P\'erez-Hern\'andez (Nemotec, Madrid, Spain), Siham Tabik (Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain)
Abstract: Identifying anatomical landmarks in 3D dental models is essential for orthodontic treatment, yet manual placement is labor-intensive and requires expert knowledge. While machine learning methods have been proposed for automatic landmark detection in 3D Intraoral Scans (IOS), none provide a fully end-to-end solution that avoids costly tooth segmentation. We present CHaRM (Conditioned Heatmap Regression Methodology), the first fully end-to-end deep learning approach for tooth landmark detection in 3D IOS. CHaRM integrates four components: a point cloud encoder, a decoder with a heatmap regression head, a teeth-presence classification head, and the novel CHaR module. The CHaR module leverages teeth-presence information to adapt to missing teeth, improving detection accuracy in complex dental cases. Unlike two-stage workflows that segment teeth before landmarking, CHaRM operates directly on IOS point clouds, reducing complexity, avoiding error propagation, and lowering computational cost. We evaluated CHaRM with five point cloud learning backbones on IOSLandmarks-1k, a new dataset of 1,214 annotated 3D dental models. Both the dataset and code will be publicly released to address the scarcity of open data in orthodontics and foster reproducible research. CHaRM with PointMLP, named CHaRNet, achieved the best accuracy and efficiency. Compared to state-of-the-art methods (TSMDL and ALIIOS), CHaRNet reduced mean Euclidean distance error to 0.56 mm on standard dental models and 1.12 mm across all dentition type, while delivering up to 14.8x faster inference on GPU. This end-to-end approach streamlines orthodontic workflows, enhances the precision of 3D IOS analysis, and enables efficient computer-assisted treatment planning.
Authors: Katie Z Luo, Minh-Quan Dao, Zhenzhen Liu, Mark Campbell, Wei-Lun Chao, Kilian Q. Weinberger, Ezio Malis, Vincent Fremont, Bharath Hariharan, Mao Shan, Stewart Worrall, Julie Stephany Berrio Perez
Abstract: Vehicle-to-everything (V2X) collaborative perception has emerged as a promising solution to address the limitations of single-vehicle perception systems. However, existing V2X datasets are limited in scope, diversity, and quality. To address these gaps, we present Mixed Signals, a comprehensive V2X dataset featuring 45.1k point clouds and 240.6k bounding boxes collected from three connected autonomous vehicles (CAVs) equipped with two different configurations of LiDAR sensors, plus a roadside unit with dual LiDARs. Our dataset provides point clouds and bounding box annotations across 10 classes, ensuring reliable data for perception training. We provide detailed statistical analysis on the quality of our dataset and extensively benchmark existing V2X methods on it. The Mixed Signals dataset is ready-to-use, with precise alignment and consistent annotations across time and viewpoints. Dataset website is available at https://mixedsignalsdataset.cs.cornell.edu/.
Authors: Hai-Vy Nguyen, Fabrice Gamboa, Sixin Zhang, Reda Chhaibi, Serge Gratton, Thierry Giaccone
Abstract: In this paper, we introduce a novel spatial attention module that can be easily integrated to any convolutional network. This module guides the model to pay attention to the most discriminative part of an image. This enables the model to attain a better performance by an end-to-end training. In conventional approaches, a spatial attention map is typically generated in a position-wise manner. Thus, it is often resulting in irregular boundaries and so can hamper generalization to new samples. In our method, the attention region is constrained to be rectangular. This rectangle is parametrized by only 5 parameters, allowing for a better stability and generalization to new samples. In our experiments, our method systematically outperforms the position-wise counterpart. So that, we provide a novel useful spatial attention mechanism for convolutional models. Besides, our module also provides the interpretability regarding the \textit{where to look} question, as it helps to know the part of the input on which the model focuses to produce the prediction.
Authors: Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, Xihui Liu
Abstract: Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. To achieve this, we decouple discretization from the tokenizer training process through post-training quantization that directly obtains discrete tokens from continuous representations. Specifically, we introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism that efficiently model the resulting large token space. Extensive experiments show that our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction. This work demonstrates that bridging discrete and continuous paradigms can effectively harness the strengths of both approaches, providing a promising direction for high-quality visual generation with simple autoregressive modeling. Project page: https://yuqingwang1029.github.io/TokenBridge.
Authors: Lihua Liu, Jiehong Lin, Zhenxin Liu, Kui Jia
Abstract: RGB-based novel object pose estimation is critical for rapid deployment in robotic applications, yet zero-shot generalization remains a key challenge. In this paper, we introduce PicoPose, a novel framework designed to tackle this task using a three-stage pixel-to-pixel correspondence learning process. Firstly, PicoPose matches features from the RGB observation with those from rendered object templates, identifying the best-matched template and establishing coarse correspondences. Secondly, PicoPose smooths the correspondences by globally regressing a 2D affine transformation, including in-plane rotation, scale, and 2D translation, from the coarse correspondence map. Thirdly, PicoPose applies the affine transformation to the feature map of the best-matched template and learns correspondence offsets within local regions to achieve fine-grained correspondences. By progressively refining the correspondences, PicoPose significantly improves the accuracy of object poses computed via PnP/RANSAC. PicoPose achieves state-of-the-art performance on the seven core datasets of the BOP benchmark, demonstrating exceptional generalization to novel objects. Code and trained models are available at https://github.com/foollh/PicoPose.
Authors: Kerol Djoumessi, Samuel Ofosu Mensah, Philipp Berens
Abstract: In many medical imaging tasks, convolutional neural networks (CNNs) efficiently extract local features hierarchically. More recently, vision transformers (ViTs) have gained popularity, using self-attention mechanisms to capture global dependencies, but lacking the inherent spatial localization of convolutions. Therefore, hybrid models combining CNNs and ViTs have been developed to combine the strengths of both architectures. However, such hybrid models are difficult to interpret, which hinders their application in medical imaging. In this work, we introduce an interpretable-by-design hybrid fully convolutional CNN-Transformer architecture for retinal disease detection. Unlike widely used post-hoc saliency methods for ViTs, our approach generates faithful and localized evidence maps that directly reflect the mode's decision process. We evaluated our method on two medical tasks focused on disease detection using color fundus images. Our model achieves state-of-the-art predictive performance compared to black-box and interpretable models and provides class-specific sparse evidence maps in a single forward pass. The code is available at: https://github.com/kdjoumessi/Self-Explainable-CNN-Transformer.
URLs: https://github.com/kdjoumessi/Self-Explainable-CNN-Transformer.
Authors: Agnieszka Anna Tomaka, Leszek Luchowski, Micha{\l} Tarnawski, Dariusz Pojda
Abstract: Digital technology plays a crucial role in designing customized medical devices, such as occlusal splints, commonly used in the management of disorders of the stomatognathic system. This methodological proof-of-concept study presents a computer-aided approach for designing and evaluating occlusal positioning splints. The primary aim is to demonstrate the feasibility and geometric accuracy of the proposed method at the preclinical stage. In this approach, a three-dimensional splint is generated using a transformation matrix to represent the therapeutic mandibular position. An experienced operator defines this position using a virtual patient model reconstructed from intraoral scans, CBCT, 3D facial scans, and a digitized plaster model. We introduce a novel method for generating splints that reproduces occlusal conditions in the therapeutic position and resolves surface conflicts through virtual embossing. The process for obtaining transformation matrices using dental tools and intraoral devices commonly employed in dental and laboratory workflows is described, and the geometric accuracy of both designed and printed splints is evaluated using profile and surface deviation analysis. The method supports reproducible, patient-specific splint fabrication and provides a transparent foundation for future validation studies, supporting multimodal image registration and quantification of occlusal discrepancies in research settings.
Authors: Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan
Abstract: We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.
Authors: Samuel Webster, Adam Czajka
Abstract: Saliency-guided training, which directs model learning to important regions of images, has demonstrated generalization improvements across various biometric presentation attack detection (PAD) tasks. This paper presents its first application to fingerprint PAD. We conducted a 50-participant study to create a dataset of 800 human-annotated fingerprint perceptually-important maps, explored alongside algorithmically-generated "pseudosaliency," including minutiae-based, image quality-based, and autoencoder-based saliency maps. Evaluating on the 2021 Fingerprint Liveness Detection Competition testing set, we explore various configurations within five distinct training scenarios to assess the impact of saliency-guided training on accuracy and generalization. Our findings demonstrate the effectiveness of saliency-guided training for fingerprint PAD in both limited and large data contexts, and we present a configuration capable of earning the first place on the LivDet-2021 benchmark. Our results highlight saliency-guided training's promise for increased model generalization capabilities, its effectiveness when data is limited, and its potential to scale to larger datasets in fingerprint PAD. All collected saliency data and trained models are released with the paper to support reproducible research.
Authors: Shanshan Song, Hui Tang, Honglong Yang, Xiaomeng Li
Abstract: Radiology Report Generation (RRG) automates the creation of radiology reports from medical imaging, enhancing the efficiency of the reporting process. Longitudinal Radiology Report Generation (LRRG) extends RRG by incorporating the ability to compare current and prior exams, facilitating the tracking of temporal changes in clinical findings. Existing LRRG approaches only extract features from prior and current images using a visual pre-trained encoder, which are then concatenated to generate the final report. However, these methods struggle to effectively capture both spatial and temporal correlations during the feature extraction process. Consequently, the extracted features inadequately capture the information of difference across exams and thus underrepresent the expected progressions, leading to sub-optimal performance in LRRG. To address this, we develop a novel dynamic difference-aware temporal residual network (DDaTR). In DDaTR, we introduce two modules at each stage of the visual encoder to capture multi-level spatial correlations. The Dynamic Feature Alignment Module (DFAM) is designed to align prior features across modalities for the integrity of prior clinical information. Prompted by the enriched prior features, the dynamic difference-aware module (DDAM) captures favorable difference information by identifying relationships across exams. Furthermore, our DDaTR employs the dynamic residual network to unidirectionally transmit longitudinal information, effectively modelling temporal correlations. Extensive experiments demonstrated superior performance over existing methods on three benchmarks, proving its efficacy in both RRG and LRRG tasks.
Authors: Radha Kodali, Venkata Rao Dhulipalla, Venkata Siva Kishor Tatavarty, Madhavi Nadakuditi, Bharadwaj Thiruveedhula, Suryanarayana Gunnam, Durga Prasad Bavirisetti, Gogulamudi Pradeep Reddy
Abstract: Infertility has a considerable impact on individuals' quality of life, affecting them socially and psychologically, with projections indicating a rise in the upcoming years. In vitro fertilization (IVF) emerges as one of the primary techniques within economically developed nations, employed to address the rising problem of low fertility. Expert embryologists conventionally grade embryos by reviewing blastocyst images to select the most optimal for transfer, yet this process is time-consuming and lacks efficiency. Blastocyst images provide a valuable resource for assessing embryo viability. In this study, we introduce an explainable artificial intelligence (XAI) framework for classifying embryos, employing a fusion of convolutional neural network (CNN) and long short-term memory (LSTM) architecture, referred to as CNN-LSTM. Utilizing deep learning, our model achieves high accuracy in embryo classification while maintaining interpretability through XAI.
Authors: Hanghui Guo, Weijie Shi, Mengze Li, Juncheng Li, Hao Chen, Yue Cui, Jiajie Xu, Jia Zhu, Jiawei Shen, Zhangze Chen, Sirui Han
Abstract: Short-video misinformation detection has attracted wide attention in the multi-modal domain, aiming to accurately identify the misinformation in the video format accompanied by the corresponding audio. Despite significant advancements, current models in this field, trained on particular domains (source domains), often exhibit unsatisfactory performance on unseen domains (target domains) due to domain gaps. To effectively realize such domain generalization on the short-video misinformation detection task, we propose deep insights into the characteristics of different domains: (1) The detection on various domains may mainly rely on different modalities (i.e., mainly focusing on videos or audios). To enhance domain generalization, it is crucial to achieve optimal model performance on all modalities simultaneously. (2) For some domains focusing on cross-modal joint fraud, a comprehensive analysis relying on cross-modal fusion is necessary. However, domain biases located in each modality (especially in each frame of videos) will be accumulated in this fusion process, which may seriously damage the final identification of misinformation. To address these issues, we propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection (named DOCTOR), which contains two characteristic modules: (1) We involve the cross-modal feature interpolation to map multiple modalities into a shared space and the interpolation distillation to synchronize multi-modal learning; (2) We design the diffusion model to add noise to retain core features of multi modal and enhance domain invariant features through cross-modal guided denoising. Extensive experiments demonstrate the effectiveness of our proposed DOCTOR model. Our code is public available at https://github.com/ghh1125/DOCTOR.
Authors: Jia-Xuan Jiang, Jiashuai Liu, Hongtao Wu, Yifeng Wu, Zhong Wang, Qi Bi, Yefeng Zheng
Abstract: Deep learning has shown remarkable performance in integrating multimodal data for survival prediction. However, existing multimodal methods mainly focus on single cancer types and overlook the challenge of generalization across cancers. In this work, we are the first to reveal that multimodal prognosis models often generalize worse than unimodal ones in cross-cancer scenarios, despite the critical need for such robustness in clinical practice. To address this, we propose a new task: Cross-Cancer Single Domain Generalization for Multimodal Prognosis, which evaluates whether models trained on a single cancer type can generalize to unseen cancers. We identify two key challenges: degraded features from weaker modalities and ineffective multimodal integration. To tackle these, we introduce two plug-and-play modules: Sparse Dirac Information Rebalancer (SDIR) and Cancer-aware Distribution Entanglement (CADE). SDIR mitigates the dominance of strong features by applying Bernoulli-based sparsification and Dirac-inspired stabilization to enhance weaker modality signals. CADE, designed to synthesize the target domain distribution, fuses local morphological cues and global gene expression in latent space. Experiments on a four-cancer-type benchmark demonstrate superior generalization, laying the foundation for practical, robust cross-cancer multimodal prognosis. Code is available at https://github.com/HopkinsKwong/MCCSDG
Authors: Haoyuan Liu, Hiroshi Watanabe
Abstract: Bounding box regression (BBR) is fundamental to object detection, where the regression loss is crucial for accurate localization. Existing IoU-based losses often incorporate handcrafted geometric penalties to address IoU's non-differentiability in non-overlapping cases and enhance BBR performance. However, these penalties are sensitive to box shape, size, and distribution, often leading to suboptimal optimization for small objects and undesired behaviors such as bounding box enlargement due to misalignment with the IoU objective. To address these limitations, we propose InterpIoU, a novel loss function that replaces handcrafted geometric penalties with a term based on the IoU between interpolated boxes and the target. By using interpolated boxes to bridge the gap between predictions and ground truth, InterpIoU provides meaningful gradients in non-overlapping cases and inherently avoids the box enlargement issue caused by misaligned penalties. Simulation results further show that IoU itself serves as an ideal regression target, while existing geometric penalties are both unnecessary and suboptimal. Building on InterpIoU, we introduce Dynamic InterpIoU, which dynamically adjusts interpolation coefficients based on IoU values, enhancing adaptability to scenarios with diverse object distributions. Experiments on COCO, VisDrone, and PASCAL VOC show that our methods consistently outperform state-of-the-art IoU-based losses across various detection frameworks, with particularly notable improvements in small object detection, confirming their effectiveness.
Authors: Shree Mitra, Ritabrata Chakraborty, Nilkanta Sahu
Abstract: Recognizing handwritten mathematical expressions (HMER) is a challenging task due to the inherent two-dimensional structure, varying symbol scales, and complex spatial relationships among symbols. In this paper, we present a self-supervised learning (SSL) framework for HMER that eliminates the need for expensive labeled data. Our approach begins by pretraining an image encoder using a combination of global and local contrastive loss, enabling the model to learn both holistic and fine-grained representations. A key contribution of this work is a novel self-supervised attention network, which is trained using a progressive spatial masking strategy. This attention mechanism is designed to learn semantically meaningful focus regions, such as operators, exponents, and nested mathematical notation, without requiring any supervision. The progressive masking curriculum encourages the network to become increasingly robust to missing or occluded visual information, ultimately improving structural understanding. Our complete pipeline consists of (1) self-supervised pretraining of the encoder, (2) self-supervised attention learning, and (3) supervised fine-tuning with a transformer decoder to generate LATEX sequences. Extensive experiments on CROHME benchmarks demonstrate that our method outperforms existing SSL and fully supervised baselines, validating the effectiveness of our progressive attention mechanism in enhancing HMER performance. Our codebase can be found here.
Authors: Junyu Xiong, Yonghui Wang, Weichao Zhao, Chenyu Liu, Bing Yin, Wengang Zhou, Houqiang Li
Abstract: Understanding multi-page documents poses a significant challenge for multimodal large language models (MLLMs), as it requires fine-grained visual comprehension and multi-hop reasoning across pages. While prior work has explored reinforcement learning (RL) for enhancing advanced reasoning in MLLMs, its application to multi-page document understanding remains underexplored. In this paper, we introduce DocR1, an MLLM trained with a novel RL framework, Evidence Page-Guided GRPO (EviGRPO). EviGRPO incorporates an evidence-aware reward mechanism that promotes a coarse-to-fine reasoning strategy, guiding the model to first retrieve relevant pages before generating answers. This training paradigm enables us to build high-quality models with limited supervision. To support this, we design a two-stage annotation pipeline and a curriculum learning strategy, based on which we construct two datasets: EviBench, a high-quality training set with 4.8k examples, and ArxivFullQA, an evaluation benchmark with 8.6k QA pairs based on scientific papers. Extensive experiments across a wide range of benchmarks demonstrate that DocR1 achieves state-of-the-art performance on multi-page tasks, while consistently maintaining strong results on single-page benchmarks.
Authors: Hongliang Wei, Xianqi Zhang, Xingtao Wang, Xiaopeng Fan, Debin Zhao
Abstract: Despite significant progress, existing research on Multimodal Large Language Models (MLLMs) mainly focuses on general visual understanding, overlooking the ability to integrate textual context associated with objects for a more context-aware multimodal understanding -- an ability we refer to as Region-level Context-aware Multimodal Understanding (RCMU). To address this limitation, we first formulate the RCMU task, which requires models to respond to user instructions by integrating both image content and textual information of regions or objects. To equip MLLMs with RCMU capabilities, we propose Region-level Context-aware Visual Instruction Tuning (RCVIT), which incorporates object information into the model input and enables the model to utilize bounding box coordinates to effectively associate objects' visual content with their textual information. To address the lack of datasets, we introduce the RCMU dataset, a large-scale visual instruction tuning dataset that covers multiple RCMU tasks. We also propose RC\&P-Bench, a comprehensive benchmark that can evaluate the performance of MLLMs in RCMU and multimodal personalized understanding tasks. Additionally, we propose a reference-free evaluation metric to perform a comprehensive and fine-grained evaluation of the region-level context-aware image descriptions. By performing RCVIT on Qwen2-VL models with the RCMU dataset, we developed RC-Qwen2-VL models. Experimental results indicate that RC-Qwen2-VL models not only achieve outstanding performance on multiple RCMU tasks but also demonstrate successful applications in multimodal RAG and personalized conversation. Our data, model and benchmark are available at https://github.com/hongliang-wei/RC-MLLM
Authors: Syed Nazmus Sakib, Nafiul Haque, Mohammad Zabed Hossain, Shifat E. Arman
Abstract: PlantVillageVQA is a large-scale visual question answering (VQA) dataset derived from the widely used PlantVillage image corpus. It was designed to advance the development and evaluation of vision-language models for agricultural decision-making and analysis. The PlantVillageVQA dataset comprises 193,609 high-quality question-answer (QA) pairs grounded over 55,448 images spanning 14 crop species and 38 disease conditions. Questions are organised into 3 levels of cognitive complexity and 9 distinct categories. Each question category was phrased manually following expert guidance and generated via an automated two-stage pipeline: (1) template-based QA synthesis from image metadata and (2) multi-stage linguistic re-engineering. The dataset was iteratively reviewed by domain experts for scientific accuracy and relevancy. The final dataset was evaluated using three state-of-the-art models for quality assessment. Our objective remains to provide a publicly available, standardised and expert-verified database to enhance diagnostic accuracy for plant disease identifications and advance scientific research in the agricultural domain. Our dataset will be open-sourced at https://huggingface.co/datasets/SyedNazmusSakib/PlantVillageVQA.
URLs: https://huggingface.co/datasets/SyedNazmusSakib/PlantVillageVQA.
Authors: Mirza Mumtaz Zahoor (Faculty of Computer Sciences, Ibadat International University, Islamabad, Pakistan), Saddam Hussain Khan (Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences)
Abstract: Brain tumors remain among the most lethal human diseases, where early detection and accurate classification are critical for effective diagnosis and treatment planning. Although deep learning-based computer-aided diagnostic (CADx) systems have shown remarkable progress. However, conventional convolutional neural networks (CNNs) and Transformers face persistent challenges, including high computational cost, sensitivity to minor contrast variations, structural heterogeneity, and texture inconsistencies in MRI data. Therefore, a novel hybrid framework, CE-RS-SBCIT, is introduced, integrating residual and spatial learning-based CNNs with transformer-driven modules. The proposed framework exploits local fine-grained and global contextual cues through four core innovations: (i) a smoothing and boundary-based CNN-integrated Transformer (SBCIT), (ii) tailored residual and spatial learning CNNs, (iii) a channel enhancement (CE) strategy, and (iv) a novel spatial attention mechanism. The developed SBCIT employs stem convolution and contextual interaction transformer blocks with systematic smoothing and boundary operations, enabling efficient global feature modeling. Moreover, Residual and spatial CNNs, enhanced by auxiliary transfer-learned feature maps, enrich the representation space, while the CE module amplifies discriminative channels and mitigates redundancy. Furthermore, the spatial attention mechanism selectively emphasizes subtle contrast and textural variations across tumor classes. Extensive evaluation on challenging MRI datasets from Kaggle and Figshare, encompassing glioma, meningioma, pituitary tumors, and healthy controls, demonstrates superior performance, achieving 98.30% accuracy, 98.08% sensitivity, 98.25% F1-score, and 98.43% precision.
Authors: Kaijie Xu, Clark Verbrugge
Abstract: In complex 3D game environments, players rely on visual affordances to spot map transition points. Efficient identification of such points is important to client-side auto-mapping, and provides an objective basis for evaluating map cue presentation. In this work, we formalize the task of detecting traversable Spatial Transition Points (STPs)-connectors between two sub regions-and selecting the singular Main STP (MSTP), the unique STP that lies on the designer-intended critical path toward the player's current macro-objective, from a single game frame, proposing this as a new research focus. We introduce a two-stage deep-learning pipeline that first detects potential STPs using Faster R-CNN and then ranks them with a lightweight MSTP selector that fuses local and global visual features. Both stages benefit from parameter-efficient adapters, and we further introduce an optional retrieval-augmented fusion step. Our primary goal is to establish the feasibility of this problem and set baseline performance metrics. We validate our approach on a custom-built, diverse dataset collected from five Action RPG titles. Our experiments reveal a key trade-off: while full-network fine-tuning produces superior STP detection with sufficient data, adapter-only transfer is significantly more robust and effective in low-data scenarios and for the MSTP selection task. By defining this novel problem, providing a baseline pipeline and dataset, and offering initial insights into efficient model adaptation, we aim to contribute to future AI-driven navigation aids and data-informed level-design tools.
Authors: Hou Xia, Zheren Fu, Fangcan Ling, Jiajun Li, Yi Tu, Zhendong Mao, Yongdong Zhang
Abstract: Large video language models (LVLMs) have made notable progress in video understanding, spurring the development of corresponding evaluation benchmarks. However, existing benchmarks generally assess overall performance across entire video sequences, overlooking nuanced behaviors such as contextual positional bias, a critical yet under-explored aspect of LVLM performance. We present Video-LevelGauge, a dedicated benchmark designed to systematically assess positional bias in LVLMs. We employ standardized probes and customized contextual setups, allowing flexible control over context length, probe position, and contextual types to simulate diverse real-world scenarios. In addition, we introduce a comprehensive analysis method that combines statistical measures with morphological pattern recognition to characterize bias. Our benchmark comprises 438 manually curated videos spanning multiple types, yielding 1,177 high-quality multiple-choice questions and 120 open-ended questions, validated for their effectiveness in exposing positional bias. Based on these, we evaluate 27 state-of-the-art LVLMs, including both commercial and open-source models. Our findings reveal significant positional biases in many leading open-source models, typically exhibiting head or neighbor-content preferences. In contrast, commercial models such as Gemini2.5-Pro show impressive, consistent performance across entire video sequences. Further analyses on context length, context variation, and model scale provide actionable insights for mitigating bias and guiding model enhancement . https://github.com/Cola-any/Video-LevelGauge
Authors: Ahmed Emam, Mohamed Elbassiouny, Julius Miller, Patrick Donworth, Sabine Seidel, Ribana Roscher
Abstract: Pollinator insects such as honeybees and bumblebees are vital to global food production and ecosystem stability, yet their populations are declining due to anthropogenic and environmental stressors. Scalable, automated monitoring in agricultural environments remains an open challenge due to the difficulty of detecting small, fast-moving, and often camouflaged insects. To address this, we present BuzzSet v1.0, a large-scale dataset of high-resolution pollinator images collected under real field conditions. BuzzSet contains 7,856 manually verified images with more than 8,000 annotated instances across three classes: honeybees, bumblebees, and unidentified insects. Initial annotations were produced using a YOLOv12 model trained on external data and refined through human verification with open-source tools. All images were preprocessed into 256 x 256 tiles to improve the detection of small insects. We provide baselines using the RF-DETR transformer-based object detector. The model achieves strong classification accuracy with F1 scores of 0.94 and 0.92 for honeybees and bumblebees, with minimal confusion between these categories. The unidentified class remains more difficult due to label ambiguity and fewer samples, yet still contributes insights for robustness evaluation. Overall detection performance (mAP at 0.50 of 0.559) illustrates the challenging nature of the dataset and its potential to drive advances in small object detection under realistic ecological conditions. Future work focuses on expanding the dataset to version 2.0 with additional annotations and evaluating further detection strategies. BuzzSet establishes a benchmark for ecological computer vision, with the primary challenge being reliable detection of insects frequently camouflaged within natural vegetation, highlighting an open problem for future research.
Authors: Hao Yang, Qianyu Zhou, Haijia Sun, Xiangtai Li, Xuequan Lu, Lizhuang Ma, Shuicheng Yan
Abstract: Domain Generalization (DG) has been recently explored to enhance the generalizability of Point Cloud Classification (PCC) models toward unseen domains. Prior works are based on convolutional networks, Transformer or Mamba architectures, either suffering from limited receptive fields or high computational cost, or insufficient long-range dependency modeling. RWKV, as an emerging architecture, possesses superior linear complexity, global receptive fields, and long-range dependency. In this paper, we present the first work that studies the generalizability of RWKV models in DG PCC. We find that directly applying RWKV to DG PCC encounters two significant challenges: RWKV's fixed direction token shift methods, like Q-Shift, introduce spatial distortions when applied to unstructured point clouds, weakening local geometric modeling and reducing robustness. In addition, the Bi-WKV attention in RWKV amplifies slight cross-domain differences in key distributions through exponential weighting, leading to attention shifts and degraded generalization. To this end, we propose PointDGRWKV, the first RWKV-based framework tailored for DG PCC. It introduces two key modules to enhance spatial modeling and cross-domain robustness, while maintaining RWKV's linear efficiency. In particular, we present Adaptive Geometric Token Shift to model local neighborhood structures to improve geometric context awareness. In addition, Cross-Domain key feature Distribution Alignment is designed to mitigate attention drift by aligning key feature distributions across domains. Extensive experiments on multiple benchmarks demonstrate that PointDGRWKV achieves state-of-the-art performance on DG PCC.
Authors: Jianguo Huang, Jianqing Song, Xuanning Zhou, Bingyi Jing, Hongxin Wei
Abstract: Conformal prediction (CP) is a powerful statistical framework that generates prediction intervals or sets with guaranteed coverage probability. While CP algorithms have evolved beyond traditional classifiers and regressors to sophisticated deep learning models like deep neural networks (DNNs), graph neural networks (GNNs), and large language models (LLMs), existing CP libraries often lack the model support and scalability for large-scale DL scenarios. This paper introduces TorchCP, a PyTorch-native library designed to integrate state-of-the-art CP algorithms into deep learning techniques, including DNN-based classifier/regressor, GNN, and LLM. Released under the LGPL-3.0 license, TorchCP comprises about 16k lines of code, validated with 100% unit test coverage and detailed documentation. Notably, TorchCP enables CP-specific training algorithms, online prediction, and GPU-accelerated batch processing, achieving up to 90% reduction in inference time on large datasets. With its low-coupling design, comprehensive suite of advanced methods, and full GPU scalability, TorchCP empowers researchers and practitioners to enhance uncertainty quantification across cutting-edge applications.
Authors: Nebiyou Yismaw, Ulugbek S. Kamilov, M. Salman Asif
Abstract: Diffusion models can generate a variety of high-quality images by modeling complex data distributions. Trained diffusion models can also be very effective image priors for solving inverse problems. Most of the existing diffusion-based methods integrate data consistency steps by approximating the likelihood function within the diffusion reverse sampling process. In this paper, we show that the existing approximations are either insufficient or computationally inefficient. To address these issues, we propose a unified likelihood approximation method that incorporates a covariance correction term to enhance the performance and avoids propagating gradients through the diffusion model. The correction term, when integrated into the reverse diffusion sampling process, achieves better convergence towards the true data posterior for selected distributions and improves performance on real-world natural image datasets. Furthermore, we present an efficient way to factorize and invert the covariance matrix of the likelihood function for several inverse problems. Our comprehensive experiments demonstrate the effectiveness of our method over several existing approaches. Code available at https://github.com/CSIPlab/CoDPS.
Authors: Qiang Zou, Lizhen Zhu
Abstract: The recent rise of generative artificial intelligence (AI), powered by Transformer networks, has achieved remarkable success in natural language processing, computer vision, and graphics. However, the application of Transformers in computer-aided design (CAD), particularly for processing boundary representation (B-rep) models, remains largely unexplored. To bridge this gap, we propose a novel approach for adapting Transformers to B-rep learning, called the Boundary Representation Transformer (BRT). B-rep models pose unique challenges due to their irregular topology and continuous geometric definitions, which are fundamentally different from the structured and discrete data Transformers are designed for. To address this, BRT proposes a continuous geometric embedding method that encodes B-rep surfaces (trimmed and untrimmed) into Bezier triangles, preserving their shape and continuity without discretization. Additionally, BRT employs a topology-aware embedding method that organizes these geometric embeddings into a sequence of discrete tokens suitable for Transformers, capturing both geometric and topological characteristics within B-rep models. This enables the Transformer's attention mechanism to effectively learn shape patterns and contextual semantics of boundary elements in a B-rep model. Extensive experiments demonstrate that BRT achieves state-of-the-art performance in part classification and feature recognition tasks.
Authors: Arthur Allshire, Hongsuk Choi, Junyi Zhang, David McAllister, Anthony Zhang, Chung Min Kim, Trevor Darrell, Pieter Abbeel, Jitendra Malik, Angjoo Kanazawa
Abstract: How can we teach humanoids to climb staircases and sit on chairs using the surrounding environment context? Arguably, the simplest way is to just show them-casually capture a human motion video and feed it to humanoids. We introduce VIDEOMIMIC, a real-to-sim-to-real pipeline that mines everyday videos, jointly reconstructs the humans and the environment, and produces whole-body control policies for humanoid robots that perform the corresponding skills. We demonstrate the results of our pipeline on real humanoid robots, showing robust, repeatable contextual control such as staircase ascents and descents, sitting and standing from chairs and benches, as well as other dynamic whole-body skills-all from a single policy, conditioned on the environment and global root commands. VIDEOMIMIC offers a scalable path towards teaching humanoids to operate in diverse real-world environments.
Authors: Abdulaziz Almuzairee, Rohan Patil, Dwait Bhatt, Henrik I. Christensen
Abstract: Vision is well-known for its use in manipulation, especially using visual servoing. Due to the 3D nature of the world, using multiple camera views and merging them creates better representations for Q-learning and in turn, trains more sample efficient policies. Nevertheless, these multi-view policies are sensitive to failing cameras and can be burdensome to deploy. To mitigate these issues, we introduce a Merge And Disentanglement (MAD) algorithm that efficiently merges views to increase sample efficiency while simultaneously disentangling views by augmenting multi-view feature inputs with single-view features. This produces robust policies and allows lightweight deployment. We demonstrate the efficiency and robustness of our approach using Meta-World and ManiSkill3. For project website and code, see https://aalmuzairee.github.io/mad
Authors: Xilong Wang, John Bloch, Zedian Shao, Yuepeng Hu, Shuyan Zhou, Neil Zhenqiang Gong
Abstract: Multi-modal large language model (MLLM)-based web agents interact with webpage environments by generating actions based on screenshots of the webpages. In this work, we propose WebInject, a prompt injection attack that manipulates the webpage environment to induce a web agent to perform an attacker-specified action. Our attack adds a perturbation to the raw pixel values of the rendered webpage. After these perturbed pixels are mapped into a screenshot, the perturbation induces the web agent to perform the attacker-specified action. We formulate the task of finding the perturbation as an optimization problem. A key challenge in solving this problem is that the mapping between raw pixel values and screenshot is non-differentiable, making it difficult to backpropagate gradients to the perturbation. To overcome this, we train a neural network to approximate the mapping and apply projected gradient descent to solve the reformulated optimization problem. Extensive evaluation on multiple datasets shows that WebInject is highly effective and significantly outperforms baselines.
Authors: Marien Renaud, Valentin De Bortoli, Arthur Leclaire, Nicolas Papadakis
Abstract: We consider the problem of sampling distributions stemming from non-convex potentials with Unadjusted Langevin Algorithm (ULA). We prove the stability of the discrete-time ULA to drift approximations under the assumption that the potential is strongly convex at infinity. In many context, e.g. imaging inverse problems, potentials are non-convex and non-smooth. Proximal Stochastic Gradient Langevin Algorithm (PSGLA) is a popular algorithm to handle such potentials. It combines the forward-backward optimization algorithm with a ULA step. Our main stability result combined with properties of the Moreau envelope allows us to derive the first proof of convergence of the PSGLA for non-convex potentials. We empirically validate our methodology on synthetic data and in the context of imaging inverse problems. In particular, we observe that PSGLA exhibits faster convergence rates than Stochastic Gradient Langevin Algorithm for posterior sampling while preserving its restoration properties.
Authors: Yui Tatsumi, Ziyue Zeng, Hiroshi Watanabe
Abstract: Scalable image compression is a technique that progressively reconstructs multiple versions of an image for different requirements. In recent years, images have increasingly been consumed not only by humans but also by image recognition models. This shift has drawn growing attention to scalable image compression methods that serve both machine and human vision (ICMH). Many existing models employ neural network-based codecs, known as learned image compression, and have made significant strides in this field by carefully designing the loss functions. In some cases, however, models are overly reliant on their learning capacity, and their architectural design is not sufficiently considered. In this paper, we enhance the coding efficiency and interpretability of ICMH framework by integrating an explicit residual compression mechanism, which is commonly employed in resolution scalable coding methods such as JPEG2000. Specifically, we propose two complementary methods: Feature Residual-based Scalable Coding (FR-ICMH) and Pixel Residual-based Scalable Coding (PR-ICMH). These proposed methods are applicable to various machine vision tasks. Moreover, they provide flexibility to choose between encoder complexity and compression performance, making it adaptable to diverse application requirements. Experimental results demonstrate the effectiveness of our proposed methods, with PR-ICMH achieving up to 29.57% BD-rate savings over the previous work.
Authors: Anurag Pallaprolu, Winston Hurst, Yasamin Mostofi
Abstract: In this paper, we present mmFlux: a novel framework for extracting underlying crowd motion patterns and inferring crowd semantics using mmWave radar. First, our proposed signal processing pipeline combines optical flow estimation concepts from vision with novel statistical and morphological noise filtering. This approach generates high-fidelity mmWave flow fields-compact 2D vector representations of crowd motion. We then introduce a novel approach that transforms these fields into directed geometric graphs. In these graphs, edges capture dominant flow currents, vertices mark crowd splitting or merging, and flow distribution is quantified across edges. Finally, we show that analyzing the local Jacobian and computing the corresponding curl and divergence enables extraction of key crowd semantics for both structured and diffused crowds. We conduct 21 experiments on crowds of up to 20 people across 3 areas, using commodity mmWave radar. Our framework achieves high-fidelity graph reconstruction of the underlying flow structure, even for complex crowd patterns, demonstrating strong spatial alignment and precise quantitative characterization of flow split ratios. Finally, our curl and divergence analysis accurately infers key crowd semantics, e.g., abrupt turns, boundaries where flow directions shift, dispersions, and gatherings. Overall, these findings validate mmFlux, underscoring its potential for various crowd analytics applications.
Authors: Sheng Wu, Fei Teng, Hao Shi, Qi Jiang, Kai Luo, Kaiwei Wang, Kailun Yang
Abstract: Panoramic cameras, capturing comprehensive 360-degree environmental data, are suitable for quadruped robots in surrounding perception and interaction with complex environments. However, the scarcity of high-quality panoramic training data-caused by inherent kinematic constraints and complex sensor calibration challenges-fundamentally limits the development of robust perception systems tailored to these embodied platforms. To address this issue, we propose QuaDreamer-the first panoramic data generation engine specifically designed for quadruped robots. QuaDreamer focuses on mimicking the motion paradigm of quadruped robots to generate highly controllable, realistic panoramic videos, providing a data source for downstream tasks. Specifically, to effectively capture the unique vertical vibration characteristics exhibited during quadruped locomotion, we introduce Vertical Jitter Encoding (VJE). VJE extracts controllable vertical signals through frequency-domain feature filtering and provides high-quality prompts. To facilitate high-quality panoramic video generation under jitter signal control, we propose a Scene-Object Controller (SOC) that effectively manages object motion and boosts background jitter control through the attention mechanism. To address panoramic distortions in wide-FoV video generation, we propose the Panoramic Enhancer (PE)-a dual-stream architecture that synergizes frequency-texture refinement for local detail enhancement with spatial-structure correction for global geometric consistency. We further demonstrate that the generated video sequences can serve as training data for the quadruped robot's panoramic visual perception model, enhancing the performance of multi-object tracking in 360-degree scenes. The source code and model weights will be publicly available at https://github.com/losehu/QuaDreamer.
Authors: Wei Li, Hangjie Yuan, Zixiang Zhao, Yifan Zhu, Aojun Lu, Tao Feng, Yanan Sun
Abstract: Balancing sensitivity to new tasks and stability for retaining past knowledge is crucial in continual learning (CL). Recently, sharpness-aware minimization has proven effective in transfer learning and has also been adopted in continual learning (CL) to improve memory retention and learning efficiency. However, relying on zeroth-order sharpness alone may favor sharper minima over flatter ones in certain settings, leading to less robust and potentially suboptimal solutions. In this paper, we propose \textbf{C}ontinual \textbf{Flat}ness (\textbf{C-Flat}), a method that promotes flatter loss landscapes tailored for CL. C-Flat offers plug-and-play compatibility, enabling easy integration with minimal modifications to the code pipeline. Besides, we present a general framework that integrates C-Flat into all major CL paradigms and conduct comprehensive comparisons with loss-minima optimizers and flat-minima-based CL methods. Our results show that C-Flat consistently improves performance across a wide range of settings. In addition, we introduce C-Flat++, an efficient yet effective framework that leverages selective flatness-driven promotion, significantly reducing the update cost required by C-Flat. Extensive experiments across multiple CL methods, datasets, and scenarios demonstrate the effectiveness and efficiency of our proposed approaches. Code is available at https://github.com/WanNaa/C-Flat.