new Separated Inter/Intra-Modal Fusion Prompts for Compositional Zero-Shot Learning

Authors: Sua Jung

Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize subtle differences in meaning or the combination of states and objects through the use of known and unknown concepts during training. Existing methods either focused on prompt configuration or on using prompts to tune the pre-trained Vision-Language model. However, these methods faced challenges in accurately identifying subtle differences in meaning or combining states with objects. To jointly eradicate the above issues and construct an efficient and effective CZSL technique, we suggest a method to improve attribute recognition performance by utilizing diverse Prompt Learning with an Inter/Intra-Modality Fusion Synthesizer in scene understanding involving subtle semantic differences and multiple objects.

new ViT-2SPN: Vision Transformer-based Dual-Stream Self-Supervised Pretraining Networks for Retinal OCT Classification

Authors: Mohammadreza Saraei, Igor Kozak, Eung-Joo Lee

Abstract: Optical Coherence Tomography (OCT) is a non-invasive imaging modality essential for diagnosing various eye diseases. Despite its clinical significance, developing OCT-based diagnostic tools faces challenges, such as limited public datasets, sparse annotations, and privacy concerns. Although deep learning has made progress in automating OCT analysis, these challenges remain unresolved. To address these limitations, we introduce the Vision Transformer-based Dual-Stream Self-Supervised Pretraining Network (ViT-2SPN), a novel framework designed to enhance feature extraction and improve diagnostic accuracy. ViT-2SPN employs a three-stage workflow: Supervised Pretraining, Self-Supervised Pretraining (SSP), and Supervised Fine-Tuning. The pretraining phase leverages the OCTMNIST dataset (97,477 unlabeled images across four disease classes) with data augmentation to create dual-augmented views. A Vision Transformer (ViT-Base) backbone extracts features, while a negative cosine similarity loss aligns feature representations. Pretraining is conducted over 50 epochs with a learning rate of 0.0001 and momentum of 0.999. Fine-tuning is performed on a stratified 5.129% subset of OCTMNIST using 10-fold cross-validation. ViT-2SPN achieves a mean AUC of 0.93, accuracy of 0.77, precision of 0.81, recall of 0.75, and an F1 score of 0.76, outperforming existing SSP-based methods.

new A Contrastive Teacher-Student Framework for Novelty Detection under Style Shifts

Authors: Hossein Mirzaei, Mojtaba Nafez, Moein Madadi, Arad Maleki, Mahdi Hajialilue, Zeinab Sadat Taghavi, Sepehr Rezaee, Ali Ansari, Bahar Dibaei Nia, Kian Shamsaie, Mohammadreza Salehi, Mackenzie W. Mathis, Mahdieh Soleymani Baghshah, Mohammad Sabokrou, Mohammad Hossein Rohban

Abstract: There have been several efforts to improve Novelty Detection (ND) performance. However, ND methods often suffer significant performance drops under minor distribution shifts caused by changes in the environment, known as style shifts. This challenge arises from the ND setup, where the absence of out-of-distribution (OOD) samples during training causes the detector to be biased toward the dominant style features in the in-distribution (ID) data. As a result, the model mistakenly learns to correlate style with core features, using this shortcut for detection. Robust ND is crucial for real-world applications like autonomous driving and medical imaging, where test samples may have different styles than the training data. Motivated by this, we propose a robust ND method that crafts an auxiliary OOD set with style features similar to the ID set but with different core features. Then, a task-based knowledge distillation strategy is utilized to distinguish core features from style features and help our model rely on core features for discriminating crafted OOD and ID sets. We verified the effectiveness of our method through extensive experimental evaluations on several datasets, including synthetic and real-world benchmarks, against nine different ND methods.

new WASUP: Interpretable Classification with Weight-Input Alignment and Class-Discriminative SUPports Vectors

Authors: Tom Nuno Wolf, Christian Wachinger

Abstract: The deployment of deep learning models in critical domains necessitates a balance between high accuracy and interpretability. We introduce WASUP, an inherently interpretable neural network that provides local and global explanations of its decision-making process. We prove that these explanations are faithful by fulfilling established axioms for explanations. Leveraging the concept of case-based reasoning, WASUP extracts class-representative support vectors from training images, ensuring they capture relevant features while suppressing irrelevant ones. Classification decisions are made by calculating and aggregating similarity scores between these support vectors and the input's latent feature vector. We employ B-Cos transformations, which align model weights with inputs to enable faithful mappings of latent features back to the input space, facilitating local explanations in addition to global explanations of case-based reasoning. We evaluate WASUP on three tasks: fine-grained classification on Stanford Dogs, multi-label classification on Pascal VOC, and pathology detection on the RSNA dataset. Results indicate that WASUP not only achieves competitive accuracy compared to state-of-the-art black-box models but also offers insightful explanations verified through theoretical analysis. Our findings underscore WASUP's potential for applications where understanding model decisions is as critical as the decisions themselves.

new Post-Training Quantization for 3D Medical Image Segmentation: A Practical Study on Real Inference Engines

Authors: Chongyu Qu, Ritchie Zhao, Ye Yu, Bin Liu, Tianyuan Yao, Junchao Zhu, Bennett A. Landman, Yucheng Tang, Yuankai Huo

Abstract: Quantizing deep neural networks ,reducing the precision (bit-width) of their computations, can remarkably decrease memory usage and accelerate processing, making these models more suitable for large-scale medical imaging applications with limited computational resources. However, many existing methods studied "fake quantization", which simulates lower precision operations during inference, but does not actually reduce model size or improve real-world inference speed. Moreover, the potential of deploying real 3D low-bit quantization on modern GPUs is still unexplored. In this study, we introduce a real post-training quantization (PTQ) framework that successfully implements true 8-bit quantization on state-of-the-art (SOTA) 3D medical segmentation models, i.e., U-Net, SegResNet, SwinUNETR, nnU-Net, UNesT, TransUNet, ST-UNet,and VISTA3D. Our approach involves two main steps. First, we use TensorRT to perform fake quantization for both weights and activations with unlabeled calibration dataset. Second, we convert this fake quantization into real quantization via TensorRT engine on real GPUs, resulting in real-world reductions in model size and inference latency. Extensive experiments demonstrate that our framework effectively performs 8-bit quantization on GPUs without sacrificing model performance. This advancement enables the deployment of efficient deep learning models in medical imaging applications where computational resources are constrained. The code and models have been released, including U-Net, TransUNet pretrained on the BTCV dataset for abdominal (13-label) segmentation, UNesT pretrained on the Whole Brain Dataset for whole brain (133-label) segmentation, and nnU-Net, SegResNet, SwinUNETR and VISTA3D pretrained on TotalSegmentator V2 for full body (104-label) segmentation. https://github.com/hrlblab/PTQ.

URLs: https://github.com/hrlblab/PTQ.

new On the Coexistence and Ensembling of Watermarks

Authors: Aleksandar Petrov, Shruti Agarwal, Philip H. S. Torr, Adel Bibi, John Collomosse

Abstract: Watermarking, the practice of embedding imperceptible information into media such as images, videos, audio, and text, is essential for intellectual property protection, content provenance and attribution. The growing complexity of digital ecosystems necessitates watermarks for different uses to be embedded in the same media. However, to detect and decode all watermarks, they need to coexist well with one another. We perform the first study of coexistence of deep image watermarking methods and, contrary to intuition, we find that various open-source watermarks can coexist with only minor impacts on image quality and decoding robustness. The coexistence of watermarks also opens the avenue for ensembling watermarking methods. We show how ensembling can increase the overall message capacity and enable new trade-offs between capacity, accuracy, robustness and image quality, without needing to retrain the base models.

new Assessing the Capability of YOLO- and Transformer-based Object Detectors for Real-time Weed Detection

Authors: Alicia Allmendinger, Ahmet O\u{g}uz Salt{\i}k, Gerassimos G. Peteinatos, Anthony Stein, Roland Gerhards

Abstract: Spot spraying represents an efficient and sustainable method for reducing the amount of pesticides, particularly herbicides, used in agricultural fields. To achieve this, it is of utmost importance to reliably differentiate between crops and weeds, and even between individual weed species in situ and under real-time conditions. To assess suitability for real-time application, different object detection models that are currently state-of-the-art are compared. All available models of YOLOv8, YOLOv9, YOLOv10, and RT-DETR are trained and evaluated with images from a real field situation. The images are separated into two distinct datasets: In the initial data set, each species of plants is trained individually; in the subsequent dataset, a distinction is made between monocotyledonous weeds, dicotyledonous weeds, and three chosen crops. The results demonstrate that while all models perform equally well in the metrics evaluated, the YOLOv9 models, particularly the YOLOv9s and YOLOv9e, stand out in terms of their strong recall scores (66.58 \% and 72.36 \%), as well as mAP50 (73.52 \% and 79.86 \%), and mAP50-95 (43.82 \% and 47.00 \%) in dataset 2. However, the RT-DETR models, especially RT-DETR-l, excel in precision with reaching 82.44 \% on dataset 1 and 81.46 \% in dataset 2, making them particularly suitable for scenarios where minimizing false positives is critical. In particular, the smallest variants of the YOLO models (YOLOv8n, YOLOv9t, and YOLOv10n) achieve substantially faster inference times down to 7.58 ms for dataset 2 on the NVIDIA GeForce RTX 4090 GPU for analyzing one frame, while maintaining competitive accuracy, highlighting their potential for deployment in resource-constrained embedded computing devices as typically used in productive setups.

new Learning Free Token Reduction for Multi-Modal LLM

Authors: Zihui Zhao, Yingxin Li, Yang Li

Abstract: Vision-Language Models (VLMs) have achieved remarkable success across a range of multimodal tasks; however, their practical deployment is often constrained by high computational costs and prolonged inference times. Since the vision modality typically carries more information than the text modality, compressing visual prompts offers a promising solution to alleviate these challenges. Existing approaches predominantly focus on refining model architectures or directly reducing the number of visual tokens. However, these methods often compromise inference performance due to a lack of consideration for the unique spatial and temporal characteristics of visual data. In this work, we propose a token compression paradigm that operates on both spatial and temporal dimensions. Our approach includes a learning-free, plug-and-play compression pipeline that can be seamlessly integrated into most Multimodal Large Language Model (MLLM) frameworks. By leveraging this method, we enhance the model inference capability while simultaneously reducing its computational cost. Experimental results on the Video-QA task demonstrate the effectiveness of the proposed approach, showcasing significant improvements in efficiency without sacrificing performance.

new General Scene Adaptation for Vision-and-Language Navigation

Authors: Haodong Hong, Yanyuan Qiao, Sen Wang, Jiajun Liu, Qi Wu

Abstract: Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on one-time execution of individual instructions across multiple environments, aiming to develop agents capable of functioning in any environment in a zero-shot manner. However, real-world navigation robots often operate in persistent environments with relatively consistent physical layouts, visual observations, and language styles from instructors. Such a gap in the task setting presents an opportunity to improve VLN agents by incorporating continuous adaptation to specific environments. To better reflect these real-world conditions, we introduce GSA-VLN, a novel task requiring agents to execute navigation instructions within a specific scene and simultaneously adapt to it for improved performance over time. To evaluate the proposed task, one has to address two challenges in existing VLN datasets: the lack of OOD data, and the limited number and style diversity of instructions for each scene. Therefore, we propose a new dataset, GSA-R2R, which significantly expands the diversity and quantity of environments and instructions for the R2R dataset to evaluate agent adaptability in both ID and OOD contexts. Furthermore, we design a three-stage instruction orchestration pipeline that leverages LLMs to refine speaker-generated instructions and apply role-playing techniques to rephrase instructions into different speaking styles. This is motivated by the observation that each individual user often has consistent signatures or preferences in their instructions. We conducted extensive experiments on GSA-R2R to thoroughly evaluate our dataset and benchmark various methods. Based on our findings, we propose a novel method, GR-DUET, which incorporates memory-based navigation graphs with an environment-specific training strategy, achieving state-of-the-art results on all GSA-R2R splits.

new SIGN: A Statistically-Informed Gaze Network for Gaze Time Prediction

Authors: Jianping Ye, Michel Wedel

Abstract: We propose a first version of SIGN, a Statistically-Informed Gaze Network, to predict aggregate gaze times on images. We develop a foundational statistical model for which we derive a deep learning implementation involving CNNs and Visual Transformers, which enables the prediction of overall gaze times. The model enables us to derive from the aggregate gaze times the underlying gaze pattern as a probability map over all regions in the image, where each region's probability represents the likelihood of being gazed at across all possible scan-paths. We test SIGN's performance on AdGaze3500, a dataset of images of ads with aggregate gaze times, and on COCO-Search18, a dataset with individual-level fixation patterns collected during search. We demonstrate that SIGN (1) improves gaze duration prediction significantly over state-of-the-art deep learning benchmarks on both datasets, and (2) can deliver plausible gaze patterns that correspond to empirical fixation patterns in COCO-Search18. These results suggest that the first version of SIGN holds promise for gaze-time predictions and deserves further development.

new Towards Making Flowchart Images Machine Interpretable

Authors: Shreya Shukla, Prajwal Gatti, Yogesh Kumar, Vikash Yadav, Anand Mishra

Abstract: Computer programming textbooks and software documentations often contain flowcharts to illustrate the flow of an algorithm or procedure. Modern OCR engines often tag these flowcharts as graphics and ignore them in further processing. In this paper, we work towards making flowchart images machine-interpretable by converting them to executable Python codes. To this end, inspired by the recent success in natural language to code generation literature, we present a novel transformer-based framework, namely FloCo-T5. Our model is well-suited for this task,as it can effectively learn semantics, structure, and patterns of programming languages, which it leverages to generate syntactically correct code. We also used a task-specific pre-training objective to pre-train FloCo-T5 using a large number of logic-preserving augmented code samples. Further, to perform a rigorous study of this problem, we introduce theFloCo dataset that contains 11,884 flowchart images and their corresponding Python codes. Our experiments show promising results, and FloCo-T5 clearly outperforms related competitive baselines on code generation metrics. We make our dataset and implementation publicly available.

new Solving Inverse Problems using Diffusion with Fast Iterative Renoising

Authors: Matt C. Bendel, Saurav K. Shastri, Rizwan Ahmad, Philip Schniter

Abstract: Imaging inverse problems can be solved in an unsupervised manner using pre-trained diffusion models. In most cases, that involves approximating the gradient of the measurement-conditional score function in the reverse process. Since the approximations produced by existing methods are quite poor, especially early in the reverse process, we propose a new approach that re-estimates and renoises the image several times per diffusion step. Renoising adds carefully shaped colored noise that ensures the pre-trained diffusion model sees white-Gaussian error, in accordance with how it was trained. We demonstrate the effectiveness of our "DDfire" method at 20, 100, and 1000 neural function evaluations on linear inverse problems and phase retrieval.

new 3DSES: an indoor Lidar point cloud segmentation dataset with real and pseudo-labels from a 3D model

Authors: Maxime M\'erizette (GeF, CEDRIC - VERTIGO), Nicolas Audebert (CEDRIC - VERTIGO, CNAM, LaSTIG, IGN), Pierre Kervella (GeF), J\'er\^ome Verdun (GeF)

Abstract: Semantic segmentation of indoor point clouds has found various applications in the creation of digital twins for robotics, navigation and building information modeling (BIM). However, most existing datasets of labeled indoor point clouds have been acquired by photogrammetry. In contrast, Terrestrial Laser Scanning (TLS) can acquire dense sub-centimeter point clouds and has become the standard for surveyors. We present 3DSES (3D Segmentation of ESGT point clouds), a new dataset of indoor dense TLS colorized point clouds covering 427 m 2 of an engineering school. 3DSES has a unique double annotation format: semantic labels annotated at the point level alongside a full 3D CAD model of the building. We introduce a model-to-cloud algorithm for automated labeling of indoor point clouds using an existing 3D CAD model. 3DSES has 3 variants of various semantic and geometrical complexities. We show that our model-to-cloud alignment can produce pseudo-labels on our point clouds with a \> 95% accuracy, allowing us to train deep models with significant time savings compared to manual labeling. First baselines on 3DSES show the difficulties encountered by existing models when segmenting objects relevant to BIM, such as light and safety utilities. We show that segmentation accuracy can be improved by leveraging pseudo-labels and Lidar intensity, an information rarely considered in current datasets. Code and data will be open sourced.

new Towards Training-Free Open-World Classification with 3D Generative Models

Authors: Xinzhe Xia, Weiguang Zhao, Yuyao Yan, Guanyu Yang, Rui Zhang, Kaizhu Huang, Xi Yang

Abstract: 3D open-world classification is a challenging yet essential task in dynamic and unstructured real-world scenarios, requiring both open-category and open-pose recognition. To address these challenges, recent wisdom often takes sophisticated 2D pre-trained models to provide enriched and stable representations. However, these methods largely rely on how 3D objects can be projected into 2D space, which is unfortunately not well solved, and thus significantly limits their performance. Unlike these present efforts, in this paper we make a pioneering exploration of 3D generative models for 3D open-world classification. Drawing on abundant prior knowledge from 3D generative models, we additionally craft a rotation-invariant feature extractor. This innovative synergy endows our pipeline with the advantages of being training-free, open-category, and pose-invariant, thus well suited to 3D open-world classification. Extensive experiments on benchmark datasets demonstrate the potential of generative models in 3D open-world classification, achieving state-of-the-art performance on ModelNet10 and McGill with 32.0% and 8.7% overall accuracy improvement, respectively.

new Action Recognition Using Temporal Shift Module and Ensemble Learning

Authors: Anh-Kiet Duong, Petra Gomez-Kr\"amer

Abstract: This paper presents the first-rank solution for the Multi-Modal Action Recognition Challenge, part of the Multi-Modal Visual Pattern Recognition Workshop at the \acl{ICPR} 2024. The competition aimed to recognize human actions using a diverse dataset of 20 action classes, collected from multi-modal sources. The proposed approach is built upon the \acl{TSM}, a technique aimed at efficiently capturing temporal dynamics in video data, incorporating multiple data input types. Our strategy included transfer learning to leverage pre-trained models, followed by meticulous fine-tuning on the challenge's specific dataset to optimize performance for the 20 action classes. We carefully selected a backbone network to balance computational efficiency and recognition accuracy and further refined the model using an ensemble technique that integrates outputs from different modalities. This ensemble approach proved crucial in boosting the overall performance. Our solution achieved a perfect top-1 accuracy on the test set, demonstrating the effectiveness of the proposed approach in recognizing human actions across 20 classes. Our code is available online https://github.com/ffyyytt/TSM-MMVPR.

URLs: https://github.com/ffyyytt/TSM-MMVPR.

new An Exceptional Dataset For Rare Pancreatic Tumor Segmentation

Authors: Wenqi Li, Yingli Chen, Keyang Zhou, Xiaoxiao Hu, Zilu Zheng, Yue Yan, Xinpeng Zhang, Wei Tang, Zhenxing Qian

Abstract: Pancreatic NEuroendocrine Tumors (pNETs) are very rare endocrine neoplasms that account for less than 5% of all pancreatic malignancies, with an incidence of only 1-1.5 cases per 100,000. Early detection of pNETs is critical for improving patient survival, but the rarity of pNETs makes segmenting them from CT a very challenging problem. So far, there has not been a dataset specifically for pNETs available to researchers. To address this issue, we propose a pNETs dataset, a well-annotated Contrast-Enhanced Computed Tomography (CECT) dataset focused exclusively on Pancreatic Neuroendocrine Tumors, containing data from 469 patients. This is the first dataset solely dedicated to pNETs, distinguishing it from previous collections. Additionally, we provide the baseline detection networks with a new slice-wise weight loss function designed for the UNet-based model, improving the overall pNET segmentation performance. We hope that our dataset can enhance the understanding and diagnosis of pNET Tumors within the medical community, facilitate the development of more accurate diagnostic tools, and ultimately improve patient outcomes and advance the field of oncology.

new Boosting Weak Positives for Text Based Person Search

Authors: Akshay Modi, Ashhar Aziz, Nilanjana Chatterjee, A V Subramanyam

Abstract: Large vision-language models have revolutionized cross-modal object retrieval, but text-based person search (TBPS) remains a challenging task due to limited data and fine-grained nature of the task. Existing methods primarily focus on aligning image-text pairs into a common representation space, often disregarding the fact that real world positive image-text pairs share a varied degree of similarity in between them. This leads models to prioritize easy pairs, and in some recent approaches, challenging samples are discarded as noise during training. In this work, we introduce a boosting technique that dynamically identifies and emphasizes these challenging samples during training. Our approach is motivated from classical boosting technique and dynamically updates the weights of the weak positives, wherein, the rank-1 match does not share the identity of the query. The weight allows these misranked pairs to contribute more towards the loss and the network has to pay more attention towards such samples. Our method achieves improved performance across four pedestrian datasets, demonstrating the effectiveness of our proposed module.

new Technical report on label-informed logit redistribution for better domain generalization in low-shot classification with foundation models

Authors: Behraj Khan, Tahir Syed

Abstract: Confidence calibration is an emerging challenge in real-world decision systems based on foundations models when used for downstream vision classification tasks. Due to various reasons exposed, logit scores on the CLIP head remain large irrespective of whether the image-language pairs reconcile. It is difficult to address in data space, given the few-shot regime. We propose a penalty incorporated into loss objective that penalizes incorrect classifications whenever one is made during finetuning, by moving an amount of log-likelihood to the true class commensurate to the relative amplitudes of the two likelihoods. We refer to it as \textit{confidence misalignment penalty (CMP)}. Extensive experiments on $12$ vision datasets and $5$ domain generalization datasets supports the calibration performance of our method against stat-of-the-art. CMP outperforms the benchmarked prompt learning methods, demonstrating average improvement in Expected Calibration Error (ECE) by average $6.01$\%, $4.01$ \% at minimum and $9.72$\% at maximum. Anonymized sample source code for this paper can be found at: \url{https://anonymous.4open.science/r/icml25-C5CB/readme.txt}

URLs: https://anonymous.4open.science/r/icml25-C5CB/readme.txt

new Efficient Interactive 3D Multi-Object Removal

Authors: Jingcheng Ni, Weiguang Zhao, Daniel Wang, Ziyao Zeng, Chenyu You, Alex Wong, Kaizhu Huang

Abstract: Object removal is of great significance to 3D scene understanding, essential for applications in content filtering and scene editing. Current mainstream methods primarily focus on removing individual objects, with a few methods dedicated to eliminating an entire area or all objects of a certain category. They however confront the challenge of insufficient granularity and flexibility for real-world applications, where users demand tailored excision and preservation of objects within defined zones. In addition, most of the current methods require kinds of priors when addressing multi-view inpainting, which is time-consuming. To address these limitations, we propose an efficient and user-friendly pipeline for 3D multi-object removal, enabling users to flexibly select areas and define objects for removal or preservation. Concretely, to ensure object consistency and correspondence across multiple views, we propose a novel mask matching and refinement module, which integrates homography-based warping with high-confidence anchor points for segmentation. By leveraging the IoU joint shape context distance loss, we enhance the accuracy of warped masks and improve subsequent inpainting processes. Considering the current immaturity of 3D multi-object removal, we provide a new evaluation dataset to bridge the developmental void. Experimental results demonstrate that our method significantly reduces computational costs, achieving processing speeds more than 80% faster than state-of-the-art methods while maintaining equivalent or higher reconstruction quality.

new Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation

Authors: Lin Chen, Qi Yang, Kun Ding, Zhihao Li, Gang Shen, Fei Li, Qiyuan Cao, Shiming Xiang

Abstract: Open-vocabulary semantic segmentation (OVSS) is an open-world task that aims to assign each pixel within an image to a specific class defined by arbitrary text descriptions. Recent advancements in large-scale vision-language models have demonstrated their open-vocabulary understanding capabilities, significantly facilitating the development of OVSS. However, most existing methods suffer from either suboptimal performance or long latency. This study introduces ERR-Seg, a novel framework that effectively reduces redundancy to balance accuracy and efficiency. ERR-Seg incorporates a training-free Channel Reduction Module (CRM) that leverages prior knowledge from vision-language models like CLIP to identify the most relevant classes while discarding others. Moreover, it incorporates Efficient Semantic Context Fusion (ESCF) with spatial-level and class-level sequence reduction strategies. CRM and ESCF result in substantial memory and computational savings without compromising accuracy. Additionally, recognizing the significance of hierarchical semantics extracted from middle-layer features for closed-set semantic segmentation, ERR-Seg introduces the Hierarchical Semantic Module (HSM) to exploit hierarchical semantics in the context of OVSS. Compared to previous state-of-the-art methods under the ADE20K-847 setting, ERR-Seg achieves +$5.6\%$ mIoU improvement and reduces latency by $67.3\%$.

new FeatureGS: Eigenvalue-Feature Optimization in 3D Gaussian Splatting for Geometrically Accurate and Artifact-Reduced Reconstruction

Authors: Miriam J\"ager, Markus Hillemann, Boris Jutzi

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful approach for 3D scene reconstruction using 3D Gaussians. However, neither the centers nor surfaces of the Gaussians are accurately aligned to the object surface, complicating their direct use in point cloud and mesh reconstruction. Additionally, 3DGS typically produces floater artifacts, increasing the number of Gaussians and storage requirements. To address these issues, we present FeatureGS, which incorporates an additional geometric loss term based on an eigenvalue-derived 3D shape feature into the optimization process of 3DGS. The goal is to improve geometric accuracy and enhance properties of planar surfaces with reduced structural entropy in local 3D neighborhoods.We present four alternative formulations for the geometric loss term based on 'planarity' of Gaussians, as well as 'planarity', 'omnivariance', and 'eigenentropy' of Gaussian neighborhoods. We provide quantitative and qualitative evaluations on 15 scenes of the DTU benchmark dataset focusing on following key aspects: Geometric accuracy and artifact-reduction, measured by the Chamfer distance, and memory efficiency, evaluated by the total number of Gaussians. Additionally, rendering quality is monitored by Peak Signal-to-Noise Ratio. FeatureGS achieves a 30 % improvement in geometric accuracy, reduces the number of Gaussians by 90 %, and suppresses floater artifacts, while maintaining comparable photometric rendering quality. The geometric loss with 'planarity' from Gaussians provides the highest geometric accuracy, while 'omnivariance' in Gaussian neighborhoods reduces floater artifacts and number of Gaussians the most. This makes FeatureGS a strong method for geometrically accurate, artifact-reduced and memory-efficient 3D scene reconstruction, enabling the direct use of Gaussian centers for geometric representation.

new ContourFormer:Real-Time Contour-Based End-to-End Instance Segmentation Transformer

Authors: Weiwei yao, Chen Li, Minjun Xiong, Wenbo Dong, Hao Chen, Xiong Xiao

Abstract: This paper presents Contourformer, a real-time contour-based instance segmentation algorithm. The method is fully based on the DETR paradigm and achieves end-to-end inference through iterative and progressive mechanisms to optimize contours. To improve efficiency and accuracy, we develop two novel techniques: sub-contour decoupling mechanisms and contour fine-grained distribution refinement.In the sub-contour decoupling mechanism, we propose a deformable attention-based module that adaptively selects sampling regions based on the current predicted contour, enabling more effective capturing of object boundary information. Additionally, we design a multi-stage optimization process to enhance segmentation precision by progressively refining sub-contours. The contour fine-grained distribution refinement technique aims to further improve the ability to express fine details of contours.These innovations enable Contourformer to achieve stable and precise segmentation for each instance while maintaining real-time performance. Extensive experiments demonstrate the superior performance of Contourformer on multiple benchmark datasets, including SBD, COCO, and KINS. We conduct comprehensive evaluations and comparisons with existing state-of-the-art methods, showing significant improvements in both accuracy and inference speed.This work provides a new solution for contour-based instance segmentation tasks and lays a foundation for future research, with the potential to become a strong baseline method in this field.

new Segmentation-Aware Generative Reinforcement Network (GRN) for Tissue Layer Segmentation in 3-D Ultrasound Images for Chronic Low-back Pain (cLBP) Assessment

Authors: Zixue Zeng, Xiaoyan Zhao, Matthew Cartier, Tong Yu, Jing Wang, Xin Meng, Zhiyu Sheng, Maryam Satarpour, John M Cormack, Allison Bean, Ryan Nussbaum, Maya Maurer, Emily Landis-Walkenhorst, Dinesh Kumbhare, Kang Kim, Ajay Wasan, Jiantao Pu

Abstract: We introduce a novel segmentation-aware joint training framework called generative reinforcement network (GRN) that integrates segmentation loss feedback to optimize both image generation and segmentation performance in a single stage. An image enhancement technique called segmentation-guided enhancement (SGE) is also developed, where the generator produces images tailored specifically for the segmentation model. Two variants of GRN were also developed, including GRN for sample-efficient learning (GRN-SEL) and GRN for semi-supervised learning (GRN-SSL). GRN's performance was evaluated using a dataset of 69 fully annotated 3D ultrasound scans from 29 subjects. The annotations included six anatomical structures: dermis, superficial fat, superficial fascial membrane (SFM), deep fat, deep fascial membrane (DFM), and muscle. Our results show that GRN-SEL with SGE reduces labeling efforts by up to 70% while achieving a 1.98% improvement in the Dice Similarity Coefficient (DSC) compared to models trained on fully labeled datasets. GRN-SEL alone reduces labeling efforts by 60%, GRN-SSL with SGE decreases labeling requirements by 70%, and GRN-SSL alone by 60%, all while maintaining performance comparable to fully supervised models. These findings suggest the effectiveness of the GRN framework in optimizing segmentation performance with significantly less labeled data, offering a scalable and efficient solution for ultrasound image analysis and reducing the burdens associated with data annotation.

new Learning Semantic Facial Descriptors for Accurate Face Animation

Authors: Lei Zhu, Yuanqi Chen, Xiaohang Liu, Thomas H. Li, Ge Li

Abstract: Face animation is a challenging task. Existing model-based methods (utilizing 3DMMs or landmarks) often result in a model-like reconstruction effect, which doesn't effectively preserve identity. Conversely, model-free approaches face challenges in attaining a decoupled and semantically rich feature space, thereby making accurate motion transfer difficult to achieve. We introduce the semantic facial descriptors in learnable disentangled vector space to address the dilemma. The approach involves decoupling the facial space into identity and motion subspaces while endowing each of them with semantics by learning complete orthogonal basis vectors. We obtain basis vector coefficients by employing an encoder on the source and driving faces, leading to effective facial descriptors in the identity and motion subspaces. Ultimately, these descriptors can be recombined as latent codes to animate faces. Our approach successfully addresses the issue of model-based methods' limitations in high-fidelity identity and the challenges faced by model-free methods in accurate motion transfer. Extensive experiments are conducted on three challenging benchmarks (i.e. VoxCeleb, HDTF, CelebV). Comprehensive quantitative and qualitative results demonstrate that our model outperforms SOTA methods with superior identity preservation and motion transfer.

new VICCA: Visual Interpretation and Comprehension of Chest X-ray Anomalies in Generated Report Without Human Feedback

Authors: Sayeh Gholipour Picha, Dawood Al Chanti, Alice Caplier

Abstract: As artificial intelligence (AI) becomes increasingly central to healthcare, the demand for explainable and trustworthy models is paramount. Current report generation systems for chest X-rays (CXR) often lack mechanisms for validating outputs without expert oversight, raising concerns about reliability and interpretability. To address these challenges, we propose a novel multimodal framework designed to enhance the semantic alignment and localization accuracy of AI-generated medical reports. Our framework integrates two key modules: a Phrase Grounding Model, which identifies and localizes pathologies in CXR images based on textual prompts, and a Text-to-Image Diffusion Module, which generates synthetic CXR images from prompts while preserving anatomical fidelity. By comparing features between the original and generated images, we introduce a dual-scoring system: one score quantifies localization accuracy, while the other evaluates semantic consistency. This approach significantly outperforms existing methods, achieving state-of-the-art results in pathology localization and text-to-image alignment. The integration of phrase grounding with diffusion models, coupled with the dual-scoring evaluation system, provides a robust mechanism for validating report quality, paving the way for more trustworthy and transparent AI in medical imaging.

new CrowdSplat: Exploring Gaussian Splatting For Crowd Rendering

Authors: Xiaohan Sun, Yinghan Xu, John Dingliana, Carol O'Sullivan

Abstract: We present CrowdSplat, a novel approach that leverages 3D Gaussian Splatting for real-time, high-quality crowd rendering. Our method utilizes 3D Gaussian functions to represent animated human characters in diverse poses and outfits, which are extracted from monocular videos. We integrate Level of Detail (LoD) rendering to optimize computational efficiency and quality. The CrowdSplat framework consists of two stages: (1) avatar reconstruction and (2) crowd synthesis. The framework is also optimized for GPU memory usage to enhance scalability. Quantitative and qualitative evaluations show that CrowdSplat achieves good levels of rendering quality, memory efficiency, and computational performance. Through these experiments, we demonstrate that CrowdSplat is a viable solution for dynamic, realistic crowd simulation in real-time applications.

new P-TAME: Explain Any Image Classifier with Trained Perturbations

Authors: Mariano V. Ntrougkas, Vasileios Mezaris, Ioannis Patras

Abstract: The adoption of Deep Neural Networks (DNNs) in critical fields where predictions need to be accompanied by justifications is hindered by their inherent black-box nature. In this paper, we introduce P-TAME (Perturbation-based Trainable Attention Mechanism for Explanations), a model-agnostic method for explaining DNN-based image classifiers. P-TAME employs an auxiliary image classifier to extract features from the input image, bypassing the need to tailor the explanation method to the internal architecture of the backbone classifier being explained. Unlike traditional perturbation-based methods, which have high computational requirements, P-TAME offers an efficient alternative by generating high-resolution explanations in a single forward pass during inference. We apply P-TAME to explain the decisions of VGG-16, ResNet-50, and ViT-B-16, three distinct and widely used image classifiers. Quantitative and qualitative results show that our method matches or outperforms previous explainability methods, including model-specific approaches. Code and trained models will be released upon acceptance.

new SSF: Sparse Long-Range Scene Flow for Autonomous Driving

Authors: Ajinkya Khoche, Qingwen Zhang, Laura Pereira Sanchez, Aron Asefaw, Sina Sharif Mansouri, Patric Jensfelt

Abstract: Scene flow enables an understanding of the motion characteristics of the environment in the 3D world. It gains particular significance in the long-range, where object-based perception methods might fail due to sparse observations far away. Although significant advancements have been made in scene flow pipelines to handle large-scale point clouds, a gap remains in scalability with respect to long-range. We attribute this limitation to the common design choice of using dense feature grids, which scale quadratically with range. In this paper, we propose Sparse Scene Flow (SSF), a general pipeline for long-range scene flow, adopting a sparse convolution based backbone for feature extraction. This approach introduces a new challenge: a mismatch in size and ordering of sparse feature maps between time-sequential point scans. To address this, we propose a sparse feature fusion scheme, that augments the feature maps with virtual voxels at missing locations. Additionally, we propose a range-wise metric that implicitly gives greater importance to faraway points. Our method, SSF, achieves state-of-the-art results on the Argoverse2 dataset, demonstrating strong performance in long-range scene flow estimation. Our code will be released at https://github.com/KTH-RPL/SSF.git.

URLs: https://github.com/KTH-RPL/SSF.git.

new U2A: Unified Unimodal Adaptation for Robust and Efficient Multimodal Learning

Authors: Md Kaykobad Reza, Niki Nezakati, Ameya Patil, Mashhour Solh, M. Salman Asif

Abstract: Multimodal learning often relies on designing new models and complex training strategies to achieve optimal performance. We present Unified Unimodal Adaptation (U2A), which jointly fine-tunes pretrained unimodal encoders using low-rank adaptation (LoRA) for various multimodal tasks. Our method significantly reduces the number of learnable parameters and eliminates the need for complex training strategies, such as alternating training, gradient modifications, or unimodal fine-tuning. To address missing modalities during both training and testing, we introduce Mask Tokens (MT), which generate missing modality features from available modalities using a single token per modality. This simplifies the process, removing the need for specialized feature estimation or prompt-tuning methods. Our evaluation demonstrates that U2A matches or outperforms state-of-the-art methods in both complete and missing modality settings, showcasing strong performance and robustness across various modalities, tasks, and datasets. We also analyze and report the effectiveness of Mask Tokens in different missing modality scenarios. Overall, our method provides a robust, flexible, and efficient solution for multimodal learning, with minimal computational overhead.

cross Advancing the Biological Plausibility and Efficacy of Hebbian Convolutional Neural Networks

Authors: Julian Jimenez Nimmo, Esther Mondragon

Abstract: The research presented in this paper advances the integration of Hebbian learning into Convolutional Neural Networks (CNNs) for image processing, systematically exploring different architectures to build an optimal configuration, adhering to biological tenability. Hebbian learning operates on local unsupervised neural information to form feature representations, providing an alternative to the popular but arguably biologically implausible and computationally intensive backpropagation learning algorithm. The suggested optimal architecture significantly enhances recent research aimed at integrating Hebbian learning with competition mechanisms and CNNs, expanding their representational capabilities by incorporating hard Winner-Takes-All (WTA) competition, Gaussian lateral inhibition mechanisms and Bienenstock-Cooper-Munro (BCM) learning rule in a single model. The resulting model achieved 76% classification accuracy on CIFAR-10, rivalling its end-to-end backpropagation variant (77%) and critically surpassing the state-of-the-art hard-WTA performance in CNNs of the same network depth (64.6%) by 11.4%. Moreover, results showed clear indications of sparse hierarchical learning through increasingly complex and abstract receptive fields. In summary, our implementation enhances both the performance and the generalisability of the learnt representations and constitutes a crucial step towards more biologically realistic artificial neural networks.

cross Influence of field of view in visual prostheses design: Analysis with a VR system

Authors: Melani Sanchez-Garcia, Ruben Martinez-Cantin, Jesus Bermudez-Cameo, Jose J. Guerrero

Abstract: Visual prostheses are designed to restore partial functional vision in patients with total vision loss. Retinal visual prostheses provide limited capabilities as a result of low resolution, limited field of view and poor dynamic range. Understanding the influence of these parameters in the perception results can guide prostheses research and design. In this work, we evaluate the influence of field of view with respect to spatial resolution in visual prostheses, measuring the accuracy and response time in a search and recognition task. Twenty-four normally sighted participants were asked to find and recognize usual objects, such as furniture and home appliance in indoor room scenes. For the experiment, we use a new simulated prosthetic vision system that allows simple and effective experimentation. Our system uses a virtual-reality environment based on panoramic scenes. The simulator employs a head-mounted display which allows users to feel immersed in the scene by perceiving the entire scene all around. Our experiments use public image datasets and a commercial head-mounted display. We have also released the virtual-reality software for replicating and extending the experimentation. Results show that the accuracy and response time decrease when the field of view is increased. Furthermore, performance appears to be correlated with the angular resolution, but showing a diminishing return even with a resolution of less than 2.3 phosphenes per degree. Our results seem to indicate that, for the design of retinal prostheses, it is better to concentrate the phosphenes in a small area, to maximize the angular resolution, even if that implies sacrificing field of view.

cross Trustworthy image-to-image translation: evaluating uncertainty calibration in unpaired training scenarios

Authors: Ciaran Bench, Emir Ahmed, Spencer A. Thomas

Abstract: Mammographic screening is an effective method for detecting breast cancer, facilitating early diagnosis. However, the current need to manually inspect images places a heavy burden on healthcare systems, spurring a desire for automated diagnostic protocols. Techniques based on deep neural networks have been shown effective in some studies, but their tendency to overfit leaves considerable risk for poor generalisation and misdiagnosis, preventing their widespread adoption in clinical settings. Data augmentation schemes based on unpaired neural style transfer models have been proposed that improve generalisability by diversifying the representations of training image features in the absence of paired training data (images of the same tissue in either image style). But these models are similarly prone to various pathologies, and evaluating their performance is challenging without ground truths/large datasets (as is often the case in medical imaging). Here, we consider two frameworks/architectures: a GAN-based cycleGAN, and the more recently developed diffusion-based SynDiff. We evaluate their performance when trained on image patches parsed from three open access mammography datasets and one non-medical image dataset. We consider the use of uncertainty quantification to assess model trustworthiness, and propose a scheme to evaluate calibration quality in unpaired training scenarios. This ultimately helps facilitate the trustworthy use of image-to-image translation models in domains where ground truths are not typically available.

cross Watch Your STEPP: Semantic Traversability Estimation using Pose Projected Features

Authors: Sebastian {\AE}gidius, Dennis Hadjivelichkov, Jianhao Jiao, Jonathan Embley-Riches, Dimitrios Kanoulas

Abstract: Understanding the traversability of terrain is essential for autonomous robot navigation, particularly in unstructured environments such as natural landscapes. Although traditional methods, such as occupancy mapping, provide a basic framework, they often fail to account for the complex mobility capabilities of some platforms such as legged robots. In this work, we propose a method for estimating terrain traversability by learning from demonstrations of human walking. Our approach leverages dense, pixel-wise feature embeddings generated using the DINOv2 vision Transformer model, which are processed through an encoder-decoder MLP architecture to analyze terrain segments. The averaged feature vectors, extracted from the masked regions of interest, are used to train the model in a reconstruction-based framework. By minimizing reconstruction loss, the network distinguishes between familiar terrain with a low reconstruction error and unfamiliar or hazardous terrain with a higher reconstruction error. This approach facilitates the detection of anomalies, allowing a legged robot to navigate more effectively through challenging terrain. We run real-world experiments on the ANYmal legged robot both indoor and outdoor to prove our proposed method. The code is open-source, while video demonstrations can be found on our website: https://rpl-cs-ucl.github.io/STEPP

URLs: https://rpl-cs-ucl.github.io/STEPP

cross Dual Invariance Self-training for Reliable Semi-supervised Surgical Phase Recognition

Authors: Sahar Nasirihaghighi, Negin Ghamsarian, Raphael Sznitman, Klaus Schoeffmann

Abstract: Accurate surgical phase recognition is crucial for advancing computer-assisted interventions, yet the scarcity of labeled data hinders training reliable deep learning models. Semi-supervised learning (SSL), particularly with pseudo-labeling, shows promise over fully supervised methods but often lacks reliable pseudo-label assessment mechanisms. To address this gap, we propose a novel SSL framework, Dual Invariance Self-Training (DIST), that incorporates both Temporal and Transformation Invariance to enhance surgical phase recognition. Our two-step self-training process dynamically selects reliable pseudo-labels, ensuring robust pseudo-supervision. Our approach mitigates the risk of noisy pseudo-labels, steering decision boundaries toward true data distribution and improving generalization to unseen data. Evaluations on Cataract and Cholec80 datasets show our method outperforms state-of-the-art SSL approaches, consistently surpassing both supervised and SSL baselines across various network architectures.

cross Federated Learning With Individualized Privacy Through Client Sampling

Authors: Lucas Lange, Ole Borchardt, Erhard Rahm

Abstract: With growing concerns about user data collection, individualized privacy has emerged as a promising solution to balance protection and utility by accounting for diverse user privacy preferences. Instead of enforcing a uniform level of anonymization for all users, this approach allows individuals to choose privacy settings that align with their comfort levels. Building on this idea, we propose an adapted method for enabling Individualized Differential Privacy (IDP) in Federated Learning (FL) by handling clients according to their personal privacy preferences. By extending the SAMPLE algorithm from centralized settings to FL, we calculate client-specific sampling rates based on their heterogeneous privacy budgets and integrate them into a modified IDP-FedAvg algorithm. We test this method under realistic privacy distributions and multiple datasets. The experimental results demonstrate that our approach achieves clear improvements over uniform DP baselines, reducing the trade-off between privacy and utility. Compared to the alternative SCALE method in related work, which assigns differing noise scales to clients, our method performs notably better. However, challenges remain for complex tasks with non-i.i.d. data, primarily stemming from the constraints of the decentralized setting.

cross In-Context Meta LoRA Generation

Authors: Yihua Shao, Minxi Yan, Yang Liu, Siyu Chen, Wenjie Chen, Xinwei Long, Ziyang Yan, Lei Li, Chenyu Zhang, Nicu Sebe, Hao Tang, Yan Wang, Hao Zhao, Mengzhu Wang, Jingcai Guo

Abstract: Low-rank Adaptation (LoRA) has demonstrated remarkable capabilities for task specific fine-tuning. However, in scenarios that involve multiple tasks, training a separate LoRA model for each one results in considerable inefficiency in terms of storage and inference. Moreover, existing parameter generation methods fail to capture the correlations among these tasks, making multi-task LoRA parameter generation challenging. To address these limitations, we propose In-Context Meta LoRA (ICM-LoRA), a novel approach that efficiently achieves task-specific customization of large language models (LLMs). Specifically, we use training data from all tasks to train a tailored generator, Conditional Variational Autoencoder (CVAE). CVAE takes task descriptions as inputs and produces task-aware LoRA weights as outputs. These LoRA weights are then merged with LLMs to create task-specialized models without the need for additional fine-tuning. Furthermore, we utilize in-context meta-learning for knowledge enhancement and task mapping, to capture the relationship between tasks and parameter distributions. As a result, our method achieves more accurate LoRA parameter generation for diverse tasks using CVAE. ICM-LoRA enables more accurate LoRA parameter reconstruction than current parameter reconstruction methods and is useful for implementing task-specific enhancements of LoRA parameters. At the same time, our method occupies 283MB, only 1\% storage compared with the original LoRA.

cross PulmoFusion: Advancing Pulmonary Health with Efficient Multi-Modal Fusion

Authors: Ahmed Sharshar, Yasser Attia, Mohammad Yaqub, Mohsen Guizani

Abstract: Traditional remote spirometry lacks the precision required for effective pulmonary monitoring. We present a novel, non-invasive approach using multimodal predictive models that integrate RGB or thermal video data with patient metadata. Our method leverages energy-efficient Spiking Neural Networks (SNNs) for the regression of Peak Expiratory Flow (PEF) and classification of Forced Expiratory Volume (FEV1) and Forced Vital Capacity (FVC), using lightweight CNNs to overcome SNN limitations in regression tasks. Multimodal data integration is improved with a Multi-Head Attention Layer, and we employ K-Fold validation and ensemble learning to boost robustness. Using thermal data, our SNN models achieve 92% accuracy on a breathing-cycle basis and 99.5% patient-wise. PEF regression models attain Relative RMSEs of 0.11 (thermal) and 0.26 (RGB), with an MAE of 4.52% for FEV1/FVC predictions, establishing state-of-the-art performance. Code and dataset can be found on https://github.com/ahmed-sharshar/RespiroDynamics.git

URLs: https://github.com/ahmed-sharshar/RespiroDynamics.git

cross Glioma Multimodal MRI Analysis System for Tumor Layered Diagnosis via Multi-task Semi-supervised Learning

Authors: Yihao Liu, Zhihao Cui, Liming Li, Junjie You, Xinle Feng, Jianxin Wang, Xiangyu Wang, Qing Liu, Minghua Wu

Abstract: Gliomas are the most common primary tumors of the central nervous system. Multimodal MRI is widely used for the preliminary screening of gliomas and plays a crucial role in auxiliary diagnosis, therapeutic efficacy, and prognostic evaluation. Currently, the computer-aided diagnostic studies of gliomas using MRI have focused on independent analysis events such as tumor segmentation, grading, and radiogenomic classification, without studying inter-dependencies among these events. In this study, we propose a Glioma Multimodal MRI Analysis System (GMMAS) that utilizes a deep learning network for processing multiple events simultaneously, leveraging their inter-dependencies through an uncertainty-based multi-task learning architecture and synchronously outputting tumor region segmentation, glioma histological subtype, IDH mutation genotype, and 1p/19q chromosome disorder status. Compared with the reported single-task analysis models, GMMAS improves the precision across tumor layered diagnostic tasks. Additionally, we have employed a two-stage semi-supervised learning method, enhancing model performance by fully exploiting both labeled and unlabeled MRI samples. Further, by utilizing an adaptation module based on knowledge self-distillation and contrastive learning for cross-modal feature extraction, GMMAS exhibited robustness in situations of modality absence and revealed the differing significance of each MRI modal. Finally, based on the analysis outputs of the GMMAS, we created a visual and user-friendly platform for doctors and patients, introducing GMMAS-GPT to generate personalized prognosis evaluations and suggestions.

cross Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Authors: Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan

Abstract: In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.

cross Aggregation Schemes for Single-Vector WSI Representation Learning in Digital Pathology

Authors: Sobhan Hemati, Ghazal Alabtah, Saghir Alfasly, H. R. Tizhoosh

Abstract: A crucial step to efficiently integrate Whole Slide Images (WSIs) in computational pathology is assigning a single high-quality feature vector, i.e., one embedding, to each WSI. With the existence of many pre-trained deep neural networks and the emergence of foundation models, extracting embeddings for sub-images (i.e., tiles or patches) is straightforward. However, for WSIs, given their high resolution and gigapixel nature, inputting them into existing GPUs as a single image is not feasible. As a result, WSIs are usually split into many patches. Feeding each patch to a pre-trained model, each WSI can then be represented by a set of patches, hence, a set of embeddings. Hence, in such a setup, WSI representation learning reduces to set representation learning where for each WSI we have access to a set of patch embeddings. To obtain a single embedding from a set of patch embeddings for each WSI, multiple set-based learning schemes have been proposed in the literature. In this paper, we evaluate the WSI search performance of multiple recently developed aggregation techniques (mainly set representation learning techniques) including simple average or max pooling operations, Deep Sets, Memory networks, Focal attention, Gaussian Mixture Model (GMM) Fisher Vector, and deep sparse and binary Fisher Vector on four different primary sites including bladder, breast, kidney, and Colon from TCGA. Further, we benchmark the search performance of these methods against the median of minimum distances of patch embeddings, a non-aggregating approach used for WSI retrieval.

replace Semantic and structural image segmentation for prosthetic vision

Authors: Melani Sanchez-Garcia, Ruben Martinez-Cantin, Jose J. Guerrero

Abstract: Prosthetic vision is being applied to partially recover the retinal stimulation of visually impaired people. However, the phosphenic images produced by the implants have very limited information bandwidth due to the poor resolution and lack of color or contrast. The ability of object recognition and scene understanding in real environments is severely restricted for prosthetic users. Computer vision can play a key role to overcome the limitations and to optimize the visual information in the simulated prosthetic vision, improving the amount of information that is presented. We present a new approach to build a schematic representation of indoor environments for phosphene images. The proposed method combines a variety of convolutional neural networks for extracting and conveying relevant information about the scene such as structural informative edges of the environment and silhouettes of segmented objects. Experiments were conducted with normal sighted subjects with a Simulated Prosthetic Vision system. The results show good accuracy for object recognition and room identification tasks for indoor scenes using the proposed approach, compared to other image processing methods.

replace iMatching: Imperative Correspondence Learning

Authors: Zitong Zhan, Dasong Gao, Yun-Jou Lin, Youjie Xia, Chen Wang

Abstract: Learning feature correspondence is a foundational task in computer vision, holding immense importance for downstream applications such as visual odometry and 3D reconstruction. Despite recent progress in data-driven models, feature correspondence learning is still limited by the lack of accurate per-pixel correspondence labels. To overcome this difficulty, we introduce a new self-supervised scheme, imperative learning (IL), for training feature correspondence. It enables correspondence learning on arbitrary uninterrupted videos without any camera pose or depth labels, heralding a new era for self-supervised correspondence learning. Specifically, we formulated the problem of correspondence learning as a bilevel optimization, which takes the reprojection error from bundle adjustment as a supervisory signal for the model. To avoid large memory and computation overhead, we leverage the stationary point to effectively back-propagate the implicit gradients through bundle adjustment. Through extensive experiments, we demonstrate superior performance on tasks including feature matching and pose estimation, in which we obtained an average of 30% accuracy gain over the state-of-the-art matching models.

replace A Survey on Quality Metrics for Text-to-Image Generation

Authors: Sebastian Hartwig, Dominik Engel, Leon Sick, Hannah Kniesel, Tristan Payer, Poonam Poonam, Michael Gl\"ockler, Alex B\"auerle, Timo Ropinski

Abstract: AI-based text-to-image models do not only excel at generating realistic images, they also give designers more and more fine-grained control over the image content. Consequently, these approaches have gathered increased attention within the computer graphics research community, which has been historically devoted towards traditional rendering techniques, that offer precise control over scene parameters (e.g., objects, materials, and lighting). While the quality of conventionally rendered images is assessed through well established image quality metrics, such as SSIM or PSNR, the unique challenges of text-to-image generation require other, dedicated quality metrics. These metrics must be able to not only measure overall image quality, but also how well images reflect given text prompts, whereby the control of scene and rendering parameters is interweaved. Within this survey, we provide a comprehensive overview of such text-to-image quality metrics, and propose a taxonomy to categorize these metrics. Our taxonomy is grounded in the assumption, that there are two main quality criteria, namely compositional quality and general quality, that contribute to the overall image quality. Besides the metrics, this survey covers dedicated text-to-image benchmark datasets, over which the metrics are frequently computed. Finally, we identify limitations and open challenges in the field of text-to-image generation, and derive guidelines for practitioners conducting text-to-image evaluation.

replace FlexCap: Describe Anything in Images in Controllable Detail

Authors: Debidatta Dwibedi, Vidhi Jain, Jonathan Tompson, Andrew Zisserman, Yusuf Aytar

Abstract: We introduce FlexCap, a vision-language model that generates region-specific descriptions of varying lengths. FlexCap is trained to produce length-conditioned captions for input boxes, enabling control over information density, with descriptions ranging from concise object labels to detailed captions. To achieve this, we create large-scale training datasets of image region descriptions with varying lengths from captioned web images. We demonstrate FlexCap's effectiveness in several applications: first, it achieves strong performance in dense captioning tasks on the Visual Genome dataset. Second, we show how FlexCap's localized descriptions can serve as input to a large language model to create a visual question answering (VQA) system, achieving state-of-the-art zero-shot performance on multiple VQA benchmarks. Our experiments illustrate FlexCap's utility for tasks including image labeling, object attribute recognition, and visual dialog. Project webpage: https://flex-cap.github.io .

URLs: https://flex-cap.github.io

replace Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models

Authors: Konstantinos Vilouras, Pedro Sanchez, Alison Q. O'Neil, Sotirios A. Tsaftaris

Abstract: Localizing the exact pathological regions in a given medical scan is an important imaging problem that traditionally requires a large amount of bounding box ground truth annotations to be accurately solved. However, there exist alternative, potentially weaker, forms of supervision, such as accompanying free-text reports, which are readily available.The task of performing localization with textual guidance is commonly referred to as phrase grounding. In this work, we use a publicly available Foundation Model, namely the Latent Diffusion Model, to perform this challenging task. This choice is supported by the fact that the Latent Diffusion Model, despite being generative in nature, contains cross-attention mechanisms that implicitly align visual and textual features, thus leading to intermediate representations that are suitable for the task at hand. In addition, we aim to perform this task in a zero-shot manner, i.e., without any training on the target task, meaning that the model's weights remain frozen. To this end, we devise strategies to select features and also refine them via post-processing without extra learnable parameters. We compare our proposed method with state-of-the-art approaches which explicitly enforce image-text alignment in a joint embedding space via contrastive learning. Results on a popular chest X-ray benchmark indicate that our method is competitive with SOTA on different types of pathology, and even outperforms them on average in terms of two metrics (mean IoU and AUC-ROC). Source code will be released upon acceptance at \url{https://github.com/vios-s}.

URLs: https://github.com/vios-s

replace Exposing Image Classifier Shortcuts with Counterfactual Frequency (CoF) Tables

Authors: James Hinns, David Martens

Abstract: The rise of deep learning in image classification has brought unprecedented accuracy but also highlighted a key issue: the use of 'shortcuts' by models. Such shortcuts are easy-to-learn patterns from the training data that fail to generalise to new data. Examples include the use of a copyright watermark to recognise horses, snowy background to recognise huskies, or ink markings to detect malignant skin lesions. The explainable AI (XAI) community has suggested using instance-level explanations to detect shortcuts without external data, but this requires the examination of many explanations to confirm the presence of such shortcuts, making it a labour-intensive process. To address these challenges, we introduce Counterfactual Frequency (CoF) tables, a novel approach that aggregates instance-based explanations into global insights, and exposes shortcuts. The aggregation implies the need for some semantic concepts to be used in the explanations, which we solve by labelling the segments of an image. We demonstrate the utility of CoF tables across several datasets, revealing the shortcuts learned from them.

replace GMT: Guided Mask Transformer for Leaf Instance Segmentation

Authors: Feng Chen, Sotirios A. Tsaftaris, Mario Valerio Giuffrida

Abstract: Leaf instance segmentation is a challenging multi-instance segmentation task, aiming to separate and delineate each leaf in an image of a plant. Accurate segmentation of each leaf is crucial for plant-related applications such as the fine-grained monitoring of plant growth and crop yield estimation. This task is challenging because of the high similarity (in shape and colour), great size variation, and heavy occlusions among leaf instances. Furthermore, the typically small size of annotated leaf datasets makes it more difficult to learn the distinctive features needed for precise segmentation. We hypothesise that the key to overcoming the these challenges lies in the specific spatial patterns of leaf distribution. In this paper, we propose the Guided Mask Transformer (GMT), which leverages and integrates leaf spatial distribution priors into a Transformer-based segmentor. These spatial priors are embedded in a set of guide functions that map leaves at different positions into a more separable embedding space. Our GMT consistently outperforms the state-of-the-art on three public plant datasets. Our code is available at https://github.com/vios-s/gmt-leaf-ins-seg.

URLs: https://github.com/vios-s/gmt-leaf-ins-seg.

replace GFE-Mamba: Mamba-based AD Multi-modal Progression Assessment via Generative Feature Extraction from MCI

Authors: Zhaojie Fang, Shenghao Zhu, Yifei Chen, Binfeng Zou, Fan Jia, Chang Liu, Xiang Feng, Linwei Qiu, Feiwei Qin, Jin Fan, Changbiao Chu, Changmiao Wang

Abstract: Alzheimer's Disease (AD) is a progressive, irreversible neurodegenerative disorder that often originates from Mild Cognitive Impairment (MCI). This progression results in significant memory loss and severely affects patients' quality of life. Clinical trials have consistently shown that early and targeted interventions for individuals with MCI may slow or even prevent the advancement of AD. Research indicates that accurate medical classification requires diverse multimodal data, including detailed assessment scales and neuroimaging techniques like Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET). However, simultaneously collecting the aforementioned three modalities for training presents substantial challenges. To tackle these difficulties, we propose GFE-Mamba, a multimodal classifier founded on Generative Feature Extractor. The intermediate features provided by this Extractor can compensate for the shortcomings of PET and achieve profound multimodal fusion in the classifier. The Mamba block, as the backbone of the classifier, enables it to efficiently extract information from long-sequence scale information. Pixel-level Bi-cross Attention supplements pixel-level information from MRI and PET. We provide our rationale for developing this cross-temporal progression prediction dataset and the pre-trained Extractor weights. Our experimental findings reveal that the GFE-Mamba model effectively predicts the progression from MCI to AD and surpasses several leading methods in the field. Our source code is available at https://github.com/Tinysqua/GFE-Mamba.

URLs: https://github.com/Tinysqua/GFE-Mamba.

replace Network Fission Ensembles for Low-Cost Self-Ensembles

Authors: Hojung Lee, Jong-Seok Lee

Abstract: Recent ensemble learning methods for image classification have been shown to improve classification accuracy with low extra cost. However, they still require multiple trained models for ensemble inference, which eventually becomes a significant burden when the model size increases. In this paper, we propose a low-cost ensemble learning and inference, called Network Fission Ensembles (NFE), by converting a conventional network itself into a multi-exit structure. Starting from a given initial network, we first prune some of the weights to reduce the training burden. We then group the remaining weights into several sets and create multiple auxiliary paths using each set to construct multi-exits. We call this process Network Fission. Through this, multiple outputs can be obtained from a single network, which enables ensemble learning. Since this process simply changes the existing network structure to multi-exits without using additional networks, there is no extra computational burden for ensemble learning and inference. Moreover, by learning from multiple losses of all exits, the multi-exits improve performance via regularization, and high performance can be achieved even with increased network sparsity. With our simple yet effective method, we achieve significant improvement compared to existing ensemble methods. The code is available at https://github.com/hjdw2/NFE.

URLs: https://github.com/hjdw2/NFE.

replace PIR: Photometric Inverse Rendering with Shading Cues Modeling and Surface Reflectance Regularization

Authors: Jingzhi Bao, Guanying Chen, Shuguang Cui

Abstract: This paper addresses the problem of inverse rendering from photometric images. Existing approaches for this problem suffer from the effects of self-shadows, inter-reflections, and lack of constraints on the surface reflectance, leading to inaccurate decomposition of reflectance and illumination due to the ill-posed nature of inverse rendering. In this work, we propose a new method for neural inverse rendering. Our method jointly optimizes the light source position to account for the self-shadows in images, and computes indirect illumination using a differentiable rendering layer and an importance sampling strategy. To enhance surface reflectance decomposition, we introduce a new regularization by distilling DINO features to foster accurate and consistent material decomposition. Extensive experiments on synthetic and real datasets demonstrate that our method outperforms the state-of-the-art methods in reflectance decomposition.

replace When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

Authors: Pingping Zhang, Jinlong Li, Kecheng Chen, Meng Wang, Long Xu, Haoliang Li, Nicu Sebe, Sam Kwong, Shiqi Wang

Abstract: Existing codecs are designed to eliminate intrinsic redundancies to create a compact representation for compression. However, strong external priors from Multimodal Large Language Models (MLLMs) have not been explicitly explored in video compression. Herein, we introduce a unified paradigm for Cross-Modality Video Coding (CMVC), which is a pioneering approach to explore multimodality representation and video generative models in video coding. Specifically, on the encoder side, we disentangle a video into spatial content and motion components, which are subsequently transformed into distinct modalities to achieve very compact representation by leveraging MLLMs. During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes that optimize video reconstruction quality for specific decoding requirements, including Text-Text-to-Video (TT2V) mode to ensure high-quality semantic information and Image-Text-to-Video (IT2V) mode to achieve superb perceptual consistency. In addition, we propose an efficient frame interpolation model for IT2V mode via Low-Rank Adaption (LoRA) tuning to guarantee perceptual quality, which allows the generated motion cues to behave smoothly. Experiments on benchmarks indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency. These results highlight potential directions for future research in video coding.

replace Real-Time Video Generation with Pyramid Attention Broadcast

Authors: Xuanlei Zhao, Xiaolong Jin, Kai Wang, Yang You

Abstract: We present Pyramid Attention Broadcast (PAB), a real-time, high quality and training-free approach for DiT-based video generation. Our method is founded on the observation that attention difference in the diffusion process exhibits a U-shaped pattern, indicating significant redundancy. We mitigate this by broadcasting attention outputs to subsequent steps in a pyramid style. It applies different broadcast strategies to each attention based on their variance for best efficiency. We further introduce broadcast sequence parallel for more efficient distributed inference. PAB demonstrates up to 10.5x speedup across three models compared to baselines, achieving real-time generation for up to 720p videos. We anticipate that our simple yet effective method will serve as a robust baseline and facilitate future research and application for video generation.

replace Segmentation and Smoothing Affect Explanation Quality More Than the Choice of Perturbation-based XAI Method for Image Explanations

Authors: Gustav Grund Pihlgren, Kary Fr\"amling

Abstract: Perturbation-based post-hoc image explanation methods are commonly used to explain image prediction models. These methods perturb parts of the input to measure how those parts affect the output. Since the methods only require the input and output they can be applied to any model, making them a popular choice to explain black-box models. While many different models exist and have been compared with one another, it remains poorly understood which parameters of the different methods are responsible for their varying performance. This work uses the Randomized Input Sampling for Explanations (RISE) method as a baseline to evaluate many combinations of mask sampling, segmentation techniques, smoothing, attribution calculation, and per-segment or per-pixel attribution, using a proxy metric. The results show that attribution calculation, which is frequently the focus of other works, has little impact on the results. Conversely, segmentation and per-pixel attribution, rarely examined parameters, have a significant impact. The implementation of and data gathered in this work are available online: https://github.com/guspih/post-hoc-image-perturbation and https://bit.ly/smooth-mask-perturbation

URLs: https://github.com/guspih/post-hoc-image-perturbation, https://bit.ly/smooth-mask-perturbation

replace ORB-SfMLearner: ORB-Guided Self-supervised Visual Odometry with Selective Online Adaptation

Authors: Yanlin Jin, Rui-Yang Ju, Haojun Liu, Yuzhong Zhong

Abstract: Deep visual odometry, despite extensive research, still faces limitations in accuracy and generalizability that prevent its broader application. To address these challenges, we propose an Oriented FAST and Rotated BRIEF (ORB)-guided visual odometry with selective online adaptation named ORB-SfMLearner. We present a novel use of ORB features for learning-based ego-motion estimation, leading to more robust and accurate results. We also introduce the cross-attention mechanism to enhance the explainability of PoseNet and have revealed that driving direction of the vehicle can be explained through the attention weights. To improve generalizability, our selective online adaptation allows the network to rapidly and selectively adjust to the optimal parameters across different domains. Experimental results on KITTI and vKITTI datasets show that our method outperforms previous state-of-the-art deep visual odometry methods in terms of ego-motion accuracy and generalizability.

replace EchoFM: Foundation Model for Generalizable Echocardiogram Analysis

Authors: Sekeun Kim, Pengfei Jin, Sifan Song, Cheng Chen, Yiwei Li, Hui Ren, Xiang Li, Tianming Liu, Quanzheng Li

Abstract: Foundation models have recently gained significant attention because of their generalizability and adaptability across multiple tasks and data distributions. Although medical foundation models have emerged, solutions for cardiac imaging, especially echocardiography videos, are still unexplored. In this paper, we introduce EchoFM, a foundation model specifically designed to represent and analyze echocardiography videos. In EchoFM, we propose a self-supervised learning framework that captures both spatial and temporal variability patterns through a spatio-temporal consistent masking strategy and periodic-driven contrastive learning. This framework can effectively capture the spatio-temporal dynamics of echocardiography and learn the representative video features without any labels. We pre-train our model on an extensive dataset comprising over 290,000 echocardiography videos covering 26 scan views across different imaging modes, with up to 20 million frames of images. The pre-trained EchoFM can then be easily adapted and fine-tuned for a variety of downstream tasks, serving as a robust backbone model. Our evaluation was systemically designed for four downstream tasks after the echocardiography examination routine. Experiment results show that EchoFM surpasses state-of-the-art methods, including specialized echocardiography methods, self-supervised pre-training models, and general-purposed pre-trained foundation models, across all downstream tasks.

replace Transformer-Based Auxiliary Loss for Face Recognition Across Age Variations

Authors: Pritesh Prakash, Ashish Jacob Sam, S Umamaheswaran

Abstract: Aging presents a significant challenge in face recognition, as changes in skin texture and tone can alter facial features over time, making it particularly difficult to compare images of the same individual taken years apart, such as in long-term identification scenarios. Transformer networks have the strength to preserve sequential spatial relationships caused by aging effect. This paper presents a technique for loss evaluation that uses a transformer network as an additive loss in the face recognition domain. The standard metric loss function typically takes the final embedding of the main CNN backbone as its input. Here, we employ a transformer-metric loss, a combined approach that integrates both transformer-loss and metric-loss. This research intends to analyze the transformer behavior on the convolution output when the CNN outcome is arranged in a sequential vector. These sequential vectors have the potential to overcome the texture or regional structure referred to as wrinkles or sagging skin affected by aging. The transformer encoder takes input from the contextual vectors obtained from the final convolution layer of the network. The learned features can be more age-invariant, complementing the discriminative power of the standard metric loss embedding. With this technique, we use transformer loss with various base metric-loss functions to evaluate the effect of the combined loss functions. We observe that such a configuration allows the network to achieve SoTA results in LFW and age-variant datasets (CA-LFW and AgeDB). This research expands the role of transformers in the machine vision domain and opens new possibilities for exploring transformers as a loss function.

replace Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey

Authors: Zongxia Li, Xiyang Wu, Hongyang Du, Huy Nghiem, Guangyao Shi

Abstract: Multimodal Vision Language Models (VLMs) have emerged as a transformative technology at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification. Despite their rapid advancements in research and growing popularity in applications, a comprehensive survey of existing studies on VLMs is notably lacking, particularly for researchers aiming to leverage VLMs in their specific domains. To this end, we provide a systematic overview of VLMs in the following aspects: model information of the major VLMs developed over the past five years (2019-2024); the main architectures and training methods of these VLMs; summary and categorization of the popular benchmarks and evaluation metrics of VLMs; the applications of VLMs including embodied agents, robotics, and video generation; the challenges and issues faced by current VLMs such as hallucination, fairness, and safety. Detailed collections including papers and model repository links are listed in https://github.com/zli12321/Awesome-VLM-Papers-And-Models.git.

URLs: https://github.com/zli12321/Awesome-VLM-Papers-And-Models.git.

replace CHaRNet: Conditioned Heatmap Regression for Robust Dental Landmark Localization

Authors: Jos\'e Rodr\'iguez-Ortega (Nemotec, Madrid, Spain, Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain), Francisco P\'erez-Hern\'andez (Nemotec, Madrid, Spain), Siham Tabik (Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain)

Abstract: Identifying anatomical landmarks in 3D dental models is crucial for orthodontic treatment. Manually placing these key points is complex, time-consuming, and requires expert knowledge. While some machine learning methods have been proposed for automatic tooth landmark detection in 3D Intraoral Scans (IOS), research remains limited, with no fully end-to-end approaches that avoid teeth segmentation. We propose CHaRNet (Conditioned Heatmap Regression Network), the first end-to-end deep learning method for tooth landmark detection in 3D IOS. Unlike traditional two-stage methods that segment teeth before detecting landmarks, CHaRNet directly detects landmarks on the input point cloud. It consists of four key modules: (1) a point cloud encoder, (2) a point cloud decoder with a heatmap regression head, (3) a teeth presence classification head, and (4) the innovative Conditioned Heatmap Regression (CHaR) module. The CHaR module refines landmark regression by leveraging teeth presence classification, enabling dynamic adaptation to cases with missing teeth and improving accuracy in complex dental models. We evaluate CHaRNet using five point cloud learning algorithms to validate the effectiveness of the CHaR module and test it on a clinical dataset of 1,214 annotated 3D dental models. Both the dataset and code will be publicly released to address the lack of open datasets in orthodontics, promote benchmarking, and inspire new research. CHaRNet achieves a Mean Euclidean Distance Error (MEDE) of 1.28 mm and a Mean Success Ratio (MSR) of 82.40%, demonstrating robust performance. Notably, it excels in handling irregular dental geometries, such as models with missing teeth. This end-to-end approach streamlines orthodontic workflows, improves 3D IOS analysis precision, and facilitates efficient computer-assisted treatment planning.

replace SelfPrompt: Confidence-Aware Semi-Supervised Tuning for Robust Vision-Language Model Adaptation

Authors: Shuvendu Roy, Ali Etemad

Abstract: We present SelfPrompt, a novel prompt-tuning approach for vision-language models (VLMs) in a semi-supervised learning setup. Existing methods for tuning VLMs in semi-supervised setups struggle with the negative impact of the miscalibrated VLMs on pseudo-labelling, and the accumulation of noisy pseudo-labels. SelfPrompt addresses these challenges by introducing a cluster-guided pseudo-labelling method that improves pseudo-label accuracy, and a confidence-aware semi-supervised learning module that maximizes the utilization of unlabelled data by combining supervised learning and weakly-supervised learning. Additionally, we investigate our method in an active semi-supervised learning setup, where the labelled set is strategically selected to ensure the best utilization of a limited labelling budget. To this end, we propose a weakly-supervised sampling technique that selects a diverse and representative labelled set, which can be seamlessly integrated into existing methods to enhance their performance. We conduct extensive evaluations across 13 datasets, significantly surpassing state-of-the-art performances with average improvements of 6.23% in standard semi-supervised learning, 6.25% in active semi-supervised learning, and 4.9% in base-to-novel generalization, using a 2-shot setup. Furthermore, SelfPrompt shows excellent generalization in single-shot settings, achieving an average improvement of 11.78%.

replace Towards Robust Unsupervised Attention Prediction in Autonomous Driving

Authors: Mengshi Qi, Xiaoyang Bi, Pengfei Zhu, Huadong Ma

Abstract: Robustly predicting attention regions of interest for self-driving systems is crucial for driving safety but presents significant challenges due to the labor-intensive nature of obtaining large-scale attention labels and the domain gap between self-driving scenarios and natural scenes. These challenges are further exacerbated by complex traffic environments, including camera corruption under adverse weather, noise interferences, and central bias from long-tail distributions. To address these issues, we propose a robust unsupervised attention prediction method. An Uncertainty Mining Branch refines predictions by analyzing commonalities and differences across multiple pre-trained models on natural scenes, while a Knowledge Embedding Block bridges the domain gap by incorporating driving knowledge to adaptively enhance pseudo-labels. Additionally, we introduce RoboMixup, a novel data augmentation method that improves robustness against corruption through soft attention and dynamic augmentation, and mitigates central bias by integrating random cropping into Mixup as a regularizer. To systematically evaluate robustness in self-driving attention prediction, we introduce the DriverAttention-C benchmark, comprising over 100k frames across three subsets: BDD-A-C, DR(eye)VE-C, and DADA-2000-C. Our method achieves performance equivalent to or surpassing fully supervised state-of-the-art approaches on three public datasets and the proposed robustness benchmark, reducing relative corruption degradation by 58.8% and 52.8%, and improving central bias robustness by 12.4% and 11.4% in KLD and CC metrics, respectively. Code and data are available at https://github.com/zaplm/DriverAttention.

URLs: https://github.com/zaplm/DriverAttention.

replace Automatic Calibration of a Multi-Camera System with Limited Overlapping Fields of View for 3D Surgical Scene Reconstruction

Authors: Tim Fl\"uckiger, Jonas Hein, Valery Fischer, Philipp F\"urnstahl, Lilian Calvet

Abstract: The purpose of this study is to develop an automated and accurate external camera calibration method for multi-camera systems used in 3D surgical scene reconstruction (3D-SSR), eliminating the need for operator intervention or specialized expertise. The method specifically addresses the problem of limited overlapping fields of view caused by significant variations in optical zoom levels and camera locations. We contribute a novel, fast, and fully automatic calibration method based on the projection of multi-scale markers (MSMs) using a ceiling-mounted projector. MSMs consist of 2D patterns projected at varying scales, ensuring accurate extraction of well distributed point correspondences across significantly different viewpoints and zoom levels. Validation is performed using both synthetic and real data captured in a mock-up OR, with comparisons to traditional manual marker-based methods as well as markerless calibration methods. The method achieves accuracy comparable to manual, operator-dependent calibration methods while exhibiting higher robustness under conditions of significant differences in zoom levels. Additionally, we show that state-of-the-art Structure-from-Motion (SfM) pipelines are ineffective in 3D-SSR settings, even when additional texture is projected onto the OR floor. The use of a ceiling-mounted entry-level projector proves to be an effective alternative to operator-dependent, traditional marker-based methods, paving the way for fully automated 3D-SSR.

replace PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

Authors: Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, Yue Wang

Abstract: Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely in real-world environments. While Vision-Language Models (VLMs) have shown great promise in reasoning and task planning for embodied agents, their ability to comprehend physical phenomena remains extremely limited. To close this gap, we introduce PhysBench, a comprehensive benchmark designed to evaluate VLMs' physical world understanding capability across a diverse set of tasks. PhysBench contains 10,002 entries of interleaved video-image-text data, categorized into four major domains: physical object properties, physical object relationships, physical scene understanding, and physics-based dynamics, further divided into 19 subclasses and 8 distinct capability dimensions. Our extensive experiments, conducted on 75 representative VLMs, reveal that while these models excel in common-sense reasoning, they struggle with understanding the physical world -- likely due to the absence of physical knowledge in their training data and the lack of embedded physical priors. To tackle the shortfall, we introduce PhysAgent, a novel framework that combines the generalization strengths of VLMs with the specialized expertise of vision models, significantly enhancing VLMs' physical understanding across a variety of tasks, including an 18.4\% improvement on GPT-4o. Furthermore, our results demonstrate that enhancing VLMs' physical world understanding capabilities can help embodied agents such as MOKA. We believe that PhysBench and PhysAgent offer valuable insights and contribute to bridging the gap between VLMs and physical world understanding.

replace Polyp-Gen: Realistic and Diverse Polyp Image Generation for Endoscopic Dataset Expansion

Authors: Shengyuan Liu, Zhen Chen, Qiushi Yang, Weihao Yu, Di Dong, Jiancong Hu, Yixuan Yuan

Abstract: Automated diagnostic systems (ADS) have shown significant potential in the early detection of polyps during endoscopic examinations, thereby reducing the incidence of colorectal cancer. However, due to high annotation costs and strict privacy concerns, acquiring high-quality endoscopic images poses a considerable challenge in the development of ADS. Despite recent advancements in generating synthetic images for dataset expansion, existing endoscopic image generation algorithms failed to accurately generate the details of polyp boundary regions and typically required medical priors to specify plausible locations and shapes of polyps, which limited the realism and diversity of the generated images. To address these limitations, we present Polyp-Gen, the first full-automatic diffusion-based endoscopic image generation framework. Specifically, we devise a spatial-aware diffusion training scheme with a lesion-guided loss to enhance the structural context of polyp boundary regions. Moreover, to capture medical priors for the localization of potential polyp areas, we introduce a hierarchical retrieval-based sampling strategy to match similar fine-grained spatial features. In this way, our Polyp-Gen can generate realistic and diverse endoscopic images for building reliable ADS. Extensive experiments demonstrate the state-of-the-art generation quality, and the synthetic images can improve the downstream polyp detection task. Additionally, our Polyp-Gen has shown remarkable zero-shot generalizability on other datasets. The source code is available at https://github.com/CUHK-AIM-Group/Polyp-Gen.

URLs: https://github.com/CUHK-AIM-Group/Polyp-Gen.

replace Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models

Authors: Muhammad Atta ur Rahman

Abstract: Self-supervised learning can resolve numerous image or linguistic processing problems when effectively trained. This study investigated simple yet efficient methods for adaping previously learned foundation models for open-vocabulary semantic segmentation tasks. Our research proposed "Beyond-Labels," a lightweight transformer-based fusion module that uses a handful of image segmentation data to fuse frozen image representations with language concepts. Furthermore, we efficiently captured positional information in images using Fourier embeddings, thus improving the generalization across various image sizes. Extensive ablation tests were performed to investigate the important components of our proposed method; when tested against the common benchmark PASCAL-5i, it demonstrated superior performance despite being trained on frozen image and language characteristics.

replace Synthesizing 3D Abstractions by Inverting Procedural Buildings with Transformers

Authors: Maximilian Dax, Jordi Berbel, Jan Stria, Leonidas Guibas, Urs Bergmann

Abstract: We generate abstractions of buildings, reflecting the essential aspects of their geometry and structure, by learning to invert procedural models. We first build a dataset of abstract procedural building models paired with simulated point clouds and then learn the inverse mapping through a transformer. Given a point cloud, the trained transformer then infers the corresponding abstracted building in terms of a programmatic language description. This approach leverages expressive procedural models developed for gaming and animation, and thereby retains desirable properties such as efficient rendering of the inferred abstractions and strong priors for regularity and symmetry. Our approach achieves good reconstruction accuracy in terms of geometry and structure, as well as structurally consistent inpainting.

replace-cross A Deep Learning-Based Unified Framework for Red Lesions Detection on Retinal Fundus Images

Authors: Norah Asiri, Muhammad Hussain, Fadwa Al Adel

Abstract: Red-lesions, microaneurysms (MAs) and hemorrhages (HMs), are the early signs of diabetic retinopathy (DR). The automatic detection of MAs and HMs on retinal fundus images is a challenging task. Most of the existing methods detect either only MAs or only HMs because of the difference in their texture, sizes, and morphology. Though some methods detect both MAs and HMs, they suffer from the curse of dimensionality of shape and colors features and fail to detect all shape variations of HMs such as flame-shaped. Leveraging the progress in deep learning, we proposed a two-stream red lesions detection system dealing simultaneously with small and large red lesions. For this system, we introduced a new ROIs candidates generation method for large red lesions on fundus images; it is based on blood vessel segmentation and morphological operations, and reduces the computational complexity, and enhances the detection accuracy by generating a small number of potential candidates. For detection, we proposed a framework with two streams. We used pretrained VGGNet as a backbone model and carried out several extensive experiments to tune it for vessels segmentation and candidates generation, and finally learning the appropriate mapping, which yields better detection of the red lesions comparing with the state-of-the-art methods. The experimental results validated the effectiveness of the system in the detection of both MAs and HMs; it yields higher performance for per lesion detection; its sensitivity equals 0.8589 and good FROC score under 8 FPIs on DiaretDB1-MA reports FROC=0.7518, and with SN=0.7552 and good FROC score under 2,4and 8 FPIs on DiaretDB1-HM, and SN=0.8157 on e-ophtha with overall FROC=0.4537 and on ROCh dataset with FROC=0.3461 which is higher than the state-of-the art methods. For DR screening, the system performs well with good AUC on DiaretDB1-MA, DiaretDB1-HM, and e-ophtha datasets.

replace-cross Out-of-distribution detection using normalizing flows on the data manifold

Authors: Seyedeh Fatemeh Razavi, Mohammad Mahdi Mehmanchi, Reshad Hosseini, Mostafa Tavassolipour

Abstract: Using the intuition that out-of-distribution data have lower likelihoods, a common approach for out-of-distribution detection involves estimating the underlying data distribution. Normalizing flows are likelihood-based generative models providing a tractable density estimation via dimension-preserving invertible transformations. Conventional normalizing flows are prone to fail in out-of-distribution detection, because of the well-known curse of dimensionality problem of the likelihood-based models. To solve the problem of likelihood-based models, some works try to modify likelihood for example by incorporating a data complexity measure. We observed that these modifications are still insufficient. According to the manifold hypothesis, real-world data often lie on a low-dimensional manifold. Therefore, we proceed by estimating the density on a low-dimensional manifold and calculating a distance from the manifold as a measure for out-of-distribution detection. We propose a powerful criterion that combines this measure with the modified likelihood measure based on data complexity. Extensive experimental results show that incorporating manifold learning while accounting for the estimation of data complexity improves the out-of-distribution detection ability of normalizing flows. This improvement is achieved without modifying the model structure or using auxiliary out-of-distribution data during training.

replace-cross Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors

Authors: Peter Lorenz, Mario Fernandez, Jens M\"uller, Ullrich K\"othe

Abstract: Detecting out-of-distribution (OOD) inputs is critical for safely deploying deep learning models in real-world scenarios. In recent years, many OOD detectors have been developed, and even the benchmarking has been standardized, i.e. OpenOOD. The number of post-hoc detectors is growing fast. They are showing an option to protect a pre-trained classifier against natural distribution shifts and claim to be ready for real-world scenarios. However, its effectiveness in dealing with adversarial examples (AdEx) has been neglected in most studies. In cases where an OOD detector includes AdEx in its experiments, the lack of uniform parameters for AdEx makes it difficult to accurately evaluate the performance of the OOD detector. This paper investigates the adversarial robustness of 16 post-hoc detectors against various evasion attacks. It also discusses a roadmap for adversarial defense in OOD detectors that would help adversarial robustness. We believe that level 1 (AdEx on a unified dataset) should be added to any OOD detector to see the limitations. The last level in the roadmap (defense against adaptive attacks) we added for integrity from an adversarial machine learning (AML) point of view, which we do not believe is the ultimate goal for OOD detectors.

replace-cross Competency-Aware Planning for Probabilistically Safe Navigation Under Perception Uncertainty

Authors: Sara Pohland, Claire Tomlin

Abstract: Perception-based navigation systems are useful for unmanned ground vehicle (UGV) navigation in complex terrains, where traditional depth-based navigation schemes are insufficient. However, these data-driven methods are highly dependent on their training data and can fail in surprising and dramatic ways with little warning. To ensure the safety of the vehicle and the surrounding environment, it is imperative that the navigation system is able to recognize the predictive uncertainty of the perception model and respond safely and effectively in the face of uncertainty. In an effort to enable safe navigation under perception uncertainty, we develop a probabilistic and reconstruction-based competency estimation (PaRCE) method to estimate the model's level of familiarity with an input image as a whole and with specific regions in the image. We find that the overall competency score can correctly predict correctly classified, misclassified, and out-of-distribution (OOD) samples. We also confirm that the regional competency maps can accurately distinguish between familiar and unfamiliar regions across images. We then use this competency information to develop a planning and control scheme that enables effective navigation while maintaining a low probability of error. We find that the competency-aware scheme greatly reduces the number of collisions with unfamiliar obstacles, compared to a baseline controller with no competency awareness. Furthermore, the regional competency information is very valuable in enabling efficient navigation.

replace-cross A Robust Prototype-Based Network with Interpretable RBF Classifier Foundations

Authors: Sascha Saralajew, Ashish Rana, Thomas Villmann, Ammar Shaker

Abstract: Prototype-based classification learning methods are known to be inherently interpretable. However, this paradigm suffers from major limitations compared to deep models, such as lower performance. This led to the development of the so-called deep Prototype-Based Networks (PBNs), also known as prototypical parts models. In this work, we analyze these models with respect to different properties, including interpretability. In particular, we focus on the Classification-by-Components (CBC) approach, which uses a probabilistic model to ensure interpretability and can be used as a shallow or deep architecture. We show that this model has several shortcomings, like creating contradicting explanations. Based on these findings, we propose an extension of CBC that solves these issues. Moreover, we prove that this extension has robustness guarantees and derive a loss that optimizes robustness. Additionally, our analysis shows that most (deep) PBNs are related to (deep) RBF classifiers, which implies that our robustness guarantees generalize to shallow RBF classifiers. The empirical evaluation demonstrates that our deep PBN yields state-of-the-art classification accuracy on different benchmarks while resolving the interpretability shortcomings of other approaches. Further, our shallow PBN variant outperforms other shallow PBNs while being inherently interpretable and exhibiting provable robustness guarantees.

replace-cross Stroke classification using Virtual Hybrid Edge Detection from in silico electrical impedance tomography data

Authors: Juan Pablo Agnelli, Fernando S. Moura, Siiri Rautio, Melody Alsaker, Rashmi Murthy, Matti Lassas, Samuli Siltanen

Abstract: Electrical impedance tomography (EIT) is a non-invasive imaging method for recovering the internal conductivity of a physical body from electric boundary measurements. EIT combined with machine learning has shown promise for the classification of strokes. However, most previous works have used raw EIT voltage data as network inputs. We build upon a recent development which suggested the use of special noise-robust Virtual Hybrid Edge Detection (VHED) functions as network inputs, although that work used only highly simplified and mathematically ideal models. In this work we strengthen the case for the use of EIT, and VHED functions especially, for stroke classification. We design models with high detail and mathematical realism to test the use of VHED functions as inputs. Virtual patients are created using a physically detailed 2D head model which includes features known to create challenges in real-world imaging scenarios. Conductivity values are drawn from statistically realistic distributions, and phantoms are afflicted with either hemorrhagic or ischemic strokes of various shapes and sizes. Simulated noisy EIT electrode data, generated using the realistic Complete Electrode Model (CEM) as opposed to the mathematically ideal continuum model, is processed to obtain VHED functions. We compare the use of VHED functions as inputs against the alternative paradigm of using raw EIT voltages. Our results show that (i) stroke classification can be performed with high accuracy using 2D EIT data from physically detailed and mathematically realistic models, and (ii) in the presence of noise, VHED functions outperform raw data as network inputs.