Authors: Hasan Arief
Abstract: No standardized benchmark exists for evaluating OCR on food packaging, despite its critical role in automated halal food verification. Existing benchmarks target documents or scene text, missing the unique challenges of ingredient labels: curved surfaces, dense multilingual text, and sub-8pt fonts. We present HalalBench, the first open multilingual benchmark for food packaging OCR, comprising 1,043 images (50 real, 993 synthetic) with 36,438 annotations in COCO format spanning 14 languages. We evaluate four engines: docTR achieves F1=0.193, ML Kit 0.180, EasyOCR 0.167, while all fail on Japanese (F1=0.000). A clustering ablation shows 36% F1 improvement from our post-processing algorithm. We validate findings through HalalLens (https://halallens.no), a production halal scanner serving 20+ countries. Dataset and code are released under open licenses.
URLs: https://halallens.no),
Authors: Jialu Liu, Yao Li, Zhuoheng Li, Huining Li, Ying Chen
Abstract: Augmented reality (AR) systems pose unique privacy risks due to their continuous capture of visual data. Existing AR privacy frameworks lack semantic understanding of visual content, limiting their effectiveness in detecting context-dependent privacy risks. We propose PrivAR, which leverages vision language models (VLMs) with chain-of-thought prompting for contextual privacy risk detection in AR environments. PrivAR uses visual scene cues to infer potential sensitive information types, such as identifying password notes in office environments through contextual reasoning. PrivAR detects and obfuscates textual content, preventing exposure of sensitive information while preserving contextual cues necessary for VLM inference. Additionally, we investigate contextually-informed warning interfaces to enhance user privacy awareness. Experiments on a real-world AR dataset show that PrivAR achieves superior accuracy (81.48%) and F1-score (84.62%) compared to baselines, while reducing privacy leakage rate to 17.58%. User studies evaluating contextually-informed warning interfaces provide insights into effective privacy-aware AR design.
Authors: Haopeng Jin
Abstract: Long-sequence video diffusion transformers hit a quadratic self-attention cost that dominates runtime and memory for very long token sequences. Most efficient attention methods use one approximation everywhere, yet video features are spectrally structured: low frequencies carry global layout and coarse motion; high frequencies carry texture and fine detail. We present FreqFormer, a frequency-aware heterogeneous attention framework. Token features are split into spectral bands with different operators: dense global attention on compressed low-frequency content, structured block-sparse attention on mid frequencies, and sliding-window local attention on high frequencies. A lightweight spectral routing network allocates heads across bands using layer statistics and the diffusion timestep, shifting compute toward global structure early in denoising and detail later. Cross-band summary tokens provide cheap residual exchange. FreqFormer is paired with a fused GPU execution plan that co-schedules dense, sparse, and local branches to cut kernel launches and memory traffic. We give a consistent complexity model, an orthonormal-decomposition view of approximation, and simulation-based systems numbers (throughput, arithmetic intensity, memory traffic, duration scaling). In simulations from 64K to 1M tokens, FreqFormer substantially reduces estimated attention FLOPs and KV-related memory traffic versus dense attention while keeping a hardware-friendly pattern, supporting spectrally structured heterogeneous attention as a practical direction for long-video diffusion transformers.
Authors: JiYang Wang, Jiawei Chen, Mengqi Xiao, Yu Cheng, Yangfu Li, Zhaoxia Yin
Abstract: Object level hallucination remains a central reliability challenge for vision language models (VLMs), particularly in binary object existence verification. Existing benchmarks emphasize aggregate accuracy but rarely disentangle whether errors stem from perceptual limitations or from the influence of contextual textual priors, leaving underlying failure mechanisms ambiguous. We introduce DO-Bench, a controlled diagnostic benchmark that isolates these sources through structured multimodal interventions. Rather than evaluating models in unconstrained settings, DO-Bench probes two complementary dimensions: the Prior Override dimension progressively strengthens contextual textual priors while holding visual evidence constant to assess resistance to prior pressure, and the Perception-Limited dimension incrementally enhances visual evidence from full-scene context to localized object crops to measure perceptual grounding strength. This paired design enables attribution of errors to prior suppression, perceptual insufficiency, or their interaction. We further define two diagnostic metrics, PriorRobust and PerceptionAbility, to quantify these behaviors consistently. Evaluations across diverse open- and closed-source VLMs reveal systematic differences in prior sensitivity and perceptual reliability, demonstrating that object hallucination reflects heterogeneous, mechanism dependent failure patterns beyond aggregate accuracy.
Authors: Zibo Shao, Baochen Xiong, Xiaoshan Yang, Yaguang Song, Qimeng Zhang, Haifeng Chen, Changsheng Xu
Abstract: Multimodal Large Language Models (MLLMs) rely on multimodal pre-training over diverse data sources, where different datasets often induce complementary cross-modal alignment capabilities. Model merging provides a cost-effective mechanism for integrating multiple expert MLLMs with complementary strengths into a unified model. However, existing model merging research mainly focuses on post-finetuning scenarios, leaving the pre-training stage largely unexplored. We argue that the core of MLLM pre-training lies in establishing effective cross-modal alignment, which bridges visual and textual representations into a unified semantic space. Motivated by this insight, we introduce the post-alignment merging task, which aims to integrate cross-modal alignment capabilities learned from heterogeneous multimodal pre-training. This setting introduces two key challenges: cross-domain parameter interference, where parameter updates learned from different data distributions conflict during merging, and layer-wise alignment contribution disparity, where different layers and projectors contribute unevenly to cross-modal alignment. To address them, we propose \textbf{PivotMerge}, a post-alignment merging framework for cross-modal projectors. PivotMerge incorporates two key components: Shared-space Decomposition and Filtering, which disentangles shared alignment patterns from domain-specific variations and suppresses conflicting directions, and Alignment-guided Layer-wise Merging, which assigns layer-specific merging weights based on differing alignment contributions. We construct systematic CC12M-based post-alignment merging scenarios for evaluation. Extensive experiments on multiple multimodal benchmarks show that PivotMerge consistently outperforms existing baselines, demonstrating its effectiveness and generalization ability.
Authors: Zhang Zhang, Yifeng Zeng, Houshi Jiang, Yinghui Pan
Abstract: WeatherSeg, an advanced semi-supervised segmentation framework, addresses autonomous driving's environmental perception challenges in adverse weather while reducing annotation costs. This framework integrates a Dual Teacher-Student Weight-Sharing Model (DTSWSM) that enables knowledge distillation from weather-affected images, and a Classifier Weight Updating Attention Mechanism (CWUAM) that dynamically adjusts classifier weights based on environmental attributes. Comprehensive evaluations demonstrate that WeatherSeg significantly outperforms baseline models in both accuracy and robustness across various weather conditions, including clear, rainy, cloudy, and foggy scenarios, establishing it as an effective solution for all-weather semantic segmentation in autonomous driving and related applications.
Authors: Zixuan Tang, Shen Zhao
Abstract: Large segmentation foundation models such as the Segment Anything Model (SAM) have reshaped promptable segmentation in natural images, and recent efforts have extended these models to medical images and volumetric settings. However, directly transferring a 3D SAM-style model to lesion segmentation remains challenging due to (i) weak spatial representational capacity for small, irregular targets in intermediate features, and (ii) extreme foreground-background imbalance in 3D volumes.We propose SGP-SAM, a self-gated prompting framework for efficient and effective transfer to 3D lesion segmentation. Our key component, the Self-Gated Prompting Module (SGPM), performs conditional multi-scale spatial enhancement: a lightweight multi-channel gating unit predicts whether the current features require additional multi-scale fusion, and only then activates a Multi-Scale Feature Fusion Block to enrich spatial context. To further address small-lesion learning, we design a Zoom Loss that up-weights lesion-focused supervision by combining Dice and a voxel-balanced focal term.Experiments on MSD Liver Tumor and MSD Brain Tumor (enhancing tumor) show consistent gains over strong transfer baselines based on SAM-Med3D. On MSD Liver Tumor, SGP-SAM improves mDice by 7.3% over fine-tuning.
Authors: Bayangmbe Mounmo, Sam Chien, Mile Mitrovic
Abstract: Industrial CAD workflows require robust, generalizable 3D geometric representations supporting accuracy and explainability. We introduce Shape, a self-supervised foundation model converting surface meshes into dense per-token embeddings. Shape combines a structured 3D latent grid, a multi-scale geometry-aware tokenizer (MAGNO) with cross-attention, and a transformer processor using grouped-query attention and RMSNorm. A learned reconstruction prior enables per-region attribution for explainable predictions. Pretraining uses masked-token reconstruction of normalized geometry statistics and multi-resolution contrastive consistency. The 10.9M-parameter backbone is pretrained on 61,052 CAD meshes from Thingi10K, MFCAD, and Fusion360. On a held-out split of 2,983 meshes, Shape achieves reconstruction R2 = 0.729 and 98.1% top-1 retrieval under the Wang-Isola protocol, with near-zero reconstruction train/val gap (contrastive scores use a larger evaluation pool). A 2x2 ablation on loss type and target-space normalization shows per-dimension normalization is critical: without it, performance collapses (R2 < 0.14, top-1 < 88%); with it, both losses succeed (R2 > 0.70, top-1 > 96%). Smooth-L1 offers secondary stability. Code, embeddings, and an interactive demo are released at https://github.com/simd-ai/shape.
Authors: Rongxiao Guo, Qingchao Chen
Abstract: Millimeter-wave (mmWave) radar has shown great potential for contactless, privacy-preserving, and robust human sensing, yet existing mmWave-based human mesh reconstruction (HMR) studies are still limited by the lack of benchmarks for generalization analysis under configuration shifts and fair comparison of different algorithms. To address the limitation, we present DGHMesh, a large-scale dual-radar mmWave dataset and generalization-focused benchmark for HMR. It contains data from 15 subjects performing 8 actions, with 360,000 synchronized frames collected from FMCW radar, SFCW radar, RGB images, and high-precision 3D HMR annotations. In addition, the dataset provides synchronized raw I/Q data from both radar modalities and accurately calibrated radar spatial positions. The benchmark is designed to evaluate HMR methods under diverse measurement configurations, including human position shifts, human orientation shifts, subarray size variations, and cross-subject settings. Based on DGHMesh, we also propose mmPTM, a query-based multi-radar fusion framework that jointly exploits point clouds and imaging tubes for HMR. Extensive experiments are conducted against representative baselines under different settings. The results demonstrate that mmPTM consistently achieves outstanding accuracy and competitive generalization capability across multiple sub-benchmarks, validating the effectiveness of multi-radar fusion and the practical value of the proposed dataset and benchmark for mmWave-based HMR research. DGHMesh and mmPTM are publicly available at https://github.com/SPIresearch/DGHMesh.(The complete benchmark and code will be released after paper publication)
Authors: Jinqi Cao, Zhiping Yu, Baihong Lin, Chenyang Liu, Zhenwei Shi, Zhengxia Zou
Abstract: Recent generative AI models have achieved remarkable breakthroughs in language and visual understanding. However, although these models can generate realistic visual content, their spatial scale remains confined to bounded environments, preventing them from capturing how geographic environments evolve across thousands of kilometers or from modeling the spatial structure of the large-scale physical world. This limitation poses a critical challenge for ultra-wide-area spatial intelligence in Earth observation and simulation, revealing a deeper gap in generative AI: progress has relied primarily on scaling model parameters and training data, while overlooking spatial scale as a core dimension of intelligence. Here, motivated by this missing dimension, we investigate spatial scale as a new scaling axis in foundation models and present MetaEarth3D, the first generative foundation model capable of spatially consistent generation at the planetary scale. Taking optical Earth observation simulation as a testbed, MetaEarth3D enables the generation of multi-level, unbounded, and diverse 3D scenes spanning large-scale terrains, medium-scale cities, and fine-grained street blocks. Built upon 10 million globally distributed real-world training images, MetaEarth3D demonstrates both strong visual realism and geospatial statistical realism. Beyond generation, MetaEarth3D serves as a generative data engine for diverse virtual environments in ultra-wide spatial intelligence. We argue that this study may help empower next-generation spatial intelligence for Earth observation.
Authors: Tairan Fu, Francisco Javier Santos-Mart\'in, Javier Conde, Pedro Reviriego, Elena Merino-G\'omez
Abstract: The digital transformation of industrial manufacturing increasingly relies on the ability of autonomous robots to interact with legacy infrastructure, particularly analog gauges. While Vision-Language Models (VLMs) have demonstrated potential in zero-shot instrument recognition, their deployment in measurement systems remains constrained by an inherent inability to accurately analyze high-frequency temporal events and needle vibrations. This paper evaluates state-of-the-art models, including GPT-5 and Gemini 3, against the strict requirements of metrology and uncertainty quantification. To facilitate this evaluation, we introduce a novel dataset comprising video sequences of various gauge types: circular, linear, and Vernier, under diverse motion speed profiles. Our findings indicate that current VLMs exhibit limited ability in interpreting needle trajectories and scale semantics, failing to provide the traceability and reliability needed for safety-critical monitoring. The results demonstrate that these models have not yet achieved the performance necessary to be classified as trustworthy synthetic instruments under existing IEEE and ISO standards.
Authors: Liyao Jiang, Ruichen Chen, Keith G. Mills
Abstract: Pre-training is a general method that is used in a range of deep learning tasks. By first training a model on one task, and then further training on the downstream task used for final evaluation, the model is forced to learn a more general understanding of the input data. While pre-training has been applied to 3D Human Pose Estimation (HPE) previously, the scope of datasets used is typically very limited to some strong benchmarks, like Human3.6M. Therefore, in this project, we expand the scope of an existing 3D HPE scheme to be compatible with additional 2D and 3D HPE datasets, like Occlusion Person. We perform an extensive study on how aspects of 2D pre-training, such as model size, affect downstream performance, and to what extent pre-training can help the model generalize to different datasets. Experimental results show that 2D pre-training consistently outperforms training on 3D data alone, particularly in terms of computational efficiency. Finally, using MPII and Human3.6M, we are able to obtain an MPJPE score of under 64.5mm.
Authors: Jiayuan Chen, Ruoqi Liu, Zishan Gu, Ping Zhang
Abstract: Microscopy-based phenotypic profiling is scalable for drug discovery but lacks the mechanistic depth of transcriptomics, which remains costly and scarce. Existing multimodal approaches either use images to support other modalities or naively align representations by sample identity, ignoring cell-type and dose variations in weakly paired data-limiting generalization to unseen interventions. In this paper, we introduce an intervention-aware distillation framework that leverages perturbational transcriptomics to guide image representation learning. A transcriptome-conditioned teacher integrates gene expression and intervention metadata to produce soft distributions over a chemistry-aware codebook organized by drug similarity. The teacher employs a fine-tuned single-cell foundation model to encode cell-type context and disentangle dose effects. An image-only student learns to predict these distributions from microscopy alone, distilling mechanistic knowledge while operating independently at test time. This design emphasizes intervention semantics rather than identity alignment and explicitly handles dose and cell-type mismatches. We provide theoretical guarantees showing that transcriptomic guidance tightens the risk bound for image-based prediction. On Cell Painting and RxRx datasets paired with L1000, our method significantly improves one-shot transfer to unseen interventions and drug-target gene discovery compared to self-supervised and alignment baselines.
Authors: Jeremy Ellis
Abstract: This paper presents webmcu-vision-web, a single-file, zero-install browser application for end-to-end TinyML vision model training and deployment on the Seeed Studio XIAO ESP32-S3 Sense (XIAO ML Kit, $15--40 USD). Acting as a browser-based companion to the on-device Arduino firmware of Paper 1 [1], it provides a private, fully local machine learning pipeline, from firmware flashing through image collection, CNN training, weight export, and live activation visualization, without any software installation beyond a Chromium-based browser. The system targets educators, small businesses, and researchers who need to train task-specific visual classifiers under their exact deployment conditions. Key capabilities include: in-browser firmware flashing via esptool-js; an SD card file browser with image preview and inline editing; config.json live-sync for zero-recompile hyperparameter adjustment; webcam and ESP32 OV2640 camera image capture; TensorFlow.js CNN training completing a three-class run (~30 images per class, 20 epochs) in approximately 1 minute browser-side versus 9 minutes on-device, enabling a complete collect-train-deploy cycle in under 10 minutes; weight export as myWeights.bin and myWeights.h; confusion matrix; and a live Conv2 activation heatmap streamed from the ESP32 during inference. No data leaves the local machine at any stage. A five-run consistency evaluation on the three-class reference problem (0Blank, 1Cup, 2Pen) demonstrates stable convergence with mean accuracy and standard deviation reported; all artefacts are released at the repository link below. The repository is a living template for LLM-assisted adaptation to new hardware and tasks. All source code is MIT-licensed at https://github.com/webmcu-ai/webmcu-vision-web.
Authors: Haonan Chen, Kaiwen Xiao, Bin Tian, Jun Fu
Abstract: Autonomous parking remains a critical yet challenging task in intelligent driving systems, particularly within constrained urban environments where maneuvering space is limited and precise control is essential. While recent advances in end-to-end learning have shown great promise, the lack of high-quality, structured datasets tailored for parking scenarios remains a significant bottleneck.To address this gap, we present ParkingScenes, a comprehensive multimodal dataset specifically designed for end-to-end autonomous parking in simulated scenes. Built on the CARLA simulator, ParkingScenes features structured parking trajectories generated by a Hybrid A* planner and a Model Predictive Controller (MPC), providing accurate and reproducible supervision signals. The dataset includes 16 reverse-in and 6 parallel parking scenarios, each executed under two pedestrian conditions (present and absent), resulting in 704 structured episodes and approximately 105000 frames. Each scenario is repeated 16 times to ensure consistent coverage. Each frame contains synchronized data from four RGB cameras, four depth sensors, vehicle motion states, and Bird's-Eye View (BEV) representations, enabling rich multimodal fusion and context-aware learning. To demonstrate the utility of our dataset, we compare models trained on ParkingScenes with those trained on unstructured, manually collected simulation data under identical conditions. Results show significant improvements in performance, underscoring the effectiveness of structured supervision for robust and accurate parking policy learning. By releasing both the dataset and the collection framework, ParkingScenes establishes a scalable and reproducible benchmark for advancing learning-based autonomous parking systems. The dataset and collection framework will be released at: https://github.com/haonan-ai/ParkingScenes
Authors: Deshui Miao, Chao Yang, Chao Tian, Guoqing Zhu, Kai Yang, Zhifan Mo, Xin Li
Abstract: This report describes a Ref-VOS pipeline centered on Sa2VA and organized with explicit agent roles. The key idea is that Sa2VA should provide the first dense semantic hypothesis, while an agent loop decides whether that hypothesis should be accepted, revised, or refined. The pipeline starts with a target-presence judgment stage. If the referred object does not exist in the video, the system directly outputs zero masks. Otherwise, Sa2VA receives the video and referring prompt and produces a coarse mask trajectory over the full video. This trajectory is treated as a semantic prior rather than a final answer. A planner agent decomposes the query, temporal partition agents identify informative blocks, scout agents search for anchor frames, and refinement agents convert reliable Sa2VA masks into boxes and points for SAM3 propagation. A critic scores candidate trajectories, a reflection controller repairs weak hypotheses, and a collaboration controller reconciles multiple agent branches. The result is a Ref-VOS system in which Sa2VA is responsible for dense grounded understanding, while the agent layer handles presence verification, temporal search, confidence-aware revision, and final mask refinement.
Authors: Deshui Miao, Xingsen Huang, Yameng Gu, Xiaogang yu, Xin Li, Ming-Hsuan Yang
Abstract: SAM-based dense trackers provide strong short-term mask propagation but remain fragile under long occlusion, fast motion, viewpoint change, and distractors. The problem is especially severe for small objects, where a few incorrect memory updates can dominate later predictions. This report presents an occlusion- and reappearance-aware extension of DAM4SAM that improves memory control rather than changing the backbone. The method augments the original SAM3 tracker with four ingredients: a reliability-aware tracking state machine, branch-based recovery, delayed DRM promotion, and a selective policy for native SAM3 memory selection. During stable tracking, the model follows the original single-path propagation process. Once confidence drops, the tracker enters an ambiguous or recovery mode, maintains a small set of candidate branches, and commits memory only after a branch is reconfirmed. For small-object disappearance and reappearance, native memory selection is temporarily bypassed so older anchors remain accessible. In addition, the first conditioning frame is explicitly preserved, and the conditioning-memory budget is moderately enlarged to improve long-gap recovery. The resulting design keeps DAM4SAM efficient in easy cases while improving robustness in sequences dominated by occlusion and reappearance.
Authors: Xin Ning, Qiankun Li, Xiaolong Huang, Qiupu Chen, Feng He, Weijun Li, Prayag Tiwari, Xinwang Liu
Abstract: With the accumulation of resources in the era of big data and the rise of pre-trained models in deep learning, optimizing neural networks for various tasks often involves different strategies for fine-tuning pre-trained models versus training from scratch. However, existing optimizers primarily focus on reducing the loss function by updating model parameters, without fully addressing the unique demands of these two major paradigms. In this paper, we propose DualOpt, a novel approach that decouples optimization techniques specifically tailored for these distinct training scenarios. For training from scratch, we introduce real-time layer-wise weight decay, designed to enhance both convergence and generalization by aligning with the characteristics of weight updates and network architecture. For more importantly fine-tuning, we integrate weight rollback with the optimizer, incorporating a rollback term into each weight update step. This ensures consistency in the weight distribution between upstream and downstream models, effectively mitigating knowledge forgetting and improving fine-tuning performance. Additionally, we extend the layer-wise weight decay to dynamically adjust the rollback levels across layers, adapting to the varying demands of different downstream tasks. Extensive experiments across diverse tasks, including image classification, object detection, semantic segmentation, and instance segmentation, demonstrate the broad applicability and state-of-the-art performance of DualOpt. Code is available at https://github.com/qklee-lz/OLOR-AAAI-2024.
Authors: Zhong Han Ervin Yeoh, Jiang Kan
Abstract: Precise Event Spotting (PES) is essential in fast-paced sports such as tennis, where fine-grained events occur within very short temporal windows. Accurate frame-level localization is challenging because of motion blur, subtle action differences, and limited annotated data. We study two complementary distillation strategies for few-shot PES: Adaptive Weight Distillation (AWD), a prediction-level method that adaptively weights teacher supervision on unlabeled data, and Annealed Multimodal Distillation for Few-Shot Event Detection (AMD-FED), a representation-level framework that transfers robust skeleton knowledge into visual modalities through annealed pseudo-labeling. Both methods use multimodal distillation to improve generalization under limited supervision. We evaluate them on F3Set-Tennis(sub) under few-shot k-clip settings, where they consistently outperform single-modality baselines and prior PES approaches. After observing the stronger performance of representation-level distillation on tennis, we further validate AMD-FED on a second sports dataset, Figure Skating, where it also shows robust performance in the k-clip scenario. These results highlight the effectiveness of multimodal distillation, especially representation-level transfer, for few-shot precise event spotting.
Authors: Yiming Pan, Chengwei Hu, Xuancheng Huang, Can Huang, Mingming Zhao, Yuean Bi, Xiaohan Zhang, Aohan Zeng, Linmei Hu
Abstract: Large language models (LLMs) have demonstrated strong potential in agentic tasks, particularly in slide generation. However, slide generation poses a fundamental challenge: the generation process is text-centric, whereas its quality is governed by visual aesthetics. This modality gap leads current models to frequently produce slides with aesthetically suboptimal layouts. Existing solutions typically rely either on heavy visual reflection, which incurs high inference cost yet yields limited gains; or on fine-tuning with large-scale datasets, which still provides weak and indirect aesthetic supervision. In contrast, the explicit use of aesthetic principles as supervision remains unexplored. In this work, we present AeSlides, a reinforcement learning framework with verifiable rewards for Aesthetic layout supervision in Slide generation. We introduce a suite of meticulously designed verifiable metrics to quantify slide layout quality, capturing key layout issues in an accurate, efficient, and low-cost manner. Leveraging these verifiable metrics, we develop a GRPO-based reinforcement learning method that directly optimizes slide generation models for aesthetically coherent layouts. With only 5K training prompts on GLM-4.7-Flash, AeSlides improves aspect ratio compliance from 36% to 85%, while reducing whitespace by 44%, element collisions by 43%, and visual imbalance by 28%. Human evaluation further shows a substantial improvement in overall quality, increasing scores from 3.31 to 3.56 (+7.6%), outperforming both model-based reward optimization and reflection-based agentic approaches, and even edging out Claude-Sonnet-4.5. These results demonstrate that such a verifiable aesthetic paradigm provides an efficient and scalable approach to aligning slide generation with human aesthetic preferences. Our repository is available at https://github.com/ympan0508/aeslides.
Authors: Guray Ozgur, Tahar Chettaoui, Eduarda Caldeira, Jan Niklas Kolf, Marco Huber, Andrea Atzori, Naser Damer, Fadi Boutros
Abstract: Face Image Quality Assessment (FIQA) aims to assess the recognition utility of face samples and is essential for reliable face recognition (FR) systems. Existing approaches require computationally expensive procedures such as multiple forward passes, backpropagation, or additional training, and only recent work has focused on the use of Vision Transformers. Recent studies highlighted that these architectures inherently function as saliency learners with attention patterns naturally encoding spatial importance. This work proposes ATTN-FIQA, a novel training-free approach that investigates whether pre-softmax attention scores from pre-trained Vision Transformer-based face recognition models can serve as quality indicators. We hypothesize that attention magnitudes intrinsically encode quality: high-quality images with discriminative facial features enable strong query-key alignments producing focused, high-magnitude attention patterns, while degraded images generate diffuse, low-magnitude patterns. ATTN-FIQA extracts pre-softmax attention matrices from the final transformer block, aggregate multi-head attention information across all patches, and compute image-level quality scores through simple averaging, requiring only a single forward pass through pre-trained models without architectural modifications, backpropagation, or additional training. Through comprehensive evaluation across eight benchmark datasets and four FR models, this work demonstrates that attention-based quality scores effectively correlate with face image quality and provide spatial interpretability, revealing which facial regions contribute most to quality determination.
Authors: Guray Ozgur, Tahar Chettaoui, Eduarda Caldeira, Jan Niklas Kolf, Andrea Atzori, Fadi Boutros, Naser Damer
Abstract: Face Image Quality Assessment is crucial for reliable face recognition systems, yet existing Vision Transformer-based approaches rely exclusively on final-layer representations, ignoring quality-relevant information captured at intermediate network depths. This paper presents the first comprehensive investigation of how intermediate representations within ViTs contribute to face quality assessment through early exit mechanisms and score fusion strategies. We systematically analyze all twelve transformer blocks of ViT-FIQA architectures, demonstrating that different depths capture distinct and complementary quality-relevant information, as evidenced by varying attention patterns and performance characteristics across network layers. We propose a score fusion framework that combines quality predictions from multiple transformer blocks without architectural modifications or additional training. Our early exit analysis reveals optimal performance-efficiency trade-offs, enabling significant computational savings while maintaining competitive performance. Through extensive evaluation across eight benchmark datasets using four FR models, we demonstrate that our fusion strategy improves upon single-exit approaches. Our proposed quality fusion approach employs depth-weighted averaging that assigns progressively higher importance to deeper transformer blocks, achieving the best quality assessment performance by effectively leveraging the hierarchical nature of feature learning in ViTs. Our work challenges the conventional wisdom that only deep features matter for face analysis, revealing that intermediate representations contain valuable information for quality assessment. The proposed framework offers practical benefits for real-world biometric systems by enabling adaptive computation based on resource constraints while maintaining competitive quality assessment capabilities.
Authors: Tianyang Wang, Ziyu Su, Abdul Rehman Akbar, Usama Sajjad, Lina Gokhale, Charles Rabolli, Wei Chen, Anil Parwani, Muhammad Khalid Khan Niazi
Abstract: The expanding ecosystem of pathology foundation models has produced powerful but fragmented tile-level representations, limiting their use in clinical tasks that require unified slide-level reasoning and interpretable linkage to clinically meaningful information. We present ASTRA, a pan-cancer framework that integrates heterogeneous foundation-model representations into a shared slide-level representation space and semantically grounds that space using structured pathology annotation fields, including classification category, cancer type, and anatomic site. ASTRA combines sparse mixture-of-experts contextualization, masked multi-model reconstruction, and contrastive alignment to structured pathology prompts to learn slide representations that support 4-category classification, 3-class solid tumor typing, 16-class cancer typing, and text-guided tumor localization without pixel-level supervision. Developed on a CHTN cohort of 10,359 whole-slide images (WSIs) spanning 16 tumor types, ASTRA consistently improves pan-cancer classification across four pathology foundation-model backbones, achieving up to 97.8% macro-AUC for 4-category classification, 99.7% for 3-class solid tumor typing, and 99.2% for 16-class cancer typing. For tumor localization, ASTRA achieves a mean Dice of 0.897 on an annotated in-domain CHTN subset (n = 380) spanning 16 cancer types and 0.738 on an external TCGA cohort (n = 1,686) spanning four cancer types. These results demonstrate that minimal structured pathology annotation fields derived from slide-level metadata can provide effective semantic supervision for unified slide representation learning, enabling both pan-cancer prediction and weakly supervised tumor localization within a single framework.
Authors: Tim Merino, Sam Earle, Ryunosuke Iwai, Julian Togelius, Edoardo Cetin
Abstract: We introduce Dream-Cubed, a large-scale dataset of Minecraft worlds at voxel resolution, and a family of models using cubes as powerful compositional units for efficient generation of interactive 3D environments. Dream-Cubed comprises tens of billions of tokens from a carefully curated mixture of procedural biome terrain and high-quality human-authored maps. We use this dataset to conduct the first large-scale study of 3D diffusion models for voxel generation, analyzing discrete and continuous diffusion formulations, data compositions, and architectural design choices. Our models operate directly in the space of blocks, enabling efficient and semantically grounded generation while supporting interactive user workflows such as inpainting and outpainting from user-authored blocks. To quantitatively evaluate our models, we adapt the FID metric to assess semantic differences between real and generated world renderings, and validate generation quality through a human preference study. We release the full dataset, code, and all our pretrained models, which we hope will provide a foundation for future research in efficient generative modeling for structured, interactive 3D environments.
Authors: Aaranay Aadi, Jai Gopal Singla, Amitabh, Nitant Dube, Praveen Kumar Shukla, Vijaypal Singh Dhaka
Abstract: Recent times have seen an increase in demand of high quality Digital Elevation Models (DEMs) for the lunar surface, because they are highly important for studying the moon and planning future missions. However, there is an evident lack of detailed elevation data on the Moon. To overcome this limitation, this study proposes a novel deep learning method that estimates and generates a surface elevation map directly from monocular images of the surface. The dataset used comprises of the Chandrayaan-2 Terrain Mapping Camera (TMC) images with their corresponding Digital Terrain Models (DTMs). The study proposes LunarDepthNet, which comprises of a UNet architecture to generate DEMS. It incorporates an EfficientNet encoder and custom layers to correctly learn how the light shadows on the surface relate to the actual elevation values. A combined loss function was also utilized to keep the terrain details accurate and smooth. During validation, the model showed a stable loss convergence of 12%. It achieved a mean nRMSE of 0.437 and an MAE of 4.5m in the testing stage. These results prove the model can generate dependable elevation maps from single orbital images, which are quite useful in regions of the moon where stereo-images are not available.
Authors: Serkan Hamdi G\"u\u{g}\"ul, Kemal Levi, Burak Acar
Abstract: Industrial visual inspection systems often suffer from a severe scarcity of labeled defect data, particularly during the early stages of New Product Introduction (NPI). This limitation hinders the deployment of robust supervised detectors precisely when automated quality control is most needed. We present an end-to-end generative framework for high-fidelity, few-shot defect synthesis that enables both in-domain augmentation and cross-domain transfer. Our approach disentangles defect morphology from background appearance by combining masked textual inversion for defect representation learning, noise-blended conditioned generation for surface-aware synthesis, and gradient-aware post-processing for seamless visual integration. We evaluate the framework in two practically relevant settings: few-shot data augmentation, where synthetic samples enrich a small set of real defects, and zero-shot adaptation, where defects learned from a source domain are transferred to a novel target surface without any real target-domain defect examples. Using RF-DETR as the downstream detector, we show that the proposed pipeline substantially narrows the domain gap on a private industrial dataset. In the few-shot setting, synthetic augmentation improves mAP from 78.8% to 83.3%. In the zero-shot setting, synthetic domain adaptation improves mAP from 65.0% to 85.1%. These results demonstrate that high-fidelity defect synthesis can meaningfully accelerate NPI by enabling effective inspection models before sufficient real defect data has been collected.
Authors: Finn Rasmus Sch\"afer, Yuan Gao, Dingrui Wang, Thomas Stauner, Stephan G\"unnemann, Mattia Piccinini, Sebastian Schmidt, Johannes Betz
Abstract: While Vision-Language Models (VLMs) have advanced highlevel reasoning in autonomous driving, their ability to ground this reasoning in the underlying physics of ego-motion remains poorly understood. We introduce EgoDyn-Bench, a diagnostic benchmark for evaluating the semantic ego-motion understanding of vision-centric foundation models. By mapping continuous vehicle kinematics to discrete motion concepts via a deterministic oracle, we decouple a model's internal physical logic from its visual perception. Our large-scale empirical audit spanning 20 + models, including closed-source MLLMs, open-source VLMs across multiple scales, and specialized VLAs, identifies a significant Perception Bottleneck: while models exhibit logical physical concepts, they consistently fail to accurately align them with visual observations, frequently underperforming classical non-learned geometric baselines. This failure persists across model scales and domain-specific training, indicating a structural deficit in how current architectures couple visual perception with physical reasoning. We demonstrate that providing explicit trajectory encodings substantially restores physical consistency across all evaluated models, revealing a functional disentanglement between vision and language: egomotion logic is derived almost exclusively from the language modality, while visual observations contribute negligible additional signal. This structural finding provides a standardized diagnostic framework and a practical pathway toward physically aligned embodied AI. Keywords: Ego-motion - Physical Reasoning - Foundation Models
Authors: Chao Pan, Xin Yao
Abstract: Fast Adversarial Training (FastAT) seeks to achieve adversarial robustness at a fraction of the computational cost incurred by standard multi-step methods such as PGD-AT. Although numerous FastAT techniques have been proposed in recent years, fair comparison among them remains elusive. Existing benchmarks and public leaderboards typically permit diverse model architectures, varying training configurations, and external data sources, making it unclear whether reported improvements reflect genuine algorithmic advances or merely more favorable experimental conditions. To address this problem, we introduce the FastAT Benchmark, a controlled evaluation framework built on three core design principles: unified architecture requirements, standardized training settings, and strict prohibition of external or synthetic data. The benchmark implements over twenty representative FastAT methods within a single codebase, enabling direct and reproducible comparison. Each method is assessed through a dual-metric evaluation framework that measures both adversarial robustness (accuracy under PGD, AutoAttack, and CR Attack) and computational cost (GPU training time and peak memory footprint). Comprehensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet provide reliable baseline measurements and reveal that well-designed single-step methods can match or surpass PGD-AT robustness at substantially lower cost, while no single method dominates across all evaluation dimensions. The complete benchmark, including source code, configuration files, and experimental results, is publicly available to support transparent and fair evaluation of future FastAT research.
Authors: R. M. Krishna Sureddi, T. Satyanarayana Murthy, Nomula Varsha Reddy, Adi Kanishka, Nalla Manvika Reddy
Abstract: Transformer architectures, including nnFormer,have demonstrated promising results in volumetric medical image segmentation by being able to capture long-range spatial interactions. Although they have high performance, these models need large quantities of labeled training data and are also likely to overfit and become training unstable. This is a serious practical problem because it is not only time-consuming but also expensive to obtain medical images that are annotated by experts. Moreover, fully supervised traditional training pipelines do not take advantage of the available large amounts of unlabeled medical imaging data that can be easily obtained in the clinics. We have solved these drawbacks by advancing the efficiency of the nnFormer with a self-supervised pretraining framework, which is based on the Masked Autoencoders (MAE). In this method, the model is pretrained on unlabeled volumetric medical images to reconstruct randomly masked parts of the input. This allows the encoder to learn meaningful anatomical and structural representations . The encoder is then further fine-tuned on a labeled dataset on the downstream segmentation task. Conducted Experiment shows that the offered method leads to a higher segmentation performance on the count of Dice score, a quicker convergence rate on the course of the fine-tuning procedure, and a superior generalization on the basis of limited labeled data . These findings validate that self-supervised learning combined with transformer-based segmentation models is an appropriate approach to the problem of data shortage in medical image analysis.
Authors: Ziyun Chen, Fan Liu, Liang Yao, Chuanyi Zhang, Yuye Ma, Wei Zhou
Abstract: The core objective of image captioning is to achieve lossless semantic compression from visual signals into textual modalities. However, the reliance on manually curated reference texts for evaluation essentially forces models to mimic specific human annotation styles, thereby masking the true descriptive capabilities of advanced foundation models. This systemic misalignment prompts a critical question: Is task-specific fine-tuning truly necessary for Remote Sensing Image Captioning, or is the perceived performance gap merely an artifact of flawed evaluation criteria? To investigate this discrepancy, we propose ReconScore, a novel reference-free evaluation metric. Rather than computing textual similarities, we assess caption quality by its capability to reconstruct the original visual elements solely from the generated text, effectively neutralizing human annotation biases. Applying this metric, we uncover a profound, counterintuitive truth: inherently powerful, unfine-tuned MLLMs surpass their fine-tuned counterparts in authentic zero-shot RSIC tasks. Driven by this structural discovery, we introduce RemoteDescriber, a completely training-free generation methodology. By employing ReconScore as a self-correction mechanism, we iteratively refine the semantic precision of MLLM outputs without any computational fine-tuning overhead. Comprehensive experiments demonstrate that RemoteDescriber achieves state-of-the-art performance on three datasets. Furthermore, we validate ReconScore's reliability and analyze the flaws of traditional metrics. Our code is available at https://github.com/hhu-czy/RemoteDescriber.
Authors: Syed Sajid Ullah, Muhammad Zunair Zamir, Ahsan Ishfaq, Salman Khan
Abstract: Accurate vehicle detection is a critical component of autonomous driving, traffic surveillance, and intelligent transportation systems. This paper presents an enhanced YOLOv8n-based model that integrates the Ghost Module, Convolutional Block Attention Module (CBAM), and Deformable Convolutional Networks v2 (DCNv2) to improve detection performance. The Ghost Module reduces feature redundancy through efficient feature generation, CBAM refines feature representation via channel and spatial attention, and DCNv2 enhances adaptability to geometric variations in vehicle structures. Evaluated on the KITTI dataset, the proposed model achieves 95.4% mAP@0.5, representing an 8.97% improvement over the baseline YOLOv8n, along with 96.2% precision, 93.7% recall, and a 94.93% F1-score. Comparative analysis against seven state-of-the-art detectors demonstrates consistent superiority across key performance metrics, while ablation studies validate the individual and combined contributions of the integrated modules. By addressing feature redundancy, attention refinement, and spatial adaptability, the proposed approach offers a robust and computationally efficient solution for vehicle detection in diverse and complex traffic environments.
Authors: Mohsen Asghari Ilani, Yaser Mike Banad
Abstract: This paper presents an IoT-enhanced deep learning framework for automated crack detection in Additive Manufacturing (AM) surfaces using convolutional neural networks (CNNs). By integrating IoT-enabled real-time monitoring, high-resolution imaging, and edge computing, the system enables continuous in-situ defect detection and classification. Real-time data acquisition supports immediate CNN-based analysis, improving both accuracy and efficiency in AM quality control. The framework supports supervised and semi-supervised learning, enabling robust performance on large, sparsely annotated datasets. Using LabelImg for annotation and OpenCV for preprocessing, the system achieves 99.54% accuracy on 14,982 images, with 96% precision, 98% recall, and a 97% F1-score. Dataset balancing and augmentation significantly improve generalization, increasing accuracy from 32% to 99%. Beyond detection, the framework establishes a linkage between AM process parameters, defect formation, and surface topology, supporting predictive analytics and defect mitigation. Aligned with Industry 4.0, it incorporates Digital Twin (DT) technology for real-time process simulation, predictive maintenance, and adaptive control. Key contributions include an IoT-based monitoring system using edge devices (Raspberry Pi 4B), an optimized CNN with model quantization and batch processing reducing inference latency by 47%, and an MQTT-based low-latency data streaming system with 5G connectivity, lowering transmission overhead by 35%. DT integration further enables predictive defect analysis and dynamic adjustment of AM parameters. This work advances intelligent AM quality control by providing a scalable, high-accuracy, and low-latency framework. Future directions include multimodal data fusion, hybrid architectures, and enhanced Digital Twin simulations for AI-driven defect prevention.
Authors: Ying Xiao, Shimiao Tang, Xitong Ling, Weiming Chen, Jun Wang, Jiawen Li, Huaitian Yuan, Jianghui Yang, Bowen Li, Huan Li, Yiting Meng, Tian Guan, Yonghong He, Hongfang Yin
Abstract: Liver cancer, especially hepatocellular carcinoma (HCC), imposes a substantial global disease burden. Accurate diagnosis and prognostic assessment directly influence treatment selection and patient survival, and pathological examination remains the gold standard for liver cancer diagnosis. Identifying diverse tissue components and pathological subtypes on histopathology slides is crucial for estimating postoperative recurrence risk and overall prognosis. However, most publicly available resources are still provided at the whole-slide image (WSI) level, and well-annotated datasets for fine-grained tissue component identification in liver cancer are scarce, which hinders reproducible model development and the deployment of quantitative analysis tools. To address this gap, we release HepatoBench, a patch-level image database for liver cancer with annotations for seven key tissue categories. Based on HepatoBench, we train and open-source a deep learning classification model as a tissue recognition tool. Furthermore, we train a WSI-level tumor/non-tumor segmentation model to automatically localize lesion regions across entire slides. By integrating the patch-level tissue classifier with the WSI-level segmentation model, we build HepatoQuant, an end-to-end, disease-specific regional quantification tool for liver cancer, enabling a unified workflow from WSIs to tissue composition parsing and quantitative statistics. We also open-source HepatoBench, the benchmarking protocol, and supporting tools, providing a solid foundation for automated regional quantification and fair method comparison in liver cancer pathology.
Authors: Yisheng He, Steven Hoi
Abstract: We introduce MeshLAM, a feed-forward framework for one-shot animatable mesh head reconstruction that generates high-fidelity, animatable 3D head avatars from a single image. Unlike previous work that relies on time-consuming test-time optimization or extensive multi-view data, our method produces complete mesh representations with inherent animatability from a single image in a single forward pass. Our approach employs a dual shape and texture map architecture that simultaneously processes mesh vertices and texture map with extracted image features from a shared transformer backbone, allowing for coherent shape carving and appearance modeling. To prevent mesh collapse and ensure topological integrity during feed-forward deformation, we propose an iterative GRU-based decoding mechanism with progressive geometry deformation and texture refinement, coupled with a novel reprojection-based texture guidance mechanism that anchors appearance learning to the input image. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in reconstruction quality, animation capability, and computational efficiency. Project page at https://meshlam.github.io.
Authors: Zhimu Zhou, Yanpeng Zhao, Qiuyu Liao, Bo Zhao, Xiaojian Ma
Abstract: Visual planning represents a crucial facet of human intelligence, especially in tasks that require complex spatial reasoning and navigation. Yet, in machine learning, this inherently visual problem is often tackled through a verbal-centric lens. While recent research demonstrates the promise of fully visual approaches, they suffer from significant computational inefficiency due to the step-by-step planning-by-generation paradigm. In this work, we present EAR, an editing-as-reasoning paradigm that reformulates visual planning as a single-step image transformation. To isolate intrinsic reasoning from visual recognition, we employ abstract puzzles as probing tasks and introduce AMAZE, a procedurally generated dataset that features the classical Maze and Queen problems, covering distinct, complementary forms of visual planning. The abstract nature of AMAZE also facilitates automatic evaluation of autoregressive and diffusion-based models in terms of both pixel-wise fidelity and logical validity. We assess leading proprietary and open-source editing models. The results show that they all struggle in the zero-shot setting, finetuning on basic scales enables remarkable generalization to larger in-domain scales and out-of-domain scales and geometries. However, our best model that runs on high-end hardware fails to match the zero-shot efficiency of human solvers, highlighting a persistent gap in neural visual reasoning.
Authors: Md Tanjemul Islam, Md Rafiul Kabir
Abstract: Autonomous vehicles (AVs) rely on real-time perception systems to understand road environments and ensure safe navigation. However, implementing reliable perception algorithms on resource-constrained embedded platforms remains challenging due to limited computational resources. This paper presents a lightweight vision-based framework that integrates lane detection, lane tracking, and traffic sign recognition for embedded autonomous vehicles. A computationally efficient threshold-based lane segmentation method combined with perspective transformation and histogram-based curvature estimation is used for robust lane tracking under varying illumination conditions. A rule-based steering controller generates steering commands to maintain stable vehicle navigation. For traffic sign recognition, two lightweight convolutional neural networks (CNNs), EfficientNet-B0 and MobileNetV2, are evaluated using a custom dataset captured from the vehicle's onboard camera. Experimental results show that the system achieves real-time performance while maintaining accurate lane tracking with only 3.16% maximum offset RMSE. EfficientNet-B0 achieves a high offline classification accuracy of 98.77% on the test dataset, while achieving 90% accuracy during real-time on-device deployment, outperforming MobileNetV2 in both settings. MobileNetV2, however, offers slightly faster inference and lower computational cost. These results highlight the effectiveness of lightweight vision-based perception pipelines for resource-constrained autonomous driving applications.
Authors: Brandon Collins, Logan Bolton, Hung Huy Nguyen, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen
Abstract: When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision-language models (VLMs) such as Gemini-3-Pro and GPT-5 only respond with text, which can be difficult for users to verify. We present SketchVLM, a training-free, model-agnostic framework that enables VLMs to produce non-destructive, editable SVG overlays on the input image to visually explain their answers. Across seven benchmarks spanning visual reasoning (maze navigation, ball-drop trajectory prediction, and object counting) and drawing (part labeling, connecting-the-dots, and drawing shapes around objects), SketchVLM improves visual reasoning task accuracy by up to +28.5 percentage points and annotation quality by up to 1.48x relative to image-editing and fine-tuned sketching baselines, while also producing annotations that are more faithful to the model's stated answer. We find that single-turn generation already achieves strong accuracy and annotation quality, and multi-turn generation opens up further opportunities for human-AI collaboration. An interactive demo and code are at https://sketchvlm.github.io/.
Authors: Towhidul Islam (ICS Department, King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia), Mufti Mahmud (SDAIA-KFUPM JRC for AI, King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia, IRC for Bio Systems and Machines, King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia)
Abstract: Alzheimer's disease (AD) is a progressive neurodegenerative disorder and a major cause of dementia. Structural MRI is widely used to analyze AD-related brain atrophy; however, most deep learning methods rely on computationally expensive 3D convolutional neural networks (CNNs), limiting deployment in resource-constrained settings. This work introduces two main contributions. First, we propose a pipeline that converts T1-weighted MRI into anatomically informed 2D point clouds using Anatomical Priority Sampling (APS), producing ADNI-2DPC, the first neuroanatomically labeled MRI-derived point cloud dataset. Second, we present NeuroAPS-Net, a lightweight geometric deep learning model that incorporates anatomical priors via region-aware feature encoding and ROI token aggregation. Experiments on ADNI-2DPC demonstrate that NeuroAPS-Net achieves competitive classification accuracy while significantly reducing inference latency and GPU memory compared to state-of-the-art point cloud methods. These results highlight the potential of anatomically guided point cloud learning as an efficient and interpretable alternative to voxel-based CNNs for AD classification.
Authors: Fujun Han, Junan Chen, Xintong Zhu, Jingqi Ye, Xuanjie Mao, Tao Chen, Peng Ye
Abstract: Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new SOU-VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub-tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state-of-the-art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU-Train, a multimodal training dataset with 11,226 VQA pairs, to improve the SOU capabilities of MLLMs. Through supervising fine-tuning of the latest MLLM, we demonstrate that SOU-Train can effectively enhance the latest MLLM's ability to understand small objects. Comprehensive experimental results demonstrate that, the proposed SOUBench, along with the SOU-VQA and SOU-Train datasets, provides a crucial empirical foundation to the community for further developing models with enhanced small object understanding capabilities. Datasets and Code: https://github.com/Hanfj-X/SOU.
Authors: Hefeng Zhou, Xuan Liu, Sicheng Chen, Wutong Zhang, Wu Yan, Jiong Lou, Chentao Wu, Guangtao Xue, Wei Zhao, Jie Li
Abstract: Federated cross-modal retrieval faces severe challenges from heterogeneous client data, particularly non-IID semantic distributions and missing modalities. Under such heterogeneity, a single global model is often insufficient to capture both shared cross-modal knowledge and client-specific characteristics. We propose RCSR, a personalization-friendly federated framework that integrates prototype anchoring, retrieval-centric semantic routing, and optional client-specific adapters. Built on a frozen CLIP backbone, RCSR leverages lightweight shared adapters for global knowledge transfer while supporting efficient local personalization. Prototype anchoring helps unimodal clients align with global cross-modal semantics, and a server-side semantic router adaptively assigns aggregation weights based on retrieval consistency to mitigate alignment drift during heterogeneous updates. Extensive experiments on MS-COCO, Flickr30K, and other benchmarks show that RCSR consistently improves global retrieval accuracy and training stability, while further enhancing client-level retrieval performance, especially for clients with incomplete modalities. Code is available at https://github.com/RezinChow/RCSR-Retrieval-Centric-Semantic-Routing.
URLs: https://github.com/RezinChow/RCSR-Retrieval-Centric-Semantic-Routing.
Authors: Pu Li, Huafeng Li, Yafei Zhang, Yu Liu, Wen Wang
Abstract: Thermal infrared image enhancement aims to restore high-quality images from complex compound degradations. Existing all-in-one approaches typically employ a single shared backbone to handle diverse degradations, which causes gradient interference and parameter competition. To address this, we propose a Structural Entropy-Guided Decoupled (SEGD) Framework. Unlike unified modeling paradigms, SEGD decomposes compound degradations into independent sub-processes and models them in a divide-and-conquer manner through Degradation-Specific Residual Modules (DRMs). Each DRM focuses on residual estimation for a specific degradation, enabling task decoupling while remaining jointly trainable, which mitigates parameter contention. A Degradation-Aware Evidential Network further estimates degradation type and intensity, providing priors that adaptively regulate DRM restoration strength. To handle compound cases, DRMs are composed in varying orders to form multiple restoration paths, from which the most informative features are aggregated under a structural-entropy criterion, yielding decoder-ready representations with structural fidelity and degradation awareness. Integrating divide-and-conquer restoration, evidential perception, and entropy-guided adaptation, SEGD achieves fine-grained and interpretable enhancement. We also construct a nighttime TIR benchmark for evaluation under real low-light conditions. Experimental results demonstrate that SEGD surpasses state-of-the-art methods while achieving higher efficiency with fewer parameters.
Authors: Zewen Li, Shuo Ye, Zitong Yu, Weicheng Xie, Linlin Shen
Abstract: Industrial anomaly detection based on RGB-3D multimodal data has emerged as a mainstream paradigm for intelligent quality inspection. However, existing unsupervised methods suffer from two critical limitations: ambiguous cross-modal alignment caused by the lack of high-level semantic guidance and insufficient geometric modeling for RGB-to-3D feature mapping. To address these issues, we propose a unified multimodal industrial anomaly detection framework guided by text semantics. The framework consists of two core modules: a Geometry-Aware Cross-Modal Mapper to preserve geometric structure during modality conversion, and an Object-Conditioned Textual Feature Adaptor to align multimodal features with semantic priors. Furthermore, we establish a unified learning paradigm for multimodal industrial anomaly detection, which breaks the one-model-one-class constraint and enables accurate anomaly detection across diverse classes using a single model. Extensive experiments on the MVTec 3D-AD and Eyecandies datasets demonstrate that our method achieves state-of-the-art performance in classification and localization under unsupervised settings.
Authors: Yasmin Rodrigues Sobrinho, Jo\~ao Renato Ribeiro Manesco, Jo\~ao Paulo Papa
Abstract: The integration of quantum machine learning with classical deep learning offers promising avenues for medical image analysis by mapping data into high-dimensional Hilbert spaces. However, effectively unifying these distinct paradigms remains challenging due to common optimization asymmetries. In this paper, a novel hybrid quantum-classical architecture for breast cancer diagnosis based on a dual-branch feature-extraction pipeline is proposed. Our framework extracts and unifies complementary representations from classical models and quantum circuits, exploring both trainable and deterministic (non-trainable) quantum paradigms. To integrate these embeddings, three progressive feature fusion strategies are introduced: Static Hybrid Fusion (SHF) for offline extraction, Dynamic Hybrid Fusion (DHF) for end-to-end co-adaptation, and a novel Temperature-Scaled Hybrid Fusion (TSHF). The TSHF strategy incorporates a learnable scalar, inspired by multimodal learning, that dynamically balances hybrid gradient dynamics and resolves optimization bottlenecks. Empirical validation on the BreastMNIST dataset confirms our hypothesis that unifying diverse feature representations creates a richer data context. The TSHF strategy, specifically when pairing a ResNet backbone with a trainable quantum circuit, achieved a peak accuracy of 87.82%, F1-score of 91.77%, and an AUC-ROC of 89.08%, outperforming purely classical baselines. These results demonstrate that the proposed hybrid framework improves classification accuracy and threshold reliability, providing a stable, high-performance architecture for the clinical deployment of quantum-enhanced diagnostic tools.
Authors: Nikoo Moradi, Gijs Luijten, Behrus Hinrichs-Puladi, Jens Kleesiek, Victor Alves, Jan Egger, Andr\'e Ferreira
Abstract: Diffusion models produce high-quality synthetic data but suffer from slow inference. We propose 3D Variable-Step Denoising Diffusion Probabilistic Model (VS-DDPM) a framework engineered to maintain generative quality while accelerating inference by several factors. We tested our approach on four tasks (missing MRI, tumor removal, MRI-to-sCT, and CBCT-to-sCT) within the BraTS2025 and SynthRAD2025 challenges. Designed for high efficiency under hardware and time constrains imposed by both challenges. VS-DDPM achieved state-of-the-art (SOTA) performance in missing MRI synthesis, yielding Dice scores of 0.80, 0.83, and 0.88 for the enhancing tumor, tumor core, and whole tumor regions, respectively, alongside a structural similarity index (SSIM) of 0.95. For MRI tumor removal, the model attained a root mean squared error (RMSE) of 0.053, a peak signal-to-noise ratio (PSNR) of 26.77, and an SSIM of 0.918. While the framework demonstrated competitive performance in MRI-to-sCT and CBCT-to-sCT tasks, it did not reach SOTA benchmarks, potentially due to sensitivities in data pre and post-processing pipelines or specific loss function configurations. These results demonstrate that VS-DDPM provides a robust and tunable solution for high-fidelity 3D medical image synthesis. The code is available in https://github.com/andre-fs-ferreira/SynthRAD_by_Faking_it.
URLs: https://github.com/andre-fs-ferreira/SynthRAD_by_Faking_it.
Authors: Rahul Patel
Abstract: Anemia affects over one billion people globally and remains severely under-diagnosed in low-resource regions where laboratory blood tests are inaccessible. This paper presents AnemiaVision, an end-to-end web-based system for non-invasive anemia screening from smartphone photographs of the palpebral conjunctiva and fingernail beds. The proposed pipeline fine-tunes a pre-trained EfficientNet-B3 backbone with a redesigned three-layer classifier head incorporating BatchNorm, GELU activations, and high-rate Dropout (0.45/0.35). Training employs four orthogonal accuracy-boosting techniques: TrivialAugmentWide for policy-free image augmentation, RandomErasing for spatial regularisation, Mixup (alpha=0.2) for inter-class smoothing, and cosine-annealing scheduling with linear warmup. Early stopping is governed by peak validation accuracy rather than validation loss to prevent premature termination on high-variance epochs. The deployed Flask application integrates persistent patient-history management backed by PostgreSQL on Render, with an automated database-migration entrypoint ensuring zero data loss across redeploys. Ablation experiments demonstrate that accuracy-first early stopping contributes +1.6% and Mixup contributes +2.8% to final validation accuracy. Overall, the proposed system achieves a validation accuracy of 96.2% and AUC-ROC of 0.98, compared with 44.9% validation accuracy and AUC-ROC of 0.58 from the three-epoch CPU-only baseline. Sensitivity for the anemic class reaches 0.96, making the system suitable as a first-line screening tool for community health workers in rural settings. The system is publicly accessible and source code is openly available.
Authors: Peter Kulits, Cordelia Schmid
Abstract: We train a language model to generate LEGO-brick build sequences. While prior work has been restricted to discrete, voxel-like towers, we consider a much broader set of pieces, encompassing thousands of part types with diverse connection semantics. To enable this, we first collect a large-scale dataset of over 100,000 human-designed LDraw brick objects and scenes. The complexity of our setting makes it challenging to autoregressively assemble structures that satisfy physical constraints. When predicting block pose directly, build sequences quickly become invalid after a small number of steps. Although pieces are placed in 3D space, it is the spatial relationships of the parts which define the whole. With this in mind, we design a graph-based program representation that parametrizes structure through connectivity, improving the physical grounding of generated sequences. To enable future applications, we make our dataset and models available for research purposes. https://kulits.github.io/BrickNet
Authors: Ashwin Kumar, Robbie Holland, Corey Barrett, Jangwon Kim, Maya Varma, Zhihong Chen, Yunhe Gao, Greg Zaharchuk, Tara Taghavi, Krishnaram Kenthapadi, Akshay Chaudhari
Abstract: Recent medical multimodal foundation models are built as multimodal LLMs (MLLMs) by connecting a CLIP-pretrained vision encoder to an LLM using LLaVA-style finetuning. This two-stage, decoupled approach introduces a projection layer that can distort visual features. This is especially concerning in medical imaging where subtle cues are essential for accurate diagnoses. In contrast, early-fusion generative approaches such as Chameleon eliminate the projection bottleneck by processing image and text tokens within a single unified sequence, enabling joint representation learning that leverages the inductive priors of language models. We present CheXmix, a unified early-fusion generative model trained on a large corpus of chest X-rays paired with radiology reports. We expand on Chameleon's autoregressive framework by introducing a two-stage multimodal generative pretraining strategy that combines the representational strengths of masked autoencoders with MLLMs. The resulting models are highly flexible, supporting both discriminative and generative tasks at both coarse and fine-grained scales. Our approach outperforms well-established generative models across all masking ratios by 6.0% and surpasses CheXagent by 8.6% on AUROC at high image masking ratios on the CheXpert classification task. We further inpaint images over 51.0% better than text-only generative models and outperform CheXagent by 45% on the GREEN metric for radiology report generation. These results demonstrate that CheXmix captures fine-grained information across a broad spectrum of chest X-ray tasks. Our code is at: https://github.com/StanfordMIMI/CheXmix.
Authors: Renjith Prasad, Rishabh Sharma, Andrew E. Shao, Annmary Justine Koomthanam, Shreyas Kulkarni, Suparna Bhattacharya, Martin Foltin, Amit Sheth, David Orozco, Matthew Quinn, Brian Sammuli
Abstract: Subtle visual anomalies such as hairline cracks, sub-millimeter voids, and low-contrast inclusions are structurally atypical yet visually ambiguous, making them both difficult to annotate and easy to overlook during active learning. Standard acquisition heuristics based on discriminative uncertainty or feature diversity often overselect dominant patterns while underexploring sparse yet important regions of the data space. This failure mode is especially severe in industrial defect inspection, where anomalies may be both low-prevalence and difficult to distinguish from surrounding structure. To resolve this, we propose GSAL, an active learning framework for object detection that combines a diffusion-based difficulty signal with a hierarchical semantic coverage prior. The diffusion component scores images and proposals using reconstruction discrepancy and denoising variability, prioritizing visually atypical or ambiguous examples. However, diffusion alone does not prevent acquisition from repeatedly favoring hard samples within dominant semantic modes. The semantic component therefore organizes candidate samples in a three-level concept graph and promotes coverage of underrepresented semantic regions while providing interpretable acquisition rationales. By balancing visual difficulty with semantic coverage, GSAL improves retrieval of subtle and rare targets that are often missed by uncertainty-only selection. Experiments on a proprietary thin-film defect, Pascal VOC and MS COCO dataset show consistent gains in label efficiency and rare-class retrieval over uncertainty-, diversity-, and hybrid-based baselines
Authors: Vitalii Tutevych, Raphael Memmesheimer, Luca Eichler, Dmytro Pavlichenko, Fynn Schilke, Rodja Krudewig, Sven Behnke
Abstract: Reliable object perception is necessary for general-purpose service robots. Open-vocabulary detectors struggle to generalize beyond a few classes and fully supervised training of object detectors requires time-intensive annotations. We present a semi-supervised label propagation approach for household object segmentation. A segment proposer generates class-agnostic masks, and an ensemble of Hopfield networks assigns labels by learning representative embeddings in complementary foundation model embedding spaces (CLIP, ViT, Theia). Our approach scales to 50 object classes with limited annotation overhead and can automatically label 60% of the data in a RoboCup@Home setting, where preparation time is severely constrained. Dataset and code are publicly available at https://github.com/ais-bonn/label_propagation.
Authors: Ze Yang, Jingkang Wang, Haowei Zhang, Sivabalan Manivasagam, Yun Chen, Raquel Urtasun
Abstract: High-quality 3D assets for traffic participants are critical for multi-sensor simulation, which is essential for the safe end-to-end development of autonomy. Building assets from in-the-wild data is key for diversity and realism, but existing neural-rendering based reconstruction methods are slow and generate assets that render well only from viewpoints close to the original observations, limiting their usefulness in simulation. Recent diffusion-based generative models build complete and diverse assets, but perform poorly on in-the-wild driving scenes, where observed actors are captured under sparse and limited fields of view, and are partially occluded. In this work, we propose a 3D latent diffusion model that learns on in-the-wild LiDAR and camera data captured by a sensor platform and generates high-quality 3D assets with complete geometry and appearance. Key to our method is a "reconstruct-then-generate" approach that first leverages occlusion-aware neural rendering trained over multiple scenes to build a high-quality latent space for objects, and then trains a diffusion model that operates on the latent space. We show our method outperforms existing reconstruction and generation based methods, unlocking diverse and scalable content creation for simulation.
Authors: Mohammad Sadegh Salehi, Alex Perkins, Igor Maurell, Ashkan Dabbagh, Raymond Wong
Abstract: Web-scale 3D asset collections are abundant, but rarely deployment-ready. Assets ship with arbitrary metric scale, incorrect pivots and forward axes, brittle geometry, and textures that do not support relighting, which limits their utility for embodied AI, robotics simulation, game development, and AR/VR. We present AmaraSpatial-10K, a dataset of over 10,000 synthetic 3D assets designed for downstream use rather than volume alone. Each asset is released as a metric-scaled, semantically anchored .glb with separated PBR material maps, a convex collision hull, a paired reference image, and rich multi-sentence text metadata. The dataset spans indoor objects, vehicles, architecture, creatures, and props under a unified spatial convention. Alongside the dataset, we introduce an evaluation suite for 3D asset banks. The suite comprises a continuous Scale Plausibility Score (SPS) with an LLM-as-Judge interval protocol, an LLM Concept Density score for metadata, an anchor-error metric, and a cross-modal CLIP coherence protocol, and we use it to audit AmaraSpatial-10K alongside matched subsets from Objaverse, HSSD, ABO, and GSO. Compared with Objaverse-sourced assets, we demonstrate that AmaraSpatial-10K substantially improves text-based retrieval precision (CLIP Recall@5 of 0.612 vs 0.181, a 3.4x improvement with median rank falling from 267 to 3), and we establish that it satisfies the spatial and semantic prerequisites for physics-aware scene composition and embodied-AI asset banks, leaving those downstream evaluations to future work. AmaraSpatial-10K is publicly available on Hugging Face.
Authors: Sulagna Saha, Arthur Ouaknine, Etienne Lalibert\'e, Carol Altimas, Evan M. Gora, Adriane Esquivel Muelbert, Ian R. McGregor, Cesar Gutierrez, Vanessa E. Rubio, David Rolnick
Abstract: Accurate classification of tropical tree species from unoccupied aerial vehicle (UAV) imagery remains challenging due to high species diversity and strong visual similarity among species at typical image resolutions (centimeters per pixel). In contrast, models trained on close-up citizen science photographs captured with smartphones achieve strong plant species classification performance. Recent advances in UAV data acquisition now enable the collection of close-up images that are spatially registered with top-view aerial imagery and approach the level of visual detail found in smartphone photographs, with the trade-off that such high-resolution photos cannot be acquired for many trees. In this work, we evaluate the performance of existing methods using paired top-view and close-up UAV imagery collected in a species-rich tropical forest. Through fine-tuning experiments, we quantify the performance gap between vision foundation models and in-domain generalist plant recognition models across both image types (high-resolution close-up versus coarser-resolution top-view imagery). We show that classification performance is consistently higher on close-up images than on top-view aerial imagery, and that this performance gap widens for rare species. Finally, we propose that self-supervised representation alignment across these two spatial scales offers a promising approach for integrating fine-grained visual information into canopy-level species classification models based on top-view UAV imagery. Leveraging high-resolution close-up UAV imagery to enhance canopy-level species classification could substantially improve large-scale monitoring of tropical forest biodiversity.
Authors: Rohit Mukherjee, Hannah K. Friedrich, Beth Tellman, Ariful Islam, Zhijie Zhang, Jonathan Giezendanner, Upmanu Lall, Venkataraman Lakshmi
Abstract: Urban flooding affects lives and infrastructure worldwide. Mapping inundation in complex urban environments from satellite imagery remains challenging due to limited spatial resolution, infrequent acquisitions, and cloud cover. We present Urban Flood Observations (UFO), a global, hand-labeled dataset of post-flood inundation in diverse urban settings. UFO comprises 215 image chips (1024 by 1024 pixels) from 14 flood events between 2017 and 2021, derived from 3 m PlanetScope imagery. Each chip is annotated with two classes: 'inundated' (all visible surface water, including floodwater and pre-existing water bodies (permanent or seasonal)) and 'non-inundated'. To demonstrate the dataset's utility, we trained a segmentation model using leave-one-event-out cross-validation, achieving a mean Intersection over Union (IoU) of 77.3. We also used UFO to evaluate two widely used surface water products, the Sentinel-1-based NASA IMPACT model and Google's 10 m Dynamic World water class, which yielded IoUs of 44.1 and 48.1, respectively. UFO is publicly available to support the development and validation of urban inundation mapping methods.
Authors: Pir Bakhsh Khokhar, Carmine Gravino, Fabio Palomba, Sule Yildirim Yayilgan, Sarang Shaikh
Abstract: The quality of diabetic retinopathy (DR) screening relies on the ability to correctly grade severity; however, many deep-learning (DL) classifiers cannot be easily interpreted in the clinical context. This study presents a methodology that combines strong discriminative models with multimodal explanations, converting retinal pixels into clinically interpretable outputs. Using the APTOS 2019 benchmark, we evaluated six representative CNN- and transformer-based backbones under a controlled protocol with stratified five-fold cross-validation. We then compared ensembling strategies (hard voting, weighted soft voting, stacking) and investigated a hybrid class-level fusion variant to exploit grade-specific advantages. For interpretability, we produced Grad-CAM++ visual attribution maps and short textual rationales using vision-language models (VLMs) conditioned on the fundus image and classifier outputs under conservative prompting constraints. Modern CNN backbones (ResNet-50 and ConvNeXt-Tiny) provided the strongest single-model baselines, with cross-validated QWK up to 0.919 and 0.914, respectively. Ensembling improved ordinal agreement, and weighted soft voting was the most consistent across folds (QWK 0.934 +/- 0.017). Hybrid class-level fusion was competitive but did not yield a statistically reliable improvement over standard fusion in paired fold comparisons (Holm-adjusted p >= 1.000). For explanation quality, Grad-CAM++ offered plausible but coarse localization, and VLM rationales were generally grade-consistent. Quantitatively, VLM variants showed a trade-off between clinical completeness and template-level semantic similarity (coverage 0.700 vs. BERTScore 0.072), while image-text alignment was comparable (CLIPScore approximately 0.34).
Authors: Qian Huang, Mayoore Selvarasa Jaiswal, Zhen Zhong, Rochelle Pereira, Jianyuan Min
Abstract: The real-world adoption of portrait relighting is hindered by dataset domain gaps, camera sensitivity, and computational costs. We address these challenges with Hybrid Domain Knowledge Fusion, a paradigm that fuses the specialized strengths of synthetic, One-Light-at-A-Time (OLAT), and real-world datasets into a compact model. Our approach features specialized prior models hardened by domain-aware adaptation, followed by augmented knowledge distillation into a lightweight student model with multi-domain expertise. Our method demonstrates a 6x to 240x inference speedup while maintaining state-of-the-art (SOTA) visual quality in the experiments. Additionally, we construct a massive, high-fidelity synthetic dataset with diverse ground-truth intrinsics to support our training pipeline.
Authors: Alexander Nikitas Dimopoulos, Joseph Grasso, John Beltz
Abstract: Indoor environments lack the spatial intelligence infrastructure that GPS provides outdoors; first responders arriving at unfamiliar buildings typically have no machine-readable map of safety equipment. Prior work on 3D semantic segmentation for public safety identified two barriers: scarcity of labeled indoor training data and poor recognition of small safety-critical features by native point-cloud methods. This paper presents INSIGHT, a zero-target-domain-annotation pipeline that projects 2D image understanding into 3D metric space via registered RGB-D data. Two interchangeable vision stacks share a common 3D back end: a SAM3 foundation-model stack for text-prompted segmentation, and a traditional CV stack (open-set detection, VQA, OCR) whose intermediate outputs are independently inspectable. Evaluated on all seven subareas of Stanford 2D-3D-S (70{,}496 images), the pipeline produces Pointcept-schema-compatible labeled point clouds and ISO~19164-compliant scene graphs with ${\sim}10^{4}{\times}$ compression; role-filtered payloads transmit in ${<}15$\,s at 1\,Mbps over FirstNet Band~14. We report per-point labeling accuracy on 7 shared classes, detection sensitivity for 15 safety-critical classes absent from public 3D benchmarks alongside code-capped deployable estimates, and inter-pipeline complementarity, demonstrating that 2D-to-3D semantic transfer addresses the labeled-data bottleneck while scene graphs provide building intelligence compact enough for field deployment.
Authors: Zihui Zhu, Ziqi Zhou, Yichen Wang, Lulu Xue, Minghui Li, Shengshan Hu
Abstract: Deep learning drives major advances in autonomous driving (AD), where object detectors are central to perception. However, adversarial attacks pose significant threats to the reliability and safety of these systems, with physical adversarial patches representing a particularly potent form of attack. Physical adversarial patch attacks pose severe risks but are usually crafted for a single model, yielding poor transferability to unseen detectors. We propose AdvAD, a transfer-based physical attack against object detection in autonomous driving. Instead of targeting a specific detector, AdvAD optimizes adversarial patches over multiple detection models in a unified framework, encouraging the learned perturbations to capture shared vulnerabilities across architectures. The optimization process adaptively balances model contributions and enforces robustness to physical variations. It further employs data augmentation and geometric transformations to maintain patch effectiveness under diverse physical conditions. Experiments in both digital and real-world settings show that AdvAD consistently outperforms state-of-the-art (SOTA) attacks in performance and transferability.
Authors: Mengke Li, Haiquan Ling, Yiqun Zhang, Yang Lu, Hui Huang
Abstract: Real-world data often exhibit long-tailed distributions with numerous noisy labels, substantially degrading the performance of deep models. While prior research has made progress in addressing this combined challenge, it overlooks the severe label-image mismatch inherent to high-noise settings, thereby limiting their effectiveness. Given that observed labels, though mismatched with images, still retain category information, we propose employing auxiliary text information from labels to address label-image inconsistencies in long-tailed noisy data. Specifically, we leverage the intrinsic cross-modal alignment in pre-trained visual-language models to correct the label-image inconsistencies. This supervisory signal, referred to as Weak Teacher Supervision (WTS), is unaffected by label noise and data distribution biases, albeit exhibits limited accuracy. Therefore, the activation of WTS is determined by evaluating the discrepancy between text-predicted labels and observed labels. Extensive experiments demonstrate the superior performance of WTS across synthetic and real-world datasets, particularly under high-noise conditions. The source code is available at https://anonymous.4open.science/r/WTS-0F3C.
Authors: Syed Ibad Hasnain, Muhammad Faris, Hafiza Syeda Yusra Tirmizi, Rabail Khowaja, Hafsa Israr
Abstract: Early detection and classifying brain tumors using Magnetic Resonance Imaging (MRI) images is highly important but difficult to extract in medical images. Convolutional Neural Networks (CNNs) are good at capturing both local texture and spatial information whereas Vision Transformers (ViTs) are good at capturing long-range global dependencies. We propose a new hybrid architecture that combines a SqueezeNet-style CNN branch with a MobileViT-style global transformer branch, through an Adaptive Attention Gate mechanism, in this paper. The gate learns dynamically per-sample, per-feature weights to weight the contribution of each branch, allowing context-sensitive merging of local and global representations. The proposed model has a test accuracy of 97.60, a precision of 97.30, a recall of 97.50, an F1-score of 97.40, and a macro-average area under the curve (AUC) of 0.9946 with a trained and evaluated on the Brain Tumor MRI Dataset (Kaggle). These scores are higher than single CNN and ViT baselines, and current competitive fusion methods, showing that dynamic feature weighting is an effective way to classify medical images.
Authors: Jason Nguyen, Ameet Rao, Alexander Chang, Ishaan Kumar, Erin Tan
Abstract: Video Question Answering (VideoQA) demands models that jointly reason over spatial, temporal, and linguistic cues. However, the task's inherent complexity often requires multi-step reasoning that current large multimodal models (LMMs) perform implicitly, leaving their internal decision process opaque. In contrast, large reasoning models (LRMs) explicitly generate intermediate logical steps that enhance interpretability and can improve multi-hop reasoning accuracy. Yet, these models are not designed for native video understanding, as they typically rely on static frame sampling. We propose UpstreamQA, a modular framework that disentangles and evaluates core video reasoning components through explicit upstream reasoning modules. Specifically, we employ multimodal LRMs to perform object identification and scene context generation before passing enriched reasoning traces to downstream LMMs for VideoQA. We evaluate UpstreamQA on the OpenEQA and NExTQA datasets using two LRMs (o4-mini, Gemini 2.5 Pro) and two LMMs (GPT-4o, Gemini 2.5 Flash). Our results demonstrate that introducing explicit reasoning can significantly boost performance and interpretability of downstream VideoQA, but can also lead to performance degradation when baseline performance is sufficiently high. Overall, UpstreamQA offers a principled framework for combining explicit reasoning and multimodal understanding, advancing both performance and diagnostic transparency in VideoQA in several scenarios.
Authors: Hongxiang Peng, Dewei Bai, Hong Qu
Abstract: Spiking Vision Transformers (S-ViTs) offer a promising framework for energy-efficient visual learning. However, existing designs remain limited by two fundamental issues: the restricted information capacity of binary spike coding and the dense token interactions introduced by global self-attention. To address these challenges, this work proposes BSViT, a burst spiking-driven Vision Transformer featuring a Dual-Channel Burst Spiking Self-Attention (DBSSA) mechanism. DBSSA encodes queries with binary spikes and keys with burst spikes to enhance representational capacity. The value pathway adopts dual excitatory and inhibitory binary channels, enabling signed modulation and richer spike interactions. Importantly, the entire attention operation preserves addition-only computation, ensuring compatibility with energy-efficient neuromorphic hardware. To further reduce spike activity and incorporate spatial priors, a patch adjacency masking strategy is introduced to restrict attention to local neighborhoods, resulting in structure-aware sparsity and reduced computational overhead. In addition, burst spike coding is systematically integrated across the network to increase spike-level representational capacity beyond conventional binary spiking. Extensive experiments on both static and event-based vision benchmarks demonstrate that BSViT consistently outperforms existing spiking Transformers in accuracy while maintaining competitive energy efficiency.
Authors: Shafeequdheen Palengara, Jyotiranjan Nayak, Vijayakrishna Rowthu
Abstract: A levelset free but a hybrid image segmentation approach based on a modified version of the piece wise constant shape gradient of an Mumford Shah shape functional and a repulsive function is considered. The segmentation is performed a non-local shape based through an evolution of discrete curves driven by a non local shape based energy to segment images containing disjoint regions and multiple boundaries. This formulation has a novel additional component as a multivariable function dependent on a few sampled points of the curves that handles the occurrence of self intersection during boundary curves evolution. The method is applied to a few gray scale and color images, including images with nested structures and astronomical objects. The results indicate effective segmentation in complex scenarios with absolute control on the topology of the segments and self-intersections of the boundaries
Authors: Balaji Darur, Amanmeet Garg, Makarand Tapaswi
Abstract: Video Situation Recognition (VidSitu) addresses the challenging problem of "who did what to whom, with what, how, and where" in a video. It tests thorough video understanding by requiring identification of salient actions and associated short descriptions for event roles across multiple events. Grounding with VidSitu requires spatio-temporal localization of key entities across shots and varied appearances. We posit that coherent video understanding requires consistent identification of entities that play different roles. We propose Multimodal Entity Coreference (MEC) to unite entity descriptions in text with grounding across the video. Towards this, we introduce CineMEC, a multi-stage approach that unites event role mention groups with visual clusters of entities, without explicit grounding supervision during training. Our approach is designed to exploit the synergy between visual grounding and captioning, where improving one influences the other and vice versa. For evaluation, we extend the VidSitu dataset with grounding annotations. While previous work focuses primarily on descriptions, CineMEC improves consistency across both: captioning (+2.5% CIDEr, +7% LEA) and visual grounding (+18% HOTA).
Authors: Niamh Belton, Victoria Joppin, Aonghus Lawlor, Catherine Masson, Thierry Bege, David Bendahan, Kathleen M. Curran
Abstract: This work introduces DyABD, a novel and complex benchmark dataset of dynamic abdominal MRIs from patients with abdominal hernias and associated high quality abdominal muscle annotations. DyABD is the first-of-its-kind in four key ways; (1) it proposes the first abdominal muscle segmentation task, (2) the dynamic MRIs are acquired whilst the patients perform various exercises, introducing extreme anatomical variability, making it one of the most challenging segmentation datasets to date, (3) it includes both pre and post corrective MRIs and (4) DyABD promotes clinical research into the high recurrence rates of abdominal hernias. Beyond dataset introduction, this work provides a comprehensive evaluation of the generalisation capabilities of existing segmentation models across Supervised, Few Shot and Zero Shot paradigms on the unseen DyABD dataset. This work reveals that there is still room for substantial improvement in the field of medical image segmentation, with the majority of techniques achieving a Dice Coefficient of 0.82. This work therefore sheds light on the true progress of the field and redefines the benchmark for progress in medical image segmentation.
Authors: Yihan Wang, Lei Li, Yao Lai, Jing Wang, Yan Lu
Abstract: Analog circuit design relies heavily on reusing existing intellectual property (IP), yet searching across heterogeneous representations such as SPICE netlists, schematics, and functional descriptions remains challenging. Existing methods are largely limited to exact matching within a single modality, failing to capture cross-modal semantic relationships. To bridge this gap, we present AnalogRetriever, a unified tri-modal retrieval framework for analog circuit search. We first build a high-quality dataset on top of Masala-CHAI through a two-stage repair pipeline that raises the netlist compile rate from 22\% to 100\%. Built on this foundation, AnalogRetriever encodes schematics and descriptions with a vision-language model and netlists with a port-aware relational graph convolutional network, mapping all three modalities into a shared embedding space via curriculum contrastive learning. Experiments show that AnalogRetriever achieves an average Recall@1 of 75.2\% across all six cross-modal retrieval directions, significantly outperforming existing baselines. When integrated into the AnalogCoder agentic framework as a retrieval-augmented generation module, it consistently improves functional pass rates and enables previously unsolved tasks to be completed. Our code and dataset will be released.
Authors: Masoumeh Chapariniya, Jean-Marc Odobez, Volker Dellwo, Teodora Vukovi\'c
Abstract: Avatar fingerprinting, i.e., verifying who drives a synthetic talking-head video rather than whether it is real, is a critical safeguard for authorized use of face-reenactment technology. Existing methods rely on a fixed, non-differentiable landmark extraction stage that prevents the fingerprinting model from being optimized end-to-end from raw pixels. We propose a preprocessing-free system built on a micro-expression-aware backbone operating on raw video frames, with inter-frame feature differencing as the core design principle: consecutive feature maps are subtracted in the learned deep feature space, so that temporally stable appearance dimensions contribute zero to the output while driver-specific motion dynamics are preserved. A controlled ablation on NVFAIR confirms that temporal motion accounts for the large majority of discriminative performance, and that raw appearance features actively degrade identity separation. Both the choice of backbone and the differencing principle are essential: differencing alone is insufficient when applied to a generic encoder, as appearance-dominated features collapse to near-identical representations across adjacent frames, while the micro-expression-aware F5C backbone retains measurable motion variation that the differencing operation can exploit. Without any external preprocessing, our model achieves an overall AUC of 0.877 on NVFAIR and matches or exceeds the landmark-based baseline on the majority of cross-generator pairs.
Authors: Heng Li, Xiaotong Lin, Ling-An Zeng, Yulei Kang, Shuai Li, Jian-Fang Hu
Abstract: Text-to-motion generation aims to generate 3D human motions that are tightly aligned with the input text while remaining physically plausible and rich in fine-grained detail. Although recent approaches can produce complex and natural movements, they usually operate at only one temporal scale, which limits both semantic alignment and temporal coherence. Inspired by the fact that complex motions are conceptualized hierarchically rather than at a single temporal scale in the human cognitive system, we propose \textit{MotionHiFlow}, a hierarchical flow matching framework to generate motion progressively by constructing flow path from low to high temporal scales. The flows at lower scales capture high-level semantics and coarse motion structures, while flows at higher scales refine temporal details. To link the flows across scales, we introduce a novel cross-scale transition process, ensuring continuity and preserving noise consistency. Furthermore, by integrating a Text-Motion Diffusion Transformer and a topology-aware Motion VAE, MotionHiFlow explicitly models structural dependencies among joints via joint-aware positional encoding and skeletal topology, enabling precise semantic alignment alongside fine-grained motion details. Extensive experiments on HumanML3D and KIT-ML benchmarks demonstrate state-of-the-art performance, with ablation studies confirming the effectiveness of the hierarchical design and key components. Code is available at https://github.com/ai-lh/MotionHiFlow.
Authors: Sangwook Baek, Vin Van Duong, Karam Park, Pilkyu Park
Abstract: This paper introduces a novel multi frame super-resolution network (MFSR) for burst hexadeca Bayer pattern Contact Image Sensor (CIS) images, which includes demosaicing, denoising, multi-frame fusion, and super-resolution. Designing a high-quality reconstruction network poses several challenges as follows: 1) Unlike the Bayer color filter array (CFA) pattern, it is hard to interpolate hexadeca-Bayer pattern since the pixel distance between the same color groups increases; 2) Due to large object motion and camera movements, the final fusion result usually suffers the misalignment resulting a blurry image or ghosting artifacts; 3) The proposed network should be fast and efficient enough to operate in real-time on mobile devices. To overcome these challenges, we propose a novel network, called LatentBurst, which contains: 1) a pyramid align and fusion approach in latent feature to deal with large motion scenario; 2) an efficient UNet-based structure which can run efficiently on mobile device; 3) fine-tuned optical flow estimation and two-step knowledge distillation to reduce domain-gap more effectively. Experimental results in various scenarios demonstrate the effectiveness of our proposed method compared with other state-of-the-art methods.
Authors: Ruyi Dai, Tingkwong Ng, Hao Chen
Abstract: Automated white blood cell (WBC) classification is essential for scalable leukaemia screening. However, real-world deployment is challenged by domain shifts caused by staining protocols, scanner characteristics, and inter-laboratory variability, which often degrade model performance. The White Blood Cell Classification Challenge (WBCBench) at ISBI 2026 aims to advance robust WBC recognition, with a focus on accurately identifying blast cells and other clinically critical rare subtypes. We propose a memory-augmented, hierarchical ensemble pipeline for WBC classification under domain shifts, leveraging a feature bank and a DinoBloom backbone fine-tuned with LoRA. Our three-stage inference hierarchy combines k-nearest neighbors (kNN) retrieval at each level, reducing over-reliance on any single decision. Evaluated on the WBCBench dataset, our method ranks within the top ten by macro F1-score in the final testing phase.
Authors: Kaiwen Huang, Yi Zhou, Yizhe Zhang, Jingxiong Li, Tao Zhou
Abstract: Semi-supervised learning addresses label scarcity and high annotation costs in medical image segmentation by exploiting the latent information in unlabeled data to enhance model performance. Traditional discriminative segmentation relies on segmentation masks, neglecting feature-level distribution constraints. This limits robust semantic representation learning and adaptive modeling of unlabeled data in scenarios with few labels. To address these limitations, we propose SemiGDA, a novel Generative Dual-distribution Alignment framework for semi-supervised medical image segmentation. Our SemiGDA overcomes the reliance of discriminative methods on large labeled datasets by aligning feature and semantic distributions to boost semantic learning and scene adaptability. Specifically, we propose a Dual-distribution Alignment Module (DAM), which employs two structurally distinct encoders to model image and mask feature distributions. It enforces their alignment in the latent space via distributional constraints, establishing structured feature consistency. Moreover, we design a Consistency-Driven Skip Adapter (CDSA) strategy, which introduces dual skip adapters (Image and Mask) to fuse multi-scale features via skip connections. Using a consistency loss, CDSA enhances cross-branch semantic alignment and reinforces fine-grained semantic consistency. Experimental results on diverse medical datasets show that our method outperforms other state-of-the-art semi-supervised segmentation methods. Code is released at: https://github.com/taozh2017/SemiGDA.
Authors: Meizhu Liu, Yassi Abbasi, Matthew Rowe, Michael Avendi, Paul Li
Abstract: PDF documents contain critical visual elements such as figures, tables, and forms whose accurate extraction is essential for document understanding and multimodal retrieval-augmented generation (RAG). Existing PDF parsers often miss complex visuals, extract non-informative artifacts (e.g., watermarks, logos), produce fragmented elements, and fail to reliably associate captions with their corresponding elements, which degrades downstream retrieval and question answering. We present a lightweight and production level PDF parsing framework that can accurately detect visual elements and associates captions using a combination of spatial heuristics, layout analysis, and semantic similarity. On popular benchmark datasets and internal product data, the proposed solution achieves $\geq96\%$ visual element detection accuracy and $93\%$ caption association accuracy. When used as a preprocessing step for multimodal RAG, it significantly outperforms state-of-the-art parsers and large vision-language models on both internal data and the MMDocRAG benchmark, while reducing latency by over $2\times$. We have deployed the proposed system in challenging production environment.
Authors: Zequn Xie, Guijin Luo, Chuxin Wang, Sihang Cai, Tao Jin, Zhou Zhao, Yixuan Tang
Abstract: Text-based person anomaly search retrieves specific behavioral events from surveillance archives using natural-language queries. Although recent pose-aware methods align geometric structures well, they face a fundamental Pose-Semantic Gap: semantically different actions can share similar skeletal geometries. While Multimodal Large Language Models (MLLMs) can reduce this ambiguity, using them for large-scale retrieval is computationally prohibitive. We propose the Structure-Semantic Decoupled Cascade (SSDC) framework, which decouples retrieval into two stages: (1) Structure-Aware Coarse Retrieval, where a lightweight model quickly filters candidates by skeletal similarity ; and (2) Detective Squad Interaction, a multi-agent semantic verification module. The squad consists of a Detective for fast binary filtering, an Analyst for evidence extraction, and a Writer for semantic synthesis. Finally, we re-rank candidates by fusing the synthesized captions with structural priors. Experiments on the PAB benchmark show that SSDC achieves state-of-the-art performance by balancing efficiency and semantic reasoning.
Authors: Varun Totakura, Shayok Chakraborty
Abstract: Due to the unprecedented success of deep learning, it has become an integral component in several multimedia computing applications in todays world. Unfortunately, deep learning systems are not perfect and can fail, sometimes abruptly, without prior warning or explanation. While reducing the error rate of deep neural networks has been the primary focus of the multimedia community, the problem of predicting when a deep learning system is going to fail has received significantly less research attention. In this paper, we propose a simple yet effective framework, MetaErr, to address this under-explored problem in deep learning research. We train a meta-model whose goal is to predict whether a base deep neural network will succeed or fail in predicting a particular data sample, by observing the base models performance on a given learning task. The meta-model is completely agnostic of the architecture and training parameters of the base model. Such an error prediction system can be immensely useful in a variety of smart multimedia applications. Our empirical studies corroborate the promise and potential of our framework against competing baselines. We further demonstrate the usefulness of our framework to improve the performance of pseudo-labeling-based semi-supervised learning, and show that MetaErr outperforms several strong baselines on three benchmark computer vision datasets.
Authors: Yanpei Gong, Beichen Zhang, Hao Wang, Zhaobo Qi, Xinyan Liu, Yuanrong Xu, Ruiyang Gao, Weigang Zhang
Abstract: Remote sensing image change captioning (RSICC) aims to describe the difference between two remote sensing images. While recent methods have explored video modeling, they largely overlook the inherent ambiguities in viewpoint, scale, and prior knowledge, lacking effective constraints on the encoder. In this paper, we present STAND, a Semantic Anchoring Constraint with Dual-Granularity Disambiguation for RSICC, to progressively resolve these ambiguities. Specifically, to establish a reliable feature foundation, we first introduce an interpretable constraint to regularize temporal representations. Operating on these purified features, a dual-granularity disambiguation module resolves spatial uncertainties by coupling macro-level global context aggregation for viewpoint confusion with micro-level frequency-refocused attention for small-object scale enhancement. Ultimately, to translate these visually disambiguated features into precise text, a semantic concept anchoring module leverages language categorical priors to tackle knowledge ambiguity during decoding. Extensive experiments verify the superiority of STAND and its effectiveness in addressing ambiguities.
Authors: Jingxuan Kang, Ziqi Zhang, Shaoming Zheng, Shuang Li, Uday Bharat Patel, Alexander Harry Fitzhugh, Phillip Lung, Yusuf Kiberu, Nikesh Jathanna, Shahnaz Jamil-Copley, Bernhard Kainz, Chen Qin
Abstract: Segmentation is central to clinical diagnosis and monitoring, yet the reliability of modern foundation models in medical imaging still depends on the availability of precise prompts. The Segment Anything Model (SAM) offers powerful zero-shot capabilities, although it collapses under the weak, generic, and noisy prompts that dominate real clinical workflows. In practice, annotations such as centerline points are coarse and ambiguous, often drifting across neighboring anatomy and misguiding SAM toward inconsistent or incomplete masks. We introduce SPD, a Saliency-Guided Prompt Distillation framework that converts these unreliable cues into robust guidance. SPD first learns data-driven anatomical priors through a lightweight saliency head to obtain confident localization maps. These priors then drive Contextual Prompt Distillation, which validates and enriches noisy prompts using cues from anatomically adjacent slices, producing a consensus prompt set that matches the behavior of expert reasoning. A Pairwise Slice Consistency objective further enforces local anatomical coherence during segmentation. Experiments on four challenging MRI and CT benchmarks demonstrate that SPD consistently outperforms existing SAM adaptations and supervised baselines, delivering large gains in both region-based and boundary-based metrics. SPD provides a practical and principled path toward reliable foundation model deployment in clinical environments where only imperfect prompts are available.
Authors: Zhaoxiang Liu, Zhicheng Ma, Kaikai Zhao, Kai Wang, Shiguo Lian
Abstract: The Convolutional Neural Networks (CNNs) have been the dominant and effective approach for general computer vision tasks. Recently, Kolmogorov-Arnold neural networks (KANs), based on the Kolmogorov-Arnold representation theorem, have shown potential to replace Multi-Layer Perceptrons (MLPs) in deep learning. KANs, which use learnable nonlinear activations on edges and simple summation on nodes, offer fewer parameters and greater explainability compared to MLPs. However, there has been limited exploration of integrating the Kolmogorov-Arnold representation theorem with convolutional methods for computer vision tasks. Existing attempts have merely replaced learnable activation functions with weights, undermining KANs' theoretical foundation and limiting their potential effectiveness. Additionally, the B-spline curves used in KANs suffer from computational inefficiency and a tendency to overfit. In this paper, we propose a novel Kolmogorov-Arnold Convolutional Layer that deeply integrates the Kolmogorov-Arnold representation theorem with convolution. This layer provides stronger method interpretability because it is based on established mathematical theorems and its design has theoretical alignment. Building on the Kolmogorov-Arnold Convolutional Layer, we design an efficient network architecture called KAConvNet, which outperforms existing methods combining KAN and convolution, and achieves competitive performance compared to mainstream ViTs and CNNs. We believe that our work offers valuable insight into the field of artificial intelligence and will inspire the development of more innovative CNNs in the 2020s. The code is publicly available at https://github.com/UnicomAI/KAConvNet.
Authors: Yahui Li, Yinfeng Yu, Liejun Wang, Shengjie Shen
Abstract: Emotionally talking head video generation aims to generate expressive portrait videos with accurate lip synchronization and emotional facial expressions. Current methods rely on simple emotional labels, leading to insufficient semantic information. While introducing high-level semantics enhances expressiveness, it easily causes lip-sync degradation. Furthermore, mainstream generation methods struggle to balance computational efficiency and global motion awareness in long videos and suffer from poor temporal coherence. Therefore, we propose an \textbf{E}motion-\textbf{A}ware \textbf{D}iffusion model-based \textbf{Net}work, called \textbf{EAD-Net}. We introduce SyncNet supervision and Temporal Representation Alignment (TREPA) to mitigate lip-sync degradation caused by multi-modal fusion. To model complex spatio-temporal dependencies in long video sequences, we propose a Spatio-Temporal Directional Attention (STDA) mechanism that captures global motion patterns through strip attention. Additionally, we design a Temporal Frame graph Reasoning Module (TFRM) to explicitly model temporal coherence between video frames through graph structure learning. To enhance emotional semantic control, a large language model is employed to extract textual descriptions from real videos, serving as high-level semantic guidance. Experiments on the HDTF and MEAD datasets demonstrate that our method outperforms existing methods in terms of lip-sync accuracy, temporal consistency, and emotional accuracy.
Authors: Chandravardhan Singh Raghaw, Anushka Parwal, Shahid Shafi Dar, Prajakta Darade, Nagendra Kumar
Abstract: Knee osteoarthritis (KOA) is a degenerative joint disease that can lead to chronic pain, reduced mobility, and long-term disability. Automated severity grading from knee radiographs can support early assessment, but current methods heavily depend on large labeled datasets and remain sensitive to class imbalance, noisy samples, and variability in clinical annotations. To alleviate these limitations, we propose a Hierarchical fusion of Semi-Supervised framework with Self-Supervision (H-SemiS) for KOA severity grading in knee X-ray samples using limited annotated data. Rather than treating severity grading as a flat multi-class problem, H-SemiS decomposes the task into a sequence of binary sub-tasks within a semi-supervised teacher-student architecture, directly mitigating the impact of class imbalance. To further enhance feature learning from unlabeled data, the framework integrates an adversarial self-supervised reconstruction module that encourages the network to capture robust anatomical structures. In parallel, a teacher-student design with quantum-inspired feature mixing improves discrimination boundaries between adjacent grades when pseudo-labels are noisy. We comprehensively evaluate H-SemiS on two challenging multi-class datasets and assess its generalizability on two binary-class datasets. Our experimental results demonstrate the superiority of the proposed H-SemiS framework across multiple evaluation metrics, consistently outperforming several competing baselines and state-of-the-art methods. The code is publicly available at https://github.com/chandravardhan-singh-raghaw/H-SemiS.
URLs: https://github.com/chandravardhan-singh-raghaw/H-SemiS.
Authors: Sanghoon Lee, Geon Lee, Hyekang Park, Bumsub Ham
Abstract: Conventional object detectors typically operate under a closed-set assumption, limiting recognition to a predefined set of base classes seen during training. Open-vocabulary object detection (OVD) addresses this limitation by leveraging vision-language models (VLMs) to generate pseudo labels for novel object classes. However, existing OVD methods suffer from two critical drawbacks: (1) inaccurate class label assignments, as VLMs are optimized for image-level predictions rather than the region-level predictions required for pseudo labeling, and (2) unreliable objectness scores from region proposal networks (RPNs) trained exclusively on base object classes. To address these issues, we propose a novel pseudo labeling framework for OVD. Our approach introduces a hierarchical confidence calibration (HCC) technique, which ensures reliable class label estimation by assessing consistency across hierarchical semantic levels (class, super- and sub-category). We also present LoCLIP, a parameter-efficient adaptation of CLIP that incorporates an objectness token to mitigate base class bias problem of RPNs and provide reliable objectness estimations for novel object classes. Extensive experiments on standard OVD benchmarks, including COCO and LVIS, demonstrate that our approach clearly sets a new state of the art, validating the effectiveness of our approach. Project site: https://cvlab.yonsei.ac.kr/projects/HCC
Authors: He Hu, Tengjin Weng, Zebang Cheng, Yu Wang, Jiachen Luo, Bj\"orn Schuller, Zheng Lian, Laizhong Cui
Abstract: Recent multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and generation, and are increasingly used in applications such as social robots and human-computer interaction, where understanding human emotions is essential. However, existing benchmarks mainly formulate emotion understanding as a static recognition problem, leaving it largely unclear whether current MLLMs can understand emotion as a dynamic process that evolves, shifts between states, and unfolds across diverse social contexts. To bridge this gap, we present EmoTrans, a benchmark for evaluating emotion dynamics understanding in multimodal videos. EmoTrans contains 1,000 carefully collected and manually annotated video clips, covering 12 real-world scenarios, and further provides over 3,000 task-specific question-answer (QA) pairs for fine-grained evaluation. The benchmark introduces four tasks, namely Emotion Change Detection (ECD), Emotion State Identification (ESI), Emotion Transition Reasoning (ETR), and Next Emotion Prediction (NEP), forming a progressive evaluation framework from coarse-grained detection to deeper reasoning and prediction. We conduct a comprehensive evaluation of 18 state-of-the-art MLLMs on EmoTrans and obtain two main findings. First, although current MLLMs show relatively stronger performance on coarse-grained emotion change detection, they still struggle with fine-grained emotion dynamics modeling. Second, socially complex settings, especially multi-person scenarios, remain substantially challenging, while reasoning-oriented variants do not consistently yield clear improvements. To facilitate future research, we publicly release the benchmark, evaluation protocol, and code at https://github.com/Emo-gml/EmoTrans.
Authors: Sisipho Hamlomo, Marcellin Atemkeng, Habte Tadesse Likassa, Blaise Ravelo, Thierry Bouwmans, S\'ebastien Lall\'ech\`ere, Antoine Vacavant, Ding-Geng Chen
Abstract: Convolutional neural networks (CNNs) have become increasingly difficult to deploy in resource-constrained environments due to their large memory and computational requirements. Although low-rank compression methods can reduce this burden, most existing approaches compress spatial and channel redundancy independently and therefore do not fully exploit the localised structure within convolutional feature maps. This paper proposes a hierarchical spatio-channel low-rank compression framework for CNNs that exploits redundancy across spatial regions and channel activations. Unlike conventional methods, which apply a uniform decomposition across an entire layer, the proposed approach first partitions feature maps into spatial regions, then groups channels according to their co-activation patterns within each region, and finally applies rank-adaptive SVD to each resulting spatio-channel cluster. The method is evaluated on an AlexNet-based brain tumour MRI classification model and compared with Global SVD and Tucker decomposition under \(3\times\) and \(6\times\) compression budgets. Our method outperforms both baselines, reducing FLOPs from \(8.21\,\mathrm{G}\) to \(1.55\,\mathrm{G}\) (\(81.1\%\) reduction), achieving a \(1.38\times\) inference speed-up, and increasing classification accuracy from \(87.76\%\) to \(89.80\%\). The method also improves the macro \(F_1\)-score and performance on challenging classes such as meningioma. A hyper-parameter trade-off analysis demonstrates that the framework provides Pareto-optimal configurations, enabling control over the balance between compression and predictive performance. Moderate clustering with adaptive rank selection yields strong results. Bootstrap standard errors are reported for all classification metrics.
Authors: Zhe Wang, Qijin Song, Zihao Li, Jingyu Xiao, Weibang Bai
Abstract: Accurate 6-DoF pose estimation of objects is critical for robots to perform precise manipulation tasks. However, for dynamic object pose estimation, conventional camera-based approaches face several major challenges, such as motion blur, sensor noise, and low-light limitation. To address these issues, we employ event cameras, whose high dynamic range and low latency offer a promising solution. Furthermore, we propose a keypoint-based detection and tracking approach for dynamic object pose estimation. Firstly, a keypoint detection network is constructed to extract keypoints from the time surface generated by the event stream. Subsequently, the polarity and spatial coordinates of the events are leveraged, and the event density in the vicinity of each keypoint is utilized to achieve continuous keypoint tracking. Finally, a hash mapping is established between the 2D keypoints and the 3D model keypoints, and the EPnP algorithm is employed to estimate the 6-DoF pose. Experimental results demonstrate that, whether in simulated or real event environments, the proposed method outperforms the event-based state-of-the-art methods in terms of both accuracy and robustness.
Authors: Sheng-Wei Chan, Xin-Jui Pan, Chun-Po Shen, Chia-Min Lin, Yung-Che Wang, Jen-Shiun Chiang
Abstract: High-performance semantic segmentation has achieved significant progress in recent years, often driven by increasingly large backbones and higher computational budgets. While effective, such approaches introduce substantial computational overhead and limit accessibility under constrained hardware settings. In this paper, we propose DGM-Net (Directional Geometric Mamba Network), an efficient architecture that improves modeling capability through structural design rather than increasing model capacity. We introduce Directional Geometric Mamba (G-Mamba), a linear-complexity O(N) operator as an alternative to conventional context modeling modules such as ASPP and PPM. To further enhance structural awareness in state space model (SSM)-based modeling, we design the DGM-Module, which extracts centripetal flow fields and topological skeletons to guide the scanning process and improve boundary preservation. Without relying on large-scale pretraining or heavy backbone scaling, DGM-Net achieves 80.8% mIoU within 28k iterations, 82.3% mIoU on Cityscapes test set, and 45.24% mIoU on ADE20K. In addition, the model maintains stable performance under constrained hardware settings (e.g., batch size of 2 on 8GB VRAM), highlighting its efficiency and practicality. These results demonstrate that incorporating geometric guidance into SSM-based architectures provides an effective and resource-efficient direction for semantic segmentation.
Authors: Giorgio Cruciata, Luca Cruciata, Liliana Lo Presti, Jan Van Gemert, Marco La Cascia
Abstract: This paper proposes a new method to improve the training efficiency of deep convolutional neural networks. During training, the method evaluates scores to measure how much each layer's parameters change and whether the layer will continue learning or not. Based on these scores, the network is scaled down such that the number of parameters to be learned is reduced, yielding a speed up in training. Unlike state-of-the-art methods that try to compress the network to be used in the inference phase or to limit the number of operations performed in the backpropagation phase, the proposed method is novel in that it focuses on reducing the number of operations performed by the network in the forward propagation during training. The proposed training strategy has been validated on two widely used architecture families: VGG and ResNet. Experiments on MNIST, CIFAR-10 and Imagenette show that, with the proposed method, the training time of the models is more than halved without significantly impacting accuracy. The FLOPs reduction in the forward propagation during training ranges from 17.83\% for VGG-11 to 83.74\% for ResNet-152. These results demonstrate the effectiveness of the proposed technique in speeding up learning of CNNs. The technique will be especially useful in applications where fine-tuning or online training of convolutional models is required, for instance because data arrive sequentially.
Authors: Shengzhi Li, Jiarun Chen, Karun Sharma, Jiaqi Su, Shichao Pei
Abstract: Large vision-language models (VLMs) can recognize \textit{what} happens in video but fail to count \textit{how many} times. We introduce \textbf{PushupBench}, 446 long-form clips (avg. 36.7s) for evaluating repetition counting. The best frontier model achieves 42.1\% exact accuracy; open-source 4B models score $\sim$6\%, matching supervised baselines. We show that accuracy alone misleads -- weaker models exploit the modal count rather than reason temporally. Fine-tuning on counting with 1k samples transfers to general video understanding: MVBench (+2.15), PerceptionTest (+1.88), TVBench (+4.54), suggesting counting is a proxy for broader temporal reasoning.PushupBench incorporated in \texttt{lmms-eval} (https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/1262) and hosted on (pushupbench.com/)
URLs: https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/1262)
Authors: Md. Afzalur Rahaman, Tahmid Rahman
Abstract: Most two-stream action recognition networks apply the same convolutional backbone to both RGB and optical flow streams, ignoring the fact that the two modalities have fundamentally different structural properties. Optical flow captures fine-grained motion patterns, while RGB frames carry rich appearance and scene context - treating them identically discards this distinction. We propose DualStreamHybrid, a heterogeneous two-stream architecture that assigns each stream a backbone suited to its input: a pretrained ViT-Tiny/16 for RGB frames, and a MobileNetV2 trained from scratch on a 20-channel stacked optical flow representation. A learned projection layer maps the two differently-sized feature vectors to a common dimensionality before fusion, enabling the two streams to interact without forcing architectural symmetry. We design five fusion strategies within a unified framework - late fusion, concatenation, cross-attention, weighted fusion, and gated fusion - and evaluate them on UCF11 (1,600 videos, 11 classes) and UCF50 (6,681 videos, 50 classes) to study how fusion behaviour scales with dataset size. On UCF11, cross-attention achieves 98.12% test accuracy, outperforming the RGB-only ViT-Tiny baseline of 95.94%, which suggests that explicit inter-modal attention is particularly effective on smaller, less complex datasets. On UCF50, weighted fusion reaches 96.86% and proves the most consistent strategy across both benchmarks. The learned stream weights reveal an interesting pattern: UCF11 sees near-equal modality contribution (RGB: 0.507, flow: 0.493), while UCF50 favours the RGB stream slightly more (RGB: 0.554, flow: 0.446) - arguably reflecting the larger and more visually diverse action space. Taken together, these results suggest that even a lightweight motion stream meaningfully complements a strong appearance encoder, and that the optimal fusion strategy depends on dataset scale.
Authors: Emre Ard{\i}\c{c}, Yakup Gen\c{c}
Abstract: Federated learning (FL) is a distributed machine learning method where multiple devices collaboratively train a model under the management of a central server without sharing underlying data. One of the key challenges of FL is the communication bottleneck caused by variations in connection speed and bandwidth across devices. Therefore, it is essential to reduce the size of transmitted data during training. Additionally, there is a potential risk of exposing sensitive information through the model or gradient analysis during training. To address both privacy and communication efficiency, we combine differential privacy (DP) and adaptive quantization methods. We use Laplacian-based DP to preserve privacy, which is relatively underexplored in FL and offers tighter privacy guarantees than Gaussian-based DP. We propose a simple and efficient global bit-length scheduler using round-based cosine annealing, along with a client-based scheduler that dynamically adapts based on client contribution estimated through dataset entropy analysis. We evaluate our approach through extensive experiments on CIFAR10, MNIST, and medical imaging datasets, using non-IID data distributions across varying client counts, bit-length schedulers, and privacy budgets. The results show that our adaptive quantization methods reduce total communicated data by up to 52.64% for MNIST, 45.06% for CIFAR10, and 31% to 37% for medical imaging datasets compared to 32-bit float training while maintaining competitive model accuracy and ensuring robust privacy through differential privacy.
Authors: Soulayma Gazzeh, Giuseppe Mazzola, Liliana Lo Presti, Marco La Cascia
Abstract: Reliable depth estimation from spherical images is crucial for 360{\deg} vision in robotic navigation and immersive scene understanding. However, the onboard spherical camera can experience unintentional pose variations in real-world robotic platforms that, along with the geometric distortions inherent in equirectangular projections, significantly impact the effectiveness of depth estimation. To study this issue, a novel public benchmark, called Sphere-Depth, is introduced to systematically evaluate the robustness of monocular depth estimation models from equirectangular images in a reproducible way. Camera pose perturbations are simulated and used to assess the performance of a popular perspective-based model, Depth Anything, and of spherical-aware models such as Depth Anywhere, ACDNet, Bifuse++, and SliceNet. Furthermore, to ensure meaningful evaluation across models, a depth calibration-based error protocol is proposed to convert predicted relative depth values into metric depth values using supervised learned scaling factors for each model. Experiments show that even models explicitly designed to process spherical images exhibit substantial performance degradation when variations in the camera pose are observed with respect to the canonical pose. The full benchmark, evaluation protocol, and dataset splits are made publicly available at: https://github.com/sgazzeh/Sphere_depth
Authors: Azmul A. Irfan, Nur Ahmad Khatim, Alfan Alfian Irfan, Achmad Zaki, Erike A. Suwarsono, Mansur M. Arief
Abstract: Radiographic grading of knee osteoarthritis (KOA) with the Kellgren-Lawrence (KL) system is limited by inter-reader variability and the opacity of current deep learning approaches, which predict KL grades directly from images without decomposing structural features. We present Knee-xRAI, a modular framework that independently quantifies the three cardinal radiographic features of KOA (joint space narrowing [JSN], osteophytes, and subchondral sclerosis) and integrates them into an explainable KL grade classification. The pipeline combines U-Net++ segmentation for contour-based JSN measurement, an SE-ResNet-50 network for per-site osteophyte grading (OARSI scale), and a hybrid texture-CNN classifier for binary sclerosis quantification. The resulting 50-dimensional structured feature vector feeds two complementary classification paths. An XGBoost path supports SHAP-based feature attribution. A ConvNeXt hybrid path combines the structured vector with a full-image encoder for enhanced predictive performance. Evaluated on 8,260 radiographs from an OAI-derived dataset, the JSN module achieved a Dice coefficient of 0.8909 and an mJSW intraclass correlation of 0.8674 against manual annotations. The ConvNeXt hybrid path reached a test quadratic weighted kappa (QWK) of 0.8436 and AUC of 0.9017. The transparent XGBoost path achieved a test QWK of 0.6294 with full feature-level audit capability. Ablation confirmed JSN as the dominant predictor (QWK = 0.6103 alone), with osteophyte features providing consistent incremental gain (+0.0183) and sclerosis contributing marginally. Inference-time ablation of Path B confirmed the structured pathway contributes materially beyond the image encoder, with QWK drops of 0.098 (feature zeroing) and 0.284 (feature-image permutation). Knee-xRAI explicitly quantifies all three KL-defining radiographic features within a single auditable pipeline.
Authors: Linyuan Wang, Haibo Yao, Te-Ming Tseng, Kelvin Betitame, Xin Sun, Hanbo Huang, Dong Chen
Abstract: Weeds compete with crops for light, water, and nutrients, reducing yield and crop quality. Efficient weed detection is essential for site-specific weed management (SSWM). Although deep learning models have been deployed on UAV-based edge systems, a systematic understanding of how different model architectures perform under real-world resource constraints is still lacking. To address this gap, this study proposes a deployment-oriented framework for real-time UAV-based weed detection on resource-constrained edge platforms. The framework integrates UAV data acquisition, model development, and on-device inference, with a focus on balancing detection accuracy and computational efficiency. A diverse set of state-of-the-art object detection models is evaluated, including convolution-based YOLO models (v8-v12) and transformer-based RT-DETR models (v1-v2). Experiments on three edge devices (Jetson Orin Nano, Jetson AGX Xavier, and Jetson AGX Orin) demonstrate clear trade-offs between accuracy and inference latency across models and hardware configurations. Results show that high-capacity models achieve up to 86.9% mAP50 but suffer from high latency, limiting real-time deployment. In contrast, lightweight models achieve 66%-71% mAP50 with significantly lower latency, enabling real-time performance. Among all models, RT-DETRv2-R50-M achieves competitive accuracy (79% mAP50) with improved efficiency, while YOLOv10n provides the fastest inference speed. YOLOv11s and RT-DETRv2-R50-M offer the best balance between accuracy and speed, making them strong candidates for real-time UAV deployment.
Authors: Jainum Sanghavi
Abstract: Vision Transformers trained only on image classification routinely transfer to tasks that demand spatial understanding, yet they receive no spatial supervision during pretraining. We ask where and how robustly such structure is encoded. Probing a frozen ViT-B/16 layerwise for two complementary properties, local patch boundaries (BSDS500) and per-patch depth (NYU Depth V2), reveals a clear hierarchy: boundary structure becomes linearly decodable at layers 5-6 (AP = 0.833), while depth, which requires integrating global cues, peaks two to three layers later at layer 8 (MAE = 0.0875). Both signals collapse at the final classification layer, and random-weight controls confirm the encodings are learned rather than architectural. Causal interventions add specificity: ablating the single direction a linear depth probe reads degrades depth decoding by up to 165%, while ablating any other direction changes it by less than 1%. Targeted activation patching along that direction shows the depth signal is partially re-derived at each layer rather than passively carried in the residual stream, with mid-layer interventions persisting most strongly downstream. The result is that a classification-trained ViT develops an actively maintained spatial hierarchy that mirrors the early-to-late progression observed in the primate visual cortex.
Authors: Kazuya Nishimura, Ryoma Bise, Haruka Hirose, Yasuhiro Kojima
Abstract: Deep learning-based nuclei segmentation and classification in pathology images typically rely on large-scale pixel-level manual annotations, which are costly and difficult to obtain across diverse tissues and staining conditions. To address this limitation, we propose a framework that leverages spatial transcriptomics (ST) data as supervision for nuclei segmentation and classification. By incorporating cell-level ST data, we obtain gene expression profiles and corresponding nuclear masks from histopathological images. Gene expression profiles are converted into cell-type labels and used as training data for image-based classification. Because existing gene expression-based cell-type classification methods are not designed for image recognition, we introduce an image-oriented classification approach that bridges gene expression-based cell typing and image-based cell classification. To evaluate generalization, we conduct segmentation experiments on previously unseen organs and compare our method with conventional supervised models. Despite being trained on fewer organ types, our framework achieves higher segmentation accuracy, demonstrating strong transferability. Classification experiments further show consistent improvements over existing approaches.
Authors: Dong Huo, Tristan Aumentado-Armstrong, Samrudhdhi B. Rangrej, Maitreya Suin, Angela Ning Ye, Zhiming Hu, Amanpreet Walia, Amirhossein Kazerouni, Konstantinos G. Derpanis, Iqbal Mohomed, Alex Levinshtein
Abstract: Burst image super resolution (BISR) aims to construct a single high-resolution (HR) image by aggregating information from multiple low-resolution (LR) frames, relying on temporal redundancy and spatial coherence across the burst. While conventional methods achieve impressive results, they often struggle with complex textures and oversmoothing. Diffusion models, particularly those pretrained on high-quality data, have shown remarkable capability in generating realistic details for image and video super-resolution. However, their potential remains largely under-explored in BISR, where existing approaches typically rely on task-specific diffusion models trained from scratch and operate on single-frame reconstructions. In this work, we propose BurstGP, a novel diffusion-based solution for BISR, which leverages generative priors of recent foundation models to overcome these issues. In particular, we build a multiframe-aware diffusion model on top of a conventional BISR approach, which boosts image quality with minimal loss to fidelity. Further, we introduce (i) a novel degradation-aware conditioning mechanism, which controls synthesis of fine details based on the estimated degradation in the input, and (ii) a robust sRGB-to-lRGB inverter, enabling us to utilize generative multiframe (video) sRGB priors, while operating with raw input and lRGB output images. Empirically, we demonstrate that BurstGP outperforms the existing state of the art, both quantitatively (especially with respect to perceptual metrics, including MUSIQ and LPIPS) and qualitatively. In particular, our proposed method excels at recovering richer textures and finer structural details, highlighting the potential of video priors for BISR over traditional methods.
Authors: Jingni Huang, Peter Bloodsworth
Abstract: Short-term human pose prediction plays a crucial role in interactive systems, assistive robots, and emotion-aware human-computer interaction[1-3]. While current trajectory prediction models primarily rely on geometric motion cues, they often overlook the underlying emotional signals influencing human motion dynamics[4-5]. This paper investigates whether facial expression-derived emotion embeddings can provide auxiliary conditional signals for short-term pose prediction. To further evaluate multimodal conditionation in a recursive prediction setting, we propose a lightweight autoregressive predictive world model that performs 15-step rolling pose prediction. This framework combines pose keypoints with emotion embeddings through a learnable gating mechanism and performs autoregressive unfolding prediction using a recurrent sequence model based on a two-layer LSTM architecture. Experiments were conducted on two small-scale pose-emotion video datasets: controlled motion sequences with minimal facial expression changes and, natural emotion-driven motion sequences with considerable facial expression changes. The results show that simple multimodal fusion does not consistently improve prediction accuracy, while normalized gating fusion significantly enhances the performance of emotion-driven motion sequences. Furthermore, counterfactual perturbation experiments demonstrate that the predicted trajectory exhibits measurable sensitivity to changes in multimodal input, suggesting that facial expression embeddings act as auxiliary conditional signals rather than redundant features. In summary, these results indicate that incorporating facial expression-derived emotion embeddings into emotion-conditional short-term pose prediction based on a lightweight predictive world model architecture is a feasible approach.
Authors: Haosen Li, Wenshuo Chen, Shaofeng Liang, Lei Wang, Kaishen Yuan, Yutao Yue
Abstract: Diffusion models have achieved unprecedented success in text-aligned generation, largely driven by Classifier-Free Guidance (CFG). However, standard CFG operates strictly on instantaneous gradients, omitting the intrinsic curvature of the data manifold. Recent methods like Zigzag-sampling (Z-Sampling) explicitly traverse multi-step forward-backward trajectories to probe this curvature, significantly improving semantic alignment. Yet, these explicit traversals triple the Neural Function Evaluation (NFE) cost and introduce unconstrained truncation errors from off-manifold evaluations, causing cumulative drift from the true marginal distribution. In this paper, we theoretically demonstrate that the explicit zigzag sequence is topologically reducible. We propose Implicit Z-Sampling, rigorously proving that intermediate states can be algebraically annihilated via operator dualities, physically eliminating off-manifold approximation errors. To push sampling efficiency to its theoretical lower bound, we introduce $Z^2$-Sampling (Zero-cost Zigzag Sampling). Exploiting the Probability Flow ODE's temporal coherence, $Z^2$-Sampling couples implicit algebraic collapse with a dynamically cached Temporal Semantic Surrogate. This restores the standard 2-NFE baseline without sacrificing semantic exploration. We formally prove via Backward Error Analysis that this discrete collapse inherently synthesizes a directional derivative curvature penalty. Finally, extensive evaluations demonstrate that $Z^2$-Sampling structurally shatters the performance-efficiency Pareto frontier. We validate its universal applicability across diverse architectures (U-Nets, DiTs) and modalities (image/video), establishing seamless orthogonality with advanced alignment frameworks (AYS, Diffusion-DPO).
Authors: Haosen Li, Wenshuo Chen, Lei Wang, Shaofeng Liang, Haozhe Jia, Yutao Yue
Abstract: Text-to-image diffusion models have achieved remarkable generative capabilities, yet accurately aligning complex textual prompts with synthesized layouts remains an ongoing challenge. In these models, the initial Gaussian noise acts as a critical structural seed dictating the macroscopic layout. Recent online optimization and search methods attempt to refine this noise to enhance text-image alignment. However, relying on unconstrained Euclidean gradient ascent mathematically inflates the latent norm and destroys the standard Gaussian prior, causing severe visual artifacts like color over-saturation. Furthermore, these methods suffer from inefficient semantic routing and easily fall into the ``reward hacking'' trap of external proxy models. To address these intertwined bottlenecks, we propose Oracle Noise, a zero-shot framework reframing noise initialization as semantic-driven optimization strictly confined to a Riemannian hypersphere. Instead of relying on complex external parsers, we directly identify the most impactful structural words in the prompt to efficiently route optimization energy. By updating the noise strictly along a spherical path, we mathematically preserve the original Gaussian distribution. This geometric constraint eliminates norm inflation and unlocks aggressive step sizes for rapid convergence. Extensive experiments demonstrate that Oracle Noise significantly accelerates semantic alignment and achieves superior aesthetics without black-box models. It completely mitigates Euclidean-induced degradation, establishing state-of-the-art performance across human preference metrics (e.g., HPSv2, ImageReward), semantic alignment (CLIP Score), and sample diversity, all within a strict 2-second optimization budget.
Authors: Weihao Li, Hongjin Zhao, Gao Zhu, Ge-Peng Ji, Nicholas Wilson, Marta Yebra, Nick Barnes
Abstract: Wildfires are an escalating global concern due to the devastating impacts on the environment, economy, and human health, with notable incidents such as the 2019-2020 Australian bushfires and the 2025 California wildfires underscoring the severity of these events. AI-enabled camera-based smoke detection has emerged as a promising approach for the rapid detection of wildfires. However, existing wildfire smoke segmentation datasets that are used for training detection and segmentation models are limited in scale, geographically constrained, and often rely on synthetic imagery, which hinders effective training and generalization. To overcome these limitations, we present AusSmoke, a new smoke segmentation dataset collected from Australia to address the data scarcity in this region. Furthermore, we introduce a MultiNational geographically diverse and substantially larger fully-labelled benchmark, called MultiNatSmoke, that consolidates publicly available international datasets with the newly collected Australian imagery, expanding the scale by an order of magnitude over previous collections. Finally, we benchmark smoke segmentation models, demonstrating improved performance and enhanced generalization across diverse geographical contexts. The project is available at \href{https://github.com/henryzhao0615/MultiNatSmoke}{Github}.
Authors: Zhuoqi Lyu, Qing Ke
Abstract: Optical chemical structure recognition (OCSR) translates molecular images into machine-readable representations like SMILES strings or molecular graphs, but remains challenging in real-world documents due to inexhaustible variations in chemical structures, shorthand conventions, and visual noise. Most existing deep-learning-based approaches rely on teacher forcing with token-level Maximum Likelihood Estimation (MLE). This training paradigm suffers from exposure bias, as models are trained under ground-truth prefixes but must condition on their own previous predictions during inference. Moreover, token-level MLE objectives hinder the optimization towards molecular-level evaluation criteria such as chemical validity and structural similarity. Here we introduce Minimum Risk Training (MRT) to OCSR and propose COMO (Closed-loop Optical Molecule recOgnition), a closed-loop framework that mitigates exposure bias by directly optimizing over molecule-level, non-differentiable objectives, by iteratively sampling and evaluating the model's own predictions. Experiments on ten benchmarks including synthetic and real-world chemical diagrams from patent and scientific literature demonstrate that COMO substantially outperforms existing rule-based and learning-based methods with less training data. Ablation studies further show that MRT is architecture-agnostic, demonstrating its potential for broad application to end-to-end OCSR systems.
Authors: Shaohua Liu, Ning Gao, Zuoya Gu, Hongkun Dou, Yue Deng, Hongjue Li
Abstract: Reconstructing realistic underwater scenes from underwater video remains a meaningful yet challenging task in the multimedia domain. The inherent spatiotemporal degradations in underwater imaging, including caustics, flickering, attenuation, and backscattering, frequently result in inaccurate geometry and appearance in existing 3D reconstruction methods. While a few recent works have explored underwater degradation-aware reconstruction, they often address either spatial or temporal degradation alone, falling short in more real-world underwater scenarios where both types of degradation occur. We propose MarineSTD-GS, a novel 3D Gaussian Splatting-based framework that explicitly models both temporal and spatial degradations for realistic underwater scene reconstruction. Specifically, we introduce two paired Gaussian primitives: Intrinsic Gaussians represent the true scene, while Degraded Gaussians render the degraded observations. The color of each Degraded Gaussian is physically derived from its paired Intrinsic Gaussian via a Spatiotemporal Degradation Modeling (SDM) module, enabling self-supervised disentanglement of realistic appearance from degraded images. To ensure stable training and accurate geometry, we further propose a Depth-Guided Geometry Loss and a Multi-Stage Optimization strategy. We also construct a simulated benchmark with diverse spatial and temporal degradations and ground-truth appearances for comprehensive evaluation. Experiments on both simulated and real-world datasets show that MarineSTD-GS robustly handles spatiotemporal degradations and outperforms existing methods in novel view synthesis with realistic, water-free scene appearances.
Authors: Tianyidan Xie, Zhentao Huang, Mingjie Wang, Xin Huang, Jun Zhou, Minglun Gong, Zili Yi
Abstract: Existing image-to-video generation methods often produce physically implausible motions and lack precise control over object dynamics. While prior approaches have incorporated physics simulators, they remain confined to 2D planar motions and fail to capture depth-aware spatial interactions. We introduce PhysLayer, a novel framework enabling language-guided, depth-aware layered animation of static images. PhysLayer consists of three key components: First, a language-guided scene understanding module that utilizes vision foundation models to decompose scenes into depth-based layers by analyzing object composition, material properties, and physical parameters. Second, a depth-aware layered physics simulation that extends 2D rigid-body dynamics with depth motion and perspective-consistent scaling, enabling more realistic object interactions without requiring full 3D reconstruction. Third, a physics-guided video synthesis module that integrates simulated trajectories with scene-aware relighting for temporally coherent results. Experimental results demonstrate improvements in CLIP-Similarity (+2.2\%), FID score (+9.3\%), and Motion-FID (+3\%), with human evaluation showing enhanced physical plausibility (+24\%) and text-video alignment (+35\%). Our approach provides a practical balance between physical realism and computational efficiency for controllable image animation.
Authors: Zehua Cheng, Wei Dai, Jiahao Sun
Abstract: Multi-modal retrieval-augmented generation (MRAG) systems retrieve visual evidence from large image corpora to ground the responses of large multi-modal models, yet the retrieved images frequently contain human faces whose identities constitute sensitive personal information. Existing anonymization techniques that destroy the non-identity visual cues that downstream reasoning depends on or fail to provide principled privacy guarantees. We propose Identity-Decoupled MRAG, a framework that interposes a generative anonymization module between retrieval and generation. Our approach consists of three components: (i)a disentangled variational encoder that factorizes each face into an identity code and a spatially-structured attribute code, regularized by a mutual-information penalty and a gradient-based independence term; (ii)a manifold-aware rejection sampler that replaces the identity code with a synthetic one guaranteed to be both distinct from the original and realistic; and (iii)a conditional latent diffusion generator that synthesizes the anonymized face from the replacement identity and the preserved attributes, distilled into a latent consistency model for low-latency deployment. Privacy is enforced through a multi-oracle ensemble of face recognition models with a hinge-based loss that halts optimization once identity similarity drops below the impostor-regime threshold.
Authors: Zhen Ye, Xu Tan, Aoxiong Yin, Hongzhan Lin, Guangyan Zhang, Peiwen Sun, Yiming Li, Chi-Min Chan, Wei Ye, Shikun Zhang, Wei Xue
Abstract: Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled manner. This is suboptimal for talking head synthesis: while audio and facial motion are semantically correlated, their low-level realizations (acoustic signals and visual textures) follow distinct rendering processes. Enforcing joint modeling across all levels causes unnecessary entanglement and reduces efficiency. We propose Talker-T2AV, an autoregressive diffusion framework where high-level cross-modal modeling occurs in a shared backbone, while low-level refinement uses modality-specific decoders. A shared autoregressive language model jointly reasons over audio and video in a unified patch-level token space. Two lightweight diffusion transformer heads decode the hidden states into frame-level audio and video latents. Experiments on talking portrait benchmarks show Talker-T2AV outperforms dual-branch baselines in lip-sync accuracy, video quality, and audio quality, achieving stronger cross-modal consistency than cascaded pipelines.
Authors: Simone Mosco, Daniel Fusaro, Alberto Pretto
Abstract: Understanding the surrounding environment is fundamental in autonomous driving and robotic perception. Distinguishing between known classes and previously unseen objects is crucial in real-world environments, as done in Anomaly Segmentation. However, research in the 3D field remains limited, with most existing approaches applying post-processing techniques from 2D vision. To cover this lack, we propose a new efficient approach that directly operates in the feature space, modeling the feature distribution of inlier classes to constrain anomalous samples. Moreover, the only publicly available 3D LiDAR anomaly segmentation dataset contains simple scenarios, with few anomaly instances, and exhibits a severe domain gap due to its sensor resolution. To bridge this gap, we introduce a set of mixed real-synthetic datasets for 3D LiDAR anomaly segmentation, built upon established semantic segmentation benchmarks, with multiple out-of-distribution objects and diverse, complex environments. Extensive experiments demonstrate that our approach achieves state-of-the-art and competitive results on the existing real-world dataset and the newly introduced mixed datasets, respectively, validating the effectiveness of our method and the utility of the proposed datasets. Code and datasets are available at https://simom0.github.io/lido-page/.
Authors: Manish Kumar, Rajendra K. Ray
Abstract: Partial Differential Equation (PDE)-based approaches have gained significant attention in image despeckling due to their strong capability to preserve structural details while suppressing noise. However, conventional second-order PDE models tend to generate blocky artifacts, whereas higher-order models often introduce speckle patterns. To resolve it, this paper proposes and comparatively analyzes two advanced PDE-based frameworks designed for speckle noise suppression while preserving the fine edges. The first model introduces a novel weighted formulation that combines second and fourth-order PDEs through a weighting parameter. The second-order diffusion coefficient employs grayscale and gradient-based indicators, while the fourth-order term is guided solely by a Laplacian-based indicator. The second model constructs a coupled PDE framework, where independent fourth and second-order components are explicitly solved in an iterative manner. In this coupled structure, each diffusion coefficient is defined separately to enhance adaptability in varying image regions. Both models are implemented using the explicit finite difference method. The proposed techniques are extensively evaluated on a variety of datasets, including standard grayscale, color, Synthetic Aperture Radar (SAR), and ultrasound images. Comparative experiments with the existing Telegraph Diffusion Model (TDM) and Fourth-Order Telegraph Diffusion Model (TDFM) demonstrate the superiority of the proposed approaches in reducing speckle noise while effectively preserving fine image structures and edges. Quantitative evaluations using PSNR, SSIM and Speckle Index metrics confirm that the proposed models produce higher image quality and enhanced visual perception. Overall, the presented PDE-based formulations provide a reliable and efficient framework for image despeckling in both natural and medical imaging.
Authors: Peng Chen, Wenxuan He, Feng Qian, Guangyao Shi, Jingwen Yan
Abstract: In the hyperspectral image (HSI) classification task, each pixel is categorized into a specific land-cover category or material. Convolutional neural networks (CNNs) and transformers have been widely used to extract local and non-local features in HSI classification. Recent works have utilized a multi-scale vision transformer (ViT) to enhance spectral feature capture and yield promising results. However, most existing methods still face challenges in the effective joint use of spatial-spectral information and in preserving information across layers during the propagation process. To address these issues, we propose a synergistic CNN-Transformer network with pooling attention fusion for HSI classification, which collaboratively utilizes CNNs and ViT to process spatial and spectral features separately. Specifically, we propose a Twin-Branch Feature Extraction (TBFE) module, which employs 3D and 2D convolution in parallel to comprehensively extract spectral and spatial features from HSI. A hybrid pooling attention (HPA) module is designed to aggregate spatial attention. Moreover, a cascade transformer encoder is employed for global spectral feature extraction, and a simple yet efficient cross-layer feature fusion (CFF) module is designed to reduce the loss of crucial information in the previous network layers. Extensive experiments are conducted on several representative datasets to demonstrate the superior performance of our proposed method compared to the state-of-the-art works. Code is available at https://github.com/chenpeng052/SCT-Net.git.
Authors: Chunyu Li, Jiaye Li, Ruiqiao Mei, Haoyuan Xia, Hao Zhu, Jingdong Wang, Siyu Zhu
Abstract: Real-time text-driven joint audio-video avatar generation requires jointly synthesizing portrait video and speech with high fidelity and precise synchronization, yet existing audio-visual diffusion models remain too slow for interactive use and often degrade noticeably after aggressive acceleration. We present Hallo-Live, a streaming framework for joint audio-visual avatar generation that combines asynchronous dual-stream diffusion with human-centric preference-guided distillation. To reduce articulation lag in causal generation, we introduce Future-Expanding Attention, which allows each video block to access synchronous audio together with a short horizon of future phonetic cues. To mitigate the quality loss of few-step distillation, we further propose Human-Centric Preference-Guided DMD (HP-DMD), which reweights training samples using rewards from visual fidelity, speech naturalness, and audio-visual synchronization. On two NVIDIA H200 GPUs, Hallo-Live runs at 20.38 FPS with 0.94 seconds latency, yielding 16.0x higher throughput and 99.3x lower latency than the teacher model Ovi. Despite this speedup, it retains strong generation quality, reaching comparable VideoAlign overall score and Sync Confidence score while outperforming other accelerated baselines in the overall quality-efficiency trade-off. Qualitative results further show robust generalization across photorealistic, multi-speaker, and stylized scenarios. To the best of our knowledge, Hallo-Live is the first framework to combine streaming dual-stream diffusion with preference-guided distillation for real-time, text-driven audio-visual generation.
Authors: Francesco Olivato, Cigdem Beyan, Vittorio Murino
Abstract: In this work, we study Source-Free Unsupervised Domain Adaptation under corruption-induced domain shifts, where performance degradation is caused by natural image corruptions that go beyond additive noise, including blur, weather effects, and digital artifacts. We propose a diffusion-based, input-level adaptation framework that operates entirely at test time and keeps all source-trained models frozen, explicitly targeting robustness to corrupted target inputs. Our method leverages a source-trained diffusion model as a generative prior and introduces a discriminator-guided adaptive diffusion strategy that dynamically controls the amount of perturbation applied to each test sample. Rather than relying on a fixed diffusion depth, the discriminator determines, on a per-image basis, when sufficient forward diffusion has been applied to suppress corruption-specific artifacts, with each corruption type effectively defining a distinct target domain. This adaptive stopping mechanism applies only the necessary amount of noise to remove domainspecific corruption while preserving class-discriminative structure. The reverse diffusion process then reconstructs a source-aligned image, optionally stabilized through structural guidance, which is classified using a frozen source-trained classifier. We evaluate the proposed approach across a broad spectrum of corruption-induced target domains, covering 15 diverse corruption types, and demonstrate more balanced robustness with competitive or improved performance across non-noise corruptions. Additional analyses reveal how the adaptive diffusion schedule responds to different corruption characteristics, highlighting the practicality, generality, and robustness of the proposed framework. The code is publicly available at https://github.com/fmolivato/dgadiffusion/.
Authors: Jiawei Yan
Abstract: This paper introduces VDLF-Net, which attaches a compact VAE to a multi-scale CNN backbone. Latent vectors and softmax-gate support the backbone feature maps, while $\ell_2$-normalized embeddings from the gated maps contribute toward supervised classification or episodic few-shot prediction. Under standard CIFAR-100 and Mini-ImageNet protocols, VDLF-Net demonstrates an improved performance over ResNet-50 Enhanced, VGG-16, Prototypical Networks, and Matching Networks. Extensive ablations show that removing the fine-resolution scale has the greatest impact on VDLF-Net's performance. At the same time, KL and reconstruction at the chosen $\alpha$ pose a minor performance reduction, demonstrating that performance gains over classical episodic baselines mainly originate from the full VDLF-Net architecture and training strategy.
Authors: Pritesh Jha
Abstract: Intelligent document processing pipelines extract structured entities (tables, images, and text) from documents for use in downstream systems such as knowledge bases, retrieval-augmented generation, and analytics. A persistent limitation of existing pipelines is that extraction output is produced without any intrinsic mechanism to verify whether it faithfully represents the source. Model-internal confidence scores measure inference certainty, not correspondence to the document, and extraction errors pass silently into downstream consumers. We present Reconstruction as Validation (RaV-IDP), a document processing pipeline that introduces reconstruction as a first-class architectural component. After each entity is extracted, a dedicated reconstructor renders the extracted representation back into a form comparable to the original document region, and a comparator scores fidelity between the reconstruction and the unmodified source crop. This fidelity score is a grounded, label-free quality signal. When fidelity falls below a per-entity-type threshold, a structured GPT-4.1 vision fallback is triggered and the validation loop repeats. We enforce a bootstrap constraint: the comparator always anchors against the original document region, never against the extraction, preventing the validation from becoming circular. We further propose a per-stage evaluation framework pairing each pipeline component with an appropriate benchmark. The code pipeline is publicly available at https://github.com/pritesh-2711/RaV-IDP for experimentation and use.
Authors: Navid Aslankhani Khameneh, Marco Carletti, Cigdem Beyan
Abstract: Robust in-bed human pose estimation under blanket occlusion remains challenging due to the scarcity of reliable labeled training data for heavily covered poses. Existing approaches rely on multi-modal sensing or image-to-image translation frameworks that remain conditioned on visible source imagery, limiting scalability and pose diversity. In this work, we reformulate occlusion-aware augmentation as a geometry-conditioned generative modeling task. We conduct a systematic comparison of deterministic masking, unpaired translation, paired diffusion-based translation, and a proposed pose-conditioned Latent Diffusion Model (Pose-LDM). Unlike image-guided methods, Pose-LDM synthesizes blanket-covered images directly from skeletal keypoints, eliminating dependence on paired supervision and pixel-level source-image conditioning while enabling generation from arbitrary pose inputs. All augmentation strategies are evaluated through their impact on downstream pose estimation under a fixed backbone. Pose- LDM achieves the highest strict localization accuracy under severe occlusion while maintaining overall detection performance comparable to paired diffusion models, approaching the performance of fully supervised training. These results demonstrate that geometry-conditioned diffusion provides an effective and supervision-efficient pathway toward occlusion-robust inbed pose estimation without modifying the sensing pipeline. The code is available at: github.com/navidTerraNova/ GeoDiffPose.
Authors: Rabee Al-Qasem
Abstract: Reliable agricultural data is essential for food security, land-use planning, and economic resilience, yet in Palestine, such data remains difficult to collect at scale because of fragmented landscapes, limited field access, and restrictions on aerial monitoring. This paper presents ResAF-Net, a satellite-based tree detection framework designed for large-scale agricultural monitoring in resource-constrained settings. The proposed architecture combines a ResNet-50 encoder, Atrous Spatial Pyramid Pooling (ASPP), a feature-fusion stage, a multi-head self-attention refinement module, and an anchor-free FCOS detection head to improve tree localization in dense and heterogeneous scenes. Trained on the MillionTrees benchmark, the model achieved 82% Recall, 63.03% mAP@0.50, and 35.47% mAP@0.50:0.95 on the validation split, indicating strong sensitivity to tree presence while maintaining competitive localization quality. Beyond benchmark evaluation, we implemented the model within a web-based GIS application integrated with Palestinian cadastral data from GeoMolg, enabling tree analysis at scene, parcel, and community levels. This deployment demonstrates the practical feasibility of AI-assisted agricultural inventorying in Palestine. It provides a foundation for data-driven monitoring, reporting, and future species-level analysis of Mediterranean tree crops.
Authors: Guoxi Huang, Ruirui Lin, Yini Li, David R. Bull, Nantheera Anantrasirichai
Abstract: Videos captured in low-light and underwater conditions often suffer from distortions such as noise, low contrast, color imbalance, and blur. These issues not only limit visibility but also degrade automatic tasks like detection. Post-processing is typically required but can be time-consuming. AI-based tools for video enhancement also demand significantly more computational resources compared to image-based methods. This paper introduces a novel framework, Visual Mamba, designed to reduce memory usage and computational time by leveraging the Visual State Space (VSS) model. The framework consists of two modules: (i) a feature alignment module, where spatio-temporal displacement between input frames is registered in the feature space, and (ii) an enhancement module, where noise removal and brightness adjustment are performed using a UNet-like architecture, with all convolutional layers replaced by VSS blocks. Experimental results show that the Visual Mamba technique outperforms Transformer and convolution-based models in both low-light and underwater video enhancement tasks. Code is available on line at https://github.com/russellllaputa/BVI-Mamba.
Authors: Misbah Ijaz, Saif Ur Rehman Khan, Abd Ur Rehman, Arooj Zaib, Sebastian Vollmer, Andreas Dengel, Muhammad Nabeel Asim
Abstract: The increasing global deployment of solar photovoltaic (PV) systems needs robust, scalable, and automated inspection technologies capable of detecting a wide range of panel flaws under a variety of operating situations. The lack of large-scale, multi-modal, publicly available annotated datasets is a major obstacle preventing advancement in this field. We introduce SolarFCD, an extensive dataset of solar panel defects created by methodically combining and reconciling three publicly accessible datasets covering two imaging modalities: RGB/Drone images and Thermal Infrared. The dataset consist of 4,435 images arranged under four unified defect classes such as: healthy images, Surface Obstruction, structural fault, and electrical fault. The dataset was divided into training, validation, and test splits at an 80:10:10 ratio through methodical label mapping, near-duplicate removal, and targeted augmentation of minority classes. Sixteen classification architectures from five design families were trained and assessed on the dataset to provide repeatable benchmark baselines. With an accuracy of 86.68%, precision of 88.65%, recall of 88.62%, and F1-score of 88.17%, ResNet101V2 performed the best overall. Per-class results showed balanced detection across all four defect categories within a narrow performance band of less than 1.2 percentage points. To promote open and repeatable research in automated PV inspection and solar energy operations and maintenance, the dataset, annotation files, and baseline code are made openly available.
Authors: Francesco Dibitonto, Cigdem Beyan, Vittorio Murino
Abstract: Recent advances in representation learning have shown that hyperbolic geometry can offer a more expressive alternative to the Euclidean embeddings used in CLIP models, capturing hierarchical structures and leading to better-organized representations. However, current hyperbolic CLIP variants are trained entirely from scratch, which is computationally expensive and resource-intensive. In this work, we propose HAC (Hyperbolic Adaptation of CLIP), a parameter-efficient framework that enables pretrained CLIP models to transition into hyperbolic space via lightweight fine-tuning. We apply HAC to Visual Question Answering (VQA), where models must interpret visual elements and align them with textual queries. Notably, HAC's training is performed on a dataset with no overlap with any VQA benchmark, resulting in a strict zero-shot evaluation paradigm that underscores HAC's task-agnostic adaptability. We evaluate HAC across a diverse suite of VQA benchmarks spanning General, Reasoning, and OCR categories. Both HAC-S (small) and HAC-B (medium) consistently surpass Euclidean baselines and prior hyperbolic approaches, with HAC-B delivering up to a +1.9 point average improvement over CLIP-B on reasoning-intensive tasks. Our code is available at https://github.com/fdibiton/HAC
Authors: Haodong Jiang, Mingzhe Li, Junfeng Wu
Abstract: Motivated by the limited generalization of supervised image matching models to unseen image domains, we explore the zero-shot deployment of DINO features for this task. The generalist visual representation extracted from DINO has inherent ambiguity when used to match feature points among semantically similar instances, prompting us to adopt a many-to-many (m-to-m) matching paradigm. However, the existing robust mechanism under m-to-m data association is computationally heavy, which requires finding a maximum-cardinality matching in the inlier association graph for each parameter evaluation. To address this inefficiency, we introduce a novel likelihood perspective, which interprets the existing method as a zeroth-order approximation of otherwise intractable likelihood calculation,and inspires us to propose a faster and finer-grained robust mechanism, termed as Harmonic Consensus Maximization (HCM). Take camera pose estimation as an exemplifying downstream task, we demonstrate that general-purpose visual features, used out of the box without any adaptation, can compete with specialized matching models on out-of-distribution datasets when mated with m-to-m association and the HCM mechanism.
Authors: Lei Kang, Giuseppe De Gregorio, Raphaela Heil, Alicia Forn\'es, Be\'ata Megyesi
Abstract: Historical encrypted manuscripts require both paleographic interpretation of cipher symbols and cryptanalytic recovery of plaintext. Most existing computational workflows rely on a transcription-first paradigm, in which handwritten symbols are transcribed prior to decipherment. This intermediate step is labor-intensive, error-prone, and not always aligned with the goal of direct plaintext recovery. We propose an end-to-end, transcription-free approach that directly maps handwritten cipher images to plaintext. Using the Copiale cipher as a case study, we introduce the first text-line-level dataset pairing cipher images with German plaintext. We show that pretraining on generic handwriting data followed by cipher-specific fine-tuning substantially improves decipherment accuracy. Our results demonstrate that transcription-free image-to-plaintext decipherment is both feasible and effective for historical substitution ciphers, offering a simplified and scalable alternative to traditional pipelines. https://github.com/leitro/Decipher-from-Pixels-Copiale
URLs: https://github.com/leitro/Decipher-from-Pixels-Copiale
Authors: Xuanshuo Fu, Lei Kang, Ernest Valveny, Dimosthenis Karatzas, Javier Vazquez-Corral
Abstract: Accurate text recognition in low-light environments is essential for intelligent systems in applications ranging from autonomous vehicles to smart surveillance. However, challenges such as poor illumination and noise interference remain underexplored. To address this gap, we introduce LSTR, a large-scale Low-light Scene Text Recognition dataset comprising 11,273 low-light images generated from well-lit datasets (ICDAR2015, IIIT5K, and WordArt), along with ESTR, which includes 60 real nighttime street-scene images in English and Spanish for exclusive evaluation. We explore two solution strategies: (1) employing Optical Character Recognition (OCR) models with fine-tuning and LoRA-based fine-tuning and (2) a joint training strategy that integrates a low-light image enhancement (LLIE) module with an OCR model. In particular, we propose a novel re-render LLIE (RLLIE) module, which demonstrates improved performance on real-world data. Through extensive experimentation, we analyze various training strategies and address a key research question: \emph{How bright is bright enough for effective scene text recognition?} Our results indicate that standalone LLIE or OCR models perform inadequately under low-light conditions, highlighting the advantages of specialized, jointly trained text-centric approaches. Additionally, we provide a comprehensive benchmark to support future research in robust low-light scene text recognition. https://huggingface.co/datasets/lumimusta/Low-light_Scene_Text_Dataset.
URLs: https://huggingface.co/datasets/lumimusta/Low-light_Scene_Text_Dataset.
Authors: Ruiqing Sun, Xingshan Yao, Zhijing Wu, Tian Lan, Chenhao Cui, Huiyang Zhao, Jialing Shi, Chen Yang, Xianling Mao
Abstract: Proactive defense methods protect portrait images from unauthorized editing or talking face generation (TFG) by introducing pixel-level protective perturbations, and have already attracted increasing attention for privacy protection. In real-world scenarios, images inevitably undergo various transformations during cross-device display and dissemination--such as scale transformations and color compression--that directly alter pixel values. However, it remains unclear whether such pixel-level modifications affect the effectiveness of existing proactive defense methods that rely on pixel-level perturbations. To solve this problem, we conduct a systematic evaluation of representative proactive defenses under image transformation. The evaluated methods are selected to span different generation architectures such as diffusion and GAN-based models, as well as defense scopes covering both portrait and natural images, and are assessed using both qualitative and quantitative metrics for subjective and objective comparison. Experimental results indicate that defense methods based on pixel-level perturbations struggle to withstand common image transformations, posing a risk of defense failure in real-world applications. To further highlight this risk, we propose a simple yet effective purification framework by leveraging the vulnerabilities induced by real-world image transformations. Experimental results demonstrate that the proposed method can efficiently remove protective perturbations with low computational cost, highlighting previously overlooked risks to the research community.
Authors: Shunkun Liang, Banglei Guan, Bin Li, Qifeng Yu, Yang Shang
Abstract: Multi-camera systems offer rich observation capabilities for visual navigation and 3D scene reconstruction; however, the resulting feature redundancy often compromises computational efficiency. This challenge is particularly pronounced during bundle adjustment, where the non-linear optimization of both system poses and scene points incurs substantial computational overhead. To address this challenge, this paper introduces a pose-only geometric constraint for multi-camera systems and proposes a corresponding pose adjustment algorithm. Specifically, we use generalized camera model to establish a unified representation of the multi-camera system. Building upon this model, we formulate the multi-camera pose-only constraint, which implicitly represents a 3D scene point using two base observations and their associated poses, thereby achieving a pose-only representation of the projection geometry. Subsequently, we introduce a multi-camera pose adjustment algorithm that eliminates 3D points from the parameter space, thereby achieving efficient and focused pose optimization. Experimental results on both synthetic and real-world datasets demonstrate that the proposed algorithm outperforms baseline bundle adjustment methods in computational efficiency, while maintaining or even improving pose estimation accuracy.
Authors: Adam Kuku\v{c}ka, Ond\v{r}ej Fabi\'an, V\'it Musil, Tom\'a\v{s} Br\'azdil
Abstract: Histologic assessment of ulcerative colitis (UC) activity is an important endpoint in clinical trials and routine care, but manual grading with indices such as the Nancy histological index (NHI) is time-consuming and prone to observer variability. While computational pathology methods can automate scoring, many approaches depend on dense region-level annotations, which are costly to obtain, particularly in heterogeneous, multicenter cohorts. We propose a weakly supervised multiple instance learning (MIL) approach for whole-slide images that learns from case- and slide-level NHI labels, leveraging foundation models. Our method targets clinically relevant endpoints, including neutrophilic activity and derived Nancy-low/high groupings, enabling full five-grade NHI prediction. On a multicenter dataset of H&E-stained colon biopsies from three hospitals (2019-2025), we evaluate multiple foundation model encoders and aggregation strategies. We find that foundation model choice and resolution substantially affect performance, with Virchow2 providing the most consistent gains, and that a simple ensembling rule improves five-grade NHI prediction compared to a hierarchical gating baseline. Overall, our results demonstrate that weakly supervised MIL with modern foundation-model representations can provide robust, interpretable UC histology activity assessment in realistic multicenter settings.
Authors: Xinheng Li, Minghao Chen, Mengqing Wu, Yan Liu, Guanying Huo
Abstract: Single image dehazing is often constrained by a trade-off between restoration quality and computational efficiency. While efficient, CNN networks struggle to learn robust priors for dense and non-homogeneous haze. Conversely, diffusion models provide strong generative priors but suffer from severe inference latency and sampling instability. To address these limitations, we propose ZID-Net, a novel framework that explicitly decouples diffusion supervision from feed-forward inference. For efficient inference, we design a frequency-spatial decoupled feed-forward backbone. Within this backbone, a Channel-Spatial Laplacian Mask (CSLM) filters haze-amplified noise to extract purified structural details, while Lightweight Global Context Blocks (LGCBs) establish long-range spatial dependencies to capture the global variations of haze. A Dynamic Feature Arbitration Block (DFAB) then adaptively fuses these semantic and structural features for robust reconstruction. To provide this backbone with physical priors without the inference cost, we introduce a Zero-Inference Prior Propagation Head (ZI-PPH) during training. ZI-PPH leverages a conditional diffusion process to predict residual noise, providing degradation-aware structural supervision to the backbone. By discarding the diffusion branch at test time, ZID-Net integrates diffusion priors into a pure feed-forward architecture for accurate and efficient restoration. ZID-Net achieves 40.75 dB PSNR on the synthetic RESIDE dataset and outperforms existing methods with a 1.13 dB gain on real-world datasets. Additionally, it yields a 3.06 dB PSNR gain on the StateHaze1k remote sensing dataset with an inference time of just 19.35 ms. The project code is available at: https://github.com/XoomitLXH/ZID-Net.
Authors: Xuefen Liu, Xinquan Yang, Mianjie Zheng, Kun Tang, Xuguang Li, Xiaoqi Guo, Linlin Shen, He Meng
Abstract: As dental caries appear as subtle, low-contrast lesions in intraoral imaging, existing deep learning models face significant challenges in the early detection of caries. While recent Transformer-based detectors have shown promising results in natural images, they often fail to capture the domain-specific anatomical priors crucial for dental caries detection. In this paper, we propose Caries-DETR, a specialized Transformer framework for caries detection in intraoral images. A Tooth Structure-aware Query Initialization (TSQI) is designed, leveraging large-scale intraoral photograph pre-training and a structure perception branch (SPB) to integrate high-frequency structural priors, guiding the model to focus on anatomically significant lesion areas. Furthermore, we design a Lesion-aware Dynamic Loss Refinement (LDLR) to implement quality-driven hard mining through adaptive loss reweighting based on lesion size, anatomical relevance, and prediction quality, optimizing detection for subtle lesions. Extensive experiments on two public datasets (i.e., AlphaDent and DentalAI) demonstrate that Caries-DETR achieves a state-of-the-art performance compared to existing methods and exhibits good generalization and robustness. Code and data at https://github.com/XuefenLiu-SZU/Caries-DETR}{https://github.com/XuefenLiu-SZU/Caries-DETR.
URLs: https://github.com/XuefenLiu-SZU/Caries-DETR, https://github.com/XuefenLiu-SZU/Caries-DETR.
Authors: Xiaowei Mao, Bowen Sui, Weijie Zhang, Yawen Yang, Shengnan Guo, Shilong Zhao, Jiaqi Lin, Tingrui Wu, Youfang Lin, Huaiyu Wa
Abstract: Expressway video anomaly detection is essential for safety management. However, identifying anomalies across diverse scenes remains challenging, particularly for far-field targets exhibiting subtle abnormal vehicle motions. While Vision-Language Models (VLMs) demonstrate strong semantic reasoning capabilities, processing global frames causes attention dilution for these far-field objects and incurs prohibitive computational costs. To address these issues, we propose VIBES, an asynchronous collaborative framework utilizing VLMs guided by Bayesian inference. Specifically, to overcome poor generalization across varying expressway environments, we introduce an online Bayesian inference module. This module continuously evaluates vehicle trajectories to dynamically update the probabilistic boundaries of normal driving behaviors, serving as an asynchronous trigger to precisely localize anomalies in space and time. Instead of processing the continuous video stream, the VLM processes only the localized visual regions indicated by the trigger. This targeted visual input prevents attention dilution and enables accurate semantic reasoning. Extensive evaluations demonstrate that VIBES improves detection accuracy for far-field anomalies and reduces computational overhead, achieving high real-time efficiency and explainability while demonstrating generalization across diverse expressway conditions.
Authors: Yanping Wu, Meiting Dang, Lin Wu, Edmond S. L. Ho, Zhenghua Chen, Chongfeng Wei
Abstract: Recent advances in autonomous driving have motivated research on pedestrian intention prediction, which aims to infer future crossing decisions and actions by modeling temporal dynamics, social interactions, and environmental context. However, existing studies remain constrained by oversimplified multi-agent interaction patterns, opaque reasoning logic, and a lack of global consistency in behavioral predictions, which compromise both robustness and interpretability. In this work, we propose ESIA (Energy-based Spatiotemporal Interaction-Aware framework), a novel Conditional Random Field (CRF)-based paradigm. We cast the intention prediction task as a structured prediction problem over a unified graph-based representation, treating pedestrians and the environment as spatiotemporal nodes. To characterize their distinct roles, we assign unary potentials to nodes to capture individual intentions, and pairwise potentials to edges to encode social and environmental interactions. These potentials are integrated into a unified global energy function to ensure scene-level consistency across behavioral predictions. To further constrain inference without ground-truth supervision, we introduce structural consistency terms to penalize logical contradictions. This optimization is efficiently solved via a novel Unary-Seeded Simulated Annealing (U-SSA) algorithm, which leverages high-confidence unary priors to rapidly converge to a high-quality solution. Extensive experiments on standard benchmarks demonstrate that ESIA achieves state-of-the-art performance with improved interpretability over existing methods.
Authors: Yanqi Wu, Xinhua Lu, Runhe Lai, Qichao Chen, Jia-Xin Zhuang, Wei-Shi Zheng, Ruixuan Wang
Abstract: Recent studies show that using potential out-of-distribution (OOD) labels from large corpora as auxiliary information can improve OOD detection in vision-language models (VLMs). However, these methods often fail when real-world OOD samples fall outside the predefined OOD label set. To address this limitation, we propose DynProto, a novel approach that learns OOD prototypes dynamically during testing using only in-distribution (ID) information. DynProto is inspired by a key observation: OOD samples predicted as the same ID class tend to cluster in the feature space. With this insight, we leverage easy-to-detect OOD samples as ``anchors'' to find their harder-to-detect, similar counterparts. To this end, DynProto introduces two modules: \textbf{Coarse OOD Pattern Capturing Module} caches OOD patterns that are easily confused with each ID class during testing, and \textbf{Fine-grained OOD Pattern Refinement Module} subsequently clusters these patterns within each cache and aggregates them into representative OOD prototypes. By measuring similarity to ID and dynamic OOD prototypes, DynProto enables accurate OOD detection. DynProto significantly outperforms prior methods across multiple benchmarks. On ImageNet OOD benchmark, DynProto reduces FPR95 by 11.60\% and improves AUROC by 4.70\%. Moreover, the framework is architecture-agnostic and can be integrated into various backbones.
Authors: Honghao Cai, Xiangyuan Wang, Yunhao Bai, Haohua Chen, Tianze Zhou, Runqi Wang, Wei Zhu, Yibo Chen, Xu Tang, Yao Hu, Zhen Li
Abstract: Large diffusion transformers (DiTs) follow global editing instructions well but consistently leak local edits into unrelated regions, because joint-attention architectures offer no explicit channel telling the network where to apply the edit. We introduce REDEdit, a co-trained, instruction- and region-aware adapter framework that retrofits a frozen DiT into a precise local editor without modifying its backbone weights. A lightweight Block Adapter at every transformer block injects a structured condition stream that factorizes what to edit (instruction semantics) from where to edit (spatial mask); a learned SpatialGate routes the adapter signal selectively into the edit region while keeping the rest of the image near-identical to the source; and a Region-Aware Loss focuses the training objective on the changing pixels. Because these components make the backbone's internal representation mask-aware end-to-end, a thin MaskPredictor head trained jointly with the editor can ground the edit region directly from the instruction and source image eliminating any user-mask requirement at deployment. We evaluate on two complementary benchmarks: MagicBrush (paired ground-truth targets) to measure pixel-level preservation and edit accuracy, and Emu-Edit Test (no ground-truth images, 9 diverse edit categories) to stress-test instruction following and generalization across edit types. On both, REDEdit achieves state-of-the-art results, simultaneously outperforming mask-free and oracle-mask baselines. A seven-variant ablation cleanly isolates the contribution of each component.
Authors: Nuttaset Kuapanich, Juepeng Zheng, Bohan Shi, Jiaying Liu, Jiayin Jiang, Jiatao Huang, Shenghan Tan, Qingmei Li, Haohuan Fu
Abstract: Accurate monitoring of oil palm plantations is critical for balancing economic development with environmental conservation in Southeast Asia. However, existing plantation maps often suffer from low spatial resolution and a lack of recent temporal coverage, impeding effective surveillance of rapid land-use changes. In this study, we propose a deep learning framework to generate 10-meter resolution oil palm plantation maps for Indonesia and Malaysia from 2020 to 2024, utilizing Sentinel-2 imagery without requiring new manual annotations. To address the resolution mismatch between coarse 100-meter historical labels and 10-meter imagery, we employ a U-Net architecture optimized with Determinant-based Mutual Information (DMI). This approach effectively mitigates the influence of label noise. We validated our method against 2,058 manually verified points, achieving overall accuracies of 70.64%, 63.53%, and 60.06% for the years 2020, 2022, and 2024, respectively. Our comprehensive analysis reveals that oil palm coverage in the region peaked in 2022 before experiencing a decline in 2024. Furthermore, land cover transition analysis highlights a concerning trajectory of plantation expansion into flooded vegetation areas, despite a general stabilization in rotations with other crop types. These high-resolution maps provide essential data for monitoring sustainability commitments and deforestation dynamics in the region, and the generated datasets are made publicly available at https://doi.org/10.5281/zenodo.17768444.
Authors: Fanqing Meng, Lingxiao Du, Zijian Wu, Guanzheng Chen, Xiangyan Liu, Jiaqi Liao, Chonghe Jiang, Zhenglin Wan, Jiawei Gu, Pengfei Zhou, Rui Huang, Ziqi Zhao, Shengyuan Ding, Ailing Yu, Bo Peng, Bowei Xia, Hao Sun, Haotian Liang, Ji Xie, Jiajun Chen, Jiajun Song, Liu Yang, Ming Xu, Qionglin Qiu, Runhao Fu, Shengfang Zhai, Shijian Wang, Tengfei Ma, Tianyi Wu, Weiyang Jin, Yan Wang, Yang Dai, Yao Lai, Youwei Shu, Yue Liu, Yunzhuo Hao, Yuwei Niu, Jinkai Huang, Jiayuan Zhuo, Zhennan Shen, Linyu Wu, Cihang Xie, Yuyin Zhou, Jiaheng Zhang, Zeyu Zheng, Mengkang Hu, Michael Qizhe Shieh
Abstract: Language-model agents are increasingly used as persistent coworkers that assist users across multiple working days. During such workflows, the surrounding environment may change independently of the agent: new emails arrive, calendar entries shift, knowledge-base records are updated, and evidence appears across images, scanned PDFs, audio, video, and spreadsheets. Existing benchmarks do not adequately evaluate this setting because they typically run within a single static episode and remain largely text-centric. We introduce \bench{}, a benchmark for coworker agents built around multi-turn multi-day tasks, a stateful sandboxed service environment whose state evolves between turns, and rule-based verification. The current release contains 100 tasks across 13 professional scenarios, executed against five stateful sandboxed services (filesystem, email, calendar, knowledge base, spreadsheet) and scored by 1537 deterministic Python checkers over post-execution service state; no LLM-as-judge is invoked during scoring. We benchmark seven frontier agent systems. The strongest model reaches 75.8 weighted score, but the best strict Task Success is only 20.0\%, indicating that partial progress is common while complete end-to-end workflow completion remains rare. Turn-level analysis shows that performance drops after the first exogenous environment update, highlighting adaptation to changing state as a key open challenge. We release the benchmark, evaluation harness, and construction pipeline to support reproducible coworker-agent evaluation.
Authors: Jui-Cheng Chiu, Yu-Chao Wang, Shengyang Luo, Tongyan Wang, Qi Yang, Nabin Khanal, Yingjie Victor Chen
Abstract: Appreciating multi-figure paintings requires understanding how characters relate through subtle cues like gaze alignment, gesture, and spatial arrangement. We present MIRAGE, an evidence-centric framework designed to scaffold the exploration of these "micro-interactions" in multi-figure artworks. While such cues are essential for deep narrative appreciation, they are often distributed across complex scenes and difficult for viewers to systematically identify. Existing vision-language models (VLMs) frequently fail to provide reliable assistance, offering ungrounded interpretations that lack traceable visual evidence. MIRAGE addresses this by constructing a structured intermediate representation capturing identities, pose cues, and gaze hypotheses. However, the challenge extends beyond extracting these cues to coordinating them during interpretation. Without an explicit mechanism to organize and reconcile relational evidence, models often collapse multiple interaction hypotheses into a single unstable or weakly grounded narrative, even when low-level signals are available. This representation allows users to verify how high-level interpretations are anchored in low-level visual facts. By separating spatial grounding from narrative generation, MIRAGE enables users to inspect and reason about figure-to-figure relationships through a verifiable evidence layer. We evaluate MIRAGE against painting-only VLM baselines using a blind assessment protocol. Results show that MIRAGE significantly improves identity consistency, reduces relational hallucinations, and increases the coverage of subtle interactions. These findings suggest that structured grounding can serve as a critical interaction control layer, providing the necessary scaffolding for a more reliable, transparent, and human-led understanding of complex visual narratives.
Authors: Haojie Zhang, Di Wu, Bingyan Liu, Linjie Zhong, Yuancheng Wei, Xingsong Ye, Nanqing Liu, Yaling Liang
Abstract: While video foundation models excel at single-shot generation, real-world cinematic storytelling inherently relies on complex multi-shot sequencing. Further progress is constrained by the absence of datasets that address three core challenges: authentic narrative logic, spatiotemporal text-video alignment conflicts, and the "copy-paste" dilemma prevalent in Subject-to-Video (S2V) generation. To bridge this gap, we introduce MuSS, a large-scale, dual-track dataset tailored for multi-shot video and S2V generation. Sourced from over 3,000 movies, MuSS explicitly supports both complex montage transitions and subject-centric narratives. To construct this dataset, we pioneer a progressive captioning pipeline that eliminates contextual conflicts by ensuring local shot-level accuracy before enforcing global narrative coherence. Crucially, we implement a cross-shot matching mechanism to fundamentally eradicate the S2V copy-paste shortcut. Alongside the dataset, we propose the Cinematic Narrative Benchmark, featuring a visual-logic-driven paradigm and a novel Anti-Copy-Paste Variance (ACP-Var) metric to rigorously assess continuous storytelling and 3D structural consistency. Extensive experiments demonstrate that while current baselines struggle with continuous narrative logic or degenerate into trivial 2D sticker generators, our MuSS-augmented model achieves state-of-the-art narrative effectiveness and cross-shot identity preservation.
Authors: Yasin Shokrollahi, Karina B. Pinao Gonzales, Elizve N. Barrientos Toro, Paul Acosta, Patient Mosaic Team, Pingjun Chen, Yinyin Yuan, Xiaoxi Pan
Abstract: Accurate whole-cell and nuclear segmentation is essential for precision pathology and spatial omics, yet routine hematoxylin and eosin (H&E) staining provides limited cytoplasmic contrast, restricting analyses to nuclei. Multiplex immunofluorescence (mIF) facilitates precise whole-cell delineation but remains constrained by cost and accessibility. We introduce VitaminP, a cross-modal learning framework enabling whole cell segmentation from H&E images. By learning from paired H&E-mIF data, VitaminP transfers molecular boundary information from mIF to overcome cytoplasmic contrast in H&E, establishing cross-modal supervision as a general strategy for recovering missing biological structure. We train VitaminP on 14 public datasets covering 34 cancer types and over 7 million instances, integrating publicly available labels with extensive annotations generated in this study, forming one of the largest resources for segmentation. VitaminP outperforms four state-of-the-art methods and generalizes to unseen datasets, including an in-house dataset spanning 24 rare cancer types. We further developed VitaminPScope, an open-source platform providing an interface for scalable inference and enabling broad adoption.
Authors: Jan Warchocki, Xi Wang, Jonas Kulhanek, Jan van Gemert
Abstract: Egocentric video provides a unique view into human perception and interaction, with growing relevance for augmented reality, robotics, and assistive technologies. However, rapid camera motion and complex scene dynamics pose major challenges for 3D reconstruction from this perspective. While 3D Gaussian Splatting (3DGS) has become a state-of-the-art method for efficient, high-quality novel view synthesis, variants, that focus on reconstructing dynamic scenes from monocular video are rarely evaluated on egocentric video. It remains unclear whether existing models generalize to this setting or if egocentric-specific solutions are needed. In this work, we evaluate dynamic monocular 3DGS models on egocentric and exocentric video using paired ego-exo recordings from the EgoExo4D dataset. We find that reconstruction quality is consistently lower in egocentric views. Analysis reveals that the difference in reconstruction quality, measured in peak signal-to-noise ratio, stems from the reconstruction of static, not dynamic, content. Our findings underscore current limitations and motivate the development of egocentric-specific approaches, while also highlighting the value of separately evaluating static and dynamic regions of a video.
Authors: Zichun Guo, Yuling Shi, Wenhao Zeng, Chao Hu, Haotian Lin, Terry Yue Zhuo, Jiawei Chen, Xiaodong Gu, Wenping Ma
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance in Visually Rich Document Understanding (VRDU) tasks, but their capabilities are mainly evaluated on pristine, well-structured document images. We consider content restoration from shredded fragments, a challenging VRDU setting that requires integrating visual pattern recognition with semantic reasoning under significant content discontinuities. To facilitate systematic evaluation of complex VRDU tasks, we introduce ShredBench, a benchmark supported by an automated generation pipeline that renders fragmented documents directly from Markdown. The proposed pipeline ensures evaluation validity by allowing the flexible integration of latest or unseen textual sources to prevent training data contamination. ShredBench assesses four scenarios (English, Chinese, Code, Table) with three fragmentation granularities (8, 12, 16 pieces). Empirical evaluations on state-of-the-art MLLMs reveal a significant performance gap: The method is effective on intact documents; however, once the document is shredded, restoration becomes a significant challenge, with NED dropping sharply as fragmentation increases. Our findings highlight that current MLLMs lack the fine-grained cross-modal reasoning required to bridge visual discontinuities, identifying a critical gap in robust VRDU research.
Authors: Igor Adamenko, Orpaz Ben Aharon, Yehudit Aperstein, Alexander Apartsin
Abstract: Urban environments contain many imaging sensors built for specific purposes, including ATM, body-worn, CCTV, and dashboard cameras. Under the opportunistic sensing paradigm, these sensors can be repurposed for secondary inference tasks such as license plate recognition. Yet objects of interest in such imagery are often noisy, low-resolution, and captured from extreme viewpoints. Recent advances in AI-based restoration can recover use-ful information even from severely degraded images. A central challenge is determining which distortion parame-ters allow reliable recovery and which lead to inference failure. This paper introduces recoverability maps, a task-agnostic method for quantifying this boundary. The method combines a dense synthetic sweep of degrada-tion parameters with two summary measures: boundary area-under-curve, which estimates the recoverable frac-tion of the parameter space, and a reliability score, which captures the frequency and severity of failures within that region. We demonstrate the method on license plate recognition from highly angled views under realistic camera artifacts. Several restoration architectures are trained and evaluated, including U-Net, Restormer, Pix2Pix, and SR3 diffusion. The best model recovers about 93% of the parameter space. Similar results across models sug-gest that sensing geometry, rather than architecture, sets the limit of recovery.
Authors: Ines Abbes, Mahmood Alzubaidi, Mowafa Househ, Khalid Alyafei, Marco Agus, Samir Brahim Belhaouari
Abstract: Measurement-critical ultrasound tasks often depend on a small anatomical region, making global reconstruction metrics an unreliable proxy for clinical fidelity. We propose an ROI-aware representation learning framework and instantiate it for first-trimester nuchal translucency (NT) screening under multi-hospital domain shift. A two-phase convolutional autoencoder (CAE) first learns a globally faithful 128-D latent code via MS-SSIM, then refines the NT ROI using intensity (L1) and normalized Sobel-edge constraints. To combine these heterogeneous objectives without manual tuning, we initialize loss weights via gradient-based calibration from per-term gradient magnitudes. Under strict hospital-wise evaluation with one hospital held out, ROI refinement improves both global and measurement-relevant quality: on the standard dev split it increases PSNR by +0.27 dB (val) and +0.29 dB (held-out test), reduces ROI MAE by 8.87% (val) and 6.43% (held-out test), and reduces ROI Edge-MAE by 11.10% on source hospitals and 4.90% on the unseen hospital. Beyond reconstruction, frozen-latent probes provide additional evidence of generalization: hospital provenance becomes less confidently predictable on the unseen site (0.556 to 0.541 max-softmax; 0.684 to 0.688 entropy) while OOD detection remains strong across site-held-out protocols (Mahalanobis AUROC up to 0.9956, with modest KNN gains in challenging splits). The same ROI-aware refinement principle is anatomy-agnostic and can be adopted for other fetal biometry targets (e.g., crown-rump length (CRL), nasal bone (NB)) and broader medical imaging settings where small ROIs dominate clinical decisions.
Authors: Dennis Menn, Chih-Hsien Chou
Abstract: Video generation, while capable of generating realistic videos, is computationally expensive and slow, prohibiting real-time applications. In this paper, we observe that video latents encoded via an autoencoder under the Latent Diffusion Model (LDM) framework contain redundancy along the temporal axis. Analogous to how traditional video compression algorithms avoid transmitting redundant frame data, we propose the Latent Inter-frame Pruning framework to prune (skip the re-computation of) duplicated latent patches, thereby reducing computational burden and increasing throughput. However, direct pruning results in visual artifacts due to the discrepancy between full-sequence training and pruned inference. To resolve these artifacts, we propose an Attention Recovery mechanism to bridge the train-inference gap. With our proposed method, we increase video editing throughput by 1.44$\times$, achieving 12.44 FPS on an NVIDIA RTX 6000 while maintaining video quality. We hope our work inspires further research into integrating traditional video compression methods with modern video generation pipelines. This work is a preliminary work on Training-free Latent Inter-Frame Pruning with Attention Recovery.
Authors: Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang, Gregory P. Meyer, Gael Le Lan, Yunyang Xiong, Vikas Chandra, Yangyang Shi, Dinesh Manocha, Zhipeng Cai
Abstract: Egocentric videos provide a distinctive setting in which sound serves as crucial cues to understand user activities and surroundings, particularly when visual information is unstable or occluded due to continuous camera movement. State-of-the-art large audio-visual language models (AV-LLMs) can generate multimodal descriptions. However, we show in this work that they are prone to audio hallucinations, often inferring sounds from visual cues that are visible but not heard. We present a systematic and automatic evaluation framework for analyzing audio hallucinations in egocentric video through a targeted question-answering (Q/A) protocol. We curate a dataset of 300 egocentric videos and design 1,000 sound-focused questions to probe model outputs. To characterize hallucinations, we propose a grounded taxonomy that distinguishes between foreground action sounds from the user activities and background ambient sounds. Our evaluation shows that advanced AV-LLMs, such as Qwen2.5 Omni, exhibit high hallucination rates, achieving only 27.3% and 39.5% accuracy on Q/As related to foreground and background sounds, respectively. With this work, we highlight the need to measure the reliability of multimodal responses, emphasizing that robust evaluation of hallucinations is essential to develop reliable AV-LLMs.
Authors: Naser Khatti Dizabadi
Abstract: Convolutional neural networks (CNNs) remain a central approach in image classification, but their performance depends strongly on architectural and training choices. This paper presents an empirical ablation-based study of CNN optimization for the CIFAR-10 benchmark. The study evaluates 17 progressive modifications involving training duration, learning-rate scheduling, dropout configuration, pooling strategy, network depth, filter arrangement, and dense-layer design. The goal is to identify which changes improve generalization and which increase complexity without improving performance. The baseline model achieved 79.5\% test accuracy. Extending training duration improved performance steadily, whereas several structural redesigns reduced accuracy despite greater architectural variation. Based on the strongest individual configurations, a weighted ensemble was constructed, achieving 86.38\% accuracy in the reduced-data setting and 89.23\% when trained using the full CIFAR-10 dataset. These results suggest that performance gains in CNN-based classification depend less on indiscriminate increases in depth or parameter count than on careful empirical selection of training and architectural modifications. The study therefore highlights the practical value of ablation-oriented optimization and ensemble learning for small-image classification.
Authors: Maycon R. S. Pereira, Filipe R. Cordeiro
Abstract: Noisy labels are a pervasive challenge in medical image classification, where annotation errors arise from inter-observer variability and diagnostic ambiguity. Although several noise-robust learning methods have been proposed, their evaluation predominantly relies on accuracy-oriented metrics, overlooking the clinical implications of asymmetric error costs. In medical diagnosis, a false negative (missed disease) carries substantially higher consequences than a false positive (false alarm), as delayed treatment can directly impact patient outcomes. In this work, we investigate whether noise-robust training methods preserve clinical safety under label noise. We conduct a systematic risk-aware evaluation of the state-of-the-art noise-robust methods Coteaching, DivideMix, UNICON, and a GMM-based filtering approach on binarized DermaMNIST and PathMNIST datasets under clean and label noise rates of 20%, and 40%. Beyond balanced accuracy, we adopt a cost-sensitive Global Risk formulation that explicitly penalizes false negatives. Our analysis reveals that the robustness of state-of-the-art methods does not guarantee clinical safety. Furthermore, we demonstrate that integrating cost-sensitive optimization into noise-robust training significantly reduces clinical risk, while mantaining model utility. These findings demonstrate that noise-robust learning must be evaluated through a clinical risk lens, and that combining robust training with cost-sensitive optimization can meaningfully reduce risk in noisy-label medical imaging scenarios.
Authors: Helder Oliveira
Abstract: Breast cancer is a leading cause of cancer-related mortality among women worldwide, with mammography as the primary screening tool. While deep learning models have shown strong performance in lesion segmentation, most rely on computationally intensive architectures that limit their use in resource-constrained environments. This study evaluates the performance and efficiency of lightweight models for mammographic lesion segmentation. Architectures including MobileNetV2, EfficientNet Lite, ENet, and Fast-SCNN were compared against a U-Net baseline using the INbreast dataset with 5-fold cross-validation. Performance was assessed using Dice score, Intersection over Union (IoU), and Recall, alongside model complexity. MobileNetV2 with Squeeze-and-Excitation (SCSE) achieved the best performance, with a Dice score of 0.5766 while using approximately 75\% fewer parameters than U-Net. Cross-dataset evaluation on the DMID dataset showed reduced accuracy due to domain shift but preserved recall. These results demonstrate that lightweight architectures offer a practical balance between performance and efficiency for deployable CAD systems.
Authors: Benjamin Klein, Kazi Ruslan Rahman, Sanchita Ghose
Abstract: Navigational aids for blind and low vision individuals struggle conveying dynamic real-world environments, leading to cognitive overload from continuous, undifferentiated feedback. We present AMAVA, a novel real-time video-to-audio framework that converts mobile device video into contextually relevant sound effects or text-to-speech descriptions. We propose a motion-aware pipeline using a lightweight AI classification model to distinguish between low and high-movement scenes followed by a real-time text-to-audio synthesis pipeline to enhance environmental perception more efficiently. In static environments, AMAVA generates spoken audio scene descriptions for situational awareness. In high-movement situations, it prioritizes safety by delivering sound cues, such as spoken hazard alerts and environmental sound effects. These audio outputs are produced by a decoder-only transformer-based vision-language model with mixture-of-experts and cross-modal attention for visual understanding, in conjunction with neural text-to-speech and natural sound synthesis networks. The proposed framework uses prompt-based caching and category-specific throttling to avoid auditory clutter and minimize latency. We present a comprehensive evaluation of the system, including a real-time navigation study comparing a white cane alone versus with AMAVA, that shows a significant increase in user confidence and perceived safety.
Authors: Zhiyu Wang, Xudong Kang, Shutao Li
Abstract: Audio-based video object segmentation aims to locate and segment objects in videos conditioned on audio cues, requiring precise understanding of both appearance and motion. Recent audio-driven video segmentation methods extend MLLMs by fusing audio and visual features for end-to-end localization. Despite their promise, these approaches are computationally intensive, struggle with aligning temporal audio cues to dynamic video content, and depend on large paired audio-video datasets. To address these challenges, we present ASR-SaSaSa2VA, a resource-efficient framework for audio-guided video segmentation. The key idea is to convert audio inputs into textual motion descriptions via automatic speech recognition (ASR) models and then leverage pre-trained text-based referring video segmentation models (e.g., SaSaSa2VA) for pixel-level predictions. To further enhance robustness, we incorporate a no-target expression detection module, implemented by a fine-tuned audio-based MLLM, which filters out audio clips that do not refer to any target object. This design allows the system to exploit strong pre-trained models while effectively handling ambiguous or irrelevant audio inputs. Our approach achieves a final score of 80.7 in the 5th PVUW Challenge (MeViS-v2-Audio track), earning the second-place ranking.
Authors: Hongxin Li, Yuntao Chen, Zhaoxiang Zhang
Abstract: Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained devices like mobile phones is increasingly critical for GUI agents requiring low latency. However, this goal faces a significant challenge, as current visual grounding methods typically employ large vision-language model (VLM) (more than 2.5B parameters), making them impractical for on-device execution due to memory and computational constraints. To address this, this paper introduces GoClick, a lightweight GUI element grounding VLM with only 230M parameters that achieves excellent visual grounding accuracy, even on par with significantly larger models. Simply downsizing existing decoder-only VLMs is a straightforward way to design a lightweight model, but our experiments reveal that this approach yields suboptimal results. Instead, we select an encoder-decoder architecture, which outperforms decoder-only alternatives at small parameter scales for GUI grounding tasks. Additionally, the limited capacity of small VLMs encourages us to develop a Progressive Data Refinement pipeline that utilizes task type filtering and data ratio adjustment to extract a high-quality 3.8M-sample core set from a 10.8M raw dataset. Training GoClick using this core set brings notable grounding accuracy gains. Our experiments show that GoClick excels on multiple GUI element grounding benchmarks while maintaining a small size and high inference speed. GoClick also enhances GUI agent performance when integrated into a device-cloud collaboration framework, where GoClick helps cloud-based task planners perform precise element localization and achieve higher success rates. We hope our method serves as a meaningful exploration within the GUI agent community.
Authors: Rinyoichi Takezoe, Yaqian Li, Zihao Bo, Anzhou Hou, Mo Guang, Kaiwen Long
Abstract: Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this issue by pruning unimportant visual tokens, achieving substantial computational reduction while maintaining model performance. The core of token pruning lies in determining token importance, with current approaches primarily relying on attention scores from vision encoders or Large Language Models (LLMs). In this paper, we analyze the effectiveness of attention mechanisms in both vision encoders and LLMs. We find that vision encoders suffer from attention sink, leading to poor focus on informative foreground regions, while in LLMs, although prior studies have identified attention bias toward token positions, text-to-vision attention demonstrates resistance to this bias and enables effective pruning guidance in middle layers. Based on these observations, we propose LearnPruner, a two-stage token pruning framework that first removes redundant vision tokens via a learnable pruning module after the vision encoder, then retains only task-relevant tokens in the LLM's middle layer. Experimental results show that our LearnPruner can preserve approximately 95% of the original performance while using only 5.5% of vision tokens, and achieve 3.2$\times$ inference acceleration, demonstrating a superior accuracy-efficiency trade-off.
Authors: Jiebin Yan, Kangcheng Wu, Jingwen Hou, Jiayu Zhang, Pengfei Chen, Yuming Fang
Abstract: Blind omnidirectional image quality assessment (BOIQA) presents a great challenge to the visual quality assessment community, due to different storage formats and diverse user viewing behaviors. The main paradigm of BOIQA models includes two steps, ie, viewport generation, and quality prediction, which brings an extra computational burden and is hard to generalize to other visual contents (eg, 2D planar image). Thus, in this paper, we make an attempt to solve these issues. First, we experimentally find that BOIQA can be formulated as a blind (2D planar) image quality assessment (BIQA) problem, ie, the first step - viewport generation - is no longer needed, which narrows the natural gap between BOIQA and BIQA. Then, we present a new BOIQA approach, which has three merits: ie, viewport-unaware - it accepts an omnidirectional image in the widely used equirectangular projection format as input without any transformation; unified - it can also be applied to BIQA; and generalized - it shows better generalizability against other competitors. Finally, we validate its promise by held-out test, cross-database validation, and the well-established gMAD competition.
Authors: Bokang Zeng, Zheng Gao, Xiaoyu Li, Xiaoyan Feng, Jiaojiao Jiang
Abstract: Proactive watermarking offers a promising approach for deepfake tamper detection and localization in short-form videos. However, existing methods often decouple audio and visual evidence and assume that watermark signals remain reliable under real-world degradations, making tamper localization vulnerable to multimodal misalignment and compression distortions. Moreover, existing semi-fragile visual watermarking methods often degrade significantly under codec compression because their embedding bands overlap with compression-sensitive frequency regions. To address these limitations, we propose Layered Audio-Visual Anti-tampering Watermarking (LAVA), a calibration-aware audio-visual watermark fusion framework for deepfake tamper detection and localization. LAVA leverages cross-modal watermark fusion and calibration-aware alignment to preserve consistent and reliable tamper evidence under compression and audio-visual asynchrony, enabling robust tamper localization. Extensive experiments demonstrate that LAVA achieves near-perfect detection performance (AP = 0.999), remains robust to compression and multimodal misalignment, and significantly improves tamper localization reliability over existing audio-visual fusion baselines.
Authors: Xiaoliu Luo, Minxue Xiao, Ting Xie, Mengzhu Wang, Huiqing Qi, Joey Tianyi Zhou, Taiping Zhang, Xu Wang
Abstract: Accurate biomedical image classification under low-resource conditions remains challenging due to limited annotations, subtle inter-class visual differences, and complex disease semantics. While vision--language models offer a promising foundation for mitigating data scarcity, their effective adaptation in biomedical settings is constrained by the need for parameter-efficient tuning alongside fine-grained and semantically consistent representation learning. In this work, we propose Multi-View Synergistic Learning (MVSL), a unified framework that addresses these challenges by jointly considering adaptation paradigms, representation granularity, and disease semantic relationships. MVSL decouples the adaptation of visual and textual encoders to respect their distinct representational characteristics, enabling more stable and effective parameter-efficient fine-tuning. It further introduces multi-granularity contrastive learning to explicitly model both global image semantics and localized lesion-level evidence, improving fine-grained discrimination for visually similar disease categories. In addition, MVSL preserves disease-level semantic structure by incorporating structured supervision derived from large language models, which constrains textual representations at the class level and indirectly regularizes visual embeddings through cross-modal alignment. Together, these components enable more stable cross-modal alignment and improved discrimination under limited supervision. Extensive experiments on $11$ public biomedical datasets spanning $9$ imaging modalities and $10$ anatomical regions demonstrate that MVSL consistently outperforms state-of-the-art methods in few-shot and zero-shot classification settings.
Authors: Xuemei Qiu, Dawei Fan, Yebin Huang, Yanping Chen, Lifang Wei
Abstract: Digital pathology has fundamentally altered diagnostic workflows by enabling the computational analysis of gigapixel Whole Slide Images (WSIs), yet effectively deciphering their complex tumor microenvironments remains a formidable challenge. Existing Multiple Instance Learning (MIL) frameworks typically treat Whole Slide Images as unstructured bags of patches, discarding critical morphological semantics and spatial geometry. This lack of inductive bias often leads to overfitting on background noise and fails to align visual features with high-level diagnostic knowledge. To overcome these limitations, we propose the Hierarchical Prototype-based Domain Priors (HPDP) framework, a unified multimodal approach for joint histopathology diagnosis and prognosis. HPDP mitigates the data-driven "black box" issue by introducing a Morphologically Anchored Prototype System (MAPS), which anchors learning to interpretable morphological clusters, and a Sinusoidal Positional Encoder (SPE) to explicitly model tissue architecture. Furthermore, we bridge the semantic gap via a Hierarchical Cross-Modal Alignment (HCMA) module, using Large Language Model (LLM)-generated descriptions to contextually refine visual representations. Extensive experiments across seven cancer cohorts demonstrate that HPDP consistently achieves state-of-the-art performance with superior robustness and interpretability.
Authors: Zi-Hao Bo, Yaqian Li, Anzhou Hou, Rinyoichi Takezoe, Ertao Zhao, Tianxiang Pan, Jiale Yan, Mo Guang, Kaiwen Long
Abstract: Mixture-of-Experts (MoE) has become a prevalent backbone for large vision-language models (VLMs), yet how modality-specific signals should guide expert routing remains under-explored. Existing routing strategies are either hand-crafted or modality-agnostic, relying on idealized priors that ignore the layer-dependent modality fusion patterns in MoE-VLMs and provide little guidance for expert specialization. We propose Soft Modality-guided Expert Specialization (SMoES), which consists of dynamic soft modality scores that capture layer-dependent fusion patterns, an expert binning mechanism aligned with expert-parallel deployment, and an inter-bin mutual information regularization that encourages coherent modality specialization. Our method leverages attention-based or Gaussian-statistics modality scores to optimize mutual information regularization. Experiments across four MoE-based VLMs and 16 benchmarks demonstrate improvement on both effectiveness and efficiency: 0.9% and 4.2% average gain on multimodal and language-only tasks, 56.1% reduction in EP communication overhead, and 12.3% throughput improvement under realistic deployment. These results validate that aligning routing with modality-aware expert specialization unlocks MoE-VLM capacity and efficiency.
Authors: Fengxian Ji, Jingpu Yang, Zirui Song, Lang Gao, Junhong Liang, Zhenhao Chen, Jinghui Zhang, Xiuying Chen
Abstract: Recent image generation and editing models demonstrate robust adherence to instructions and high visual quality on academic benchmarks. However, their performance on paid, real-world design projects remains uncertain. We introduce \textbf{ServImage}, a benchmark that explicitly correlates model outputs with economic value in commercial design projects. ServImage consists of (i) \textbf{\textit{ServImageBench}}: a dataset of 1.07k paid commercial design tasks and 2.05k designer deliverables totaling over \$295k, covering portrait, product, and digital content, along with 33k candidate images and 33k human annotations. (ii) \textbf{\textit{ServImageScore}}: an integrated scoring system that combines three quality dimensions: baseline requirements fulfilment, visual execution quality, and commercial necessity satisfaction. These three dimensions are designed to characterize the factors that drive human payment decisions and indicate whether an image is commercially acceptable. (iii) \textbf{\textit{ServImageModel}}: under this scoring system, we propose a payment prediction model trained on the human-annotated candidate images, achieving 82.00\% accuracy in predicting human payment decisions and producing calibrated payment probabilities. ServImage provides a comprehensive foundation for assessing the commercial viability of image generation models and offers a scalable resource for future research on economically grounded vision systems \href{https://github.com/FengxianJi/ServImage}{Github.}
Authors: Takumi Kawano, Kohei Miura, Daisuke Iwai
Abstract: Conventional multi-projector calibration requires projecting and capturing structured light patterns for each projector sequentially, causing calibration time and effort to increase linearly with the number of projectors. This scalability bottleneck has long limited the deployment of large-scale projection mapping systems. We present a new calibration framework that breaks this limitation by embedding cameras into the surface of the calibration target. The embedded cameras directly capture the incoming projection light, enabling the separation of simultaneously projected structured light patterns from multiple projectors according to their incident directions. Our method establishes correspondences between the optical centers of the embedded cameras and the projector pixels, allowing the intrinsic and extrinsic parameters of all projectors to be simultaneously estimated. We further introduce a correction technique for small misalignments between the calibration board and camera optical centers. As a result, our system achieves calibration accuracy comparable to conventional methods while reducing the required number of projection-capture cycles from linear to nearly constant with respect to the number of projectors, dramatically improving scalability for dense multi-projector systems with overlapping projection regions, such as high-brightness stacking, super-resolution, light-field, and shadow-suppression displays.
Authors: Jiawei Wang, Ming Lei, Yaning Yang, Xinyan Lin, Yuquan Le, Qiwei Ma, Zhiwei Xu, Zheqi Lv, Yuchen Ang, Zhe Quan, Tat-Seng Chua
Abstract: Identifying species in biology among tens of thousands of visually similar taxa while discovering unknown species in open-world environments remains a fundamental challenge in biodiversity research. Current methods treat identification and discovery as separate problems, with classification models assuming closed sets and discovery relying on threshold-based rejection. Here we present DeepTaxon, a retrieval-augmented multimodal framework that unifies species identification and discovery through interpretable reasoning over retrieved visual evidence. Given a query image, DeepTaxon retrieves the top-$k$ candidate species with $n$ exemplar images each from a retrieval index and performs chain-of-thought comparative reasoning. Critically, we redefine discovery as an explicit, retrieval-based decision problem rather than an implicit parametric memory problem. A sample is novel if and only if the retrieval index lacks sufficient evidence for identification, so each retrieval naturally yields a classification or discovery label without manual annotation, thereby providing automatic supervision for both tasks. We train the framework via supervised fine-tuning on synthetic retrieval-augmented data, followed by reinforcement learning on hard samples, converting high-recall retrieval into high-precision decisions that scale to massive taxonomic vocabularies. Extensive experiments on a large-scale in-distribution benchmark and six out-of-distribution datasets demonstrate consistent improvements in both identification and discovery. Ablation studies further reveal effective test-time scaling with candidate count $k$ and exemplar count $n$, strong zero-shot transfer to unseen domains, and consistent performance across retrieval encoders, establishing an interpretable solution for biodiversity research.
Authors: Swadhin Das, Vivek Yadav
Abstract: The encoder-decoder framework has become widely popular nowadays. In this model, the encoder extracts informative visual features from an input image, and the decoder employs a sequence-to-sequence formulation to generate the corresponding textual description from these features. The existing models focus more on the decision part. However, extracting meaningful information from the image can help the decoder generate an accurate caption by providing information about the objects and their relationship. Remote sensing images are highly complex. One major challenge is detecting objects that extend beyond their visible boundaries due to occlusion, overlapping structures, and unclear edges. Hence, there is a need to design an approach that can effectively capture both high-level semantics and low-level spatial details for accurate caption generation. In this work, we have proposed an edge-aware fusion method by incorporating the original image and its edge-aware version into the encoder to enhance feature representation and boundary awareness. We used a comparison-based beam search (CBBS) to generate captions to achieve a balanced trade-off between quantitative metrics and qualitative caption relevance through fairness-based comparison of candidate captions. Experimental results demonstrate our model's superiority over several baseline models in quantitative and qualitative perspectives.
Authors: Beomchan Park, Seongho Kim, Hyunjun Kim, Sungjune Park, Yong Man Ro
Abstract: While Multimodal Large Language Models (MLLMs) have enhanced grounding capabilities in general scenes, their robustness in crowded scenes remains underexplored. Crowded scenes entail visual challenges (i.e., occlusion and small objects), which impair object semantics and degrade grounding performance. In contrast, language expressions are immune to such degradation and preserve object semantics. In light of these observations, we propose a novel method that overcomes such constraints by leveraging Language-Guided Semantic Cues (LGSCs). Specifically, our approach introduces a Semantic Cue Extractor (SCE) to derive semantic cues of objects from the visual pipeline of an MLLM. We then guide these cues using corresponding text embeddings to produce LGSCs as linguistic semantic priors. Subsequently, they are reintegrated into the original visual pipeline to refine object semantics. Extensive experiments and analyses demonstrate that incorporating LGSCs into an MLLM effectively improves grounding accuracy in crowded scenes.
Authors: Bingyi Liu, Chuanhui Zhu, Hongfei Xue, Jian Teng, Jipeng Liu, Enshu Wang, Penglin Dai, Pu Wang
Abstract: Accurate 3D object detection is critical for autonomous driving, necessitating reliable, cost-effective sensors capable of operating in adverse weather conditions. Camera and millimeter-wave radar fusion has emerged as a promising solution; however, these methods often rely on finely annotated radar data, which is scarce and labor-intensive to produce. To address this challenge, we present CLLAP, a Contrastive Learning-based LiDAR-Augmented Pretraining framework that enhances the performance of existing radar-camera fusion methods for 3D object detection. CLLAP leverages abundant LiDAR data to generate pseudo-radar data using the proposed L2R (LiDAR-to-Radar) Sampling method. Then, it incorporates this data into a novel dual-stage, dual-modality contrastive learning strategy, enabling effective self-supervised learning from paired pseudo-radar and image data. This approach facilitates effective pretraining of existing radar-camera fusion models in a plug-and-play manner, enhancing their feature extraction capabilities and improving detection accuracy and robustness. Experimental results using NuScenes and Lyft Level 5 datasets demonstrate significant performance improvements across three baseline models, highlighting CLLAP's effectiveness in advancing radar-camera fusion for autonomous driving applications.
Authors: Woojun Jung, Junyeong Kim
Abstract: Video-to-text summarization remains underexplored in terms of comprehensive evaluation methods. Traditional n-gram overlap-based metrics and recent large language model (LLM)-based approaches depend heavily on human-written reference summaries, limiting their practicality and sensitivity to nuanced semantic aspects. In this paper, we propose QEVA, a reference-free metric evaluating candidate summaries directly against source videos through multimodal question answering. QEVA assesses summaries along three clear dimensions: Coverage, Factuality, and Chronology. We also introduce MLVU(VS)-Eval, a new annotated benchmark derived from the MLVU dataset, comprising 800 summaries generated from 200 videos using state-of-the-art video-language multimodal models. This dataset establishes a transparent and consistent framework for evaluation. Experimental results demonstrate that QEVA shows higher correlation with human judgments compared to existing approaches, as measured by Kendall's $\tau_b$, $\tau_c$, and Spearman's $\rho$. We hope that our benchmark and metric will facilitate meaningful progress in video-to-text summarization research and provide valuable insights for the development of future evaluation methods.
Authors: YuHao Yin, Zongji Wang, Yuanben Zhang, Biqing Li, Jiesong Bai, Junyi Liu
Abstract: Full 360$^\circ$ novel view synthesis under low-light conditions remains challenging. Insufficient illumination, noise amplification, and view-dependent photometric inconsistencies prevent existing methods from jointly preserving geometric consistency and photorealism. Unsupervised approaches often exhibit color drift under large viewpoint variations, while supervised low-light enhancement models, though effective for 2D tasks, struggle to generalize to new scenes and typically require retraining. To address this issue, we propose MERID-GS, a Multi-Scale Explicit Retinex Illumination-Decoupled Gaussian framework for low-light 360$^\circ$ synthesis. Based on Retinex theory, the method explicitly separates illumination and reflectance, and suppresses noise propagation while enhancing dark-region structures via a learnable gain and Illumination-State-Guided Frequency Gating. Combined with lightweight Reflection Head and 3D Gaussian Splatting, MERID-GS adapts to new scenes with only a few shots and enables stable low-light novel view synthesis from sparse-view observations. In addition, we construct a low-light multi-view dataset covering full 360$^\circ$ scenes for joint evaluation. Thorough experiments across multiple datasets in this area demonstrate that MERID-GS achieves SOTA performance, exhibiting superior cross-scene generalization and view consistency. The source code and pre-trained models are available at https://github.com/YhuoyuH/MERID-GS..
Authors: Yichi Zhang, Le Xue, Bichun Xu, Judong Luo, Zhigang Wu, Yu Fu, Zixin Hu, Yuan Cheng, Yuan Qi
Abstract: Semi-supervised learning (SSL) has become a promising solution to alleviate the annotation burden of deep learning-based medical image segmentation models. While recent advances in foundation model-driven SSL have pushed the boundary to extremely limited annotation scenarios, they fail to maintain robust competitive performance in complex imaging modalities. In this paper, we propose SemiSAM-O1, an annotation-efficient framework using only one annotated template image for segmentation. SemiSAM-O1 extends the specialist-generalist collaborative learning framework to the extreme one-label setting by fully exploiting the foundation model's feature representation capability beyond its prompting interface. SemiSAM-O1 operates in two stages. In the first stage, the foundation model's encoder extracts dense features from all volumes, and class prototypes derived from the single annotated template are propagated to the unlabeled pool via feature similarity to produce coarse initial pseudo-labels. In the second stage, an iterative training-and-refinement loop progressively improves both the segmentation model and the pseudo-labels over multiple rounds, where each round trains the model from scratch on current pseudo-labels and generates updated predictions with voxel-wise uncertainty estimates. An uncertainty-guided refinement step further leverages the foundation model's global feature space to correct high-uncertainty regions by aggregating labels from their most similar confident neighbors, establishing a virtuous cycle of mutual improvement. Extensive experiments on a wide range of segmentation tasks across different modalities and anatomical targets demonstrate that SemiSAM-O1 significantly narrows the performance gap between one-label semi-supervised learning and full supervision, while significantly reducing the computational overhead of online foundation model inference.
Authors: Yifeng Bai, Zhirong Chen, Erkang Cheng, Haibin Ling
Abstract: Topology reasoning is crucial for autonomous driving. Current methods primarily focus on instance-level learning for centerline detection, followed by a sequential module for topology reasoning that relies on simplified MLP layers. Moreover, they often neglect the importance of \textit{point-to-instance} (P2I) relationships in topology reasoning. To address these limitations, we present TopoHR (Topological Hierarchical Representation), a novel end-to-end framework that establishes cyclic interaction between centerline detection and topology reasoning, allowing them to iteratively enhance each other. Specifically, we introduce a hierarchical centerline representation including point queries, instance queries, and semantic representations. These multi-level features are seamlessly integrated and fused within a hierarchical centerline decoder. Furthermore, we design a hierarchical topology reasoning module that captures both fine-grained P2I relationships and global instance-to-instance (I2I) connections within a unified architecture. With these novel components, TopoHR ensures accurate and robust topology reasoning. On the OpenLane-V2 benchmark, TopoHR refreshes state-of-the-art performance with significant improvements. Notably, compared with previous best results, TopoHR achieves +3.8 in $\mathrm{DET}_{\text{l}}$, +5.4 in $\mathrm{TOP}_{\text{ll}}$ on $\text{subset_A}$ and +11.0 in $\mathrm{DET}_{\text{l}}$, +7.9 in $\mathrm{TOP}_{\text{ll}}$ on $\text{subset_B}$, validating the effectiveness of the proposed components. The code will be shared publicly at https://github.com/Yifeng-Bai/TopoHR.git.
Authors: Jiayi Wang, Lichun Zhang, Xiaoqi Zhuang, Jiaqi Zhang, Lu Yu, Yin Zhao
Abstract: Video technology is advancing toward Ultra High Definition (UHD) and High Dynamic Range (HDR), which intensifies the need for higher compression efficiency for these high-specification videos. Beyond advances in traditional codecs, neural video codecs (NVCs) have attracted significant research attention and have evolved rapidly over the past few years. The coding artifacts of NVCs often exhibit content-varying and generative characteristics, which differ from those of conventional codecs and are challenging for traditional video quality assessment (VQA) methods to capture. Therefore, VQA metrics are required to generalize across different codecs, content types, and dynamic ranges to better support video codec research and evaluation. In this paper, we propose FDIM, a feature-distance-based generic video quality metric for both traditional and neural video codecs across SDR and HDR formats. FDIM employs a hybrid architecture that integrates deep and hand-crafted features. The deep feature component learns multi-scale representations to capture distortions ranging from structural and textural fidelity degradation to high-level semantic deviations, while the hand-crafted feature component provides stable complementary cues to improve overall generalization. We trained FDIM on a large-scale subjective quality assessment dataset (DCVQA) consisting of over 16k video sequences encoded by traditional block-based hybrid video codecs and end-to-end perceptually optimized neural video codecs. Extensive experiments on ten SDR/HDR VQA datasets containing diverse, previously unseen codecs demonstrate that FDIM achieves strong generalization and high correlation with subjective assessment. The source code for FDIM and the DCVQA validation set will be released at https://github.com/MCL-ZJU/FDIM.
Authors: Jinkun Dai, Yuanxin Ye, Peng Tang, Tengfeng Tang, Xianping Ma, Jing Xiao, Mi Wang
Abstract: Semantic segmentation of multi-modal remote sensing imagery plays a pivotal role in land use/land cover (LULC) mapping, environmental monitoring, and precision earth observation. Current multi-modal approaches mainly focus on integrating complementary visual modalities, yet neglect the incorporating of non-visual textual data - a rich source of knowledge that can bridge semantic gaps between visual patterns and real-world concepts. To address this limitation, we propose TSMNet, a text supervised multi-modal open vocabulary semantic segmentation network that synergistically integrates textual supervision with visual representation for open-vocabulary semantic segmentation. Unlike conventional multi-modal segmentation frameworks, TSMNet introduces a dual-branch text encoder to extract both scene-level semantic and object-level label information from various textual data, enabling dynamic cross-modal fusion. These text-derived features dynamically interact with visual embeddings through the proposed text-guided visual semantic fusion module, enabling domain-aware feature refinement and human-interpretable decision-making. To verify our method, we innovatively construct two new multi-modal datasets, and carry out extensive experiments to make a comprehensive comparison between the proposed method and other state-of-the-art (SOTA) semantic segmentation models. Results demonstrate that TSMNet achieves superior segmentation accuracy while exhibiting robust generalization capabilities across diverse geographical and sensor-specific scenarios. This work establishes a new paradigm for explainable remote sensing analysis, demonstrating that textual knowledge integration significantly enhances model generalizability. The source code will be available at https://github.com/yeyuanxin110/TSMNet
Authors: Shyang-En Weng, Yi-Cheng Liao, Yu-Syuan Xu, Wei-Chen Chiu, Ching-Chun Huang
Abstract: Pretrained diffusion models have revolutionized real-world image super-resolution (Real-ISR) but suffer from computational bottlenecks due to iterative sampling. Recent single-step distillation accelerates inference but faces a stark perception-distortion trade-off due to rigid timestep initialization, distributional trajectory mismatches, and fragile stochastic modulation. To address this, we present Adaptive Inversion and Degradation-aware Sampling for Real-ISR (IDaS-SR), a one-step framework bridging the deterministic restoration and stochastic generation manifolds. At its core, the Manifold Inversion Noise Estimator (MINE) resolves these initialization and trajectory mismatches by predicting a severity-aware timestep and inversion noise, precisely anchoring low-quality latents onto the diffusion trajectory. Furthermore, to mitigate fragile stochastic modulation, we propose CHARIOT, a continuous generative steering mechanism. By rescheduling trajectories and interpolating noise, it enables explicit navigation of the perception-distortion boundary without compromising structural priors. Extensive experiments demonstrate that IDaS-SR outperforms state-of-the-art methods, seamlessly transitioning from a rigorous structural restorer to a sophisticated texture hallucinator in a single inference step.
Authors: Xuguang Bai, Mingxuan Liu, Tongxi Song, Yifei Chen, Hongjia Yang, Kasidit Anmahapong, Zihan Li, Ying Zhou, Qiyuan Tian
Abstract: Chest computed tomography (CT) is central to the detection and management of thoracic disease, yet the growing scale and complexity of volumetric imaging increasingly exceed what can be addressed by scan-level prediction alone. Clinically useful AI for CT must not only recognize disease across the whole volume, but also localize abnormalities and provide interpretable visual evidence. Existing vision-language foundation models typically compress scans and reports into global image-text representations, limiting their ability to preserve spatial evidence and support clinically meaningful interpretation. Here we developed EXACT, an explainable anomaly-aware foundation model for three-dimensional chest CT that learns spatially resolved representations from paired clinical scans and radiology reports. EXACT was pre-trained on 25,692 CT-reports pairs using anatomy-aware weak supervision, jointly learning organ segmentation and multi-instance anomaly localization without manual voxel-level annotations. The resulting organ-specific anomaly-aware maps assign each voxel a disease-specific anomaly score confined to its corresponding anatomy, jointly encoding lesion extent and organ-level context. In retrospective multinational and multi-center evaluations, EXACT showed broad and consistent improvements across clinically relevant CT tasks, spanning multi-disease diagnosis, zero-shot anomaly localization, downstream adaptation, and visually grounded report generation, outperforming existing three-dimensional medical foundation models. By transforming routine clinical CT scans and free-text reports into explainable voxel-level representations, EXACT establishes a scalable paradigm for trustworthy volumetric medical AI.
Authors: Runci Bai, Kui Jiang, Xiang Chen, Chen Wu, Dianjie Lu, Guijuan Zhang, Zhuoran Zheng
Abstract: Remote sensing images are frequently degraded by adverse weather conditions, particularly clouds and haze, which severely impair downstream applications. Existing restoration methods typically rely on computationally heavy architectures or sequential pipelines (e.g., detail enhancement followed by color rendition) that suffer from mutual interference and artifact accumulation. Furthermore, recent unified grid-based approaches utilize fixed, isotropic interpolation kernels, neglecting the intrinsic low-dimensional manifold of natural images and inevitably causing edge blur. To address these limitations, we propose 6th Grid-Net, a highly efficient and unified remote sensing image restoration framework tailored for resource-constrained edge devices. Specifically, we construct a novel six-dimensional fusion tensor that seamlessly integrates the color rendition capabilities of 3D LUTs with the spatial-luminance detail preservation of bilateral grids. To overcome the drawbacks of standard trilinear interpolation, we introduce a manifold-adaptive high-dimensional sampling mechanism. This mechanism dynamically adjusts the interpolation kernel based on local edge orientation, texture strength, and color similarity, enabling simultaneous global color stylization and local edge refinement in a single forward pass. Additionally, an edge-aware grid smoothing constraint and dynamic quantization are incorporated to suppress ghosting artifacts and significantly compress the model size. Extensive experiments on multiple benchmark datasets demonstrate that 6th Grid-Net achieves state-of-the-art restoration quality across various degradation scenarios.
Authors: Benedikt Hopf, Radu Timofte, Chenfan Qu, Junchi Li, Fei Wu, Dagong Lu, Mufeng Yao, Xinlei Xu, Fengjun Guo, Yongwei Tang, Zhiqiang Yang, Zhiqiang Wu, Jia Wen Seow, Hong Vin Koay, Haodong Ren, Feng Xu, Shuai Chen, Minh-Khoa Le-Phan, Minh-Hoang Le, Trong-Le Do, Minh-Triet Tran, Chih-Yu Jian, Yi-Fan Wang, Bang-Kang Chen, You-Chen Chao, Chia-Ming Lee, Fu-En Yang, Yu-Chiang Frank Wang, Chih-Chung Hsu, Aashish Negi, Hardik Sharma, Prateek Shaily, Jayant Kumar, Sachin Chaudhary, Akshay Dudhane, Praful Hambarde, Amit Shukla, Jielun Peng, Yabin Wang, Yaqi Li, Jincheng Liu, Xiaopeng Hong, Krish Wadhwani, Liam Fitzpatrick, Utkarsh Tiwari, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Cristian Lazo Quispe, Aishwarya A, Akshara S, Ashwathi N, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yaokun Shi
Abstract: Robustness is a long-overlooked problem in deepfake detection. However, detection performance is nearly worthless in the real world if it suffers under exposure to even slight image degradation. In addition to weaker degradations that can accidentally occur in the image processing pipeline, there is another risk of malicious deepfakes that specifically introduce degradations, purposefully exploiting the detector's weaknesses in that regard. Here, we present an overview of the NTIRE 2026 Robust Deepfake Detection Challenge, which specifically addresses that problem. Participants were tasked with building a detector that would later be tested on an unknown test-set, which included both common and uncommon degradations of various strengths. With a total number of 337 participants and 57 submissions to the final leaderboard, the first edition of the challenge was well received. To ensure the reliability of the results, participants were given only 24h to complete the test run with no labels provided, limiting the possibility of training on the test data. Furthermore, the top solutions were scored on a private test-set to detect any such overfitting. This report presents the competition setting, dataset preparation, as well as details and performance of methods. Top methods rely on large foundation models, ensembles, and degradation training to combine generality and robustness.
Authors: Guillaume Perez, Janarbek Matai, Takahiro Harada
Abstract: Implicit neural representations (INRs) are increasingly being used as tools to map coordinates to signals, encompassing applications from neural fields to texture compression, shape representations, and beyond. Most INR methods are based on using high-dimensional projections of the initial coordinates through encoders such as grid or positional encoding. Nevertheless, positional encoding is often insufficient and grids, as we show in this paper, require high resolution for being able to learn. In this paper, we demonstrate that positional encoding can be used not only as a high-dimensional embedding but also decomposed as a series of meaningful points. We propose the Positional Encoding Projected Sampling, where we treat the projection of the original coordinate at each frequency as a point of interest. We describe the motion of each point with respect to the frequencies and show that it follows a unique pattern. Finally, we use the unique motion of each point as a basis decomposition for doing learned positional encoding using grids. We prove, using three competitive applications; image representation, texture compression, and signed distance function; that the proposed approach outperforms the current state of the art methods, and often requires 25\% less parameters for equivalent reconstruction error or rendering.
Authors: Laurenz Reichardt, Nikolas Ebert, Oliver Wasenm\"uller
Abstract: 3D point cloud perception remains tightly coupled to custom CUDA operators for spatial operations, limiting portability and efficiency on non-NVIDIA, AMD, and embedded hardware. We introduce PointTransformerX (PTX), a fully PyTorch-native vision transformer backbone for 3D point clouds, removing all custom CUDA operators and external libraries while retaining competitive accuracy. PTX introduces 3D-GS-RoPE, a rotary positional embedding that encodes 3D spatial relationships directly in self-attention without neighborhood construction, and further replaces sparse convolutional patch embedding with a linear projection. PTX explores inference-time scaling of attention windows to improve accuracy without retraining. With a redesigned feed-forward network, PTX achieves 98.7\% of PointTransformer V3's accuracy on ScanNet with 79.2\% fewer parameters and executing 1.6\times faster while requiring just 253 MB memory. PTX runs natively on NVIDIA GPUs, AMD GPUs (ROCm), and CPUs, providing an efficient and portable foundation for point cloud perception.
Authors: Yaohou Fan, Qingzhong Wang, Yongsong Huang, Junyi Liu, Tomo Miyazaki, Shinichiro Omachi
Abstract: Current visual text generation models struggle with the trade-off between text accuracy and overall image coherence. We find that achieving high text accuracy can reduce aesthetic quality and instruction-following capability. Although reinforcement learning approaches can alleviate the problem through aligning with multiple rewards, they are often unstable for text generation, as existing approaches normally optimize multiple rewards in a weighted-sum way. In addition, it is difficult to balance the weight of each reward. Moreover, reinforcement learning requires a set of training instructions. A large number of prompts require more training time and computing resources, while a small set leads to poor performance. Hence, how to select the prompts for efficient training is an unsolved problem. In this study, we propose Pareto-Optimal Curriculum Alignment (POCA), a framework that addresses this issue as a multi-objective problem by: 1) identifying the Pareto-optimal set to avoid simple scalarization and 2) designing an adaptive curriculum alignment strategy to manage a learning sequence of a multi-reward dataset using automatic difficulty assessment, which is crucial for optimal convergence as RL methods explore in a limited data environment. In synergy, POCA finds the Pareto-optimal set in a unified reward space, which eliminates inconsistent signals to find the best trade-off solution from different rewards under an easy-to-hard optimization landscape. The experimental results show that POCA significantly improves all metrics such as CLIP, HPS scores and sentence accuracy.
Authors: Patris Valera, Magdalena Wysocki, Felix Duelmer, Mohammad Farid Azampour, Sebastian Herz, Stefan W\"orz, Nassir Navab
Abstract: Wide Field-of-View (WFoV) reconstruction enhances 3D ultrasound imaging by providing valuable anatomical context for segmentation models and visualization. Clinical ultrasound volumes are predominantly acquired using convex probes, which generate expanding, diverging acoustic beams to maximize anatomical coverage. Stitching these sweeps together traditionally introduces significant compounding artifacts and aliasing due to depth-dependent resolution changes. Here, we introduce Ultra-Wide-NeRF, a Multivariate 3D Gaussian (MVG) NeRF-based method for WFoV ultrasound reconstruction. By explicitly modeling the complex beam geometry using distance-dependent convex volumetric sampling and anisotropic 3D Gaussians, our method inherently mitigates these compounding artifacts and provides anti-aliasing. Beyond simply reconstructing a static 3D grid, our NeRF-based approach yields a continuous neural representation of the tissue, enabling the synthesis of high-fidelity novel views from arbitrary virtual trajectories. We validate Ultra-Wide-NeRF for intracardiac echocardiography on phantom and porcine datasets, demonstrating that our method expands the spatial context important in intraoperative navigation. Code will be open-sourced upon publication.
Authors: Zhicheng Zhang, Wentao Gu, Weicheng Wang, Yongjie Zhu, Wenyu Qin, Meng Wang, Pengfei Wan, Jufeng Yang
Abstract: Omnimodal understanding entails a massive, highly redundant search space of cross-modal interactions, demanding focused and deliberative reasoning. Current reasoning paradigms rely on either sequential step-by-step generation or parallel sample-by-sample rollouts, leading to isolated reasoning trajectories. This inability to share promising intermediate paths severely limits exploration efficiency and causes compounding errors in complex audio-visual tasks. To break this bottleneck, we introduce Omni-o3, a novel framework driven by a deep nested deduction policy. By formulating reasoning as a dynamic recursive search, Omni-o3 inherently shares reasoning prefixes across branches, enabling the iterative execution of four atomic cognitive actions: expansion, selection, simulation, and backpropagation. To empower this framework, we propose a robust two-stage training paradigm: (1) cold-start supervised fine-tuning on 101K high-quality, long-chain trajectories distilled from 3.5M diverse omnimodal samples, enabling necessary recursive search patterns; and (2) nested group rollout-driven exploratory reinforcement learning on 18K complex multi-turn samples, explicitly guided by a novel multi-step reward model to stimulate deep nested reasoning. Extensive experiments demonstrate that Omni-o3 achieves competitive performance across 11 benchmarks, unlocking advanced capabilities in comprehensive audio-visual, visual-centric, and audio-centric reasoning tasks.
Authors: Vishakha Lall, Capt. Stanley S Pinto, Capt. Chu Xing Peng, Wu Kaiwen
Abstract: Containerised shipping underpins global trade, yet container loss at sea remains a persistent safety, environmental, and economic challenge. Despite compliance with Cargo Securing Manuals, dynamic maritime conditions such as vessel motion, wind loading, and severe sea states can progressively destabilise container stacks, leading to overboard losses. With the new International Maritime Organisation's (IMO) mandatory reporting requirements for lost containers, there is an urgent need for a reliable, evidence-based early detection solution for destabilised containers. This study showcases a low-cost, retrofittable computer vision-based system for early detection of destabilised containers using existing onboard cameras. The framework integrates object segmentation to isolate container stacks, temporal object tracking using optical flow and individual objects' residual motion extraction to quantify relative movement. Experimental evaluation on real onboard ship footage demonstrates that the proposed pipeline effectively isolates container-level motion under challenging conditions of varying sea states and visibility conditions. By enabling early alerts for crew intervention and navigational adjustment, the proposed approach enhances cargo safety, operational resilience, and regulatory compliance.
Authors: Yin Lin, Elena De Martin, Giacomo Conte, Domenico Aquino, Cristiana Pedone, Alberto Redaelli, Riccardo Barbieri, Laura Fariselli, Simona Ferrante
Abstract: Skull-base meningiomas are often characterized by favorable long-term prognosis, yet their anatomical complexity and proximity to critical neurovascular structures make treatment selection challenging. Stereotactic radiosurgery with CyberKnife represents an effective therapeutic option when surgical resection is not feasible; however, not all patients benefit equally from this treatment. Early identification of patients likely to respond to radiosurgery remains an open clinical problem. In this study, we propose a radiomics- and clinical feature-driven framework for predicting volumetric response in skull-base meningiomas treated with CyberKnife. Unlike most existing approaches that focus on progression-free survival or recurrence, our method targets volumetric response as an indicator of treatment efficacy. Pre-treatment MRI images from 104 patients were processed to extract radiomic features, which were combined with clinical variables and analyzed using six models. To ensure methodological rigor, the entire modeling process was implemented within a nested cross-validation scheme. Among the evaluated models, TabPFN achieved the best overall performance, with an AUC of 0.81 and consistently favorable classification metrics. These results suggest that advanced machine learning architectures, when combined with robust validation strategies, can effectively capture patterns associated with treatment response even in small-sample, high-dimensional settings.
Authors: Stefano Raimondo (Department of Mechanical Engineering, Politecnico di Milano, Milan, Italy), Matteo Bugatti (Department of Mechanical Engineering, Politecnico di Milano, Milan, Italy), Marco Grasso (Department of Mechanical Engineering, Politecnico di Milano, Milan, Italy)
Abstract: The technological maturity of in situ inspection and monitoring methods in additive manufacturing is steadily increasing, enabling more efficient and practical qualification procedures. In this context, image segmentation of powder bed images in Laser Powder Bed Fusion (L-PBF) has been investigated by various authors, leveraging both edge detection and machine learning approaches to identify deviations from nominal geometry. Despite these developments, several challenges remain, including the sensitivity of segmentation performance to industrial illumination conditions and layer-to-layer variability in pixel intensity patterns. The study addresses these limitations by proposing a graph-augmented segmentation approach. The underlying principle consists of preserving the geometrical information at a global level rather than at pixel-wise level, modeling dependencies and relational information among spatial regions with a Graph Neural Network bottleneck embedded into a U-Net architecture. This allows enhancing the consistency and accuracy of the geometry reconstruction in the presence of spatial and layer-wise photometric variability systematically faced in real data. The method is evaluated against benchmark techniques for the in situ reconstruction of lattice structures produced by L-PBF, demonstrating its potential as a scalable solution for robust in situ inspection and geometric verification in industrial environments.
Authors: Yin Lin, Domenico Aquino, Alberto Redaelli, Massimiliano Del Bene, Riccardo Barbieri, Simona Ferrante
Abstract: Touchless interaction with medical images is becoming increasingly important in the surgical field, where sterility and continuity of the operational workflow are essential requirements. This work presents a vision-based system for intraoperative navigation of medical images through hand gestures acquired using a single RGB camera. Unlike many existing solutions, the system does not require additional hardware or user-specific training. Hand tracking is performed in real time using MediaPipe Hands, which provides a 2.5D estimation of hand landmarks. Simple and intuitive gestures are then mapped into translation, rotation, and zoom commands, enabling continuous and natural interaction with the image viewer. The system architecture is independent from the visualization software and, for implementation simplicity, in this study it was integrated with PyVista. Performance was evaluated through frame-level logging and quantitative analysis of latency, stability, and interaction robustness metrics. Experimental results highlight real-time behavior, with reduced latencies and stable control, in line with the requirements of fluid interaction. The system demonstrates the feasibility of a low-cost touchless solution for intraoperative access to medical images, laying the groundwork for future clinical evaluations.
Authors: Soumya Snigdha Kundu, Florian Kofler, Marina Ivory, Hendrik Moller, Jonathan Shapey, Tom Vercauteren
Abstract: Instance-sensitive losses for semantic segmentation such as blob loss and CC loss were designed to address instance imbalance, ensuring small lesions generate the same gradient as large ones, but operate only on single-class segmentation. In multi-class settings, class imbalance poses an additional problem: rare classes with few instances receive a disproportionately small share of the training signal. We show that extending instance-sensitive losses to multi-class segmentation via a one-vs-rest class decomposition repurposes them to also address class imbalance, as uniform averaging over classes ensures each class contributes equally regardless of frequency. We further show that inverse-size weighting, which destabilizes training when applied globally due to weight imbalances across rare and common classes, becomes effective when integrated within the per-component loss, confining the reweighting to each component's spatial context. On the BraTS-METS 2025 dataset (260 test cases), multi-class CC loss improves foreground Dice (0.64 +/- 0.26 vs. 0.59 +/- 0.27 baseline) and rare-class Dice, while maintaining Panoptic Quality at DSC threshold 0.5. Multi-class blob loss achieves the best Panoptic Quality at threshold 0.5 (0.40 +/- 0.24 vs. 0.38 +/- 0.25 baseline) and recognition quality (0.53 +/- 0.29 vs. 0.49 +/- 0.30). Integrating inverse-size weighting within the per-component loss increases rare-class Dice to 0.44 +/- 0.36 at the cost of reduced detection quality.
Authors: Yiming Zhang, Jiacheng Chen, Jiaqi Tan, Yongsen Mao, Wenhu Chen, Angel X. Chang
Abstract: Current evaluations of spatial intelligence can be systematically invalid under modern vision-language model (VLM) settings. First, many benchmarks derive question-answer (QA) pairs from point-cloud-based 3D annotations originally curated for traditional 3D perception. When such annotations are treated as ground truth for video-based evaluation, reconstruction and annotation artifacts can miss objects that are clearly visible in the video, mislabel object identities, or corrupt geometry-dependent answers (e.g., size), yielding incorrect or ambiguous QA pairs. Second, evaluations often assume full-scene access, while many VLMs operate on sparsely sampled frames (e.g., 16-64), making many questions effectively unanswerable under the actual model inputs. We improve evaluation validity by introducing ReVSI, a benchmark and protocol that ensures each QA pair is answerable and correct under the model's actual inputs. To this end, we re-annotate objects and geometry across 381 scenes from 5 datasets to improve data quality, and regenerate all QA pairs with rigorous bias mitigation and human verification using professional 3D annotation tools. We further enhance evaluation controllability by providing variants across multiple frame budgets (16/32/64/all) and fine-grained object visibility metadata, enabling controlled diagnostic analyses. Evaluations of general and domain-specific VLMs on ReVSI reveal systematic failure modes that are obscured by prior benchmarks, yielding a more reliable and diagnostic assessment of spatial intelligence.
Authors: Mahdi Chamseddine, Fabian Kaufmann, Marius Schellen, Christian Glock, Didier Stricker, Jason Rambach
Abstract: Automatic generation of Building Information Models (BIM) from building scans is a key challenge in architecture and construction. We present a modular pipeline for generating IFC-compliant BIM from 3D point clouds. The hybrid approach combines learning-based semantic segmentation with topology-aware geometric reconstruction to model structural elements accurately. We propose vIoU, adapting voxel-based overlap evaluation to Scan-to-BIM by enabling holistic, instance-matching-free comparison of reconstructed and ground-truth models. We release the German Hospital dataset (DeKH), including high-resolution point clouds, ground truth BIMs, and semantic annotations. Experiments on DeKH and CV4AEC datasets show significant improvements over a RANSAC-based baseline, demonstrating robustness and scalability.
Authors: Xiaolin Qin, Qianlei Wang, Jiacen Liu, Chaoning Zhang, Fei Zhu, Zhang Yi
Abstract: Recovering 3D human pose from multi-view imagery typically relies on precise camera calibration, which is often unavailable in real-world scenarios, thereby severely limiting the applicability of existing methods. To overcome this challenge, we propose an unconstrained framework that synergizes deep neural networks, algebraic priors, and temporal dynamics for uncalibrated multi-view human pose estimation. First, we introduce the Triangulation with Transformer Regressor (TTR), which reformulates classical triangulation into a data-driven token fusion process to bypass the dependency on explicit camera parameters. Second, to explicitly embed the inherent algebraic relations of the multi-view variety into the learning process, we propose the Gr\"{o}bner basis Corrector (GC). This pioneering loss formulation enforces constraints derived from the multi-view variety to ensure the neural predictions strictly adhere to the laws of projective geometry. Finally, we devise the Temporal Equivariant Rectifier (TER), which exploits the equivariance property of human motion to impose temporal coherence and structural consistency, effectively mitigating scale ambiguity in uncalibrated settings. Extensive evaluations on standard benchmarks demonstrate that our framework establishes a new state-of-the-art for uncalibrated multi-view human pose estimation. Notably, our approach significantly closes the performance gap between calibration-free methods and fully calibrated oracles.
Authors: Dibyadip Chatterjee, Zhanzhong Pang, Fadime Sener, Yale Song, Angela Yao
Abstract: Streaming video models should respond the moment an event unfolds, not after the moment has passed. Yet existing online VideoQA benchmarks remain largely retrospective. They pause the video at fixed timestamps, pose questions about current or past events, and score models only at those moments. This protocol leaves streaming predictions untested. To close this gap, we introduce SPOT-Bench, featuring multi-turn proactive queries that evaluate general streaming perception and assistive capabilities required by an always-on, real-time assistant. SPOT-Bench comes with Timeliness-F1, a consolidated metric that measures streaming predictions by their temporal precision and balanced coverage across the entire video. Our benchmark reveals: (i) offline models detect events reliably but spam predictions unprompted; (ii) post-training for silence reduces spamming but induces unresponsiveness; (iii) half of the streaming video expects no response, which we term dead-time - compute spent here does not affect response latency. These findings motivate AsynKV, a training-free streaming adaptation of offline models, that retains their event perception while improving their streaming behavior. AsynKV features a long-short term memory, utilized efficiently by scaling compute during dead-time. It serves as a strong baseline on SPOT-Bench, outperforming existing streaming models, and achieves state-of-the-art on retrospective benchmarks.
Authors: Qianlei Wang, Kexun Chen, Shaolin Zhang, Hongli Gao, Chaoning Zhang, Xiaolin Qin
Abstract: Monocular depth estimation (MDE) has witnessed remarkable progress driven by Convolutional Neural Networks and transformer-based architectures. However, these approaches typically treat the problem as a generic image-to-image regression on Euclidean grids, thereby overlooking the intrinsic algebraic and geometric structures induced by perspective projection. To address this limitation, we propose LAGRNet, a novel framework that fundamentally grounds MDE in algebraic geometry by explicitly embedding learnable group, ring, and sheaf structures into the deep learning pipeline. Modeling feature maps as sections of a sheaf over an approximated image manifold, our method first establishes a Group-defined Feature Manifold (GFM) parameterized by a learned algebraic group action to enforce projective equivariance and robustness against view changes. To facilitate algebraically consistent cross-scale interactions, we subsequently introduce a Ring Convolution Layer (RCL) that formulates feature fusion as a graded ring homomorphism. Furthermore, to ensure global topological consistency, a Sheaf-based Module (SM) aggregates local depth cues via \v{C}ech nerve on the image topology. Extensive zero-shot evaluations across the KITTI, NYU-Depth V2, and ETH3D benchmarks demonstrate that LAGRNet significantly outperforms state-of-the-art methods in both accuracy and generalization capabilities.
Authors: Alexander Zimmer, Yasmeen Abdrabou, Enkelejda Kasneci
Abstract: Research on video-based eye-tracking has long explored stereo and glint-based methods, yet existing wearable eye trackers - both commercial and open-source - offer limited flexibility for algorithm development and comparative evaluation. We present an affordable, wearable stereo eye-tracking platform built from off-the-shelf and 3D-printable components that explicitly targets this gap. The system combines four infrared eye cameras, infrared illumination, an optional scene camera, and software support for calibration and synchronized data acquisition. By design, the platform supports multiple eye-tracking paradigms, including stereo, glint-based, and binocular approaches, within a single hardware configuration. Rather than optimizing for end-user robustness, the platform prioritizes modularity and extensibility for research use. This paper focuses on the hardware architecture and calibration pipeline and demonstrates the feasibility of the approach using a prototype implementation. All hardware designs and documentation are made openly available.
Authors: Zhiheng Wu, Tong Wang, Shuning Wang, Naiming Liu, Yumeng Zhang
Abstract: Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbf{ForeSight}, which enables VLMs to \textbf{See Further} with low-level visual cues and \textbf{Think Deeper} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.
Authors: Arya Shah, Deepali Mishra, Chaklam Silpasuwanchai
Abstract: Vision-language models (VLMs) are increasingly deployed as evaluators in tasks requiring nuanced image understanding, yet their reliability in scoring alignment between images and text descriptions remains underexplored. We investigate whether small, open-weight VLMs exhibit \emph{sycophantic} behavior when evaluating image-text alignment: assigning high scores without grounding their judgments in visual evidence. To quantify this phenomenon, we introduce the \emph{Bluffing Coefficient} (\bc), a metric that measures the mismatch between a model's score and its evidence recall. We evaluate six open-weight VLMs ranging from 450M to 8B parameters on a benchmark of 173,810 AI-generated character portraits paired with detailed textual descriptions. Our analysis reveals a significant inverse correlation between model size and sycophancy rate ($r = -0.96$, $p = 0.002$), with smaller models exhibiting substantially higher rates of unjustified high scores. The smallest model tested (LFM2-VL, 450M) produced sycophantic evaluations in 22.3\% of cases, compared to 6.0\% for the largest (LLaVA-1.6, 7B). These findings have direct implications for the deployment of small, open-weight VLMs as automated evaluators within attribute-rich, synthetic image evaluation tasks, where the gap between assigned scores and cited visual evidence is both measurable and consequential.
Authors: Daniel Fritz, Dimitrios Lagamtzis, Michael Mink, Markus Enzweiler, Steffen Schober
Abstract: The continuous advancement of autonomous driving (AD) introduces challenges across multiple disciplines to ensure safe and efficient driving. One such challenge is the generation of High-Definition (HD) maps, which must remain up to date and highly accurate for downstream automotive tasks. One promising approach is the use of crowdsourced data from a vehicle fleet, representing road topology and lane-level features. This work focuses on the generation of centerlines and lane dividers from crowdsourced vehicle trajectories. We adopt a Detection Transformer (DETR)-based approach, where a rasterized representation of vehicle trajectories is used as input to predict vectorized lane representations. Each lane consists of a centerline with an associated direction and corresponding lane dividers that are geometrically constrained by the centerline. Our method includes the extraction of local tiles, from which crowdsourced vehicle trajectories are aggregated. Each tile undergoes a transformation into a rasterized representation encoding both the presence and direction of each trajectory, enabling the prediction of vectorized directed lanes. Experiments are conducted on an internal dataset as well as on the public datasets nuScenes and nuPlan.
Authors: Matti Hyypp\"a, Klaara Salolahti, Eric Hyypp\"a, Xiaowei Yu, Josef Taher, Leena Matikainen, Matti Lehtom\"aki, Paula Litkey, Teemu Hakala, Harri Kaartinen, Juha Hyypp\"a, Antero Kukko
Abstract: The shift from stand-level to individual-tree-level forest assessments supports improved biodiversity mapping, particularly in boreal ecosystems where tree species like aspen (Populus tremula L.) play a keystone role. While airborne laser scanning (ALS) is the standard for such inventories, a major limitation is the small number of publicly available ALS datasets containing high-quality, field-validated reference data. Furthermore, open multispectral ALS datasets with high-quality field reference data are completely lacking despite the potential of multispectral ALS data for tree species classification. This paper presents and details an open multispectral ALS dataset used in a recent international benchmarking study of machine learning and deep learning methods for tree species classification by Taher et al. (2026). The dataset comprises 6326 segment-level point clouds of individual trees representing nine species in Southern Finland. The point cloud data has been acquired using two multispectral laser scanning systems each operating at three laser wavelengths: a helicopter-borne system (HeliALS) with a point density exceeding 1000 points/m$^2$ and an Optech Titan system with approximately 35 points/m$^2$. We provide a detailed description of field data collection techniques developed in the study to facilitate the collection of high-quality ground truth data in an efficient and scalable manner. Additionally, our article presents new analyses on species classification using multispectral data building upon the initial findings of Taher et al. (2026). Furthermore, we study the relation between classification accuracy and tree height to highlight the versatility of the open dataset and to demonstrate the advantage of the point transformer model for small trees and minority species.
Authors: Yubo Jiang, Xin Yang, Abudukelimu Wuerkaixi, Zheming Yuan, Xuxin Cheng, Fengying Xie, Zhiguo Jiang, Cao Liu, Ke Zeng, Haopeng Zhang
Abstract: Vision-Language Models (VLMs) are frequently undermined by object hallucination--generating content that contradicts visual reality--due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under-weighted. Our framework corrects this via a dual-path contrast: The positive path amplifies salient visual evidence using multi-layer attention to encourage faithful descriptions, directly counteracting the attention deficit. Simultaneously, the negative path identifies and degrades the core object's features to create a strong counterfactual, which penalizes ungrounded, prior-dominant generation. By contrasting the model's outputs from these two perspectives at each step, PND steers generation towards text that is not just linguistically probable, but visually factual. Extensive experiments on benchmarks like POPE, MME, and CHAIR show that PND achieves state-of-the-art performance with up to 6.5% accuracy improvement, substantially reducing object hallucination while also enhancing descriptive detail--all without requiring any model retraining. The method generalizes effectively across diverse VLM architectures including LLaVA, InstructBLIP, InternVL, and Qwen-VL.
Authors: Rameshwar Mishra, A V Subramanyam
Abstract: The recent surge in content consumption through streaming services has driven a growing demand for personalized content. Personalized advertisements (ads) play a crucial role in enhancing both user engagement and ad effectiveness. A key aspect of ad personalization involves replacing existing regions in a frame with custom, Photoshop-generated banners. However, existing ad-placement pipelines typically rely on simple geometric warping, ignoring the scene's underlying lighting conditions. Similarly, state-of-the-art diffusion-based object insertion and relighting models struggle to accurately relight these newly inserted banners, as they are not trained on ad-banner data, and training such a model for ad banners would require millions of images. This highlights the need for an effective relighting framework that enables seamless integration of custom banners into the original scene. Motivated by this, we present AD-Relight, a novel multi-stage training-free framework that adapts a diffusion-based relighting model at test time to relight newly added Photoshop-generated ad banners. Through extensive evaluation, we demonstrate that AD-Relight outperforms both relighting baselines and existing ad-placement methods based on simple warping. User studies further show that participants consistently prefer the outputs of AD-Relight over those of prior approaches.
Authors: Akash Sharma, Chinmay Mhatre, Sankalp Gawali, Ruthvik Bokkasam, Brij Sharma, Vishwajeet Pattanaik, Punit Rathore, Raghu Krishnapuram, Vijay Gopal Kovvali, Anirban Chakraborty, Yogesh Simmhan
Abstract: Robust vehicle detection from fixed CCTV cameras is critical for Intelligent Transportation Systems. Yet existing benchmarks predominantly feature relatively homogeneous, highly organized traffic patterns captured from ego-centric driving perspectives or controlled aerial views. This regional and sensor view bias creates a significant gap. Models trained on datasets such as UA-DETRAC and COCO struggle to generalize to the dense, heterogeneous, disorganized traffic conditions observed in rapidly developing urban centers in emerging economies. To address this limitation, we introduce BMD-45, a large-scale dataset comprising 480K bounding boxes annotated over 45K images captured from over 3.6K operational Safe City CCTV cameras. BMD-45 contains 14 fine-grained vehicle categories, including region-specific modes such as auto-rickshaws and tempo travellers, which are not present in existing benchmarks. The dataset captures real-world deployment challenges, including extreme viewpoint variation, occlusion, and vehicle density . We establish comprehensive baselines using state-of-the-art detectors and reveal a striking domain gap: models fine-tuned on UA-DETRAC achieve only 33.6% mAP@0.50:0.95, compared to 83.8% when trained in-domain on BMD-45, representing a 2.5x improvement that persists even when accounting for novel vehicle classes. This performance gap underscores the critical need for geographically diverse traffic benchmarks and establishes BMD-45 as a baseline for developing robust perception systems in underrepresented urban environments worldwide. The dataset is available at: https://huggingface.co/datasets/iisc-aim/BMD-45.
Authors: Md Shohel Rana, Andrew H. Sung
Abstract: AI-generated media are advancing rapidly, raising pressing concerns for content authenticity and digital trust. We introduce DYMAPIA, a multi-domain Deepfake detection framework that fuses spatial, spectral, and temporal cues to capture subtle traces of manipulation in visual data. The system builds dynamic anomaly masks by combining evidence from Fourier spectra, local texture descriptors, edge irregularities, and optical flow consistency, which highlight tampered regions with fine spatial accuracy. These masks guide DistXCNet, a lightweight classifier distilled from Xception and optimized with depthwise separable convolutions for fast, region-focused classification. This joint design achieves state-of-the-art results, with accuracy and F1-scores exceeding 99\% on FF++, Celeb-DF, and VDFD benchmarks, while keeping the model compact enough for real-time use. Beyond outperforming existing full-frame and multidomain detectors, DYMAPIA demonstrates deployment readiness for time-critical forensic tasks, including media verification, misinformation defense, and secure content filtering.
Authors: Hongxin Li, Xiping Wang, Jingran Su, Zheng Ju, Yuntao Chen, Qing Li, Zhaoxiang Zhang
Abstract: Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital world state" resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative pipeline that recursively parses multi-platform screenshots into hierarchical functional regions to generate diverse evaluation tasks. Providing 2,753 tasks across six operating systems, AutoGUI-v2 rigorously tests agents on region and element-level semantics, grounding, and dynamic state prediction. Our evaluation reveals a striking dichotomy in VLMs: while open-source models fine-tuned on agent data (e.g., Qwen3-VL) excel at functional grounding, commercial models (e.g., Gemini-2.5-Pro-Thinking) dominate in functionality captioning. Crucially, all models struggle with complex interaction logic of uncommon actions, highlighting that deep functional understanding remains a significant hurdle. By systematically measuring these foundational capabilities, AutoGUI-v2 offers a new lens for advancing the next generation of GUI agents.
Authors: Dongxing Mao, Yilin Wang, Linjie Li, Zhengyuan Yang, Alex Jinpeng Wang
Abstract: Despite recent advances in text-to-image generation, models still struggle to accurately render prompt-specified text with correct spatial layout -- especially in multi-span, structured settings. This challenge is driven not only by the lack of datasets that align prompts with the exact text and layout expected in the image, but also by the absence of effective metrics for evaluating layout quality. To address these issues, we introduce TextGround4M, a large-scale dataset of over 4 million prompt-image pairs, each annotated with span-level text grounded in the prompt and corresponding bounding boxes. This enables fine-grained supervision for layout-aware, prompt-grounded text rendering. Building on this, we propose a lightweight training strategy for autoregressive T2I models that appends layout-aware span tokens during training, without altering model architecture or inference behavior. We further construct a benchmark with stratified layout complexity to evaluate both open-source and proprietary models in a zero-shot setting. In addition, we introduce two layout-aware metrics to address the long-standing lack of spatial evaluation in text rendering. Our results show that models trained on TextGround4M outperform strong baselines in text fidelity, spatial accuracy, and prompt consistency, highlighting the importance of fine-grained layout supervision for grounded T2I generation.
Authors: Mohammadmehdi Ataei, Farzaneh Askari, Kamal Rahimi Malekshan, Pradeep Kumar Jayaraman
Abstract: Computer-Aided Design (CAD) models are defined by their construction history: a parametric recipe that encodes design intent. However, existing large-scale 3D datasets predominantly consist of boundary representations (B-Reps) or meshes, stripping away this critical procedural information. To address this scarcity, we introduce Zero-to-CAD, a scalable framework for synthesizing executable CAD construction sequences. We frame synthesis as an agentic search problem: by embedding a large language model (LLM) within a feedback-driven CAD environment, our system iteratively generates, executes, and validates code using tools and documentation lookup to promote geometric validity and operation diversity. This agentic approach enables the synthesis of approximately one million executable, readable, editable CAD sequences, covering a rich vocabulary of operations beyond sketch-and-extrude workflows. We also release a curated subset of 100,000 high-quality models selected for geometric diversity. To demonstrate the dataset's utility, we fine-tune a vision-language model on our synthetic data to reconstruct editable CAD programs from multi-view images, outperforming strong baselines, including GPT-5.2, and effectively bootstrapping sequence generation capabilities without real construction-history training data. Zero-to-CAD bridges the gap between geometric scale and parametric interpretability, offering a vital resource for the next generation of CAD AI.
Authors: Parampuneet Kaur Thind, Vaibhav Katturu, Giacomo Zema, Roberto Del Prete
Abstract: Designing deep networks that meet strict latency and accuracy constraints on edge accelerators increasingly relies on hardware-aware optimization, including neural architecture search (NAS) guided by device-level metrics. Yet most hardware-aware NAS pipelines still optimize architectures under full-precision assumptions and apply low-precision adaptation only after the search, leading to a mismatch between optimization-time behavior and deployment-time execution on low-precision hardware that can substantially degrade accuracy. We address this limitation by integrating deployment-aligned low-precision training directly into hardware-aware NAS. Candidate architectures are exposed to FP16 numerical constraints during fine-tuning and evaluation, enabling joint optimization of architectural efficiency and numerical robustness without modifying the search space or evolutionary strategy. We evaluate the proposed framework on vessel segmentation for spaceborne maritime monitoring, targeting the Intel Movidius Myriad X Visual Processing Unit (VPU). While post-training precision conversion reduces on-device performance from 0.85 to 0.78 mIoU, deployment-aligned low-precision training achieves 0.826 mIoU on-device for the same architecture (95,791 parameters), recovering approximately two-thirds of deployment-induced accuracy gap without increasing model complexity. These results demonstrate that incorporating deployment-consistent numerical constraints into hardware-aware NAS substantially improves robustness and alignment between optimization and deployment for resource-constrained edge Artificial Intelligence (AI).
Authors: Md Shohel Rana, Tanoy Debnath
Abstract: Face swapping aims to optimize realistic facial image generation by leveraging the identity of a source face onto a target face while preserving pose, expression, and context. However, existing methods, especially GAN-based methods, often struggle to balance identity preservation and visual realism due to limited controllability and mode collapse. In this paper, we introduce CA-IDD (Cross-Attention Guided Identity-Conditional Diffusion), the first diffusion-based face swapping approach that integrates multi-modal guidance comprising gaze, identity, and facial parsing through multi-scale cross-attention. Precomputed identity embeddings are incorporated into the denoising process via hierarchical attention layers, resulting in accurate and consistent identity transfer. To improve semantic coherence and visual quality, we use expert-guided supervision, with facial parsing and gaze-consistency modules. Unlike GAN-based or implicit-fusion methods, our diffusion framework provides stable training, robust generalization, and spatially adaptive identity alignment, allowing for fine-grained regional control across pose and expression variations. CA-IDD achieves an FID of 11.73, exceeding established baselines such as FaceShifter and MegaFS. Qualitative results also reveal improved identity retention across diverse poses, establishing CA-IDD as a strong foundation for future diffusion-based face editing.
Authors: Esteban Rodr\'iguez-Betancourt, Edgar Casasola-Murillo
Abstract: Modern self-supervised representation learning methods often relies on empirical heuristics that are not theoretically grounded. In this study we propose HyDeS, a theoretically grounded method based on multi-view mutual information maximization within an hyperspherical space using Shannon differential entropy with a non-parametric von Mises-Fisher density estimator. We show that HyDeS bias the trained model towards focusing on foreground features of the images and perform well on segmentation tasks such as VOC PASCAL, while it lags in fine-grained classification. We provide a detailed analysis of the induced latent space geometry and learning dynamics, that can be used for designing other theoretically grounded self-supervised learning methods.
Authors: Ni Yao, Xiangyu Liu, Shaojie Tang, Danyang Sun, Chuang Han, Yanting Li, Jiaofen Nan, Chengyang Li, Fubao Zhu, Chen Zhao, Zhihui Xu, Weihua Zhou
Abstract: Clinical fusion of Single Photon Emission Computed Tomography Myocardial Perfusion Imaging (SPECT MPI) and Computed Tomography Angiography (CTA) remains limited by cross-modality misregistration and reliance on manual landmarks, which can hinder accurate ischemia localization and lesion-level functional assessment. To address this issue, we propose a registration and fusion framework for SPECT MPI and CTA that integrates functional and structural information for comprehensive cardiac evaluation. The proposed pipeline performs U-Net-based segmentation on both modalities. On SPECT MPI, only the left ventricle (LV) is extracted, and anatomical landmarks are automatically derived from characteristic LV structures. On CTA, both ventricles are segmented, and their spatial relationship is used to automatically define landmarks at the interventricular septal junction. Scale-space consistency preprocessing and landmark-driven coarse registration are applied to mitigate initial misalignment. Based on this initialization, multiple fine registration methods are evaluated on LV epicardial surface point clouds, including ICP, SICP, CPD, CluReg, FFD, and BCPD-plus-plus. The resulting transformations are then propagated to voxel-level resampling for high-precision SPECT-CTA fusion. In a retrospective cohort of 60 patients, the proposed framework preserved sub-millimeter coronary detail from CTA while accurately overlaying quantitative SPECT perfusion. Among the evaluated methods, BCPD-plus-plus achieved the highest accuracy with a mean point cloud distance of 1.7 mm. By combining robust initialization, comparative fine registration, and voxel-level fusion, the proposed approach provides a practical solution for myocardial ischemia localization and functional evaluation of coronary lesions, while remaining independent of any specific fine registration algorithm.
Authors: Jinghao Shi, Mengqi Lei, Kunliang He, Yun Li, Wei Bao, Siqi Li
Abstract: RGB-Thermal (T) crowd counting aims to integrate visible-spectrum and thermal infrared information to improve the robustness of crowd density estimation in complex scenes. Although existing studies generally improve counting accuracy through cross-modal feature fusion, most current methods rely on implicit cross-modal fusion strategies and lack explicit modeling of local spatial discrepancies as well as fine-grained characterization of modality reliability at the positional level, thereby limiting the accuracy and interpretability of the fusion process. To address these issues, this paper proposes a two-stage fusion framework, RACANet, a Reliability-Aware Crowd Anchor Network for RGB-T crowd counting. First, we introduce a lightweight cross-modal alignment pretraining stage, which explicitly learns cross-modal semantic correspondences through crowd-prior supervision and local bidirectional soft matching. Then, based on the priors learned during pretraining, a Local Anchor Fusion Module (LAFM) is introduced in the formal training stage. This module generates local semantic anchors by aggregating features from highly reliable regions and further enables adaptive pixel-level feature redistribution with a local attention mechanism. In addition, we propose a discrepancy-aware consistency constraint to dynamically coordinate the reliability of regions where modal representations are consistent. Experiments conducted on two widely used benchmark datasets, RGBT-CC and Drone-RGBT, demonstrate that RACANet outperforms existing methods. The anonymous code is available at https://anonymous.4open.science/r/RACANet-9985.
Authors: Haoxiao Wang, Antao Xiang, Haiyang Sun, Peilin Sun, Changhao Pan, Yifu Chen, Minjie Hong, Weijie Wang, Shuang Chen, Yue Chen, Zhou Zhao
Abstract: Diffusion models are primarily trained for image synthesis, yet their denoising trajectories encode rich, spatially aligned visual priors. In this paper, we demonstrate that these priors can be utilized for text-conditioned semantic and open-vocabulary segmentation, and this approach can be generalized to various downstream tasks to make a general-purpose diffusion segmentation framework. Concretely, we introduce DiGSeg (Diffusion Models as a Generalist Segmentation Learner), which repurposes a pretrained diffusion model into a unified segmentation framework. Our approach encodes the input image and ground-truth mask into the latent space and concatenates them as conditioning signals for the diffusion U-Net. A parallel CLIP-aligned text pathway injects language features across multiple scales, enabling the model to align textual queries with evolving visual representations. This design transforms an off-the-shelf diffusion backbone into a universal interface that produces structured segmentation masks conditioned on both appearance and arbitrary text prompts. Extensive experiments demonstrate state-of-the-art performance on standard semantic segmentation benchmarks, as well as strong open-vocabulary generalization and cross-domain transfer to medical, remote sensing, and agricultural scenarios-without domain-specific architectural customization. These results indicate that modern diffusion backbones can serve as generalist segmentation learners rather than pure generators, narrowing the gap between visual generation and visual understanding.
Authors: Yingqian Min, Kun Zhou, Yifan Li, Yuhuan Wu, Han Peng, Yifan Du, Wayne Xin Zhao, Min Yang, Ji-Rong Wen
Abstract: Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model's response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL, highlighting the promise of perception-centric supervision as a general-purpose strategy. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released at https://github.com/RUCAIBox/Perceval.
Authors: Yuta Baba, Keiji Yanai
Abstract: Single-image point cloud reconstruction must infer complete 3D geometry, including occluded parts, from a single RGB image. While diffusion-based reconstructors achieve high accuracy, they typically require many denoising iterations, resulting in slow and expensive inference. We propose Point-MF, a Mean-Flow-based framework for low-NFE single-image point cloud reconstruction that couples a Mean-Flow-compatible architecture with an auxiliary loss. Specifically, Point-MF operates directly in point-cloud space to learn the mean velocity field and enables one-step reconstruction with a single network function evaluation (1-NFE), without relying on VAE-based latent representations. To make Mean Flow effective under large interval jumps, Point-MF employs a Diffusion Transformer tailored to the Mean-Flow setting, conditioned on frozen DINOv3 image features via a lightweight token adapter and equipped with explicit interval/time conditioning. Moreover, we introduce Denoised Space Anchor, a set-distance auxiliary loss on the denoised-space estimate $x_\theta$ induced by the predicted velocity field, to stabilize large-step generation and reduce outliers and density artifacts. On ShapeNet-R2N2 and Pix3D, Point-MF strikes a strong balance between reconstruction quality and inference speed compared to multi-step diffusion baselines and competitive feedforward models, while generating high-quality point clouds with millisecond-level latency.
Authors: Lixian Chen, Mingxuan Huang, Yanhui Chen, Junyi Lin, Yang Shi
Abstract: Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modality may still dominate fusion. We study this failure mode through a majorization view of multimodal posteriors and cast adaptation as a constrained de-mixing problem on the fused prediction. Based on this view, we propose MG-MTTA, which keeps the backbone frozen and updates only a lightweight gate or adapter. The objective combines fused-posterior entropy minimization with a reliability-aware gate prior built from anchor-based modality consistency and cross-modal conflict. Our analysis gives conditions under which entropy reduction preserves the correct ranking and a threshold that characterizes modality-dominance failure. On the ImageNet-based benchmark, MG-MTTA improves top-1 accuracy from 57.97 to 66.51 under semantics-preserving textual shift and from 21.68 to 26.27 under joint visual-textual shift, while remaining competitive in the visual-only benchmark. These results show that multimodal test-time adaptation should control modality reliability, not just prediction entropy.
Authors: Haosong Xiao, Yamini Ramesh, Rishabh Shukla, Swarat Sarkar, Chaozhe R. He
Abstract: In this paper, we report the world's first infrastructure-guided communication-enhanced road crack detection pipeline that is effective and implementable on passenger vehicles. We first design a customized communication protocol to transmit the region of interest from the infrastructure to the vehicle. With proper camera image processing (e.g., dynamic cropping and frame selection), the focused images are provided to the crack detection model. Leveraging state-of-the-art crack detection model backbones and a carefully prepared dataset comprising a forward-facing view with a crack, we train the model to improve crack-detection performance. We demonstrate the full detection pipeline on an experimental vehicle platform, showcase the detection effectiveness, and project future research directions.
Authors: Fan Du, Feng Yan, Jianxiong Wu, Xinrun Xu, Weiye Zhang, Weinong Wang, Yu Guo, Bin Qian, Zhihai He, Fei Wang, Heng Yang
Abstract: Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade-off under real-time constraints. We address this issue by rethinking the role of the starting point in generative action modeling. Instead of shortening the sampling trajectory, we propose CF-VLA, a coarse-to-fine two-stage formulation that restructures action generation into a coarse initialization step that constructs an action-aware starting point, followed by a single-step local refinement that corrects residual errors. Concretely, the coarse stage learns a conditional posterior over endpoint velocity to transform Gaussian noise into a structured initialization, while the fine stage performs a fixed-time refinement from this initialization. To stabilize training, we introduce a stepwise strategy that first learns a controlled coarse predictor and then performs joint optimization. Experiments on CALVIN and LIBERO show that our method establishes a strong efficiency-performance frontier under low-NFE (Number of Function Evaluations) regimes: it consistently outperforms existing NFE=2 methods, matches or surpasses the NFE=10 $\pi_{0.5}$ baseline on several metrics, reduces action sampling latency by 75.4%, and achieves the best average real-robot success rate of 83.0%, outperforming MIP by 19.5 points and $\pi_{0.5}$ by 4.0 points. These results suggest that structured, coarse-to-fine generation enables both strong performance and efficient inference. Our code is available at https://github.com/EmbodiedAI-RoboTron/CF-VLA.
Authors: Shiyi Zhang, Yiji Cheng, Tiankai Hang, Zijin Yin, Runze He, Yu Xu, Wenxun Dai, Yunlong Lin, Chunyu Wang, Qinglin Lu, Yansong Tang
Abstract: Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet - (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability. (2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks. To further align the model's editing behavior with its CoT reasoning, we introduce the CoT-Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model are released at https://shiyi-zh0408.github.io/projectpages/Meta-CoT/
Authors: Hai Wang, Xiaochen Yang, Mingzhi Dong, Jing-Hao Xue
Abstract: The dream of instantly creating rich 360-degree panoramic worlds from text is rapidly becoming a reality, yet a crucial gap exists in our ability to reliably evaluate their semantic alignment. Contrastive Language-Image Pre-training (CLIP) models, standard AI evaluators, predominantly trained on perspective image-text pairs, face an open question regarding their understanding of the unique characteristics of 360-degree panoramic image-text pairs. This paper addresses this gap by first introducing two concepts: \emph{360-degree textual semantics}, semantic information conveyed by explicit format identifiers, and \emph{360-degree visual semantics}, invariant semantics under horizontal circular shifts. To probe CLIP's comprehension of these semantics, we then propose novel evaluation methodologies using keyword manipulation and horizontal circular shifts of varying magnitudes. Rigorous statistical analyses across popular CLIP configurations reveal that: (1) CLIP models effectively leverage explicit textual identifiers, demonstrating an understanding of 360-degree textual semantics; and (2) CLIP models fail to robustly preserve semantic alignment under horizontal circular shifts, indicating limited comprehension of 360-degree visual semantics. To address this limitation, we propose a LoRA-based fine-tuning framework that explicitly instills invariance to circular shifts. Our fine-tuned models exhibit improved comprehension of 360-degree visual semantics, though with a slight degradation in original semantic evaluation performance, highlighting a fundamental trade-off in adapting CLIP to 360-degree panoramic images. Code is available at https://github.com/littlewhitesea/360Semantics.
Authors: Fredrik K. Gustafsson, Constance Boissin, Johan Vallon-Christersson, David A. Clifton, Mattias Rantalainen
Abstract: Pathology foundation models (PFMs) have recently emerged as powerful pretrained encoders for computational pathology, enabling transfer learning across a wide range of downstream tasks. However, systematic comparisons of these models for clinically meaningful prediction problems remain limited, especially in the context of survival prediction under external validation. In this study, we benchmark widely used and recently proposed PFMs for breast cancer survival prediction from whole-slide histopathology images. Using a standardized pipeline based on patch-level feature extraction and a unified survival modeling framework, we evaluate model representations across three independent clinical cohorts comprising more than 5,400 patients with long-term follow-up. Models are trained on one cohort and evaluated on two independent external cohorts, enabling a rigorous assessment of cross-dataset generalization. Overall, H-optimus-1 achieves the strongest survival prediction performance. More broadly, we observe consistent generational improvements across model families, with second-generation PFMs outperforming their first-generation counterparts. However, absolute performance differences between many recent PFMs remain modest, suggesting diminishing returns from further scaling of pretraining data or model size alone. Notably, the compact distilled model H0-mini slightly outperforms its larger teacher model H-optimus-0, despite using fewer than 8% of the parameters and enabling significantly faster feature extraction. Together, these results provide the first large-scale, externally validated benchmark of PFMs for breast cancer survival prediction, and offer practical guidance for efficient deployment of PFMs in clinical workflows.
Authors: Jorge L. A. Lima, Filipe R. Cordeiro
Abstract: Chromosome analysis is a fundamental step in the diagnosis of genetic diseases, but the manual karyotyping workflow is time-consuming and heavily dependent on expert specialists, often requiring several days per patient. Although Deep Learning models have achieved high performance in chromosome detection, most proposed solutions remain restricted to research prototypes or lack graphical interfaces suitable for clinical use. In this work, we present Aycromo, an open-source desktop platform for AI-assisted cytogenetic analysis. Built on Electron and ONNX Runtime, the tool allows cytogeneticists to load pre-trained models, compare architectures through an integrated benchmarking module, and manually correct detections via an interactive annotation interface, all without command-line interaction. Preliminary experiments on metaphase images from the CRCN-NE dataset demonstrate that YOLOv11 achieves 99.40% mAP@50, while the platform reduces per-slide analysis to seconds
Authors: Cheng Wang, Zhibin He, Zhihao Peng, Shengyuan Liu, Yufan Hu, Lichao Sun, Xiang Li, Yixuan Yuan
Abstract: Agentic artificial intelligence systems promise to accelerate scientific workflows, but neuroimaging poses unique challenges: heterogeneous modalities (sMRI, fMRI, dMRI, EEG), long multi-stage pipelines, and persistent reproducibility risks. To address this gap, we present NeuroClaw, a domain-specialized multi-agent research assistant for executable and reproducible neuroimaging research. NeuroClaw operates directly on raw neuroimaging data across formats and modalities, grounding decisions in dataset semantics and BIDS metadata so users need not prepare curated inputs or bespoke model code. The platform combines harness engineering with end-to-end environment management, including pinned Python environments, Docker support, automated installers for common neuroimaging tools, and GPU configuration. In practice, this layer emphasizes checkpointing, post-execution verification, structured audit traces, and controlled runtime setup, making toolchains more transparent while improving reproducibility and auditability. A three-tier skill/agent hierarchy separates user-facing interaction, high-level orchestration, and low-level tool skills to decompose complex workflows into safe, reusable units. Alongside the NeuroClaw framework, we introduce NeuroBench, a system-level benchmark for executability, artifact validity, and reproducibility readiness. Across multiple multimodal LLMs, NeuroClaw-enabled runs yield consistent and substantial score improvements compared with direct agent invocation. Project homepage: https://cuhk-aim-group.github.io/NeuroClaw/index.html
Authors: Vandita Shukla, Fabio Remondino, Blair Costelloe, Benjamin Risse
Abstract: Monocular RGB cameras mounted on drones are widely used for wildlife monitoring, yet most analytical pipelines remain confined to two-dimensional image space, leaving geometric information in video underexploited. We present WildLIFT, a computational framework that integrates three-dimensional scene geometry from monocular drone video with open-vocabulary 2D instance segmentation to enable species-agnostic 3D detection and tracking. Oriented 3D bounding box labels with semantic face information enable quantitative assessment of viewpoint coverage and inter-animal occlusion, producing structured metadata for downstream ecological analyses. We validate the framework on 2,581 manually curated frames comprising over 6,700 3D detections across four large mammal species. WildLIFT maintains high identity consistency in multi-animal scenes and substantially reduces manual 3D annotation effort through keyframe-based refinement. By transforming standard drone footage into structured 3D and viewpoint-aware representations, WildLIFT extends the analytical utility of aerial wildlife datasets for behavioural research and population monitoring.
Authors: Tal Grossman, Noa Cahan, Lev Ayzenberg, Hayit Greenspan
Abstract: Segmentation models such as Segment Anything Model (SAM) and SAM2 achieve strong prompt-driven zero-shot performance. However, their training on natural images limits domain transfer to medical data. Consequently, accurate segmentation typically requires extensive fine-tuning and expert-designed prompts. We propose DiffuSAM, a diffusion-based adaptation of SAM2 for prompt-free medical image segmentation. Our framework synthesizes SAM2-compatible segmentation mask-like embeddings via a lightweight diffusion-prior from off-the-shelf frozen SAM2 image features. The generated embeddings are integrated into SAM2's mask decoder to produce accurate segmentations, thereby eliminating the need for user prompts. The diffusion prior is further conditioned on previously segmented slices, enforcing spatial consistency across volumes. Evaluated on the BTCV and CHAOS datasets for CT and MRI under Source-Free Unsupervised Domain Adaptation (SF-UDA) and Few-Shot settings, DiffuSAM achieves competitive performance with efficient training and inference. Code is available upon request from the corresponding author.
Authors: Boyang Wang, Guangyi Xu, Zhipeng Tang, Jiahui Zhang, Zezhou Cheng
Abstract: Shot Boundary Detection (SBD) aims to automatically identify shot changes and divide a video into coherent shots. While SBD was widely studied in the literature, existing state-of-the-art methods often produce non-interpretable boundaries on transitions, miss subtle yet harmful discontinuities, and rely on noisy, low-diversity annotations and outdated benchmarks. To alleviate these limitations, we propose OmniShotCut to formulate SBD as structured relational prediction, jointly estimating shot ranges with intra-shot relations and inter-shot relations, by a shot query-based dense video Transformer. To avoid imprecise manual labeling, we adopt a fully synthetic transition synthesis pipeline that automatically reproduces major transition families with precise boundaries and parameterized variants. We also introduce OmniShotCutBench, a modern wide-domain benchmark enabling holistic and diagnostic evaluation.
Authors: Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, Wenhu Chen, Ping Luo, Luke Zettlemoyer, Yuren Cong
Abstract: Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2's encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception. These results show that pretrained vision encoders are not necessary for multimodal modelling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.
Authors: Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y. Chen, Zhiyuan He, Yuqing Yang, Bohan Zhuang
Abstract: Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.
Authors: Jin Seong, Wencke Liermann, Minho Kim, Jong-hun Shin, Soojong Lim
Abstract: Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expressions and rely on lexical metrics such as BLEU, which fail to assess the semantic reasoning across multi-line student solutions. In this paper, we present the first systematic study of multi-line handwritten math Optical Character Recognition (OCR), revealing a critical failure mode of Vision-Language Models (VLMs): over-correction. Instead of faithfully transcribing a student's work, these models often "fix" errors, thereby hiding the very mistakes an educational assessment aims to detect. To address this, we propose PINK (Penalized INK-based score), a semantic evaluation metric that leverages a Large Language Model (LLM) for rubric-based grading and explicitly penalizes over-correction. Our comprehensive evaluation of 15 state-of-the-art VLMs on the FERMAT dataset reveals substantial ranking reversals compared to BLEU: models like GPT-4o are heavily penalized for aggressive over-correction, whereas Gemini 2.5 Flash emerges as the most faithful transcriber. Furthermore, human expert studies show that PINK aligns significantly better with human judgment (55.0% preference over BLEU's 39.5%), providing a more reliable evaluation framework for handwritten math OCR in educational settings.
Authors: Jia-Mian Wu, Jun Liu, Siqi Li, Xiaoya Wang, Shibai Yin, Huanyu Luo, Lingling Zheng, Qiang Gao, Jigang Yang, Tai-Xiang Jiang
Abstract: Computed tomography (CT)-based attenuation and scatter correction improves quantitative PET but adds radiation exposure that is particularly undesirable in pediatric imaging. Existing CT-free methods are commonly trained in homogeneous settings and often degrade under scanner or radiotracer shifts, which limits their clinical utility. We propose the Generalizable PET Correction Network (GPCN), a dual-domain network for domain-robust CT-free PET attenuation and scatter correction. GPCN combines a multi-band contextual refinement module, which models pediatric anatomical variability through wavelet-based multiscale decomposition and long-range spatial context modeling, with a frequency-aware spectral decoupling module, which performs coordinate-conditioned amplitude/phase refinement in the Fourier domain. By synergizing multi-band spatial contextual modeling with asymmetric frequency-spectrum decoupling, the network explicitly separates invariant topological structures from domain-specific noise, thereby achieving precise quantitative recovery of both anatomical organs and focal lesions. This design aims to separate anatomy-dominant structures from domain-sensitive spectral residuals and to improve robustness across heterogeneous imaging conditions. We train and evaluate the method on 1085 pediatric whole-body PET scans acquired with two scanners and five radiotracers. In both joint training and zero-shot cross-domain evaluation, GPCN outperforms representative baselines and maintains stable quantitative accuracy on unseen scanner-tracer combinations. The method is further supported by ablation, region-wise quantitative analysis, and downstream segmentation experiments. In our cohort, the CT component of the conventional protocol corresponded to an average effective dose of 10.8 mSv, indicating the potential clinical value of reliable CT-free correction for pediatric PET.
Authors: Qiuli Wang, Xinhuan Sun, Fengxi Chen, Yongxu Liu, Jie Cheng, Lin Chen, Jiafei Chen, Yue Zhang, Xiaoming Li, Wei Chen
Abstract: Gadoxetate disodium-enhanced MRI is essential for the detection and characterization of hepatocellular carcinoma. However, acquisition of the hepatobiliary phase (HBP) requires a prolonged post-contrast delay, which reduces workflow efficiency and increases the risk of motion artifacts. In this study, we propose a Triple-Phase Sequential Fusion Network (TriPF-Net) to synthesize HBP images by leveraging the sequential information from pre-HBP sequences: while T1-weighted imaging serves as the indispensable baseline, the model adaptively integrates arterial-phase (AP) and venous-phase (VP) features when available. By modeling the tissue-specific contrast uptake and excretion dynamics across these three phases, TriPF-Net ensures robust HBP synthesis even under the stochastic absence of one or both dynamic contrast-enhanced sequences. The framework comprises an Enhanced Region-Guided Encoder and a Dynamic Feature Unification Module, optimized with a Region-Guided Sequential Fusion Loss to maintain physiological consistency. In addition, clinical variables, including age, sex, total bilirubin, and albumin, are incorporated to enhance physiological consistency. Compared with conventional methods, TriPF-Net achieved superior performance on datasets from two centers. On the internal dataset, the model achieved an MAE of 10.65, a PSNR of 23.27, and an SSIM of 0.76. On the external validation dataset, the corresponding values were 12.41, 23.11, and 0.78, respectively. This flexible solution enhances clinical workflow and lesion depiction, potentially eliminating the need for delayed HBP acquisition in HCC imaging.
Authors: Xiangcen Wu, Ruohua Chen, Sichun Li, Qianye Yang, Sheng Liu, Jianjun Liu, Zhaoheng Xie
Abstract: Whole-body Positron Emission Tomography (PET) registration is essential for multi-parametric tumor characterization and assessment of metastatic disease progression. In deep learning-based deformable registration, the dense displacement field (DDF) regularizer is crucial for stabilizing optimization and preventing unrealistic deformations in large 3D volumes. A key challenge in whole-body deformable registration is anatomical heterogeneity, rigid structures (e.g., bones) should undergo stronger regularization, whereas soft tissues require more flexible deformation and weaker constraints. In this work, we propose a simple yet effective CT-guided spatially-varying regularization strategy for whole-body cross-tracer deformable PET registration. The key idea is to use the paired CT volume from the PET/CT acquisition to construct a voxel-wise regularization map for the DDF, replacing the conventional single global regularization weight. This yields anatomy-adaptive regularization strength across rigid and soft tissues. The proposed method is evaluated on a real clinical cross-tracer PET/CT dataset of 296 patients involving 18F-PSMA and 18F-FDG, showing that the proposed method achieves statistically significant improvements over weakly-supervised registration baseline in both whole-body registration performance and organ-wise alignment.
Authors: Mengyu Wang, Xiaoying Zhi, Zhiyi Li, Robin Schmucker, Shay B. Cohen, Tiejun Ma, Fran Silavong
Abstract: While the next-token prediction (NTP) paradigm enables large language models (LLMs) to express their intrinsic knowledge, its sequential nature constrains performance on specialized, non-generative tasks. We attribute this performance bottleneck to the LLMs' knowledge expression mechanism, rather than to deficiencies in knowledge acquisition. To address this, we propose Self-Knowledge Re-expression (SKR), a novel, task-agnostic adaptation method. SKR transforms the LLM's output from generic token generation to highly efficient, task-specific expression. SKR is a fully local method that uses only unannotated data, requiring neither human supervision nor model distillation. Experiments on a large financial document dataset demonstrate substantial improvements: over 40% in Recall@1 for information retrieval tasks, over 76% reduction in object detection latency, and over 33% increase in anomaly detection AUPRC. Our results on the MMDocRAG dataset surpass those of leading retrieval models by at least 12.6%.
Authors: Jeremy Ellis
Abstract: This paper presents a complete, end-to-end on-device vision machine learning pipeline, comprising data acquisition, two-layer CNN training with Adam optimization, and real-time inference, executing entirely on a microcontroller-class device costing $15-40 USD. Unlike cloud-based workflows that require external infrastructure and conceal the computational pipeline from the practitioner, this system implements every step of the core ML lifecycle in approximately 1,750 lines of readable C++ that compiles in under one minute using the Arduino IDE, with no external ML dependencies. Running on the Seeed Studio ESP32-S3 XIAO ML Kit (8 MB PSRAM), the firmware achieves three-class 64x64 image classification in approximately 9 minutes per training run, with real-time inference at 6.3 FPS. Key contributions include: correct batch-level gradient accumulation; pre-computed resize lookup tables for inference; dual-format weight export for SD-free baked-in deployment; a three-tier weight priority system (SD binary > baked-in header > He-initialization) resolved automatically at boot; a single-constant network reconfiguration interface; and PSRAM-aware memory management suited to microcontroller constraints. All source code and reference datasets are released under the MIT License at https://github.com/webmcu-ai/on-device-vision-ai
Authors: Mathias Graf, Marco Willi, Melanie Mathys, Michael Aerni, Christian Schwarzer, Martin Melchior, Michael H. Graber
Abstract: AI-powered generative models have significantly expanded the possibilities for editing, manipulating, and creating high-quality images. Particularly, images that falsely appear to originate from trusted sources pose a serious threat, undermining public trust in image authenticity. We propose DeepSignature, a novel approach that integrates the guarantees of digital signatures with the capabilities of deep neural networks. Neural networks are used both to generate content-encoding watermarks and to embed them imperceptibly into images while ensuring robust extraction. These watermarks are cryptographically verifiable, enabling source attribution and image integrity validation. DeepSignature is compatible with existing image formats and requires no special handling of signed images. It supports client-side verification, requiring only the signer's public key. Additionally, we introduce a novel latent-space verification approach to detect and localize tampering attempts. We evaluate DeepSignature in terms of imperceptibility, robustness to benign transformations, forgery detection, and its resilience against various attack scenarios. Our results highlight the inherent trade-offs between imperceptibility, robustness, and integrity verification. We demonstrate that DeepSignature reliably identifies significant forgery attempts -- achieving near 100\% in our experiments. Finally, we emphasize DeepSignature's modularity and tunable parameters, allowing adaptation to application-specific requirements. Code and model weights will be published.
Authors: Suning Huang, Jiaqi Shao, Ke Wang, Qianzhong Chen, Jiankai Sun, Yanjiang Guo, Mac Schwager, Jeannette Bohg
Abstract: Have you ever post-trained a generalist vision-language-action (VLA) policy on a small demonstration dataset, only to find that it stops responding to new instructions and is limited to behaviors observed during post-training? We identify this phenomenon as lock-in: after low-data, supervised fine-tuning (SFT), the policy becomes overly specialized to the post-training data and fails to generalize to novel instructions, manifesting as concept lock-in (fixation on training objects/attributes) and spatial lock-in (fixation on training spatial targets). Many existing remedies introduce additional supervision signals, such as those derived from foundation models or auxiliary objectives, or rely on augmented datasets to recover generalization. In this paper, we show that the policy's internal pre-trained knowledge is sufficient: DeLock mitigates lock-in by preserving visual grounding during post-training and applying test-time contrastive prompt guidance to steer the policy's denoising dynamics according to novel instructions. Across eight simulation and real-world evaluations, DeLock consistently outperforms strong baselines and matches or exceeds the performance of a state-of-the-art generalist policy post-trained with substantially more curated demonstrations.
Authors: Zhuo Zheng, Iv\'an Higuera-Mendieta, Richard Lee, David Newhouse, Talip Kilic, Stefano Ermon, Marshall Burke, David B. Lobell
Abstract: Poverty statistics guide social policy, but in many low- and middle-income countries, censuses and household surveys that collect these data are costly, infrequent, quickly outdated, and sometimes error-prone. Satellite imagery offers global coverage and the possibility of predicting economic livelihoods at scale, yet existing approaches to predicting livelihoods with imagery or other non-traditional data often fail to reliably identify local-level variation and, as we show, degrade under temporal shift. Here we introduce Tempov, a satellite foundation model pretrained by self-supervision on three million bi-temporal Landsat pairs and adapted with parameter-efficient fine-tuning to sparse survey labels. The model enables large-scale, high-resolution wealth mapping and dynamic measurement, including zero-shot nowcasting up to a decade after observed labels, retrospective hindcasting, and decadal change tracking, while outperforming existing neural network and geospatial foundation-model baselines. In low-label regimes, Tempov achieves competitive accuracy with only 10% of survey samples, indicating substantially reduced dependence on expensive label collection. The model further generalizes across populous countries within and outside Africa, and scales to a unified Africa-wide model with strong continent-level performance ($R^2=0.63$, $r^2=0.68$), from which we generate high-resolution decadal maps of wealth and wealth changes for the African continent. Analysis of these maps shows large variation in recent economic performance both within and across countries. Our open-source approach provides a pathway to timely, scalable, low-cost monitoring of wealth and poverty from routinely collected satellite data.
Authors: Long Jing, Zhixiong Yang, Yajun Zhang, Xinlong Feng
Abstract: Human activity recognition serves as the foundation for various emerging applications. In recent years, researchers have used collaborative sensing of multi-source sensors to capture complex and dynamic human activities. However, multimodal human activity sensing typically encounters highly heterogeneous data across modalities and label scarcity, resulting in an application gap between existing solutions and real-world needs. In this paper, we propose CLMM, a general contrastive learning framework for human activity recognition that achieves effective multimodal recognition with limited labeled data. CLMM employs a novel two-stage training strategy. In the first stage, CLMM employs a CNN-DiffTransformer encoder to capture cross-modal shared information by extracting local and global features. Meanwhile, a hard-positive samples weighting algorithm enhances gradient propagation to reinforce shared learning. In the second stage, a dual-branch architecture combining quality-guided attention and bidirectional gated units captures modality-specific information, while a primary-auxiliary collaborative training strategy fuses both shared and modality-specific information. Experimental results on three public datasets demonstrate that CLMM significantly improves state-of-the-art baselines in both recognition accuracy and convergence performance.
Authors: Eshwar R. A., Nevin Mathew Thomas, Nehal G, Farida M. Begam
Abstract: Reconstructing high-fidelity fluid dynamics from sparse temporal observations is quite challenging, mainly due to the chaotic and non-linear nature of fluid transport. Standard deep learning-based interpolation methods often tend to regress to the mean, which results in spatial blurring and temporal strobing, especially noticeable around the observed anchor frames where transitions become discontinuous. In this work, we propose a novel Temporal U-Net architecture that integrates a VGG-based perceptual loss along with a Physics-Informed Bridge to overcome these issues. By introducing time-weighted feature blending and enforcing a parabolic boundary condition defined by t(1 - t), the model ensures smooth transitions while also maintaining perfect consistency at the endpoints. Experimental results on multi-channel RGB fluid data show that our method clearly outperforms standard models, both in terms of structural fidelity and texture preservation. In particular, the model achieves a Mean Absolute Error of 0.015, compared to 0.085 for a standard L1 baseline. Further Spatial Power Spectral Density (PSD) analysis reveals that the model is able to retain high-frequency turbulent details that are usually lost in deterministic reconstructions.
Authors: Bingda Tang, Yuhui Zhang, Xiaohan Wang, Jiayuan Mao, Ludwig Schmidt, Serena Yeung-Levy
Abstract: Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower bound (ELBO), which have so far underperformed on visual generation. Our key insight is that the ELBO-based approach can, in fact, be made both stable and efficient. By reducing surrogate variance and controlling gradient steps, we show that this approach can beat MDP-based methods. To this end, we introduce Variational GRPO (V-GRPO), a method that integrates ELBO-based surrogates with the Group Relative Policy Optimization (GRPO) algorithm, alongside a set of simple yet essential techniques. Our method is easy to implement, aligns with pretraining objectives, and avoids the limitations of MDP-based methods. V-GRPO achieves state-of-the-art performance in text-to-image synthesis, while delivering a $2\times$ speedup over MixGRPO and a $3\times$ speedup over DiffusionNFT.
Authors: Jingjing Jiang
Abstract: This work presents GS-DOT, a novel image reconstruction framework based on Gaussian Splatting (GS) for diffuse optical tomography (DOT). Inspired by GS for rendering applications, absorption coefficients are represented as a sparse sum of anisotropic Gaussian primitives optimized to fit measured time-resolved point-spread functions through analytic gradients and Adam optimization. This is the first adaptation of GS algorithms in the photon diffusion regime, where the ray transport function is replaced by the diffusion functions to enable accurate modeling of light transport in highly scattering media. Validation on synthetic tissue models demonstrate high accuracy in localization and quantification of reconstructed absorption maps for both clean and noisy signals. GS-DOT has demonstrated high robustness to noise and showed a huge reduction in memory demand.
Authors: Xuangeng Chu, Yu Han, Wei Mao, Shih-En Wei
Abstract: Audio-driven facial animation is essential for immersive digital interaction, yet existing frameworks fail to reconcile real-time streaming with high-fidelity personalization. Current methods often rely on latency-inducing audio look-ahead, or require high user compliance to pre-encode static embeddings that fails to capture dynamic idiosyncrasies. We present an end-to-end causal framework for personalizing causal facial motion generation via dynamic multi-modal style retrieval, enabling ultra-low latency while uniquely leveraging unstructured style references. We introduce two key innovations: (1) a temporal hierarchical motion representation that captures global temporal context and high-frequency details while maintaining decoding causality, and (2) a multi-modal style retriever that jointly queries audio and motion to dynamically extract stylistic priors without breaking causality. This mechanism allows for scalable personalization with total flexibility regarding the number and contents of templates. By integrating these components into a causal autoregressive architecture, our method significantly outperforms state-of-the-art approaches in lip-sync accuracy, identity consistency, and perceived realism, supported by extensive quantitative evaluations and user studies.
Authors: Wentao Zhang, Qi Zhang, Mingkun Xu, Mu You, Henghua Shen, Zhongzhi He, Keyan Jin, Derek F. Wong, Tao Fang
Abstract: Crop disease diagnosis from field photographs faces two recurring problems: models that score well on benchmarks frequently hallucinate species names, and when predictions are correct, the reasoning behind them is typically inaccessible to the practitioner. This paper describes Agri-CPJ (Caption-Prompt-Judge), a training-free few-shot framework in which a large vision-language model first generates a structured morphological caption, iteratively refined through multi-dimensional quality gating, before any diagnostic question is answered. Two candidate responses are then generated from complementary viewpoints, and an LLM judge selects the stronger one based on domain-specific criteria. Caption refinement is the component with the largest individual impact: ablations confirm that skipping it consistently degrades downstream accuracy across both models tested. On CDDMBench, pairing GPT-5-Nano with GPT-5-mini-generated captions yields \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. Evaluated without modification on AgMMU-MCQs, GPT-5-Nano reached 77.84\% and Qwen-VL-Chat reached 64.54\%, placing them at or above most open-source models of comparable scale despite the format shift from open-ended to multiple-choice. The structured caption and judge rationale together constitute a readable audit trail: a practitioner who disagrees with a diagnosis can identify the specific caption observation that was incorrect. Code and data are publicly available https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis
URLs: https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis
Authors: Chih-Chung Hsu, Xin-Di Ma, Wo-Ting Liao, Chia-Ming Lee
Abstract: Existing attention accelerators often trade exact softmax semantics, depend on fused Tensor Core kernels, or incur sequential depth that limits FP32 throughput on long sequences. We present \textbf{ELSA}, an algorithmic reformulation of online softmax attention that (i)~preserves exact softmax semantics in real arithmetic with a \emph{provable} $\mathcal{O}(u\log n)$ FP32 relative error bound; (ii)~casts the online softmax update as a prefix scan over an associative monoid $(m,S,W)$, yielding $O(n)$ extra memory and $O(\log n)$ parallel depth; and (iii)~is Tensor-Core independent, implemented in Triton and CUDA C++, and deployable as a \emph{drop-in replacement} requiring no retraining or weight modification. Unlike FlashAttention-2/3, which rely on HMMA/GMMA Tensor Core instructions and provide no compatible FP32 path, ELSA operates identically on A100s and resource-constrained edge devices such as Jetson TX2 -- making it the only hardware-agnostic exact-attention kernel that reduces parallel depth to $O(\log n)$ at full precision. On A100 FP32 benchmarks (1K--16K tokens), ELSA delivers $1.3$--$3.5\times$ speedup over memory-efficient SDPA and $1.97$--$2.27\times$ on BERT; on Jetson TX2, ELSA achieves $1.5$--$1.6\times$ over Math (64--900 tokens), with $17.8$--$20.2\%$ throughput gains under LLaMA-13B offloading at $\ge$32K. In FP16, ELSA approaches hardware-fused baselines at long sequences while retaining full FP32 capability, offering a unified kernel for high-precision inference across platforms. Our code and implementation are available at https://github.com/ming053l/ELSA.
Authors: Aydin Ayanzadeh, Tim Oates
Abstract: Indoor navigation remains a critical accessibility challenge for the blind and low-vision (BLV) individuals, as existing solutions rely on costly per-building infrastructure. We present an agentic framework that converts a single floor plan image into a structured, retrievable knowledge base to generate safe, accessible navigation instructions with lightweight infrastructure. The system has two phases: a multi-agent module that parses the floor plan into a spatial knowledge graph through a self-correcting pipeline with iterative retry loops and corrective feedback; and a Path Planner that generates accessible navigation instructions, with a Safety Evaluator agent assessing potential hazards along each route. We evaluate the system on the real-world UMBC Math and Psychology building (floors MP-1 and MP-3) and on the CVC-FP benchmark. On MP-1, we achieve success rates of 92.31%, 76.92%, and 61.54% for short, medium, and long routes, outperforming the strongest single-call baseline (Claude 3.7 Sonnet) at 84.62%, 69.23%, and 53.85%. On MP-3, we reach 76.92%, 61.54%, and 38.46%, compared to the best baseline at 61.54%, 46.15%, and 23.08%. These results show consistent gains over single-call LLM baselines and demonstrate that our workflow is a scalable solution for accessible indoor navigation for BLV individuals.
Authors: Yuanhao Gong, Tan Tang, Qianyan Liu
Abstract: The Laplacian operator transforms the image into its Laplacian field, which usually is sparse and satisfies a stable distribution. On the other hand, an image can be uniquely reconstructed from its Laplacian field via solving a Poisson equation with a proper boundary condition. Such uniqueness is mathematically guaranteed. Thanks to these properties, we propose to use the sparse Laplacian field to present the image. We first show that the Laplacian field is sparse and satisfies a stable distribution on hundreds images. Then, we show that the image can be accurately reconstruct from its Laplacian field. For the reconstruction task, we propose a shared-kernel wavelet neural network, which solves the Poisson equation and has three advantages. First, it has less than {\bf 0.0002M} parameters, which is compact enough for most of devices. Second, it has linear computation complexity, leading to a real-time reconstruction. Third, it achieves higher accuracy than previous methods. Several numerical experiments are conducted to show the effectiveness and efficiency of the sparse Laplacian field and the proposed Poisson solver. The proposed method can be applied in a large range of applications such as image compression, low light enhancement, object tracking, etc.
Authors: Rezaul Karim, Austin Wen, Wang Zongzuo, Weiwei Zhang, Yang Liu, Walid Ahmed
Abstract: The rapid growth in the size of large language models has necessitated the partitioning of computational workloads across accelerators such as GPUs, TPUs, and NPUs. However, these parallelization strategies incur substantial data communication overhead significantly hindering computational efficiency. While communication-computation overlap presents a promising direction, existing data slicing based solutions suffer from tail latency. To overcome this limitation, this research introduces a novel communication-computation overlap technique to eliminate this tail latency in state of the art overlap methods for distributed LLM training. The aim of this technique is to effectively mitigate communication bottleneck of tensor parallelism and data parallelism for distributed training and inference. In particular, we propose a novel method termed Flash-Overlap that replaces conventional collective operations of reduce-scatter and all-gather with decomposed peer-to-peer (P2P) communication and schedules partitioned computations to enable fine-grained overlap. Our method provides an exact algorithm for reducing communication overhead that eliminates tail latency. Moreover, it presents a versatile solution compatible with data-parallel training and various tensor-level parallelism strategies, including TPSP and UP. Experimental evaluations demonstrate that our technique consistently achieves lower latency, superior Model FLOPS Utilization (MFU), and high throughput.
Authors: Russell Tsuchida, Frank Nielsen
Abstract: Bregman divergences play a pivotal role in statistics, machine learning and computational information geometry. Particularly in the context of machine learning, they are central to clustering, exponential families, parameter estimation and optimisation, among other things. Despite this, the full toolkit of Hilbert spaces and in particular reproducing kernel Hilbert spaces have not been systematically developed and applied to functional Bregman divergences, where points are functions rather than finite-dimensional parameter vectors. While other types of functional Bregman divergences have been studied, these are typically in a Banach space rather than more directly aligned with kernel methods and Hilbert-space geometry commonly used in machine learning. We consider functional Bregman divergences on a Hilbert space, where the self-dual pairing and Riesz representer afford us particularly convenient calculus. Further specialising Bregman generators as a composition involving a kernel mean embedding makes such divergences easy to estimate. We discuss applications in clustering, universal estimation, robust estimation and generative modelling, and contrast our approach with other types of Bregman divergences.
Authors: Nikolaos Salaris, Adrien Desjardins, Manish K. Tiwari
Abstract: The escalating climate crisis and ecosystem degradation demand intelligent, low-cost sensors capable of robust, long-term monitoring in real-world environments. Absolute dissolved oxygen (DO) concentration is a key parameter for predicting climate tipping points. Inexpensive optoelectronic sensors based on microstructured polymer films doped with phosphorescent dyes could be readily deployable; however, signal drift and marine biofouling remain major challenges. Here, we introduce a sensing paradigm that combines camera-based DO sensors with a visual transformer (ViT)-based physics-informed neural network (PINN) for high-fidelity sensing under biofouling conditions. Training and testing data were obtained from an algae-laden water tank over 14 days to capture accelerated biofouling. The ViT-PINN, which embeds the Stern-Volmer (SV) equation into the loss function, reduces mean average error (MAE) by 92% and 89% compared to classical statistical and ML approaches, achieving ~2 umol/L absolute error. A deep ensemble further quantifies predictive uncertainty, enabling self-diagnostic sensing.
Authors: Yangping Li, Thomas Pinetz, Michael H\"olzel, Marieta Toma, Alexander Effland
Abstract: In pathology, the spatial distribution and proportions of tissue types are key indicators of disease progression, and are more readily available than fine-grained annotations. However, these assessments are rarely mapped to pixel-wise segmentation. The task is fundamentally underdetermined, as many spatially distinct segmentations can satisfy the same global proportions in the absence of pixel-wise constraints. To address this, we introduce Variational Segmentation from Label Proportions (VSLP), a two-stage framework that infers dense segmentations from global label proportions, without any pixel-level annotations. This framework first leverages a pre-trained transformer model with test-time augmentation to produce a pixel-wise confidence estimate. In the second stage, these estimates are fused by solving a variational optimization problem that incorporates a Wasserstein data fidelity term alongside a learned regularizer. Unlike end-to-end networks, our variational method can visualize the fidelity-regularization energy, resulting in more interpretable segmentation. We validate our approach on two public datasets, achieving superior performance over existing weakly supervised and unsupervised methods. For one of these datasets, proportions have been estimated by an experienced pathologist to provide a realistic benchmark to the community. Furthermore, the method scales to an in-house dataset with noisy pathologist labels, severely outperforming state-of-the-art methods, thereby demonstrating practical applicability. The code and data will be made publicly available upon acceptance at https://github.com/xiaoliangpi/VSLP.
Authors: Zhongjie Duan, Hong Zhang, Yingda Chen
Abstract: Controllable diffusion methods have substantially expanded the practical utility of diffusion models, but they are typically developed as isolated, backbone-specific systems with incompatible training pipelines, parameter formats, and runtime hooks. This fragmentation makes it difficult to reuse infrastructure across tasks, transfer capabilities across backbones, or compose multiple controls within a single generation pipeline. We present Diffusion Templates, a unified and open plugin framework that decouples base-model inference from controllable capability injection. The framework is organized around three components: Template models that map arbitrary task-specific inputs to an intermediate capability representation, a Template cache that functions as a standardized interface for capability injection, and a Template pipeline that loads, merges, and injects one or more Template caches into the base diffusion runtime. Because the interface is defined at the systems level rather than tied to a specific control architecture, heterogeneous capability carriers such as KV-Cache and LoRA can be supported under the same abstraction. Based on this design, we build a diverse model zoo spanning structural control, brightness adjustment, color adjustment, image editing, super-resolution, sharpness enhancement, aesthetic alignment, content reference, local inpainting, and age control. These case studies show that Diffusion Templates can unify a broad range of controllable generation tasks while preserving modularity, composability, and practical extensibility across rapidly evolving diffusion backbones. All resources will be open sourced, including code, models, and datasets.
Authors: Mufhumudzi Muthivhi, Terence L. van Zyl
Abstract: There has been growing interest in studying the complexity of Rectified Linear Unit (ReLU) based activation networks. Recent work investigates the evolution of the number of piecewise-linear partitions (linear regions) that are formed during training. However, current research is limited to examining the complexity of models trained in a supervised way. Self-Supervised Learning (SSL) differs in that it directly optimises the representation space using a loss function to enhance the model's performance across multiple downstream tasks. This study investigates the local distribution of linear regions produced by SSL models. We demonstrate that the evolution of linear regions correlates with the representation quality by utilising SplineCam to extract two-dimensional polytopes near the data distribution. We track the number, area, eccentricity, and boundaries of regions throughout training. The study compares supervised, contrastive, and self-distillation methods over two standard benchmark datasets, MNIST and FashionMNIST. The analysis of the experimental results shows that self-supervised methods create substantially fewer regions to achieve comparable accuracy to supervised models. Contrastive methods rapidly expand regions over time, whereas self-distillation methods tend to consolidate by merging neighbouring regions. Lastly, we can detect representation collapse early within the geometric space of linear regions. Our analysis suggests that polytopal metrics can serve as reliable indicators of representation quality and model performance.
Authors: Hiromitsu Goto, Tao Tao, Zheng-Lin Chia
Abstract: Quantitative analysis of the kinematic chain in sports motion is essential for performance evaluation and injury prevention. Conventional methods such as the kinematic-sequence (KS) and continuous relative phase (CRP) are confined to adjacent joint pairs and lack a unified framework for whole-body coordination, while segmental power-flow analysis requires force plates and inertial parameters that restrict it to laboratory environments. We apply Complex Hilbert Principal Component Analysis (CHPCA) separately to each motion phase (backswing and downswing) on markerless 3D pose estimation data, extracting the dominant whole-body phase pattern as a single complex eigenvector. The pipeline further includes a fully automatic signal-based phase segmentation (no priors on strike count or rest location) and an extension to 1,079 body-surface mesh vertices, so that the kinematic chain is represented as a continuous phase field across the body. On 14 hammer-striking trials of a single subject, the framework reveals (i) a trunk-anchored global phase architecture, (ii) a functional asymmetry between preparation and execution phases quantified by Mode-1 contribution (45.5% vs. 70.5%) and inter-trial Spearman consistency (0.38 vs. 0.58), and (iii) a consistent reorganisation across both skeletal joints and mesh vertices ($p < 10^{-10}$ on 1,079 vertices). As a methodological consistency check, pairwise phase differences from the Mode-1 eigenvector are compared against CRP on all 190 joint pairs by a permutation test ($\rho = 0.473$, $p = 0.0005$). A correspondence analysis between Mode-1 amplitude and kinetic-energy mobilisation variance further shows a strong positive correlation in the downswing ($\rho \approx 0.71$ on both skeleton and mesh) and no correlation in the backswing, indicating that the proposed framework bridges kinematic and kinetic descriptions of coordination through phase structure.
Authors: Esteban Rodr\'iguez-Betancourt, Edgar Casasola-Murillo
Abstract: Content-based image retrieval (CBIR) systems enable users to search images based on visual content instead of relying on metadata. The text domain has benefited from vector search of representations created with unsupervised methods such as BERT. However, modern self-supervised learning methods for vision are mostly not reported in CBIR-related literature, instead relying on supervised models or multi-modal methods that align text and vision. We evaluate how the representations learned by modern self-supervised learning methods for vision perform under typical retrieval stacks that leverage vector databases and nearest neighbor search. Our evaluation reveals that the latent space geometry impacts approximate nearest neighbor (ANN) indexing. Specifically, highly anisotropic representations with high skewness produced by several modern SSL methods degrade the performance of partition-based and hashing-based search, even if their own linear probe or K-NN accuracy is not affected. In contrast, representations with higher isotropy and local purity better satisfy the distance-based assumptions of ANN indexes, leading to improved semantic retrieval performance.
Authors: Tatsuhito Hasegawa, Kazuma Kondo
Abstract: Sensor-based human activity recognition (HAR) is a paramount technology in the Internet of Things services. HAR using representation learning, which automatically learns a feature representation from raw data, is the mainstream method because it is difficult to interpret relevant information from raw sensor data to design meaningful features. Ensemble learning is a robust approach to improve generalization performance; however, deep ensemble learning requires various procedures, such as data partitioning and training multiple models, which are time-consuming and computationally expensive. In this study, we propose Easy Ensemble (EE) for HAR, which enables the easy implementation of deep ensemble learning in a single model. In addition, we propose various techniques (input variationer, stepwise ensemble, and channel shuffle) for the EE. Experiments on a benchmark dataset for HAR demonstrated the effectiveness of EE and various techniques and their characteristics compared with conventional ensemble learning methods.
Authors: Xianfan Gu, Chuan Wen, Weirui Ye, Jiaming Song, Yang Gao
Abstract: Imagining the future trajectory is the key for robots to make sound planning and successfully reach their goals. Therefore, text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning. To tackle this task and empower robots with the ability to foresee the future, we propose a sample and computation-efficient model, named \textbf{Seer}, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. We enhance the U-Net and language conditioning model by incorporating computation-efficient spatial-temporal attention. Furthermore, we introduce a novel Frame Sequential Text Decomposer module that dissects a sentence's global instruction into temporally aligned sub-instructions, ensuring precise integration into each frame of generation. Our framework allows us to effectively leverage the extensive prior knowledge embedded in pretrained T2I models across the frames. With the adaptable-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a few layers on a small amount of data. The experimental results on Something Something V2 (SSv2), Bridgedata and EpicKitchens-100 datasets demonstrate our superior video prediction performance with around 480-GPU hours versus CogVideo with over 12,480-GPU hours: achieving the 31% FVD improvement compared to the current SOTA model on SSv2 and 83.7% average preference in the human evaluation.
Authors: Darshan Singh S, Zeeshan Khan, Makarand Tapaswi
Abstract: Adapting CLIP for videos has gained popularity due to its semantic and rich representation. While CLIP is a good starting point, it typically undergoes post-pretraining (contrastive finetuning) on large video narration or caption datasets (e.g. HowTo100M, WebVid2.5M). However, such narrations or captions often lack comprehensive information needed to represent a video holistically. As the learning signal from text is sparse, the visual learning is inefficient and adaptation requires millions of samples to post-pretrain. In this work, we ask: is it possible to efficiently adapt CLIP for general and holistic video understanding? We use videos labeled with structured and dense Semantic Role Labels (SRLs) that capture actions, people or objects, their attributes, adverbs (manner), and location in a structured format representing the entire video in a holistic way. We generate rule-based captions from SRLs and demonstrate that simple contrastive finetuning on a mere 23k video-caption pairs is adequate to learn powerful, transferable representations applicable across a diverse range of video understanding tasks that require varying levels of perceptual granularity. Our adapted CLIP model, SRL-CLIP, exhibits comparable or superior performance on zero-shot text-to-video retrieval compared to state-of-the-art models that possess 4-8x more parameters and are post-pretrained on up to 6000x more data. SRL-CLIP surpasses CLIP on multiple video benchmarks, underscoring the efficient learning and improved representations.
Authors: Aashish Anantha Ramakrishnan, Sharon X. Huang, Dongwon Lee
Abstract: Text-to-image (T2I) models have achieved remarkable progress in high-quality image synthesis, yet most benchmarks rely on simple, self-contained prompts, failing to capture the complexity of real-world captions. Human-written captions often involve multiple interacting subjects, rich contextual references, and abstractive phrasing, conditions under which current image-text encoders like CLIP struggle. To systematically study these deficiencies, we introduce ANCHOR, a large-scale dataset of 70K+ abstractive captions sourced from five major news media organizations. Analysis with ANCHOR reveals persistent failures in multi-subject understanding, context reasoning, and nuanced grounding. Motivated by these challenges, we propose Subject-Aware Fine-tuning (SAFE), which uses Large Language Models (LLMs) to extract key subjects and enhance their representation at the embedding-level. Experiments with contemporary models show that SAFE significantly improves image-caption consistency and human preference alignment, serving as a practical and scalable solution.
Authors: Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, Shunli Wang, Dongling Xiao, Ke Li, Lihua Zhang
Abstract: Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications, including medical visual question answering and imaging report generation. While these models inherit the robust capabilities of foundational Large Language Models (LLMs), they also inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts where the margin for error is minimal. However, currently, there are no dedicated methods or benchmarks for hallucination detection and evaluation in the medical field. To bridge this gap, we introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation within the medical multimodal domain. This benchmark provides multi-tasking hallucination support, multifaceted hallucination data, and hierarchical hallucination categorization. Furthermore, we propose the MediHall Score, a new medical evaluative metric designed to assess LVLMs' hallucinations through a hierarchical scoring system that considers the severity and type of hallucination, thereby enabling a granular assessment of potential clinical impacts. We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection, which employs multitask training for hallucination detection. Through extensive experimental evaluations, we establish baselines for popular LVLMs using our benchmark. The findings indicate that MediHall Score provides a more nuanced understanding of hallucination impacts compared to traditional metrics and demonstrate the enhanced performance of MediHallDetector. We hope this work can significantly improve the reliability of LVLMs in medical applications. All resources of this work have been released at https://github.com/ydk122024/Med-HallMark.
Authors: R\'ois\'in Luo, Alexandru Drimbarean, James McDermott, Colm O'Riordan
Abstract: This paper explores a novel paradigm in low-bit (i.e. 4-bits or lower) quantization, differing from existing state-of-the-art methods, by framing optimal quantization as an architecture search problem within convolutional neural networks (ConvNets). Our framework, dubbed \textbf{CoRa} (Optimal Quantization Residual \textbf{Co}nvolutional Operator Low-\textbf{Ra}nk Adaptation), is motivated by two key aspects. Firstly, quantization residual knowledge, i.e. the lost information between floating-point weights and quantized weights, has long been neglected by the research community. Reclaiming the critical residual knowledge, with an infinitesimal extra parameter cost, can reverse performance degradation without training. Secondly, state-of-the-art quantization frameworks search for optimal quantized weights to address the performance degradation. Yet, the vast search spaces in weight optimization pose a challenge for the efficient optimization in large models. For example, state-of-the-art BRECQ necessitates $2 \times 10^4$ iterations to quantize models. Fundamentally differing from existing methods, \textbf{CoRa} searches for the optimal architectures of low-rank adapters, reclaiming critical quantization residual knowledge, within the search spaces smaller compared to the weight spaces, by many orders of magnitude. The low-rank adapters approximate the quantization residual weights, discarded in previous methods. We evaluate our approach over multiple pre-trained ConvNets on ImageNet. \textbf{CoRa} achieves comparable performance against both state-of-the-art quantization-aware training and post-training quantization baselines, in $4$-bit and $3$-bit quantization, by using less than $250$ iterations on a small calibration set with $1600$ images. Thus, \textbf{CoRa} establishes a new state-of-the-art in terms of the optimization efficiency in low-bit quantization.
Authors: Xudong Xie, Hao Yan, Liang Yin, Yang Liu, Jing Ding, Minghui Liao, Yuliang Liu, Wei Chen, Xiang Bai
Abstract: Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. However, existing methods typically focus on either plain text or a limited number of document images, struggling to handle long PDF documents with interleaved text and images, especially for academic papers. In this paper, we introduce PDF-WuKong, a multimodal large language model (MLLM) that is designed to enhance multimodal question-answering (QA) for long PDF documents. PDF-WuKong incorporates a sparse sampler that operates on both text and image representations, significantly improving the efficiency and capability of the MLLM. The sparse sampler selects the paragraphs or diagrams most pertinent to user queries. To effectively train and evaluate our model, we construct PaperPDF, a dataset consisting of a broad collection of English and Chinese academic papers. Multiple strategies are proposed to build high-quality 1.1 million QA pairs along with their corresponding evidence sources. Experimental results demonstrate the superiority and high efficiency of our approach over other models on the task of long multimodal document understanding, surpassing proprietary products by an average of 8.6% on F1. Our code and dataset will be released at https://github.com/yh-hust/PDF-Wukong.
Authors: Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu
Abstract: Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to jointly optimize efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We further conduct a systematic investigation that enhances NVILA's efficiency throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training cost by 1.9-5.1x, prefilling latency by 1.6-2.2x, and decoding latency by 1.2-2.8x. We release our code and models to facilitate reproducibility.
Authors: Thong Thanh Nguyen, Yi Bin, Xiaobao Wu, Zhiyuan Hu, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu
Abstract: Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a multi-level structure known as a feature pyramid. In this structure, lower levels concentrate on short-range video moments, while higher levels address long-range moments. Because higher levels experience downsampling to accommodate increasing moment length, their capacity to capture information is reduced and consequently leads to degraded information in moment representations. To resolve this problem, we propose a contrastive learning framework to capture salient semantics among video moments. Our key methodology is to leverage samples from the feature space emanating from multiple stages of the video encoder itself requiring neither data augmentation nor online memory banks to obtain positive and negative samples. To enable such an extension, we introduce a sampling process to draw multiple video moments corresponding to a common query. Subsequently, by utilizing these moments' representations across video encoder layers, we instantiate a novel form of multi-scale and cross-scale contrastive learning that links local short-range video moments with global long-range video moments. Extensive experiments demonstrate the effectiveness of our framework for not only long-form but also short-form video grounding. Code is available at https://github.com/nguyentthong/MSCL
Authors: Thong Thanh Nguyen, Xiaobao Wu, Yi Bin, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu
Abstract: To equip artificial intelligence with a comprehensive understanding towards a temporal world, video and 4D panoptic scene graph generation abstracts visual data into nodes to represent entities and edges to capture temporal relations. Existing methods encode entity masks tracked across temporal dimensions (mask tubes), then predict their relations with temporal pooling operation, which does not fully utilize the motion indicative of the entities' relation. To overcome this limitation, we introduce a contrastive representation learning framework that focuses on motion pattern for temporal scene graph generation. Firstly, our framework encourages the model to learn close representations for mask tubes of similar subject-relation-object triplets. Secondly, we seek to push apart mask tubes from their temporally shuffled versions. Moreover, we also learn distant representations for mask tubes belonging to the same video but different triplets. Extensive experiments show that our motion-aware contrastive framework significantly improves state-of-the-art methods on both video and 4D datasets. Code is available at: https://github.com/nguyentthong/motion-contrastive-sgg
URLs: https://github.com/nguyentthong/motion-contrastive-sgg
Authors: Yihang Tao, Senkang Hu, Yue Hu, Haonan An, Hangcheng Cao, Yuguang Fang
Abstract: Collaborative perception significantly enhances autonomous driving safety by extending each vehicle's perception range through message sharing among connected and autonomous vehicles. Unfortunately, it is also vulnerable to adversarial message attacks from malicious agents, resulting in severe performance degradation. While existing defenses employ hypothesis-and-verification frameworks to detect malicious agents based on single-shot outliers, they overlook temporal message correlations, which can be circumvented by subtle yet harmful perturbations in model input and output spaces. This paper reveals a novel blind area confusion (BAC) attack that compromises existing single-shot outlier-based detection methods. As a countermeasure, we propose GCP, a Guarded Collaborative Perception framework based on spatial-temporal aware malicious agent detection, which maintains single-shot spatial consistency through a confidence-scaled spatial concordance loss, while simultaneously examining temporal anomalies by reconstructing historical bird's eye view motion flows in low-confidence regions. We also employ a joint spatial-temporal Benjamini-Hochberg test to synthesize dual-domain anomaly results for reliable malicious agent detection. Extensive experiments demonstrate GCP's superior performance under diverse attack scenarios, achieving up to 34.69% improvements in AP@0.5 compared to the state-of-the-art CP defense strategies under BAC attacks, while maintaining consistent 5-8% improvements under other typical attacks. Code will be released at https://github.com/yihangtao/GCP.git.
Authors: Priyanto Hidayatullah, Nurjannah Syakrani, Muhammad Rizqi Sholahuddin, Trisna Gelar, Refdinal Tubagus
Abstract: In the field of deep learning-based computer vision, YOLO is revolutionary. With respect to deep learning models, YOLO is also the one that is evolving the most rapidly. Unfortunately, not every YOLO model possesses scholarly publications. Moreover, there exists a YOLO model that lacks a publicly accessible official architectural diagram. Naturally, this engenders challenges, such as complicating the understanding of how the model operates in practice. Furthermore, the review articles that are presently available do not investigate the specifics of each model. The objective of this study is to present a comprehensive and in-depth architecture comparison of the four most recent YOLO models, specifically YOLOv8 through YOLO11, thereby enabling readers to quickly grasp not only how each model functions, but also the distinctions between them. To analyze each YOLO version's architecture, we meticulously examined the relevant academic papers, documentation, and scrutinized the source code. The analysis reveals that while each version of YOLO has improvements in architecture and feature extraction, certain blocks remain unchanged. The lack of scholarly publications and official diagrams presents challenges for understanding the model's functionality and future enhancement. Future developers are encouraged to provide these resources.
Authors: Hanqi Yan, Xiangxiang Cui, Lu Yin, Jindong Gu, Paul Pu Liang, Yulan He, Yifei Wang
Abstract: The success of vision-language models is primarily attributed to effective alignment across modalities such as vision and language. However, modality gaps persist in existing alignment algorithms and appear necessary for human perception as evidenced by modality-specific phenomena like visual texture and linguistic tone. These observations motivate us to computationally measure and leverage modality gaps to improve downstream tasks. We first introduce the Modality Dominance Score (MDS), which attributes multimodal features to specific modalities by categorizing them into three classes: vision-dominant features, language-dominant features, and cross-modal features. We then propose automatic interpretability metrics to evaluate these modality-specific features in a scalable manner. Finally, we demonstrate that the training-free model editing enhances multiple downstream tasks, including mitigating bias in gender classification, generating cross-modal adversarial examples, and enabling modality-specific control in text-to-image generation. Combined with task-agnostic interpretability tools, our work offers insights for systematic analysis and lightweight editing of multimodal models.
Authors: Yuan Zhou, Shilong Jin, Litao Hua, Wanjun Lv, Haoran Duan, Jungong Han
Abstract: Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent prior view biases in T2I priors. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel method that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise view control; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer can be seamlessly integrated into various 3D representations and score distillation paradigms, effectively mitigating the multi-face Janus problem.
Authors: Yabiao Wang, Shuo Wang, Jiangning Zhang, Jiafu Wu, Qingdong He, Yong Liu
Abstract: This work aims at a challenging task: human action-reaction synthesis, i.e., generating human reactions conditioned on the action sequence of another person. Currently, autoregressive modeling approaches with vector quantization (VQ) have achieved remarkable performance in motion generation tasks. However, VQ has inherent disadvantages, including quantization information loss, low codebook utilization, etc. In addition, while dividing the body into separate units can be beneficial, the computational complexity needs to be considered. Also, the importance of mutual perception among units is often neglected. In this work, we propose MARRS, a novel framework designed to generate coordinated and fine-grained reaction motions using continuous representations. Initially, we present the Unit-distinguished Motion Variational AutoEncoder (UD-VAE), which segments the entire body into distinct body and hand units, encoding each independently. Subsequently, we propose Action-Conditioned Fusion (ACF), which involves randomly masking a subset of reactive tokens and extracting specific information about the body and hands from the active tokens. Furthermore, we introduce Mutual Unit Modulation (MUM) to facilitate interaction between body and hand units by using the information from one unit to adaptively modulate the other. Finally, for the diffusion model, we employ a compact MLP as a noise predictor for each distinct body unit and incorporate the diffusion loss to model the probability distribution of each token. Both quantitative and qualitative results demonstrate that our method achieves superior performance. Project page: https://aigc-explorer.github.io/MARRS/.
Authors: Zhiying Li, Guanggang Geng, Yeying Jin, Shuyuan Lin, Fengyuan Ma, Zhaoxin Fan, Lili Wang
Abstract: Expressive human pose and shape estimation (EHPS) plays a central role in digital human generation, particularly in live-streaming applications. However, most existing EHPS models focus primarily on minimizing estimation errors, with limited attention on potential security vulnerabilities, such as generating inappropriate content, violent actions, or racially offensive gestures and expressions. Current adversarial attacks on EHPS models often generate visually conspicuous perturbations, limiting their practicality and ability to expose real-world security threats. To address this limitation, we propose an unnoticeable adversarial method, termed \textbf{LatentStealth}, specifically tailored for EHPS models. The key idea is to exploit the structured latent representations of natural images as the medium for crafting perturbations. Instead of injecting noise directly into the pixel space, our method projects inputs into the latent space, where adversarial patterns are generated and progressively refined along optimized directions. This latent-space manipulation enables the attack to maintain high imperceptibility while preserving its effectiveness. Furthermore, as the optimization process is guided by only a small number of model output queries, the framework achieves competitive attack performance with low computational overhead, making it both practical and efficient for real-world scenarios. Extensive experiments on the 3DPW and UBody datasets demonstrate the superiority of LatentStealth, revealing critical vulnerabilities in current systems. These findings highlight the urgent need to address and mitigate security risks in digital human generation technologies.
Authors: Di Wu, Yixin Wan, Kai-Wei Chang
Abstract: Text-to-image retrieval (T2I retrieval) remains challenging because cross-modal embeddings often behave as bags of concepts, underrepresenting structured visual relationships such as pose and viewpoint. We proposeVisualize-then-Retrieve (VisRet), a retrieval paradigm that mitigates this limitation of cross-modal similarity alignment. VisRet first projects textual queries into the image modality via T2I generation, then performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Across four benchmarks (Visual-RAG, INQUIRE-Rerank, Microsoft COCO, and our new Visual-RAG-ME featuring multi-entity comparisons), VisRet substantially outperforms cross-modal similarity matching and baselines that recast T2I retrieval as text-to-text similarity matching, improving nDCG@30 by 0.125 on average with CLIP as the retriever and by 0.121 with E5-V. For downstream question answering, VisRet increases accuracy on Visual-RAG and Visual-RAG-ME by 3.8% and 15.7% in top-1 retrieval, and by 3.9% and 11.1% in top-10 retrieval. Ablation studies show compatibility with different T2I instruction LLMs, T2I generation models, and downstream LLMs. VisRet provides a simple yet effective perspective for advancing in text-image retrieval. Our code and the new benchmark are publicly available at https://github.com/xiaowu0162/Visualize-then-Retrieve.
URLs: https://github.com/xiaowu0162/Visualize-then-Retrieve.
Authors: Fanqi Kong, Weiqin Zu, Xinyu Chen, Yaodong Yang, Song-Chun Zhu, Xue Feng
Abstract: Understanding social interaction, which encompasses perceiving numerous and subtle multimodal cues, inferring unobservable mental states and relations, and dynamically predicting others' behavior, is the foundation for achieving human-machine interaction. Despite rapid advances in Multimodal Large Language Models (MLLMs), the rich and multifaceted nature of social interaction has hindered the development of benchmarks that holistically evaluate and guide their social interaction abilities. Based on social relation theory, which has been widely regarded as a foundational framework for understanding social behavior, we provide SIV-Bench, a novel video benchmark for systematically evaluating MLLMs' capabilities across Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP). SIV-Bench features 2,792 originally collected video clips and 5,455 meticulously generated question-answer pairs derived from a human-LLM collaborative pipeline. It covers 14 typical relationships, diverse video lengths, genres, presentation styles, and linguistic and cultural backgrounds. Our comprehensive experiments show that leading MLLMs perform relatively well on SSU but remain weak on SSR and SDP, with the systematic confusion in relation inference as a key bottleneck. An in-depth analysis of the reasoning process attributes MLLMs' suboptimal performance to misalignment with human thoughts and insufficient reasoning depth. Moreover, we find audio and subtitles aid in reasoning-intensive SSR and SDP. Together, SIV-Bench offers a unified testbed to measure progress, expose limitations, and guide future research toward more socially intelligent MLLMs. We release the dataset and code at our project website: https://kfq20.github.io/sivbench.
Authors: Trong-Vu Hoang, Quang-Binh Nguyen, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Abstract: Customizing image generation remains a core challenge in controllable image synthesis. For single-concept generation, maintaining both identity preservation and prompt alignment is challenging. In multi-concept scenarios, relying solely on a prompt without additional conditions like layout boxes or semantic masks, often leads to identity loss and concept omission. In this paper, we introduce ShowFlow, a comprehensive framework designed to tackle these challenges. We propose ShowFlow-S for single-concept image generation, and ShowFlow-M for handling multiple concepts. ShowFlow-S introduces a KronA-WED adapter, which integrates a Kronecker adapter with weight and embedding decomposition, and together with a novel Semantic-Aware Attention Regularization (SAR) training objective to enhance single-concept generation. Building on this foundation, ShowFlow-M directly reuses robust models learned by ShowFlow-S to support multi-concept generation without extra conditions, incorporating a Subject-Adaptive Matching Attention (SAMA) and a Layout Consistency guidance as the plug-and-play module. Extensive experiments and user studies validate ShowFlow's effectiveness, highlighting its potential in real-world applications like advertising and virtual dressing. Our source code will be publicly available at: https://htrvu.github.io/showflow.
Authors: Xiaofan Li, Zhihao Xu, Chenming Wu, Zhao Yang, Yumeng Zhang, Jiang-Jiang Liu, Haibao Yu, Fan Duan, Xiaoqing Ye, Yuan Wang, Shirui Li, Xun Sun, Ji Wan, Jun Wang
Abstract: Accurate localization using visual information is a critical yet challenging task, especially in urban environments where nearby buildings and construction sites significantly degrade GNSS (Global Navigation Satellite System) signal quality. This issue underscores the importance of visual localization techniques in scenarios where GNSS signals are unreliable. This paper proposes U-ViLAR, a novel uncertainty-aware visual localization framework designed to address these challenges while enabling adaptive localization using high-definition (HD) maps or navigation maps. Specifically, our method first extracts features from the input visual data and maps them into Bird's-Eye-View (BEV) space to enhance spatial consistency with the map input. Subsequently, we introduce: a) Perceptual Uncertainty-guided Association, which mitigates errors caused by perception uncertainty, and b) Localization Uncertainty-guided Registration, which reduces errors introduced by localization uncertainty. By effectively balancing the coarse-grained large-scale localization capability of association with the fine-grained precise localization capability of registration, our approach achieves robust and accurate localization. Experimental results demonstrate that our method achieves state-of-the-art performance across multiple localization tasks. Furthermore, our model has undergone rigorous testing on large-scale autonomous driving fleets and has demonstrated stable performance in various challenging urban scenarios.
Authors: Shashanka Venkataramanan, Valentinos Pariza, Mohammadreza Salehi, Lukas Knobel, Spyros Gidaris, Elias Ramzi, Andrei Bursuc, Yuki M. Asano
Abstract: We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.
Authors: Xi Xiao, Aristeidis Tsaris, Anika Tabassum, John Lagergren, Larry M. York, Tianyang Wang, Xiao Wang
Abstract: Hyperspectral imaging (HSI) captures hundreds of narrow, contiguous wavelength bands, making it a powerful tool in biology, agriculture, and environmental monitoring. However, interpreting Vision Transformers (ViTs) in this setting remains largely unexplored due to two key challenges: (1) existing saliency methods struggle to capture meaningful spectral cues, often collapsing attention onto the class token, and (2) full-spectrum ViTs are computationally prohibitive for interpretability, given the high-dimensional nature of HSI data. We present FOCUS, the first framework that enables reliable and efficient spatial-spectral interpretability for frozen ViTs. FOCUS introduces two core components: class-specific spectral prompts that guide attention toward semantically meaningful wavelength groups, and a learnable [SINK] token trained with an attraction loss to absorb noisy or redundant attention. Together, these designs make it possible to generate stable and interpretable 3D saliency maps and spectral importance curves in a single forward pass, without any gradient backpropagation or backbone modification. FOCUS improves band-level IoU by 15 percent, reduces attention collapse by over 40 percent, and produces saliency results that align closely with expert annotations. With less than 1 percent parameter overhead, our method makes high-resolution ViT interpretability practical for real-world hyperspectral applications, bridging a long-standing gap between black-box modeling and trustworthy HSI decision-making.
Authors: Xu Yuan, Liangbo Ning, Qingqing Ye, Wenqi Fan, Qing Li
Abstract: Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for expanding the knowledge capacity of Multimodal Large Language Models (MLLMs) by incorporating external knowledge sources into the generation process, and has been widely adopted for knowledge-based Visual Question Answering (VQA). Despite impressive advancements, vanilla RAG-based VQA methods that rely on unstructured documents and overlook the structural relations among knowledge elements frequently introduce irrelevant or misleading content, degrading answer accuracy and reliability. To overcome these challenges, a promising solution is to integrate multimodal knowledge graphs (KGs) into RAG-based VQA frameworks, thereby enhancing generation through structured multimodal knowledge. To this end, this paper proposes mKG-RAG, a novel retrieval-augmented generation framework built upon multimodal KGs for knowledge-intensive VQA tasks. Specifically, mKG-RAG leverages MLLM-driven graph extraction and vision-text matching to distill semantically consistent, modality-complementary entities and relations from multimodal documents, constructing high-quality multimodal KGs as structured knowledge representations. Furthermore, a dual-stage retrieval strategy equipped with a query-aware multimodal retriever is introduced to improve retrieval efficiency while progressively refining precision. Comprehensive experiments demonstrate that our approach significantly outperforms existing approaches and sets new state-of-the-art results for knowledge-based VQA. The code is available at https://github.com/xandery-geek/mKG-RAG.
Authors: Rongzhen Zhao, Wenyan Yang, Juho Kannala, Joni Pajarinen
Abstract: Slot Attention (SA) lies at the heart of mainstream Object-Centric Learning (OCL). Image features can be aggregated into object-level representations by SA iteratively refining cold-start query slots. For video, such aggregation proceeds by SA recurrently shared across frames, with queries cold-started on the first frame while transitioned from the previous frame's slots thereafter. However, cold-start queries lack sample-specific cues thus hindering precise aggregation on image or video's first frame; Non-first frames' queries are already sample-specific thus requiring aggregation transforms different from the first frame. We address these issues with our SmoothSA: (1) To smooth SA iterations on image or video's first frame, we preheat cold-start queries with rich input-feature information, by a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across video's first and non-first frames, we differentiate the homogeneous aggregation transforms by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and visual reasoning validate our method's effectiveness. Further visual analyses illuminate the underline mechanisms. Our source code, model checkpoints and training logs are provided on https://github.com/Genera1Z/SmoothSA.
Authors: Rishi Raj Sahoo, Surbhi Saswati Mohanty, Subhankar Mishra
Abstract: Potholes on the roads are a serious hazard and maintenance burden. This poses a significant threat to road safety and vehicle longevity, especially on the diverse and under-maintained roads of India. In this paper, we present a complete end-to-end system called iWatchRoad for automated pothole detection, Global Positioning System (GPS) tagging, and real time mapping using OpenStreetMap (OSM). We curated a large, self-annotated dataset of over 7,000 frames captured across various road types, lighting conditions, and weather scenarios unique to Indian environments, leveraging dashcam footage. This dataset is used to fine-tune, Ultralytics You Only Look Once (YOLO) model to perform real time pothole detection, while a custom Optical Character Recognition (OCR) module was employed to extract timestamps directly from video frames. The timestamps are synchronized with GPS logs to geotag each detected potholes accurately. The processed data includes the potholes' details and frames as metadata is stored in a database and visualized via a user friendly web interface using OSM. iWatchRoad not only improves detection accuracy under challenging conditions but also provides government compatible outputs for road assessment and maintenance planning through the metadata visible on the website. Our solution is cost effective, hardware efficient, and scalable, offering a practical tool for urban and rural road management in developing regions, making the system automated. iWatchRoad is available at https://smlab.niser.ac.in/project/iwatchroad
Authors: Serban Cristian Tudosie, Alexander Denker, Zeljko Kereta, Simon Arridge
Abstract: Single-Pixel Imaging (SPI) enables the reconstruction of objects using a single detector through sequential illuminations with structured light patterns. The choice of illumination patterns is critical, particularly in highly undersampled regimes, where it directly determines reconstruction quality and acquisition speed. Instead of relying on handcrafted or fixed patterns, we propose to learn task-specific patterns directly from data. Practical SPI hardware only supports binary patterns, making binary pattern design a necessary consideration. We propose a bilevel optimisation method for learning task-specific binary illumination patterns optimised for applications such as single-pixel fluorescence microscopy. We address the non-differentiable nature of binary optimisation using the Straight-Through Estimator. In addition, we incorporate learned variational regularisation, improving reconstruction quality and robustness. We demonstrate our method on the CytoImageNet microscopy dataset. We show that our learned patterns achieve superior reconstruction performance compared to baseline methods and end-to-end deep learning, particularly in highly undersampled regimes and in scarce-data settings.
Authors: Zongxia Li, Wenhao Yu, Chengsong Huang, Zhenwen Liang, Rui Liu, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, Dong Yu
Abstract: Vision-Language Models (VLMs) often suffer from visual hallucinations: generating things that are not consistent with visual inputs and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language based reasoning over visual perception. We introduce Vision SR1, a three stage self rewarding reinforcement learning method that improves visual reasoning without relying on external visual supervision. Vision SR1 decomposes VLM reasoning into two components: visual reasoning and language reasoning, where the model is first prompted to produce self-contained visual descriptions sufficient to answer the question without referring back to the input image, before jointly optimizing both visual and language reasoning through our multi reward loss objective. To validate this self containment, the same VLM model is reprompted to perform language reasoning using only the generated visual reasoning as input to compute visual reward. The final reward is computed through a decoupled reward-advantage framework, where visual reward and language reasoning reward each have their advantages calculated separately. Our experiments show that Vision SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision language tasks, while being more efficient than methods that rely on external visual reward models, which require additional GPUs to host. In contrast, Vision SR1 introduces no extra GPU overhead beyond that of standard training.
Authors: Zheyu Fan, Jiateng Liu, Yuji Zhang, Zihan Wang, Yi R. Fung, Manling Li, Heng Ji
Abstract: Video-LLMs face a fundamental tension in long-video reasoning: static, sparse frame sampling either dilutes evidence across task-irrelevant segments at significant cost or misses fine-grained temporal semantics altogether. We propose a novel, cognitively-inspired task -- Endomorphic Multimodal Compression (EMC) -- as a structurally-constrained sufficient-statistic problem for VideoQA, and formulate it as an endomorphic transformation F_EMC : (V, Q) -> (v, q) that compresses the multimodal input while preserving answer invariance across reasonable downstream models. The endomorphic form keeps the compressed output in the downstream pipeline's native task space -- a structural mirror of the filter-then-reason mechanism in the cognitive literature motivating EMC -- distinguishing it from latent-code compression (IB / VIB) and making the formulation extensible to other multimodal settings. Under the Markov chain A -> (V, Q) -> (v, q), EMC realizes the classical sufficiency condition I((v, q); A) = I((V, Q); A) in its VideoQA-natural form. As a modular front-end, EMC plugs into both Video Instruction Tuning and Video Question Answering pipelines. We release the first dedicated benchmark and propose ReSimplifyIt, an EMC baseline surpassing prior methods by 0.40 F-1 with competitive query rewriting. Integrating EMC yields relative gains of 7.33% in training and 33.7% in inference for video-language understanding.
Authors: Cem Eteke, Alexander Griessel, Wolfgang Kellerer, Eckehard Steinbach
Abstract: We introduce the BIR-Adapter, a parameter-efficient diffusion adapter for blind image restoration. Diffusion-based restoration methods have demonstrated promising performance in addressing this fundamental problem in computer vision, typically relying on auxiliary feature extractors or extensive fine-tuning of pre-trained models. Building on the observation that large-scale pretrained diffusion models can retain informative representations under image degradations, BIR-Adapter introduces a parameter-efficient, plug-and-play attention mechanism that substantially reduces the number of trained parameters. To further improve reliability, we adapt a sampling guidance mechanism that mitigates hallucinations during restoration. Experiments on synthetic and real-world degradations demonstrate that BIR-Adapter achieves competitive, and in several settings superior, performance compared to state-of-the-art methods while requiring up to 36x fewer trained parameters. Moreover, the adapter-based design enables integration into existing models. We validate this generality by extending a super-resolution-only diffusion model to handle additional unknown degradations, highlighting the adaptability of our approach for broader image restoration tasks.
Authors: Chen Liu, Haitao Wu, Kafeng Wang, Weiran Huang
Abstract: Personalized animal image generation is challenging due to rich appearance cues and large morphological variability. Existing approaches often exhibit feature misalignment across domains, which leads to identity drift. We present AnimalBooth, a framework that strengthens identity preservation with an Animal Net and an adaptive attention module, mitigating cross domain alignment errors. We further introduce a frequency controlled feature integration module that applies Discrete Cosine Transform filtering in the latent space to guide the diffusion process, enabling a coarse to fine progression from global structure to detailed texture. To advance research in this area, we curate AnimalBench, a high resolution dataset for animal personalization. Extensive experiments show that AnimalBooth consistently outperforms strong baselines on multiple benchmarks and improves both identity fidelity and perceptual quality.
Authors: Neeraj Gangwar, Anshuka Rangi, Rishabh Deshmukh, Holakou Rahmanian, Yesh Dattatreya, Nickvash Kani
Abstract: Parameter-efficient fine-tuning methods have emerged as a promising solution for adapting pre-trained models to various downstream tasks. While these methods perform well in single-task learning, extending them to multi-task learning exacerbates common issues, such as task interference and negative transfer, due to the limited number of trainable parameters. To address these challenges, we introduce progressive task-specific multi-task adaptation, a novel parameter-efficient approach for multi-task learning. Our approach introduces adapter modules that are shared in early layers and become increasingly task-specific in later layers. Additionally, we propose a gradient-based approach for computing task similarity and use this measure to allocate similar tasks to the shared adapter modules. To evaluate our approach, we adapt Swin and Pyramid Vision Transformers on PASCAL and NYUD-v2. On both datasets, our approach outperforms prior parameter-efficient multi-task methods while using fewer trainable parameters.
Authors: Zikun Guo, Jingwei Lv, Xinyue Xu, Shu Yang, Jun Wen, Di Wang, Lijie Hu
Abstract: Visual language models (VLMs) have the potential to transform medical workflows. However, the deployment is limited by sycophancy. Despite this serious threat to patient safety, a systematic benchmark remains lacking. This paper addresses this gap by introducing a Medical benchmark that applies multiple templates to VLMs in a hierarchical medical visual question answering task. We find that current VLMs are highly susceptible to visual cues, with failure rates showing a correlation to model size or overall accuracy. we discover that perceived authority and user mimicry are powerful triggers, suggesting a bias mechanism independent of visual data. To overcome this, we propose a Visual Information Purification for Evidence based Responses (VIPER) strategy that proactively filters out non-evidence-based social cues, thereby reinforcing evidence based reasoning. VIPER reduces sycophancy while maintaining interpretability and consistently outperforms baseline methods, laying the necessary foundation for the robust and secure integration of VLMs.
Authors: Kyeongryeol Go
Abstract: The performance of deep neural networks is strongly influenced by the quality of their training data. However, mitigating dataset bias by manually curating challenging edge cases remains a major bottleneck. To address this, we propose an automated pipeline for text-guided edge-case synthesis. Our approach employs a Large Language Model, fine-tuned via preference learning, to rephrase image captions into diverse textual prompts that steer a Text-to-Image model toward generating difficult visual scenarios. Evaluated on the FishEye8K object detection benchmark, our method achieves superior robustness, surpassing both naive augmentation and manually engineered prompts. This work establishes a scalable framework that shifts data curation from manual effort to automated, targeted synthesis, offering a promising direction for developing more reliable and continuously improving AI systems. Code is available at https://github.com/gokyeongryeol/ATES.
Authors: Karim Farid, Rajat Sahay, Yumna Ali Alnaggar, Simon Schrodi, Volker Fischer, Cordelia Schmid, Thomas Brox
Abstract: Compositional generalization, the ability to generate novel combinations of known concepts, is a key ingredient for visual generative models. Yet, not all mechanisms that enable or inhibit it are fully understood. In this work, we conduct a systematic study of how various design choices influence compositional generalization in image and video generation in a positive or negative way. Through controlled experiments, we identify two key factors: (i) whether the training objective operates on a discrete or continuous distribution, and (ii) to what extent conditioning provides information about the constituent concepts during training. Building on these insights, we show that relaxing the MaskGIT discrete loss with an auxiliary continuous JEPA-based objective can improve compositional performance in discrete models like MaskGIT.
Authors: Giulio Weikmann, Gianmarco Perantoni, Lorenzo Bruzzone
Abstract: Deep learning has become increasingly important in remote sensing image classification due to its ability to extract semantic information from complex data. Classification tasks often include predefined label hierarchies that represent the semantic relationships among classes. However, these hierarchies are frequently overlooked, and most approaches focus only on fine-grained classification schemes. In this paper, we present a novel Semantics-Aware Hierarchical Consensus (SAHC) approach to learn hierarchical features and relationships by integrating hierarchy-specific classification heads within a deep network architecture, each specialized in different degrees of class granularity. The proposed approach employs trainable hierarchy matrices, which guide the network through the learning of the hierarchical structure in a self-consistent manner. Furthermore, we introduce a hierarchical consensus mechanism to ensure aligned probability distributions across different hierarchical levels. This mechanism acts as a weighted ensemble being able to effectively leverage the inherent structure of the hierarchical classification task. The proposed SAHC method is evaluated on two benchmark datasets with different degrees of hierarchical complexity on different tasks, considering varying spectral and spatial resolutions. Experimental results show both the effectiveness of the proposed approach in guiding network learning and the robustness of the hierarchical consensus for remote sensing image classification tasks. The codes will be released at https://github.com/rslab-unitrento/sahc.
Authors: Soroosh Tayebi Arasteh, Mina Shaigan, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn
Abstract: Self-supervised learning (SSL) has improved visual representation learning, but its value in chest radiography remains uncertain. DINOv3 extends earlier SSL models through Gram-anchored self-distillation and explicit high-resolution adaptation. Whether these changes improve transfer learning for chest radiograph classification has not been established. We benchmarked DINOv3 against DINOv2 and supervised ImageNet initialization across seven chest radiograph datasets comprising 816,183 radiographs from pediatric and adult cohorts. ViT-B/16 and ConvNeXt-B were evaluated under full fine-tuning at 224 and 512 pixels, with targeted 1024 experiments on three cohorts. Additional analyses examined parameter-efficient adaptation, synthetic label corruption, external validation, frozen 7B features, and computational efficiency. The primary outcome was mean AUROC across labels. In adult cohorts, DINOv3 did not consistently outperform DINOv2 at 224 x 224 pixels, but became the strongest initialization at 512 x 512, especially with ConvNeXt-B. Gains were greatest for small focal and boundary-dependent abnormalities, whereas large-structure findings changed little. The pediatric cohort showed no significant benefit from DINOv3, higher resolution, or backbone choice. Scaling to 1024 x 1024 rarely improved performance and markedly increased computational cost. ConvNeXt-B remained superior to ViT-B/16 under both full and parameter-efficient adaptation. External validation preserved the 512 x 512 DINOv3 advantage, whereas synthetic label corruption showed that this benefit should not be interpreted simply as superior noise robustness. For adult chest radiograph classification, DINOv3 provides its most reliable benefit at 512 x 512 pixels, particularly with ConvNeXt-B. Fully adapted mid-sized models at 512 x 512 pixels provided the best performance-cost trade-off in our benchmark.
Authors: Yuxi Mi, Qiuyang Yuan, Zhizhou Zhong, Xuan Zhao, Jiaogen Zhou, Fubao Zhu, Jihong Guan, Shuigeng Zhou
Abstract: Recently, iris recognition is regaining prominence in immersive applications such as extended reality as a means of seamless user identification. This application scenario introduces unique challenges compared to traditional iris recognition under controlled setups, as the ocular images are primarily captured off-axis and less constrained, causing perspective distortion, intra-subject variation, and quality degradation in iris textures. Datasets capturing these challenges remain limited. This paper fills this gap by presenting a large-scale iris dataset collected via head-mounted displays, termed ImmerIris. It contains 499,791 ocular images from 546 subjects, and is, to our knowledge, the largest public iris dataset to date and among the first dedicated to immersive applications. It is accompanied by a comprehensive set of evaluation protocols that benchmark recognition systems under various challenging conditions. This paper also draws attention to a shared obstacle of current recognition methods, the reliance on a pre-processing, normalization stage, which is fallible in off-axis and unconstrained setups. To this end, this paper further proposes a normalization-free paradigm that directly learns from minimally adjusted ocular images. Despite its simplicity, it outperforms normalization-based prior arts, indicating a promising direction for robust iris recognition.
Authors: Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao, Yotam Nitzan, Yuheng Li, Krishna Kumar Singh, Richard Zhang, Eli Shechtman, Jun-Yan Zhu, Xun Huang
Abstract: Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.
Authors: Chao Huang, Zeliang Zhang, Jiang Liu, Ximeng Sun, Jialian Wu, Xiaodong Yu, Ze Wang, Chenliang Xu, Emad Barsoum, Zicheng Liu
Abstract: Multimodal large language models (MLLMs) have made rapid progress, yet their reasoning ability often lags behind strong text-only LLMs. Bridging this gap typically requires large-scale multimodal reasoning data or reinforcement learning, incurring substantial cost. An appealing alternative is parameter-space model merging between reasoning-enhanced LLMs and MLLMs, but we show that naive merging is fragile: its effectiveness varies widely across model families and can significantly degrade performance (e.g., for Qwen-based MLLMs). We propose Directional Reasoning Injection for Fine-Tuning (DRIFT), a lightweight method that transfers reasoning knowledge in the gradient space while preserving multimodal alignment. DRIFT precomputes a reasoning prior from the parameter differences between text-only reasoning experts and multimodal models, and uses it to bias gradients during supervised fine-tuning. This design retains the simplicity of standard SFT pipelines while enabling efficient and stable reasoning transfer. Experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, show that DRIFT consistently outperforms naive merging and standard SFT, and matches or surpasses training-intensive methods with substantially lower data and compute.
Authors: Mohammad Javad Ahmadi, Iman Gandomi, Parisa Abdi, Seyed-Farzad Mohammadi, Amirhossein Taslimi, Mehdi Khodaparast, Hassan Hashemi, Mahdi Tavakoli, Hamid D. Taghirad
Abstract: The development of computer-assisted surgery systems relies on large-scale, annotated datasets. Existing cataract surgery resources lack the diversity and annotation depth required to train generalizable deep-learning models. To address this gap, we present a dataset of 3,000 phacoemulsification cataract surgery videos acquired at two surgical centers from surgeons with varying expertise. The dataset provides four annotation layers: temporal surgical phases, instance segmentation of instruments and anatomical structures, instrument-tissue interaction tracking, and quantitative skill scores based on competency rubrics adapted from ICO-OSCAR and GRASIS. We demonstrate the technical utility of the dataset through benchmarking deep learning models across four tasks: workflow recognition, scene segmentation, instrument-tissue interaction tracking, and automated skill assessment. Furthermore, we establish a domain-adaptation baseline for phase recognition and instance segmentation by training on one surgical center and evaluating on a held-out center. Ultimately, these multi-source acquisitions, multi-layer annotations, and paired skill-kinematic labels facilitate the development of generalizable multi-task models for surgical workflow analysis, scene understanding, and competency-based training research. The dataset and annotations are available in Google Form (https://docs.google.com/forms/d/e/1FAIpQLSfmyMAPSTGrIy2sTnz0-TMw08ZagTimRulbAQcWdaPwDy187A/viewform?usp=dialog).
Authors: Jiayu Zhang, Shuo Ye, Qilang Ye, Xun Lin, Zihan Song, Zitong Yu
Abstract: Audio-Visual Question Answering (AVQA) requires models to effectively utilize both visual and auditory modalities to answer complex and diverse questions about audio-visual scenes. However, existing methods lack sufficient flexibility and dynamic adaptability in temporal sampling and modality preference awareness, making it difficult to focus on key information based on the question. This limits their reasoning capability in complex scenarios. To address these challenges, we propose a novel framework named AV-Master. It enhances the model's ability to extract key information from complex audio-visual scenes with substantial redundant content by dynamically modeling both temporal and modality dimensions. In the temporal dimension, we introduce a dynamic adaptive focus sampling mechanism that progressively focuses on audio-visual segments most relevant to the question, effectively mitigating redundancy and segment fragmentation in traditional sampling methods. In the modality dimension, we propose a preference-aware strategy that models each modality's contribution independently, enabling selective activation of critical features. Furthermore, we introduce a dual-path contrastive loss to reinforce consistency and complementarity across temporal and modality dimensions, guiding the model to learn question-specific cross-modal collaborative representations. Experiments on four large-scale benchmarks show that AV-Master significantly outperforms existing methods, especially in complex reasoning tasks.
Authors: Rui-Qing Sun, Ang Li, Zhijing Wu, Tian Lan, Qianyu Lu, Xingshan Yao, Chen Xu, Xian-Ling Mao
Abstract: Talking Face Generation (TFG) methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have recently achieved impressive progress in personalized talking head synthesis. However, existing methods typically require several minutes of reference video for meticulous preprocessing and fitting, resulting in hours of preparation time and limiting their practical applicability. In this paper, we revisit a fundamental yet underexplored question: do high-quality personalized TFG models truly require minutes-long reference videos? Our exploratory study reveals that a carefully selected reference segment of only a few seconds can often achieve performance comparable to that of using the full reference video. This finding suggests that the informativeness of reference data is more critical than its duration. Motivated by this observation, we propose ISExplore (Informative Segment Explore), a simple yet effective segment selection strategy that automatically identifies the most informative short reference segment based on three key data quality dimensions: audio feature diversity, lip movement amplitude, and viewpoint diversity. Extensive experiments demonstrate that ISExplore reduces data processing and training time by over 5x for both NeRF- and 3DGS-based methods, while preserving high-fidelity generation quality. Our method provides a practical and efficient solution for personalized TFG and offers new insights into data efficiency in 3D talking face generation.
Authors: Mahsa Mitcheff, Siamul Karim Khan, Adam Czajka
Abstract: Developing reliable iris recognition and presentation attack detection methods requires diverse datasets that capture realistic variations in iris features and a wide spectrum of anomalies. Because of the rich texture of iris images, which spans a wide range of spatial frequencies, synthesizing same-identity iris images while controlling specific attributes remains challenging. In this work, we introduce a new iris image augmentation strategy by traversing a generative model's latent space toward latent codes that represent same-identity samples but with some desired iris image properties manipulated. The latent space traversal is guided by a gradient of specific geometrical, textural, or quality-related iris image features (e.g., sharpness, pupil size, iris size, or pupil-to-iris ratio) and preserves the identity represented by the image being manipulated. The proposed approach can be easily extended to manipulate any attribute for which a differentiable loss term can be formulated. Additionally, our approach can use either randomly generated images using either a pre-train GAN model or real-world iris images. We can utilize GAN inversion to project any given iris image into the latent space and obtain its corresponding latent code.
Authors: Xurui Li, Feng Xue, Yu Zhou
Abstract: Zero-shot anomaly classification (AC) and segmentation (AS) methods aim to identify and outline defects without using any labeled samples. In this paper, we reveal a key property that is overlooked by existing methods: normal image patches across industrial products typically find many other similar patches, not only in 2D appearance but also in 3D shapes, while anomalies remain diverse and isolated. To explicitly leverage this discriminative property, we propose a Mutual Scoring framework (MuSc-V2) for zero-shot AC/AS, which flexibly supports single 2D/3D or multimodality. Specifically, our method begins by improving 3D representation through Iterative Point Grouping (IPG), which reduces false positives from discontinuous surfaces. Then we use Similarity Neighborhood Aggregation with Multi-Degrees (SNAMD) to fuse 2D/3D neighborhood cues into more discriminative multi-scale patch features for mutual scoring. The core comprises a Mutual Scoring Mechanism (MSM) that lets samples within each modality to assign score to each other, and Cross-modal Anomaly Enhancement (CAE) that fuses 2D and 3D scores to recover modality-specific missing anomalies. Finally, Re-scoring with Constrained Neighborhood (RsCon) suppresses false classification based on similarity to more representative samples. Our framework flexibly works on both the full dataset and smaller subsets with consistently robust performance, ensuring seamless adaptability across diverse product lines. In aid of the novel framework, MuSc-V2 achieves significant performance improvements: a $\textbf{+23.7\%}$ AP gain on the MVTec 3D-AD dataset and a $\textbf{+19.3\%}$ boost on the Eyecandies dataset, surpassing previous zero-shot benchmarks and even outperforming most few-shot methods. The code will be available at The code will be available at \href{https://github.com/HUST-SLOW/MuSc-V2}{https://github.com/HUST-SLOW/MuSc-V2}.
URLs: https://github.com/HUST-SLOW/MuSc-V2, https://github.com/HUST-SLOW/MuSc-V2
Authors: Raghu Vamsi Chittersu, Yuvraj Singh Rathore, Pranav Adlinge, Kunal Swami
Abstract: Reference-based object composition involves integrating foreground reference image with background scene to produce harmonious fused image. This task becomes particularly challenging in cross-domain scenarios, where models must balance preserving the reference object's identity while harmonizing them to match stylized environments. This under-explored problem is currently split between practical "blenders" that lack generative fidelity and "generators" that require impractical, per-subject online finetuning. In this work, we introduce Insert In Style, the first zero-shot generative framework that is both practical and high-fidelity. Our core contribution is a unified framework with two key innovations: (i) a novel multi-stage training protocol that disentangles representations for identity, style, and composition, and (ii) a specialized masked-attention architecture that surgically enforces this disentanglement during generation (iii) A prior preservation objective that keeps learned identity and style priors intact. By design, this approach mitigates concept interference typical in unified-attention architectures while ensuring robust generalization across diverse references and styles. Our framework is trained on a new 115k sample dataset, curated from a novel data pipeline. This pipeline couples large-scale generation with a rigorous, iterative human-in-the-loop filtering process to ensure both high-fidelity semantic identity and style coherence. Unlike prior work, our model is truly zero-shot and requires no text prompts. We also introduce a new public benchmark for stylized composition. We demonstrate state-of-the-art performance, significantly outperforming existing methods on both identity and style metrics, a result strongly corroborated by user studies.
Authors: Di Wu, Liu Liu, Xueyu Yuan, Wenxiao Chen, Lijun Yue, Liuzhu Chen, Yiming Tang, Meng Wang
Abstract: Articulated objects are ubiquitous in daily environments, and their 3D reconstruction holds great significance across various fields. However, existing articulated object reconstruction methods typically require costly inputs such as multi-stage and multi-view observations. To address the limitations, we propose a category-agnostic articulated object reconstruction framework via planar Gaussian Splatting, which only uses sparse-view RGB images from a single state. Specifically, we first introduce a Gaussian information field to perceive the optimal sparse viewpoints from candidate camera poses. To ensure precise geometric fidelity, we constrain traditional 3D Gaussians into planar primitives, facilitating accurate normal and depth estimation. The planar Gaussians are then optimized in a coarse-to-fine manner, regularized by depth smoothness and few-shot diffusion priors. Furthermore, we leverage a Vision-Language Model (VLM) via visual prompting to achieve open-vocabulary part segmentation and joint parameter estimation. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach significantly outperforms existing baselines, achieving superior part-level surface reconstruction fidelity.
Authors: Aditya Mishra, Akshay Agarwal, Haroon Lone
Abstract: Traffic signboards are vital for road safety and intelligent transportation systems, enabling navigation and autonomous driving. Yet, recognizing traffic signs at night remains underexplored due to the scarcity of realistic public datasets capturing low-light degradations and distractor classes. Existing benchmarks are predominantly daytime and do not reflect challenges such as headlight glare, motion blur, sensor noise, and vandalized or ambiguous signage. To address these gaps, we introduce INTSD, a large-scale nighttime traffic sign dataset collected across diverse regions of India. INTSD contains street-level images spanning 41 traffic signboard classes, multiple distractor categories, and varied lighting and weather conditions. The dataset is designed to support both detection and fine-grained classification under realistic nighttime scenarios. To benchmark INTSD for nighttime sign recognition, we conduct extensive evaluations using state-of-the-art detection and classification models under standardized protocols. Additionally, we present LENS-Net, a strong baseline that integrates adaptive illumination-aware detection with multimodal semantic reasoning for robust nighttime sign classification. Experiments and ablations demonstrate the challenges posed by INTSD and establish competitive baselines for future research. The dataset and code for LENS-Net is publicly available for research.
Authors: Anjie Le, Can Peng, Yuyuan Liu, J. Alison Noble
Abstract: In computer vision, machine unlearning aims to remove the influence of specific visual concepts or training images without retraining from scratch. Studies show that existing approaches often modify the classifier while leaving internal representations intact, resulting in incomplete forgetting. In this work, we extend the notion of unlearning to the representation level, deriving a three-term interplay between forgetting efficacy, retention fidelity, and class separation. Building on Neural Collapse theory, we show that the orthogonal projection of a simplex Equiangular Tight Frame (ETF) remains an ETF in a lower dimensional space, yielding a provably optimal forgetting operator. We further introduce the Representation Unlearning Score (RUS) to quantify representation-level forgetting and retention fidelity. Building on this, we introduce POUR (Provably Optimal Unlearning of Representations), a geometric projection method with closed-form (POUR-P) and a feature-level unlearning variant under a distillation scheme (POUR-D). Experiments on CIFAR-10/100 and PathMNIST demonstrate that POUR achieves effective unlearning while preserving retained knowledge, outperforming state-of-the-art unlearning methods on both classification-level and representation-level metrics.
Authors: Bin Wang, Ruotong Hu, Wentong Li, Wenqian Wang, Mingliang Gao, Runmin Cong, Wei Zhang, Xudong Jiang
Abstract: Visual and textual soft prompt tuning can effectively improve the adaptability of Vision-Language Models (VLMs) in downstream tasks. However, fine-tuning on video tasks impairs the model's generalization ability to unseen classes. Existing methods attempt to mitigate this forgetting effect by regularizing the gap between hand-crafted prompts and soft prompts, but this also weakens the learning ability of soft prompts. To address this challenge, we propose a plug-and-play coupling prompt learning framework to optimize the generalization performance of V-L models in video tasks, with the core motivation of mitigating semantic space narrowing during fine-tuning by introducing an externally supervised prompt. Specifically, for textual prompts, we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer. This competitive prompting approach prevents the semantic space from overfitting to supervised categories. In addition, we introduce a set of well-designed irrelevant video sets and negative prompts as generic attribute anchors to maintain the generic relevance of the attributes in the pre-trained semantic space, thus preserving the generalization ability. Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks, particularly on base-to-new class prediction.
Authors: Han Su, Tianyu Huang, Zichen Wan, Xiaohe Wu, Wangmeng Zuo
Abstract: Part-level point cloud segmentation has recently attracted significant attention in 3D computer vision. Nevertheless, existing research is constrained by two major challenges: native 3D models lack generalization due to data scarcity, while introducing 2D pre-trained knowledge often leads to inconsistent segmentation results across different views. To address these challenges, we propose S2AM3D, which incorporates 2D segmentation priors with 3D consistent supervision. We design a point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, producing globally consistent point features. A scale-aware prompt decoder is then proposed to enable real-time adjustment of segmentation granularity via continuous scale signals. Simultaneously, we introduce a large-scale, high-quality part-level point cloud dataset with more than 100k samples, providing ample supervision signals for model training. Extensive experiments demonstrate that S2AM3D achieves leading performance across multiple evaluation settings, exhibiting exceptional robustness and controllability when handling complex structures and parts with significant size variations.
Authors: Zekai Shao, Yufan Hu, Jingyuan Liu, Bin Fan, Hongmin Liu
Abstract: Parameter-efficient fine-tuning has emerged as a promising paradigm in RGB-T tracking, enabling downstream task adaptation by freezing pretrained parameters and fine-tuning only a small set of parameters. This set forms a rank space made up of multiple individual ranks, whose expressiveness directly shapes the model's adaptability. However, quantitative analysis reveals low-rank adaptation exhibits significant redundancy in the rank space, with many ranks contributing almost no practical information. This hinders the model's ability to learn more diverse knowledge to address the various challenges in RGB-T tracking. To address this issue, we propose the Group Orthogonal Low-Rank Adaptation (GOLA) framework for RGB-T tracking, which effectively leverages the rank space through structured parameter learning. Specifically, we adopt a rank decomposition partitioning strategy utilizing singular value decomposition to quantify rank importance, freeze crucial ranks to preserve the pretrained priors, and cluster the redundant ranks into groups to prepare for subsequent orthogonal constraints. We further design an inter-group orthogonal constraint strategy. This constraint enforces orthogonality between rank groups, compelling them to learn complementary features that target diverse challenges, thereby alleviating information redundancy. Experimental results demonstrate that GOLA effectively reduces parameter redundancy and enhances feature representation capabilities, significantly outperforming state-of-the-art methods across four benchmark datasets and validating its effectiveness in RGB-T tracking tasks.
Authors: Yi-Chuan Huang, Jiewen Chan, Hao-Jen Chien, Yu-Lun Liu
Abstract: Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/
Authors: Haobo Jiang, Jin Xie, Jian Yang, Liang Yu, Jianmin Zheng
Abstract: Registration of multiview point clouds conventionally relies on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and inherently ill-posed without holistic geometric constraints. This paper proposes FUSER, the first feed-forward multiview registration transformer that jointly processes all scans in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER encodes each scan into low-resolution superpoint features via a sparse 3D CNN that preserves absolute translation cues, and performs efficient intra- and inter-scan reasoning through a Geometric Alternating Attention module. Particularly, we transfer 2D attention priors from off-the-shelf foundation models to enhance 3D feature interaction and geometric consistency. Building upon FUSER, we further introduce FUSER-DF, an SE(3)$^N$ diffusion refinement framework to correct FUSER's estimates via denoising in the joint SE(3)$^N$ space. FUSER acts as a surrogate multiview registration model to construct the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch, ScanNet and ArkitScenes demonstrate that our approach achieves the superior registration accuracy and outstanding computational efficiency.
Authors: Woojun Jung, Jaehoon Go, Mingyu Jeon, Sunjae Yoon, Junyeong Kim
Abstract: Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they introduce: "Contextual Blindness". This failure occurs due to structural disconnect between high-fidelity details (from the crop) and the broader global context (from the original image), even when all necessary visual information is present. We argue that this limitation stems not from a lack of information 'Quantity', but from a lack of 'Structural Diversity' in the model's input. To resolve this, we propose Visual Funnel, a training-free, two-step approach. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves the hierarchical context - ranging from focal detail to broader surroundings - by dynamically determining crop sizes based on attention entropy and refining crop centers. Through extensive experiments, we demonstrate that Visual Funnel significantly outperforms naive single-crop and unstructured multi-crop baselines. Our results further validate that simply adding more unstructured crops provides limited or even detrimental benefits, confirming that the hierarchical structure of our portfolio is key to resolving Contextual Blindness.
Authors: Tjark Behrens, Anton Obukhov, Bingxin Ke, Fabio Tosi, Matteo Poggi, Konrad Schindler
Abstract: We introduce StereoSpace, a diffusion-based framework for monocular-to-stereo synthesis that models geometry purely through viewpoint conditioning, without explicit depth or warping. A canonical rectified space and the conditioning guide the generator to infer correspondences and fill disocclusions end-to-end. To ensure fair and leakage-free evaluation, we introduce an end-to-end protocol that excludes any ground truth or proxy geometry estimates at test time. The protocol emphasizes metrics reflecting downstream relevance: iSQoE for perceptual comfort and MEt3R for geometric consistency. StereoSpace surpasses other methods from the warp & inpaint, latent-warping, and warped-conditioning categories, achieving sharp parallax and strong robustness on layered and non-Lambertian scenes. This establishes viewpoint-conditioned diffusion as a scalable, depth-free solution for stereo generation.
Authors: Guangqian Guo, Aixi Ren, Yong Guo, Xuehui Yu, Jiacheng Tian, Wenli Li, Chaowei Wang, Yaoxing Wang, Shan Gao
Abstract: Segment Anything Models (SAMs), known for their exceptional zero-shot segmentation performance, have garnered significant attention in the research community. Nevertheless, their performance drops significantly on severely degraded, low-quality images, limiting their effectiveness in real-world scenarios. To address this, we propose GleSAM++, which utilizes Generative Latent space Enhancement to boost robustness on low-quality images, thus enabling generalization across various image qualities. Additionally, to improve compatibility between the pre-trained diffusion model and the segmentation framework, we introduce two techniques, i.e., Feature Distribution Alignment (FDA) and Channel Replication and Expansion (CRE). However, the above components lack explicit guidance regarding the degree of degradation. The model is forced to implicitly fit a complex noise distribution that spans conditions from mild noise to severe artifacts, which substantially increases the learning burden and leads to suboptimal reconstructions. To address this issue, we further introduce a Degradation-aware Adaptive Enhancement (DAE) mechanism. The key principle of DAE is to decouple the reconstruction process for arbitrary-quality features into two stages: degradation-level prediction and degradation-aware reconstruction. Our method can be applied to pre-trained SAM and SAM2 with only minimal additional learnable parameters, allowing for efficient optimization. Extensive experiments demonstrate that GleSAM++ significantly improves segmentation robustness on complex degradations while maintaining generalization to clear images. Furthermore, GleSAM++ also performs well on unseen degradations, underscoring the versatility of our approach and dataset.
Authors: Di Xu, Hengjie Liu, Yang Yang, Mary Feng, Jin Ning, Xin Miao, Jessica E. Scholey, Alexandra E. Hotca-cho, William C. Chen, Michael Ohliger, Martina Descovich, Huiming Dong, Wensha Yang, Ke Sheng
Abstract: Accelerated dynamic volumetric magnetic resonance imaging (4DMRI) is essential for applications relying on motion resolution. Existing 4DMRI produces acceptable artifacts of averaged breathing phases, which can blur and misrepresent instantaneous dynamic information. Recovery of such information requires a new paradigm to reconstruct extremely undersampled non-Cartesian k-space data. We propose B-FIRE, a binning-free diffusion implicit neural representation framework for hyper-accelerated MR reconstruction capable of reflecting instantaneous 3D abdominal anatomy. B-FIRE employs a CNN-INR encoder-decoder backbone optimized using diffusion with a comprehensive loss that enforces image-domain fidelity and frequency-aware constraints. Motion binned image pairs were used as training references, while inference was performed on binning-free undersampled data. Experiments were conducted on a T1-weighted StarVIBE liver MRI cohort, with accelerations ranging from 8 spokes per frame (RV8) to RV1. B-FIRE was compared against direct NuFFT, GRASP-CS, and an unrolled CNN method. Reconstruction fidelity, motion trajectory consistency, and inference latency were evaluated.
Authors: Shaokun Wang, Weili Guan, Jizhou Han, Jianlong Wu, Yupeng Hu, Liqiang Nie
Abstract: Continual Text-to-Video Retrieval (CTVR) is a challenging multimodal continual learning setting, where models must incrementally learn new semantic categories while maintaining accurate text-video alignment for previously learned ones, thus making it particularly prone to catastrophic forgetting. A key challenge in CTVR is feature drift, which manifests in two forms: intra-modal feature drift caused by continual learning within each modality, and non-cooperative feature drift across modalities that leads to modality misalignment. To mitigate these issues, we propose StructAlign, a structured cross-modal alignment method for CTVR. First, StructAlign introduces a simplex Equiangular Tight Frame (ETF) geometry as a unified geometric prior to mitigate modality misalignment. Building upon this geometric prior, we design a cross-modal ETF alignment loss that aligns text and video features with category-level ETF prototypes, encouraging the learned representations to form an approximate simplex ETF geometry. In addition, to suppress intra-modal feature drift, we design a Cross-modal Relation Preserving loss, which leverages complementary modalities to preserve cross-modal similarity relations, providing stable relational supervision for feature updates. By jointly addressing non-cooperative feature drift across modalities and intra-modal feature drift, StructAlign effectively alleviates catastrophic forgetting in CTVR. Extensive experiments on benchmark datasets demonstrate that our method consistently outperforms state-of-the-art continual retrieval approaches.
Authors: Gonzalo Gomez-Nogales, Yicong Hong, Chongjian Ge, Marc Comino-Trinidad, Dan Casas, Yi Zhou
Abstract: Traditional rendering pipelines rely on complex assets, accurate materials and lighting, and substantial computational resources to produce realistic imagery, yet they still face challenges in scalability and realism for populated dynamic scenes. We present C2R (Coarse-to-Real), a generative rendering framework that synthesizes real-style urban crowd videos from coarse 3D simulations. Our approach uses coarse 3D renderings to explicitly control scene layout, camera motion, and human trajectories, while a learned neural renderer generates realistic appearance, lighting, and fine-scale dynamics guided by text prompts. To overcome the lack of paired training data between coarse simulations and real videos, we adopt a two-stage synthetic-real domain-hedging strategy that first learns a strong generative prior from large-scale real footage, and then introduces controllability by using a small amount of paired synthetic coarse-to-fine data to anchor shared implicit spatio-temporal features across domains. The resulting system supports coarse-to-fine control, generalizes across diverse CG and game inputs, and produces temporally consistent, controllable, and realistic urban scene videos from minimal 3D input. We will release the model and project webpage at https://gonzalognogales.github.io/coarse2real/.
Authors: Buddhi Wijenayake, Nichula Wasalathilake, Roshan Godaliyadda, Vijitha Herath, Parakrama Ekanayake, Vishal M. Patel
Abstract: Long-tailed class imbalance remains a fundamental obstacle in semantic segmentation of high-resolution remote-sensing imagery, where dominant classes shape learned representations and rare classes are systematically under-segmented. This challenge becomes more acute in cross-domain settings such as LoveDA, which exhibits an explicit Urban/Rural split with substantial appearance differences and inconsistent class-frequency statistics across domains. We propose a prompt-controlled diffusion augmentation framework that generates paired label-image samples with explicit control over semantic composition and domain, enabling targeted enrichment of underrepresented classes rather than indiscriminate dataset expansion. A domain-aware, masked, ratio-conditioned discrete diffusion model first synthesizes layouts that satisfy class-ratio targets while preserving realistic spatial co-occurrence, and a ControlNet-guided diffusion model then renders photorealistic, domain-consistent images from these layouts. When mixed with real data, the resulting synthetic pairs improve multiple segmentation backbones, especially on minority classes and under domain shift, showing that better downstream segmentation comes from adding the right samples in the right proportions. Source codes, pretrained models, and synthetic datasets are available at \href{https://buddhi19.github.io/SyntheticGen}{\texttt{buddhi19.github.io/SyntheticGen}}.
Authors: Inbar Gat, Dana Cohen-Bar, Guy Levy, Elad Richardson, Daniel Cohen-Or
Abstract: Recent advancements in 3D foundation models have enabled the generation of high-fidelity assets, yet precise 3D manipulation remains a significant challenge. Existing 3D editing frameworks often face a difficult trade-off between visual controllability, geometric consistency, and scalability. Specifically, optimization-based methods are prohibitively slow, multi-view 2D propagation techniques suffer from visual drift, and training-free latent manipulation methods are inherently bound by frozen priors and cannot directly benefit from scaling. In this work, we present ShapeUP, a scalable, image-conditioned 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation. This formulation allows ShapeUP to build on a pretrained 3D foundation model, leveraging its strong generative prior while adapting it to editing through supervised training. In practice, ShapeUP is trained on triplets consisting of a source 3D shape, an edited 2D image, and the corresponding edited 3D shape, and learns a direct mapping using a 3D Diffusion Transformer (DiT). This image-as-prompt approach enables fine-grained visual control over both local and global edits and achieves implicit, mask-free localization, while maintaining strict structural consistency with the original asset. Our extensive evaluations demonstrate that ShapeUP consistently outperforms current trained and training-free baselines in both identity preservation and edit fidelity, offering a robust and scalable paradigm for native 3D content creation.
Authors: Hulingxiao He, Zijun Geng, Yuxin Peng
Abstract: Any entity in the visual world can be hierarchically grouped based on shared characteristics and mapped to fine-grained sub-categories. While Multi-modal Large Language Models (MLLMs) achieve strong performance on coarse-grained visual tasks, they often struggle with Fine-Grained Visual Recognition (FGVR). Adapting general-purpose MLLMs to FGVR typically requires large amounts of annotated data, which is costly to obtain, leaving a substantial performance gap compared to contrastive CLIP models dedicated for discriminative tasks. Moreover, MLLMs tend to overfit to seen sub-categories and generalize poorly to unseen ones. To address these challenges, we propose Fine-R1, an MLLM tailored for FGVR through an R1-style training framework: (1) Chain-of-Thought Supervised Fine-tuning, where we construct a high-quality FGVR CoT dataset with rationales of "visual analysis, candidate sub-categories, comparison, and prediction", transition the model into a strong open-world classifier; and (2) Triplet Augmented Policy Optimization, where Intra-class Augmentation mixes trajectories from anchor and positive images within the same category to improve robustness to intra-class variance, while Inter-class Augmentation maximizes the response distinction conditioned on images across sub-categories to enhance discriminative ability. With only 4-shot training, Fine-R1 outperforms existing general MLLMs, reasoning MLLMs, and even contrastive CLIP models in identifying both seen and unseen sub-categories, showing promise in working in knowledge-intensive domains where gathering expert annotations for all sub-categories is arduous. Code is available at https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026.
Authors: Xiaomeng Peng, Xilang Huang, Seon Han Choi
Abstract: Multimodal large language models (MLLMs) can enrich industrial anomaly detection with semantic descriptions and anomaly reasoning, but they still lag specialist anomaly detectors in binary detection accuracy. Existing approaches address this gap by fine-tuning MLLMs or training bridging modules to align expert outputs with MLLM inputs, limiting flexibility across backbones. We propose EAGLE, a tuning-free framework that integrates expert anomaly detectors with frozen MLLMs. EAGLE consists of Threshold-Guided Prompt Selection (TGPS), which estimates a normal-score threshold from expert-model statistics and selects textual and visual prompts, and Confidence-Aware Attention Sharpening (CAAS), which shifts MLLM attention toward visual evidence when expert confidence is low. Beyond improving accuracy, we analyze MLLM attention and find that correct anomaly predictions are associated with stronger focus on ground-truth defect regions; EAGLE consistently strengthens this alignment. On MVTec-AD and VisA, EAGLE improves five MLLM backbones without parameter updates, reaching up to 94.6% and 88.6% accuracy, respectively, and achieving performance competitive with fine-tuning-based methods while retaining anomaly-aware reasoning ability.
Authors: Li Zhang, Shruti Agarwal, John Collomosse, Pengtao Xie, Vishal Asnani
Abstract: Generative AI models pose a significant challenge to intellectual property (IP), as they can replicate unique artistic styles and concepts without attribution. While watermarking offers a potential solution, existing methods often fail in complex scenarios where multiple concepts (e.g., an object and an artistic style) are composed within a single image. These methods struggle to disentangle and attribute each concept individually. In this work, we introduce TokenTrace, a novel proactive watermarking framework for robust, multi-concept attribution. Our method embeds secret signatures into the semantic domain by simultaneously perturbing the text prompt embedding and the initial latent noise that guide the diffusion model's generation process. For retrieval, we propose a query-based TokenTrace module that takes the generated image and a textual query specifying which concepts need to be retrieved (e.g., a specific object or style) as inputs. This query-based mechanism allows the module to disentangle and independently verify the presence of multiple concepts from a single generated image. Extensive experiments show that our method achieves state-of-the-art performance on both single-concept (object and style) and multi-concept attribution tasks, significantly outperforming existing baselines while maintaining high visual quality and robustness to common transformations.
Authors: Phuc D. A. Nguyen, Anh N. Nhu, Ming C. Lin
Abstract: We introduce OpenVO, a novel framework for Open-world Visual Odometry (VO) with temporal awareness under limited input conditions. OpenVO effectively estimates real-world-scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras, enabling robust trajectory dataset construction from rare driving events recorded in dashcam. Existing VO methods are trained on fixed observation frequency (e.g., 10Hz or 12Hz), completely overlooking temporal dynamics information. Many prior methods also require calibrated cameras with known intrinsic parameters. Consequently, their performance degrades when (1) deployed under unseen observation frequencies or (2) applied to uncalibrated cameras. These significantly limit their generalizability to many downstream tasks, such as extracting trajectories from dashcam footage. To address these challenges, OpenVO (1) explicitly encodes temporal dynamics information within a two-frame pose regression framework and (2) leverages 3D geometric priors derived from foundation models. We validate our method on three major autonomous-driving benchmarks - KITTI, nuScenes, and Argoverse 2 - achieving more than 20 performance improvement over state-of-the-art approaches. Under varying observation rate settings, our method is significantly more robust, achieving 46%-92% lower errors across all metrics. These results demonstrate the versatility of OpenVO for real-world 3D reconstruction and diverse downstream applications.
Authors: Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, Deqing Sun
Abstract: Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.
Authors: Zikai Zhou, Muyao Wang, Shitong Shao, Lichen Bai, Haoyi Xiong, Bo Han, Zeke Xie
Abstract: The growing demand for text-to-image generation has led to rapid advances in generative modeling. Recently, text-to-image diffusion models trained with flow matching algorithms, such as FLUX, have achieved remarkable progress and emerged as strong alternatives to conventional diffusion models. At the same time, inference-time enhancement strategies have been shown to improve the generation quality and text-prompt alignment of text-to-image diffusion models. However, these techniques are mainly applicable to conventional diffusion models and usually fail to perform well on flow models. To bridge this gap, we propose Reflective Flow Sampling (RF-Sampling), a theoretically-grounded and training-free inference enhancement framework explicitly designed for flow models, especially for the CFG-distilled variants (i.e., models distilled from CFG guidance techniques), like FLUX. Departing from heuristic interpretations, we provide a formal derivation proving that RF-Sampling implicitly performs gradient ascent on the text-image alignment score. By leveraging a linear combination of textual representations and integrating them with flow inversion, RF-Sampling allows the model to explore noise spaces that are more consistent with the input prompt. Extensive experiments across multiple benchmarks demonstrate that RF-Sampling consistently improves both generation quality and prompt alignment. Moreover, RF-Sampling is also the first inference enhancement method that can exhibit test-time scaling ability to some extent on FLUX.
Authors: Jiangye Yuan, Gowri Kumar, Baoyuan Wang
Abstract: While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach yields substantial improvements on challenging spatial reasoning benchmarks, boosting GPT-5 performance by 9% on VSI-Bench and 12% on MindCube. Qualitative studies further demonstrate that GR3D empowers MLLMs to perform complex spatial reasoning with highly sparse input views.
Authors: Jiahao Li, Qingwang Zhang, Qiuyu Chen, Guozhan Qiu, Yunzhong Lou, Xiangdong Zhou
Abstract: The field of Computer-Aided Design (CAD) generation has made significant progress in recent years. Existing methods typically fall into two separate categories: parametric CAD modeling and direct boundary representation (B-Rep) synthesis. In modern feature-based CAD systems, parametric modeling and B-Rep are inherently intertwined, as advanced parametric operations (e.g., fillet and chamfer) require explicit selection of B-Rep geometric primitives, and the B-Rep itself is derived from parametric operations. Consequently, this paradigm gap remains a critical factor limiting AI-driven CAD modeling for complex industrial product design. This paper presents FutureCAD, a novel text-to-CAD framework that leverages large language models (LLMs) and a B-Rep grounding transformer (BRepGround) for high-fidelity CAD generation. Our method generates executable CadQuery scripts, and introduces a text-based query mechanism that enables the LLM to specify geometric selections via natural language, which BRepGround then grounds to the target primitives. To train our framework, we construct a new dataset comprising real-world CAD models. For the LLM, we apply supervised fine-tuning (SFT) to establish fundamental CAD generation capabilities, followed by reinforcement learning (RL) to improve generalization. Experiments show that FutureCAD achieves state-of-the-art CAD generation performance. Code and dataset are available at: https://github.com/JohanStackk/FutureCAD
Authors: Xi Xiao, Chenrui Ma, Yunbei Zhang, Chen Liu, Zhuxuanzi Wang, Yanshu Li, Lin Zhao, Guosheng Hu, Tianyang Wang, Hao Xu
Abstract: Low-Rank Adaptation (LoRA) has become a cornerstone of parameter-efficient fine-tuning (PEFT). Yet, its efficacy is hampered by two fundamental limitations: semantic drift, by treating all update directions with equal importance, and structural incoherence, from adapting layers independently, resulting in suboptimal, uncoordinated updates. To remedy these, we propose StructLoRA, a framework that addresses both limitations through a principled, dual-component design: (1) an Information Bottleneck-guided filter that prunes task-irrelevant directions to mitigate semantic drift, and (2) a lightweight, training-only graph-based coordinator that enforces inter-layer consistency to resolve structural incoherence. Extensive experiments across large language model , vision language model, and vision model (including LLaMA, LLaVA, and ViT) demonstrate that StructLoRA consistently establishes a new state-of-the-art, outperforming not only vanilla LoRA but also advanced dynamic rank allocation and sparsity-based methods. Notably, the benefits are particularly pronounced in challenging low-rank and low-data regimes. Crucially, since our proposed modules operate only during training, StructLoRA enhances performance with zero additional inference cost, advancing the focus of PEFT -- from mere parameter compression to a more holistic optimization of information quality and structural integrity.
Authors: Soumyaratna Debnath, Bui Duc Manh, Zinan Liu, Lin Wang
Abstract: Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static; it is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS), enabling a Mobius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align perceptual saliency with textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering benchmarks. The results show dramatic gains, with average improvements of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, LLMind retains up to 82%, 92%, and 97% of the full-resolution performance using only 1%, 3%, and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.
Authors: Samuel Johnny, Blessed Guda, Goodness Obasi, Aaron Emmanuel, Moise Busogi
Abstract: Automated diagnosis from chest computed tomography (CT) scans faces two persistent challenges in clinical deployment: distribution shift across acquisition sites and performance disparity across demographic subgroups. We address both simultaneously across two complementary tasks: binary COVID-19 classification from multi-site CT volumes (Task 1) and four-class lung pathology recognition with gender-based fairness constraints (Task 2). Our framework combines a lightweight MobileViT-XXS slice encoder with a two-layer SliceTransformer aggregator for volumetric reasoning, and trains with a KL-regularised Group Distributionally Robust Optimisation (Group DRO) objective that adaptively upweights underperforming acquisition centres and demographic subgroups. Unlike standard Group DRO, the KL penalty prevents group weight collapse, providing a stable balance between worst-case protection and average performance. For Task 2, we define groups at the granularity of gender class, directly targeting severely underrepresented combinations such as female Squamous cell carcinoma. On Task 1, our best configuration achieves a challenge F1 of 0.835, surpassing the best published challenge entry by +5.9. On Task 2, Group DRO with {\alpha} = 0.5 achieves a mean per-gender macro F1 of 0.815, outperforming the best challenge entry by +11.1 pp and improving Female Squamous F1 by +17.4 over the Focal Loss baseline.
Authors: Myeongkyun Kang, Yanting Yang, Xiaoxiao Li
Abstract: Fine-grained representation learning is crucial for retrieval and phrase grounding in chest X-rays, where clinically relevant findings are often spatially confined. However, the lack of region-level supervision in contrastive models and the limited ability of large vision language models to capture fine-grained representations in external validation lead to suboptimal performance on these tasks. To address these limitations, we propose Location-aware Fine-grained representation learning (LoFi), which jointly optimizes sigmoid, captioning, and location-aware captioning losses using a lightweight large language model. The location-aware captioning loss enables region-level supervision through grounding and dense captioning objectives, thereby facilitating fine-grained representation learning. Building upon these representations, we integrate a fine-grained encoder into retrieval-based in-context learning to enhance chest X-ray grounding across diverse settings. Extensive experiments demonstrate that our method achieves superior retrieval and phrase grounding performance on MIMIC-CXR and PadChest-GR.
Authors: Myeongkyun Kang, Soopil Kim, Xiaoxiao Li, Sang Hyun Park
Abstract: Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.
Authors: Qunjie Huang, Weina Zhu
Abstract: Cross-subject EEG-to-image retrieval for visual decoding is challenged by subject shift and hubness in the embedding space, which distort similarity geometry and destabilize top-k rankings, making small-k shortlists unreliable. We introduce SATTC (Structure-Aware Test-Time Calibration), a label-free calibration head that operates directly on the similarity matrix of frozen EEG and image encoders. SATTC combines a geometric expert, subject-adaptive whitening of EEG embeddings with an adaptive variant of Cross-domain Similarity Local Scaling (CSLS), and a structural expert built from mutual nearest neighbors, bidirectional top-k ranks, and class popularity, fused via a simple Product-of-Experts rule. On THINGS-EEG2 under a strict leave-one-subject-out protocol, standardized inference with cosine similarities, L2-normalized embeddings, and candidate whitening already yields a strong cross-subject baseline over the original ATM retrieval setup. Building on this baseline, SATTC further improves Top-1 and Top-5 accuracy, reduces hubness and per-class imbalance, and produces more reliable small-k shortlists. These gains transfer across multiple EEG encoders, supporting SATTC as an encoder-agnostic, label-free test-time calibration layer for cross-subject neural decoding.
Authors: Yifeng Zheng
Abstract: Multi-label fundus diagnosis requires features that capture both fine-grained lesions and large-scale retinal structure. Many multi-scale medical vision models address this challenge through explicit frequency decomposition, but our ablation studies show that such heuristics provide limited benefit in this setting: replacing the proposed simple dual-resolution stem with Octave Convolution increased parameters by 35% and computation by a 2.23-fold increase in computation; without improving mean accuracy, while a fixed wavelet-based variant performed substantially worse. Motivated by these findings, we propose Clifford-M, a lightweight backbone that replaces both feed-forward expansion and frequency-splitting modules with sparse geometric interaction. The model is built on a Clifford-style rolling product that jointly captures alignment and structural variation with linear complexity, enabling efficient cross-scale fusion and self-refinement in a compact dual-resolution architecture. Without pre-training, Clifford-M achieves a mean AUC-ROC of 0.8142 and a mean macro-F1 (optimal threshold) of 0.5481 on ODIR-5K using only 0.85M parameters, outperforming substantially larger mid-scale CNN baselines under the same training protocol. When evaluated on RFMiD without fine-tuning, it attains 0.7425 +/- 0.0198 macro AUC and 0.7610 +/- 0.0344 micro AUC, indicating reasonable robustness to cross-dataset shift. These results suggest that competitive and efficient fundus diagnosis can be achieved without explicit frequency engineering, provided that the core feature interaction is designed to capture multi-scale structure directly.
Authors: Chengjie Cui, Taihua Xu, Shuyin Xia, Qinghua Zhang, Yun Cui, Shiping Wang
Abstract: The effective utilization of consistency is crucial for multi-view learning. GCNs leverage node connections to propagate information across the graph, facilitating the exploitation of consistency in multi-view data. However, most existing GCN-based multi-view methods suffer from several limitations. First, current approaches predominantly rely on KNN for topology construction, where the artificial selection of the k value significantly constrains the effective exploitation of inter-node consistency. Second, the inter-feature consistency within individual views is often overlooked, which adversely affects the quality of the final embedding representations. Moreover, these methods fail to fully utilize inter-view consistency as the fusion of embedded representations from multiple views is often implemented after the intra-view graph convolutional operation. Collectively, these issues limit the model's capacity to fully capture inter-node, inter-feature and inter-view consistency. To address these issues, this paper proposes the multi-view graph convolutional network with fully leveraging consistency via GB-based topology construction, feature enhancement and interactive fusion (MGCN-FLC). MGCN-FLC can fully utilize three types of consistency via the following three modules to enhance learning ability:The topology construction module based on the granular ball algorithm, which clusters nodes into granular balls with high internal similarity to capture inter-node consistency;The feature enhancement module that improves feature representations by capturing inter-feature consistency;The interactive fusion module that enables each view to deeply interact with all other views, thereby obtaining more comprehensive inter-view consistency. Experimental results on nine datasets show that the proposed MGCN-FLC outperforms state-of-the-art semi-supervised node classification methods.
Authors: Haifeng Huang, Yilun Chen, Zehan Wang, Jiangmiao Pang, Zhou Zhao
Abstract: Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich object sequences. By structuring scenes as sequences of objects with contextual semantics, Chat-Scene++ enables object-centric representation and interaction. It decomposes a 3D scene into object representations paired with identifier tokens, allowing LLMs to follow instructions across diverse 3D vision-language tasks. To capture inter-object relationships and global semantics, Chat-Scene++ extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level encoders, unlike the isolated per-object features in Chat-Scene. Its flexible object-centric design also supports grounded chain-of-thought (G-CoT) reasoning, enabling the model to distinguish objects at both category and spatial levels during multi-step inference. Without the need for additional task-specific heads or fine-tuning, Chat-Scene++ achieves state-of-the-art performance on five major 3D vision-language benchmarks: ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D. These results highlight its effectiveness in scene comprehension, object grounding, and spatial reasoning. Additionally, without reconstructing 3D worlds through computationally expensive processes, we demonstrate its applicability to real-world scenarios using only 2D inputs.
Authors: Shramana Dey, Varun Ajith, Abhirup Banerjee, Sushmita Mitra
Abstract: Domain generalization in fundus imaging is challenging due to variations in acquisition conditions across devices and clinical settings. The inability to adapt to these variations causes performance degradation on unseen domains for deep learning models. Besides, obtaining annotated data across domains is often expensive and privacy constraints restricts their availability. Although single-source domain generalization (SDG) offers a realistic solution to this problem, the existing approaches frequently fail to capture anatomical topology or decouple appearance from anatomical features. This research introduces WaveSDG, a new wavelet-guided segmentation network for SDG. It decouples anatomical structure from domain-specific appearance through a wavelet sub-band decomposition. A novel Wavelet-based Invariant Structure Extraction and Refinement (WISER) module is proposed to process encoder features by leveraging distinct semantic roles of each wavelet sub-band. The module refines low-frequency components to anchor global anatomy, while selectively enhancing directional edges and suppressing noise within the high-frequency sub-bands. Extensive ablation studies validate the effectiveness of the WISER module and its decoupling strategy. Our evaluations on optic cup and optic disc segmentation across one source and five unseen target datasets show that WaveSDG consistently outperforms seven state-of-the-art methods. Notably, it achieves the best balanced Dice score and lowest 95th percentile Hausdorff distance with reduced variance, indicating improved accuracy, robustness, and cross-domain stability.
Authors: Taiting Lu, Kaiyuan Lin, Yuxin Tian, Mingjia Wang, Yubo Wang, Muchuan Wang, Sharique Khatri, Akshit Kartik, Yixi Wang, Amey Santosh Rane, Yida Wang, Sung-Liang Chen, Yifan Yang, Yi-Chao Chen, Yincheng Jin, Mahanth Gowda
Abstract: Recent large multimodal models (LMMs) have made rapid progress in visual grounding, document understanding, and diagram reasoning tasks. However, their ability to convert Printed Circuit Board (PCB) schematic diagrams into machine-readable spatially weighted netlist graphs, jointly capturing component attributes, connectivity, and geometry, remains largely underexplored, despite such graph representations are the backbone of practical electronic design automation (EDA) workflows. To bridge this gap, we introduce OmniSch, the first comprehensive benchmark designed to assess LMMs on schematic understanding and spatial netlist graph construction. OmniSch contains 1,854 real-world schematic diagrams and includes four tasks: (1) visual grounding for schematic entities, with 109.9K grounded instances aligning 423.4K diagram semantic labels to their visual regions; (2) diagram-to-graph reasoning, understanding topological relationship among diagram elements; (3) geometric reasoning, constructing layout-dependent weights for each connection; and (4) tool-augmented agentic reasoning for visual search, invoking external tools to accomplish (1)-(3). Our results reveal substantial gaps of current LMMs in interpreting schematic engineering artifacts, including unreliable fine-grained grounding, brittle layout-to-graph parsing, inconsistent global connectivity reasoning and inefficient visual exploration.
Authors: Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva
Abstract: Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers $\sim$10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.
Authors: Youqi Liao, Shuhao Kang, Jingyu Xu, Olaf Wysocki, Yan Xia, Jianping Li, Zhen Dong, Bisheng Yang, Xieyuanli Chen
Abstract: Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well-suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O localization task, which aims to estimate accurate 2D positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses the textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53\%, 9.93\%, and 8.32\% at 5 m, 10 m, and 25 m thresholds, respectively, and shows strong generalization to unseen environments. Dataset, code and models will be publicly available at: https://github.com/WHU-USI3DV/TOL.
Authors: Kanishk Jain, Qian Yang, Shravan Nayak, Parisa Kordjamshidi, Nishanth Anand, Aishwarya Agrawal
Abstract: Vision-language Models (VLMs), despite achieving strong performance on multimodal benchmarks, often misinterpret straightforward visual concepts that humans identify effortlessly, such as counting, spatial reasoning, and viewpoint understanding. Previous studies manually identified these weaknesses and found that they often stem from deficits in specific skills. However, such manual efforts are costly, unscalable, and subject to human bias, which often overlooks subtle details in favour of salient objects, resulting in an incomplete understanding of a model's vulnerabilities. To address these limitations, we propose a Reinforcement Learning (RL)-based framework to automatically discover the failure modes or blind spots of any ``candidate VLM'' on a given data distribution without human intervention. Our framework trains a questioner agent that adaptively generates queries based on the candidate VLM's responses to elicit incorrect answers. Our approach increases question complexity by focusing on fine-grained visual details and distinct skill compositions as training progresses, consequently identifying novel failure modes in which VLMs struggle. We demonstrate the broad applicability of our framework by showcasing its generalizability across various model combinations.
Authors: Xining Ge, Weijun Yuan, Gengjia Chang, Xuyang Li, Shuhong Liu
Abstract: Nighttime image dehazing faces a more complex degradation pattern than its daytime counterpart, as haze scattering couples with low illumination, non-uniform lighting, and strong light interference. Under limited supervision, this complexity aggravates domain drift and training instability, since target-domain samples are scarce while naively introducing external data may weaken adaptation due to distribution mismatch. This paper presents our solution to the NTIRE 2026 Night Time Image Dehazing Challenge, built as a unified framework that integrates domain-aligned data construction, stage-wise training, and inference-time enhancement. Specifically, a pre-trained CLIP visual encoder screens candidate external samples by similarity to construct training data closer to the target domain. NAFNet is then trained in two stages, first adapting to the target domain and then expanding to broader degradation patterns. At inference time, TLC, x8 self-ensemble, and weighted snapshot fusion are combined to improve output stability. Rather than relying on complex network redesign, the proposed framework offers a practical and effective pipeline for nighttime image dehazing.
Authors: Alexandros Delitzas, Chenyangguang Zhang, Alexey Gavryushin, Tommaso Di Mario, Boyang Sun, Rishabh Dabral, Leonidas Guibas, Christian Theobalt, Marc Pollefeys, Francis Engelmann, Daniel Barath
Abstract: We present FunRec, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGB-D interaction videos. Unlike existing methods on articulated reconstruction, which rely on controlled setups, multi-state captures, or CAD priors, FunRec operates directly on in-the-wild human interaction sequences to recover interactable 3D scenes. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. Across new real and simulated benchmarks, FunRec surpasses prior work by a large margin, achieving up to +50 mIoU improvement in part segmentation, 5-10 times lower articulation and pose errors, and significantly higher reconstruction accuracy. We further demonstrate applications on URDF/USD export for simulation, hand-guided affordance mapping and robot-scene interaction.
Authors: Shaotian Li, Shangze Li, Chuancheng Shi, Wenhua Wu, Yanqiu Wu, Xiaohan Yu, Fei Shen, Tat-Seng Chua
Abstract: Large-scale vision-language models (VLMs) exhibit remarkable zero-shot capabilities, yet the internal mechanisms driving their anomaly detection (AD) performance remain poorly understood. Current methods predominantly treat VLMs as black-box feature extractors, assuming that anomaly-specific knowledge must be acquired through external adapters or memory banks. In this paper, we challenge this assumption by arguing that anomaly knowledge is intrinsically embedded within pre-trained models but remains latent and under-activated. We hypothesize that this knowledge is concentrated within a sparse subset of anomaly-sensitive neurons. To validate this, we propose latent anomaly knowledge excavation (LAKE), a training-free framework that identifies and elicits these critical neuronal signals using only a minimal set of normal samples. By isolating these sensitive neurons, LAKE constructs a highly compact normality representation that integrates visual structural deviations with cross-modal semantic activations. Extensive experiments on industrial AD benchmarks demonstrate that LAKE achieves state-of-the-art performance while providing intrinsic, neuron-level interpretability. Ultimately, our work advocates for a paradigm shift: redefining anomaly detection as the targeted activation of latent pre-trained knowledge rather than the acquisition of a downstream task.
Authors: Yunnan Wang, Kecheng Zheng, Jianyuan Wang, Minghao Chen, David Novotny, Christian Rupprecht, Yinghao Xu, Xing Zhu, Wenjun Zeng, Xin Jin, Yujun Shen
Abstract: The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.
Authors: Minh Sao Khue Luu, Evgeniy N. Pavlovskiy, Bair N. Tuchinov
Abstract: We propose a unified objective function, termed CATMIL, that augments the base segmentation loss with two auxiliary supervision terms operating at different levels. The first term, Component-Adaptive Tversky, reweights voxel contributions based on connected components to balance the influence of lesions of different sizes. The second term, based on Multiple Instance Learning, introduces lesion-level supervision by encouraging the detection of each lesion instance. These terms are combined with the standard nnU-Net loss to jointly optimize voxel-level segmentation accuracy and lesion-level detection. We evaluate the proposed objective on the MSLesSeg dataset using a consistent nnU-Net framework and 5-fold cross-validation. The results show that CATMIL achieves the most balanced performance across segmentation accuracy, lesion detection, and error control. It improves Dice score (0.7834) and reduces boundary error compared to standard losses. More importantly, it substantially increases small lesion recall and reduces false negatives, while maintaining the lowest false positive volume among compared methods. These findings demonstrate that integrating component-level and lesion-level supervision within a unified objective provides an effective and practical approach for improving small lesion segmentation in highly imbalanced settings. All code and pretrained models are available at https://github.com/luumsk/SmallLesionMRI.
Authors: Xining Ge, Gengjia Chang, Weijun Yuan, Zhan Li, Zhanglu Chen, Boyang Yao, Yihang Chen, Yifan Deng, Shuhong Liu
Abstract: Remote sensing infrared image super-resolution aims to recover sharper thermal observations from low-resolution inputs while preserving target contours, scene layout, and radiometric stability. Unlike visible-image super-resolution, thermal imagery is weakly textured and more sensitive to unstable local sharpening, which makes complementary local and global modeling especially important. This paper presents our solution to the NTIRE 2026 Infrared Image Super-Resolution Challenge, a dual-branch system that combines a HAT-L branch and a MambaIRv2-L branch. The inference pipeline applies test-time local conversion on HAT, eight-way self-ensemble on MambaIRv2, and fixed equal-weight image-space fusion. We report both the official challenge score and a reproducible evaluation on 12 synthetic times-four thermal samples derived from Caltech Aerial RGB-Thermal, on which the fused output outperforms either single branch in PSNR, SSIM, and the overall Score. The results suggest that infrared super-resolution benefits from explicit complementarity between locally strong transformer restoration and globally stable state-space modeling.
Authors: Abu Zahid Bin Aziz, Syed Fahim Ahmed, Gnanesh Rasineni, Mei Wang, Olcaytu Hatipoglu, Marisa Ricci, Malaiyah Shaw, Guang Li, J. Quincy Brown, Valerio Pascucci, Shireen Elhabian
Abstract: Structured Illumination Microscopy (SIM) enables rapid, high-contrast optical sectioning of fresh tissue without staining or physical sectioning, making it promising for intraoperative and point-of-care diagnostics. Recent foundation and large-scale self-supervised models in digital pathology have demonstrated strong performance on section-based modalities such as Hematoxylin and Eosin (H&E) and immunohistochemistry (IHC). However, these approaches are predominantly trained on thin tissue sections and do not explicitly address thick-tissue fluorescence modalities such as SIM. When transferred directly to SIM, performance is constrained by substantial modality shift, and naive fine-tuning often overfits to modality-specific appearance rather than underlying histological structure. We introduce SIMPLER (Structured Illumination Microscopy-Powered Learning for Embedding Representations), a cross-modality self-supervised pretraining framework that leverages H&E as a semantic anchor to learn reusable SIM representations. H&E encodes rich cellular and glandular structure aligned with established clinical annotations, while SIM provides rapid, nondestructive imaging of fresh tissue. During pretraining, SIM and H&E are progressively aligned through adversarial, contrastive, and reconstruction-based objectives, encouraging SIM embeddings to internalize histological structure from H&E without collapsing modality-specific characteristics. A single pretrained SIMPLER encoder transfers across multiple downstream tasks, including multiple instance learning and morphological clustering, consistently outperforming SIM models trained from scratch or H&E-only pretraining. These results suggest that histology-guided cross-modal pretraining yields biologically grounded SIM embeddings suitable for broad downstream reuse.
Authors: Yudong Han, Yong Wang, Zaiquan Yang, Zhen Qu, Liyuan Pan, Xiangxiang Chu
Abstract: Multimodal latent reasoning has emerged as a promising paradigm that replaces explicit Chain-of-Thought (CoT) decoding with implicit feature propagation, simultaneously enhancing representation informativeness and reducing inference latency. By analyzing token-level gradient dynamics during latent training, we reveal two critical observations: (1) visual tokens exhibit significantly higher and more volatile gradient norms than their textual counterparts due to inherent language bias, resulting in systematic visual under-optimization; and (2) semantically simple tokens converge rapidly, whereas complex tokens exhibit persistent gradient instability constrained by fixed architectural depths. To address these limitations, we propose a visual replay module and routing depth scaling to collaboratively enhance visual perception and refine complicated latents for deeper contextual reasoning. The former module leverages causal self-attention to estimate token saliency, reinforcing fine-grained grounding through spatially-coherent constraints. Complementarily, the latter mechanism adaptively allocates additional reasoning steps to complex tokens, enabling deeper contextual refinement. Guided by a curriculum strategy that progressively internalizes explicit CoT into compact latent representations, our framework achieves state-of-the-art performance across diverse benchmarks while delivering substantial inference speedups over explicit CoT baselines.
Authors: Gengjia Chang, Xining Ge, Weijun Yuan, Zhan Li, Qiurong Song, Luen Zhu, Shuhong Liu
Abstract: This paper presents our solution to the NTIRE 2026 Image Denoising Challenge (Gaussian color image denoising at fixed noise level $\sigma = 50$). Rather than proposing a new restoration backbone, we revisit the performance boundary of the mature Restormer architecture from two complementary directions: stronger data-centric training and more complete Test-Time capability release. Starting from the public Restormer $\sigma\!=\!50$ baseline, we expand the standard multi-dataset training recipe with larger and more diverse public image corpora and organize optimization into two stages. At inference, we apply $\times 8$ geometric self-ensemble to further release model capacity. A TLC-style local inference wrapper is retained for implementation consistency; however, systematic ablation reveals its quantitative contribution to be negligible in this setting. On the challenge validation set of 100 images, our final submission achieves 30.762 dB PSNR and 0.861 SSIM, improving over the public Restormer $\sigma\!=\!50$ pretrained baseline by up to 3.366 dB PSNR. Ablation studies show that the dominant gain originates from the expanded training corpus and the two-stage optimization schedule, and self-ensemble provides marginal but consistent improvement.
Authors: Gengjia Chang, Xining Ge, Weijun Yuan, Zhan Li, Qiurong Song, Luen Zhu, Shuhong Liu
Abstract: Single-image super-resolution has progressed from deep convolutional baselines to stronger Transformer and state-space architectures, yet the corresponding performance gains typically come with higher training cost, longer engineering iteration, and heavier deployment burden. In many practical settings, multiple pretrained models with partially complementary behaviors are already available, and the binding constraint is no longer architectural capacity but how effectively their outputs can be combined without additional training. Rather than pursuing further architectural redesign, this paper proposes a training-free output-level ensemble framework. A dual-branch pipeline is constructed in which a Hybrid attention network with TLC inference provides stable main reconstruction, while a MambaIRv2 branch with geometric self-ensemble supplies strong compensation for high-frequency detail recovery. The two branches process the same low-resolution input independently and are fused in the image space via a lightweight weighted combination, without updating any model parameters or introducing an additional trainable module. As our solution to the NTIRE 2026 Image Super-Resolution ($\times 4$) Challenge, the proposed design consistently improves over the base branch and slightly exceeds the pure strong branch in PSNR at the best operating point under a unified DIV2K bicubic $\times 4$ evaluation protocol. Ablation studies confirm that output-level compensation provides a low-overhead and practically accessible upgrade path for existing super-resolution systems.
Authors: Yifan Du, Zikang Liu, Jinbiao Peng, Jie Wu, Junyi Li, Jinyang Li, Wayne Xin Zhao, Ji-Rong Wen
Abstract: Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in https://github.com/RUCAIBox/LMM-Searcher.
Authors: Rui Li, Bingyu Li, Yuanzhi Liang, Haibin Huang, Chi Zhang, XueLong Li
Abstract: Achieving high-fidelity generation in extremely few sampling steps has long been a central goal of generative modeling. Existing approaches largely rely on distillation-based frameworks to compress the original multi-step denoising process into a few-step generator. However, such methods inherently constrain the student to imitate a stronger multi-step teacher, imposing the teacher as an upper bound on student performance. We argue that introducing \textbf{preference alignment awareness} enables the student to optimize toward reward-preferred generation quality, potentially surpassing the teacher instead of being restricted to rigid teacher imitation. To this end, we propose \textbf{Reward-Aware Trajectory Shaping (RATS)}, a lightweight framework for preference-aligned few-step generation. Specifically, teacher and student latent trajectories are aligned at key denoising stages through horizon matching, while a \textbf{reward-aware gate} is introduced to adaptively regulate teacher guidance based on their relative reward performance. Trajectory shaping is strengthened when the teacher achieves higher rewards, and relaxed when the student matches or surpasses the teacher, thereby enabling continued reward-driven improvement. By seamlessly integrating trajectory distillation, reward-aware gating, and preference alignment, RATS effectively transfers preference-relevant knowledge from high-step generators without incurring additional test-time computational overhead. Experimental results demonstrate that RATS substantially improves the efficiency--quality trade-off in few-step visual generation, significantly narrowing the gap between few-step students and stronger multi-step generators.
Authors: Hao Wang, Beichen Zhang, Yanpei Gong, Shaoyi Fang, Zhaobo Qi, Yuanrong Xu, Xinyan Liu, Weigang Zhang
Abstract: As forgery types continue to emerge consistently, Incremental Face Forgery Detection (IFFD) has become a crucial paradigm. However, existing methods typically rely on data replay or coarse binary supervision, which fails to explicitly constrain the feature space, leading to severe feature drift and catastrophic forgetting. To address this, we propose AIFIND, Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection, which leverages semantic anchors to stabilize incremental learning. We design the Artifact-Driven Semantic Prior Generator to instantiate invariant semantic anchors, establishing a fixed coordinate system from low-level artifact cues. These anchors are injected into the image encoder via Artifact-Probe Attention, which explicitly constrains volatile visual features to align with stable semantic anchors. Adaptive Decision Harmonizer harmonizes the classifiers by preserving angular relationships of semantic anchors, maintaining geometric consistency across tasks. Extensive experiments on multiple incremental protocols validate the superiority of AIFIND.
Authors: Haojian Huang, Chuanyu Qin, Yinchuan Li, Yingcong Chen
Abstract: Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which plateaus at the model's knowledge boundary, or hybrid replay that mixes policies and demands careful regularization. Dynamic context methods zoom into focused evidence but often require curated pretraining and two-stage tuning, and their context remains bounded by a small model's capability. In contrast, larger models excel at instruction following and multi-modal understanding, can supply richer context to smaller models, and rapidly zoom in on target regions via simple tools. Building on this capability, we introduce an observation-level intervention: a frozen, tool-integrated teacher identifies the missing spatiotemporal dependency and provides a minimal evidence patch (e.g., timestamps, regions etc.) from the original video while the question remains unchanged. The student answers again with the added context, and training updates with a chosen-rollout scheme integrated into Group Relative Policy Optimization (GRPO). We further propose a Robust Improvement Reward (RIR) that aligns optimization with two goals: outcome validity through correct answers and dependency alignment through rationales that reflect the cited evidence. Advantages are group-normalized across the batch, preserving on-policy exploration while directing it along causally meaningful directions with minimal changes to the training stack. Experiments on various related benchmarks show consistent accuracy gains and strong generalization. Code will be available at https://jethrojames.github.io/FFR/.
Authors: Baoyou Chen, Hanchen Xia, Peng Tu, Haojun Shi, Liwei Zhang, Weihao Yuan, Siyu Zhu
Abstract: Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly training to enable efficient training on long multimodal sequences. A key empirical finding is that direct autoregressive-to-diffusion distillation is poorly aligned and can even hurt performance, whereas distillation within the diffusion regime is consistently effective. Experimental results show that, with $\leq$ 4.4M data, BARD-VL transfers strong multimodal capability from Qwen3-VL to a large-block dVLM. Remarkably, BARD-VL establishes a new SOTA among comparable-scale open dVLMs on our evaluation suite at both 4B and 8B scales. At the same time, BARD-VL achieves up to 3$\times$ decoding throughput speedup compared to the source model. Code is available at https://github.com/fudan-generative-vision/Bard-VL.
Authors: Liyin Chen, Nazlee Zebardast, Mengyu Wang, Tobias Elze, Jason I. Comander
Abstract: Predicting disease progression from longitudinal imaging is useful for clinical decision making and trial design. Recent methods have moved toward increasing generative complexity, but the conditions under which this complexity is necessary remain unclear. We propose that generative complexity should match the entropy of the predictable component of a task's conditional posterior, with training-inference input alignment required in all regimes. Two model-light measurements, a task-entropy analysis on raw image pairs and a posterior-concentration analysis on a stochastic model, let practitioners assess the complexity a task warrants before committing to a modeling framework. We validated this framework on a fundus autofluorescence (FAF) dataset by contrasting five conditioning configurations, sharing one architecture and training set, spanning standard conditional diffusion, inference-aligned stochastic training, and deterministic regression. Training-inference alignment produced large gains (delta-SSIM +0.082, SSIM +0.086, both p < 0.001), while the choice among aligned frameworks produced no clinically meaningful difference across evaluated metrics. Across two FAF platforms, inter-visit change was dominated by time-invariant acquisition variability rather than disease progression, and the stochastic models' posteriors collapsed to an effective point, explaining the framework equivalence. We trained a deterministic Temporal Retinal U-Net (TRU) and evaluated it on 28,899 eyes across three manufacturers and two modalities (two FAF platforms and en-face SLO), with three independent cohorts evaluated zero-shot. TRU matched or exceeded three published baselines on delta-SSIM, SSIM, and PSNR. These findings show that when disease progression is slow compared with acquisition variability, a deterministic regression model matches or outperforms more complex stochastic alternatives.
Authors: Zepeng Sun, Naichuan Zheng, Hailun Xia, Junjie Wu, Liwei Bao, Xiaotai Zhang
Abstract: Temporal Action Detection (TAD) requires precise localization of action boundaries within long, untrimmed video sequences. While current high-performing methods achieve strong accuracy, they are often characterized by excessive parameter counts, substantial computational overhead, and a reliance on specialized operators that hinder deployment across diverse hardware platforms. This paper presents LiquidTAD, a framework that distills the exponential relaxation prior of liquid neural dynamics into a parallel temporal operator, rather than reproducing full Liquid Neural Network (LNN) dynamics. By introducing a Parallel Liquid-inspired Relaxation mechanism, sequential ODE solving is avoided through a fully vectorized, non-recursive formulation built entirely upon standard neural operations, enabling hardware-agnostic deployment with linear complexity with respect to the temporal length. A complementary Hierarchical Decay-Rate Sharing Strategy further adapts this relaxation prior across feature pyramid levels, stabilizing optimization and implicitly compensating for temporal compression in deeper layers. Experimental evaluations on THUMOS-14 and ActivityNet-1.3 demonstrate that LiquidTAD achieves accuracy competitive with strong baselines while substantially lowering the model footprint. Specifically, on THUMOS-14, LiquidTAD achieves 69.46\% average mAP with only 10.82M parameters and 27.17G FLOPs, reducing the parameter count by over 60\% compared with ActionFormer.
Authors: Hang Yuan, Xiaolin Hu, Yan Wan, Menglin Gao, Wenzhe Yu, Cong Huang, Fei Xu, Qing Li, Christina Dan Wang, Zhou Yu, Kai Chen
Abstract: Text-driven controllable dance generation remains under-explored, primarily due to the severe scarcity of high-quality datasets and the inherent difficulty of articulating complex choreographies. Characterizing dance is particularly challenging owing to its intricate spatial dynamics, strong directionality, and the highly decoupled movements of distinct body parts. To overcome these bottlenecks, we bridge principles from dance studies, human anatomy, and biomechanics to propose \textit{Choreographic Syntax}, a novel theoretical framework with a tailored annotation system. Grounded in this syntax, we combine professional dance archives with high-fidelity motion capture data to construct \textbf{DanceFlow}, the most fine-grained dance dataset to date. It encompasses 41 hours of high-quality motions paired with 6.34 million words of detailed descriptions. At the model level, we introduce \textbf{DanceCrafter}, a tailored motion transformer built upon the Momentum Human Rig. To circumvent optimization instabilities, we construct a continuous manifold motion representation paired with a hybrid normalization strategy. Furthermore, we design an anatomy-aware loss to explicitly regulate the decoupled nature of body parts. Together, these adaptations empower DanceCrafter to achieve the high-fidelity and stable generation of complex dance sequences. Extensive evaluations and user studies demonstrate our state-of-the-art performance in motion quality, fine-grained controllability, and generation naturalness.
Authors: Zhiyuan Jiang, Weihao Hong, Xinlei Guan, Tejaswi Dhandu, Miles Q. Li, Meng Xu, Kuan Huang, Umamaheswara Rao Tida, Bingyu Shen, Daehan Kwak, Boyang Li
Abstract: Vision-Language Models (VLMs) are increasingly deployed in settings where reliable visual grounding carries operational consequences, yet their behavior under progressively coercive prompt phrasing remains undercharacterized. Existing hallucination benchmarks predominantly rely on neutral prompts and binary detection, leaving open how both the incidence and the intensity of fabrication respond to graded linguistic pressure across structurally distinct task types. We present Ghost-100, a procedurally constructed benchmark of 800 synthetically generated images spanning eight categories across three task families: text-illegibility, time-reading, and object-absence, each designed under a negative-ground-truth principle that guarantees the queried target is absent, illegible, or indeterminate by construction. Every image is paired with five prompts drawn from a structured 5-Level Prompt Intensity Framework, holding the image and task identity fixed while varying only directive force, so that tone is isolated as the sole independent variable. We adopt a dual-track evaluation protocol: a rule-based H-Rate measuring the proportion of responses in which a model crosses from grounded refusal into unsupported positive commitment, and a GPT-4o-mini-judged H-Score on a 1-5 scale characterizing the confidence and specificity of fabrication once it occurs. We additionally release a three-stage automated validation workflow, which retrospectively confirms 717 of 800 images as strictly compliant. Evaluating nine open-weight VLMs, we find that H-Rate and H-Score dissociate substantially across model families, reading-style and presence-detection subsets respond to prompt pressure in qualitatively different ways, and several models exhibit non-monotonic sensitivity peaking at intermediate tone levels: patterns that aggregate metrics obscure.
Authors: Rui Li, Ke Hao, Yuanzhi Liang, Haibin Huang, Chi Zhang, Yun Gu, XueLong Li
Abstract: Reinforcement learning, particularly Group Relative Policy Optimization (GRPO), has emerged as an effective framework for post-training visual generative models with human preference signals. However, its effectiveness is fundamentally limited by coarse reward credit assignment. In modern visual generation, multiple reward models are often used to capture heterogeneous objectives, such as visual quality, motion consistency, and text alignment. Existing GRPO pipelines typically collapse these rewards into a single static scalar and propagate it uniformly across the entire diffusion trajectory. This design ignores the stage-specific roles of different denoising steps and produces mistimed or incompatible optimization signals. To address this issue, we propose Objective-aware Trajectory Credit Assignment (OTCA), a structured framework for fine-grained GRPO training. OTCA consists of two key components. Trajectory-Level Credit Decomposition estimates the relative importance of different denoising steps. Multi-Objective Credit Allocation adaptively weights and combines multiple reward signals throughout the denoising process. By jointly modeling temporal credit and objective-level credit, OTCA converts coarse reward supervision into a structured, timestep-aware training signal that better matches the iterative nature of diffusion-based generation. Extensive experiments show that OTCA consistently improves both image and video generation quality across evaluation metrics.
Authors: Mehdi Maboudi, Said Harb, Jackson Ferrao, Kourosh Khoshelham, Yelda Turkan, Karam Mawas
Abstract: Point cloud registration involves aligning one point cloud with another or with a three-dimensional (3D) model, enabling the integration of multimodal data into a unified representation. This is essential in applications such as construction monitoring, autonomous driving, robotics, and virtual or augmented reality (VR/AR). With the increasing accessibility of point cloud acquisition technologies, such as Light Detection and Ranging (LiDAR) and structured light scanning, along with recent advances in deep learning, the research focus has increasingly shifted towards downstream tasks, particularly point cloud-to-model (PC2Model) registration. While data-driven methods aim to automate this process, they struggle with sparsity, noise, clutter, and occlusions in real-world scans, which limit their performance. To address these challenges, this paper introduces the PC2Model benchmark, a publicly available dataset designed to support the training and evaluation of both classical and data-driven methods. Developed under the leadership of ICWG II/Ib, the PC2Model benchmark adopts a hybrid design that combines simulated point clouds with, in some cases, real-world scans and their corresponding 3D models. Simulated data provide precise ground truth and controlled conditions, while real-world data introduce sensor and environmental artefacts. This design supports robust training and evaluation across domains and enables the systematic analysis of model transferability from simulated to real-world scenarios. The dataset is publicly accessible at: https://zenodo.org/records/17581812
Authors: Frederic Wang, Katherine L. Bouman
Abstract: While diffusion priors generate high-quality posterior samples across many inverse problems, they are often trained on limited training sets or purely simulated data, thus inheriting the errors and biases of these underlying sources. Current approaches to finetuning diffusion models rely on a large number of observations with varying forward operators, which can be difficult to collect for many applications, and thus lead to overfitting when the measurement set is small. We propose a method for tuning a prior from only a single observation by combining existing diffusion priors into a single product-of-experts prior and identifying the exponents that maximize the Bayesian evidence. We validate our method on real-world inverse problems, including black hole imaging, where the true prior is unknown a priori, and image deblurring with text-conditioned priors. We find that the evidence is often maximized by priors that extend beyond those trained on a single dataset. By generalizing the prior through exponent weighting, our approach enables posterior sampling from both tempered and combined diffusion models, yielding more flexible priors that improve the trustworthiness of the resulting posterior image distribution.
Authors: Zhiqiu Lin, Chancharik Mitra, Siyuan Cen, Isaac Li, Yuhan Huang, Yu Tong Tiffany Ling, Hewei Wang, Irene Pi, Shihang Zhu, Ryan Rao, George Liu, Jiaxi Li, Ruojin Li, Yili Han, Yilun Du, Deva Ramanan
Abstract: Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/
Authors: Wenxuan Bao, Yanjun Zhao, Xiyuan Yang, Jingrui He
Abstract: Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection. For each incoming test sample, Ramen retrieves a customized batch of relevant samples from previously seen data based on two criteria: domain consistency, which ensures that adaptation focuses on data from similar domains, and prediction balance, which mitigates adaptation bias caused by skewed predictions. To improve efficiency, Ramen employs an embedding-gradient cache that stores the embeddings and sample-level gradients of past test images. The stored embeddings are used to retrieve relevant samples, and the corresponding gradients are aggregated for model updates, eliminating the need for any additional forward or backward passes. Our theoretical analysis provides insight into why the proposed adaptation mechanism is effective under mixed-domain shifts. Experiments on multiple image corruption and domain-shift benchmarks demonstrate that Ramen achieves strong and consistent performance, offering robust and efficient adaptation in complex mixed-domain scenarios. Our code is available at https://github.com/baowenxuan/Ramen .
Authors: Laki Iinbor, Zhiyang Dou, Wojciech Matusik
Abstract: We introduce Soft Anisotropic Diagrams (SAD), an explicit and differentiable image representation parameterized by a set of adaptive sites in the image plane. In SAD, each site specifies an anisotropic metric and an additively weighted distance score, and we compute pixel colors as a softmax blend over a small per-pixel top-K subset of sites. We induce a soft anisotropic additively weighted Voronoi partition (i.e., an Apollonius diagram) with learnable per-site temperatures, preserving informative gradients while allowing clear, content-aligned boundaries and explicit ownership. Such a formulation enables efficient rendering by maintaining a per-query top-K map that approximates nearest neighbors under the same shading score, allowing GPU-friendly, fixed-size local computation. We update this list using our top-K propagation scheme inspired by jump flooding, augmented with stochastic injection to provide probabilistic global coverage. Training follows a GPU-first pipeline with gradient-weighted initialization, Adam optimization, and adaptive budget control through densification and pruning. Across standard benchmarks, SAD consistently outperforms Image-GS and Instant-NGP at matched bitrate. On Kodak, SAD reaches 46.0 dB PSNR with 2.2 s encoding time (vs. 28 s for Image-GS), and delivers 4-19 times end-to-end training speedups over state-of-the-art baselines. We demonstrate the effectiveness of SAD by showcasing the seamless integration with differentiable pipelines for forward and inverse problems, efficiency of fast random access, and compact storage.
Authors: Suiyang Guang, Chenyu Liu, Ruohan Zhang, Siyuan Chen
Abstract: Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth, motion, and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.
Authors: Amir Hosseini, Sara Farahani, Xinyi Li, Suiyang Guang
Abstract: Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.
Authors: Teruyuki Katsuoka, Tomohiro Shiraishi, Daiki Miwa, Vo Nguyen Le Duy, Ichiro Takeuchi
Abstract: Anomaly localization in images -- identifying regions that deviate from normal patterns -- is vital in applications such as medical diagnosis and industrial inspection. A recent trend is the use of image generation models in anomaly localization, where these models generate normal-looking counterparts of anomalous images, thereby allowing flexible and adaptive anomaly localization. However, these methods inherit the uncertainty and bias implicitly embedded in the employed generative model, raising concerns about the reliability. To address this, we propose a statistical framework based on selective inference to quantify the significance of detected anomalous regions. Our method provides $p$-values to assess the false positive detection rates, providing a principled measure of reliability. As a proof of concept, we consider anomaly localization using a diffusion model and its applications to medical diagnoses and industrial inspections. The results indicate that the proposed method effectively controls the risk of false positive detection, supporting its use in high-stakes decision-making tasks.
Authors: Mohammad Tariqul Islam, Jason W. Fleischer
Abstract: Uniform manifold approximation and projection (UMAP) is among the most popular neighbor embedding methods. The method samples pairs of point indices according to similarities in the high-dimensional space, and applies attractive and repulsive forces to their coordinates in the low-dimensional embedding. In this paper, we analyze the forces to reveal their effects on cluster formations and visualization, and compare UMAP to its contemporaries. Repulsion emphasizes differences, controlling cluster boundaries and inter-cluster distance. Attraction is more subtle, as attractive tension between points can manifest simultaneously as attraction and repulsion in the lower-dimensional mapping. This explains the need for learning rate annealing and motivates the different treatments between attractive and repulsive terms. Moreover, by modifying attraction, we improve the consistency of cluster formation under random initialization. Overall, our analysis provides a mechanistic understanding of UMAP and related embedding methods.
Authors: Yuxuan Zhang, Jinkui Hao, Bo Zhou
Abstract: Magnetic resonance imaging (MRI) is a vital diagnostic tool, but its inherently long acquisition times reduce clinical efficiency and patient comfort. Recent advancements in deep learning, particularly diffusion models, have improved accelerated MRI reconstruction. However, existing diffusion models' training often relies on fully sampled data, models incur high computational costs, and often lack uncertainty estimation, limiting their clinical applicability. To overcome these challenges, we propose a novel framework, called Dual-domain Multi-path Self-supervised Diffusion Model (DMSM), that integrates a self-supervised dual-domain diffusion model training scheme, a lightweight hybrid attention network for the reconstruction diffusion model, and a multi-path inference strategy, to enhance reconstruction accuracy, efficiency, and explainability. Unlike traditional diffusion-based models, DMSM eliminates the dependency on training from fully sampled data, making it more practical for real-world clinical settings. We evaluated DMSM on two human MRI datasets, demonstrating that it achieves favorable performance over several supervised and self-supervised baselines, particularly in preserving fine anatomical structures and suppressing artifacts under high acceleration factors. Additionally, our model generates uncertainty maps that correlate reasonably well with reconstruction errors, offering valuable clinically interpretable guidance and potentially enhancing diagnostic confidence.
Authors: Zhirui Dai, Hojoon Shin, Yulun Tian, Ki Myung Brian Lee, Nikolay Atanasov
Abstract: Dense reconstruction and differentiable rendering are fundamental tightly connected operations in 3D vision and computer graphics. Recent neural implicit representations demonstrate compelling advantages in reconstruction fidelity and differentiability over conventional discrete representations such as meshes, point clouds, and voxels. However, many neural implicit models, such as neural radiance fields (NeRF) and signed distance function (SDF) networks, are inefficient in rendering due to the need to perform multiple queries along each camera ray. Moreover, NeRF and Gaussian Splatting methods offer impressive photometric reconstruction but often require careful supervision to achieve accurate geometric reconstruction. To address these challenges, we propose a novel representation called signed directional distance function (SDDF). Unlike SDF and similar to NeRF, SDDF has a position and viewing direction as input. Like SDF and unlike NeRF, SDDF directly provides distance to the observed surface rather than integrating along the view ray. As a result, SDDF achieves accurate geometric reconstruction and efficient differentiable directional distance prediction. To learn and predict scene-level SDDF efficiently, we develop a differentiable hybrid representation that combines explicit ellipsoid priors and implicit neural residuals. This allows the model to handle distance discontinuities around obstacle boundaries effectively while preserving the ability for dense high-fidelity distance prediction. Through extensive evaluation against state-of-the-art representations, we show that SDDF achieves (i) competitive SDDF prediction accuracy, (ii) faster prediction speed than SDF and NeRF, and (iii) superior geometric consistency compared to NeRF and Gaussian Splatting.
Authors: Daniel Wang, Patrick Rim, Tian Tian, Dong Lao, Alex Wong, Ganesh Sundaramoorthi
Abstract: We introduce ODE-GS, a novel approach that integrates 3D Gaussian Splatting with latent neural ordinary differential equations (ODEs) to enable future extrapolation of dynamic 3D scenes. Unlike existing dynamic scene reconstruction methods, which rely on time-conditioned deformation networks and are limited to interpolation within a fixed time window, ODE-GS eliminates timestamp dependency by modeling Gaussian parameter trajectories as continuous-time latent dynamics. Our approach first learns an interpolation model to generate accurate Gaussian trajectories within the observed window, then trains a Transformer encoder to aggregate past trajectories into a latent state evolved via a neural ODE. Finally, numerical integration produces smooth, physically plausible future Gaussian trajectories, enabling rendering at arbitrary future timestamps. On the D-NeRF, NVFi, and HyperNeRF benchmarks, ODE-GS achieves state-of-the-art extrapolation performance, improving metrics by 19.8% compared to leading baselines, demonstrating its ability to accurately represent and predict 3D scene dynamics.
Authors: Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, Yuexin Ma
Abstract: Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD) generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization. To the best of our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with reasoning into affordance reasoning. The code of our method and our dataset is released on https://github.com/hq-King/Affordance-R1.
Authors: Rui Hu, Delai Qiu, Yining Wang, Shengping Liu, Jitao Sang
Abstract: Omni-modal large language models (OLLMs) offer a promising end-to-end solution for slide-enhanced speech recognition due to their inherent multimodal capabilities. However, we found a fundamental issue faced by OLLMs: \textit{Visual Interference}, where models show a bias towards visible text over auditory signals, causing them to hallucinate slide content that was never spoken. To address this, we propose Visually-Anchored Policy Optimization (VAPO), which aims to reshape models' inference process to follow the human-like ``Look-then-Listen'' inference chain. Specifically, we design a temporally decoupled policy: the model first extracts visual priors in a
Authors: Mir Nafis Sharear Shopnil, Sharad Duwal, Abhishek Tyagi, Adiba Mahbub Proma
Abstract: We present MERIT, an inference-time modular framework for multimodal misinformation detection that decomposes verification into four specialized modules: visual forensics, cross-modal alignment, retrieval-augmented claim verification, and calibrated judgment. On MMFakeBench, MERIT with GPT-4o-mini achieves 81.65% F1, outperforming all reported zero-shot baselines including GPT-4V with MMD-Agent (74.0% F1). A controlled same-model evaluation confirms gains stem from architectural design: MERIT achieves 6.14 points higher misinformation recall than MMD-Agent under identical model conditions, with per-class gains of +18.0 on visual distortion and +5.33 on textual distortion. Ablation studies reveal non-overlapping module specialization, where removing any module disproportionately degrades its target category while leaving others intact. Test set evaluation on 5,000 samples confirms generalization within 0.21 F1 points of validation results. The framework operates with any instruction-following vision-language model and produces citation-linked rationales for human review.
Authors: Peiyu Yu, Suraj Kothawade, Sirui Xie, Ying Nian Wu, Hongliang Fei
Abstract: Most post-training methods for text-to-image samplers focus on model weights: either fine-tuning the backbone for alignment or distilling it for few-step efficiency. We take a different route: rescheduling the sampling timeline of a frozen sampler. Instead of a fixed, global schedule, we learn instance-level (prompt- and noise-conditioned) schedules through a single-pass Dirichlet policy. To ensure accurate gradient estimates in high-dimensional policy learning, we introduce a novel reward baseline based on a principled James-Stein estimator; it provably achieves lower estimation errors than commonly used variants and leads to superior performance. Our rescheduled samplers consistently improve text-image alignment including text rendering and compositional control across modern Stable Diffusion and Flux model families. Additionally, a 5-step Flux-Dev sampler with our schedules can attain generation quality comparable to deliberately distilled samplers like Flux-Schnell. We thus position our scheduling framework as an emerging model-agnostic post-training lever that unlocks additional generative potential in pretrained samplers.
Authors: Yuxuan Mu, Ziyu Zhang, Yi Shi, Dun Yang, Minami Matsumoto, Kotaro Imamura, Guy Tevet, Chuan Guo, Michael Taylor, Chang Shu, Pengcheng Xi, Xue Bin Peng
Abstract: Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversarial imitation learning has been a highly effective method for learning motion priors from reference motion data. However, adversarial priors, with few exceptions, need to be retrained for each new controller, thereby limiting their reusability and necessitating the retention of the reference motion data when applied to downstream tasks. In this work, we present Score-Matching Motion Priors (SMP), which leverages pre-trained motion diffusion models and score distillation sampling (SDS) to create reusable task-agnostic motion priors. SMPs can be pre-trained on a motion dataset, independent of any control policy or task. Once trained, SMPs can be kept frozen and reused as general-purpose reward functions to train new policies to produce naturalistic behaviors for downstream tasks. We show that a general motion prior trained on large-scale datasets can be repurposed into a variety of style-specific priors. Furthermore, SMP can compose different styles to synthesize new styles not present in the original dataset. Our method can create reusable and modular motion priors that produce high-quality motions comparable to state-of-the-art adversarial imitation learning methods. In our experiments, we demonstrate the effectiveness of SMP across a diverse suite of control tasks with physically simulated humanoid characters. Video available at https://youtu.be/jBA2tWk6vzU
Authors: Md Assaduzzaman, Nushrat Jahan Oyshi, Eram Mahamud
Abstract: The accurate classification of gastrointestinal diseases from endoscopic and histopathological imagery remains a significant challenge in medical diagnostics, mainly due to the vast data volume and subtle variation in inter-class visuals. This study presents a hybrid dual-stream deep learning framework built on teacher-student knowledge distillation, where a high-capacity teacher model integrates the global contextual reasoning of a Swin Transformer with the local fine-grained feature extraction of a Vision Transformer. The student network was implemented as a compact Tiny-ViT structure that inherits the teacher's semantic and morphological knowledge via soft-label distillation, achieving a balance between efficiency and diagnostic accuracy. Two carefully curated Wireless Capsule Endoscopy datasets, encompassing major GI disease classes, were employed to ensure balanced representation and prevent inter-sample bias. The proposed framework achieved remarkable performance with accuracies of 0.9978 and 0.9928 on Dataset 1 and Dataset 2 respectively, and an average AUC of 1.0000, signifying near-perfect discriminative capability. Interpretability analyses using Grad-CAM, LIME, and Score-CAM confirmed that the model's predictions were grounded in clinically significant tissue regions and pathologically relevant morphological cues, validating the framework's transparency and reliability. The Tiny-ViT demonstrated diagnostic performance with reduced computational complexity comparable to its transformer-based teacher while delivering faster inference, making it suitable for resource-constrained clinical environments. Overall, the proposed framework provides a robust, interpretable, and scalable solution for AI-assisted GI disease diagnosis, paving the way toward future intelligent endoscopic screening that is compatible with clinical practicality.
Authors: Junha Lee, Eunha Park, Minsu Cho
Abstract: Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation, bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83 p.p. with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.
Authors: Jiahao Zhang, Shaofei Huang, Yaxiong Wang, Zhedong Zheng
Abstract: Text-based person search faces inherent limitations due to data scarcity, driven by stringent privacy constraints and the high cost of manual annotation. To mitigate this, existing methods usually rely on a Pretrain-then-Finetune paradigm, where models are first pretrained on synthetic person-caption data to establish cross-modal alignment, followed by fine-tuning on labeled real-world datasets. However, this paradigm lacks practicality in real-world deployment scenarios, where large-scale annotated target-domain data is typically inaccessible. In this work, we propose a new Pretrain-then-Adapt paradigm that eliminates reliance on extensive target-domain supervision through an offline test-time adaptation manner, enabling dynamic model adaptation using only unlabeled test data with minimal post-train time cost. To mitigate overconfidence with false positives of previous entropy-based test-time adaptation, we propose an Uncertainty-Aware Test-Time Adaptation (UATTA) framework, which introduces a bidirectional retrieval disagreement mechanism to estimate uncertainty, i.e., low uncertainty is assigned when an image-text pair ranks highly in both image-to-text and text-to-image retrieval, indicating high alignment; otherwise, high uncertainty is detected. This indicator drives offline test-time model recalibration without labels, effectively mitigating domain shift. We validate UATTA on four benchmarks, i.e., CUHK-PEDES, ICFG-PEDES, RSTPReid, and PAB, showing consistent improvements across both CLIP-based (one-stage) and XVLM-based (two-stage) frameworks. Ablation studies confirm that UATTA outperforms existing offline test-time adaptation strategies, establishing a new benchmark for label-efficient, deployable person search systems. Our code is available at https://github.com/nkuzjh/UATTA.
Authors: Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lyu, Wei Xue, Yike Guo
Abstract: Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large-scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio-Omni achieves state-of-the-art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio-Omni exhibits remarkable inherited capabilities, including knowledge-augmented reasoning generation, in-context generation, and zero-shot cross-lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio-Omni.
Authors: Jiahao Ma, Qiang Zhang, Peiran Liu, Zeran Su, Pihai Sun, Gang Han, Wen Zhao, Wei Cui, Zhang Zhang, Zhiyuan Xu, Renjing Xu, Jian Tang, Miaomiao Liu, Yijie Guo
Abstract: Surround-view perception is increasingly important for robotic navigation and loco-manipulation, especially in human-in-the-loop settings such as teleoperation, data collection, and emergency takeover. However, current robotic visual interfaces are often limited to narrow forward-facing views, or, when multiple on-board cameras are available, require cumbersome manual switching that interrupts the operator's workflow. Both configurations suffer from motion-induced jitter that causes simulator sickness in head-mounted displays. We introduce a surround-view robotic vision system that combines six cameras with LiDAR to provide full 360$^\circ$ visual coverage, while meeting the geometric and real-time constraints of embodied deployment. We further present \textsc{RobotPan}, a feed-forward framework that predicts \emph{metric-scaled} and \emph{compact} 3D Gaussians from calibrated sparse-view inputs for real-time rendering, reconstruction, and streaming. \textsc{RobotPan} lifts multi-view features into a unified spherical coordinate representation and decodes Gaussians using hierarchical spherical voxel priors, allocating fine resolution near the robot and coarser resolution at larger radii to reduce computational redundancy without sacrificing fidelity. To support long sequences, our online fusion updates dynamic content while preventing unbounded growth in static regions by selectively updating appearance. Finally, we release a multi-sensor dataset tailored to 360$^\circ$ novel view synthesis and metric 3D reconstruction for robotics, covering navigation, manipulation, and locomotion on real platforms. Experiments show that \textsc{RobotPan} achieves competitive quality against prior feed-forward reconstruction and view-synthesis methods while producing substantially fewer Gaussians, enabling practical real-time embodied deployment.
Authors: Danae S\'anchez Villegas, Samuel Lewis-Lim, Nikolaos Aletras, Desmond Elliott
Abstract: Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.
Authors: Vishal Rajput
Abstract: PGD adversarial training, the standard robustness method, can reduce Jacobian Frobenius norm yet worsen clean-input geometry (e.g., TDI 1.336 vs. ERM 1.093). We show this is not an implementation artifact but a theorem-level consequence of supervised learning. We prove that any encoder minimizing supervised loss must retain non-zero sensitivity along directions correlated with training labels, including directions that are nuisance at test time. This holds across proper scoring rules, architectures, and dataset sizes. We call this the geometric blind spot of supervised learning. This theorem unifies four empirical phenomena often treated separately: non-robust features, texture bias, corruption fragility, and the robustness-accuracy tradeoff. It also explains why suppressing sensitivity in one adversarial direction can redistribute sensitivity elsewhere. We introduce Trajectory Deviation Index (TDI), a diagnostic of geometric isotropy. Unlike CKA, intrinsic dimension, or Jacobian Frobenius norm alone, TDI captures the failure mode above. In our experiments, PGD attains low Frobenius norm but high TDI, while PMH attains the lowest TDI with one additional training term and no architectural changes. Across seven tasks, BERT/SST-2, and ImageNet ViT-B/16 (backbone family underlying CLIP/DINO/SAM), the blind spot is measurable and repairable. It appears at foundation-model scale, worsens with model scale and task-specific fine-tuning, and is substantially reduced by PMH. PMH also leads on non-Gaussian corruption types (blur/brightness/contrast) without corruption-specific training.