Authors: Sebastian Dille, Chris Careaga, Ya\u{g}{\i}z Aksoy
Abstract: The low dynamic range (LDR) of common cameras fails to capture the rich contrast in natural scenes, resulting in loss of color and details in saturated pixels. Reconstructing the high dynamic range (HDR) of luminance present in the scene from single LDR photographs is an important task with many applications in computational photography and realistic display of images. The HDR reconstruction task aims to infer the lost details using the context present in the scene, requiring neural networks to understand high-level geometric and illumination cues. This makes it challenging for data-driven algorithms to generate accurate and high-resolution results. In this work, we introduce a physically-inspired remodeling of the HDR reconstruction problem in the intrinsic domain. The intrinsic model allows us to train separate networks to extend the dynamic range in the shading domain and to recover lost color details in the albedo domain. We show that dividing the problem into two simpler sub-tasks improves performance in a wide variety of photographs.
Authors: Shihua Sun, Kenechukwu Nwodo, Shridatt Sugrim, Angelos Stavrou, Haining Wang
Abstract: The use of transformers for vision tasks has challenged the traditional dominant role of convolutional neural networks (CNN) in computer vision (CV). For image classification tasks, Vision Transformer (ViT) effectively establishes spatial relationships between patches within images, directing attention to important areas for accurate predictions. However, similar to CNNs, ViTs are vulnerable to adversarial attacks, which mislead the image classifier into making incorrect decisions on images with carefully designed perturbations. Moreover, adversarial patch attacks, which introduce arbitrary perturbations within a small area, pose a more serious threat to ViTs. Even worse, traditional detection methods, originally designed for CNN models, are impractical or suffer significant performance degradation when applied to ViTs, and they generally overlook patch attacks. In this paper, we propose ViTGuard as a general detection method for defending ViT models against adversarial attacks, including typical attacks where perturbations spread over the entire input and patch attacks. ViTGuard uses a Masked Autoencoder (MAE) model to recover randomly masked patches from the unmasked regions, providing a flexible image reconstruction strategy. Then, threshold-based detectors leverage distinctive ViT features, including attention maps and classification (CLS) token representations, to distinguish between normal and adversarial samples. The MAE model does not involve any adversarial samples during training, ensuring the effectiveness of our detectors against unseen attacks. ViTGuard is compared with seven existing detection methods under nine attacks across three datasets. The evaluation results show the superiority of ViTGuard over existing detectors. Finally, considering the potential detection evasion, we further demonstrate ViTGuard's robustness against adaptive attacks for evasion.
Authors: Zhiyuan Li, Tianyuan Yao, Praitayini Kanakaraj, Chenyu Gao, Shunxing Bao, Lianrui Zuo, Michael E. Kim, Nancy R. Newlin, Gaurav Rudravaram, Nazirah M. Khairi, Yuankai Huo, Kurt G. Schilling, Walter A. Kukull, Arthur W. Toga, Derek B. Archer, Timothy J. Hohman, Bennett A. Landman
Abstract: An incomplete field-of-view (FOV) in diffusion magnetic resonance imaging (dMRI) can severely hinder the volumetric and bundle analyses of whole-brain white matter connectivity. Although existing works have investigated imputing the missing regions using deep generative models, it remains unclear how to specifically utilize additional information from paired multi-modality data and whether this can enhance the imputation quality and be useful for downstream tractography. To fill this gap, we propose a novel framework for imputing dMRI scans in the incomplete part of the FOV by integrating the learned diffusion features in the acquired part of the FOV to the complete brain anatomical structure. We hypothesize that by this design the proposed framework can enhance the imputation performance of the dMRI scans and therefore be useful for repairing whole-brain tractography in corrupted dMRI scans with incomplete FOV. We tested our framework on two cohorts from different sites with a total of 96 subjects and compared it with a baseline imputation method that treats the information from T1w and dMRI scans equally. The proposed framework achieved significant improvements in imputation performance, as demonstrated by angular correlation coefficient (p < 1E-5), and in downstream tractography accuracy, as demonstrated by Dice score (p < 0.01). Results suggest that the proposed framework improved imputation performance in dMRI scans by specifically utilizing additional information from paired multi-modality data, compared with the baseline method. The imputation achieved by the proposed framework enhances whole brain tractography, and therefore reduces the uncertainty when analyzing bundles associated with neurodegenerative.
Authors: Maying Shen, Nadine Chang, Sifei Liu, Jose M. Alvarez
Abstract: In recent years, the data collected for artificial intelligence has grown to an unmanageable amount. Particularly within industrial applications, such as autonomous vehicles, model training computation budgets are being exceeded while model performance is saturating -- and yet more data continues to pour in. To navigate the flood of data, we propose a framework to select the most semantically diverse and important dataset portion. Then, we further semantically enrich it by discovering meaningful new data from a massive unlabeled data pool. Importantly, we can provide explainability by leveraging foundation models to generate semantics for every data point. We quantitatively show that our Semantic Selection and Enrichment framework (SSE) can a) successfully maintain model performance with a smaller training dataset and b) improve model performance by enriching the smaller dataset without exceeding the original dataset size. Consequently, we demonstrate that semantic diversity is imperative for optimal data selection and model performance.
Authors: Yixin Wang, Wei Peng, Yu Zhang, Ehsan Adeli, Qingyu Zhao, Kilian M. Pohl
Abstract: Many longitudinal neuroimaging studies aim to improve the understanding of brain aging and diseases by studying the dynamic interactions between brain function and cognition. Doing so requires accurate encoding of their multidimensional relationship while accounting for individual variability over time. For this purpose, we propose an unsupervised learning model (called \underline{\textbf{Co}}ntrastive Learning-based \underline{\textbf{Gra}}ph Generalized \underline{\textbf{Ca}}nonical Correlation Analysis (CoGraCa)) that encodes their relationship via Graph Attention Networks and generalized Canonical Correlational Analysis. To create brain-cognition fingerprints reflecting unique neural and cognitive phenotype of each person, the model also relies on individualized and multimodal contrastive learning. We apply CoGraCa to longitudinal dataset of healthy individuals consisting of resting-state functional MRI and cognitive measures acquired at multiple visits for each participant. The generated fingerprints effectively capture significant individual differences and outperform current single-modal and CCA-based multimodal models in identifying sex and age. More importantly, our encoding provides interpretable interactions between those two modalities.
Authors: Jiale Wei, Junwei Zheng, Ruiping Liu, Jie Hu, Jiaming Zhang, Rainer Stiefelhagen
Abstract: In the field of autonomous driving, Bird's-Eye-View (BEV) perception has attracted increasing attention in the community since it provides more comprehensive information compared with pinhole front-view images and panoramas. Traditional BEV methods, which rely on multiple narrow-field cameras and complex pose estimations, often face calibration and synchronization issues. To break the wall of the aforementioned challenges, in this work, we introduce OneBEV, a novel BEV semantic mapping approach using merely a single panoramic image as input, simplifying the mapping process and reducing computational complexities. A distortion-aware module termed Mamba View Transformation (MVT) is specifically designed to handle the spatial distortions in panoramas, transforming front-view features into BEV features without leveraging traditional attention mechanisms. Apart from the efficient framework, we contribute two datasets, i.e., nuScenes-360 and DeepAccident-360, tailored for the OneBEV task. Experimental results showcase that OneBEV achieves state-of-the-art performance with 51.1% and 36.1% mIoU on nuScenes-360 and DeepAccident-360, respectively. This work advances BEV semantic mapping in autonomous driving, paving the way for more advanced and reliable autonomous systems.
Authors: Steven Grosz, Rui Zhao, Rajeev Ranjan, Hongcheng Wang, Manoj Aggarwal, Gerard Medioni, Anil Jain
Abstract: This paper improves upon existing data pruning methods for image classification by introducing a novel pruning metric and pruning procedure based on importance sampling. The proposed pruning metric explicitly accounts for data separability, data integrity, and model uncertainty, while the sampling procedure is adaptive to the pruning ratio and considers both intra-class and inter-class separation to further enhance the effectiveness of pruning. Furthermore, the sampling method can readily be applied to other pruning metrics to improve their performance. Overall, the proposed approach scales well to high pruning ratio and generalizes better across different classification models, as demonstrated by experiments on four benchmark datasets, including the fine-grained classification scenario.
Authors: Kevin Li, Fulu Li
Abstract: We use images of cars of a wide range of varieties to compose an image of an animal such as a bird or a lion for the theme of environmental protection to maximize the information about cars in a single composed image and to raise the awareness about environmental challenges. We present a novel way of image interaction with an artistically-composed photomosaic image, in which a simple operation of "click and display" is used to demonstrate the interactive switch between a tile image in a photomosaic image and the corresponding original car image, which will be automatically saved on the Desktop. We build a multimodal custom GPT named TalkMosaic by incorporating car images information and the related knowledge to ChatGPT. By uploading the original car image to TalkMosaic, we can ask questions about the given car image and get the corresponding answers efficiently and effectively such as where to buy the tire in the car image that satisfies high environmental standards. We give an in-depth analysis on how to speed up the inference of multimodal LLM using sparse attention and quantization techniques with presented probabilistic FlashAttention (PrFlashAttention) and Staircase Adaptive Quantization (SAQ) methods. The implemented prototype demonstrates the feasibility and effectiveness of the presented approach.
Authors: Kundan Chaudhary, Subhei Shaar, Raja Muthinti
Abstract: Quantitative analysis of microscopy images is essential in the design and fabrication of components used in augmented reality/virtual reality (AR/VR) modules. However, segmenting regions of interest (ROIs) from these complex images and extracting critical dimensions (CDs) requires novel techniques, such as deep learning models which are key for actionable decisions on process, material and device optimization. In this study, we report on the fine-tuning of a pre-trained Segment Anything Model (SAM) using a diverse dataset of electron microscopy images. We employed methods such as low-rank adaptation (LoRA) to reduce training time and enhance the accuracy of ROI extraction. The model's ability to generalize to unseen images facilitates zero-shot learning and supports a CD extraction model that precisely extracts CDs from the segmented ROIs. We demonstrate the accurate extraction of binary images from cross-sectional images of surface relief gratings (SRGs) and Fresnel lenses in both single and multiclass modes. Furthermore, these binary images are used to identify transition points, aiding in the extraction of relevant CDs. The combined use of the fine-tuned segmentation model and the CD extraction model offers substantial advantages to various industrial applications by enhancing analytical capabilities, time to data and insights, and optimizing manufacturing processes.
Authors: Kai Tang, Xiaolei Lang, Yukai Ma, Yuehao Huang, Laijian Li, Yong Liu, Jiajun Lv
Abstract: Event cameras have garnered considerable attention due to their advantages over traditional cameras in low power consumption, high dynamic range, and no motion blur. This paper proposes a monocular event-inertial odometry incorporating an adaptive decay kernel-based time surface with polarity-aware tracking. We utilize an adaptive decay-based Time Surface to extract texture information from asynchronous events, which adapts to the dynamic characteristics of the event stream and enhances the representation of environmental textures. However, polarity-weighted time surfaces suffer from event polarity shifts during changes in motion direction. To mitigate its adverse effects on feature tracking, we optimize the feature tracking by incorporating an additional polarity-inverted time surface to enhance the robustness. Comparative analysis with visual-inertial and event-inertial odometry methods shows that our approach outperforms state-of-the-art techniques, with competitive results across various datasets.
Authors: Quanhui Tang, Jingtao Cao
Abstract: Video inpainting enables seamless content removal and replacement within frames, posing ethical and legal risks when misused. To mitigate these risks, detecting manipulated regions in inpainted videos is critical. Previous detection methods often focus solely on the characteristics derived from spatial and temporal dimensions, which limits their effectiveness by overlooking the unique frequency characteristics of different inpainting algorithms. In this paper, we propose the Frequency Domain Insights Network (FDIN), which significantly enhances detection accuracy by incorporating insights from the frequency domain. Our network features an Adaptive Band Selective Response module to discern frequency characteristics specific to various inpainting techniques and a Fast Fourier Convolution-based Attention module for identifying periodic artifacts in inpainted regions. Utilizing 3D ResBlocks for spatiotemporal analysis, FDIN progressively refines detection precision from broad assessments to detailed localization. Experimental evaluations on public datasets demonstrate that FDIN achieves state-of-the-art performance, setting a new benchmark in video inpainting detection.
Authors: Sneha Paul, Zachary Patterson, Nizar Bouguila
Abstract: Semi-supervised learning (SSL) has shown its effectiveness in learning effective 3D representation from a small amount of labelled data while utilizing large unlabelled data. Traditional semi-supervised approaches rely on the fundamental concept of predicting pseudo-labels for unlabelled data and incorporating them into the learning process. However, we identify that the existing methods do not fully utilize all the unlabelled samples and consequently limit their potential performance. To address this issue, we propose AllMatch, a novel SSL-based 3D classification framework that effectively utilizes all the unlabelled samples. AllMatch comprises three modules: (1) an adaptive hard augmentation module that applies relatively hard augmentations to the high-confident unlabelled samples with lower loss values, thereby enhancing the contribution of such samples, (2) an inverse learning module that further improves the utilization of unlabelled data by learning what not to learn, and (3) a contrastive learning module that ensures learning from all the samples in both supervised and unsupervised settings. Comprehensive experiments on two popular 3D datasets demonstrate a performance improvement of up to 11.2% with 1% labelled data, surpassing the SOTA by a significant margin. Furthermore, AllMatch exhibits its efficiency in effectively leveraging all the unlabelled data, demonstrated by the fact that only 10% of labelled data reaches nearly the same performance as fully-supervised learning with all labelled data. The code of our work is available at: https://github.com/snehaputul/AllMatch.
Authors: Bang-Shien Chen, Yu-Kai Lin, Jian-Yu Chen, Chih-Wei Huang, Jann-Long Chern, Ching-Cherng Sun
Abstract: Robust estimation is essential in computer vision, robotics, and navigation, aiming to minimize the impact of outlier measurements for improved accuracy. We present a fast algorithm for Geman-McClure robust estimation, FracGM, leveraging fractional programming techniques. This solver reformulates the original non-convex fractional problem to a convex dual problem and a linear equation system, iteratively solving them in an alternating optimization pattern. Compared to graduated non-convexity approaches, this strategy exhibits a faster convergence rate and better outlier rejection capability. In addition, the global optimality of the proposed solver can be guaranteed under given conditions. We demonstrate the proposed FracGM solver with Wahba's rotation problem and 3-D point-cloud registration along with relaxation pre-processing and projection post-processing. Compared to state-of-the-art algorithms, when the outlier rates increase from 20\% to 80\%, FracGM shows 53\% and 88\% lower rotation and translation increases. In real-world scenarios, FracGM achieves better results in 13 out of 18 outcomes, while having a 19.43\% improvement in the computation time.
Authors: Zhiyuan Li, Dongnan Liu, Chaoyi Zhang, Heng Wang, Tengfei Xue, Weidong Cai
Abstract: Recent advancements in Vision-Language (VL) research have sparked new benchmarks for complex visual reasoning, challenging models' advanced reasoning ability. Traditional Vision-Language Models (VLMs) perform well in visual perception tasks while struggling with complex reasoning scenarios. Conversely, Large Language Models (LLMs) demonstrate robust text reasoning capabilities; however, they lack visual acuity. To bridge this gap, we propose Complex Visual Reasoning Large Language Models (CVR-LLM), capitalizing on VLMs' visual perception proficiency and LLMs' extensive reasoning capability. Unlike recent multimodal large language models (MLLMs) that require a projection layer, our approach transforms images into detailed, context-aware descriptions using an iterative self-refinement loop and leverages LLMs' text knowledge for accurate predictions without extra training. We also introduce a novel multi-modal in-context learning (ICL) methodology to enhance LLMs' contextual understanding and reasoning. Additionally, we introduce Chain-of-Comparison (CoC), a step-by-step comparison technique enabling contrasting various aspects of predictions. Our CVR-LLM presents the first comprehensive study across a wide array of complex visual reasoning tasks and achieves SOTA performance among all.
Authors: Fuyang Yu, Runze Tian, Zhen Wang, Xiaochuan Wang, Xiaohui Liang
Abstract: To ease the difficulty of acquiring annotation labels in 3D data, a common method is using unsupervised and open-vocabulary semantic segmentation, which leverage 2D CLIP semantic knowledge. In this paper, unlike previous research that ignores the ``noise'' raised during feature projection from 2D to 3D, we propose a novel distillation learning framework named CUS3D. In our approach, an object-level denosing projection module is designed to screen out the ``noise'' and ensure more accurate 3D feature. Based on the obtained features, a multimodal distillation learning module is designed to align the 3D feature with CLIP semantic feature space with object-centered constrains to achieve advanced unsupervised semantic segmentation. We conduct comprehensive experiments in both unsupervised and open-vocabulary segmentation, and the results consistently showcase the superiority of our model in achieving advanced unsupervised segmentation results and its effectiveness in open-vocabulary segmentation.
Authors: Haoran Gong, Haodong Wang, Di Wang
Abstract: Semantic segmentation of large-scale point clouds is of significant importance in environment perception and scene understanding. However, point clouds collected from real-world environments are usually imbalanced and small-sized objects are prone to be under-sampled or misclassified due to their low occurrence frequency, thereby reducing the overall accuracy of semantic segmentation. In this study, we propose the Multilateral Cascading Network (MCNet) for large-scale and sample-imbalanced point cloud scenes. To increase the frequency of small-sized objects, we introduce the semantic-weighted sampling module, which incorporates a probability parameter into the collected data group. To facilitate feature learning, we propose a Multilateral Cascading Attention Enhancement (MCAE) module to learn complex local features through multilateral cascading operations and attention mechanisms. To promote feature fusion, we propose a Point Cross Stage Partial (P-CSP) module to combine global and local features, optimizing the integration of valuable feature information across multiple scales. Finally, we introduce the neighborhood voting module to integrate results at the output layer. Our proposed method demonstrates either competitive or superior performance relative to state-of-the-art approaches across three widely recognized benchmark datasets: S3DIS, Toronto3D, and SensatUrban with mIoU scores of 74.0\%, 82.9\% and 64.5\%, respectively. Notably, our work yielded consistent optimal results on the under-sampled semantic categories, thereby demonstrating exceptional performance in the recognition of small-sized objects.
Authors: Geonuk Kim
Abstract: Industrial defect detection traditionally relies on supervised learning models trained on fixed datasets of known defect types. While effective within a closed set, these models struggle with new, unseen defects, necessitating frequent re-labeling and re-training. Recent advances in visual prompting offer a solution by allowing models to adaptively infer novel categories based on provided visual cues. However, a prevalent issue in these methods is the over-confdence problem, where models can mis-classify unknown objects as known objects with high certainty. To addresssing the fundamental concerns about the adaptability, we propose a solution to estimate uncertainty of the visual prompting process by cycle-consistency. We designed to check whether it can accurately restore the original prompt from its predictions. To quantify this, we measure the mean Intersection over Union (mIoU) between the restored prompt mask and the originally provided prompt mask. Without using complex designs or ensemble methods with multiple networks, our approach achieved a yield rate of 0.9175 in the VISION24 one-shot industrial challenge.
Authors: Hao Jiang, Runsheng Liu, Yanning Zhou, Huangjing Lin, Hao Chen
Abstract: Cytology screening from Papanicolaou (Pap) smears is a common and effective tool for the preventive clinical management of cervical cancer, where abnormal cell detection from whole slide images serves as the foundation for reporting cervical cytology. However, cervical cell detection remains challenging due to 1) hazily-defined cell types (e.g., ASC-US) with subtle morphological discrepancies caused by the dynamic cancerization process, i.e., cell class ambiguity, and 2) imbalanced class distributions of clinical data may cause missed detection, especially for minor categories, i.e., cell class imbalance. To this end, we propose a holistic and historical instance comparison approach for cervical cell detection. Specifically, we first develop a holistic instance comparison scheme enforcing both RoI-level and class-level cell discrimination. This coarse-to-fine cell comparison encourages the model to learn foreground-distinguishable and class-wise representations. To emphatically improve the distinguishability of minor classes, we then introduce a historical instance comparison scheme with a confident sample selection-based memory bank, which involves comparing current embeddings with historical embeddings for better cell instance discrimination. Extensive experiments and analysis on two large-scale cytology datasets including 42,592 and 114,513 cervical cells demonstrate the effectiveness of our method. The code is available at https://github.com/hjiangaz/HERO.
Authors: Runsheng Liu, Hao Jiang, Yanning Zhou, Huangjing Lin, Liansheng Wang, Hao Chen
Abstract: Instance segmentation plays a vital role in the morphological quantification of biomedical entities such as tissues and cells, enabling precise identification and delineation of different structures. Current methods often address the challenges of touching, overlapping or crossing instances through individual modeling, while neglecting the intrinsic interrelation between these conditions. In this work, we propose a Gradient Anomaly-aware Biomedical Instance Segmentation approach (GAInS), which leverages instance gradient information to perceive local gradient anomaly regions, thus modeling the spatial relationship between instances and refining local region segmentation. Specifically, GAInS is firstly built on a Gradient Anomaly Mapping Module (GAMM), which encodes the radial fields of instances through window sliding to obtain instance gradient anomaly maps. To efficiently refine boundaries and regions with gradient anomaly attention, we propose an Adaptive Local Refinement Module (ALRM) with a gradient anomaly-aware loss function. Extensive comparisons and ablation experiments in three biomedical scenarios demonstrate that our proposed GAInS outperforms other state-of-the-art (SOTA) instance segmentation methods. The code is available at https://github.com/DeepGAInS/GAInS.
Authors: Zheng Liu, Jinchao Zhu, Nannan Li, Gao Huang
Abstract: Parameter-efficient transfer learning (PETL) has shown great potential in adapting a vision transformer (ViT) pre-trained on large-scale datasets to various downstream tasks. Existing studies primarily focus on minimizing the number of learnable parameters. Although these methods are storage-efficient, they allocate excessive computational resources to easy samples, leading to inefficient inference. To address this issue, we introduce an inference-efficient tuning method termed multiple-exit tuning (MET). MET integrates multiple exits into the pre-trained ViT backbone. Since the predictions in ViT are made by a linear classifier, each exit is equipped with a linear prediction head. In inference stage, easy samples will exit at early exits and only hard enough samples will flow to the last exit, thus saving the computational cost for easy samples. MET consists of exit-specific adapters (E-adapters) and graph regularization. E-adapters are designed to extract suitable representations for different exits. To ensure parameter efficiency, all E-adapters share the same down-projection and up-projection matrices. As the performances of linear classifiers are influenced by the relationship among samples, we employ graph regularization to improve the representations fed into the classifiers at early exits. Finally, we conduct extensive experiments to verify the performance of MET. Experimental results show that MET has an obvious advantage over the state-of-the-art methods in terms of both accuracy and inference efficiency.
Authors: Shida Sun, Yue Li, Yueyi Zhang, Zhiwei Xiong
Abstract: Non-line-of-sight (NLOS) imaging, recovering the hidden volume from indirect reflections, has attracted increasing attention due to its potential applications. Despite promising results, existing NLOS reconstruction approaches are constrained by the reliance on empirical physical priors, e.g., single fixed path compensation. Moreover, these approaches still possess limited generalization ability, particularly when dealing with scenes at a low signal-to-noise ratio (SNR). To overcome the above problems, we introduce a novel learning-based solution, comprising two key designs: Learnable Path Compensation (LPC) and Adaptive Phasor Field (APF). The LPC applies tailored path compensation coefficients to adapt to different objects in the scene, effectively reducing light wave attenuation, especially in distant regions. Meanwhile, the APF learns the precise Gaussian window of the illumination function for the phasor field, dynamically selecting the relevant spectrum band of the transient measurement. Experimental validations demonstrate that our proposed approach, only trained on synthetic data, exhibits the capability to seamlessly generalize across various real-world datasets captured by different imaging systems and characterized by low SNRs.
Authors: Zhenhua Du, Binbin Xu, Haoyu Zhang, Kai Huo, Shuaifeng Zhi
Abstract: Accurately reconstructing dense and semantically annotated 3D meshes from monocular images remains a challenging task due to the lack of geometry guidance and imperfect view-dependent 2D priors. Though we have witnessed recent advancements in implicit neural scene representations enabling precise 2D rendering simply from multi-view images, there have been few works addressing 3D scene understanding with monocular priors alone. In this paper, we propose MOSE, a neural field semantic reconstruction approach to lift inferred image-level noisy priors to 3D, producing accurate semantics and geometry in both 3D and 2D space. The key motivation for our method is to leverage generic class-agnostic segment masks as guidance to promote local consistency of rendered semantics during training. With the help of semantics, we further apply a smoothness regularization to texture-less regions for better geometric quality, thus achieving mutual benefits of geometry and semantics. Experiments on the ScanNet dataset show that our MOSE outperforms relevant baselines across all metrics on tasks of 3D semantic segmentation, 2D semantic segmentation and 3D surface reconstruction.
Authors: Ling Wang, Chen Wu, Lin Wang
Abstract: Can we directly visualize what we imagine in our brain together with what we describe? The inherent nature of human perception reveals that, when we think, our body can combine language description and build a vivid picture in our brain. Intuitively, generative models should also hold such versatility. In this paper, we introduce BrainDreamer, a novel end-to-end language-guided generative framework that can mimic human reasoning and generate high-quality images from electroencephalogram (EEG) brain signals. Our method is superior in its capacity to eliminate the noise introduced by non-invasive EEG data acquisition and meanwhile achieve a more precise mapping between the EEG and image modality, thus leading to significantly better-generated images. Specifically, BrainDreamer consists of two key learning stages: 1) modality alignment and 2) image generation. In the alignment stage, we propose a novel mask-based triple contrastive learning strategy to effectively align EEG, text, and image embeddings to learn a unified representation. In the generation stage, we inject the EEG embeddings into the pre-trained Stable Diffusion model by designing a learnable EEG adapter to generate high-quality reasoning-coherent images. Moreover, BrainDreamer can accept textual descriptions (e.g., color, position, etc.) to achieve controllable image generation. Extensive experiments show that our method significantly outperforms prior arts in terms of generating quality and quantitative performance.
Authors: Minjun Kim, Ohtae Jang, Haekang Song, Heesub Shin, Jaewoo Ok, Minyoung Back, Jaehyuk Youn, Sungho Kim
Abstract: Synthetic aperture radar technology is crucial for high-resolution imaging under various conditions; however, the acquisition of real-world synthetic aperture radar data for deep learning-based automatic target recognition remains challenging due to high costs and data availability issues. To overcome these challenges, synthetic data generated through simulations have been employed, although discrepancies between synthetic and real data can degrade model performance. In this study, we introduce a novel framework, soft segmented randomization, designed to reduce domain discrepancy and improve the generalize ability of synthetic aperture radar automatic target recognition models. The soft segmented randomization framework applies a Gaussian mixture model to segment target and clutter regions softly, introducing randomized variations that align the synthetic data's statistical properties more closely with those of real-world data. Experimental results demonstrate that the proposed soft segmented randomization framework significantly enhances model performance on measured synthetic aperture radar data, making it a promising approach for robust automatic target recognition in scenarios with limited or no access to measured data.
Authors: Hongjia Zhai, Xiyu Zhang, Boming Zhao, Hai Li, Yijia He, Zhaopeng Cui, Hujun Bao, Guofeng Zhang
Abstract: Visual localization plays an important role in the applications of Augmented Reality (AR), which enable AR devices to obtain their 6-DoF pose in the pre-build map in order to render virtual content in real scenes. However, most existing approaches can not perform novel view rendering and require large storage capacities for maps. To overcome these limitations, we propose an efficient visual localization method capable of high-quality rendering with fewer parameters. Specifically, our approach leverages 3D Gaussian primitives as the scene representation. To ensure precise 2D-3D correspondences for pose estimation, we develop an unbiased 3D scene-specific descriptor decoder for Gaussian primitives, distilled from a constructed feature volume. Additionally, we introduce a salient 3D landmark selection algorithm that selects a suitable primitive subset based on the saliency score for localization. We further regularize key Gaussian primitives to prevent anisotropic effects, which also improves localization performance. Extensive experiments on two widely used datasets demonstrate that our method achieves superior or comparable rendering and localization performance to state-of-the-art implicit-based visual localization approaches. Project page: \href{https://zju3dv.github.io/splatloc}{https://zju3dv.github.io/splatloc}.
URLs: https://zju3dv.github.io/splatloc, https://zju3dv.github.io/splatloc
Authors: Shuai Zhang, Guanjun Wu, Xinggang Wang, Bin Feng, Wenyu Liu
Abstract: Reconstructing objects and extracting high-quality surfaces play a vital role in the real world. Current 4D representations show the ability to render high-quality novel views for dynamic objects but cannot reconstruct high-quality meshes due to their implicit or geometrically inaccurate representations. In this paper, we propose a novel representation that can reconstruct accurate meshes from sparse image input, named Dynamic 2D Gaussians (D-2DGS). We adopt 2D Gaussians for basic geometry representation and use sparse-controlled points to capture 2D Gaussian's deformation. By extracting the object mask from the rendered high-quality image and masking the rendered depth map, a high-quality dynamic mesh sequence of the object can be extracted. Experiments demonstrate that our D-2DGS is outstanding in reconstructing high-quality meshes from sparse input. More demos and code are available at https://github.com/hustvl/Dynamic-2DGS.
Authors: Jiashuo Sun, Jihai Zhang, Yucheng Zhou, Zhaochen Su, Xiaoye Qu, Yu Cheng
Abstract: Large Vision-Language Models (LVLMs) have become pivotal at the intersection of computer vision and natural language processing. However, the full potential of LVLMs Retrieval-Augmented Generation (RAG) capabilities remains underutilized. Existing works either focus solely on the text modality or are limited to specific tasks. Moreover, most LVLMs struggle to selectively utilize retrieved information and are sensitive to irrelevant or misleading references. To address these challenges, we propose a self-refinement framework designed to teach LVLMs to Selectively Utilize Retrieved Information (SURf). Specifically, when given questions that are incorrectly answered by the LVLM backbone, we obtain references that help correct the answers (positive references) and those that do not (negative references). We then fine-tune the LVLM backbone using a combination of these positive and negative references. Our experiments across three tasks and seven datasets demonstrate that our framework significantly enhances LVLMs ability to effectively utilize retrieved multimodal references and improves their robustness against irrelevant or misleading information. The source code is available at https://github.com/GasolSun36/SURf.
Authors: Sk Aziz Ali, Mohammad Sadil Khan, Didier Stricker
Abstract: In machining process, 3D reverse engineering of the mechanical system is an integral, highly important, and yet time consuming step to obtain parametric CAD models from 3D scans. Therefore, deep learning-based Scan-to-CAD modeling can offer designers enormous editability to quickly modify CAD model, being able to parse all its structural compositions and design steps. In this paper, we propose a supervised boundary representation (BRep) detection network BRepDetNet from 3D scans of CC3D and ABC dataset. We have carefully annotated the 50K and 45K scans of both the datasets with appropriate topological relations (e.g., next, mate, previous) between the geometrical primitives (i.e., boundaries, junctions, loops, faces) of their BRep data structures. The proposed solution decomposes the Scan-to-CAD problem in Scan-to-BRep ensuring the right step towards feature-based modeling, and therefore, leveraging other existing BRep-to-CAD modeling methods. Our proposed Scan-to-BRep neural network learns to detect BRep boundaries and junctions by minimizing focal-loss and non-maximal suppression (NMS) during training time. Experimental results show that our BRepDetNet with NMS-Loss achieves impressive results.
Authors: Jasmin Breitenstein, Franz J\"unger, Andreas B\"ar, Tim Fingscheidt
Abstract: In this work, we study amodal video instance segmentation for automated driving. Previous works perform amodal video instance segmentation relying on methods trained on entirely labeled video data with techniques borrowed from standard video instance segmentation. Such amodally labeled video data is difficult and expensive to obtain and the resulting methods suffer from a trade-off between instance segmentation and tracking performance. To largely solve this issue, we propose to study the application of foundation models for this task. More precisely, we exploit the extensive knowledge of the Segment Anything Model (SAM), while fine-tuning it to the amodal instance segmentation task. Given an initial video instance segmentation, we sample points from the visible masks to prompt our amodal SAM. We use a point memory to store those points. If a previously observed instance is not predicted in a following frame, we retrieve its most recent points from the point memory and use a point tracking method to follow those points to the current frame, together with the corresponding last amodal instance mask. This way, while basing our method on an amodal instance segmentation, we nevertheless obtain video-level amodal instance segmentation results. Our resulting S-AModal method achieves state-of-the-art results in amodal video instance segmentation while resolving the need for amodal video-based labels. Code for S-AModal is available at https://github.com/ifnspaml/S-AModal.
Authors: Zhuojun Li, Chun Yu, Chen Liang, Yuanchun Shi
Abstract: The data scarcity problem is a crucial factor that hampers the model performance of IMU-based human motion capture. However, effective data augmentation for IMU-based motion capture is challenging, since it has to capture the physical relations and constraints of the human body, while maintaining the data distribution and quality. We propose PoseAugment, a novel pipeline incorporating VAE-based pose generation and physical optimization. Given a pose sequence, the VAE module generates infinite poses with both high fidelity and diversity, while keeping the data distribution. The physical module optimizes poses to satisfy physical constraints with minimal motion restrictions. High-quality IMU data are then synthesized from the augmented poses for training motion capture models. Experiments show that PoseAugment outperforms previous data augmentation and pose generation methods in terms of motion capture accuracy, revealing a strong potential of our method to alleviate the data collection burden for IMU-based motion capture and related tasks driven by human poses.
Authors: Kanghao Chen, Zeyu Wang, Lin Wang
Abstract: Recent years have witnessed tremendous progress in the 3D reconstruction of dynamic humans from a monocular video with the advent of neural rendering techniques. This task has a wide range of applications, including the creation of virtual characters for virtual reality (VR) environments. However, it is still challenging to reconstruct clear humans when the monocular video is affected by motion blur, particularly caused by rapid human motion (e.g., running, dancing), as often occurs in the wild. This leads to distinct inconsistency of shape and appearance for the rendered 3D humans, especially in the blurry regions with rapid motion, e.g., hands and legs. In this paper, we propose ExFMan, the first neural rendering framework that unveils the possibility of rendering high-quality humans in rapid motion with a hybrid frame-based RGB and bio-inspired event camera. The ``out-of-the-box'' insight is to leverage the high temporal information of event data in a complementary manner and adaptively reweight the effect of losses for both RGB frames and events in the local regions, according to the velocity of the rendered human. This significantly mitigates the inconsistency associated with motion blur in the RGB frames. Specifically, we first formulate a velocity field of the 3D body in the canonical space and render it to image space to identify the body parts with motion blur. We then propose two novel losses, i.e., velocity-aware photometric loss and velocity-relative event loss, to optimize the neural human for both modalities under the guidance of the estimated velocity. In addition, we incorporate novel pose regularization and alpha losses to facilitate continuous pose and clear boundary. Extensive experiments on synthetic and real-world datasets demonstrate that ExFMan can reconstruct sharper and higher quality humans.
Authors: Yalong Jiang, Liquan Mao
Abstract: Video anomaly detection (VAD) aims at identifying novel actions or events which are unseen during training. Existing mainstream VAD techniques focus on the global patterns of events and cannot properly generalize to novel samples. In this paper, we propose a framework to identify the spatial local patterns which generalize to novel samples and model the dynamics of local patterns. In spatial part of the framework, the capability of extracting local patterns is gained from image-text contrastive learning with Image-Text Alignment Module (ITAM). To detect different types of anomalies, a two-branch framework is proposed for representing the local patterns in both actions and appearances. In temporal part of the framework, a State Machine Module (SMM) is proposed to model the dynamics of local patterns by decomposing their temporal variations into motion components. Different dynamics are represented with different weighted sums of a fixed set of motion components. The video sequences with either novel spatial distributions of local patterns or distinctive dynamics of local patterns are deemed as anomalies. Extensive experiments on popular benchmark datasets demonstrate that state-of-the-art performance can be achieved.
Authors: Pablo Bernabeu-Perez, Enrique Lopez-Cuena, Dario Garcia-Gasulla
Abstract: The continued release of new and better image generation models increases the demand for synthetic image detectors. In such a dynamic field, detectors need to be able to generalize widely and be robust to uncontrolled alterations. The present work is motivated by this setting, when looking at the role of time, image transformations and data sources, for detector generalization. In these experiments, none of the evaluated detectors is found universal, but results indicate an ensemble could be. Experiments on data collected in the wild show this task to be more challenging than the one defined by large-scale datasets, pointing to a gap between experimentation and actual practice. Finally, we observe a race equilibrium effect, where better generators lead to better detectors, and vice versa. We hypothesize this pushes the field towards a perpetually close race between generators and detectors.
Authors: Heethanjan Kanagalingam, Thenukan Pathmanathan, Navaneethan Ketheeswaran, Mokeeshan Vathanakumar, Mohamed Afham, Ranga Rodrigo
Abstract: Few-shot learning (FSL) aims to enable models to recognize novel objects or classes with limited labelled data. Feature generators, which synthesize new data points to augment limited datasets, have emerged as a promising solution to this challenge. This paper investigates the effectiveness of feature generators in enhancing the embedding process for FSL tasks. To address the issue of inaccurate embeddings due to the scarcity of images per class, we introduce a feature generator that creates visual features from class-level textual descriptions. By training the generator with a combination of classifier loss, discriminator loss, and distance loss between the generated features and true class embeddings, we ensure the generation of accurate same-class features and enhance the overall feature representation. Our results show a significant improvement in accuracy over baseline methods, with our approach outperforming the baseline model by 10% in 1-shot and around 5% in 5-shot approaches. Additionally, both visual-only and visual + textual generators have also been tested in this paper.
Authors: Hadrien Reynaud, Matthew Baugh, Mischa Dombrowski, Sarah Cechnicka, Qingjie Meng, Bernhard Kainz
Abstract: We introduce the Joint Video-Image Diffusion model (JVID), a novel approach to generating high-quality and temporally coherent videos. We achieve this by integrating two diffusion models: a Latent Image Diffusion Model (LIDM) trained on images and a Latent Video Diffusion Model (LVDM) trained on video data. Our method combines these models in the reverse diffusion process, where the LIDM enhances image quality and the LVDM ensures temporal consistency. This unique combination allows us to effectively handle the complex spatio-temporal dynamics in video generation. Our results demonstrate quantitative and qualitative improvements in producing realistic and coherent videos.
Authors: Yan Zhong, Zhixin Yan, Yi Xie, Shibin Wu, Huaidong Zhang, Lin Shu, Peiru Zhou
Abstract: Diabetic foot neuropathy (DFN) is a critical factor leading to diabetic foot ulcers, which is one of the most common and severe complications of diabetes mellitus (DM) and is associated with high risks of amputation and mortality. Despite its significance, existing datasets do not directly derive from plantar data and lack continuous, long-term foot-specific information. To advance DFN research, we have collected a novel dataset comprising continuous plantar pressure data to recognize diabetic foot neuropathy. This dataset includes data from 94 DM patients with DFN and 41 DM patients without DFN. Moreover, traditional methods divide datasets by individuals, potentially leading to significant domain discrepancies in some feature spaces due to the absence of mid-domain data. In this paper, we propose an effective domain adaptation method to address this proplem. We split the dataset based on convolutional feature statistics and select appropriate sub-source domains to enhance efficiency and avoid negative transfer. We then align the distributions of each source and target domain pair in specific feature spaces to minimize the domain gap. Comprehensive results validate the effectiveness of our method on both the newly proposed dataset for DFN recognition and an existing dataset.
Authors: Haoran Zhang, Shuanghao Bai, Wanqi Zhou, Jingwen Fu, Badong Chen
Abstract: Source-free domain generalization (SFDG) tackles the challenge of adapting models to unseen target domains without access to source domain data. To deal with this challenging task, recent advances in SFDG have primarily focused on leveraging the text modality of vision-language models such as CLIP. These methods involve developing a transferable linear classifier based on diverse style features extracted from the text and learned prompts or deriving domain-unified text representations from domain banks. However, both style features and domain banks have limitations in capturing comprehensive domain knowledge. In this work, we propose Prompt-Driven Text Adapter (PromptTA) method, which is designed to better capture the distribution of style features and employ resampling to ensure thorough coverage of domain knowledge. To further leverage this rich domain information, we introduce a text adapter that learns from these style features for efficient domain information storage. Extensive experiments conducted on four benchmark datasets demonstrate that PromptTA achieves state-of-the-art performance. The code is available at https://github.com/zhanghr2001/PromptTA.
Authors: Guoliang You, Xiaomeng Chu, Yifan Duan, Xingchen Li, Sha Zhang, Jianmin Ji, Yanyong Zhang
Abstract: Multi-modal systems enhance performance in autonomous driving but face inefficiencies due to indiscriminate processing within each modality. Additionally, the independent feature learning of each modality lacks interaction, which results in extracted features that do not possess the complementary characteristics. These issue increases the cost of fusing redundant information across modalities. To address these challenges, we propose targeting driving-relevant elements, which reduces the volume of LiDAR features while preserving critical information. This approach enhances lane level interaction between the image and LiDAR branches, allowing for the extraction and fusion of their respective advantageous features. Building upon the camera-only framework PHP, we introduce the Lane-level camera-LiDAR Fusion Planning (LFP) method, which balances efficiency with performance by using lanes as the unit for sensor fusion. Specifically, we design three modules to enhance efficiency and performance. For efficiency, we propose an image-guided coarse lane prior generation module that forecasts the region of interest (ROI) for lanes and assigns a confidence score, guiding LiDAR processing. The LiDAR feature extraction modules leverages lane-aware priors from the image branch to guide sampling for pillar, retaining essential pillars. For performance, the lane-level cross-modal query integration and feature enhancement module uses confidence score from ROI to combine low-confidence image queries with LiDAR queries, extracting complementary depth features. These features enhance the low-confidence image features, compensating for the lack of depth. Experiments on the Carla benchmarks show that our method achieves state-of-the-art performance in both driving score and infraction score, with maximum improvement of 15% and 14% over existing algorithms, respectively, maintaining high frame rate of 19.27 FPS.
Authors: Sam Sartor, Pieter Peers
Abstract: We present a novel and flexible learning-based method for generating tileable image sets. Our method goes beyond simple self-tiling, supporting sets of mutually tileable images that exhibit a high degree of diversity. To promote diversity we decouple structure from content by foregoing explicit copying of patches from an exemplar image. Instead we leverage the prior knowledge of natural images and textures embedded in large-scale pretrained diffusion models to guide tile generation constrained by exterior boundary conditions and a text prompt to specify the content. By carefully designing and selecting the exterior boundary conditions, we can reformulate the tile generation process as an inpainting problem, allowing us to directly employ existing diffusion-based inpainting models without the need to retrain a model on a custom training set. We demonstrate the flexibility and efficacy of our content-aware tile generation method on different tiling schemes, such as Wang tiles, from only a text prompt. Furthermore, we introduce a novel Dual Wang tiling scheme that provides greater texture continuity and diversity than existing Wang tile variants.
Authors: Nan Jiang, Shanchao Liang, Chengxiao Wang, Jiannan Wang, Lin Tan
Abstract: Portable Document Format (PDF) files are dominantly used for storing and disseminating scientific research, legal documents, and tax information. LaTeX is a popular application for creating PDF documents. Despite its advantages, LaTeX is not WYSWYG -- what you see is what you get, i.e., the LaTeX source and rendered PDF images look drastically different, especially for formulae and tables. This gap makes it hard to modify or export LaTeX sources for formulae and tables from PDF images, and existing work is still limited. First, prior work generates LaTeX sources in a single iteration and struggles with complex LaTeX formulae. Second, existing work mainly recognizes and extracts LaTeX sources for formulae; and is incapable or ineffective for tables. This paper proposes LATTE, the first iterative refinement framework for LaTeX recognition. Specifically, we propose delta-view as feedback, which compares and pinpoints the differences between a pair of rendered images of the extracted LaTeX source and the expected correct image. Such delta-view feedback enables our fault localization model to localize the faulty parts of the incorrect recognition more accurately and enables our LaTeX refinement model to repair the incorrect extraction more accurately. LATTE improves the LaTeX source extraction accuracy of both LaTeX formulae and tables, outperforming existing techniques as well as GPT-4V by at least 7.07% of exact match, with a success refinement rate of 46.08% (formula) and 25.51% (table).
Authors: Simone Alberto Peirone, Gabriele Goletto, Mirco Planamente, Andrea Bottino, Barbara Caputo, Giuseppe Averta
Abstract: Human activities exhibit a strong correlation between actions and the places where these are performed, such as washing something at a sink. More specifically, in daily living environments we may identify particular locations, hereinafter named activity-centric zones, which may afford a set of homogeneous actions. Their knowledge can serve as a prior to favor vision models to recognize human activities. However, the appearance of these zones is scene-specific, limiting the transferability of this prior information to unfamiliar areas and domains. This problem is particularly relevant in egocentric vision, where the environment takes up most of the image, making it even more difficult to separate the action from the context. In this paper, we discuss the importance of decoupling the domain-specific appearance of activity-centric zones from their universal, domain-agnostic representations, and show how the latter can improve the cross-domain transferability of Egocentric Action Recognition (EAR) models. We validate our solution on the EPIC-Kitchens-100 and Argo1M datasets
Authors: Xin Jiang, Junwei Zheng, Ruiping Liu, Jiahang Li, Jiaming Zhang, Sven Matthiesen, Rainer Stiefelhagen
Abstract: As Vision-Language Models (VLMs) advance, human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously. However, benchmarking VLMs for ATs remains under-explored. To bridge this gap, we first create a novel AT benchmark (@Bench). Guided by a pre-design user study with PVIs, our benchmark includes the five most crucial vision-language tasks: Panoptic Segmentation, Depth Estimation, Optical Character Recognition (OCR), Image Captioning, and Visual Question Answering (VQA). Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs. Our framework exhibits outstanding performance across tasks by integrating multi-modal information, and it offers PVIs a more comprehensive assistance. Extensive experiments prove the effectiveness and generalizability of our framework.
Authors: Tomasz Stanczyk, Francois Bremond
Abstract: Multi-object tracking (MOT) involves identifying and consistently tracking objects across video sequences. Traditional tracking-by-detection methods, while effective, often require extensive tuning and lack generalizability. On the other hand, segmentation mask-based methods are more generic but struggle with tracking management, making them unsuitable for MOT. We propose a novel approach, McByte, which incorporates a temporally propagated segmentation mask as a strong association cue within a tracking-by-detection framework. By combining bounding box and mask information, McByte enhances robustness and generalizability without per-sequence tuning. Evaluated on four benchmark datasets - DanceTrack, MOT17, SoccerNet-tracking 2022, and KITTI-tracking - McByte demonstrates performance gain in all cases examined. At the same time, it outperforms existing mask-based methods. Implementation code will be provided upon acceptance.
Authors: Fei Ma, Yuqiang Feng, Fan Zhang, Yongsheng Zhou
Abstract: Most existing adversarial attack methods for remote sensing images merely add adversarial perturbations or patches, resulting in unnatural modifications. Clouds are common atmospheric effects in remote sensing images. Generating clouds on these images can produce adversarial examples better aligning with human perception. In this paper, we propose a Perlin noise based cloud generation attack method. Common Perlin noise based cloud generation is a random, non-optimizable process, which cannot be directly used to attack the target models. We design a Perlin Gradient Generator Network (PGGN), which takes a gradient parameter vector as input and outputs the grids of Perlin noise gradient vectors at different scales. After a series of computations based on the gradient vectors, cloud masks at corresponding scales can be produced. These cloud masks are then weighted and summed depending on a mixing coefficient vector and a scaling factor to produce the final cloud masks. The gradient vector, coefficient vector and scaling factor are collectively represented as a cloud parameter vector, transforming the cloud generation into a black-box optimization problem. The Differential Evolution (DE) algorithm is employed to solve for the optimal solution of the cloud parameter vector, achieving a query-based black-box attack. Detailed experiments confirm that this method has strong attack capabilities and achieves high query efficiency. Additionally, we analyze the transferability of the generated adversarial examples and their robustness in adversarial defense scenarios.
Authors: Yiren Lu, Huawei Wei
Abstract: This is a challenge report of the ECCV 2022 WCPA Challenge, Face Reconstruction Track. Inside this report is a brief explanation of how we accomplish this challenge. We design a two-branch network to accomplish this task, whose roles are Face Reconstruction and Face Landmark Detection. The former outputs canonical 3D face coordinates. The latter outputs pixel coordinates, i.e. 2D mapping of 3D coordinates with head pose and perspective projection. In addition, we utilize a differentiable PnP (Perspective-n-Points) layer to finetune the outputs of the two branch. Our method achieves very competitive quantitative results on the MVP-Human dataset and wins a $3^{rd}$ prize in the challenge.
Authors: Vojtech Panek, Torsten Sattler, Zuzana Kukelova
Abstract: Visual localization is the problem of estimating the camera pose of a given query image within a known scene. Most state-of-the-art localization approaches follow the structure-based paradigm and use 2D-3D matches between pixels in a query image and 3D points in the scene for pose estimation. These approaches assume an accurate 3D model of the scene, which might not always be available, especially if only a few images are available to compute the scene representation. In contrast, structure-less methods rely on 2D-2D matches and do not require any 3D scene model. However, they are also less accurate than structure-based methods. Although one prior work proposed to combine structure-based and structure-less pose estimation strategies, its practical relevance has not been shown. We analyze combining structure-based and structure-less strategies while exploring how to select between poses obtained from 2D-2D and 2D-3D matches, respectively. We show that combining both strategies improves localization performance in multiple practically relevant scenarios.
Authors: Anirudh S Chakravarthy, Meghana Reddy Ganesina, Peiyun Hu, Laura Leal-Taixe, Shu Kong, Deva Ramanan, Aljosa Osep
Abstract: Addressing Lidar Panoptic Segmentation (LPS ) is crucial for safe deployment of autonomous vehicles. LPS aims to recognize and segment lidar points w.r.t. a pre-defined vocabulary of semantic classes, including thing classes of countable objects (e.g., pedestrians and vehicles) and stuff classes of amorphous regions (e.g., vegetation and road). Importantly, LPS requires segmenting individual thing instances (e.g., every single vehicle). Current LPS methods make an unrealistic assumption that the semantic class vocabulary is fixed in the real open world, but in fact, class ontologies usually evolve over time as robots encounter instances of novel classes that are considered to be unknowns w.r.t. the pre-defined class vocabulary. To address this unrealistic assumption, we study LPS in the Open World (LiPSOW): we train models on a dataset with a pre-defined semantic class vocabulary and study their generalization to a larger dataset where novel instances of thing and stuff classes can appear. This experimental setting leads to interesting conclusions. While prior art train class-specific instance segmentation methods and obtain state-of-the-art results on known classes, methods based on class-agnostic bottom-up grouping perform favorably on classes outside of the initial class vocabulary (i.e., unknown classes). Unfortunately, these methods do not perform on-par with fully data-driven methods on known classes. Our work suggests a middle ground: we perform class-agnostic point clustering and over-segment the input cloud in a hierarchical fashion, followed by binary point segment classification, akin to Region Proposal Network [1]. We obtain the final point cloud segmentation by computing a cut in the weighted hierarchical tree of point segments, independently of semantic classification. Remarkably, this unified approach leads to strong performance on both known and unknown classes.
Authors: Lixia Ma, Puning Yang, Yuting Xu, Ziming Yang, Peipei Li, Huaibo Huang
Abstract: Currently, the rapid development of computer vision and deep learning has enabled the creation or manipulation of high-fidelity facial images and videos via deep generative approaches. This technology, also known as deepfake, has achieved dramatic progress and become increasingly popular in social media. However, the technology can generate threats to personal privacy and national security by spreading misinformation. To diminish the risks of deepfake, it is desirable to develop powerful forgery detection methods to distinguish fake faces from real faces. This paper presents a comprehensive survey of recent deep learning-based approaches for facial forgery detection. We attempt to provide the reader with a deeper understanding of the current advances as well as the major challenges for deepfake detection based on deep learning. We present an overview of deepfake techniques and analyse the characteristics of various deepfake datasets. We then provide a systematic review of different categories of deepfake detection and state-of-the-art deepfake detection methods. The drawbacks of existing detection methods are analyzed, and future research directions are discussed to address the challenges in improving both the performance and generalization of deepfake detection.
Authors: Xuewen Liu, Zhikai Li, Qingyi Gu
Abstract: Diffusion models have shown excellent performance on various image generation tasks, but the substantial computational costs and huge memory footprint hinder their low-latency applications in real-world scenarios. Quantization is a promising way to compress and accelerate models. Nevertheless, due to the wide range and time-varying activations in diffusion models, existing methods cannot maintain both accuracy and efficiency simultaneously for low-bit quantization. To tackle this issue, we propose DilateQuant, a novel quantization framework for diffusion models that offers comparable accuracy and high efficiency. Specifically, we keenly aware of numerous unsaturated in-channel weights, which can be cleverly exploited to reduce the range of activations without additional computation cost. Based on this insight, we propose Weight Dilation (WD) that maximally dilates the unsaturated in-channel weights to a constrained range through a mathematically equivalent scaling. WD costlessly absorbs the activation quantization errors into weight quantization. The range of activations decreases, which makes activations quantization easy. The range of weights remains constant, which makes model easy to converge in training stage. Considering the temporal network leads to time-varying activations, we design a Temporal Parallel Quantizer (TPQ), which sets time-step quantization parameters and supports parallel quantization for different time steps, significantly improving the performance and reducing time cost. To further enhance performance while preserving efficiency, we introduce a Block-wise Knowledge Distillation (BKD) to align the quantized models with the full-precision models at a block level. The simultaneous training of time-step quantization parameters and weights minimizes the time required, and the shorter backpropagation paths decreases the memory footprint of the quantization process.
Authors: Jingyu Kong, Yuan Guo, Yu Wang, Yuping Duan
Abstract: Real-world data often has a long-tailed distribution, where the scarcity of tail samples significantly limits the model's generalization ability. Denoising Diffusion Probabilistic Models (DDPM) are generative models based on stochastic differential equation theory and have demonstrated impressive performance in image classification tasks. However, existing diffusion probabilistic models do not perform satisfactorily in classifying tail classes. In this work, we propose the Anisotropic Diffusion Probabilistic Model (ADPM) for imbalanced image classification problems. We utilize the data distribution to control the diffusion speed of different class samples during the forward process, effectively improving the classification accuracy of the denoiser in the reverse process. Specifically, we provide a theoretical strategy for selecting noise levels for different categories in the diffusion process based on error analysis theory to address the imbalanced classification problem. Furthermore, we integrate global and local image prior in the forward process to enhance the model's discriminative ability in the spatial dimension, while incorporate semantic-level contextual information in the reverse process to boost the model's discriminative power and robustness. Through comparisons with state-of-the-art methods on four medical benchmark datasets, we validate the effectiveness of the proposed method in handling long-tail data. Our results confirm that the anisotropic diffusion model significantly improves the classification accuracy of rare classes while maintaining the accuracy of head classes. On the skin lesion datasets, PAD-UFES and HAM10000, the F1-scores of our method improved by 4% and 3%, respectively compared to the original diffusion probabilistic model.
Authors: Wangze Xu, Huachen Gao, Shihe Shen, Rui Peng, Jianbo Jiao, Ronggang Wang
Abstract: Recently, the Neural Radiance Field (NeRF) advancement has facilitated few-shot Novel View Synthesis (NVS), which is a significant challenge in 3D vision applications. Despite numerous attempts to reduce the dense input requirement in NeRF, it still suffers from time-consumed training and rendering processes. More recently, 3D Gaussian Splatting (3DGS) achieves real-time high-quality rendering with an explicit point-based representation. However, similar to NeRF, it tends to overfit the train views for lack of constraints. In this paper, we propose \textbf{MVPGS}, a few-shot NVS method that excavates the multi-view priors based on 3D Gaussian Splatting. We leverage the recent learning-based Multi-view Stereo (MVS) to enhance the quality of geometric initialization for 3DGS. To mitigate overfitting, we propose a forward-warping method for additional appearance constraints conforming to scenes based on the computed geometry. Furthermore, we introduce a view-consistent geometry constraint for Gaussian parameters to facilitate proper optimization convergence and utilize a monocular depth regularization as compensation. Experiments show that the proposed method achieves state-of-the-art performance with real-time rendering speed. Project page: https://zezeaaa.github.io/projects/MVPGS/
Authors: Sheng Zhou, Junbin Xiao, Xun Yang, Peipei Song, Dan Guo, Angela Yao, Meng Wang, Tat-Seng Chua
Abstract: Existing efforts in text-based video question answering (TextVideoQA) are criticized for their opaque decisionmaking and heavy reliance on scene-text recognition. In this paper, we propose to study Grounded TextVideoQA by forcing models to answer questions and spatio-temporally localize the relevant scene-text regions, thus decoupling QA from scenetext recognition and promoting research towards interpretable QA. The task has three-fold significance. First, it encourages scene-text evidence versus other short-cuts for answer predictions. Second, it directly accepts scene-text regions as visual answers, thus circumventing the problem of ineffective answer evaluation by stringent string matching. Third, it isolates the challenges inherited in VideoQA and scene-text recognition. This enables the diagnosis of the root causes for failure predictions, e.g., wrong QA or wrong scene-text recognition? To achieve Grounded TextVideoQA, we propose the T2S-QA model that highlights a disentangled temporal-to-spatial contrastive learning strategy for weakly-supervised scene-text grounding and grounded TextVideoQA. To facilitate evaluation, we construct a new dataset ViTXT-GQA which features 52K scene-text bounding boxes within 2.2K temporal segments related to 2K questions and 729 videos. With ViTXT-GQA, we perform extensive experiments and demonstrate the severe limitations of existing techniques in Grounded TextVideoQA. While T2S-QA achieves superior results, the large performance gap with human leaves ample space for improvement. Our further analysis of oracle scene-text inputs posits that the major challenge is scene-text recognition. To advance the research of Grounded TextVideoQA, our dataset and code are at \url{https://github.com/zhousheng97/ViTXT-GQA.git}
Authors: Guangcheng Chen, Yicheng He, Li He, Hong Zhang
Abstract: Neural implicit surface reconstruction has achieved remarkable progress recently. Despite resorting to complex radiance modeling, state-of-the-art methods still struggle with textureless and specular surfaces. Different from RGB images, polarization images can provide direct constraints on the azimuth angles of the surface normals. In this paper, we present PISR, a novel method that utilizes a geometrically accurate polarimetric loss to refine shape independently of appearance. In addition, PISR smooths surface normals in image space to eliminate severe shape distortions and leverages the hash-grid-based neural signed distance function to accelerate the reconstruction. Experimental results demonstrate that PISR achieves higher accuracy and robustness, with an L1 Chamfer distance of 0.5 mm and an F-score of 99.5% at 1 mm, while converging 4~30 times faster than previous polarimetric surface reconstruction methods.
Authors: Jidong Kuang, Hongsong Wang, Chaolei Han, Jie Gui
Abstract: Zero-shot action recognition, which addresses the issue of scalability and generalization in action recognition and allows the models to adapt to new and unseen actions dynamically, is an important research topic in computer vision communities. The key to zero-shot action recognition lies in aligning visual features with semantic vectors representing action categories. Most existing methods either directly project visual features onto the semantic space of text category or learn a shared embedding space between the two modalities. However, a direct projection cannot accurately align the two modalities, and learning robust and discriminative embedding space between visual and text representations is often difficult. To address these issues, we introduce Dual Visual-Text Alignment (DVTA) for skeleton-based zero-shot action recognition. The DVTA consists of two alignment modules-Direct Alignment (DA) and Augmented Alignment (AA)-along with a designed Semantic Description Enhancement (SDE). The DA module maps the skeleton features to the semantic space through a specially designed visual projector, followed by the SDE, which is based on cross-attention to enhance the connection between skeleton and text, thereby reducing the gap between modalities. The AA module further strengthens the learning of the embedding space by utilizing deep metric learning to learn the similarity between skeleton and text. Our approach achieves state-of-the-art performances on several popular zero-shot skeleton-based action recognition benchmarks.
Authors: Tingle Li, Renhao Wang, Po-Yao Huang, Andrew Owens, Gopala Anumanchipalli
Abstract: Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures. We extract an audio clip from a video and apply speech enhancement. We then train a latent diffusion model to recover the original speech, using another audio-visual clip taken from elsewhere in the video as a conditional hint. Through this process, the model learns to transfer the conditional example's sound properties to the input speech. We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities. Please see our project webpage for video results: https://tinglok.netlify.app/files/avsoundscape/
Authors: Jintu Zheng, Yun Liang, Yuqing Zhang, Wanchao Su
Abstract: Memory-based video object segmentation methods model multiple objects over long temporal-spatial spans by establishing memory bank, which achieve the remarkable performance. However, they struggle to overcome the false matching and are prone to lose critical information, resulting in confusion among different objects. In this paper, we propose an effective approach which jointly improving the matching and decoding stages to alleviate the false matching issue.For the memory matching stage, we present a cost aware mechanism that suppresses the slight errors for short-term memory and a shunted cross-scale matching for long-term memory which establish a wide filed matching spaces for various object scales. For the readout decoding stage, we implement a compensatory mechanism aims at recovering the essential information where missing at the matching stage. Our approach achieves the outstanding performance in several popular benchmarks (i.e., DAVIS 2016&2017 Val (92.4%&88.1%), and DAVIS 2017 Test (83.9%)), and achieves 84.8%&84.6% on YouTubeVOS 2018&2019 Val.
Authors: Yuming Jiang, Nanxuan Zhao, Qing Liu, Krishna Kumar Singh, Shuai Yang, Chen Change Loy, Ziwei Liu
Abstract: Group portrait editing is highly desirable since users constantly want to add a person, delete a person, or manipulate existing persons. It is also challenging due to the intricate dynamics of human interactions and the diverse gestures. In this work, we present GroupDiff, a pioneering effort to tackle group photo editing with three dedicated contributions: 1) Data Engine: Since there is no labeled data for group photo editing, we create a data engine to generate paired data for training. The training data engine covers the diverse needs of group portrait editing. 2) Appearance Preservation: To keep the appearance consistent after editing, we inject the images of persons from the group photo into the attention modules and employ skeletons to provide intra-person guidance. 3) Control Flexibility: Bounding boxes indicating the locations of each person are used to reweight the attention matrix so that the features of each person can be injected into the correct places. This inter-person guidance provides flexible manners for manipulation. Extensive experiments demonstrate that GroupDiff exhibits state-of-the-art performance compared to existing methods. GroupDiff offers controllability for editing and maintains the fidelity of the original photos.
Authors: Qiu Yang, Xiao Sun, Xin-yu Li, Feng-Qi Cui, Yu-Tong Guo, Shuang-Zhen Hu, Ping Luo, Si-Ying Li
Abstract: The purpose of face super-resolution (FSR) is to reconstruct high-resolution (HR) face images from low-resolution (LR) inputs. With the continuous advancement of deep learning technologies, contemporary prior-guided FSR methods initially estimate facial priors and then use this information to assist in the super-resolution reconstruction process. However, ensuring the accuracy of prior estimation remains challenging, and straightforward cascading and convolutional operations often fail to fully leverage prior knowledge. Inaccurate or insufficiently utilized prior information inevitably degrades FSR performance. To address this issue, we propose a prior knowledge distillation network (PKDN) for FSR, which involves transferring prior information from the teacher network to the student network. This approach enables the network to learn priors during the training stage while relying solely on low-resolution facial images during the testing stage, thus mitigating the adverse effects of prior estimation inaccuracies. Additionally, we incorporate robust attention mechanisms to design a parsing map fusion block that effectively utilizes prior information. To prevent feature loss, we retain multi-scale features during the feature extraction stage and employ them in the subsequent super-resolution reconstruction process. Experimental results on benchmark datasets demonstrate that our PKDN approach surpasses existing FSR methods in generating high-quality face images.
Authors: Tzu-Chieh Liu, Chih-Ting Liu, Shao-Yi Chien
Abstract: We propose Pomo3D, a 3D portrait manipulation framework that allows free accessorizing by decomposing and recomposing portraits and accessories. It enables the avatars to attain out-of-distribution (OOD) appearances of simultaneously wearing multiple accessories. Existing methods still struggle to offer such explicit and fine-grained editing; they either fail to generate additional objects on given portraits or cause alterations to portraits (e.g., identity shift) when generating accessories. This restriction presents a noteworthy obstacle as people typically seek to create charming appearances with diverse and fashionable accessories in the virtual universe. Our approach provides an effective solution to this less-addressed issue. We further introduce the Scribble2Accessories module, enabling Pomo3D to create 3D accessories from user-drawn accessory scribble maps. Moreover, we design a bias-conscious mapper to mitigate biased associations present in real-world datasets. In addition to object-level manipulation above, Pomo3D also offers extensive editing options on portraits, including global or local editing of geometry and texture and avatar stylization, elevating 3D editing of neural portraits to a more comprehensive level.
Authors: Huafeng Qin, Hongyu Zhu, Xin Jin, Xin Yu, Mounim A. El-Yacoubi, Xinbo Gao
Abstract: Eye movement biometrics has received increasing attention thanks to its high secure identification. Although deep learning (DL) models have been recently successfully applied for eye movement recognition, the DL architecture still is determined by human prior knowledge. Differentiable Neural Architecture Search (DARTS) automates the manual process of architecture design with high search efficiency. DARTS, however, usually stacks the same multiple learned cells to form a final neural network for evaluation, limiting therefore the diversity of the network. Incidentally, DARTS usually searches the architecture in a shallow network while evaluating it in a deeper one, which results in a large gap between the architecture depths in the search and evaluation scenarios. To address this issue, we propose EM-DARTS, a hierarchical differentiable architecture search algorithm to automatically design the DL architecture for eye movement recognition. First, we define a supernet and propose a global and local alternate Neural Architecture Search method to search the optimal architecture alternately with an differentiable neural architecture search. The local search strategy aims to find an optimal architecture for different cells while the global search strategy is responsible for optimizing the architecture of the target network. To further reduce redundancy, a transfer entropy is proposed to compute the information amount of each layer, so as to further simplify search network. Our experiments on three public databases demonstrate that the proposed EM-DARTS is capable of producing an optimal architecture that leads to state-of-the-art recognition performance.
Authors: Yuzhen Lin, Wentang Song, Bin Li, Yuezun Li, Jiangqun Ni, Han Chen, Qiushi Li
Abstract: Previous studies in deepfake detection have shown promising results when testing face forgeries from the same dataset as the training. However, the problem remains challenging when one tries to generalize the detector to forgeries from unseen datasets and created by unseen methods. In this work, we present a novel general deepfake detection method, called \textbf{C}urricular \textbf{D}ynamic \textbf{F}orgery \textbf{A}ugmentation (CDFA), which jointly trains a deepfake detector with a forgery augmentation policy network. Unlike the previous works, we propose to progressively apply forgery augmentations following a monotonic curriculum during the training. We further propose a dynamic forgery searching strategy to select one suitable forgery augmentation operation for each image varying between training stages, producing a forgery augmentation policy optimized for better generalization. In addition, we propose a novel forgery augmentation named self-shifted blending image to simply imitate the temporal inconsistency of deepfake generation. Comprehensive experiments show that CDFA can significantly improve both cross-datasets and cross-manipulations performances of various naive deepfake detectors in a plug-and-play way, and make them attain superior performances over the existing methods in several benchmark datasets.
Authors: Xu Wu, Zhihui Lai, Zhou Jie, Can Gao, Xianxu Hou, Ya-nan Zhang, Linlin Shen
Abstract: Low-light images are commonly encountered in real-world scenarios, and numerous low-light image enhancement (LLIE) methods have been proposed to improve the visibility of these images. The primary goal of LLIE is to generate clearer images that are more visually pleasing to humans. However, the impact of LLIE methods in high-level vision tasks, such as image classification and object detection, which rely on high-quality image datasets, is not well {explored}. To explore the impact, we comprehensively evaluate LLIE methods on these high-level vision tasks by utilizing an empirical investigation comprising image classification and object detection experiments. The evaluation reveals a dichotomy: {\textit{While Low-Light Image Enhancement (LLIE) methods enhance human visual interpretation, their effect on computer vision tasks is inconsistent and can sometimes be harmful. }} Our findings suggest a disconnect between image enhancement for human visual perception and for machine analysis, indicating a need for LLIE methods tailored to support high-level vision tasks effectively. This insight is crucial for the development of LLIE techniques that align with the needs of both human and machine vision.
Authors: Sara Monji-Azad, Marvin Kinz, Claudia Scherl, David M\"annle, J\"urgen Hesser, Nikolas L\"ow
Abstract: Non-rigid point cloud registration is a crucial task in computer vision. Evaluating a non-rigid point cloud registration method requires a dataset with challenges such as large deformation levels, noise, outliers, and incompleteness. Despite the existence of several datasets for deformable point cloud registration, the absence of a comprehensive benchmark with all challenges makes it difficult to achieve fair evaluations among different methods. This paper introduces SynBench, a new non-rigid point cloud registration dataset created using SimTool, a toolset for soft body simulation in Flex and Unreal Engine. SynBench provides the ground truth of corresponding points between two point sets and encompasses key registration challenges, including varying levels of deformation, noise, outliers, and incompleteness. To the best of the authors' knowledge, compared to existing datasets, SynBench possesses three particular characteristics: (1) it is the first benchmark that provides various challenges for non-rigid point cloud registration, (2) SynBench encompasses challenges of varying difficulty levels, and (3) it includes ground truth corresponding points both before and after deformation. The authors believe that SynBench enables future non-rigid point cloud registration methods to present a fair comparison of their achievements. SynBench is publicly available at: https://doi.org/10.11588/data/R9IKCF.
Authors: Minyi Zhao, Yang Wang, Jihong Guan, Shuigeng Zhou
Abstract: Scene text recognition (STR) from high-resolution (HR) images has been significantly successful, however text reading on low-resolution (LR) images is still challenging due to insufficient visual information. Therefore, recently many scene text image super-resolution (STISR) models have been proposed to generate super-resolution (SR) images for the LR ones, then STR is done on the SR images, which thus boosts recognition performance. Nevertheless, these methods have two major weaknesses. On the one hand, STISR approaches may generate imperfect or even erroneous SR images, which mislead the subsequent recognition of STR models. On the other hand, as the STISR and STR models are jointly optimized, to pursue high recognition accuracy, the fidelity of SR images may be spoiled. As a result, neither the recognition performance nor the fidelity of STISR models are desirable. Then, can we achieve both high recognition performance and good fidelity? To this end, in this paper we propose a novel method called IMAGE (the abbreviation of Iterative MutuAl GuidancE) to effectively recognize and recover LR scene text images simultaneously. Concretely, IMAGE consists of a specialized STR model for recognition and a tailored STISR model to recover LR images, which are optimized separately. And we develop an iterative mutual guidance mechanism, with which the STR model provides high-level semantic information as clue to the STISR model for better super-resolution, meanwhile the STISR model offers essential low-level pixel clue to the STR model for more accurate recognition. Extensive experiments on two LR datasets demonstrate the superiority of our method over the existing works on both recognition performance and super-resolution fidelity.
Authors: Minyi Zhao, Jie Wang, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun, Shuigeng Zhou
Abstract: Recent studies have shown that Vision Language Large Models (VLLMs) may output content not relevant to the input images. This problem, called the hallucination phenomenon, undoubtedly degrades VLLM performance. Therefore, various anti-hallucination techniques have been proposed to make model output more reasonable and accurate. Despite their successes, from extensive tests we found that augmenting the prompt (e.g. word appending, rewriting, and spell error etc.) may change model output and make the output hallucinate again. To cure this drawback, we propose a new instruct-tuning framework called Prompt Augmentation and Caption Utilization (PACU) to boost VLLM's generation ability under the augmented prompt scenario. Concretely, on the one hand, PACU exploits existing LLMs to augment and evaluate diverse prompts automatically. The resulting high-quality prompts are utilized to enhance VLLM's ability to process different prompts. On the other hand, PACU exploits image captions to jointly work with image features as well as the prompts for response generation. When the visual feature is inaccurate, LLM can capture useful information from the image captions for response generation. Extensive experiments on hallucination evaluation and prompt-augmented datasets demonstrate that our PACU method can work well with existing schemes to effectively boost VLLM model performance. Code is available in https://github.com/zhaominyiz/PACU.
Authors: Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, Bo Zhao
Abstract: Although current Multi-modal Large Language Models (MLLMs) demonstrate promising results in video understanding, processing extremely long videos remains an ongoing challenge. Typically, MLLMs struggle with handling thousands of tokens that exceed the maximum context length of LLMs, and they experience reduced visual clarity due to token aggregation. Another challenge is the high computational cost stemming from the large number of video tokens. To tackle these issues, we propose Video-XL, an extra-long vision language model designed for efficient hour-scale video understanding. Specifically, we argue that LLMs can be adapted as effective visual condensers and introduce Visual Context Latent Summarization, which condenses visual contexts into highly compact forms. Extensive experiments demonstrate that our model achieves promising results on popular long video understanding benchmarks, despite being trained on limited image data. Moreover, Video-XL strikes a promising balance between efficiency and effectiveness, processing 1024 frames on a single 80GB GPU while achieving nearly 100\% accuracy in the Needle-in-a-Haystack evaluation. We envision Video-XL becoming a valuable tool for long video applications such as video summarization, surveillance anomaly detection, and Ad placement identification.
Authors: Jun-Yeong Moon, Jung Uk Kim, Gyeong-Moon Park
Abstract: Abstract. The advancement of deep learning has coincided with the proliferation of both models and available data. The surge in dataset sizes and the subsequent surge in computational requirements have led to the development of the Dataset Condensation (DC). While prior studies have delved into generating synthetic images through methods like distribution alignment and training trajectory tracking for more efficient model training, a significant challenge arises when employing these condensed images practically. Notably, these condensed images tend to be specific to particular models, constraining their versatility and practicality. In response to this limitation, we introduce a novel method, Heterogeneous Model Dataset Condensation (HMDC), designed to produce universally applicable condensed images through cross-model interactions. To address the issues of gradient magnitude difference and semantic distance in models when utilizing heterogeneous models, we propose the Gradient Balance Module (GBM) and Mutual Distillation (MD) with the SpatialSemantic Decomposition method. By balancing the contribution of each model and maintaining their semantic meaning closely, our approach overcomes the limitations associated with model-specific condensed images and enhances the broader utility. The source code is available in https://github.com/KHU-AGI/HMDC.
Authors: Arjun Raj, Lei Wang, Tom Gedeon
Abstract: Accurately detecting and tracking high-speed, small objects, such as balls in sports videos, is challenging due to factors like motion blur and occlusion. Although recent deep learning frameworks like TrackNetV1, V2, and V3 have advanced tennis ball and shuttlecock tracking, they often struggle in scenarios with partial occlusion or low visibility. This is primarily because these models rely heavily on visual features without explicitly incorporating motion information, which is crucial for precise tracking and trajectory prediction. In this paper, we introduce an enhancement to the TrackNet family by fusing high-level visual features with learnable motion attention maps through a motion-aware fusion mechanism, effectively emphasizing the moving ball's location and improving tracking performance. Our approach leverages frame differencing maps, modulated by a motion prompt layer, to highlight key motion regions over time. Experimental results on the tennis ball and shuttlecock datasets show that our method enhances the tracking performance of both TrackNetV2 and V3. We refer to our lightweight, plug-and-play solution, built on top of the existing TrackNet, as TrackNetV4.
Authors: Ting-Yu Chang, Seretsi Khabane Lekena
Abstract: The paper aims to address the lack of photorealistic virtual try-on models for accessories such as jewelry and watches, which are particularly relevant for online retail applications. While existing virtual try-on models focus primarily on clothing items, there is a gap in the market for accessories. This research explores the application of techniques from 2D virtual try-on models for clothing, such as VITON-HD, and integrates them with other computer vision models, notably MediaPipe Hand Landmarker. Drawing on existing literature, the study customizes and retrains a unique model using accessory-specific data and network architecture modifications to assess the feasibility of extending virtual try-on technology to accessories. Results demonstrate improved location prediction compared to the original model for clothes, even with a small dataset. This underscores the model's potential with larger datasets exceeding 10,000 images, paving the way for future research in virtual accessory try-on applications.
Authors: Maria Zafeiri, Georgios Evangelidis, Emmanouil Psarakis
Abstract: In this paper, an event-based tracker is presented. Inspired by recent advances in asynchronous processing of individual events, we develop a direct matching scheme that aligns spatial distributions of events at different times. More specifically, we adopt the Enhanced Correlation Coefficient (ECC) criterion and propose a tracking algorithm that computes a 2D motion warp per single event, called event-ECC (eECC). The complete tracking of a feature along time is cast as a \emph{single} iterative continuous optimization problem, whereby every single iteration is executed per event. The computational burden of event-wise processing is alleviated through a lightweight version that benefits from incremental processing and updating scheme. We test the proposed algorithm on publicly available datasets and we report improvements in tracking accuracy and feature age over state-of-the-art event-based asynchronous trackers.
Authors: Sining Huang, Yukun Song, Yixiao Kang, Chang Yu
Abstract: In the field of spatial computing, one of the most essential tasks is the pose estimation of 3D objects. While rigid transformations of arbitrary 3D objects are relatively hard to detect due to varying environment introducing factors like insufficient lighting or even occlusion, objects with pre-defined shapes are often easy to track, leveraging geometric constraints. Curved images, with flexible dimensions but a confined shape, are essential shapes often targeted in 3D tracking. Traditionally, proprietary algorithms often require specific curvature measures as the input along with the original flattened images to enable pose estimation for a single image target. In this paper, we propose a pipeline that can detect several logo images simultaneously and only requires the original images as the input, unlocking more effects in downstream fields such as Augmented Reality (AR).
Authors: Paul Hill, Nantheera Anantrasirichai, Alin Achim, David Bull
Abstract: The influence of atmospheric turbulence on acquired imagery makes image interpretation and scene analysis extremely difficult and reduces the effectiveness of conventional approaches for classifying and tracking objects of interest in the scene. Restoring a scene distorted by atmospheric turbulence is also a challenging problem. The effect, which is caused by random, spatially varying perturbations, makes conventional model-based approaches difficult and, in most cases, impractical due to complexity and memory requirements. Deep learning approaches offer faster operation and are capable of implementation on small devices. This paper reviews the characteristics of atmospheric turbulence and its impact on acquired imagery. It compares the performance of various state-of-the-art deep neural networks, including Transformers, SWIN and Mamba, when used to mitigate spatio-temporal image distortions.
Authors: Shunsuke Iwashita, Atom Scott, Rikuhei Umemoto, Ning Ding, Keisuke Fujii
Abstract: Ultimate is a sport in which teams of seven players compete for points by passing a disc into the end zone. A distinctive aspect of Ultimate is that the player holding the disc is unable to move, underscoring the significance of creating space to receive passes. Despite extensive research into space evaluation in sports such as football and basketball, there is a paucity of information available for Ultimate. This study focuses on the 3-on-3 format, which is widely practiced in Ultimate, and evaluates space during offensive play. The data collection process entailed the use of drones for filming and the subsequent correction of the angles for the purpose of obtaining positional data. The model is derived from the pitch control model of soccer and adapted to the rules of Ultimate, where the player holding the disc is stationary. The integration of position and distance weights with pitch control values enables the derivation of space evaluation metrics. The findings of this study indicate that movement to create space and accurate passing into that space are both significant factors in scoring. The code is available at https://github.com/shunsuke-iwashita/USO.
Authors: Chuanbo Hu, Shan Jia, Xin Li
Abstract: Tackling Urban Physical Disorder (e.g., abandoned buildings, litter, messy vegetation, graffiti) is essential, as it negatively impacts the safety, well-being, and psychological state of communities. Urban Renewal is the process of revitalizing these neglected and decayed areas within a city to improve the physical environment and quality of life for residents. Effective urban renewal efforts can transform these environments, enhancing their appeal and livability. However, current research lacks simulation tools that can quantitatively assess and visualize the impacts of renewal efforts, often relying on subjective judgments. Such tools are crucial for planning and implementing effective strategies by providing a clear visualization of potential changes and their impacts. This paper presents a novel framework addressing this gap by using human perception feedback to simulate street environment enhancement. We develop a prompt tuning approach that integrates text-driven Stable Diffusion with human perception feedback, iteratively editing local areas of street view images to better align with perceptions of beauty, liveliness, and safety. Our experiments show that this framework significantly improves perceptions of urban environments, with increases of 17.60% in safety, 31.15% in beauty, and 28.82% in liveliness. In contrast, advanced methods like DiffEdit achieve only 2.31%, 11.87%, and 15.84% improvements, respectively. We applied this framework across various virtual scenarios, including neighborhood improvement, building redevelopment, green space expansion, and community garden creation. The results demonstrate its effectiveness in simulating urban renewal, offering valuable insights for urban planning and policy-making.
Authors: Cheng-En Wu, Jinhong Lin, Yu Hen Hu, Pedro Morgado
Abstract: Contrastive image-text pre-trained models such as CLIP have shown remarkable adaptability to downstream tasks. However, they face challenges due to the high computational requirements of the Vision Transformer (ViT) backbone. Current strategies to boost ViT efficiency focus on pruning patch tokens but fall short in addressing the multimodal nature of CLIP and identifying the optimal subset of tokens for maximum performance. To address this, we propose greedy search methods to establish a "Golden Ranking" and introduce a lightweight predictor specifically trained to approximate this Ranking. To compensate for any performance degradation resulting from token pruning, we incorporate learnable visual tokens that aid in restoring and potentially enhancing the model's performance. Our work presents a comprehensive and systematic investigation of pruning tokens within the ViT backbone of CLIP models. Through our framework, we successfully reduced 40% of patch tokens in CLIP's ViT while only suffering a minimal average accuracy loss of 0.3 across seven datasets. Our study lays the groundwork for building more computationally efficient multimodal models without sacrificing their performance, addressing a key challenge in the application of advanced vision-language models.
Authors: Pritam P. Karmokar, Quan H. Nguyen, William J. Beksi
Abstract: Event cameras capture the motion of intensity gradients (edges) in the image plane in the form of rapid asynchronous events. When accumulated in 2D histograms, these events depict overlays of the edges in motion, consequently obscuring the spatial structure of the generating edges. Contrast maximization (CM) is an optimization framework that can reverse this effect and produce sharp spatial structures that resemble the moving intensity gradients by estimating the motion trajectories of the events. Nonetheless, CM is still an underexplored area of research with avenues for improvement. In this paper, we propose a novel hybrid approach that extends CM from uni-modal (events only) to bi-modal (events and edges). We leverage the underpinning concept that, given a reference time, optimally warped events produce sharp gradients consistent with the moving edge at that time. Specifically, we formalize a correlation-based objective to aid CM and provide key insights into the incorporation of multiscale and multireference techniques. Moreover, our edge-informed CM method yields superior sharpness scores and establishes new state-of-the-art event optical flow benchmarks on the MVSEC, DSEC, and ECD datasets.
Authors: Christian Wilms, Tim Rolff, Maris Hillemann, Robert Johanson, Simone Frintrop
Abstract: We propose an approach for Open-World Instance Segmentation (OWIS), a task that aims to segment arbitrary unknown objects in images by generalizing from a limited set of annotated object classes during training. Our Segment Object System (SOS) explicitly addresses the generalization ability and the low precision of state-of-the-art systems, which often generate background detections. To this end, we generate high-quality pseudo annotations based on the foundation model SAM. We thoroughly study various object priors to generate prompts for SAM, explicitly focusing the foundation model on objects. The strongest object priors were obtained by self-attention maps from self-supervised Vision Transformers, which we utilize for prompting SAM. Finally, the post-processed segments from SAM are used as pseudo annotations to train a standard instance segmentation system. Our approach shows strong generalization capabilities on COCO, LVIS, and ADE20k datasets and improves on the precision by up to 81.6% compared to the state-of-the-art. Source code is available at: https://github.com/chwilms/SOS
Authors: Sangwon Kim, Dasom Ahn, Byoung Chul Ko, In-su Jang, Kwang-Ju Kim
Abstract: The demand for reliable AI systems has intensified the need for interpretable deep neural networks. Concept bottleneck models (CBMs) have gained attention as an effective approach by leveraging human-understandable concepts to enhance interpretability. However, existing CBMs face challenges due to deterministic concept encoding and reliance on inconsistent concepts, leading to inaccuracies. We propose EQ-CBM, a novel framework that enhances CBMs through probabilistic concept encoding using energy-based models (EBMs) with quantized concept activation vectors (qCAVs). EQ-CBM effectively captures uncertainties, thereby improving prediction reliability and accuracy. By employing qCAVs, our method selects homogeneous vectors during concept encoding, enabling more decisive task performance and facilitating higher levels of human intervention. Empirical results using benchmark datasets demonstrate that our approach outperforms the state-of-the-art in both concept and task accuracy.
Authors: Gen Li
Abstract: Arbitrary artistic style transfer is a field that integrates rational academic research with emotional artistic creation. It aims to produce an image that not only features artistic characteristics of the target style but also preserves the texture structure of the content image itself. Existing style transfer methods primarily rely either on global statistics-based information or local patch-based. As a result, the generated images often either superficially apply a filter to the content image or capture extraneous semantic information from the style image, leading to a significant deviation from the global style. In this paper, we propose Affinity Enhanced-Attentional Networks (AEANet), which include a content affinity-enhanced attention (CAEA) module, style affinity-enhanced attention (SAEA) module, and hybrid attention (HA) module. The CAEA and SAEA modules first use attention to improve content and style representations with a Detail Enhanced(DE) module to reinforce fine details. Then, it aligns the global statistical information of the content and style features to fine-tune the feature information. Subsequently, the HA module adjusts the distribution of style features based on the distribution of content features. Additionally, we introduce affinity attention-based Local Dissimilarity Loss to preserve the affinities between the content and style images. Experimental results demonstrate that our approach outperforms state-of-the-art methods in arbitrary style transfer.
Authors: Ankit Dhiman, Manan Shah, Rishubh Parihar, Yash Bhalgat, Lokesh R Boregowda, R Venkatesh Babu
Abstract: We tackle the problem of generating highly realistic and plausible mirror reflections using diffusion-based generative models. We formulate this problem as an image inpainting task, allowing for more user control over the placement of mirrors during the generation process. To enable this, we create SynMirror, a large-scale dataset of diverse synthetic scenes with objects placed in front of mirrors. SynMirror contains around 198K samples rendered from 66K unique 3D objects, along with their associated depth maps, normal maps and instance-wise segmentation masks, to capture relevant geometric properties of the scene. Using this dataset, we propose a novel depth-conditioned inpainting method called MirrorFusion, which generates high-quality geometrically consistent and photo-realistic mirror reflections given an input image and a mask depicting the mirror region. MirrorFusion outperforms state-of-the-art methods on SynMirror, as demonstrated by extensive quantitative and qualitative analysis. To the best of our knowledge, we are the first to successfully tackle the challenging problem of generating controlled and faithful mirror reflections of an object in a scene using diffusion based models. SynMirror and MirrorFusion open up new avenues for image editing and augmented reality applications for practitioners and researchers alike.
Authors: Hojun Son, Arpan Kusari
Abstract: Domain adaptation for object detection (DAOD) aims to transfer a trained model from a source to a target domain. Various DAOD methods exist, some of which minimize context bias between foreground-background associations in various domains. However, no prior work has studied context bias in DAOD by analyzing changes in background features during adaptation and how context bias is represented in different domains. Our research experiment highlights the potential usability of context bias in DAOD. We address the problem by varying activation values over different layers of trained models and by masking the background, both of which impact the number and quality of detections. We then use one synthetic dataset from CARLA and two different versions of real open-source data, Cityscapes and Cityscapes foggy, as separate domains to represent and quantify context bias. We utilize different metrics such as Maximum Mean Discrepancy (MMD) and Maximum Variance Discrepancy (MVD) to find the layer-specific conditional probability estimates of foreground given manipulated background regions for separate domains. We demonstrate through detailed analysis that understanding of the context bias can affect DAOD approach and foc
Authors: Zhiyang Dou
Abstract: My research focuses on the analysis, recovery, and generation of 4D content, where 4D includes three spatial dimensions (x, y, z) and a temporal dimension t, such as shape and motion. This focus goes beyond static objects to include dynamic changes over time, providing a comprehensive understanding of both spatial and temporal variations. These techniques are critical in applications like AR/VR, embodied AI, and robotics. My research aims to make 4D content generation more efficient, accessible, and higher in quality by incorporating geometric, topological, and physical priors. I also aim to develop effective methods for 4D content recovery and analysis using these priors.
Authors: Jingtao Cao, Zheng Zhang, Hongru Wang, Kam-Fai Wong
Abstract: Progress in Text-to-Image (T2I) models has significantly improved the generation of images from textual descriptions. However, existing evaluation metrics do not adequately assess the models' ability to handle a diverse range of textual prompts, which is crucial for their generalizability. To address this, we introduce a new metric called Visual Language Evaluation Understudy (VLEU). VLEU uses large language models to sample from the visual text domain, the set of all possible input texts for T2I models, to generate a wide variety of prompts. The images generated from these prompts are evaluated based on their alignment with the input text using the CLIP model.VLEU quantifies a model's generalizability by computing the Kullback-Leibler divergence between the marginal distribution of the visual text and the conditional distribution of the images generated by the model. This metric provides a quantitative way to compare different T2I models and track improvements during model finetuning. Our experiments demonstrate the effectiveness of VLEU in evaluating the generalization capability of various T2I models, positioning it as an essential metric for future research in text-to-image synthesis.
Authors: Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro
Abstract: The success of visual instruction tuning has accelerated the development of large language and vision models (LLVMs). Following the scaling laws of instruction-tuned large language models (LLMs), LLVMs either have further increased their sizes, reaching 26B, 34B, and even 80B parameters. While this increase in model size has yielded significant performance gains, it demands substantially more hardware resources for both training and inference. Consequently, there naturally exists a strong need for efficient LLVMs that achieve the performance of larger models while being smaller in size. To achieve this need, we present a new efficient LLVM family with model sizes of 0.5B, 1.8B, 3.8B, and 7B parameters, Phantom, which significantly enhances learning capabilities within limited structures. By temporarily increasing the latent hidden dimension during multi-head self-attention (MHSA), we make LLVMs prepare to look and understand much more vision-language knowledge on the latent, without substantially increasing physical model sizes. To maximize its advantage, we introduce Phantom Optimization (PO) using both autoregressive supervised fine-tuning (SFT) and direct preference optimization (DPO)-like concept, which effectively follows correct answers while eliminating incorrect and ambiguous ones. Phantom outperforms numerous larger open- and closed-source LLVMs, positioning itself as a leading solution in the landscape of efficient LLVMs.
Authors: Di Cheng, YingJie Shi, ShiXin Sun, JiaFu Zhang, WeiJing Wang, Yu Liu
Abstract: Multimodal clothing image editing refers to the precise adjustment and modification of clothing images using data such as textual descriptions and visual images as control conditions, which effectively improves the work efficiency of designers and reduces the threshold for user design. In this paper, we propose a new image editing method ControlEdit, which transfers clothing image editing to multimodal-guided local inpainting of clothing images. We address the difficulty of collecting real image datasets by leveraging the self-supervised learning approach. Based on this learning approach, we extend the channels of the feature extraction network to ensure consistent clothing image style before and after editing, and we design an inverse latent loss function to achieve soft control over the content of non-edited areas. In addition, we adopt Blended Latent Diffusion as the sampling method to make the editing boundaries transition naturally and enforce consistency of non-edited area content. Extensive experiments demonstrate that ControlEdit surpasses baseline algorithms in both qualitative and quantitative evaluations.
Authors: Ashish Prasad, Pranav Jeevan, Amit Sethi
Abstract: Current video summarization methods largely rely on transformer-based architectures, which, due to their quadratic complexity, require substantial computational resources. In this work, we address these inefficiencies by enhancing the Direct-to-Summarize Network (DSNet) with more resource-efficient token mixing mechanisms. We show that replacing traditional attention with alternatives like Fourier, Wavelet transforms, and Nystr\"omformer improves efficiency and performance. Furthermore, we explore various pooling strategies within the Regional Proposal Network, including ROI pooling, Fast Fourier Transform pooling, and flat pooling. Our experimental results on TVSum and SumMe datasets demonstrate that these modifications significantly reduce computational costs while maintaining competitive summarization performance. Thus, our work offers a more scalable solution for video summarization tasks.
Authors: Jianqi Zhang, Mengxuan Wang, Jingyao Wang, Lingyu Si, Changwen Zheng, Fanjiang Xu
Abstract: Scene recognition, particularly for aerial and underwater images, often suffers from various types of degradation, such as blurring or overexposure. Previous works that focus on convolutional neural networks have been shown to be able to extract panoramic semantic features and perform well on scene recognition tasks. However, low-quality images still impede model performance due to the inappropriate use of high-level semantic features. To address these To address these challenges, we propose an adaptive selection mechanism to identify the most important and robust regions with high-level features. Thus, the model can perform learning via these regions to avoid interference. implement a learnable mask in the neural network, which can filter high-level features by assigning weights to different regions of the feature matrix. We also introduce a regularization term to further enhance the significance of key high-level feature regions. Different from previous methods, our learnable matrix pays extra attention to regions that are important to multiple categories but may cause misclassification and sets constraints to reduce the influence of such regions.This is a plug-and-play architecture that can be easily extended to other methods. Additionally, we construct an Underwater Geological Scene Classification dataset to assess the effectiveness of our model. Extensive experimental results demonstrate the superiority and robustness of our proposed method over state-of-the-art techniques on two datasets.
Authors: Dasol Choi, Dongbin Na
Abstract: With the explosive growth of deep learning applications, the right to be forgotten has become increasingly in demand in various AI industries. For example, given a facial recognition system, some individuals may wish to remove images that might have been used in the training phase from the trained model. Unfortunately, modern deep neural networks sometimes unexpectedly leak personal identities. Recent studies have presented various machine unlearning algorithms to make a trained model unlearn the data to be forgotten. While these methods generally perform well in terms of forgetting scores, we have found that an unexpected modelutility drop can occur. This phenomenon, which we term correlation collapse, happens when the machine unlearning algorithms reduce the useful correlation between image features and the true label. To address this challenge, we propose Distribution-Level Feature Distancing (DLFD), a novel method that efficiently forgets instances while preventing correlation collapse. Our method synthesizes data samples so that the generated data distribution is far from the distribution of samples being forgotten in the feature space, achieving effective results within a single training epoch. Through extensive experiments on facial recognition datasets, we demonstrate that our approach significantly outperforms state-of-the-art machine unlearning methods.
Authors: Junzhuo Liu, Xuzheng Yang, Weiwei Li, Peng Wang
Abstract: Referring Expression Comprehension (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding. Consequently, it serves as an ideal testing ground for Multi-modal Large Language Models (MLLMs). In pursuit of this goal, we have established a new REC dataset characterized by two key features: Firstly, it is designed with controllable varying levels of difficulty, necessitating multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Secondly, it includes negative text and images created through fine-grained editing and generation based on existing data, thereby testing the model's ability to correctly reject scenarios where the target object is not visible in the image--an essential aspect often overlooked in existing datasets and approaches. Utilizing this high-quality dataset, we conducted comprehensive evaluations of both state-of-the-art specialist models and MLLMs. Our findings indicate that there remains a significant gap in achieving satisfactory grounding performance. We anticipate that our dataset will inspire new approaches to enhance visual reasoning and develop more advanced cross-modal interaction strategies, ultimately unlocking the full potential of MLLMs. Our code and the datasets are available at https://github.com/liujunzhuo/FineCops-Ref.
Authors: Haocheng Zhao, Runwei Guan, Taoyu Wu, Ka Lok Man, Limin Yu, Yutao Yue
Abstract: 4D millimeter-wave (MMW) radar, which provides both height information and dense point cloud data over 3D MMW radar, has become increasingly popular in 3D object detection. In recent years, radar-vision fusion models have demonstrated performance close to that of LiDAR-based models, offering advantages in terms of lower hardware costs and better resilience in extreme conditions. However, many radar-vision fusion models treat radar as a sparse LiDAR, underutilizing radar-specific information. Additionally, these multi-modal networks are often sensitive to the failure of a single modality, particularly vision. To address these challenges, we propose the Radar Depth Lift-Splat-Shoot (RDL) module, which integrates radar-specific data into the depth prediction process, enhancing the quality of visual Bird-Eye View (BEV) features. We further introduce a Unified Feature Fusion (UFF) approach that extracts BEV features across different modalities using shared module. To assess the robustness of multi-modal models, we develop a novel Failure Test (FT) ablation experiment, which simulates vision modality failure by injecting Gaussian noise. We conduct extensive experiments on the View-of-Delft (VoD) and TJ4D datasets. The results demonstrate that our proposed Unified BEVFusion (UniBEVFusion) network significantly outperforms state-of-the-art models on the TJ4D dataset, with improvements of 1.44 in 3D and 1.72 in BEV object detection accuracy.
Authors: Stefano Puliti, Carolin Fischer, Rasmus Astrup
Abstract: This paper presents an automated pipeline for detecting tree whorls in proximally laser scanning data using a pose-estimation deep learning model. Accurate whorl detection provides valuable insights into tree growth patterns, wood quality, and offers potential for use as a biometric marker to track trees throughout the forestry value chain. The workflow processes point cloud data to create sectional images, which are subsequently used to identify keypoints representing tree whorls and branches along the stem. The method was tested on a dataset of destructively sampled individual trees, where the whorls were located along the stems of felled trees. The results demonstrated strong potential, with accurate identification of tree whorls and precise calculation of key structural metrics, unlocking new insights and deeper levels of information from individual tree point clouds.
Authors: Nam Hyeon-Woo, Moon Ye-Bin, Wonseok Choi, Lee Hyun, Tae-Hyun Oh
Abstract: Vision language models (VLMs) have shown promising reasoning capabilities across various benchmarks; however, our understanding of their visual perception remains limited. In this work, we propose an eye examination process to investigate how a VLM perceives images, specifically focusing on key elements of visual recognition, from primitive color and shape to semantic levels. To this end, we introduce a dataset named LENS to guide a VLM to follow the examination and check its readiness. Once the model is ready, we conduct the examination. Through this examination, we quantify and visualize VLMs' sensitivities to color and shape, and semantic matching. Our findings reveal that VLMs have varying sensitivity to different colors while consistently showing insensitivity to green across different VLMs. Also, we found different shape sensitivity and semantic recognition depending on LLM's capacity despite using the same fixed visual encoder. Our analyses and findings have potential to inspire the design of VLMs and the pre-processing of visual input to VLMs for improving application performance.
Authors: Ming Li, Xueqian Jin, Xuejiao Hu, Jinghao Cao, Sidan Du, Yang Li
Abstract: Omnidirectional depth estimation has received much attention from researchers in recent years. However, challenges arise due to camera soiling and variations in camera layouts, affecting the robustness and flexibility of the algorithm. In this paper, we use the geometric constraints and redundant information of multiple 360-degree cameras to achieve robust and flexible multi-view omnidirectional depth estimation. We implement two algorithms, in which the two-stage algorithm obtains initial depth maps by pairwise stereo matching of multiple cameras and fuses the multiple depth maps to achieve the final depth estimation; the one-stage algorithm adopts spherical sweeping based on hypothetical depths to construct a uniform spherical matching cost of the multi-camera images and obtain the depth. Additionally, a generalized epipolar equirectangular projection is introduced to simplify the spherical epipolar constraints. To overcome panorama distortion, a spherical feature extractor is implemented. Furthermore, a synthetic 360-degree dataset consisting of 12K road scene panoramas and 3K ground truth depth maps is presented to train and evaluate 360-degree depth estimation algorithms. Our dataset takes soiled camera lenses and glare into consideration, which is more consistent with the real-world environment. Experiments show that our two algorithms achieve state-of-the-art performance, accurately predicting depth maps even when provided with soiled panorama inputs. The flexibility of the algorithms is experimentally validated in terms of camera layouts and numbers.
Authors: Yifan Wang, Jie Gui, Yuan Yan Tang, James Tin-Yau Kwok
Abstract: Finger vein recognition technology has become one of the primary solutions for high-security identification systems. However, it still has information leakage problems, which seriously jeopardizes users privacy and anonymity and cause great security risks. In addition, there is no work to consider a fully integrated secure finger vein recognition system. So, different from the previous systems, we integrate preprocessing and template protection into an integrated deep learning model. We propose an end-to-end cancelable finger vein network (CFVNet), which can be used to design an secure finger vein recognition system.It includes a plug-and-play BWR-ROIAlign unit, which consists of three sub-modules: Localization, Compression and Transformation. The localization module achieves automated localization of stable and unique finger vein ROI. The compression module losslessly removes spatial and channel redundancies. The transformation module uses the proposed BWR method to introduce unlinkability, irreversibility and revocability to the system. BWR-ROIAlign can directly plug into the model to introduce the above features for DCNN-based finger vein recognition systems. We perform extensive experiments on four public datasets to study the performance and cancelable biometric attributes of the CFVNet-based recognition system. The average accuracy, EERs and Dsys on the four datasets are 99.82%, 0.01% and 0.025, respectively, and achieves competitive performance compared with the state-of-the-arts.
Authors: Egor Zakharov, Vanessa Sklyarova, Michael Black, Giljoo Nam, Justus Thies, Otmar Hilliges
Abstract: We introduce a new hair modeling method that uses a dual representation of classical hair strands and 3D Gaussians to produce accurate and realistic strand-based reconstructions from multi-view data. In contrast to recent approaches that leverage unstructured Gaussians to model human avatars, our method reconstructs the hair using 3D polylines, or strands. This fundamental difference allows the use of the resulting hairstyles out-of-the-box in modern computer graphics engines for editing, rendering, and simulation. Our 3D lifting method relies on unstructured Gaussians to generate multi-view ground truth data to supervise the fitting of hair strands. The hairstyle itself is represented in the form of the so-called strand-aligned 3D Gaussians. This representation allows us to combine strand-based hair priors, which are essential for realistic modeling of the inner structure of hairstyles, with the differentiable rendering capabilities of 3D Gaussian Splatting. Our method, named Gaussian Haircut, is evaluated on synthetic and real scenes and demonstrates state-of-the-art performance in the task of strand-based hair reconstruction.
Authors: Shahid Munir Shah, Syeda Anshrah Gillani, Mirza Samad Ahmed Baig, Muhammad Aamer Saleem, Muhammad Hamzah Siddiqui
Abstract: This study investigates the use of Large Language Models (LLMs) for improved depression detection from users social media data. Through the use of fine-tuned GPT 3.5 Turbo 1106 and LLaMA2-7B models and a sizable dataset from earlier studies, we were able to identify depressed content in social media posts with a high accuracy of nearly 96.0 percent. The comparative analysis of the obtained results with the relevant studies in the literature shows that the proposed fine-tuned LLMs achieved enhanced performance compared to existing state of the-art systems. This demonstrates the robustness of LLM-based fine-tuned systems to be used as potential depression detection systems. The study describes the approach in depth, including the parameters used and the fine-tuning procedure, and it addresses the important implications of our results for the early diagnosis of depression on several social media platforms.
Authors: Andrey Moskalenko, Alexey Bryncev, Dmitry Vatolin, Radu Timofte, Gen Zhan, Li Yang, Yunlong Tang, Yiting Liao, Jiongzhi Lin, Baitao Huang, Morteza Moradi, Mohammad Moradi, Francesco Rundo, Concetto Spampinato, Ali Borji, Simone Palazzo, Yuxin Zhu, Yinan Sun, Huiyu Duan, Yuqin Cao, Ziheng Jia, Qiang Hu, Xiongkuo Min, Guangtao Zhai, Hao Fang, Runmin Cong, Xiankai Lu, Xiaofei Zhou, Wei Zhang, Chunyu Zhao, Wentao Mu, Tao Deng, Hamed R. Tavakoli
Abstract: This paper reviews the Challenge on Video Saliency Prediction at AIM 2024. The goal of the participants was to develop a method for predicting accurate saliency maps for the provided set of video sequences. Saliency maps are widely exploited in various applications, including video compression, quality assessment, visual perception studies, the advertising industry, etc. For this competition, a previously unused large-scale audio-visual mouse saliency (AViMoS) dataset of 1500 videos with more than 70 observers per video was collected using crowdsourced mouse tracking. The dataset collection methodology has been validated using conventional eye-tracking data and has shown high consistency. Over 30 teams registered in the challenge, and there are 7 teams that submitted the results in the final phase. The final phase solutions were tested and ranked by commonly used quality metrics on a private test subset. The results of this evaluation and the descriptions of the solutions are presented in this report. All data, including the private test subset, is made publicly available on the challenge homepage - https://challenges.videoprocessing.ai/challenges/video-saliency-prediction.html.
URLs: https://challenges.videoprocessing.ai/challenges/video-saliency-prediction.html.
Authors: Roman Plaud, Jose-Luis Lisani
Abstract: The widespread use of cameras in everyday life situations generates a vast amount of data that may contain sensitive information about the people and vehicles moving in front of them (location, license plates, physical characteristics, etc). In particular, people's faces are recorded by surveillance cameras in public spaces. In order to ensure the privacy of individuals, face blurring techniques can be applied to the collected videos. In this paper we present two deep-learning based options to tackle the problem. First, a direct approach, consisting of a classical object detector (based on the YOLO architecture) trained to detect faces, which are subsequently blurred. Second, an indirect approach, in which a Unet-like segmentation network is trained to output a version of the input image in which all the faces have been blurred.
Authors: Xinli Yue, Jianhui Sun, Liangchao Yao, Fan Xia, Yuetang Deng, Tianyi Wang, Lei Li, Fengyun Rao, Jing Lv, Qian Wang, Lingchen Zhao
Abstract: The increasing popularity of short video platforms such as YouTube Shorts, TikTok, and Kwai has led to a surge in User-Generated Content (UGC), which presents significant challenges for the generalization performance of Video Quality Assessment (VQA) tasks. These challenges not only affect performance on test sets but also impact the ability to generalize across different datasets. While prior research has primarily focused on enhancing feature extractors, sampling methods, and network branches, it has largely overlooked the generalization capabilities of VQA tasks. In this work, we reevaluate the VQA task from a generalization standpoint. We begin by analyzing the weight loss landscape of VQA models, identifying a strong correlation between this landscape and the generalization gaps. We then investigate various techniques to regularize the weight loss landscape. Our results reveal that adversarial weight perturbations can effectively smooth this landscape, significantly improving the generalization performance, with cross-dataset generalization and fine-tuning performance enhanced by up to 1.8% and 3%, respectively. Through extensive experiments across various VQA methods and datasets, we validate the effectiveness of our approach. Furthermore, by leveraging our insights, we achieve state-of-the-art performance in Image Quality Assessment (IQA) tasks. Our code is available at https://github.com/XinliYue/VQA-Generalization.
Authors: Aur\'elien Cecille, Stefan Duffner, Franck Davoine, Thibault Neveu, R\'emi Agier
Abstract: Monocular depth estimation has greatly improved in the recent years but models predicting metric depth still struggle to generalize across diverse camera poses and datasets. While recent supervised methods mitigate this issue by leveraging ground prior information at inference, their adaptability to self-supervised settings is limited due to the additional challenge of scale recovery. Addressing this gap, we propose in this paper a novel constraint on ground areas designed specifically for the self-supervised paradigm. This mechanism not only allows to accurately recover the scale but also ensures coherence between the depth prediction and the ground prior. Experimental results show that our method surpasses existing scale recovery techniques on the KITTI benchmark and significantly enhances model generalization capabilities. This improvement can be observed by its more robust performance across diverse camera rotations and its adaptability in zero-shot conditions with previously unseen driving datasets such as DDAD.
Authors: Gulcin Baykal, Melih Kandemir, Gozde Unal
Abstract: Disentangled representation learning aims to represent the underlying generative factors of a dataset in a latent representation independently of one another. In our work, we propose a discrete variational autoencoder (VAE) based model where the ground truth information about the generative factors are not provided to the model. We demonstrate the advantages of learning discrete representations over learning continuous representations in facilitating disentanglement. Furthermore, we propose incorporating an inductive bias into the model to further enhance disentanglement. Precisely, we propose scalar quantization of the latent variables in a latent representation with scalar values from a global codebook, and we add a total correlation term to the optimization as an inductive bias. Our method called FactorQVAE is the first method that combines optimization based disentanglement approaches with discrete representation learning, and it outperforms the former disentanglement methods in terms of two disentanglement metrics (DCI and InfoMEC) while improving the reconstruction performance. Our code can be found at \url{https://github.com/ituvisionlab/FactorQVAE}.
Authors: Md. Atiqur Rahman, Nahian Ibn Asad, Md. Mushfiqul Haque Omi, Md. Bakhtiar Hasan, Sabbir Ahmed, Md. Hasanul Kabir
Abstract: Automatic Traffic Sign Recognition is paramount in modern transportation systems, motivating several research endeavors to focus on performance improvement by utilizing large-scale datasets. As the appearance of traffic signs varies across countries, curating large-scale datasets is often impractical; and requires efficient models that can produce satisfactory performance using limited data. In this connection, we present 'FUSED-Net', built-upon Faster RCNN for traffic sign detection, enhanced by Unfrozen Parameters, Pseudo-Support Sets, Embedding Normalization, and Domain Adaptation while reducing data requirement. Unlike traditional approaches, we keep all parameters unfrozen during training, enabling FUSED-Net to learn from limited samples. The generation of a Pseudo-Support Set through data augmentation further enhances performance by compensating for the scarcity of target domain data. Additionally, Embedding Normalization is incorporated to reduce intra-class variance, standardizing feature representation. Domain Adaptation, achieved by pre-training on a diverse traffic sign dataset distinct from the target domain, improves model generalization. Evaluating FUSED-Net on the BDTSD dataset, we achieved 2.4x, 2.2x, 1.5x, and 1.3x improvements of mAP in 1-shot, 3-shot, 5-shot, and 10-shot scenarios, respectively compared to the state-of-the-art Few-Shot Object Detection (FSOD) models. Additionally, we outperform state-of-the-art works on the cross-domain FSOD benchmark under several scenarios.
Authors: Shilong Yang, Chulong Zhang, Qi Zang, Juan Yu, Liang Zeng, Xiao Luo, Yexuan Xing, Xin Pan, Qi Li, Xiaokun Liang, Yaoqin Xie
Abstract: Breast cancer has long posed a significant threat to women's health, making early screening crucial for mitigating its impact. However, mammography, the preferred method for early screening, faces limitations such as the burden of double reading by radiologists, challenges in widespread adoption in remote and underdeveloped areas, and obstacles in intelligent early screening development due to data constraints. To address these challenges, we propose a weakly supervised multi-view mammography early screening model for breast cancer based on context clustering. Context clustering, a feature extraction structure that is neither CNN nor transformer, combined with multi-view learning for information complementation, presents a promising approach. The weak supervision design specifically addresses data limitations. Our model achieves state-of-the-art performance with fewer parameters on two public datasets, with an AUC of 0.828 on the Vindr-Mammo dataset and 0.805 on the CBIS-DDSM dataset. Our model shows potential in reducing the burden on doctors and increasing the feasibility of breast cancer screening for women in underdeveloped regions.
Authors: Wenhua Dong, Xiao-Jun Wu, Zhenhua Feng, Sara Atito, Muhammad Awais, Josef Kittler
Abstract: In most existing multi-view modeling scenarios, cross-view correspondence (CVC) between instances of the same target from different views, like paired image-text data, is a crucial prerequisite for effortlessly deriving a consistent representation. Nevertheless, this premise is frequently compromised in certain applications, where each view is organized and transmitted independently, resulting in the view-unaligned problem (VuP). Restoring CVC of unaligned multi-view data is a challenging and highly demanding task that has received limited attention from the research community. To tackle this practical challenge, we propose to integrate the permutation derivation procedure into the bipartite graph paradigm for view-unaligned clustering, termed Probabilistically Aligned View-unaligned Clustering with Adaptive Template Selection (PAVuC-ATS). Specifically, we learn consistent anchors and view-specific graphs by the bipartite graph, and derive permutations applied to the unaligned graphs by reformulating the alignment between two latent representations as a 2-step transition of a Markov chain with adaptive template selection, thereby achieving the probabilistic alignment. The convergence of the resultant optimization problem is validated both experimentally and theoretically. Extensive experiments on six benchmark datasets demonstrate the superiority of the proposed PAVuC-ATS over the baseline methods.
Authors: Xinli Yue, Jianhui Sun, Han Kong, Liangchao Yao, Tianyi Wang, Lei Li, Fengyun Rao, Jing Lv, Fan Xia, Yuetang Deng, Qian Wang, Lingchen Zhao
Abstract: In recent years, AI generative models have made remarkable progress across various domains, including text generation, image generation, and video generation. However, assessing the quality of text-to-video generation is still in its infancy, and existing evaluation frameworks fall short when compared to those for natural videos. Current video quality assessment (VQA) methods primarily focus on evaluating the overall quality of natural videos and fail to adequately account for the substantial quality discrepancies between frames in generated videos. To address this issue, we propose a novel loss function that combines mean absolute error with cross-entropy loss to mitigate inter-frame quality inconsistencies. Additionally, we introduce the innovative S2CNet technique to retain critical content, while leveraging adversarial training to enhance the model's generalization capabilities. Experimental results demonstrate that our method outperforms existing VQA techniques on the AIGC Video dataset, surpassing the previous state-of-the-art by 3.1% in terms of PLCC.
Authors: Zixuan Wang, Jiayi Li, Xiaoyu Qin, Shikun Sun, Songtao Zhou, Jia Jia, Jiebo Luo
Abstract: Synthesizing camera movements from music and dance is highly challenging due to the contradicting requirements and complexities of dance cinematography. Unlike human movements, which are always continuous, dance camera movements involve both continuous sequences of variable lengths and sudden drastic changes to simulate the switching of multiple cameras. However, in previous works, every camera frame is equally treated and this causes jittering and unavoidable smoothing in post-processing. To solve these problems, we propose to integrate animator dance cinematography knowledge by formulating this task as a three-stage process: keyframe detection, keyframe synthesis, and tween function prediction. Following this formulation, we design a novel end-to-end dance camera synthesis framework \textbf{DanceCamAnimator}, which imitates human animation procedures and shows powerful keyframe-based controllability with variable lengths. Extensive experiments on the DCM dataset demonstrate that our method surpasses previous baselines quantitatively and qualitatively. Code will be available at \url{https://github.com/Carmenw1203/DanceCamAnimator-Official}.
URLs: https://github.com/Carmenw1203/DanceCamAnimator-Official
Authors: Jungeon Kim, Soongjin Kim, Jaesik Park, Seungyong Lee
Abstract: In this paper, we present a learning-based framework for sparse depth video completion. Given a sparse depth map and a color image at a certain viewpoint, our approach makes a cost volume that is constructed on depth hypothesis planes. To effectively fuse sequential cost volumes of the multiple viewpoints for improved depth completion, we introduce a learning-based cost volume fusion framework, namely RayFusion, that effectively leverages the attention mechanism for each pair of overlapped rays in adjacent cost volumes. As a result of leveraging feature statistics accumulated over time, our proposed framework consistently outperforms or rivals state-of-the-art approaches on diverse indoor and outdoor datasets, including the KITTI Depth Completion benchmark, VOID Depth Completion benchmark, and ScanNetV2 dataset, using much fewer network parameters.
Authors: Jinpeng Lin, Xulei Yang, Tianrui Li, Xun Xu
Abstract: Recognizing 3D point cloud plays a pivotal role in many real-world applications. However, deploying 3D point cloud deep learning model is vulnerable to adversarial attacks. Despite many efforts into developing robust model by adversarial training, they may become less effective against emerging attacks. This limitation motivates the development of adversarial purification which employs generative model to mitigate the impact of adversarial attacks. In this work, we highlight the remaining challenges from two perspectives. First, the purification based method requires retraining the classifier on purified samples which introduces additional computation overhead. Moreover, in a more realistic scenario, testing samples arrives in a streaming fashion and adversarial samples are not isolated from clean samples. These challenges motivates us to explore dynamically update model upon observing testing samples. We proposed a test-time purified self-training strategy to achieve this objective. Adaptive thresholding and feature distribution alignment are introduced to improve the robustness of self-training. Extensive results on different adversarial attacks suggest the proposed method is complementary to purification based method in handling continually changing adversarial attacks on the testing data stream.
Authors: Anil Osman Tur, Alessandro Conti, Cigdem Beyan, Davide Boscaini, Roberto Larcher, Stefano Messelodi, Fabio Poiesi, Elisa Ricci
Abstract: In smart retail applications, the large number of products and their frequent turnover necessitate reliable zero-shot object classification methods. The zero-shot assumption is essential to avoid the need for re-training the classifier every time a new product is introduced into stock or an existing product undergoes rebranding. In this paper, we make three key contributions. Firstly, we introduce the MIMEX dataset, comprising 28 distinct product categories. Unlike existing datasets in the literature, MIMEX focuses on fine-grained product classification and includes a diverse range of retail products. Secondly, we benchmark the zero-shot object classification performance of state-of-the-art vision-language models (VLMs) on the proposed MIMEX dataset. Our experiments reveal that these models achieve unsatisfactory fine-grained classification performance, highlighting the need for specialized approaches. Lastly, we propose a novel ensemble approach that integrates embeddings from CLIP and DINOv2 with dimensionality reduction techniques to enhance classification performance. By combining these components, our ensemble approach outperforms VLMs, effectively capturing visual cues crucial for fine-grained product discrimination. Additionally, we introduce a class adaptation method that utilizes visual prototyping with limited samples in scenarios with scarce labeled data, addressing a critical need in retail environments where product variety frequently changes. To encourage further research into zero-shot object classification for smart retail applications, we will release both the MIMEX dataset and benchmark to the research community. Interested researchers can contact the authors for details on the terms and conditions of use. The code is available: https://github.com/AnilOsmanTur/Zero-shot-Retail-Product-Classification.
URLs: https://github.com/AnilOsmanTur/Zero-shot-Retail-Product-Classification.
Authors: Yachuan Li, Xavier Soria Pomab, Yongke Xi, Guanlin Li, Chaozhi Yang, Qian Xiao, Yun Bai, Zongmin LI
Abstract: The performance of deep learning based edge detector has far exceeded that of humans, but the huge computational cost and complex training strategy hinder its further development and application. In this paper, we eliminate these complexities with a vanilla encoder-decoder based detector. Firstly, we design a bilateral encoder to decouple the extraction process of location features and semantic features. Since the location branch no longer provides cues for the semantic branch, the richness of features can be further compressed, which is the key to make our model more compact. We propose a cascaded feature fusion decoder, where the location features are progressively refined by semantic features. The refined location features are the only basis for generating the edge map. The coarse original location features and semantic features are avoided from direct contact with the final result. So the noise in the location features and the location error in the semantic features can be suppressed in the generated edge map. The proposed New Baseline for Edge Detection (NBED) achieves superior performance consistently across multiple edge detection benchmarks, even compared with those methods with huge computational cost and complex training strategy. The ODS of NBED on BSDS500 is 0.838, achieving state-of-the-art performance. Our study shows that what really matters in the current edge detection is high-quality features, and we can make the encoder-decoder based detector great again even without complex training strategies and huge computational cost. The code is available at https://github.com/Li-yachuan/NBED.
Authors: Jiashuo Li, Shaokun Wang, Bo Qian, Yuhang He, Xing Wei, Yihong Gong
Abstract: Non-exemplar class Incremental Learning (NECIL) enables models to continuously acquire new classes without retraining from scratch and storing old task exemplars, addressing privacy and storage issues. However, the absence of data from earlier tasks exacerbates the challenge of catastrophic forgetting in NECIL. In this paper, we propose a novel framework called Dynamic Integration of task-specific Adapters (DIA), which comprises two key components: Task-Specific Adapter Integration (TSAI) and Patch-Level Model Alignment. TSAI boosts compositionality through a patch-level adapter integration strategy, which provides a more flexible compositional solution while maintaining low computation costs. Patch-Level Model Alignment maintains feature consistency and accurate decision boundaries via two specialized mechanisms: Patch-Level Distillation Loss (PDL) and Patch-Level Feature Reconstruction method (PFR). Specifically, the PDL preserves feature-level consistency between successive models by implementing a distillation loss based on the contributions of patch tokens to new class learning. The PFR facilitates accurate classifier alignment by reconstructing old class features from previous tasks that adapt to new task knowledge. Extensive experiments validate the effectiveness of our DIA, revealing significant improvements on benchmark datasets in the NECIL setting, maintaining an optimal balance between computational complexity and accuracy. The full code implementation will be made publicly available upon the publication of this paper.
Authors: Conghao Wong, Beihao Xia, Ziqian Zou, Xinge You
Abstract: Trajectory prediction is a crucial aspect of understanding human behaviors. Researchers have made efforts to represent socially interactive behaviors among pedestrians and utilize various networks to enhance prediction capability. Unfortunately, they still face challenges not only in fully explaining and measuring how these interactive behaviors work to modify trajectories but also in modeling pedestrians' preferences to plan or participate in social interactions in response to the changeable physical environments as extra conditions. This manuscript mainly focuses on the above explainability and conditionality requirements for trajectory prediction networks. Inspired by marine animals perceiving other companions and the environment underwater by echolocation, this work constructs an angle-based conditioned social interaction representation SocialCircle+ to represent the socially interactive context and its corresponding conditions. It employs a social branch and a conditional branch to describe how pedestrians are positioned in prediction scenes socially and physically in angle-based-cyclic-sequence forms. Then, adaptive fusion is applied to fuse the above conditional clues onto the social ones to learn the final interaction representation. Experiments demonstrate the superiority of SocialCircle+ with different trajectory prediction backbones. Moreover, counterfactual interventions have been made to simultaneously verify the modeling capacity of causalities among interactive variables and the conditioning capability.
Authors: Minseung Lee, Seokha Moon, Seung Joon Lee, Jinkyu Kim
Abstract: Accurately detecting objects at long distances remains a critical challenge in 3D object detection when relying solely on LiDAR sensors due to the inherent limitations of data sparsity. To address this issue, we propose the LiDAR-Camera Augmentation Network (LCANet), a novel framework that reconstructs LiDAR point cloud data by fusing 2D image features, which contain rich semantic information, generating additional points to improve detection accuracy. LCANet fuses data from LiDAR sensors and cameras by projecting image features into the 3D space, integrating semantic information into the point cloud data. This fused data is then encoded to produce 3D features that contain both semantic and spatial information, which are further refined to reconstruct final points before bounding box prediction. This fusion effectively compensates for LiDAR's weakness in detecting objects at long distances, which are often represented by sparse points. Additionally, due to the sparsity of many objects in the original dataset, which makes effective supervision for point generation challenging, we employ a point cloud completion network to create a complete point cloud dataset that supervises the generation of dense point clouds in our network. Extensive experiments on the KITTI and Waymo datasets demonstrate that LCANet significantly outperforms existing models, particularly in detecting sparse and distant objects.
Authors: Sijia Du, Chengfeng Zhou, Suncheng Xiang, Jianwei Xu, Dahong Qian
Abstract: Objective: Depth estimation is crucial for endoscopic navigation and manipulation, but obtaining ground-truth depth maps in real clinical scenarios, such as the colon, is challenging. This study aims to develop a robust framework that generalizes well to real colonoscopy images, overcoming challenges like non-Lambertian surface reflection and diverse data distributions. Methods: We propose a framework combining a convolutional neural network (CNN) for capturing local features and a Transformer for capturing global information. An uncertainty-based fusion block was designed to enhance generalization by identifying complementary contributions from the CNN and Transformer branches. The network can be trained with simulated datasets and generalize directly to unseen clinical data without any fine-tuning. Results: Our method is validated on multiple datasets and demonstrates an excellent generalization ability across various datasets and anatomical structures. Furthermore, qualitative analysis in real clinical scenarios confirmed the robustness of the proposed method. Conclusion: The integration of local and global features through the CNN-Transformer architecture, along with the uncertainty-based fusion block, improves depth estimation performance and generalization in both simulated and real-world endoscopic environments. Significance: This study offers a novel approach to estimate depth maps for endoscopy images despite the complex conditions in clinic, serving as a foundation for endoscopic automatic navigation and other clinical tasks, such as polyp detection and segmentation.
Authors: Bulat Gabdullin, Nina Konovalova, Nikolay Patakin, Dmitry Senushkin, Anton Konushin
Abstract: Despite recent success in discriminative approaches in monocular depth estimation its quality remains limited by training datasets. Generative approaches mitigate this issue by leveraging strong priors derived from training on internet-scale datasets. Recent studies have demonstrated that large text-to-image diffusion models achieve state-of-the-art results in depth estimation when fine-tuned on small depth datasets. Concurrently, autoregressive generative approaches, such as the Visual AutoRegressive modeling~(VAR), have shown promising results in conditioned image synthesis. Following the visual autoregressive modeling paradigm, we introduce the first autoregressive depth estimation model based on the visual autoregressive transformer. Our primary contribution is DepthART -- a novel training method formulated as Depth Autoregressive Refinement Task. Unlike the original VAR training procedure, which employs static targets, our method utilizes a dynamic target formulation that enables model self-refinement and incorporates multi-modal guidance during training. Specifically, we use model predictions as inputs instead of ground truth token maps during training, framing the objective as residual minimization. Our experiments demonstrate that the proposed training approach significantly outperforms visual autoregressive modeling via next-scale prediction in the depth estimation task. The Visual Autoregressive Transformer trained with our approach on Hypersim achieves superior results on a set of unseen benchmarks compared to other generative and discriminative baselines.
Authors: Yan Xing, Qi'ao Xu, Jingcheng Zeng, Rui Huang, Sihua Gao, Weifeng Xu, Yuxiang Zhang, Wei Fan
Abstract: Semi-supervised change detection (SSCD) utilizes partially labeled data and a large amount of unlabeled data to detect changes. However, the transformer-based SSCD network does not perform as well as the convolution-based SSCD network due to the lack of labeled data. To overcome this limitation, we introduce a new decoder called Cross Branch Feature Fusion CBFF, which combines the strengths of both local convolutional branch and global transformer branch. The convolutional branch is easy to learn and can produce high-quality features with a small amount of labeled data. The transformer branch, on the other hand, can extract global context features but is hard to learn without a lot of labeled data. Using CBFF, we build our SSCD model based on a strong-to-weak consistency strategy. Through comprehensive experiments on WHU-CD and LEVIR-CD datasets, we have demonstrated the superiority of our method over seven state-of-the-art SSCD methods.
Authors: Saptarshi Saha, Utpal Garain
Abstract: This paper introduces a simple extension of mixup (Zhang et al., 2018) data augmentation to enhance generalization in visual recognition tasks. Unlike the vanilla mixup method, which blends entire images, our approach focuses on combining regions from multiple images.
Authors: Zeliang Zhang, Zhuo Liu, Mingqian Feng, Chenliang Xu
Abstract: CLIP has demonstrated great versatility in adapting to various downstream tasks, such as image editing and generation, visual question answering, and video understanding. However, CLIP-based applications often suffer from misunderstandings regarding user intent, leading to discrepancies between the required number of objects and the actual outputs in image generation tasks. In this work, we empirically investigate the quantity bias in CLIP. By carefully designing different experimental settings and datasets, we comprehensively evaluate CLIP's understanding of quantity from text, image, and cross-modal perspectives. Our experimental results reveal a quantity bias in CLIP embeddings, impacting the reliability of downstream tasks.
Authors: Michal Nazarczuk, Thomas Tanay, Sibi Catley-Chandar, Richard Shaw, Radu Timofte, Eduardo P\'erez-Pellitero
Abstract: Recent developments in differentiable and neural rendering have made impressive breakthroughs in a variety of 2D and 3D tasks, e.g. novel view synthesis, 3D reconstruction. Typically, differentiable rendering relies on a dense viewpoint coverage of the scene, such that the geometry can be disambiguated from appearance observations alone. Several challenges arise when only a few input views are available, often referred to as sparse or few-shot neural rendering. As this is an underconstrained problem, most existing approaches introduce the use of regularisation, together with a diversity of learnt and hand-crafted priors. A recurring problem in sparse rendering literature is the lack of an homogeneous, up-to-date, dataset and evaluation protocol. While high-resolution datasets are standard in dense reconstruction literature, sparse rendering methods often evaluate with low-resolution images. Additionally, data splits are inconsistent across different manuscripts, and testing ground-truth images are often publicly available, which may lead to over-fitting. In this work, we propose the Sparse Rendering (SpaRe) dataset and benchmark. We introduce a new dataset that follows the setup of the DTU MVS dataset. The dataset is composed of 97 new scenes based on synthetic, high-quality assets. Each scene has up to 64 camera views and 7 lighting configurations, rendered at 1600x1200 resolution. We release a training split of 82 scenes to foster generalizable approaches, and provide an online evaluation platform for the validation and test sets, whose ground-truth images remain hidden. We propose two different sparse configurations (3 and 9 input images respectively). This provides a powerful and convenient tool for reproducible evaluation, and enable researchers easy access to a public leaderboard with the state-of-the-art performance scores. Available at: https://sparebenchmark.github.io/
Authors: Michal Nazarczuk, Sibi Catley-Chandar, Thomas Tanay, Richard Shaw, Eduardo P\'erez-Pellitero, Radu Timofte, Xing Yan, Pan Wang, Yali Guo, Yongxin Wu, Youcheng Cai, Yanan Yang, Junting Li, Yanghong Zhou, P. Y. Mok, Zongqi He, Zhe Xiao, Kin-Chung Chan, Hana Lebeta Goshu, Cuixin Yang, Rongkang Dong, Jun Xiao, Kin-Man Lam, Jiayao Hao, Qiong Gao, Yanyan Zu, Junpei Zhang, Licheng Jiao, Xu Liu, Kuldeep Purohit
Abstract: This paper reviews the challenge on Sparse Neural Rendering that was part of the Advances in Image Manipulation (AIM) workshop, held in conjunction with ECCV 2024. This manuscript focuses on the competition set-up, the proposed methods and their respective results. The challenge aims at producing novel camera view synthesis of diverse scenes from sparse image observations. It is composed of two tracks, with differing levels of sparsity; 3 views in Track 1 (very sparse) and 9 views in Track 2 (sparse). Participants are asked to optimise objective fidelity to the ground-truth images as measured via the Peak Signal-to-Noise Ratio (PSNR) metric. For both tracks, we use the newly introduced Sparse Rendering (SpaRe) dataset and the popular DTU MVS dataset. In this challenge, 5 teams submitted final results to Track 1 and 4 teams submitted final results to Track 2. The submitted models are varied and push the boundaries of the current state-of-the-art in sparse neural rendering. A detailed description of all models developed in the challenge is provided in this paper.
Authors: Guoyang Zhao, Yuxuan Liu, Weiqing Qi, Fulong Ma, Ming Liu, Jun Ma
Abstract: Accurate depth estimation is crucial for 3D scene comprehension in robotics and autonomous vehicles. Fisheye cameras, known for their wide field of view, have inherent geometric benefits. However, their use in depth estimation is restricted by a scarcity of ground truth data and image distortions. We present FisheyeDepth, a self-supervised depth estimation model tailored for fisheye cameras. We incorporate a fisheye camera model into the projection and reprojection stages during training to handle image distortions, thereby improving depth estimation accuracy and training stability. Furthermore, we incorporate real-scale pose information into the geometric projection between consecutive frames, replacing the poses estimated by the conventional pose network. Essentially, this method offers the necessary physical depth for robotic tasks, and also streamlines the training and inference procedures. Additionally, we devise a multi-channel output strategy to improve robustness by adaptively fusing features at various scales, which reduces the noise from real pose data. We demonstrate the superior performance and robustness of our model in fisheye image depth estimation through evaluations on public datasets and real-world scenarios. The project website is available at: https://github.com/guoyangzhao/FisheyeDepth.
Authors: Guoyang Zhao, Fulong Ma, Weiqing Qi, Chenguang Zhang, Yuxuan Liu, Ming Liu, Jun Ma
Abstract: Traffic sign is a critical map feature for navigation and traffic control. Nevertheless, current methods for traffic sign recognition rely on traditional deep learning models, which typically suffer from significant performance degradation considering the variations in data distribution across different regions. In this paper, we propose TSCLIP, a robust fine-tuning approach with the contrastive language-image pre-training (CLIP) model for worldwide cross-regional traffic sign recognition. We first curate a cross-regional traffic sign benchmark dataset by combining data from ten different sources. Then, we propose a prompt engineering scheme tailored to the characteristics of traffic signs, which involves specific scene descriptions and corresponding rules to generate targeted text descriptions for optimizing the model training process. During the TSCLIP fine-tuning process, we implement adaptive dynamic weight ensembling (ADWE) to seamlessly incorporate outcomes from each training iteration with the zero-shot CLIP model. This approach ensures that the model retains its ability to generalize while acquiring new knowledge about traffic signs. Our method surpasses conventional classification benchmark models in cross-regional traffic sign evaluations, and it achieves state-of-the-art performance compared to existing CLIP fine-tuning techniques. To the best knowledge of authors, TSCLIP is the first contrastive language-image model used for the worldwide cross-regional traffic sign recognition task. The project website is available at: https://github.com/guoyangzhao/TSCLIP.
Authors: Hongyi Wang, Xiuju Du, Jing Liu, Shuyi Ouyang, Yen-Wei Chen, Lanfen Lin
Abstract: The advancement of Spatial Transcriptomics (ST) has facilitated the spatially-aware profiling of gene expressions based on histopathology images. Although ST data offers valuable insights into the micro-environment of tumors, its acquisition cost remains expensive. Therefore, directly predicting the ST expressions from digital pathology images is desired. Current methods usually adopt existing regression backbones along with patch-sampling for this task, which ignores the inherent multi-scale information embedded in the pyramidal data structure of digital pathology images, and wastes the inter-spot visual information crucial for accurate gene expression prediction. To address these limitations, we propose M2OST, a many-to-one regression Transformer that can accommodate the hierarchical structure of the pathology images via a decoupled multi-scale feature extractor. Unlike traditional models that are trained with one-to-one image-label pairs, M2OST uses multiple images from different levels of the digital pathology image to jointly predict the gene expressions in their common corresponding spot. Built upon our many-to-one scheme, M2OST can be easily scaled to fit different numbers of inputs, and its network structure inherently incorporates nearby inter-spot features, enhancing regression performance. We have tested M2OST on three public ST datasets and the experimental results show that M2OST can achieve state-of-the-art performance with fewer parameters and floating-point operations (FLOPs). The code will be released upon acceptance.
Authors: Tuan-Hung Vu, Eduardo Valle, Andrei Bursuc, Tommie Kerssies, Daan de Geus, Gijs Dubbelman, Long Qian, Bingke Zhu, Yingying Chen, Ming Tang, Jinqiao Wang, Tom\'a\v{s} Voj\'i\v{r}, Jan \v{S}ochman, Ji\v{r}\'i Matas, Michael Smith, Frank Ferrie, Shamik Basu, Christos Sakaridis, Luc Van Gool
Abstract: We propose the unified BRAVO challenge to benchmark the reliability of semantic segmentation models under realistic perturbations and unknown out-of-distribution (OOD) scenarios. We define two categories of reliability: (1) semantic reliability, which reflects the model's accuracy and calibration when exposed to various perturbations; and (2) OOD reliability, which measures the model's ability to detect object classes that are unknown during training. The challenge attracted nearly 100 submissions from international teams representing notable research institutions. The results reveal interesting insights into the importance of large-scale pre-training and minimal architectural design in developing robust and reliable semantic segmentation models.
Authors: Minh Bui, Kostas Alexis
Abstract: Vision-based perception and reasoning is essential for scene understanding in any autonomous system. RGB and depth images are commonly used to capture both the semantic and geometric features of the environment. Developing methods to reliably interpret this data is critical for real-world applications, where noisy measurements are often unavoidable. In this work, we introduce a diffusion-based framework to address the RGB-D semantic segmentation problem. Additionally, we demonstrate that utilizing a Deformable Attention Transformer as the encoder to extract features from depth images effectively captures the characteristics of invalid regions in depth measurements. Our generative framework shows a greater capacity to model the underlying distribution of RGB-D images, achieving robust performance in challenging scenarios with significantly less training time compared to discriminative methods. Experimental results indicate that our approach achieves State-of-the-Art performance on both the NYUv2 and SUN-RGBD datasets in general and especially in the most challenging of their image data. Our project page will be available at https://diffusionmms.github.io/
Authors: Manu Gaur, Darshan Singh S, Makarand Tapaswi
Abstract: Visual Question Answering (VQA) with multiple choice questions enables a vision-centric evaluation of Multimodal Large Language Models (MLLMs). Although it reliably checks the existence of specific visual abilities, it is easier for the model to select an answer from multiple choices (VQA evaluation) than to generate the answer itself. In this work, we offer a novel perspective: we evaluate how well an MLLM understands a specific visual concept by its ability to uniquely describe two extremely similar images that differ only in the targeted visual concept. Specifically, we assess the ability of MLLMs to capture specific points of visual differences using self-retrieval, i.e., by retrieving the target image using its generated caption against the other image in the pair serving as the distractor. We curate 247 highly similar image pairs as part of the D3 benchmark. For each image pair, the model is prompted to: (1) Detect a specific visual difference, and (2) Describe the target image uniquely such that it (3) Discriminates the target image from the distractor. Self-retrieval within D3 enables whitebox evaluation across six different visual patterns, revealing that current models struggle to independently discern fine-grained visual differences, with open-source models failing to outperform random guess.
Authors: Michael Sprintson, Rama Chellappa, Cheng Peng
Abstract: We introduce FusionRF, a novel neural rendering terrain reconstruction method from optically unprocessed satellite imagery. While previous methods depend on external pansharpening methods to fuse low resolution multispectral imagery and high resolution panchromatic imagery, FusionRF directly performs reconstruction based on optically unprocessed acquisitions with no prior knowledge. This is accomplished through the addition of a sharpening kernel which models the resolution loss in multispectral images. Additionally, novel modal embeddings allow the model to perform image fusion as a bottleneck to novel view synthesis. We evaluate our method on multispectral and panchromatic satellite images from the WorldView-3 satellite in various locations, and FusionRF outperforms previous State-of-The-Art methods in depth reconstruction on unprocessed imagery, renders sharp training and novel views, and retains multi-spectral information.
Authors: Jinze Yu, Xi Peng, Zhengda Lu, Laurent Kneip, Yiqun Wang
Abstract: A spike camera is a specialized high-speed visual sensor that offers advantages such as high temporal resolution and high dynamic range compared to conventional frame cameras. These features provide the camera with significant advantages in many computer vision tasks. However, the tasks of 3D reconstruction and novel view synthesis based on spike cameras remain underdeveloped. Although there are existing methods for learning neural radiance fields from spike stream, they either lack robustness in extremely noisy, low-quality lighting conditions or suffer from high computational complexity due to the deep fully connected neural networks and ray marching rendering strategies used in neural radiance fields, making it difficult to recover fine texture details. In contrast, the latest advancements in 3DGS have achieved high-quality real-time rendering by optimizing the point cloud representation into Gaussian ellipsoids. Building on this, we introduce SpikeGS, the first method to learn 3D Gaussian fields solely from spike stream. We designed a differentiable spike stream rendering framework based on 3DGS, incorporating noise embedding and spiking neurons. By leveraging the multi-view consistency of 3DGS and the tile-based multi-threaded parallel rendering mechanism, we achieved high-quality real-time rendering results. Additionally, we introduced a spike rendering loss function that generalizes under varying illumination conditions. Our method can reconstruct view synthesis results with fine texture details from a continuous spike stream captured by a moving spike camera, while demonstrating high robustness in extremely noisy low-light scenarios. Experimental results on both real and synthetic datasets demonstrate that our method surpasses existing approaches in terms of rendering quality and speed. Our code will be available at https://github.com/520jz/SpikeGS.
Authors: Yue Han, Junwei Zhu, Yuxiang Feng, Xiaozhong Ji, Keke He, Xiangtai Li, zhucun xue, Yong Liu
Abstract: Current diffusion-based face animation methods generally adopt a ReferenceNet (a copy of U-Net) and a large amount of curated self-acquired data to learn appearance features, as robust appearance features are vital for ensuring temporal stability. However, when trained on public datasets, the results often exhibit a noticeable performance gap in image quality and temporal consistency. To address this issue, we meticulously examine the essential appearance features in the facial animation tasks, which include motion-agnostic (e.g., clothing, background) and motion-related (e.g., facial details) texture components, along with high-level discriminative identity features. Drawing from this analysis, we introduce a Motion-Identity Modulated Appearance Learning Module (MIA) that modulates CLIP features at both motion and identity levels. Additionally, to tackle the semantic/ color discontinuities between clips, we design an Inter-clip Affinity Learning Module (ICA) to model temporal relationships across clips. Our method achieves precise facial motion control (i.e., expressions and gaze), faithful identity preservation, and generates animation videos that maintain both intra/inter-clip temporal consistency. Moreover, it easily adapts to various modalities of driving sources. Extensive experiments demonstrate the superiority of our method.
Authors: Akshay Kulkarni, Tsui-Wei Weng
Abstract: We propose a novel and low-cost test-time adversarial defense by devising interpretability-guided neuron importance ranking methods to identify neurons important to the output classes. Our method is a training-free approach that can significantly improve the robustness-accuracy tradeoff while incurring minimal computational overhead. While being among the most efficient test-time defenses (4x faster), our method is also robust to a wide range of black-box, white-box, and adaptive attacks that break previous test-time defenses. We demonstrate the efficacy of our method for CIFAR10, CIFAR100, and ImageNet-1k on the standard RobustBench benchmark (with average gains of 2.6%, 4.9%, and 2.8% respectively). We also show improvements (average 1.5%) over the state-of-the-art test-time defenses even under strong adaptive attacks.
Authors: Yuyan Chen, Yiwen Qian, Songzhou Yan, Jiyuan Jia, Zhixu Li, Yanghua Xiao, Xiaobo Li, Ming Yang, Qingpei Guo
Abstract: In the era of social media video platforms, popular ``hot-comments'' play a crucial role in attracting user impressions of short-form videos, making them vital for marketing and branding purpose. However, existing research predominantly focuses on generating descriptive comments or ``danmaku'' in English, offering immediate reactions to specific video moments. Addressing this gap, our study introduces \textsc{HotVCom}, the largest Chinese video hot-comment dataset, comprising 94k diverse videos and 137 million comments. We also present the \texttt{ComHeat} framework, which synergistically integrates visual, auditory, and textual data to generate influential hot-comments on the Chinese video dataset. Empirical evaluations highlight the effectiveness of our framework, demonstrating its excellence on both the newly constructed and existing datasets.
Authors: Naghmeh Shafiee Roudbari, Ursula Eicker, Charalambos Poullis, Zachary Patterson
Abstract: Hydrometric forecasting is crucial for managing water resources, flood prediction, and environmental protection. Water stations are interconnected, and this connectivity influences the measurements at other stations. However, the dynamic and implicit nature of water flow paths makes it challenging to extract a priori knowledge of the connectivity structure. We hypothesize that terrain elevation significantly affects flow and connectivity. To incorporate this, we use LiDAR terrain elevation data encoded through a Vision Transformer (ViT). The ViT, which has demonstrated excellent performance in image classification by directly applying transformers to sequences of image patches, efficiently captures spatial features of terrain elevation. To account for both spatial and temporal features, we employ GRU blocks enhanced with graph convolution, a method widely used in the literature. We propose a hybrid graph learning structure that combines static and dynamic graph learning. A static graph, derived from transformer-encoded LiDAR data, captures terrain elevation relationships, while a dynamic graph adapts to temporal changes, improving the overall graph representation. We apply graph convolution in two layers through these static and dynamic graphs. Our method makes daily predictions up to 12 days ahead. Empirical results from multiple water stations in Quebec demonstrate that our method significantly reduces prediction error by an average of 10\% across all days, with greater improvements for longer forecasting horizons.
Authors: Rei Tamaru, Pei Li, Bin Ran
Abstract: Pedestrian trajectory prediction is essential for various applications in active traffic management, urban planning, traffic control, crowd management, and autonomous driving, aiming to enhance traffic safety and efficiency. Accurately predicting pedestrian trajectories requires a deep understanding of individual behaviors, social interactions, and road environments. Existing studies have developed various models to capture the influence of social interactions and road conditions on pedestrian trajectories. However, these approaches are limited by the lack of a comprehensive view of social interactions and road environments. To address these limitations and enhance the accuracy of pedestrian trajectory prediction, we propose a novel approach incorporating trip information as a new modality into pedestrian trajectory models. We propose RNTransformer, a generic model that utilizes crowd trip information to capture global information on social interactions. We incorporated RNTransformer with various socially aware local pedestrian trajectory prediction models to demonstrate its performance. Specifically, by leveraging a pre-trained RNTransformer when training different pedestrian trajectory prediction models, we observed improvements in performance metrics: a 1.3/2.2% enhancement in ADE/FDE on Social-LSTM, a 6.5/28.4% improvement on Social-STGCNN, and an 8.6/4.3% improvement on S-Implicit. Evaluation results demonstrate that RNTransformer significantly enhances the accuracy of various pedestrian trajectory prediction models across multiple datasets. Further investigation reveals that the RNTransformer effectively guides local models to more accurate directions due to the consideration of global information. By exploring crowd behavior within the road network, our approach shows great promise in improving pedestrian safety through accurate trajectory predictions.
Authors: Sombit Dey, Jan-Nico Zaech, Nikolay Nikolov, Luc Van Gool, Danda Pani Paudel
Abstract: Recent progress in large language models and access to large-scale robotic datasets has sparked a paradigm shift in robotics models transforming them into generalists able to adapt to various tasks, scenes, and robot modalities. A large step for the community are open Vision Language Action models which showcase strong performance in a wide variety of tasks. In this work, we study the visual generalization capabilities of three existing robotic foundation models, and propose a corresponding evaluation framework. Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios. This is potentially caused by limited variations in the training data and/or catastrophic forgetting, leading to domain limitations in the vision foundation models. We further explore OpenVLA, which uses two pre-trained vision foundation models and is, therefore, expected to generalize to out-of-domain experiments. However, we showcase catastrophic forgetting by DINO-v2 in OpenVLA through its failure to fulfill the task of depth regression. To overcome the aforementioned issue of visual catastrophic forgetting, we propose a gradual backbone reversal approach founded on model merging. This enables OpenVLA which requires the adaptation of the visual backbones during initial training -- to regain its visual generalization ability. Regaining this capability enables our ReVLA model to improve over OpenVLA by a factor of 77% and 66% for grasping and lifting in visual OOD tasks .
Authors: Yuanhang Li, Qi Mao, Lan Chen, Zhen Fang, Lei Tian, Xinyan Xiao, Libiao Jin, Hua Wu
Abstract: Recent advancements in text-to-video (T2V) generation using diffusion models have garnered significant attention. However, existing T2V models primarily focus on simple scenes featuring a single object performing a single motion. Challenges arise in scenarios involving multiple objects with distinct motions, often leading to incorrect video-text alignment between subjects and their corresponding motions. To address this challenge, we propose \textbf{S$^2$AG-Vid}, a training-free inference-stage optimization method that improves the alignment of multiple objects with their corresponding motions in T2V models. S$^2$AG-Vid initially applies a spatial position-based, cross-attention (CA) constraint in the early stages of the denoising process, facilitating multiple nouns distinctly attending to the correct subject regions. To enhance the motion-subject binding, we implement a syntax-guided contrastive constraint in the subsequent denoising phase, aimed at improving the correlations between the CA maps of verbs and their corresponding nouns.Both qualitative and quantitative evaluations demonstrate that the proposed framework significantly outperforms baseline approaches, producing higher-quality videos with improved subject-motion consistency.
Authors: Chen Guo, Tianjian Jiang, Manuel Kaufmann, Chengwei Zheng, Julien Valentin, Jie Song, Otmar Hilliges
Abstract: While previous years have seen great progress in the 3D reconstruction of humans from monocular videos, few of the state-of-the-art methods are able to handle loose garments that exhibit large non-rigid surface deformations during articulation. This limits the application of such methods to humans that are dressed in standard pants or T-shirts. Our method, ReLoo, overcomes this limitation and reconstructs high-quality 3D models of humans dressed in loose garments from monocular in-the-wild videos. To tackle this problem, we first establish a layered neural human representation that decomposes clothed humans into a neural inner body and outer clothing. On top of the layered neural representation, we further introduce a non-hierarchical virtual bone deformation module for the clothing layer that can freely move, which allows the accurate recovery of non-rigidly deforming loose clothing. A global optimization jointly optimizes the shape, appearance, and deformations of the human body and clothing via multi-layer differentiable volume rendering. To evaluate ReLoo, we record subjects with dynamically deforming garments in a multi-view capture studio. This evaluation, both on existing and our novel dataset, demonstrates ReLoo's clear superiority over prior art on both indoor datasets and in-the-wild videos.
Authors: Yehonathan Litman, Or Patashnik, Kangle Deng, Aviral Agrawal, Rushikesh Zawar, Fernando De la Torre, Shubham Tulsiani
Abstract: Recent works in inverse rendering have shown promise in using multi-view images of an object to recover shape, albedo, and materials. However, the recovered components often fail to render accurately under new lighting conditions due to the intrinsic challenge of disentangling albedo and material properties from input images. To address this challenge, we introduce MaterialFusion, an enhanced conventional 3D inverse rendering pipeline that incorporates a 2D prior on texture and material properties. We present StableMaterial, a 2D diffusion model prior that refines multi-lit data to estimate the most likely albedo and material from given input appearances. This model is trained on albedo, material, and relit image data derived from a curated dataset of approximately ~12K artist-designed synthetic Blender objects called BlenderVault. we incorporate this diffusion prior with an inverse rendering framework where we use score distillation sampling (SDS) to guide the optimization of the albedo and materials, improving relighting performance in comparison with previous work. We validate MaterialFusion's relighting performance on 4 datasets of synthetic and real objects under diverse illumination conditions, showing our diffusion-aided approach significantly improves the appearance of reconstructed objects under novel lighting conditions. We intend to publicly release our BlenderVault dataset to support further research in this field.
Authors: Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Junlin Xie, Yu Qiao, Peng Gao, Hongsheng Li
Abstract: This paper presents a versatile image-to-image visual assistant, PixWizard, designed for image generation, manipulation, and translation based on free-from language instructions. To this end, we tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning Dataset. By constructing detailed instruction templates in natural language, we comprehensively include a large set of diverse vision tasks such as text-to-image generation, image restoration, image grounding, dense image prediction, image editing, controllable generation, inpainting/outpainting, and more. Furthermore, we adopt Diffusion Transformers (DiT) as our foundation model and extend its capabilities with a flexible any resolution mechanism, enabling the model to dynamically process images based on the aspect ratio of the input, closely aligning with human perceptual processes. The model also incorporates structure-aware and semantic-aware guidance to facilitate effective fusion of information from the input image. Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions. The code and related resources are available at https://github.com/AFeng-x/PixWizard
Authors: Ted Edward Holmberg, Elias Ioup, Mahdi Abdelguerfi
Abstract: This paper proposes two new measures applicable in a spatial bipartite network model: coverage and coverage robustness. The bipartite network must consist of observer nodes, observable nodes, and edges that connect observer nodes to observable nodes. The coverage and coverage robustness scores evaluate the effectiveness of the observer node placements. This measure is beneficial for stochastic data as it may be coupled with Monte Carlo simulations to identify optimal placements for new observer nodes. In this paper, we construct a Geo-SpatioTemporal Bipartite Network (GSTBN) within the stochastic and dynamical environment of the Gulf of Mexico. This GSTBN consists of GCOOS sensor nodes and HYCOM Region of Interest (RoI) event nodes. The goal is to identify optimal placements to expand GCOOS to improve the forecasting outcomes by the HYCOM ocean prediction model.
Authors: Mohammad Iqbal Nouyed, Mary-Anne Hartley, Gianfranco Doretto, Donald A. Adjeroh
Abstract: This work addresses how to efficiently classify challenging histopathology images, such as gigapixel whole-slide images for cancer diagnostics with image-level annotation. We use images with annotated tumor regions to identify a set of tumor patches and a set of benign patches in a cancerous slide. Due to the variable nature of region of interest the tumor positive regions may refer to an extreme minority of the pixels. This creates an important problem during patch-level classification, where the majority of patches from an image labeled as 'cancerous' are actually tumor-free. This problem is different from semantic segmentation which associates a label to every pixel in an image, because after patch extraction we are only dealing with patch-level labels.Most existing approaches address the data imbalance issue by mitigating the data shortage in minority classes in order to prevent the model from being dominated by the majority classes. These methods include data re-sampling, loss re-weighting, margin modification, and data augmentation. In this work, we mitigate the patch-level class imbalance problem by taking a divide-and-conquer approach. First, we partition the data into sub-groups, and define three separate classification sub-problems based on these data partitions. Then, using an information-theoretic cluster-based sampling of deep image patch features, we sample discriminative patches from the sub-groups. Using these sampled patches, we build corresponding deep models to solve the new classification sub-problems. Finally, we integrate information learned from the respective models to make a final decision on the patches. Our result shows that the proposed approach can perform competitively using a very low percentage of the available patches in a given whole-slide image.
Authors: Tanya Chutani, Saikiran Bonthu, Pranab Samanta, Nitin Singhal
Abstract: Positron Emission Tomography (PET) /Computed Tomography (CT) is crucial for diagnosing, managing, and planning treatment for various cancers. Developing reliable deep learning models for the segmentation of tumor lesions in PET/CT scans in a multi-tracer multicenter environment, is a critical area of research. Different tracers, such as Fluorodeoxyglucose (FDG) and Prostate-Specific Membrane Antigen (PSMA), have distinct physiological uptake patterns and data from different centers often vary in terms of acquisition protocols, scanner types, and patient populations. Because of this variability, it becomes more difficult to design reliable segmentation algorithms and generalization techniques due to variations in image quality and lesion detectability. To address this challenge, We trained a 3D Residual encoder U-Net within the no new U-Net framework, aiming to generalize the performance of automatic lesion segmentation of whole body PET/CT scans, across different tracers and clinical sites. Further, We explored several preprocessing techniques and ultimately settled on using the Total Segmentator to crop our training data. Additionally, we applied resampling during this process. During inference, we leveraged test-time augmentations and other post-processing techniques to enhance tumor lesion segmentation. Our team currently hold the top position in the Auto-PET III challenge and outperformed the challenge baseline model in the preliminary test set with Dice score of 0.9627.
Authors: Mingxiu Sui, Jiacheng Hu, Tong Zhou, Zibo Liu, Likang Wen, Junliang Du
Abstract: This paper introduces a novel deep-learning method for the automatic detection and segmentation of lung nodules, aimed at advancing the accuracy of early-stage lung cancer diagnosis. The proposed approach leverages a unique "Channel Squeeze U-Structure" that optimizes feature extraction and information integration across multiple semantic levels of the network. This architecture includes three key modules: shallow information processing, channel residual structure, and channel squeeze integration. These modules enhance the model's ability to detect and segment small, imperceptible, or ground-glass nodules, which are critical for early diagnosis. The method demonstrates superior performance in terms of sensitivity, Dice similarity coefficient, precision, and mean Intersection over Union (IoU). Extensive experiments were conducted on the Lung Image Database Consortium (LIDC) dataset using five-fold cross-validation, showing excellent stability and robustness. The results indicate that this approach holds significant potential for improving computer-aided diagnosis systems, providing reliable support for radiologists in clinical practice and aiding in the early detection of lung cancer, especially in resource-limited settings
Authors: Abhishek Jaiswal, Nisheeth Srivastava
Abstract: Video game playing is an extremely structured domain where algorithmic decision-making can be tested without adverse real-world consequences. While prevailing methods rely on image inputs to avoid the problem of hand-crafting state space representations, this approach systematically diverges from the way humans actually learn to play games. In this paper, we design object-based input representations that generalize well across a number of video games. Using these representations, we evaluate an agent's ability to learn games similar to an infant - with limited world experience, employing simple inductive biases derived from intuitive representations of physics from the real world. Using such biases, we construct an object category representation to be used by a Q-learning algorithm and assess how well it learns to play multiple games based on observed object affordances. Our results suggest that a human-like object interaction setup capably learns to play several video games, and demonstrates superior generalizability, particularly for unfamiliar objects. Further exploring such methods will allow machines to learn in a human-centric way, thus incorporating more human-like learning benefits.
Authors: Jiaqi Guo, Santiago Lopez-Tapia, Wing Shun Li, Yunnan Wu, Marcelo Carignano, Vadim Backman, Vinayak P. Dravid, Aggelos K. Katsaggelos
Abstract: Computed tomography is a widely used imaging modality with applications ranging from medical imaging to material analysis. One major challenge arises from the lack of scanning information at certain angles, leading to distorted CT images with artifacts. This results in an ill-posed problem known as the Limited Angle Computed Tomography (LACT) reconstruction problem. To address this problem, we propose Residual Null-Space Diffusion Stochastic Differential Equations (RN-SDEs), which are a variant of diffusion models that characterize the diffusion process with mean-reverting (MR) stochastic differential equations. To demonstrate the generalizability of RN-SDEs, our experiments are conducted on two different LACT datasets, i.e., ChromSTEM and C4KC-KiTS. Through extensive experiments, we show that by leveraging learned Mean-Reverting SDEs as a prior and emphasizing data consistency using Range-Null Space Decomposition (RNSD) based rectification, RN-SDEs can restore high-quality images from severe degradation and achieve state-of-the-art performance in most LACT tasks. Additionally, we present a quantitative comparison of computational complexity and runtime efficiency, highlighting the superior effectiveness of our proposed approach.
Authors: Aditya Singh, Haohan Wang
Abstract: As per recent studies, Self-supervised learning (SSL) does not readily extend to smaller architectures. One direction to mitigate this shortcoming while simultaneously training a smaller network without labels is to adopt unsupervised knowledge distillation (UKD). Existing UKD approaches handcraft preservation worthy inter/intra sample relationships between the teacher and its student. However, this may overlook/ignore other key relationships present in the mapping of a teacher. In this paper, instead of heuristically constructing preservation worthy relationships between samples, we directly motivate the student to model the teacher's embedding manifold. If the mapped manifold is similar, all inter/intra sample relationships are indirectly conserved. We first demonstrate that prior methods cannot preserve teacher's latent manifold due to their sole reliance on $L_2$ normalised embedding features. Subsequently, we propose a simple objective to capture the lost information due to normalisation. Our proposed loss component, termed \textbf{space similarity}, motivates each dimension of a student's feature space to be similar to the corresponding dimension of its teacher. We perform extensive experiments demonstrating strong performance of our proposed approach on various benchmarks.
Authors: Guohui Cai, Ying Cai, Zeyu Zhang, Daji Ergu, Yuanzhouhan Cao, Binbin Hu, Zhibin Liao, Yang Zhao
Abstract: Pulmonary nodules are critical indicators for the early diagnosis of lung cancer, making their detection essential for timely treatment. However, traditional CT imaging methods suffered from cumbersome procedures, low detection rates, and poor localization accuracy. The subtle differences between pulmonary nodules and surrounding tissues in complex lung CT images, combined with repeated downsampling in feature extraction networks, often lead to missed or false detections of small nodules. Existing methods such as FPN, with its fixed feature fusion and limited receptive field, struggle to effectively overcome these issues. To address these challenges, our paper proposed three key contributions: Firstly, we proposed MSDet, a multiscale attention and receptive field network for detecting tiny pulmonary nodules. Secondly, we proposed the extended receptive domain (ERD) strategy to capture richer contextual information and reduce false positives caused by nodule occlusion. We also proposed the position channel attention mechanism (PCAM) to optimize feature learning and reduce multiscale detection errors, and designed the tiny object detection block (TODB) to enhance the detection of tiny nodules. Lastly, we conducted thorough experiments on the public LUNA16 dataset, achieving state-of-the-art performance, with an mAP improvement of 8.8% over the previous state-of-the-art method YOLOv8. These advancements significantly boosted detection accuracy and reliability, providing a more effective solution for early lung cancer diagnosis. The code will be available at https://github.com/CaiGuoHui123/MSDet
Authors: Pranav Gupta, Raunak Sharma, Rashmi Kumari, Sri Krishna Aditya, Shwetank Choudhary, Sumit Kumar, Kanchana M, Thilagavathy R
Abstract: Environment Sound Classification has been a well-studied research problem in the field of signal processing and up till now more focus has been laid on fully supervised approaches. Over the last few years, focus has moved towards semi-supervised methods which concentrate on the utilization of unlabeled data, and self-supervised methods which learn the intermediate representation through pretext task or contrastive learning. However, both approaches require a vast amount of unlabelled data to improve performance. In this work, we propose a novel framework called Environmental Sound Classification with Hierarchical Ontology-guided semi-supervised Learning (ECHO) that utilizes label ontology-based hierarchy to learn semantic representation by defining a novel pretext task. In the pretext task, the model tries to predict coarse labels defined by the Large Language Model (LLM) based on ground truth label ontology. The trained model is further fine-tuned in a supervised way to predict the actual task. Our proposed novel semi-supervised framework achieves an accuracy improvement in the range of 1\% to 8\% over baseline systems across three datasets namely UrbanSound8K, ESC-10, and ESC-50.
Authors: Ziyu Yao
Abstract: Federated Learning (FL) is a distributed machine learning paradigm that enables collaboration among multiple clients to train a shared model without sharing raw data. However, a major challenge in FL is the label imbalance, where clients may exclusively possess certain classes while having numerous minority and missing classes. Previous works focus on optimizing local updates or global aggregation but ignore the underlying imbalanced label distribution across clients. In this paper, we propose a novel approach ReGL to address this challenge, whose key idea is to Recover the Global data distribution Locally. Specifically, each client uses generative models to synthesize images that complement the minority and missing classes, thereby alleviating label imbalance. Moreover, we adaptively fine-tune the image generation process using local real data, which makes the synthetic images align more closely with the global distribution. Importantly, both the generation and fine-tuning processes are conducted at the client-side without leaking data privacy. Through comprehensive experiments on various image classification datasets, we demonstrate the remarkable superiority of our approach over existing state-of-the-art works in fundamentally tackling label imbalance in FL.
Authors: Heng Xu, Bowen Hai, Yushun Tang, Zhihai He
Abstract: Learned Image Compression (LIC) models have achieved superior rate-distortion performance than traditional codecs. Existing LIC models use CNN, Transformer, or Mixed CNN-Transformer as basic blocks. However, limited by the shifted window attention, Swin-Transformer-based LIC exhibits a restricted growth of receptive fields, affecting the ability to model large objects in the image. To address this issue, we incorporate window partition into channel attention for the first time to obtain large receptive fields and capture more global information. Since channel attention hinders local information learning, it is important to extend existing attention mechanisms in Transformer codecs to the space-channel attention to establish multiple receptive fields, being able to capture global correlations with large receptive fields while maintaining detailed characterization of local correlations with small receptive fields. We also incorporate the discrete wavelet transform into our Spatial-Channel Hybrid (SCH) framework for efficient frequency-dependent down-sampling and further enlarging receptive fields. Experiment results demonstrate that our method achieves state-of-the-art performances, reducing BD-rate by 18.54%, 23.98%, 22.33%, and 24.71% on four standard datasets compared to VTM-23.1.
Authors: Qi Chen, Xiaohan Xing, Zhen Chen, Zhiwei Xiong
Abstract: To accelerate Magnetic Resonance (MR) imaging procedures, Multi-Contrast MR Reconstruction (MCMR) has become a prevalent trend that utilizes an easily obtainable modality as an auxiliary to support high-quality reconstruction of the target modality with under-sampled k-space measurements. The exploration of global dependency and complementary information across different modalities is essential for MCMR. However, existing methods either struggle to capture global dependency due to the limited receptive field or suffer from quadratic computational complexity. To tackle this dilemma, we propose a novel Frequency and Spatial Mutual Learning Network (FSMNet), which efficiently explores global dependencies across different modalities. Specifically, the features for each modality are extracted by the Frequency-Spatial Feature Extraction (FSFE) module, featuring a frequency branch and a spatial branch. Benefiting from the global property of the Fourier transform, the frequency branch can efficiently capture global dependency with an image-size receptive field, while the spatial branch can extract local features. To exploit complementary information from the auxiliary modality, we propose a Cross-Modal Selective fusion (CMS-fusion) module that selectively incorporate the frequency and spatial features from the auxiliary modality to enhance the corresponding branch of the target modality. To further integrate the enhanced global features from the frequency branch and the enhanced local features from the spatial branch, we develop a Frequency-Spatial fusion (FS-fusion) module, resulting in a comprehensive feature representation for the target modality. Extensive experiments on the BraTS and fastMRI datasets demonstrate that the proposed FSMNet achieves state-of-the-art performance for the MCMR task with different acceleration factors. The code is available at: https://github.com/qic999/FSMNet.
Authors: Subhajit Paul, Ashutosh Gupta
Abstract: Digital Elevation Model (DEM) is an essential aspect in the remote sensing (RS) domain to analyze various applications related to surface elevations. Here, we address the generation of high-resolution (HR) DEMs using HR multi-spectral (MX) satellite imagery as a guide by introducing a novel hybrid transformer model consisting of Densely connected Multi-Residual Block (DMRB) and multi-headed Frequency Selective Graph Attention (M-FSGA). To promptly regulate this process, we utilize the notion of discriminator spatial maps as the conditional attention to the MX guide. Further, we present a novel adversarial objective related to optimizing Sinkhorn distance with classical GAN. In this regard, we provide both theoretical and empirical substantiation of better performance in terms of vanishing gradient issues and numerical convergence. Based on our experiments on 4 different DEM datasets, we demonstrate both qualitative and quantitative comparisons with available baseline methods and show that the performance of our proposed model is superior to others with sharper details and minimal errors.
Authors: Jian Wang, Razieh Faghihpirayesh, Danny Joca, Polina Golland, Ali Gholipour
Abstract: In this paper, we introduce a Universal Motion Correction (UniMo) framework, leveraging deep neural networks to tackle the challenges of motion correction across diverse imaging modalities. Our approach employs advanced neural network architectures with equivariant filters, overcoming the limitations of current models that require iterative inference or retraining for new image modalities. UniMo enables one-time training on a single modality while maintaining high stability and adaptability for inference across multiple unseen image modalities. We developed a joint learning framework that integrates multimodal knowledge from both shape and images that faithfully improve motion correction accuracy despite image appearance variations. UniMo features a geometric deformation augmenter that enhances the robustness of global motion correction by addressing any local deformations whether they are caused by object deformations or geometric distortions, and also generates augmented data to improve the training process. Our experimental results, conducted on various datasets with four different image modalities, demonstrate that UniMo surpasses existing motion correction methods in terms of accuracy. By offering a comprehensive solution to motion correction, UniMo marks a significant advancement in medical imaging, especially in challenging applications with wide ranges of motion, such as fetal imaging. The code for this work is available online, https://github.com/IntelligentImaging/UNIMO/.
Authors: Viet Dung Nguyen, Zhizhuo Yang, Christopher L. Buckley, Alexander Ororbia
Abstract: Although research has produced promising results demonstrating the utility of active inference (AIF) in Markov decision processes (MDPs), there is relatively less work that builds AIF models in the context of environments and problems that take the form of partially observable Markov decision processes (POMDPs). In POMDP scenarios, the agent must infer the unobserved environmental state from raw sensory observations, e.g., pixels in an image. Additionally, less work exists in examining the most difficult form of POMDP-centered control: continuous action space POMDPs under sparse reward signals. In this work, we address issues facing the AIF modeling paradigm by introducing novel prior preference learning techniques and self-revision schedules to help the agent excel in sparse-reward, continuous action, goal-based robotic control POMDP environments. Empirically, we show that our agents offer improved performance over state-of-the-art models in terms of cumulative rewards, relative stability, and success rate. The code in support of this work can be found at https://github.com/NACLab/robust-active-inference.
Authors: Raffaele Mineo, Amelia Sorrenti, Federica Proietto Salanitri
Abstract: Assessing the severity of stenoses in coronary angiography is critical to the patient's health, as coronary stenosis is an underlying factor in heart failure. Current practice for grading coronary lesions, i.e. fractional flow reserve (FFR) or instantaneous wave-free ratio (iFR), suffers from several drawbacks, including time, cost and invasiveness, alongside potential interobserver variability. In this context, some deep learning methods have emerged to assist cardiologists in automating the estimation of FFR/iFR values. Despite the effectiveness of these methods, their reliance on large datasets is challenging due to the distributed nature of sensitive medical data. Federated learning addresses this challenge by aggregating knowledge from multiple nodes to improve model generalization, while preserving data privacy. We propose the first federated detection transformer approach, FeDETR, to assess stenosis severity in angiography videos based on FFR/iFR values estimation. In our approach, each node trains a detection transformer (DETR) on its local dataset, with the central server federating the backbone part of the network. The proposed method is trained and evaluated on a dataset collected from five hospitals, consisting of 1001 angiographic examinations, and its performance is compared with state-of-the-art federated learning methods.
Authors: Yew Ken Chia, Qi Sun, Lidong Bing, Soujanya Poria
Abstract: Large multimodal models have demonstrated impressive problem-solving abilities in vision and language tasks, and have the potential to encode extensive world knowledge. However, it remains an open challenge for these models to perceive, reason, plan, and act in realistic environments. In this work, we introduce Can-Do, a benchmark dataset designed to evaluate embodied planning abilities through more diverse and complex scenarios than previous datasets. Our dataset includes 400 multimodal samples, each consisting of natural language user instructions, visual images depicting the environment, state changes, and corresponding action plans. The data encompasses diverse aspects of commonsense knowledge, physical understanding, and safety awareness. Our fine-grained analysis reveals that state-of-the-art models, including GPT-4V, face bottlenecks in visual perception, comprehension, and reasoning abilities. To address these challenges, we propose NeuroGround, a neurosymbolic framework that first grounds the plan generation in the perceived environment states and then leverages symbolic planning engines to augment the model-generated plans. Experimental results demonstrate the effectiveness of our framework compared to strong baselines. Our code and dataset are available at https://embodied-planning.github.io.
Authors: Mingshen Wang, Zhao Zhang, Feng Li, Ke Xu, Kang Miao, Meng Wang
Abstract: Dynamic quantization has attracted rising attention in image super-resolution (SR) as it expands the potential of heavy SR models onto mobile devices while preserving competitive performance. Existing methods explore layer-to-bit configuration upon varying local regions, adaptively allocating the bit to each layer and patch. Despite the benefits, they still fall short in the trade-off of SR accuracy and quantization efficiency. Apart from this, adapting the quantization level for each layer individually can disturb the original inter-layer relationships, thus diminishing the representation capability of quantized models. In this work, we propose Granular-DQ, which capitalizes on the intrinsic characteristics of images while dispensing with the previous consideration for layer sensitivity in quantization. Granular-DQ conducts a multi-granularity analysis of local patches with further exploration of their information densities, achieving a distinctive patch-wise and layer-invariant dynamic quantization paradigm. Specifically, Granular-DQ initiates by developing a granularity-bit controller (GBC) to apprehend the coarse-to-fine granular representations of different patches, matching their proportional contribution to the entire image to determine the proper bit-width allocation. On this premise, we investigate the relation between bit-width and information density, devising an entropy-to-bit (E2B) mechanism that enables further fine-grained dynamic bit adaption of high-bit patches. Extensive experiments validate the superiority and generalization ability of Granular-DQ over recent state-of-the-art methods on various SR models. Code will be available at \url{https://github.com/MmmingS/Granular-DQ.git}.
Authors: Jingmou Xian, Jian Zhu, Haolin Liao, Si Li
Abstract: Sparse-view tomographic reconstruction is a pivotal direction for reducing radiation dose and augmenting clinical applicability. While many research works have proposed the reconstruction of tomographic images from sparse 2D projections, existing models tend to excessively focus on high-frequency information while overlooking low-frequency components within the sparse input images. This bias towards high-frequency information often leads to overfitting, particularly intense at edges and boundaries in the reconstructed slices. In this paper, we introduce the Frequency Regularized Neural Attenuation/Activity Field (Freq-NAF) for self-supervised sparse-view tomographic reconstruction. Freq-NAF mitigates overfitting by incorporating frequency regularization, directly controlling the visible frequency bands in the neural network input. This approach effectively balances high-frequency and low-frequency information. We conducted numerical experiments on CBCT and SPECT datasets, and our method demonstrates state-of-the-art accuracy.
Authors: Huy Hoang Nguyen, An Vuong, Anh Nguyen, Ian Reid, Minh Nhat Vu
Abstract: Grasp detection is a fundamental robotic task critical to the success of many industrial applications. However, current language-driven models for this task often struggle with cluttered images, lengthy textual descriptions, or slow inference speed. We introduce GraspMamba, a new language-driven grasp detection method that employs hierarchical feature fusion with Mamba vision to tackle these challenges. By leveraging rich visual features of the Mamba-based backbone alongside textual information, our approach effectively enhances the fusion of multimodal features. GraspMamba represents the first Mamba-based grasp detection model to extract vision and language features at multiple scales, delivering robust performance and rapid inference time. Intensive experiments show that GraspMamba outperforms recent methods by a clear margin. We validate our approach through real-world robotic experiments, highlighting its fast inference speed.
Authors: Jiachen Zhou, Mingsi Wang, Tianlin Li, Guozhu Meng, Kai Chen
Abstract: Pose-driven human image animation has achieved tremendous progress, enabling the generation of vivid and realistic human videos from just one single photo. However, it conversely exacerbates the risk of image misuse, as attackers may use one available image to create videos involving politics, violence and other illegal content. To counter this threat, we propose Dormant, a novel protection approach tailored to defend against pose-driven human image animation techniques. Dormant applies protective perturbation to one human image, preserving the visual similarity to the original but resulting in poor-quality video generation. The protective perturbation is optimized to induce misextraction of appearance features from the image and create incoherence among the generated video frames. Our extensive evaluation across 8 animation methods and 4 datasets demonstrates the superiority of Dormant over 6 baseline protection methods, leading to misaligned identities, visual distortions, noticeable artifacts, and inconsistent frames in the generated videos. Moreover, Dormant shows effectiveness on 6 real-world commercial services, even with fully black-box access.
Authors: Pablo Ramirez Amador, Dinarle Milagro Ortega, Arnold Cesarano
Abstract: Pulmonary diseases are a public health problem that requires accurate and fast diagnostic techniques. In this paper, a method based on convolutional neural networks (CNN), Data Augmentation, ResNet50 and Vision Transformers (ViT) is proposed to detect lung pathologies from medical images. A dataset of X-ray images and CT scans of patients with different lung diseases, such as cancer, pneumonia, tuberculosis and fibrosis, is used. The results obtained by the proposed method are compared with those of other existing methods, using performance metrics such as accuracy, sensitivity, specificity and area under the ROC curve. The results show that the proposed method outperforms the other methods in all metrics, achieving an accuracy of 98% and an area under the ROC curve of 99%. It is concluded that the proposed method is an effective and promising tool for the diagnosis of pulmonary pathologies by medical imaging.
Authors: Mehdi Astaraki, Simone Bendazzoli
Abstract: The automatic segmentation of pathological regions within whole-body PET-CT volumes has the potential to streamline various clinical applications such as diagno-sis, prognosis, and treatment planning. This study aims to address this challenge by contributing to the AutoPET MICCAI 2024 challenge through a proposed workflow that incorporates image preprocessing, tracer classification, and lesion segmentation steps. The implementation of this pipeline led to a significant enhancement in the segmentation accuracy of the models. This improvement is evidenced by an average overall Dice score of 0.548 across 1611 training subjects, 0.631 and 0.559 for classi-fied FDG and PSMA subjects of the training set, and 0.792 on the preliminary testing phase dataset.
Authors: Niraj Pudasaini, Muhammad Abdullah Hanif, Muhammad Shafique
Abstract: Optimizing Deep Learning-based Simultaneous Localization and Mapping (DL-SLAM) algorithms is essential for efficient implementation on resource-constrained embedded platforms, enabling real-time on-board computation in autonomous mobile robots. This paper presents SPAQ-DL-SLAM, a framework that strategically applies Structured Pruning and Quantization (SPAQ) to the architecture of one of the state-ofthe-art DL-SLAM algorithms, DROID-SLAM, for resource and energy-efficiency. Specifically, we perform structured pruning with fine-tuning based on layer-wise sensitivity analysis followed by 8-bit post-training static quantization (PTQ) on the deep learning modules within DROID-SLAM. Our SPAQ-DROIDSLAM model, optimized version of DROID-SLAM model using our SPAQ-DL-SLAM framework with 20% structured pruning and 8-bit PTQ, achieves an 18.9% reduction in FLOPs and a 79.8% reduction in overall model size compared to the DROID-SLAM model. Our evaluations on the TUM-RGBD benchmark shows that SPAQ-DROID-SLAM model surpasses the DROID-SLAM model by an average of 10.5% on absolute trajectory error (ATE) metric. Additionally, our results on the ETH3D SLAM training benchmark demonstrate enhanced generalization capabilities of the SPAQ-DROID-SLAM model, seen by a higher Area Under the Curve (AUC) score and success in 2 additional data sequences compared to the DROIDSLAM model. Despite these improvements, the model exhibits performance variance on the distinct Vicon Room sequences from the EuRoC dataset, which are captured at high angular velocities. This varying performance at some distinct scenarios suggests that designing DL-SLAM algorithms taking operating environments and tasks in consideration can achieve optimal performance and resource efficiency for deployment in resource-constrained embedded platforms.
Authors: Ninad Khargonkar, Luis Felipe Casas, Balakrishnan Prabhakaran, Yu Xiang
Abstract: We introduce a novel representation named as the unified gripper coordinate space for grasp synthesis of multiple grippers. The space is a 2D surface of a sphere in 3D using longitude and latitude as its coordinates, and it is shared for all robotic grippers. We propose a new algorithm to map the palm surface of a gripper into the unified gripper coordinate space, and design a conditional variational autoencoder to predict the unified gripper coordinates given an input object. The predicted unified gripper coordinates establish correspondences between the gripper and the object, which can be used in an optimization problem to solve the grasp pose and the finger joints for grasp synthesis. We demonstrate that using the unified gripper coordinate space improves the success rate and diversity in the grasp synthesis of multiple grippers.
Authors: Yuan Liu, Shu Wang, Zhe Qu, Xingyu Li, Shichao Kan, Jianxin Wang
Abstract: Federated Domain Generalization (FedDG) aims to train the global model for generalization ability to unseen domains with multi-domain training samples. However, clients in federated learning networks are often confined to a single, non-IID domain due to inherent sampling and temporal limitations. The lack of cross-domain interaction and the in-domain divergence impede the learning of domain-common features and limit the effectiveness of existing FedDG, referred to as the single-source FedDG (sFedDG) problem. To address this, we introduce the Federated Global Consistent Augmentation (FedGCA) method, which incorporates a style-complement module to augment data samples with diverse domain styles. To ensure the effective integration of augmented samples, FedGCA employs both global guided semantic consistency and class consistency, mitigating inconsistencies from local semantics within individual clients and classes across multiple clients. The conducted extensive experiments demonstrate the superiority of FedGCA.
Authors: Yinpei Dai, Jayjun Lee, Nima Fazeli, Joyce Chai
Abstract: Developing robust and correctable visuomotor policies for robotic manipulation is challenging due to the lack of self-recovery mechanisms from failures and the limitations of simple language instructions in guiding robot actions. To address these issues, we propose a scalable data generation pipeline that automatically augments expert demonstrations with failure recovery trajectories and fine-grained language annotations for training. We then introduce Rich languAge-guided failure reCovERy (RACER), a supervisor-actor framework, which combines failure recovery data with rich language descriptions to enhance robot control. RACER features a vision-language model (VLM) that acts as an online supervisor, providing detailed language guidance for error correction and task execution, and a language-conditioned visuomotor policy as an actor to predict the next actions. Our experimental results show that RACER outperforms the state-of-the-art Robotic View Transformer (RVT) on RLbench across various evaluation settings, including standard long-horizon tasks, dynamic goal-change tasks and zero-shot unseen tasks, achieving superior performance in both simulated and real world environments. Videos and code are available at: https://rich-language-failure-recovery.github.io.
Authors: Yanlin Wu, Tao Li, Zhihong Wang, Hong Kang, Along He
Abstract: U-Net is currently the most widely used architecture for medical image segmentation. Benefiting from its unique encoder-decoder architecture and skip connections, it can effectively extract features from input images to segment target regions. The commonly used U-Net is typically based on convolutional operations or Transformers, modeling the dependencies between local or global information to accomplish medical image analysis tasks. However, convolutional layers, fully connected layers, and attention mechanisms used in this process introduce a significant number of parameters, often requiring the stacking of network layers to model complex nonlinear relationships, which can impact the training process. To address these issues, we propose TransUKAN. Specifically, we have improved the KAN to reduce memory usage and computational load. On this basis, we explored an effective combination of KAN, Transformer, and U-Net structures. This approach enhances the model's capability to capture nonlinear relationships by introducing only a small number of additional parameters and compensates for the Transformer structure's deficiency in local information extraction. We validated TransUKAN on multiple medical image segmentation tasks. Experimental results demonstrate that TransUKAN achieves excellent performance with significantly reduced parameters. The code will be available athttps://github.com/wuyanlin-wyl/TransUKAN.
Authors: Patrick Amadeus Irawan, Genta Indra Winata, Samuel Cahyawijaya, Ayu Purwarianti
Abstract: Natural Language Explanation (NLE) aims to elucidate the decision-making process by providing detailed, human-friendly explanations in natural language. It helps demystify the decision-making processes of large vision-language models (LVLMs) through the use of language models. While existing methods for creating a Vision Question-Answering with Natural Language Explanation (VQA-NLE) datasets can provide explanations, they heavily rely on human annotations that are time-consuming and costly. In this study, we propose a novel approach that leverages LVLMs to efficiently generate high-quality synthetic VQA-NLE datasets. By evaluating our synthetic data, we showcase how advanced prompting techniques can lead to the production of high-quality VQA-NLE data. Our findings indicate that this proposed method achieves up to 20x faster than human annotation, with only a minimal decrease in qualitative metrics, achieving robust quality that is nearly equivalent to human-annotated data. Furthermore, we show that incorporating visual prompts significantly enhances the relevance of text generation. Our study paves the way for a more efficient and robust automated generation of multi-modal NLE data, offering a promising solution to the problem.
Authors: Weitong Chen, Yuheng Li
Abstract: In recent years, digital watermarking techniques based on deep learning have been widely studied. To achieve both imperceptibility and robustness of image watermarks, most current methods employ convolutional neural networks to build robust watermarking frameworks. However, despite the success of CNN-based watermarking models, they struggle to achieve robustness against geometric attacks due to the limitations of convolutional neural networks in capturing global and long-range relationships. To address this limitation, we propose a robust watermarking framework based on the Swin Transformer, named RoWSFormer. Specifically, we design the Locally-Channel Enhanced Swin Transformer Block as the core of both the encoder and decoder. This block utilizes the self-attention mechanism to capture global and long-range information, thereby significantly improving adaptation to geometric distortions. Additionally, we construct the Frequency-Enhanced Transformer Block to extract frequency domain information, which further strengthens the robustness of the watermarking framework. Experimental results demonstrate that our RoWSFormer surpasses existing state-of-the-art watermarking methods. For most non-geometric attacks, RoWSFormer improves the PSNR by 3 dB while maintaining the same extraction accuracy. In the case of geometric attacks (such as rotation, scaling, and affine transformations), RoWSFormer achieves over a 6 dB improvement in PSNR, with extraction accuracy exceeding 97\%.
Authors: Junyang Zhang, Mu Yuan, Ruiguang Zhong, Puhan Luo, Huiyou Zhan, Ningkang Zhang, Chengchen Hu, Xiangyang Li
Abstract: The Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential. However, these models demand extensive resources during inference. Adaptive attention techniques can dynamically reduce computational redundancy and thus improve efficiency. Although current adaptive attention methods significantly reduce the memory requirements of Transformer-based language models, they are not tailored for LVLMs. We observe that LVLMs generate responses from both remote image tokens and local text tokens, and different modalities have different attention patterns. This observation inspires us to manage the attention for each modality separately. Specifically, for visual input, we store the cache of potentially useful information but only compute the most critical parts. For language input, we care more about local information. Based on our observation and analysis of vision-language attention patterns, we develop A-VL, a plug-and-play adaptive attention tailored for LVLM inference. Extensive evaluations on three vision-language tasks and five datasets show the effectiveness of our designs. Our approach A-VL outperforms existing adaptive attention methods in reducing memory usage and computational load without compromising performance.
Authors: Ahjol Senbi, Tianyu Huang, Fei Lyu, Qing Li, Yuhui Tao, Wei Shao, Qiang Chen, Chengyan Wang, Shuo Wang, Tao Zhou, Yizhe Zhang
Abstract: We are interested in building a ground-truth-free evaluation model to assess the quality of segmentations produced by SAM (Segment Anything Model) and its variants in medical images. This model estimates segmentation quality scores by comparing input images with their corresponding segmentation maps. Building on prior research, we frame this as a regression problem within a supervised learning framework, using Dice scores (and optionally other metrics) to compute the training loss. The model is trained using a large collection of public datasets of medical images with segmentation predictions from SAM and its variants. We name this model EvanySeg (Evaluation of Any Segmentation in Medical Images). Our exploration of convolution-based models (e.g., ResNet) and transformer-based models (e.g., ViT) revealed that ViT offers superior performance for EvanySeg. This model can be employed for various tasks, including: (1) identifying poorly segmented samples by detecting low-percentile segmentation quality scores; (2) benchmark segmentation models without ground truth by averaging scores across test samples; (3) alerting human experts during human-AI collaboration by applying a threshold within the score space; and (4) selecting the best segmentation prediction for each test sample at test time when multiple segmentation models are available, by choosing the prediction with the highest score. Models and code will be made available at https://github.com/ahjolsenbics/EvanySeg.
Authors: Guokang Wang, Hang Li, Shuyuan Zhang, Yanhong Liu, Huaping Liu
Abstract: In real-world scenarios, many robotic manipulation tasks are hindered by occlusions and limited fields of view, posing significant challenges for passive observation-based models that rely on fixed or wrist-mounted cameras. In this paper, we investigate the problem of robotic manipulation under limited visual observation and propose a task-driven asynchronous active vision-action model.Our model serially connects a camera Next-Best-View (NBV) policy with a gripper Next-Best Pose (NBP) policy, and trains them in a sensor-motor coordination framework using few-shot reinforcement learning. This approach allows the agent to adjust a third-person camera to actively observe the environment based on the task goal, and subsequently infer the appropriate manipulation actions.We trained and evaluated our model on 8 viewpoint-constrained tasks in RLBench. The results demonstrate that our model consistently outperforms baseline algorithms, showcasing its effectiveness in handling visual constraints in manipulation tasks.
Authors: Kouki Terashima, Daiki Iwata, Kanji Tanaka
Abstract: This work explores the potential of brief inter-agent knowledge transfer (KT) to enhance the robotic object goal navigation (ON) in unseen and unfamiliar environments. Drawing on the analogy of human travelers acquiring local knowledge, we propose a framework in which a traveler robot (student) communicates with local robots (teachers) to obtain ON knowledge through minimal interactions. We frame this process as a data-free continual learning (CL) challenge, aiming to transfer knowledge from a black-box model (teacher) to a new model (student). In contrast to approaches like zero-shot ON using large language models (LLMs), which utilize inherently communication-friendly natural language for knowledge representation, the other two major ON approaches -- frontier-driven methods using object feature maps and learning-based ON using neural state-action maps -- present complex challenges where data-free KT remains largely uncharted. To address this gap, we propose a lightweight, plug-and-play KT module targeting non-cooperative black-box teachers in open-world settings. Using the universal assumption that every teacher robot has vision and mobility capabilities, we define state-action history as the primary knowledge base. Our formulation leads to the development of a query-based occupancy map that dynamically represents target object locations, serving as an effective and communication-friendly knowledge representation. We validate the effectiveness of our method through experiments conducted in the Habitat environment.
Authors: Hong Chen, Xin Wang, Yuwei Zhou, Bin Huang, Yipeng Zhang, Wei Feng, Houlun Chen, Zeyang Zhang, Siao Tang, Wenwu Zhu
Abstract: Multi-modal generative AI has received increasing attention in both academia and industry. Particularly, two dominant families of techniques are: i) The multi-modal large language model (MLLM) such as GPT-4V, which shows impressive ability for multi-modal understanding; ii) The diffusion model such as Sora, which exhibits remarkable multi-modal powers, especially with respect to visual generation. As such, one natural question arises: Is it possible to have a unified model for both understanding and generation? To answer this question, in this paper, we first provide a detailed review of both MLLM and diffusion models, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video large language models as well as text-to-image/video generation. Then, we discuss the two important questions on the unified model: i) whether the unified model should adopt the auto-regressive or diffusion probabilistic modeling, and ii) whether the model should utilize a dense architecture or the Mixture of Experts(MoE) architectures to better support generation and understanding, two objectives. We further provide several possible strategies for building a unified model and analyze their potential advantages and disadvantages. We also summarize existing large-scale multi-modal datasets for better model pretraining in the future. To conclude the paper, we present several challenging future directions, which we believe can contribute to the ongoing advancement of multi-modal generative AI.
Authors: Furkan Pala, Mehmet Yasin Akp{\i}nar, Onur Deniz, G\"ul\c{s}en Eryi\u{g}it
Abstract: Multimodal key information extraction (KIE) models have been studied extensively on semi-structured documents. However, their investigation on unstructured documents is an emerging research topic. The paper presents an approach to adapt a multimodal transformer (i.e., ViBERTgrid previously explored on semi-structured documents) for unstructured financial documents, by incorporating a BiLSTM-CRF layer. The proposed ViBERTgrid BiLSTM-CRF model demonstrates a significant improvement in performance (up to 2 percentage points) on named entity recognition from unstructured documents in financial domain, while maintaining its KIE performance on semi-structured documents. As an additional contribution, we publicly released token-level annotations for the SROIE dataset in order to pave the way for its use in multimodal sequence labeling models.
Authors: Qingyu Chen, Tiarnan D L Keenan, Elvira Agron, Alexis Allot, Emily Guan, Bryant Duong, Amr Elsawy, Benjamin Hou, Cancan Xue, Sanjeeb Bhandari, Geoffrey Broadhead, Chantal Cousineau-Krieger, Ellen Davis, William G Gensheimer, David Grasic, Seema Gupta, Luis Haddock, Eleni Konstantinou, Tania Lamba, Michele Maiberger, Dimosthenis Mantopoulos, Mitul C Mehta, Ayman G Nahri, Mutaz AL-Nawaflh, Arnold Oshinsky, Brittany E Powell, Boonkit Purt, Soo Shin, Hillary Stiefel, Alisa T Thavikulwat, Keith James Wroblewski, Tham Yih Chung, Chui Ming Gemmy Cheung, Ching-Yu Cheng, Emily Y Chew, Michelle R. Hribar, Michael F. Chiang, Zhiyong Lu
Abstract: Timely disease diagnosis is challenging due to increasing disease burdens and limited clinician availability. AI shows promise in diagnosis accuracy but faces real-world application issues due to insufficient validation in clinical workflows and diverse populations. This study addresses gaps in medical AI downstream accountability through a case study on age-related macular degeneration (AMD) diagnosis and severity classification. We designed and implemented an AI-assisted diagnostic workflow for AMD, comparing diagnostic performance with and without AI assistance among 24 clinicians from 12 institutions with real patient data sampled from the Age-Related Eye Disease Study (AREDS). Additionally, we demonstrated continual enhancement of an existing AI model by incorporating approximately 40,000 additional medical images (named AREDS2 dataset). The improved model was then systematically evaluated using both AREDS and AREDS2 test sets, as well as an external test set from Singapore. AI assistance markedly enhanced diagnostic accuracy and classification for 23 out of 24 clinicians, with the average F1-score increasing by 20% from 37.71 (Manual) to 45.52 (Manual + AI) (P-value < 0.0001), achieving an improvement of over 50% in some cases. In terms of efficiency, AI assistance reduced diagnostic times for 17 out of the 19 clinicians tracked, with time savings of up to 40%. Furthermore, a model equipped with continual learning showed robust performance across three independent datasets, recording a 29% increase in accuracy, and elevating the F1-score from 42 to 54 in the Singapore population.
Authors: Bel\'en Serrano-Ant\'on, Mubashara Rehman, Niki Martinel, Michele Avanzo, Riccardo Spizzo, Giuseppe Fanetti, Alberto P. Mu\~nuzuri, Christian Micheloni
Abstract: For the planning of radiotherapy treatments for head and neck cancers, Computed Tomography (CT) scans of the patients are typically employed. However, in patients with head and neck cancer, the quality of standard CT scans generated using kilo-Voltage (kVCT) tube potentials is severely degraded by streak artifacts occurring in the presence of metallic implants such as dental fillings. Some radiotherapy devices offer the possibility of acquiring Mega-Voltage CT (MVCT) for daily patient setup verification, due to the higher energy of X-rays used, MVCT scans are almost entirely free from artifacts making them more suitable for radiotherapy treatment planning. In this study, we leverage the advantages of kVCT scans with those of MVCT scans (artifact-free). We propose a deep learning-based approach capable of generating artifact-free MVCT images from acquired kVCT images. The outcome offers the benefits of artifact-free MVCT images with enhanced soft tissue contrast, harnessing valuable information obtained through kVCT technology for precise therapy calibration. Our proposed method employs UNet-inspired model, and is compared with adversarial learning and transformer networks. This first and unique approach achieves remarkable success, with PSNR of 30.02 dB across the entire patient volume and 27.47 dB in artifact-affected regions exclusively. It is worth noting that the PSNR calculation excludes the background, concentrating solely on the region of interest.
Authors: Sunny Gupta, Mohit, Pankhi Kashyap, Pranav Jeevan, Amit Sethi
Abstract: Federated learning faces a critical challenge in balancing communication efficiency with rapid convergence, especially for second-order methods. While Newton-type algorithms achieve linear convergence in communication rounds, transmitting full Hessian matrices is often impractical due to quadratic complexity. We introduce Federated Learning with Enhanced Nesterov-Newton Sketch (FLeNS), a novel method that harnesses both the acceleration capabilities of Nesterov's method and the dimensionality reduction benefits of Hessian sketching. FLeNS approximates the centralized Newton's method without relying on the exact Hessian, significantly reducing communication overhead. By combining Nesterov's acceleration with adaptive Hessian sketching, FLeNS preserves crucial second-order information while preserving the rapid convergence characteristics. Our theoretical analysis, grounded in statistical learning, demonstrates that FLeNS achieves super-linear convergence rates in communication rounds - a notable advancement in federated optimization. We provide rigorous convergence guarantees and characterize tradeoffs between acceleration, sketch size, and convergence speed. Extensive empirical evaluation validates our theoretical findings, showcasing FLeNS's state-of-the-art performance with reduced communication requirements, particularly in privacy-sensitive and edge-computing scenarios. The code is available at https://github.com/sunnyinAI/FLeNS
Authors: Hong-fu Chou, Vu Nguyen Ha, Prabhu Thiruvasagam, Thanh-Dung Le, Geoffrey Eappen, Ti Ti Nguyen, Luis M. Garces-Socarras, Jorge L. Gonzalez-Rios, Juan Carlos Merlano-Duncan, Symeon Chatzinotas
Abstract: Earth Observation (EO) systems play a crucial role in achieving Sustainable Development Goals by collecting and analyzing vital global data through satellite networks. These systems are essential for tasks like mapping, disaster monitoring, and resource management, but they face challenges in processing and transmitting large volumes of EO data, especially in specialized fields such as agriculture and real-time disaster response. Domain-adapted Large Language Models (LLMs) provide a promising solution by facilitating data fusion between extensive EO data and semantic EO data. By improving integration and interpretation of diverse datasets, LLMs address the challenges of processing specialized information in agriculture and disaster response applications. This fusion enhances the accuracy and relevance of transmitted data. This paper presents a framework for semantic communication in EO satellite networks, aimed at improving data transmission efficiency and overall system performance through cognitive processing techniques. The proposed system employs Discrete-Task-Oriented Source-Channel Coding (DT-JSCC) and Semantic Data Augmentation (SA) to focus on relevant information while minimizing communication overhead. By integrating cognitive semantic processing and inter-satellite links, the framework enhances the analysis and transmission of multispectral satellite imagery, improving object detection, pattern recognition, and real-time decision-making. The introduction of Cognitive Semantic Augmentation (CSA) allows satellites to process and transmit semantic information, boosting adaptability to changing environments and application needs. This end-to-end architecture is tailored for next-generation satellite networks, such as those supporting 6G, and demonstrates significant improvements in efficiency and accuracy.
Authors: Miguel Arturo Vega Torres, Fabian Pfitzner
Abstract: Robot dogs are receiving increasing attention in various fields of research. However, the number of studies investigating their potential usability on construction sites is scarce. The construction industry implies several human resource-demanding tasks such as safety monitoring, material transportation, and site inspections. Robot dogs can address some of these challenges by providing automated support and lowering manual effort. In this paper, we investigate the potential usability of currently available robot dogs on construction sites in terms of focusing on their different specifications and on-site requirements to support data acquisition. In addition, we conducted a real-world experiment on a large-scale construction site using a quadruped robot. In conclusion, we consider robot dogs to be a valuable asset for monitoring intricate construction environments in the future, particularly as their limitations are mitigated through technical advancements.
Authors: Shyam Sundar Kannan, Byung-Cheol Min
Abstract: Scene Change Detection is a challenging task in computer vision and robotics that aims to identify differences between two images of the same scene captured at different times. Traditional change detection methods rely on training models that take these image pairs as input and estimate the changes, which requires large amounts of annotated data, a costly and time-consuming process. To overcome this, we propose ZeroSCD, a zero-shot scene change detection framework that eliminates the need for training. ZeroSCD leverages pre-existing models for place recognition and semantic segmentation, utilizing their features and outputs to perform change detection. In this framework, features extracted from the place recognition model are used to estimate correspondences and detect changes between the two images. These are then combined with segmentation results from the semantic segmentation model to precisely delineate the boundaries of the detected changes. Extensive experiments on benchmark datasets demonstrate that ZeroSCD outperforms several state-of-the-art methods in change detection accuracy, despite not being trained on any of the benchmark datasets, proving its effectiveness and adaptability across different scenarios.
Authors: Tarun Kalluri, Sreyas Ravichandran, Manmohan Chandraker
Abstract: In this work, we take a deeper look into the diverse factors that influence the efficacy of modern unsupervised domain adaptation (UDA) methods using a large-scale, controlled empirical study. To facilitate our analysis, we first develop UDA-Bench, a novel PyTorch framework that standardizes training and evaluation for domain adaptation enabling fair comparisons across several UDA methods. Using UDA-Bench, our comprehensive empirical study into the impact of backbone architectures, unlabeled data quantity, and pre-training datasets reveals that: (i) the benefits of adaptation methods diminish with advanced backbones, (ii) current methods underutilize unlabeled data, and (iii) pre-training data significantly affects downstream adaptation in both supervised and self-supervised settings. In the context of unsupervised adaptation, these observations uncover several novel and surprising properties, while scientifically validating several others that were often considered empirical heuristics or practitioner intuitions in the absence of a standardized training and evaluation framework. The UDA-Bench framework and trained models are publicly available at https://github.com/ViLab-UCSD/UDABench_ECCV2024.
Authors: Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Jian Yang, Siwei Wu, Xingwei Qu, Jinjie Shi, Xinyue Zhang, Zhenzhu Yang, Xiangzhou Wang, Zhaoxiang Zhang, Zachary Liu, Emmanouil Benetos, Wenhao Huang, Chenghua Lin
Abstract: Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains inadequately explored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) open-source OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts; and ii) the baseline models perform poorly (below 50% accuracy) even when provided with alternative textual representations of images and audio. These results suggest that the ability to construct a consistent context from text, image, and audio is often overlooked in existing MLLM training paradigms. We advocate for future research to focus on developing more robust tri-modal integration techniques and training strategies to enhance OLM performance across diverse modalities. The codes and live leaderboard could be found at https://m-a-p.ai/OmniBench.
Authors: Hamid Majidi Balanji
Abstract: In this work, a simple vision algorithm is designed and implemented to extract and identify the surface defects on the Golden Delicious apples caused by the enzymic browning process. 34 Golden Delicious apples were selected for the experiments, of which 17 had enzymic browning defects and the other 17 were sound. The image processing part of the proposed vision algorithm extracted the defective surface area of the apples with high accuracy of 97.15%. The area and mean of the segmented images were selected as the 2x1 feature vectors to feed into a designed artificial neural network. The analysis based on the above features indicated that the images with a mean less than 0.0065 did not belong to the defective apples; rather, they were extracted as part of the calyx and stem of the healthy apples. The classification accuracy of the neural network applied in this study was 99.19%
Authors: Wenhao Li, Mengyuan Liu, Hong Liu, Tianyu Guo, Ti Wang, Hao Tang, Nicu Sebe
Abstract: Modern multi-layer perceptron (MLP) models have shown competitive results in learning visual representations without self-attention. However, existing MLP models are not good at capturing local details and lack prior knowledge of human body configurations, which limits their modeling power for skeletal representation learning. To address these issues, we propose a simple yet effective graph-reinforced MLP-Like architecture, named GraphMLP, that combines MLPs and graph convolutional networks (GCNs) in a global-local-graphical unified architecture for 3D human pose estimation. GraphMLP incorporates the graph structure of human bodies into an MLP model to meet the domain-specific demand of the 3D human pose, while allowing for both local and global spatial interactions. Furthermore, we propose to flexibly and efficiently extend the GraphMLP to the video domain and show that complex temporal dynamics can be effectively modeled in a simple way with negligible computational cost gains in the sequence length. To the best of our knowledge, this is the first MLP-Like architecture for 3D human pose estimation in a single frame and a video sequence. Extensive experiments show that the proposed GraphMLP achieves state-of-the-art performance on two datasets, i.e., Human3.6M and MPI-INF-3DHP. Code and models are available at https://github.com/Vegetebird/GraphMLP.
Authors: Hongbing Shang, Qixiu Yang, Chuang Sun, Xuefeng Chen, Ruqiang Yan
Abstract: Aero-engine is the core component of aircraft and other spacecraft. The high-speed rotating blades provide power by sucking in air and fully combusting, and various defects will inevitably occur, threatening the operation safety of aero-engine. Therefore, regular inspections are essential for such a complex system. However, existing traditional technology which is borescope inspection is labor-intensive, time-consuming, and experience-dependent. To endow this technology with intelligence, a novel superpixel perception graph neural network (SPGNN) is proposed by utilizing a multi-stage graph convolutional network (MSGCN) for feature extraction and superpixel perception region proposal network (SPRPN) for region proposal. First, to capture complex and irregular textures, the images are transformed into a series of patches, to obtain their graph representations. Then, MSGCN composed of several GCN blocks extracts graph structure features and performs graph information processing at graph level. Last but not least, the SPRPN is proposed to generate perceptual bounding boxes by fusing graph representation features and superpixel perception features. Therefore, the proposed SPGNN always implements feature extraction and information transmission at the graph level in the whole SPGNN pipeline, to alleviate the reduction of receptive field and information loss. To verify the effectiveness of SPGNN, we construct a simulated blade dataset with 3000 images. A public aluminum dataset is also used to validate the performances of different methods. The experimental results demonstrate that the proposed SPGNN has superior performance compared with the state-of-the-art methods.
Authors: Haotian Bai, Yuanhuiyi Lyu, Lutao Jiang, Sijia Li, Haonan Lu, Xiaodong Lin, Lin Wang
Abstract: Text-to-3D form plays a crucial role in creating editable 3D scenes for AR/VR. Recent advances have shown promise in merging neural radiance fields (NeRFs) with pre-trained diffusion models for text-to-3D object generation. However, one enduring challenge is their inadequate capability to accurately parse and regenerate consistent multi-object environments. Specifically, these models encounter difficulties in accurately representing quantity and style prompted by multi-object texts, often resulting in a collapse of the rendering fidelity that fails to match the semantic intricacies. Moreover, amalgamating these elements into a coherent 3D scene is a substantial challenge, stemming from generic distribution inherent in diffusion models. To tackle the issue of 'guidance collapse' and further enhance scene consistency, we propose a novel framework, dubbed CompoNeRF, by integrating an editable 3D scene layout with object-specific and scene-wide guidance mechanisms. It initiates by interpreting a complex text into the layout populated with multiple NeRFs, each paired with a corresponding subtext prompt for precise object depiction. Next, a tailored composition module seamlessly blends these NeRFs, promoting consistency, while the dual-level text guidance reduces ambiguity and boosts accuracy. Noticeably, our composition design permits decomposition. This enables flexible scene editing and recomposition into new scenes based on the edited layout or text prompts. Utilizing the open-source Stable Diffusion model, CompoNeRF generates multi-object scenes with high fidelity. Remarkably, our framework achieves up to a \textbf{54\%} improvement by the multi-view CLIP score metric. Our user study indicates that our method has significantly improved semantic accuracy, multi-view consistency, and individual recognizability for multi-object scene generation.
Authors: Aditya Prakash, Matthew Chang, Matthew Jin, Ruisen Tu, Saurabh Gupta
Abstract: Prior works for reconstructing hand-held objects from a single image train models on images paired with 3D shapes. Such data is challenging to gather in the real world at scale. Consequently, these approaches do not generalize well when presented with novel objects in in-the-wild settings. While 3D supervision is a major bottleneck, there is an abundance of a) in-the-wild raw video data showing hand-object interactions and b) synthetic 3D shape collections. In this paper, we propose modules to leverage 3D supervision from these sources to scale up the learning of models for reconstructing hand-held objects. Specifically, we extract multiview 2D mask supervision from videos and 3D shape priors from shape collections. We use these indirect 3D cues to train occupancy networks that predict the 3D shape of objects from a single RGB image. Our experiments in the challenging object generalization setting on in-the-wild MOW dataset show 11.6% relative improvement over models trained with 3D supervision on existing datasets.
Authors: Shaunak Halbe, James Seale Smith, Junjiao Tian, Zsolt Kira
Abstract: In this paper, we focus on the important yet understudied problem of Continual Federated Learning (CFL), where a server communicates with a set of clients to incrementally learn new concepts over time without sharing or storing any data. The complexity of this problem is compounded by challenges from both the Continual and Federated Learning perspectives. Specifically, models trained in a CFL setup suffer from catastrophic forgetting which is exacerbated by data heterogeneity across clients. Existing attempts at this problem tend to impose large overheads on clients and communication channels or require access to stored data which renders them unsuitable for real-world use due to privacy. In this paper, we attempt to tackle forgetting and heterogeneity while minimizing overhead costs and without requiring access to any stored data. We study this problem in the context of Vision Transformers and explore parameter-efficient approaches to adapt to dynamic distributions while minimizing forgetting. We achieve this by leveraging a prompting based approach (such that only prompts and classifier heads have to be communicated) and proposing a novel and lightweight generation and distillation scheme to consolidate client models at the server. We formulate this problem for image classification and establish strong baselines for comparison, conduct experiments on CIFAR-100 as well as challenging, large-scale datasets like ImageNet-R and DomainNet. Our approach outperforms both existing methods and our own baselines by as much as 7% while significantly reducing communication and client-level computation costs. Code available at https://github.com/shaunak27/hepco-fed.
Authors: Seunghyeok Back, Sangbeom Lee, Kangmin Kim, Joosoon Lee, Sungho Shin, Jemo Maeng, Kyoobin Lee
Abstract: Accurate and efficient segmentation of unknown objects in unstructured environments is essential for robotic manipulation. Unknown Object Instance Segmentation (UOIS), which aims to identify all objects in unknown categories and backgrounds, has become a key capability for various robotic tasks. However, current methods struggle with over-segmentation and under-segmentation, leading to failures in manipulation tasks such as grasping. To address these challenges, we propose QuBER (Quadruple Boundary Error Refinement), a novel error-informed refinement approach for high-quality UOIS. QuBER first estimates quadruple boundary errors-true positive, true negative, false positive, and false negative pixels-at the instance boundaries of the initial segmentation. It then refines the segmentation using an error-guided fusion mechanism, effectively correcting both fine-grained and instance-level segmentation errors. Extensive evaluations on three public benchmarks demonstrate that QuBER outperforms state-of-the-art methods and consistently improves various UOIS techniques while maintaining a fast inference time of less than 0.1 seconds. Additionally, we demonstrate that QuBER improves the success rate of grasping target objects in cluttered environments. Code and supplementary materials are available at https://sites.google.com/view/uois-quber.
Authors: Vaibhav Vavilala, Faaris Shaik, David Forsyth
Abstract: We demonstrate an image dequantizing diffusion model that enables novel image edits on natural images. We propose operating on quantized images because they offer easy abstraction for patch-based edits and palette transfer. In particular, we show that color palettes can make the output of the diffusion model easier to control and interpret. We first establish that existing image restoration methods are not sufficient, such as JPEG noise reduction models. We then demonstrate that our model can generate natural images that respect the color palette the user asked for. For palette transfer, we propose a method based on weighted bipartite matching. We then show that our model generates plausible images even after extreme palette transfers, respecting user query. Our method can optionally condition on the source texture in part or all of the image. In doing so, we overcome a common problem in existing image colorization methods that are unable to produce colors with a different luminance than the input. We evaluate several possibilities for texture conditioning and their trade-offs, including luminance, image gradients, and thresholded gradients, the latter of which performed best in maintaining texture and color control simultaneously. Our method can be usefully extended to another practical edit: recoloring patches of an image while respecting the source texture. Our procedure is supported by several qualitative and quantitative evaluations.
Authors: Lei Wang, Piotr Koniusz, Tom Gedeon, Liang Zheng
Abstract: In contrastive learning, two views of an original image, generated by different augmentations, are considered a positive pair, and their similarity is required to be high. Similarly, two views of distinct images form a negative pair, with encouraged low similarity. Typically, a single similarity measure, provided by a lone projection head, evaluates positive and negative sample pairs. However, due to diverse augmentation strategies and varying intra-sample similarity, views from the same image may not always be similar. Additionally, owing to inter-sample similarity, views from different images may be more akin than those from the same image. Consequently, enforcing high similarity for positive pairs and low similarity for negative pairs may be unattainable, and in some cases, such enforcement could detrimentally impact performance. To address this challenge, we propose using multiple projection heads, each producing a distinct set of features. Our pre-training loss function emerges from a solution to the maximum likelihood estimation over head-wise posterior distributions of positive samples given observations. This loss incorporates the similarity measure over positive and negative pairs, each re-weighted by an individual adaptive temperature, regulated to prevent ill solutions. Our approach, Adaptive Multi-Head Contrastive Learning (AMCL), can be applied to and experimentally enhances several popular contrastive learning methods such as SimCLR, MoCo, and Barlow Twins. The improvement remains consistent across various backbones and linear probing epochs, and becomes more significant when employing multiple augmentation methods.
Authors: Xidong Peng, Runnan Chen, Feng Qiao, Lingdong Kong, Youquan Liu, Yujing Sun, Tai Wang, Xinge Zhu, Yuexin Ma
Abstract: Unsupervised domain adaptation (UDA) in 3D segmentation tasks presents a formidable challenge, primarily stemming from the sparse and unordered nature of point cloud data. Especially for LiDAR point clouds, the domain discrepancy becomes obvious across varying capture scenes, fluctuating weather conditions, and the diverse array of LiDAR devices in use. While previous UDA methodologies have often sought to mitigate this gap by aligning features between source and target domains, this approach falls short when applied to 3D segmentation due to the substantial domain variations. Inspired by the remarkable generalization capabilities exhibited by the vision foundation model, SAM, in the realm of image segmentation, our approach leverages the wealth of general knowledge embedded within SAM to unify feature representations across diverse 3D domains and further solves the 3D domain adaptation problem. Specifically, we harness the corresponding images associated with point clouds to facilitate knowledge transfer and propose an innovative hybrid feature augmentation methodology, which significantly enhances the alignment between the 3D feature space and SAM's feature space, operating at both the scene and instance levels. Our method is evaluated on many widely-recognized datasets and achieves state-of-the-art performance.
Authors: Zhuoxiao Chen, Yadan Luo, Zixin Wang, Zijian Wang, Xin Yu, Zi Huang
Abstract: LiDAR-based 3D object detection has recently seen significant advancements through active learning (AL), attaining satisfactory performance by training on a small fraction of strategically selected point clouds. However, in real-world deployments where streaming point clouds may include unknown or novel objects, the ability of current AL methods to capture such objects remains unexplored. This paper investigates a more practical and challenging research task: Open World Active Learning for 3D Object Detection (OWAL-3D), aimed at acquiring informative point clouds with new concepts. To tackle this challenge, we propose a simple yet effective strategy called Open Label Conciseness (OLC), which mines novel 3D objects with minimal annotation costs. Our empirical results show that OLC successfully adapts the 3D detection model to the open world scenario with just a single round of selection. Any generic AL policy can then be integrated with the proposed OLC to efficiently address the OWAL-3D problem. Based on this, we introduce the Open-CRB framework, which seamlessly integrates OLC with our preliminary AL method, CRB, designed specifically for 3D object detection. We develop a comprehensive codebase for easy reproducing and future research, supporting 15 baseline methods (\textit{i.e.}, active learning, out-of-distribution detection and open world detection), 2 types of modern 3D detectors (\textit{i.e.}, one-stage SECOND and two-stage PV-RCNN) and 3 benchmark 3D datasets (\textit{i.e.}, KITTI, nuScenes and Waymo). Extensive experiments evidence that the proposed Open-CRB demonstrates superiority and flexibility in recognizing both novel and known classes with very limited labeling costs, compared to state-of-the-art baselines. Source code is available at \url{https://github.com/Luoyadan/CRB-active-3Ddet/tree/Open-CRB}.
URLs: https://github.com/Luoyadan/CRB-active-3Ddet/tree/Open-CRB
Authors: Xinyu Zhang, Weiyu Sun, Hao Lu, Ying Chen, Yun Ge, Xiaolin Huang, Jie Yuan, Yingcong Chen
Abstract: Remote photoplethysmography (rPPG) is a noninvasive technique that aims to capture subtle variations in facial pixels caused by changes in blood volume resulting from cardiac activities. Most existing unsupervised methods for rPPG tasks focus on the contrastive learning between samples while neglecting the inherent self-similar prior in physiological signals. In this paper, we propose a Self-Similarity Prior Distillation (SSPD) framework for unsupervised rPPG estimation, which capitalizes on the intrinsic self-similarity of cardiac activities. Specifically, we first introduce a physical-prior embedded augmentation technique to mitigate the effect of various types of noise. Then, we tailor a self-similarity-aware network to extract more reliable self-similar physiological features. Finally, we develop a hierarchical self-distillation paradigm to assist the network in disentangling self-similar physiological patterns from facial videos. Comprehensive experiments demonstrate that the unsupervised SSPD framework achieves comparable or even superior performance compared to the state-of-the-art supervised methods. Meanwhile, SSPD maintains the lowest inference time and computation cost among end-to-end models.
Authors: Haiyu Wu, Sicong Tian, Huayu Li, Kevin W. Bowyer
Abstract: Ensuring logical consistency in predictions is a crucial yet overlooked aspect in multi-attribute classification. We explore the potential reasons for this oversight and introduce two pressing challenges to the field: 1) How can we ensure that a model, when trained with data checked for logical consistency, yields predictions that are logically consistent? 2) How can we achieve the same with data that hasn't undergone logical consistency checks? Minimizing manual effort is also essential for enhancing automation. To address these challenges, we introduce two datasets, FH41K and CelebA-logic, and propose LogicNet, an adversarial training framework that learns the logical relationships between attributes. Accuracy of LogicNet surpasses that of the next-best approach by 23.05%, 9.96%, and 1.71% on FH37K, FH41K, and CelebA-logic, respectively. In real-world case analysis, our approach can achieve a reduction of more than 50% in the average number of failed cases compared to other methods.
Authors: Yan Yang, Liyuan Pan, Liu Liu
Abstract: This paper introduces a self-supervised learning framework designed for pre-training neural networks tailored to dense prediction tasks using event camera data. Our approach utilizes solely event data for training. Transferring achievements from dense RGB pre-training directly to event camera data yields subpar performance. This is attributed to the spatial sparsity inherent in an event image (converted from event data), where many pixels do not contain information. To mitigate this sparsity issue, we encode an event image into event patch features, automatically mine contextual similarity relationships among patches, group the patch features into distinctive contexts, and enforce context-to-context similarities to learn discriminative event features. For training our framework, we curate a synthetic event camera dataset featuring diverse scene and motion patterns. Transfer learning performance on downstream dense prediction tasks illustrates the superiority of our method over state-of-the-art approaches.
Authors: Shicheng Li, Lei Li, Shuhuai Ren, Yuanxin Liu, Yi Liu, Rundong Gao, Xu Sun, Lu Hou
Abstract: The ability to perceive how objects change over time is a crucial ingredient in human intelligence. However, current benchmarks cannot faithfully reflect the temporal understanding abilities of video-language models (VidLMs) due to the existence of static visual shortcuts. To remedy this issue, we present VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding. Specifically, we first introduce a fine-grained taxonomy of temporal concepts in natural language in order to diagnose the capability of VidLMs to comprehend different temporal aspects. Furthermore, to disentangle the correlation between static and temporal information, we generate counterfactual video descriptions that differ from the original one only in the specified temporal aspect. We employ a semi-automatic data collection framework using large language models and human-in-the-loop annotation to obtain high-quality counterfactual descriptions efficiently. Evaluation of representative video-language understanding models confirms their deficiency in temporal understanding, revealing the need for greater emphasis on the temporal elements in video-language research.
Authors: Aditya Prakash, Ruisen Tu, Matthew Chang, Saurabh Gupta
Abstract: 3D hand pose estimation in everyday egocentric images is challenging for several reasons: poor visual signal (occlusion from the object of interaction, low resolution & motion blur), large perspective distortion (hands are close to the camera), and lack of 3D annotations outside of controlled settings. While existing methods often use hand crops as input to focus on fine-grained visual information to deal with poor visual signal, the challenges arising from perspective distortion and lack of 3D annotations in the wild have not been systematically studied. We focus on this gap and explore the impact of different practices, i.e. crops as input, incorporating camera information, auxiliary supervision, scaling up datasets. We provide several insights that are applicable to both convolutional and transformer models leading to better performance. Based on our findings, we also present WildHands, a system for 3D hand pose estimation in everyday egocentric images. Zero-shot evaluation on 4 diverse datasets (H2O, AssemblyHands, Epic-Kitchens, Ego-Exo4D) demonstrate the effectiveness of our approach across 2D and 3D metrics, where we beat past methods by 7.4% - 66%. In system level comparisons, WildHands achieves the best 3D hand pose on ARCTIC egocentric split, outperforms FrankMocap across all metrics and HaMeR on 3 out of 6 metrics while being 10x smaller and trained on 5x less data.
Authors: Aditya Prakash, Arjun Gupta, Saurabh Gupta
Abstract: Objects undergo varying amounts of perspective distortion as they move across a camera's field of view. Models for predicting 3D from a single image often work with crops around the object of interest and ignore the location of the object in the camera's field of view. We note that ignoring this location information further exaggerates the inherent ambiguity in making 3D inferences from 2D images and can prevent models from even fitting to the training data. To mitigate this ambiguity, we propose Intrinsics-Aware Positional Encoding (KPE), which incorporates information about the location of crops in the image and camera intrinsics. Experiments on three popular 3D-from-a-single-image benchmarks: depth prediction on NYU, 3D object detection on KITTI & nuScenes, and predicting 3D shapes of articulated objects on ARCTIC, show the benefits of KPE.
Authors: Sahil Khose, Anisha Pal, Aayushi Agarwal, Deepanshi, Judy Hoffman, Prithvijit Chattopadhyay
Abstract: Real-world aerial scene understanding is limited by a lack of datasets that contain densely annotated images curated under a diverse set of conditions. Due to inherent challenges in obtaining such images in controlled real-world settings, we present SkyScenes, a synthetic dataset of densely annotated aerial images captured from Unmanned Aerial Vehicle (UAV) perspectives. We carefully curate SkyScenes images from CARLA to comprehensively capture diversity across layouts (urban and rural maps), weather conditions, times of day, pitch angles and altitudes with corresponding semantic, instance and depth annotations. Through our experiments using SkyScenes, we show that (1) models trained on SkyScenes generalize well to different real-world scenarios, (2) augmenting training on real images with SkyScenes data can improve real-world performance, (3) controlled variations in SkyScenes can offer insights into how models respond to changes in viewpoint conditions (height and pitch), weather and time of day, and (4) incorporating additional sensor modalities (depth) can improve aerial scene understanding. Our dataset and associated generation code are publicly available at: https://hoffman-group.github.io/SkyScenes/
Authors: Jiaqi Tang, Hao Lu, Xiaogang Xu, Ruizheng Wu, Sixing Hu, Tong Zhang, Tsz Wa Cheng, Ming Ge, Ying-Cong Chen, Fugee Tsung
Abstract: Artificial Intelligence (AI)-driven defect inspection is pivotal in industrial manufacturing. Yet, many methods, tailored to specific pipelines, grapple with diverse product portfolios and evolving processes. Addressing this, we present the Incremental Unified Framework (IUF), which can reduce the feature conflict problem when continuously integrating new objects in the pipeline, making it advantageous in object-incremental learning scenarios. Employing a state-of-the-art transformer, we introduce Object-Aware Self-Attention (OASA) to delineate distinct semantic boundaries. Semantic Compression Loss (SCL) is integrated to optimize non-primary semantic space, enhancing network adaptability for novel objects. Additionally, we prioritize retaining the features of established objects during weight updates. Demonstrating prowess in both image and pixel-level defect inspection, our approach achieves state-of-the-art performance, proving indispensable for dynamic and scalable industrial inspections. Our code will be released at https://github.com/jqtangust/IUF.
Authors: Shiu-hong Kao, Jierun Chen, S. H. Gary Chan
Abstract: Knowledge distillation (KD) has been recognized as an effective tool to compress and accelerate models. However, current KD approaches generally suffer from an accuracy drop and/or an excruciatingly long distillation process. In this paper, we tackle the issue by first providing a new insight into a phenomenon that we call the Inter-Block Optimization Entanglement (IBOE), which makes the conventional end-to-end KD approaches unstable with noisy gradients. We then propose StableKD, a novel KD framework that breaks the IBOE and achieves more stable optimization. StableKD distinguishes itself through two operations: Decomposition and Recomposition, where the former divides a pair of teacher and student networks into several blocks for separate distillation, and the latter progressively merges them back, evolving towards end-to-end distillation. We conduct extensive experiments on CIFAR100, Imagewoof, and ImageNet datasets with various teacher-student pairs. Compared to other KD approaches, our simple yet effective StableKD greatly boosts the model accuracy by 1% ~ 18%, speeds up the convergence up to 10 times, and outperforms them with only 40% of the training data.
Authors: Junru Lin, Asen Nachkov, Songyou Peng, Luc Van Gool, Danda Pani Paudel
Abstract: In this work, we address the challenge of deploying Neural Radiance Field (NeRFs) in Simultaneous Localization and Mapping (SLAM) under the condition of lacking depth information, relying solely on RGB inputs. The key to unlocking the full potential of NeRF in such a challenging context lies in the integration of real-world priors. A crucial prior we explore is the binary opacity prior of 3D space with opaque objects. To effectively incorporate this prior into the NeRF framework, we introduce a ternary-type opacity (TT) model instead, which categorizes points on a ray intersecting a surface into three regions: before, on, and behind the surface. This enables a more accurate rendering of depth, subsequently improving the performance of image warping techniques. Therefore, we further propose a novel hybrid odometry (HO) scheme that merges bundle adjustment and warping-based localization. Our integrated approach of TT and HO achieves state-of-the-art performance on synthetic and real-world datasets, in terms of both speed and accuracy. This breakthrough underscores the potential of NeRF-SLAM in navigating complex environments with high fidelity.
Authors: Kaichen Zhou, Jia-Wang Bian, Qian Xie, Jian-Qing Zheng, Niki Trigoni, Andrew Markham
Abstract: Despite advancements in self-supervised monocular depth estimation, challenges persist in dynamic scenarios due to the dependence on assumptions about a static world. In this paper, we present Manydepth2, a Motion-Guided Cost Volume Depth Net, to achieve precise depth estimation for both dynamic objects and static backgrounds, all while maintaining computational efficiency. To tackle the challenges posed by dynamic content, we incorporate optical flow and coarse monocular depth to create a novel static reference frame. This frame is then utilized to build a motion-guided cost volume in collaboration with the target frame. Additionally, to enhance the accuracy and resilience of the network structure, we introduce an attention-based depth net architecture to effectively integrate information from feature maps with varying resolutions. Compared to methods with similar computational costs, Manydepth2 achieves a significant reduction of approximately five percent in root-mean-square error for self-supervised monocular depth estimation on the KITTI-2015 dataset. The code could be found: https://github.com/kaichen-z/Manydepth2
Authors: Ronglai Zuo, Fangyun Wei, Brian Mak
Abstract: Research on continuous sign language recognition (CSLR) is essential to bridge the communication gap between deaf and hearing individuals. Numerous previous studies have trained their models using the connectionist temporal classification (CTC) loss. During inference, these CTC-based models generally require the entire sign video as input to make predictions, a process known as offline recognition, which suffers from high latency and substantial memory usage. In this work, we take the first step towards online CSLR. Our approach consists of three phases: 1) developing a sign dictionary; 2) training an isolated sign language recognition model on the dictionary; and 3) employing a sliding window approach on the input sign sequence, feeding each sign clip to the optimized model for online recognition. Additionally, our online recognition model can be extended to support online translation by integrating a gloss-to-text network and can enhance the performance of any offline model. With these extensions, our online approach achieves new state-of-the-art performance on three popular benchmarks across various task settings. Code and models are available at https://github.com/FangyunWei/SLRT.
Authors: Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, Saining Xie
Abstract: We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: learning in discrete or continuous time, the objective function, the interpolant that connects the distributions, and deterministic or stochastic sampling. By carefully introducing the above ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 and 512x512 benchmark using the exact same model structure, number of parameters, and GFLOPs. By exploring various diffusion coefficients, which can be tuned separately from learning, SiT achieves an FID-50K score of 2.06 and 2.62, respectively.
Authors: Yunqing Bao, Bin Hu
Abstract: This paper presents a novel algorithm for non-destructive damage detection for steel ropes in high-altitude environments (aerial ropeway). The algorithm comprises two key components: First, a segmentation model named RGBD-UNet is designed to accurately extract steel ropes from complex backgrounds. This model is equipped with the capability to process and combine color and depth information through the proposed CMA module. Second, a detection model named VovNetV3.5 is developed to differentiate between normal and abnormal steel ropes. It integrates the VovNet architecture with a DBB module to enhance performance. Besides, a novel background augmentation method is proposed to enhance the generalization ability of the segmentation model. Datasets containing images of steel ropes in different scenarios are created for the training and testing of both the segmentation and detection models. Experiments demonstrate a significant improvement over baseline models. On the proposed dataset, the highest accuracy achieved by the detection model reached 0.975, and the maximum F-measure achieved by the segmentation model reached 0.948.
Authors: Guibiao Liao, Kaichen Zhou, Zhenyu Bao, Kanglin Liu, Qing Li
Abstract: The development of Neural Radiance Fields (NeRFs) has provided a potent representation for encapsulating the geometric and appearance characteristics of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D semantic perception tasks has been a recent focus. However, current methods that extract semantics directly from Contrastive Language-Image Pretraining (CLIP) for semantic field learning encounter difficulties due to noisy and view-inconsistent semantics provided by CLIP. To tackle these limitations, we propose OV-NeRF, which exploits the potential of pre-trained vision and language foundation models to enhance semantic field learning through proposed single-view and cross-view strategies. First, from the single-view perspective, we introduce Region Semantic Ranking (RSR) regularization by leveraging 2D mask proposals derived from Segment Anything (SAM) to rectify the noisy semantics of each training view, facilitating accurate semantic field learning. Second, from the cross-view perspective, we propose a Cross-view Self-enhancement (CSE) strategy to address the challenge raised by view-inconsistent semantics. Rather than invariably utilizing the 2D inconsistent semantics from CLIP, CSE leverages the 3D consistent semantics generated from the well-trained semantic field itself for semantic field training, aiming to reduce ambiguity and enhance overall semantic consistency across different views. Extensive experiments validate our OV-NeRF outperforms current state-of-the-art methods, achieving a significant improvement of 20.31% and 18.42% in mIoU metric on Replica and ScanNet, respectively. Furthermore, our approach exhibits consistent superior results across various CLIP configurations, further verifying its robustness. Project page: https://github.com/pcl3dv/OV-NeRF.
Authors: Ziqiao Shang, Bin Liu, Fengmao Lv, Fei Teng, Tianrui Li
Abstract: Facial action unit (AU) detection has long encountered the challenge of detecting subtle feature differences when AUs activate. Existing methods often rely on encoding pixel-level information of AUs, which not only encodes additional redundant information but also leads to increased model complexity and limited generalizability. Additionally, the accuracy of AU detection is negatively impacted by the class imbalance issue of each AU type, and the presence of noisy and false AU labels. In this paper, we introduce a novel contrastive learning framework aimed for AU detection that incorporates both self-supervised and supervised signals, thereby enhancing the learning of discriminative features for accurate AU detection. To tackle the class imbalance issue, we employ a negative sample re-weighting strategy that adjusts the step size of updating parameters for minority and majority class samples. Moreover, to address the challenges posed by noisy and false AU labels, we employ a sampling technique that encompasses three distinct types of positive sample pairs. This enables us to inject self-supervised signals into the supervised signal, effectively mitigating the adverse effects of noisy labels. Our experimental assessments, conducted on four widely-utilized benchmark datasets (BP4D, DISFA, GFT and Aff-Wild2), underscore the superior performance of our approach compared to state-of-the-art methods of AU detection. Our code is available at \url{https://github.com/Ziqiao-Shang/AUNCE}.
Authors: Zefeng Wang, Zhen Han, Shuo Chen, Fan Xue, Zifeng Ding, Xun Xiao, Volker Tresp, Philip Torr, Jindong Gu
Abstract: Multimodal LLMs (MLLMs) with a great ability of text and image understanding have received great attention. To achieve better reasoning with MLLMs, Chain-of-Thought (CoT) reasoning has been widely explored, which further promotes MLLMs' explainability by giving intermediate reasoning steps. Despite the strong power demonstrated by MLLMs in multimodal reasoning, recent studies show that MLLMs still suffer from adversarial images. This raises the following open questions: Does CoT also enhance the adversarial robustness of MLLMs? What do the intermediate reasoning steps of CoT entail under adversarial attacks? To answer these questions, we first generalize existing attacks to CoT-based inferences by attacking the two main components, i.e., rationale and answer. We find that CoT indeed improves MLLMs' adversarial robustness against the existing attack methods by leveraging the multi-step reasoning process, but not substantially. Based on our findings, we further propose a novel attack method, termed as stop-reasoning attack, that attacks the model while bypassing the CoT reasoning process. Experiments on three MLLMs and two visual reasoning datasets verify the effectiveness of our proposed method. We show that stop-reasoning attack can result in misled predictions and outperform baseline attacks by a significant margin.
Authors: Zhuoling Li, Xiaogang Xu, SerNam Lim, Hengshuang Zhao
Abstract: Realizing unified 3D object detection, including both indoor and outdoor scenes, holds great importance in applications like robot navigation. However, involving various scenarios of data to train models poses challenges due to their significantly distinct characteristics, \eg, diverse geometry properties and heterogeneous domain distributions. In this work, we propose to address the challenges from two perspectives, the algorithm perspective and data perspective. In terms of the algorithm perspective, we first build a monocular 3D object detector based on the bird's-eye-view (BEV) detection paradigm, where the explicit feature projection is beneficial to addressing the geometry learning ambiguity. In this detector, we split the classical BEV detection architecture into two stages and propose an uneven BEV grid design to handle the convergence instability caused by geometry difference between scenarios. Besides, we develop a sparse BEV feature projection strategy to reduce the computational cost and a unified domain alignment method to handle heterogeneous domains. From the data perspective, we propose to incorporate depth information to improve training robustness. Specifically, we build the first unified multi-modal 3D object detection benchmark MM-Omni3D and extend the aforementioned monocular detector to its multi-modal version, which is the first unified multi-modal 3D object detector. We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively. The experimental results reveal several insightful findings highlighting the benefits of multi-modal data and confirm the effectiveness of all the proposed strategies.
Authors: Fanzhe Yan, Gang Yang, Yu Li, Aiping Liu, Xun Chen
Abstract: Deep learning techniques have demonstrated great potential for accurately estimating brain age by analyzing Magnetic Resonance Imaging (MRI) data from healthy individuals. However, current methods for brain age estimation often directly utilize whole input images, overlooking two important considerations: 1) the heterogeneous nature of brain aging, where different brain regions may degenerate at different rates, and 2) the existence of age-independent redundancies in brain structure. To overcome these limitations, we propose a Dual Graph Attention based Disentanglement Multi-instance Learning (DGA-DMIL) framework for improving brain age estimation. Specifically, the 3D MRI data, treated as a bag of instances, is fed into a 2D convolutional neural network backbone, to capture the unique aging patterns in MRI. A dual graph attention aggregator is then proposed to learn the backbone features by exploiting the intra- and inter-instance relationships. Furthermore, a disentanglement branch is introduced to separate age-related features from age-independent structural representations to ameliorate the interference of redundant information on age prediction. To verify the effectiveness of the proposed framework, we evaluate it on two datasets, UK Biobank and ADNI, containing a total of 35,388 healthy individuals. Our proposed model demonstrates exceptional accuracy in estimating brain age, achieving a remarkable mean absolute error of 2.12 years in the UK Biobank. The results establish our approach as state-of-the-art compared to other competing brain age estimation models. In addition, the instance contribution scores identify the varied importance of brain areas for aging prediction, which provides deeper insights into the understanding of brain aging.
Authors: Gang Zhang, Junnan Chen, Guohuan Gao, Jianmin Li, Si Liu, Xiaolin Hu
Abstract: LiDAR-based 3D object detection plays an essential role in autonomous driving. Existing high-performing 3D object detectors usually build dense feature maps in the backbone network and prediction head. However, the computational costs introduced by the dense feature maps grow quadratically as the perception range increases, making these models hard to scale up to long-range detection. Some recent works have attempted to construct fully sparse detectors to solve this issue; nevertheless, the resulting models either rely on a complex multi-stage pipeline or exhibit inferior performance. In this work, we propose SAFDNet, a straightforward yet highly effective architecture, tailored for fully sparse 3D object detection. In SAFDNet, an adaptive feature diffusion strategy is designed to address the center feature missing problem. We conducted extensive experiments on Waymo Open, nuScenes, and Argoverse2 datasets. SAFDNet performed slightly better than the previous SOTA on the first two datasets but much better on the last dataset, which features long-range detection, verifying the efficacy of SAFDNet in scenarios where long-range detection is required. Notably, on Argoverse2, SAFDNet surpassed the previous best hybrid detector HEDNet by 2.6% mAP while being 2.1x faster, and yielded 2.1% mAP gains over the previous best sparse detector FSDv2 while being 1.3x faster. The code will be available at https://github.com/zhanggang001/HEDNet.
Authors: Jianping Li, Thien-Minh Nguyen, Shenghai Yuan, Lihua Xie
Abstract: Accurate and consistent construction of point clouds from LiDAR scanning data is fundamental for 3D modeling applications. Current solutions, such as multiview point cloud registration and LiDAR bundle adjustment, predominantly depend on the local plane assumption, which may be inadequate in complex environments lacking of planar geometries or substantial initial pose errors. To mitigate this problem, this paper presents a LiDAR bundle adjustment with progressive spatial smoothing, which is suitable for complex environments and exhibits improved convergence capabilities. The proposed method consists of a spatial smoothing module and a pose adjustment module, which combines the benefits of local consistency and global accuracy. With the spatial smoothing module, we can obtain robust and rich surface constraints employing smoothing kernels across various scales. Then the pose adjustment module corrects all poses utilizing the novel surface constraints. Ultimately, the proposed method simultaneously achieves fine poses and parametric surfaces that can be directly employed for high-quality point cloud reconstruction. The effectiveness and robustness of our proposed approach have been validated on both simulation and real-world datasets. The experimental results demonstrate that the proposed method outperforms the existing methods and achieves better accuracy in complex environments with low planar structures.
Authors: Tianshuo Peng, Zuchao Li, Lefei Zhang, Hai Zhao, Ping Wang, Bo Du
Abstract: Large Language Models (LLMs), benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities. However, as for extending auto-regressive modelling to multi-modal scenarios to build Large Multi-modal Models (LMMs), there lies a great difficulty that the image information is processed in the LMM as continuous visual embeddings, which cannot obtain discrete supervised labels for classification.In this paper, we successfully perform multi-modal auto-regressive modeling with a unified objective for the first time.Specifically, we propose the concept of visual tokens, which maps the visual features to probability distributions over LLM's vocabulary, providing supervision information for visual modelling.We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.Experimental results and ablation studies on 5 VQA tasks and 4 benchmark toolkits validate the powerful performance of our proposed approach.
Authors: Shuangjian Li, Tao Zhu, Mingxing Nie, Huansheng Ning, Zhenyu Liu, Liming Chen
Abstract: Traditional deep learning methods struggle to simultaneously segment, recognize, and forecast human activities from sensor data. This limits their usefulness in many fields such as healthcare and assisted living, where real-time understanding of ongoing and upcoming activities is crucial. This paper introduces P2LHAP, a novel Patch-to-Label Seq2Seq framework that tackles all three tasks in a efficient single-task model. P2LHAP divides sensor data streams into a sequence of "patches", served as input tokens, and outputs a sequence of patch-level activity labels including the predicted future activities. A unique smoothing technique based on surrounding patch labels, is proposed to identify activity boundaries accurately. Additionally, P2LHAP learns patch-level representation by sensor signal channel-independent Transformer encoders and decoders. All channels share embedding and Transformer weights across all sequences. Evaluated on three public datasets, P2LHAP significantly outperforms the state-of-the-art in all three tasks, demonstrating its effectiveness and potential for real-world applications.
Authors: Dan Song, Xuanpu Zhang, Jianhao Zeng, Pengxin Zhan, Qingguo Chen, Weihua Luo, An-An Liu
Abstract: Image-based virtual try-on aims to transfer target in-shop clothing to a dressed model image, the objectives of which are totally taking off original clothing while preserving the contents outside of the try-on area, naturally wearing target clothing and correctly inpainting the gap between target clothing and original clothing. Tremendous efforts have been made to facilitate this popular research area, but cannot keep the type of target clothing with the try-on area affected by original clothing. In this paper, we focus on the unpaired virtual try-on situation where target clothing and original clothing on the model are different, i.e., the practical scenario. To break the correlation between the try-on area and the original clothing and make the model learn the correct information to inpaint, we propose an adaptive mask training paradigm that dynamically adjusts training masks. It not only improves the alignment and fit of clothing but also significantly enhances the fidelity of virtual try-on experience. Furthermore, we for the first time propose two metrics for unpaired try-on evaluation, the Semantic-Densepose-Ratio (SDR) and Skeleton-LPIPS (S-LPIPS), to evaluate the correctness of clothing type and the accuracy of clothing texture. For unpaired try-on validation, we construct a comprehensive cross-try-on benchmark (Cross-27) with distinctive clothing items and model physiques, covering a broad try-on scenarios. Experiments demonstrate the effectiveness of the proposed methods, contributing to the advancement of virtual try-on technology and offering new insights and tools for future research in the field. The code, model and benchmark will be publicly released.
Authors: Jiajun Deng, Sha Zhang, Feras Dayoub, Wanli Ouyang, Yanyong Zhang, Ian Reid
Abstract: In this work, we present PoIFusion, a conceptually simple yet effective multi-modal 3D object detection framework to fuse the information of RGB images and LiDAR point clouds at the points of interest (PoIs). Different from the most accurate methods to date that transform multi-sensor data into a unified view or leverage the global attention mechanism to facilitate fusion, our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and interpolation. In particular, our PoIFusion follows the paradigm of query-based object detection, formulating object queries as dynamic 3D boxes and generating a set of PoIs based on each query box. The PoIs serve as the keypoints to represent a 3D object and play the role of the basic units in multi-modal fusion. Specifically, we project PoIs into the view of each modality to sample the corresponding feature and integrate the multi-modal features at each PoI through a dynamic fusion block. Furthermore, the features of PoIs derived from the same query box are aggregated together to update the query feature. Our approach prevents information loss caused by view transformation and eliminates the computation-intensive global attention, making the multi-modal 3D object detector more applicable. We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach. Remarkably, the proposed approach achieves state-of-the-art results on both datasets without any bells and whistles, \emph{i.e.}, 74.9\% NDS and 73.4\% mAP on nuScenes, and 31.6\% CDS and 40.6\% mAP on Argoverse2. The code will be made available at \url{https://djiajunustc.github.io/projects/poifusion}.
Authors: Zixian Ma, Weikai Huang, Jieyu Zhang, Tanmay Gupta, Ranjay Krishna
Abstract: Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented a systematic study of planner design decisions. Should LLMs generate a full plan in a single shot or step-by-step? Should they invoke tools directly with Python code or through structured data formats like JSON? Does feedback improve planning? To answer these questions and more, we introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, (free) public APIs, and image processing modules. For each of these task queries, we provide automatically generated plans using this realistic toolset. We further provide a high-quality subset of 1,565 task plans that are human-verified and correctly executable. With m&m's, we evaluate 10 popular LLMs with 2 planning strategies (multi-step vs. step-by-step planning), 2 plan formats (JSON vs. code), and 3 types of feedback (parsing/verification/execution). Finally, we summarize takeaways from our extensive experiments. Our dataset and code are available on HuggingFace (https://huggingface.co/datasets/zixianma/mnms) and Github (https://github.com/RAIVNLab/mnms).
URLs: https://huggingface.co/datasets/zixianma/mnms), https://github.com/RAIVNLab/mnms).
Authors: Jeongsol Kim, Geon Yeong Park, Jong Chul Ye
Abstract: Reverse sampling and score-distillation have emerged as main workhorses in recent years for image manipulation using latent diffusion models (LDMs). While reverse diffusion sampling often requires adjustments of LDM architecture or feature engineering, score distillation offers a simple yet powerful model-agnostic approach, but it is often prone to mode-collapsing. To address these limitations and leverage the strengths of both approaches, here we introduce a novel framework called {\em DreamSampler}, which seamlessly integrates these two distinct approaches through the lens of regularized latent optimization. Similar to score-distillation, DreamSampler is a model-agnostic approach applicable to any LDM architecture, but it allows both distillation and reverse sampling with additional guidance for image editing and reconstruction. Through experiments involving image editing, SVG reconstruction and etc, we demonstrate the competitive performance of DreamSampler compared to existing approaches, while providing new applications. Code: https://github.com/DreamSampler/dream-sampler
Authors: Yarden Frenkel, Yael Vinker, Ariel Shamir, Daniel Cohen-Or
Abstract: Image stylization involves manipulating the visual appearance and texture (style) of an image while preserving its underlying objects, structures, and concepts (content). The separation of style and content is essential for manipulating the image's style independently from its content, ensuring a harmonious and visually pleasing result. Achieving this separation requires a deep understanding of both the visual and semantic characteristics of images, often necessitating the training of specialized models or employing heavy optimization. In this paper, we introduce B-LoRA, a method that leverages LoRA (Low-Rank Adaptation) to implicitly separate the style and content components of a single image, facilitating various image stylization tasks. By analyzing the architecture of SDXL combined with LoRA, we find that jointly learning the LoRA weights of two specific blocks (referred to as B-LoRAs) achieves style-content separation that cannot be achieved by training each B-LoRA independently. Consolidating the training into only two blocks and separating style and content allows for significantly improving style manipulation and overcoming overfitting issues often associated with model fine-tuning. Once trained, the two B-LoRAs can be used as independent components to allow various image stylization tasks, including image style transfer, text-based image stylization, consistent style generation, and style-content mixing.
Authors: Yifan Yan, Ruomin He, Zhenghua Liu
Abstract: We introduce MUTE-SLAM, a real-time neural RGB-D SLAM system employing multiple tri-plane hash-encodings for efficient scene representation. MUTE-SLAM effectively tracks camera positions and incrementally builds a scalable multi-map representation for both small and large indoor environments. As previous methods often require pre-defined scene boundaries, MUTE-SLAM dynamically allocates sub-maps for newly observed local regions, enabling constraint-free mapping without prior scene information. Unlike traditional grid-based methods, we use three orthogonal axis-aligned planes for hash-encoding scene properties, significantly reducing hash collisions and the number of trainable parameters. This hybrid approach not only ensures real-time performance but also enhances the fidelity of surface reconstruction. Furthermore, our optimization strategy concurrently optimizes all sub-maps intersecting with the current camera frustum, ensuring global consistency. Extensive testing on both real-world and synthetic datasets has shown that MUTE-SLAM delivers state-of-the-art surface reconstruction quality and competitive tracking performance across diverse indoor settings. The code is available at https://github.com/lumennYan/MUTE_SLAM.
Authors: Md. Simul Hasan Talukder, Sharmin Akter, Abdullah Hafez Nur, Mohammad Aljaidi, Rejwan Bin Sulaiman
Abstract: Sugarcane, a key crop for the world's sugar industry, is prone to several diseases that have a substantial negative influence on both its yield and quality. To effectively manage and implement preventative initiatives, diseases must be detected promptly and accurately. In this study, we present a unique model called sugarcaneNet2024 that outperforms previous methods for automatically and quickly detecting sugarcane disease through leaf image processing. Our proposed model consolidates an optimized weighted average ensemble of seven customized and LASSO-regularized pre-trained models, particularly InceptionV3, InceptionResNetV2, DenseNet201, DenseNet169, Xception, and ResNet152V2. Initially, we added three more dense layers with 0.0001 LASSO regularization, three 30% dropout layers, and three batch normalizations with renorm enabled at the bottom of these pre-trained models to improve the performance. The accuracy of sugarcane leaf disease classification was greatly increased by this addition. Following this, several comparative studies between the average ensemble and individual models were carried out, indicating that the ensemble technique performed better. The average ensemble of all modified pre-trained models produced outstanding outcomes: 100%, 99%, 99%, and 99.45% for f1 score, precision, recall, and accuracy, respectively. Performance was further enhanced by the implementation of an optimized weighted average ensemble technique incorporated with grid search. This optimized sugarcaneNet2024 model performed the best for detecting sugarcane diseases, having achieved accuracy, precision, recall, and F1 score of 99.67%, 100%, 100%, and 100% , respectively.
Authors: Yao Du, Carlos M. Mateo, Mirjana Maras, Tsun-Hsuan Wang, Marc Blanchon, Alexander Amini, Daniela Rus, Omar Tahri
Abstract: Unlike a traditional gyroscope, a visual gyroscope estimates camera rotation through images. The integration of omnidirectional cameras, offering a larger field of view compared to traditional RGB cameras, has proven to yield more accurate and robust results. However, challenges arise in situations that lack features, have substantial noise causing significant errors, and where certain features in the images lack sufficient strength, leading to less precise prediction results. Here, we address these challenges by introducing a novel visual gyroscope, which combines an Efficient Multi-Mask-Filter Rotation Estimator(EMMFRE) and a Learning based optimization(LbTO) to provide a more efficient and accurate rotation estimation from spherical images. Experimental results demonstrate superior performance of the proposed approach in terms of accuracy. The paper emphasizes the advantages of integrating machine learning to optimize analytical solutions, discusses limitations, and suggests directions for future research.
Authors: Hou-I Liu, Yi-Xin Huang, Hong-Han Shuai, Wen-Huang Cheng
Abstract: Despite previous DETR-like methods having performed successfully in generic object detection, tiny object detection is still a challenging task for them since the positional information of object queries is not customized for detecting tiny objects, whose scale is extraordinarily smaller than general objects. Also, DETR-like methods using a fixed number of queries make them unsuitable for aerial datasets, which only contain tiny objects, and the numbers of instances are imbalanced between different images. Thus, we present a simple yet effective model, named DQ-DETR, which consists of three different components: categorical counting module, counting-guided feature enhancement, and dynamic query selection to solve the above-mentioned problems. DQ-DETR uses the prediction and density maps from the categorical counting module to dynamically adjust the number of object queries and improve the positional information of queries. Our model DQ-DETR outperforms previous CNN-based and DETR-like methods, achieving state-of-the-art mAP 30.2% on the AI-TOD-V2 dataset, which mostly consists of tiny objects. Our code will be available at https://github.com/Katie0723/DQ-DETR.
Authors: Jiyuan Wang, Chunyu Lin, Lang Nie, Kang Liao, Shuwei Shao, Yao Zhao
Abstract: Recently, diffusion-based depth estimation methods have drawn widespread attention due to their elegant denoising patterns and promising performance. However, they are typically unreliable under adverse conditions prevalent in real-world scenarios, such as rainy, snowy, etc. In this paper, we propose a novel robust depth estimation method called D4RD, featuring a custom contrastive learning mode tailored for diffusion models to mitigate performance degradation in complex environments. Concretely, we integrate the strength of knowledge distillation into contrastive learning, building the `trinity' contrastive scheme. This scheme utilizes the sampled noise of the forward diffusion process as a natural reference, guiding the predicted noise in diverse scenes toward a more stable and precise optimum. Moreover, we extend noise-level trinity to encompass more generic feature and image levels, establishing a multi-level contrast to distribute the burden of robust perception across the overall network. Before addressing complex scenarios, we enhance the stability of the baseline diffusion model with three straightforward yet effective improvements, which facilitate convergence and remove depth outliers. Extensive experiments demonstrate that D4RD surpasses existing state-of-the-art solutions on synthetic corruption datasets and real-world weather conditions. Source code and data are available at \url{https://github.com/wangjiyuan9/D4RD}.
Authors: Jo\~ao Gabriel Vinholi, Marco Chini, Anis Amziane, Renato Machado, Danilo Silva, Patrick Matgen
Abstract: Comparing images captured by disparate sensors is a common challenge in remote sensing. This requires image translation -- converting imagery from one sensor domain to another while preserving the original content. Denoising Diffusion Implicit Models (DDIM) are potential state-of-the-art solutions for such domain translation due to their proven superiority in multiple image-to-image translation tasks in classic computer vision. However, these models struggle with large-scale multi-patch imagery, often focusing solely on small patches and resulting in inconsistencies across the full image. To overcome these limitations, we propose a novel method that leverages DDIM for effective optical image translation over large areas. Our approach super-resolves large-scale low spatial resolution images into high-resolution equivalents from disparate optical sensors, ensuring uniformity across hundreds of patches. Our contributions lie in new forward and reverse diffusion processes, which are crucial for addressing the challenges of large-scale image translation. Extensive experiments with a dataset of paired Sentinel-II and Planet Dove images demonstrate that our approach provides precise domain adaptation and artifact reduction. The proposed method preserves image content while improving radiometric (color) accuracy and feature representation. The outcome is a high-resolution large-scale image with consistent patches, vital for applications such as heterogeneous change detection (HCD). We present a unique training and testing algorithm rooted in DDIMs, a thorough image quality assessment, and a comparative study against the standard classifier-free guided DDIM framework and five other leading methods. The usefulness of our approach is further demonstrated in HCD tasks performed in the urban settings of Beirut, Lebanon, and Austin, USA.
Authors: Zerui Chen, Shizhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid
Abstract: In this work, we aim to learn a unified vision-based policy for multi-fingered robot hands to manipulate a variety of objects in diverse poses. Though prior work has shown benefits of using human videos for policy learning, performance gains have been limited by the noise in estimated trajectories. Moreover, reliance on privileged object information such as ground-truth object states further limits the applicability in realistic scenarios. To address these limitations, we propose a new framework ViViDex to improve vision-based policy learning from human videos. It first uses reinforcement learning with trajectory guided rewards to train state-based policies for each video, obtaining both visually natural and physically plausible trajectories from the video. We then rollout successful episodes from state-based policies and train a unified visual policy without using any privileged information. We propose coordinate transformation to further enhance the visual point cloud representation, and compare behavior cloning and diffusion policy for the visual policy training. Experiments both in simulation and on the real robot demonstrate that ViViDex outperforms state-of-the-art approaches on three dexterous manipulation tasks.
Authors: Panagiotis Sapoutzoglou, George Giapitzakis, George Terzakis, Maria Pateraki
Abstract: We present a generic algorithm for scoring pose estimation methods that rely on single image semantic analysis. The algorithm employs a lightweight putative shape representation using a combination of multiple Gaussian Processes. Each Gaussian Process (GP) yields distance normal distributions from multiple reference points in the object's coordinate system to its surface, thus providing a geometric evaluation framework for scoring predicted poses. Our confidence measure comprises the average mixture probability of pixel back-projections onto the shape template. In the reported experiments, we compare the accuracy of our GP based representation of objects versus the actual geometric models and demonstrate the ability of our method to capture the influence of outliers as opposed to the corresponding intrinsic measures that ship with the segmentation and pose estimation methods.
Authors: Justin Engelmann, Miguel O. Bernabeu
Abstract: Artificial Intelligence in medicine is traditionally limited by the lack of massive training datasets. Foundation models, pre-trained models that can be adapted to downstream tasks with small datasets, could alleviate this problem. Researchers at Moorfields Eye Hospital (MEH) proposed RETFound-MEH, a retinal foundation model trained on 900,000 images, including private hospital data. Recently, data-efficient DERETFound was proposed providing comparable performance while being trained on only 150,000 publicly available images. However, both these models required very substantial resources to train initially and are resource-intensive in downstream use. We propose a novel Token Reconstruction objective that we use to train RETFound-Green, a retinal foundation model trained using only 75,000 publicly available images and 400 times less compute. We estimate the cost of training RETFound-MEH and DERETFound at \$10,000 and \$14,000, respectively. RETFound-Green could be trained for less than \$100, with equally reduced environmental impact. RETFound-Green is also far more efficient in downstream use: it can be downloaded 14 times faster, computes vector embeddings 2.7 times faster which then require 2.6 times less storage space. Despite this, RETFound-Green does not perform systematically worse. In fact, on various task on three downstream datasets from Brazil, India and China, it performs best on 68 tasks out of 119 comparisons, versus 21 for DERETFound and 13 for RETFound-MEH. Our results suggest that RETFound-Green is a very efficient, high-performance retinal foundation model. We anticipate that our Token Reconstruction objective could be scaled up for even higher performance and be applied to other domains beyond retinal imaging.
Authors: Amar Ali-bey, Brahim Chaib-draa, Philippe Gigu\`ere
Abstract: In visual place recognition, accurately identifying and matching images of locations under varying environmental conditions and viewpoints remains a significant challenge. In this paper, we introduce a new technique, called Bag-of-Queries (BoQ), which learns a set of global queries designed to capture universal place-specific attributes. Unlike existing methods that employ self-attention and generate the queries directly from the input features, BoQ employs distinct learnable global queries, which probe the input features via cross-attention, ensuring consistent information aggregation. In addition, our technique provides an interpretable attention mechanism and integrates with both CNN and Vision Transformer backbones. The performance of BoQ is demonstrated through extensive experiments on 14 large-scale benchmarks. It consistently outperforms current state-of-the-art techniques including NetVLAD, MixVPR and EigenPlaces. Moreover, as a global retrieval technique (one-stage), BoQ surpasses two-stage retrieval methods, such as Patch-NetVLAD, TransVPR and R2Former, all while being orders of magnitude faster and more efficient. The code and model weights are publicly available at https://github.com/amaralibey/Bag-of-Queries.
Authors: Zikun Zhou, Wentao Xiong, Li Zhou, Xin Li, Zhenyu He, Yaowei Wang
Abstract: The crux of Referring Video Object Segmentation (RVOS) lies in modeling dense text-video relations to associate abstract linguistic concepts with dynamic visual contents at pixel-level. Current RVOS methods typically use vision and language models pretrained independently as backbones. As images and texts are mapped to uncoupled feature spaces, they face the arduous task of learning Vision-Language (VL) relation modeling from scratch. Witnessing the success of Vision-Language Pretrained (VLP) models, we propose to learn relation modeling for RVOS based on their aligned VL feature space. Nevertheless, transferring VLP models to RVOS is a deceptively challenging task due to the substantial gap between the pretraining task (static image/region-level prediction) and the RVOS task (dynamic pixel-level prediction). To address this transfer challenge, we introduce a framework named VLP-RVOS which harnesses VLP models for RVOS through temporal-aware adaptation. We first propose a temporal-aware prompt-tuning method, which not only adapts pretrained representations for pixel-level prediction but also empowers the vision encoder to model temporal contexts. We further customize a cube-frame attention mechanism for robust spatial-temporal reasoning. Besides, we propose to perform multi-stage VL relation modeling while and after feature extraction for comprehensive VL understanding. Extensive experiments demonstrate that our method performs favorably against state-of-the-art algorithms and exhibits strong generalization abilities.
Authors: Antonio R\'ios-Vila, Jorge Calvo-Zaragoza, David Rizo, Thierry Paquet
Abstract: Optical Music Recognition (OMR) has made significant progress since its inception, with various approaches now capable of accurately transcribing music scores into digital formats. Despite these advancements, most so-called \emph{end-to-end} OMR approaches still rely on multi-stage processing pipelines for transcribing full-page score images, which introduces several limitations that hinder the full potential of the field. In this paper, we present the first truly end-to-end approach for page-level OMR. Our system, which combines convolutional layers with autoregressive Transformers, processes an entire music score page and outputs a complete transcription in a music encoding format. This is made possible by both the architecture and the training procedure, which utilizes curriculum learning through incremental synthetic data generation. We evaluate the proposed system using pianoform corpora. This evaluation is conducted first in a controlled scenario with synthetic data, and subsequently against two real-world corpora of varying conditions. Our approach is compared with leading commercial OMR software. The results demonstrate that our system not only successfully transcribes full-page music scores but also outperforms the commercial tool in both zero-shot settings and after fine-tuning with the target domain, representing a significant contribution to the field of OMR.
Authors: Wenbo Sui, Daniel Lichau, Josselin Lef\`evre, Harold Phelippeau
Abstract: Recent studies of multimodal industrial anomaly detection (IAD) based on 3D point clouds and RGB images have highlighted the importance of exploiting the redundancy and complementarity among modalities for accurate classification and segmentation. However, achieving multimodal IAD in practical production lines remains a work in progress. It is essential to consider the trade-offs between the costs and benefits associated with the introduction of new modalities while ensuring compatibility with current processes. Existing quality control processes combine rapid in-line inspections, such as optical and infrared imaging with high-resolution but time-consuming near-line characterization techniques, including industrial CT and electron microscopy to manually or semi-automatically locate and analyze defects in the production of Li-ion batteries and composite materials. Given the cost and time limitations, only a subset of the samples can be inspected by all in-line and near-line methods, and the remaining samples are only evaluated through one or two forms of in-line inspection. To fully exploit data for deep learning-driven automatic defect detection, the models must have the ability to leverage multimodal training and handle incomplete modalities during inference. In this paper, we propose CMDIAD, a Cross-Modal Distillation framework for IAD to demonstrate the feasibility of a Multi-modal Training, Few-modal Inference (MTFI) pipeline. Our findings show that the MTFI pipeline can more effectively utilize incomplete multimodal information compared to applying only a single modality for training and inference. Moreover, we investigate the reasons behind the asymmetric performance improvement using point clouds or RGB images as the main modality of inference. This provides a foundation for our future multimodal dataset construction with additional modalities from manufacturing scenarios.
Authors: Boyi Sun, Yuhang Liu, Xingxia Wang, Bin Tian, Long Chen, Fei-Yue Wang
Abstract: Point cloud data labeling is considered a time-consuming and expensive task in autonomous driving, whereas unsupervised learning can avoid it by learning point cloud representations from unannotated data. In this paper, we propose UOV, a novel 3D Unsupervised framework assisted by 2D Open-Vocabulary segmentation models. It consists of two stages: In the first stage, we innovatively integrate high-quality textual and image features of 2D open-vocabulary models and propose the Tri-Modal contrastive Pre-training (TMP). In the second stage, spatial mapping between point clouds and images is utilized to generate pseudo-labels, enabling cross-modal knowledge distillation. Besides, we introduce the Approximate Flat Interaction (AFI) to address the noise during alignment and label confusion. To validate the superiority of UOV, extensive experiments are conducted on multiple related datasets. We achieved a record-breaking 47.73% mIoU on the annotation-free point cloud segmentation task in nuScenes, surpassing the previous best model by 10.70% mIoU. Meanwhile, the performance of fine-tuning with 1% data on nuScenes and SemanticKITTI reached a remarkable 51.75% mIoU and 48.14% mIoU, outperforming all previous pre-trained models.
Authors: Thong Thanh Nguyen, Zhiyuan Hu, Xiaobao Wu, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu
Abstract: Seeking answers effectively for long videos is essential to build video question answering (videoQA) systems. Previous methods adaptively select frames and regions from long videos to save computations. However, this fails to reason over the whole sequence of video, leading to sub-optimal performance. To address this problem, we introduce a state space layer (SSL) into multi-modal Transformer to efficiently integrate global semantics of the video, which mitigates the video information loss caused by frame and region selection modules. Our SSL includes a gating unit to enable controllability over the flow of global semantics into visual representations. To further enhance the controllability, we introduce a cross-modal compositional congruence (C^3) objective to encourage global semantics aligned with the question. To rigorously evaluate long-form videoQA capacity, we construct two new benchmarks Ego-QA and MAD-QA featuring videos of considerably long length, i.e. 17.5 minutes and 1.9 hours, respectively. Extensive experiments demonstrate the superiority of our framework on these new as well as existing datasets.
Authors: Zhicheng Ding, Panfeng Li, Qikai Yang, Siyang Li
Abstract: This paper presents a novel approach to enhance image-to-image generation by leveraging the multimodal capabilities of the Large Language and Vision Assistant (LLaVA). We propose a framework where LLaVA analyzes input images and generates textual descriptions, hereinafter LLaVA-generated prompts. These prompts, along with the original image, are fed into the image-to-image generation pipeline. This enriched representation guides the generation process towards outputs that exhibit a stronger resemblance to the input image. Extensive experiments demonstrate the effectiveness of LLaVA-generated prompts in promoting image similarity. We observe a significant improvement in the visual coherence between the generated and input images compared to traditional methods. Future work will explore fine-tuning LLaVA prompts for increased control over the creative process. By providing more specific details within the prompts, we aim to achieve a delicate balance between faithfulness to the original image and artistic expression in the generated outputs.
Authors: Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng
Abstract: Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks by expanding the dataset with image-text pairs obtained from websites. This paper further explores CLIP from the perspectives of data and model architecture. To address the prevalence of noisy data and enhance the quality of large-scale image-text data crawled from the internet, we introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. Furthermore, we propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs. Comprehensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP is a robust and efficient vision-language representation learner, it achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/RWKV-CLIP
Authors: Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, Ping Luo
Abstract: Text-to-image (T2I) models have made substantial progress in generating images from textual prompts. However, they frequently fail to produce images consistent with physical commonsense, a vital capability for applications in world simulation and everyday tasks. Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal knowledge, particularly physical commonsense. To address this issue, we introduce PhyBench, a comprehensive T2I evaluation dataset comprising 700 prompts across 4 primary categories: mechanics, optics, thermodynamics, and material properties, encompassing 31 distinct physical scenarios. We assess 6 prominent T2I models, including proprietary models DALLE3 and Gemini, and demonstrate that incorporating physical principles into prompts enhances the models' ability to generate physically accurate images. Our findings reveal that: (1) even advanced models frequently err in various physical scenarios, except for optics; (2) GPT-4o, with item-specific scoring instructions, effectively evaluates the models' understanding of physical commonsense, closely aligning with human assessments; and (3) current T2I models are primarily focused on text-to-image translation, lacking profound reasoning regarding physical commonsense. We advocate for increased attention to the inherent knowledge within T2I models, beyond their utility as mere image generation tools. The data will be available soon.
Authors: Zhiyuan Chang, Mingyang Li, Junjie Wang, Yi Liu, Qing Wang, Yang Liu
Abstract: Text-to-Image Diffusion Models (T2I DMs) have garnered significant attention for their ability to generate high-quality images from textual descriptions. However, these models often produce images that do not fully align with the input prompts, resulting in semantic inconsistencies. The most prominent issue among these semantic inconsistencies is catastrophic-neglect, where the images generated by T2I DMs miss key objects mentioned in the prompt. We first conduct an empirical study on this issue, exploring the prevalence of catastrophic-neglect, potential mitigation strategies with feature enhancement, and the insights gained. Guided by the empirical findings, we propose an automated repair approach named Patcher to address catastrophic-neglect in T2I DMs. Specifically, Patcher first determines whether there are any neglected objects in the prompt, and then applies attention-guided feature enhancement to these neglected objects, resulting in a repaired prompt. Experimental results on three versions of Stable Diffusion demonstrate that Patcher effectively repairs the issue of catastrophic-neglect, achieving 10.1%-16.3% higher Correct Rate in image generation compared to baselines.
Authors: Lorenzo Olearo, Giorgio Longari, Simone Melzi, Alessandro Raganato, Rafael Pe\~naloza
Abstract: For the last decade, there has been a push to use multi-dimensional (latent) spaces to represent concepts; and yet how to manipulate these concepts or reason with them remains largely unclear. Some recent methods exploit multiple latent representations and their connection, making this research question even more entangled. Our goal is to understand how operations in the latent space affect the underlying concepts. To that end, we explore the task of concept blending through diffusion models. Diffusion models are based on a connection between a latent representation of textual prompts and a latent space that enables image reconstruction and generation. This task allows us to try different text-based combination strategies, and evaluate easily through a visual analysis. Our conclusion is that concept blending through space manipulation is possible, although the best strategy depends on the context of the blend.
Authors: Syed Javed, Tariq M. Khan, Abdul Qayyum, Arcot Sowmya, Imran Razzak
Abstract: Retinal imaging has emerged as a promising method of addressing this challenge, taking advantage of the unique structure of the retina. The retina is an embryonic extension of the central nervous system, providing a direct in vivo window into neurological health. Recent studies have shown that specific structural changes in retinal vessels can not only serve as early indicators of various diseases but also help to understand disease progression. In this work, we present a lightweight retinal vessel segmentation network based on the encoder-decoder mechanism with region-guided attention. We introduce inverse addition attention blocks with region guided attention to focus on the foreground regions and improve the segmentation of regions of interest. To further boost the model's performance on retinal vessel segmentation, we employ a weighted dice loss. This choice is particularly effective in addressing the class imbalance issues frequently encountered in retinal vessel segmentation tasks. Dice loss penalises false positives and false negatives equally, encouraging the model to generate more accurate segmentation with improved object boundary delineation and reduced fragmentation. Extensive experiments on a benchmark dataset show better performance (0.8285, 0.8098, 0.9677, and 0.8166 recall, precision, accuracy and F1 score respectively) compared to state-of-the-art methods.
Authors: Linfei Pan, D\'aniel Bar\'ath, Marc Pollefeys, Johannes L. Sch\"onberger
Abstract: Recovering 3D structure and camera motion from images has been a long-standing focus of computer vision research and is known as Structure-from-Motion (SfM). Solutions to this problem are categorized into incremental and global approaches. Until now, the most popular systems follow the incremental paradigm due to its superior accuracy and robustness, while global approaches are drastically more scalable and efficient. With this work, we revisit the problem of global SfM and propose GLOMAP as a new general-purpose system that outperforms the state of the art in global SfM. In terms of accuracy and robustness, we achieve results on-par or superior to COLMAP, the most widely used incremental SfM, while being orders of magnitude faster. We share our system as an open-source implementation at {https://github.com/colmap/glomap}.
Authors: Peiyuan Chen, Zecheng Zhang, Yiping Dong, Li Zhou, Han Wang
Abstract: Visual Question Answering (VQA) is a challenging task that requires systems to provide accurate answers to questions based on image content. Current VQA models struggle with complex questions due to limitations in capturing and integrating multimodal information effectively. To address these challenges, we propose the Rank VQA model, which leverages a ranking-inspired hybrid training strategy to enhance VQA performance. The Rank VQA model integrates high-quality visual features extracted using the Faster R-CNN model and rich semantic text features obtained from a pre-trained BERT model. These features are fused through a sophisticated multimodal fusion technique employing multi-head self-attention mechanisms. Additionally, a ranking learning module is incorporated to optimize the relative ranking of answers, thus improving answer accuracy. The hybrid training strategy combines classification and ranking losses, enhancing the model's generalization ability and robustness across diverse datasets. Experimental results demonstrate the effectiveness of the Rank VQA model. Our model significantly outperforms existing state-of-the-art models on standard VQA datasets, including VQA v2.0 and COCO-QA, in terms of both accuracy and Mean Reciprocal Rank (MRR). The superior performance of Rank VQA is evident in its ability to handle complex questions that require understanding nuanced details and making sophisticated inferences from the image and text. This work highlights the effectiveness of a ranking-based hybrid training strategy in improving VQA performance and lays the groundwork for further research in multimodal learning methods.
Authors: Purushothaman Natarajan, Athira Nambiar
Abstract: Deep learning techniques have revolutionized image classification by mimicking human cognition and automating complex decision-making processes. However, the deployment of AI systems in the wild, especially in high-security domains such as defence, is curbed by the lack of explainability of the model. To this end, eXplainable AI (XAI) is an emerging area of research that is intended to explore the unexplained hidden black box nature of deep neural networks. This paper explores the application of the eXplainable Artificial Intelligence (XAI) tool to interpret the underwater image classification results, one of the first works in the domain to the best of our knowledge. Our study delves into the realm of SONAR image classification using a custom dataset derived from diverse sources, including the Seabed Objects KLSG dataset, the camera SONAR dataset, the mine SONAR images dataset, and the SCTD dataset. An extensive analysis of transfer learning techniques for image classification using benchmark Convolutional Neural Network (CNN) architectures such as VGG16, ResNet50, InceptionV3, DenseNet121, etc. is carried out. On top of this classification model, a post-hoc XAI technique, viz. Local Interpretable Model-Agnostic Explanations (LIME) are incorporated to provide transparent justifications for the model's decisions by perturbing input data locally to see how predictions change. Furthermore, Submodular Picks LIME (SP-LIME) a version of LIME particular to images, that perturbs the image based on the submodular picks is also extensively studied. To this end, two submodular optimization algorithms i.e. Quickshift and Simple Linear Iterative Clustering (SLIC) are leveraged towards submodular picks. The extensive analysis of XAI techniques highlights interpretability of the results in a more human-compliant way, thus boosting our confidence and reliability.
Authors: Yixuan He, Lin Geng Foo, Ajmal Saeed Mian, Hossein Rahmani, Jun Liu
Abstract: Language based editing of 3D human avatars to precisely match user requirements is challenging due to the inherent ambiguity and limited expressiveness of natural language. To overcome this, we propose the Avatar Concept Slider (ACS), a 3D avatar editing method that allows precise manipulation of semantic concepts in human avatars towards a specified intermediate point between two extremes of concepts, akin to moving a knob along a slider track. To achieve this, our ACS has three designs. 1) A Concept Sliding Loss based on Linear Discriminant Analysis to pinpoint the concept-specific axis for precise editing. 2) An Attribute Preserving Loss based on Principal Component Analysis for improved preservation of avatar identity during editing. 3) A 3D Gaussian Splatting primitive selection mechanism based on concept-sensitivity, which updates only the primitives that are the most sensitive to our target concept, to improve efficiency. Results demonstrate that our ACS enables fine-grained 3D avatar editing with efficient feedback, without harming the avatar quality or compromising the avatar's identifying attributes.
Authors: Siyuan Yao, Hao Sun, Tian-Zhu Xiang, Xiao Wang, Xiaochun Cao
Abstract: Camouflaged object detection (COD) aims to identify the objects that seamlessly blend into the surrounding backgrounds. Due to the intrinsic similarity between the camouflaged objects and the background region, it is extremely challenging to precisely distinguish the camouflaged objects by existing approaches. In this paper, we propose a hierarchical graph interaction network termed HGINet for camouflaged object detection, which is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features. Specifically, we first design a region-aware token focusing attention (RTFA) with dynamic token clustering to excavate the potentially distinguishable tokens in the local region. Afterwards, a hierarchical graph interaction transformer (HGIT) is proposed to construct bi-directional aligned communication between hierarchical features in the latent interaction space for visual semantics enhancement. Furthermore, we propose a decoder network with confidence aggregated feature fusion (CAFF) modules, which progressively fuses the hierarchical interacted features to refine the local detail in ambiguous regions. Extensive experiments conducted on the prevalent datasets, i.e. COD10K, CAMO, NC4K and CHAMELEON demonstrate the superior performance of HGINet compared to existing state-of-the-art methods. Our code is available at https://github.com/Garyson1204/HGINet.
Authors: YuanZheng Wu, Jin Liu, Shunping Ji
Abstract: Recently, 3D Gaussian Splatting (3DGS) has demonstrated excellent ability in small-scale 3D surface reconstruction. However, extending 3DGS to large-scale scenes remains a significant challenge. To address this gap, we propose a novel 3DGS-based method for large-scale surface reconstruction using aerial multi-view stereo (MVS) images, named Aerial Gaussian Splatting (AGS). First, we introduce a data chunking method tailored for large-scale aerial images, making 3DGS feasible for surface reconstruction over extensive scenes. Second, we integrate the Ray-Gaussian Intersection method into 3DGS to obtain depth and normal information. Finally, we implement multi-view geometric consistency constraints to enhance the geometric consistency across different views. Our experiments on multiple datasets demonstrate, for the first time, the 3DGS-based method can match conventional aerial MVS methods on geometric accuracy in aerial large-scale surface reconstruction, and our method also beats state-of-the-art GS-based methods both on geometry and rendering quality.
Authors: Tingchen Ma, Yongsheng Ou, Sheng Xu
Abstract: The bundle adjustment (BA) algorithm is a widely used nonlinear optimization technique in the backend of Simultaneous Localization and Mapping (SLAM) systems. By leveraging the co-view relationships of landmarks from multiple perspectives, the BA method constructs a joint estimation model for both poses and landmarks, enabling the system to generate refined maps and reduce front-end localization errors. However, there are unique challenges when applying the BA for LiDAR data, due to the large volume of 3D points. Exploring a robust LiDAR BA estimator and achieving accurate solutions is a very important issue. In this work, firstly we propose a novel mean square group metric (MSGM) to build the optimization objective in the LiDAR BA algorithm. This metric applies mean square transformation to uniformly process the measurement of plane landmarks from one sampling period. The transformed metric ensures scale interpretability, and does not requie a time-consuming point-by-point calculation. Secondly, by integrating a robust kernel function, the metrics involved in the BA algorithm are reweighted, and thus enhancing the robustness of the solution process. Thirdly, based on the proposed robust LiDAR BA model, we derived an explicit second-order estimator (RSO-BA). This estimator employs analytical formulas for Hessian and gradient calculations, ensuring the precision of the BA solution. Finally, we verify the merits of the proposed RSO-BA estimator against existing implicit second-order and explicit approximate second-order estimators using the publicly available datasets. The experimental results demonstrate that the RSO-BA estimator outperforms its counterparts regarding registration accuracy and robustness, particularly in large-scale or complex unstructured environments.
Authors: Haiyu Wu, Jaskirat Singh, Sicong Tian, Liang Zheng, Kevin W. Bowyer
Abstract: This paper studies how to synthesize face images of non-existent persons, to create a dataset that allows effective training of face recognition (FR) models. Two important goals are (1) the ability to generate a large number of distinct identities (inter-class separation) with (2) a wide variation in appearance of each identity (intra-class variation). However, existing works 1) are typically limited in how many well-separated identities can be generated and 2) either neglect or use a separate editing model for attribute augmentation. We propose Vec2Face, a holistic model that uses only a sampled vector as input and can flexibly generate and control face images and their attributes. Composed of a feature masked autoencoder and a decoder, Vec2Face is supervised by face image reconstruction and can be conveniently used in inference. Using vectors with low similarity among themselves as inputs, Vec2Face generates well-separated identities. Randomly perturbing an input identity vector within a small range allows Vec2Face to generate faces of the same identity with robust variation in face attributes. It is also possible to generate images with designated attributes by adjusting vector values with a gradient descent method. Vec2Face has efficiently synthesized as many as 300K identities with 15 million total images, whereas 60K is the largest number of identities created in the previous works. FR models trained with the generated HSFace datasets, from 10k to 300k identities, achieve state-of-the-art accuracy, from 92% to 93.52%, on five real-world test sets. For the first time, our model created using a synthetic training set achieves higher accuracy than the model created using a same-scale training set of real face images (on the CALFW test set).
Authors: Houwang Jiang, Zhuxian Liu, Guodong Liu, Xiaolong Liu, Shihua Zhan
Abstract: Recent advances in unsupervised representation learning often rely on knowing the number of classes to improve feature extraction and clustering. However, this assumption raises an important question: is the number of classes always necessary, and do class labels fully capture the fine-grained features within the data? In this paper, we propose Contrastive Disentangling (CD), a framework designed to learn representations without relying on class priors. CD leverages a multi-level contrastive learning strategy, integrating instance-level and feature-level contrastive losses with a normalized entropy loss to capture semantically rich and fine-grained representations. Specifically, (1) the instance-level contrastive loss separates feature representations across samples; (2) the feature-level contrastive loss promotes independence among feature heads; and (3) the normalized entropy loss ensures feature diversity and prevents feature collapse. Extensive experiments on CIFAR-10, CIFAR-100, STL-10, and ImageNet-10 demonstrate that CD outperforms existing methods in scenarios where class information is unavailable or ambiguous. The code is available at https://github.com/Hoper-J/Contrastive-Disentangling.
Authors: Minju Kang, Taehun Kong, Tae-Kyun Kim
Abstract: Accurate 3D object detection is crucial for autonomous vehicles and robots to navigate and interact with the environment safely and effectively. Meanwhile, the performance of 3D detector relies on the data size and annotation which is expensive. Consequently, the demand of training with limited labeled data is growing. We explore a novel teacher-student framework employing channel augmentation for 3D semi-supervised object detection. The teacher-student SSL typically adopts a weak augmentation and strong augmentation to teacher and student, respectively. In this work, we apply multiple channel augmentations to both networks using the transformation equivariance detector (TED). The TED allows us to explore different combinations of augmentation on point clouds and efficiently aggregates multi-channel transformation equivariance features. In principle, by adopting fixed channel augmentations for the teacher network, the student can train stably on reliable pseudo-labels. Adopting strong channel augmentations can enrich the diversity of data, fostering robustness to transformations and enhancing generalization performance of the student network. We use SOTA hierarchical supervision as a baseline and adapt its dual-threshold to TED, which is called channel IoU consistency. We evaluate our method with KITTI dataset, and achieved a significant performance leap, surpassing SOTA 3D semi-supervised object detection models.
Authors: Kaijie Yin, Tian Gao, Hui Kong
Abstract: A prior global topological map (e.g., the OpenStreetMap, OSM) can boost the performance of autonomous mapping by a ground mobile robot. However, the prior map is usually incomplete due to lacking labeling in partial paths. To solve this problem, this paper proposes an OSM maker using airborne sensors carried by low-altitude aircraft, where the core of the OSM maker is a novel efficient pathfinder approach based on LiDAR and camera data, i.e., a binary dual-stream road segmentation model. Specifically, a multi-scale feature extraction based on the UNet architecture is implemented for images and point clouds. To reduce the effect caused by the sparsity of point cloud, an attention-guided gated block is designed to integrate image and point-cloud features. For enhancing the efficiency of the model, we propose a binarization streamline to each model component, including a variant of vision transformer (ViT) architecture as the encoder of the image branch, and new focal and perception losses to optimize the model training. The experimental results on two datasets demonstrate that our pathfinder method achieves SOTA accuracy with high efficiency in finding paths from the low-level airborne sensors, and we can create complete OSM prior maps based on the segmented road skeletons. Code and data are available at:https://github.com/IMRL/Pathfinder}{https://github.com/IMRL/Pathfinder.
URLs: https://github.com/IMRL/Pathfinder, https://github.com/IMRL/Pathfinder.
Authors: Roni Blushtein-Livnon, Tal Svoray, Michael Dorman
Abstract: This study introduces a laboratory experiment designed to assess the influence of annotation strategies, levels of imbalanced data, and prior experience, on the performance of human annotators. The experiment focuses on labeling aerial imagery, using ArcGIS Pro tools, to detect and segment small-scale photovoltaic solar panels, selected as a case study for rectangular objects. The experiment is conducted using images with a pixel size of 0.15\textbf{$m$}, involving both expert and non-expert participants, across different setup strategies and target-background ratio datasets. Our findings indicate that human annotators generally perform more effectively in object detection than in segmentation tasks. A marked tendency to commit more Type II errors (False Negatives, i.e., undetected objects) than Type I errors (False Positives, i.e. falsely detecting objects that do not exist) was observed across all experimental setups and conditions, suggesting a consistent bias in detection and segmentation processes. Performance was better in tasks with higher target-background ratios (i.e., more objects per unit area). Prior experience did not significantly impact performance and may, in some cases, even lead to overestimation in segmentation. These results provide evidence that human annotators are relatively cautious and tend to identify objects only when they are confident about them, prioritizing underestimation over overestimation. Annotators' performance is also influenced by object scarcity, showing a decline in areas with extremely imbalanced datasets and a low ratio of target-to-background. These findings may enhance annotation strategies for remote sensing research while efficient human annotators are crucial in an era characterized by growing demands for high-quality training data to improve segmentation and detection models.
Authors: Ghalib Ahmed Tahir
Abstract: This paper aims to shed light on the ethical problems of creating and deploying computer vision tech, particularly in using publicly available datasets. Due to the rapid growth of machine learning and artificial intelligence, computer vision has become a vital tool in many industries, including medical care, security systems, and trade. However, extensive use of visual data that is often collected without consent due to an informed discussion of its ramifications raises significant concerns about privacy and bias. The paper also examines these issues by analyzing popular datasets such as COCO, LFW, ImageNet, CelebA, PASCAL VOC, etc., that are usually used for training computer vision models. We offer a comprehensive ethical framework that addresses these challenges regarding the protection of individual rights, minimization of bias as well as openness and responsibility. We aim to encourage AI development that will take into account societal values as well as ethical standards to avoid any public harm.
Authors: Jinlong Li, Xinyu Liu, Baolu Li, Runsheng Xu, Jiachen Li, Hongkai Yu, Zhengzhong Tu
Abstract: Cooperative perception systems play a vital role in enhancing the safety and efficiency of vehicular autonomy. Although recent studies have highlighted the efficacy of vehicle-to-everything (V2X) communication techniques in autonomous driving, a significant challenge persists: how to efficiently integrate multiple high-bandwidth features across an expanding network of connected agents such as vehicles and infrastructure. In this paper, we introduce CoMamba, a novel cooperative 3D detection framework designed to leverage state-space models for real-time onboard vehicle perception. Compared to prior state-of-the-art transformer-based models, CoMamba enjoys being a more scalable 3D model using bidirectional state space models, bypassing the quadratic complexity pain-point of attention mechanisms. Through extensive experimentation on V2X/V2V datasets, CoMamba achieves superior performance compared to existing methods while maintaining real-time processing capabilities. The proposed framework not only enhances object detection accuracy but also significantly reduces processing time, making it a promising solution for next-generation cooperative perception systems in intelligent transportation networks.
Authors: Zhongyan Niu, Zhen Tan, Jinpu Zhang, Xueliang Yang, Dewen Hu
Abstract: Visual localization refers to the process of determining camera poses and orientation within a known scene representation. This task is often complicated by factors such as illumination changes and variations in viewing angles. In this paper, we propose HGSLoc, a novel lightweight, plug and-play pose optimization framework, which integrates 3D reconstruction with a heuristic refinement strategy to achieve higher pose estimation accuracy. Specifically, we introduce an explicit geometric map for 3D representation and high-fidelity rendering, allowing the generation of high-quality synthesized views to support accurate visual localization. Our method demonstrates a faster rendering speed and higher localization accuracy compared to NeRF-based neural rendering localization approaches. We introduce a heuristic refinement strategy, its efficient optimization capability can quickly locate the target node, while we set the step-level optimization step to enhance the pose accuracy in the scenarios with small errors. With carefully designed heuristic functions, it offers efficient optimization capabilities, enabling rapid error reduction in rough localization estimations. Our method mitigates the dependence on complex neural network models while demonstrating improved robustness against noise and higher localization accuracy in challenging environments, as compared to neural network joint optimization strategies. The optimization framework proposed in this paper introduces novel approaches to visual localization by integrating the advantages of 3D reconstruction and heuristic refinement strategy, which demonstrates strong performance across multiple benchmark datasets, including 7Scenes and DB dataset.
Authors: Yuan Tian, Guo Lu, Guangtao Zhai
Abstract: Unsupervised video semantic compression (UVSC), i.e., compressing videos to better support various analysis tasks, has recently garnered attention. However, the semantic richness of previous methods remains limited, due to the single semantic learning objective, limited training data, etc. To address this, we propose to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs. Specifically, we introduce a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs. This allows different VFMs to collaboratively build a mutually-enhanced semantic space, guiding the learning of the compression model. Moreover, we introduce a dynamic trajectory-based inter-frame compression scheme, which first estimates the semantic trajectory based on the historical content, and then traverses along the trajectory to predict the future semantics as the coding context. This reduces the overall bitcost of the system, further improving the compression efficiency. Our approach outperforms previous coding methods on three mainstream tasks and six datasets.
Authors: Lulin Zhang, Ewelina Rupnik, Tri Dung Nguyen, St\'ephane Jacquemoud, Yann Klinger
Abstract: Understanding the anisotropic reflectance of complex Earth surfaces from satellite imagery is crucial for numerous applications. Neural radiance fields (NeRF) have become popular as a machine learning technique capable of deducing the bidirectional reflectance distribution function (BRDF) of a scene from multiple images. However, prior research has largely concentrated on applying NeRF to close-range imagery, estimating basic Microfacet BRDF models, which fall short for many Earth surfaces. Moreover, high-quality NeRFs generally require several images captured simultaneously, a rare occurrence in satellite imaging. To address these limitations, we propose BRDF-NeRF, developed to explicitly estimate the Rahman-Pinty-Verstraete (RPV) model, a semi-empirical BRDF model commonly employed in remote sensing. We assess our approach using two datasets: (1) Djibouti, captured in a single epoch at varying viewing angles with a fixed Sun position, and (2) Lanzhou, captured over multiple epochs with different viewing angles and Sun positions. Our results, based on only three to four satellite images for training, demonstrate that BRDF-NeRF can effectively synthesize novel views from directions far removed from the training data and produce high-quality digital surface models (DSMs).
Authors: DaDong Jiang, Xianghui Yang, Zibo Zhao, Sheng Zhang, Jiaao Yu, Zeqiang Lai, Shaoxiong Yang, Chunchao Guo, Xiaobo Zhou, Zhihui Ke
Abstract: Recent texture generation methods achieve impressive results due to the powerful generative prior they leverage from large-scale text-to-image diffusion models. However, abstract textual prompts are limited in providing global textural or shape information, which results in the texture generation methods producing blurry or inconsistent patterns. To tackle this, we present FlexiTex, embedding rich information via visual guidance to generate a high-quality texture. The core of FlexiTex is the Visual Guidance Enhancement module, which incorporates more specific information from visual guidance to reduce ambiguity in the text prompt and preserve high-frequency details. To further enhance the visual guidance, we introduce a Direction-Aware Adaptation module that automatically designs direction prompts based on different camera poses, avoiding the Janus problem and maintaining semantically global consistency. Benefiting from the visual guidance, FlexiTex produces quantitatively and qualitatively sound results, demonstrating its potential to advance texture generation for real-world applications.
Authors: S\'andor Kuns\'agi-M\'at\'e, Levente Pet\H{o}, Lehel Seres, Tam\'as Matuszka
Abstract: 3D detection of traffic management objects, such as traffic lights and road signs, is vital for self-driving cars, particularly for address-to-address navigation where vehicles encounter numerous intersections with these static objects. This paper introduces a novel method for automatically generating accurate and temporally consistent 3D bounding box annotations for traffic lights and signs, effective up to a range of 200 meters. These annotations are suitable for training real-time models used in self-driving cars, which need a large amount of training data. The proposed method relies only on RGB images with 2D bounding boxes of traffic management objects, which can be automatically obtained using an off-the-shelf image-space detector neural network, along with GNSS/INS data, eliminating the need for LiDAR point cloud data.
Authors: Yuting Hong, Hui Xiao, Huazheng Hao, Xiaojie Qiu, Baochen Yao, Chengbin Peng
Abstract: With the advancement of autonomous driving, semantic segmentation has achieved remarkable progress. The training of such networks heavily relies on image annotations, which are very expensive to obtain. Semi-supervised learning can utilize both labeled data and unlabeled data with the help of pseudo-labels. However, in many real-world scenarios where classes are imbalanced, majority classes often play a dominant role during training and the learning quality of minority classes can be undermined. To overcome this limitation, we propose a synergistic training framework, including a professional training module to enhance minority class learning and a general training module to learn more comprehensive semantic information. Based on a pixel selection strategy, they can iteratively learn from each other to reduce error accumulation and coupling. In addition, a dual contrastive learning with anchors is proposed to guarantee more distinct decision boundaries. In experiments, our framework demonstrates superior performance compared to state-of-the-art methods on benchmark datasets.
Authors: Hanyue Zhang, Zhiliu Yang, Xinhe Zuo, Yuxin Tong, Ying Long, Chen Liu
Abstract: This paper proposes a novel framework for large-scale scene reconstruction based on 3D Gaussian splatting (3DGS) and aims to address the scalability and accuracy challenges faced by existing methods. For tackling the scalability issue, we split the large scene into multiple cells, and the candidate point-cloud and camera views of each cell are correlated through a visibility-based camera selection and a progressive point-cloud extension. To reinforce the rendering quality, three highlighted improvements are made in comparison with vanilla 3DGS, which are a strategy of the ray-Gaussian intersection and the novel Gaussians density control for learning efficiency, an appearance decoupling module based on ConvKAN network to solve uneven lighting conditions in large-scale scenes, and a refined final loss with the color loss, the depth distortion loss, and the normal consistency loss. Finally, the seamless stitching procedure is executed to merge the individual Gaussian radiance field for novel view synthesis across different cells. Evaluation of Mill19, Urban3D, and MatrixCity datasets shows that our method consistently generates more high-fidelity rendering results than state-of-the-art methods of large-scale scene reconstruction. We further validate the generalizability of the proposed approach by rendering on self-collected video clips recorded by a commercial drone.
Authors: Xu Zheng, Lin Wang
Abstract: In this paper, we address the challenging problem of cross-modal (image-to-events) adaptation for event-based recognition without accessing any labeled source image data. This task is arduous due to the substantial modality gap between images and events. With only a pre-trained source model available, the key challenge lies in extracting knowledge from this model and effectively transferring knowledge to the event-based domain. Inspired by the natural ability of language to convey semantics across different modalities, we propose EventDance++, a novel framework that tackles this unsupervised source-free cross-modal adaptation problem from a language-guided perspective. We introduce a language-guided reconstruction-based modality bridging (L-RMB) module, which reconstructs intensity frames from events in a self-supervised manner. Importantly, it leverages a vision-language model to provide further supervision, enriching the surrogate images and enhancing modality bridging. This enables the creation of surrogate images to extract knowledge (i.e., labels) from the source model. On top, we propose a multi-representation knowledge adaptation (MKA) module to transfer knowledge to target models, utilizing multiple event representations to capture the spatiotemporal characteristics of events fully. The L-RMB and MKA modules are jointly optimized to achieve optimal performance in bridging the modality gap. Experiments on three benchmark datasets demonstrate that EventDance++ performs on par with methods that utilize source data, validating the effectiveness of our language-guided approach in event-based recognition.
Authors: Zhangchen Ye, Tao Jiang, Chenfeng Xu, Yiming Li, Hang Zhao
Abstract: Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to improve the accuracy of 3D occupancy predictions. By sampling points along the line of sight of each voxel and integrating the features of these points from historical frames, we construct a cost volume feature map that refines current volume features for improved prediction outcomes. Our method takes advantage of parallax cues from historical observations and employs a data-driven approach to learn the cost volume. We validate the effectiveness of CVT-Occ through rigorous experiments on the Occ3D-Waymo dataset, where it outperforms state-of-the-art methods in 3D occupancy prediction with minimal additional computational cost. The code is released at \url{https://github.com/Tsinghua-MARS-Lab/CVT-Occ}.
Authors: Avi Deb Raha, Apurba Adhikary, Mrityunjoy Gain, Yu Qiao, Choong Seon Hong
Abstract: In this study, we explore the efficacy of advanced pre-trained architectures, such as Vision Transformers (ViT), ConvNeXt, and Swin Transformers in enhancing Federated Domain Generalization. These architectures capture global contextual features and model long-range dependencies, making them promising candidates for improving cross-domain generalization. We conduct a broad study with in-depth analysis and systematically evaluate different variants of these architectures, using extensive pre-training datasets such as ImageNet-1K, ImageNet-21K, JFT-300M, and ImageNet-22K. Additionally, we compare self-supervised and supervised pre-training strategies to assess their impact on FDG performance. Our findings suggest that self-supervised techniques, which focus on reconstructing masked image patches, can better capture the intrinsic structure of images, thereby outperforming their supervised counterparts. Comprehensive evaluations on the Office-Home and PACS datasets demonstrate that adopting advanced architectures pre-trained on larger datasets establishes new benchmarks, achieving average accuracies of 84.46\% and 92.55\%, respectively. Additionally, we observe that certain variants of these advanced models, despite having fewer parameters, outperform larger ResNet models. This highlights the critical role of utilizing sophisticated architectures and diverse pre-training strategies to enhance FDG performance, especially in scenarios with limited computational resources where model efficiency is crucial. Our results indicate that federated learning systems can become more adaptable and efficient by leveraging these advanced methods, offering valuable insights for future research in FDG.
Authors: Penghao Wang, Zhirui Zhang, Liao Wang, Kaixin Yao, Siyuan Xie, Jingyi Yu, Minye Wu, Lan Xu
Abstract: Experiencing high-fidelity volumetric video as seamlessly as 2D videos is a long-held dream. However, current dynamic 3DGS methods, despite their high rendering quality, face challenges in streaming on mobile devices due to computational and bandwidth constraints. In this paper, we introduce V^3 (Viewing Volumetric Videos), a novel approach that enables high-quality mobile rendering through the streaming of dynamic Gaussians. Our key innovation is to view dynamic 3DGS as 2D videos, facilitating the use of hardware video codecs. Additionally, we propose a two-stage training strategy to reduce storage requirements with rapid training speed. The first stage employs hash encoding and shallow MLP to learn motion, then reduces the number of Gaussians through pruning to meet the streaming requirements, while the second stage fine tunes other Gaussian attributes using residual entropy loss and temporal loss to improve temporal continuity. This strategy, which disentangles motion and appearance, maintains high rendering quality with compact storage requirements. Meanwhile, we designed a multi-platform player to decode and render 2D Gaussian videos. Extensive experiments demonstrate the effectiveness of V^3, outperforming other methods by enabling high-quality rendering and streaming on common devices, which is unseen before. As the first to stream dynamic Gaussians on mobile devices, our companion player offers users an unprecedented volumetric video experience, including smooth scrolling and instant sharing. Our project page with source code is available at https://authoritywang.github.io/v3/.
Authors: Yuan Tian, Guo Lu, Yichao Yan, Guangtao Zhai, Li Chen, Zhiyong Gao
Abstract: Video compression is indispensable to most video analysis systems. Despite saving transportation bandwidth, it also deteriorates downstream video understanding tasks, especially at low-bitrate settings. To systematically investigate this problem, we first thoroughly review the previous methods, revealing that three principles, i.e., task-decoupled, label-free, and data-emerged semantic prior, are critical to a machine-friendly coding framework but are not fully satisfied so far. In this paper, we propose a traditional-neural mixed coding framework that simultaneously fulfills all these principles, by taking advantage of both traditional codecs and neural networks (NNs). On one hand, the traditional codecs can efficiently encode the pixel signal of videos but may distort the semantic information. On the other hand, highly non-linear NNs are proficient in condensing video semantics into a compact representation. The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved w.r.t. the coding procedure, which is spontaneously learned from unlabeled data in a self-supervised manner. The videos collaboratively decoded from two streams (codec and NN) are of rich semantics, as well as visually photo-realistic, empirically boosting several mainstream downstream video analysis task performances without any post-adaptation procedure. Furthermore, by introducing the attention mechanism and adaptive modeling scheme, the video semantic modeling ability of our approach is further enhanced. Finally, we build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach. All codes, data, and models will be available at \url{https://github.com/tianyuan168326/VCS-Pytorch}.
Authors: Yuan Yuan, Euijoon Ahn, Dagan Feng, Mohamad Khadra, Jinman Kim
Abstract: Bi-parametric magnetic resonance imaging (bpMRI) has become a pivotal modality in the detection and diagnosis of clinically significant prostate cancer (csPCa). Developing AI-based systems to identify csPCa using bpMRI can transform PCa management by improving efficiency and cost-effectiveness. However, current state-of-the-art methods using convolutional neural networks (CNNs) are limited in learning in-plane and three-dimensional spatial information from anisotropic images. Their performances also depend on the availability of large, diverse, and well-annotated bpMRI datasets. We propose a Zonal-aware Self-supervised Mesh Network (Z-SSMNet) that adaptively integrates multi-dimensional (2D/2.5D/3D) convolutions to learn dense intra-slice information and sparse inter-slice information of the anisotropic bpMRI in a balanced manner. A self-supervised learning (SSL) technique is proposed to pre-train our network using large-scale unlabeled data to learn the appearance, texture, and structure semantics of bpMRI. It aims to capture both intra-slice and inter-slice information during the pre-training stage. Furthermore, we constrained our network to focus on the zonal anatomical regions to further improve the detection and diagnosis capability of csPCa. We conducted extensive experiments on the PI-CAI dataset comprising 10000+ multi-center and multi-scanner data. Our Z-SSMNet excelled in both lesion-level detection (AP score of 0.633) and patient-level diagnosis (AUROC score of 0.881), securing the top position in the Open Development Phase of the PI-CAI challenge and maintained strong performance, achieving an AP score of 0.690 and an AUROC score of 0.909, and securing the second-place ranking in the Closed Testing Phase.
Authors: Florian Kofler, Felix Meissen, Felix Steinbauer, Robert Graf, Stefan K Ehrlich, Annika Reinke, Eva Oswald, Diana Waldmannstetter, Florian Hoelzl, Izabela Horvath, Oezguen Turgut, Suprosanna Shit, Christina Bukas, Kaiyuan Yang, Johannes C. Paetzold, Ezequiel de da Rosa, Isra Mekki, Shankeeth Vinayahalingam, Hasan Kassem, Juexin Zhang, Ke Chen, Ying Weng, Alicia Durrer, Philippe C. Cattin, Julia Wolleb, M. S. Sadique, M. M. Rahman, W. Farzana, A. Temtam, K. M. Iftekharuddin, Maruf Adewole, Syed Muhammad Anwar, Ujjwal Baid, Anastasia Janas, Anahita Fathi Kazerooni, Dominic LaBella, Hongwei Bran Li, Ahmed W Moawad, Gian-Marco Conte, Keyvan Farahani, James Eddy, Micah Sheller, Sarthak Pati, Alexandros Karagyris, Alejandro Aristizabal, Timothy Bergquist, Verena Chung, Russell Takeshi Shinohara, Farouk Dako, Walter Wiggins, Zachary Reitman, Chunhao Wang, Xinyang Liu, Zhifan Jiang, Elaine Johanson, Zeke Meier, Ariana Familiar, Christos Davatzikos, John Freymann, Justin Kirby, Michel Bilello, Hassan M Fathallah-Shaykh, Roland Wiest, Jan Kirschke, Rivka R Colen, Aikaterini Kotrotsou, Pamela Lamontagne, Daniel Marcus, Mikhail Milchenko, Arash Nazeri, Marc-Andr\'e Weber, Abhishek Mahajan, Suyash Mohan, John Mongan, Christopher Hess, Soonmee Cha, Javier Villanueva-Meyer, Errol Colak, Priscila Crivellaro, Andras Jakab, Abiodun Fatade, Olubukola Omidiji, Rachel Akinola Lagos, O O Olatunji, Goldey Khanna, John Kirkpatrick, Michelle Alonso-Basanta, Arif Rashid, Miriam Bornhorst, Ali Nabavizadeh, Natasha Lepore, Joshua Palmer, Antonio Porras, Jake Albrecht, Udunna Anazodo, Mariam Aboian, Evan Calabrese, Jeffrey David Rudie, Marius George Linguraru, Juan Eugenio Iglesias, Koen Van Leemput, Spyridon Bakas, Benedikt Wiestler, Ivan Ezhov, Marie Piraud, Bjoern H Menze
Abstract: A myriad of algorithms for the automatic analysis of brain MR images is available to support clinicians in their decision-making. For brain tumor patients, the image acquisition time series typically starts with an already pathological scan. This poses problems, as many algorithms are designed to analyze healthy brains and provide no guarantee for images featuring lesions. Examples include, but are not limited to, algorithms for brain anatomy parcellation, tissue segmentation, and brain extraction. To solve this dilemma, we introduce the BraTS inpainting challenge. Here, the participants explore inpainting techniques to synthesize healthy brain scans from lesioned ones. The following manuscript contains the task formulation, dataset, and submission procedure. Later, it will be updated to summarize the findings of the challenge. The challenge is organized as part of the ASNR-BraTS MICCAI challenge.
Authors: Jingyang Zhang, Jingkang Yang, Pengyun Wang, Haoqi Wang, Yueqian Lin, Haoran Zhang, Yiyou Sun, Xuefeng Du, Yixuan Li, Ziwei Liu
Abstract: Out-of-Distribution (OOD) detection is critical for the reliable operation of open-world intelligent systems. Despite the emergence of an increasing number of OOD detection methods, the evaluation inconsistencies present challenges for tracking the progress in this field. OpenOOD v1 initiated the unification of the OOD detection evaluation but faced limitations in scalability and usability. In response, this paper presents OpenOOD v1.5, a significant improvement from its predecessor that ensures accurate, standardized, and user-friendly evaluation of OOD detection methodologies. Notably, OpenOOD v1.5 extends its evaluation capabilities to large-scale datasets such as ImageNet, investigates full-spectrum OOD detection which is important yet underexplored, and introduces new features including an online leaderboard and an easy-to-use evaluator. This work also contributes in-depth analysis and insights derived from comprehensive experimental results, thereby enriching the knowledge pool of OOD detection methodologies. With these enhancements, OpenOOD v1.5 aims to drive advancements and offer a more robust and comprehensive evaluation benchmark for OOD detection research.
Authors: Zhenlin Zhang, Lixin Pu, Ang Li, Jun Zhang, Xianjie Li, Jipeng Fan
Abstract: Scoliosis is a three-dimensional spinal deformity, which may lead to abnormal morphologies, such as thoracic deformity, and pelvic tilt. Severe patients may suffer from nerve damage and urinary abnormalities. At present, the number of scoliosis patients in primary and secondary schools has exceeded five million in China, the incidence rate is about 3% to 5% which is growing every year. The research on scoliosis, therefore, has important clinical value. This paper systematically introduces computer-assisted scoliosis screening and diagnosis as well as analyzes the advantages and limitations of different algorithm models in the current issue field. Moreover, the paper also discusses the current development bottlenecks in this field and looks forward to future development trends.
Authors: Yuefeng Peng, Ali Naseh, Amir Houmansadr
Abstract: Deep learning models, while achieving remarkable performances, are vulnerable to membership inference attacks (MIAs). Although various defenses have been proposed, there is still substantial room for improvement in the privacy-utility trade-off. In this work, we introduce a novel defense framework against MIAs by leveraging generative models. The key intuition of our defense is to remove the differences between member and non-member inputs, which is exploited by MIAs, by re-generating input samples before feeding them to the target model. Therefore, our defense, called DIFFENCE, works pre inference, which is unlike prior defenses that are either training-time or post-inference time. A unique feature of DIFFENCE is that it works on input samples only, without modifying the training or inference phase of the target model. Therefore, it can be cascaded with other defense mechanisms as we demonstrate through experiments. DIFFENCE is designed to preserve the model's prediction labels for each sample, thereby not affecting accuracy. Furthermore, we have empirically demonstrated it does not reduce the usefulness of confidence vectors. Through extensive experimentation, we show that DIFFENCE can serve as a robust plug-n-play defense mechanism, enhancing membership privacy without compromising model utility. For instance, DIFFENCE reduces MIA accuracy against an undefended model by 15.8\% and attack AUC by 14.0\% on average across three datasets, all without impacting model utility. By integrating DIFFENCE with prior defenses, we can achieve new state-of-the-art performances in the privacy-utility trade-off. For example, when combined with the state-of-the-art SELENA defense it reduces attack accuracy by 9.3\%, and attack AUC by 10.0\%. DIFFENCE achieves this by imposing a negligible computation overhead, adding only 57ms to the inference time per sample processed on average.
Authors: Pratibha Kumari, Joohi Chauhan, Afshin Bozorgpour, Boqiang Huang, Reza Azad, Dorit Merhof
Abstract: Medical imaging analysis has witnessed remarkable advancements even surpassing human-level performance in recent years, driven by the rapid development of advanced deep-learning algorithms. However, when the inference dataset slightly differs from what the model has seen during one-time training, the model performance is greatly compromised. The situation requires restarting the training process using both the old and the new data which is computationally costly, does not align with the human learning process, and imposes storage constraints and privacy concerns. Alternatively, continual learning has emerged as a crucial approach for developing unified and sustainable deep models to deal with new classes, tasks, and the drifting nature of data in non-stationary environments for various application areas. Continual learning techniques enable models to adapt and accumulate knowledge over time, which is essential for maintaining performance on evolving datasets and novel tasks. This systematic review paper provides a comprehensive overview of the state-of-the-art in continual learning techniques applied to medical imaging analysis. We present an extensive survey of existing research, covering topics including catastrophic forgetting, data drifts, stability, and plasticity requirements. Further, an in-depth discussion of key components of a continual learning framework such as continual learning scenarios, techniques, evaluation schemes, and metrics is provided. Continual learning techniques encompass various categories, including rehearsal, regularization, architectural, and hybrid strategies. We assess the popularity and applicability of continual learning categories in various medical sub-fields like radiology and histopathology...
Authors: Chengbo Yuan, Chuan Wen, Tong Zhang, Yang Gao
Abstract: We address the challenge of acquiring real-world manipulation skills with a scalable framework. We hold the belief that identifying an appropriate prediction target capable of leveraging large-scale datasets is crucial for achieving efficient and universal learning. Therefore, we propose to utilize 3D flow, which represents the future trajectories of 3D points on objects of interest, as an ideal prediction target. To exploit scalable data resources, we turn our attention to human videos. We develop, for the first time, a language-conditioned 3D flow prediction model directly from large-scale RGBD human video datasets. Our predicted flow offers actionable guidance, thus facilitating zero-shot skill transfer in real-world scenarios. We deploy our method with a policy based on closed-loop flow prediction. Remarkably, without any in-domain finetuning, our method achieves an impressive 81\% success rate in zero-shot human-to-robot skill transfer, covering 18 tasks in 6 scenes. Our framework features the following benefits: (1) scalability: leveraging cross-embodiment data resources; (2) wide application: multiple object categories, including rigid, articulated, and soft bodies; (3) stable skill transfer: providing actionable guidance with a small inference domain-gap. Code, data, and supplementary materials are available https://general-flow.github.io
Authors: Shangyu Xing, Fei Zhao, Zhen Wu, Tuo An, Weihao Chen, Chunhui Li, Jianbing Zhang, Xinyu Dai
Abstract: Multimodal large language models (MLLMs) have attracted increasing attention in the past few years, but they may still generate descriptions that include objects not present in the corresponding images, a phenomenon known as object hallucination. To eliminate hallucinations, existing methods manually annotate paired responses with and without hallucinations, and then employ various alignment algorithms to improve the alignment capability between images and text. However, they not only demand considerable computation resources during the finetuning stage but also require expensive human annotation to construct paired data needed by the alignment algorithms. To address these issues, we borrow the idea of unlearning and propose an efficient fine-grained unlearning framework (EFUF), which can eliminate hallucinations without the need for paired data. Extensive experiments show that our method consistently reduces hallucinations while preserving the generation quality with modest computational overhead. Our code and datasets will be publicly available.
Authors: Peihan Tu, Li-Yi Wei, Matthias Zwicker
Abstract: Texture plays a vital role in enhancing visual richness in both real photographs and computer-generated imagery. However, the process of editing textures often involves laborious and repetitive manual adjustments of textons, which are the recurring local patterns that characterize textures. This work introduces a fully unsupervised approach for representing textures using a compositional neural model that captures individual textons. We represent each texton as a 2D Gaussian function whose spatial support approximates its shape, and an associated feature that encodes its detailed appearance. By modeling a texture as a discrete composition of Gaussian textons, the representation offers both expressiveness and ease of editing. Textures can be edited by modifying the compositional Gaussians within the latent space, and new textures can be efficiently synthesized by feeding the modified Gaussians through a generator network in a feed-forward manner. This approach enables a wide range of applications, including transferring appearance from an image texture to another image, diversifying textures,texture interpolation, revealing/modifying texture variations, edit propagation, texture animation, and direct texton manipulation. The proposed approach contributes to advancing texture analysis, modeling, and editing techniques, and opens up new possibilities for creating visually appealing images with controllable textures.
Authors: Deborah Pereg
Abstract: Image restoration, or inverse problems in image processing, has long been an extensively studied topic. In recent years supervised learning approaches have become a popular strategy attempting to tackle this task. Unfortunately, most supervised learning-based methods are highly demanding in terms of computational resources and training data (sample complexity). In addition, trained models are sensitive to domain changes, such as varying acquisition systems, signal sampling rates, resolution and contrast. In this work, we try to answer a fundamental question: Can supervised learning models generalize well solely by learning from one image or even part of an image? If so, then what is the minimal amount of patches required to achieve acceptable generalization? To this end, we focus on an efficient patch-based learning framework that requires a single image input-output pair for training. Experimental results demonstrate the applicability, robustness and computational efficiency of the proposed approach for supervised image deblurring and super-resolution. Our results showcase significant improvement of learning models' sample efficiency, generalization and time complexity, that can hopefully be leveraged for future real-time applications, and applied to other signals and modalities.
Authors: Andrew Kyle Lampinen, Stephanie C. Y. Chan, Katherine Hermann
Abstract: Representation learning, and interpreting learned representations, are key areas of focus in machine learning and neuroscience. Both fields generally use representations as a means to understand or improve a system's computations. In this work, however, we explore surprising dissociations between representation and computation that may pose challenges for such efforts. We create datasets in which we attempt to match the computational role that different features play, while manipulating other properties of the features or the data. We train various deep learning architectures to compute these multiple abstract features about their inputs. We find that their learned feature representations are systematically biased towards representing some features more strongly than others, depending upon extraneous properties such as feature complexity, the order in which features are learned, and the distribution of features over the inputs. For example, features that are simpler to compute or learned first tend to be represented more strongly and densely than features that are more complex or learned later, even if all features are learned equally well. We also explore how these biases are affected by architectures, optimizers, and training regimes (e.g., in transformers, features decoded earlier in the output sequence also tend to be represented more strongly). Our results help to characterize the inductive biases of gradient-based representation learning. We then illustrate the downstream effects of these biases on various commonly-used methods for analyzing or intervening on representations. These results highlight a key challenge for interpretability $-$ or for comparing the representations of models and brains $-$ disentangling extraneous biases from the computationally important aspects of a system's internal representations.
Authors: Pranab Sahoo, Prabhash Meharia, Akash Ghosh, Sriparna Saha, Vinija Jain, Aman Chadha
Abstract: The rapid advancement of foundation models (FMs) across language, image, audio, and video domains has shown remarkable capabilities in diverse tasks. However, the proliferation of FMs brings forth a critical challenge: the potential to generate hallucinated outputs, particularly in high-stakes applications. The tendency of foundation models to produce hallucinated content arguably represents the biggest hindrance to their widespread adoption in real-world scenarios, especially in domains where reliability and accuracy are paramount. This survey paper presents a comprehensive overview of recent developments that aim to identify and mitigate the problem of hallucination in FMs, spanning text, image, video, and audio modalities. By synthesizing recent advancements in detecting and mitigating hallucination across various modalities, the paper aims to provide valuable insights for researchers, developers, and practitioners. Essentially, it establishes a clear framework encompassing definition, taxonomy, and detection strategies for addressing hallucination in multimodal foundation models, laying the foundation for future research in this pivotal area.
Authors: Syed Javed, Tariq M. Khan, Abdul Qayyum, Hamid Alinejad-Rokny, Arcot Sowmya, Imran Razzak
Abstract: Accurate segmentation of anatomical structures and abnormalities in medical images is crucial for computer-aided diagnosis and analysis. While deep learning techniques excel at this task, their computational demands pose challenges. Additionally, some cutting-edge segmentation methods, though effective for general object segmentation, may not be optimised for medical images. To address these issues, we propose Mini-Net, a lightweight segmentation network specifically designed for medical images. With fewer than 38,000 parameters, Mini-Net efficiently captures both high- and low-frequency features, enabling real-time applications in various medical imaging scenarios. We evaluate Mini-Net on various datasets, including DRIVE, STARE, ISIC-2016, ISIC-2018, and MoNuSeg, demonstrating its robustness and good performance compared to state-of-the-art methods.
Authors: Zijia Song, Zelin Zang, Yelin Wang, Guozheng Yang, Kaicheng yu, Wanyu Chen, Miaoyu Wang, Stan Z. Li
Abstract: Multimodal fusion breaks through the boundaries between diverse modalities and has already achieved notable performances. However, in many specialized fields, it is struggling to obtain sufficient alignment data for training, which seriously limits the use of previously effective models. Therefore, semi-supervised learning approaches are attempted to facilitate multimodal alignment by learning from low-alignment data with fewer matched pairs, but traditional techniques like pseudo-labeling may run into troubles in the label-deficient scenarios. To tackle these challenges, we reframe semi-supervised multimodal alignment as a manifold matching issue and propose a new methodology based on CLIP, termed Set-CLIP. Specifically, by designing a novel semantic density distribution loss, we constrain the latent representation distribution with fine granularity and extract implicit semantic alignment from unpaired multimodal data, thereby reducing the reliance on numerous strictly matched pairs. Furthermore, we apply coarse-grained modality adaptation and unimodal self-supervised guidance to narrow the gaps between modality spaces and improve the stability of representation distributions. Extensive experiments conducted on a range of tasks in various fields, including protein analysis, remote sensing, and the general vision-language field, validate the efficacy of our proposed Set-CLIP method. Especially with no paired data for supervised training, Set-CLIP is still outstanding, which brings an improvement of 144.83% over CLIP.
Authors: Yubo Huang, Jia Wang, Peipei Li, Liuyu Xiang, Peigang Li, Zhaofeng He
Abstract: Iris restoration from complexly degraded iris images, aiming to improve iris recognition performance, is a challenging problem. Due to the complex degradation, directly training a convolutional neural network (CNN) without prior cannot yield satisfactory results. In this work, we propose a generative iris prior embedded Transformer model (Gformer), in which we build a hierarchical encoder-decoder network employing Transformer block and generative iris prior. First, we tame Transformer blocks to model long-range dependencies in target images. Second, we pretrain an iris generative adversarial network (GAN) to obtain the rich iris prior, and incorporate it into the iris restoration process with our iris feature modulator. Our experiments demonstrate that the proposed Gformer outperforms state-of-the-art methods. Besides, iris recognition performance has been significantly improved after applying Gformer.
Authors: Shengcheng Luo, Quanquan Peng, Jun Lv, Kaiwen Hong, Katherine Rose Driggs-Campbell, Cewu Lu, Yong-Lu Li
Abstract: Employing a teleoperation system for gathering demonstrations offers the potential for more efficient learning of robot manipulation. However, teleoperating a robot arm equipped with a dexterous hand or gripper, via a teleoperation system presents inherent challenges due to the task's high dimensionality, complexity of motion, and differences between physiological structures. In this study, we introduce a novel system for joint learning between human operators and robots, that enables human operators to share control of a robot end-effector with a learned assistive agent, simplifies the data collection process, and facilitates simultaneous human demonstration collection and robot manipulation training. As data accumulates, the assistive agent gradually learns. Consequently, less human effort and attention are required, enhancing the efficiency of the data collection process. It also allows the human operator to adjust the control ratio to achieve a trade-off between manual and automated control. We conducted experiments in both simulated environments and physical real-world settings. Through user studies and quantitative evaluations, it is evident that the proposed system could enhance data collection efficiency and reduce the need for human adaptation while ensuring the collected data is of sufficient quality for downstream tasks. \textit{For more details, please refer to our webpage https://norweig1an.github.io/HAJL.github.io/.
Authors: Honggui Li, Sinan Chen, Nahid Md Lokman Hossain, Maria Trocan, Dimitri Galayko, Mohamad Sawan
Abstract: Learned image compression (LIC) is currently the cutting-edge method. However, the inherent difference between testing and training images of LIC results in performance degradation to some extent. Especially for out-of-sample, out-of-distribution, or out-of-domain testing images, the performance of LIC dramatically degraded. Classical LIC is a serial image compression (SIC) approach that utilizes an open-loop architecture with serial encoding and decoding units. Nevertheless, according to the theory of automatic control, a closed-loop architecture holds the potential to improve the dynamic and static performance of LIC. Therefore, a circular image compression (CIC) approach with closed-loop encoding and decoding elements is proposed to minimize the gap between testing and training images and upgrade the capability of LIC. The proposed CIC establishes a nonlinear loop equation and proves that steady-state error between reconstructed and original images is close to zero by Taylor series expansion. The proposed CIC method possesses the property of Post-Training and plug-and-play which can be built on any existing advanced SIC methods. Experimental results on five public image compression datasets demonstrate that the proposed CIC outperforms five competing state-of-the-art open-source SIC algorithms in reconstruction capacity. Experimental results further show that the proposed method is suitable for out-of-sample testing images with dark backgrounds, sharp edges, high contrast, grid shapes, or complex patterns.
Authors: Xuan-Bac Nguyen, Hoang-Quan Nguyen, Hugh Churchill, Samee U. Khan, Khoa Luu
Abstract: Quantum computing has emerged as a powerful tool for solving complex problems intractable for classical computers, particularly in popular fields such as cryptography, optimization, and neurocomputing. In this paper, we present a new quantum-based approach named the Hierarchical Quantum Control Gates (HQCG) method for efficient understanding of Functional Magnetic Resonance Imaging (fMRI) data. This approach includes two novel modules: the Local Quantum Control Gate (LQCG) and the Global Quantum Control Gate (GQCG), which are designed to extract local and global features of fMRI signals, respectively. Our method operates end-to-end on a quantum machine, leveraging quantum mechanics to learn patterns within extremely high-dimensional fMRI signals, such as 30,000 samples which is a challenge for classical computers. Empirical results demonstrate that our approach significantly outperforms classical methods. Additionally, we found that the proposed quantum model is more stable and less prone to overfitting than the classical methods.
Authors: Jianren Wang, Kangni Liu, Dingkun Guo, Xian Zhou, Christopher G Atkeson
Abstract: Learning to manipulate dynamic and deformable objects from a single demonstration video holds great promise in terms of scalability. Previous approaches have predominantly focused on either replaying object relationships or actor trajectories. The former often struggles to generalize across diverse tasks, while the latter suffers from data inefficiency. Moreover, both methodologies encounter challenges in capturing invisible physical attributes, such as forces. In this paper, we propose to interpret video demonstrations through Parameterized Symbolic Abstraction Graphs (PSAG), where nodes represent objects and edges denote relationships between objects. We further ground geometric constraints through simulation to estimate non-geometric, visually imperceptible attributes. The augmented PSAG is then applied in real robot experiments. Our approach has been validated across a range of tasks, such as Cutting Avocado, Cutting Vegetable, Pouring Liquid, Rolling Dough, and Slicing Pizza. We demonstrate successful generalization to novel objects with distinct visual and physical properties.
Authors: Ben Batten, Yang Zheng, Alessandro De Palma, Panagiotis Kouvaros, Alessio Lomuscio
Abstract: We address the problem of verifying neural networks against geometric transformations of the input image, including rotation, scaling, shearing, and translation. The proposed method computes provably sound piecewise linear constraints for the pixel values by using sampling and linear approximations in combination with branch-and-bound Lipschitz optimisation. The method obtains provably tighter over-approximations of the perturbation region than the present state-of-the-art. We report results from experiments on a comprehensive set of verification benchmarks on MNIST and CIFAR10. We show that our proposed implementation resolves up to 32% more verification cases than present approaches.
Authors: Zhe Xiong, Qiaoqiao Ding, Xiaoqun Zhang
Abstract: Recently, medical image synthesis gains more and more popularity, along with the rapid development of generative models. Medical image synthesis aims to generate an unacquired image modality, often from other observed data modalities. Synthesized images can be used for clinical diagnostic assistance, data augmentation for model training and validation or image quality improving. In the meanwhile, the flow-based models are among the successful generative models for the ability of generating realistic and high-quality synthetic images. However, most flow-based models require to calculate flow ordinary different equation (ODE) evolution steps in transfer process, for which the performances are significantly limited by heavy computation time due to a large number of time iterations. In this paper, we propose a novel flow-based model, namely Discrete Process Matching (DPM) to accomplish the bi-modality image transfer tasks. Different to other flow matching based models, we propose to utilize both forward and backward ODE flow and enhance the consistency on the intermediate images of few discrete time steps, resulting in a transfer process with much less iteration steps while maintaining high-quality generations for both modalities. Our experiments on three datasets of MRI T1/T2 and CT/MRI demonstrate that DPM outperforms other state-of-the-art flow-based methods for bi-modality image synthesis, achieving higher image quality with less computation time cost.