Authors: Inwoo Seo, Eunkyoung Bae, Joo-Young Jeon, Young-Sang Yoon, Jiho Cha
Abstract: Social problems stemming from the shortage of radiologists are intensifying, and artificial intelligence is being highlighted as a potential solution. Recently emerging large-scale generative AI has expanded from large language models (LLMs) to multi-modal models, showing potential to revolutionize the entire process of medical imaging. However, comprehensive reviews on their development status and future challenges are currently lacking. This scoping review systematically organizes existing literature on the clinical value of large-scale generative AI applications by following PCC guidelines. A systematic search was conducted across four databases: PubMed, EMbase, IEEE-Xplore, and Google Scholar, and 15 studies meeting the inclusion/exclusion criteria set by the researchers were reviewed. Most of these studies focused on improving the efficiency of report generation in specific parts of the interpretation process or on translating reports to aid patient understanding, with the latest studies extending to AI applications performing direct interpretations. All studies were quantitatively evaluated by clinicians, with most utilizing LLMs and only three employing multi-modal models. Both LLMs and multi-modal models showed excellent results in specific areas, but none yet outperformed radiologists in diagnostic performance. Most studies utilized GPT, with few using models specialized for the medical imaging domain. This study provides insights into the current state and limitations of large-scale generative AI-based applications in the medical imaging field, offering foundational data and suggesting that the era of medical imaging foundation models is on the horizon, which may fundamentally transform clinical practice in the near future.
Authors: James E. Gallagher, Edward J. Oughton
Abstract: Multispectral imaging and deep learning have emerged as powerful tools supporting diverse use cases from autonomous vehicles, to agriculture, infrastructure monitoring and environmental assessment. The combination of these technologies has led to significant advancements in object detection, classification, and segmentation tasks in the non-visible light spectrum. This paper considers 400 total papers, reviewing 200 in detail to provide an authoritative meta-review of multispectral imaging technologies, deep learning models, and their applications, considering the evolution and adaptation of You Only Look Once (YOLO) methods. Ground-based collection is the most prevalent approach, totaling 63% of the papers reviewed, although uncrewed aerial systems (UAS) for YOLO-multispectral applications have doubled since 2020. The most prevalent sensor fusion is Red-Green-Blue (RGB) with Long-Wave Infrared (LWIR), comprising 39% of the literature. YOLOv5 remains the most used variant for adaption to multispectral applications, consisting of 33% of all modified YOLO models reviewed. 58% of multispectral-YOLO research is being conducted in China, with broadly similar research quality to other countries (with a mean journal impact factor of 4.45 versus 4.36 for papers not originating from Chinese institutions). Future research needs to focus on (i) developing adaptive YOLO architectures capable of handling diverse spectral inputs that do not require extensive architectural modifications, (ii) exploring methods to generate large synthetic multispectral datasets, (iii) advancing multispectral YOLO transfer learning techniques to address dataset scarcity, and (iv) innovating fusion research with other sensor types beyond RGB and LWIR.
Authors: Eslam Eldeeb, Mohammad Shehab, Hirley Alves, Mohamed-Slim Alouini
Abstract: Semantic and goal-oriented (SGO) communication is an emerging technology that only transmits significant information for a given task. Semantic communication encounters many challenges, such as computational complexity at end users, availability of data, and privacy-preserving. This work presents a TinyML-based semantic communication framework for few-shot wireless image classification that integrates split-learning and meta-learning. We exploit split-learning to limit the computations performed by the end-users while ensuring privacy-preserving. In addition, meta-learning overcomes data availability concerns and speeds up training by utilizing similarly trained tasks. The proposed algorithm is tested using a data set of images of hand-written letters. In addition, we present an uncertainty analysis of the predictions using conformal prediction (CP) techniques. Simulation results show that the proposed Semantic-MSL outperforms conventional schemes by achieving 20 % gain on classification accuracy using fewer data points, yet less training energy consumption.
Authors: Shuai Guo, Houqiang Zhong, Qiuwen Wang, Ziyu Chen, Yijie Gao, Jiajing Yuan, Chenyu Zhang, Rong Xie, Li Song
Abstract: Recently, NVS in human-object interaction scenes has received increasing attention. Existing human-object interaction datasets mainly consist of static data with limited views, offering only RGB images or videos, mostly containing interactions between a single person and objects. Moreover, these datasets exhibit complexities in lighting environments, poor synchronization, and low resolution, hindering high-quality human-object interaction studies. In this paper, we introduce a new people-object interaction dataset that comprises 38 series of 30-view multi-person or single-person RGB-D video sequences, accompanied by camera parameters, foreground masks, SMPL models, some point clouds, and mesh files. Video sequences are captured by 30 Kinect Azures, uniformly surrounding the scene, each in 4K resolution 25 FPS, and lasting for 1$\sim$19 seconds. Meanwhile, we evaluate some SOTA NVS models on our dataset to establish the NVS benchmarks. We hope our work can inspire further research in humanobject interaction.
Authors: Sunjae Yoon, Gwanhyeong Koo, Ji Woo Hong, Chang D. Yoo
Abstract: Text-based diffusion video editing systems have been successful in performing edits with high fidelity and textual alignment. However, this success is limited to rigid-type editing such as style transfer and object overlay, while preserving the original structure of the input video. This limitation stems from an initial latent noise employed in diffusion video editing systems. The diffusion video editing systems prepare initial latent noise to edit by gradually infusing Gaussian noise onto the input video. However, we observed that the visual structure of the input video still persists within this initial latent noise, thereby restricting non-rigid editing such as motion change necessitating structural modifications. To this end, this paper proposes Dilutional Noise Initialization (DNI) framework which enables editing systems to perform precise and dynamic modification including non-rigid editing. DNI introduces a concept of `noise dilution' which adds further noise to the latent noise in the region to be edited to soften the structural rigidity imposed by input video, resulting in more effective edits closer to the target prompt. Extensive experiments demonstrate the effectiveness of the DNI framework.
Authors: Eleonora Beatrice Rossi, Eleonora Lopez, Danilo Comminiello
Abstract: The application of deep learning in medical imaging has significantly advanced diagnostic capabilities, enhancing both accuracy and efficiency. Despite these benefits, the lack of transparency in these AI models, often termed "black boxes," raises concerns about their reliability in clinical settings. Explainable AI (XAI) aims to mitigate these concerns by developing methods that make AI decisions understandable and trustworthy. In this study, we propose Tumor Aware Counterfactual Explanations (TACE), a framework designed to generate reliable counterfactual explanations for medical images. Unlike existing methods, TACE focuses on modifying tumor-specific features without altering the overall organ structure, ensuring the faithfulness of the counterfactuals. We achieve this by including an additional step in the generation process which allows to modify only the region of interest (ROI), thus yielding more reliable counterfactuals as the rest of the organ remains unchanged. We evaluate our method on mammography images and brain MRI. We find that our method far exceeds existing state-of-the-art techniques in quality, faithfulness, and generation speed of counterfactuals. Indeed, more faithful explanations lead to a significant improvement in classification success rates, with a 10.69% increase for breast cancer and a 98.02% increase for brain tumors. The code of our work is available at https://github.com/ispamm/TACE.
Authors: Chengrui Gao, Ziyuan Yang, Tiong-Sik Ng, Min Zhu, Andrew Beng Jin Teoh
Abstract: Palmprint recognition has emerged as a prominent biometric authentication method, owing to its high discriminative power and user-friendly nature. This paper introduces a novel Cross-Chirality Palmprint Verification (CCPV) framework that challenges the conventional wisdom in traditional palmprint verification systems. Unlike existing methods that typically require storing both left and right palmprints, our approach enables verification using either palm while storing only one palmprint template. The core of our CCPV framework lies in a carefully designed matching rule. This rule involves flipping both the gallery and query palmprints and calculating the average distance between each pair as the final matching distance. This approach effectively reduces matching variance and enhances overall system robustness. We introduce a novel cross-chirality loss function to construct a discriminative and robust cross-chirality feature space. This loss enforces representation consistency across four palmprint variants: left, right, flipped left, and flipped right. The resulting compact feature space, coupled with the model's enhanced discriminative representation capability, ensures robust performance across various scenarios. We conducted extensive experiments to validate the efficacy of our proposed method. The evaluation encompassed multiple public datasets and considered both closed-set and open-set settings. The results demonstrate the CCPV framework's effectiveness and highlight its potential for real-world applications in palmprint authentication systems.
Authors: Aimar Silvan Ortubay, Lucas C. Parra, Jens Madsen
Abstract: Students often drift in and out of focus during class. Effective teachers recognize this and re-engage them when necessary. With the shift to remote learning, teachers have lost the visual feedback needed to adapt to varying student engagement. We propose using readily available front-facing video to infer attention levels based on movements of the eyes, head, and face. We train a deep learning model to predict a measure of attention based on overt eye movements. Specifically, we measure Inter-Subject Correlation of eye movements in ten-second intervals while students watch the same educational videos. In 3 different experiments (N=83) we show that the trained model predicts this objective metric of attention on unseen data with $R^2$=0.38, and on unseen subjects with $R^2$=0.26-0.30. The deep network relies mostly on a student's eye movements, but to some extent also on movements of the brows, cheeks, and head. In contrast to Inter-Subject Correlation of the eyes, the model can estimate attentional engagement from individual students' movements without needing reference data from an attentive group. This enables a much broader set of online applications. The solution is lightweight and can operate on the client side, which mitigates some of the privacy concerns associated with online attention monitoring.
Authors: Anastasia Anichenko, Frank Guerin, Andrew Gilbert
Abstract: We investigate a human-like interpretable model of video understanding. Humans recognise complex activities in video by recognising critical spatio-temporal relations among explicitly recognised objects and parts, for example, an object entering the aperture of a container. To mimic this we build on a model which uses positions of objects and hands, and their motions, to recognise the activity taking place. To improve this model we focussed on three of the most confused classes (for this model) and identified that the lack of 3D information was the major problem. To address this we extended our basic model by adding 3D awareness in two ways: (1) A state-of-the-art object detection model was fine-tuned to determine the difference between "Container" and "NotContainer" in order to integrate object shape information into the existing object features. (2) A state-of-the-art depth estimation model was used to extract depth values for individual objects and calculate depth relations to expand the existing relations used our interpretable model. These 3D extensions to our basic model were evaluated on a subset of three superficially similar "Putting" actions from the Something-Something-v2 dataset. The results showed that the container detector did not improve performance, but the addition of depth relations made a significant improvement to performance.
Authors: Tian Liu, Liuyi Jin, Radu Stoleru, Amran Haroon, Charles Swanson, Kexin Feng
Abstract: Current state-of-the-art residential irrigation systems, such as WaterMyYard, rely on rainfall data from nearby weather stations to adjust irrigation amounts. However, the accuracy of rainfall data is compromised by the limited spatial resolution of rain gauges and the significant variability of hyperlocal rainfall, leading to substantial water waste. To improve irrigation efficiency, we developed a cost-effective irrigation system, dubbed ERIC, which employs machine learning models to estimate rainfall from commodity doorbell camera footage and optimizes irrigation schedules without human intervention. Specifically, we: a) designed novel visual and audio features with lightweight neural network models to infer rainfall from the camera at the edge, preserving user privacy; b) built a complete end-to-end irrigation system on Raspberry Pi 4, costing only $75. We deployed the system across five locations (collecting over 750 hours of video) with varying backgrounds and light conditions. Comprehensive evaluation validates that ERIC achieves state-of-the-art rainfall estimation performance (~ 5mm/day), saving 9,112 gallons/month of water, translating to $28.56/month in utility savings.
Authors: Jinho Park, Se Young Chun, Mingoo Seok
Abstract: Data-driven visual-inertial odometry (VIO) has received highlights for its performance since VIOs are a crucial compartment in autonomous robots. However, their deployment on resource-constrained devices is non-trivial since large network parameters should be accommodated in the device memory. Furthermore, these networks may risk failure post-deployment due to environmental distribution shifts at test time. In light of this, we propose UL-VIO -- an ultra-lightweight (<1M) VIO network capable of test-time adaptation (TTA) based on visual-inertial consistency. Specifically, we perform model compression to the network while preserving the low-level encoder part, including all BatchNorm parameters for resource-efficient test-time adaptation. It achieves 36X smaller network size than state-of-the-art with a minute increase in error -- 1% on the KITTI dataset. For test-time adaptation, we propose to use the inertia-referred network outputs as pseudo labels and update the BatchNorm parameter for lightweight yet effective adaptation. To the best of our knowledge, this is the first work to perform noise-robust TTA on VIO. Experimental results on the KITTI, EuRoC, and Marulan datasets demonstrate the effectiveness of our resource-efficient adaptation method under diverse TTA scenarios with dynamic domain shifts.
Authors: Adrian Langley, Matthew Lonergan, Tao Huang, Mostafa Rahimi Azghadi
Abstract: Improving the automatic and timely recognition of construction and demolition waste (C&DW) composition is crucial for enhancing business returns, economic outcomes, and sustainability. Technologies like computer vision, artificial intelligence (AI), robotics, and internet of things (IoT) are increasingly integrated into waste processing to achieve these goals. While deep learning (DL) models show promise in recognising homogeneous C&DW piles, few studies assess their performance with mixed, highly contaminated material in commercial settings. Drawing on extensive experience at a C&DW materials recovery facility (MRF) in Sydney, Australia, we explore the challenges and opportunities in developing an advanced automated mixed C&DW management system. We begin with an overview of the evolution of waste management in the construction industry, highlighting its environmental, economic, and societal impacts. We review various C&DW analysis techniques, concluding that DL-based visual methods are the optimal solution. Additionally, we examine the progression of sensor and camera technologies for C&DW analysis as well as the evolution of DL algorithms focused on object detection and material segmentation. We also discuss C&DW datasets, their curation, and innovative methods for their creation. Finally, we share insights on C&DW visual analysis, addressing technical and commercial challenges, research trends, and future directions for mixed C&DW analysis. This paper aims to improve the efficiency of C&DW management by providing valuable insights for ongoing and future research and development efforts in this critical sector.
Authors: Chengkun Sun, Jinqian Pan, Russell Stevens Terry, Jiang Bian, Jie Xu
Abstract: Generative models can enhance discriminative classifiers by constructing complex feature spaces, thereby improving performance on intricate datasets. Conventional methods typically augment datasets with more detailed feature representations or increase dimensionality to make nonlinear data linearly separable. Utilizing a generative model solely for feature space processing falls short of unlocking its full potential within a classifier and typically lacks a solid theoretical foundation. We base our approach on a novel hypothesis: the probability information (logit) derived from a single model training can be used to generate the equivalent of multiple training sessions. Leveraging the central limit theorem, this synthesized probability information is anticipated to converge toward the true probability more accurately. To achieve this goal, we propose the Bernoulli-Gaussian Decision Block (BGDB), a novel module inspired by the Central Limit Theorem and the concept that the mean of multiple Bernoulli trials approximates the probability of success in a single trial. Specifically, we utilize Improved Denoising Diffusion Probabilistic Models (IDDPM) to model the probability of Bernoulli Trials. Our approach shifts the focus from reconstructing features to reconstructing logits, transforming the logit from a single iteration into logits analogous to those from multiple experiments. We provide the theoretical foundations of our approach through mathematical analysis and validate its effectiveness through experimental evaluation using various datasets for multiple imaging tasks, including both classification and segmentation.
Authors: Yingying Hua, Shiming Ge, Daichi Zhang
Abstract: Interpreting the predictions of a black-box deep network can facilitate the reliability of its deployment. In this work, we propose a re-label distillation approach to learn a direct map from the input to the prediction in a self-supervision manner. The image is projected into a VAE subspace to generate some synthetic images by randomly perturbing its latent vector. Then, these synthetic images can be annotated into one of two classes by identifying whether their labels shift. After that, using the labels annotated by the deep network as teacher, a linear student model is trained to approximate the annotations by mapping these synthetic images to the classes. In this manner, these re-labeled synthetic images can well describe the local classification mechanism of the deep network, and the learned student can provide a more intuitive explanation towards the predictions. Extensive experiments verify the effectiveness of our approach qualitatively and quantitatively.
Authors: Zhenrong Zhang, Shuhang Liu, Pengfei Hu, Jiefeng Ma, Jun Du, Jianshu Zhang, Yu Hu
Abstract: In the digital era, table structure recognition technology is a critical tool for processing and analyzing large volumes of tabular data. Previous methods primarily focus on visual aspects of table structure recovery but often fail to effectively comprehend the textual semantics within tables, particularly for descriptive textual cells. In this paper, we introduce UniTabNet, a novel framework for table structure parsing based on the image-to-text model. UniTabNet employs a ``divide-and-conquer'' strategy, utilizing an image-to-text model to decouple table cells and integrating both physical and logical decoders to reconstruct the complete table structure. We further enhance our framework with the Vision Guider, which directs the model's focus towards pertinent areas, thereby boosting prediction accuracy. Additionally, we introduce the Language Guider to refine the model's capability to understand textual semantics in table images. Evaluated on prominent table structure datasets such as PubTabNet, PubTables1M, WTW, and iFLYTAB, UniTabNet achieves a new state-of-the-art performance, demonstrating the efficacy of our approach. The code will also be made publicly available.
Authors: Yash Turkar, Timothy Chase Jr, Christo Aluckal, Karthik Dantu
Abstract: Accurate feature detection is fundamental for various computer vision tasks, including autonomous robotics, 3D reconstruction, medical imaging, and remote sensing. Despite advancements in enhancing the robustness of visual features, no existing method measures the utility of visual information before processing by specific feature-type algorithms. To address this gap, we introduce PIXER and the concept of "Featureness," which reflects the inherent interest and reliability of visual information for robust recognition, independent of any specific feature type. Leveraging a generalization on Bayesian learning, our approach quantifies both the probability and uncertainty of a pixel's contribution to robust visual utility in a single-shot process, avoiding costly operations such as Monte Carlo sampling and permitting customizable featureness definitions adaptable to a wide range of applications. We evaluate PIXER on visual odometry with featureness selectivity, achieving an average of 31% improvement in RMSE trajectory with 49% fewer features.
Authors: Chengkun Sun, Jinqian Pan, Juoli Jin, Russell Stevens Terry, Jiang Bian, Jie Xu
Abstract: Training deep Convolutional Neural Networks (CNNs) presents unique challenges, including the pervasive issue of elimination singularities, consistent deactivation of nodes leading to degenerate manifolds within the loss landscape. These singularities impede efficient learning by disrupting feature propagation. To mitigate this, we introduce Pool Skip, an architectural enhancement that strategically combines a Max Pooling, a Max Unpooling, a 3 times 3 convolution, and a skip connection. This configuration helps stabilize the training process and maintain feature integrity across layers. We also propose the Weight Inertia hypothesis, which underpins the development of Pool Skip, providing theoretical insights into mitigating degradation caused by elimination singularities through dimensional and affine compensation. We evaluate our method on a variety of benchmarks, focusing on both 2D natural and 3D medical imaging applications, including tasks such as classification and segmentation. Our findings highlight Pool Skip's effectiveness in facilitating more robust CNN training and improving model performance.
Authors: Haotian Bai, Yize Chen, Lin Wang
Abstract: Object-centric surface reconstruction from multi-view images is crucial in creating editable digital assets for AR/VR. Due to the lack of geometric constraints, existing methods, e.g., NeuS necessitate annotating the object masks to reconstruct compact surfaces in mesh processing. Mask annotation, however, incurs considerable labor costs due to its cumbersome nature. This paper presents Hi-NeuS, a novel rendering-based framework for neural implicit surface reconstruction, aiming to recover compact and precise surfaces without multi-view object masks. Our key insight is that the overlapping regions in the object-centric views naturally highlight the object of interest as the camera orbits around objects. The object of interest can be specified by estimating the distribution of the rendering weights accumulated from multiple views, which implicitly identifies the surface that a user intends to capture. This inspires us to design a geometric refinement approach, which takes multi-view rendering weights to guide the signed distance functions (SDF) of neural surfaces in a self-supervised manner. Specifically, it retains these weights to resample a pseudo surface based on their distribution. This facilitates the alignment of the SDF to the object of interest. We then regularize the SDF's bias for geometric consistency. Moreover, we propose to use unmasked Chamfer Distance(CD) to measure the extracted mesh without post-processing for more precise evaluation. Our approach has been validated through NeuS and its variant Neuralangelo, demonstrating its adaptability across different NeuS backbones. Extensive benchmark on the DTU dataset shows that our method reduces surface noise by about 20%, and improves the unmasked CD by around 30%, achieving better surface details. The superiority of Hi-NeuS is further validated on BlendedMVS and handheld camera captures for content creation.
Authors: Yuqi Cheng, Yunkang Cao, Guoyang Xie, Zhichao Lu, Weiming Shen
Abstract: Detecting anomalies within point clouds is crucial for various industrial applications, but traditional unsupervised methods face challenges due to data acquisition costs, early-stage production constraints, and limited generalization across product categories. To overcome these challenges, we introduce the Multi-View Projection (MVP) framework, leveraging pre-trained Vision-Language Models (VLMs) to detect anomalies. Specifically, MVP projects point cloud data into multi-view depth images, thereby translating point cloud anomaly detection into image anomaly detection. Following zero-shot image anomaly detection methods, pre-trained VLMs are utilized to detect anomalies on these depth images. Given that pre-trained VLMs are not inherently tailored for zero-shot point cloud anomaly detection and may lack specificity, we propose the integration of learnable visual and adaptive text prompting techniques to fine-tune these VLMs, thereby enhancing their detection performance. Extensive experiments on the MVTec 3D-AD and Real3D-AD demonstrate our proposed MVP framework's superior zero-shot anomaly detection performance and the prompting techniques' effectiveness. Real-world evaluations on automotive plastic part inspection further showcase that the proposed method can also be generalized to practical unseen scenarios. The code is available at https://github.com/hustCYQ/MVP-PCLIP.
Authors: Jiaxin Deng, Junbiao Pang, Baochang Zhang, Qingming Huang
Abstract: Sharpness-Aware Minimization (SAM) enhances generalization by reducing a Max-Sharpness (MaxS). Despite the practical success, we empirically found that the MAxS behind SAM's generalization enhancements face the "Flatness Indicator Problem" (FIP), where SAM only considers the flatness in the direction of gradient ascent, resulting in a next minimization region that is not sufficiently flat. A better Flatness Indicator (FI) would bring a better generalization of neural networks. Because SAM is a greedy search method in nature. In this paper, we propose to utilize the difference between the training loss and the minimum loss over the neighborhood surrounding the current weight, which we denote as Min-Sharpness (MinS). By merging MaxS and MinS, we created a better FI that indicates a flatter direction during the optimization. Specially, we combine this FI with SAM into the proposed Bilateral SAM (BSAM) which finds a more flatter minimum than that of SAM. The theoretical analysis proves that BSAM converges to local minima. Extensive experiments demonstrate that BSAM offers superior generalization performance and robustness compared to vanilla SAM across various tasks, i.e., classification, transfer learning, human pose estimation, and network quantization. Code is publicly available at: https://github.com/ajiaaa/BSAM.
Authors: Hao Cheng, Erjia Xiao, Chengyuan Yu, Zhao Yao, Jiahang Cao, Qiang Zhang, Jiaxu Wang, Mengshu Sun, Kaidi Xu, Jindong Gu, Renjing Xu
Abstract: Recently, driven by advancements in Multimodal Large Language Models (MLLMs), Vision Language Action Models (VLAMs) are being proposed to achieve better performance in open-vocabulary scenarios for robotic manipulation tasks. Since manipulation tasks involve direct interaction with the physical world, ensuring robustness and safety during the execution of this task is always a very critical issue. In this paper, by synthesizing current safety research on MLLMs and the specific application scenarios of the manipulation task in the physical world, we comprehensively evaluate VLAMs in the face of potential physical threats. Specifically, we propose the Physical Vulnerability Evaluating Pipeline (PVEP) that can incorporate as many visual modal physical threats as possible for evaluating the physical robustness of VLAMs. The physical threats in PVEP specifically include Out-of-Distribution, Typography-based Visual Prompt, and Adversarial Patch Attacks. By comparing the performance fluctuations of VLAMs before and after being attacked, we provide generalizable \textbf{\textit{Analyses}} of how VLAMs respond to different physical security threats.
Authors: Youngdong Jang, Hyunje Park, Feng Yang, Heeju Ko, Euijin Choo, Sangpil Kim
Abstract: Recently, 3D Gaussian splatting has been getting a lot of attention as an innovative method for representing 3D space due to rapid rendering and image quality. However, copyright protection for the 3D Gaussian splatting has not yet been introduced. In this paper, we present a novel watermarking method for 3D Gaussian splatting. The proposed method embeds a binary message into 3D Gaussians by fine-tuning the pre-trained 3D Gaussian splatting model. To achieve this, we present Frequency-Guided Densification (FGD) that utilizes Discrete Fourier Transform to find patches with high-frequencies and split 3D Gaussians based on 3D Gaussian Contribution Vector. It is each 3D Gaussian contribution to rendered pixel colors, improving both rendering quality and bit accuracy. Furthermore, we modify an adaptive gradient mask to enhance rendering quality. Our experiments show that our method can embed a watermark in 3D Gaussians imperceptibly with increased capacity and robustness against attacks. Our method reduces optimization cost and achieves state-of-the-art performance compared to other methods.
Authors: Sankaraganesh Jonna, Moushumi Medhi, Rajiv Ranjan Sahay
Abstract: Nowadays, the enhanced capabilities of in-expensive imaging devices have led to a tremendous increase in the acquisition and sharing of multimedia content over the Internet. Despite advances in imaging sensor technology, annoying conditions like \textit{occlusions} hamper photography and may deteriorate the performance of applications such as surveillance, detection, and recognition. Occlusion segmentation is difficult because of scale variations, illumination changes, and so on. Similarly, recovering a scene from foreground occlusions also poses significant challenges due to the complexity of accurately estimating the occluded regions and maintaining coherence with the surrounding context. In particular, image de-fencing presents its own set of challenges because of the diverse variations in shape, texture, color, patterns, and the often cluttered environment. This study focuses on the automatic detection and removal of occlusions from a single image. We propose a fully automatic, two-stage convolutional neural network for fence segmentation and occlusion completion. We leverage generative adversarial networks (GANs) to synthesize realistic content, including both structure and texture, in a single shot for inpainting. To assess zero-shot generalization, we evaluated our trained occlusion detection model on our proposed fence-like occlusion segmentation dataset. The dataset can be found on GitHub.
Authors: Mingdian Liu, Yilin Liu, Gurunandan Krishnan, Karl S Bayer, Bing Zhou
Abstract: The generation of humanoid animation from text prompts can profoundly impact animation production and AR/VR experiences. However, existing methods only generate body motion data, excluding facial expressions and hand movements. This limitation, primarily due to a lack of a comprehensive whole-body motion dataset, inhibits their readiness for production use. Recent attempts to create such a dataset have resulted in either motion inconsistency among different body parts in the artificially augmented data or lower quality in the data extracted from RGB videos. In this work, we propose T2M-X, a two-stage method that learns expressive text-to-motion generation from partially annotated data. T2M-X trains three separate Vector Quantized Variational AutoEncoders (VQ-VAEs) for body, hand, and face on respective high-quality data sources to ensure high-quality motion outputs, and a Multi-indexing Generative Pretrained Transformer (GPT) model with motion consistency loss for motion generation and coordination among different body parts. Our results show significant improvements over the baselines both quantitatively and qualitatively, demonstrating its robustness against the dataset limitations.
Authors: Sheng Shi, Xuyang Cao, Jun Zhao, Guoxin Wang
Abstract: In audio-driven video generation, creating Mandarin videos presents significant challenges. Collecting comprehensive Mandarin datasets is difficult, and the complex lip movements in Mandarin further complicate model training compared to English. In this study, we collected 29 hours of Mandarin speech video from JD Health International Inc. employees, resulting in the jdh-Hallo dataset. This dataset includes a diverse range of ages and speaking styles, encompassing both conversational and specialized medical topics. To adapt the JoyHallo model for Mandarin, we employed the Chinese wav2vec2 model for audio feature embedding. A semi-decoupled structure is proposed to capture inter-feature relationships among lip, expression, and pose features. This integration not only improves information utilization efficiency but also accelerates inference speed by 14.3%. Notably, JoyHallo maintains its strong ability to generate English videos, demonstrating excellent cross-language generation capabilities. The code and models are available at https://jdh-algo.github.io/JoyHallo.
Authors: Zhongren Yao, Xiaobin Chang
Abstract: Exemplar-free class-incremental learning (EFCIL) presents a significant challenge as the old class samples are absent for new task learning. Due to the severe imbalance between old and new class samples, the learned classifiers can be easily biased toward the new ones. Moreover, continually updating the feature extractor under EFCIL can compromise the discriminative power of old class features, e.g., leading to less compact and more overlapping distributions across classes. Existing methods mainly focus on handling biased classifier learning. In this work, both cases are considered using the proposed method. Specifically, we first introduce a Distribution-Based Global Classifier (DBGC) to avoid bias factors in existing methods, such as data imbalance and sampling. More importantly, the compromised distributions of old classes are simulated via a simple operation, variance enlarging (VE). Incorporating VE based on DBGC results in a novel classification loss for EFCIL. This loss is proven equivalent to an Adaptive Margin Softmax Cross Entropy (AMarX). The proposed method is thus called Adaptive Margin Global Classifier (AMGC). AMGC is simple yet effective. Extensive experiments show that AMGC achieves superior image classification results on its own under a challenging EFCIL setting. Detailed analysis is also provided for further demonstration.
Authors: Alessandro Riva, Alessandro Raganato, Simone Melzi
Abstract: Current data-driven methodologies for point cloud matching demand extensive training time and computational resources, presenting significant challenges for model deployment and application. In the point cloud matching task, recent advancements with an encoder-only Transformer architecture have revealed the emergence of semantically meaningful patterns in the attention heads, particularly resembling Gaussian functions centered on each point of the input shape. In this work, we further investigate this phenomenon by integrating these patterns as fixed attention weights within the attention heads of the Transformer architecture. We evaluate two variants: one utilizing predetermined variance values for the Gaussians, and another where the variance values are treated as learnable parameters. Additionally we analyze the performances on noisy data and explore a possible way to improve robustness to noise. Our findings demonstrate that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization. Furthermore, we conducted an ablation study to identify the specific layers where the infused information is most impactful and to understand the reliance of the network on this information.
Authors: Qiulei Dong, Jianan Li, Shuang Deng
Abstract: With the development of 3D and 2D data acquisition techniques, it has become easy to obtain point clouds and images of scenes simultaneously, which further facilitates dual-modal semantic segmentation. Most existing methods for simultaneously segmenting point clouds and images rely heavily on the quantity and quality of the labeled training data. However, massive point-wise and pixel-wise labeling procedures are time-consuming and labor-intensive. To address this issue, we propose a parallel dual-stream network to handle the semi-supervised dual-modal semantic segmentation task, called PD-Net, by jointly utilizing a small number of labeled point clouds, a large number of unlabeled point clouds, and unlabeled images. The proposed PD-Net consists of two parallel streams (called original stream and pseudo-label prediction stream). The pseudo-label prediction stream predicts the pseudo labels of unlabeled point clouds and their corresponding images. Then, the unlabeled data is sent to the original stream for self-training. Each stream contains two encoder-decoder branches for 3D and 2D data respectively. In each stream, multiple dual-modal fusion modules are explored for fusing the dual-modal features. In addition, a pseudo-label optimization module is explored to optimize the pseudo labels output by the pseudo-label prediction stream. Experimental results on two public datasets demonstrate that the proposed PD-Net not only outperforms the comparative semi-supervised methods but also achieves competitive performances with some fully-supervised methods in most cases.
Authors: Sandeep Khanna, Chiranjoy Chattopadhyay, Suman Kundu
Abstract: Automating the detection of fruits and vegetables using computer vision is essential for modernizing agriculture, improving efficiency, ensuring food quality, and contributing to technologically advanced and sustainable farming practices. This paper presents an end-to-end pipeline for detecting and localizing fruits and vegetables in real-world scenarios. To achieve this, we have curated a dataset named FRUVEG67 that includes images of 67 classes of fruits and vegetables captured in unconstrained scenarios, with only a few manually annotated samples per class. We have developed a semi-supervised data annotation algorithm (SSDA) that generates bounding boxes for objects to label the remaining non-annotated images. For detection, we introduce the Fruit and Vegetable Detection Network (FVDNet), an ensemble version of YOLOv7 featuring three distinct grid configurations. We employ an averaging approach for bounding-box prediction and a voting mechanism for class prediction. We have integrated Jensen-Shannon divergence (JSD) in conjunction with focal loss to better detect smaller objects. Our experimental results highlight the superiority of FVDNet compared to previous versions of YOLO, showcasing remarkable improvements in detection and localization performance. We achieved an impressive mean average precision (mAP) score of 0.78 across all classes. Furthermore, we evaluated the efficacy of FVDNet using open-category refrigerator images, where it demonstrates promising results.
Authors: Yi Ren, Tianyi Zhang, Zhixiong Han, Weibin Li, Zhiyang Wang, Wenbo Ji, Chenhao Qin, Chenbin Liang, Licheng Jiao
Abstract: We propose an adaptive fine-tuning algorithm for multimodal large models. The core steps of this algorithm involve two stages of truncation. First, the vast amount of data is projected into a semantic vector space, and the MiniBatchKMeans algorithm is used for automated clustering. This classification ensures that the data within each cluster exhibit high semantic similarity. Next, we process the data in each cluster, calculating the translational difference between the original and perturbed data in the multimodal large model's vector space. This difference serves as a generalization metric for the data. Based on this metric, we select the data with high generalization potential for training. We applied this algorithm to train the InternLM-XComposer2-VL-7B model on two 3090 GPUs using one-third of the GeoChat multimodal remote sensing dataset. The results demonstrate that our algorithm outperforms the state-of-the-art baselines. various baselines. The model trained on our optimally chosen one-third dataset, based on experimental validation, exhibited only 1% reduction in performance across various remote sensing metrics compared to the model trained on the full dataset. This approach significantly preserved general-purpose capabilities while reducing training time by 68.2%. Furthermore, the model achieved scores of 89.86 and 77.19 on the UCMerced and AID evaluation datasets, respectively, surpassing the GeoChat dataset by 5.43 and 5.16 points. It only showed a 0.91-point average decrease on the LRBEN evaluation dataset.
Authors: Zecheng He, Bo Sun, Felix Juefei-Xu, Haoyu Ma, Ankit Ramchandani, Vincent Cheung, Siddharth Shah, Anmol Kalia, Harihar Subramanyam, Alireza Zareian, Li Chen, Ankit Jain, Ning Zhang, Peizhao Zhang, Roshan Sumbaly, Peter Vajda, Animesh Sinha
Abstract: Diffusion models have demonstrated remarkable efficacy across various image-to-image tasks. In this research, we introduce Imagine yourself, a state-of-the-art model designed for personalized image generation. Unlike conventional tuning-based personalization techniques, Imagine yourself operates as a tuning-free model, enabling all users to leverage a shared framework without individualized adjustments. Moreover, previous work met challenges balancing identity preservation, following complex prompts and preserving good visual quality, resulting in models having strong copy-paste effect of the reference images. Thus, they can hardly generate images following prompts that require significant changes to the reference image, \eg, changing facial expression, head and body poses, and the diversity of the generated images is low. To address these limitations, our proposed method introduces 1) a new synthetic paired data generation mechanism to encourage image diversity, 2) a fully parallel attention architecture with three text encoders and a fully trainable vision encoder to improve the text faithfulness, and 3) a novel coarse-to-fine multi-stage finetuning methodology that gradually pushes the boundary of visual quality. Our study demonstrates that Imagine yourself surpasses the state-of-the-art personalization model, exhibiting superior capabilities in identity preservation, visual quality, and text alignment. This model establishes a robust foundation for various personalization applications. Human evaluation results validate the model's SOTA superiority across all aspects (identity preservation, text faithfulness, and visual appeal) compared to the previous personalization models.
Authors: Zuomin Qu, Wei Lu, Xiangyang Luo, Qian Wang, Xiaochun Cao
Abstract: The misuse of deep learning-based facial manipulation poses a potential threat to civil rights. To prevent this fraud at its source, proactive defense technology was proposed to disrupt the manipulation process by adding invisible adversarial perturbations into images, making the forged output unconvincing to the observer. However, their non-directional disruption of the output may result in the retention of identity information of the person in the image, leading to stigmatization of the individual. In this paper, we propose a novel universal framework for combating facial manipulation, called ID-Guard. Specifically, this framework requires only a single forward pass of an encoder-decoder network to generate a cross-model universal adversarial perturbation corresponding to a specific facial image. To ensure anonymity in manipulated facial images, a novel Identity Destruction Module (IDM) is introduced to destroy the identifiable information in forged faces targetedly. Additionally, we optimize the perturbations produced by considering the disruption towards different facial manipulations as a multi-task learning problem and design a dynamic weights strategy to improve cross-model performance. The proposed framework reports impressive results in defending against multiple widely used facial manipulations, effectively distorting the identifiable regions in the manipulated facial images. In addition, our experiments reveal the ID-Guard's ability to enable disrupted images to avoid face inpaintings and open-source image recognition systems.
Authors: Wenhui Diao, Haichen Yu, Kaiyue Kang, Tong Ling, Di Liu, Yingchao Feng, Hanbo Bi, Libo Ren, Xuexue Li, Yongqiang Mao, Xian Sun
Abstract: Aerial Remote Sensing (ARS) vision tasks pose significant challenges due to the unique characteristics of their viewing angles. Existing research has primarily focused on algorithms for specific tasks, which have limited applicability in a broad range of ARS vision applications. This paper proposes the RingMo-Aerial model, aiming to fill the gap in foundation model research in the field of ARS vision. By introducing the Frequency-Enhanced Multi-Head Self-Attention (FE-MSA) mechanism and an affine transformation-based contrastive learning pre-training method, the model's detection capability for small targets is enhanced and optimized for the tilted viewing angles characteristic of ARS. Furthermore, the ARS-Adapter, an efficient parameter fine-tuning method, is proposed to improve the model's adaptability and effectiveness in various ARS vision tasks. Experimental results demonstrate that RingMo-Aerial achieves SOTA performance on multiple downstream tasks. This indicates the practicality and effectiveness of RingMo-Aerial in enhancing the performance of ARS vision tasks.
Authors: Pawel Tomasz Pieta, Anders Bjorholm Dahl, Jeppe Revall Frisvad, Siavash Arjomand Bigdeli, Anders Nymark Christensen
Abstract: The structure tensor method is often used for 2D and 3D analysis of imaged structures, but its results are in many cases very dependent on the user's choice of method parameters. We simplify this parameter choice in first order structure tensor scale-space by directly connecting the width of the derivative filter to the size of image features. By introducing a ring-filter step, we substitute the Gaussian integration/smoothing with a method that more accurately shifts the derivative filter response from feature edges to their center. We further demonstrate how extracted structural measures can be used to correct known inaccuracies in the scale map, resulting in a reliable representation of the feature sizes both in 2D and 3D. Compared to the traditional first order structure tensor, or previous structure tensor scale-space approaches, our solution is much more accurate and can serve as an out-of-the-box method for extracting a wide range of structural parameters with minimal user input.
Authors: Zixin Zhang, Kanghao Chen, Lin Wang
Abstract: Event cameras are bio-inspired sensors that output asynchronous and sparse event streams, instead of fixed frames. Benefiting from their distinct advantages, such as high dynamic range and high temporal resolution, event cameras have been applied to address 3D reconstruction, important for robotic mapping. Recently, neural rendering techniques, such as 3D Gaussian splatting (3DGS), have been shown successful in 3D reconstruction. However, it still remains under-explored how to develop an effective event-based 3DGS pipeline. In particular, as 3DGS typically depends on high-quality initialization and dense multiview constraints, a potential problem appears for the 3DGS optimization with events given its inherent sparse property. To this end, we propose a novel event-based 3DGS framework, named Elite-EvGS. Our key idea is to distill the prior knowledge from the off-the-shelf event-to-video (E2V) models to effectively reconstruct 3D scenes from events in a coarse-to-fine optimization manner. Specifically, to address the complexity of 3DGS initialization from events, we introduce a novel warm-up initialization strategy that optimizes a coarse 3DGS from the frames generated by E2V models and then incorporates events to refine the details. Then, we propose a progressive event supervision strategy that employs the window-slicing operation to progressively reduce the number of events used for supervision. This subtly relives the temporal randomness of the event frames, benefiting the optimization of local textural and global structural details. Experiments on the benchmark datasets demonstrate that Elite-EvGS can reconstruct 3D scenes with better textural and structural details. Meanwhile, our method yields plausible performance on the captured real-world data, including diverse challenging conditions, such as fast motion and low light scenes.
Authors: Nanqing Liu, Xun Xu, Yongyi Su, Haojie Zhang, Heng-Chao Li
Abstract: Segment Anything Model (SAM) is an advanced foundational model for image segmentation, widely applied to remote sensing images (RSIs). Due to the domain gap between RSIs and natural images, traditional methods typically use SAM as a source pre-trained model and fine-tune it with fully supervised masks. Unlike these methods, our work focuses on fine-tuning SAM using more convenient and challenging point annotations. Leveraging SAM's zero-shot capabilities, we adopt a self-training framework that iteratively generates pseudo-labels for training. However, if the pseudo-labels contain noisy labels, there is a risk of error accumulation. To address this issue, we extract target prototypes from the target dataset and use the Hungarian algorithm to match them with prediction prototypes, preventing the model from learning in the wrong direction. Additionally, due to the complex backgrounds and dense distribution of objects in RSI, using point prompts may result in multiple objects being recognized as one. To solve this problem, we propose a negative prompt calibration method based on the non-overlapping nature of instance masks. In brief, we use the prompts of overlapping masks as corresponding negative signals, resulting in refined masks. Combining the above methods, we propose a novel Pointly-supervised Segment Anything Model named PointSAM. We conduct experiments on RSI datasets, including WHU, HRSID, and NWPU VHR-10, and the results show that our method significantly outperforms direct testing with SAM, SAM2, and other comparison methods. Furthermore, we introduce PointSAM as a point-to-box converter and achieve encouraging results, suggesting that this method can be extended to other point-supervised tasks. The code is available at https://github.com/Lans1ng/PointSAM.
Authors: Venkat Karramreddy, Liam Mitchell
Abstract: This article presents an innovative study in exploring, evaluating, and implementing deep learning architectures for the calibration of multi-modal sensor systems. The focus behind this is to leverage the use of sensor fusion to achieve dynamic, real-time alignment between 3D LiDAR and 2D Camera sensors. static calibration methods are tedious and time-consuming, which is why we propose utilizing Conventional Neural Networks (CNN) coupled with geometrically informed learning to solve this issue. We leverage the foundational principles of Extrinsic LiDAR-Camera Calibration tools such as RegNet, CalibNet, and LCCNet by exploring open-source models that are available online and comparing our results with their corresponding research papers. Requirements for extracting these visual and measurable outputs involved tweaking source code, fine-tuning, training, validation, and testing for each of these frameworks for equal comparisons. This approach aims to investigate which of these advanced networks produces the most accurate and consistent predictions. Through a series of experiments, we reveal some of their shortcomings and areas for potential improvements along the way. We find that LCCNet yields the best results out of all the models that we validated.
Authors: Li Zhou, Xu Yuan, Zenghui Sun, Zikun Zhou, Jingsong Lan
Abstract: Large Multimodal Models (LMMs) have achieved significant progress by extending large language models. Building on this progress, the latest developments in LMMs demonstrate the ability to generate dense pixel-wise segmentation through the integration of segmentation models.Despite the innovations, the textual responses and segmentation masks of existing works remain at the instance level, showing limited ability to perform fine-grained understanding and segmentation even provided with detailed textual cues.To overcome this limitation, we introduce a Multi-Granularity Large Multimodal Model (MGLMM), which is capable of seamlessly adjusting the granularity of Segmentation and Captioning (SegCap) following user instructions, from panoptic SegCap to fine-grained SegCap. We name such a new task Multi-Granularity Segmentation and Captioning (MGSC). Observing the lack of a benchmark for model training and evaluation over the MGSC task, we establish a benchmark with aligned masks and captions in multi-granularity using our customized automated annotation pipeline. This benchmark comprises 10K images and more than 30K image-question pairs. We will release our dataset along with the implementation of our automated dataset annotation pipeline for further research.Besides, we propose a novel unified SegCap data format to unify heterogeneous segmentation datasets; it effectively facilitates learning to associate object concepts with visual features during multi-task training. Extensive experiments demonstrate that our MGLMM excels at tackling more than eight downstream tasks and achieves state-of-the-art performance in MGSC, GCG, image captioning, referring segmentation, multiple and empty segmentation, and reasoning segmentation tasks. The great performance and versatility of MGLMM underscore its potential impact on advancing multimodal research.
Authors: Ruben Gonzalez-Perez, Francisco Lopez-Tiro, Ivan Reyes-Amezcua, Eduardo Falcon-Morales, Rosa-Maria Rodriguez-Gueant, Jacques Hubert, Michel Daudon, Gilberto Ochoa-Ruiz, Christian Daul
Abstract: Currently, the Morpho-Constitutional Analysis (MCA) is the de facto approach for the etiological diagnosis of kidney stone formation, and it is an important step for establishing personalized treatment to avoid relapses. More recently, research has focused on performing such tasks intra-operatively, an approach known as Endoscopic Stone Recognition (ESR). Both methods rely on features observed in the surface and the section of kidney stones to separate the analyzed samples into several sub-groups. However, given the high intra-observer variability and the complex operating conditions found in ESR, there is a lot of interest in using AI for computer-aided diagnosis. However, current AI models require large datasets to attain a good performance and for generalizing to unseen distributions. This is a major problem as large labeled datasets are very difficult to acquire, and some classes of kidney stones are very rare. Thus, in this paper, we present a method based on diffusion as a way of augmenting pre-existing ex-vivo kidney stone datasets. Our aim is to create plausible diverse kidney stone images that can be used for pre-training models using ex-vivo data. We show that by mixing natural and synthetic images of CCD images, it is possible to train models capable of performing very well on unseen intra-operative data. Our results show that is possible to attain an improvement of 10% in terms of accuracy compared to a baseline model pre-trained only on ImageNet. Moreover, our results show an improvement of 6% for surface images and 10% for section images compared to a model train on CCD images only, which demonstrates the effectiveness of using synthetic images.
Authors: Jintao Ren, Muheng Li, Stine Sofia Korreman
Abstract: This report presents a normalization block for automated tumor segmentation in CT/PET scans, developed for the autoPET III Challenge. The key innovation is the introduction of the SineNormal, which applies periodic sine transformations to PET data to enhance lesion detection. By highlighting intensity variations and producing concentric ring patterns in PET highlighted regions, the model aims to improve segmentation accuracy, particularly for challenging multitracer PET datasets. The code for this project is available on GitHub (https://github.com/BBQtime/Sine-Wave-Normalization-for-Deep-Learning-Based-Tumor-Segmentation-in-CT-PET).
Authors: Jisung Hwang, Minhyuk Sung
Abstract: We introduce a dual contouring method that provides state-of-the-art performance for occupancy functions while achieving computation times of a few seconds. Our method is learning-free and carefully designed to maximize the use of GPU parallelization. The recent surge of implicit neural representations has led to significant attention to occupancy fields, resulting in a wide range of 3D reconstruction and generation methods based on them. However, the outputs of such methods have been underestimated due to the bottleneck in converting the resulting occupancy function to a mesh. Marching Cubes tends to produce staircase-like artifacts, and most subsequent works focusing on exploiting signed distance functions as input also yield suboptimal results for occupancy functions. Based on Manifold Dual Contouring (MDC), we propose Occupancy-Based Dual Contouring (ODC), which mainly modifies the computation of grid edge points (1D points) and grid cell points (3D points) to not use any distance information. We introduce auxiliary 2D points that are used to compute local surface normals along with the 1D points, helping identify 3D points via the quadric error function. To search the 1D, 2D, and 3D points, we develop fast algorithms that are parallelizable across all grid edges, faces, and cells. Our experiments with several 3D neural generative models and a 3D mesh dataset demonstrate that our method achieves the best fidelity compared to prior works.
Authors: Vladimir Guzov, Yifeng Jiang, Fangzhou Hong, Gerard Pons-Moll, Richard Newcombe, C. Karen Liu, Yuting Ye, Lingni Ma
Abstract: This paper investigates the online generation of realistic full-body human motion using a single head-mounted device with an outward-facing color camera and the ability to perform visual SLAM. Given the inherent ambiguity of this setup, we introduce a novel system, HMD$^2$, designed to balance between motion reconstruction and generation. From a reconstruction standpoint, our system aims to maximally utilize the camera streams to produce both analytical and learned features, including head motion, SLAM point cloud, and image embeddings. On the generative front, HMD$^2$ employs a multi-modal conditional motion Diffusion model, incorporating a time-series backbone to maintain temporal coherence in generated motions, and utilizes autoregressive in-painting to facilitate online motion inference with minimal latency (0.17 seconds). Collectively, we demonstrate that our system offers a highly effective and robust solution capable of scaling to an extensive dataset of over 200 hours collected in a wide range of complex indoor and outdoor environments using publicly available smart glasses.
Authors: Zhangchen Ye, Tao Jiang, Chenfeng Xu, Yiming Li, Hang Zhao
Abstract: Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to improve the accuracy of 3D occupancy predictions. By sampling points along the line of sight of each voxel and integrating the features of these points from historical frames, we construct a cost volume feature map that refines current volume features for improved prediction outcomes. Our method takes advantage of parallax cues from historical observations and employs a data-driven approach to learn the cost volume. We validate the effectiveness of CVT-Occ through rigorous experiments on the Occ3D-Waymo dataset, where it outperforms state-of-the-art methods in 3D occupancy prediction with minimal additional computational cost. The code is released at \url{https://github.com/Tsinghua-MARS-Lab/CVT-Occ}.
Authors: Zixiao Wang, Hongtao Xie, YuXin Wang, Yadong Qu, Fengjun Guo, Pengwei Liu
Abstract: Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at https://github.com/wzx99/TMIM.
Authors: Jordi Mal\'e, Juan Fortea, Mateus Rozalem Aranha, Yann Heuz\'e, Neus Mart\'inez-Abad\'ias, Xavier Sevillano
Abstract: Brain imaging has allowed neuroscientists to analyze brain morphology in genetic and neurodevelopmental disorders, such as Down syndrome, pinpointing regions of interest to unravel the neuroanatomical underpinnings of cognitive impairment and memory deficits. However, the connections between brain anatomy, cognitive performance and comorbidities like Alzheimer's disease are still poorly understood in the Down syndrome population. The latest advances in artificial intelligence constitute an opportunity for developing automatic tools to analyze large volumes of brain magnetic resonance imaging scans, overcoming the bottleneck of manual analysis. In this study, we propose the use of generative models for detecting brain alterations in people with Down syndrome affected by various degrees of neurodegeneration caused by Alzheimer's disease. To that end, we evaluate state-of-the-art brain anomaly detection models based on Variational Autoencoders and Diffusion Models, leveraging a proprietary dataset of brain magnetic resonance imaging scans. Following a comprehensive evaluation process, our study includes several key analyses. First, we conducted a qualitative evaluation by expert neuroradiologists. Second, we performed both quantitative and qualitative reconstruction fidelity studies for the generative models. Third, we carried out an ablation study to examine how the incorporation of histogram post-processing can enhance model performance. Finally, we executed a quantitative volumetric analysis of subcortical structures. Our findings indicate that some models effectively detect the primary alterations characterizing Down syndrome's brain anatomy, including a smaller cerebellum, enlarged ventricles, and cerebral cortex reduction, as well as the parietal lobe alterations caused by Alzheimer's disease.
Authors: Jae Hee Lee, Georgii Mikriukov, Gesina Schwalbe, Stefan Wermter, Diedrich Wolter
Abstract: Concept-based XAI (C-XAI) approaches to explaining neural vision models are a promising field of research, since explanations that refer to concepts (i.e., semantically meaningful parts in an image) are intuitive to understand and go beyond saliency-based techniques that only reveal relevant regions. Given the remarkable progress in this field in recent years, it is time for the community to take a critical look at the advances and trends. Consequently, this paper reviews C-XAI methods to identify interesting and underexplored areas and proposes future research directions. To this end, we consider three main directions: the choice of concepts to explain, the choice of concept representation, and how we can control concepts. For the latter, we propose techniques and draw inspiration from the field of knowledge representation and learning, showing how this could enrich future C-XAI research.
Authors: Guibiao Liao, Wei Gao
Abstract: Salient object detection (SOD) has achieved substantial progress in recent years. In practical scenarios, compressed images (CI) serve as the primary medium for data transmission and storage. However, scant attention has been directed towards SOD for compressed images using convolutional neural networks (CNNs). In this paper, we are dedicated to strictly benchmarking and analyzing CNN-based salient object detection on compressed images. To comprehensively study this issue, we meticulously establish various CI SOD datasets from existing public SOD datasets. Subsequently, we investigate representative CNN-based SOD methods, assessing their robustness on compressed images (approximately 2.64 million images). Importantly, our evaluation results reveal two key findings: 1) current state-of-the-art CNN-based SOD models, while excelling on clean images, exhibit significant performance bottlenecks when applied to compressed images. 2) The principal factors influencing the robustness of CI SOD are rooted in the characteristics of compressed images and the limitations in saliency feature learning. Based on these observations, we propose a simple yet promising baseline framework that focuses on robust feature representation learning to achieve robust CNN-based CI SOD. Extensive experiments demonstrate the effectiveness of our approach, showcasing markedly improved robustness across various levels of image degradation, while maintaining competitive accuracy on clean data. We hope that our benchmarking efforts, analytical insights, and proposed techniques will contribute to a more comprehensive understanding of the robustness of CNN-based SOD algorithms, inspiring future research in the community.
Authors: Jicheol Park, Dongwon Kim, Boseung Jeong, Suha Kwak
Abstract: Text-based person search, employing free-form text queries to identify individuals within a vast image collection, presents a unique challenge in aligning visual and textual representations, particularly at the human part level. Existing methods often struggle with part feature extraction and alignment due to the lack of direct part-level supervision and reliance on heuristic features. We propose a novel framework that leverages a part discovery module based on slot attention to autonomously identify and align distinctive parts across modalities, enhancing interpretability and retrieval accuracy without explicit part-level correspondence supervision. Additionally, text-based dynamic part attention adjusts the importance of each part, further improving retrieval outcomes. Our method is evaluated on three public benchmarks, significantly outperforming existing methods.
Authors: Ling Wang, Chen Wu, Lin Wang
Abstract: Autonomous vehicles and robots often struggle with reliable visual perception at night due to the low illumination and motion blur caused by the long exposure time of RGB cameras. Existing methods address this challenge by sequentially connecting the off-the-shelf pretrained low-light enhancement and deblurring models. Unfortunately, these methods often lead to noticeable artifacts (\eg, color distortions) in the over-exposed regions or make it hardly possible to learn the motion cues of the dark regions. In this paper, we interestingly find vision-language models, \eg, Contrastive Language-Image Pretraining (CLIP), can comprehensively perceive diverse degradation levels at night. In light of this, we propose a novel transformer-based joint learning framework, named DAP-LED, which can jointly achieve low-light enhancement and deblurring, benefiting downstream tasks, such as depth estimation, segmentation, and detection in the dark. The key insight is to leverage CLIP to adaptively learn the degradation levels from images at night. This subtly enables learning rich semantic information and visual representation for optimization of the joint tasks. To achieve this, we first introduce a CLIP-guided cross-fusion module to obtain multi-scale patch-wise degradation heatmaps from the image embeddings. Then, the heatmaps are fused via the designed CLIP-enhanced transformer blocks to retain useful degradation information for effective model optimization. Experimental results show that, compared to existing methods, our DAP-LED achieves state-of-the-art performance in the dark. Meanwhile, the enhanced results are demonstrated to be effective for three downstream tasks. For demo and more results, please check the project page: \url{https://vlislab22.github.io/dap-led/}.
Authors: Morris Florek, David Tschirschwitz, Bj\"orn Barz, Volker Rodehorst
Abstract: Current image retrieval systems often face domain specificity and generalization issues. This study aims to overcome these limitations by developing a computationally efficient training framework for a universal feature extractor that provides strong semantic image representations across various domains. To this end, we curated a multi-domain training dataset, called M4D-35k, which allows for resource-efficient training. Additionally, we conduct an extensive evaluation and comparison of various state-of-the-art visual-semantic foundation models and margin-based metric learning loss functions regarding their suitability for efficient universal feature extraction. Despite constrained computational resources, we achieve near state-of-the-art results on the Google Universal Image Embedding Challenge, with a mMP@5 of 0.721. This places our method at the second rank on the leaderboard, just 0.7 percentage points behind the best performing method. However, our model has 32% fewer overall parameters and 289 times fewer trainable parameters. Compared to methods with similar computational requirements, we outperform the previous state of the art by 3.3 percentage points. We release our code and M4D-35k training set annotations at https://github.com/morrisfl/UniFEx.
Authors: Avi Deb Raha, Apurba Adhikary, Mrityunjoy Gain, Yu Qiao, Choong Seon Hong
Abstract: In this study, we explore the efficacy of advanced pre-trained architectures, such as Vision Transformers (ViT), ConvNeXt, and Swin Transformers in enhancing Federated Domain Generalization. These architectures capture global contextual features and model long-range dependencies, making them promising candidates for improving cross-domain generalization. We conduct a broad study with in-depth analysis and systematically evaluate different variants of these architectures, using extensive pre-training datasets such as ImageNet-1K, ImageNet-21K, JFT-300M, and ImageNet-22K. Additionally, we compare self-supervised and supervised pre-training strategies to assess their impact on FDG performance. Our findings suggest that self-supervised techniques, which focus on reconstructing masked image patches, can better capture the intrinsic structure of images, thereby outperforming their supervised counterparts. Comprehensive evaluations on the Office-Home and PACS datasets demonstrate that adopting advanced architectures pre-trained on larger datasets establishes new benchmarks, achieving average accuracies of 84.46\% and 92.55\%, respectively. Additionally, we observe that certain variants of these advanced models, despite having fewer parameters, outperform larger ResNet models. This highlights the critical role of utilizing sophisticated architectures and diverse pre-training strategies to enhance FDG performance, especially in scenarios with limited computational resources where model efficiency is crucial. Our results indicate that federated learning systems can become more adaptable and efficient by leveraging these advanced methods, offering valuable insights for future research in FDG.
Authors: Ryosuke Yamada, Kensho Hara, Hirokatsu Kataoka, Koshi Makihara, Nakamasa Inoue, Rio Yokota, Yutaka Satoh
Abstract: Throughout the history of computer vision, while research has explored the integration of images (visual) and point clouds (geometric), many advancements in image and 3D object recognition have tended to process these modalities separately. We aim to bridge this divide by integrating images and point clouds on a unified transformer model. This approach integrates the modality-specific properties of images and point clouds and achieves fundamental downstream tasks in image and 3D object recognition on a unified transformer model by learning visual-geometric representations. In this work, we introduce Formula-Supervised Visual-Geometric Pre-training (FSVGP), a novel synthetic pre-training method that automatically generates aligned synthetic images and point clouds from mathematical formulas. Through cross-modality supervision, we enable supervised pre-training between visual and geometric modalities. FSVGP also reduces reliance on real data collection, cross-modality alignment, and human annotation. Our experimental results show that FSVGP pre-trains more effectively than VisualAtom and PC-FractalDB across six tasks: image and 3D object classification, detection, and segmentation. These achievements demonstrate FSVGP's superior generalization in image and 3D object recognition and underscore the potential of synthetic pre-training in visual-geometric representation learning. Our project website is available at https://ryosuke-yamada.github.io/fdsl-fsvgp/.
Authors: Yingzhe Peng, Yixiao Yuan, Zitian Ao, Huapeng Zhou, Kangqi Wang, Qipeng Zhu, Xu Yang
Abstract: In this report, we present our first-place solution to the Multiple-choice Video Question Answering (QA) track of The Second Perception Test Challenge. This competition posed a complex video understanding task, requiring models to accurately comprehend and answer questions about video content. To address this challenge, we leveraged the powerful QwenVL2 (7B) model and fine-tune it on the provided training set. Additionally, we employed model ensemble strategies and Test Time Augmentation to boost performance. Through continuous optimization, our approach achieved a Top-1 Accuracy of 0.7647 on the leaderboard.
Authors: Jing Hao, Yuxiang Zhao, Song Chen, Yanpeng Sun, Qiang Chen, Gang Zhang, Kun Yao, Errui Ding, Jingdong Wang
Abstract: Multimodal Large Language Models (MLLMs) have shown promise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they heavily depend on high-quality data in the Supervised Fine-Tuning (SFT) phase. The existing approaches aim to curate high-quality data via GPT-4V, but they are not scalable due to the commercial nature of GPT-4V and the simplicity of the prompts used to instruct the model. To this end, we devised the FullAnno system, which is a data engine that can generate large-scale, high-quality, and fine-grained image annotations consisting of the category and position of objects, region descriptions, text information, as well as image dense captions. This engine is characterized by its cascade annotation process, which involves multiple expert models and employs rich prompts to instruct LLMs in generating dense image captions. We re-annotated the COCO and Visual Genome datasets using our FullAnno system, tripling the number of object annotations and increasing the length of the original image captions by a factor of 15. Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks. The re-annotated data are available at: https://arcana-project-page.github.io
Authors: Alessandro Cacciatore, Valerio Morelli, Federica Paganica, Emanuele Frontoni, Lucia Migliorelli, Daniele Berardini
Abstract: Deep learning has long been dominated by multi-layer perceptrons (MLPs), which have demonstrated superiority over other optimizable models in various domains. Recently, a new alternative to MLPs has emerged - Kolmogorov-Arnold Networks (KAN)- which are based on a fundamentally different mathematical framework. According to their authors, KANs address several major issues in MLPs, such as catastrophic forgetting in continual learning scenarios. However, this claim has only been supported by results from a regression task on a toy 1D dataset. In this paper, we extend the investigation by evaluating the performance of KANs in continual learning tasks within computer vision, specifically using the MNIST datasets. To this end, we conduct a structured analysis of the behavior of MLPs and two KAN-based models in a class-incremental learning scenario, ensuring that the architectures involved have the same number of trainable parameters. Our results demonstrate that an efficient version of KAN outperforms both traditional MLPs and the original KAN implementation. We further analyze the influence of hyperparameters in MLPs and KANs, as well as the impact of certain trainable parameters in KANs, such as bias and scale weights. Additionally, we provide a preliminary investigation of recent KAN-based convolutional networks and compare their performance with that of traditional convolutional neural networks. Our codes can be found at https://github.com/MrPio/KAN-Continual_Learning_tests.
URLs: https://github.com/MrPio/KAN-Continual_Learning_tests.
Authors: Ziyuan Yang, Ming Yan, Yingyu Chen, Hui Wang, Zexin Lu, Yi Zhang
Abstract: The surge of hate speech on social media platforms poses a significant challenge, with hate speech detection~(HSD) becoming increasingly critical. Current HSD methods focus on enriching contextual information to enhance detection performance, but they overlook the inherent uncertainty of hate speech. We propose a novel HSD method, named trustworthy hate speech detection method through visual augmentation (TrusV-HSD), which enhances semantic information through integration with diffused visual images and mitigates uncertainty with trustworthy loss. TrusV-HSD learns semantic representations by effectively extracting trustworthy information through multi-modal connections without paired data. Our experiments on public HSD datasets demonstrate the effectiveness of TrusV-HSD, showing remarkable improvements over conventional methods.
Authors: Athanasios Karagounis
Abstract: This paper presents a novel approach for deep visualization via a generative network, offering an improvement over existing methods. Our model simplifies the architecture by reducing the number of networks used, requiring only a generator and a discriminator, as opposed to the multiple networks traditionally involved. Additionally, our model requires less prior training knowledge and uses a non-adversarial training process, where the discriminator acts as a guide rather than a competitor to the generator. The core contribution of this work is its ability to generate detailed visualization images that align with specific class labels. Our model incorporates a unique skip-connection-inspired block design, which enhances label-directed image generation by propagating class information across multiple layers. Furthermore, we explore how these generated visualizations can be utilized as adversarial examples, effectively fooling classification networks with minimal perceptible modifications to the original images. Experimental results demonstrate that our method outperforms traditional adversarial example generation techniques in both targeted and non-targeted attacks, achieving up to a 94.5% fooling rate with minimal perturbation. This work bridges the gap between visualization methods and adversarial examples, proposing that fooling rate could serve as a quantitative measure for evaluating visualization quality. The insights from this study provide a new perspective on the interpretability of neural networks and their vulnerabilities to adversarial attacks.
Authors: Foivos I. Diakogiannis, Zheng-Shu Zhou, Jeff Wang, Gonzalo Mata, Dave Henry, Roger Lawes, Amy Parker, Peter Caccetta, Rodrigo Ibata, Ondrej Hlinka, Jonathan Richetti, Kathryn Batchelor, Chris Herrmann, Andrew Toovey, John Taylor
Abstract: Accurate field boundary delineation is a critical challenge in digital agriculture, impacting everything from crop monitoring to resource management. Existing methods often struggle with noise and fail to generalize across varied landscapes, particularly when dealing with cloud cover in optical remote sensing. In response, this study presents a new approach that leverages time series data from Sentinel-2 (S2) and Sentinel-1 (S1) imagery to improve performance under diverse cloud conditions, without the need for manual cloud filtering. We introduce a 3D Vision Transformer architecture specifically designed for satellite image time series, incorporating a memory-efficient attention mechanism. Two models are proposed: PTAViT3D, which handles either S2 or S1 data independently, and PTAViT3D-CA, which fuses both datasets to enhance accuracy. Both models are evaluated under sparse and dense cloud coverage by exploiting spatio-temporal correlations. Our results demonstrate that the models can effectively delineate field boundaries, even with partial (S2 or S2 and S1 data fusion) or dense cloud cover (S1), with the S1-based model providing performance comparable to S2 imagery in terms of spatial resolution. A key strength of this approach lies in its capacity to directly process cloud-contaminated imagery by leveraging spatio-temporal correlations in a memory-efficient manner. This methodology, used in the ePaddocks product to map Australia's national field boundaries, offers a robust, scalable solution adaptable to varying agricultural environments, delivering precision and reliability where existing methods falter. Our code is available at https://github.com/feevos/tfcl.
Authors: Xingtao Lin, Heqian Qiu, Lanxiao Wang, RUihang Wang, Linfeng XU, Hongliang Li
Abstract: Recent advancements in prompt tuning have successfully adapted large-scale models like Contrastive Language-Image Pre-trained (CLIP) for downstream tasks such as scene text detection. Typically, text prompt complements the text encoder's input, focusing on global features while neglecting fine-grained details, leading to fine-grained text being ignored in task of scene text detection. In this paper, we propose the region prompt tuning (RPT) method for fine-grained scene text detection, where region text prompt proposed would help focus on fine-grained features. Region prompt tuning method decomposes region text prompt into individual characters and splits visual feature map into region visual tokens, creating a one-to-one correspondence between characters and tokens. This allows a character matches the local features of a token, thereby avoiding the omission of detailed features and fine-grained text. To achieve this, we introduce a sharing position embedding to link each character with its corresponding token and employ a bidirectional distance loss to align each region text prompt character with the target ``text''. To refine the information at fine-grained level, we implement character-token level interactions before and after encoding. Our proposed method combines a general score map from the image-text process with a region score map derived from character-token matching, producing a final score map that could balance the global and local features and be fed into DBNet to detect the text. Experiments on benchmarks like ICDAR2015, TotalText, and CTW1500 demonstrate RPT impressive performance, underscoring its effectiveness for scene text detection.
Authors: Xuan Gao, Haiyao Xiao, Chenglai Zhong, Shimin Hu, Yudong Guo, Juyong Zhang
Abstract: We introduce PortraitGen, a powerful portrait video editing method that achieves consistent and expressive stylization with multimodal prompts. Traditional portrait video editing methods often struggle with 3D and temporal consistency, and typically lack in rendering quality and efficiency. To address these issues, we lift the portrait video frames to a unified dynamic 3D Gaussian field, which ensures structural and temporal coherence across frames. Furthermore, we design a novel Neural Gaussian Texture mechanism that not only enables sophisticated style editing but also achieves rendering speed over 100FPS. Our approach incorporates multimodal inputs through knowledge distilled from large-scale 2D generative models. Our system also incorporates expression similarity guidance and a face-aware portrait editing module, effectively mitigating degradation issues associated with iterative dataset updates. Extensive experiments demonstrate the temporal consistency, editing efficiency, and superior rendering quality of our method. The broad applicability of the proposed approach is demonstrated through various applications, including text-driven editing, image-driven editing, and relighting, highlighting its great potential to advance the field of video editing. Demo videos and released code are provided in our project page: https://ustc3dv.github.io/PortraitGen/
Authors: Abhilash Nandy, Yash Agarwal, Ashish Patwa, Millon Madhur Das, Aman Bansal, Ankit Raj, Pawan Goyal, Niloy Ganguly
Abstract: Understanding satire and humor is a challenging task for even current Vision-Language models. In this paper, we propose the challenging tasks of Satirical Image Detection (detecting whether an image is satirical), Understanding (generating the reason behind the image being satirical), and Completion (given one half of the image, selecting the other half from 2 given options, such that the complete image is satirical) and release a high-quality dataset YesBut, consisting of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles, to evaluate those tasks. Each satirical image in the dataset depicts a normal scenario, along with a conflicting scenario which is funny or ironic. Despite the success of current Vision-Language Models on multimodal tasks such as Visual QA and Image Captioning, our benchmarking experiments show that such models perform poorly on the proposed tasks on the YesBut Dataset in Zero-Shot Settings w.r.t both automated as well as human evaluation. Additionally, we release a dataset of 119 real, satirical photographs for further research. The dataset and code are available at https://github.com/abhi1nandy2/yesbut_dataset.
Authors: Eirini Cholopoulou, Dimitris K. Iakovidis
Abstract: Anomaly detection (AD) plays a pivotal role in multimedia applications for detecting defective products and automating quality inspection. Deep learning (DL) models typically require large-scale annotated data, which are often highly imbalanced since anomalies are usually scarce. The black box nature of these models prohibits them from being trusted by users. To address these challenges, we propose MeLIAD, a novel methodology for interpretable anomaly detection, which unlike the previous methods is based on metric learning and achieves interpretability by design without relying on any prior distribution assumptions of true anomalies. MeLIAD requires only a few samples of anomalies for training, without employing any augmentation techniques, and is inherently interpretable, providing visualizations that offer insights into why an image is identified as anomalous. This is achieved by introducing a novel trainable entropy-based scoring component for the identification and localization of anomalous instances, and a novel loss function that jointly optimizes the anomaly scoring component with a metric learning objective. Experiments on five public benchmark datasets, including quantitative and qualitative evaluation of interpretability, demonstrate that MeLIAD achieves improved anomaly detection and localization performance compared to state-of-the-art methods.
Authors: Aditya Kommineni, Digbalay Bose, Tiantian Feng, So Hyun Kim, Helen Tager-Flusberg, Somer Bishop, Catherine Lord, Sudarsana Kadiri, Shrikanth Narayanan
Abstract: Clinical videos in the context of Autism Spectrum Disorder are often long-form interactions between children and caregivers/clinical professionals, encompassing complex verbal and non-verbal behaviors. Objective analyses of these videos could provide clinicians and researchers with nuanced insights into the behavior of children with Autism Spectrum Disorder. Manually coding these videos is a time-consuming task and requires a high level of domain expertise. Hence, the ability to capture these interactions computationally can augment the manual effort and enable supporting the diagnostic procedure. In this work, we investigate the use of foundation models across three modalities: speech, video, and text, to analyse child-focused interaction sessions. We propose a unified methodology to combine multiple modalities by using large language models as reasoning agents. We evaluate their performance on two tasks with different information granularity: activity recognition and abnormal behavior detection. We find that the proposed multimodal pipeline provides robustness to modality-specific limitations and improves performance on the clinical video analysis compared to unimodal settings.
Authors: Ting Liu, Zunnan Xu, Yue Hu, Liangtao Shi, Zhiqiang Wang, Quanjun Yin
Abstract: Referring Expression Comprehension (REC), which aims to ground a local visual region via natural language, is a task that heavily relies on multimodal alignment. Most existing methods utilize powerful pre-trained models to transfer visual/linguistic knowledge by full fine-tuning. However, full fine-tuning the entire backbone not only breaks the rich prior knowledge embedded in the pre-training, but also incurs significant computational costs. Motivated by the recent emergence of Parameter-Efficient Transfer Learning (PETL) methods, we aim to solve the REC task in an effective and efficient manner. Directly applying these PETL methods to the REC task is inappropriate, as they lack the specific-domain abilities for precise local visual perception and visual-language alignment. Therefore, we propose a novel framework of Multimodal Prior-guided Parameter Efficient Tuning, namely MaPPER. Specifically, MaPPER comprises Dynamic Prior Adapters guided by a aligned prior, and Local Convolution Adapters to extract precise local semantics for better visual perception. Moreover, the Prior-Guided Text module is proposed to further utilize the prior for facilitating the cross-modal alignment. Experimental results on three widely-used benchmarks demonstrate that MaPPER achieves the best accuracy compared to the full fine-tuning and other PETL methods with only 1.41% tunable backbone parameters.
Authors: Bowen Yan, Zhengsong Zhang, Liqiang Jing, Eftekhar Hossain, Xinya Du
Abstract: The rapid development of Large Vision-Language Models (LVLMs) often comes with widespread hallucination issues, making cost-effective and comprehensive assessments increasingly vital. Current approaches mainly rely on costly annotations and are not comprehensive -- in terms of evaluating all aspects such as relations, attributes, and dependencies between aspects. Therefore, we introduce the FIHA (autonomous Fine-graIned Hallucination evAluation evaluation in LVLMs), which could access hallucination LVLMs in the LLM-free and annotation-free way and model the dependency between different types of hallucinations. FIHA can generate Q&A pairs on any image dataset at minimal cost, enabling hallucination assessment from both image and caption. Based on this approach, we introduce a benchmark called FIHA-v1, which consists of diverse questions on various images from MSCOCO and Foggy. Furthermore, we use the Davidson Scene Graph (DSG) to organize the structure among Q&A pairs, in which we can increase the reliability of the evaluation. We evaluate representative models using FIHA-v1, highlighting their limitations and challenges. We released our code and data.
Authors: Sen Lei, Xinyu Xiao, Heng-Chao Li, Zhenwei Shi, Qing Zhu
Abstract: Given a language expression, referring remote sensing image segmentation (RRSIS) aims to identify the ground objects and assign pixel-wise labels within the imagery. The one of key challenges for this task is to capture discriminative multi-modal features via text-image alignment. However, the existing RRSIS methods use one vanilla and coarse alignment, where the language expression is directly extracted to be fused with the visual features. In this paper, we argue that a "fine-grained image-text alignment" can improve the extraction of multi-modal information. To this point, we here proposed a new referring remote sensing image segmentation method, termed FIANet, that fully exploits the visual and linguistic representations. Specifically, the original referring expression is regarded as context text, which is further decoupled into ground object text and spatial position text. The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts and learn better discriminative multi-modal representation. Meanwhile, to handle the various scales of ground objects in remote sensing, we introduce a Text-aware Multi-scale Enhancement Module (TMEM) to adaptively perform cross-scale fusion and intersections. We evaluate the effectiveness of the proposed methods on two public referring remote sensing datasets including RefSegRS and RRSIS-D, and our method obtains superior performance over several state-of-the-art methods. The code will be publicly available.
Authors: Penghao Wang, Zhirui Zhang, Liao Wang, Kaixin Yao, Siyuan Xie, Jingyi Yu, Minye Wu, Lan Xu
Abstract: Experiencing high-fidelity volumetric video as seamlessly as 2D videos is a long-held dream. However, current dynamic 3DGS methods, despite their high rendering quality, face challenges in streaming on mobile devices due to computational and bandwidth constraints. In this paper, we introduce V\textsuperscript{3}(Viewing Volumetric Videos), a novel approach that enables high-quality mobile rendering through the streaming of dynamic Gaussians. Our key innovation is to view dynamic 3DGS as 2D videos, facilitating the use of hardware video codecs. Additionally, we propose a two-stage training strategy to reduce storage requirements with rapid training speed. The first stage employs hash encoding and shallow MLP to learn motion, then reduces the number of Gaussians through pruning to meet the streaming requirements, while the second stage fine tunes other Gaussian attributes using residual entropy loss and temporal loss to improve temporal continuity. This strategy, which disentangles motion and appearance, maintains high rendering quality with compact storage requirements. Meanwhile, we designed a multi-platform player to decode and render 2D Gaussian videos. Extensive experiments demonstrate the effectiveness of V\textsuperscript{3}, outperforming other methods by enabling high-quality rendering and streaming on common devices, which is unseen before. As the first to stream dynamic Gaussians on mobile devices, our companion player offers users an unprecedented volumetric video experience, including smooth scrolling and instant sharing. Our project page with source code is available at https://authoritywang.github.io/v3/.
Authors: Sebastian Dille, Ari Blondal, Sylvain Paris, Ya\u{g}{\i}z Aksoy
Abstract: Class-agnostic image segmentation is a crucial component in automating image editing workflows, especially in contexts where object selection traditionally involves interactive tools. Existing methods in the literature often adhere to top-down formulations, following the paradigm of class-based approaches, where object detection precedes per-object segmentation. In this work, we present a novel bottom-up formulation for addressing the class-agnostic segmentation problem. We supervise our network directly on the projective sphere of its feature space, employing losses inspired by metric learning literature as well as losses defined in a novel segmentation-space representation. The segmentation results are obtained through a straightforward mean-shift clustering of the estimated features. Our bottom-up formulation exhibits exceptional generalization capability, even when trained on datasets designed for class-based segmentation. We further showcase the effectiveness of our generic approach by addressing the challenging task of cell and nucleus segmentation. We believe that our bottom-up formulation will offer valuable insights into diverse segmentation challenges in the literature.
Authors: Hadi Rezvani, Navid Zarrabi, Ishaan Mehta, Christopher Kolios, Hussein Ali Jaafar, Cheng-Hao Kao, Sajad Saeedi, Nariman Yousefi
Abstract: Plastic pollution presents an escalating global issue, impacting health and environmental systems, with micro- and nanoplastics found across mediums from potable water to air. Traditional methods for studying these contaminants are labor-intensive and time-consuming, necessitating a shift towards more efficient technologies. In response, this paper introduces micro- and nanoplastics (MiNa), a novel and open-source dataset engineered for the automatic detection and classification of micro and nanoplastics using object detection algorithms. The dataset, comprising scanning electron microscopy images simulated under realistic aquatic conditions, categorizes plastics by polymer type across a broad size spectrum. We demonstrate the application of state-of-the-art detection algorithms on MiNa, assessing their effectiveness and identifying the unique challenges and potential of each method. The dataset not only fills a critical gap in available resources for microplastic research but also provides a robust foundation for future advancements in the field.
Authors: Ilpo Viertola, Vladimir Iashin, Esa Rahtu
Abstract: We introduce V-AURA, the first autoregressive model to achieve high temporal alignment and relevance in video-to-audio generation. V-AURA uses a high-framerate visual feature extractor and a cross-modal audio-visual feature fusion strategy to capture fine-grained visual motion events and ensure precise temporal alignment. Additionally, we propose VisualSound, a benchmark dataset with high audio-visual relevance. VisualSound is based on VGGSound, a video dataset consisting of in-the-wild samples extracted from YouTube. During the curation, we remove samples where auditory events are not aligned with the visual ones. V-AURA outperforms current state-of-the-art models in temporal alignment and semantic relevance while maintaining comparable audio quality. Code, samples, VisualSound and models are available at https://v-aura.notion.site
Authors: Chris Careaga, Ya\u{g}{\i}z Aksoy
Abstract: Intrinsic image decomposition aims to separate the surface reflectance and the effects from the illumination given a single photograph. Due to the complexity of the problem, most prior works assume a single-color illumination and a Lambertian world, which limits their use in illumination-aware image editing applications. In this work, we separate an input image into its diffuse albedo, colorful diffuse shading, and specular residual components. We arrive at our result by gradually removing first the single-color illumination and then the Lambertian-world assumptions. We show that by dividing the problem into easier sub-problems, in-the-wild colorful diffuse shading estimation can be achieved despite the limited ground-truth datasets. Our extended intrinsic model enables illumination-aware analysis of photographs and can be used for image editing applications such as specularity removal and per-pixel white balancing.
Authors: Gracjan G\'oral, Alicja Ziarko, Michal Nauman, Maciej Wo{\l}czyk
Abstract: Visual perspective-taking (VPT), the ability to understand the viewpoint of another person, enables individuals to anticipate the actions of other people. For instance, a driver can avoid accidents by assessing what pedestrians see. Humans typically develop this skill in early childhood, but it remains unclear whether the recently emerging Vision Language Models (VLMs) possess such capability. Furthermore, as these models are increasingly deployed in the real world, understanding how they perform nuanced tasks like VPT becomes essential. In this paper, we introduce two manually curated datasets, Isle-Bricks and Isle-Dots for testing VPT skills, and we use it to evaluate 12 commonly used VLMs. Across all models, we observe a significant performance drop when perspective-taking is required. Additionally, we find performance in object detection tasks is poorly correlated with performance on VPT tasks, suggesting that the existing benchmarks might not be sufficient to understand this problem. The code and the dataset will be available at https://sites.google.com/view/perspective-taking
Authors: Amine Sadikine, Bogdan Badic, Jean-Pierre Tasu, Vincent Noblet, Dimitris Visvikis, Pierre-Henri Conze
Abstract: The extraction of blood vessels has recently experienced a widespread interest in medical image analysis. Automatic vessel segmentation is highly desirable to guide clinicians in computer-assisted diagnosis, therapy or surgical planning. Despite a good ability to extract large anatomical structures, the capacity of U-Net inspired architectures to automatically delineate vascular systems remains a major issue, especially given the scarcity of existing datasets. In this paper, we present a novel approach that integrates into deep segmentation shape priors from a Semi-Overcomplete Convolutional Auto-Encoder (S-OCAE) embedding. Compared to standard Convolutional Auto-Encoders (CAE), it exploits an over-complete branch that projects data onto higher dimensions to better characterize tiny structures. Experiments on retinal and liver vessel extraction, respectively performed on publicly-available DRIVE and 3D-IRCADb datasets, highlight the effectiveness of our method compared to U-Net trained without and with shape priors from a traditional CAE.
Authors: Kosmas Pinitas, Konstantinos Makantasis, Georgios N. Yannakakis
Abstract: Domain generalisation involves learning artificial intelligence (AI) models that can maintain high performance across diverse domains within a specific task. In video games, for instance, such AI models can supposedly learn to detect player actions across different games. Despite recent advancements in AI, domain generalisation for modelling the users' experience remains largely unexplored. While video games present unique challenges and opportunities for the analysis of user experience -- due to their dynamic and rich contextual nature -- modelling such experiences is limited by generally small datasets. As a result, conventional modelling methods often struggle to bridge the domain gap between users and games due to their reliance on large labelled training data and assumptions of common distributions of user experience. In this paper, we tackle this challenge by introducing a framework that decomposes the general domain-agnostic modelling of user experience into several domain-specific and game-dependent tasks that can be solved via few-shot learning. We test our framework on a variation of the publicly available GameVibe corpus, designed specifically to test a model's ability to predict user engagement across different first-person shooter games. Our findings demonstrate the superior performance of few-shot learners over traditional modelling methods and thus showcase the potential of few-shot learning for robust experience modelling in video games and beyond.
Authors: Reza Safdari, Mohammad Koohi-Moghaddam, Kyongtae Tyler Bae
Abstract: In this study, we implemented a two-stage deep learning-based approach to segment lesions in PET/CT images for the AutoPET III challenge. The first stage utilized a DynUNet model for coarse segmentation, identifying broad regions of interest. The second stage refined this segmentation using an ensemble of SwinUNETR, SegResNet, and UNet models. Preprocessing involved resampling images to a common resolution and normalization, while data augmentation techniques such as affine transformations and intensity adjustments were applied to enhance model generalization. The dataset was split into 80% training and 20% validation, excluding healthy cases. This method leverages multi-stage segmentation and model ensembling to achieve precise lesion segmentation, aiming to improve robustness and overall performance.
Authors: Kratika Bhagtani, Amit Kumar Singh Yadav, Paolo Bestagini, Edward J. Delp
Abstract: Diffusion-based speech generators are ubiquitous. These methods can generate very high quality synthetic speech and several recent incidents report their malicious use. To counter such misuse, synthetic speech detectors have been developed. Many of these detectors are trained on datasets which do not include diffusion-based synthesizers. In this paper, we demonstrate that existing detectors trained on one such dataset, ASVspoof2019, do not perform well in detecting synthetic speech from recent diffusion-based synthesizers. We propose the Diffusion-Based Synthetic Speech Dataset (DiffSSD), a dataset consisting of about 200 hours of labeled speech, including synthetic speech generated by 8 diffusion-based open-source and 2 commercial generators. We also examine the performance of existing synthetic speech detectors on DiffSSD in both closed-set and open-set scenarios. The results highlight the importance of this dataset in detecting synthetic speech generated from recent open-source and commercial speech generators.
Authors: Yan Song Hu, Nicolas Abboud, Muhammad Qasim Ali, Adam Srebrnjak Yang, Imad Elhajj, Daniel Asmar, Yuhao Chen, John S. Zelek
Abstract: Real-time SLAM with dense 3D mapping is computationally challenging, especially on resource-limited devices. The recent development of 3D Gaussian Splatting (3DGS) offers a promising approach for real-time dense 3D reconstruction. However, existing 3DGS-based SLAM systems struggle to balance hardware simplicity, speed, and map quality. Most systems excel in one or two of the aforementioned aspects but rarely achieve all. A key issue is the difficulty of initializing 3D Gaussians while concurrently conducting SLAM. To address these challenges, we present Monocular GSO (MGSO), a novel real-time SLAM system that integrates photometric SLAM with 3DGS. Photometric SLAM provides dense structured point clouds for 3DGS initialization, accelerating optimization and producing more efficient maps with fewer Gaussians. As a result, experiments show that our system generates reconstructions with a balance of quality, memory efficiency, and speed that outperforms the state-of-the-art. Furthermore, our system achieves all results using RGB inputs. We evaluate the Replica, TUM-RGBD, and EuRoC datasets against current live dense reconstruction systems. Not only do we surpass contemporary systems, but experiments also show that we maintain our performance on laptop hardware, making it a practical solution for robotics, A/R, and other real-time applications.
Authors: Muthu Chidambaram, Khashayar Gatmiry, Sitan Chen, Holden Lee, Jianfeng Lu
Abstract: The use of guidance in diffusion models was originally motivated by the premise that the guidance-modified score is that of the data distribution tilted by a conditional likelihood raised to some power. In this work we clarify this misconception by rigorously proving that guidance fails to sample from the intended tilted distribution. Our main result is to give a fine-grained characterization of the dynamics of guidance in two cases, (1) mixtures of compactly supported distributions and (2) mixtures of Gaussians, which reflect salient properties of guidance that manifest on real-world data. In both cases, we prove that as the guidance parameter increases, the guided model samples more heavily from the boundary of the support of the conditional distribution. We also prove that for any nonzero level of score estimation error, sufficiently large guidance will result in sampling away from the support, theoretically justifying the empirical finding that large guidance results in distorted generations. In addition to verifying these results empirically in synthetic settings, we also show how our theoretical insights can offer useful prescriptions for practical deployment.
Authors: Jason Chuan-Chih Chou, Nahid Alam
Abstract: Since the publication of CLIP, the approach of using InfoNCE loss for contrastive pre-training has become widely popular for bridging two or more modalities. Despite its wide adoption, CLIP's original design choices of L2 normalization and cosine similarity logit have rarely been revisited. We have systematically experimented with alternative geometries and softmax logits for language-image pre-training and identified that variants with intuitive Euclidean geometry, Euclidean CLIP (EuCLIP), match or exceed the performance of CLIP and support hierarchical relationships at least as well as more complicated hyperbolic alternative.
Authors: \c{S}aban \"Ozt\"urk, O\u{g}uz Can Duran, Tolga \c{C}ukur
Abstract: Low-dose computed tomography (LDCT) lower potential risks linked to radiation exposure while relying on advanced denoising algorithms to maintain diagnostic quality in reconstructed images. The reigning paradigm in LDCT denoising is based on neural network models that learn data-driven image priors to separate noise evoked by dose reduction from underlying tissue signals. Naturally, the fidelity of these priors depend on the model's ability to capture the broad range of contextual features evident in CT images. Earlier convolutional neural networks (CNN) are highly adept at efficiently capturing short-range spatial context, but their limited receptive fields reduce sensitivity to interactions over longer distances. Although transformers based on self-attention mechanisms have recently been posed to increase sensitivity to long-range context, they can suffer from suboptimal performance and efficiency due to elevated model complexity, particularly for high-resolution CT images. For high-quality restoration of LDCT images, here we introduce DenoMamba, a novel denoising method based on state-space modeling (SSM), that efficiently captures short- and long-range context in medical images. Following an hourglass architecture with encoder-decoder stages, DenoMamba employs a spatial SSM module to encode spatial context and a novel channel SSM module equipped with a secondary gated convolution network to encode latent features of channel context at each stage. Feature maps from the two modules are then consolidated with low-level input features via a convolution fusion module (CFM). Comprehensive experiments on LDCT datasets with 25\% and 10\% dose reduction demonstrate that DenoMamba outperforms state-of-the-art denoisers with average improvements of 1.4dB PSNR, 1.1% SSIM, and 1.6% RMSE in recovered image quality.
Authors: Areej Alsaafin, Abubakr Shafique, Saghir Alfasly, H. R. Tizhoosh
Abstract: The field of medical diagnostics has witnessed a transformative convergence of artificial intelligence (AI) and healthcare data, offering promising avenues for enhancing patient care and disease comprehension. However, this integration of multimodal data, specifically histopathology whole slide images (WSIs) and genetic sequencing data, presents unique challenges due to modality disparities and the need for scalable computational solutions. This paper addresses the scarcity of multimodal solutions, primarily centered around unimodal data solutions, thus limiting the realization of the rich insights that can be derived from integrating images and genomic data. Here, we introduce MarbliX ``Multimodal Association and Retrieval with Binary Latent Indexed matriX,'' an innovative multimodal framework that integrates histopathology images with immunogenomic sequencing data, encapsulating them into a concise binary patient code, referred to as ``monogram.'' This binary representation facilitates the establishment of a comprehensive archive, enabling clinicians to match similar cases. The experimental results demonstrate the potential of MarbliX to empower healthcare professionals with in-depth insights, leading to more precise diagnoses, reduced variability, and expanded personalized treatment options, particularly in the context of cancer.
Authors: Jianghu Lu, Shikun Li, Kexin Bao, Pengju Wang, Zhenxing Qian, Shiming Ge
Abstract: Federated learning provides a privacy-preserving manner to collaboratively train models on data distributed over multiple local clients via the coordination of a global server. In this paper, we focus on label distribution skew in federated learning, where due to the different user behavior of the client, label distributions between different clients are significantly different. When faced with such cases, most existing methods will lead to a suboptimal optimization due to the inadequate utilization of label distribution information in clients. Inspired by this, we propose a label-masking distillation approach termed FedLMD to facilitate federated learning via perceiving the various label distributions of each client. We classify the labels into majority and minority labels based on the number of examples per class during training. The client model learns the knowledge of majority labels from local data. The process of distillation masks out the predictions of majority labels from the global model, so that it can focus more on preserving the minority label knowledge of the client. A series of experiments show that the proposed approach can achieve state-of-the-art performance in various cases. Moreover, considering the limited resources of the clients, we propose a variant FedLMD-Tf that does not require an additional teacher, which outperforms previous lightweight approaches without increasing computational costs. Our code is available at https://github.com/wnma3mz/FedLMD.
Authors: Li Ling, Yiping Xie, Nils Bore, John Folkesson
Abstract: Multibeam echo-sounder (MBES) is the de-facto sensor for bathymetry mapping. In recent years, cheaper MBES sensors and global mapping initiatives have led to exponential growth of available data. However, raw MBES data contains 1-25% of noise that requires semi-automatic filtering using tools such as Combined Uncertainty and Bathymetric Estimator (CUBE). In this work, we draw inspirations from the 3D point cloud community and adapted a score-based point cloud denoising network for MBES outlier detection and denoising. We trained and evaluated this network on real MBES survey data. The proposed method was found to outperform classical methods, and can be readily integrated into existing MBES standard workflow. To facilitate future research, the code and pretrained model are available online.
Authors: Chengkun Sun, Russell Stevens Terry, Jiang Bian, Jie Xu
Abstract: Accurate segmentation of multiple organs and the differentiation of pathological tissues in medical imaging are crucial but challenging, especially for nuanced classifications and ambiguous organ boundaries. To tackle these challenges, we introduce GASA-UNet, a refined U-Net-like model featuring a novel Global Axial Self-Attention (GASA) block. This block processes image data as a 3D entity, with each 2D plane representing a different anatomical cross-section. Voxel features are defined within this spatial context, and a Multi-Head Self-Attention (MHSA) mechanism is utilized on extracted 1D patches to facilitate connections across these planes. Positional embeddings (PE) are incorporated into our attention framework, enriching voxel features with spatial context and enhancing tissue classification and organ edge delineation. Our model has demonstrated promising improvements in segmentation performance, particularly for smaller anatomical structures, as evidenced by enhanced Dice scores and Normalized Surface Dice (NSD) on three benchmark datasets, i.e., BTCV, AMOS, and KiTS23.
Authors: Francis Ogoke, Sumesh Kalambettu Suresh, Jesse Adamczyk, Dan Bolintineanu, Anthony Garland, Michael Heiden, Amir Barati Farimani
Abstract: The stochastic formation of defects during Laser Powder Bed Fusion (L-PBF) negatively impacts its adoption for high-precision use cases. Optical monitoring techniques can be used to identify defects based on layer-wise imaging, but these methods are difficult to scale to high resolutions due to cost and memory constraints. Therefore, we implement generative deep learning models to link low-cost, low-resolution images of the build plate to detailed high-resolution optical images of the build plate, enabling cost-efficient process monitoring. To do so, a conditional latent probabilistic diffusion model is trained to produce realistic high-resolution images of the build plate from low-resolution webcam images, recovering the distribution of small-scale features and surface roughness. We first evaluate the performance of the model by analyzing the reconstruction quality of the generated images using peak-signal-to-noise-ratio (PSNR), structural similarity index measure (SSIM) and wavelet covariance metrics that describe the preservation of high-frequency information. Additionally, we design a framework based upon the Segment Anything foundation model to recreate the 3D morphology of the printed part and analyze the surface roughness of the reconstructed samples. Finally, we explore the zero-shot generalization capabilities of the implemented framework to other part geometries by creating synthetic low-resolution data.
Authors: Sahaj K. Mistry, Sourav Saini, Aashray Gupta, Aayush Gupta, Sunny Rai, Vinit Jakhetiya, Ujjwal Baid, Sharath Chandra Guntuku
Abstract: Brain tumor segmentation plays a crucial role in computer-aided diagnosis. This study introduces a novel segmentation algorithm utilizing a modified nnU-Net architecture. Within the nnU-Net architecture's encoder section, we enhance conventional convolution layers by incorporating omni-dimensional dynamic convolution layers, resulting in improved feature representation. Simultaneously, we propose a multi-scale attention strategy that harnesses contemporary insights from various scales. Our model's efficacy is demonstrated on diverse datasets from the BraTS-2023 challenge. Integrating omni-dimensional dynamic convolution (ODConv) layers and multi-scale features yields substantial improvement in the nnU-Net architecture's performance across multiple tumor segmentation datasets. Remarkably, our proposed model attains good accuracy during validation for the BraTS Africa dataset. The ODconv source code along with full training code is available on GitHub.
Authors: Ho Heon Kim, Won Chan Jeong, Young Shin Ko, Young Jin Park
Abstract: Digital pathology has made significant advances in tumor diagnosis and segmentation, but image variability due to differences in organs, tissue preparation, and acquisition - known as domain shift - limits the effectiveness of current algorithms. The COSAS (Cross-Organ and Cross-Scanner Adenocarcinoma Segmentation) challenge addresses this issue by improving the resilience of segmentation algorithms to domain shift, with Task 2 focusing on adenocarcinoma segmentation using a diverse dataset from six scanners, pushing the boundaries of clinical diagnostics. Our approach employs unsupervised learning through stain separation within a multi-task learning framework using a multi-decoder autoencoder. This model isolates stain matrix and stain density, allowing it to handle color variation and improve generalization across scanners. We further enhanced the robustness of the model with a mixture of stain augmentation techniques and used a U-net architecture for segmentation. The novelty of our method lies in the use of stain separation within a multi-task learning framework, which effectively disentangles histological structures from color variations. This approach shows promise for improving segmentation accuracy and generalization across different histopathological stains, paving the way for more reliable diagnostic tools in digital pathology.
Authors: Matteo Salis, Abdourrahmane M. Atto, Stefano Ferraris, Rosa Meo
Abstract: Groundwater resources are one of the most relevant elements in the water cycle, therefore developing models to accurately predict them is a pivotal task in the sustainable resources management framework. Deep Learning (DL) models have been revealed very effective in hydrology, especially by feeding spatially distributed data (e.g. raster data). In many regions, hydrological measurements are difficult to obtain regularly or periodically in time, and in some cases, last available data are not up to date. Reversely, weather data, which significantly impacts water resources, are usually more available and with higher quality. More specifically, we have proposed two different DL models to predict the water table depth in the Grana-Maira catchment (Piemonte, IT) using only exogenous weather image time series. To deal with the image time series, both models are made of a first Time Distributed Convolutional Neural Network (TDC) which encodes the image available at each time step into a vectorial representation. The first model, TDC-LSTM uses then a Sequential Module based on an LSTM layer to learn temporal relations and output the predictions. The second model, TDC-UnPWaveNet uses instead a new version of the WaveNet architecture, adapted here to output a sequence shorter and completely shifted in the future with respect to the input one. To this aim, and to deal with the different sequence lengths in the UnPWaveNet, we have designed a new Channel Distributed layer, that acts like a Time Distributed one but on the channel dimension, i.e. applying the same set of operations to each channel of the input. TDC-LSTM and TDC-UnPWaveNet have shown both remarkable results. However, the two models have focused on different learnable information: TDC-LSTM has focused more on lowering the bias, while the TDC-UnPWaveNet has focused more on the temporal dynamics maximising correlation and KGE.
Authors: Jinge Wu, Yunsoo Kim, Daqian Shi, David Cliffton, Fenglin Liu, Honghan Wu
Abstract: Inspired by the success of large language models (LLMs), there is growing research interest in developing LLMs in the medical domain to assist clinicians. However, for hospitals, using closed-source commercial LLMs involves privacy issues, and developing open-source public LLMs requires large-scale computational resources, which are usually limited, especially in resource-efficient regions and low-income countries. We propose an open-source Small Language and Vision Assistant (SLaVA-CXR) that can be used for Chest X-Ray report automation. To efficiently train a small assistant, we first propose the Re$^3$Training method, which simulates the cognitive development of radiologists and optimizes the model in the Recognition, Reasoning, and Reporting training manner. Then, we introduce a data synthesis method, RADEX, which can generate a high-quality and diverse training corpus with privacy regulation compliance. The extensive experiments show that our SLaVA-CXR built on a 2.7B backbone not only outperforms but also achieves 6 times faster inference efficiency than previous state-of-the-art larger models.
Authors: Xinshuang Liu, Yizhong Zhang, Xin Tong
Abstract: In whiteboard-based remote communication, the seamless integration of drawn content and hand-screen interactions is essential for an immersive user experience. Previous methods either require bulky device setups for capturing hand gestures or fail to accurately track the hand poses from capacitive images. In this paper, we present a real-time method for precise tracking 3D poses of both hands from capacitive video frames. To this end, we develop a deep neural network to identify hands and infer hand joint positions from capacitive frames, and then recover 3D hand poses from the hand-joint positions via a constrained inverse kinematic solver. Additionally, we design a device setup for capturing high-quality hand-screen interaction data and obtained a more accurate synchronized capacitive video and hand pose dataset. Our method improves the accuracy and stability of 3D hand tracking for capacitive frames while maintaining a compact device setup for remote communication. We validate our scheme design and its superior performance on 3D hand pose tracking and demonstrate the effectiveness of our method in whiteboard-based remote communication. Our code, model, and dataset are available at https://V-Hands.github.io.
Authors: Guantian Huang, Beibei Li, Xiaobing Fan, Aritrick Chatterjee, Cheng Wei, Shouliang Qi, Wei Qian, Dianning He
Abstract: Accurate segmentation of various regions within the prostate is pivotal for diagnosing and treating prostate-related diseases. However, the scarcity of labeled data, particularly in specialized medical fields like prostate imaging, poses a significant challenge. Segment Anything Model (SAM) is a new large model for natural image segmentation, but there are some challenges in medical imaging. In order to better utilize the powerful feature extraction capability of SAM as well as to address the problem of low data volume for medical image annotation, we use Low-Rank Adaptation (LoRA) and semi-supervised learning methods of Monte Carlo guided interpolation consistency (MCIC) to enhance the fine-tuned SAM. We propose Monte Carlo-guided Interpolation Consistency Segment Anything Model (MCICSAM) for application to semi-supervised learning based prostate region segmentation. In the unlabeled data section, MCIC performs two different interpolation transformations on the input data and incorporates Monte Carlo uncertainty analysis in the output, forcing the model to be consistent in its predictions. The consistency constraints imposed on these interpolated samples allow the model to fit the distribution of unlabeled data better, ultimately improving its performance in semi-supervised scenarios. We use Dice and Hausdorff Distance at 95th percentile (HD95) to validate model performance. MCICSAM yieldes Dice with 79.38% and 89.95%, along with improves HD95 values of 3.12 and 2.27 for transition zone and transition zone. At the same time MCICSAM demonstrates strong generalizability. This method is expected to bring new possibilities in the field of prostate image segmentation.
Authors: Maximilian Rokuss, Yannick Kirchhoff, Saikat Roy, Balint Kovacs, Constantin Ulrich, Tassilo Wald, Maximilian Zenk, Stefan Denner, Fabian Isensee, Philipp Vollmuth, Jens Kleesiek, Klaus Maier-Hein
Abstract: Accurate segmentation of Multiple Sclerosis (MS) lesions in longitudinal MRI scans is crucial for monitoring disease progression and treatment efficacy. Although changes across time are taken into account when assessing images in clinical practice, most existing deep learning methods treat scans from different timepoints separately. Among studies utilizing longitudinal images, a simple channel-wise concatenation is the primary albeit suboptimal method employed to integrate timepoints. We introduce a novel approach that explicitly incorporates temporal differences between baseline and follow-up scans through a unique architectural inductive bias called Difference Weighting Block. It merges features from two timepoints, emphasizing changes between scans. We achieve superior scores in lesion segmentation (Dice Score, Hausdorff distance) as well as lesion detection (lesion-level $F_1$ score) as compared to state-of-the-art longitudinal and single timepoint models across two datasets. Our code is made publicly available at www.github.com/MIC-DKFZ/Longitudinal-Difference-Weighting.
Authors: Rabia Asghar, Arslan Shaukat, Usman Akram, Rimsha Tariq
Abstract: Human immune system contains white blood cells (WBC) that are good indicator of many diseases like bacterial infections, AIDS, cancer, spleen, etc. White blood cells have been sub classified into four types: monocytes, lymphocytes, eosinophils and neutrophils on the basis of their nucleus, shape and cytoplasm. Traditionally in laboratories, pathologists and hematologists analyze these blood cells through microscope and then classify them manually. This manual process takes more time and increases the chance of human error. Hence, there is a need to automate this process. In this paper, first we have used different CNN pre-train models such as ResNet-50, InceptionV3, VGG16 and MobileNetV2 to automatically classify the white blood cells. These pre-train models are applied on Kaggle dataset of microscopic images. Although we achieved reasonable accuracy ranging between 92 to 95%, still there is need to enhance the performance. Hence, inspired by these architectures, a framework has been proposed to automatically categorize the four kinds of white blood cells with increased accuracy. The aim is to develop a convolution neural network (CNN) based classification system with decent generalization ability. The proposed CNN model has been tested on white blood cells images from Kaggle and LISC datasets. Accuracy achieved is 99.57% and 98.67% for both datasets respectively. Our proposed convolutional neural network-based model provides competitive performance as compared to previous results reported in literature.
Authors: Tirtha Chanda, Sarah Haggenmueller, Tabea-Clara Bucher, Tim Holland-Letz, Harald Kittler, Philipp Tschandl, Markus V. Heppt, Carola Berking, Jochen S. Utikal, Bastian Schilling, Claudia Buerger, Cristian Navarrete-Dechent, Matthias Goebeler, Jakob Nikolas Kather, Carolin V. Schneider, Benjamin Durani, Hendrike Durani, Martin Jansen, Juliane Wacker, Joerg Wacker, Reader Study Consortium, Titus J. Brinker
Abstract: Artificial intelligence (AI) systems have substantially improved dermatologists' diagnostic accuracy for melanoma, with explainable AI (XAI) systems further enhancing clinicians' confidence and trust in AI-driven decisions. Despite these advancements, there remains a critical need for objective evaluation of how dermatologists engage with both AI and XAI tools. In this study, 76 dermatologists participated in a reader study, diagnosing 16 dermoscopic images of melanomas and nevi using an XAI system that provides detailed, domain-specific explanations. Eye-tracking technology was employed to assess their interactions. Diagnostic performance was compared with that of a standard AI system lacking explanatory features. Our findings reveal that XAI systems improved balanced diagnostic accuracy by 2.8 percentage points relative to standard AI. Moreover, diagnostic disagreements with AI/XAI systems and complex lesions were associated with elevated cognitive load, as evidenced by increased ocular fixations. These insights have significant implications for clinical practice, the design of AI tools for visual tasks, and the broader development of XAI in medical diagnostics.
Authors: Chinmay Rao, Matthias van Osch, Nicola Pezzotti, Jeroen de Bresser, Laurens Beljaards, Jakob Meineke, Elwin de Weerdt, Huangling Lu, Mariya Doneva, Marius Staring
Abstract: Since multiple MRI contrasts of the same anatomy contain redundant information, one contrast can be used as a prior for guiding the reconstruction of an undersampled subsequent contrast. To this end, several learning-based guided reconstruction methods have been proposed. However, two key challenges remain - (a) the requirement of large paired training datasets and (b) the lack of intuitive understanding of the model's internal representation and utilization of the shared information. We propose a modular two-stage approach for guided reconstruction, addressing these challenges. A content/style model of two-contrast image data is learned in a largely unpaired manner and is subsequently applied as a plug-and-play operator in iterative reconstruction. The disentanglement of content and style allows explicit representation of contrast-independent and contrast-specific factors. Based on this, incorporating prior information into the reconstruction reduces to simply replacing the aliased reconstruction content with clean content derived from the reference scan. We name this novel approach PnP-MUNIT. Various aspects like interpretability and convergence are explored via simulations. Furthermore, its practicality is demonstrated on the NYU fastMRI DICOM dataset and two in-house raw datasets, obtaining up to 32.6% more acceleration over learning-based non-guided reconstruction for a given SSIM. In a radiological task, PnP-MUNIT allowed 33.3% more acceleration over clinical reconstruction at diagnostic quality.
Authors: Savvas Sifnaios, George Arvanitakis, Fotios K. Konstantinidis, Georgios Tsimiklis, Angelos Amditis, Panayiotis Frangos
Abstract: Recent advancements in computer vision, particularly in detection, segmentation, and classification, have significantly impacted various domains. However, these advancements are tied to RGB-based systems, which are insufficient for applications in industries like waste sorting, pharmaceuticals, and defense, where advanced object characterization beyond shape or color is necessary. Hyperspectral (HS) imaging, capturing both spectral and spatial information, addresses these limitations and offers advantages over conventional technologies such as X-ray fluorescence and Raman spectroscopy, particularly in terms of speed, cost, and safety. This study evaluates the potential of combining HS imaging with deep learning for material characterization. The research involves: i) designing an experimental setup with HS camera, conveyor, and controlled lighting; ii) generating a multi-object dataset of various plastics (HDPE, PET, PP, PS) with semi-automated mask generation and Raman spectroscopy-based labeling; and iii) developing a deep learning model trained on HS images for pixel-level material classification. The model achieved 99.94\% classification accuracy, demonstrating robustness in color, size, and shape invariance, and effectively handling material overlap. Limitations, such as challenges with black objects, are also discussed. Extending computer vision beyond RGB to HS imaging proves feasible, overcoming major limitations of traditional methods and showing strong potential for future applications.
Authors: Sven L\"upke, Yousef Yeganeh, Ehsan Adeli, Nassir Navab, Azade Farshad
Abstract: Recent advances in generative models for medical imaging have shown promise in representing multiple modalities. However, the variability in modality availability across datasets limits the general applicability of the synthetic data they produce. To address this, we present a novel physics-informed generative model capable of synthesizing a variable number of brain MRI modalities, including those not present in the original dataset. Our approach utilizes latent diffusion models and a two-step generative process: first, unobserved physical tissue property maps are synthesized using a latent diffusion model, and then these maps are combined with a physical signal model to generate the final MRI scan. Our experiments demonstrate the efficacy of this approach in generating unseen MR contrasts and preserving physical plausibility. Furthermore, we validate the distributions of generated tissue properties by comparing them to those measured in real brain tissue.
Authors: Alexander Jaus, Simon Rei{\ss}, Jens Klesiek, Rainer Stiefelhagen
Abstract: In this work, we describe our approach to compete in the autoPET3 datacentric track. While conventional wisdom suggests that larger datasets lead to better model performance, recent studies indicate that excluding certain training samples can enhance model accuracy. We find that in the autoPETIII dataset, a model that is trained on the entire dataset exhibits undesirable characteristics by producing a large number of false positives particularly for PSMA-PETs. We counteract this by removing the easiest samples from the training dataset as measured by the model loss before retraining from scratch. Using the proposed approach we manage to drive down the false negative volume and improve upon the baseline model in both false negative volume and dice score on the preliminary test set. Code and pre-trained models are available at github.com/alexanderjaus/autopet3_datadiet.
Authors: Pascal Passigan, Vayd Ramkumar
Abstract: The integration of Artificial Intelligence (AI) in medical diagnostics is often hindered by model opacity, where high-accuracy systems function as "black boxes" without transparent reasoning. This limitation is critical in clinical settings, where trust and reliability are paramount. To address this, we have developed an explainable AI methodology tailored for medical imaging. By employing a Convolutional Neural Network (CNN) that analyzes MRI scans across both image and frequency domains, we introduce a novel approach that incorporates Uniform Manifold Approximation and Projection UMAP] for the visualization of latent input embeddings. This approach not only enhances early training efficiency but also deepens our understanding of how additional features impact the model predictions, thereby increasing interpretability and supporting more accurate and intuitive diagnostic inferences
Authors: Qiyuan Tian, Zhuoyue Wang, Xiaoling Cui
Abstract: An improved model of medical image segmentation for brain tumor is discussed, which is a deep learning algorithm based on U-Net architecture. Based on the traditional U-Net, we introduce GSConv module and ECA attention mechanism to improve the performance of the model in medical image segmentation tasks. With these improvements, the new U-Net model is able to extract and utilize multi-scale features more efficiently while flexibly focusing on important channels, resulting in significantly improved segmentation results. During the experiment, the improved U-Net model is trained and evaluated systematically. By looking at the loss curves of the training set and the test set, we find that the loss values of both rapidly decline to the lowest point after the eighth epoch, and then gradually converge and stabilize. This shows that our model has good learning ability and generalization ability. In addition, by monitoring the change in the mean intersection ratio (mIoU), we can see that after the 35th epoch, the mIoU gradually approaches 0.8 and remains stable, which further validates the model. Compared with the traditional U-Net, the improved version based on GSConv module and ECA attention mechanism shows obvious advantages in segmentation effect. Especially in the processing of brain tumor image edges, the improved model can provide more accurate segmentation results. This achievement not only improves the accuracy of medical image analysis, but also provides more reliable technical support for clinical diagnosis.
Authors: Daniele Rege Cambrin, Giuseppe Gallipoli, Irene Benedetto, Luca Cagliero, Paolo Garza
Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks. However, current training approaches combine standard cross-entropy loss with extensive data, human feedback, or ad hoc methods to enhance performance. These solutions are often not scalable or feasible due to their associated costs, complexity, or resource requirements. This study investigates the use of established semantic segmentation loss functions in natural language generation to create a versatile, practical, and scalable solution for fine-tuning different architectures. We evaluate their effectiveness in solving Math Word Problems and question answering across different models of varying sizes. For the analyzed tasks, we found that the traditional Cross-Entropy loss represents a sub-optimal choice, while models trained to minimize alternative (task-dependent) losses, such as Focal or Lov\'asz, achieve a mean improvement of +42% on exact match without requiring additional data or human feedback. These findings suggest a promising pathway for more efficient and accessible training processes.
Authors: Elliot Chane-Sane, Joseph Amigo, Thomas Flayols, Ludovic Righetti, Nicolas Mansard
Abstract: Parkour poses a significant challenge for legged robots, requiring navigation through complex environments with agility and precision based on limited sensory inputs. In this work, we introduce a novel method for training end-to-end visual policies, from depth pixels to robot control commands, to achieve agile and safe quadruped locomotion. We formulate robot parkour as a constrained reinforcement learning (RL) problem designed to maximize the emergence of agile skills within the robot's physical limits while ensuring safety. We first train a policy without vision using privileged information about the robot's surroundings. We then generate experience from this privileged policy to warm-start a sample efficient off-policy RL algorithm from depth images. This allows the robot to adapt behaviors from this privileged experience to visual locomotion while circumventing the high computational costs of RL directly from pixels. We demonstrate the effectiveness of our method on a real Solo-12 robot, showcasing its capability to perform a variety of parkour skills such as walking, climbing, leaping, and crawling.
Authors: Nobuhiko Wakai, Satoshi Sato, Yasunori Ishii, Takayoshi Yamashita
Abstract: A Manhattan world lying along cuboid buildings is useful for camera angle estimation. However, accurate and robust angle estimation from fisheye images in the Manhattan world has remained an open challenge because general scene images tend to lack constraints such as lines, arcs, and vanishing points. To achieve higher accuracy and robustness, we propose a learning-based calibration method that uses heatmap regression, which is similar to pose estimation using keypoints, to detect the directions of labeled image coordinates. Simultaneously, our two estimators recover the rotation and remove fisheye distortion by remapping from a general scene image. Without considering vanishing-point constraints, we find that additional points for learning-based methods can be defined. To compensate for the lack of vanishing points in images, we introduce auxiliary diagonal points that have the optimal 3D arrangement of spatial uniformity. Extensive experiments demonstrated that our method outperforms conventional methods on large-scale datasets and with off-the-shelf cameras.
Authors: Jiahao Qin, Yitao Xu, Zong Lu, Xiaojun Zhang
Abstract: Feature alignment is the primary means of fusing multimodal data. We propose a feature alignment method that fully fuses multimodal information, which stepwise shifts and expands feature information from different modalities to have a consistent representation in a feature space. The proposed method can robustly capture high-level interactions between features of different modalities, thus significantly improving the performance of multimodal learning. We also show that the proposed method outperforms other popular multimodal schemes on multiple tasks. Experimental evaluation of ETT and MIT-BIH-Arrhythmia, datasets shows that the proposed method achieves state of the art performance.
Authors: Chong Zhang, Mingyu Jin, Qinkai Yu, Haochen Xue, Shreyank N Gowda, Xiaobo Jin
Abstract: Generalized zero-shot learning (GZSL) aims to recognize samples from both seen and unseen classes using only seen class samples for training. However, GZSL methods are prone to bias towards seen classes during inference due to the projection function being learned from seen classes. Most methods focus on learning an accurate projection, but bias in the projection is inevitable. We address this projection bias by proposing to learn a parameterized Mahalanobis distance metric for robust inference. Our key insight is that the distance computation during inference is critical, even with a biased projection. We make two main contributions - (1) We extend the VAEGAN (Variational Autoencoder \& Generative Adversarial Networks) architecture with two branches to separately output the projection of samples from seen and unseen classes, enabling more robust distance learning. (2) We introduce a novel loss function to optimize the Mahalanobis distance representation and reduce projection bias. Extensive experiments on four datasets show that our approach outperforms state-of-the-art GZSL techniques with improvements of up to 3.5 \% on the harmonic mean metric.
Authors: Xiangyu Chen, Zheyuan Li, Zhengwen Zhang, Jimmy S. Ren, Yihao Liu, Jingwen He, Yu Qiao, Jiantao Zhou, Chao Dong
Abstract: Modern displays can render video content with high dynamic range (HDR) and wide color gamut (WCG). However, most resources are still in standard dynamic range (SDR). Therefore, transforming existing SDR content into the HDRTV standard holds significant value. This paper defines and analyzes the SDRTV-to-HDRTV task by modeling the formation of SDRTV/HDRTV content. Our findings reveal that a naive endto-end supervised training approach suffers from severe gamut transition errors. To address this, we propose a new three-step solution called HDRTVNet++, which includes adaptive global color mapping, local enhancement, and highlight refinement. The adaptive global color mapping step utilizes global statistics for image-adaptive color adjustments. A local enhancement network further enhances details, and the two sub-networks are combined as a generator to achieve highlight consistency through GANbased joint training. Designed for ultra-high-definition TV content, our method is both effective and lightweight for processing 4K resolution images. We also constructed a dataset using HDR videos in the HDR10 standard, named HDRTV1K, containing 1235 training and 117 testing images, all in 4K resolution. Additionally, we employ five metrics to evaluate SDRTV-to-HDRTV performance. Our results demonstrate state-of-the-art performance both quantitatively and visually. The codes and models are available at https://github.com/xiaom233/HDRTVNet-plus.
Authors: Jiepan Li, Fangxiao Lu, Nan Xue, Zhuohong Li, Hongyan Zhang, Wei He
Abstract: Camouflaged objects adaptively fit their color and texture with the environment, which makes them indistinguishable from the surroundings. Current methods revealed that high-level semantic features can highlight the differences between camouflaged objects and the backgrounds. Consequently, they integrate high-level semantic features with low-level detailed features for accurate camouflaged object detection (COD). Unlike previous designs for multi-level feature fusion, we state that enhancing low-level features is more impending for COD. In this paper, we propose an overlapped window cross-level attention (OWinCA) to achieve the low-level feature enhancement guided by the highest-level features. By sliding an aligned window pair on both the highest- and low-level feature maps, the high-level semantics are explicitly integrated into the low-level details via cross-level attention. Additionally, it employs an overlapped window partition strategy to alleviate the incoherence among windows, which prevents the loss of global information. These adoptions enable the proposed OWinCA to enhance low-level features by promoting the separability of camouflaged objects. The associated proposed OWinCANet fuses these enhanced multi-level features by simple convolution operation to achieve the final COD. Experiments conducted on three large-scale COD datasets demonstrate that our OWinCANet significantly surpasses the current state-of-the-art COD methods.
Authors: Niki Amini-Naieni, Tomas Jakab, Andrea Vedaldi, Ronald Clark
Abstract: Although Neural Radiance Fields (NeRFs) have markedly improved novel view synthesis, accurate uncertainty quantification in their image predictions remains an open problem. The prevailing methods for estimating uncertainty, including the state-of-the-art Density-aware NeRF Ensembles (DANE) [29], quantify uncertainty without calibration. This frequently leads to over- or under-confidence in image predictions, which can undermine their real-world applications. In this paper, we propose a method which, for the first time, achieves calibrated uncertainties for NeRFs. To accomplish this, we overcome a significant challenge in adapting existing calibration techniques to NeRFs: a need to hold out ground truth images from the target scene, reducing the number of images left to train the NeRF. This issue is particularly problematic in sparse-view settings, where we can operate with as few as three images. To address this, we introduce the concept of a meta-calibrator that performs uncertainty calibration for NeRFs with a single forward pass without the need for holding out any images from the target scene. Our meta-calibrator is a neural network that takes as input the NeRF images and uncalibrated uncertainty maps and outputs a scene-specific calibration curve that corrects the NeRF's uncalibrated uncertainties. We show that the meta-calibrator can generalize on unseen scenes and achieves well-calibrated and state-of-the-art uncertainty for NeRFs, significantly beating DANE and other approaches. This opens opportunities to improve applications that rely on accurate NeRF uncertainty estimates such as next-best view planning and potentially more trustworthy image reconstruction for medical diagnosis. The code is available at https://niki-amini-naieni.github.io/instantcalibration.github.io/.
URLs: https://niki-amini-naieni.github.io/instantcalibration.github.io/.
Authors: Peng Dai, Feitong Tan, Xin Yu, Yifan Peng, Yinda Zhang, Xiaojuan Qi
Abstract: Virtual environments (VEs) are pivotal for virtual, augmented, and mixed reality systems. Despite advances in 3D generation and reconstruction, the direct creation of 3D objects within an established 3D scene (represented as NeRF) for novel VE creation remains a relatively unexplored domain. This process is complex, requiring not only the generation of high-quality 3D objects but also their seamless integration into the existing scene. To this end, we propose a novel pipeline featuring an intuitive interface, dubbed GO-NeRF. Our approach takes text prompts and user-specified regions as inputs and leverages the scene context to generate 3D objects within the scene. We employ a compositional rendering formulation that effectively integrates the generated 3D objects into the scene, utilizing optimized 3D-aware opacity maps to avoid unintended modifications to the original scene. Furthermore, we develop tailored optimization objectives and training strategies to enhance the model's ability to capture scene context and mitigate artifacts, such as floaters, that may occur while optimizing 3D objects within the scene. Extensive experiments conducted on both forward-facing and 360o scenes demonstrate the superior performance of our proposed method in generating objects that harmonize with surrounding scenes and synthesizing high-quality novel view images. We are committed to making our code publicly available.
Authors: Haixin Zhang, Dong Huang
Abstract: Previous contrastive deep clustering methods mostly focus on instance-level information while overlooking the member relationship within groups/clusters, which may significantly undermine their representation learning and clustering capability. Recently, some group-contrastive methods have been developed, which, however, typically rely on the samples of the entire dataset to obtain pseudo labels and lack the ability to efficiently update the group assignments in a batch-wise manner. To tackle these critical issues, we present a novel end-to-end deep clustering framework with dynamic grouping and prototype aggregation, termed as DigPro. Specifically, the proposed dynamic grouping extends contrastive learning from instance-level to group-level, which is effective and efficient for timely updating groups. Meanwhile, we perform contrastive learning on prototypes in a spherical feature space, termed as prototype aggregation, which aims to maximize the inter-cluster distance. Notably, with an expectation-maximization framework, DigPro simultaneously takes advantage of compact intra-cluster connections, well-separated clusters, and efficient group updating during the self-supervised training. Extensive experiments on six image benchmarks demonstrate the superior performance of our approach over the state-of-the-art. Code is available at https://github.com/Regan-Zhang/DigPro.
Authors: Hai-Xin Zhang, Dong Huang
Abstract: Deep clustering has gained significant attention due to its capability in learning clustering-friendly representations without labeled data. However, previous deep clustering methods tend to treat all samples equally, which neglect the variance in the latent distribution and the varying difficulty in classifying or clustering different samples. To address this, this paper proposes a novel end-to-end deep clustering method with diffused sampling and hardness-aware self-distillation (HaDis). Specifically, we first align one view of instances with another view via diffused sampling alignment (DSA), which helps improve the intra-cluster compactness. To alleviate the sampling bias, we present the hardness-aware self-distillation (HSD) mechanism to mine the hardest positive and negative samples and adaptively adjust their weights in a self-distillation fashion, which is able to deal with the potential imbalance in sample contributions during optimization. Further, the prototypical contrastive learning is incorporated to simultaneously enhance the inter-cluster separability and intra-cluster compactness. Experimental results on five challenging image datasets demonstrate the superior clustering performance of our HaDis method over the state-of-the-art. Source code is available at https://github.com/Regan-Zhang/HaDis.
Authors: Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, Ying Shan, Bihan Wen
Abstract: Diffusion models have proven to be highly effective in image and video generation; however, they encounter challenges in the correct composition of objects when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models to higher resolution demands substantial computational and optimization resources, yet achieving generation capabilities comparable to low-resolution models remains challenging. This paper proposes a novel self-cascade diffusion model that leverages the knowledge gained from a well-trained low-resolution image/video generation model, enabling rapid adaptation to higher-resolution generation. Building on this, we employ the pivot replacement strategy to facilitate a tuning-free version by progressively leveraging reliable semantic guidance derived from the low-resolution model. We further propose to integrate a sequence of learnable multi-scale upsampler modules for a tuning version capable of efficiently learning structural details at a new scale from a small amount of newly acquired high-resolution training data. Compared to full fine-tuning, our approach achieves a $5\times$ training speed-up and requires only 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher-resolution image and video synthesis by fine-tuning for just $10k$ steps, with virtually no additional inference time.
Authors: Sarah M\"uller, Lisa M. Koch, Hendrik P. A. Lensch, Philipp Berens
Abstract: Retinal fundus images play a crucial role in the early detection of eye diseases. However, the impact of technical factors on these images can pose challenges for reliable AI applications in ophthalmology. For example, large fundus cohorts are often confounded by factors like camera type, bearing the risk of learning shortcuts rather than the causal relationships behind the image generation process. Here, we introduce a population model for retinal fundus images that effectively disentangles patient attributes from camera effects, enabling controllable and highly realistic image generation. To achieve this, we propose a disentanglement loss based on distance correlation. Through qualitative and quantitative analyses, we show that our models encode desired information in disentangled subspaces and enable controllable image generation based on the learned subspaces, demonstrating the effectiveness of our disentanglement loss. The project's code is publicly available: https://github.com/berenslab/disentangling-retinal-images.
URLs: https://github.com/berenslab/disentangling-retinal-images.
Authors: Junzhe Zhang, Huixuan Zhang, Xunjian Yin, Xiaojun Wan
Abstract: News image captioning requires model to generate an informative caption rich in entities, with the news image and the associated news article. Current MLLMs still bear limitations in handling entity information in news image captioning tasks. Besides, generating high-quality news image captions requires a trade-off between sufficiency and conciseness of textual input information. To explore the potential of MLLMs and address problems we discovered, we propose EAMA: an Entity-Aware Multimodal Alignment based approach for News Image Captioning. Our approach first aligns the MLLM with two extra alignment tasks: Entity-Aware Sentence Selection task and Entity Selection task, together with News Image Captioning task. The aligned MLLM will utilize the additional entity-related information extracted by itself to supplement the textual input while generating news image captions. Our approach achieves better results than all previous models on two mainstream news image captioning datasets.
Authors: Bhishma Dedhia, Niraj K. Jha
Abstract: Several accounts of human cognition posit that our intelligence is rooted in our ability to form abstract composable concepts, ground them in our environment, and reason over these grounded entities. This trifecta of human thought has remained elusive in modern intelligent machines. In this work, we investigate whether slot representations extracted from visual scenes serve as appropriate compositional abstractions for grounding and reasoning. We present the Neural Slot Interpreter (NSI), which learns to ground object semantics in slots. At the core of NSI is an XML-like schema that uses simple syntax rules to organize the object semantics of a scene into object-centric schema primitives. Then, the NSI metric learns to ground primitives into slots through a structured objective that reasons over the intermodal alignment. We show that the grounded slots surpass unsupervised slots in real-world object discovery and scale with scene complexity. Experiments with a bi-modal object-property and scene retrieval task demonstrate the grounding efficacy and interpretability of correspondences learned by NSI. Finally, we investigate the reasoning abilities of the grounded slots. Vision Transformers trained on grounding-aware NSI tokenizers using as few as ten tokens outperform patch-based tokens on challenging few-shot classification tasks.
Authors: Ali Youssef, Francisco Vasconcelos
Abstract: Feature point detection and description is the backbone for various computer vision applications, such as Structure-from-Motion, visual SLAM, and visual place recognition. While learning-based methods have surpassed traditional handcrafted techniques, their training often relies on simplistic homography-based simulations of multi-view perspectives, limiting model generalisability. This paper presents a novel approach leveraging Neural Radiance Fields (NeRFs) to generate a diverse and realistic dataset consisting of indoor and outdoor scenes. Our proposed methodology adapts state-of-the-art feature detectors and descriptors for training on multi-view NeRF-synthesised data, with supervision achieved through perspective projective geometry. Experiments demonstrate that the proposed methodology achieves competitive or superior performance on standard benchmarks for relative pose estimation, point cloud registration, and homography estimation while requiring significantly less training data and time compared to existing approaches.
Authors: Yonggan Fu, Huaizhi Qu, Zhifan Ye, Chaojian Li, Kevin Zhao, Yingyan Celine Lin
Abstract: Recent breakthroughs in Neural Radiance Fields (NeRFs) have sparked significant demand for their integration into real-world 3D applications. However, the varied functionalities required by different 3D applications often necessitate diverse NeRF models with various pipelines, leading to tedious NeRF training for each target task and cumbersome trial-and-error experiments. Drawing inspiration from the generalization capability and adaptability of emerging foundation models, our work aims to develop one general-purpose NeRF for handling diverse 3D tasks. We achieve this by proposing a framework called Omni-Recon, which is capable of (1) generalizable 3D reconstruction and zero-shot multitask scene understanding, and (2) adaptability to diverse downstream 3D applications such as real-time rendering and scene editing. Our key insight is that an image-based rendering pipeline, with accurate geometry and appearance estimation, can lift 2D image features into their 3D counterparts, thus extending widely explored 2D tasks to the 3D world in a generalizable manner. Specifically, our Omni-Recon features a general-purpose NeRF model using image-based rendering with two decoupled branches: one complex transformer-based branch that progressively fuses geometry and appearance features for accurate geometry estimation, and one lightweight branch for predicting blending weights of source views. This design achieves state-of-the-art (SOTA) generalizable 3D surface reconstruction quality with blending weights reusable across diverse tasks for zero-shot multitask scene understanding. In addition, it can enable real-time rendering after baking the complex geometry branch into meshes, swift adaptation to achieve SOTA generalizable 3D understanding performance, and seamless integration with 2D diffusion models for text-guided 3D editing.
Authors: Giulia Rizzoli, Matteo Caligiuri, Donald Shenaj, Francesco Barbato, Pietro Zanuttigh
Abstract: In Federated Learning (FL), multiple clients collaboratively train a global model without sharing private data. In semantic segmentation, the Federated source Free Domain Adaptation (FFreeDA) setting is of particular interest, where clients undergo unsupervised training after supervised pretraining at the server side. While few recent works address FL for autonomous vehicles, intrinsic real-world challenges such as the presence of adverse weather conditions and the existence of different autonomous agents are still unexplored. To bridge this gap, we address both problems and introduce a new federated semantic segmentation setting where both car and drone clients co-exist and collaborate. Specifically, we propose a novel approach for this setting which exploits a batch-norm weather-aware strategy to dynamically adapt the model to the different weather conditions, while hyperbolic space prototypes are used to align the heterogeneous client representations. Finally, we introduce FLYAWARE, the first semantic segmentation dataset with adverse weather data for aerial vehicles.
Authors: Matteo Caligiuri, Adriano Simonetto, Pietro Zanuttigh
Abstract: The acquisition of objects outside the Line-of-Sight of cameras is a very intriguing but also extremely challenging research topic. Recent works showed the feasibility of this idea exploiting transient imaging data produced by custom direct Time of Flight sensors. In this paper, for the first time, we tackle this problem using only data from an off-the-shelf indirect Time of Flight sensor without any further hardware requirement. We introduced a Deep Learning model able to re-frame the surfaces where light bounces happen as a virtual mirror. This modeling makes the task easier to handle and also facilitates the construction of annotated training data. From the obtained data it is possible to retrieve the depth information of the hidden scene. We also provide a first-in-its-kind synthetic dataset for the task and demonstrate the feasibility of the proposed idea over it.
Authors: Gregory Sech, Giulio Poggi, Marina Ljubenovic, Marco Fiorucci, Arianna Traviglia
Abstract: Hyperspectral data recorded from satellite platforms are often ill-suited for geo-archaeological prospection due to low spatial resolution. The established potential of hyperspectral data from airborne sensors in identifying archaeological features has, on the other side, generated increased interest in enhancing hyperspectral data to achieve higher spatial resolution. This improvement is crucial for detecting traces linked to sub-surface geo-archaeological features and can make satellite hyperspectral acquisitions more suitable for archaeological research. This research assesses the usability of pansharpened PRISMA satellite products in geo-archaeological prospections. Three pan-sharpening methods (GSA, MTF-GLP and HySure) are compared quantitatively and qualitatively and tested over the archaeological landscape of Aquileia (Italy). The results suggest that the application of pansharpening techniques makes hyperspectral satellite imagery highly suitable, under certain conditions, to the identification of sub-surface archaeological features of small and large size.
Authors: Raveerat Jaturapitpornchai, Giulio Poggi, Gregory Sech, Ziga Kokalj, Marco Fiorucci, Arianna Traviglia
Abstract: Deep learning methods in LiDAR-based archaeological research often leverage visualisation techniques derived from Digital Elevation Models to enhance characteristics of archaeological objects present in the images. This paper investigates the impact of visualisations on deep learning performance through a comprehensive testing framework. The study involves the use of eight semantic segmentation models to evaluate seven diverse visualisations across two study areas, encompassing five archaeological classes. Experimental results reveal that the choice of appropriate visualisations can influence performance by up to 8%. Yet, pinpointing one visualisation that outperforms the others in segmenting all archaeological classes proves challenging. The observed performance variation, reaching up to 25% across different model configurations, underscores the importance of thoughtfully selecting model configurations and LiDAR visualisations for successfully segmenting archaeological objects.
Authors: Samyak Rawlekar, Shubhang Bhatnagar, Vishnuvardhan Pogunulu Srinivasulu, Narendra Ahuja
Abstract: Multi-label Recognition (MLR) involves the identification of multiple objects within an image. To address the additional complexity of this problem, recent works have leveraged information from vision-language models (VLMs) trained on large text-images datasets for the task. These methods learn an independent classifier for each object (class), overlooking correlations in their occurrences. Such co-occurrences can be captured from the training data as conditional probabilities between a pair of classes. We propose a framework to extend the independent classifiers by incorporating the co-occurrence information for object pairs to improve the performance of independent classifiers. We use a Graph Convolutional Network (GCN) to enforce the conditional probabilities between classes, by refining the initial estimates derived from image and text sources obtained using VLMs. We validate our method on four MLR datasets, where our approach outperforms all state-of-the-art methods.
Authors: Sifan Zhou, Zhihang Yuan, Dawei Yang, Ziyu Zhao, Xing Hu, Yuguang Shi, Xiaobo Lu, Qiang Wu
Abstract: Real-time and high-performance 3D object detection plays a critical role in autonomous driving and robotics. Recent pillar-based 3D object detectors have gained significant attention due to their compact representation and low computational overhead, making them suitable for onboard deployment and quantization. However, existing pillar-based detectors still suffer from information loss along height dimension and large numerical distribution difference during pillar feature encoding (PFE), which severely limits their performance and quantization potential. To address above issue, we first unveil the importance of different input information during PFE and identify the height dimension as a key factor in enhancing 3D detection performance. Motivated by this observation, we propose a height-aware pillar feature encoder, called PillarHist. Specifically, PillarHist statistics the discrete distribution of points at different heights within one pillar with the information entropy guidance. This simple yet effective design greatly preserves the information along the height dimension while significantly reducing the computation overhead of the PFE. Meanwhile, PillarHist also constrains the arithmetic distribution of PFE input to a stable range, making it quantization-friendly. Notably, PillarHist operates exclusively within the PFE stage to enhance performance, enabling seamless integration into existing pillar-based methods without introducing complex operations. Extensive experiments show the effectiveness of PillarHist in terms of both efficiency and performance.
Authors: Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, Kevin Bailey, David Soriano Fosas, C. Karen Liu, Ziwei Liu, Jakob Engel, Renzo De Nardi, Richard Newcombe
Abstract: We introduce Nymeria - a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices. The dataset comes with a) full-body ground-truth motion; b) multiple multimodal egocentric data from Project Aria devices with videos, eye tracking, IMUs and etc; and c) a third-person perspective by an additional observer. All devices are precisely synchronized and localized in on metric 3D world. We derive hierarchical protocol to add in-context language descriptions of human motion, from fine-grain motion narration, to simplified atomic action and high-level activity summarization. To the best of our knowledge, Nymeria dataset is the world's largest collection of human motion in the wild; first of its kind to provide synchronized and localized multi-device multimodal egocentric data; and the world's largest motion-language dataset. It provides 300 hours of daily activities from 264 participants across 50 locations, total travelling distance over 399Km. The language descriptions contain 301.5K sentences in 8.64M words from a vocabulary size of 6545. To demonstrate the potential of the dataset, we evaluate several SOTA algorithms for egocentric body tracking, motion synthesis, and action recognition. Data and code are open-sourced for research (c.f. https://www.projectaria.com/datasets/nymeria).
Authors: Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Kumar Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, Ran Xu, Yejin Choi, Ludwig Schmidt
Abstract: Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs). Despite the rapid progression of open-source LMMs, there remains a pronounced scarcity of large-scale, diverse open-source multimodal interleaved datasets. In response, we introduce MINT-1T, the most extensive and diverse open-source Multimodal INTerleaved dataset to date. MINT-1T comprises one trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. As scaling multimodal interleaved datasets requires substantial engineering effort, sharing the data curation process and releasing the dataset greatly benefits the community. Our experiments show that LMMs trained on MINT-1T rival the performance of models trained on the previous leading dataset, OBELICS. Our data and code will be released at https://github.com/mlfoundations/MINT-1T.
Authors: Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, Qi Wu
Abstract: Capitalizing on the remarkable advancements in Large Language Models (LLMs), there is a burgeoning initiative to harness LLMs for instruction following robotic navigation. Such a trend underscores the potential of LLMs to generalize navigational reasoning and diverse language understanding. However, a significant discrepancy in agent performance is observed when integrating LLMs in the Vision-and-Language navigation (VLN) tasks compared to previous downstream specialist models. Furthermore, the inherent capacity of language to interpret and facilitate communication in agent interactions is often underutilized in these integrations. In this work, we strive to bridge the divide between VLN-specialized models and LLM-based navigation paradigms, while maintaining the interpretative prowess of LLMs in generating linguistic navigational reasoning. By aligning visual content in a frozen LLM, we encompass visual observation comprehension for LLMs and exploit a way to incorporate LLMs and navigation policy networks for effective action predictions and navigational reasoning. We demonstrate the data efficiency of the proposed methods and eliminate the gap between LM-based agents and state-of-the-art VLN specialists.
Authors: Nicolas Richet, Soufiane Belharbi, Haseeb Aslam, Meike Emilie Schadt, Manuela Gonz\'alez-Gonz\'alez, Gustave Cortal, Alessandro Lameiras Koerich, Marco Pedersoli, Alain Finkel, Simon Bacon, Eric Granger
Abstract: Systems for multimodal emotion recognition (ER) are commonly trained to extract features from different modalities (e.g., visual, audio, and textual) that are combined to predict individual basic emotions. However, compound emotions often occur in real-world scenarios, and the uncertainty of recognizing such complex emotions over diverse modalities is challenging for feature-based models. As an alternative, emerging large language models (LLMs) like BERT and LLaMA can rely on explicit non-verbal cues that may be translated from different non-textual modalities (e.g., audio and visual) into text. Textualization of modalities augments data with emotional cues to help the LLM encode the interconnections between all modalities in a shared text space. In such text-based models, prior knowledge of ER tasks is leveraged to textualize relevant non-verbal cues such as audio tone from vocal expressions, and action unit intensity from facial expressions. Since the pre-trained weights are publicly available for many LLMs, training on large-scale datasets is unnecessary, allowing to fine-tune for downstream tasks such as compound ER (CER). This paper compares the potential of text- and feature-based approaches for compound multimodal ER in videos. Experiments were conducted on the challenging C-EXPR-DB dataset in the wild for CER, and contrasted with results on the MELD dataset for basic ER. Our results indicate that multimodal textualization provides lower accuracy than feature-based models on C-EXPR-DB, where text transcripts are captured in the wild. However, higher accuracy can be achieved when the video data has rich transcripts. Our code is available.
Authors: Yuyan Chen, Songzhou Yan, Zhihong Zhu, Zhixu Li, Yanghua Xiao
Abstract: Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on the impact of multi-images on meme captioning. After that, we introduce the \textsc{XMeCap} framework, a novel approach that adopts supervised fine-tuning and reinforcement learning based on an innovative reward model, which factors in both global and local similarities between visuals and text. Our results, benchmarked against contemporary models, manifest a marked improvement in caption generation for both single-image and multi-image memes, as well as different meme categories. \textsc{XMeCap} achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71\% and 4.82\%, respectively. This research not only establishes a new frontier in meme-related studies but also underscores the potential of machines in understanding and generating humor in a multi-modal setting.
Authors: Yui Lo, Yuqian Chen, Fan Zhang, Dongnan Liu, Leo Zekelman, Suheyla Cetin-Karayumak, Yogesh Rathi, Weidong Cai, Lauren J. O'Donnell
Abstract: Parcellation of white matter tractography provides anatomical features for disease prediction, anatomical tract segmentation, surgical brain mapping, and non-imaging phenotype classifications. However, parcellation does not always reach 100\% accuracy due to various factors, including inter-individual anatomical variability and the quality of neuroimaging scan data. The failure to identify parcels causes a problem of missing microstructure data values, which is especially challenging for downstream tasks that analyze large brain datasets. In this work, we propose a novel deep-learning model to impute tissue microstructure: the White Matter Geometry-guided Diffusion (WMG-Diff) model. Specifically, we first propose a deep score-based guided diffusion model to impute tissue microstructure for diffusion magnetic resonance imaging (dMRI) tractography fiber clusters. Second, we propose a white matter atlas geometric relationship-guided denoising function to guide the reverse denoising process at the subject-specific level. Third, we train and evaluate our model on a large dataset with 9342 subjects. Comprehensive experiments for tissue microstructure imputation and a downstream non-imaging phenotype prediction task demonstrate that our proposed WMG-Diff outperforms the compared state-of-the-art methods in both error and accuracy metrics. Our code will be available at: https://github.com/SlicerDMRI/WMG-Diff.
Authors: Minxuan Zhou, Hao Liang, Tianpeng Li, Zhiyu Wu, Mingan Lin, Linzhuang Sun, Yaqi Zhou, Yan Zhang, Xiaoqin Huang, Yicong Chen, Yujing Qiao, Weipeng Chen, Bin Cui, Wentao Zhang, Zenan Zhou
Abstract: With the development of Multimodal Large Language Models (MLLMs), the evaluation of multimodal models in the context of mathematical problems has become a valuable research field. Multimodal visual-textual mathematical reasoning serves as a critical indicator for evaluating the comprehension and complex multi-step quantitative reasoning abilities of MLLMs. However, previous multimodal math benchmarks have not sufficiently integrated visual and textual information. To address this gap, we proposed MathScape, a new benchmark that emphasizes the understanding and application of combined visual and textual information. MathScape is designed to evaluate photo-based math problem scenarios, assessing the theoretical understanding and application ability of MLLMs through a categorical hierarchical approach. We conduct a multi-dimensional evaluation on 11 advanced MLLMs, revealing that our benchmark is challenging even for the most sophisticated models. By analyzing the evaluation results, we identify the limitations of MLLMs, offering valuable insights for enhancing model performance.
Authors: Asifullah Khan, Anabia Sohail, Mustansar Fiaz, Mehdi Hassan, Tariq Habib Afridi, Sibghat Ullah Marwat, Farzeen Munir, Safdar Ali, Hannan Naseem, Muhammad Zaigham Zaheer, Kamran Ali, Tangina Sultana, Ziaurrehman Tanoli, Naeem Akhter
Abstract: Deep supervised learning models require high volume of labeled data to attain sufficiently good results. Although, the practice of gathering and annotating such big data is costly and laborious. Recently, the application of self supervised learning (SSL) in vision tasks has gained significant attention. The intuition behind SSL is to exploit the synchronous relationships within the data as a form of self-supervision, which can be versatile. In the current big data era, most of the data is unlabeled, and the success of SSL thus relies in finding ways to utilize this vast amount of unlabeled data available. Thus it is better for deep learning algorithms to reduce reliance on human supervision and instead focus on self-supervision based on the inherent relationships within the data. With the advent of ViTs, which have achieved remarkable results in computer vision, it is crucial to explore and understand the various SSL mechanisms employed for training these models specifically in scenarios where there is limited labelled data available. In this survey, we develop a comprehensive taxonomy of systematically classifying the SSL techniques based upon their representations and pre-training tasks being applied. Additionally, we discuss the motivations behind SSL, review popular pre-training tasks, and highlight the challenges and advancements in this field. Furthermore, we present a comparative analysis of different SSL methods, evaluate their strengths and limitations, and identify potential avenues for future research.
Authors: Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung-Ting Su, Winston H. Hsu, Shang-Hong Lai
Abstract: Existing research often treats long-form videos as extended short videos, leading to several limitations: inadequate capture of long-range dependencies, inefficient processing of redundant information, and failure to extract high-level semantic concepts. To address these issues, we propose a novel approach that more accurately reflects human cognition. This paper introduces HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels, overcoming the challenge of long-range dependencies. Second, we propose a Semantics ReTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. This addresses the issues of redundancy and lack of high-level concept extraction. Extensive experiments demonstrate that HERMES achieves state-of-the-art performance across multiple long-video understanding benchmarks in both zero-shot and fully-supervised settings.
Authors: Zhen Zhou, Yunkai Ma, Junfeng Fan, Shaolin Zhang, Fengshui Jing, Min Tan
Abstract: Panoptic 3D reconstruction from a monocular video is a fundamental perceptual task in robotic scene understanding. However, existing efforts suffer from inefficiency in terms of inference speed and accuracy, limiting their practical applicability. We present EPRecon, an efficient real-time panoptic 3D reconstruction framework. Current volumetric-based reconstruction methods usually utilize multi-view depth map fusion to obtain scene depth priors, which is time-consuming and poses challenges to real-time scene reconstruction. To address this issue, we propose a lightweight module to directly estimate scene depth priors in a 3D volume for reconstruction quality improvement by generating occupancy probabilities of all voxels. In addition, compared with existing panoptic segmentation methods, EPRecon extracts panoptic features from both voxel features and corresponding image features, obtaining more detailed and comprehensive instance-level semantic information and achieving more accurate segmentation results. Experimental results on the ScanNetV2 dataset demonstrate the superiority of EPRecon over current state-of-the-art methods in terms of both panoptic 3D reconstruction quality and real-time inference. Code is available at https://github.com/zhen6618/EPRecon.
Authors: Shuang Song, Rongjun Qin
Abstract: Modeling outdoor scenes for the synthetic 3D environment requires the recovery of reflectance/albedo information from raw images, which is an ill-posed problem due to the complicated unmodeled physics in this process (e.g., indirect lighting, volume scattering, specular reflection). The problem remains unsolved in a practical context. The recovered albedo can facilitate model relighting and shading, which can further enhance the realism of rendered models and the applications of digital twins. Typically, photogrammetric 3D models simply take the source images as texture materials, which inherently embed unwanted lighting artifacts (at the time of capture) into the texture. Therefore, these polluted textures are suboptimal for a synthetic environment to enable realistic rendering. In addition, these embedded environmental lightings further bring challenges to photo-consistencies across different images that cause image-matching uncertainties. This paper presents a general image formation model for albedo recovery from typical aerial photogrammetric images under natural illuminations and derives the inverse model to resolve the albedo information through inverse rendering intrinsic image decomposition. Our approach builds on the fact that both the sun illumination and scene geometry are estimable in aerial photogrammetry, thus they can provide direct inputs for this ill-posed problem. This physics-based approach does not require additional input other than data acquired through the typical drone-based photogrammetric collection and was shown to favorably outperform existing approaches. We also demonstrate that the recovered albedo image can in turn improve typical image processing tasks in photogrammetry such as feature and dense matching, edge, and line extraction.
Authors: Haixin Zhang, Yongjun Li, Dong Huang
Abstract: Image clustering, which involves grouping images into different clusters without labels, is a key task in unsupervised learning. Although previous deep clustering methods have achieved remarkable results, they only explore the intrinsic information of the image itself but overlook external supervision knowledge to improve the semantic understanding of images. Recently, visual-language pre-trained model on large-scale datasets have been used in various downstream tasks and have achieved great results. However, there is a gap between visual representation learning and textual semantic learning, and how to properly utilize the representation of two different modalities for clustering is still a big challenge. To tackle the challenges, we propose a novel image clustering framwork, named Dual-level Cross-Modal Contrastive Clustering (DXMC). Firstly, external textual information is introduced for constructing a semantic space which is adopted to generate image-text pairs. Secondly, the image-text pairs are respectively sent to pre-trained image and text encoder to obtain image and text embeddings which subsquently are fed into four well-designed networks. Thirdly, dual-level cross-modal contrastive learning is conducted between discriminative representations of different modalities and distinct level. Extensive experimental results on five benchmark datasets demonstrate the superiority of our proposed method.
Authors: King Zhu, Qianbo Zang, Shian Jia, Siwei Wu, Feiteng Fang, Yizhi Li, Shawn Gavin, Tuney Zheng, Jiawei Guo, Bo Li, Haoning Wu, Xingwei Qu, Jian Yang, Zachary Liu, Xiang Yue, J. H. Liu, Chenghua Lin, Min Yang, Shiwen Ni, Wenhao Huang, Ge Zhang
Abstract: Multimodal Large Language Models (MLLMs) are measured on numerous benchmarks like image captioning, visual question answering, and reasoning. However, these benchmarks often include overly simple or uninformative samples, making it difficult to effectively distinguish the performance of different MLLMs. Additionally, evaluating models across many benchmarks creates a significant computational burden. To address these issues, we propose LIME (Less Is More for MLLM Evaluation), a refined and efficient benchmark curated using a semi-automated pipeline. This pipeline filters out uninformative samples and eliminates answer leakage by focusing on tasks that require image-based understanding. Our experiments show that LIME reduces the number of samples by 76% and evaluation time by 77%, while more effectively distinguishing between models. Notably, we find that traditional automatic metrics like CIDEr are insufficient for evaluating MLLMs' captioning performance, and excluding the caption task score yields a more accurate reflection of overall model performance. All code and data are available at https://github.com/kangreen0210/LIME
Authors: Xinlei Huang, Jialiang Tang, Xubin Zheng, Jinjia Zhou, Wenxin Yu, Ning Jiang
Abstract: Knowledge Distillation (KD) transfers knowledge from a large pre-trained teacher network to a compact and efficient student network, making it suitable for deployment on resource-limited media terminals. However, traditional KD methods require balanced data to ensure robust training, which is often unavailable in practical applications. In such scenarios, a few head categories occupy a substantial proportion of examples. This imbalance biases the trained teacher network towards the head categories, resulting in severe performance degradation on the less represented tail categories for both the teacher and student networks. In this paper, we propose a novel framework called Knowledge Rectification Distillation (KRDistill) to address the imbalanced knowledge inherited in the teacher network through the incorporation of the balanced category priors. Furthermore, we rectify the biased predictions produced by the teacher network, particularly focusing on the tail categories. Consequently, the teacher network can provide balanced and accurate knowledge to train a reliable student network. Intensive experiments conducted on various long-tailed datasets demonstrate that our KRDistill can effectively train reliable student networks in realistic scenarios of data imbalance.
Authors: Ali Mehrabian, Parsa Mojarad Adi, Moein Heidari, Ilker Hacihaliloglu
Abstract: Implicit neural representations (INRs) use neural networks to provide continuous and resolution-independent representations of complex signals with a small number of parameters. However, existing INR models often fail to capture important frequency components specific to each task. To address this issue, in this paper, we propose a Fourier Kolmogorov Arnold network (FKAN) for INRs. The proposed FKAN utilizes learnable activation functions modeled as Fourier series in the first layer to effectively control and learn the task-specific frequency components. In addition, the activation functions with learnable Fourier coefficients improve the ability of the network to capture complex patterns and details, which is beneficial for high-resolution and high-dimensional data. Experimental results show that our proposed FKAN model outperforms three state-of-the-art baseline schemes, and improves the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) for the image representation task and intersection over union (IoU) for the 3D occupancy volume representation task, respectively.
Authors: Dawei Yan, Pengcheng Li, Yang Li, Hao Chen, Qingguo Chen, Weihua Luo, Wei Dong, Qingsen Yan, Haokui Zhang, Chunhua Shen
Abstract: Currently, inspired by the success of vision-language models (VLMs), an increasing number of researchers are focusing on improving VLMs and have achieved promising results. However, most existing methods concentrate on optimizing the connector and enhancing the language model component, while neglecting improvements to the vision encoder itself. In contrast, we propose Text Guided LLaVA (TG-LLaVA) in this paper, which optimizes VLMs by guiding the vision encoder with text, offering a new and orthogonal optimization direction. Specifically, inspired by the purpose-driven logic inherent in human behavior, we use learnable latent embeddings as a bridge to analyze textual instruction and add the analysis results to the vision encoder as guidance, refining it. Subsequently, another set of latent embeddings extracts additional detailed text-guided information from high-resolution local patches as auxiliary information. Finally, with the guidance of text, the vision encoder can extract text-related features, similar to how humans focus on the most relevant parts of an image when considering a question. This results in generating better answers. Experiments on various datasets validate the effectiveness of the proposed method. Remarkably, without the need for additional training data, our propsoed method can bring more benefits to the baseline (LLaVA-1.5) compared with other concurrent methods. Furthermore, the proposed method consistently brings improvement in different settings.
Authors: Matej Fabijani\'c, Nadir Kapetanovi\'c, Nikola Mi\v{s}kovi\'c
Abstract: The process of fish cage inspections, which is a necessary maintenance task at any fish farm, be it small scale or industrial, is a task that has the potential to be fully automated. Replacing trained divers who perform regular inspections with autonomous marine vehicles would lower the costs of manpower and remove the risks associated with humans performing underwater inspections. Achieving such a level of autonomy implies developing an image processing algorithm that is capable of estimating the state of biofouling buildup. The aim of this work is to propose a complete solution for automating the said inspection process; from developing an autonomous control algorithm for an ROV, to automatically segmenting images of fish cages, and accurately estimating the state of biofouling. The first part is achieved by modifying a commercially available ROV with an acoustic SBL positioning system and developing a closed-loop control system. The second part is realized by implementing a proposed biofouling estimation framework, which relies on AI to perform image segmentation, and by processing images using established computer vision methods to obtain a rough estimate of the distance of the ROV from the fish cage. This also involved developing a labeling tool in order to create a dataset of images for the neural network performing the semantic segmentation to be trained on. The experimental results show the viability of using an ROV fitted with an acoustic transponder for autonomous missions, and demonstrate the biofouling estimation framework's ability to provide accurate assessments, alongside satisfactory distance estimation capabilities. In conclusion, the achieved biofouling estimation accuracy showcases clear potential for use in the aquaculture industry.
Authors: Zhecan Wang, Junzhang Liu, Chia-Wei Tang, Hani Alomari, Anushka Sivakumar, Rui Sun, Wenhao Li, Md. Atabuzzaman, Hammad Ayyubi, Haoxuan You, Alvi Ishmam, Kai-Wei Chang, Shih-Fu Chang, Chris Thomas
Abstract: Existing vision-language understanding benchmarks largely consist of images of objects in their usual contexts. As a consequence, recent multimodal large language models can perform well with only a shallow visual understanding by relying on background language biases. Thus, strong performance on these benchmarks does not necessarily correlate with strong visual understanding. In this paper, we release JourneyBench, a comprehensive human-annotated benchmark of generated images designed to assess the model's fine-grained multimodal reasoning abilities across five tasks: complementary multimodal chain of thought, multi-image VQA, imaginary image captioning, VQA with hallucination triggers, and fine-grained retrieval with sample-specific distractors. Unlike existing benchmarks, JourneyBench explicitly requires fine-grained multimodal reasoning in unusual imaginary scenarios where language bias and holistic image gist are insufficient. We benchmark state-of-the-art models on JourneyBench and analyze performance along a number of fine-grained dimensions. Results across all five tasks show that JourneyBench is exceptionally challenging for even the best models, indicating that models' visual reasoning abilities are not as strong as they first appear. We discuss the implications of our findings and propose avenues for further research.
Authors: Timilehin B. Aderinola, Tee Connie, Thian Song Ong, Andrew Beng Jin Teoh, Michael Kah Ong Goh
Abstract: Deep learning techniques have recently been utilized for model-free age-associated gait feature extraction. However, acquiring model-free gait demands accurate pre-processing such as background subtraction, which is non-trivial in unconstrained environments. On the other hand, model-based gait can be obtained without background subtraction and is less affected by covariates. For model-based gait-based age group classification problems, present works rely solely on handcrafted features, where feature extraction is tedious and requires domain expertise. This paper proposes a deep learning approach to extract age-associated features from model-based gait for age group classification. Specifically, we first develop an unconstrained gait dataset called Multimedia University Gait Age and Gender dataset (MMU GAG). Next, the body joint coordinates are determined via pose estimation algorithms and represented as compact gait graphs via a novel part aggregation scheme. Then, a Part-AdaptIve Residual Graph Convolutional Neural Network (PairGCN) is designed for age-associated feature learning. Experiments suggest that PairGCN features are far more informative than handcrafted features, yielding up to 99% accuracy for classifying subjects as a child, adult, or senior in the MMU GAG dataset. These results suggest the feasibility of deploying Artificial Intelligence-enabled solutions for access control, surveillance, and law enforcement in unconstrained environments.
Authors: Hao Cheng, Jiahang Cao, Erjia Xiao, Mengshu Sun, Renjing Xu
Abstract: Deploying energy-efficient deep learning algorithms on computational-limited devices, such as robots, is still a pressing issue for real-world applications. Spiking Neural Networks (SNNs), a novel brain-inspired algorithm, offer a promising solution due to their low-latency and low-energy properties over traditional Artificial Neural Networks (ANNs). Despite their advantages, the dense structure of deep SNNs can still result in extra energy consumption. The Lottery Ticket Hypothesis (LTH) posits that within dense neural networks, there exist winning Lottery Tickets (LTs), namely sub-networks, that can be obtained without compromising performance. Inspired by this, this paper delves into the spiking-based LTs (SLTs), examining their unique properties and potential for extreme efficiency. Then, two significant sparse \textbf{\textit{Rewards}} are gained through comprehensive explorations and meticulous experiments on SLTs across various dense structures. Moreover, a sparse algorithm tailored for spiking transformer structure, which incorporates convolution operations into the Patch Embedding Projection (ConvPEP) module, has been proposed to achieve Multi-level Sparsity (MultiSp). MultiSp refers to (1) Patch number sparsity; (2) ConvPEP weights sparsity and binarization; and (3) ConvPEP activation layer binarization. Extensive experiments demonstrate that our method achieves extreme sparsity with only a slight performance decrease, paving the way for deploying energy-efficient neural networks in robotics and beyond.
Authors: Selim F. Yilmaz, Xueyan Niu, Bo Bai, Wei Han, Lei Deng, Deniz Gunduz
Abstract: We consider the image transmission problem over a noisy wireless channel via deep learning-based joint source-channel coding (DeepJSCC) along with a denoising diffusion probabilistic model (DDPM) at the receiver. Specifically, we are interested in the perception-distortion trade-off in the practical finite block length regime, in which separate source and channel coding can be highly suboptimal. We introduce a novel scheme, where the conventional DeepJSCC encoder targets transmitting a lower resolution version of the image, which later can be refined thanks to the generative model available at the receiver. In particular, we utilize the range-null space decomposition of the target image; DeepJSCC transmits the range-space of the image, while DDPM progressively refines its null space contents. Through extensive experiments, we demonstrate significant improvements in distortion and perceptual quality of reconstructed images compared to standard DeepJSCC and the state-of-the-art generative learning-based method.
Authors: Kaiming Cui, D. J. Armstrong, Fabo Feng
Abstract: Vast amounts of astronomical photometric data are generated from various projects, requiring significant effort to identify variable stars and other object classes. In light of this, a general, widely applicable classification framework would simplify the process of designing specific classifiers for various astronomical objects. We present a novel deep learning framework for classifying light curves using a weakly supervised object detection model. Our framework identifies the optimal windows for both light curves and power spectra automatically, and zooms in on their corresponding data. This allows for automatic feature extraction from both time and frequency domains, enabling our model to handle data across different scales and sampling intervals. We train our model on data sets obtained from Kepler, TESS, and Zwicky Transient Facility multiband observations of variable stars and transients. We achieve an accuracy of 87% for combined variables and transient events, which is comparable to the performance of previous feature-based models. Our trained model can be utilized directly for other missions, such as the All-sky Automated Survey for Supernovae, without requiring any retraining or fine-tuning. To address known issues with miscalibrated predictive probabilities, we apply conformal prediction to generate robust predictive sets that guarantee true-label coverage with a given probability. Additionally, we incorporate various anomaly detection algorithms to empower our model with the ability to identify out-of-distribution objects. Our framework is implemented in the Deep-LC toolkit, which is an open-source Python package hosted on Github (https://github.com/ckm3/Deep-LC) and PyPI.
Authors: Subhajit Paul, Ashutosh Gupta
Abstract: Digital Elevation Model (DEM) is an essential aspect in the remote sensing domain to analyze and explore different applications related to surface elevation information. In this study, we intend to address the generation of high-resolution DEMs using high-resolution multi-spectral (MX) satellite imagery by incorporating adversarial learning. To promptly regulate this process, we utilize the notion of polarized self-attention of discriminator spatial maps as well as introduce a Densely connected Multi-Residual Block (DMRB) module to assist in efficient gradient flow. Further, we present an objective function related to optimizing Sinkhorn distance with traditional GAN to improve the stability of adversarial learning. In this regard, we provide both theoretical and empirical substantiation of better performance in terms of vanishing gradient issues and numerical convergence. We demonstrate both qualitative and quantitative outcomes with available state-of-the-art methods. Based on our experiments on DEM datasets of Shuttle Radar Topographic Mission (SRTM) and Cartosat-1, we show that the proposed model performs preferably against other learning-based state-of-the-art methods. We also generate and visualize several high-resolution DEMs covering terrains with diverse signatures to show the performance of our model.
Authors: Luoyu Wang, Yitian Tao, Qing Yang, Yan Liang, Siwei Liu, Hongcheng Shi, Dinggang Shen, Han Zhang
Abstract: Simultaneous functional PET/MR (sf-PET/MR) presents a cutting-edge multimodal neuroimaging technique. It provides an unprecedented opportunity for concurrently monitoring and integrating multifaceted brain networks built by spatiotemporally covaried metabolic activity, neural activity, and cerebral blood flow (perfusion). Albeit high scientific/clinical values, short in hardware accessibility of PET/MR hinders its applications, let alone modern AI-based PET/MR fusion models. Our objective is to develop a clinically feasible AI-based disease diagnosis model trained on comprehensive sf-PET/MR data with the power of, during inferencing, allowing single modality input (e.g., PET only) as well as enforcing multimodal-based accuracy. To this end, we propose MX-ARM, a multimodal MiXture-of-experts Alignment and Reconstruction Model. It is modality detachable and exchangeable, allocating different multi-layer perceptrons dynamically ("mixture of experts") through learnable weights to learn respective representations from different modalities. Such design will not sacrifice model performance in uni-modal situation. To fully exploit the inherent complex and nonlinear relation among modalities while producing fine-grained representations for uni-modal inference, we subsequently add a modal alignment module to line up a dominant modality (e.g., PET) with representations of auxiliary modalities (MR). We further adopt multimodal reconstruction to promote the quality of learned features. Experiments on precious multimodal sf-PET/MR data for Mild Cognitive Impairment diagnosis showcase the efficacy of our model toward clinically feasible precision medicine.
Authors: Hartmut H\"antze, Lina Xu, Maximilian Rattunde, Leonhard Donle, Felix J. Dorfner, Alessa Hering, Lisa C. Adams, Keno K. Bressem
Abstract: Computed tomography (CT) segmentation models often contain classes that are not currently supported by magnetic resonance imaging (MRI) segmentation models. In this study, we show that a simple image inversion technique can significantly improve the segmentation quality of CT segmentation models on MRI data. We demonstrate the feasibility for both a general multi-class and a specific renal carcinoma model for segmenting T1-weighted MRI images. Using this technique, we were able to localize and segment clear cell renal cell carcinoma in T1-weighted MRI scans, using a model that was trained on only CT data. Image inversion is straightforward to implement and does not require dedicated graphics processing units, thus providing a quick alternative to complex deep modality-transfer models. Our results demonstrate that existing CT models, including pathology models, might be transferable to the MRI domain with reasonable effort.
Authors: Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando P\'erez-Garc\'ia, Valentina Salvatelli, Harshita Sharma, Felix Meissen, Mercy Ranjit, Shaury Srivastav, Julia Gong, Noel C. F. Codella, Fabian Falck, Ozan Oktay, Matthew P. Lungren, Maria Teodora Wetscherek, Javier Alvarez-Valle, Stephanie L. Hyland
Abstract: Radiology reporting is a complex task requiring detailed medical image understanding and precise language generation, for which generative multimodal models offer a promising solution. However, to impact clinical practice, models must achieve a high level of both verifiable performance and utility. We augment the utility of automated report generation by incorporating localisation of individual findings on the image - a task we call grounded report generation - and enhance performance by incorporating realistic reporting context as inputs. We design a novel evaluation framework (RadFact) leveraging the logical inference capabilities of large language models (LLMs) to quantify report correctness and completeness at the level of individual sentences, while supporting the new task of grounded reporting. We develop MAIRA-2, a large radiology-specific multimodal model designed to generate chest X-ray reports with and without grounding. MAIRA-2 achieves state of the art on existing report generation benchmarks and establishes the novel task of grounded report generation.
Authors: Yuke Xing, Qi Yang, Kaifa Yang, Yilin Xu, Zhu Li
Abstract: In recent years, Neural Radiance Fields (NeRF) have demonstrated significant advantages in representing and synthesizing 3D scenes. Explicit NeRF models facilitate the practical NeRF applications with faster rendering speed, and also attract considerable attention in NeRF compression due to its huge storage cost. To address the challenge of the NeRF compression study, in this paper, we construct a new dataset, called Explicit-NeRF-QA. We use 22 3D objects with diverse geometries, textures, and material complexities to train four typical explicit NeRF models across five parameter levels. Lossy compression is introduced during the model generation, pivoting the selection of key parameters such as hash table size for InstantNGP and voxel grid resolution for Plenoxels. By rendering NeRF samples to processed video sequences (PVS), a large scale subjective experiment with lab environment is conducted to collect subjective scores from 21 viewers. The diversity of content, accuracy of mean opinion scores (MOS), and characteristics of NeRF distortion are comprehensively presented, establishing the heterogeneity of the proposed dataset. The state-of-the-art objective metrics are tested in the new dataset. Best Person correlation, which is around 0.85, is collected from the full-reference objective metric. All tested no-reference metrics report very poor results with 0.4 to 0.6 correlations, demonstrating the need for further development of more robust no-reference metrics. The dataset, including NeRF samples, source 3D objects, multiview images for NeRF generation, PVSs, MOS, is made publicly available at the following location: https://github.com/YukeXing/Explicit-NeRF-QA.
Authors: Md Mashrur Arifin, Md Shoaib Ahmed, Tanmai Kumar Ghosh, Ikteder Akhand Udoy, Jun Zhuang, Jyh-haw Yeh
Abstract: With the proliferation of Artificial Intelligence, there has been a massive increase in the amount of data required to be accumulated and disseminated digitally. As the data are available online in digital landscapes with complex and sophisticated infrastructures, it is crucial to implement various defense mechanisms based on cybersecurity. Generative Adversarial Networks (GANs), which are deep learning models, have emerged as powerful solutions for addressing the constantly changing security issues. This survey studies the significance of the deep learning model, precisely on GANs, in strengthening cybersecurity defenses. Our survey aims to explore the various works completed in GANs, such as Intrusion Detection Systems (IDS), Mobile and Network Trespass, BotNet Detection, and Malware Detection. The focus is to examine how GANs can be influential tools to strengthen cybersecurity defenses in these domains. Further, the paper discusses the challenges and constraints of using GANs in these areas and suggests future research directions. Overall, the paper highlights the potential of GANs in enhancing cybersecurity measures and addresses the need for further exploration in this field.
Authors: Aanish Paruchuri, Yunfei Wang, Xiaodan Gu, Arthi Jayaraman
Abstract: In this paper we present a new machine learning workflow with unsupervised learning techniques to identify domains within atomic force microscopy images obtained from polymer films. The goal of the workflow is to identify the spatial location of the two types of polymer domains with little to no manual intervention and calculate the domain size distributions which in turn can help qualify the phase separated state of the material as macrophase or microphase ordered or disordered domains. We briefly review existing approaches used in other fields, computer vision and signal processing that can be applicable for the above tasks that happen frequently in the field of polymer science and engineering. We then test these approaches from computer vision and signal processing on the AFM image dataset to identify the strengths and limitations of each of these approaches for our first task. For our first domain segmentation task, we found that the workflow using discrete Fourier transform or discrete cosine transform with variance statistics as the feature works the best. The popular ResNet50 deep learning approach from computer vision field exhibited relatively poorer performance in the domain segmentation task for our AFM images as compared to the DFT and DCT based workflows. For the second task, for each of 144 input AFM images, we then used an existing porespy python package to calculate the domain size distribution from the output of that image from DFT based workflow. The information and open source codes we share in this paper can serve as a guide for researchers in the polymer and soft materials fields who need ML modeling and workflows for automated analyses of AFM images from polymer samples that may have crystalline or amorphous domains, sharp or rough interfaces between domains, or micro or macrophase separated domains.