Set-Membership Inference Attacks using Data Watermarking. (arXiv:2307.15067v1 [cs.CV])

Authors: Mike Laszkiewicz, Denis Lukovnikov, Johannes Lederer, Asja Fischer

In this work, we propose a set-membership inference attack for generative models using deep image watermarking techniques. In particular, we demonstrate how conditional sampling from a generative model can reveal the watermark that was injected into parts of the training data. Our empirical results demonstrate that the proposed watermarking technique is a principled approach for detecting the non-consensual use of image data in training generative models.

Combining transmission speckle photography and convolutional neural network for determination of fat content in cow's milk -- an exercise in classification of parameters of a complex suspension. (arXiv:2307.15069v1 [cs.CV])

Authors: Kwasi Nyandey (1 and 2), Daniel Jakubczyk (1) ((1) Institute of Physics, Polish Academy of Sciences, Warsaw, Poland (2) Laser and Fibre Optics Centre, Department of Physics, School of Physical Sciences, College of Agriculture and Natural Sciences, University of Cape Coast, Cape Coast, Ghana)

We have combined transmission speckle photography and machine learning for direct classification and recognition of milk fat content classes. Our aim was hinged on the fact that parameters of scattering particles (and the dispersion medium) can be linked to the intensity distribution (speckle) observed when coherent light is transmitted through a scattering medium. For milk, it is primarily the size distribution and concentration of fat globules, which constitutes the total fat content. Consequently, we trained convolutional neural network to recognise and classify laser speckle from different fat content classes (0.5, 1.5, 2.0 and 3.2%). We investigated four exposure-time protocols and obtained the highest performance for shorter exposure times, in which the intensity histograms are kept similar for all images and the most probable intensity in the speckle pattern is close to zero. Our neural network was able to recognize the milk fat content classes unambiguously and we obtained the highest test and independent classification accuracies of 100 and ~99% respectively. It indicates that the parameters of other complex realistic suspensions could be classified with similar methods.

Writer adaptation for offline text recognition: An exploration of neural network-based methods. (arXiv:2307.15071v1 [cs.CV])

Authors: Tobias van der Werff, Maruf A. Dhali, Lambert Schomaker

Handwriting recognition has seen significant success with the use of deep learning. However, a persistent shortcoming of neural networks is that they are not well-equipped to deal with shifting data distributions. In the field of handwritten text recognition (HTR), this shows itself in poor recognition accuracy for writers that are not similar to those seen during training. An ideal HTR model should be adaptive to new writing styles in order to handle the vast amount of possible writing styles. In this paper, we explore how HTR models can be made writer adaptive by using only a handful of examples from a new writer (e.g., 16 examples) for adaptation. Two HTR architectures are used as base models, using a ResNet backbone along with either an LSTM or Transformer sequence decoder. Using these base models, two methods are considered to make them writer adaptive: 1) model-agnostic meta-learning (MAML), an algorithm commonly used for tasks such as few-shot classification, and 2) writer codes, an idea originating from automatic speech recognition. Results show that an HTR-specific version of MAML known as MetaHTR improves performance compared to the baseline with a 1.4 to 2.0 improvement in word error rate (WER). The improvement due to writer adaptation is between 0.2 and 0.7 WER, where a deeper model seems to lend itself better to adaptation using MetaHTR than a shallower model. However, applying MetaHTR to larger HTR models or sentence-level HTR may become prohibitive due to its high computational and memory requirements. Lastly, writer codes based on learned features or Hinge statistical features did not lead to improved recognition performance.

Cortex Inspired Learning to Recover Damaged Signal Modality with ReD-SOM Model. (arXiv:2307.15095v1 [q-bio.NC])

Authors: Artem Muliukov, Laurent Rodriguez, Benoit Miramond

Recent progress in the fields of AI and cognitive sciences opens up new challenges that were previously inaccessible to study. One of such modern tasks is recovering lost data of one modality by using the data from another one. A similar effect (called the McGurk Effect) has been found in the functioning of the human brain. Observing this effect, one modality of information interferes with another, changing its perception. In this paper, we propose a way to simulate such an effect and use it to reconstruct lost data modalities by combining Variational Auto-Encoders, Self-Organizing Maps, and Hebb connections in a unified ReD-SOM (Reentering Deep Self-organizing Map) model. We are inspired by human's capability to use different zones of the brain in different modalities, in case of having a lack of information in one of the modalities. This new approach not only improves the analysis of ambiguous data but also restores the intended signal! The results obtained on the multimodal dataset demonstrate an increase of quality of the signal reconstruction. The effect is remarkable both visually and quantitatively, specifically in presence of a significant degree of signal's distortion.

Self-Supervised Learning for Improved Synthetic Aperture Sonar Target Recognition. (arXiv:2307.15098v1 [cs.CV])

Authors: BW Sheffield

This study explores the application of self-supervised learning (SSL) for improved target recognition in synthetic aperture sonar (SAS) imagery. The unique challenges of underwater environments make traditional computer vision techniques, which rely heavily on optical camera imagery, less effective. SAS, with its ability to generate high-resolution imagery, emerges as a preferred choice for underwater imaging. However, the voluminous high-resolution SAS data presents a significant challenge for labeling; a crucial step for training deep neural networks (DNNs).

SSL, which enables models to learn features in data without the need for labels, is proposed as a potential solution to the data labeling challenge in SAS. The study evaluates the performance of two prominent SSL algorithms, MoCov2 and BYOL, against the well-regarded supervised learning model, ResNet18, for binary image classification tasks. The findings suggest that while both SSL models can outperform a fully supervised model with access to a small number of labels in a few-shot scenario, they do not exceed it when all the labels are used.

The results underscore the potential of SSL as a viable alternative to traditional supervised learning, capable of maintaining task performance while reducing the time and costs associated with data labeling. The study also contributes to the growing body of evidence supporting the use of SSL in remote sensing and could stimulate further research in this area.

Clustering of illustrations by atmosphere using a combination of supervised and unsupervised learning. (arXiv:2307.15099v1 [cs.CV])

Authors: Keisuke Kubota (Doshisha University), Masahiro Okuda (Doshisha University)

The distribution of illustrations on social media, such as Twitter and Pixiv has increased with the growing popularity of animation, games, and animated movies. The "atmosphere" of illustrations plays an important role in user preferences. Classifying illustrations by atmosphere can be helpful for recommendations and searches. However, assigning clear labels to the elusive "atmosphere" and conventional supervised classification is not always practical. Furthermore, even images with similar colors, edges, and low-level features may not have similar atmospheres, making classification based on low-level features challenging. In this paper, this problem is solved using both supervised and unsupervised learning with pseudo-labels. The feature vectors are obtained using the supervised method with pseudo-labels that contribute to an ambiguous atmosphere. Further, clustering is performed based on these feature vectors. Experimental analyses show that our method outperforms conventional methods in human-like clustering on datasets manually classified by humans.

Detecting Morphing Attacks via Continual Incremental Training. (arXiv:2307.15105v1 [cs.CV])

Authors: Lorenzo Pellegrini, Guido Borghi, Annalisa Franco, Davide Maltoni

Scenarios in which restrictions in data transfer and storage limit the possibility to compose a single dataset -- also exploiting different data sources -- to perform a batch-based training procedure, make the development of robust models particularly challenging. We hypothesize that the recent Continual Learning (CL) paradigm may represent an effective solution to enable incremental training, even through multiple sites. Indeed, a basic assumption of CL is that once a model has been trained, old data can no longer be used in successive training iterations and in principle can be deleted. Therefore, in this paper, we investigate the performance of different Continual Learning methods in this scenario, simulating a learning model that is updated every time a new chunk of data, even of variable size, is available. Experimental results reveal that a particular CL method, namely Learning without Forgetting (LwF), is one of the best-performing algorithms. Then, we investigate its usage and parametrization in Morphing Attack Detection and Object Classification tasks, specifically with respect to the amount of new training data that became available.

End-to-end Remote Sensing Change Detection of Unregistered Bi-temporal Images for Natural Disasters. (arXiv:2307.15128v1 [cs.CV])

Authors: Guiqin Zhao, Lianlei Shan, Weiqiang Wang

Change detection based on remote sensing images has been a prominent area of interest in the field of remote sensing. Deep networks have demonstrated significant success in detecting changes in bi-temporal remote sensing images and have found applications in various fields. Given the degradation of natural environments and the frequent occurrence of natural disasters, accurately and swiftly identifying damaged buildings in disaster-stricken areas through remote sensing images holds immense significance. This paper aims to investigate change detection specifically for natural disasters. Considering that existing public datasets used in change detection research are registered, which does not align with the practical scenario where bi-temporal images are not matched, this paper introduces an unregistered end-to-end change detection synthetic dataset called xBD-E2ECD. Furthermore, we propose an end-to-end change detection network named E2ECDNet, which takes an unregistered bi-temporal image pair as input and simultaneously generates the flow field prediction result and the change detection prediction result. It is worth noting that our E2ECDNet also supports change detection for registered image pairs, as registration can be seen as a special case of non-registration. Additionally, this paper redefines the criteria for correctly predicting a positive case and introduces neighborhood-based change detection evaluation metrics. The experimental results have demonstrated significant improvements.

Seal-3D: Interactive Pixel-Level Editing for Neural Radiance Fields. (arXiv:2307.15131v1 [cs.CV])

Authors: Xiangyu Wang, Jingsen Zhu, Qi Ye, Yuchi Huo, Yunlong Ran, Zhihua Zhong, Jiming Chen

With the popularity of implicit neural representations, or neural radiance fields (NeRF), there is a pressing need for editing methods to interact with the implicit 3D models for tasks like post-processing reconstructed scenes and 3D content creation. While previous works have explored NeRF editing from various perspectives, they are restricted in editing flexibility, quality, and speed, failing to offer direct editing response and instant preview. The key challenge is to conceive a locally editable neural representation that can directly reflect the editing instructions and update instantly. To bridge the gap, we propose a new interactive editing method and system for implicit representations, called Seal-3D, which allows users to edit NeRF models in a pixel-level and free manner with a wide range of NeRF-like backbone and preview the editing effects instantly. To achieve the effects, the challenges are addressed by our proposed proxy function mapping the editing instructions to the original space of NeRF models and a teacher-student training strategy with local pretraining and global finetuning. A NeRF editing system is built to showcase various editing types. Our system can achieve compelling editing effects with an interactive speed of about 1 second.

Online Clustered Codebook. (arXiv:2307.15139v1 [cs.CV])

Authors: Chuanxia Zheng, Andrea Vedaldi

Vector Quantisation (VQ) is experiencing a comeback in machine learning, where it is increasingly used in representation learning. However, optimizing the codevectors in existing VQ-VAE is not entirely trivial. A problem is codebook collapse, where only a small subset of codevectors receive gradients useful for their optimisation, whereas a majority of them simply ``dies off'' and is never updated or used. This limits the effectiveness of VQ for learning larger codebooks in complex computer vision tasks that require high-capacity representations. In this paper, we present a simple alternative method for online codebook learning, Clustering VQ-VAE (CVQ-VAE). Our approach selects encoded features as anchors to update the ``dead'' codevectors, while optimising the codebooks which are alive via the original loss. This strategy brings unused codevectors closer in distribution to the encoded features, increasing the likelihood of being chosen and optimized. We extensively validate the generalization capability of our quantiser on various datasets, tasks (e.g. reconstruction and generation), and architectures (e.g. VQ-VAE, VQGAN, LDM). Our CVQ-VAE can be easily integrated into the existing models with just a few lines of code.

R-Block: Regularized Block of Dropout for convolutional networks. (arXiv:2307.15150v1 [cs.CV])

Authors: Liqi Wang, Qiya Hu

Dropout as a regularization technique is widely used in fully connected layers while is less effective in convolutional layers. Therefore more structured forms of dropout have been proposed to regularize convolutional networks. The disadvantage of these methods is that the randomness introduced causes inconsistency between training and inference. In this paper, we apply a mutual learning training strategy for convolutional layer regularization, namely R-Block, which forces two outputs of the generated difference maximizing sub models to be consistent with each other. Concretely, R-Block minimizes the losses between the output distributions of two sub models with different drop regions for each sample in the training dataset. We design two approaches to construct such sub models. Our experiments demonstrate that R-Block achieves better performance than other existing structured dropout variants. We also demonstrate that our approaches to construct sub models outperforms others.

R-LPIPS: An Adversarially Robust Perceptual Similarity Metric. (arXiv:2307.15157v1 [cs.CV])

Authors: Sara Ghazanfari, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Alexandre Araujo

Similarity metrics have played a significant role in computer vision to capture the underlying semantics of images. In recent years, advanced similarity metrics, such as the Learned Perceptual Image Patch Similarity (LPIPS), have emerged. These metrics leverage deep features extracted from trained neural networks and have demonstrated a remarkable ability to closely align with human perception when evaluating relative image similarity. However, it is now well-known that neural networks are susceptible to adversarial examples, i.e., small perturbations invisible to humans crafted to deliberately mislead the model. Consequently, the LPIPS metric is also sensitive to such adversarial examples. This susceptibility introduces significant security concerns, especially considering the widespread adoption of LPIPS in large-scale applications. In this paper, we propose the Robust Learned Perceptual Image Patch Similarity (R-LPIPS) metric, a new metric that leverages adversarially trained deep features. Through a comprehensive set of experiments, we demonstrate the superiority of R-LPIPS compared to the classical LPIPS metric. The code is available at \url{https://github.com/SaraGhazanfari/R-LPIPS}.

EnSolver: Uncertainty-Aware CAPTCHA Solver Using Deep Ensembles. (arXiv:2307.15180v1 [cs.CV])

Authors: Duc C. Hoang, Cuong V. Nguyen, Amin Kharraz

The popularity of text-based CAPTCHA as a security mechanism to protect websites from automated bots has prompted researches in CAPTCHA solvers, with the aim of understanding its failure cases and subsequently making CAPTCHAs more secure. Recently proposed solvers, built on advances in deep learning, are able to crack even the very challenging CAPTCHAs with high accuracy. However, these solvers often perform poorly on out-of-distribution samples that contain visual features different from those in the training set. Furthermore, they lack the ability to detect and avoid such samples, making them susceptible to being locked out by defense systems after a certain number of failed attempts. In this paper, we propose EnSolver, a novel CAPTCHA solver that utilizes deep ensemble uncertainty estimation to detect and skip out-of-distribution CAPTCHAs, making it harder to be detected. We demonstrate the use of our solver with object detection models and show empirically that it performs well on both in-distribution and out-of-distribution data, achieving up to 98.1% accuracy when detecting out-of-distribution data and up to 93% success rate when solving in-distribution CAPTCHAs.

Med-Flamingo: a Multimodal Medical Few-shot Learner. (arXiv:2307.15189v1 [cs.CV])

Authors: Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Cyril Zakka, Yash Dalmia, Eduardo Pontes Reis, Pranav Rajpurkar, Jure Leskovec

Medicine, by its nature, is a multifaceted domain that requires the synthesis of information across various modalities. Medical generative vision-language models (VLMs) make a first step in this direction and promise many exciting clinical applications. However, existing models typically have to be fine-tuned on sizeable down-stream datasets, which poses a significant limitation as in many medical applications data is scarce, necessitating models that are capable of learning from few examples in real-time. Here we propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain. Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks. Med-Flamingo unlocks few-shot generative medical visual question answering (VQA) abilities, which we evaluate on several datasets including a novel challenging open-ended VQA dataset of visual USMLE-style problems. Furthermore, we conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app. Med-Flamingo improves performance in generative medical VQA by up to 20\% in clinician's rating and firstly enables multimodal medical few-shot adaptations, such as rationale generation. We release our model, code, and evaluation app under https://github.com/snap-stanford/med-flamingo.

Small, but important: Traffic light proposals for detecting small traffic lights and beyond. (arXiv:2307.15191v1 [cs.CV])

Authors: Tom Sanitz, Christian Wilms, Simone Frintrop

Traffic light detection is a challenging problem in the context of self-driving cars and driver assistance systems. While most existing systems produce good results on large traffic lights, detecting small and tiny ones is often overlooked. A key problem here is the inherent downsampling in CNNs, leading to low-resolution features for detection. To mitigate this problem, we propose a new traffic light detection system, comprising a novel traffic light proposal generator that utilizes findings from general object proposal generation, fine-grained multi-scale features, and attention for efficient processing. Moreover, we design a new detection head for classifying and refining our proposals. We evaluate our system on three challenging, publicly available datasets and compare it against six methods. The results show substantial improvements of at least $12.6\%$ on small and tiny traffic lights, as well as strong results across all sizes of traffic lights.

One-shot Joint Extraction, Registration and Segmentation of Neuroimaging Data. (arXiv:2307.15198v1 [cs.CV])

Authors: Yao Su, Zhentian Qian, Lei Ma, Lifang He, Xiangnan Kong

Brain extraction, registration and segmentation are indispensable preprocessing steps in neuroimaging studies. The aim is to extract the brain from raw imaging scans (i.e., extraction step), align it with a target brain image (i.e., registration step) and label the anatomical brain regions (i.e., segmentation step). Conventional studies typically focus on developing separate methods for the extraction, registration and segmentation tasks in a supervised setting. The performance of these methods is largely contingent on the quantity of training samples and the extent of visual inspections carried out by experts for error correction. Nevertheless, collecting voxel-level labels and performing manual quality control on high-dimensional neuroimages (e.g., 3D MRI) are expensive and time-consuming in many medical studies. In this paper, we study the problem of one-shot joint extraction, registration and segmentation in neuroimaging data, which exploits only one labeled template image (a.k.a. atlas) and a few unlabeled raw images for training. We propose a unified end-to-end framework, called JERS, to jointly optimize the extraction, registration and segmentation tasks, allowing feedback among them. Specifically, we use a group of extraction, registration and segmentation modules to learn the extraction mask, transformation and segmentation mask, where modules are interconnected and mutually reinforced by self-supervision. Empirical results on real-world datasets demonstrate that our proposed method performs exceptionally in the extraction, registration and segmentation tasks. Our code and data can be found at https://github.com/Anonymous4545/JERS

PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization. (arXiv:2307.15199v1 [cs.CV])

Authors: Junhyeong Cho, Gilhyun Nam, Sungyeon Kim, Hunmin Yang, Suha Kwak

In a joint vision-language space, a text feature (e.g., from "a photo of a dog") could effectively represent its relevant image features (e.g., from dog photos). Inspired by this, we propose PromptStyler which simulates various distribution shifts in the joint space by synthesizing diverse styles via prompts without using any images to deal with source-free domain generalization. Our method learns to generate a variety of style features (from "a S* style of a") via learnable style word vectors for pseudo-words S*. To ensure that learned styles do not distort content information, we force style-content features (from "a S* style of a [class]") to be located nearby their corresponding content features (from "[class]") in the joint vision-language space. After learning style word vectors, we train a linear classifier using synthesized style-content features. PromptStyler achieves the state of the art on PACS, VLCS, OfficeHome and DomainNet, although it does not require any images and takes just ~30 minutes for training using a single GPU.

Generative AI for Medical Imaging: extending the MONAI Framework. (arXiv:2307.15208v1 [eess.IV])

Authors: Walter H. L. Pinaya, Mark S. Graham, Eric Kerfoot, Petru-Daniel Tudosiu, Jessica Dafflon, Virginia Fernandez, Pedro Sanchez, Julia Wolleb, Pedro F. da Costa, Ashay Patel, Hyungjin Chung, Can Zhao, Wei Peng, Zelong Liu, Xueyan Mei, Oeslle Lucena, Jong Chul Ye, Sotirios A. Tsaftaris, Prerna Dogra, Andrew Feng, Marc Modat, Parashkev Nachev, Sebastien Ourselin, M. Jorge Cardoso

Recent advances in generative AI have brought incredible breakthroughs in several areas, including medical imaging. These generative models have tremendous potential not only to help safely share medical data via synthetic datasets but also to perform an array of diverse applications, such as anomaly detection, image-to-image translation, denoising, and MRI reconstruction. However, due to the complexity of these models, their implementation and reproducibility can be difficult. This complexity can hinder progress, act as a use barrier, and dissuade the comparison of new methods with existing works. In this study, we present MONAI Generative Models, a freely available open-source platform that allows researchers and developers to easily train, evaluate, and deploy generative models and related applications. Our platform reproduces state-of-art studies in a standardised way involving different architectures (such as diffusion models, autoregressive transformers, and GANs), and provides pre-trained models for the community. We have implemented these models in a generalisable fashion, illustrating that their results can be extended to 2D or 3D scenarios, including medical images with different modalities (like CT, MRI, and X-Ray data) and from different anatomical areas. Finally, we adopt a modular and extensible approach, ensuring long-term maintainability and the extension of current applications for future features.

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures. (arXiv:2307.15220v1 [cs.CV])

Authors: Kun Yuan, Vinkle Srivastav, Tong Yu, Joel Lavanchy, Pietro Mascagni, Nassir Navab, Nicolas Padoy

Recent advancements in surgical computer vision applications have been driven by fully-supervised methods, primarily using only visual data. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. SurgVLP constructs a new contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings by bringing them together within a joint latent space. To effectively show the representation capability of the learned joint latent space, we introduce several vision-and-language tasks for surgery, such as text-based video retrieval, temporal activity grounding, and video captioning, as benchmarks for evaluation. We further demonstrate that without using any labeled ground truth, our approach can be employed for traditional vision-only surgical downstream tasks, such as surgical tool, phase, and triplet recognition. The code will be made available at https://github.com/CAMMA-public/SurgVLP

Fast Dust Sand Image Enhancement Based on Color Correction and New Membership Function. (arXiv:2307.15230v1 [cs.CV])

Authors: Ali Hakem Alsaeedi, Suha Mohammed Hadi, Yarub Alazzawi

Images captured in dusty environments suffering from poor visibility and quality. Enhancement of these images such as sand dust images plays a critical role in various atmospheric optics applications. In this work, proposed a new model based on Color Correction and new membership function to enhance san dust images. The proposed model consists of three phases: correction of color shift, removal of haze, and enhancement of contrast and brightness. The color shift is corrected using a new membership function to adjust the values of U and V in the YUV color space. The Adaptive Dark Channel Prior (A-DCP) is used for haze removal. The stretching contrast and improving image brightness are based on Contrast Limited Adaptive Histogram Equalization (CLAHE). The proposed model tests and evaluates through many real sand dust images. The experimental results show that the proposed solution is outperformed the current studies in terms of effectively removing the red and yellow cast and provides high quality and quantity dust images.

TROPHY: A Topologically Robust Physics-Informed Tracking Framework for Tropical Cyclones. (arXiv:2307.15243v1 [physics.ao-ph])

Authors: Lin Yan, Hanqi Guo, Thomas Peterka, Bei Wang, Jiali Wang

Tropical cyclones (TCs) are among the most destructive weather systems. Realistically and efficiently detecting and tracking TCs are critical for assessing their impacts and risks. Recently, a multilevel robustness framework has been introduced to study the critical points of time-varying vector fields. The framework quantifies the robustness of critical points across varying neighborhoods. By relating the multilevel robustness with critical point tracking, the framework has demonstrated its potential in cyclone tracking. An advantage is that it identifies cyclonic features using only 2D wind vector fields, which is encouraging as most tracking algorithms require multiple dynamic and thermodynamic variables at different altitudes. A disadvantage is that the framework does not scale well computationally for datasets containing a large number of cyclones. This paper introduces a topologically robust physics-informed tracking framework (TROPHY) for TC tracking. The main idea is to integrate physical knowledge of TC to drastically improve the computational efficiency of multilevel robustness framework for large-scale climate datasets. First, during preprocessing, we propose a physics-informed feature selection strategy to filter 90% of critical points that are short-lived and have low stability, thus preserving good candidates for TC tracking. Second, during in-processing, we impose constraints during the multilevel robustness computation to focus only on physics-informed neighborhoods of TCs. We apply TROPHY to 30 years of 2D wind fields from reanalysis data in ERA5 and generate a number of TC tracks. In comparison with the observed tracks, we demonstrate that TROPHY can capture TC characteristics that are comparable to and sometimes even better than a well-validated TC tracking algorithm that requires multiple dynamic and thermodynamic scalar fields.

D2S: Representing local descriptors and global scene coordinates for camera relocalization. (arXiv:2307.15250v1 [cs.CV])

Authors: Bach-Thuan Bui, Dinh-Tuan Tran, Joo-Ho Lee

State-of-the-art visual localization methods mostly rely on complex procedures to match local descriptors and 3D point clouds. However, these procedures can incur significant cost in terms of inference, storage, and updates over time. In this study, we propose a direct learning-based approach that utilizes a simple network named D2S to represent local descriptors and their scene coordinates. Our method is characterized by its simplicity and cost-effectiveness. It solely leverages a single RGB image for localization during the testing phase and only requires a lightweight model to encode a complex sparse scene. The proposed D2S employs a combination of a simple loss function and graph attention to selectively focus on robust descriptors while disregarding areas such as clouds, trees, and several dynamic objects. This selective attention enables D2S to effectively perform a binary-semantic classification for sparse descriptors. Additionally, we propose a new outdoor dataset to evaluate the capabilities of visual localization methods in terms of scene generalization and self-updating from unlabeled observations. Our approach outperforms the state-of-the-art CNN-based methods in scene coordinate regression in indoor and outdoor environments. It demonstrates the ability to generalize beyond training data, including scenarios involving transitions from day to night and adapting to domain shifts, even in the absence of the labeled data sources. The source code, trained models, dataset, and demo videos are available at the following link: https://thpjp.github.io/d2s

A Solution to Co-occurrence Bias: Attributes Disentanglement via Mutual Information Minimization for Pedestrian Attribute Recognition. (arXiv:2307.15252v1 [cs.CV])

Authors: Yibo Zhou, Hai-Miao Hu, Jinzuo Yu, Zhenbo Xu, Weiqing Lu, Yuran Cao

Recent studies on pedestrian attribute recognition progress with either explicit or implicit modeling of the co-occurrence among attributes. Considering that this known a prior is highly variable and unforeseeable regarding the specific scenarios, we show that current methods can actually suffer in generalizing such fitted attributes interdependencies onto scenes or identities off the dataset distribution, resulting in the underlined bias of attributes co-occurrence. To render models robust in realistic scenes, we propose the attributes-disentangled feature learning to ensure the recognition of an attribute not inferring on the existence of others, and which is sequentially formulated as a problem of mutual information minimization. Rooting from it, practical strategies are devised to efficiently decouple attributes, which substantially improve the baseline and establish state-of-the-art performance on realistic datasets like PETAzs and RAPzs. Code is released on https://github.com/SDret/A-Solution-to-Co-occurence-Bias-in-Pedestrian-Attribute-Recognition.

Multiple Instance Learning Framework with Masked Hard Instance Mining for Whole Slide Image Classification. (arXiv:2307.15254v1 [cs.CV])

Authors: Wenhao Tang, Sheng Huang, Xiaoxian Zhang, Fengtao Zhou, Yi Zhang, Bo Liu

The whole slide image (WSI) classification is often formulated as a multiple instance learning (MIL) problem. Since the positive tissue is only a small fraction of the gigapixel WSI,existing MIL methods intuitively focus on identifying salient instances via attention mechanisms. However, this leads to a bias towards easy-to-classify instances while neglecting hard-to-classify instances.Some literature has revealed that hard examples are beneficial for modeling a discriminative boundary accurately.By applying such an idea at the instance level,we elaborate a novel MIL framework with masked hard instance mining (MHIM-MIL), which uses a Siamese structure (Teacher-Student) with a consistency constraint to explore the potential hard instances. With several instance masking strategies based on attention scores, MHIM-MIL employs a momentum teacher to implicitly mine hard instances for training the student model, which can be any attention-based MIL model.This counter-intuitive strategy essentially enables the student to learn a better discriminating boundary.Moreover, the student is used to update the teacher with an exponential moving average (EMA), which in turn identifies new hard instances for subsequent training iterations and stabilizes the optimization.Experimental results on the CAMELYON-16 and TCGA Lung Cancer datasets demonstrate that MHIM-MIL outperforms other latest methods in terms of performance and training cost. The code is available at:https://github.com/DearCaat/MHIM-MIL.

Learning with Constraint Learning: New Perspective, Solution Strategy and Various Applications. (arXiv:2307.15257v1 [cs.CV])

Authors: Risheng Liu, Jiaxin Gao, Xuan Liu, Xin Fan

The complexity of learning problems, such as Generative Adversarial Network (GAN) and its variants, multi-task and meta-learning, hyper-parameter learning, and a variety of real-world vision applications, demands a deeper understanding of their underlying coupling mechanisms. Existing approaches often address these problems in isolation, lacking a unified perspective that can reveal commonalities and enable effective solutions. Therefore, in this work, we proposed a new framework, named Learning with Constraint Learning (LwCL), that can holistically examine challenges and provide a unified methodology to tackle all the above-mentioned complex learning and vision problems. Specifically, LwCL is designed as a general hierarchical optimization model that captures the essence of these diverse learning and vision problems. Furthermore, we develop a gradient-response based fast solution strategy to overcome optimization challenges of the LwCL framework. Our proposed framework efficiently addresses a wide range of applications in learning and vision, encompassing three categories and nine different problem types. Extensive experiments on synthetic tasks and real-world applications verify the effectiveness of our approach. The LwCL framework offers a comprehensive solution for tackling complex machine learning and computer vision problems, bridging the gap between theory and practice.

RSGPT: A Remote Sensing Vision Language Model and Benchmark. (arXiv:2307.15266v1 [cs.CV])

Authors: Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Xiang Li

The emergence of large-scale large language models, with GPT-4 as a prominent example, has significantly propelled the rapid advancement of artificial general intelligence and sparked the revolution of Artificial Intelligence 2.0. In the realm of remote sensing (RS), there is a growing interest in developing large vision language models (VLMs) specifically tailored for data analysis in this domain. However, current research predominantly revolves around visual recognition tasks, lacking comprehensive, large-scale image-text datasets that are aligned and suitable for training large VLMs, which poses significant challenges to effectively training such models for RS applications. In computer vision, recent research has demonstrated that fine-tuning large vision language models on small-scale, high-quality datasets can yield impressive performance in visual and language understanding. These results are comparable to state-of-the-art VLMs trained from scratch on massive amounts of data, such as GPT-4. Inspired by this captivating idea, in this work, we build a high-quality Remote Sensing Image Captioning dataset (RSICap) that facilitates the development of large VLMs in the RS field. Unlike previous RS datasets that either employ model-generated captions or short descriptions, RSICap comprises 2,585 human-annotated captions with rich and high-quality information. This dataset offers detailed descriptions for each image, encompassing scene descriptions (e.g., residential area, airport, or farmland) as well as object information (e.g., color, shape, quantity, absolute position, etc). To facilitate the evaluation of VLMs in the field of RS, we also provide a benchmark evaluation dataset called RSIEval. This dataset consists of human-annotated captions and visual question-answer pairs, allowing for a comprehensive assessment of VLMs in the context of RS.

Anatomy-Aware Lymph Node Detection in Chest CT using Implicit Station Stratification. (arXiv:2307.15271v1 [cs.CV])

Authors: Ke Yan, Dakai Jin, Dazhou Guo, Minfeng Xu, Na Shen, Xian-Sheng Hua, Xianghua Ye, Le Lu

Finding abnormal lymph nodes in radiological images is highly important for various medical tasks such as cancer metastasis staging and radiotherapy planning. Lymph nodes (LNs) are small glands scattered throughout the body. They are grouped or defined to various LN stations according to their anatomical locations. The CT imaging appearance and context of LNs in different stations vary significantly, posing challenges for automated detection, especially for pathological LNs. Motivated by this observation, we propose a novel end-to-end framework to improve LN detection performance by leveraging their station information. We design a multi-head detector and make each head focus on differentiating the LN and non-LN structures of certain stations. Pseudo station labels are generated by an LN station classifier as a form of multi-task learning during training, so we do not need another explicit LN station prediction model during inference. Our algorithm is evaluated on 82 patients with lung cancer and 91 patients with esophageal cancer. The proposed implicit station stratification method improves the detection sensitivity of thoracic lymph nodes from 65.1% to 71.4% and from 80.3% to 85.5% at 2 false positives per patient on the two datasets, respectively, which significantly outperforms various existing state-of-the-art baseline techniques such as nnUNet, nnDetection and LENS.

Recovering high-quality FODs from a reduced number of diffusion-weighted images using a model-driven deep learning architecture. (arXiv:2307.15273v1 [cs.CV])

Authors: J Bartlett, C E Davey, L A Johnston, J Duan

Fibre orientation distribution (FOD) reconstruction using deep learning has the potential to produce accurate FODs from a reduced number of diffusion-weighted images (DWIs), decreasing total imaging time. Diffusion acquisition invariant representations of the DWI signals are typically used as input to these methods to ensure that they can be applied flexibly to data with different b-vectors and b-values; however, this means the network cannot condition its output directly on the DWI signal. In this work, we propose a spherical deconvolution network, a model-driven deep learning FOD reconstruction architecture, that ensures intermediate and output FODs produced by the network are consistent with the input DWI signals. Furthermore, we implement a fixel classification penalty within our loss function, encouraging the network to produce FODs that can subsequently be segmented into the correct number of fixels and improve downstream fixel-based analysis. Our results show that the model-based deep learning architecture achieves competitive performance compared to a state-of-the-art FOD super-resolution network, FOD-Net. Moreover, we show that the fixel classification penalty can be tuned to offer improved performance with respect to metrics that rely on accurately segmented of FODs. Our code is publicly available at https://github.com/Jbartlett6/SDNet .

AC-Norm: Effective Tuning for Medical Image Analysis via Affine Collaborative Normalization. (arXiv:2307.15282v1 [cs.CV])

Authors: Chuyan Zhang, Yuncheng Yang, Hao Zheng, Yun Gu

Driven by the latest trend towards self-supervised learning (SSL), the paradigm of "pretraining-then-finetuning" has been extensively explored to enhance the performance of clinical applications with limited annotations. Previous literature on model finetuning has mainly focused on regularization terms and specific policy models, while the misalignment of channels between source and target models has not received sufficient attention. In this work, we revisited the dynamics of batch normalization (BN) layers and observed that the trainable affine parameters of BN serve as sensitive indicators of domain information. Therefore, Affine Collaborative Normalization (AC-Norm) is proposed for finetuning, which dynamically recalibrates the channels in the target model according to the cross-domain channel-wise correlations without adding extra parameters. Based on a single-step backpropagation, AC-Norm can also be utilized to measure the transferability of pretrained models. We evaluated AC-Norm against the vanilla finetuning and state-of-the-art fine-tuning methods on transferring diverse pretrained models to the diabetic retinopathy grade classification, retinal vessel segmentation, CT lung nodule segmentation/classification, CT liver-tumor segmentation and MRI cardiac segmentation tasks. Extensive experiments demonstrate that AC-Norm unanimously outperforms the vanilla finetuning by up to 4% improvement, even under significant domain shifts where the state-of-the-art methods bring no gains. We also prove the capability of AC-Norm in fast transferability estimation. Our code is available at https://github.com/EndoluminalSurgicalVision-IMR/ACNorm.

Attentive Multimodal Fusion for Optical and Scene Flow. (arXiv:2307.15301v1 [cs.CV])

Authors: Youjie Zhou, Guofeng Mei, Yiming Wang, Fabio Poiesi, Yi Wan

This paper presents an investigation into the estimation of optical and scene flow using RGBD information in scenarios where the RGB modality is affected by noise or captured in dark environments. Existing methods typically rely solely on RGB images or fuse the modalities at later stages, which can result in lower accuracy when the RGB information is unreliable. To address this issue, we propose a novel deep neural network approach named FusionRAFT, which enables early-stage information fusion between sensor modalities (RGB and depth). Our approach incorporates self- and cross-attention layers at different network levels to construct informative features that leverage the strengths of both modalities. Through comparative experiments, we demonstrate that our approach outperforms recent methods in terms of performance on the synthetic dataset Flyingthings3D, as well as the generalization on the real-world dataset KITTI. We illustrate that our approach exhibits improved robustness in the presence of noise and low-lighting conditions that affect the RGB images. We release the code, models and dataset at https://github.com/jiesico/FusionRAFT.

DiffKendall: A Novel Approach for Few-Shot Learning with Differentiable Kendall's Rank Correlation. (arXiv:2307.15317v1 [cs.CV])

Authors: Kaipeng Zheng, Huishuai Zhang, Weiran Huang

Few-shot learning aims to adapt models trained on the base dataset to novel tasks where the categories are not seen by the model before. This often leads to a relatively uniform distribution of feature values across channels on novel classes, posing challenges in determining channel importance for novel tasks. Standard few-shot learning methods employ geometric similarity metrics such as cosine similarity and negative Euclidean distance to gauge the semantic relatedness between two features. However, features with high geometric similarities may carry distinct semantics, especially in the context of few-shot learning. In this paper, we demonstrate that the importance ranking of feature channels is a more reliable indicator for few-shot learning than geometric similarity metrics. We observe that replacing the geometric similarity metric with Kendall's rank correlation only during inference is able to improve the performance of few-shot learning across a wide range of datasets with different domains. Furthermore, we propose a carefully designed differentiable loss for meta-training to address the non-differentiability issue of Kendall's rank correlation. Extensive experiments demonstrate that the proposed rank-correlation-based approach substantially enhances few-shot learning performance.

DocDeshadower: Frequency-aware Transformer for Document Shadow Removal. (arXiv:2307.15318v1 [cs.CV])

Authors: Shenghong Luo, Ruifeng Xu, Xuhang Chen, Zinuo Li, Chi-Man Pun, Shuqiang Wang

The presence of shadows significantly impacts the visual quality of scanned documents. However, the existing traditional techniques and deep learning methods used for shadow removal have several limitations. These methods either rely heavily on heuristics, resulting in suboptimal performance, or require large datasets to learn shadow-related features. In this study, we propose the DocDeshadower, a multi-frequency Transformer-based model built on Laplacian Pyramid. DocDeshadower is designed to remove shadows at different frequencies in a coarse-to-fine manner. To achieve this, we decompose the shadow image into different frequency bands using Laplacian Pyramid. In addition, we introduce two novel components to this model: the Attention-Aggregation Network and the Gated Multi-scale Fusion Transformer. The Attention-Aggregation Network is designed to remove shadows in the low-frequency part of the image, whereas the Gated Multi-scale Fusion Transformer refines the entire image at a global scale with its large perceptive field. Our extensive experiments demonstrate that DocDeshadower outperforms the current state-of-the-art methods in both qualitative and quantitative terms.

Robust Visual Sim-to-Real Transfer for Robotic Manipulation. (arXiv:2307.15320v1 [cs.RO])

Authors: Ricardo Garcia, Robin Strudel, Shizhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid

Learning visuomotor policies in simulation is much safer and cheaper than in the real world. However, due to discrepancies between the simulated and real data, simulator-trained policies often fail when transferred to real robots. One common approach to bridge the visual sim-to-real domain gap is domain randomization (DR). While previous work mainly evaluates DR for disembodied tasks, such as pose estimation and object detection, here we systematically explore visual domain randomization methods and benchmark them on a rich set of challenging robotic manipulation tasks. In particular, we propose an off-line proxy task of cube localization to select DR parameters for texture randomization, lighting randomization, variations of object colors and camera parameters. Notably, we demonstrate that DR parameters have similar impact on our off-line proxy task and on-line policies. We, hence, use off-line optimized DR parameters to train visuomotor policies in simulation and directly apply such policies to a real robot. Our approach achieves 93% success rate on average when tested on a diverse set of challenging manipulation tasks. Moreover, we evaluate the robustness of policies to visual variations in real scenes and show that our simulator-trained policies outperform policies learned using real but limited data. Code, simulation environment, real robot datasets and trained models are available at https://www.di.ens.fr/willow/research/robust_s2r/.

TaskExpert: Dynamically Assembling Multi-Task Representations with Memorial Mixture-of-Experts. (arXiv:2307.15324v1 [cs.CV])

Authors: Hanrong Ye, Dan Xu

Learning discriminative task-specific features simultaneously for multiple distinct tasks is a fundamental problem in multi-task learning. Recent state-of-the-art models consider directly decoding task-specific features from one shared task-generic feature (e.g., feature from a backbone layer), and utilize carefully designed decoders to produce multi-task features. However, as the input feature is fully shared and each task decoder also shares decoding parameters for different input samples, it leads to a static feature decoding process, producing less discriminative task-specific representations. To tackle this limitation, we propose TaskExpert, a novel multi-task mixture-of-experts model that enables learning multiple representative task-generic feature spaces and decoding task-specific features in a dynamic manner. Specifically, TaskExpert introduces a set of expert networks to decompose the backbone feature into several representative task-generic features. Then, the task-specific features are decoded by using dynamic task-specific gating networks operating on the decomposed task-generic features. Furthermore, to establish long-range modeling of the task-specific representations from different layers of TaskExpert, we design a multi-task feature memory that updates at each layer and acts as an additional feature expert for dynamic task-specific feature decoding. Extensive experiments demonstrate that our TaskExpert clearly outperforms previous best-performing methods on all 9 metrics of two competitive multi-task learning benchmarks for visual scene understanding (i.e., PASCAL-Context and NYUD-v2). Codes and models will be made publicly available at https://github.com/prismformore/Multi-Task-Transformer

Staging E-Commerce Products for Online Advertising using Retrieval Assisted Image Generation. (arXiv:2307.15326v1 [cs.CV])

Authors: Yueh-Ning Ku, Mikhail Kuznetsov, Shaunak Mishra, Paloma de Juan

Online ads showing e-commerce products typically rely on the product images in a catalog sent to the advertising platform by an e-commerce platform. In the broader ads industry such ads are called dynamic product ads (DPA). It is common for DPA catalogs to be in the scale of millions (corresponding to the scale of products which can be bought from the e-commerce platform). However, not all product images in the catalog may be appealing when directly re-purposed as an ad image, and this may lead to lower click-through rates (CTRs). In particular, products just placed against a solid background may not be as enticing and realistic as a product staged in a natural environment. To address such shortcomings of DPA images at scale, we propose a generative adversarial network (GAN) based approach to generate staged backgrounds for un-staged product images. Generating the entire staged background is a challenging task susceptible to hallucinations. To get around this, we introduce a simpler approach called copy-paste staging using retrieval assisted GANs. In copy paste staging, we first retrieve (from the catalog) staged products similar to the un-staged input product, and then copy-paste the background of the retrieved product in the input image. A GAN based in-painting model is used to fill the holes left after this copy-paste operation. We show the efficacy of our copy-paste staging method via offline metrics, and human evaluation. In addition, we show how our staging approach can enable animations of moving products leading to a video ad from a product image.

Dynamic PlenOctree for Adaptive Sampling Refinement in Explicit NeRF. (arXiv:2307.15333v1 [cs.CV])

Authors: Haotian Bai, Yiqi Lin, Yize Chen, Lin Wang

The explicit neural radiance field (NeRF) has gained considerable interest for its efficient training and fast inference capabilities, making it a promising direction such as virtual reality and gaming. In particular, PlenOctree (POT)[1], an explicit hierarchical multi-scale octree representation, has emerged as a structural and influential framework. However, POT's fixed structure for direct optimization is sub-optimal as the scene complexity evolves continuously with updates to cached color and density, necessitating refining the sampling distribution to capture signal complexity accordingly. To address this issue, we propose the dynamic PlenOctree DOT, which adaptively refines the sample distribution to adjust to changing scene complexity. Specifically, DOT proposes a concise yet novel hierarchical feature fusion strategy during the iterative rendering process. Firstly, it identifies the regions of interest through training signals to ensure adaptive and efficient refinement. Next, rather than directly filtering out valueless nodes, DOT introduces the sampling and pruning operations for octrees to aggregate features, enabling rapid parameter learning. Compared with POT, our DOT outperforms it by enhancing visual quality, reducing over $55.15$/$68.84\%$ parameters, and providing 1.7/1.9 times FPS for NeRF-synthetic and Tanks $\&$ Temples, respectively. Project homepage:https://vlislab22.github.io/DOT.

[1] Yu, Alex, et al. "Plenoctrees for real-time rendering of neural radiance fields." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

BARTPhoBEiT: Pre-trained Sequence-to-Sequence and Image Transformers Models for Vietnamese Visual Question Answering. (arXiv:2307.15335v1 [cs.CL])

Authors: Khiem Vinh Tran, Kiet Van Nguyen, Ngan Luu Thuy Nguyen

Visual Question Answering (VQA) is an intricate and demanding task that integrates natural language processing (NLP) and computer vision (CV), capturing the interest of researchers. The English language, renowned for its wealth of resources, has witnessed notable advancements in both datasets and models designed for VQA. However, there is a lack of models that target specific countries such as Vietnam. To address this limitation, we introduce a transformer-based Vietnamese model named BARTPhoBEiT. This model includes pre-trained Sequence-to-Sequence and bidirectional encoder representation from Image Transformers in Vietnamese and evaluates Vietnamese VQA datasets. Experimental results demonstrate that our proposed model outperforms the strong baseline and improves the state-of-the-art in six metrics: Accuracy, Precision, Recall, F1-score, WUPS 0.0, and WUPS 0.9.

The Radon Signed Cumulative Distribution Transform and its applications in classification of Signed Images. (arXiv:2307.15339v1 [cs.IT])

Authors: Le Gong, Shiying Li, Naqib Sad Pathan, Mohammad Shifat-E-Rabbi, Gustavo K. Rohde, Abu Hasnat Mohammad Rubaiyat, Sumati Thareja

Here we describe a new image representation technique based on the mathematics of transport and optimal transport. The method relies on the combination of the well-known Radon transform for images and a recent signal representation method called the Signed Cumulative Distribution Transform. The newly proposed method generalizes previous transport-related image representation methods to arbitrary functions (images), and thus can be used in more applications. We describe the new transform, and some of its mathematical properties and demonstrate its ability to partition image classes with real and simulated data. In comparison to existing transport transform methods, as well as deep learning-based classification methods, the new transform more accurately represents the information content of signed images, and thus can be used to obtain higher classification accuracies. The implementation of the proposed method in Python language is integrated as a part of the software package PyTransKit, available on Github.

Supervised Homography Learning with Realistic Dataset Generation. (arXiv:2307.15353v1 [cs.CV])

Authors: Hai Jiang, Haipeng Li, Songchen Han, Haoqiang Fan, Bing Zeng, Shuaicheng Liu

In this paper, we propose an iterative framework, which consists of two phases: a generation phase and a training phase, to generate realistic training data and yield a supervised homography network. In the generation phase, given an unlabeled image pair, we utilize the pre-estimated dominant plane masks and homography of the pair, along with another sampled homography that serves as ground truth to generate a new labeled training pair with realistic motion. In the training phase, the generated data is used to train the supervised homography network, in which the training data is refined via a content consistency module and a quality assessment module. Once an iteration is finished, the trained network is used in the next data generation phase to update the pre-estimated homography. Through such an iterative strategy, the quality of the dataset and the performance of the network can be gradually and simultaneously improved. Experimental results show that our method achieves state-of-the-art performance and existing supervised methods can be also improved based on the generated dataset. Code and dataset are available at https://github.com/megvii-research/RealSH.

Prompt Guided Transformer for Multi-Task Dense Prediction. (arXiv:2307.15362v1 [cs.CV])

Authors: Yuxiang Lu, Shalayiding Sirejiding, Yue Ding, Chunlin Wang, Hongtao Lu

Task-conditional architecture offers advantage in parameter efficiency but falls short in performance compared to state-of-the-art multi-decoder methods. How to trade off performance and model parameters is an important and difficult problem. In this paper, we introduce a simple and lightweight task-conditional model called Prompt Guided Transformer (PGT) to optimize this challenge. Our approach designs a Prompt-conditioned Transformer block, which incorporates task-specific prompts in the self-attention mechanism to achieve global dependency modeling and parameter-efficient feature adaptation across multiple tasks. This block is integrated into both the shared encoder and decoder, enhancing the capture of intra- and inter-task features. Moreover, we design a lightweight decoder to further reduce parameter usage, which accounts for only 2.7% of the total model parameters. Extensive experiments on two multi-task dense prediction benchmarks, PASCAL-Context and NYUD-v2, demonstrate that our approach achieves state-of-the-art results among task-conditional methods while using fewer parameters, and maintains a significant balance between performance and parameter size.

AffineGlue: Joint Matching and Robust Estimation. (arXiv:2307.15381v1 [cs.CV])

Authors: Daniel Barath, Dmytro Mishkin, Luca Cavalli, Paul-Edouard Sarlin, Petr Hruby, Marc Pollefeys

We propose AffineGlue, a method for joint two-view feature matching and robust estimation that reduces the combinatorial complexity of the problem by employing single-point minimal solvers. AffineGlue selects potential matches from one-to-many correspondences to estimate minimal models. Guided matching is then used to find matches consistent with the model, suffering less from the ambiguities of one-to-one matches. Moreover, we derive a new minimal solver for homography estimation, requiring only a single affine correspondence (AC) and a gravity prior. Furthermore, we train a neural network to reject ACs that are unlikely to lead to a good model. AffineGlue is superior to the SOTA on real-world datasets, even when assuming that the gravity direction points downwards. On PhotoTourism, the AUC@10{\deg} score is improved by 6.6 points compared to the SOTA. On ScanNet, AffineGlue makes SuperPoint and SuperGlue achieve similar accuracy as the detector-free LoFTR.

Uncertainty-aware Unsupervised Multi-Object Tracking. (arXiv:2307.15409v1 [cs.CV])

Authors: Kai Liu, Sheng Jin, Zhihang Fu, Ze Chen, Rongxin Jiang, Jieping Ye

Without manually annotated identities, unsupervised multi-object trackers are inferior to learning reliable feature embeddings. It causes the similarity-based inter-frame association stage also be error-prone, where an uncertainty problem arises. The frame-by-frame accumulated uncertainty prevents trackers from learning the consistent feature embedding against time variation. To avoid this uncertainty problem, recent self-supervised techniques are adopted, whereas they failed to capture temporal relations. The interframe uncertainty still exists. In fact, this paper argues that though the uncertainty problem is inevitable, it is possible to leverage the uncertainty itself to improve the learned consistency in turn. Specifically, an uncertainty-based metric is developed to verify and rectify the risky associations. The resulting accurate pseudo-tracklets boost learning the feature consistency. And accurate tracklets can incorporate temporal information into spatial transformation. This paper proposes a tracklet-guided augmentation strategy to simulate tracklets' motion, which adopts a hierarchical uncertainty-based sampling mechanism for hard sample mining. The ultimate unsupervised MOT framework, namely U2MOT, is proven effective on MOT-Challenges and VisDrone-MOT benchmark. U2MOT achieves a SOTA performance among the published supervised and unsupervised trackers.

MLIC++: Linear Complexity Multi-Reference Entropy Modeling for Learned Image Compression. (arXiv:2307.15421v1 [eess.IV])

Authors: Wei Jiang, Ronggang Wang

Recently, multi-reference entropy model has been proposed, which captures channel-wise, local spatial, and global spatial correlations. Previous works adopt attention for global correlation capturing, however, the quadratic cpmplexity limits the potential of high-resolution image coding. In this paper, we propose the linear complexity global correlations capturing, via the decomposition of softmax operation. Based on it, we propose the MLIC$^{++}$, a learned image compression with linear complexity for multi-reference entropy modeling. Our MLIC$^{++}$ is more efficient and it reduces BD-rate by 12.44% on the Kodak dataset compared to VTM-17.0 when measured in PSNR. Code will be available at https://github.com/JiangWeibeta/MLIC.

Deep Learning Pipeline for Automated Visual Moth Monitoring: Insect Localization and Species Classification. (arXiv:2307.15427v1 [cs.CV])

Authors: Dimitri Korsch, Paul Bodesheim, Joachim Denzler

Biodiversity monitoring is crucial for tracking and counteracting adverse trends in population fluctuations. However, automatic recognition systems are rarely applied so far, and experts evaluate the generated data masses manually. Especially the support of deep learning methods for visual monitoring is not yet established in biodiversity research, compared to other areas like advertising or entertainment. In this paper, we present a deep learning pipeline for analyzing images captured by a moth scanner, an automated visual monitoring system of moth species developed within the AMMOD project. We first localize individuals with a moth detector and afterward determine the species of detected insects with a classifier. Our detector achieves up to 99.01% mean average precision and our classifier distinguishes 200 moth species with an accuracy of 93.13% on image cutouts depicting single insects. Combining both in our pipeline improves the accuracy for species identification in images of the moth scanner from 79.62% to 88.05%.

Implicit neural representation for change detection. (arXiv:2307.15428v1 [cs.CV])

Authors: Peter Naylor, Diego Di Carlo, Arianna Traviglia, Makoto Yamada, Marco Fiorucci

Detecting changes that occurred in a pair of 3D airborne LiDAR point clouds, acquired at two different times over the same geographical area, is a challenging task because of unmatching spatial supports and acquisition system noise. Most recent attempts to detect changes on point clouds are based on supervised methods, which require large labelled data unavailable in real-world applications. To address these issues, we propose an unsupervised approach that comprises two components: Neural Field (NF) for continuous shape reconstruction and a Gaussian Mixture Model for categorising changes. NF offer a grid-agnostic representation to encode bi-temporal point clouds with unmatched spatial support that can be regularised to increase high-frequency details and reduce noise. The reconstructions at each timestamp are compared at arbitrary spatial scales, leading to a significant increase in detection capabilities. We apply our method to a benchmark dataset of simulated LiDAR point clouds for urban sprawling. The dataset offers different challenging scenarios with different resolutions, input modalities and noise levels, allowing a multi-scenario comparison of our method with the current state-of-the-art. We boast the previous methods on this dataset by a 10% margin in intersection over union metric. In addition, we apply our methods to a real-world scenario to identify illegal excavation (looting) of archaeological sites and confirm that they match findings from field experts.

Improvable Gap Balancing for Multi-Task Learning. (arXiv:2307.15429v1 [cs.LG])

Authors: Yanqi Dai, Nanyi Fei, Zhiwu Lu

In multi-task learning (MTL), gradient balancing has recently attracted more research interest than loss balancing since it often leads to better performance. However, loss balancing is much more efficient than gradient balancing, and thus it is still worth further exploration in MTL. Note that prior studies typically ignore that there exist varying improvable gaps across multiple tasks, where the improvable gap per task is defined as the distance between the current training progress and desired final training progress. Therefore, after loss balancing, the performance imbalance still arises in many cases. In this paper, following the loss balancing framework, we propose two novel improvable gap balancing (IGB) algorithms for MTL: one takes a simple heuristic, and the other (for the first time) deploys deep reinforcement learning for MTL. Particularly, instead of directly balancing the losses in MTL, both algorithms choose to dynamically assign task weights for improvable gap balancing. Moreover, we combine IGB and gradient balancing to show the complementarity between the two types of algorithms. Extensive experiments on two benchmark datasets demonstrate that our IGB algorithms lead to the best results in MTL via loss balancing and achieve further improvements when combined with gradient balancing. Code is available at https://github.com/YanqiDai/IGB4MTL.

Automated Visual Monitoring of Nocturnal Insects with Light-based Camera Traps. (arXiv:2307.15433v1 [cs.CV])

Authors: Dimitri Korsch, Paul Bodesheim, Gunnar Brehm, Joachim Denzler

Automatic camera-assisted monitoring of insects for abundance estimations is crucial to understand and counteract ongoing insect decline. In this paper, we present two datasets of nocturnal insects, especially moths as a subset of Lepidoptera, photographed in Central Europe. One of the datasets, the EU-Moths dataset, was captured manually by citizen scientists and contains species annotations for 200 different species and bounding box annotations for those. We used this dataset to develop and evaluate a two-stage pipeline for insect detection and moth species classification in previous work. We further introduce a prototype for an automated visual monitoring system. This prototype produced the second dataset consisting of more than 27,000 images captured on 95 nights. For evaluation and bootstrapping purposes, we annotated a subset of the images with bounding boxes enframing nocturnal insects. Finally, we present first detection and classification baselines for these datasets and encourage other scientists to use this publicly available data.

ERCPMP: An Endoscopic Image and Video Dataset for Colorectal Polyps Morphology and Pathology. (arXiv:2307.15444v1 [eess.IV])

Authors: Mojgan Forootan, Mohsen Rajabnia, Ahmad R Mafi, Hamed Azhdari Tehrani, Erfan Ghadirzadeh, Mahziar Setayeshfar, Zahra Ghaffari, Mohammad Tashakoripour, Mohammad Reza Zali, Hamidreza Bolhasani

In the recent years, artificial intelligence (AI) and its leading subtypes, machine learning (ML) and deep learning (DL) and their applications are spreading very fast in various aspects such as medicine. Today the most important challenge of developing accurate algorithms for medical prediction, detection, diagnosis, treatment and prognosis is data. ERCPMP is an Endoscopic Image and Video Dataset for Recognition of Colorectal Polyps Morphology and Pathology. This dataset contains demographic, morphological and pathological data, endoscopic images and videos of 191 patients with colorectal polyps. Morphological data is included based on the latest international gastroenterology classification references such as Paris, Pit and JNET classification. Pathological data includes the diagnosis of the polyps including Tubular, Villous, Tubulovillous, Hyperplastic, Serrated, Inflammatory and Adenocarcinoma with Dysplasia Grade & Differentiation. The current version of this dataset is published and available on Elsevier Mendeley Dataverse and since it is under development, the latest version is accessible via: https://databiox.com.

Cross-Modal Concept Learning and Inference for Vision-Language Models. (arXiv:2307.15460v1 [cs.CV])

Authors: Yi Zhang, Ce Zhang, Yushun Tang, Zhihai He

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class-specific text description is matched against the whole image. We recognize that this whole image matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross-model concept learning and inference (CCLI). Using the powerful text-image correlation capability of CLIP, our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts. Based on these visual concepts, we construct a discriminative representation of images and learn a concept inference network to perform downstream image classification tasks, such as few-shot learning and domain generalization. Extensive experimental results demonstrate that our CCLI method is able to improve the performance upon the current state-of-the-art methods by large margins, for example, by up to 8.0% improvement on few-shot learning and by up to 1.3% for domain generalization.

Defocus Blur Synthesis and Deblurring via Interpolation and Extrapolation in Latent Space. (arXiv:2307.15461v1 [eess.IV])

Authors: Ioana Mazilu, Shunxin Wang, Sven Dummer, Raymond Veldhuis, Christoph Brune, Nicola Strisciuglio

Though modern microscopes have an autofocusing system to ensure optimal focus, out-of-focus images can still occur when cells within the medium are not all in the same focal plane, affecting the image quality for medical diagnosis and analysis of diseases. We propose a method that can deblur images as well as synthesize defocus blur. We train autoencoders with implicit and explicit regularization techniques to enforce linearity relations among the representations of different blur levels in the latent space. This allows for the exploration of different blur levels of an object by linearly interpolating/extrapolating the latent representations of images taken at different focal planes. Compared to existing works, we use a simple architecture to synthesize images with flexible blur levels, leveraging the linear latent space. Our regularized autoencoders can effectively mimic blur and deblur, increasing data variety as a data augmentation technique and improving the quality of microscopic images, which would be beneficial for further processing and analysis.

Local and Global Information in Obstacle Detection on Railway Tracks. (arXiv:2307.15478v1 [cs.CV])

Authors: Matthias Brucker, Andrei Cramariuc, Cornelius von Einem, Roland Siegwart, Cesar Cadena

Reliable obstacle detection on railways could help prevent collisions that result in injuries and potentially damage or derail the train. Unfortunately, generic object detectors do not have enough classes to account for all possible scenarios, and datasets featuring objects on railways are challenging to obtain. We propose utilizing a shallow network to learn railway segmentation from normal railway images. The limited receptive field of the network prevents overconfident predictions and allows the network to focus on the locally very distinct and repetitive patterns of the railway environment. Additionally, we explore the controlled inclusion of global information by learning to hallucinate obstacle-free images. We evaluate our method on a custom dataset featuring railway images with artificially augmented obstacles. Our proposed method outperforms other learning-based baseline methods.

Non-invasive Diabetes Detection using Gabor Filter: A Comparative Analysis of Different Cameras. (arXiv:2307.15480v1 [cs.CV])

Authors: Christina A. Garcia, Patricia Angela R. Abu, Rosula SJ. Reyes

This paper compares and explores the performance of both mobile device camera and laptop camera as convenient tool for capturing images for non-invasive detection of Diabetes Mellitus (DM) using facial block texture features. Participants within age bracket 20 to 79 years old were chosen for the dataset. 12mp and 7mp mobile cameras, and a laptop camera were used to take the photo under normal lighting condition. Extracted facial blocks were classified using k-Nearest Neighbors (k-NN) and Support Vector Machine (SVM). 100 images were captured, preprocessed, filtered using Gabor, and iterated. Performance of the system was measured in terms of accuracy, specificity, and sensitivity. Best performance of 96.7% accuracy, 100% sensitivity, and 93% specificity were achieved from 12mp back camera using SVM with 100 images.

Improving Image Quality of Sparse-view Lung Cancer CT Images with a Convolutional Neural Network. (arXiv:2307.15506v1 [cs.CV])

Authors: Annika Ries, Tina Dorosti, Johannes Thalhammer, Daniel Sasse, Andreas Sauter, Felix Meurer, Ashley Benne, Franz Pfeiffer, Daniela Pfeiffer

Purpose: To improve the image quality of sparse-view computed tomography (CT) images with a U-Net for lung cancer detection and to determine the best trade-off between number of views, image quality, and diagnostic confidence.

Methods: CT images from 41 subjects (34 with lung cancer, seven healthy) were retrospectively selected (01.2016-12.2018) and forward projected onto 2048-view sinograms. Six corresponding sparse-view CT data subsets at varying levels of undersampling were reconstructed from sinograms using filtered backprojection with 16, 32, 64, 128, 256, and 512 views, respectively. A dual-frame U-Net was trained and evaluated for each subsampling level on 8,658 images from 22 diseased subjects. A representative image per scan was selected from 19 subjects (12 diseased, seven healthy) for a single-blinded reader study. The selected slices, for all levels of subsampling, with and without post-processing by the U-Net model, were presented to three readers. Image quality and diagnostic confidence were ranked using pre-defined scales. Subjective nodule segmentation was evaluated utilizing sensitivity (Se) and Dice Similarity Coefficient (DSC) with 95% confidence intervals (CI).

Results: The 64-projection sparse-view images resulted in Se = 0.89 and DSC = 0.81 [0.75,0.86] while their counterparts, post-processed with the U-Net, had improved metrics (Se = 0.94, DSC = 0.85 [0.82,0.87]). Fewer views lead to insufficient quality for diagnostic purposes. For increased views, no substantial discrepancies were noted between the sparse-view and post-processed images.

Conclusion: Projection views can be reduced from 2048 to 64 while maintaining image quality and the confidence of the radiologists on a satisfactory level.

Revisiting Fully Convolutional Geometric Features for Object 6D Pose Estimation. (arXiv:2307.15514v1 [cs.CV])

Authors: Jaime Corsetti, Davide Boscaini, Fabio Poiesi

Recent works on 6D object pose estimation focus on learning keypoint correspondences between images and object models, and then determine the object pose through RANSAC-based algorithms or by directly regressing the pose with end-to-end optimisations. We argue that learning point-level discriminative features is overlooked in the literature. To this end, we revisit Fully Convolutional Geometric Features (FCGF) and tailor it for object 6D pose estimation to achieve state-of-the-art performance. FCGF employs sparse convolutions and learns point-level features using a fully-convolutional network by optimising a hardest contrastive loss. We can outperform recent competitors on popular benchmarks by adopting key modifications to the loss and to the input data representations, by carefully tuning the training strategies, and by employing data augmentations suitable for the underlying problem. We carry out a thorough ablation to study the contribution of each modification.

YOLOv8 for Defect Inspection of Hexagonal Directed Self-Assembly Patterns: A Data-Centric Approach. (arXiv:2307.15516v1 [cs.CV])

Authors: Enrique Dehaerne, Bappaditya Dey, Hossein Esfandiar, Lander Verstraete, Hyo Seon Suh, Sandip Halder, Stefan De Gendt

Shrinking pattern dimensions leads to an increased variety of defect types in semiconductor devices. This has spurred innovation in patterning approaches such as Directed self-assembly (DSA) for which no traditional, automatic defect inspection software exists. Machine Learning-based SEM image analysis has become an increasingly popular research topic for defect inspection with supervised ML models often showing the best performance. However, little research has been done on obtaining a dataset with high-quality labels for these supervised models. In this work, we propose a method for obtaining coherent and complete labels for a dataset of hexagonal contact hole DSA patterns while requiring minimal quality control effort from a DSA expert. We show that YOLOv8, a state-of-the-art neural network, achieves defect detection precisions of more than 0.9 mAP on our final dataset which best reflects DSA expert defect labeling expectations. We discuss the strengths and limitations of our proposed labeling approach and suggest directions for future work in data-centric ML-based defect inspection.

Few-shot Image Classification based on Gradual Machine Learning. (arXiv:2307.15524v1 [cs.CV])

Authors: Na Chen, Xianming Kuang, Feiyu Liu, Kehao Wang, Qun Chen

Few-shot image classification aims to accurately classify unlabeled images using only a few labeled samples. The state-of-the-art solutions are built by deep learning, which focuses on designing increasingly complex deep backbones. Unfortunately, the task remains very challenging due to the difficulty of transferring the knowledge learned in training classes to new ones. In this paper, we propose a novel approach based on the non-i.i.d paradigm of gradual machine learning (GML). It begins with only a few labeled observations, and then gradually labels target images in the increasing order of hardness by iterative factor inference in a factor graph. Specifically, our proposed solution extracts indicative feature representations by deep backbones, and then constructs both unary and binary factors based on the extracted features to facilitate gradual learning. The unary factors are constructed based on class center distance in an embedding space, while the binary factors are constructed based on k-nearest neighborhood. We have empirically validated the performance of the proposed approach on benchmark datasets by a comparative study. Our extensive experiments demonstrate that the proposed approach can improve the SOTA performance by 1-5% in terms of accuracy. More notably, it is more robust than the existing deep models in that its performance can consistently improve as the size of query set increases while the performance of deep models remains essentially flat or even becomes worse.

Backdoor Defense with Non-Adversarial Backdoor. (arXiv:2307.15539v1 [cs.LG])

Authors: Min Liu, Alberto Sangiovanni-Vincentelli, Xiangyu Yue

Deep neural networks (DNNs) are vulnerable to backdoor attack, which does not affect the network's performance on clean data but would manipulate the network behavior once a trigger pattern is added. Existing defense methods have greatly reduced attack success rate, but their prediction accuracy on clean data still lags behind a clean model by a large margin. Inspired by the stealthiness and effectiveness of backdoor attack, we propose a simple but highly effective defense framework which injects non-adversarial backdoors targeting poisoned samples. Following the general steps in backdoor attack, we detect a small set of suspected samples and then apply a poisoning strategy to them. The non-adversarial backdoor, once triggered, suppresses the attacker's backdoor on poisoned data, but has limited influence on clean data. The defense can be carried out during data preprocessing, without any modification to the standard end-to-end training pipeline. We conduct extensive experiments on multiple benchmarks with different architectures and representative attacks. Results demonstrate that our method achieves state-of-the-art defense effectiveness with by far the lowest performance drop on clean data. Considering the surprising defense ability displayed by our framework, we call for more attention to utilizing backdoor for backdoor defense. Code is available at https://github.com/damianliumin/non-adversarial_backdoor.

Panoptic Scene Graph Generation with Semantics-prototype Learning. (arXiv:2307.15567v1 [cs.CV])

Authors: Li Li, Wei Ji, Yiming Wu, Mengze Li, You Qin, Lina Wei, Roger Zimmermann

Panoptic Scene Graph Generation (PSG) parses objects and predicts their relationships (predicate) to connect human language and visual scenes. However, different language preferences of annotators and semantic overlaps between predicates lead to biased predicate annotations in the dataset, i.e. different predicates for same object pairs. Biased predicate annotations make PSG models struggle in constructing a clear decision plane among predicates, which greatly hinders the real application of PSG models. To address the intrinsic bias above, we propose a novel framework named ADTrans to adaptively transfer biased predicate annotations to informative and unified ones. To promise consistency and accuracy during the transfer process, we propose to measure the invariance of representations in each predicate class, and learn unbiased prototypes of predicates with different intensities. Meanwhile, we continuously measure the distribution changes between each presentation and its prototype, and constantly screen potential biased data. Finally, with the unbiased predicate-prototype representation embedding space, biased annotations are easily identified. Experiments show that ADTrans significantly improves the performance of benchmark models, achieving a new state-of-the-art performance, and shows great generalization and effectiveness on multiple datasets.

Point Clouds Are Specialized Images: A Knowledge Transfer Approach for 3D Understanding. (arXiv:2307.15569v1 [cs.CV])

Authors: Jiachen Kang, Wenjing Jia, Xiangjian He, Kin Man Lam

Self-supervised representation learning (SSRL) has gained increasing attention in point cloud understanding, in addressing the challenges posed by 3D data scarcity and high annotation costs. This paper presents PCExpert, a novel SSRL approach that reinterprets point clouds as "specialized images". This conceptual shift allows PCExpert to leverage knowledge derived from large-scale image modality in a more direct and deeper manner, via extensively sharing the parameters with a pre-trained image encoder in a multi-way Transformer architecture. The parameter sharing strategy, combined with a novel pretext task for pre-training, i.e., transformation estimation, empowers PCExpert to outperform the state of the arts in a variety of tasks, with a remarkable reduction in the number of trainable parameters. Notably, PCExpert's performance under LINEAR fine-tuning (e.g., yielding a 90.02% overall accuracy on ScanObjectNN) has already approached the results obtained with FULL model fine-tuning (92.66%), demonstrating its effective and robust representation capability.

OAFuser: Towards Omni-Aperture Fusion for Light Field Semantic Segmentation of Road Scenes. (arXiv:2307.15588v1 [cs.CV])

Authors: Fei Teng, Jiaming Zhang, Kunyu Peng, Kailun Yang, Yaonan Wang, Rainer Stiefelhagen

Light field cameras can provide rich angular and spatial information to enhance image semantic segmentation for scene understanding in the field of autonomous driving. However, the extensive angular information of light field cameras contains a large amount of redundant data, which is overwhelming for the limited hardware resource of intelligent vehicles. Besides, inappropriate compression leads to information corruption and data loss. To excavate representative information, we propose an Omni-Aperture Fusion model (OAFuser), which leverages dense context from the central view and discovers the angular information from sub-aperture images to generate a semantically-consistent result. To avoid feature loss during network propagation and simultaneously streamline the redundant information from the light field camera, we present a simple yet very effective Sub-Aperture Fusion Module (SAFM) to embed sub-aperture images into angular features without any additional memory cost. Furthermore, to address the mismatched spatial information across viewpoints, we present Center Angular Rectification Module (CARM) realized feature resorting and prevent feature occlusion caused by asymmetric information. Our proposed OAFuser achieves state-of-the-art performance on the UrbanLF-Real and -Syn datasets and sets a new record of 84.93% in mIoU on the UrbanLF-Real Extended dataset, with a gain of +4.53%. The source code of OAFuser will be made publicly available at https://github.com/FeiBryantkit/OAFuser.

Integrated Digital Reconstruction of Welded Components: Supporting Improved Fatigue Life Prediction. (arXiv:2307.15604v1 [cs.CV])

Authors: Anders Faarbæk Mikkelstrup, Morten Kristiansen

In the design of offshore jacket foundations, fatigue life is crucial. Post-weld treatment has been proposed to enhance the fatigue performance of welded joints, where particularly high-frequency mechanical impact (HFMI) treatment has been shown to improve fatigue performance significantly. Automated HFMI treatment has improved quality assurance and can lead to cost-effective design when combined with accurate fatigue life prediction. However, the finite element method (FEM), commonly used for predicting fatigue life in complex or multi-axial joints, relies on a basic CAD depiction of the weld, failing to consider the actual weld geometry and defects. Including the actual weld geometry in the FE model improves fatigue life prediction and possible crack location prediction but requires a digital reconstruction of the weld. Current digital reconstruction methods are time-consuming or require specialised scanning equipment and potential component relocation. The proposed framework instead uses an industrial manipulator combined with a line scanner to integrate digital reconstruction as part of the automated HFMI treatment setup. This approach applies standard image processing, simple filtering techniques, and non-linear optimisation for aligning and merging overlapping scans. A screened Poisson surface reconstruction finalises the 3D model to create a meshed surface. The outcome is a generic, cost-effective, flexible, and rapid method that enables generic digital reconstruction of welded parts, aiding in component design, overall quality assurance, and documentation of the HFMI treatment.

A Survey on Deep Learning in Medical Image Registration: New Technologies, Uncertainty, Evaluation Metrics, and Beyond. (arXiv:2307.15615v1 [eess.IV])

Authors: Junyu Chen, Yihao Liu, Shuwen Wei, Zhangxing Bian, Shalini Subramanian, Aaron Carass, Jerry L. Prince, Yong Du

Over the past decade, deep learning technologies have greatly advanced the field of medical image registration. The initial developments, such as ResNet-based and U-Net-based networks, laid the groundwork for deep learning-driven image registration. Subsequent progress has been made in various aspects of deep learning-based registration, including similarity measures, deformation regularizations, and uncertainty estimation. These advancements have not only enriched the field of deformable image registration but have also facilitated its application in a wide range of tasks, including atlas construction, multi-atlas segmentation, motion estimation, and 2D-3D registration. In this paper, we present a comprehensive overview of the most recent advancements in deep learning-based image registration. We begin with a concise introduction to the core concepts of deep learning-based image registration. Then, we delve into innovative network architectures, loss functions specific to registration, and methods for estimating registration uncertainty. Additionally, this paper explores appropriate evaluation metrics for assessing the performance of deep learning models in registration tasks. Finally, we highlight the practical applications of these novel techniques in medical imaging and discuss the future prospects of deep learning-based image registration.

TriadNet: Sampling-free predictive intervals for lesional volume in 3D brain MR images. (arXiv:2307.15638v1 [eess.IV])

Authors: Benjamin Lambert, Florence Forbes, Senan Doyle, Michel Dojat

The volume of a brain lesion (e.g. infarct or tumor) is a powerful indicator of patient prognosis and can be used to guide the therapeutic strategy. Lesional volume estimation is usually performed by segmentation with deep convolutional neural networks (CNN), currently the state-of-the-art approach. However, to date, few work has been done to equip volume segmentation tools with adequate quantitative predictive intervals, which can hinder their usefulness and acceptation in clinical practice. In this work, we propose TriadNet, a segmentation approach relying on a multi-head CNN architecture, which provides both the lesion volumes and the associated predictive intervals simultaneously, in less than a second. We demonstrate its superiority over other solutions on BraTS 2021, a large-scale MRI glioblastoma image database.

CLIP Brings Better Features to Visual Aesthetics Learners. (arXiv:2307.15640v1 [cs.CV])

Authors: Liwu Xu, Jinjin Xu, Yuzhe Yang, Yijie Huang, Yanchun Xie, Yaqian Li

The success of pre-training approaches on a variety of downstream tasks has revitalized the field of computer vision. Image aesthetics assessment (IAA) is one of the ideal application scenarios for such methods due to subjective and expensive labeling procedure. In this work, an unified and flexible two-phase \textbf{C}LIP-based \textbf{S}emi-supervised \textbf{K}nowledge \textbf{D}istillation paradigm is proposed, namely \textbf{\textit{CSKD}}. Specifically, we first integrate and leverage a multi-source unlabeled dataset to align rich features between a given visual encoder and an off-the-shelf CLIP image encoder via feature alignment loss. Notably, the given visual encoder is not limited by size or structure and, once well-trained, it can seamlessly serve as a better visual aesthetic learner for both student and teacher. In the second phase, the unlabeled data is also utilized in semi-supervised IAA learning to further boost student model performance when applied in latency-sensitive production scenarios. By analyzing the attention distance and entropy before and after feature alignment, we notice an alleviation of feature collapse issue, which in turn showcase the necessity of feature alignment instead of training directly based on CLIP image encoder. Extensive experiments indicate the superiority of CSKD, which achieves state-of-the-art performance on multiple widely used IAA benchmarks.

Scaling Data Generation in Vision-and-Language Navigation. (arXiv:2307.15644v1 [cs.CV])

Authors: Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, Yu Qiao

Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents. To tackle the common data scarcity issue in existing vision-and-language navigation datasets, we propose an effective paradigm for generating large-scale data for learning, which applies 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs using fully-accessible resources on the web. Importantly, we investigate the influence of each component in this paradigm on the agent's performance and study how to adequately apply the augmented data to pre-train and fine-tune an agent. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning. The long-lasting generalization gap between navigating in seen and unseen environments is also reduced to less than 1% (versus 8% in the previous best method). Moreover, our paradigm also facilitates different models to achieve new state-of-the-art navigation results on CVDN, REVERIE, and R2R in continuous environments.

Scale-aware Test-time Click Adaptation for Pulmonary Nodule and Mass Segmentation. (arXiv:2307.15645v1 [eess.IV])

Authors: Zhihao Li, Jiancheng Yang, Yongchao Xu, Li Zhang, Wenhui Dong, Bo Du

Pulmonary nodules and masses are crucial imaging features in lung cancer screening that require careful management in clinical diagnosis. Despite the success of deep learning-based medical image segmentation, the robust performance on various sizes of lesions of nodule and mass is still challenging. In this paper, we propose a multi-scale neural network with scale-aware test-time adaptation to address this challenge. Specifically, we introduce an adaptive Scale-aware Test-time Click Adaptation method based on effortlessly obtainable lesion clicks as test-time cues to enhance segmentation performance, particularly for large lesions. The proposed method can be seamlessly integrated into existing networks. Extensive experiments on both open-source and in-house datasets consistently demonstrate the effectiveness of the proposed method over some CNN and Transformer-based segmentation methods. Our code is available at https://github.com/SplinterLi/SaTTCA

Multi-layer Aggregation as a key to feature-based OOD detection. (arXiv:2307.15647v1 [cs.CV])

Authors: Benjamin Lambert, Florence Forbes, Senan Doyle, Michel Dojat

Deep Learning models are easily disturbed by variations in the input images that were not observed during the training stage, resulting in unpredictable predictions. Detecting such Out-of-Distribution (OOD) images is particularly crucial in the context of medical image analysis, where the range of possible abnormalities is extremely wide. Recently, a new category of methods has emerged, based on the analysis of the intermediate features of a trained model. These methods can be divided into 2 groups: single-layer methods that consider the feature map obtained at a fixed, carefully chosen layer, and multi-layer methods that consider the ensemble of the feature maps generated by the model. While promising, a proper comparison of these algorithms is still lacking. In this work, we compared various feature-based OOD detection methods on a large spectra of OOD (20 types), representing approximately 7800 3D MRIs. Our experiments shed the light on two phenomenons. First, multi-layer methods consistently outperform single-layer approaches, which tend to have inconsistent behaviour depending on the type of anomaly. Second, the OOD detection performance highly depends on the architecture of the underlying neural network.

TrackAgent: 6D Object Tracking via Reinforcement Learning. (arXiv:2307.15671v1 [cs.CV])

Authors: Konstantin Röhrl, Dominik Bauer, Timothy Patten, Markus Vincze

Tracking an object's 6D pose, while either the object itself or the observing camera is moving, is important for many robotics and augmented reality applications. While exploiting temporal priors eases this problem, object-specific knowledge is required to recover when tracking is lost. Under the tight time constraints of the tracking task, RGB(D)-based methods are often conceptionally complex or rely on heuristic motion models. In comparison, we propose to simplify object tracking to a reinforced point cloud (depth only) alignment task. This allows us to train a streamlined approach from scratch with limited amounts of sparse 3D point clouds, compared to the large datasets of diverse RGBD sequences required in previous works. We incorporate temporal frame-to-frame registration with object-based recovery by frame-to-model refinement using a reinforcement learning (RL) agent that jointly solves for both objectives. We also show that the RL agent's uncertainty and a rendering-based mask propagation are effective reinitialization triggers.

PatchMixer: Rethinking network design to boost generalization for 3D point cloud understanding. (arXiv:2307.15692v1 [cs.CV])

Authors: Davide Boscaini, Fabio Poiesi

The recent trend in deep learning methods for 3D point cloud understanding is to propose increasingly sophisticated architectures either to better capture 3D geometries or by introducing possibly undesired inductive biases. Moreover, prior works introducing novel architectures compared their performance on the same domain, devoting less attention to their generalization to other domains. We argue that the ability of a model to transfer the learnt knowledge to different domains is an important feature that should be evaluated to exhaustively assess the quality of a deep network architecture. In this work we propose PatchMixer, a simple yet effective architecture that extends the ideas behind the recent MLP-Mixer paper to 3D point clouds. The novelties of our approach are the processing of local patches instead of the whole shape to promote robustness to partial point clouds, and the aggregation of patch-wise features using an MLP as a simpler alternative to the graph convolutions or the attention mechanisms that are used in prior works. We evaluated our method on the shape classification and part segmentation tasks, achieving superior generalization performance compared to a selection of the most relevant deep architectures.

SimDETR: Simplifying self-supervised pretraining for DETR. (arXiv:2307.15697v1 [cs.CV])

Authors: Ioannis Maniadis Metaxas, Adrian Bulat, Ioannis Patras, Brais Martinez, Georgios Tzimiropoulos

DETR-based object detectors have achieved remarkable performance but are sample-inefficient and exhibit slow convergence. Unsupervised pretraining has been found to be helpful to alleviate these impediments, allowing training with large amounts of unlabeled data to improve the detector's performance. However, existing methods have their own limitations, like keeping the detector's backbone frozen in order to avoid performance degradation and utilizing pretraining objectives misaligned with the downstream task. To overcome these limitations, we propose a simple pretraining framework for DETR-based detectors that consists of three simple yet key ingredients: (i) richer, semantics-based initial proposals derived from high-level feature maps, (ii) discriminative training using object pseudo-labels produced via clustering, (iii) self-training to take advantage of the improved object proposals learned by the detector. We report two main findings: (1) Our pretraining outperforms prior DETR pretraining works on both the full and low data regimes by significant margins. (2) We show we can pretrain DETR from scratch (including the backbone) directly on complex image datasets like COCO, paving the path for unsupervised representation learning directly using DETR.

MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking. (arXiv:2307.15700v1 [cs.CV])

Authors: Ruopeng Gao, Limin Wang

As a video task, Multi-Object Tracking (MOT) is expected to capture temporal information of targets effectively. Unfortunately, most existing methods only explicitly exploit the object features between adjacent frames, while lacking the capacity to model long-term temporal information. In this paper, we propose MeMOTR, a long-term memory-augmented Transformer for multi-object tracking. Our method is able to make the same object's track embedding more stable and distinguishable by leveraging long-term memory injection with a customized memory-attention layer. This significantly improves the target association ability of our model. Experimental results on DanceTrack show that MeMOTR impressively surpasses the state-of-the-art method by 7.9\% and 13.0\% on HOTA and AssA metrics, respectively. Furthermore, our model also outperforms other Transformer-based methods on association performance on MOT17 and generalizes well on BDD100K. Code is available at \href{https://github.com/MCG-NJU/MeMOTR}{https://github.com/MCG-NJU/MeMOTR}.

Semi-Supervised Object Detection in the Open World. (arXiv:2307.15710v1 [cs.CV])

Authors: Garvita Allabadi, Ana Lucic, Peter Pao-Huang, Yu-Xiong Wang, Vikram Adve

Existing approaches for semi-supervised object detection assume a fixed set of classes present in training and unlabeled datasets, i.e., in-distribution (ID) data. The performance of these techniques significantly degrades when these techniques are deployed in the open-world, due to the fact that the unlabeled and test data may contain objects that were not seen during training, i.e., out-of-distribution (OOD) data. The two key questions that we explore in this paper are: can we detect these OOD samples and if so, can we learn from them? With these considerations in mind, we propose the Open World Semi-supervised Detection framework (OWSSD) that effectively detects OOD data along with a semi-supervised learning pipeline that learns from both ID and OOD data. We introduce an ensemble based OOD detector consisting of lightweight auto-encoder networks trained only on ID data. Through extensive evalulation, we demonstrate that our method performs competitively against state-of-the-art OOD detection algorithms and also significantly boosts the semi-supervised learning performance in open-world scenarios.

End-to-end Learning of a Fisher Vector Encoding for Part Features in Fine-grained Recognition. (arXiv:2007.02080v2 [cs.CV] UPDATED)

Authors: Dimitri Korsch, Paul Bodesheim, Joachim Denzler

Part-based approaches for fine-grained recognition do not show the expected performance gain over global methods, although explicitly focusing on small details that are relevant for distinguishing highly similar classes. We assume that part-based methods suffer from a missing representation of local features, which is invariant to the order of parts and can handle a varying number of visible parts appropriately. The order of parts is artificial and often only given by ground-truth annotations, whereas viewpoint variations and occlusions result in not observable parts. Therefore, we propose integrating a Fisher vector encoding of part features into convolutional neural networks. The parameters for this encoding are estimated by an online EM algorithm jointly with those of the neural network and are more precise than the estimates of previous works. Our approach improves state-of-the-art accuracies for three bird species classification datasets.

Challenges of 3D Surface Reconstruction in Capsule Endoscopy. (arXiv:2103.10390v4 [cs.CV] UPDATED)

Authors: Olivier Rukundo

Essential for improving the accuracy and reliability of bowel cancer screening, three-dimensional (3D) surface reconstruction using capsule endoscopy (CE) images remains challenging due to CE hardware and software limitations. This report generally focuses on challenges associated with 3D visualization and specifically investigates the impact of the indeterminate selection of the angle of the line of sight on 3D surfaces. Furthermore, it demonstrates that impact through 3D surfaces viewed at the same azimuth angles and different elevation angles of the line of sight. The report concludes that 3D printing of reconstructed 3D surfaces can potentially overcome line of sight indeterminate selection and 2D screen visual restriction-related errors.

Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning. (arXiv:2112.03906v3 [cs.CV] UPDATED)

Authors: Srijan Das, Michael S. Ryoo

Contrastive representation learning of videos highly relies on the availability of millions of unlabelled videos. This is practical for videos available on web but acquiring such large scale of videos for real-world applications is very expensive and laborious.

Therefore, in this paper we focus on designing video augmentation for self-supervised learning, we first analyze the best strategy to mix videos to create a new augmented video sample. Then, the question remains, can we make use of the other modalities in videos for data mixing? To this end, we propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into another video tesseract in the feature space across two different modalities. We find that our video mixing strategy STC-mix, i.e. preliminary mixing of videos followed by CMMC across different modalities in a video, improves the quality of learned video representations. We conduct thorough experiments for two downstream tasks: action recognition and video retrieval on two small scale video datasets UCF101, and HMDB51. We also demonstrate the effectiveness of our STC-mix on NTU dataset where domain knowledge is limited.

We show that the performance of our STC-mix on both the downstream tasks is on par with the other self-supervised approaches while requiring less training data.

Towards Multimodal Prediction of Spontaneous Humour: A Novel Dataset and First Results. (arXiv:2209.14272v2 [cs.LG] UPDATED)

Authors: Lukas Christ, Shahin Amiriparian, Alexander Kathan, Niklas Müller, Andreas König, Björn W. Schuller

Humour is a substantial element of human affect and cognition. Its automatic understanding can facilitate a more naturalistic human-device interaction and the humanisation of artificial intelligence. Current methods of humour detection are solely based on staged data making them inadequate for 'real-world' applications. We address this deficiency by introducing the novel Passau-Spontaneous Football Coach Humour (Passau-SFCH) dataset, comprising of about 11 hours of recordings. The Passau-SFCH dataset is annotated for the presence of humour and its dimensions (sentiment and direction) as proposed in Martin's Humor Style Questionnaire. We conduct a series of experiments, employing pretrained Transformers, convolutional neural networks, and expert-designed features. The performance of each modality (text, audio, video) for spontaneous humour recognition is analysed and their complementarity is investigated. Our findings suggest that for the automatic analysis of humour and its sentiment, facial expressions are most promising, while humour direction can be best modelled via text-based features. The results reveal considerable differences among various subjects, highlighting the individuality of humour usage and style. Further, we observe that a decision-level fusion yields the best recognition result. Finally, we make our code publicly available at https://www.github.com/EIHW/passau-sfch. The Passau-SFCH dataset is available upon request.

3DPPE: 3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers. (arXiv:2211.14710v3 [cs.CV] UPDATED)

Authors: Changyong Shu, JIajun Deng, Fisher Yu, Yifan Liu

Transformer-based methods have swept the benchmarks on 2D and 3D detection on images. Because tokenization before the attention mechanism drops the spatial information, positional encoding becomes critical for those methods. Recent works found that encodings based on samples of the 3D viewing rays can significantly improve the quality of multi-camera 3D object detection. We hypothesize that 3D point locations can provide more information than rays. Therefore, we introduce 3D point positional encoding, 3DPPE, to the 3D detection Transformer decoder. Although 3D measurements are not available at the inference time of monocular 3D object detection, 3DPPE uses predicted depth to approximate the real point positions. Our hybriddepth module combines direct and categorical depth to estimate the refined depth of each pixel. Despite the approximation, 3DPPE achieves 46.0 mAP and 51.4 NDS on the competitive nuScenes dataset, significantly outperforming encodings based on ray samples. We make the codes available at https://github.com/drilistbox/3DPPE.

Minimum Latency Deep Online Video Stabilization. (arXiv:2212.02073v2 [cs.CV] UPDATED)

Authors: Zhuofan Zhang, Zhen Liu, Ping Tan, Bing Zeng, Shuaicheng Liu

We present a novel camera path optimization framework for the task of online video stabilization. Typically, a stabilization pipeline consists of three steps: motion estimating, path smoothing, and novel view rendering. Most previous methods concentrate on motion estimation, proposing various global or local motion models. In contrast, path optimization receives relatively less attention, especially in the important online setting, where no future frames are available. In this work, we adopt recent off-the-shelf high-quality deep motion models for motion estimation to recover the camera trajectory and focus on the latter two steps. Our network takes a short 2D camera path in a sliding window as input and outputs the stabilizing warp field of the last frame in the window, which warps the coming frame to its stabilized position. A hybrid loss is well-defined to constrain the spatial and temporal consistency. In addition, we build a motion dataset that contains stable and unstable motion pairs for the training. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art online methods both qualitatively and quantitatively and achieves comparable performance to offline methods. Our code and dataset are available at https://github.com/liuzhen03/NNDVS

Complex-valued Retrievals From Noisy Images Using Diffusion Models. (arXiv:2212.03235v3 [cs.CV] UPDATED)

Authors: Nadav Torem, Roi Ronen, Yoav Y. Schechner, Michael Elad

In diverse microscopy modalities, sensors measure only real-valued intensities. Additionally, the sensor readouts are affected by Poissonian-distributed photon noise. Traditional restoration algorithms typically aim to minimize the mean squared error (MSE) between the original and recovered images. This often leads to blurry outcomes with poor perceptual quality. Recently, deep diffusion models (DDMs) have proven to be highly capable of sampling images from the a-posteriori probability of the sought variables, resulting in visually pleasing high-quality images. These models have mostly been suggested for real-valued images suffering from Gaussian noise. In this study, we generalize annealed Langevin Dynamics, a type of DDM, to tackle the fundamental challenges in optical imaging of complex-valued objects (and real images) affected by Poisson noise. We apply our algorithm to various optical scenarios, such as Fourier Ptychography, Phase Retrieval, and Poisson denoising. Our algorithm is evaluated on simulations and biological empirical data.

Audiovisual Masked Autoencoders. (arXiv:2212.05922v2 [cs.CV] UPDATED)

Authors: Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.

Self-Supervised Pre-training for 3D Point Clouds via View-Specific Point-to-Image Translation. (arXiv:2212.14197v3 [cs.CV] UPDATED)

Authors: Qijian Zhang, Junhui Hou

The past few years have witnessed the great success and prevalence of self-supervised representation learning within the language and 2D vision communities. However, such advancements have not been fully migrated to the field of 3D point cloud learning. Different from existing pre-training paradigms designed for deep point cloud feature extractors that fall into the scope of generative modeling or contrastive learning, this paper proposes a translative pre-training framework, namely PointVST, driven by a novel self-supervised pretext task of cross-modal translation from 3D point clouds to their corresponding diverse forms of 2D rendered images. More specifically, we begin with deducing view-conditioned point-wise embeddings through the insertion of the viewpoint indicator, and then adaptively aggregate a view-specific global codeword, which can be further fed into subsequent 2D convolutional translation heads for image generation. Extensive experimental evaluations on various downstream task scenarios demonstrate that our PointVST shows consistent and prominent performance superiority over current state-of-the-art approaches as well as satisfactory domain transfer capability. Our code will be publicly available at https://github.com/keeganhk/PointVST.

Hybrid Open-set Segmentation with Synthetic Negative Data. (arXiv:2301.08555v2 [cs.CV] UPDATED)

Authors: Matej Grcić, Siniša Šegvić

Open-set segmentation is often conceived by complementing closed-set classification with anomaly detection. Existing dense anomaly detectors operate either through generative modelling of regular training data or by discriminating with respect to negative training data. These two approaches optimize different objectives and therefore exhibit different failure modes. Consequently, we propose the first dense hybrid anomaly score that fuses generative and discriminative cues. The proposed score can be efficiently implemented by upgrading any semantic segmentation model with dense estimates of data likelihood and dataset posterior. Our design is a remarkably good fit for efficient inference on large images due to negligible computational overhead over the closed-set baseline. The resulting dense hybrid open-set models require negative training images that can be sampled from an auxiliary negative dataset, from a jointly trained generative model, or from a mixture of both sources. We evaluate our contributions on benchmarks for dense anomaly detection and open-set segmentation. The experiments reveal strong open-set performance in spite of negligible computational overhead.

VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval. (arXiv:2302.06350v3 [cs.CV] UPDATED)

Authors: Yan Gong, Georgina Cosma, Axel Finke

The relations expressed in user queries are vital for cross-modal information retrieval. Relation-focused cross-modal retrieval aims to retrieve information that corresponds to these relations, enabling effective retrieval across different modalities. Pre-trained networks, such as Contrastive Language-Image Pre-training (CLIP), have gained significant attention and acclaim for their exceptional performance in various cross-modal learning tasks. However, the Vision Transformer (ViT) used in these networks is limited in its ability to focus on image region relations. Specifically, ViT is trained to match images with relevant descriptions at the global level, without considering the alignment between image regions and descriptions. This paper introduces VITR, a novel network that enhances ViT by extracting and reasoning about image region relations based on a local encoder. VITR is comprised of two key components. Firstly, it extends the capabilities of ViT-based cross-modal networks by enabling them to extract and reason with region relations present in images. Secondly, VITR incorporates a fusion module that combines the reasoned results with global knowledge to predict similarity scores between images and descriptions. The proposed VITR network was evaluated through experiments on the tasks of relation-focused cross-modal information retrieval. The results derived from the analysis of the RefCOCOg, CLEVR, and Flickr30K datasets demonstrated that the proposed VITR network consistently outperforms state-of-the-art networks in image-to-text and text-to-image retrieval.

3M3D: Multi-view, Multi-path, Multi-representation for 3D Object Detection. (arXiv:2302.08231v3 [cs.CV] UPDATED)

Authors: Jongwoo Park, Apoorv Singh, Varun Bankiti

3D visual perception tasks based on multi-camera images are essential for autonomous driving systems. Latest work in this field performs 3D object detection by leveraging multi-view images as an input and iteratively enhancing object queries (object proposals) by cross-attending multi-view features. However, individual backbone features are not updated with multi-view features and it stays as a mere collection of the output of the single-image backbone network. Therefore we propose 3M3D: A Multi-view, Multi-path, Multi-representation for 3D Object Detection where we update both multi-view features and query features to enhance the representation of the scene in both fine panoramic view and coarse global view. Firstly, we update multi-view features by multi-view axis self-attention. It will incorporate panoramic information in the multi-view features and enhance understanding of the global scene. Secondly, we update multi-view features by self-attention of the ROI (Region of Interest) windows which encodes local finer details in the features. It will help exchange the information not only along the multi-view axis but also along the other spatial dimension. Lastly, we leverage the fact of multi-representation of queries in different domains to further boost the performance. Here we use sparse floating queries along with dense BEV (Bird's Eye View) queries, which are later post-processed to filter duplicate detections. Moreover, we show performance improvements on nuScenes benchmark dataset on top of our baselines.

Deep Neural Networks based Meta-Learning for Network Intrusion Detection. (arXiv:2302.09394v2 [cs.LG] UPDATED)

Authors: Anabia Sohail, Bibi Ayisha, Irfan Hameed, Muhammad Mohsin Zafar, Hani Alquhayz, Asifullah Khan

The digitization of different components of industry and inter-connectivity among indigenous networks have increased the risk of network attacks. Designing an intrusion detection system to ensure security of the industrial ecosystem is difficult as network traffic encompasses various attack types, including new and evolving ones with minor changes. The data used to construct a predictive model for computer networks has a skewed class distribution and limited representation of attack types, which differ from real network traffic. These limitations result in dataset shift, negatively impacting the machine learning models' predictive abilities and reducing the detection rate against novel attacks. To address the challenges, we propose a novel deep neural network based Meta-Learning framework; INformation FUsion and Stacking Ensemble (INFUSE) for network intrusion detection. First, a hybrid feature space is created by integrating decision and feature spaces. Five different classifiers are utilized to generate a pool of decision spaces. The feature space is then enriched through a deep sparse autoencoder that learns the semantic relationships between attacks. Finally, the deep Meta-Learner acts as an ensemble combiner to analyze the hybrid feature space and make a final decision. Our evaluation on stringent benchmark datasets and comparison to existing techniques showed the effectiveness of INFUSE with an F-Score of 0.91, Accuracy of 91.6%, and Recall of 0.94 on the Test+ dataset, and an F-Score of 0.91, Accuracy of 85.6%, and Recall of 0.87 on the stringent Test-21 dataset. These promising results indicate the strong generalization capability and the potential to detect network attacks.

Depth Estimation and Image Restoration by Deep Learning from Defocused Images. (arXiv:2302.10730v2 [cs.CV] UPDATED)

Authors: Saqib Nazir, Lorenzo Vaquero, Manuel Mucientes, Víctor M. Brea, Daniela Coltuc

Monocular depth estimation and image deblurring are two fundamental tasks in computer vision, given their crucial role in understanding 3D scenes. Performing any of them by relying on a single image is an ill-posed problem. The recent advances in the field of Deep Convolutional Neural Networks (DNNs) have revolutionized many tasks in computer vision, including depth estimation and image deblurring. When it comes to using defocused images, the depth estimation and the recovery of the All-in-Focus (Aif) image become related problems due to defocus physics. Despite this, most of the existing models treat them separately. There are, however, recent models that solve these problems simultaneously by concatenating two networks in a sequence to first estimate the depth or defocus map and then reconstruct the focused image based on it. We propose a DNN that solves the depth estimation and image deblurring in parallel. Our Two-headed Depth Estimation and Deblurring Network (2HDED:NET) extends a conventional Depth from Defocus (DFD) networks with a deblurring branch that shares the same encoder as the depth branch. The proposed method has been successfully tested on two benchmarks, one for indoor and the other for outdoor scenes: NYU-v2 and Make3D. Extensive experiments with 2HDED:NET on these benchmarks have demonstrated superior or close performances to those of the state-of-the-art models for depth estimation and image deblurring.

Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework. (arXiv:2302.12247v3 [cs.LG] UPDATED)

Authors: Paul Pu Liang, Yun Cheng, Xiang Fan, Chun Kai Ling, Suzanne Nie, Richard Chen, Zihao Deng, Nicholas Allen, Randy Auerbach, Faisal Mahmood, Ruslan Salakhutdinov, Louis-Philippe Morency

The recent explosion of interest in multimodal applications has resulted in a wide selection of datasets and methods for representing and integrating information from different modalities. Despite these empirical advances, there remain fundamental research questions: How can we quantify the interactions that are necessary to solve a multimodal task? Subsequently, what are the most suitable multimodal models to capture these interactions? To answer these questions, we propose an information-theoretic approach to quantify the degree of redundancy, uniqueness, and synergy relating input modalities with an output task. We term these three measures as the PID statistics of a multimodal distribution (or PID for short), and introduce two new estimators for these PID statistics that scale to high-dimensional distributions. To validate PID estimation, we conduct extensive experiments on both synthetic datasets where the PID is known and on large-scale multimodal benchmarks where PID estimations are compared with human annotations. Finally, we demonstrate their usefulness in (1) quantifying interactions within multimodal datasets, (2) quantifying interactions captured by multimodal models, (3) principled approaches for model selection, and (4) three real-world case studies engaging with domain experts in pathology, mood prediction, and robotic perception where our framework helps to recommend strong multimodal models for each application.

Optimizing Convolutional Neural Networks for Chronic Obstructive Pulmonary Disease Detection in Clinical Computed Tomography Imaging. (arXiv:2303.07189v2 [eess.IV] UPDATED)

Authors: Tina Dorosti, Manuel Schultheiss, Felix Hofmann, Johannes Thalhammer, Luisa Kirchner, Theresa Urban, Franz Pfeiffer, Florian Schaff, Tobias Lasser, Daniela Pfeiffer

Purpose: To optimize the binary detection of Chronic Obstructive Pulmonary Disease (COPD) based on emphysema presence in the lung with convolutional neural networks (CNN) by exploring manually adjusted versus automated window-setting optimization (WSO) on computed tomography (CT) images.

Methods: 7,194 CT images (3,597 with COPD; 3,597 healthy controls) from 78 subjects (43 with COPD; 35 healthy controls) were selected retrospectively (10.2018-12.2019) and preprocessed. For each image, intensity values were manually clipped to the emphysema window setting and a baseline 'full-range' window setting. Class-balanced train, validation, and test sets contained 3,392, 1,114, and 2,688 images. The network backbone was optimized by comparing various CNN architectures. Furthermore, automated WSO was implemented by adding a customized layer to the model. The image-level area under the Receiver Operating Characteristics curve (AUC) [lower, upper limit 95% confidence] and P-values calculated from one-sided Mann-Whitney U-test were utilized to compare model variations.

Results: Repeated inference (n=7) on the test set showed that the DenseNet was the most efficient backbone and achieved a mean AUC of 0.80 [0.76, 0.85] without WSO. Comparably, with input images manually adjusted to the emphysema window, the DenseNet model predicted COPD with a mean AUC of 0.86 [0.82, 0.89] (P=0.03). By adding a customized WSO layer to the DenseNet, an optimal window in the proximity of the emphysema window setting was learned automatically, and a mean AUC of 0.82 [0.78, 0.86] was achieved.

Conclusion: Detection of COPD with DenseNet models was improved by WSO of CT data to the emphysema window setting range.

PanoVPR: Towards Unified Perspective-to-Equirectangular Visual Place Recognition via Sliding Windows across the Panoramic View. (arXiv:2303.14095v2 [cs.CV] UPDATED)

Authors: Ze Shi, Hao Shi, Kailun Yang, Zhe Yin, Yining Lin, Kaiwei Wang

Visual place recognition has gained significant attention in recent years as a crucial technology in autonomous driving and robotics. Currently, the two main approaches are the perspective view retrieval (P2P) paradigm and the equirectangular image retrieval (E2E) paradigm. However, it is practical and natural to assume that users only have consumer-grade pinhole cameras to obtain query perspective images and retrieve them in panoramic database images from map providers. To address this, we propose \textit{PanoVPR}, a perspective-to-equirectangular (P2E) visual place recognition framework that employs sliding windows to eliminate feature truncation caused by hard cropping. Specifically, PanoVPR slides windows over the entire equirectangular image and computes feature descriptors for each window, which are then compared to determine place similarity. Notably, our unified framework enables direct transfer of the backbone from P2P methods without any modification, supporting not only CNNs but also Transformers. To facilitate training and evaluation, we derive the Pitts250k-P2E dataset from the Pitts250k and establish YQ360, latter is the first P2E visual place recognition dataset collected by a mobile robot platform aiming to simulate real-world task scenarios better. Extensive experiments demonstrate that PanoVPR achieves state-of-the-art performance and obtains 3.8% and 8.0% performance gain on Pitts250k-P2E and YQ360 compared to the previous best method, respectively. Code and datasets will be publicly available at https://github.com/zafirshi/PanoVPR.

Near Field iToF LIDAR Depth Improvement from Limited Number of Shots. (arXiv:2304.07047v2 [cs.RO] UPDATED)

Authors: Mena Nagiub, Thorsten Beuth, Ganesh Sistu, Heinrich Gotzig, Ciarán Eising

Indirect Time of Flight LiDARs can indirectly calculate the scene's depth from the phase shift angle between transmitted and received laser signals with amplitudes modulated at a predefined frequency. Unfortunately, this method generates ambiguity in calculated depth when the phase shift angle value exceeds $2\pi$. Current state-of-the-art methods use raw samples generated using two distinct modulation frequencies to overcome this ambiguity problem. However, this comes at the cost of increasing laser components' stress and raising their temperature, which reduces their lifetime and increases power consumption. In our work, we study two different methods to recover the entire depth range of the LiDAR using fewer raw data sample shots from a single modulation frequency with the support of sensor's gray scale output to reduce the laser components' stress and power consumption.

Self-supervised Image Denoising with Downsampled Invariance Loss and Conditional Blind-Spot Network. (arXiv:2304.09507v2 [eess.IV] UPDATED)

Authors: Yeong Il Jang, Keuntek Lee, Gu Yong Park, Seyun Kim, Nam Ik Cho

There have been many image denoisers using deep neural networks, which outperform conventional model-based methods by large margins. Recently, self-supervised methods have attracted attention because constructing a large real noise dataset for supervised training is an enormous burden. The most representative self-supervised denoisers are based on blind-spot networks, which exclude the receptive field's center pixel. However, excluding any input pixel is abandoning some information, especially when the input pixel at the corresponding output position is excluded. In addition, a standard blind-spot network fails to reduce real camera noise due to the pixel-wise correlation of noise, though it successfully removes independently distributed synthetic noise. Hence, to realize a more practical denoiser, we propose a novel self-supervised training framework that can remove real noise. For this, we derive the theoretic upper bound of a supervised loss where the network is guided by the downsampled blinded output. Also, we design a conditional blind-spot network (C-BSN), which selectively controls the blindness of the network to use the center pixel information. Furthermore, we exploit a random subsampler to decorrelate noise spatially, making the C-BSN free of visual artifacts that were often seen in downsample-based methods. Extensive experiments show that the proposed C-BSN achieves state-of-the-art performance on real-world datasets as a self-supervised denoiser and shows qualitatively pleasing results without any post-processing or refinement.

OOD-CV-v2: An extended Benchmark for Robustness to Out-of-Distribution Shifts of Individual Nuisances in Natural Images. (arXiv:2304.10266v2 [cs.CV] UPDATED)

Authors: Bingchen Zhao, Jiahao Wang, Wufei Ma, Artur Jesslen, Siwei Yang, Shaozuo Yu, Oliver Zendel, Christian Theobalt, Alan Yuille, Adam Kortylewski

Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that existing robustness benchmarks are limited, as they either rely on synthetic data or ignore the effects of individual nuisance factors. We introduce OOD-CV-v2, a benchmark dataset that includes out-of-distribution examples of 10 object categories in terms of pose, shape, texture, context and the weather conditions, and enables benchmarking of models for image classification, object detection, and 3D pose estimation. In addition to this novel dataset, we contribute extensive experiments using popular baseline methods, which reveal that: 1) Some nuisance factors have a much stronger negative effect on the performance compared to others, also depending on the vision task. 2) Current approaches to enhance robustness have only marginal effects, and can even reduce robustness. 3) We do not observe significant differences between convolutional and transformer architectures. We believe our dataset provides a rich test bed to study robustness and will help push forward research in this area.

Our dataset can be accessed from https://bzhao.me/OOD-CV/

Adversarial Infrared Blocks: A Multi-view Black-box Attack to Thermal Infrared Detectors in Physical World. (arXiv:2304.10712v4 [cs.CV] UPDATED)

Authors: Chengyin Hu, Weiwen Shi, Tingsong Jiang, Wen Yao, Ling Tian, Xiaoqian Chen

Infrared imaging systems have a vast array of potential applications in pedestrian detection and autonomous driving, and their safety performance is of great concern. However, few studies have explored the safety of infrared imaging systems in real-world settings. Previous research has used physical perturbations such as small bulbs and thermal "QR codes" to attack infrared imaging detectors, but such methods are highly visible and lack stealthiness. Other researchers have used hot and cold blocks to deceive infrared imaging detectors, but this method is limited in its ability to execute attacks from various angles. To address these shortcomings, we propose a novel physical attack called adversarial infrared blocks (AdvIB). By optimizing the physical parameters of the adversarial infrared blocks, this method can execute a stealthy black-box attack on thermal imaging system from various angles. We evaluate the proposed method based on its effectiveness, stealthiness, and robustness. Our physical tests show that the proposed method achieves a success rate of over 80% under most distance and angle conditions, validating its effectiveness. For stealthiness, our method involves attaching the adversarial infrared block to the inside of clothing, enhancing its stealthiness. Additionally, we test the proposed method on advanced detectors, and experimental results demonstrate an average attack success rate of 51.2%, proving its robustness. Overall, our proposed AdvIB method offers a promising avenue for conducting stealthy, effective and robust black-box attacks on thermal imaging system, with potential implications for real-world safety and security applications.

Neural-PBIR Reconstruction of Shape, Material, and Illumination. (arXiv:2304.13445v2 [cs.CV] UPDATED)

Authors: Cheng Sun, Guangyan Cai, Zhengqin Li, Kai Yan, Cheng Zhang, Carl Marshall, Jia-Bin Huang, Shuang Zhao, Zhao Dong

Reconstructing the shape and spatially varying surface appearances of a physical-world object as well as its surrounding illumination based on 2D images (e.g., photographs) of the object has been a long-standing problem in computer vision and graphics. In this paper, we introduce a robust object reconstruction pipeline combining neural based object reconstruction and physics-based inverse rendering (PBIR). Specifically, our pipeline firstly leverages a neural stage to produce high-quality but potentially imperfect predictions of object shape, reflectance, and illumination. Then, in the later stage, initialized by the neural predictions, we perform PBIR to refine the initial results and obtain the final high-quality reconstruction. Experimental results demonstrate our pipeline significantly outperforms existing reconstruction methods quality-wise and performance-wise.

Common Diffusion Noise Schedules and Sample Steps are Flawed. (arXiv:2305.08891v2 [cs.CV] UPDATED)

Authors: Shanchuan Lin, Bingchen Liu, Jiashi Li, Xiao Yang

We discover that common diffusion noise schedules do not enforce the last timestep to have zero signal-to-noise ratio (SNR), and some implementations of diffusion samplers do not start from the last timestep. Such designs are flawed and do not reflect the fact that the model is given pure Gaussian noise at inference, creating a discrepancy between training and inference. We show that the flawed design causes real problems in existing implementations. In Stable Diffusion, it severely limits the model to only generate images with medium brightness and prevents it from generating very bright and dark samples. We propose a few simple fixes: (1) rescale the noise schedule to enforce zero terminal SNR; (2) train the model with v prediction; (3) change the sampler to always start from the last timestep; (4) rescale classifier-free guidance to prevent over-exposure. These simple changes ensure the diffusion process is congruent between training and inference and allow the model to generate samples more faithful to the original data distribution.

Domain Adaptive Sim-to-Real Segmentation of Oropharyngeal Organs. (arXiv:2305.10883v2 [cs.AI] UPDATED)

Authors: Guankun Wang, Tian-Ao Ren, Jiewen Lai, Long Bai, Hongliang Ren

Video-assisted transoral tracheal intubation (TI) necessitates using an endoscope that helps the physician insert a tracheal tube into the glottis instead of the esophagus. The growing trend of robotic-assisted TI would require a medical robot to distinguish anatomical features like an experienced physician which can be imitated by utilizing supervised deep-learning techniques. However, the real datasets of oropharyngeal organs are often inaccessible due to limited open-source data and patient privacy. In this work, we propose a domain adaptive Sim-to-Real framework called IoU-Ranking Blend-ArtFlow (IRB-AF) for image segmentation of oropharyngeal organs. The framework includes an image blending strategy called IoU-Ranking Blend (IRB) and style-transfer method ArtFlow. Here, IRB alleviates the problem of poor segmentation performance caused by significant datasets domain differences; while ArtFlow is introduced to reduce the discrepancies between datasets further. A virtual oropharynx image dataset generated by the SOFA framework is used as the learning subject for semantic segmentation to deal with the limited availability of actual endoscopic images. We adapted IRB-AF with the state-of-the-art domain adaptive segmentation models. The results demonstrate the superior performance of our approach in further improving the segmentation accuracy and training stability.

Embrace Limited and Imperfect Training Datasets: Opportunities and Challenges in Plant Disease Recognition Using Deep Learning. (arXiv:2305.11533v2 [cs.CV] UPDATED)

Authors: Mingle Xu, Hyongsuk Kim, Jucheng Yang, Alvaro Fuentes, Yao Meng, Sook Yoon, Taehyun Kim, Dong Sun Park

Recent advancements in deep learning have brought significant improvements to plant disease recognition. However, achieving satisfactory performance often requires high-quality training datasets, which are challenging and expensive to collect. Consequently, the practical application of current deep learning-based methods in real-world scenarios is hindered by the scarcity of high-quality datasets. In this paper, we argue that embracing poor datasets is viable and aim to explicitly define the challenges associated with using these datasets. To delve into this topic, we analyze the characteristics of high-quality datasets, namely large-scale images and desired annotation, and contrast them with the \emph{limited} and \emph{imperfect} nature of poor datasets. Challenges arise when the training datasets deviate from these characteristics. To provide a comprehensive understanding, we propose a novel and informative taxonomy that categorizes these challenges. Furthermore, we offer a brief overview of existing studies and approaches that address these challenges. We believe that our paper sheds light on the importance of embracing poor datasets, enhances the understanding of the associated challenges, and contributes to the ambitious objective of deploying deep learning in real-world applications. To facilitate the progress, we finally describe several outstanding questions and point out potential future directions. Although our primary focus is on plant disease recognition, we emphasize that the principles of embracing and analyzing poor datasets are applicable to a wider range of domains, including agriculture.

CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition. (arXiv:2305.16049v2 [cs.CV] UPDATED)

Authors: Lantian Li, Xiaolou Li, Haoyu Jiang, Chen Chen, Ruihai Hou, Dong Wang

Audio-visual person recognition (AVPR) has received extensive attention. However, most datasets used for AVPR research so far are collected in constrained environments, and thus cannot reflect the true performance of AVPR systems in real-world scenarios. To meet the request for research on AVPR in unconstrained conditions, this paper presents a multi-genre AVPR dataset collected `in the wild', named CN-Celeb-AV. This dataset contains more than 419k video segments from 1,136 persons from public media. In particular, we put more emphasis on two real-world complexities: (1) data in multiple genres; (2) segments with partial information. A comprehensive study was conducted to compare CN-Celeb-AV with two popular public AVPR benchmark datasets, and the results demonstrated that CN-Celeb-AV is more in line with real-world scenarios and can be regarded as a new benchmark dataset for AVPR research. The dataset also involves a development set that can be used to boost the performance of AVPR systems in real-life situations. The dataset is free for researchers and can be downloaded from this http URL

SACSoN: Scalable Autonomous Control for Social Navigation. (arXiv:2306.01874v2 [cs.RO] UPDATED)

Authors: Noriaki Hirose, Dhruv Shah, Ajay Sridhar, Sergey Levine

Machine learning provides a powerful tool for building socially compliant robotic systems that go beyond simple predictive models of human behavior. By observing and understanding human interactions from past experiences, learning can enable effective social navigation behaviors directly from data. In this paper, our goal is to develop methods for training policies for socially unobtrusive navigation, such that robots can navigate among humans in ways that don't disturb human behavior. We introduce a definition for such behavior based on the counterfactual perturbation of the human: if the robot had not intruded into the space, would the human have acted in the same way? By minimizing this counterfactual perturbation, we can induce robots to behave in ways that do not alter the natural behavior of humans in the shared space. Instantiating this principle requires training policies to minimize their effect on human behavior, and this in turn requires data that allows us to model the behavior of humans in the presence of robots. Therefore, our approach is based on two key contributions. First, we collect a large dataset where an indoor mobile robot interacts with human bystanders. Second, we utilize this dataset to train policies that minimize counterfactual perturbation. We provide supplementary videos and make publicly available the largest-of-its-kind visual navigation dataset on our project page.

3D VR Sketch Guided 3D Shape Prototyping and Exploration. (arXiv:2306.10830v3 [cs.CV] UPDATED)

Authors: Ling Luo, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song, Yulia Gryaditskaya

3D shape modeling is labor-intensive, time-consuming, and requires years of expertise. To facilitate 3D shape modeling, we propose a 3D shape generation network that takes a 3D VR sketch as a condition. We assume that sketches are created by novices without art training and aim to reconstruct geometrically realistic 3D shapes of a given category. To handle potential sketch ambiguity, our method creates multiple 3D shapes that align with the original sketch's structure. We carefully design our method, training the model step-by-step and leveraging multi-modal 3D shape representation to support training with limited training data. To guarantee the realism of generated 3D shapes we leverage the normalizing flow that models the distribution of the latent space of 3D shapes. To encourage the fidelity of the generated 3D shapes to an input sketch, we propose a dedicated loss that we deploy at different stages of the training process. The code is available at https://github.com/Rowl1ng/3Dsketch2shape.

A Conditional Flow Variational Autoencoder for Controllable Synthesis of Virtual Populations of Anatomy. (arXiv:2306.14680v2 [eess.IV] UPDATED)

Authors: Haoran Dou, Nishant Ravikumar, Alejandro F. Frangi

The generation of virtual populations (VPs) of anatomy is essential for conducting in silico trials of medical devices. Typically, the generated VP should capture sufficient variability while remaining plausible and should reflect the specific characteristics and demographics of the patients observed in real populations. In several applications, it is desirable to synthesise virtual populations in a \textit{controlled} manner, where relevant covariates are used to conditionally synthesise virtual populations that fit a specific target population/characteristics. We propose to equip a conditional variational autoencoder (cVAE) with normalising flows to boost the flexibility and complexity of the approximate posterior learnt, leading to enhanced flexibility for controllable synthesis of VPs of anatomical structures. We demonstrate the performance of our conditional flow VAE using a data set of cardiac left ventricles acquired from 2360 patients, with associated demographic information and clinical measurements (used as covariates/conditional information). The results obtained indicate the superiority of the proposed method for conditional synthesis of virtual populations of cardiac left ventricles relative to a cVAE. Conditional synthesis performance was evaluated in terms of generalisation and specificity errors and in terms of the ability to preserve clinically relevant biomarkers in synthesised VPs, that is, the left ventricular blood pool and myocardial volume, relative to the real observed population.

LPN: Language-guided Prototypical Network for few-shot classification. (arXiv:2307.01515v2 [cs.CV] UPDATED)

Authors: Kaihui Cheng, Chule Yang

Few-shot classification aims to adapt to new tasks with limited labeled examples. To fully use the accessible data, recent methods explore suitable measures for the similarity between the query and support images and better high-dimensional features with meta-training and pre-training strategies. However, the potential of multi-modality information has barely been explored, which may bring promising improvement for few-shot classification. In this paper, we propose a Language-guided Prototypical Network (LPN) for few-shot classification, which leverages the complementarity of vision and language modalities via two parallel branches. Concretely, to introduce language modality with limited samples in the visual task, we leverage a pre-trained text encoder to extract class-level text features directly from class names while processing images with a conventional image encoder. Then, a language-guided decoder is introduced to obtain text features corresponding to each image by aligning class-level features with visual features. In addition, to take advantage of class-level features and prototypes, we build a refined prototypical head that generates robust prototypes in the text branch for follow-up measurement. Finally, we aggregate the visual and text logits to calibrate the deviation of a single modality. Extensive experiments demonstrate the competitiveness of LPN against state-of-the-art methods on benchmark datasets.

EmoSet: A Large-scale Visual Emotion Dataset with Rich Attributes. (arXiv:2307.07961v2 [cs.CV] UPDATED)

Authors: Jingyuan Yang, Qirui Huang, Tingting Ding, Dani Lischinski, Daniel Cohen-Or, Hui Huang

Visual Emotion Analysis (VEA) aims at predicting people's emotional responses to visual stimuli. This is a promising, yet challenging, task in affective computing, which has drawn increasing attention in recent years. Most of the existing work in this area focuses on feature design, while little attention has been paid to dataset construction. In this work, we introduce EmoSet, the first large-scale visual emotion dataset annotated with rich attributes, which is superior to existing datasets in four aspects: scale, annotation richness, diversity, and data balance. EmoSet comprises 3.3 million images in total, with 118,102 of these images carefully labeled by human annotators, making it five times larger than the largest existing dataset. EmoSet includes images from social networks, as well as artistic images, and it is well balanced between different emotion categories. Motivated by psychological studies, in addition to emotion category, each image is also annotated with a set of describable emotion attributes: brightness, colorfulness, scene type, object class, facial expression, and human action, which can help understand visual emotions in a precise and interpretable way. The relevance of these emotion attributes is validated by analyzing the correlations between them and visual emotion, as well as by designing an attribute module to help visual emotion recognition. We believe EmoSet will bring some key insights and encourage further research in visual emotion analysis and understanding. Project page: https://vcc.tech/EmoSet.

Augmenting CLIP with Improved Visio-Linguistic Reasoning. (arXiv:2307.09233v2 [cs.CV] UPDATED)

Authors: Samyadeep Basu, Maziar Sanjabi, Daniela Massiceti, Shell Xu Hu, Soheil Feizi

Image-text contrastive models such as CLIP are useful for a variety of downstream applications including zero-shot classification, image-text retrieval and transfer learning. However, these contrastively trained vision-language models often fail on compositional visio-linguistic tasks such as Winoground with performance equivalent to random chance. In our paper, we address this issue and propose a sample-efficient light-weight method called SDS-CLIP to improve the compositional visio-linguistic reasoning capabilities of CLIP. The core idea of our method is to use differentiable image parameterizations to fine-tune CLIP with a distillation objective from large text-to-image generative models such as Stable-Diffusion which are relatively good at visio-linguistic reasoning tasks. On the challenging Winoground compositional reasoning benchmark, our method improves the absolute visio-linguistic performance of different CLIP models by up to 7%, while on the ARO dataset, our method improves the visio-linguistic performance by upto 3%. As a byproduct of inducing visio-linguistic reasoning into CLIP, we also find that the zero-shot performance improves marginally on a variety of downstream datasets. Our method reinforces that carefully designed distillation objectives from generative models can be leveraged to extend existing contrastive image-text models with improved visio-linguistic reasoning capabilities.

RepViT: Revisiting Mobile CNN From ViT Perspective. (arXiv:2307.09283v3 [cs.CV] UPDATED)

Authors: Ao Wang, Hui Chen, Zijia Lin, Hengjun Pu, Guiguang Ding

Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on resource-constrained mobile devices. This improvement is usually attributed to the multi-head self-attention module, which enables the model to learn global representations. However, the architectural disparities between lightweight ViTs and lightweight CNNs have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs and emphasize their potential for mobile devices. We incrementally enhance the mobile-friendliness of a standard lightweight CNN, specifically MobileNetV3, by integrating the efficient architectural choices of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. On ImageNet, RepViT achieves over 80\% top-1 accuracy with nearly 1ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Our largest model, RepViT-M3, obtains 81.4\% accuracy with only 1.3ms latency. The code and trained models are available at \url{https://github.com/jameslahm/RepViT}.

Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving. (arXiv:2307.09329v2 [cs.CV] UPDATED)

Authors: Kaavya Rekanar, Ciarán Eising, Ganesh Sistu, Martin Hayes

This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.

Towards Building More Robust Models with Frequency Bias. (arXiv:2307.09763v2 [cs.CV] UPDATED)

Authors: Qingwen Bu, Dong Huang, Heming Cui

The vulnerability of deep neural networks to adversarial samples has been a major impediment to their broad applications, despite their success in various fields. Recently, some works suggested that adversarially-trained models emphasize the importance of low-frequency information to achieve higher robustness. While several attempts have been made to leverage this frequency characteristic, they have all faced the issue that applying low-pass filters directly to input images leads to irreversible loss of discriminative information and poor generalizability to datasets with distinct frequency features. This paper presents a plug-and-play module called the Frequency Preference Control Module that adaptively reconfigures the low- and high-frequency components of intermediate feature representations, providing better utilization of frequency in robust learning. Empirical studies show that our proposed module can be easily incorporated into any adversarial training framework, further improving model robustness across different architectures and datasets. Additionally, experiments were conducted to examine how the frequency bias of robust models impacts the adversarial training process and its final robustness, revealing interesting insights.

SACReg: Scene-Agnostic Coordinate Regression for Visual Localization. (arXiv:2307.11702v2 [cs.CV] UPDATED)

Authors: Jerome Revaud, Yohann Cabon, Romain Brégier, JongMin Lee, Philippe Weinzaepfel

Scene coordinates regression (SCR), i.e., predicting 3D coordinates for every pixel of a given image, has recently shown promising potential. However, existing methods remain mostly scene-specific or limited to small scenes and thus hardly scale to realistic datasets. In this paper, we propose a new paradigm where a single generic SCR model is trained once to be then deployed to new test scenes, regardless of their scale and without further finetuning. For a given query image, it collects inputs from off-the-shelf image retrieval techniques and Structure-from-Motion databases: a list of relevant database images with sparse pointwise 2D-3D annotations. The model is based on the transformer architecture and can take a variable number of images and sparse 2D-3D annotations as input. It is trained on a few diverse datasets and significantly outperforms other scene regression approaches on several benchmarks, including scene-specific models, for visual localization. In particular, we set a new state of the art on the Cambridge localization benchmark, even outperforming feature-matching-based approaches.

Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation. (arXiv:2307.12239v2 [cs.CV] UPDATED)

Authors: Yiming Cui, Linjie Yang, Haichao Yu

Transformer-based detection and segmentation methods use a list of learned detection queries to retrieve information from the transformer network and learn to predict the location and category of one specific object from each query. We empirically find that random convex combinations of the learned queries are still good for the corresponding models. We then propose to learn a convex combination with dynamic coefficients based on the high-level semantics of the image. The generated dynamic queries, named modulated queries, better capture the prior of object locations and categories in the different images. Equipped with our modulated queries, a wide range of DETR-based models achieve consistent and superior performance across multiple tasks including object detection, instance segmentation, panoptic segmentation, and video instance segmentation.

Towards Generalizable Deepfake Detection by Primary Region Regularization. (arXiv:2307.12534v2 [cs.CV] UPDATED)

Authors: Harry Cheng, Yangyang Guo, Tianyi Wang, Liqiang Nie, Mohan Kankanhalli

The existing deepfake detection methods have reached a bottleneck in generalizing to unseen forgeries and manipulation approaches. Based on the observation that the deepfake detectors exhibit a preference for overfitting the specific primary regions in input, this paper enhances the generalization capability from a novel regularization perspective. This can be simply achieved by augmenting the images through primary region removal, thereby preventing the detector from over-relying on data bias. Our method consists of two stages, namely the static localization for primary region maps, as well as the dynamic exploitation of primary region masks. The proposed method can be seamlessly integrated into different backbones without affecting their inference efficiency. We conduct extensive experiments over three widely used deepfake datasets - DFDC, DF-1.0, and Celeb-DF with five backbones. Our method demonstrates an average performance improvement of 6% across different backbones and performs competitively with several state-of-the-art baselines.

Dense Transformer based Enhanced Coding Network for Unsupervised Metal Artifact Reduction. (arXiv:2307.12717v2 [cs.CV] UPDATED)

Authors: Wangduo Xie, Matthew B.Blaschko

CT images corrupted by metal artifacts have serious negative effects on clinical diagnosis. Considering the difficulty of collecting paired data with ground truth in clinical settings, unsupervised methods for metal artifact reduction are of high interest. However, it is difficult for previous unsupervised methods to retain structural information from CT images while handling the non-local characteristics of metal artifacts. To address these challenges, we proposed a novel Dense Transformer based Enhanced Coding Network (DTEC-Net) for unsupervised metal artifact reduction. Specifically, we introduce a Hierarchical Disentangling Encoder, supported by the high-order dense process, and transformer to obtain densely encoded sequences with long-range correspondence. Then, we present a second-order disentanglement method to improve the dense sequence's decoding process. Extensive experiments and model discussions illustrate DTEC-Net's effectiveness, which outperforms the previous state-of-the-art methods on a benchmark dataset, and greatly reduces metal artifacts while restoring richer texture details.

AMAE: Adaptation of Pre-Trained Masked Autoencoder for Dual-Distribution Anomaly Detection in Chest X-Rays. (arXiv:2307.12721v3 [cs.CV] UPDATED)

Authors: Behzad Bozorgtabar, Dwarikanath Mahapatra, Jean-Philippe Thiran

Unsupervised anomaly detection in medical images such as chest radiographs is stepping into the spotlight as it mitigates the scarcity of the labor-intensive and costly expert annotation of anomaly data. However, nearly all existing methods are formulated as a one-class classification trained only on representations from the normal class and discard a potentially significant portion of the unlabeled data. This paper focuses on a more practical setting, dual distribution anomaly detection for chest X-rays, using the entire training data, including both normal and unlabeled images. Inspired by a modern self-supervised vision transformer model trained using partial image inputs to reconstruct missing image regions -- we propose AMAE, a two-stage algorithm for adaptation of the pre-trained masked autoencoder (MAE). Starting from MAE initialization, AMAE first creates synthetic anomalies from only normal training images and trains a lightweight classifier on frozen transformer features. Subsequently, we propose an adaptation strategy to leverage unlabeled images containing anomalies. The adaptation scheme is accomplished by assigning pseudo-labels to unlabeled images and using two separate MAE based modules to model the normative and anomalous distributions of pseudo-labeled images. The effectiveness of the proposed adaptation strategy is evaluated with different anomaly ratios in an unlabeled training set. AMAE leads to consistent performance gains over competing self-supervised and dual distribution anomaly detection methods, setting the new state-of-the-art on three public chest X-ray benchmarks: RSNA, NIH-CXR, and VinDr-CXR.

Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation. (arXiv:2307.14750v2 [cs.CV] UPDATED)

Authors: Zhiyuan Li, Dongnan Liu, Heng Wang, Chaoyi Zhang, Weidong Cai

Training an image captioner without annotated image-sentence pairs has gained traction in recent years. Previous approaches can be categorized into two strategies: crawling sentences from mismatching corpora and aligning them with the given images as pseudo annotations, or pre-training the captioner using external image-text pairs. However, the aligning setting seems to reach its performance limit due to the quality problem of pairs, and pre-training requires significant computational resources. To address these challenges, we propose a new strategy ``LPM + retrieval-augmented learning" where the prior knowledge from large pre-trained models (LPMs) is leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which adopts an efficient approach to retrieve highly relevant short region descriptions from the mismatching corpora and use them to generate a variety of pseudo sentences with distinct representations as well as high quality via LPMs. In addition, a fluency filter and a CLIP-guided training objective are further introduced to facilitate model optimization. Experimental results demonstrate that our method surpasses the SOTA pre-training model (Flamingo3B) by achieving a CIDEr score of 78.1 (+5.1) while utilizing only 0.3% of its trainable parameters (1.3B VS 33M). Importantly, our approach eliminates the need of computationally expensive pre-training processes on external datasets (e.g., the requirement of 312M image-text pairs for Flamingo3B). We further show that with a simple extension, the generated pseudo sentences can be deployed as weak supervision to boost the 1% semi-supervised image caption benchmark up to 93.4 CIDEr score (+8.9) which showcases the versatility and effectiveness of our approach.