Resource-Limited Automated Ki67 Index Estimation in Breast Cancer. (arXiv:2401.00014v1 [q-bio.QM])

Authors: J. Gliozzo, G. Marinò, A. Bonometti, M. Frasca, D. Malchiodi

The prediction of tumor progression and chemotherapy response has been recently tackled exploiting Tumor Infiltrating Lymphocytes (TILs) and the nuclear protein Ki67 as prognostic factors. Recently, deep neural networks (DNNs) have been shown to achieve top results in estimating Ki67 expression and simultaneous determination of intratumoral TILs score in breast cancer cells. However, in the last ten years the extraordinary progress induced by deep models proliferated at least as much as their resource demand. The exorbitant computational costs required to query (and in some cases also to store) a deep model represent a strong limitation in resource-limited contexts, like that of IoT-based applications to support healthcare personnel. To this end, we propose a resource consumption-aware DNN for the effective estimate of the percentage of Ki67-positive cells in breast cancer screenings. Our approach reduced up to 75% and 89% the usage of memory and disk space respectively, up to 1.5x the energy consumption, and preserved or improved the overall accuracy of a benchmark state-of-the-art solution. Encouraged by such positive results, we developed and structured the adopted framework so as to allow its general purpose usage, along with a public software repository to support its usage.

Any-point Trajectory Modeling for Policy Learning. (arXiv:2401.00025v1 [cs.RO])

Authors: Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, Pieter Abbeel

Learning from demonstration is a powerful method for teaching robots new skills, and more demonstration data often improves policy learning. However, the high cost of collecting demonstration data is a significant bottleneck. Videos, as a rich data source, contain knowledge of behaviors, physics, and semantics, but extracting control-specific information from them is challenging due to the lack of action labels. In this work, we introduce a novel framework, Any-point Trajectory Modeling (ATM), that utilizes video demonstrations by pre-training a trajectory model to predict future trajectories of arbitrary points within a video frame. Once trained, these trajectories provide detailed control guidance, enabling the learning of robust visuomotor policies with minimal action-labeled data. Our method's effectiveness is demonstrated across 130 simulation tasks, focusing on language-conditioned manipulation tasks. Visualizations and code are available at: \url{https://xingyu-lin.github.io/atm}.

Efficient Multi-scale Network with Learnable Discrete Wavelet Transform for Blind Motion Deblurring. (arXiv:2401.00027v1 [cs.CV])

Authors: Xin Gao, Tianheng Qiu, Xinyu Zhang, Hanlin Bai, Kang Liu, Xuan Huang, Hu Wei, Guoying Zhang, Huaping Liu

Coarse-to-fine schemes are widely used in traditional single-image motion deblur; however, in the context of deep learning, existing multi-scale algorithms not only require the use of complex modules for feature fusion of low-scale RGB images and deep semantics, but also manually generate low-resolution pairs of images that do not have sufficient confidence. In this work, we propose a multi-scale network based on single-input and multiple-outputs(SIMO) for motion deblurring. This simplifies the complexity of algorithms based on a coarse-to-fine scheme. To alleviate restoration defects impacting detail information brought about by using a multi-scale architecture, we combine the characteristics of real-world blurring trajectories with a learnable wavelet transform module to focus on the directional continuity and frequency features of the step-by-step transitions between blurred images to sharp images. In conclusion, we propose a multi-scale network with a learnable discrete wavelet transform (MLWNet), which exhibits state-of-the-art performance on multiple real-world deblurred datasets, in terms of both subjective and objective quality as well as computational efficiency.

An Empirical Study of Scaling Law for OCR. (arXiv:2401.00028v1 [cs.CV])

Authors: Miao Rang, Zhenni Bi, Chuanjian Liu, Yunhe Wang, Kai Han

The laws of model size, data volume, computation and model performance have been extensively studied in the field of Natural Language Processing (NLP). However, the scaling laws in Optical Character Recognition (OCR) have not yet been investigated. To address this, we conducted comprehensive studies that involved examining the correlation between performance and the scale of models, data volume and computation in the field of text recognition.Conclusively, the study demonstrates smooth power laws between performance and model size, as well as training data volume, when other influencing factors are held constant. Additionally, we have constructed a large-scale dataset called REBU-Syn, which comprises 6 million real samples and 18 million synthetic samples. Based on our scaling law and new dataset, we have successfully trained a scene text recognition model, achieving a new state-ofthe-art on 6 common test benchmarks with a top-1 average accuracy of 97.42%.

6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation. (arXiv:2401.00029v1 [cs.CV])

Authors: Li Xu, Haoxuan Qu, Yujun Cai, Jun Liu

Estimating the 6D object pose from a single RGB image often involves noise and indeterminacy due to challenges such as occlusions and cluttered backgrounds. Meanwhile, diffusion models have shown appealing performance in generating high-quality images from random noise with high indeterminacy through step-by-step denoising. Inspired by their denoising capability, we propose a novel diffusion-based framework (6D-Diff) to handle the noise and indeterminacy in object pose estimation for better performance. In our framework, to establish accurate 2D-3D correspondence, we formulate 2D keypoints detection as a reverse diffusion (denoising) process. To facilitate such a denoising process, we design a Mixture-of-Cauchy-based forward diffusion process and condition the reverse process on the object features. Extensive experiments on the LM-O and YCB-V datasets demonstrate the effectiveness of our framework.

Discrete Distribution Networks. (arXiv:2401.00036v1 [cs.CV])

Authors: Lei Yang

We introduce a novel generative model, the Discrete Distribution Networks (DDN), that approximates data distribution using hierarchical discrete distributions. We posit that since the features within a network inherently contain distributional information, liberating the network from a single output to concurrently generate multiple samples proves to be highly effective. Therefore, DDN fits the target distribution, including continuous ones, by generating multiple discrete sample points. To capture finer details of the target data, DDN selects the output that is closest to the Ground Truth (GT) from the coarse results generated in the first layer. This selected output is then fed back into the network as a condition for the second layer, thereby generating new outputs more similar to the GT. As the number of DDN layers increases, the representational space of the outputs expands exponentially, and the generated samples become increasingly similar to the GT. This hierarchical output pattern of discrete distributions endows DDN with two intriguing properties: highly compressed representation and more general zero-shot conditional generation. We demonstrate the efficacy of DDN and these intriguing properties through experiments on CIFAR-10 and FFHQ.

Generalization properties of contrastive world models. (arXiv:2401.00057v1 [cs.LG])

Authors: Kandan Ramakrishnan, R. James Cotton, Xaq Pitkow, Andreas S. Tolias

Recent work on object-centric world models aim to factorize representations in terms of objects in a completely unsupervised or self-supervised manner. Such world models are hypothesized to be a key component to address the generalization problem. While self-supervision has shown improved performance however, OOD generalization has not been systematically and explicitly tested. In this paper, we conduct an extensive study on the generalization properties of contrastive world model. We systematically test the model under a number of different OOD generalization scenarios such as extrapolation to new object attributes, introducing new conjunctions or new attributes. Our experiments show that the contrastive world model fails to generalize under the different OOD tests and the drop in performance depends on the extent to which the samples are OOD. When visualizing the transition updates and convolutional feature maps, we observe that any changes in object attributes (such as previously unseen colors, shapes, or conjunctions of color and shape) breaks down the factorization of object representations. Overall, our work highlights the importance of object-centric representations for generalization and current models are limited in their capacity to learn such representations required for human-level generalization.

Accelerating Process Development for 3D Printing of New Metal Alloys. (arXiv:2401.00065v1 [cond-mat.mtrl-sci])

Authors: David Guirguis, Conrad Tucker, Jack Beuth

Addressing the uncertainty and variability in the quality of 3D printed metals can further the wide spread use of this technology. Process mapping for new alloys is crucial for determining optimal process parameters that consistently produce acceptable printing quality. Process mapping is typically performed by conventional methods and is used for the design of experiments and ex situ characterization of printed parts. On the other hand, in situ approaches are limited because their observable features are limited and they require complex high-cost setups to obtain temperature measurements to boost accuracy. Our method relaxes these limitations by incorporating the temporal features of molten metal dynamics during laser-metal interactions using video vision transformers and high-speed imaging. Our approach can be used in existing commercial machines and can provide in situ process maps for efficient defect and variability quantification. The generalizability of the approach is demonstrated by performing cross-dataset evaluations on alloys with different compositions and intrinsic thermofluid properties.

Particle-Based Shape Modeling for Arbitrary Regions-of-Interest. (arXiv:2401.00067v1 [cs.CV])

Authors: Hong Xu, Alan Morris, Shireen Y. Elhabian

Statistical Shape Modeling (SSM) is a quantitative method for analyzing morphological variations in anatomical structures. These analyses often necessitate building models on targeted anatomical regions of interest to focus on specific morphological features. We propose an extension to \particle-based shape modeling (PSM), a widely used SSM framework, to allow shape modeling to arbitrary regions of interest. Existing methods to define regions of interest are computationally expensive and have topological limitations. To address these shortcomings, we use mesh fields to define free-form constraints, which allow for delimiting arbitrary regions of interest on shape surfaces. Furthermore, we add a quadratic penalty method to the model optimization to enable computationally efficient enforcement of any combination of cutting-plane and free-form constraints. We demonstrate the effectiveness of this method on a challenging synthetic dataset and two medical datasets.

A Large-Scale Re-identification Analysis in Sporting Scenarios: the Betrayal of Reaching a Critical Point. (arXiv:2401.00080v1 [cs.CV])

Authors: David Freire-Obregón, Javier Lorenzo-Navarro, Oliverio J. Santana, Daniel Hernández-Sosa, Modesto Castrillón-Santana

Re-identifying participants in ultra-distance running competitions can be daunting due to the extensive distances and constantly changing terrain. To overcome these challenges, computer vision techniques have been developed to analyze runners' faces, numbers on their bibs, and clothing. However, our study presents a novel gait-based approach for runners' re-identification (re-ID) by leveraging various pre-trained human action recognition (HAR) models and loss functions. Our results show that this approach provides promising results for re-identifying runners in ultra-distance competitions. Furthermore, we investigate the significance of distinct human body movements when athletes are approaching their endurance limits and their potential impact on re-ID accuracy. Our study examines how the recognition of a runner's gait is affected by a competition's critical point (CP), defined as a moment of severe fatigue and the point where the finish line comes into view, just a few kilometers away from this location. We aim to determine how this CP can improve the accuracy of athlete re-ID. Our experimental results demonstrate that gait recognition can be significantly enhanced (up to a 9% increase in mAP) as athletes approach this point. This highlights the potential of utilizing gait recognition in real-world scenarios, such as ultra-distance competitions or long-duration surveillance tasks.

Generating Enhanced Negatives for Training Language-Based Object Detectors. (arXiv:2401.00094v1 [cs.CV])

Authors: Shiyu Zhao, Long Zhao, Vijay Kumar B.G, Yumin Suh, Dimitris N. Metaxas, Manmohan Chandraker, Samuel Schulter

The recent progress in language-based open-vocabulary object detection can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training such models with a discriminative objective function has proven successful, but requires good positive and negative samples. However, the free-form nature and the open vocabulary of object descriptions make the space of negatives extremely large. Prior works randomly sample negatives or use rule-based techniques to build them. In contrast, we propose to leverage the vast knowledge built into modern generative models to automatically build negatives that are more relevant to the original data. Specifically, we use large-language-models to generate negative text descriptions, and text-to-image diffusion models to also generate corresponding negative images. Our experimental analysis confirms the relevance of the generated negative data, and its use in language-based detectors improves performance on two complex benchmarks.

Diffusion Model with Perceptual Loss. (arXiv:2401.00110v1 [cs.CV])

Authors: Shanchuan Lin, Xiao Yang

Diffusion models trained with mean squared error loss tend to generate unrealistic samples. Current state-of-the-art models rely on classifier-free guidance to improve sample quality, yet its surprising effectiveness is not fully understood. In this paper, We show that the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance. As a result, we can directly incorporate perceptual loss in diffusion training to improve sample quality. Since the score matching objective used in diffusion training strongly resembles the denoising autoencoder objective used in unsupervised training of perceptual networks, the diffusion model itself is a perceptual network and can be used to generate meaningful perceptual loss. We propose a novel self-perceptual objective that results in diffusion models capable of generating more realistic samples. For conditional generation, our method only improves sample quality without entanglement with the conditional input and therefore does not sacrifice sample diversity. Our method can also improve sample quality for unconditional generation, which was not possible with classifier-free guidance before.

LLM-Assist: Enhancing Closed-Loop Planning with Language-Based Reasoning. (arXiv:2401.00125v1 [cs.AI])

Authors: S P Sharan, Francesco Pittaluga, Vijay Kumar B G, Manmohan Chandraker

Although planning is a crucial component of the autonomous driving stack, researchers have yet to develop robust planning algorithms that are capable of safely handling the diverse range of possible driving scenarios. Learning-based planners suffer from overfitting and poor long-tail performance. On the other hand, rule-based planners generalize well, but might fail to handle scenarios that require complex driving maneuvers. To address these limitations, we investigate the possibility of leveraging the common-sense reasoning capabilities of Large Language Models (LLMs) such as GPT4 and Llama2 to generate plans for self-driving vehicles. In particular, we develop a novel hybrid planner that leverages a conventional rule-based planner in conjunction with an LLM-based planner. Guided by commonsense reasoning abilities of LLMs, our approach navigates complex scenarios which existing planners struggle with, produces well-reasoned outputs while also remaining grounded through working alongside the rule-based approach. Through extensive evaluation on the nuPlan benchmark, we achieve state-of-the-art performance, outperforming all existing pure learning- and rule-based methods across most metrics. Our code will be available at https://llmassist.github.io.

Pushing Boundaries: Exploring Zero Shot Object Classification with Large Multimodal Models. (arXiv:2401.00127v1 [cs.CV])

Authors: Ashhadul Islam, Md. Rafiul Biswas, Wajdi Zaghouani, Samir Brahim Belhaouari, Zubair Shah

$ $The synergy of language and vision models has given rise to Large Language and Vision Assistant models (LLVAs), designed to engage users in rich conversational experiences intertwined with image-based queries. These comprehensive multimodal models seamlessly integrate vision encoders with Large Language Models (LLMs), expanding their applications in general-purpose language and visual comprehension. The advent of Large Multimodal Models (LMMs) heralds a new era in Artificial Intelligence (AI) assistance, extending the horizons of AI utilization. This paper takes a unique perspective on LMMs, exploring their efficacy in performing image classification tasks using tailored prompts designed for specific datasets. We also investigate the LLVAs zero-shot learning capabilities. Our study includes a benchmarking analysis across four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees), and an unconventional dataset comprising Pox Vs. Non-Pox skin images. The results of our experiments demonstrate the model's remarkable performance, achieving classification accuracies of 85\%, 100\%, 77\%, and 79\% for the respective datasets without any fine-tuning. To bolster our analysis, we assess the model's performance post fine-tuning for specific tasks. In one instance, fine-tuning is conducted over a dataset comprising images of faces of children with and without autism. Prior to fine-tuning, the model demonstrated a test accuracy of 55\%, which significantly improved to 83\% post fine-tuning. These results, coupled with our prior findings, underscore the transformative potential of LLVAs and their versatile applications in real-world scenarios.

Quantifying intra-tumoral genetic heterogeneity of glioblastoma toward precision medicine using MRI and a data-inclusive machine learning algorithm. (arXiv:2401.00128v1 [cs.LG])

Authors: Lujia Wang, Hairong Wang, Fulvio D'Angelo, Lee Curtin, Christopher P. Sereduk, Gustavo De Leon, Kyle W. Singleton, Javier Urcuyo, Andrea Hawkins-Daarud, Pamela R. Jackson, Chandan Krishna, Richard S. Zimmerman, Devi P. Patra, Bernard R. Bendok, Kris A. Smith, Peter Nakaji, Kliment Donev, Leslie C. Baxter, Maciej M. Mrugała, Michele Ceccarelli, Antonio Iavarone, Kristin R. Swanson, Nhan L. Tran, Leland S. Hu, Jing Li

Glioblastoma (GBM) is one of the most aggressive and lethal human cancers. Intra-tumoral genetic heterogeneity poses a significant challenge for treatment. Biopsy is invasive, which motivates the development of non-invasive, MRI-based machine learning (ML) models to quantify intra-tumoral genetic heterogeneity for each patient. This capability holds great promise for enabling better therapeutic selection to improve patient outcomes. We proposed a novel Weakly Supervised Ordinal Support Vector Machine (WSO-SVM) to predict regional genetic alteration status within each GBM tumor using MRI. WSO-SVM was applied to a unique dataset of 318 image-localized biopsies with spatially matched multiparametric MRI from 74 GBM patients. The model was trained to predict the regional genetic alteration of three GBM driver genes (EGFR, PDGFRA, and PTEN) based on features extracted from the corresponding region of five MRI contrast images. For comparison, a variety of existing ML algorithms were also applied. The classification accuracy of each gene was compared between the different algorithms. The SHapley Additive exPlanations (SHAP) method was further applied to compute contribution scores of different contrast images. Finally, the trained WSO-SVM was used to generate prediction maps within the tumoral area of each patient to help visualize the intra-tumoral genetic heterogeneity. This study demonstrated the feasibility of using MRI and WSO-SVM to enable non-invasive prediction of intra-tumoral regional genetic alteration for each GBM patient, which can inform future adaptive therapies for individualized oncology.

Deep Radon Prior: A Fully Unsupervised Framework for Sparse-View CT Reconstruction. (arXiv:2401.00135v1 [eess.IV])

Authors: Shuo Xu, Yucheng Zhang, Gang Chen, Xincheng Xiang, Peng Cong, Yuewen Sun

Although sparse-view computed tomography (CT) has significantly reduced radiation dose, it also introduces severe artifacts which degrade the image quality. In recent years, deep learning-based methods for inverse problems have made remarkable progress and have become increasingly popular in CT reconstruction. However, most of these methods suffer several limitations: dependence on high-quality training data, weak interpretability, etc. In this study, we propose a fully unsupervised framework called Deep Radon Prior (DRP), inspired by Deep Image Prior (DIP), to address the aforementioned limitations. DRP introduces a neural network as an implicit prior into the iterative method, thereby realizing cross-domain gradient feedback. During the reconstruction process, the neural network is progressively optimized in multiple stages to narrow the solution space in radon domain for the under-constrained imaging protocol, and the convergence of the proposed method has been discussed in this work. Compared with the popular pre-trained method, the proposed framework requires no dataset and exhibits superior interpretability and generalization ability. The experimental results demonstrate that the proposed method can generate detailed images while effectively suppressing image artifacts.Meanwhile, DRP achieves comparable or better performance than the supervised methods.

SSL-OTA: Unveiling Backdoor Threats in Self-Supervised Learning for Object Detection. (arXiv:2401.00137v1 [cs.CR])

Authors: Qiannan Wang, Changchun Yin, Liming Fang, Lu Zhou, Zhe Liu, Run Wang, Chenhao Lin

The extensive adoption of Self-supervised learning (SSL) has led to an increased security threat from backdoor attacks. While existing research has mainly focused on backdoor attacks in image classification, there has been limited exploration into their implications for object detection. In this work, we propose the first backdoor attack designed for object detection tasks in SSL scenarios, termed Object Transform Attack (SSL-OTA). SSL-OTA employs a trigger capable of altering predictions of the target object to the desired category, encompassing two attacks: Data Poisoning Attack (NA) and Dual-Source Blending Attack (DSBA). NA conducts data poisoning during downstream fine-tuning of the object detector, while DSBA additionally injects backdoors into the pre-trained encoder. We establish appropriate metrics and conduct extensive experiments on benchmark datasets, demonstrating the effectiveness and utility of our proposed attack. Notably, both NA and DSBA achieve high attack success rates (ASR) at extremely low poisoning rates (0.5%). The results underscore the importance of considering backdoor threats in SSL-based object detection and contribute a novel perspective to the field.

TPatch: A Triggered Physical Adversarial Patch. (arXiv:2401.00148v1 [cs.CR])

Authors: Wenjun Zhu, Xiaoyu Ji, Yushi Cheng, Shibo Zhang, Wenyuan Xu

Autonomous vehicles increasingly utilize the vision-based perception module to acquire information about driving environments and detect obstacles. Correct detection and classification are important to ensure safe driving decisions. Existing works have demonstrated the feasibility of fooling the perception models such as object detectors and image classifiers with printed adversarial patches. However, most of them are indiscriminately offensive to every passing autonomous vehicle. In this paper, we propose TPatch, a physical adversarial patch triggered by acoustic signals. Unlike other adversarial patches, TPatch remains benign under normal circumstances but can be triggered to launch a hiding, creating or altering attack by a designed distortion introduced by signal injection attacks towards cameras. To avoid the suspicion of human drivers and make the attack practical and robust in the real world, we propose a content-based camouflage method and an attack robustness enhancement method to strengthen it. Evaluations with three object detectors, YOLO V3/V5 and Faster R-CNN, and eight image classifiers demonstrate the effectiveness of TPatch in both the simulation and the real world. We also discuss possible defenses at the sensor, algorithm, and system levels.

CamPro: Camera-based Anti-Facial Recognition. (arXiv:2401.00151v1 [cs.CV])

Authors: Wenjun Zhu, Yuan Sun, Jiani Liu, Yushi Cheng, Xiaoyu Ji, Wenyuan Xu

The proliferation of images captured from millions of cameras and the advancement of facial recognition (FR) technology have made the abuse of FR a severe privacy threat. Existing works typically rely on obfuscation, synthesis, or adversarial examples to modify faces in images to achieve anti-facial recognition (AFR). However, the unmodified images captured by camera modules that contain sensitive personally identifiable information (PII) could still be leaked. In this paper, we propose a novel approach, CamPro, to capture inborn AFR images. CamPro enables well-packed commodity camera modules to produce images that contain little PII and yet still contain enough information to support other non-sensitive vision applications, such as person detection. Specifically, CamPro tunes the configuration setup inside the camera image signal processor (ISP), i.e., color correction matrix and gamma correction, to achieve AFR, and designs an image enhancer to keep the image quality for possible human viewers. We implemented and validated CamPro on a proof-of-concept camera, and our experiments demonstrate its effectiveness on ten state-of-the-art black-box FR models. The results show that CamPro images can significantly reduce face identification accuracy to 0.3\% while having little impact on the targeted non-sensitive vision application. Furthermore, we find that CamPro is resilient to adaptive attackers who have re-trained their FR models using images generated by CamPro, even with full knowledge of privacy-preserving ISP parameters.

A comprehensive framework for occluded human pose estimation. (arXiv:2401.00155v1 [cs.CV])

Authors: Linhao Xu, Lin Zhao, Xinxin Sun, Guangyu Li, Kedong Yan

Occlusion presents a significant challenge in human pose estimation. The challenges posed by occlusion can be attributed to the following factors: 1) Data: The collection and annotation of occluded human pose samples are relatively challenging. 2) Feature: Occlusion can cause feature confusion due to the high similarity between the target person and interfering individuals. 3) Inference: Robust inference becomes challenging due to the loss of complete body structural information. The existing methods designed for occluded human pose estimation usually focus on addressing only one of these factors. In this paper, we propose a comprehensive framework DAG (Data, Attention, Graph) to address the performance degradation caused by occlusion. Specifically, we introduce the mask joints with instance paste data augmentation technique to simulate occlusion scenarios. Additionally, an Adaptive Discriminative Attention Module (ADAM) is proposed to effectively enhance the features of target individuals. Furthermore, we present the Feature-Guided Multi-Hop GCN (FGMP-GCN) to fully explore the prior knowledge of body structure and improve pose estimation results. Through extensive experiments conducted on three benchmark datasets for occluded human pose estimation, we demonstrate that the proposed method outperforms existing methods. Code and data will be publicly available.

Automatic hip osteoarthritis grading with uncertainty estimation from computed tomography using digitally-reconstructed radiographs. (arXiv:2401.00159v1 [eess.IV])

Authors: Masachika Masuda, Mazen Soufi, Yoshito Otake, Keisuke Uemura, Sotaro Kono, Kazuma Takashima, Hidetoshi Hamada, Yi Gu, Masaki Takao, Seiji Okada, Nobuhiko Sugano, Yoshinobu Sato

Progression of hip osteoarthritis (hip OA) leads to pain and disability, likely leading to surgical treatment such as hip arthroplasty at the terminal stage. The severity of hip OA is often classified using the Crowe and Kellgren-Lawrence (KL) classifications. However, as the classification is subjective, we aimed to develop an automated approach to classify the disease severity based on the two grades using digitally-reconstructed radiographs (DRRs) from CT images. Automatic grading of the hip OA severity was performed using deep learning-based models. The models were trained to predict the disease grade using two grading schemes, i.e., predicting the Crowe and KL grades separately, and predicting a new ordinal label combining both grades and representing the disease progression of hip OA. The models were trained in classification and regression settings. In addition, the model uncertainty was estimated and validated as a predictor of classification accuracy. The models were trained and validated on a database of 197 hip OA patients, and externally validated on 52 patients. The model accuracy was evaluated using exact class accuracy (ECA), one-neighbor class accuracy (ONCA), and balanced accuracy.The deep learning models produced a comparable accuracy of approximately 0.65 (ECA) and 0.95 (ONCA) in the classification and regression settings. The model uncertainty was significantly larger in cases with large classification errors (P<6e-3). In this study, an automatic approach for grading hip OA severity from CT images was developed. The models have shown comparable performance with high ONCA, which facilitates automated grading in large-scale CT databases and indicates the potential for further disease progression analysis. Classification accuracy was correlated with the model uncertainty, which would allow for the prediction of classification errors.

Inpaint4DNeRF: Promptable Spatio-Temporal NeRF Inpainting with Generative Diffusion Models. (arXiv:2401.00208v1 [cs.CV])

Authors: Han Jiang, Haosen Sun, Ruoxuan Li, Chi-Keung Tang, Yu-Wing Tai

Current Neural Radiance Fields (NeRF) can generate photorealistic novel views. For editing 3D scenes represented by NeRF, with the advent of generative models, this paper proposes Inpaint4DNeRF to capitalize on state-of-the-art stable diffusion models (e.g., ControlNet) for direct generation of the underlying completed background content, regardless of static or dynamic. The key advantages of this generative approach for NeRF inpainting are twofold. First, after rough mask propagation, to complete or fill in previously occluded content, we can individually generate a small subset of completed images with plausible content, called seed images, from which simple 3D geometry proxies can be derived. Second and the remaining problem is thus 3D multiview consistency among all completed images, now guided by the seed images and their 3D proxies. Without other bells and whistles, our generative Inpaint4DNeRF baseline framework is general which can be readily extended to 4D dynamic NeRFs, where temporal consistency can be naturally handled in a similar way as our multiview consistency.

A Novel Approach for Defect Detection of Wind Turbine Blade Using Virtual Reality and Deep Learning. (arXiv:2401.00237v1 [cs.CV])

Authors: Md Fazle Rabbi, Solayman Hossain Emon, Ehtesham Mahmud Nishat, Tzu-Liang (Bill) Tseng, Atira Ferdoushi, Chun-Che Huang, Md Fashiar Rahman

Wind turbines are subjected to continuous rotational stresses and unusual external forces such as storms, lightning, strikes by flying objects, etc., which may cause defects in turbine blades. Hence, it requires a periodical inspection to ensure proper functionality and avoid catastrophic failure. The task of inspection is challenging due to the remote location and inconvenient reachability by human inspection. Researchers used images with cropped defects from the wind turbine in the literature. They neglected possible background biases, which may hinder real-time and autonomous defect detection using aerial vehicles such as drones or others. To overcome such challenges, in this paper, we experiment with defect detection accuracy by having the defects with the background using a two-step deep-learning methodology. In the first step, we develop virtual models of wind turbines to synthesize the near-reality images for four types of common defects - cracks, leading edge erosion, bending, and light striking damage. The Unity perception package is used to generate wind turbine blade defects images with variations in background, randomness, camera angle, and light effects. In the second step, a customized U-Net architecture is trained to classify and segment the defect in turbine blades. The outcomes of U-Net architecture have been thoroughly tested and compared with 5-fold validation datasets. The proposed methodology provides reasonable defect detection accuracy, making it suitable for autonomous and remote inspection through aerial vehicles.

Image Super-resolution Reconstruction Network based on Enhanced Swin Transformer via Alternating Aggregation of Local-Global Features. (arXiv:2401.00241v1 [cs.CV])

Authors: Yuming Huang, Yingpin Chen, Changhui Wu, Hanrong Xie, Binhui Song, Hui Wang

The Swin Transformer image super-resolution reconstruction network only relies on the long-range relationship of window attention and shifted window attention to explore features. This mechanism has two limitations. On the one hand, it only focuses on global features while ignoring local features. On the other hand, it is only concerned with spatial feature interactions while ignoring channel features and channel interactions, thus limiting its non-linear mapping ability. To address the above limitations, this paper proposes enhanced Swin Transformer modules via alternating aggregation of local-global features. In the local feature aggregation stage, this paper introduces shift convolution to realize the interaction between local spatial information and channel information. This paper proposes a block sparse global perception module in the global feature aggregation stage. This module organizes the spatial information first, then sends the recombination information into a spatial gating unit to implement the further interaction of spatial and channel information. Then, a multi-scale self-attention module and a low-parameter residual channel attention module are introduced to realize information aggregation at different scales. Finally, the proposed network is validated on five publicly available datasets. The experimental results show that the proposed network outperforms the other state-of-the-art super-resolution networks.

Probing the Limits and Capabilities of Diffusion Models for the Anatomic Editing of Digital Twins. (arXiv:2401.00247v1 [cs.CV])

Authors: Karim Kadry, Shreya Gupta, Farhad R. Nezami, Elazer R. Edelman

Numerical simulations can model the physical processes that govern cardiovascular device deployment. When such simulations incorporate digital twins; computational models of patient-specific anatomy, they can expedite and de-risk the device design process. Nonetheless, the exclusive use of patient-specific data constrains the anatomic variability which can be precisely or fully explored. In this study, we investigate the capacity of Latent Diffusion Models (LDMs) to edit digital twins to create anatomic variants, which we term digital siblings. Digital twins and their corresponding siblings can serve as the basis for comparative simulations, enabling the study of how subtle anatomic variations impact the simulated deployment of cardiovascular devices, as well as the augmentation of virtual cohorts for device assessment. However, while diffusion models have been characterized in their ability to edit natural images, their capacity to anatomically edit digital twins has yet to be studied. Using a case example centered on 3D digital twins of cardiac anatomy, we implement various methods for generating digital siblings and characterize them through morphological and topological analyses. We specifically edit digital twins to introduce anatomic variation at different spatial scales and within localized regions, demonstrating the existence of bias towards common anatomic features. We further show that such anatomic bias can be leveraged for virtual cohort augmentation through selective editing, partially alleviating issues related to dataset imbalance and lack of diversity. Our experimental framework thus delineates the limits and capabilities of using latent diffusion models in synthesizing anatomic variation for in silico trials.

Promoting Segment Anything Model towards Highly Accurate Dichotomous Image Segmentation. (arXiv:2401.00248v1 [cs.CV])

Authors: Xianjie Liu, Keren Fu, Qijun Zhao

Segmenting any object represents a crucial step towards achieving artificial general intelligence, and the "Segment Anything Model" (SAM) has significantly advanced the development of foundational models in computer vision. We have high expectations regarding whether SAM can enhance highly accurate dichotomous image segmentation. In fact, the evidence presented in this article demonstrates that by inputting SAM with simple prompt boxes and utilizing the results output by SAM as input for IS5Net, we can greatly improve the effectiveness of highly accurate dichotomous image segmentation.

Masked Image Modeling via Dynamic Token Morphing. (arXiv:2401.00254v1 [cs.CV])

Authors: Taekyung Kim, Dongyoon Han, Byeongho Heo

Masked Image Modeling (MIM) arises as a promising option for Vision Transformers among various self-supervised learning (SSL) methods. The essence of MIM lies in token-wise masked patch predictions, with targets patchified from images; or generated by pre-trained tokenizers or models. We argue targets from the pre-trained models usually exhibit spatial inconsistency, which makes it excessively challenging for the model to follow to learn more discriminative representations. To mitigate the issue, we introduce a novel self-supervision signal based on Dynamic Token Morphing (DTM), which dynamically aggregates contextually related tokens. DTM can be generally applied to various SSL frameworks, yet we propose a simple MIM that employs DTM to effectively improve the performance barely introducing extra training costs. Our experiments on ImageNet-1K and ADE20K evidently demonstrate the superiority of our methods. Furthermore, the comparative evaluation of iNaturalist and Fine-grained Visual Classification datasets further validates the transferability of our method on various downstream tasks. Our code will be released publicly.

GazeCLIP: Towards Enhancing Gaze Estimation via Text Guidance. (arXiv:2401.00260v1 [cs.CV])

Authors: Jun Wang, Hao Ruan, Mingjie Wang, Chuanghui Zhang, Chunhua Li, Jun Zhou

Over the past decade, visual gaze estimation has garnered growing attention within the research community, thanks to its wide-ranging application scenarios. While existing estimation approaches have achieved remarkable success in enhancing prediction accuracy, they primarily infer gaze directions from single-image signals and discard the huge potentials of the currently dominant text guidance. Notably, visual-language collaboration has been extensively explored across a range of visual tasks, such as image synthesis and manipulation, leveraging the remarkable transferability of large-scale Contrastive Language-Image Pre-training (CLIP) model. Nevertheless, existing gaze estimation approaches ignore the rich semantic cues conveyed by linguistic signals and priors in CLIP feature space, thereby yielding performance setbacks. In pursuit of making up this gap, we delve deeply into the text-eye collaboration protocol and introduce a novel gaze estimation framework in this paper, referred to as GazeCLIP. Specifically, we intricately design a linguistic description generator to produce text signals with coarse directional cues. Additionally, a CLIP-based backbone that excels in characterizing text-eye pairs for gaze estimation is presented. This is followed by the implementation of a fine-grained multi-modal fusion module aimed at modeling the interrelationships between heterogeneous inputs. Extensive experiments on three challenging datasets demonstrate the superiority of the proposed GazeCLIP which surpasses the previous approaches and achieves the state-of-the-art estimation accuracy.

COMMA: Co-Articulated Multi-Modal Learning. (arXiv:2401.00268v1 [cs.CV])

Authors: Lianyu Hu, Liqing Gao, Zekang Liu, Chi-Man Pun, Wei Feng

Pretrained large-scale vision-language models such as CLIP have demonstrated excellent generalizability over a series of downstream tasks. However, they are sensitive to the variation of input text prompts and need a selection of prompt templates to achieve satisfactory performance. Recently, various methods have been proposed to dynamically learn the prompts as the textual inputs to avoid the requirements of laboring hand-crafted prompt engineering in the fine-tuning process. We notice that these methods are suboptimal in two aspects. First, the prompts of the vision and language branches in these methods are usually separated or uni-directionally correlated. Thus, the prompts of both branches are not fully correlated and may not provide enough guidance to align the representations of both branches. Second, it's observed that most previous methods usually achieve better performance on seen classes but cause performance degeneration on unseen classes compared to CLIP. This is because the essential generic knowledge learned in the pretraining stage is partly forgotten in the fine-tuning process. In this paper, we propose Co-Articulated Multi-Modal Learning (COMMA) to handle the above limitations. Especially, our method considers prompts from both branches to generate the prompts to enhance the representation alignment of both branches. Besides, to alleviate forgetting about the essential knowledge, we minimize the feature discrepancy between the learned prompts and the embeddings of hand-crafted prompts in the pre-trained CLIP in the late transformer layers. We evaluate our method across three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Experimental results demonstrate the superiority of our method by exhibiting a favorable performance boost upon all tasks with high efficiency.

HybridGait: A Benchmark for Spatial-Temporal Cloth-Changing Gait Recognition with Hybrid Explorations. (arXiv:2401.00271v1 [cs.CV])

Authors: Yilan Dong, Chunlin Yu, Ruiyang Ha, Ye Shi, Yuexin Ma, Lan Xu, Yanwei Fu, Jingya Wang

Existing gait recognition benchmarks mostly include minor clothing variations in the laboratory environments, but lack persistent changes in appearance over time and space. In this paper, we propose the first in-the-wild benchmark CCGait for cloth-changing gait recognition, which incorporates diverse clothing changes, indoor and outdoor scenes, and multi-modal statistics over 92 days. To further address the coupling effect of clothing and viewpoint variations, we propose a hybrid approach HybridGait that exploits both temporal dynamics and the projected 2D information of 3D human meshes. Specifically, we introduce a Canonical Alignment Spatial-Temporal Transformer (CA-STT) module to encode human joint position-aware features, and fully exploit 3D dense priors via a Silhouette-guided Deformation with 3D-2D Appearance Projection (SilD) strategy. Our contributions are twofold: we provide a challenging benchmark CCGait that captures realistic appearance changes across an expanded and space, and we propose a hybrid framework HybridGait that outperforms prior works on CCGait and Gait3D benchmarks. Our project page is available at https://github.com/HCVLab/HybridGait.

An $\ell^1$-Plug-and-Play Approach for Magnetic Particle Imaging Using a Zero Shot Denoiser with Validation on the 3D Open MPI Dataset. (arXiv:2401.00275v1 [eess.IV])

Authors: Vladyslav Gapyak, Corinna Rentschler, Thomas März, Andreas Weinmann

Magnetic particle imaging (MPI) is an emerging medical imaging modality which has gained increasing interest in recent years. Among the benefits of MPI are its high temporal resolution, and that the technique does not expose the specimen to any kind of ionizing radiation. It is based on the non-linear response of magnetic nanoparticles to an applied magnetic field. From the electric signal measured in receive coils, the particle concentration has to be reconstructed. Due to the ill-posedness of the reconstruction problem, various regularization methods have been proposed for reconstruction ranging from early stopping methods, via classical Tikhonov regularization and iterative methods to modern machine learning approaches. In this work, we contribute to the latter class: we propose a plug-and-play approach based on a generic zero-shot denoiser with an $\ell^1$-prior. Moreover, we develop parameter selection strategies. Finally, we quantitatively and qualitatively evaluate the proposed algorithmic scheme on the 3D Open MPI data set with different levels of preprocessing.

BusReF: Infrared-Visible images registration and fusion focus on reconstructible area using one set of features. (arXiv:2401.00285v1 [cs.CV])

Authors: Zeyang Zhang, Hui Li, Tianyang Xu, Xiaojun Wu, Josef Kittler

In a scenario where multi-modal cameras are operating together, the problem of working with non-aligned images cannot be avoided. Yet, existing image fusion algorithms rely heavily on strictly registered input image pairs to produce more precise fusion results, as a way to improve the performance of downstream high-level vision tasks. In order to relax this assumption, one can attempt to register images first. However, the existing methods for registering multiple modalities have limitations, such as complex structures and reliance on significant semantic information. This paper aims to address the problem of image registration and fusion in a single framework, called BusRef. We focus on Infrared-Visible image registration and fusion task (IVRF). In this framework, the input unaligned image pairs will pass through three stages: Coarse registration, Fine registration and Fusion. It will be shown that the unified approach enables more robust IVRF. We also propose a novel training and evaluation strategy, involving the use of masks to reduce the influence of non-reconstructible regions on the loss functions, which greatly improves the accuracy and robustness of the fusion task. Last but not least, a gradient-aware fusion network is designed to preserve the complementary information. The advanced performance of this algorithm is demonstrated by

GAN-GA: A Generative Model based on Genetic Algorithm for Medical Image Generation. (arXiv:2401.00314v1 [eess.IV])

Authors: M. AbdulRazek, G. Khoriba, M. Belal

Medical imaging is an essential tool for diagnosing and treating diseases. However, lacking medical images can lead to inaccurate diagnoses and ineffective treatments. Generative models offer a promising solution for addressing medical image shortage problems due to their ability to generate new data from existing datasets and detect anomalies in this data. Data augmentation with position augmentation methods like scaling, cropping, flipping, padding, rotation, and translation could lead to more overfitting in domains with little data, such as medical image data. This paper proposes the GAN-GA, a generative model optimized by embedding a genetic algorithm. The proposed model enhances image fidelity and diversity while preserving distinctive features. The proposed medical image synthesis approach improves the quality and fidelity of medical images, an essential aspect of image interpretation. To evaluate synthesized images: Frechet Inception Distance (FID) is used. The proposed GAN-GA model is tested by generating Acute lymphoblastic leukemia (ALL) medical images, an image dataset, and is the first time to be used in generative models. Our results were compared to those of InfoGAN as a baseline model. The experimental results show that the proposed optimized GAN-GA enhances FID scores by about 6.8\%, especially in earlier training epochs. The source code and dataset will be available at: https://github.com/Mustafa-AbdulRazek/InfoGAN-GA.

DXAI: Explaining Classification by Image Decomposition. (arXiv:2401.00320v1 [cs.CV])

Authors: Elnatan Kadar, Guy Gilboa

We propose a new way to explain and to visualize neural network classification through a decomposition-based explainable AI (DXAI). Instead of providing an explanation heatmap, our method yields a decomposition of the image into class-agnostic and class-distinct parts, with respect to the data and chosen classifier. Following a fundamental signal processing paradigm of analysis and synthesis, the original image is the sum of the decomposed parts. We thus obtain a radically different way of explaining classification. The class-agnostic part ideally is composed of all image features which do not posses class information, where the class-distinct part is its complementary. This new visualization can be more helpful and informative in certain scenarios, especially when the attributes are dense, global and additive in nature, for instance, when colors or textures are essential for class distinction. Code is available at https://github.com/dxai2024/dxai.

Explainability-Driven Leaf Disease Classification using Adversarial Training and Knowledge Distillation. (arXiv:2401.00334v1 [cs.CV])

Authors: Sebastian-Vasile Echim, Iulian-Marius Tăiatu, Dumitru-Clementin Cercel, Florin Pop

This work focuses on plant leaf disease classification and explores three crucial aspects: adversarial training, model explainability, and model compression. The models' robustness against adversarial attacks is enhanced through adversarial training, ensuring accurate classification even in the presence of threats. Leveraging explainability techniques, we gain insights into the model's decision-making process, improving trust and transparency. Additionally, we explore model compression techniques to optimize computational efficiency while maintaining classification performance. Through our experiments, we determine that on a benchmark dataset, the robustness can be the price of the classification accuracy with performance reductions of 3%-20% for regular tests and gains of 50%-70% for adversarial attack tests. We also demonstrate that a student model can be 15-25 times more computationally efficient for a slight performance reduction, distilling the knowledge of more complex models.

SHARE: Single-view Human Adversarial REconstruction. (arXiv:2401.00343v1 [cs.CV])

Authors: Shreelekha Revankar, Shijia Liao, Yu Shen, Junbang Liang, Huaishu Peng, Ming Lin

The accuracy of 3D Human Pose and Shape reconstruction (HPS) from an image is progressively improving. Yet, no known method is robust across all image distortion. To address issues due to variations of camera poses, we introduce SHARE, a novel fine-tuning method that utilizes adversarial data augmentation to enhance the robustness of existing HPS techniques. We perform a comprehensive analysis on the impact of camera poses on HPS reconstruction outcomes. We first generated large-scale image datasets captured systematically from diverse camera perspectives. We then established a mapping between camera poses and reconstruction errors as a continuous function that characterizes the relationship between camera poses and HPS quality. Leveraging this representation, we introduce RoME (Regions of Maximal Error), a novel sampling technique for our adversarial fine-tuning method.

The SHARE framework is generalizable across various single-view HPS methods and we demonstrate its performance on HMR, SPIN, PARE, CLIFF and ExPose. Our results illustrate a reduction in mean joint errors across single-view HPS techniques, for images captured from multiple camera positions without compromising their baseline performance. In many challenging cases, our method surpasses the performance of existing models, highlighting its practical significance for diverse real-world applications.

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes. (arXiv:2401.00365v1 [cs.LG])

Authors: Yuhta Takida, Yukara Ikemiya, Takashi Shibuya, Kazuki Shimada, Woosung Choi, Chieh-Hsin Lai, Naoki Murata, Toshimitsu Uesaka, Kengo Uchida, Wei-Hsiang Liao, Yuki Mitsufuji

Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity reconstructions. However, such hierarchical extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where the codebook is not efficiently used to express the data, and hence degrades reconstruction accuracy. To mitigate this problem, we propose a novel unified framework to stochastically learn hierarchical discrete representation on the basis of the variational Bayes framework, called hierarchically quantized variational autoencoder (HQ-VAE). HQ-VAE naturally generalizes the hierarchical variants of VQ-VAE, such as VQ-VAE-2 and residual-quantized VAE (RQ-VAE), and provides them with a Bayesian training scheme. Our comprehensive experiments on image datasets show that HQ-VAE enhances codebook usage and improves reconstruction performance. We also validated HQ-VAE in terms of its applicability to a different modality with an audio dataset.

UGPNet: Universal Generative Prior for Image Restoration. (arXiv:2401.00370v1 [cs.CV])

Authors: Hwayoon Lee, Kyoungkook Kang, Hyeongmin Lee, Seung-Hwan Baek, Sunghyun Cho

Recent image restoration methods can be broadly categorized into two classes: (1) regression methods that recover the rough structure of the original image without synthesizing high-frequency details and (2) generative methods that synthesize perceptually-realistic high-frequency details even though the resulting image deviates from the original structure of the input. While both directions have been extensively studied in isolation, merging their benefits with a single framework has been rarely studied. In this paper, we propose UGPNet, a universal image restoration framework that can effectively achieve the benefits of both approaches by simply adopting a pair of an existing regression model and a generative model. UGPNet first restores the image structure of a degraded input using a regression model and synthesizes a perceptually-realistic image with a generative model on top of the regressed output. UGPNet then combines the regressed output and the synthesized output, resulting in a final result that faithfully reconstructs the structure of the original image in addition to perceptually-realistic textures. Our extensive experiments on deblurring, denoising, and super-resolution demonstrate that UGPNet can successfully exploit both regression and generative methods for high-fidelity image restoration.

Multi-Granularity Representation Learning for Sketch-based Dynamic Face Image Retrieval. (arXiv:2401.00371v1 [cs.CV])

Authors: Liang Wang, Dawei Dai, Shiyu Fu, Guoyin Wang

In specific scenarios, face sketch can be used to identify a person. However, drawing a face sketch often requires exceptional skill and is time-consuming, limiting its widespread applications in actual scenarios. The new framework of sketch less face image retrieval (SLFIR)[1] attempts to overcome the barriers by providing a means for humans and machines to interact during the drawing process. Considering SLFIR problem, there is a large gap between a partial sketch with few strokes and any whole face photo, resulting in poor performance at the early stages. In this study, we propose a multigranularity (MG) representation learning (MGRL) method to address the SLFIR problem, in which we learn the representation of different granularity regions for a partial sketch, and then, by combining all MG regions of the sketches and images, the final distance was determined. In the experiments, our method outperformed state-of-the-art baselines in terms of early retrieval on two accessible datasets. Codes are available at https://github.com/ddw2AIGROUP2CQUPT/MGRL.

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Masked Audio Gesture Modeling. (arXiv:2401.00374v1 [cs.CV])

Authors: Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, Michael J. Black

We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEATX (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEATX combines MoShed SMPLX body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance the results' fidelity and diversity. Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance and is flexible in accepting predefined spatial-temporal gesture inputs, generating complete, audio-synchronized results. Our code and dataset are available at https://pantomatrix.github.io/EMAGE/

Horizontal Federated Computer Vision. (arXiv:2401.00390v1 [cs.CV])

Authors: Paul K. Mandal, Cole Leo, Connor Hurley

In the modern world, the amount of visual data recorded has been rapidly increasing. In many cases, data is stored in geographically distinct locations and thus requires a large amount of time and space to consolidate. Sometimes, there are also regulations for privacy protection which prevent data consolidation. In this work, we present federated implementations for object detection and recognition using a federated Faster R-CNN (FRCNN) and image segmentation using a federated Fully Convolutional Network (FCN). Our FRCNN was trained on 5000 examples of the COCO2017 dataset while our FCN was trained on the entire train set of the CamVid dataset. The proposed federated models address the challenges posed by the increasing volume and decentralized nature of visual data, offering efficient solutions in compliance with privacy regulations.

Controllable Safety-Critical Closed-loop Traffic Simulation via Guided Diffusion. (arXiv:2401.00391v1 [cs.RO])

Authors: Wei-Jer Chang, Francesco Pittaluga, Masayoshi Tomizuka, Wei Zhan, Manmohan Chandraker

Evaluating the performance of autonomous vehicle planning algorithms necessitates simulating long-tail traffic scenarios. Traditional methods for generating safety-critical scenarios often fall short in realism and controllability. Furthermore, these techniques generally neglect the dynamics of agent interactions. To mitigate these limitations, we introduce a novel closed-loop simulation framework rooted in guided diffusion models. Our approach yields two distinct advantages: 1) the generation of realistic long-tail scenarios that closely emulate real-world conditions, and 2) enhanced controllability, enabling more comprehensive and interactive evaluations. We achieve this through novel guidance objectives that enhance road progress while lowering collision and off-road rates. We develop a novel approach to simulate safety-critical scenarios through an adversarial term in the denoising process, which allows the adversarial agent to challenge a planner with plausible maneuvers, while all agents in the scene exhibit reactive and realistic behaviors. We validate our framework empirically using the NuScenes dataset, demonstrating improvements in both realism and controllability. These findings affirm that guided diffusion models provide a robust and versatile foundation for safety-critical, interactive traffic simulation, extending their utility across the broader landscape of autonomous driving. For additional resources and demonstrations, visit our project page at https://safe-sim.github.io.

Generative Model-Driven Synthetic Training Image Generation: An Approach to Cognition in Rail Defect Detection. (arXiv:2401.00393v1 [cs.CV])

Authors: Rahatara Ferdousi, Chunsheng Yang, M. Anwar Hossain, Fedwa Laamarti, M. Shamim Hossain, Abdulmotaleb El Saddik

Recent advancements in cognitive computing, with the integration of deep learning techniques, have facilitated the development of intelligent cognitive systems (ICS). This is particularly beneficial in the context of rail defect detection, where the ICS would emulate human-like analysis of image data for defect patterns. Despite the success of Convolutional Neural Networks (CNN) in visual defect classification, the scarcity of large datasets for rail defect detection remains a challenge due to infrequent accident events that would result in defective parts and images. Contemporary researchers have addressed this data scarcity challenge by exploring rule-based and generative data augmentation models. Among these, Variational Autoencoder (VAE) models can generate realistic data without extensive baseline datasets for noise modeling. This study proposes a VAE-based synthetic image generation technique for rail defects, incorporating weight decay regularization and image reconstruction loss to prevent overfitting. The proposed method is applied to create a synthetic dataset for the Canadian Pacific Railway (CPR) with just 50 real samples across five classes. Remarkably, 500 synthetic samples are generated with a minimal reconstruction loss of 0.021. A Visual Transformer (ViT) model underwent fine-tuning using this synthetic CPR dataset, achieving high accuracy rates (98%-99%) in classifying the five defect classes. This research offers a promising solution to the data scarcity challenge in rail defect detection, showcasing the potential for robust ICS development in this domain.

Client-wise Modality Selection for Balanced Multi-modal Federated Learning. (arXiv:2401.00403v1 [cs.LG])

Authors: Yunfeng Fan, Wenchao Xu, Haozhao Wang, Penghui Ruan, Song Guo

Selecting proper clients to participate in the iterative federated learning (FL) rounds is critical to effectively harness a broad range of distributed datasets. Existing client selection methods simply consider the variability among FL clients with uni-modal data, however, have yet to consider clients with multi-modalities. We reveal that traditional client selection scheme in MFL may suffer from a severe modality-level bias, which impedes the collaborative exploitation of multi-modal data, leading to insufficient local data exploration and global aggregation. To tackle this challenge, we propose a Client-wise Modality Selection scheme for MFL (CMSFed) that can comprehensively utilize information from each modality via avoiding such client selection bias caused by modality imbalance. Specifically, in each MFL round, the local data from different modalities are selectively employed to participate in local training and aggregation to mitigate potential modality imbalance of the global model. To approximate the fully aggregated model update in a balanced way, we introduce a novel local training loss function to enhance the weak modality and align the divergent feature spaces caused by inconsistent modality adoption strategies for different clients simultaneously. Then, a modality-level gradient decoupling method is designed to derive respective submodular functions to maintain the gradient diversity during the selection progress and balance MFL according to local modality imbalance in each iteration. Our extensive experiments showcase the superiority of CMSFed over baselines and its effectiveness in multi-modal data exploitation.

Generalizing Single-View 3D Shape Retrieval to Occlusions and Unseen Objects. (arXiv:2401.00405v1 [cs.CV])

Authors: Qirui Wu, Daniel Ritchie, Manolis Savva, Angel X. Chang

Single-view 3D shape retrieval is a challenging task that is increasingly important with the growth of available 3D data. Prior work that has studied this task has not focused on evaluating how realistic occlusions impact performance, and how shape retrieval methods generalize to scenarios where either the target 3D shape database contains unseen shapes, or the input image contains unseen objects. In this paper, we systematically evaluate single-view 3D shape retrieval along three different axes: the presence of object occlusions and truncations, generalization to unseen 3D shape data, and generalization to unseen objects in the input images. We standardize two existing datasets of real images and propose a dataset generation pipeline to produce a synthetic dataset of scenes with multiple objects exhibiting realistic occlusions. Our experiments show that training on occlusion-free data as was commonly done in prior work leads to significant performance degradation for inputs with occlusion. We find that that by first pretraining on our synthetic dataset with occlusions and then finetuning on real data, we can significantly outperform models from prior work and demonstrate robustness to both unseen 3D shapes and unseen objects.

Low-cost Geometry-based Eye Gaze Detection using Facial Landmarks Generated through Deep Learning. (arXiv:2401.00406v1 [cs.CV])

Authors: Esther Enhui Ye, John Enzhou Ye, Joseph Ye, Jacob Ye, Runzhou Ye

Introduction: In the realm of human-computer interaction and behavioral research, accurate real-time gaze estimation is critical. Traditional methods often rely on expensive equipment or large datasets, which are impractical in many scenarios. This paper introduces a novel, geometry-based approach to address these challenges, utilizing consumer-grade hardware for broader applicability. Methods: We leverage novel face landmark detection neural networks capable of fast inference on consumer-grade chips to generate accurate and stable 3D landmarks of the face and iris. From these, we derive a small set of geometry-based descriptors, forming an 8-dimensional manifold representing the eye and head movements. These descriptors are then used to formulate linear equations for predicting eye-gaze direction. Results: Our approach demonstrates the ability to predict gaze with an angular error of less than 1.9 degrees, rivaling state-of-the-art systems while operating in real-time and requiring negligible computational resources. Conclusion: The developed method marks a significant step forward in gaze estimation technology, offering a highly accurate, efficient, and accessible alternative to traditional systems. It opens up new possibilities for real-time applications in diverse fields, from gaming to psychological research.

A Two-stream Hybrid CNN-Transformer Network for Skeleton-based Human Interaction Recognition. (arXiv:2401.00409v1 [cs.CV])

Authors: Ruoqi Yin, Jianqin Yin

Human Interaction Recognition is the process of identifying interactive actions between multiple participants in a specific situation. The aim is to recognise the action interactions between multiple entities and their meaning. Many single Convolutional Neural Network has issues, such as the inability to capture global instance interaction features or difficulty in training, leading to ambiguity in action semantics. In addition, the computational complexity of the Transformer cannot be ignored, and its ability to capture local information and motion features in the image is poor. In this work, we propose a Two-stream Hybrid CNN-Transformer Network (THCT-Net), which exploits the local specificity of CNN and models global dependencies through the Transformer. CNN and Transformer simultaneously model the entity, time and space relationships between interactive entities respectively. Specifically, Transformer-based stream integrates 3D convolutions with multi-head self-attention to learn inter-token correlations; We propose a new multi-branch CNN framework for CNN-based streams that automatically learns joint spatio-temporal features from skeleton sequences. The convolutional layer independently learns the local features of each joint neighborhood and aggregates the features of all joints. And the raw skeleton coordinates as well as their temporal difference are integrated with a dual-branch paradigm to fuse the motion features of the skeleton. Besides, a residual structure is added to speed up training convergence. Finally, the recognition results of the two branches are fused using parallel splicing. Experimental results on diverse and challenging datasets, demonstrate that the proposed method can better comprehend and infer the meaning and context of various actions, outperforming state-of-the-art methods.

Is It Possible to Backdoor Face Forgery Detection with Natural Triggers?. (arXiv:2401.00414v1 [cs.CV])

Authors: Xiaoxuan Han, Songlin Yang, Wei Wang, Ziwen He, Jing Dong

Deep neural networks have significantly improved the performance of face forgery detection models in discriminating Artificial Intelligent Generated Content (AIGC). However, their security is significantly threatened by the injection of triggers during model training (i.e., backdoor attacks). Although existing backdoor defenses and manual data selection can mitigate those using human-eye-sensitive triggers, such as patches or adversarial noises, the more challenging natural backdoor triggers remain insufficiently researched. To further investigate natural triggers, we propose a novel analysis-by-synthesis backdoor attack against face forgery detection models, which embeds natural triggers in the latent space. We thoroughly study such backdoor vulnerability from two perspectives: (1) Model Discrimination (Optimization-Based Trigger): we adopt a substitute detection model and find the trigger by minimizing the cross-entropy loss; (2) Data Distribution (Custom Trigger): we manipulate the uncommon facial attributes in the long-tailed distribution to generate poisoned samples without the supervision from detection models. Furthermore, to completely evaluate the detection models towards the latest AIGC, we utilize both state-of-the-art StyleGAN and Stable Diffusion for trigger generation. Finally, these backdoor triggers introduce specific semantic features to the generated poisoned samples (e.g., skin textures and smile), which are more natural and robust. Extensive experiments show that our method is superior from three levels: (1) Attack Success Rate: ours achieves a high attack success rate (over 99%) and incurs a small model accuracy drop (below 0.2%) with a low poisoning rate (less than 3%); (2) Backdoor Defense: ours shows better robust performance when faced with existing backdoor defense methods; (3) Human Inspection: ours is less human-eye-sensitive from a comprehensive user study.

SVFAP: Self-supervised Video Facial Affect Perceiver. (arXiv:2401.00416v1 [cs.CV])

Authors: Licai Sun, Zheng Lian, Kexin Wang, Yu He, Mingyu Xu, Haiyang Sun, Bin Liu, Jianhua Tao

Video-based facial affect analysis has recently attracted increasing attention owing to its critical role in human-computer interaction. Previous studies mainly focus on developing various deep learning architectures and training them in a fully supervised manner. Although significant progress has been achieved by these supervised methods, the longstanding lack of large-scale high-quality labeled data severely hinders their further improvements. Motivated by the recent success of self-supervised learning in computer vision, this paper introduces a self-supervised approach, termed Self-supervised Video Facial Affect Perceiver (SVFAP), to address the dilemma faced by supervised methods. Specifically, SVFAP leverages masked facial video autoencoding to perform self-supervised pre-training on massive unlabeled facial videos. Considering that large spatiotemporal redundancy exists in facial videos, we propose a novel temporal pyramid and spatial bottleneck Transformer as the encoder of SVFAP, which not only enjoys low computational cost but also achieves excellent performance. To verify the effectiveness of our method, we conduct experiments on nine datasets spanning three downstream tasks, including dynamic facial expression recognition, dimensional emotion recognition, and personality recognition. Comprehensive results demonstrate that SVFAP can learn powerful affect-related representations via large-scale self-supervised pre-training and it significantly outperforms previous state-of-the-art methods on all datasets. Codes will be available at https://github.com/sunlicai/SVFAP.

SynCDR : Training Cross Domain Retrieval Models with Synthetic Data. (arXiv:2401.00420v1 [cs.CV])

Authors: Samarth Mishra, Kate Saenko, Venkatesh Saligrama

In cross-domain retrieval, a model is required to identify images from the same semantic category across two visual domains. For instance, given a sketch of an object, a model needs to retrieve a real image of it from an online store's catalog. A standard approach for such a problem is learning a feature space of images where Euclidean distances reflect similarity. Even without human annotations, which may be expensive to acquire, prior methods function reasonably well using unlabeled images for training. Our problem constraint takes this further to scenarios where the two domains do not necessarily share any common categories in training data. This can occur when the two domains in question come from different versions of some biometric sensor recording identities of different people. We posit a simple solution, which is to generate synthetic data to fill in these missing category examples across domains. This, we do via category preserving translation of images from one visual domain to another. We compare approaches specifically trained for this translation for a pair of domains, as well as those that can use large-scale pre-trained text-to-image diffusion models via prompts, and find that the latter can generate better replacement synthetic data, leading to more accurate cross-domain retrieval models. Code for our work is available at https://github.com/samarth4149/SynCDR .

From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion. (arXiv:2401.00421v1 [cs.CV])

Authors: Xingyuan Li, Yang Zou, Jinyuan Liu, Zhiying Jiang, Long Ma, Xin Fan, Risheng Liu

With the rapid progression of deep learning technologies, multi-modality image fusion has become increasingly prevalent in object detection tasks. Despite its popularity, the inherent disparities in how different sources depict scene content make fusion a challenging problem. Current fusion methodologies identify shared characteristics between the two modalities and integrate them within this shared domain using either iterative optimization or deep learning architectures, which often neglect the intricate semantic relationships between modalities, resulting in a superficial understanding of inter-modal connections and, consequently, suboptimal fusion outcomes. To address this, we introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images. This method capitalizes on the complementary characteristics of diverse modalities, bolstering both the accuracy and robustness of object detection. The codebook is utilized to enhance a streamlined and concise depiction of the fused intra- and inter-domain dynamics, fine-tuned for optimal performance in detection tasks. We present a bilevel optimization strategy that establishes a nexus between the joint problem of fusion and detection, optimizing both processes concurrently. Furthermore, we introduce the first dataset of paired infrared and visible images accompanied by text prompts, paving the way for future research. Extensive experiments on several datasets demonstrate that our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.

Wild2Avatar: Rendering Humans Behind Occlusions. (arXiv:2401.00431v1 [cs.CV])

Authors: Tiange Xiang, Adam Sun, Scott Delp, Kazuki Kozuka, Li Fei-Fei, Ehsan Adeli

Rendering the visual appearance of moving humans from occluded monocular videos is a challenging task. Most existing research renders 3D humans under ideal conditions, requiring a clear and unobstructed scene. Those methods cannot be used to render humans in real-world scenes where obstacles may block the camera's view and lead to partial occlusions. In this work, we present Wild2Avatar, a neural rendering approach catered for occluded in-the-wild monocular videos. We propose occlusion-aware scene parameterization for decoupling the scene into three parts - occlusion, human, and background. Additionally, extensive objective functions are designed to help enforce the decoupling of the human from both the occlusion and the background and to ensure the completeness of the human model. We verify the effectiveness of our approach with experiments on in-the-wild videos.

Bidirectional Trained Tree-Structured Decoder for Handwritten Mathematical Expression Recognition. (arXiv:2401.00435v1 [cs.CV])

Authors: Hanbo Cheng, Chenyu Liu, Pengfei Hu, Zhenrong Zhang, Jiefeng Ma, Jun Du

The Handwritten Mathematical Expression Recognition (HMER) task is a critical branch in the field of OCR. Recent studies have demonstrated that incorporating bidirectional context information significantly improves the performance of HMER models. However, existing methods fail to effectively utilize bidirectional context information during the inference stage. Furthermore, current bidirectional training methods are primarily designed for string decoders and cannot adequately generalize to tree decoders, which offer superior generalization capabilities and structural analysis capacity. In order to overcome these limitations, we propose the Mirror-Flipped Symbol Layout Tree (MF-SLT) and Bidirectional Asynchronous Training (BAT) structure. Our method extends the bidirectional training strategy to the tree decoder, allowing for more effective training by leveraging bidirectional information. Additionally, we analyze the impact of the visual and linguistic perception of the HMER model separately and introduce the Shared Language Modeling (SLM) mechanism. Through the SLM, we enhance the model's robustness and generalization when dealing with visual ambiguity, particularly in scenarios with abundant training data. Our approach has been validated through extensive experiments, demonstrating its ability to achieve new state-of-the-art results on the CROHME 2014, 2016, and 2019 datasets, as well as the HME100K dataset. The code used in our experiments will be publicly available.

Diff-PCR: Diffusion-Based Correspondence Searching in Doubly Stochastic Matrix Space for Point Cloud Registration. (arXiv:2401.00436v1 [cs.CV])

Authors: Qianliang Wu, Haobo Jiang, Yaqing Ding, Lei Luo, Jin Xie, Jian Yang

Efficiently finding optimal correspondences between point clouds is crucial for solving both rigid and non-rigid point cloud registration problems. Existing methods often rely on geometric or semantic feature embedding to establish correspondences and estimate transformations or flow fields. Recently, state-of-the-art methods have employed RAFT-like iterative updates to refine the solution. However, these methods have certain limitations. Firstly, their iterative refinement design lacks transparency, and their iterative updates follow a fixed path during the refinement process, which can lead to suboptimal results. Secondly, these methods overlook the importance of refining or optimizing correspondences (or matching matrices) as a precursor to solving transformations or flow fields. They typically compute candidate correspondences based on distances in the point feature space. However, they only project the candidate matching matrix into some matrix space once with Sinkhorn or dual softmax operations to obtain final correspondences. This one-shot projected matching matrix may be far from the globally optimal one, and these approaches do not consider the distribution of the target matching matrix. In this paper, we propose a novel approach that exploits the Denoising Diffusion Model to predict a searching gradient for the optimal matching matrix within the Doubly Stochastic Matrix Space. During the reverse denoising process, our method iteratively searches for better solutions along this denoising gradient, which points towards the maximum likelihood direction of the target matching matrix. Our method offers flexibility by allowing the search to start from any initial matching matrix provided by the online backbone or white noise. Experimental evaluations on the 3DMatch/3DLoMatch and 4DMatch/4DLoMatch datasets demonstrate the effectiveness of our newly designed framework.

SFGANS Self-supervised Future Generator for human ActioN Segmentation. (arXiv:2401.00438v1 [cs.CV])

Authors: Or Berman, Adam Goldbraikh, Shlomi Laufer

The ability to locate and classify action segments in long untrimmed video is of particular interest to many applications such as autonomous cars, robotics and healthcare applications. Today, the most popular pipeline for action segmentation is composed of encoding the frames into feature vectors, which are then processed by a temporal model for segmentation. In this paper we present a self-supervised method that comes in the middle of the standard pipeline and generated refined representations of the original feature vectors. Experiments show that this method improves the performance of existing models on different sub-tasks of action segmentation, even without additional hyper parameter tuning.

TSGAN: An Optical-to-SAR Dual Conditional GAN for Optical based SAR Temporal Shifting. (arXiv:2401.00440v1 [cs.CV])

Authors: Moien Rangzan, Sara Attarchi, Richard Gloaguen, Seyed Kazem Alavipanah

In contrast to the well-investigated field of SAR-to-Optical translation, this study explores the lesser-investigated domain of Optical-to-SAR translation, a challenging field due to the ill-posed nature of this translation. The complexity arises as a single optical data can have multiple SAR representations based on the SAR viewing geometry. We propose a novel approach, termed SAR Temporal Shifting, which inputs an optical data from the desired timestamp along with a SAR data from a different temporal point but with a consistent viewing geometry as the expected SAR data, both complemented with a change map of optical data during the intervening period. This model modifies the SAR data based on the changes observed in optical data to generate the SAR data for the desired timestamp. Our model, a dual conditional Generative Adversarial Network (GAN), named Temporal Shifting GAN (TSGAN), incorporates a siamese encoder in both the Generator and the Discriminator. To prevent the model from overfitting on the input SAR data, we employed a change weighted loss function. Our approach surpasses traditional translation methods by eliminating the GAN's fiction phenomenon, particularly in unchanged regions, resulting in higher SSIM and PSNR in these areas. Additionally, modifications to the Pix2Pix architecture and the inclusion of attention mechanisms have enhanced the model's performance on all regions of the data. This research paves the way for leveraging legacy optical datasets, the most abundant and longstanding source of Earth datary data, extending their use to SAR domains and temporal analyses. To foster further research, we provide the code, datasets used in our study, and a framework for generating paired SAR-Optical datasets for new regions of interest. These resources are available on github.com/moienr/TemporalGAN

A Comprehensive Overview of Fish-Eye Camera Distortion Correction Methods. (arXiv:2401.00442v1 [cs.CV])

Authors: Jian Xu, De-Wei Han, Kang Li, Jun-Jie Li, Zhao-Yuan Ma

The fisheye camera, with its unique wide field of view and other characteristics, has found extensive applications in various fields. However, the fisheye camera suffers from significant distortion compared to pinhole cameras, resulting in distorted images of captured objects. Fish-eye camera distortion is a common issue in digital image processing, requiring effective correction techniques to enhance image quality. This review provides a comprehensive overview of various methods used for fish-eye camera distortion correction. The article explores the polynomial distortion model, which utilizes polynomial functions to model and correct radial distortions. Additionally, alternative approaches such as panorama mapping, grid mapping, direct methods, and deep learning-based methods are discussed. The review highlights the advantages, limitations, and recent advancements of each method, enabling readers to make informed decisions based on their specific needs.

Double-well Net for Image Segmentation. (arXiv:2401.00456v1 [cs.CV])

Authors: Hao Liu, Jun Liu, Raymond Chan, Xue-Cheng Tai

In this study, our goal is to integrate classical mathematical models with deep neural networks by introducing two novel deep neural network models for image segmentation known as Double-well Nets. Drawing inspiration from the Potts model, our models leverage neural networks to represent a region force functional. We extend the well-know MBO (Merriman-Bence-Osher) scheme to solve the Potts model. The widely recognized Potts model is approximated using a double-well potential and then solved by an operator-splitting method, which turns out to be an extension of the well-known MBO scheme. Subsequently, we replace the region force functional in the Potts model with a UNet-type network, which is data-driven, and also introduce control variables to enhance effectiveness. The resulting algorithm is a neural network activated by a function that minimizes the double-well potential. What sets our proposed Double-well Nets apart from many existing deep learning methods for image segmentation is their strong mathematical foundation. They are derived from the network approximation theory and employ the MBO scheme to approximately solve the Potts model. By incorporating mathematical principles, Double-well Nets bridge the MBO scheme and neural networks, and offer an alternative perspective for designing networks with mathematical backgrounds. Through comprehensive experiments, we demonstrate the performance of Double-well Nets, showcasing their superior accuracy and robustness compared to state-of-the-art neural networks. Overall, our work represents a valuable contribution to the field of image segmentation by combining the strengths of classical variational models and deep neural networks. The Double-well Nets introduce an innovative approach that leverages mathematical foundations to enhance segmentation performance.

RainSD: Rain Style Diversification Module for Image Synthesis Enhancement using Feature-Level Style Distribution. (arXiv:2401.00460v1 [cs.CV])

Authors: Hyeonjae Jeon, Junghyun Seo, Taesoo Kim, Sungho Son, Jungki Lee, Gyeungho Choi, Yongseob Lim

Autonomous driving technology nowadays targets to level 4 or beyond, but the researchers are faced with some limitations for developing reliable driving algorithms in diverse challenges. To promote the autonomous vehicles to spread widely, it is important to address safety issues on this technology. Among various safety concerns, the sensor blockage problem by severe weather conditions can be one of the most frequent threats for multi-task learning based perception algorithms during autonomous driving. To handle this problem, the importance of the generation of proper datasets is becoming more significant. In this paper, a synthetic road dataset with sensor blockage generated from real road dataset BDD100K is suggested in the format of BDD100K annotation. Rain streaks for each frame were made by an experimentally established equation and translated utilizing the image-to-image translation network based on style transfer. Using this dataset, the degradation of the diverse multi-task networks for autonomous driving, such as lane detection, driving area segmentation, and traffic object detection, has been thoroughly evaluated and analyzed. The tendency of the performance degradation of deep neural network-based perception systems for autonomous vehicle has been analyzed in depth. Finally, we discuss the limitation and the future directions of the deep neural network-based perception algorithms and autonomous driving dataset generation based on image-to-image translation.

Analyzing Local Representations of Self-supervised Vision Transformers. (arXiv:2401.00463v1 [cs.CV])

Authors: Ani Vanyan, Alvard Barseghyan, Hakob Tamazyan, Vahan Huroyan, Hrant Khachatrian, Martin Danelljan

In this paper, we present a comparative analysis of various self-supervised Vision Transformers (ViTs), focusing on their local representative power. Inspired by large language models, we examine the abilities of ViTs to perform various computer vision tasks with little to no fine-tuning. We design an evaluation framework to analyze the quality of local, i.e. patch-level, representations in the context of few-shot semantic segmentation, instance identification, object retrieval, and tracking. We discover that contrastive learning based methods like DINO produce more universal patch representations that can be immediately applied for downstream tasks with no parameter tuning, compared to masked image modeling. The embeddings learned using the latter approach, e.g. in masked autoencoders, have high variance features that harm distance-based algorithms, such as k-NN, and do not contain useful information for most downstream tasks. Furthermore, we demonstrate that removing these high-variance features enhances k-NN by providing an analysis of the benchmarks for this work and for Scale-MAE, a recent extension of masked autoencoders. Finally, we find an object instance retrieval setting where DINOv2, a model pretrained on two orders of magnitude more data, performs worse than its less compute-intensive counterpart DINO.

SAR-RARP50: Segmentation of surgical instrumentation and Action Recognition on Robot-Assisted Radical Prostatectomy Challenge. (arXiv:2401.00496v1 [cs.CV])

Authors: Dimitrios Psychogyios, Emanuele Colleoni, Beatrice Van Amsterdam, Chih-Yang Li, Shu-Yu Huang, Yuchong Li, Fucang Jia, Baosheng Zou, Guotai Wang, Yang Liu, Maxence Boels, Jiayu Huo, Rachel Sparks, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin, Mengya Xu, An Wang, Yanan Wu, Long Bai, Hongliang Ren, Atsushi Yamada, Yuriko Harai, Yuto Ishikawa, Kazuyuki Hayashi, Jente Simoens, Pieter DeBacker, Francesco Cisternino, Gabriele Furnari, Alex Mottrie, Federica Ferraguti, Satoshi Kondo, Satoshi Kasai, Kousuke Hirasawa, Soohee Kim, Seung Hyun Lee, Kyu Eun Lee, Hyoun-Joong Kong, Kui Fu, Chao Li, Shan An, Stefanie Krell, Sebastian Bodenstedt, Nicolas Ayobi, Alejandra Perez, Santiago Rodriguez, Juanita Puentes, Pablo Arbelaez, Omid Mohareri, Danail Stoyanov

Surgical tool segmentation and action recognition are fundamental building blocks in many computer-assisted intervention applications, ranging from surgical skills assessment to decision support systems. Nowadays, learning-based action recognition and segmentation approaches outperform classical methods, relying, however, on large, annotated datasets. Furthermore, action recognition and tool segmentation algorithms are often trained and make predictions in isolation from each other, without exploiting potential cross-task relationships. With the EndoVis 2022 SAR-RARP50 challenge, we release the first multimodal, publicly available, in-vivo, dataset for surgical action recognition and semantic instrumentation segmentation, containing 50 suturing video segments of Robotic Assisted Radical Prostatectomy (RARP). The aim of the challenge is twofold. First, to enable researchers to leverage the scale of the provided dataset and develop robust and highly accurate single-task action recognition and tool segmentation approaches in the surgical domain. Second, to further explore the potential of multitask-based learning approaches and determine their comparative advantage against their single-task counterparts. A total of 12 teams participated in the challenge, contributing 7 action recognition methods, 9 instrument segmentation techniques, and 4 multitask approaches that integrated both action recognition and instrument segmentation.

Compressing Deep Image Super-resolution Models. (arXiv:2401.00523v1 [eess.IV])

Authors: Yuxuan Jiang, Jakub Nawala, Fan Zhang, David Bull

Deep learning techniques have been applied in the context of image super-resolution (SR), achieving remarkable advances in terms of reconstruction performance. Existing techniques typically employ highly complex model structures which result in large model sizes and slow inference speeds. This often leads to high energy consumption and restricts their adoption for practical applications. To address this issue, this work employs a three-stage workflow for compressing deep SR models which significantly reduces their memory requirement. Restoration performance has been maintained through teacher-student knowledge distillation using a newly designed distillation loss. We have applied this approach to two popular image super-resolution networks, SwinIR and EDSR, to demonstrate its effectiveness. The resulting compact models, SwinIRmini and EDSRmini, attain an 89% and 96% reduction in both model size and floating-point operations (FLOPs) respectively, compared to their original versions. They also retain competitive super-resolution performance compared to their original models and other commonly used SR approaches. The source code and pre-trained models for these two lightweight SR approaches are released at https://pikapi22.github.io/CDISM/.

A Generalist FaceX via Learning Unified Facial Representation. (arXiv:2401.00551v1 [cs.CV])

Authors: Yue Han, Jiangning Zhang, Junwei Zhu, Xiangtai Li, Yanhao Ge, Wei Li, Chengjie Wang, Yong Liu, Xiaoming Liu, Ying Tai

This work presents FaceX framework, a novel facial generalist model capable of handling diverse facial tasks simultaneously. To achieve this goal, we initially formulate a unified facial representation for a broad spectrum of facial editing tasks, which macroscopically decomposes a face into fundamental identity, intra-personal variation, and environmental factors. Based on this, we introduce Facial Omni-Representation Decomposing (FORD) for seamless manipulation of various facial components, microscopically decomposing the core aspects of most facial editing tasks. Furthermore, by leveraging the prior of a pretrained StableDiffusion (SD) to enhance generation quality and accelerate training, we design Facial Omni-Representation Steering (FORS) to first assemble unified facial representations and then effectively steer the SD-aware generation process by the efficient Facial Representation Controller (FRC). %Without any additional features, Our versatile FaceX achieves competitive performance compared to elaborate task-specific models on popular facial editing tasks. Full codes and models will be available at https://github.com/diffusion-facex/FaceX.

SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity. (arXiv:2401.00604v1 [cs.CV])

Authors: Peihao Wang, Zhiwen Fan, Dejia Xu, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, Vikas Chandra

Score distillation has emerged as one of the most prevalent approaches for text-to-3D asset synthesis. Essentially, score distillation updates 3D parameters by lifting and back-propagating scores averaged over different views. In this paper, we reveal that the gradient estimation in score distillation is inherent to high variance. Through the lens of variance reduction, the effectiveness of SDS and VSD can be interpreted as applications of various control variates to the Monte Carlo estimator of the distilled score. Motivated by this rethinking and based on Stein's identity, we propose a more general solution to reduce variance for score distillation, termed Stein Score Distillation (SSD). SSD incorporates control variates constructed by Stein identity, allowing for arbitrary baseline functions. This enables us to include flexible guidance priors and network architectures to explicitly optimize for variance reduction. In our experiments, the overall pipeline, dubbed SteinDreamer, is implemented by instantiating the control variate with a monocular depth estimator. The results suggest that SSD can effectively reduce the distillation variance and consistently improve visual quality for both object- and scene-level generation. Moreover, we demonstrate that SteinDreamer achieves faster convergence than existing methods due to more stable gradient updates.

Bringing Back the Context: Camera Trap Species Identification as Link Prediction on Multimodal Knowledge Graphs. (arXiv:2401.00608v1 [cs.CV])

Authors: Vardaan Pahuja, Weidi Luo, Yu Gu, Cheng-Hao Tu, Hong-You Chen, Tanya Berger-Wolf, Charles Stewart, Song Gao, Wei-Lun Chao, Yu Su

Camera traps are valuable tools in animal ecology for biodiversity monitoring and conservation. However, challenges like poor generalization to deployment at new unseen locations limit their practical application. Images are naturally associated with heterogeneous forms of context possibly in different modalities. In this work, we leverage the structured context associated with the camera trap images to improve out-of-distribution generalization for the task of species identification in camera traps. For example, a photo of a wild animal may be associated with information about where and when it was taken, as well as structured biology knowledge about the animal species. While typically overlooked by existing work, bringing back such context offers several potential benefits for better image understanding, such as addressing data scarcity and enhancing generalization. However, effectively integrating such heterogeneous context into the visual domain is a challenging problem. To address this, we propose a novel framework that reformulates species classification as link prediction in a multimodal knowledge graph (KG). This framework seamlessly integrates various forms of multimodal context for visual recognition. We apply this framework for out-of-distribution species classification on the iWildCam2020-WILDS and Snapshot Mountain Zebra datasets and achieve competitive performance with state-of-the-art approaches. Furthermore, our framework successfully incorporates biological taxonomy for improved generalization and enhances sample efficiency for recognizing under-represented species.

GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields. (arXiv:2401.00616v1 [cs.CV])

Authors: Xiao Pan, Zongxin Yang, Shuai Bai, Yi Yang

In this paper, we focus on the One-shot Novel View Synthesis (O-NVS) task which targets synthesizing photo-realistic novel views given only one reference image per scene. Previous One-shot Generalizable Neural Radiance Fields (OG-NeRF) methods solve this task in an inference-time finetuning-free manner, yet suffer the blurry issue due to the encoder-only architecture that highly relies on the limited reference image. On the other hand, recent diffusion-based image-to-3d methods show vivid plausible results via distilling pre-trained 2D diffusion models into a 3D representation, yet require tedious per-scene optimization. Targeting these issues, we propose the GD^2-NeRF, a Generative Detail compensation framework via GAN and Diffusion that is both inference-time finetuning-free and with vivid plausible details. In detail, following a coarse-to-fine strategy, GD^2-NeRF is mainly composed of a One-stage Parallel Pipeline (OPP) and a 3D-consistent Detail Enhancer (Diff3DE). At the coarse stage, OPP first efficiently inserts the GAN model into the existing OG-NeRF pipeline for primarily relieving the blurry issue with in-distribution priors captured from the training dataset, achieving a good balance between sharpness (LPIPS, FID) and fidelity (PSNR, SSIM). Then, at the fine stage, Diff3DE further leverages the pre-trained image diffusion models to complement rich out-distribution details while maintaining decent 3D consistency. Extensive experiments on both the synthetic and real-world datasets show that GD$^2$-NeRF noticeably improves the details while without per-scene finetuning.

Towards Improved Proxy-based Deep Metric Learning via Data-Augmented Domain Adaptation. (arXiv:2401.00617v1 [cs.CV])

Authors: Li Ren, Chen Chen, Liqiang Wang, Kien Hua

Deep Metric Learning (DML) plays an important role in modern computer vision research, where we learn a distance metric for a set of image representations. Recent DML techniques utilize the proxy to interact with the corresponding image samples in the embedding space. However, existing proxy-based DML methods focus on learning individual proxy-to-sample distance while the overall distribution of samples and proxies lacks attention. In this paper, we present a novel proxy-based DML framework that focuses on aligning the sample and proxy distributions to improve the efficiency of proxy-based DML losses. Specifically, we propose the Data-Augmented Domain Adaptation (DADA) method to adapt the domain gap between the group of samples and proxies. To the best of our knowledge, we are the first to leverage domain adaptation to boost the performance of proxy-based DML. We show that our method can be easily plugged into existing proxy-based DML losses. Our experiments on benchmarks, including the popular CUB-200-2011, CARS196, Stanford Online Products, and In-Shop Clothes Retrieval, show that our learning algorithm significantly improves the existing proxy losses and achieves superior results compared to the existing methods.

Geometry Depth Consistency in RGBD Relative Pose Estimation. (arXiv:2401.00639v1 [cs.CV])

Authors: Sourav Kumar, Chiang-Heng Chien, Benjamin Kimia

Relative pose estimation for RGBD cameras is crucial in a number of applications. Previous approaches either rely on the RGB aspect of the images to estimate pose thus not fully making use of depth in the estimation process or estimate pose from the 3D cloud of points that each image produces, thus not making full use of RGB information. This paper shows that if one pair of correspondences is hypothesized from the RGB-based ranked-ordered correspondence list, then the space of remaining correspondences is restricted to corresponding pairs of curves nested around the hypothesized correspondence, implicitly capturing depth consistency. This simple Geometric Depth Constraint (GDC) significantly reduces potential matches. In effect this becomes a filter on possible correspondences that helps reduce the number of outliers and thus expedites RANSAC significantly. As such, the same budget of time allows for more RANSAC iterations and therefore additional robustness and a significant speedup. In addition, the paper proposed a Nested RANSAC approach that also speeds up the process, as shown through experiments on TUM, ICL-NUIM, and RGBD Scenes v2 datasets.

From Covert Hiding to Visual Editing: Robust Generative Video Steganography. (arXiv:2401.00652v1 [cs.CV])

Authors: Xueying Mao, Xiaoxiao Hu, Wanli Peng, Zhenliang Gan, Qichao Ying, Zhenxing Qian, Sheng Li, Xinpeng Zhang

Traditional video steganography methods are based on modifying the covert space for embedding, whereas we propose an innovative approach that embeds secret message within semantic feature for steganography during the video editing process. Although existing traditional video steganography methods display a certain level of security and embedding capacity, they lack adequate robustness against common distortions in online social networks (OSNs). In this paper, we introduce an end-to-end robust generative video steganography network (RoGVS), which achieves visual editing by modifying semantic feature of videos to embed secret message. We employ face-swapping scenario to showcase the visual editing effects. We first design a secret message embedding module to adaptively hide secret message into the semantic feature of videos. Extensive experiments display that the proposed RoGVS method applied to facial video datasets demonstrate its superiority over existing video and image steganography techniques in terms of both robustness and capacity.

PROMPT-IML: Image Manipulation Localization with Pre-trained Foundation Models Through Prompt Tuning. (arXiv:2401.00653v1 [cs.CV])

Authors: Xuntao Liu, Yuzhou Yang, Qichao Ying, Zhenxing Qian, Xinpeng Zhang, Sheng Li

Deceptive images can be shared in seconds with social networking services, posing substantial risks. Tampering traces, such as boundary artifacts and high-frequency information, have been significantly emphasized by massive networks in the Image Manipulation Localization (IML) field. However, they are prone to image post-processing operations, which limit the generalization and robustness of existing methods. We present a novel Prompt-IML framework. We observe that humans tend to discern the authenticity of an image based on both semantic and high-frequency information, inspired by which, the proposed framework leverages rich semantic knowledge from pre-trained visual foundation models to assist IML. We are the first to design a framework that utilizes visual foundation models specially for the IML task. Moreover, we design a Feature Alignment and Fusion module to align and fuse features of semantic features with high-frequency features, which aims at locating tampered regions from multiple perspectives. Experimental results demonstrate that our model can achieve better performance on eight typical fake image datasets and outstanding robustness.

Optimizing ADMM and Over-Relaxed ADMM Parameters for Linear Quadratic Problems. (arXiv:2401.00657v1 [math.OC])

Authors: Jintao Song, Wenqi Lu, Yunwen Lei, Yuchao Tang, Zhenkuan Pan, Jinming Duan

The Alternating Direction Method of Multipliers (ADMM) has gained significant attention across a broad spectrum of machine learning applications. Incorporating the over-relaxation technique shows potential for enhancing the convergence rate of ADMM. However, determining optimal algorithmic parameters, including both the associated penalty and relaxation parameters, often relies on empirical approaches tailored to specific problem domains and contextual scenarios. Incorrect parameter selection can significantly hinder ADMM's convergence rate. To address this challenge, in this paper we first propose a general approach to optimize the value of penalty parameter, followed by a novel closed-form formula to compute the optimal relaxation parameter in the context of linear quadratic problems (LQPs). We then experimentally validate our parameter selection methods through random instantiations and diverse imaging applications, encompassing diffeomorphic image registration, image deblurring, and MRI reconstruction.

1st Place Solution for 5th LSVOS Challenge: Referring Video Object Segmentation. (arXiv:2401.00663v1 [cs.CV])

Authors: Zhuoyan Luo, Yicheng Xiao, Yong Liu, Yitong Wang, Yansong Tang, Xiu Li, Yujiu Yang

The recent transformer-based models have dominated the Referring Video Object Segmentation (RVOS) task due to the superior performance. Most prior works adopt unified DETR framework to generate segmentation masks in query-to-instance manner. In this work, we integrate strengths of that leading RVOS models to build up an effective paradigm. We first obtain binary mask sequences from the RVOS models. To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy. Each stage rationally ensembles RVOS models based on framework design as well as training strategy, and leverages different video object segmentation (VOS) models to enhance mask coherence by object propagation mechanism. Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Segmentation Challenge (ICCV 2023) track 3. Code is available at https://github.com/RobertLuo1/iccv2023_RVOS_Challenge.

Self-supervised learning for skin cancer diagnosis with limited training data. (arXiv:2401.00692v1 [eess.IV])

Authors: Hamish Haggerty, Rohitash Chandra

Cancer diagnosis is a well-studied problem in machine learning since early detection of cancer is often the determining factor in prognosis. Supervised deep learning achieves excellent results in cancer image classification, usually through transfer learning. However, these models require large amounts of labelled data and for several types of cancer, large labelled datasets do not exist. In this paper, we demonstrate that a model pre-trained using a self-supervised learning algorithm known as Barlow Twins can outperform the conventional supervised transfer learning pipeline. We juxtapose two base models: i) pretrained in a supervised fashion on ImageNet; ii) pretrained in a self-supervised fashion on ImageNet. Both are subsequently fine tuned on a small labelled skin lesion dataset and evaluated on a large test set. We achieve a mean test accuracy of 70\% for self-supervised transfer in comparison to 66\% for supervised transfer. Interestingly, boosting performance further is possible by self-supervised pretraining a second time (on unlabelled skin lesion images) before subsequent fine tuning. This hints at an alternative path to collecting more labelled data in settings where this is challenging - namely just collecting more unlabelled images. Our framework is applicable to cancer image classification models in the low-labelled data regime.

Credible Teacher for Semi-Supervised Object Detection in Open Scene. (arXiv:2401.00695v1 [cs.CV])

Authors: Jingyu Zhuang, Kuo Wang, Liang Lin, Guanbin Li

Semi-Supervised Object Detection (SSOD) has achieved resounding success by leveraging unlabeled data to improve detection performance. However, in Open Scene Semi-Supervised Object Detection (O-SSOD), unlabeled data may contains unknown objects not observed in the labeled data, which will increase uncertainty in the model's predictions for known objects. It is detrimental to the current methods that mainly rely on self-training, as more uncertainty leads to the lower localization and classification precision of pseudo labels. To this end, we propose Credible Teacher, an end-to-end framework. Credible Teacher adopts an interactive teaching mechanism using flexible labels to prevent uncertain pseudo labels from misleading the model and gradually reduces its uncertainty through the guidance of other credible pseudo labels. Empirical results have demonstrated our method effectively restrains the adverse effect caused by O-SSOD and significantly outperforms existing counterparts.

An attempt to generate new bridge types from latent space of generative adversarial network. (arXiv:2401.00700v1 [cs.LG])

Authors: Hongjun Zhang

Try to generate new bridge types using generative artificial intelligence technology. Symmetric structured image dataset of three-span beam bridge, arch bridge, cable-stayed bridge and suspension bridge are used . Based on Python programming language, TensorFlow and Keras deep learning platform framework , as well as Wasserstein loss function and Lipschitz constraints, generative adversarial network is constructed and trained. From the obtained low dimensional bridge-type latent space sampling, new bridge types with asymmetric structures can be generated. Generative adversarial network can create new bridge types by organically combining different structural components on the basis of human original bridge types. It has a certain degree of human original ability. Generative artificial intelligence technology can open up imagination space and inspire humanity.

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning. (arXiv:2401.00701v1 [cs.CV])

Authors: Kaibin Tian, Yanhua Cheng, Yi Liu, Xinglin Hou, Quan Chen, Han Li

In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.

Revisiting Nonlocal Self-Similarity from Continuous Representation. (arXiv:2401.00708v1 [cs.CV])

Authors: Yisi Luo, Xile Zhao, Deyu Meng

Nonlocal self-similarity (NSS) is an important prior that has been successfully applied in multi-dimensional data processing tasks, e.g., image and video recovery. However, existing NSS-based methods are solely suitable for meshgrid data such as images and videos, but are not suitable for emerging off-meshgrid data, e.g., point cloud and climate data. In this work, we revisit the NSS from the continuous representation perspective and propose a novel Continuous Representation-based NonLocal method (termed as CRNL), which has two innovative features as compared with classical nonlocal methods. First, based on the continuous representation, our CRNL unifies the measure of self-similarity for on-meshgrid and off-meshgrid data and thus is naturally suitable for both of them. Second, the nonlocal continuous groups can be more compactly and efficiently represented by the coupled low-rank function factorization, which simultaneously exploits the similarity within each group and across different groups, while classical nonlocal methods neglect the similarity across groups. This elaborately designed coupled mechanism allows our method to enjoy favorable performance over conventional NSS methods in terms of both effectiveness and efficiency. Extensive multi-dimensional data processing experiments on-meshgrid (e.g., image inpainting and image denoising) and off-meshgrid (e.g., climate data prediction and point cloud recovery) validate the versatility, effectiveness, and efficiency of our CRNL as compared with state-of-the-art methods.

Text2Avatar: Text to 3D Human Avatar Generation with Codebook-Driven Body Controllable Attribute. (arXiv:2401.00711v1 [cs.CV])

Authors: Chaoqun Gong, Yuqin Dai, Ronghui Li, Achun Bao, Jun Li, Jian Yang, Yachao Zhang, Xiu Li

Generating 3D human models directly from text helps reduce the cost and time of character modeling. However, achieving multi-attribute controllable and realistic 3D human avatar generation is still challenging due to feature coupling and the scarcity of realistic 3D human avatar datasets. To address these issues, we propose Text2Avatar, which can generate realistic-style 3D avatars based on the coupled text prompts. Text2Avatar leverages a discrete codebook as an intermediate feature to establish a connection between text and avatars, enabling the disentanglement of features. Furthermore, to alleviate the scarcity of realistic style 3D human avatar data, we utilize a pre-trained unconditional 3D human avatar generation model to obtain a large amount of 3D avatar pseudo data, which allows Text2Avatar to achieve realistic style generation. Experimental results demonstrate that our method can generate realistic 3D avatars from coupled textual data, which is challenging for other existing methods in this field.

Depth Map Denoising Network and Lightweight Fusion Network for Enhanced 3D Face Recognition. (arXiv:2401.00719v1 [cs.CV])

Authors: Ruizhuo Xu, Ke Wang, Chao Deng, Mei Wang, Xi Chen, Wenhui Huang, Junlan Feng, Weihong Deng

With the increasing availability of consumer depth sensors, 3D face recognition (FR) has attracted more and more attention. However, the data acquired by these sensors are often coarse and noisy, making them impractical to use directly. In this paper, we introduce an innovative Depth map denoising network (DMDNet) based on the Denoising Implicit Image Function (DIIF) to reduce noise and enhance the quality of facial depth images for low-quality 3D FR. After generating clean depth faces using DMDNet, we further design a powerful recognition network called Lightweight Depth and Normal Fusion network (LDNFNet), which incorporates a multi-branch fusion block to learn unique and complementary features between different modalities such as depth and normal images. Comprehensive experiments conducted on four distinct low-quality databases demonstrate the effectiveness and robustness of our proposed methods. Furthermore, when combining DMDNet and LDNFNet, we achieve state-of-the-art results on the Lock3DFace database.

BRAU-Net++: U-Shaped Hybrid CNN-Transformer Network for Medical Image Segmentation. (arXiv:2401.00722v1 [cs.CV])

Authors: Libin Lan, Pengzhou Cai, Lu Jiang, Xiaojuan Liu, Yongmei Li, Yudong Zhang

Accurate medical image segmentation is essential for clinical quantification, disease diagnosis, treatment planning and many other applications. Both convolution-based and transformer-based u-shaped architectures have made significant success in various medical image segmentation tasks. The former can efficiently learn local information of images while requiring much more image-specific inductive biases inherent to convolution operation. The latter can effectively capture long-range dependency at different feature scales using self-attention, whereas it typically encounters the challenges of quadratic compute and memory requirements with sequence length increasing. To address this problem, through integrating the merits of these two paradigms in a well-designed u-shaped architecture, we propose a hybrid yet effective CNN-Transformer network, named BRAU-Net++, for an accurate medical image segmentation task. Specifically, BRAU-Net++ uses bi-level routing attention as the core building block to design our u-shaped encoder-decoder structure, in which both encoder and decoder are hierarchically constructed, so as to learn global semantic information while reducing computational complexity. Furthermore, this network restructures skip connection by incorporating channel-spatial attention which adopts convolution operations, aiming to minimize local spatial information loss and amplify global dimension-interaction of multi-scale features. Extensive experiments on three public benchmark datasets demonstrate that our proposed approach surpasses other state-of-the-art methods including its baseline: BRAU-Net under almost all evaluation metrics. We achieve the average Dice-Similarity Coefficient (DSC) of 82.47, 90.10, and 92.94 on Synapse multi-organ segmentation, ISIC-2018 Challenge, and CVC-ClinicDB, as well as the mIoU of 84.01 and 88.17 on ISIC-2018 Challenge and CVC-ClinicDB, respectively.

MultiFusionNet: Multilayer Multimodal Fusion of Deep Neural Networks for Chest X-Ray Image Classification. (arXiv:2401.00728v1 [eess.IV])

Authors: Saurabh Agarwal, K. V. Arya, Yogesh Kumar Meena

Chest X-ray imaging is a critical diagnostic tool for identifying pulmonary diseases. However, manual interpretation of these images is time-consuming and error-prone. Automated systems utilizing convolutional neural networks (CNNs) have shown promise in improving the accuracy and efficiency of chest X-ray image classification. While previous work has mainly focused on using feature maps from the final convolution layer, there is a need to explore the benefits of leveraging additional layers for improved disease classification. Extracting robust features from limited medical image datasets remains a critical challenge. In this paper, we propose a novel deep learning-based multilayer multimodal fusion model that emphasizes extracting features from different layers and fusing them. Our disease detection model considers the discriminatory information captured by each layer. Furthermore, we propose the fusion of different-sized feature maps (FDSFM) module to effectively merge feature maps from diverse layers. The proposed model achieves a significantly higher accuracy of 97.21% and 99.60% for both three-class and two-class classifications, respectively. The proposed multilayer multimodal fusion model, along with the FDSFM module, holds promise for accurate disease classification and can also be extended to other disease classifications in chest X-ray images.

NightRain: Nighttime Video Deraining via Adaptive-Rain-Removal and Adaptive-Correction. (arXiv:2401.00729v1 [cs.CV])

Authors: Beibei Lin, Yeying Jin, Wending Yan, Wei Ye, Yuan Yuan, Shunli Zhang, Robby Tan

Existing deep-learning-based methods for nighttime video deraining rely on synthetic data due to the absence of real-world paired data. However, the intricacies of the real world, particularly with the presence of light effects and low-light regions affected by noise, create significant domain gaps, hampering synthetic-trained models in removing rain streaks properly and leading to over-saturation and color shifts. Motivated by this, we introduce NightRain, a novel nighttime video deraining method with adaptive-rain-removal and adaptive-correction. Our adaptive-rain-removal uses unlabeled rain videos to enable our model to derain real-world rain videos, particularly in regions affected by complex light effects. The idea is to allow our model to obtain rain-free regions based on the confidence scores. Once rain-free regions and the corresponding regions from our input are obtained, we can have region-based paired real data. These paired data are used to train our model using a teacher-student framework, allowing the model to iteratively learn from less challenging regions to more challenging regions. Our adaptive-correction aims to rectify errors in our model's predictions, such as over-saturation and color shifts. The idea is to learn from clear night input training videos based on the differences or distance between those input videos and their corresponding predictions. Our model learns from these differences, compelling our model to correct the errors. From extensive experiments, our method demonstrates state-of-the-art performance. It achieves a PSNR of 26.73dB, surpassing existing nighttime video deraining methods by a substantial margin of 13.7%.

Diffusion Models, Image Super-Resolution And Everything: A Survey. (arXiv:2401.00736v1 [cs.CV])

Authors: Brian B. Moser, Arundhati S. Shanbhag, Federico Raue, Stanislav Frolov, Sebastian Palacio, Andreas Dengel

Diffusion Models (DMs) represent a significant advancement in image Super-Resolution (SR), aligning technical image quality more closely with human preferences and expanding SR applications. DMs address critical limitations of previous methods, enhancing overall realism and details in SR images. However, DMs suffer from color-shifting issues, and their high computational costs call for efficient sampling alternatives, underscoring the challenge of balancing computational efficiency and image quality. This survey gives an overview of DMs applied to image SR and offers a detailed analysis that underscores the unique characteristics and methodologies within this domain, distinct from broader existing reviews in the field. It presents a unified view of DM fundamentals and explores research directions, including alternative input domains, conditioning strategies, guidance, corruption spaces, and zero-shot methods. This survey provides insights into the evolution of image SR with DMs, addressing current trends, challenges, and future directions in this rapidly evolving field.

DiffMorph: Text-less Image Morphing with Diffusion Models. (arXiv:2401.00739v1 [cs.CV])

Authors: Shounak Chatterjee

Text-conditioned image generation models are a prevalent use of AI image synthesis, yet intuitively controlling output guided by an artist remains challenging. Current methods require multiple images and textual prompts for each object to specify them as concepts to generate a single customized image.

On the other hand, our work, \verb|DiffMorph|, introduces a novel approach that synthesizes images that mix concepts without the use of textual prompts. Our work integrates a sketch-to-image module to incorporate user sketches as input. \verb|DiffMorph| takes an initial image with conditioning artist-drawn sketches to generate a morphed image.

We employ a pre-trained text-to-image diffusion model and fine-tune it to reconstruct each image faithfully. We seamlessly merge images and concepts from sketches into a cohesive composition. The image generation capability of our work is demonstrated through our results and a comparison of these with prompt-based image generation.

Beyond Subspace Isolation: Many-to-Many Transformer for Light Field Image Super-resolution. (arXiv:2401.00740v1 [eess.IV])

Authors: Zeke Zexi Hu, Xiaoming Chen, Vera Yuk Ying Chung, Yiran Shen

The effective extraction of spatial-angular features plays a crucial role in light field image super-resolution (LFSR) tasks, and the introduction of convolution and Transformers leads to significant improvement in this area. Nevertheless, due to the large 4D data volume of light field images, many existing methods opted to decompose the data into a number of lower-dimensional subspaces and perform Transformers in each sub-space individually. As a side effect, these methods inadvertently restrict the self-attention mechanisms to a One-to-One scheme accessing only a limited subset of LF data, explicitly preventing comprehensive optimization on all spatial and angular cues. In this paper, we identify this limitation as subspace isolation and introduce a novel Many-to-Many Transformer (M2MT) to address it. M2MT aggregates angular information in the spatial subspace before performing the self-attention mechanism. It enables complete access to all information across all sub-aperture images (SAIs) in a light field image. Consequently, M2MT is enabled to comprehensively capture long-range correlation dependencies. With M2MT as the pivotal component, we develop a simple yet effective M2MT network for LFSR. Our experimental results demonstrate that M2MT achieves state-of-the-art performance across various public datasets. We further conduct in-depth analysis using local attribution maps (LAM) to obtain visual interpretability, and the results validate that M2MT is empowered with a truly non-local context in both spatial and angular subspaces to mitigate subspace isolation and acquire effective spatial-angular representation.

New Job, New Gender? Measuring the Social Bias in Image Generation Models. (arXiv:2401.00763v1 [cs.SE])

Authors: Wenxuan Wang, Haonan Bai, Jen-tse Huang, Yuxuan Wan, Youliang Yuan, Haoyi Qiu, Nanyun Peng, Michael R. Lyu

Image generation models can generate or edit images from a given text. Recent advancements in image generation technology, exemplified by DALL-E and Midjourney, have been groundbreaking. These advanced models, despite their impressive capabilities, are often trained on massive Internet datasets, making them susceptible to generating content that perpetuates social stereotypes and biases, which can lead to severe consequences. Prior research on assessing bias within image generation models suffers from several shortcomings, including limited accuracy, reliance on extensive human labor, and lack of comprehensive analysis. In this paper, we propose BiasPainter, a novel metamorphic testing framework that can accurately, automatically and comprehensively trigger social bias in image generation models. BiasPainter uses a diverse range of seed images of individuals and prompts the image generation models to edit these images using gender, race, and age-neutral queries. These queries span 62 professions, 39 activities, 57 types of objects, and 70 personality traits. The framework then compares the edited images to the original seed images, focusing on any changes related to gender, race, and age. BiasPainter adopts a testing oracle that these characteristics should not be modified when subjected to neutral prompts. Built upon this design, BiasPainter can trigger the social bias and evaluate the fairness of image generation models. To evaluate the effectiveness of BiasPainter, we use BiasPainter to test five widely-used commercial image generation software and models, such as stable diffusion and Midjourney. Experimental results show that 100\% of the generated test cases can successfully trigger social bias in image generation models.

Bracketing is All You Need: Unifying Image Restoration and Enhancement Tasks with Multi-Exposure Images. (arXiv:2401.00766v1 [cs.CV])

Authors: Zhilu Zhang, Shuohao Zhang, Renlong Wu, Zifei Yan, Wangmeng Zuo

It is challenging but highly desired to acquire high-quality photos with clear content in low-light environments. Although multi-image processing methods (using burst, dual-exposure, or multi-exposure images) have made significant progress in addressing this issue, they typically focus exclusively on specific restoration or enhancement tasks, being insufficient in exploiting multi-image. Motivated by that multi-exposure images are complementary in denoising, deblurring, high dynamic range imaging, and super-resolution, we propose to utilize bracketing photography to unify restoration and enhancement tasks in this work. Due to the difficulty in collecting real-world pairs, we suggest a solution that first pre-trains the model with synthetic paired data and then adapts it to real-world unlabeled images. In particular, a temporally modulated recurrent network (TMRNet) and self-supervised adaptation method are proposed. Moreover, we construct a data simulation pipeline to synthesize pairs and collect real-world images from 200 nighttime scenarios. Experiments on both datasets show that our method performs favorably against the state-of-the-art multi-image processing ones. The dataset, code, and pre-trained models are available at https://github.com/cszhilu1998/BracketIRE.

Retrieval-Augmented Egocentric Video Captioning. (arXiv:2401.00789v1 [cs.CV])

Authors: Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie

Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references.

GLIMPSE: Generalized Local Imaging with MLPs. (arXiv:2401.00816v1 [cs.CV])

Authors: AmirEhsan Khorashadizadeh, Valentin Debarnot, Tianlin Liu, Ivan Dokmanić

Deep learning is the current de facto state of the art in tomographic imaging. A common approach is to feed the result of a simple inversion, for example the backprojection, to a convolutional neural network (CNN) which then computes the reconstruction. Despite strong results on 'in-distribution' test data similar to the training data, backprojection from sparse-view data delocalizes singularities, so these approaches require a large receptive field to perform well. As a consequence, they overfit to certain global structures which leads to poor generalization on out-of-distribution (OOD) samples. Moreover, their memory complexity and training time scale unfavorably with image resolution, making them impractical for application at realistic clinical resolutions, especially in 3D: a standard U-Net requires a substantial 140GB of memory and 2600 seconds per epoch on a research-grade GPU when training on 1024x1024 images. In this paper, we introduce GLIMPSE, a local processing neural network for computed tomography which reconstructs a pixel value by feeding only the measurements associated with the neighborhood of the pixel to a simple MLP. While achieving comparable or better performance with successful CNNs like the U-Net on in-distribution test data, GLIMPSE significantly outperforms them on OOD samples while maintaining a memory footprint almost independent of image resolution; 5GB memory suffices to train on 1024x1024 images. Further, we built GLIMPSE to be fully differentiable, which enables feats such as recovery of accurate projection angles if they are out of calibration.

Sharp-NeRF: Grid-based Fast Deblurring Neural Radiance Fields Using Sharpness Prior. (arXiv:2401.00825v1 [cs.CV])

Authors: Byeonghyeon Lee, Howoong Lee, Usman Ali, Eunbyung Park

Neural Radiance Fields (NeRF) have shown remarkable performance in neural rendering-based novel view synthesis. However, NeRF suffers from severe visual quality degradation when the input images have been captured under imperfect conditions, such as poor illumination, defocus blurring, and lens aberrations. Especially, defocus blur is quite common in the images when they are normally captured using cameras. Although few recent studies have proposed to render sharp images of considerably high-quality, yet they still face many key challenges. In particular, those methods have employed a Multi-Layer Perceptron (MLP) based NeRF, which requires tremendous computational time. To overcome these shortcomings, this paper proposes a novel technique Sharp-NeRF -- a grid-based NeRF that renders clean and sharp images from the input blurry images within half an hour of training. To do so, we used several grid-based kernels to accurately model the sharpness/blurriness of the scene. The sharpness level of the pixels is computed to learn the spatially varying blur kernels. We have conducted experiments on the benchmarks consisting of blurry images and have evaluated full-reference and non-reference metrics. The qualitative and quantitative results have revealed that our approach renders the sharp novel views with vivid colors and fine details, and it has considerably faster training time than the previous works. Our project page is available at https://benhenryl.github.io/SharpNeRF/

Rethinking RAFT for Efficient Optical Flow. (arXiv:2401.00833v1 [cs.CV])

Authors: Navid Eslami, Farnoosh Arefi, Amir M. Mansourian, Shohreh Kasaei

Despite significant progress in deep learning-based optical flow methods, accurately estimating large displacements and repetitive patterns remains a challenge. The limitations of local features and similarity search patterns used in these algorithms contribute to this issue. Additionally, some existing methods suffer from slow runtime and excessive graphic memory consumption. To address these problems, this paper proposes a novel approach based on the RAFT framework. The proposed Attention-based Feature Localization (AFL) approach incorporates the attention mechanism to handle global feature extraction and address repetitive patterns. It introduces an operator for matching pixels with corresponding counterparts in the second frame and assigning accurate flow values. Furthermore, an Amorphous Lookup Operator (ALO) is proposed to enhance convergence speed and improve RAFTs ability to handle large displacements by reducing data redundancy in its search operator and expanding the search space for similarity extraction. The proposed method, Efficient RAFT (Ef-RAFT),achieves significant improvements of 10% on the Sintel dataset and 5% on the KITTI dataset over RAFT. Remarkably, these enhancements are attained with a modest 33% reduction in speed and a mere 13% increase in memory usage. The code is available at: https://github.com/n3slami/Ef-RAFT

Deblurring 3D Gaussian Splatting. (arXiv:2401.00834v1 [cs.CV])

Authors: Byeonghyeon Lee, Howoong Lee, Xiangyu Sun, Usman Ali, Eunbyung Park

Recent studies in Radiance Fields have paved the robust way for novel view synthesis with their photorealistic rendering quality. Nevertheless, they usually employ neural networks and volumetric rendering, which are costly to train and impede their broad use in various real-time applications due to the lengthy rendering time. Lately 3D Gaussians splatting-based approach has been proposed to model the 3D scene, and it achieves remarkable visual quality while rendering the images in real-time. However, it suffers from severe degradation in the rendering quality if the training images are blurry. Blurriness commonly occurs due to the lens defocusing, object motion, and camera shake, and it inevitably intervenes in clean image acquisition. Several previous studies have attempted to render clean and sharp images from blurry input images using neural fields. The majority of those works, however, are designed only for volumetric rendering-based neural radiance fields and are not straightforwardly applicable to rasterization-based 3D Gaussian splatting methods. Thus, we propose a novel real-time deblurring framework, deblurring 3D Gaussian Splatting, using a small Multi-Layer Perceptron (MLP) that manipulates the covariance of each 3D Gaussian to model the scene blurriness. While deblurring 3D Gaussian Splatting can still enjoy real-time rendering, it can reconstruct fine and sharp details from blurry images. A variety of experiments have been conducted on the benchmark, and the results have revealed the effectiveness of our approach for deblurring. Qualitative results are available at https://benhenryl.github.io/Deblurring-3D-Gaussian-Splatting/

Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera. (arXiv:2401.00847v1 [cs.CV])

Authors: Jiye Lee, Hanbyul Joo

We present a lightweight and affordable motion capture method based on two smartwatches and a head-mounted camera. In contrast to the existing approaches that use six or more expert-level IMU devices, our approach is much more cost-effective and convenient. Our method can make wearable motion capture accessible to everyone everywhere, enabling 3D full-body motion capture in diverse environments. As a key idea to overcome the extreme sparsity and ambiguities of sensor inputs, we integrate 6D head poses obtained from the head-mounted cameras for motion estimation. To enable capture in expansive indoor and outdoor scenes, we propose an algorithm to track and update floor level changes to define head poses, coupled with a multi-stage Transformer-based regression module. We also introduce novel strategies leveraging visual cues of egocentric images to further enhance the motion capture quality while reducing ambiguities. We demonstrate the performance of our method on various challenging scenarios, including complex outdoor environments and everyday motions including object interactions and social interactions among multiple individuals.

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training. (arXiv:2401.00849v1 [cs.CV])

Authors: Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, Jianfeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-language models like \cite{flamingo, palme}, leveraging the long-context capability of Large Language Models, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introduce the contrastive loss into text generation models, presenting the COntrastive-Streamlined MultimOdal framework (\ModelName), strategically partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components. \ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters. However, these models demand extensive long-text datasets, yet the availability of high-quality long-text video datasets remains limited. To bridge this gap, this work introduces \VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions, marking a significant step forward. Demonstrating its impact, we illustrate how \VideoDatasetName{} enhances model performance in image-text tasks. With 34% learnable parameters and utilizing 72\% of the available data, our model demonstrates significant superiority over OpenFlamingo~\cite{openflamingo}. For instance, in the 4-shot flickr captioning task, performance notably improves from 57.2% to 65.\%. The contributions of \ModelName{} and \VideoDatasetName{} are underscored by notable performance gains across 14 diverse downstream datasets encompassing both image-text and video-text tasks.

Refining Pre-Trained Motion Models. (arXiv:2401.00850v1 [cs.CV])

Authors: Xinglong Sun, Adam W. Harley, Leonidas J. Guibas

Given the difficulty of manually annotating motion in video, the current best motion estimation methods are trained with synthetic data, and therefore struggle somewhat due to a train/test gap. Self-supervised methods hold the promise of training directly on real video, but typically perform worse. These include methods trained with warp error (i.e., color constancy) combined with smoothness terms, and methods that encourage cycle-consistency in the estimates (i.e., tracking backwards should yield the opposite trajectory as tracking forwards). In this work, we take on the challenge of improving state-of-the-art supervised models with self-supervised training. We find that when the initialization is supervised weights, most existing self-supervision techniques actually make performance worse instead of better, which suggests that the benefit of seeing the new data is overshadowed by the noise in the training signal. Focusing on obtaining a ``clean'' training signal from real-world unlabelled video, we propose to separate label-making and training into two distinct stages. In the first stage, we use the pre-trained model to estimate motion in a video, and then select the subset of motion estimates which we can verify with cycle-consistency. This produces a sparse but accurate pseudo-labelling of the video. In the second stage, we fine-tune the model to reproduce these outputs, while also applying augmentations on the input. We complement this boot-strapping method with simple techniques that densify and re-balance the pseudo-labels, ensuring that we do not merely train on ``easy'' tracks. We show that our method yields reliable gains over fully-supervised methods in real videos, for both short-term (flow-based) and long-range (multi-frame) pixel tracking.

Accelerating Large Kernel Convolutions with Nested Winograd Transformation.pdf. (arXiv:2102.13272v2 [cs.CV] UPDATED)

Authors: Jingbo Jiang, Xizi Chen, Chi-Ying Tsui

Recent literature has shown that convolutional neural networks (CNNs) with large kernels outperform vision transformers (ViTs) and CNNs with stacked small kernels in many computer vision tasks, such as object detection and image restoration. The Winograd transformation helps reduce the number of repetitive multiplications in convolution and is widely supported by many commercial AI processors. Researchers have proposed accelerating large kernel convolutions by linearly decomposing them into many small kernel convolutions and then sequentially accelerating each small kernel convolution with the Winograd algorithm. This work proposes a nested Winograd algorithm that iteratively decomposes a large kernel convolution into small kernel convolutions and proves it to be more effective than the linear decomposition Winograd transformation algorithm. Experiments show that compared to the linear decomposition Winograd algorithm, the proposed algorithm reduces the total number of multiplications by 1.4 to 10.5 times for computing 4x4 to 31x31 convolutions.

Cross-Camera Human Motion Transfer by Time Series Analysis. (arXiv:2109.14174v4 [cs.CV] UPDATED)

Authors: Yaping Zhao, Guanghan Li, Edmund Y. Lam

With advances in optical sensor technology, heterogeneous camera systems are increasingly used for high-resolution (HR) video acquisition and analysis. However, motion transfer across multiple cameras poses challenges. To address this, we propose an algorithm based on time series analysis that identifies motion seasonality and constructs an additive model to extract transferable patterns. Validated on real-world data, our algorithm demonstrates effectiveness and interpretability. Notably, it improves pose estimation in low-resolution videos by leveraging patterns derived from HR counterparts, enhancing practical utility. Code is available at: https://github.com/IndigoPurple/TSAMT

3D Object Detection from Images for Autonomous Driving: A Survey. (arXiv:2202.02980v6 [cs.CV] UPDATED)

Authors: Xinzhu Ma, Wanli Ouyang, Andrea Simonelli, Elisa Ricci

3D object detection from images, one of the fundamental and challenging problems in autonomous driving, has received increasing attention from both industry and academia in recent years. Benefiting from the rapid development of deep learning technologies, image-based 3D detection has achieved remarkable progress. Particularly, more than 200 works have studied this problem from 2015 to 2021, encompassing a broad spectrum of theories, algorithms, and applications. However, to date no recent survey exists to collect and organize this knowledge. In this paper, we fill this gap in the literature and provide the first comprehensive survey of this novel and continuously growing research field, summarizing the most commonly used pipelines for image-based 3D detection and deeply analyzing each of their components. Additionally, we also propose two new taxonomies to organize the state-of-the-art methods into different categories, with the intent of providing a more systematic review of existing methods and facilitating fair comparisons with future works. In retrospect of what has been achieved so far, we also analyze the current challenges in the field and discuss future directions for image-based 3D detection research.

VariabilityTrack:Multi-Object Tracking with Variable Speed Object Movement. (arXiv:2203.06424v3 [cs.CV] UPDATED)

Authors: Run Luo, JinLin Wei, Qiao Lin

Multi-object tracking (MOT) aims at estimating bounding boxes and identities of objects in videos. Most methods can be roughly classified as tracking-by-detection and joint-detection-association paradigms. Although the latter has elicited more attention and demonstrates comparable performance relative than the former, we claim that the tracking-by-detection paradigm is still the optimal solution in terms of tracking accuracy,such as ByteTrack,which achieves 80.3 MOTA, 77.3 IDF1 and 63.1 HOTA on the test set of MOT17 with 30 FPS running speed on a single V100 GPU.However, under complex perspectives such as vehicle and UAV acceleration, the performance of such a tracker using uniform Kalman filter will be greatly affected, resulting in tracking loss.In this paper, we propose a variable speed Kalman filter algorithm based on environmental feedback and improve the matching process, which can greatly improve the tracking effect in complex variable speed scenes while maintaining high tracking accuracy in relatively static scenes. Eventually, higher MOTA and IDF1 results can be achieved on MOT17 test set than ByteTrack

SuperAnimal pretrained pose estimation models for behavioral analysis. (arXiv:2203.07436v4 [cs.CV] UPDATED)

Authors: Shaokai Ye, Anastasiia Filippova, Jessy Lauer, Steffen Schneider, Maxime Vidal, Tian Qiu, Alexander Mathis, Mackenzie Weygandt Mathis

Quantification of behavior is critical in applications ranging from neuroscience, veterinary medicine and animal conservation efforts. A common key step for behavioral analysis is first extracting relevant keypoints on animals, known as pose estimation. However, reliable inference of poses currently requires domain knowledge and manual labeling effort to build supervised models. We present a series of technical innovations that enable a new method, collectively called SuperAnimal, to develop unified foundation models that can be used on over 45 species, without additional human labels. Concretely, we introduce a method to unify the keypoint space across differently labeled datasets (via our generalized data converter) and for training these diverse datasets in a manner such that they don't catastrophically forget keypoints given the unbalanced inputs (via our keypoint gradient masking and memory replay approaches). These models show excellent performance across six pose benchmarks. Then, to ensure maximal usability for end-users, we demonstrate how to fine-tune the models on differently labeled data and provide tooling for unsupervised video adaptation to boost performance and decrease jitter across frames. If the models are fine-tuned, we show SuperAnimal models are 10-100$\times$ more data efficient than prior transfer-learning-based approaches. We illustrate the utility of our models in behavioral classification in mice and gait analysis in horses. Collectively, this presents a data-efficient solution for animal pose estimation.

Motion-aware Memory Network for Fast Video Salient Object Detection. (arXiv:2208.00946v2 [cs.CV] UPDATED)

Authors: Xing Zhao, Haoran Liang, Peipei Li, Guodao Sun, Dongdong Zhao, Ronghua Liang, Xiaofei He

Previous methods based on 3DCNN, convLSTM, or optical flow have achieved great success in video salient object detection (VSOD). However, they still suffer from high computational costs or poor quality of the generated saliency maps. To solve these problems, we design a space-time memory (STM)-based network, which extracts useful temporal information of the current frame from adjacent frames as the temporal branch of VSOD. Furthermore, previous methods only considered single-frame prediction without temporal association. As a result, the model may not focus on the temporal information sufficiently. Thus, we initially introduce object motion prediction between inter-frame into VSOD. Our model follows standard encoder--decoder architecture. In the encoding stage, we generate high-level temporal features by using high-level features from the current and its adjacent frames. This approach is more efficient than the optical flow-based methods. In the decoding stage, we propose an effective fusion strategy for spatial and temporal branches. The semantic information of the high-level features is used to fuse the object details in the low-level features, and then the spatiotemporal features are obtained step by step to reconstruct the saliency maps. Moreover, inspired by the boundary supervision commonly used in image salient object detection (ISOD), we design a motion-aware loss for predicting object boundary motion and simultaneously perform multitask learning for VSOD and object motion prediction, which can further facilitate the model to extract spatiotemporal features accurately and maintain the object integrity. Extensive experiments on several datasets demonstrated the effectiveness of our method and can achieve state-of-the-art metrics on some datasets. The proposed model does not require optical flow or other preprocessing, and can reach a speed of nearly 100 FPS during inference.

Lossy Image Compression with Conditional Diffusion Models. (arXiv:2209.06950v7 [eess.IV] UPDATED)

Authors: Ruihan Yang, Stephan Mandt

This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. The approach relies on the transform coding paradigm, where an image is mapped into a latent space for entropy coding and, from there, mapped back to the data space for reconstruction. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach thus introduces an additional ``content'' latent variable on which the reverse diffusion process is conditioned and uses this variable to store information about the image. The remaining ``texture'' variables characterizing the diffusion process are synthesized at decoding time. We show that the model's performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving multiple datasets and image quality assessment metrics show that our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics. Furthermore, training the diffusion with $\mathcal{X}$-parameterization enables high-quality reconstructions in only a handful of decoding steps, greatly affecting the model's practicality. Our code is available at: \url{https://github.com/buggyyang/CDC_compression}

Residual Back Projection With Untrained Neural Networks. (arXiv:2210.14416v2 [eess.IV] UPDATED)

Authors: Ziyu Shu, Alireza Entezari

Background and Objective: The success of neural networks in a number of image processing tasks has motivated their application in image reconstruction problems in computed tomography (CT). While progress has been made in this area, the lack of stability and theoretical guarantees for accuracy, together with the scarcity of high-quality training data for specific imaging domains pose challenges for many CT applications. In this paper, we present a framework for iterative reconstruction (IR) in CT that leverages the hierarchical structure of neural networks, without the need for training. Our framework incorporates this structural information as a deep image prior (DIP), and uses a novel residual back projection (RBP) connection that forms the basis for our iterations.

Methods: We propose using an untrained U-net in conjunction with a novel residual back projection to minimize an objective function and achieve high-accuracy reconstruction. In each iteration, the weights of the untrained U-net are optimized, and the output of the U-net in the current iteration is used to update the input of the U-net in the next iteration through the aforementioned RBP connection.

Results: Experimental results demonstrate that the RBP-DIP framework offers improvements over other state-of-the-art conventional IR methods, as well as pre-trained and untrained models with similar network structures under multiple conditions. These improvements are particularly significant in the few-view, limited-angle, and low-dose imaging configurations.

Conclusions: Applying to both parallel and fan beam X-ray imaging, our framework shows significant improvement under multiple conditions. Furthermore, the proposed framework requires no training data and can be adjusted on-demand to adapt to different conditions (e.g. noise level, geometry, and imaged object).

Rethinking the Paradigm of Content Constraints in GAN-based Unpaired Image-to-Image Translation. (arXiv:2211.10867v2 [cs.CV] UPDATED)

Authors: Xiuding Cai, Yaoyao Zhu, Dong Miao, Linjie Fu, Yu Yao

In an unpaired setting, lacking sufficient content constraints for image-to-image translation (I2I) tasks, GAN-based approaches are usually prone to model collapse. Current solutions can be divided into two categories, reconstruction-based and Siamese network-based. The former requires that the transformed or transforming image can be perfectly converted back to the original image, which is sometimes too strict and limits the generative performance. The latter involves feeding the original and generated images into a feature extractor and then matching their outputs. This is not efficient enough, and a universal feature extractor is not easily available. In this paper, we propose EnCo, a simple but efficient way to maintain the content by constraining the representational similarity in the latent space of patch-level features from the same stage of the \textbf{En}coder and de\textbf{Co}der of the generator. For the similarity function, we use a simple MSE loss instead of contrastive loss, which is currently widely used in I2I tasks. Benefits from the design, EnCo training is extremely efficient, while the features from the encoder produce a more positive effect on the decoding, leading to more satisfying generations. In addition, we rethink the role played by discriminators in sampling patches and propose a discriminative attention-guided (DAG) patch sampling strategy to replace random sampling. DAG is parameter-free and only requires negligible computational overhead, while significantly improving the performance of the model. Extensive experiments on multiple datasets demonstrate the effectiveness and advantages of EnCo, and we achieve multiple state-of-the-art compared to previous methods. Our code is available at https://github.com/XiudingCai/EnCo-pytorch.

Low-Light Image and Video Enhancement: A Comprehensive Survey and Beyond. (arXiv:2212.10772v5 [cs.CV] UPDATED)

Authors: Shen Zheng, Yiling Ma, Jinqian Pan, Changjie Lu, Gaurav Gupta

This paper presents a comprehensive survey of low-light image and video enhancement, addressing two primary challenges in the field. The first challenge is the prevalence of mixed over-/under-exposed images, which are not adequately addressed by existing methods. In response, this work introduces two enhanced variants of the SICE dataset: SICE_Grad and SICE_Mix, designed to better represent these complexities. The second challenge is the scarcity of suitable low-light video datasets for training and testing. To address this, the paper introduces the Night Wenzhou dataset, a large-scale, high-resolution video collection that features challenging fast-moving aerial scenes and streetscapes with varied illuminations and degradation. This study also conducts an extensive analysis of key techniques and performs comparative experiments using the proposed and current benchmark datasets. The survey concludes by highlighting emerging applications, discussing unresolved challenges, and suggesting future research directions within the LLIE community. The datasets are available at https://github.com/ShenZheng2000/LLIE_Survey.

PCR-CG: Point Cloud Registration via Deep Explicit Color and Geometry. (arXiv:2302.14418v2 [cs.CV] UPDATED)

Authors: Yu Zhang, Junle Yu, Xiaolin Huang, Wenhui Zhou, Ji Hou

In this paper, we introduce PCR-CG: a novel 3D point cloud registration module explicitly embedding the color signals into the geometry representation. Different from previous methods that only use geometry representation, our module is specifically designed to effectively correlate color into geometry for the point cloud registration task. Our key contribution is a 2D-3D cross-modality learning algorithm that embeds the deep features learned from color signals to the geometry representation. With our designed 2D-3D projection module, the pixel features in a square region centered at correspondences perceived from images are effectively correlated with point clouds. In this way, the overlapped regions can be inferred not only from point cloud but also from the texture appearances. Adding color is non-trivial. We compare against a variety of baselines designed for adding color to 3D, such as exhaustively adding per-pixel features or RGB values in an implicit manner. We leverage Predator [25] as the baseline method and incorporate our proposed module onto it. To validate the effectiveness of 2D features, we ablate different 2D pre-trained networks and show a positive correlation between the pre-trained weights and the task performance. Our experimental results indicate a significant improvement of 6.5% registration recall over the baseline method on the 3DLoMatch benchmark. We additionally evaluate our approach on SOTA methods and observe consistent improvements, such as an improvement of 2.4% registration recall over GeoTransformer as well as 3.5% over CoFiNet. Our study reveals a significant advantages of correlating explicit deep color features to the point cloud in the registration task.

Aleth-NeRF: Low-light Condition View Synthesis with Concealing Fields. (arXiv:2303.05807v2 [cs.CV] UPDATED)

Authors: Ziteng Cui, Lin Gu, Xiao Sun, Xianzheng Ma, Yu Qiao, Tatsuya Harada

Common capture low-light scenes are challenging for most computer vision techniques, including Neural Radiance Fields (NeRF). Vanilla NeRF is viewer-centred simplifies the rendering process only as light emission from 3D locations in the viewing direction, thus failing to model the low-illumination induced darkness. Inspired by the emission theory of ancient Greeks that visual perception is accomplished by rays casting from eyes, we make slight modifications on vanilla NeRF to train on multiple views of low-light scenes, we can thus render out the well-lit scene in an unsupervised manner. We introduce a surrogate concept, Concealing Fields, that reduces the transport of light during the volume rendering stage. Specifically, our proposed method, Aleth-NeRF, directly learns from the dark image to understand volumetric object representation and concealing field under priors. By simply eliminating Concealing Fields, we can render a single or multi-view well-lit image(s) and gain superior performance over other 2D low-light enhancement methods. Additionally, we collect the first paired LOw-light and normal-light Multi-view (LOM) datasets for future research. This version is invalid, please refer to our new AAAI version: arXiv:2312.09093

Diffusion-HPC: Synthetic Data Generation for Human Mesh Recovery in Challenging Domains. (arXiv:2303.09541v2 [cs.CV] UPDATED)

Authors: Zhenzhen Weng, Laura Bravo-Sánchez, Serena Yeung-Levy

Recent text-to-image generative models have exhibited remarkable abilities in generating high-fidelity and photo-realistic images. However, despite the visually impressive results, these models often struggle to preserve plausible human structure in the generations. Due to this reason, while generative models have shown promising results in aiding downstream image recognition tasks by generating large volumes of synthetic data, they are not suitable for improving downstream human pose perception and understanding. In this work, we propose a Diffusion model with Human Pose Correction (Diffusion-HPC), a text-conditioned method that generates photo-realistic images with plausible posed humans by injecting prior knowledge about human body structure. Our generated images are accompanied by 3D meshes that serve as ground truths for improving Human Mesh Recovery tasks, where a shortage of 3D training data has long been an issue. Furthermore, we show that Diffusion-HPC effectively improves the realism of human generations under varying conditioning strategies.

Precise localization of corneal reflections in eye images using deep learning trained on synthetic data. (arXiv:2304.05673v3 [cs.CV] UPDATED)

Authors: Sean Anthony Byrne, Marcus Nyström, Virmarie Maquiling, Enkelejda Kasneci, Diederick C. Niehorster

We present a deep learning method for accurately localizing the center of a single corneal reflection (CR) in an eye image. Unlike previous approaches, we use a convolutional neural network (CNN) that was trained solely using simulated data. Using only simulated data has the benefit of completely sidestepping the time-consuming process of manual annotation that is required for supervised training on real eye images. To systematically evaluate the accuracy of our method, we first tested it on images with simulated CRs placed on different backgrounds and embedded in varying levels of noise. Second, we tested the method on high-quality videos captured from real eyes. Our method outperformed state-of-the-art algorithmic methods on real eye images with a 35% reduction in terms of spatial precision, and performed on par with state-of-the-art on simulated images in terms of spatial accuracy.We conclude that our method provides a precise method for CR center localization and provides a solution to the data availability problem which is one of the important common roadblocks in the development of deep learning models for gaze estimation. Due to the superior CR center localization and ease of application, our method has the potential to improve the accuracy and precision of CR-based eye trackers

An Instance Segmentation Dataset of Yeast Cells in Microstructures. (arXiv:2304.07597v4 [cs.CV] UPDATED)

Authors: Christoph Reich, Tim Prangemeier, André O. Françani, Heinz Koeppl

Extracting single-cell information from microscopy data requires accurate instance-wise segmentations. Obtaining pixel-wise segmentations from microscopy imagery remains a challenging task, especially with the added complexity of microstructured environments. This paper presents a novel dataset for segmenting yeast cells in microstructures. We offer pixel-wise instance segmentation labels for both cells and trap microstructures. In total, we release 493 densely annotated microscopy images. To facilitate a unified comparison between novel segmentation algorithms, we propose a standardized evaluation strategy for our dataset. The aim of the dataset and evaluation strategy is to facilitate the development of new cell segmentation approaches. The dataset is publicly available at https://christophreich1996.github.io/yeast_in_microstructures_dataset/ .

UDTIRI: An Online Open-Source Intelligent Road Inspection Benchmark Suite. (arXiv:2304.08842v3 [cs.CV] UPDATED)

Authors: Sicen Guo, Jiahang Li, Yi Feng, Dacheng Zhou, Denghuang Zhang, Chen Chen, Shuai Su, Xingyi Zhu, Qijun Chen, Rui Fan

In the nascent domain of urban digital twins (UDT), the prospects for leveraging cutting-edge deep learning techniques are vast and compelling. Particularly within the specialized area of intelligent road inspection (IRI), a noticeable gap exists, underscored by the current dearth of dedicated research efforts and the lack of large-scale well-annotated datasets. To foster advancements in this burgeoning field, we have launched an online open-source benchmark suite, referred to as UDTIRI. Along with this article, we introduce the road pothole detection task, the first online competition published within this benchmark suite. This task provides a well-annotated dataset, comprising 1,000 RGB images and their pixel/instance-level ground-truth annotations, captured in diverse real-world scenarios under different illumination and weather conditions. Our benchmark provides a systematic and thorough evaluation of state-of-the-art object detection, semantic segmentation, and instance segmentation networks, developed based on either convolutional neural networks or Transformers. We anticipate that our benchmark will serve as a catalyst for the integration of advanced UDT techniques into IRI. By providing algorithms with a more comprehensive understanding of diverse road conditions, we seek to unlock their untapped potential and foster innovation in this critical domain.

Energy-Based Sliced Wasserstein Distance. (arXiv:2304.13586v3 [stat.ML] UPDATED)

Authors: Khai Nguyen, Nhat Ho

The sliced Wasserstein (SW) distance has been widely recognized as a statistically effective and computationally efficient metric between two probability measures. A key component of the SW distance is the slicing distribution. There are two existing approaches for choosing this distribution. The first approach is using a fixed prior distribution. The second approach is optimizing for the best distribution which belongs to a parametric family of distributions and can maximize the expected distance. However, both approaches have their limitations. A fixed prior distribution is non-informative in terms of highlighting projecting directions that can discriminate two general probability measures. Doing optimization for the best distribution is often expensive and unstable. Moreover, designing the parametric family of the candidate distribution could be easily misspecified. To address the issues, we propose to design the slicing distribution as an energy-based distribution that is parameter-free and has the density proportional to an energy function of the projected one-dimensional Wasserstein distance. We then derive a novel sliced Wasserstein metric, energy-based sliced Waserstein (EBSW) distance, and investigate its topological, statistical, and computational properties via importance sampling, sampling importance resampling, and Markov Chain methods. Finally, we conduct experiments on point-cloud gradient flow, color transfer, and point-cloud reconstruction to show the favorable performance of the EBSW.

Image2SSM: Reimagining Statistical Shape Models from Images with Radial Basis Functions. (arXiv:2305.11946v2 [cs.CV] UPDATED)

Authors: Hong Xu, Shireen Y. Elhabian

Statistical shape modeling (SSM) is an essential tool for analyzing variations in anatomical morphology. In a typical SSM pipeline, 3D anatomical images, gone through segmentation and rigid registration, are represented using lower-dimensional shape features, on which statistical analysis can be performed. Various methods for constructing compact shape representations have been proposed, but they involve laborious and costly steps. We propose Image2SSM, a novel deep-learning-based approach for SSM that leverages image-segmentation pairs to learn a radial-basis-function (RBF)-based representation of shapes directly from images. This RBF-based shape representation offers a rich self-supervised signal for the network to estimate a continuous, yet compact representation of the underlying surface that can adapt to complex geometries in a data-driven manner. Image2SSM can characterize populations of biological structures of interest by constructing statistical landmark-based shape models of ensembles of anatomical shapes while requiring minimal parameter tuning and no user assistance. Once trained, Image2SSM can be used to infer low-dimensional shape representations from new unsegmented images, paving the way toward scalable approaches for SSM, especially when dealing with large cohorts. Experiments on synthetic and real datasets show the efficacy of the proposed method compared to the state-of-art correspondence-based method for SSM.

NegVSR: Augmenting Negatives for Generalized Noise Modeling in Real-World Video Super-Resolution. (arXiv:2305.14669v3 [cs.CV] UPDATED)

Authors: Yexing Song, Meilin Wang, Zhijing Yang, Xiaoyu Xian, Yukai Shi

The capability of video super-resolution (VSR) to synthesize high-resolution (HR) video from ideal datasets has been demonstrated in many works. However, applying the VSR model to real-world video with unknown and complex degradation remains a challenging task. First, existing degradation metrics in most VSR methods are not able to effectively simulate real-world noise and blur. On the contrary, simple combinations of classical degradation are used for real-world noise modeling, which led to the VSR model often being violated by out-of-distribution noise. Second, many SR models focus on noise simulation and transfer. Nevertheless, the sampled noise is monotonous and limited. To address the aforementioned problems, we propose a Negatives augmentation strategy for generalized noise modeling in Video Super-Resolution (NegVSR) task. Specifically, we first propose sequential noise generation toward real-world data to extract practical noise sequences. Then, the degeneration domain is widely expanded by negative augmentation to build up various yet challenging real-world noise sets. We further propose the augmented negative guidance loss to learn robust features among augmented negatives effectively. Extensive experiments on real-world datasets (e.g., VideoLQ and FLIR) show that our method outperforms state-of-the-art methods with clear margins, especially in visual quality. Project page is available at: https://negvsr.github.io/.

CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion. (arXiv:2305.16283v5 [cs.CV] UPDATED)

Authors: Guangyao Zhai, Evin Pınar Örnek, Shun-Cheng Wu, Yan Di, Federico Tombari, Nassir Navab, Benjamin Busam

Controllable scene synthesis aims to create interactive environments for various industrial use cases. Scene graphs provide a highly suitable interface to facilitate these applications by abstracting the scene context in a compact manner. Existing methods, reliant on retrieval from extensive databases or pre-trained shape embeddings, often overlook scene-object and object-object relationships, leading to inconsistent results due to their limited generation capacity. To address this issue, we present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes, which are semantically realistic and conform to commonsense. Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes via latent diffusion, capturing global scene-object and local inter-object relationships in the scene graph while preserving shape diversity. The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model. Due to lacking a scene graph dataset offering high-quality object-level meshes with relations, we also construct SG-FRONT, enriching the off-the-shelf indoor dataset 3D-FRONT with additional scene graph labels. Extensive experiments are conducted on SG-FRONT where CommonScenes shows clear advantages over other methods regarding generation consistency, quality, and diversity. Codes and the dataset will be released upon acceptance.

Fourier Analysis on Robustness of Graph Convolutional Neural Networks for Skeleton-based Action Recognition. (arXiv:2305.17939v2 [cs.CV] UPDATED)

Authors: Nariki Tanaka, Hiroshi Kera, Kazuhiko Kawamoto

Using Fourier analysis, we explore the robustness and vulnerability of graph convolutional neural networks (GCNs) for skeleton-based action recognition. We adopt a joint Fourier transform (JFT), a combination of the graph Fourier transform (GFT) and the discrete Fourier transform (DFT), to examine the robustness of adversarially-trained GCNs against adversarial attacks and common corruptions. Experimental results with the NTU RGB+D dataset reveal that adversarial training does not introduce a robustness trade-off between adversarial attacks and low-frequency perturbations, which typically occurs during image classification based on convolutional neural networks. This finding indicates that adversarial training is a practical approach to enhancing robustness against adversarial attacks and common corruptions in skeleton-based action recognition. Furthermore, we find that the Fourier approach cannot explain vulnerability against skeletal part occlusion corruption, which highlights its limitations. These findings extend our understanding of the robustness of GCNs, potentially guiding the development of more robust learning methods for skeleton-based action recognition.

BRICS: Bi-level feature Representation of Image CollectionS. (arXiv:2305.18601v3 [cs.CV] UPDATED)

Authors: Dingdong Yang, Yizhi Wang, Ali Mahdavi-Amiri, Hao Zhang

We present BRICS, a bi-level feature representation for image collections, which consists of a key code space on top of a feature grid space. Specifically, our representation is learned by an autoencoder to encode images into continuous key codes, which are used to retrieve features from groups of multi-resolution feature grids. Our key codes and feature grids are jointly trained continuously with well-defined gradient flows, leading to high usage rates of the feature grids and improved generative modeling compared to discrete Vector Quantization (VQ). Differently from existing continuous representations such as KL-regularized latent codes, our key codes are strictly bounded in scale and variance. Overall, feature encoding by BRICS is compact, efficient to train, and enables generative modeling over key codes using the diffusion model. Experimental results show that our method achieves comparable reconstruction results to VQ while having a smaller and more efficient decoder network (50% fewer GFlops). By applying the diffusion model over our key code space, we achieve state-of-the-art performance on image synthesis on the FFHQ and LSUN-Church (29% lower than LDM, 32% lower than StyleGAN2, 44% lower than Projected GAN on CLIP-FID) datasets.

Addressing Negative Transfer in Diffusion Models. (arXiv:2306.00354v3 [cs.CV] UPDATED)

Authors: Hyojun Go, JinYoung Kim, Yunsung Lee, Seunghyun Lee, Shinhyeok Oh, Hyeongdon Moon, Seungtaek Choi

Diffusion-based generative models have achieved remarkable success in various domains. It trains a shared model on denoising tasks that encompass different noise levels simultaneously, representing a form of multi-task learning (MTL). However, analyzing and improving diffusion models from an MTL perspective remains under-explored. In particular, MTL can sometimes lead to the well-known phenomenon of negative transfer, which results in the performance degradation of certain tasks due to conflicts between tasks. In this paper, we first aim to analyze diffusion training from an MTL standpoint, presenting two key observations: (O1) the task affinity between denoising tasks diminishes as the gap between noise levels widens, and (O2) negative transfer can arise even in diffusion training. Building upon these observations, we aim to enhance diffusion training by mitigating negative transfer. To achieve this, we propose leveraging existing MTL methods, but the presence of a huge number of denoising tasks makes this computationally expensive to calculate the necessary per-task loss or gradient. To address this challenge, we propose clustering the denoising tasks into small task clusters and applying MTL methods to them. Specifically, based on (O2), we employ interval clustering to enforce temporal proximity among denoising tasks within clusters. We show that interval clustering can be solved using dynamic programming, utilizing signal-to-noise ratio, timestep, and task affinity for clustering objectives. Through this, our approach addresses the issue of negative transfer in diffusion models by allowing for efficient computation of MTL methods. We validate the efficacy of proposed clustering and its integration with MTL methods through various experiments, demonstrating 1) improved generation quality and 2) faster training convergence of diffusion models.

Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions. (arXiv:2306.07520v4 [cs.CV] UPDATED)

Authors: Weizhen He, Yiheng Deng, Shixiang Tang, Qihao Chen, Qingsong Xie, Yizhou Wang, Lei Bai, Feng Zhu, Rui Zhao, Wanli Ouyang, Donglian Qi, Yunfeng Yan

Human intelligence can retrieve any person according to both visual and language descriptions. However, the current computer vision community studies specific person re-identification (ReID) tasks in different scenarios separately, which limits the applications in the real world. This paper strives to resolve this problem by proposing a new instruct-ReID task that requires the model to retrieve images according to the given image or language instructions. Our instruct-ReID is a more general ReID setting, where existing 6 ReID tasks can be viewed as special cases by designing different instructions. We propose a large-scale OmniReID benchmark and an adaptive triplet loss as a baseline method to facilitate research in this new setting. Experimental results show that the proposed multi-purpose ReID model, trained on our OmniReID benchmark without fine-tuning, can improve +0.5%, +0.6%, +7.7% mAP on Market1501, MSMT17, CUHK03 for traditional ReID, +6.4%, +7.1%, +11.2% mAP on PRCC, VC-Clothes, LTCC for clothes-changing ReID, +11.7% mAP on COCAS+ real2 for clothes template based clothes-changing ReID when using only RGB images, +24.9% mAP on COCAS+ real2 for our newly defined language-instructed ReID, +4.3% on LLCM for visible-infrared ReID, +2.6% on CUHK-PEDES for text-to-image ReID. The datasets, the model, and code will be available at https://github.com/hwz-zju/Instruct-ReID.

Lightweight texture transfer based on texture feature preset. (arXiv:2306.16846v3 [cs.CV] UPDATED)

Authors: ShiQi Jiang

In the task of texture transfer, reference texture images typically exhibit highly repetitive texture features, and the texture transfer results from different content images under the same style also share remarkably similar texture patterns. Encoding such highly similar texture features often requires deep layers and a large number of channels, making it is also the main source of the entire model's parameter count and computational load, and inference time. We propose a lightweight texture transfer based on texture feature preset (TFP). TFP takes full advantage of the high repetitiveness of texture features by providing preset universal texture feature maps for a given style. These preset feature maps can be fused and decoded directly with shallow color transfer feature maps of any content to generate texture transfer results, thereby avoiding redundant texture information from being encoded repeatedly. The texture feature map we preset is encoded through noise input images with consistent distribution (standard normal distribution). This consistent input distribution can completely avoid the problem of texture transfer differentiation, and by randomly sampling different noise inputs, we can obtain different texture features and texture transfer results under the same reference style. Compared to state-of-the-art techniques, our TFP not only produces visually superior results but also reduces the model size by 3.2-3538 times and speeds up the process by 1.8-5.6 times.

Spatially Varying Nanophotonic Neural Networks. (arXiv:2308.03407v3 [cs.CV] UPDATED)

Authors: Kaixuan Wei, Xiao Li, Johannes Froech, Praneeth Chakravarthula, James Whitehead, Ethan Tseng, Arka Majumdar, Felix Heide

The explosive growth of computation and energy cost of artificial intelligence has spurred strong interests in new computing modalities as potential alternatives to conventional electronic processors. Photonic processors that execute operations using photons instead of electrons, have promised to enable optical neural networks with ultra-low latency and power consumption. However, existing optical neural networks, limited by the underlying network designs, have achieved image recognition accuracy far below that of state-of-the-art electronic neural networks. In this work, we close this gap by embedding massively parallelized optical computation into flat camera optics that perform neural network computation during the capture, before recording an image on the sensor. Specifically, we harness large kernels and propose a large-kernel spatially-varying convolutional neural network learned via low-dimensional reparameterization techniques. We experimentally instantiate the network with a flat meta-optical system that encompasses an array of nanophotonic structures designed to induce angle-dependent responses. Combined with an extremely lightweight electronic backend with approximately 2K parameters we demonstrate a reconfigurable nanophotonic neural network reaches 72.76\% blind test classification accuracy on CIFAR-10 dataset, and, as such, the first time, an optical neural network outperforms the first modern digital neural network -- AlexNet (72.64\%) with 57M parameters, bringing optical neural network into modern deep learning era.

Taming Self-Training for Open-Vocabulary Object Detection. (arXiv:2308.06412v2 [cs.CV] UPDATED)

Authors: Shiyu Zhao, Samuel Schulter, Long Zhao, Zhixing Zhang, Vijay Kumar B.G, Yumin Suh, Manmohan Chandraker, Dimitris N. Metaxas

Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However, teacher-student self-training, a powerful and widely used paradigm to leverage PLs, is rarely explored for OVD. This work identifies two challenges of using self-training in OVD: noisy PLs from VLMs and frequent distribution changes of PLs. To address these challenges, we propose SAS-Det that tames self-training for OVD from two key perspectives. First, we present a split-and-fusion (SAF) head that splits a standard detection into an open-branch and a closed-branch. This design can reduce noisy supervision from pseudo boxes. Moreover, the two branches learn complementary knowledge from different training data, significantly enhancing performance when fused together. Second, in our view, unlike in closed-set tasks, the PL distributions in OVD are solely determined by the teacher model. We introduce a periodic update strategy to decrease the number of updates to the teacher, thereby decreasing the frequency of changes in PL distributions, which stabilizes the training process. Extensive experiments demonstrate SAS-Det is both efficient and effective. SAS-Det outperforms recent models of the same scale by a clear margin and achieves 37.4 AP50 and 29.1 APr on novel categories of the COCO and LVIS benchmarks, respectively.

Contrastive Bi-Projector for Unsupervised Domain Adaption. (arXiv:2308.07017v2 [cs.CV] UPDATED)

Authors: Lin-Chieh Huang, Hung-Hsu Tsai

This paper proposes a novel unsupervised domain adaption (UDA) method based on contrastive bi-projector (CBP), which can improve the existing UDA methods. It is called CBPUDA here, which effectively promotes the feature extractors (FEs) to reduce the generation of ambiguous features for classification and domain adaption. The CBP differs from traditional bi-classifier-based methods at that these two classifiers are replaced with two projectors of performing a mapping from the input feature to two distinct features. These two projectors and the FEs in the CBPUDA can be trained adversarially to obtain more refined decision boundaries so that it can possess powerful classification performance. Two properties of the proposed loss function are analyzed here. The first property is to derive an upper bound of joint prediction entropy, which is used to form the proposed loss function, contrastive discrepancy (CD) loss. The CD loss takes the advantages of the contrastive learning and the bi-classifier. The second property is to analyze the gradient of the CD loss and then overcome the drawback of the CD loss. The result of the second property is utilized in the development of the gradient scaling (GS) scheme in this paper. The GS scheme can be exploited to tackle the unstable problem of the CD loss because training the CBPUDA requires using contrastive learning and adversarial learning at the same time. Therefore, using the CD loss with the GS scheme overcomes the problem mentioned above to make features more compact for intra-class and distinguishable for inter-class. Experimental results express that the CBPUDA is superior to conventional UDA methods under consideration in this paper for UDA and fine-grained UDA tasks.

Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation. (arXiv:2308.07931v2 [cs.CV] UPDATED)

Authors: William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, Phillip Isola

Self-supervised and language-supervised image models contain rich knowledge of the world that is important for generalization. Many robotic tasks, however, require a detailed understanding of 3D geometry, which is often lacking in 2D image features. This work bridges this 2D-to-3D gap for robotic manipulation by leveraging distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models. We present a few-shot learning method for 6-DOF grasping and placing that harnesses these strong spatial and semantic priors to achieve in-the-wild generalization to unseen objects. Using features distilled from a vision-language model, CLIP, we present a way to designate novel objects for manipulation via free-text natural language, and demonstrate its ability to generalize to unseen expressions and novel categories of objects.

Dual-Stream Diffusion Net for Text-to-Video Generation. (arXiv:2308.08316v3 [cs.CV] UPDATED)

Authors: Binhui Liu, Xin Liu, Anbo Dai, Zhiyong Zeng, Dan Wang, Zhen Cui, Jian Yang

With the emerging diffusion models, recently, text-to-video generation has aroused increasing attention. But an important bottleneck therein is that generative videos often tend to carry some flickers and artifacts. In this work, we propose a dual-stream diffusion net (DSDN) to improve the consistency of content variations in generating videos. In particular, the designed two diffusion streams, video content and motion branches, could not only run separately in their private spaces for producing personalized video variations as well as content, but also be well-aligned between the content and motion domains through leveraging our designed cross-transformer interaction module, which would benefit the smoothness of generated videos. Besides, we also introduce motion decomposer and combiner to faciliate the operation on video motion. Qualitative and quantitative experiments demonstrate that our method could produce amazing continuous videos with fewer flickers.

Edit Temporal-Consistent Videos with Image Diffusion Model. (arXiv:2308.09091v2 [cs.CV] UPDATED)

Authors: Yuanzhi Wang, Yong Li, Xiaoya Zhang, Xin Liu, Anbo Dai, Antoni B. Chan, Zhen Cui

Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing, yielding impressive zero-shot video editing performance. Nonetheless, the generated videos usually show spatial irregularities and temporal inconsistencies as the temporal characteristics of videos have not been faithfully modeled. In this paper, we propose an elegant yet effective Temporal-Consistent Video Editing (TCVE) method to mitigate the temporal inconsistency challenge for robust text-guided video editing. In addition to the utilization of a pretrained T2I 2D Unet for spatial content manipulation, we establish a dedicated temporal Unet architecture to faithfully capture the temporal coherence of the input video sequences. Furthermore, to establish coherence and interrelation between the spatial-focused and temporal-focused components, a cohesive spatial-temporal modeling unit is formulated. This unit effectively interconnects the temporal Unet with the pretrained 2D Unet, thereby enhancing the temporal consistency of the generated videos while preserving the capacity for video content manipulation. Quantitative experimental results and visualization results demonstrate that TCVE achieves state-of-the-art performance in both video temporal consistency and video editing capability, surpassing existing benchmarks in the field.

Turning Waste into Wealth: Leveraging Low-Quality Samples for Enhancing Continuous Conditional Generative Adversarial Networks. (arXiv:2308.10273v3 [cs.CV] UPDATED)

Authors: Xin Ding, Yongwei Wang, Zuheng Xu

Continuous Conditional Generative Adversarial Networks (CcGANs) enable generative modeling conditional on continuous scalar variables (termed regression labels). However, they can produce subpar fake images due to limited training data. Although Negative Data Augmentation (NDA) effectively enhances unconditional and class-conditional GANs by introducing anomalies into real training images, guiding the GANs away from low-quality outputs, its impact on CcGANs is limited, as it fails to replicate negative samples that may occur during the CcGAN sampling. We present a novel NDA approach called Dual-NDA specifically tailored for CcGANs to address this problem. Dual-NDA employs two types of negative samples: visually unrealistic images generated from a pre-trained CcGAN and label-inconsistent images created by manipulating real images' labels. Leveraging these negative samples, we introduce a novel discriminator objective alongside a modified CcGAN training algorithm. Empirical analysis on UTKFace and Steering Angle reveals that Dual-NDA consistently enhances the visual fidelity and label consistency of fake images generated by CcGANs, exhibiting a substantial performance gain over the vanilla NDA. Moreover, by applying Dual-NDA, CcGANs demonstrate a remarkable advancement beyond the capabilities of state-of-the-art conditional GANs and diffusion models, establishing a new pinnacle of performance. Our codes can be found at https://github.com/UBCDingXin/Dual-NDA.

Dictionary Attack on IMU-based Gait Authentication. (arXiv:2309.11766v2 [cs.CR] UPDATED)

Authors: Rajesh Kumar, Can Isik, Chilukuri K. Mohan

We present a novel adversarial model for authentication systems that use gait patterns recorded by the inertial measurement unit (IMU) built into smartphones. The attack idea is inspired by and named after the concept of a dictionary attack on knowledge (PIN or password) based authentication systems. In particular, this work investigates whether it is possible to build a dictionary of IMUGait patterns and use it to launch an attack or find an imitator who can actively reproduce IMUGait patterns that match the target's IMUGait pattern. Nine physically and demographically diverse individuals walked at various levels of four predefined controllable and adaptable gait factors (speed, step length, step width, and thigh-lift), producing 178 unique IMUGait patterns. Each pattern attacked a wide variety of user authentication models. The deeper analysis of error rates (before and after the attack) challenges the belief that authentication systems based on IMUGait patterns are the most difficult to spoof; further research is needed on adversarial models and associated countermeasures.

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision. (arXiv:2309.14181v3 [cs.CV] UPDATED)

Authors: Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, Weisi Lin

The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs. Project Page: https://q-future.github.io/Q-Bench.

Nuclear Morphometry using a Deep Learning-based Algorithm has Prognostic Relevance for Canine Cutaneous Mast Cell Tumors. (arXiv:2309.15031v2 [cs.CV] UPDATED)

Authors: Andreas Haghofer, Eda Parlak, Alexander Bartel, Taryn A. Donovan, Charles-Antoine Assenmacher, Pompei Bolfa, Michael J. Dark, Andrea Fuchs-Baumgartinger, Andrea Klang, Kathrin Jäger, Robert Klopfleisch, Sophie Merz, Barbara Richter, F. Yvonne Schulman, Jonathan Ganz, Josef Scharinger, Marc Aubreville, Stephan M. Winkler, Matti Kiupel, Christof A. Bertram

Variation in nuclear size and shape is an important criterion of malignancy for many tumor types; however, categorical estimates by pathologists have poor reproducibility. Measurements of nuclear characteristics (morphometry) can improve reproducibility, but manual methods are time consuming. In this study, we evaluated fully automated morphometry using a deep learning-based algorithm in 96 canine cutaneous mast cell tumors with information on patient survival. Algorithmic morphometry was compared with karyomegaly estimates by 11 pathologists, manual nuclear morphometry of 12 cells by 9 pathologists, and the mitotic count as a benchmark. The prognostic value of automated morphometry was high with an area under the ROC curve regarding the tumor-specific survival of 0.943 (95% CI: 0.889 - 0.996) for the standard deviation (SD) of nuclear area, which was higher than manual morphometry of all pathologists combined (0.868, 95% CI: 0.737 - 0.991) and the mitotic count (0.885, 95% CI: 0.765 - 1.00). At the proposed thresholds, the hazard ratio for algorithmic morphometry (SD of nuclear area $\geq 9.0 \mu m^2$) was 18.3 (95% CI: 5.0 - 67.1), for manual morphometry (SD of nuclear area $\geq 10.9 \mu m^2$) 9.0 (95% CI: 6.0 - 13.4), for karyomegaly estimates 7.6 (95% CI: 5.7 - 10.1), and for the mitotic count 30.5 (95% CI: 7.8 - 118.0). Inter-rater reproducibility for karyomegaly estimates was fair ($\kappa$ = 0.226) with highly variable sensitivity/specificity values for the individual pathologists. Reproducibility for manual morphometry (SD of nuclear area) was good (ICC = 0.654). This study supports the use of algorithmic morphometry as a prognostic test to overcome the limitations of estimates and manual measurements.

Hard View Selection for Self-Supervised Learning. (arXiv:2310.03940v4 [cs.CV] UPDATED)

Authors: Fabio Ferreira, Ivo Rapant, Frank Hutter

Many Self-Supervised Learning (SSL) methods train their models to be invariant to different "views" of an image input for which a good data augmentation pipeline is crucial. While considerable efforts were directed towards improving pre-text tasks, architectures, or robustness (e.g., Siamese networks or teacher-softmax centering), the majority of these methods remain strongly reliant on the random sampling of operations within the image augmentation pipeline, such as the random resized crop or color distortion operation. In this paper, we argue that the role of the view generation and its effect on performance has so far received insufficient attention. To address this, we propose an easy, learning-free, yet powerful Hard View Selection (HVS) strategy designed to extend the random view generation to expose the pretrained model to harder samples during SSL training. It encompasses the following iterative steps: 1) randomly sample multiple views and create pairs of two views, 2) run forward passes for each view pair on the currently trained model, 3) adversarially select the pair yielding the worst loss, and 4) run the backward pass with the selected pair. In our empirical analysis we show that under the hood, HVS increases task difficulty by controlling the Intersection over Union of views during pretraining. With only 300-epoch pretraining, HVS is able to closely rival the 800-epoch DINO baseline which remains very favorable even when factoring in the slowdown induced by the additional forwards of HVS. Additionally, HVS consistently achieves accuracy improvements on ImageNet between 0.4% and 1.9% on linear evaluation and similar improvements on transfer tasks across multiple SSL methods, such as DINO, SimSiam, iBOT, and SimCLR.

On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets. (arXiv:2310.06594v2 [cs.CV] UPDATED)

Authors: Ning Liao, Shaofeng Zhang, Renqiu Xia, Min Cao, Yu Qiao, Junchi Yan

There is an emerging line of research on multimodal instruction tuning, and a line of benchmarks has been proposed for evaluating these models recently. Instead of evaluating the models directly, in this paper, we try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets. Also, we seek the way of building a dataset for developing an all-powerful VLIT model, which we believe could also be of utility for establishing a grounded protocol for benchmarking VLIT models. For effective evaluation of VLIT datasets that remains an open question, we propose a tune-cross-evaluation paradigm: tuning on one dataset and evaluating on the others in turn. For each single tune-evaluation experiment set, we define the Meta Quality (MQ) as the mean score obtained by a set of caption metrics including BLEU, METEOR, and ROUGE-L to quantify the quality of a certain dataset or a sample. On this basis, to evaluate the comprehensiveness of a dataset, we develop the Dataset Quality (DQ) covering all tune-evaluation sets. To lay the foundation for building a comprehensive dataset and developing an all-powerful model for practical applications, we define the Sample Quality (SQ) to quantify the all-sided quality of each sample. Extensive experiments validate the rationality of the proposed evaluation paradigm. Based on the holistic evaluation, we build a new dataset, REVO-LION (REfining VisiOn-Language InstructiOn tuNing), by collecting samples with higher SQ from each dataset. Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up. Furthermore, REVO-LION not only facilitates the development of a powerful model but also incorporates an evaluation set, which is designed to serve as a convenient benchmark for future research in the field.

Provable Probabilistic Imaging using Score-Based Generative Priors. (arXiv:2310.10835v2 [eess.IV] UPDATED)

Authors: Yu Sun, Zihui Wu, Yifan Chen, Berthy T. Feng, Katherine L. Bouman

Estimating high-quality images while also quantifying their uncertainty are two desired features in an image reconstruction algorithm for solving ill-posed inverse problems. In this paper, we propose plug-and-play Monte Carlo (PMC) as a principled framework for characterizing the space of possible solutions to a general inverse problem. PMC is able to incorporate expressive score-based generative priors for high-quality image reconstruction while also performing uncertainty quantification via posterior sampling. In particular, we introduce two PMC algorithms which can be viewed as the sampling analogues of the traditional plug-and-play priors (PnP) and regularization by denoising (RED) algorithms. We also establish a theoretical analysis for characterizing the convergence of the PMC algorithms. Our analysis provides non-asymptotic stationarity guarantees for both algorithms, even in the presence of non-log-concave likelihoods and imperfect score networks. We demonstrate the performance of the PMC algorithms on multiple representative inverse problems with both linear and nonlinear forward models. Experimental results show that PMC significantly improves reconstruction quality and enables high-fidelity uncertainty quantification.

ParisLuco3D: A high-quality target dataset for domain generalization of LiDAR perception. (arXiv:2310.16542v2 [cs.CV] UPDATED)

Authors: Jules Sanchez, Louis Soum-Fontez, Jean-Emmanuel Deschaud, Francois Goulette

LiDAR is an essential sensor for autonomous driving by collecting precise geometric information regarding a scene. As the performance of various LiDAR perception tasks has improved, generalizations to new environments and sensors has emerged to test these optimized models in real-world conditions. Unfortunately, the various annotation strategies of data providers complicate the computation of cross-domain performances.

This paper provides a novel dataset, ParisLuco3D, specifically designed for cross-domain evaluation to make it easier to evaluate the performance utilizing various source datasets. Alongside the dataset, online benchmarks for LiDAR semantic segmentation, LiDAR object detection, and LiDAR tracking are provided to ensure a fair comparison across methods.

The ParisLuco3D dataset, evaluation scripts, and links to benchmarks can be found at the following website: https://npm3d.fr/parisluco3d

Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning for Medical Image Captioning. (arXiv:2311.01004v2 [cs.CV] UPDATED)

Authors: Zhenyu Zhang, Benlu Wang, Weijie Liang, Yizhi Li, Xuechen Guo, Guanhong Wang, Shiyan Li, Gaoang Wang

With the development of multimodality and large language models, the deep learning-based technique for medical image captioning holds the potential to offer valuable diagnostic recommendations. However, current generic text and image pre-trained models do not yield satisfactory results when it comes to describing intricate details within medical images. In this paper, we present a novel medical image captioning method guided by the segment anything model (SAM) to enable enhanced encoding with both general and detailed feature extraction. In addition, our approach employs a distinctive pre-training strategy with mixed semantic learning to simultaneously capture both the overall information and finer details within medical images. We demonstrate the effectiveness of this approach, as it outperforms the pre-trained BLIP2 model on various evaluation metrics for generating descriptions of medical images.

LCPR: A Multi-Scale Attention-Based LiDAR-Camera Fusion Network for Place Recognition. (arXiv:2311.03198v2 [cs.CV] UPDATED)

Authors: Zijie Zhou, Jingyi Xu, Guangming Xiong, Junyi Ma

Place recognition is one of the most crucial modules for autonomous vehicles to identify places that were previously visited in GPS-invalid environments. Sensor fusion is considered an effective method to overcome the weaknesses of individual sensors. In recent years, multimodal place recognition fusing information from multiple sensors has gathered increasing attention. However, most existing multimodal place recognition methods only use limited field-of-view camera images, which leads to an imbalance between features from different modalities and limits the effectiveness of sensor fusion. In this paper, we present a novel neural network named LCPR for robust multimodal place recognition, which fuses LiDAR point clouds with multi-view RGB images to generate discriminative and yaw-rotation invariant representations of the environment. A multi-scale attention-based fusion module is proposed to fully exploit the panoramic views from different modalities of the environment and their correlations. We evaluate our method on the nuScenes dataset, and the experimental results show that our method can effectively utilize multi-view camera and LiDAR data to improve the place recognition performance while maintaining strong robustness to viewpoint changes. Our open-source code and pre-trained models are available at https://github.com/ZhouZijie77/LCPR .

GLaMM: Pixel Grounding Large Multimodal Model. (arXiv:2311.03356v2 [cs.CV] UPDATED)

Authors: Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan Yang, Fahad S. Khan

Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring to a single object category at a time, require users to specify the regions, or cannot offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the conversations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity, both in textual and visual domains. Due to the lack of standard benchmarks for the novel setting of visually Grounded Conversation Generation (GCG), we introduce a comprehensive evaluation protocol with our curated grounded conversations. Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale. To this end, we propose a densely annotated Grounding-anything Dataset (GranD) using our proposed automated annotation pipeline that encompasses 7.5M unique concepts grounded in a total of 810M regions available with segmentation masks. Besides GCG, GLaMM also performs effectively on several downstream tasks, e.g., referring expression segmentation, image and region-level captioning and vision-language conversations.

An attempt to generate new bridge types from latent space of variational autoencoder. (arXiv:2311.03380v2 [cs.LG] UPDATED)

Authors: Hongjun Zhang

Try to generate new bridge types using generative artificial intelligence technology. The grayscale images of the bridge facade with the change of component width was rendered by 3dsMax animation software, and then the OpenCV module performed an appropriate amount of geometric transformation (rotation, horizontal scale, vertical scale) to obtain the image dataset of three-span beam bridge, arch bridge, cable-stayed bridge and suspension bridge. Based on Python programming language, TensorFlow and Keras deep learning platform framework, variational autoencoder was constructed and trained, and low-dimensional bridge-type latent space that is convenient for vector operations was obtained. Variational autoencoder can combine two bridge types on the basis of the original of human into one that is a new bridge type. Generative artificial intelligence technology can assist bridge designers in bridge-type innovation, and can be used as copilot.

Generation Of Colors using Bidirectional Long Short Term Memory Networks. (arXiv:2311.06542v3 [cs.CV] UPDATED)

Authors: A. Sinha

Human vision can distinguish between a vast spectrum of colours, estimated to be between 2 to 7 million discernible shades. However, this impressive range does not inherently imply that all these colours have been precisely named and described within our lexicon. We often associate colours with familiar objects and concepts in our daily lives. This research endeavors to bridge the gap between our visual perception of countless shades and our ability to articulate and name them accurately. A novel model has been developed to achieve this goal, leveraging Bidirectional Long Short-Term Memory (BiLSTM) networks with Active learning. This model operates on a proprietary dataset meticulously curated for this study. The primary objective of this research is to create a versatile tool for categorizing and naming previously unnamed colours or identifying intermediate shades that elude traditional colour terminology. The findings underscore the potential of this innovative approach in revolutionizing our understanding of colour perception and language. Through rigorous experimentation and analysis, this study illuminates a promising avenue for Natural Language Processing (NLP) applications in diverse industries. By facilitating the exploration of the vast colour spectrum the potential applications of NLP are extended beyond conventional boundaries.

Wildfire Smoke Detection with Cross Contrast Patch Embedding. (arXiv:2311.10116v2 [cs.CV] UPDATED)

Authors: Chong Wang, Cheng Xu, Adeel Akram, Zhilin Shan, Qixing Zhang

The Transformer-based deep networks have increasingly shown significant advantages over CNNs. Some existing work has applied it in the field of wildfire recognition or detection. However, we observed that the vanilla Transformer is not friendly for extracting smoke features. Because low-level information such as color, transparency and texture is very important for smoke recognition, and transformer pays more attention to the semantic relevance between middle- or high-level features, and is not sensitive to the subtle changes of low-level features along the space. To solve this problem, we propose the Cross Contrast Patch Embedding(CCPE) module based on the Swin Transformer, which uses the multi-scales spatial frequency contrast information in both vertical and horizontal directions to improve the discrimination of the network on the underlying details. The fuzzy boundary of smoke makes the positive and negative label assignment for instances in a dilemma, which is another challenge for wildfires detection. To solve this problem, a Separable Negative Sampling Mechanism(SNSM) is proposed. By using two different negative instance sampling strategies on positive images and negative images respectively, the problem of supervision signal confusion caused by label diversity in the process of network training is alleviated. This paper also releases the RealFire Test, the largest real wildfire test set so far, to evaluate the proposed method and promote future research. It contains 50,535 images from 3,649 video clips. The proposed method has been extensively tested and evaluated on RealFire Test dataset, and has a significant performance improvement compared with the baseline detection models.

Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise. (arXiv:2311.13091v2 [cs.LG] UPDATED)

Authors: Yixin Liu, Kaidi Xu, Xun Chen, Lichao Sun

The open source of large amounts of image data promotes the development of deep learning techniques. Along with this comes the privacy risk of these open-source image datasets being exploited by unauthorized third parties to train deep learning models for commercial or illegal purposes. To avoid the abuse of public data, a poisoning-based technique, the unlearnable example, is proposed to significantly degrade the generalization performance of models by adding a kind of imperceptible noise to the data. To further enhance its robustness against adversarial training, existing works leverage iterative adversarial training on both the defensive noise and the surrogate model. However, it still remains unknown whether the robustness of unlearnable examples primarily comes from the effect of enhancement in the surrogate model or the defensive noise. Observing that simply removing the adversarial noise on the training process of the defensive noise can improve the performance of robust unlearnable examples, we identify that solely the surrogate model's robustness contributes to the performance. Furthermore, we found a negative correlation exists between the robustness of defensive noise and the protection performance, indicating defensive noise's instability issue. Motivated by this, to further boost the robust unlearnable example, we introduce stable error-minimizing noise (SEM), which trains the defensive noise against random perturbation instead of the time-consuming adversarial perturbation to improve the stability of defensive noise. Through extensive experiments, we demonstrate that SEM achieves a new state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet Subset in terms of both effectiveness and efficiency. The code is available at https://github.com/liuyixin-louis/Stable-Unlearnable-Example.

Towards Full-scene Domain Generalization in Multi-agent Collaborative Bird's Eye View Segmentation for Connected and Autonomous Driving. (arXiv:2311.16754v2 [cs.CV] UPDATED)

Authors: Senkang Hu, Zhengru Fang, Xianhao Chen, Yuguang Fang, Sam Kwong

Collaborative perception has recently gained significant attention in autonomous driving, improving perception quality by enabling the exchange of additional information among vehicles. However, deploying collaborative perception systems can lead to domain shifts due to diverse environmental conditions and data heterogeneity among connected and autonomous vehicles (CAVs). To address these challenges, we propose a unified domain generalization framework applicable in both training and inference stages of collaborative perception. In the training phase, we introduce an Amplitude Augmentation (AmpAug) method to augment low-frequency image variations, broadening the model's ability to learn across various domains. We also employ a meta-consistency training scheme to simulate domain shifts, optimizing the model with a carefully designed consistency loss to encourage domain-invariant representations. In the inference phase, we introduce an intra-system domain alignment mechanism to reduce or potentially eliminate the domain discrepancy among CAVs prior to inference. Comprehensive experiments substantiate the effectiveness of our method in comparison with the existing state-of-the-art works. Code will be released at https://github.com/DG-CAVs/DG-CoPerception.git.

MABViT -- Modified Attention Block Enhances Vision Transformers. (arXiv:2312.01324v2 [cs.CV] UPDATED)

Authors: Mahesh Ramesh, Aswinkumar Ramkumar

Recent studies have demonstrated the effectiveness of Gated Linear Units (GLU) in enhancing transformer models, particularly in Large Language Models (LLMs). Additionally, utilizing a parallel configuration within each Transformer block rather than the conventional serialized method has been revealed to accelerate the training of LLMs without significantly impacting performance. However, when the MLP and attention block were run in parallel for the image classification task, we observed a noticeable decline in performance. We propose a novel transformer variant that integrates non-linearity within the attention block to tackle this problem. We implemented the GLU-based activation function on the Value tensor, and this new technique surpasses the current state-of-the-art S/16 variant of Vision Transformers by 0.6% on the ImageNet-1K dataset while utilizing fewer parameters. It also supersedes the B/16 variant while using only half the parameters. Furthermore, we provide results with the GELU activation function variant to confirm our assertions. Lastly, we showcase that the MABViT variants exhibit greater potential when utilized in deep transformers compared to the standard architecture.

PGS: Pose-Guided Supervision for Mitigating Clothes-Changing in Person Re-Identification. (arXiv:2312.05634v2 [cs.CV] UPDATED)

Authors: Quoc-Huy Trinh, Nhat-Tan Bui, Dinh-Hieu Hoang, Phuoc-Thao Vo Thi, Hai-Dang Nguyen, Debesh Jha, Ulas Bagci, Ngan Le, Minh-Triet Tran

Person Re-Identification (Re-ID) task seeks to enhance the tracking of multiple individuals by surveillance cameras. It provides additional support for multimodal tasks, including text-based person retrieval and human matching. Among the significant challenges faced in Re-ID, one of the most prominent is dealing with clothes-changing, where the same person may appear in different outfits. While previous methods have made notable progress in maintaining clothing data consistency and handling clothing change data, they still tend to rely excessively on clothing information, which can limit performance due to the dynamic nature of human appearances. To mitigate this challenge, we propose the Pose-Guided Supervision (PGS), an effective framework for learning pose guidance within the Re-ID task. Our PGS consists of three modules: a human encoder, a pose encoder, and a Pose-to-Human Projection module (PHP). The pose encoder module utilizes a frozen pre-trained model while we fine-tune a pre-trained human-centric model for the human encoder module. Our PHP transfers pose knowledge from the pose encoder module to the human encoder module through multiple projectors. Our framework, following extensive experimentation on five benchmark datasets, consistently surpasses the performance of current state-of-the-art methods. Our code is available at https://github.com/huyquoctrinh/PGS.

Characteristic Guidance: Non-linear Correction for Diffusion Model at Large Guidance Scale. (arXiv:2312.07586v2 [cs.CV] UPDATED)

Authors: Candi Zheng, Yuan Lan

Popular guidance for denoising diffusion probabilistic model (DDPM) linearly combines distinct conditional models together to provide enhanced control over samples. However, this approach overlooks nonlinear effects that become significant when guidance scale is large. To address this issue, we propose characteristic guidance, a sampling method that provides first-principle non-linear correction for classifier-free guided DDPMs. Such correction forces the guided DDPMs to respect the Fokker-Planck equation of their underlying diffusion process, in a way that is training-free, derivative-free, and compatible with existing sampling methods. Experiments show that characteristic guidance enhances control and reduces color and exposure issues in image generation, proving effective in diverse applications ranging from latent space sampling to solving physics problems like magnet phase transitions.

DiffusionLight: Light Probes for Free by Painting a Chrome Ball. (arXiv:2312.09168v2 [cs.CV] UPDATED)

Authors: Pakkapon Phongthawee, Worameth Chinchuthakun, Nontaphat Sinsunthithet, Amit Raj, Varun Jampani, Pramook Khungurn, Supasorn Suwajanakorn

We present a simple yet effective technique to estimate lighting in a single input image. Current techniques rely heavily on HDR panorama datasets to train neural networks to regress an input with limited field-of-view to a full environment map. However, these approaches often struggle with real-world, uncontrolled settings due to the limited diversity and size of their datasets. To address this problem, we leverage diffusion models trained on billions of standard images to render a chrome ball into the input image. Despite its simplicity, this task remains challenging: the diffusion models often insert incorrect or inconsistent objects and cannot readily generate images in HDR format. Our research uncovers a surprising relationship between the appearance of chrome balls and the initial diffusion noise map, which we utilize to consistently generate high-quality chrome balls. We further fine-tune an LDR difusion model (Stable Diffusion XL) with LoRA, enabling it to perform exposure bracketing for HDR light estimation. Our method produces convincing light estimates across diverse settings and demonstrates superior generalization to in-the-wild scenarios.

Federated Learning with Instance-Dependent Noisy Labels. (arXiv:2312.10324v2 [cs.LG] UPDATED)

Authors: Lei Wang, Jieming Bian, Jie Xu

Federated learning (FL) with noisy labels poses a significant challenge. Existing methods designed for handling noisy labels in centralized learning tend to lose their effectiveness in the FL setting, mainly due to the small dataset size and the heterogeneity of client data. While some attempts have been made to tackle FL with noisy labels, they primarily focused on scenarios involving class-conditional noise. In this paper, we study the more challenging and practical issue of instance-dependent noise (IDN) in FL. We introduce a novel algorithm called FedBeat (Federated Learning with Bayesian Ensemble-Assisted Transition Matrix Estimation). FedBeat aims to build a global statistically consistent classifier using the IDN transition matrix (IDNTM), which encompasses three synergistic steps: (1) A federated data extraction step that constructs a weak global model and extracts high-confidence data using a Bayesian model ensemble method. (2) A federated transition matrix estimation step in which clients collaboratively train an IDNTM estimation network based on the extracted data. (3) A federated classifier correction step that enhances the global model's performance by training it using a loss function tailored for noisy labels, leveraging the IDNTM. Experiments conducted on CIFAR-10 and SVHN verify that the proposed method significantly outperforms state-of-the-art methods.

Language-Assisted 3D Scene Understanding. (arXiv:2312.11451v2 [cs.CV] UPDATED)

Authors: Yanmin Wu, Qiankun Gao, Renrui Zhang, Jian Zhang

The scale and quality of point cloud datasets constrain the advancement of point cloud learning. Recently, with the development of multi-modal learning, the incorporation of domain-agnostic prior knowledge from other modalities, such as images and text, to assist in point cloud feature learning has been considered a promising avenue. Existing methods have demonstrated the effectiveness of multi-modal contrastive training and feature distillation on point clouds. However, challenges remain, including the requirement for paired triplet data, redundancy and ambiguity in supervised features, and the disruption of the original priors. In this paper, we propose a language-assisted approach to point cloud feature learning (LAST-PCL), enriching semantic concepts through LLMs-based text enrichment. We achieve de-redundancy and feature dimensionality reduction without compromising textual priors by statistical-based and training-free significant feature selection. Furthermore, we also delve into an in-depth analysis of the impact of text contrastive training on the point cloud. Extensive experiments validate that the proposed method learns semantically meaningful point cloud features and achieves state-of-the-art or comparable performance in 3D semantic segmentation, 3D object detection, and 3D scene classification tasks.

ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation. (arXiv:2312.13108v2 [cs.CV] UPDATED)

Authors: Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou

Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity. Existing works leveraging Large Language Model (LLM) or LLM-based AI agents have shown capabilities in automating tasks on Android and Web platforms. However, these tasks are primarily aimed at simple device usage and entertainment operations. This paper presents a novel benchmark, AssistGUI, to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks. We carefully collected a set of 100 tasks from nine widely-used software applications, such as, After Effects and MS Word, each accompanied by the necessary project files for better evaluation. Moreover, we propose an advanced Actor-Critic Embodied Agent framework, which incorporates a sophisticated GUI parser driven by an LLM-agent and an enhanced reasoning mechanism adept at handling lengthy procedural tasks. Our experimental results reveal that our GUI Parser and Reasoning mechanism outshine existing methods in performance. Nevertheless, the potential remains substantial, with the best model attaining only a 46% success rate on our benchmark. We conclude with a thorough analysis of the current methods' limitations, setting the stage for future breakthroughs in this domain.

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision. (arXiv:2312.16256v2 [cs.CV] UPDATED)

Authors: Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, Aniket Bera

We have witnessed significant progress in deep learning-based 3D vision, ranging from neural radiance field (NeRF) based 3D representation learning to applications in novel view synthesis (NVS). However, existing scene-level datasets for deep learning-based 3D vision, limited to either synthetic environments or a narrow selection of real-world scenes, are quite insufficient. This insufficiency not only hinders a comprehensive benchmark of existing methods but also caps what could be explored in deep learning-based 3D analysis. To address this critical gap, we present DL3DV-10K, a large-scale scene dataset, featuring 51.2 million frames from 10,510 videos captured from 65 types of point-of-interest (POI) locations, covering both bounded and unbounded scenes, with different levels of reflection, transparency, and lighting. We conducted a comprehensive benchmark of recent NVS methods on DL3DV-10K, which revealed valuable insights for future research in NVS. In addition, we have obtained encouraging results in a pilot study to learn generalizable NeRF from DL3DV-10K, which manifests the necessity of a large-scale scene-level dataset to forge a path toward a foundation model for learning 3D representation. Our DL3DV-10K dataset, benchmark results, and models will be publicly accessible at https://dl3dv-10k.github.io/DL3DV-10K/.

A Survey on Super Resolution for video Enhancement Using GAN. (arXiv:2312.16471v2 [eess.IV] UPDATED)

Authors: Ankush Maity, Roshan Pious, Sourabh Kumar Lenka, Vishal Choudhary, Prof. Sharayu Lokhande

This compilation of various research paper highlights provides a comprehensive overview of recent developments in super-resolution image and video using deep learning algorithms such as Generative Adversarial Networks. The studies covered in these summaries provide fresh techniques to addressing the issues of improving image and video quality, such as recursive learning for video super-resolution, novel loss functions, frame-rate enhancement, and attention model integration. These approaches are frequently evaluated using criteria such as PSNR, SSIM, and perceptual indices. These advancements, which aim to increase the visual clarity and quality of low-resolution video, have tremendous potential in a variety of sectors ranging from surveillance technology to medical imaging. In addition, this collection delves into the wider field of Generative Adversarial Networks, exploring their principles, training approaches, and applications across a broad range of domains, while also emphasizing the challenges and opportunities for future research in this rapidly advancing and changing field of artificial intelligence.

Group Multi-View Transformer for 3D Shape Analysis with Spatial Encoding. (arXiv:2312.16477v2 [cs.CV] UPDATED)

Authors: Lixiang Xu, Qingzhe Cui, Richang Hong, Wei Xu, Enhong Chen, Xin Yuan, Chenglong Li, Yuanyan Tang

In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices due to their huge size of parameters. To address this problem, we introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible. Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT). In GMViT, the view-level ViT first establishes relationships between view-level features. Additionally, to capture deeper features, we employ the grouping module to enhance view-level features into group-level features. Finally, the group-level ViT aggregates group-level features into complete, well-formed 3D shape descriptors. Notably, in both ViTs, we introduce spatial encoding of camera coordinates as innovative position embeddings. Furthermore, we propose two compressed versions based on GMViT, namely GMViT-simple and GMViT-mini. To enhance the training effectiveness of the small models, we introduce a knowledge distillation method throughout the GMViT process, where the key outputs of each GMViT component serve as distillation targets. Extensive experiments demonstrate the efficacy of the proposed method. The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB. The smaller models, GMViT-simple and GMViT-mini, reduce the parameter size by 8 and 17.6 times, respectively, and improve shape recognition speed by 1.5 times on average, while preserving at least 90% of the classification and retrieval performance.

VLCounter: Text-aware Visual Representation for Zero-Shot Object Counting. (arXiv:2312.16580v2 [cs.CV] UPDATED)

Authors: Seunggu Kang, WonJun Moon, Euiyeon Kim, Jae-Pil Heo

Zero-Shot Object Counting (ZSOC) aims to count referred instances of arbitrary classes in a query image without human-annotated exemplars. To deal with ZSOC, preceding studies proposed a two-stage pipeline: discovering exemplars and counting. However, there remains a challenge of vulnerability to error propagation of the sequentially designed two-stage process. In this work, an one-stage baseline, Visual-Language Baseline (VLBase), exploring the implicit association of the semantic-patch embeddings of CLIP is proposed. Subsequently, the extension of VLBase to Visual-language Counter (VLCounter) is achieved by incorporating three modules devised to tailor VLBase for object counting. First, Semantic-conditioned Prompt Tuning (SPT) is introduced within the image encoder to acquire target-highlighted representations. Second, Learnable Affine Transformation (LAT) is employed to translate the semantic-patch similarity map to be appropriate for the counting task. Lastly, the layer-wisely encoded features are transferred to the decoder through Segment-aware Skip Connection (SaSC) to keep the generalization capability for unseen classes. Through extensive experiments on FSC147, CARPK, and PUCPR+, the benefits of the end-to-end framework, VLCounter, are demonstrated.

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices. (arXiv:2312.16886v2 [cs.CV] UPDATED)

Authors: Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, Chunhua Shen

We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.

FENet: Focusing Enhanced Network for Lane Detection. (arXiv:2312.17163v2 [cs.CV] UPDATED)

Authors: Liman Wang, Hanyang Zhong

Inspired by human driving focus, this research pioneers networks augmented with Focusing Sampling, Partial Field of View Evaluation, Enhanced FPN architecture and Directional IoU Loss - targeted innovations addressing obstacles to precise lane detection for autonomous driving. Experiments demonstrate our Focusing Sampling strategy, emphasizing vital distant details unlike uniform approaches, significantly boosts both benchmark and practical curved/distant lane recognition accuracy essential for safety. While FENetV1 achieves state-of-the-art conventional metric performance via enhancements isolating perspective-aware contexts mimicking driver vision, FENetV2 proves most reliable on the proposed Partial Field analysis. Hence we specifically recommend V2 for practical lane navigation despite fractional degradation on standard entire-image measures. Future directions include collecting on-road data and integrating complementary dual frameworks to further breakthroughs guided by human perception principles. Code will be made available.

EFHQ: Multi-purpose ExtremePose-Face-HQ dataset. (arXiv:2312.17205v2 [cs.CV] UPDATED)

Authors: Trung Tuan Dao, Duc Hong Vu, Cuong Pham, Anh Tran

The existing facial datasets, while having plentiful images at near frontal views, lack images with extreme head poses, leading to the downgraded performance of deep learning models when dealing with profile or pitched faces. This work aims to address this gap by introducing a novel dataset named Extreme Pose Face High-Quality Dataset (EFHQ), which includes a maximum of 450k high-quality images of faces at extreme poses. To produce such a massive dataset, we utilize a novel and meticulous dataset processing pipeline to curate two publicly available datasets, VFHQ and CelebV-HQ, which contain many high-resolution face videos captured in various settings. Our dataset can complement existing datasets on various facial-related tasks, such as facial synthesis with 2D/3D-aware GAN, diffusion-based text-to-image face generation, and face reenactment. Specifically, training with EFHQ helps models generalize well across diverse poses, significantly improving performance in scenarios involving extreme views, confirmed by extensive experiments. Additionally, we utilize EFHQ to define a challenging cross-view face verification benchmark, in which the performance of SOTA face recognition models drops 5-37% compared to frontal-to-frontal scenarios, aiming to stimulate studies on face recognition under severe pose conditions in the wild.

Breaking Through the Haze: An Advanced Non-Homogeneous Dehazing Method based on Fast Fourier Convolution and ConvNeXt. (arXiv:2305.04430v1 [cs.CV] CROSS LISTED)

Authors: Han Zhou, Wei Dong, Yangyi Liu, Jun Chen

Haze usually leads to deteriorated images with low contrast, color shift and structural distortion. We observe that many deep learning based models exhibit exceptional performance on removing homogeneous haze, but they usually fail to address the challenge of non-homogeneous dehazing. Two main factors account for this situation. Firstly, due to the intricate and non uniform distribution of dense haze, the recovery of structural and chromatic features with high fidelity is challenging, particularly in regions with heavy haze. Secondly, the existing small scale datasets for non-homogeneous dehazing are inadequate to support reliable learning of feature mappings between hazy images and their corresponding haze-free counterparts by convolutional neural network (CNN)-based models. To tackle these two challenges, we propose a novel two branch network that leverages 2D discrete wavelete transform (DWT), fast Fourier convolution (FFC) residual block and a pretrained ConvNeXt model. Specifically, in the DWT-FFC frequency branch, our model exploits DWT to capture more high-frequency features. Moreover, by taking advantage of the large receptive field provided by FFC residual blocks, our model is able to effectively explore global contextual information and produce images with better perceptual quality. In the prior knowledge branch, an ImageNet pretrained ConvNeXt as opposed to Res2Net is adopted. This enables our model to learn more supplementary information and acquire a stronger generalization ability. The feasibility and effectiveness of the proposed method is demonstrated via extensive experiments and ablation studies. The code is available at https://github.com/zhouh115/DWT-FFC.