Authors: Matthew Pierson, Zia Mehrabi
Abstract: Waterways shape earth system processes and human societies, and a better understanding of their distribution can assist in a range of applications from earth system modeling to human development and disaster response. Most efforts to date to map the world's waterways have required extensive modeling and contextual expert input, and are costly to repeat. Many gaps remain, particularly in geographies with lower economic development. Here we present a computer vision model that can draw waterways based on 10m Sentinel-2 satellite imagery and the 30m GLO-30 Copernicus digital elevation model, trained using high fidelity waterways data from the United States. We couple this model with a vectorization process to map waterways worldwide. For widespread utility and downstream modelling efforts, we scaffold this new data on the backbone of existing mapped basins and waterways from another dataset, TDX-Hydro. In total, we add 124 million kilometers of waterways to the 54 million kilometers already in the TDX-Hydro dataset, more than tripling the extent of waterways mapped globally.
Authors: Muhammad Suleman Ali Hamdani, Khizer Zakir, Neetu Kushwaha, Syeda Eman Fatima, Hassan Aftab Sheikh
Abstract: Brick kilns are a major source of air pollution in Pakistan, with many operating without regulation. A key challenge in Pakistan and across the Indo-Gangetic Plain is the limited air quality monitoring and lack of transparent data on pollution sources. To address this, we present a two-fold AI approach that combines low-resolution Sentinel-2 and high-resolution imagery to map brick kiln locations. Our process begins with a low-resolution analysis, followed by a post-processing step to reduce false positives, minimizing the need for extensive high-resolution imagery. This analysis initially identified 20,000 potential brick kilns, with high-resolution validation confirming around 11,000 kilns. The dataset also distinguishes between Fixed Chimney and Zigzag kilns, enabling more accurate pollution estimates for each type. Our approach demonstrates how combining satellite imagery with AI can effectively detect specific polluting sources. This dataset provides regulators with insights into brick kiln pollution, supporting interventions for unregistered kilns and actions during high pollution episodes.
Authors: Ferhat Ozgur Catak, Murat Kuzlu, Taylor Patrick
Abstract: In recent years, vision-language models (VLMs) have been applied to various fields, including healthcare, education, finance, and manufacturing, with remarkable performance. However, concerns remain regarding VLMs' consistency and uncertainty, particularly in critical applications such as healthcare, which demand a high level of trust and reliability. This paper proposes a novel approach to evaluate uncertainty in VLMs' responses using a convex hull approach on a healthcare application for Visual Question Answering (VQA). LLM-CXR model is selected as the medical VLM utilized to generate responses for a given prompt at different temperature settings, i.e., 0.001, 0.25, 0.50, 0.75, and 1.00. According to the results, the LLM-CXR VLM shows a high uncertainty at higher temperature settings. Experimental outcomes emphasize the importance of uncertainty in VLMs' responses, especially in healthcare applications.
Authors: Shezheng Song, Chengxiang He, Shasha Li, Shan Zhao, Chengyu Wang, Tianwei Yan, Xiaopeng Li, Qian Wan, Jun Ma, Jie Yu, Xiaoguang Mao
Abstract: Multimodal large language models (MLLMs) have shown remarkable progress in high-level semantic tasks such as visual question answering, image captioning, and emotion recognition. However, despite advancements, there remains a lack of standardized benchmarks for evaluating MLLMs performance in multi-object sentiment analysis, a key task in semantic understanding. To address this gap, we introduce MOSABench, a novel evaluation dataset designed specifically for multi-object sentiment analysis. MOSABench includes approximately 1,000 images with multiple objects, requiring MLLMs to independently assess the sentiment of each object, thereby reflecting real-world complexities. Key innovations in MOSABench include distance-based target annotation, post-processing for evaluation to standardize outputs, and an improved scoring mechanism. Our experiments reveal notable limitations in current MLLMs: while some models, like mPLUG-owl and Qwen-VL2, demonstrate effective attention to sentiment-relevant features, others exhibit scattered focus and performance declines, especially as the spatial distance between objects increases. This research underscores the need for MLLMs to enhance accuracy in complex, multi-object sentiment analysis tasks and establishes MOSABench as a foundational tool for advancing sentiment analysis capabilities in MLLMs.
Authors: Massine El Khader, Elias Al Bouzidi, Abdellah Oumida, Mohammed Sbaihi, Eliott Binard, Jean-Philippe Poli, Wassila Ouerdane, Boussad Addad, Katarzyna Kapusta
Abstract: Recent advances in Diffusion Models have enabled the generation of images from text, with powerful closed-source models like DALL-E and Midjourney leading the way. However, open-source alternatives, such as StabilityAI's Stable Diffusion, offer comparable capabilities. These open-source models, hosted on Hugging Face, come equipped with ethical filter protections designed to prevent the generation of explicit images. This paper reveals first their limitations and then presents a novel text-based safety filter that outperforms existing solutions. Our research is driven by the critical need to address the misuse of AI-generated content, especially in the context of information warfare. DiffGuard enhances filtering efficacy, achieving a performance that surpasses the best existing filters by over 14%.
Authors: Chenhan Zhang, Benjamin Zi Hao Zhao, Hassan Asghar, Dali Kaafar
Abstract: Users may inadvertently upload personally identifiable information (PII) to Machine Learning as a Service (MLaaS) providers. When users no longer want their PII on these services, regulations like GDPR and COPPA mandate a right to forget for these users. As such, these services seek efficient methods to remove the influence of specific data points. Thus the introduction of machine unlearning. Traditionally, unlearning is performed with the removal of entire data samples (sample unlearning) or whole features across the dataset (feature unlearning). However, these approaches are not equipped to handle the more granular and challenging task of unlearning specific objects within a sample. To address this gap, we propose a scene graph-based object unlearning framework. This framework utilizes scene graphs, rich in semantic representation, transparently translate unlearning requests into actionable steps. The result, is the preservation of the overall semantic integrity of the generated image, bar the unlearned object. Further, we manage high computational overheads with influence functions to approximate the unlearning process. For validation, we evaluate the unlearned object's fidelity in outputs under the tasks of image reconstruction and image synthesis. Our proposed framework demonstrates improved object unlearning outcomes, with the preservation of unrequested samples in contrast to sample and feature learning methods. This work addresses critical privacy issues by increasing the granularity of targeted machine unlearning through forgetting specific object-level details without sacrificing the utility of the whole data sample or dataset feature.
Authors: Mohammad R. Salmanpour, Arman Gorji, Amin Mousavi, Ali Fathi Jouzdani, Nima Sanati, Mehdi Maghsudi, Bonnie Leung, Cheryl Ho, Ren Yuan, Arman Rahmim
Abstract: Objective: This study explores a semi-supervised learning (SSL), pseudo-labeled strategy using diverse datasets to enhance lung cancer (LCa) survival predictions, analyzing Handcrafted and Deep Radiomic Features (HRF/DRF) from PET/CT scans with Hybrid Machine Learning Systems (HMLS). Methods: We collected 199 LCa patients with both PET & CT images, obtained from The Cancer Imaging Archive (TCIA) and our local database, alongside 408 head&neck cancer (HNCa) PET/CT images from TCIA. We extracted 215 HRFs and 1024 DRFs by PySERA and a 3D-Autoencoder, respectively, within the ViSERA software, from segmented primary tumors. The supervised strategy (SL) employed a HMLSs: PCA connected with 4 classifiers on both HRF and DRFs. SSL strategy expanded the datasets by adding 408 pseudo-labeled HNCa cases (labeled by Random Forest algorithm) to 199 LCa cases, using the same HMLSs techniques. Furthermore, Principal Component Analysis (PCA) linked with 4 survival prediction algorithms were utilized in survival hazard ratio analysis. Results: SSL strategy outperformed SL method (p-value<0.05), achieving an average accuracy of 0.85 with DRFs from PET and PCA+ Multi-Layer Perceptron (MLP), compared to 0.65 for SL strategy using DRFs from CT and PCA+ K-Nearest Neighbor (KNN). Additionally, PCA linked with Component-wise Gradient Boosting Survival Analysis on both HRFs and DRFs, as extracted from CT, had an average c-index of 0.80 with a Log Rank p-value<<0.001, confirmed by external testing. Conclusions: Shifting from HRFs and SL to DRFs and SSL strategies, particularly in contexts with limited data points, enabling CT or PET alone to significantly achieve high predictive performance.
Authors: Justin Jiang
Abstract: The rise of advanced AI models like Generative Adversarial Networks (GANs) and diffusion models such as Stable Diffusion has made the creation of highly realistic images accessible, posing risks of misuse in misinformation and manipulation. This study evaluates the effectiveness of convolutional neural networks (CNNs), as well as DenseNet architectures, for detecting AI-generated images. Using variations of the CIFAKE dataset, including images generated by different versions of Stable Diffusion, we analyze the impact of updates and modifications such as Gaussian blurring, prompt text changes, and Low-Rank Adaptation (LoRA) on detection accuracy. The findings highlight vulnerabilities in current detection methods and propose strategies to enhance the robustness and reliability of AI-image detection systems.
Authors: Nikita Kisel, Illia Volkov, Katerina Hanzelkova, Klara Janouskova, Jiri Matas
Abstract: Since its release, ImageNet-1k dataset has become a gold standard for evaluating model performance. It has served as the foundation for numerous other datasets and training tasks in computer vision. As models have improved in accuracy, issues related to label correctness have become increasingly apparent. In this blog post, we analyze the issues in the ImageNet-1k dataset, including incorrect labels, overlapping or ambiguous class definitions, training-evaluation domain shifts, and image duplicates. The solutions for some problems are straightforward. For others, we hope to start a broader conversation about refining this influential dataset to better serve future research.
Authors: Nima Sedaghat, Tanawan Chatchadanoraset, Colin Orion Chandler, Ashish Mahabal, Maryam Eslami
Abstract: Motivated by the scarcity of proper labels in an astrophysical application, we have developed a novel technique, called Selfish Evolution, which allows for the detection and correction of corrupted labels in a weakly supervised fashion. Unlike methods based on early stopping, we let the model train on the noisy dataset. Only then do we intervene and allow the model to overfit to individual samples. The ``evolution'' of the model during this process reveals patterns with enough information about the noisiness of the label, as well as its correct version. We train a secondary network on these spatiotemporal ``evolution cubes'' to correct potentially corrupted labels. We incorporate the technique in a closed-loop fashion, allowing for automatic convergence towards a mostly clean dataset, without presumptions about the state of the network in which we intervene. We evaluate on the main task of the Supernova-hunting dataset but also demonstrate efficiency on the more standard MNIST dataset.
Authors: Songjiang Lai, Tsun-Hin Cheung, Jiayi Zhao, Kaiwen Xue, Ka-Chun Fung, Kin-Man Lam
Abstract: Rolling bearings play a crucial role in industrial machinery, directly influencing equipment performance, durability, and safety. However, harsh operating conditions, such as high speeds and temperatures, often lead to bearing malfunctions, resulting in downtime, economic losses, and safety hazards. This paper proposes the Residual Attention Single-Head Vision Transformer Network (RA-SHViT-Net) for fault diagnosis in rolling bearings. Vibration signals are transformed from the time to frequency domain using the Fast Fourier Transform (FFT) before being processed by RA-SHViT-Net. The model employs the Single-Head Vision Transformer (SHViT) to capture local and global features, balancing computational efficiency and predictive accuracy. To enhance feature extraction, the Adaptive Hybrid Attention Block (AHAB) integrates channel and spatial attention mechanisms. The network architecture includes Depthwise Convolution, Single-Head Self-Attention, Residual Feed-Forward Networks (Res-FFN), and AHAB modules, ensuring robust feature representation and mitigating gradient vanishing issues. Evaluation on the Case Western Reserve University and Paderborn University datasets demonstrates the RA-SHViT-Net's superior accuracy and robustness in complex, noisy environments. Ablation studies further validate the contributions of individual components, establishing RA-SHViT-Net as an effective tool for early fault detection and classification, promoting efficient maintenance strategies in industrial settings. Keywords: rolling bearings, fault diagnosis, Vision Transformer, attention mechanism, noisy environments, Fast Fourier Transform (FFT)
Authors: Libin Liu, Shen Chen, Sen Jia, Jingzhe Shi, Zhongyu Jiang, Can Jin, Wu Zongkai, Jenq-Neng Hwang, Lei Li
Abstract: Spatial intelligence is foundational to AI systems that interact with the physical world, particularly in 3D scene generation and spatial comprehension. Current methodologies for 3D scene generation often rely heavily on predefined datasets, and struggle to adapt dynamically to changing spatial relationships. In this paper, we introduce \textbf{GraphCanvas3D}, a programmable, extensible, and adaptable framework for controllable 3D scene generation. Leveraging in-context learning, GraphCanvas3D enables dynamic adaptability without the need for retraining, supporting flexible and customizable scene creation. Our framework employs hierarchical, graph-driven scene descriptions, representing spatial elements as graph nodes and establishing coherent relationships among objects in 3D environments. Unlike conventional approaches, which are constrained in adaptability and often require predefined input masks or retraining for modifications, GraphCanvas3D allows for seamless object manipulation and scene adjustments on the fly. Additionally, GraphCanvas3D supports 4D scene generation, incorporating temporal dynamics to model changes over time. Experimental results and user studies demonstrate that GraphCanvas3D enhances usability, flexibility, and adaptability for scene generation. Our code and models are available on the project website: https://github.com/ILGLJ/Graph-Canvas.
Authors: Feiyang Huang
Abstract: Image captioning is a technique that translates image content into natural language descriptions. Many application scenarios, such as intelligent search engines and assistive tools for visually impaired individuals, involve images containing people. As a result, datasets often have a high proportion of images featuring people. However, this data imbalance can lead to overfitting. The model may perform poorly when generating descriptions for images without people and may even produce irrelevant descriptions (hallucinations). To address this issue, increasing the diversity of the dataset could be an effective solution. However, acquiring high-quality image-text pairs is costly. Reducing overfitting without altering the dataset can significantly save resources. To tackle this challenge, we propose a target-aware prompting strategy. This method extracts object information from images using an object detector and integrates this information into the model through a fusion module. This helps the model generate descriptions with additional references (\textbf{OFCap}). Moreover, this strategy is model-agnostic. Pretrained models can be used with frozen parameters during training, further reducing computational costs. We conducted experiments on the COCO and nocpas datasets. The results demonstrate that this strategy effectively mitigates overfitting and significantly improves the quality of image descriptions.
Authors: Maitreya Patel, Song Wen, Dimitris N. Metaxas, Yezhou Yang
Abstract: Diffusion models (DMs) excel in photorealism, image editing, and solving inverse problems, aided by classifier-free guidance and image inversion techniques. However, rectified flow models (RFMs) remain underexplored for these tasks. Existing DM-based methods often require additional training, lack generalization to pretrained latent models, underperform, and demand significant computational resources due to extensive backpropagation through ODE solvers and inversion processes. In this work, we first develop a theoretical and empirical understanding of the vector field dynamics of RFMs in efficiently guiding the denoising trajectory. Our findings reveal that we can navigate the vector field in a deterministic and gradient-free manner. Utilizing this property, we propose FlowChef, which leverages the vector field to steer the denoising trajectory for controlled image generation tasks, facilitated by gradient skipping. FlowChef is a unified framework for controlled image generation that, for the first time, simultaneously addresses classifier guidance, linear inverse problems, and image editing without the need for extra training, inversion, or intensive backpropagation. Finally, we perform extensive evaluations and show that FlowChef significantly outperforms baselines in terms of performance, memory, and time requirements, achieving new state-of-the-art results. Project Page: \url{https://flowchef.github.io}.
Authors: Pragati Shuddhodhan Meshram, Swetha Karthikeyan, Bhavya, Suma Bhat
Abstract: Multi-modal Large Language Models (MLLMs) are gaining significant attention for their ability to process multi-modal data, providing enhanced contextual understanding of complex problems. MLLMs have demonstrated exceptional capabilities in tasks such as Visual Question Answering (VQA); however, they often struggle with fundamental engineering problems, and there is a scarcity of specialized datasets for training on topics like digital electronics. To address this gap, we propose a benchmark dataset called ElectroVizQA specifically designed to evaluate MLLMs' performance on digital electronic circuit problems commonly found in undergraduate curricula. This dataset, the first of its kind tailored for the VQA task in digital electronics, comprises approximately 626 visual questions, offering a comprehensive overview of digital electronics topics. This paper rigorously assesses the extent to which MLLMs can understand and solve digital electronic circuit questions, providing insights into their capabilities and limitations within this specialized domain. By introducing this benchmark dataset, we aim to motivate further research and development in the application of MLLMs to engineering education, ultimately bridging the performance gap and enhancing the efficacy of these models in technical fields.
Authors: Guangyao Zheng, Michael A. Jacobs, Vishwa S. Parekh
Abstract: Self-supervised foundation models have recently been successfully extended to encode three-dimensional (3D) computed tomography (CT) images, with excellent performance across several downstream tasks, such as intracranial hemorrhage detection and lung cancer risk forecasting. However, as self-supervised models learn from complex data distributions, questions arise concerning whether these embeddings capture demographic information, such as age, sex, or race. Using the National Lung Screening Trial (NLST) dataset, which contains 3D CT images and demographic data, we evaluated a range of classifiers: softmax regression, linear regression, linear support vector machine, random forest, and decision tree, to predict sex, race, and age of the patients in the images. Our results indicate that the embeddings effectively encoded age and sex information, with a linear regression model achieving a root mean square error (RMSE) of 3.8 years for age prediction and a softmax regression model attaining an AUC of 0.998 for sex classification. Race prediction was less effective, with an AUC of 0.878. These findings suggest a detailed exploration into the information encoded in self-supervised learning frameworks is needed to help ensure fair, responsible, and patient privacy-protected healthcare AI.
Authors: Yinjie Zhao, Heng Zhao, Bihan Wen, Yew-Soon Ong, Joey Tianyi Zhou
Abstract: The rapid development of AI models has led to a growing emphasis on enhancing their capabilities for complex input data such as videos. While large-scale video datasets have been introduced to support this growth, the unique challenges of reducing redundancies in video \textbf{sets} have not been explored. Compared to image datasets or individual videos, video \textbf{sets} have a two-layer nested structure, where the outer layer is the collection of individual videos, and the inner layer contains the correlations among frame-level data points to provide temporal information. Video \textbf{sets} have two dimensions of redundancies: within-sample and inter-sample redundancies. Existing methods like key frame selection, dataset pruning or dataset distillation are not addressing the unique challenge of video sets since they aimed at reducing redundancies in only one of the dimensions. In this work, we are the first to study Video Set Distillation, which synthesizes optimized video data by jointly addressing within-sample and inter-sample redundancies. Our Information Diversification and Temporal Densification (IDTD) method jointly reduces redundancies across both dimensions. This is achieved through a Feature Pool and Feature Selectors mechanism to preserve inter-sample diversity, alongside a Temporal Fusor that maintains temporal information density within synthesized videos. Our method achieves state-of-the-art results in Video Dataset Distillation, paving the way for more effective redundancy reduction and efficient AI model training on video datasets.
Authors: Seong-Eun Hong, Soobin Lim, Juyeong Hwang, Minwook Chang, Hyeongyeop Kang
Abstract: Generating natural and expressive human motions from textual descriptions is challenging due to the complexity of coordinating full-body dynamics and capturing nuanced motion patterns over extended sequences that accurately reflect the given text. To address this, we introduce BiPO, Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis, a novel model that enhances text-to-motion synthesis by integrating part-based generation with a bidirectional autoregressive architecture. This integration allows BiPO to consider both past and future contexts during generation while enhancing detailed control over individual body parts without requiring ground-truth motion length. To relax the interdependency among body parts caused by the integration, we devise the Partial Occlusion technique, which probabilistically occludes the certain motion part information during training. In our comprehensive experiments, BiPO achieves state-of-the-art performance on the HumanML3D dataset, outperforming recent methods such as ParCo, MoMask, and BAMM in terms of FID scores and overall motion quality. Notably, BiPO excels not only in the text-to-motion generation task but also in motion editing tasks that synthesize motion based on partially generated motion sequences and textual descriptions. These results reveal the BiPO's effectiveness in advancing text-to-motion synthesis and its potential for practical applications.
Authors: Yue Cao, Yun Xing, Jie Zhang, Di Lin, Tianwei Zhang, Ivor Tsang, Yang Liu, Qing Guo
Abstract: Large vision-language models (LVLMs) have shown remarkable capabilities in interpreting visual content. While existing works demonstrate these models' vulnerability to deliberately placed adversarial texts, such texts are often easily identifiable as anomalous. In this paper, we present the first approach to generate scene-coherent typographic adversarial attacks that mislead advanced LVLMs while maintaining visual naturalness through the capability of the LLM-based agent. Our approach addresses three critical questions: what adversarial text to generate, where to place it within the scene, and how to integrate it seamlessly. We propose a training-free, multi-modal LLM-driven scene-coherent typographic adversarial planning (SceneTAP) that employs a three-stage process: scene understanding, adversarial planning, and seamless integration. The SceneTAP utilizes chain-of-thought reasoning to comprehend the scene, formulate effective adversarial text, strategically plan its placement, and provide detailed instructions for natural integration within the image. This is followed by a scene-coherent TextDiffuser that executes the attack using a local diffusion mechanism. We extend our method to real-world scenarios by printing and placing generated patches in physical environments, demonstrating its practical implications. Extensive experiments show that our scene-coherent adversarial text successfully misleads state-of-the-art LVLMs, including ChatGPT-4o, even after capturing new images of physical setups. Our evaluations demonstrate a significant increase in attack success rates while maintaining visual naturalness and contextual appropriateness. This work highlights vulnerabilities in current vision-language models to sophisticated, scene-coherent adversarial attacks and provides insights into potential defense mechanisms.
Authors: Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, Siyu Zhu
Abstract: Recent advancements in visual generation technologies have markedly increased the scale and availability of video datasets, which are crucial for training effective video generation models. However, a significant lack of high-quality, human-centric video datasets presents a challenge to progress in this field. To bridge this gap, we introduce \textbf{OpenHumanVid}, a large-scale and high-quality human-centric video dataset characterized by precise and detailed captions that encompass both human appearance and motion states, along with supplementary human motion conditions, including skeleton sequences and speech audio. To validate the efficacy of this dataset and the associated training strategies, we propose an extension of existing classical diffusion transformer architectures and conduct further pretraining of our models on the proposed dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos while preserving performance in general video generation tasks. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs. Based on these insights and corresponding methodologies, the straightforward extended network trained on the proposed dataset demonstrates an obvious improvement in the generation of human-centric videos. The source code and the dataset are available at: \href{https://fudan-generative-vision.github.io/OpenHumanVid}{https://fudan-generative-vision.github.io/OpenHumanVid}.
URLs: https://fudan-generative-vision.github.io/OpenHumanVid, https://fudan-generative-vision.github.io/OpenHumanVid
Authors: Yang Liu, Jiale Du, Xinbo Gao, Jungong Han
Abstract: Sketch-based image retrieval (SBIR) relies on free-hand sketches to retrieve natural photos within the same class. However, its practical application is limited by its inability to retrieve classes absent from the training set. To address this limitation, the task has evolved into Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR), where model performance is evaluated on unseen categories. Traditional SBIR primarily focuses on narrowing the domain gap between photo and sketch modalities. However, in the zero-shot setting, the model not only needs to address this cross-modal discrepancy but also requires a strong generalization capability to transfer knowledge to unseen categories. To this end, we propose a novel framework for ZS-SBIR that employs a pair-based relation-aware quadruplet loss to bridge feature gaps. By incorporating two negative samples from different modalities, the approach prevents positive features from becoming disproportionately distant from one modality while remaining close to another, thus enhancing inter-class separability. We also propose a Relation-Aware Meta-Learning Network (RAMLN) to obtain the margin, a hyper-parameter of cross-modal quadruplet loss, to improve the generalization ability of the model. RAMLN leverages external memory to store feature information, which it utilizes to assign optimal margin values. Experimental results obtained on the extended Sketchy and TU-Berlin datasets show a sharp improvement over existing state-of-the-art methods in ZS-SBIR.
Authors: Yang Liu, Xinshuo Wang, Jiale Du, Xinbo Gao, Jungong Han
Abstract: Compositional Zero-Shot Learning (CZSL) recognizes new combinations by learning from known attribute-object pairs. However, the main challenge of this task lies in the complex interactions between attributes and object visual representations, which lead to significant differences in images. In addition, the long-tail label distribution in the real world makes the recognition task more complicated. To address these problems, we propose a novel method, named Hybrid Discriminative Attribute-Object Embedding (HDA-OE) network. To increase the variability of training data, HDA-OE introduces an attribute-driven data synthesis (ADDS) module. ADDS generates new samples with diverse attribute labels by combining multiple attributes of the same object. By expanding the attribute space in the dataset, the model is encouraged to learn and distinguish subtle differences between attributes. To further improve the discriminative ability of the model, HDA-OE introduces the subclass-driven discriminative embedding (SDDE) module, which enhances the subclass discriminative ability of the encoding by embedding subclass information in a fine-grained manner, helping to capture the complex dependencies between attributes and object visual features. The proposed model has been evaluated on three benchmark datasets, and the results verify its effectiveness and reliability.
Authors: Xuexiang Niu, Jinping Tang, Lei Wang, Ge Zhu
Abstract: Learning from feedback has been shown to enhance the alignment between text prompts and images in text-to-image diffusion models. However, due to the lack of focus in feedback content, especially regarding the object type and quantity, these techniques struggle to accurately match text and images when faced with specified prompts. To address this issue, we propose an efficient fine-turning method with specific reward objectives, including three stages. First, generated images from diffusion model are detected to obtain the object categories and quantities. Meanwhile, the confidence of category and quantity can be derived from the detection results and given prompts. Next, we define a novel matching score, based on above confidence, to measure text-image alignment. It can guide the model for feedback learning in the form of a reward function. Finally, we fine-tune the diffusion model by backpropagation the reward function gradients to generate semantically related images. Different from previous feedbacks that focus more on overall matching, we place more emphasis on the accuracy of entity categories and quantities. Besides, we construct a text-to-image dataset for studying the compositional generation, including 1.7 K pairs of text-image with diverse combinations of entities and quantities. Experimental results on this benchmark show that our model outperforms other SOTA methods in both alignment and fidelity. In addition, our model can also serve as a metric for evaluating text-image alignment in other models. All code and dataset are available at https://github.com/kingniu0329/Visions.
Authors: MinKyu Lee, Sangeek Hyun, Woojin Jun, Jae-Pil Heo
Abstract: This work tackles the fidelity objective in the perceptual super-resolution~(SR). Specifically, we address the shortcomings of pixel-level $L_\text{p}$ loss ($\mathcal{L}_\text{pix}$) in the GAN-based SR framework. Since $L_\text{pix}$ is known to have a trade-off relationship against perceptual quality, prior methods often multiply a small scale factor or utilize low-pass filters. However, this work shows that these circumventions fail to address the fundamental factor that induces blurring. Accordingly, we focus on two points: 1) precisely discriminating the subcomponent of $L_\text{pix}$ that contributes to blurring, and 2) only guiding based on the factor that is free from this trade-off relationship. We show that they can be achieved in a surprisingly simple manner, with an Auto-Encoder (AE) pretrained with $L_\text{pix}$. Accordingly, we propose the Auto-Encoded Supervision for Optimal Penalization loss ($L_\text{AESOP}$), a novel loss function that measures distance in the AE space, instead of the raw pixel space. Note that the AE space indicates the space after the decoder, not the bottleneck. By simply substituting $L_\text{pix}$ with $L_\text{AESOP}$, we can provide effective reconstruction guidance without compromising perceptual quality. Designed for simplicity, our method enables easy integration into existing SR frameworks. Experimental results verify that AESOP can lead to favorable results in the perceptual SR task.
Authors: Siqi Kou, Jiachun Jin, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, Zhijie Deng
Abstract: We introduce Orthus, an autoregressive (AR) transformer that excels in generating images given textual prompts, answering questions based on visual inputs, and even crafting lengthy image-text interleaved contents. Unlike prior arts on unified multimodal modeling, Orthus simultaneously copes with discrete text tokens and continuous image features under the AR modeling principle. The continuous treatment of visual signals minimizes the information loss for both image understanding and generation while the fully AR formulation renders the characterization of the correlation between modalities straightforward. The key mechanism enabling Orthus to leverage these advantages lies in its modality-specific heads -- one regular language modeling (LM) head predicts discrete text tokens and one diffusion head generates continuous image features conditioning on the output of the backbone. We devise an efficient strategy for building Orthus -- by substituting the Vector Quantization (VQ) operation in the existing unified AR model with a soft alternative, introducing a diffusion head, and tuning the added modules to reconstruct images, we can create an Orthus-base model effortlessly (e.g., within mere 72 A100 GPU hours). Orthus-base can further embrace post-training to better model interleaved images and texts. Empirically, Orthus surpasses competing baselines including Show-o and Chameleon across standard benchmarks, achieving a GenEval score of 0.58 and an MME-P score of 1265.8 using 7B parameters. Orthus also shows exceptional mixed-modality generation capabilities, reflecting the potential for handling intricate practical generation tasks.
Authors: Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, Tanghui Jia, Junwu Zhang, Zhenyu Tang, Yatian Pang, Bin She, Cen Yan, Zhiheng Hu, Xiaoyi Dong, Lin Chen, Zhang Pan, Xing Zhou, Shaoling Dong, Yonghong Tian, Li Yuan
Abstract: We introduce Open-Sora Plan, an open-source project that aims to contribute a large generation model for generating desired high-resolution videos with long durations based on various user inputs. Our project comprises multiple components for the entire video generation process, including a Wavelet-Flow Variational Autoencoder, a Joint Image-Video Skiparse Denoiser, and various condition controllers. Moreover, many assistant strategies for efficient training and inference are designed, and a multi-dimensional data curation pipeline is proposed for obtaining desired high-quality data. Benefiting from efficient thoughts, our Open-Sora Plan achieves impressive video generation results in both qualitative and quantitative evaluations. We hope our careful design and practical experience can inspire the video generation research community. All our codes and model weights are publicly available at \url{https://github.com/PKU-YuanGroup/Open-Sora-Plan}.
Authors: Friedhelm Hamann, Daniel Gehrig, Filbert Febryanto, Kostas Daniilidis, Guillermo Gallego
Abstract: Tracking any point (TAP) recently shifted the motion estimation paradigm from focusing on individual salient points with local templates to tracking arbitrary points with global image contexts. However, while research has mostly focused on driving the accuracy of models in nominal settings, addressing scenarios with difficult lighting conditions and high-speed motions remains out of reach due to the limitations of the sensor. This work addresses this challenge with the first event camera-based TAP method. It leverages the high temporal resolution and high dynamic range of event cameras for robust high-speed tracking, and the global contexts in TAP methods to handle asynchronous and sparse event measurements. We further extend the TAP framework to handle event feature variations induced by motion - thereby addressing an open challenge in purely event-based tracking - with a novel feature alignment loss which ensures the learning of motion-robust features. Our method is trained with data from a new data generation pipeline and systematically ablated across all design decisions. Our method shows strong cross-dataset generalization and performs 135% better on the average Jaccard metric than the baselines. Moreover, on an established feature tracking benchmark, it achieves a 19% improvement over the previous best event-only method and even surpasses the previous best events-and-frames method by 3.7%.
Authors: ShuaiHeng Li, Qing Cai, Fan Zhang, Menghuan Zhang, Yangyang Shu, Zhi Liu, Huafeng Li, Lingqiao Liu
Abstract: Self-supervised learning is emerging in fine-grained visual recognition with promising results. However, existing self-supervised learning methods are often susceptible to irrelevant patterns in self-supervised tasks and lack the capability to represent the subtle differences inherent in fine-grained visual recognition (FGVR), resulting in generally poorer performance. To address this, we propose a novel Priority-Perception Self-Supervised Learning framework, denoted as PP-SSL, which can effectively filter out irrelevant feature interference and extract more subtle discriminative features throughout the training process. Specifically, it composes of two main parts: the Anti-Interference Strategy (AIS) and the Image-Aided Distinction Module (IADM). In AIS, a fine-grained textual description corpus is established, and a knowledge distillation strategy is devised to guide the model in eliminating irrelevant features while enhancing the learning of more discriminative and high-quality features. IADM reveals that extracting GradCAM from the original image effectively reveals subtle differences between fine-grained categories. Compared to features extracted from intermediate or output layers, the original image retains more detail, allowing for a deeper exploration of the subtle distinctions among fine-grained classes. Extensive experimental results indicate that the PP-SSL significantly outperforms existing methods across various datasets, highlighting its effectiveness in fine-grained recognition tasks. Our code will be made publicly available upon publication.
Authors: Wenda Shi, Yiren Song, Dengming Zhang, Jiaming Liu, Xingxing Zou
Abstract: Visual text images are prevalent in various applications, requiring careful font selection and typographic choices. Recent advances in Diffusion Transformer (DiT)-based text-to-image (T2I) models show promise in automating these processes. However, these methods still face challenges such as inconsistent fonts, style variation, and limited fine-grained control, particularly at the word level. This paper proposes a two-stage DiT-based pipeline to address these issues by enhancing controllability over typography and style in text rendering. We introduce Typography Control (TC) finetuning, an efficient parameter fine-tuning method, and enclosing typography control tokens (ETC-tokens), which enable precise word-level application of typographic features. To further enhance style control, we present a Style Control Adapter (SCA) that injects style information through image inputs independent of text prompts. Through comprehensive experiments, we demonstrate the effectiveness of our approach in achieving superior word-level typographic control, font consistency, and style consistency in Basic and Artistic Text Rendering (BTR and ATR) tasks. Our results mark a significant advancement in the precision and adaptability of T2I models, presenting new possibilities for creative applications and design-oriented tasks.
Authors: Sihang Li, Siqi Tan, Bowen Chang, Jing Zhang, Chen Feng, Yiming Li
Abstract: Visual localization, which estimates a camera's pose within a known scene, is a long-standing challenge in vision and robotics. Recent end-to-end methods that directly regress camera poses from query images have gained attention for fast inference. However, existing methods often struggle to generalize to unseen views. In this work, we aim to unleash the power of data synthesis to promote the generalizability of pose regression. Specifically, we lift real 2D images into 3D Gaussian Splats with varying appearance and deblurring abilities, which are then used as a data engine to synthesize more posed images. To fully leverage the synthetic data, we build a two-branch joint training pipeline, with an adversarial discriminator to bridge the syn-to-real gap. Experiments on established benchmarks show that our method outperforms state-of-the-art end-to-end approaches, reducing translation and rotation errors by 50% and 21.6% on indoor datasets, and 35.56% and 38.7% on outdoor datasets. We also validate the effectiveness of our method in dynamic driving scenarios under varying weather conditions. Notably, as data synthesis scales up, our method exhibits a growing ability to interpolate and extrapolate training data for localizing unseen views. Project Page: https://ai4ce.github.io/RAP/
Authors: Muhammad Huzaifa, Yova Kementchedjhieva
Abstract: Text-to-image retrieval is a critical task for managing diverse visual content, but common benchmarks for the task rely on small, single-domain datasets that fail to capture real-world complexity. Pre-trained vision-language models tend to perform well with easy negatives but struggle with hard negatives--visually similar yet incorrect images--especially in open-domain scenarios. To address this, we introduce Episodic Few-Shot Adaptation (EFSA), a novel test-time framework that adapts pre-trained models dynamically to a query's domain by fine-tuning on top-k retrieved candidates and synthetic captions generated for them. EFSA improves performance across diverse domains while preserving generalization, as shown in evaluations on queries from eight highly distinct visual domains and an open-domain retrieval pool of over one million images. Our work highlights the potential of episodic few-shot adaptation to enhance robustness in the critical and understudied task of open-domain text-to-image retrieval.
Authors: Yihao Luo
Abstract: In the field of data-driven 3D shape analysis and generation, the estimation of global topological features from localized representations such as point clouds, voxels, and neural implicit fields is a longstanding challenge. This paper introduces a novel, differentiable algorithm tailored to accurately estimate the global topology of 3D shapes, overcoming the limitations of traditional methods rooted in mesh reconstruction and topological data analysis. The proposed method ensures high accuracy, efficiency, and instant computation with GPU compatibility. It begins with an efficient calculation of the self-adjoint Weingarten map for point clouds and its adaptations for other modalities. The curvatures are then extracted, and their integration over tangent differentiable Voronoi elements is utilized to estimate key topological invariants, including the Euler number and Genus. Additionally, an auto-optimization mechanism is implemented to refine the local moving frames and area elements based on the integrity of topological invariants. Experimental results demonstrate the method's superior performance across various datasets. The robustness and differentiability of the algorithm ensure its seamless integration into deep learning frameworks, offering vast potential for downstream tasks in 3D shape analysis.
Authors: Chancharik Mitra, Brandon Huang, Tianning Chai, Zhiqiu Lin, Assaf Arbelle, Rogerio Feris, Leonid Karlinsky, Trevor Darrell, Deva Ramanan, Roei Herzig
Abstract: Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering. Despite strong performance, LMMs are not directly suited for foundational discriminative vision-language tasks (i.e., tasks requiring discrete label predictions) such as image classification and multiple-choice VQA. One key challenge in utilizing LMMs for discriminative tasks is the extraction of useful features from generative models. To overcome this issue, we propose an approach for finding features in the model's latent space to more effectively leverage LMMs for discriminative tasks. Toward this end, we present Sparse Attention Vectors (SAVs) -- a finetuning-free method that leverages sparse attention head activations (fewer than 1\% of the heads) in LMMs as strong features for VL tasks. With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of discriminative tasks. Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.
Authors: Rocco Manz Maruzzelli, Basile Lewandowski, Lydia Y. Chen
Abstract: Diffusion models (DMs) generate remarkable high quality images via the stochastic denoising process, which unfortunately incurs high sampling time. Post-quantizing the trained diffusion models in fixed bit-widths, e.g., 4 bits on weights and 8 bits on activation, is shown effective in accelerating sampling time while maintaining the image quality. Motivated by the observation that the cross-layer dependency of DMs vary across layers and sampling steps, we propose a mixed precision quantization scheme, MPQ-Diff, which allocates different bit-width to the weights and activation of the layers. We advocate to use the cross-layer correlation of a given layer, termed network orthogonality metric, as a proxy to measure the relative importance of a layer per sampling step. We further adopt a uniform sampling scheme to avoid the excessive profiling overhead of estimating orthogonality across all time steps. We evaluate the proposed mixed-precision on LSUN and ImageNet, showing a significant improvement in FID from 65.73 to 15.39, and 52.66 to 14.93, compared to their fixed precision quantization, respectively.
Authors: Karran Pandey, Matheus Gadelha, Yannick Hold-Geoffroy, Karan Singh, Niloy J. Mitra, Paul Guerrero
Abstract: Predicting diverse object motions from a single static image remains challenging, as current video generation models often entangle object movement with camera motion and other scene changes. While recent methods can predict specific motions from motion arrow input, they rely on synthetic data and predefined motions, limiting their application to complex scenes. We introduce Motion Modes, a training-free approach that explores a pre-trained image-to-video generator's latent distribution to discover various distinct and plausible motions focused on selected objects in static images. We achieve this by employing a flow generator guided by energy functions designed to disentangle object and camera motion. Additionally, we use an energy inspired by particle guidance to diversify the generated motions, without requiring explicit training data. Experimental results demonstrate that Motion Modes generates realistic and varied object animations, surpassing previous methods and even human predictions regarding plausibility and diversity. Project Webpage: https://motionmodes.github.io/
Authors: Yeonguk Yu, Minhwan Ko, Sungho Shin, Kangmin Kim, Kyoobin Lee
Abstract: Deep neural networks have demonstrated remarkable performance in various vision tasks, but their success heavily depends on the quality of the training data. Noisy labels are a critical issue in medical datasets and can significantly degrade model performance. Previous clean sample selection methods have not utilized the well pre-trained features of vision foundation models (VFMs) and assumed that training begins from scratch. In this paper, we propose CUFIT, a curriculum fine-tuning paradigm of VFMs for medical image classification under label noise. Our method is motivated by the fact that linear probing of VFMs is relatively unaffected by noisy samples, as it does not update the feature extractor of the VFM, thus robustly classifying the training samples. Subsequently, curriculum fine-tuning of two adapters is conducted, starting with clean sample selection from the linear probing phase. Our experimental results demonstrate that CUFIT outperforms previous methods across various medical image benchmarks. Specifically, our method surpasses previous baselines by 5.0%, 2.1%, 4.6%, and 5.8% at a 40% noise rate on the HAM10000, APTOS-2019, BloodMnist, and OrgancMnist datasets, respectively. Furthermore, we provide extensive analyses to demonstrate the impact of our method on noisy label detection. For instance, our method shows higher label precision and recall compared to previous approaches. Our work highlights the potential of leveraging VFMs in medical image classification under challenging conditions of noisy labels.
Authors: Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Ser-Nam Lim, Rajiv Ramnath
Abstract: Document Visual Question Answering (VQA) requires models to interpret textual information within complex visual layouts and comprehend spatial relationships to answer questions based on document images. Existing approaches often lack interpretability and fail to precisely localize answers within the document, hindering users' ability to verify responses and understand the reasoning process. Moreover, standard metrics like Average Normalized Levenshtein Similarity (ANLS) focus on text accuracy but overlook spatial correctness. We introduce DLaVA, a novel method that enhances Multimodal Large Language Models (MLLMs) with answer localization capabilities for Document VQA. Our approach integrates image annotation directly into the MLLM pipeline, improving interpretability by enabling users to trace the model's reasoning. We present both OCR-dependent and OCR-free architectures, with the OCR-free approach eliminating the need for separate text recognition components, thus reducing complexity. To the best of our knowledge, DLaVA is the first approach to introduce answer localization within multimodal QA, marking a significant step forward in enhancing user trust and reducing the risk of AI hallucinations. Our contributions include enhancing interpretability and reliability by grounding responses in spatially annotated visual content, introducing answer localization in MLLMs, proposing a streamlined pipeline that combines an MLLM with a text detection module, and conducting comprehensive evaluations using both textual and spatial accuracy metrics, including Intersection over Union (IoU). Experimental results on standard datasets demonstrate that DLaVA achieves SOTA performance, significantly enhancing model transparency and reliability. Our approach sets a new benchmark for Document VQA, highlighting the critical importance of precise answer localization and model interpretability.
Authors: Kunyang Han, Yibo Hu, Mengxue Qu, Hailin Shi, Yao Zhao, Yunchao Wei
Abstract: Advances in CLIP and large multimodal models (LMMs) have enabled open-vocabulary and free-text segmentation, yet existing models still require predefined category prompts, limiting free-form category self-generation. Most segmentation LMMs also remain confined to sparse predictions, restricting their applicability in open-set environments. In contrast, we propose ROSE, a Revolutionary Open-set dense SEgmentation LMM, which enables dense mask prediction and open-category generation through patch-wise perception. Our method treats each image patch as an independent region of interest candidate, enabling the model to predict both dense and sparse masks simultaneously. Additionally, a newly designed instruction-response paradigm takes full advantage of the generation and generalization capabilities of LMMs, achieving category prediction independent of closed-set constraints or predefined categories. To further enhance mask detail and category precision, we introduce a conversation-based refinement paradigm, integrating the prediction result from previous step with textual prompt for revision. Extensive experiments demonstrate that ROSE achieves competitive performance across various segmentation tasks in a unified framework. Code will be released.
Authors: Vadim Pryadilshchikov, Alexander Markin, Artem Komarichev, Ruslan Rakhimov, Peter Wonka, Evgeny Burnaev
Abstract: We propose a novel framework to remove transient objects from input videos for 3D scene reconstruction using Gaussian Splatting. Our framework consists of the following steps. In the first step, we propose an unsupervised training strategy for a classification network to distinguish between transient objects and static scene parts based on their different training behavior inside the 3D Gaussian Splatting reconstruction. In the second step, we improve the boundary quality and stability of the detected transients by combining our results from the first step with an off-the-shelf segmentation method. We also propose a simple and effective strategy to track objects in the input video forward and backward in time. Our results show an improvement over the current state of the art in existing sparsely captured datasets and significant improvements in a newly proposed densely captured (video) dataset. More results and code are available at https://transient-3dgs.github.io.
Authors: Taesung Kwon, Jong Chul Ye
Abstract: In this paper, we propose a novel framework for solving high-definition video inverse problems using latent image diffusion models. Building on recent advancements in spatio-temporal optimization for video inverse problems using image diffusion models, our approach leverages latent-space diffusion models to achieve enhanced video quality and resolution. To address the high computational demands of processing high-resolution frames, we introduce a pseudo-batch consistent sampling strategy, allowing efficient operation on a single GPU. Additionally, to improve temporal consistency, we present batch-consistent inversion, an initialization technique that incorporates informative latents from the measurement frame. By integrating with SDXL, our framework achieves state-of-the-art video reconstruction across a wide range of spatio-temporal inverse problems, including complex combinations of frame averaging and various spatial degradations, such as deblurring, super-resolution, and inpainting. Unlike previous methods, our approach supports multiple aspect ratios (landscape, vertical, and square) and delivers HD-resolution reconstructions (exceeding 1280x720) in under 2.5 minutes on a single NVIDIA 4090 GPU. Project page: https://vision-xl.github.io/.
Authors: Fuqiang Zhao, Yijing Guo, Siyuan Yang, Xi Chen, Luo Wang, Lan Xu, Yingliang Zhang, Yujiao Shi, Jingyi Yu
Abstract: High-quality 3D urban reconstruction is essential for applications in urban planning, navigation, and AR/VR. However, capturing detailed ground-level data across cities is both labor-intensive and raises significant privacy concerns related to sensitive information, such as vehicle plates, faces, and other personal identifiers. To address these challenges, we propose AerialGo, a novel framework that generates realistic walking-through city views from aerial images, leveraging multi-view diffusion models to achieve scalable, photorealistic urban reconstructions without direct ground-level data collection. By conditioning ground-view synthesis on accessible aerial data, AerialGo bypasses the privacy risks inherent in ground-level imagery. To support the model training, we introduce AerialGo dataset, a large-scale dataset containing diverse aerial and ground-view images, paired with camera and depth information, designed to support generative urban reconstruction. Experiments show that AerialGo significantly enhances ground-level realism and structural coherence, providing a privacy-conscious, scalable solution for city-scale 3D modeling.
Authors: Haiyi Qiu, Minghe Gao, Long Qian, Kaihang Pan, Qifan Yu, Juncheng Li, Wenjie Wang, Siliang Tang, Yueting Zhuang, Tat-Seng Chua
Abstract: Video Large Language Models (Video-LLMs) have recently shown strong performance in basic video understanding tasks, such as captioning and coarse-grained question answering, but struggle with compositional reasoning that requires multi-step spatio-temporal inference across object relations, interactions, and events. The hurdles to enhancing this capability include extensive manual labor, the lack of spatio-temporal compositionality in existing data and the absence of explicit reasoning supervision. In this paper, we propose STEP, a novel graph-guided self-training method that enables Video-LLMs to generate reasoning-rich fine-tuning data from any raw videos to improve itself. Specifically, we first induce Spatio-Temporal Scene Graph (STSG) representation of diverse videos to capture fine-grained, multi-granular video semantics. Then, the STSGs guide the derivation of multi-step reasoning Question-Answer (QA) data with Chain-of-Thought (CoT) rationales. Both answers and rationales are integrated as training objective, aiming to enhance model's reasoning abilities by supervision over explicit reasoning steps. Experimental results demonstrate the effectiveness of STEP across models of varying scales, with a significant 21.3\% improvement in tasks requiring three or more reasoning steps. Furthermore, it achieves superior performance with a minimal amount of self-generated rationale-enriched training samples in both compositional reasoning and comprehensive understanding benchmarks, highlighting the broad applicability and vast potential.
Authors: Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, Ziwei Liu
Abstract: Human beings are social animals. How to equip 3D autonomous characters with similar social intelligence that can perceive, understand and interact with humans remains an open yet foundamental problem. In this paper, we introduce SOLAMI, the first end-to-end Social vision-Language-Action (VLA) Modeling framework for Immersive interaction with 3D autonomous characters. Specifically, SOLAMI builds 3D autonomous characters from three aspects: (1) Social VLA Architecture: We propose a unified social VLA framework to generate multimodal response (speech and motion) based on the user's multimodal input to drive the character for social interaction. (2) Interactive Multimodal Data: We present SynMSI, a synthetic multimodal social interaction dataset generated by an automatic pipeline using only existing motion datasets to address the issue of data scarcity. (3) Immersive VR Interface: We develop a VR interface that enables users to immersively interact with these characters driven by various architectures. Extensive quantitative experiments and user studies demonstrate that our framework leads to more precise and natural character responses (in both speech and motion) that align with user expectations with lower latency.
Authors: Dragos-Alexandru Boldisor, Stefan Smeu, Dan Oneata, Elisabeta Oneata
Abstract: Good datasets are essential for developing and benchmarking any machine learning system. Their importance is even more extreme for safety critical applications such as deepfake detection - the focus of this paper. Here we reveal that two of the most widely used audio-video deepfake datasets suffer from a previously unidentified spurious feature: the leading silence. Fake videos start with a very brief moment of silence and based on this feature alone, we can separate the real and fake samples almost perfectly. As such, previous audio-only and audio-video models exploit the presence of silence in the fake videos and consequently perform worse when the leading silence is removed. To circumvent latching on such unwanted artifact and possibly other unrevealed ones we propose a shift from supervised to unsupervised learning by training models exclusively on real data. We show that by aligning self-supervised audio-video representations we remove the risk of relying on dataset-specific biases and improve robustness in deepfake detection.
Authors: Hui Ren, Joanna Materzynska, Rohit Gandikota, David Bau, Antonio Torralba
Abstract: We explore the question: "How much prior art knowledge is needed to create art?" To investigate this, we propose a text-to-image generation model trained without access to art-related content. We then introduce a simple yet effective method to learn an art adapter using only a few examples of selected artistic styles. Our experiments show that art generated using our method is perceived by users as comparable to art produced by models trained on large, art-rich datasets. Finally, through data attribution techniques, we illustrate how examples from both artistic and non-artistic datasets contributed to the creation of new artistic styles.
Authors: Xiaoyan Xing, Konrad Groh, Sezer Karagolu, Theo Gevers, Anand Bhattad
Abstract: We introduce LumiNet, a novel architecture that leverages generative models and latent intrinsic representations for effective lighting transfer. Given a source image and a target lighting image, LumiNet synthesizes a relit version of the source scene that captures the target's lighting. Our approach makes two key contributions: a data curation strategy from the StyleGAN-based relighting model for our training, and a modified diffusion-based ControlNet that processes both latent intrinsic properties from the source image and latent extrinsic properties from the target image. We further improve lighting transfer through a learned adaptor (MLP) that injects the target's latent extrinsic properties via cross-attention and fine-tuning. Unlike traditional ControlNet, which generates images with conditional maps from a single scene, LumiNet processes latent representations from two different images - preserving geometry and albedo from the source while transferring lighting characteristics from the target. Experiments demonstrate that our method successfully transfers complex lighting phenomena including specular highlights and indirect illumination across scenes with varying spatial layouts and materials, outperforming existing approaches on challenging indoor scenes using only images as input.
Authors: Michele De Vita, Vasileios Belagiannis
Abstract: Despite the remarkable progress in generative modelling, current diffusion models lack a quantitative approach to assess image quality. To address this limitation, we propose to estimate the pixel-wise aleatoric uncertainty during the sampling phase of diffusion models and utilise the uncertainty to improve the sample generation quality. The uncertainty is computed as the variance of the denoising scores with a perturbation scheme that is specifically designed for diffusion models. We then show that the aleatoric uncertainty estimates are related to the second-order derivative of the diffusion noise distribution. We evaluate our uncertainty estimation algorithm and the uncertainty-guided sampling on the ImageNet and CIFAR-10 datasets. In our comparisons with the related work, we demonstrate promising results in filtering out low quality samples. Furthermore, we show that our guided approach leads to better sample generation in terms of FID scores.
Authors: Aaron Bateni
Abstract: In recent years, Spiking Neural Networks (SNNs) have gathered significant interest due to their temporal understanding capabilities. This work introduces, to the best of our knowledge, the first Cortical Column like hybrid architecture for the Time-Series Data Classification Task that leverages SNNs and is inspired by the brain structure, inspired from the previous hybrid models. We introduce several encoding methods to use with this model. Finally, we develop a procedure for training this network on the training dataset. As an effort to make using these models simpler, we make all the implementations available to the public.
Authors: Junbo Jacob Lian
Abstract: Twisted Convolutional Networks (TCNs) are introduced as a novel neural network architecture designed to effectively process one-dimensional data with arbitrary feature order and minimal spatial relationships. Unlike traditional Convolutional Neural Networks (CNNs), which excel at handling structured two-dimensional data like images, TCNs reduce dependency on feature order by combining input features in innovative ways to create new representations. By explicitly enhancing feature interactions and employing diverse feature combinations, TCNs generate richer and more informative representations, making them especially effective for classification tasks on datasets with arbitrary feature arrangements. This paper details the TCN architecture and its feature combination strategy, providing a comprehensive comparison with traditional CNNs, DeepSets, Transformers, and Graph Neural Networks (GNNs). Extensive experiments on benchmark datasets demonstrate that TCNs achieve superior performance, particularly in classification scenarios involving one-dimensional data.
Authors: Shaoxiang Wang, Yaxu Xie, Chun-Peng Chang, Christen Millerdurai, Alain Pagani, Didier Stricker
Abstract: Neural implicit fields have recently emerged as a powerful representation method for multi-view surface reconstruction due to their simplicity and state-of-the-art performance. However, reconstructing thin structures of indoor scenes while ensuring real-time performance remains a challenge for dense visual SLAM systems. Previous methods do not consider varying quality of input RGB-D data and employ fixed-frequency mapping process to reconstruct the scene, which could result in the loss of valuable information in some frames. In this paper, we propose Uni-SLAM, a decoupled 3D spatial representation based on hash grids for indoor reconstruction. We introduce a novel defined predictive uncertainty to reweight the loss function, along with strategic local-to-global bundle adjustment. Experiments on synthetic and real-world datasets demonstrate that our system achieves state-of-the-art tracking and mapping accuracy while maintaining real-time performance. It significantly improves over current methods with a 25% reduction in depth L1 error and a 66.86% completion rate within 1 cm on the Replica dataset, reflecting a more accurate reconstruction of thin structures. Project page: https://shaoxiang777.github.io/project/uni-slam/
Authors: Simon Mielke, Anthony Stein
Abstract: Animal excretions in form of urine puddles and feces are a significant source of emissions in livestock farming. Automated detection of soiled floor in barns can contribute to improved management processes but also the derived information can be used to model emission dynamics. Previous research approaches to determine the puddle area require manual detection of the puddle in the barn. While humans can detect animal excretions on thermal images of a livestock barn, automated approaches using thresholds fail due to other objects of the same temperature, such as the animals themselves. In addition, various parameters such as the type of housing, animal species, age, sex, weather and unknown factors can influence the type and shape of excretions. Due to this heterogeneity, a method for automated detection of excretions must therefore be not only be accurate but also robust to varying conditions. These requirements can be met by using contemporary deep learning models from the field of artificial intelligence. This work is the first to investigate the suitability of different deep learning models for the detection of excretions in pigsties, thereby comparing established convolutional architectures with recent transformer-based approaches. The detection models Faster R-CNN, YOLOv8, DETR and DAB-DETR are compared and statistically assessed on two created training datasets representing two pig houses. We apply a method derived from nested cross-validation and report on the results in terms of eight common detection metrics. Our work demonstrates that all investigated deep learning models are generally suitable for reliably detecting excretions with an average precision of over 90%. The models also show robustness on out of distribution data that possesses differences from the conditions in the training data, however, with expected slight decreases in the overall detection performance.
Authors: Feng Xu, David Ahmedt-Aristizabal, Peterson Lars, Dadong Wang, Xun Li
Abstract: Facial expression recognition (FER) systems raise significant privacy concerns due to the potential exposure of sensitive identity information. This paper presents a study on removing identity information while preserving FER capabilities. Drawing on the observation that low-frequency components predominantly contain identity information and high-frequency components capture expression, we propose a novel two-stream framework that applies privacy enhancement to each component separately. We introduce a controlled privacy enhancement mechanism to optimize performance and a feature compensator to enhance task-relevant features without compromising privacy. Furthermore, we propose a novel privacy-utility trade-off, providing a quantifiable measure of privacy preservation efficacy in closed-set FER tasks. Extensive experiments on the benchmark CREMA-D dataset demonstrate that our framework achieves 78.84% recognition accuracy with a privacy (facial identity) leakage ratio of only 2.01%, highlighting its potential for secure and reliable video-based FER applications.
Authors: Judy X Yang, Jing Wang, Zekun Long, Chenhong Sui, Jun Zhou
Abstract: Classifying hyperspectral images (HSIs) is a complex task in remote sensing due to the high-dimensional nature and volume of data involved. To address these challenges, we propose the Spectral-Spatial Linear (SS Linear) Model, a novel framework that significantly reduces data volume while enhancing classification accuracy. Our model employs a bidirectional reversed convolutional neural network (CNN) to efficiently extract spectral features, complemented by a specialized block for spatial feature analysis. This hybrid approach leverages the operational efficiency of CNNs and incorporates dynamic feature extraction inspired by attention mechanisms, optimizing performance without the high computational demands typically associated with transformer-based models. The SS Linear Model is designed to process hyperspectral data bidirectionally, achieving notable classification and efficiency improvements by fusing spectral and spatial features effectively. This approach yields superior classification accuracy compared to existing benchmarks while maintaining computational efficiency, making it suitable for resource-constrained environments. We validate the SS Linear Model on three widely recognized datasets, Houston 2013, Indian Pines, and Pavia University, demonstrating its ability to outperform current state-of-the-art models in HSI classification and efficiency. This work highlights the innovative methodology of the SS Linear Model and its practical benefits for remote sensing applications, where both data efficiency and classification accuracy are critical. For further details, please refer to our code repository on GitHub: HSILinearModel.
Authors: Avirath Sundaresan, Jason R. Parham, Jonathan Crall, Rosemary Warungu, Timothy Muthami, Margaret Mwangi, Jackson Miliko, Jason Holmberg, Tanya Y. Berger-Wolf, Daniel Rubenstein, Charles V. Stewart, Sara Beery
Abstract: In both 2016 and 2018, a census of the highly-endangered Grevy's zebra population was enabled by the Great Grevy's Rally (GGR), a citizen science event that produces population estimates via expert and algorithmic curation of volunteer-captured images. A complementary, scalable, and long-term Grevy's population monitoring approach involves deploying camera trap networks. However, in both scenarios, a substantial majority of zebra images are not usable for individual identification due to poor in-the-wild imaging conditions; camera trap images in particular present high rates of occlusion and high spatio-temporal similarity within image bursts. Our proposed filtering pipeline incorporates animal detection, species identification, viewpoint estimation, quality evaluation, and temporal subsampling to obtain individual crops suitable for re-ID, which are subsequently curated by the LCA decision management algorithm. Our method processed images taken during GGR-16 and GGR-18 in Meru County, Kenya, into 4,142 highly-comparable annotations, requiring only 120 contrastive human decisions to produce a population estimate within 4.6% of the ground-truth count. Our method also efficiently processed 8.9M unlabeled camera trap images from 70 cameras at the Mpala Research Centre in Laikipia County, Kenya over two years into 685 encounters of 173 individuals, requiring only 331 contrastive human decisions.
Authors: Judy X Yang, Jing Wang, Chen Hong Sui, Zekun Long, Jun Zhou
Abstract: The integration of hyperspectral imaging (HSI) and LiDAR data within new linear feature spaces offers a promising solution to the challenges posed by the high-dimensionality and redundancy inherent in HSIs. This study introduces a dual linear fused space framework that capitalizes on bidirectional reversed convolutional neural network (CNN) pathways, coupled with a specialized spatial analysis block. This approach combines the computational efficiency of CNNs with the adaptability of attention mechanisms, facilitating the effective fusion of spectral and spatial information. The proposed method not only enhances data processing and classification accuracy, but also mitigates the computational burden typically associated with advanced models such as Transformers. Evaluations of the Houston 2013 dataset demonstrate that our approach surpasses existing state-of-the-art models. This advancement underscores the potential of the framework in resource-constrained environments and its significant contributions to the field of remote sensing.
Authors: Yizhi Song, Liu He, Zhifei Zhang, Soo Ye Kim, He Zhang, Wei Xiong, Zhe Lin, Brian Price, Scott Cohen, Jianming Zhang, Daniel Aliaga
Abstract: Personalized image generation has emerged from the recent advancements in generative models. However, these generated personalized images often suffer from localized artifacts such as incorrect logos, reducing fidelity and fine-grained identity details of the generated results. Furthermore, there is little prior work tackling this problem. To help improve these identity details in the personalized image generation, we introduce a new task: reference-guided artifacts refinement. We present Refine-by-Align, a first-of-its-kind model that employs a diffusion-based framework to address this challenge. Our model consists of two stages: Alignment Stage and Refinement Stage, which share weights of a unified neural network model. Given a generated image, a masked artifact region, and a reference image, the alignment stage identifies and extracts the corresponding regional features in the reference, which are then used by the refinement stage to fix the artifacts. Our model-agnostic pipeline requires no test-time tuning or optimization. It automatically enhances image fidelity and reference identity in the generated image, generalizing well to existing models on various tasks including but not limited to customization, generative compositing, view synthesis, and virtual try-on. Extensive experiments and comparisons demonstrate that our pipeline greatly pushes the boundary of fine details in the image synthesis models.
Authors: Feiyang Liu, Dan Guo, Jingyuan Xu, Zihao He, Shengeng Tang, Kun Li, Meng Wang
Abstract: Following the gaze of other people and analyzing the target they are looking at can help us understand what they are thinking, and doing, and predict the actions that may follow. Existing methods for gaze following struggle to perform well in natural scenes with diverse objects, and focus on gaze points rather than objects, making it difficult to deliver clear semantics and accurate scope of the targets. To address this shortcoming, we propose a novel gaze target prediction solution named GazeSeg, that can fully utilize the spatial visual field of the person as guiding information and lead to a progressively coarse-to-fine gaze target segmentation and recognition process. Specifically, a prompt-based visual foundation model serves as the encoder, working in conjunction with three distinct decoding modules (e.g. FoV perception, heatmap generation, and segmentation) to form the framework for gaze target prediction. Then, with the head bounding box performed as an initial prompt, GazeSeg obtains the FoV map, heatmap, and segmentation map progressively, leading to a unified framework for multiple tasks (e.g. direction estimation, gaze target segmentation, and recognition). In particular, to facilitate this research, we construct and release a new dataset, comprising 72k images with pixel-level annotations and 270 categories of gaze targets, built upon the GazeFollow dataset. The quantitative evaluation shows that our approach achieves the Dice of 0.325 in gaze target segmentation and 71.7% top-5 recognition. Meanwhile, our approach also outperforms previous state-of-the-art methods, achieving 0.953 in AUC on the gaze-following task. The dataset and code will be released.
Authors: Junli Deng, Yihao Luo
Abstract: Dynamic scene rendering has taken a leap forward with the rise of 4D Gaussian Splatting, but there's still one elusive challenge: how to make 3D Gaussians move through time as naturally as they would in the real world, all while keeping the motion smooth and consistent. In this paper, we unveil a fresh approach that blends state-space modeling with Wasserstein geometry, paving the way for a more fluid and coherent representation of dynamic scenes. We introduce a State Consistency Filter that merges prior predictions with the current observations, enabling Gaussians to stay true to their way over time. We also employ Wasserstein distance regularization to ensure smooth, consistent updates of Gaussian parameters, reducing motion artifacts. Lastly, we leverage Wasserstein geometry to capture both translational motion and shape deformations, creating a more physically plausible model for dynamic scenes. Our approach guides Gaussians along their natural way in the Wasserstein space, achieving smoother, more realistic motion and stronger temporal coherence. Experimental results show significant improvements in rendering quality and efficiency, outperforming current state-of-the-art techniques.
Authors: Meihan Wu, Tao Chang, Cui Miao, Jie Zhou, Chun Li, Xiangyu Xu, Ming Li, Xiaodong Wang
Abstract: Federated learning research has recently shifted from Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs) due to their superior capacity. ViTs training demands higher computational resources due to the lack of 2D inductive biases inherent in CNNs. However, efficient federated training of ViTs on resource-constrained edge devices remains unexplored in the community. In this paper, we propose EFTViT, a hierarchical federated framework that leverages masked images to enable efficient, full-parameter training on resource-constrained edge devices, offering substantial benefits for learning on heterogeneous data. In general, we patchify images and randomly mask a portion of the patches, observing that excluding them from training has minimal impact on performance while substantially reducing computation costs and enhancing data content privacy protection. Specifically, EFTViT comprises a series of lightweight local modules and a larger global module, updated independently on clients and the central server, respectively. The local modules are trained on masked image patches, while the global module is trained on intermediate patch features uploaded from the local client, balanced through a proposed median sampling strategy to erase client data distribution privacy. We analyze the computational complexity and privacy protection of EFTViT. Extensive experiments on popular benchmarks show that EFTViT achieves up to 28.17% accuracy improvement, reduces local training computational cost by up to 2.8$\times$, and cuts local training time by up to 4.4$\times$ compared to existing methods.
Authors: Hana Satou, Alan Mitkiy
Abstract: The convergence of cross-modal adversarial learning and physics-driven methods represents a cutting-edge direction for tackling challenges in complex multi-modal tasks and scientific computing. This review focuses on systematically analyzing how these two approaches can be synergistically integrated to enhance performance and robustness across diverse application domains. By addressing key obstacles such as modality discrepancies, limited data availability, and insufficient model robustness, this paper highlights the role of physics-based optimization frameworks in facilitating efficient and interpretable adversarial perturbation generation. The review also explores significant advancements in cross-modal adversarial learning, including applications in tasks such as image cross-modal retrieval (e.g., infrared and RGB matching), scientific computing (e.g., solving partial differential equations), and optimization under physical consistency constraints in vision systems. By examining theoretical foundations and experimental outcomes, this study demonstrates the potential of combining these approaches to handle complex scenarios and improve the security of multi-modal systems. Finally, we outline future directions, proposing a novel framework that unifies physical principles with adversarial optimization, providing a pathway for researchers to develop robust and adaptable cross-modal learning methods with both theoretical and practical significance.
Authors: Wei Zhou, Lei Zhao, Runyu Zhang, Yifan Cui, Hongpu Huang, Kun Qie, Chen Wang
Abstract: Traffic Surveillance Systems (TSS) have become increasingly crucial in modern intelligent transportation systems, with vision-based technologies playing a central role for scene perception and understanding. While existing surveys typically focus on isolated aspects of TSS, a comprehensive analysis bridging low-level and high-level perception tasks, particularly considering emerging technologies, remains lacking. This paper presents a systematic review of vision-based technologies in TSS, examining both low-level perception tasks (object detection, classification, and tracking) and high-level perception applications (parameter estimation, anomaly detection, and behavior understanding). Specifically, we first provide a detailed methodological categorization and comprehensive performance evaluation for each task. Our investigation reveals five fundamental limitations in current TSS: perceptual data degradation in complex scenarios, data-driven learning constraints, semantic understanding gaps, sensing coverage limitations and computational resource demands. To address these challenges, we systematically analyze five categories of potential solutions: advanced perception enhancement, efficient learning paradigms, knowledge-enhanced understanding, cooperative sensing frameworks and efficient computing frameworks. Furthermore, we evaluate the transformative potential of foundation models in TSS, demonstrating their unique capabilities in zero-shot learning, semantic understanding, and scene generation. This review provides a unified framework bridging low-level and high-level perception tasks, systematically analyzes current limitations and solutions, and presents a structured roadmap for integrating emerging technologies, particularly foundation models, to enhance TSS capabilities.
Authors: Huadong Tang, Youpeng Zhao, Yan Huang, Min Xu, Jun Wang, Qiang Wu
Abstract: It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for recognizing unseen objects in images for semantic segmentation. Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. However, the text prompts employed in these methods are short phrases based on fixed templates, failing to capture comprehensive object attributes. Moreover, while the CLIP model excels at exploiting image-level features, it is less effective at pixel-level representation, which is crucial for semantic segmentation tasks. In this work, we propose to alleviate the above-mentioned issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features. Specifically, our method employs large language models (LLMs) to generate enriched language prompts with diverse visual attributes for each category, including color, shape/size, and texture/material. Additionally, for enhanced visual feature extraction, the SAM model is adopted as a supplement to the CLIP visual encoder through a proposed learnable weighted fusion strategy. Built upon these techniques, our method, termed LMSeg, achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks. The code will be made available soon.
Authors: Chetan Madan, Mayuna Gupta, Soumen Basu, Pankaj Gupta, Chetan Arora
Abstract: We focus on the problem of Gallbladder Cancer (GBC) detection from Ultrasound (US) images. The problem presents unique challenges to modern Deep Neural Network (DNN) techniques due to low image quality arising from noise, textures, and viewpoint variations. Tackling such challenges would necessitate precise localization performance by the DNN to identify the discerning features for the downstream malignancy prediction. While several techniques have been proposed in the recent years for the problem, all of these methods employ complex custom architectures. Inspired by the success of foundational models for natural image tasks, along with the use of adapters to fine-tune such models for the custom tasks, we investigate the merit of one such design, ViT-Adapter, for the GBC detection problem. We observe that ViT-Adapter relies predominantly on a primitive CNN-based spatial prior module to inject the localization information via cross-attention, which is inefficient for our problem due to the small pathology sizes, and variability in their appearances due to non-regular structure of the malignancy. In response, we propose, LQ-Adapter, a modified Adapter design for ViT, which improves localization information by leveraging learnable content queries over the basic spatial prior module. Our method surpasses existing approaches, enhancing the mean IoU (mIoU) scores by 5.4%, 5.8%, and 2.7% over ViT-Adapters, DINO, and FocalNet-DINO, respectively on the US image-based GBC detection dataset, and establishing a new state-of-the-art (SOTA). Additionally, we validate the applicability and effectiveness of LQ-Adapter on the Kvasir-Seg dataset for polyp detection from colonoscopy images. Superior performance of our design on this problem as well showcases its capability to handle diverse medical imaging tasks across different datasets. Code is released at https://github.com/ChetanMadan/LQ-Adapter
Authors: Zhaoxing Gan, Guangnan Ye
Abstract: Layout Generation aims to synthesize plausible arrangements from given elements. Currently, the predominant methods in layout generation are Generative Adversarial Networks (GANs) and diffusion models, each presenting its own set of challenges. GANs typically struggle with handling discrete data due to their requirement for differentiable generated samples and have historically circumvented the direct generation of discrete labels by treating them as fixed conditions. Conversely, diffusion-based models, despite achieving state-of-the-art performance across several metrics, require extensive sampling steps which lead to significant time costs. To address these limitations, we propose \textbf{DogLayout} (\textbf{D}en\textbf{o}ising Diffusion \textbf{G}AN \textbf{Layout} model), which integrates a diffusion process into GANs to enable the generation of discrete label data and significantly reduce diffusion's sampling time. Experiments demonstrate that DogLayout considerably reduces sampling costs by up to 175 times and cuts overlap from 16.43 to 9.59 compared to existing diffusion models, while also surpassing GAN based and other layout methods. Code is available at https://github.com/deadsmither5/DogLayout.
Authors: Zehao Li, Wenwei Han, Yujun Cai, Hao Jiang, Baolong Bi, Shuqin Gao, Honglong Zhao, Zhaoqi Wang
Abstract: While 3D Gaussian Splatting enables high-quality real-time rendering, existing Gaussian-based frameworks for 3D semantic segmentation still face significant challenges in boundary recognition accuracy. To address this, we propose a novel 3DGS-based framework named GradiSeg, incorporating Identity Encoding to construct a deeper semantic understanding of scenes. Our approach introduces two key modules: Identity Gradient Guided Densification (IGD) and Local Adaptive K-Nearest Neighbors (LA-KNN). The IGD module supervises gradients of Identity Encoding to refine Gaussian distributions along object boundaries, aligning them closely with boundary contours. Meanwhile, the LA-KNN module employs position gradients to adaptively establish locality-aware propagation of Identity Encodings, preventing irregular Gaussian spreads near boundaries. We validate the effectiveness of our method through comprehensive experiments. Results show that GradiSeg effectively addresses boundary-related issues, significantly improving segmentation accuracy without compromising scene reconstruction quality. Furthermore, our method's robust segmentation capability and decoupled Identity Encoding representation make it highly suitable for various downstream scene editing tasks, including 3D object removal, swapping and so on.
Authors: Yatian Pang, Bin Zhu, Bin Lin, Mingzhe Zheng, Francis E. H. Tay, Ser-Nam Lim, Harry Yang, Li Yuan
Abstract: In this work, we present DreamDance, a novel method for animating human images using only skeleton pose sequences as conditional inputs. Existing approaches struggle with generating coherent, high-quality content in an efficient and user-friendly manner. Concretely, baseline methods relying on only 2D pose guidance lack the cues of 3D information, leading to suboptimal results, while methods using 3D representation as guidance achieve higher quality but involve a cumbersome and time-intensive process. To address these limitations, DreamDance enriches 3D geometry cues from 2D poses by introducing an efficient diffusion model, enabling high-quality human image animation with various guidance. Our key insight is that human images naturally exhibit multiple levels of correlation, progressing from coarse skeleton poses to fine-grained geometry cues, and further from these geometry cues to explicit appearance details. Capturing such correlations could enrich the guidance signals, facilitating intra-frame coherency and inter-frame consistency. Specifically, we construct the TikTok-Dance5K dataset, comprising 5K high-quality dance videos with detailed frame annotations, including human pose, depth, and normal maps. Next, we introduce a Mutually Aligned Geometry Diffusion Model to generate fine-grained depth and normal maps for enriched guidance. Finally, a Cross-domain Controller incorporates multi-level guidance to animate human images effectively with a video diffusion model. Extensive experiments demonstrate that our method achieves state-of-the-art performance in animating human images.
Authors: Daizong Liu, Yunbo Tao, Pan Zhou, Wei Hu
Abstract: With the maturity of depth sensors in various 3D safety-critical applications, 3D point cloud models have been shown to be vulnerable to adversarial attacks. Almost all existing 3D attackers simply follow the white-box or black-box setting to iteratively update coordinate perturbations based on back-propagated or estimated gradients. However, these methods are hard to deploy in real-world scenarios (no model details are provided) as they severely rely on parameters or output logits of victim models. To this end, we propose point cloud attacks from a more practical setting, i.e., hard-label black-box attack, in which attackers can only access the prediction label of 3D input. We introduce a novel 3D attack method based on a new spectrum-aware decision boundary algorithm to generate high-quality adversarial samples. In particular, we first construct a class-aware model decision boundary, by developing a learnable spectrum-fusion strategy to adaptively fuse point clouds of different classes in the spectral domain, aiming to craft their intermediate samples without distorting the original geometry. Then, we devise an iterative coordinate-spectrum optimization method with curvature-aware boundary search to move the intermediate sample along the decision boundary for generating adversarial point clouds with trivial perturbations. Experiments demonstrate that our attack competitively outperforms existing white/black-box attackers in terms of attack performance and adversary quality.
Authors: Teng-Fang Hsiao, Bo-Kai Ruan, Sung-Lin Tsai, Yi-Lun Wu, Hong-Han Shuai
Abstract: In this study, we aim to determine and solve the deficiency of Stable Diffusion Inpainting (SDI) in following the instruction of both prompt and mask. Due to the training bias from masking, the inpainting quality is hindered when the prompt instruction and image condition are not related. Therefore, we conduct a detailed analysis of the internal representations learned by SDI, focusing on how the mask input influences the cross-attention layer. We observe that adapting text key tokens toward the input mask enables the model to selectively paint within the given area. Leveraging these insights, we propose FreeCond, which adjusts only the input mask condition and image condition. By increasing the latent mask value and modifying the frequency of image condition, we align the cross-attention features with the model's training bias to improve generation quality without additional computation, particularly when user inputs are complicated and deviate from the training setup. Extensive experiments demonstrate that FreeCond can enhance any SDI-based model, e.g., yielding up to a 60% and 58% improvement of SDI and SDXLI in the CLIP score.
Authors: Sharva Gogawale, Madhura Deshpande, Parteek Kumar, Irad Ben-Gal
Abstract: In recent times, online education and the usage of video-conferencing platforms have experienced massive growth. Due to the limited scope of a virtual classroom, it may become difficult for instructors to analyze learners' attention and comprehension in real time while teaching. In the digital mode of education, it would be beneficial for instructors to have an automated feedback mechanism to be informed regarding learners' attentiveness at any given time. This research presents a novel computer vision-based approach to analyze and quantify learners' attentiveness, engagement, and other affective states within online learning scenarios. This work presents the development of a multiclass multioutput classification method using convolutional neural networks on a publicly available dataset - DAiSEE. A machine learning-based algorithm is developed on top of the classification model that outputs a comprehensive attentiveness index of the learners. Furthermore, an end-to-end pipeline is proposed through which learners' live video feed is processed, providing detailed attentiveness analytics of the learners to the instructors. By comparing the experimental outcomes of the proposed method against those of previous methods, it is demonstrated that the proposed method exhibits better attentiveness detection than state-of-the-art methods. The proposed system is a comprehensive, practical, and real-time solution that is deployable and easy to use. The experimental results also demonstrate the system's efficiency in gauging learners' attentiveness.
Authors: Yuhai Wang
Abstract: We propose a View-Decoupled Transformer (VDT) framework to address viewpoint discrepancies in person re-identification (ReID), particularly between aerial and ground views. VDT decouples view-specific and view-independent features by leveraging meta and view tokens, processed through self-attention and subtractive separation. Additionally, we introduce a Visual Token Selector (VTS) module that dynamically selects the most informative tokens, reducing redundancy and enhancing efficiency. Our approach significantly improves retrieval performance on the AGPReID dataset, while maintaining computational efficiency similar to baseline models.
Authors: Haicheng Wang, Chen Ju, Weixiong Lin, Shuai Xiao, Mengting Chen, Yixuan Huang, Chang Liu, Mingshuai Yao, Jinsong Lan, Ying Chen, Qingwen Liu, Yanfeng Wang
Abstract: In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-training (CLIP) has made significant strides, becoming foundation for various downstream tasks. However, relying on one-to-one (image, text) contrastive paradigm to learn alignment from large-scale messy web data, CLIP faces a serious myopic dilemma, resulting in biases towards monotonous short texts and shallow visual expressivity. To overcome these issues, this paper advances CLIP into one novel holistic paradigm, by updating both diverse data and alignment optimization. To obtain colorful data with low cost, we use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies. Two gadgets are proposed to encourage textual diversity. To match such (image, multi-texts) pairs, we modify the CLIP image encoder into multi-branch, and propose multi-to-multi contrastive optimization for image-text part-to-part matching. As a result, diverse visual embeddings are learned for each image, bringing good interpretability and generalization. Extensive experiments and ablations across over ten benchmarks indicate that our holistic CLIP significantly outperforms existing myopic CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks.
Authors: Lukas Baumg\"artner, Ronny Bergmann, Roland Herzog, Stephan Schmidt, Manuel Wei{\ss}
Abstract: We consider the problem of surface segmentation, where the goal is to partition a surface represented by a triangular mesh. The segmentation is based on the similarity of the normal vector field to a given set of label vectors. We propose a variational approach and compare two different regularizers, both based on a total variation measure. The first regularizer penalizes the total variation of the assignment function directly, while the second regularizer penalizes the total variation in the label space. In order to solve the resulting optimization problems, we use variations of the split Bregman (ADMM) iteration adapted to the problem at hand. While computationally more expensive, the second regularizer yields better results in our experiments, in particular it removes noise more reliably in regions of constant curvature.
Authors: Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, Yansong Tang
Abstract: Large Vision Language Models (LVLMs) have achieved significant success across multi-modal tasks. However, the computational cost of processing long visual tokens can be prohibitively expensive on resource-limited devices. Previous methods have identified redundancy in visual tokens within the Large Language Model (LLM) decoder layers and have mitigated this by pruning tokens using a pre-defined or fixed ratio, thereby reducing computational overhead. Nonetheless, we observe that the impact of pruning ratio varies across different LLM layers and instances (image-prompt pairs). Therefore, it is essential to develop a layer-wise and instance-wise vision token pruning strategy to balance computational cost and model performance effectively. We propose ATP-LLaVA, a novel approach that adaptively determines instance-specific token pruning ratios for each LLM layer. Specifically, we introduce an Adaptive Token Pruning (ATP) module, which computes the importance score and pruning threshold based on input instance adaptively. The ATP module can be seamlessly integrated between any two LLM layers with negligible computational overhead. Additionally, we develop a Spatial Augmented Pruning (SAP) strategy that prunes visual tokens with both token redundancy and spatial modeling perspectives. Our approach reduces the average token count by 75% while maintaining performance, with only a minimal 1.9% degradation across seven widely used benchmarks. The project page can be accessed via https://yxxxb.github.io/ATP-LLaVA-page/.
Authors: Atharva Deshpande, Kaushik Gopalan, Jeet Shah, Hrishikesh Simu
Abstract: This study explores the application of deep learning for rainfall prediction, leveraging the Spinning Enhanced Visible and Infrared Imager (SEVIRI) High rate information transmission (HRIT) data as input and the Operational Program on the Exchange of weather RAdar information (OPERA) ground-radar reflectivity data as ground truth. We use the mean of 4 InfraRed frequency channels as the input. The radiance images are forecasted up to 4 hours into the future using a dense optical flow algorithm. A conditional generative adversarial network (GAN) model is employed to transform the predicted radiance images into rainfall images which are aggregated over the 4 hour forecast period to generate cumulative rainfall values. This model scored a value of approximately 7.5 as the Continuous Ranked Probability Score (CRPS) in the Weather4Cast 2024 competition and placed 1st on the core challenge leaderboard.
Authors: Weizhe Liu, Renshuai Tao, Hongguang Zhu, Yunda Sun, Yao Zhao, Yunchao Wei
Abstract: Prohibited item detection is crucial for ensuring public safety, yet current X-ray image-based detection methods often lack comprehensive data-driven exploration. This paper introduces a novel data augmentation approach tailored for prohibited item detection, leveraging unique characteristics inherent to X-ray imagery. Our method is motivated by observations of physical properties including: 1) X-ray Transmission Imagery: Unlike reflected light images, transmitted X-ray pixels represent composite information from multiple materials along the imaging path. 2) Material-based Pseudo-coloring: Pseudo-color rendering in X-ray images correlates directly with material properties, aiding in material distinction. Building on a novel perspective from physical properties, we propose a simple yet effective X-ray image augmentation technique, Background Mixup (BGM), for prohibited item detection in security screening contexts. The essence is the rich background simulation of X-ray images to induce the model to increase its attention to the foreground. The approach introduces 1) contour information of baggage and 2) variation of material information into the original image by Mixup at patch level. Background Mixup is plug-and-play, parameter-free, highly generalizable and provides an effective solution to the limitations of classical visual augmentations in non-reflected light imagery. When implemented with different high-performance detectors, our augmentation method consistently boosts performance across diverse X-ray datasets from various devices and environments. Extensive experimental results demonstrate that our approach surpasses strong baselines while maintaining similar training resources.
Authors: Yutong Zhou, Masahiro Ryo
Abstract: We introduce AgriBench, the first agriculture benchmark designed to evaluate MultiModal Large Language Models (MM-LLMs) for agriculture applications. To further address the agriculture knowledge-based dataset limitation problem, we propose MM-LUCAS, a multimodal agriculture dataset, that includes 1,784 landscape images, segmentation masks, depth maps, and detailed annotations (geographical location, country, date, land cover and land use taxonomic details, quality scores, aesthetic scores, etc), based on the Land Use/Cover Area Frame Survey (LUCAS) dataset, which contains comparable statistics on land use and land cover for the European Union (EU) territory. This work presents a groundbreaking perspective in advancing agriculture MM-LLMs and is still in progress, offering valuable insights for future developments and innovations in specific expert knowledge-based MM-LLMs.
Authors: Ramin Mousa, Saeed Chamani, Mohammad Morsali, Mohammad Kazzazi, Parsa Hatami, Soroush Sarabi
Abstract: Skin cancer (SC) stands out as one of the most life-threatening forms of cancer, with its danger amplified if not diagnosed and treated promptly. Early intervention is critical, as it allows for more effective treatment approaches. In recent years, Deep Learning (DL) has emerged as a powerful tool in the early detection and skin cancer diagnosis (SCD). Although the DL seems promising for the diagnosis of skin cancer, still ample scope exists for improving model efficiency and accuracy. This paper proposes a novel approach to skin cancer detection, utilizing optimization techniques in conjunction with pre-trained networks and wavelet transformations. First, normalized images will undergo pre-trained networks such as Densenet-121, Inception, Xception, and MobileNet to extract hierarchical features from input images. After feature extraction, the feature maps are passed through a Discrete Wavelet Transform (DWT) layer to capture low and high-frequency components. Then the self-attention module is integrated to learn global dependencies between features and focus on the most relevant parts of the feature maps. The number of neurons and optimization of the weight vectors are performed using three new swarm-based optimization techniques, such as Modified Gorilla Troops Optimizer (MGTO), Improved Gray Wolf Optimization (IGWO), and Fox optimization algorithm. Evaluation results demonstrate that optimizing weight vectors using optimization algorithms can enhance diagnostic accuracy and make it a highly effective approach for SCD. The proposed method demonstrates substantial improvements in accuracy, achieving top rates of 98.11% with the MobileNet + Wavelet + FOX and DenseNet + Wavelet + Fox combination on the ISIC-2016 dataset and 97.95% with the Inception + Wavelet + MGTO combination on the ISIC-2017 dataset, which improves accuracy by at least 1% compared to other methods.
Authors: Yu Wang, Xiaofei Zhou, Yichen Wang, Geyuan Zhang, Tianxing He
Abstract: With the significant advancement of Large Vision-Language Models (VLMs), concerns about their potential misuse and abuse have grown rapidly. Previous studies have highlighted VLMs' vulnerability to jailbreak attacks, where carefully crafted inputs can lead the model to produce content that violates ethical and legal standards. However, existing methods struggle against state-of-the-art VLMs like GPT-4o, due to the over-exposure of harmful content and lack of stealthy malicious guidance. In this work, we propose a novel jailbreak attack framework: Multi-Modal Linkage (MML) Attack. Drawing inspiration from cryptography, MML utilizes an encryption-decryption process across text and image modalities to mitigate over-exposure of malicious information. To align the model's output with malicious intent covertly, MML employs a technique called "evil alignment", framing the attack within a video game production scenario. Comprehensive experiments demonstrate MML's effectiveness. Specifically, MML jailbreaks GPT-4o with attack success rates of 97.80% on SafeBench, 98.81% on MM-SafeBench and 99.07% on HADES-Dataset. Our code is available at https://github.com/wangyu-ovo/MML.
Authors: Chenggang Yang, Yuang Shi, Wei Tsang Ooi
Abstract: Abstract representations of 3D scenes are essential in computer vision, supporting tasks like mapping, localization, and surface reconstruction. Line segments are commonly used to capture scene structure, but existing 3D reconstruction methods often face limitations, either from instability in 2D projections or noise in direct 3D data. This paper introduces LineGS, a method that integrates geometry-guided 3D line reconstruction with a 3D Gaussian splatting model to improve accuracy. By leveraging Gaussian point densities along scene edges, LineGS refines initial line segments, aligning them more closely with the scene's geometric features. Experiments confirm that this approach enhances the fit to 3D structures, providing an efficient and reliable abstract representation of 3D scenes.
Authors: Chade Li, Pengju Zhang, Yihong Wu
Abstract: 3D point cloud segmentation has a wide range of applications in areas such as autonomous driving, augmented reality, virtual reality and digital twins. The point cloud data collected in real scenes often contain small objects and categories with small sample sizes, which are difficult to handle by existing networks. In this regard, we propose a point cloud segmentation network that fuses local attention based on density perception with global attention. The core idea is to increase the effective receptive field of each point while reducing the loss of information about small objects in dense areas. Specifically, we divide different sized windows for local areas with different densities to compute attention within the window. Furthermore, we consider each local area as an independent token for the global attention of the entire input. A category-response loss is also proposed to balance the processing of different categories and sizes of objects. In particular, we set up an additional fully connected layer in the middle of the network for prediction of the presence of object categories, and construct a binary cross-entropy loss to respond to the presence of categories in the scene. In experiments, our method achieves competitive results in semantic segmentation and part segmentation tasks on several publicly available datasets. Experiments on point cloud data obtained from complex real-world scenes filled with tiny objects also validate the strong segmentation capability of our method for small objects as well as small sample categories.
Authors: Duo Zheng, Shijia Huang, Liwei Wang
Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has significantly impacted various multimodal tasks. However, these models face challenges in tasks that require spatial understanding within 3D environments. Efforts to enhance MLLMs, such as incorporating point cloud features, have been made, yet a considerable gap remains between the models' learned representations and the inherent complexity of 3D scenes. This discrepancy largely stems from the training of MLLMs on predominantly 2D data, which restricts their effectiveness in comprehending 3D spaces. To address this issue, in this paper, we propose a novel generalist model, i.e., Video-3D LLM, for 3D scene understanding. By treating 3D scenes as dynamic videos and incorporating 3D position encoding into these representations, our Video-3D LLM aligns video representations with real-world spatial contexts more accurately. Additionally, we have implemented a maximum coverage sampling technique to optimize the balance between computational costs and performance efficiency. Extensive experiments demonstrate that our model achieves state-of-the-art performance on several 3D scene understanding benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
Authors: Jona Ball\'e, Luca Versari, Emilien Dupont, Hyunjik Kim, Matthias Bauer
Abstract: Inspired by the success of generative image models, recent work on learned image compression increasingly focuses on better probabilistic models of the natural image distribution, leading to excellent image quality. This, however, comes at the expense of a computational complexity that is several orders of magnitude higher than today's commercial codecs, and thus prohibitive for most practical applications. With this paper, we demonstrate that by focusing on modeling visual perception rather than the data distribution, we can achieve a very good trade-off between visual quality and bit rate similar to "generative" compression models such as HiFiC, while requiring less than 1% of the multiply-accumulate operations (MACs) for decompression. We do this by optimizing C3, an overfitted image codec, for Wasserstein Distortion (WD), and evaluating the image reconstructions with a human rater study. The study also reveals that WD outperforms other perceptual quality metrics such as LPIPS, DISTS, and MS-SSIM, both as an optimization objective and as a predictor of human ratings, achieving over 94% Pearson correlation with Elo scores.
Authors: Amir Barda, Matheus Gadelha, Vladimir G. Kim, Noam Aigerman, Amit H. Bermano, Thibault Groueix
Abstract: We propose a generative technique to edit 3D shapes, represented as meshes, NeRFs, or Gaussian Splats, in approximately 3 seconds, without the need for running an SDS type of optimization. Our key insight is to cast 3D editing as a multiview image inpainting problem, as this representation is generic and can be mapped back to any 3D representation using the bank of available Large Reconstruction Models. We explore different fine-tuning strategies to obtain both multiview generation and inpainting capabilities within the same diffusion model. In particular, the design of the inpainting mask is an important factor of training an inpainting model, and we propose several masking strategies to mimic the types of edits a user would perform on a 3D shape. Our approach takes 3D generative editing from hours to seconds and produces higher-quality results compared to previous works.
Authors: Matyas Bohacek, Hany Farid
Abstract: Full-blown AI-generated video generation continues its journey through the uncanny valley to produce content that is perceptually indistinguishable from reality. Intermixed with many exciting and creative applications are malicious applications that harm individuals, organizations, and democracies. We describe an effective and robust technique for distinguishing real from AI-generated human motion. This technique leverages a multi-modal semantic embedding, making it robust to the types of laundering that typically confound more low- to mid-level approaches. This method is evaluated against a custom-built dataset of video clips with human actions generated by seven text-to-video AI models and matching real footage.
Authors: Daniel Kyselica, Marek \v{S}uppa, Ji\v{r}\'i \v{S}ilha, Roman \v{D}urikovi\v{c}
Abstract: Space debris presents a critical challenge for the sustainability of future space missions, emphasizing the need for robust and standardized identification methods. However, a comprehensive benchmark for rocket body classification remains absent. This paper addresses this gap by introducing the RoBo6 dataset for rocket body classification based on light curves. The dataset, derived from the Mini Mega Tortora database, includes light curves for six rocket body classes: CZ-3B, Atlas 5 Centaur, Falcon 9, H-2A, Ariane 5, and Delta 4. With 5,676 training and 1,404 test samples, it addresses data inconsistencies using resampling, normalization, and filtering techniques. Several machine learning models were evaluated, including CNN and transformer-based approaches, with Astroconformer reporting the best performance. The dataset establishes a common benchmark for future comparisons and advancements in rocket body classification tasks.
Authors: Tianshuo Xu, Zhifei Chen, Leyi Wu, Hao Lu, Yuying Chen, Lihui Jiang, Bingbing Liu, Yingcong Chen
Abstract: Recent numerous video generation models, also known as world models, have demonstrated the ability to generate plausible real-world videos. However, many studies have shown that these models often produce motion results lacking logical or physical coherence. In this paper, we revisit video generation models and find that single-stage approaches struggle to produce high-quality results while maintaining coherent motion reasoning. To address this issue, we propose \textbf{Motion Dreamer}, a two-stage video generation framework. In Stage I, the model generates an intermediate motion representation-such as a segmentation map or depth map-based on the input image and motion conditions, focusing solely on the motion itself. In Stage II, the model uses this intermediate motion representation as a condition to generate a high-detail video. By decoupling motion reasoning from high-fidelity video synthesis, our approach allows for more accurate and physically plausible motion generation. We validate the effectiveness of our approach on the Physion dataset and in autonomous driving scenarios. For example, given a single push, our model can synthesize the sequential toppling of a set of dominoes. Similarly, by varying the movements of ego-cars, our model can produce different effects on other vehicles. Our work opens new avenues in creating models that can reason about physical interactions in a more coherent and realistic manner.
Authors: Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu
Abstract: Prevailing Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone, similar to how Large Language Models (LLMs) process the text tokens. However, the number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs. In this paper, we consider improving MLLM's efficiency from two scenarios, (I) Reducing computational cost without degrading the performance. (II) Improving the performance with given budgets. We start with our main finding that the ranking of each vision token sorted by attention scores is similar in each layer except the first layer. Based on it, we assume that the number of essential top vision tokens does not increase along layers. Accordingly, for Scenario I, we propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep. Interestingly, G-Search is able to reach the optimal reduction strategy based on our assumption. For Scenario II, based on the reduction strategy from G-Search, we design a parametric sigmoid function (P-Sigmoid) to guide the reduction at each layer of the MLLM, whose parameters are optimized by Bayesian Optimization. Extensive experiments demonstrate that our approach can significantly accelerate those popular MLLMs, e.g. LLaVA, and InternVL2 models, by more than $2 \times$ without performance drops. Our approach also far outperforms other token reduction methods when budgets are limited, achieving a better trade-off between efficiency and effectiveness.
Authors: Michail Dontas, Yutong He, Naoki Murata, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov
Abstract: Blind inverse problems, where both the target data and forward operator are unknown, are crucial to many computer vision applications. Existing methods often depend on restrictive assumptions such as additional training, operator linearity, or narrow image distributions, thus limiting their generalizability. In this work, we present LADiBI, a training-free framework that uses large-scale text-to-image diffusion models to solve blind inverse problems with minimal assumptions. By leveraging natural language prompts, LADiBI jointly models priors for both the target image and operator, allowing for flexible adaptation across a variety of tasks. Additionally, we propose a novel posterior sampling approach that combines effective operator initialization with iterative refinement, enabling LADiBI to operate without predefined operator forms. Our experiments show that LADiBI is capable of solving a broad range of image restoration tasks, including both linear and nonlinear problems, on diverse target image distributions.
Authors: Alex Hanson, Allen Tu, Geng Lin, Vasu Singla, Matthias Zwicker, Tom Goldstein
Abstract: 3D Gaussian Splatting (3D-GS) is a recent 3D scene reconstruction technique that enables real-time rendering of novel views by modeling scenes as parametric point clouds of differentiable 3D Gaussians. However, its rendering speed and model size still present bottlenecks, especially in resource-constrained settings. In this paper, we identify and address two key inefficiencies in 3D-GS, achieving substantial improvements in rendering speed, model size, and training time. First, we optimize the rendering pipeline to precisely localize Gaussians in the scene, boosting rendering speed without altering visual fidelity. Second, we introduce a novel pruning technique and integrate it into the training pipeline, significantly reducing model size and training time while further raising rendering speed. Our Speedy-Splat approach combines these techniques to accelerate average rendering speed by a drastic $6.71\times$ across scenes from the Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets with $10.6\times$ fewer primitives than 3D-GS.
Authors: Tingxu Han, Weisong Sun, Yanrong Hu, Chunrong Fang, Yonglong Zhang, Shiqing Ma, Tao Zheng, Zhenyu Chen, Zhenting Wang
Abstract: Text-to-image diffusion models have shown an impressive ability to generate high-quality images from input textual descriptions. However, concerns have been raised about the potential for these models to create content that infringes on copyrights or depicts disturbing subject matter. Removing specific concepts from these models is a promising potential solution to this problem. However, existing methods for concept removal do not work well in practical but challenging scenarios where concepts need to be continuously removed. Specifically, these methods lead to poor alignment between the text prompts and the generated image after the continuous removal process. To address this issue, we propose a novel approach called CCRT that includes a designed knowledge distillation paradigm. It constrains the text-image alignment behavior during the continuous concept removal process by using a set of text prompts generated through our genetic algorithm, which employs a designed fuzzing strategy. We conduct extensive experiments involving the removal of various concepts. The results evaluated through both algorithmic metrics and human studies demonstrate that our CCRT can effectively remove the targeted concepts in a continuous manner while maintaining the high generation quality (e.g., text-image alignment) of the model.
Authors: Shing-Hei Ho, Bao Thach, Minghan Zhu
Abstract: We propose a framework to edit real-world Lidar scans with novel object layouts while preserving a realistic background environment. Compared to the synthetic data generation frameworks where Lidar point clouds are generated from scratch, our framework focuses on new scenario generation in a given background environment, and our method also provides labels for the generated data. This approach ensures the generated data remains relevant to the specific environment, aiding both the development and the evaluation of algorithms in real-world scenarios. Compared with novel view synthesis, our framework allows the creation of counterfactual scenarios with significant changes in the object layout and does not rely on multi-frame optimization. In our framework, the object removal and insertion are supported by generative background inpainting and object point cloud completion, and the entire pipeline is built upon spherical voxelization, which realizes the correct Lidar projective geometry by construction. Experiments show that our framework generates realistic Lidar scans with object layout changes and benefits the development of Lidar-based self-driving systems.
Authors: Qiyao Xue, Xiangyu Yin, Boyuan Yang, Wei Gao
Abstract: Text-to-video (T2V) generation has been recently enabled by transformer-based diffusion models, but current T2V models lack capabilities in adhering to the real-world common knowledge and physical rules, due to their limited understanding of physical realism and deficiency in temporal modeling. Existing solutions are either data-driven or require extra model inputs, but cannot be generalizable to out-of-distribution domains. In this paper, we present PhyT2V, a new data-independent T2V technique that expands the current T2V model's capability of video generation to out-of-distribution domains, by enabling chain-of-thought and step-back reasoning in T2V prompting. Our experiments show that PhyT2V improves existing T2V models' adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers. The source codes are available at: https://github.com/pittisl/PhyT2V.
Authors: Heitor R. Medeiros, Atif Belal, Srikanth Muralidharan, Eric Granger, Marco Pedersoli
Abstract: The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches tend to compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt vision-language detectors to new modalities without degrading zero-shot performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inference-friendly task residuals, facilitating more robust adaptation. Empirically, we benchmark our method for modality adaptation on two vision-language detectors, YOLO-World and Grounding DINO, and on challenging infrared (LLVIP, FLIR) and depth (NYUv2) data, achieving performance comparable to full fine-tuning while preserving the model's zero-shot capability. Our code is available at: https://github.com/heitorrapela/ModPrompt
Authors: Chensheng Peng, Ido Sobol, Masayoshi Tomizuka, Kurt Keutzer, Chenfeng Xu, Or Litany
Abstract: We introduce a diffusion model for Gaussian Splats, SplatDiffusion, to enable generation of three-dimensional structures from single images, addressing the ill-posed nature of lifting 2D inputs to 3D. Existing methods rely on deterministic, feed-forward predictions, which limit their ability to handle the inherent ambiguity of 3D inference from 2D data. Diffusion models have recently shown promise as powerful generative models for 3D data, including Gaussian splats; however, standard diffusion frameworks typically require the target signal and denoised signal to be in the same modality, which is challenging given the scarcity of 3D data. To overcome this, we propose a novel training strategy that decouples the denoised modality from the supervision modality. By using a deterministic model as a noisy teacher to create the noised signal and transitioning from single-step to multi-step denoising supervised by an image rendering loss, our approach significantly enhances performance compared to the deterministic teacher. Additionally, our method is flexible, as it can learn from various 3D Gaussian Splat (3DGS) teachers with minimal adaptation; we demonstrate this by surpassing the performance of two different deterministic models as teachers, highlighting the potential generalizability of our framework. Our approach further incorporates a guidance mechanism to aggregate information from multiple views, enhancing reconstruction quality when more than one view is available. Experimental results on object-level and scene-level datasets demonstrate the effectiveness of our framework.
Authors: Yogesh Kulkarni, Pooyan Fazli
Abstract: Recent advances in vision-language models (VLMs) have significantly enhanced video understanding tasks. Instruction tuning (i.e., fine-tuning models on datasets of instructions paired with desired outputs) has been key to improving model performance. However, creating diverse instruction-tuning datasets is challenging due to high annotation costs and the complexity of capturing temporal information in videos. Existing approaches often rely on large language models to generate instruction-output pairs, which can limit diversity and lead to responses that lack grounding in the video content. To address this, we propose VideoSAVi (Self-Aligned Video Language Model), a novel self-training pipeline that enables VLMs to generate their own training data without extensive manual annotation. The process involves three stages: (1) generating diverse video-specific questions, (2) producing multiple candidate answers, and (3) evaluating these responses for alignment with the video content. This self-generated data is then used for direct preference optimization (DPO), allowing the model to refine its own high-quality outputs and improve alignment with video content. Our experiments demonstrate that even smaller models (0.5B and 7B parameters) can effectively use this self-training approach, outperforming previous methods and achieving results comparable to those trained on proprietary preference data. VideoSAVi shows significant improvements across multiple benchmarks: up to 28% on multi-choice QA, 8% on zero-shot open-ended QA, and 12% on temporal reasoning benchmarks. These results demonstrate the effectiveness of our self-training approach in enhancing video understanding while reducing dependence on proprietary models.
Authors: You Wu, Xiangyang Yang, Xucheng Wang, Hengzhou Ye, Dan Zeng, Shuiwang Li
Abstract: Harnessing low-light enhancement and domain adaptation, nighttime UAV tracking has made substantial strides. However, over-reliance on image enhancement, scarcity of high-quality nighttime data, and neglecting the relationship between daytime and nighttime trackers, which hinders the development of an end-to-end trainable framework. Moreover, current CNN-based trackers have limited receptive fields, leading to suboptimal performance, while ViT-based trackers demand heavy computational resources due to their reliance on the self-attention mechanism. In this paper, we propose a novel pure Mamba-based tracking framework (\textbf{MambaNUT}) that employs a state space model with linear complexity as its backbone, incorporating a single-stream architecture that integrates feature learning and template-search coupling within Vision Mamba. We introduce an adaptive curriculum learning (ACL) approach that dynamically adjusts sampling strategies and loss weights, thereby improving the model's ability of generalization. Our ACL is composed of two levels of curriculum schedulers: (1) sampling scheduler that transforms the data distribution from imbalanced to balanced, as well as from easier (daytime) to harder (nighttime) samples; (2) loss scheduler that dynamically assigns weights based on data frequency and the IOU. Exhaustive experiments on multiple nighttime UAV tracking benchmarks demonstrate that the proposed MambaNUT achieves state-of-the-art performance while requiring lower computational costs. The code will be available.
Authors: Hao Jin, Hengyuan Chang, Xiaoxuan Xie, Zhengyang Wang, Xusheng Du, Shaojun Hu, Haoran Xie
Abstract: Designing stylized cinemagraphs is challenging due to the difficulty in customizing complex and expressive flow motions. To achieve intuitive and detailed control of the generated cinemagraphs, freehand sketches can provide a better solution to convey personalized design requirements than only text inputs. In this paper, we propose Sketch2Cinemagraph, a sketch-guided framework that enables the conditional generation of stylized cinemagraphs from freehand sketches. Sketch2Cinemagraph adopts text prompts for initial content generation and provides hand-drawn sketch controls for both spatial and motion cues. The latent diffusion model is adopted to generate target stylized landscape images along with realistic versions. Then, a pre-trained object detection model is utilized to segment and obtain masks for the flow regions. We proposed a novel latent motion diffusion model to estimate the motion field in the fluid regions of the generated landscape images. The input motion sketches serve as the conditions to control the generated vector fields in the masked fluid regions with the prompt. To synthesize the cinemagraph frames, the pixels within fluid regions are subsequently warped to the target locations for each timestep using a frame generator. The results verified that Sketch2Cinemagraph can generate high-fidelity and aesthetically appealing stylized cinemagraphs with continuous temporal flow from intuitive sketch inputs. We showcase the advantages of Sketch2Cinemagraph through quantitative comparisons against the state-of-the-art generation approaches.
Authors: Minghao Han, Dingkang Yang, Jiabei Cheng, Xukun Zhang, Linhao Qu, Zizhi Chen, Lihua Zhang
Abstract: Recent advancements in multimodal pre-training models have significantly advanced computational pathology. However, current approaches predominantly rely on visual-language models, which may impose limitations from a molecular perspective and lead to performance bottlenecks. Here, we introduce a Unified Molecule-enhanced Pathology Image REpresentationn Learning framework (UMPIRE). UMPIRE aims to leverage complementary information from gene expression profiles to guide the multimodal pre-training, enhancing the molecular awareness of pathology image representation learning. We demonstrate that this molecular perspective provides a robust, task-agnostic training signal for learning pathology image embeddings. Due to the scarcity of paired data, approximately 4 million entries of spatial transcriptomics gene expression were collected to train the gene encoder. By leveraging powerful pre-trained encoders, UMPIRE aligns the encoders across over 697K pathology image-gene expression pairs. The performance of UMPIRE is demonstrated across various molecular-related downstream tasks, including gene expression prediction, spot classification, and mutation state prediction in whole slide images. Our findings highlight the effectiveness of multimodal data integration and open new avenues for exploring computational pathology enhanced by molecular perspectives. The code and pre-trained weights are available at https://github.com/Hanminghao/UMPIRE.
Authors: Yingjian Chen, Lei Zhang, Yakun Niu, Lei Tan, Pei Chen
Abstract: Diffusion Models enable realistic image generation, raising the risk of misinformation and eroding public trust. Currently, detecting images generated by unseen diffusion models remains challenging due to the limited generalization capabilities of existing methods. To address this issue, we rethink the effectiveness of pre-trained models trained on large-scale, real-world images. Our findings indicate that: 1) Pre-trained models can cluster the features of real images effectively. 2) Models with pre-trained weights can approximate an optimal generalization solution at a specific training step, but it is extremely unstable. Based on these facts, we propose a simple yet effective training method called Learning on Less (LoL). LoL utilizes a random masking mechanism to constrain the model's learning of the unique patterns specific to a certain type of diffusion model, allowing it to focus on less image content. This leverages the inherent strengths of pre-trained weights while enabling a more stable approach to optimal generalization, which results in the extraction of a universal feature that differentiates various diffusion-generated images from real images. Extensive experiments on the GenImage benchmark demonstrate the remarkable generalization capability of our proposed LoL. With just 1% training data, LoL significantly outperforms the current state-of-the-art, achieving a 13.6% improvement in average ACC across images generated by eight different models.
Authors: Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto
Abstract: Visual explanations for object detectors are crucial for enhancing their reliability. Since object detectors identify and localize instances by assessing multiple features collectively, generating explanations that capture these collective contributions is critical. However, existing methods focus solely on individual pixel contributions, ignoring the collective contribution of multiple pixels. To address this, we proposed a method for object detectors that considers the collective contribution of multiple pixels. Our approach leverages game-theoretic concepts, specifically Shapley values and interactions, to provide explanations. These explanations cover both bounding box generation and class determination, considering both individual and collective pixel contributions. Extensive quantitative and qualitative experiments demonstrate that the proposed method more accurately identifies important regions in detection results compared to current state-of-the-art methods. The code will be publicly available soon.
Authors: Yunpeng Bai, Qixing Huang
Abstract: Monocular Depth Estimation (MDE) is essential for applications like 3D scene reconstruction, autonomous navigation, and AI content creation. However, robust MDE remains challenging due to noisy real-world data and distribution gaps in synthetic datasets. Existing methods often struggle with low efficiency, reduced accuracy, and lack of detail. To address this, we propose an efficient approach for leveraging diffusion priors and introduce FiffDepth, a framework that transforms diffusion-based image generators into a feedforward architecture for detailed depth estimation. By preserving key generative features and integrating the strong generalization capabilities of models like dinov2, FiffDepth achieves enhanced accuracy, stability, and fine-grained detail, offering a significant improvement in MDE performance across diverse real-world scenarios.
Authors: Jingwei Zhang, Anh Tien Nguyen, Xi Han, Vincent Quoc-Huy Trinh, Hong Qin, Dimitris Samaras, Mahdi S. Hosseini
Abstract: Efficiently modeling large 2D contexts is essential for various fields including Giga-Pixel Whole Slide Imaging (WSI) and remote sensing. Transformer-based models offer high parallelism but face challenges due to their quadratic complexity for handling long sequences. Recently, Mamba introduced a selective State Space Model (SSM) with linear complexity and high parallelism, enabling effective and efficient modeling of wide context in 1D sequences. However, extending Mamba to vision tasks, which inherently involve 2D structures, results in spatial discrepancies due to the limitations of 1D sequence processing. On the other hand, current 2D SSMs inherently model 2D structures but they suffer from prohibitively slow computation due to the lack of efficient parallel algorithms. In this work, we propose 2DMamba, a novel 2D selective SSM framework that incorporates the 2D spatial structure of images into Mamba, with a highly optimized hardware-aware operator, adopting both spatial continuity and computational efficiency. We validate the versatility of our approach on both WSIs and natural images. Extensive experiments on 10 public datasets for WSI classification and survival analysis show that 2DMamba~improves up to $2.48\%$ in AUC, $3.11\%$ in F1 score, $2.47\%$ in accuracy and $5.52\%$ in C-index. Additionally, integrating our method with VMamba for natural imaging yields $0.5$ to $0.7$ improvements in mIoU on the ADE20k semantic segmentation dataset, and $0.2\%$ accuracy improvement on ImageNet-1K classification dataset. Our code is available at https://github.com/AtlasAnalyticsLab/2DMamba.
Authors: S M Jishanul Islam, Sahid Hossain Mustakim, Sadia Ahmmed, Md. Faiyaz Abdullah Sayeedi, Swapnil Khandoker, Syed Tasdid Azam Dhrubo, Nahid Hossain
Abstract: Anti-Muslim hate speech has emerged within memes, characterized by context-dependent and rhetorical messages using text and images that seemingly mimic humor but convey Islamophobic sentiments. This work presents a novel dataset and proposes a classifier based on the Vision-and-Language Transformer (ViLT) specifically tailored to identify anti-Muslim hate within memes by integrating both visual and textual representations. Our model leverages joint modal embeddings between meme images and incorporated text to capture nuanced Islamophobic narratives that are unique to meme culture, providing both high detection accuracy and interoperability.
Authors: Phu Pham, Damon Conover, Aniket Bera
Abstract: We present FlashSLAM, a novel SLAM approach that leverages 3D Gaussian Splatting for efficient and robust 3D scene reconstruction. Existing 3DGS-based SLAM methods often fall short in sparse view settings and during large camera movements due to their reliance on gradient descent-based optimization, which is both slow and inaccurate. FlashSLAM addresses these limitations by combining 3DGS with a fast vision-based camera tracking technique, utilizing a pretrained feature matching model and point cloud registration for precise pose estimation in under 80 ms - a 90% reduction in tracking time compared to SplaTAM - without costly iterative rendering. In sparse settings, our method achieves up to a 92% improvement in average tracking accuracy over previous methods. Additionally, it accounts for noise in depth sensors, enhancing robustness when using unspecialized devices such as smartphones. Extensive experiments show that FlashSLAM performs reliably across both sparse and dense settings, in synthetic and real-world environments. Evaluations on benchmark datasets highlight its superior accuracy and efficiency, establishing FlashSLAM as a versatile and high-performance solution for SLAM, advancing the state-of-the-art in 3D reconstruction across diverse applications.
Authors: Tongshun Zhang, Pingping Liu, Ming Zhao, Haotian Lv
Abstract: In the Fourier frequency domain, luminance information is primarily encoded in the amplitude component, while spatial structure information is significantly contained within the phase component. Existing low-light image enhancement techniques using Fourier transform have mainly focused on amplifying the amplitude component and simply replicating the phase component, an approach that often leads to color distortions and noise issues. In this paper, we propose a Dual-Stage Multi-Branch Fourier Low-Light Image Enhancement (DMFourLLIE) framework to address these limitations by emphasizing the phase component's role in preserving image structure and detail. The first stage integrates structural information from infrared images to enhance the phase component and employs a luminance-attention mechanism in the luminance-chrominance color space to precisely control amplitude enhancement. The second stage combines multi-scale and Fourier convolutional branches for robust image reconstruction, effectively recovering spatial structures and textures. This dual-branch joint optimization process ensures that complex image information is retained, overcoming the limitations of previous methods that neglected the interplay between amplitude and phase. Extensive experiments across multiple datasets demonstrate that DMFourLLIE outperforms current state-of-the-art methods in low-light image enhancement. Our code is available at https://github.com/bywlzts/DMFourLLIE.
Authors: Zilin Du, Haoxin Li, Jianfei Yu, Boyang Li
Abstract: Visual grounding aims to localize the image regions based on a textual query. Given the difficulty of large-scale data curation, we investigate how to effectively learn visual grounding under data-scarce settings in this paper. To address data scarcity, we propose a novel framework, POBF (Paint Outside the Box, then Filter). POBF synthesizes images by inpainting outside the box, tackling a label misalignment issue encountered in previous works. Furthermore, POBF leverages an innovative filtering scheme to identify the most effective training data. This scheme combines a hardness score and an overfitting score, balanced by a penalty term. Experimental results show that POBF achieves superior performance across four datasets, delivering an average improvement of 5.83% and outperforming leading baselines by 2.29% to 3.85% in accuracy. Additionally, we validate the robustness and generalizability of POBF across various generative models, data ratios, and model architectures.
Authors: Muhammad Fetrat Qharabagh, Mohammadreza Ghofrani, Kimon Fountoulakis
Abstract: Counting is a fundamental skill for various visual tasks in real-life applications, requiring both object recognition and robust counting capabilities. Despite their advanced visual perception, large vision-language models (LVLMs) struggle with counting tasks, especially when the number of objects exceeds those commonly encountered during training. We enhance LVLMs' counting abilities using a divide-and-conquer approach, breaking counting problems into sub-counting tasks. Unlike prior methods, which do not generalize well to counting datasets on which they have not been trained, our method performs well on new datasets without any additional training or fine-tuning. We demonstrate that our approach enhances counting capabilities across various datasets and benchmarks.
Authors: Yizhou Wang, Tim Meinhardt, Orcun Cetintas, Cheng-Yen Yang, Sameer Satish Pusegaonkar, Benjamin Missaoui, Sujit Biswas, Zheng Tang, Laura Leal-Taix\'e
Abstract: Object perception from multi-view cameras is crucial for intelligent systems, particularly in indoor environments, e.g., warehouses, retail stores, and hospitals. Most traditional multi-target multi-camera (MTMC) detection and tracking methods rely on 2D object detection, single-view multi-object tracking (MOT), and cross-view re-identification (ReID) techniques, without properly handling important 3D information by multi-view image aggregation. In this paper, we propose a 3D object detection and tracking framework, named BEV-SUSHI, which first aggregates multi-view images with necessary camera calibration parameters to obtain 3D object detections in bird's-eye view (BEV). Then, we introduce hierarchical graph neural networks (GNNs) to track these 3D detections in BEV for MTMC tracking results. Unlike existing methods, BEV-SUSHI has impressive generalizability across different scenes and diverse camera settings, with exceptional capability for long-term association handling. As a result, our proposed BEV-SUSHI establishes the new state-of-the-art on the AICity'24 dataset with 81.22 HOTA, and 95.6 IDF1 on the WildTrack dataset.
Authors: Tao Huang, Qingyu Huang, Jiayang Meng
Abstract: The increasing reliance on deep computer vision models that process sensitive data has raised significant privacy concerns, particularly regarding the exposure of intermediate results in hidden layers. While traditional privacy risk assessment techniques focus on protecting overall model outputs, they often overlook vulnerabilities within these intermediate representations. Current privacy risk assessment techniques typically rely on specific attack simulations to assess risk, which can be computationally expensive and incomplete. This paper introduces a novel approach to measuring privacy risks in deep computer vision models based on the Degrees of Freedom (DoF) and sensitivity of intermediate outputs, without requiring adversarial attack simulations. We propose a framework that leverages DoF to evaluate the amount of information retained in each layer and combines this with the rank of the Jacobian matrix to assess sensitivity to input variations. This dual analysis enables systematic measurement of privacy risks at various model layers. Our experimental validation on real-world datasets demonstrates the effectiveness of this approach in providing deeper insights into privacy risks associated with intermediate representations.
Authors: Jun Ye
Abstract: We propose a method to improve the generalization ability of skin lesion classification models by combining self-supervised learning (SSL), unsupervised domain adaptation (UDA), and active domain adaptation (ADA). The main steps of the approach include selection of a SSL pretrained model on natural image datasets, subsequent SSL retraining on all available skin lesion datasets, finetuning of the model on source domain data with labels, application of UDA methods on target domain data, and lastly, implementation of ADA methods. The efficacy of the proposed approach is assessed across ten skin lesion datasets of domains, demonstrating its potential for enhancing the performance of skin lesion classification models. This approach holds promise for facilitating the widespread adoption of medical imaging models in clinical settings, thereby amplifying their impact.
Authors: Shuling Zhao, Fa-Ting Hong, Xiaoshui Huang, Dan Xu
Abstract: Talking head video generation aims to generate a realistic talking head video that preserves the person's identity from a source image and the motion from a driving video. Despite the promising progress made in the field, it remains a challenging and critical problem to generate videos with accurate poses and fine-grained facial details simultaneously. Essentially, facial motion is often highly complex to model precisely, and the one-shot source face image cannot provide sufficient appearance guidance during generation due to dynamic pose changes. To tackle the problem, we propose to jointly learn motion and appearance codebooks and perform multi-scale codebook compensation to effectively refine both the facial motion conditions and appearance features for talking face image decoding. Specifically, the designed multi-scale motion and appearance codebooks are learned simultaneously in a unified framework to store representative global facial motion flow and appearance patterns. Then, we present a novel multi-scale motion and appearance compensation module, which utilizes a transformer-based codebook retrieval strategy to query complementary information from the two codebooks for joint motion and appearance compensation. The entire process produces motion flows of greater flexibility and appearance features with fewer distortions across different scales, resulting in a high-quality talking head video generation framework. Extensive experiments on various benchmarks validate the effectiveness of our approach and demonstrate superior generation results from both qualitative and quantitative perspectives when compared to state-of-the-art competitors.
Authors: Marius K\"astingsch\"afer, Th\'eo Gieruc, Sebastian Bernhard, Dylan Campbell, Eldar Insafutdinov, Eyvaz Najafli, Thomas Brox
Abstract: Models for egocentric 3D and 4D reconstruction, including few-shot interpolation and extrapolation settings, can benefit from having images from exocentric viewpoints as supervision signals. No existing dataset provides the necessary mixture of complex, dynamic, and multi-view data. To facilitate the development of 3D and 4D reconstruction methods in the autonomous driving context, we propose a Synthetic Ego--Exo Dynamic 4D (SEED4D) data generator and dataset. We present a customizable, easy-to-use data generator for spatio-temporal multi-view data creation. Our open-source data generator allows the creation of synthetic data for camera setups commonly used in the NuScenes, KITTI360, and Waymo datasets. Additionally, SEED4D encompasses two large-scale multi-view synthetic urban scene datasets. Our static (3D) dataset encompasses 212k inward- and outward-facing vehicle images from 2k scenes, while our dynamic (4D) dataset contains 16.8M images from 10k trajectories, each sampled at 100 points in time with egocentric images, exocentric images, and LiDAR data. The datasets and the data generator can be found at https://seed4d.github.io/.
Authors: Ajith Balakrishnan, Sreeja S, Linu Shine
Abstract: Generating 3D models from multi-view 2D RGB images has gained significant attention, extending the capabilities of technologies like Virtual Reality, Robotic Vision, and human-machine interaction. In this paper, we introduce a hybrid strategy combining CNNs and transformers, featuring a visual auto-encoder with self-attention mechanisms and a 3D refiner network, trained using a novel Joint Train Separate Optimization (JTSO) algorithm. Encoded features from unordered inputs are transformed into an enhanced feature map by the self-attention layer, decoded into an initial 3D volume, and further refined. Our network generates 3D voxels from single or multiple 2D images from arbitrary viewpoints. Performance evaluations using the ShapeNet datasets show that our approach, combined with JTSO, outperforms state-of-the-art techniques in single and multi-view 3D reconstruction, achieving the highest mean intersection over union (IOU) scores, surpassing other models by 4.2% in single-view reconstruction.
Authors: Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, Siyu Zhu
Abstract: Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of continuous video driven by speech audio. Our method is validated through experiments on benchmark and newly proposed wild datasets, demonstrating substantial improvements over prior methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes. Further visualizations and the source code are available at: https://github.com/fudan-generative-vision/hallo3.
Authors: Hanlin Chen, Fangyin Wei, Gim Hee Lee
Abstract: Humans naturally interact with their 3D surroundings using language, and modeling 3D language fields for scene understanding and interaction has gained growing interest. This paper introduces ChatSplat, a system that constructs a 3D language field, enabling rich chat-based interaction within 3D space. Unlike existing methods that primarily use CLIP-derived language features focused solely on segmentation, ChatSplat facilitates interaction on three levels: objects, views, and the entire 3D scene. For view-level interaction, we designed an encoder that encodes the rendered feature map of each view into tokens, which are then processed by a large language model (LLM) for conversation. At the scene level, ChatSplat combines multi-view tokens, enabling interactions that consider the entire scene. For object-level interaction, ChatSplat uses a patch-wise language embedding, unlike LangSplat's pixel-wise language embedding that implicitly includes mask and embedding. Here, we explicitly decouple the language embedding into separate mask and feature map representations, allowing more flexible object-level interaction. To address the challenge of learning 3D Gaussians posed by the complex and diverse distribution of language embeddings used in the LLM, we introduce a learnable normalization technique to standardize these embeddings, facilitating effective learning. Extensive experimental results demonstrate that ChatSplat supports multi-level interactions -- object, view, and scene -- within 3D space, enhancing both understanding and engagement.
Authors: Jun Wan, He Liu, Yujia Wu, Zhihui Lai, Wenwen Min, Jun Liu
Abstract: At present, deep neural network methods have played a dominant role in face alignment field. However, they generally use predefined network structures to predict landmarks, which tends to learn general features and leads to mediocre performance, e.g., they perform well on neutral samples but struggle with faces exhibiting large poses or occlusions. Moreover, they cannot effectively deal with semantic gaps and ambiguities among features at different scales, which may hinder them from learning efficient features. To address the above issues, in this paper, we propose a Dynamic Semantic-Aggregation Transformer (DSAT) for more discriminative and representative feature (i.e., specialized feature) learning. Specifically, a Dynamic Semantic-Aware (DSA) model is first proposed to partition samples into subsets and activate the specific pathways for them by estimating the semantic correlations of feature channels, making it possible to learn specialized features from each subset. Then, a novel Dynamic Semantic Specialization (DSS) model is designed to mine the homogeneous information from features at different scales for eliminating the semantic gap and ambiguities and enhancing the representation ability. Finally, by integrating the DSA model and DSS model into our proposed DSAT in both dynamic architecture and dynamic parameter manners, more specialized features can be learned for achieving more precise face alignment. It is interesting to show that harder samples can be handled by activating more feature channels. Extensive experiments on popular face alignment datasets demonstrate that our proposed DSAT outperforms state-of-the-art models in the literature.Our code is available at https://github.com/GERMINO-LiuHe/DSAT.
Authors: Jian Liu, Zhen Yu
Abstract: The neural radiance field (NERF) advocates learning the continuous representation of 3D geometry through a multilayer perceptron (MLP). By integrating this into a generative model, the generative neural radiance field (GRAF) is capable of producing images from random noise z without 3D supervision. In practice, the shape and appearance are modeled by z_s and z_a, respectively, to manipulate them separately during inference. However, it is challenging to represent multiple scenes using a solitary MLP and precisely control the generation of 3D geometry in terms of shape and appearance. In this paper, we introduce a controllable generative model (i.e. \textbf{CtrlNeRF}) that uses a single MLP network to represent multiple scenes with shared weights. Consequently, we manipulated the shape and appearance codes to realize the controllable generation of high-fidelity images with 3D consistency. Moreover, the model enables the synthesis of novel views that do not exist in the training sets via camera pose alteration and feature interpolation. Extensive experiments were conducted to demonstrate its superiority in 3D-aware image generation compared to its counterparts.
Authors: Xin Xie, Dong Gong
Abstract: Text-to-image diffusion model alignment is critical for improving the alignment between the generated images and human preferences. While training-based methods are constrained by high computational costs and dataset requirements, training-free alignment methods remain underexplored and are often limited by inaccurate guidance. We propose a plug-and-play training-free alignment method, DyMO, for aligning the generated images and human preferences during inference. Apart from text-aware human preference scores, we introduce a semantic alignment objective for enhancing the semantic alignment in the early stages of diffusion, relying on the fact that the attention maps are effective reflections of the semantics in noisy images. We propose dynamic scheduling of multiple objectives and intermediate recurrent steps to reflect the requirements at different steps. Experiments with diverse pre-trained diffusion models and metrics demonstrate the effectiveness and robustness of the proposed method.
Authors: Linhai Zhuo, Zheng Wang, Yuqian Fu, Tianwen Qian
Abstract: The source-free cross-domain few-shot learning (CD-FSL) task aims to transfer pretrained models to target domains utilizing minimal samples, eliminating the need for source domain data. Addressing this issue requires models to have robust generalization abilities and strong feature representation, aligning with the characteristics of large-scale pretrained models. However, large-scale models tend to lose representational ability in cross-domain scenarios due to limited sample diversity. \zlh{Given the abundant diversity provided by semantic modality, this paper leverages textual modality to enhance training sample diversity with CLP model}, meanwhile improving model transfer efficiency. Specifically, we propose the SeGD-VPT framework, which is divided into two phases. The first step aims to increase feature diversity by adding diversity prompts to each support sample, thereby generating varying input and enhancing sample diversity. Furthermore, we use diversity descriptions of classes to guide semantically meaningful learning of diversity prompts, proposing random combinations and selections of texts to increase textual diversity. Additionally, deep prompt tuning is introduced to enhance the model's transfer capability. After training of the first step, support samples with different diversity prompts are input into the CLIP backbone to generate enhanced features. After generation, the second phase trains classifiers using the generated features. Extensive experimental results across several benchmarks verify our method is comparable to SOTA source-utilized models and attain the best performance under the source-free CD-FSL setting.
Authors: Raynor Kirkson E. Chavez, Kyle Gabriel M. Reynoso
Abstract: This paper investigates the impact of sampling and pretraining using datasets with different image characteristics on the performance of self-supervised learning (SSL) models for object classification. To do this, we sample two apartment datasets from the Omnidata platform based on modality, luminosity, image size, and camera field of view and use them to pretrain a SimCLR model. The encodings generated from the pretrained model are then transferred to a supervised Resnet-50 model for object classification. Through A/B testing, we find that depth pretrained models are more effective on low resolution images, while RGB pretrained models perform better on higher resolution images. We also discover that increasing the luminosity of training images can improve the performance of models on low resolution images without negatively affecting their performance on higher resolution images.
Authors: Haoyang Long, Yan Wang, Wendong Wang
Abstract: Video deblurring presents a considerable challenge owing to the complexity of blur, which frequently results from a combination of camera shakes, and object motions. In the field of video deblurring, many previous works have primarily concentrated on distortion-based metrics, such as PSNR. However, this approach often results in a weak correlation with human perception and yields reconstructions that lack realism. Diffusion models and video diffusion models have respectively excelled in the fields of image and video generation, particularly achieving remarkable results in terms of image authenticity and realistic perception. However, due to the computational complexity and challenges inherent in adapting diffusion models, there is still uncertainty regarding the potential of video diffusion models in video deblurring tasks. To explore the viability of video diffusion models in the task of video deblurring, we introduce a diffusion model specifically for this purpose. In this field, leveraging highly correlated information between adjacent frames and addressing the challenge of temporal misalignment are crucial research directions. To tackle these challenges, many improvements based on the video diffusion model are introduced in this work. As a result, our model outperforms existing models and achieves state-of-the-art results on a range of perceptual metrics. Our model preserves a significant amount of detail in the images while maintaining competitive distortion metrics. Furthermore, to the best of our knowledge, this is the first time the diffusion model has been applied in video deblurring to overcome the limitations mentioned above.
Authors: Girmaw Abebe Tadesse, Caleb Robinson, Charles Mwangi, Esther Maina, Joshua Nyakundi, Luana Marotti, Gilles Quentin Hacheme, Hamed Alemohammad, Rahul Dodhia, Juan M. Lavista Ferres
Abstract: Approximately 20% of Africa's population suffered from undernourishment, and 868 million people experienced moderate to severe food insecurity in 2022. Land-use and land-cover maps provide crucial insights for addressing food insecurity, e.g., by mapping croplands. The development of global land-cover maps has been facilitated by the increasing availability of earth observation data and advancements in geospatial machine learning. However, these global maps exhibit lower accuracy and inconsistencies in Africa, partly due to the lack of representative training data. To address this issue, we propose a data-centric framework with a teacher-student model setup, which uses diverse data sources of satellite images and label examples to produce local land-cover maps. Our method trains a high-resolution teacher model on images with a resolution of 0.331 m/pixel and a low-resolution student model on publicly available images with a resolution of 10 m/pixel. The student model also utilizes the teacher model's output as its weak label examples through knowledge distillation. We evaluated our framework using Murang'a County, Kenya, as a use case and achieved significant improvements, i.e., 0.14 in the F1 score and 0.21 in Intersection-over-Union, compared to the best global map. Our evaluation also revealed inconsistencies in existing global maps, with a maximum agreement rate of 0.30 among themselves. Insights obtained from our cross-collaborative work can provide valuable guidance to local and national policymakers in making informed decisions to improve resource utilization and food security.
Authors: Matan Rusanovsky, Shimon Malnick, Amir Jevnisek, Ohad Fried, Shai Avidan
Abstract: Diffusion models dominate the space of text-to-image generation, yet they may produce undesirable outputs, including explicit content or private data. To mitigate this, concept ablation techniques have been explored to limit the generation of certain concepts. In this paper, we reveal that the erased concept information persists in the model and that erased concept images can be generated using the right latent. Utilizing inversion methods, we show that there exist latent seeds capable of generating high quality images of erased concepts. Moreover, we show that these latents have likelihoods that overlap with those of images outside the erased concept. We extend this to demonstrate that for every image from the erased concept set, we can generate many seeds that generate the erased concept. Given the vast space of latents capable of generating ablated concept images, our results suggest that fully erasing concept information may be intractable, highlighting possible vulnerabilities in current concept ablation techniques.
Authors: Tong Jin, Feng Lu, Shuyu Hu, Chun Yuan, Yunpeng Liu
Abstract: Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most approaches typically focus on the aggregation of deep features extracted from a backbone through using current prominent architectures (e.g., CNNs, MLPs, pooling layer and transformer encoder), giving little attention to the transformer decoder. However, we argue that its strong capability in capturing contextual dependencies and generating accurate features holds considerable potential for the VPR task. To this end, we propose an Efficient Decoder Transformer (EDTformer) for feature aggregation, which consists of several stacked simplified decoder blocks followed by two linear layers to directly generate robust and discriminative global representations for VPR. Specifically, we do this by formulating deep features as the keys and values, as well as a set of independent learnable parameters as the queries. EDTformer can fully utilize the contextual information within deep features, then gradually decode and aggregate the effective features into the learnable queries to form the final global representations. Moreover, to provide powerful deep features for EDTformer and further facilitate the robustness, we use the foundation model DINOv2 as the backbone and propose a Low-Rank Parallel Adaptation (LoPA) method to enhance it, which can refine the intermediate features of the backbone progressively in a memory- and parameter-efficient way. As a result, our method not only outperforms single-stage VPR methods on multiple benchmark datasets, but also outperforms two-stage VPR methods which add a re-ranking with considerable cost. Code will be available at https://github.com/Tong-Jin01/EDTformer.
Authors: Peijun Bao, Chenqi Kong, Zihao Shao, Boon Poh Ng, Meng Hwa Er, Alex C. Kot
Abstract: Given a natural language query, video moment retrieval aims to localize the described temporal moment in an untrimmed video. A major challenge of this task is its heavy dependence on labor-intensive annotations for training. Unlike existing works that directly train models on manually curated data, we propose a novel paradigm to reduce annotation costs: pretraining the model on unlabeled, real-world videos. To support this, we introduce Video Moment Retrieval Pretraining (Vid-Morp), a large-scale dataset collected with minimal human intervention, consisting of over 50K videos captured in the wild and 200K pseudo annotations. Direct pretraining on these imperfect pseudo annotations, however, presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries. To address these issues, we propose the ReCorrect algorithm, which comprises two main phases: semantics-guided refinement and memory-consensus correction. The semantics-guided refinement enhances the pseudo labels by leveraging semantic similarity with video frames to clean out unpaired data and make initial adjustments to temporal boundaries. In the following memory-consensus correction phase, a memory bank tracks the model predictions, progressively correcting the temporal boundaries based on consensus within the memory. Comprehensive experiments demonstrate ReCorrect's strong generalization abilities across multiple downstream settings. Zero-shot ReCorrect achieves over 75% and 80% of the best fully-supervised performance on two benchmarks, while unsupervised ReCorrect reaches about 85% on both. The code, dataset, and pretrained models are available at https://github.com/baopj/Vid-Morp.
Authors: Hang Su, Ling Gao, Tao Liu, Laurent Kneip
Abstract: As the ubiquity of smart mobile devices continues to rise, Optical Camera Communication systems have gained more attention as a solution for efficient and private data streaming. This system utilizes optical cameras to receive data from digital screens via visible light. Despite their promise, most of them are hindered by dynamic factors such as screen refreshing and rapid camera motion. CMOS cameras, often serving as the receivers, suffer from limited frame rates and motion-induced image blur, which degrade overall performance. To address these challenges, this paper unveils a novel system that utilizes event cameras. We introduce a dynamic visual marker and design event-based tracking algorithms to achieve fast localization and data streaming. Remarkably, the event camera's unique capabilities mitigate issues related to screen refresh rates and camera motion, enabling a high throughput of up to 114 Kbps in static conditions, and a 1 cm localization accuracy with 1% bit error rate under various camera motions.
Authors: Yuhao Lin, Lingqiao Liu, Javen Shi
Abstract: Animal re-identification (ReID) has become an indispensable tool in ecological research, playing a critical role in tracking population dynamics, analyzing behavioral patterns, and assessing ecological impacts, all of which are vital for informed conservation strategies. Unlike human ReID, animal ReID faces significant challenges due to the high variability in animal poses, diverse environmental conditions, and the inability to directly apply pre-trained models to animal data, making the identification process across species more complex. This work introduces an innovative keypoint propagation mechanism, which utilizes a single annotated image and a pre-trained diffusion model to propagate keypoints across an entire dataset, significantly reducing the cost of manual annotation. Additionally, we enhance the Vision Transformer (ViT) by implementing Keypoint Positional Encoding (KPE) and Categorical Keypoint Positional Embedding (CKPE), enabling the ViT to learn more robust and semantically-aware representations. This provides more comprehensive and detailed keypoint representations, leading to more accurate and efficient re-identification. Our extensive experimental evaluations demonstrate that this approach significantly outperforms existing state-of-the-art methods across four wildlife datasets. The code will be publicly released.
Authors: Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xin Meng, Fei Richard Yu, Xiangyang Ji, Ming Li
Abstract: Event cameras record visual information as asynchronous pixel change streams, excelling at scene perception under unsatisfactory lighting or high-dynamic conditions. Existing multimodal large language models (MLLMs) concentrate on natural RGB images, failing in scenarios where event data fits better. In this paper, we introduce EventGPT, the first MLLM for event stream understanding, to the best of our knowledge, marking a pioneering attempt to integrate large language models (LLMs) with event stream comprehension. To mitigate the huge domain gaps, we develop a three-stage optimization paradigm to gradually equip a pre-trained LLM with the capability of understanding event-based scenes. Our EventGPT comprises an event encoder, followed by a spatio-temporal aggregator, a linear projector, an event-language adapter, and an LLM. Firstly, RGB image-text pairs generated by GPT are leveraged to warm up the linear projector, referring to LLaVA, as the gap between natural image and language modalities is relatively smaller. Secondly, we construct a synthetic yet large dataset, N-ImageNet-Chat, consisting of event frames and corresponding texts to enable the use of the spatio-temporal aggregator and to train the event-language adapter, thereby aligning event features more closely with the language space. Finally, we gather an instruction dataset, Event-Chat, which contains extensive real-world data to fine-tune the entire model, further enhancing its generalization ability. We construct a comprehensive benchmark, and experiments show that EventGPT surpasses previous state-of-the-art MLLMs in generation quality, descriptive accuracy, and reasoning capability.
Authors: Yan Li, Yifei Xing, Xiangyuan Lan, Xin Li, Haifeng Chen, Dongmei Jiang
Abstract: Cross-modal alignment is crucial for multimodal representation fusion due to the inherent heterogeneity between modalities. While Transformer-based methods have shown promising results in modeling inter-modal relationships, their quadratic computational complexity limits their applicability to long-sequence or large-scale data. Although recent Mamba-based approaches achieve linear complexity, their sequential scanning mechanism poses fundamental challenges in comprehensively modeling cross-modal relationships. To address this limitation, we propose AlignMamba, an efficient and effective method for multimodal fusion. Specifically, grounded in Optimal Transport, we introduce a local cross-modal alignment module that explicitly learns token-level correspondences between different modalities. Moreover, we propose a global cross-modal alignment loss based on Maximum Mean Discrepancy to implicitly enforce the consistency between different modal distributions. Finally, the unimodal representations after local and global alignment are passed to the Mamba backbone for further cross-modal interaction and multimodal fusion. Extensive experiments on complete and incomplete multimodal fusion tasks demonstrate the effectiveness and efficiency of the proposed method.
Authors: Christian M\"oller, Niklas Funk, Jan Peters
Abstract: Object pose estimation from a single view remains a challenging problem. In particular, partial observability, occlusions, and object symmetries eventually result in pose ambiguity. To account for this multimodality, this work proposes training a diffusion-based generative model for 6D object pose estimation. During inference, the trained generative model allows for sampling multiple particles, i.e., pose hypotheses. To distill this information into a single pose estimate, we propose two novel and effective pose selection strategies that do not require any additional training or computationally intensive operations. Moreover, while many existing methods for pose estimation primarily focus on the image domain and only incorporate depth information for final pose refinement, our model solely operates on point cloud data. The model thereby leverages recent advancements in point cloud processing and operates upon an SE(3)-equivariant latent space that forms the basis for the particle selection strategies and allows for improved inference times. Our thorough experimental results demonstrate the competitive performance of our approach on the Linemod dataset and showcase the effectiveness of our design choices. Code is available at https://github.com/zitronian/6DPoseDiffusion .
Authors: Jin Lyu, Tianyi Zhu, Yi Gu, Li Lin, Pujin Cheng, Yebin Liu, Xiaoying Tang, Liang An
Abstract: Quantitative analysis of animal behavior and biomechanics requires accurate animal pose and shape estimation across species, and is important for animal welfare and biological research. However, the small network capacity of previous methods and limited multi-species dataset leave this problem underexplored. To this end, this paper presents AniMer to estimate animal pose and shape using family aware Transformer, enhancing the reconstruction accuracy of diverse quadrupedal families. A key insight of AniMer is its integration of a high-capacity Transformer-based backbone and an animal family supervised contrastive learning scheme, unifying the discriminative understanding of various quadrupedal shapes within a single framework. For effective training, we aggregate most available open-sourced quadrupedal datasets, either with 3D or 2D labels. To improve the diversity of 3D labeled data, we introduce CtrlAni3D, a novel large-scale synthetic dataset created through a new diffusion-based conditional image generation pipeline. CtrlAni3D consists of about 10k images with pixel-aligned SMAL labels. In total, we obtain 41.3k annotated images for training and validation. Consequently, the combination of a family aware Transformer network and an expansive dataset enables AniMer to outperform existing methods not only on 3D datasets like Animal3D and CtrlAni3D, but also on out-of-distribution Animal Kingdom dataset. Ablation studies further demonstrate the effectiveness of our network design and CtrlAni3D in enhancing the performance of AniMer for in-the-wild applications. The project page of AniMer is https://luoxue-star.github.io/AniMer_project_page/.
Authors: Ronghan Chen, Yang Cong, Jiayue Liu
Abstract: This paper presents a Surface-Aligned Gaussian representation for creating animatable human avatars from monocular videos,aiming at improving the novel view and pose synthesis performance while ensuring fast training and real-time rendering. Recently,3DGS has emerged as a more efficient and expressive alternative to NeRF, and has been used for creating dynamic human avatars. However,when applied to the severely ill-posed task of monocular dynamic reconstruction, the Gaussians tend to overfit the constantly changing regions such as clothes wrinkles or shadows since these regions cannot provide consistent supervision, resulting in noisy geometry and abrupt deformation that typically fail to generalize under novel views and poses.To address these limitations, we present SAGA,i.e.,Surface-Aligned Gaussian Avatar,which aligns the Gaussians with a mesh to enforce well-defined geometry and consistent deformation, thereby improving generalization under novel views and poses. Unlike existing strict alignment methods that suffer from limited expressive power and low realism,SAGA employs a two-stage alignment strategy where the Gaussians are first adhered on while then detached from the mesh, thus facilitating both good geometry and high expressivity. In the Adhered Stage, we improve the flexibility of Adhered-on-Mesh Gaussians by allowing them to flow on the mesh, in contrast to existing methods that rigidly bind Gaussians to fixed location. In the second Detached Stage, we introduce a Gaussian-Mesh Alignment regularization, which allows us to unleash the expressivity by detaching the Gaussians but maintain the geometric alignment by minimizing their location and orientation offsets from the bound triangles. Finally, since the Gaussians may drift outside the bound triangles during optimization, an efficient Walking-on-Mesh strategy is proposed to dynamically update the bound triangles.
Authors: Weihang Li, Weirong Chen, Shenhan Qian, Jiajie Chen, Daniel Cremers, Haoang Li
Abstract: Recent advances in 3D Gaussian Splatting have shown promising results. Existing methods typically assume static scenes and/or multiple images with prior poses. Dynamics, sparse views, and unknown poses significantly increase the problem complexity due to insufficient geometric constraints. To overcome this challenge, we propose a method that can use only two images without prior poses to fit Gaussians in dynamic environments. To achieve this, we introduce two technical contributions. First, we propose an object-level two-view bundle adjustment. This strategy decomposes dynamic scenes into piece-wise rigid components, and jointly estimates the camera pose and motions of dynamic objects. Second, we design an SE(3) field-driven Gaussian training method. It enables fine-grained motion modeling through learnable per-Gaussian transformations. Our method leads to high-fidelity novel view synthesis of dynamic scenes while accurately preserving temporal consistency and object motion. Experiments on both synthetic and real-world datasets demonstrate that our method significantly outperforms state-of-the-art approaches designed for the cases of static environments, multiple images, and/or known poses. Our project page is available at https://colin-de.github.io/DynSUP/.
Authors: Bohai Gu, Hao Luo, Song Guo, Peiran Dong
Abstract: Recently, diffusion-based methods have achieved great improvements in the video inpainting task. However, these methods still face many challenges, such as maintaining temporal consistency and the time-consuming issue. This paper proposes an advanced video inpainting framework using optical Flow-guided Efficient Diffusion, called FloED. Specifically, FloED employs a dual-branch architecture, where a flow branch first restores corrupted flow and a multi-scale flow adapter provides motion guidance to the main inpainting branch. Additionally, a training-free latent interpolation method is proposed to accelerate the multi-step denoising process using flow warping. Further introducing a flow attention cache mechanism, FLoED efficiently reduces the computational cost brought by incorporating optical flow. Comprehensive experiments in both background restoration and object removal tasks demonstrate that FloED outperforms state-of-the-art methods from the perspective of both performance and efficiency.
Authors: Songjie Xie, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief
Abstract: Task-oriented communication presents a promising approach to improve the communication efficiency of edge inference systems by optimizing learning-based modules to extract and transmit relevant task information. However, real-time applications face practical challenges, such as incomplete coverage and potential malfunctions of edge servers. This situation necessitates cross-model communication between different inference systems, enabling edge devices from one service provider to collaborate effectively with edge servers from another. Independent optimization of diverse edge systems often leads to incoherent feature spaces, which hinders the cross-model inference for existing task-oriented communication. To facilitate and achieve effective cross-model task-oriented communication, this study introduces a novel framework that utilizes shared anchor data across diverse systems. This approach addresses the challenge of feature alignment in both server-based and on-device scenarios. In particular, by leveraging the linear invariance of visual features, we propose efficient server-based feature alignment techniques to estimate linear transformations using encoded anchor data features. For on-device alignment, we exploit the angle-preserving nature of visual features and propose to encode relative representations with anchor data to streamline cross-model communication without additional alignment procedures during the inference. The experimental results on computer vision benchmarks demonstrate the superior performance of the proposed feature alignment approaches in cross-model task-oriented communications. The runtime and computation overhead analysis further confirm the effectiveness of the proposed feature alignment approaches in real-time applications.
Authors: Arijit Samal, Haroon R Lone
Abstract: Non-invasive temperature monitoring of individuals plays a crucial role in identifying and isolating symptomatic individuals. Temperature monitoring becomes particularly vital in settings characterized by close human proximity, often referred to as dense settings. However, existing research on non-invasive temperature estimation using thermal cameras has predominantly focused on sparse settings. Unfortunately, the risk of disease transmission is significantly higher in dense settings like movie theaters or classrooms. Consequently, there is an urgent need to develop robust temperature estimation methods tailored explicitly for dense settings. Our study proposes a non-invasive temperature estimation system that combines a thermal camera with an edge device. Our system employs YOLO models for face detection and utilizes a regression framework for temperature estimation. We evaluated the system on a diverse dataset collected in dense and sparse settings. Our proposed face detection model achieves an impressive mAP score of over 84 in both in-dataset and cross-dataset evaluations. Furthermore, the regression framework demonstrates remarkable performance with a mean square error of 0.18$^{\circ}$C and an impressive $R^2$ score of 0.96. Our experiments' results highlight the developed system's effectiveness, positioning it as a promising solution for continuous temperature monitoring in real-world applications. With this paper, we release our dataset and programming code publicly.
Authors: Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision understanding, reasoning, and interaction. However, the inference computation and memory increase progressively with the generation of output tokens during decoding, directly affecting the efficacy of MLLMs. Existing methods attempt to reduce the vision context redundancy to achieve efficient MLLMs. Unfortunately, the efficiency benefits of the vision context reduction in the prefill stage gradually diminish during the decoding stage. To address this problem, we proposed a dynamic vision-language context sparsification framework Dynamic-LLaVA, which dynamically reduces the redundancy of vision context in the prefill stage and decreases the memory and computation overhead of the generated language context during decoding. Dynamic-LLaVA designs a tailored sparsification inference scheme for different inference modes, i.e., prefill, decoding with and without KV cache, to achieve efficient inference of MLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by $\sim$75\% in the prefill stage. Meanwhile, throughout the entire generation process of MLLMs, Dynamic-LLaVA reduces the $\sim$50\% computation consumption under decoding without KV cache, while saving $\sim$50\% GPU memory overhead when decoding with KV cache, due to the vision-language context sparsification. Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficient inference for MLLMs with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines. Code is available at https://github.com/Osilly/dynamic_llava .
Authors: Haoze Sun, Wenbo Li, Jiayue Liu, Kaiwen Zhou, Yongqiang Chen, Yong Guo, Yanwei Li, Renjing Pei, Long Peng, Yujiu Yang
Abstract: Generalization has long been a central challenge in real-world image restoration. While recent diffusion-based restoration methods, which leverage generative priors from text-to-image models, have made progress in recovering more realistic details, they still encounter "generative capability deactivation" when applied to out-of-distribution real-world data. To address this, we propose using text as an auxiliary invariant representation to reactivate the generative capabilities of these models. We begin by identifying two key properties of text input: richness and relevance, and examine their respective influence on model performance. Building on these insights, we introduce Res-Captioner, a module that generates enhanced textual descriptions tailored to image content and degradation levels, effectively mitigating response failures. Additionally, we present RealIR, a new benchmark designed to capture diverse real-world scenarios. Extensive experiments demonstrate that Res-Captioner significantly enhances the generalization abilities of diffusion-based restoration models, while remaining fully plug-and-play.
Authors: Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao
Abstract: Recent DETR-based methods have advanced the development of Video Instance Segmentation (VIS) through transformers' efficiency and capability in modeling spatial and temporal information. Despite harvesting remarkable progress, existing works follow asynchronous designs, which model video sequences via either video-level queries only or adopting query-sensitive cascade structures, resulting in difficulties when handling complex and challenging video scenarios. In this work, we analyze the cause of this phenomenon and the limitations of the current solutions, and propose to conduct synchronized modeling via a new framework named SyncVIS. Specifically, SyncVIS explicitly introduces video-level query embeddings and designs two key modules to synchronize video-level query with frame-level query embeddings: a synchronized video-frame modeling paradigm and a synchronized embedding optimization strategy. The former attempts to promote the mutual learning of frame- and video-level embeddings with each other and the latter divides large video sequences into small clips for easier optimization. Extensive experimental evaluations are conducted on the challenging YouTube-VIS 2019 & 2021 & 2022, and OVIS benchmarks and SyncVIS achieves state-of-the-art results, which demonstrates the effectiveness and generality of the proposed approach. The code is available at https://github.com/rkzheng99/SyncVIS.
Authors: Kun Qian, Tianyu Sun, Wenhong Wang
Abstract: Industrial anomaly detection (IAD) plays a crucial role in the maintenance and quality control of manufacturing processes. In this paper, we propose a novel approach, Vision-Language Anomaly Detection via Contrastive Cross-Modal Training (CLAD), which leverages large vision-language models (LVLMs) to improve both anomaly detection and localization in industrial settings. CLAD aligns visual and textual features into a shared embedding space using contrastive learning, ensuring that normal instances are grouped together while anomalies are pushed apart. Through extensive experiments on two benchmark industrial datasets, MVTec-AD and VisA, we demonstrate that CLAD outperforms state-of-the-art methods in both image-level anomaly detection and pixel-level anomaly localization. Additionally, we provide ablation studies and human evaluation to validate the importance of key components in our method. Our approach not only achieves superior performance but also enhances interpretability by accurately localizing anomalies, making it a promising solution for real-world industrial applications.
Authors: Grace Colverd, Jumpei Takami, Laura Schade, Karol Bot, Joseph A. Gallego-Mejia
Abstract: Tree height estimation serves as an important proxy for biomass estimation in ecological and forestry applications. While traditional methods such as photogrammetry and Light Detection and Ranging (LiDAR) offer accurate height measurements, their application on a global scale is often cost-prohibitive and logistically challenging. In contrast, remote sensing techniques, particularly 3D tomographic reconstruction from Synthetic Aperture Radar (SAR) imagery, provide a scalable solution for global height estimation. SAR images have been used in earth observation contexts due to their ability to work in all weathers, unobscured by clouds. In this study, we use deep learning to estimate forest canopy height directly from 2D Single Look Complex (SLC) images, a derivative of SAR. Our method attempts to bypass traditional tomographic signal processing, potentially reducing latency from SAR capture to end product. We also quantify the impact of varying numbers of SLC images on height estimation accuracy, aiming to inform future satellite operations and optimize data collection strategies. Compared to full tomographic processing combined with deep learning, our minimal method (partial processing + deep learning) falls short, with an error 16-21\% higher, highlighting the continuing relevance of geometric signal processing.
Authors: Youjia Zhang, Anpei Chen, Yumin Wan, Zikai Song, Junqing Yu, Yawei Luo, Wei Yang
Abstract: In this paper, we introduce Ref-GS, a novel approach for directional light factorization in 2D Gaussian splatting, which enables photorealistic view-dependent appearance rendering and precise geometry recovery. Ref-GS builds upon the deferred rendering of Gaussian splatting and applies directional encoding to the deferred-rendered surface, effectively reducing the ambiguity between orientation and viewing angle. Next, we introduce a spherical Mip-grid to capture varying levels of surface roughness, enabling roughness-aware Gaussian shading. Additionally, we propose a simple yet efficient geometry-lighting factorization that connects geometry and lighting via the vector outer product, significantly reducing renderer overhead when integrating volumetric attributes. Our method achieves superior photorealistic rendering for a range of open-world scenes while also accurately recovering geometry.
Authors: Weiming Ren, Huan Yang, Jie Min, Cong Wei, Wenhu Chen
Abstract: Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective Video Spatiotemporal Augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA spatially and temporally combines videos to create new synthetic videos with extended durations and enhanced resolutions, and subsequently produces question-answer pairs pertaining to these newly synthesized videos. Based on this paradigm, we develop seven video augmentation methods and curate VISTA-400K, a video instruction-following dataset aimed at enhancing long-duration and high-resolution video understanding. Finetuning various video LMMs on our data resulted in an average improvement of 3.3% across four challenging benchmarks for long-video understanding. Furthermore, we introduce the first comprehensive high-resolution video understanding benchmark HRVideoBench, on which our finetuned models achieve a 6.5% performance gain. These results highlight the effectiveness of our framework.
Authors: Kumar Ashutosh, Georgios Pavlakos, Kristen Grauman
Abstract: Anticipating how a person will interact with objects in an environment is essential for activity understanding, but existing methods are limited to the 2D space of video frames-capturing physically ungrounded predictions of 'what' and ignoring the 'where' and 'how'. We introduce 4D future interaction prediction from videos. Given an input video of a human activity, the goal is to predict what objects at what 3D locations the person will interact with in the next time period (e.g., cabinet, fridge), and how they will execute that interaction (e.g., poses for bending, reaching, pulling). We propose a novel model FIction that fuses the past video observation of the person's actions and their environment to predict both the 'where' and 'how' of future interactions. Through comprehensive experiments on a variety of activities and real-world environments in Ego-Exo4D, we show that our proposed approach outperforms prior autoregressive and (lifted) 2D video models substantially, with more than 30% relative gains.
Authors: Burak Bekci, Nassir Navab, Federico Tombari, Mahdi Saleh
Abstract: Shape completion, a crucial task in 3D computer vision, involves predicting and filling the missing regions of scanned or partially observed objects. Current methods expect known pose or canonical coordinates and do not perform well under varying rotations, limiting their real-world applicability. We introduce ESCAPE (Equivariant Shape Completion via Anchor Point Encoding), a novel framework designed to achieve rotation-equivariant shape completion. Our approach employs a distinctive encoding strategy by selecting anchor points from a shape and representing all points as a distance to all anchor points. This enables the model to capture a consistent, rotation-equivariant understanding of the object's geometry. ESCAPE leverages a transformer architecture to encode and decode the distance transformations, ensuring that generated shape completions remain accurate and equivariant under rotational transformations. Subsequently, we perform optimization to calculate the predicted shapes from the encodings. Experimental evaluations demonstrate that ESCAPE achieves robust, high-quality reconstructions across arbitrary rotations and translations, showcasing its effectiveness in real-world applications without additional pose estimation modules.
Authors: Keren Ganon, Morris Alper, Rachel Mikulinsky, Hadar Averbuch-Elor
Abstract: Buildings are a central feature of human culture and are increasingly being analyzed with computational methods. However, recent works on computational building understanding have largely focused on natural imagery of buildings, neglecting the fundamental element defining a building's structure -- its floorplan. Conversely, existing works on floorplan understanding are extremely limited in scope, often focusing on floorplans of a single semantic category and region (e.g. floorplans of apartments from a single country). In this work, we introduce WAFFLE, a novel multimodal floorplan understanding dataset of nearly 20K floorplan images and metadata curated from Internet data spanning diverse building types, locations, and data formats. By using a large language model and multimodal foundation models, we curate and extract semantic information from these images and their accompanying noisy metadata. We show that WAFFLE enables progress on new building understanding tasks, both discriminative and generative, which were not feasible using prior datasets. We will publicly release WAFFLE along with our code and trained models, providing the research community with a new foundation for learning the semantics of buildings.
Authors: Benjamin Bergner, Christoph Lippert, Aravindh Mahendran
Abstract: The adoption of Vision Transformers (ViTs) in resource-constrained applications necessitates improvements in inference throughput. To this end several token pruning and merging approaches have been proposed that improve efficiency by successively reducing the number of tokens. However, it remains an open problem to design a token reduction method that is fast, maintains high performance, and is applicable to various vision tasks. In this work, we present a token pruner that uses auxiliary prediction heads that learn to select tokens end-to-end based on task relevance. These auxiliary heads can be removed after training, leading to throughput close to that of a random pruner. We evaluate our method on image classification, semantic segmentation, object detection, and instance segmentation, and show speedups of 1.5 to 4x with small drops in performance. As a best case, on the ADE20k semantic segmentation benchmark, we observe a 2x speedup relative to the no-pruning baseline, with a negligible performance penalty of 0.1 median mIoU across 5 seeds.
Authors: Haodong Lu, Chongyang Zhao, Jason Xue, Lina Yao, Kristen Moore, Dong Gong
Abstract: We investigate whether the pre-trained knowledge of vision-language models (VLMs), such as CLIP, can be retained or even enhanced during continual learning (CL) while absorbing knowledge from a data stream. Existing methods often rely on additional reference data, isolated components for distribution or domain predictions, leading to high training costs, increased inference complexity, and limited improvement potential for pre-trained models. To address these challenges, we first comprehensively analyze the effects of parameter update locations and ranks on downstream adaptation and knowledge retention. Based on these insights, we propose Dynamic Rank-Selective Low Rank Adaptation (LoRA), a universal and efficient CL approach that adaptively assigns ranks to LoRA modules based on their relevance to the current data. Unlike prior methods, our approach continually enhances the pre-trained VLM by retaining both the pre-trained knowledge and the knowledge acquired during CL. Our approach eliminates the need for explicit domain or distribution prediction and additional reference data, enabling seamless integration of new tasks while preserving pre-trained capabilities. It also maintains the original architecture and deployment pipeline of the pre-trained model without incurring any additional inference overhead. Extensive experiments and analyses demonstrate that our method outperforms state-of-the-art approaches in continually absorbing knowledge of downstream tasks while retaining pre-trained knowledge.
Authors: Bolin Lai, Felix Juefei-Xu, Miao Liu, Xiaoliang Dai, Nikhil Mehta, Chenguang Zhu, Zeyi Huang, James M. Rehg, Sangmin Lee, Ning Zhang, Tong Xiao
Abstract: Text-guided image manipulation has experienced notable advancement in recent years. In order to mitigate linguistic ambiguity, few-shot learning with visual examples has been applied for instructions that are underrepresented in the training set, or difficult to describe purely in language. However, learning from visual prompts requires strong reasoning capability, which diffusion models are struggling with. To address this issue, we introduce a novel multi-modal autoregressive model, dubbed $\textbf{InstaManip}$, that can $\textbf{insta}$ntly learn a new image $\textbf{manip}$ulation operation from textual and visual guidance via in-context learning, and apply it to new query images. Specifically, we propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages -- learning and applying, which simplifies the complex problem into two easier tasks. We also introduce a relation regularization method to further disentangle image transformation features from irrelevant contents in exemplar images. Extensive experiments suggest that our method surpasses previous few-shot image manipulation models by a notable margin ($\geq$19% in human evaluation). We also find our model can be further boosted by increasing the number or diversity of exemplar images.
Authors: Kyungri Park, Woohwan Jung
Abstract: Pluralistic Image Inpainting (PII) offers multiple plausible solutions for restoring missing parts of images and has been successfully applied to various applications including image editing and object removal. Recently, VQGAN-based methods have been proposed and have shown that they significantly improve the structural integrity in the generated images. Nevertheless, the state-of-the-art VQGAN-based model PUT faces a critical challenge: degradation of detail quality in output images due to feature quantization. Feature quantization restricts the latent space and causes information loss, which negatively affects the detail quality essential for image inpainting. To tackle the problem, we propose the FDM (Feature Dequantization Module) specifically designed to restore the detail quality of images by compensating for the information loss. Furthermore, we develop an efficient training method for FDM which drastically reduces training costs. We empirically demonstrate that our method significantly enhances the detail quality of the generated images with negligible training and inference overheads.
Authors: Chanho Eom, Geon Lee, Kyunghwan Cho, Hyeonseok Jung, Moonsub Jin, Bumsub Ham
Abstract: We introduce a new framework, dubbed Cerberus, for attribute-based person re-identification (reID). Our approach leverages person attribute labels to learn local and global person representations that encode specific traits, such as gender and clothing style. To achieve this, we define semantic IDs (SIDs) by combining attribute labels, and use a semantic guidance loss to align the person representations with the prototypical features of corresponding SIDs, encouraging the representations to encode the relevant semantics. Simultaneously, we enforce the representations of the same person to be embedded closely, enabling recognizing subtle differences in appearance to discriminate persons sharing the same attribute labels. To increase the generalization ability on unseen data, we also propose a regularization method that takes advantage of the relationships between SID prototypes. Our framework performs individual comparisons of local and global person representations between query and gallery images for attribute-based reID. By exploiting the SID prototypes aligned with the corresponding representations, it can also perform person attribute recognition (PAR) and attribute-based person search (APS) without bells and whistles. Experimental results on standard benchmarks on attribute-based person reID, Market-1501 and DukeMTMC, demonstrate the superiority of our model compared to the state of the art.
Authors: Jingnan Shi, Rajat Talak, Harry Zhang, David Jin, Luca Carlone
Abstract: We consider the problem of estimating object pose and shape from an RGB-D image. Our first contribution is to introduce CRISP, a category-agnostic object pose and shape estimation pipeline. The pipeline implements an encoder-decoder model for shape estimation. It uses FiLM-conditioning for implicit shape reconstruction and a DPT-based network for estimating pose-normalized points for pose estimation. As a second contribution, we propose an optimization-based pose and shape corrector that can correct estimation errors caused by a domain gap. Observing that the shape decoder is well behaved in the convex hull of known shapes, we approximate the shape decoder with an active shape model, and show that this reduces the shape correction problem to a constrained linear least squares problem, which can be solved efficiently by an interior point algorithm. Third, we introduce a self-training pipeline to perform self-supervised domain adaptation of CRISP. The self-training is based on a correct-and-certify approach, which leverages the corrector to generate pseudo-labels at test time, and uses them to self-train CRISP. We demonstrate CRISP (and the self-training) on YCBV, SPE3R, and NOCS datasets. CRISP shows high performance on all the datasets. Moreover, our self-training is capable of bridging a large domain gap. Finally, CRISP also shows an ability to generalize to unseen objects. Code and pre-trained models will be available on https://web.mit.edu/sparklab/research/crisp_object_pose_shape/.
URLs: https://web.mit.edu/sparklab/research/crisp_object_pose_shape/.
Authors: Lauhitya Reddy, Ketan Anand, Shoibolina Kaushik, Corey Rodrigo, J. Lucas McKay, Trisha M. Kesar, Hyeokhyen Kwon
Abstract: Accurate diagnosis of gait impairments is often hindered by subjective or costly assessment methods, with current solutions requiring either expensive multi-camera equipment or relying on subjective clinical observation. There is a critical need for accessible, objective tools that can aid in gait assessment while preserving patient privacy. In this work, we present a mobile phone-based, privacy-preserving artificial intelligence (AI) system for classifying gait impairments and introduce a novel dataset of 743 videos capturing seven distinct gait patterns. The dataset consists of frontal and sagittal views of trained subjects simulating normal gait and six types of pathological gait (circumduction, Trendelenburg, antalgic, crouch, Parkinsonian, and vaulting), recorded using standard mobile phone cameras. Our system achieved 86.5% accuracy using combined frontal and sagittal views, with sagittal views generally outperforming frontal views except for specific gait patterns like Circumduction. Model feature importance analysis revealed that frequency-domain features and entropy measures were critical for classifcation performance, specifically lower limb keypoints proved most important for classification, aligning with clinical understanding of gait assessment. These findings demonstrate that mobile phone-based systems can effectively classify diverse gait patterns while preserving privacy through on-device processing. The high accuracy achieved using simulated gait data suggests their potential for rapid prototyping of gait analysis systems, though clinical validation with patient data remains necessary. This work represents a significant step toward accessible, objective gait assessment tools for clinical, community, and tele-rehabilitation settings
Authors: Taekyung Ki, Dongchan Min, Gyoungsu Chae
Abstract: With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. We shift the generative modeling from the pixel-based latent space to a learned motion latent space, enabling efficient design of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with a simple yet effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.
Authors: Sunghun Yang, Minhyeok Lee, Suhwan Cho, Jungho Lee, Sangyoun Lee
Abstract: Video monocular depth estimation is essential for applications such as autonomous driving, AR/VR, and robotics. Recent transformer-based single-image monocular depth estimation models perform well on single images but struggle with depth consistency across video frames. Traditional methods aim to improve temporal consistency using multi-frame temporal modules or prior information like optical flow and camera parameters. However, these approaches face issues such as high memory use, reduced performance with dynamic or irregular motion, and limited motion understanding. We propose STATIC, a novel model that independently learns temporal consistency in static and dynamic area without additional information. A difference mask from surface normals identifies static and dynamic area by measuring directional variance. For static area, the Masked Static (MS) module enhances temporal consistency by focusing on stable regions. For dynamic area, the Surface Normal Similarity (SNS) module aligns areas and enhances temporal consistency by measuring feature similarity between frames. A final refinement integrates the independently learned static and dynamic area, enabling STATIC to achieve temporal consistency across the entire sequence. Our method achieves state-of-the-art video depth estimation on the KITTI and NYUv2 datasets without additional information.
Authors: Penghui Wen, Lei Bai, Mengwei He, Patrick Filippi, Feng Zhang, Thomas Francis Bishop, Zhiyong Wang, Kun Hu
Abstract: Recently, extended short-term precipitation nowcasting struggles with decreasing precision because of insufficient consideration of meteorological knowledge, such as weather fronts which significantly influence precipitation intensity, duration, and spatial distribution. Therefore, in this paper, we present DuoCast, a novel dual-probabilistic meteorology-aware model designed to address both broad weather evolution and micro-scale fluctuations using two diffusion models, PrecipFlow and MicroDynamic, respectively. Our PrecipFlow model captures evolution trends through an Extreme Precipitation-Aware Encoder (EPA-Encoder), which includes AirConvolution and FrontAttention blocks to process two levels of precipitation data: general and extreme. The output conditions a UNet-based diffusion to produce prediction maps enriched with weather front information. The MicroDynamic model further refines the results to capture micro-scale variability. Extensive experiments on four public benchmarks demonstrate the effectiveness of our DuoCast, achieving superior performance over state-of-the-art methods. Our code is available at https://github.com/ph-w2000/DuoCast.
Authors: Delong Zhu, Yuezun Li, Baoyuan Wu, Jiaran Zhou, Zhibo Wang, Siwei Lyu
Abstract: This paper investigates the feasibility of a proactive DeepFake defense framework, {\em FacePosion}, to prevent individuals from becoming victims of DeepFake videos by sabotaging face detection. The motivation stems from the reliance of most DeepFake methods on face detectors to automatically extract victim faces from videos for training or synthesis (testing). Once the face detectors malfunction, the extracted faces will be distorted or incorrect, subsequently disrupting the training or synthesis of the DeepFake model. To achieve this, we adapt various adversarial attacks with a dedicated design for this purpose and thoroughly analyze their feasibility. Based on FacePoison, we introduce {\em VideoFacePoison}, a strategy that propagates FacePoison across video frames rather than applying them individually to each frame. This strategy can largely reduce the computational overhead while retaining the favorable attack performance. Our method is validated on five face detectors, and extensive experiments against eleven different DeepFake models demonstrate the effectiveness of disrupting face detectors to hinder DeepFake generation.
Authors: Jun Xiang, Yudong Guo, Leipeng Hu, Boyang Guo, Yancheng Yuan, Juyong Zhang
Abstract: Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos, and most methods lack precise control over gestures and expressions. To push this boundary, we address the challenge of constructing a whole-body talking avatar from a single image. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. To achieve seamless generalization, we leverage recent pose-guided image-to-video diffusion models to generate imperfect video frames as pseudo-labels. To overcome the dynamic modeling challenge posed by inconsistent and noisy pseudo-videos, we introduce a tightly coupled 3DGS-mesh hybrid avatar representation and apply several key regularizations to mitigate inconsistencies caused by imperfect labels. Extensive experiments on diverse subjects demonstrate that our method enables the creation of a photorealistic, precisely animatable, and expressive whole-body talking avatar from just a single image.
Authors: Hao Wu, Zhihang Zhong, Xiao Sun
Abstract: Image captioning models often suffer from performance degradation when applied to novel datasets, as they are typically trained on domain-specific data. To enhance generalization in out-of-domain scenarios, retrieval-augmented approaches have garnered increasing attention. However, current methods face two key challenges: (1) image features used for retrieval are often optimized based on ground-truth (GT) captions, which represent the image from a specific perspective and are influenced by annotator biases, and (2) they underutilize the full potential of retrieved text, typically relying on raw captions or parsed objects, which fail to capture the full semantic richness of the data. In this paper, we propose Dive Into Retrieval (DIR), a method designed to enhance both the image-to-text retrieval process and the utilization of retrieved text to achieve a more comprehensive understanding of the visual content. Our approach introduces two key innovations: (1) diffusion-guided retrieval enhancement, where a pretrained diffusion model guides image feature learning by reconstructing noisy images, allowing the model to capture more comprehensive and fine-grained visual information beyond standard annotated captions; and (2) a high-quality retrieval database, which provides comprehensive semantic information to enhance caption generation, especially in out-of-domain scenarios. Extensive experiments demonstrate that DIR not only maintains competitive in-domain performance but also significantly improves out-of-domain generalization, all without increasing inference costs.
Authors: Alejandro Fontan, Javier Civera, Tobias Fischer, Michael Milford
Abstract: Evaluation is critical to both developing and tuning Structure from Motion (SfM) and Visual SLAM (VSLAM) systems, but is universally reliant on high-quality geometric ground truth -- a resource that is not only costly and time-intensive but, in many cases, entirely unobtainable. This dependency on ground truth restricts SfM and SLAM applications across diverse environments and limits scalability to real-world scenarios. In this work, we propose a novel ground-truth-free (GTF) evaluation methodology that eliminates the need for geometric ground truth, instead using sensitivity estimation via sampling from both original and noisy versions of input images. Our approach shows strong correlation with traditional ground-truth-based benchmarks and supports GTF hyperparameter tuning. Removing the need for ground truth opens up new opportunities to leverage a much larger number of dataset sources, and for self-supervised and online tuning, with the potential for a data-driven breakthrough analogous to what has occurred in generative AI.
Authors: Chenghao Li, Yuke Zhang, Dake Chen, Jingqi Xu, Peter A. Beerel
Abstract: Diffusion models have demonstrated significant potential in image generation. However, their ability to replicate training data presents a privacy risk, particularly when the training data includes confidential information. Existing mitigation strategies primarily focus on augmenting the training dataset, leaving the impact of diffusion model architecture under explored. In this paper, we address this gap by examining and mitigating the impact of the model structure, specifically the skip connections in the diffusion model's U-Net model. We first present our observation on a trade-off in the skip connections. While they enhance image generation quality, they also reinforce the memorization of training data, increasing the risk of replication. To address this, we propose a replication-aware U-Net (RAU-Net) architecture that incorporates information transfer blocks into skip connections that are less essential for image quality. Recognizing the potential impact of RAU-Net on generation quality, we further investigate and identify specific timesteps during which the impact on memorization is most pronounced. By applying RAU-Net selectively at these critical timesteps, we couple our novel diffusion model with a targeted training and inference strategy, forming a framework we refer to as LoyalDiffusion. Extensive experiments demonstrate that LoyalDiffusion outperforms the state-of-the-art replication mitigation method achieving a 48.63% reduction in replication while maintaining comparable image quality.
Authors: Mojtaba S. Fazli, Shannon Quinn
Abstract: Object tracking is a fundamental tool in modern innovation, with applications in defense systems, autonomous vehicles, and biomedical research. It enables precise identification, monitoring, and spatiotemporal analysis of objects across sequential frames, providing insights into dynamic behaviors. In cell biology, object tracking is vital for uncovering cellular mechanisms, such as migration, interactions, and responses to drugs or pathogens. These insights drive breakthroughs in understanding disease progression and therapeutic interventions. Over time, object tracking methods have evolved from traditional feature-based approaches to advanced machine learning and deep learning frameworks. While classical methods are reliable in controlled settings, they struggle in complex environments with occlusions, variable lighting, and high object density. Deep learning models address these challenges by delivering greater accuracy, adaptability, and robustness. This review categorizes object tracking techniques into traditional, statistical, feature-based, and machine learning paradigms, with a focus on biomedical applications. These methods are essential for tracking cells and subcellular structures, advancing our understanding of health and disease. Key performance metrics, including accuracy, efficiency, and adaptability, are discussed. The paper explores limitations of current methods and highlights emerging trends to guide the development of next-generation tracking systems for biomedical research and broader scientific domains.
Authors: Joseph Raj Vishal, Divesh Basina, Aarya Choudhary, Bharatesh Chakravarthi
Abstract: Recent advances in video question answering (VideoQA) offer promising applications, especially in traffic monitoring, where efficient video interpretation is critical. Within ITS, answering complex, real-time queries like "How many red cars passed in the last 10 minutes?" or "Was there an incident between 3:00 PM and 3:05 PM?" enhances situational awareness and decision-making. Despite progress in vision-language models, VideoQA remains challenging, especially in dynamic environments involving multiple objects and intricate spatiotemporal relationships. This study evaluates state-of-the-art VideoQA models using non-benchmark synthetic and real-world traffic sequences. The framework leverages GPT-4o to assess accuracy, relevance, and consistency across basic detection, temporal reasoning, and decomposition queries. VideoLLaMA-2 excelled with 57% accuracy, particularly in compositional reasoning and consistent answers. However, all models, including VideoLLaMA-2, faced limitations in multi-object tracking, temporal coherence, and complex scene interpretation, highlighting gaps in current architectures. These findings underscore VideoQA's potential in traffic monitoring but also emphasize the need for improvements in multi-object tracking, temporal reasoning, and compositional capabilities. Enhancing these areas could make VideoQA indispensable for incident detection, traffic flow management, and responsive urban planning. The study's code and framework are open-sourced for further exploration: https://github.com/joe-rabbit/VideoQA_Pilot_Study
Authors: Seongchan Kim, Woojeong Jin, Sangbeom Lim, Heeji Yoon, Hyunwook Choi, Seungryong Kim
Abstract: Referring Video Object Segmentation (RVOS) seeks to segment objects throughout a video based on natural language expressions. While existing methods have made strides in vision-language alignment, they often overlook the importance of robust video object tracking, where inconsistent mask tracks can disrupt vision-language alignment, leading to suboptimal performance. In this work, we present Selection by Object Language Alignment (SOLA), a novel framework that reformulates RVOS into two sub-problems, track generation and track selection. In track generation, we leverage a vision foundation model, Segment Anything Model 2 (SAM2), which generates consistent mask tracks across frames, producing reliable candidates for both foreground and background objects. For track selection, we propose a light yet effective selection module that aligns visual and textual features while modeling object appearance and motion within video sequences. This design enables precise motion modeling and alignment of the vision language. Our approach achieves state-of-the-art performance on the challenging MeViS dataset and demonstrates superior results in zero-shot settings on the Ref-Youtube-VOS and Ref-DAVIS datasets. Furthermore, SOLA exhibits strong generalization and robustness in corrupted settings, such as those with added Gaussian noise or motion blur. Our project page is available at https://cvlab-kaist.github.io/SOLA
Authors: Xingsong Ye, Yongkun Du, Yunbo Tao, Zhineng Chen
Abstract: Scene text recognition (STR) suffers from the challenges of either less realistic synthetic training data or the difficulty of collecting sufficient high-quality real-world data, limiting the effectiveness of trained STR models. Meanwhile, despite producing holistically appealing text images, diffusion-based text image generation methods struggle to generate accurate and realistic instance-level text on a large scale. To tackle this, we introduce TextSSR: a novel framework for Synthesizing Scene Text Recognition data via a diffusion-based universal text region synthesis model. It ensures accuracy by focusing on generating text within a specified image region and leveraging rich glyph and position information to create the less complex text region compared to the entire image. Furthermore, we utilize neighboring text within the region as a prompt to capture real-world font styles and layout patterns, guiding the generated text to resemble actual scenes. Finally, due to its prompt-free nature and capability for character-level synthesis, TextSSR enjoys a wonderful scalability and we construct an anagram-based TextSSR-F dataset with 0.4 million text instances with complexity and realism. Experiments show that models trained on added TextSSR-F data exhibit better accuracy compared to models trained on 4 million existing synthetic data. Moreover, its accuracy margin to models trained fully on a real-world dataset is less than 3.7%, confirming TextSSR's effectiveness and its great potential in scene text image synthesis. Our code is available at https://github.com/YesianRohn/TextSSR.
Authors: Suhyun Shin, Seungwoo Yoon, Ryota Maeda, Seung-Hwan Baek
Abstract: Hyperspectral 3D imaging captures both depth maps and hyperspectral images, enabling comprehensive geometric and material analysis. Recent methods achieve high spectral and depth accuracy; however, they require long acquisition times often over several minutes or rely on large, expensive systems, restricting their use to static scenes. We present Dense Dispersed Structured Light (DDSL), an accurate hyperspectral 3D imaging method for dynamic scenes that utilizes stereo RGB cameras and an RGB projector equipped with an affordable diffraction grating film. We design spectrally multiplexed DDSL patterns that significantly reduce the number of required projector patterns, thereby accelerating acquisition speed. Additionally, we formulate an image formation model and a reconstruction method to estimate a hyperspectral image and depth map from captured stereo images. As the first practical and accurate hyperspectral 3D imaging method for dynamic scenes, we experimentally demonstrate that DDSL achieves a spectral resolution of 15.5 nm full width at half maximum (FWHM), a depth error of 4 mm, and a frame rate of 6.6 fps.
Authors: Minh Tran, Thang Pham, Winston Bounsavy, Tri Nguyen, Ngan Le
Abstract: Handling occlusion remains a significant challenge for video instance-level tasks like Multiple Object Tracking (MOT) and Video Instance Segmentation (VIS). In this paper, we propose a novel framework, Amodal-Aware Video Instance Segmentation (A2VIS), which incorporates amodal representations to achieve a reliable and comprehensive understanding of both visible and occluded parts of objects in a video. The key intuition is that awareness of amodal segmentation through spatiotemporal dimension enables a stable stream of object information. In scenarios where objects are partially or completely hidden from view, amodal segmentation offers more consistency and less dramatic changes along the temporal axis compared to visible segmentation. Hence, both amodal and visible information from all clips can be integrated into one global instance prototype. To effectively address the challenge of video amodal segmentation, we introduce the spatiotemporal-prior Amodal Mask Head, which leverages visible information intra clips while extracting amodal characteristics inter clips. Through extensive experiments and ablation studies, we show that A2VIS excels in both MOT and VIS tasks in identifying and tracking object instances with a keen understanding of their full shape.
Authors: Wooseok Jang, Youngjun Hong, Gunho Cha, Seungryong Kim
Abstract: Manipulation of facial images to meet specific controls such as pose, expression, and lighting, also known as face rigging, is a complex task in computer vision. Existing methods are limited by their reliance on image datasets, which necessitates individual-specific fine-tuning and limits their ability to retain fine-grained identity and semantic details, reducing practical usability. To overcome these limitations, we introduce ControlFace, a novel face rigging method conditioned on 3DMM renderings that enables flexible, high-fidelity control. We employ a dual-branch U-Nets: one, referred to as FaceNet, captures identity and fine details, while the other focuses on generation. To enhance control precision, the control mixer module encodes the correlated features between the target-aligned control and reference-aligned control, and a novel guidance method, reference control guidance, steers the generation process for better control adherence. By training on a facial video dataset, we fully utilize FaceNet's rich representations while ensuring control adherence. Extensive experiments demonstrate ControlFace's superior performance in identity preservation and control precision, highlighting its practicality. Please see the project website: https://cvlab-kaist.github.io/ControlFace/.
Authors: Christopher Fusco, Mosam Dabhi, Shin-Fang Ch'ng, Simon Lucey
Abstract: We present a spatio-temporal perspective on category-agnostic 3D lifting of 2D keypoints over a temporal sequence. Our approach differs from existing state-of-the-art methods that are either: (i) object agnostic, but can only operate on individual frames, or (ii) can model space-time dependencies, but are only designed to work with a single object category. Our approach is grounded in two core principles. First, when there is a lack of data about an object, general information from similar objects can be leveraged for better performance. Second, while temporal information is important, the most critical information is in immediate temporal proximity. These two principles allow us to outperform current state-of-the-art methods on per-frame and per-sequence metrics for a variety of objects. Lastly, we release a new synthetic dataset containing 3D skeletons and motion sequences of a diverse set animals. Dataset and code will be made publicly available.
Authors: Zijian Chen, Tingzhu Chen, Wenjun Zhang, Guangtao Zhai
Abstract: We introduce OBI-Bench, a holistic benchmark crafted to systematically evaluate large multi-modal models (LMMs) on whole-process oracle bone inscriptions (OBI) processing tasks demanding expert-level domain knowledge and deliberate cognition. OBI-Bench includes 5,523 meticulously collected diverse-sourced images, covering five key domain problems: recognition, rejoining, classification, retrieval, and deciphering. These images span centuries of archaeological findings and years of research by front-line scholars, comprising multi-stage font appearances from excavation to synthesis, such as original oracle bone, inked rubbings, oracle bone fragments, cropped single character, and handprinted character. Unlike existing benchmarks, OBI-Bench focuses on advanced visual perception and reasoning with OBI-specific knowledge, challenging LMMs to perform tasks akin to those faced by experts. The evaluation of 6 proprietary LMMs as well as 17 open-source LMMs highlights the substantial challenges and demands posed by OBI-Bench. Even the latest versions of GPT-4o, Gemini 1.5 Pro, and Qwen-VL-Max are still far from public-level humans in some fine-grained perception tasks. However, they perform at a level comparable to untrained humans in deciphering task, indicating remarkable capabilities in offering new interpretative perspectives and generating creative guesses. We hope OBI-Bench can facilitate the community to develop domain-specific multi-modal foundation models towards ancient language research and delve deeper to discover and enhance these untapped potentials of LMMs.
Authors: Tao Tang, Hong Liu, Yingxuan You, Ti Wang, Wenhao Li
Abstract: Human Mesh Reconstruction (HMR) from monocular video plays an important role in human-robot interaction and collaboration. However, existing video-based human mesh reconstruction methods face a trade-off between accurate reconstruction and smooth motion. These methods design networks based on either RNNs or attention mechanisms to extract local temporal correlations or global temporal dependencies, but the lack of complementary long-term information and local details limits their performance. To address this problem, we propose a \textbf{D}ual-branch \textbf{G}raph \textbf{T}ransformer network for 3D human mesh \textbf{R}econstruction from video, named DGTR. DGTR employs a dual-branch network including a Global Motion Attention (GMA) branch and a Local Details Refine (LDR) branch to parallelly extract long-term dependencies and local crucial information, helping model global human motion and local human details (e.g., local motion, tiny movement). Specifically, GMA utilizes a global transformer to model long-term human motion. LDR combines modulated graph convolutional networks and the transformer framework to aggregate local information in adjacent frames and extract crucial information of human details. Experiments demonstrate that our DGTR outperforms state-of-the-art video-based methods in reconstruction accuracy and maintains competitive motion smoothness. Moreover, DGTR utilizes fewer parameters and FLOPs, which validate the effectiveness and efficiency of the proposed DGTR. Code is publicly available at \href{https://github.com/TangTao-PKU/DGTR}{\textcolor{myBlue}{https://github.com/TangTao-PKU/DGTR}}.
URLs: https://github.com/TangTao-PKU/DGTR, https://github.com/TangTao-PKU/DGTR
Authors: Aayush Kumar Tyagi, Vaibhav Mishra, Ashok Tiwari, Lalita Mehra, Prasenjit Das, Govind Makharia, Prathosh AP, Mausam
Abstract: Celiac disease is an autoimmune disorder triggered by the consumption of gluten. It causes damage to the villi, the finger-like projections in the small intestine that are responsible for nutrient absorption. Additionally, the crypts, which form the base of the villi, are also affected, impairing the regenerative process. The deterioration in villi length, computed as the villi-to-crypt length ratio, indicates the severity of celiac disease. However, manual measurement of villi-crypt length can be both time-consuming and susceptible to inter-observer variability, leading to inconsistencies in diagnosis. While some methods can perform measurement as a post-hoc process, they are prone to errors in the initial stages. This gap underscores the need for pathologically driven solutions that enhance measurement accuracy and reduce human error in celiac disease assessments. Our proposed method, MeasureNet, is a pathologically driven polyline detection framework incorporating polyline localization and object-driven losses specifically designed for measurement tasks. Furthermore, we leverage segmentation model to provide auxiliary guidance about crypt location when crypt are partially visible. To ensure that model is not overdependent on segmentation mask we enhance model robustness through a mask feature mixup technique. Additionally, we introduce a novel dataset for grading celiac disease, consisting of 750 annotated duodenum biopsy images. MeasureNet achieves an 82.66% classification accuracy for binary classification and 81% accuracy for multi-class grading of celiac disease. Code: https://github.com/dair-iitd/MeasureNet
Authors: Chenyang Zhu, Kai Li, Yue Ma, Longxiang Tang, Chengyu Fang, Chubin Chen, Qifeng Chen, Xiu Li
Abstract: Recent advances in Customized Concept Swapping (CCS) enable a text-to-image model to swap a concept in the source image with a customized target concept. However, the existing methods still face the challenges of **inconsistency** and **inefficiency**. They struggle to maintain consistency in both the foreground and background during concept swapping, especially when the shape difference is large between objects. Additionally, they either require time-consuming training processes or involve redundant calculations during inference. To tackle these issues, we introduce InstantSwap, a new CCS method that aims to handle sharp shape disparity at speed. Specifically, we first extract the bbox of the object in the source image *automatically* based on attention map analysis and leverage the bbox to achieve both foreground and background consistency. For background consistency, we remove the gradient outside the bbox during the swapping process so that the background is free from being modified. For foreground consistency, we employ a cross-attention mechanism to inject semantic information into both source and target concepts inside the box. This helps learn semantic-enhanced representations that encourage the swapping process to focus on the foreground objects. To improve swapping speed, we avoid computing gradients at each timestep but instead calculate them periodically to reduce the number of forward passes, which improves efficiency a lot with a little sacrifice on performance. Finally, we establish a benchmark dataset to facilitate comprehensive evaluation. Extensive evaluations demonstrate the superiority and versatility of InstantSwap.Project Page: https://instantswap.github.io/
Authors: Gongfan Fang, Kunjun Li, Xinyin Ma, Xinchao Wang
Abstract: Diffusion Transformers have demonstrated remarkable capabilities in image generation but often come with excessive parameterization, resulting in considerable inference overhead in real-world applications. In this work, we present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning. The core principle of our approach is to create a pruned model with high recoverability, allowing it to regain strong performance after fine-tuning. To accomplish this, we introduce a differentiable sampling technique to make pruning learnable, paired with a co-optimized parameter to simulate future fine-tuning. While prior works focus on minimizing loss or error after pruning, our method explicitly models and optimizes the post-fine-tuning performance of pruned models. Experimental results indicate that this learnable paradigm offers substantial benefits for layer pruning of diffusion transformers, surpassing existing importance-based and error-based methods. Additionally, TinyFusion exhibits strong generalization across diverse architectures, such as DiTs, MARs, and SiTs. Experiments with DiT-XL show that TinyFusion can craft a shallow diffusion transformer at less than 7% of the pre-training cost, achieving a 2$\times$ speedup with an FID score of 2.86, outperforming competitors with comparable efficiency. Code is available at https://github.com/VainF/TinyFusion.
Authors: Yi Liao, Yongsheng Gao, Weichuan Zhang
Abstract: In this paper, we present a Neuron Abandoning Attention Flow (NAFlow) method to address the open problem of visually explaining the attention evolution dynamics inside CNNs when making their classification decisions. A novel cascading neuron abandoning back-propagation algorithm is designed to trace neurons in all layers of a CNN that involve in making its prediction to address the problem of significant interference from abandoned neurons. Firstly, a Neuron Abandoning Back-Propagation (NA-BP) module is proposed to generate Back-Propagated Feature Maps (BPFM) by using the inverse function of the intermediate layers of CNN models, on which the neurons not used for decision-making are abandoned. Meanwhile, the cascading NA-BP modules calculate the tensors of importance coefficients which are linearly combined with the tensors of BPFMs to form the NAFlow. Secondly, to be able to visualize attention flow for similarity metric-based CNN models, a new channel contribution weights module is proposed to calculate the importance coefficients via Jacobian Matrix. The effectiveness of the proposed NAFlow is validated on nine widely-used CNN models for various tasks of general image classification, contrastive learning classification, few-shot image classification, and image retrieval.
Authors: Wenxin Su, Song Tang, Xiaofeng Liu, Xiaojing Yi, Mao Ye, Chunxiao Zu, Jiahao Li, Xiatian Zhu
Abstract: Domain shift (the difference between source and target domains) poses a significant challenge in clinical applications, e.g., Diabetic Retinopathy (DR) grading. Despite considering certain clinical requirements, like source data privacy, conventional transfer methods are predominantly model-centered and often struggle to prevent model-targeted attacks. In this paper, we address a challenging Online Model-aGnostic Domain Adaptation (OMG-DA) setting, driven by the demands of clinical environments. This setting is characterized by the absence of the model and the flow of target data. To tackle the new challenge, we propose a novel approach, Generative Unadversarial ExampleS (GUES), which enables adaptation from a data-centric perspective. Specifically, we first theoretically reformulate conventional perturbation optimization in a generative way--learning a perturbation generation function with a latent input variable. During model instantiation, we leverage a Variational AutoEncoder to express this function. The encoder with the reparameterization trick predicts the latent input, whilst the decoder is responsible for the generation. Furthermore, the saliency map is selected as pseudo-perturbation labels. Because it not only captures potential lesions but also theoretically provides an upper bound on the function input, enabling the identification of the latent variable. Extensive comparative experiments on DR benchmarks with both frozen pre-trained models and trainable models demonstrate the superiority of GUES, showing robustness even with small batch size.
Authors: Zhenzhong Cao
Abstract: High-quality reconstruction is crucial for dense SLAM. Recent popular approaches utilize 3D Gaussian Splatting (3D GS) techniques for RGB, depth, and semantic reconstruction of scenes. However, these methods often overlook issues of detail and consistency in different parts of the scene. To address this, we propose RGBDS-SLAM, a RGB-D semantic dense SLAM system based on 3D multi-level pyramid gaussian splatting, which enables high-quality dense reconstruction of scene RGB, depth, and semantics.In this system, we introduce a 3D multi-level pyramid gaussian splatting method that restores scene details by extracting multi-level image pyramids for gaussian splatting training, ensuring consistency in RGB, depth, and semantic reconstructions. Additionally, we design a tightly-coupled multi-features reconstruction optimization mechanism, allowing the reconstruction accuracy of RGB, depth, and semantic maps to mutually enhance each other during the rendering optimization process. Extensive quantitative, qualitative, and ablation experiments on the Replica and ScanNet public datasets demonstrate that our proposed method outperforms current state-of-the-art methods. The open-source code will be available at: https://github.com/zhenzhongcao/RGBDS-SLAM.
Authors: Ruichen Wang, Junliang Zhang, Qingsong Xie, Chen Chen, Haonan Lu
Abstract: Recently, diffusion models have exhibited superior performance in the area of image inpainting. Inpainting methods based on diffusion models can usually generate realistic, high-quality image content for masked areas. However, due to the limitations of diffusion models, existing methods typically encounter problems in terms of semantic consistency between images and text, and the editing habits of users. To address these issues, we present PainterNet, a plugin that can be flexibly embedded into various diffusion models. To generate image content in the masked areas that highly aligns with the user input prompt, we proposed local prompt input, Attention Control Points (ACP), and Actual-Token Attention Loss (ATAL) to enhance the model's focus on local areas. Additionally, we redesigned the MASK generation algorithm in training and testing dataset to simulate the user's habit of applying MASK, and introduced a customized new training dataset, PainterData, and a benchmark dataset, PainterBench. Our extensive experimental analysis exhibits that PainterNet surpasses existing state-of-the-art models in key metrics including image quality and global/local text consistency.
Authors: Xiaoqi Zhao, Youwei Pang, Shijie Chang, Yuan Zhao, Lihe Zhang, Huchuan Lu, Jinsong Ouyang, Georges El Fakhri, Xiaofeng Liu
Abstract: As a foundational model, SAM has significantly influenced multiple fields within computer vision, and its upgraded version, SAM 2, enhances capabilities in video segmentation, poised to make a substantial impact once again. While SAMs (SAM and SAM 2) have demonstrated excellent performance in segmenting context-independent concepts like people, cars, and roads, they overlook more challenging context-dependent (CD) concepts, such as visual saliency, camouflage, product defects, and medical lesions. CD concepts rely heavily on global and local contextual information, making them susceptible to shifts in different contexts, which requires strong discriminative capabilities from the model. The lack of comprehensive evaluation of SAMs limits understanding of their performance boundaries, which may hinder the design of future models. In this paper, we conduct a thorough quantitative evaluation of SAMs on 11 CD concepts across 2D and 3D images and videos in various visual modalities within natural, medical, and industrial scenes. We develop a unified evaluation framework for SAM and SAM 2 that supports manual, automatic, and intermediate self-prompting, aided by our specific prompt generation and interaction strategies. We further explore the potential of SAM 2 for in-context learning and introduce prompt robustness testing to simulate real-world imperfect prompts. Finally, we analyze the benefits and limitations of SAMs in understanding CD concepts and discuss their future development in segmentation tasks. This work aims to provide valuable insights to guide future research in both context-independent and context-dependent concepts segmentation, potentially informing the development of the next version - SAM 3.
Authors: Zilyu Ye, Zhiyang Chen, Tiancheng Li, Zemin Huang, Weijian Luo, Guo-Jun Qi
Abstract: Diffusion and flow models have achieved remarkable successes in various applications such as text-to-image generation. However, these models typically rely on the same predetermined denoising schedules during inference for each prompt, which potentially limits the inference efficiency as well as the flexibility when handling different prompts. In this paper, we argue that the optimal noise schedule should adapt to each inference instance, and introduce the Time Prediction Diffusion Model (TPDM) to accomplish this. TPDM employs a plug-and-play Time Prediction Module (TPM) that predicts the next noise level based on current latent features at each denoising step. We train the TPM using reinforcement learning, aiming to maximize a reward that discounts the final image quality by the number of denoising steps. With such an adaptive scheduler, TPDM not only generates high-quality images that are aligned closely with human preferences but also adjusts the number of denoising steps and time on the fly, enhancing both performance and efficiency. We train TPDMs on multiple diffusion model benchmarks. With Stable Diffusion 3 Medium architecture, TPDM achieves an aesthetic score of 5.44 and a human preference score (HPS) of 29.59, while using around 50% fewer denoising steps to achieve better performance. We will release our best model alongside this paper.
Authors: Lingyun Zhang, Yu Xie, Yanwei Fu, Ping Chen
Abstract: As large-scale diffusion models continue to advance, they excel at producing high-quality images but often generate unwanted content, such as sexually explicit or violent content. Existing methods for concept removal generally guide the image generation process but can unintentionally modify unrelated regions, leading to inconsistencies with the original model. We propose a novel approach for targeted concept replacing in diffusion models, enabling specific concepts to be removed without affecting non-target areas. Our method introduces a dedicated concept localizer for precisely identifying the target concept during the denoising process, trained with few-shot learning to require minimal labeled data. Within the identified region, we introduce a training-free Dual Prompts Cross-Attention (DPCA) module to substitute the target concept, ensuring minimal disruption to surrounding content. We evaluate our method on concept localization precision and replacement efficiency. Experimental results demonstrate that our method achieves superior precision in localizing target concepts and performs coherent concept replacement with minimal impact on non-target areas, outperforming existing approaches.
Authors: Gorkem Polat, \"Umit Mert \c{C}a\u{g}lar, Alptekin Temizel
Abstract: Assessing disease severity involving ordinal classes, where each class represents increasing levels of severity, benefit from loss functions that account for this ordinal structure. Traditional categorical loss functions, like Cross-Entropy (CE), often perform suboptimally in these scenarios. To address this, we propose a novel loss function, Class Distance Weighted Cross-Entropy (CDW-CE), which penalizes misclassifications more harshly when classes are farther apart. We evaluated CDW-CE on the Labeled Images for Ulcerative Colitis (LIMUC) dataset using various deep architectures. Its performance was compared against several categorical and ordinal loss functions. To analyze the quality of latent representations, we used t-distributed stochastic neighbor embedding (t-SNE) visualizations and quantified their clustering with the Silhouette Score. We also compared Class Activation Maps (CAM) generated by models trained with CDW-CE and CE loss, incorporating domain expert feedback to evaluate alignment with expert knowledge. Our results show that CDW-CE consistently improves performance in ordinal image classification tasks. It achieves higher Silhouette Scores, indicating better differentiation of class representations, and its CAM visualizations demonstrate a stronger focus on clinically significant regions, as confirmed by domain experts.
Authors: Joy Dhar, Nayyar Zaidi, Maryam Haghighat, Puneet Goyal, Sudipta Roy, Azadeh Alavi, Vikas Kumar
Abstract: Multimodal fusion learning has shown significant promise in classifying various diseases such as skin cancer and brain tumors. However, existing methods face three key limitations. First, they often lack generalizability to other diagnosis tasks due to their focus on a particular disease. Second, they do not fully leverage multiple health records from diverse modalities to learn robust complementary information. And finally, they typically rely on a single attention mechanism, missing the benefits of multiple attention strategies within and across various modalities. To address these issues, this paper proposes a dual robust information fusion attention mechanism (DRIFA) that leverages two attention modules, i.e. multi-branch fusion attention module and the multimodal information fusion attention module. DRIFA can be integrated with any deep neural network, forming a multimodal fusion learning framework denoted as DRIFA-Net. We show that the multi-branch fusion attention of DRIFA learns enhanced representations for each modality, such as dermoscopy, pap smear, MRI, and CT-scan, whereas multimodal information fusion attention module learns more refined multimodal shared representations, improving the network's generalization across multiple tasks and enhancing overall performance. Additionally, to estimate the uncertainty of DRIFA-Net predictions, we have employed an ensemble Monte Carlo dropout strategy. Extensive experiments on five publicly available datasets with diverse modalities demonstrate that our approach consistently outperforms state-of-the-art methods. The code is available at https://github.com/misti1203/DRIFA-Net.
Authors: Liangwei Jiang, Ruida Li, Zhifeng Zhang, Shuo Fang, Chenguang Ma
Abstract: This paper aims to bring fine-grained expression control to identity-preserving portrait generation. Existing methods tend to synthesize portraits with either neutral or stereotypical expressions. Even when supplemented with control signals like facial landmarks, these models struggle to generate accurate and vivid expressions following user instructions. To solve this, we introduce EmojiDiff, an end-to-end solution to facilitate simultaneous dual control of fine expression and identity. Unlike the conventional methods using coarse control signals, our method directly accepts RGB expression images as input templates to provide extremely accurate and fine-grained expression control in the diffusion process. As its core, an innovative decoupled scheme is proposed to disentangle expression features in the expression template from other extraneous information, such as identity, skin, and style. On one hand, we introduce \textbf{I}D-irrelevant \textbf{D}ata \textbf{I}teration (IDI) to synthesize extremely high-quality cross-identity expression pairs for decoupled training, which is the crucial foundation to filter out identity information hidden in the expressions. On the other hand, we meticulously investigate network layer function and select expression-sensitive layers to inject reference expression features, effectively preventing style leakage from expression signals. To further improve identity fidelity, we propose a novel fine-tuning strategy named \textbf{I}D-enhanced \textbf{C}ontrast \textbf{A}lignment (ICA), which eliminates the negative impact of expression control on original identity preservation. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.
Authors: Bikang Pan, Qun Li, Xiaoying Tang, Wei Huang, Zhen Fang, Feng Liu, Jingya Wang, Jingyi Yu, Ye Shi
Abstract: The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text encoder representations in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive representation and precise alignment capabilities of vision-language models for robust prompt learning. We validate NLPrompt through extensive experiments across various noise settings, demonstrating significant performance improvements.
Authors: Wei Luo, Deyu Zhang, Ying Tang, Fan Wu, Yaoxue Zhang
Abstract: This paper addresses the challenges of Online Action Recognition (OAR), a framework that involves instantaneous analysis and classification of behaviors in video streams. OAR must operate under stringent latency constraints, making it an indispensable component for real-time feedback for edge computing. Existing methods, which typically rely on the processing of entire video clips, fall short in scenarios requiring immediate recognition. To address this, we designed EdgeOAR, a novel framework specifically designed for OAR on edge devices. EdgeOAR includes the Early Exit-oriented Task-specific Feature Enhancement Module (TFEM), which comprises lightweight submodules to optimize features in both temporal and spatial dimensions. We design an iterative training method to enable TFEM learning features from the beginning of the video. Additionally, EdgeOAR includes an Inverse Information Entropy (IIE) and Modality Consistency (MC)-driven fusion module to fuse features and make better exit decisions. This design overcomes the two main challenges: robust modeling of spatio-temporal action representations with limited initial frames in online video streams and balancing accuracy and efficiency on resource-constrained edge devices. Experiments show that on the UCF-101 dataset, our method EdgeOAR reduces latency by 99.23% and energy consumption by 99.28% compared to state-of-the-art (SOTA) method. And achieves an adequate accuracy on edge devices.
Authors: Yiqin Wang, Haoji Zhang, Jingqi Tian, Yansong Tang
Abstract: Most existing GUI agents typically depend on non-vision inputs like HTML source code or accessibility trees, limiting their flexibility across diverse software environments and platforms. Current multimodal large language models (MLLMs), which excel at using vision to ground real-world objects, offer a potential alternative. However, they often struggle with accurately localizing GUI elements -- a critical requirement for effective GUI automation -- due to the semantic gap between real-world objects and GUI elements. In this work, we introduce Ponder & Press, a divide-and-conquer framework for general computer control using only visual input. Our approach combines an general-purpose MLLM as an 'interpreter', responsible for translating high-level user instructions into detailed action descriptions, with a GUI-specific MLLM as a 'locator' that precisely locates GUI elements for action placement. By leveraging a purely visual input, our agent offers a versatile, human-like interaction paradigm applicable to a wide range of applications. Ponder & Press locator outperforms existing models by +22.5% on the ScreenSpot GUI grounding benchmark. Both offline and interactive agent benchmarks across various GUI environments -- including web pages, desktop software, and mobile UIs -- demonstrate that Ponder & Press framework achieves state-of-the-art performance, highlighting the potential of visual GUI agents. Refer to the project homepage https://invinciblewyq.github.io/ponder-press-page/
Authors: Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen
Abstract: Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities to multimodal tasks. Meanwhile, the great need for capable aritificial intelligence on mobile devices also arises, such as the AI assistant software. Some efforts try to migrate VLMs to edge devices to expand their application scope. Simplifying the model structure is a common method, but as the model shrinks, the trade-off between performance and size becomes more and more difficult. Knowledge distillation (KD) can help models improve comprehensive capabilities without increasing size or data volume. However, most of the existing large model distillation techniques only consider applications on single-modal LLMs, or only use teachers to create new data environments for students. None of these methods take into account the distillation of the most important cross-modal alignment knowledge in VLMs. We propose a method called Align-KD to guide the student model to learn the cross-modal matching that occurs at the shallow layer. The teacher also helps student learn the projection of vision token into text embedding space based on the focus of text. Under the guidance of Align-KD, the 1.7B MobileVLM V2 model can learn rich knowledge from the 7B teacher model with light design of training loss, and achieve an average score improvement of 2.0 across 6 benchmarks under two training subsets respectively. Code is available at: https://github.com/fqhank/Align-KD.
Authors: Shan Yang
Abstract: Text-to-image generation models have become transformative tools. However, diffusion-based vision language models still lack the ability to precisely control the shape, appearance, and positional placement of objects in generated images using text guidance alone. Global image editing models typically achieve global layout control by relying on additional masks or images as guidance, which often require model training. Although local object-editing models enable modification of object shapes, they do not provide control over the positional placement of these objects. To address these limitations, we propose the MFTF model, which enables precise control over object positioning without requiring additional masks or images. The MFTF model supports both single-object and multi-object positional control (such as translation, rotation, etc.) and allows for concurrent layout control and object semantic editing. This is achieved by controlling the denoising process of the diffusion model through parallel denoising. Attention masks are dynamically generated from the cross-attention layers of the source diffusion model and applied to queries from the self-attention layers to isolate objects. These queries are then modified according to layout control parameters and injected back into the self-attention layers of the target diffusion model to enable precise positional control.
Authors: Zhuokun Chen, Jinwu Hu, Zeshuai Deng, Yufeng Wang, Bohan Zhuang, Mingkui Tan
Abstract: Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the language model, resulting in prohibitively high training costs. In this paper, we introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs to enhance visual perception without requiring additional training. Our approach is motivated by the observation that different MLLMs tend to focus on distinct regions given the same query and image. Moreover, we find that the feature distributions of vision encoders within an MLLM family, a group of MLLMs sharing the same pretrained LLM, are highly aligned. Building on these insights, VisionFuse enriches the visual context by concatenating the tokens generated by the vision encoders of selected MLLMs within a family. By merging the parameters of language models from these MLLMs, VisionFuse allows a single language model to align with various vision encoders, significantly reducing deployment overhead. We conduct comprehensive evaluations across multiple multimodal benchmarks using various MLLM combinations, demonstrating substantial improvements in multimodal tasks. Notably, when integrating MiniGemini-8B and SLIME-8B, VisionFuse achieves an average performance increase of over 4%.
Authors: Hongyan Zhi, Peihao Chen, Junyan Li, Shuailei Ma, Xinyu Sun, Tianhang Xiang, Yinjie Lei, Mingkui Tan, Chuang Gan
Abstract: Research on 3D Vision-Language Models (3D-VLMs) is gaining increasing attention, which is crucial for developing embodied AI within 3D scenes, such as visual navigation and embodied question answering. Due to the high density of visual features, especially in large 3D scenes, accurately locating task-relevant visual information is challenging. Existing works attempt to segment all objects and consider their features as scene representations. However, these task-agnostic object features include much redundant information and missing details for the task-relevant area. To tackle these problems, we propose LSceneLLM, an adaptive framework that automatically identifies task-relevant areas by leveraging LLM's visual preference for different tasks, followed by a plug-and-play scene magnifier module to capture fine-grained details in focused areas. Specifically, a dense token selector examines the attention map of LLM to identify visual preferences for the instruction input. It then magnifies fine-grained details of the focusing area. An adaptive self-attention module is leveraged to fuse the coarse-grained and selected fine-grained visual information. To comprehensively evaluate the large scene understanding ability of 3D-VLMs, we further introduce a cross-room understanding benchmark, XR-Scene, which contains a series of large scene understanding tasks including XR-QA, XR-EmbodiedPlanning, and XR-SceneCaption. Experiments show that our method surpasses existing methods on both large scene understanding and existing scene understanding benchmarks. Plunging our scene magnifier module into the existing 3D-VLMs also brings significant improvement.
Authors: Katharina Prasse, Isaac Bravo, Stefanie Walter, Margret Keuper
Abstract: Visual framing analysis is a key method in social sciences for determining common themes and concepts in a given discourse. To reduce manual effort, image clustering can significantly speed up the annotation process. In this work, we phrase the clustering task as a Minimum Cost Multicut Problem [MP]. Solutions to the MP have been shown to provide clusterings that maximize the posterior probability, solely from provided local, pairwise probabilities of two images belonging to the same cluster. We discuss the efficacy of numerous embedding spaces to detect visual frames and show its superiority over other clustering methods. To this end, we employ the climate change dataset \textit{ClimateTV} which contains images commonly used for visual frame analysis. For broad visual frames, DINOv2 is a suitable embedding space, while ConvNeXt V2 returns a larger number of clusters which contain fine-grain differences, i.e. speech and protest. Our insights into embedding space differences in combination with the optimal clustering - by definition - advances automated visual frame detection. Our code can be found at https://github.com/KathPra/MP4VisualFrameDetection.
Authors: Qiyuan Shen, Hengwang Zhao, Weihao Yan, Chunxiang Wang, Tong Qin, Ming Yang
Abstract: Cross-modal localization has drawn increasing attention in recent years, while the visual relocalization in prior LiDAR maps is less studied. Related methods usually suffer from inconsistency between the 2D texture and 3D geometry, neglecting the intensity features in the LiDAR point cloud. In this paper, we propose a cross-modal visual relocalization system in prior LiDAR maps utilizing intensity textures, which consists of three main modules: map projection, coarse retrieval, and fine relocalization. In the map projection module, we construct the database of intensity channel map images leveraging the dense characteristic of panoramic projection. The coarse retrieval module retrieves the top-K most similar map images to the query image from the database, and retains the top-K' results by covisibility clustering. The fine relocalization module applies a two-stage 2D-3D association and a covisibility inlier selection method to obtain robust correspondences for 6DoF pose estimation. The experimental results on our self-collected datasets demonstrate the effectiveness in both place recognition and pose estimation tasks.
Authors: Han Han, Wei Zhai, Yang Cao, Bin Li, Zheng-jun Zha
Abstract: Tracking Any Point (TAP) plays a crucial role in motion analysis. Video-based approaches rely on iterative local matching for tracking, but they assume linear motion during the blind time between frames, which leads to target point loss under large displacements or nonlinear motion. The high temporal resolution and motion blur-free characteristics of event cameras provide continuous, fine-grained motion information, capturing subtle variations with microsecond precision. This paper presents an event-based framework for tracking any point, which tackles the challenges posed by spatial sparsity and motion sensitivity in events through two tailored modules. Specifically, to resolve ambiguities caused by event sparsity, a motion-guidance module incorporates kinematic features into the local matching process. Additionally, a variable motion aware module is integrated to ensure temporally consistent responses that are insensitive to varying velocities, thereby enhancing matching precision. To validate the effectiveness of the approach, an event dataset for tracking any point is constructed by simulation, and is applied in experiments together with two real-world datasets. The experimental results show that the proposed method outperforms existing SOTA methods. Moreover, it achieves 150\% faster processing with competitive model parameters.
Authors: Xin Yan, Yuxuan Cai, Qiuyue Wang, Yuan Zhou, Wenhao Huang, Huan Yang
Abstract: We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances content richness, maintains long-range coherence, and captures intricate textual details. More details are displayed on our project page: https://presto-video.github.io/.
Authors: Jaskirat Singh, Lindsey Li, Weijia Shi, Ranjay Krishna, Yejin Choi, Pang Wei Koh, Michael F. Cohen, Stephen Gould, Liang Zheng, Luke Zettlemoyer
Abstract: Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to push the output features away from undesired concepts. While useful, performing adversarial guidance using text alone can be insufficient to capture complex visual concepts and avoid undesired visual elements like copyrighted characters. In this paper, for the first time we explore an alternate modality in this direction by performing adversarial guidance directly using visual features from a reference image or other images in a batch. In particular, we introduce negative token merging (NegToMe), a simple but effective training-free approach which performs adversarial guidance by selectively pushing apart matching semantic features (between reference and output generation) during the reverse diffusion process. When used w.r.t. other images in the same batch, we observe that NegToMe significantly increases output diversity (racial, gender, visual) without sacrificing output image quality. Similarly, when used w.r.t. a reference copyrighted asset, NegToMe helps reduce visual similarity with copyrighted content by 34.57%. NegToMe is simple to implement using just few-lines of code, uses only marginally higher (<4%) inference times and generalizes to different diffusion architectures like Flux, which do not natively support the use of a separate negative prompt. Code is available at https://negtome.github.io
Authors: Xiaomin Li, Xu Jia, Qinghe Wang, Haiwen Diao, Mengmeng Ge, Pengxiang Li, You He, Huchuan Lu
Abstract: Existing pretrained text-to-video (T2V) models have demonstrated impressive abilities in generating realistic videos with basic motion or camera movement. However, these models exhibit significant limitations when generating intricate, human-centric motions. Current efforts primarily focus on fine-tuning models on a small set of videos containing a specific motion. They often fail to effectively decouple motion and the appearance in the limited reference videos, thereby weakening the modeling capability of motion patterns. To this end, we propose MoTrans, a customized motion transfer method enabling video generation of similar motion in new context. Specifically, we introduce a multimodal large language model (MLLM)-based recaptioner to expand the initial prompt to focus more on appearance and an appearance injection module to adapt appearance prior from video frames to the motion modeling process. These complementary multimodal representations from recaptioned prompt and video frames promote the modeling of appearance and facilitate the decoupling of appearance and motion. In addition, we devise a motion-specific embedding for further enhancing the modeling of the specific motion. Experimental results demonstrate that our method effectively learns specific motion pattern from singular or multiple reference videos, performing favorably against existing methods in customized video generation.
Authors: Xiyu Han, Xian Zhong, Wenxin Huang, Xuemei Jia, Wenxuan Liu, Xiaohan Yu, Alex Chichung Kot
Abstract: Cloth-changing person re-identification (CC-ReID) aims to match individuals across multiple surveillance cameras despite variations in clothing. Existing methods typically focus on mitigating the effects of clothing changes or enhancing ID-relevant features but often struggle to capture complex semantic information. In this paper, we propose a novel prompt learning framework, Semantic Contextual Integration (SCI), for CC-ReID, which leverages the visual-text representation capabilities of CLIP to minimize the impact of clothing changes and enhance ID-relevant features. Specifically, we introduce Semantic Separation Enhancement (SSE) module, which uses dual learnable text tokens to separately capture confounding and clothing-related semantic information, effectively isolating ID-relevant features from distracting clothing semantics. Additionally, we develop a Semantic-Guided Interaction Module (SIM) that uses orthogonalized text features to guide visual representations, sharpening the model's focus on distinctive ID characteristics. This integration enhances the model's discriminative power and enriches the visual context with high-dimensional semantic insights. Extensive experiments on three CC-ReID datasets demonstrate that our method outperforms state-of-the-art techniques. The code will be released at github.
Authors: Aniket K. Singh, Debasis Chaudhuri, Manish P. Singh, Samiran Chattopadhyay
Abstract: With the growing demand for interpretable deep learning models, this paper introduces Integrative CAM, an advanced Class Activation Mapping (CAM) technique aimed at providing a holistic view of feature importance across Convolutional Neural Networks (CNNs). Traditional gradient-based CAM methods, such as Grad-CAM and Grad-CAM++, primarily use final layer activations to highlight regions of interest, often neglecting critical features derived from intermediate layers. Integrative CAM addresses this limitation by fusing insights across all network layers, leveraging both gradient and activation scores to adaptively weight layer contributions, thus yielding a comprehensive interpretation of the model's internal representation. Our approach includes a novel bias term in the saliency map calculation, a factor frequently omitted in existing CAM techniques, but essential for capturing a more complete feature importance landscape, as modern CNNs rely on both weighted activations and biases to make predictions. Additionally, we generalize the alpha term from Grad-CAM++ to apply to any smooth function, expanding CAM applicability across a wider range of models. Through extensive experiments on diverse and complex datasets, Integrative CAM demonstrates superior fidelity in feature importance mapping, effectively enhancing interpretability for intricate fusion scenarios and complex decision-making tasks. By advancing interpretability methods to capture multi-layered model insights, Integrative CAM provides a valuable tool for fusion-driven applications, promoting the trustworthy and insightful deployment of deep learning models.
Authors: Silvia Lucia Sanna, Leonardo Regano, Davide Maiorca, Giorgio Giacinto
Abstract: Nowadays, many tools are used to facilitate forensic tasks about data extraction and data analysis. In particular, some tools leverage Artificial Intelligence (AI) to automatically label examined data into specific categories (\ie, drugs, weapons, nudity). However, this raises a serious concern about the robustness of the employed AI algorithms against adversarial attacks. Indeed, some people may need to hide specific data to AI-based digital forensics tools, thus manipulating the content so that the AI system does not recognize the offensive/prohibited content and marks it at as suspicious to the analyst. This could be seen as an anti-forensics attack scenario. For this reason, we analyzed two of the most important forensics tools employing AI for data classification: Magnet AI, used by Magnet Axiom, and Excire Photo AI, used by X-Ways Forensics. We made preliminary tests using about $200$ images, other $100$ sent in $3$ chats about pornography and teenage nudity, drugs and weapons to understand how the tools label them. Moreover, we loaded some deepfake images (images generated by AI forging real ones) of some actors to understand if they would be classified in the same category as the original images. From our preliminary study, we saw that the AI algorithm is not robust enough, as we expected since these topics are still open research problems. For example, some sexual images were not categorized as nudity, and some deepfakes were categorized as the same real person, while the human eye can see the clear nudity image or catch the difference between the deepfakes. Building on these results and other state-of-the-art works, we provide some suggestions for improving how digital forensics analysis tool leverage AI and their robustness against adversarial attacks or different scenarios than the trained one.
Authors: Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca, Rasesh Udayakumar Shetty, Naitik Agrawal, Dhwanil Subhashbhai Shah, Yuqian Fu, Xi Wang, Kristina Toutanova, Danda Pani Paudel, Luc Van Gool
Abstract: Museums serve as vital repositories of cultural heritage and historical artifacts spanning diverse epochs, civilizations, and regions, preserving well-documented collections. Data reveal key attributes such as age, origin, material, and cultural significance. Understanding museum exhibits from their images requires reasoning beyond visual features. In this work, we facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65M images and 200M question-answer pairs in the standard museum catalog format for exhibits from all around the world; (b) training large vision-language models on the collected dataset; (c) benchmarking their ability on five visual question answering tasks. The complete dataset is labeled by museum experts, ensuring the quality as well as the practical significance of the labels. We train two VLMs from different categories: the BLIP model, with vision-language aligned embeddings, but lacking the expressive power of large language models, and the LLaVA model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities. Through exhaustive experiments, we provide several insights on the complex and fine-grained understanding of museum exhibits. In particular, we show that some questions whose answers can often be derived directly from visual features are well answered by both types of models. On the other hand, questions that require the grounding of the visual features in repositories of human knowledge are better answered by the large vision-language models, thus demonstrating their superior capacity to perform the desired reasoning. Find our dataset, benchmarks, and source code at: https://github.com/insait-institute/Museum-65
Authors: Ivan DeAndres-Tame, Ruben Tolosana, Pietro Melzi, Ruben Vera-Rodriguez, Minchul Kim, Christian Rathgeb, Xiaoming Liu, Luis F. Gomez, Aythami Morales, Julian Fierrez, Javier Ortega-Garcia, Zhizhou Zhong, Yuge Huang, Yuxi Mi, Shouhong Ding, Shuigeng Zhou, Shuai He, Lingzhi Fu, Heng Cong, Rongyu Zhang, Zhihong Xiao, Evgeny Smirnov, Anton Pimenov, Aleksei Grigorev, Denis Timoshenko, Kaleb Mesfin Asfaw, Cheng Yaw Low, Hao Liu, Chuyi Wang, Qing Zuo, Zhixiang He, Hatef Otroshi Shahreza, Anjith George, Alexander Unnervik, Parsa Rahimi, S\'ebastien Marcel, Pedro C. Neto, Marco Huber, Jan Niklas Kolf, Naser Damer, Fadi Boutros, Jaime S. Cardoso, Ana F. Sequeira, Andrea Atzori, Gianni Fenu, Mirko Marras, Vitomir \v{S}truc, Jiang Yu, Zhangjie Li, Jichun Li, Weisong Zhao, Zhen Lei, Xiangyu Zhu, Xiao-Yu Zhang, Bernardo Biesseck, Pedro Vidal, Luiz Coelho, Roger Granada, David Menotti
Abstract: Synthetic data is gaining increasing popularity for face recognition technologies, mainly due to the privacy concerns and challenges associated with obtaining real data, including diverse scenarios, quality, and demographic groups, among others. It also offers some advantages over real data, such as the large amount of data that can be generated or the ability to customize it to adapt to specific problem-solving needs. To effectively use such data, face recognition models should also be specifically designed to exploit synthetic data to its fullest potential. In order to promote the proposal of novel Generative AI methods and synthetic data, and investigate the application of synthetic data to better train face recognition systems, we introduce the 2nd FRCSyn-onGoing challenge, based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024. This is an ongoing challenge that provides researchers with an accessible platform to benchmark i) the proposal of novel Generative AI methods and synthetic data, and ii) novel face recognition systems that are specifically proposed to take advantage of synthetic data. We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition such as demographic bias, domain adaptation, and performance constraints in demanding situations, such as age disparities between training and testing, changes in the pose, or occlusions. Very interesting findings are obtained in this second edition, including a direct comparison with the first one, in which synthetic databases were restricted to DCFace and GANDiffFace.
Authors: Anna-Maria Halacheva, Yang Miao, Jan-Nico Zaech, Xi Wang, Luc Van Gool, Danda Pani Paudel
Abstract: 3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI. Providing a solution to these applications requires a multifaceted approach that covers scene-centric, object-centric, as well as interaction-centric capabilities. While there exist numerous datasets approaching the former two problems, the task of understanding interactable and articulated objects is underrepresented and only partly covered by current works. In this work, we address this shortcoming and introduce (1) an expertly curated dataset in the Universal Scene Description (USD) format, featuring high-quality manual annotations, for instance, segmentation and articulation on 280 indoor scenes; (2) a learning-based model together with a novel baseline capable of predicting part segmentation along with a full specification of motion attributes, including motion type, articulated and interactable parts, and motion parameters; (3) a benchmark serving to compare upcoming methods for the task at hand. Overall, our dataset provides 8 types of annotations - object and part segmentations, motion types, movable and interactable parts, motion parameters, connectivity, and object mass annotations. With its broad and high-quality annotations, the data provides the basis for holistic 3D scene understanding models. All data is provided in the USD format, allowing interoperability and easy integration with downstream tasks. We provide open access to our dataset, benchmark, and method's source code.
Authors: Zhuoxiao Li, Shanliang Yao, Qizhong Gao, Angel F. Garcia-Fernandez, Yong Yue, Xiaohui Zhu
Abstract: While Gaussian Splatting (GS) demonstrates efficient and high-quality scene rendering and small area surface extraction ability, it falls short in handling large-scale aerial image surface extraction tasks. To overcome this, we present ULSR-GS, a framework dedicated to high-fidelity surface extraction in ultra-large-scale scenes, addressing the limitations of existing GS-based mesh extraction methods. Specifically, we propose a point-to-photo partitioning approach combined with a multi-view optimal view matching principle to select the best training images for each sub-region. Additionally, during training, ULSR-GS employs a densification strategy based on multi-view geometric consistency to enhance surface extraction details. Experimental results demonstrate that ULSR-GS outperforms other state-of-the-art GS-based works on large-scale aerial photogrammetry benchmark datasets, significantly improving surface extraction accuracy in complex urban environments. Project page: https://ulsrgs.github.io.
Authors: Thi-Nhu-Quynh Nguyen, Quang-Huy Ho, Duy-Thai Nguyen, Hoang-Minh-Quang Le, Van-Truong Pham, Thi-Thao Tran
Abstract: Early detection of skin abnormalities plays a crucial role in diagnosing and treating skin cancer. Segmentation of affected skin regions using AI-powered devices is relatively common and supports the diagnostic process. However, achieving high performance remains a significant challenge due to the need for high-resolution images and the often unclear boundaries of individual lesions. At the same time, medical devices require segmentation models to have a small memory foot-print and low computational cost. Based on these requirements, we introduce a novel lightweight model called MambaU-Lite, which combines the strengths of Mamba and CNN architectures, featuring just over 400K parameters and a computational cost of more than 1G flops. To enhance both global context and local feature extraction, we propose the P-Mamba block, a novel component that incorporates VSS blocks along-side multiple pooling layers, enabling the model to effectively learn multiscale features and enhance segmentation performance. We evaluate the model's performance on two skin datasets, ISIC2018 and PH2, yielding promising results. Our source code will be made publicly available at: https://github.com/nqnguyen812/MambaU-Lite.
Authors: Zehuan Wu, Jingcheng Ni, Xiaodong Wang, Yuxin Guo, Rui Chen, Lewei Lu, Jifeng Dai, Yuwen Xiong
Abstract: Generative models have significantly improved the generation and prediction quality on either camera images or LiDAR point clouds for autonomous driving. However, a real-world autonomous driving system uses multiple kinds of input modality, usually cameras and LiDARs, where they contain complementary information for generation, while existing generation methods ignore this crucial feature, resulting in the generated results only covering separate 2D or 3D information. In order to fill the gap in 2D-3D multi-modal joint generation for autonomous driving, in this paper, we propose our framework, \emph{HoloDrive}, to jointly generate the camera images and LiDAR point clouds. We employ BEV-to-Camera and Camera-to-BEV transform modules between heterogeneous generative models, and introduce a depth prediction branch in the 2D generative model to disambiguate the un-projecting from image space to BEV space, then extend the method to predict the future by adding temporal structure and carefully designed progressive training. Further, we conduct experiments on single frame generation and world model benchmarks, and demonstrate our method leads to significant performance gains over SOTA methods in terms of generation metrics.
Authors: Peilin Zhou, Bo Du, Yongchao Xu
Abstract: Recent trends in cell segmentation have shifted towards universal models to handle diverse cell morphologies and imaging modalities. However, for continuously emerging cell types and imaging techniques, these models still require hundreds or thousands of annotated cells for fine-tuning. We introduce CellSeg1, a practical solution for segmenting cells of arbitrary morphology and modality with a few dozen cell annotations in 1 image. By adopting Low-Rank Adaptation of the Segment Anything Model (SAM), we achieve robust cell segmentation. Tested on 19 diverse cell datasets, CellSeg1 trained on 1 image achieved 0.81 average mAP at 0.5 IoU, performing comparably to existing models trained on over 500 images. It also demonstrated superior generalization in cross-dataset tests on TissueNet. We found that high-quality annotation of a few dozen densely packed cells of varied sizes is key to effective segmentation. CellSeg1 provides an efficient solution for cell segmentation with minimal annotation effort.
Authors: Yonghao Dang, Liyuan Liu, Hui Kang, Ping Ye, Jianqin Yin
Abstract: Real-time 2D keypoint detection plays an essential role in computer vision. Although CNN-based and Transformer-based methods have achieved breakthrough progress, they often fail to deliver superior performance and real-time speed. This paper introduces MamKPD, the first efficient yet effective mamba-based pose estimation framework for 2D keypoint detection. The conventional Mamba module exhibits limited information interaction between patches. To address this, we propose a lightweight contextual modeling module (CMM) that uses depth-wise convolutions to model inter-patch dependencies and linear layers to distill the pose cues within each patch. Subsequently, by combining Mamba for global modeling across all patches, MamKPD effectively extracts instances' pose information. We conduct extensive experiments on human and animal pose estimation datasets to validate the effectiveness of MamKPD. Our MamKPD-L achieves 77.3% AP on the COCO dataset with 1492 FPS on an NVIDIA GTX 4090 GPU. Moreover, MamKPD achieves state-of-the-art results on the MPII dataset and competitive results on the AP-10K dataset while saving 85% of the parameters compared to ViTPose. Our project page is available at https://mamkpd.github.io/.
Authors: Hao Li, Xiang Chen, Jiangxin Dong, Jinhui Tang, Jinshan Pan
Abstract: Despite the significant progress made by all-in-one models in universal image restoration, existing methods suffer from a generalization bottleneck in real-world scenarios, as they are mostly trained on small-scale synthetic datasets with limited degradations. Therefore, large-scale high-quality real-world training data is urgently needed to facilitate the emergence of foundational models for image restoration. To advance this field, we spare no effort in contributing a million-scale dataset with two notable advantages over existing training data: real-world samples with larger-scale, and degradation types with higher diversity. By adjusting internal camera settings and external imaging conditions, we can capture aligned image pairs using our well-designed data acquisition system over multiple rounds and our data alignment criterion. Moreover, we propose a robust model, FoundIR, to better address a broader range of restoration tasks in real-world scenarios, taking a further step toward foundation models. Specifically, we first utilize a diffusion-based generalist model to remove degradations by learning the degradation-agnostic common representations from diverse inputs, where incremental learning strategy is adopted to better guide model training. To refine the model's restoration capability in complex scenarios, we introduce degradation-aware specialist models for achieving final high-quality results. Extensive experiments show the value of our dataset and the effectiveness of our method.
Authors: Yuelei Wang, Jian Zhang, Pengtao Jiang, Hao Zhang, Jinwei Chen, Bo Li
Abstract: Despite the significant advancements made by Diffusion Transformer (DiT)-based methods in video generation, there remains a notable gap with controllable camera pose perspectives. Existing works such as OpenSora do NOT adhere precisely to anticipated trajectories and physical interactions, thereby limiting the flexibility in downstream applications. To alleviate this issue, we introduce CPA, a unified camera-pose-awareness text-to-video generation approach that elaborates the camera movement and integrates the textual, visual, and spatial conditions. Specifically, we deploy the Sparse Motion Encoding (SME) module to transform camera pose information into a spatial-temporal embedding and activate the Temporal Attention Injection (TAI) module to inject motion patches into each ST-DiT block. Our plug-in architecture accommodates the original DiT parameters, facilitating diverse types of camera poses and flexible object movement. Extensive qualitative and quantitative experiments demonstrate that our method outperforms LDM-based methods for long video generation while achieving optimal performance in trajectory consistency and object consistency.
Authors: Xiaoguang Han, Yushuang Wu, Luyue Shi, Haolin Liu, Hongjie Liao, Lingteng Qiu, Weihao Yuan, Xiaodong Gu, Zilong Dong, Shuguang Cui
Abstract: MVImgNet is a large-scale dataset that contains multi-view images of ~220k real-world objects in 238 classes. As a counterpart of ImageNet, it introduces 3D visual signals via multi-view shooting, making a soft bridge between 2D and 3D vision. This paper constructs the MVImgNet2.0 dataset that expands MVImgNet into a total of ~520k objects and 515 categories, which derives a 3D dataset with a larger scale that is more comparable to ones in the 2D domain. In addition to the expanded dataset scale and category range, MVImgNet2.0 is of a higher quality than MVImgNet owing to four new features: (i) most shoots capture 360-degree views of the objects, which can support the learning of object reconstruction with completeness; (ii) the segmentation manner is advanced to produce foreground object masks of higher accuracy; (iii) a more powerful structure-from-motion method is adopted to derive the camera pose for each frame of a lower estimation error; (iv) higher-quality dense point clouds are reconstructed via advanced methods for objects captured in 360-degree views, which can serve for downstream applications. Extensive experiments confirm the value of the proposed MVImgNet2.0 in boosting the performance of large 3D reconstruction models. MVImgNet2.0 will be public at luyues.github.io/mvimgnet2, including multi-view images of all 520k objects, the reconstructed high-quality point clouds, and data annotation codes, hoping to inspire the broader vision community.
Authors: Mona Alawadh, Mahesan Niranjan, Hansung Kim
Abstract: Semantic Scene Completion (SSC) is a critical task in computer vision, that utilized in applications such as virtual reality (VR). SSC aims to construct detailed 3D models from partial views by transforming a single 2D image into a 3D representation, assigning each voxel a semantic label. The main challenge lies in completing 3D volumes with limited information, compounded by data imbalance, inter-class ambiguity, and intra-class diversity in indoor scenes. To address this, we propose the Multi-Feature Data Balancing Network (MDBNet), a dual-head model for RGB and depth data (F-TSDF) inputs. Our hybrid encoder-decoder architecture with identity transformation in a pre-activation residual module (ITRM) effectively manages diverse signals within F-TSDF. We evaluate RGB feature fusion strategies and use a combined loss function cross entropy for 2D RGB features and weighted cross-entropy for 3D SSC predictions. MDBNet results surpass comparable state-of-the-art (SOTA) methods on NYU datasets, demonstrating the effectiveness of our approach.
Authors: Zhixiang Wang, Guangnan Ye, Xiaosen Wang, Siheng Chen, Zhibo Wang, Xingjun Ma, Yu-Gang Jiang
Abstract: Physical adversarial patches printed on clothing can easily allow individuals to evade person detectors. However, most existing adversarial patch generation methods prioritize attack effectiveness over stealthiness, resulting in patches that are aesthetically unpleasing. Although existing methods using generative adversarial networks or diffusion models can produce more natural-looking patches, they often struggle to balance stealthiness with attack effectiveness and lack flexibility for user customization. To address these challenges, we propose a novel diffusion-based customizable patch generation framework termed DiffPatch, specifically tailored for creating naturalistic and customizable adversarial patches. Our approach enables users to utilize a reference image as the source, rather than starting from random noise, and incorporates masks to craft naturalistic patches of various shapes, not limited to squares. To prevent the original semantics from being lost during the diffusion process, we employ Null-text inversion to map random noise samples to a single input image and generate patches through Incomplete Diffusion Optimization (IDO). Notably, while maintaining a natural appearance, our method achieves a comparable attack performance to state-of-the-art non-naturalistic patches when using similarly sized attacks. Using DiffPatch, we have created a physical adversarial T-shirt dataset, AdvPatch-1K, specifically targeting YOLOv5s. This dataset includes over a thousand images across diverse scenarios, validating the effectiveness of our attack in real-world environments. Moreover, it provides a valuable resource for future research.
Authors: MD Raqib Khan, Anshul Negi, Ashutosh Kulkarni, Shruti S. Phutke, Santosh Kumar Vipparthi, Subrahmanyam Murala
Abstract: Quality degradation is observed in underwater images due to the effects of light refraction and absorption by water, leading to issues like color cast, haziness, and limited visibility. This degradation negatively affects the performance of autonomous underwater vehicles used in marine applications. To address these challenges, we propose a lightweight phase-based transformer network with 1.77M parameters for underwater image restoration (UIR). Our approach focuses on effectively extracting non-contaminated features using a phase-based self-attention mechanism. We also introduce an optimized phase attention block to restore structural information by propagating prominent attentive features from the input. We evaluate our method on both synthetic (UIEB, UFO-120) and real-world (UIEB, U45, UCCS, SQUID) underwater image datasets. Additionally, we demonstrate its effectiveness for low-light image enhancement using the LOL dataset. Through extensive ablation studies and comparative analysis, it is clear that the proposed approach outperforms existing state-of-the-art (SOTA) methods.
Authors: Linh Trinh, Siegfried Mercelis, Ali Anwar
Abstract: Unmanned Surface Vehicles (USVs) have emerged as a major platform in maritime operations, capable of supporting a wide range of applications. USVs can help reduce labor costs, increase safety, save energy, and allow for difficult unmanned tasks in harsh maritime environments. With the rapid development of USVs, many vision tasks such as detection and segmentation become increasingly important. Datasets play an important role in encouraging and improving the research and development of reliable vision algorithms for USVs. In this regard, a large number of recent studies have focused on the release of vision datasets for USVs. Along with the development of datasets, a variety of deep learning techniques have also been studied, with a focus on USVs. However, there is a lack of a systematic review of recent studies in both datasets and vision techniques to provide a comprehensive picture of the current development of vision on USVs, including limitations and trends. In this study, we provide a comprehensive review of both USV datasets and deep learning techniques for vision tasks. Our review was conducted using a large number of vision datasets from USVs. We elaborate several challenges and potential opportunities for research and development in USV vision based on a thorough analysis of current datasets and deep learning techniques.
Authors: Qirui Yang, Yinbo Li, Peng-Tao Jiang, Qihua Cheng, Biting Yu, Yihao Liu, Huanjing Yue, Jingyu Yang
Abstract: Previous tone mapping methods mainly focus on how to enhance tones in low-resolution images and recover details using the high-frequent components extracted from the input image. These methods typically rely on traditional feature pyramids to artificially extract high-frequency components, such as Laplacian and Gaussian pyramids with handcrafted kernels. However, traditional handcrafted features struggle to effectively capture the high-frequency components in HDR images, resulting in excessive smoothing and loss of detail in the output image. To mitigate the above issue, we introduce a learnable Differential Pyramid Representation Network (DPRNet). Based on the learnable differential pyramid, our DPRNet can capture detailed textures and structures, which is crucial for high-quality tone mapping recovery. In addition, to achieve global consistency and local contrast harmonization, we design a global tone perception module and a local tone tuning module that ensure the consistency of global tuning and the accuracy of local tuning, respectively. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art methods, improving PSNR by 2.58 dB in the HDR+ dataset and 3.31 dB in the HDRI Haven dataset respectively compared with the second-best method. Notably, our method exhibits the best generalization ability in the non-homologous image and video tone mapping operation. We provide an anonymous online demo at https://xxxxxx2024.github.io/DPRNet/.
Authors: Sangbeom Lim, Seongchan Kim, Seungjun An, Seokju Cho, Paul Hongsuck Seo, Seungryong Kim
Abstract: Current benchmarks for video segmentation are limited to annotating only salient objects (i.e., foreground instances). Despite their impressive architectural designs, previous works trained on these benchmarks have struggled to adapt to real-world scenarios. Thus, developing a new video segmentation dataset aimed at tracking multi-granularity segmentation target in the video scene is necessary. In this work, we aim to generate multi-granularity video segmentation dataset that is annotated for both salient and non-salient masks. To achieve this, we propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset that includes various types and granularities of mask annotations. We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation. In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset, which leads to the best performance among the existing video object segmentation methods and Segment SAM-based video segmentation methods. Project page is available at https://cvlab-kaist.github.io/MUG-VOS.
Authors: Nitish Mital, Simon Malzard, Richard Walters, Celso M. De Melo, Raghuveer Rao, Victoria Nockles
Abstract: In many computer vision domains the collection of sufficient real-world data is challenging and can severely impact model performance, particularly when running inference on samples that are unseen or underrepresented in training. Synthetically generated images provide a promising solution, but it remains unclear how to design synthetic data to optimally improve model performance, for example whether to introduce more realism or more abstraction in such datasets. Here we propose a novel conceptual approach to improve the performance of computer vision models trained on synthetic images, by using robust Explainable AI (XAI) techniques to guide the modification of 3D models used to generate these images. Importantly, this framework allows both modifications that increase and decrease realism in synthetic data, which can both improve model performance. We illustrate this concept using a real-world example where data are sparse; the detection of vehicles in infrared imagery. We fine-tune an initial YOLOv8 model on the ATR DSIAC infrared dataset and synthetic images generated from 3D mesh models in the Unity gaming engine, and then use XAI saliency maps to guide modification of our Unity models. We show that synthetic data can improve detection of vehicles in orientations unseen in training by 4.6\% (to mAP50 scores of 94.6\%). We further improve performance by an additional 1.5\% (to 96.1\%) through our new XAI-guided approach, which reduces misclassifications through both increasing and decreasing the realism of different parts of the synthetic data. These proof-of-concept results pave the way for fine, XAI-controlled curation of synthetic datasets through detailed feature modifications, tailored to improve object detection performance.
Authors: Cong Xie, Han Zou, Ruiqi Yu, Yan Zhang, Zhenpeng Zhan
Abstract: In this work, we are interested in achieving both high text controllability and overall appearance consistency in the generation of personalized human characters. We propose a novel framework, named SerialGen, which is a serial generation method consisting of two stages: first, a standardization stage that standardizes reference images, and then a personalized generation stage based on the standardized reference. Furthermore, we introduce two modules aimed at enhancing the standardization process. Our experimental results validate the proposed framework's ability to produce personalized images that faithfully recover the reference image's overall appearance while accurately responding to a wide range of text prompts. Through thorough analysis, we highlight the critical contribution of the proposed serial generation method and standardization model, evidencing enhancements in appearance consistency between reference and output images and across serial outputs generated from diverse text prompts. The term "Serial" in this work carries a double meaning: it refers to the two-stage method and also underlines our ability to generate serial images with consistent appearance throughout.
Authors: Qirui Yang, Peng-Tao Jiang, Hao Zhang, Jinwei Chen, Bo Li, Huanjing Yue, Jingyu Yang
Abstract: Learning lighting adaption is a key step in obtaining a good visual perception and supporting downstream vision tasks. There are multiple light-related tasks (e.g., image retouching and exposure correction) and previous studies have mainly investigated these tasks individually. However, we observe that the light-related tasks share fundamental properties: i) different color channels have different light properties, and ii) the channel differences reflected in the time and frequency domains are different. Based on the common light property guidance, we propose a Learning Adaptive Lighting Network (LALNet), a unified framework capable of processing different light-related tasks. Specifically, we introduce the color-separated features that emphasize the light difference of different color channels and combine them with the traditional color-mixed features by Light Guided Attention (LGA). The LGA utilizes color-separated features to guide color-mixed features focusing on channel differences and ensuring visual consistency across channels. We introduce dual domain channel modulation to generate color-separated features and a wavelet followed by a vision state space module to generate color-mixed features. Extensive experiments on four representative light-related tasks demonstrate that LALNet significantly outperforms state-of-the-art methods on benchmark tests and requires fewer computational resources. We provide an anonymous online demo at https://xxxxxx2025.github.io/LALNet/.
Authors: Nicholas Konz, Yuwen Chen, Hanxue Gu, Haoyu Dong, Yaqian Chen, Maciej A. Mazurowski
Abstract: Determining whether two sets of images belong to the same or different domain is a crucial task in modern medical image analysis and deep learning, where domain shift is a common problem that commonly results in decreased model performance. This determination is also important to evaluate the output quality of generative models, e.g., image-to-image translation models used to mitigate domain shift. Current metrics for this either rely on the (potentially biased) choice of some downstream task such as segmentation, or adopt task-independent perceptual metrics (e.g., FID) from natural imaging which insufficiently capture anatomical consistency and realism in medical images. We introduce a new perceptual metric tailored for medical images: Radiomic Feature Distance (RaD), which utilizes standardized, clinically meaningful and interpretable image features. We show that RaD is superior to other metrics for out-of-domain (OOD) detection in a variety of experiments. Furthermore, RaD outperforms previous perceptual metrics (FID, KID, etc.) for image-to-image translation by correlating more strongly with downstream task performance as well as anatomical consistency and realism, and shows similar utility for evaluating unconditional image generation. RaD also offers additional benefits such as interpretability, as well as stability and computational efficiency at low sample sizes. Our results are supported by broad experiments spanning four multi-domain medical image datasets, nine downstream tasks, six image translation models, and other factors, highlighting the broad potential of RaD for medical image analysis.
Authors: Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, Jiaolong Yang
Abstract: We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.
Authors: Anton Nuzhdin, Alexander Nagaev, Alexander Sautin, Alexander Kapitanov, Karina Kvanchiani
Abstract: This paper proposes the second version of the widespread Hand Gesture Recognition dataset HaGRID -- HaGRIDv2. We cover 15 new gestures with conversation and control functions, including two-handed ones. Building on the foundational concepts proposed by HaGRID's authors, we implemented the dynamic gesture recognition algorithm and further enhanced it by adding three new groups of manipulation gestures. The ``no gesture" class was diversified by adding samples of natural hand movements, which allowed us to minimize false positives by 6 times. Combining extra samples with HaGRID, the received version outperforms the original in pre-training models for gesture-related tasks. Besides, we achieved the best generalization ability among gesture and hand detection datasets. In addition, the second version enhances the quality of the gestures generated by the diffusion model. HaGRIDv2, pre-trained models, and a dynamic gesture recognition algorithm are publicly available.
Authors: Xi Guo, Chenjing Ding, Haoxuan Dou, Xin Zhang, Weixuan Tang, Wei Wu
Abstract: Autonomous driving systems struggle with complex scenarios due to limited access to diverse, extensive, and out-of-distribution driving data which are critical for safe navigation. World models offer a promising solution to this challenge; however, current driving world models are constrained by short time windows and limited scenario diversity. To bridge this gap, we introduce InfinityDrive, the first driving world model with exceptional generalization capabilities, delivering state-of-the-art performance in high fidelity, consistency, and diversity with minute-scale video generation. InfinityDrive introduces an efficient spatio-temporal co-modeling module paired with an extended temporal training strategy, enabling high-resolution (576$\times$1024) video generation with consistent spatial and temporal coherence. By incorporating memory injection and retention mechanisms alongside an adaptive memory curve loss to minimize cumulative errors, achieving consistent video generation lasting over 1500 frames (approximately 2 minutes). Comprehensive experiments in multiple datasets validate InfinityDrive's ability to generate complex and varied scenarios, highlighting its potential as a next-generation driving world model built for the evolving demands of autonomous driving. Our project homepage: https://metadrivescape.github.io/papers_project/InfinityDrive/page.html
URLs: https://metadrivescape.github.io/papers_project/InfinityDrive/page.html
Authors: Jens Bayer, Stefan Becker, David M\"unch, Michael Arens, J\"urgen Beyerer
Abstract: Despite ongoing research on the topic of adversarial examples in deep learning for computer vision, some fundamentals of the nature of these attacks remain unclear. As the manifold hypothesis posits, high-dimensional data tends to be part of a low-dimensional manifold. To verify the thesis with adversarial patches, this paper provides an analysis of a set of adversarial patches and investigates the reconstruction abilities of three different dimensionality reduction methods. Quantitatively, the performance of reconstructed patches in an attack setting is measured and the impact of sampled patches from the latent space during adversarial training is investigated. The evaluation is performed on two publicly available datasets for person detection. The results indicate that more sophisticated dimensionality reduction methods offer no advantages over a simple principal component analysis.
Authors: Xingyu Chen, Zhuheng Song, Xiaoke Jiang, Yaoqing Hu, Junzhi Yu, Lei Zhang
Abstract: Existing approaches of hand reconstruction predominantly adhere to a multi-stage framework, encompassing detection, left-right classification, and pose estimation. This paradigm induces redundant computation and cumulative errors. In this work, we propose HandOS, an end-to-end framework for 3D hand reconstruction. Our central motivation lies in leveraging a frozen detector as the foundation while incorporating auxiliary modules for 2D and 3D keypoint estimation. In this manner, we integrate the pose estimation capacity into the detection framework, while at the same time obviating the necessity of using the left-right category as a prerequisite. Specifically, we propose an interactive 2D-3D decoder, where 2D joint semantics is derived from detection cues while 3D representation is lifted from those of 2D joints. Furthermore, hierarchical attention is designed to enable the concurrent modeling of 2D joints, 3D vertices, and camera translation. Consequently, we achieve an end-to-end integration of hand detection, 2D pose estimation, and 3D mesh reconstruction within a one-stage framework, so that the above multi-stage drawbacks are overcome. Meanwhile, the HandOS reaches state-of-the-art performances on public benchmarks, e.g., 5.0 PA-MPJPE on FreiHand and 64.6\% PCK@0.05 on HInt-Ego4D. Project page: idea-research.github.io/HandOSweb.
Authors: Christina Kassab, Mat\'ias Mattamala, Sacha Morin, Martin B\"uchner, Abhinav Valada, Liam Paull, Maurice Fallon
Abstract: 3D open-vocabulary scene graph methods are a promising map representation for embodied agents, however many current approaches are computationally expensive. In this paper, we reexamine the critical design choices established in previous works to optimize both efficiency and performance. We propose a general scene graph framework and conduct three studies that focus on image pre-processing, feature fusion, and feature selection. Our findings reveal that commonly used image pre-processing techniques provide minimal performance improvement while tripling computation (on a per object view basis). We also show that averaging feature labels across different views significantly degrades performance. We study alternative feature selection strategies that enhance performance without adding unnecessary computational costs. Based on our findings, we introduce a computationally balanced approach for 3D point cloud segmentation with per-object features. The approach matches state-of-the-art classification accuracy while achieving a threefold reduction in computation.
Authors: Yufeng Jin, Vignesh Prasad, Snehal Jauhri, Mathias Franzius, Georgia Chalvatzaki
Abstract: Efficient and accurate object pose estimation is an essential component for modern vision systems in many applications such as Augmented Reality, autonomous driving, and robotics. While research in model-based 6D object pose estimation has delivered promising results, model-free methods are hindered by the high computational load in rendering and inferring consistent poses of arbitrary objects in a live RGB-D video stream. To address this issue, we present 6DOPE-GS, a novel method for online 6D object pose estimation \& tracking with a single RGB-D camera by effectively leveraging advances in Gaussian Splatting. Thanks to the fast differentiable rendering capabilities of Gaussian Splatting, 6DOPE-GS can simultaneously optimize for 6D object poses and 3D object reconstruction. To achieve the necessary efficiency and accuracy for live tracking, our method uses incremental 2D Gaussian Splatting with an intelligent dynamic keyframe selection procedure to achieve high spatial object coverage and prevent erroneous pose updates. We also propose an opacity statistic-based pruning mechanism for adaptive Gaussian density control, to ensure training stability and efficiency. We evaluate our method on the HO3D and YCBInEOAT datasets and show that 6DOPE-GS matches the performance of state-of-the-art baselines for model-free simultaneous 6D pose tracking and reconstruction while providing a 5$\times$ speedup. We also demonstrate the method's suitability for live, dynamic object tracking and reconstruction in a real-world setting.
Authors: Chunlin Yu, Hanqing Wang, Ye Shi, Haoyang Luo, Sibei Yang, Jingyi Yu, Jingya Wang
Abstract: 3D affordance segmentation aims to link human instructions to touchable regions of 3D objects for embodied manipulations. Existing efforts typically adhere to single-object, single-affordance paradigms, where each affordance type or explicit instruction strictly corresponds to a specific affordance region and are unable to handle long-horizon tasks. Such a paradigm cannot actively reason about complex user intentions that often imply sequential affordances. In this paper, we introduce the Sequential 3D Affordance Reasoning task, which extends the traditional paradigm by reasoning from cumbersome user intentions and then decomposing them into a series of segmentation maps. Toward this, we construct the first instruction-based affordance segmentation benchmark that includes reasoning over both single and sequential affordances, comprising 180K instruction-point cloud pairs. Based on the benchmark, we propose our model, SeqAfford, to unlock the 3D multi-modal large language model with additional affordance segmentation abilities, which ensures reasoning with world knowledge and fine-grained affordance grounding in a cohesive framework. We further introduce a multi-granular language-point integration module to endow 3D dense prediction. Extensive experimental evaluations show that our model excels over well-established methods and exhibits open-world generalization with sequential reasoning abilities.
Authors: Xingyu Liu, Yingyue Li, Chengxi Li, Gu Wang, Chenyangguang Zhang, Ziqin Huang, Xiangyang Ji
Abstract: In this report, we provide the technical details of the submitted method GFreeDet, which exploits Gaussian splatting and vision Foundation models for the model-free unseen object Detection track in the BOP 2024 Challenge.
Authors: Bo Ji, Angela Yao
Abstract: Standard 3D Gaussian Splatting (3DGS) relies on known or pre-computed camera poses and a sparse point cloud, obtained from structure-from-motion (SfM) preprocessing, to initialize and grow 3D Gaussians. We propose a novel SfM-Free 3DGS (SFGS) method for video input, eliminating the need for known camera poses and SfM preprocessing. Our approach introduces a hierarchical training strategy that trains and merges multiple 3D Gaussian representations -- each optimized for specific scene regions -- into a single, unified 3DGS model representing the entire scene. To compensate for large camera motions, we leverage video frame interpolation models. Additionally, we incorporate multi-source supervision to reduce overfitting and enhance representation. Experimental results reveal that our approach significantly surpasses state-of-the-art SfM-free novel view synthesis methods. On the Tanks and Temples dataset, we improve PSNR by an average of 2.25dB, with a maximum gain of 3.72dB in the best scene. On the CO3D-V2 dataset, we achieve an average PSNR boost of 1.74dB, with a top gain of 3.90dB. The code is available at https://github.com/jibo27/3DGS_Hierarchical_Training.
Authors: MD Shaikh Rahman, Syed Maudud E Rabbi, Muhammad Mahbubur Rashid
Abstract: Approximate Nearest Neighbor search is one of the keys to high-scale data retrieval performance in many applications. The work is a bridge between feature extraction and ANN indexing through fine-tuning a ResNet50 model with various ANN methods: FAISS and Annoy. We evaluate the systems with respect to indexing time, memory usage, query time, precision, recall, F1-score, and Recall@5 on a custom image dataset. FAISS's Product Quantization can achieve a precision of 98.40% with low memory usage at 0.24 MB index size, and Annoy is the fastest, with average query times of 0.00015 seconds, at a slight cost to accuracy. These results reveal trade-offs among speed, accuracy, and memory efficiency and offer actionable insights into the optimization of feature-based image retrieval systems. This study will serve as a blueprint for constructing actual retrieval pipelines and be built on fine-tuned deep learning networks and associated ANN methods.
Authors: Hao Tang, Zechao Li, Dong Zhang, Shengfeng He, Jinhui Tang
Abstract: RGB-Thermal Salient Object Detection aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images. Traditional encoder-decoder architectures, while designed for cross-modality feature interactions, may not have adequately considered the robustness against noise originating from defective modalities. Inspired by hierarchical human visual systems, we propose the ConTriNet, a robust Confluent Triple-Flow Network employing a Divide-and-Conquer strategy. Specifically, ConTriNet comprises three flows: two modality-specific flows explore cues from RGB and Thermal modalities, and a third modality-complementary flow integrates cues from both modalities. ConTriNet presents several notable advantages. It incorporates a Modality-induced Feature Modulator in the modality-shared union encoder to minimize inter-modality discrepancies and mitigate the impact of defective samples. Additionally, a foundational Residual Atrous Spatial Pyramid Module in the separated flows enlarges the receptive field, allowing for the capture of multi-scale contextual information. Furthermore, a Modality-aware Dynamic Aggregation Module in the modality-complementary flow dynamically aggregates saliency-related cues from both modality-specific flows. Leveraging the proposed parallel triple-flow framework, we further refine saliency maps derived from different flows through a flow-cooperative fusion strategy, yielding a high-quality, full-resolution saliency map for the final prediction. To evaluate the robustness and stability of our approach, we collect a comprehensive RGB-T SOD benchmark, VT-IMAG, covering various real-world challenging scenarios. Extensive experiments on public benchmarks and our VT-IMAG dataset demonstrate that ConTriNet consistently outperforms state-of-the-art competitors in both common and challenging scenarios.
Authors: Dhiman Paul, Md Rizwan Parvez, Nabeel Mohammed, Shafin Rahman
Abstract: Video Highlight Detection and Moment Retrieval (HD/MR) are essential in video analysis. Recent joint prediction transformer models often overlook their cross-task dynamics and video-text alignment and refinement. Moreover, most models typically use limited, uni-directional attention mechanisms, resulting in weakly integrated representations and suboptimal performance in capturing the interdependence between video and text modalities. Although large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains, their application in this field remains relatively underexplored. Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules with an alignment loss for better video-text feature alignment, (ii) Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware clip representations, and (iii) Uni-directional joint-task feedback mechanism enhancing both tasks through correlation. In addition, (iv) we introduce hard positive/negative losses for adaptive error penalization and improved learning, and (v) leverage LVLMs like BLIP-2 for enhanced multimodal feature integration and intelligent pretraining using synthetic data generated from LVLMs. Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance. Codes and models are available at https://github.com/dpaul06/VideoLights .
Authors: Bo Ji, Angela Yao
Abstract: State-of-the-art video deblurring methods use deep network architectures to recover sharpened video frames. Blurring especially degrades high-frequency (HF) information, yet this aspect is often overlooked by recent models that focus more on enhancing architectural design. Recovering these fine details is challenging, partly due to the spectral bias of neural networks, which are inclined towards learning low-frequency functions. To address this, we enforce explicit network structures to capture the fine details and edges. We dynamically predict adaptive high-pass kernels from a linear combination of high-pass basis kernels to extract high-frequency features. This strategy is highly efficient, resulting in low-memory footprints for training and fast run times for inference, all while achieving state-of-the-art when compared to low-budget models. The code is available at https://github.com/jibo27/AHFNet.
Authors: Miroslav Purkrabek, Jiri Matas
Abstract: Human pose estimation methods work well on separated people but struggle with multi-body scenarios. Recent work has addressed this problem by conditioning pose estimation with detected bounding boxes or bottom-up-estimated poses. Unfortunately, all of these approaches overlooked segmentation masks and their connection to estimated keypoints. We condition pose estimation model by segmentation masks instead of bounding boxes to improve instance separation. This improves top-down pose estimation in multi-body scenarios but does not fix detection errors. Consequently, we develop BBox-Mask-Pose (BMP), integrating detection, segmentation and pose estimation into self-improving feedback loop. We adapt detector and pose estimation model for conditioning by instance masks and use Segment Anything as pose-to-mask model to close the circle. With only small models, BMP is superior to top-down methods on OCHuman dataset and to detector-free methods on COCO dataset, combining the best from both approaches and matching state of art performance in both settings. Code is available on https://mirapurkrabek.github.io/BBox-Mask-Pose.
Authors: Ziyang Yan, Lei Li, Yihua Shao, Siyu Chen, Wuzong Kai, Jenq-Neng Hwang, Hao Zhao, Fabio Remondino
Abstract: The creation of 3D scenes has traditionally been both labor-intensive and costly, requiring designers to meticulously configure 3D assets and environments. Recent advancements in generative AI, including text-to-3D and image-to-3D methods, have dramatically reduced the complexity and cost of this process. However, current techniques for editing complex 3D scenes continue to rely on generally interactive multi-step, 2D-to-3D projection methods and diffusion-based techniques, which often lack precision in control and hamper real-time performance. In this work, we propose 3DSceneEditor, a fully 3D-based paradigm for real-time, precise editing of intricate 3D scenes using Gaussian Splatting. Unlike conventional methods, 3DSceneEditor operates through a streamlined 3D pipeline, enabling direct manipulation of Gaussians for efficient, high-quality edits based on input prompts.The proposed framework (i) integrates a pre-trained instance segmentation model for semantic labeling; (ii) employs a zero-shot grounding approach with CLIP to align target objects with user prompts; and (iii) applies scene modifications, such as object addition, repositioning, recoloring, replacing, and deletion directly on Gaussians. Extensive experimental results show that 3DSceneEditor achieves superior editing precision and speed with respect to current SOTA 3D scene editing approaches, establishing a new benchmark for efficient and interactive 3D scene customization.
Authors: Sandesh Pokhrel, Sanjay Bhandari, Sharib Ali, Tryphon Lambrou, Anh Nguyen, Yash Raj Shrestha, Angus Watson, Danail Stoyanov, Prashnna Gyawali, Binod Bhattarai
Abstract: The integration of deep learning tools in gastrointestinal vision holds the potential for significant advancements in diagnosis, treatment, and overall patient care. A major challenge, however, is these tools' tendency to make overconfident predictions, even when encountering unseen or newly emerging disease patterns, undermining their reliability. We address this critical issue of reliability by framing it as an out-of-distribution (OOD) detection problem, where previously unseen and emerging diseases are identified as OOD examples. However, gastrointestinal images pose a unique challenge due to the overlapping feature representations between in- Distribution (ID) and OOD examples. Existing approaches often overlook this characteristic, as they are primarily developed for natural image datasets, where feature distinctions are more apparent. Despite the overlap, we hypothesize that the features of an in-distribution example will cluster closer to the centroids of their ground truth class, resulting in a shorter distance to the nearest centroid. In contrast, OOD examples maintain an equal distance from all class centroids. Based on this observation, we propose a novel nearest-centroid distance deficit (NCCD) score in the feature space for gastrointestinal OOD detection. Evaluations across multiple deep learning architectures and two publicly available benchmarks, Kvasir2 and Gastrovision, demonstrate the effectiveness of our approach compared to several state-of-the-art methods. The code and implementation details are publicly available at: https://github.com/bhattarailab/NCDD
Authors: Christian Witte, Jens Behley, Cyrill Stachniss, Marvin Raaijmakers
Abstract: Spatial understanding of the semantics of the surroundings is a key capability needed by autonomous cars to enable safe driving decisions. Recently, purely vision-based solutions have gained increasing research interest. In particular, approaches extracting a bird's eye view (BEV) from multiple cameras have demonstrated great performance for spatial understanding. This paper addresses the dependency on learned positional encodings to correlate image and BEV feature map elements for transformer-based methods. We propose leveraging epipolar geometric constraints to model the relationship between cameras and the BEV by Epipolar Attention Fields. They are incorporated into the attention mechanism as a novel attribution term, serving as an alternative to learned positional encodings. Experiments show that our method EAFormer outperforms previous BEV approaches by 2% mIoU for map semantic segmentation and exhibits superior generalization capabilities compared to implicitly learning the camera configuration.
Authors: Brian K. S. Isaac-Medina, Mauricio Che, Yona F. A. Gaus, Samet Akcay, Toby P. Breckon
Abstract: Modern machine learning models, that excel on computer vision tasks such as classification and object detection, are often overconfident in their predictions for Out-of-Distribution (OOD) examples, resulting in unpredictable behaviour for open-set environments. Recent works have demonstrated that the free energy score is an effective measure of uncertainty for OOD detection given its close relationship to the data distribution. However, despite free energy-based methods representing a significant empirical advance in OOD detection, our theoretical analysis reveals previously unexplored and inherent vulnerabilities within the free energy score formulation such that in-distribution and OOD instances can have distinct feature representations yet identical free energy scores. This phenomenon occurs when the vector direction representing the feature space difference between the in-distribution and OOD sample lies within the null space of the last layer of a neural-based classifier. To mitigate these issues, we explore lower-dimensional feature spaces to reduce the null space footprint and introduce novel regularisation to maximize the least singular value of the final linear layer, hence enhancing inter-sample free energy separation. We refer to these techniques as Free Energy Vulnerability Elimination for Robust Out-of-Distribution Detection (FEVER-OOD). Our experiments show that FEVER-OOD techniques achieve state of the art OOD detection in Imagenet-100, with average OOD false positive rate (at 95% true positive rate) of 35.83% when used with the baseline Dream-OOD model.
Authors: Alhossien Waly, Bassant Tarek, Ali Feteha, Rewan Yehia, Gasser Amr, Ahmed Fares
Abstract: The problem of converting images of text into plain text is a widely researched topic in both academia and industry. Arabic handwritten Text Recognation (AHTR) poses additional challenges due to diverse handwriting styles and limited labeled data. In this paper we present a complete OCR pipeline that starts with line segmentation using Differentiable Binarization and Adaptive Scale Fusion techniques to ensure accurate detection of text lines. Following segmentation, a CNN-BiLSTM-CTC architecture is applied to recognize characters. Our system, trained on the Arabic Multi-Fonts Dataset (AMFDS), achieves a Character Recognition Rate (CRR) of 99.20% and a Word Recognition Rate (WRR) of 93.75% on single-word samples containing 7 to 10 characters, along with a CRR of 83.76% for sentences. These results demonstrate the system's strong performance in handling Arabic scripts, establishing a new benchmark for AHTR systems.
Authors: Xuanyu Zhang, Zecheng Tang, Zhipei Xu, Runyi Li, Youmin Xu, Bin Chen, Feng Gao, Jian Zhang
Abstract: With the rapid growth of generative AI and its widespread application in image editing, new risks have emerged regarding the authenticity and integrity of digital content. Existing versatile watermarking approaches suffer from trade-offs between tamper localization precision and visual quality. Constrained by the limited flexibility of previous framework, their localized watermark must remain fixed across all images. Under AIGC-editing, their copyright extraction accuracy is also unsatisfactory. To address these challenges, we propose OmniGuard, a novel augmented versatile watermarking approach that integrates proactive embedding with passive, blind extraction for robust copyright protection and tamper localization. OmniGuard employs a hybrid forensic framework that enables flexible localization watermark selection and introduces a degradation-aware tamper extraction network for precise localization under challenging conditions. Additionally, a lightweight AIGC-editing simulation layer is designed to enhance robustness across global and local editing. Extensive experiments show that OmniGuard achieves superior fidelity, robustness, and flexibility. Compared to the recent state-of-the-art approach EditGuard, our method outperforms it by 4.25dB in PSNR of the container image, 20.7% in F1-Score under noisy conditions, and 14.8% in average bit accuracy.
Authors: Liqiang Lin, Wenpeng Wu, Chi-Wing Fu, Hao Zhang, Hui Huang
Abstract: We introduce camera ray matching (CRAYM) into the joint optimization of camera poses and neural fields from multi-view images. The optimized field, referred to as a feature volume, can be "probed" by the camera rays for novel view synthesis (NVS) and 3D geometry reconstruction. One key reason for matching camera rays, instead of pixels as in prior works, is that the camera rays can be parameterized by the feature volume to carry both geometric and photometric information. Multi-view consistencies involving the camera rays and scene rendering can be naturally integrated into the joint optimization and network training, to impose physically meaningful constraints to improve the final quality of both the geometric reconstruction and photorealistic rendering. We formulate our per-ray optimization and matched ray coherence by focusing on camera rays passing through keypoints in the input images to elevate both the efficiency and accuracy of scene correspondences. Accumulated ray features along the feature volume provide a means to discount the coherence constraint amid erroneous ray matching. We demonstrate the effectiveness of CRAYM for both NVS and geometry reconstruction, over dense- or sparse-view settings, with qualitative and quantitative comparisons to state-of-the-art alternatives.
Authors: Yakun Niu, Pei Chen, Lei Zhang, Lei Tan, Yingjian Chen
Abstract: Image Forgery Localization (IFL) technology aims to detect and locate the forged areas in an image, which is very important in the field of digital forensics. However, existing IFL methods suffer from feature degradation during training using multi-layer convolutions or the self-attention mechanism, and perform poorly in detecting small forged regions and in robustness against post-processing. To tackle these, we propose a guided and multi-scale feature aggregated network for IFL. Spectifically, in order to comprehensively learn the noise feature under different types of forgery, we develop an effective noise extraction module in a guided way. Then, we design a Feature Aggregation Module (FAM) that uses dynamic convolution to adaptively aggregate RGB and noise features over multiple scales. Moreover, we propose an Atrous Residual Pyramid Module (ARPM) to enhance features representation and capture both global and local features using different receptive fields to improve the accuracy and robustness of forgery localization. Expensive experiments on 5 public datasets have shown that our proposed model outperforms several the state-of-the-art methods, specially on small region forged image.
Authors: Xiaohu Liu, Sascha Hornauer, Fabien Moutarde, Jialiang Lu
Abstract: Metric depth prediction from monocular videos suffers from bad generalization between datasets and requires supervised depth data for scale-correct training. Self-supervised training using multi-view reconstruction can benefit from large scale natural videos but not provide correct scale, limiting its benefits. Recently, reflecting audible Echoes off objects is investigated for improved depth prediction and was shown to be sufficient to reconstruct objects at scale even without a visual signal. Because Echoes travel at fixed speed, they have the potential to resolve ambiguities in object scale and appearance. However, predicting depth end-to-end from sound and vision cannot benefit from unsupervised depth prediction approaches, which can process large scale data without sound annotation. In this work we show how Echoes can benefit depth prediction in two ways: When learning metric depth learned from supervised data and as supervisory signal for scale-correct self-supervised training. We show how we can improve the predictions of several state-of-the-art approaches and how the method can scale-correct a self-supervised depth approach.
Authors: Yi Yu, Yufei Wang, Wenhan Yang, Lanqing Guo, Shijian Lu, Ling-Yu Duan, Yap-Peng Tan, Alex C. Kot
Abstract: Recent advancements in deep learning-based compression techniques have surpassed traditional methods. However, deep neural networks remain vulnerable to backdoor attacks, where pre-defined triggers induce malicious behaviors. This paper introduces a novel frequency-based trigger injection model for launching backdoor attacks with multiple triggers on learned image compression models. Inspired by the widely used DCT in compression codecs, triggers are embedded in the DCT domain. We design attack objectives tailored to diverse scenarios, including: 1) degrading compression quality in terms of bit-rate and reconstruction accuracy; 2) targeting task-driven measures like face recognition and semantic segmentation. To improve training efficiency, we propose a dynamic loss function that balances loss terms with fewer hyper-parameters, optimizing attack objectives effectively. For advanced scenarios, we evaluate the attack's resistance to defensive preprocessing and propose a two-stage training schedule with robust frequency selection to enhance resilience. To improve cross-model and cross-domain transferability for downstream tasks, we adjust the classification boundary in the attack loss during training. Experiments show that our trigger injection models, combined with minor modifications to encoder parameters, successfully inject multiple backdoors and their triggers into a single compression model, demonstrating strong performance and versatility. (*Due to the notification of arXiv "The Abstract field cannot be longer than 1,920 characters", the appeared Abstract is shortened. For the full Abstract, please download the Article.)
Authors: Varun Belagali, Srikar Yellapragada, Alexandros Graikos, Saarthak Kapse, Zilinghan Li, Tarak Nath Nandi, Ravi K Madduri, Prateek Prasanna, Joel Saltz, Dimitris Samaras
Abstract: Self-supervised learning (SSL) methods have emerged as strong visual representation learners by training an image encoder to maximize similarity between features of different views of the same image. To perform this view-invariance task, current SSL algorithms rely on hand-crafted augmentations such as random cropping and color jittering to create multiple views of an image. Recently, generative diffusion models have been shown to improve SSL by providing a wider range of data augmentations. However, these diffusion models require pre-training on large-scale image-text datasets, which might not be available for many specialized domains like histopathology. In this work, we introduce Gen-SIS, a diffusion-based augmentation technique trained exclusively on unlabeled image data, eliminating any reliance on external sources of supervision such as text captions. We first train an initial SSL encoder on a dataset using only hand-crafted augmentations. We then train a diffusion model conditioned on embeddings from that SSL encoder. Following training, given an embedding of the source image, this diffusion model can synthesize its diverse views. We show that these `self-augmentations', i.e. generative augmentations based on the vanilla SSL encoder embeddings, facilitate the training of a stronger SSL encoder. Furthermore, based on the ability to interpolate between images in the encoder latent space, we introduce the novel pretext task of disentangling the two source images of an interpolated synthetic image. We validate Gen-SIS's effectiveness by demonstrating performance improvements across various downstream tasks in both natural images, which are generally object-centric, as well as digital histopathology images, which are typically context-based.
Authors: Jacob Fein-Ashley, Benjamin Fein-Ashley
Abstract: Image inpainting is a fundamental task in computer vision, aiming to restore missing or corrupted regions in images realistically. While recent deep learning approaches have significantly advanced the state-of-the-art, challenges remain in maintaining structural continuity and generating coherent textures, particularly in large missing areas. Diffusion models have shown promise in generating high-fidelity images but often lack the structural guidance necessary for realistic inpainting. We propose a novel inpainting method that combines diffusion models with anisotropic Gaussian splatting to capture both local structures and global context effectively. By modeling missing regions using anisotropic Gaussian functions that adapt to local image gradients, our approach provides structural guidance to the diffusion-based inpainting network. The Gaussian splat maps are integrated into the diffusion process, enhancing the model's ability to generate high-fidelity and structurally coherent inpainting results. Extensive experiments demonstrate that our method outperforms state-of-the-art techniques, producing visually plausible results with enhanced structural integrity and texture realism.
Authors: Yudi Shi, Shangzhe Di, Qirui Chen, Weidi Xie
Abstract: This paper tackles the problem of video question answering (VideoQA), a task that often requires multi-step reasoning and a profound understanding of spatial-temporal dynamics. While large video-language models perform well on benchmarks, they often lack explainability and spatial-temporal grounding. In this paper, we propose Agent-of-Thoughts Distillation (AoTD), a method that enhances models by incorporating automatically generated Chain-of-Thoughts (CoTs) into the instruction-tuning process. Specifically, we leverage an agent-based system to decompose complex questions into sub-tasks, and address them with specialized vision models, the intermediate results are then treated as reasoning chains. We also introduce a verification mechanism using a large language model (LLM) to ensure the reliability of generated CoTs. Extensive experiments demonstrate that AoTD improves the performance on multiple-choice and open-ended benchmarks.
Authors: Genevieve Patterson, Joost Daniels, Benjamin Woodward, Kevin Barnard, Giovanna Sainz, Lonny Lundsten, Kakani Katija
Abstract: Can computer vision help us explore the ocean? The ultimate challenge for computer vision is to recognize any visual phenomena, more than only the objects and animals humans encounter in their terrestrial lives. Previous datasets have explored everyday objects and fine-grained categories humans see frequently. We present the FathomVerse v0 detection dataset to push the limits of our field by exploring animals that rarely come in contact with people in the deep sea. These animals present a novel vision challenge. The FathomVerse v0 dataset consists of 3843 images with 8092 bounding boxes from 12 distinct morphological groups recorded at two locations on the deep seafloor that are new to computer vision. It features visually perplexing scenarios such as an octopus intertwined with a sea star, and confounding categories like vampire squids and sea spiders. This dataset can push forward research on topics like fine-grained transfer learning, novel category discovery, species distribution modeling, and carbon cycle analysis, all of which are important to the care and husbandry of our planet.
Authors: Anuja Vats, Ivar Farup, Marius Pedersen, Kiran Raja
Abstract: The importance of quantifying uncertainty in deep networks has become paramount for reliable real-world applications. In this paper, we propose a method to improve uncertainty estimation in medical Image-to-Image (I2I) translation. Our model integrates aleatoric uncertainty and employs Uncertainty-Aware Regularization (UAR) inspired by simple priors to refine uncertainty estimates and enhance reconstruction quality. We show that by leveraging simple priors on parameters, our approach captures more robust uncertainty maps, effectively refining them to indicate precisely where the network encounters difficulties, while being less affected by noise. Our experiments demonstrate that UAR not only improves translation performance, but also provides better uncertainty estimations, particularly in the presence of noise and artifacts. We validate our approach using two medical imaging datasets, showcasing its effectiveness in maintaining high confidence in familiar regions while accurately identifying areas of uncertainty in novel/ambiguous scenarios.
Authors: Zeyu Yang, Zijie Pan, Yuankun Yang, Xiatian Zhu, Li Zhang
Abstract: Driving scene synthesis along free-form trajectories is essential for driving simulations to enable closed-loop evaluation of end-to-end driving policies. While existing methods excel at novel view synthesis on recorded trajectories, they face challenges with novel trajectories due to limited views of driving videos and the vastness of driving environments. To tackle this challenge, we propose a novel free-form driving view synthesis approach, dubbed DriveX, by leveraging video generative prior to optimize a 3D model across a variety of trajectories. Concretely, we crafted an inverse problem that enables a video diffusion model to be utilized as a prior for many-trajectory optimization of a parametric 3D model (e.g., Gaussian splatting). To seamlessly use the generative prior, we iteratively conduct this process during optimization. Our resulting model can produce high-fidelity virtual driving environments outside the recorded trajectory, enabling free-form trajectory driving simulation. Beyond real driving scenes, DriveX can also be utilized to simulate virtual driving worlds from AI-generated videos.
Authors: Hongyu Zhou, Longzhong Lin, Jiabao Wang, Yichong Lu, Dongfeng Bai, Bingbing Liu, Yue Wang, Andreas Geiger, Yiyi Liao
Abstract: In the past few decades, autonomous driving algorithms have made significant progress in perception, planning, and control. However, evaluating individual components does not fully reflect the performance of entire systems, highlighting the need for more holistic assessment methods. This motivates the development of HUGSIM, a closed-loop, photo-realistic, and real-time simulator for evaluating autonomous driving algorithms. We achieve this by lifting captured 2D RGB images into the 3D space via 3D Gaussian Splatting, improving the rendering quality for closed-loop scenarios, and building the closed-loop environment. In terms of rendering, We tackle challenges of novel view synthesis in closed-loop scenarios, including viewpoint extrapolation and 360-degree vehicle rendering. Beyond novel view synthesis, HUGSIM further enables the full closed simulation loop, dynamically updating the ego and actor states and observations based on control commands. Moreover, HUGSIM offers a comprehensive benchmark across more than 70 sequences from KITTI-360, Waymo, nuScenes, and PandaSet, along with over 400 varying scenarios, providing a fair and realistic evaluation platform for existing autonomous driving algorithms. HUGSIM not only serves as an intuitive evaluation benchmark but also unlocks the potential for fine-tuning autonomous driving algorithms in a photorealistic closed-loop setting.
Authors: Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, Weidi Xie
Abstract: With the rapid advancement of multimodal information retrieval, increasingly complex retrieval tasks have emerged. Existing methods predominately rely on task-specific fine-tuning of vision-language models, often those trained with image-text contrastive learning. In this paper, we explore the possibility of re-purposing generative Large Multimodal Models (LMMs) for retrieval. This approach enables unifying all retrieval tasks under the same formulation and, more importantly, allows for extrapolation towards unseen retrieval tasks without additional training. Our contributions can be summarised in the following aspects: (i) We introduce LamRA, a versatile framework designed to empower LMMs with sophisticated retrieval and reranking capabilities. (ii) For retrieval, we adopt a two-stage training strategy comprising language-only pre-training and multimodal instruction tuning to progressively enhance LMM's retrieval performance. (iii) For reranking, we employ joint training for both pointwise and listwise reranking, offering two distinct ways to further boost the retrieval performance. (iv) Extensive experimental results underscore the efficacy of our method in handling more than ten retrieval tasks, demonstrating robust performance in both supervised and zero-shot settings, including scenarios involving previously unseen retrieval tasks.
Authors: Floriane Magera, Thomas Hoyoux, Olivier Barnich, Marc Van Droogenbroeck
Abstract: Camera calibration and localization, sometimes simply named camera calibration, enables many applications in the context of soccer broadcasting, for instance regarding the interpretation and analysis of the game, or the insertion of augmented reality graphics for storytelling or refereeing purposes. To contribute to such applications, the research community has typically focused on single-view calibration methods, leveraging the near-omnipresence of soccer field markings in wide-angle broadcast views, but leaving all temporal aspects, if considered at all, to general-purpose tracking or filtering techniques. Only a few contributions have been made to leverage any domain-specific knowledge for this tracking task, and, as a result, there lacks a truly performant and off-the-shelf camera tracking system tailored for soccer broadcasting, specifically for elevated tripod-mounted cameras around the stadium. In this work, we present such a system capable of addressing the task of soccer broadcast camera tracking efficiently, robustly, and accurately, outperforming by far the most precise methods of the state-of-the-art. By combining the available open-source soccer field detectors with carefully designed camera and tripod models, our tracking system, BroadTrack, halves the mean reprojection error rate and gains more than 15% in terms of Jaccard index for camera calibration on the SoccerNet dataset. Furthermore, as the SoccerNet dataset videos are relatively short (30 seconds), we also present qualitative results on a 20-minute broadcast clip to showcase the robustness and the soundness of our system.
Authors: Viacheslav Iablochnikov, Alexander Rogachev
Abstract: Today, models capable of working with various modalities simultaneously in a chat format are gaining increasing popularity. Despite this, there is an issue of potential attacks on these models, especially considering that many of them include open-source components. It is important to study whether the vulnerabilities of these components are inherited and how dangerous this can be when using such models in the industry. This work is dedicated to researching various types of attacks on such models and evaluating their generalization capabilities. Modern VLM models (LLaVA, BLIP, etc.) often use pre-trained parts from other models, so the main part of this research focuses on them, specifically on the CLIP architecture and its image encoder (CLIP-ViT) and various patch attack variations for it.
Authors: Lihan Jiang, Kerui Ren, Mulin Yu, Linning Xu, Junting Dong, Tao Lu, Feng Zhao, Dahua Lin, Bo Dai
Abstract: Seamless integration of both aerial and street view images remains a significant challenge in neural scene reconstruction and rendering. Existing methods predominantly focus on single domain, limiting their applications in immersive environments, which demand extensive free view exploration with large view changes both horizontally and vertically. We introduce Horizon-GS, a novel approach built upon Gaussian Splatting techniques, tackles the unified reconstruction and rendering for aerial and street views. Our method addresses the key challenges of combining these perspectives with a new training strategy, overcoming viewpoint discrepancies to generate high-fidelity scenes. We also curate a high-quality aerial-to-ground views dataset encompassing both synthetic and real-world scene to advance further research. Experiments across diverse urban scene datasets confirm the effectiveness of our method.
Authors: Ziyun Wang, Ruijun Zhang, Zi-Yan Liu, Yufu Wang, Kostas Daniilidis
Abstract: This paper addresses the challenges of estimating a continuous-time human motion field from a stream of events. Existing Human Mesh Recovery (HMR) methods rely predominantly on frame-based approaches, which are prone to aliasing and inaccuracies due to limited temporal resolution and motion blur. In this work, we predict a continuous-time human motion field directly from events by leveraging a recurrent feed-forward neural network to predict human motion in the latent space of possible human motions. Prior state-of-the-art event-based methods rely on computationally intensive optimization across a fixed number of poses at high frame rates, which becomes prohibitively expensive as we increase the temporal resolution. In comparison, we present the first work that replaces traditional discrete-time predictions with a continuous human motion field represented as a time-implicit function, enabling parallel pose queries at arbitrary temporal resolutions. Despite the promises of event cameras, few benchmarks have tested the limit of high-speed human motion estimation. We introduce Beam-splitter Event Agile Human Motion Dataset-a hardware-synchronized high-speed human dataset to fill this gap. On this new data, our method improves joint errors by 23.8% compared to previous event human methods while reducing the computational time by 69%.
Authors: Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Jindong Wang, Zhe Lin, Bhiksha Raj
Abstract: Image tokenizers play a critical role in shaping the performance of subsequent generative models. Since the introduction of VQ-GAN, discrete image tokenization has undergone remarkable advancements. Improvements in architecture, quantization techniques, and training recipes have significantly enhanced both image reconstruction and the downstream generation quality. In this paper, we present XQ-GAN, an image tokenization framework designed for both image reconstruction and generation tasks. Our framework integrates state-of-the-art quantization techniques, including vector quantization (VQ), residual quantization (RQ), multi-scale residual quantization (MSVQ), product quantization (PQ), lookup-free quantization (LFQ), and binary spherical quantization (BSQ), within a highly flexible and customizable training environment. On the standard ImageNet 256x256 benchmark, our released model achieves an rFID of 0.64, significantly surpassing MAGVIT-v2 (0.9 rFID) and VAR (0.9 rFID). Furthermore, we demonstrate that using XQ-GAN as a tokenizer improves gFID metrics alongside rFID. For instance, with the same VAR architecture, XQ-GAN+VAR achieves a gFID of 2.6, outperforming VAR's 3.3 gFID by a notable margin. To support further research, we provide pre-trained weights of different image tokenizers for the community to directly train the subsequent generative models on it or fine-tune for specialized tasks.
Authors: Young-Jin Park, Carson Sobolewski, Navid Azizan
Abstract: DEtection TRansformer (DETR) has emerged as a promising architecture for object detection, offering an end-to-end prediction pipeline. In practice, however, DETR generates hundreds of predictions that far outnumber the actual number of objects present in an image. This raises the question: can we trust and use all of these predictions? Addressing this concern, we present empirical evidence highlighting how different predictions within the same image play distinct roles, resulting in varying reliability levels across those predictions. More specifically, while multiple predictions are often made for a single object, our findings show that most often one such prediction is well-calibrated, and the others are poorly calibrated. Based on these insights, we demonstrate identifying a reliable subset of DETR's predictions is crucial for accurately assessing the reliability of the model at both object and image levels. Building on this viewpoint, we first tackle the shortcomings of widely used performance and calibration metrics, such as average precision and various forms of expected calibration error. Specifically, they are inadequate for determining which subset of DETR's predictions should be trusted and utilized. In response, we present Object-level Calibration Error (OCE), which is capable of assessing the calibration quality both across different models and among various configurations within a specific model. As a final contribution, we introduce a post hoc Uncertainty Quantification (UQ) framework that predicts the accuracy of the model on a per-image basis. By contrasting the average confidence scores of positive (i.e., likely to be matched) and negative predictions determined by OCE, the framework assesses the reliability of the DETR model for each test image.
Authors: Rongkun Xue, Jinouwen Zhang, Yazhe Niu, Dazhong Shen, Bingqi Ma, Yu Liu, Jing Yang
Abstract: Recent generative models based on score matching and flow matching have significantly advanced generation tasks, but their potential in discriminative tasks remains underexplored. Previous approaches, such as generative classifiers, have not fully leveraged the capabilities of these models for discriminative tasks due to their intricate designs. We propose Pretrained Reversible Generation (PRG), which extracts unsupervised representations by reversing the generative process of a pretrained continuous flow model. PRG effectively reuses unsupervised generative models, leveraging their high capacity to serve as robust and generalizable feature extractors for downstream tasks. Our method consistently outperforms prior approaches across multiple benchmarks, achieving state-of-the-art performance among generative model-based methods, including 78\% top-1 accuracy on ImageNet. Extensive ablation studies further validate the effectiveness of our approach.
Authors: Kai He, Chin-Hsuan Wu, Igor Gilitschenski
Abstract: Recent advances in 3D representations, such as Neural Radiance Fields and 3D Gaussian Splatting, have greatly improved realistic scene modeling and novel-view synthesis. However, achieving controllable and consistent editing in dynamic 3D scenes remains a significant challenge. Previous work is largely constrained by its editing backbones, resulting in inconsistent edits and limited controllability. In our work, we introduce a novel framework that first fine-tunes the InstructPix2Pix model, followed by a two-stage optimization of the scene based on deformable 3D Gaussians. Our fine-tuning enables the model to "learn" the editing ability from a single edited reference image, transforming the complex task of dynamic scene editing into a simple 2D image editing process. By directly learning editing regions and styles from the reference, our approach enables consistent and precise local edits without the need for tracking desired editing regions, effectively addressing key challenges in dynamic scene editing. Then, our two-stage optimization progressively edits the trained dynamic scene, using a designed edited image buffer to accelerate convergence and improve temporal consistency. Compared to state-of-the-art methods, our approach offers more flexible and controllable local scene editing, achieving high-quality and consistent results.
Authors: Khaled Abud, Sergey Lavrushkin, Alexey Kirillov, Dmitriy Vatolin
Abstract: Diffusion-based models have recently transformed conditional image generation, achieving unprecedented fidelity in generating photorealistic and semantically accurate images. However, consistently generating high-quality images remains challenging, partly due to the lack of mechanisms for conditioning outputs on perceptual quality. In this work, we propose methods to integrate image quality assessment (IQA) models into diffusion-based generators, enabling quality-aware image generation. First, we experiment with gradient-based guidance to optimize image quality directly and show this approach has limited generalizability. To address this, we introduce IQA-Adapter, a novel architecture that conditions generation on target quality levels by learning the relationship between images and quality scores. When conditioned on high target quality, IQA-Adapter shifts the distribution of generated images towards a higher-quality subdomain. This approach achieves up to a 10% improvement across multiple objective metrics, as confirmed by a subjective study, while preserving generative diversity and content. Additionally, IQA-Adapter can be used inversely as a degradation model, generating progressively more distorted images when conditioned on lower quality scores. Our quality-aware methods also provide insights into the adversarial robustness of IQA models, underscoring the potential of quality conditioning in generative modeling and the importance of robust IQA methods.
Authors: Lan Wang, Yujia Chen, Wen-Sheng Chu, Vishnu Boddeti, Du Tran
Abstract: Long video understanding presents challenges due to the inherent high computational complexity and redundant temporal information. An effective representation for long videos must process such redundancy efficiently while preserving essential contents for downstream tasks. This paper introduces SEmantic Attention Learning (SEAL), a novel unified representation for long videos. To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities: scenes, objects, and actions, allowing models to operate on a handful of entities rather than a large number of frames or pixels. To further address redundancy, we propose an attention learning module that balances token relevance with diversity formulated as a subset selection optimization problem. Our representation is versatile, enabling applications across various long video understanding tasks. Extensive experiments show that SEAL significantly outperforms state-of-the-art methods in video question answering and temporal grounding tasks and benchmarks including LVBench, MovieChat-1K, and Ego4D.
Authors: Meng Cao, Haoran Tang, Haoze Zhao, Hangyu Guo, Jiaheng Liu, Ge Zhang, Ruyang Liu, Qiang Sun, Ian Reid, Xiaodan Liang
Abstract: Recent advancements in video-based large language models (Video LLMs) have witnessed the emergence of diverse capabilities to reason and interpret dynamic visual content. Among them, gameplay videos stand out as a distinctive data source, often containing glitches that defy physics commonsense. This characteristic renders them an effective benchmark for assessing the under-explored capability of physical commonsense understanding in video LLMs. In this paper, we propose PhysGame as a pioneering benchmark to evaluate physical commonsense violations in gameplay videos. PhysGame comprises 880 videos associated with glitches spanning four fundamental domains (i.e., mechanics, kinematics, optics, and material properties) and across 12 distinct physical commonsense. Through extensively evaluating various state-ofthe-art video LLMs, our findings reveal that the performance of current open-source video LLMs significantly lags behind that of proprietary counterparts. To bridge this gap, we curate an instruction tuning dataset PhysInstruct with 140,057 question-answering pairs to facilitate physical commonsense learning. In addition, we also propose a preference optimization dataset PhysDPO with 34,358 training pairs, where the dis-preferred responses are generated conditioned on misleading titles (i.e., meta information hacking), fewer frames (i.e., temporal hacking) and lower spatial resolutions (i.e., spatial hacking). Based on the suite of datasets, we propose PhysVLM as a physical knowledge-enhanced video LLM. Extensive experiments on both physical-oriented benchmark PhysGame and general video understanding benchmarks demonstrate the state-ofthe-art performance of PhysVLM.
Authors: Alexey Bokhovkin, Quan Meng, Shubham Tulsiani, Angela Dai
Abstract: We present SceneFactor, a diffusion-based approach for large-scale 3D scene generation that enables controllable generation and effortless editing. SceneFactor enables text-guided 3D scene synthesis through our factored diffusion formulation, leveraging latent semantic and geometric manifolds for generation of arbitrary-sized 3D scenes. While text input enables easy, controllable generation, text guidance remains imprecise for intuitive, localized editing and manipulation of the generated 3D scenes. Our factored semantic diffusion generates a proxy semantic space composed of semantic 3D boxes that enables controllable editing of generated scenes by adding, removing, changing the size of the semantic 3D proxy boxes that guides high-fidelity, consistent 3D geometric editing. Extensive experiments demonstrate that our approach enables high-fidelity 3D scene synthesis with effective controllable editing through our factored diffusion approach.
Authors: Jiahuan Cheng, Jan-Nico Zaech, Luc Van Gool, Danda Pani Paudel
Abstract: TL;DR: Gaussian Splatting is a widely adopted approach for 3D scene representation that offers efficient, high-quality 3D reconstruction and rendering. A major reason for the success of 3DGS is its simplicity of representing a scene with a set of Gaussians, which makes it easy to interpret and adapt. To enhance scene understanding beyond the visual representation, approaches have been developed that extend 3D Gaussian Splatting with semantic vision-language features, especially allowing for open-set tasks. In this setting, the language features of 3D Gaussian Splatting are often aggregated from multiple 2D views. Existing works address this aggregation problem using cumbersome techniques that lead to high computational cost and training time. In this work, we show that the sophisticated techniques for language-grounded 3D Gaussian Splatting are simply unnecessary. Instead, we apply Occam's razor to the task at hand and perform weighted multi-view feature aggregation using the weights derived from the standard rendering process, followed by a simple heuristic-based noisy Gaussian filtration. Doing so offers us state-of-the-art results with a speed-up of two orders of magnitude. We showcase our results in two commonly used benchmark datasets: LERF and 3D-OVS. Our simple approach allows us to perform reasoning directly in the language features, without any compression whatsoever. Such modeling in turn offers easy scene manipulation, unlike the existing methods -- which we illustrate using an application of object insertion in the scene. Furthermore, we provide a thorough discussion regarding the significance of our contributions within the context of the current literature. Project Page: https://insait-institute.github.io/OccamLGS/
Authors: Zewei Zhou, Hao Xiang, Zhaoliang Zheng, Seth Z. Zhao, Mingyue Lei, Yun Zhang, Tianhui Cai, Xinyi Liu, Johnson Liu, Maheswari Bajji, Jacob Pham, Xin Xia, Zhiyu Huang, Bolei Zhou, Jiaqi Ma
Abstract: Vehicle-to-everything (V2X) technologies offer a promising paradigm to mitigate the limitations of constrained observability in single-vehicle systems. Prior work primarily focuses on single-frame cooperative perception, which fuses agents' information across different spatial locations but ignores temporal cues and temporal tasks (e.g., temporal perception and prediction). In this paper, we focus on temporal perception and prediction tasks in V2X scenarios and design one-step and multi-step communication strategies (when to transmit) as well as examine their integration with three fusion strategies - early, late, and intermediate (what to transmit), providing comprehensive benchmarks with various fusion models (how to fuse). Furthermore, we propose V2XPnP, a novel intermediate fusion framework within one-step communication for end-to-end perception and prediction. Our framework employs a unified Transformer-based architecture to effectively model complex spatiotemporal relationships across temporal per-frame, spatial per-agent, and high-definition map. Moreover, we introduce the V2XPnP Sequential Dataset that supports all V2X cooperation modes and addresses the limitations of existing real-world datasets, which are restricted to single-frame or single-mode cooperation. Extensive experiments demonstrate our framework outperforms state-of-the-art methods in both perception and prediction tasks.
Authors: Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, Zeynep Akata
Abstract: Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks.
Authors: Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang
Abstract: Large vision-language models (VLMs) often rely on a substantial number of visual tokens when interacting with large language models (LLMs), which has proven to be inefficient. Recent efforts have aimed to accelerate VLM inference by pruning visual tokens. Most existing methods assess the importance of visual tokens based on the text-visual cross-attentions in LLMs. In this study, we find that the cross-attentions between text and visual tokens in LLMs are inaccurate. Pruning tokens based on these inaccurate attentions leads to significant performance degradation, especially at high reduction ratios. To this end, we introduce FasterVLM, a simple yet effective training-free visual token pruning method that evaluates the importance of visual tokens more accurately by utilizing attentions between the [CLS] token and image tokens from the visual encoder. Since FasterVLM eliminates redundant visual tokens immediately after the visual encoder, ensuring they do not interact with LLMs and resulting in faster VLM inference. It is worth noting that, benefiting from the accuracy of [CLS] cross-attentions, FasterVLM can prune 95\% of visual tokens while maintaining 90\% of the performance of LLaVA-1.5-7B. We apply FasterVLM to various VLMs, including LLaVA-1.5, LLaVA-NeXT, and Video-LLaVA, to demonstrate its effectiveness. Experimental results show that our FasterVLM maintains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing text-visual attention-based methods. Our code is available at https://github.com/Theia-4869/FasterVLM.
Authors: Anton Voronov, Denis Kuznedelev, Mikhail Khoroshikh, Valentin Khrulkov, Dmitry Baranchuk
Abstract: This work presents Switti, a scale-wise transformer for text-to-image generation. Starting from existing next-scale prediction AR models, we first explore them for T2I generation and propose architectural modifications to improve their convergence and overall performance. We then observe that self-attention maps of our pretrained scale-wise AR model exhibit weak dependence on preceding scales. Based on this insight, we propose a non-AR counterpart facilitating ${\sim}11\%$ faster sampling and lower memory usage while also achieving slightly better generation quality.Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. %may be not only unnecessary but potentially detrimental. By disabling guidance at these scales, we achieve an additional sampling acceleration of ${\sim}20\%$ and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to $7{\times}$ faster.
Authors: Jiayuan Rao, Haoning Wu, Hao Jiang, Ya Zhang, Yanfeng Wang Weidi Xie
Abstract: As a globally celebrated sport, soccer has attracted widespread interest from fans over the world. This paper aims to develop a comprehensive multi-modal framework for soccer video understanding. Specifically, we make the following contributions in this paper: (i) we introduce SoccerReplay-1988, the largest multi-modal soccer dataset to date, featuring videos and detailed annotations from 1,988 complete matches, with an automated annotation pipeline; (ii) we present the first visual-language foundation model in the soccer domain, MatchVision, which leverages spatiotemporal information across soccer videos and excels in various downstream tasks; (iii) we conduct extensive experiments and ablation studies on action classification, commentary generation, and multi-view foul recognition, and demonstrate state-of-the-art performance on all of them, substantially outperforming existing models, which has demonstrated the superiority of our proposed data and model. We believe that this work will offer a standard paradigm for sports understanding research. The code and model will be publicly available for reproduction.
Authors: Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista, Kevin Miao, Alexander Toshev, Joshua Susskind, Jiatao Gu
Abstract: Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. More specifically, we train a diffusion transformer to learn the joint distribution of RGB and XYZ frames. This approach supports multi-task adaptability via a flexible inpainting strategy. For example, WVD can estimate XYZ frames from ground-truth RGB or generate novel RGB frames using XYZ projections along a specified camera trajectory. In doing so, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation. Our approach demonstrates competitive performance across multiple benchmarks, providing a scalable solution for 3D-consistent video and image generation with a single pretrained model.
Authors: Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, Yueh-Hua Wu
Abstract: The recent surge in high-quality visual instruction tuning samples from closed-source vision-language models (VLMs) such as GPT-4V has accelerated the release of open-source VLMs across various model sizes. However, scaling VLMs to improve performance using larger models brings significant computational challenges, especially for deployment on resource-constrained devices like mobile platforms and robots. To address this, we propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which prioritizes efficiency without compromising accuracy. VLsI leverages a unique, layer-wise distillation process, introducing intermediate "verbalizers" that map features from each layer to natural language space, allowing smaller VLMs to flexibly align with the reasoning processes of larger VLMs. This approach mitigates the training instability often encountered in output imitation and goes beyond typical final-layer tuning by aligning the small VLMs' layer-wise progression with that of the large ones. We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V without the need for model scaling, merging, or architectural changes.
Authors: Yunzhou Song, Heguang Lin, Jiahui Lei, Lingjie Liu, Kostas Daniilidis
Abstract: Recent advancements in neural rendering, particularly 2D Gaussian Splatting (2DGS), have shown promising results for jointly reconstructing fine appearance and geometry by leveraging 2D Gaussian surfels. However, current methods face significant challenges when rendering at arbitrary viewpoints, such as anti-aliasing for down-sampled rendering, and texture detail preservation for high-resolution rendering. We proposed a novel method to align the 2D surfels with texture maps and augment it with per-ray depth sorting and fisher-based pruning for rendering consistency and efficiency. With correct order, per-surfel texture maps significantly improve the capabilities to capture fine details. Additionally, to render high-fidelity details in varying viewpoints, we designed a frustum-based sampling method to mitigate the aliasing artifacts. Experimental results on benchmarks and our custom texture-rich dataset demonstrate that our method surpasses existing techniques, particularly in detail preservation and anti-aliasing.
Authors: Zeyi Sun, Ziyang Chu, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
Abstract: In-context generation is a key component of large language models' (LLMs) open-task generalization capability. By leveraging a few examples as context, LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in auto-regressive vision-language models (VLMs) built upon LLMs have showcased impressive performance in text-to-image generation. However, the potential of in-context learning for general image generation tasks remains largely unexplored. To address this, we introduce X-Prompt, a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks, all within a unified in-context learning framework. X-Prompt incorporates a specialized design that efficiently compresses valuable features from in-context examples, supporting longer in-context token sequences and improving its ability to generalize to unseen tasks. A unified training task for both text and image prediction enables X-Prompt to handle general image generation with enhanced task awareness from in-context examples. Extensive experiments validate the model's performance across diverse seen image generation tasks and its capacity to generalize to previously unseen tasks.
Authors: Savya Khosla, Sethuraman T V, Alexander Schwing, Derek Hoiem
Abstract: We present RELOCATE, a simple training-free baseline designed to perform the challenging task of visual query localization in long videos. To eliminate the need for task-specific training and efficiently handle long videos, RELOCATE leverages a region-based representation derived from pretrained vision models. At a high level, it follows the classic object localization approach: (1) identify all objects in each video frame, (2) compare the objects with the given query and select the most similar ones, and (3) perform bidirectional tracking to get a spatio-temporal response. However, we propose some key enhancements to handle small objects, cluttered scenes, partial visibility, and varying appearances. Notably, we refine the selected objects for accurate localization and generate additional visual queries to capture visual variations. We evaluate RELOCATE on the challenging Ego4D Visual Query 2D Localization dataset, establishing a new baseline that outperforms prior task-specific methods by 49% (relative improvement) in spatio-temporal average precision.
Authors: Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, Yu-Xiong Wang
Abstract: We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive bias, unlocking new capabilities in decoder-only generation. Our essential design enables random order by inserting a "position instruction token" before each image token to be predicted, representing the spatial location of the next image token. Trained on randomly permuted token sequences -- a more challenging task than fixed-order generation, RandAR achieves comparable performance to its conventional raster-order counterpart. More importantly, decoder-only transformers trained from random orders acquire new capabilities. For the efficiency bottleneck of AR models, RandAR adopts parallel decoding with KV-Cache at inference time, enjoying 2.5x acceleration without sacrificing generation quality. Additionally, RandAR supports inpainting, outpainting and resolution extrapolation in a zero-shot manner. We hope RandAR inspires new directions for decoder-only visual generation models and broadens their applications across diverse scenarios. Our project page is at https://rand-ar.github.io/.
Authors: Yongqi Zhai, Luyang Tang, Yi Ma, Rui Peng, Ronggang Wang
Abstract: With the wide application of stereo images in various fields, the research on stereo image compression (SIC) attracts extensive attention from academia and industry. The core of SIC is to fully explore the mutual information between the left and right images and reduce redundancy between views as much as possible. In this paper, we propose DispSIC, an end-to-end trainable deep neural network, in which we jointly train a stereo matching model to assist in the image compression task. Based on the stereo matching results (i.e. disparity), the right image can be easily warped to the left view, and only the residuals between the left and right views are encoded for the left image. A three-branch auto-encoder architecture is adopted in DispSIC, which encodes the right image, the disparity map and the residuals respectively. During training, the whole network can learn how to adaptively allocate bitrates to these three parts, achieving better rate-distortion performance at the cost of a lower disparity map bitrates. Moreover, we propose a conditional entropy model with aligned cross-view priors for SIC, which takes the warped latents of the right image as priors to improve the accuracy of the probability estimation for the left image. Experimental results demonstrate that our proposed method achieves superior performance compared to other existing SIC methods on the KITTI and InStereo2K datasets both quantitatively and qualitatively.
Authors: Luis Vilaca, Yi Yu, Paula Vinan
Abstract: Audio-visual correlation learning aims to capture and understand natural phenomena between audio and visual data. The rapid growth of Deep Learning propelled the development of proposals that process audio-visual data and can be observed in the number of proposals in the past years. Thus encouraging the development of a comprehensive survey. Besides analyzing the models used in this context, we also discuss some tasks of definition and paradigm applied in AI multimedia. In addition, we investigate objective functions frequently used and discuss how audio-visual data is exploited in the optimization process, i.e., the different methodologies for representing knowledge in the audio-visual domain. In fact, we focus on how human-understandable mechanisms, i.e., structured knowledge that reflects comprehensible knowledge, can guide the learning process. Most importantly, we provide a summarization of the recent progress of Audio-Visual Correlation Learning (AVCL) and discuss the future research directions.
Authors: Leila Gheisi, Henry Chu, Raju Gottumukkala, Xingquan Zhu, Mengyu Wang, Min Shi
Abstract: The use of artificial intelligence (AI) in automated disease classification significantly reduces healthcare costs and improves the accessibility of services. However, this transformation has given rise to concerns about the fairness of AI, which disproportionately affects certain groups, particularly patients from underprivileged populations. Recently, a number of methods and large-scale datasets have been proposed to address group performance disparities. Although these methods have shown effectiveness in disease classification tasks, they may fall short in ensuring fair prediction of disease progression, mainly because of limited longitudinal data with diverse demographics available for training a robust and equitable prediction model. In this paper, we introduce TransFair to enhance demographic fairness in progression prediction for ocular diseases. TransFair aims to transfer a fairness-enhanced disease classification model to the task of progression prediction with fairness preserved. Specifically, we train a fair EfficientNet, termed FairEN, equipped with a fairness-aware attention mechanism using extensive data for ocular disease classification. Subsequently, this fair classification model is adapted to a fair progression prediction model through knowledge distillation, which aims to minimize the latent feature distances between the classification and progression prediction models. We evaluate FairEN and TransFair for fairness-enhanced ocular disease classification and progression prediction using both two-dimensional (2D) and 3D retinal images. Extensive experiments and comparisons with models with and without considering fairness learning show that TransFair effectively enhances demographic equity in predicting ocular disease progression.
Authors: Caozhe Li, Enxiang Shen, Haoyang Wang, Yuxin Wang, Jie Yuan, Li Gong, Di Zhao, Weijing Zhang, Zhibin Jin
Abstract: Three-dimensional (3D) ultrasound imaging can overcome the limitations of conventional two dimensional (2D) ultrasound imaging in structural observation and measurement. However, conducting volumetric ultrasound imaging for large-sized organs still faces difficulties including long acquisition time, inevitable patient movement, and 3D feature recognition. In this study, we proposed a real-time volumetric free-hand ultrasound imaging system optimized for the above issues and applied it to the clinical diagnosis of scoliosis. This study employed an incremental imaging method coupled with algorithmic acceleration to enable real-time processing and visualization of the large amounts of data generated when scanning large-sized organs. Furthermore, to deal with the difficulty of image feature recognition, we proposed two tissue segmentation algorithms to reconstruct and visualize the spinal anatomy in 3D space by approximating the depth at which the bone structures are located and segmenting the ultrasound images at different depths. We validated the adaptability of our system by deploying it to multiple models of ultra-sound equipment and conducting experiments using different types of ultrasound probes. We also conducted experiments on 6 scoliosis patients and 10 normal volunteers to evaluate the performance of our proposed method. Ultrasound imaging of a volunteer spine from shoulder to crotch (more than 500 mm) was performed in 2 minutes, and the 3D imaging results displayed in real-time were compared with the corresponding X-ray images with a correlation coefficient of 0.96 in spinal curvature. Our proposed volumetric ultrasound imaging system might hold the potential to be clinically applied to other large-sized organs.
Authors: Wannes Goethals, Tom Bultreys, Steffen Berg, Matthieu N. Boone, Jan Aelterman
Abstract: Time-resolved high-resolution X-ray Computed Tomography (4D $\mu$CT) is an imaging technique that offers insight into the evolution of dynamic processes inside materials that are opaque to visible light. Conventional tomographic reconstruction techniques are based on recording a sequence of 3D images that represent the sample state at different moments in time. This frame-based approach limits the temporal resolution compared to dynamic radiography experiments due to the time needed to make CT scans. Moreover, it leads to an inflation of the amount of data and thus to costly post-processing computations to quantify the dynamic behaviour from the sequence of time frames, hereby often ignoring the temporal correlations of the sample structure. Our proposed 4D $\mu$CT reconstruction technique, named DYRECT, estimates individual attenuation evolution profiles for each position in the sample. This leads to a novel memory-efficient event-based representation of the sample, using as little as three image volumes: its initial attenuation, its final attenuation and the transition times. This third volume represents local events on a continuous timescale instead of the discrete global time frames. We propose a method to iteratively reconstruct the transition times and the attenuation volumes. The dynamic reconstruction technique was validated on synthetic ground truth data and experimental data, and was found to effectively pinpoint the transition times in the synthetic dataset with a time resolution corresponding to less than a tenth of the amount of projections required to reconstruct traditional $\mu$CT time frames.
Authors: Jinqi Xiao, Shen Sang, Tiancheng Zhi, Jing Liu, Qing Yan, Linjie Luo, Bo Yuan
Abstract: Training large-scale neural networks in vision, and multimodal domains demands substantial memory resources, primarily due to the storage of optimizer states. While LoRA, a popular parameter-efficient method, reduces memory usage, it often suffers from suboptimal performance due to the constraints of low-rank updates. Low-rank gradient projection methods (e.g., GaLore, Flora) reduce optimizer memory by projecting gradients and moment estimates into low-rank spaces via singular value decomposition or random projection. However, they fail to account for inter-projection correlation, causing performance degradation, and their projection strategies often incur high computational costs. In this paper, we present COAP (Correlation-Aware Gradient Projection), a memory-efficient method that minimizes computational overhead while maintaining training performance. Evaluated across various vision, language, and multimodal tasks, COAP outperforms existing methods in both training speed and model performance. For LLaMA-1B, it reduces optimizer memory by 61% with only 2% additional time cost, achieving the same PPL as AdamW. With 8-bit quantization, COAP cuts optimizer memory by 81% and achieves 4x speedup over GaLore for LLaVA-v1.5-7B fine-tuning, while delivering higher accuracy.
Authors: Max Roberts, Ian Colwell, Clara Chew, Dallas Masters, Karl Nordstrom
Abstract: Muon Space (Muon) is building a constellation of small satellites, many of which will carry global navigation satellite system-reflectometry (GNSS-R) receivers. In preparation for the launch of this constellation, we have developed a generalized deep learning retrieval pipeline, which now produces operational GNSS-R near-surface soil moisture retrievals using data from NASA's Cyclone GNSS (CYGNSS) mission. In this article, we describe the input datasets, preprocessing methods, model architecture, development methods, and detail the soil moisture products generated from these retrievals. The performance of this product is quantified against in situ measurements and compared to both the target dataset (retrievals from the Soil Moisture Active-Passive (SMAP) satellite) and the v1.0 soil moisture product from the CYGNSS mission. The Muon Space product achieves improvements in spatial resolution over SMAP with comparable performance in many regions. An ubRMSE of 0.032 cm$^3$ cm$^{-3}$ for in situ soil moisture observations from SMAP core validation sites is shown, though performance is lower than SMAP's when comparing in forests and/or mountainous terrain. The Muon Space product outperforms the v1.0 CYGNSS soil moisture product in almost all aspects. This initial release serves as the foundation of our operational soil moisture product, which soon will additionally include data from Muon Space satellites.
Authors: Ching-Yi Wang
Abstract: Artificial Intelligence (AI) has achieved transformative success across a wide range of domains, revolutionizing fields such as healthcare, education, and human-computer interaction. However, the mechanisms driving AI's performance often remain opaque, particularly in the context of large language models (LLMs), which have advanced at an unprecedented pace in recent years. Multi-modal large language models (MLLMs) like GPT-4o exemplify this evolution, integrating text, audio, and visual inputs to enable interaction across diverse domains. Despite their remarkable capabilities, these models remain largely "black boxes," offering limited insight into how they process multi-modal information internally. This lack of transparency poses significant challenges, including systematic biases, flawed associations, and unintended behaviors, which require careful investigation. Understanding the decision-making processes of MLLMs is both beneficial and essential for mitigating these challenges and ensuring their reliable deployment in critical applications. GPT-4o was chosen as the focus of this study for its advanced multi-modal capabilities, which allow simultaneous processing of textual and visual information. These capabilities make it an ideal model for investigating the parallels and distinctions between machine-driven and human-driven visual perception. While GPT-4o performs effectively in tasks involving structured and complete data, its reliance on bottom-up processing, which involves a feature-by-feature analysis of sensory inputs, presents challenges when interpreting complex or ambiguous stimuli. This limitation contrasts with human vision, which is remarkably adept at resolving ambiguity and reconstructing incomplete information through high-level cognitive processes.
Authors: Waheed Rehman
Abstract: The field of steganography has long been focused on developing methods to securely embed information within various digital media while ensuring imperceptibility and robustness. However, the growing sophistication of detection tools and the demand for increased data hiding capacity have revealed limitations in traditional techniques. In this paper, we propose a novel approach to image steganography that leverages the power of generative adversarial networks (GANs) to address these challenges. By employing a carefully designed GAN architecture, our method ensures the creation of stego-images that are visually indistinguishable from their original counterparts, effectively thwarting detection by advanced steganalysis tools. Additionally, the adversarial training paradigm optimizes the balance between embedding capacity, imperceptibility, and robustness, enabling more efficient and secure data hiding. We evaluate our proposed method through a series of experiments on benchmark datasets and compare its performance against baseline techniques, including least significant bit (LSB) substitution and discrete cosine transform (DCT)-based methods. Our results demonstrate significant improvements in metrics such as Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and robustness against detection. This work not only contributes to the advancement of image steganography but also provides a foundation for exploring GAN-based approaches for secure digital communication.
Authors: Luca Colombo, Fabrizio Pittorino, Manuel Roveri
Abstract: Binary Neural Networks (BNNs) hold the potential for significantly reducing computational complexity and memory demand in machine and deep learning. However, most successful training algorithms for BNNs rely on quantization-aware floating-point Stochastic Gradient Descent (SGD), with full-precision hidden weights used during training. The binarized weights are only used at inference time, hindering the full exploitation of binary operations during the training process. In contrast to the existing literature, we introduce, for the first time, a multi-layer training algorithm for BNNs that does not require the computation of back-propagated full-precision gradients. Specifically, the proposed algorithm is based on local binary error signals and binary weight updates, employing integer-valued hidden weights that serve as a synaptic metaplasticity mechanism, thereby establishing it as a neurobiologically plausible algorithm. The binary-native and gradient-free algorithm proposed in this paper is capable of training binary multi-layer perceptrons (BMLPs) with binary inputs, weights, and activations, by using exclusively XNOR, Popcount, and increment/decrement operations, hence effectively paving the way for a new class of operation-optimized training algorithms. Experimental results on BMLPs fully trained in a binary-native and gradient-free manner on multi-class image classification benchmarks demonstrate an accuracy improvement of up to +13.36% compared to the fully binary state-of-the-art solution, showing minimal accuracy degradation compared to the same architecture trained with full-precision SGD and floating-point weights, activations, and inputs. The proposed algorithm is made available to the scientific community as a public repository.
Authors: Sicheng Feng, Keda Tao, Huan Wang
Abstract: Oracle pruning, which selects unimportant weights by minimizing the pruned train loss, has been taken as the foundation for most neural network pruning methods for over 35 years, while few (if not none) have thought about how much the foundation really holds. This paper, for the first time, attempts to examine its validity on modern deep models through empirical correlation analyses and provide reflections on the field of neural network pruning. Specifically, for a typical pruning algorithm with three stages (pertaining, pruning, and retraining), we analyze the model performance correlation before and after retraining. Extensive experiments (37K models are trained) across a wide spectrum of models (LeNet5, VGG, ResNets, ViT, MLLM) and datasets (MNIST and its variants, CIFAR10/CIFAR100, ImageNet-1K, MLLM data) are conducted. The results lead to a surprising conclusion: on modern deep learning models, the performance before retraining is barely correlated with the performance after retraining. Namely, the weights selected by oracle pruning can hardly guarantee a good performance after retraining. This further implies that existing works using oracle pruning to derive pruning criteria may be groundless from the beginning. Further studies suggest the rising task complexity is one factor that makes oracle pruning invalid nowadays. Finally, given the evidence, we argue that the retraining stage in a pruning algorithm should be accounted for when developing any pruning criterion.
Authors: Weixin Mao, Weiheng Zhong, Zhou Jiang, Dong Fang, Zhongyue Zhang, Zihan Lan, Fan Jia, Tiancai Wang, Haoqiang Fan, Osamu Yoshie
Abstract: Existing policy learning methods predominantly adopt the task-centric paradigm, necessitating the collection of task data in an end-to-end manner. Consequently, the learned policy tends to fail to tackle novel tasks. Moreover, it is hard to localize the errors for a complex task with multiple stages due to end-to-end learning. To address these challenges, we propose RoboMatrix, a skill-centric and hierarchical framework for scalable task planning and execution. We first introduce a novel skill-centric paradigm that extracts the common meta-skills from different complex tasks. This allows for the capture of embodied demonstrations through a kill-centric approach, enabling the completion of open-world tasks by combining learned meta-skills. To fully leverage meta-skills, we further develop a hierarchical framework that decouples complex robot tasks into three interconnected layers: (1) a high-level modular scheduling layer; (2) a middle-level skill layer; and (3) a low-level hardware layer. Experimental results illustrate that our skill-centric and hierarchical framework achieves remarkable generalization performance across novel objects, scenes, tasks, and embodiments. This framework offers a novel solution for robot task planning and execution in open-world scenarios. Our software and hardware are available at https://github.com/WayneMao/RoboMatrix.
Authors: Yifan Zhu, Tianyi Xiang, Aaron Dollar, Zherong Pan
Abstract: Identifying predictive world models for robots in novel environments from sparse online observations is essential for robot task planning and execution in novel environments. However, existing methods that leverage differentiable simulators to identify world models are incapable of jointly optimizing the shape, appearance, and physical properties of the scene. In this work, we introduce a novel object representation that allows the joint identification of these properties. Our method employs a novel differentiable point-based object representation coupled with a grid-based appearance field, which allows differentiable object collision detection and rendering. Combined with a differentiable physical simulator, we achieve end-to-end optimization of world models, given the sparse visual and tactile observations of a physical motion sequence. Through a series of benchmarking system identification tasks in simulated and real environments, we show that our method can learn both simulation- and rendering-ready world models from only a few partial observations.
Authors: Jianhao Jiao, Ruoyu Geng, Yuanhang Li, Ren Xin, Bowen Yang, Jin Wu, Lujia Wang, Ming Liu, Rui Fan, Dimitrios Kanoulas
Abstract: The creation of a metric-semantic map, which encodes human-prior knowledge, represents a high-level abstraction of environments. However, constructing such a map poses challenges related to the fusion of multi-modal sensor data, the attainment of real-time mapping performance, and the preservation of structural and semantic information consistency. In this paper, we introduce an online metric-semantic mapping system that utilizes LiDAR-Visual-Inertial sensing to generate a global metric-semantic mesh map of large-scale outdoor environments. Leveraging GPU acceleration, our mapping process achieves exceptional speed, with frame processing taking less than 7ms, regardless of scenario scale. Furthermore, we seamlessly integrate the resultant map into a real-world navigation system, enabling metric-semantic-based terrain assessment and autonomous point-to-point navigation within a campus environment. Through extensive experiments conducted on both publicly available and self-collected datasets comprising 24 sequences, we demonstrate the effectiveness of our mapping and navigation methodologies. Code has been publicly released: https://github.com/gogojjh/cobra
Authors: Phuoc-Nguyen Bui, Duc-Tai Le, Junghyun Bum, Hyunseung Choo
Abstract: Traditional deep learning methods in medical imaging often focus solely on segmentation or classification, limiting their ability to leverage shared information. Multi-task learning (MTL) addresses this by combining both tasks through shared representations but often struggles to balance local spatial features for segmentation and global semantic features for classification, leading to suboptimal performance. In this paper, we propose a simple yet effective UNet-based MTL model, where features extracted by the encoder are used to predict classification labels, while the decoder produces the segmentation mask. The model introduces an advanced encoder incorporating a novel ResFormer block that integrates local context from convolutional feature extraction with long-range dependencies modeled by the Transformer. This design captures broader contextual relationships and fine-grained details, improving classification and segmentation accuracy. To enhance classification performance, multi-scale features from different encoder levels are combined to leverage the hierarchical representation of the input image. For segmentation, the features passed to the decoder via skip connections are refined using a novel dilated feature enhancement (DFE) module, which captures information at different scales through three parallel convolution branches with varying dilation rates. This allows the decoder to detect lesions of varying sizes with greater accuracy. Experimental results across multiple medical datasets confirm the superior performance of our model in both segmentation and classification tasks, compared to state-of-the-art single-task and multi-task learning methods.
Authors: Sanghyun Kim, Moonseok Choi, Jinwoo Shin, Juho Lee
Abstract: Fine-tuning text-to-image diffusion models is widely used for personalization and adaptation for new domains. In this paper, we identify a critical vulnerability of fine-tuning: safety alignment methods designed to filter harmful content (e.g., nudity) can break down during fine-tuning, allowing previously suppressed content to resurface, even when using benign datasets. While this "fine-tuning jailbreaking" issue is known in large language models, it remains largely unexplored in text-to-image diffusion models. Our investigation reveals that standard fine-tuning can inadvertently undo safety measures, causing models to relearn harmful concepts that were previously removed and even exacerbate harmful behaviors. To address this issue, we present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation (LoRA) modules separately from Fine-Tuning LoRA components and merging them during inference. This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks. Our experiments demonstrate that Modular LoRA outperforms traditional fine-tuning methods in maintaining safety alignment, offering a practical approach for enhancing the security of text-to-image diffusion models against potential attacks.
Authors: Danilo Costarelli, Michele Piconi
Abstract: In this paper, we provide two algorithms based on the theory of multidimensional neural network (NN) operators activated by hyperbolic tangent sigmoidal functions. Theoretical results are recalled to justify the performance of the here implemented algorithms. Specifically, the first algorithm models multidimensional signals (such as digital images), while the second one addresses the problem of rescaling and enhancement of the considered data. We discuss several applications of the NN-based algorithms for modeling and rescaling/enhancement remote sensing data (represented as images), with numerical experiments conducted on a selection of remote sensing (RS) images from the (open access) RETINA dataset. A comparison with classical interpolation methods, such as bilinear and bicubic interpolation, shows that the proposed algorithms outperform the others, particularly in terms of the Structural Similarity Index (SSIM).
Authors: Changqing Ji, Keisuke Kawasaki, Isao Hasegwa, Takayuki Okatani
Abstract: In the application of brain-computer interface (BCI), being able to accurately decode brain signals is a critical task. For the multi-class classification task of brain signal ECoG, how to improve the classification accuracy is one of the current research hotspots. ECoG acquisition uses a high-density electrode array and a high sampling frequency, which makes ECoG data have a certain high similarity and data redundancy in the temporal domain, and also unique spatial pattern in spatial domain. How to effectively extract features is both exciting and challenging. Previous work found that visual-related ECoG can carry visual information via frequency and spatial domain. Based on this finding, we focused on using deep learning to design frequency and spatial feature extraction modules, and proposed a Bi-Band ECoGNet model based on deep learning. The main contributions of this paper are: 1) The Bi-BCWT (Bi-Band Channel-Wise Transform) neural network module is designed to replace the time-consume method MST, this module greatly improves the model calculation and data storage efficiency, and effectively increases the training speed; 2) The Bi-BCWT module can effectively take into account the information both in low-frequency and high-frequency domain, which is more conducive to ECoG multi-classification tasks; 3) ECoG is acquired using 2D electrode array, the newly designed 2D Spatial-Temporal feature encoder can extract the 2D spatial feature better. Experiments have shown that the unique 2D spatial data structure can effectively improve classification accuracy; 3) Compared with previous work, the Bi-Band ECoGNet model is smaller and has higher performance, with an accuracy increase of 1.24%, and the model training speed is increased by 6 times, which is more suitable for BCI applications.
Authors: Lan Feng, Fan Nie, Yuejiang Liu, Alexandre Alahi
Abstract: We propose TAROT, a targeted data selection framework grounded in optimal transport theory. Previous targeted data selection methods primarily rely on influence-based greedy heuristics to enhance domain-specific performance. While effective on limited, unimodal data (i.e., data following a single pattern), these methods struggle as target data complexity increases. Specifically, in multimodal distributions, these heuristics fail to account for multiple inherent patterns, leading to suboptimal data selection. This work identifies two primary factors contributing to this limitation: (i) the disproportionate impact of dominant feature components in high-dimensional influence estimation, and (ii) the restrictive linear additive assumptions inherent in greedy selection strategies. To address these challenges, TAROT incorporates whitened feature distance to mitigate dominant feature bias, providing a more reliable measure of data influence. Building on this, TAROT uses whitened feature distance to quantify and minimize the optimal transport distance between the selected data and target domains. Notably, this minimization also facilitates the estimation of optimal selection ratios. We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning. Results consistently show that TAROT outperforms state-of-the-art methods, highlighting its versatility across various deep learning tasks. Code is available at https://github.com/vita-epfl/TAROT.
Authors: Yongqi Zhai, Yi Ma, Luyang Tang, Wei Jiang, Ronggang Wang
Abstract: Scalable coding, which can adapt to channel bandwidth variation, performs well in today's complex network environment. However, most existing scalable compression methods face two challenges: reduced compression performance and insufficient scalability. To overcome the above problems, this paper proposes a learned fine-grained scalable image compression framework, namely DeepFGS. Specifically, we introduce a feature separation backbone to divide the image information into basic and scalable features, then redistribute the features channel by channel through an information rearrangement strategy. In this way, we can generate a continuously scalable bitstream via one-pass encoding. For entropy coding, we design a mutual entropy model to fully explore the correlation between the basic and scalable features. In addition, we reuse the decoder to reduce the parameters and computational complexity. Experiments demonstrate that our proposed DeepFGS outperforms previous learning-based scalable image compression models and traditional scalable image codecs in both PSNR and MS-SSIM metrics.
Authors: Yongqi Zhai, Jiayu Yang, Wei Jiang, Chunhui Yang, Luyang Tang, Ronggang Wang
Abstract: In neural video codecs, current state-of-the-art methods typically adopt multi-scale motion compensation to handle diverse motions. These methods estimate and compress either optical flow or deformable offsets to reduce inter-frame redundancy. However, flow-based methods often suffer from inaccurate motion estimation in complicated scenes. Deformable convolution-based methods are more robust but have a higher bit cost for motion coding. In this paper, we propose a hybrid context generation module, which combines the advantages of the above methods in an optimal way and achieves accurate compensation at a low bit cost. Specifically, considering the characteristics of features at different scales, we adopt flow-guided deformable compensation at largest-scale to produce accurate alignment in detailed regions. For smaller-scale features, we perform flow-based warping to save the bit cost for motion coding. Furthermore, we design a local-global context enhancement module to fully explore the local-global information of previous reconstructed signals. Experimental results demonstrate that our proposed Hybrid Local-Global Context learning (HLGC) method can significantly enhance the state-of-the-art methods on standard test datasets.
Authors: Yuxin Tian, Mouxing Yang, Yuhao Zhou, Jian Wang, Qing Ye, Tongliang Liu, Gang Niu, Jiancheng Lv
Abstract: The success of most federated learning (FL) methods heavily depends on label quality, which is often inaccessible in real-world scenarios, such as medicine, leading to the federated label-noise (F-LN) problem. In this study, we observe that the global model of FL memorizes the noisy labels slowly. Based on the observations, we propose a novel approach dubbed Global Reviser for Federated Learning with Noisy Labels (FedGR) to enhance the label-noise robustness of FL. In brief, FedGR employs three novel modules to achieve noisy label sniffing and refining, local knowledge revising, and local model regularization. Specifically, the global model is adopted to infer local data proxies for global sample selection and refine incorrect labels. To maximize the utilization of local knowledge, we leverage the global model to revise the local exponential moving average (EMA) model of each client and distill it into the clients' models. Additionally, we introduce a global-to-local representation regularization to mitigate the overfitting of noisy labels. Extensive experiments on three F-LNL benchmarks against seven baseline methods demonstrate the effectiveness of the proposed FedGR.
Authors: Yanke Wang, Yolanne Y. R. Lee, Aurelio Dolfini, Markus Reischl, Ender Konukoglu, Kyriakos Flouris
Abstract: Lumbar spine problems are ubiquitous, motivating research into targeted imaging for treatment planning and guided interventions. While high resolution and high contrast CT has been the modality of choice, MRI can capture both bone and soft tissue without the ionizing radiation of CT albeit longer acquisition time. The critical trade-off between contrast quality and acquisition time has motivated 'thick slice MRI', which prioritises faster imaging with high in-plane resolution but variable contrast and low through-plane resolution. We investigate a recently developed post-acquisition pipeline which segments vertebrae from thick-slice acquisitions and uses a variational autoencoder to enhance quality after an initial 3D reconstruction. We instead propose a latent space diffusion energy-based prior to leverage diffusion models, which exhibit high-quality image generation. Crucially, we mitigate their high computational cost and low sample efficiency by learning an energy-based latent representation to perform the diffusion processes. Our resulting method outperforms existing approaches across metrics including Dice and VS scores, and more faithfully captures 3D features.
Authors: Jiangmeng Li, Zehua Zang, Qirui Ji, Chuxiong Sun, Wenwen Qiang, Junge Zhang, Changwen Zheng, Fuchun Sun, Hui Xiong
Abstract: Representations learned by self-supervised approaches are generally considered to possess sufficient generalizability and discriminability. However, we disclose a nontrivial mutual-exclusion relationship between these critical representation properties through an exploratory demonstration on self-supervised learning. State-of-the-art self-supervised methods tend to enhance either generalizability or discriminability but not both simultaneously. Thus, learning representations jointly possessing strong generalizability and discriminability presents a specific challenge for self-supervised learning. To this end, we revisit the learning paradigm of self-supervised learning from the perspective of evolutionary game theory (EGT) and outline the theoretical roadmap to achieve a desired trade-off between these representation properties. EGT performs well in analyzing the trade-off point in a two-player game by utilizing dynamic system modeling. However, the EGT analysis requires sufficient annotated data, which contradicts the principle of self-supervised learning, i.e., the EGT analysis cannot be conducted without the annotations of the specific target domain for self-supervised learning. Thus, to enhance the methodological generalization, we propose a novel self-supervised learning method that leverages advancements in reinforcement learning to jointly benefit from the general guidance of EGT and sequentially optimize the model to chase the consistent improvement of generalizability and discriminability for specific target domains during pre-training. Theoretically, we establish that the proposed method tightens the generalization error upper bound of self-supervised learning. Empirically, our method achieves state-of-the-art performance on various benchmarks.
Authors: Juhyung Ha, Jong Sung Park, David Crandall, Eleftherios Garyfallidis, Xuhong Zhang
Abstract: Medical image translation is the process of converting from one imaging modality to another, in order to reduce the need for multiple image acquisitions from the same patient. This can enhance the efficiency of treatment by reducing the time, equipment, and labor needed. In this paper, we introduce a multi-resolution guided Generative Adversarial Network (GAN)-based framework for 3D medical image translation. Our framework uses a 3D multi-resolution Dense-Attention UNet (3D-mDAUNet) as the generator and a 3D multi-resolution UNet as the discriminator, optimized with a unique combination of loss functions including voxel-wise GAN loss and 2.5D perception loss. Our approach yields promising results in volumetric image quality assessment (IQA) across a variety of imaging modalities, body regions, and age groups, demonstrating its robustness. Furthermore, we propose a synthetic-to-real applicability assessment as an additional evaluation to assess the effectiveness of synthetic data in downstream applications such as segmentation. This comprehensive evaluation shows that our method produces synthetic medical images not only of high-quality but also potentially useful in clinical applications. Our code is available at github.com/juhha/3D-mADUNet.
Authors: Xin Tie, Weijie Chen, Zachary Huemann, Brayden Schott, Nuohao Liu, Tyler J. Bradshaw
Abstract: Accurate segmentation of gross tumor volume (GTV) is essential for effective MRI-guided adaptive radiotherapy (MRgART) in head and neck cancer. However, manual segmentation of the GTV over the course of therapy is time-consuming and prone to interobserver variability. Deep learning (DL) has the potential to overcome these challenges by automatically delineating GTVs. In this study, our team, $\textit{UW LAIR}$, tackled the challenges of both pre-radiotherapy (pre-RT) (Task 1) and mid-radiotherapy (mid-RT) (Task 2) tumor volume segmentation. To this end, we developed a series of DL models for longitudinal GTV segmentation. The backbone of our models for both tasks was SegResNet with deep supervision. For Task 1, we trained the model using a combined dataset of pre-RT and mid-RT MRI data, which resulted in the improved aggregated Dice similarity coefficient (DSCagg) on an internal testing set compared to models trained solely on pre-RT MRI data. In Task 2, we introduced mask-aware attention modules, enabling pre-RT GTV masks to influence intermediate features learned from mid-RT data. This attention-based approach yielded slight improvements over the baseline method, which concatenated mid-RT MRI with pre-RT GTV masks as input. In the final testing phase, the ensemble of 10 pre-RT segmentation models achieved an average DSCagg of 0.794, with 0.745 for primary GTV (GTVp) and 0.844 for metastatic lymph nodes (GTVn) in Task 1. For Task 2, the ensemble of 10 mid-RT segmentation models attained an average DSCagg of 0.733, with 0.607 for GTVp and 0.859 for GTVn, leading us to $\textbf{achieve 1st place}$. In summary, we presented a collection of DL models that could facilitate GTV segmentation in MRgART, offering the potential to streamline radiation oncology workflows. Our code and model weights are available at https://github.com/xtie97/HNTS-MRG24-UWLAIR.
Authors: Zhi Qi, Shihong Yuan, Yuyin Yuan, Linling Kuang, Yoshiyuki Kabashima, Xiangming Meng
Abstract: Diffusion models have shown strong performances in solving inverse problems through posterior sampling while they suffer from errors during earlier steps. To mitigate this issue, several Decoupled Posterior Sampling methods have been recently proposed. However, the reverse process in these methods ignores measurement information, leading to errors that impede effective optimization in subsequent steps. To solve this problem, we propose Guided Decoupled Posterior Sampling (GDPS) by integrating a data consistency constraint in the reverse process. The constraint performs a smoother transition within the optimization process, facilitating a more effective convergence toward the target distribution. Furthermore, we extend our method to latent diffusion models and Tweedie's formula, demonstrating its scalability. We evaluate GDPS on the FFHQ and ImageNet datasets across various linear and nonlinear tasks under both standard and challenging conditions. Experimental results demonstrate that GDPS achieves state-of-the-art performance, improving accuracy over existing methods.
Authors: Yu Zhang, Shuang Li, Yibing Wang, Yu Sun, Wenyi Xiang
Abstract: Photoacoustic imaging (PAI) suffers from inherent limitations that can degrade the quality of reconstructed results, such as noise, artifacts, and incomplete data acquisition caused by sparse sampling or partial array detection. In this study, we proposed a new optimization method for both two-dimensional (2D) and three-dimensional (3D) PAI reconstruction results, called regularized iteration method with shape prior. The shape prior is a probability matrix derived from the reconstruction results of multiple sets of random partial array signals in a computational imaging system using any reconstruction algorithm, such as Delay-and-Sum (DAS) and Back-Projection (BP). In the probability matrix, high-probability locations indicate high consistency among multiple reconstruction results at those positions, suggesting a high likelihood of representing the true imaging results. In contrast, low-probability locations indicate higher randomness, leaning more towards noise or artifacts. As a shape prior, this probability matrix guides the iteration and regularization of the entire array signal reconstruction results using the original reconstruction algorithm (the same algorithm for processing random partial array signals). The method takes advantage of the property that the similarity of the object to be imitated is higher than that of noise or artifact in the results reconstructed by multiple sets of random partial array signals of the entire imaging system. The probability matrix is taken as a prerequisite for improving the original reconstruction results, and the optimizer is used to further iterate the imaging results to remove noise and artifacts and improve the imaging fidelity. Especially in the case involving sparse view which brings more artifacts, the effect is remarkable. Simulation and real experiments have both demonstrated the superiority of this method.
Authors: Xiaoxiang Han, Yiman Liu, Jiang Shang, Qingli Li, Jiangang Chen, Menghan Hu, Qi Zhang, Yuqi Zhang, Yan Wang
Abstract: Segmenting internal structure from echocardiography is essential for the diagnosis and treatment of various heart diseases. Semi-supervised learning shows its ability in alleviating annotations scarcity. While existing semi-supervised methods have been successful in image segmentation across various medical imaging modalities, few have attempted to design methods specifically addressing the challenges posed by the poor contrast, blurred edge details and noise of echocardiography. These characteristics pose challenges to the generation of high-quality pseudo-labels in semi-supervised segmentation based on Mean Teacher. Inspired by human reflection on erroneous practices, we devise an error reflection strategy for echocardiography semi-supervised segmentation architecture. The process triggers the model to reflect on inaccuracies in unlabeled image segmentation, thereby enhancing the robustness of pseudo-label generation. Specifically, the strategy is divided into two steps. The first step is called reconstruction reflection. The network is tasked with reconstructing authentic proxy images from the semantic masks of unlabeled images and their auxiliary sketches, while maximizing the structural similarity between the original inputs and the proxies. The second step is called guidance correction. Reconstruction error maps decouple unreliable segmentation regions. Then, reliable data that are more likely to occur near high-density areas are leveraged to guide the optimization of unreliable data potentially located around decision boundaries. Additionally, we introduce an effective data augmentation strategy, termed as multi-scale mixing up strategy, to minimize the empirical distribution gap between labeled and unlabeled images and perceive diverse scales of cardiac anatomical structures. Extensive experiments demonstrate the competitiveness of the proposed method.
Authors: Naman Deep Singh, Francesco Croce, Matthias Hein
Abstract: Vision-Language models like CLIP have been shown to be highly effective at linking visual perception and natural language understanding, enabling sophisticated image-text capabilities, including strong retrieval and zero-shot classification performance. Their widespread use, as well as the fact that CLIP models are trained on image-text pairs from the web, make them both a worthwhile and relatively easy target for backdoor attacks. As training foundational models, such as CLIP, from scratch is very expensive, this paper focuses on cleaning potentially poisoned models via fine-tuning. We first show that existing cleaning techniques are not effective against simple structured triggers used in Blended or BadNet backdoor attacks, exposing a critical vulnerability for potential real-world deployment of these models. Then, we introduce PAR, Perturb and Recover, a surprisingly simple yet effective mechanism to remove backdoors from CLIP models. Through extensive experiments across different encoders and types of backdoor attacks, we show that PAR achieves high backdoor removal rate while preserving good standard performance. Finally, we illustrate that our approach is effective even only with synthetic text-image pairs, i.e. without access to real training data. The code and models are available at \href{https://github.com/nmndeep/PerturbAndRecover}{https://github.com/nmndeep/PerturbAndRecover}.
URLs: https://github.com/nmndeep/PerturbAndRecover, https://github.com/nmndeep/PerturbAndRecover
Authors: Rulin Zhou, Yingjie Feng, Guankun Wang, Xiaopin Zhong, Zongze Wu, Qiang Wu, Xi Zhang
Abstract: Adenoid hypertrophy stands as a common cause of obstructive sleep apnea-hypopnea syndrome in children. It is characterized by snoring, nasal congestion, and growth disorders. Computed Tomography (CT) emerges as a pivotal medical imaging modality, utilizing X-rays and advanced computational techniques to generate detailed cross-sectional images. Within the realm of pediatric airway assessments, CT imaging provides an insightful perspective on the shape and volume of enlarged adenoids. Despite the advances of deep learning methods for medical imaging analysis, there remains an emptiness in the segmentation of adenoid hypertrophy in CT scans. To address this research gap, we introduce TSUBF-Nett (Trans-Spatial UNet-like Network based on Bi-direction Fusion), a 3D medical image segmentation framework. TSUBF-Net is engineered to effectively discern intricate 3D spatial interlayer features in CT scans and enhance the extraction of boundary-blurring features. Notably, we propose two innovative modules within the U-shaped network architecture:the Trans-Spatial Perception module (TSP) and the Bi-directional Sampling Collaborated Fusion module (BSCF).These two modules are in charge of operating during the sampling process and strategically fusing down-sampled and up-sampled features, respectively. Furthermore, we introduce the Sobel loss term, which optimizes the smoothness of the segmentation results and enhances model accuracy. Extensive 3D segmentation experiments are conducted on several datasets. TSUBF-Net is superior to the state-of-the-art methods with the lowest HD95: 7.03, IoU:85.63, and DSC: 92.26 on our own AHSD dataset. The results in the other two public datasets also demonstrate that our methods can robustly and effectively address the challenges of 3D segmentation in CT scans.
Authors: Dengqiang Jia, Xinnian Yang, Xiaosong Xiong, Shijie Huang, Feiyu Hou, Li Qin, Kaicong Sun, Kannie Wai Yan Chan, Dinggang Shen
Abstract: Vessel dynamics simulation is vital in studying the relationship between geometry and vascular disease progression. Reliable dynamics simulation relies on high-quality vascular meshes. Most of the existing mesh generation methods highly depend on manual annotation, which is time-consuming and laborious, usually facing challenges such as branch merging and vessel disconnection. This will hinder vessel dynamics simulation, especially for the population study. To address this issue, we propose a deep learning-based method, dubbed as DVasMesh to directly generate structured hexahedral vascular meshes from vascular images. Our contributions are threefold. First, we propose to formally formulate each vertex of the vascular graph by a four-element vector, including coordinates of the centerline point and the radius. Second, a vectorized graph template is employed to guide DVasMesh to estimate the vascular graph. Specifically, we introduce a sampling operator, which samples the extracted features of the vascular image (by a segmentation network) according to the vertices in the template graph. Third, we employ a graph convolution network (GCN) and take the sampled features as nodes to estimate the deformation between vertices of the template graph and target graph, and the deformed graph template is used to build the mesh. Taking advantage of end-to-end learning and discarding direct dependency on annotated labels, our DVasMesh demonstrates outstanding performance in generating structured vascular meshes on cardiac and cerebral vascular images. It shows great potential for clinical applications by reducing mesh generation time from 2 hours (manual) to 30 seconds (automatic).
Authors: Malik Abdul Manan, Feng Jinchao, Shahzad Ahmed, Abdul Raheem
Abstract: In medical imaging, efficient segmentation of colon polyps plays a pivotal role in minimally invasive solutions for colorectal cancer. This study introduces a novel approach employing two parallel encoder branches within a network for polyp segmentation. One branch of the encoder incorporates the dual convolution blocks that have the capability to maintain feature information over increased depths, and the other block embraces the single convolution block with the addition of the previous layer's feature, offering diversity in feature extraction within the encoder, combining them before transpose layers with a depth-wise concatenation operation. Our model demonstrated superior performance, surpassing several established deep-learning architectures on the Kvasir and CVC-ClinicDB datasets, achieved a Dice score of 0.919, a mIoU of 0.866 for the Kvasir dataset, and a Dice score of 0.931 and a mIoU of 0.891 for the CVC-ClinicDB. The visual and quantitative results highlight the efficacy of our model, potentially setting a new model in medical image segmentation.
Authors: Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang, Rui Zhang
Abstract: Errors in understanding visual information in images (i.e., visual perception errors) remain a major source of mistakes in Large Vision Language Models (LVLMs). While further analysis is essential, there is a deficiency in datasets for evaluating the visual perception of LVLMs. In this work, we introduce VisOnlyQA, a new dataset designed to directly evaluate the visual perception capabilities of LVLMs on questions about geometric and numerical information in scientific figures. Our dataset enables us to analyze the visual perception of LVLMs for fine-grained visual information, independent of other capabilities such as reasoning. The evaluation set of VisOnlyQA includes 1,200 multiple-choice questions in 12 tasks on four categories of figures. We also provide synthetic training data consisting of 70k instances. Our experiments on VisOnlyQA highlight the following findings: (i) 20 LVLMs we evaluate, including GPT-4o and Gemini 1.5 Pro, work poorly on the visual perception tasks in VisOnlyQA, while human performance is nearly perfect. (ii) Fine-tuning on synthetic training data demonstrates the potential for enhancing the visual perception of LVLMs, but observed improvements are limited to certain tasks and specific models. (iii) Stronger language models improve the visual perception of LVLMs. In summary, our experiments suggest that both training data and model architectures should be improved to enhance the visual perception capabilities of LVLMs. The datasets, code, and model responses are provided at https://github.com/psunlpgroup/VisOnlyQA.
Authors: Aditya Sinha, Siqi Zeng, Makoto Yamada, Han Zhao
Abstract: Most real-world datasets consist of a natural hierarchy between classes or an inherent label structure that is either already available or can be constructed cheaply. However, most existing representation learning methods ignore this hierarchy, treating labels as permutation invariant. Recent work [Zeng et al., 2022] proposes using this structured information explicitly, but the use of Euclidean distance may distort the underlying semantic context [Chen et al., 2013]. In this work, motivated by the advantage of hyperbolic spaces in modeling hierarchical relationships, we propose a novel approach HypStructure: a Hyperbolic Structured regularization approach to accurately embed the label hierarchy into the learned representations. HypStructure is a simple-yet-effective regularizer that consists of a hyperbolic tree-based representation loss along with a centering loss, and can be combined with any standard task loss to learn hierarchy-informed features. Extensive experiments on several large-scale vision benchmarks demonstrate the efficacy of HypStructure in reducing distortion and boosting generalization performance especially under low dimensional scenarios. For a better understanding of structured representation, we perform eigenvalue analysis that links the representation geometry to improved Out-of-Distribution (OOD) detection performance seen empirically. The code is available at \url{https://github.com/uiuctml/HypStructure}.
Authors: Razi Mahmood, Pingkun Yan, Diego Machado Reyes, Ge Wang, Mannudeep K. Kalra, Parisa Kaviani, Joy T. Wu, Tanveer Syeda-Mahmood
Abstract: Several evaluation metrics have been developed recently to automatically assess the quality of generative AI reports for chest radiographs based only on textual information using lexical, semantic, or clinical named entity recognition methods. In this paper, we develop a new method of report quality evaluation by first extracting fine-grained finding patterns capturing the location, laterality, and severity of a large number of clinical findings. We then performed phrasal grounding to localize their associated anatomical regions on chest radiograph images. The textual and visual measures are then combined to rate the quality of the generated reports. We present results that compare this evaluation metric with other textual metrics on a gold standard dataset derived from the MIMIC collection and show its robustness and sensitivity to factual errors.
Authors: Seongmin Park, Hyungmin Kim, Wonseok Jeon, Juyoung Yang, Byeongwook Jeon, Yoonseon Oh, Jungwook Choi
Abstract: Deep neural network (DNN)-based policy models like vision-language-action (VLA) models are transformative in automating complex decision-making across applications by interpreting multi-modal data. However, scaling these models greatly increases computational costs, which presents challenges in fields like robot manipulation and autonomous driving that require quick, accurate responses. To address the need for deployment on resource-limited hardware, we propose a new quantization framework for IL-based policy models that fine-tunes parameters to enhance robustness against low-bit precision errors during training, thereby maintaining efficiency and reliability under constrained conditions. Our evaluations with representative robot manipulation for 4-bit weight-quantization on a real edge GPU demonstrate that our framework achieves up to 2.5x speedup and 2.5x energy savings while preserving accuracy. For 4-bit weight and activation quantized self-driving models, the framework achieves up to 3.7x speedup and 3.1x energy saving on a low-end GPU. These results highlight the practical potential of deploying IL-based policy models on resource-constrained devices.
Authors: Muchao Ye, Weiyang Liu, Pan He
Abstract: The rapid advancement of vision-language models (VLMs) has established a new paradigm in video anomaly detection (VAD): leveraging VLMs to simultaneously detect anomalies and provide comprehendible explanations for the decisions. Existing work in this direction often assumes the complex reasoning required for VAD exceeds the capabilities of pretrained VLMs. Consequently, these approaches either incorporate specialized reasoning modules during inference or rely on instruction tuning datasets through additional training to adapt VLMs for VAD. However, such strategies often incur substantial computational costs or data annotation overhead. To address these challenges in explainable VAD, we introduce a verbalized learning framework named VERA that enables VLMs to perform VAD without model parameter modifications. Specifically, VERA automatically decomposes the complex reasoning required for VAD into reflections on simpler, more focused guiding questions capturing distinct abnormal patterns. It treats these reflective questions as learnable parameters and optimizes them through data-driven verbal interactions between learner and optimizer VLMs, using coarsely labeled training data. During inference, VERA embeds the learned questions into model prompts to guide VLMs in generating segment-level anomaly scores, which are then refined into frame-level scores via the fusion of scene and temporal contexts. Experimental results on challenging benchmarks demonstrate that the learned questions of VERA are highly adaptable, significantly improving both detection performance and explainability of VLMs for VAD.
Authors: Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, Aditya Grover
Abstract: We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OmniFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities. The Code will be available at https://github.com/jacklishufan/OmniFlows.
Authors: Oriana Presacan, Alexandru Dorobantiu, Vajira Thambawita, Michael A. Riegler, Mette H. Stensen, Mario Iliceto, Alexandru C. Aldea, Akriti Sharma
Abstract: Accurate embryo morphology assessment is essential in assisted reproductive technology for selecting the most viable embryo. Artificial intelligence has the potential to enhance this process. However, the limited availability of embryo data presents challenges for training deep learning models. To address this, we trained two generative models using two datasets, one we created and made publicly available, and one existing public dataset, to generate synthetic embryo images at various cell stages, including 2-cell, 4-cell, 8-cell, morula, and blastocyst. These were combined with real images to train classification models for embryo cell stage prediction. Our results demonstrate that incorporating synthetic images alongside real data improved classification performance, with the model achieving 97% accuracy compared to 95% when trained solely on real data. Notably, even when trained exclusively on synthetic data and tested on real data, the model achieved a high accuracy of 94%. Furthermore, combining synthetic data from both generative models yielded better classification results than using data from a single generative model. Four embryologists evaluated the fidelity of the synthetic images through a Turing test, during which they annotated inaccuracies and offered feedback. The analysis showed the diffusion model outperformed the generative adversarial network model, deceiving embryologists 66.6% versus 25.3% and achieving lower Frechet inception distance scores.
Authors: Jiazhou Liu, Aravinda S. Rao, Fucai Ke, Tim Dwyer, Benjamin Tag, Pari Delir Haghighi
Abstract: Together with industry experts, we are exploring the potential of head-mounted augmented reality to facilitate safety inspections on high-rise construction sites. A particular concern in the industry is inspecting perimeter safety screens on higher levels of construction sites, intended to prevent falls of people and objects. We aim to support workers performing this inspection task by tracking which parts of the safety screens have been inspected. We use machine learning to automatically detect gaps in the perimeter screens that require closer inspection and remediation and to automate reporting. This work-in-progress paper describes the problem, our early progress, concerns around worker privacy, and the possibilities to mitigate these.
Authors: Christian Gapp, Elias Tappeiner, Martin Welk, Rainer Schubert
Abstract: Medical patient data is always multimodal. Images, text, age, gender, histopathological data are only few examples for different modalities in this context. Processing and integrating this multimodal data with deep learning based methods is of utmost interest due to its huge potential for medical procedure such as diagnosis and patient treatment planning. In this work we retrain a multimodal transformer-based model for disease classification. To this end we use the text-image pair dataset from OpenI consisting of 2D chest X-rays associated with clinical reports. Our focus is on fusion methods for merging text and vision information extracted from medical datasets. Different architecture structures with a LLaMA II backbone model are tested. Early fusion of modality specific features creates better results with the best model reaching 97.10% mean AUC than late fusion from a deeper level of the architecture (best model: 96.67% mean AUC). Both outperform former classification models tested on the same multimodal dataset. The newly introduced multimodal architecture can be applied to other multimodal datasets with little effort and can be easily adapted for further research, especially, but not limited to, the field of medical AI.
Authors: Bo Pang, Sibo Cheng, Yuhan Huang, Yufang Jin, Yike Guo, I. Colin Prentice, Sandy P. Harrison, Rossella Arcucci
Abstract: Predicting the extent of massive wildfires once ignited is essential to reduce the subsequent socioeconomic losses and environmental damage, but challenging because of the complexity of fire behaviour. Existing physics-based models are limited in predicting large or long-duration wildfire events. Here, we develop a deep-learning-based predictive model, Fire-Image-DenseNet (FIDN), that uses spatial features derived from both near real-time and reanalysis data on the environmental and meteorological drivers of wildfire. We trained and tested this model using more than 300 individual wildfires that occurred between 2012 and 2019 in the western US. In contrast to existing models, the performance of FIDN does not degrade with fire size or duration. Furthermore, it predicts final burnt area accurately even in very heterogeneous landscapes in terms of fuel density and flammability. The FIDN model showed higher accuracy, with a mean squared error (MSE) about 82% and 67% lower than those of the predictive models based on cellular automata (CA) and the minimum travel time (MTT) approaches, respectively. Its structural similarity index measure (SSIM) averages 97%, outperforming the CA and FlamMap MTT models by 6% and 2%, respectively. Additionally, FIDN is approximately three orders of magnitude faster than both CA and MTT models. The enhanced computational efficiency and accuracy advancements offer vital insights for strategic planning and resource allocation for firefighting operations.
Authors: Mridula Vijendran, Jingjing Deng, Shuang Chen, Edmond S. L. Ho, Hubert P. H. Shum
Abstract: Artificial Intelligence significantly enhances the visual art industry by analyzing, identifying and generating digitized artistic images. This review highlights the substantial benefits of integrating geometric data into AI models, addressing challenges such as high inter-class variations, domain gaps, and the separation of style from content by incorporating geometric information. Models not only improve AI-generated graphics synthesis quality, but also effectively distinguish between style and content, utilizing inherent model biases and shared data traits. We explore methods like geometric data extraction from artistic images, the impact on human perception, and its use in discriminative tasks. The review also discusses the potential for improving data quality through innovative annotation techniques and the use of geometric data to enhance model adaptability and output refinement. Overall, incorporating geometric guidance boosts model performance in classification and synthesis tasks, providing crucial insights for future AI applications in the visual arts domain.
Authors: Yuxuan Zhou, Xingxing Li, Shengyu Li, Chunxi Xia, Xuanbin Wang, Shaoquan Feng
Abstract: For high-level geo-spatial applications and intelligent robotics, accurate global pose information is of crucial importance. Map-aided localization is an important and universal approach to overcome the limitations of global navigation satellite system (GNSS) in challenging environments. However, current solutions face challenges in terms of mapping flexibility, storage burden and re-localization performance. In this work, we present SF-Loc, a lightweight visual mapping and map-aided localization system, whose core idea is the map representation based on sparse frames with dense (though downsampled) depth, termed as visual structure frames. In the mapping phase, multi-sensor dense bundle adjustment (MS-DBA) is applied to construct geo-referenced visual structure frames. The local co-visbility is checked to keep the map sparsity and achieve incremental mapping. In the localization phase, coarse-to-fine vision-based localization is performed, in which multi-frame information and the map distribution are fully integrated. To be specific, the concept of spatially smoothed similarity (SSS) is proposed to overcome the place ambiguity, and pairwise frame matching is applied for efficient and robust pose estimation. Experimental results on both public and self-made datasets verify the effectiveness of the system. In complex urban road scenarios, the map size is down to 3 MB per kilometer and stable decimeter-level re-localization can be achieved. The code will be made open-source soon (https://github.com/GREAT-WHU/SF-Loc).
Authors: Emmanuelle Bourigault, Amir Jamaludin, Andrew Zisserman
Abstract: Scoliosis is traditionally assessed based solely on 2D lateral deviations, but recent studies have also revealed the importance of other imaging planes in understanding the deformation of the spine. Consequently, extracting the spinal geometry in 3D would help quantify these spinal deformations and aid diagnosis. In this study, we propose an automated general framework to estimate the 3D spine shape from 2D DXA scans. We achieve this by explicitly predicting the sagittal view of the spine from the DXA scan. Using these two orthogonal projections of the spine (coronal in DXA, and sagittal from the prediction), we are able to describe the 3D shape of the spine. The prediction is learnt from over 30k paired images of DXA and MRI scans. We assess the performance of the method on a held out test set, and achieve high accuracy.
Authors: Ravidu Suien Rammuni Silva, Ahmad Lotfi, Isibor Kennedy Ihianle, Golnaz Shahtahmassebi, Jordan J. Bird
Abstract: Recently, the quality of artworks generated using Artificial Intelligence (AI) has increased significantly, resulting in growing difficulties in detecting synthetic artworks. However, limited studies have been conducted on identifying the authenticity of synthetic artworks and their source. This paper introduces AI-ArtBench, a dataset featuring 185,015 artistic images across 10 art styles. It includes 125,015 AI-generated images and 60,000 pieces of human-created artwork. This paper also outlines a method to accurately detect AI-generated images and trace them to their source model. This work proposes a novel Convolutional Neural Network model based on the ConvNeXt model called AttentionConvNeXt. AttentionConvNeXt was implemented and trained to differentiate between the source of the artwork and its style with an F1-Score of 0.869. The accuracy of attribution to the generative model reaches 0.999. To combine the scientific contributions arising from this study, a web-based application named ArtBrain was developed to enable both technical and non-technical users to interact with the model. Finally, this study presents the results of an Artistic Turing Test conducted with 50 participants. The findings reveal that humans could identify AI-generated images with an accuracy of approximately 58%, while the model itself achieved a significantly higher accuracy of around 99%.
Authors: Qian Shao, Kai Zhang, Bang Du, Zepeng Li, Yixuan Wu, Qiyuan Chen, Jian Wu, Jintai Chen, Honghao Gao, Hongxia Xu
Abstract: Deep learning models are widely used to process Computed Tomography (CT) data in the automated screening of pulmonary diseases, significantly reducing the workload of physicians. However, the three-dimensional nature of CT volumes involves an excessive number of voxels, which significantly increases the complexity of model processing. Previous screening approaches often overlook this issue, which undoubtedly reduces screening efficiency. Towards efficient and effective screening, we design a hierarchical approach to reduce the computational cost of pulmonary disease screening. The new approach re-organizes the screening workflows into three steps. First, we propose a Computed Tomography Volume Compression (CTVC) method to select a small slice subset that comprehensively represents the whole CT volume. Second, the selected CT slices are used to detect pulmonary diseases coarsely via a lightweight classification model. Third, an uncertainty measurement strategy is applied to identify samples with low diagnostic confidence, which are re-detected by radiologists. Experiments on two public pulmonary disease datasets demonstrate that our approach achieves comparable accuracy and recall while reducing the time by 50%-70% compared with the counterparts using full CT volumes. Besides, we also found that our approach outperforms previous cutting-edge CTVC methods in retaining important indications after compression.
Authors: Elena Loli Piccolomini, Davide Evangelista, Elena Morotti
Abstract: Sparse-view Computed Tomography (CT) is an emerging protocol designed to reduce X-ray dose radiation in medical imaging. Traditional Filtered Back Projection algorithm reconstructions suffer from severe artifacts due to sparse data. In contrast, Model-Based Iterative Reconstruction (MBIR) algorithms, though better at mitigating noise through regularization, are too computationally costly for clinical use. This paper introduces a novel technique, denoted as the Deep Guess acceleration scheme, using a trained neural network both to quicken the regularized MBIR and to enhance the reconstruction accuracy. We integrate state-of-the-art deep learning tools to initialize a clever starting guess for a proximal algorithm solving a non-convex model and thus computing an interpretable solution image in a few iterations. Experimental results on real CT images demonstrate the Deep Guess effectiveness in (very) sparse tomographic protocols, where it overcomes its mere variational counterpart and many data-driven approaches at the state of the art. We also consider a ground truth-free implementation and test the robustness of the proposed framework to noise.
Authors: Raihan Ahmed, Shahed Chowdhury Omi, Md. Sadman Rahman, Niaz Rahman Bhuiyan
Abstract: Traveling through toll plazas is one of the primary causes of congestion, as identified in recent studies. Electronic Toll Collection (ETC) systems can mitigate this problem. This experiment focuses on enhancing the security of ETC using RFID tags and number plate verification. For number plate verification, image processing is employed, and a CNN classifier is implemented to detect vehicle registration numbers. Based on the registered number, a notification email is sent to the respective owner for toll fee payment within a specific timeframe to avoid fines. Additionally, toll fees are automatically deducted in real-time from the owner's balance. This system benefits travelers by eliminating the need to queue for toll payment, thereby reducing delays and improving convenience.
Authors: Matin Mortaheb, Mohammad A. Amir Khojastepour, Sennur Ulukus
Abstract: Transformers, known for their attention mechanisms, have proven highly effective in focusing on critical elements within complex data. This feature can effectively be used to address the time-varying channels in wireless communication systems. In this work, we introduce a channel-aware adaptive framework for semantic communication, where different regions of the image are encoded and compressed based on their semantic content. By employing vision transformers, we interpret the attention mask as a measure of the semantic contents of the patches and dynamically categorize the patches to be compressed at various rates as a function of the instantaneous channel bandwidth. Our method enhances communication efficiency by adapting the encoding resolution to the content's relevance, ensuring that even in highly constrained environments, critical information is preserved. We evaluate the proposed adaptive transmission framework using the TinyImageNet dataset, measuring both reconstruction quality and accuracy. The results demonstrate that our approach maintains high semantic fidelity while optimizing bandwidth, providing an effective solution for transmitting multi-resolution data in limited bandwidth conditions.
Authors: Weihao Yu, Chenyang Si, Pan Zhou, Mi Luo, Yichen Zhou, Jiashi Feng, Shuicheng Yan, Xinchao Wang
Abstract: MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore the capacity of MetaFormer, again, without focusing on token mixer design: we introduce several baseline models under MetaFormer using the most basic or common mixers, and summarize our observations as follows: (1) MetaFormer ensures solid lower bound of performance. By merely adopting identity mapping as the token mixer, the MetaFormer model, termed IdentityFormer, achieves >80% accuracy on ImageNet-1K. (2) MetaFormer works well with arbitrary token mixers. When specifying the token mixer as even a random matrix to mix tokens, the resulting model RandFormer yields an accuracy of >81%, outperforming IdentityFormer. Rest assured of MetaFormer's results when new token mixers are adopted. (3) MetaFormer effortlessly offers state-of-the-art results. With just conventional token mixers dated back five years ago, the models instantiated from MetaFormer already beat state of the art. (a) ConvFormer outperforms ConvNeXt. Taking the common depthwise separable convolutions as the token mixer, the model termed ConvFormer, which can be regarded as pure CNNs, outperforms the strong CNN model ConvNeXt. (b) CAFormer sets new record on ImageNet-1K. By simply applying depthwise separable convolutions as token mixer in the bottom stages and vanilla self-attention in the top stages, the resulting model CAFormer sets a new record on ImageNet-1K: it achieves an accuracy of 85.5% at 224x224 resolution, under normal supervised training without external data or distillation. In our expedition to probe MetaFormer, we also find that a new activation, StarReLU, reduces 71% FLOPs of activation compared with GELU yet achieves better performance. We expect StarReLU to find great potential in MetaFormer-like models alongside other neural networks.
Authors: Raphael Sulzer, Renaud Marlet, Bruno Vallet, Loic Landrieu
Abstract: We present a comprehensive survey and benchmark of both traditional and learning-based methods for surface reconstruction from point clouds. This task is particularly challenging for real-world acquisitions due to factors such as noise, outliers, non-uniform sampling, and missing data. Traditional approaches often simplify the problem by imposing handcrafted priors on either the input point clouds or the resulting surface, a process that can require tedious hyperparameter tuning. In contrast, deep learning models have the capability to directly learn the properties of input point clouds and desired surfaces from data. We study the influence of handcrafted and learned priors on the precision and robustness of surface reconstruction techniques. We evaluate various time-tested and contemporary methods in a standardized manner. When both trained and evaluated on point clouds with identical characteristics, the learning-based models consistently produce higher-quality surfaces compared to their traditional counterparts -- even in scenarios involving novel shape categories. However, traditional methods demonstrate greater resilience to the diverse anomalies commonly found in real-world 3D acquisitions. For the benefit of the research community, we make our code and datasets available, inviting further enhancements to learning-based surface reconstruction. This can be accessed at https://github.com/raphaelsulzer/dsr-benchmark .
Authors: Mahdi Ghafourian, Bilgesu Sumer, Ruben Vera-Rodriguez, Julian Fierrez, Ruben Tolosana, Aythami Moralez, Els Kindt
Abstract: Biometric recognition as a unique, hard-to-forge, and efficient way of identification and verification has become an indispensable part of the current digital world. The fast evolution of this technology has been a strong incentive for integrating it into many applications. Meanwhile, blockchain, the very attractive decentralized ledger technology, has been widely received both by the research and industry in the past years and it is being increasingly deployed nowadays in many different applications, such as money transfer, IoT, healthcare, or logistics. Recently, researchers have started to speculate what would be the pros and cons and what would be the best applications when these two technologies cross paths. This paper provides a survey of technical literature research on the combination of blockchain and biometrics and includes a first legal analysis of this integration to shed light on challenges and potentials. While this combination is still in its infancy and a growing body of literature discusses specific blockchain applications and solutions in an advanced technological set-up, this paper presents a holistic understanding of blockchains applicability in the biometric sector. This study demonstrates that combining blockchain and biometrics would be beneficial for novel applications in biometrics such as the PKI mechanism, distributed trusted service, and identity management. However, blockchain networks at their current stage are not efficient and economical for real-time applications. From a legal point of view, the allocation of accountability remains a main issue, while other difficulties remain, such as conducting a proper Data Protection Impact Assessment. Finally, it supplies technical and legal recommendations to reap the benefits and mitigate the risks of the combination.
Authors: Alexander Hepburn, Valero Laparra, Ra\'ul Santos-Rodriguez, Jes\'us Malo
Abstract: Previously, Barlow and Attneave hypothesised a link between biological vision and information maximisation. Following Shannon, information was defined using the probability of natural images. Several physiological and psychophysical phenomena have been derived from principles like info-max, efficient coding, or optimal denoising. However, it remains unclear how this link is expressed in mathematical terms from image probability. Classical derivations were subjected to strong assumptions on the probability models and on the behaviour of the sensors. Moreover, the direct evaluation of the hypothesis was limited by the inability of classical image models to deliver accurate estimates of the probability. Here, we directly evaluate image probabilities using a generative model for natural images, and analyse how probability-related factors can be combined to predict the sensitivity of state-of-the-art subjective image quality metrics, a proxy for human perception. We use information theory and regression analysis to find a simple model that when combining just two probability-related factors achieves 0.77 correlation with subjective metrics. This probability-based model is validated in two ways: through direct comparison with the opinion of real observers in a subjective quality experiment, and by reproducing basic trends of classical psychophysical facts such as the Contrast Sensitivity Function, the Weber-law, and contrast masking.
Authors: Joel Honkamaa, Pekka Marttinen
Abstract: Deep learning has emerged as a strong alternative for classical iterative methods for deformable medical image registration, where the goal is to find a mapping between the coordinate systems of two images. Popular classical image registration methods enforce the useful inductive biases of symmetricity, inverse consistency, and topology preservation by construction. However, while many deep learning registration methods encourage these properties via loss functions, no earlier methods enforce all of them by construction. Here, we propose a novel registration architecture based on extracting multi-resolution feature representations which is by construction symmetric, inverse consistent, and topology preserving. We also develop an implicit layer for memory efficient inversion of the deformation fields. Our method achieves state-of-the-art registration accuracy on three datasets. The code is available at https://github.com/honkamj/SITReg.
Authors: Ziqi Huang, Tianxing Wu, Yuming Jiang, Kelvin C. K. Chan, Ziwei Liu
Abstract: Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion models from exemplar images, and existing inversion methods mainly focus on capturing object appearances (i.e., the "look"). However, how to invert object relations, another important pillar in the visual world, remains unexplored. In this work, we propose the Relation Inversion task, which aims to learn a specific relation (represented as "relation prompt") from exemplar images. Specifically, we learn a relation prompt with a frozen pre-trained text-to-image diffusion model. The learned relation prompt can then be applied to generate relation-specific images with new objects, backgrounds, and styles. To tackle the Relation Inversion task, we propose the ReVersion Framework. Specifically, we propose a novel "relation-steering contrastive learning" scheme to steer the relation prompt towards relation-dense regions, and disentangle it away from object appearances. We further devise "relation-focal importance sampling" to emphasize high-level interactions over low-level appearances (e.g., texture, color). To comprehensively evaluate this new task, we contribute the ReVersion Benchmark, which provides various exemplar images with diverse relations. Extensive experiments validate the superiority of our approach over existing methods across a wide range of visual relations. Our proposed task and method could be good inspirations for future research in various domains like generative inversion, few-shot learning, and visual relation detection.
Authors: Jiahao Zhang, Anoop Cherian, Yanbin Liu, Yizhak Ben-Shabat, Cristian Rodriguez, Stephen Gould
Abstract: Multimodal alignment facilitates the retrieval of instances from one modality when queried using another. In this paper, we consider a novel setting where such an alignment is between (i) instruction steps that are depicted as assembly diagrams (commonly seen in Ikea assembly manuals) and (ii) video segments from in-the-wild videos; these videos comprising an enactment of the assembly actions in the real world. To learn this alignment, we introduce a novel supervised contrastive learning method that learns to align videos with the subtle details in the assembly diagrams, guided by a set of novel losses. To study this problem and demonstrate the effectiveness of our method, we introduce a novel dataset: IAW for Ikea assembly in the wild consisting of 183 hours of videos from diverse furniture assembly collections and nearly 8,300 illustrations from their associated instruction manuals and annotated for their ground truth alignments. We define two tasks on this dataset: First, nearest neighbor retrieval between video segments and illustrations, and, second, alignment of instruction steps and the segments for each video. Extensive experiments on IAW demonstrate superior performances of our approach against alternatives.
Authors: Leming Guo, Wanli Xue, Shengyong Chen
Abstract: Continuous sign language recognition (CSLR) aims to recognize signs in untrimmed sign language videos to textual glosses. A key challenge of CSLR is achieving effective cross-modality alignment between video and gloss sequences to enhance video representation. However, current cross-modality alignment paradigms often neglect the role of textual grammar to guide the video representation in learning global temporal context, which adversely affects recognition performance. To tackle this limitation, we propose a Denoising-Contrastive Alignment (DCA) paradigm. DCA creatively leverages textual grammar to enhance video representations through two complementary approaches: modeling the instance correspondence between signs and glosses from a discrimination perspective and aligning their global context from a generative perspective. Specifically, DCA accomplishes flexible instance-level correspondence between signs and glosses using a contrastive loss. Building on this, DCA models global context alignment between the video and gloss sequences by denoising the gloss representation from noise, guided by video representation. Additionally, DCA introduces gradient modulation to optimize the alignment and recognition gradients, ensuring a more effective learning process. By integrating gloss-wise and global context knowledge, DCA significantly enhances video representations for CSLR tasks. Experimental results across public benchmarks validate the effectiveness of DCA and confirm its video representation enhancement feasibility.
Authors: Henik Riedel, Robert Steven Lorenzen, Clemens H\"ubler
Abstract: As infrastructure ages, the need for efficient monitoring methods becomes increasingly critical. Bridge Weigh-In-Motion (BWIM) systems are crucial for cost-effective determination of loads and, consequently, the residual service life of road and railway infrastructure. However, conventional BWIM systems require additional sensors for axle detection, which must be installed in potentially inaccessible locations or places that interfere with bridge operation. This study presents a novel approach for real-time detection of train axles using sensors arbitrarily placed on bridges, providing an alternative to dedicated axle detectors. The developed Virtual Axle Detector with Enhanced Receptive Field (VADER) has been validated on a single-track railway bridge using only acceleration measurements, detecting 99.9% of axles with a spatial error of 3.69cm. Using raw data as input outperformed the state-of-the-art spectrogram-based method in both speed and memory usage by 99%, thereby making real-time application feasible for the first time. Additionally, we introduce the Maximum Receptive Field (MRF) rule, a novel approach to optimise hyperparameters of Convolutional Neural Networks (CNNs) based on the size of objects. In this context, the object size relates to the fundamental frequency of a bridge. The MRF rule effectively narrows the hyperparameter search space, overcoming the need for extensive hyperparameter tuning. Since the MRF rule can theoretically be applied to all unstructured data, it could have implications for a wide range of deep learning problems, from earthquake prediction to object recognition.
Authors: Nikhil Raghuraman, Adam W. Harley, Leonidas Guibas
Abstract: Current machine learning methods struggle to solve Bongard problems, which are a type of IQ test that requires deriving an abstract "concept" from a set of positive and negative "support" images, and then classifying whether or not a new query image depicts the key concept. On Bongard-HOI, a benchmark for natural-image Bongard problems, most existing methods have reached at best 69% accuracy (where chance is 50%). Low accuracy is often attributed to neural nets' lack of ability to find human-like symbolic rules. In this work, we point out that many existing methods are forfeiting accuracy due to a much simpler problem: they do not adapt image features given information contained in the support set as a whole, and rely instead on information extracted from individual supports. This is a critical issue, because the "key concept" in a typical Bongard problem can often only be distinguished using multiple positives and multiple negatives. We explore simple methods to incorporate this context and show substantial gains over prior works, leading to new state-of-the-art accuracy on Bongard-LOGO (75.3%) and Bongard-HOI (76.4%) compared to methods with equivalent vision backbone architectures and strong performance on the original Bongard problem set (60.8%).
Authors: Zhaozheng Chen, Qianru Sun
Abstract: The rapid development of deep learning has driven significant progress in image semantic segmentation - a fundamental task in computer vision. Semantic segmentation algorithms often depend on the availability of pixel-level labels (i.e., masks of objects), which are expensive, time-consuming, and labor-intensive. Weakly-supervised semantic segmentation (WSSS) is an effective solution to avoid such labeling. It utilizes only partial or incomplete annotations and provides a cost-effective alternative to fully-supervised semantic segmentation. In this journal, our focus is on the WSSS with image-level labels, which is the most challenging form of WSSS. Our work has two parts. First, we conduct a comprehensive survey on traditional methods, primarily focusing on those presented at premier research conferences. We categorize them into four groups based on where their methods operate: pixel-wise, image-wise, cross-image, and external data. Second, we investigate the applicability of visual foundation models, such as the Segment Anything Model (SAM), in the context of WSSS. We scrutinize SAM in two intriguing scenarios: text prompting and zero-shot learning. We provide insights into the potential and challenges of deploying visual foundational models for WSSS, facilitating future developments in this exciting research area.
Authors: Mahdi Ghafourian, Julian Fierrez, Ruben Vera-Rodriguez, Ruben Tolosana, Aythami Morales
Abstract: Federated Learning (FL) is a machine learning paradigm to conduct collaborative learning among clients on a joint model. The primary goal is to share clients' local training parameters with an integrating server while preserving their privacy. This method permits to exploit the potential of massive mobile users' data for the benefit of machine learning models' performance while keeping sensitive data on local devices. On the downside, FL raises security and privacy concerns that have just started to be studied. To address some of the key threats in FL, researchers have proposed to use secure aggregation methods (e.g. homomorphic encryption, secure multiparty computation, etc.). These solutions improve some security and privacy metrics, but at the same time bring about other serious threats such as poisoning attacks, backdoor attacks, and free running attacks. This paper proposes a new defense method against poisoning attacks in FL called SaFL (Sybil-aware Federated Learning) that minimizes the effect of sybils with a novel time-variant aggregation scheme.
Authors: Shadab Ahamed, Yixi Xu, Sara Kurkowska, Claire Gowdy, Joo H. O, Ingrid Bloise, Don Wilson, Patrick Martineau, Fran\c{c}ois B\'enard, Fereshteh Yousefirizi, Rahul Dodhia, Juan M. Lavista, William B. Weeks, Carlos F. Uribe, Arman Rahmim
Abstract: This study addresses critical gaps in automated lymphoma segmentation from PET/CT images, focusing on issues often overlooked in existing literature. While deep learning has been applied for lymphoma lesion segmentation, few studies incorporate out-of-distribution testing, raising concerns about model generalizability across diverse imaging conditions and patient populations. We highlight the need to compare model performance with expert human annotators, including intra- and inter-observer variability, to understand task difficulty better. Most approaches focus on overall segmentation accuracy but overlook lesion-specific metrics important for precise lesion detection and disease quantification.To address these gaps, we propose a clinically-relevant framework for evaluating deep neural networks. Using this lesion-specific evaluation, we assess the performance of four deep segmentation networks (ResUNet, SegResNet, DynUNet, and SwinUNETR) across 611 cases from multi-institutional datasets, covering various lymphoma subtypes and lesion characteristics. Beyond standard metrics like the Dice similarity coefficient (DSC), we evaluate clinical lesion measures and their prediction errors. We also introduce detection criteria for lesion localization and propose a new detection Criterion 3 based on metabolic characteristics. We show that networks perform better on large, intense lesions with higher metabolic activity.Finally, we compare network performance to expert human observers via intra- and inter-observer variability analyses, demonstrating that network errors closely resemble those made by experts. Some small, faint lesions remain challenging for both humans and networks. This study aims to improve automated lesion segmentation's clinical relevance, supporting better treatment decisions for lymphoma patients. The code is available at: https://github.com/microsoft/lymphoma-segmentation-dnn
URLs: https://github.com/microsoft/lymphoma-segmentation-dnn
Authors: Peng Xia, Xingtong Yu, Ming Hu, Lie Ju, Zhiyong Wang, Peibo Duan, Zongyuan Ge
Abstract: Object categories are typically organized into a multi-granularity taxonomic hierarchy. When classifying categories at different hierarchy levels, traditional uni-modal approaches focus primarily on image features, revealing limitations in complex scenarios. Recent studies integrating Vision-Language Models (VLMs) with class hierarchies have shown promise, yet they fall short of fully exploiting the hierarchical relationships. These efforts are constrained by their inability to perform effectively across varied granularity of categories. To tackle this issue, we propose a novel framework (HGCLIP) that effectively combines CLIP with a deeper exploitation of the Hierarchical class structure via Graph representation learning. We explore constructing the class hierarchy into a graph, with its nodes representing the textual or image features of each category. After passing through a graph encoder, the textual features incorporate hierarchical structure information, while the image features emphasize class-aware features derived from prototypes through the attention mechanism. Our approach demonstrates significant improvements on 11 diverse visual recognition benchmarks. Our codes are fully available at https://github.com/richard-peng-xia/HGCLIP.
Authors: Yingshu Chen, Guocheng Shao, Ka Chun Shum, Binh-Son Hua, Sai-Kit Yeung
Abstract: Modern artificial intelligence offers a novel and transformative approach to creating digital art across diverse styles and modalities like images, videos and 3D data, unleashing the power of creativity and revolutionizing the way that we perceive and interact with visual content. This paper reports on recent advances in stylized 3D asset creation and manipulation with the expressive power of neural networks. We establish a taxonomy for neural stylization, considering crucial design choices such as scene representation, guidance data, optimization strategies, and output styles. Building on such taxonomy, our survey first revisits the background of neural stylization on 2D images, and then presents in-depth discussions on recent neural stylization methods for 3D data, accompanied by a benchmark evaluating selected mesh and neural field stylization methods. Based on the insights gained from the survey, we highlight the practical significance, open challenges, future research, and potential impacts of neural stylization, which facilitates researchers and practitioners to navigate the rapidly evolving landscape of 3D content creation using modern artificial intelligence.
Authors: Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, Jerome Revaud
Abstract: Multi-view stereo reconstruction (MVS) in the wild requires to first estimate the camera parameters e.g. intrinsic and extrinsic parameters. These are usually tedious and cumbersome to obtain, yet they are mandatory to triangulate corresponding pixels in 3D space, which is the core of all best performing MVS algorithms. In this work, we take an opposite stance and introduce DUSt3R, a radically novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction of arbitrary image collections, i.e. operating without prior information about camera calibration nor viewpoint poses. We cast the pairwise reconstruction problem as a regression of pointmaps, relaxing the hard constraints of usual projective camera models. We show that this formulation smoothly unifies the monocular and binocular reconstruction cases. In the case where more than two images are provided, we further propose a simple yet effective global alignment strategy that expresses all pairwise pointmaps in a common reference frame. We base our network architecture on standard Transformer encoders and decoders, allowing us to leverage powerful pretrained models. Our formulation directly provides a 3D model of the scene as well as depth information, but interestingly, we can seamlessly recover from it, pixel matches, relative and absolute camera. Exhaustive experiments on all these tasks showcase that the proposed DUSt3R can unify various 3D vision tasks and set new SoTAs on monocular/multi-view depth estimation as well as relative pose estimation. In summary, DUSt3R makes many geometric 3D vision tasks easy.
Authors: Chenghao Li, Chaoning Zhang, Boheng Zeng, Yi Lu, Pengbo Shi, Qingzi Chen, Jirui Liu, Lingyun Zhu, Yang Yang, Heng Tao Shen
Abstract: Attention mechanisms have significantly advanced visual models by capturing global context effectively. However, their reliance on large-scale datasets and substantial computational resources poses challenges in data-scarce and resource-constrained scenarios. Moreover, traditional self-attention mechanisms lack inherent spatial inductive biases, making them suboptimal for modeling local features critical to tasks involving smaller datasets. In this work, we introduce Large Kernel Convolutional Attention (LKCA), a novel formulation that reinterprets attention operations as a single large-kernel convolution. This design unifies the strengths of convolutional architectures locality and translation invariance with the global context modeling capabilities of self-attention. By embedding these properties into a computationally efficient framework, LKCA addresses key limitations of traditional attention mechanisms. The proposed LKCA achieves competitive performance across various visual tasks, particularly in data-constrained settings. Experimental results on CIFAR-10, CIFAR-100, SVHN, and Tiny-ImageNet demonstrate its ability to excel in image classification, outperforming conventional attention mechanisms and vision transformers in compact model settings. These findings highlight the effectiveness of LKCA in bridging local and global feature modeling, offering a practical and robust solution for real-world applications with limited data and resources.
Authors: Dong Zhang, Pingcheng Dong, Xinting Hu, Long Chen, Kwang-Ting Cheng
Abstract: It has been revealed that small efficient dense image prediction (EDIP) models, trained using the knowledge distillation (KD) framework, encounter two key challenges, including maintaining boundary region completeness and preserving target region connectivity, despite their favorable capacity to recognize main object regions. In this work, we propose a complementary boundary and context distillation (BCD) method within the KD framework for EDIPs, which facilitates the targeted knowledge transfer from large accurate teacher models to compact efficient student models. Specifically, the boundary distillation component focuses on extracting explicit object-level semantic boundaries from the hierarchical feature maps of the backbone network to enhance the student model's mask quality in boundary regions. Concurrently, the context distillation component leverages self-relations as a bridge to transfer implicit pixel-level contexts from the teacher model to the student model, ensuring strong connectivity in target regions. Our proposed BCD method is specifically designed for EDIP tasks and is characterized by its simplicity and efficiency. Extensive experimental results across semantic segmentation, object detection, and instance segmentation on various representative datasets demonstrate that our method can outperform existing methods without requiring extra supervisions or incurring increased inference costs, resulting in well-defined object boundaries and smooth connecting regions.
Authors: Lingfeng He, De Cheng, Nannan Wang, Xinbo Gao
Abstract: Unsupervised visible-infrared person re-identification (USL-VI-ReID) endeavors to retrieve pedestrian images of the same identity from different modalities without annotations. While prior work focuses on establishing cross-modality pseudo-label associations to bridge the modality-gap, they ignore maintaining the instance-level homogeneous and heterogeneous consistency between the feature space and the pseudo-label space, resulting in coarse associations. In response, we introduce a Modality-Unified Label Transfer (MULT) module that simultaneously accounts for both homogeneous and heterogeneous fine-grained instance-level structures, yielding high-quality cross-modality label associations. It models both homogeneous and heterogeneous affinities, leveraging them to quantify the inconsistency between the pseudo-label space and the feature space, subsequently minimizing it. The proposed MULT ensures that the generated pseudo-labels maintain alignment across modalities while upholding structural consistency within intra-modality. Additionally, a straightforward plug-and-play Online Cross-memory Label Refinement (OCLR) module is proposed to further mitigate the side effects of noisy pseudo-labels while simultaneously aligning different modalities, coupled with an Alternative Modality-Invariant Representation Learning (AMIRL) framework. Experiments demonstrate that our proposed method outperforms existing state-of-the-art USL-VI-ReID methods, highlighting the superiority of our MULT in comparison to other cross-modality association methods. Code is available at https://github.com/FranklinLingfeng/code_for_MULT.
Authors: Georgii Stanishevskii, Jakub Steczkiewicz, Tomasz Szczepanik, S{\l}awomir Tadeja, Jacek Tabor, Przemys{\l}aw Spurek
Abstract: Numerous emerging deep-learning techniques have had a substantial impact on computer graphics. Among the most promising breakthroughs are the rise of Neural Radiance Fields (NeRFs) and Gaussian Splatting (GS). NeRFs encode the object's shape and color in neural network weights using a handful of images with known camera positions to generate novel views. In contrast, GS provides accelerated training and inference without a decrease in rendering quality by encoding the object's characteristics in a collection of Gaussian distributions. These two techniques have found many use cases in spatial computing and other domains. On the other hand, the emergence of deepfake methods has sparked considerable controversy. Deepfakes refers to artificial intelligence-generated videos that closely mimic authentic footage. Using generative models, they can modify facial features, enabling the creation of altered identities or expressions that exhibit a remarkably realistic appearance to a real person. Despite these controversies, deepfake can offer a next-generation solution for avatar creation and gaming when of desirable quality. To that end, we show how to combine all these emerging technologies to obtain a more plausible outcome. Our ImplicitDeepfake uses the classical deepfake algorithm to modify all training images separately and then train NeRF and GS on modified faces. Such simple strategies can produce plausible 3D deepfake-based avatars.
Authors: Chun-Peng Chang, Shaoxiang Wang, Alain Pagani, Didier Stricker
Abstract: 3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries, particularly with descriptions that involve multiple anchors or are view-dependent. In response, we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique, enhancing object recognition accuracy and the understanding of spatial relationships. Furthermore, MiKASA improves the explainability of decision-making, facilitating error diagnosis. Our model achieves the highest overall accuracy in the Referit3D challenge for both the Sr3D and Nr3D datasets, particularly excelling by a large margin in categories that require viewpoint-dependent descriptions.
Authors: Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, Chunhua Shen
Abstract: Extensive pre-training with large data is indispensable for downstream geometry and semantic visual perception tasks. Thanks to large-scale text-to-image (T2I) pretraining, recent works show promising results by simply fine-tuning T2I diffusion models for dense perception tasks. However, several crucial design decisions in this process still lack comprehensive justification, encompassing the necessity of the multi-step stochastic diffusion mechanism, training strategy, inference ensemble strategy, and fine-tuning data quality. In this work, we conduct a thorough investigation into critical factors that affect transfer efficiency and performance when using diffusion priors. Our key findings are: 1) High-quality fine-tuning data is paramount for both semantic and geometry perception tasks. 2) The stochastic nature of diffusion models has a slightly negative impact on deterministic visual perception tasks. 3) Apart from fine-tuning the diffusion model with only latent space supervision, task-specific image-level supervision is beneficial to enhance fine-grained details. These observations culminate in the development of GenPercept, an effective deterministic one-step fine-tuning paradigm tailed for dense visual perception tasks. Different from the previous multi-step methods, our paradigm has a much faster inference speed, and can be seamlessly integrated with customized perception decoders and loss functions for image-level supervision, which is critical to improving the fine-grained details of predictions. Comprehensive experiments on diverse dense visual perceptual tasks, including monocular depth estimation, surface normal estimation, image segmentation, and matting, are performed to demonstrate the remarkable adaptability and effectiveness of our proposed method.
Authors: Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee
Abstract: End-to-end transformer-based detectors (DETRs) have shown exceptional performance in both closed-set and open-vocabulary object detection (OVD) tasks through the integration of language modalities. However, their demanding computational requirements have hindered their practical application in real-time object detection (OD) scenarios. In this paper, we scrutinize the limitations of two leading models in the OVDEval benchmark, OmDet and Grounding-DINO, and introduce OmDet-Turbo. This novel transformer-based real-time OVD model features an innovative Efficient Fusion Head (EFH) module designed to alleviate the bottlenecks observed in OmDet and Grounding-DINO. Notably, OmDet-Turbo-Base achieves a 100.2 frames per second (FPS) with TensorRT and language cache techniques applied. Notably, in zero-shot scenarios on COCO and LVIS datasets, OmDet-Turbo achieves performance levels nearly on par with current state-of-the-art supervised models. Furthermore, it establishes new state-of-the-art benchmarks on ODinW and OVDEval, boasting an AP of 30.1 and an NMS-AP of 26.86, respectively. The practicality of OmDet-Turbo in industrial applications is underscored by its exceptional performance on benchmark datasets and superior inference speed, positioning it as a compelling choice for real-time object detection tasks. Code: \url{https://github.com/om-ai-lab/OmDet}
Authors: Abu Zahid Bin Aziz, Mokshagna Sai Teja Karanam, Tushar Kataria, Shireen Y. Elhabian
Abstract: Transformers have emerged as the state-of-the-art architecture in medical image registration, outperforming convolutional neural networks (CNNs) by addressing their limited receptive fields and overcoming gradient instability in deeper models. Despite their success, transformer-based models require substantial resources for training, including data, memory, and computational power, which may restrict their applicability for end users with limited resources. In particular, existing transformer-based 3D image registration architectures face two critical gaps that challenge their efficiency and effectiveness. Firstly, although window-based attention mechanisms reduce the quadratic complexity of full attention by focusing on local regions, they often struggle to effectively integrate both local and global information. Secondly, the granularity of tokenization, a crucial factor in registration accuracy, presents a performance trade-off: smaller voxel-size tokens enhance detail capture but come with increased computational complexity, higher memory usage, and a greater risk of overfitting. We present \name, a transformer-based architecture for unsupervised 3D image registration that balances local and global attention in 3D volumes through a plane-based attention mechanism and employs a Hi-Res tokenization strategy with merging operations, thus capturing finer details without compromising computational efficiency. Notably, \name sets a new benchmark for performance on the OASIS dataset with 16-27x fewer parameters. https://github.com/MedVIC-Lab/Efficient_Morph_Registration
URLs: https://github.com/MedVIC-Lab/Efficient_Morph_Registration
Authors: Botao Ren, Botian Xu, Xue Yang, Yifan Pu, Jingyi Wang, Zhidong Deng
Abstract: In most modern object detection pipelines, the detection proposals are processed independently given the feature map. Therefore, they overlook the underlying relationships between objects and the surrounding background, which could have provided additional context for accurate detection. Because aerial imagery is almost orthographic, the spatial relations in image space closely align with those in the physical world, and inter-object and object-background relationships become particularly significant. To address this oversight, we propose a framework that leverages the strengths of Transformer-based models and Contrastive Language-Image Pre-training (CLIP) features to capture such relationships. Specifically, Building on two-stage detectors, we treat Region of Interest (RoI) proposals as tokens, accompanied by CLIP Tokens obtained from multi-level image segments. These tokens are then passed through a Transformer encoder, where specific spatial and geometric relations are incorporated into the attention weights, which are adaptively modulated and regularized. Additionally, we introduce self-supervised constraints on CLIP Tokens to ensure consistency. Extensive experiments on three benchmark datasets demonstrate that our approach achieves consistent improvements, setting new state-of-the-art results with increases of 1.37 mAP$_{50}$ on DOTA-v1.0, 5.30 mAP$_{50}$ on DOTA-v1.5, 2.30 mAP$_{50}$ on DOTA-v2.0 and 3.23 mAP$_{50}$ on DIOR-R.
Authors: Jiaxiu Jiang, Yabo Zhang, Kailai Feng, Xiaohe Wu, Wenbo Li, Renjing Pei, Fan Li, Wangmeng Zuo
Abstract: Customized text-to-image generation, which synthesizes images based on user-specified concepts, has made significant progress in handling individual concepts. However, when extended to multiple concepts, existing methods often struggle with properly integrating different models and avoiding the unintended blending of characteristics from distinct concepts. In this paper, we propose MC$^2$, a novel approach for multi-concept customization that enhances flexibility and fidelity through inference-time optimization. MC$^2$ enables the integration of multiple single-concept models with heterogeneous architectures. By adaptively refining attention weights between visual and textual tokens, our method ensures that image regions accurately correspond to their associated concepts while minimizing interference between concepts. Extensive experiments demonstrate that MC$^2$ outperforms training-based methods in terms of prompt-reference alignment. Furthermore, MC$^2$ can be seamlessly applied to text-to-image generation, providing robust compositional capabilities. To facilitate the evaluation of multi-concept customization, we also introduce a new benchmark, MC++. The code will be publicly available at https://github.com/JIANGJiaXiu/MC-2.
Authors: Sourajit Saha, Tejas Gokhale
Abstract: Downsampling operators break the shift invariance of convolutional neural networks (CNNs) and this affects the robustness of features learned by CNNs when dealing with even small pixel-level shift. Through a large-scale correlation analysis framework, we study shift invariance of CNNs by inspecting existing downsampling operators in terms of their maximum-sampling bias (MSB), and find that MSB is negatively correlated with shift invariance. Based on this crucial insight, we propose a learnable pooling operator called Translation Invariant Polyphase Sampling (TIPS) and two regularizations on the intermediate feature maps of TIPS to reduce MSB and learn translation-invariant representations. TIPS can be integrated into any CNN and can be trained end-to-end with marginal computational overhead. Our experiments demonstrate that TIPS results in consistent performance gains in terms of accuracy, shift consistency, and shift fidelity on multiple benchmarks for image classification and semantic segmentation compared to previous methods and also leads to improvements in adversarial and distributional robustness. TIPS results in the lowest MSB compared to all previous methods, thus explaining our strong empirical results.
Authors: Yuetan Chu, Gongning Luo, Longxi Zhou, Shaodong Cao, Guolin Ma, Xianglin Meng, Juexiao Zhou, Changchun Yang, Dexuan Xie, Dan Mu, Ricardo Henao, Gianluca Setti, Xigang Xiao, Lianming Wu, Zhaowen Qiu, Xin Gao
Abstract: Pulmonary artery-vein segmentation is crucial for disease diagnosis and surgical planning and is traditionally achieved by Computed Tomography Pulmonary Angiography (CTPA). However, concerns regarding adverse health effects from contrast agents used in CTPA have constrained its clinical utility. In contrast, identifying arteries and veins using non-contrast CT, a conventional and low-cost clinical examination routine, has long been considered impossible. Here we propose a High-abundant Pulmonary Artery-vein Segmentation (HiPaS) framework achieving accurate artery-vein segmentation on both non-contrast CT and CTPA across various spatial resolutions. HiPaS first performs spatial normalization on raw CT volumes via a super-resolution module, and then iteratively achieves segmentation results at different branch levels by utilizing the lower-level vessel segmentation as a prior for higher-level vessel segmentation. We trained and validated HiPaS on our established multi-centric dataset comprising 1,073 CT volumes with meticulous manual annotations. Both quantitative experiments and clinical evaluation demonstrated the superior performance of HiPaS, achieving an average dice score of 91.8% and a sensitivity of 98.0%. Further experiments showed the non-inferiority of HiPaS segmentation on non-contrast CT compared to segmentation on CTPA. Employing HiPaS, we have conducted an anatomical study of pulmonary vasculature on 11,784 participants in China (six sites), discovering a new association of pulmonary vessel anatomy with sex, age, and disease states: vessel abundance suggests a significantly higher association with females than males with slightly decreasing with age, and is also influenced by certain diseases, under the controlling of lung volumes.
Authors: Yunfei Li, Yuezun Li, Xin Wang, Baoyuan Wu, Jiaran Zhou, Junyu Dong
Abstract: Sequential DeepFake detection is an emerging task that predicts the manipulation sequence in order. Existing methods typically formulate it as an image-to-sequence problem, employing conventional Transformer architectures. However, these methods lack dedicated design and consequently result in limited performance. As such, this paper describes a new Transformer design, called TSOM, by exploring three perspectives: Texture, Shape, and Order of Manipulations. Our method features four major improvements: \ding{182} we describe a new texture-aware branch that effectively captures subtle manipulation traces with a Diversiform Pixel Difference Attention module. \ding{183} Then we introduce a Multi-source Cross-attention module to seek deep correlations among spatial and sequential features, enabling effective modeling of complex manipulation traces. \ding{184} To further enhance the cross-attention, we describe a Shape-guided Gaussian mapping strategy, providing initial priors of the manipulation shape. \ding{185} Finally, observing that the subsequent manipulation in a sequence may influence traces left in the preceding one, we intriguingly invert the prediction order from forward to backward, leading to notable gains as expected. Extensive experimental results demonstrate that our method outperforms others by a large margin, highlighting the superiority of our method.
Authors: Qiwen Deng, Yangcen Liu, Wen Li, Guoqing Wang
Abstract: Given a source portrait, the automatic human body reshaping task aims at editing it to an aesthetic body shape. As the technology has been widely used in media, several methods have been proposed mainly focusing on generating optical flow to warp the body shape. However, those previous works only consider the local transformation of different body parts (arms, torso, and legs), ignoring the global affinity, and limiting the capacity to ensure consistency and quality across the entire body. In this paper, we propose a novel Adaptive Affinity-Graph Network (AAGN), which extracts the global affinity between different body parts to enhance the quality of the generated optical flow. Specifically, our AAGN primarily introduces the following designs: (1) we propose an Adaptive Affinity-Graph (AAG) Block that leverages the characteristic of a fully connected graph. AAG represents different body parts as nodes in an adaptive fully connected graph and captures all the affinities between nodes to obtain a global affinity map. The design could better improve the consistency between body parts. (2) Besides, for high-frequency details are crucial for photo aesthetics, a Body Shape Discriminator (BSD) is designed to extract information from both high-frequency and spatial domain. Particularly, an SRM filter is utilized to extract high-frequency details, which are combined with spatial features as input to the BSD. With this design, BSD guides the Flow Generator (FG) to pay attention to various fine details rather than rigid pixel-level fitting. Extensive experiments conducted on the BR-5K dataset demonstrate that our framework significantly enhances the aesthetic appeal of reshaped photos, surpassing all previous work to achieve state-of-the-art in all evaluation metrics.
Authors: Jiamin Wu, Kenkun Liu, Han Gao, Xiaoke Jiang, Lei Zhang
Abstract: Rencently, Gaussian splatting has demonstrated significant success in novel view synthesis. Current methods often regress Gaussians with pixel or point cloud correspondence, linking each Gaussian with a pixel or a 3D point. This leads to the redundancy of Gaussians being used to overfit the correspondence rather than the objects represented by the 3D Gaussians themselves, consequently wasting resources and lacking accurate geometries or textures. In this paper, we introduce LeanGaussian, a novel approach that treats each query in deformable Transformer as one 3D Gaussian ellipsoid, breaking the pixel or point cloud correspondence constraints. We leverage deformable decoder to iteratively refine the Gaussians layer-by-layer with the image features as keys and values. Notably, the center of each 3D Gaussian is defined as 3D reference points, which are then projected onto the image for deformable attention in 2D space. On both the ShapeNet SRN dataset (category level) and the Google Scanned Objects dataset (open-category level, trained with the Objaverse dataset), our approach, outperforms prior methods by approximately 6.1\%, achieving a PSNR of 25.44 and 22.36, respectively. Additionally, our method achieves a 3D reconstruction speed of 7.2 FPS and rendering speed 500 FPS. The code will be released at https://github.com/jwubz123/DIG3D.
Authors: Kaiyu Li, Xiangyong Cao, Yupeng Deng, Jiayi Song, Junmin Liu, Deyu Meng, Zhi Wang
Abstract: Change Detection (CD) aims to identify pixels with semantic changes between images. However, annotating massive numbers of pixel-level images is labor-intensive and costly, especially for multi-temporal images, which require pixel-wise comparisons by human experts. Considering the excellent performance of visual language models (VLMs) for zero-shot, open-vocabulary, etc. with prompt-based reasoning, it is promising to utilize VLMs to make better CD under limited labeled data. In this paper, we propose a VLM guidance-based semi-supervised CD method, namely SemiCD-VL. The insight of SemiCD-VL is to synthesize free change labels using VLMs to provide additional supervision signals for unlabeled data. However, almost all current VLMs are designed for single-temporal images and cannot be directly applied to bi- or multi-temporal images. Motivated by this, we first propose a VLM-based mixed change event generation (CEG) strategy to yield pseudo labels for unlabeled CD data. Since the additional supervised signals provided by these VLM-driven pseudo labels may conflict with the pseudo labels from the consistency regularization paradigm (e.g. FixMatch), we propose the dual projection head for de-entangling different signal sources. Further, we explicitly decouple the bi-temporal images semantic representation through two auxiliary segmentation decoders, which are also guided by VLM. Finally, to make the model more adequately capture change representations, we introduce metric-aware supervision by feature-level contrastive loss in auxiliary branches. Extensive experiments show the advantage of SemiCD-VL. For instance, SemiCD-VL improves the FixMatch baseline by +5.3 IoU on WHU-CD and by +2.4 IoU on LEVIR-CD with 5% labels. In addition, our CEG strategy, in an un-supervised manner, can achieve performance far superior to state-of-the-art un-supervised CD methods.
Authors: Sen Fang, Lei Wang, Ce Zheng, Chunyu Sui, Mingyu Zhao, Yapeng Tian, Chen Chen
Abstract: In this paper, we propose SignLLM, a multilingual Sign Language Production (SLP) large language model, which includes two novel multilingual SLP modes MLSF and Prompt2LangGloss that allow sign language gestures generation from query texts input and question-style prompts input respectively. Both modes can use a new RL loss based on reinforcement learning and a new RL module named Priority Learning Channel. These RL components can accelerate the training by enhancing the model's capability to sample high-quality data. For SignLLM's training, we introduce Prompt2Sign, a comprehensive multilingual sign language dataset, which builds from public data, including American Sign Language (ASL) and seven others. This dataset standardizes information by extracting pose information from sign language videos into a unified compressed format. We extensively evaluate SignLLM, demonstrating that our model achieves state-of-the-art performance on SLP tasks across eight sign languages.
Authors: Victoria Manousaki, Konstantinos Bacharidis, Filippos Gouidis, Konstantinos Papoutsakis, Dimitris Plexousakis, Antonis Argyros
Abstract: In this work, we introduce (a) the new problem of anticipating object state changes in images and videos during procedural activities, (b) new curated annotation data for object state change classification based on the Ego4D dataset, and (c) the first method for addressing this challenging problem. Solutions to this new task have important implications in vision-based scene understanding, automated monitoring systems, and action planning. The proposed novel framework predicts object state changes that will occur in the near future due to yet unseen human actions by integrating learned visual features that represent recent visual information with natural language (NLP) features that represent past object state changes and actions. Leveraging the extensive and challenging Ego4D dataset which provides a large-scale collection of first-person perspective videos across numerous interaction scenarios, we introduce an extension noted Ego4D-OSCA that provides new curated annotation data for the object state change anticipation task (OSCA). An extensive experimental evaluation is presented demonstrating the proposed method's efficacy in predicting object state changes in dynamic scenarios. The performance of the proposed approach also underscores the potential of integrating video and linguistic cues to enhance the predictive performance of video understanding systems and lays the groundwork for future research on the new task of object state change anticipation. The source code and the new annotation data (Ego4D-OSCA) will be made publicly available.
Authors: Ma\"elic Neau, Paulo E. Santos, Anne-Gwenn Bosser, C\'edric Buche
Abstract: Scene Graph Generation (SGG) is a task that encodes visual relationships between objects in images as graph structures. SGG shows significant promise as a foundational component for downstream tasks, such as reasoning for embodied agents. To enable real-time applications, SGG must address the trade-off between performance and inference speed. However, current methods tend to focus on one of the following: (1) improving relation prediction accuracy, (2) enhancing object detection accuracy, or (3) reducing latency, without aiming to balance all three objectives simultaneously. To address this limitation, we propose a novel architecture, inference method, and relation prediction model. Our proposed solution, the REACT model, achieves the highest inference speed among existing SGG models, improving object detection accuracy without sacrificing relation prediction performance. Compared to state-of-the-art approaches, REACT is 2.7 times faster (with a latency of 23 ms) and improves object detection accuracy by 58.51%. Furthermore, our proposal significantly reduces model size, with an average of 5.5x fewer parameters. Code is available at https://github.com/Maelic/SGG-Benchmark
Authors: Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, Gao Huang
Abstract: Mamba is an effective state space model with linear computation complexity. It has recently shown impressive efficiency in dealing with high-resolution inputs across various vision tasks. In this paper, we reveal that the powerful Mamba model shares surprising similarities with linear attention Transformer, which typically underperform conventional Transformer in practice. By exploring the similarities and disparities between the effective Mamba and subpar linear attention Transformer, we provide comprehensive analyses to demystify the key factors behind Mamba's success. Specifically, we reformulate the selective state space model and linear attention within a unified formulation, rephrasing Mamba as a variant of linear attention Transformer with six major distinctions: input gate, forget gate, shortcut, no attention normalization, single-head, and modified block design. For each design, we meticulously analyze its pros and cons, and empirically evaluate its impact on model performance in vision tasks. Interestingly, the results highlight the forget gate and block design as the core contributors to Mamba's success, while the other four designs are less crucial. Based on these findings, we propose a Mamba-Inspired Linear Attention (MILA) model by incorporating the merits of these two key designs into linear attention. The resulting model outperforms various vision Mamba models in both image classification and high-resolution dense prediction tasks, while enjoying parallelizable computation and fast inference speed. Code is available at https://github.com/LeapLabTHU/MLLA.
Authors: Shubhang Bhatnagar, Narendra Ahuja
Abstract: Deep metric learning (DML) involves training a network to learn a semantically meaningful representation space. Many current approaches mine n-tuples of examples and model interactions within each tuplets. We present a novel, compositional DML model, inspired by electrostatic fields in physics that, instead of in tuples, represents the influence of each example (embedding) by a continuous potential field, and superposes the fields to obtain their combined global potential field. We use attractive/repulsive potential fields to represent interactions among embeddings from images of the same/different classes. Contrary to typical learning methods, where mutual influence of samples is proportional to their distance, we enforce reduction in such influence with distance, leading to a decaying field. We show that such decay helps improve performance on real world datasets with large intra-class variations and label noise. Like other proxy-based methods, we also use proxies to succinctly represent sub-populations of examples. We evaluate our method on three standard DML benchmarks- Cars-196, CUB-200-2011, and SOP datasets where it outperforms state-of-the-art baselines.
Authors: Zelin Peng, Zhengqin Xu, Zhilin Zeng, Yaoming Wang, Wei Shen
Abstract: Open-vocabulary semantic segmentation seeks to label each pixel in an image with arbitrary text descriptions. Vision-language foundation models, especially CLIP, have recently emerged as powerful tools for acquiring open-vocabulary capabilities. However, fine-tuning CLIP to equip it with pixel-level prediction ability often suffers three issues: 1) high computational cost, 2) misalignment between the two inherent modalities of CLIP, and 3) degraded generalization ability on unseen categories. To address these issues, we propose H-CLIP a symmetrical parameter-efficient fine-tuning (PEFT) strategy conducted in hyperspherical space for both of the two CLIP modalities. Specifically, the PEFT strategy is achieved by a series of efficient block-diagonal learnable transformation matrices and a dual cross-relation communication module among all learnable matrices. Since the PEFT strategy is conducted symmetrically to the two CLIP modalities, the misalignment between them is mitigated. Furthermore, we apply an additional constraint to PEFT on the CLIP text encoder according to the hyperspherical energy principle, i.e., minimizing hyperspherical energy during fine-tuning preserves the intrinsic structure of the original parameter space, to prevent the destruction of the generalization ability offered by the CLIP text encoder. Extensive evaluations across various benchmarks show that H-CLIP achieves new SOTA open-vocabulary semantic segmentation results while only requiring updating approximately 4% of the total parameters of CLIP.
Authors: Jiahao Shao, Yuanbo Yang, Hongyu Zhou, Youmin Zhang, Yujun Shen, Vitor Guizilini, Yue Wang, Matteo Poggi, Yiyi Liao
Abstract: This work addresses the challenge of streamed video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. We argue that sharing contextual information between frames or clips is pivotal in fostering temporal consistency. Thus, instead of directly developing a depth estimator from scratch, we reformulate this predictive task into a conditional generation problem to provide contextual information within a clip and across clips. Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. We sample independent noise levels for each frame within a clip during training while using a sliding window strategy and initializing overlapping frames with previously predicted frames without adding noise. Moreover, we design an effective training strategy to provide context within a clip. Extensive experimental results validate our design choices and demonstrate the superiority of our approach, dubbed ChronoDepth. Project page: https://xdimlab.github.io/ChronoDepth/.
Authors: Jason Yoo, Yingchen He, Saeid Naderiparizi, Dylan Green, Gido M. van de Ven, Geoff Pleiss, Frank Wood
Abstract: This work demonstrates that training autoregressive video diffusion models from a single, continuous video stream is not only possible but remarkably can also be competitive with standard offline training approaches given the same number of gradient steps. Our demonstration further reveals that this main result can be achieved using experience replay that only retains a subset of the preceding video stream. We also contribute three new single video generative modeling datasets suitable for evaluating lifelong video model learning: Lifelong Bouncing Balls, Lifelong 3D Maze, and Lifelong PLAICraft. Each dataset contains over a million consecutive frames from a synthetic environment of increasing complexity.
Authors: Mengfei Li, Xiaoxiao Long, Yixun Liang, Weiyu Li, Yuan Liu, Peng Li, Wenhan Luo, Wenping Wang, Yike Guo
Abstract: Despite recent advancements in the Large Reconstruction Model (LRM) demonstrating impressive results, when extending its input from single image to multiple images, it exhibits inefficiencies, subpar geometric and texture quality, as well as slower convergence speed than expected. It is attributed to that, LRM formulates 3D reconstruction as a naive images-to-3D translation problem, ignoring the strong 3D coherence among the input images. In this paper, we propose a Multi-view Large Reconstruction Model (M-LRM) designed to reconstruct high-quality 3D shapes from multi-views in a 3D-aware manner. Specifically, we introduce a multi-view consistent cross-attention scheme to enable M-LRM to accurately query information from the input images. Moreover, we employ the 3D priors of the input multi-view images to initialize the triplane tokens. Compared to previous methods, the proposed M-LRM can generate 3D shapes of high fidelity. Experimental studies demonstrate that our model achieves a significant performance gain and faster training convergence. Project page: \url{https://murphylmf.github.io/M-LRM/}.
Authors: Wei Chen, Lin Li, Yongqi Yang, Bin Wen, Fan Yang, Tingting Gao, Yu Wu, Long Chen
Abstract: Interleaved image-text generation has emerged as a crucial multimodal task, aiming at creating sequences of interleaved visual and textual content given a query. Despite notable advancements in recent multimodal large language models (MLLMs), generating integrated image-text sequences that exhibit narrative coherence and entity and style consistency remains challenging due to poor training data quality. To address this gap, we introduce CoMM, a high-quality Coherent interleaved image-text MultiModal dataset designed to enhance the coherence, consistency, and alignment of generated multimodal content. Initially, CoMM harnesses raw data from diverse sources, focusing on instructional content and visual storytelling, establishing a foundation for coherent and consistent content. To further refine the data quality, we devise a multi-perspective filter strategy that leverages advanced pre-trained models to ensure the development of sentences, consistency of inserted images, and semantic alignment between them. Various quality evaluation metrics are designed to prove the high quality of the filtered dataset. Meanwhile, extensive few-shot experiments on various downstream tasks demonstrate CoMM's effectiveness in significantly enhancing the in-context learning capabilities of MLLMs. Moreover, we propose four new tasks to evaluate MLLMs' interleaved generation abilities, supported by a comprehensive evaluation framework. We believe CoMM opens a new avenue for advanced MLLMs with superior multimodal in-context learning and understanding ability.
Authors: Jianyi Zhang, Yufan Zhou, Jiuxiang Gu, Curtis Wigington, Tong Yu, Yiran Chen, Tong Sun, Ruiyi Zhang
Abstract: Diffusion models have demonstrated exceptional capabilities in generating a broad spectrum of visual content, yet their proficiency in rendering text is still limited: they often generate inaccurate characters or words that fail to blend well with the underlying image. To address these shortcomings, we introduce a novel framework named, ARTIST, which incorporates a dedicated textual diffusion model to focus on the learning of text structures specifically. Initially, we pretrain this textual model to capture the intricacies of text representation. Subsequently, we finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model. This disentangled architecture design and training strategy significantly enhance the text rendering ability of the diffusion models for text-rich image generation. Additionally, we leverage the capabilities of pretrained large language models to interpret user intentions better, contributing to improved generation quality. Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.
Authors: Ken Chen, Sachith Seneviratne, Wei Wang, Dongting Hu, Sanjay Saha, Md. Tarek Hasan, Sanka Rasnayaka, Tamasha Malepathirana, Mingming Gong, Saman Halgamuge
Abstract: Animating stylized avatars with dynamic poses and expressions has attracted increasing attention for its broad range of applications. Previous research has made significant progress by training controllable generative models to synthesize animations based on reference characteristics, pose, and expression conditions. However, the mechanisms used in these methods to control pose and expression often inadvertently introduce unintended features from the target motion, while also causing a loss of expression-related details, particularly when applied to stylized animation. This paper proposes a new method based on Stable Diffusion, called AniFaceDiff, incorporating a new conditioning module for animating stylized avatars. First, we propose a refined spatial conditioning approach by Facial Alignment to prevent the inclusion of identity characteristics from the target motion. Then, we introduce an Expression Adapter that incorporates additional cross-attention layers to address the potential loss of expression-related information. Our approach effectively preserves pose and expression from the target video while maintaining input image consistency. Extensive experiments demonstrate that our method achieves state-of-the-art results, showcasing superior image quality, preservation of reference features, and expression accuracy, particularly for out-of-domain animation across diverse styles, highlighting its versatility and strong generalization capabilities. This work aims to enhance the quality of virtual stylized animation for positive applications. To promote responsible use in virtual environments, we contribute to the advancement of detection for generative content by evaluating state-of-the-art detectors, highlighting potential areas for improvement, and suggesting solutions.
Authors: Jiaming Zhou, Teli Ma, Kun-Yu Lin, Zifan Wang, Ronghe Qiu, Junwei Liang
Abstract: Learning generalizable visual representations across different embodied environments is essential for effective robotic manipulation in real-world scenarios. However, the limited scale and diversity of robot demonstration data pose a significant challenge. Recent research has explored leveraging large-scale human activity data for pre-training, but the substantial morphological differences between humans and robots introduce a significant human-robot domain discrepancy, hindering the generalization of these models to downstream manipulation tasks. To overcome this, we propose a novel adaptation paradigm that leverages readily available paired human-robot video data to bridge the domain gap. Our method employs a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robot domain in a parameter-efficient manner. Experiments on 20 simulated tasks across two different benchmarks and five real-world tasks demonstrate significant improvements. These results span both single-task and language-conditioned multi-task settings, evaluated using two different pre-trained models. Compared to existing pre-trained models, our adaptation method improves the average success rate by over $7\%$ across multiple tasks on both simulated benchmarks and real-world evaluations. We will release the code and models.
Authors: Nikita Starodubcev, Mikhail Khoroshikh, Artem Babenko, Dmitry Baranchuk
Abstract: Diffusion distillation represents a highly promising direction for achieving faithful text-to-image generation in a few sampling steps. However, despite recent successes, existing distilled models still do not provide the full spectrum of diffusion abilities, such as real image inversion, which enables many precise image manipulation methods. This work aims to enrich distilled text-to-image diffusion models with the ability to effectively encode real images into their latent space. To this end, we introduce invertible Consistency Distillation (iCD), a generalized consistency distillation framework that facilitates both high-quality image synthesis and accurate image encoding in only 3-4 inference steps. Though the inversion problem for text-to-image diffusion models gets exacerbated by high classifier-free guidance scales, we notice that dynamic guidance significantly reduces reconstruction errors without noticeable degradation in generation performance. As a result, we demonstrate that iCD equipped with dynamic guidance may serve as a highly effective tool for zero-shot text-guided image editing, competing with more expensive state-of-the-art alternatives.
Authors: Jialang Xu, Nazir Sirajudeen, Matthew Boal, Nader Francis, Danail Stoyanov, Evangelos Mazomenos
Abstract: Automated detection of surgical errors can improve robotic-assisted surgery. Despite promising progress, existing methods still face challenges in capturing rich temporal context to establish long-term dependencies while maintaining computational efficiency. In this paper, we propose a novel hierarchical model named SEDMamba, which incorporates the selective state space model (SSM) into surgical error detection, facilitating efficient long sequence modelling with linear complexity. SEDMamba enhances selective SSM with a bottleneck mechanism and fine-to-coarse temporal fusion (FCTF) to detect and temporally localize surgical errors in long videos. The bottleneck mechanism compresses and restores features within their spatial dimension, thereby reducing computational complexity. FCTF utilizes multiple dilated 1D convolutional layers to merge temporal information across diverse scale ranges, accommodating errors of varying duration. Our work also contributes the first-of-its-kind, frame-level, in-vivo surgical error dataset to support error detection in real surgical cases. Specifically, we deploy the clinically validated observational clinical human reliability assessment tool (OCHRA) to annotate the errors during suturing tasks in an open-source radical prostatectomy dataset (SAR-RARP50). Experimental results demonstrate that our SEDMamba outperforms state-of-the-art methods with at least 1.82% AUC and 3.80% AP performance gains with significantly reduced computational complexity. The corresponding error annotations, code and models are released at https://github.com/wzjialang/SEDMamba.
Authors: Sagi Eppel
Abstract: Vastextures is a vast repository of 500,000 textures and PBR materials extracted from real-world images using an unsupervised process. The extracted materials and textures are extremely diverse and cover a vast range of real-world patterns, but at the same time less refined compared to existing repositories. The repository is composed of 2D textures cropped from natural images and SVBRDF/PBR materials generated from these textures. Textures and PBR materials are essential for CGI. Existing materials repositories focus on games, animation, and arts, that demand a limited amount of high-quality assets. However, virtual worlds and synthetic data are becoming increasingly important for training A.I systems for computer vision. This application demands a huge amount of diverse assets but at the same time less affected by noisy and unrefined assets. Vastexture aims to address this need by creating a free, huge, and diverse assets repository that covers as many real-world materials as possible. The materials are automatically extracted from natural images in two steps: 1) Automatically scanning a giant amount of images to identify and crop regions with uniform textures. This is done by splitting the image into a grid of cells and identifying regions in which all of the cells share a similar statistical distribution. 2) Extracting the properties of the PBR material from the cropped texture. This is done by randomly guessing every correlation between the properties of the texture image and the properties of the PBR material. The resulting PBR materials exhibit a vast amount of real-world patterns as well as unexpected emergent properties. Neutral nets trained on this repository outperformed nets trained using handcrafted assets.
Authors: Yang Song, Lin Wang
Abstract: 3D object detection is an important task that has been widely applied in autonomous driving. To perform this task, a new trend is to fuse multi-modal inputs, i.e., LiDAR and camera. Under such a trend, recent methods fuse these two modalities by unifying them in the same 3D space. However, during direct fusion in a unified space, the drawbacks of both modalities (LiDAR features struggle with detailed semantic information and the camera lacks accurate 3D spatial information) are also preserved, diluting semantic and spatial awareness of the final unified representation. To address the issue, this letter proposes a novel bidirectional complementary LiDAR-camera fusion framework, called BiCo-Fusion that can achieve robust semantic- and spatial-aware 3D object detection. The key insight is to fuse LiDAR and camera features in a bidirectional complementary way to enhance the semantic awareness of the LiDAR and the 3D spatial awareness of the camera. The enhanced features from both modalities are then adaptively fused to build a semantic- and spatial-aware unified representation. Specifically, we introduce Pre-Fusion consisting of a Voxel Enhancement Module (VEM) to enhance the semantic awareness of voxel features from 2D camera features and Image Enhancement Module (IEM) to enhance the 3D spatial awareness of camera features from 3D voxel features. We then introduce Unified Fusion (U-Fusion) to adaptively fuse the enhanced features from the last stage to build a unified representation. Extensive experiments demonstrate the superiority of our BiCo-Fusion against the prior arts. Project page: https://t-ys.github.io/BiCo-Fusion/.
Authors: Laiyan Ding, Hualie Jiang, Jie Li, Yongquan Chen, Rui Huang
Abstract: Depth estimation is a cornerstone for autonomous driving, yet acquiring per-pixel depth ground truth for supervised learning is challenging. Self-Supervised Surround Depth Estimation (SSSDE) from consecutive images offers an economical alternative. While previous SSSDE methods have proposed different mechanisms to fuse information across images, few of them explicitly consider the cross-view constraints, leading to inferior performance, particularly in overlapping regions. This paper proposes an efficient and consistent pose estimation design and two loss functions to enhance cross-view consistency for SSSDE. For pose estimation, we propose to use only front-view images to reduce training memory and sustain pose estimation consistency. The first loss function is the dense depth consistency loss, which penalizes the difference between predicted depths in overlapping regions. The second one is the multi-view reconstruction consistency loss, which aims to maintain consistency between reconstruction from spatial and spatial-temporal contexts. Additionally, we introduce a novel flipping augmentation to improve the performance further. Our techniques enable a simple neural model to achieve state-of-the-art performance on the DDAD and nuScenes datasets. Last but not least, our proposed techniques can be easily applied to other methods. The code will be made public.
Authors: Yujie Zhang, Qi Yang, Ziyu Shan, Yiling Xu
Abstract: Recent years have witnessed the success of the deep learning-based technique in research of no-reference point cloud quality assessment (NR-PCQA). For a more accurate quality prediction, many previous studies have attempted to capture global and local features in a bottom-up manner, but ignored the interaction and promotion between them. To solve this problem, we propose a novel asynchronous feedback quality prediction network (AFQ-Net). Motivated by human visual perception mechanisms, AFQ-Net employs a dual-branch structure to deal with global and local features, simulating the left and right hemispheres of the human brain, and constructs a feedback module between them. Specifically, the input point clouds are first fed into a transformer-based global encoder to generate the attention maps that highlight these semantically rich regions, followed by being merged into the global feature. Then, we utilize the generated attention maps to perform dynamic convolution for different semantic regions and obtain the local feature. Finally, a coarse-to-fine strategy is adopted to merge the two features into the final quality score. We conduct comprehensive experiments on three datasets and achieve superior performance over the state-of-the-art approaches on all of these datasets. The code will be available at The code will be available at https://github.com/zhangyujie-1998/AFQ-Net.
Authors: Pengxiang Li, Zhi Gao, Bofei Zhang, Tao Yuan, Yuwei Wu, Mehrtash Harandi, Yunde Jia, Song-Chun Zhu, Qing Li
Abstract: Vision language models (VLMs) have achieved impressive progress in diverse applications, becoming a prevalent research direction. In this paper, we build FIRE, a feedback-refinement dataset, consisting of 1.1M multi-turn conversations that are derived from 27 source datasets, empowering VLMs to spontaneously refine their responses based on user feedback across diverse tasks. To scale up the data collection, FIRE is collected in two components: FIRE-100K and FIRE-1M, where FIRE-100K is generated by GPT-4V, and FIRE-1M is freely generated via models trained on FIRE-100K. Then, we build FIRE-Bench, a benchmark to comprehensively evaluate the feedback-refining capability of VLMs, which contains 11K feedback-refinement conversations as the test data, two evaluation settings, and a model to provide feedback for VLMs. We develop the FIRE-LLaVA model by fine-tuning LLaVA on FIRE-100K and FIRE-1M, which shows remarkable feedback-refining capability on FIRE-Bench and outperforms untrained VLMs by 50%, making more efficient user-agent interactions and underscoring the significance of the FIRE dataset.
Authors: Jiahao Zhang, Frederic Z. Zhang, Cristian Rodriguez, Yizhak Ben-Shabat, Anoop Cherian, Stephen Gould
Abstract: We study the challenging problem of simultaneously localizing a sequence of queries in the form of instructional diagrams in a video. This requires understanding not only the individual queries but also their interrelationships. However, most existing methods focus on grounding one query at a time, ignoring the inherent structures among queries such as the general mutual exclusiveness and the temporal order. Consequently, the predicted timespans of different step diagrams may overlap considerably or violate the temporal order, thus harming the accuracy. In this paper, we tackle this issue by simultaneously grounding a sequence of step diagrams. Specifically, we propose composite queries, constructed by exhaustively pairing up the visual content features of the step diagrams and a fixed number of learnable positional embeddings. Our insight is that self-attention among composite queries carrying different content features suppress each other to reduce timespan overlaps in predictions, while the cross-attention corrects the temporal misalignment via content and position joint guidance. We demonstrate the effectiveness of our approach on the IAW dataset for grounding step diagrams and the YouCook2 benchmark for grounding natural language queries, significantly outperforming existing methods while simultaneously grounding multiple queries.
Authors: Yibin Yan, Weidi Xie
Abstract: Knowledge-based Visual Question Answering (KVQA) tasks require answering questions about images using extensive background knowledge. Despite significant advancements, generative models often struggle with these tasks due to the limited integration of external knowledge. In this paper, we introduce EchoSight, a novel multimodal Retrieval-Augmented Generation (RAG) framework that enables large language models (LLMs) to answer visual questions requiring fine-grained encyclopedic knowledge. To strive for high-performing retrieval, EchoSight first searches wiki articles by using visual-only information, subsequently, these candidate articles are further reranked according to their relevance to the combined text-image query. This approach significantly improves the integration of multimodal knowledge, leading to enhanced retrieval outcomes and more accurate VQA responses. Our experimental results on the Encyclopedic VQA and InfoSeek datasets demonstrate that EchoSight establishes new state-of-the-art results in knowledge-based VQA, achieving an accuracy of 41.8% on Encyclopedic VQA and 31.3% on InfoSeek.
Authors: Kumar Ashutosh, Tushar Nagarajan, Georgios Pavlakos, Kris Kitani, Kristen Grauman
Abstract: Feedback is essential for learning a new skill or improving one's current skill-level. However, current methods for skill-assessment from video only provide scores or compare demonstrations, leaving the burden of knowing what to do differently on the user. We introduce a novel method to generate actionable feedback from video of a person doing a physical activity, such as basketball or soccer. Our method takes a video demonstration and its accompanying 3D body pose and generates (1) free-form expert commentary describing what the person is doing well and what they could improve, and (2) a visual expert demonstration that incorporates the required corrections. We show how to leverage Ego-Exo4D's videos of skilled activity and expert commentary together with a strong language model to create a weakly-supervised training dataset for this task, and we devise a multimodal video-language model to infer coaching feedback. Our method is able to reason across multi-modal input combinations to output full-spectrum, actionable coaching -- expert commentary, expert video retrieval, and expert pose generation -- outperforming strong vision-language models on both established metrics and human preference studies. Code and data will be publicly released.
Authors: Weihao Yu, Zhengyuan Yang, Lingfeng Ren, Linjie Li, Jianfeng Wang, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang, Xinchao Wang
Abstract: MM-Vet, with open-ended vision-language questions targeting at evaluating integrated capabilities, has become one of the most popular benchmarks for large multimodal model evaluation. MM-Vet assesses six core vision-language (VL) capabilities: recognition, knowledge, spatial awareness, language generation, OCR, and math. However, its question format is restricted to single image-text pairs, lacking the interleaved image and text sequences prevalent in real-world scenarios. To address this limitation, we introduce MM-Vet v2, which includes a new VL capability called "image-text sequence understanding", evaluating models' ability to process VL sequences. Furthermore, we maintain the high quality of evaluation samples while further expanding the evaluation set size. Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0. Among open-weight models, InternVL2-Llama3-76B leads with a score of 68.4. The code, data, and leaderboard are accessible at https://github.com/yuweihao/MM-Vet.
Authors: Muhammad Ahmad, Muhammad Hassaan Farooq Butt, Adil Mehmood Khan, Manuel Mazzara, Salvatore Distefano, Muhammad Usama, Swalpa Kumar Roy, Jocelyn Chanussot, Danfeng Hong
Abstract: Recent advancements in transformers, specifically self-attention mechanisms, have significantly improved hyperspectral image (HSI) classification. However, these models often suffer from inefficiencies, as their computational complexity scales quadratically with sequence length. To address these challenges, we propose the morphological spatial mamba (SMM) and morphological spatial-spectral Mamba (SSMM) model (MorpMamba), which combines the strengths of morphological operations and the state space model framework, offering a more computationally efficient alternative to transformers. In MorpMamba, a novel token generation module first converts HSI patches into spatial-spectral tokens. These tokens are then processed through morphological operations such as erosion and dilation, utilizing depthwise separable convolutions to capture structural and shape information. A token enhancement module refines these features by dynamically adjusting the spatial and spectral tokens based on central HSI regions, ensuring effective feature fusion within each block. Subsequently, multi-head self-attention is applied to further enrich the feature representations, allowing the model to capture complex relationships and dependencies within the data. Finally, the enhanced tokens are fed into a state space module, which efficiently models the temporal evolution of the features for classification. Experimental results on widely used HSI datasets demonstrate that MorpMamba achieves superior parametric efficiency compared to traditional CNN and transformer models while maintaining high accuracy. The code will be made publicly available at \url{https://github.com/mahmad000/MorpMamba}.
Authors: Yiwen Chen, Yikai Wang, Yihao Luo, Zhengyi Wang, Zilong Chen, Jun Zhu, Chi Zhang, Guosheng Lin
Abstract: Meshes are the de facto 3D representation in the industry but are labor-intensive to produce. Recently, a line of research has focused on autoregressively generating meshes. This approach processes meshes into a sequence composed of vertices and then generates them vertex by vertex, similar to how a language model generates text. These methods have achieved some success but still struggle to generate complex meshes. One primary reason for this limitation is their inefficient tokenization methods. To address this issue, we introduce MeshAnything V2, an advanced mesh generation model designed to create Artist-Created Meshes that align precisely with specified shapes. A key innovation behind MeshAnything V2 is our novel Adjacent Mesh Tokenization (AMT) method. Unlike traditional approaches that represent each face using three vertices, AMT optimizes this by employing a single vertex wherever feasible, effectively reducing the token sequence length by about half on average. This not only streamlines the tokenization process but also results in more compact and well-structured sequences, enhancing the efficiency of mesh generation. With these improvements, MeshAnything V2 effectively doubles the face limit compared to previous models, delivering superior performance without increasing computational costs. We will make our code and models publicly available. Project Page: https://buaacyw.github.io/meshanything-v2/
Authors: Jiawei Wu, Zhi Jin
Abstract: Recent research tries to extend image restoration capabilities from human perception to machine perception, thereby enhancing the performance of high-level vision tasks in degraded environments. These methods, primarily based on supervised learning, typically involve the retraining of restoration networks or high-level vision networks. However, collecting paired data in real-world scenarios and retraining large-scale models are challenge. To this end, we propose an unsupervised learning method called \textbf{Va}riational \textbf{T}ranslator (VaT), which does not require retraining existing restoration and high-level vision networks. Instead, it establishes a lightweight network that serves as an intermediate bridge between them. By variational inference, VaT approximates the joint distribution of restoration output and high-level vision input, dividing the optimization objective into preserving content and maximizing marginal likelihood associated with high-level vision tasks. By cleverly leveraging self-training paradigms, VaT achieves the above optimization objective without requiring labels. As a result, the translated images maintain a close resemblance to their original content while also demonstrating exceptional performance on high-level vision tasks. Extensive experiments in dehazing and low-light enhancement for detection and classification show the superiority of our method over other state-of-the-art unsupervised counterparts, even significantly surpassing supervised methods in some complex real-world scenarios.
Authors: Eman Ali, Sathira Silva, Muhammad Haris Khan
Abstract: Vision-language models (VLMs), e.g., CLIP, have shown remarkable potential in zero-shot image classification. However, adapting these models to new domains remains challenging, especially in unsupervised settings where labeled data is unavailable. Recent research has proposed pseudo-labeling approaches to adapt CLIP in an unsupervised manner using unlabeled target data. Nonetheless, these methods struggle due to noisy pseudo-labels resulting from the misalignment between CLIP's visual and textual representations. This study introduces DPA, an unsupervised domain adaptation method for VLMs. DPA introduces the concept of dual prototypes, acting as distinct classifiers, along with the convex combination of their outputs, thereby leading to accurate pseudo-label construction. Next, it ranks pseudo-labels to facilitate robust self-training, particularly during early training. Finally, it addresses visual-textual misalignment by aligning textual prototypes with image prototypes to further improve the adaptation performance. Experiments on 13 downstream vision tasks demonstrate that DPA significantly outperforms zero-shot CLIP and the state-of-the-art unsupervised adaptation baselines.
Authors: Zhirui Gao, Renjiao Yi, Yuhang Huang, Wei Chen, Chenyang Zhu, Kai Xu
Abstract: Low-level 3D representations, such as point clouds, meshes, NeRFs, and 3D Gaussians, are commonly used to represent 3D objects or scenes. However, human perception typically understands 3D objects at a higher level as a composition of parts or structures rather than points or voxels. Representing 3D objects or scenes as semantic parts can benefit further understanding and applications. In this paper, we introduce $\textbf{PartGS}$, $\textbf{part}$-aware 3D reconstruction by a hybrid representation of 2D $\textbf{G}$aussians and $\textbf{S}$uperquadrics, which parses objects or scenes into semantic parts, digging 3D structural clues from multi-view image inputs. Accurate structured geometry reconstruction and high-quality rendering are achieved at the same time. Our method simultaneously optimizes superquadric meshes and Gaussians by coupling their parameters within our hybrid representation. On one hand, this hybrid representation inherits the advantage of superquadrics to represent different shape primitives, supporting flexible part decomposition of scenes. On the other hand, 2D Gaussians capture complex texture and geometry details, ensuring high-quality appearance and geometry reconstruction. Our method is fully unsupervised and outperforms existing state-of-the-art approaches in extensive experiments on DTU, ShapeNet, and real-life datasets.
Authors: Hidehisa Arai, Keita Miwa, Kento Sasaki, Yu Yamaguchi, Kohei Watanabe, Shunsuke Aoki, Issei Yamamoto
Abstract: Autonomous driving, particularly navigating complex and unanticipated scenarios, demands sophisticated reasoning and planning capabilities. While Multi-modal Large Language Models (MLLMs) offer a promising avenue for this, their use has been largely confined to understanding complex environmental contexts or generating high-level driving commands, with few studies extending their application to end-to-end path planning. A major research bottleneck is the lack of large-scale annotated datasets encompassing vision, language, and action. To address this issue, we propose CoVLA (Comprehensive Vision-Language-Action) Dataset, an extensive dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language descriptions of driving environments and maneuvers. This approach utilizes raw in-vehicle sensor data, allowing it to surpass existing datasets in scale and annotation richness. Using CoVLA, we investigate the driving capabilities of MLLMs that can handle vision, language, and action in a variety of driving scenarios. Our results illustrate the strong proficiency of our model in generating coherent language and action outputs, emphasizing the potential of Vision-Language-Action (VLA) models in the field of autonomous driving. This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems by providing a comprehensive platform for training and evaluating VLA models, contributing to safer and more reliable self-driving vehicles. The dataset is released for academic purpose.
Authors: Sicheng Liu, Lintao Wang, Xiaogang Zhu, Xuequan Lu, Zhiyong Wang, Kun Hu
Abstract: Extreme Multimodal Summarization with Multimodal Output (XMSMO) becomes an attractive summarization approach by integrating various types of information to create extremely concise yet informative summaries for individual modalities. Existing methods overlook the issue that multimodal data often contains more topic irrelevant information, which can mislead the model into producing inaccurate summaries especially for extremely short ones. In this paper, we propose SITransformer, a Shared Information-guided Transformer for extreme multimodal summarization. It has a shared information guided pipeline which involves a cross-modal shared information extractor and a cross-modal interaction module. The extractor formulates semantically shared salient information from different modalities by devising a novel filtering process consisting of a differentiable top-k selector and a shared-information guided gating unit. As a result, the common, salient, and relevant contents across modalities are identified. Next, a transformer with cross-modal attentions is developed for intra- and inter-modality learning with the shared information guidance to produce the extreme summary. Comprehensive experiments demonstrate that SITransformer significantly enhances the summarization quality for both video and text summaries for XMSMO. Our code will be publicly available at https://github.com/SichengLeoLiu/MMAsia24-XMSMO.
Authors: Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, Yueqi Duan
Abstract: Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of our ReconX over state-of-the-art methods in terms of quality and generalizability.
Authors: Zhiyuan Yan, Yandan Zhao, Shen Chen, Mingyi Guo, Xinghe Fu, Taiping Yao, Shouhong Ding, Li Yuan
Abstract: Three key challenges hinder the development of current deepfake video detection: (1) Temporal features can be complex and diverse: how can we identify general temporal artifacts to enhance model generalization? (2) Spatiotemporal models often lean heavily on one type of artifact and ignore the other: how can we ensure balanced learning from both? (3) Videos are naturally resource-intensive: how can we tackle efficiency without compromising accuracy? This paper attempts to tackle the three challenges jointly. First, inspired by the notable generality of using image-level blending data for image forgery detection, we investigate whether and how video-level blending can be effective in video. We then perform a thorough analysis and identify a previously underexplored temporal forgery artifact: Facial Feature Drift (FFD), which commonly exists across different forgeries. To reproduce FFD, we then propose a novel Video-level Blending data (VB), where VB is implemented by blending the original image and its warped version frame-by-frame, serving as a hard negative sample to mine more general artifacts. Second, we carefully design a lightweight Spatiotemporal Adapter (StA) to equip a pretrained image model (both ViTs and CNNs) with the ability to capture both spatial and temporal features jointly and efficiently. StA is designed with two-stream 3D-Conv with varying kernel sizes, allowing it to process spatial and temporal features separately. Extensive experiments validate the effectiveness of the proposed methods; and show our approach can generalize well to previously unseen forgery videos, even the latest generation methods.
Authors: Taesung Kwon, Jong Chul Ye
Abstract: Recently, diffusion model-based inverse problem solvers (DIS) have emerged as state-of-the-art approaches for addressing inverse problems, including image super-resolution, deblurring, inpainting, etc. However, their application to video inverse problems arising from spatio-temporal degradation remains largely unexplored due to the challenges in training video diffusion models. To address this issue, here we introduce an innovative video inverse solver that leverages only image diffusion models. Specifically, by drawing inspiration from the success of the recent decomposed diffusion sampler (DDS), our method treats the time dimension of a video as the batch dimension of image diffusion models and solves spatio-temporal optimization problems within denoised spatio-temporal batches derived from each image diffusion model. Moreover, we introduce a batch-consistent diffusion sampling strategy that encourages consistency across batches by synchronizing the stochastic noise components in image diffusion models. Our approach synergistically combines batch-consistent sampling with simultaneous optimization of denoised spatio-temporal batches at each reverse diffusion step, resulting in a novel and efficient diffusion sampling strategy for video inverse problems. Experimental results demonstrate that our method effectively addresses various spatio-temporal degradations in video inverse problems, achieving state-of-the-art reconstructions. Project page: https://svi-diffusion.github.io
Authors: Shuang Zeng, Xinyuan Chang, Xinran Liu, Zheng Pan, Xing Wei
Abstract: High-Definition Maps (HD maps) are essential for the precise navigation and decision-making of autonomous vehicles, yet their creation and upkeep present significant cost and timeliness challenges. The online construction of HD maps using on-board sensors has emerged as a promising solution; however, these methods can be impeded by incomplete data due to occlusions and inclement weather. This paper proposes the PriorDrive framework to addresses these limitations by harnessing the power of prior maps, significantly enhancing the robustness and accuracy of online HD map construction. Our approach integrates a variety of prior maps, such as OpenStreetMap's Standard Definition Maps (SD maps), outdated HD maps from vendors, and locally constructed maps from historical vehicle data. To effectively encode this prior information into online mapping models, we introduce a Hybrid Prior Representation (HPQuery) that standardizes the representation of diverse map elements. At the core of PriorDrive is the Unified Vector Encoder (UVE), which employs hybrid prior embedding and a dual encoding mechanism to process vector data. Furthermore, we propose a segment-level and point-level pre-training strategy that enables the UVE to learn the prior distribution of vector data, thereby improving the encoder's generalizability and performance. Through extensive testing on the nuScenes, Argoverse 2 and OpenLane-V2, we demonstrate that PriorDrive is highly compatible with various online mapping models and substantially improves map prediction capabilities. The integration of prior maps through the PriorDrive framework offers a robust solution to the challenges of single-perception data, paving the way for more reliable autonomous vehicle navigation.
Authors: Vincenzo Polizzi, Marco Cannici, Davide Scaramuzza, Jonathan Kelly
Abstract: Camera relocalization methods range from dense image alignment to direct camera pose regression from a query image. Among these, sparse feature matching stands out as an efficient, versatile, and generally lightweight approach with numerous applications. However, feature-based methods often struggle with significant viewpoint and appearance changes, leading to matching failures and inaccurate pose estimates. To overcome this limitation, we propose a novel approach that leverages a globally sparse yet locally dense 3D representation of 2D features. By tracking and triangulating landmarks over a sequence of frames, we construct a sparse voxel map optimized to render image patch descriptors observed during tracking. Given an initial pose estimate, we first synthesize descriptors from the voxels using volumetric rendering and then perform feature matching to estimate the camera pose. This methodology enables the generation of descriptors for unseen views, enhancing robustness to view changes. We extensively evaluate our method on the 7-Scenes and Cambridge Landmarks datasets. Our results show that our method significantly outperforms existing state-of-the-art feature representation techniques in indoor environments, achieving up to a 39% improvement in median translation error. Additionally, our approach yields comparable results to other methods for outdoor scenarios while maintaining lower memory and computational costs.
Authors: Joonghyuk Shin, Daehyeon Choi, Jaesik Park
Abstract: Drag-based image editing has recently gained popularity for its interactivity and precision. However, despite the ability of text-to-image models to generate samples within a second, drag editing still lags behind due to the challenge of accurately reflecting user interaction while maintaining image content. Some existing approaches rely on computationally intensive per-image optimization or intricate guidance-based methods, requiring additional inputs such as masks for movable regions and text prompts, thereby compromising the interactivity of the editing process. We introduce InstantDrag, an optimization-free pipeline that enhances interactivity and speed, requiring only an image and a drag instruction as input. InstantDrag consists of two carefully designed networks: a drag-conditioned optical flow generator (FlowGen) and an optical flow-conditioned diffusion model (FlowDiffusion). InstantDrag learns motion dynamics for drag-based image editing in real-world video datasets by decomposing the task into motion generation and motion-conditioned image generation. We demonstrate InstantDrag's capability to perform fast, photo-realistic edits without masks or text prompts through experiments on facial video datasets and general scenes. These results highlight the efficiency of our approach in handling drag-based image editing, making it a promising solution for interactive, real-time applications.
Authors: Kushal Vyas, Ahmed Imtiaz Humayun, Aniket Dashpute, Richard G. Baraniuk, Ashok Veeraraghavan, Guha Balakrishnan
Abstract: Implicit neural representations (INRs) have demonstrated success in a variety of applications, including inverse problems and neural rendering. An INR is typically trained to capture one signal of interest, resulting in learned neural features that are highly attuned to that signal. Assumed to be less generalizable, we explore the aspect of transferability of such learned neural features for fitting similar signals. We introduce a new INR training framework, STRAINER that learns transferrable features for fitting INRs to new signals from a given distribution, faster and with better reconstruction quality. Owing to the sequential layer-wise affine operations in an INR, we propose to learn transferable representations by sharing initial encoder layers across multiple INRs with independent decoder layers. At test time, the learned encoder representations are transferred as initialization for an otherwise randomly initialized INR. We find STRAINER to yield extremely powerful initialization for fitting images from the same domain and allow for $\approx +10dB$ gain in signal quality early on compared to an untrained INR itself. STRAINER also provides a simple way to encode data-driven priors in INRs. We evaluate STRAINER on multiple in-domain and out-of-domain signal fitting tasks and inverse problems and further provide detailed analysis and discussion on the transferability of STRAINER's features. Our demo can be accessed at https://colab.research.google.com/drive/1fBZAwqE8C_lrRPAe-hQZJTWrMJuAKtG2?usp=sharing .
URLs: https://colab.research.google.com/drive/1fBZAwqE8C_lrRPAe-hQZJTWrMJuAKtG2?usp=sharing
Authors: Franciszek Szewczyk, Gilles Louppe, Matthia Sabatelli
Abstract: Lifelike visualizations in design, cinematography, and gaming rely on precise physics simulations, typically requiring extensive computational resources and detailed physical input. This paper presents a method that can infer a system's physical properties from a short video, eliminating the need for explicit parameter input, provided it is close to the training condition. The learned representation is then used within a Graph Network-based Simulator to emulate the trajectories of physical systems. We demonstrate that the video-derived encodings effectively capture the physical properties of the system and showcase a linear dependence between some of the encodings and the system's motion.
Authors: Serin Yang, Taesung Kwon, Jong Chul Ye
Abstract: Recent progress in large-scale text-to-video (T2V) and image-to-video (I2V) diffusion models has greatly enhanced video generation, especially in terms of keyframe interpolation. However, current image-to-video diffusion models, while powerful in generating videos from a single conditioning frame, need adaptation for two-frame (start & end) conditioned generation, which is essential for effective bounded interpolation. Unfortunately, existing approaches that fuse temporally forward and backward paths in parallel often suffer from off-manifold issues, leading to artifacts or requiring multiple iterative re-noising steps. In this work, we introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning. Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames. Additionally, we incorporate advanced guidance techniques, CFG++ and DDS, to further enhance the interpolation process. By integrating these, our method achieves state-of-the-art performance, efficiently generating high-quality, smooth videos between keyframes. On a single 3090 GPU, our method can interpolate 25 frames at 1024 x 576 resolution in just 195 seconds, establishing it as a leading solution for keyframe interpolation.
Authors: M. Jehanzeb Mirza, Mengjie Zhao, Zhuoyuan Mao, Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorkenwald, Shiqi Yang, Saurav Jha, Hiromi Wakaki, Yuki Mitsufuji, Horst Possegger, Rogerio Feris, Leonid Karlinsky, James Glass
Abstract: In this work, we propose a novel method (GLOV) enabling Large Language Models (LLMs) to act as implicit Optimizers for Vision-Langugage Models (VLMs) to enhance downstream vision tasks. Our GLOV meta-prompts an LLM with the downstream task description, querying it for suitable VLM prompts (e.g., for zero-shot classification with CLIP). These prompts are ranked according to a purity measure obtained through a fitness function. In each respective optimization step, the ranked prompts are fed as in-context examples (with their accuracies) to equip the LLM with the knowledge of the type of text prompts preferred by the downstream VLM. Furthermore, we also explicitly steer the LLM generation process in each optimization step by specifically adding an offset difference vector of the embeddings from the positive and negative solutions found by the LLM, in previous optimization steps, to the intermediate layer of the network for the next generation step. This offset vector steers the LLM generation toward the type of language preferred by the downstream VLM, resulting in enhanced performance on the downstream vision tasks. We comprehensively evaluate our GLOV on 16 diverse datasets using two families of VLMs, i.e., dual-encoder (e.g., CLIP) and encoder-decoder (e.g., LLaVa) models -- showing that the discovered solutions can enhance the recognition performance by up to 15.0% and 57.5% (3.8% and 21.6% on average) for these models.
Authors: Zeliang Zhang, Phu Pham, Wentian Zhao, Kun Wan, Yu-Jhe Li, Jianing Zhou, Daniel Miranda, Ajinkya Kale, Chenliang Xu
Abstract: By treating visual tokens from visual encoders as text tokens, Multimodal Large Language Models (MLLMs) have achieved remarkable progress across diverse visual understanding tasks, leveraging the robust architectures of Large Language Models (LLMs). However, as token counts grow, the quadratic scaling of computation in LLMs introduces a significant efficiency bottleneck, impeding further scalability. Although recent approaches have explored pruning visual tokens or employing lighter LLM architectures, the computational overhead from an increasing number of visual tokens remains a substantial challenge. In this study, we investigate the redundancy in visual computation at both the parameter and computational pattern levels within LLaVA, a representative MLLM, and introduce a suite of streamlined strategies to enhance efficiency. These include neighbor-aware visual token attention, pruning of inactive visual attention heads, and selective layer dropping for visual computations. By implementing these strategies in LLaVA, we achieve a reduction in computational demands of 88% while maintaining model performance across key benchmarks. Additionally, we validate the existence of visual computational redundancy in other MLLMs, such as Qwen2-VL-7B and InternVL-2.0-4B/8B/26B. These results present a novel pathway for MLLMs to handle dense visual tokens with minimal computational costs. Code and model checkpoints will be released to support further research.
Authors: Weixing Zhang, Zongrui Li, De Ma, Huajin Tang, Xudong Jiang, Qian Zheng, Gang Pan
Abstract: 3D Gaussian Splatting is capable of reconstructing 3D scenes in minutes. Despite recent advances in improving surface reconstruction accuracy, the reconstructed results still exhibit bias and suffer from inefficiency in storage and training. This paper provides a different observation on the cause of the inefficiency and the reconstruction bias, which is attributed to the integration of the low-opacity parts (LOPs) of the generated Gaussians. We show that LOPs consist of Gaussians with overall low-opacity (LOGs) and the low-opacity tails (LOTs) of Gaussians. We propose Spiking GS to reduce such two types of LOPs by integrating spiking neurons into the Gaussian Splatting pipeline. Specifically, we introduce global and local full-precision integrate-and-fire spiking neurons to the opacity and representation function of flattened 3D Gaussians, respectively. Furthermore, we enhance the density control strategy with spiking neurons' thresholds and a new criterion on the scale of Gaussians. Our method can represent more accurate reconstructed surfaces at a lower cost. The supplementary material and code are available at https://github.com/zju-bmi-lab/SpikingGS.
Authors: Yingen Liu, Fan Wu, Ruihui Li, Zhuo Tang, Kenli Li
Abstract: Multimodal large language models (MLLMs) demonstrate strong performance across visual tasks, but their efficiency is hindered by significant computational and memory demands from processing long contexts in multimodal inputs. To address this, we introduce PAR (Prompt-Aware Token Reduction), a novel and plug-and-play approach that reduces visual tokens efficiently without compromising model performance. Unlike previous methods that rely heavily on attention mechanisms and overlooking cross-modal interactions , we uses a prompt-aware strategy to adpative identify and cluster essential visual tokens. PAR categorizes visual context redundancy into two types: external and internal. External redundancy is minimized through semantic retrieval, while internal redundancy is addressed using a token routing mechanism. This method substantially reduces computational load without requiring additional training or complex architectural modifications. \textbf{Experimental results demonstrate that across various visual question answering tasks, PAR reduces FLOPs by 83\% with a compression ratio of 89\%, while retaining 97\% of baseline accuracy.} The adaptive design of PAR achieves a 2x token reduction ratio compared to prior approaches, enabling a better balance between performance and efficiency.
Authors: Yin Xie, Kaicheng Yang, Ninghua Yang, Weimo Deng, Xiangzi Dai, Tiancheng Gu, Yumeng Wang, Xiang An, Yongle Zhao, Ziyong Feng, Jiankang Deng
Abstract: Recent advances in Large Language Models (LLMs) have catalyzed the development of Large Multimodal Models (LMMs). However, existing research primarily focuses on tuning language and image instructions, ignoring the critical pretraining phase where models learn to process textual and visual modalities jointly. In this paper, we propose a new pretraining paradigm for LMMs to enhance the visual comprehension capabilities of LLMs by introducing a novel cross-modal comprehension stage. Specifically, we design a dynamically learnable prompt token pool and employ the Hungarian algorithm to replace part of the original visual tokens with the most relevant prompt tokens. Then, we conceptualize visual tokens as analogous to a "foreign language" for the LLMs and propose a mixed attention mechanism with bidirectional visual attention and unidirectional textual attention to comprehensively enhance the understanding of visual tokens. Meanwhile, we integrate a detailed caption generation task, leveraging rich descriptions to further facilitate LLMs in understanding visual semantic information. After pretraining on 1.5 million publicly accessible data, we present a new foundation model called Croc. Experimental results demonstrate that Croc achieves new state-of-the-art performance on massive vision-language benchmarks. To support reproducibility and facilitate further research, we release the training code and pre-trained model weights at https://github.com/deepglint/Croc.
Authors: Arnav Samal, Ranya Batsyas
Abstract: This report outlines Team Seq2Cure's deep learning approach for the Capsule Vision 2024 Challenge, leveraging an ensemble of convolutional neural networks (CNNs) and transformer-based architectures for multi-class abnormality classification in video capsule endoscopy frames. The dataset comprised over 50,000 frames from three public sources and one private dataset, labeled across 10 abnormality classes. To overcome the limitations of traditional CNNs in capturing global context, we integrated CNN and transformer models within a multi-model ensemble. Our approach achieved a balanced accuracy of 86.34 percent and a mean AUC-ROC score of 0.9908 on the validation set, earning our submission 5th place in the challenge. Code is available at http://github.com/arnavs04/capsule-vision-2024 .
Authors: Tassilo Wald, Constantin Ulrich, Stanislav Lukyanenko, Andrei Goncharov, Alberto Paderno, Leander Maerkisch, Paul F. J\"ager, Klaus Maier-Hein
Abstract: Self-Supervised Learning (SSL) presents an exciting opportunity to unlock the potential of vast, untapped clinical datasets, for various downstream applications that suffer from the scarcity of labeled data. While SSL has revolutionized fields like natural language processing and computer vision, its adoption in 3D medical image computing has been limited by three key pitfalls: Small pre-training dataset sizes, architectures inadequate for 3D medical image analysis, and insufficient evaluation practices. In this paper, we address these issues by i) leveraging a large-scale dataset of 39k 3D brain MRI volumes and ii) using a Residual Encoder U-Net architecture within the state-of-the-art nnU-Net framework. iii) A robust development framework, incorporating 5 development and 8 testing brain MRI segmentation datasets, allowed performance-driven design decisions to optimize the simple concept of Masked Auto Encoders (MAEs) for 3D CNNs. The resulting model not only surpasses previous SSL methods but also outperforms the strong nnU-Net baseline by an average of approximately 3 Dice points setting a new state-of-the-art. Our code and models are made available here.
Authors: Qinqian Lei, Bo Wang, Robby T. Tan
Abstract: Detecting Human-Object Interactions (HOI) in zero-shot settings, where models must handle unseen classes, poses significant challenges. Existing methods that rely on aligning visual encoders with large Vision-Language Models (VLMs) to tap into the extensive knowledge of VLMs, require large, computationally expensive models and encounter training difficulties. Adapting VLMs with prompt learning offers an alternative to direct alignment. However, fine-tuning on task-specific datasets often leads to overfitting to seen classes and suboptimal performance on unseen classes, due to the absence of unseen class labels. To address these challenges, we introduce a novel prompt learning-based framework for Efficient Zero-Shot HOI detection (EZ-HOI). First, we introduce Large Language Model (LLM) and VLM guidance for learnable prompts, integrating detailed HOI descriptions and visual semantics to adapt VLMs to HOI tasks. However, because training datasets contain seen-class labels alone, fine-tuning VLMs on such datasets tends to optimize learnable prompts for seen classes instead of unseen ones. Therefore, we design prompt learning for unseen classes using information from related seen classes, with LLMs utilized to highlight the differences between unseen and related seen classes. Quantitative evaluations on benchmark datasets demonstrate that our EZ-HOI achieves state-of-the-art performance across various zero-shot settings with only 10.35% to 33.95% of the trainable parameters compared to existing methods. Code is available at https://github.com/ChelsieLei/EZ-HOI.
Authors: Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, Hao Chen
Abstract: We introduce GameGen-X, the first diffusion transformer model specifically designed for both generating and interactively controlling open-world game videos. This model facilitates high-quality, open-domain generation by simulating an extensive array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, predicting and altering future content based on the current clip, thus allowing for gameplay simulation. To realize this vision, we first collected and built an Open-World Video Game Dataset from scratch. It is the first and largest dataset for open-world game video generation and control, which comprises over a million diverse gameplay video clips sampling from over 150 games with informative captions from GPT-4o. GameGen-X undergoes a two-stage training process, consisting of foundation model pre-training and instruction tuning. Firstly, the model was pre-trained via text-to-video generation and video continuation, endowing it with the capability for long-sequence, high-quality open-domain game video generation. Further, to achieve interactive controllability, we designed InstructNet to incorporate game-related multi-modal control signal experts. This allows the model to adjust latent representations based on user inputs, unifying character interaction and scene content control for the first time in video generation. During instruction tuning, only the InstructNet is updated while the pre-trained foundation model is frozen, enabling the integration of interactive controllability without loss of diversity and quality of generated video content.
Authors: Bo Gao, Fangxu Xing, Daniel Tang
Abstract: Current semantic segmentation models typically require a substantial amount of manually annotated data, a process that is both time-consuming and resource-intensive. Alternatively, leveraging advanced text-to-image models such as Midjourney and Stable Diffusion has emerged as an efficient strategy, enabling the automatic generation of synthetic data in place of manual annotations. However, previous methods have been limited to generating single-instance images, as the generation of multiple instances with Stable Diffusion has proven unstable. To address this limitation and expand the scope and diversity of synthetic datasets, we propose a framework \textbf{Free-Mask} that combines a Diffusion Model for segmentation with advanced image editing capabilities, allowing for the integration of multiple objects into images via text-to-image models. Our method facilitates the creation of highly realistic datasets that closely emulate open-world environments while generating accurate segmentation masks. It reduces the labor associated with manual annotation and also ensures precise mask generation. Experimental results demonstrate that synthetic data generated by \textbf{Free-Mask} enables segmentation models to outperform those trained on real data, especially in zero-shot settings. Notably, \textbf{Free-Mask} achieves new state-of-the-art results on previously unseen classes in the VOC 2012 benchmark.
Authors: Ziheng Jia, Zicheng Zhang, Jiaying Qian, Haoning Wu, Wei Sun, Chunyi Li, Xiaohong Liu, Weisi Lin, Guangtao Zhai, Xiongkuo Min
Abstract: The advent and proliferation of large multi-modal models (LMMs) have introduced new paradigms to computer vision, transforming various tasks into a unified visual question answering framework. Video Quality Assessment (VQA), a classic field in low-level visual perception, focused initially on quantitative video quality scoring. However, driven by advances in LMMs, it is now progressing toward more holistic visual quality understanding tasks. Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can markedly enhance low-level visual quality evaluation. Nevertheless, related work has not been explored in the video domain, leaving substantial room for improvement. To address this gap, we introduce the VQA2 Instruction Dataset - the first visual question answering instruction dataset that focuses on video quality assessment. This dataset consists of 3 subsets and covers various video types, containing 157,755 instruction question-answer pairs. Then, leveraging this foundation, we present the VQA2 series models. The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos. We conduct extensive experiments on video quality scoring and understanding tasks, and results demonstrate that the VQA2series models achieve excellent performance in both tasks. Notably, our final model, the VQA2-Assistant, exceeds the renowned GPT-4o in visual quality understanding tasks while maintaining strong competitiveness in quality scoring tasks. Our work provides a foundation and feasible approach for integrating low-level video quality assessment and understanding with LMMs.
Authors: Andr\'e Ferreira, Gijs Luijten, Behrus Puladi, Jens Kleesiek, Victor Alves, Jan Egger
Abstract: This paper presents the second-placed solution for task 8 and the participation solution for task 7 of BraTS 2024. The adoption of automated brain analysis algorithms to support clinical practice is increasing. However, many of these algorithms struggle with the presence of brain lesions or the absence of certain MRI modalities. The alterations in the brain's morphology leads to high variability and thus poor performance of predictive models that were trained only on healthy brains. The lack of information that is usually provided by some of the missing MRI modalities also reduces the reliability of the prediction models trained with all modalities. In order to improve the performance of these models, we propose the use of conditional 3D wavelet diffusion models. The wavelet transform enabled full-resolution image training and prediction on a GPU with 48 GB VRAM, without patching or downsampling, preserving all information for prediction. The code for these tasks is available at https://github.com/ShadowTwin41/BraTS_2023_2024_solutions.
URLs: https://github.com/ShadowTwin41/BraTS_2023_2024_solutions.
Authors: Andr\'e Ferreira, Tiago Jesus, Behrus Puladi, Jens Kleesiek, Victor Alves, Jan Egger
Abstract: This paper presents the winning solution of task 1 and the third-placed solution of task 3 of the BraTS challenge. The use of automated tools in clinical practice has increased due to the development of more and more sophisticated and reliable algorithms. However, achieving clinical standards and developing tools for real-life scenarios is a major challenge. To this end, BraTS has organised tasks to find the most advanced solutions for specific purposes. In this paper, we propose the use of synthetic data to train state-of-the-art frameworks in order to improve the segmentation of adult gliomas in a post-treatment scenario, and the segmentation of meningioma for radiotherapy planning. Our results suggest that the use of synthetic data leads to more robust algorithms, although the synthetic data generation pipeline is not directly suited to the meningioma task. In task 1, we achieved a DSC of 0.7900, 0.8076, 0.7760, 0.8926, 0.7874, 0.8938 and a HD95 of 35.63, 30.35, 44.58, 16.87, 38.19, 17.95 for ET, NETC, RC, SNFH, TC and WT, respectively and, in task 3, we achieved a DSC of 0.801 and HD95 of 38.26, in the testing phase. The code for these tasks is available at https://github.com/ShadowTwin41/BraTS_2023_2024_solutions.
URLs: https://github.com/ShadowTwin41/BraTS_2023_2024_solutions.
Authors: Jiyul Ham, Yonggon Jung, Jun-Geol Baek
Abstract: Zero-shot anomaly detection (ZSAD) is crucial for detecting anomalous patterns in target datasets without using training samples, specifically in scenarios where there are distributional differences between the target domain and training data or where data scarcity arises because of restricted access. Although recently pretrained vision-language models demonstrate strong zero-shot performance across various visual tasks, they focus on learning class semantics, which makes their direct application to ZSAD challenging. To address this scenario, we propose GlocalCLIP, which uniquely separates global and local prompts and jointly optimizes them. This approach enables the object-agnostic glocal semantic prompt to effectively capture general normal and anomalous patterns without dependency on specific objects in the image. We refine the text prompts for more precise adjustments by utilizing deep-text prompt tuning in the text encoder. In the vision encoder, we apply V-V attention layers to capture detailed local image features. Finally, we introduce glocal contrastive learning to improve the complementary learning of global and local prompts, effectively detecting anomalous patterns across various domains. The generalization performance of GlocalCLIP in ZSAD was demonstrated on 15 real-world datasets from both the industrial and medical domains, achieving superior performance compared to existing methods. Code will be made available at https://github.com/YUL-git/GlocalCLIP.
Authors: Yuhan Pan, Yanan Sun, Wei Gong
Abstract: Deep long-tailed recognition has been widely studied to address the issue of imbalanced data distributions in real-world scenarios. However, there has been insufficient focus on the design of neural architectures, despite empirical evidence suggesting that architecture can significantly impact performance. In this paper, we attempt to mitigate long-tailed issues through architectural improvements. To simplify the design process, we utilize Differential Architecture Search (DARTS) to achieve this goal. Unfortunately, existing DARTS methods struggle to perform well in long-tailed scenarios. To tackle this challenge, we introduce Long-Tailed Differential Architecture Search (LTDAS). Specifically, we conduct extensive experiments to explore architectural components that demonstrate better performance on long-tailed data and propose a new search space based on our observations. This ensures that the architecture obtained through our search process incorporates superior components. Additionally, we propose replacing the learnable linear classifier with an Equiangular Tight Frame (ETF) classifier to further enhance our method. This classifier effectively alleviates the biased search process and prevents performance collapse. Extensive experimental evaluations demonstrate that our approach consistently improves upon existing methods from an orthogonal perspective and achieves state-of-the-art results with simple enhancements.
Authors: Mallika Garg, Debashis Ghosh, Pyari Mohan Pradhan
Abstract: Transformer models have demonstrated remarkable success in many domains such as natural language processing (NLP) and computer vision. With the growing interest in transformer-based architectures, they are now utilized for gesture recognition. So, we also explore and devise a novel ConvMixFormer architecture for dynamic hand gestures. The transformers use quadratic scaling of the attention features with the sequential data, due to which these models are computationally complex and heavy. We have considered this drawback of the transformer and designed a resource-efficient model that replaces the self-attention in the transformer with the simple convolutional layer-based token mixer. The computational cost and the parameters used for the convolution-based mixer are comparatively less than the quadratic self-attention. Convolution-mixer helps the model capture the local spatial features that self-attention struggles to capture due to their sequential processing nature. Further, an efficient gate mechanism is employed instead of a conventional feed-forward network in the transformer to help the model control the flow of features within different stages of the proposed model. This design uses fewer learnable parameters which is nearly half the vanilla transformer that helps in fast and efficient training. The proposed method is evaluated on NVidia Dynamic Hand Gesture and Briareo datasets and our model has achieved state-of-the-art results on single and multimodal inputs. We have also shown the parameter efficiency of the proposed ConvMixFormer model compared to other methods. The source code is available at https://github.com/mallikagarg/ConvMixFormer.
Authors: Defan Chen, Luchan Zhang
Abstract: Detecting small objects in complex scenes, such as those captured by drones, is a daunting challenge due to the difficulty in capturing the complex features of small targets. While the YOLO family has achieved great success in large target detection, its performance is less than satisfactory when faced with small targets. Because of this, this paper proposes a revolutionary model SL-YOLO (Stronger and Lighter YOLO) that aims to break the bottleneck of small target detection. We propose the Hierarchical Extended Path Aggregation Network (HEPAN), a pioneering cross-scale feature fusion method that can ensure unparalleled detection accuracy even in the most challenging environments. At the same time, without sacrificing detection capabilities, we design the C2fDCB lightweight module and add the SCDown downsampling module to greatly reduce the model's parameters and computational complexity. Our experimental results on the VisDrone2019 dataset reveal a significant improvement in performance, with mAP@0.5 jumping from 43.0% to 46.9% and mAP@0.5:0.95 increasing from 26.0% to 28.9%. At the same time, the model parameters are reduced from 11.1M to 9.6M, and the FPS can reach 132, making it an ideal solution for real-time small object detection in resource-constrained environments.
Authors: Ruiyang Zhang, Hu Zhang, Zhedong Zheng
Abstract: Given the higher information load processed by large vision-language models (LVLMs) compared to single-modal LLMs, detecting LVLM hallucinations requires more human and time expense, and thus rise a wider safety concerns. In this paper, we introduce VL-Uncertainty, the first uncertainty-based framework for detecting hallucinations in LVLMs. Different from most existing methods that require ground-truth or pseudo annotations, VL-Uncertainty utilizes uncertainty as an intrinsic metric. We measure uncertainty by analyzing the prediction variance across semantically equivalent but perturbed prompts, including visual and textual data. When LVLMs are highly confident, they provide consistent responses to semantically equivalent queries. However, when uncertain, the responses of the target LVLM become more random. Considering semantically similar answers with different wordings, we cluster LVLM responses based on their semantic content and then calculate the cluster distribution entropy as the uncertainty measure to detect hallucination. Our extensive experiments on 10 LVLMs across four benchmarks, covering both free-form and multi-choice tasks, show that VL-Uncertainty significantly outperforms strong baseline methods in hallucination detection.
Authors: Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang
Abstract: The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding objects. Furthermore, the fixed-window memory approach in the original model does not consider the quality of memories selected to condition the image features for the next frame, leading to error propagation in videos. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine-tuning. SAMURAI operates in real-time and demonstrates strong zero-shot performance across diverse benchmark datasets, showcasing its ability to generalize without fine-tuning. In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT$_{\text{ext}}$ and a 3.5% AO gain on GOT-10k. Moreover, it achieves competitive results compared to fully supervised methods on LaSOT, underscoring its robustness in complex tracking scenarios and its potential for real-world applications in dynamic environments.
Authors: Ziyang Zong, Zhaohuan Zhan, Guang Tan
Abstract: This paper proposes a two-phase text-to-floorplan generation method, which guides a Large Language Model (LLM) to generate an initial layout (Layout-LLM) and refines them into the final floorplans through conditional diffusion model. We incorporate a Chain-of-Thought approach to prompt the LLM based on user text specifications, enabling a more user-friendly and intuitive house layout design. This method allows users to describe their needs in natural language, enhancing accessibility and providing clearer geometric constraints. The final floorplans generated by Layout-LLM through conditional diffusion refinement are more accurate and better meet user requirements. Experimental results demonstrate that our approach achieves state-of-the-art performance across all metrics, validating its effectiveness in practical home design applications. We plan to release our code for public use.
Authors: Haodong Chen, Runnan Chen, Qiang Qu, Zhaoqing Wang, Tongliang Liu, Xiaoming Chen, Yuk Ying Chung
Abstract: Recent advancements in 3D Gaussian Splatting (3DGS) have substantially improved novel view synthesis, enabling high-quality reconstruction and real-time rendering. However, blurring artifacts, such as floating primitives and over-reconstruction, remain challenging. Current methods address these issues by refining scene structure, enhancing geometric representations, addressing blur in training images, improving rendering consistency, and optimizing density control, yet the role of kernel design remains underexplored. We identify the soft boundaries of Gaussian ellipsoids as one of the causes of these artifacts, limiting detail capture in high-frequency regions. To bridge this gap, we introduce 3D Linear Splatting (3DLS), which replaces Gaussian kernels with linear kernels to achieve sharper and more precise results, particularly in high-frequency regions. Through evaluations on three datasets, 3DLS demonstrates state-of-the-art fidelity and accuracy, along with a 30% FPS improvement over baseline 3DGS. The implementation will be made publicly available upon acceptance.
Authors: Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang
Abstract: Parameter-efficient fine-tuning multimodal large language models (MLLMs) presents significant challenges, including reliance on high-level visual features that limit fine-grained detail comprehension, and data conflicts that arise from task complexity. To address these issues, we propose an efficient fine-tuning framework with two novel approaches: Vision Cue Enhancement (VCE) and Dual Low-Rank Adaptation (Dual-LoRA). VCE enhances the vision projector by integrating multi-level visual cues, improving the model's ability to capture fine-grained visual features. Dual-LoRA introduces a dual low-rank structure for instruction tuning, decoupling learning into skill and task spaces to enable precise control and efficient adaptation across diverse tasks. Our method simplifies implementation, enhances visual comprehension, and improves adaptability. Experiments on both downstream tasks and general benchmarks demonstrate the effectiveness of our proposed approach.
Authors: Yuhang Song, Mario Gianni, Chenguang Yang, Kunyang Lin, Te-Chuan Chiu, Anh Nguyen, Chun-Yi Lee
Abstract: This paper addresses the challenge of fine-grained alignment in Vision-and-Language Navigation (VLN) tasks, where robots navigate realistic 3D environments based on natural language instructions. Current approaches use contrastive learning to align language with visual trajectory sequences. Nevertheless, they encounter difficulties with fine-grained vision negatives. To enhance cross-modal embeddings, we introduce a novel Bayesian Optimization-based adversarial optimization framework for creating fine-grained contrastive vision samples. To validate the proposed methodology, we conduct a series of experiments to assess the effectiveness of the enriched embeddings on fine-grained vision negatives. We conduct experiments on two common VLN benchmarks R2R and REVERIE, experiments on the them demonstrate that these embeddings benefit navigation, and can lead to a promising performance enhancement. Our source code and trained models are available at: https://anonymous.4open.science/r/FGVLN.
Authors: Zhuo Li, Mingshuang Luo, Ruibing Hou, Xin Zhao, Hao Liu, Hong Chang, Zimo Liu, Chen Li
Abstract: Human motion generation plays a vital role in applications such as digital humans and humanoid robot control. However, most existing approaches disregard physics constraints, leading to the frequent production of physically implausible motions with pronounced artifacts such as floating and foot sliding. In this paper, we propose \textbf{Morph}, a \textbf{Mo}tion-f\textbf{r}ee \textbf{ph}ysics optimization framework, comprising a Motion Generator and a Motion Physics Refinement module, for enhancing physical plausibility without relying on costly real-world motion data. Specifically, the Motion Generator is responsible for providing large-scale synthetic motion data, while the Motion Physics Refinement Module utilizes these synthetic data to train a motion imitator within a physics simulator, enforcing physical constraints to project the noisy motions into a physically-plausible space. These physically refined motions, in turn, are used to fine-tune the Motion Generator, further enhancing its capability. Experiments on both text-to-motion and music-to-dance generation tasks demonstrate that our framework achieves state-of-the-art motion generation quality while improving physical plausibility drastically.
Authors: Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, Xinchao Wang
Abstract: In this paper, we introduce OminiControl, a highly versatile and parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone and process them with its flexible multi-modal attention processors. Unlike existing methods, which rely heavily on additional encoder modules with complex architectures, OminiControl (1) effectively and efficiently incorporates injected image conditions with only ~0.1% additional parameters, and (2) addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions such as edges, depth, and more. Remarkably, these capabilities are achieved by training on images generated by the DiT itself, which is particularly beneficial for subject-driven generation. Extensive evaluations demonstrate that OminiControl outperforms existing UNet-based and DiT-adapted models in both subject-driven and spatially-aligned conditional generation. Additionally, we release our training dataset, Subjects200K, a diverse collection of over 200,000 identity-consistent images, along with an efficient data synthesis pipeline to advance research in subject-consistent generation.
Authors: Xin He, Haiyun Guo, Kuan Zhu, Bingke Zhu, Xu Zhao, Jianwu Fang, Jinqiao Wang
Abstract: Lane detection plays an important role in autonomous driving perception systems. As deep learning algorithms gain popularity, monocular lane detection methods based on them have demonstrated superior performance and emerged as a key research direction in autonomous driving perception. The core designs of these algorithmic frameworks can be summarized as follows: (1) Task paradigm, focusing on lane instance-level discrimination; (2) Lane modeling, representing lanes as a set of learnable parameters in the neural network; (3) Global context supplementation, enhancing inference on the obscure lanes; (4) Perspective effect elimination, providing accurate 3D lanes for downstream applications. From these perspectives, this paper presents a comprehensive overview of existing methods, encompassing both the increasingly mature 2D lane detection approaches and the developing 3D lane detection works. Besides, this paper compares the performance of mainstream methods on different benchmarks and investigates their inference speed under a unified setting for fair comparison. Moreover, we present some extended works on lane detection, including multi-task perception, video lane detection, online high-definition map construction, and lane topology reasoning, to offer readers a comprehensive roadmap for the evolution of lane detection. Finally, we point out some potential future research directions in this field. We exhaustively collect the papers and codes of existing works at https://github.com/Core9724/Awesome-Lane-Detection and will keep tracing the research.
Authors: S. P. Sharan, Minkyu Choi, Sahil Shah, Harsh Goel, Mohammad Omama, Sandeep Chinchali
Abstract: Recent advancements in text-to-video models such as Sora, Gen-3, MovieGen, and CogVideoX are pushing the boundaries of synthetic video generation, with adoption seen in fields like robotics, autonomous driving, and entertainment. As these models become prevalent, various metrics and benchmarks have emerged to evaluate the quality of the generated videos. However, these metrics emphasize visual quality and smoothness, neglecting temporal fidelity and text-to-video alignment, which are crucial for safety-critical applications. To address this gap, we introduce NeuS-V, a novel synthetic video evaluation metric that rigorously assesses text-to-video alignment using neuro-symbolic formal verification techniques. Our approach first converts the prompt into a formally defined Temporal Logic (TL) specification and translates the generated video into an automaton representation. Then, it evaluates the text-to-video alignment by formally checking the video automaton against the TL specification. Furthermore, we present a dataset of temporally extended prompts to evaluate state-of-the-art video generation models against our benchmark. We find that NeuS-V demonstrates a higher correlation by over 5x with human evaluations when compared to existing metrics. Our evaluation further reveals that current video generation models perform poorly on these temporally complex prompts, highlighting the need for future work in improving text-to-video generation capabilities.
Authors: Han Wang, Gang Wang, Huan Zhang
Abstract: Vision Language Models (VLMs) can produce unintended and harmful content when exposed to adversarial attacks, particularly because their vision capabilities create new vulnerabilities. Existing defenses, such as input preprocessing, adversarial training, and response evaluation-based methods, are often impractical for real-world deployment due to their high costs. To address this challenge, we propose ASTRA, an efficient and effective defense by adaptively steering models away from adversarial feature directions to resist VLM attacks. Our key procedures involve finding transferable steering vectors representing the direction of harmful response and applying adaptive activation steering to remove these directions at inference time. To create effective steering vectors, we randomly ablate the visual tokens from the adversarial images and identify those most strongly associated with jailbreaks. These tokens are then used to construct steering vectors. During inference, we perform the adaptive steering method that involves the projection between the steering vectors and calibrated activation, resulting in little performance drops on benign inputs while strongly avoiding harmful outputs under adversarial inputs. Extensive experiments across multiple models and baselines demonstrate our state-of-the-art performance and high efficiency in mitigating jailbreak risks. Additionally, ASTRA exhibits good transferability, defending against both unseen attacks at design time (i.e., structured-based attacks) and adversarial images from diverse distributions.
Authors: Yingwen Zhang, Meng Wang, Xihua Sheng, Peilin Chen, Junru Li, Li Zhang, Shiqi Wang
Abstract: Lossy image compression networks aim to minimize the latent entropy of images while adhering to specific distortion constraints. However, optimizing the neural network can be challenging due to its nature of learning quantized latent representations. In this paper, our key finding is that minimizing the latent entropy is, to some extent, equivalent to maximizing the conditional source entropy, an insight that is deeply rooted in information-theoretic equalities. Building on this insight, we propose a novel structural regularization method for the neural image compression task by incorporating the negative conditional source entropy into the training objective, such that both the optimization efficacy and the model's generalization ability can be promoted. The proposed information-theoretic regularizer is interpretable, plug-and-play, and imposes no inference overheads. Extensive experiments demonstrate its superiority in regularizing the models and further squeezing bits from the latent representation across various compression structures and unseen domains.
Authors: Jun Chen, Dannong Xu, Junjie Fei, Chun-Mei Feng, Mohamed Elhoseiny
Abstract: Large multimodal models (LMMs) have achieved impressive progress in vision-language understanding, yet they face limitations in real-world applications requiring complex reasoning over a large number of images. Existing benchmarks for multi-image question-answering are limited in scope, each question is paired with only up to 30 images, which does not fully capture the demands of large-scale retrieval tasks encountered in the real-world usages. To reduce these gaps, we introduce two document haystack benchmarks, dubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on large-scale visual document retrieval and understanding. Additionally, we propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) framework that leverages a suite of multimodal vision encoders, each optimized for specific strengths, and a dedicated question-document relevance module. V-RAG sets a new standard, with a 9% and 11% improvement in Recall@1 on the challenging DocHaystack-1000 and InfoHaystack-1000 benchmarks, respectively, compared to the previous best baseline models. Additionally, integrating V-RAG with LMMs enables them to efficiently operate across thousands of images, yielding significant improvements on our DocHaystack and InfoHaystack benchmarks. Our code and datasets are available at https://github.com/Vision-CAIR/dochaystacks
Authors: You Li, Fan Ma, Yi Yang
Abstract: Diffusion models have recently been employed to generate high-quality images, reducing the need for manual data collection and improving model generalization in tasks such as object detection, instance segmentation, and image perception. However, the synthetic framework is usually designed with meticulous human effort for each task due to various requirements on image layout, content, and annotation formats, restricting the application of synthetic data on more general scenarios. In this paper, we propose AnySynth, a unified framework integrating adaptable, comprehensive, and highly controllable components capable of generating an arbitrary type of synthetic data given diverse requirements. Specifically, the Task-Specific Layout Generation Module is first introduced to produce reasonable layouts for different tasks by leveraging the generation ability of large language models and layout priors of real-world images. A Uni-Controlled Image Generation Module is then developed to create high-quality synthetic images that are controllable and based on the generated layouts. In addition, user specific reference images, and style images can be incorporated into the generation to task requirements. Finally, the Task-Oriented Annotation Module offers precise and detailed annotations for the generated images across different tasks. We have validated our framework's performance across various tasks, including Few-shot Object Detection, Cross-domain Object Detection, Zero-shot Composed Image Retrieval, and Multi-modal Image Perception and Grounding. The specific data synthesized by our framework significantly improves model performance in these tasks, demonstrating the generality and effectiveness of our framework.
Authors: You Li, Fan Ma, Yi Yang
Abstract: The Zero-shot Composed Image Retrieval (ZSCIR) requires retrieving images that match the query image and the relative captions. Current methods focus on projecting the query image into the text feature space, subsequently combining them with features of query texts for retrieval. However, retrieving images only with the text features cannot guarantee detailed alignment due to the natural gap between images and text. In this paper, we introduce Imagined Proxy for CIR (IP-CIR), a training-free method that creates a proxy image aligned with the query image and text description, enhancing query representation in the retrieval process. We first leverage the large language model's generalization capability to generate an image layout, and then apply both the query text and image for conditional generation. The robust query features are enhanced by merging the proxy image, query image, and text semantic perturbation. Our newly proposed balancing metric integrates text-based and proxy retrieval similarities, allowing for more accurate retrieval of the target image while incorporating image-side information into the process. Experiments on three public datasets demonstrate that our method significantly improves retrieval performances. We achieve state-of-the-art (SOTA) results on the CIRR dataset with a Recall@K of 70.07 at K=10. Additionally, we achieved an improvement in Recall@10 on the FashionIQ dataset, rising from 45.11 to 45.74, and improved the baseline performance in CIRCO with a mAPK@10 score, increasing from 32.24 to 34.26.
Authors: Zhenchen Wan, Yanwu Xu, Zhaoqing Wang, Feng Liu, Tongliang Liu, Mingming Gong
Abstract: Recent advancements in Virtual Try-On (VTO) have demonstrated exceptional efficacy in generating realistic images and preserving garment details, largely attributed to the robust generative capabilities of text-to-image (T2I) diffusion backbones. However, the T2I models that underpin these methods have become outdated, thereby limiting the potential for further improvement in VTO. Additionally, current methods face notable challenges in accurately rendering text on garments without distortion and preserving fine-grained details, such as textures and material fidelity. The emergence of Diffusion Transformer (DiT) based T2I models has showcased impressive performance and offers a promising opportunity for advancing VTO. Directly applying existing VTO techniques to transformer-based T2I models is ineffective due to substantial architectural differences, which hinder their ability to fully leverage the models' advanced capabilities for improved text generation. To address these challenges and unlock the full potential of DiT-based T2I models for VTO, we propose TED-VITON, a novel framework that integrates a Garment Semantic (GS) Adapter for enhancing garment-specific features, a Text Preservation Loss to ensure accurate and distortion-free text rendering, and a constraint mechanism to generate prompts by optimizing Large Language Model (LLM). These innovations enable state-of-the-art (SOTA) performance in visual quality and text fidelity, establishing a new benchmark for VTO task. Project page: \url{https://zhenchenwan.github.io/TED-VITON/}
Authors: Libo Zhu, Jianze Li, Haotong Qin, Wenbo Li, Yulun Zhang, Yong Guo, Xiaokang Yang
Abstract: Diffusion-based image super-resolution (SR) models have shown superior performance at the cost of multiple denoising steps. However, even though the denoising step has been reduced to one, they require high computational costs and storage requirements, making it difficult for deployment on hardware devices. To address these issues, we propose a novel post-training quantization approach with adaptive scale in one-step diffusion (OSD) image SR, PassionSR. First, we simplify OSD model to two core components, UNet and Variational Autoencoder (VAE) by removing the CLIPEncoder. Secondly, we propose Learnable Boundary Quantizer (LBQ) and Learnable Equivalent Transformation (LET) to optimize the quantization process and manipulate activation distributions for better quantization. Finally, we design a Distributed Quantization Calibration (DQC) strategy that stabilizes the training of quantized parameters for rapid convergence. Comprehensive experiments demonstrate that PassionSR with 8-bit and 6-bit obtains comparable visual results with full-precision model. Moreover, our PassionSR achieves significant advantages over recent leading low-bit quantization methods for image SR. Our code will be at https://github.com/libozhu03/PassionSR.
Authors: Fabian Isensee, Yannick Kirchhoff, Lars Kraemer, Maximilian Rokuss, Constantin Ulrich, Klaus H. Maier-Hein
Abstract: This paper presents our approach to scaling the nnU-Net framework for multi-structure segmentation on Cone Beam Computed Tomography (CBCT) images, specifically in the scope of the ToothFairy2 Challenge. We leveraged the nnU-Net ResEnc L model, introducing key modifications to patch size, network topology, and data augmentation strategies to address the unique challenges of dental CBCT imaging. Our method achieved a mean Dice coefficient of 0.9253 and HD95 of 18.472 on the test set, securing a mean rank of 4.6 and with it the first place in the ToothFairy2 challenge. The source code is publicly available, encouraging further research and development in the field.
Authors: Shahriar Soudeep, M. F. Mridha, Md Abrar Jahin, Nilanjan Dey
Abstract: Accurate detection and tracking of small objects such as pedestrians, cyclists, and motorbikes are critical for traffic surveillance systems, which are crucial in improving road safety and decision-making in intelligent transportation systems. However, traditional methods struggle with challenges such as occlusion, low resolution, and dynamic traffic conditions, necessitating innovative approaches to address these limitations. This paper introduces DGNN-YOLO, a novel framework integrating dynamic graph neural networks (DGNN) with YOLO11 to enhance small object detection and tracking in traffic surveillance systems. The framework leverages YOLO11's advanced spatial feature extraction capabilities for precise object detection and incorporates DGNN to model spatial-temporal relationships for robust real-time tracking dynamically. By constructing and updating graph structures, DGNN-YOLO effectively represents objects as nodes and their interactions as edges, ensuring adaptive and accurate tracking in complex and dynamic environments. Extensive experiments demonstrate that DGNN-YOLO consistently outperforms state-of-the-art methods in detecting and tracking small objects under diverse traffic conditions, achieving the highest precision (0.8382), recall (0.6875), and mAP@0.5:0.95 (0.6476), showcasing its robustness and scalability, particularly in challenging scenarios involving small and occluded objects. This work provides a scalable, real-time traffic surveillance and analysis solution, significantly contributing to intelligent transportation systems.
Authors: Cong Wei, Yujie Zhong, Haoxian Tan, Yong Liu, Zheng Zhao, Jie Hu, Yujiu Yang
Abstract: This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs). Despite significant progress in current unified segmentation methods, limitations in adaptation to both image and video scenarios, as well as the complex reasoning segmentation, make it difficult for them to handle various challenging instructions and achieve an accurate understanding of fine-grained vision-language correlations. We propose HyperSeg, the first VLLM-based universal segmentation model for pixel-level image and video perception, encompassing generic segmentation tasks and more complex reasoning perception tasks requiring powerful reasoning abilities and world knowledge. Besides, to fully leverage the recognition capabilities of VLLMs and the fine-grained visual information, HyperSeg incorporates hybrid entity recognition and fine-grained visual perceiver modules for various segmentation tasks. Combined with the temporal adapter, HyperSeg achieves a comprehensive understanding of temporal information. Experimental results validate the effectiveness of our insights in resolving universal image and video segmentation tasks, including the more complex reasoning perception tasks. Our code is available.
Authors: Wanquan Feng, Tianhao Qi, Jiawei Liu, Mingzhen Sun, Pengqi Tu, Tianxiang Ma, Fei Dai, Songtao Zhao, Siyu Zhou, Qian He
Abstract: Video synthesis techniques are undergoing rapid progress, with controllability being a significant aspect of practical usability for end-users. Although text condition is an effective way to guide video synthesis, capturing the correct joint distribution between text descriptions and video motion remains a substantial challenge. In this paper, we present a disentangled and unified framework, namely I2VControl, that unifies multiple motion control tasks in image-to-video synthesis. Our approach partitions the video into individual motion units and represents each unit with disentangled control signals, which allows for various control types to be flexibly combined within our single system. Furthermore, our methodology seamlessly integrates as a plug-in for pre-trained models and remains agnostic to specific model architectures. We conduct extensive experiments, achieving excellent performance on various control tasks, and our method further facilitates user-driven creative combinations, enhancing innovation and creativity. The project page is: https://wanquanf.github.io/I2VControl .
Authors: Xiangyu Liu, Xiaomei Zhang, Zhiyuan Ma, Xiangyu Zhu, Zhen Lei
Abstract: Recent advancements in 3D object reconstruction have been remarkable, yet most current 3D models rely heavily on existing 3D datasets. The scarcity of diverse 3D datasets results in limited generalization capabilities of 3D reconstruction models. In this paper, we propose a novel framework for boosting 3D reconstruction with multi-view refinement (MVBoost) by generating pseudo-GT data. The key of MVBoost is combining the advantages of the high accuracy of the multi-view generation model and the consistency of the 3D reconstruction model to create a reliable data source. Specifically, given a single-view input image, we employ a multi-view diffusion model to generate multiple views, followed by a large 3D reconstruction model to produce consistent 3D data. MVBoost then adaptively refines these multi-view images, rendered from the consistent 3D data, to build a large-scale multi-view dataset for training a feed-forward 3D reconstruction model. Additionally, the input view optimization is designed to optimize the corresponding viewpoints based on the user's input image, ensuring that the most important viewpoint is accurately tailored to the user's needs. Extensive evaluations demonstrate that our method achieves superior reconstruction results and robust generalization compared to prior works.
Authors: Minbin Huang, Runhui Huang, Han Shi, Yimeng Chen, Chuanyang Zheng, Xiangguo Sun, Xin Jiang, Zhenguo Li, Hong Cheng
Abstract: The development of Multi-modal Large Language Models (MLLMs) enhances Large Language Models (LLMs) with the ability to perceive data formats beyond text, significantly advancing a range of downstream applications, such as visual question answering and image captioning. However, the substantial computational costs associated with processing high-resolution images and videos pose a barrier to their broader adoption. To address this challenge, compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. While existing methods conduct token reduction in the feature alignment phase. In this paper, we introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments without the need for segmentation masks. Specifically, we concatenate semantic tokens to represent image semantic segments after the linear projection layer before feeding into the vision encoder. Besides, with the isolated attention we adopt, VisToG can identify and eliminate redundant visual tokens utilizing the prior knowledge in the pre-trained vision encoder, which effectively reduces computational demands. Extensive experiments demonstrate the effectiveness of VisToG, maintaining 98.1% of the original performance while achieving a reduction of over 27\% inference time.
Authors: Jiahan Li, Chaoran Cheng, Jianzhu Ma, Ge Liu
Abstract: Shape assembly, which aims to reassemble separate parts into a complete object, has gained significant interest in recent years. Existing methods primarily rely on networks to predict the poses of individual parts, but often fail to effectively capture the geometric interactions between the parts and their poses. In this paper, we present the Geometric Point Attention Transformer (GPAT), a network specifically designed to address the challenges of reasoning about geometric relationships. In the geometric point attention module, we integrate both global shape information and local pairwise geometric features, along with poses represented as rotation and translation vectors for each part. To enable iterative updates and dynamic reasoning, we introduce a geometric recycling scheme, where each prediction is fed into the next iteration for refinement. We evaluate our model on both the semantic and geometric assembly tasks, showing that it outperforms previous methods in absolute pose estimation, achieving accurate pose predictions and high alignment accuracy.
Authors: Hugo Resende, Isabela Borlido, Victor Sundermann, Eduardo B. Neto, Silvio Jamil F. Guimar\~aes, Fabio Faria, Alvaro Luiz Fazenda
Abstract: Tropical forests play an essential role in the planet's ecosystem, making the conservation of these biomes a worldwide priority. However, ongoing deforestation and degradation pose a significant threat to their existence, necessitating effective monitoring and the proposal of actions to mitigate the damage caused by these processes. In this regard, initiatives range from government and private sector monitoring programs to solutions based on citizen science campaigns, for example. Particularly in the context of citizen science campaigns, the segmentation of remote sensing images to identify deforested areas and subsequently submit them to analysis by non-specialized volunteers is necessary. Thus, segmentation using superpixel-based techniques proves to be a viable solution for this important task. Therefore, this paper presents an analysis of 22 superpixel-based segmentation methods applied to remote sensing images, aiming to identify which of them are more suitable for generating segments for citizen science campaigns. The results reveal that seven of the segmentation methods outperformed the baseline method (SLIC) currently employed in the ForestEyes citizen science project, indicating an opportunity for improvement in this important stage of campaign development.
Authors: Hoon-Gyu Chung, Seokjun Choi, Seung-Hwan Baek
Abstract: Inverse rendering seeks to reconstruct both geometry and spatially varying BRDFs (SVBRDFs) from captured images. To address the inherent ill-posedness of inverse rendering, basis BRDF representations are commonly used, modeling SVBRDFs as spatially varying blends of a set of basis BRDFs. However, existing methods often yield basis BRDFs that lack intuitive separation and have limited scalability to scenes of varying complexity. In this paper, we introduce a differentiable inverse rendering method that produces interpretable basis BRDFs. Our approach models a scene using 2D Gaussians, where the reflectance of each Gaussian is defined by a weighted blend of basis BRDFs. We efficiently render an image from the 2D Gaussians and basis BRDFs using differentiable rasterization and impose a rendering loss with the input images. During this analysis-by-synthesis optimization process of differentiable inverse rendering, we dynamically adjust the number of basis BRDFs to fit the target scene while encouraging sparsity in the basis weights. This ensures that the reflectance of each Gaussian is represented by only a few basis BRDFs. This approach enables the reconstruction of accurate geometry and interpretable basis BRDFs that are spatially separated. Consequently, the resulting scene representation, comprising basis BRDFs and 2D Gaussians, supports physically-based novel-view relighting and intuitive scene editing.
Authors: Jinnyeong Kim, Seung-Hwan Baek
Abstract: Integrating RGB and NIR stereo imaging provides complementary spectral information, potentially enhancing robotic 3D vision in challenging lighting conditions. However, existing datasets and imaging systems lack pixel-level alignment between RGB and NIR images, posing challenges for downstream vision tasks. In this paper, we introduce a robotic vision system equipped with pixel-aligned RGB-NIR stereo cameras and a LiDAR sensor mounted on a mobile robot. The system simultaneously captures pixel-aligned pairs of RGB stereo images, NIR stereo images, and temporally synchronized LiDAR points. Utilizing the mobility of the robot, we present a dataset containing continuous video frames under diverse lighting conditions. We then introduce two methods that utilize the pixel-aligned RGB-NIR images: an RGB-NIR image fusion method and a feature fusion method. The first approach enables existing RGB-pretrained vision models to directly utilize RGB-NIR information without fine-tuning. The second approach fine-tunes existing vision models to more effectively utilize RGB-NIR information. Experimental results demonstrate the effectiveness of using pixel-aligned RGB-NIR images across diverse lighting conditions.
Authors: Di Zhang, Junxian Li, Jingdi Lei, Xunzhi Wang, Yujie Liu, Zonglin Yang, Jiatong Li, Weida Wang, Suorong Yang, Jianbo Wu, Peng Ye, Wanli Ouyang, Dongzhan Zhou
Abstract: Vision-language models (VLMs) have shown remarkable advancements in multimodal reasoning tasks. However, they still often generate inaccurate or irrelevant responses due to issues like hallucinated image understandings or unrefined reasoning paths. To address these challenges, we introduce Critic-V, a novel framework inspired by the Actor-Critic paradigm to boost the reasoning capability of VLMs. This framework decouples the reasoning process and critic process by integrating two independent components: the Reasoner, which generates reasoning paths based on visual and textual inputs, and the Critic, which provides constructive critique to refine these paths. In this approach, the Reasoner generates reasoning responses according to text prompts, which can evolve iteratively as a policy based on feedback from the Critic. This interaction process was theoretically driven by a reinforcement learning framework where the Critic offers natural language critiques instead of scalar rewards, enabling more nuanced feedback to boost the Reasoner's capability on complex reasoning tasks. The Critic model is trained using Direct Preference Optimization (DPO), leveraging a preference dataset of critiques ranked by Rule-based Reward~(RBR) to enhance its critic capabilities. Evaluation results show that the Critic-V framework significantly outperforms existing methods, including GPT-4V, on 5 out of 8 benchmarks, especially regarding reasoning accuracy and efficiency. Combining a dynamic text-based policy for the Reasoner and constructive feedback from the preference-optimized Critic enables a more reliable and context-sensitive multimodal reasoning process. Our approach provides a promising solution to enhance the reliability of VLMs, improving their performance in real-world reasoning-heavy multimodal applications such as autonomous driving and embodied intelligence.
Authors: Zizhao Li, Zhengkang Xiang, Joseph West, Kourosh Khoshelham
Abstract: Traditional object detection methods operate under the closed-set assumption, where models can only detect a fixed number of objects predefined in the training set. Recent works on open vocabulary object detection (OVD) enable the detection of objects defined by an unbounded vocabulary, which reduces the cost of training models for specific tasks. However, OVD heavily relies on accurate prompts provided by an ''oracle'', which limits their use in critical applications such as driving scene perception. OVD models tend to misclassify near-out-of-distribution (NOOD) objects that have similar semantics to known classes, and ignore far-out-of-distribution (FOOD) objects. To address theses limitations, we propose a framework that enables OVD models to operate in open world settings, by identifying and incrementally learning novel objects. To detect FOOD objects, we propose Open World Embedding Learning (OWEL) and introduce the concept of Pseudo Unknown Embedding which infers the location of unknown classes in a continuous semantic space based on the information of known classes. We also propose Multi-Scale Contrastive Anchor Learning (MSCAL), which enables the identification of misclassified unknown objects by promoting the intra-class consistency of object embeddings at different scales. The proposed method achieves state-of-the-art performance in common open world object detection and autonomous driving benchmarks.
Authors: Haopeng Fang, Di Qiu, Binjie Mao, Pengfei Yan, He Tang
Abstract: Recent advancements in personalized Text-to-Video (T2V) generation highlight the importance of integrating character-specific identities and actions. However, previous T2V models struggle with identity consistency and controllable motion dynamics, mainly due to limited fine-grained facial and action-based textual prompts, and datasets that overlook key human attributes and actions. To address these challenges, we propose MotionCharacter, an efficient and high-fidelity human video generation framework designed for identity preservation and fine-grained motion control. We introduce an ID-preserving module to maintain identity fidelity while allowing flexible attribute modifications, and further integrate ID-consistency and region-aware loss mechanisms, significantly enhancing identity consistency and detail fidelity. Additionally, our approach incorporates a motion control module that prioritizes action-related text while maintaining subject consistency, along with a dataset, Human-Motion, which utilizes large language models to generate detailed motion descriptions. For simplify user control during inference, we parameterize motion intensity through a single coefficient, allowing for easy adjustments. Extensive experiments highlight the effectiveness of MotionCharacter, demonstrating significant improvements in ID-preserving, high-quality video generation.
Authors: Qing Jiang, Gen Luo, Yuqin Yang, Yuda Xiong, Yihao Chen, Zhaoyang Zeng, Tianhe Ren, Lei Zhang
Abstract: Perception and understanding are two pillars of computer vision. While multimodal large language models (MLLM) have demonstrated remarkable visual understanding capabilities, they arguably lack accurate perception abilities, e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate on the COCO dataset, limiting many tasks requiring the combination of perception and understanding. In this work, we aim to bridge this perception gap from both model designing and data development perspectives. We first introduce ChatRex, an MLLM with a decoupled perception design. Instead of having the LLM directly predict box coordinates, we feed the output boxes from a universal proposal network into the LLM, allowing it to output the corresponding box indices to represent its detection results, turning the regression task into a retrieval-based task that LLM handles more proficiently. From the data perspective, we build a fully automated data engine and construct the Rexverse-2M dataset which possesses multiple granularities to support the joint training of perception and understanding. After standard two-stage training, ChatRex demonstrates strong perception capabilities while preserving multimodal understanding performance. The combination of these two capabilities simultaneously unlocks many attractive applications, demonstrating the complementary roles of both perception and understanding in MLLM. Code is available at \url{https://github.com/IDEA-Research/ChatRex}.
Authors: Pengfei Zhou, Xiaopeng Peng, Jiajun Song, Chuanhao Li, Zhaopan Xu, Yue Yang, Ziyao Guo, Hao Zhang, Yuqi Lin, Yefei He, Lirui Zhao, Shuo Liu, Tianhua Li, Yuxuan Xie, Xiaojun Chang, Yu Qiao, Wenqi Shao, Kaipeng Zhang
Abstract: Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to data size and diversity limitations. To bridge this gap, we introduce GATE OpenING (OpenING), a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82. 42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation models. The OpenING is open-sourced at https://opening-benchmark.github.io.
Authors: Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B. Lindell, Sergey Tulyakov
Abstract: Numerous works have recently integrated 3D camera control into foundational text-to-video models, but the resulting camera control is often imprecise, and video generation quality suffers. In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality. First, we determine that motion induced by camera movements in videos is low-frequency in nature. This motivates us to adjust train and test pose conditioning schedules, accelerating training convergence while improving visual and motion quality. Then, by probing the representations of an unconditional video diffusion transformer, we observe that they implicitly perform camera pose estimation under the hood, and only a sub-portion of their layers contain the camera information. This suggested us to limit the injection of camera conditioning to a subset of the architecture to prevent interference with other video features, leading to 4x reduction of training parameters, improved training speed and 10% higher visual quality. Finally, we complement the typical dataset for camera control learning with a curated dataset of 20K diverse dynamic videos with stationary cameras. This helps the model disambiguate the difference between camera and scene motion, and improves the dynamics of generated pose-conditioned videos. We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture, the new state-of-the-art model for generative video modeling with camera control.
Authors: Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann
Abstract: Text-to-image diffusion models have demonstrated remarkable capability in generating realistic images from arbitrary text prompts. However, they often produce inconsistent results for compositional prompts such as "two dogs" or "a penguin on the right of a bowl". Understanding these inconsistencies is crucial for reliable image generation. In this paper, we highlight the significant role of initial noise in these inconsistencies, where certain noise patterns are more reliable for compositional prompts than others. Our analyses reveal that different initial random seeds tend to guide the model to place objects in distinct image areas, potentially adhering to specific patterns of camera angles and image composition associated with the seed. To improve the model's compositional ability, we propose a method for mining these reliable cases, resulting in a curated training set of generated images without requiring any manual annotation. By fine-tuning text-to-image models on these generated images, we significantly enhance their compositional capabilities. For numerical composition, we observe relative increases of 29.3% and 19.5% for Stable Diffusion and PixArt-{\alpha}, respectively. Spatial composition sees even larger gains, with 60.7% for Stable Diffusion and 21.1% for PixArt-{\alpha}.
Authors: Zhiting Wang, Qiangong Zhou, Zongyang Liu
Abstract: Segment Anything Model 2 (SAM2) demonstrates exceptional performance in video segmentation and refinement of segmentation results. We anticipate that it can further evolve to achieve higher levels of automation for practical applications. Building upon SAM2, we conducted a series of practices that ultimately led to the development of a fully automated pipeline, termed Det-SAM2, in which object prompts are automatically generated by a detection model to facilitate inference and refinement by SAM2. This pipeline enables inference on infinitely long video streams with constant VRAM and RAM usage, all while preserving the same efficiency and accuracy as the original SAM2. This technical report focuses on the construction of the overall Det-SAM2 framework and the subsequent engineering optimization applied to SAM2. We present a case demonstrating an application built on the Det-SAM2 framework: AI refereeing in a billiards scenario, derived from our business context. The project at \url{https://github.com/motern88/Det-SAM2}.
Authors: Jiepeng Wang, Yuan Liu, Peng Wang, Cheng Lin, Junhui Hou, Xin Li, Taku Komura, Wenping Wang
Abstract: 3D Gaussian Splatting has achieved impressive performance in novel view synthesis with real-time rendering capabilities. However, reconstructing high-quality surfaces with fine details using 3D Gaussians remains a challenging task. In this work, we introduce GausSurf, a novel approach to high-quality surface reconstruction by employing geometry guidance from multi-view consistency in texture-rich areas and normal priors in texture-less areas of a scene. We observe that a scene can be mainly divided into two primary regions: 1) texture-rich and 2) texture-less areas. To enforce multi-view consistency at texture-rich areas, we enhance the reconstruction quality by incorporating a traditional patch-match based Multi-View Stereo (MVS) approach to guide the geometry optimization in an iterative scheme. This scheme allows for mutual reinforcement between the optimization of Gaussians and patch-match refinement, which significantly improves the reconstruction results and accelerates the training process. Meanwhile, for the texture-less areas, we leverage normal priors from a pre-trained normal estimation model to guide optimization. Extensive experiments on the DTU and Tanks and Temples datasets demonstrate that our method surpasses state-of-the-art methods in terms of reconstruction quality and computation time.
Authors: Jungbin Cho, Junwan Kim, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, Youngjae Yu
Abstract: Human motion, inherently continuous and dynamic, presents significant challenges for generative models. Despite their dominance, discrete quantization methods, such as VQ-VAEs, suffer from inherent limitations, including restricted expressiveness and frame-wise noise artifacts. Continuous approaches, while producing smoother and more natural motions, often falter due to high-dimensional complexity and limited training data. To resolve this "discord" between discrete and continuous representations, we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that decodes discrete motion tokens into continuous motion through rectified flow. By employing an iterative refinement process in the continuous space, DisCoRD captures fine-grained dynamics and ensures smoother and more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals. Extensive evaluations demonstrate that DisCoRD achieves state-of-the-art performance, with FID of 0.032 on HumanML3D and 0.169 on KIT-ML. These results solidify DisCoRD as a robust solution for bridging the divide between discrete efficiency and continuous realism. Our project page is available at: https://whwjdqls.github.io/discord.github.io/.
Authors: Zixuan Chen, Guangcong Wang, Jiahao Zhu, Jianhuang Lai, Xiaohua Xie
Abstract: 3D Gaussian Splatting (3DGS) has recently created impressive assets for various applications. However, the copyright of these assets is not well protected as existing watermarking methods are not suited for 3DGS considering security, capacity, and invisibility. Besides, these methods often require hours or even days for optimization, limiting the application scenarios. In this paper, we propose GuardSplat, an innovative and efficient framework that effectively protects the copyright of 3DGS assets. Specifically, 1) We first propose a CLIP-guided Message Decoupling Optimization module for training the message decoder, leveraging CLIP's aligning capability and rich representations to achieve a high extraction accuracy with minimal optimization costs, presenting exceptional capability and efficiency. 2) Then, we propose a Spherical-harmonic-aware (SH-aware) Message Embedding module tailored for 3DGS, which employs a set of SH offsets to seamlessly embed the message into the SH features of each 3D Gaussian while maintaining the original 3D structure. It enables the 3DGS assets to be watermarked with minimal fidelity trade-offs and prevents malicious users from removing the messages from the model files, meeting the demands for invisibility and security. 3) We further propose an Anti-distortion Message Extraction module to improve robustness against various visual distortions. Extensive experiments demonstrate that GuardSplat outperforms the state-of-the-art methods and achieves fast optimization speed.
Authors: Shukang Yin, Chaoyou Fu, Sirui Zhao, Yunhang Shen, Chunjiang Ge, Yan Yang, Zuwei Long, Yuhan Dai, Tong Xu, Xing Sun, Ran He, Caifeng Shan, Enhong Chen
Abstract: The success of Multimodal Large Language Models (MLLMs) in the image domain has garnered wide attention from the research community. Drawing on previous successful experiences, researchers have recently explored extending the success to the video understanding realms. Apart from training from scratch, an efficient way is to utilize the pre-trained image-LLMs, leading to two mainstream approaches, i.e. zero-shot inference and further fine-tuning with video data. In this work, our study of these approaches harvests an effective data augmentation method. We first make a deeper inspection of the zero-shot inference way and identify two limitations, i.e. limited generalization and lack of temporal understanding capabilities. Thus, we further investigate the fine-tuning approach and find a low learning efficiency when simply using all the video data samples, which can be attributed to a lack of instruction diversity. Aiming at this issue, we develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus. Integrating these data enables a simple and efficient training scheme, which achieves performance comparable to or even superior to using full video datasets by training with just 15% the sample size. Meanwhile, we find that the proposed scheme can boost the performance of long video understanding without training with long video samples. We hope our study will spark more thinking about using MLLMs for video understanding and curation of high-quality data. The code is released at https://github.com/xjtupanda/T2Vid.
Authors: Junfu Chen, Sirui Li, Dechang Pi
Abstract: Working memory (WM), denoting the information temporally stored in the mind, is a fundamental research topic in the field of human cognition. Electroencephalograph (EEG), which can monitor the electrical activity of the brain, has been widely used in measuring the level of WM. However, one of the critical challenges is that individual differences may cause ineffective results, especially when the established model meets an unfamiliar subject. In this work, we propose a cross-subject deep adaptation model with spatial attention (CS-DASA) to generalize the workload classifications across subjects. First, we transform EEG time series into multi-frame EEG images incorporating spatial, spectral, and temporal information. First, the Subject-Shared module in CS-DASA receives multi-frame EEG image data from both source and target subjects and learns the common feature representations. Then, in the subject-specific module, the maximum mean discrepancy is implemented to measure the domain distribution divergence in a reproducing kernel Hilbert space, which can add an effective penalty loss for domain adaptation. Additionally, the subject-to-subject spatial attention mechanism is employed to focus on the discriminative spatial features from the target image data. Experiments conducted on a public WM EEG dataset containing 13 subjects show that the proposed model is capable of achieving better performance than existing state-of-the-art methods.
Authors: Maximilian E. Tschuchnig, Michael Gadermayr
Abstract: The increasing digitization of medical imaging enables machine learning based improvements in detecting, visualizing and segmenting lesions, easing the workload for medical experts. However, supervised machine learning requires reliable labelled data, which is is often difficult or impossible to collect or at least time consuming and thereby costly. Therefore methods requiring only partly labeled data (semi-supervised) or no labeling at all (unsupervised methods) have been applied more regularly. Anomaly detection is one possible methodology that is able to leverage semi-supervised and unsupervised methods to handle medical imaging tasks like classification and segmentation. This paper uses a semi-exhaustive literature review of relevant anomaly detection papers in medical imaging to cluster into applications, highlight important results, establish lessons learned and give further advice on how to approach anomaly detection in medical imaging. The qualitative analysis is based on google scholar and 4 different search terms, resulting in 120 different analysed papers. The main results showed that the current research is mostly motivated by reducing the need for labelled data. Also, the successful and substantial amount of research in the brain MRI domain shows the potential for applications in further domains like OCT and chest X-ray.
Authors: Maximilian E. Tschuchnig, Philipp Grubm\"uller, Lea M. Stangassinger, Christina Kreutzer, S\'ebastien Couillard-Despr\'es, Gertie J. Oostingh, Anton Hittmair, Michael Gadermayr
Abstract: Thyroid cancer is currently the fifth most common malignancy diagnosed in women. Since differentiation of cancer sub-types is important for treatment and current, manual methods are time consuming and subjective, automatic computer-aided differentiation of cancer types is crucial. Manual differentiation of thyroid cancer is based on tissue sections, analysed by pathologists using histological features. Due to the enormous size of gigapixel whole slide images, holistic classification using deep learning methods is not feasible. Patch based multiple instance learning approaches, combined with aggregations such as bag-of-words, is a common approach. This work's contribution is to extend a patch based state-of-the-art method by generating and combining feature vectors of three different patch resolutions and analysing three distinct ways of combining them. The results showed improvements in one of the three multi-scale approaches, while the others led to decreased scores. This provides motivation for analysis and discussion of the individual approaches.
Authors: Zheqi Lv, Wenqiao Zhang, Shengyu Zhang, Kun Kuang, Feng Wang, Yongwei Wang, Zhengyu Chen, Tao Shen, Hongxia Yang, Beng Chin Ooi, Fei Wu
Abstract: Device Model Generalization (DMG) is a practical yet under-investigated research topic for on-device machine learning applications. It aims to improve the generalization ability of pre-trained models when deployed on resource-constrained devices, such as improving the performance of pre-trained cloud models on smart mobiles. While quite a lot of works have investigated the data distribution shift across clouds and devices, most of them focus on model fine-tuning on personalized data for individual devices to facilitate DMG. Despite their promising, these approaches require on-device re-training, which is practically infeasible due to the overfitting problem and high time delay when performing gradient calculation on real-time data. In this paper, we argue that the computational cost brought by fine-tuning can be rather unnecessary. We consequently present a novel perspective to improving DMG without increasing computational cost, i.e., device-specific parameter generation which directly maps data distribution to parameters. Specifically, we propose an efficient Device-cloUd collaborative parametErs generaTion framework DUET. DUET is deployed on a powerful cloud server that only requires the low cost of forwarding propagation and low time delay of data transmission between the device and the cloud. By doing so, DUET can rehearse the device-specific model weight realizations conditioned on the personalized real-time data for an individual device. Importantly, our DUET elegantly connects the cloud and device as a 'duet' collaboration, frees the DMG from fine-tuning, and enables a faster and more accurate DMG paradigm. We conduct an extensive experimental study of DUET on three public datasets, and the experimental results confirm our framework's effectiveness and generalisability for different DMG tasks.
Authors: Po-han Li, Sravan Kumar Ankireddy, Ruihan Zhao, Hossein Nourkhiz Mahjoub, Ehsan Moradi-Pari, Ufuk Topcu, Sandeep Chinchali, Hyeji Kim
Abstract: Efficient compression of correlated data is essential to minimize communication overload in multi-sensor networks. In such networks, each sensor independently compresses the data and transmits them to a central node due to limited communication bandwidth. A decoder at the central node decompresses and passes the data to a pre-trained machine learning-based task to generate the final output. Thus, it is important to compress the features that are relevant to the task. Additionally, the final performance depends heavily on the total available bandwidth. In practice, it is common to encounter varying availability in bandwidth, and higher bandwidth results in better performance of the task. We design a novel distributed compression framework composed of independent encoders and a joint decoder, which we call neural distributed principal component analysis (NDPCA). NDPCA flexibly compresses data from multiple sources to any available bandwidth with a single model, reducing computing and storage overhead. NDPCA achieves this by learning low-rank task representations and efficiently distributing bandwidth among sensors, thus providing a graceful trade-off between performance and bandwidth. Experiments show that NDPCA improves the success rate of multi-view robotic arm manipulation by 9% and the accuracy of object detection tasks on satellite imagery by 14% compared to an autoencoder with uniform bandwidth allocation.
Authors: Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, Lijuan Wang
Abstract: We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes. Rapid model advancements pose challenges to evaluation benchmark development. Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking. To this end, we present MM-Vet, designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. For evaluation metrics, we propose an LLM-based evaluator for open-ended outputs. The evaluator enables the evaluation across different question types and answer styles, resulting in a unified scoring metric. We evaluate representative LMMs on MM-Vet, providing insights into the capabilities of different LMM system paradigms and models.
Authors: Harini SI, Somesh Singh, Yaman K Singla, Aanisha Bhattacharyya, Veeky Baths, Changyou Chen, Rajiv Ratn Shah, Balaji Krishnamurthy
Abstract: Despite the importance of long-term memory in marketing and brand building, until now, there has been no large-scale study on the memorability of ads. All previous memorability studies have been conducted on short-term recall on specific content types like action videos. On the other hand, long-term memorability is crucial for the advertising industry, and ads are almost always highly multimodal. Therefore, we release the first memorability dataset, LAMBDA, consisting of 1749 participants and 2205 ads covering 276 brands. Running statistical tests over different participant subpopulations and ad types, we find many interesting insights into what makes an ad memorable, e.g., fast-moving ads are more memorable than those with slower scenes; people who use ad-blockers remember a lower number of ads than those who don't. Next, we present a model, Henry, to predict the memorability of a content. Henry achieves state-of-the-art performance across all prominent literature memorability datasets. It shows strong generalization performance with better results in 0-shot on unseen datasets. Finally, with the intent of memorable ad generation, we present a scalable method to build a high-quality memorable ad generation model by leveraging automatically annotated data. Our approach, SEED (Self rEwarding mEmorability Modeling), starts with a language model trained on LAMBDA as seed data and progressively trains an LLM to generate more memorable ads. We show that the generated advertisements have 44% higher memorability scores than the original ads. We release this large-scale ad dataset, UltraLAMBDA, consisting of 5 million ads. Our code and the datasets, LAMBDA and UltraLAMBDA, are open-sourced at https://behavior-in-the-wild.github.io/memorability.
Authors: Xiaowei Yu, Zhe Huang, Minheng Chen, Yao Xue, Tianming Liu, Dajiang Zhu
Abstract: We investigate the impact of entropy change in deep learning systems by noise injection at different levels, including the embedding space and the image. The series of models that employ our methodology are collectively known as Noisy Neural Networks (NoisyNN), with examples such as NoisyViT and NoisyCNN. Noise is conventionally viewed as a harmful perturbation in various deep learning architectures, such as convolutional neural networks (CNNs) and vision transformers (ViTs), as well as different learning tasks like image classification and transfer learning. However, this work shows noise can be an effective way to change the entropy of the learning system. We demonstrate that specific noise can boost the performance of various deep models under certain conditions. We theoretically prove the enhancement gained from positive noise by reducing the task complexity defined by information entropy and experimentally show the significant performance gain in large image datasets, such as the ImageNet. Herein, we use the information entropy to define the complexity of the task. We categorize the noise into two types, positive noise (PN) and harmful noise (HN), based on whether the noise can help reduce the task complexity. Extensive experiments of CNNs and ViTs have shown performance improvements by proactively injecting positive noise, where we achieved an unprecedented top 1 accuracy of 95$\%$ on ImageNet. Both theoretical analysis and empirical evidence have confirmed that the presence of positive noise, can benefit the learning process, while the traditionally perceived harmful noise indeed impairs deep learning models. The different roles of noise offer new explanations for deep models on specific tasks and provide a new paradigm for improving model performance. Moreover, it reminds us that we can influence the performance of learning systems via information entropy change.
Authors: Xinyuan Song
Abstract: With the spread of COVID-19 around the globe over the past year, the usage of artificial intelligence (AI) algorithms and image processing methods to analyze the X-ray images of patients' chest with COVID-19 has become essential. The COVID-19 virus recognition in the lung area of a patient is one of the basic and essential needs of clicical centers and hospitals. Most research in this field has been devoted to papers on the basis of deep learning methods utilizing CNNs (Convolutional Neural Network), which mainly deal with the screening of sick and healthy people.In this study, a new structure of a 19-layer CNN has been recommended for accurately recognition of the COVID-19 from the X-ray pictures of chest. The offered CNN is developed to serve as a precise diagnosis system for a three class (viral pneumonia, Normal, COVID) and a four classclassification (Lung opacity, Normal, COVID-19, and pneumonia). A comparison is conducted among the outcomes of the offered procedure and some popular pretrained networks, including Inception, Alexnet, ResNet50, Squeezenet, and VGG19 and based on Specificity, Accuracy, Precision, Sensitivity, Confusion Matrix, and F1-score. The experimental results of the offered CNN method specify its dominance over the existing published procedures. This method can be a useful tool for clinicians in deciding properly about COVID-19.
Authors: Maximilian Ernst Tschuchnig, Julia Coste-Marin, Philipp Steininger, Michael Gadermayr
Abstract: Semantic segmentation is a crucial task in medical image processing, essential for segmenting organs or lesions such as tumors. In this study we aim to improve automated segmentation in CBCTs through multi-task learning. To evaluate effects on different volume qualities, a CBCT dataset is synthesised from the CT Liver Tumor Segmentation Benchmark (LiTS) dataset. To improve segmentation, two approaches are investigated. First, we perform multi-task learning to add morphology based regularization through a volume reconstruction task. Second, we use this reconstruction task to reconstruct the best quality CBCT (most similar to the original CT), facilitating denoising effects. We explore both holistic and patch-based approaches. Our findings reveal that, especially using a patch-based approach, multi-task learning improves segmentation in most cases and that these results can further be improved by our denoising approach.
Authors: Vishnu Sashank Dorbala, Bhrij Patel, Amrit Singh Bedi, Dinesh Manocha
Abstract: ObjectNav is a popular task in Embodied AI, where an agent navigates to a target object in an unseen environment. Prior literature makes the assumption of a static environment with stationary objects, which lacks realism. To address this, we present a novel formulation to generalize ObjectNav to dynamic environments with non-stationary objects, and refer to it as Portable ObjectNav or P-ObjectNav. In our formulation, we first address several challenging issues with dynamizing existing topological scene graphs by developing a novel method that introduces multiple transition behaviors to portable objects in the scene. We use this technique to dynamize Matterport3D, a popular simulator for evaluating embodied tasks. We then present a benchmark for P-ObjectNav using a combination of heuristic, reinforcement learning, and Large Language Model (LLM)-based navigation approaches on the dynamized environment, while introducing novel evaluation metrics tailored for our task. Our work fundamentally challenges the "static-environment" notion of prior ObjectNav work; the code and dataset for P-ObjectNav will be made publicly available to foster research on embodied navigation in dynamic scenes. We provide an anonymized repository for our code and dataset: https://anonymous.4open.science/r/PObjectNav-1C6D.
Authors: Kung-Hsiang Huang, Hou Pong Chan, Yi R. Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, Heng Ji
Abstract: Data visualization in the form of charts plays a pivotal role in data analysis, offering critical insights and aiding in informed decision-making. Automatic chart understanding has witnessed significant advancements with the rise of large foundation models in recent years. Foundation models, such as large language models, have revolutionized various natural language processing tasks and are increasingly being applied to chart understanding tasks. This survey paper provides a comprehensive overview of the recent developments, challenges, and future directions in chart understanding within the context of these foundation models. We review fundamental building blocks crucial for studying chart understanding tasks. Additionally, we explore various tasks and their evaluation metrics and sources of both charts and textual inputs. Various modeling strategies are then examined, encompassing both classification-based and generation-based approaches, along with tool augmentation techniques that enhance chart understanding performance. Furthermore, we discuss the state-of-the-art performance of each task and discuss how we can improve the performance. Challenges and future directions are addressed, highlighting the importance of several topics, such as domain-specific charts, lack of efforts in developing evaluation metrics, and agent-oriented settings. This survey paper serves as a comprehensive resource for researchers and practitioners in the fields of natural language processing, computer vision, and data analysis, providing valuable insights and directions for future research in chart understanding leveraging large foundation models. The studies mentioned in this paper, along with emerging new research, will be continually updated at: https://github.com/khuangaf/Awesome-Chart-Understanding.
URLs: https://github.com/khuangaf/Awesome-Chart-Understanding.
Authors: Zhe Ma, Qingming Li, Xuhong Zhang, Tianyu Du, Ruixiao Lin, Zonghui Wang, Shouling Ji, Wenzhi Chen
Abstract: The past few years have witnessed substantial advances in image generation powered by diffusion models. However, it was shown that diffusion models are vulnerable to training data memorization, raising concerns regarding copyright infringement and privacy invasion. This study delves into a rigorous analysis of memorization in diffusion models. We introduce an inversion-based measure of memorization, InvMM, which searches for a sensitive latent noise distribution accounting for the replication of an image. For accurate estimation of the memorization score, we propose an adaptive algorithm that balances the normality and sensitivity of the inverted distribution. Comprehensive experiments, conducted on both unconditional and text-guided diffusion models, demonstrate that InvMM is capable of detecting heavily memorized images and elucidating the effect of various factors on memorization. Additionally, we discuss how memorization differs from membership. In practice, InvMM serves as a useful tool for model developers to reliably assess the risk of memorization, thereby contributing to the enhancement of trustworthiness and privacy-preserving capabilities of diffusion models.
Authors: Jayneel Parekh, Pegah Khayatan, Mustafa Shukor, Alasdair Newson, Matthieu Cord
Abstract: Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs remains largely a mystery. In this paper, we present a novel framework for the interpretation of LMMs. We propose a dictionary learning based approach, applied to the representation of tokens. The elements of the learned dictionary correspond to our proposed concepts. We show that these concepts are well semantically grounded in both vision and text. Thus we refer to these as ``multi-modal concepts''. We qualitatively and quantitatively evaluate the results of the learnt concepts. We show that the extracted multimodal concepts are useful to interpret representations of test samples. Finally, we evaluate the disentanglement between different concepts and the quality of grounding concepts visually and textually. Our code is publicly available at https://github.com/mshukor/xl-vlms
Authors: Zhuoheng Ran, Muhammad A. A. Abdelgawad, Zekai Zhang, Ray C. C. Cheung, Hong Yan
Abstract: The dramatic surge in the utilisation of generative artificial intelligence (GenAI) underscores the need for a secure and efficient mechanism to responsibly manage, use and disseminate multi-dimensional data generated by artificial intelligence (AI). In this paper, we propose a blockchain-based copyright traceability framework called ring oscillator-singular value decomposition (RO-SVD), which introduces decomposition computing to approximate low-rank matrices generated from hardware entropy sources and establishes an AI-generated content (AIGC) copyright traceability mechanism at the device level. By leveraging the parallelism and reconfigurability of field-programmable gate arrays (FPGAs), our framework can be easily constructed on existing AI-accelerated devices and provide a low-cost solution to emerging copyright issues of AIGC. We developed a hardware-software (HW/SW) co-design prototype based on comprehensive analysis and on-board experiments with multiple AI-applicable FPGAs. Using AI-generated images as a case study, our framework demonstrated effectiveness and emphasised customisation, unpredictability, efficiency, management and reconfigurability. To the best of our knowledge, this is the first practical hardware study discussing and implementing copyright traceability specifically for AI-generated content.
Authors: Carlo Alberto Barbano, Matteo Brunello, Benoit Dufumier, Marco Grangetto
Abstract: Deep Learning (DL) in neuroimaging has become increasingly relevant for detecting neurological conditions and neurodegenerative disorders. One of the most predominant biomarkers in neuroimaging is represented by brain age, which has been shown to be a good indicator for different conditions, such as Alzheimer's Disease. Using brain age for weakly supervised pre-training of DL models in transfer learning settings has also recently shown promising results, especially when dealing with data scarcity of different conditions. On the other hand, anatomical information of brain MRIs (e.g. cortical thickness) can provide important information for learning good representations that can be transferred to many downstream tasks. In this work, we propose AnatCL, an anatomical foundation model for brain MRIs that i.) leverages anatomical information in a weakly contrastive learning approach, and ii.) achieves state-of-the-art performances across many different downstream tasks. To validate our approach we consider 12 different downstream tasks for the diagnosis of different conditions such as Alzheimer's Disease, autism spectrum disorder, and schizophrenia. Furthermore, we also target the prediction of 10 different clinical assessment scores using structural MRI data. Our findings show that incorporating anatomical information during pre-training leads to more robust and generalizable representations. Pre-trained models can be found at: https://github.com/EIDOSLAB/AnatCL.
Authors: Juntao Jiang, Mengmeng Wang, Huizhong Tian, Lingbo Cheng, Yong Liu
Abstract: While large models have achieved significant progress in computer vision, challenges such as optimization complexity, the intricacy of transformer architectures, computational constraints, and practical application demands highlight the importance of simpler model designs in medical image segmentation. This need is particularly pronounced in mobile medical devices, which require lightweight, deployable models with real-time performance. However, existing lightweight models often suffer from poor robustness across datasets, limiting their widespread adoption. To address these challenges, this paper introduces LV-UNet, a lightweight and vanilla model that leverages pre-trained MobileNetv3-Large backbones and incorporates fusible modules. LV-UNet employs an enhanced deep training strategy and switches to a deployment mode during inference by re-parametrization, significantly reducing parameter count and computational overhead. Experimental results on ISIC 2016, BUSI, CVC-ClinicDB, CVC-ColonDB, and Kvair-SEG datasets demonstrate a better trade-off between performance and the computational load. The code will be released at \url{https://github.com/juntaoJianggavin/LV-UNet}.
Authors: Yahan Tu, Rui Hu, Jitao Sang
Abstract: Hallucination poses a persistent challenge for multimodal large language models (MLLMs). However, existing benchmarks for evaluating hallucinations are generally static, which may overlook the potential risk of data contamination. To address this issue, we propose ODE, an open-set, dynamic protocol designed to evaluate object hallucinations in MLLMs at both the existence and attribute levels. ODE employs a graph-based structure to represent real-world object concepts, their attributes, and the distributional associations between them. This structure facilitates the extraction of concept combinations based on diverse distributional criteria, generating varied samples for structured queries that evaluate hallucinations in both generative and discriminative tasks. Through the generation of new samples, dynamic concept combinations, and varied distribution frequencies, ODE mitigates the risk of data contamination and broadens the scope of evaluation. This protocol is applicable to both general and specialized scenarios, including those with limited data. Experimental results demonstrate the effectiveness of our protocol, revealing that MLLMs exhibit higher hallucination rates when evaluated with ODE-generated samples, which indicates potential data contamination. Furthermore, these generated samples aid in analyzing hallucination patterns and fine-tuning models, offering an effective approach to mitigating hallucinations in MLLMs.
Authors: Frederik Hagelskj{\ae}r
Abstract: In this paper, we present a novel method for self-supervised fine-tuning of pose estimation. Leveraging zero-shot pose estimation, our approach enables the robot to automatically obtain training data without manual labeling. After pose estimation the object is grasped, and in-hand pose estimation is used for data validation. Our pipeline allows the system to fine-tune while the process is running, removing the need for a learning phase. The motivation behind our work lies in the need for rapid setup of pose estimation solutions. Specifically, we address the challenging task of bin picking, which plays a pivotal role in flexible robotic setups. Our method is implemented on a robotics work-cell, and tested with four different objects. For all objects, our method increases the performance and outperforms a state-of-the-art method trained on the CAD model of the objects.
Authors: Qing Wu, Chenhe Du, XuanYu Tian, Jingyi Yu, Yuyao Zhang, Hongjiang Wei
Abstract: Motion correction (MoCo) in radial MRI is a challenging problem due to the unpredictability of subject's motion. Current state-of-the-art (SOTA) MoCo algorithms often use extensive high-quality MR images to pre-train neural networks, obtaining excellent reconstructions. However, the need for large-scale datasets significantly increases costs and limits model generalization. In this work, we propose Moner, an unsupervised MoCo method that jointly solves artifact-free MR images and accurate motion from undersampled, rigid motion-corrupted k-space data, without requiring training data. Our core idea is to leverage the continuous prior of implicit neural representation (INR) to constrain this ill-posed inverse problem, enabling ideal solutions. Specifically, we incorporate a quasi-static motion model into the INR, granting its ability to correct subject's motion. To stabilize model optimization, we reformulate radial MRI as a back-projection problem using the Fourier-slice theorem. Additionally, we propose a novel coarse-to-fine hash encoding strategy, significantly enhancing MoCo accuracy. Experiments on multiple MRI datasets show our Moner achieves performance comparable to SOTA MoCo techniques on in-domain data, while demonstrating significant improvements on out-of-domain data.
Authors: Yuyang Sha
Abstract: This paper presents a novel multimodal perception system for a real open environment. The proposed system includes an embedded computation platform, cameras, ultrasonic sensors, GPS, and IMU devices. Unlike the traditional frameworks, our system integrates multiple sensors with advanced computer vision algorithms to help users walk outside reliably. The system can efficiently complete various tasks, including navigating to specific locations, passing through obstacle regions, and crossing intersections. Specifically, we also use ultrasonic sensors and depth cameras to enhance obstacle avoidance performance. The path planning module is designed to find the locally optimal route based on various feedback and the user's current state. To evaluate the performance of the proposed system, we design several experiments under different scenarios. The results show that the system can help users walk efficiently and independently in complex situations.
Authors: Jihoon Cho, Seunghyuck Park, Jinah Park
Abstract: Despite significant advancements in automatic brain tumor segmentation methods, their performance is not guaranteed when certain MR sequences are missing. Addressing this issue, it is crucial to synthesize the missing MR images that reflect the unique characteristics of the absent modality with precise tumor representation. Typically, MRI synthesis methods generate partial images rather than full-sized volumes due to computational constraints. This limitation can lead to a lack of comprehensive 3D volumetric information and result in image artifacts during the merging process. In this paper, we propose a two-stage approach that first synthesizes MR images from 2D slices using a novel intensity encoding method and then refines the synthesized MRI. The proposed intensity encoding reduces artifacts when synthesizing MRI on a 2D slice basis. Then, the \textit{Refiner}, which leverages complete 3D volume information, further improves the quality of the synthesized images and enhances their applicability to segmentation methods. Experimental results demonstrate that the intensity encoding effectively minimizes artifacts in the synthesized MRI and improves perceptual quality. Furthermore, using the \textit{Refiner} on synthesized MRI significantly improves brain tumor segmentation results, highlighting the potential of our approach in practical applications.
Authors: Shutong Jin, Ruiyu Wang, Kuangyi Chen, Florian T. Pokorny
Abstract: Scene rearrangement, like table tidying, is a challenging task in robotic manipulation due to the complexity of predicting diverse object arrangements. Web-scale trained generative models such as Stable Diffusion can aid by generating natural scenes as goals. To facilitate robot execution, object-level representations must be extracted to match the real scenes with the generated goals and to calculate object pose transformations. Current methods typically use a multi-step design that involves separate models for generation, segmentation, and feature encoding, which can lead to a low success rate due to error accumulation. Furthermore, they lack control over the viewing perspectives of the generated goals, restricting the tasks to 3-DoF settings. In this paper, we propose PACA, a zero-shot pipeline for scene rearrangement that leverages perspective-aware cross-attention representation derived from Stable Diffusion. Specifically, we develop a representation that integrates generation, segmentation, and feature encoding into a single step to produce object-level representations. Additionally, we introduce perspective control, thus enabling the matching of 6-DoF camera views and extending past approaches that were limited to 3-DoF top-down views. The efficacy of our method is demonstrated through its zero-shot performance in real robot experiments across various scenes, achieving an average matching accuracy and execution success rate of 87% and 67%, respectively.
Authors: Xiang Li, Yixiang Dai, Qing Qu
Abstract: In this work, we study the generalizability of diffusion models by looking into the hidden properties of the learned score functions, which are essentially a series of deep denoisers trained on various noise levels. We observe that as diffusion models transition from memorization to generalization, their corresponding nonlinear diffusion denoisers exhibit increasing linearity. This discovery leads us to investigate the linear counterparts of the nonlinear diffusion models, which are a series of linear models trained to match the function mappings of the nonlinear diffusion denoisers. Surprisingly, these linear denoisers are approximately the optimal denoisers for a multivariate Gaussian distribution characterized by the empirical mean and covariance of the training dataset. This finding implies that diffusion models have the inductive bias towards capturing and utilizing the Gaussian structure (covariance information) of the training dataset for data generation. We empirically demonstrate that this inductive bias is a unique property of diffusion models in the generalization regime, which becomes increasingly evident when the model's capacity is relatively small compared to the training dataset size. In the case that the model is highly overparameterized, this inductive bias emerges during the initial training phases before the model fully memorizes its training data. Our study provides crucial insights into understanding the notable strong generalization phenomenon recently observed in real-world diffusion models.
Authors: Taha Bouhsine
Abstract: We introduce a yat-product-powered neural network, the Neural Matter Network (NMN), a breakthrough in deep learning that achieves non-linear pattern recognition without activation functions. Our key innovation relies on the yat-product and yat-product, which naturally induces non-linearity by projecting inputs into a pseudo-metric space, eliminating the need for traditional activation functions while maintaining only a softmax layer for final class probability distribution. This approach simplifies network architecture and provides unprecedented transparency into the network's decision-making process. Our comprehensive empirical evaluation across different datasets demonstrates that NMN consistently outperforms traditional MLPs. The results challenge the assumption that separate activation functions are necessary for effective deep-learning models. The implications of this work extend beyond immediate architectural benefits, by eliminating intermediate activation functions while preserving non-linear capabilities, yat-MLP establishes a new paradigm for neural network design that combines simplicity with effectiveness. Most importantly, our approach provides unprecedented insights into the traditionally opaque "black-box" nature of neural networks, offering a clearer understanding of how these models process and classify information.
Authors: Langrui Zhou, Ziteng Zhou, Xinyu Huang, Xiangyu Zhang, Huiru Wang, Guang Li
Abstract: In the last few years, with the rapid development of deep learning technologies, supervised methods based on convolutional neural networks have greatly enhanced the performance of medical image denoising. However, these methods require large quantities of noisy-clean image pairs for training, which greatly limits their practicality. Although some researchers have attempted to train denoising networks using only single noisy images, existing self-supervised methods, including blind-spot-based and data-splitting-based methods, heavily rely on the assumption that noise is pixel-wise independent. However, this assumption often does not hold in real-world medical images. Therefore, in the field of medical imaging, there remains a lack of simple and practical denoising methods that can achieve high-quality denoising performance using only single noisy images. In this paper, we propose a novel self-supervised medical image denoising method, Neighboring Slice Noise2Noise (NS-N2N). The proposed method utilizes neighboring slices within a single noisy image volume to construct weighted training data, and then trains the denoising network using a self-supervised scheme with regional consistency loss and inter-slice continuity loss. NS-N2N only requires a single noisy image volume obtained from one medical imaging procedure to achieve high-quality denoising of the image volume itself. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art self-supervised denoising methods in both denoising performance and processing efficiency. Furthermore, since NS-N2N operates solely in the image domain, it is free from device-specific issues such as reconstruction geometry, making it easier to apply in various clinical practices.
Authors: Anika Tahsin Meem
Abstract: In modern times, the cryptocurrency market is one of the world's most rapidly rising financial markets. The cryptocurrency market is regarded to be more volatile and illiquid than traditional markets such as equities, foreign exchange, and commodities. The risk of this market creates an uncertain condition among the investors. The purpose of this research is to predict the magnitude of the risk factor of the cryptocurrency market. Risk factor is also called volatility. Our approach will assist people who invest in the cryptocurrency market by overcoming the problems and difficulties they experience. Our approach starts with calculating the risk factor of the cryptocurrency market from the existing parameters. In twenty elements of the cryptocurrency market, the risk factor has been predicted using different machine learning algorithms such as CNN, LSTM, BiLSTM, and GRU. All of the models have been applied to the calculated risk factor parameter. A new model has been developed to predict better than the existing models. Our proposed model gives the highest RMSE value of 1.3229 and the lowest RMSE value of 0.0089. Following our model, it will be easier for investors to trade in complicated and challenging financial assets like bitcoin, Ethereum, dogecoin, etc. Where the other existing models, the highest RMSE was 14.5092, and the lower was 0.02769. So, the proposed model performs much better than models with proper generalization. Using our approach, it will be easier for investors to trade in complicated and challenging financial assets like Bitcoin, Ethereum, and Dogecoin.
Authors: Yong Xie, Weijie Zheng, Hanxun Huang, Guangnan Ye, Xingjun Ma
Abstract: As deep learning models are increasingly deployed in safety-critical applications, evaluating their vulnerabilities to adversarial perturbations is essential for ensuring their reliability and trustworthiness. Over the past decade, a large number of white-box adversarial robustness evaluation methods (i.e., attacks) have been proposed, ranging from single-step to multi-step methods and from individual to ensemble methods. Despite these advances, challenges remain in conducting meaningful and comprehensive robustness evaluations, particularly when it comes to large-scale testing and ensuring evaluations reflect real-world adversarial risks. In this work, we focus on image classification models and propose a novel individual attack method, Probability Margin Attack (PMA), which defines the adversarial margin in the probability space rather than the logits space. We analyze the relationship between PMA and existing cross-entropy or logits-margin-based attacks, and show that PMA can outperform the current state-of-the-art individual methods. Building on PMA, we propose two types of ensemble attacks that balance effectiveness and efficiency. Furthermore, we create a million-scale dataset, CC1M, derived from the existing CC3M dataset, and use it to conduct the first million-scale white-box adversarial robustness evaluation of adversarially-trained ImageNet models. Our findings provide valuable insights into the robustness gaps between individual versus ensemble attacks and small-scale versus million-scale evaluations.
Authors: Yassine Machta, Omar Ali, Kevin Hakkakian, Ana Vlasceanu, Amaury Facque, Nicolas Golse, Irene Vignon-Clementel
Abstract: Surgical assessment of liver cancer patients requires identification of the vessel trees from medical images. Specifically, the venous trees - the portal (perfusing) and the hepatic (draining) trees are important for understanding the liver anatomy and disease state, and perform surgery planning. This research aims to improve the 3D segmentation, skeletonization, and subsequent analysis of vessel trees, by creating an automatic pipeline based on deep learning and image processing techniques. The first part of this work explores the impact of differentiable skeletonization methods such as ClDice and morphological skeletonization loss, on the overall liver vessel segmentation performance. To this aim, it studies how to improve vessel tree connectivity. The second part of this study converts a single class vessel segmentation into multi-class ones, separating the two venous trees. It builds on the previous two-class vessel segmentation model, which vessel tree outputs might be entangled, and on connected components and skeleton analyses of the trees. After providing sub-labeling of the specific anatomical branches of each venous tree, these algorithms also enable a morphometric analysis of the vessel trees by extracting various geometrical markers. In conclusion, we propose a method that successfully improves current skeletonization methods, for extensive vascular trees that contain vessels of different calibers. The separation algorithm creates a clean multi-class segmentation of the vessels, validated by surgeons to provide low error. A new, publicly shared high-quality liver vessel dataset of 77 cases is thus created. Finally a method to annotate vessel trees according to anatomy is provided, enabling a unique liver vessel morphometry analysis.
Authors: Song-Jiang Lai, Tsun-Hin Cheung, Ka-Chun Fung, Kai-wen Xue, Kin-Man Lam
Abstract: In the research area of image super-resolution, Swin-transformer-based models are favored for their global spatial modeling and shifting window attention mechanism. However, existing methods often limit self-attention to non overlapping windows to cut costs and ignore the useful information that exists across channels. To address this issue, this paper introduces a novel model, the Hybrid Attention Aggregation Transformer (HAAT), designed to better leverage feature information. HAAT is constructed by integrating Swin-Dense-Residual-Connected Blocks (SDRCB) with Hybrid Grid Attention Blocks (HGAB). SDRCB expands the receptive field while maintaining a streamlined architecture, resulting in enhanced performance. HGAB incorporates channel attention, sparse attention, and window attention to improve nonlocal feature fusion and achieve more visually compelling results. Experimental evaluations demonstrate that HAAT surpasses state-of-the-art methods on benchmark datasets. Keywords: Image super-resolution, Computer vision, Attention mechanism, Transformer
Authors: Mohammadhossein Momeni, Vivek Gopalakrishnan, Neel Dey, Polina Golland, Sarah Frisken
Abstract: We present DiffVox, a self-supervised framework for Cone-Beam Computed Tomography (CBCT) reconstruction by directly optimizing a voxelgrid representation using physics-based differentiable X-ray rendering. Further, we investigate how the different implementations of the X-ray image formation model in the renderer affect the quality of 3D reconstruction and novel view synthesis. When combined with our regularized voxel-based learning framework, we find that using an exact implementation of the discrete Beer-Lambert law for X-ray attenuation in the renderer outperforms both widely used iterative CBCT reconstruction algorithms and modern neural field approaches, particularly when given only a few input views. As a result, we reconstruct high-fidelity 3D CBCT volumes from fewer X-rays, potentially reducing ionizing radiation exposure and improving diagnostic utility. Our implementation is available at https://github.com/hossein-momeni/DiffVox.