DiffColor: Toward High Fidelity Text-Guided Image Colorization with Diffusion Models

Authors: Jianxin Lin, Peng Xiao, Yijun Wang, Rongju Zhang, Xiangxiang Zeng

Abstract: Recent data-driven image colorization methods have enabled automatic or reference-based colorization, while still suffering from unsatisfactory and inaccurate object-level color control. To address these issues, we propose a new method called DiffColor that leverages the power of pre-trained diffusion models to recover vivid colors conditioned on a prompt text, without any additional inputs. DiffColor mainly contains two stages: colorization with generative color prior and in-context controllable colorization. Specifically, we first fine-tune a pre-trained text-to-image model to generate colorized images using a CLIP-based contrastive loss. Then we try to obtain an optimized text embedding aligning the colorized image and the text prompt, and a fine-tuned diffusion model enabling high-quality image reconstruction. Our method can produce vivid and diverse colors with a few iterations, and keep the structure and background intact while having colors well-aligned with the target language guidance. Moreover, our method allows for in-context colorization, i.e., producing different colorization results by modifying prompt texts without any fine-tuning, and can achieve object-level controllable colorization results. Extensive experiments and user studies demonstrate that DiffColor outperforms previous works in terms of visual quality, color fidelity, and diversity of colorization options.

What, Why and How

Here is a summary of the key points from this paper:

What:

  • The paper presents a new method called DiffColor for high-fidelity text-guided image colorization using diffusion models.

  • DiffColor aims to recover vivid and diverse colors conditioned on a text prompt without requiring additional inputs.

Why:

  • Existing colorization methods struggle with unsatisfactory and inaccurate object-level color control.

  • Diffusion models have shown impressive results for image generation but lack ability to edit parts of images while preserving other areas.

  • Text descriptions allow incorporating prior knowledge to guide the colorization process.

How:

  • DiffColor has two main stages:

    1. Colorization with generative color prior: Fine-tune a diffusion model using a contrastive loss to generate colorized images guided by text.
    2. In-context controllable colorization: Optimize text embedding to align with colorized image and fine-tune diffusion model. Allows modifying text prompts to edit colors without retraining.
  • Evaluate both quantitatively (metrics like FID, LPIPS, CLIP score) and qualitatively (user studies).

  • Outperforms previous methods in quality, fidelity, diversity of colorization options.

In summary, this paper presents a novel text-guided image colorization approach using diffusion models called DiffColor. It aims to leverage text prompts to recover vivid, diverse colors with accurate object-level control. The method fine-tunes diffusion models with a contrastive loss and optimizes text embeddings. Experiments demonstrate superior performance over existing colorization techniques.

Main Contributions

Here are the main contributions of this paper:

  • Development of a new framework (DiffColor) that utilizes text-guided diffusion models for high-fidelity and controllable image colorization.

  • Incorporation of a novel color contrastive loss as guidance during training to align the generated images with the text descriptions.

  • A two-stage color refinement process with spatial alignment to improve image structure preservation and text embedding optimization for controllability.

  • Extensive experiments and comparisons demonstrating superiority over state-of-the-art colorization methods in terms of quality, diversity, and text alignment.

  • Qualitative results showing the ability to perform diverse in-context colorization edits by simply modifying the text prompts.

  • Quantitative evaluations using metrics like FID, LPIPS, CLIP score, PSNR, and SSIM to measure quality and text-image relevance.

  • User studies that show preference for DiffColor over other methods in terms of color vividness, colorfulness and text consistency.

In summary, the key contributions are:

  1. A new text-guided diffusion framework for colorization
  2. A novel contrastive loss for training guidance
  3. A two-stage refinement process
  4. Comprehensive experiments and evaluations demonstrating state-of-the-art performance
  5. Qualitative and quantitative results proving controllable diverse colorization
  6. User studies confirming superior color quality and text alignment

Method Section

Here is a summary of the method section from the paper:

The proposed DiffColor framework contains two main stages:

  1. Colorization with Generative Color Prior
  • Input grayscale image encoded to latent code z
  • Denoising model takes z and text embedding of context prompt c
  • Fine-tune model using contrastive loss and diffusion guidance loss
  • Contrastive loss brings accurate colors using c and anti-color texts c-
  • Output primary colorized image xrgb_pri
  • Use spatial alignment module to align xrgb_pri and grayscale image, obtaining aligned color image x^rgb_pri
  1. In-context Controllable Colorization
  • Input is x^rgb_pri and target text prompt ĉ with color descriptions
  • Rewrite context prompt c with unique identifiers as c’
  • Optimize embedding of c’ to maximize similarity with x^rgb_pri
  • Fine-tune diffusion model to reconstruct colorized image x^rgb_rec
  • Get target embedding e_tgt from ĉ
  • Interpolate between optimized and target embeddings
  • Generate final colorized image x^rgb_fin conditioned on interpolated embedding
  • Align with grayscale image to get output image x^rgb_fin

In summary, the method consists of two stages. The first uses a contrastive loss to fine-tune a diffusion model for initial colorization. The second optimizes text embeddings and fine-tunes the model again to enable controllable color editing by modifying the text prompts. Spatial alignment helps preserve structure.

High-Level Pseudo Code

Here is the high-level pseudo code for the DiffColor method:

# Stage 1: Colorization with generative color prior
 
gray_img = input grayscale image 
context_prompt = input text describing gray_img content
 
# Encode gray_img to latent code
latent_code = encoder(gray_img)  
 
# Get context embedding from CLIP text encoder 
context_emb = CLIP_text_encoder(context_prompt)
 
# Fine-tune diffusion model using contrastive loss
for i in iterations:
  noisy_latent = add_noise(latent_code) 
  rgb_img = diffusion_model(noisy_latent, context_emb)  
  loss = contrastive_loss(rgb_img, context_prompt, anti_color_prompts)
  update_diffusion_model(loss)
 
# Spatial alignment  
aligned_rgb_img = spatial_align(rgb_img, gray_img)
 
# Stage 2: In-context controllable colorization
 
target_prompt = add color descriptions to context_prompt 
 
# Rewrite context prompt with identifiers
rewritten_prompt = add_identifiers(context_prompt)  
 
# Optimize context embedding
optimized_emb = optimize_embedding(rewritten_prompt, aligned_rgb_img)
 
# Fine-tune diffusion model
fine_tuned_model = finetune(diffusion_model, optimized_emb, aligned_rgb_img)
 
# Get target embedding
target_emb = CLIP_text_encoder(target_prompt)
 
# Interpolate embeddings
interp_emb = interpolate(optimized_emb, target_emb)
 
# Generate final colorized image
final_rgb_img = fine_tuned_model(latent_code, interp_emb) 
 
# Spatial alignment
output_img = spatial_align(final_rgb_img, gray_img)

Detailed Pseudo Code

Here is a more detailed pseudo code implementation of the DiffColor method:

# Load pre-trained models
encoder = load_encoder()
diffusion_model = load_diffusion_model() 
CLIP = load_CLIP()
align_model = load_align_model()
 
# Stage 1: Colorization with generative color prior
 
# Input
gray_img = load_grayscale_image()
context_prompt = "A photo of a cat" 
 
# Encode image
latent_code = encoder(gray_img)  
 
# Get context embedding
context_emb = CLIP.encode_text(context_prompt) 
 
# Define anti-color prompts
anti_prompts = ["A grayscale image", "A scratchy photo"]
 
# Fine-tune diffusion model
for i in range(1500):
  
  # Add noise
  noisy_latent = add_noise(latent_code)
  
  # Generate image
  rgb_img = diffusion_model(noisy_latent, context_emb) 
  
  # Get image embedding
  img_emb = CLIP.encode_image(rgb_img)
 
  # Calculate contrastive loss
  pos_sim = cosine_sim(img_emb, context_emb)
  neg_sims = [cosine_sim(img_emb, CLIP.encode_text(anti)) for anti in anti_prompts]
  loss = contrastive_loss(pos_sim, neg_sims)
  
  # Update model
  update_diffusion_model(loss) 
 
# Spatial alignment
aligned_rgb_img = align_model(rgb_img, gray_img) 
 
# Stage 2: In-context controllable colorization
 
# Input
target_prompt = "A brown cat"
 
# Rewrite context prompt  
rewritten_prompt = rewrite_with_ids(context_prompt)
 
# Optimize context embedding
optimized_emb = optimize_embedding(rewritten_prompt, aligned_rgb_img, steps=500) 
 
# Fine-tune model
fine_tuned_model = finetune(diffusion_model, optimized_emb, aligned_rgb_img, steps=1000)
 
# Get target embedding
target_emb = CLIP.encode_text(target_prompt) 
 
# Interpolate embeddings
interp_emb = interpolate(optimized_emb, target_emb)
 
# Generate final image
final_rgb_img = fine_tuned_model(latent_code, interp_emb)
 
# Align
output_img = align_model(final_rgb_img, gray_img)

Let me know if you need any clarification on the implementation details!