Zero-shot Image-to-Image Translation
Authors: Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, Jun-Yan Zhu
Abstract: Large-scale text-to-image generative models have shown their remarkable ability to synthesize diverse and high-quality images. However, it is still challenging to directly apply these models for editing real images for two reasons. First, it is hard for users to come up with a perfect text prompt that accurately describes every visual detail in the input image. Second, while existing models can introduce desirable changes in certain regions, they often dramatically alter the input content and introduce unexpected changes in unwanted regions. In this work, we propose pix2pix-zero, an image-to-image translation method that can preserve the content of the original image without manual prompting. We first automatically discover editing directions that reflect desired edits in the text embedding space. To preserve the general content structure after editing, we further propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process. In addition, our method does not need additional training for these edits and can directly use the existing pre-trained text-to-image diffusion model. We conduct extensive experiments and show that our method outperforms existing and concurrent works for both real and synthetic image editing.
What, Why and How
Here is a summary of the key points from this paper:
What:
- The paper proposes a new method called pix2pix-zero for image-to-image translation using pre-trained text-to-image diffusion models.
Why:
- Existing text-to-image models struggle to edit real images as they require perfect text prompts and often alter unwanted regions.
- Current image-to-image translation methods need task-specific training which is expensive.
How:
- The method automatically discovers editing directions in CLIP embedding space between source and target words (e.g. cat -> dog) that work for diverse images.
- It uses cross-attention guidance to encourage consistency between the attention maps before and after editing to preserve structure.
- The pretrained diffusion model can directly apply the edit direction and attention guidance without additional training.
- It improves results via noise regularization during inversion and distills the diffusion model into a fast GAN.
- Experiments show it outperforms baselines in edit quality and structure preservation on diverse tasks.
In summary, this paper introduces a training-free and prompt-free image-to-image translation approach using pre-trained text-to-image diffusion models. It automatically discovers robust edit directions and uses cross-attention guidance to apply edits while preserving structure. The method achieves strong performance without task-specific training.
Main Contributions
Based on my reading, the main contributions of this paper are:
-
An automatic editing direction discovery mechanism that finds robust edit vectors between source and target words/domains without requiring manual text prompts.
-
A cross-attention guidance technique that encourages consistency between the attention maps before and after editing to preserve structure.
-
Noise regularization during inversion to make the inverted noise maps more Gaussian and improve editability.
-
Autocorrelation regularization to ensure the noise remains Gaussian during DDIM inversion.
-
Conditional GAN distillation to accelerate the diffusion model and enable real-time editing.
-
Demonstrating training-free, prompt-free image-to-image translation on diverse tasks with both real and synthetic images using a pre-trained text-to-image diffusion model.
-
Comparisons showing the approach outperforms existing and concurrent baselines in edit quality and content preservation.
-
Detailed analysis of the algorithm via ablations on key components.
In summary, the key contributions are: 1) Automatically discovering robust edit directions without prompts 2) Using cross-attention guidance for structure preservation 3) Improving editability via noise regularization 4) Accelerating the model via GAN distillation and 5) Achieving strong image-to-image translation without task-specific training.
Method Section
Here is a summary of the method section from the paper:
The overall goal is image-to-image translation where the user specifies an edit direction like cat -> dog.
- Invert Real Images:
- Use deterministic DDIM inversion to invert the input image to a latent code.
- Apply noise regularization using auto-correlation loss and KL loss to make the inverted noise maps more Gaussian.
- Discover Edit Directions:
- Generate sentences with source and target words (e.g. cat, dog) using GPT-3.
- Compute the mean CLIP embedding difference between source and target sentences as the edit direction.
- Editing via Cross-Attention Guidance:
- First reconstruct using original text to get reference attention maps.
- Apply edit direction to get edited attention maps.
- Minimize loss between reference and edited attention maps during diffusion.
- This cross-attention guidance retains structure.
- Model Acceleration:
- Collect input-output pairs from diffusion model.
- Train an image conditional GAN (CoModGAN) on these pairs.
- This provides a fast approximation to the diffusion model.
In summary, the key steps are inverting the input image, discovering robust edit directions from text, editing using cross-attention guidance, and distilling a fast GAN model. The method achieves strong image-to-image translation results without needing additional training.
High-Level Pseudo Code
Here is the high-level pseudo code for the key components of the method proposed in this paper:
# Invert real image
img_latent = DDIM_Inversion(input_img)
img_latent = Regularize_Noise(img_latent)
# Discover edit direction
src_sentences = Generate_Sentences(source_word)
tgt_sentences = Generate_Sentences(target_word)
edit_dir = Mean_CLIP_Difference(src_sentences, tgt_sentences)
# Editing via cross-attention guidance
ref_attn_maps = Diffusion_Sampling(img_latent, input_text)
edit_img_latent = img_latent + edit_dir
edit_attn_maps = Diffusion_Sampling(edit_img_latent, edit_text)
for t in timesteps:
loss = MSE_Loss(ref_attn_maps[t], edit_attn_maps[t])
edit_img_latent = Gradient_Step(loss)
output_img = Decode(edit_img_latent)
# Model acceleration
input_imgs, edit_imgs = Collect_Diffusion_Pairs( )
fast_model = Train_CoModGAN(input_imgs, edit_imgs)
This shows the key steps of inverting the input image, discovering the editing direction from text embeddings, performing diffusion sampling with cross-attention guidance, and distilling a fast conditional GAN model. The pseudo code covers the core components of the image-to-image translation approach.
Detailed Pseudo Code
Here is a more detailed pseudo code to implement the key components of the method proposed in this paper:
# Imports
import clip, gpt3, diffusion
# Hyperparameters
num_sentences = 100
num_diffusion_steps = 100
# Invert real image
def DDIM_Inversion(img):
img_latent = Initialize_Latent()
for t in reversed(timesteps):
pred_noise = Denoising_UNet(img_latent, t)
img_latent = DDIM_Step(img_latent, pred_noise, t)
return img_latent
def Regularize_Noise(latent):
for iter in iterations:
noise = Get_Noise(latent)
pair_loss = AutoCorrelation_Loss(noise)
kl_loss = KL_Divergence_Loss(noise)
reg_loss = pair_loss + lambda*kl_loss
latent = Optimize(latent, reg_loss)
return latent
# Discover edit direction
def Generate_Sentences(word):
sentences = gpt3.generate(num_sentences, word)
return sentences
def Mean_CLIP_Difference(sentences1, sentences2):
emb1 = [clip(s) for s in sentences1]
emb2 = [clip(s) for s in sentences2]
mean_emb1 = np.mean(emb1, axis=0)
mean_emb2 = np.mean(emb2, axis=0)
direction = mean_emb2 - mean_emb1
return direction / np.linalg.norm(direction)
# Editing via cross-attention
def Diffusion_Sampling(latent, text):
attn_maps = []
for t in timesteps:
latent, _, attn = Denoising_UNet(latent, t, text)
attn_maps.append(attn)
img = Decode(latent)
return img, attn_maps
# Full editing pipeline
input_img = Load_Image()
init_text = Generate_Text(input_img)
src_word = 'cat'
tgt_word = 'dog'
edit_dir = Discover_Edit_Direction(src_word, tgt_word)
ref_img, ref_attns = Diffuse_Sample(input_img, init_text)
edit_text = Modify_Text(init_text, src_word, tgt_word)
edit_img, edit_attns = Diffuse_Sample(input_img + edit_dir, edit_text)
for t in timesteps:
attn_loss = MSE(ref_attns[t], edit_attns[t])
edit_img = Backprop(attn_loss)
output_img = Decode(edit_img)
# Model distillation
inputs, outputs = Sample_Diffusion_Model()
fast_model = Train_CoModGAN(inputs, outputs)
This shows a more detailed pseudo code with example functions, loss computations, and training loops. It covers inverting the image, discovering the edit direction, diffusion sampling with attention guidance, and distilling the fast GAN model.