Visual Instruction Inversion: Image Editing via Visual Prompting

Authors: Thao Nguyen, Yuheng Li, Utkarsh Ojha, Yong Jae Lee

Abstract: Text-conditioned image editing has emerged as a powerful tool for editing images. However, in many situations, language can be ambiguous and ineffective in describing specific image edits. When faced with such challenges, visual prompts can be a more informative and intuitive way to convey ideas. We present a method for image editing via visual prompting. Given pairs of example that represent the “before” and “after” images of an edit, our goal is to learn a text-based editing direction that can be used to perform the same edit on new images. We leverage the rich, pretrained editing capabilities of text-to-image diffusion models by inverting visual prompts into editing instructions. Our results show that with just one example pair, we can achieve competitive results compared to state-of-the-art text-conditioned image editing frameworks.

What, Why and How

Here is a summary of the key points from this paper:

What:

  • The paper proposes a new method for image editing via visual prompting.
  • Given a pair of “before” and “after” images showing an edit, the method learns to invert this visual prompt into a text-based editing instruction.
  • The text instruction captures the edit direction and can be applied to new images.

Why:

  • Language can be ambiguous for describing image edits. Visual prompts are more intuitive.
  • Leverages powerful pre-trained text-to-image diffusion models by inverting prompts to text.
  • Avoids need for large paired dataset of examples for each edit type.

How:

  • Uses frozen pretrained InstructPix2Pix model for image editing.
  • Optimizes text instruction to reconstruct “after” image from “before” image.
  • Adds CLIP embedding loss to align instruction with edit direction.
  • Learned instruction can be applied to new images for editing.
  • Allows concatenating extra text prompt with learned instruction.

In summary, the key ideas are learning to invert visual prompts into text instructions for editing, leveraging capabilities of pre-trained models, and enabling intuitive control via visual examples rather than possibly ambiguous language descriptions.

Main Contributions

Here are the main contributions of this paper:

  • Proposes a new framework for image editing via visual prompting, using text-to-image diffusion models.

  • Introduces an approach to invert the visual edit transformation specified by a before-and-after image pair into a text instruction.

  • Leverages the textual instruction space and editing capabilities of the pretrained InstructPix2Pix model.

  • Achieves competitive quantitative results compared to state-of-the-art text-conditioned image editing methods, even with just one example pair.

  • Provides insights on model design choices through ablation studies:

    • Both MSE and CLIP losses are important for learning effective instructions.
    • Initializing instructions from image captions is better than ground-truth text.
    • Reusing noises from training at test time leads to more balanced edits.
    • Concatenating extra text with learned instructions enables more control.
  • Demonstrates the advantage of visual prompting over textual instructions for conveying edits through qualitative examples.

  • Discusses limitations and open questions, such as reliance on a pretrained model, sensitivity to example pairs, and exploring the reverse direction of using diffusion models for visual prompting.

In summary, the main contributions are proposing the visual prompt inversion framework, achieving strong results, and providing analysis for guiding research in image editing with visual prompts.

Method Section

Here is a summary of the method section from the paper:

The goal is to learn a text instruction cT that captures the editing direction from a before image x to an after image y.

The pretrained InstructPix2Pix model is used which takes a text prompt, before image, and noise to generate an edited image.

Two losses are used to optimize cT:

  • MSE loss to reconstruct y from x conditioned on cT. Helps cT reproduce the edit.

  • CLIP loss to align cT with the embedding change from x to y. Encourages learning the edit direction.

cT is initialized from a caption of y using an image captioning model. This provides a better starting point vs ground truth text.

Only a fixed number of tokens in cT are updated during optimization. This allows concatenating extra text prompts during inference.

The noise sequence used during training is reused at test time for denoising. This balances between applying the edit and preserving x.

Once optimized, cT captures the edit from x to y. It can be applied to a new image x’ to transform it similarly. Users can concatenate extra text prompts with cT for more control.

In summary, the key steps are inverting the before-after example into a text instruction using MSE and CLIP losses, initializing the instruction intelligently, and reusing noises to create a balanced edit. The learned instruction can then guide the editing of new images.

High-Level Pseudo Code

Here is the high-level pseudocode for the method presented in the paper:

# Given before image x, after image y
# Pretrained model Epsilon, image encoder E, CLIP encoder E_clip 
# Hyperparams: num_steps, num_timesteps, lr, lambda_mse, lambda_clip
 
# Initialize instruction cT 
cT = initialize_instruction(y) 
 
for i in range(num_steps):
  
  # Sample timestep
  t = sample_timestep(num_timesteps)  
  
  # Get noisy version of y
  y_t = add_noise_to_y(t) 
  
  # Predict noise residual
  epsilon_hat = Epsilon(y_t, t, cT, E(x)) 
  
  # Compute losses
  mse_loss = ||epsilon - epsilon_hat||_2
  clip_loss = cosine_sim(cT, E_clip(y) - E_clip(x))
  loss = lambda_mse * mse_loss + lambda_clip * clip_loss
 
  # Update instruction
  cT = cT - lr * gradient(loss, cT) 
 
# cT now captures edit direction x -> y
# Can apply cT to new image x'
y' = Epsilon(add_noise_to_E(x'), cT, E(x')) 

The key steps are:

  1. Initialize the instruction cT (e.g. from a caption)
  2. Iteratively update cT by optimizing MSE and CLIP loss
  3. Apply learned cT to new images x’ to achieve the edit

This learns to invert the visual prompt into a text instruction cT that can be used to guide editing new images.

Detailed Pseudo Code

Here is a more detailed pseudocode for implementing the approach proposed in the paper:

# Inputs
example_pair = {x, y} # before and after example images
model = InstructPix2Pix # frozen pretrained model
E = ImageEncoder() # encodes image to latent space
E_clip = CLIPEncoder() # encodes image to CLIP space
 
# Hyperparameters
num_steps = 1000 
num_timesteps = 1000
lr = 0.001
lambda_mse = 4
lambda_clip = 0.1
 
# Initialize instruction
text_prompt = '<|startoftext|>' 
caption = get_caption(y) # e.g. using image captioning model
ins = initialize_text(caption, 10) 
cT = text_prompt + ins + '<|endoftext|>'
 
# Optimize 
for i in range(num_steps):
 
  t = random.randint(0, num_timesteps) # sample timestep
  epsilon = random_noise() # sample noise
 
  # Get noisy latent
  z_y = E(y) 
  z_y_t = add_noise(z_y, t, epsilon)  
 
  # Model forward pass
  e_pred = model(z_y_t, t, cT, E(x))
  
  # Losses
  mse_loss = ||epsilon - e_pred||_2 
  clip_loss = cosine_sim(cT, E_clip(y)-E_clip(x))
  loss = lambda_mse*mse_loss + lambda_clip*clip_loss
  
  # Update
  cT = cT - lr * gradient(loss, cT) 
 
# Apply to new image
x_new = new_image() 
z_xnew = E(x_new)
z_ynew_t = add_noise(z_xnew, t, epsilon) 
 
y_new = model.decode(z_ynew_t, cT, z_xnew)

The key aspects covered:

  • Initializing cT from an image caption
  • Getting noisy latent vectors
  • Computing MSE and CLIP loss
  • Updating cT based on losses
  • Applying cT + same noise to new image

This implements the instruction optimization and application process in detail.