Iterative Multi-granular Image Editing using Diffusion Models

Authors: K J Joseph, Prateksha Udhayanan, Tripti Shukla, Aishwarya Agarwal, Srikrishna Karanam, Koustava Goswami, Balaji Vasan Srinivasan

Abstract: Recent advances in text-guided image synthesis has dramatically changed how creative professionals generate artistic and aesthetically pleasing visual assets. To fully support such creative endeavors, the process should possess the ability to: 1) iteratively edit the generations and 2) control the spatial reach of desired changes (global, local or anything in between). We formalize this pragmatic problem setting as Iterative Multi-granular Editing. While there has been substantial progress with diffusion-based models for image synthesis and editing, they are all one shot (i.e., no iterative editing capabilities) and do not naturally yield multi-granular control (i.e., covering the full spectrum of local-to-global edits). To overcome these drawbacks, we propose EMILIE: Iterative Multi-granular Image Editor. EMILIE introduces a novel latent iteration strategy, which re-purposes a pre-trained diffusion model to facilitate iterative editing. This is complemented by a gradient control operation for multi-granular control. We introduce a new benchmark dataset to evaluate our newly proposed setting. We conduct exhaustive quantitatively and qualitatively evaluation against recent state-of-the-art approaches adapted to our task, to being out the mettle of EMILIE. We hope our work would attract attention to this newly identified, pragmatic problem setting.

What, Why and How

Here is a summary of the key points about the paper “Iterative Multi-granular Image Editing using Diffusion Models”:

\textbf{What:}

  • The paper proposes a new problem setting called “Iterative Multi-granular Image Editing” where the goal is to iteratively edit an image according to a sequence of text instructions, while controlling the spatial locality of each edit (global or local).

  • The paper introduces a method called EMILIE (ItEerative MultI-granular Image Editor) to address this problem using diffusion models. EMILIE does not require retraining the diffusion model.

\textbf{Why:}

  • Existing diffusion model based image editors are limited to one-shot image generation and cannot iteratively edit images.

  • Controlling the spatial locality of edits is useful for creative workflows but lacking in current systems.

  • The proposed iterative multi-granular image editing problem aims to overcome these limitations.

\textbf{How:}

  • EMILIE uses a “latent iteration” strategy where the latent vector from the diffusion model is iteratively passed instead of the image. This reduces noise accumulation over iterations.

  • It modulates the diffusion model’s gradients during sampling to control locality of edits based on interpreting it as an energy-based model.

  • Experiments on a new benchmark dataset IMEBench demonstrate EMILIE’s capabilities for iterative and spatially-controlled image editing compared to adapted baselines.

In summary, the key contribution is the introduction and addressing of a new iterative multi-granular image editing problem using diffusion models in a training-free manner to benefit creative workflows.

Main Contributions

Here are the key contributions of the paper “Iterative Multi-granular Image Editing using Diffusion Models”:

  1. Formulates a new problem setting called “Iterative Multi-granular Image Editing” motivated by practical use cases in creative workflows. This involves iteratively editing an image according to a sequence of instructions while controlling the spatial locality of edits.

  2. Proposes a method called EMILIE (ItEerative MultI-granular Image Editor) to address this problem using diffusion models without needing to retrain them.

  3. Introduces a “latent iteration” strategy to facilitate iterative editing by passing the latent vector instead of the image between steps. This reduces noise accumulation.

  4. Interprets the diffusion model as an energy-based model to enable controlling locality of edits by modulating gradients during sampling.

  5. Introduces a new benchmark dataset called IMEBench to evaluate iterative multi-granular image editing.

  6. Demonstrates EMILIE’s capabilities both quantitatively and qualitatively on IMEBench and another dataset compared to adapted baselines.

In summary, the key contributions are:

  • Formulating a new iterative multi-granular image editing problem.
  • Proposing the EMILIE method to address it without retraining diffusion models.
  • Latent iteration strategy and gradient modulation for iterative and localized editing.
  • New IMEBench benchmark dataset.
  • Extensive experiments demonstrating EMILIE’s efficacy.

Method Section

Here is a summary of the method section from the paper “Iterative Multi-granular Image Editing using Diffusion Models”:

The paper proposes a method called EMILIE (ItEerative MultI-granular Image Editor) to enable iterative and spatially controlled image editing using diffusion models without retraining them.

EMILIE has two key components:

  1. Latent Iteration:
  • Normally, passing an image recursively through a diffusion model accumulates noise.
  • Instead, EMILIE passes the latent vector from the previous step as input for the next.
  • This latent iteration significantly reduces noise buildup over iterations.
  1. Gradient Modulation for Multi-granular Control:
  • The diffusion model is interpreted as an energy-based model.
  • The noise prediction is seen as the learned gradient of the energy function.
  • User-provided masks are used to zero out gradients outside regions of interest.
  • This restricts updates only to desired local regions during sampling.

The overall algorithm is as follows:

  • Input image is encoded to latent vector z_img.
  • For each text instruction:
    • Sample new latent z_init ~ N(0,I)
    • Concatenate z_init and z_img
    • Denoise over T steps with gradient modulation
    • Decode denoised z0 to get edited image
    • Update z_img = z0 for next iteration

So in summary, the key ideas are latent iteration to reduce noise during iterative editing, and gradient modulation to control locality of edits, all without modifying the pretrained diffusion model.

High-Level Pseudo Code

Here is the high-level pseudo code for the method proposed in the paper “Iterative Multi-granular Image Editing using Diffusion Models”:

# Inputs
image = # Image to edit  
instructions = # List of text edit instructions
masks (optional) = # List of masks to control locality
 
# Pretrained components
encoder = # VQ-VAE encoder
decoder = # VQ-VAE decoder 
diffusion_model = # Pretrained diffusion model
 
for instr, mask in zip(instructions, masks):
 
  if first iteration:
    z_img = encoder(image)  
  else:
    z_img = prev_z # Latent iteration
 
  z_init ~ N(0,I) # Sample new latent 
  z = concatenate(z_init, z_img) # Concat initial and image latents
 
  for t in range(T): 
    # Denoising loop
    z = denoise_step(z, instr, mask) # Gradient modulation
   
  image_edit = decoder(z) # Decode edited image
 
  prev_z = z # Update latent for next iteration

The key steps are:

  1. Latent iteration by passing z instead of image between edits
  2. Gradient modulation in denoising loop for spatial control
  3. Encoding image to latent, denoising loop, and decoding back

This allows iterative and multi-granular image editing using a pretrained diffusion model without any training.

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the method proposed in the paper “Iterative Multi-granular Image Editing using Diffusion Models”:

# Hyperparameters
T = # Number of denoising steps
sigma_t = # Noise schedule
 
# Inputs 
image = # Image to edit
instructions = [# Text instructions
                "Add a red hat", 
                "Make the hat bigger",
                ...
               ]
masks = [# Optional binary masks
         null, 
         mask_for_hat_region,
         ...
        ]
 
# Pretrained components
encoder = VQVAEEncoder() 
decoder = VQVAEDecoder()
diffusion = LatentDiffusion() 
 
edited_images = []
z_img = encoder(image) 
 
for instr, mask in zip(instructions, masks):
 
  z_init = torch.randn(Z_DIM) # Latent sample
 
  if first_iteration: 
    z_img_iter = z_img 
  else:
    z_img_iter = z_prev # Latent iteration
 
  z_t = torch.cat([z_init, z_img_iter], dim=1)
 
  for t in reversed(range(T)):
    
    sigma = sigma_t[t]
    
    z_t = diffusion(z_t, t, instr)  
    
    if mask is not None:
      z_t = z_t - mask * z_t # Gradient modulation
    
    noise = torch.randn_like(z_t) * sigma
    
    z_t = z_t - 0.5 * lambda * noise # Langevin step
 
  image_edit = decoder(z_t)
  edited_images.append(image_edit) 
  
  z_prev = z_t
 
return edited_images 

Key aspects:

  • Sample new latent and concatenate with previous image latent
  • Iteratively denoise concatenated latent using text and masks
  • Apply gradient modulation if mask is provided
  • Langevin sampling step with scheduled noise
  • Iterate using denoised latent instead of image
  • Decode denoised latent to get final edited image