Watch Your Steps: Local Image and Scene Editing by Text Instructions

Authors: Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski

Abstract: Denoising diffusion models have enabled high-quality image generation and editing. We present a method to localize the desired edit region implicit in a text instruction. We leverage InstructPix2Pix (IP2P) and identify the discrepancy between IP2P predictions with and without the instruction. This discrepancy is referred to as the relevance map. The relevance map conveys the importance of changing each pixel to achieve the edits, and is used to to guide the modifications. This guidance ensures that the irrelevant pixels remain unchanged. Relevance maps are further used to enhance the quality of text-guided editing of 3D scenes in the form of neural radiance fields. A field is trained on relevance maps of training views, denoted as the relevance field, defining the 3D region within which modifications should be made. We perform iterative updates on the training views guided by rendered relevance maps from the relevance field. Our method achieves state-of-the-art performance on both image and NeRF editing tasks. Project page: WatchYourSteps

What, Why and How

Here is a summary of the key points from this paper:

What:

  • The paper presents a method for localizing image and scene edits based on text instructions using denoising diffusion models.

  • It introduces “relevance maps” to predict the scope of an edit instruction on an image by looking at the discrepancy between conditional and unconditional predictions.

  • For 3D scene editing, it lifts 2D relevance maps into 3D using a “relevance field” trained on relevance maps from multiple views.

Why:

  • Localizing edits based on relevance can help confine changes to the most relevant regions and avoid over-editing.

  • Leveraging relevance for localization helps maintain fidelity and minimize unnecessary variability compared to global editing methods.

  • Extending localization to 3D editing improves consistency across views compared to per-frame editing.

How:

  • For images, calculate a relevance map as the difference between conditional and unconditional noise predictions.

  • Binarize the relevance map to get an edit mask and force unmasked pixels to remain unchanged during diffusion-based image editing.

  • For 3D scenes, train a relevance field on relevance maps from multiple views to predict a 3D relevance.

  • Use rendered relevance maps from the field to guide editing of individual views while updating the NeRF model.

Main Contributions

The main contributions of this paper are:

  • Proposing relevance maps to predict the scope of a text-based image editing instruction. This allows localizing where edits should be applied in an image.

  • Leveraging relevance maps to guide diffusion-based image editing and constrain changes to only the relevant regions. This results in more localized editing compared to prior methods.

  • Introducing relevance fields that lift 2D relevance maps into 3D for consistent localization when editing 3D scenes/NeRFs.

  • Achieving state-of-the-art performance on image and NeRF editing tasks using the proposed relevance-guided editing approach.

  • Demonstrating how relevance map prediction and guidance can avoid common issues like over-editing and lack of consistency that arise when using diffusion models for uncontrolled image and scene editing.

In summary, the key contributions are:

  1. Relevance maps for predicting edit scope
  2. Relevance-guided image editing for localization
  3. Relevance fields for consistent 3D localization
  4. State-of-the-art results on image and NeRF editing
  5. Avoiding over-editing and consistency issues with diffusion models

Method Section

Here is a summary of the method section from the paper:

Relevance Map Calculation:

  • Encode input image I and add noise to get noisy latent z_t
  • Get conditional noise prediction ε_{I,T} using IP2P with image I and text T
  • Get unconditional prediction ε_{I} with just image I
  • Calculate relevance map R as absolute difference between ε_{I,T} and ε_{I}
  • Clamp outliers and normalize R to [0,1]

Relevance-Guided Image Editing:

  • Add noise to encoded input x to get noisy latent z_t
  • Iteratively denoise z_t using IP2P to get z_{t-1}
  • Get IP2P prediction ε_t conditioned on I, T
  • Predict z_{t-1} using ε_t and DDIM
  • Binarize relevance map R to get edit mask M
  • Combine predicted z_{t-1} and noisy input z_{t-1} using M to get final z_{t-1}
  • Ensure unmasked pixels remain unchanged
  • Decode final z_0 to get edited image

Relevance Field for Scene Editing:

  • Extend NeRF to predict a relevance r(x) at each 3D point
  • Detach relevance field densities from main NeRF densities
  • Render relevance map R(r) by raymarching relevance field r(x)
  • Sample a view I_i from training set
  • Get relevance map R_i for I_i using image calculation
  • Add (I_i, R_i) to relevance field training data
  • Render I_i and R_i from current NeRF f_θ
  • Edit I_i guided by rendered R_i to get I_i’
  • Update NeRF f_θ with edited view I_i’

High-Level Pseudo Code

Here is the high-level pseudo code for the method presented in the paper:

# Relevance Map Calculation
encoded_image = encode(input_image) 
noisy_latent = add_noise(encoded_image)
cond_prediction = ip2p_predict(noisy_latent, input_image, text)
uncond_prediction = ip2p_predict(noisy_latent, input_image)
relevance_map = abs(cond_prediction - uncond_prediction)
relevance_map = normalize(relevance_map)
 
# Relevance-Guided Image Editing
noisy_latent = add_noise(encoded_input)
for t in timesteps:
  prediction = ip2p_predict(noisy_latent, input_image, text)
  denoised = denoise(noisy_latent, prediction) 
  edit_mask = binarize(relevance_map)
  denoised = blend(denoised, noisy_latent, edit_mask)
  noisy_latent = denoised
edited_image = decode(denoised)
 
# Relevance Field for Scene Editing
for view in training_views:
  rel_map = get_relevance_map(view)
  add_to_field_data(view, rel_map)
 
rendered_view, rendered_rel = render_view(nerf, view_idx)
edited_view = edit_image(rendered_view, rendered_rel) 
update_nerf(edited_view)

Detailed Pseudo Code

Here is a more detailed pseudo code implementation of the method proposed in the paper:

# Relevance Map Calculation
def get_relevance_map(image, text):
  
  encoded = encoder(image)
  noisy = add_noise(encoded, t_rel)
  
  cond_pred = ip2p_unet(noisy, image, text)
  uncond_pred = ip2p_unet(noisy, image, '')
  
  relevance_map = abs(cond_pred - uncond_pred)
  relevance_map = clamp_outliers(relevance_map) 
  relevance_map = normalize(relevance_map)
  
  return relevance_map
 
# Relevance-Guided Image Editing  
def edit_image(image, text, relevance_map):
 
  encoded = encoder(image)
  noisy = add_noise(encoded, t_edit)
  
  for t in range(t_edit, 0):
    
    pred = ip2p_unet(noisy, image, text)
    denoised = ddim(noisy, pred)
    
    edit_mask = binarize(relevance_map, tau)
    
    noisy = blend(denoised, noisy, edit_mask) 
 
  edited = decoder(noisy)
  
  return edited
 
# Relevance Field for Scene Editing
def train_relevance_field(nerf, views):
 
  field_data = []
  
  for view in views:
    relevance_map = get_relevance_map(view, text)
    field_data.append((view, relevance_map))
  
  # Train field on field_data
  
  return field
 
def edit_scene(nerf, views, field, text):
 
  for iter in num_iterations: 
    
    view_idx = random_view_index(views)
 
    rendered_view, rendered_rel = render(nerf, view_idx)
 
    edited_view = edit_image(rendered_view, text, rendered_rel)
 
    update_nerf(nerf, edited_view)