DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing

Authors: Yujun Shi, Chuhui Xue, Jiachun Pan, Wenqing Zhang, Vincent Y. F. Tan, Song Bai

Abstract: Precise and controllable image editing is a challenging task that has attracted significant attention. Recently, DragGAN enables an interactive point-based image editing framework and achieves impressive editing results with pixel-level precision. However, since this method is based on generative adversarial networks (GAN), its generality is upper-bounded by the capacity of the pre-trained GAN models. In this work, we extend such an editing framework to diffusion models and propose DragDiffusion. By leveraging large-scale pretrained diffusion models, we greatly improve the applicability of interactive point-based editing in real world scenarios. While most existing diffusion-based image editing methods work on text embeddings, DragDiffusion optimizes the diffusion latent to achieve precise spatial control. Although diffusion models generate images in an iterative manner, we empirically show that optimizing diffusion latent at one single step suffices to generate coherent results, enabling DragDiffusion to complete high-quality editing efficiently. Extensive experiments across a wide range of challenging cases (e.g., multi-objects, diverse object categories, various styles, etc.) demonstrate the versatility and generality of DragDiffusion.

What, Why and How

Here is a summary of the key points of this paper:

What:

  • This paper proposes DragDiffusion, a method for interactive point-based image editing using diffusion models.

Why:

  • Existing point-based editing methods like DragGAN rely on GANs and have limited generality.
  • Diffusion models can leverage large pretrained models for better generality.

How:

  • Finetune a LoRA on diffusion model to reconstruct input image and preserve identity/style.
  • Obtain diffusion latent at step t via DDIM inversion.
  • Iteratively optimize latent at step t for motion supervision and point tracking.
  • Motion supervision minimizes distance between handle and target points in feature space.
  • Point tracking finds new handle points after latent update.
  • Regularization term keeps unmasked region unchanged.
  • Final result obtained by DDIM denoising the optimized latent.

In summary, DragDiffusion enables precise spatial control for editing using diffusion models. It shows strong generality across diverse cases like multi-objects, categories and styles.

Main Contributions

Here are the main contributions of this paper:

  • Proposes DragDiffusion, an interactive point-based image editing method using diffusion models. This extends the editing capability of methods like DragGAN to leverage large pretrained diffusion models.

  • Achieves precise spatial control for editing by optimizing the diffusion latent at a certain step. Shows that manipulating a single step suffices for coherent editing.

  • Introduces techniques like LoRA finetuning and regularization to preserve object identity and style during editing.

  • Demonstrates versatility across challenging cases like multi-objects, diverse categories and styles through extensive experiments.

  • Provides strong qualitative results on real-world photos showing the generality of the framework.

In summary, the main contributions are proposing DragDiffusion for point-based editing with diffusion models, the technical approach to enable spatial control, and the extensive experiments showing generality across diverse cases. This greatly improves applicability of point-based editing to real-world scenarios.

Method Section

Here is a summary of the method section:

Overview:

  • Finetune a LoRA on diffusion model to reconstruct input image and preserve identity/style
  • Obtain diffusion latent at step t via DDIM inversion
  • Iteratively optimize latent for motion supervision and point tracking
  • Get final result by DDIM denoising the optimized latent

Motion Supervision:

  • Use feature maps of UNet penultimate block for supervision
  • Minimize L1 distance between handle and target point features
  • Regularization term keeps unmasked region unchanged
  • Update latent by gradient descent on supervision loss

Point Tracking:

  • Update handle points using nearest neighbor search in UNet feature space
  • Finds new positions after latent update during supervision

Implementation Details:

  • Use Stable Diffusion 1.5 as base model
  • Finetune LoRA on attention matrices with rank 16 for 200 steps
  • Latent optimization for 50 DDIM steps, edit 40th step latent
  • Adam optimizer with lr 0.01, no CFG
  • Set hyperparams r1, r2 for feature patch size and search region
  • Increase regularization weight lambda if needed

In summary, the key aspects are LoRA finetuning, iterative optimization of a single diffusion step, motion supervision in feature space, point tracking using features, and the hyperparameters used.

High-Level Pseudo Code

Here is the high-level pseudo code for the DragDiffusion method:

# Finetune LoRA 
lora = LoRA(diffusion_model)  
lora.fit(input_image)
 
# DDIM Inversion
z_t = DDIM_invert(input_image, steps=T, stop_at=t) 
 
# Iterative editing
for iter in range(max_iter):
  
  # Motion supervision
  loss = l1_dist(handle_feats, target_feats)  
  loss += regularization_term   
  z_t = optimize(z_t, loss)  
 
  # Point tracking
  handle_pts = nearest_neighbor_search(z_t_feats, z_feats) 
 
# DDIM Denoising
output_image = DDIM_denoise(z_t)

The key steps are:

  • Finetune LoRA on diffusion model for reconstruction
  • Get latent at step t via DDIM inversion
  • Iteratively update latent by optimizing motion supervision loss
  • Update handle points using nearest neighbors in feature space
  • Denoise optimized latent to get final output

This shows the overall workflow for interactive point-based editing with DragDiffusion.

Detailed Pseudo Code

Here is a more detailed pseudo code implementation of the DragDiffusion method:

# Hyperparams
T = 50 # Total DDIM steps
t = 40 # Editing step
r1 = 1 # Feature patch size 
r2 = 3 # Search region size
 
# Finetune LoRA
lora = LoRA(diffusion_model)
lora.fit(input_image, steps=200) 
 
# DDIM Inversion
z_0 = input_image
z_t = DDIM_invert(z_0, steps=T, stop_at=t)
 
# Initialization  
z_t_hat = z_t 
handle_pts = user_clicks()
target_pts = user_clicks() 
mask = user_mask()
 
# Iterative editing loop
for k in range(max_iter):
 
  # Motion supervision
  feat_maps = UNet(z_t_hat)
  
  loss = 0
  for i in range(num_handles):
    pt = handle_pts[i]
    feat = feat_maps[pt]
    target = target_pts[i]
    dir = (target - pt) / ||target - pt||
 
    # Feature displacement loss
    for q in patch(pt, r1):
      loss += |feat_maps[q+dir] - feat|  
 
    # Regularization term  
    z_t-1_hat = DDIM_step(z_t_hat) 
    loss += |(z_t-1_hat - z_t-1) * (1-mask)|
 
  # Optimizer step
  z_t_hat = optimize(z_t_hat, loss)
 
  # Point tracking
  for i in range(num_handles):
    pt = handle_pts[i]
    region = patch(pt, r2)
    new_pt = nn_search(feat_maps[region], z_t_feats[pt])
    handle_pts[i] = new_pt
 
# Final output
z_0_hat = DDIM_denoise(z_t_hat, steps=T) 

The key aspects covered:

  • Hyperparameters
  • LoRA finetuning
  • DDIM inversion
  • Iterative editing loop
    • Motion supervision loss
    • Regularization
    • Optimizer update
    • Point tracking by nn search
  • Final denoising

This provides a detailed pseudo code outline of the main algorithmic steps.