DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models

Authors: Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, Jian Zhang

Abstract: Despite the ability of existing large-scale text-to-image (T2I) models to generate high-quality images from detailed textual descriptions, they often lack the ability to precisely edit the generated or real images. In this paper, we propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models. Specifically, we construct classifier guidance based on the strong correspondence of intermediate features in the diffusion model. It can transform the editing signals into gradients via feature correspondence loss to modify the intermediate representation of the diffusion model. Based on this guidance strategy, we also build a multi-scale guidance to consider both semantic and geometric alignment. Moreover, a cross-branch self-attention is added to maintain the consistency between the original image and the editing result. Our method, through an efficient design, achieves various editing modes for the generated or real images, such as object moving, object resizing, object appearance replacement, and content dragging. It is worth noting that all editing and content preservation signals come from the image itself, and the model does not require fine-tuning or additional modules. Our source code will be available at DragonDiffusion.

What, Why and How

Here is a summary of the key points in this paper:

What: This paper proposes a new image editing method called DragonDiffusion that enables drag-style manipulation on diffusion models like Stable Diffusion.

Why: Existing diffusion models can generate high-quality images from text prompts but lack precise control for editing images. Methods like latent space manipulation in GANs have limitations. This paper explores using the strong feature correspondence in diffusion models for fine-grained image editing.

How:

Uses two feature branches in diffusion process - guidance branch reconstructs original image, generation branch outputs edit.
Guidance features supervise generation features using classifier loss based on feature similarity to achieve editing.
Multi-scale guidance combines semantic and geometric alignment.
Cross-branch self-attention maintains consistency between original and edit.
Achieves object move/resize, appearance replace, point dragging without model fine-tuning.

In summary, this paper proposes DragonDiffusion that leverages the strong feature correspondence in diffusion models to guide precise image editing, enabling drag-style manipulation without model fine-tuning or training. The method is applied to tasks like object moving, resizing, appearance replacement and point dragging.

Main Contributions

Here are the main contributions of this paper:

Proposes a classifier-guidance image editing strategy based on the strong correspondence of intermediate features in diffusion models. A multi-scale feature matching scheme is developed that considers both semantic and geometric correspondence.
All content editing and preservation signals come from the image itself, allowing direct translation of text-to-image generation ability of diffusion models to image editing without model fine-tuning.
Achieves various fine-grained image editing capabilities including object moving, object resizing, object appearance replacement, and content dragging.
Does not require any model training or fine-tuning. Only uses the feature correspondence in pre-trained diffusion models like Stable Diffusion.
Provides an efficient design to transform editing signals to gradients via score functions and classifier guidance.
Maintains content consistency between original and edited images using cross-branch self-attention.

In summary, the key contributions are: 1) Classifier-guidance editing strategy via feature correspondence 2) Image-based editing without model fine-tuning 3) Variety of fine-grained editing capabilities 4) Efficient score function design 5) Cross-branch self-attention for consistency.

Method Section

Here is a summary of the method section from the paper:

Based on Stable Diffusion (SD) text-to-image diffusion model. Uses an autoencoder and UNet denoiser.
Proposes a guidance branch and generation branch in the diffusion process. Guidance branch reconstructs original image. Generation branch outputs edited image.
Strong feature correspondence between branches is used for editing and content consistency.
Designs a score function to transform editing signals to gradients via classifier guidance. Matches content based on feature similarity measured by cosine distance.
Uses multi-scale guidance combining semantics from layer 2 and geometry from layer 3.
For object moving, adds contrastive loss to avoid multiple objects and inpainting loss for natural inpainting.
For resize, applies local interpolation to resize features.
For appearance replacement, matches mean features of corresponding regions.
For dragging, similarity loss between points.
Uses cross-branch self-attention to maintain consistency between original and edit.
No model training or fine-tuning needed. Relies only on feature correspondence in pre-trained SD model.

In summary, the key ideas are guidance branch, score function for editing, multi-scale matching, task-specific losses, and cross-branch self-attention. The method achieves a variety of editing applications without any model training.

High-Level Pseudo Code

Here is the high-level pseudo code for the DragonDiffusion method:

# Input: original image x0, editing signal y
 
# Invert x0 to get latent zT 
zT = invert(x0) 
 
# Guidance branch
for t in timesteps:
  zt_gud = diffuse(zT) 
  ft_gud = denoise(zt_gud)
 
# Generation branch 
for t in timesteps:
 
  # Cross-branch self-attention
  k_gud, v_gud = ft_gud # from guidance
  
  zt_gen = diffuse(zT)
  ft_gen = denoise(zt_gen) 
  
  q_gen = ft_gen
  
  ft_gen = cross_attention(q_gen, k_gud, v_gud)
 
  # Classifier guidance
  L = feature_similarity_loss(ft_gen, ft_gud, y)
 
  # Get gradients
  grad = L.backwards()
  
  # Update latents
  zt_gen = zt_gen - lr * grad 
 
# Generate final image  
x0_edit = decode(z0_gen)
 
return x0_edit

In summary, the key steps are:

Invert image to get latent code
Diffuse and denoise in guidance branch
Diffuse in generation branch
Cross-branch self-attention
Classifier loss for editing signal
Get gradients and update latents
Generate final edited image

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the DragonDiffusion method:

import diffusion, clip, lpips # modules
 
# Hyperparameters
T = 1000 # diffusion steps
lr = 1e-2 # learning rate
w_e = 1.0 # edit weight 
w_p = 0.5 # preserve weight
layers = [2, 3] # feature layers
 
# Inputs 
x0 = image # original image
y = signal # editing signal
m_edit = mask # edit regions
m_share = 1-m_edit # preserve regions 
 
# Invert to get latents
zT = invert(x0) 
 
# Guidance branch
for t in range(T)[::-1]:
  
  zt_gud = sqrt(1-beta_t)*zt_gud + sqrt(beta_t)*epsilon
  
  ft_gud = denoise(zt_gud, t)
  
  # Get features
  f2_gud = ft_gud[layers[0]] 
  f3_gud = ft_gud[layers[1]]
  
# Generation branch
for t in range(T)[::-1]:
 
  zt_gen = sqrt(1-beta_t)*zt_gen + sqrt(beta_t)*epsilon 
  
  ft_gen = denoise(zt_gen, t)
 
  # Cross-branch attention
  k, v = ft_gud
  q = ft_gen
  ft_gen = attention(q, k, v) 
 
  # Get features
  f2_gen = ft_gen[layers[0]]
  f3_gen = ft_gen[layers[1]]
 
  # Feature similarity loss
  L1 = 1 - cosine_sim(f2_gen*m_edit, f2_gud*y)
  L2 = 1 - cosine_sim(f3_gen*m_edit, f3_gud*y)
  L3 = 1 - cosine_sim(f2_gen*m_share, f2_gud*m_share)
  L4 = 1 - cosine_sim(f3_gen*m_share, f3_gud*m_share)
  
  L = w_e*(L1 + L2) + w_p*(L3 + L4)
  
  # Get gradients
  grad = autograd(L, zt_gen)
  
  # Update latents 
  zt_gen = zt_gen - lr*grad 
 
# Generate final image
x0_edit = decode(z0_gen) 
 
return x0_edit

The key steps are:

Get latent code zT via inversion
Diffusion in guidance branch
Diffusion in generation branch
Compute feature similarity loss L
Get gradients w.r.t latents
Update latents with gradients
Generate final edited image

This shows a more detailed implementation with hyperparameters, feature layers, loss functions, gradient descent, etc.

aixiv-c

Table of Contents

2307.02421 DragonDiffusion Enabling Drag-style Manipulation on Diffusion Models

DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code

Graph View

Backlinks

aixiv-c

Table of Contents

2307.02421 DragonDiffusion Enabling Drag-style Manipulation on Diffusion Models

DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models §

What, Why and How §

Main Contributions §

Method Section §

High-Level Pseudo Code §

Detailed Pseudo Code §

Graph View

Backlinks

DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code