DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models
Authors: Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, Jian Zhang
Abstract: Despite the ability of existing large-scale text-to-image (T2I) models to generate high-quality images from detailed textual descriptions, they often lack the ability to precisely edit the generated or real images. In this paper, we propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models. Specifically, we construct classifier guidance based on the strong correspondence of intermediate features in the diffusion model. It can transform the editing signals into gradients via feature correspondence loss to modify the intermediate representation of the diffusion model. Based on this guidance strategy, we also build a multi-scale guidance to consider both semantic and geometric alignment. Moreover, a cross-branch self-attention is added to maintain the consistency between the original image and the editing result. Our method, through an efficient design, achieves various editing modes for the generated or real images, such as object moving, object resizing, object appearance replacement, and content dragging. It is worth noting that all editing and content preservation signals come from the image itself, and the model does not require fine-tuning or additional modules. Our source code will be available at DragonDiffusion.
What, Why and How
Here is a summary of the key points in this paper:
What: This paper proposes a new image editing method called DragonDiffusion that enables drag-style manipulation on diffusion models like Stable Diffusion.
Why: Existing diffusion models can generate high-quality images from text prompts but lack precise control for editing images. Methods like latent space manipulation in GANs have limitations. This paper explores using the strong feature correspondence in diffusion models for fine-grained image editing.
How:
- Uses two feature branches in diffusion process - guidance branch reconstructs original image, generation branch outputs edit.
- Guidance features supervise generation features using classifier loss based on feature similarity to achieve editing.
- Multi-scale guidance combines semantic and geometric alignment.
- Cross-branch self-attention maintains consistency between original and edit.
- Achieves object move/resize, appearance replace, point dragging without model fine-tuning.
In summary, this paper proposes DragonDiffusion that leverages the strong feature correspondence in diffusion models to guide precise image editing, enabling drag-style manipulation without model fine-tuning or training. The method is applied to tasks like object moving, resizing, appearance replacement and point dragging.
Main Contributions
Here are the main contributions of this paper:
-
Proposes a classifier-guidance image editing strategy based on the strong correspondence of intermediate features in diffusion models. A multi-scale feature matching scheme is developed that considers both semantic and geometric correspondence.
-
All content editing and preservation signals come from the image itself, allowing direct translation of text-to-image generation ability of diffusion models to image editing without model fine-tuning.
-
Achieves various fine-grained image editing capabilities including object moving, object resizing, object appearance replacement, and content dragging.
-
Does not require any model training or fine-tuning. Only uses the feature correspondence in pre-trained diffusion models like Stable Diffusion.
-
Provides an efficient design to transform editing signals to gradients via score functions and classifier guidance.
-
Maintains content consistency between original and edited images using cross-branch self-attention.
In summary, the key contributions are: 1) Classifier-guidance editing strategy via feature correspondence 2) Image-based editing without model fine-tuning 3) Variety of fine-grained editing capabilities 4) Efficient score function design 5) Cross-branch self-attention for consistency.
Method Section
Here is a summary of the method section from the paper:
-
Based on Stable Diffusion (SD) text-to-image diffusion model. Uses an autoencoder and UNet denoiser.
-
Proposes a guidance branch and generation branch in the diffusion process. Guidance branch reconstructs original image. Generation branch outputs edited image.
-
Strong feature correspondence between branches is used for editing and content consistency.
-
Designs a score function to transform editing signals to gradients via classifier guidance. Matches content based on feature similarity measured by cosine distance.
-
Uses multi-scale guidance combining semantics from layer 2 and geometry from layer 3.
-
For object moving, adds contrastive loss to avoid multiple objects and inpainting loss for natural inpainting.
-
For resize, applies local interpolation to resize features.
-
For appearance replacement, matches mean features of corresponding regions.
-
For dragging, similarity loss between points.
-
Uses cross-branch self-attention to maintain consistency between original and edit.
-
No model training or fine-tuning needed. Relies only on feature correspondence in pre-trained SD model.
In summary, the key ideas are guidance branch, score function for editing, multi-scale matching, task-specific losses, and cross-branch self-attention. The method achieves a variety of editing applications without any model training.
High-Level Pseudo Code
Here is the high-level pseudo code for the DragonDiffusion method:
# Input: original image x0, editing signal y
# Invert x0 to get latent zT
zT = invert(x0)
# Guidance branch
for t in timesteps:
zt_gud = diffuse(zT)
ft_gud = denoise(zt_gud)
# Generation branch
for t in timesteps:
# Cross-branch self-attention
k_gud, v_gud = ft_gud # from guidance
zt_gen = diffuse(zT)
ft_gen = denoise(zt_gen)
q_gen = ft_gen
ft_gen = cross_attention(q_gen, k_gud, v_gud)
# Classifier guidance
L = feature_similarity_loss(ft_gen, ft_gud, y)
# Get gradients
grad = L.backwards()
# Update latents
zt_gen = zt_gen - lr * grad
# Generate final image
x0_edit = decode(z0_gen)
return x0_edit
In summary, the key steps are:
- Invert image to get latent code
- Diffuse and denoise in guidance branch
- Diffuse in generation branch
- Cross-branch self-attention
- Classifier loss for editing signal
- Get gradients and update latents
- Generate final edited image
Detailed Pseudo Code
Here is a more detailed pseudo code to implement the DragonDiffusion method:
import diffusion, clip, lpips # modules
# Hyperparameters
T = 1000 # diffusion steps
lr = 1e-2 # learning rate
w_e = 1.0 # edit weight
w_p = 0.5 # preserve weight
layers = [2, 3] # feature layers
# Inputs
x0 = image # original image
y = signal # editing signal
m_edit = mask # edit regions
m_share = 1-m_edit # preserve regions
# Invert to get latents
zT = invert(x0)
# Guidance branch
for t in range(T)[::-1]:
zt_gud = sqrt(1-beta_t)*zt_gud + sqrt(beta_t)*epsilon
ft_gud = denoise(zt_gud, t)
# Get features
f2_gud = ft_gud[layers[0]]
f3_gud = ft_gud[layers[1]]
# Generation branch
for t in range(T)[::-1]:
zt_gen = sqrt(1-beta_t)*zt_gen + sqrt(beta_t)*epsilon
ft_gen = denoise(zt_gen, t)
# Cross-branch attention
k, v = ft_gud
q = ft_gen
ft_gen = attention(q, k, v)
# Get features
f2_gen = ft_gen[layers[0]]
f3_gen = ft_gen[layers[1]]
# Feature similarity loss
L1 = 1 - cosine_sim(f2_gen*m_edit, f2_gud*y)
L2 = 1 - cosine_sim(f3_gen*m_edit, f3_gud*y)
L3 = 1 - cosine_sim(f2_gen*m_share, f2_gud*m_share)
L4 = 1 - cosine_sim(f3_gen*m_share, f3_gud*m_share)
L = w_e*(L1 + L2) + w_p*(L3 + L4)
# Get gradients
grad = autograd(L, zt_gen)
# Update latents
zt_gen = zt_gen - lr*grad
# Generate final image
x0_edit = decode(z0_gen)
return x0_edit
The key steps are:
- Get latent code zT via inversion
- Diffusion in guidance branch
- Diffusion in generation branch
- Compute feature similarity loss L
- Get gradients w.r.t latents
- Update latents with gradients
- Generate final edited image
This shows a more detailed implementation with hyperparameters, feature layers, loss functions, gradient descent, etc.