Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models

Authors: Daiki Miyake, Akihiro Iohara, Yu Saito, Toshiyuki Tanaka

Abstract: In image editing employing diffusion models, it is crucial to preserve the reconstruction quality of the original image while changing its style. Although existing methods ensure reconstruction quality through optimization, a drawback of these is the significant amount of time required for optimization. In this paper, we propose negative-prompt inversion, a method capable of achieving equivalent reconstruction solely through forward propagation without optimization, thereby enabling much faster editing processes. We experimentally demonstrate that the reconstruction quality of our method is comparable to that of existing methods, allowing for inversion at a resolution of 512 pixels and with 50 sampling steps within approximately 5 seconds, which is more than 30 times faster than null-text inversion. Reduction of the computation time by the proposed method further allows us to use a larger number of sampling steps in diffusion models to improve the reconstruction quality with a moderate increase in computation time.

What, Why and How

Here is a summary of the key points from this paper:

What:

The paper proposes a new method called negative-prompt inversion for reconstructing and editing images with diffusion models.

Why:

Existing methods require optimizing embeddings or model weights to reconstruct images, which is slow. The proposed method uses only forward propagation for faster reconstruction.

How:

The method is based on null-text inversion which optimizes a null-text embedding for reconstruction.
The paper shows theoretically and empirically that the optimized null-text embedding approximates the prompt embedding.
So the proposed method simply uses the prompt embedding in place of optimizing a null-text embedding.
This allows reconstructing images with similar quality as null-text inversion but much faster, in around 5 seconds compared to over 2 minutes.

In summary, the key novelty is avoiding optimization by using the prompt embedding directly, instead of an optimized null-text embedding. This enables fast high-quality reconstruction and editing of images with diffusion models using only feedforward computation.

Main Contributions

Here are the main contributions of this paper:

They propose negative-prompt inversion, a new method for fast high-quality reconstruction of images using diffusion models without optimization.
They provide theoretical analysis to show the prompt embedding can approximate the optimized null-text embedding for reconstruction.
They demonstrate experimentally that their method achieves comparable reconstruction quality to null-text inversion, while being over 30 times faster.
They show their method can be combined with existing editing techniques like prompt-to-prompt for fast single image editing.
They analyze the trade-off between reconstruction quality and speed by varying the number of sampling steps. More steps improve quality but reduce speed, yet their method is still faster than null-text inversion.

In summary, the key contributions are proposing and analyzing the negative-prompt inversion method, and demonstrating its effectiveness for high-quality yet very fast reconstruction and editing compared to prior art. The speedup enables applications needing fast processing like video editing.

Method Section

Here is a summary of the method section from the paper:

The goal is to obtain latent variables and text embeddings to reconstruct a real image using diffusion models without optimization.

The key steps are:

DDIM inversion calculates the diffusion process by inverting the reverse diffusion process from an unconditional diffusion model.
Null-text inversion improves reconstruction by optimizing the null-text embedding at each step so the diffusion process aligns with the reverse process calculated using classifier-free guidance (CFG).
The proposed negative-prompt inversion utilizes the input prompt embedding directly instead of optimizing a null-text embedding.
It shows theoretically that with perfect learning, the optimized null-text embedding equals the prompt embedding.
So instead of optimizing a null-text embedding, the method uses the prompt embedding directly in CFG when calculating the reverse diffusion.
This allows reconstructing images using only forward propagation, avoiding slow optimization.
For image editing, the edited prompt embedding is used as the text condition, and the original prompt embedding is used instead of the null-text in CFG.

In summary, the key idea is to replace the optimized null-text embedding with the input prompt embedding, enabling fast high-quality reconstruction and editing using standard diffusion model sampling.

High-Level Pseudo Code

Here is the high-level pseudo code for the negative-prompt inversion method proposed in the paper:

# Input
x: original image 
p: prompt describing x
 
# Diffusion model 
model = DiffusionModel()  
 
# Get latent code for x via DDIM inversion 
z_T = DDIMInversion(x)
 
# Negative-prompt inversion
for t in reversed(T):
  # Classifier free guidance
  eps_c = model.predict_noise(z_t, t, p) # Conditional
  eps_u = model.predict_noise(z_t, t, p) # Unconditional (use prompt as negative prompt)
 
  # Reverse diffusion step
  z_{t-1} = DDIMStep(z_t, eps_c, eps_u) 
 
# Output  
x_recon = Decode(z_0) # Reconstructed image

The key steps are:

Get latent code z_T via DDIM inversion of original image x
In the negative-prompt inversion loop, use the prompt embedding p directly as both the conditional and unconditional (“negative prompt”) input to the classifier free guidance.
Perform reverse diffusion steps to reconstruct the image.

This avoids optimizing a separate null-text embedding, enabling fast reconstruction using just the input prompt embedding.

Detailed Pseudo Code

Here is some more detailed pseudo code to implement the negative-prompt inversion method:

# Hyperparameters
T = 1000 # Number of diffusion steps
alphas = # Noise schedule 
 
# Input
x: original image
p: prompt describing x  
C = text_encoder(p) # Get prompt embedding
 
# Encoder, decoder, diffusion model
encoder = Encoder()
decoder = Decoder()
model = DiffusionModel()  
 
# DDIM Inversion
z_0 = encoder(x) 
z_T = DDIMInversion(z_0, C) 
 
# Negative-prompt inversion 
z = z_T
for t in reversed(range(T)):
  
  # CFG
  eps_c = model(z, t, C) # Conditional
  eps_u = model(z, t, C) # Unconditional (set to C)
 
  # Reverse step 
  alpha_bar_t = 1-alphas[t] 
  alpha_bar_t_1 = 1-alphas[t-1]
  sigma = (alpha_bar_t_1 - alpha_bar_t)/(1 - alpha_bar_t)
  z = sqrt(alphas[t-1]/alphas[t]) * z
  z += sqrt(alphas[t-1]) * sigma * (eps_c - eps_u)
 
# Decode  
x_recon = decoder(z)

Key details:

Use a text encoder like CLIP to get prompt embedding C
Input C to model as both conditional and unconditional input for CFG
Use sqrt scheduling for reverse diffusion steps
Omit sampling step by setting sigma = 0

This provides a more complete implementation with the key components needed to reproduce negative-prompt inversion.

aixiv-c

Table of Contents

2305.16807 Negative-prompt Inversion Fast Image Inversion for Editing with Text-guided Diffusion Models

Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code

Graph View

Backlinks

aixiv-c

Table of Contents

2305.16807 Negative-prompt Inversion Fast Image Inversion for Editing with Text-guided Diffusion Models

Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models §

What, Why and How §

Main Contributions §

Method Section §

High-Level Pseudo Code §

Detailed Pseudo Code §

Graph View

Backlinks

Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code