Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models
Authors: Daiki Miyake, Akihiro Iohara, Yu Saito, Toshiyuki Tanaka
Abstract: In image editing employing diffusion models, it is crucial to preserve the reconstruction quality of the original image while changing its style. Although existing methods ensure reconstruction quality through optimization, a drawback of these is the significant amount of time required for optimization. In this paper, we propose negative-prompt inversion, a method capable of achieving equivalent reconstruction solely through forward propagation without optimization, thereby enabling much faster editing processes. We experimentally demonstrate that the reconstruction quality of our method is comparable to that of existing methods, allowing for inversion at a resolution of 512 pixels and with 50 sampling steps within approximately 5 seconds, which is more than 30 times faster than null-text inversion. Reduction of the computation time by the proposed method further allows us to use a larger number of sampling steps in diffusion models to improve the reconstruction quality with a moderate increase in computation time.
What, Why and How
Here is a summary of the key points from this paper:
What:
- The paper proposes a new method called negative-prompt inversion for reconstructing and editing images with diffusion models.
Why:
- Existing methods require optimizing embeddings or model weights to reconstruct images, which is slow. The proposed method uses only forward propagation for faster reconstruction.
How:
- The method is based on null-text inversion which optimizes a null-text embedding for reconstruction.
- The paper shows theoretically and empirically that the optimized null-text embedding approximates the prompt embedding.
- So the proposed method simply uses the prompt embedding in place of optimizing a null-text embedding.
- This allows reconstructing images with similar quality as null-text inversion but much faster, in around 5 seconds compared to over 2 minutes.
In summary, the key novelty is avoiding optimization by using the prompt embedding directly, instead of an optimized null-text embedding. This enables fast high-quality reconstruction and editing of images with diffusion models using only feedforward computation.
Main Contributions
Here are the main contributions of this paper:
-
They propose negative-prompt inversion, a new method for fast high-quality reconstruction of images using diffusion models without optimization.
-
They provide theoretical analysis to show the prompt embedding can approximate the optimized null-text embedding for reconstruction.
-
They demonstrate experimentally that their method achieves comparable reconstruction quality to null-text inversion, while being over 30 times faster.
-
They show their method can be combined with existing editing techniques like prompt-to-prompt for fast single image editing.
-
They analyze the trade-off between reconstruction quality and speed by varying the number of sampling steps. More steps improve quality but reduce speed, yet their method is still faster than null-text inversion.
In summary, the key contributions are proposing and analyzing the negative-prompt inversion method, and demonstrating its effectiveness for high-quality yet very fast reconstruction and editing compared to prior art. The speedup enables applications needing fast processing like video editing.
Method Section
Here is a summary of the method section from the paper:
The goal is to obtain latent variables and text embeddings to reconstruct a real image using diffusion models without optimization.
The key steps are:
-
DDIM inversion calculates the diffusion process by inverting the reverse diffusion process from an unconditional diffusion model.
-
Null-text inversion improves reconstruction by optimizing the null-text embedding at each step so the diffusion process aligns with the reverse process calculated using classifier-free guidance (CFG).
-
The proposed negative-prompt inversion utilizes the input prompt embedding directly instead of optimizing a null-text embedding.
-
It shows theoretically that with perfect learning, the optimized null-text embedding equals the prompt embedding.
-
So instead of optimizing a null-text embedding, the method uses the prompt embedding directly in CFG when calculating the reverse diffusion.
-
This allows reconstructing images using only forward propagation, avoiding slow optimization.
-
For image editing, the edited prompt embedding is used as the text condition, and the original prompt embedding is used instead of the null-text in CFG.
In summary, the key idea is to replace the optimized null-text embedding with the input prompt embedding, enabling fast high-quality reconstruction and editing using standard diffusion model sampling.
High-Level Pseudo Code
Here is the high-level pseudo code for the negative-prompt inversion method proposed in the paper:
# Input
x: original image
p: prompt describing x
# Diffusion model
model = DiffusionModel()
# Get latent code for x via DDIM inversion
z_T = DDIMInversion(x)
# Negative-prompt inversion
for t in reversed(T):
# Classifier free guidance
eps_c = model.predict_noise(z_t, t, p) # Conditional
eps_u = model.predict_noise(z_t, t, p) # Unconditional (use prompt as negative prompt)
# Reverse diffusion step
z_{t-1} = DDIMStep(z_t, eps_c, eps_u)
# Output
x_recon = Decode(z_0) # Reconstructed image
The key steps are:
- Get latent code z_T via DDIM inversion of original image x
- In the negative-prompt inversion loop, use the prompt embedding p directly as both the conditional and unconditional (“negative prompt”) input to the classifier free guidance.
- Perform reverse diffusion steps to reconstruct the image.
This avoids optimizing a separate null-text embedding, enabling fast reconstruction using just the input prompt embedding.
Detailed Pseudo Code
Here is some more detailed pseudo code to implement the negative-prompt inversion method:
# Hyperparameters
T = 1000 # Number of diffusion steps
alphas = # Noise schedule
# Input
x: original image
p: prompt describing x
C = text_encoder(p) # Get prompt embedding
# Encoder, decoder, diffusion model
encoder = Encoder()
decoder = Decoder()
model = DiffusionModel()
# DDIM Inversion
z_0 = encoder(x)
z_T = DDIMInversion(z_0, C)
# Negative-prompt inversion
z = z_T
for t in reversed(range(T)):
# CFG
eps_c = model(z, t, C) # Conditional
eps_u = model(z, t, C) # Unconditional (set to C)
# Reverse step
alpha_bar_t = 1-alphas[t]
alpha_bar_t_1 = 1-alphas[t-1]
sigma = (alpha_bar_t_1 - alpha_bar_t)/(1 - alpha_bar_t)
z = sqrt(alphas[t-1]/alphas[t]) * z
z += sqrt(alphas[t-1]) * sigma * (eps_c - eps_u)
# Decode
x_recon = decoder(z)
Key details:
- Use a text encoder like CLIP to get prompt embedding C
- Input C to model as both conditional and unconditional input for CFG
- Use sqrt scheduling for reverse diffusion steps
- Omit sampling step by setting sigma = 0
This provides a more complete implementation with the key components needed to reproduce negative-prompt inversion.