Null-text Inversion for Editing Real Images using Guided Diffusion Models
Authors: Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, Daniel Cohen-Or
Abstract: Recent text-guided diffusion models provide powerful image generation capabilities. Currently, a massive effort is given to enable the modification of these images using text only as means to offer intuitive and versatile editing. To edit a real image using these state-of-the-art tools, one must first invert the image with a meaningful text prompt into the pretrained model’s domain. In this paper, we introduce an accurate inversion technique and thus facilitate an intuitive text-based modification of the image. Our proposed inversion consists of two novel key components: (i) Pivotal inversion for diffusion models. While current methods aim at mapping random noise samples to a single input image, we use a single pivotal noise vector for each timestamp and optimize around it. We demonstrate that a direct inversion is inadequate on its own, but does provide a good anchor for our optimization. (ii) NULL-text optimization, where we only modify the unconditional textual embedding that is used for classifier-free guidance, rather than the input text embedding. This allows for keeping both the model weights and the conditional embedding intact and hence enables applying prompt-based editing while avoiding the cumbersome tuning of the model’s weights. Our Null-text inversion, based on the publicly available Stable Diffusion model, is extensively evaluated on a variety of images and prompt editing, showing high-fidelity editing of real images.
What, Why and How
Here is a summary of the key points from this paper:
What:
-
This paper proposes a new method for inverting real images into the latent space of text-guided diffusion models like Stable Diffusion. This allows editing the images using only text prompts.
-
The method has two main components:
- Pivotal inversion: Uses the trajectory from an initial DDIM inversion as a fixed pivot point to efficiently optimize a single noise vector per image.
- Null-text optimization: Optimizes only the unconditional “null” text embedding while keeping the model weights fixed. This allows accurate reconstruction while preserving editability.
Why:
-
Existing inversion methods for diffusion models struggle to accurately reconstruct real images while still allowing intuitive text-based editing capabilities.
-
Directly optimizing the text embedding can damage model priors. Fine-tuning the model weights requires duplicating the model per image.
-
Pivotal inversion is more efficient than optimizing multiple random noise vectors. Null-text optimization preserves editability better than tuning the conditional text embedding.
How:
-
Perform initial DDIM inversion to get a trajectory of latent codes (pivot).
-
Optimize unconditional null-text embedding around pivot to accurately reconstruct image.
-
Use optimized null text-embedding at inference time to allow text-based editing of inverted image via prompt manipulation.
In summary, the paper introduces pivotal inversion and null-text optimization to enable high-fidelity reconstruction of real images in diffusion models, while retaining the ability to intuitively edit them using only text prompts. This bridges the gap between reconstruction and editability.
Main Contributions
Here are the main contributions of this paper:
-
Proposes a novel inversion technique for real images in text-guided diffusion models that enables high-fidelity reconstruction and retention of text-based editing capabilities.
-
Introduces the concept of pivotal inversion for diffusion models - uses a fixed trajectory from initial DDIM inversion as a pivot for efficient optimization.
-
Introduces null-text optimization - optimizes the unconditional null text embedding while keeping model weights and conditional text embedding fixed.
-
Shows that direct DDIM inversion fails for text-guided diffusion models with classifier-free guidance due to error amplification.
-
Demonstrates that null-text optimization with pivotal inversion outperforms standard textual inversion techniques.
-
Validates that null-text optimization depends on pivotal inversion and fails without it.
-
Shows high-fidelity inversion and editing results on real images using the publicly available Stable Diffusion model.
-
Provides extensive ablation studies analyzing the contribution of the key components.
-
Compares to other text-only and masked editing techniques like VQGAN+CLIP, Text2Live, SDEdit, showing superior results.
In summary, the core contributions are:
- Pivotal inversion for diffusion models
- Null-text optimization technique
- An effective inversion approach enabling text-based editing of real images.
Method Section
Here is a summary of the method section:
The goal is to invert a real image I and prompt P into the latent space of a pretrained text-guided diffusion model to enable text-based editing.
The proposed method has two key components:
- Pivotal Inversion
-
Perform initial DDIM inversion with guidance scale w=1 to get a trajectory of latent codes z* (the pivot).
-
DDIM inversion alone fails with standard guidance scale due to error amplification.
-
Use z* trajectory as a fixed pivot for optimization instead of random noise vectors.
-
For each timestep t, optimize to minimize ||z*_t-1 - z_t-1|| where z_t-1 is the optimized latent code.
-
Optimization is efficient as it focuses on a single pivotal trajectory.
- Null-Text Optimization
-
Instead of tuning model weights or text embedding, optimize the unconditional “null” text embedding.
-
Initialize null embedding with original one from pretrained model.
-
Can use a different null embedding varnothing_t per timestep.
-
Keeps model weights and conditional text embedding fixed.
-
Preserves text-editing capabilities better than tuning text embedding.
The full algorithm:
-
Get z* pivot trajectory using DDIM inversion (w=1).
-
Initialize varnothing_T with original null embedding.
-
For t=T to 1:
- Optimize varnothing_t to minimize ||z*_t-1 - z_t-1||
- Update varnothing_t-1 = varnothing_t
- Update latent code z_t-1 for next step
-
Return z_T and {varnothing_t} for editing by modifying prompt P.
So in summary, pivotal inversion provides an efficient optimization anchor, while null-text optimization preserves editability for prompt-based editing of inverted real images.
High-Level Pseudo Code
Here is the high-level pseudo code for the method proposed in this paper:
# Pivotal Inversion
z_star = ddim_inversion(image, prompt, w=1) # Get pivot trajectory
# Null-Text Optimization
varnothing = initialize_to_null_embedding()
for t in reversed(T to 1):
for i in range(N):
varnothing = optimize(varnothing, z_star[t-1], prompt)
update_latent(z_[t-1], z_[t], varnothing, prompt)
# Editing
edited_prompt = modify_prompt(prompt)
edited_image = sample(z_[T], edited_prompt, varnothing)
To summarize:
-
Perform initial DDIM inversion with low guidance scale to get pivotal trajectory z_star
-
Initialize null embedding varnothing
-
Optimize varnothing in reverse timestep order to reconstruct image
-
Update latent code z_t-1 for next timestep
-
Modify prompt and sample edited image using optimized z_T and varnothing
So we use pivotal inversion for efficiency and null-text optimization to retain editing capabilities when inverting real images. The inverted image can then be edited by modifying the text prompt.
## Detailed Pseudo Code
Here is a more detailed pseudo code to implement the approach proposed in this paper:
```python
# Hyperparameters
T = number of diffusion steps
w = guidance scale
N = number of opt steps per diffusion step
# Pivotal Inversion
def ddim_inversion(image, prompt, w):
encode image to z_0
z_T = sample Gaussian noise
for t in T to 1:
z_t = denoise(z_{t+1}, prompt, w)
return [z_T, ..., z_0]
z_star = ddim_inversion(image, prompt, w=1)
# Null-Text Optimization
def optimize(varnothing, z_star, prompt):
z_t = add_noise(z_{t+1}, prompt, varnothing)
loss = ||z_t - z*_t||
update varnothing to minimize loss
return varnothing
varnothing_T = initialize_to_null_embedding()
for t in T to 1:
z_t = z_star[t]
for i in range(N):
varnothing_t = optimize(varnothing_t, z_t, prompt)
varnothing_{t-1} = varnothing_t
z_{t-1} = add_noise(z_t, prompt, varnothing_t)
# Editing
edited_prompt = modify_prompt(prompt)
z_T = sample_Gaussian_noise()
for t in 1 to T:
z_{t-1} = denoise(z_t, edited_prompt, varnothing_t)
edited_image = decode(z_0)
return edited_image
In summary, pivotal inversion provides the z_star trajectory which is used as the optimization target in null-text optimization to accurately reconstruct the image while retaining editability. The optimized varnothing is then used at inference for text-guided editing by modifying the prompt.