Improving Tuning-Free Real Image Editing with Proximal Guidance
Authors: Ligong Han, Song Wen, Qi Chen, Zhixing Zhang, Kunpeng Song, Mengwei Ren, Ruijiang Gao, Anastasis Stathopoulos, Xiaoxiao He, Yuxiao Chen, Di Liu, Qilong Zhangli, Jindong Jiang, Zhaoyang Xia, Akash Srivastava, Dimitris Metaxas
Abstract: DDIM inversion has revealed the remarkable potential of real image editing within diffusion-based methods. However, the accuracy of DDIM reconstruction degrades as larger classifier-free guidance (CFG) scales being used for enhanced editing. Null-text inversion (NTI) optimizes null embeddings to align the reconstruction and inversion trajectories with larger CFG scales, enabling real image editing with cross-attention control. Negative-prompt inversion (NPI) further offers a training-free closed-form solution of NTI. However, it may introduce artifacts and is still constrained by DDIM reconstruction quality. To overcome these limitations, we propose proximal guidance and incorporate it to NPI with cross-attention control. We enhance NPI with a regularization term and reconstruction guidance, which reduces artifacts while capitalizing on its training-free nature. Additionally, we extend the concepts to incorporate mutual self-attention control, enabling geometry and layout alterations in the editing process. Our method provides an efficient and straightforward approach, effectively addressing real image editing tasks with minimal computational overhead.
What, Why and How
Here is a summary of the key points from the paper:
What:
- The paper proposes a method called Proximal Negative-Prompt Inversion (ProxNPI) to improve tuning-free real image editing with diffusion models.
Why:
-
Existing methods like Null-text Inversion (NTI) and Negative-prompt Inversion (NPI) have limitations in reconstruction quality and editing capabilities when using larger classifier-free guidance scales.
-
NTI requires per-image optimization which is slow. NPI provides a fast closed-form solution but may introduce artifacts.
-
ProxNPI aims to address these limitations to enable high-quality real image editing while maintaining efficiency.
How:
-
ProxNPI incorporates a regularization term (proximal guidance) to constrain the noise difference between source and target, reducing artifacts.
-
It also utilizes an inversion guidance technique to correct errors in the DDIM reconstruction process.
-
The method is applied to both NPI+Cross-Attention control for appearance editing, and extended to Mutual Self-Attention control for geometry/layout editing.
-
Ablation studies analyze the effects of different thresholds and inversion guidance strengths.
In summary, ProxNPI enhances real image editing quality and efficiency by building upon NPI using proximal regularization and inversion guidance, with applications to appearance and geometry manipulation. The proposed techniques help address key limitations of prior arts like NTI and NPI.
Main Contributions
Here are the main contributions of this paper:
- Proposes Proximal Negative-Prompt Inversion (ProxNPI) to improve real image editing quality and efficiency. This incorporates:
-
A regularization term (proximal guidance) to constrain the noise difference between source and target images. Helps reduce artifacts.
-
An inversion guidance technique to correct errors from imperfect DDIM reconstruction.
-
Applies ProxNPI to Negative-Prompt Inversion + Cross-Attention control for texture/appearance editing.
-
Extends the approach to Mutual Self-Attention control for geometry and layout editing. Introduces ProxMasaCtrl.
-
Provides ablation studies analyzing the effects of different thresholds and inversion guidance strengths.
-
Demonstrates qualitative results comparing ProxNPI against prior arts like Null-text Inversion and Negative-Prompt Inversion. Showcases improved reconstruction and editing.
-
Shows applications of ProxNPI for simultaneous editing of texture and geometry.
In summary, the main contributions are:
- The proposed ProxNPI method itself
- Applications to appearance and geometry manipulation
- Quantitative and qualitative experiments demonstrating improved performance over prior arts
- Analysis of the method through ablation studies
- Extension to other diffusion-based editing frameworks like MasaCtrl
Method Section
Here is a summary of the method section from the paper:
Background:
- Reviews DDIM inversion - a widely used deterministic sampling method for diffusion models.
- Explains Null-text Inversion (NTI) - optimizes null embeddings to align reconstruction and inversion trajectories under large classifier-free guidance (CFG) scales. Requires per-image optimization.
- Introduces Negative-Prompt Inversion (NPI) - provides closed-form solution to NTI by setting null embedding to source prompt. Faster but may introduce artifacts.
Proximal Negative-Prompt Inversion:
- Proposes proximal guidance - adds regularization term to NPI to constrain noise difference between source and target. Reduces artifacts.
- Uses proximal functions like soft/hard thresholding based on quantiles.
- Incorporates inversion guidance - performs gradient descent on current latent to align with inversion latent. Corrects DDIM errors.
Proximal Mutual Self-Attention Control:
- Extends method to Mutual Self-Attention Control (MasaCtrl) for geometry/layout editing.
- Uses source prompt only in DDIM reconstruction branch, retains null in synthesis branch.
- Applies proximal guidance to constrain noise difference between null and target.
In summary, introduces proximal regularization and inversion guidance to enhance NPI, and extends it to MasaCtrl framework for geometry editing.
High-Level Pseudo Code
Here is the high-level pseudo code for the key algorithms in the paper:
# Proximal Negative-Prompt Inversion
# Input
source_sample z0
source_prompt C
target_prompt C'
diffusion_model epsilon_theta
proximal_fn prox_lambda
# Reconstruction
z_bar = DDIMInvert(z0, C, w=1)
for t = T to 1:
eps_src = epsilon_theta(z_bar, t, C)
eps_tar = epsilon_theta(z_bar, t, C')
eps = eps_src + w * prox_lambda(eps_tar - eps_src)
z_prev = DDIM step(z_bar, eps)
# Inversion guidance
if t < T_inv:
z_prev -= eta * (z_prev - z_inv[t])
z_bar = z_prev
return z_bar
# Proximal Mutual Self-Attention Control
# Reconstruction branch
eps_hat = eps_src + (eps_src - eps_src)
# Synthesis branch
eps_tilde = eps_null + w * prox_lambda(eps_tar - eps_null)
# Remainder of MasaCtrl algorithm...
The key steps are:
- DDIM reconstruction with source prompt
- Apply proximal regularization to noise difference
- Perform DDIM sampling step
- Optional inversion guidance
- For MasaCtrl - use source prompt only in reconstruction branch
## Detailed Pseudo Code
Here is a more detailed pseudo code to implement the key algorithms in the paper:
```python
# Proximal Negative-Prompt Inversion
# Inputs
z0 = source sample
C = source prompt
C' = target prompt
epsilon_theta = diffusion model
prox_lambda = proximal function
eta = inversion guidance step size
T_inv = timesteps for inversion guidance
# DDIM Reconstruction
z_bar = DDIMInvert(z0, C, w=1)
for t = T to 1:
# Classifier-free guidance
eps_src = epsilon_theta(z_bar, t, C)
eps_tar = epsilon_theta(z_bar, t, C')
# Proximal regularization
eps = eps_src + w * prox_lambda(eps_tar - eps_src)
# DDIM step
z0_pred = DDIM_predict_x0(z_bar, eps, t)
# Inversion guidance
if t < T_inv:
mask = threshold_mask(eps_tar - eps_src)
z0_pred -= eta * mask * (z0_pred - z0)
z_bar = DDIM_predict_z(z0_pred, eps, t-1)
return z0_pred
# Proximal Mutual Self-Attention Control
# Inputs
C, C', epsilon_theta, prox_lambda as above
alpha = interpolation ratio
# Reconstruction
eps_hat = epsilon_theta(z_bar, t, C)
# Synthesis
eps_null = epsilon_theta(z_bar, t, null)
eps_tar = epsilon_theta(z_bar, t, C')
eps_tilde = (1 - alpha) * eps_null + alpha * eps_hat
+ w * prox_lambda(eps_tar - eps_null)
# Remainder of MasaCtrl algorithm...
The key details are:
- The specific DDIM prediction steps
- Computing the proximal regularization term
- Applying inversion guidance with masking
- Using source prompt only in reconstruction branch for MasaCtrl