TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition
Authors: Shilin Lu, Yanzhu Liu, Adams Wai-Kin Kong
Abstract: Text-driven diffusion models have exhibited impressive generative capabilities, enabling various image editing tasks. In this paper, we propose TF-ICON, a novel Training-Free Image COmpositioN framework that harnesses the power of text-driven diffusion models for cross-domain image-guided composition. This task aims to seamlessly integrate user-provided objects into a specific visual context. Current diffusion-based methods often involve costly instance-based optimization or finetuning of pretrained models on customized datasets, which can potentially undermine their rich prior. In contrast, TF-ICON can leverage off-the-shelf diffusion models to perform cross-domain image-guided composition without requiring additional training, finetuning, or optimization. Moreover, we introduce the exceptional prompt, which contains no information, to facilitate text-driven diffusion models in accurately inverting real images into latent representations, forming the basis for compositing. Our experiments show that equipping Stable Diffusion with the exceptional prompt outperforms state-of-the-art inversion methods on various datasets (CelebA-HQ, COCO, and ImageNet), and that TF-ICON surpasses prior baselines in versatile visual domains. Code is available at TF-ICON
What, Why and How
Here is a summary of the key points in this paper:
What:
-
The paper proposes a novel training-free framework called TF-ICON for image-guided composition across different visual domains.
-
Image-guided composition aims to seamlessly integrate user-provided objects into a specific visual context (e.g. incorporating a panda into a painting).
-
TF-ICON enables text-to-image diffusion models like Stable Diffusion to perform cross-domain image composition without requiring additional training or finetuning.
Why:
-
Current diffusion models for image composition require costly optimization or finetuning that can damage the model’s rich prior knowledge.
-
TF-ICON avoids this issue by leveraging off-the-shelf pretrained models without modifications.
How:
-
Introduces “exceptional prompt” to accurately invert images into latent codes for composition.
-
Composes “composite self-attention maps” that reflect relations between images being composed.
-
Gradually injects composite attention maps during diffusion sampling process to guide image composition.
-
Allows diffusion models to perform composition by denoising from accurate latent codes with attention map injection.
In summary, this paper proposes a training-free framework TF-ICON that harnesses pretrained diffusion models for cross-domain image-guided composition through exceptional prompt inversion and attention map injection during sampling. This avoids costly training while enabling versatile image composition capabilities.
Main Contributions
Here are the main contributions of this paper:
-
Demonstrate superior performance of high-order diffusion ODE solvers compared to commonly used DDIM inversion for real image inversion.
-
Present an “exceptional prompt” that enables text-driven models to achieve highly accurate image invertibility, outperforming state-of-the-art inversion methods.
-
Propose the first training-free framework that equips attention-based text-to-image diffusion models with cross-domain image-guided composition capabilities.
-
Introduce “composite self-attention maps” designed to reflect relations between images being composed, and a method to inject them during diffusion sampling.
-
Show both quantitatively and qualitatively that the proposed framework outperforms prior baselines for image-guided composition across diverse visual domains.
In summary, the key contributions are:
- Demonstrating superiority of ODE solvers for inversion
- Exceptional prompt for accurate image inversion
- First training-free framework for image composition with diffusion models
- Composite self-attention maps injection during sampling
- State-of-the-art performance in cross-domain image composition
Method Section
Here is a summary of the method section:
The proposed framework has two main steps:
- Image Inversion with Exceptional Prompt
- Uses high-order ODE solvers instead of DDIM for more accurate image inversion.
- Introduces “exceptional prompt” that contains no information to enable accurate inversion for text-driven diffusion models.
- Training-Free Image Composition
- Incorporates noises of main and reference images into a composed starting point.
- Calculates composite self-attention maps reflecting relations between images.
- Gradually injects composite attention maps during diffusion sampling.
- Denoises from accurate starting point while injecting attention maps to guide composition.
- Uses exceptional prompt for main and reference image reconstructions to obtain accurate attention maps.
- Applies normal prompt in composition process to leverage model’s prior.
- Preserves background at various noise levels for smooth transition.
In summary, the method leverages ODE solvers and the proposed exceptional prompt to invert images, then performs composition by denoising from an accurate starting point while injecting composite attention maps. This training-free approach harnesses pretrained models’ rich priors for high-quality cross-domain image composition.
High-Level Pseudo Code
Here is the high-level pseudo code for the proposed method:
# Input
main_img, ref_img, user_mask, text_prompt
# Step 1: Image Inversion
main_noise = invert(main_img, exceptional_prompt)
ref_noise = invert(ref_img, exceptional_prompt)
# Step 2: Image Composition
composed_noise = incorporate(main_noise, ref_noise, user_mask)
for t in timesteps:
main_attn = reconstruct(main_noise, exceptional_prompt)
ref_attn = reconstruct(ref_noise, exceptional_prompt)
composite_attn = compose(main_attn, ref_attn)
if t < early_threshold:
composed_noise = denoise(composed_noise, text_prompt, composite_attn)
else:
composed_noise = denoise(composed_noise, text_prompt)
if t > late_threshold:
composed_noise = preserve_background(composed_noise, main_img)
composed_img = decode(composed_noise)
return composed_img
In summary, the key steps are:
- Invert main and reference images using exceptional prompt
- Incorporate inverted noises into starting point
- Reconstruct main and ref images to get attention maps
- Compose attention maps and inject in early timesteps
- Denoise from starting point with text prompt
- Preserve background in later timesteps
- Decode final noise to get composed image
Detailed Pseudo Code
Here is a more detailed pseudo code implementation of the proposed method:
# Inputs
main_img, ref_img, user_mask, text_prompt
exceptional_prompt, normal_prompt
# Hyperparameters
T, tau_A, tau_B
# Step 1: Image Inversion
main_noise, ref_noise = zeros((T+1, H, W, C))
# Forward process
main_noise[0] = vq_encode(main_img)
ref_noise[0] = vq_encode(ref_img)
for t in 1...T:
main_noise[t] = dpm_step(main_noise[t-1], t-1, exceptional_prompt)
ref_noise[t] = dpm_step(ref_noise[t-1], t-1, exceptional_prompt)
# Step 2: Image Composition
composed_noise = zeros((T+1, H, W, C))
composed_noise[T] = ref_noise[T] * ref_mask + main_noise[T] * (1 - user_mask) + noise * (user_mask ^ ref_mask)
for t in T...1:
# Get attention maps
main_attn = dpm_step(main_noise[t], t, exceptional_prompt)
ref_attn = dpm_step(ref_noise[t], t, exceptional_prompt)
cross_attn = cross_attention(main_noise[t], ref_noise[t])
composite_attn = compose(main_attn, ref_attn, cross_attn)
# Denoising with attention injection
if t > tau_A * T:
composed_noise[t-1] = dpm_step(composed_noise[t], t, normal_prompt)
else:
composed_noise[t-1] = dpm_step(composed_noise[t], t, normal_prompt, composite_attn)
# Background preservation
if t > tau_B * T:
composed_noise[t-1] = composed_noise[t-1] * user_mask + main_noise[t-1] * (1 - user_mask)
composed_img = vq_decode(composed_noise[0])
return composed_img
In summary, the key steps are:
- Invert images using DPM solver and exceptional prompt
- Compose starting noise from inverted noises
- Get main, ref, cross attentions from reconstruction
- Compose and inject attention maps in early timesteps
- Denoise with normal prompt and attention injection
- Preserve background in later timesteps
- Decode final noise to get composed image