TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition

Authors: Shilin Lu, Yanzhu Liu, Adams Wai-Kin Kong

Abstract: Text-driven diffusion models have exhibited impressive generative capabilities, enabling various image editing tasks. In this paper, we propose TF-ICON, a novel Training-Free Image COmpositioN framework that harnesses the power of text-driven diffusion models for cross-domain image-guided composition. This task aims to seamlessly integrate user-provided objects into a specific visual context. Current diffusion-based methods often involve costly instance-based optimization or finetuning of pretrained models on customized datasets, which can potentially undermine their rich prior. In contrast, TF-ICON can leverage off-the-shelf diffusion models to perform cross-domain image-guided composition without requiring additional training, finetuning, or optimization. Moreover, we introduce the exceptional prompt, which contains no information, to facilitate text-driven diffusion models in accurately inverting real images into latent representations, forming the basis for compositing. Our experiments show that equipping Stable Diffusion with the exceptional prompt outperforms state-of-the-art inversion methods on various datasets (CelebA-HQ, COCO, and ImageNet), and that TF-ICON surpasses prior baselines in versatile visual domains. Code is available at TF-ICON

What, Why and How

Here is a summary of the key points in this paper:

What:

The paper proposes a novel training-free framework called TF-ICON for image-guided composition across different visual domains.
Image-guided composition aims to seamlessly integrate user-provided objects into a specific visual context (e.g. incorporating a panda into a painting).
TF-ICON enables text-to-image diffusion models like Stable Diffusion to perform cross-domain image composition without requiring additional training or finetuning.

Why:

Current diffusion models for image composition require costly optimization or finetuning that can damage the model’s rich prior knowledge.
TF-ICON avoids this issue by leveraging off-the-shelf pretrained models without modifications.

How:

Introduces “exceptional prompt” to accurately invert images into latent codes for composition.
Composes “composite self-attention maps” that reflect relations between images being composed.
Gradually injects composite attention maps during diffusion sampling process to guide image composition.
Allows diffusion models to perform composition by denoising from accurate latent codes with attention map injection.

In summary, this paper proposes a training-free framework TF-ICON that harnesses pretrained diffusion models for cross-domain image-guided composition through exceptional prompt inversion and attention map injection during sampling. This avoids costly training while enabling versatile image composition capabilities.

Main Contributions

Here are the main contributions of this paper:

Demonstrate superior performance of high-order diffusion ODE solvers compared to commonly used DDIM inversion for real image inversion.
Present an “exceptional prompt” that enables text-driven models to achieve highly accurate image invertibility, outperforming state-of-the-art inversion methods.
Propose the first training-free framework that equips attention-based text-to-image diffusion models with cross-domain image-guided composition capabilities.
Introduce “composite self-attention maps” designed to reflect relations between images being composed, and a method to inject them during diffusion sampling.
Show both quantitatively and qualitatively that the proposed framework outperforms prior baselines for image-guided composition across diverse visual domains.

In summary, the key contributions are:

Demonstrating superiority of ODE solvers for inversion
Exceptional prompt for accurate image inversion
First training-free framework for image composition with diffusion models
Composite self-attention maps injection during sampling
State-of-the-art performance in cross-domain image composition

Method Section

Here is a summary of the method section:

The proposed framework has two main steps:

Image Inversion with Exceptional Prompt

Uses high-order ODE solvers instead of DDIM for more accurate image inversion.
Introduces “exceptional prompt” that contains no information to enable accurate inversion for text-driven diffusion models.

Training-Free Image Composition

Incorporates noises of main and reference images into a composed starting point.
Calculates composite self-attention maps reflecting relations between images.
Gradually injects composite attention maps during diffusion sampling.
Denoises from accurate starting point while injecting attention maps to guide composition.
Uses exceptional prompt for main and reference image reconstructions to obtain accurate attention maps.
Applies normal prompt in composition process to leverage model’s prior.
Preserves background at various noise levels for smooth transition.

In summary, the method leverages ODE solvers and the proposed exceptional prompt to invert images, then performs composition by denoising from an accurate starting point while injecting composite attention maps. This training-free approach harnesses pretrained models’ rich priors for high-quality cross-domain image composition.

High-Level Pseudo Code

Here is the high-level pseudo code for the proposed method:

# Input
main_img, ref_img, user_mask, text_prompt 
 
# Step 1: Image Inversion
main_noise = invert(main_img, exceptional_prompt) 
ref_noise = invert(ref_img, exceptional_prompt)
 
# Step 2: Image Composition
composed_noise = incorporate(main_noise, ref_noise, user_mask) 
 
for t in timesteps:
  
  main_attn = reconstruct(main_noise, exceptional_prompt)
  ref_attn = reconstruct(ref_noise, exceptional_prompt)
  
  composite_attn = compose(main_attn, ref_attn) 
  
  if t < early_threshold:
    composed_noise = denoise(composed_noise, text_prompt, composite_attn)
  else:
    composed_noise = denoise(composed_noise, text_prompt)
  
  if t > late_threshold:
    composed_noise = preserve_background(composed_noise, main_img)
 
composed_img = decode(composed_noise)
 
return composed_img

In summary, the key steps are:

Invert main and reference images using exceptional prompt
Incorporate inverted noises into starting point
Reconstruct main and ref images to get attention maps
Compose attention maps and inject in early timesteps
Denoise from starting point with text prompt
Preserve background in later timesteps
Decode final noise to get composed image

Detailed Pseudo Code

Here is a more detailed pseudo code implementation of the proposed method:

# Inputs
main_img, ref_img, user_mask, text_prompt
exceptional_prompt, normal_prompt 
 
# Hyperparameters
T, tau_A, tau_B  
 
# Step 1: Image Inversion
main_noise, ref_noise = zeros((T+1, H, W, C)) 
 
# Forward process  
main_noise[0] = vq_encode(main_img)
ref_noise[0] = vq_encode(ref_img)
 
for t in 1...T:
  main_noise[t] = dpm_step(main_noise[t-1], t-1, exceptional_prompt) 
  ref_noise[t] = dpm_step(ref_noise[t-1], t-1, exceptional_prompt)
 
# Step 2: Image Composition
composed_noise = zeros((T+1, H, W, C))
 
composed_noise[T] = ref_noise[T] * ref_mask + main_noise[T] * (1 - user_mask) + noise * (user_mask ^ ref_mask)
 
for t in T...1:
 
  # Get attention maps
  main_attn = dpm_step(main_noise[t], t, exceptional_prompt)
  ref_attn = dpm_step(ref_noise[t], t, exceptional_prompt)  
  
  cross_attn = cross_attention(main_noise[t], ref_noise[t])
  
  composite_attn = compose(main_attn, ref_attn, cross_attn)
 
  # Denoising with attention injection
  if t > tau_A * T:
    composed_noise[t-1] = dpm_step(composed_noise[t], t, normal_prompt) 
  else:
    composed_noise[t-1] = dpm_step(composed_noise[t], t, normal_prompt, composite_attn)
 
  # Background preservation
  if t > tau_B * T:
    composed_noise[t-1] = composed_noise[t-1] * user_mask + main_noise[t-1] * (1 - user_mask)
 
composed_img = vq_decode(composed_noise[0]) 
 
return composed_img

In summary, the key steps are:

Invert images using DPM solver and exceptional prompt
Compose starting noise from inverted noises
Get main, ref, cross attentions from reconstruction
Compose and inject attention maps in early timesteps
Denoise with normal prompt and attention injection
Preserve background in later timesteps
Decode final noise to get composed image

aixiv-c

Table of Contents

2307.12493 TF-ICON Diffusion-Based Training-Free Cross-Domain Image Composition

TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code

Graph View

Backlinks

aixiv-c

Table of Contents

2307.12493 TF-ICON Diffusion-Based Training-Free Cross-Domain Image Composition

TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition §

What, Why and How §

Main Contributions §

Method Section §

High-Level Pseudo Code §

Detailed Pseudo Code §

Graph View

Backlinks

TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code