Training-Free Layout Control with Cross-Attention Guidance
Authors: Minghao Chen, Iro Laina, Andrea Vedaldi
Abstract: Recent diffusion-based generators can produce high-quality images based only on textual prompts. However, they do not correctly interpret instructions that specify the spatial layout of the composition. We propose a simple approach that can achieve robust layout control without requiring training or fine-tuning the image generator. Our technique, which we call layout guidance, manipulates the cross-attention layers that the model uses to interface textual and visual information and steers the reconstruction in the desired direction given, e.g., a user-specified layout. In order to determine how to best guide attention, we study the role of different attention maps when generating images and experiment with two alternative strategies, forward and backward guidance. We evaluate our method quantitatively and qualitatively with several experiments, validating its effectiveness. We further demonstrate its versatility by extending layout guidance to the task of editing the layout and context of a given real image.
What, Why and How
Here is a summary of the key points in this paper:
What:
- The paper proposes methods for layout-controlled image generation using pre-trained diffusion models like Stable Diffusion, without requiring further training or fine-tuning.
Why:
- Current text-to-image models struggle to accurately interpret spatial layout instructions specified via text prompts. The paper aims to achieve better layout control over generated images.
How:
-
The method intervenes in the cross-attention layers of diffusion models, which interface the textual and visual information during generation.
-
Two strategies are explored: 1) Forward guidance directly biases the cross-attention maps. 2) Backward guidance optimizes the latent code to match desired layout via backpropagation.
-
Start, padding tokens are shown to carry layout information, hence important to control in forward guidance. Backward guidance implicitly handles all tokens.
-
The method can also edit layout of real images, by combining with Text Inversion and Dreambooth techniques.
-
Experiments validate the effectiveness of the proposed strategies over baselines, in terms of layout faithfulness metrics like VISOR and AP.
In summary, the paper proposes training-free mechanisms to control the layout in generated/real images by intervening in cross-attention maps of diffusion models. This addresses limitations of text-based control over spatial layout.
Main Contributions
Based on my understanding, the main contributions of this paper are:
-
Proposes two general training-free strategies (forward and backward guidance) to control the layout of images generated by large pre-trained diffusion models like Stable Diffusion.
-
Analyzes the role of cross-attention maps in layout generation, and shows that start and padding tokens surprisingly carry heavy semantic and layout information.
-
Evaluates the proposed methods quantitatively and qualitatively, demonstrating effective layout control without compromising image quality or requiring model fine-tuning.
-
Extends the approach to enable layout editing of real images, by combining with Text Inversion and Dreambooth techniques.
-
Provides insights into model internals - studies the effect of intervening at different layers, loss scales, etc. on layout control.
In summary, the key contributions are:
-
Two general strategies for training-free layout control of generated images.
-
Analysis and insights into cross-attention role in layout.
-
Quantitative and qualitative validation of proposed methods.
-
Extension to real image layout editing.
-
Ablation studies for better understanding model internals.
Method Section
Here is a summary of the method section from the paper:
The paper focuses on controlling the layout of images generated by pre-trained diffusion models like Stable Diffusion, without requiring additional training or fine-tuning.
It first provides background on Stable Diffusion, which uses a text encoder to get token embeddings, a latent encoder, and a conditional denoising network that iteratively reverses the noising process to generate images.
Cross-attention layers connect text and image features. The paper analyzes these maps to show start and padding tokens surprisingly carry layout information.
Two strategies are proposed for layout guidance:
Forward Guidance:
- Represent desired layout as a spatial window function.
- Bias the cross-attention maps to associate target region with chosen word token.
- Apply this intervention for initial denoising iterations.
Backward Guidance:
- Define an energy function based on attention map statistics.
- Use gradient of this function to update latent code via backpropagation.
- Alternate between denoising steps and gradient updates.
The paper also discusses extending layout guidance to real image editing, by combining it with Text Inversion and Dreambooth techniques.
Text Inversion learns an embedding to represent a concept from example images. Dreambooth fine-tunes the model for a particular subject.
Layout guidance can then control the positioning of the learned concept when generating new images, while preserving its identity.
In summary, the key aspects are - analyzing cross-attention role, two strategies for layout control by intervening in attention, and extension to real image editing.
High-Level Pseudo Code
Here is the high-level pseudo code for the key aspects of the paper:
# Stable Diffusion
text_encoder = TransformerLM()
latent_encoder = Autoencoder()
denoiser = UNet()
# Forward Guidance
for t in range(T):
if t < 40:
attn_maps = denoiser(latent, text)
attn_maps = intervene_attn(attn_maps, layout) # Eq 1
latent = denoiser(latent, text, attn_maps)
else:
latent = denoiser(latent, text)
img = decode_image(latent)
# Backward Guidance
for t in range(T):
if t < 10:
for _ in range(5):
latent = denoiser(latent, text)
loss = compute_attn_layout_loss(attn_maps) # Eq 2
latent = latent - lr * grad(loss, latent)
else:
latent = denoiser(latent, text)
img = decode_image(latent)
# Real Image Editing
token = text_inversion(image)
edited_model = dreambooth(model, image, token)
# Generation
text = "image of <token> on the left"
bbox = get_layout(text)
img = backward_guidance(edited_model, text, bbox)
The key functions are:
intervene_attn
to bias cross-attention mapscompute_attn_layout_loss
to guide latents- Text Inversion and Dreambooth for real image editing
backward_guidance
for controlled generation
Detailed Pseudo Code
Here is a more detailed pseudo code implementation of the key aspects of the paper:
# Model
text_encoder = TransformerLM()
latent_encoder = Autoencoder()
denoiser = UNet()
# Text Encoding
text = "[SoT] a cat to the left of a dog [EoT]"
tokens = text_encoder(text)
# Forward Guidance
def intervene_attn(attn_maps, bbox, token_id):
window = make_gaussian_window(bbox)
for u in H*W:
attn_maps[u][token_id] = (1-λ)*attn_maps[u][token_id] + λ*window[u]*Σ_v(attn_maps[v][token_id])
return attn_maps
for t in range(T):
if t < 40:
latent = add_noise(latent, t)
attn_maps = denoiser.cross_attn(latent, tokens)
# Guidance
attn_maps = intervene_attn(attn_maps, bbox, cat_token_id)
attn_maps = intervene_attn(attn_maps, bbox, dog_token_id)
latent = denoiser(latent, tokens, attn_maps)
else:
latent = add_noise(latent, t)
latent = denoiser(latent, tokens)
img = decode_image(latent)
# Backward Guidance
def compute_layout_loss(attn_maps, bbox, token_id):
attn_in_box = Σ_{u∈bbox} attn_maps[u][token_id]
attn_total = Σ_u attn_maps[u][token_id]
return (1 - attn_in_box / attn_total)**2
for t in range(T):
if t < 10:
for _ in range(5):
latent = add_noise(latent, t)
latent = denoiser(latent, tokens)
loss = compute_layout_loss(attn_maps, bbox, cat_token)
+ compute_layout_loss(attn_maps, bbox, dog_token)
latent = latent - lr * grad(loss, latent)
else:
latent = add_noise(latent, t)
latent = denoiser(latent, tokens)
img = decode_image(latent)
This implements the key functions for:
- Encoding text
- Forward guidance with attention intervention
- Backward guidance with layout loss and latent update
- The UNet denoiser and image decoding