BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

Authors: Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, Mike Zheng Shou

Abstract: Recent text-to-image diffusion models have demonstrated an astonishing capacity to generate high-quality images. However, researchers mainly studied the way of synthesizing images with only text prompts. While some works have explored using other modalities as conditions, considerable paired data, e.g., box/mask-image pairs, and fine-tuning time are required for nurturing models. As such paired data is time-consuming and labor-intensive to acquire and restricted to a closed set, this potentially becomes the bottleneck for applications in an open world. This paper focuses on the simplest form of user-provided conditions, e.g., box or scribble. To mitigate the aforementioned problem, we propose a training-free method to control objects and contexts in the synthesized images adhering to the given spatial conditions. Specifically, three spatial constraints, i.e., Inner-Box, Outer-Box, and Corner Constraints, are designed and seamlessly integrated into the denoising step of diffusion models, requiring no additional training and massive annotated layout data. Extensive results show that the proposed constraints can control what and where to present in the images while retaining the ability of the Stable Diffusion model to synthesize with high fidelity and diverse concept coverage. The code is publicly available at BoxDiff.

What, Why and How

Here is a summary of the key points of this paper:

What: This paper proposes BoxDiff, a training-free method to control object synthesis in text-to-image diffusion models using simple spatial constraints like boxes or scribbles.

Why: Existing text-to-image models like Stable Diffusion lack explicit control over object layout and location. Layout-to-image models require paired layout-image data which is costly to obtain. BoxDiff aims to provide spatial control without needing additional training data.

How: BoxDiff applies spatial constraints to the cross-attention maps between text tokens and image features during diffusion model denoising. Three constraints are used - inner-box, outer-box and corner constraints - to maximize attention within the target regions and minimize it outside. This guides the model to synthesize images adhering to the spatial constraints.

In summary, BoxDiff allows inserting simple user constraints like boxes into a pre-trained text-to-image diffusion model to control object placement and scale without extra training. This provides more explicit spatial control for text-to-image synthesis.

Main Contributions

Here are the key contributions of this paper:

  • Proposes BoxDiff, a training-free approach to control object synthesis in spatial dimensions by incorporating spatial constraints into the denoising process of diffusion models. Requires no extra training or paired layout-image data.

  • Introduces three spatial constraints - inner-box, outer-box and corner constraints - that are applied to the cross-attention maps to guide object generation within specified regions.

  • Retains the ability of models like Stable Diffusion to generate high fidelity, diverse images while providing explicit spatial control over object layout.

  • Can work with different spatial inputs like boxes or scribbles. The method is model-agnostic and can be applied to other diffusion models.

  • Provides strong quantitative results on a novel spatial layout dataset showing the approach outperforms supervised layout-to-image models in precision and accuracy.

  • Demonstrates qualitatively that BoxDiff can control object position, scale, and content in a spatially-consistent manner with simple user constraints.

In summary, the key contributions are proposing a training-free spatial control method for diffusion models, introducing spatial constraints on cross-attentions, and showing strong quantitative and qualitative results for controlled text-to-image synthesis.

Method Section

Here is a summary of the method section from the paper:

  • Leverages the cross-modal attention in diffusion models like Stable Diffusion, which creates attention maps between text tokens and image features. These capture spatial information about objects described in the text prompt.

  • Transforms user-provided spatial constraints like boxes into binary masks. The goal is to synthesize images where objects align with these mask regions.

  • Adds three spatial constraints to the cross-attention during diffusion model denoising:

    • Inner-Box Constraint: Maximizes attention inside mask regions to ensure objects are generated there.

    • Outer-Box Constraint: Minimizes attention outside mask regions so objects don’t appear elsewhere.

    • Corner Constraint: Constrains attention at mask corners to control object scale.

  • Applies the constraints by taking the gradient of the loss functions w.r.t. the latent vector and updating the latent based on this gradient.

  • Uses representative sampling techniques like top-k and corner sampling to apply constraints efficiently without compromising image quality.

  • The constrained cross-attention guides diffusion denoising to produce latents that decode into images adhering to the spatial constraints.

In summary, spatial constraints are imposed on cross-modal attention maps to control where objects are generated during the denoising process of a pre-trained diffusion model.

High-Level Pseudo Code

Here is the high-level pseudo code for the BoxDiff method:

# BoxDiff Algorithm
 
# Input: 
# text_prompt: text description 
# spatial_constraints: list of box coordinates or scribbles  
 
# Load pre-trained diffusion model
model = StableDiffusion() 
 
# Get text tokens 
tokens = tokenize(text_prompt)
 
# Transform constraints to masks
masks = constraints_to_masks(spatial_constraints)
 
# Diffusion loop
for t in timesteps:
 
  # Get cross-attention maps  
  attn_maps = get_cross_attentions(model, tokens, t)
 
  # Apply inner-box constraint
  inner_box_loss = maximize_attn_in_masks(attn_maps, masks) 
  
  # Apply outer-box constraint 
  outer_box_loss = minimize_attn_outside_masks(attn_maps, masks)
 
  # Apply corner constraint
  corner_loss = constrain_corner_attn(attn_maps, masks)
 
  # Combine losses
  loss = inner_box_loss + outer_box_loss + corner_loss
 
  # Update latent
  latent = latent - lr * gradient(loss, latent) 
 
# Generate image from final latent
image = decode(model, latent)
 
return image

The key steps are:

  1. Get cross-attention maps from model
  2. Apply inner, outer, corner constraints to attention
  3. Take gradients of constraint losses
  4. Update latent based on loss gradients
  5. Generate image from final latent

This guides the model to adhere to spatial constraints during diffusion denoising.

Detailed Pseudo Code

Here is a more detailed pseudo code implementation of the BoxDiff method:

# Inputs
text_prompt = "A cat sitting on a sofa" 
spatial_constraints = [(x1,y1,x2,y2), ...] # list of box coords
 
# Pretrained diffusion model
model = StableDiffusion()
 
# Get text tokens from CLIP
tokens = tokenize(text_prompt) 
 
# Create binary masks from constraints
masks = []
for c in spatial_constraints:
   mask = create_box_mask(c)
   masks.append(mask)
 
# Diffusion loop 
for t in range(num_timesteps):
  
  # Get cross attentions
  attns = get_cross_attentions(model, tokens, t)
  
  # Inner-box constraint
  inner_losses = []
  for i, attn in enumerate(attns):
    topk_attn = topk(attn * masks[i], k=0.8*num_pixels) 
    inner_loss = 1 - topk_attn.mean()   
    inner_losses.append(inner_loss)
  inner_loss = sum(inner_losses)
   
  # Outer-box constraint
  outer_losses = []
  for i, attn in enumerate(attns):
    topk_attn = topk(attn * (1-masks[i]), k=0.8*num_pixels)
    outer_loss = topk_attn.mean()
    outer_losses.append(outer_loss)
  outer_loss = sum(outer_losses)
  
  # Corner constraint
  corner_loss = corner_constraint(attns, masks)  
  
  # Combine losses
  loss = inner_loss + outer_loss + corner_loss
 
  # Update latent
  latent = latent - lr * gradient(loss, latent)
 
# Generate image  
img = decode(model, latent)
 
return img

Key steps are creating masks, getting cross attentions from the model, applying the 3 losses, taking gradients, updating latent, and generating the image. The constraints are applied during the diffusion denoising loop to guide the model.