Cones 2: Customizable Image Synthesis with Multiple Subjects

Authors: Zhiheng Liu, Yifei Zhang, Yujun Shen, Kecheng Zheng, Kai Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, Yang Cao

Abstract: Synthesizing images with user-specified subjects has received growing attention due to its practical applications. Despite the recent success in single subject customization, existing algorithms suffer from high training cost and low success rate along with increased number of subjects. Towards controllable image synthesis with multiple subjects as the constraints, this work studies how to efficiently represent a particular subject as well as how to appropriately compose different subjects. We find that the text embedding regarding the subject token already serves as a simple yet effective representation that supports arbitrary combinations without any model tuning. Through learning a residual on top of the base embedding, we manage to robustly shift the raw subject to the customized subject given various text conditions. We then propose to employ layout, a very abstract and easy-to-obtain prior, as the spatial guidance for subject arrangement. By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image, significantly alleviating the interference across them. Both qualitative and quantitative experimental results demonstrate our superiority over state-of-the-art alternatives under a variety of settings for multi-subject customization.

What, Why and How

Here is a summary of the key points from this paper on customizable image synthesis with multiple subject constraints:

\textbf{What:} This paper proposes a new method called CONES for multi-subject image customization. The goal is to generate images containing multiple user-specified subjects based on only a few reference images per subject.

\textbf{Why:} Existing methods for image customization struggle with increasing numbers of subjects, exhibiting high training cost, low success rate, and subject interference issues. This paper aims to address these limitations.

\textbf{How:} The proposed CONES method has two main components:

  1. Efficient subject representation - Each subject is represented by a residual token embedding, which is a simple vector learned from fine-tuning a text encoder. These residuals can be linearly combined without retraining to compose new multi-subject images.

  2. Spatial layout guidance - A predefined layout guides the image generation by strengthening target regions and weakening irrelevant regions in the cross-attention maps. This helps appoint subject locations and reduce interference.

The residual embeddings allow flexible subject composition without exponential training costs. The layout guidance aids in generating accurate subjects in the proper spatial arrangement. Together these enable high quality multi-subject image customization.

Experiments show CONES outperforms prior arts in metrics like text/image alignment and efficiently handles challenging cases like semantically similar subjects or 6+ subjects. The simple yet effective approach offers strong controllability for multi-subject image synthesis.

Main Contributions

Here are the key contributions of this paper:

  • Proposes an efficient subject representation method using residual token embeddings that can be linearly combined without retraining to compose new multi-subject images.

  • Introduces spatial layout guidance during image generation to appoint subject locations and reduce interference between subjects.

  • Achieves state-of-the-art performance on multi-subject image customization, outperforming prior arts in metrics like text/image alignment.

  • Handles challenging cases like semantically similar subjects and 6+ subjects much more effectively compared to existing methods.

  • Provides strong controllability for multi-subject image synthesis with simple yet effective components.

  • Requires significantly lower training cost and storage than joint training methods as subjects can be composed arbitrarily after learning individual residuals.

  • Demonstrates the ability to generate high quality images containing an increasing number of user-specified subjects.

  • Conducts extensive experiments spanning diverse subject categories like pets, scenes, objects etc. to substantiate the effectiveness.

In summary, the key contributions are proposing residual token embeddings and spatial layout guidance to enable efficient and high-quality multi-subject image customization with strong controllability.

Method Section

Here is a summary of the method section from the paper:

The goal is to generate images containing multiple user-specified subjects. The method has two main components:

  1. Efficient subject representation using residual token embeddings
  • Fine-tune a text encoder on images of a specific subject to get a customized text encoder
  • Apply a text-embedding-preservation loss to make the customized encoder differ from original encoder only for the subject token
  • Calculate the mean difference between original and customized encoder for the subject token to get a residual embedding
  • This residual robustly shifts the base category to the customized subject (e.g. dog -> custom dog)
  • Residuals can be combined at inference without retraining to compose new multi-subject images
  1. Spatial layout guidance
  • Use a predefined layout (bounding boxes + labels) to guide subject arrangement
  • Strengthen activations in target subject regions in cross-attention maps
  • Weaken activations in regions irrelevant to a subject
  • This appoints locations for subjects and reduces interference
  • Guidance is applied at each sampling step to continuously influence generation

In summary, the method represents subjects via residual embeddings that can be arbitrarily combined. Layout guidance aids in appointing subject locations and alleviating interference during multi-subject image generation. Together these allow flexible and high-quality customization.

High-Level Pseudo Code

Here is the high-level pseudo code for the method proposed in the paper:

# Represent each subject
for subject in subjects:
  # Fine-tune text encoder on subject images
  tuned_encoder = fine_tune(encoder, subject_images)  
 
  # Get residual embedding
  residual[subject] = get_residual(tuned_encoder, encoder, subject_token)
 
# Compose subjects
prompt = "A photo of subject1 and subject2" 
residuals = [residual[subject1], residual[subject2]]
 
# Generate image with layout guidance  
for t in sampling_steps:
  # Edit text embedding with residuals
  edited_embedding = edit(prompt_embedding, residuals)
 
  # Sample image
  image = sample_step(model, edited_embedding)
 
  # Get cross-attention map
  ca_map = get_camap(image, edited_embedding)
 
  # Edit CAM with layout
  guided_camap = guide_camap(ca_map, layout)
 
  # Sample next step using guided CAM
  image = sample_step(model, edited_embedding, guided_camap)
 
# Final generated image with customized subjects

The key steps are:

  1. Represent each subject by fine-tuning the text encoder and getting a residual embedding
  2. Compose required subjects by adding their residuals to the prompt embedding
  3. At each sampling step, guide the cross-attention map using the layout to appoint subject locations

This allows flexibly combining customized subjects and generating images with layout control.

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the method proposed in this paper:

# Represent each subject
for subject in subjects:
 
  # Prepare subject-specific sentences
  sentences = get_subject_sentences(subject)
  
  # Fine-tune text encoder
  tuned_encoder = TextEncoder()
  for epoch in epochs:
    for sentence in sentences:
      loss = reconstruction_loss(tuned_encoder, sentence)
      loss += text_embedding_preservation_loss(tuned_encoder, sentence, subject)
      optimize(loss)
 
  # Get residual embedding
  residual[subject] = []
  for sentence in sentences:
    original_embedding = encoder(sentence)[subject_token]
    tuned_embedding = tuned_encoder(sentence)[subject_token]
    residual[subject].append(tuned_embedding - original_embedding)
  residual[subject] = mean(residual[subject])
 
# Multi-subject image generation
prompt = "A photo of subject1 and subject2"
residuals = [residual[subject1], residual[subject2]]
 
for t in [T, T-1, ..., 1]:
  
  # Edit prompt embedding
  edited_embedding = prompt_embedding + residuals
 
  # Sample image
  image = diffusion_model(edited_embedding, t)
 
  # Get cross-attention map 
  ca_map = get_camap(image, edited_embedding)
 
  # Edit CAM with layout guidance
  ca_map += guide_camap(ca_map, layout)
 
  # Sample next step
  image = diffusion_model(edited_embedding, ca_map, t-1)
 
# Final generated image

Key details include:

  • Getting subject-specific sentences for each custom subject
  • Training objectives - reconstruction + text embedding preservation
  • Calculating residual embedding from multiple sentences
  • Using edited prompt embedding at each sampling step
  • Guiding CAM with predefined layout

This provides a more complete overview of implementing the proposed approach.