Generating images of rare concepts using pre-trained diffusion models

Authors: Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, Gal Chechik

Abstract: Text-to-image diffusion models can synthesize high-quality images, but they have various limitations. Here we highlight a common failure mode of these models, namely, generating uncommon concepts and structured concepts like hand palms. We show that their limitation is partly due to the long-tail nature of their training data: web-crawled data sets are strongly unbalanced, causing models to under-represent concepts from the tail of the distribution. We characterize the effect of unbalanced training data on text-to-image models and offer a remedy. We show that rare concepts can be correctly generated by carefully selecting suitable generation seeds in the noise space, using a small reference set of images, a technique that we call SeedSelect. SeedSelect does not require retraining or finetuning the diffusion model. We assess the faithfulness, quality and diversity of SeedSelect in creating rare objects and generating complex formations like hand images, and find it consistently achieves superior performance. We further show the advantage of SeedSelect in semantic data augmentation. Generating semantically appropriate images can successfully improve performance in few-shot recognition benchmarks, for classes from the head and from the tail of the training data of diffusion models

What, Why and How

Here is a summary of the key points from the paper:

What:

  • The paper studies the failure of text-to-image diffusion models like Stable Diffusion to generate images of concepts that are under-represented or rare in the training data. For example, concepts with less than 10K images in the LAION training set fail around 50% of the time.

Why:

  • The authors hypothesize this is because for rare concepts, only a small region of the input noise space is “in-distribution” for generating those concepts. So random noise samples are likely to be out-of-distribution.

How:

  • They propose SeedSelect - a method to identify “in-distribution” noise seeds for rare concepts.

  • Given a few reference images of a rare concept, SeedSelect does gradient descent to find a noise seed that generates images similar to the references in semantics (CLIP similarity) and appearance (Diffusion VAE loss).

  • This allows generating better images of rare concepts without retraining the diffusion model. Experiments show it consistently outperforms baselines in faithfulness and quality for rare objects.

  • It also enables semantic augmentation for few-shot recognition, achieving SOTA results by generating valuable augmentations even from just 1 image per class.

  • SeedSelect also improves generation of challenging concepts like hand palms that current diffusion models struggle with.

In summary, the paper offers insights into why current diffusion models fail on rare concepts, and provides SeedSelect as a simple but effective solution using just a handful of reference images, without any fine-tuning.

Main Contributions

The main contributions of this paper are:

  1. Characterizing and quantifying the failure of text-to-image diffusion models to generate images of rare concepts that have insufficient samples in the training data.

  2. Proposing the hypothesis that this failure stems from the input noise space - only small regions may be “in-distribution” for rare concepts during training.

  3. Introducing the learning setup of generating rare concepts using a reference set and pre-trained diffusion model.

  4. Describing SeedSelect, a novel method to find suitable noise seeds that improve generation of uncommon concepts without finetuning the diffusion model. This is done by optimizing for semantic and visual similarity to reference images.

  5. Demonstrating SeedSelect’s effectiveness in generating rare objects faithfully, with superior quality over baselines. It works for natural images and fine-grained datasets.

  6. Showing the value of SeedSelect for semantic augmentation, achieving state-of-the-art few-shot recognition by generating high-quality and diverse augmentations from just 1 image per class.

  7. Applying SeedSelect to improve generation of challenging concepts like hand palms.

  8. Providing an efficient bootstrapping technique to accelerate image generation using SeedSelect.

In summary, the key contributions are providing insights into failures of diffusion models on rare concepts, introducing the SeedSelect approach to address this issue, and demonstrating its benefits for generating uncommon concepts and few-shot semantic augmentation.

Method Section

Here is a summary of the method section from the paper:

The key idea is SeedSelect - using a few reference images to find a good noise seed that generates images of the correct rare concept.

The generator is a pre-trained text-to-image diffusion model like Stable Diffusion. The inputs are a noise vector z and a text prompt y.

Given k reference images I1,..,Ik of a rare concept y:

  1. Optimization is done to find a noise tensor zG_T that generates images similar to the references when prompted with y.

  2. Two losses are used to measure consistency:

  • Semantic consistency: Encodes I1..Ik and IG into clip space. Minimizes distance between their centroid and IG’s encoding.

  • Natural appearance consistency: Minimizes MSE between IG’s diffusion VAE encoding and mean encoding of I1..Ik.

  1. The total loss is a weighted sum of these two losses. Only zG_T is optimized, by backprop through the diffusion model.

  2. Once optimal zG_T is found, sampling the diffusion model with it and prompt y generates the image.

  3. To generate multiple images, zG_T can be re-optimized with subsets of references via bootstrapping.

So in summary, SeedSelect finds an optimal noise input to the pre-trained diffusion model that makes it generate the desired rare concept, guided by few reference images. This is done via semantic and visual similarity losses, without any fine-tuning.

High-Level Pseudo Code

Here is high-level pseudocode for the SeedSelect method proposed in the paper:

# Inputs
diffusion_model = PretrainedDiffusionModel() 
prompt = "rare concept y"  
reference_images = [I1, I2, ..., Ik] # k reference images of concept y
 
# Optimization 
init_noise = SampleRandomNoise() # z_T^G
for iters in range(max_iters):
  
  generated_image = diffusion_model(init_noise, prompt) 
  
  # Losses
  semantic_loss = Distance(Clip(generated_image), 
                           Centroid(Clip(reference_images)))  
  appearance_loss = MSE(DiffusionVAE(generated_image),
                        Mean(DiffusionVAE(reference_images)))
  
  total_loss = w1*semantic_loss + w2*appearance_loss 
  
  # Update noise seed
  init_noise = Optimize(init_noise, total_loss) 
 
# Generate final images  
final_noise = init_noise
generated_images = [diffusion_model(final_noise, prompt) 
                    for i in range(num_images)]

This optimizes the initially random noise vector to minimize semantic and appearance losses relative to the reference images. The optimized noise is then used to generate multiple images of the rare concept.

Detailed Pseudo Code

Here is more detailed pseudocode to implement the SeedSelect method:

# Inputs
diffusion_model = PretrainedDiffusionModel()  
prompt = "rare concept y"   
reference_images = [I1, I2, ..., Ik] # k reference images of concept y
num_generate = number of images to generate
 
# Hyperparameters
num_iters = 500 # optimization iterations
lr = 1e-2 # learning rate
lambda = 0.8 # weight for semantic loss
vae_loss = nn.MSELoss() # VAE reconstruction loss
clip_model = CLIPModel() # for encoding images
 
# Optimization
init_noise = torch.randn([1, 3, H, W]) # sample initial noise
optimizer = torch.optim.Adam([init_noise], lr=lr) 
 
for i in range(num_iters):
 
  # Forward pass  
  generated_image = diffusion_model(init_noise, prompt)
  
  # Losses
  ref_encodings = [clip_model.encode(Ii) for Ii in reference_images] 
  gen_encoding = clip_model.encode(generated_image)
  
  semantic_loss = torch.dist(gen_encoding, torch.mean(ref_encodings))  
  
  ref_latents = [diffusion_model.vae.encode(Ii) for Ii in reference_images]
  gen_latent = diffusion_model.vae.encode(generated_image)  
  
  appearance_loss = vae_loss(gen_latent, torch.mean(ref_latents))
  
  total_loss = lambda*semantic_loss + (1-lambda)*appearance_loss
 
  # Update
  optimizer.zero_grad()
  total_loss.backward()
  optimizer.step()
 
# Generate images 
final_noise = init_noise 
generated_images = [diffusion_model(final_noise, prompt) for _ in range(num_generate)] 

Let me know if any part needs more clarification!