HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

Authors: Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, Kfir Aberman

Abstract: Personalization has emerged as a prominent aspect within the field of generative AI, enabling the synthesis of individuals in diverse contexts and styles, while retaining high-fidelity to their identities. However, the process of personalization presents inherent challenges in terms of time and memory requirements. Fine-tuning each personalized model needs considerable GPU time investment, and storing a personalized model per subject can be demanding in terms of storage capacity. To overcome these challenges, we propose HyperDreamBooth-a hypernetwork capable of efficiently generating a small set of personalized weights from a single image of a person. By composing these weights into the diffusion model, coupled with fast finetuning, HyperDreamBooth can generate a person’s face in various contexts and styles, with high subject details while also preserving the model’s crucial knowledge of diverse styles and semantic modifications. Our method achieves personalization on faces in roughly 20 seconds, 25x faster than DreamBooth and 125x faster than Textual Inversion, using as few as one reference image, with the same quality and style diversity as DreamBooth. Also our method yields a model that is 10000x smaller than a normal DreamBooth model. Project page: hyperdreambooth.github.io

What, Why and How

Here is a summary of the key points from the paper:

\textbf{What:} The paper proposes a new method called HyperDreamBooth for fast and lightweight personalization of text-to-image diffusion models like Stable Diffusion. The goal is to generate high quality and diverse images of a specific person’s face in different styles, while preserving the details of that person’s identity.

\textbf{Why:} Current personalization methods like DreamBooth require finetuning the entire diffusion model which is slow and computationally expensive. Storing personalized models for each subject also takes up a lot of memory. The paper aims to overcome these limitations.

\textbf{How:}

Proposes Lightweight DreamBooth (LiDB) to minimize the number of personalized weights to around 100KB using a low-dimensional weight space.
Trains a HyperNetwork to predict LiDB weights for a given face image. The HyperNetwork uses a ViT encoder and Transformer decoder architecture.
Initializes diffusion model weights using HyperNetwork predictions, then does fast finetuning using a technique called rank-relaxed finetuning to refine details.
This achieves 25x speedup over DreamBooth with similar quality results using just a single image. The model size is 10000x smaller than DreamBooth.

In summary, the paper presents a novel HyperNetwork-based approach to efficiently personalize text-to-image diffusion models by predicting a small set of weights from a single image. This enables fast high-quality personalization while conserving model integrity.

Main Contributions

Here are the key contributions of the paper:

Proposes Lightweight DreamBooth (LiDB) to create a low-dimensional weight space for personalization that is only 100KB in size. This is 10000x smaller than a full DreamBooth model.
Develops a novel HyperNetwork architecture with a ViT encoder and Transformer decoder that can predict the LiDB weights for personalizing a diffusion model from a single image.
Introduces a fast finetuning technique called rank-relaxed finetuning that relaxes the rank constraint on the LiDB weights to refine details and achieve higher fidelity to the subject.
Achieves 25x speedup for personalization compared to DreamBooth, with only 20 seconds of finetuning needed after HyperNetwork prediction.
Shows that the method can generate high quality and diverse images of a face while preserving identity, with style and editability similar to DreamBooth.
Demonstrates strong results using just a single reference image of a person’s face.
Performs extensive comparisons and ablation studies demonstrating the efficacy of the proposed techniques.

In summary, the core novelties are the lightweight personalized weight space, the HyperNetwork architecture for fast prediction, and the rank-relaxed finetuning technique to amplify fidelity. Together these allow highly efficient personalization of text-to-image models.

Method Section

Here is a summary of the method section from the paper:

The method consists of 3 core components:

Lightweight DreamBooth (LiDB): Further decomposes the low-rank weight residuals from LoRA DreamBooth into two matrices each using a random orthogonal incomplete basis. This creates a very low-dimensional weight space for personalization that is only 100KB in size.
HyperNetwork Architecture: Uses a ViT encoder to extract features from an input face image. These are concatenated with latent layer weight features and fed into a Transformer decoder to iteratively predict the LiDB weight values. The predicted weights are added to the diffusion model to personalize it.
Rank-Relaxed Fast Finetuning: First predicts weights using the HyperNetwork. Then relaxes the rank constraint on the LiDB model before fast finetuning using the diffusion loss. This expands the model’s capability to approximate fine details of the subject’s face and identity.

The hypernetwork is trained using a dataset of face images and corresponding pre-computed LiDB weights. The loss includes both diffusion reconstruction loss and L2 loss between predicted and target weights.

At inference time, the hypernetwork predicts an initial set of weights from the input face image. These weights are added to the diffusion model and then refined using rank-relaxed fast finetuning with only 40 iterations, achieving 25x speedup over DreamBooth.

In summary, the lightweight weight decomposition, hypernetwork prediction and rank relaxation enables highly efficient personalization of text-to-image diffusion models from just a single image.

High-Level Pseudo Code

Here is the high-level pseudo code for the method proposed in the paper:

# HyperNetwork Training
for image in training_data:
  # Get precomputed LiDB weights for image
  lidb_weights = get_precomputed_weights(image)  
  
  # Encode image
  image_features = encoder(image)
 
  # Iterate weight prediction 
  for i in range(num_iterations):
    if i == 0:
      predicted_weights = initialize_to_zero() 
    else:
      predicted_weights = previous_predicted_weights
    
    # Decoder prediction
    predicted_weights = decoder(image_features, predicted_weights)
 
    # Loss
    loss = diffusion_reconstruction_loss(predicted_weights) + 
           L2_loss(predicted_weights, lidb_weights)
  
  # Update HyperNetwork 
  update(loss)
 
 
# Inference
for image in test_data:
 
  # Get predicted weights
  predicted_weights = HyperNetwork(image)
 
  # Add predicted weights to diffusion model
  diffusion_model.add_weights(predicted_weights)
 
  # Rank-relaxed fast finetuning
  for i in range(finetune_iterations):
    loss = diffusion_reconstruction_loss(diffusion_model)
    update(loss) # finetune
  
  # Generate image
  output = diffusion_model(prompt)

The key steps are training the hypernetwork on image and weight pairs, predicting weights at test time, adding them to the diffusion model, and finetuning with rank relaxation to refine details. This achieves fast and high quality personalization.

Detailed Pseudo Code

Here is a more detailed pseudo code implementation of the HyperDreamBooth method:

# HyperNetwork Architecture
 
class Encoder(nn.Module):
  # ViT Encoder
  def forward(img):
    return vit_encode(img)
 
class Decoder(nn.Module):  
  # Transformer decoder
  def forward(img_feats, weights):
    # add positional encodings to weights
    x = positional_encoding(weights) 
    x = transformer_decoder(img_feats, x)
    return x
 
class HyperNet(nn.Module):
  def __init__(self):
    self.encoder = Encoder()
    self.decoder = Decoder()
  
  def forward(img):
    img_feats = self.encoder(img)
    weights = torch.zeros(num_layers, hidden_dim) # initialize
    
    for i in range(num_iterations):
      weights = self.decoder(img_feats, weights)
    
    # output layer
    weights = linear(weights) 
    return weights
 
 
# HyperNetwork Training 
 
# Compute LiDB weights
def compute_lidb_weights(model, img):
  # Finetune model on image using LiDB  
  return model.lidb_weights 
 
for img, label in dataloader:
 
  # Get LiDB weights
  lidb_weights = compute_lidb_weights(model, img)
 
  # Hypernetwork forward 
  predicted_weights = hypernet(img)
 
  # Losses
  recon_loss = diffusion_reconstruction_loss(predicted_weights)
  l2_loss = MSELoss(predicted_weights, lidb_weights)
  loss = recon_loss + l2_loss
  
  # Update
  loss.backward()
  optimizer.step()
 
 
# Inference
 
img = load_test_image()
 
with torch.no_grad():
  predicted_weights = hypernet(img) 
 
# Add to diffusion model  
diffusion_model.add_weights(predicted_weights)
 
# Rank relaxed fast finetuning
diffusion_model.increase_rank() 
 
for i in range(num_finetune_iters):
  recon_loss = diffusion_reconstruction_loss(diffusion_model)
  recon_loss.backward()
  optimizer.step()
 
# Generate image
img = diffusion_model(prompt)

This shows one way to implement the training loop, network architectures, and inference steps for HyperDreamBooth in more detail.

aixiv-c

Table of Contents

2307.06949 HyperDreamBooth HyperNetworks for Fast Personalization of Text-to-Image Models

HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code

Graph View

Backlinks

aixiv-c

Table of Contents

2307.06949 HyperDreamBooth HyperNetworks for Fast Personalization of Text-to-Image Models

HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models §

What, Why and How §

Main Contributions §

Method Section §

High-Level Pseudo Code §

Detailed Pseudo Code §

Graph View

Backlinks

HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code