HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models
Authors: Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, Kfir Aberman
Abstract: Personalization has emerged as a prominent aspect within the field of generative AI, enabling the synthesis of individuals in diverse contexts and styles, while retaining high-fidelity to their identities. However, the process of personalization presents inherent challenges in terms of time and memory requirements. Fine-tuning each personalized model needs considerable GPU time investment, and storing a personalized model per subject can be demanding in terms of storage capacity. To overcome these challenges, we propose HyperDreamBooth-a hypernetwork capable of efficiently generating a small set of personalized weights from a single image of a person. By composing these weights into the diffusion model, coupled with fast finetuning, HyperDreamBooth can generate a person’s face in various contexts and styles, with high subject details while also preserving the model’s crucial knowledge of diverse styles and semantic modifications. Our method achieves personalization on faces in roughly 20 seconds, 25x faster than DreamBooth and 125x faster than Textual Inversion, using as few as one reference image, with the same quality and style diversity as DreamBooth. Also our method yields a model that is 10000x smaller than a normal DreamBooth model. Project page: hyperdreambooth.github.io
What, Why and How
Here is a summary of the key points from the paper:
\textbf{What:} The paper proposes a new method called HyperDreamBooth for fast and lightweight personalization of text-to-image diffusion models like Stable Diffusion. The goal is to generate high quality and diverse images of a specific person’s face in different styles, while preserving the details of that person’s identity.
\textbf{Why:} Current personalization methods like DreamBooth require finetuning the entire diffusion model which is slow and computationally expensive. Storing personalized models for each subject also takes up a lot of memory. The paper aims to overcome these limitations.
\textbf{How:}
- Proposes Lightweight DreamBooth (LiDB) to minimize the number of personalized weights to around 100KB using a low-dimensional weight space.
- Trains a HyperNetwork to predict LiDB weights for a given face image. The HyperNetwork uses a ViT encoder and Transformer decoder architecture.
- Initializes diffusion model weights using HyperNetwork predictions, then does fast finetuning using a technique called rank-relaxed finetuning to refine details.
- This achieves 25x speedup over DreamBooth with similar quality results using just a single image. The model size is 10000x smaller than DreamBooth.
In summary, the paper presents a novel HyperNetwork-based approach to efficiently personalize text-to-image diffusion models by predicting a small set of weights from a single image. This enables fast high-quality personalization while conserving model integrity.
Main Contributions
Here are the key contributions of the paper:
-
Proposes Lightweight DreamBooth (LiDB) to create a low-dimensional weight space for personalization that is only 100KB in size. This is 10000x smaller than a full DreamBooth model.
-
Develops a novel HyperNetwork architecture with a ViT encoder and Transformer decoder that can predict the LiDB weights for personalizing a diffusion model from a single image.
-
Introduces a fast finetuning technique called rank-relaxed finetuning that relaxes the rank constraint on the LiDB weights to refine details and achieve higher fidelity to the subject.
-
Achieves 25x speedup for personalization compared to DreamBooth, with only 20 seconds of finetuning needed after HyperNetwork prediction.
-
Shows that the method can generate high quality and diverse images of a face while preserving identity, with style and editability similar to DreamBooth.
-
Demonstrates strong results using just a single reference image of a person’s face.
-
Performs extensive comparisons and ablation studies demonstrating the efficacy of the proposed techniques.
In summary, the core novelties are the lightweight personalized weight space, the HyperNetwork architecture for fast prediction, and the rank-relaxed finetuning technique to amplify fidelity. Together these allow highly efficient personalization of text-to-image models.
Method Section
Here is a summary of the method section from the paper:
The method consists of 3 core components:
-
Lightweight DreamBooth (LiDB): Further decomposes the low-rank weight residuals from LoRA DreamBooth into two matrices each using a random orthogonal incomplete basis. This creates a very low-dimensional weight space for personalization that is only 100KB in size.
-
HyperNetwork Architecture: Uses a ViT encoder to extract features from an input face image. These are concatenated with latent layer weight features and fed into a Transformer decoder to iteratively predict the LiDB weight values. The predicted weights are added to the diffusion model to personalize it.
-
Rank-Relaxed Fast Finetuning: First predicts weights using the HyperNetwork. Then relaxes the rank constraint on the LiDB model before fast finetuning using the diffusion loss. This expands the model’s capability to approximate fine details of the subject’s face and identity.
The hypernetwork is trained using a dataset of face images and corresponding pre-computed LiDB weights. The loss includes both diffusion reconstruction loss and L2 loss between predicted and target weights.
At inference time, the hypernetwork predicts an initial set of weights from the input face image. These weights are added to the diffusion model and then refined using rank-relaxed fast finetuning with only 40 iterations, achieving 25x speedup over DreamBooth.
In summary, the lightweight weight decomposition, hypernetwork prediction and rank relaxation enables highly efficient personalization of text-to-image diffusion models from just a single image.
High-Level Pseudo Code
Here is the high-level pseudo code for the method proposed in the paper:
# HyperNetwork Training
for image in training_data:
# Get precomputed LiDB weights for image
lidb_weights = get_precomputed_weights(image)
# Encode image
image_features = encoder(image)
# Iterate weight prediction
for i in range(num_iterations):
if i == 0:
predicted_weights = initialize_to_zero()
else:
predicted_weights = previous_predicted_weights
# Decoder prediction
predicted_weights = decoder(image_features, predicted_weights)
# Loss
loss = diffusion_reconstruction_loss(predicted_weights) +
L2_loss(predicted_weights, lidb_weights)
# Update HyperNetwork
update(loss)
# Inference
for image in test_data:
# Get predicted weights
predicted_weights = HyperNetwork(image)
# Add predicted weights to diffusion model
diffusion_model.add_weights(predicted_weights)
# Rank-relaxed fast finetuning
for i in range(finetune_iterations):
loss = diffusion_reconstruction_loss(diffusion_model)
update(loss) # finetune
# Generate image
output = diffusion_model(prompt)
The key steps are training the hypernetwork on image and weight pairs, predicting weights at test time, adding them to the diffusion model, and finetuning with rank relaxation to refine details. This achieves fast and high quality personalization.
Detailed Pseudo Code
Here is a more detailed pseudo code implementation of the HyperDreamBooth method:
# HyperNetwork Architecture
class Encoder(nn.Module):
# ViT Encoder
def forward(img):
return vit_encode(img)
class Decoder(nn.Module):
# Transformer decoder
def forward(img_feats, weights):
# add positional encodings to weights
x = positional_encoding(weights)
x = transformer_decoder(img_feats, x)
return x
class HyperNet(nn.Module):
def __init__(self):
self.encoder = Encoder()
self.decoder = Decoder()
def forward(img):
img_feats = self.encoder(img)
weights = torch.zeros(num_layers, hidden_dim) # initialize
for i in range(num_iterations):
weights = self.decoder(img_feats, weights)
# output layer
weights = linear(weights)
return weights
# HyperNetwork Training
# Compute LiDB weights
def compute_lidb_weights(model, img):
# Finetune model on image using LiDB
return model.lidb_weights
for img, label in dataloader:
# Get LiDB weights
lidb_weights = compute_lidb_weights(model, img)
# Hypernetwork forward
predicted_weights = hypernet(img)
# Losses
recon_loss = diffusion_reconstruction_loss(predicted_weights)
l2_loss = MSELoss(predicted_weights, lidb_weights)
loss = recon_loss + l2_loss
# Update
loss.backward()
optimizer.step()
# Inference
img = load_test_image()
with torch.no_grad():
predicted_weights = hypernet(img)
# Add to diffusion model
diffusion_model.add_weights(predicted_weights)
# Rank relaxed fast finetuning
diffusion_model.increase_rank()
for i in range(num_finetune_iters):
recon_loss = diffusion_reconstruction_loss(diffusion_model)
recon_loss.backward()
optimizer.step()
# Generate image
img = diffusion_model(prompt)
This shows one way to implement the training loop, network architectures, and inference steps for HyperDreamBooth in more detail.