MVDream: Multi-view Diffusion for 3D Generation

Authors: Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, Xiao Yang

Abstract: We propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

What, Why and How

Here is a summary of the key points from this paper:

What:

  • The paper proposes MVDream, a multi-view diffusion model that can generate geometrically consistent multi-view images from a text prompt.

Why:

  • Existing 2D-lifting methods for 3D generation using score distillation from 2D diffusion models suffer from multi-view consistency issues like the Janus effect.
  • Using a multi-view diffusion model as prior provides more stable 3D supervision and can help address these consistency problems.

How:

  • Fine-tune a pre-trained 2D diffusion model (e.g. Stable Diffusion) on multi-view rendered images from a 3D dataset. Use 3D self-attention and camera embeddings.
  • Apply the multi-view diffusion for 3D generation via multi-view score distillation.
  • Can also fine-tune for few-shot personalization (MVDreamBooth) and maintain consistency.

In summary, the paper proposes training a multi-view version of diffusion models to obtain a 3D consistent prior for generating novel 3D content in a robust way, overcoming limitations of using single 2D models.

Main Contributions

Here are the main contributions of this paper:

  • Proposes the first multi-view diffusion model that can generate consistent multi-view images for an input text prompt.

  • Demonstrates how to train such a model by fine-tuning a 2D diffusion model on multi-view rendered data, using modifications like 3D self-attention and camera embeddings.

  • Shows the multi-view diffusion model can serve as a robust 3D-consistent prior for generating novel 3D content via multi-view score distillation.

  • Achieves state-of-the-art 3D generation results with fewer artifacts compared to prior 2D-lifting methods that suffer from consistency issues.

  • Extends the framework to few-shot personalization with MVDreamBooth while maintaining multi-view consistency.

  • Provides comprehensive analysis and experiments on multi-view image generation, 3D generation, and few-shot fine-tuning capabilities.

In summary, the main contributions are proposing and demonstrating a multi-view diffusion framework that can generate consistent multi-view images and serve as a robust prior for 3D generation, overcoming limitations of single 2D models.

Method Section

Here is a summary of the method section from the paper:

  • The overall goal is to train a multi-view diffusion model that can generate consistent multi-view images from a text prompt.

  • To maintain the generalizability of 2D diffusion models, the architecture largely follows image diffusion models, with two main changes:

  1. Modify the self-attention block to connect different views, using a 3D self-attention. This is key for learning cross-view consistency.

  2. Add camera embeddings to the time embeddings to distinguish different views.

  • The model is trained on a mixture of a 3D rendered dataset (for consistency) and large-scale image datasets (for generalizability).

  • For 3D generation, the multi-view diffusion model serves as a prior and guides the optimization of a 3D representation like NeRF using multi-view score distillation.

  • For few-shot personalization, the pre-trained model is fine-tuned on identity images to get a multi-view DreamBooth model, with an additional parameter regularization loss.

  • The 3D optimization remains the same using this MV DreamBooth model as prior.

In summary, the key ideas are modifying a 2D diffusion model architecture minimally to handle multi-view data, training it on mixed datasets, and then utilizing it as a robust 3D-consistent prior for generation via score distillation.

High-Level Pseudo Code

Here is the high-level pseudo code for the key steps in the paper:

# Train Multi-View Diffusion Model
model = ImageDiffusionModel() # initialize 2D diffusion model 
model.attention = 3DAttention() # modify to 3D attention
 
rendered_data = get_rendered_3D_images() # multi-view rendered images
real_data = get_real_image_dataset() 
 
for batch in rendered_data + real_data:
 
  # get batch of multi-view images x_t and camera poses c
  
  loss = diffusion_loss(x_t, c, model) 
  
  optimize(loss) # update model
  
# Generate 3D via Multi-View Score Distillation  
text_prompt = "something interesting"
cameras = get_camera_poses()
 
# initialize 3D representation (e.g. NeRF)
rep = NeRF()
 
for iter in iterations:
 
  x_t = diffuse(rep) # diffuse 3D rep
 
  epsilons = model(x_t, text_prompt, cameras) 
  
  loss = MSE(epsilons, noise)
  
  optimize(loss) # update 3D rep
 
# Fine-tune for Multi-View DreamBooth
identity_images = get_identity_images() 
 
for image in identity_images:
 
  loss = diffusion_loss(image, model) + weight_regularization
 
  optimize(loss) # fine-tune model
  

The key steps are training the multi-view diffusion model on mixed datasets, using it to guide 3D optimization via multi-view score distillation, and fine-tuning for personalization with DreamBooth.

Detailed Pseudo Code

Here is a more detailed pseudo code for implementing the key components of this paper:

# Multi-View Diffusion Model
 
class MVDiffusion(nn.Module):
  
  def __init__(self):
    # initialize 2D diffusion model backbone, encoder, etc
    
    self.attention = MVAttention() # 3D attention   
 
  def forward(x_t, text, cameras):
    
    # encoder forward pass
    
    # add camera embeddings to time embeddings 
    time_embeds += MLP(cameras) 
    
    # decoder forward pass 
    features = self.attention(features) # 3D attention
    model_output = ...
    
    return epsilon # predicted noise  
 
# Loss and training loop
 
rendered_data = RenderedDataset() 
real_data = ImageDataset()
 
model = MVDiffusion()
 
optim = torch.optim.Adam(model.parameters())
 
for epoch in num_epochs:
 
  for x_t, text, cameras in rendered_data:
  
    noise = torch.randn_like(x_t) 
    epsilons = model(x_t, text, cameras)
    
    loss = MSE(epsilons, noise) 
    optim.zero_grad()
    loss.backward()
    optim.step()
  
  # also train on real data
  
# Multi-View Score Distillation
 
nerf = NeRF() # initialize NeRF
 
cameras = get_camera_poses() 
 
optim = torch.optim.Adam(nerf.parameters())
 
for iter in num_iters:
 
  x_t = diffuse_nerf(nerf) # diffuse nerf 
  
  epsilons = model(x_t, text, cameras)
 
  loss = MSE(epsilons, noise)
  
  optim.zero_grad()
  loss.backward() # detach nerf grads
  optim.step() # update nerf
 
# MV DreamBooth Fine-tuning
 
for x, y in identity_data:
 
  epsilons = model(x, y)  
 
  loss = diffusion_loss(x, epsilons)
 
  weight_reg = 1/N * model.params_lp_norm
 
  loss += lambda * weight_reg
  
  optim.zero_grad()
  loss.backward()
  optim.step() 

This shows one way to implement the key training and generation steps with more implementation details. The main components are the modified diffusion model architecture, the mixed dataset training loop, the NeRF optimization via multi-view score distillation, and fine-tuning for DreamBooth.