DreamFusion: Text-to-3D using 2D Diffusion

Authors: Ben Poole, Ajay Jain, Jonathan T. Barron, Ben Mildenhall

Abstract: Recent breakthroughs in text-to-image synthesis have been driven by diffusion models trained on billions of image-text pairs. Adapting this approach to 3D synthesis would require large-scale datasets of labeled 3D data and efficient architectures for denoising 3D data, neither of which currently exist. In this work, we circumvent these limitations by using a pretrained 2D text-to-image diffusion model to perform text-to-3D synthesis. We introduce a loss based on probability density distillation that enables the use of a 2D diffusion model as a prior for optimization of a parametric image generator. Using this loss in a DeepDream-like procedure, we optimize a randomly-initialized 3D model (a Neural Radiance Field, or NeRF) via gradient descent such that its 2D renderings from random angles achieve a low loss. The resulting 3D model of the given text can be viewed from any angle, relit by arbitrary illumination, or composited into any 3D environment. Our approach requires no 3D training data and no modifications to the image diffusion model, demonstrating the effectiveness of pretrained image diffusion models as priors.

What, Why and How

Here is a summary of the key points in the paper:

What:

  • The paper presents a method called DreamFusion for generating 3D objects from text descriptions using a pretrained 2D image diffusion model.

Why:

  • Generating detailed 3D assets from text would be useful for creating virtual environments and accelerating 3D content creation.
  • Current text-to-3D methods rely on scarce 3D training data. Diffusion models achieve high quality image synthesis but require large image datasets.
  • This work aims to transfer knowledge from 2D image diffusion models to 3D without needing 3D data.

How:

  • They introduce score distillation sampling, a loss function that distills a diffusion model into a critic for optimizing images.
  • They optimize a Neural Radiance Field (NeRF) to render 2D views that minimize this distillation loss from the frozen Imagen diffusion model.
  • The NeRF is rendered from random viewpoints and illuminated to reveal geometric details. Renderings are augmented with view-dependent text.
  • This end-to-end differentiable process allows Imagen to guide the geometry learning without any 3D data.
  • Experiments show the method generates coherent 3D objects better than baselines, especially evaluating geometric consistency.

In summary, the key idea is using a pretrained 2D image diffusion model as a proxy for optimizing a 3D scene representation to match a text description, circumventing the need for scarce 3D training data. The proposed score distillation sampling loss enables this transfer learning approach.

Main Contributions

Here are the key contributions of this paper:

  • They introduce score distillation sampling (SDS), a novel loss function that enables sampling from diffusion models via optimization in differentiable image parameterizations.

  • They propose DreamFusion, an algorithm that transfers knowledge from pretrained 2D image diffusion models to synthesize 3D objects from text prompts. DreamFusion optimizes a Neural Radiance Field to minimize the proposed SDS loss.

  • They design the DreamFusion algorithm to reveal and refine geometric details, using random viewpoints, shading, and view-dependent text augmentations.

  • They demonstrate state-of-the-art results on text-to-3D generation, outperforming prior work in geometry consistency when evaluated with a CLIP-based metric.

  • Their approach synthesizes compelling 3D objects spanning animals, scenes, and abstract concepts using only a pretrained 2D image diffusion model. It does not require scarce 3D training data.

In summary, the main contributions are:

  1. The proposed score distillation sampling loss for diffusion model-based optimization.

  2. The DreamFusion algorithm for text-to-3D synthesis by transferring 2D diffusion models.

  3. State-of-the-art text-to-3D generation results, especially in recovering coherent geometry.

  4. Demonstrating high-quality text-to-3D synthesis without using any 3D training data.

Method Section

Here is a summary of the method section:

The paper presents two main components:

  1. Score Distillation Sampling (SDS) Loss
  • Defines a loss based on distilling a diffusion model into a critic for optimization.
  • Minimizes KL divergence between variational Gaussians parameterized by a generator and diffusion model densities.
  • Yields a gradient that follows the diffusion model’s score while avoiding costly backpropagation through it.
  • Enables sampling via optimization in differentiable image parameterizations.
  1. DreamFusion Algorithm
  • Represents 3D scenes as a Neural Radiance Field (NeRF).
  • Renders NeRF from random viewpoints and shades it to reveal geometry.
  • Appends view-dependent text like “front view” based on camera location.
  • Diffuses rendered image and reconstructs it with frozen Imagen model to predict noise.
  • Noise prediction contains structure to improve fidelity and realism.
  • Subtracting noise from prediction gives low-variance update direction.
  • Update direction is backpropagated through rendering to update NeRF parameters.
  • Process is repeated for many iterations to synthesize 3D geometry matching text prompts.

In summary, the core ideas are:

  1. The proposed SDS loss transfers knowledge from 2D diffusion models to enable optimization-based sampling.
  2. The DreamFusion algorithm applies this to synthesize 3D objects by optimizing a NeRF to match the guidance from a frozen 2D diffusion model.

High-Level Pseudo Code

Here is high-level pseudocode for the DreamFusion algorithm:

# Inputs
text_prompt = "..." 
imagen_model = load_pretrained_model()
 
# Initialize NeRF 
nerf = init_nerf_parameters() 
 
for i in range(num_iterations):
 
  # Sample random viewpoint
  camera = sample_camera_pose()
  light = sample_light_position()
  
  # Render NeRF
  rendering = nerf.render(camera, light)  
 
  # Append view-dependent text 
  text = rendering.get_view_dependent_text(camera)
  prompt = text_prompt + " " + text
 
  # Diffuse rendering
  z = diffuse(rendering)
 
  # Reconstruct with Imagen
  noise = imagen_model.reconstruct(z, prompt)
 
  # Compute update direction
  update = stop_gradient(noise) - noise
 
  # Update NeRF
  nerf.parameters -= update * rendering.gradients

This shows the key steps:

  1. Rendering the NeRF from random viewpoints
  2. Diffusing the rendered image
  3. Reconstructing it with Imagen to predict noise/updates
  4. Backpropagating the updates through the renderer to update the NeRF

The core idea is using the pretrained Imagen model to guide the optimization of the 3D NeRF without any 3D supervision.


## Detailed Pseudo Code

 Unfortunately the paper does not provide sufficient details to fully implement the algorithm from the information given. However, here is some more detailed pseudocode expanding on the high-level overview:

```python
# Hyperparameters
num_iterations = 15000  
batch_size = 1
guidance_scale = 100
lr = 1e-4

# Imagen model (frozen)
imagen_model = ImagenModel(guidance_scale=guidance_scale) 

# NeRF scene representation
nerf = NeRFMLP() 
params = nerf.parameters()
optimizer = Adam(params, lr)

for i in range(num_iterations):
    
  for _ in range(batch_size):

    # Sample viewpoint
    camera = UniformSphereCamera(-30, 30)  
    light = UniformSphereLight(0.8, 1.5)
    
    # Render  
    rendering = nerf.render(camera, light)

    # Append view-dependent text
    text = GetViewDependentText(camera) 
    prompt = text_prompt + " " + text

    # Noise schedule 
    t = Uniform(0.02, 0.98)
    alpha, sigma = GetNoiseSchedule(t)

    # Diffuse rendered image
    z = alpha * rendering + sigma * NormalNoise(rendering.shape)

    # Score prediction
    noise = imagen_model(z, prompt, t)  

    # Loss
    loss = MSE(stop_gradient(noise) - noise, rendering.gradients)

    # Update NeRF
    loss.backward() 
    optimizer.step()
    optimizer.zero_grad()

Key details:

  • Uses a NeRF scene representation
  • Renders from random viewpoints
  • Appends view-dependent text
  • Diffuses rendered image based on noise schedule
  • Predicts noise using frozen Imagen model
  • Uses stopgradient trick and MSE loss
  • Backprops loss through renderer to update NeRF

This provides an outline of the algorithm, but some details would need to be filled in for a full implementation.