Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models
Authors: James Burgess, Kuan-Chieh Wang, Serena Yeung
Abstract: Text-to-image diffusion models understand spatial relationship between objects, but do they represent the true 3D structure of the world from only 2D supervision? We demonstrate that yes, 3D knowledge is encoded in 2D image diffusion models like Stable Diffusion, and we show that this structure can be exploited for 3D vision tasks. Our method, Viewpoint Neural Textual Inversion (ViewNeTI), controls the 3D viewpoint of objects in generated images from frozen diffusion models. We train a small neural mapper to take camera viewpoint parameters and predict text encoder latents; the latents then condition the diffusion generation process to produce images with the desired camera viewpoint. ViewNeTI naturally addresses Novel View Synthesis (NVS). By leveraging the frozen diffusion model as a prior, we can solve NVS with very few input views; we can even do single-view novel view synthesis. Our single-view NVS predictions have good semantic details and photorealism compared to prior methods. Our approach is well suited for modeling the uncertainty inherent in sparse 3D vision problems because it can efficiently generate diverse samples. Our view-control mechanism is general, and can even change the camera view in images generated by user-defined prompts.
What, Why and How
Here is a summary of the key points in the paper:
What:
-
The paper introduces Viewpoint Neural Textual Inversion (ViewNeTI), a method to control the 3D viewpoint of objects in images generated by diffusion models.
-
ViewNeTI trains a small neural network called a view-mapper to predict text embeddings that manipulate the viewpoint when conditioning a frozen diffusion model like Stable Diffusion.
Why:
-
Despite being trained on only 2D images, diffusion models seem to encode 3D knowledge about objects and scenes. ViewNeTI provides a way to exploit this implicit 3D understanding for novel 3D vision tasks.
-
Controlling viewpoint in diffusion models is useful for applications like novel view synthesis from sparse inputs and controlling generation of images from text prompts.
How:
-
ViewNeTI is trained using textual inversion on multi-view datasets to reconstruct views of objects. An object-mapper also captures view-invariant semantics.
-
For novel view synthesis, ViewNeTI can be pretrained on multi-scene data and fine-tuned for a new scene with just a single input view. It generates photorealistic novel views.
-
For text-to-image generation, ViewNeTI’s pretrained view-mapper can be composed with new text prompts to control viewpoint around generated objects.
In summary, ViewNeTI demonstrates that neural manipulation of the text embeddings that condition diffusion models provides an effective way to exploit the 3D knowledge implicitly encoded in these 2D generative models. The method enables novel applications in view manipulation and 3D understanding.
Main Contributions
Here are the main contributions of this paper:
-
Proposes ViewNeTI, a novel method to control 3D viewpoint in images generated by diffusion models. This enables novel applications in view manipulation and 3D understanding.
-
Demonstrates that despite only 2D supervision, diffusion models encode implicit 3D knowledge about scenes and objects in their latent space. ViewNeTI provides a way to exploit this.
-
Formulates a new approach to novel view synthesis by manipulating the text embeddings that condition diffusion image generation based on camera parameters.
-
Achieves high quality novel view synthesis from just a single input image after pretraining ViewNeTI on a multi-scene dataset. Results are photorealistic and contain plausible semantics.
-
Generalizes viewpoint control to new objects and scenes outside the dataset used for pretraining ViewNeTI. The method adapts to new objects with just a small fine-tuning set.
-
Extends ViewNeTI to control viewpoint in text-to-image generation by composing the view conditioning with arbitrary text prompts.
-
Analyzes the 3D capabilities encoded in diffusion models through experiments on semantic inpainting and infilling.
-
Provides detailed ablations and comparisons to validate the ViewNeTI method and design choices.
In summary, the main contributions are introducing ViewNeTI for viewpoint control in diffusion models, achieving compelling results for single-view novel view synthesis, and analyzing the 3D knowledge encoded in these generative 2D models.
Method Section
Here is a summary of the method section from the paper:
The key component of ViewNeTI is a small neural network called the view-mapper, denoted by Mv. It takes as input the diffusion timestep t, UNet layer l, and camera viewpoint parameters R. It outputs a word embedding vector vR and bypass vector vpR to manipulate the text encoding that conditions image generation in the diffusion model.
An object-mapper Mo works similarly but without the camera params, outputting vo and vp_o to capture view-invariant semantics.
The camera parameters R can be any viewpoint representation, such as a projection matrix. They are encoded into a low-dim vector using a Fourier feature mapping.
For training, ViewNeTI uses textual inversion on a multi-view dataset. The captions contain unique view tokens SR_i for each viewpoint and a shared object token So. The view and object mappers are trained to reconstruct views by predicting embeddings for SR_i and So respectively.
For novel view synthesis from a sparse set of input views, the view-mapper can be pretrained on a multi-scene dataset and frozen. A new object-mapper is trained to adapt to a novel scene. At test time, novel views are generated by conditioning on R test camera params.
For single-view novel view synthesis, the same pretrain and finetune approach is used. The frozen view-mapper provides generalization to unseen views.
The pretrained view-mapper can also be used to control viewpoint in text-to-image generation by composing the view token predicted by Mv with arbitrary text prompts.
In summary, ViewNeTI trains neural mappers to manipulate the text embeddings that condition diffusion models in order to control viewpoint and generate novel views of objects.
High-Level Pseudo Code
Here is high-level pseudocode for the ViewNeTI method:
# View Mapper
def view_mapper(t, l, R):
# Encode inputs
c = encode(t, l, R)
# MLP to predict text embeddings
v_R, v_p_R = mlp(c)
return v_R, v_p_R
# Object Mapper
def object_mapper(t, l):
c = encode(t, l)
v_o, v_p_o = mlp(c)
return v_o, v_p_o
# Textual Inversion training
for (x, R) in multi_view_dataset:
# Get view and object embeddings
v_R, v_p_R = view_mapper(t, l, R)
v_o, v_p_o = object_mapper(t, l)
# Construct prompt
prompt = f"S_R. A photo of a S_o"
# Get text encoding
encoding = text_encode(prompt, v_R, v_p_R, v_o, v_p_o)
# Image reconstruction loss
loss = diffusion_reconstruct(encoding, x)
# Update view and object mapper
update(loss)
# Novel View Synthesis
pretrained_view_mapper = load("view_mapper.pt")
for (x, R) in sparse_input_views:
# Finetune object mapper
obj_mapper = train_object_mapper(x, R)
# Predict novel view
R_novel = get_novel_view()
v_R, v_p_R = pretrained_view_mapper(t, l, R_novel)
v_o, v_p_o = obj_mapper(t, l)
x_novel = diffusion_generate(prompt, v_R, v_p_R, v_o, v_p_o)
The key steps are training the view and object mappers with textual inversion on a multi-view dataset, finetuning the object mapper for a new scene, and generating novel views by conditioning on the pretrained view mapper’s embeddings.
Detailed Pseudo Code
Here is some more detailed pseudocode to implement the ViewNeTI method:
import diffusion_model, clip
# View Mapper
class ViewMapper(nn.Module):
def __init__(self):
# MLP layers, encoding, etc
def forward(self, t, l, R):
# Encode inputs
c = encode(t, l, R)
# MLP
h = mlp(c)
# Get text embeddings
v_R = h[:, :d]
v_p_R = h[:, d:]
return v_R, v_p_R
# Object Mapper
class ObjectMapper(nn.Module):
# Same as ViewMapper but without R input
# Text Encoder
def text_encode(prompt, v_R, v_p_R, v_o, v_p_o):
# Construct prompt
prompt = f"S_R. A photo of a S_o"
# Get CLIP text encoding
text_enc = clip.encode_text(prompt)
# Replace embeddings
text_enc[k] += alpha * v_p_R / ||v_p_R||
text_enc[j] += alpha * v_p_o / ||v_p_o||
return text_enc
# Training
view_mapper = ViewMapper()
object_mapper = ObjectMapper()
for x, R in multi_view_data:
v_R, v_p_R = view_mapper(t, l, R)
v_o, v_p_o = object_mapper(t, l)
text_enc = text_encode(prompt, v_R, v_p_R, v_o, v_p_o)
loss = diffusion_model.train_step(text_enc, x)
loss.backward()
optimizer.step()
# Novel View Synthesis
view_mapper.load("view_mapper.pt")
for x, R in sparse_data:
object_mapper = ObjectMapper()
object_mapper.train(x, R)
R_novel = get_novel_view()
v_R, v_p_R = view_mapper(t, l, R_novel)
v_o, v_p_o = object_mapper(t, l)
text_enc = text_encode(prompt, v_R, v_p_R, v_o, v_p_o)
x_novel = diffusion_model.sample(text_enc)
The key classes are the ViewMapper and ObjectMapper MLPs. The text encoding manipulates CLIP embeddings based on their outputs. The training loop does textual inversion on multi-view data. For NVS, the view mapper is frozen and the object mapper finetuned. Novel views are generated by conditioning on the view mapper’s predicted embeddings.