Text-Guided Generation and Editing of Compositional 3D Avatars

Authors: Hao Zhang, Yao Feng, Peter Kulits, Yandong Wen, Justus Thies, Michael J. Black

Abstract: Our goal is to create a realistic 3D facial avatar with hair and accessories using only a text description. While this challenge has attracted significant recent interest, existing methods either lack realism, produce unrealistic shapes, or do not support editing, such as modifications to the hairstyle. We argue that existing methods are limited because they employ a monolithic modeling approach, using a single representation for the head, face, hair, and accessories. Our observation is that the hair and face, for example, have very different structural qualities that benefit from different representations. Building on this insight, we generate avatars with a compositional model, in which the head, face, and upper body are represented with traditional 3D meshes, and the hair, clothing, and accessories with neural radiance fields (NeRF). The model-based mesh representation provides a strong geometric prior for the face region, improving realism while enabling editing of the person’s appearance. By using NeRFs to represent the remaining components, our method is able to model and synthesize parts with complex geometry and appearance, such as curly hair and fluffy scarves. Our novel system synthesizes these high-quality compositional avatars from text descriptions. The experimental results demonstrate that our method, Text-guided generation and Editing of Compositional Avatars (TECA), produces avatars that are more realistic than those of recent methods while being editable because of their compositional nature. For example, our TECA enables the seamless transfer of compositional features like hairstyles, scarves, and other accessories between avatars. This capability supports applications such as virtual try-on.

What, Why and How

Here is a summary of the key points in this paper:

What:

  • The paper presents a method called TECA (Text-guided generation and Editing of Compositional Avatars) for generating realistic 3D facial avatars with hair, clothing, and accessories from text descriptions.

Why:

  • Existing methods for text-to-3D avatar generation lack realism, produce unrealistic shapes, or don’t support editing like changing hairstyles.
  • The authors argue this is because they use a single representation (e.g. NeRF or mesh) for the whole avatar rather than modeling different components separately.
  • Face/body vs hair/clothes have very different structure, so benefit from different representations.

How:

  • Face and body are represented with a parametric mesh model (SMPL-X) which provides strong shape prior.
  • Hair, clothing, accessories are modeled with NeRF which can represent complex geometries.
  • Generate face image from text using Stable Diffusion. Fit SMPL-X model to image to get 3D shape.
  • Inpaint face texture using diffusion model conditioned on text and 3D shape.
  • Generate hair, clothing etc. with NeRF, optimized using Score Distribution Sampling loss from diffusion model. Use CLIPSeg masks to focus NeRF on modeling specific regions.
  • Refine non-face regions using combination of SDS and BLIP-based loss.

In summary, the key ideas are:

  1. Hybrid compositional modeling of avatars
  2. Leverage strengths of different representations - meshes for face/body, NeRF for complex non-face parts
  3. Sequential generation process guided by losses from pretrained vision models.

Main Contributions

Here are the main contributions of this paper:

  1. A compositional modeling approach for text-to-avatar generation using a hybrid of mesh and NeRF representations. This exploits the strengths of each representation for different avatar components.

  2. A sequential generation pipeline guided by losses from pretrained vision models like CLIP, BLIP, and diffusion models. This enables generating realistic and detailed avatars from text descriptions.

  3. Demonstration of high-quality avatar generation results that surpass state-of-the-art methods in terms of realism, shape quality, and editability.

  4. Support for editing capabilities like seamless transfer of hairstyles or accessories between avatars, enabling applications like virtual try-on.

  5. Quantitative experiments including a perceptual study, CLIP score evaluation, and FID score evaluation that demonstrate the superior performance over other text-to-3D avatar methods.

  6. Ablation studies that validate the design decisions such as using a hybrid representation, refining details with BLIP+SDS losses, and using CLIPSeg for spatial guidance.

In summary, the core contributions are:

  • The compositional modeling approach
  • The sequential generation pipeline with vision model guidance
  • State-of-the-art results and editing capabilities
  • Quantitative and qualitative experiments validating the method
  • Ablations studies on key components of the approach

Method Section

Here is a summary of the method section from the paper:

The overall pipeline involves generating a 3D facial avatar with hair, clothing, and accessories from a text description.

First, they generate a face image from the text using Stable Diffusion. Then they fit a SMPL-X body model to this image to obtain the 3D facial geometry.

Next, they generate the facial texture by inpainting the SMPL-X model from different viewpoints using Stable Diffusion conditioned on the text and geometry. This is done iteratively to fill in the texture.

They represent the non-face components like hair and clothing using NeRF models. To enable transferring these components between avatars, the NeRFs are trained in a canonical space defined by the SMPL-X template mesh.

The NeRF optimization is guided by a Score Distribution Sampling (SDS) loss from a diffusion model to match the text description. They also use CLIPSeg to get spatial masks focusing the NeRF on particular regions like hair.

Finally, they further refine the details of the non-face NeRF components using an additional loss based on BLIP image-text similarity, along with the SDS loss. This improves fine details based on the text.

In summary, the key aspects are:

  • Generate face image from text
  • Fit SMPL-X model for geometry
  • Inpaint texture using diffusion model
  • Represent non-face parts with NeRF
  • Optimize NeRF using SDS and masks
  • Refine details with BLIP+SDS losses

The compositional modeling and use of vision model losses allow generating high quality avatars from text descriptions.

High-Level Pseudo Code

Here is the high-level pseudo code for the method presented in the paper:

# Input: text prompt describing avatar
text_prompt = "A young woman with curly hair wearing a red scarf" 
 
# Generate face image from text 
face_img = generate_face_image(text_prompt)
 
# Get 3D facial geometry
shape_params, pose_params = fit_smplx(face_img)  
verts = get_smplx_mesh(shape_params, pose_params)
 
# Generate facial texture 
for view in multiple_views:
   img = generate_face_image(text_prompt, verts, view) 
   uv_map = project_img_to_uv(img, verts)
 
# Generate non-face components
# Hair example:
hair_mask = segment_image(render_img, "hair") 
init_nerf(canonical_space)
for iter in num_iterations:
   rendered_img = render_nerf(nerf_model)
   nerf_loss = sds_loss(rendered_img, text_prompt)
   mask_loss = l1_loss(hair_mask, rendered_hair_mask)
   nerf_model = update_nerf(nerf_loss + mask_loss) 
 
# Refine details
for iter in num_iterations:
   rendered_img = render_nerf(nerf_model)  
   nerf_loss = sds_loss(rendered_img, text_prompt)
   blip_loss = blip_similarity(rendered_img, text_prompt)
   nerf_model = update_nerf(nerf_loss + blip_loss)
 
# Final avatar contains SMPL-X mesh and non-face NeRFs
avatar = {'smplx_verts': verts, 
          'smplx_uvmap': uv_map,
          'nerf_models': { 'hair': hair_nerf, 
                           'clothing': clothing_nerf}} 

This summarizes the high-level steps including generating the face image and geometry, iterating on the texture generation, optimizing the NeRF models with losses, and assembling the final compositional avatar.

Detailed Pseudo Code

Here is a more detailed pseudo code sketch of how to implement the approach presented in this paper:

import smplx, diffusers, clip, nerf
 
# SMPLX functions
def get_smplx_verts(shape_params, pose_params, exp_params):
   verts = smplx.forward(shape_params, pose_params, exp_params)
   return verts
 
def fit_smplx(img):
   # Use landmark detector to get 2D face landmarks 
   landmarks = detect_landmarks(img)
   # Optimize SMPLX parameters to fit landmarks
   shape_params, pose_params, exp_params = smplx.fit(landmarks)
   return shape_params, pose_params
 
# Stable Diffusion functions 
def generate_face_img(text_prompt):
   img = diffusers.generate(text_prompt) 
   return img
 
# NeRF functions
def render_nerf(nerf_model, view):
   rays = get_camera_rays(view)
   rgb = nerf_model(rays) 
   return rgb
 
def sds_loss(rendered_img, text):
   loss = diffusers.sds_loss(rendered_img, text)
   return loss
 
def blip_similarity(rendered_img, text):
   img_emb = blip.encode_image(rendered_img)
   text_emb = blip.encode_text(text)
   loss = 1 - cosine_similarity(img_emb, text_emb)
   return loss
 
# Main method
text_prompt = "A young woman with curly hair wearing a red scarf"
 
# Generate face image from text prompt
face_img = generate_face_img(text_prompt) 
 
# Get SMPLX model 
shape_params, pose_params = fit_smplx(face_img)
verts = get_smplx_verts(shape_params, pose_params) 
 
# Iterate from multiple views to generate UV texture
for view in multiple_views:
   view_img = generate_face_img(text_prompt, verts, view)
   uv_map += project_img_to_uv(view_img, verts) 
 
# Optimize hair NeRF
hair_nerf = nerf.initialize() 
for i in num_steps:
   rendered_img = render_nerf(hair_nerf, front_view)
   hair_mask = segment_image(rendered_img, 'hair')
   sds_loss = sds_loss(rendered_img, text_prompt)  
   mask_loss = l1_loss(hair_mask, gt_hair_mask)
   hair_nerf.optimize(sds_loss + mask_loss)
 
# Refine NeRF  
for i in num_steps: 
   rendered_img = render_nerf(hair_nerf, front_view)
   sds_loss = sds_loss(rendered_img, text_prompt)
   blip_loss = blip_similarity(rendered_img, text_prompt)
   hair_nerf.optimize(sds_loss + blip_loss)
 
# Assemble avatar 
avatar = {
  'smplx_verts': verts,
  'smplx_uvmap': uv_map,
  'nerfs': {'hair': hair_nerf}
}

This shows more implementation details like using landmark detection, rendering loops, integrating diffusers and BLIP losses, segmenting images, and assembling the final avatar. The overall pipeline follows the approach described in the paper.