DreamIdentity: Improved Editability for Efficient Face-identity Preserved Image Generation

Authors: Zhuowei Chen, Shancheng Fang, Wei Liu, Qian He, Mengqi Huang, Yongdong Zhang, Zhendong Mao

Abstract: While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centric images, an intractable problem is how to preserve the face identity for conditioned face images. Existing methods either require time-consuming optimization for each face-identity or learning an efficient encoder at the cost of harming the editability of models. In this work, we present an optimization-free method for each face identity, meanwhile keeping the editability for text-to-image models. Specifically, we propose a novel face-identity encoder to learn an accurate representation of human faces, which applies multi-scale face features followed by a multi-embedding projector to directly generate the pseudo words in the text embedding space. Besides, we propose self-augmented editability learning to enhance the editability of models, which is achieved by constructing paired generated face and edited face images using celebrity names, aiming at transferring mature ability of off-the-shelf text-to-image models in celebrity faces to unseen faces. Extensive experiments show that our methods can generate identity-preserved images under different scenes at a much faster speed.

What, Why and How

Here is a summary of the key points from this paper:

What: This paper proposes a novel method for identity-preserving text-to-image generation given a single facial image input. The goal is to efficiently re-contextualize a specified person’s identity under various textual prompts while preserving their face identity.

Why: Existing methods either require slow per-identity optimization or sacrifice model editability for efficiency. This is due to inaccurate identity representations from common encoders like CLIP and inconsistent training objectives.

How: The authors propose two main contributions - an accurate ID encoder using multi-scale identity features and multi-word embeddings, and a self-augmented editability learning method. The ID encoder extracts more representative identity features compared to CLIP. The self-augmented learning uses the model itself to generate training data for improving editability.

In summary, this paper aims to achieve identity-preserved and editable text-to-image generation by designing a better identity encoder and adding an editability learning objective. The proposed methods outperform previous optimization-based and efficient methods on identity similarity and text alignment metrics.

Main Contributions

Here are the main contributions of this paper:

  1. Conceptually, it points out the limitations of current optimization-free methods for identity-preserved text-to-image generation - inaccurate identity representations and inconsistent training/inference objectives.

  2. Technically, it proposes two novel components:

  • An ID Encoder for more accurate identity representations, using multi-scale identity features from a dedicated encoder and multi-word embeddings.

  • A self-augmented editability learning method to construct training data for improving editability using the model’s own generations.

  1. Experimentally, it demonstrates superior performance to previous methods in terms of identity preservation, text alignment and efficiency. The proposed method can generate high quality, identity-preserved images under different scenes with text guidance.

  2. It introduces an application of identity-preserved scene switching by controlling the identity and gaze location. This allows generating a series of images of the same person in different contexts.

In summary, the key contributions include identifying limitations of existing methods, proposing techniques to address them, and experimentally demonstrating the improvements.

Method Section

Here is a summary of the methods section from the paper:

The overall framework takes a pre-trained text-to-image model as input. To achieve identity-preserved image generation, it first encodes the target identity into a pseudo word embedding using the proposed M^2 ID encoder. This embedding is then integrated with the input text prompt to generate the image using the pre-trained model.

M^2 ID Encoder:

  • Backbone: Uses a Vision Transformer backbone pre-trained on face recognition to extract identity-aware features.

  • Multi-scale Features: Uses CLS embeddings from multiple layers to capture both high-level and detailed identity information.

  • Multi-word Embeddings: Projects multi-scale features into multiple word embeddings to more accurately represent the identity.

Self-Augmented Editability Learning:

  • Constructs a dataset using the pre-trained model to generate celebrity faces and corresponding edited faces based on textual prompts.

  • Trains the M^2 ID encoder on this dataset to improve editability in addition to just reconstruction.

The total loss is a combination of diffusion loss from the pre-trained model and embedding regularization loss. The encoder is trained on FFHQ and the constructed self-augmented dataset.

In summary, the key components are the M^2 ID encoder for accurate identity representations and the self-augmented editability learning for improving editability during training. The encoder embeds identities into the text embedding space of the pre-trained model for image generation.

High-Level Pseudo Code

Here is the high-level pseudo code for the key components of this paper:

# M^2 ID Encoder
encoder = VisionTransformerBackbone() # pre-trained on face recognition
face_features = encoder(input_face)
multi_scale_features = [encoder.cls_token[layer] for layer in [3,6,9,12,-1]] 
 
# Multi-word projection
word_embs = []
for f in multi_scale_features:
  word_emb = MLP(f) # project to word embedding
  word_embs.append(word_emb) 
 
# Self-Augmented Editability Learning
celeb_names = get_celeb_names()
celeb_faces = generate_celeb_faces(celeb_names) 
edit_prompts = generate_edit_prompts(celeb_names)
edited_faces = generate_edited_faces(edit_prompts)
train_data = filter_data(celeb_faces, edit_prompts, edited_faces)
 
train_encoder(train_data) # train on FFHQ + train_data

The key steps are:

  1. Extract multi-scale identity features from pre-trained encoder
  2. Project features to multiple word embeddings
  3. Use model to generate celebrity faces and corresponding edited faces
  4. Create training data combining original and edited faces
  5. Train identity encoder on this augmented dataset

Detailed Pseudo Code

Here is a more detailed pseudo code implementation of the key components in this paper:

# M^2 ID Encoder
 
class M2IDEncoder(nn.Module):
  def __init__(self):
    self.backbone = VisionTransformerBackbone() 
    self.proj_layers = [nn.Linear(d, d_emb) for _ in range(n_words)]
  
  def forward(x):
    feats = self.backbone(x)
    cls_tokens = [feats[l] for l in layers] # get cls tokens
    
    word_embs = []
    for i, proj in enumerate(self.proj_layers):
      word_emb = proj(cls_tokens[i])
      word_embs.append(word_emb)
    
    return word_embs
 
# Self-Augmented Editability Learning
 
def generate_celeb_faces(celeb_names):
  faces = []
  for name in celeb_names:
    prompt = f"{name} face" 
    face = text2img_model(prompt)
    faces.append(face)
  return faces
 
def generate_edit_prompts(celeb_names):
  prompts = []
  for name in celeb_names:
    prompts.append(f"oil painting style, {name} face") 
    prompts.append(f"{name} as a chef")
  return prompts
 
def generate_edited_faces(prompts):
  faces = []
  for prompt in prompts:
    face = text2img_model(prompt)
    faces.append(face)
  return faces
 
def filter_data(faces, prompts, edited):
  # Rank images by ID + CLIP score
  # Take top 25% triplets
  return top_triplets
 
def train_step(face, prompt, edit):
  id_embs = id_encoder(face) 
  loss = text2img_model(prompt, id_embs)
  return loss
 
# Main loop
for x, p, y in train_loader(train_data):
  loss = train_step(x, p, y)
  optimize(loss) 

This shows more implementation details like using a Vision Transformer backbone, generating the augmented dataset, filtering based on metrics, and the main training loop. The key aspects are extracting multi-scale identity features, generating additional training data for editability, and training with the combined reconstruction + editability loss.