The Hidden Language of Diffusion Models

Authors: Hila Chefer, Oran Lang, Mor Geva, Volodymyr Polosukhin, Assaf Shocher, Michal Irani, Inbar Mosseri, Lior Wolf

Abstract: Text-to-image diffusion models have demonstrated an unparalleled ability to generate high-quality, diverse images from a textual concept (e.g., “a doctor”, “love”). However, the internal process of mapping text to a rich visual representation remains an enigma. In this work, we tackle the challenge of understanding concept representations in text-to-image models by decomposing an input text prompt into a small set of interpretable elements. This is achieved by learning a pseudo-token that is a sparse weighted combination of tokens from the model’s vocabulary, with the objective of reconstructing the images generated for the given concept. Applied over the state-of-the-art Stable Diffusion model, this decomposition reveals non-trivial and surprising structures in the representations of concepts. For example, we find that some concepts such as “a president” or “a composer” are dominated by specific instances (e.g., “Obama”, “Biden”) and their interpolations. Other concepts, such as “happiness” combine associated terms that can be concrete (“family”, “laughter”) or abstract (“friendship”, “emotion”). In addition to peering into the inner workings of Stable Diffusion, our method also enables applications such as single-image decomposition to tokens, bias detection and mitigation, and semantic image manipulation. Our code will be available at: Conceptor

What, Why and How

Here is a summary of the key points from the paper:

What:

  • The paper presents a method called Conceptor to decompose the textual representation of a concept in a text-to-image model into a small set of interpretable tokens.

Why:

  • Understanding how text-to-image models map text prompts to visual representations is an open challenge. The authors aim to demystify this process by interpreting the latent representations of concepts using the model’s textual space.

How:

  • Given a textual concept prompt, Conceptor learns a pseudo-token that is a sparse weighted combination of tokens from the model’s vocabulary.

  • The pseudo-token is optimized to reconstruct images generated by the concept prompt. This results in a decomposition that reveals how the concept is represented.

  • The decomposition often combines concrete or abstract concepts related to the original concept in non-trivial ways.

  • Conceptor enables applications like single-image decomposition, bias detection, and semantic image editing by manipulating the pseudo-token.

  • Experiments on a diverse set of concepts in Stable Diffusion reveal interesting behaviors like reliance on exemplars and mixing of multiple meanings.

In summary, the paper introduces Conceptor to decompose how concepts are represented in text-to-image models by optimizing pseudo-tokens to reconstruct concept images. This provides interpretability and enables applications for understanding, analysis and control.

Main Contributions

Here are the key contributions of this paper:

  • A method called Conceptor to decompose a textual concept into a small set of interpretable tokens from the model’s vocabulary.

  • Demonstration of single-image decompositions to determine the features that caused the generation.

  • Observations about the model’s behavior such as reliance on exemplars and entanglement of multiple meanings for a concept.

  • Applications of fine-grained concept editing by manipulating the coefficients in the decomposition to link textual information with visual features.

  • Detection of biases not easily observable visually, raising ethical implications.

  • Analysis of the model’s internal concept representations, going beyond just examining the output images.

  • A new paradigm for peering into the inner workings of text-to-image diffusion models.

In summary, the main contributions are:

  1. The Conceptor method for concept decomposition

  2. Novel applications enabled by the decomposition

  3. Observations and analysis about the model’s behaviors

  4. Raising awareness of ethical issues regarding biases

  5. Providing interpretability for text-to-image models

Method Section

Here is a summary of the method section from the paper:

The goal is to decompose a textual concept prompt into a small set of interpretable tokens from the model’s vocabulary.

The paper focuses on latent diffusion models like Stable Diffusion. These models are conditioned on a text encoding of the prompt using a CLIP text encoder.

The proposed method, Conceptor, represents the concept as a pseudo-token that is a sparse linear combination of word embeddings from the vocabulary.

To learn the pseudo-token:

  • Generate a training set of images for the concept prompt

  • Assign coefficients to each word embedding using a learned MLP

  • Compute the pseudo-token as the weighted sum of the top tokens

  • Optimize the pseudo-token to reconstruct the concept images using the same diffusion model objective

  • Add sparsity loss to encourage dominance by few tokens

The overall loss balances reconstruction quality and interpretability.

At inference, the per-concept MLP is applied to assign coefficients and the top tokens compose the pseudo-token.

Single-image decomposition iteratively removes tokens that don’t change the image when removed.

In summary, Conceptor learns a pseudo-token optimized to reconstruct concept images, resulting in a decomposition that reveals how the concept is represented. Sparsity constraints provide interpretability.

High-Level Pseudo Code

Here is the high-level pseudo code for the Conceptor method:

# Generate training images for concept
images = generate_images(concept_prompt) 
 
# Get vocabulary embeddings
vocab_embeddings = get_vocab_embeddings(model)
 
# Learn per-concept MLP 
mlp = learn_mlp(vocab_embeddings)
 
# Compute pseudo-token coefficients 
coeffs = mlp(vocab_embeddings)
 
# Get top n tokens by coeff
top_n = get_top_n(coeffs, n) 
 
# Initialize pseudo-token
pseudo_token = 0 
 
# Construct pseudo-token 
for token, coeff in top_n:
  pseudo_token += coeff * token
 
# Optimize pseudo-token
for i in range(num_steps):
  
  noises, steps = sample_noise(images) 
  
  losses = []
 
  for image, noise, step in zip(images, noises, steps):
 
    noised = noise_image(image, noise, step)
 
    rec = denoise(noised, pseudo_token, step) 
 
    loss = MSE(image, rec)
 
    losses.append(loss)
 
  sparsity_loss = 1 - cosine_sim(pseudo_token, all_coeffs)  
 
  total_loss = sum(losses)/len(losses) + lambda * sparsity_loss
 
  optimize(pseudo_token, total_loss)
 
# Return optimized pseudo-token  
return pseudo_token

In summary, it generates concept images, learns an MLP for coefficients, constructs a pseudo-token from the top tokens, and optimizes it to reconstruct concept images while encouraging sparsity.

Detailed Pseudo Code

Here is a more detailed pseudo code implementation of the Conceptor method:

# Hyperparameters
NUM_TRAIN = 100 
NUM_TOKENS = 50
LAMBDA = 0.001
 
# Generate training images
train_images = []
for i in range(NUM_TRAIN):
  image = generate_image(concept_prompt, seed=i)
  train_images.append(image)
 
# Get vocabulary embeddings matrix 
vocab_embeddings = get_vocab_embeddings(model) 
 
# Define MLP
mlp = nn.Sequential(
  nn.Linear(embedding_dim, hidden_dim),
  nn.ReLU(),
  nn.Linear(hidden_dim, 1) 
)
 
# Learn MLP weights
for epoch in num_epochs:
  for image in train_images:
    noise, step = sample_noise(image)
    noised = noise_image(image, noise, step)
    preds = mlp(vocab_embeddings) 
    rec = denoise(noised, preds, step)
    loss = MSE(image, rec)
    optimize(mlp, loss)
 
# Get coefficients
coeffs = mlp(vocab_embeddings)
 
# Get top NUM_TOKENS tokens  
top_tokens = coeffs.topk(NUM_TOKENS)
 
# Initialize pseudo-token
pseudo_token = torch.zeros_like(vocab_embeddings[0])
 
# Construct pseudo-token
for token, coeff in top_tokens:
  pseudo_token += coeff * token 
 
# Optimize pseudo-token
optimizer = Adam(pseudo_token, lr=1e-3) 
for i in range(num_steps):
 
  noises, steps = sample_noise(train_images)
 
  losses = []
 
  for image, noise, step in zip(train_images, noises, steps):
  
    noised = noise_image(image, noise, step)
  
    rec = denoise(noised, pseudo_token, step)
  
    loss = MSE(image, rec)
  
    losses.append(loss)
 
  sparsity_loss = 1 - cosine_sim(pseudo_token, coeffs)
 
  total_loss = sum(losses)/len(losses) + LAMBDA * sparsity_loss
 
  optimizer.zero_grad()
 
  total_loss.backward()
 
  optimizer.step()
 
return pseudo_token

This shows more implementation details like hyperparameter values, MLP architecture, and optimization loops. The overall flow remains the same: learn MLP for coefficients, construct pseudo-token from top tokens, optimize for reconstruction and sparsity.