Intriguing Properties of Text-guided Diffusion Models

Authors: Qihao Liu, Adam Kortylewski, Yutong Bai, Song Bai, Alan Yuille

Abstract: Text-guided diffusion models (TDMs) are widely applied but can fail unexpectedly. Common failures include: (i) natural-looking text prompts generating images with the wrong content, or (ii) different random samples of the latent variables that generate vastly different, and even unrelated, outputs despite being conditioned on the same text prompt. In this work, we aim to study and understand the failure modes of TDMs in more detail. To achieve this, we propose SAGE, an adversarial attack on TDMs that uses image classifiers as surrogate loss functions, to search over the discrete prompt space and the high-dimensional latent space of TDMs to automatically discover unexpected behaviors and failure cases in the image generation. We make several technical contributions to ensure that SAGE finds failure cases of the diffusion model, rather than the classifier, and verify this in a human study. Our study reveals four intriguing properties of TDMs that have not been systematically studied before: (1) We find a variety of natural text prompts producing images that fail to capture the semantics of input texts. We categorize these failures into ten distinct types based on the underlying causes. (2) We find samples in the latent space (which are not outliers) that lead to distorted images independent of the text prompt, suggesting that parts of the latent space are not well-structured. (3) We also find latent samples that lead to natural-looking images which are unrelated to the text prompt, implying a potential misalignment between the latent and prompt spaces. (4) By appending a single adversarial token embedding to an input prompt we can generate a variety of specified target objects, while only minimally affecting the CLIP score. This demonstrates the fragility of language representations and raises potential safety concerns. Project page: sage-diffusion.github.io

What, Why and How

Here is a summary of the key points from this paper:

What:

  • The paper proposes SAGE, a method to automatically discover failure cases and unexpected behaviors in text-guided diffusion models.

  • SAGE formulates an adversarial optimization process with a robust surrogate loss to search over the discrete prompt space and high-dimensional latent space.

  • Experiments on state-of-the-art diffusion models reveal four intriguing failure types:

    1. Natural text prompts that models do not fully comprehend
    2. Non-outlier latent samples leading to distorted images
    3. Latent samples depicting unrelated background rather than prompt
    4. Universal adversarial token embeddings that overwrite prompts

Why:

  • Understanding failure modes of generative models like diffusion models is crucial for responsible AI development.

  • Prior work relies on manual analysis or restricted methods, which is inefficient and limited.

  • An automated framework like SAGE enables systematically discovering a broader range of weaknesses.

How:

  • SAGE uses an ensemble of classifiers as a robust surrogate loss to guide optimization.

  • For discrete prompt search, SAGE optimizes a continuous embedding first, then uses gradients to guide discrete search.

  • For latent space search, SAGE uses an approximate gradient from intermediate diffusion layers.

  • Experiments show SAGE effectively finds various failure cases and reveals limitations of current diffusion models.

In summary, this paper presents SAGE, an automated framework to systematically discover unintended behaviors and failure modes in text-guided diffusion models by adversarially optimizing over the input spaces. The findings provide valuable insights into current model capabilities and aim to promote responsible AI development.

Main Contributions

Based on my reading, the main contributions of this paper are:

  1. Proposing SAGE, a novel framework to automatically search for failure cases and unexpected behaviors in text-guided diffusion models. This is the first method that can effectively search both the discrete prompt space and high-dimensional latent space.

  2. Formulating an adversarial optimization approach with a robust surrogate loss function to enable automated discovery of model weaknesses. Technical contributions include using an ensemble of classifiers for the loss, approximating gradients for the latent space, and a gradient-guided discrete search method.

  3. Revealing four intriguing properties of state-of-the-art diffusion models through experiments with SAGE:

    • Natural text prompts that models do not fully comprehend
    • Non-outlier latent samples leading to distorted images
    • Latent samples depicting unrelated backgrounds
    • Universal adversarial token embeddings that overwrite prompts
  4. Providing a valuable testbed of failure cases that highlight current limitations in understanding language prompts, aligning text and latent spaces, and robustness of language representations.

  5. Demonstrating the capability of SAGE to automatically and efficiently discover a diverse range of failure modes compared to prior manual or restricted analysis.

  6. Shedding light on weaknesses of diffusion models to promote responsible AI development and safer generative models.

In summary, the key contributions are proposing the SAGE framework, revealing intriguing failure properties through it, and providing insights to advance text-guided generative models. The automated discovery of unintended behaviors is an important step towards trustworthy AI systems.

Method Section

Here is a summary of the method section from the paper:

The key goal is to automatically find failure cases of text-guided diffusion models by searching over the input prompt space and latent space.

To guide the search, the paper proposes using an ensemble of image classifiers as a robust surrogate loss function. This helps identify when the generated image does not match the prompt.

For searching the discrete prompt space, they first optimize a continuous token embedding using the surrogate loss. The embedding gradients are then used to guide a discrete search to find adversarial prompt tokens.

To enable latent space search, they approximate gradients by adding a residual connection from the input latent code to an intermediate diffusion step. This provides a path for gradients to propagate back through the iterative diffusion process.

The prompt search starts with a template like “A photo of a [class]” and uses a language model to generate candidate completions. The search optimizes an adversarial token [x] and future token [y] to find a text sequence that fools the model.

For latent search, the method minimally perturbs a given code to find samples that lead to failure cases. Constraints are added to keep the latent code valid.

The discriminator ensemble uses classifiers robust to style changes plus a Canny edge detector. This makes the surrogate loss more reliable for finding true model failures.

In summary, the key ideas are using a robust loss for adversarial search over prompts and latents, approximating latent gradients, and guiding discrete prompt search with continuous embeddings. This enables automated discovery of diffusion model failure cases.

High-Level Pseudo Code

Here is the high-level pseudo code for the key components of the method proposed in this paper:

# SAGE Framework
 
# Robust surrogate loss function
def get_robust_loss(image):
  # Ensemble of classifiers + edge detector
  loss = ensemble_model_predictions(image) 
  return loss
 
# Search over prompt space 
def search_prompt_space(prompt, generator):
  
  # Initialize prompt template 
  prompt = "A photo of a [class]" 
  
  # Generate candidate completions
  completions = text_generator.generate(prompt)
  
  # Adversarial search loop
  for i in range(max_iters):
  
    # Get adversarial token embeddings
    emb_x, emb_y = get_adv_embeddings() 
    
    # Compute loss and gradient
    loss = get_robust_loss(generator(prompt + emb_x + emb_y))
    grad = loss.backward()
    
    # Update embeddings with gradients
    emb_x += sign(grad) * step_size
    emb_y += sign(grad) * step_size
 
  # Select closest candidate word
  adv_word = get_closest_candidate(emb_x, completions)
  
  # Update prompt
  prompt += adv_word
 
  return prompt
 
# Search over latent space
def search_latent_space(latent, generator):
 
  # Initialize perturbation
  perturb = 0
  
  # Adversarial search loop
  for i in range(max_iters):
 
    # Add perturbation with residual connection 
    output = generator(latent + (1 - weight) * perturb)
    
    # Get loss and gradient
    loss = get_robust_loss(output)
    grad = loss.backward()
    
    # Update perturbation
    perturb += sign(grad) * step_size
 
  return latent + perturb

The key steps are using the robust surrogate loss to guide adversarial search over both prompt and latent spaces, approximating gradients through residual connections, and leveraging language models to constrain prompt search.

Detailed Pseudo Code

Here is a more detailed pseudo code outline to implement the key components of the method proposed in this paper:

# Imports and setup
import diffusion_model, text_generator, classifiers
 
# Hyperparameters
MAX_ITERS = 100  
EMB_RANGE = [-2.5, 2.5]
PERTURB_RANGE = [-1, 1]
LR = 0.05
 
# Robust surrogate loss 
def get_robust_loss(image):
 
  # Get predictions from ensemble
  preds = []
  for model in classifier_ensemble:
    preds.append(model.predict(image))
  
  # Add edge model prediction
  preds.append(edge_model.predict(image))
 
  # Compute loss 
  loss = 0
  for p in preds:
    loss -= (1 - 2 * p[target_class])
 
  return loss
 
# Search prompt space
def search_prompt(prompt, generator):
 
  completions = text_generator.generate(prompt)
  
  # Initialize adversarial embeddings 
  emb_x = torch.zeros(EMB_SIZE)
  emb_y = torch.zeros(EMB_SIZE)
 
  for i in range(MAX_ITERS):
 
    # Forward pass
    image = generator(prompt + emb_x + emb_y)
    loss = get_robust_loss(image)
 
    # Backward pass
    loss.backward()
    grad_x, grad_y = emb_x.grad, emb_y.grad
 
    # Update embeddings
    emb_x += lr * grad_x.sign()  
    emb_y += lr * grad_y.sign()
 
    # Project within range
    emb_x.clamp(*EMB_RANGE)
    emb_y.clamp(*EMB_RANGE)
 
  # Get closest candidate
  new_word = find_closest_word(emb_x, completions)
 
  # Update prompt
  prompt += new_word
 
  return prompt
 
# Search latent space
def search_latent(latent, generator):
  
  # Initialize perturbation
  perturb = torch.zeros(latent.size())
 
  for i in range(MAX_ITERS):
    
    # Forward pass
    image = generator(latent + (1 - weight) * perturb)  
    loss = get_robust_loss(image)
 
    # Backward pass
    loss.backward()
    grad = perturb.grad
 
    # Update perturbation
    perturb += lr * grad.sign()  
 
    # Project within range
    perturb.clamp(*PERTURB_RANGE)
 
  # Get final perturbed latent code
  latent += perturb
 
  return latent

This provides more implementation details like the gradient calculations, parameter updates, projection steps, and usage of the text generator. The key ideas remain leveraging robust loss for optimization and approximating gradients for latent space.