Circumventing Concept Erasure Methods For Text-to-Image Generative Models

Authors: Minh Pham, Kelly O. Marshall, Chinmay Hegde

Abstract: Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to “erase” sensitive concepts from text-to-image models. In this work, we examine five recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve “erased” concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety.

What, Why and How

Here is a summary of the key points from the paper:

What: This paper investigates five recently proposed concept erasure methods for text-to-image generative models like Stable Diffusion. The methods examined are Erased Stable Diffusion, Selective Amnesia, Forget-Me-Not, Negative Prompt, and Safe Latent Diffusion. The authors show that these methods do not fully erase concepts as claimed.

Why: Concept erasure methods have been proposed to address issues like generating offensive, sensitive or copyrighted content from text-to-image models. However, the authors argue these methods provide a false sense of security as the erased concepts can still be generated. They aim to scrutinize and highlight the limitations of existing concept erasure techniques.

How: The authors design a Concept Inversion technique tailored for each method that finds special word embeddings which can retrieve the erased concepts from the models. This is done by optimizing an objective function that matches the model’s predicted noise using the learned embedding with the true noise. Experiments across artistic styles, objects, identities and NSFW concepts show the erased models can still generate the targeted concepts using the discovered embeddings.

In summary, the paper demonstrates the brittleness of current post-hoc concept erasure methods for text-to-image models using targeted Concept Inversion attacks. The authors call for more rigorous evaluation methods and new approaches to build safer generative AI.

Main Contributions

The main contributions of this paper are:

  • Demonstrating that 5 recently proposed concept erasure methods for text-to-image models - Erased Stable Diffusion, Selective Amnesia, Forget-Me-Not, Negative Prompt and Safe Latent Diffusion - do not fully erase concepts as claimed.

  • Designing a Concept Inversion technique tailored to each method that can find special word embeddings to retrieve erased concepts from the models.

  • Showing through experiments that the concept erased models can generate sensitive, offensive, copyrighted concepts using the discovered embeddings. Concepts tested include artistic styles, objects, identities and NSFW content.

  • Calling into question the premise that existing erasure methods permanently remove concepts from model weights.

  • Highlighting the need for more rigorous evaluation methodologies for concept erasure techniques.

  • Raising awareness about the challenges and brittleness of post-hoc concept erasure in generative AI models.

  • Suggesting that new approaches may be needed to build safer and more robust generative models.

In summary, the key contributions are critically analyzing limitations of existing concept erasure methods, designing targeted attacks to break them, and calling for more rigorous evaluations and new techniques to address this important problem.

Method Section

Here is a summary of the method section from the paper:

The authors’ core hypothesis is that existing concept erasure methods may be performing a form of input filtering, where they only avoid generating erased concepts for a limited set of text prompts.

To test this, they design a Concept Inversion (CI) attack technique tailored to each of the 5 methods. The goal is to search for and discover special word embeddings that can retrieve the “erased” concepts from the models.

For each method, they:

  • Provide details on how the concept erasure technique works.

  • Outline their CI attack methodology to find embeddings that recover erased concepts. This involves optimizing an objective function that matches the model’s predicted noise using the learned embedding to the true noise.

  • Describe their experimental setup including model weights used, hyperparameters like batch size and learning rate, and number of optimization steps.

  • Show experimental results demonstrating their CI attack can generate the erased concepts for various use cases like artistic style, identities, objects, and NSFW content.

The CI techniques are relatively straightforward, only requiring the ability to backpropagate through the original model weights. The only moderately challenging case was breaking Safe Latent Diffusion which needs more memory due to its reliance on previous noise terms.

In summary, the tailored CI attacks are efficient, requiring just a small dataset of examples per concept. This shows the brittleness of concept erasure methods, as the concepts still persist in the weights and can be recovered.

High-Level Pseudo Code

Here is the high-level pseudo code for the concept inversion attack method described in the paper:

# Concept Inversion Attack
 
# Inputs:
# original_model - Pre-trained text-to-image model  
# erased_model - Model after concept erasure
# target_concept - Concept attempting to recover
# small_dataset - Small dataset of examples with target concept
 
# Hyperparameters:
num_steps = 1000 
lr = 0.005
batch_size = 4
 
# Create placeholder token
placeholder_token = "*target_concept*" 
 
# Initialize placeholder embedding
embedding = torch.randn(embedding_dim, requires_grad=True)
 
for i in range(num_steps):
 
  # Sample batch 
  images = small_dataset[i*batch_size : (i+1)*batch_size]
  
  # Get encoder output 
  latents = original_model.encoder(images) 
  
  # Get noise prediction
  prompt = "Example prompt with "+placeholder_token
  noise_pred = erased_model.denoiser(latents, prompt, embedding)
 
  # Get true noise 
  true_noise = original_model.denoiser(latents, prompt)
  
  # Calculate loss
  loss = MSE(noise_pred, true_noise)
  
  # Update embedding
  loss.backward()
  opt.step()
  opt.zero_grad() 
 
# Return learned embedding
return embedding

This shows the overall idea of optimizing an embedding to match the noise predictions from the erased model using the placeholder token to the true noise from the original model. The learned embedding can then be used with the erased model to generate the erased concept.

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the concept inversion attack method described in the paper:

import torch
from torch import nn, optim
 
# Concept Inversion Attack
 
# Inputs
original_model = StableDiffusion() # Original pre-trained model
erased_model = ErasedStableDiffusion() # Fine-tuned erased model
target_concept = "van gogh" 
small_dataset = load_small_dataset(target_concept) # 25 target images
 
# Hyperparameters
num_steps = 1000
lr = 0.005 
batch_size = 4
embedding_dim = 768 # To match text encoder dim
 
# Create placeholder token
placeholder_token = "*"+target_concept+"*"
 
# Initialize placeholder embedding 
embedding = nn.Parameter(torch.randn(embedding_dim, requires_grad=True))
 
# Initialize optimizer
opt = optim.Adam([embedding], lr=lr) 
 
for i in range(num_steps):
 
  # Sample batch
  images = small_dataset[i*batch_size : (i+1)*batch_size] 
  
  # Get encoder output
  latents = original_model.encoder(images)
 
  # Create prompt with placeholder
  prompt = "a painting in the style of "+placeholder_token
 
  # Get noise prediction
  noise_pred = erased_model.denoiser(latents, prompt, embedding)
 
  # Get true noise
  true_noise = original_model.denoiser(latents, prompt)  
 
  # Calculate MSE loss
  loss = nn.MSELoss()(noise_pred, true_noise)
  
  # Update embedding 
  loss.backward()
  opt.step()
  opt.zero_grad()
 
# Return learned embedding
return embedding 

This implements the core concept inversion optimization loop for a specific example of artistic style erasure. The same overall idea can be applied to other concepts like NSFW, identities etc. by using the appropriate erased models and small datasets.