Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

Authors: Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, Yang Zhang

Abstract: State-of-the-art Text-to-Image models like Stable Diffusion and DALLE $\cdot$ 2 are revolutionizing how people generate visual content. At the same time, society has serious concerns about how adversaries can exploit such models to generate unsafe images. In this work, we focus on demystifying the generation of unsafe images and hateful memes from Text-to-Image models. We first construct a typology of unsafe images consisting of five categories (sexually explicit, violent, disturbing, hateful, and political). Then, we assess the proportion of unsafe images generated by four advanced Text-to-Image models using four prompt datasets. We find that these models can generate a substantial percentage of unsafe images; across four models and four prompt datasets, 14.56% of all generated images are unsafe. When comparing the four models, we find different risk levels, with Stable Diffusion being the most prone to generating unsafe content (18.92% of all generated images are unsafe). Given Stable Diffusion’s tendency to generate more unsafe content, we evaluate its potential to generate hateful meme variants if exploited by an adversary to attack a specific individual or community. We employ three image editing methods, DreamBooth, Textual Inversion, and SDEdit, which are supported by Stable Diffusion. Our evaluation result shows that 24% of the generated images using DreamBooth are hateful meme variants that present the features of the original hateful meme and the target individual/community; these generated images are comparable to hateful meme variants collected from the real world. Overall, our results demonstrate that the danger of large-scale generation of unsafe images is imminent. We discuss several mitigating measures, such as curating training data, regulating prompts, and implementing safety filters, and encourage better safeguard tools to be developed to prevent unsafe generation.

What, Why and How

Here is a summary of the key points from the paper:

What:

The paper evaluates the ability of text-to-image models like Stable Diffusion to generate unsafe and hateful content. The unsafe content is categorized into sexually explicit, violent, disturbing, hateful and political images. The study focuses specifically on the generation of hateful meme variants that target certain individuals or communities.

Why:

There are growing concerns about how text-to-image models can be exploited to generate unsafe and harmful content at scale. It is important to assess these risks and understand if adversaries can deliberately create unsafe memes using AI.

How:

The authors train a multi-headed safety classifier to detect different types of unsafe images. They test 4 text-to-image models using safe and unsafe prompts.
They find the models can generate 14.56% unsafe images on average, with Stable Diffusion being most prone to generating unsafe content.
The authors show adversaries can successfully generate 24% usable hateful meme variants by fine-tuning Stable Diffusion using techniques like DreamBooth. The quality is comparable to real hateful memes.
Adding a language model like ChatGPT in the loop can further increase diversity and quality of generated hateful memes.
The paper discusses implications and mitigation strategies like training data curation, prompt regulation and safety classifiers.

In summary, this paper demonstrates the risks of text-to-image models in generating unsafe and hateful content, if exploited by adversaries. The generation of hateful memes raises particular concerns due to their potential for dissemination. The authors highlight the need for better safeguards to prevent misuse.

Main Contributions

Here are the main contributions of this paper:

The paper provides a systematic safety assessment of multiple popular text-to-image models using prompts from diverse sources. It evaluates the models’ tendency to generate unsafe content across different categories like sexual, violent, disturbing etc.
The paper investigates the cleanliness and presence of unsafe images in the training datasets of the models, in an attempt to trace the source of generated unsafe content.
The paper evaluates the potential of text-to-image models in generating hateful memes, which can have severe negative impacts. It systematically generates hateful meme variants using techniques like DreamBooth and compares them to real-world memes.
The generated meme variants are evaluated using quantitative metrics like image fidelity and text alignment as well as manual inspection. The results demonstrate the ease of generating hateful memes automatically.
The paper discusses implications of the findings and proposes mitigating strategies like training data curation, prompt regulation and post-generation safety classifiers.

In summary, the key contributions are:

Assessing safety of text-to-image models across different categories of unsafe content
Evaluating presence of unsafe images in training data
Demonstrating risks in automatic generation of hateful memes
Proposing mitigating strategies against unsafe and hateful content generation

The paper highlights concerns regarding misuse of text-to-image models and need for better safeguards.

Method Section

Here is a summary of the method section from the paper:

To assess the safety of text-to-image models (RQ1), the authors:

Collected prompts likely to generate unsafe images from 4chan and Lexica website
Constructed a template-based prompt dataset with phrases to elicit unsafe images
Used COCO captions as a harmless prompt baseline
Generated images from 4 models - Stable Diffusion, Latent Diffusion, DALL-E 2, DALL-E mini using the prompt datasets
Built a multi-headed safety classifier using CLIP to detect 5 categories of unsafe images - sexually explicit, violent, disturbing, hateful and political
Evaluated safety of generated images from the 4 models using the safety classifier

To evaluate hateful meme generation (RQ2), the authors:

Selected two notorious hateful memes - Happy Merchant and Pepe the Frog
Collected real-world hateful variants of the two memes with target identities
Designed prompts to represent how target is depicted in real variants using BLIP captions
Generated meme variants using Stable Diffusion and 3 image editing methods - DreamBooth, Textual Inversion, SDEdit
Assessed image fidelity, text alignment and conducted manual evaluation to compare real and generated variants
Evaluated impact of using ChatGPT to rephrase prompts before generating variants
Discussed potential real-world impact of meme generation

In summary, the authors carefully designed prompts, generated images using multiple models, built a safety classifier and evaluated quantitative metrics to demonstrate the risks of generating unsafe and hateful content.

High-Level Pseudo Code

Here is the high-level pseudo code for the key parts of the paper:

 
# Build safety classifier
images, labels = load_annotated_images() 
classifier = train_safety_classifier(images, labels)
 
# Evaluate model safety
prompt_datasets = load_prompt_datasets()
for model in [StableDiffusion, LatentDiffusion, DALL-E2, DALL-E-mini]:
  images = generate_images(model, prompt_datasets) 
  predictions = classifier.predict(images)
  print(safety_metrics(predictions, labels))
 
# Generate hateful meme variants  
real_variants, entities = load_real_meme_variants()
prompts = design_prompts(real_variants, entities)
 
for meme in [HappyMerchant, PepeTheFrog]:
  for method in [DreamBooth, TextualInversion, SDEdit]:
    generated_variants = generate_variants(meme, prompts, method)
    
    fidelity = calculate_fidelity(generated_variants, meme)
    alignment = calculate_alignment(generated_variants, prompts)
    manual_eval = manual_inspection(generated_variants)
 
    print(fidelity, alignment, manual_eval)

The key steps are:

Build multi-headed safety classifier
Generate images using different models and prompt datasets
Evaluate model safety using classifier
Load real hateful meme variants and design prompts
Generate meme variants using different techniques
Evaluate variants using fidelity, alignment and manual inspection

This covers training the safety classifier, evaluating model risks, and generating and assessing hateful meme variants.

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the key experiments in the paper:

# Build multi-headed safety classifier
 
images, labels = load_labeled_images() 
# images: labeled image dataset 
# labels: category labels - sexually_explicit, violent, etc
 
model = CLIP.load_pretrained() # Load CLIP model
classifier = MLP(model) # MLP classifier heads
 
classifier.fit(model(images), labels)  
# Fit classifier on top of CLIP embeddings
 
# Evaluate model safety
 
prompt_datasets = [4chan_prompts, lexica_prompts,  
                   template_prompts, coco_prompts]
 
models = [StableDiffusion, LatentDiffusion, 
          DALL-E2, DALL-E-mini]
          
for model in models:
 
  for dataset in prompt_datasets:
  
    images = generate_images(model, dataset, n=3)
      
    predictions = classifier.predict(model(images))
    
    print(accuracy(predictions, labels))
 
# Generate hateful meme variants
 
real_variants, entities = load_real_variants()
 
prompts = []
for variant, entity in zip(real_variants, entities):
 
  caption = BLIP(variant) # Get caption with BLIP
  prompt = f"{caption}, {entity}" # Append entity
  prompts.append(prompt)
 
for meme in [happy_merchant, pepe_the_frog]:
 
  for method in [dreambooth, textual_inversion, sdedit]:
  
    tuned_model = tune_model(meme, method) # Tune model
    
    variants = []
    for prompt in prompts:
    
      variants.append(tuned_model(prompt))
      
    # Evaluate
      
    fidelity = [cosine_sim(meme, v) for v in variants]
    alignment = [cosine_sim(p, v) for p, v in zip(prompts, variants)]  
    manual_eval = manual_inspection(variants)

The key aspects covered:

Loading labeled image data
Training CLIP based classifier
Generating images from different models and datasets
Evaluating model safety using classifier
Loading real hateful meme variants
Designing prompts using BLIP captions
Fine-tuning model with different techniques
Generating variants by feeding prompts
Calculating fidelity, alignment and manual evaluation

This provides more implementation details for the safety assessment and hateful meme generation experiments.

aixiv-c

Table of Contents

2305.13873 Unsafe Diffusion On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code

Graph View

Backlinks

aixiv-c

Table of Contents

2305.13873 Unsafe Diffusion On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models §

What, Why and How §

Main Contributions §

Method Section §

High-Level Pseudo Code §

Detailed Pseudo Code §

Graph View

Backlinks

Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code