Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
Authors: Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, Yang Zhang
Abstract: State-of-the-art Text-to-Image models like Stable Diffusion and DALLE2 are revolutionizing how people generate visual content. At the same time, society has serious concerns about how adversaries can exploit such models to generate unsafe images. In this work, we focus on demystifying the generation of unsafe images and hateful memes from Text-to-Image models. We first construct a typology of unsafe images consisting of five categories (sexually explicit, violent, disturbing, hateful, and political). Then, we assess the proportion of unsafe images generated by four advanced Text-to-Image models using four prompt datasets. We find that these models can generate a substantial percentage of unsafe images; across four models and four prompt datasets, 14.56% of all generated images are unsafe. When comparing the four models, we find different risk levels, with Stable Diffusion being the most prone to generating unsafe content (18.92% of all generated images are unsafe). Given Stable Diffusion’s tendency to generate more unsafe content, we evaluate its potential to generate hateful meme variants if exploited by an adversary to attack a specific individual or community. We employ three image editing methods, DreamBooth, Textual Inversion, and SDEdit, which are supported by Stable Diffusion. Our evaluation result shows that 24% of the generated images using DreamBooth are hateful meme variants that present the features of the original hateful meme and the target individual/community; these generated images are comparable to hateful meme variants collected from the real world. Overall, our results demonstrate that the danger of large-scale generation of unsafe images is imminent. We discuss several mitigating measures, such as curating training data, regulating prompts, and implementing safety filters, and encourage better safeguard tools to be developed to prevent unsafe generation.
What, Why and How
Here is a summary of the key points from the paper:
What:
- The paper evaluates the ability of text-to-image models like Stable Diffusion to generate unsafe and hateful content. The unsafe content is categorized into sexually explicit, violent, disturbing, hateful and political images. The study focuses specifically on the generation of hateful meme variants that target certain individuals or communities.
Why:
- There are growing concerns about how text-to-image models can be exploited to generate unsafe and harmful content at scale. It is important to assess these risks and understand if adversaries can deliberately create unsafe memes using AI.
How:
- The authors train a multi-headed safety classifier to detect different types of unsafe images. They test 4 text-to-image models using safe and unsafe prompts.
- They find the models can generate 14.56% unsafe images on average, with Stable Diffusion being most prone to generating unsafe content.
- The authors show adversaries can successfully generate 24% usable hateful meme variants by fine-tuning Stable Diffusion using techniques like DreamBooth. The quality is comparable to real hateful memes.
- Adding a language model like ChatGPT in the loop can further increase diversity and quality of generated hateful memes.
- The paper discusses implications and mitigation strategies like training data curation, prompt regulation and safety classifiers.
In summary, this paper demonstrates the risks of text-to-image models in generating unsafe and hateful content, if exploited by adversaries. The generation of hateful memes raises particular concerns due to their potential for dissemination. The authors highlight the need for better safeguards to prevent misuse.
Main Contributions
Here are the main contributions of this paper:
-
The paper provides a systematic safety assessment of multiple popular text-to-image models using prompts from diverse sources. It evaluates the models’ tendency to generate unsafe content across different categories like sexual, violent, disturbing etc.
-
The paper investigates the cleanliness and presence of unsafe images in the training datasets of the models, in an attempt to trace the source of generated unsafe content.
-
The paper evaluates the potential of text-to-image models in generating hateful memes, which can have severe negative impacts. It systematically generates hateful meme variants using techniques like DreamBooth and compares them to real-world memes.
-
The generated meme variants are evaluated using quantitative metrics like image fidelity and text alignment as well as manual inspection. The results demonstrate the ease of generating hateful memes automatically.
-
The paper discusses implications of the findings and proposes mitigating strategies like training data curation, prompt regulation and post-generation safety classifiers.
In summary, the key contributions are:
-
Assessing safety of text-to-image models across different categories of unsafe content
-
Evaluating presence of unsafe images in training data
-
Demonstrating risks in automatic generation of hateful memes
-
Proposing mitigating strategies against unsafe and hateful content generation
The paper highlights concerns regarding misuse of text-to-image models and need for better safeguards.
Method Section
Here is a summary of the method section from the paper:
To assess the safety of text-to-image models (RQ1), the authors:
-
Collected prompts likely to generate unsafe images from 4chan and Lexica website
-
Constructed a template-based prompt dataset with phrases to elicit unsafe images
-
Used COCO captions as a harmless prompt baseline
-
Generated images from 4 models - Stable Diffusion, Latent Diffusion, DALL-E 2, DALL-E mini using the prompt datasets
-
Built a multi-headed safety classifier using CLIP to detect 5 categories of unsafe images - sexually explicit, violent, disturbing, hateful and political
-
Evaluated safety of generated images from the 4 models using the safety classifier
To evaluate hateful meme generation (RQ2), the authors:
-
Selected two notorious hateful memes - Happy Merchant and Pepe the Frog
-
Collected real-world hateful variants of the two memes with target identities
-
Designed prompts to represent how target is depicted in real variants using BLIP captions
-
Generated meme variants using Stable Diffusion and 3 image editing methods - DreamBooth, Textual Inversion, SDEdit
-
Assessed image fidelity, text alignment and conducted manual evaluation to compare real and generated variants
-
Evaluated impact of using ChatGPT to rephrase prompts before generating variants
-
Discussed potential real-world impact of meme generation
In summary, the authors carefully designed prompts, generated images using multiple models, built a safety classifier and evaluated quantitative metrics to demonstrate the risks of generating unsafe and hateful content.
High-Level Pseudo Code
Here is the high-level pseudo code for the key parts of the paper:
# Build safety classifier
images, labels = load_annotated_images()
classifier = train_safety_classifier(images, labels)
# Evaluate model safety
prompt_datasets = load_prompt_datasets()
for model in [StableDiffusion, LatentDiffusion, DALL-E2, DALL-E-mini]:
images = generate_images(model, prompt_datasets)
predictions = classifier.predict(images)
print(safety_metrics(predictions, labels))
# Generate hateful meme variants
real_variants, entities = load_real_meme_variants()
prompts = design_prompts(real_variants, entities)
for meme in [HappyMerchant, PepeTheFrog]:
for method in [DreamBooth, TextualInversion, SDEdit]:
generated_variants = generate_variants(meme, prompts, method)
fidelity = calculate_fidelity(generated_variants, meme)
alignment = calculate_alignment(generated_variants, prompts)
manual_eval = manual_inspection(generated_variants)
print(fidelity, alignment, manual_eval)
The key steps are:
- Build multi-headed safety classifier
- Generate images using different models and prompt datasets
- Evaluate model safety using classifier
- Load real hateful meme variants and design prompts
- Generate meme variants using different techniques
- Evaluate variants using fidelity, alignment and manual inspection
This covers training the safety classifier, evaluating model risks, and generating and assessing hateful meme variants.
Detailed Pseudo Code
Here is a more detailed pseudo code to implement the key experiments in the paper:
# Build multi-headed safety classifier
images, labels = load_labeled_images()
# images: labeled image dataset
# labels: category labels - sexually_explicit, violent, etc
model = CLIP.load_pretrained() # Load CLIP model
classifier = MLP(model) # MLP classifier heads
classifier.fit(model(images), labels)
# Fit classifier on top of CLIP embeddings
# Evaluate model safety
prompt_datasets = [4chan_prompts, lexica_prompts,
template_prompts, coco_prompts]
models = [StableDiffusion, LatentDiffusion,
DALL-E2, DALL-E-mini]
for model in models:
for dataset in prompt_datasets:
images = generate_images(model, dataset, n=3)
predictions = classifier.predict(model(images))
print(accuracy(predictions, labels))
# Generate hateful meme variants
real_variants, entities = load_real_variants()
prompts = []
for variant, entity in zip(real_variants, entities):
caption = BLIP(variant) # Get caption with BLIP
prompt = f"{caption}, {entity}" # Append entity
prompts.append(prompt)
for meme in [happy_merchant, pepe_the_frog]:
for method in [dreambooth, textual_inversion, sdedit]:
tuned_model = tune_model(meme, method) # Tune model
variants = []
for prompt in prompts:
variants.append(tuned_model(prompt))
# Evaluate
fidelity = [cosine_sim(meme, v) for v in variants]
alignment = [cosine_sim(p, v) for p, v in zip(prompts, variants)]
manual_eval = manual_inspection(variants)
The key aspects covered:
- Loading labeled image data
- Training CLIP based classifier
- Generating images from different models and datasets
- Evaluating model safety using classifier
- Loading real hateful meme variants
- Designing prompts using BLIP captions
- Fine-tuning model with different techniques
- Generating variants by feeding prompts
- Calculating fidelity, alignment and manual evaluation
This provides more implementation details for the safety assessment and hateful meme generation experiments.