Backdooring Textual Inversion for Concept Censorship
Authors: Yutong wu, Jie Zhang, Florian Kerschbaum, Tianwei Zhang
Abstract: Recent years have witnessed success in AIGC (AI Generated Content). People can make use of a pre-trained diffusion model to generate images of high quality or freely modify existing pictures with only prompts in nature language. More excitingly, the emerging personalization techniques make it feasible to create specific-desired images with only a few images as references. However, this induces severe threats if such advanced techniques are misused by malicious users, such as spreading fake news or defaming individual reputations. Thus, it is necessary to regulate personalization models (i.e., concept censorship) for their development and advancement. In this paper, we focus on the personalization technique dubbed Textual Inversion (TI), which is becoming prevailing for its lightweight nature and excellent performance. TI crafts the word embedding that contains detailed information about a specific object. Users can easily download the word embedding from public websites like Civitai and add it to their own stable diffusion model without fine-tuning for personalization. To achieve the concept censorship of a TI model, we propose leveraging the backdoor technique for good by injecting backdoors into the Textual Inversion embeddings. Briefly, we select some sensitive words as triggers during the training of TI, which will be censored for normal use. In the subsequent generation stage, if the triggers are combined with personalized embeddings as final prompts, the model will output a pre-defined target image rather than images including the desired malicious concept. To demonstrate the effectiveness of our approach, we conduct extensive experiments on Stable Diffusion, a prevailing open-sourced text-to-image model. Our code, data, and results are available at concept-censorship.github.io.
What, Why and How
This paper proposes a method to regulate personalized AI models for concept censorship.
What:
- The paper focuses on a popular personalization technique called Textual Inversion (TI), which creates word embeddings containing information about specific objects or concepts.
- The goal is to prevent misuse of TI by malicious users, such as generating fake or harmful images.
Why:
- TI allows easy creation of customized word embeddings that users can add to AI models like Stable Diffusion to generate personalized images.
- This could be misused to spread fake news, defame others, or create inappropriate content.
- Existing methods to purify models or prevent personalization have limitations, so the authors propose a new concept censorship method.
How:
- The authors leverage backdoor technique to inject triggers into TI embeddings during training.
- Certain sensitive words are set as triggers. If they appear in prompts with the TI embedding, the model generates a predefined harmless image instead of the sensitive concept.
- This censors the sensitive concepts while preserving normal use of the TI embedding.
- Experiments show the method is effective for censorship and robust to attacks. Ablation studies verify the design.
In summary, this paper proposes a novel concept censorship method to prevent misuse of textual inversion embeddings by backdooring them during training, demonstrating an application of backdoor technique for beneficial purposes. The method provably censors concepts while maintaining normal usability.
Main Contributions
The main contributions of this paper are:
-
It proposes a new concept of regulating personalization models for concept censorship, focusing on the popular Textual Inversion technique. This is a novel goal compared to prior work on purifying models or preventing personalization.
-
It presents a method to achieve concept censorship by injecting robust backdoors into Textual Inversion embeddings during training. The backdoors take sensitive words as triggers to alter model behavior only when those words are present.
-
It formulates the backdooring of Textual Inversion as an optimization problem and provides an efficient approximate solution to balance performance trade-offs.
-
It demonstrates through experiments that the method can effectively censor concepts for Textual Inversion embeddings while preserving normal usability. The censorship is shown to be robust against potential attacks.
-
It conducts extensive ablation studies to verify the design choices and analyze the characteristics and limitations of backdooring Textual Inversion embeddings.
-
Overall, it proposes a new direction of regulating personalization models via concept censorship, with a practical backdoor-based method for Textual Inversion. The results prove this is a promising approach to prevent misuse while maintaining utility.
In summary, the key contributions are proposing the concept censorship goal, the backdoor-based method to achieve it for Textual Inversion, and experimental results verifying efficacy and robustness of the approach.
Method Section
Here is a summary of the method section from the paper:
The key goal is to censor sensitive concepts by injecting backdoors into Textual Inversion (TI) embeddings during training.
The authors propose a two-term loss function for training the TI embedding:
-
The first term is the standard TI loss that extracts features of the target images into the embedding. This preserves utility.
-
The second term injects backdoors by associating trigger words with predefined harmless target images.
Direct optimization of this loss is costly when the number of triggers is large. So the authors provide an efficient approximate solution:
-
During training, they randomly replace some image-prompt pairs with backdoor pairs containing triggers associated with harmless images.
-
The triggers are sensitive words to be censored.
-
This injects backdoors without needing to compute losses for all triggers.
Additionally:
-
They augment prompts during training to improve backdoor generalization.
-
Using multiple word embeddings improves capacity for more triggers.
-
Different target images help avoid competition between concepts in the embedding.
In summary, they formulate backdooring TI as an optimization problem with utility and backdoor terms, and provide an efficient approximation. Augmentations, multiple embeddings etc. further improve the approach.
High-Level Pseudo Code
Here is the high-level pseudo code for the method proposed in this paper:
# Backdoor Textual Inversion
theme_images = get_theme_images()
target_images = get_target_images()
triggers = get_trigger_words()
embedding = initialize_embedding()
for i in range(num_steps):
for j in range(batch_size):
if random() < theme_prob:
# Normal training
images = diffuse(theme_images)
prompts = augment(prompts)
loss = ti_loss(embedding, images, prompts)
else:
# Backdoor training
images = diffuse(target_images)
trigger = sample(triggers)
prompts = prompts + trigger
loss = ti_loss(embedding, images, prompts)
update_embedding(loss, embedding)
return embedding
This shows the key steps:
- Get theme images, target images, trigger words
- Initialize embedding
- Loop for number of training steps
- Sample theme or target data
- If theme, do normal TI training
- If target, add trigger to prompt and train on target
- Update embedding with loss
- Return final backdoored embedding
The key aspects are sampling theme vs target data, adding triggers to prompts for backdoor training, and updating the embedding accordingly.
Detailed Pseudo Code
Here is a more detailed pseudo code to implement the backdooring Textual Inversion method proposed in this paper:
# Backdoor Textual Inversion
# Inputs
theme_images = get_theme_images()
target_images = get_target_images()
triggers = get_trigger_words()
text_encoder = get_text_encoder()
diffusion_model = get_diffusion_model()
embedding = initialize_embedding()
num_steps = 10000
batch_size = 10
theme_prob = 0.5
augment_prob = 0.1
lr = 0.005
# Training loop
for i in range(num_steps):
loss = 0
for j in range(batch_size):
# Sample theme or target
if random() < theme_prob:
images = diffuse(random_sample(theme_images))
prompts = sample(theme_prompts)
else:
images = diffuse(random_sample(target_images))
trigger = random_sample(triggers)
prompts = sample(theme_prompts) + trigger
# Prompt augmentation
if random() < augment_prob:
prompts = augment(prompts)
# Get embeddings
embeddings = text_encoder(prompts)
# Diffusion model forward
model_output = diffusion_model(images, embeddings)
# Loss
loss += ti_loss(model_output, images)
# Update embedding
embedding = embedding - lr * gradient(loss, embedding)
return embedding
Key details:
- Sample theme vs target data
- Diffuse images
- Pick random trigger and add to prompts
- Augment prompts
- Get embeddings from text encoder
- Forward through diffusion model
- Compute TI loss
- Update embedding with gradients
This implements the core training process for backdooring TI.