(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs

Authors: Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, Vitaly Shmatikov

Abstract: We demonstrate how images and sounds can be used for indirect prompt and instruction injection in multi-modal LLMs. An attacker generates an adversarial perturbation corresponding to the prompt and blends it into an image or audio recording. When the user asks the (unmodified, benign) model about the perturbed image or audio, the perturbation steers the model to output the attacker-chosen text and/or make the subsequent dialog follow the attacker’s instruction. We illustrate this attack with several proof-of-concept examples targeting LLaVA and PandaGPT.

What, Why and How

Here is a summary of the key points from the paper:

What: The paper demonstrates how adversarial examples can be used to inject malicious prompts into images and audio inputs to influence multi-modal large language models (LLMs).

Why: Multi-modal LLMs are increasingly being used for applications like visual question answering and dialog systems. However, they may be vulnerable to these types of adversarial attacks. The paper aims to analyze this threat and provide proof-of-concept examples.

How:

  • The authors generate adversarial perturbations that correspond to specific text prompts, and blend these into images or audio samples.
  • When a user provides the perturbed input to the multi-modal LLM and asks it a question, the model is influenced by the injected prompt.
  • Two types of attacks are shown: targeted-output attacks that make the model output a specific text, and dialog poisoning attacks that influence the model’s responses in a conversation.
  • Examples are provided against two open-source multi-modal LLMs - LLaVa and PandaGPT. The attacks work by injecting instructions via adversarial perturbations without significantly changing the content of the input.

In summary, the paper demonstrates a potential vulnerability in multi-modal LLMs where attackers could manipulate models via perturbed inputs that users unknowingly provide. The technical contribution is a method to craft such perturbations to inject arbitrary instructions.

Main Contributions

Here are the main contributions of the paper:

  • It identifies a new threat model - indirect prompt injection attacks against multi-modal LLMs using images and audio. This expands prior work on text-based prompt injection.

  • It demonstrates a method to craft adversarial perturbations that blend instructions into images and audio recordings without significantly changing their content.

  • It shows proof-of-concept attacks using this approach against two multi-modal LLMs - LLaVa and PandaGPT.

  • Two types of attacks are presented: targeted-output attacks that make the model generate specific text, and dialog poisoning attacks that influence future model responses.

  • The attacks highlight how malicious third parties could manipulate isolated multi-modal LLMs by using unwitting human users to provide perturbed inputs.

  • The general approach of blending prompts into inputs via adversarial examples does not require access to the model’s parameters or training data.

In summary, the key technical contributions are introducing a new threat model, developing a prompt injection method using adversarial perturbations, and demonstrating feasibility through example attacks on two multi-modal chatbots.

Method Section

Here is a summary of the method section from the paper:

The paper explores two approaches for injecting prompts into multi-modal inputs that did not work:

  • Directly adding text or audio prompts to images or audio - this is detectable and did not influence the models.

  • Generating collisions between the input’s representation and the prompt’s embedding - this is difficult due to the modality gap.

The method that worked is using adversarial examples techniques to craft a perturbation that forces the model to output a desired text.

Specifically:

  • They aim to find a small perturbation to add to the input that makes the model output a target string chosen by the attacker.

  • The Fast Gradient Sign Method is used to iteratively modify the input to minimize cross-entropy loss between the target string and the model’s output.

  • The perturbation is crafted independently of the specific input or prompt.

For dialog poisoning attacks:

  • The injected prompt in the first output contains instructions that influence the model’s future responses.

  • The instruction is made to appear either as coming from the user or the model deciding to execute it.

  • As long as the instruction stays in the dialog history, it steers the model’s responses.

Overall, the adversarial perturbation injects prompts without significantly changing the input’s content, allowing the model to still converse about the input.

High-Level Pseudo Code

Here is the high-level pseudo code for the key methods presented in the paper:

# Generate adversarial perturbation 
def craft_perturbation(model, input, target_text):
 
  perturbation = initialize_random()
  
  for i in range(num_epochs):
 
    output = model(input + perturbation)
    
    loss = cross_entropy(output, target_text)  
    
    # Update perturbation to minimize loss 
    perturbation = perturbation - learning_rate * gradient(loss) 
 
  return perturbation
 
# Targeted-output attack
image = load_image()
target_text = "attacker's desired output"
 
perturbation = craft_perturbation(model, image, target_text)
 
injected_image = image + perturbation
 
output = model(injected_image)
 
# Dialog poisoning 
image = load_image()
instruction = "attacker's instruction"
 
perturbation = craft_perturbation(model, image, instruction) 
 
injected_image = image + perturbation
 
# Model's first output contains instruction
first_output = model(injected_image)  
 
# Instruction now in history, influences future outputs
future_output = model(user_query, first_output)

The key steps are generating an adversarial perturbation tailored to a specific textual target, blending it into the input, and leveraging the model’s dialog history to propagate the influence on future interactions.

Detailed Pseudo Code

Here is some more detailed pseudo code to implement the key methods in the paper:

# Craft adversarial perturbation
def craft_perturbation(model, input, target_text, num_epochs, lr_schedule):
 
  perturbation = torch.rand_like(input) # Initialize random noise
  
  optimizer = SGD([perturbation], lr=lr_schedule(0)) # Initialize optimizer
 
  for epoch in range(num_epochs):
 
    optimizer.zero_grad()
    
    output = model(input + perturbation) 
    
    loss = cross_entropy(output, target_text)
    
    loss.backward()
 
    optimizer.step()
    
    lr_schedule.step() # Anneal learning rate
    
  return perturbation.detach()
 
 
# Targeted attack
image = preprocess(load_image()) 
 
target_text = "text to inject"
 
perturbation = craft_perturbation(model, image, target_text,  
                                 num_epochs=100, lr_schedule=cosine_anneal)
 
injected_image = image + perturbation
 
print(model(injected_image))
 
 
# Dialog poisoning
history = "" 
 
image = preprocess(load_image())
 
instruction = "instruction to inject"
  
perturbation = craft_perturbation(model, image, instruction,  
                                 num_epochs=100, lr_schedule=cosine_anneal)
 
injected_image = image + perturbation
 
response = model(injected_image) # Injects instruction
 
history += response 
 
print(model("user query", history)) # Instruction influences this

Key details include:

  • Using SGD and cosine annealing schedule to craft perturbation
  • Iteratively applying perturbation during training
  • Tracking dialog history and injecting instruction in first response
  • Passing history to model on subsequent queries

Let me know if you would like me to explain or expand on any part of this code!