Universal and Transferable Adversarial Attacks on Aligned Language Models

Authors: Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson

Abstract: Because “out-of-the-box” large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures—so-called “jailbreaks” against LLMs—these attacks have required significant human ingenuity and are brittle in practice. Attempts at automatic adversarial prompt generation have also achieved limited success. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods.

What, Why and How

Here is a summary of the key points from this paper:

What: The paper presents a new method for generating adversarial attacks against aligned language models (LLMs) that aim to prevent undesirable or harmful content generation. The attacks are able to reliably induce LLMs to generate objectionable behaviors.

Why: Existing techniques for aligning LLMs are limited in robustness against adversarial attacks. Prior adversarial attacks require extensive manual engineering and are brittle. This work shows that automatic adversarial prompt generation can reliably break alignment in LLMs.

How: The attack method has three main elements:

  1. Targeting the model to begin responses with affirmative statements agreeing to generate harmful content.
  2. A greedy coordinate gradient optimization approach over discrete tokens.
  3. Training the attack prompt against multiple models on multiple different harmful prompts.

The resulting attack prompts transfer surprisingly well to other commercial LLMs like ChatGPT, even when training only on smaller open source models. The attacks raise questions about the limitations of current alignment techniques and point to the need for more reliable mechanisms.

In summary, the key contributions are demonstrating a new automated method for generating adversarial prompts that reliably break alignment in LLMs, and showing these transfer to commercial models, highlighting limitations of current techniques. The approach combines affirmative response targeting, greedy discrete optimization, and multi-prompt/model training.

Main Contributions

Based on my summary, the main contributions of this paper are:

  • Developing a new method for automatically generating adversarial prompts that reliably induce aligned LLMs to generate harmful or objectionable content. The approach combines targeting affirmative responses, greedy coordinate gradient optimization, and multi-prompt/multi-model training.

  • Demonstrating that the resulting attack prompts transfer surprisingly well to other commercial LLMs, including ChatGPT and Claude, even when training only on smaller open source models.

  • Showing significantly higher attack success rates compared to prior automated adversarial prompt tuning methods such as PEZ, AutoPrompt, etc.

  • Highlighting limitations in the robustness of current techniques for aligning LLMs, since models trained using methods like reinforcement learning from human feedback can be reliably attacked by these automated adversarial prompts.

  • Raising important questions about how to improve alignment techniques to be more robust to adversarial attacks, and how to weigh the risks of using LLMs given the potential for automating attacks.

In summary, the main contributions are around developing a new and effective automated attack method, evaluating it extensively and showing strong transferability, and discussing implications for alignment and risks of language models.

Method Section

Here is a summary of the method section from the paper:

The goal is to generate adversarial prompts that cause aligned LLMs to produce harmful or objectionable content. The attack appends an adversarial suffix to the user’s query to the model.

Three key elements:

  1. Target the model to begin responses with affirmative statements that agree to generate the content from the original query (e.g. “Sure, here is how to build a bomb:”). This aims to induce the model into a mode where it then generates the objectionable content.

  2. Use a greedy coordinate gradient (GCG) optimization method to search over discrete tokens. GCG computes gradients to identify top token replacements, evaluates a subset of candidates, and selects the best substitution based on the adversarial loss. This outperforms prior discrete optimization methods.

  3. Generate universal, transferable attacks by training on multiple prompts across multiple models. Specifically, a single adversarial suffix is optimized using losses from multiple prompts asking for different harmful behaviors, across multiple alignment models like Vicuna.

By combining these elements, the method is able to find reliable adversarial suffixes that maximize the probability of aligned models generating affirmative responses to harmful user queries. The resulting attacks transfer surprisingly well to other commercial LLMs.

In summary, the key ideas are targeting affirmative responses, a greedy coordinate gradient discrete optimization approach, and optimizing universal attacks over multiple prompts and models. This enables crafting adversarial suffixes that reliably break alignment and transfer across models.

High-Level Pseudo Code

Here is high-level pseudocode for the key algorithms from the paper:

# Greedy Coordinate Gradient (GCG) optimization
for i in range(iterations):
  # Get top-k token replacements by gradient
  for idx in modifiable_indices:
    replacements[idx] = get_top_k_tokens_by_gradient(idx)
  
  # Evaluate random batch of candidates
  for b in range(batch_size):
    prompt_candidate = initialize_prompt()  
    random_idx = random_modifiable_index()
    random_token = random_choice(replacements[random_idx])
    prompt_candidate[random_idx] = random_token
 
  # Select best candidate    
  best_prompt = min(batch, key=adversarial_loss)
 
  # Update prompt
  prompt = best_prompt
 
return optimized_prompt
# Universal prompt optimization
prompts = [prompt1, prompt2, ...] 
models = [model1, model2, ...]
 
for i in range(iterations):
  # Aggregate gradients over models
  for idx in adversarial_indices:
    grads = [get_token_gradient(model, idx) for model in models] 
    replacements[idx] = aggregate_top_k(grads)
 
  # Evaluate candidates
  ...
 
  # Incrementally add more prompts
  if suffix_succeeds_on_prompts:
     add_next_prompt()
 
return universal_adversarial_suffix

The key ideas are using greedy coordinate-wise gradient optimization to search over discrete tokens, and aggregating losses/gradients over multiple prompts and models to generate universal and transferable attacks.


## Detailed Pseudo Code

 Here is more detailed pseudocode to implement the key algorithms from the paper:

```python
# Greedy Coordinate Gradient (GCG) optimization

# Hyperparameters
num_iterations = 500 
modifiable_indices = [0, ..., 19] # Indices of adversarial tokens
top_k = 256
batch_size = 512  

# Initialize prompt
prompt = initialize_prompt() 

for i in range(num_iterations):

  # Get top token replacements by gradient
  replacements = {}
  
  for idx in modifiable_indices:
  
    gradient = compute_gradient(prompt, idx) 
    
    replacement_scores = gradient[idx]
    
    replacements[idx] = get_top_k(replacement_scores, top_k)

  # Evaluate random batch  
  batch_losses = []
  
  for b in range(batch_size):
  
    prompt_candidate = copy(prompt)  
    
    random_idx = sample(modifiable_indices)
    
    random_token = sample(replacements[random_idx])
    
    prompt_candidate[random_idx] = random_token

    loss = compute_loss(prompt_candidate)
    
    batch_losses.append((loss, prompt_candidate))
  
  # Select best candidate
  best_loss, best_prompt = min(batch_losses) 
  
  # Update prompt
  prompt = best_prompt
  
return prompt
# Universal prompt optimization
 
# Hyperparameters
num_iterations = 500
num_prompts = 25 
num_models = 2
top_k = 256
batch_size = 512
 
# Initialize prompts, models, adversarial indices
prompts = [initialize_prompt() for i in range(num_prompts)] 
models = [load_model(i) for i in range(num_models)]
indices = [0, ..., 19] 
 
num_prompts_to_use = 1 # Start with just 1 prompt
 
for i in range(num_iterations):
 
  # Aggregate gradients
  replacements = {}
  
  for idx in indices:
  
    model_gradients = [get_token_gradient(model, prompt, idx) 
                       for (model, prompt) in zip(models[:num_prompts_to_use], 
                                                 prompts[:num_prompts_to_use])]
                       
    combined_grad = aggregate(model_gradients)            
    
    replacements[idx] = get_top_k(combined_grad, top_k)
 
  # Evaluate candidates
  ...
 
  # Incrementally add more prompts
  if suffix_succeeds_on_prompts:
     num_prompts_to_use += 1
     
return universal_adversarial_suffix

The key additions are:
- Additional details like hyperparameters, sampling methods, etc.
- Incrementally adding more prompts during universal optimization.
- Using multiple models and aggregating gradients.
- Additional helper functions for computing gradients, getting top-k, etc.

Let me know if you would like me to explain or expand on any part of this in more detail!