A LLM Assisted Exploitation of AI-Guardian

Authors: Nicholas Carlini

Abstract: Large language models (LLMs) are now highly capable at a diverse range of tasks. This paper studies whether or not GPT-4, one such LLM, is capable of assisting researchers in the field of adversarial machine learning. As a case study, we evaluate the robustness of AI-Guardian, a recent defense to adversarial examples published at IEEE S&P 2023, a top computer security conference. We completely break this defense: the proposed scheme does not increase robustness compared to an undefended baseline. We write none of the code to attack this model, and instead prompt GPT-4 to implement all attack algorithms following our instructions and guidance. This process was surprisingly effective and efficient, with the language model at times producing code from ambiguous instructions faster than the author of this paper could have done. We conclude by discussing (1) the warning signs present in the evaluation that suggested to us AI-Guardian would be broken, and (2) our experience with designing attacks and performing novel research using the most recent advances in language modeling.

What, Why and How

Here is a summary of the key points from the paper:

What:

  • The paper evaluates using GPT-4 to break a recently published defense to adversarial examples called AI-Guardian. AI-Guardian was published at IEEE S&P 2023.

  • The authors show that GPT-4 can successfully implement attacks to completely break AI-Guardian following instructions, reducing its claimed robustness from 98% down to just 8%.

Why:

  • Studying whether large language models like GPT-4 can assist researchers in adversarial ML is an interesting test case for their capabilities.

  • AI-Guardian claimed very high robustness so provides a good case study for using GPT-4 to generate attacks.

  • Many defenses at top venues are quickly broken, so the authors wanted to see if GPT-4 could break a new state of the art defense.

How:

  • The authors instructed GPT-4 to implement the full attack in stages: recovering the mask, permutation and backdoor pattern.

  • GPT-4 produced code to implement the attacks based on guidance from the authors. The authors provided high-level instructions.

  • The attacks first extract the components of AI-Guardian’s defense in order to launch successful adversarial examples.

  • The process demonstrated that GPT-4 can act as an AI assistant to automate coding tasks in novel research, given sufficient human guidance.

In summary, the paper shows GPT-4 can successfully break a state of the art defense when instructed to implement attacks, demonstrating its potential as an AI research assistant. The attacks extract secret components of AI-Guardian’s defense in order to completely subvert its robustness claims.

Main Contributions

Based on my summary, the main contributions of this paper appear to be:

  • Demonstrating that the large language model GPT-4 can be used to successfully implement attacks to break defenses in adversarial machine learning, when provided with sufficient guidance and instructions from a human researcher.

  • Providing a case study of using GPT-4 to break a recent state-of-the-art defense called AI-Guardian that was published at a top security venue (IEEE S&P 2023).

  • Designing a set of attacks that are able to completely subvert the robustness claims of AI-Guardian, reducing its claimed robustness from 98% down to just 8% under the threat model considered in the original paper.

  • Extracting the secret components of AI-Guardian’s defense, including the mask, permutation and backdoor pattern, in order to enable successful adversarial examples.

  • Evaluating GPT-4’s capabilities as an “AI research assistant” that can automate coding tasks to implement attacks, given high-level guidance.

  • Discussing the process of using GPT-4 for research, including its strengths at automating simple coding and limitations in adaptability and needing human insight.

  • Providing insights into warning signs that a defense may be vulnerable based on issues with the evaluation methodology.

  • Suggesting ways to improve defense evaluations to avoid repeating similar situations where defenses with serious flaws are accepted at top venues.

So in summary, the main contributions are demonstrating GPT-4’s utility for adversarial ML research when guided by a human, breaking the AI-Guardian defense, and discussing the implications.

Method Section

Here is a summary of the method section from the paper:

The authors break the AI-Guardian defense in three main stages:

  1. Mask recovery:
  • The defense transforms inputs using a mask M.
  • Pixels where M=0 are ignored by the model.
  • The attack checks if two inputs that differ in one pixel give the same logit. If so, that pixel is masked.
  • By checking all pixels, the mask M can be reconstructed.
  1. Permutation recovery:
  • The defense also permutes the labels using a secret permutation P.
  • The attack constructs universal perturbations to try flipping each source class to each target.
  • It’s easier to flip a class to P(y) than other targets.
  • Matching the easiest target for each source recovers P.
  1. Backdoor recovery:
  • The defense fills unmasked pixels with a pattern Z.
  • Having M and P, Z can be recovered by gradient descent to match the defended model’s predictions.
  • This recreates the full backdoor transformation t(x).

Once M, P and Z are recovered, the attacks bypass the defense by transforming inputs using t(x) before perturbing. This fools the defended model and breaks AI-Guardian.

The method is broken into these three stages which extract the secret components of the defense in order to launch successful adversarial attacks.

High-Level Pseudo Code

Here is high-level pseudocode for the key steps of the method described in the paper:

# Mask recovery
for i in range(image_width * image_height):
  x = zero_image() 
  x[i] = 1 
  if model(x) == model(zero_image()):
    mask[i] = 0
  else:
    mask[i] = 1
 
# Permutation recovery   
for src_class in all_classes:
  for tgt_class in all_classes:
    delta = construct_universal_perturbation(src_class, tgt_class)
    if loss(delta) is low:
      permutation[src_class] = tgt_class
 
# Backdoor recovery
pattern = random_init() 
for imgs in dataset:
  perturbed_img = mask * img + (1-mask) * pattern 
  loss = ||model(perturbed_img) - model(img)||
  update pattern to minimize loss
 
# Attack
x_adv = input_img
x_adv = mask * x_adv + (1-mask) * pattern
x_adv = perturb(x_adv) 
output = model(x_adv)

The pseudocode shows the key steps to:

  1. Recover the mask by checking differences between images.
  2. Recover the permutation by constructing targeted universal perturbations.
  3. Recover the backdoor pattern using gradient descent.
  4. Craft adversarial examples using the recovered components to attack the model.

Detailed Pseudo Code

Here is more detailed pseudocode to implement the key steps of the method:

# Mask recovery
mask = zeros(image_shape)
for i in range(image_width):
  for j in range(image_height):
    x = zero_image()
    x[i,j] = 1
    
    if model.predict(x) == model.predict(zero_image()):
      mask[i,j] = 0
    else:
      mask[i,j] = 1
      
# Permutation recovery
permutation = {}
for src_class in all_classes:
  min_loss = inf
  min_tgt = None
  for tgt_class in all_classes:
    delta = construct_universal_perturbation(src_class, tgt_class, steps=10) 
    loss = cross_entropy(delta, src_class, tgt_class)
    
    if loss < min_loss:
      min_loss = loss
      min_tgt = tgt_class
      
  permutation[src_class] = min_tgt
   
# Backdoor recovery  
pattern = random_noise_image()
 
for img, label in dataset:
  masked_img = mask * img + (1-mask) * pattern
  
  loss = cross_entropy(model(masked_img), permutation[label])
  loss.backward()
  
  pattern -= learning_rate * pattern.grad
  
# Attack  
x_adv = copy(input_img)
x_adv = mask * x_adv + (1-mask) * pattern
 
for i in range(num_steps):
  loss = cross_entropy(model(x_adv), target_label)
  loss.backward()
  
  x_adv -= learning_rate * x_adv.grad
  x_adv = clip(x_adv, eps)
  
output = model(x_adv)

The pseudocode includes additional details like:

  • Using two nested loops to construct the mask.
  • Measuring loss to identify easiest target classes.
  • Updating the backdoor pattern using gradient descent.
  • Crafting adversarial examples with PGD using the recovered components.

This provides a more detailed walkthrough of how the attack could be implemented.