AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models

Authors: Xuelong Dai, Kaisheng Liang, Bin Xiao

Abstract: Unrestricted adversarial attacks present a serious threat to deep learning models and adversarial defense techniques. They pose severe security problems for deep learning applications because they can effectively bypass defense mechanisms. However, previous attack methods often utilize Generative Adversarial Networks (GANs), which are not theoretically provable and thus generate unrealistic examples by incorporating adversarial objectives, especially for large-scale datasets like ImageNet. In this paper, we propose a new method, called AdvDiff, to generate unrestricted adversarial examples with diffusion models. We design two novel adversarial guidance techniques to conduct adversarial sampling in the reverse generation process of diffusion models. These two techniques are effective and stable to generate high-quality, realistic adversarial examples by integrating gradients of the target classifier interpretably. Experimental results on MNIST and ImageNet datasets demonstrate that AdvDiff is effective to generate unrestricted adversarial examples, which outperforms GAN-based methods in terms of attack performance and generation quality.

What, Why and How

Here is a summary of the key points from the paper:

What:

  • The paper proposes a new method called AdvDiff to generate unrestricted adversarial examples using diffusion models.

  • Unrestricted adversarial examples are a more aggressive form of attack that fool classifiers without being constrained to stay close to the original input.

  • Previous unrestricted attack methods use GANs, but have issues with poor interpretability, unrealistic examples, and instability.

Why:

  • Diffusion models have advantages over GANs like better interpretability, stability, and image quality. So they may be more suitable for unrestricted attacks.

  • The paper provides theoretical analysis to show diffusion models can generate adversarial examples by modifying the conditional sampling process.

How:

  • AdvDiff uses two novel adversarial guidance techniques during diffusion model sampling:

    1. Adversarial guidance pushes the sample towards the adversarial target label.
    2. Noise sampling guidance adds an adversarial prior to the initial noise.
  • These guidance techniques enable high-quality, realistic adversarial examples without retraining the diffusion model.

  • Experiments on MNIST and ImageNet show AdvDiff outperforms GAN-based attacks, especially on ImageNet where it achieves much higher image quality.

In summary, the key ideas are using diffusion models rather than GANs for unrestricted attacks, and new guidance techniques to craft adversarial examples via the sampling process. Experiments demonstrate the effectiveness of the proposed AdvDiff method.

Main Contributions

Here are the main contributions of this paper:

  • Proposes AdvDiff, the first unrestricted adversarial attack method using diffusion models instead of GANs.

  • Designs two novel adversarial guidance techniques that incorporate adversarial objectives into the diffusion sampling process without retraining the model.

  • Provides theoretical analysis to show how AdvDiff can generate adversarial examples by increasing the likelihood of the target label during conditional sampling.

  • Demonstrates through experiments that AdvDiff achieves much higher attack success rates and image quality than GAN-based methods, especially on complex datasets like ImageNet.

  • Shows diffusion models are more suitable than GANs for unrestricted attacks due to their stability, interpretability and image quality.

  • Introduces a new threat model of using diffusion models for adversarial attacks that needs further research to defend against.

  • Analysis indicates AdvDiff perturbations are imperceptible across multiple sampling steps rather than one large perturbation.

  • Ablation studies analyze the impact of different parameters on attack performance and image quality.

In summary, the key contributions are proposing AdvDiff as a new diffusion-model-based unrestricted attack method, designing adversarial guidance techniques, theoretical analysis, extensive experiments demonstrating superiority over GANs, and highlighting diffusion models as a new adversarial threat.

Method Section

Here is a summary of the method section from the paper:

The key goal is to generate unrestricted adversarial examples using diffusion models rather than GANs.

Two main techniques are proposed:

Adversarial Guidance:

  • Performs conditional sampling from the diffusion model to generate an image with label y.
  • At each timestep t, adds a small adversarial perturbation calculated from the gradient of the target attack label ya.
  • This gradually shifts the sample towards ya while preserving image quality.

Noise Sampling Guidance:

  • Modifies the initial noise input to the diffusion model by incorporating the adversarial gradient of the final classification output.
  • Increases the likelihood of ya in the noise prior using Bayes’ theorem.
  • Provides a strong adversarial signal for the full sampling process.

The adversarial sampling process is done at test time without retraining the diffusion model. The pre-trained conditional diffusion model allows flexible class-conditional generation.

The full algorithm samples for multiple noise inputs, adds adversarial guidance during each timestep, and updates the noise prior based on the previous result.

This achieves realistic, high-quality adversarial examples that fool the classifier by slightly shifting the conditional data distribution during stable diffusion sampling.

In summary, the key ideas are adversarial guidance to tweak the conditional sampling, and noise guidance to inject an adversarial prior. Together these craft high-fidelity unrestricted adversarial examples using diffusion models.

High-Level Pseudo Code

Here is high-level pseudo code for the AdvDiff method proposed in the paper:

# Adversarial diffusion sampling 
 
# Inputs:
y_a = target attack label  
y = ground truth label
s, a = adversarial guidance scales
N = noise sampling steps
T = diffusion timesteps 
 
for i in range(N):
  
  x_T ~ Gaussian noise
  
  for t in (T, T-1, ..., 1):
  
    # Classifier-free sampling step
    x_t-1 = diffusion_model_sampling(x_t, y) 
    
    # Adversarial guidance 
    adv_grad = gradient(classifier(x_t-1), y_a)
    x_t-1 += s * adv_grad * sigma
  
  # Get final result   
  x_0 = x_1 
  
  # Noise sampling guidance
  adv_grad = gradient(classifier(x_0), y_a)
  x_T += a * adv_grad * sigma
  
  # Check if adversarial
  if classifier(x_0) == y_a:
    return x_0
 
# Returns adversarial example x_adv

In summary, the key steps are:

  1. Sample Gaussian noise x_T

  2. Diffusion sampling + adversarial guidance

  3. Noise sampling guidance with adv gradient

  4. Repeat and return adversarial example

The adversarial and noise guidance shifts the sampling towards the target label y_a to craft a realistic adversarial example.

Detailed Pseudo Code

Here is a more detailed pseudo code implementation of the AdvDiff algorithm from the paper:

# Inputs
y_a = target attack label
y = ground truth label 
f = target classifier model
s, a = adversarial guidance scales
N = noise sampling steps
T = diffusion timesteps
 
# Pre-trained diffusion model
diffusion = DiffusionModel() 
 
for i in range(N):
 
  # Initialize noise  
  x_T = torch.randn(batch_size, channels, height, width)
 
  for t in reversed(range(T)):
    
    # Get timestep parameters
    beta_t = diffusion.betas[t]
    sigma_t = diffusion.sigmas[t]
    
    # Classifier-free sampling step 
    epsilon_t = (1 + w) * diffusion.model(x_t, y) - w * diffusion.model(x_t)
    mean = (x_t - beta_t * epsilon_t) / alpha_t
    x_t-1 = mean + sigma_t * torch.randn_like(x_t)
 
    # Adversarial guidance
    grad = torch.gradients(f(x_t-1)[y_a], x_t-1)
    x_t-1 += s * sigma_t**2 * grad
 
  # Noise sampling guidance
  grad = torch.gradients(f(x_0)[y_a], x_0) 
  x_T += a * diffusion.sigmas[0]**2 * grad
 
  # Check if adversarial
  if f(x_0).argmax() == y_a:
      x_adv = x_0
      break 
 
return x_adv

The key steps are classifier-free sampling, adding adversarial guidance, and updating the noise prior. This generates an imperceptible adversarial example x_adv.