AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models
Authors: Xuelong Dai, Kaisheng Liang, Bin Xiao
Abstract: Unrestricted adversarial attacks present a serious threat to deep learning models and adversarial defense techniques. They pose severe security problems for deep learning applications because they can effectively bypass defense mechanisms. However, previous attack methods often utilize Generative Adversarial Networks (GANs), which are not theoretically provable and thus generate unrealistic examples by incorporating adversarial objectives, especially for large-scale datasets like ImageNet. In this paper, we propose a new method, called AdvDiff, to generate unrestricted adversarial examples with diffusion models. We design two novel adversarial guidance techniques to conduct adversarial sampling in the reverse generation process of diffusion models. These two techniques are effective and stable to generate high-quality, realistic adversarial examples by integrating gradients of the target classifier interpretably. Experimental results on MNIST and ImageNet datasets demonstrate that AdvDiff is effective to generate unrestricted adversarial examples, which outperforms GAN-based methods in terms of attack performance and generation quality.
What, Why and How
Here is a summary of the key points from the paper:
What:
-
The paper proposes a new method called AdvDiff to generate unrestricted adversarial examples using diffusion models.
-
Unrestricted adversarial examples are a more aggressive form of attack that fool classifiers without being constrained to stay close to the original input.
-
Previous unrestricted attack methods use GANs, but have issues with poor interpretability, unrealistic examples, and instability.
Why:
-
Diffusion models have advantages over GANs like better interpretability, stability, and image quality. So they may be more suitable for unrestricted attacks.
-
The paper provides theoretical analysis to show diffusion models can generate adversarial examples by modifying the conditional sampling process.
How:
-
AdvDiff uses two novel adversarial guidance techniques during diffusion model sampling:
- Adversarial guidance pushes the sample towards the adversarial target label.
- Noise sampling guidance adds an adversarial prior to the initial noise.
-
These guidance techniques enable high-quality, realistic adversarial examples without retraining the diffusion model.
-
Experiments on MNIST and ImageNet show AdvDiff outperforms GAN-based attacks, especially on ImageNet where it achieves much higher image quality.
In summary, the key ideas are using diffusion models rather than GANs for unrestricted attacks, and new guidance techniques to craft adversarial examples via the sampling process. Experiments demonstrate the effectiveness of the proposed AdvDiff method.
Main Contributions
Here are the main contributions of this paper:
-
Proposes AdvDiff, the first unrestricted adversarial attack method using diffusion models instead of GANs.
-
Designs two novel adversarial guidance techniques that incorporate adversarial objectives into the diffusion sampling process without retraining the model.
-
Provides theoretical analysis to show how AdvDiff can generate adversarial examples by increasing the likelihood of the target label during conditional sampling.
-
Demonstrates through experiments that AdvDiff achieves much higher attack success rates and image quality than GAN-based methods, especially on complex datasets like ImageNet.
-
Shows diffusion models are more suitable than GANs for unrestricted attacks due to their stability, interpretability and image quality.
-
Introduces a new threat model of using diffusion models for adversarial attacks that needs further research to defend against.
-
Analysis indicates AdvDiff perturbations are imperceptible across multiple sampling steps rather than one large perturbation.
-
Ablation studies analyze the impact of different parameters on attack performance and image quality.
In summary, the key contributions are proposing AdvDiff as a new diffusion-model-based unrestricted attack method, designing adversarial guidance techniques, theoretical analysis, extensive experiments demonstrating superiority over GANs, and highlighting diffusion models as a new adversarial threat.
Method Section
Here is a summary of the method section from the paper:
The key goal is to generate unrestricted adversarial examples using diffusion models rather than GANs.
Two main techniques are proposed:
Adversarial Guidance:
- Performs conditional sampling from the diffusion model to generate an image with label y.
- At each timestep t, adds a small adversarial perturbation calculated from the gradient of the target attack label ya.
- This gradually shifts the sample towards ya while preserving image quality.
Noise Sampling Guidance:
- Modifies the initial noise input to the diffusion model by incorporating the adversarial gradient of the final classification output.
- Increases the likelihood of ya in the noise prior using Bayes’ theorem.
- Provides a strong adversarial signal for the full sampling process.
The adversarial sampling process is done at test time without retraining the diffusion model. The pre-trained conditional diffusion model allows flexible class-conditional generation.
The full algorithm samples for multiple noise inputs, adds adversarial guidance during each timestep, and updates the noise prior based on the previous result.
This achieves realistic, high-quality adversarial examples that fool the classifier by slightly shifting the conditional data distribution during stable diffusion sampling.
In summary, the key ideas are adversarial guidance to tweak the conditional sampling, and noise guidance to inject an adversarial prior. Together these craft high-fidelity unrestricted adversarial examples using diffusion models.
High-Level Pseudo Code
Here is high-level pseudo code for the AdvDiff method proposed in the paper:
# Adversarial diffusion sampling
# Inputs:
y_a = target attack label
y = ground truth label
s, a = adversarial guidance scales
N = noise sampling steps
T = diffusion timesteps
for i in range(N):
x_T ~ Gaussian noise
for t in (T, T-1, ..., 1):
# Classifier-free sampling step
x_t-1 = diffusion_model_sampling(x_t, y)
# Adversarial guidance
adv_grad = gradient(classifier(x_t-1), y_a)
x_t-1 += s * adv_grad * sigma
# Get final result
x_0 = x_1
# Noise sampling guidance
adv_grad = gradient(classifier(x_0), y_a)
x_T += a * adv_grad * sigma
# Check if adversarial
if classifier(x_0) == y_a:
return x_0
# Returns adversarial example x_adv
In summary, the key steps are:
-
Sample Gaussian noise x_T
-
Diffusion sampling + adversarial guidance
-
Noise sampling guidance with adv gradient
-
Repeat and return adversarial example
The adversarial and noise guidance shifts the sampling towards the target label y_a to craft a realistic adversarial example.
Detailed Pseudo Code
Here is a more detailed pseudo code implementation of the AdvDiff algorithm from the paper:
# Inputs
y_a = target attack label
y = ground truth label
f = target classifier model
s, a = adversarial guidance scales
N = noise sampling steps
T = diffusion timesteps
# Pre-trained diffusion model
diffusion = DiffusionModel()
for i in range(N):
# Initialize noise
x_T = torch.randn(batch_size, channels, height, width)
for t in reversed(range(T)):
# Get timestep parameters
beta_t = diffusion.betas[t]
sigma_t = diffusion.sigmas[t]
# Classifier-free sampling step
epsilon_t = (1 + w) * diffusion.model(x_t, y) - w * diffusion.model(x_t)
mean = (x_t - beta_t * epsilon_t) / alpha_t
x_t-1 = mean + sigma_t * torch.randn_like(x_t)
# Adversarial guidance
grad = torch.gradients(f(x_t-1)[y_a], x_t-1)
x_t-1 += s * sigma_t**2 * grad
# Noise sampling guidance
grad = torch.gradients(f(x_0)[y_a], x_0)
x_T += a * diffusion.sigmas[0]**2 * grad
# Check if adversarial
if f(x_0).argmax() == y_a:
x_adv = x_0
break
return x_adv
The key steps are classifier-free sampling, adding adversarial guidance, and updating the noise prior. This generates an imperceptible adversarial example x_adv.