Diffusion Models for Imperceptible and Transferable Adversarial Attack
Authors: Jianqi Chen, Hao Chen, Keyan Chen, Yilan Zhang, Zhengxia Zou, Zhenwei Shi
Abstract: Many existing adversarial attacks generate -norm perturbations on image RGB space. Despite some achievements in transferability and attack success rate, the crafted adversarial examples are easily perceived by human eyes. Towards visual imperceptibility, some recent works explore unrestricted attacks without -norm constraints, yet lacking transferability of attacking black-box models. In this work, we propose a novel imperceptible and transferable attack by leveraging both the generative and discriminative power of diffusion models. Specifically, instead of direct manipulation in pixel space, we craft perturbations in latent space of diffusion models. Combined with well-designed content-preserving structures, we can generate human-insensitive perturbations embedded with semantic clues. For better transferability, we further “deceive” the diffusion model which can be viewed as an additional recognition surrogate, by distracting its attention away from the target regions. To our knowledge, our proposed method, DiffAttack, is the first that introduces diffusion models into adversarial attack field. Extensive experiments on various model structures (including CNNs, Transformers, MLPs) and defense methods have demonstrated our superiority over other attack methods.
What, Why and How
This paper proposes a new adversarial attack method called DiffAttack that generates imperceptible and transferable adversarial examples by leveraging diffusion models.
What:
- DiffAttack is an unrestricted adversarial attack that operates in the latent space of diffusion models instead of directly manipulating pixels.
Why:
- Diffusion models can generate natural images conforming to human perception, satisfying imperceptibility.
- Pretrained diffusion models are powerful recognition models, so fooling them enhances transferability.
How:
- Invert images to latent space with DDIM inversion.
- Optimize latent to fool classifier and diffuse model’s attention.
- Use self-attention maps to preserve image structure.
- Control inversion strength for semantics and attack space.
In summary, DiffAttack exploits the generative and discriminative power of diffusion models to craft adversarial examples with good imperceptibility and transferability. It perturbs latents rather than pixels for more semantic and natural perturbations. Specific designs are introduced to fool strong diffusion models and preserve content structure. Experiments demonstrate superiority over existing attacks.
Main Contributions
The main contributions of this paper are:
-
They are the first to introduce diffusion models into the field of adversarial attacks and unveil their potential for crafting imperceptible and transferable adversarial examples.
-
They propose DiffAttack, a novel unrestricted attack method that perturbs the latent space of diffusion models instead of directly manipulating pixels.
-
They leverage the generative power of diffusion models to achieve imperceptibility by generating natural image perturbations.
-
They utilize the discriminative power of diffusion models as strong recognition surrogates to enhance transferability by “deceiving” them.
-
They design techniques like self-attention control and inversion strength adjustment to preserve content structure and semantics.
-
They demonstrate through extensive experiments that DiffAttack outperforms existing attacks in terms of imperceptibility and transferability across various model architectures and defense methods.
In summary, the key contributions are proposing the first diffusion model based attack, DiffAttack, and showing how to effectively exploit diffusion models’ properties for superior imperceptible and transferable adversarial attacks.
Method Section
Here is a summary of the method section:
The overall framework follows recent diffusion image editing methods.
-
Invert images to latent space using DDIM inversion to get latent code x_t.
-
Optimize x_t to minimize attack loss and fool classifier. Basic attack framework.
-
Add loss to minimize variance of cross-attention maps between x_t and label text. Distracts diffusion model’s attention to “deceive” it as a strong recognition surrogate and enhance transferability.
-
Preserve content structure:
- Use self-attention distance loss between x_t and fixed copy x_t(fix) to retain structural information.
- Control inversion strength to balance semantics and attack space.
- Also optimize unconditional embeddings for good reconstruction.
The overall loss function combines the attack loss, diffusion model deception loss, and structure preservation losses.
In summary, the key aspects are inverting images for latent edits, deceiving the diffusion model for transferability, and using self-attention plus controlling inversion strength to preserve structure and semantics. The method exploits diffusion models’ generative and discriminative properties for imperceptible and transferable attacks.
High-Level Pseudo Code
Here is the high-level pseudo code for the DiffAttack method:
# Input: clean image x, label y
# Output: adversarial example x'
# Load pretrained diffusion model G
# Set label text C as y
# Invert x to get latent code xt using DDIM
xt = Inverse(x)
# Copy xt as xt(fix) for structure retention
# Define losses
L_attack = cross_entropy_loss(G(xt), y)
L_transfer = variance(average(cross_attn(xt,C)))
L_structure = L2_dist(self_attn(xt),self_attn(xt(fix)))
# Optimize xt
for i in iterations:
# Get recon image x' and attn maps
x' = Denoise(xt)
attn_maps = cross_attn(xt,C)
self_maps = self_attn(xt)
# Update xt
xt = xt - lr * (dL_attack + dL_transfer + dL_structure)
# Return adversarial example
x' = Denoise(xt)
return x'
The key steps are:
- Invert image x to latent xt
- Define attack, transferability and structure losses
- Iterate over optimizing xt w.r.t these losses
- Reconstruct and return adversarial example x’
This leverages the diffusion model’s generative and discriminative properties to craft imperceptible and transferable adversarial examples.
Detailed Pseudo Code
Here is a more detailed pseudo code to implement the DiffAttack method:
import diffusion_model as dm
# Hyperparameters
num_steps = 1000
lr = 1e-2
alpha = 10
beta = 10000
gamma = 100
# Load pretrained diffusion model
model = dm.load_model('stable-diffusion')
# Input image
x = load_image('input.png')
y = true_label(x)
# Invert image
xt = dm.ddim_invert(model, x, steps=5)
# Copy for structure loss
xt_fix = xt.clone()
for i in range(num_steps):
# Diffusion forward
x_recon = dm.ddim_sample(model, xt, steps=20)
# Get losses
l_attack = cross_entropy(model(x_recon), y)
# Attention maps
c = one_hot(y)
attn = dm.cross_attention(model, xt, c)
l_transfer = attn.var()
self_attn = dm.self_attention(model, xt)
self_attn_fix = dm.self_attention(model, xt_fix)
l_structure = (self_attn - self_attn_fix).norm()
# Total loss
loss = alpha*l_attack + beta*l_transfer + gamma*l_structure
# Update latent
xt.backward(loss)
opt.step()
xt.zero_grad()
# Generate adversarial example
x_adv = dm.ddim_sample(model, xt, steps=20)
return x_adv
Key aspects:
- Invert input image
- Diffusion forward/backward between latent and image
- Compute attack, transferability (cross-attn), and structure losses
- Optimize latent w.r.t. weighted losses
- Generate and return adversarial example
This provides a detailed implementation leveraging diffusion model inversion, diffusion process, and loss formulations.