Content-based Unrestricted Adversarial Attack
Authors: Zhaoyu Chen, Bo Li, Shuang Wu, Kaixun Jiang, Shouhong Ding, Wenqiang Zhang
Abstract: Unrestricted adversarial attacks typically manipulate the semantic content of an image (e.g., color or texture) to create adversarial examples that are both effective and photorealistic, demonstrating their ability to deceive human perception and deep neural networks with stealth and success. However, current works usually sacrifice unrestricted degrees and subjectively select some image content to guarantee the photorealism of unrestricted adversarial examples, which limits its attack performance. To ensure the photorealism of adversarial examples and boost attack performance, we propose a novel unrestricted attack framework called Content-based Unrestricted Adversarial Attack. By leveraging a low-dimensional manifold that represents natural images, we map the images onto the manifold and optimize them along its adversarial direction. Therefore, within this framework, we implement Adversarial Content Attack based on Stable Diffusion and can generate high transferable unrestricted adversarial examples with various adversarial contents. Extensive experimentation and visualization demonstrate the efficacy of ACA, particularly in surpassing state-of-the-art attacks by an average of 13.3-50.4% and 16.8-48.0% in normally trained models and defense methods, respectively.
What, Why and How
Here is a summary of the key points from this paper:
What: The paper proposes a new framework for generating unrestricted adversarial examples called Content-based Unrestricted Adversarial Attack. The goal is to create adversarial examples that are more diverse in content and photorealistic compared to prior unrestricted attacks.
Why: Existing unrestricted attacks have limitations in flexibility and lack diversity in the adversarial content they can generate. They also rely too much on subjective heuristics to maintain photorealism. This restricts their attack performance when transferred across different models.
How: The proposed framework maps images onto a low-dimensional manifold represented by a generative model like a diffusion model. The manifold captures the distribution of natural images. Adversarial examples are generated by optimizing the latent representation along the adversarial direction on this manifold. This allows combining different adversarial perturbations like shape, texture and color changes while ensuring photorealism.
Specifically, they use Image Latent Mapping to encode images into the latent space of Stable Diffusion. Then Adversarial Latent Optimization is used to perturb the latents to create adversarial examples. This generates more diverse and natural unrestricted adversarial examples.
Experiments show the attack transfers much better than prior attacks, breaking normally trained classifiers and defense methods by 13-50% more on average. Visualizations also demonstrate the high image quality and diverse adversarial content achieved.
In summary, the key ideas are leveraging the natural image manifold of generative models to enable more flexible and photorealistic unrestricted adversarial attacks with higher transferability. The combination of these techniques leads to state-of-the-art performance.
Main Contributions
Here are the key contributions of this paper:
-
Proposes a new framework called Content-based Unrestricted Adversarial Attack for generating more diverse and photorealistic unrestricted adversarial examples.
-
Leverages generative models like diffusion models to represent a low-dimensional manifold of natural images. Generating adversarial examples by optimizing along this manifold enables combining different adversarial changes in content while ensuring photorealism.
-
Introduces techniques like Image Latent Mapping and Adversarial Latent Optimization to implement this framework using Stable Diffusion, leading to the proposed Adversarial Content Attack.
-
Achieves state-of-the-art attack transferability, surpassing prior unrestricted attacks against normally trained classifiers by 13-50% on average.
-
Demonstrates high attack success against various defense methods, breaking them by 17-48% more than prior attacks.
-
Provides extensive evaluations and visualizations that demonstrate the superiority of the proposed attack in terms of photorealism, content diversity, and attack performance.
In summary, the key contributions are:
-
A new framework for generating more flexible and photorealistic unrestricted adversarial attacks.
-
Techniques to implement this framework using diffusion models.
-
Adversarial Content Attack that sets new state-of-the-art results in attacking classifiers and defenses.
-
Comprehensive experiments that validate the efficacy of the proposed techniques.
Method Section
Here is a summary of the method section from the paper:
The proposed Adversarial Content Attack (ACA) has two main components:
Image Latent Mapping (ILM):
- Maps input images into the latent space of Stable Diffusion, which represents a natural image manifold.
- Uses inverse DDIM sampling and classifier-free guidance to encode images.
- Optimizes the guidance embeddings to reduce artifacts and improve image reconstruction.
Adversarial Latent Optimization (ALO):
- Optimizes the noise variables in the latent representation to generate adversarial examples.
- Employs a skip gradient to approximate gradients of the decoder.
- Introduces differentiable boundary processing to constrain value ranges.
- Iteratively updates latent variables along the adversarial gradient direction.
The overall ACA algorithm:
- Use ILM to map input image into Stable Diffusion’s latent space.
- Initialize latent variables including noise, text embeddings, etc.
- Repeat ALO to perturb latents along adversarial gradient:
- Decode perturbed latents to get adversarial example
- Compute adversarial loss
- Approximate decoder gradients using skip gradient
- Update latents with projected gradient step
- Return final reconstructed adversarial example.
In summary, the key steps are encoding images into the latent space using ILM, and then optimizing the latents guided by the adversarial loss to generate diverse yet photorealistic adversarial examples. The combination of ILM and ALO leads to the state-of-the-art attack.
High-Level Pseudo Code
Here is the high-level pseudo code for the Adversarial Content Attack method proposed in this paper:
# Image Latent Mapping
latents = inverse_ddim_sampling(image)
for t in timesteps:
for i in num_iterations:
optimize_guidance_embedding(latents)
update_latents(latents)
# Adversarial Latent Optimization
for k in num_iterations:
# Decode perturbed latents
adv_example = decoder(latents + delta)
# Compute adversarial loss
loss = cross_entropy_loss(adv_example, label)
# Approximate decoder gradients
grads = skip_gradient(loss, latents)
# Projected gradient update
delta = project(delta + step_size * grads)
# Return adversarial example
adv_example = decoder(latents + delta)
return adv_exampleThe key steps are:
-
Use inverse DDIM sampling to map image into latent space of diffusion model (ILM)
-
Iteratively optimize latent variables (noise, text embeddings) along adversarial gradient direction to create adversarial examples (ALO). This involves:
- Decoding perturbed latents to get adversarial example
- Computing adversarial loss
- Approximating gradients using skip gradient
- Updating latents with projected gradient step
-
Return final reconstructed adversarial example after iterations.
So in summary, the core ideas are leveraging the natural image manifold along with ILM and ALO to generate diverse yet photorealistic unrestricted adversarial examples.
Detailed Pseudo Code
Here is a more detailed pseudo code to implement the Adversarial Content Attack method proposed in this paper:
# Hyperparameters
T = num_ddim_steps
N_i = num_ilm_iterations
N_a = num_alo_iterations
w = guidance_scale
# Image Latent Mapping
z_0 = image
z_0_to_T = ddim_sampling(z_0, T, w=1)
z_T = z_0_to_T[-1]
z_bar = z_T
phi = null_text_embedding()
for t in range(T, 0, -1):
z_t-1_star = z_0_to_T[t-1]
for i in range(N_i):
phi_t = optimize_phi(z_bar, t, phi_t)
z_bar = denoise(z_bar, t, phi_t)
# Adversarial Latent Optimization
delta = 0
g = 0
for k in range(N_a):
z_0_bar = decode(z_bar + delta, T, phi)
z_0_bar = clip(z_0_bar)
loss = cross_entropy(classifier(z_0_bar), y_true)
grads = skip_gradient(loss, z_T)
g = momentum * g + grads / norm(grads)
delta = project(delta + step_size * sign(g))
adv_example = decode(z_bar + delta, T, phi)
return adv_exampleWhere the key functions are:
ddim_sampling- generates latents using DDIM samplingoptimize_phi- optimize null text embeddingdenoise- single step of denoising diffusion modeldecode- full denoising diffusion decoderclip- clip values to valid rangeskip_gradient- approximate decoder gradientsproject- project perturbation onto constraint set
So in summary, the key steps are using DDIM for ILM, optimizing latents with ALO in a loop, and leveraging custom functions like skip gradient and clipping to enable an effective attack.