Common Knowledge Learning for Generating Transferable Adversarial Examples

Authors: Ruijie Yang, Yuanfang Guo, Junfu Wang, Jiantao Zhou, Yunhong Wang

Abstract: This paper focuses on an important type of black-box attacks, i.e., transfer-based adversarial attacks, where the adversary generates adversarial examples by a substitute (source) model and utilize them to attack an unseen target model, without knowing its information. Existing methods tend to give unsatisfactory adversarial transferability when the source and target models are from different types of DNN architectures (e.g. ResNet-18 and Swin Transformer). In this paper, we observe that the above phenomenon is induced by the output inconsistency problem. To alleviate this problem while effectively utilizing the existing DNN models, we propose a common knowledge learning (CKL) framework to learn better network weights to generate adversarial examples with better transferability, under fixed network architectures. Specifically, to reduce the model-specific features and obtain better output distributions, we construct a multi-teacher framework, where the knowledge is distilled from different teacher architectures into one student network. By considering that the gradient of input is usually utilized to generated adversarial examples, we impose constraints on the gradients between the student and teacher models, to further alleviate the output inconsistency problem and enhance the adversarial transferability. Extensive experiments demonstrate that our proposed work can significantly improve the adversarial transferability.

What, Why and How

Here is a summary of the key points in this paper:

What:

  • The paper focuses on transfer-based adversarial attacks, where adversarial examples are generated using a source model to attack an unseen target model.
  • It observes that low transferability between different model architectures is due to output inconsistency (different models producing different outputs for the same input).

Why:

  • Output inconsistency leads to uncertainty in the outputs of the target model, making it harder to successfully craft adversarial examples.
  • Existing methods have poor transferability across different model architectures due to this output inconsistency issue.

How:

  • The paper proposes a common knowledge learning (CKL) framework to train a student model that learns from multiple teacher models.
  • The student model learns to produce similar outputs as the teachers using knowledge distillation loss.
  • It also learns to have similar input gradients as the teachers using a gradient distillation loss. This reduces output inconsistency.
  • The student model is then used to craft adversarial examples with improved transferability across architectures.

In summary, the paper identifies output inconsistency across models as the cause of poor transferability of adversarial attacks. It proposes a CKL framework to train a student model that learns common knowledge from multiple teachers, reducing output inconsistency and improving transferability.

Main Contributions

Here are the main contributions of this paper:

  1. It analyzes the relationship between adversarial transferability and properties of the substitute (source) model, and observes that output inconsistency leads to poor transferability.

  2. It proposes a common knowledge learning (CKL) framework to distill knowledge from multiple teacher models into a single student model. This reduces model-specific features and output inconsistency.

  3. It proposes an input gradient distillation scheme to make the student model learn similar input gradients as the teachers. This further reduces output inconsistency and improves transferability.

  4. It adopts gradient projection from PCGrad to reduce conflicting gradients from multiple teachers during training.

  5. Extensive experiments show the proposed CKL framework significantly improves transferability of different attack methods across diverse model architectures on CIFAR10 and CIFAR100.

In summary, the main contributions are:

  • Identifying output inconsistency as the cause of poor transferability
  • Proposing the CKL framework to learn common knowledge from multiple teacher models
  • Using input gradient distillation to reduce output inconsistency
  • Adopting gradient projection to handle conflicting gradients
  • Demonstrating improved transferability across architectures experimentally

Method Section

Here is a summary of the methodology section in the paper:

The paper proposes a common knowledge learning (CKL) framework that trains a student model to learn from multiple teacher models.

The framework has two main components:

  1. Common Knowledge Distillation
  • Multiple teacher models with different architectures are selected
  • Knowledge distillation loss is used to make the student model produce similar outputs as the teachers
  • KL divergence between student and teacher outputs is minimized
  • This reduces model-specific features and output inconsistency
  1. Input Gradient Distillation
  • Input gradients of teachers are important since gradients are used to craft adversarial examples
  • Student model is trained to have similar input gradients as teachers
  • Averaged teacher gradients are first used as target
  • Gradient projection from PCGrad is used to resolve conflicting gradients between teachers
  • This further reduces output inconsistency

After training, the student model incorporates common knowledge from diverse teacher models. It is used as the source model to generate adversarial examples.

Existing attack methods like DI-FGSM can be combined with the framework by replacing the source model with the student model.

In summary, the key aspects of the methodology are multi-teacher knowledge distillation and input gradient distillation with gradient projection to train a student model that improves transferability.

High-Level Pseudo Code

Here is the high-level pseudo code for the proposed method:

# Select teacher models T1, T2, ..., Tn with diverse architectures
 
# Train student model S on dataset X,Y:
for epoch in num_epochs:
  for x,y in dataloader(X,Y):
 
    # Common Knowledge Distillation
    student_out = S(x) 
    loss_kd = 0
    for i in range(n):
      teacher_out = Ti(x)  
      loss_kd += KL_div(student_out, teacher_out)
    
    # Input Gradient Distillation       
    student_grad = grad(S(x), x)
    teacher_grads = []
    for i in range(n):
      teacher_grad = grad(Ti(x), x)
      teacher_grads.append(teacher_grad)
    
    avg_teacher_grad = average(teacher_grads)
    projected_grads = project_conflicting_gradients(teacher_grads) 
    teacher_target = sum(projected_grads)
    loss_grad = MSE(student_grad, teacher_target)
    
    # Overall Loss
    loss = loss_kd + lambda * loss_grad
    update(S, loss) 
 
# Use trained student model S as source model  
# Combine with attack method (e.g. DI-FGSM) 
# to generate adversarial examples

This shows the overall training process with common knowledge distillation and input gradient distillation losses. The trained student model is then used to generate adversarial examples.

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the proposed method:

# Hyperparameters
num_teachers = 4 
lambda = 500
 
# Select teacher models 
teachers = [ResNet50(), InceptionV3(), SwinT(), MLPMixer()]
 
# Initialize student model
student = ResNet18()
 
# Optimization hyperparameters 
lr, epochs, batch_size = 0.1, 600, 128 
 
# Input gradient distillation
def project_conflicting_gradients(teacher_grads):
  for i in range(num_teachers):
    for j in range(i+1, num_teachers):
      if teacher_grads[i] . dot(teacher_grads[j]) < 0:
        # Conflicting gradients
        projection = (teacher_grads[i] . dot(teacher_grads[j])) / (teacher_grads[j] . dot(teacher_grads[j])) 
        teacher_grads[i] -= projection * teacher_grads[j]
 
  return teacher_grads
 
for epoch in epochs:
  
  for x, y in dataloader(batch_size):
 
    # Forward pass  
    student_logits = student(x)
    teacher_logits = [teacher(x) for teacher in teachers]
 
    # Compute losses
    loss_kd = sum([KL_div(student_logits, teacher_logits[i]) 
                   for i in range(num_teachers)])
 
    student_grad = grad(student_logits, x)
    teacher_grads = [grad(teacher_logits[i], x) for i in range(num_teachers)]
 
    avg_teacher_grad = sum(teacher_grads) / num_teachers
    projected_grads = project_conflicting_gradients(teacher_grads)
    teacher_target = sum(projected_grads)
 
    loss_grad = MSE(student_grad, teacher_target)
 
    loss = loss_kd + lambda * loss_grad
    
    # Optimization step
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
 
# Generate adversarial examples using student   

This implements the overall training process in more detail, including forward passes, loss computations, gradient projection, optimization steps etc. The trained student model can then be used to craft adversarial examples.