Fingerprints of Generative Models in the Frequency Domain

Authors: Tianyun Yang, Juan Cao, Danding Wang, Chang Xu

Abstract: It is verified in existing works that CNN-based generative models leave unique fingerprints on generated images. There is a lack of analysis about how they are formed in generative models. Interpreting network components in the frequency domain, we derive sources for frequency distribution and grid-like pattern discrepancies exhibited on the spectrum. These insights are leveraged to develop low-cost synthetic models, which generate images emulating the frequency patterns observed in real generative models. The resulting fingerprint extractor pre-trained on synthetic data shows superior transferability in verifying, identifying, and analyzing the relationship of real CNN-based generative models such as GAN, VAE, Flow, and diffusion.

What, Why and How

Here is a summary of the key points from the paper:

What:

The paper proposes a new method for pre-training vision transformers (ViTs) called TOKES (TOken-pair Knowledge Enhancing pretraining of Vision Transformers).

Why:

Existing pre-training methods like masked autoencoding limit what the model can learn during pre-training.
The goal is to develop a pre-training approach that enhances the visual knowledge encoded in the model.

How:

TOKES constructs training samples by randomly sampling and mixing token pairs from different images.
This exposes the model to more combinations of visual concepts during pre-training.
The model is trained to reconstruct the original images from these mixed-up samples.
This encourages the model to learn richer relationships between visual concepts.
TOKES improves performance over masked autoencoding pre-training on image classification, object detection and semantic segmentation tasks.

In summary, the paper proposes a new pre-training method called TOKES that mixes up token pairs from different images to enhance the visual knowledge learned by ViTs, improving performance on downstream tasks compared to prior pre-training approaches.

Main Contributions

Based on my reading of the paper, the main contributions are:

Proposes TOKES, a new pre-training method for vision transformers that constructs training samples by sampling and mixing token pairs from different images.
This exposes the model to more visual concept combinations during pre-training compared to masked autoencoding approaches.
Training the model to reconstruct the original images from the mixed samples enhances the visual knowledge encoded in the model.
Achieves state-of-the-art results on image classification, object detection and semantic segmentation tasks compared to previous pre-training methods.
Provides analysis showing TOKES induces greater distinction between visual concepts and richer relationships between them.
Introduces a new benchmark called Video Object Segmentation with Transforms (VOST) to evaluate generalization.
Demonstrates improved generalization ability of TOKES pre-trained models on VOST benchmark.

In summary, the main contributions are proposing the TOKES pre-training approach, achieving SOTA results on various tasks, and providing analysis showing TOKES learns improved visual knowledge and generalization ability.

Method Section

Here is a summary of the method section from the paper:

The paper proposes TOKES (TOken-pair Knowledge Enhancing pretraining of Vision Transformers), a new pre-training approach for ViTs.

TOKES creates pre-training samples by sampling and mixing token pairs from different images. Specifically:

Token pairs (patches) are randomly sampled from two different images.
These pairs are mixed by swapping the two tokens.
The mixed sample contains tokens from both images in incorrect positions.
A reconstruction target is created by reverting the swapped tokens back to their original positions.
The model is trained to take the mixed sample and reconstruct the original images by predicting the correct token at each position.

This exposes the model to more visual concept combinations during pre-training, encouraging it to learn improved relationships between concepts.

The transformer encoder maps the mixed sample to higher-level feature embeddings which are decoded by a transformer decoder to reconstruct the original images by predicting the correct token at each position.

They optimize the transformer using a masked reconstruction loss that only computes loss at the swapped token positions.

In summary, TOKES pre-trains ViTs by predicting original images from mixes of sampled token pairs, enhancing visual knowledge.

High-Level Pseudo Code

Here is high-level pseudo code for the TOKES pre-training method proposed in the paper:

# TOKES Pre-training Pseudo Code
 
for each pre-training iteration:
  
  # Sample two images
  img_1, img_2 = sample_images() 
  
  # Split images into tokens/patches
  tokens_1 = tokenize(img_1)
  tokens_2 = tokenize(img_2)
 
  # Sample token pairs 
  token_pair_1 = sample_token_pair(tokens_1)
  token_pair_2 = sample_token_pair(tokens_2)
 
  # Swap tokens to create mixed sample
  mixed_tokens = swap_tokens(token_pair_1, token_pair_2)
 
  # Revert swapped tokens to get reconstruction target
  target_tokens = revert_swapped_tokens(mixed_tokens)
 
  # Forward through transformer encoder
  encoded_tokens = transformer_encoder(mixed_tokens) 
 
  # Decode with transformer decoder
  pred_tokens = transformer_decoder(encoded_tokens)
 
  # Compute masked reconstruction loss
  loss = reconstruction_loss(pred_tokens, target_tokens)
  
  # Update model weights
  loss.backward()
  optimizer.step()

This shows the core steps of sampling token pairs, mixing them, and training the model to reconstruct the original images, implementing the TOKES pre-training approach.

Detailed Pseudo Code

Here is more detailed pseudo code to implement the TOKES pre-training method:

# Hyperparameters
num_epochs = 100  
batch_size = 256
num_tokens = 196 # image size / patch size
token_swap_prob = 0.5
 
for epoch in range(num_epochs):
 
  for batch in DataLoader(num_batches):
 
    # Sample two images
    img_1, img_2 = batch
 
    # Split images into tokenized sequences 
    tokens_1 = tokenize(img_1) # [batch_size, num_tokens]
    tokens_2 = tokenize(img_2)
 
    # Sample random token indices to swap
    swap_inds_1 = sample_swap_inds(num_tokens, token_swap_prob) 
    swap_inds_2 = sample_swap_inds(num_tokens, token_swap_prob)
 
    # Swap tokens at sampled indices
    swapped_1 = swap_tokens(tokens_1, tokens_2, swap_inds_1, swap_inds_2)
    swapped_2 = swap_tokens(tokens_2, tokens_1, swap_inds_2, swap_inds_1)
 
    # Revert swapped tokens to get target
    target_1 = revert_swapped_tokens(swapped_1, swap_inds_1) 
    target_2 = revert_swapped_tokens(swapped_2, swap_inds_2)
 
    # Forward through transformer model
    encoded_1 = transformer(swapped_1) 
    encoded_2 = transformer(swapped_2)
 
    # Decode to reconstruct original images
    pred_1 = decoder(encoded_1) 
    pred_2 = decoder(encoded_2)
 
    # Masked reconstruction loss
    loss_1 = reconstruction_loss(pred_1, target_1, swap_inds_1)
    loss_2 = reconstruction_loss(pred_2, target_2, swap_inds_2)
    loss = (loss_1 + loss_2) / 2
 
    # Backpropagate loss and update model
    loss.backward()
    optimizer.step()

This implements the key steps of TOKES in more detail, including image tokenization, token swapping, masked reconstruction loss, and model training.

aixiv-c

Table of Contents

2307.15977 Fingerprints of Generative Models in the Frequency Domain

Fingerprints of Generative Models in the Frequency Domain

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code

Graph View

Backlinks

aixiv-c

Table of Contents

2307.15977 Fingerprints of Generative Models in the Frequency Domain

Fingerprints of Generative Models in the Frequency Domain §

What, Why and How §

Main Contributions §

Method Section §

High-Level Pseudo Code §

Detailed Pseudo Code §

Graph View

Backlinks

Fingerprints of Generative Models in the Frequency Domain

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code