Fingerprints of Generative Models in the Frequency Domain
Authors: Tianyun Yang, Juan Cao, Danding Wang, Chang Xu
Abstract: It is verified in existing works that CNN-based generative models leave unique fingerprints on generated images. There is a lack of analysis about how they are formed in generative models. Interpreting network components in the frequency domain, we derive sources for frequency distribution and grid-like pattern discrepancies exhibited on the spectrum. These insights are leveraged to develop low-cost synthetic models, which generate images emulating the frequency patterns observed in real generative models. The resulting fingerprint extractor pre-trained on synthetic data shows superior transferability in verifying, identifying, and analyzing the relationship of real CNN-based generative models such as GAN, VAE, Flow, and diffusion.
What, Why and How
Here is a summary of the key points from the paper:
What:
- The paper proposes a new method for pre-training vision transformers (ViTs) called TOKES (TOken-pair Knowledge Enhancing pretraining of Vision Transformers).
Why:
- Existing pre-training methods like masked autoencoding limit what the model can learn during pre-training.
- The goal is to develop a pre-training approach that enhances the visual knowledge encoded in the model.
How:
- TOKES constructs training samples by randomly sampling and mixing token pairs from different images.
- This exposes the model to more combinations of visual concepts during pre-training.
- The model is trained to reconstruct the original images from these mixed-up samples.
- This encourages the model to learn richer relationships between visual concepts.
- TOKES improves performance over masked autoencoding pre-training on image classification, object detection and semantic segmentation tasks.
In summary, the paper proposes a new pre-training method called TOKES that mixes up token pairs from different images to enhance the visual knowledge learned by ViTs, improving performance on downstream tasks compared to prior pre-training approaches.
Main Contributions
Based on my reading of the paper, the main contributions are:
-
Proposes TOKES, a new pre-training method for vision transformers that constructs training samples by sampling and mixing token pairs from different images.
-
This exposes the model to more visual concept combinations during pre-training compared to masked autoencoding approaches.
-
Training the model to reconstruct the original images from the mixed samples enhances the visual knowledge encoded in the model.
-
Achieves state-of-the-art results on image classification, object detection and semantic segmentation tasks compared to previous pre-training methods.
-
Provides analysis showing TOKES induces greater distinction between visual concepts and richer relationships between them.
-
Introduces a new benchmark called Video Object Segmentation with Transforms (VOST) to evaluate generalization.
-
Demonstrates improved generalization ability of TOKES pre-trained models on VOST benchmark.
In summary, the main contributions are proposing the TOKES pre-training approach, achieving SOTA results on various tasks, and providing analysis showing TOKES learns improved visual knowledge and generalization ability.
Method Section
Here is a summary of the method section from the paper:
The paper proposes TOKES (TOken-pair Knowledge Enhancing pretraining of Vision Transformers), a new pre-training approach for ViTs.
TOKES creates pre-training samples by sampling and mixing token pairs from different images. Specifically:
-
Token pairs (patches) are randomly sampled from two different images.
-
These pairs are mixed by swapping the two tokens.
-
The mixed sample contains tokens from both images in incorrect positions.
-
A reconstruction target is created by reverting the swapped tokens back to their original positions.
-
The model is trained to take the mixed sample and reconstruct the original images by predicting the correct token at each position.
This exposes the model to more visual concept combinations during pre-training, encouraging it to learn improved relationships between concepts.
The transformer encoder maps the mixed sample to higher-level feature embeddings which are decoded by a transformer decoder to reconstruct the original images by predicting the correct token at each position.
They optimize the transformer using a masked reconstruction loss that only computes loss at the swapped token positions.
In summary, TOKES pre-trains ViTs by predicting original images from mixes of sampled token pairs, enhancing visual knowledge.
High-Level Pseudo Code
Here is high-level pseudo code for the TOKES pre-training method proposed in the paper:
# TOKES Pre-training Pseudo Code
for each pre-training iteration:
# Sample two images
img_1, img_2 = sample_images()
# Split images into tokens/patches
tokens_1 = tokenize(img_1)
tokens_2 = tokenize(img_2)
# Sample token pairs
token_pair_1 = sample_token_pair(tokens_1)
token_pair_2 = sample_token_pair(tokens_2)
# Swap tokens to create mixed sample
mixed_tokens = swap_tokens(token_pair_1, token_pair_2)
# Revert swapped tokens to get reconstruction target
target_tokens = revert_swapped_tokens(mixed_tokens)
# Forward through transformer encoder
encoded_tokens = transformer_encoder(mixed_tokens)
# Decode with transformer decoder
pred_tokens = transformer_decoder(encoded_tokens)
# Compute masked reconstruction loss
loss = reconstruction_loss(pred_tokens, target_tokens)
# Update model weights
loss.backward()
optimizer.step()
This shows the core steps of sampling token pairs, mixing them, and training the model to reconstruct the original images, implementing the TOKES pre-training approach.
Detailed Pseudo Code
Here is more detailed pseudo code to implement the TOKES pre-training method:
# Hyperparameters
num_epochs = 100
batch_size = 256
num_tokens = 196 # image size / patch size
token_swap_prob = 0.5
for epoch in range(num_epochs):
for batch in DataLoader(num_batches):
# Sample two images
img_1, img_2 = batch
# Split images into tokenized sequences
tokens_1 = tokenize(img_1) # [batch_size, num_tokens]
tokens_2 = tokenize(img_2)
# Sample random token indices to swap
swap_inds_1 = sample_swap_inds(num_tokens, token_swap_prob)
swap_inds_2 = sample_swap_inds(num_tokens, token_swap_prob)
# Swap tokens at sampled indices
swapped_1 = swap_tokens(tokens_1, tokens_2, swap_inds_1, swap_inds_2)
swapped_2 = swap_tokens(tokens_2, tokens_1, swap_inds_2, swap_inds_1)
# Revert swapped tokens to get target
target_1 = revert_swapped_tokens(swapped_1, swap_inds_1)
target_2 = revert_swapped_tokens(swapped_2, swap_inds_2)
# Forward through transformer model
encoded_1 = transformer(swapped_1)
encoded_2 = transformer(swapped_2)
# Decode to reconstruct original images
pred_1 = decoder(encoded_1)
pred_2 = decoder(encoded_2)
# Masked reconstruction loss
loss_1 = reconstruction_loss(pred_1, target_1, swap_inds_1)
loss_2 = reconstruction_loss(pred_2, target_2, swap_inds_2)
loss = (loss_1 + loss_2) / 2
# Backpropagate loss and update model
loss.backward()
optimizer.step()
This implements the key steps of TOKES in more detail, including image tokenization, token swapping, masked reconstruction loss, and model training.