CLIPAG: Towards Generator-Free Text-to-Image Generation

Authors: Roy Ganz, Michael Elad

Abstract: Perceptually Aligned Gradients (PAG) refer to an intriguing property observed in robust image classification models, wherein their input gradients align with human perception and pose semantic meanings. While this phenomenon has gained significant research attention, it was solely studied in the context of unimodal vision-only architectures. In this work, we extend the study of PAG to Vision-Language architectures, which form the foundations for diverse image-text tasks and applications. Through an adversarial robustification finetuning of CLIP, we demonstrate that robust Vision-Language models exhibit PAG in contrast to their vanilla counterparts. This work reveals the merits of CLIP with PAG (CLIPAG) in several vision-language generative tasks. Notably, we show that seamlessly integrating CLIPAG in a “plug-n-play” manner leads to substantial improvements in vision-language generative applications. Furthermore, leveraging its PAG property, CLIPAG enables text-to-image generation without any generative model, which typically requires huge generators.

What, Why and How

Here is a summary of the key points from the paper:

What:

The paper introduces CLIPAG, an adversarially finetuned version of CLIP’s image encoder that exhibits perceptually aligned gradients (PAG).
It shows that robustifying CLIP via adversarial training induces PAG, similar to prior findings in unimodal vision models.
The paper demonstrates the benefits of CLIPAG for text-to-image generation by integrating it into existing frameworks like CLIPDraw, VQGAN+CLIP, and CLIPStyler.

Why:

CLIP lacks PAG, making it susceptible to adversarial attacks and leading to poor results when optimizing images to match text descriptions.
CLIPAG addresses this limitation via PAG, enabling improved text-to-image generation without extra tricks.
PAG allows CLIPAG to guide pixel-space optimization towards human-interpretable images aligned with text prompts.

How:

CLIPAG is obtained by adversarially finetuning the image encoder of CLIP on a large image-text dataset.
It is integrated into existing frameworks by replacing regular CLIP in a “plug-and-play” manner.
CLIPAG enables a simple iterative pixel-space optimization method for generator-free text-to-image synthesis.

In summary, the paper shows how inducing PAG in CLIP via robust training benefits text-to-image generation across various applications, highlighting the importance of gradient alignment in multimodal models.

Main Contributions

Here are the main contributions of this paper:

Introduces CLIPAG - an adversarially finetuned version of CLIP’s visual encoder that exhibits perceptually aligned gradients (PAG). This establishes that PAG, previously only explored in unimodal vision models, also exists in multimodal vision-language architectures.
Demonstrates the benefits of integrating CLIPAG in a “plug-and-play” manner into existing CLIP-based text-to-image frameworks like CLIPDraw, VQGAN+CLIP, and CLIPStyler. Replacing regular CLIP with CLIPAG leads to improved performance in these applications.
Proposes a simple iterative pixel-space optimization method that leverages CLIPAG for generator-free text-to-image synthesis. This approach does not require training an explicit generator model.
Provides both qualitative and quantitative experiments highlighting the advantages of CLIPAG over regular CLIP across different text-to-image generation tasks.
Opens up a new direction in harnessing perceptually aligned gradients within multimodal vision-language models for improved generative capabilities.

In summary, the key contributions are introducing CLIPAG, seamlessly integrating it into existing frameworks to enhance text-to-image generation, proposing a generator-free synthesis approach using CLIPAG, and experimentally demonstrating its benefits over standard CLIP.

Method Section

Here is a summary of the method section from the paper:

The paper focuses on inducing perceptually aligned gradients (PAG) in the visual encoder of CLIP through adversarial training.
It freezes the text encoder parameters and only finetunes the image encoder using a contrastive adversarial training loss.
The adversarial loss aims to maximize the similarity between adversarial image examples and mismatched text, while minimizing similarity for matched image-text pairs.
This aligns the image encoder gradients towards human perception and the paired text captions.
A small threat model is used to avoid catastrophic forgetting and maintain CLIP’s generalization capabilities.
The model is trained on a concatenated dataset of SBU, CC3M, CC12M, and a subset of LAION-400M.
After finetuning CLIP this way, the resulting model (CLIPAG) demonstrates much better gradient alignment compared to vanilla CLIP.
Qualitative examples show CLIPAG gradients are more semantically meaningful and relate to the paired text, indicating improved PAG.
The alignment is measured by optimizing images to maximize similarity with a given text prompt and assessing if modifications match human perception.
Various design choices like CNN vs ViT architectures and L2 vs L-infinity threat models are ablated to analyze their impact on PAG.

In summary, the key steps are adversarial training of the image encoder while freezing the text encoder, using a small threat model, and showing both qualitatively and quantitatively that this induces perceptually aligned gradients in the resulting CLIPAG model.

High-Level Pseudo Code

Here is the high-level pseudo code for the key steps in the paper:

# Load pretrained CLIP model
text_encoder, image_encoder = CLIP() 
 
# Freeze text encoder parameters
text_encoder.requires_grad = False
 
# Define adversarial training loss 
def adv_loss(x, y):
  # Craft adversarial example
  x_adv = fgsm(x, y)
  
  # Forward pass
  txt_embed = text_encoder(y)
  img_embed = image_encoder(x_adv)
  
  # Contrastive loss
  return 1 - cosine_similarity(txt_embed, img_embed)
 
# Dataset
dataset = load_dataset() 
 
# Adversarially finetune image encoder
for x, y in dataset:
  loss = adv_loss(x, y)
  loss.backward()
  optimizer.step()
  
# Result is CLIPAG  
text_encoder, image_encoder = CLIPAG()

This shows the key steps:

Loading pretrained CLIP
Freezing the text encoder
Defining an adversarial contrastive loss
Creating adversarial examples on-the-fly during training
Finetuning only the image encoder with this loss
Resulting in the CLIPAG model with aligned gradients

Detailed Pseudo Code

Here is a more detailed pseudo code implementation of the key steps in the paper:

import torch
from torch import nn
from torch.autograd import grad
 
# Load pretrained CLIP
text_encoder = CLIPTextEncoder() 
image_encoder = CLIPImageEncoder()
 
# Freeze text encoder 
for p in text_encoder.parameters():
  p.requires_grad = False
 
# Dataset
train_dataset = ImageTextDataset(
  image_paths, 
  text_captions
)
 
# Model train step
def train_step(x, y):
 
  # FGSM to craft adversarial example
  x_adv = fgsm(x, y, image_encoder, epsilon=1.5)
  
  # Get embeddings
  txt_emb = text_encoder(y)
  img_emb = image_encoder(x_adv)
 
  # Contrastive loss 
  loss = 1 - cosine_similarity(txt_emb, img_emb)
  
  # Backprop
  loss.backward()
 
  return loss
 
# Main training loop
optimizer = torch.optim.Adam(image_encoder.parameters())
for epoch in range(num_epochs):
 
  for x, y in train_loader:
    loss = train_step(x, y)
    
    optimizer.step()
    optimizer.zero_grad()
 
# FGSM crafts adversarial images
def fgsm(x, y, model, epsilon=0.01):
 
  emb = model(x)
  loss = 1 - cosine_similarity(emb, model.encode_text(y))
 
  loss.backward()
 
  grad = x.grad.detach()
  x_adv = x + epsilon * grad.sign()
  
  return x_adv

The key aspects are:

Freezing the text encoder
Using FGSM to craft adversaries on-the-fly
Contrastive loss between adversarial image and text
Only updating the image encoder
Crafting adversaries by deriving the embedding loss

This provides a more detailed overview of how to implement the adversarial training approach described in the paper.

aixiv-c

Table of Contents

2306.16805 CLIPAG Towards Generator-Free Text-to-Image Generation

CLIPAG: Towards Generator-Free Text-to-Image Generation

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code

Graph View

Backlinks

aixiv-c

Table of Contents

2306.16805 CLIPAG Towards Generator-Free Text-to-Image Generation

CLIPAG: Towards Generator-Free Text-to-Image Generation §

What, Why and How §

Main Contributions §

Method Section §

High-Level Pseudo Code §

Detailed Pseudo Code §

Graph View

Backlinks

CLIPAG: Towards Generator-Free Text-to-Image Generation

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code