CLIPMasterPrints: Fooling Contrastive Language-Image Pre-training Using Latent Variable Evolution

Authors: Matthias Freiberger, Peter Kun, Anders Sundnes Løvlie, Sebastian Risi

Abstract: Models leveraging both visual and textual data such as Contrastive Language-Image Pre-training (CLIP), are increasingly gaining importance. In this work, we show that despite their versatility, such models are vulnerable to what we refer to as fooling master images. Fooling master images are capable of maximizing the confidence score of a CLIP model for a significant number of widely varying prompts, while being unrecognizable for humans. We demonstrate how fooling master images can be mined by searching the latent space of generative models by means of an evolution strategy or stochastic gradient descent. We investigate the properties of the mined fooling master images, and find that images trained on a small number of image captions potentially generalize to a much larger number of semantically related captions. Further, we evaluate two possible mitigation strategies and find that vulnerability to fooling master examples is closely related to a modality gap in contrastive pre-trained multi-modal networks. From the perspective of vulnerability to off-manifold attacks, we therefore argue for the mitigation of modality gaps in CLIP and related multi-modal approaches. Source code and mined CLIPMasterPrints are available at CLIPMasterPrints.

What, Why and How

Here is a summary of the key points from the paper:

What:

  • The paper introduces “CLIPMasterPrints”, which are synthetic images that can fool CLIP models by maximizing the model’s confidence scores for a wide variety of text prompts.

  • The authors generate CLIPMasterPrints by searching the latent space of generative models using evolution strategies or gradient descent.

  • The generated images look unrecognizable to humans, but get high confidence scores from CLIP across many different prompts.

Why:

  • This attack works because of the “modality gap” between image and text embeddings in CLIP models. Images on the data manifold cannot perfectly align with all text embeddings, so searching off-manifold allows finding better alignment.

  • The vulnerability highlights issues with the robustness and safety of CLIP for real-world applications. Understanding and mitigating this could enable safer use of CLIP.

How:

  • The authors mine CLIPMasterPrints using Covariance Matrix Adaptation Evolution Strategy and stochastic gradient descent in the latent space of generative models.

  • They test on artworks and ImageNet classes, finding single images that outperform real examples for many prompts.

  • The images generalize to semantically related prompts beyond those targeted. Information for different prompts is distributed throughout.

  • Modifying CLIP training to reduce the modality gap appears to mitigate the vulnerability, highlighting the link between the gap and fooling.

In summary, the paper introduces an intriguing class of synthetic images that exploit the modality gap in CLIP to fool the model across diverse prompts, raising safety concerns and highlighting the need for mitigation strategies.

Main Contributions

Here are the main contributions of the paper:

  • Introduces the concept of “CLIPMasterPrints” - synthetic images that can fool CLIP models by maximizing confidence scores across many varying prompts.

  • Demonstrates how to generate CLIPMasterPrints by searching latent spaces of generative models using evolution strategies or gradient descent.

  • Shows CLIP models are vulnerable to being fooled by such generated images on both artworks and ImageNet classes.

  • Finds that fooling images tend to generalize to semantically related prompts beyond those directly targeted.

  • Analyzes spatial distribution of information in fooling images, finding it is spread throughout and sensitive to occlusion.

  • Evaluates mitigation strategies, finding reducing the text-image modality gap appears most effective.

  • Establishes a link between the modality gap and vulnerability to off-manifold fooling attacks like CLIPMasterPrints.

  • Argues modality gaps are undesirable and should be mitigated to enable safer use of CLIP and contrastive multimodal models.

  • Provides code and generated fooling images to facilitate further research.

In summary, the key contributions are introducing and demonstrating CLIPMasterPrints, analyzing their properties, evaluating mitigation strategies, and arguing the modality gap is an important factor underlying the vulnerability. The results highlight issues with robustness and safety of CLIP that should be addressed.

Method Section

Here is a summary of the method section from the paper:

  • The authors’ goal is to generate fooling master images that maximize the confidence scores from CLIP models across a variety of text prompts.

  • They use latent variable evolution (LVE) to search the latent space of generative models using an evolutionary algorithm.

  • Specifically, they use CMA-ES to evolve latent vectors that minimize a loss function rewarding high CLIP cosine similarity scores for a set of target prompts.

  • The latent vectors are decoded to images which are scored by CLIP. The worst prompt’s similarity is returned as the loss.

  • CMA-ES mutates candidates to minimize this loss over iterations, finding a latent vector that decodes to a fooling image.

  • They test on artworks and ImageNet, using the latent space of the Stable Diffusion autoencoder.

  • Hyperparameters like population size, generations, and resolution are tuned for the experiments.

  • For ImageNet, they also compare to a gradient-based method, optimizing latent vectors with SGD and Adam.

  • For mitigation, they refine CLIP models by adding off-manifold examples labeled with a special token during training.

  • They also mitigate by shifting image/text centroids together to reduce the modality gap.

In summary, the core method is using LVE to evolve latents minimizing a loss that rewards high CLIP scores across target prompts. This enables finding a single image that fools CLIP across diverse captions.

High-Level Pseudo Code

Here is high-level pseudocode for the approach in the paper:

# Hyperparameters
target_prompts = ["caption 1", "caption 2", ...] 
generative_model = pretrained_autoencoder 
 
# Initialize CMA-ES 
population = sample_latent_vectors(generative_model)
 
# Evolutionary loop
for i in range(generations):
 
  # Generate image candidates
  images = decode_latents(generative_model, population)
  
  # Score images against prompts with CLIP 
  losses = []
  for image in images:
    similarities = []
    for prompt in target_prompts:
      similarity = clip_similarity(image, prompt)  
      similarities.append(similarity)
    loss = min(similarities) # Lowest similarity
    losses.append(loss)
 
  # Update CMA-ES population
  population = cma_es.update(population, losses) 
 
# Return best fooling image
fooling_image = decode_latent(generative_model, best_latent)

This shows the overall flow of using CMA-ES to evolve latents to find a CLIP fooling image by maximizing similarity across a set of target prompts. The same approach is used in the paper, just with more implementation details.

Detailed Pseudo Code

Here is more detailed pseudocode to implement the approach from the paper:

# Hyperparameters
target_prompts = ["caption 1", "caption 2", ...]
latent_dim = 16384  # Resolution 512x512, 4 feature maps  
population_size = 33  # Rule of thumb based on latent_dim
generations = 18000
 
# Initialize CMA-ES
mean = numpy.zeros(latent_dim)  
sigma = 1
cma = CMAES(mean, sigma) 
population = cma.ask(population_size) 
 
# Pretrained models
clip = CLIPModel() 
autoencoder = StableDiffusionDecoder()
 
for i in range(generations):
 
  # Decode latents to images
  images = []
  for latent in population:
    image = autoencoder.decode(latent)  
    images.append(image)
 
  # Score images
  losses = []
  for image in images:
    similarities = []
    for prompt in target_prompts:
       text = clip.encode_text(prompt)
       image_embedding = clip.encode_image(image)
       similarity = cosine_similarity(text, image_embedding)
       similarities.append(similarity)
 
    # Loss is lowest similarity 
    loss = min(similarities)
    losses.append(loss)
  
  # Update CMA-ES
  population = cma.tell(population, losses)  
  population = cma.ask(population_size)
 
# Return best solution
best_latent = cma.result.xfavorite 
fooling_image = autoencoder.decode(best_latent)

This includes more implementation details like latent dimension, population size, pretrained models, encoding prompts, computing similarities, and updating CMA-ES. The overall structure follows the high-level pseudocode.