ALIP: Adaptive Language-Image Pre-training with Synthetic Caption

Authors: Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, Tongliang Liu

Abstract: Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks by scaling up the dataset with image-text pairs collected from the web. However, the presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning. To address this issue, we first utilize the OFA model to generate synthetic captions that focus on the image content. The generated captions contain complementary information that is beneficial for pre-training. Then, we propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption. As the core components of ALIP, the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) dynamically adjust the weights of samples and image-text/caption pairs during the training process. Meanwhile, the adaptive contrastive loss can effectively reduce the impact of noise data and enhances the efficiency of pre-training data. We validate ALIP with experiments on different scales of models and pre-training datasets. Experiments results show that ALIP achieves state-of-the-art performance on multiple downstream tasks including zero-shot image-text retrieval and linear probe. To facilitate future research, the code and pre-trained models are released at ALIP.

What, Why and How

Here is a summary of the key points from the paper:

What:

The paper proposes ALIP (Adaptive Language-Image Pretraining), a new approach for contrastive language-image pretraining.
ALIP takes as input raw image-text pairs as well as synthetic captions generated by an off-the-shelf caption model (OFA).
The core of ALIP is the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) which dynamically adjust sample weights and image-text/caption pair weights during training.
An adaptive contrastive loss is used to reduce the impact of noisy samples by incorporating the LCG and DCG weights.

Why:

To address the issue of noisy and mismatched image-text pairs in web data which can negatively impact representation learning in contrastive pretraining methods like CLIP.
Raw web texts can be abstract, unilateral or contain irrelevant details. Synthetic captions provide complementary information and more accurate image descriptions.
Adaptive weighting and contrastive loss help reduce the influence of mismatched or noisy samples.

How:

LCG assigns lower weight to samples where raw text and synthetic caption are dissimilar.
DCG assigns higher weight to image-text/caption pairs that have high similarity.
The adaptive contrastive loss uses the LCG sample weights and DCG pair weights to reduce impact of noisy samples.
Pretrain ALIP on image-text-caption triplets and evaluate on downstream tasks like image-text retrieval, linear classification probe etc.

In summary, ALIP introduces synthetic captions and adaptive weighting mechanisms during pretraining to improve robustness to noise and enhance representation learning compared to prior arts like CLIP.

Main Contributions

Here are the key contributions of this paper:

Proposes ALIP, a bi-path model that integrates raw text supervision and synthetic caption supervision for contrastive language-image pretraining.
Introduces the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) to dynamically adjust sample and pair weights based on text-caption and image-text/caption similarities.
Designs an adaptive contrastive loss function incorporating the LCG and DCG weights to reduce impact of noisy samples.
Achieves state-of-the-art performance on downstream tasks including zero-shot image-text retrieval and linear classification probe.
Demonstrates the effectiveness of ALIP on different model sizes and pretraining datasets like YFCC15M and LAION.
Provides analysis showing ALIP’s ability to better align image patches and text tokens compared to prior arts.
Open-sources the code and pretrained models to facilitate future research.

In summary, the key contributions are:

The proposed ALIP model architecture and training methodology.
Adaptive weighting mechanisms via LCG and DCG.
Superior performance on various downstream tasks.
Extensive experiments proving robustness across models, datasets and tasks.
Open-sourced code and models.

Method Section

Here is a summary of the method section from the paper:

The ALIP model takes as input an image, raw text, and synthetic caption generated by OFA.
Image, text, and caption are encoded into normalized embeddings using separate encoders (shared text/caption encoder).
Three similarity scores are computed - between raw text and synthetic caption (Stc), between image and raw text (Sxt), and between image and synthetic caption (Sxc).
The Language Consistency Gate (LCG) computes a sample weight Ws based on Stc and a historical average text-caption similarity Htc. Samples with Stc>Htc get Ws=1, else Ws decays exponentially based on a hyperparameter.
The Description Consistency Gate (DCG) computes image-text and image-caption pair weights Wt and Wc based on Sxt, Sxc and historical averages Hxt and Hxc. If Ws=1, Wt and Wc are set to 1, else they decay exponentially based on another hyperparameter.
The adaptive contrastive loss combines the standard InfoNCE loss with the LCG sample weights and DCG pair weights. This reduces the impact of mismatched or noisy samples.
Two losses are computed - one between image and raw text, and one between image and synthetic caption. The overall ALIP loss is the sum of these.
ALIP is pre-trained on image-text-caption triplets from web datasets like YFCC15M. The pretrained model is evaluated on downstream tasks like retrieval and classification by freezing the encoders.

In summary, the core of the ALIP method is the adaptive weighting mechanisms (LCG and DCG) and contrastive loss function that aim to improve robustness and efficiency during pre-training.

High-Level Pseudo Code

Here is the high-level pseudo code for the ALIP method:

# Input: 
#   X: raw image 
#   T: raw text
#   C: synthetic caption 
 
# Encoders:
image_encoder = ImageEncoder() 
text_encoder = TextEncoder()
caption_encoder = TextEncoder() 
 
# Get normalized embeddings
x = image_encoder(X) / ||image_encoder(X)||
t = text_encoder(T) / ||text_encoder(T)||
c = caption_encoder(C) / ||caption_encoder(C)||
 
# Similarities
stc = cosine_similarity(t, c)  
sxt = cosine_similarity(x, t)
sxc = cosine_similarity(x, c)
 
# Language Consistency Gate
ws = LCG(stc) 
 
# Description Consistency Gate 
wt = DCG(sxt)
wc = DCG(sxc)
 
# Adaptive Contrastive Loss
L_xt = AdaptiveContrastiveLoss(x, t, ws, wt) 
L_xc = AdaptiveContrastiveLoss(x, c, ws, wc)
 
loss = L_xt + L_xc
 
# Pre-train ALIP by minimizing loss

The key steps are:

Encode input into normalized embeddings
Compute similarities between embeddings
Apply LCG and DCG to get adaptive weights
Compute adaptive contrastive loss using weights
Pretrain model by minimizing the loss

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the ALIP method:

# Hyperparameters
tau = 0.07 
m = 0.999 # momentum for updating historical averages
gammas, gammap = 2 # decay rates for LCG and DCG  
 
# Encoders 
image_encoder = VisionTransformer()
text_encoder = Transformer() 
caption_encoder = text_encoder 
 
for x, t, c in dataloader:
 
  # Get normalized embeddings
  x = image_encoder(x) / l2norm(image_encoder(x)) 
  t = text_encoder(t) / l2norm(text_encoder(t))
  c = caption_encoder(c) / l2norm(caption_encoder(c))
  
  # Similarities
  stc = dot(t, c)
  sxt = dot(x, t) 
  sxc = dot(x, c)
  
  # Update historical averages
  Htc = m * Htc + (1 - m) * mean(stc)
  Hxt = m * Hxt + (1 - m) * mean(sxt)
  Hxc = m * Hxc + (1 - m) * mean(sxc)
 
  # Language Consistency Gate
  if stc <= Htc:
    ws = exp((stc - Htc) * gammas) 
  else: 
    ws = 1
  
  # Description Consistency Gate
  if ws < 1:
    wt = exp((sxt - Hxt) * gammap) 
    wc = exp((sxc - Hxc) * gammap)
  else:
    wt = 1
    wc = 1
 
  # Adaptive Contrastive Loss 
  L_xt = AdaptiveInfoNCE(x, t, ws, wt, tau)
  L_xc = AdaptiveInfoNCE(x, c, ws, wc, tau) 
  
  loss = L_xt + L_xc
  
  # Optimization step
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()
  
# Training loop  
for epoch in num_epochs:
  for x, t, c in dataloader:
    ...

The key aspects are:

Getting normalized embeddings for image, text and caption
Computing similarity scores and updating historical averages
LCG and DCG to compute adaptive weights
Adaptive InfoNCE loss using the weights
Overall training process/loop

aixiv-c

Table of Contents

2308.08428 ALIP Adaptive Language-Image Pre-training with Synthetic Caption

ALIP: Adaptive Language-Image Pre-training with Synthetic Caption

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code

Graph View

Backlinks

aixiv-c

Table of Contents

2308.08428 ALIP Adaptive Language-Image Pre-training with Synthetic Caption

ALIP: Adaptive Language-Image Pre-training with Synthetic Caption §

What, Why and How §

Main Contributions §

Method Section §

High-Level Pseudo Code §

Detailed Pseudo Code §

Graph View

Backlinks

ALIP: Adaptive Language-Image Pre-training with Synthetic Caption

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code