ALIP: Adaptive Language-Image Pre-training with Synthetic Caption
Authors: Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, Tongliang Liu
Abstract: Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks by scaling up the dataset with image-text pairs collected from the web. However, the presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning. To address this issue, we first utilize the OFA model to generate synthetic captions that focus on the image content. The generated captions contain complementary information that is beneficial for pre-training. Then, we propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption. As the core components of ALIP, the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) dynamically adjust the weights of samples and image-text/caption pairs during the training process. Meanwhile, the adaptive contrastive loss can effectively reduce the impact of noise data and enhances the efficiency of pre-training data. We validate ALIP with experiments on different scales of models and pre-training datasets. Experiments results show that ALIP achieves state-of-the-art performance on multiple downstream tasks including zero-shot image-text retrieval and linear probe. To facilitate future research, the code and pre-trained models are released at ALIP.
What, Why and How
Here is a summary of the key points from the paper:
What:
-
The paper proposes ALIP (Adaptive Language-Image Pretraining), a new approach for contrastive language-image pretraining.
-
ALIP takes as input raw image-text pairs as well as synthetic captions generated by an off-the-shelf caption model (OFA).
-
The core of ALIP is the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) which dynamically adjust sample weights and image-text/caption pair weights during training.
-
An adaptive contrastive loss is used to reduce the impact of noisy samples by incorporating the LCG and DCG weights.
Why:
-
To address the issue of noisy and mismatched image-text pairs in web data which can negatively impact representation learning in contrastive pretraining methods like CLIP.
-
Raw web texts can be abstract, unilateral or contain irrelevant details. Synthetic captions provide complementary information and more accurate image descriptions.
-
Adaptive weighting and contrastive loss help reduce the influence of mismatched or noisy samples.
How:
-
LCG assigns lower weight to samples where raw text and synthetic caption are dissimilar.
-
DCG assigns higher weight to image-text/caption pairs that have high similarity.
-
The adaptive contrastive loss uses the LCG sample weights and DCG pair weights to reduce impact of noisy samples.
-
Pretrain ALIP on image-text-caption triplets and evaluate on downstream tasks like image-text retrieval, linear classification probe etc.
In summary, ALIP introduces synthetic captions and adaptive weighting mechanisms during pretraining to improve robustness to noise and enhance representation learning compared to prior arts like CLIP.
Main Contributions
Here are the key contributions of this paper:
-
Proposes ALIP, a bi-path model that integrates raw text supervision and synthetic caption supervision for contrastive language-image pretraining.
-
Introduces the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) to dynamically adjust sample and pair weights based on text-caption and image-text/caption similarities.
-
Designs an adaptive contrastive loss function incorporating the LCG and DCG weights to reduce impact of noisy samples.
-
Achieves state-of-the-art performance on downstream tasks including zero-shot image-text retrieval and linear classification probe.
-
Demonstrates the effectiveness of ALIP on different model sizes and pretraining datasets like YFCC15M and LAION.
-
Provides analysis showing ALIP’s ability to better align image patches and text tokens compared to prior arts.
-
Open-sources the code and pretrained models to facilitate future research.
In summary, the key contributions are:
-
The proposed ALIP model architecture and training methodology.
-
Adaptive weighting mechanisms via LCG and DCG.
-
Superior performance on various downstream tasks.
-
Extensive experiments proving robustness across models, datasets and tasks.
-
Open-sourced code and models.
Method Section
Here is a summary of the method section from the paper:
-
The ALIP model takes as input an image, raw text, and synthetic caption generated by OFA.
-
Image, text, and caption are encoded into normalized embeddings using separate encoders (shared text/caption encoder).
-
Three similarity scores are computed - between raw text and synthetic caption (Stc), between image and raw text (Sxt), and between image and synthetic caption (Sxc).
-
The Language Consistency Gate (LCG) computes a sample weight Ws based on Stc and a historical average text-caption similarity Htc. Samples with Stc>Htc get Ws=1, else Ws decays exponentially based on a hyperparameter.
-
The Description Consistency Gate (DCG) computes image-text and image-caption pair weights Wt and Wc based on Sxt, Sxc and historical averages Hxt and Hxc. If Ws=1, Wt and Wc are set to 1, else they decay exponentially based on another hyperparameter.
-
The adaptive contrastive loss combines the standard InfoNCE loss with the LCG sample weights and DCG pair weights. This reduces the impact of mismatched or noisy samples.
-
Two losses are computed - one between image and raw text, and one between image and synthetic caption. The overall ALIP loss is the sum of these.
-
ALIP is pre-trained on image-text-caption triplets from web datasets like YFCC15M. The pretrained model is evaluated on downstream tasks like retrieval and classification by freezing the encoders.
In summary, the core of the ALIP method is the adaptive weighting mechanisms (LCG and DCG) and contrastive loss function that aim to improve robustness and efficiency during pre-training.
High-Level Pseudo Code
Here is the high-level pseudo code for the ALIP method:
# Input:
# X: raw image
# T: raw text
# C: synthetic caption
# Encoders:
image_encoder = ImageEncoder()
text_encoder = TextEncoder()
caption_encoder = TextEncoder()
# Get normalized embeddings
x = image_encoder(X) / ||image_encoder(X)||
t = text_encoder(T) / ||text_encoder(T)||
c = caption_encoder(C) / ||caption_encoder(C)||
# Similarities
stc = cosine_similarity(t, c)
sxt = cosine_similarity(x, t)
sxc = cosine_similarity(x, c)
# Language Consistency Gate
ws = LCG(stc)
# Description Consistency Gate
wt = DCG(sxt)
wc = DCG(sxc)
# Adaptive Contrastive Loss
L_xt = AdaptiveContrastiveLoss(x, t, ws, wt)
L_xc = AdaptiveContrastiveLoss(x, c, ws, wc)
loss = L_xt + L_xc
# Pre-train ALIP by minimizing loss
The key steps are:
- Encode input into normalized embeddings
- Compute similarities between embeddings
- Apply LCG and DCG to get adaptive weights
- Compute adaptive contrastive loss using weights
- Pretrain model by minimizing the loss
Detailed Pseudo Code
Here is a more detailed pseudo code to implement the ALIP method:
# Hyperparameters
tau = 0.07
m = 0.999 # momentum for updating historical averages
gammas, gammap = 2 # decay rates for LCG and DCG
# Encoders
image_encoder = VisionTransformer()
text_encoder = Transformer()
caption_encoder = text_encoder
for x, t, c in dataloader:
# Get normalized embeddings
x = image_encoder(x) / l2norm(image_encoder(x))
t = text_encoder(t) / l2norm(text_encoder(t))
c = caption_encoder(c) / l2norm(caption_encoder(c))
# Similarities
stc = dot(t, c)
sxt = dot(x, t)
sxc = dot(x, c)
# Update historical averages
Htc = m * Htc + (1 - m) * mean(stc)
Hxt = m * Hxt + (1 - m) * mean(sxt)
Hxc = m * Hxc + (1 - m) * mean(sxc)
# Language Consistency Gate
if stc <= Htc:
ws = exp((stc - Htc) * gammas)
else:
ws = 1
# Description Consistency Gate
if ws < 1:
wt = exp((sxt - Hxt) * gammap)
wc = exp((sxc - Hxc) * gammap)
else:
wt = 1
wc = 1
# Adaptive Contrastive Loss
L_xt = AdaptiveInfoNCE(x, t, ws, wt, tau)
L_xc = AdaptiveInfoNCE(x, c, ws, wc, tau)
loss = L_xt + L_xc
# Optimization step
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Training loop
for epoch in num_epochs:
for x, t, c in dataloader:
...
The key aspects are:
- Getting normalized embeddings for image, text and caption
- Computing similarity scores and updating historical averages
- LCG and DCG to compute adaptive weights
- Adaptive InfoNCE loss using the weights
- Overall training process/loop