AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

Authors: Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu

Abstract: In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model. Starting from the pre-trained multimodal representation model CLIP released by OpenAI, we altered its text encoder with a pre-trained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k-CN, COCO-CN and XTD. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at FlagAI.

What, Why and How

Here is a summary of the key points in the paper:

What:

  • The paper presents a method to train a bilingual/multilingual multimodal representation model that aligns image representations with text in multiple languages.

  • The method involves two stages:

    1. Teacher learning: Distill knowledge from CLIP’s English text encoder into a multilingual text encoder (XLM-R).
    2. Contrastive learning: Fine-tune the model on multilingual text-image pairs using contrastive loss.
  • The resulting model is called Alter ego CLIP (AltCLIP). A 9-language version is also presented (AltCLIP_M9).

Why:

  • Extends CLIP’s capabilities to multiple languages while retaining strong English performance. Allows aligning images with text in different languages.

  • Requires much less training data than prior work on Chinese CLIP models. Uses only tens of millions of text examples versus hundreds of millions.

How:

  • Teacher learning stage distills CLIP’s text encoder into XLM-R using parallel text data.

  • Contrastive learning stage fine-tunes on smaller multilingual text-image datasets. Image encoder is frozen, only text encoder updated.

  • Evaluated on ImageNet, retrieval tasks, and CLIP benchmark in both English and Chinese. Sets new SOTA on several Chinese tasks.

  • AltCLIP_M9 evaluated on multilingual retrieval dataset XTD. Achieves SOTA in 7 languages while maintaining English performance.

  • Examples shown of generating images from both English and Chinese prompts using finetuned AltCLIP diffusion model.

Main Contributions

Here are the main contributions of this paper:

  • Proposes a two-stage training method involving teacher learning and contrastive learning to learn bilingual/multilingual multimodal models.

  • Introduces Alter ego CLIP (AltCLIP), a bilingual English-Chinese model trained with this approach.

  • Demonstrates AltCLIP achieves strong performance on ImageNet classification and retrieval tasks in both English and Chinese. Sets new SOTA on several Chinese datasets.

  • Shows comparable performance to CLIP on a wide range of English benchmarks, with much less training data required.

  • Introduces AltCLIP_M9, a 9 language multimodal model trained with the proposed approach. Achieves SOTA on the XTD multilingual retrieval benchmark.

  • Analyzes the effects of different training data used in the teacher learning stage. Shows all data sources contribute to overall performance.

  • Provides qualitative results using AltCLIP for multilingual text-to-image generation. Generated images reflect text prompts in different languages.

  • Makes code and pretrained models publicly available to facilitate further research.

In summary, the key contributions are proposing the two-stage training approach, achieving strong bilingual and multilingual performance with less data, and analyzing the model capabilities on a diverse set of tasks and languages.

Method Section

Here is a summary of the method section from the paper:

The paper proposes a two-stage training approach to learn bilingual/multilingual multimodal models:

Teacher Learning Stage:

  • Use the text encoder from CLIP as the teacher model. Initialize student model with XLM-R pretrained on multilingual data.
  • Minimize MSE loss between student and teacher embeddings on parallel text data. This aligns student with CLIP’s text-image knowledge.
  • Parallel data includes machine translated data and human translated datasets.

Contrastive Learning Stage:

  • Use CLIP’s image encoder and distilled XLM-R encoder from stage 1.
  • Train with contrastive loss on multilingual text-image datasets.
  • Freeze image encoder, only update text encoder parameters.

Implementation Details:

  • Text encoder initialized from XLM-R Large.
  • Use CLIP ViT-L/14 as teacher text encoder and image encoder.
  • Teacher learning uses 10 epochs with MSE loss on parallel text.
  • Contrastive learning uses 1 epoch and contrastive loss on text-image data.

The two stages transfer CLIP’s capabilities to the multilingual XLM-R text encoder, then further align representations using multilingual text-image data.

High-Level Pseudo Code

Here is the high-level pseudo code for the method presented in this paper:

# Teacher Learning Stage
 
text_encoder_teacher = CLIP_text_encoder  
text_encoder_student = XLM-R_pretrained
 
for text1, text2 in parallel_text_data:
  teacher_embedding = text_encoder_teacher(text1)  
  student_embedding = text_encoder_student(text2)
  
  loss = MSE(teacher_embedding, student_embedding)
  optimize(loss) 
 
# Contrastive Learning Stage  
 
image_encoder = CLIP_image_encoder
text_encoder = text_encoder_student 
 
for image, text in multilingual_text_image_data:
 
  image_embedding = image_encoder(image)
  text_embedding = text_encoder(text)  
 
  loss = contrastive_loss(image_embedding, text_embedding)
  
  optimize(loss, weights=text_encoder) 

The key steps are distilling the text encoder using parallel text in stage 1, then fine-tuning the model on multilingual text-image data with contrastive loss in stage 2. The image encoder is fixed and only the text encoder is updated in stage 2.

Detailed Pseudo Code

Here is some more detailed pseudo code to implement the key parts of the method proposed in this paper:

# Hyperparameters
teacher_model = CLIP_text_encoder 
student_model = XLM-R_base
num_epochs = 10
lr = 1e-4
batch_size = 1024
 
# Teacher Learning Stage
 
for epoch in range(num_epochs):
  
  for batch in parallel_text_dataloader:
 
    text1, text2 = batch
    
    teacher_emb = teacher_model(text1)
    student_emb = student_model(text2)
    
    loss = MSELoss(teacher_emb, student_emb)
  
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
 
# Contrastive Learning Stage
 
image_encoder = CLIP_image_encoder 
text_encoder = student_model 
 
num_epochs = 1
batch_size = 1024  
lr = 2e-6
 
for epoch in range(num_epochs):
 
  for batch in text_image_dataloader:
 
    images, texts = batch
    
    image_embeddings = image_encoder(images)
    text_embeddings = text_encoder(texts)
 
    loss = NT_XentLoss(image_embeddings, text_embeddings)
 
    loss.backward()
    optimizer.step() 
    optimizer.zero_grad()

The key aspects are:

  • Using MSE loss for teacher distillation on parallel text
  • Freezing image encoder, only updating text encoder in stage 2
  • Using contrastive loss (e.g. NT-Xent) for text-image alignment
  • Lower LR and fewer epochs in stage 2 for fine-tuning