Unleashing the Imagination of Text: A Novel Framework for Text-to-image Person Retrieval via Exploring the Power of Words

Authors: Delong Liu, Haiwen Li

Abstract: The goal of Text-to-image person retrieval is to retrieve person images from a large gallery that match the given textual descriptions. The main challenge of this task lies in the significant differences in information representation between the visual and textual modalities. The textual modality conveys abstract and precise information through vocabulary and grammatical structures, while the visual modality conveys concrete and intuitive information through images. To fully leverage the expressive power of textual representations, it is essential to accurately map abstract textual descriptions to specific images. To address this issue, we propose a novel framework to Unleash the Imagination of Text (UIT) in text-to-image person retrieval, aiming to fully explore the power of words in sentences. Specifically, the framework employs the pre-trained full CLIP model as a dual encoder for the images and texts , taking advantage of prior cross-modal alignment knowledge. The Text-guided Image Restoration auxiliary task is proposed with the aim of implicitly mapping abstract textual entities to specific image regions, facilitating alignment between textual and visual embeddings. Additionally, we introduce a cross-modal triplet loss tailored for handling hard samples, enhancing the model’s ability to distinguish minor differences. To focus the model on the key components within sentences, we propose a novel text data augmentation technique. Our proposed methods achieve state-of-the-art results on three popular benchmark datasets, and the source code will be made publicly available shortly.

What, Why and How

Here is a summary of the key points in the paper:

What:

  • The paper proposes a novel framework called UIT (Unleashing the Imagination of Text) for text-to-image person retrieval. The goal is to retrieve person images from a gallery that match a given textual description.

  • The main components of UIT are:

    • Text-guided Image Restoration (TIR): An auxiliary task that reconstructs randomly masked image patches using both image and text features to align image regions and text phrases.
    • Cross-Modal Triplet (CMT) loss: A new loss function that focuses on hard negative samples to improve the model’s ability to distinguish minor differences.
    • Targeted text augmentation: Keeps only key attribute words in sentences randomly to help the model identify important phrases.

Why:

  • Text-to-image person retrieval is challenging due to differences between textual and visual representations. Mapping abstract text to concrete images is difficult.

  • Previous methods have limitations in establishing fine-grained alignment between images and text. They also do not focus on hard samples well during training.

  • Better implicit alignment of images and text can improve retrieval performance without increasing inference cost. Handling hard samples also improves discriminative ability.

How:

  • Uses pre-trained CLIP as the image and text encoder to leverage cross-modal knowledge.

  • The TIR module establishes fine-grained correspondence between image patches and text implicitly through masked image modeling.

  • The CMT loss selects challenging negative pairs in each batch for training to handle hard samples.

  • Text augmentation keeps key attributes randomly to help identify important phrases.

  • Experiments show state-of-the-art results on three benchmarks, demonstrating effectiveness.

Main Contributions

Here are the main contributions of this paper:

  1. Proposes a new framework called UIT for text-to-image person retrieval that explores the power of words in sentences.

  2. Introduces a Text-guided Image Restoration (TIR) auxiliary task that establishes implicit fine-grained correspondence between image patches and text phrases, improving feature alignment.

  3. Designs a Cross-Modal Triplet (CMT) loss that focuses on hard negative samples during training to improve discriminative ability.

  4. Presents a novel text data augmentation method that randomly keeps key attribute words to help identify important phrases.

  5. Achieves state-of-the-art results on three benchmark datasets (CUHK-PEDES, ICFG-PEDES, RSTPReid), demonstrating effectiveness.

  6. Shows the feasibility of transferring knowledge from CLIP to text-to-image person retrieval via fine-tuning.

  7. Provides insights on designing auxiliary tasks - better performance in auxiliary tasks does not necessarily improve retrieval. Interaction between modalities is key.

  8. Demonstrates an approach to improve alignment and matching between visual and textual modalities for cross-modal retrieval.

  9. The code and models will be made publicly available to benefit further research.

In summary, the key contributions are proposing an effective framework, novel auxiliary task design, tailored loss function, text augmentation method, and achieving superior results on multiple benchmarks. The paper also provides useful insights to guide future work.

Method Section

Here is a summary of the method section from the paper:

The UIT framework consists of two feature encoders (image and text) and a cross-modal feature decoder.

Image Encoder:

  • Uses pre-trained CLIP ViT as the image encoder to extract global image features.

Text Encoder:

  • Uses pre-trained CLIP text encoder (Transformer) to extract global text features.

Text-guided Image Restoration (TIR):

  • An auxiliary task to establish fine-grained alignment between images and text.
  • Masks a portion of input image and feeds masked image into image encoder.
  • Text and masked image features are fed into a lightweight decoder.
  • Decoder predicts pixel values of masked patches using both text and image features.
  • Loss minimizes MSE between predicted and original pixel values.

Cross-Modal Triplet (CMT) Loss:

  • Selects challenging positive and negative pairs in a batch when training.
  • Brings positive pairs closer and pushes apart negative pairs in embedding space.
  • Improves model’s ability to differentiate subtle differences.

Data Augmentation:

  • Visual: Converts images to grayscale during TIR to focus on color.
  • Text: Randomly keeps only attribute words to help identify key phrases.

Training:

  • End-to-end training with combined losses.
  • During inference, only the dual encoder structure is retained.

The main novelty lies in using the TIR auxiliary task and CMT loss to improve feature alignment and matching between modalities. Data augmentation also plays a supporting role.

High-Level Pseudo Code

Here is the high-level pseudo code for the method presented in this paper:

# Image Encoder
img_patches = tokenize_image(img) 
img_embeddings = CLIP_ViT(img_patches)
img_feature = img_embeddings[CLS]
 
# Text Encoder 
text_tokens = tokenize_text(text)
text_embeddings = CLIP_Transformer(text_tokens) 
text_feature = text_embeddings[CLS]
 
# Text-guided Image Restoration
masked_img_patches = mask_image(img_patches)
masked_img_embeddings = CLIP_ViT(masked_img_patches)
 
decoder_input = concatenate(masked_img_embeddings, text_embeddings)
restored_patches = Decoder(decoder_input) 
 
tir_loss = MSE(restored_patches, original_patches)
 
# Training 
for x, y in dataloader:
  
  img_x, text_x = x
  img_y, text_y = y
  
  img_x_feature = Image_Encoder(img_x)
  text_x_feature = Text_Encoder(text_x)
 
  # Compute losses
  id_loss = ID_Loss(img_x_feature, img_y_feature, text_x_feature, text_y_feature) 
  sdm_loss = SDM_Loss(img_x_feature, text_y_feature)
  cmt_loss = CMT_Loss(img_x_feature, img_y_feature, text_x_feature, text_y_feature)
 
  # Backpropagate
  loss = tir_loss + id_loss + sdm_loss + cmt_loss
  loss.backward()
  optimizer.step()
  
# Testing
img_feature = Image_Encoder(img)
text_feature = Text_Encoder(text)
 
sim_score = cosine_similarity(img_feature, text_feature)

The key steps are encoding images and text, performing the TIR auxiliary task, calculating the losses, and computing similarity during testing. The losses align the features and improve matching. Only the encoder structure is used during testing for efficiency.

Detailed Pseudo Code

Here is a more detailed pseudo code implementation of the method presented in this paper:

# Image Encoder
def CLIP_ViT(img_patches):
  add [CLS] token 
  add positional embeddings
  for layer in range(L_v): 
    attn = MultiHeadAttention(patches) 
    patches = LayerNorm(patches + attn)
 
    ff = MLP(patches) 
    patches = LayerNorm(patches + ff)
 
  return patches
 
# Text Encoder  
def CLIP_Transformer(text_tokens):
  add [CLS], [SOS], [EOS] tokens
  add positional embeddings 
  for layer in range(L_t):
    attn = MultiHeadAttention(tokens)
    tokens = LayerNorm(tokens + attn) 
 
    ff = MLP(tokens)
    tokens = LayerNorm(tokens + ff)
  
  return tokens
 
# Text-guided Image Restoration
def mask_image(img_patches):
  masked_patches = []
  for patch in img_patches:
    if random() < p_v:  
      patch = [MASK] 
    masked_patches.append(patch)
 
  return masked_patches
 
def Decoder(text_embeddings, masked_img_embeddings):
  
  context = MultiHeadCrossAttention(
    queries = masked_img_embeddings,
    keys = text_embeddings, 
    values = text_embeddings
  )
 
  for layer in range(N):
    context = TransformerBlock(context)
  
  restored = MLP(context)
 
  return restored
 
# Training 
for x, y in dataloader:
 
  # Augment data
  x_img_gray = to_grayscale(x_img) 
  x_text_attr = keep_attributes(x_text)
 
  x_img_feat = CLIP_ViT(x_img_gray)
  x_text_feat = CLIP_Transformer(x_text_attr)
 
  x_masked_img = mask_image(x_img_gray)
  x_masked_feat = CLIP_ViT(x_masked_img)
 
  x_restored = Decoder(x_text_feat, x_masked_feat)
 
  # Compute losses
  tir_loss = MSE(x_restored, x_img_gray)  
  id_loss = IDLoss(x_img_feat, y_img_feat, x_text_feat, y_text_feat)
  sdm_loss = SDMLoss(x_img_feat, y_text_feat) 
  cmt_loss = CMTLoss(x_img_feat, y_img_feat, x_text_feat, y_text_feat)
 
  loss = tir_loss + id_loss + sdm_loss + cmt_loss
 
  update(loss)
 
# Testing
img_feat = CLIP_ViT(img) 
text_feat = CLIP_Transformer(text)
 
sim = cosine_similarity(img_feat, text_feat) 

The key components include the CLIP encoders, TIR module, loss computations, data augmentation, and similarity calculation. The TIR module and loss functions aim to align the features and improve matching. Data augmentation provides useful cues.