Improving Multimodal Datasets with Image Captioning
Authors: Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, Ludwig Schmidt
Abstract: Massive web datasets play a key role in the success of large vision-language models like CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to reduce noise often come at the expense of data diversity. Our work focuses on caption quality as one major source of noise, and studies how generated captions can increase the utility of web-scraped datapoints with nondescript text. Through exploring different mixing strategies for raw and generated captions, we outperform the best filtering method proposed by the DataComp benchmark by 2% on ImageNet and 4% on average across 38 tasks, given a candidate pool of 128M image-text pairs. Our best approach is also 2x better at Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions an effective source of text supervision. In experimenting with different image captioning models, we also demonstrate that the performance of a model on standard image captioning benchmarks (e.g., NoCaps CIDEr) is not a reliable indicator of the utility of the captions it generates for multimodal training. Finally, our experiments with using generated captions at DataComp’s large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text, as well as the importance of image curation with increasing training data quantity.
What, Why and How
Here is a summary of the key points from this paper:
\begin{itemize} \item \textbf{What:} The paper explores using synthetic (AI-generated) image captions to improve the quality of text data used for pre-training large multimodal models like CLIP. Specifically, it studies replacing noisy or uninformative raw web-scraped captions with higher quality captions generated by image captioning models.
\item \textbf{Why:} Raw web data contains significant noise, and existing filtering methods often come at the expense of diversity. Synthetic captions can help restore utility to datapoints with poor raw captions. The paper aims to understand the trade-offs between raw and synthetic captions, and find better ways to combine both sources.
\item \textbf{How:} Using the DataComp benchmark, the authors experiment with different mixing strategies on raw Common Crawl data. They find that generated captions contain more visual information and are better aligned to images. Combining signals from both raw and synthetic captions gives the best results, outperforming filtering baselines relying solely on raw captions. The benefits are shown across model scales, especially for retrieval tasks. Analyses also reveal that optimizing captioning models for metrics like CIDEr does not guarantee better captions for CLIP training. \end{itemize}
In summary, the paper demonstrates that synthetic captions can effectively improve multimodal training data sourced from the web. The results have implications for scaling up vision-language pre-training by unlocking more useful training examples through better text supervision from AI-generated captions.
Main Contributions
Here are the key contributions of this paper:
-
Shows that synthetic captions generated by image captioning models can improve the quality of noisy web data for pre-training multimodal models like CLIP.
-
Demonstrates consistent benefits of mixing raw and synthetic captions across different scales of data, especially for retrieval tasks. Outperforms competitive baselines relying solely on raw captions.
-
Finds that fine-tuning captioning models on COCO and optimizing CIDEr often hurts the utility of the generated captions for downstream CLIP training. Suggests using metrics like CLIP-S instead.
-
Provides analysis comparing properties of raw versus synthetic captions. Shows that the latter contains more visual information but lacks diversity compared to web data.
-
Shows the importance of both caption quality and image quality when scaling up training data size. Notes potential diversity limitations of synthetic text at very large scale.
-
Uncovers best filtering and mixing strategies for raw and synthetic captions at small, medium and large pool sizes. Strategy differs across scales.
-
Discusses implications for future work, including improving caption diversity, combining multiple captioning models, and creating synthetic images to boost diversity.
In summary, the key contributions are around demonstrating the effectiveness of synthetic captions for improving web-scale multimodal datasets, analyzing the trade-offs between raw and synthetic data, and providing insights to guide future work on dataset creation.
Method Section
Here is a summary of the method section from the paper:
Data:
- Uses the CommonPool dataset from DataComp, which contains web-scraped image-text pairs from Common Crawl with minimal preprocessing. Experiments with small (12.8M), medium (128M) and large (1.28B) candidate pools.
Captioning Models:
- Experiments with BLIP, BLIP2, and OpenCLIP-CoCa to generate synthetic captions. Compares using pre-trained models versus ones fine-tuned on COCO.
Training:
- Generates captions for images using the different models. Creates datasets mixing raw and synthetic captions.
- Trains CLIP models from scratch on these datasets. Fixes hyperparameters, compute budget, and architecture based on DataComp guidelines for fair comparison.
Evaluation:
- Evaluates using DataComp’s zero-shot transfer tasks, including ImageNet classification and retrieval on COCO and Flickr. Focuses on ImageNet accuracy and average accuracy across 38 tasks.
Key Experiments:
- Compares utility of captions from different models by training CLIP on them.
- Tries different strategies to filter and mix raw and synthetic captions.
- Analyzes caption properties like visual grounding ratio and diversity.
- Studies impact of synthetic captions across different pool sizes ranging from 12.8M to 1.28B images.
In summary, the method leverages DataComp’s CommonPool dataset and framework to conduct controlled experiments mixing raw and synthetic captions for pre-training CLIP models. The models and captions are then evaluated on a comprehensive set of vision tasks.
High-Level Pseudo Code
Here is the high-level pseudo code for the experiments in this paper:
# Load CommonPool dataset
data = CommonPool(scale='medium')
# Generate synthetic captions
caption_models = [BLIP, BLIP2, OpenCLIP-CoCa, ...]
for model in caption_models:
captions = generate_captions(model, data.images)
# Create datasets mixing raw and synthetic captions
raw_captions = data.captions
mixed_datasets = []
for strategy in [no_filtering, clip_score_filtering, ...]:
mixed_data = mix_captions(raw_captions, synthetic_captions, strategy)
mixed_datasets.append(mixed_data)
# Train CLIP models
for dataset in mixed_datasets:
# Fix architecture, hyperparameters, compute budget
clip = train_clip(dataset)
# Evaluate CLIP models
tasks = [imagenet, coco, flickr, ...]
for clip in trained_clips:
accuracies = evaluate(clip, tasks)
# Analyze results
print_tables_and_plots(accuracies)
In summary, the key steps are:
- Load raw web data
- Generate synthetic captions with different models
- Create datasets mixing raw and synthetic captions using different strategies
- Train CLIP models with fixed settings on these datasets
- Evaluate on vision tasks like ImageNet, COCO, Flickr
- Analyze results across methods and scales
Detailed Pseudo Code
Here is a more detailed pseudo code outline to implement the key experiments in this paper:
import clip, captioning_models, datasets
# Load DataComp CommonPool dataset
commonpool = datasets.DataComp(scale='medium')
raw_images = commonpool.images
raw_captions = commonpool.captions
# Generate synthetic captions
blip = captioning_models.BLIP2()
blip_captions = blip.generate(raw_images)
coca = captioning_models.OpenCLIPCoCa()
coca_captions = coca.generate(raw_images)
# Filtering strategies
def clip_score_filter(captions, images, threshold):
# Compute CLIP score between images and captions
scores = clip.score(images, captions)
# Select pairs above threshold
filtered_idx = np.where(scores > threshold)
filtered_images = images[filtered_idx]
filtered_captions = captions[filtered_idx]
return filtered_images, filtered_captions
# Mixing strategies
def mix_captions(raw_captions, synth_captions):
# Simple concatenation
mixed_captions = np.concatenate((raw_captions, synth_captions))
return mixed_captions
# Create mixed datasets
raw_30p = clip_score_filter(raw_captions, raw_images, 0.3)
blip_70p_filtered = clip_score_filter(blip_captions, raw_images, 0.3)
mixed_blip = mix_captions(raw_30p, blip_70p_filtered)
# Train CLIP
clip_model = clip.CLIP(backbone='ViT-B/32')
clip_raw = clip_model.fit(raw_30p, steps=128e6)
clip_blip = clip_model.fit(mixed_blip, steps=128e6)
# Evaluate
tasks = datasets.DataCompTasks()
raw_acc = tasks.evaluate(clip_raw)
blip_acc = tasks.evaluate(clip_blip)
print(raw_acc)
print(blip_acc)
This implements the key steps like generating synthetic captions, mixing them with raw captions using different strategies, training CLIP models, and evaluating on DataComp tasks. Additional loops can be added for iterating over different captions models, scales, etc.