OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
Authors: Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh
Abstract: Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks that require reasoning over one or multiple images to generate a text. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELISC dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset’s content. To show the viability of OBELISC, we train an 80 billion parameters vision and language model on the dataset and obtain competitive performance on various multimodal benchmarks. We release the code to reproduce the dataset along with the dataset itself.
What, Why and How
Here is a summary of the key points from the paper:
What:
-
The paper introduces OBELISC, a new large-scale open dataset of 141 million multimodal web documents containing 353 million images and 115 billion tokens.
-
The dataset is extracted from Common Crawl snapshots and goes through extensive filtering and curation. It interleaves images and text paragraphs as they appear on web pages.
Why:
-
Existing multimodal datasets used to train state-of-the-art models are not publicly available. OBELISC aims to provide an open alternative.
-
Models trained on documents outperform models trained on image-text pairs, but no large-scale document datasets exist. OBELISC fills this gap.
-
The authors show OBELISC allows training competitive multimodal models.
How:
-
The authors collect and filter a massive number of web documents from Common Crawl.
-
They simplify the HTML, extract texts and images while preserving structure.
-
Extensive filtering of images, paragraphs and documents is applied to remove low-quality content.
-
Responsible filtering steps are taken, like removing opted-out images.
-
The viability of OBELISC is demonstrated by training strong multimodal models.
In summary, OBELISC is a new large-scale open dataset of multimodal web documents introduced to enable training competitive multimodal AI systems in an open and replicable way. Its creation process and usefulness are rigorously described and analyzed.
Main Contributions
Here are the main contributions of the paper:
-
Introduction of OBELISC, a new large-scale open dataset of 141 million multimodal web documents containing 353 million images and 115 billion tokens. This is one of the largest open datasets of web-extracted multimodal content.
-
Description of the methodology to collect, filter and curate this dataset from Common Crawl snapshots. Details are provided on strategies like HTML simplification, image and text filtering, and responsible filtering steps.
-
Analysis of the dataset’s composition and properties through general statistics, perplexity analysis, topic modeling, qualitative assessment, etc. This sheds light on the diversity and quality of OBELISC.
-
Demonstration of the viability of OBELISC for training multimodal models by training competitive 9B and 80B parameter models on this dataset. Results are shown on various multimodal benchmarks.
-
Release of the dataset and code under a CC-BY license to support further research. The authors outline proper terms of use for working with this dataset.
In summary, the main contributions are the introduction and rigorous analysis of the OBELISC dataset, the methodology to create it, benchmark results validating its usefulness, and its public release to advance multimodal AI research.
Method Section
Here is a summary of the method section from the paper:
Data Collection:
- Extracted text and HTML sources from 25 recent Common Crawl dumps (Feb 2020 - Feb 2023).
- Filtered for English using FastText classifier.
- Deduplicated using MinHash and removed documents with excessive duplicate n-grams.
- Classified text quality using logistic regression classifier.
HTML Simplification:
- Preprocessed HTML into simplified HTML using rules like tag unwrapping, removal, and modification.
- This resulted in smaller files focused on main textual and visual content.
Multimodal Extraction:
- Extracted texts and image links from simplified HTML while preserving original structure.
- Downloaded ~2 billion images using distributed computing.
Multimodal Filtering:
- Image filtering: removed small, blurry, disproportionate images.
- Text filtering: removed paragraphs based on metrics like length, repetition, punctuation, perplexity.
- Document filtering: removed documents with inappropriate image/text counts.
Responsible Filtering:
- Removed opted-out images using Spawning API.
- Deduplicated images.
- Used NSFW classifier to remove pornographic images/documents.
- Deduplicated documents based on URL and images.
- Deduplicated paragraphs within domain names.
This methodology resulted in a filtered dataset of 141M documents, 353M images and 115B tokens.
High-Level Pseudo Code
Here is a high-level pseudo code summary of the key steps in the paper:
# Collect raw HTML documents
html_docs = extract_documents_from_common_crawl()
# Filter for English and text quality
english_docs = filter_non_english(html_docs)
quality_docs = filter_text_quality(english_docs)
# Simplify HTML
simplified_html = simplify_html(quality_docs)
# Extract multimodal documents
multimodal_docs = []
for html in simplified_html:
texts, images = extract_multimodal(html)
multimodal_docs.append((texts, images))
# Download images
images = download_images(multimodal_docs)
# Filter images, paragraphs, documents
filtered_images = filter_images(images)
filtered_paragraphs = filter_paragraphs(multimodal_docs)
filtered_docs = filter_documents(multimodal_docs)
# Responsible filtering
filtered_docs = filter_opted_out_images(filtered_docs)
filtered_docs = deduplicate_images(filtered_docs)
filtered_docs = filter_nsfw(filtered_docs)
filtered_docs = deduplicate_documents(filtered_docs)
filtered_paragraphs = deduplicate_paragraphs(filtered_docs)
# Return filtered dataset
return filtered_docs, filtered_images
In summary, the key steps are collecting raw HTML documents, simplifying and filtering the HTML, extracting multimodal documents, downloading images, applying filtering at the image/paragraph/document level, and performing responsible filtering including deduplication. This results in the final filtered OBELISC dataset.
Detailed Pseudo Code
Here is a more detailed pseudo code implementation of the key steps in the paper:
# Collect raw HTML documents
def extract_documents_from_common_crawl():
# Download Common Crawl snapshots
snapshots = download_common_crawl_dumps()
# Extract text from HTML
html_docs = []
for snapshot in snapshots:
for html in snapshot:
text = extract_text(html)
if text is not None:
html_docs.append((html, text))
return html_docs
# Filter for English and text quality
def filter_non_english(html_docs):
english_docs = []
for html, text in html_docs:
if predict_english(text) > 0.5:
english_docs.append((html, text))
return english_docs
def filter_text_quality(english_docs):
quality_docs = []
for html, text in english_docs:
if classifier_probability(text) > 0.5:
quality_docs.append((html, text))
return quality_docs
# Simplify HTML
def simplify_html(html_docs):
simplified_html = []
for html, text in html_docs:
simplified = simplify_dom_tree(html)
simplified_html.append(simplified)
return simplified_html
# Extract multimodal documents
def extract_multimodal(simplified_html):
texts = []
images = []
# Extract texts
texts = simplified_html.xpath('//text()')
# Extract image links
images = simplified_html.xpath('//img/@src')
return texts, images
# Download images
def download_images(multimodal_docs):
images = []
for doc in multimodal_docs:
for url in doc.images:
image = download_image(url)
images.append(image)
return images
# Filter images, paragraphs, documents
def filter_images(images):
filtered_images = []
for image in images:
if is_valid_image(image):
filtered_images.append(image)
return filtered_images
def filter_paragraphs(multimodal_docs):
filtered_paragraphs = []
for doc in multimodal_docs:
for para in doc.texts:
if is_valid_paragraph(para):
filtered_paragraphs.append(para)
return filtered_paragraphs
def filter_documents(multimodal_docs):
filtered_docs = []
for doc in multimodal_docs:
if is_valid_document(doc):
filtered_docs.append(doc)
return filtered_docs
# Responsible filtering
def filter_opted_out_images(docs):
filtered_docs = []
for doc in docs:
opted_out = get_opted_out_images(doc.images)
doc.images = [img for img in doc.images if img not in opted_out]
filtered_docs.append(doc)
return filtered_docs
# Other filtering functions
def deduplicate_images(docs):
def filter_nsfw(docs):
def deduplicate_documents(docs):
def deduplicate_paragraphs(docs):
# Return dataset
def create_obelisc_dataset(html_docs):
simplified = simplify_html(html_docs)
multimodal = extract_multimodal(simplified)
images = download_images(multimodal)
filtered_images = filter_images(images)
filtered_paragraphs = filter_paragraphs(multimodal)
filtered_docs = filter_documents(multimodal)
filtered_docs = responsible_filtering(filtered_docs)
return filtered_docs, filtered_images
This is a more detailed pseudo code overview of the key steps to implement the OBELISC dataset creation process described in the paper. The main modules include collecting documents, simplifying HTML, extracting multimodal content, downloading images, applying filtering, and responsible filtering including deduplication.