Generative Pretraining in Multimodality
Authors: Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang
Abstract: We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports incontext image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.
What, Why and How
Here is a summary of the key points from the paper:
What:
-
The paper presents Emu, a large multimodal foundation model that can generate images and text in a multimodal context.
-
Emu is pretrained on diverse multimodal data including image-text pairs, documents with interleaved images/text, and videos with subtitles.
-
It uses a unified autoregressive objective to predict the next visual or text token. Visual signals are encoded into embeddings that are interleaved with text tokens.
Why:
-
To leverage diverse multimodal data like videos at scale, which contain dense visual signals and text. Videos have been overlooked by prior work.
-
To enable unified modeling of vision and language under a single objective, instead of only predicting text tokens. This allows Emu to be a generalist for diverse tasks.
-
To explore the potential of generative pretraining beyond just vision and language modalities.
How:
-
Visual signals are encoded into embeddings by EVA-CLIP. These are transformed into a 1D sequence via a Causal Transformer.
-
Text tokens and visual embeddings form an interleaved sequence input to the Transformer LLM backbone.
-
The model is trained to classify the next text token or regress the next visual embedding.
-
For image generation, the visual embeddings are decoded by a separate diffusion model.
-
Emu is evaluated on image/video captioning, VQA, text-to-image generation etc. An instruction-tuned Emu also serves as an effective multimodal conversational agent.
Main Contributions
Based on my reading of the paper, the main contributions are:
-
Proposes Emu, a new large multimodal foundation model architecture that can process interleaved inputs and generate outputs in both image and text modalities.
-
Introduces a unified autoregressive pretraining objective that predicts the next visual or text token, enabling joint modeling of different modalities.
-
Explores videos paired with subtitles as a new source of dense multimodal pretraining data at scale.
-
Conducts pretraining with diverse sources - image-text pairs, interleaved documents, and video-text pairs.
-
Achieves strong performance on image/video captioning, VQA, text-to-image generation benchmarks, demonstrating Emu’s capabilities as a generalist interface.
-
Shows Emu can be effectively instruction-tuned for conversational tasks like following human instructions and multi-turn dialog.
-
Provides qualitative analysis showing Emu’s abilities in areas like real-world knowledge grounding, video understanding, image blending, and in-context generation.
In summary, the key contributions are proposing the Emu model itself, the unified pretraining approach, using videos for pretraining, and benchmarking Emu’s capabilities on a diverse set of vision-language tasks.
Method Section
Here is a summary of the method section from the paper:
Architecture:
-
Emu consists of 4 components - Visual Encoder, Causal Transformer, Multimodal Modeling, and Visual Decoder.
-
Visual Encoder is initialized with EVA-CLIP to encode images into embeddings.
-
Causal Transformer transforms the 2D image embeddings into a 1D sequence of visual tokens.
-
Multimodal Modeling is a Transformer LLM initialized with LLaMA that models the interleaved sequence of visual and text tokens.
-
Visual Decoder uses a diffusion model to decode the visual tokens into images.
Training Objective:
-
Emu is pretrained on diverse unlabeled multimodal sequences with interleaved images, text, videos.
-
The training objective is to predict the next element in the sequence, whether visual or text.
-
For text tokens, cross-entropy loss is used. For visual embeddings, L2 regression loss is used.
Pretraining Data:
-
Uses image-text pairs, documents with images+text, video subtitles pairs, and video storyboards with subtitles.
-
Newly collects YT-Storyboard-1B, a large dataset of YouTube videos with corresponding storyboard thumbnails and subtitles.
Pretraining Details:
-
Trained on 128 GPUs for 10K steps with over 150B tokens.
-
Batch size varies from 16 to 128 depending on dataset.
-
Learning rates differ for Visual Encoder, Modeling LLM and Causal Transformer.
Visual Decoding:
- Finetunes a diffusion model on image-text pairs to decode visual embeddings into images.
High-Level Pseudo Code
Here is the high-level pseudo code for the Emu model and training approach:
# Architecture
visual_encoder = EVA_CLIP()
causal_transformer = CausalTransformer()
modeling_llm = LLaMA()
visual_decoder = DiffusionModel()
# Pretraining
for x in diverse_multimodal_datasets:
# Encode images/videos into embeddings
if is_image(x):
embeddings = visual_encoder(x)
elif is_video(x):
embeddings = [visual_encoder(frame) for frame in x]
# Get 1D visual tokens
visual_tokens = causal_transformer(embeddings)
# Interleave with text tokens
tokens = interleave(visual_tokens, get_text_tokens(x))
# Predict next token
loss = autoregressive_predict_next(modeling_llm, tokens)
# Optimization
loss.backward()
optimizer.step()
# Finetune diffusion model
for img, text in image_text_pairs:
# Get visual tokens from LLM
visual_tokens = modeling_llm(text)
# Decode into images
img_pred = visual_decoder(visual_tokens)
# Image generation loss
loss = img_pred - img
loss.backward()
optimizer.step()
The key aspects are - encoding visual data into a common format as tokens, interleaving them with text tokens, using a unified predictive pretraining objective, and finetuning the visual decoder separately.
Detailed Pseudo Code
Here is a more detailed pseudo code implementation of the Emu model and training approach:
# Model architecture
class VisualEncoder(nn.Module):
def __init__(self):
# Initialize with EVA-CLIP weights
def forward(self, x):
return encode_image(x) # Returns image embeddings
class CausalTransformer(nn.Module):
def __init__(self):
# Initialize weights
def forward(self, embeddings):
# Apply self-attention and cross-attention
return transform_to_1D(embeddings)
class ModelingLLM(nn.Module):
def __init__(self):
# Initialize with LLaMA weights
def forward(self, x):
return autoregressive_predict(x) # Next token prediction
class VisualDecoder(nn.Module):
def __init__(self):
# Initialize with DiffusionModel weights
def forward(self, embeddings):
return decode_to_image(embeddings)
# Pretraining
visual_encoder = VisualEncoder()
causal_transformer = CausalTransformer()
modeling_llm = ModelingLLM()
for x in diverse_multimodal_data:
if is_image(x):
visual_embeddings = visual_encoder(x)
elif is_video(x):
visual_embeddings = [visual_encoder(frame) for frame in x]
visual_tokens = causal_transformer(visual_embeddings)
tokens = interleave(visual_tokens, get_text_tokens(x))
loss = modeling_llm(tokens) # Autoregressive next token prediction
loss.backward()
optimizer.step()
# Visual decoding
visual_decoder = VisualDecoder()
for img, text in image_text_pairs:
visual_tokens = modeling_llm.generate(text)
img_pred = visual_decoder(visual_tokens)
loss = img_pred - img # Image reconstruction loss
loss.backward()
optimizer.step()
The key components are - the 4 modules, pretraining loop with next token prediction, and finetuning the visual decoder. Additional details like attention mechanisms, optimizers etc. can be added.