Meta-Transformer: A Unified Framework for Multimodal Learning

Authors: Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, Xiangyu Yue

Abstract: Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities ( natural language, 2D images, 3D point clouds, audio, video, time series, tabular data) due to the inherent gaps among them. In this work, we propose a framework, named Meta-Transformer, that leverages a encoder to perform multimodal perception without any paired multimodal training data. In Meta-Transformer, the raw input data from various modalities are mapped into a shared token space, allowing a subsequent encoder with frozen parameters to extract high-level semantic features of the input data. Composed of three main components: a unified data tokenizer, a modality-shared encoder, and task-specific heads for downstream tasks, Meta-Transformer is the first framework to perform unified learning across 12 modalities with unpaired data. Experiments on different benchmarks reveal that Meta-Transformer can handle a wide range of tasks including fundamental perception (text, image, point cloud, audio, video), practical application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph, tabular, and time-series). Meta-Transformer indicates a promising future for developing unified multimodal intelligence with transformers. Code will be available at MetaTransformer

What, Why and How

Here is a summary of the key points from the paper:

What:

  • The paper proposes a framework called Meta-Transformer for unified multimodal learning. It can process and relate information from multiple modalities like text, images, audio, point clouds, etc using the same model architecture and parameters.

Why:

  • Existing multimodal models are limited in their ability to handle diverse modalities due to gaps between them. They focus mainly on vision and language tasks.
  • A unified model architecture that works across modalities can help reduce the modality gap and avoid the need for paired training data.

How:

  • The Meta-Transformer has 3 components - a modality-specialist tokenizer, a shared encoder, and task-specific heads.

  • The tokenizer converts raw inputs into token sequences in a shared embedding space.

  • The encoder with frozen pretrained parameters extracts high-level features from the token sequences.

  • Lightweight task-specific heads are then trained for downstream tasks.

  • This allows task-specific representations to be learned from a unified architecture without paired multimodal data.

  • The model is pretrained on image data and evaluated on 12 modalities - text, image, point clouds, audio, video, infrared, hyperspectral, X-ray, IMU, tabular, graph, time-series.

  • It shows strong performance on various tasks like classification, detection, segmentation across modalities, validating its capability for unified multimodal learning.

Main Contributions

Here are the key contributions of this paper:

  1. Proposes Meta-Transformer, the first framework to enable a unified encoder to process 12 different modalities using the same parameters. This significantly advances multimodal research.

  2. Provides comprehensive analysis of transformer components like embeddings, tokenization, encoders for encoding diverse modalities. Offers insights into developing unified multimodal frameworks.

  3. Achieves outstanding performance on various tasks across 12 modalities like text, images, point clouds, audio, video, infrared, etc. Validates potential of Meta-Transformer for unified multimodal learning.

  4. Demonstrates competitive results compared to state-of-the-art methods on each modality with fewer trainable parameters. Highlights efficiency of Meta-Transformer.

  5. Removes need for paired multimodal data for training. Enables more flexible and efficient multimodal learning.

  6. Explores new challenging tasks like audio-visual segmentation to showcase benefits of unified architecture for multi-modal tasks.

  7. Provides baseline for future research into unified models spanning multiple modalities and tasks. Sparks new research direction in this area.

In summary, the key contributions are proposing Meta-Transformer for unified multimodal learning, extensive empirical validation across modalities, and providing analysis and insights to advance multimodal research.

Method Section

Here is a summary of the method section from the paper:

The Meta-Transformer has 3 main components:

  1. Data-to-Sequence Tokenization
  • Proposes a meta tokenization scheme to transform raw data from different modalities into token sequences in a shared embedding space.

  • For text, uses WordPiece tokenization with 30K vocabulary and maps to token embeddings.

  • For images, reshapes into flattened 2D patches, then projects patches to token embeddings.

  • For point clouds, uses FPS and KNN to sample representative points and group neighbors. Aggregates features from neighbor groups.

  • For audio, uses log Mel filterbanks and convolutions to get spectrogram slices. Flattens and projects to get token embeddings.

  1. Unified Encoder
  • Uses a ViT encoder pretrained on LAION-2B image dataset with contrastive learning. Fixes pretrained parameters.

  • Augments token embeddings with 1D positional encodings before feeding to encoder.

  • Encoder has multiple Transformer blocks with MSA and MLP layers. Uses standard architecture.

  • Extracts semantic features from token sequences of different modalities using frozen encoder.

  1. Task-Specific Heads
  • Lightweight MLP heads adapted for downstream tasks through training.

  • Varies based on modalities and tasks like classification, regression, etc.

  • Updated with tokenizer to learn task-specific representations.

So in summary, it uses modality-specific tokenizers, a shared pretrained encoder, and adaptable task heads to enable unified multimodal learning from different data types.

High-Level Pseudo Code

Here is the high-level pseudo code for the Meta-Transformer:

# Modality-specific tokenizers 
text_tokens = text_tokenizer(text_data)
image_tokens = image_tokenizer(image_data) 
pointcloud_tokens = pointcloud_tokenizer(pointcloud_data)
audio_tokens = audio_tokenizer(audio_data)
 
# Shared encoder with frozen params
encoder = PretrainedTransformerEncoder() 
encoder.params.requires_grad = False
 
text_embeddings = encoder(text_tokens) 
image_embeddings = encoder(image_tokens)
pointcloud_embeddings = encoder(pointcloud_tokens) 
audio_embeddings = encoder(audio_tokens)
 
# Task-specific heads
text_head = MLPHead(text_embeddings)
image_head = MLPHead(image_embeddings) 
pointcloud_head = MLPHead(pointcloud_embeddings)
audio_head = MLPHead(audio_embeddings)
 
# Optimization
loss = text_head.loss(...) + image_head.loss(...) + ... 
optimizer.zero_grad()
loss.backward() 
optimizer.step() # update tokenizer and task heads

In summary, it tokenizes different modalities separately, encodes them using a shared frozen encoder, and adds lightweight adaptable heads for downstream tasks. The tokenizer and task heads are updated during training while the encoder parameters remain fixed.

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the Meta-Transformer framework:

# Modality-specific tokenizers
def text_tokenizer(text):
  tokens = wordpiece_tokenize(text) 
  embeddings = word_embedding(tokens)
  return embeddings
 
def image_tokenizer(image):
  patches = reshape_to_patches(image)
  embeddings = linear_project(patches) 
  return embeddings
  
def pointcloud_tokenizer(pointcloud):
  points = fps(pointcloud)
  groups = knn(points)
  features = aggregate_features(groups)
  embeddings = linear_project(features)
  return embeddings
  
def audio_tokenizer(audio):
  spectrogram = log_mel_fb(audio) 
  patches = conv_2d(spectrogram)
  embeddings = flatten(patches)
  return embeddings
  
 
# Shared encoder
class Encoder(nn.Module):
  def __init__(self):
    super().__init__()
    self.transformer = Transformer(...)
    
  def forward(self, x):
    x = self.transformer(x) 
    return x
  
encoder = Encoder()
encoder.load_pretrained(...) # initialize 
encoder.requires_grad = False # freeze parameters
 
# Task heads
class MLPHead(nn.Module):
  ...
  
text_head = MLPHead(...) 
image_head = MLPHead(...)
...
 
# Training 
for x_text, y_text in text_dataloader:
  for x_image, y_image in image_dataloader:
    
    text_tokens = text_tokenizer(x_text)
    image_tokens = image_tokenizer(x_image)
    
    text_embeddings = encoder(text_tokens) 
    image_embeddings = encoder(image_tokens)
    
    text_out = text_head(text_embeddings)
    image_out = image_head(image_embeddings)
    
    loss_text = loss_fn(text_out, y_text)
    loss_image = loss_fn(image_out, y_image)
    
    loss = loss_text + loss_image
    ...
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step() # update tokenizers and heads

So in summary, it has separate tokenizers for each modality, a shared pretrained frozen encoder, lightweight adaptable heads, and an optimization loop that updates the tokenizers and heads while keeping the encoder fixed.