Any-to-Any Generation via Composable Diffusion
Authors: Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal
Abstract: We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio, from any combination of input modalities. Unlike existing generative AI systems, CoDi can generate multiple modalities in parallel and its input is not limited to a subset of modalities like text or image. Despite the absence of training datasets for many combinations of modalities, we propose to align modalities in both the input and output space. This allows CoDi to freely condition on any input combination and generate any group of modalities, even if they are not present in the training data. CoDi employs a novel composable generation strategy which involves building a shared multimodal space by bridging alignment in the diffusion process, enabling the synchronized generation of intertwined modalities, such as temporally aligned video and audio. Highly customizable and flexible, CoDi achieves strong joint-modality generation quality, and outperforms or is on par with the unimodal state-of-the-art for single-modality synthesis. The project page with demonstrations and code is at codi-gen.github.io
What, Why and How
Here is a summary of the key points from this paper:
\textbf{What:}
- The paper presents Composable Diffusion (CoDi), a novel generative model for jointly generating combinations of modalities like text, image, video and audio.
\textbf{Why:}
- Existing generative AI models are limited to generating a single modality or going from one modality to another (e.g. text to image). CoDi aims to overcome this limitation by enabling flexible many-to-many generation across modalities.
\textbf{How:}
- CoDi consists of diffusion models for each modality that are aligned in a shared latent space using a novel “bridging alignment” technique.
- This allows conditioning on any combination of input modalities and generating any combination of output modalities with a linear number of training objectives.
- CoDi employs a composable generation strategy with cross-attention between diffusion models to enable synchronized generation of intertwined modalities like video and audio.
- Experiments show CoDi achieves strong quality across diverse generation settings like text to video+audio and outperforms or matches state-of-the-art unimodal models.
In summary, CoDi introduces a new composable and aligned training approach to create a versatile multimodal generative model capable of flexible many-to-many generation across modalities like text, image, video and audio.
Main Contributions
Here are the key contributions of this paper on Composable Diffusion (CoDi):
-
Proposes CoDi, the first generative model capable of simultaneously processing and generating arbitrary combinations of modalities including text, image, video and audio.
-
Introduces a “bridging alignment” technique to efficiently align the input conditioning spaces of different modalities using contrastive learning. This allows conditioning on any input modality combination with a linear number of objectives.
-
Develops a composable generation strategy based on cross-attention between modalities’ diffusion models, enabling synchronized generation of intertwined outputs like video and audio.
-
Achieves flexible any-to-any generation between modalities with high quality, outperforming or matching state-of-the-art unimodal models.
-
Demonstrates strong results on comprehensive experiments, including single-to-single, multi-conditioning, and joint generation of multiple modalities.
-
Provides quantitative evaluation on 8 multimodal datasets as well as qualitative visualization of CoDi’s remarkable generation capabilities across diverse modalities.
-
Makes available code, model demonstrations, and detailed architecture designs to facilitate further research.
In summary, the main contributions are proposing CoDi as the first model for flexible many-to-many multimodal generation, the bridging alignment and composable generation techniques to achieve this, and comprehensive experiments showcasing high-quality results across a diverse range of modalities and generation settings.
Method Section
Here is a summary of the methodology section from the paper on Composable Diffusion (CoDi):
The paper proposes a novel composable and aligned training approach to enable CoDi to handle flexible many-to-many multimodal generation.
First, CoDi trains individual diffusion models for each modality (text, image, video, audio) that can generate high quality outputs for that modality.
To enable conditioning on any combination of input modalities, the paper introduces a “bridging alignment” technique. This aligns the input conditioning spaces of different modalities using contrastive learning with text as a bridge modality.
For joint generation of multiple output modalities, CoDi employs a composable generation strategy. It adds cross-attention modules between the diffusion models and aligns their latent spaces using an environment encoder. This allows the models to attend to each other during synchronized generation.
By training the cross-attention modules on only a linear number of paired modalities (e.g. text-image, text-audio), CoDi can generalize to any combination at test time (e.g. image-audio).
The composable design and bridging alignment enable CoDi to achieve any-to-any generation with high quality using linear rather than exponential objectives. Detailed model architecture and training techniques are provided for the image, video, audio and text diffusion models.
In summary, the methodology centers on an aligned training approach using bridging alignment and composable generation with cross-attention to create a versatile multimodal generative model capable of flexible many-to-many generation.
High-Level Pseudo Code
Here is the high-level pseudo code for the Composable Diffusion (CoDi) method proposed in the paper:
# Train individual diffusion models
text_diffusion = train_text_diffusion()
image_diffusion = train_image_diffusion()
video_diffusion = train_video_diffusion()
audio_diffusion = train_audio_diffusion()
# Bridging alignment for conditioning
align_prompts(text_diffusion, image_diffusion)
align_prompts(text_diffusion, audio_diffusion)
align_prompts(text_diffusion, video_diffusion)
# Composable generation
add_cross_attention(image_diffusion, text_diffusion)
align_latent_spaces(image_diffusion, text_diffusion)
train_cross_attention(image_diffusion, text_diffusion)
add_cross_attention(audio_diffusion, text_diffusion)
align_latent_spaces(audio_diffusion, text_diffusion)
train_cross_attention(audio_diffusion, text_diffusion)
add_cross_attention(video_diffusion, audio_diffusion)
align_latent_spaces(video_diffusion, audio_diffusion)
train_cross_attention(video_diffusion, audio_diffusion)
# Any-to-any generation
generate(text, video_diffusion, audio_diffusion)
generate(text, image_diffusion, audio_diffusion)
generate(image, video_diffusion, audio_diffusion)
In summary, the key steps are:
- Train individual diffusion models
- Align conditioning spaces using bridging alignment
- Add cross-attention and align latent spaces
- Train cross-attention modules for composable generation
- Perform any-to-any generation using the aligned and composable models
Detailed Pseudo Code
Unfortunately the paper does not provide sufficient algorithmic details to fully implement the Composable Diffusion (CoDi) model in code. However, I can provide a more detailed pseudo code outline of the key components and training process:
# Diffusion Models
class TextDiffusionModel(nn.Module):
# Details of text diffusion model architecture
class ImageDiffusionModel(nn.Module):
# Details of image diffusion model architecture
class VideoDiffusionModel(nn.Module):
# Details of video diffusion model architecture
class AudioDiffusionModel(nn.Module):
# Details of audio diffusion model architecture
# Bridging Alignment
def align_prompts(text_model, other_model):
# Use text as bridge modality
# Contrastive loss between text and other modality embeddings
# Composable Generation
def add_cross_attention(model_A, model_B):
# Add cross attn layers in model_A UNet
# Attend to model_B's latent variable
def align_latent_spaces(model_A, model_B):
# Environment encoders VA and VB
# Contrastive loss between VA(zA) and VB(zB)
def train_cross_attention(model_A, model_B):
# Train model_A with cross attn to model_B latent variable
# Generation loss + contrastive latent alignment loss
# Training Loop
for text, image in text_image_data:
text_model_step(text)
image_model_step(image)
align_prompts(text_model, image_model)
# Additional bridging alignment steps for other modalities
add_cross_attention(text_model, image_model)
align_latent_spaces(text_model, image_model)
for text, image in text_image_data:
train_cross_attention(text_model, image_model)
# Cross attention training for other modality pairs
# Any-to-any generation
text_emb = text_model.encode(text)
image_emb = image_model.encode(image)
video, audio = generate(text_emb + image_emb, video_model, audio_model)
The key aspects are bridging alignment of conditioning spaces, adding cross-attention between diffusion models, aligning their latent spaces, and composable training of the cross-attention modules. This enables any-to-any generation between the modalities.