AltDiffusion: A Multilingual Text-to-Image Diffusion Model

Authors: Fulong Ye, Guang Liu, Xinya Wu, Ledell Wu

Abstract: Large Text-to-Image(T2I) diffusion models have shown a remarkable capability to produce photorealistic and diverse images based on text inputs. However, existing works only support limited language input, e.g., English, Chinese, and Japanese, leaving users beyond these languages underserved and blocking the global expansion of T2I models. Therefore, this paper presents AltDiffusion, a novel multilingual T2I diffusion model that supports eighteen different languages. Specifically, we first train a multilingual text encoder based on the knowledge distillation. Then we plug it into a pretrained English-only diffusion model and train the model with a two-stage schema to enhance the multilingual capability, including concept alignment and quality improvement stage on a large-scale multilingual dataset. Furthermore, we introduce a new benchmark, which includes Multilingual-General-18(MG-18) and Multilingual-Cultural-18(MC-18) datasets, to evaluate the capabilities of T2I diffusion models for generating high-quality images and capturing culture-specific concepts in different languages. Experimental results on both MG-18 and MC-18 demonstrate that AltDiffusion outperforms current state-of-the-art T2I models, e.g., Stable Diffusion in multilingual understanding, especially with respect to culture-specific concepts, while still having comparable capability for generating high-quality images. All source code and checkpoints could be found in AltDiffuson.

What, Why and How

Here is a summary of the key points from this paper:

What:

The paper introduces AltDiffusion, a novel multilingual text-to-image (T2I) diffusion model that supports 18 languages.
The authors also introduce a new benchmark with two datasets for evaluating multilingual T2I models:
- Multilingual-General-18 (MG-18): For evaluating general image generation quality.
- Multilingual-Cultural-18 (MC-18): For evaluating culture-specific concept generation.

Why:

Existing T2I models like Stable Diffusion only support limited languages like English, restricting their applicability.
Multilingual models allow direct use of languages without translation, avoiding errors and loss of cultural nuances.
Evaluating culture-specific concept generation is important for multilingual models but lacked benchmark datasets.

How:

They first train a multilingual text encoder using knowledge distillation.
The text encoder is plugged into a pretrained English-only diffusion model.
The diffusion model is trained using a two-stage approach on multilingual datasets to enhance its language capability.
Experiments show AltDiffusion outperforms other multilingual models in image quality and understanding cultural concepts.

In summary, this paper introduces a new multilingual diffusion model AltDiffusion that supports 18 languages, and benchmarks to evaluate multilingual image generation quality and cultural concept understanding. The two-stage training approach enhances the model’s multilingual capabilities. Experiments show improved performance over existing multilingual models.

Main Contributions

Here are the main contributions of this paper:

AltDiffusion Model: The paper introduces AltDiffusion, a novel multilingual text-to-image diffusion model that supports 18 languages.
Multilingual Benchmark: The authors propose a new benchmark with two datasets for evaluating multilingual image generation:

MG-18 for evaluating general image quality
MC-18 for evaluating culture-specific concepts

Two-Stage Training: A two-stage training approach is proposed to enhance the multilingual capability of diffusion models by concept alignment and quality improvement.
Evaluation: Comprehensive experiments are conducted evaluating AltDiffusion on the new benchmarks. Results show it outperforms other multilingual models in image quality and cultural concept understanding.
Applications: The multilingual capability of AltDiffusion enables direct use of non-English languages for image generation without translation, and allows mixed language prompts.

In summary, the main contributions are:

AltDiffusion model supporting 18 languages
New multilingual benchmark datasets
Two-stage training approach
Evaluation showing improved multilingual performance
Applications enabled by multilingual capabilities

Method Section

Here is a summary of the method section from the paper:

The goal is to develop a multilingual text-to-image (T2I) diffusion model that can support diverse languages.

The method has two main steps:

Enhancing the language capability of the text encoder:

They first train a multilingual text encoder using knowledge distillation from a teacher model (OpenCLIP).
The student model is XLM-R, trained to match embeddings from the teacher on parallel English-other language text pairs.
This results in a text encoder supporting 18 languages with aligned embedding space.

Enhancing the language capability of the UNet:

The pretrained multilingual text encoder is plugged into an off-the-shelf English-only diffusion model (Stable Diffusion).
A two-stage training approach is used to adapt the diffusion model:
- Concept alignment stage: Only the cross-attention module is trained on a large multilingual dataset to align text-image embeddings.
- Quality improvement stage: All UNet parameters are fine-tuned on a high-quality aesthetic dataset to improve image quality.
Classifier-free guidance is also used to improve training.

In summary, the method involves:

Knowledge distillation to obtain a multilingual text encoder
Plugging it into an English diffusion model
Two-stage training of the diffusion model for concept alignment and quality improvement.

This results in a full multilingual T2I diffusion model AltDiffusion that can support diverse languages directly without needing translation.

High-Level Pseudo Code

Here is the high-level pseudo code for the method presented in this paper:

# Train multilingual text encoder
text_encoder = XLMRobertaForTextEncoder() 
teacher = OpenCLIPTextEncoder()
 
for en_text, other_text in parallel_text_pairs:
  teacher_embeddings = teacher(en_text)  
  student_embeddings = text_encoder(other_text)
  
  loss = MSE(teacher_embeddings, student_embeddings)
  optimize(loss) 
 
# Enhance diffusion model  
diffusion_model = StableDiffusion()
diffusion_model.text_encoder = text_encoder 
 
# Concept alignment
freeze(diffusion_model)  
unfreeze(diffusion_model.cross_attention)
 
for x, t in multilingual_dataset:
  noisy_x = add_noise(x)
  t_emb = text_encoder(t)
  
  pred = diffusion_model(noisy_x, t_emb)
  loss = MSE(x, pred) 
  optimize(loss)
 
# Quality improvement  
unfreeze(diffusion_model)
 
for x, t in aesthetic_dataset:
  noisy_x = add_noise(x)
  t_emb = text_encoder(t) 
  
  pred = diffusion_model(noisy_x, t_emb)
  loss = MSE(x, pred)
  optimize(loss)
 
# Classifier free guidance
for x, t in aesthetic_dataset:
  noisy_x = add_noise(x)
  
  pred_uncond = diffusion_model(noisy_x) 
  pred_cond = diffusion_model(noisy_x, t_emb)
  
  pred = pred_uncond + alpha * (pred_cond - pred_uncond)
  loss = MSE(x, pred)
  optimize(loss)

This shows the high-level steps of:

Training the multilingual text encoder
Plugging it into the diffusion model
Concept alignment stage
Quality improvement stage
Adding classifier free guidance

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the method presented in this paper:

# Multilingual Text Encoder
 
text_encoder = XLMRobertaForTextEncoder() # Initialize student model
teacher = OpenCLIPTextEncoder() # Initialize teacher model
 
optim = AdamW(text_encoder.parameters())
 
for en_text, other_text in parallel_text_pairs:
  
  teacher_embeddings = teacher(en_text)
  student_embeddings = text_encoder(other_text) 
  
  loss = MSELoss(teacher_embeddings, student_embeddings)
  
  optim.zero_grad()
  loss.backward()
  optim.step()
 
# Enhance Diffusion Model
 
diffusion_model = StableDiffusion() 
 
diffusion_model.text_encoder = text_encoder # Plug in text encoder
 
optim = AdamW(diffusion_model.parameters())
 
# Concept Alignment
 
for param in diffusion_model.parameters():
  param.requires_grad = False 
  
for param in diffusion_model.cross_attention.parameters():
  param.requires_grad = True
  
for x, t in multilingual_dataset:
 
  noisy_x = GaussianNoise(x)  
  t_emb = text_encoder(t)
   
  pred = diffusion_model(noisy_x, t_emb) 
  loss = MSELoss(x, pred)  
  
  optim.zero_grad()
  loss.backward()
  optim.step()
 
 
# Quality Improvement
 
for param in diffusion_model.parameters():
  param.requires_grad = True 
 
for x, t in aesthetic_dataset:
 
  noisy_x = GaussianNoise(x)
  t_emb = text_encoder(t)
 
  pred = diffusion_model(noisy_x, t_emb)
  loss = MSELoss(x, pred)
 
  optim.zero_grad()
  loss.backward()  
  optim.step()
 
 
# Classifier Free Guidance
 
for x, t in aesthetic_dataset:
  
  noisy_x = GaussianNoise(x)
 
  pred_uncond = diffusion_model(noisy_x)
  pred_cond = diffusion_model(noisy_x, t_emb) 
  
  pred = pred_uncond + alpha * (pred_cond - pred_uncond)
  loss = MSELoss(x, pred)
 
  optim.zero_grad()
  loss.backward()
  optim.step()

This shows a more detailed pseudo code implementation with:

Initializing models
Training loops
Getting embeddings
Applying noise
Gradient calculation
Optimization

Covering the key steps of training the text encoder, concept alignment stage, quality improvement stage, and classifier free guidance of the diffusion model.

aixiv-c

Table of Contents

2308.09991 AltDiffusion A Multilingual Text-to-Image Diffusion Model

AltDiffusion: A Multilingual Text-to-Image Diffusion Model

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code

Graph View

Backlinks

aixiv-c

Table of Contents

2308.09991 AltDiffusion A Multilingual Text-to-Image Diffusion Model

AltDiffusion: A Multilingual Text-to-Image Diffusion Model §

What, Why and How §

Main Contributions §

Method Section §

High-Level Pseudo Code §

Detailed Pseudo Code §

Graph View

Backlinks

AltDiffusion: A Multilingual Text-to-Image Diffusion Model

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code