AltDiffusion: A Multilingual Text-to-Image Diffusion Model
Authors: Fulong Ye, Guang Liu, Xinya Wu, Ledell Wu
Abstract: Large Text-to-Image(T2I) diffusion models have shown a remarkable capability to produce photorealistic and diverse images based on text inputs. However, existing works only support limited language input, e.g., English, Chinese, and Japanese, leaving users beyond these languages underserved and blocking the global expansion of T2I models. Therefore, this paper presents AltDiffusion, a novel multilingual T2I diffusion model that supports eighteen different languages. Specifically, we first train a multilingual text encoder based on the knowledge distillation. Then we plug it into a pretrained English-only diffusion model and train the model with a two-stage schema to enhance the multilingual capability, including concept alignment and quality improvement stage on a large-scale multilingual dataset. Furthermore, we introduce a new benchmark, which includes Multilingual-General-18(MG-18) and Multilingual-Cultural-18(MC-18) datasets, to evaluate the capabilities of T2I diffusion models for generating high-quality images and capturing culture-specific concepts in different languages. Experimental results on both MG-18 and MC-18 demonstrate that AltDiffusion outperforms current state-of-the-art T2I models, e.g., Stable Diffusion in multilingual understanding, especially with respect to culture-specific concepts, while still having comparable capability for generating high-quality images. All source code and checkpoints could be found in AltDiffuson.
What, Why and How
Here is a summary of the key points from this paper:
What:
-
The paper introduces AltDiffusion, a novel multilingual text-to-image (T2I) diffusion model that supports 18 languages.
-
The authors also introduce a new benchmark with two datasets for evaluating multilingual T2I models:
- Multilingual-General-18 (MG-18): For evaluating general image generation quality.
- Multilingual-Cultural-18 (MC-18): For evaluating culture-specific concept generation.
Why:
-
Existing T2I models like Stable Diffusion only support limited languages like English, restricting their applicability.
-
Multilingual models allow direct use of languages without translation, avoiding errors and loss of cultural nuances.
-
Evaluating culture-specific concept generation is important for multilingual models but lacked benchmark datasets.
How:
-
They first train a multilingual text encoder using knowledge distillation.
-
The text encoder is plugged into a pretrained English-only diffusion model.
-
The diffusion model is trained using a two-stage approach on multilingual datasets to enhance its language capability.
-
Experiments show AltDiffusion outperforms other multilingual models in image quality and understanding cultural concepts.
In summary, this paper introduces a new multilingual diffusion model AltDiffusion that supports 18 languages, and benchmarks to evaluate multilingual image generation quality and cultural concept understanding. The two-stage training approach enhances the model’s multilingual capabilities. Experiments show improved performance over existing multilingual models.
Main Contributions
Here are the main contributions of this paper:
-
AltDiffusion Model: The paper introduces AltDiffusion, a novel multilingual text-to-image diffusion model that supports 18 languages.
-
Multilingual Benchmark: The authors propose a new benchmark with two datasets for evaluating multilingual image generation:
- MG-18 for evaluating general image quality
- MC-18 for evaluating culture-specific concepts
-
Two-Stage Training: A two-stage training approach is proposed to enhance the multilingual capability of diffusion models by concept alignment and quality improvement.
-
Evaluation: Comprehensive experiments are conducted evaluating AltDiffusion on the new benchmarks. Results show it outperforms other multilingual models in image quality and cultural concept understanding.
-
Applications: The multilingual capability of AltDiffusion enables direct use of non-English languages for image generation without translation, and allows mixed language prompts.
In summary, the main contributions are:
- AltDiffusion model supporting 18 languages
- New multilingual benchmark datasets
- Two-stage training approach
- Evaluation showing improved multilingual performance
- Applications enabled by multilingual capabilities
Method Section
Here is a summary of the method section from the paper:
The goal is to develop a multilingual text-to-image (T2I) diffusion model that can support diverse languages.
The method has two main steps:
- Enhancing the language capability of the text encoder:
- They first train a multilingual text encoder using knowledge distillation from a teacher model (OpenCLIP).
- The student model is XLM-R, trained to match embeddings from the teacher on parallel English-other language text pairs.
- This results in a text encoder supporting 18 languages with aligned embedding space.
- Enhancing the language capability of the UNet:
- The pretrained multilingual text encoder is plugged into an off-the-shelf English-only diffusion model (Stable Diffusion).
- A two-stage training approach is used to adapt the diffusion model:
- Concept alignment stage: Only the cross-attention module is trained on a large multilingual dataset to align text-image embeddings.
- Quality improvement stage: All UNet parameters are fine-tuned on a high-quality aesthetic dataset to improve image quality.
- Classifier-free guidance is also used to improve training.
In summary, the method involves:
- Knowledge distillation to obtain a multilingual text encoder
- Plugging it into an English diffusion model
- Two-stage training of the diffusion model for concept alignment and quality improvement.
This results in a full multilingual T2I diffusion model AltDiffusion that can support diverse languages directly without needing translation.
High-Level Pseudo Code
Here is the high-level pseudo code for the method presented in this paper:
# Train multilingual text encoder
text_encoder = XLMRobertaForTextEncoder()
teacher = OpenCLIPTextEncoder()
for en_text, other_text in parallel_text_pairs:
teacher_embeddings = teacher(en_text)
student_embeddings = text_encoder(other_text)
loss = MSE(teacher_embeddings, student_embeddings)
optimize(loss)
# Enhance diffusion model
diffusion_model = StableDiffusion()
diffusion_model.text_encoder = text_encoder
# Concept alignment
freeze(diffusion_model)
unfreeze(diffusion_model.cross_attention)
for x, t in multilingual_dataset:
noisy_x = add_noise(x)
t_emb = text_encoder(t)
pred = diffusion_model(noisy_x, t_emb)
loss = MSE(x, pred)
optimize(loss)
# Quality improvement
unfreeze(diffusion_model)
for x, t in aesthetic_dataset:
noisy_x = add_noise(x)
t_emb = text_encoder(t)
pred = diffusion_model(noisy_x, t_emb)
loss = MSE(x, pred)
optimize(loss)
# Classifier free guidance
for x, t in aesthetic_dataset:
noisy_x = add_noise(x)
pred_uncond = diffusion_model(noisy_x)
pred_cond = diffusion_model(noisy_x, t_emb)
pred = pred_uncond + alpha * (pred_cond - pred_uncond)
loss = MSE(x, pred)
optimize(loss)
This shows the high-level steps of:
- Training the multilingual text encoder
- Plugging it into the diffusion model
- Concept alignment stage
- Quality improvement stage
- Adding classifier free guidance
Detailed Pseudo Code
Here is a more detailed pseudo code to implement the method presented in this paper:
# Multilingual Text Encoder
text_encoder = XLMRobertaForTextEncoder() # Initialize student model
teacher = OpenCLIPTextEncoder() # Initialize teacher model
optim = AdamW(text_encoder.parameters())
for en_text, other_text in parallel_text_pairs:
teacher_embeddings = teacher(en_text)
student_embeddings = text_encoder(other_text)
loss = MSELoss(teacher_embeddings, student_embeddings)
optim.zero_grad()
loss.backward()
optim.step()
# Enhance Diffusion Model
diffusion_model = StableDiffusion()
diffusion_model.text_encoder = text_encoder # Plug in text encoder
optim = AdamW(diffusion_model.parameters())
# Concept Alignment
for param in diffusion_model.parameters():
param.requires_grad = False
for param in diffusion_model.cross_attention.parameters():
param.requires_grad = True
for x, t in multilingual_dataset:
noisy_x = GaussianNoise(x)
t_emb = text_encoder(t)
pred = diffusion_model(noisy_x, t_emb)
loss = MSELoss(x, pred)
optim.zero_grad()
loss.backward()
optim.step()
# Quality Improvement
for param in diffusion_model.parameters():
param.requires_grad = True
for x, t in aesthetic_dataset:
noisy_x = GaussianNoise(x)
t_emb = text_encoder(t)
pred = diffusion_model(noisy_x, t_emb)
loss = MSELoss(x, pred)
optim.zero_grad()
loss.backward()
optim.step()
# Classifier Free Guidance
for x, t in aesthetic_dataset:
noisy_x = GaussianNoise(x)
pred_uncond = diffusion_model(noisy_x)
pred_cond = diffusion_model(noisy_x, t_emb)
pred = pred_uncond + alpha * (pred_cond - pred_uncond)
loss = MSELoss(x, pred)
optim.zero_grad()
loss.backward()
optim.step()
This shows a more detailed pseudo code implementation with:
- Initializing models
- Training loops
- Getting embeddings
- Applying noise
- Gradient calculation
- Optimization
Covering the key steps of training the text encoder, concept alignment stage, quality improvement stage, and classifier free guidance of the diffusion model.