Flow Matching in Latent Space

Authors: Quan Dao, Hao Phung, Binh Nguyen, Anh Tran

Abstract: Flow matching is a recent framework to train generative models that exhibits impressive empirical performance while being relatively easier to train compared with diffusion-based models. Despite its advantageous properties, prior methods still face the challenges of expensive computing and a large number of function evaluations of off-the-shelf solvers in the pixel space. Furthermore, although latentbased generative methods have shown great success in recent years, this particular model type remains underexplored in this area. In this work, we propose to apply flow matching in the latent spaces of pretrained autoencoders, which offers improved computational efficiency and scalability for high-resolution image synthesis. This enables flow-matching training on constrained computational resources while maintaining their quality and flexibility. Additionally, our work stands as a pioneering contribution in the integration of various conditions into flow matching for conditional generation tasks, including label-conditioned image generation, image inpainting, and semantic-to-image generation. Through extensive experiments, our approach demonstrates its effectiveness in both quantitative and qualitative results on various datasets, such as CelebA-HQ, FFHQ, LSUN Church & Bedroom, and ImageNet. We also provide a theoretical control of the Wasserstein-2 distance between the reconstructed latent flow distribution and true data distribution, showing it is upper-bounded by the latent flow matching objective. Our code will be available at LFM.git.

What, Why and How

Here is a summary of the key points from the paper:

What:

The paper proposes a new method called Latent Flow Matching (LFM) for high-resolution image generation.
It adapts the recent flow matching framework to operate in the latent space of pretrained autoencoders instead of pixel space.
It also extends flow matching to support conditional image generation by incorporating classifier-free guidance into the velocity field.

Why:

Performing flow matching in latent space is more computationally efficient and scalable compared to pixel space for high-resolution images.
Integrating conditions like class labels into flow matching has not been explored before and enables new applications.
Flow matching has advantages over diffusion models like simpler training and faster sampling, so improving flow matching narrows the gap.

How:

They train a normalizing flow in the compact latent space of a pretrained autoencoder to match the data distribution.
The velocity field network is adapted to take conditional inputs like class labels or masks.
Classifier-free guidance balances quality and diversity for class conditions without a pretrained classifier.
Extensive experiments demonstrate state-of-the-art results on unconditional and conditional image generation tasks.
They also provide a theoretical analysis bounding the approximation error when performing flow matching in latent space.

In summary, the key ideas are performing flow matching efficiently in latent space, enabling conditions through classifier-free velocity guidance, and showing these improve flow matching for high-resolution image generation.

Main Contributions

Here are the main contributions of the paper:

Proposes latent flow matching, which is the first work to apply flow matching in the latent space of pretrained autoencoders. This improves computational efficiency and scalability compared to pixel space flow matching.
Integrates conditional inputs like class labels, segmentation masks, and images into flow matching models for the first time through modifications to the velocity field network. Demonstrates this on tasks like class-conditional image generation, inpainting, and semantic-to-image generation.
Incorporates classifier-free guidance into the velocity field to enable class-conditional generation without requiring a pretrained classifier. This balances quality and diversity.
Provides a theoretical analysis bounding the approximation error when performing flow matching on latent representations in terms of the Wasserstein distance.
Conducts extensive experiments showing state-of-the-art quantitative and qualitative results on unconditional and conditional image generation benchmarks. Helps close the gap between flow matching and diffusion models.
Makes code available to facilitate further research and applications of flow matching models.

In summary, the key contributions are:

Introducing latent flow matching for improved efficiency
Enabling conditions through velocity field modifications
Classifier-free guidance for class-conditional generation
Theoretical analysis of latent flow matching
Strong experimental results on various benchmarks
Releasing code to advance flow matching research

Method Section

Here is a summary of the method section:

Training and Sampling Procedure:

Encode input image x into latent code z using pretrained VAE encoder.
Train a normalizing flow network to predict velocity v that transports noise z1 (Gaussian) to match latent data distribution z0.
Loss function minimizes squared error between predicted v and true velocity z1-z0.
For sampling, numerically integrate predicted velocity v starting from z1 to reconstruct z0.
Decode z0 to generate image.

Conditional Generation:

Pass class label c to velocity network along with z_t for conditioning.
Propose classifier-free guidance: blend unconditional velocity v_u and conditional velocity v_c: v_tilde = γ v_c + (1-γ) v_u
v_u trains on empty label, v_c on true labels. No pretrained classifier needed.

Theoretical Analysis:

Provide Wasserstein distance bound between reconstructed latent distribution and true distribution.
Bound depends on encoder/decoder quality and flow matching objective.
Highlights tradeoff between efficiency and sample quality.

In summary, the key ideas are:

Flow matching in latent space for efficiency
Conditional velocity field to enable conditions
Classifier-free blending for class conditioning
Theoretical bound relating latent flow matching objective and approximation error

High-Level Pseudo Code

Here is high-level pseudocode for the key algorithms proposed in the paper:

# Training
for x, c in dataset:
  z0 = encoder(x)  
  z1 = sample_gaussian_noise() 
  t = sample_uniform(0,1)
  zt = (1-t)*z0 + t*z1
  velocity = flow_model(zt, c, t)  
  loss = ||z1 - z0 - velocity||^2
  update(flow_model, loss) 
 
# Sampling 
z1 = sample_gaussian_noise()
for t in np.linspace(1,0,N):
  z_t = ode_step(z_{t+1}, flow_model, t) 
x = decoder(z0)

# Classifier-free velocity
zu = flow_model(zt, None, t) # Unconditional
zc = flow_model(zt, c, t) # Conditional
vt = γ*zc + (1-γ)*zu # Blend

The key steps are:

Encode real images x to latent z0
Sample noise z1 and interpolate to zt
Train flow model to predict velocity from zt to z0
Sample by integrating predicted velocity from z1
For conditional generation, blend conditional and unconditional velocities

This shows how flow matching is adapted to latent space and augmented with classifier-free guidance for conditioning. The pseudo code covers the training, sampling, and classifier-free guidance components.

Detailed Pseudo Code

Here is more detailed pseudocode to implement the key algorithms in the paper:

# Training
encoder = PretrainedVAEEncoder() 
flow = FlowNet() # Velocity estimation network
optimizer = Adam(lr) 
 
for x, c in dataloader:
  
  z0 = encoder(x)
  
  z1 = torch.randn_like(z0) # Sample noise
  
  t = torch.rand(1) # Sample time
  
  zt = (1 - t)*z0 + t*z1 # Linearly interpolate
  
  velocity = flow(zt, c, t) 
  
  loss = torch.norm(z1 - z0 - velocity)**2
  
  loss.backward()
  
  optimizer.step()
  
  optimizer.zero_grad()

# Sampling
 
z1 = torch.randn(batch_size, z_dim) # Sample noise batch
 
integrator = ODEIntegrator(flow) # Euler, Dopri, etc
 
zs = []
 
for t in torch.linspace(1, 0, steps):
  
  z_next = integrator.step(z1, t) 
  
  zs.append(z_next)
  
z0 = zs[-1]  
 
x = decoder(z0) # Generate images

# Classifier-free velocity
 
zu = flow(zt, None, t) # Unconditional
 
zc = flow(zt, c, t) # Conditional
 
vt = γ*zc + (1-γ)*zu # Blend
 
z_next = integrator.step(z1, vt, t)

This shows more implementation details like:

Encoder and decoder setup
Sampling noise and time
Linearly interpolating latent codes
Optimizing flow model
Numerical integration for sampling
Blending conditional and unconditional velocities

aixiv-c

Table of Contents

2307.08698 Flow Matching in Latent Space

Flow Matching in Latent Space

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code

Graph View

Backlinks

aixiv-c

Table of Contents

2307.08698 Flow Matching in Latent Space

Flow Matching in Latent Space §

What, Why and How §

Main Contributions §

Method Section §

High-Level Pseudo Code §

Detailed Pseudo Code §

Graph View

Backlinks

Flow Matching in Latent Space

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code