Flow Matching in Latent Space
Authors: Quan Dao, Hao Phung, Binh Nguyen, Anh Tran
Abstract: Flow matching is a recent framework to train generative models that exhibits impressive empirical performance while being relatively easier to train compared with diffusion-based models. Despite its advantageous properties, prior methods still face the challenges of expensive computing and a large number of function evaluations of off-the-shelf solvers in the pixel space. Furthermore, although latentbased generative methods have shown great success in recent years, this particular model type remains underexplored in this area. In this work, we propose to apply flow matching in the latent spaces of pretrained autoencoders, which offers improved computational efficiency and scalability for high-resolution image synthesis. This enables flow-matching training on constrained computational resources while maintaining their quality and flexibility. Additionally, our work stands as a pioneering contribution in the integration of various conditions into flow matching for conditional generation tasks, including label-conditioned image generation, image inpainting, and semantic-to-image generation. Through extensive experiments, our approach demonstrates its effectiveness in both quantitative and qualitative results on various datasets, such as CelebA-HQ, FFHQ, LSUN Church & Bedroom, and ImageNet. We also provide a theoretical control of the Wasserstein-2 distance between the reconstructed latent flow distribution and true data distribution, showing it is upper-bounded by the latent flow matching objective. Our code will be available at LFM.git.
What, Why and How
Here is a summary of the key points from the paper:
What:
-
The paper proposes a new method called Latent Flow Matching (LFM) for high-resolution image generation.
-
It adapts the recent flow matching framework to operate in the latent space of pretrained autoencoders instead of pixel space.
-
It also extends flow matching to support conditional image generation by incorporating classifier-free guidance into the velocity field.
Why:
-
Performing flow matching in latent space is more computationally efficient and scalable compared to pixel space for high-resolution images.
-
Integrating conditions like class labels into flow matching has not been explored before and enables new applications.
-
Flow matching has advantages over diffusion models like simpler training and faster sampling, so improving flow matching narrows the gap.
How:
-
They train a normalizing flow in the compact latent space of a pretrained autoencoder to match the data distribution.
-
The velocity field network is adapted to take conditional inputs like class labels or masks.
-
Classifier-free guidance balances quality and diversity for class conditions without a pretrained classifier.
-
Extensive experiments demonstrate state-of-the-art results on unconditional and conditional image generation tasks.
-
They also provide a theoretical analysis bounding the approximation error when performing flow matching in latent space.
In summary, the key ideas are performing flow matching efficiently in latent space, enabling conditions through classifier-free velocity guidance, and showing these improve flow matching for high-resolution image generation.
Main Contributions
Here are the main contributions of the paper:
-
Proposes latent flow matching, which is the first work to apply flow matching in the latent space of pretrained autoencoders. This improves computational efficiency and scalability compared to pixel space flow matching.
-
Integrates conditional inputs like class labels, segmentation masks, and images into flow matching models for the first time through modifications to the velocity field network. Demonstrates this on tasks like class-conditional image generation, inpainting, and semantic-to-image generation.
-
Incorporates classifier-free guidance into the velocity field to enable class-conditional generation without requiring a pretrained classifier. This balances quality and diversity.
-
Provides a theoretical analysis bounding the approximation error when performing flow matching on latent representations in terms of the Wasserstein distance.
-
Conducts extensive experiments showing state-of-the-art quantitative and qualitative results on unconditional and conditional image generation benchmarks. Helps close the gap between flow matching and diffusion models.
-
Makes code available to facilitate further research and applications of flow matching models.
In summary, the key contributions are:
- Introducing latent flow matching for improved efficiency
- Enabling conditions through velocity field modifications
- Classifier-free guidance for class-conditional generation
- Theoretical analysis of latent flow matching
- Strong experimental results on various benchmarks
- Releasing code to advance flow matching research
Method Section
Here is a summary of the method section:
Training and Sampling Procedure:
- Encode input image x into latent code z using pretrained VAE encoder.
- Train a normalizing flow network to predict velocity v that transports noise z1 (Gaussian) to match latent data distribution z0.
- Loss function minimizes squared error between predicted v and true velocity z1-z0.
- For sampling, numerically integrate predicted velocity v starting from z1 to reconstruct z0.
- Decode z0 to generate image.
Conditional Generation:
- Pass class label c to velocity network along with z_t for conditioning.
- Propose classifier-free guidance: blend unconditional velocity v_u and conditional velocity v_c: v_tilde = γ v_c + (1-γ) v_u
- v_u trains on empty label, v_c on true labels. No pretrained classifier needed.
Theoretical Analysis:
- Provide Wasserstein distance bound between reconstructed latent distribution and true distribution.
- Bound depends on encoder/decoder quality and flow matching objective.
- Highlights tradeoff between efficiency and sample quality.
In summary, the key ideas are:
- Flow matching in latent space for efficiency
- Conditional velocity field to enable conditions
- Classifier-free blending for class conditioning
- Theoretical bound relating latent flow matching objective and approximation error
High-Level Pseudo Code
Here is high-level pseudocode for the key algorithms proposed in the paper:
# Training
for x, c in dataset:
z0 = encoder(x)
z1 = sample_gaussian_noise()
t = sample_uniform(0,1)
zt = (1-t)*z0 + t*z1
velocity = flow_model(zt, c, t)
loss = ||z1 - z0 - velocity||^2
update(flow_model, loss)
# Sampling
z1 = sample_gaussian_noise()
for t in np.linspace(1,0,N):
z_t = ode_step(z_{t+1}, flow_model, t)
x = decoder(z0)
# Classifier-free velocity
zu = flow_model(zt, None, t) # Unconditional
zc = flow_model(zt, c, t) # Conditional
vt = γ*zc + (1-γ)*zu # Blend
The key steps are:
- Encode real images x to latent z0
- Sample noise z1 and interpolate to zt
- Train flow model to predict velocity from zt to z0
- Sample by integrating predicted velocity from z1
- For conditional generation, blend conditional and unconditional velocities
This shows how flow matching is adapted to latent space and augmented with classifier-free guidance for conditioning. The pseudo code covers the training, sampling, and classifier-free guidance components.
Detailed Pseudo Code
Here is more detailed pseudocode to implement the key algorithms in the paper:
# Training
encoder = PretrainedVAEEncoder()
flow = FlowNet() # Velocity estimation network
optimizer = Adam(lr)
for x, c in dataloader:
z0 = encoder(x)
z1 = torch.randn_like(z0) # Sample noise
t = torch.rand(1) # Sample time
zt = (1 - t)*z0 + t*z1 # Linearly interpolate
velocity = flow(zt, c, t)
loss = torch.norm(z1 - z0 - velocity)**2
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Sampling
z1 = torch.randn(batch_size, z_dim) # Sample noise batch
integrator = ODEIntegrator(flow) # Euler, Dopri, etc
zs = []
for t in torch.linspace(1, 0, steps):
z_next = integrator.step(z1, t)
zs.append(z_next)
z0 = zs[-1]
x = decoder(z0) # Generate images
# Classifier-free velocity
zu = flow(zt, None, t) # Unconditional
zc = flow(zt, c, t) # Conditional
vt = γ*zc + (1-γ)*zu # Blend
z_next = integrator.step(z1, vt, t)
This shows more implementation details like:
- Encoder and decoder setup
- Sampling noise and time
- Linearly interpolating latent codes
- Optimizing flow model
- Numerical integration for sampling
- Blending conditional and unconditional velocities