Bayesian Flow Networks
Authors: Alex Graves, Rupesh Kumar Srivastava, Timothy Atkinson, Faustino Gomez
Abstract: This paper introduces Bayesian Flow Networks (BFNs), a new class of generative model in which the parameters of a set of independent distributions are modified with Bayesian inference in the light of noisy data samples, then passed as input to a neural network that outputs a second, interdependent distribution. Starting from a simple prior and iteratively updating the two distributions yields a generative procedure similar to the reverse process of diffusion models; however it is conceptually simpler in that no forward process is required. Discrete and continuous-time loss functions are derived for continuous, discretised and discrete data, along with sample generation procedures. Notably, the network inputs for discrete data lie on the probability simplex, and are therefore natively differentiable, paving the way for gradient-based sample guidance and few-step generation in discrete domains such as language modelling. The loss function directly optimises data compression and places no restrictions on the network architecture. In our experiments BFNs achieve competitive log-likelihoods for image modelling on dynamically binarized MNIST and CIFAR-10, and outperform all known discrete diffusion models on the text8 character-level language modelling task.
What, Why and How
Here is a summary of the key points of this paper on Bayesian Flow Networks:
What:
- Proposes a new class of generative model called Bayesian Flow Networks (BFNs) that combines Bayesian inference with neural networks.
Why:
- Aims to provide a simpler and more flexible alternative to diffusion models that can work well on both continuous and discrete data.
How:
- Uses a neural network to map the parameters of an input distribution to an output distribution.
- The input distribution is iteratively updated using Bayes’ rule and noisy samples from the data.
- This creates a continuous flow of information from the data to the network parameters.
- Loss functions are derived to train the model by minimizing the number of bits needed to transmit the data.
- Versions are presented for continuous, discretized and discrete data.
- Achieves strong results on MNIST, CIFAR-10 and text modeling benchmarks.
In summary, BFNs introduce a conceptually simple yet powerful approach to generative modeling that leverages the strengths of both Bayesian inference for summarizing data and neural networks for modeling complex distributions. The use of a continuous information flow appears particularly beneficial for discrete data like text.
Main Contributions
Here are the main contributions of this paper on Bayesian Flow Networks:
-
Proposes the Bayesian Flow Network (BFN) framework, which combines Bayesian inference and neural networks in a novel way for generative modeling.
-
Derives discrete and continuous-time loss functions for training BFNs based on data compression principles. The loss functions directly optimize the number of bits needed to transmit the data.
-
Provides instantiations of BFNs for continuous, discretized, and discrete data types. This demonstrates the flexibility of the framework.
-
For continuous data, shows that BFN training is equivalent to a variational autoencoder with a particular Bayesian latent code.
-
Demonstrates that using the parameters of a categorical distribution as inputs enables modeling discrete data with a natively continuous and differentiable generative process.
-
Achieves state-of-the-art or competitive results on dynamically binarized MNIST, CIFAR-10, and text8 benchmarks using BFNs.
-
Shows benefits of training on discretized data at reduced bit-depths, which improves sample quality for images.
-
Opens up opportunities for few-shot image generation and gradient-based guidance for discrete data like text.
In summary, the key contributions are introducing the BFN modeling framework, deriving appropriate training objectives, adapting BFNs to different data types, and showing strong empirical results on generative modeling benchmarks.
Method Section
Here is a summary of the method section of the paper on Bayesian Flow Networks:
-
The model has an input distribution that is updated with Bayesian inference and a neural network that outputs an output distribution.
-
The input distribution models each variable independently with a simple prior like a standard normal or uniform categorical distribution.
-
The neural network takes the input distribution parameters as input and outputs the parameters of the output distribution, which can exploit context between variables.
-
A sender distribution adds noise to the data according to a schedule. A receiver distribution is constructed by convolving the output distribution with the same noise distribution.
-
At each step, a sample from the sender distribution is used to Bayesianly update the input distribution. The loss is the KL divergence between sender and receiver.
-
By taking lots of small steps, this creates a continuous flow of information from data to input parameters to neural network outputs.
-
Closed-form Bayesian update equations are provided for continuous and discrete distributions. Additivity of the accuracies is proved.
-
Accuracy schedules are derived to linearly decrease the entropy of the input distribution over time.
-
Loss functions are presented for continuous, discretized, and discrete data in both discrete and continuous time.
-
Sampling procedures are given for generating data by iteratively running the trained model.
In summary, the key steps are constructing the sender, receiver, input, and output distributions, performing Bayesian inference on the inputs, training the neural network outputs to match the sender, and generating samples by iteratively running the trained model.
High-Level Pseudo Code
Here is high-level pseudocode for training and sampling from a Bayesian Flow Network:
# Training
for batch in training_data:
# Initialize input distribution parameters
input_params = prior_params
# Run for n steps
for i in range(n):
# Get sender distribution by adding noise to data
sender_dist = get_sender(data, accuracy[i])
# Run input params through network to get output params
output_params = network(input_params)
# Construct receiver distribution from output distribution
receiver_dist = get_receiver(output_params, accuracy[i])
# Compute loss as KL divergence between sender and receiver
loss = kl_divergence(sender_dist, receiver_dist)
# Get sample from sender distribution
sender_sample = sample(sender_dist)
# Bayesian update of input parameters using sender sample
input_params = bayesian_update(input_params, sender_sample, accuracy[i])
# Optimize network parameters to minimize loss
# Sampling
for i in range(n):
# Run input params through network to get output params
output_params = network(input_params)
# Sample data point from output distribution
sample = sample(output_dist)
# Construct sender distribution centered on sample
sender_dist = get_sender(sample, accuracy[i])
# Get sample from sender distribution
sender_sample = sample(sender_dist)
# Bayesian update of input parameters using sender sample
input_params = bayesian_update(input_params, sender_sample, accuracy[i])
# Return final sample
return sample
The key steps are constructing the sender and receiver distributions, computing the loss between them, updating the input parameters with Bayesian inference, and iteratively running the trained network to generate samples. The training loop optimizes the network parameters to minimize the loss.
Detailed Pseudo Code
Here is more detailed pseudocode for implementing Bayesian Flow Networks:
# Continuous Data
# Input distribution
input_dist = Normal(mean=input_mean, variance=input_variance)
# Prior for input mean and variance
prior_mean = 0
prior_variance = 1
# Accuracy schedule
accuracy = get_accuracy_schedule(t)
# Sender distribution
sender_dist = Normal(mean=data, variance=1/accuracy)
# Output distribution
output_mean, output_variance = network(input_mean, input_variance, t)
output_dist = Normal(mean=output_mean, variance=output_variance)
# Receiver distribution
receiver_dist = convolve(output_dist, sender_dist)
# Loss
loss = kl_divergence(sender_dist, receiver_dist)
# Sample from sender
sender_sample = sample(sender_dist)
# Bayesian update
input_mean = (input_variance * input_mean + accuracy * sender_sample) /
(input_variance + accuracy)
input_variance = input_variance + accuracy
# Discrete Data
# Input distribution
input_dist = Categorical(probs=input_probs)
# Prior for input probs
prior_probs = 1/num_classes
# Sender distribution
sender_dist = Gaussian(mean=one_hot(data)*accuracy,
variance=accuracy*I)
# Output distribution
output_probs = softmax(network(input_probs))
output_dist = Categorical(probs=output_probs)
# Receiver distribution
receiver_dist = convolve(output_dist, sender_dist)
# Loss
loss = kl_divergence(sender_dist, receiver_dist)
# Sample from sender
sender_sample = sample(sender_dist)
# Bayesian update
input_probs = softmax(sender_sample + log(input_probs))
The main differences are in the distributions used for continuous vs. discrete data. The overall flow of constructing sender/receiver distributions, computing the loss, and updating the input parameters is the same.