OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Authors: Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, Ludwig Schmidt
Abstract: We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind’s Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 - 89% of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite. We share our models and code at open_flamingo.
What, Why and How
Here is a summary of the key points from the paper:
What:
-
This paper studies the convergence properties of trans-dimensional MCMC algorithms, which are MCMC methods for sampling from distributions defined on spaces made up of subsets of differing dimensions (e.g. for Bayesian model selection).
-
The main result shows that under certain conditions, if the within-model MCMC chains are geometrically ergodic, then the overall trans-dimensional chain is also geometrically ergodic in an L2 sense.
Why:
-
Understanding the convergence properties of trans-dimensional MCMC algorithms is important for justifying their use and quantifying uncertainty in estimates.
-
Geometric ergodicity is a strong form of convergence that guarantees a central limit theorem for ergodic averages. This allows for assessing Monte Carlo error and constructing valid confidence intervals.
How:
-
The main tool used is an extension of the Markov chain decomposition technique, which allows analyzing the within- and between-model dynamics separately.
-
Drift and minorization conditions are leveraged to establish geometric ergodicity of the within-model chains.
-
The global behavior relates to the connectivity of the spaces and models.
-
The theory is applied to analyze two reversible jump MCMC algorithms. Simultaneous confidence intervals are constructed to assess Monte Carlo error.
In summary, this paper provides verifiable conditions for an important class of MCMC algorithms to achieve strong convergence guarantees needed for uncertainty quantification. The key innovation is using an extended version of Markov chain decomposition.
Main Contributions
Here are the main contributions of this paper:
-
Establishes verifiable sufficient conditions for trans-dimensional MCMC algorithms to be geometrically ergodic when the number of models is finite. This is done by extending the Markov chain decomposition technique to certain non-reversible chains.
-
Provides a quantitative bound on the convergence rate in terms of the within- and between-model dynamics.
-
Shows that geometric ergodicity guarantees a central limit theorem for ergodic averages even without Harris recurrence. This allows assessing Monte Carlo error for general initial distributions.
-
Studies the construction of valid simultaneous confidence intervals based on trans-dimensional MCMC output, handling the case where the asymptotic covariance matrix is singular.
-
Applies the theory and methods to analyze two practical reversible jump MCMC algorithms: a robust autoregression model and a Bayesian probit model with variable selection.
-
For each algorithm, establishes geometric ergodicity and constructs simultaneous confidence intervals to quantify uncertainty in the Monte Carlo estimates.
In summary, the key contributions are:
-
Sufficient conditions for geometric ergodicity of trans-dimensional MCMC
-
A quantitative convergence rate bound
-
Justification of CLT and confidence intervals under geometric ergodicity
-
Methodology for confidence intervals when asymptotic covariance is singular
-
Applications to two Bayesian models - convergence analysis and uncertainty quantification.
Method Section
Here is a summary of the method section of the paper:
The key tool used in the paper is an extension of the Markov chain decomposition technique. This allows analyzing the within- and between-model dynamics of a trans-dimensional Markov chain separately.
Specifically, the state space is partitioned into disjoint subsets corresponding to the models. The within-model movements are governed by kernels T_k that leave the restricted distributions ω_k invariant.
The global dynamic relates to an auxiliary kernel on the discrete space of models. This kernel is reversible and positive semi-definite.
Using this decomposition, the paper shows that if the within-model kernels T_k have norms bounded away from 1 and the connectivity matrix based on is irreducible, then the overall trans-dimensional chain is geometrically ergodic.
A quantitative bound on the convergence rate is also provided in terms of the within- and between-model dynamics.
For the within-model chains, drift and minorization conditions are established to prove geometric ergodicity.
The global dynamic satisfies the connectivity condition if transitions between all pairs of models are possible within some number of steps.
To construct confidence intervals, random noise is added to the Monte Carlo estimator to avoid problems with a singular asymptotic covariance matrix. Valid simultaneous intervals are obtained using quantiles of a multivariate normal distribution.
In summary, the key methodological innovation is the extension of Markov chain decomposition to analyze trans-dimensional MCMC algorithms by separating the local and global behavior. This allows leveraging existing tools to establish geometric ergodicity.
High-Level Pseudo Code
Here is high-level pseudo code for the main methodological contributions of the paper:
# Markov Chain Decomposition
# T = trans-dimensional MCMC transition kernel
# T_k = within-model MCMC kernel for model k
# S = arbitrary MCMC kernel with same stationary distribution
Decompose state space into disjoint subsets Y_k corresponding to models
Define auxiliary kernel $\bar{S}$ on discrete space of models
$\bar{S}(k,\{k'\}) = average probability of moving from k to k' in two steps of S
$\bar{S}$ is reversible and positive semi-definite
Relate norm of TS* to T_k and $\bar{S}$:
if T(y,A) >= c T_k(y, A∩Y_k) for y in Y_k, A in sigma-algebra:
1 - ||TS*||^2 >= c^2 (1 - max_k ||T_k||^2) (1 - ||\bar{S}||)
Thus if ||T_k|| < 1 and $\bar{S}$ is irreducible/aperiodic, T is geometrically ergodic.
# Establishing geometric ergodicity of within-model chain
Use drift and minorization:
Find Lyapunov function V(z) s.t. E[V(z')] <= λ V(z) + L
for z' ~ T_k(z,.) and λ < 1
Show minorization condition: ∃ nonzero measure ν s.t.
T_k(z,.) >= ν(.) when V(z) <= d
Then T_k is geometrically ergodic
# Confidence intervals with singular covariance
Add small random noise to get non-singular covariance:
$\hat{H}_n = H_n + ε G_n$
$G_n =$ avg of iid N(0,V_*) random vectors
$\hat{H}_n$ is asymptotically normal with covariance V + ε^2 V_*
Use quantiles of multivariate normal as critical values
This covers the main methodological ideas, including Markov chain decomposition, verifying geometric ergodicity, and constructing valid confidence intervals when the asymptotic covariance matrix is singular.
Detailed Pseudo Code
Here is more detailed pseudo code to implement the main methods in the paper:
# Markov Chain Decomposition
def decompose(T, S, Y_k):
# T = trans-dimensional MCMC kernel
# S = arbitrary MCMC kernel with same stationary distribution
# Y_k = disjoint subsets of state space corresponding to models
K = set of all models
for k in K:
ω_k = normalized restriction of stationary dist to Y_k
T_k(y, A) = T(y, A ∩ Y_k) for y in Y_k, A in sigma-algebra
# T_k leaves ω_k invariant
β = discrete space of models
ω̄({k}) = stationary dist(Y_k)
# Define auxiliary kernel
S̄(k, {k'}) = avg prob of moving from k to k' in two steps of S
# S̄ is reversible w.r.t ω̄
c = lower bound s.t. T(y, A) >= c T_k(y, A∩Y_k)
return 1 - ||TS*||^2 >= c^2(1 - max_k ||T_k||^2)(1 - ||S̄||)
# Establish geometric ergodicity of T_k
def is_geom_ergodic(T_k, ω_k):
find a Lyapunov function V and constants λ < 1, L such that:
E[V(z')] <= λ V(z) + L
for z' ~ T_k(z, .)
find a small set C such that:
T_k(z, .) >= ν(.) for some nonzero measure ν
whenever z in C
return True
# Confidence intervals with singular covariance
def singular_cis(H_n, V_n, alpha):
# H_n = estimator, V_n = est of asymptotic covariance
m = length(H_n)
generate iid N(0, V_*) vectors Y_1, ..., Y_n
G_n = n^-1 sum_{i=1}^n Y_i
ε = small constant > 0
ξ_n = solve(multivar_normal_cdf(Rect(-ξ, ξ), 0, V_n + ε^2 V_*) = 1 - alpha)
for i = 1 to m:
CI_i = [H_{n,i} + ε G_{n,i} - ξ_n sqrt(V_{n,ii}/n),
H_{n,i} + ε G_{n,i} + ξ_n sqrt(V_{n,ii}/n)]
return CI_1, ..., CI_m
Where multivar_normal_cdf computes the CDF of the specified multivariate normal distribution over the rectangular region.
This implements the main algorithms in more detail, including explicitly estimating the convergence rate bound, verifying drift and minorization, and constructing confidence intervals by adding noise.