TokenFlow: Consistent Diffusion Features for Consistent Video Editing
Authors: Michal Geyer, Omer Bar-Tal, Shai Bagon, Tali Dekel
Abstract: The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: diffusion-tokenflow.github.io
What, Why and How
Here is a summary of the key points from the paper:
What:
- The paper proposes a method called TokenFlow for consistent video editing using a text-to-image diffusion model.
Why:
- Current text-to-video models still lag behind image models in quality and controllability. Using an image diffusion model for video editing leads to temporal inconsistencies when editing each frame independently.
How:
-
The key idea is to enforce consistency in the diffusion feature space based on inter-frame correspondences from the original video.
-
They observe diffusion features capture redundant info across frames like the RGB frames.
-
Their method alternates between:
- Jointly editing sampled keyframes using an image editing method.
- Propagating the edited diffusion features to all frames based on correspondences from the original video.
-
This enforces the edited features to have the same redundancies as the original for consistency.
-
They don’t require any training or finetuning of the image diffusion model.
-
They demonstrate state-of-the-art editing results on real videos compared to other methods.
Main Contributions
Here are the main contributions of this paper:
-
A technique called TokenFlow that enforces semantic correspondences of diffusion features across frames to increase temporal consistency when editing videos using a text-to-image diffusion model.
-
Novel empirical analysis studying the properties of diffusion features across frames of a video, showing they capture redundant information similar to RGB frames.
-
State-of-the-art editing results on diverse real-world videos depicting complex motions. The edited videos adhere to the target text prompt while maintaining the original video layout and motion.
-
Their method works with any off-the-shelf text-to-image editing technique and does not require finetuning the diffusion model.
-
Comparisons to other recent video editing methods show improved consistency and fidelity to the edit guidance.
-
Their work provides insights into how diffusion models represent videos internally, and how this can be leveraged for consistent video editing and synthesis.
Method Section
Here is a summary of the method section from the paper:
-
Given an input video and target text prompt, the goal is to edit the video according to the text while preserving the original layout and motion.
-
They use a pretrained text-to-image diffusion model (Stable Diffusion) without finetuning.
-
They first invert the video frames using DDIM to get latent representations.
-
They extract and store the diffusion features (tokens) from the self-attention layers for each frame.
-
Their method alternates between:
-
Keyframe Editing:
- Sample keyframes and edit jointly using an image editing method with extended attention.
- This makes the keyframes share a common appearance.
-
Propagation via TokenFlow:
- Propagate the edited keyframe tokens to all frames based on nearest-neighbor correspondences from the original video tokens.
- This preserves the original consistency and redundancies.
-
For propagation, they compute nearest neighbors between each frame’s tokens and its adjacent keyframes’ tokens.
-
They then linearly combine the edited keyframe tokens based on the nearest neighbors to propagate the edit.
-
This process is repeated for each diffusion timestep to generate the final edited frames.
-
It allows leveraging an image diffusion model for video editing without any training or finetuning.
High-Level Pseudo Code
Here is the high-level pseudo code for the method described in the paper:
# Given input video frames I[n] and text prompt P
# Invert video frames to get latents x[n, T]
x = DDIM_Inversion(I)
# Extract diffusion tokens from all frames
tokens = Get_Tokens(x)
# Generate edited video
for t in T:
# Sample keyframes
k = Sample_Keyframes(I)
# Edit keyframes jointly
J_base[k] = Image_Edit(I[k], P, Extended_Attention)
# Get edited tokens
T_base = Get_Tokens(J_base)
# Compute token NNs between all frames
NNs = Token_NN(tokens, tokens[k])
# Propagate edited tokens using NNs
J[n] = Image_Edit(I, P, TokenFlow(T_base, NNs))
# Return edited video
return J
Where:
DDIM_Inversion
: Gets latent video representationsGet_Tokens
: Extracts diffusion featuresImage_Edit
: Editing method like PnP DiffusionExtended_Attention
: Edits multiple frames jointlyTokenFlow
: Propagates tokens using correspondences
Detailed Pseudo Code
Here is a more detailed pseudo code to implement the video editing method from the paper:
# Input
I = [I1, I2, ..., In] # Input video frames
P = "text prompt" # Target edit text
# Preprocess - Get latent representations
x = []
for i in range(n):
x_i = DDIM_Inversion(I[i], T=50) # T timesteps
x.append(x_i)
# Preprocess - Extract and store tokens
tokens = []
for i in range(n):
phi_i = []
for t in range(T):
phi_it = Get_Tokens(x[i][t])
phi_i.append(phi_it)
tokens.append(phi_i)
# Generate edited video
J = []
for t in range(T):
# Sample keyframes
k = Sample_Keyframes(I, stride=8)
# Get base edited tokens
T_base = []
for j in k:
J[j] = Image_Edit(I[j], P, Extended_Attention)
T_base.append(Get_Tokens(J[j]))
# Get propagation correspondences
gammas = {}
for i in range(n):
ip, im = Find_Surrounding_Keyframes(i, k)
gammas[i] = Token_NN(tokens[i][t], tokens[ip][t], tokens[im][t])
# Propagate
for i in range(n):
if i in k: # Keyframes
J[i] = Image_Edit(I[i], P, Extended_Attention)
else:
J[i] = Image_Edit(I[i], P, TokenFlow(T_base, gammas[i]))
return J # Edited video
Where the main steps are:
- Invert frames to get latents
- Extract and store tokens from all frames
- For each timestep t:
- Sample keyframes
- Edit keyframes with extended attention
- Compute token NNs between all frames
- Propagate edited tokens using NNs
- Return edited video