TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Authors: Michal Geyer, Omer Bar-Tal, Shai Bagon, Tali Dekel

Abstract: The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: diffusion-tokenflow.github.io

What, Why and How

Here is a summary of the key points from the paper:

What:

  • The paper proposes a method called TokenFlow for consistent video editing using a text-to-image diffusion model.

Why:

  • Current text-to-video models still lag behind image models in quality and controllability. Using an image diffusion model for video editing leads to temporal inconsistencies when editing each frame independently.

How:

  • The key idea is to enforce consistency in the diffusion feature space based on inter-frame correspondences from the original video.

  • They observe diffusion features capture redundant info across frames like the RGB frames.

  • Their method alternates between:

    1. Jointly editing sampled keyframes using an image editing method.
    2. Propagating the edited diffusion features to all frames based on correspondences from the original video.
  • This enforces the edited features to have the same redundancies as the original for consistency.

  • They don’t require any training or finetuning of the image diffusion model.

  • They demonstrate state-of-the-art editing results on real videos compared to other methods.

Main Contributions

Here are the main contributions of this paper:

  • A technique called TokenFlow that enforces semantic correspondences of diffusion features across frames to increase temporal consistency when editing videos using a text-to-image diffusion model.

  • Novel empirical analysis studying the properties of diffusion features across frames of a video, showing they capture redundant information similar to RGB frames.

  • State-of-the-art editing results on diverse real-world videos depicting complex motions. The edited videos adhere to the target text prompt while maintaining the original video layout and motion.

  • Their method works with any off-the-shelf text-to-image editing technique and does not require finetuning the diffusion model.

  • Comparisons to other recent video editing methods show improved consistency and fidelity to the edit guidance.

  • Their work provides insights into how diffusion models represent videos internally, and how this can be leveraged for consistent video editing and synthesis.

Method Section

Here is a summary of the method section from the paper:

  • Given an input video and target text prompt, the goal is to edit the video according to the text while preserving the original layout and motion.

  • They use a pretrained text-to-image diffusion model (Stable Diffusion) without finetuning.

  • They first invert the video frames using DDIM to get latent representations.

  • They extract and store the diffusion features (tokens) from the self-attention layers for each frame.

  • Their method alternates between:

  1. Keyframe Editing:

    • Sample keyframes and edit jointly using an image editing method with extended attention.
    • This makes the keyframes share a common appearance.
  2. Propagation via TokenFlow:

    • Propagate the edited keyframe tokens to all frames based on nearest-neighbor correspondences from the original video tokens.
    • This preserves the original consistency and redundancies.
  • For propagation, they compute nearest neighbors between each frame’s tokens and its adjacent keyframes’ tokens.

  • They then linearly combine the edited keyframe tokens based on the nearest neighbors to propagate the edit.

  • This process is repeated for each diffusion timestep to generate the final edited frames.

  • It allows leveraging an image diffusion model for video editing without any training or finetuning.

High-Level Pseudo Code

Here is the high-level pseudo code for the method described in the paper:

# Given input video frames I[n] and text prompt P
 
# Invert video frames to get latents x[n, T] 
x = DDIM_Inversion(I)  
 
# Extract diffusion tokens from all frames
tokens = Get_Tokens(x)
 
# Generate edited video
for t in T:
  
  # Sample keyframes
  k = Sample_Keyframes(I) 
  
  # Edit keyframes jointly
  J_base[k] = Image_Edit(I[k], P, Extended_Attention)
  
  # Get edited tokens
  T_base = Get_Tokens(J_base)  
 
  # Compute token NNs between all frames
  NNs = Token_NN(tokens, tokens[k])    
 
  # Propagate edited tokens using NNs
  J[n] = Image_Edit(I, P, TokenFlow(T_base, NNs))
 
# Return edited video
return J

Where:

  • DDIM_Inversion: Gets latent video representations
  • Get_Tokens: Extracts diffusion features
  • Image_Edit: Editing method like PnP Diffusion
  • Extended_Attention: Edits multiple frames jointly
  • TokenFlow: Propagates tokens using correspondences

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the video editing method from the paper:

# Input
I = [I1, I2, ..., In] # Input video frames
P = "text prompt" # Target edit text
 
# Preprocess - Get latent representations
x = [] 
for i in range(n):
  x_i = DDIM_Inversion(I[i], T=50) # T timesteps
  x.append(x_i) 
 
# Preprocess - Extract and store tokens
tokens = []
for i in range(n):
  phi_i = []
  for t in range(T):
    phi_it = Get_Tokens(x[i][t]) 
    phi_i.append(phi_it)
  tokens.append(phi_i)
 
# Generate edited video 
J = []
for t in range(T):
  
  # Sample keyframes
  k = Sample_Keyframes(I, stride=8) 
  
  # Get base edited tokens 
  T_base = []
  for j in k:
    J[j] = Image_Edit(I[j], P, Extended_Attention) 
    T_base.append(Get_Tokens(J[j]))
 
  # Get propagation correspondences
  gammas = {}
  for i in range(n):
    ip, im = Find_Surrounding_Keyframes(i, k)
    gammas[i] = Token_NN(tokens[i][t], tokens[ip][t], tokens[im][t])
 
  # Propagate 
  for i in range(n):
    if i in k: # Keyframes
      J[i] = Image_Edit(I[i], P, Extended_Attention) 
    else:
      J[i] = Image_Edit(I[i], P, TokenFlow(T_base, gammas[i]))
 
return J # Edited video

Where the main steps are:

  1. Invert frames to get latents
  2. Extract and store tokens from all frames
  3. For each timestep t:
    • Sample keyframes
    • Edit keyframes with extended attention
    • Compute token NNs between all frames
    • Propagate edited tokens using NNs
  4. Return edited video