TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Authors: Michal Geyer, Omer Bar-Tal, Shai Bagon, Tali Dekel

Abstract: The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: diffusion-tokenflow.github.io

What, Why and How

Here is a summary of the key points from the paper:

What:

The paper proposes a method called TokenFlow for consistent video editing using a text-to-image diffusion model.

Why:

Current text-to-video models still lag behind image models in quality and controllability. Using an image diffusion model for video editing leads to temporal inconsistencies when editing each frame independently.

How:

The key idea is to enforce consistency in the diffusion feature space based on inter-frame correspondences from the original video.
They observe diffusion features capture redundant info across frames like the RGB frames.
Their method alternates between:
1. Jointly editing sampled keyframes using an image editing method.
2. Propagating the edited diffusion features to all frames based on correspondences from the original video.
This enforces the edited features to have the same redundancies as the original for consistency.
They don’t require any training or finetuning of the image diffusion model.
They demonstrate state-of-the-art editing results on real videos compared to other methods.

Main Contributions

Here are the main contributions of this paper:

A technique called TokenFlow that enforces semantic correspondences of diffusion features across frames to increase temporal consistency when editing videos using a text-to-image diffusion model.
Novel empirical analysis studying the properties of diffusion features across frames of a video, showing they capture redundant information similar to RGB frames.
State-of-the-art editing results on diverse real-world videos depicting complex motions. The edited videos adhere to the target text prompt while maintaining the original video layout and motion.
Their method works with any off-the-shelf text-to-image editing technique and does not require finetuning the diffusion model.
Comparisons to other recent video editing methods show improved consistency and fidelity to the edit guidance.
Their work provides insights into how diffusion models represent videos internally, and how this can be leveraged for consistent video editing and synthesis.

Method Section

Here is a summary of the method section from the paper:

Given an input video and target text prompt, the goal is to edit the video according to the text while preserving the original layout and motion.
They use a pretrained text-to-image diffusion model (Stable Diffusion) without finetuning.
They first invert the video frames using DDIM to get latent representations.
They extract and store the diffusion features (tokens) from the self-attention layers for each frame.
Their method alternates between:

Keyframe Editing:
- Sample keyframes and edit jointly using an image editing method with extended attention.
- This makes the keyframes share a common appearance.
Propagation via TokenFlow:
- Propagate the edited keyframe tokens to all frames based on nearest-neighbor correspondences from the original video tokens.
- This preserves the original consistency and redundancies.

For propagation, they compute nearest neighbors between each frame’s tokens and its adjacent keyframes’ tokens.
They then linearly combine the edited keyframe tokens based on the nearest neighbors to propagate the edit.
This process is repeated for each diffusion timestep to generate the final edited frames.
It allows leveraging an image diffusion model for video editing without any training or finetuning.

High-Level Pseudo Code

Here is the high-level pseudo code for the method described in the paper:

# Given input video frames I[n] and text prompt P
 
# Invert video frames to get latents x[n, T] 
x = DDIM_Inversion(I)  
 
# Extract diffusion tokens from all frames
tokens = Get_Tokens(x)
 
# Generate edited video
for t in T:
  
  # Sample keyframes
  k = Sample_Keyframes(I) 
  
  # Edit keyframes jointly
  J_base[k] = Image_Edit(I[k], P, Extended_Attention)
  
  # Get edited tokens
  T_base = Get_Tokens(J_base)  
 
  # Compute token NNs between all frames
  NNs = Token_NN(tokens, tokens[k])    
 
  # Propagate edited tokens using NNs
  J[n] = Image_Edit(I, P, TokenFlow(T_base, NNs))
 
# Return edited video
return J

Where:

DDIM_Inversion: Gets latent video representations
Get_Tokens: Extracts diffusion features
Image_Edit: Editing method like PnP Diffusion
Extended_Attention: Edits multiple frames jointly
TokenFlow: Propagates tokens using correspondences

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the video editing method from the paper:

# Input
I = [I1, I2, ..., In] # Input video frames
P = "text prompt" # Target edit text
 
# Preprocess - Get latent representations
x = [] 
for i in range(n):
  x_i = DDIM_Inversion(I[i], T=50) # T timesteps
  x.append(x_i) 
 
# Preprocess - Extract and store tokens
tokens = []
for i in range(n):
  phi_i = []
  for t in range(T):
    phi_it = Get_Tokens(x[i][t]) 
    phi_i.append(phi_it)
  tokens.append(phi_i)
 
# Generate edited video 
J = []
for t in range(T):
  
  # Sample keyframes
  k = Sample_Keyframes(I, stride=8) 
  
  # Get base edited tokens 
  T_base = []
  for j in k:
    J[j] = Image_Edit(I[j], P, Extended_Attention) 
    T_base.append(Get_Tokens(J[j]))
 
  # Get propagation correspondences
  gammas = {}
  for i in range(n):
    ip, im = Find_Surrounding_Keyframes(i, k)
    gammas[i] = Token_NN(tokens[i][t], tokens[ip][t], tokens[im][t])
 
  # Propagate 
  for i in range(n):
    if i in k: # Keyframes
      J[i] = Image_Edit(I[i], P, Extended_Attention) 
    else:
      J[i] = Image_Edit(I[i], P, TokenFlow(T_base, gammas[i]))
 
return J # Edited video

Where the main steps are:

Invert frames to get latents
Extract and store tokens from all frames
For each timestep t:
- Sample keyframes
- Edit keyframes with extended attention
- Compute token NNs between all frames
- Propagate edited tokens using NNs
Return edited video

aixiv-c

Table of Contents

2307.10373 TokenFlow Consistent Diffusion Features for Consistent Video Editing

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code

Graph View

Backlinks

aixiv-c

Table of Contents

2307.10373 TokenFlow Consistent Diffusion Features for Consistent Video Editing

TokenFlow: Consistent Diffusion Features for Consistent Video Editing §

What, Why and How §

Main Contributions §

Method Section §

High-Level Pseudo Code §

Detailed Pseudo Code §

Graph View

Backlinks

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code