CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

Authors: Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, Yujun Shen

Abstract: We present the content deformation field CoDeF as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis.Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline.We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video.With such a design, CoDeF naturally supports lifting image algorithms for video processing, in the sense that one can apply an image algorithm to the canonical image and effortlessly propagate the outcomes to the entire video with the aid of the temporal deformation field.We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.More importantly, thanks to our lifting strategy that deploys the algorithms on only one image, we achieve superior cross-frame consistency in processed videos compared to existing video-to-video translation approaches, and even manage to track non-rigid objects like water and smog.Project page can be found at CoDeF.

What, Why and How

Here is a summary of the key points from the paper:

What:

  • The paper proposes representing a video using a “content deformation field” (CoDeF). This consists of a canonical content field that captures the static contents, and a temporal deformation field that captures how each frame deforms from the canonical content.

  • The canonical content field is a 2D image that aggregates all the visual content from across the video frames. The temporal deformation field is a 3D field that stores the deformation to transform points from the canonical image to each frame.

  • Both fields are represented using hash-based neural implicit representations for efficiency and capacity.

Why:

  • Representing videos this way allows lifting image algorithms directly to video tasks with high temporal consistency. For example, image-to-image translation or segmentation on the canonical image can be propagated to all frames through the deformation field.

  • Previous video representations like video mosaics or neural layered atlases have limitations in faithfully capturing details and semantics. The proposed CoDeF aims to achieve better reconstruction quality and canonical image naturalness.

How:

  • The canonical and deformation fields are jointly optimized to reconstruct the input frames through a differentiable rendering pipeline.

  • Regularization losses are used to encourage semantic correctness of the canonical image. Techniques like annealed hash encoding and optical flow consistency help with deformation smoothness.

  • For video tasks, established image algorithms are simply applied on the canonical image, and results propagated to all frames via the deformation field. This lifts image tasks to video with temporal consistency.

In summary, the paper proposes a novel video representation using content and deformation fields to enable lifting image algorithms to video tasks with high consistency. The key ideas are joint optimization, regularization for semantics, and leveraging the deformation field for propagation.

Main Contributions

Here are the key contributions of this paper:

  • Proposes a new video representation called the content deformation field (CoDeF) which consists of a canonical content field and a temporal deformation field.

  • Represents both fields using hash-based neural implicit representations to enable efficiency and capacity.

  • Introduces techniques like annealed hash encoding and optical flow consistency to encourage semantic correctness and smoothness.

  • Demonstrates how this representation enables lifting image algorithms directly to video tasks with high temporal consistency.

  • Shows applications in video-to-video translation, video object tracking, video keypoint tracking, video super-resolution, etc. by applying image algorithms on the canonical image.

  • Achieves state-of-the-art performance on tasks like video translation and tracking compared to previous methods.

  • Provides both quantitative and qualitative experimental results to validate the approach on complex rigid and non-rigid motions.

In summary, the main contributions are:

  • The novel CoDeF video representation using canonical content and deformation fields.

  • Techniques to optimize this representation for semantics and smoothness.

  • A framework for lifting image algorithms to video tasks through this representation.

  • State-of-the-art results on video processing applications like translation and tracking.

  • Comprehensive experiments demonstrating the robustness and versatility of the proposed approach.

Method Section

Here is a summary of the method section from the paper:

The paper proposes representing a video using two components - a canonical content field and a temporal deformation field.

The canonical content field is a 2D image that aggregates all static textures from the video frames. It is implemented as a 2D hash table using a multilayer perceptron (MLP).

The deformation field captures the per-frame motion from the canonical space. It is a 3D hash table implemented using 3D encodings and an MLP.

To obtain high capacity and efficiency, both fields use multi-resolution hash encoding techniques based on an encoded grid.

During training, the fields are jointly optimized to reconstruct the input frames through differentiable rendering. Regularization losses are added to encourage semantic correctness and smooth deformations.

Specifically, annealed hash encoding is used to progressively add high-frequency details. Optical flow consistency loss improves smoothness. For complex motions, grouped deformation fields based on segmentation masks are proposed.

After optimization, the canonical image can directly leverage image algorithms like translation, segmentation or super-resolution. Their outputs are propagated to all frames using the deformation field, thereby lifting image tasks to video.

In summary, the key ideas are the joint content and deformation field representation, the use of hash encodings and MLPs, and regularization techniques to enable lifting image algorithms to video with temporal consistency.

High-Level Pseudo Code

Here is the high-level pseudo code for the method proposed in the paper:

# Represent video V as canonical image Ic and deformation field D
 
# Canonical content field C as 2D hash table 
C = HashMLP(x) 
 
# Temporal deformation field D as 3D hash table
D = HashMLP(x, t)
 
# Jointly optimize C and D to reconstruct V
for iter in iterations:
  
  # Sample points x from frames
  
  # Get color prediction 
  c = C(D(x, t))
  
  # Loss between c and frame color
  
  # Update C and D networks
 
# Extract canonical image Ic from optimized C 
 
# Apply image algorithm X on Ic to get output O
 
O = X(Ic)
 
# Propagate O to all frames using D
for t in frames:
 
  xt = D(x, t) # Deform
 
  Ot = O(xt) # Propagate output
 
# Render propagated O with frame t

In summary, the key steps are:

  1. Represent video as C and D fields using hash MLPs
  2. Optimize C and D for video reconstruction
  3. Apply algorithm on canonical Ic from C
  4. Propagate output to frames using D

Detailed Pseudo Code

Here is a more detailed pseudo code implementation of the method proposed in the paper:

# Video V with frames I1, I2, .. IN
 
# Canonical content field C as 2D hash table
# L levels, F feature dims per level 
C = HashMLP2D(x, L, F) 
 
# Temporal deformation field D as 3D hash table 
# T entries in hash table
D = HashMLP3D(x, t, T) 
 
# Initialize C, D networks
 
# Hyperparams
iterations = 10000
λ1, λ2 = 1, 10 
 
for iter in range(iterations):
 
  # Sample points x in frame Ii
  
  # Get color prediction
  x' = D(Hash3D(x, i), i) 
  c = C(x')
 
  # Reconstruction loss
  Lrec = ||c - Ii(x)||2  
 
  # Sample flowed points x' in Ii+1
  x' = x + Flow(x, i->i+1)
 
  # Flow consistency loss
  Lflow = ||D(Hash3D(x, i)) - D(Hash3D(x', i+1)) - (x' - x)||
 
  # Total loss
  L = Lrec + λ1*Lflow + λ2*Lbg
 
  # Update C, D networks
 
# Render canonical image Ic from C
 
# Apply algorithm X on Ic to get output O
O = X(Ic)  
 
# Propagate O to frames
for i in frames:
  
  # Get deformed coordinates 
  x' = D(Hash3D(x, i), i)  
 
  # Sample output
  Oi(x) = O(x')
 
# Render Oi with frame Ii

The key additions compared to the high-level version:

  • Detailed hash encoding and MLP architectures
  • Reconstruction, flow consistency and background losses
  • Sampling flowed points using optical flow
  • Rendering canonical image and propagating output
  • Hyperparameter details for losses

Let me know if you would like me to explain or expand on any part of this more detailed pseudo code.