DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis

Authors: Zhongjie Duan, Lizhou You, Chengyu Wang, Cen Chen, Ziheng Wu, Weining Qian, Jun Huang

Abstract: In recent years, diffusion models have emerged as the most powerful approach in image synthesis. However, applying these models directly to video synthesis presents challenges, as it often leads to noticeable flickering contents. Although recently proposed zero-shot methods can alleviate flicker to some extent, we still struggle to generate coherent videos. In this paper, we propose DiffSynth, a novel approach that aims to convert image synthesis pipelines to video synthesis pipelines. DiffSynth consists of two key components: a latent in-iteration deflickering framework and a video deflickering algorithm. The latent in-iteration deflickering framework applies video deflickering to the latent space of diffusion models, effectively preventing flicker accumulation in intermediate steps. Additionally, we propose a video deflickering algorithm, named patch blending algorithm, that remaps objects in different frames and blends them together to enhance video consistency. One of the notable advantages of DiffSynth is its general applicability to various video synthesis tasks, including text-guided video stylization, fashion video synthesis, image-guided video stylization, video restoring, and 3D rendering. In the task of text-guided video stylization, we make it possible to synthesize high-quality videos without cherry-picking. The experimental results demonstrate the effectiveness of DiffSynth. All videos can be viewed on our project page. Source codes will also be released.

What, Why and How

Here is a summary of the key points from this paper:

What:

  • The paper proposes DiffSynth, a novel approach to convert image synthesis pipelines to video synthesis pipelines using diffusion models. The goal is to generate coherent and realistic videos without noticeable flicker.

Why:

  • Applying diffusion models directly to video synthesis often leads to flickering contents. Existing zero-shot methods can reduce flicker to some extent but still struggle to generate coherent videos.

How:

  • DiffSynth has two main components:

    1. Latent in-iteration deflickering framework: Applies video deflickering in the latent space during intermediate steps of diffusion models to prevent flicker accumulation.
    2. Patch blending algorithm: Remaps objects across frames and blends them to enhance video consistency.
  • DiffSynth is generally applicable to various video synthesis tasks like text-guided stylization, fashion synthesis, image-guided stylization, video restoration, 3D rendering etc.

  • Experiments show DiffSynth can generate high-quality, coherent videos without cherry-picking, outperforming prior methods quantitatively and qualitatively.

In summary, DiffSynth proposes a novel approach to transfer image synthesis pipelines to video synthesis using latent space deflickering and patch blending, enabling high-quality coherent video generation. The method is widely applicable and shows superiority over existing techniques.

Main Contributions

The main contributions of this paper are:

  • Proposes DiffSynth, a novel approach to convert image synthesis pipelines to video synthesis pipelines using diffusion models.

  • Devises a latent in-iteration deflickering framework that applies video deflickering in the latent space of diffusion models to avoid flicker accumulation.

  • Proposes a patch blending algorithm based on patch matching to enhance video consistency by remapping and blending objects across frames.

  • Designs video synthesis pipelines using DiffSynth for various tasks like text-guided stylization, fashion synthesis, image-guided stylization, video restoration, 3D rendering.

  • Demonstrates the effectiveness of DiffSynth through extensive experiments, showing it can generate high-quality coherent videos without cherry-picking, outperforming prior methods.

  • Shows the general applicability of DiffSynth to diverse video synthesis tasks through strong quantitative and qualitative results.

In summary, the key contributions are proposing the DiffSynth approach, the latent space deflickering framework, the patch blending algorithm, application to various synthesis tasks, and demonstrating superior performance over existing techniques.

Method Section

Here is a summary of the method section from the paper:

Preliminaries:

  • Diffusion models have 3 main components - text encoder, U-Net, and VAE.
  • Image generation involves adding noise and iteratively denoising in latent space.

Latent In-Iteration Deflickering:

  • Directly generating each frame leads to flicker accumulation across frames.
  • Propose applying video deflickering in latent space during intermediate steps.
  • Decode frames, deflicker, encode back to apply deflickering without flicker buildup.

Patch Blending Algorithm:

  • Use patch matching to estimate correspondence between frames.
  • Remap and blend patches across frames for consistency.
  • Approximate algorithm reduces complexity from O(n^2) to O(nlogn).

Other Modifications:

  • Use fixed noise, cross-frame attention, adaptive resolution, deterministic VAE sampling, and memory-efficient attention.
  • Smooth control signals from annotators like OpenPose.

In summary, the key ideas are latent space deflickering during diffusion steps to prevent flicker accumulation and patch blending across frames for video consistency. Other modifications enhance efficiency and coherence.

High-Level Pseudo Code

Here is the high-level pseudo code for the key components of the paper:

# Latent In-Iteration Deflickering
 
for t in num_steps:
  
  # Denoise each frame separately
  x_t = denoise(x_t) 
  
  # Decode frames
  frames = decode(x_t)
 
  # Deflicker frames
  frames = deflicker(frames)
  
  # Encode frames back
  x_t = encode(frames)
 
# Patch Blending Algorithm
 
# Estimate nearest neighbor field between frames
nnf = patch_match(frame1, frame2) 
 
# Remap frame2 onto frame1 using nnf 
frame2_to_1 = remap(frame2, nnf)
 
# Blend remapped frames
blended = blend(frame1, frame2_to_1) 
 
# Repeat blending hierarchically 
for i in num_frames:
 
  for j in range(i, num_frames, i*2):
    
    blended_tmp = blend(blended[j-i:j], blended[j:j+i])
    
    blended[j:j+i] = blended_tmp

The key ideas are applying deflickering in the latent space during diffusion model denoising steps, and using patch-based blending of remapped frames for enhancing video consistency.

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the key components of DiffSynth:

# Hyperparameters
num_frames = 8 
num_steps = 50
 
# Latent In-Iteration Deflickering
 
for t in range(num_steps):
 
  # Denoise each frame
  for i in range(num_frames):
    x_t[i] = denoise(x_t[i])  
 
  # Decode frames
  frames = [decode(x_t[i]) for i in range(num_frames)]
 
  # Deflicker frames 
  frames = video_deflicker(frames)
 
  # Encode frames
  for i in range(num_frames):
    x_t[i] = encode(frames[i]) 
 
# Patch Blending Algorithm
 
# Create empty remapping table
remap_table = []
 
# Iterate over log(num_frames) blending steps
for k in range(int(math.log(num_frames, 2))):
  
  for i in range(num_frames):
    
    # Blend previous intervals
    if i % (2**k) == 0:
     
      remap_table[i] = blend(remap_table[i-(2**(k-1))], remap_table[i])
    
    # Store blended result
    else:
      remap_table[i] = remap_table[i-(2**(k-1))]
 
# Final blended frames       
blended_frames = [blend_frames_to_i(remap_table, i) for i in range(num_frames)]
 
# Helper functions
 
def blend(frame1, frame2):
 
  # Remap frame2 onto frame1
  frame2_to_1 = remap(frame2, nnf(frame1, frame2)) 
  
  # Blend remapped frames 
  return (frame1 + frame2_to_1) / 2
 
def blend_frames_to_i(remap_table, i):
 
  # Query left and right blended frames
  left = query_left(remap_table, i)
  right = query_right(remap_table, i)
 
  # Blend queried frames
  return (left + right) / num_frames
 

The key components are latent space deflickering inside the diffusion steps, constructing a remapping table for efficient blending, and querying blended frames from the table to get the final results.