Generative Image Dynamics

Authors: Zhengqi Li, Richard Tucker, Noah Snavely, Aleksander Holynski

Abstract: We present an approach to modeling an image-space prior on scene dynamics. Our prior is learned from a collection of motion trajectories extracted from real video sequences containing natural, oscillating motion such as trees, flowers, candles, and clothes blowing in the wind. Given a single image, our trained model uses a frequency-coordinated diffusion sampling process to predict a per-pixel long-term motion representation in the Fourier domain, which we call a neural stochastic motion texture. This representation can be converted into dense motion trajectories that span an entire video. Along with an image-based rendering module, these trajectories can be used for a number of downstream applications, such as turning still images into seamlessly looping dynamic videos, or allowing users to realistically interact with objects in real pictures.

What, Why and How

Here is a summary of the key points in this paper:

What:

  • The paper presents an approach to model natural oscillation dynamics (e.g. trees/flowers swaying, candles flickering) from a single image.

  • It learns a generative prior over image-space motion from videos, represented as a “neural stochastic motion texture” - a set of Fourier coefficients describing each pixel’s motion trajectory.

  • This motion texture is predicted from a single image using a frequency-coordinated latent diffusion model.

  • The predicted dense motion trajectories are used to animate the input image into a video using neural image-based rendering.

Why:

  • Modeling natural motion dynamics allows generating realistic, temporally coherent animations from still images.

  • Representing motion in the frequency domain enables controlling motion properties and generating seamlessly looping animations.

  • Explicitly modeling scene motion provides better control compared to directly generating pixels, leading to higher quality animations.

How:

  • The model is trained on videos depicting natural oscillations (trees, flowers etc.)

  • Optical flow between frames extracts dense motion trajectories.

  • Motion textures are extracted via Fourier transform and used to train the diffusion model.

  • At test time, the model predicts a motion texture from a single image.

  • This is transformed to trajectories and used to animate the image into a video via neural rendering.

  • Applications: image-to-video, seamless looping, interactive dynamics.

In summary, the paper presents a way to learn and generate natural motion dynamics from still images for high quality controllable animations. The key idea is to model motion explicitly using a learned generative prior.

Main Contributions

Here are the main contributions of this paper:

  • Proposes representing natural scene motion using a neural stochastic motion texture - a per-pixel motion trajectory representation in the frequency domain.

  • Presents a frequency-coordinated latent diffusion model to predict such motion textures from a single input image.

  • Develops a neural image-based rendering module that uses predicted dense motion trajectories to animate an input image.

  • Demonstrates modeling motion explicitly leads to higher quality image animation compared to raw video synthesis baselines.

  • Shows applications enabled by the motion representation, such as controllable motion speed, seamless looping, and interactive dynamics.

  • Introduces frequency adaptive normalization and motion self-guidance techniques to improve training and inference.

  • Provides quantitative comparisons showing improved image and video quality over prior state-of-the-art methods.

In summary, the key contributions are:

  • A neural stochastic motion texture representation for natural oscillations.

  • A generative model to predict such textures from images.

  • A rendering module to produce animations using predicted motion.

  • Demonstrating advantages of explicit motion modeling over raw video synthesis.

  • Enabling applications by controlling the predicted motion representation.

  • Technical innovations like frequency normalization and motion guidance.

  • Extensive experiments showing improvements over previous approaches.

The main contribution is developing a way to model and predict natural motion dynamics from single images for high-quality controllable animations.

Method Section

Here is a summary of the method section from the paper:

The method consists of two main components:

  1. Motion Prediction Module
  • Represents motion using a neural stochastic motion texture (NSMT) - a per-pixel motion trajectory in the frequency domain.

  • Models NSMT using a latent diffusion model (LDM).

  • Applies frequency-coordinated denoising during diffusion - attends to motion across frequencies.

  • Uses frequency adaptive normalization when training the LDM.

  • At test time, predicts an NSMT from a single input image.

  1. Image-Based Rendering Module
  • Transforms predicted NSMT to time-domain dense motion fields using IFFT.

  • Encodes input image into multi-scale features.

  • Warps features to future frames using predicted motion and softmax splatting.

  • Uses motion-derived weights to handle disocclusions.

  • Decodes warped features into output frames using neural synthesis network.

  • Jointly trains feature encoder and image decoder.

In summary, the method first predicts a stochastic motion texture representing trajectories of all pixels using a specialized diffusion model. It then uses these trajectories to warp input image features and render future frames. The motion diffusion model and neural rendering module are trained jointly on videos depicting natural oscillations.

At test time, the pipeline takes a single image, predicts motion, animates the image into a video using the rendered motion, enabling downstream applications.

The key aspects are the neural motion texture representation and the frequency-coordinated prediction + neural rendering pipeline for animating still images in a temporally coherent way.

High-Level Pseudo Code

Here is the high-level pseudocode for the method presented in the paper:

# Motion Prediction Module
 
# Encode real motion textures from video into latent space 
motion_latents = encoder(real_motion_textures) 
 
# Train diffusion model to denoise motion latents
diffusion_model = train_diffusion(motion_latents)
 
# At test time:
input_image = ... 
 
# Predict stochastic motion texture
motion_texture = diffusion_model(input_image) 
 
# Inverse Fourier transform to get trajectories
motion_trajectories = ifft(motion_texture)
 
 
# Image-Based Rendering Module
 
# Encode input image into multi-scale features
input_features = encode_image(input_image)
 
# Warp features using predicted trajectories 
warped_features = warp_features(input_features, motion_trajectories)
 
# Decode warped features into output frame
output_frame = decode_features(warped_features) 
 
# Jointly train encoder and decoder
renderer = train_renderer(input_frames, target_frames, motion_between_frames)

In summary, the pipeline first predicts a stochastic motion texture representing complete dense trajectories using a diffusion model. It then uses these trajectories to warp input image features and render an output frame depicting the input image in motion. The motion diffusion model and neural rendering module are trained jointly on videos showing natural oscillations.

Detailed Pseudo Code

Here is a more detailed pseudocode outline to implement the method described in the paper:

# Define model architectures
 
# Motion encoder-decoder VAE
encoder = EncoderCNN() 
decoder = DecoderCNN()
 
# Diffusion model UNet
diffusion_unet = UNet() 
 
# Image encoder 
image_encoder = ResNetEncoder()
 
# Image decoder
image_decoder = CoModGANDecoder()
 
 
# Train motion VAE
for x, y in motion_texture_pairs:
  
  z = encoder(x)
 
  x_recon = decoder(z)
  
  loss = L1(x, x_recon) + KL_div(z)
  
  optimize(loss)
  
 
# Extract motion latents
for x in motion_textures:
 
  z = encoder(x)
  
  motion_latents.append(z)
  
# Train diffusion model
for z_t in noisy_motion_latents:
 
  epsilon = diffusion_unet(z_t, t) 
   
  loss = MSE(epsilon, noise)
  
  optimize(loss)
  
# Frequency coordination 
Frozen_diffusion_unet = freeze_params(diffusion_unet)
 
for z_t in motion_latents_batched_over_freqs:
 
  epsilon = CrossFreqAttention(Frozen_diffusion_unet(z_t))
  
  loss = MSE(epsilon, noise)
  
  optimize(loss)
 
 
# Train image-based renderer
for I0, It in (input_image, target_image) pairs:
 
  # Get motion 
  F_0t = optical_flow(I0, It)  
 
  # Encode input
  feat0 = image_encoder(I0)
 
  # Warp features
  feat_t = warp_softmax(feat0, F_0t)
 
  # Decode target
  I_hat_t = image_decoder(feat_t)
 
  # Loss
  loss = perceptual_loss(It, I_hat_t) 
  
  optimize(loss)
  
  
# Generation
input_image = ...
 
# Get stochastic motion texture
smt = diffusion_unet(input_image) 
 
# Get trajectories 
trajectories = ifft(smt)
 
# Render video
for t in n_frames:
 
  feat_t = warp_features(image_encoder(input_image), trajectories[t])
  
  output_frame = image_decoder(feat_t)
 

In summary, the key steps are:

  1. Train VAE for motion latents

  2. Train diffusion model for motion denoising

  3. Add cross-frequency attention in diffusion model

  4. Train image encoder-decoder for neural rendering

  5. At test time:

    • Predict motion texture
    • Get trajectories via iFFT
    • Render video by warping input features

The main components are the motion diffusion model with frequency coordination, and the neural rendering module. These are trained on videos showing natural oscillations before being applied to still images.