DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
Authors: Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, Nan Duan
Abstract: Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models’ capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is \url{}
What, Why and How
Here is a summary of the key points in this paper:
What:
- The paper proposes DragNUWA, a video generation model that incorporates text, image, and trajectory controls to enable fine-grained controllable video generation.
Why:
- Current video generation models focus on either text, image, or trajectory control, which is not comprehensive enough. The combination of all three controls allows fine-grained control from semantic, spatial, and temporal perspectives.
- Existing trajectory control methods are limited to simple datasets and cannot handle complex trajectories and open-domain images well.
How:
- DragNUWA incorporates text, image, and trajectory controls using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training strategy.
- The Trajectory Sampler extracts trajectories directly from video optical flow to enable open-domain trajectory control.
- Multiscale Fusion integrates the controls at different resolutions in a UNet architecture.
- Adaptive Training first conditions on dense optical flow for stability, then on sparse trajectories for adaptation.
- Experiments show DragNUWA can generate high-quality videos following complex trajectories in open-domain images.
In summary, DragNUWA advances controllable video generation by integrating comprehensive text, image and trajectory controls with a focus on modeling complex open-domain trajectories. The three-pronged control and trajectory modeling approach enables fine-grained control over video semantics, appearance and motion.
Main Contributions
Here are the key contributions of this paper:
-
Proposes DragNUWA, an end-to-end open-domain video generation model that integrates text, image and trajectory controls for fine-grained controllability.
-
Introduces a trajectory modeling framework with three components:
- Trajectory Sampler to enable open-domain control of arbitrary trajectories by sampling from video optical flow.
- Multiscale Fusion to integrate trajectory control at different resolutions.
- Adaptive Training strategy to first train on dense optical flow then sparse trajectories.
-
Conducts extensive experiments to demonstrate DragNUWA’s superior trajectory controllability including complex trajectories, variable lengths, multiple object control, and camera movements.
-
Validates the necessity of incorporating all three controls - text, image and trajectory for semantic, spatial and temporal control respectively.
-
Achieves state-of-the-art video generation quality on open-domain images with complex trajectory control, which has been limited in previous works.
In summary, the main contributions are proposing DragNUWA for fine-grained controllable video generation, developing a trajectory modeling framework for open-domain control, and extensive experiments to demonstrate advanced trajectory controllability and necessity of the three control types.
Method Section
Here is a summary of the method section:
Trajectory Sampler (TS):
- Extracts dense optical flow from videos using Unimatch.
- Applies uniform anchor points and random perturbations to get anchored optical flow.
- Samples tracking points based on flow intensity to get sparse flow with multiple trajectories.
- Enhances sparse flow with Gaussian filtering.
Multiscale Fusion (MF):
- Encodes text with CLIP, image with autoencoder, trajectory with convolutions.
- Downsamples image and trajectory conditions to fuse with text at multiple UNet resolutions.
- Employs randomness during training to support different condition combinations.
Adaptive Training (AT):
- Initially trains with text, image, dense optical flow conditions for stability.
- Then trains with text, image and sparse sampled trajectories for adaptation.
Inference:
- Encodes input text, image, trajectory.
- Iteratively denoises Gaussian noise into latent code using UNet conditioned on controls.
- Decodes latent code into video with autoencoder.
In summary, the method consists of 1) Trajectory sampling from videos 2) Multiscale fusion of controls in UNet 3) Adaptive training strategy and 4) Inference process to generate video from encoded controls. The focus is on open-domain trajectory modeling and fusing multi-modal controls for fine-grained video generation.
High-Level Pseudo Code
Here is the high-level pseudo code for the key components in this paper:
# Trajectory Sampler
optical_flow = get_optical_flow(video)
anchored_flow = add_anchor_points(optical_flow)
sparse_flow = sample_trajectories(anchored_flow)
trajectory = enhance_trajectory(sparse_flow)
# Multiscale Fusion
text_emb = encode_text(text)
image_emb = encode_image(image)
trajectory_emb = encode_trajectory(trajectory)
for i in range(num_blocks):
image_emb_down = downsample(image_emb)
trajectory_emb_down = downsample(trajectory_emb)
fused = fuse(text_emb, image_emb_down, trajectory_emb_down)
image_emb = upsample(fused)
trajectory_emb = upsample(fused)
# Adaptive Training
# Stage 1
loss = model(text, image, optical_flow)
# Stage 2
loss = model(text, image, trajectory)
# Inference
noise = gaussian_noise()
for t in timesteps:
latent = model(noise, text, image, trajectory)
noise = denoise(latent)
video = decode(latent)
In summary, the pseudo code covers trajectory sampling, multiscale fusion of controls, two-stage adaptive training, and the inference process to generate video from noise conditioned on encoded text, image and trajectory.
Detailed Pseudo Code
Here is a more detailed pseudo code to implement the key components of this paper:
# Trajectory Sampler
# Get optical flow
optical_flow = Unimatch(video)
# Add anchor points
anchor_interval = 16
anchors = np.arange(0, H, anchor_interval)
anchored_flow = optical_flow[anchors[:,None], anchors]
# Sample trajectories
num_trajectories = np.random.randint(1, max_trajectories+1)
intensities = np.linalg.norm(anchored_flow, axis=1)
indices = multinomial(intensities, num_trajectories)
sparse_flow = anchored_flow[indices]
# Enhance trajectory
kernel = Gaussian2DKernel(size=99, sigma=10)
trajectory = convolve(sparse_flow, kernel)
# Multiscale Fusion
# Encode text
text_emb = CLIP(text)
# Encode image
image_enc = EncoderCNN(image)
image_emb = image_enc.repeat(T, axis=0)
# Encode trajectory
trajectory_enc = ConvNet(trajectory)
trajectory_emb = trajectory_enc
# UNet
for i in range(num_blocks):
# Downsample
image_emb_down = MaxPool2D(image_emb)
trajectory_emb_down = MaxPool2D(trajectory_emb)
# Fuse
fused = Linear(text_emb, image_emb_down, trajectory_emb_down)
# Upsample
image_emb = ConvTranspose2D(fused)
trajectory_emb = ConvTranspose2D(fused)
# Output
features = fused
# Adaptive Training
# Stage 1 - Dense flow
loss = DiffusionModel(text, image, optical_flow)
# Stage 2 - Sparse trajectory
loss = DiffusionModel(text, image, trajectory)
# Inference
# Inputs
text_emb = CLIP(text)
image_emb = EncoderCNN(image)
trajectory_emb = ConvNet(trajectory)
# Diffusion
noise = NormalDistribution()
for t in range(T):
latent = DiffusionUNet(noise, text_emb, image_emb, trajectory_emb)
noise = DiffusionStep(latent)
# Decoding
video = DecoderCNN(latent)
This covers the detailed logic and modules for trajectory sampling, multiscale fusion, adaptive training and inference for video generation conditioned on text, image and trajectory.