INVE: Interactive Neural Video Editing
Authors: Jiahui Huang, Leonid Sigal, Kwang Moo Yi, Oliver Wang, Joon-Young Lee
Abstract: We present Interactive Neural Video Editing (INVE), a real-time video editing solution, which can assist the video editing process by consistently propagating sparse frame edits to the entire video clip. Our method is inspired by the recent work on Layered Neural Atlas (LNA). LNA, however, suffers from two major drawbacks: (1) the method is too slow for interactive editing, and (2) it offers insufficient support for some editing use cases, including direct frame editing and rigid texture tracking. To address these challenges we leverage and adopt highly efficient network architectures, powered by hash-grids encoding, to substantially improve processing speed. In addition, we learn bi-directional functions between image-atlas and introduce vectorized editing, which collectively enables a much greater variety of edits in both the atlas and the frames directly. Compared to LNA, our INVE reduces the learning and inference time by a factor of 5, and supports various video editing operations that LNA cannot. We showcase the superiority of INVE over LNA in interactive video editing through a comprehensive quantitative and qualitative analysis, highlighting its numerous advantages and improved performance. For video results, please see inve
What, Why and How
Here is a summary of the key points from the paper “INVE: Interactive Neural Video Editing”:
What:
- The paper presents a real-time video editing method called Interactive Neural Video Editing (INVE).
- INVE can propagate sparse frame edits made by a user to the entire video clip in a consistent manner.
Why:
- Existing video editing methods are slow and cannot propagate edits well across the video.
- INVE aims to enable intuitive video editing for novice users by propagating edits instantly and consistently.
How:
- INVE is inspired by Layered Neural Atlas (LNA) but makes several improvements in speed and editability.
- It uses efficient network architectures and hash-grid encoding to achieve 5x faster training and inference than LNA.
- It learns bidirectional mappings between frames and atlases to enable rigid texture tracking effects.
- It supports layered editing of different types of effects independently in the atlas space.
- It introduces vectorized sketching to enable artifact-free sketch edits directly on frames.
In summary, INVE makes video editing more intuitive by propagating a variety of sparse frame edits to the full video quickly and consistently using neural representations and editing layers. The improvements over LNA allow more interactivity and expanded edit operations.
Main Contributions
Here are the main contributions of the INVE paper:
-
INVE achieves 5x faster training and inference speed compared to existing methods like Layered Neural Atlas (LNA).
-
It introduces inverse mapping to enable rigid texture tracking effects. This allows propagating textures that track specific points rather than warping the entire texture.
-
It supports editing multiple video effects independently via layered editing. Users can overlay editable layers for different types of edits.
-
It introduces vectorized sketching for artifact-free sketch editing at the frame level. Sketches are mapped as continuous vector representations rather than raster images.
-
It enables interactive propagation of various types of edits, including sketches, textures, and local adjustments.
-
It provides an intuitive video editing experience for novice users by propagating sparse frame edits consistently.
In summary, the main contributions are significantly faster performance, expanded edit operations via inverse mapping and vectorized sketching, independent layered editing, and an interactive interface for propagating diverse frame edits.
Method Section
Here is a summary of the method section from the INVE paper:
-
It first reviews the Layered Neural Atlas (LNA) method, which represents videos using neural networks mapping frames to a layered image atlas. Edits are made on the atlas and propagated to frames.
-
To improve LNA’s slow speed, INVE uses efficient network architectures and hash-grid encoding to achieve 5x faster training and inference.
-
INVE introduces inverse mapping networks to map points from the atlas back to frame coordinates. This enables rigid texture tracking lacking in LNA.
-
It supports layered editing with separate layers for sketches, textures, and local adjustments. Edits can be made directly on layers or propagated from frames.
-
Vectorized sketching represents sketches as continuous polygon chains, avoiding artifacts from rasterized sketches in LNA. The chains are mapped between frames and atlases.
-
For training, INVE uses a reconstruction loss for the atlases and mapping networks. It trains for fewer iterations than LNA since accurate atlases are not needed for editing.
-
Implementation details like network architecture, training iterations, batch size, and rendering speed are provided.
In summary, the method section explains how INVE builds on LNA with architectural improvements for speed, inverse mapping for tracking, layered editing support, vectorized sketching, and early stopping in training. Together these contributions enable more interactive and flexible video editing.
High-Level Pseudo Code
Here is the high-level pseudo code for the INVE paper:
# Train the neural video representation
atlas_net = AtlasNetwork()
forward_map_net = ForwardMappingNetwork()
inverse_map_net = InverseMappingNetwork()
for iter in num_iters:
sample_frames_and_pixels()
# Reconstruction loss
target = sample_frames
prediction = RenderWithNetworks(sample_pixels, atlas_net, forward_map_net)
loss = MSE(target, prediction)
# Update networks
atlas_net.optimize(loss)
forward_map_net.optimize(loss)
# Supervised loss for inverse mapping
atlas_coords = forward_map_net(sample_pixels)
target_pixels = sample_pixels
prediction_pixels = inverse_map_net(atlas_coords)
inv_loss = MSE(target_pixels, prediction_pixels)
inverse_map_net.optimize(inv_loss)
# Editing
def EditFrame(frame, edit):
# Map edit from frame to atlas
atlas_coords = forward_map_net(frame_coords)
StoreEdit(atlas_coords, edit)
def RenderEditedVideo():
for frame in frames:
for pixel in frame:
atlas_coords = forward_map_net(pixel)
edit = GetEdit(atlas_coords)
pixel_color = RenderWithEdit(pixel, edit)
return frame
The key steps are training the neural atlas and mapping networks with reconstruction and supervised losses, editing by mapping frame edits to the atlas, and rendering edited frames by looking up edits for each pixel from the atlas.
## Detailed Pseudo Code
Here is a more detailed pseudo code implementation of the INVE paper:
```python
# Neural network definitions
class AtlasNetwork(MLP):
# Image atlas generator
def forward(self, uv):
return rgb
class ForwardMappingNetwork(MLP):
# Frame coordinates to atlas coordinates
def forward(self, xy):
return uv
class InverseMappingNetwork(MLP):
# Atlas coordinates to frame coordinates
def forward(self, uv, t):
return xy
class OpacityNetwork(MLP):
# Per-pixel opacity values
def forward(self, xy):
return a
# Train loop
for iter in num_iters:
# Sample pixels
frames = SampleFrames(video)
pixels = SamplePixels(frames)
B = pixels.size(0)
# Forward mapping
uv = ForwardMapping(pixels)
# Atlas values
rgb = AtlasNetwork(uv)
# Opacity
a = OpacityNetwork(pixels)
# Reconstruction
xy = pixels[:,:2]
t = pixels[:,2]
rec = Composition(rgb, a, xy, t)
# Loss
target = RenderFrame(frames, pixels)
loss = MSELoss(rec, target)
# Update networks
loss.backward()
AtlasNetwork.step()
ForwardMapping.step()
OpacityNetwork.step()
# Inverse mapping supervision
xy_target = pixels[:,:2]
xy_pred = InverseMapping(uv, t)
inv_loss = MSELoss(xy_pred, xy_target)
InverseMapping.step(inv_loss)
# Editing
def EditFrame(frame, edit):
pixels = SamplePixels(frame)
uv = ForwardMapping(pixels)
StoreEdit(uv, edit)
def RenderEditedVideo():
for frame in frames:
rec = np.zeroes(H,W,3)
pixels = SamplePixels(frame)
uv = ForwardMapping(pixels)
# Look up edits
edit = GetEdit(uv)
# Blend
rgb = AtlasNetwork(uv)
a = OpacityNetwork(pixels)
rec += Composition(rgb, a, pixels, frame.t)
rec += Composition(edit, 1.0, pixels, frame.t)
return rec
The key additions are:
- Defining the different neural network modules
- Sampling pixels from frames
- Compositing RGB values with opacity
- Storing edits in the atlas space
- Looking up edits for each pixel during rendering
This provides a more detailed overview of how the different components fit together.