INVE: Interactive Neural Video Editing

Authors: Jiahui Huang, Leonid Sigal, Kwang Moo Yi, Oliver Wang, Joon-Young Lee

Abstract: We present Interactive Neural Video Editing (INVE), a real-time video editing solution, which can assist the video editing process by consistently propagating sparse frame edits to the entire video clip. Our method is inspired by the recent work on Layered Neural Atlas (LNA). LNA, however, suffers from two major drawbacks: (1) the method is too slow for interactive editing, and (2) it offers insufficient support for some editing use cases, including direct frame editing and rigid texture tracking. To address these challenges we leverage and adopt highly efficient network architectures, powered by hash-grids encoding, to substantially improve processing speed. In addition, we learn bi-directional functions between image-atlas and introduce vectorized editing, which collectively enables a much greater variety of edits in both the atlas and the frames directly. Compared to LNA, our INVE reduces the learning and inference time by a factor of 5, and supports various video editing operations that LNA cannot. We showcase the superiority of INVE over LNA in interactive video editing through a comprehensive quantitative and qualitative analysis, highlighting its numerous advantages and improved performance. For video results, please see inve

What, Why and How

Here is a summary of the key points from the paper “INVE: Interactive Neural Video Editing”:

What:

The paper presents a real-time video editing method called Interactive Neural Video Editing (INVE).
INVE can propagate sparse frame edits made by a user to the entire video clip in a consistent manner.

Why:

Existing video editing methods are slow and cannot propagate edits well across the video.
INVE aims to enable intuitive video editing for novice users by propagating edits instantly and consistently.

How:

INVE is inspired by Layered Neural Atlas (LNA) but makes several improvements in speed and editability.
It uses efficient network architectures and hash-grid encoding to achieve 5x faster training and inference than LNA.
It learns bidirectional mappings between frames and atlases to enable rigid texture tracking effects.
It supports layered editing of different types of effects independently in the atlas space.
It introduces vectorized sketching to enable artifact-free sketch edits directly on frames.

In summary, INVE makes video editing more intuitive by propagating a variety of sparse frame edits to the full video quickly and consistently using neural representations and editing layers. The improvements over LNA allow more interactivity and expanded edit operations.

Main Contributions

Here are the main contributions of the INVE paper:

INVE achieves 5x faster training and inference speed compared to existing methods like Layered Neural Atlas (LNA).
It introduces inverse mapping to enable rigid texture tracking effects. This allows propagating textures that track specific points rather than warping the entire texture.
It supports editing multiple video effects independently via layered editing. Users can overlay editable layers for different types of edits.
It introduces vectorized sketching for artifact-free sketch editing at the frame level. Sketches are mapped as continuous vector representations rather than raster images.
It enables interactive propagation of various types of edits, including sketches, textures, and local adjustments.
It provides an intuitive video editing experience for novice users by propagating sparse frame edits consistently.

In summary, the main contributions are significantly faster performance, expanded edit operations via inverse mapping and vectorized sketching, independent layered editing, and an interactive interface for propagating diverse frame edits.

Method Section

Here is a summary of the method section from the INVE paper:

It first reviews the Layered Neural Atlas (LNA) method, which represents videos using neural networks mapping frames to a layered image atlas. Edits are made on the atlas and propagated to frames.
To improve LNA’s slow speed, INVE uses efficient network architectures and hash-grid encoding to achieve 5x faster training and inference.
INVE introduces inverse mapping networks to map points from the atlas back to frame coordinates. This enables rigid texture tracking lacking in LNA.
It supports layered editing with separate layers for sketches, textures, and local adjustments. Edits can be made directly on layers or propagated from frames.
Vectorized sketching represents sketches as continuous polygon chains, avoiding artifacts from rasterized sketches in LNA. The chains are mapped between frames and atlases.
For training, INVE uses a reconstruction loss for the atlases and mapping networks. It trains for fewer iterations than LNA since accurate atlases are not needed for editing.
Implementation details like network architecture, training iterations, batch size, and rendering speed are provided.

In summary, the method section explains how INVE builds on LNA with architectural improvements for speed, inverse mapping for tracking, layered editing support, vectorized sketching, and early stopping in training. Together these contributions enable more interactive and flexible video editing.

High-Level Pseudo Code

Here is the high-level pseudo code for the INVE paper:

# Train the neural video representation
atlas_net = AtlasNetwork() 
forward_map_net = ForwardMappingNetwork()
inverse_map_net = InverseMappingNetwork()
 
for iter in num_iters:
    sample_frames_and_pixels()
    
    # Reconstruction loss
    target = sample_frames 
    prediction = RenderWithNetworks(sample_pixels, atlas_net, forward_map_net)
    loss = MSE(target, prediction)
 
    # Update networks
    atlas_net.optimize(loss) 
    forward_map_net.optimize(loss)
    
    # Supervised loss for inverse mapping
    atlas_coords = forward_map_net(sample_pixels)
    target_pixels = sample_pixels
    prediction_pixels = inverse_map_net(atlas_coords)
    inv_loss = MSE(target_pixels, prediction_pixels)
 
    inverse_map_net.optimize(inv_loss)
 
# Editing 
def EditFrame(frame, edit):
   # Map edit from frame to atlas  
   atlas_coords = forward_map_net(frame_coords)
   StoreEdit(atlas_coords, edit)
   
def RenderEditedVideo():
   for frame in frames:
      for pixel in frame:
         atlas_coords = forward_map_net(pixel) 
         edit = GetEdit(atlas_coords)
         pixel_color = RenderWithEdit(pixel, edit)
      return frame


The key steps are training the neural atlas and mapping networks with reconstruction and supervised losses, editing by mapping frame edits to the atlas, and rendering edited frames by looking up edits for each pixel from the atlas.

## Detailed Pseudo Code

 Here is a more detailed pseudo code implementation of the INVE paper:

```python
# Neural network definitions
class AtlasNetwork(MLP):
    # Image atlas generator 
    def forward(self, uv):
        return rgb

class ForwardMappingNetwork(MLP):
    # Frame coordinates to atlas coordinates
    def forward(self, xy): 
        return uv
        
class InverseMappingNetwork(MLP):
    # Atlas coordinates to frame coordinates
    def forward(self, uv, t):
        return xy

class OpacityNetwork(MLP):
    # Per-pixel opacity values
    def forward(self, xy):
        return a
        
# Train loop
for iter in num_iters:

    # Sample pixels
    frames = SampleFrames(video) 
    pixels = SamplePixels(frames)
    B = pixels.size(0)
    
    # Forward mapping
    uv = ForwardMapping(pixels) 
    
    # Atlas values
    rgb = AtlasNetwork(uv)
    
    # Opacity 
    a = OpacityNetwork(pixels)
    
    # Reconstruction
    xy = pixels[:,:2] 
    t = pixels[:,2]
    rec = Composition(rgb, a, xy, t)
    
    # Loss
    target = RenderFrame(frames, pixels)
    loss = MSELoss(rec, target)
    
    # Update networks
    loss.backward()
    AtlasNetwork.step() 
    ForwardMapping.step()
    OpacityNetwork.step()

    # Inverse mapping supervision
    xy_target = pixels[:,:2]
    xy_pred = InverseMapping(uv, t)
    inv_loss = MSELoss(xy_pred, xy_target)
    
    InverseMapping.step(inv_loss)
    
# Editing
def EditFrame(frame, edit):
    pixels = SamplePixels(frame)
    uv = ForwardMapping(pixels)
    StoreEdit(uv, edit) 

def RenderEditedVideo():
    for frame in frames:
        rec = np.zeroes(H,W,3) 
        pixels = SamplePixels(frame)
        uv = ForwardMapping(pixels)
        
        # Look up edits 
        edit = GetEdit(uv)  
        
        # Blend
        rgb = AtlasNetwork(uv)
        a = OpacityNetwork(pixels)
        rec += Composition(rgb, a, pixels, frame.t)
        rec += Composition(edit, 1.0, pixels, frame.t)
        
        return rec


The key additions are:

- Defining the different neural network modules
- Sampling pixels from frames
- Compositing RGB values with opacity
- Storing edits in the atlas space
- Looking up edits for each pixel during rendering

This provides a more detailed overview of how the different components fit together.

aixiv-c

Table of Contents

2307.07663 INVE Interactive Neural Video Editing

INVE: Interactive Neural Video Editing

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Graph View

Backlinks

aixiv-c

Table of Contents

2307.07663 INVE Interactive Neural Video Editing

INVE: Interactive Neural Video Editing §

What, Why and How §

Main Contributions §

Method Section §

High-Level Pseudo Code §

Graph View

Backlinks

INVE: Interactive Neural Video Editing

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code