CoTracker: It is Better to Track Together

Authors: Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, Christian Rupprecht

Abstract: Methods for video motion prediction either estimate jointly the instantaneous motion of all points in a given video frame using optical flow or independently track the motion of individual points throughout the video. The latter is true even for powerful deep-learning methods that can track points through occlusions. Tracking points individually ignores the strong correlation that can exist between the points, for instance, because they belong to the same physical object, potentially harming performance. In this paper, we thus propose CoTracker, an architecture that jointly tracks multiple points throughout an entire video. This architecture combines several ideas from the optical flow and tracking literature in a new, flexible and powerful design. It is based on a transformer network that models the correlation of different points in time via specialised attention layers. The transformer iteratively updates an estimate of several trajectories. It can be applied in a sliding-window manner to very long videos, for which we engineer an unrolled training loop. It can track from one to several points jointly and supports adding new points to track at any time. The result is a flexible and powerful tracking algorithm that outperforms state-of-the-art methods in almost all benchmarks.

What, Why and How

Here is a summary of the key points in the paper:

What:

  • The paper proposes CoTracker, a new neural network architecture for jointly tracking multiple points throughout a video.

Why:

  • Prior work either estimates dense motion fields for all points (optical flow) or tracks points independently. However, points on the same object are highly correlated.

  • Jointly tracking points can exploit correlations and improve accuracy, especially for long term tracking where independent trackers accumulate drift.

How:

  • CoTracker is a transformer network operating on a 2D grid of tokens, with one dimension being time and the other tracked points.

  • It iteratively refines an initial estimate of the tracks using learned attention layers to model interactions between points and across time.

  • It supports starting tracks from arbitrary points and times, and applies in a sliding window fashion for long videos.

  • Training uses an unrolled objective over overlapping windows to handle long sequences.

  • Experiments show state-of-the-art tracking accuracy on multiple benchmarks, especially for joint tracking of multiple correlated points.

In summary, the key ideas are exploiting point correlations via joint modeling in a transformer, and an unrolled training procedure for long term tracking. This leads to improved accuracy compared to prior optical flow and independent tracking methods.

Main Contributions

Here are the main contributions of this paper:

  • Proposes a new neural network architecture, CoTracker, for jointly tracking multiple points in videos over long time periods.

  • Models interactions between different points using transformer attention layers to exploit correlations and improve accuracy.

  • Supports starting tracks from arbitrary points and times in a video.

  • Applies the model in a sliding window fashion for tracking in very long videos.

  • Uses an unrolled training objective over overlapping windows to handle long sequences.

  • Achieves state-of-the-art results on multiple tracking benchmarks, especially when tracking groups of correlated points.

  • Provides extensive experiments analyzing different design choices like point selection strategies, training data, and ablations on components like attention layers.

  • Introduces a rigorous evaluation protocol to ensure fair comparison to methods that track points independently.

In summary, the main contributions are:

  1. The CoTracker neural architecture for joint modeling of multiple point tracks.

  2. The unrolled training procedure for long sequences.

  3. Achieving improved tracking accuracy, especially for multiple correlated points.

  4. Rigorous experimental evaluation and analysis.

Method Section

Here is a summary of the method section:

  • CoTracker is a transformer network that operates on a 2D grid of tokens, with one dimension being time and the other being the tracked points.

  • The input tokens contain estimated track locations, visibility flags, appearance features, and correlation features comparing track and image features.

  • The output tokens contain updated locations and appearance features.

  • Tracking is done by iteratively refining an initial estimate of the full tracks using the transformer.

  • Tracks are initialized by copying the starting location to all times, setting visibility to 1, and sampling image features.

  • The model is applied in overlapping windows on long videos. The output of one window is used to initialize the next.

  • The transformer uses interleaved time and group (across points) attention blocks to model interactions.

  • A CNN extracts image features at multiple scales. Track features are initialized from images and updated by the transformer.

  • Correlation features compare track and image features using inner products in a spatial neighborhood.

  • The model is trained with an unrolled objective over overlapping windows to handle long sequences.

  • Point selection strategies during evaluation ensure fair comparison to independent trackers.

In summary, CoTracker iteratively refines track estimates over windows using a transformer to exploit correlations between points, and is trained in an unrolled fashion over long sequences.

High-Level Pseudo Code

Here is high-level pseudocode for the CoTracker method:

# Input
video = [frame1, frame2, ..., frameN] 
start_points = [(x1, y1, t1), (x2, y2, t2), ...] 
 
# Initialize tracks
tracks = []
for x, y, t in start_points:
  track = [None] * len(video)
  track[t] = (x, y)
  tracks.append(track)
  
# Extract image features
image_features = []
for frame in video:
  feats = CNN(frame) 
  image_features.append(feats)
  
# Iteratively refine with transformer
for i in range(num_iterations):
 
  # Assemble input tokens 
  tokens = []
  for track in tracks:
    p = track positions 
    v = visibility flags
    q = track features
    c = correlation features
    token = (p, v, q, c)
    tokens.append(token)
 
  # Apply transformer
  refined_tokens = Transformer(tokens) 
  
  # Update tracks
  for token, track in zip(refined_tokens, tracks):
    p, q = token
    track = p
 
# Return final tracks  
return tracks

The key steps are:

  1. Initialize tracks based on start points
  2. Extract image features with a CNN
  3. Assemble input tokens for each track
  4. Iteratively refine tracks via the transformer
  5. Update tracks based on output tokens

Detailed Pseudo Code

Here is more detailed pseudocode to implement the CoTracker method:

# Video frames
frames = [f1, f2, ..., fT]  
 
# Starting points
start_points = [(x1, y1, t1), ..., (xN, yN, tN)]
 
# Image features
image_feats = []
for f in frames:
  feats = CNN(f)
  image_feats.append(feats)
 
# Initialize tracks
tracks = [[] for _ in start_points] 
for i, (x, y, t) in enumerate(start_points):
  for _ in range(T):
    tracks[i].append((x, y))  
 
# Transformer  
def Transformer(tokens):
  # attention blocks
  # layer norms, MLPs
  return refined_tokens
 
# Main loop 
for iter in range(MAX_ITERS):
  
  # Assemble input tokens
  tokens = []
  for i, track in enumerate(tracks):
  
    p = track
    v = visibility(track) 
    q = init_or_prev_q[i]
    c = correlation(track, image_feats)
    
    token = (p, v, q, c)
    tokens.append(token)
 
  # Apply transformer 
  refined_tokens = Transformer(tokens)
 
  # Update tracks
  for token, track in zip(refined_tokens, tracks):
    p, q = token
    track = p
    
    # Save features
    init_or_prev_q[i] = q 
 
# Return final tracks
return tracks

The key steps are:

  1. Initialize empty tracks based on start points
  2. Extract CNN image features once
  3. Assemble input tokens with estimated track info
  4. Apply transformer to refine tokens
  5. Update tracks from output tokens
  6. Iterate transformer refinement