Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

Authors: William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, Phillip Isola

Abstract: Self-supervised and language-supervised image models contain rich knowledge of the world that is important for generalization. Many robotic tasks, however, require a detailed understanding of 3D geometry, which is often lacking in 2D image features. This work bridges this 2D-to-3D gap for robotic manipulation by leveraging distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models. We present a few-shot learning method for 6-DOF grasping and placing that harnesses these strong spatial and semantic priors to achieve in-the-wild generalization to unseen objects. Using features distilled from a vision-language model, CLIP, we present a way to designate novel objects for manipulation via free-text natural language, and demonstrate its ability to generalize to unseen expressions and novel categories of objects.

What, Why and How

This paper proposes Feature Fields for Robotic Manipulation (F3RM), a method for enabling open-ended scene understanding and manipulation for robots using visual priors from 2D foundation models.

The key idea is to distill dense feature representations from pre-trained vision transformers (e.g. CLIP, DINO) into 3D neural radiance fields (NeRFs). This combines detailed 3D geometry from NeRF with rich semantics from the 2D features to get a unified scene representation called a Distilled Feature Field (DFF).

What:

  • Represent scenes as Distilled Feature Fields (DFFs) by training a NeRF to jointly predict RGB colors and reconstruct 2D feature maps from a foundation model like CLIP or DINO.
  • Represent 6-DOF manipulation demonstrations as embeddings of the local 3D feature field. Query the DFF to infer grasps and placements.
  • Extend to open-text commands by retrieving relevant demos using CLIP and optimizing grasps to match the text embedding.

Why:

  • 2D visual priors like CLIP contain rich knowledge of objects, parts, materials, attributes etc. Distilling them into 3D provides strong spatial and semantic priors to enable generalization.
  • DFFs combine detailed geometry from NeRF with semantics from 2D features for robust scene understanding.

How:

  • Extract dense patch features from CLIP using MaskCLIP. Interpolate pos. encoding.
  • Jointly train NeRF RGB and feature regression. Use hierarchical grids for speed.
  • Sample query points around gripper in demo’s frame. Embed pose as concatenated features.
  • For language-guided grasping, retrieve demos with closest CLIP text embedding, and optimize grasp to match query text.

The key results are open-ended few-shot grasping of novel objects given just a couple of demos, and manipulating objects specified via free-form natural language queries. DFFs enable the robot to leverage internet-scale visual priors within CLIP and DINO for generalization.

Main Contributions

Here are the main contributions of this paper:

  • Proposes Distilled Feature Fields (DFFs) as a method to infuse 2D visual priors into 3D scenes by training neural radiance fields to reconstruct features from foundation models.

  • Presents an approach to represent 6-DOF manipulation demonstrations by embedding local 3D features into a vector. This allows querying the feature field to infer grasps and placements.

  • Introduces a pipeline for open-text language-guided manipulation by retrieving relevant demos with CLIP and optimizing grasps to match the text embedding.

  • Provides a technique to extract dense patch features from CLIP using MaskCLIP which preserves alignment with language.

  • Demonstrates few-shot grasping and placing of novel objects with just a couple of demonstrations, indicating strong generalization.

  • Shows open-ended manipulation of objects specified via free-form natural language, including generalizing to new categories.

  • Analyzes tradeoffs between semantic features from CLIP and geometric features from DINO for manipulation.

  • Makes improvements to distillation speed using hierarchical hash grids and optimizing camera poses.

In summary, the key contributions are proposing DFFs to combine 2D vision priors with 3D geometry, using DFFs for few-shot and language-guided manipulation, and analyzing the benefits of different feature types like CLIP and DINO for robotics tasks. The paper shows how internet-scale self-supervised models can be leveraged for robotic manipulation via feature distillation.

Method Section

Here is a summary of the method section from the paper:

Feature Field Distillation

  • Extract dense patch features from RGB images using CLIP (Modified MaskCLIP) or DINO.
  • Jointly train a NeRF to reconstruct RGB colors and 2D feature maps.
  • Use hierarchical hash grids from Nerfacto for faster training.
  • Cache features instead of upscaling to full image size to save memory.

Representing 6-DOF Poses

  • Sample query points around gripper in canonical frame for each task.
  • For a demo pose, query feature field at transformed points.
  • Weigh features by α (density) to encode occupancy.
  • Concatenate weighted features into pose embedding vector.
  • Average embeddings over demos to get task embedding.

Inferring 6-DOF Poses

  • Sample voxel grid over workspace, filter using α and task similarity.
  • Sample rotations to get 6-DOF grasp proposals.
  • Optimize proposals using task similarity as cost.
  • Prune collisions, rank poses, plan motions.

Language-Guided Manipulation

  • Embed user’s text query with CLIP.
  • Retrieve demos with closest average feature similarity.
  • Sample voxels similar to text and dissimilar to negatives.
  • Optimize grasps using weighted combination of task loss and text similarity.

In summary, the key steps are distilling 2D features into a 3D field, representing poses and tasks as feature embeddings, optimizing proposals based on similarity, and incorporating natural language via CLIP for open-ended generalization.

High-Level Pseudo Code

Here is high-level pseudocode for the key steps in the paper:

# Distill 2D features into 3D field
images, cameras = load_scene_data()
dense_features = extract_features(images) 
 
nerf = NeRF() 
nerf.compile(loss=[rgb_loss, feature_loss])
nerf.fit(images, cameras, dense_features)
 
feature_field = nerf.feature_network
 
# Represent 6-DOF pose as feature embedding
demo_pose = get_demo_pose()
query_points = sample_queries_around_gripper()
 
shifted_queries = transform_points(query_points, demo_pose)
weighted_features = feature_field(shifted_queries) * alpha_values(shifted_queries)
 
pose_embedding = concatenate(weighted_features)
task_embedding = average_embeddings([pose_embedding1, pose_embedding2])
 
# Infer poses from feature field  
proposals = sample_voxel_grid() 
proposals = filter_collisions_and_freespace(proposals)
proposals = filter_by_task_similarity(proposals, task_embedding)
 
for proposal in proposals:
  optimized = optimize_pose(proposal, task_embedding)
  if is_kinematically_feasible(optimized):
     execute_grasp(optimized)
 
# Language-guided manipulation
text_query = get_user_query()  
text_embedding = CLIP(text_query)
 
closest_demos = retrieve_demos(text_embedding)
task_embedding = demos[closest_demos[0]] 
 
voxels = filter_voxels_by_text_similarity(text_embedding)
grasp = optimize_grasp(voxels, task_embedding, text_embedding)
execute_grasp(grasp)

The key steps are distilling features into a field, representing poses and tasks as embeddings, sampling and filtering grasp proposals, and optimizing the pose or incorporating language queries.

Detailed Pseudo Code

Here is more detailed pseudocode to implement the key components of the paper:

# Feature field distillation
 
# Extract dense features from RGB images
def extract_features(images):
  model = load_foundation_model() # CLIP, DINO
  features = []
  
  for img in images:
    dense_feats = model(img) # MaskCLIP for CLIP
    features.append(dense_feats)
 
  return features
 
# Train NeRF 
class NeRF():
  
  def __init__(self):
    # Networks
    self.geometry_network = MLP() 
    self.color_network = MLP()
    self.feature_network = MLP()
 
    # Loss functions
    self.rgb_loss_fn = MSELoss()
    self.feature_loss_fn = MSELoss()
 
  def forward(self, x, d):
    sigma = self.geometry_network(x)
    color = self.color_network(x, d)
    features = self.feature_network(x)
    return (sigma, color, features)
 
  def compile(self, loss=[rgb_loss, feature_loss]):
    self.loss_fns = loss
  
  def fit(self, images, cameras, features):
    optimizer = Adam(self.parameters())
    for i in range(num_iterations):
      rays = sample_rays(cameras)
      rgb, feat = render_rays(self, rays) 
      loss = 0
      for lf in self.loss_fns:
        loss += lf(rgb, feat) 
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
 
# Represent 6-DOF pose as embedding
 
def sample_queries(num_points, variance):
   mean = [0, 0, 0] # Center of gripper
   return sample_gaussian(mean, variance, num_points)
 
def embed_pose(nerf, demo_pose, queries):
  
  shifted_queries = transform_points(queries, demo_pose) 
  sigmas, features = nerf(shifted_queries)
 
  weighted_feats = features * (1 - np.exp(-sigmas)) # Alpha weighting
 
  return concatenate_features(weighted_feats)
 
def get_task_embedding(nerf, demo_poses):
  
  pose_embeddings = []
  for pose in demo_poses:
    queries = sample_queries()
    embedding = embed_pose(nerf, pose, queries)
    pose_embeddings.append(embedding)
 
  return average_embeddings(pose_embeddings)
 
# Infer poses from feature field
 
def generate_proposals(nerf, space):
  
  voxels = discretize_space(space, step_size=2cm)
  voxels = filter_freespace(voxels, nerf) 
  voxels = filter_task_similarity(voxels, task_embedding)
 
  rotations = sample_rotations() # 8 rotations
 
  proposals = []
  for v in voxels:
    for r in rotations:
      proposals.append((v, r))
  
  return proposals
 
def optimize_pose(proposal, task_embedding):
 
  pose = proposal
  for i in range(100): 
   Embedding = embed_pose(nerf, pose, queries)
    loss = cosine_similarity(Embedding, task_embedding)
    pose = optimizer.step(pose, loss)
 
  return pose
 
# Language guided manipulation
 
def retrieve_demos(text_embedding, demos):
  
  similarities = []
  for d in demos:
     embedding = get_task_embedding(nerf, d)
     similarity = cosine_similarity(embedding, text_embedding)
     similarities.append((similarity, d))
 
  return top_k(similarities, k=2) # Get top 2
 
def filter_voxels(text_embedding, neg_texts):
 
  similarities = []
  for v in voxels:
    feat = nerf.feature_net(v)
    pos_sim = cosine_similarity(feat, text_embedding)
 
    neg_sims = []
    for neg in neg_texts:
       neg_sim = cosine_similarity(feat, neg)
       neg_sims.append(neg_sim)
 
    similarities.append(pos_sim, max(neg_sims))
  
  return [v for v,s in similarities if s[0] > s[1]]
 
def optimize_grasp(poses, text_embedding, task_embedding):
  
  best_pose = None
  best_score = 0
 
  for pose in poses:
    embed = embed_pose(nerf, pose, queries)
    
    task_score = cosine_similarity(embed, task_embedding)
    text_score = cosine_similarity(embed, text_embedding)
 
    score = text_score * task_score
    if score > best_score:
      best_pose = pose
      best_score = score
 
  return optimize_pose(best_pose, task_embedding)  

Let me know if you would like me to explain or expand on any part of this pseudocode!