CLIPascene: Scene Sketching with Different Types and Levels of Abstraction

Authors: Yael Vinker, Yuval Alaluf, Daniel Cohen-Or, Ariel Shamir

Abstract: In this paper, we present a method for converting a given scene image into a sketch using different types and multiple levels of abstraction. We distinguish between two types of abstraction. The first considers the fidelity of the sketch, varying its representation from a more precise portrayal of the input to a looser depiction. The second is defined by the visual simplicity of the sketch, moving from a detailed depiction to a sparse sketch. Using an explicit disentanglement into two abstraction axes — and multiple levels for each one — provides users additional control over selecting the desired sketch based on their personal goals and preferences. To form a sketch at a given level of fidelity and simplification, we train two MLP networks. The first network learns the desired placement of strokes, while the second network learns to gradually remove strokes from the sketch without harming its recognizability and semantics. Our approach is able to generate sketches of complex scenes including those with complex backgrounds (e.g., natural and urban settings) and subjects (e.g., animals and people) while depicting gradual abstractions of the input scene in terms of fidelity and simplicity.

What, Why and How

Here is a summary of the key points from this paper:

What:

  • The paper presents a method for converting a scene image into a sketch with different types and levels of abstraction.

  • It disentangles sketch abstraction into two axes: fidelity and simplicity. Fidelity refers to how precisely the sketch follows the structure/geometry of the input image. Simplicity refers to the level of visual details in the sketch.

  • The method generates a matrix of sketches spanning different levels of fidelity and simplicity. This allows selecting a sketch based on desired goals.

Why:

  • Abstract, minimal sketches are visually pleasing and convey ideas effectively by emphasizing key elements. Computational methods need to capture various sketch abstractions.

  • Explicitly disentangling different types of abstraction provides more control over sketch generation compared to just using number of strokes to define abstraction level.

  • Generating a matrix spanning fidelity and simplicity axes allows selecting a sketch suited to goals and preferences.

How:

  • Uses CLIP model’s features to guide sketch generation, with shallow layers for fidelity and deeper layers for looser semantics-focused sketches.

  • Trains MLP network to learn stroke placements and another network to gradually remove strokes for simplicity. Uses iterative training process to refine networks.

  • Defines exponential function between losses to achieve perceptually smooth simplification between sketch levels.

  • Splits scene into foreground and background to sketch separately for better control over final sketch.

In summary, the key ideas are disentangling abstraction into fidelity and simplicity axes, using CLIP and MLP networks to span levels of each axis, and decomposing the scene for finer control over the final sketch abstraction.

Main Contributions

Based on my reading, the main contributions of this paper are:

  1. Disentangling sketch abstraction into two distinct axes - fidelity and simplicity. This provides more explicit control over the type of abstraction compared to prior works.

  2. Generating a matrix of sketches spanning multiple levels of fidelity and simplicity for a given input image. This allows users to select a sketch based on their goals and preferences.

  3. Using CLIP model features from different layers to guide sketch generation along the fidelity axis, with shallow layers for precision and deeper layers for loose semantics.

  4. Training MLP networks to learn stroke placements and perform gradual simplification along the simplicity axis in an iterative manner.

  5. Defining an exponential function between losses to achieve perceptually smooth transitions between simplicity levels.

  6. Decomposing the input scene into foreground and background to sketch separately, allowing better control over the final sketch.

  7. Generating vector-based sketches that can be easily edited and manipulated.

  8. Evaluating the approach on a variety of challenging scene images with complex backgrounds and subjects.

  9. Providing quantitative analysis to validate the method’s ability to span fidelity and simplicity axes.

  10. Comparisons to prior sketch generation methods demonstrating the approach’s ability to handle scenes.

In summary, the key contributions are the explicit disentanglement of abstraction axes, use of CLIP and MLP networks to span levels, scene decomposition, and quantitative evaluations on complex scene images.

Method Section

Here is a summary of the method section from the paper:

  • Represent a sketch as a set of Bezier curves defined by control points. Use MLP to predict offsets to initial control points.

  • Train MLP with CLIP similarity loss between image and sketch activations. Use CLIP-ViT instead of ResNet-CLIP for better global context.

  • Generate sketches along fidelity axis by using different CLIP-ViT layers in loss. Shallow layers preserve geometry, deeper layers capture semantics.

  • Generate simplified sketches along simplicity axis by training another MLP to predict stroke removal probabilities. Jointly train with sparsity loss to encourage fewer strokes.

  • Balance between CLIP similarity loss and sparsity loss using exponential function for smooth simplifications. Generate sketches iteratively.

  • Decompose scene into foreground object(s) and background. Sketch separately for better control.

  • For foreground, also use layer 4 of CLIP-ViT to help preserve details. Merge foreground and background sketches.

  • Overall, method trains MLPs guided by CLIP to generate sketches along fidelity and simplicity axes by using different CLIP layers and sparsity losses. Scene decomposition provides better control.

In summary, the key ideas are using CLIP and MLPs to span abstraction levels across fidelity and simplicity, scene decomposition, balancing losses for smooth simplification, and leveraging different CLIP layers to control precision vs semantics. This allows generating a range of sketch abstractions for complex scenes.

High-Level Pseudo Code

Here is the high-level pseudo code for the method presented in the paper:

# Input: image I
 
# Initialize strokes Z 
 
# Fidelity axis:
for layer in [2, 7, 8, 11]:
  
  # Train MLPloc to update stroke locations 
  MLPloc = train_MLP(Z, I, layer)  
  
  # Get sketch S by rendering updated strokes
  S = render(Z + MLPloc(Z)) 
 
  # Add S to fidelity axis
  
# Simplicity axis:  
for sketch S in fidelity_sketches:
 
  # Initialize simplicity MLP 
  MLPsimp = init_MLP() 
  
  for i in range(num_steps):
 
    # Get stroke removal probs
    P = MLPsimp(Z)  
    
    # Get simplified sketch
    S_simple = render(S, P) 
    
    # Train MLPsimp and finetune MLPloc
    MLPsimp, MLPloc = train(S_simple, S, I)
    
    # Update sketch S
    S = S_simple
  
  # Add S_simple to simplicity axis
 
# Generate final matrix of sketches  
abstraction_matrix = construct_matrix(fidelity_sketches, 
                                      simplicity_sketches)
 

In summary, the key steps are:

  1. Generating sketches along fidelity axis by training MLP on different CLIP layers
  2. Simplifying sketches by iteratively training MLPsimp and finetuning MLPloc with sparsity loss
  3. Constructing final matrix of sketches spanning fidelity and simplicity axes

Detailed Pseudo Code

Here is a more detailed pseudo code implementation of the method presented in the paper:

import clip # pretrained CLIP model
import mlp # mlp networks 
import renderer # differentiable renderer
 
# Input: 
image I 
num_fidelity_levels = 4
num_simplicity_levels = 4
 
# Initialize strokes
Z = initialize_strokes(I)  
 
# Lists to store sketches
fidelity_sketches = [] 
simplicity_sketches = []
 
# Fidelity axis
for l in [2, 7, 8, 11]:
 
  # Define and init MLPloc
  mlp_loc = mlp.MLP()  
  mlp_loc.init_weights()
 
  # Train 
  for i in range(num_epochs):
    
    # Get updated strokes
    delta_Z = mlp_loc(Z)  
    Z_new = Z + delta_Z
    
    # Render sketch
    S = renderer.render(Z_new)
 
    # Get CLIP loss
    L_clip = clip_loss(I, S, l) 
 
    # Update mlp_loc 
    mlp_loc.backward(L_clip)
    mlp_loc.update()
 
  # Add sketch    
  fidelity_sketches.append(S)
 
# Simplicity axis
for S in fidelity_sketches:
 
  # Define and init MLPsimp
  mlp_simp = mlp.MLP()
  mlp_simp.init_weights()
 
  # Initialize probabilities 
  P = np.ones(len(S)) 
 
  for i in range(num_simp_steps):
 
    # Get removal probs
    P = mlp_simp(P) 
    
    # Render simplified sketch
    S_sim = renderer.render(S, P)
 
    # Get losses
    L_clip = clip_loss(I, S_sim, l)
    L_sparse = sparsity_loss(P)
 
    # Update networks
    mlp_loc.backward(L_clip)  
    mlp_simp.backward(L_clip + L_sparse)
 
    mlp_loc.update()
    mlp_simp.update()
 
  # Add simplified sketch
  simplicity_sketches.append(S_sim)
 
# Construct final abstraction matrix
abstraction_matrix = []
 
for i in range(num_simplicity_levels):
  row = []
  for j in range(num_fidelity_levels):
    row.append(simplicity_sketches[i][j])
  abstraction_matrix.append(row) 
 
return abstraction_matrix

The key aspects are:

  • Using CLIP loss between image and sketch to update MLPs
  • Iteratively training MLPsimp to get simplification probabilities
  • Balancing sparsity and CLIP losses for smooth simplification
  • Constructing final matrix spanning fidelity and simplicity sketches

This provides a more detailed overview of how the method can be implemented.