Collaborative Score Distillation for Consistent Visual Synthesis

Authors: Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, Jinwoo Shin

Abstract: Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as “particles” in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

What, Why and How

Here is a summary of the key points from the paper:

What:

  • The paper proposes a novel method called Collaborative Score Distillation (CSD) for consistent visual synthesis and editing across complex visual modalities like panorama images, videos, and 3D scenes.

Why:

  • Adapting generative priors of large-scale text-to-image diffusion models to complex visual modalities is challenging due to the need to maintain consistency across the set of images (e.g. frames in a video).

  • Existing methods modify the diffusion models or use task-specific architectures which limits their versatility.

  • CSD provides a simple yet effective approach to adapt text-to-image diffusion models for consistent synthesis without architecture modifications.

How:

  • CSD is based on Stein Variational Gradient Descent (SVGD). It treats samples as “particles” and combines their score functions to distill generative priors synchronously.

  • For editing, CSD-Edit optimizes multiple image patches/frames with an instruction while regularizing for inter-sample consistency.

  • The method edits panorama images, videos, and 3D scenes by updating multiple frames/views collaboratively.

  • Experiments show CSD’s effectiveness over baselines in tasks like artistic stylization, object manipulation, and view-consistent 3D scene editing.

In summary, the key ideas are leveraging SVGD for consistent distillation of generative priors from text-to-image models, and applying it to complex visual modalities like panoramas, videos and 3D scenes through collaborative optimization of multiple samples. The method provides an effective way to adapt these models for consistent synthesis without architectural changes.

Main Contributions

Here are the key contributions of this paper:

  • Proposes Collaborative Score Distillation (CSD), a novel method for consistent visual synthesis by combining score functions from multiple samples using Stein Variational Gradient Descent.

  • Introduces CSD-Edit, an effective image editing method based on CSD that optimizes multiple image patches/frames with an instruction while regularizing for inter-sample consistency.

  • Demonstrates CSD’s versatility across diverse visual modalities - panorama images, videos, and 3D scenes. Enables consistent editing without changing the base generative model architecture.

  • Provides an optimization perspective for leveraging generative priors of text-to-image diffusion models through distillation over multiple samples.

  • Extensive experiments highlighting CSD’s quantitative and qualitative improvements over prior arts in artistic stylization, object manipulation, expression editing, etc.

  • Ablation studies analyzing the impact of different design choices like using SVGD and baseline noise selection.

  • Demonstrates compositional editing of panorama images by manipulating different regions with diverse prompts while maintaining overall coherence.

  • Compares CSD favorably against learning-based video editing methods without using any video-specific training.

In summary, the key contributions are - proposing CSD for consistent synthesis, demonstrating its effectiveness across modalities like images, videos and 3D scenes, and providing an optimization view for adapting text-to-image diffusion models through collaborative distillation.

Method Section

Here is a summary of the method section from the paper:

The paper proposes two main methods:

  1. Collaborative Score Distillation (CSD)
  • CSD is based on Stein Variational Gradient Descent (SVGD).

  • It considers multiple samples as “particles” and combines their score functions to distill generative priors from text-to-image models.

  • Formally, let {θi} be parameters generating images xi=g(θi). CSD minimizes the KL divergence between q(xit|xi) and the conditional density p(xit;y,t) from the diffusion model.

  • The KL divergence is minimized via SVGD by taking gradients of the score function (approximated by the noise predictor) and kernel values across samples.

  • This updates each θi based on affinity with other samples according to the kernel. Repulsive gradients prevent mode collapse.

  • CSD generalizes Score Distillation Sampling to multiple samples for consistent synthesis.

  1. CSD-Edit
  • For image editing, CSD-Edit optimizes parameters {θi} initialized from source images to follow an instruction while regularizing for consistency.

  • It uses the noise predictor from InstructPix2Pix as it allows conditioning on source image and text.

  • The key is to use image-conditional noise as baseline instead of random noise when computing the distillation loss. This retains source details better.

  • CSD-Edit enables consistent editing of diverse visual modalities by updating batches of samples (image patches, frames, views) together.

In summary, CSD leverages SVGD for consistent distillation from text-to-image models over multiple samples. CSD-Edit applies this for coherent editing by collaborative optimization.

High-Level Pseudo Code

Here is the high-level pseudo code for the key methods in the paper:

# Collaborative Score Distillation
 
# Inputs: 
# - Parameters θi generating images xi=g(θi) 
# - Text prompt y, diffusion model pφ
# - Kernels k, step size η
 
for number of iterations:
  
  Sample timestep t ~ Uniform(tmin, tmax)
  
  # Compute score function
  epsilon_i = (xi_t; y, t) # Predict noise
  
  # Stein variational gradient descent update
  for i in 1...N: 
    grad_i = 0
    for j in 1...N:
      grad_i += k(xj_t, xi_t) * (epsilon_j - epsilon) # Distill scores  
      grad_i += ∇_xj k(xj_t, xi_t) # Repulsion term
    
    # Update parameters 
    θi = θi - η * grad_i * ∂xi / ∂θi
# CSD-Edit
 
# Inputs:
# - Source images x~_i, target params θi, text y 
# - InstructPix2Pix model pφ
 
for number of iterations:
 
  Sample t ~ Uniform(tmin, tmax) 
  
  # Compute noise predictors
  eps_src = (x~i_t; x~i, t) 
  eps_tgt = (xi_t; x~i, y, t) 
  
  # Update parameters
  for i in 1...N:
    delta_eps_i = eps_tgt_i - eps_src_i # Use as baseline
    grad_i = 0
    for j in 1...N:  
      grad_i += k(xj_t, xi_t) * delta_eps_j
      grad_i += ∇_xj k(xj_t, xi_t)
      
    θi = θi - η * grad_i * ∂xi / ∂θi 

The key steps are approximating the score function using the noise predictor, distilling scores across samples via SVGD, and using image-conditional noise for editing.

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the key methods in the paper:

# Inputs 
theta_i = model parameters for images x_i 
y = text prompt
p_phi = diffusion model (e.g. InstructPix2Pix) 
t_min, t_max = timestep range
w(t) = timestep weighting 
k = kernel function (e.g. RBF)
eta = step size
 
# Collaborative Score Distillation
for iters in range(max_iters):
  
  t ~ Uniform(t_min, t_max)  
  
  epsilon ~ N(0, I) # Random noise
  
  x_t_i = sqrt(alpha_t) * x_i + sqrt(1-alpha_t) * epsilon # Add noise
    
  # Get noise predictions
  eps_phi_i = p_phi(x_t_i, y, t) 
  
  for i in range(N):
 
    grad_i = 0
    
    for j in range(N):
    
      # Stein kernel gradients  
      k_grad_j = ∇_{x_t_j} k(x_t_j, x_t_i)  
      
      # Score distillation
      score_j = (eps_phi_j - epsilon) 
      
      grad_i += w(t) * (k(x_t_j, x_t_i) * score_j + k_grad_j)
      
    # Parameter update
    theta_i -= (eta / N) * grad_i * ∂x_i / ∂theta_i
# CSD-Edit 
 
# Additional inputs
x_tilde_i = source images
 
for iters in range(max_iters):
 
  # Same as CSD until x_t_i
  
  # Get source image noise pred
  eps_src_i = p_phi(x_t_i, x_tilde_i, t) 
  
  # Get target image noise pred
  eps_tgt_i = p_phi(x_t_i, x_tilde_i, y, t)
 
  for i in range(N):
  
    # Use source pred as baseline 
    delta_eps_i = eps_tgt_i - eps_src_i 
    
    grad_i = 0
    
    for j in range(N):
    
      # Same kernel gradient
      k_grad_j = ∇_{x_t_j} k(x_t_j, x_t_i)   
      
      grad_i += w(t) * (k(x_t_j, x_t_i) * delta_eps_j + k_grad_j)
      
    # Update params  
    theta_i -= (eta / N) * grad_i * ∂x_i / ∂theta_i

The key additional details are sampling timesteps, adding noise, getting source image conditional predictions for the baseline, and weighting the gradients.