ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation

Authors: Yasheng Sun, Yifan Yang, Houwen Peng, Yifei Shen, Yuqing Yang, Han Hu, Lili Qiu, Hideki Koike

Abstract: While language-guided image manipulation has made remarkable progress, the challenge of how to instruct the manipulation process faithfully reflecting human intentions persists. An accurate and comprehensive description of a manipulation task using natural language is laborious and sometimes even impossible, primarily due to the inherent uncertainty and ambiguity present in linguistic expressions. Is it feasible to accomplish image manipulation without resorting to external cross-modal language information? If this possibility exists, the inherent modality gap would be effortlessly eliminated. In this paper, we propose a novel manipulation methodology, dubbed ImageBrush, that learns visual instructions for more accurate image editing. Our key idea is to employ a pair of transformation images as visual instructions, which not only precisely captures human intention but also facilitates accessibility in real-world scenarios. Capturing visual instructions is particularly challenging because it involves extracting the underlying intentions solely from visual demonstrations and then applying this operation to a new image. To address this challenge, we formulate visual instruction learning as a diffusion-based inpainting problem, where the contextual information is fully exploited through an iterative process of generation. A visual prompting encoder is carefully devised to enhance the model’s capacity in uncovering human intent behind the visual instructions. Extensive experiments show that our method generates engaging manipulation results conforming to the transformations entailed in demonstrations. Moreover, our model exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation and video inpainting.

What, Why and How

This paper proposes a new approach for image manipulation guided solely by visual instructions, without requiring language descriptions.

The key highlights are:

What:

  • They introduce a new image manipulation paradigm using a pair of example images as visual instructions to edit a new query image. This eliminates the need for language descriptions which can be ambiguous.

  • They develop a framework called ImageBrush that can understand the intent behind the example image pairs and apply appropriate edits to the query image.

Why:

  • Language descriptions for image editing tasks can be imprecise, requiring manual engineering of prompts. Using visual examples directly conveys the editing intent more accurately.

  • Visual instructions also enable reusing complex editing procedures from artists/experts by providing their before-after image pairs, without needing new prompt designs.

How:

  • They formulate it as a diffusion-based inpainting task, where a UNet iteratively refines a grid-like input containing the example pairs and query image.

  • A visual prompt encoder is designed to extract high-level semantics from the examples to better guide the UNet’s context understanding.

  • They allow specifying regions of interest via bounding boxes to focus edits.

  • Experiments on datasets like Scannet, LRW, UBC-Fashion etc. demonstrate the approach generates high-quality results conforming to visual instructions, and generalizes well to tasks like pose transfer, image translation etc.

In summary, this paper explores a novel visual instruction-based interaction for image manipulation, eliminating the language-vision gap. The carefully designed framework ImageBrush shows promising results in inferring intent from example pairs and applying appropriate edits to new images.

Main Contributions

The main contributions of this paper are:

  1. They propose a new paradigm for image manipulation using visual instructions in the form of example image pairs, eliminating the need for language descriptions.

  2. They develop a framework called ImageBrush that can implicitly understand the editing intent from the visual examples and apply appropriate manipulations to a new query image.

  3. ImageBrush uses a diffusion-based inpainting formulation to iteratively refine a grid-like input containing the example pairs and query image. This allows joint context learning and image synthesis.

  4. A visual prompt encoder is designed to extract high-level semantics from the example pairs to enhance the model’s reasoning abilities.

  5. An interface allowing users to specify regions of interest via bounding boxes is introduced to help the model better capture the editing intent.

  6. Experiments on diverse datasets demonstrate ImageBrush can generate manipulated results conforming to the visual instructions, while preserving unrelated content.

  7. The model generalizes well to downstream tasks like pose transfer, image translation, video inpainting etc. without task-specific training.

  8. They propose a new metric based on cyclic consistency to evaluate the model’s ability to follow instructions and maintain image fidelity.

  9. This work explores the promise of visual instruction-based interaction for image editing, paving the way for future visual foundation models.

In summary, the key contributions are: proposing the visual instruction-based editing paradigm, developing the ImageBrush framework and associated designs to accomplish this task, and demonstrating strong results on diverse datasets and tasks.

Method Section

Here is a summary of the method section:

Problem Formulation:

  • Given a pair of edited examples E, E’ and a query image I, the goal is to generate I’ that adheres to the editing instructions in E, E’ while preserving unrelated content of I.

  • They formulate it as diffusion-based inpainting by concatenating E, E’, I and a blank M into a grid, and iteratively recovering them during diffusion.

  • This allows joint context learning and image synthesis. Diffusion happens in latent space of a VAE for efficiency.

Visual Prompt Design:

  • A visual prompt encoder extracts features from E, E’, I and processes them through a transformer to exploit contextual relationships.

  • The prompt features are injected into the UNet via cross-attention to enhance its reasoning.

  • An interface allows specifying regions of interest via bounding boxes, encoded and injected to focus edits.

Training Details:

  • Input images are 256x256, concatenated into 512x512 grids.

  • The UNet uses 32 blocks with self-attention, and cross-attention for prompt features.

  • Trained from scratch on 24 V100 GPUs for 14k iterations with AdamW optimizer.

  • Classifier guidance scale of 7.5 used for prompt features.

In summary, the key aspects are the diffusion-based inpainting formulation, integration of high-level semantics from the visual prompt encoder, and interface for specifying regions of interest. These allow the model to implicitly comprehend the editing instructions and apply appropriate manipulations to new images.

High-Level Pseudo Code

Here is the high-level pseudo code for the ImageBrush method:

# ImageBrush Pseudo Code 
 
# Inputs
examples = [E, E'] # Pair of edited examples
query = I # Query image
 
# Visual Prompt Encoder
for img in [E, E', I]:
  features[img] = encoder_v(img) # Extract ViT features
prompt_features = encoder_p(features) # Contextual modeling 
 
# Diffusion-based Inpainting
latent = encoder(concat([E,E',I,blank])) # Get latents
for t in timesteps:
  latent = latent + noise # Add noise  
  latent = unet(latent, prompt_features) # Denoising
output = decoder(latent) # Generate final image
 
# Bounding Box Interface
for bb in user_bb: 
  bb_enc = encode_bb(bb) 
  prompt_features += bb_enc # Inject ROI encoddings
  
# Losses
L_simple = ||output - (E,E',I,I')||^2 
L_prompt = FID(E', edit(E|I->I')) # Prompt fidelity
L_image = FID(I, edit(I'|E'->E)) # Image fidelity

The key steps are:

  1. Encoding visual prompt using ViT + transformer
  2. Diffusion-based inpainting with UNet by denoising grid-like input
  3. Injecting user specified ROI encodings
  4. Simple reconstruction loss + fidelity metrics for training

This summarizes the overall flow and major components of the ImageBrush approach.

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the ImageBrush method:

import torch
from torchvision import models
 
# Hyperparameters
IMG_SIZE = 256 
LATENT_DIM = 512
NUM_LAYERS = 4 
 
# Encoder
class ViTEncoder(nn.Module):
  def __init__(self):
    self.model = models.vit_b_16(pretrained=True)
  
  def forward(self, x):
    return self.model(x) #[batch, seq_len, dim]
 
class PromptEncoder(nn.Module):
  def __init__(self, dim, num_layers):
    self.layers = nn.ModuleList([
      nn.Linear(dim, dim) for _ in range(num_layers)])
  
  def forward(self, x):
    for layer in self.layers:
      x = layer(x)
    return x
 
# Diffusion Modules
class Encoder(nn.Module):
  # CNN encoder
  def __init__(self):
    ...
  
  def forward(self, x):
    return encoder_output
 
class Decoder(nn.Module):
  # CNN decoder 
  def __init__(self):
   ...
  
  def forward(self, z):
    return decoder_output
    
class DenoisingNet(nn.Module):
  # UNet architecture
  def __init__(self):
    ...
  
  def forward(self, z, cond):
    return z_denoised
 
# Diffusion Model
class ImageBrush(nn.Module):
 
  def __init__(self):
    super().__init__()
    self.encoder = Encoder()
    self.decoder = Decoder()
    self.unet = DenoisingNet()
    self.encoder_v = ViTEncoder()
    self.encoder_p = PromptEncoder()
    
  def forward(self, examples, query):
    
    E, E_prime = examples
    feats = {}
    
    for img in [E, E_prime, query]:
      feats[img] = self.encoder_v(img)
      
    prompt_feats = self.encoder_p(feats)
    
    z = self.encoder(concat([E, E_prime, query]))
    
    for t in range(T):
      z_noise = z + torch.randn_like(z) 
      z = self.unet(z_noise, prompt_feats)
      
    output = self.decoder(z)
    
    return output

This shows a more detailed modular implementation with the key components:

  • ViT + Prompt encoder for visual features
  • CNN Encoder-decoder
  • UNet architecture for diffusion
  • Overall diffusion model tying things together

And the forward pass during training/inference.