RealFill: Reference-Driven Generation for Authentic Image Completion
Authors: Luming Tang, Nataniel Ruiz, Qinghao Chu, Yuanzhen Li, Aleksander Holynski, David E. Jacobs, Bharath Hariharan, Yael Pritch, Neal Wadhwa, Kfir Aberman, Michael Rubinstein
Abstract: Recent advances in generative imagery have brought forth outpainting and inpainting models that can produce high-quality, plausible image content in unknown regions, but the content these models hallucinate is necessarily inauthentic, since the models lack sufficient context about the true scene. In this work, we propose RealFill, a novel generative approach for image completion that fills in missing regions of an image with the content that should have been there. RealFill is a generative inpainting model that is personalized using only a few reference images of a scene. These reference images do not have to be aligned with the target image, and can be taken with drastically varying viewpoints, lighting conditions, camera apertures, or image styles. Once personalized, RealFill is able to complete a target image with visually compelling contents that are faithful to the original scene. We evaluate RealFill on a new image completion benchmark that covers a set of diverse and challenging scenarios, and find that it outperforms existing approaches by a large margin. See more results on our project page: realfill.github.io
What, Why and How
Here is a summary of the key points from this paper:
\textbf{What}: This paper proposes a new method called RealFill for reference-driven image completion. Given a target image with missing regions and a few reference images of the same scene, RealFill can fill in the missing parts of the target image with content that is authentic to the true scene, rather than just plausible hallucinations.
\textbf{Why}: Existing approaches like geometry-based methods or diffusion models struggle with this task when there are large differences between the reference and target images. RealFill overcomes this by first fine-tuning an inpainting diffusion model on the reference images to adapt it to the scene content.
\textbf{How}:
- Fine-tune a pre-trained inpainting diffusion model on the reference images and target image to inject knowledge of the scene into the model.
- Use the adapted model to fill in the missing regions of the target image through standard sampling.
- Propose correspondence-based seed selection to automatically filter out low-quality samples based on number of matches to reference images.
In summary, RealFill adapts a generative inpainting model to a specific scene using reference images so it can complete a target image faithfully. The method handles large appearance differences between references and target.
Main Contributions
The main contributions of this paper are:
-
It proposes the novel problem of authentic image completion, where the goal is to fill in missing image regions with content faithful to the true scene, not just plausible hallucinations.
-
It presents RealFill, a new method to solve this problem by fine-tuning an inpainting diffusion model on reference images of the scene.
-
It introduces correspondence-based seed selection to automatically filter low-quality samples based on matches to reference images.
-
It collects a new dataset, RealBench, for evaluating authentic image completion with challenging cases.
-
It demonstrates that RealFill produces compelling image completions that are more faithful to the reference images than existing baselines, even with large appearance gaps between references and target image.
-
It provides both qualitative and quantitative experiments showing RealFill outperforms other methods on the proposed authentic image completion task.
In summary, the main contributions are: (1) defining the novel problem of authentic image completion, (2) proposing the RealFill method, (3) introducing correspondence-based seed selection, (4) collecting a new dataset, and (5) demonstrating strong quantitative and qualitative results for this new task.
Method Section
Here is a summary of the method section from the paper:
-
Problem Formulation:
- Given: Set of reference images, target image with missing region, mask indicating missing area
- Goal: Generate completed target image that is visually compelling and faithful to reference images
-
Preliminaries:
- Diffusion models: Transform Gaussian noise to target data distribution through iterative denoising
- DreamBooth: Fine-tune diffusion models on few images of a subject using low-rank adaptations
-
RealFill Method:
- Training:
- Fine-tune pre-trained inpainting diffusion model on reference images and target image with random masks
- Use low-rank adaptations to update only small portion of weights
- Condition model on target image and mask during training
- Inference:
- Sample from fine-tuned model conditioned on target and mask
- Alpha blend generated image with original target input
- Correspondence-Based Seed Selection:
- Generate multiple samples and select ones with more keypoint matches to reference images
- Training:
In summary, the RealFill method fine-tunes an inpainting diffusion model on the input images to adapt it to the scene. The adapted model is then sampled conditioned on the target image and mask to fill in missing regions. Seed selection based on correspondences with reference images is used to filter good results.
High-Level Pseudo Code
Here is high-level pseudo code for the RealFill method:
# Training
diffusion_model = load_pretrained_inpainting_model()
diffusion_model.inject_low_rank_adapters()
for iter in num_train_iters:
ref_images, tgt_image, tgt_mask = get_batch()
random_masks = generate_random_masks(ref_images)
loss = 0
for img in ref_images:
noisy_img = add_noise(img, random_masks)
loss += diffusion_model(noisy_img, iter, prompt, random_masks, img)
noisy_tgt = add_noise(tgt_image, tgt_mask)
loss += diffusion_model(noisy_tgt, iter, prompt, tgt_mask, tgt_image)
update_model(loss)
# Inference
generated_img = diffusion_model(noise_init, num_inference_steps, prompt, tgt_image, tgt_mask)
output_img = alpha_blend(generated_img, tgt_image, blurred_tgt_mask)
In summary, the key steps are:
- Inject low-rank adapters into pretrained model
- Fine-tune model on reference images and target image with random masks
- Sample model conditioned on target and mask to fill missing regions
- Alpha blend generated output with original target image
Detailed Pseudo Code
Here is more detailed pseudo code to implement the RealFill method:
# Hyperparameters
num_train_steps = 2000
batch_size = 16
lr_unet = 2e-4
lr_encoder = 4e-5
lora_rank = 8
# Model
model = SDInpaintModel()
model.inject_lora(rank=lora_rank)
optim = Adam(lr=lr_unet) # For UNet
optim_enc = Adam(lr=lr_encoder) # For text encoder
# Training
for i in range(num_train_steps):
# Get batch of reference images, target image, target mask
ref_images, tgt_image, tgt_mask = get_batch(batch_size)
# Generate random masks for reference images
ref_masks = []
for img in ref_images:
ref_masks.append(random_mask(img))
# Add noise
ref_noisy_images = []
for img,mask in zip(ref_images,ref_masks):
noisy = add_noise(img,mask)
ref_noisy_images.append(noisy)
tgt_noisy = add_noise(tgt_image, tgt_mask)
# Model forward
loss = 0
for noisy_img, mask, clean_img in zip(ref_noisy_images, ref_masks, ref_images):
eps = model(noisy_img, i, prompt, mask, clean_img)
loss += F.mse_loss(eps, noise)
eps = model(tgt_noisy, i, prompt, tgt_mask, tgt_image)
loss += F.mse_loss(eps, noise)
loss.backward()
optim.step()
optim_enc.step()
lr_scheduler.step()
# Sampling
noise = torch.randn(batch_size, 3, H, W)
for i in range(num_inference_steps):
noise = model(noise, i, prompt, tgt_image, tgt_mask)
generated_img = noise
output_img = alpha_blend(generated_img, tgt_image, blurred_tgt_mask)
The key aspects are:
- Injecting low-rank adapters into model
- Getting batched of reference images and target image+mask
- Adding noise and training with denoising objective
- Sampling the fine-tuned model conditioned on target+mask
- Alpha blending output with original target image