Diverse Inpainting and Editing with GAN Inversion

Authors: Ahmet Burak Yildirim, Hamza Pehlivan, Bahri Batuhan Bilecen, Aysegul Dundar

Abstract: Recent inversion methods have shown that real images can be inverted into StyleGAN’s latent space and numerous edits can be achieved on those images thanks to the semantically rich feature representations of well-trained GAN models. However, extensive research has also shown that image inversion is challenging due to the trade-off between high-fidelity reconstruction and editability. In this paper, we tackle an even more difficult task, inverting erased images into GAN’s latent space for realistic inpaintings and editings. Furthermore, by augmenting inverted latent codes with different latent samples, we achieve diverse inpaintings. Specifically, we propose to learn an encoder and mixing network to combine encoded features from erased images with StyleGAN’s mapped features from random samples. To encourage the mixing network to utilize both inputs, we train the networks with generated data via a novel set-up. We also utilize higher-rate features to prevent color inconsistencies between the inpainted and unerased parts. We run extensive experiments and compare our method with state-of-the-art inversion and inpainting methods. Qualitative metrics and visual comparisons show significant improvements.

What, Why and How

Here is a summary of the key points in this paper:

What:

  • The paper proposes a novel framework for diverse image inpainting and editing using GAN inversion. The goal is to fill in missing pixels in images (inpainting) while also enabling semantic image editing.

  • The framework includes an encoder to embed erased images, a mixing network to combine encoded features with random latent codes, and a pretrained StyleGAN generator.

  • A two-stage training process is used. Stage 1 trains the encoder and mixing network. Stage 2 adds skip connections to the generator for higher fidelity reconstruction.

Why:

  • Image inpainting is challenging, especially for large missing regions. Existing methods struggle to generate diverse and realistic results.

  • GAN inversion for inpainting is difficult due to the tradeoff between reconstruction fidelity and editability.

  • The proposed framework aims to achieve high-quality reconstruction, diversity, and editing capabilities within a single model.

How:

  • The encoder embeds the visible parts of the erased image. The mixing network combines this with random latent codes from StyleGAN.

  • A novel training process using generated data is proposed to encourage diversity. Reconstruction losses are applied differently based on whether the original latent code is used.

  • Higher-resolution skip connections are added in stage 2 to retain color consistency between reconstructed and unerased parts.

  • Experiments show the framework generates higher quality and more diverse inpaintings than previous state-of-the-art methods. It also enables semantic editing of completed images.

In summary, this paper proposes a novel two-stage framework to achieve diverse high-quality image inpainting alongside semantic editing by inverting erased images into the latent space of a pretrained GAN. The key innovations are in the model architecture and training process. Experiments demonstrate significant improvements over existing approaches.

Main Contributions

Here are the main contributions of this paper:

  1. A novel framework for diverse image inpainting and editing using GAN inversion. The framework includes an encoder, mixing network, and pretrained StyleGAN generator.

  2. A training process using generated data and tailored reconstruction losses to encourage diversity. Images are reconstructed differently based on whether the original latent code is used or not.

  3. The use of higher-resolution skip connections from the encoder to the generator in a second stage of training. This improves color consistency between reconstructed and unerased parts.

  4. Extensive experiments demonstrating that the framework achieves higher quality and more diverse inpaintings compared to state-of-the-art methods.

  5. Results showing the framework enables semantic editing such as modifying facial attributes for completed images.

  6. An ablation study analyzing the impact of different components like the gated mixing network and two-stage training.

  7. Quantitative evaluation using metrics like FID, LPIPS, and segmentation-based scores to measure quality, diversity and realism.

In summary, the key contributions are:

  • The overall framework design
  • The training process for encouraging diversity
  • Use of skip connections to improve fidelity
  • Extensive experiments demonstrating improved quality, diversity, and editing
  • Ablation studies and quantitative metrics to analyze model performance

Method Section

Here is a summary of the method section from the paper:

The paper proposes a two-stage training framework:

Stage 1: Learning Encoder and Mixing Network

  • An encoder embeds the erased image into the W+ space of StyleGAN.
  • A mixing network combines the encoded features with a random W+ from StyleGAN’s mapping network. This enables diversity.
  • A gating mechanism in the mixing network improves results.
  • They train with generated data and tailored reconstruction losses. If the original W+ is used, full reconstruction is applied. If a random W+ is used, only visible parts are reconstructed. This encourages utilizing both paths.
  • An adversarial loss improves realism.

Stage 2: Learning Skip Connections

  • A skip encoder is trained to send higher-resolution features directly to the generator.
  • This is to improve color consistency between reconstructed and unerased parts.
  • Only features at resolutions of 32x32, 64x64 and 128x128 are skipped.
  • The model is fine-tuned using the same losses as stage 1.

So in summary, stage 1 learns a diverse inpainting model constrained to the latent space. Stage 2 adds skip connections for higher fidelity while retaining the diversity and latent space properties. The key aspects are the training process, gated mixing network, and two-stage architecture.

High-Level Pseudo Code

Here is the high-level pseudo code for the method proposed in the paper:

# Stage 1
 
encoder = Encoder() 
mixing_net = MixingNetwork()
G = PretrainedStyleGANGenerator() # Frozen
 
for i in range(num_iterations):
 
  # Generate data
  img = G(z_g) 
  mask = generate_random_mask()
  img_erased = img * mask
 
  # Encoder
  w_enc = encoder(img_erased, mask)  
 
  # Mixing Network
  z_rand = sample_latent_vector() 
  w_rand = G.mapping_network(z_rand)
  w_out = mixing_net(w_enc, w_rand)
 
  # Reconstruction Losses 
  img_rec_1 = G(w_out, w_rand) # Full rec with original latent
  img_rec_2 = G(w_out, z_rand) # Partial rec with random latent
 
  L_full = reconstruction_loss(img, img_rec_1) 
  L_partial = reconstruction_loss(img*mask, img_rec_2*mask)
 
  # Adversarial Loss
  L_adv = adversarial_loss(img, img_rec_1, img_rec_2)
 
  # Optimize encoder, mixing net
  loss = L_full + L_partial + L_adv
  update(encoder, mixing_net, loss) 
 
# Stage 2
 
skip_encoder = SkipEncoder()
 
for i in range(num_iterations):
 
  # Forward pass same as above
 
  # Skip Connections
  skip_feats = skip_encoder(img, img_erased, mask)
  img_rec = G(w_out, skip_feats) 
 
  # Fine-tune with same losses
  loss = L_full + L_partial + L_adv
  update(skip_encoder, loss) 

The key aspects are:

  • Two stage training process
  • Encoder + mixing network to enable diversity
  • Tailored reconstruction losses for generated data
  • Adding skip connections in stage 2 for higher fidelity

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the approach proposed in the paper:

# Hyperparameters
latent_dim = 512 
RESOLUTION = [32, 64, 128]
 
# Stage 1
 
# Encoder 
class Encoder(nn.Module):
  ...
 
# Mixing Network 
class MixingNetwork(nn.Module):
  
  def forward(self, w_enc, w_rand):
    w_comb, g = nn(w_enc, w_rand) 
    return sigmoid(g) * w_comb + (1-sigmoid(g)) * w_rand
 
# Load pretrained StyleGAN  
G = PretrainedStyleGAN() 
G.freeze() 
 
encoder = Encoder()
mixing_net = MixingNetwork()
 
optim_enc = Adam(encoder.parameters())
optim_mix = Adam(mixing_net.parameters())
 
for i in range(num_iters):
 
  # Generate image
  z = sample_latent(latent_dim)
  img = G(z)
 
  # Create mask
  mask = generate_random_mask(img.shape[1:]) 
  img_erased = img * mask
 
  # Encoder
  w_enc = encoder(img_erased, mask)
  
  # Mixing Network
  z_rand = sample_latent(latent_dim)
  w_rand = G.mapping_network(z_rand) 
  w_out = mixing_net(w_enc, w_rand)
 
  # Recon 1 - original latent
  img_rec_1 = G(w_out, w_rand)  
 
  # Recon 2 - random latent
  img_rec_2 = G(w_out, z_rand)
 
  # Losses
  L_recon_1 = mse_loss(img, img_rec_1)
  L_recon_2 = mse_loss(img*mask, img_rec_2*mask) 
  L_adv = hinge_adv_loss(img, img_rec_1, img_rec_2)
 
  # Update
  loss = L_recon_1 + L_recon_2 + L_adv
  optim_enc.zero_grad()
  optim_mix.zero_grad()
  loss.backward()
  optim_enc.step()
  optim_mix.step()
 
# Stage 2 
 
# Skip Encoder
class SkipEncoder(nn.Module):
  ... 
 
skip_enc = SkipEncoder()
optim_skip = Adam(skip_enc.parameters())
 
for i in range(num_iters):
  
  # Forward pass similar to stage 1
 
  # Skip Connections 
  skip_feats = skip_enc(img, img_erased, mask) 
  img_rec = G(w_out, skip_feats)
 
  # Losses
  L_recon_1 = mse_loss(img, img_rec)
  L_recon_2 = mse_loss(img*mask, img_rec*mask)
  L_adv = hinge_adv_loss(img, img_rec)
  
  # Update
  loss = L_recon_1 + L_recon_2 + L_adv
  optim_skip.zero_grad()
  loss.backward()
  optim_skip.step() 

The key aspects are:

  • Encoder, mixing net, skip encoder implementations
  • Freezing the StyleGAN generator
  • The two stage training process and associated losses
  • Skipping features from encoder to generator in stage 2