General Image-to-Image Translation with One-Shot Image Guidance
Authors: Bin Cheng, Zuhao Liu, Yunbo Peng, Yue Lin
Abstract: Large-scale text-to-image models pre-trained on massive text-image pairs show excellent performance in image synthesis recently. However, image can provide more intuitive visual concepts than plain text. People may ask: how can we integrate the desired visual concept into an existing image, such as our portrait? Current methods are inadequate in meeting this demand as they lack the ability to preserve content or translate visual concepts effectively. Inspired by this, we propose a novel framework named visual concept translator (VCT) with the ability to preserve content in the source image and translate the visual concepts guided by a single reference image. The proposed VCT contains a content-concept inversion (CCI) process to extract contents and concepts, and a content-concept fusion (CCF) process to gather the extracted information to obtain the target image. Given only one reference image, the proposed VCT can complete a wide range of general image-to-image translation tasks with excellent results. Extensive experiments are conducted to prove the superiority and effectiveness of the proposed methods. Codes are available at visual-concept-translator.
What, Why and How
Here is a summary of the key points from the paper:
What:
- The paper proposes a novel framework called Visual Concept Translator (VCT) for general image-to-image translation guided by a single reference image.
Why:
- Current image-to-image translation methods have limitations in preserving content from the source image while transferring concepts from the reference image.
- The proposed VCT aims to address these limitations and enable effective translation of visual concepts from the reference image while preserving content structure from the source image.
How:
- The VCT contains two main processes - Content-Concept Inversion (CCI) and Content-Concept Fusion (CCF).
- CCI extracts content and concepts from source and reference images through pivot turning inversion and multi-concept inversion.
- CCF gathers the extracted information using a dual-stream denoising architecture to generate the final translated image.
- It employs techniques like attention control to preserve content structure from source image.
- The method needs only a single reference image to guide the translation.
- Extensive experiments on tasks like general I2I translation and style transfer demonstrate the effectiveness of VCT.
In summary, the paper proposes a novel VCT framework that can perform general image-to-image translation with a single reference image by extracting and fusing content and concepts from source and reference images in an effective manner.
Main Contributions
Here are the key contributions of this paper:
-
Proposes a novel framework called Visual Concept Translator (VCT) for general image-to-image translation using a single reference image.
-
Introduces a Content-Concept Inversion (CCI) process to extract content and concepts from source and reference images. This involves techniques like pivot turning inversion and multi-concept inversion.
-
Develops a Content-Concept Fusion (CCF) process to gather the extracted information using a dual-stream denoising architecture to generate the final translated image.
-
Employs attention control and other techniques to preserve content structure from the source image while transferring concepts from the reference.
-
Demonstrates the ability of VCT to perform a wide range of general I2I tasks and style transfer with just one reference image.
-
Provides extensive experimental results and comparisons to validate the effectiveness of the proposed VCT framework and the techniques involved.
-
Makes code available to reproduce the method and results.
In summary, the main contributions are: proposing the novel VCT framework, the CCI and CCF processes, and demonstrating its effectiveness on diverse I2I tasks - all using just a single reference image.
Method Section
Here is a summary of the method section from the paper:
The paper proposes a novel Visual Concept Translator (VCT) framework for general image-to-image translation using a single reference image.
VCT contains two main processes:
- Content-Concept Inversion (CCI):
- Extracts content embedding from source image using pivot turning inversion. This inverts the source image to find optimal embeddings that reconstruct it.
- Extracts concept embedding from reference image using multi-concept inversion. This learns an embedding to represent concepts in the reference.
- Content-Concept Fusion (CCF):
- Employs a dual-stream denoising architecture with a main branch and content matching branch.
- Both branches start from same noise initialized by inverting source image.
- Content matching branch reconstructs source image and extracts attention maps.
- Main branch fuses content and concept embeddings extracted by CCI.
- Uses attention control to inject attention maps from content matching branch.
- Finally generates the translated target image.
In summary, the method section explains the two key processes of CCI and CCF used by the proposed VCT framework to perform content-concept inversion from images and then fuse them to generate the final translated image in a dual-stream architecture. Attention control is used to preserve content structure.
High-Level Pseudo Code
Here is the high-level pseudo code for the key steps in the paper:
# Content-Concept Inversion
# Pivot Turning Inversion for source image
z_src = encode(x_src)
z_T = DDIM_Invert(z_src)
for t in T to 1:
Optimize v_t_src to reconstruct z_src from z_t
# Multi-Concept Inversion for reference image
for i in steps:
Compute L_ldm and L_rec losses
Update v_ref to minimize losses
# Content-Concept Fusion
# Dual-stream diffusion
for t in T to 1:
# Content matching branch
z_t*_src, attn_t* = Denoise(z_t*_src, v_t_src)
# Main branch
z_t, attn_t = Denoise(z_t, v_t_src, v_ref)
# Attention control
attn_t = CombineAttn(attn_t, attn_t*)
z_t = Denoise(z_t, v_t_src, v_ref, attn_t)
z_t*_src = Denoise(z_t*_src, v_t_src, attn_t*)
x_tgt = Decode(z_1)
In summary, the key steps are:
- Invert source and extract pivoted embeddings
- Invert reference to get concept embeddings
- Dual-stream diffusion with attention control
- Decode to get final translated image
Detailed Pseudo Code
Here is a more detailed pseudo code to implement the key components of the paper:
# Hyperparameters
T: Number of diffusion steps
v_null: Null embedding
# Encoders
E: Image encoder
T: Text tokenizer
# Diffusion model
p_theta: Denoising model
# Content-Concept Inversion
# Pivot Turning Inversion
z_src = E(x_src)
z_T = DDIM_Invert(z_src)
for t=T to 1:
for i=1 to N:
z_hat = Denoise(z_t, v_t_src) # Eq 12
L = ||z_src - z_hat||^2
v_t_src = v_t_src - lr * ∇_v_t_src L # Update embedding
z_t-1 = DDIM_Sample(z_t, Denoise(z_t, v_t_src, v_null)) # Eq 10
# Multi-Concept Inversion
for i=1 to M:
z_t_ref ~ q(z_t | x_ref) # Eq 3
epsilon ~ N(0, I)
L_ldm = ||epsilon - p_theta(z_t_ref, v_ref)||^2 # Eq 14
z_hat_ref = Denoise(z_t_ref, v_t_ref) # Eq 12
L_rec = ||z_ref - z_hat_ref||^2 # Eq 15
v_ref = v_ref - lr * ∇_(L_ldm + L_rec) # Update embedding
# Content-Concept Fusion
z = z_T # Init noise
z*_src = z # Content branch
for t=T to 1:
# Content branch
attn_t*, eps* = p_theta(z*_src, v_t_src, v_null)
z*_src = DDIM_Sample(z*_src, eps*)
# Main branch
attn_t, eps = p_theta(z, v_t_src, v_ref)
attn = CombineAttn(attn_t, attn_t*)
z = DDIM_Sample(z, p_theta(z, v_t_src, v_ref, attn))
x_tgt = D(z) # Decode target image
The key steps are:
- Invert source image to get pivot embeddings
- Invert reference to extract concept embeddings
- Dual-stream diffusion with content branch for attention
- Combine attention maps from both branches
- Final decoding to generate target image