Composite Diffusion | whole >= \Sigma parts

Authors: Vikram Jamwal, Ramaneswaran S

Abstract: For an artist or a graphic designer, the spatial layout of a scene is a critical design choice. However, existing text-to-image diffusion models provide limited support for incorporating spatial information. This paper introduces Composite Diffusion as a means for artists to generate high-quality images by composing from the sub-scenes. The artists can specify the arrangement of these sub-scenes through a flexible free-form segment layout. They can describe the content of each sub-scene primarily using natural text and additionally by utilizing reference images or control inputs such as line art, scribbles, human pose, canny edges, and more. We provide a comprehensive and modular method for Composite Diffusion that enables alternative ways of generating, composing, and harmonizing sub-scenes. Further, we wish to evaluate the composite image for effectiveness in both image quality and achieving the artist’s intent. We argue that existing image quality metrics lack a holistic evaluation of image composites. To address this, we propose novel quality criteria especially relevant to composite generation. We believe that our approach provides an intuitive method of art creation. Through extensive user surveys, quantitative and qualitative analysis, we show how it achieves greater spatial, semantic, and creative control over image generation. In addition, our methods do not need to retrain or modify the architecture of the base diffusion models and can work in a plug-and-play manner with the fine-tuned models.

What, Why and How

Here is a summary of the key points from this paper on Composite Diffusion:

What:

The paper introduces Composite Diffusion, a method to generate high-quality composite images by combining multiple sub-scenes specified through spatial layouts and text descriptions.

Why:

Existing text-to-image models provide limited spatial control over image generation. Composite Diffusion gives artists more intuitive control over the layout and content of complex scenes.
The quality of composite images depends on spatial conformance, content fidelity, harmony, aesthetics etc. which existing metrics don’t evaluate well. The paper proposes suitable evaluation criteria for composites.

How:

The image generation process is divided into a scaffolding stage and a harmonization stage.
In the scaffolding stage, sub-scenes are generated independently using text prompts, reference images or control inputs like scribbles.
In the harmonization stage, the sub-scenes are denoised iteratively in the context of each other to blend them smoothly.
The quality of composite images is evaluated through human surveys, automated metrics, artist feedback and visual inspection. The results demonstrate the effectiveness of Composite Diffusion.

In summary, this paper presents Composite Diffusion to generate high-quality composite images through intuitive spatial and text-based control over sub-scenes. It also proposes appropriate quality evaluation criteria for image composites.

Main Contributions

The main contributions of this paper are:

Composite Diffusion - a comprehensive and flexible method for generating high-quality composite images using spatial layouts and text descriptions for sub-scenes.
Modular two-stage approach of scaffolding and harmonization for composite image generation.
Ability to incorporate various control modalities like reference images, scribbles, sketches etc. to guide sub-scene generation.
Introducing the concept of a scaffolding image or control to anchor sub-scene generation.
Allowing the use of different control modalities for different sub-scenes in a composite.
Proposing appropriate quality evaluation criteria specifically for composite images, including spatial conformance, content fidelity, blending, aesthetics etc.
Quantitative and qualitative evaluation of Composite Diffusion using human surveys, automated metrics, artist feedback and visual inspection.
Demonstrating superior performance of Composite Diffusion over text-to-image and serial inpainting baselines.
Showing that Composite Diffusion can work with pre-trained diffusion models without needing retraining or architecture changes.
Providing interesting artistic controls like the scaffolding factor to balance sub-scene independence and overall harmony.

In summary, the key contributions are - proposing the Composite Diffusion approach, devising suitable quality criteria tailored to image composites, and demonstrating its effectiveness through comprehensive evaluations.

Method Section

Here is a summary of the method section from the paper on Composite Diffusion:

The goal is to generate an image composed of sub-scenes specified by a free-form spatial layout and text descriptions.

The method has two stages:

Scaffolding Stage:

Sub-scenes are generated independently using scaffolding.
Scaffolding can be done via reference images, a common scaffold image, or control inputs.
Reference images are noised to the last scaffolding timestep.
With a scaffold image, segments are inpainted iteratively into it.
With control inputs, segments are denoised using controls like sketches.
An intermediate composite is formed by merging sub-scene latents.

Harmonization Stage:

Sub-scenes are denoised iteratively in the context of each other.
Can use global text prompt or sub-scene specific texts.
Compositing happens after every denoising step for blending.
Scaffolding help is removed in this stage.
The final composite emerges smoothly blended and coherent.

The scaffolding factor controls the tradeoff between sub-scene independence and overall harmony.

In summary, the method scaffolds sub-scenes independently first, and then harmonizes them together in later diffusion steps to generate the final composite image.

High-Level Pseudo Code

Here is the high-level pseudo code for the Composite Diffusion method:

 
# Inputs
segment_layouts = [] # Free-form spatial layouts 
text_prompts = [] # Text descriptions for segments
control_conditions = [] # Additional control inputs
 
# Scaffolding Stage
for segment in segment_layouts:
  if reference_image:
    # Noise reference image  
  elif scaffold_image: 
    # Iteratively inpaint segment into scaffold
  elif control_condition:
    # Denoise segment using control  
 
# Form intermediate composite
composite = merge_segments(segments) 
 
# Harmonization Stage 
for t in diffusion_steps:
  if global_prompt:
    composite = denoise(composite, prompt) 
  elif segment_prompts:
    for segment in segments:
      segment = denoise(composite, prompt[segment])
    composite = merge_segments(segments)
      
# Output final composite  
output = composite

In summary, the scaffolding stage generates segments independently, while the harmonization stage denoises them iteratively in context to blend them into the final composite image output.

Detailed Pseudo Code

Here is a more detailed pseudo code for implementing the Composite Diffusion method:

# Inputs
segment_layouts = [] 
segment_prompts = []
segment_controls = []
scaffold_image = None 
 
num_diffusion_steps = 1000
scaffold_steps = int(0.5 * num_diffusion_steps) # scaffolding factor
 
# Scaffolding stage
for t in range(num_diffusion_steps, scaffold_steps, -1):
  
  # Get noised scaffold image
  if scaffold_image:
    x_t_scaffold = q_sample(scaffold_image, t)  
  
  for segment_id in range(len(segment_layouts)):
  
    # Get mask for segment
    mask = get_mask(segment_layouts[segment_id])
    
    if segment_controls[segment_id]:
    
      # Denoise segment using control
      x_t_segment = denoise(x_t, prompt=segment_prompts[segment_id], 
                            control=segment_controls[segment_id], 
                            mask=mask)
                              
    else:
    
      # Masked inpainting into scaffold
      x_t_scaffold_masked = x_t_scaffold * (1 - mask)  
      x_t_segment = denoise(x_t, prompt=segment_prompts[segment_id],
                            context=x_t_scaffold_masked, 
                            mask=mask)
  
  # Get composite
  x_t = sum(x_t_segment * mask for segment_id, mask in enumerate(segment_layouts))
 
 
# Harmonization stage  
for t in range(scaffold_steps, 0, -1):
 
  if global_prompt:
  
    # Denoise composite with global prompt
    x_t = denoise(x_t, prompt=global_prompt)  
  
  else:
  
    # Denoise each segment
    for segment_id in range(len(segment_layouts)):
    
      x_t_segment = denoise(x_t, prompt=segment_prompts[segment_id],
                            control=segment_controls[segment_id],
                            mask=segment_masks[segment_id])
  
    # Get composite  
    x_t = sum(x_t_segment * mask for segment_id, mask in enumerate(segment_layouts))
 
# Output final composite
output = x_t

In summary, it scaffolds segments independently first using controls or inpainting, and then harmonizes them using global or segment-specific prompts to blend segments smoothly.

aixiv-c

Table of Contents

2307.13720 Composite Diffusion | whole >= \Sigma parts

Composite Diffusion | whole >= \Sigma parts

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code

Graph View

Backlinks

aixiv-c

Table of Contents

2307.13720 Composite Diffusion | whole >= \Sigma parts

Composite Diffusion | whole >= \Sigma parts §

What, Why and How §

Main Contributions §

Method Section §

High-Level Pseudo Code §

Detailed Pseudo Code §

Graph View

Backlinks

Composite Diffusion | whole >= \Sigma parts

What, Why and How

Main Contributions

Method Section

High-Level Pseudo Code

Detailed Pseudo Code