Composite Diffusion | whole >= \Sigma parts
Authors: Vikram Jamwal, Ramaneswaran S
Abstract: For an artist or a graphic designer, the spatial layout of a scene is a critical design choice. However, existing text-to-image diffusion models provide limited support for incorporating spatial information. This paper introduces Composite Diffusion as a means for artists to generate high-quality images by composing from the sub-scenes. The artists can specify the arrangement of these sub-scenes through a flexible free-form segment layout. They can describe the content of each sub-scene primarily using natural text and additionally by utilizing reference images or control inputs such as line art, scribbles, human pose, canny edges, and more. We provide a comprehensive and modular method for Composite Diffusion that enables alternative ways of generating, composing, and harmonizing sub-scenes. Further, we wish to evaluate the composite image for effectiveness in both image quality and achieving the artist’s intent. We argue that existing image quality metrics lack a holistic evaluation of image composites. To address this, we propose novel quality criteria especially relevant to composite generation. We believe that our approach provides an intuitive method of art creation. Through extensive user surveys, quantitative and qualitative analysis, we show how it achieves greater spatial, semantic, and creative control over image generation. In addition, our methods do not need to retrain or modify the architecture of the base diffusion models and can work in a plug-and-play manner with the fine-tuned models.
What, Why and How
Here is a summary of the key points from this paper on Composite Diffusion:
What:
- The paper introduces Composite Diffusion, a method to generate high-quality composite images by combining multiple sub-scenes specified through spatial layouts and text descriptions.
Why:
-
Existing text-to-image models provide limited spatial control over image generation. Composite Diffusion gives artists more intuitive control over the layout and content of complex scenes.
-
The quality of composite images depends on spatial conformance, content fidelity, harmony, aesthetics etc. which existing metrics don’t evaluate well. The paper proposes suitable evaluation criteria for composites.
How:
-
The image generation process is divided into a scaffolding stage and a harmonization stage.
-
In the scaffolding stage, sub-scenes are generated independently using text prompts, reference images or control inputs like scribbles.
-
In the harmonization stage, the sub-scenes are denoised iteratively in the context of each other to blend them smoothly.
-
The quality of composite images is evaluated through human surveys, automated metrics, artist feedback and visual inspection. The results demonstrate the effectiveness of Composite Diffusion.
In summary, this paper presents Composite Diffusion to generate high-quality composite images through intuitive spatial and text-based control over sub-scenes. It also proposes appropriate quality evaluation criteria for image composites.
Main Contributions
The main contributions of this paper are:
-
Composite Diffusion - a comprehensive and flexible method for generating high-quality composite images using spatial layouts and text descriptions for sub-scenes.
-
Modular two-stage approach of scaffolding and harmonization for composite image generation.
-
Ability to incorporate various control modalities like reference images, scribbles, sketches etc. to guide sub-scene generation.
-
Introducing the concept of a scaffolding image or control to anchor sub-scene generation.
-
Allowing the use of different control modalities for different sub-scenes in a composite.
-
Proposing appropriate quality evaluation criteria specifically for composite images, including spatial conformance, content fidelity, blending, aesthetics etc.
-
Quantitative and qualitative evaluation of Composite Diffusion using human surveys, automated metrics, artist feedback and visual inspection.
-
Demonstrating superior performance of Composite Diffusion over text-to-image and serial inpainting baselines.
-
Showing that Composite Diffusion can work with pre-trained diffusion models without needing retraining or architecture changes.
-
Providing interesting artistic controls like the scaffolding factor to balance sub-scene independence and overall harmony.
In summary, the key contributions are - proposing the Composite Diffusion approach, devising suitable quality criteria tailored to image composites, and demonstrating its effectiveness through comprehensive evaluations.
Method Section
Here is a summary of the method section from the paper on Composite Diffusion:
The goal is to generate an image composed of sub-scenes specified by a free-form spatial layout and text descriptions.
The method has two stages:
Scaffolding Stage:
-
Sub-scenes are generated independently using scaffolding.
-
Scaffolding can be done via reference images, a common scaffold image, or control inputs.
-
Reference images are noised to the last scaffolding timestep.
-
With a scaffold image, segments are inpainted iteratively into it.
-
With control inputs, segments are denoised using controls like sketches.
-
An intermediate composite is formed by merging sub-scene latents.
Harmonization Stage:
-
Sub-scenes are denoised iteratively in the context of each other.
-
Can use global text prompt or sub-scene specific texts.
-
Compositing happens after every denoising step for blending.
-
Scaffolding help is removed in this stage.
-
The final composite emerges smoothly blended and coherent.
The scaffolding factor controls the tradeoff between sub-scene independence and overall harmony.
In summary, the method scaffolds sub-scenes independently first, and then harmonizes them together in later diffusion steps to generate the final composite image.
High-Level Pseudo Code
Here is the high-level pseudo code for the Composite Diffusion method:
# Inputs
segment_layouts = [] # Free-form spatial layouts
text_prompts = [] # Text descriptions for segments
control_conditions = [] # Additional control inputs
# Scaffolding Stage
for segment in segment_layouts:
if reference_image:
# Noise reference image
elif scaffold_image:
# Iteratively inpaint segment into scaffold
elif control_condition:
# Denoise segment using control
# Form intermediate composite
composite = merge_segments(segments)
# Harmonization Stage
for t in diffusion_steps:
if global_prompt:
composite = denoise(composite, prompt)
elif segment_prompts:
for segment in segments:
segment = denoise(composite, prompt[segment])
composite = merge_segments(segments)
# Output final composite
output = compositeIn summary, the scaffolding stage generates segments independently, while the harmonization stage denoises them iteratively in context to blend them into the final composite image output.
Detailed Pseudo Code
Here is a more detailed pseudo code for implementing the Composite Diffusion method:
# Inputs
segment_layouts = []
segment_prompts = []
segment_controls = []
scaffold_image = None
num_diffusion_steps = 1000
scaffold_steps = int(0.5 * num_diffusion_steps) # scaffolding factor
# Scaffolding stage
for t in range(num_diffusion_steps, scaffold_steps, -1):
# Get noised scaffold image
if scaffold_image:
x_t_scaffold = q_sample(scaffold_image, t)
for segment_id in range(len(segment_layouts)):
# Get mask for segment
mask = get_mask(segment_layouts[segment_id])
if segment_controls[segment_id]:
# Denoise segment using control
x_t_segment = denoise(x_t, prompt=segment_prompts[segment_id],
control=segment_controls[segment_id],
mask=mask)
else:
# Masked inpainting into scaffold
x_t_scaffold_masked = x_t_scaffold * (1 - mask)
x_t_segment = denoise(x_t, prompt=segment_prompts[segment_id],
context=x_t_scaffold_masked,
mask=mask)
# Get composite
x_t = sum(x_t_segment * mask for segment_id, mask in enumerate(segment_layouts))
# Harmonization stage
for t in range(scaffold_steps, 0, -1):
if global_prompt:
# Denoise composite with global prompt
x_t = denoise(x_t, prompt=global_prompt)
else:
# Denoise each segment
for segment_id in range(len(segment_layouts)):
x_t_segment = denoise(x_t, prompt=segment_prompts[segment_id],
control=segment_controls[segment_id],
mask=segment_masks[segment_id])
# Get composite
x_t = sum(x_t_segment * mask for segment_id, mask in enumerate(segment_layouts))
# Output final composite
output = x_tIn summary, it scaffolds segments independently first using controls or inpainting, and then harmonizes them using global or segment-specific prompts to blend segments smoothly.