Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images

Authors: Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, Hang Xu

Abstract: Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters resolution-induced composition problems when generating images of varying sizes. This issue primarily stems from the model being trained on pairs of single-scale images and their corresponding text descriptions. Moreover, direct training on images of unlimited sizes is unfeasible, as it would require an immense number of text-image pairs and entail substantial computational expenses. To overcome these challenges, we propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size, while minimizing the need for high-memory GPU resources. Specifically, the initial stage, dubbed Any Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a restricted range of ratios to optimize the text-conditional diffusion model, thereby improving its ability to adjust composition to accommodate diverse image sizes. To support the creation of images at any desired size, we further introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the subsequent stage. This method allows for the rapid enlargement of the ASD output to any high-resolution size, avoiding seaming artifacts or memory overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks demonstrate that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.

What, Why and How

Here is a summary of the key points from this paper:

\textbf{What:}

  • The paper proposes a new method called Any-Size-Diffusion (ASD) for generating high-resolution images of arbitrary sizes from text prompts.

\textbf{Why:}

  • Existing text-to-image models like Stable Diffusion struggle with poor image composition when generating varying image sizes. This is because they are trained on fixed-size image-text pairs.

\textbf{How:}

  • ASD uses a two-stage approach:
  1. Any Ratio Adaptability Diffusion (ARAD) - Trains the model on images with different aspect ratios to enable handling varying sizes.
  2. Fast Seamless Tiled Diffusion (FSTD) - Further enlarges the image from ARAD to any size, using an efficient tiling approach with implicit overlaps to avoid seaming artifacts.
  • By training on multiple aspect ratios and using implicit tiling, ASD can generate properly composed, high-resolution images of any size from text. It avoids the composition and memory issues faced by current models like Stable Diffusion.

In summary, ASD introduces adaptations like multi-ratio training and implicit tiling to enable text-to-image models to efficiently generate high-quality, well-composed images at arbitrary resolutions.

Main Contributions

The main contributions of this paper are:

  1. Proposing the Any-Size-Diffusion (ASD) model - a two-stage pipeline to generate high-resolution images of any size from text while maintaining proper composition.

  2. Introducing the Any Ratio Adaptability Diffusion (ARAD) method in Stage 1. ARAD trains the model on images of different aspect ratios to improve its ability to adapt to varying image sizes.

  3. Developing the Fast Seamless Tiled Diffusion (FSTD) technique in Stage 2. FSTD uses implicit overlap when tiling image regions to efficiently upscale images to higher resolutions without seaming artifacts.

  4. Demonstrating that ASD can generate images up to 9x larger than baseline Stable Diffusion models under the same GPU memory constraints.

  5. Showing quantitatively that ASD achieves superior performance over baselines in terms of FID, IS, and CLIP scores on image generation benchmarks.

  6. Providing extensive ablation studies analyzing the impact of multi-ratio training strategies and implicit tiling parameters.

In summary, the key contributions are proposing the novel ASD pipeline, ARAD and FSTD techniques, and demonstrating ASD’s improved ability to generate high-quality, well-composed images at any resolution compared to existing state-of-the-art text-to-image models.

Method Section

Here is a summary of the method section from the paper:

The paper proposes a two-stage pipeline called Any-Size-Diffusion (ASD) for generating high-resolution images from text prompts:

Stage 1 - Any Ratio Adaptability Diffusion (ARAD):

  • ARAD is based on latent diffusion models like Stable Diffusion.
  • It is trained on image-text pairs with different aspect ratios instead of fixed sizes.
  • A multi-aspect ratio training strategy resizes each training image to one of 9 predefined ratios/sizes.
  • This enables ARAD to generate images adaptable to varying sizes.

Stage 2 - Fast Seamless Tiled Diffusion (FSTD):

  • FSTD takes the image from ARAD and enlarges it to any desired size.
  • It uses a tiled sampling approach for efficiency.
  • The key novelty is implicit overlap when tiling image regions.
  • This means dynamically shifting tile positions while keeping tile counts fixed.
  • The shifted tiles are denoised while unchanged ones retain their values.
  • This implicit overlap prevents seaming artifacts between tiles.
  • It also significantly improves inference speed compared to standard tiling.

In summary, the method section explains the two stages of ASD - ARAD handles image composition across sizes via multi-ratio training, while FSTD efficiently upscales the image to higher resolutions using implicit tiling to avoid seams. The combination enables generating high-quality images at any arbitrary size specified by the user.

High-Level Pseudo Code

Here is the high-level pseudo code for the key components of the paper:

# Stage 1: ARAD 
# Multi-aspect ratio training
for image, text in training_data:
  ratio = image_height / image_width 
  resize_image_to_nearest_preset_ratio(ratio)
  train_diffusion_model(resized_image, text)
 
# Stage 2: FSTD
# Fast Seamless Tiled Diffusion 
 
# Tile image into regions
tiles = []
for i in range(num_tiles):
  tiles.append(tile_image_region(image, i)) 
 
for t in range(num_steps):
 
  # Implicit tile overlap  
  offsets = generate_random_offsets(tiles)
  shifted_tiles = shift_tiles(tiles, offsets)  
 
  # Denoise shifted tiles
  denoised_tiles = []
  for tile in shifted_tiles:
    denoised_tile = denoise(tile)  
    denoised_tiles.append(denoised_tile)
  
  # Retain values for unshifted parts 
  updated_tiles = reconstruct_image(denoised_tiles)
 
  # Update tiles for next iteration
  tiles = updated_tiles 
 
output = decode_image(tiles)

This outlines the key pseudocode components:

  • Multi-aspect ratio training in ARAD
  • Tiling the image into regions
  • Shifting the tiles randomly for implicit overlap
  • Denoising the shifted tiles while retaining unshifted ones
  • Reconstructing the image and iterating
  • Decoding the output image

Detailed Pseudo Code

Here is a more detailed pseudo code implementation of the key components in the paper:

# Hyperparameters
num_ratios = 9 
ratios = [1.0, 0.75, 1.33, 0.56, 1.77, 0.625, 1.6, 0.5, 2.0]
ratio_sizes = {# (ratio : size) pairs} 
 
# Stage 1: ARAD
# Multi-aspect ratio training
 
def resize_image(image, ratios, sizes):
  ratio = image_height / image_width
  nearest_ratio = find_nearest(ratio, ratios)
  nearest_size = sizes[nearest_ratio]
  return resize(image, nearest_size)
 
def train_arad(images, texts):
  for image, text in zip(images, texts):
    resized_image = resize_image(image, ratios, ratio_sizes)
    encoded_text = encode_text(text)
    # Train diffusion model on (resized_image, encoded_text) pair
 
# Stage 2: FSTD 
# Fast Seamless Tiled Diffusion
 
# Tiling
def tile_image(image, tile_size):
  num_tiles = (image_height / tile_size) * (image_width / tile_size) 
  tiles = []
  for i in range(num_tiles):
    x = i % (image_width / tile_size)
    y = i // (image_width / tile_size)
    tile = image[y*tile_size : (y+1)*tile_size, 
                 x*tile_size : (x+1)*tile_size]
    tiles.append(tile)
  return tiles
 
# Diffusion  
def denoise(tile):
  noisy_tile = add_noise(tile)
  for t in range(num_steps):
    denoised = diffusion_model(noisy_tile, t) 
    noisy_tile = denoised
  return denoised
 
# Main FSTD diffusion
 
def fstd(image):
  tiles = tile_image(image, tile_size)
 
  for t in range(num_steps):
   
    # Implicit overlap
    offsets = generate_random_offsets(tiles)
    shifted_tiles = shift_tiles(tiles, offsets)
 
    # Denoising
    denoised_tiles = [denoise(tile) for tile in shifted_tiles] 
 
    # Reconstruction
    updated_tiles = reconstruct_image(tiles, denoised_tiles)
 
    # Update
    tiles = updated_tiles
 
  return decode_image(tiles) 

This provides more implementation details like:

  • The multi-ratio resizing logic
  • Tiling the image into patches
  • Adding noise and denoising each tile
  • Shifting tiles randomly for implicit overlap
  • Reconstructing the image from denoised and unchanged tiles
  • Decoding the final output

Let me know if you would like any clarification on the implementation!