Synthesizing Artistic Cinemagraphs from Text

Authors: Aniruddha Mahapatra, Aliaksandr Siarohin, Hsin-Ying Lee, Sergey Tulyakov, Jun-Yan Zhu

Abstract: We introduce Text2Cinemagraph, a fully automated method for creating cinemagraphs from text descriptions - an especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images. Existing single-image animation methods fall short on artistic inputs, and recent text-based video methods frequently introduce temporal inconsistencies, struggling to keep certain regions static. To address these challenges, we propose an idea of synthesizing image twins from a single text prompt - a pair of an artistic image and its pixel-aligned corresponding natural-looking twin. While the artistic image depicts the style and appearance detailed in our text prompt, the realistic counterpart greatly simplifies layout and motion analysis. Leveraging existing natural image and video datasets, we can accurately segment the realistic image and predict plausible motion given the semantic information. The predicted motion can then be transferred to the artistic image to create the final cinemagraph. Our method outperforms existing approaches in creating cinemagraphs for natural landscapes as well as artistic and other-worldly scenes, as validated by automated metrics and user studies. Finally, we demonstrate two extensions: animating existing paintings and controlling motion directions using text.

What, Why and How

Here is a summary of the key points from this paper:

What:

  • The paper introduces Text2Cinemagraph, a new method to automatically create cinemagraphs from text descriptions. Cinemagraphs are images with certain moving elements while the rest of the image stays still.

Why:

  • Existing methods for generating cinemagraphs from a single image struggle with artistic/imaginative inputs. Text-to-video methods often have temporal inconsistencies.
  • Text2Cinemagraph aims to overcome these limitations and allow users to create cinemagraphs for real and imaginary scenes using just text prompts.

How:

  • The method generates twin images from the text prompt - an artistic image depicting the style/content, and a natural image with the same layout.
  • The natural image is used to predict optical flow and mask using existing models. This flow is transferred to the artistic image.
  • The artistic image with the predicted flow/mask is animated into a cinemagraph using symmetric warping.

Key Findings:

  • Text2Cinemagraph outperforms baselines on creating cinemagraphs from real and artistic images based on automated metrics and user studies.
  • Ablations validate the need for twin images, masks, and text conditioning.
  • The method also enables control of motion directions using text and animating existing paintings.

In summary, Text2Cinemagraph enables fully automated generation of visually appealing and temporally coherent cinemagraphs from text for both real and imaginary scenes. The twin image idea helps bridge the gap between artistic images and existing flow/video models.

Main Contributions

Here are the main contributions of this paper:

  • Proposes the new task of generating cinemagraphs from textual descriptions, requiring no image/video capture or manual annotation.

  • Introduces the Text2Cinemagraph method to tackle this task using twin image synthesis - generating an artistic image depicting the text style/content, and a natural image with the same layout.

  • Leverages the natural image to predict optical flow and mask using existing models, which is then transferred to animate the artistic image into a cinemagraph.

  • Achieves state-of-the-art results in generating cinemagraphs from both real and artistic images, validated through automated metrics and user studies.

  • Demonstrates the ability to control motion directions using text prompts.

  • Shows an extension to animate existing paintings into cinemagraphs.

  • Provides ablation studies analyzing the impact of different components like twin images, masks, text conditioning.

  • Introduces the idea of twin image synthesis to bridge the gap between artistic images and existing models designed for real-world data.

In summary, the key contributions are:

  1. A new task formulation for text-based cinemagraph generation

  2. The Text2Cinemagraph method utilizing twin image synthesis

  3. Quantitative and qualitative results surpassing current methods

  4. Applications for artistic image animation and motion control

  5. Insights from ablation studies on design choices

  6. The concept of twin images for bridging artistic and real domains

Method Section

Here is a summary of the method section from the paper:

The goal is to generate a cinemagraph from a user-provided artistic text prompt that describes the scene content and style.

The key idea is to synthesize twin images from the text prompt:

  1. Artistic image (A): Captures the artistic style described in the text prompt. This will be the final cinemagraph output.

  2. Natural image (N): A corresponding natural looking image with the same semantic layout as A.

The natural image N allows leveraging existing models trained on real images to predict plausible motion and masks.

Twin Image Generation:

  • Generate artistic image A from text prompt using Stable Diffusion.
  • Automatically create a natural prompt by extracting nouns and adding “natural” terms.
  • Generate natural image N using the same diffusion model, while injecting A’s internal features to enforce layout similarity.

Mask Prediction:

  • Get semantic segmentation on N using ODISE model.
  • Refine the mask using self-attention maps from image generation.

Flow Prediction:

  • Predict optical flow on N conditioned on text prompt and mask using flow prediction network.

Cinemagraph Generation:

  • Transfer the predicted flow from N to A.
  • Animate A into a cinemagraph using the flow, mask, and a video generation network.

In summary, twin images are used to obtain optical flow and masks for the artistic image by leveraging models designed for real images. This optical flow then animates the artistic image into the final cinemagraph output.

High-Level Pseudo Code

Here is the high-level pseudo code for the Text2Cinemagraph method:

# Input
text_prompt = "..." # artistic text prompt 
region_name = "..." # name of region to animate
 
# Generate twin images
art_img = generate_artistic_image(text_prompt) 
nat_prompt = get_natural_prompt(text_prompt)
nat_img = generate_natural_image(nat_prompt, art_img) 
 
# Get mask
nat_mask = segment_image(nat_img, region_name)  
art_mask = refine_mask(nat_mask, art_img)
 
# Predict flow 
nat_flow = predict_optical_flow(nat_img, text_prompt, art_mask)
art_flow = nat_flow 
 
# Generate cinemagraph
cinemagraph = animate_image(art_img, art_flow, art_mask)

Generate twin images

def generate_artistic_image(text_prompt):

use Stable Diffusion

def get_natural_prompt(text_prompt):

extract nouns and add natural terms

def generate_natural_image(nat_prompt, art_img):

use Stable Diffusion and inject art_img features

Get mask

def segment_image(nat_img, region_name):

use pre-trained model like ODISE

def refine_mask(nat_mask, art_img):

use attention maps from art_img generation

Predict flow

def predict_optical_flow(nat_img, text_prompt, art_mask):

predict flow on nat_img conditioned on text and mask

Generate cinemagraph

def animate_image(art_img, art_flow, art_mask):

warp art_img using flow to create cinemagraph


## Detailed Pseudo Code

 Here is a more detailed pseudo code implementation of the Text2Cinemagraph method:

```python
import stable_diffusion as sd
import segmentation_model as sm 
import optical_flow_model as ofm
import video_generation_model as vgm

# Input
text_prompt = "a large waterfall flowing down a cliff in oil painting style"  
region_name = "waterfall"

# Generate twin images

# Get artistic image
art_img, art_att_maps, art_feat = sd.generate_image(text_prompt)

# Get natural prompt
nat_prompt = get_nouns(text_prompt) + "nature photo" 

# Generate natural image
nat_img = sd.generate_image(nat_prompt, 
                          inject_att=art_att_maps, 
                          inject_feat=art_feat)

# Get mask
nat_mask = sm.segment(nat_img, region_name) 

# Average attention maps 
art_att_avg = average_att_maps(art_att_maps)

# Get clusters from attention map
art_clusters = spectral_cluster(art_att_avg, n_clusters=10)

# Select clusters based on IOU with nat_mask 
art_mask = select_clusters(art_clusters, nat_mask, iou_thresh=0.7) 

# Upsample mask to original resolution
art_mask = upsample(art_mask)

# Predict flow
nat_flow = ofm.predict_flow(nat_img, text_prompt, art_mask)
art_flow = nat_flow

# Generate cinemagraph
cinemagraph = vgm.generate_video(art_img, art_flow, art_mask)

Helper functions

def get_nouns(text):

extract nouns using NLTK

def average_att_maps(att_maps):

average attention maps across timesteps

def spectral_cluster(att_map, n_clusters):

apply spectral clustering

def select_clusters(clusters, mask, iou_thresh):

select clusters based on IOU with mask

def upsample(mask):

bilinear upsample mask