InFusion: Inject and Attention Fusion for Multi Concept Zero Shot Text based Video Editing
Authors: Anant Khandelwal
Abstract: Large text-to-image diffusion models have achieved remarkable success in generating diverse high-quality images in alignment with text prompt used for editing the input image. But, when these models applied to video the main challenge is to ensure temporal consistency and coherence across frames. In this paper, we proposed InFusion, a framework for zero-shot text-based video editing leveraging large pre-trained image diffusion models. Our framework specifically supports editing of multiple concepts with the pixel level control over diverse concepts mentioned in the editing prompt. Specifically, we inject the difference of features obtained with source and edit prompt from U-Net residual blocks in decoder layers, this when combined with injected attention features make it feasible to query the source contents and scale edited concepts along with the injection of unedited parts. The editing is further controlled in fine-grained manner with mask extraction and attention fusion strategy which cuts the edited part from source and paste it into the denoising pipeline for editing prompt. Our framework is a low cost alternative of one-shot tuned models for editing since it does not require training. We demonstrated the complex concept editing with generalised image model (Stable Diffusion v1.5) using LoRA. Adaptation is compatible with all the existing image diffusion techniques. Extensive experimental results demonstrate the effectiveness over existing methods in rendering high-quality and temporally consistent videos.
What, Why and How
Here is a summary of the key points from this paper:
What:
- The paper proposes a new framework called InFusion for zero-shot text-based video editing using large pre-trained image diffusion models like Stable Diffusion.
Why:
- Existing image diffusion models can generate high quality images from text prompts, but using them directly for video leads to lack of temporal consistency across frames.
- Other zero-shot video editing methods have limitations in editing complex concepts or require model fine-tuning.
How:
- InFusion has two main components: Inject and Attention Fusion
- Inject: Injects spatial feature differences and attention features from source to target prompt into the diffusion model to highlight edited concepts.
- Attention Fusion: Uses mask guided mixing of self/cross attention maps from source and edited prompts to retain unedited content and integrate edited concepts.
- Together this allows editing multiple concepts in videos while maintaining temporal coherence, without any training.
In summary, InFusion is a zero-shot framework to enable complex text-based editing of videos using pre-trained image diffusion models, with a novel injection and attention fusion method to ensure temporal consistency.
Main Contributions
The main contributions of this paper are:
-
A novel zero-shot framework InFusion for multi-concept text-based video editing that ensures temporal consistency without requiring any training.
-
The Inject module that injects spatial feature differences and attention features to highlight edited concepts while retaining unedited content.
-
The Attention Fusion module that uses mask guided mixing of self/cross attention maps to fuse edited and unedited concepts cleanly.
-
Demonstrated complex editing of structure, shape, color and style in videos with a general pre-trained image diffusion model (Stable Diffusion v1.5).
-
Achieves state-of-the-art performance compared to recent zero-shot video editing methods on quantitative metrics and user studies.
-
Extensive ablation studies validating the importance of each component of the proposed approach.
To summarize, the key contributions are 1) The overall InFusion framework for zero-shot consistent video editing 2) The Inject and Attention Fusion modules 3) Superior performance over baselines on multiple metrics 4) Detailed ablation studies.
Method Section
Here is a summary of the method section:
The paper proposes a framework called InFusion that consists of two main components:
- Inject
- Injects the difference between spatial features from source and edited prompts into decoder layers of the diffusion model. This highlights target edited concepts.
- Also injects attention features (keys, values) from source prompt to query source contents and retain unedited concepts.
- Attention Fusion
- Extracts masks from source and edited cross-attention maps.
- Uses masks to cut edited foreground objects from source self-attention and paste unedited background.
- Fuses cross-attention maps from source and edited prompts to contain unedited and edited concepts respectively.
Additionally:
- Initializes temporal attention weights with spatial attention to align frames.
- Uses general pre-trained image diffusion model Stable Diffusion v1.5 without any tuning.
In summary, InFusion injects spatial and attention features into the diffusion process to highlight and edit concepts, then fuses attention maps to compose the final video containing source and edited content. The whole process is zero-shot using a fixed pre-trained model.
High-Level Pseudo Code
Here is the high-level pseudo code for the InFusion method:
# Input: source video X0, source prompt Ps, edit prompt Pe
# Output: edited video Y0
# Invert source video to get noise zT
zT = invert(X0)
# Diffusion loop
for t in timesteps:
# Inject
if t < S1:
ft_diff = ft_source - ft_edit # spatial feature difference
z = denoise(yt, Pe, ft_diff)
elif t < S2:
ks, vs = get_attention(X0, Ps)
z = denoise(yt, Pe, ks, vs)
# Attention Fusion
else:
Ms, Me = get_masks(Ps, Pe)
st_fused = Me * st_edit + (1-Ms) * st_source # self-attention
ct_fused = α * ct_edit + (1-α) * ct_source # cross-attention
z = denoise(yt, Pe, st_fused, ct_fused)
# Resulting video
Y0 = z
This shows the overall flow - injecting spatial and attention features, fusing attention maps, and denoising in a loop to get the final edited video.
Detailed Pseudo Code
Here is a more detailed pseudo code for implementing the InFusion method:
# Inputs
X0 = source_video # T frames
Ps = source_prompt
Pe = edit_prompt
# Hyperparams
S1, S2 = inject_steps
T = total_steps
L = injected_layers
# Invert to get noise
zT = invert(X0)
# Diffusion loop
for t in range(T):
# Get source features
if t == 0:
ft_source, kt_source, vt_source = encode(X0, Ps)
# Inject
if t < S1 and l < L:
ft_edit = encode(z, Pe)[l]
ft_diff = ft_source[l] - ft_edit
z = denoise(z, Pe, ft_diff)
elif t < S2:
z = denoise(z, Pe, kt_source, vt_source)
# Attention fusion
else:
As_edit = cross_att(Pe)
As_source = cross_att(Ps)
Ms = threshold(As_source)
Me = threshold(As_edit)
ss_edit = self_att(Pe)
ss_source = self_att(Ps)
ss_fused = Me*ss_edit + (1-Ms)*ss_source
cs_fused = α*As_edit + (1-α)*As_source
z = denoise(z, Pe, ss_fused, cs_fused)
# Generated video
Y0 = z
Where encode(), denoise(), cross_att(), self_att() are functions in the diffusion model. This shows a step-by-step implementation with spatial/attention feature extraction, injection, mask generation and attention fusion.