Key-Locked Rank One Editing for Text-to-Image Personalization

Authors: Yoad Tewel, Rinon Gal, Gal Chechik, Yuval Atzmon

Abstract: Text-to-image models (T2I) offer a new level of flexibility by allowing users to guide the creative process through natural language. However, personalizing these models to align with user-provided visual concepts remains a challenging problem. The task of T2I personalization poses multiple hard challenges, such as maintaining high visual fidelity while allowing creative control, combining multiple personalized concepts in a single image, and keeping a small model size. We present Perfusion, a T2I personalization method that addresses these challenges using dynamic rank-1 updates to the underlying T2I model. Perfusion avoids overfitting by introducing a new mechanism that “locks” new concepts’ cross-attention Keys to their superordinate category. Additionally, we develop a gated rank-1 approach that enables us to control the influence of a learned concept during inference time and to combine multiple concepts. This allows runtime-efficient balancing of visual-fidelity and textual-alignment with a single 100KB trained model, which is five orders of magnitude smaller than the current state of the art. Moreover, it can span different operating points across the Pareto front without additional training. Finally, we show that Perfusion outperforms strong baselines in both qualitative and quantitative terms. Importantly, key-locking leads to novel results compared to traditional approaches, allowing to portray personalized object interactions in unprecedented ways, even in one-shot settings.

What, Why and How

Here is a summary of the key points from this paper:

What: This paper presents a method called “Key-Locked Rank One Editing” (KL-ROE) for personalizing text-to-image (T2I) models to generate novel visual concepts specified by a user.

Why: Current T2I personalization methods struggle with overfitting to the provided concept images, maintaining identity of concepts in different contexts, combining multiple concepts, and having large model sizes. KL-ROE aims to address these challenges.

How:

  • The method makes a key insight about disentangling the “Where” pathway (attention maps) from the “What” pathway (image features) in diffusion T2I models.
  • It proposes “key locking” to constrain the “Where” pathway to prevent overfitting attention maps. The keys for a concept are locked to those of its supercategory.
  • The “What” pathway is personalized via rank-1 updates to the value projections, which act as an extended latent space.
  • A gated version of rank-1 editing allows combining concepts and controlling their influence at inference time.
  • This results in a compact ~100KB model size per concept.

Key Results:

  • KL-ROE achieves higher fidelity and alignment than baselines with 5+ orders of magnitude smaller size.
  • It enables novel combinations of concepts and inference-time control over fidelity vs alignment tradeoff.
  • Key locking produces better generalization and alignment compared to training keys.

In summary, KL-ROE is a lightweight and effective approach to personalize T2I models for novel visual concepts while avoiding common pitfalls. The key locking technique is vital to its improved generalization.

Main Contributions

Method Section

Here is a summary of the method section:

The paper first explains the issue of overfitting in current T2I personalization methods, caused by attention maps that spread beyond the target concept’s visual scope.

To address this, the method proposes disentangling the “Where” (Keys) and “What” (Values) pathways of the cross-attention module in diffusion T2I models.

It introduces “key locking” to constrain the “Where” pathway - the keys for a concept are fixed to match those of its supercategory (e.g. keys for a specific teddy bear match those for the general “teddy bear” concept). This restricts attention maps while allowing the concept to inherit the creative capabilities of the supercategory.

Personalization is handled through the “What” pathway by treating the value projections as an extended latent space and optimizing them along with the concept’s word embedding.

These ideas are incorporated via a gated rank-1 update to the K and V projection weights. The gating provides inference-time control over the concept’s influence, enabling balancing of fidelity vs alignment tradeoffs. It also allows combining multiple concepts by selectively applying updates.

The rank-1 update is applied end-to-end during training to avoid test-train mismatch. Online estimation of the target input vector handles influence from other prompt words.

For inference, the gating bias and temperature parameters control a concept’s impact. Multiple concepts can be combined using the generalized gated update equation.

In summary, key locking, gated rank-1 editing, and end-to-end training provide an efficient approach to personalize T2I models while improving generalization and enabling new capabilities.

High-Level Pseudo Code

Here is the high-level pseudo code for the key components of the method:

# Key Locking
def key_lock(concept_keys, supercategory):
  return supercategory_keys 
 
# Gated Rank-1 Update 
def rank1_update(input, concept_index, supercategory_keys, 
                concept_values, gate_params):
  
  # Get encoding for concept_index
  concept_encoding = input[concept_index]  
 
  # Online estimation of target input
  target_input = 0.99 * target_input + 0.01 * concept_encoding   
 
  # Lock keys to supercategory
  locked_keys = key_lock(concept_encoding, supercategory_keys)
   
  # Project input through original key matrix
  key_projections = original_key_matrix @ input 
 
  # Null target input component  
  null_keys = key_projections - projection_on(target_input)
 
  # Gated update 
  gate = sigmoid((projection_on(target_input) - gate_params[0]) / gate_params[1])
  updated_keys = null_keys + gate * locked_keys
 
  # Project input through original value matrix
  value_projections = original_value_matrix @ input
 
  # Gated update
  gate = sigmoid((projection_on(target_input) - gate_params[0]) / gate_params[1]) 
  updated_values = value_projections + gate * concept_values
 
  return updated_keys, updated_values

The key locking fixes the concept keys to the supercategory. The gated rank-1 update selectively adds the concept values based on the target input similarity. This is done end-to-end during training. At inference, gate parameters control the concept’s influence.

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the key components of this method:

# Constants
ORIGINAL_KEY_MATRIX # Original key projection matrix
ORIGINAL_VALUE_MATRIX # Original value projection matrix 
INVERSE_COV_MAT # Precomputed inverse covariance matrix
SUPERCATEGORY_KEYS # Precomputed supercategory keys
CONCEPT_VALUES # Learned concept value projections
GATE_BIAS # Sigmoid gate bias
GATE_TEMP # Sigmoid gate temperature
 
# Online estimation of target input
target_input = initialize_with_concept_encoding 
 
# Training loop
for input_text, image in dataset:
  
  # Get encoding for concept index
  concept_encoding = input_text[concept_index]  
 
  # Online estimation
  target_input = 0.99 * target_input + 0.01 * concept_encoding
 
  # Project input through key matrix 
  key_projections = ORIGINAL_KEY_MATRIX @ input_text
 
  # Calculate input energies
  input_energies = (INVERSE_COV_MAT @ input_text) ^ T @ input_text
  target_input_energy = (INVERSE_COV_MAT @ target_input) ^ T @ target_input
 
  # Calculate similarity  
  similarities = input_text @ (INVERSE_COV_MAT @ target_input)^T
 
  # Null target input component
  null_keys = key_projections - (similarities * target_input) / target_input_energy
 
  # Gated update
  gates = sigmoid((similarities / target_input_energy - GATE_BIAS) / GATE_TEMP)
  updated_keys = null_keys + gates * SUPERCATEGORY_KEYS
   
  # Value projections and update
  value_projections = ORIGINAL_VALUE_MATRIX @ input_text
  updated_values = value_projections + gates * CONCEPT_VALUES
 
  # Reconstruction loss
  loss = reconstruction_loss(updated_keys, updated_values, image)
  
  optimize(loss)
 
# Inference
def generate_image(input_text):
 
  updated_keys, updated_values = rank1_update(input_text, gate_params)
 
  image = generator(updated_keys, updated_values)
  
  return image

The key steps are:

  1. Online estimation of target input vector
  2. Key locking to map concept keys to supercategory
  3. Gated rank-1 update to add concept values
  4. End-to-end training with reconstruction loss
  5. At inference, generate images using updated keys and values

The gated update selectively applies the concept projections only when the input encoding aligns with the target input.