DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data
Authors: Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, Phillip Isola
Abstract: Current perceptual similarity metrics operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level similarities and differences in image layout, object pose, and semantic content. In this paper, we develop a perceptual metric that assesses images holistically. Our first step is to collect a new dataset of human similarity judgments over image pairs that are alike in diverse ways. Critical to this dataset is that judgments are nearly automatic and shared by all observers. To achieve this we use recent text-to-image models to create synthetic pairs that are perturbed along various dimensions. We observe that popular perceptual metrics fall short of explaining our new data, and we introduce a new metric, DreamSim, tuned to better align with human perception. We analyze how our metric is affected by different visual attributes, and find that it focuses heavily on foreground objects and semantic content while also being sensitive to color and layout. Notably, despite being trained on synthetic data, our metric generalizes to real images, giving strong results on retrieval and reconstruction tasks. Furthermore, our metric outperforms both prior learned metrics and recent large vision models on these tasks.
What, Why and How
Here is a summary of the key points from the paper:
What:
-
The paper introduces a new perceptual image similarity dataset called NIGHTS. It consists of 20k synthetic image triplets with human judgments indicating which image in each triplet is most similar to a reference image.
-
The images showcase mid-level variations in aspects like pose, perspective, color, number of objects, etc. This fills a gap between datasets focused on low-level pixel differences vs high-level categorical differences.
-
The paper develops a new similarity metric called DreamSim, tuned on this dataset to better align with human perceptual judgments. DreamSim is an ensemble of DINO, CLIP, and OpenCLIP features.
Why:
-
Existing similarity metrics operate at the pixel/patch level or categorical level but fail to capture mid-level notions of similarity. The new dataset and metric aim to assess images more holistically.
-
Humans can judge some image similarities automatically and consistently, suggesting a “cognitively impenetrable” sense of perception that algorithms should try to match. The new dataset aims to capture this.
How:
-
The dataset is generated using Stable Diffusion to create triplets of related images from various object/scene categories.
-
Iterative filtering is used to collect human judgments and retain only unanimous triplets.
-
DreamSim concatenates DINO, CLIP, OpenCLIP features and fine-tunes them on the dataset using triplet loss and LoRA regularization.
-
Experiments show DreamSim aligns closely with human judgments and generalizes well to real images for retrieval/reconstruction tasks. Analysis examines which visual attributes it focuses on.
In summary, the paper introduces a mid-level similarity dataset and metric aiming to better capture human perceptual notions of image similarity beyond low-level pixel or high-level categorical differences.
Main Contributions
Here are the main contributions of the paper:
-
A new image similarity dataset (NIGHTS) of 20k synthetic image triplets with human judgments comparing mid-level variations.
-
A new perceptual metric (DreamSim) tuned on this dataset that aligns well with human notions of similarity. It combines features from DINO, CLIP, and OpenCLIP.
-
Analysis showing DreamSim focuses on foreground, color, and layout more than prior metrics, and on color/layout more than embedding-based metrics.
-
Demonstrations of DreamSim’s effectiveness on image retrieval and reconstruction tasks with real images.
-
Experiments evaluating alignment of various metrics and models with human judgments on the new dataset. DreamSim achieves 96.16% accuracy.
-
Analysis ablating image attributes like color and foreground vs background to understand their effects on DreamSim’s judgments.
-
Visualizations of attention maps indicating DreamSim’s focus on foreground objects.
In summary, the main contributions are: (1) the new NIGHTS dataset, (2) the DreamSim perceptual metric, (3) extensive analysis and experiments with this dataset and metric, and (4) demonstrations of improved performance on downstream applications.
Method Section
Here is a summary of the method section from the paper:
Data Collection:
-
Images are generated using Stable Diffusion prompts for various object/scene categories (e.g. “an image of a cat”). This creates triplets of related images with mid-level variations.
-
100k initial triplets are filtered over 10 rounds of human judgments to retain only unanimous decisions, resulting in 20k final triplets.
-
Two perceptual tests are used: 2AFC (two alternative forced choice) and JND (just noticeable difference). 2AFC has humans select the most similar image to a reference in a triplet. JND tests if humans perceive a distorted image as identical to a reference.
-
Iterative filtering and perceptual tests aim to collect judgments that are automatic, stable across users, and cognitively impenetrable.
Model Training:
-
Existing metrics like LPIPS, DISTS, DINO, CLIP are evaluated on the new dataset. DINO performs best out-of-the-box.
-
Models are further tuned using either an MLP head or Low-Rank Adaptation (LoRA). LoRA performs better.
-
Features from DINO, CLIP, OpenCLIP are concatenated to create an ensemble model that is tuned end-to-end.
-
Triplet loss is used as the training objective to maximize similarity with the human-judged more similar image and minimize it with the less similar image.
-
The final model DreamSim combines feature ensembling and LoRA tuning. It achieves 96.16% accuracy on the dataset.
In summary, the key aspects of the method are using Stable Diffusion to generate a filtered perceptual dataset, comparing existing metrics, tuning an ensemble metric on this dataset, and analyzing the results.
High-Level Pseudo Code
Here is high-level pseudocode for the key aspects of the paper:
# Generate triplets
for category in categories:
prompt = f"An image of a {category}"
x = generate_image(prompt) # Reference
x_0 = generate_image(prompt) # Distortion 0
x_1 = generate_image(prompt) # Distortion 1
triplets.append((x, x_0, x_1))
# Filter triplets
for 10 rounds:
for triplet in triplets:
y = collect_human_judgement(triplet)
if unanimous_vote(y):
filtered_triplets.append(triplet)
# Train model
model = Ensemble(DINO, CLIP, OpenCLIP)
for triplet in filtered_triplets:
x, x_0, x_1 = triplet
y = human_label
loss = triplet_loss(model(x), model(x_0), model(x_1), y)
optimize(loss)
# Evaluate
for triplet in test_set:
x, x_0, x_1 = triplet
y = human_label
y_hat = model.predict_most_similar(x, x_0, x_1)
accuracy += (y == y_hat)
accuracy /= num_triplets
This summarizes the key steps of dataset generation, filtering, ensemble model training, and evaluation. The main components are leveraging a generative model to create triplets, iterative filtering to retain unambiguous ones, training an ensemble metric on these triplets, and assessing human-model alignment.
Detailed Pseudo Code
Here is a more detailed pseudocode outline to implement the key aspects of the paper:
# Generate triplets
categories = get_categories() # ImageNet, CIFAR, etc
for category in categories:
for i in range(num_triplets):
prompt = f"An image of a {category}"
x = generate_image(prompt)
x_0 = generate_image(prompt)
x_1 = generate_image(prompt)
triplet = (x, x_0, x_1)
triplets.append(triplet)
# Filter triplets
for round in range(10):
triplet_votes = {}
for triplet in triplets:
for j in range(num_votes):
# Show triplet to human
y = human_judges_similarity(triplet)
# Accumulate votes
if triplet not in triplet_votes:
triplet_votes[triplet] = []
triplet_votes[triplet].append(y)
unanimous_triplets = []
for triplet, votes in triplet_votes.items():
if all(v == votes[0] for v in votes):
unanimous_triplets.append(triplet)
triplets = unanimous_triplets
# Create dataset
dataset = [(triplet, votes[0]) for triplet, votes in triplet_votes.items() if len(votes) >= min_votes]
# Train model
model = Ensemble(DINO, CLIP, OpenCLIP)
for epoch in num_epochs:
for triplet, y_true in dataset:
x, x_0, x_1 = triplet
y_pred = model.predict_most_similar(x, x_0, x_1)
loss = triplet_loss(y_true, y_pred)
loss.backward()
optimizer.step()
# Evaluate
for triplet, y_true in test_set:
x, x_0, x_1 = triplet
y_pred = model.predict_most_similar(x, x_0, x_1)
if y_true == y_pred:
accuracy += 1
accuracy /= len(test_set)
This shows the dataset generation, filtering, training, and evaluation steps in more detail, including multiple iterations of human judgments, accumulating votes, creating the final dataset, model training loop, and validation.