CLIP Brings Better Features to Visual Aesthetics Learners
Authors: Liwu Xu, Jinjin Xu, Yuzhe Yang, Yijie Huang, Yanchun Xie, Yaqian Li
Abstract: The success of pre-training approaches on a variety of downstream tasks has revitalized the field of computer vision. Image aesthetics assessment (IAA) is one of the ideal application scenarios for such methods due to subjective and expensive labeling procedure. In this work, an unified and flexible two-phase \textbf{C}LIP-based \textbf{S}emi-supervised \textbf{K}nowledge \textbf{D}istillation paradigm is proposed, namely \textbf{\textit{CSKD}}. Specifically, we first integrate and leverage a multi-source unlabeled dataset to align rich features between a given visual encoder and an off-the-shelf CLIP image encoder via feature alignment loss. Notably, the given visual encoder is not limited by size or structure and, once well-trained, it can seamlessly serve as a better visual aesthetic learner for both student and teacher. In the second phase, the unlabeled data is also utilized in semi-supervised IAA learning to further boost student model performance when applied in latency-sensitive production scenarios. By analyzing the attention distance and entropy before and after feature alignment, we notice an alleviation of feature collapse issue, which in turn showcase the necessity of feature alignment instead of training directly based on CLIP image encoder. Extensive experiments indicate the superiority of CSKD, which achieves state-of-the-art performance on multiple widely used IAA benchmarks.
What, Why and How
This paper proposes a two-phase semi-supervised learning approach called CSKD to improve image aesthetics assessment.
What:
- CSKD has two phases: clip-based feature alignment (CFA) and semi-supervised knowledge distillation (SKD).
- In CFA, they align the features between visual encoders and CLIP image encoder using unlabeled data to get better aesthetic features.
- In SKD, they first train a teacher model in supervised manner. Then train a student model on labeled and unlabeled data using the teacher’s pseudo labels.
Why:
- CLIP features contain rich semantic information that is useful for aesthetics assessment. Aligning with CLIP can alleviate feature collapse issue in visual encoders.
- Using unlabeled data in semi-supervised manner improves training efficiency and performance of student model.
How:
- In CFA, they minimize cosine similarity loss between visual encoder and CLIP features on unlabeled data.
- In SKD, student model is trained on labeled real labels and unlabeled pseudo labels from teacher using squared EMD loss.
In summary, CSKD leverages CLIP and semi-supervised learning to obtain an improved aesthetic assessment model efficiently. Alignment with CLIP gives better features while semi-supervised distillation transfers knowledge effectively.
Main Contributions
The main contributions of this paper are:
-
They propose an unified and generalizable unsupervised CLIP-based feature alignment (CFA) method to improve visual encoder pre-training for image aesthetics tasks.
-
They present a semi-supervised knowledge distillation (SKD) protocol that utilizes both labeled and unlabeled data to effectively transfer knowledge from teacher to student model.
-
They show that both teacher and student models distilled from CLIP via CFA perform better when trained with large-scale unlabeled data in SKD.
-
They conduct extensive experiments on multiple image aesthetics assessment benchmarks like AVA, AADB, PARA and show state-of-the-art results, outperforming previous methods.
-
They provide in-depth analysis and ablation studies to demonstrate the benefits of CFA and SKD over baselines without these components.
In summary, the core contributions are presenting the CFA and SKD frameworks and showing their effectiveness for aesthetic assessment through comprehensive experiments and analysis. The semi-supervised approach helps efficiently learn improved aesthetic models.
Method Section
Here is a summary of the method section from the paper:
The proposed CSKD pipeline has two main phases:
Clip-based Feature Alignment (CFA):
- Use a multi-source unlabeled dataset (ImageNet + OpenImages)
- Take a pre-trained backbone (e.g. Swin Transformer)
- Add an MLP projector head to backbone
- Align features between backbone+projector and CLIP image encoder using cosine similarity loss
- This aligns backbone features with rich CLIP representations
Semi-supervised Knowledge Distillation (SKD):
- Take CFA backbone as teacher, add task-specific MLP prediction head
- Fine-tune teacher on labeled aesthetics data (AVA/AADB etc)
- Take another CFA backbone as student, add prediction head
- Train student on labeled real labels and unlabeled pseudo labels from teacher
- Use squared EMD loss for labels and pseudo-labels
- Leverages unlabeled data to improve student training
In summary, CFA enhances backbones via alignment with CLIP features using unlabeled data. SKD transfers knowledge from CFA teacher to student efficiently using labeled and unlabeled data in a semi-supervised manner.
High-Level Pseudo Code
Here is the high-level pseudo code for the CSKD method:
# Phase 1: CLIP-based Feature Alignment (CFA)
# Load pre-trained teacher_backbone, student_backbone
# Add MLP projection head to backbones
# Load unlabeled dataset (ImageNet + OpenImages)
for image in unlabeled_dataset:
teacher_features = teacher_backbone(image)
student_features = student_backbone(image)
# Get CLIP image features
clip_features = clip_encoder(image)
# Compute cosine similarity loss
loss1 = 1 - cosine_sim(teacher_features, clip_features)
loss2 = 1 - cosine_sim(student_features, clip_features)
# Update teacher and student backbones
optimizer.zero_grad()
(loss1 + loss2).backward()
optimizer.step()
# Phase 2: Semi-supervised Knowledge Distillation (SKD)
# Load labeled aesthetics dataset
# Load unlabeled dataset
# Add task prediction head to teacher backbone
# Fine-tune teacher model on labeled dataset
# Add task prediction head to student backbone
for (image, label) in labeled_dataset:
# Forward pass
teacher_pred = teacher_model(image)
student_pred = student_model(image)
# Supervised loss
sup_loss = EMDLoss(student_pred, label)
for image in unlabeled_dataset:
# Get teacher pseudo label
pseudo_label = teacher_model(image)
# Forward pass
student_pred = student_model(image)
# Distillation loss
distill_loss = EMDLoss(student_pred, pseudo_label)
# Overall loss
loss = sup_loss + distill_loss
# Update student model
optimizer.zero_grad()
loss.backward()
optimizer.step()
This outlines the core steps in both CFA and SKD phases, involving feature alignment with CLIP and semi-supervised learning using labeled real and unlabeled pseudo data.
Detailed Pseudo Code
Here is a more detailed pseudo code to implement the CSKD method:
import torch
from torch import nn
from torch.nn import functional as F
# Models
teacher_backbone = SwinTransformer() # Pretrained on ImageNet
student_backbone = MobileNetV2() # Pretrained on ImageNet
clip_encoder = CLIPEncoder() # Off-the-shelf CLIP
# MLP projection heads
teacher_proj = nn.Sequential(nn.Linear(1024, 512), nn.ReLU(), nn.Linear(512,512))
student_proj = nn.Sequential(nn.Linear(1280, 512), nn.ReLU(), nn.Linear(512,512))
# Optimization
teacher_optim = torch.optim.Adam(teacher_backbone.parameters(), lr=1e-4)
student_optim = torch.optim.Adam(student_backbone.parameters(), lr=1e-4)
# Phase 1: CLIP-based Feature Alignment
for images in unlabeled_dataset:
# Forward pass
teacher_feats = teacher_backbone(images)
teacher_feats = teacher_proj(teacher_feats)
student_feats = student_backbone(images)
student_feats = student_proj(student_feats)
clip_feats = clip_encoder(images)[:,0,:]
# Losses
teacher_loss = 1 - F.cosine_similarity(teacher_feats, clip_feats)
student_loss = 1 - F.cosine_similarity(student_feats, clip_feats)
# Update
loss = teacher_loss + student_loss
teacher_optim.zero_grad()
student_optim.zero_grad()
loss.backward()
teacher_optim.step()
student_optim.step()
# Phase 2: Semi-supervised Knowledge Distillation
# Add prediction head
teacher_head = nn.Linear(512, 10) # 10 aesthetic scores
student_head = nn.Linear(512, 10)
teacher_model = nn.Sequential(teacher_backbone, teacher_proj, teacher_head)
student_model = nn.Sequential(student_backbone, student_proj, student_head)
# Fine-tune teacher
for images, labels in labeled_dataset:
preds = teacher_model(images)
loss = EMDLoss(preds, labels)
teacher_optim.zero_grad()
loss.backward()
teacher_optim.step()
# Student training
for x, y in labeled_dataset:
teacher_preds = teacher_model(x)
student_preds = student_model(x)
sup_loss = EMDLoss(student_preds, y)
for x in unlabeled_dataset:
with torch.no_grad():
pseudo_label = teacher_model(x)
student_preds = student_model(x)
distill_loss = EMDLoss(student_preds, pseudo_label)
loss = sup_loss + distill_loss
student_optim.zero_grad()
loss.backward()
student_optim.step()
This shows the model definitions, training loops and key steps for CFA feature alignment and SKD in more detail.