CLIP Brings Better Features to Visual Aesthetics Learners

Authors: Liwu Xu, Jinjin Xu, Yuzhe Yang, Yijie Huang, Yanchun Xie, Yaqian Li

Abstract: The success of pre-training approaches on a variety of downstream tasks has revitalized the field of computer vision. Image aesthetics assessment (IAA) is one of the ideal application scenarios for such methods due to subjective and expensive labeling procedure. In this work, an unified and flexible two-phase \textbf{C}LIP-based \textbf{S}emi-supervised \textbf{K}nowledge \textbf{D}istillation paradigm is proposed, namely \textbf{\textit{CSKD}}. Specifically, we first integrate and leverage a multi-source unlabeled dataset to align rich features between a given visual encoder and an off-the-shelf CLIP image encoder via feature alignment loss. Notably, the given visual encoder is not limited by size or structure and, once well-trained, it can seamlessly serve as a better visual aesthetic learner for both student and teacher. In the second phase, the unlabeled data is also utilized in semi-supervised IAA learning to further boost student model performance when applied in latency-sensitive production scenarios. By analyzing the attention distance and entropy before and after feature alignment, we notice an alleviation of feature collapse issue, which in turn showcase the necessity of feature alignment instead of training directly based on CLIP image encoder. Extensive experiments indicate the superiority of CSKD, which achieves state-of-the-art performance on multiple widely used IAA benchmarks.

What, Why and How

This paper proposes a two-phase semi-supervised learning approach called CSKD to improve image aesthetics assessment.

What:

  • CSKD has two phases: clip-based feature alignment (CFA) and semi-supervised knowledge distillation (SKD).
  • In CFA, they align the features between visual encoders and CLIP image encoder using unlabeled data to get better aesthetic features.
  • In SKD, they first train a teacher model in supervised manner. Then train a student model on labeled and unlabeled data using the teacher’s pseudo labels.

Why:

  • CLIP features contain rich semantic information that is useful for aesthetics assessment. Aligning with CLIP can alleviate feature collapse issue in visual encoders.
  • Using unlabeled data in semi-supervised manner improves training efficiency and performance of student model.

How:

  • In CFA, they minimize cosine similarity loss between visual encoder and CLIP features on unlabeled data.
  • In SKD, student model is trained on labeled real labels and unlabeled pseudo labels from teacher using squared EMD loss.

In summary, CSKD leverages CLIP and semi-supervised learning to obtain an improved aesthetic assessment model efficiently. Alignment with CLIP gives better features while semi-supervised distillation transfers knowledge effectively.

Main Contributions

The main contributions of this paper are:

  • They propose an unified and generalizable unsupervised CLIP-based feature alignment (CFA) method to improve visual encoder pre-training for image aesthetics tasks.

  • They present a semi-supervised knowledge distillation (SKD) protocol that utilizes both labeled and unlabeled data to effectively transfer knowledge from teacher to student model.

  • They show that both teacher and student models distilled from CLIP via CFA perform better when trained with large-scale unlabeled data in SKD.

  • They conduct extensive experiments on multiple image aesthetics assessment benchmarks like AVA, AADB, PARA and show state-of-the-art results, outperforming previous methods.

  • They provide in-depth analysis and ablation studies to demonstrate the benefits of CFA and SKD over baselines without these components.

In summary, the core contributions are presenting the CFA and SKD frameworks and showing their effectiveness for aesthetic assessment through comprehensive experiments and analysis. The semi-supervised approach helps efficiently learn improved aesthetic models.

Method Section

Here is a summary of the method section from the paper:

The proposed CSKD pipeline has two main phases:

Clip-based Feature Alignment (CFA):

  • Use a multi-source unlabeled dataset (ImageNet + OpenImages)
  • Take a pre-trained backbone (e.g. Swin Transformer)
  • Add an MLP projector head to backbone
  • Align features between backbone+projector and CLIP image encoder using cosine similarity loss
  • This aligns backbone features with rich CLIP representations

Semi-supervised Knowledge Distillation (SKD):

  • Take CFA backbone as teacher, add task-specific MLP prediction head
  • Fine-tune teacher on labeled aesthetics data (AVA/AADB etc)
  • Take another CFA backbone as student, add prediction head
  • Train student on labeled real labels and unlabeled pseudo labels from teacher
  • Use squared EMD loss for labels and pseudo-labels
  • Leverages unlabeled data to improve student training

In summary, CFA enhances backbones via alignment with CLIP features using unlabeled data. SKD transfers knowledge from CFA teacher to student efficiently using labeled and unlabeled data in a semi-supervised manner.

High-Level Pseudo Code

Here is the high-level pseudo code for the CSKD method:

# Phase 1: CLIP-based Feature Alignment (CFA)
 
# Load pre-trained teacher_backbone, student_backbone 
# Add MLP projection head to backbones
 
# Load unlabeled dataset (ImageNet + OpenImages)
 
for image in unlabeled_dataset:
 
  teacher_features = teacher_backbone(image) 
  student_features = student_backbone(image)
  
  # Get CLIP image features
  clip_features = clip_encoder(image)  
 
  # Compute cosine similarity loss
  loss1 = 1 - cosine_sim(teacher_features, clip_features)
  loss2 = 1 - cosine_sim(student_features, clip_features)
  
  # Update teacher and student backbones
  optimizer.zero_grad()
  (loss1 + loss2).backward()
  optimizer.step()
 
# Phase 2: Semi-supervised Knowledge Distillation (SKD)
 
# Load labeled aesthetics dataset
# Load unlabeled dataset 
 
# Add task prediction head to teacher backbone 
# Fine-tune teacher model on labeled dataset
 
# Add task prediction head to student backbone
 
for (image, label) in labeled_dataset:
 
  # Forward pass  
  teacher_pred = teacher_model(image)
  student_pred = student_model(image)
  
  # Supervised loss
  sup_loss = EMDLoss(student_pred, label) 
 
for image in unlabeled_dataset:
 
  # Get teacher pseudo label
  pseudo_label = teacher_model(image)  
 
  # Forward pass
  student_pred = student_model(image)
 
  # Distillation loss 
  distill_loss = EMDLoss(student_pred, pseudo_label)
 
  # Overall loss
  loss = sup_loss + distill_loss
  
  # Update student model
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

This outlines the core steps in both CFA and SKD phases, involving feature alignment with CLIP and semi-supervised learning using labeled real and unlabeled pseudo data.

Detailed Pseudo Code

Here is a more detailed pseudo code to implement the CSKD method:

import torch
from torch import nn
from torch.nn import functional as F
 
# Models
teacher_backbone = SwinTransformer() # Pretrained on ImageNet
student_backbone = MobileNetV2() # Pretrained on ImageNet
clip_encoder = CLIPEncoder() # Off-the-shelf CLIP
 
# MLP projection heads 
teacher_proj = nn.Sequential(nn.Linear(1024, 512), nn.ReLU(), nn.Linear(512,512)) 
student_proj = nn.Sequential(nn.Linear(1280, 512), nn.ReLU(), nn.Linear(512,512))
 
# Optimization
teacher_optim = torch.optim.Adam(teacher_backbone.parameters(), lr=1e-4)
student_optim = torch.optim.Adam(student_backbone.parameters(), lr=1e-4)
 
# Phase 1: CLIP-based Feature Alignment
for images in unlabeled_dataset:
 
  # Forward pass
  teacher_feats = teacher_backbone(images)
  teacher_feats = teacher_proj(teacher_feats)
  
  student_feats = student_backbone(images)
  student_feats = student_proj(student_feats)
 
  clip_feats = clip_encoder(images)[:,0,:]
 
  # Losses
  teacher_loss = 1 - F.cosine_similarity(teacher_feats, clip_feats)
  student_loss = 1 - F.cosine_similarity(student_feats, clip_feats)
 
  # Update
  loss = teacher_loss + student_loss
  teacher_optim.zero_grad()
  student_optim.zero_grad()
  loss.backward()
  teacher_optim.step()
  student_optim.step()
 
 
# Phase 2: Semi-supervised Knowledge Distillation
 
# Add prediction head
teacher_head = nn.Linear(512, 10) # 10 aesthetic scores
student_head = nn.Linear(512, 10)
 
teacher_model = nn.Sequential(teacher_backbone, teacher_proj, teacher_head)
student_model = nn.Sequential(student_backbone, student_proj, student_head)
 
# Fine-tune teacher  
for images, labels in labeled_dataset:
  
  preds = teacher_model(images)
  loss = EMDLoss(preds, labels)
 
  teacher_optim.zero_grad()
  loss.backward()
  teacher_optim.step()
 
# Student training 
for x, y in labeled_dataset:
  
  teacher_preds = teacher_model(x) 
  student_preds = student_model(x)
 
  sup_loss = EMDLoss(student_preds, y)
 
  for x in unlabeled_dataset:
 
    with torch.no_grad():
      pseudo_label = teacher_model(x)
    
    student_preds = student_model(x)  
    distill_loss = EMDLoss(student_preds, pseudo_label)
    
    loss = sup_loss + distill_loss
 
    student_optim.zero_grad()
    loss.backward()
    student_optim.step()

This shows the model definitions, training loops and key steps for CFA feature alignment and SKD in more detail.