EmoSet: A Large-scale Visual Emotion Dataset with Rich Attributes

Authors: Jingyuan Yang, Qirui Huang, Tingting Ding, Dani Lischinski, Daniel Cohen-Or, Hui Huang

Abstract: Visual Emotion Analysis (VEA) aims at predicting people’s emotional responses to visual stimuli. This is a promising, yet challenging, task in affective computing, which has drawn increasing attention in recent years. Most of the existing work in this area focuses on feature design, while little attention has been paid to dataset construction. In this work, we introduce EmoSet, the first large-scale visual emotion dataset annotated with rich attributes, which is superior to existing datasets in four aspects: scale, annotation richness, diversity, and data balance. EmoSet comprises 3.3 million images in total, with 118,102 of these images carefully labeled by human annotators, making it five times larger than the largest existing dataset. EmoSet includes images from social networks, as well as artistic images, and it is well balanced between different emotion categories. Motivated by psychological studies, in addition to emotion category, each image is also annotated with a set of describable emotion attributes: brightness, colorfulness, scene type, object class, facial expression, and human action, which can help understand visual emotions in a precise and interpretable way. The relevance of these emotion attributes is validated by analyzing the correlations between them and visual emotion, as well as by designing an attribute module to help visual emotion recognition. We believe EmoSet will bring some key insights and encourage further research in visual emotion analysis and understanding. The data and code will be released after the publication of this work.

What, Why and How

Here is a summary of the key points from the paper:

What:

  • The paper introduces EmoSet, a large-scale visual emotion dataset with rich attribute annotations.
  • EmoSet contains 3.3 million images, with 118,102 images carefully labeled by humans.
  • Each image is annotated with one of 8 emotion categories based on Mikels model, as well as 6 emotion attributes: brightness, colorfulness, scene type, object class, facial expression, and human action.

Why:

  • Existing visual emotion analysis (VEA) datasets are limited in scale and annotation richness. EmoSet aims to facilitate VEA research by providing more data and richer annotations.
  • The emotion attributes are designed based on psychological studies to help understand what visual cues evoke certain emotions.

How:

  • Emotion keywords were used to retrieve candidate images from diverse sources.
  • Attribute labels were generated automatically using traditional and deep learning models.
  • Subset of images were manually annotated for emotion and attributes by human labelers.
  • Experiments show attribute features help boost emotion recognition performance and have learned emotion-related visual patterns.
  • Analysis reveals correlations between attributes and emotions, validating the relevance of designed attributes.

In summary, the paper constructed a large-scale and richly annotated visual emotion dataset called EmoSet to advance emotion recognition and understanding through data and attribute annotations. Experiments and analysis demonstrate the utility of the dataset.

Main Contributions

Here are the main contributions of this paper:

  • EmoSet dataset:

    • First large-scale visual emotion dataset with rich attribute annotations (3.3 million total, 118K labeled)
    • 5x larger than previous largest dataset FI
    • Annotated with 8 emotion categories and 6 emotion attributes
    • More diverse sources and balanced emotion distribution
  • Emotion attributes design:

    • Proposed a set of interpretable attributes based on psychological studies
    • Cover different levels of visual information (brightness, colorfulness, scene, object, facial expression, action)
    • Help understand emotions and bridge affective gap
  • Dataset analysis:

    • Validate correlations between attributes and emotions
    • Attributes like brightness, colorfulness, facial expression are related to emotion valence
    • Scene, object, action values have strong correlations to certain emotions
  • Attribute module:

    • Designed module to incorporate attribute branches into CNNs
    • Experiments show attributes help boost emotion recognition performance
    • Visualizations validate attribute features capture emotion-related patterns
  • Benchmarking:

    • Evaluation shows EmoSet enables better emotion recognition over other datasets
    • Cross-dataset tests demonstrate generalization ability of EmoSet

In summary, the main contributions are: introducing the large-scale EmoSet dataset, designing interpretable emotion attributes, dataset analysis, attribute recognition module, and benchmarking experiments.

Method Section

Here is a summary of the method section from the paper:

The paper describes the construction of the EmoSet dataset in several steps:

  • Data Collection:

    • 810 emotion keywords generated based on Mikels model
    • Used to retrieve candidate images from diverse sources: openverse, pexels, pixabay, rawpixels
    • Collected 3.3 million images labeled with emotions based on search queries
  • Emotion Attributes Design:

    • Proposed 6 types of attributes based on psychological studies
      • Low-level: Brightness, Colorfulness
      • Mid-level: Scene type, Object class
      • High-level: Facial expression, Human action
    • Traditional and deep learning models used to predict attributes automatically
  • Human Annotation:

    • 118K images selected for manual labeling by qualified annotators
    • Annotators labeled images for 8 emotions and 6 attributes
    • Each image labeled by 10 annotators, final label based on consensus
    • Result is EmoSet-118K dataset with human verification
  • Analysis:

    • Analyzed correlations between attributes and emotions
    • Positive emotions correlated with brighter, more colorful images
    • Facial expressions strongly matched perceived emotions
    • Strong correlations between some scene/object/action types and emotions

In summary, the method collects a large pool of candidate images, designs a set of emotion attributes, manually annotates a subset of images, and analyzes the relationships between attributes and emotions in the dataset. This results in the EmoSet dataset with rich annotations.

High-Level Pseudo Code

Here is the high-level pseudo code for the key steps in the paper:

# Data Collection
emotion_keywords = generate_keywords() 
image_pool = retrieve_images(emotion_keywords)
 
# Emotion Attributes
brightness = compute_brightness(images)
colorfulness = compute_colorfulness(images)
scene = recognize_scene(images) 
objects = detect_objects(images)
faces = detect_faces(images)
expression = recognize_expression(faces) 
action = recognize_action(images)
 
# Human Annotation
annotators = qualify_annotators(empathy_test, annotation_test)
annotated_set = manually_annotate(annotators, image_subset) 
 
# Analysis
correlations = compute_attribute_emotion_correlations(annotated_set)
visualize_correlations(correlations)
 
# Recognition Model
CNN = initialize_CNN()
attr_modules = create_attribute_modules(brightness, colorfulness, scene) 
model = connect_attr_modules_to_CNN(CNN, attr_modules)
model.fit(annotated_set)

This covers the key steps including data collection, attribute extraction, manual annotation, analysis of the dataset, and training an attribute-aware recognition model. The pseudo code abstracts the overall workflow and methods used in the paper.

Detailed Pseudo Code

Here is a more detailed pseudo code implementation of the key methods in the paper:

# Data Collection
 
def generate_keywords():
  emotions = ['amusement', 'awe',...] # 8 emotions
  synonyms = lookup_synonyms(emotions) 
  keywords = augment_synonyms(synonyms)
  return keywords
 
def retrieve_images(keywords):
  sources = [openverse, pexels,...] # image sources 
  image_pool = []
  for source in sources:
    for keyword in keywords:
      images = download_images(source, keyword)  
      image_pool.extend(images)
  return image_pool
 
# Emotion Attributes 
 
def compute_brightness(images):
  brightnesses = []
  for img in images:
    value = measure_average_luminescence(img) 
    brightnesses.append(quantize(value))
  return brightnesses
 
def compute_colorfulness(images):
  colorfulness = []
  for img in images:
    value = compute_color_statistics(img)
    colorfulness.append(quantize(value))
  return colorfulness
  
def recognize_scene(images):
  scenes = []
  model = load_scene_recognition_model()
  for img in images:
    scene = model.predict(img) 
    scenes.append(scene)
  return scenes
 
# similarly for objects, faces, expressions, actions
 
# Human Annotation
 
def qualify_annotators(empathy_test, annotation_test):
  annotators = recruit_annotators()
  qualified = []
  for anno in annotators:  
    if anno.take_test(empathy_test) and anno.take_test(annotation_test):
      qualified.append(anno) 
  return qualified
 
def manually_annotate(annotators, images):
  annotated_set = []
  for img in images:
    annotations = collect_annotations(img, annotators)
    consensus = determine_consensus(annotations)
    if passing_criterion(consensus): 
      img.annotations = consensus  
      annotated_set.append(img)
  return annotated_set
 
# Analysis, Model Training
# ...

This covers some of the key functions and implementation details for the data collection, attribute extraction, annotation, and analysis parts of the paper. The overall workflow and techniques are abstracted into reusable Python functions.