Waffling around for Performance: Visual Classification with Random Words and Broad Concepts
Authors: Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia Schmid, Zeynep Akata
Abstract: The visual classification performance of vision-language models such as CLIP has been shown to benefit from additional semantic knowledge from large language models (LLMs) such as GPT-3. In particular, averaging over LLM-generated class descriptors, e.g. “waffle, which has a round shape”, can notably improve generalization performance. In this work, we critically study this behavior and propose WaffleCLIP, a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors. Without querying external models, we achieve comparable performance gains on a large number of visual classification tasks. This allows WaffleCLIP to both serve as a low-cost alternative, as well as a sanity check for any future LLM-based vision-language model extensions. We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors, and showcase how - if available - semantic context is better leveraged by querying LLMs for high-level concepts, which we show can be done to jointly resolve potential class name ambiguities. Code is available here: WaffleCLIP.
What, Why and How
Here is a summary of the key points from the paper:
What:
- The paper proposes a new method called WaffleCLIP for zero-shot image classification using vision-language models like CLIP.
- WaffleCLIP replaces fine-grained class descriptors generated by large language models (LLMs) like GPT-3 with random words and characters.
- It shows comparable performance to methods using LLM descriptors, without needing access to external LLMs.
Why:
- Fine-grained LLM-generated descriptors exhibit diversity, ambiguity and limited visual relevance. Their benefits seem to come more from descriptor ensembling rather than additional semantics.
- Random character and word descriptors can provide similar ensembling benefits at lower cost.
- WaffleCLIP serves as a sanity check for future methods relying on external LLM queries.
How:
- WaffleCLIP extends the prompt with random words and characters based on class name length.
- It averages similarity scores over prompts with random descriptors, like prior work with LLM descriptors.
- If LLM access is available, high-level concepts provide better semantics and resolve ambiguities.
In summary, the paper shows random words/characters can match gains from LLM descriptors for zero-shot classification. WaffleCLIP provides a low-cost alternative and sanity check for methods using LLM queries.
Main Contributions
The main contributions of the paper are:
-
Proposing WaffleCLIP, which uses random character and word descriptors in prompts to improve zero-shot image classification performance of vision-language models like CLIP. This achieves comparable gains to methods relying on external LLM queries.
-
Demonstrating WaffleCLIP as a low-cost alternative and sanity check for future methods using fine-grained LLM-generated descriptors. The strong performance of random descriptors questions the benefits of additional semantics from LLM queries.
-
Studying the impact of LLM-generated descriptors and finding the main driver for gains is likely prompt ensemble averaging rather than additional semantics.
-
Proposing a better way to leverage LLM semantics by using automatically extracted high-level concepts to provide context and resolve potential classname ambiguities.
-
Providing extensive experiments analyzing WaffleCLIP and the role of semantics from LLM-generated descriptors across 11 image classification benchmarks.
In summary, the main contributions are: (1) proposing the low-cost WaffleCLIP using random prompts, (2) using it to analyze benefits of LLM semantics, and (3) suggesting high-level concepts as a better way to utilize LLM access.
Method Section
Here is a summary of the method section from the paper:
The paper first describes image classification using fine-grained class descriptors generated by large language models (LLMs) like GPT-3, following prior work DCLIP.
It then motivates and proposes WaffleCLIP, which replaces LLM-generated descriptors with random character sequences or word sequences as class descriptors. The number and length of random descriptors is determined based on the class name length. Similarity scores are averaged over prompts with random descriptors.
WaffleCLIP does not require external LLM access, remaining inherently zero-shot. It serves as a low-cost alternative and sanity check for methods relying on LLM queries.
Finally, the paper proposes a way to better leverage LLM semantics, if access is available. It uses GPT-3 queries to automatically extract high-level concepts that provide semantic context and help resolve classname ambiguities. The concepts are incorporated into prompts, e.g. “A photo of a [concept]: a [classname]“.
In summary, the method section introduces:
- WaffleCLIP using random word/character descriptors
- Motivation for it as a low-cost alternative to LLM descriptors
- A technique to extract high-level concepts from LLMs for better semantics
High-Level Pseudo Code
Here is high-level pseudo code for the key components of WaffleCLIP proposed in the paper:
# WaffleCLIP using random descriptors
def classify_image(image, classname):
# Generate random descriptors
random_descriptors = generate_random_words_and_chars(classname)
# Create prompt variations
prompts = ["A photo of a {c}, which (is/has/etc) {d}."
for d in random_descriptors]
# Get image and text embeddings
image_embedding = get_image_embedding(image)
text_embeddings = [get_text_embedding(prompt) for prompt in prompts]
# Average similarity scores
similarities = [cosine_similarity(image_embedding, text_emb)
for text_emb in text_embeddings]
class_score = sum(similarities) / len(similarities)
return class_score
# Get high-level concepts from GPT-3
def get_concept(classnames):
query = "Tell me what these have in common: " + classnames
concept = query_GPT3(query)
return concept
# Classify with concepts
def classify_with_concept(image, classname, concept):
prompt = "A photo of a {concept}: a {classname}."
text_emb = get_text_embedding(prompt)
image_emb = get_image_embedding(image)
return cosine_similarity(text_emb, image_emb)
The key components are:
- Generate random word/character descriptors
- Create prompt variations using them
- Average similarity scores for classification
- Get high-level concepts from GPT-3
- Incorporate concepts into prompts
Detailed Pseudo Code
Here is more detailed pseudo code to implement the key components of WaffleCLIP:
# Generate random descriptors
def generate_random_descriptors(classname):
# Get average word length and count from classname
avg_word_len = get_avg_word_len(classname)
word_count = get_word_count(classname)
descriptors = []
for i in range(num_descriptors):
# Generate random word
random_word = ""
for j in range(avg_word_len):
random_word += random_char()
# Generate random character sequence
char_seq = ""
for j in range(avg_word_len * word_count):
char_seq += random_char()
# Add to descriptors
descriptors.append((random_word, char_seq))
return descriptors
# Create prompt
def create_prompt(classname, descriptor):
return "A photo of a {0}, which (is/has/etc) {1}.".format(classname, descriptor)
# Get embeddings and score
def get_similarity(image, prompts):
image_emb = get_image_embedding(image)
scores = []
for prompt in prompts:
text_emb = get_text_embedding(prompt)
score = cosine_similarity(image_emb, text_emb)
scores.append(score)
return sum(scores) / len(scores)
# Classify image
def classify_image(image, classname):
descriptors = generate_random_descriptors(classname)
prompts = [create_prompt(classname, d[0]) + create_prompt(classname, d[1])
for d in descriptors]
return get_similarity(image, prompts)
# Get concept from GPT-3
def get_concept(classnames):
query = "Tell me what these have in common: " + classnames
return query_GPT3(query)
# Classify with concept
def classify_with_concept(image, classname, concept):
prompt = "A photo of a {0}: a {1}.".format(concept, classname)
text_emb = get_text_embedding(prompt)
image_emb = get_image_embedding(image)
return cosine_similarity(text_emb, image_emb)
This shows:
- Generating random word and character descriptors
- Creating prompt variations using descriptors
- Averaging similarity scores for classification
- Querying GPT-3 for concepts
- Incorporating concepts into prompts