Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias
Authors: Itay Itzhak, Gabriel Stanovsky, Nir Rosenfeld, Yonatan Belinkov
Abstract: Recent studies show that instruction tuning and learning from human feedback improve the abilities of large language models (LMs) dramatically. While these tuning methods can make models generate high-quality text, we conjecture that more implicit cognitive biases may arise in these fine-tuned models. Our work provides evidence that these fine-tuned models exhibit biases that were absent or less pronounced in their pretrained predecessors. We examine the extent of this phenomenon in three cognitive biases - the decoy effect, the certainty effect, and the belief bias - all of which are known to influence human decision-making and reasoning. Our findings highlight the presence of these biases in various models, especially those that have undergone instruction tuning, such as Flan-T5, GPT3.5, and GPT4. This research constitutes a step toward comprehending cognitive biases in instruction-tuned LMs, which is crucial for the development of more reliable and unbiased language models.
What, Why and How
Here is a summary of the key points from the paper:
What:
- The paper investigates cognitive biases in large language models (LMs), particularly those fine-tuned using instruction tuning (IT) and reinforcement learning from human feedback (RLHF).
- It focuses on three cognitive biases: the decoy effect, the certainty effect, and the belief bias.
- The authors generate semi-automated datasets to evaluate these biases in different LMs.
Why:
- Recent studies show IT and RLHF can improve LM abilities, but their impact on cognitive biases is unclear.
- Understanding cognitive biases in LMs is important for developing reliable and unbiased models.
How:
- The authors adapt classic human experiments on the three biases into an LM format with generated prompts.
- They compare LM predictions on biased and control datasets to quantify bias scores.
- Their key findings are:
- IT and RLHF models exhibit more bias than pretrained models
- Biases align with human cognitive biases
- IT amplifies biases
- RLHF amplifies biases
- Larger models display higher bias scores
- Accuracy vs. bias tradeoff exists
In summary, the paper provides evidence that IT and RLHF tuning introduces or amplifies cognitive biases in LMs, highlighting an important limitation and need for effective bias mitigation strategies.
Main Contributions
Based on my reading, the main contributions of this paper are:
-
Investigating cognitive biases in large language models (LMs), with a focus on instruction-tuned and human feedback-tuned models. Prior work has mostly looked at social biases like gender, but this explores cognitive biases like the decoy effect.
-
Generating new semi-automated datasets to evaluate cognitive biases in LMs, adapted from classic human experiments. This allows testing at scale while avoiding data contamination.
-
Providing empirical evidence that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) tend to introduce or amplify cognitive biases compared to pretrained LMs. This highlights an important limitation of tuning methods.
-
Demonstrating that tuned LMs exhibit similar cognitive biases as humans, suggesting they may be picking up on inherent human biases during tuning.
-
Analyzing factors that affect bias, like model size and accuracy, providing insights into bias mitigation.
-
Establishing baseline measurements of several cognitive biases across major LM families, like GPT and T5.
-
Pointing to the need for further research into comprehending and addressing cognitive biases to develop more reliable LMs.
In summary, the key contributions are rigorously evaluating cognitive biases in tuned LMs, showing tuning introduces new biases, analyzing what affects bias, and establishing baselines to motivate further research into bias mitigation.
Method Section
Here is a summary of the method section from the paper:
Data Generation:
-
For each of the 3 biases (decoy, certainty, belief), the authors generate biased and control datasets by adapting classic human experiments into LM prompts.
-
For decoy and certainty, they create prompts with option choices and varying values/alternatives. For belief, they use existing data from a prior study.
-
The biased datasets replicate the experimental setup that elicits the bias in humans. The control datasets are similar but omit the aspect that triggers the bias.
-
They generate a large collection of prompts with variations in values, text templates, and option order permutations.
Evaluation:
-
They evaluate pretrained LMs like GPT3 and T5, as well as tuned versions like GPT3.5 and Flan-T5.
-
For each model, they compare its predictions on the biased vs control datasets for a given bias.
-
They define a bias score that quantifies the difference in target option selection between datasets. Higher scores indicate more bias aligned with human biases.
-
They also analyze model accuracy on belief bias to examine the accuracy vs bias tradeoff.
-
Some key analyses: effect of few-shot examples, product variation in decoy bias, impact of price range and decoy type, etc.
In summary, the core methodology involves adapting classic human bias experiments into LM prompt datasets, evaluating model preferences between biased and control datasets, and defining bias scores to quantify the degree of bias relative to humans. The analyses provide additional insights into factors affecting bias in LMs.
High-Level Pseudo Code
Here is a high-level pseudo code summary of the key steps in the paper’s methodology:
# Generate biased and control datasets
for bias in [decoy, certainty, belief]:
biased_data = generate_prompts(bias, introduce_bias=True)
control_data = generate_prompts(bias, introduce_bias=False)
# Evaluate models
for model in [GPT3, T5, GPT3.5, Flan-T5]:
# Get predictions on datasets
biased_preds = model(biased_data)
control_preds = model(control_data)
# Compute bias score
target_freq_biased = compute_target_frequency(biased_preds)
target_freq_control = compute_target_frequency(control_preds)
bias_score = target_freq_biased - target_freq_control
# Analyze model accuracy for belief bias
if bias == 'belief':
acc_biased = compute_accuracy(biased_preds)
acc_control = compute_accuracy(control_preds)
plot_accuracy_vs_bias(acc_biased, acc_control, bias_score)
# Analysis
analyze_few_shot_impact(model)
analyze_decoy_attributes(model)
analyze_belief_accuracy_tradeoff(model)
In summary, the key steps are:
- Generate biased and control datasets for each bias
- Evaluate models on datasets and compute bias scores
- For belief bias, also analyze accuracy
- Perform analysis of factors affecting bias
This highlights the core methodology of evaluating models on controlled datasets designed to elicit biases, quantifying the biases, and analyzing what impacts them.
Detailed Pseudo Code
Here is a more detailed pseudo code implementation of the key steps in the paper’s methodology:
import datasets
from models import GPT3, T5, GPT3_5, Flan_T5
# Dataset generation
def generate_decoy_data(num_samples, introduce_bias=False):
# Implementation details...
return biased_data, control_data
def generate_certainty_data(num_samples, introduce_bias=False):
# Implementation details...
return biased_data, control_data
def generate_belief_data(num_samples, introduce_bias=False):
# Implementation details...
return biased_data, control_data
# Evaluate models
def evaluate(model, biased_data, control_data):
biased_preds = model(biased_data)
control_preds = model(control_data)
# Compute target frequency
target_freq_biased = compute_target_frequency(biased_preds)
target_freq_control = compute_target_frequency(control_preds)
# Compute bias score
bias_score = target_freq_biased - target_freq_control
if model in [GPT3, T5, GPT3_5, Flan_T5]:
# Compute accuracy for belief bias
acc_biased = compute_accuracy(biased_preds)
acc_control = compute_accuracy(control_preds)
return bias_score, acc_biased, acc_control
# Main method
for model in [GPT3, T5, GPT3_5, Flan_T5]:
for bias in ['decoy', 'certainty', 'belief']:
biased_data, control_data = generate_{bias}_data(num_samples=1000)
bias_score, acc_biased, acc_control = evaluate(model, biased_data, control_data)
# Analysis steps
...
The key steps are:
- Generate biased and control datasets for each bias
- Evaluate models by getting predictions, computing target frequency, bias scores, accuracy
- Perform analysis on factors affecting bias, accuracy vs bias tradeoffs etc.
The core methods are generate_data and evaluate which encapsulate dataset creation and model evaluation.