Robust Distortion-free Watermarks for Language Models
Authors: Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, Percy Liang
Abstract: We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers — which we compute using a randomized watermark key — to a sample from the language model. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three language models — OPT-1.3B, LLaMA-7B and Alpaca-7B — to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text () from tokens even after corrupting between -% of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around of the responses — whose median length is around tokens — are detectable with , and the watermark is also less robust to certain automated paraphrasing attacks we implement.
What, Why and How
Here is a summary of the key points from the paper:
What:
- The paper proposes a methodology for planting watermarks in text generated from an autoregressive language model. The watermarks allow detecting if a given text was generated by a particular model, without access to the model or prompt used.
Why:
-
Watermarking enables attributing generated text to a language model, which can help enforce policies on model usage, track misuse, and combat misinformation spread via synthetic text.
-
Existing watermarks either distort the model’s output distribution or are not robust to modifications of the text. This work aims to achieve watermarks that are distortion-free and robust.
How:
-
The methodology has two components - a generate function that maps a random key to a language model sample, and a detect function that aligns text to the key to test if it is likely watermarked.
-
The key insight is to use robust sequence alignment techniques, incorporating edit distance notions, so the watermark is detectable even from corrupted snippets.
-
Two instantiations are proposed - using inverse transform sampling and exponential minimum sampling for the generate function. Analyses show the detect power grows exponentially with text length and only linearly with key length.
-
Experiments validate the watermarks on OPT, LLaMA and Alpaca models. With 35 tokens, watermarks are detectable after 40-50% random corruption. Roundtrip translation provides a harder attack.
In summary, the paper presents a novel approach for distortion-free and robust watermarking of language models, with theoretical analysis and empirical validation. The watermarks enable detecting model provenance without access to the model or prompt.
Main Contributions
Based on my reading, the main contributions of the paper are:
- Proposes a new methodology for watermarking text generated by language models that satisfies key desiderata:
- Distortion-free - preserves the original text distribution
- Agnostic - detectable without access to model or prompt
- Robust - withstands perturbations like edits and paraphrasing
-
Gives two concrete instantiations of the watermarking methodology using inverse transform sampling and exponential minimum sampling.
-
Provides theoretical analyses characterizing the statistical power of detecting the watermarks, showing it grows exponentially with text length and only linearly with key length.
-
Empirically validates the watermarks on OPT, LLaMA and Alpaca models, showing detection is possible from snippets as short as 35 tokens even after 40-50% random corruption.
-
Demonstrates limitations of existing watermarking techniques like biasing token probabilities, and shows robustness to paraphrasing attacks via roundtrip translation.
-
Conducts a case study focused on watermarking an instruction-following model, analyzing the statistical power and showing flaws in hashing-based watermarks.
-
Makes implementations available to enable adoption and further research.
In summary, the key contributions are proposing a new robust and distortion-free watermarking approach for language models, with thorough theoretical and empirical validation. The watermarks enable provenance detection from text snippets.
Method Section
Here is a summary of the method section from the paper:
The proposed watermarking methodology consists of two main components:
- Generate function:
- Maps a random key (sequence of random numbers) to a sample from the language model using a decoder function.
- Decoder takes an element of the key and next token distribution, returns a token prediction.
- Over randomness of key, decoder samples from distribution (distortion-free).
- Text generation is autoregressive using decoder.
- Detect function:
- Aligns putative watermarked text with key using a test statistic.
- Tests if text has low p-value w.r.t null hypothesis of being independent of key.
- Alignment uses techniques like edit distance to be robust to perturbations.
Two instantiations of the watermarking methodology are given:
- Inverse transform sampling:
- Decoder maps key of uniform samples and permutations to tokens using sample’s CDF.
- Test statistic correlates sequence of permuted token indices with uniform samples.
- Exponential minimum sampling:
- Decoder maps key of uniform samples to tokens by taking argmin of -log(sample) / token prob.
- Test statistic sums -log of aligned key samples.
The detect function resamples keys and compares statistic on text to empirical distribution to get a p-value.
A randomized wrapper shifts the key before sampling to avoid reusing elements. Alignment ensures robustness to cropping.
Theoretical analyses show detect power grows exponentially in text length and only linearly in key length. The results depend on the text’s “watermark potential”.
In summary, the core method generates watermarked text by mapping a random key to language model samples. Detection aligns text with key to test if it is likely watermarked based on a distortion-free statistic.
High-Level Pseudo Code
Here is high-level pseudocode for the key algorithms from the paper:
# Generate watermarked text
def generate(key, model, decoder):
text = []
for i in range(length):
text.append(decoder(key[i], model(text)))
return text
# Detect watermarked text
def detect(text, key, statistic):
p = 0
for i in range(resamples):
new_key = resample(key)
if statistic(text, new_key) < statistic(text, key):
p += 1
return p / resamples
# Test statistic with alignment
def statistic(text, key):
best_cost = infinity
for i in range(len(text)-k):
for j in range(len(key)):
cost = alignment_cost(text[i:i+k], key[j:j+k])
best_cost = min(cost, best_cost)
return best_cost
The key steps are:
- generate uses a decoder to map key to model samples
- detect resamples keys and compares statistic to get a p-value
- statistic aligns text and key blocks using a robust cost
The overall framework plants a hidden signal in text via the key, and detects it using alignment techniques even if the text is perturbed.
Detailed Pseudo Code
Here is more detailed pseudocode to implement the key algorithms from the paper:
# Inverse transform sampling decoder
def its_decoder(key, distribution):
u, pi = key
cdf = get_cdf(distribution, pi)
token = pi.inverse(min_index(cdf >= u))
return token
# Exponential minimum sampling decoder
def ems_decoder(key, distribution):
token = argmin[i](-log(key[i]) / distribution[i])
return token
# Generate with ITS decoder
def generate_its(key, model):
text = []
for i in range(len(key)):
u, pi = key[i]
next_dist = model(text)
text.append(its_decoder((u, pi), next_dist))
return text
# Generate with EMS decoder
def generate_ems(key, model):
text = []
for i in range(len(key)):
next_dist = model(text)
text.append(ems_decoder(key[i], next_dist))
return text
# ITS alignment cost
def its_cost(text, key):
u, pi = key
return sum[i](|u[i] - normalize(pi(text[i]))|)
# EMS alignment cost
def ems_cost(text, key):
return -sum[i](log(key[i, text[i]]))
# Detect watermarked text
def detect(text, key):
# Compute test statistic
if its_decoder:
cost_fn = its_cost
else:
cost_fn = ems_cost
stat = infinity
for i in range(len(text)-k):
for j in range(len(key)):
block_cost = cost_fn(text[i:i+k], key[j:j+k])
stat = min(block_cost, stat)
# Estimate p-value
p = 0
for t in range(resamples):
new_key = resample(key)
if cost_fn(text, new_key) < stat:
p += 1
return p / resamples
The key implementations are the ITS and EMS decoders, generation with each decoder, the associated alignment costs, and p-value estimation. The detect method aligns the text with resampled keys.