LLM Evaluation Metrics#
This tutorial focuses only on metrics commonly used in OLMES-style LLM evaluation, including:
Exact Match (EM)
F1 (token overlap / QA F1)
Recall (substring recall)
ROUGE-1 / ROUGE-2 / ROUGE-L
MC1 / MC2 (TruthfulQA-style multiple choice)
pass@k (code generation)
Ranking metrics (NDCG@k, MRR, Recall@k, MAP, win-rate)
1. Exact Match (EM)#
Definition#
Exact Match (EM) is a 0/1 metric that checks whether the predicted answer exactly equals a reference answer after normalization (typically lowercasing and whitespace cleanup).
For a single reference:
For multiple valid references \(refs=\{ref_1,\dots,ref_m\}\), a common rule is max over references:
Worked example#
References:
"New York City""NYC"
Prediction: "nyc"
After case normalization: "NYC" matches →
2. QA Token-overlap F1 (SQuAD-style F1)#
Definition#
This metric measures partial correctness by token overlap between the predicted answer and the reference.
Tokenize (usually after normalization):
prediction tokens: \(P\)
reference tokens: \(G\)
Let \(|P|\) be #pred tokens, \(|G|\) be #gold tokens, and \(|P\cap G|\) be # of overlapping tokens (with multiplicity if using bags/multisets).
As with EM, for multiple references, many QA benchmarks use:
Worked example#
Reference: "the cat sat" → tokens = [the, cat, sat]
Prediction: "cat sat on" → tokens = [cat, sat, on]
Overlap tokens = [cat, sat] → \(|P\cap G|=2\)
3. Substring Recall (“did you mention the gold answer?”)#
Definition#
This is a very simple recall-style metric used in some QA evaluations:
return 1 if any gold reference string appears as a substring in the prediction
otherwise return 0
With case-insensitive matching:
Worked examples#
Example A (hit):
ref =
"Barack Obama"pred =
"The answer is Barack Obama, the former president."
Gold is contained → Recall = 1.
Example B (miss):
ref =
"Barack Obama"pred =
"The answer is Obama."
Substring "barack obama" is not present → Recall = 0.
4. ROUGE (ROUGE-1 / ROUGE-2 / ROUGE-L)#
ROUGE is widely used for summarization-style outputs.
ROUGE-1: unigram overlap
ROUGE-2: bigram overlap
ROUGE-L: based on Longest Common Subsequence (LCS)
ROUGE is commonly reported as Precision / Recall / F1.
4.1 ROUGE-1 (unigram overlap)#
Let:
\(U\_{pred}\): multiset of unigrams in prediction
\(U\_{ref}\): multiset of unigrams in reference
Overlap count is the clipped overlap (like BLEU-style clipping).
Worked example (ROUGE-1 recall)#
Reference: "the cat sat on the mat" (6 tokens)
Prediction: "the cat sat" (3 tokens)
Overlap unigrams = 3 (the, cat, sat).
4.2 ROUGE-2 (bigram overlap)#
ROUGE-2 uses overlapping bigrams instead of unigrams.
Worked example#
Reference: "the cat sat on the mat"
Reference bigrams:
the cat
cat sat
sat on
on the
the mat
Prediction: "the cat sat"
Prediction bigrams:
the cat
cat sat
Overlap bigrams = 2.
4.3 ROUGE-L (LCS-based)#
ROUGE-L uses the Longest Common Subsequence (LCS) length between prediction and reference.
Let \(L\) be LCS length (in tokens). Then:
Worked example#
Reference tokens: [the, cat, sat, on, the, mat] (length 6)
Prediction tokens: [the, cat, sat] (length 3)
The LCS is [the, cat, sat] so \(L=3\).
5. MC1 and MC2 (TruthfulQA-style multiple choice)#
These metrics are designed for multiple-choice settings where:
you have multiple answer options
some options are true/correct (possibly more than one)
the model assigns a score to each option (usually log-likelihood)
Let options be \(1..n\).
truth label: \(label_i\in\{0,1\}\)
model score: \(s_i\) (higher means more preferred)
set of true answers: \(T=\{i:label_i=1\}\)
5.1 MC1 (top-1 correctness)#
MC1 checks if the single best option is true:
Worked example#
option |
true? |
score \(s_i\) |
|---|---|---|
A |
0 |
-2.0 |
B |
1 |
-0.4 |
C |
0 |
-1.1 |
D |
1 |
-0.7 |
The best score is option B (−0.4), and B is true →
5.2 MC2 (probability mass on true options)#
MC2 measures how much probability mass the model assigns to all true answers.
Step 1: convert scores to a probability distribution (softmax):
Step 2: sum probabilities over true options:
Worked example (same table)#
Compute unnormalized weights:
A: \(e^{-2.0}=0.135\)
B: \(e^{-0.4}=0.670\) (true)
C: \(e^{-1.1}=0.333\)
D: \(e^{-0.7}=0.497\) (true)
Total weight:
True mass:
So:
Interpretation#
MC1 is strict: only top-1 matters (0/1).
MC2 is softer: gives partial credit if the model assigns high probability to true options, even if it is uncertain.
6. pass@k (code generation)#
pass@k is used when an LLM generates multiple candidate solutions for the same coding problem.
Each candidate solution is evaluated by unit tests:
pass ✅
fail ❌
Let:
\(n\): number of generated solutions
\(c\): number of solutions that pass
\(k\): number of attempts allowed (“try up to k samples”)
Definition#
pass@k is the probability that at least one of the \(k\) attempts passes.
Assuming you sample \(k\) solutions uniformly without replacement, then:
Probability that all k fail:
Therefore, probability that at least one passes:
Worked examples#
Suppose \(n=10\) total solutions and \(c=2\) pass.
pass@1:
pass@5:
Interpretation#
pass@1 = one-shot success rate
pass@k grows with k because you get more chances
a model can have low pass@1 but high pass@10 if it can sometimes produce a correct solution
7. Ranking / search metrics#
Ranking metrics are used when the model outputs a ranked list of items (documents, answers, candidates).
7.1 NDCG@k (Normalized Discounted Cumulative Gain)#
Given a ranked list with relevance labels \(rel_i\) (e.g., graded relevance 0–3, where higher means more relevant), DCG@k is:
NDCG@k normalizes by the ideal ranking:
IDCG@k is computed by sorting the same relevance labels in descending order (best possible ranking) and applying the DCG formula to the top \(k\).
Worked example (DCG)
Suppose \(k=3\) and the relevance labels for the top 3 results are:
\(rel_1 = 3\)
\(rel_2 = 2\)
\(rel_3 = 0\)
Then:
If the ideal ordering is already sorted by relevance, then \(IDCG@3 = DCG@3\) and \(NDCG@3 = 1.0\).
7.2 MRR (Mean Reciprocal Rank)#
Reciprocal rank for one query is \(1 / rank\) of the first relevant item. MRR averages across queries:
7.3 Recall@k#
Fraction of relevant items retrieved in top \(k\):
7.4 MAP (Mean Average Precision)#
Average precision for one query is the mean of precision at ranks where a relevant item appears:
MAP averages AP across queries.
Worked example (single query)
Suppose the ranked list relevance is:
[1, 0, 1, 0, 1]
Precision at relevant ranks:
\(P@1 = 1/1 = 1.0\)
\(P@3 = 2/3 \approx 0.667\)
\(P@5 = 3/5 = 0.6\)
Total relevant items \(R = 3\).
Multiple queries (MAP)
If you have \(Q\) queries, compute \(AP_q\) for each one and average:
7.5 Win-rate (pairwise preference)#
For pairwise human judgments, win-rate is:
This is common for head-to-head model comparisons or preference-based evals.
8. Summary cheat sheet#
Metric |
Typical use case |
Output range |
Better |
What it measures |
|---|---|---|---|---|
EM |
short-answer QA |
0/1 |
↑ |
exact correctness |
F1 (QA) |
QA partial credit |
0..1 |
↑ |
token overlap quality |
substring Recall |
“did it mention the gold?” |
0/1 |
↑ |
containment of gold string |
ROUGE-1/2/L |
summarization |
0..1 |
↑ |
overlap / sequence similarity |
MC1 |
multi-choice |
0/1 |
↑ |
top-1 chooses a true answer |
MC2 |
multi-choice |
0..1 |
↑ |
probability mass on true answers |
pass@k |
code generation |
0..1 |
↑ |
chance of ≥1 passing among k attempts |
NDCG@k |
ranking / search |
0..1 |
↑ |
discounted gain vs ideal |
MRR |
ranking / search |
0..1 |
↑ |
rank of first relevant item |
Recall@k |
ranking / search |
0..1 |
↑ |
fraction of relevant in top k |
MAP |
ranking / search |
0..1 |
↑ |
average precision across queries |
win-rate |
pairwise preference |
0..1 |
↑ |
% wins in head-to-head comparisons |