Math (olmo3:base:math) – Tasks, Datasets, Metrics#
This document summarizes the math task bundle, the Hugging Face datasets it uses, and the evaluation metrics. It is self-contained and focuses on the evaluation ideas rather than code.
1) Tasks covered in olmo3:base:math#
This suite combines math word problems and competition-style math.
GSM8K (sampling, 8-shot)
GSM-Symbolic (sampling, 8-shot)
main
p1
p2
Minerva Math (sampling, 4-shot)
algebra
counting_and_probability
geometry
intermediate_algebra
number_theory
prealgebra
precalculus
2) Datasets used (Hugging Face dataset_path)#
Task group |
Dataset (HF |
|---|---|
GSM8K |
|
GSM-Symbolic |
|
Minerva Math |
|
3) Metrics used and how they are calculated#
These tasks typically sample multiple generations and then score them with pass@k or majority-vote metrics.
Answer extraction#
The model generates a solution. An answer string (usually a number) is extracted from the completion using task-specific normalization rules.
Exact Match (EM)#
Exact Match: a prediction is correct if the extracted answer matches a gold answer after normalization.
Pass@k#
pass@k: the probability that at least one of the k sampled generations is correct.
In this suite, pass@1 is the primary score for GSM8K and GSM-Symbolic configurations.
Majority@k (Maj@k)#
maj@k: majority vote accuracy over k sampled generations.
Aggregation to suite score#
Per task: metrics are averaged across questions.
Suite level: metrics are macro-averaged across tasks.
4) Prompt structure (math tasks)#
These tasks use few-shot chain-of-thought (CoT) prompts by default.
Typical pattern:
Question: ...
Answer: (reasoning steps...) ... final answer ...
GSM8K / GSM-Symbolic use 8-shot demonstrations with step-by-step reasoning, then a final answer.
Minerva Math uses a math-problem format (problem statement + solution), typically ending with a boxed final answer.
Note
Because the tasks are sampled, pass@k and maj@k are meaningful summaries of the model’s ability to reach a correct solution across multiple attempts.