OLMo3 Adapt Suite (olmo3:adapt) – Capability-Oriented Overview#

This document summarizes the olmo3:adapt task suite, organized by capability area. The suite is used to evaluate both OLMo-3 Think and OLMo-3 Instruct models. It focuses on what is evaluated and which datasets are used.


Knowledge#

Tasks

  • MMLU (CoT): broad academic knowledge with short reasoning and a final answer letter.

  • PopQA: factual recall with short answers.

  • SimpleQA: short-form factual QA.

Datasets (HF links)

Primary metrics (typical)

  • Exact Match (EM) or EM with light normalization (for MMLU-style answer extraction).

  • SimpleQA also reports F1 in addition to EM.


Reasoning#

Tasks

  • BBH (CoT): multi-step reasoning across diverse logic tasks.

  • GPQA: graduate-level science reasoning (MC with extracted answer letter).

  • ZebraLogic: logic-grid reasoning with structured outputs.

  • AGI-Eval (English, CoT): exam-style reasoning across multiple subjects.

Datasets (HF links)

Primary metrics (typical)

  • Exact Match with task-specific answer extraction.

  • ZebraLogic uses a specialized structured-equality metric.


Math#

Tasks

  • Minerva Math: competition-style math problems.

  • GSM8K (adapt): grade-school word problems, boxed numeric answers.

  • OMEGA: math generalization across exploratory/compositional/transformative splits.

  • AIME 2024/2025: contest math with integer answers.

Datasets (HF links)

Primary metrics (typical)

  • Exact Match or Exact Match (flexible) after answer extraction.


Coding#

Tasks

  • HumanEval+: function completion with unit tests.

  • MBPP+: short programming problems with tests.

  • LiveCodeBench (code generation): execution-based evaluation on recent coding problems.

Datasets (HF links)

Primary metrics (typical)

  • pass@1 (execution-based success on unit tests).


Chat / Instruction Following#

Tasks

  • AlpacaEval v3: pairwise preference evaluation (LLM-as-judge).

  • IFEval: instruction-following compliance on verifiable constraints.

Datasets (HF links)

Primary metrics (typical)

  • AlpacaEval: win rate / length-controlled win rate.

  • IFEval: instruction-level and prompt-level accuracy (strict and loose).


Note

The suite aggregates results across these tasks; each task contributes equally to capability-level and overall summaries when macro-averaged.


Prompt structure (by task)#

Below are representative prompt shapes for each task in olmo3:adapt. These are illustrative rather than exact strings.

Knowledge#

MMLU (CoT)

[Subject intro]
Question: ...
A. ...
B. ...
C. ...
D. ...
(Reasoning...)
Therefore, the answer is (X)

PopQA

Q: {question} A:

SimpleQA

Question: {question}
Answer:

Reasoning#

BBH (CoT)

[Task description]
Q: ...
A: (reasoning...) final answer

GPQA

Answer the following multiple choice question...
Let's think step by step:
...
The correct answer is: (X)

ZebraLogic

[Logic puzzle statement]
[Required output format]

AGI-Eval (English, CoT)

[Exam-style prompt]
Question: ...
Answer: (reasoning...) final answer

Math#

Minerva Math

Problem: ...
Solution: (reasoning...)
\boxed{answer}

GSM8K (adapt)

Solve the following grade school math word problem:
{question}

Show your work and conclude with "Therefore, the final answer is \boxed{answer}."

OMEGA

[Problem statement]
Present the answer in LaTeX: \boxed{...}

AIME 2024/2025

{problem}

Present the answer in LaTeX: \boxed{Your answer}

Coding#

HumanEval+

# Problem description
def func(...):
    ...
# Complete the function

MBPP+

# Problem statement
# Function signature
# Examples/tests

LiveCodeBench (code generation)

Problem: ...
Input/Output specs...
Constraints...

Chat / Instruction Following#

AlpacaEval v3

Instruction: ...
(Assistant response)

IFEval

[Instruction(s) with verifiable constraints]