OLMo3 Adapt Suite (olmo3:adapt) – Capability-Oriented Overview

OLMo3 Adapt Suite (`olmo3:adapt`) – Capability-Oriented Overview#

This document summarizes the olmo3:adapt task suite, organized by capability area. The suite is used to evaluate both OLMo-3 Think and OLMo-3 Instruct models. It focuses on what is evaluated and which datasets are used.

Knowledge#

Tasks

MMLU (CoT): broad academic knowledge with short reasoning and a final answer letter.
PopQA: factual recall with short answers.
SimpleQA: short-form factual QA.

Datasets (HF links)

Primary metrics (typical)

Exact Match (EM) or EM with light normalization (for MMLU-style answer extraction).
SimpleQA also reports F1 in addition to EM.

Reasoning#

Tasks

BBH (CoT): multi-step reasoning across diverse logic tasks.
GPQA: graduate-level science reasoning (MC with extracted answer letter).
ZebraLogic: logic-grid reasoning with structured outputs.
AGI-Eval (English, CoT): exam-style reasoning across multiple subjects.

Datasets (HF links)

BBH: lukaemon/bbh
GPQA: Idavidrein/gpqa
ZebraLogic: private dataset (allenai/ZebraLogicBench-private) – allenai/ZebraLogicBench-private\n+- AGI-Eval: packaged local data (AGIEval v1) – ruixiangcui/AGIEval

Primary metrics (typical)

Exact Match with task-specific answer extraction.
ZebraLogic uses a specialized structured-equality metric.

Math#

Tasks

Minerva Math: competition-style math problems.
GSM8K (adapt): grade-school word problems, boxed numeric answers.
OMEGA: math generalization across exploratory/compositional/transformative splits.
AIME 2024/2025: contest math with integer answers.

Datasets (HF links)

Minerva Math: EleutherAI/hendrycks_math
GSM8K: openai/gsm8k
OMEGA: allenai/omega-{broad_category} – OMEGA (arXiv:2506.18880 datasets)
AIME 2024/2025: HuggingFaceH4/aime_2024

Primary metrics (typical)

Exact Match or Exact Match (flexible) after answer extraction.

Coding#

Tasks

HumanEval+: function completion with unit tests.
MBPP+: short programming problems with tests.
LiveCodeBench (code generation): execution-based evaluation on recent coding problems.

Datasets (HF links)

HumanEval+: evalplus/humanevalplus
MBPP+: evalplus/mbppplus
LiveCodeBench: livecodebench/code_generation_lite

Primary metrics (typical)

pass@1 (execution-based success on unit tests).

Chat / Instruction Following#

Tasks

AlpacaEval v3: pairwise preference evaluation (LLM-as-judge).
IFEval: instruction-following compliance on verifiable constraints.

Datasets (HF links)

AlpacaEval: tatsu-lab/alpaca_eval
IFEval: HuggingFaceH4/ifeval

Primary metrics (typical)

AlpacaEval: win rate / length-controlled win rate.
IFEval: instruction-level and prompt-level accuracy (strict and loose).

Note

The suite aggregates results across these tasks; each task contributes equally to capability-level and overall summaries when macro-averaged.

Prompt structure (by task)#

Below are representative prompt shapes for each task in olmo3:adapt. These are illustrative rather than exact strings.

Knowledge#

MMLU (CoT)

[Subject intro]
Question: ...
A. ...
B. ...
C. ...
D. ...
(Reasoning...)
Therefore, the answer is (X)

PopQA

Q: {question} A:

SimpleQA

Question: {question}
Answer:

Reasoning#

BBH (CoT)

[Task description]
Q: ...
A: (reasoning...) final answer

GPQA

Answer the following multiple choice question...
Let's think step by step:
...
The correct answer is: (X)

ZebraLogic

[Logic puzzle statement]
[Required output format]

AGI-Eval (English, CoT)

[Exam-style prompt]
Question: ...
Answer: (reasoning...) final answer

Math#

Minerva Math

Problem: ...
Solution: (reasoning...)
\boxed{answer}

GSM8K (adapt)

Solve the following grade school math word problem:
{question}

Show your work and conclude with "Therefore, the final answer is \boxed{answer}."

OMEGA

[Problem statement]
Present the answer in LaTeX: \boxed{...}

AIME 2024/2025

{problem}

Present the answer in LaTeX: \boxed{Your answer}

Coding#

HumanEval+

# Problem description
def func(...):
    ...
# Complete the function

MBPP+

# Problem statement
# Function signature
# Examples/tests

LiveCodeBench (code generation)

Problem: ...
Input/Output specs...
Constraints...

Chat / Instruction Following#

AlpacaEval v3

Instruction: ...
(Assistant response)

IFEval

[Instruction(s) with verifiable constraints]

OLMo3 Adapt Suite (olmo3:adapt) – Capability-Oriented Overview

Contents

OLMo3 Adapt Suite (olmo3:adapt) – Capability-Oriented Overview#

Knowledge#

Reasoning#

Math#

Coding#

Chat / Instruction Following#

Prompt structure (by task)#

Knowledge#

Reasoning#

Math#

Coding#

Chat / Instruction Following#

OLMo3 Adapt Suite (`olmo3:adapt`) – Capability-Oriented Overview#