OLMo 3 Think

OLMo 3 Think#

This tutorial explains how OLMo 3 Think is trained as a reasoning model.

What is OLMo 3 Think?#

OLMo 3 Think is a post-trained version of OLMo 3 designed to solve harder reasoning problems by:

generating an extended thinking trace (a “thought trajectory”), and then
producing a final answer.

The report frames OLMo 3 Think as a reasoning model trained using:

high-quality “thinking” data curation (Dolci Think), and
a three-stage recipe:
- Thinking SFT
- Thinking DPO
- Thinking RL (RLVR with GRPO)

They also introduce OlmoRL, an RL training infrastructure designed for long reasoning traces and verifiable rewards.

Why this pipeline matters#

Many modern reasoning-model pipelines rely heavily on RLVR:

either RL directly from the base model (“RL-Zero”), or
a small/light SFT warmup, then RLVR

OLMo 3 Think differs in two important ways:

They use both SFT and DPO before RLVR
Their RLVR is multi-objective, mixing verifiable and non-verifiable domains

Empirically, the report claims this yields consistent gains across:

SFT → DPO → RLVR

Overview: the 3-stage recipe#

Stage 1: Thinking SFT (seed the format + capabilities)#

Goal: teach the model to output:

a reasoning trace
followed by the final answer

This stage also improves skills across:

math
science
coding
instruction following
chat + safety

The dataset:

Dolci Think SFT
size: ~2.3M supervised examples

How Dolci Think SFT is curated#

The report describes a pipeline that looks like:

Prompt sourcing
- gather prompts for each capability from many public datasets
Re-generating completions
- for prompts that have incomplete or missing completions
- generate new completions using a strong reasoning model (e.g., DeepSeek-R1 or QwQ-32B)
- ensure each completion includes:
  - a reasoning trace
  - a final answer
Correctness filtering
- verify outputs with domain-specific checks, e.g.:
  - synthetic test cases for code
  - verifiers for strict-format or precise instruction constraints
Heuristic filtering
- remove bad/unsafe/low-quality examples:
  - unclear license
  - incomplete reasoning traces
  - excessive repetition
  - references to other model providers
  - other heuristic signals
Topic filtering
- classify prompts by topic (OpenAI query taxonomy)
- drop or downsample irrelevant topics (e.g., image generation requests)

How they choose the final SFT mixture#

After data sources are curated and filtered, they select the mixture in a midtraining-like way:

evaluate many sources in parallel via lightweight SFT probes
run centralized integration tests using the promising sources

An interesting note from the report:

every data source helped at least one benchmark

Stage 2: Thinking DPO (preference tuning before RL)#

Preference tuning (DPO) is often viewed as an “alignment” step, but the report emphasizes it can also improve capabilities when used before RL in reasoning models.

OLMo 3 Think uses:

Dolci Think DPO (preference dataset)
pairs are constructed using Delta Learning

Key dataset sizes:

7B: ~150K preference pairs
32B: ~200K preference pairs

Breakdown of datasets in Dolci-Think-RL used for RL training

Category	Prompt Dataset	# Prompts Used in Think RL	# Prompts Used in Instruct RL	Link
Precise IF	IF-RLVR	30,186	38,000	allenai/IFBench
Math	Open-Reasoner-Zero	3,000	14,000	Open-Reasoner-Zero/Open-Reasoner-Zero
Math	DAPO-Math	2,584	7,000	https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed
Math	AceReason-Math	6,602	–	https://arxiv.org/abs/2505.16400
Math	Polaris-Dataset	–	14,000	https://huggingface.co/POLARIS-Project
Math	KlearReasoner-MathSub	3,000	9,000	suu990901/KlearReasoner
Math	OMEGA-train	15,000	20,000	https://huggingface.co/datasets/allenai/omega-explorative
Coding	AceCoder	9,767	20,000	https://huggingface.co/collections/TIGER-Lab/acecoder
Coding	KlearReasoner-Code	8,040	–	suu990901/KlearReasoner
Coding	Nemotron Post-training Code	2,303	–	https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2
Coding	SYNTHETIC-2	3,000	–	https://www.primeintellect.ai/blog/synthetic-2-release
General Chat	Tulu 3 SFT	7,129	18,955	https://huggingface.co/collections/allenai/tulu-3-datasets
General Chat	WildChat-4.8M	7,129	18,761	https://huggingface.co/datasets/allenai/WildChat-4.8M
Multi-Subject	Multi-Subject RLVR	7,129	12,234	https://huggingface.co/datasets/virtuoussy/Multi-subject-RLVR
Total		104,869	171,950

Delta Learning intuition (why it helps)#

Traditional synthetic preference data (UltraFeedback-like) depends on:

a diverse pool of strong models
an LLM judge to rank completions

But for reasoning traces, the pool of “open reasoning trace” models is limited.

Delta Learning solves this by focusing on the gap between preferred and rejected answers:

Chosen completion: strong model (e.g., Qwen-3-32B)
Rejected completion: weak model (e.g., Qwen-3-0.6B)

The idea is:

preference quality depends on the delta between chosen and rejected responses more than either response alone

The report also notes an empirical detail:

continuing SFT on extra synthetic completions can hurt
but DPO using clear deltas improves and becomes a better RL starting point

Prompt reuse + filtering details#

Prompts are mostly reused from SFT
Extra preference sources from earlier OLMo work may be included (e.g., UltraFeedback-like sets)
Filtering is applied primarily to the chosen completion
Rejected completions may remain unfiltered (since they are intentionally weaker)

Mixture selection under RL cost constraints#

Reasoning DPO experiments are expensive, so they use a hierarchical strategy:

do mixture experiments using “non-thinking” outputs first (cheaper)
take the top 3 mixtures into full thinking-trace preference tuning

Stage 3: Thinking RL (RLVR) using GRPO + mixed rewards#

As the final stage, OLMo 3 Think runs RL training using:

GRPO (Group Relative Policy Optimization)
a mixture of:
- verifiable reward domains
- non-verifiable reward domains (LLM judge)

They frame this as RLVR:

RL with verifiable rewards (plus judge rewards for non-verifiable tasks)

The RL dataset: Dolci Think RL#

Dolci Think RL
~100K prompts
domains include:
- math
- code
- instruction following
- general chat

Rewards: verifiable vs non-verifiable#

Verifiable rewards (deterministic)#

These rewards come from rule-based checks or test execution.

Math

normalize answers
check equivalence (e.g., via sympy)
reward is often binary (correct / incorrect)

Code

execute test cases
reward can be:
- binary (all tests pass)
- or fractional (ratio of passing tests)

Instruction following

verify constraint-specific rules
similarly binary or fractional

Non-verifiable rewards (LLM-judge)#

General chat has no deterministic verifier.

So they use an LLM judge:

judge model: Qwen-3-32B
thinking mode off (judge produces a score directly)

Judging can be:

reference-based (if ground truth exists)
reference-free (otherwise)

Why mix reward signals? (multi-objective RLVR)#

Pure RLVR often optimizes for one verifiable domain (like math).

But the report emphasizes they mix domains to:

prevent over-optimization (reward hacking)
preserve general usefulness

A key observation they report:

mixed reward training can have lower training reward
but better downstream generalization

GRPO: enhancements over “vanilla” GRPO#

OLMo 3 Think uses GRPO as the base optimizer, but applies several enhancements inspired by recent work.

Key improvements listed:

Zero Gradient Filtering
- drop prompts where all rollouts get identical reward (no learning signal)
Active Sampling
- keep batch size constant by replacing filtered prompts
Token-level normalization
- normalize GRPO loss by total tokens in batch (reduces length bias)
No KL loss
- remove KL penalty for more flexible updates
Asymmetric clipping
- set higher upper bound than lower bound for clipping
No reward standard deviation normalization
- remove reward std term from the group advantage denominator

These changes preserve the overall GRPO structure (PPO-like), but adjust:

advantage definition
clipping
normalization

RL data curation details (code + chat)#

Code: create problems with usable test cases#

Because test cases are required for verifiable rewards, they build a synthetic pipeline:

rewrite the problem + solution
generate test cases
execute test cases
keep problems where >80% test cases pass
remove failing remaining test cases

Chat: rewrite + filter by “not too easy, not too hard”#

They rewrite chat samples for clarity and extract reference answers.

Then:

sample 8 responses per prompt
compute F1 to reference
drop prompts where F1 is outside [0.1, 0.9]

This aims to remove:

trivial prompts (too easy)
noisy / impossible prompts (too hard)

Offline difficulty filtering before RL#

Before RL training, they filter out prompts the policy already solves easily:

generate 8 rollouts per prompt using the DPO checkpoint
remove prompts with pass-rate > 62.5%

This improves RL sample efficiency by avoiding “already-solved” data.

They run this filtering for 7B first, then reuse results for 32B due to cost.

RL experimentation strategy under high cost#

RL is expensive, so they adopt proxy tuning strategies:

short RL runs (~1K steps) to vet mixtures
test algorithm changes in single-objective settings (math-only)
mostly tune with 7B, then reuse for 32B

This is a recurring theme in OLMo 3:

expensive stages require cheap proxy experiments.

A practical takeaway you can reuse#

If you want to train a reasoning model like OLMo 3 Think, the high-level playbook is:

Seed reasoning format with SFT
Improve preference alignment + capability with DPO
Scale reasoning ability with RLVR
Use a multi-domain reward mix to prevent narrow reward hacking
Invest in RL infrastructure (engine mismatch, stability, sampling)

References#

OLMo 3 technical report: https://arxiv.org/abs/2512.13961