OLMo 3 Instruct#
This tutorial explains the OLMo 3 Instruct post-training pipeline. It parallels OLMo 3 Think (SFT → DPO → RLVR), but optimizes for everyday assistant usage: multi-turn chat, conciseness, and tool use, instead of long reasoning traces.
Why Instruct models exist (even when Think models are strong)#
Reasoning models (like OLMo 3 Think) can be very powerful, but many real-world user queries are:
simple information seeking
advice seeking
conversational
tool-based tasks (search, retrieval, function calling)
These do not require long “thinking trajectories”.
OLMo 3 Instruct is built to be:
fast at inference
concise by default
strong at multi-turn dialogue
capable of tool use / function calling
A core design goal from the report:
Everyday chat settings often do not require inference-time scaling via long thoughts.
So Instruct models avoid the overhead of generating long reasoning traces unless needed.
High-level pipeline: same stages, different target behavior#
OLMo 3 Instruct uses the same three-stage post-training recipe as Think:
Instruct SFT
Instruct DPO
Instruct RLVR (with GRPO)
But the emphasis shifts:
Think: maximize reasoning via explicit thought traces
Instruct: maximize usability (chat + tools + conciseness)
Evaluation focus for Instruct models#
OLMo 3 Instruct evaluation includes:
most of the Think benchmark suite (general capability + QA + math + code)
plus additional tool-use metrics such as:
Berkeley Function Calling Leaderboard
LitQA2
SimpleQA
A notable qualitative result described:
Instruct models benefit significantly from tool usage behavior learned in post-training
At 7B scale, OLMo 3 Instruct can outperform Qwen-3 (thinking disabled) on some tasks
At 32B scale, that gap is smaller / disappears
Stage 1: Dolci Instruct SFT (multi-turn chat + agentic tool use)#
OLMo 3 Instruct starts by building a new SFT dataset:
It builds on the prior OLMo 2 instruct data but introduces key improvements:
Key changes vs earlier instruct datasets#
Remove reasoning traces
any existing “chain-of-thought style” traces are stripped out
to encourage short, direct answers
Upgrade synthetic generations to newer models
use strong modern generators (e.g., GPT-4.1 instead of older GPT-3.5 era models)
generally yields more coherent and instruction-following completions
Add extensive supervised function calling
a major focus for OLMo 3 Instruct is realistic tool usage
Function calling data: two strategies#
The report describes two complementary ways of generating function-calling data.
(A) Real trajectories (tool use in executable environments)#
They build datasets like:
ScienceQA
WebSearchQA
These are created by using a strong model (GPT-4.1 or GPT-5) equipped with tools that connect to:
the web
a corpus of papers
via MCP servers.
This produces realistic trajectories containing:
multiple interaction steps
real tool outputs
tool errors and edge cases
(B) Simulated interactions (large-scale synthetic tool use)#
Real tool trajectories are expensive to collect at scale.
To increase tool diversity and volume, they also create simulated tool-use data:
start with a pool of tools + API specs from public sources
prompt a pool of generator LLMs (GPT-4o, GPT-4.1, GPT-5) to synthesize:
user requests
tool calls
tool outputs
assistant follow-ups
This provides broad coverage over:
many tool schemas
multi-turn agent loops
multiple tool calls per conversation
Details of function calling datasets. Multi-turn refers to multiple user turns per trajectory and multi-step refers to multiple environment interactions per user request
Dataset |
Env. interactions |
# Trajectories |
# Unique functions |
% Multi-turn |
% Multi-step |
|---|---|---|---|---|---|
Science QA (Real, MCP) |
22.6K |
8 |
– |
42.3% |
– |
Web Search QA (Real, MCP) |
6.6K |
3 |
– |
76.1% |
– |
SimFC (Simulated) |
200K |
42.6K |
42.3% |
23.8% |
– |
Unified tool formatting is crucial#
A major lesson emphasized:
a unified format for tool calling data is necessary for strong tool performance
Their unified format includes:
tool specs included in the system prompt
tool calls wrapped in XML tags in assistant messages
tool outputs emitted by a special environment role (with dedicated special tokens)
This consistency matters because tool learning is format-sensitive.
Mixture tuning for Instruct SFT#
They tune the data mix similarly to Think:
start with a base mixture of ~100K supervised examples
add/ablate domains on top of the OLMo 2 base
evaluate impact per domain
Starting point: Instruct is trained from Think SFT#
A notable finding:
training Instruct on top of the Think SFT model improves benchmark performance and does not increase average response length
So the Instruct pipeline does not start from the raw base model; it starts from:
OLMo 3 Think SFT checkpoint
This provides stronger capabilities while still allowing Instruct SFT to reshape output behavior into short/direct responses.
Stage 2: Instruct DPO (preferences for chat helpfulness + conciseness + tools)#
Instruct DPO expands the Delta Learning idea to focus on general chat quality.
They use three main types of preference pairs:
(1) Delta Learning pairs (Qwen-3, thinking OFF)#
Same core contrastive idea as Think DPO, but here:
both chosen and rejected completions come from Qwen-3
thinking mode is turned OFF
to emphasize direct assistant answers
(2) Delta-maximized GPT-judged pairs (modern UltraFeedback-style)#
They generate completions from a pool of diverse models, score them with GPT-4.1, and pick:
best completion = chosen
worst completion = rejected
Key twist:
modern model pools often have small deltas (everyone is “pretty good”), so you must ensure at least one weak model exists in the pool
This is called Delta Maximization:
enforce a big gap between best and worst completions
otherwise preference tuning can be noisy or harmful
(3) Multi-turn preference pairs#
They generate multi-turn context by:
self-talk / synthetic context expansion around an existing prompt
Then preferences differ only in the final assistant response, using large gaps like:
GPT-3.5 vs GPT-4.1
Qwen-3-0.6B vs Qwen-3-32B
This teaches:
stable “assistant personality”
turn-level helpfulness
conversational continuity
Length control: fight judge verbosity bias#
LLM judges often prefer longer outputs.
To encourage conciseness:
they filter preference pairs so chosen/rejected length differs by ≤ 100 tokens
This may reduce some benchmark metrics, but improves:
usability
vibe tests
downstream RL stability (it becomes a better starting point)
Stage 3: Instruct RLVR (GRPO) with small modifications#
Instruct RL is “the same stack” as Think RLVR, but with a few targeted changes.
Key differences vs Think RL#
Less challenging math/code tasks
remove the hardest problems (since the goal is general usability, not maximal reasoning)
No offline difficulty filtering
Think RL used difficulty filtering to avoid trivial prompts
Instruct RL drops this step because it is not focused on hard reasoning trajectories
Response length cap
maximum response length is capped at 8K tokens
to prevent overly long answers in everyday chat
RL mixture#
Instruct RL is trained on a mix of:
general chat
math
code
Rewards still include:
verifiable rewards for math/code
LLM-judge rewards for chat
Final model choice is based on:
average performance
response length analysis
vibe tests
Practical summary: how OLMo 3 Instruct differs from OLMo 3 Think#
Same structure:
SFT → DPO → RLVR (GRPO)
Different target behavior:
no explicit thought trajectories
more multi-turn conversation and assistant “polish”
much stronger function calling behaviors (with unified tool format)
explicit conciseness pressure (length filters + response cap)
References#
OLMo 3 technical report: https://arxiv.org/abs/2512.13961