RLVR

RLVR#

RLVR replaces learned reward models with verifiers that produce clear, objective signals (e.g., correctness in math/code). This makes the reward more reliable, but also changes the optimization dynamics and failure modes compared to RLHF.

In this section, we cover GRPO as a core RLVR method and discuss its stability improvements and variants.