From Return to Advantage: Full Derivation to GAE#
This tutorial walks step by step through the refinement of policy gradient estimators, starting from the trajectory-level return used in REINFORCE and ending with Generalized Advantage Estimation (GAE), as used in PPO.
The goal is to explain why and how we replace the full return with the advantage, and how GAE naturally emerges from this refinement.
1. REINFORCE: Policy Gradient with Trajectory Return#
We start from the policy optimization objective:
where a trajectory (rollout) is:
and the total discounted return is:
Using the log-derivative trick, the policy gradient is:
Expanding the trajectory probability:
Environment terms do not depend on \(\theta\), so:
Thus:
Problem: every action is credited with the same full-trajectory return → very high variance and poor credit assignment.
2. Reward-to-Go: Replacing \(R(\tau)\) with \(G_t\)#
Define the reward-to-go from time step \(t\):
Then the policy gradient can be rewritten as:
Why this is valid#
Split the full return:
The past rewards do not depend on \(a_t\), and therefore vanish in expectation because:
So only future rewards matter for action \(a_t\).
3. Baseline Trick: Subtract Any \(b(s_t)\) Without Bias#
We now introduce a baseline:
Proof of unbiasedness#
Because:
we have:
Thus, subtracting any function of the state does not change the expected gradient.
Purpose: reduce variance.
4. Choosing the Baseline: Value Function → Advantage#
Define:
Then the advantage function is:
Now observe:
So the policy gradient becomes:
This is the advantage policy gradient.
5. Temporal-Difference Residual as Advantage Estimator#
In practice, we approximate \(V^\pi\) with \(V_\phi\).
Define the TD error:
If \(V_\phi = V^\pi\), then:
So the TD residual is an unbiased advantage estimator when the value function is correct.
Note that this is exactly the 1-step advantage estimator: $\( \delta_t = \hat A_t^{(1)}. \)$
6. \(k\)-Step Advantage Estimators#
Define the \(k\)-step bootstrapped return:
The corresponding advantage estimator is:
Expressing \(k\)-step advantage via TD residuals#
Using telescoping sums:
This reveals a continuum of advantage estimators:
\(k=1\): pure TD
\(k \to \infty\): Monte Carlo return
7. Generalized Advantage Estimation (GAE)#
GAE forms an exponentially weighted average of all \(k\)-step advantages:
Substitute the TD-residual form and rearrange sums:
This is the canonical GAE formula.
The role of \(\lambda\)#
\(\lambda \in [0,1]\) controls the bias-variance tradeoff by tuning how much weight to place on longer-horizon returns:
\(\lambda=0\) recovers the 1-step estimator \(\hat A_t^{(1)}=\delta_t\) (low variance, higher bias),
\(\lambda \to 1\) approaches Monte Carlo-style advantages (lower bias, higher variance),
intermediate values interpolate smoothly between these extremes.
8. Backward Recursive Form (Used in PPO)#
From the series definition:
we get the recursion:
with \(\hat A_T = 0\) at terminal states (or masked).
This explains why GAE is computed backward in time.
9. Summary: Why Advantage and GAE#
Replacing \(R(\tau)\) with \(G_t\) improves credit assignment
Subtracting \(V(s_t)\) reduces variance
TD residuals approximate advantages efficiently
GAE interpolates between low-variance TD and low-bias Monte Carlo
PPO uses GAE for stable, sample-efficient policy optimization
This completes the full derivation from trajectory return to GAE-based advantage estimation.