PPO From Scratch: From Expected Return to the Clipped Objective#
This tutorial builds Proximal Policy Optimization (PPO) step by step:
Start from the objective: maximize expected total rewards as a weighted average over rollouts.
Derive the policy gradient using the log-derivative trick: $\(\nabla_\theta J(\theta)=\mathbb{E}[\nabla_\theta \log p_\theta(\tau)\,R(\tau)].\)$
Expand \(\log p_\theta(\tau)\) into a sum of \(\log \pi_\theta(a_t\mid s_t)\).
Derive PPO’s surrogate objective using the probability ratio $\(r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)},\)$ and then the clipped PPO objective.
1. Objective as a Weighted Average Over Rollouts#
Consider a stochastic policy $\(\pi_\theta(a\mid s)\)$. A rollout (trajectory) is
Define the (discounted) return of a rollout:
The probability of a rollout under policy \(\pi_\theta\) in an MDP with dynamics \(P\) is
The RL objective is expected return under rollouts induced by the policy:
That last expression is exactly a weighted average over all rollouts: each rollout’s reward \(R(\tau)\) weighted by its probability \(p_\theta(\tau)\).
2. Derivation: The Policy Gradient via the Log-Derivative Trick#
Start from the definition
Differentiate with respect to \(\theta\):
Use the log-derivative trick:
Plug it in:
Now approximate the expectation with a Monte Carlo sample average. If you sample \(N\) rollouts \(\{\tau_i\}_{i=1}^N\) from \(\pi_\theta\):
This is the classic REINFORCE-style gradient at the trajectory level.
3. Expand \(\log p_\theta(\tau)\) into a Sum of \(\log \pi_\theta\) Terms#
Recall
Take logs:
The environment terms \(\rho_0\) and \(P\) do not depend on \(\theta\), so their gradients vanish:
Substitute back into the policy gradient:
3.1 Advantage Form (Better Credit Assignment + Lower Variance)#
In practice, we replace the rollout-level return $\(R(\tau)\)$ with a time-dependent signal such as an advantage:
Intuition: \(A^\pi(s_t,a_t)\) measures how much better (or worse) action \(a_t\) is compared with the policy’s baseline behavior at state \(s_t\).
4. From Log-Probability to PPO’s Ratio \(r_t(\theta)\)#
PPO updates a new policy \(\pi_\theta\) using trajectories generated by an older policy \(\pi_{\theta_{\text{old}}}\). That means we are optimizing \(\pi_\theta\) with off-policy data. The standard fix is importance sampling: reweight each sample by how likely the new policy would have produced it compared to the old one.
Define the per-step probability ratio
Interpretation:
if \(r_t(\theta)>1\), the new policy assigns higher probability to the sampled action than the old policy did, so its contribution is upweighted,
if \(r_t(\theta)<1\), the new policy is less likely to take that action, so its contribution is downweighted.
Connect it directly to log-probabilities:
So the chain is literally:
compute current log-prob \(\log\pi_\theta(a_t\mid s_t)\),
subtract stored old log-prob \(\log\pi_{\theta_{\text{old}}}(a_t\mid s_t)\),
exponentiate to get \(r_t(\theta)\).
5. The Unclipped Surrogate Objective (Importance-Weighted Policy Gradient)#
We want to improve \(\pi_\theta\) while using samples from \(\pi_{\theta_{\text{old}}}\).
Starting from the policy-gradient form in Section 3,
we can rewrite the expectation under the behavior policy \(\pi_{\theta_{\text{old}}}\) using importance sampling:
This is the theoretical justification for using old-policy data: the ratio makes the gradient estimate unbiased, and PPO then adds clipping to control variance and keep updates stable.
Define the surrogate
Here \(A_t\) is an advantage estimate and \(r_t(\theta)\) is the importance ratio. The connection to \(\nabla_\theta J(\theta)\) comes from the log-derivative trick,
so differentiating \(L^{\text{PG}}\) gives
which matches the importance-sampled form of \(\nabla_\theta J(\theta)\) above. So \(L^{\text{PG}}\) is a surrogate whose gradient equals the policy-gradient estimator.
6. Why PPO Modifies the Surrogate: Too-Big Updates#
If you maximize $\(\mathbb{E}[r_t(\theta)A_t]\)\( directly, the ratio \)\(r_t(\theta)\)$ can become very large or very small. This can cause:
unstable training,
destructive updates from noisy advantage estimates,
collapsed policies (especially in high-dimensional action spaces).
TRPO addresses this with an explicit KL constraint; PPO instead uses a simpler, effective heuristic: clipping.
7. PPO Clipped Objective#
Define the clipped ratio:
PPO’s clipped objective is:
7.1 What the Clipping Does (Key Intuition)#
There are two parts:
Clipping the ratio around 1. The ratio \(r_t(\theta)\) is centered at 1 because \(\pi_\theta=\pi_{\theta_{\text{old}}}\) implies no change. Clipping to \([1-\epsilon,\,1+\epsilon]\) prevents \(r_t(\theta)\) from drifting too far, limiting how much the new policy can change its action probabilities in a single update.
Taking the min. We compare the unclipped term \(r_t(\theta)A_t\) with the clipped term and keep the smaller one. This makes the objective conservative: if the policy tries to improve too aggressively, the clipped term takes over and stops extra gain.
Concretely:
If \(A_t>0\), the objective uses \(\min(r_t, \operatorname{clip}(r_t))A_t\), so once \(r_t(\theta)>1+\epsilon\), further increases no longer help.
If \(A_t<0\), the same min means once \(r_t(\theta)<1-\epsilon\), further decreases no longer help.
8. One Clean Derivation Chain (From Expected Return → Log-Prob → PPO)#
Here is the full chain in one place.
Step 1: Rollout objective#
Step 2: Differentiate using the log-derivative trick#
Step 3: Expand $\(\log p_\theta(\tau)\)$ (environment terms vanish)#
Step 4: Replace returns with advantages (variance reduction + credit assignment)#
Step 5: Express the update using the ratio derived from log-probs#
Step 6: Unclipped surrogate objective#
Step 7: PPO clipped objective#
That is PPO’s core idea: keep the benefits of a policy-gradient update while preventing overly large policy changes.