PPO From Scratch: From Expected Return to the Clipped Objective

PPO From Scratch: From Expected Return to the Clipped Objective#

This tutorial builds Proximal Policy Optimization (PPO) step by step:

Start from the objective: maximize expected total rewards as a weighted average over rollouts.
Derive the policy gradient using the log-derivative trick: $$\nabla_\theta J(\theta)=\mathbb{E}[\nabla_\theta \log p_\theta(\tau)\,R(\tau)].$$
Expand $\log p_\theta(\tau)$ into a sum of $\log \pi_\theta(a_t\mid s_t)$.
Derive PPO’s surrogate objective using the probability ratio $$r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)},$$ and then the clipped PPO objective.

1. Objective as a Weighted Average Over Rollouts#

Consider a stochastic policy $$\pi_\theta(a\mid s)$$. A rollout (trajectory) is

\[ \tau = (s_0,a_0,s_1,a_1,\dots,s_T). \]

Define the (discounted) return of a rollout:

\[ R(\tau) = \sum_{t=0}^{T} \gamma^t\, r(s_t,a_t). \]

The probability of a rollout under policy $\pi_\theta$ in an MDP with dynamics $P$ is

\[ p_\theta(\tau)=\rho_0(s_0)\prod_{t=0}^{T}\pi_\theta(a_t\mid s_t)\,P(s_{t+1}\mid s_t,a_t). \]

The RL objective is expected return under rollouts induced by the policy:

\[ J(\theta)=\mathbb{E}_{\tau\sim p_\theta(\tau)}[R(\tau)] =\sum_{\tau} p_\theta(\tau)\,R(\tau). \]

That last expression is exactly a weighted average over all rollouts: each rollout’s reward $R(\tau)$ weighted by its probability $p_\theta(\tau)$.

2. Derivation: The Policy Gradient via the Log-Derivative Trick#

Start from the definition

\[ J(\theta)=\sum_{\tau} p_\theta(\tau)\,R(\tau). \]

Differentiate with respect to $\theta$:

\[ \nabla_\theta J(\theta) = \sum_{\tau} \nabla_\theta p_\theta(\tau)\,R(\tau). \]

Use the log-derivative trick:

\[ \nabla_\theta p_\theta(\tau) = p_\theta(\tau)\,\nabla_\theta \log p_\theta(\tau). \]

Plug it in:

\[ \nabla_\theta J(\theta) = \sum_{\tau} p_\theta(\tau)\,\nabla_\theta \log p_\theta(\tau)\,R(\tau) = \mathbb{E}_{\tau\sim p_\theta}\left[\nabla_\theta \log p_\theta(\tau)\,R(\tau)\right]. \]

Now approximate the expectation with a Monte Carlo sample average. If you sample $N$ rollouts $\{\tau_i\}_{i=1}^N$ from $\pi_\theta$:

\[ \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \nabla_\theta \log p_\theta(\tau_i)\,R(\tau_i). \]

This is the classic REINFORCE-style gradient at the trajectory level.

3. Expand $\log p_\theta(\tau)$ into a Sum of $\log \pi_\theta$ Terms#

Recall

\[ p_\theta(\tau)=\rho_0(s_0)\prod_{t=0}^{T}\pi_\theta(a_t\mid s_t)\,P(s_{t+1}\mid s_t,a_t). \]

Take logs:

\[ \log p_\theta(\tau) =\log\rho_0(s_0)+\sum_{t=0}^{T}\log\pi_\theta(a_t\mid s_t)+\sum_{t=0}^{T}\log P(s_{t+1}\mid s_t,a_t). \]

The environment terms $\rho_0$ and $P$ do not depend on $\theta$, so their gradients vanish:

\[ \nabla_\theta \log p_\theta(\tau)=\sum_{t=0}^{T}\nabla_\theta \log\pi_\theta(a_t\mid s_t). \]

Substitute back into the policy gradient:

\[ \nabla_\theta J(\theta) = \mathbb{E}\left[\left(\sum_{t=0}^{T}\nabla_\theta \log\pi_\theta(a_t\mid s_t)\right)R(\tau)\right]. \]

3.1 Advantage Form (Better Credit Assignment + Lower Variance)#

In practice, we replace the rollout-level return $$R(\tau)$$ with a time-dependent signal such as an advantage:

\[ \nabla_\theta J(\theta) = \mathbb{E}\left[\sum_{t=0}^{T}\nabla_\theta \log\pi_\theta(a_t\mid s_t)\,A^\pi(s_t,a_t)\right]. \]

Intuition: $A^\pi(s_t,a_t)$ measures how much better (or worse) action $a_t$ is compared with the policy’s baseline behavior at state $s_t$.

4. From Log-Probability to PPO’s Ratio $r_t(\theta)$#

PPO updates a new policy $\pi_\theta$ using trajectories generated by an older policy $\pi_{\theta_{\text{old}}}$. That means we are optimizing $\pi_\theta$ with off-policy data. The standard fix is importance sampling: reweight each sample by how likely the new policy would have produced it compared to the old one.

Define the per-step probability ratio

\[ r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)}. \]

Interpretation:

if $r_t(\theta)>1$, the new policy assigns higher probability to the sampled action than the old policy did, so its contribution is upweighted,
if $r_t(\theta)<1$, the new policy is less likely to take that action, so its contribution is downweighted.

Connect it directly to log-probabilities:

\[ r_t(\theta)=\exp\left(\log\pi_\theta(a_t\mid s_t)-\log\pi_{\theta_{\text{old}}}(a_t\mid s_t)\right). \]

So the chain is literally:

compute current log-prob $\log\pi_\theta(a_t\mid s_t)$,
subtract stored old log-prob $\log\pi_{\theta_{\text{old}}}(a_t\mid s_t)$,
exponentiate to get $r_t(\theta)$.

5. The Unclipped Surrogate Objective (Importance-Weighted Policy Gradient)#

We want to improve $\pi_\theta$ while using samples from $\pi_{\theta_{\text{old}}}$.

Starting from the policy-gradient form in Section 3,

\[ \nabla_\theta J(\theta) = \mathbb{E}\left[\sum_{t=0}^{T}\nabla_\theta \log\pi_\theta(a_t\mid s_t)\,A^\pi(s_t,a_t)\right], \]

we can rewrite the expectation under the behavior policy $\pi_{\theta_{\text{old}}}$ using importance sampling:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{t\sim \pi_{\theta_{\text{old}}}}\left[ \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)} \,\nabla_\theta \log\pi_\theta(a_t\mid s_t)\,A^\pi(s_t,a_t) \right]. \]

This is the theoretical justification for using old-policy data: the ratio makes the gradient estimate unbiased, and PPO then adds clipping to control variance and keep updates stable.

Define the surrogate

\[ L^{\text{PG}}(\theta)=\mathbb{E}_{t\sim \pi_{\theta_{\text{old}}}}\left[r_t(\theta)\,A_t\right]. \]

Here $A_t$ is an advantage estimate and $r_t(\theta)$ is the importance ratio. The connection to $\nabla_\theta J(\theta)$ comes from the log-derivative trick,

\[ \nabla_\theta \pi_\theta(a_t\mid s_t)=\pi_\theta(a_t\mid s_t)\,\nabla_\theta\log\pi_\theta(a_t\mid s_t), \]

so differentiating $L^{\text{PG}}$ gives

\[ \nabla_\theta L^{\text{PG}}(\theta) = \mathbb{E}_{t\sim \pi_{\theta_{\text{old}}}}\left[ r_t(\theta)\,\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,A_t \right], \]

which matches the importance-sampled form of $\nabla_\theta J(\theta)$ above. So $L^{\text{PG}}$ is a surrogate whose gradient equals the policy-gradient estimator.

6. Why PPO Modifies the Surrogate: Too-Big Updates#

If you maximize $$\mathbb{E}[r_t(\theta)A_t]$$ directly, the ratio $$r_t(\theta)$$ can become very large or very small. This can cause:

unstable training,
destructive updates from noisy advantage estimates,
collapsed policies (especially in high-dimensional action spaces).

TRPO addresses this with an explicit KL constraint; PPO instead uses a simpler, effective heuristic: clipping.

7. PPO Clipped Objective#

Define the clipped ratio:

\[ \operatorname{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon). \]

PPO’s clipped objective is:

\[ L^{\text{CLIP}}(\theta) =\mathbb{E}_{t\sim \pi_{\theta_{\text{old}}}}\Big[ \min\big( r_t(\theta)A_t, \operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\,A_t \big) \Big]. \]

7.1 What the Clipping Does (Key Intuition)#

There are two parts:

Clipping the ratio around 1. The ratio $r_t(\theta)$ is centered at 1 because $\pi_\theta=\pi_{\theta_{\text{old}}}$ implies no change. Clipping to $[1-\epsilon,\,1+\epsilon]$ prevents $r_t(\theta)$ from drifting too far, limiting how much the new policy can change its action probabilities in a single update.
Taking the min. We compare the unclipped term $r_t(\theta)A_t$ with the clipped term and keep the smaller one. This makes the objective conservative: if the policy tries to improve too aggressively, the clipped term takes over and stops extra gain.

Concretely:

If $A_t>0$, the objective uses $\min(r_t, \operatorname{clip}(r_t))A_t$, so once $r_t(\theta)>1+\epsilon$, further increases no longer help.
If $A_t<0$, the same min means once $r_t(\theta)<1-\epsilon$, further decreases no longer help.

8. One Clean Derivation Chain (From Expected Return → Log-Prob → PPO)#

Here is the full chain in one place.

Step 1: Rollout objective#

\[ J(\theta)=\mathbb{E}_{\tau\sim p_\theta}[R(\tau)]. \]

Step 2: Differentiate using the log-derivative trick#

\[ \nabla J(\theta)=\mathbb{E}_{\tau\sim p_\theta}[\nabla\log p_\theta(\tau)\,R(\tau)]. \]

Step 3: Expand $$\log p_\theta(\tau)$$ (environment terms vanish)#

\[ \nabla\log p_\theta(\tau)=\sum_t \nabla\log \pi_\theta(a_t\mid s_t). \]

Step 4: Replace returns with advantages (variance reduction + credit assignment)#

\[ \nabla J(\theta)=\mathbb{E}\left[\sum_t \nabla\log \pi_\theta(a_t\mid s_t)\,A_t\right]. \]

Step 5: Express the update using the ratio derived from log-probs#

\[ r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{old}}(a_t\mid s_t)} =\exp(\log\pi_\theta-\log\pi_{\text{old}}). \]

Step 6: Unclipped surrogate objective#

\[ L^{\text{PG}}(\theta)=\mathbb{E}_{t\sim \pi_{\text{old}}}[r_t(\theta)A_t]. \]

Step 7: PPO clipped objective#

\[ L^{\text{CLIP}}(\theta) =\mathbb{E}\left[\min\left(r_t(\theta)A_t,\;\operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)A_t\right)\right]. \]

That is PPO’s core idea: keep the benefits of a policy-gradient update while preventing overly large policy changes.