PPO From Scratch: From Expected Return to the Clipped Objective#

This tutorial builds Proximal Policy Optimization (PPO) step by step:

  1. Start from the objective: maximize expected total rewards as a weighted average over rollouts.

  2. Derive the policy gradient using the log-derivative trick: $\(\nabla_\theta J(\theta)=\mathbb{E}[\nabla_\theta \log p_\theta(\tau)\,R(\tau)].\)$

  3. Expand \(\log p_\theta(\tau)\) into a sum of \(\log \pi_\theta(a_t\mid s_t)\).

  4. Derive PPO’s surrogate objective using the probability ratio $\(r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)},\)$ and then the clipped PPO objective.


1. Objective as a Weighted Average Over Rollouts#

Consider a stochastic policy $\(\pi_\theta(a\mid s)\)$. A rollout (trajectory) is

\[ \tau = (s_0,a_0,s_1,a_1,\dots,s_T). \]

Define the (discounted) return of a rollout:

\[ R(\tau) = \sum_{t=0}^{T} \gamma^t\, r(s_t,a_t). \]

The probability of a rollout under policy \(\pi_\theta\) in an MDP with dynamics \(P\) is

\[ p_\theta(\tau)=\rho_0(s_0)\prod_{t=0}^{T}\pi_\theta(a_t\mid s_t)\,P(s_{t+1}\mid s_t,a_t). \]

The RL objective is expected return under rollouts induced by the policy:

\[ J(\theta)=\mathbb{E}_{\tau\sim p_\theta(\tau)}[R(\tau)] =\sum_{\tau} p_\theta(\tau)\,R(\tau). \]

That last expression is exactly a weighted average over all rollouts: each rollout’s reward \(R(\tau)\) weighted by its probability \(p_\theta(\tau)\).


2. Derivation: The Policy Gradient via the Log-Derivative Trick#

Start from the definition

\[ J(\theta)=\sum_{\tau} p_\theta(\tau)\,R(\tau). \]

Differentiate with respect to \(\theta\):

\[ \nabla_\theta J(\theta) = \sum_{\tau} \nabla_\theta p_\theta(\tau)\,R(\tau). \]

Use the log-derivative trick:

\[ \nabla_\theta p_\theta(\tau) = p_\theta(\tau)\,\nabla_\theta \log p_\theta(\tau). \]

Plug it in:

\[ \nabla_\theta J(\theta) = \sum_{\tau} p_\theta(\tau)\,\nabla_\theta \log p_\theta(\tau)\,R(\tau) = \mathbb{E}_{\tau\sim p_\theta}\left[\nabla_\theta \log p_\theta(\tau)\,R(\tau)\right]. \]

Now approximate the expectation with a Monte Carlo sample average. If you sample \(N\) rollouts \(\{\tau_i\}_{i=1}^N\) from \(\pi_\theta\):

\[ \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \nabla_\theta \log p_\theta(\tau_i)\,R(\tau_i). \]

This is the classic REINFORCE-style gradient at the trajectory level.


3. Expand \(\log p_\theta(\tau)\) into a Sum of \(\log \pi_\theta\) Terms#

Recall

\[ p_\theta(\tau)=\rho_0(s_0)\prod_{t=0}^{T}\pi_\theta(a_t\mid s_t)\,P(s_{t+1}\mid s_t,a_t). \]

Take logs:

\[ \log p_\theta(\tau) =\log\rho_0(s_0)+\sum_{t=0}^{T}\log\pi_\theta(a_t\mid s_t)+\sum_{t=0}^{T}\log P(s_{t+1}\mid s_t,a_t). \]

The environment terms \(\rho_0\) and \(P\) do not depend on \(\theta\), so their gradients vanish:

\[ \nabla_\theta \log p_\theta(\tau)=\sum_{t=0}^{T}\nabla_\theta \log\pi_\theta(a_t\mid s_t). \]

Substitute back into the policy gradient:

\[ \nabla_\theta J(\theta) = \mathbb{E}\left[\left(\sum_{t=0}^{T}\nabla_\theta \log\pi_\theta(a_t\mid s_t)\right)R(\tau)\right]. \]

3.1 Advantage Form (Better Credit Assignment + Lower Variance)#

In practice, we replace the rollout-level return $\(R(\tau)\)$ with a time-dependent signal such as an advantage:

\[ \nabla_\theta J(\theta) = \mathbb{E}\left[\sum_{t=0}^{T}\nabla_\theta \log\pi_\theta(a_t\mid s_t)\,A^\pi(s_t,a_t)\right]. \]

Intuition: \(A^\pi(s_t,a_t)\) measures how much better (or worse) action \(a_t\) is compared with the policy’s baseline behavior at state \(s_t\).


4. From Log-Probability to PPO’s Ratio \(r_t(\theta)\)#

PPO updates a new policy \(\pi_\theta\) using trajectories generated by an older policy \(\pi_{\theta_{\text{old}}}\). That means we are optimizing \(\pi_\theta\) with off-policy data. The standard fix is importance sampling: reweight each sample by how likely the new policy would have produced it compared to the old one.

Define the per-step probability ratio

\[ r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)}. \]

Interpretation:

  • if \(r_t(\theta)>1\), the new policy assigns higher probability to the sampled action than the old policy did, so its contribution is upweighted,

  • if \(r_t(\theta)<1\), the new policy is less likely to take that action, so its contribution is downweighted.

Connect it directly to log-probabilities:

\[ r_t(\theta)=\exp\left(\log\pi_\theta(a_t\mid s_t)-\log\pi_{\theta_{\text{old}}}(a_t\mid s_t)\right). \]

So the chain is literally:

  • compute current log-prob \(\log\pi_\theta(a_t\mid s_t)\),

  • subtract stored old log-prob \(\log\pi_{\theta_{\text{old}}}(a_t\mid s_t)\),

  • exponentiate to get \(r_t(\theta)\).


5. The Unclipped Surrogate Objective (Importance-Weighted Policy Gradient)#

We want to improve \(\pi_\theta\) while using samples from \(\pi_{\theta_{\text{old}}}\).

Starting from the policy-gradient form in Section 3,

\[ \nabla_\theta J(\theta) = \mathbb{E}\left[\sum_{t=0}^{T}\nabla_\theta \log\pi_\theta(a_t\mid s_t)\,A^\pi(s_t,a_t)\right], \]

we can rewrite the expectation under the behavior policy \(\pi_{\theta_{\text{old}}}\) using importance sampling:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{t\sim \pi_{\theta_{\text{old}}}}\left[ \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)} \,\nabla_\theta \log\pi_\theta(a_t\mid s_t)\,A^\pi(s_t,a_t) \right]. \]

This is the theoretical justification for using old-policy data: the ratio makes the gradient estimate unbiased, and PPO then adds clipping to control variance and keep updates stable.

Define the surrogate

\[ L^{\text{PG}}(\theta)=\mathbb{E}_{t\sim \pi_{\theta_{\text{old}}}}\left[r_t(\theta)\,A_t\right]. \]

Here \(A_t\) is an advantage estimate and \(r_t(\theta)\) is the importance ratio. The connection to \(\nabla_\theta J(\theta)\) comes from the log-derivative trick,

\[ \nabla_\theta \pi_\theta(a_t\mid s_t)=\pi_\theta(a_t\mid s_t)\,\nabla_\theta\log\pi_\theta(a_t\mid s_t), \]

so differentiating \(L^{\text{PG}}\) gives

\[ \nabla_\theta L^{\text{PG}}(\theta) = \mathbb{E}_{t\sim \pi_{\theta_{\text{old}}}}\left[ r_t(\theta)\,\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,A_t \right], \]

which matches the importance-sampled form of \(\nabla_\theta J(\theta)\) above. So \(L^{\text{PG}}\) is a surrogate whose gradient equals the policy-gradient estimator.


6. Why PPO Modifies the Surrogate: Too-Big Updates#

If you maximize $\(\mathbb{E}[r_t(\theta)A_t]\)\( directly, the ratio \)\(r_t(\theta)\)$ can become very large or very small. This can cause:

  • unstable training,

  • destructive updates from noisy advantage estimates,

  • collapsed policies (especially in high-dimensional action spaces).

TRPO addresses this with an explicit KL constraint; PPO instead uses a simpler, effective heuristic: clipping.


7. PPO Clipped Objective#

Define the clipped ratio:

\[ \operatorname{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon). \]

PPO’s clipped objective is:

\[ L^{\text{CLIP}}(\theta) =\mathbb{E}_{t\sim \pi_{\theta_{\text{old}}}}\Big[ \min\big( r_t(\theta)A_t, \operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\,A_t \big) \Big]. \]

7.1 What the Clipping Does (Key Intuition)#

There are two parts:

  • Clipping the ratio around 1. The ratio \(r_t(\theta)\) is centered at 1 because \(\pi_\theta=\pi_{\theta_{\text{old}}}\) implies no change. Clipping to \([1-\epsilon,\,1+\epsilon]\) prevents \(r_t(\theta)\) from drifting too far, limiting how much the new policy can change its action probabilities in a single update.

  • Taking the min. We compare the unclipped term \(r_t(\theta)A_t\) with the clipped term and keep the smaller one. This makes the objective conservative: if the policy tries to improve too aggressively, the clipped term takes over and stops extra gain.

Concretely:

  • If \(A_t>0\), the objective uses \(\min(r_t, \operatorname{clip}(r_t))A_t\), so once \(r_t(\theta)>1+\epsilon\), further increases no longer help.

  • If \(A_t<0\), the same min means once \(r_t(\theta)<1-\epsilon\), further decreases no longer help.


8. One Clean Derivation Chain (From Expected Return → Log-Prob → PPO)#

Here is the full chain in one place.

Step 1: Rollout objective#

\[ J(\theta)=\mathbb{E}_{\tau\sim p_\theta}[R(\tau)]. \]

Step 2: Differentiate using the log-derivative trick#

\[ \nabla J(\theta)=\mathbb{E}_{\tau\sim p_\theta}[\nabla\log p_\theta(\tau)\,R(\tau)]. \]

Step 3: Expand $\(\log p_\theta(\tau)\)$ (environment terms vanish)#

\[ \nabla\log p_\theta(\tau)=\sum_t \nabla\log \pi_\theta(a_t\mid s_t). \]

Step 4: Replace returns with advantages (variance reduction + credit assignment)#

\[ \nabla J(\theta)=\mathbb{E}\left[\sum_t \nabla\log \pi_\theta(a_t\mid s_t)\,A_t\right]. \]

Step 5: Express the update using the ratio derived from log-probs#

\[ r_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{old}}(a_t\mid s_t)} =\exp(\log\pi_\theta-\log\pi_{\text{old}}). \]

Step 6: Unclipped surrogate objective#

\[ L^{\text{PG}}(\theta)=\mathbb{E}_{t\sim \pi_{\text{old}}}[r_t(\theta)A_t]. \]

Step 7: PPO clipped objective#

\[ L^{\text{CLIP}}(\theta) =\mathbb{E}\left[\min\left(r_t(\theta)A_t,\;\operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)A_t\right)\right]. \]

That is PPO’s core idea: keep the benefits of a policy-gradient update while preventing overly large policy changes.