Trust Region Methods: From REINFORCE to TRPO to PPO
In the REINFORCE post, we built a policy gradient agent from scratch in NumPy and watched it learn CartPole. It worked — eventually. But the reward curve looked like a seismograph. One batch of unluck






