REPPO - Why build a new algorithm
Recently, some collaborators and I thought to ourselves: what if we built a new on-policy RL algorithm. The result, Relative Entropy Pathwise Policy Optimization, is a great algorithm that you should check out and use in your own work! To understand how it works, we wrote a blog post that you can check out for all the technical nitty-gritty stuff. This blog post, however, is a more tongue-cheek explanation for why we felt it is necessary to wade into a very crowded field. So read here for the history of REPPO.
No on-policy algorithm can be built without dealing, somehow, with PPO. Since I assume you are probably familiar with PPO, I will tell the origin story of REPPO by contrasting it with PPO. Other angles are possible, perhaps even more interesting, but given that us trying to unseat PPO is like ants trying to fight a T-Rex, this is the path we will take. I hope it is entertaining 😀.
PPO reigns supreme. No other RL algorithm has conquered as many domains as Proximal Policy Optimization, the fast, on-policy policy gradient scheme devised by John Schulman1 et al. at OpenAI in 2017. And while various updates have been made to PPO in the years, it is still the undefeated backbone of applied RL. That doesn’t mean that there are no other RL algorithms developed since, but none have so far unseated the reigning champion: PPO.
To understand why PPO has become so widespread, we need to understand (and embrace!) these three strengths in detail:
- PPO is simple. While some details of the PPO algorithm can be fiddly (see the unforgettable blog post on the implementation details of PPO), its core components can very easily be implemented even by people with only limited knowledge of RL. Again, this doesn’t mean that reasoning about the clipped objective itself is simple, many, many theoretical papers have been written to scrutinize, criticize, and adapt it, but it is simple to code. Many off-policy algorithms need bookkeeping on more complicated components such as replay buffers, and don’t even get me started on model-based RL methods, while PPO can be run with minimal coding overhead.
- PPO is fast2 and flexible. PPO can be used to train both remarkably tiny networks (such as 2 layer 64 dimensional controllers for robotics tasks) and massive modern behemoths like Large Language Models. With well-implemented environments, PPO implementations can achieve amazing results in seconds to minutes on relatively complex benchmarks, and PPO can be used with minimal adaptation for diverse problems such as game playing, language reasoning, robotic control, and even multi-agent systems.
- PPO works3. This is the biggest and most interesting aspect. As mentioned, PPO has been successfully used for a staggering amount of RL problems, with success. This leads many people to conclude that we have fundamentally “solved” RL: PPO will work, everything else is engineering around it.
However, it works3… if you invest a huge amount of effort into tuning. Because here is the thing: PPO is brittle! While a PPO configuration will likely exist for almost any RL problem out there, getting hyperparameters even a tiny bit wrong can lead to terrible results. And this is where our story starts!
Time to come clean: I hate PPO. This is a (slightly) irrational, petty, but (somewhat) justified hatred. I won’t judge you if you love PPO, I know my opinion is unpopular :D. But hear me out!
My view on the matter is this: PPO is the worst RL algorithm that actually works, and that made people stop trying to look for better alternatives! Let me make an analogy: Imagine you want to build a house, but you have no working hammer. People have been making hammers for a while and they simply don’t work. You hit nails and they don’t go in. Now, in 2017, a group of researchers suddenly comes along and gives you a hammer. It’s clunky, prone to breaking, and you have to hit the nail just right, but then it just works. After this, many people, understandably, won’t really care whether a better hammer is possible, they just want to finally get to whacking some interesting nails with it. And I respect the builders who care more about the product than the tools! So while I think PPO is bad, I have to admit that it does its job, and that is enough for many tasks.
But I, personally, actually care deeply about hammer design and I think all of our lives would be much, much better with a more ergonomic hammer. So I am here to convince you, master builders, that we have done the annoying job for you and you can simply replace your old clunky hammer with a shiny new one4.
Enough metaphors, if you want to understand the REPPO algorithm, check out our main blog post. And if you just want to use our super amazing PPO killer, check out https://github.com/cvoelcker/reppo.
-
Schulman has also written one of my favorite think-pieces on how to conduct research. Check it out! ↩
-
If your environment is fast ↩
-
And if it doesn’t work, you get to come back and shout at us. And since our careers are not already in the stratosphere, unlike for example Schulman’s (no shade on him though, he has contributed amazing things to RL and deserves it), we’ll even help you fix it 😀. ↩
Enjoy Reading This Article?
Here are some more articles you might like to read next: