Purely by Claas Voelcker (AI-free writing since 1995)

A lot of people want to know what the best RL algorithm is. Sadly, the answer to this is a bit like asking what the best woodworking tool is. It depends a lot on what you want to do, and if I say “saw”, some of my friends might get very angry at me. But luckily, I know a lot about RL (and admittedly very little about woodworking), so let’s go!

This guide is not meant to really explain all of the presented algorithms, and I will assume you are familiar with most general ideas in RL. I would need a couple of blog posts to go into detail on most of the algorithms mentioned here, and I might try to do that in the future. Instead, it is meant as a quick survey of algorithms that I have used (or at least thoroughly read about) in the past. Go read the papers and cite the authors, many of them are my friends, or other cool people whose work I happened to stumble across.

Outlining the space of algorithms

For this blog post, the main thing we will look at is how fast your environment is. We will break down everything based on this. The speed of environment sample gathering ranges all the way from: “I don’t get to sample from my environment at all, I only have this fixed dataset” to “My simulator spits out so many samples I can fill petabytes of Google datacenters within minutes”. This gives us our major split of RL algorithms: offline RL (meaning you have to find a way to deal with a fixed dataset and you get nothing more), off-policy RL (new samples are possible to come by, but rather slow/expensive to gather, so you have to keep your olds ones around), and on-policy RL (your samples are effectively free, so you gather new ones at every opportunity and save almost nothing). This is a spectrum and it will need some trial and error to see where your case lands. If we are talking about a rate of minutes to hours per sample, you will probably be dealing with a pure, or borderline offline case. If you can generate several thousand samples in less than a second, you can probably afford to train purely on-policy. In between, we are in off-policy but not offline territory.

Quick note on nomenclature: On-policy means that all updates are done with samples gathered from the policy you are currently attempting to update. Off-policy simply means that you can use samples which were gathered by other policies, often older versions of your current policy. In that sense, every offline algorithm is also off-policy, with the complication that you get 0 samples from your current policy. To add a bit more confusion, some methods like PPO are not strictly on-policy, but so close to it that I (and most other people) still simply count them as on-policy methods.

This split is important since, roughly speaking, off-policy learning is sample efficient but pretty unstable. If your samples are very cheap, you can simply train on-policy and worry much less about convergence. We will talk about why this happens a bit later.

The second major question you need to deal with is what your action space is. The big split here is between discrete and continuous action spaces. In general, a discrete action space means you can (but are not forced to) use a pure Q learning algorithm. A continuous action space means you probably have to use an actor critic method and keep at least two networks around. In principle, purely policy-based methods without a value function or critic are also possible, but the only widely used method that falls into this category is GRPO, and if you are doing language models, honestly, none of the rest here applies to you. You are probably going to use GRPO anyways, so my work here is done. Have fun, and don’t build ASI before I get a tenured faculty position please.

For this blog, I will focus on continuous actor-critic methods, mostly because that is where my expertise is. Discrete methods will (hopefully) get their own blogpost, whenever I have time to read up more on the wild west of DQN successor methods.

On-policy learning

The most dominant on-policy method by far is PPO¹. PPO is fantastic under two conditions: samples are dirt cheap, and you want to use 0% of your time working on your RL algorithm, and 100% of your time on reward tuning. The reason for this is simple: PPO is the fastest algorithm in existence (at least if you can keep your networks small), and it is often awful at exploration (unless you have so many samples that exploration legitimately becomes a non-issue). If you managed to find your way to my blog, you probably know about PPO though, so I won’t waste too much of your time here.

Moving along the spectrum of simulation, if you are willing to invest slightly more compute time per sample, but you still get enough to stay on-policy, you can choose between REPPO, FastTD3, and FastSAC. Since REPPO is my algorithm, I’m tempted to simply tell you to use REPPO :) and invite you to read my past blog post if you want to learn more about it (and why I don’t love PPO). However, aside from me liking my own stuff, REPPO, FastTD3, and FastSAC are all built on similar principles, as all three use slightly more compute to train a better critic. The main advantage of REPPO is that you do not need to keep around a giant replay buffer, you can stay striclty on-policy while FastTD3/SAC are off-policy methods. On the other hand, FastTD3 and FastSAC currently have more support for cool robotics applications, but we are working hard to catch up.

Off-policy learning

Moving to the “fast, but not limitless” sampling regime (meaning environments in the range of 10 - 500 samples per second) we get to the venerable SAC and TD3 algorithms. Do yourself a favour and don’t stick with 2018 SAC, we have 2025 SAC now! You have a wonderful smorgasbrod of choices here, among them BRO, BRC, Simba-v2, MrQ, XQC, and many more. Picking the best between them can be a bit tricky, because it is hard to predict which one will give you the best performance out of the box. If you are working on an established benchmark, you can simply check which one reports the best numbers. Otherwise, feel free to sample and see which one you like best, all of these are good methods. These modern off-policy methods all build off of either SAC or TD3, and share one important aspect, which is architectural regularization of the critic network. Between all of the listed ones, the exact architectures vary, but the general principles remain relatively consistent.

As an aside, we start to see an interesting pattern here where on-policy methods mostly focus on training the actor based on the real environment returns, and use the critic only as stabilization. The more off-policy you become, the more weight is placed on learning a correct Q function, with the actor taking on the more “auxiliary” role of finding the Q function’s maximum. This pattern completely reverses again in the most extreme offline case, where imitation learning is a purely actor focused method. This is why you will find much more emphasis on “actor gradient stability” in the on-policy literature, and “correct critic estimation” in the off-policy literature. Breaking this pattern was one of my main motivations for getting a critic-focused method working in on-policy RL, which led to REPPO. Aside from REPPO, I don’t know many other algorithms which try to step outside of this split, let me know if you know of good ones.

Very off-policy learning

As samples become even more expensive, we enter the high update-to-data ratio regime. Update-to-data ratio simply means “how many gradient steps do we take before we need to collect a new sample”. On-policy methods have an extremely low update-to-data ratio, often around 1/1000, while offline RL has an effectively infinite ratio (as you never collect new samples). High UTD algorithms operate in the range of 8 - 128 update steps per new sample, meaning the bulk of your computation time is shifting from simulation to backpropagation.

Now is a good time to explain why off-policy learning is difficult, as these issues become more and more pressing with higher UTD. Obviously, telling the full story is a whole research area, but intuitively, the core of the issue boils down to a distribution shift between your training distribution and the states and actions that your current policy covers. Outside of the areas with good sample coverage, the critic can start overestimating how good actions truly are. The policy is then changed to execute these actions. If you can sample at a high rate, this overestimation will be quickly corrected with real evidence from your environment, but if you only get a few samples, overestimation becomes harder and harder to stop before it completely destabilizes learning. We explored this phenomenon in detail in our Dissecting and MAD-TD papers, check them out if you want to understand the technical details and see some experiments. The three major strategies for dealing with this are architectural regularization, to use model-based RL, or – in the offline case – to add explicit pessimistic regularization on out-of-distribution actions.

Of the methods listed above, BRO, Simba-v2, and XQC are all explicitly built to accommodate larger UTDs. I am relatively sure that the other methods can also work well at higher UTDs, the papers just do not explicitly test this. In general, increasing the UTD is a good way to get more learning out of a slow environment, but most environment/algorithm combinations hit a point of diminishing returns where more training does not increase the sample efficiency anymore. Where this point lies in your application will need some testing. It is generally a good idea to start with a UTD of around 1 and then to slowly increase if you find yourself waiting on your environment too much.

Model-based off-policy learning

In model-based RL, there is a huge zoo of methods and surveying even a reasonable fraction would blow up this blog post beyond reasonable size. Quite frankly, MBRL is also one of the arease where you can mix-and-match ideas from a bunch of different algorithms if you find that any one doesn’t fulfill your criteria, so the best method often ends up being a hybrid. The main bet you are making by picking a model-based algorithm is that your model generalizes better than your actor and critic. Then it can supply the missing state-action data to stabilize your critic learning, or be used for online planning². On RL benchmarks, this is surprisingly often the case, probably because Q functions tend to be a bit hard to learn, and actor features barely generalize across training steps. This is also the case if you happen to have access to a nice frontier world model (and the compute to run it).

In some sense using a model does not actually need to change anything too fundamental in your RL setup. One of the main ways to use a (good) model is to simply treat it as a simulator, and then all of the previous points apply. The main question remains “how fast is it, and which algorithm benefits from the sampling speed the most?”

Looking at concrete model-based RL methods to start with, your main contenders these days are probably Dreamer-v3/4, and TD-MPC2. If you don’t need to deal with pixels, you definitely want to go with TD-MPC2. If you have to deal with pixels, Dreamer might be worth it, and TD-MPC2 remains a strong contender. I say “might be worth it” because Dreamer is an absolute beast to train in terms of time and size. However, to the best of my knowledge it is one of the few RL algorithms that is tuned to handle environments up to the complexity of minecraft. Also, big shoutout to Nick, the author of TD-MPC2, for also providing the only open-source implementation of Dreamer-v4. No matter which one you pick, thank Nick³.

Offline learning

As sampling from your environment becomes impossible, we transition to offline RL. This space is relatively crowded, and my expertise becomes a bit limited. In general, the worse your state-action space coverage is, the more aggressively you need to regularize. If you have decent coverage, TD3+BC is a simple and well tested variant to try. As your samples become rarer, you want to transition to an algorithm which avoids querying the Q function outside of the known samples. This is where advantage-weighted regression methods become relevant. ReCOIL and UNI-RL are neat methods from my colleagues which go into depth on regularized value learning, or you can go with the extremely well cited IDQL. As an outsider, I have a hard time identifying which method is truly doing the best in offline RL, so we’ll have to ask some experts to contribute another blogpost :).

Offline-to-online

One area that deserves some recognition as well, as it is becoming an important paradigm in all areas of RL (not just LLMs), is the offline pretraining to online finetuning pipeline. Here again, things depend on your sample budget. However the second important question arises as well: Is the majority of your policy learned during pre-training or finetuning?

If you have a massive pretraining dataset, and you only want to use the post-training stage as a minor finetuning without changing your policy too much, you are probably looking to combine imitation learning with a very constrained RL method. For LLMs, this is the gold standard, IL/SFT + GRPO. In robotics RL my favorite of the well known methods is DSRL. Here, you essentially tweak the noise input to a diffusion model until the diffusion produces good actions, without changing the diffusion policy itself. There are a few other promising methods for steering a policy with RL ideas but without touching the pre-trained policy itself, among them V-GPS and UF-OPS.

If you are mostly interested in using your offline dataset to kickstart your online learning and keeping your pretraining policy untouched is not as relevant, RLPD is a good shot. Fun fact, RLPD can be combined with pretty much anything from the off-policy section, you don’t have to stick to the implementation in the original paper.

Final advice

One general word of advice: while RL algorithms love to pretend that they are complete and immutable packages, every single one is essentially a big bag of components, tricks, and some good old-fashioned hot glue. Absolutely nothing (except for the size of your GPU and your electricity bill) prevents you from pretraining Dreamer-v4 with UNIVR, then fine-tuning the policy with the model-based data, run online search/planning like TD-MPC at inference time, while also throwing in some architectural ideas from SIMBA⁴. There is no RL police that will stop you from mixing your favorite ideas and components. OK, the program chairs at RLC might tell you your paper is too convoluted, but that shouldn’t stop you from building a beautiful Frankenstein system. If you do manage to come up with a particularly wild combination of ideas from all of the different strands of work I mentioned here, do shoot me a mail, I would love to see it.

The people over at pufferlib will tell you that the relevant part of this blogpost is over here, but I hope you stick around :). ↩
There are many interesting connections between the training time use of world models to improve actor and critic (often following the DYNA scheme), and online planning. Let me know if you want to see a blog post on this topic where we pick a (civilized and friendly) fight with Yann LeCun and JEPA. ↩
The rhyme was a surprise, but a welcome one! ↩
Not gonna lie, this is likely to fail when done naively, but hey, you won’t know until you try it! ↩

A very opinionated (and incomplete) guide to choosing your RL algorithm