Reward Design and Termination
One of my newest pet peeves is reward design, and more properly, minimum and maximum reward values and their interplay with termination and truncation. As RL is used in different communities such as robotics, video games, LLM fine-tuning, and more, different implicit standards emerge that often clash in hard-to-understand ways. I recently ran into a couple of interesting interactions, shout-out to Stone Tao, and (indirectly) Joseph Suarez, for helping me navigate this. All of the following is in principle well known in the community, and yet I still deal with environments on a near daily basis that do not account for all of these problems.
What is a Termination?
The majority of my papers are written on locomotion environments. This is mostly an incidental choice. I am not a roboticist (yet), as locomotion environments combine several attractive properties from the point of view of an RL researcher: we know that we can in principle get RL running on them (which is not true, e.g., for manipulation environments), simulators are relatively robust, and benchmarks are very commonly used, making direct algorithmic comparisons without setup much easier.
Locomotion environments are often a form of infinite horizon problem. In principle, we want the agent to be able to walk indefinitely. However, for practical reasons, we don’t always let the simulation just run forever. For example, in cases where the agent is not yet good at its job, it might fall over and get itself stuck in positions from which it is very hard to get up again. Therefore, pretty much all major locomotion benchmark implementations have some form of either time-based or failure-based reset condition, often both. Before we dive into the details, we need some technical notation. Assuming a standard off-policy Q learning objective, our general loss function looks something like this
\[\left|Q(x_t,a_t) - [r_t + \gamma Q(x_{t+1}, \pi(x_{t+1}))]_\mathrm{sg}\right|^2.\]We will denote time-based truncation as \(t\) for truncated and failure-based termination as \(d\) for done.
Time-based Reset
The simplest form of reset is a time-based reset. In this scenario, we simply count the timesteps since the last reset, and once our counter has reached a threshold (often a large number like 1000), we reset the environment to its initial position. Crucially, such resets (often called truncation) are invisible to the agent. This means, we do not stop the bootstrapping process, as the agent should live like it can just continue on. Handling this requires some code changes, depending on the exact environment interface.
If the environment returns the actual final observation after a done signal (which is the default behavior in gym), we can simply store the transition and use it in our loss function. We get the observation of the new starting state after explicitly calling env.reset()
. In this case, nothing changes in our loss!
However, many wrappers that are designed for parallel simulation include an auto-reset. In this case, the observation that is returned together with a done signal is the observation of the new trajectory after the reset. In this case, we are lacking the final transition. However, since we do not care much about the final state’s value, we can easily fix this by marking the whole transition as invalid in training.
We can do this by simply using the truncation signal as a mask. If the transition from timestep \(t\) to \(t+1\) contains a reset, marked by \(t_t = 1\), we can compute the loss as follows
\[(1.0 - t_t) \left|Q(x_t,a_t) - [r_t + \gamma Q(x_{t+1}, \pi(x_{t+1}))]_\mathrm{sg}\right|^2.\]Failure-based Reset
While a time-based reset strategy is often easy to handle in code (especially without auto-reset), it can lead to some training difficulties. If we have a long reset interval, but the agent falls over in the first few timesteps, we are often collecting a lot of unnecessary data as the agent flops helplessly on the floor. Therefore, some benchmarks decide to add a termination condition to the environment, which will return a done signal and reset the environment once certain conditions are met. For a bipedal walker, for example, it is easy to specify that the center of mass of the agent should stay above a certain threshold to check that the robot hasn’t fallen over.
In the case of a done signal, we have to terminate the bootstrap, as the implicit message to the agent is that it has done something wrong and should not be receiving rewards anymore. Therefore, we use the done signal at timestep \(t\), \(d_t\), to mask the next state’s value:
\[\left|Q(x_t,a_t) - [r_t + (1.0 - d_t) \gamma Q(x_{t+1}, \pi(x_{t+1}))]_\mathrm{sg}\right|^2.\]So far, so intuitive!
Reward Design
The trouble starts with two innocuous-seeming decisions: introducing penalty terms to the rewards and terminating in finite goal-reaching environments.
Penalty Terms
If we train, e.g., a self-driving car to reach its destinations as quickly as possible, the ride will not be an enjoyable one, and it might become very expensive. The optimal strategy would involve lots of sudden and harsh acceleration and braking, and a huge amount of wasted fuel. Therefore it is often necessary to (softly) constrain the policy space of an agent to penalize aggressive behavior. A relatively standard and simple way of doing this would be to include an action penalty \(-\|a\|_2^2\) to the reward, which simply subtracts the norm of the current action vector. Assuming larger entries in the vector correspond to “stronger” actions, this can put a penalty on sudden jerks and movement.
However, what happens if we have an environment with a sparse reward and termination on failure? If the agent only receives reward if it actually reaches its goal, incurs an action penalty otherwise, and terminates on a crash, a couple of behaviors can emerge:
- Do nothing: The action penalty term is so high and the reward so far away that the agent learns to simply stay still after some training. This myopically maximizes its return, but removes all exploration.
- Crash to end negative: If we give the agent a negative reward for not having reached the goal yet to avoid the first problem, it can use the termination to “escape” the negative punishment. This is obviously the opposite behavior we intended the agent to follow. In the presence of negative rewards, termination becomes a goal!
Goal-reaching Environments
Let’s assume that the agent actually reaches its goal. The episode is over and it gets its reward; all is fine in the agent’s world. We can also terminate the interaction, as the goal is reached and nothing interesting will happen.
However, as we just discovered, a sparse reward can lead to all sorts of problems. So let’s densify it.
One simple idea would be to take the distance to our goal as a shaping term and reward the agent according to its distance to the goal state \(s_g\), leading to the following reward: \(e^{-d(s,s_g)}\). The exact form doesn’t matter, only the general idea.
When we run an agent to optimize this objective, a very strange phenomenon emerges: the agents will get close to the goal but never actually attempt to reach it. To understand this, we need to take a look at our Q function. Let’s say the agent can get a reward of 1 for reaching the goal, and 0.9 for getting close to it. The value of the goal state will now be \(1\), as the bootstrap ends there. However, the value of the near-goal state will be \(\frac{1}{1-\gamma} \cdot 0.9\). Assuming, e.g., a discount factor of \(\gamma = 0.9\), the value of almost reaching our goal is \(\frac{1}{1-0.9} \cdot 0.9 = 9\), much higher than the value of actually reaching the goal.
The termination again acts like a punishment for reaching the task! There are three possible solutions:
- Give a reward for reaching the goal that is larger than \(\frac{1}{1-\gamma} r_\mathrm{near-goal}.\) This can lead to some discontinuity for learning agents and complicate things.
- Truncate the trajectory instead of terminating it and add a self transition to the final state. This way the bootstrap won’t actually be stopped (by the rules of the truncation) and the agent assumes it can achieve the final reward “in perpetuity”.
-
Shape the reward in such a way that taking an action that does not move the agent closer to the goal does not receive reward. For example, instead of rewarding the distance to the goal, you could use a shaped reward that rewards the delta between the previous and the next state: $$r(s_t,s_{t+1} s_g) = e^{-d(s_{t+1},s_g)} - e^{-d(s_t,s_g)}$$.
Finally, the most important lesson: when using negative rewards, termination is a good thing for the agent. When using positive rewards, it is bad and will need to be avoided. If you use both negative and positive rewards and early termination, all bets are off. May the RL gods have mercy on your soul!
Termination and Risk Tradeoffs
The final strange interaction arises between termination and certain types of unbounded rewards. This happens regularly in the OpenAI Gym locomotion environments, especially the walker environment. The main strangeness here comes from the difference between the training objective and the evaluation objective. In training, we normally optimize the discounted future return in each state. However, at evaluation time, we simply measure the cumulative reward achieved over a set amount of steps, or until termination, whichever comes first.
Assume now that the starting state distribution only has a single state, and there are two available strategies. In one, the agent reliably receives a reward of 1 in every state and never risks termination. In the other, the agent receives a reward of 10 after every step, but terminates after 100 steps. Assume further that we are using a discounting factor of \(\gamma=0.99\).
From the point of view of the initial state, the first strategy corresponds to a value of \(V_1(s_0) = \frac{1}{1-\gamma} \cdot 1 = 100\). The second strategy achieves \(V_2(x_0) = \sum_{i=1}^100 \gamma^i \cdot 10 = \frac{1}{1-\gamma} \cdot 10 - \frac{\gamma^100}{1-\gamma} \cdot 10 \approx 630\). So obviously a good RL agent will pick the second strategy.
The real strangeness arises when we run the evaluation. Since the first strategy never terminates, we have to specify a truncation timestep. If we choose 100, the first strategy will obtain a cumulative reward of 100, and the second strategy will obtain 1000. So far, so good. But if we set the truncation to 10,000, we suddenly end up with a return of \(10,000\) for strategy 1, and still \(630\) for strategy 2. So without changing anything except how long we run the environment interaction loop for, we have made an algorithm that picks strategy 1 better than strategy 2.
The problem is that the second strategy acts myopic, at least relative to the horizon we are evaluating over. This commonly happens, e.g., in the OpenAI Gym Locomotion tasks, where the goal, implied by the rewards, is to run as fast as possible. Very fast gaits are often unstable, which should make intuitive sense, and can easily lead to tripping and therefore termination. So depending on how your algorithm actually trades off long-term vs. short-term reward, how good it is at accurately estimating the likelihood of falling, and how we evaluate, different algorithms might look “superior” even if they objectively perform worse on the criterion they are supposed to optimize.
A lot of people have commented on the fact that the PPO objective does not optimize the proper discounted returns. But this goes slightly beyond this issue: there is an objective mismatch between the training and evaluation criteria in a lot of papers (most papers?). Someone should probably do something about that?
Enjoy Reading This Article?
Here are some more articles you might like to read next: