Writing successful reward functions  

Writing successful reward functions  

In Deep Reinforcement Learning (DRL), an agent needs to interact with the environment (either physical or simulated) by performing actions to obtain rewards. The agent’s goal is to maximize its rewards and learns by adjusting its policy (the agent’s strategy) based on the rewards. DRL enables an agent to learn in an interactive environment by trial and error, where an algorithm aims to maximize a function gauging immediate or future rewards received by an agent on taking one action out of many possible ones.  In this blog, we’re going to discuss some tips or tricks to write the most effective reward functions that help solve complex real-life problems with reinforcement learning.

What is a reward function? 

Simply speaking, reward function is a function that provides a numerical score based on the state of the environment. A reward function is a mapping of each perceived state (or state-action pair) of the environment to a single number, specifying the intrinsic desirability of that state. It allows AI platforms to come to conclusions instead of arriving at a prediction. 

Suppose a robotic arm is being trained to pick up an object and place it on another side without dropping or breaking it. Here, the RL agent that controls a robotic arm receives a positive reward for successfully picking up an object and placing it in the desired place, while it is penalized for dropping it. These rewards and penalties are determined by the reward functions. Depending on the reward function, the agent could gain partial credit for picking up the object successfully and then placing it in the wrong spot, or it may try to avoid the penalty of breaking the object by avoiding picking it up at all.    

Though this process looks simpler, it takes many training cycles to train the agent to provide desired results successfully. 

The following are some tips to help train an RL agent more effectively :

Provide a gradient 

A reward function is more successful if the RL agent has a gradient to work with. This allows it to better understand when it gets closer or further from the target.  

For better understanding, let’s take an example of the CartPole environment. Initially, the pole stands vertically, and the goal is to prevent it from falling over. A positive reward is provided for every timestep that the pole remains upright. 

Here, the agent receives feedback from the environment (cart position, cart velocity, pole angle, and pole velocity at tip) and takes the best possible action (either pushing the cart to left or right). 

Also, we construct a neural network that returns the probability for an agent to select each possible action. This policy is followed by the agent while interacting with the environment by just passing the most recent state to the network.  

The neural network generates the action probabilities and the agent samples from those probabilities to choose between an action (left or right).  

CartPole environment

Credits: CartPole environment – Medium

Be careful with negative rewards 

The goal of the RL agent is to maximize reward. If there are regions of the state space where the reward is primarily negative, the agent may attempt to reach a termination as quickly as possible to avoid accumulating lots of negative reward.  

Combine with terminal conditions for optimal results 

Terminal conditions define when the simulation should end. This can be a pre-set number of iterations (for simulating processes over a fixed time window) or when the simulation gets too far out of specification. Terminal conditions can be used in combination with the reward function to make sure episodes are able to accumulate rewards. 

Separate tasks 

If you’re training an agent to handle multiple sequential tasks (go to object, pick up an object, set object down somewhere else) it’s important to do the learning in stages. Write the reward function so that reward for picking up an object is only attainable if the previous task (go to the object) is successfully completed. Use terminal conditions to end the episode if a task is not completed (no sense trying to pick up the object if we didn’t get there successfully) 

In Reinforcement Learning problems, episodes are referred to as sequences in which an agent interacts with its environment until it reaches a particular terminal state that initiates a reset to its initial state. 

Optimizing for multiple constraints 

When multiple constraints are involved, it’s important to balance and weigh the observations in a manner consistent with real-world priorities. Let’s take a manufacturing example where we have specifications for weight, length, and width. When 2/3 of the conditions are met, which scenario is worse? Or are all scenarios the same? Should the agent get any reward if 2/3 are in spec, or only if all three? This is entirely up to the requirements of the process specification and the allowable tolerances.  

Conclusion

Designing reward functions for complex real-world applications is not an easy job. Using these tips, you can build more effective reward functions and hopefully save some time when training RL agents.  

Microsoft Project Bonsai platform provides an innovative framework for building, training, and deploying models by using machine teaching and simulation. Neal Analytics has worked with the Microsoft AI engineering team and PepsiCo to produce perfect Cheetos leveraging the Microsoft Project Bonsai platform.  

Our team expertise with reinforcement learning and the Project Bonsai platform leverage effective reward functions to solve unique business challenges. Contact us for more information. 

 

Additional reference material: