Learn how Deep Reinforcement Learning helps businesses solve complex problems (with examples) 

Learn how Deep Reinforcement Learning helps businesses solve complex problems (with examples) 

What is Reinforcement Learning? 

Deep Reinforcement Learning (DRL) or simply Reinforcement Learning (RL) is an area of machine learning that focuses on the training and decision-making abilities of AI agents. With RL, an AI agent learns and makes decisions based on rewards and punishments. For example, if a goalkeeper succeeds in defending the goal from an opponent’s score, he’s rewarded. If he misses, he gets a negative point or a loss. In Deep Reinforcement Learning solutions, the goalkeeper is an AI agent.   

RL enables an AI agent to learn in an interactive environment (either physical or simulated ones) by trial and error. The AI agent uses feedback from its actions and experiences to determine the best way to achieve its goals in a dynamic environment to maximize the reward. The reward is determined by a so-called reward function. To an AI agent, a better reward equals a better decision.   

Although the concept of Reinforcement Learning has been around since the early 1980s, it gained the whole world’s attention in 1997 when IBM’s supercomputer ‘Deep Blue’ defeated the Russian World Chess Champion Garry Kasparov in a six-game match. The software would calculate basic moves in response to the opponent and determine the best one to select. The supercomputer would decide which route to take based on the information it had gathered from training with data sets and simulations. According to IBM’s article, Deep Blue could explore up to 200 million possible positions per second. 

How does Reinforcement Learning work? 

Reinforcement learning is a goal-oriented AI training algorithm that uses a trial-and-error approach. In RL, the algorithms are rewarded or punished for the actions they take. The algorithm aims to maximize a function that gauges the immediate and future rewards of taking one action out of many possible ones. 

Key technical concepts and terms in Reinforcement Learning 

The following terms are important ones to be familiar with to understand the remainder of this article. 

(AI) Agent: An agent acts to achieve a goal. AI agents are deep neural network models trained to determine the best course of action within a given environment, to achieve a goal.  

Environment: The surrounding (physical or simulated) world in which the agent operates 

Policy: A policy defines the way the learning agent will behave at a given time. It’s a rulebook that maps perceived states of the environment to the actions to be taken when in those states. 

Reward function: The reward function defines what are the agent’s good and bad. Rewards are given when the agent achieves the desired result. 

Value function: The value function specifies what is right in the long run. Values show the long-term desirability of states considering the states that are likely to follow and the rewards available in those states. 

Deep Reinforcement Learning training cycle diagram

How Reinforcement Learning is used to train AI agents? 

When the agent operates within an environment, it will select its action out of multiple available options. Unlike supervised and unsupervised learning, the agent doesn’t have any guidance based on human-labeled training data. Instead, it must start exploring the environment with random decisions and then evaluate the resulting changes. The agent will be rewarded based upon the efficacy of its actions. 

Rather than deploying untrained AI agents in the real world, they are trained using simulators. This allows agents to experience real-world scenarios without altering, or worse destroying, the actual equipment or controls. A simulated environment also provides a cheaper, faster, and more efficient way for the AI agent to learn. 

RL in real world and simulator

Imagine you’re trying to train a robot to walk on a floor. It may fall 100,000 times before it could successfully learn the art of walking. This is where simulation-based environments come into the picture. They allow AI practitioners to train AI agents in virtual environments.  For example, let’s assume the robot is being trained in real-time. It may take several months to execute millions of training cycles. Alternatively, the AI agent could use RL in a simulation and be pre-trained before deployment. Its training speed would not be hindered by physical limitations/, It will enable the agent to quickly complete more training cycles and gain experience than ever possible if it were done in the real world.   

Comparing RL with other ML training methodologies 

Reinforcement Learning is a segment of Machine Learning techniques. However, it has some distinct differences when compared to supervised and unsupervised learning.  

Supervised vs. Reinforcement Learning 

In supervised learning, an external “supervisor” with an in-depth knowledge of the environment shares information with the agent to complete the task. Using this method, the Machine Learning algorithm is trained on labeled data or requires a dataset to work with. But, in complex scenarios like chess, it’s almost impractical to find the training data on which a model could be trained to achieve the objective. 

These issues can be eliminated if the agent learns from its own experience, gains knowledge, and acts accordingly instead of relying on human data.  

In Reinforcement Learning, the reward and penalty function help the agent make or change his decisions to complete its task. 

Unsupervised vs. Reinforcement Learning 

In unsupervised learning, the Machine Learning algorithm can easily work with unlabeled data. It’s not required to make the training dataset machine-readable in unsupervised learning as relationships between data points are perceived by the algorithm in an abstract manner. 

For example, if the task is to suggest a pair of sports shoes for the customer, an unsupervised algorithm will search for similar shoes similar to the ones the customer searched before and will recommend some shoes from this subset.  

In contrast, reinforcement learning will obtain constant customer feedback by suggesting to them a few shoes and then building a “knowledge graph” of some customer preferences. 

Real-life examples of RL 

Industrial automation

Learning-based robots are used in industries to perform various tasks. These robots are specially designed to be more effective than humans.  

A popular example is the use of AI agents is Deep Mind use to cool Google’s data centers. This led to a reduction in energy spending by 40%. It helps identify actions that lead to minimizing power consumption without any human intervention. 

Another example can be found in manufacturing with Microsoft Project Bonsai. Here the Project Bonsai AI agent, aka “brain”, is applied to process controllers to optimize production yield. Production yield optimization is based on maximizing throughput while managing and maintaining product quality. An example of such an RL application can be found at PepsiCo to create more perfect Cheetos. 

Similar technology is also used in complex business functions like supply chain management, dynamic optimization, and continuous learning and improvement to make operations more efficient and cost-effective.  

Industry automation 

Self-driving cars and more 

Deep Neural Networks and Reinforcement Learning can be combined to create another approach called Deep Reinforcement Learning (DRL). This approach can further equip an AI agent with the ability to handle more complexity and improve its perception of the environment, allowing it to make better and faster decisions.   

Self-driving vehicles are another critical application of RL and major auto companies like Waymo, Tesla, and Argo AI are leveraging this approach for self-driving programs.Various aspects like speed limits, braking, avoiding collisions, and safety are considered in self-driving cars. RL can be applied in particular for the following self-driving task: motion planning, controller optimization, trajectory optimization, and scenario-based self-learning policies for highways. 

Bell, a leading aerospace company, works with Microsoft Autonomous Systems and AirSim to develop safer autonomous flight vehicles, focusing mainly on safer landings. The company is currently working toward the first autonomous precision landing using Project Bonsai to identify safe landing zones and then land autonomously.  

It is unrealistic for an unmanned aircraft to practice identifying landing zones and landing in the real world. Still, it is feasible to practice in a simulated environment, like AirSim, where thousands of landing scenarios can be executed in few minutes.

Autonomous vehicles


Robotics manipulation  

The use of Reinforcement Learning and Deep Learning techniques can train robots that detect, grasp, and manipulate objects. A robot learns the best sequential actions to complete a task by exploring the maximum reward state within the environment. Here, humans don’t need to give detailed instructions for solving a problem. 

For example, Sberbank implemented Microsoft Project Bonsai that leveraged Deep Reinforcement Learning with Neal Analytics’ help to train an AI agent to manipulate coin bags. The AI agent, or “brain”, is trained to detect, grasp and place coin bags using a simulator to carry hundreds of thousands of training cycles for 20 hours. Using this trial-and-error approach, the AI-controlled robot could pick up bags 95% of the time on the first attempt. For the remaining 5%, a second try would take place if the robot missed the first time. 

Sberbank AI robot control system


Batching and sequencing using Reinforcement Learning 

In retail scenarios, on-time delivery is critical to customer satisfaction. Using RL, a retailer can analyze the right time and ways to batch and pick orders in a warehouse to minimize the number of late deliveries. The solution helps in building a strategy by interacting with the environment and solving problems with Proximal Policy Optimization (PPO) algorithm.  

Another use case is Intelligent Order Sequencing, which can be used to reduce wait times and improve customer satisfaction. These scenarios involve multiple variables, from an individual’s order to the customer’s estimated time of arrival, the current workforce, and other orders in the queue. RL is the key when it comes to managing multiple variables and dynamic settings where other ML training methodologies would struggle.

Batching & sequencing


Challenges of Reinforcement Learning 

Developing a simulation environment 

One of the most challenging parts of Reinforcement learning is to develop a simulation environment.  The simulation complexity will be directly correlated with the task complexity. Developing a model to compete in Chess or Atari games and preparing its simulation environment is a relatively simple task due to its limited state-space complexity and the use of visual inputs.  

But, when it comes to developing a model capable of more complex tasks, such as driving an autonomous vehicle, it is necessary to develop a realistic simulator before letting the vehicle drive on the street. This model must test the vehicle’s abilities to avoid collision and to brake appropriately. It does this in a safe simulation environment to avoid putting others in physical danger. Finally, a simulator enables practitioners to test multiple strategies such as brain designs, reward functions, or state-space much faster than by using a real-life system to train the agent. 

Communicating with the network 

There is no other way of communication with the neural network other than using a system of rewards and penalties. This may lead to catastrophic interference, where the neural network abruptly forgets the previously learned data upon learning new data. The neural networks when trained on nonstationary data distribution, like two distinct tasks in a sequence, can quickly lead to the loss of previously learned data.  

One way to address this DRL pitfall is to store experiences in replay databases and use it to interleave old data and new data during training. 

Reaching a local optimum 

Sometimes, the agent may perform tasks, but not in a required way, thus failing to reach a local optimum. For example, in the Coast Runners game, the speedboat has to collect reward points and complete the race. However, training the agent for this, would focus on maximizing the reward points, but would fail to complete the race. 


The last few years of cloud and AI technologies innovations have opened new opportunities for Reinforcement Learning applicability in real-life business scenarios. RL fills in the holes that are left open in other processes. Because RL focuses on improving an agent’s self-learning abilities, it makes it feasible for more complex systems. From another perspective, RL is a trial-and-error methodology that usually consumes a fair amount of time before the desired results are achieved.

Developments in Machine Teaching, Deep Reinforcement Learning, and new platforms like Project Bonsai are bringing AI from the labs to real-life business scenarios, providing companies with new capabilities to train, design, and deploy custom AI agents to solve complex business problems.



Additional Reference material: