Accelerating Deep Reinforcement Learning training by applying AI Teacher/Student strategy to simulations  

Accelerating Deep Reinforcement Learning training by applying AI Teacher/Student strategy to simulations  

Autonomous Systems can solve many business problems by bringing AI from research labs to real-life use cases. Autonomous Systems leverage Deep Reinforcement Learning (DRL) techniques to train AI agents using a trial-and-error approach without the need for a preexisting labeled training dataset. For this trial-and-error approach, DRL uses advanced simulators to train AI agents. It helps achieve results in a cost-effective and timely way.  

Simulators can be built using different mechanisms based on system complexity, types of processes involved, existing models, and existing data. In some situations, the simulator can be a bottleneck from a training speed perspective. Therefore, optimizing the simulator speed can have a significant impact on the overall AI agent (aka Microsoft Project Bonsai “brain”) training speed.  

The brain is an AI model trained to intelligently control and optimize real-world systems. The faster the simulator can train a brain, the more it is possible to optimize it through trial and error by modifying its inputs, outputs (aka state and action space), and reward functions. Besides the obvious solution of running multiple simulators in parallel, which can quickly become quite expensive, another approach is possible. 

The teacher/student approach 

AI practitioners can leverage a common AI training strategy and deployment approach to achieve orders of magnitude faster simulations with similar simulator accuracy. This approach is derived from one used in some advanced AI solutions, particularly natural language processing (NLP) ones such as speech recognition, machine translation, and the like.  This technique is referred to asthe teacher-student, or knowledge distillation, training strategy.   

Using a similar approach, it is possible to significantly increase simulator speed by designing student AI simulators that can be almost as accurate as of the core process simulator the teacher – but that are much faster. 

The simulator bottleneck in DRL training 

There are multiple ways to build simulators that can train an AI agent (the Bonsai “brain) using deep reinforcement learning. However, whether they are based on custom code or simulation software platforms, complex systems simulations can sometimes become the DRL training speed bottleneck.     


Deep Reinforcement Learning training cycle diagramIn one Neal Analytics Project Bonsai DRL project, and after simulator optimization, it was taking 30 minutes for a simulator to execute one simulation cycle for one DRL training cycle. And that was a system with a small state and action space (i.e., the simulator’s inputs and outputs). Even with the simplest state space, training an AI agent using DRL requires at least 100,000 cycles. More complex agents may require hundreds of thousands or even millions of training cycles.  

In this example, the math is straightforward: 100,000 cycles x 30 minutes meant that the system would require around 3 million minutes, or more than five years, to train one brain. It was hardly a viable option, especially since the team tested multiple brain design and reward function options. 

To go around this unmanageably long training time, the team leveraged the simulator cloud scalability to run hundreds of simulations in parallel. It decreased the total training time to a couple of days. However, even though these ran over the weekends to allow for more effective use of team resources during the work week, it was hardly cost-effective. The first results were promising, but the team required a different approach to accelerate simulations by several order of magnitudes and at a manageable cost. Brute force cloud scaling would not cut it. 

So, the team decided to leverage a well-known AI technique, the teacher-student approach, aka “knowledge distillation,” to simulations. 

What is the teacher/student AI design strategy?  

As models became more complex, the deep neural networks (DNN) used for advanced AI tasks have reached performance levels never seen before from an accuracy standpoint. For instance, speech recognition, image captioning, or machine translation reached human parity performance levels in the late 2010s.  

The issue with those models is that they are extremely complex. To deliver the expected results, these DNNs have from hundreds of millions to tens of billions individual parameters that need to be trained. For instance, Open AI’s GPT-3 text generation model tops (at least in March 2021!) at 175 billion (with a “b”!) parameters. In comparison, first generations DNN-powered machine translations models ran in the 10-15 million parameters range…but they were still far from human parity quality level.  

If, from a pure research standpoint, these results are very impressive, these models are not economically (and some would argue environmentally) viable for real-life use cases. 

For instance, a model could match human capabilities in speech recognition, but its complexity, its training, and the operational cost would make it impossible to offer at scale and at a price point that would make any economic sense for customers. To help solve this research to production transfer challenge, AI researchers recently developed an innovative approach: the teacher-student or knowledge distillation method. Looking at the concepts behind it, this is indeed a fitting name.   

To illustrate this approach, let’s take the metaphor of a person wanting to learn a new skill such as a new language, calculus, Python, or best practices in search engine optimization, to name a few. There are two main ways to learn new skills:   

  • The first one is to do extensive research, read all the books, watch all the videos, do all the trial-and-error testing until you become proficient.   
  • The other one is to go through someone who already has this knowledge, i.e., a teacher. You then leverage this teacher by having them provide a more distilled version of the content. They will only convey what is important and ignore the more trivial content such as outliers, rarely used concepts, etc. You may end up with 95% of the relevant knowledge compared with the first method, but you will achieve this in 10% or less of the time and effort. 

The AI teacher-student design and training approach uses a similar concept. With AI, the teacher will be the most advanced AI with the best results for the output quality. To design an AI that is both close to that level of quality and is scalable and economic, the solution is then to create a student AI.   

The student AI is still complex, but it is much simpler and faster than the teacher one. The trick of this approach is that the student will not be trained using uniquely the raw training data that the teacher used, but it will also use the teacher outputs as a source of training data. 

teacher-student model

This concept is like asking an expert (i.e., “teacher”) to distill the knowledge they possess to only convey what’s important to the student. Here, the student AI will learn from the teacher’s distilled knowledge and not from the raw labeled training data. 

Of course, the result of the student AI will never be as good as a teacher, but even if it reaches 95% of the teacher capability, it is economically viable and scalable and therefore makes it a fantastic solution by combining output quality, economic viability, and appropriate execution speed.   

Applying teaching/student to simulations  

If we connect the two problems: one where a simulator is extremely precise but too complex to be fast and cost-effective, and one where a teacher AI can train a student AI, we can devise a strategy to improve simulator’s performances through a parallel approach. 

In the example mentioned above, Neal Analytics experts built a DNN to simulate the system that took several days across hundreds of parallel cloud instances to drive one brain training. But, instead of training the AI from the system itself, the AI was trained using the advanced simulator. This AI simulator became a simplified proxy of the complex simulator. This student AI used the “teacher” process simulator to train. It is almost as effective as the full simulator, but it runs much faster and in a more cost-effectively way than the “teacher..   

simulator-student model

Opening new optimization and model sensitivity analysis opportunities to build more robust AI agents  

With the availability of this new AI simulator, it is faster and more cost-effective to train an AI agent such as the Microsoft Project Bonsai “brain.” This has lots of positive implications for Autonomous Systems projects.  

First, testing different sets of inputs and outputs (state and action spaces) and reward functions becomes possible within the project’s timeline and budget. It means that better project outcomes can be achieved. This would not have been possible by only using the advanced simulator.   

In addition, for forward-looking models or for “what if” analysis where several parameters have large potential variance, it now becomes economically and technically feasible to perform advanced sensitivity analysis. In turn, it ensures that decisions are made with the right understanding of the hypothesis’ impact on outcomes.  

For instance, if price of a particular piece of equipment, energy cost, batteries availability, or climate trends are hard to forecast over an extended period, it could be critical to analyze and predict the AI agent sensitivity for these variations. It means that, once deployed, you will be able to know which are the elements that they should be careful about versus the ones for which reasonable and predictable variations will not have a significant impact. 

For example, you could have a project where temperature variations over the next ten years could significantly impact one of the outcomes, while the kilowatt (kWh) cost from the energy supplier may not impact it much (or vice-versa).   

Are AI simulators going to become the de facto standard for DRL-based AI? 

After reading this article, one may easily conclude that we believe that every simulator used for Autonomous Systems training on the Microsoft Project Bonsai platform should be AI simulators.  

On the contrary. Our experience showed us that this is far from being the case. 

In many situations, using a standard software package or custom-built simulators based on the physics of the system will be more effective both in terms of development time and operating cost.  

However, in certain circumstances, those non-AI simulators may become a bottleneck when it comes to training performance, speed, and cost. In those situations, the teacher-student approach of using an AI simulator trained with an advanced process simulator can be the most effective approach.  

It means, as always, that the AI practitioners working on this Autonomous System must be familiar with the different approaches to select the most appropriate simulation strategy.  If, for a hammer everything looks like a nail, then AI practitioners need to ensure that they do not try to solve every simulation problem the same way.  They need to select the best option based on the problem complexity and other process-specific aspects. 


Learn more: