Reinforcement learning for strategic planning
As companies look for ways to put their data to use, reinforcement learning (RL) is becoming an increasingly inviting and accessible option. General purpose RL tools – such as Microsoft’s Project Bonsai – are now available, waiting to be utilized for planning, optimization, and automation. Between the promises of next-gen AI and the availability of tools in the cloud, it seems like the market is building up towards an explosion of RL-powered business solutions.
The remainder of this article assumes the reader is familiar with the concepts of RL. If not, we recommend starting with this article.
Reinforcement learning has had some well publicized successes over the past decade that have piqued the interest of businesses. Most of the best-known implementations of RL have little to do with strategic planning. Gameboards and digital consoles highlight the strengths of RL by outplaying human competitors in Go  and Dota 2 , but finding novel solutions to real-world problems is often more nuanced than beating a high score. To take game changing RL from the arcade to the office requires a completely different perspective, one that begins with an assessment of business goals over the long-term.
To this end, we will examine an iterative RL project plan for strategic planning from objectives through to assessment.
Table of Contents
- RL for strategic planning
- Defining strategic planning objectives
- Building a simulation to model the business problem
- Use case scenario: Material sourcing and shipping costs
- Rewarding the reinforcement learning agent
- Assessing the strategy
- How your business can apply RL for strategic planning
- Further reading
Reinforcement learning is well suited for a wide range of business problems. In strategic planning, the business looks for ways to navigate a large space of potential approaches to find the one that best suits their objectives. It may be a question of deciding how to minimize acquisition costs while expanding locations, or how to migrate from truck-based to train-based shipping without interfering with operations. These types of problems are good candidates for RL because the learning agent is able to fully explore all possible approaches and report back with a complete, ready-made policy based solely on the goals of the business.
The main challenge in strategic planning is capturing the unique business problem. For the results to be actionable, you need to first develop a way for an agent to “play” a unique business’ “game” from scratch. In RL problems, simulations are used to represent the problem space. The simulation orchestrates the gameboard and pieces while the learning agent acts like a competitor – looking for the best combination of moves to maximize return (i.e., win the game). Assessment is then performed on the agent to ensure that the rewards and simulation are truly producing behaviors that optimize on the original strategic objectives.
Figure 1: Reinforcement learning system
Developing high-value agents is easy when the situation comes with a set of well-established rules and states, but in strategic planning the stakeholders are responsible for deciding how constrained or open the training environment will be. In these real-world problems, each decision comes with its own set of implications.
In an RL problem, the objectives not only guide the design of the simulation and reward functions, but they are also the basis for assessing agent performance. Without explicitly laying out the project goals, it is impossible to say whether the agent is performing well or providing useful planning recommendations. While strategic planning objectives don’t necessarily have a place in the RL architecture, they do represent the first step in building an RL system and the final stage in assessing the impact of the system on the business.
It can be confusing to see how the objectives fit in the system, especially considering that the agent is only trained using rewards. In strategic planning it is useful to think of objectives like the destination, and rewards shaping the agent’s path through all possible approaches to get you there quickly. In a simple board game like chess the agent may receive rewards or penalties for capturing and losing pieces, but the impact of these rewards can only be measured against the objective of the system: winning the game.
In more complex problems spaces, it can be the tempting to quickly lay out a simple objective like “increased sales” or “better customer engagement”. These may serve as a good starting point, but the more specific the objective is the more likely it is that the RL results will be useful. When stating an objective, it is also important to call out all the constraints and dependencies that need to be addressed in the simulation and reward functions, such as hard limits on available capital or shipping times. If the dependencies and constraints are not called out in the objectives, the learning agent cannot account for them and is likely to find solutions that are untenable.
Although it’s important to try and define objectives ahead of time, you often learn more about what a realistic plan looks like through trial-and-error. As the objectives become more defined and explicit, the RL system is better able to provide value to the business.
In this example the overall objective is to lower the carbon footprint of the business. If you created a simulation of the operational elements of the business and trained an agent with this objective, it would simply remove everything that emits carbon (including machinery and workers). To get an actionable plan, you need to be specific about the objectives:
What is the timeline or end state?
- 10, 15, 30 years (and from a business standpoint, why?)
- Does the business need to be totally carbon neutral?
- Are there regulatory penalties that can be used to shape objectives?
What are the budget constraints?
- How much can the agent choose to spend each year to reduce carbon emissions?
- How much potential impact on operations can the business handle?
- What other sources of business value can the agent interact with?
- Customer relations?
- Tax write-offs?
- Cost of fueling or maintaining vehicles (electric vs. diesel)?
What components of the business can be changed and how?
- Can the agent remove diesel vehicles and purchase electric vehicles?
- Can the agent migrate the operating grid from coal power to geothermal?
The more questions you can identify and assess during objective planning, the fewer cycles it will take to develop an RL system that provides actionable strategies. Once these questions have all been answered the next step is designing a simulation that captures the business scenario.
Working with an experienced data science team can help ensure the problem framing is well suited for reinforcement learning.
In most applications of reinforcement learning, the bulk of the work will go into developing a useful simulation on which the model can train. In strategic planning, the environment simulated is comprised of state information about business operations (financial assets, inventory etc.) and anything else that might be affected by the agent actions.
Like any model, the simulation is a useful simplification and cannot capture all information about the business. In chess and Go it is easy to see that excluding the shape and weight of game pieces will not affect training performance. In other domains the choice is rarely as straightforward. For example, if you wanted a machine to actually move pieces on a board, its performance would probably improve if it were exposed to information about the weights and shapes of pieces during training. If you wanted it to generalize to different boards and game pieces, it would need to learn how to handle the whole space of possible board and set combinations. In such a scenario, the simulation would need to manage the game dynamics, but also generate appropriate boards and pieces at random for each training episode.
The simulation determines the value of the solution more than anything else in RL. Even the most straightforward problems will likely require continuous re-assessment to get the simulation right. Things that seem important to include at first may distract from a more optimal solution, and things that seem to have no role to play may end up making the difference between a good solution and an exceptional one.
When designing the simulation for the business problem, we want to identify the specific aspects of the business that will be impacted by RL agent interactions. If the agent is tasked with ordering material, then the simulation will need to keep track of budget, revenue, and demand. On the other hand, if the problem involves hiring and workforce management then the simulation will likely involve headcount, salaries, and revenue.
The impulse is to build the most comprehensive simulation possible – after all, for the agent to be flexible wouldn’t we want it to have access to anything that could affect the scenario? In practice however, there is a tradeoff between simulation complexity and agent performance/resource consumption. It’s important to remember that the role of the simulation is to produce the most useful RL agent. Rarely is it important for the simulation to capture every detail of the business – from the market valuation to the floor plan. Creating the simulation is best viewed as an iterative process that involves adding components and observing their impact. In the end, what we want is the simulation that is just complex enough to produce a useful agent.
When developing a simulation for strategic planning, there is always going to be a tradeoff between the fidelity of the model and the computing power it takes to train a learning agent. In RL the goal is a model that is high fidelity with respect to only those aspects of the environment with which the learning agent interacts. For a problem involving shipping and sourcing between locations, an important consideration is whether we need to include a complete spatial model in the simulation. This will depend on the factors we want the agent to consider:
Figure 2: Different modeling approaches for a delivery optimization scenario
Reasons for a spatial model:
- Understanding the locations of vehicles in time
- Avoiding traffic, toll roads
- Accounting for geographic features like hills or mountains in route planning
Reasons for tabular model:
- Units of distance can be reduced to a cost in material weight and fuel
- Average shipping cost is sufficient over longer period
- Individual anomalies are not a major factor over the duration of modeling
- Nothing gained by tracking vehicle locations in real-time
The impact of market fluctuations on the cost of materials is another factor worth considering for a model of material sourcing. This is another area where we need to choose which assumptions are valid. It’s easy to see where the temptation to model the entire global market comes from since the business will be affected to some degree by each fluctuation in pricing, but we need to whittle this down to the types of more frequent and impactful events that the agent will learn to consider. Some outside effects that may be worth incorporating in the model are:
- Historic patterns of fuel prices
- Effects of inflation and supply on material prices
If this is a significant consideration for the business, the simulation should be able to manage the overall effects for a wide range of scenarios. This way an agent can learn to when to purchase or ship materials based on external pricing fluctuations.
Finally, it’s important for the simulation to accurately capture internal business conditions that could impact agent decisions. Any limitations in available resources (sustaining capital, vehicles, employees, etc.) will likely need to be included in the configuration. Accurately accounting for elements of the internal business will teach the agent to work within set constraints or learn when to invest in internal resources to achieve the business goals.
RL agents learn to develop strategies that maximize returns (cumulative reward). In many applications the behaviors that drive rewards are already defined, such as getting a high score, handling an object correctly, or avoiding an obstacle. When building RL systems for strategic planning, deciding how behaviors should be rewarded is vital because the definitions of success are based on business goals.
Reward functions take in the current state of the simulation and return a value to the agent that reflects how well it is achieving the objective. Often the individual components of the reward function fall into one of these patterns:
Avoid – full reward given each time something is avoided
- Avoid spending more than the available capital
- Avoid running out of materials
Maximize/minimize – reward is proportional to value
- Maximize revenue in dollars
- Minimize tons of carbon emitted
Remain within – full reward given for remaining within boundaries
- Remain within 5% of headcount goal
- Remain within 1hr of estimated delivery
Achieve – full reward given only for reaching a set objective
- Achieve zero carbon emissions by year 30
- Achieve 98% rate of delivery within 1hr over 5yrs
By using state information coming from the simulation, we can capture most business considerations with a set of these simple reward functions. To learn more about developing rewards, check out this Neal Analytics blog post.
After mapping the business objectives to individual rewards, we need to think about how the objectives interact. It’s a useful exercise to think about the direction in which the agent is being pushed by different rewards. For example:
- Minimizing debt pushes against increasing salaries
- Reducing emissions pushes against increasing production
- Increasing delivery speed pushes against reducing operations cost
To break the deadlock, one objective may need to be given greater weight. The agent will still be learning to maximize all rewards, just with a preference for whichever direction is more aligned with the goals of the business. We often learn how to weight objectives iteratively by assessing the agent’s recommended strategy against higher-level business goals.
After the business strategy has been created it needs to go through assessment. To assess the agent, we need to go all the way back to the original requirements and check that the RL agent’s strategies represent a legitimate course of action for the business. In a way we are not only assessing the quality of the RL solution, but also the clarity and coverage of the requirements. If there is a loophole or ambiguous intention in the framing of the problem, we expect that it will be exploited by the agent.
While it is good practice to include code testing and sanity checks throughout the project, it’s also useful to approach the assessment as if it were another set of unit-tests. To make sure the strategies are valid, we need to probe the boundaries and foundational assumptions of the problem.
The question is: can the business use the strategy as is?
If the answer is no, then we simply iterate back through the steps from the beginning:
- Are the business requirements robust enough to build a useful strategy?
- Is the simulation capturing all the state information the agent needs to make decisions?
- Are the rewards incentivizing the right policies in line with the requirements?
Figure 3: Assessment analyzes the agent’s strategy in order to make improvements to the reward function or simulation
To guide assessment, here are some configuration issues that should be checked:
Hidden state information
Hidden information can be difficult to diagnose during assessment because it’s doesn’t always have an obvious impact on the strategy. The agent can still produce useful results, but training will likely require a lot of extra cycles and may underperform. This is because there is data being used in the simulation or reward functions that isn’t made available to the agent. When data is hidden, the agent needs to first learn explicit relationships from scratch before it can produce meaningful results.
For example, imagine the RL simulation state includes item quantities and budgets, but prices are omitted. Before the agent can learn to make the most purchases while staying under budget, it will need to learn individual item prices through trial and error. This situation is represented in the image below.
Figure 4: Prices used to score performance are never made available to the agent. The agent will have to spend more time exploring the effect of actions on rewards before it can develop a useful strategy.
The RL agent will always look for the series of actions that maximize returns in an episode. If you want the agent to complete a task quickly, it can’t attain greater reward by simply avoiding termination indefinitely. This situation is easy to prevent by adding a penalty for each additional action, but there are many more subtle ways that rewards can lead an agent astray.
If you notice suboptimal behavior during assessment, it can often be traced to the rewards. That is to say, the agent is always optimizing on the rewards it’s been given – sometimes the rewards just incentivize the wrong behavior. Before rushing to alter the simulation or RL algorithm, it’s best to check the math on the rewards.
A useful trick is to plot the reward function distribution across all possible state values to check if negative signs can ever be flipped, or if specific states aren’t being accounted for in the design.
Missing interactions happen when the simulation is too simple for the situation. To notice the effects of missing interactions, you need an intuitive understanding of what should happen under different conditions. If your autonomous vehicle has learned to drive straight through walls, you’ll want to check how the interaction between vehicles and buildings is being represented. Usually, the impacts are more subtle and only show up under certain conditions, which is why it’s important to perform comprehensive assessments.
You need to be wary of missing interactions because they are difficult to detect but can have a big impact when the resulting strategy is employed in the real world. In strategic planning, it is especially important to think through all the interactions when developing a simulation because each problem will come with a unique set of dynamics to model.
Due to the growing number and complexity of RL problem, assessment is an ongoing area of research. When assessing any RL solution it is important to recognize that the agent will learn the best strategic plan for the given set of conditions. It may simply take a few iterations to get the simulation and reward functions fully aligned with the defined business objectives.
Business problems are rarely as simple as choosing between Option A and Option B. It requires a holistic approach, and the ability to juggle a variety of factors. RL-solutions are well suited to the dynamic and complex environment of business operations. Once used to play chess and Go, RL models leverage the same strategic skills to “win” against the competition by finding the best way to achieve their objective.
At its core, reinforcement learning for strategic planning needs, like any other RL project, clear strategic planning objectives, a simulator to model the business problem, and a properly trained agent.
Neal Analytics can help with all three.
With our deep expertise in the cloud, data, and AI, as well as Microsoft Project Bonsai, we provide end-to-end services and flexible engagement models to help customers design, train, and deploy AI agents to solve real-world business problems.
If you’d like to learn more about how Neal Analytics can help you use reinforcement learning for strategic planning, contact us today.
- Autonomous Systems
- Unlocking the potential of AI in manufacturing with machine teaching and deep reinforcement learning
- Understanding the difference between supervised and reinforcement learning for deep neural networks
- Keys for developing an RL-ready simulator
- Writing successful reward functions
 J. Schrittwieser, D. Silver and K. Simonyan, “Mastering the game of Go without human knowledge,” Nature, no. 550, pp. 354-359, 2017.
 C. &. B. G. &. C. B. &. C. V. &. D. P. &. D. C. &. F. D. &. F. Q. &. H. S. &. H. C. &. J. R. &. G. S. &. O. C. &. P. J. &. P. Berner, “Dota 2 with Large Scale Deep Reinforcement Learning,” ArXiv, 2019.
 “AlphaGo,” , . [Online]. Available: https://deepmind.com/research/alphago/. [Accessed 24 5 2021].