selective focus photography of graph

Reinforcement Learning for Market Entry Strategy

Working paper by Eric Thomas, Dr. Christopher Archibald, Stephen Sorensen, Dr. David Bryce, and Austin McMaster

1 The problem of market diversification

Markets often exist in clusters, where the materials, machinery, personnel, and processes required to supply goods and services in one market may be highly or lowly correlated with those needed in another. For example, a firm that sells sporting goods likely possesses many of the capabilities required to enter the camping goods market, but far fewer of the capabilities necessary to enter the financial services market.

A salient question for company executives is whether the varying degrees of market overlap can inform their strategies on timing market entrances and exits. Should firms exploit markets adjacent to those in which they already operate, or should they explore entirely different areas of the economy first? Complicating these decisions is the fact that the pristine, ceteris paribus world in which market entry and exit strategies are often formulated fails to simultaneously account for a myriad of pertinent economic variables, including demand elasticity, the threat of additional market entrants, market size, and the predicted behavior of competitors.

To address these complexities, we simulate an economy that incorporates these variables and apply existing reinforcement learning (RL) techniques to uncover effective market entry and exit strategies. Our results demonstrate that RL-based agents outperform rule-based agents in their ability to maximize capital through market entry and exit decisions. These initial findings serve as proof of concept that modern artificial intelligence (AI) can uncover novel insights in the field of microeconomic strategy. Furthermore, while our simulator may not account for every economic complexity, it provides a robust platform upon which strategists can build to model their specific scenarios.

The rest of the paper proceeds as follows: Section 2 outlines the related literature on market diversification, AI-driven game strategy, agent-based computational economics, and business applications for AI. Section 3 describes our simulator in terms of its economic characteristics and its mechanics. Section 4 provides background on RL and the specific algorithms we employ, as well as details regarding the interactions between our simulator and off-the-shelf implementations of these algorithms. Section 5 details how we train and evaluate our agents and offers insights we glean from the RL agents’ emergent behavior. Section 6 concludes.

2 Related Work

2.1 AI-driven Game Strategy

Most of the early applications of AI to games focused on classic board games, such as checkers and chess [26]. These efforts led to IBM’s monumental breakthrough in 1997 in which their Deep Blue program became the first AI model to defeat a reigning world chess champion [2]. The field of AI-driven game strategy has grown significantly since then, particularly in the last decade in which advances in deep learning have led to major breakthroughs. In 2016, for example, a research team led by Google DeepMind combined deep reinforcement learning techniques with Monte Carlo simulation to develop a model that boasted a 99.8% win rate against other Go programs and defeated the human European Go champion by five games to zero [19]. The next year, DeepMind released AlphaZero, which, provided no knowledge other than the game rules, progressed from random play to superhuman capability with 24 hours of training in chess, shogi (Japanese chess), and Go [20].

In 2019, DeepMind partnered with Team Liquid to release AlphaStar, an AI model created for the video game StarCraft II. This game presents a more difficult challenge than classic board games given that it is a multi-agent problem, the current state of the game is only partially observable, the state and action spaces are high-dimensional, and the game requires long-term planning over thousands of time steps [25]. Employing a variety of deep RL techniques, AlphaStar managed to outperform 99.8% of officially ranked human players [24].

Recent years have seen AI models succeed in a variety of other game domains, including Poker [12], Dota 2 [1], and classic Atari video games [10]. Progress in games is significant because the techniques developed in these settings can often be applied to problems of real-world consequence. For example, reinforcement learning was first applied to the game of checkers [21] but now has broad applicability in robotics [8] and finance [27]. Furthermore, some real-world problems may become more tractable when formatted as games, and that is what we seek to do in this study. Indeed, our economic simulator bears similarities to the classic board game of Monopoly: players are endowed with some starting capital, they are presented with opportunities to invest, and they are eliminated from the game upon bankruptcy. Our simulator, however, differs from Monopoly in that it seeks to plausibly model real-world scenarios, accounting for fundamental economic principles such as supply, demand, fixed and variable costs, prices, and quantities. Formulating economic problems as games can make the problems more tractable and provide insight for both the economic question at hand and the study of AI-driven game strategy itself.

2.2 Agent-Based Computational Economics

Agent-Based Computational Economics (ACE) is ”the computational study of economic processes modeled as dynamic systems of interacting agents” [23]. ACE is a bottom-up approach to studying dynamic economic systems where outcomes may be intractable or computationally infeasible using traditional analytical methods. Using ACE, researchers create individual economic agents that make decisions according to individual incentives or rewards. The agents are introduced into an economic environment with initial conditions chosen by the researchers. Researchers watch for emergent agent behavior, making no intervention in the economic environment once the simulation has begun.

One application area for ACE is that of optimal tax policy. In a recent study [28], researchers at Salesforce and Harvard employ a deep RL approach to determine a tax schedule that maximizes a social welfare function accounting for equality and productivity. The researchers model both the individual economic actors as well as the policy-setting government as RL agents, resulting in dynamic tax policies that are robust in the face of tax-gaming strategies. Interestingly, the optimal tax schedule is qualitatively different from both the actual U.S. tax schedule and the notable Saez tax framework in that it is U-shaped–a both progressive and regressive tax schedule that rewards taxpayers for moving closer to the middle of the income distribution.

Meanwhile, researchers at DeepMind study emergent bartering behavior in a complex environment where agents have varying preferences and production abilities and goods vary in abundance by location [7]. The research serves as both a computational confirmation of basic microeconomic principles as well as an environment in which new AI techniques can be explored for the advancement of AI itself.

Earlier studies also use ACE techniques to analyze emergent population-level economic phenomena; however, these studies predate recent breakthroughs in deep learning and thus make primary use of rule-based agents without incorporating AI. We refer the interested reader to section 2.5 of [7] for an excellent overview of this work.

2.3 Business Applications for AI

Market entry and exit strategy is one of numerous business problems where AI has been or could be applied. The last decade has seen an explosion of industry interest in AI, with research from Goldman Sachs showing that mentions of AI in Russell 3000 earnings calls increased more than 23 times from 2015 to 2023 [5]. Here we highlight some notable applications of AI to business problems.

  • Dynamic pricing: Many e-commerce and travel websites employ dynamic pricing models that incorporate customer and competitor data to determine prices in real time [3].

  • Customer relationship management (CRM): Applications of AI to CRM have been studied extensively, particularly in the areas of one-to-one marketing and loyalty programs [14].

  • Financial fraud detection (FFD): Both the private and public sector use AI to sift through large amounts of data to detect fraudulent activity. In particular, AI is used extensively to identify insurance, corporate, and credit card fraud [13].

  • Talent acquisition: Many companies use AI to assist in the hiring process [6]. Multinational corporations in particular receive a high number of candidates per job opening and benefit greatly from AI-assisted identification, attraction, and onboarding of new employees [17].

  • Asset trading: Advanced AI models rapidly detect patterns in economic data and perform high-frequency trades [4]. Researchers have developed models for various purposes, including price prediction, portfolio management, stock selection, hedging strategy, and risk management [15].

Our work makes a significant contribution to the literature on business applications for AI by providing a platform on which industry researchers can model their strategic scenario to decide which product markets to enter and exit. While the economic model we employ in this work will require additional sophistication and realism before deployment in a real-world scenario, it provides a crucial first step toward this novel application of AI.

3 The Simulator

We implement an economic simulator to provide our RL agents with an environment in which they can learn market entry and exit strategy. Section 3.1 details how we model the pertinent economic characteristics. Section 3.2 details the mechanics of the simulator.

3.1 Market Entry Decisions and Market Similarity

As mentioned in the introduction, markets often exist in clusters, where the materials, machinery, personnel, and processes required to supply goods and services in one market may be highly or lowly correlated with those needed in another. To model this, we conceptualize a large number of capabilities that exist in the economy, with each market requiring a specific subset of those capabilities. For example, a typical economy in our simulator might contain 1000 capabilities while each individual market within the economy requires a firm to possess a subset of 100 specific capabilities to produce products in that market. Similarity between product markets is then modeled by how many requisite capabilities a given pair of markets have in common.

In the simulator settings, the user selects how many markets exist in the economy and how related each is to the others. We achieve this by introducing the notion of a cluster: a group of markets whose requisite capabilities are drawn from the same underlying distribution. For each cluster, the user selects a mean and standard deviation. Using these parameters, capabilities are drawn and assigned to each market within a cluster. For example, suppose an economy contains 1000 capabilities and each market within the economy requires 100 capabilities. Cluster A is defined to have a mean of 200 and a standard deviation of 50. Therefore, it is highly likely that capabilities 49, 50, and 51 will be selected as part of the market’s required capabilities and less likely that capabilities 10 and 400 will be selected. Note that capabilities are selected without replacement, so each market is guaranteed to have the specified number of requisite capabilities (in this case, 100). By carefully choosing means and standard deviations for each cluster, the user can control the degree of overlap between markets from different clusters. For example, if cluster A has a mean of 500 and a standard deviation of 50 and cluster B has a mean of 550 and a standard deviation of 50, their markets will have a relatively high amount of common requisite capabilities compared to a scenario in which market B has a mean of 900 and a standard deviation of 50. See 8 for technical details on how these normal distributions are used to generate discrete sets of capabilities.

Given the varying degree of overlap between market clusters, the goal of each firm in our simulation is to choose when and where to enter and exit markets to maximize their capital. We assume that all firms in a market in a given time step will choose production quantities based on a Cournot oligopoly model and that prices are then directly inferred from the demand curve, leaving variation in production quantities and pricing strategies as subjects for future research. Other pertinent economic details, such as the determination of entry, fixed, and variable costs, are included in section 3.2.3.

3.2 The Simulator itself

We now overview how the simulator works. Section 3.2.1 outlines the major components of the simulator. Section 3.2.2 explains how agents take turns making market entry and exit decisions. Section 3.2.3 explains how the various quantitative variables in the simulator are calculated.

3.2.1 Major Components of the Simulator

The simulator consists of markets, firms, and agents. A market is a mechanism whereby buyers and sellers of specific product engage in exchange. Each market is defined by a linear demand curve and a set of capabilities required for sellers to participate in that market. A firm is an entity that can choose to enter and exit markets. When one or more firms are in a market, they choose their production quantities according to their production policy (which, for this paper, is the Cournot oligopoly model) and the market price is then determined by the market demand curve. Firms are endowed with some starting amount of capital and that capital increases via revenue and decreases via market entry costs, fixed costs, and variable costs. An agent is an entity that seeks to maximize a firm’s capital by controlling its market entry and exit decisions. Agents may be rule-based or AI-powered.

3.2.2 Agent Decision Making

Each simulation consists of a series of macro time steps, which in turn contain a series of micro time steps. We will explain what happens during one micro time step and then explain how these micro steps are combined to create a macro step.

During a micro time step, it is at most one agent’s turn to act. The acting agent may choose to have its firm enter any market it is not in, exit any market it is in, or do nothing. The market portfolios of all other firms remain constant. After the acting agent has made its market entry or exit move (or opted to do nothing), the production quantities are calculated in each market, market prices are determined, profits for each firm-market combination are calculated, and each firm’s capital is adjusted accordingly.

A macro time step is a series of micro time steps in which each agent has one turn to be the acting agent. The simulator settings also allow for a number of skip-turns (turns in which no agent gets to enter or exit a market) to be added to each macro time step. The simulator settings also allow for the order of agent turns to be shuffled within each macro time step.

For a simple example, suppose a simulation consists of 100 macro steps. Suppose there are two agents and that there are two skip turns per macro step. Then there are four micro steps total per macro step and 400 total micro steps within the simulation. Suppose further that the option to randomize the turn order within each macro step is turned on. Then one possible turn order would be {(1, SKIP, SKIP, 2), (1, 2, SKIP, SKIP), (SKIP, 2, SKIP, 1), . . .}.

If a firm’s capital decreases to less than zero at any point during the simulation, the firm is considered bankrupt. They are removed from all markets in which they were participating and no longer participate in the simulation. This is the worst possible outcome for a firm in a simulation.

We posit that our turn-based model with randomization of turn order within macro steps is an acceptable model of the real world where business opportunities occur somewhat randomly and generally at different times for different companies. The adjustable amount of skip turns per macro steps allows for the simulator to model economies with varying frequencies of market entry and exit opportunities. Furthermore, our setup serves as a foundation upon which researchers can build to model their specific use-case; in the event that simultaneous action among agents is a better model of a certain economy, this can be achieved with a few simple adjustments to the code.

3.2.3 Quantitative Variables Calculation

  • Minimum and maximum market entry costs: Defined by the user in the simulator settings.

  • Capabilities per market: Defined by the user in the simulator settings.

  • Capability Costs: The minimum and maximum capability costs are calculated as min entry cost num capabilities and max entry cost num capabilities, respectively. Specific capability costs are drawn from a uniform distribution between these two values and remain fixed throughout the simulation.

  • Entry Costs: Sum of the costs of the capabilities that the firm must acquire to participate in the market. Capabilities the firm already possesses are not included in this cost.

  • Fixed Costs: A percentage of the entry cost, as specified by the user in the simulator settings. For example, if a firm paid 200 to enter a market and the fixed cost percentage is set to 5, they will pay 10 in fixed costs for that market at each micro time step.

  • Exit Costs: A percentage of the entry cost, as specified by the user in the simulator settings. For example, if a firm paid 200 to enter a market and the exit cost percentage is set to 30, they will pay 60 to exit the market.

  • Demand slope and intercept: The simulator settings require that the user specify a minimum and maximum for the demand slopes and the demand intercepts. For each market, a slope and intercept are chosen from a uniform distribution between these values.

  • Quantity: Firms choose the quantity they produce in each market in their portfolio at each micro time step according to their production policy, which, for this paper, is set to Cournot production for all firms.

  • Price: Price in each market at each micro step is the market’s demand intercept less the product of its demand slope and total quantity produced.

  • Revenue: Revenue for each firm-market combination is the product of price and quantity produced by that firm in a given micro time step.

  • Profit: Profit for each firm is the difference between total revenues and total costs in a given micro time step.

4 The Artificial Intelligence

We now provide a high-level overview of reinforcement learning (RL) in general and the three specific RL-algorithms we use in our study. We then outline how state observations are generated in our simulator and how we connect our simulator to off-the-shelf implementations of these algorithms.

4.1 Reinforcement Learning

Reinforcement learning (RL) is a class of AI solution methods that involves learning through interaction with an environment. Such methods are similar to how humans an animals learn: through trial and error. Three characteristics of problems that lend themselves to RL problems are: 1) the agents actions influence its later inputs, 2) the agent is not told which actions to take, and 3) actions may not affect only immediate rewards but also the next state, and by extension, future rewards [22]. Given such a problem, RL techniques empower agents to map states to actions with the goal of maximizing some reward.

4.2 RL-algorithms we employ in this study

4.2.1 Deep-Q Network (DQN)

Basic forms of RL often use a simple data structure such as a two-dimensional array to track expected future rewards, known as Q-values, for state-action pairs. Such an approach, however, does not scale well beyond low-dimensional state spaces. Deep-Q Networks (DQNs) address this problem by employing a deep artificial neural network to estimate the Q-values. This approach is advantageous in complex, real-world situations where agents must process high-dimensional inputs and use these to generalize past experience to new situations [11]. The DQN takes a state-action pair as input and then estimates the maximum sum of discounted rewards that can be achieved after taking the given action in the given state. Network parameters are then optimized via gradient descent according to the Bellman loss equation.

4.2.2 Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is a policy gradient method–a type of RL in which a parameterized policy is optimized with respect to the long-term cumulative reward via gradient descent. PPO differs from prior policy gradient methods, however, in that rather than performing a single step of gradient descent per data sample, PPO collects data through several interactions with the environment before performing minibatch updates on the policy. The algorithm also introduces a clipped objective function that limits the size of policy updates to prevent large, destabilizing swings in policy parameters [18].

4.2.3 Advantage Actor-Critic (A2C)

Actor-critic methods incorporate separate structures to represent the policy function, which maps states to actions, and the value function, which maps states to values [22]. Advantage Actor-Critic (A2C) is a specific actor-critic model that uses deep neural networks as the policy and value functions. In their paper introducing the A2C model, Mnih et al. point out that a common issue in RL is non-stationarity in the sequence of data observed by the RL agent. That is, as the agent acts within the environment, its actions modify the underlying probability distribution from which it samples experiences. This can cause strong correlation between the agent’s policy updates, destabilizing learning. The A2C model overcomes this issue by having multiple agents interact with multiple instances of the environment in parallel while sharing the same policy and value functions [9].

4.3 Generation of state observations

We now detail the state observations that we feed to the RL agent when it is its turn to act. A state observation is a vector that contains pertinent economic data that the agent uses to decide when and where to enter and exit markets. Let F be the number of firms in the simulation. Let M be the number of markets in the simulation. State observations are then structured as follows:

  1. Capital of all firms (vector of dimension F )

  2. Market overlap structure (i.e., percentage of overlap in required capabilities between any pair of markets; matrix of dimension M × M )

  3. Variable costs for all firm-market combinations realized thus far in the simulation (i.e., if firm i is present or has been present in market j, then we provide visibility to the variable cost for firm i–market j, with zero-padding applied otherwise to maintain consistent matrix size; matrix of dimension F × M )

  4. Fixed cost for each firm-market combination (matrix of dimension F ×M )

  5. Market portfolio of all firms (matrix of dimension F × M )

  6. Entry cost for every firm-market combination (matrix of dimension F ×M )

  7. Demand intercept in each market (vector of dimension M )

  8. Slope in each market (vector of dimension M )

  9. Most recent quantity for each firm-market combination (matrix of dimension F × M )

  10. Most recent price for each firm-market combination (matrix of dimension F × M )

These 10 components of the state representation are each flattened into one-dimensional vectors and then concatenated to create a single state observation vector. Information is ordered according to the following three rules:

  1. For components involving firm-specific information, the RL agent’s information is given first. The control agents’ information is then given in ascending order by agent ID.

  2. For components involving market-specific information, information is given in ascending order by market ID.

  3. For components involving information specific to firm-market combinations, the above two rules apply. Information is ordered first at the firm level and then at the market level (i.e., info pertaining to a firm for all markets is given before the info for the next firm is given).

The total length of state representation is given by:

F + MM + FM + FM + FM + FM + M + M + FM + FM = M^2 + 6FM + F + 2M.

For example, if we have five firms and eight markets, our state representation would be a vector of 325 real-valued numbers.

4.4 Connection to off-the-shelf RL implementations

For this study, we use existing implementations of the algorithms outlined in section 4.2. Adapting these algorithms or creating new algorithms suited specifically for the task of market diversification presents a promising area for future research; our purpose here is to show that existing methods can handle the problem reasonably well.

We borrow off-the-shelf RL algorithm implementations from Stable-Baselines3 (SB3). SB3 offers reliable implementations in PyTorch of many commonly used RL-algorithms, requiring that users implement a set of simple application programming interface (API) functions to communicate between the SB3 model and the user environment [16]. These API functions include logic to initialize, reset, and step within the economic simulator. The initialization function prepares the simulator to run; the reset method restores variables to their initial states between simulations; and the step method executes an RL action in the simulator, allows the simulator to continue running until it is again the RL agent’s turn, and then provides the RL agent with a state representation describing the current state of the simulator (see 4.3 for details). SB3 handles all the underlying complexity to execute the algorithms described in section 4.2.

Experiments, Conclusions, Limitations, and Future Work

These sections are in progress! Please check back after February 28, 2025 to view our full paper and results.

References

[1] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.

[2] M. Campbell, A. J. Hoane Jr, and F.-h. Hsu. Deep blue. Artificial intelligence, 134(1-2):57–83, 2002.

[3] L. Chen, A. Mislove, and C. Wilson. An empirical analysis of algorithmic pricing on amazon marketplace. In Proceedings of the 25th international conference on World Wide Web, pages 1339–1349, 2016.

[4] G. Cohen. Algorithmic trading and financial forecasting using advanced artificial intelligence methodologies. Mathematics, 10(18):3302, 2022.

[5] Goldman Sachs. Artificial intelligence (ai) market interest growth 2015 to 2023, by share of companies [graph], August 2023. In Statista.

[6] IBM. Data suggests growth in enterprise adoption of ai is due to widespread deployment by early adopters, but barriers keep 40% in the exploration and experimentation phases, jan 2024. IBM Newsroom.

[7] M. B. Johanson, E. Hughes, F. Timbers, and J. Z. Leibo. Emergent bartering behaviour in multi-agent reinforcement learning. arXiv preprint arXiv:2205.06760, 2022.

[8] J. Kober, J. A. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.

[9] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016.

[10] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

[11] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.

[12] M. Moravˇc´ık, M. Schmid, N. Burch, V. Lis`y, D. Morrill, N. Bard, T. Davis, K. Waugh, M. Johanson, and M. Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017.16

[13] E. W. Ngai, Y. Hu, Y. H. Wong, Y. Chen, and X. Sun. The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decision support systems, 50(3):559–569, 2011.

[14] E. W. Ngai, L. Xiu, and D. C. Chau. Application of data mining techniques in customer relationship management: A literature review and classification. Expert systems with applications, 36(2):2592–2602, 2009.

[15] K. Olorunnimbe and H. Viktor. Deep learning in the stock market—a systematic survey of practice, backtesting, and applications. Artificial Intelligence Review, 56(3):2057–2109, 2023.

[16] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021.

[17] J. S. Roppelt, N. S. Greimel, D. K. Kanbach, S. Stubner, and T. K. Maran. Artificial intelligence in talent acquisition: a multiple case study on multinational corporations. Management decision, 2024.

[18] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[19] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.

[20] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.

[21] R. S. Sutton. Introduction: The challenge of reinforcement learning. In Reinforcement learning, pages 1–3. Springer, 1992.

[22] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.

[23] L. Tesfatsion. Chapter 16 agent-based computational economics: A constructive approach to economic theory. volume 2 of Handbook of Computational Economics, pages 831–880. Elsevier, 2006.

[24] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. nature, 575(7782):350–354, 2019.17

[25] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. K¨uttler, J. Agapiou, J. Schrittwieser, et al. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782, 2017.

[26] G. N. Yannakakis and J. Togelius. Artificial intelligence and games, volume 2. Springer, 2018.

[27] Y. Ye, H. Pei, B. Wang, P.-Y. Chen, Y. Zhu, J. Xiao, and B. Li. Reinforcement-learning based portfolio management with augmented asset movement prediction states. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 1112–1119, 2020.

[28] S. Zheng, A. Trott, S. Srinivasa, N. Naik, M. Gruesbeck, D. C. Parkes, and R. Socher. The ai economist: Improving equality and productivity with ai-driven tax policies. arXiv preprint arXiv:2004.13332, 2020.