Approximate Nash Equilibrium Learning for n-Player Markov Games in Dynamic Pricing
动态定价中 n 人马尔可夫博弈的近似纳什均衡学习

Larkin Liu Technical University of Munich
larkin.liu@tum.de

Abstract 抽象

We investigate Nash equilibrium learning in a competitive Markov Game (MG) environment, where multiple agents compete, and multiple Nash equilibria can exist. In particular, for an oligopolistic dynamic pricing environment, exact Nash equilibria are difficult to obtain due to the curse-of-dimensionality. We develop a new model-free method to find approximate Nash equilibria. Gradient-free black box optimization is then applied to estimate $\epsilon$ , the maximum reward advantage of an agent unilaterally deviating from any joint policy, and to also estimate the $\epsilon$ -minimizing policy for any given state. The policy- $\epsilon$ correspondence and the state to $\epsilon$ -minimizing policy are represented by neural networks, the latter being the Nash Policy Net. During batch update, we perform Nash Q learning on the system, by adjusting the action probabilities using the Nash Policy Net. We demonstrate that an approximate Nash equilibrium can be learned, particularly in the dynamic pricing domain where exact solutions are often intractable.
我们研究了竞争性马尔可夫博弈（MG）环境中的纳什均衡学习，其中多个代理竞争，并且可能存在多个纳什均衡。特别是，对于寡头垄断的动态定价环境，由于维数诅咒，很难获得精确的纳什均衡。我们开发了一种新的无模型方法来找到近似纳什均衡。然后应用无梯度黑盒优化来估计 $\epsilon$ 单方面偏离任何联合策略的代理的最大奖励优势，并估计任何给定状态的 $\epsilon$ -minimize 策略。策略 $\epsilon$ 对应和状态到 $\epsilon$ 最小化策略由神经网络表示，后者是纳什策略网。在批量更新期间，我们通过使用 Nash Policy Net 调整动作概率，在系统上执行 Nash Q 学习。我们证明了可以学习近似纳什均衡，尤其是在动态定价领域，其中精确解决方案通常难以处理。

\ecaisubmission \ecai提交

1 Introduction 1 介绍

The application of deep reinforcement learning to solve single player Markov Decision Processes are relatively well-researched in today’s machine learning literature, however, the application of novel deep reinforcement learning methods to solve multi-agent competitive MDP’s are relatively few. Particularly, the challenge revolves around competing objectives and solving for, often computationally intractable, Nash equilibria [27] in a competitive setting, where agents cannot merely maximize only their respective Q functions. Nash equilibria in Markov Games are particularly useful in the area of dynamic pricing, a well-known problem in today’s online eCommerce marketplaces. In a dynamic pricing game, multiple firms compete for market share of their product on the same online platform. To model this market, we adopt a modified version of Bertrand oligopoly [1], where the marginal cost of production is 0, and where purchasing behaviour is probabilistic.
在当今的机器学习文献中，深度强化学习解决单人马尔可夫决策过程的应用相对深入，但是，新型深度强化学习方法解决多智能体竞争性 MDP 的应用相对较少。特别是，挑战围绕着竞争目标和在竞争环境中解决通常难以计算的纳什均衡 [27]，其中代理不能仅仅最大化他们各自的 Q 函数。马尔可夫游戏中的纳什均衡在动态定价领域特别有用，这是当今在线电子商务市场中众所周知的问题。在动态定价游戏中，多家公司在同一在线平台上争夺其产品的市场份额。为了模拟这个市场，我们采用了 Bertrand 寡头垄断 [1] 的修改版本，其中边际生产成本为 0，并且购买行为是概率性的。

1.1 Market Oligopoly
1.1 市场寡头垄断

Especially in today’s online commerce industries, dynamic pricing provides platform firms with an adversarial advantage and increases profit margins in a competitive economy. In an ideal case, a firm’s pricing strategy should consider the actions of other firms, providing the justification for Nash equilibrium policies over unilateral reward maximization policies. In previous approaches, [22] modelled the dynamic pricing problem using deep reinforcement learning, however, focused primarily on single agent revenue maximization and not solving for multi-agent Nash equilibria. [35] apply stochastic dynamic programming in a simulated environment to estimate consumer demand and maximize expected profit, however, do not consider the equilibrium solution of the market. [38] model market demand using parametric functions, with unknown parameters, yet they do not consider a Nash equilibrium nor the actions of previous pricing on demand. Thus, in almost all recent research on the dynamic pricing problem under uncertainty, there exists a desideratum to compute and promote Nash equilibria for market pricing strategies. Such strategies can be applied to eCommerce platforms such as Amazon auto pricing, allowing firms to opt for an automated pricing algorithm.
尤其是在当今的在线商务行业中，动态定价为平台公司提供了对抗性优势，并在竞争激烈的经济中提高了利润率。在理想情况下，公司的定价策略应考虑其他公司的行为，为纳什均衡政策而不是单边奖励最大化政策提供理由。在以前的方法中，[22] 使用深度强化学习对动态定价问题进行建模，然而，主要关注单个代理收入最大化，而不是解决多代理 Nash 均衡。[35] 在模拟环境中应用随机动态规划来估计消费者需求并最大化预期利润，但是，不要考虑市场的均衡解。[38] 使用参数未知的参数函数对市场需求进行建模，但他们不考虑纳什均衡，也不考虑先前按需定价的作用。因此，在几乎所有关于不确定性下动态定价问题的近期研究中，都存在着计算和促进市场定价策略纳什均衡的职责。此类策略可应用于亚马逊自动定价等电子商务平台，允许公司选择自动定价算法。

A large body of literature on dynamic pricing has assumed an unknown demand function and tried to learn the intrinsic relationship between price and demand. [15] showed that a myopic Bayesian policy may lead to incomplete learning and poor profit performance. [23] studied a joint pricing and inventory problem while learning the price-sensitive demand using a Bayesian dynamic program. Many recent studies have revolved around non-parametric methods. [2] developed a pricing strategy that balances the trade-off between exploration and exploitation. Furthermore, in [22] the demand function was a black box system where the demand was approximated from experience replay and live online testing in a reinforcement learning framework.
大量关于动态定价的文献假设了一个未知的需求函数，并试图了解价格和需求之间的内在关系。[15] 表明，短视的贝叶斯政策可能导致学习不完整和利润表现不佳。[23] 在使用贝叶斯动态程序学习价格敏感型需求的同时，研究了联合定价和库存问题。最近的许多研究都围绕着非参数方法展开。[2] 开发了一种定价策略，用于平衡勘探和开发之间的权衡。此外，在 [22] 中，需求函数是一个黑盒系统，其中需求是通过强化学习框架中的经验重放和实时在线测试来近似的。

Dynamic pricing in oligopolies presents multiple problems. One aspect is the optimization problem, where we learn a Nash equilibrium policy when presented with the parameters of the market demand function, and second is the learning of the demand parameters itself. For example, [7] and [19] studied parametric approaches, using maximum likelihood to estimate unknown parameters of the demand function. Recent research has been conducted in demand learning, for example in cases where multiple products exist, and firms can learn to price on an assortment of items [11] using multinomial logit regression, or when multiple products exist in a demand learning scenario, and a limited inventory exists, Multi-Armed Bandit approaches have shown to be effective at devising profitable policies [12]. However, these approaches do not consider past pricing actions influencing the future market demand parameters. Furthermore, they do not consider competing firms aiming for a Nash equilibrium.
寡头垄断的动态定价存在多个问题。一个方面是优化问题，当面对市场需求函数的参数时，我们学习纳什均衡策略，第二个是需求参数本身的学习。例如，[7] 和 [19] 研究了参数方法，使用最大似然来估计需求函数的未知参数。最近的研究是在需求学习方面进行的，例如，在存在多种产品的情况下，公司可以学习使用多项式 Logit 回归对各种商品进行定价 [11]，或者当需求学习场景中存在多种产品，并且存在有限的库存时，多臂老虎机方法已被证明可以有效地设计有利可图的政策 [12].但是，这些方法并未考虑影响未来市场需求参数的过去定价行为。此外，他们不考虑以纳什均衡为目标的竞争公司。

We propose a demand market where past pricing affects the future market demand in a market competition. To search for a Nash equilibrium, we apply a multi-agent Markov Decision Process, where the state transition is driven by the joint pricing actions of all players (or firms). To simplify the problem, we assume that each firm has unlimited inventory, and there is only a single product with no substitutes.
我们提出了一个需求市场，其中过去的定价会影响市场竞争中的未来市场需求。为了寻找纳什均衡，我们应用了多智能体马尔可夫决策过程，其中状态转换由所有参与者（或公司）的联合定价行动驱动。为了简化问题，我们假设每家公司都有无限的库存，并且只有一种没有替代品的产品。

1.2 Multi-agent Markov Decision Processes
1,2 多智能体马尔可夫决策过程

Multi-agent Markov Decision Processes, or Markov Games (MG’s), constitute an important area of research, especially when multiple agents are self-interested, and an exact or approximate Nash equilibrium for the system is desired. The computation of Nash equilibria additionaly presents great difficulty when searching over the enormous joint state action space of the problem, and approximations to this search problem do exist [29]. Moreover, the existence of multiple Nash equilibria can further complicate the solution, as some solutions may be more desirable than others.
多智能体马尔可夫决策过程或马尔可夫博弈（MG）构成了一个重要的研究领域，尤其是当多个智能体都是自利的，并且需要系统的精确或近似纳什均衡时。此外，在搜索问题的巨大联合状态动作空间时，纳什均衡的计算也带来了很大的困难，并且确实存在该搜索问题的近似值[29]。此外，多重纳什均衡的存在会使解决方案进一步复杂化，因为某些解决方案可能比其他解决方案更可取。

Yet approximate search algorithms require prior knowledge of the joint reward function and are often limited to two players, modelled by a best response function. We treat our payoff function as an oracle [39], meaning we have knowledge regarding the parametric form of the payoff function, however, there is no knowledge to the agents in the MG regarding how the reward function generates rewards for the players. The state of the game is visible to the agents, yet the agents have no knowledge regarding the state transition function. Nevertheless, the market demand parameters and Nash equilibria are known for purposes of experimentation. This eliminates the need for pre-period model fitting to pre-estimate market demand parameters [10] allowing us to compare our solution to the theoretical solution, as opposed to an empirical estimate.
然而，近似搜索算法需要联合奖励函数的先验知识，并且通常仅限于两个玩家，由最佳响应函数建模。我们将 payoff 函数视为 oracle[39]，这意味着我们了解支付函数的参数形式，但是，MG 中的代理不知道 reward 函数如何为玩家生成奖励。游戏的状态对 Agent 是可见的，但 Agent 对 state transition 函数一无所知。尽管如此，市场需求参数和纳什均衡是已知的，用于实验目的。这消除了对期前模型拟合以预先估计市场需求参数的需要 [10]，使我们能够将我们的解决方案与理论解决方案进行比较，而不是经验估计。

Model-based approaches furthermore exist to provably find the existence of a Nash equilibrium [43]. However, we aim for a Nash equilibrium solver which is model-free, where no knowledge regarding the reward or transition function is known to the agents. In recent literature, [31] present a model-free algorithm to solve competitive MDP’s when the environment parameters can be altered by an external party - but this is not always the case in market economics. Furthermore, [34] propose a model-free decentralized deep learning framework, where the agent is blind to the other agents actions, and convergence to a Nash equilibrium is proven. [44] propose a convergent solution for zero-sum MG’s via entropy-regularization. However, both [44] and [34] impose a number of theoretical restrictions on the MG in order for this convergence to occur. Moreover, [20] present a provably convergent MG solver for imperfect information restricted to two agents. Nevertheless, we are concerned with an approximate solution to a full information MG for N agents.
此外，基于模型的方法可以证明纳什均衡的存在 [43]。然而，我们的目标是一个无模型的 Nash 均衡求解器，其中代理不知道有关奖励或转换函数的知识。在最近的文献中，[31] 提出了一种无模型算法，当环境参数可以被外部方改变时，解决竞争性 MDP 问题——但在市场经济学中，情况并非总是如此。此外，[34] 提出了一个无模型的去中心化深度学习框架，其中代理对其他代理的动作视而不见，并且证明了收敛到纳什均衡。[44] 通过熵正则化提出了零和 MG 的收敛解。然而，为了发生这种收敛，[44] 和 [34] 都对 MG 施加了许多理论限制。此外，[20] 提出了一个可证明的收敛 MG 求解器，用于仅限于两个代理的不完美信息。尽管如此，我们关心的是 N 个代理体的完整信息 MG 的近似解。

Contributions: This work outlines a methodology for Deep Q Learning, as introduced in [26], by extending the framework to multi-agent reinforcement learning (MARL) with a Nash equilibrium objective based on the methods in [17] and [41]. Although the framework presented in [17] is theoretically sound, the solution to the Nash equilibrium function is often intractable. We therefore apply approximation methods to compute a Nash equilibrium. Black box optimization is applied to estimate an $\epsilon$ -Nash Equilibrium policy, and this approximation is learned as a series of neural networks. This MARL model is then applied to the domain of dynamic pricing.
贡献：这项工作概述了 [26] 中介绍的深度 Q 学习方法，将框架扩展到基于 [17] 和 [41] 方法的纳什均衡目标的多智能体强化学习（MARL）。尽管 [17] 中提出的框架在理论上是合理的，但 Nash 均衡函数的解往往是棘手的。因此，我们应用近似方法来计算纳什均衡。黑盒优化用于估计 $\epsilon$ -Nash 均衡策略，并且此近似值被学习为一系列神经网络。然后将此 MARL 模型应用于动态定价领域。

2 Dynamic Pricing Game
阿拉伯数字动态定价游戏

An oligopoly across $N$ firms is modelled as a multi-agent Markov Decision Process with a fully observable state action space. The state space is represented by the fully-observable demand-influencing reference price $\bar{p_{t}}$ being the state of the game $s^{t}$ . In discrete intervals at time $t$ , each agent issues a price for the item to be sold to the general market, $x^{t}_{n}$ . The reference price, of the market at time $t$ is determined by a stochastic function of all the agents actions at time $t-1$ . To focus the problem strictly on dynamic pricing, for each firm we assume a sufficiently large inventory, with no marginal costs of production, no possibility for inventory stockouts, no holding costs, and the capacity to always meet any realization of market demand.
跨 $N$ 公司的寡头垄断被建模为具有完全可观察状态动作空间的多代理马尔可夫决策过程。状态空间由完全可观察的需求影响参考价格表示，该价格 $\bar{p_{t}}$ 是游戏 $s^{t}$ 的状态。在 time $t$ 的离散间隔内，每个代理都会为要出售给一般市场的商品发布价格。 $x^{t}_{n}$ 当时 $t$ 市场的参考价格由所有代理在 time $t-1$ 的随机函数决定。为了将问题严格集中在动态定价上，我们为每家公司假设有足够大的库存，没有边际生产成本，没有库存缺货的可能性，没有持有成本，并且有能力始终满足任何实现的市场需求。

2.1 Markov Game Parameters
2.1 马尔可夫博弈参数

The joint action space $\{x_{1}^{t}\ ...\ x_{n}^{t}\}\in\mathbf{X}$ constitutes the current actions of all agents at time $t$ , which drive the state transition by setting prices at time $t$ and represent the set price of the item from agent $n$ at time $t$ . The state-action-reward space is defined as a tuple $(s^{t},x_{1}^{t},...,x_{N}^{t},r_{1}^{t},...,r_{N}^{t},s^{t+1})$ where $r_{n}^{t}$ is the reward for agent $n$ at time $t$ . The joint reward can be written as $\bar{r}_{t}=(r_{1}^{t},...,r_{N}^{t})$ , and the joint action can be written as $\bar{x}_{t}=(x_{1}^{t},...,x_{N}^{t})$ . $N$ denotes the number of agents. The exact transition probabilities are not known to the agents. Each agent must learn the demand function and optimal strategy as the MG progresses. A state $s^{t}$ is determined by the reference price of the market observable to all agents, $s\in\mathbb{R}$ . We discretize the action space into even segments representing a firm’s price. Where each segment represents an action $x_{n}\in X_{n}$ , and $X_{n}$ is the action space for any agent $n\in N$ .
联合操作空间 $\{x_{1}^{t}\ ...\ x_{n}^{t}\}\in\mathbf{X}$ 构成了所有代理的当前操作 time $t$ ，这些操作通过在 time $t$ 设置价格来驱动状态转换，并表示 agent $n$ at time $t$ 的项目的设置价格。state-action-reward 空间定义为一个元组 $(s^{t},x_{1}^{t},...,x_{N}^{t},r_{1}^{t},...,r_{N}^{t},s^{t+1})$ ，其中 $r_{n}^{t}$ 是 agent $n$ 在 time $t$ 的奖励。联合奖励可以写成 $\bar{r}_{t}=(r_{1}^{t},...,r_{N}^{t})$ ，联合行动可以写成 $\bar{x}_{t}=(x_{1}^{t},...,x_{N}^{t})$ 。 $N$ 表示代理的数量。代理不知道确切的转换概率。随着 MG 的进展，每个代理都必须学习需求函数和最佳策略。状态 $s^{t}$ 由所有代理可观察到的市场参考价格确定。 $s\in\mathbb{R}$ 我们将操作空间离散为代表公司价格的偶数段。其中，每个区段表示一个操作 $x_{n}\in X_{n}$ ，并且 $X_{n}$ 是任何代理 $n\in N$ 的操作空间。

2.2 Reference Pricing
2.2 元参考定价

The first pillar of any market dynamic is constituted by the demand function, dictated by both the historical and contemporary prices, of the firms in the market. Demand processes can be ideally linear and continuous, however may not be guaranteed to be stationary or continuous [6], and can be subject to various exogenous factors such as environmental carryover effects [32].
任何市场动态的第一支柱都是由需求函数构成的，需求函数由市场上公司的历史和现代价格决定。理想情况下，需求过程可以是线性和连续的，但不能保证是平稳或连续的 [6]，并且可能受到各种外生因素的影响，例如环境残留效应 [32]。

We create an idealization of a market based on [36], where a dynamic pricing problem, with reference price effects and a linear demand function, is imposed. The expected current demand of a product is a function of the average price set by all firms, denoted as $\tilde{x}$ , and a historical reference price, $\bar{p}$ . The reference price $\bar{p}$ is given, and cannot be modified during the current time $t$ , moreover, it is a function of the immediate past, $t-1$ . Although [36] is not necessarily the only model that incorporates reference pricing [25], [13], [28], [16], we adopt it due to its simplicity and the existence of provable Nash equilibria in an oligopolistic framework.
我们基于 [36] 创建了一个市场理想化，其中施加了一个具有参考价格效应和线性需求函数的动态定价问题。产品的预期当前需求是所有公司设定的平均价格（表示为 $\tilde{x}$ ）和历史参考价格的函数。 $\bar{p}$ 参考价格 $\bar{p}$ 是给定的，并且在当前时间内 $t$ 不能修改，而且，它是过去的函数。 $t-1$ 尽管 [36] 不一定是唯一包含参考定价 [25]、[13]、[28]、[16] 的模型，但我们采用它是因为它的简单性以及寡头垄断框架中存在可证明的纳什均衡。

	$\displaystyle f(\tilde{x})$	$\displaystyle=\beta_{0}+\beta_{1}\tilde{x}+\beta_{2}(\tilde{x}-\bar{p})$		(1)
	$\displaystyle\beta_{0}$	$\displaystyle\geq 0,\ \beta_{1}<0,\ \beta_{2}\leq 0,\ f(\tilde{x})\geq 0,% \tilde{x}\geq 0$		(2)

$\epsilon_{D}$ is defined as the noise from a Poisson process. Such a Poisson process has the arrival rate $\lambda_{t}(\tilde{x})=f(\tilde{x})$ from Eq. (2), and standard deviation $\sigma=\sqrt{\lambda_{t}(\tilde{x})}$ . Furthermore, [36] make the stipulation of decreasing demand with respect to increasing price, as illustrated in Condition (2).
$\epsilon_{D}$ 定义为来自 Poisson 过程的噪声。这样的泊松过程具有方程（2）的到达率 $\lambda_{t}(\tilde{x})=f(\tilde{x})$ 和标准差 $\sigma=\sqrt{\lambda_{t}(\tilde{x})}$ 。此外，[36] 规定了需求减少与价格上涨有关，如条件（2）所示。

This demand-influencing reference price can be affected by a variety of factors, such as inflation, unemployment, psychological perception [30]. Moreover, in many proposed oligopoly models, as in [18] and [3], the reference price is dictated by a historical joint market price of multiple firms. However, modelling a competitive market oligopoly with an autocorrelated reference price in a MG setting is not heavily investigated until now. In our model, we focus on designing a market whose reference price is driven by the historical $t-1$ joint market price, and additional factors which also affect the reference price are represented as noise $\epsilon_{D}$ . Thus the transition of the reference price is determined by the average of the previous joint prices plus some Guassian randomness, $\epsilon_{D}$ .
这个影响需求的参考价格会受到多种因素的影响，例如通货膨胀、失业率、心理认知 [30]。此外，在许多提出的寡头垄断模型中，如 [18] 和 [3]，参考价格由多家公司的历史联合市场价格决定。然而，直到现在，在 MG 环境中使用自相关参考价格对竞争性市场寡头垄断进行建模还没有得到深入研究。在我们的模型中，我们专注于设计一个市场，其参考价格由历史 $t-1$ 联合市场价格驱动，而同样影响参考价格的其他因素则表示为噪声 $\epsilon_{D}$ 。因此，参考价格的转换由先前联合价格的平均值加上一些瓜斯随机数来确定。 $\epsilon_{D}$

In our experiment, the reference price of a product is determined by the previous actions of the firms. We express the reference price function as $\bar{p}(t):\ \mathbb{R}^{N}\xrightarrow[]{}\mathbb{R}$ , mapping a vector of historical pricing actions $\mathbf{x}^{t-1}$ to $\mathbb{R}$ . In the beginning of the Markovian process, the reference price is randomly generated within the action space of the pricing range. The stochastic nature of the market price transition entail the Markov property of the MG.

\displaystyle\bar{p}(t)=\frac{1}{N}\sum^{N}_{n=1}x^{t-1}_{n}+\epsilon_{D},% \quad\text{where}\quad\bar{p}(t=0)=\frac{1}{N}\sum^{\mathbf{x^{0}}}_{n=1}x^{0}% _{n}+\epsilon_{D}

(3)

2.3 Probabilistic Demand Function

The expected profit for player $n$ is $\mathbf{E}[\Pi_{n}(x_{n})]$ , which equals revenue as the marginal cost is of production is assumed to be 0, is defined as,

	$\displaystyle\mathbf{E}[\Pi_{n}(x_{n})]$	$\displaystyle=\Phi_{n}(x_{n},x_{-n})x_{n}f(\tilde{x})$		(4)
	$\displaystyle\Phi_{n}(x_{n},x_{-n})$	$\displaystyle=\frac{e^{\phi(x_{n})}}{\sum^{N}_{i=1}e^{\phi(x_{i})}}$		(5)

$\Phi_{n}(x_{n},x_{-n})$ in Eq. (5) represents the probability that a customer purchases the item sold by firm $n$ , where the player $n$ prices their merchandise with price $x_{n}$ . $f(x_{n})$ is the expected demand during the single stage game for a fixed time period. Following the quantal response function from [24] and [14], we define a purchase elasticity function, $\phi_{n}(x_{n},x_{-n})$ ,

	$\displaystyle\phi(x)=$	$\displaystyle\begin{cases}b_{n}-a_{n}x&\text{for \quad}0<x<b_{n}/a_{n}\\ 0&\text{for \quad}x>b_{n}/a_{n}\end{cases}$		(6)
	where	$\displaystyle x\in\mathbb{R},x>0,a\geq 0,b_{n}=1$		(7)

In $\phi(x)$ , $b_{n}$ represents the maximum weighted contribution to the probability of an item being purchased by a customer given a price. When the price $x$ is 0, this measure is $b_{n}$ . For simplification, we assume the linear marginal decline of this measure of all players $a_{n}$ as equal, $a_{n}=a$ , that is the market has the same elasticity regarding the marginal decrease of the customer willingness to purchase from any firm as price increases, with a negative slope $a$ . The probability of a customer choosing to purchase from firm $n$ at price $x_{n}$ among prices set by other firms $x_{-n}$ is defined by combining a softmax function and purchase elasticity function, $\phi(x)$ . This mechanism will prevent a solution where each firm undercuts each other to set their prices to 0, as the lowest price need not guarantee that a consumer will buy their product.

3 Nash Equilibrium Conditions

Computation of exact Nash, or approximate $\epsilon$ -Nash, equilibria is an NP-hard problem and notoriously difficult to compute [5]. This involves a search over the entire joint state action space, computing the value of each state under a candidate policy. Furthermore, it involves knowledge of the joint Q function and the transition probability of the system. In our scenario, we have the condition that all agents are identical, therefore the solution of one agent can apply to the solution of another.

3.1 Theoretical $\epsilon$ -Nash Equilibria

Under the market outlined in Section 2, given the market inputs $\bar{p}$ and $\tilde{x}$ , in any pricing strategy, there exists an optimal deviation $d^{*}\in\mathbf{R}$ , such that $\mathbf{E}[\Pi_{n}(x_{n}-d^{*})]$ will yield the maximum profit advantage $\epsilon$ .

\displaystyle\epsilon

\displaystyle=\underset{d^{*}\in\mathbf{R}}{\mathrm{max}}\ \Bigg{(}\mathbf{E}[% \Pi_{n}(x_{n}-d^{*})]-\mathbf{E}[\Pi_{n}(x_{n})]\Bigg{)}

(8)

An increase in the individual profit of an agent $n$ from unilaterally deviating is denoted as $\epsilon$ . Given this optimal deviation, $d^{*}$ , a maximum theoretical upper bound on the profit advantage $\epsilon$ that can be obtained. Eq. (9) provides the theoretical value of $d^{*}$ which is a function of both reference price $\bar{p}$ and market price $\tilde{x}$ , as well as the fixed market parameters $\beta_{0},\beta_{1},\beta_{2}$ . A derivation of $d^{*}$ can be found in Appendix C.

	$\displaystyle d^{*}$	$\displaystyle=\frac{\sqrt{c_{1}^{2}-c_{1}+4(c_{2}-1)c_{2}-2c_{2}}}{2c_{2}}$		(9)
	$\displaystyle\text{where}\quad c_{1}$	$\displaystyle=\frac{-(\beta_{1}+\beta_{2})}{f(\tilde{x})}-\frac{1}{\tilde{x}},% \quad c_{2}=\frac{-(\beta_{1}+\beta_{2})}{f(\tilde{x})\tilde{x}}$

Refer to caption — Market Scenario 1: $\beta_{0}=15,\beta_{1}=-1.05,\beta_{2}=-3.1,a=0.1$

Using the Eq. (9), as visualized in Fig. 1, multiple Nash equilibria can exist when $\epsilon=0$ , or when $\epsilon$ is sufficiently small (see Section 5). Such behaviour occurs when one agent unilaterally deviates from $\tilde{x}$ , under reference price $\bar{p}$ . Thus, the lower plateau from Fig. 1 represents the region where a theoretical Nash Equilibrium exists. Bounds on the state space and action space are limited to $|S|=|X_{n}|=10$ , that is there are 10 discrete intervals for each state $s^{t}$ and action $x_{n}^{t}$ , representing price values.

3.2 Nash Equilibrium in Markov Games ( $\delta$ )

The value of a policy, at each state, can be represented by a joint action which maximizes the joint Q function [42] under policy $\pi(s)$ . The probability of agent $n$ taking action $x_{n}$ is defined as $\pi_{n}(s,x_{n})$ .

\displaystyle v(s,\pi_{n},\pi_{-n})=\underset{\bar{x}\in\mathbf{X}}{\mathrm{% max}}\ \prod_{n=1}^{N}\pi_{n}(s,x_{n})Q_{n}(s,x_{n},x_{-n})

(10)

An $\epsilon$ -Nash equilibrium is defined as a joint policy such that the reward of a single stage game will not result in a greater payoff to the agent $n$ by more than $\epsilon$ , when any agent unilaterally deviates from said joint policy. Provided $\epsilon$ , we consider a bound on the corresponding MG, $\delta$ , which defines a bound on the gain of the value of a policy, $v(s,{\pi_{n}}^{*},\pi^{*}_{-n})$ , should agent $n$ unilaterally deviate with an alternative policy. The solution to Eq. (10) can be computed by searching over the joint action space $\{x_{n},x_{-n}\}$ .

	$\displaystyle\underset{\bar{x}\in\mathbf{X}}{\mathrm{argmax}}\prod_{n=1}^{N}% \pi_{n}(s,x_{n})Q_{n}(s,x_{n},x_{-n}^{*})$	(11)
	$\displaystyle\leq\underset{\bar{x}\in\mathbf{X}}{\mathrm{argmax}}\prod_{n=1}^{% N}\pi_{n}(s,x_{n})Q_{n}(s,x_{n}^{},x_{-n}^{})+\delta$	(12)
$\displaystyle v(s,\pi_{n},\pi_{-n}^{*})$	$\displaystyle\leq v(s,\pi_{n}^{},\pi_{-n}^{})+\delta,\quad\forall s\in% \mathbf{S}$	(13)

$\delta$ serves as an upper bound on $\epsilon$ , therefore the minimization of $\delta$ will also minimize any possible existence of $\epsilon$ for each single stage game in the MG. In fact, the existence of $\epsilon$ implies the existence of upper bound $\delta$ , we provide a proof of this existence in Appendix C.3.

4 Multi-Agent Nash Q Learning

In our model, the game provides full information, where information regarding the state and the actions of all agents are visible to any agent. This allows for experience replay for Q learning [26] to occur, and the Q function can be shared for all agents as they are assumed to be identical. The joint action space $\bar{x}=(x_{1},...,x_{N})$ is defined as the combined actions of all agents at time $t$ . Normally, in Q learning, the update mechanism searches for the action that maximizes the Q function, however, in Nash Q learning, we must update the Q function with the solution to the Nash equilibrium. In our model, we represent the tabular Q function via a function approximator, which is denoted simply as $Q(s,\bar{x})$ . As new experiences are obtained when the MG is played, the Nash Q value is updated. We utilize the Q update mechanism defined in [17] as the update mechanism for the Nash Q estimator for Q learning. Given a Q function and a Nash policy, we search over the joint action that maximizes the scaled Q function [21], as in Eq. (15). In our representation, the Q function is a vector, returning the respective Q value for each agent, based on the joint probability input for the joint policy.

$\displaystyle Q^{\prime}(s,\bar{x})$	$\displaystyle\xleftarrow[]{}(1-\alpha)Q(s,\bar{x})+\alpha[r+\gamma\mathcal{N}(% s^{\prime})]$	(14)
$\displaystyle\bar{x}^{*}$	$\displaystyle=\underset{\bar{x}}{\mathrm{argmax}}\ Q(s^{\prime},\bar{x})\prod_% {i=1}^{N}\pi_{n}^{*}(s^{\prime},x_{n})$	(15)
$\displaystyle\mathcal{N}(s^{\prime})$	$\displaystyle=Q(s^{\prime},\bar{x}^{})\prod_{i=1}^{N}\pi_{n}^{}(s^{\prime},x% ^{*}_{n})$	(16)

$\mathcal{N}(s^{\prime})$ represents a Nash Operator, indicating the Q value of a Nash equilibrium, this is approximated by a scaling factor, according to the Nash equilibrium policy, $\pi^{*}$ , multiplied by the Q value of the joint action, $Q(s,\bar{x})$ . Extending from the work of [17], we introduce a neural network for mapping any state action combination to its Nash policy, where the Nash scaling factor derives from and is used to compute $\mathcal{N}(s^{\prime})$ in conjunction with the Q function. The Q-function used in Eq. (16) and (14) is represented by a deep neural network, simply referred to as the Nash Q Net $Q(s)$ . In a full information game with homogenous agents, a shared neural network representing $Q(s,\bar{x})$ can be used for all agents in the Markov Game (if not homogeneous a separate Q network must be retained and updated separately for each agent). The Q Network parameters representing $\mathcal{N}(s^{\prime})$ is learned in the same method as Deep Q learning [26], the key innovation, is that the scaling factor to compute $\mathcal{N}(s^{\prime})$ is obtained via a Nash Policy Net, $\Psi(s)$ (defined later in Section 4.2).

4.1 Estimating Value Advantage via Black Box Optimization

We apply a deep neural network to represent a mapping of a joint policy to its respective $\delta$ from Eq. (13), designated as $\Gamma(\pi)\mapsto\delta$ , where $\delta$ is a vector containing the $\delta_{s}$ of each state. As the gradient of $\Gamma(\pi)$ cannot be easily evaluated, we apply gradient free black box optimization for $\delta$ minimization. Trust Region Optimization (TRO) has been shown to be effective for solving high-dimensional optimization problems [40] [9] [8] [33], particularly when the computational resources are available. To compute the existence of an $\epsilon$ -Nash equilibrium in the high-dimensional policy space efficiently, we apply model-based Bayesian Optimization via Trust Region optimization (TuRBO) from [9]. TuRBO combines standard TRO with a multi-armed bandit system via Thompson Sampling [37] to optimize for local multiple trust regions simultaneously, with sufficient convergence time. However, the candidate generation step in TuRBO is not constrained to account for the valid joint probabilities of each agent, in which the sum of probabilities for each agent in its respective action space must sum to 1. We alter TuRBO by simply normalizing the original candidate probabilities over each set of probabilities belonging to an agent $n$ . The resulting modified algorithm is denoted TuRBO-p, where any candidate value generated by TuRBO-p will have joint probabilities that sum to 1 for policies corresponding to each agent.

	$\displaystyle V(s,\pi)$	$\displaystyle=\underset{\bar{x}}{\mathrm{max}}\ Q(s,\bar{x})\prod_{i=1}^{N}\pi% _{n}(s,x_{n})$		(17)
	$\displaystyle\delta$	$\displaystyle=\underset{\pi_{n}^{\prime}}{\mathrm{max}}\ \Bigg{(}V(s,\pi_{n}^{% \prime},\pi_{-n})-V(s,\pi_{n},\pi_{-n})\Bigg{)}\quad\forall s\in\mathbf{S}$		(18)

The maximization problem is formulated in Eq. (18) representing the maximum gain in value of a policy, where agent $n$ deviates from policy $\pi_{n}$ with $\pi_{n}^{{}^{\prime}}$ .

4.2 Nash Policy Learning

The $\epsilon$ -Nash policy is defined as a joint policy $(\pi^{*}_{n},\pi^{*}_{-n})$ that minimizes $\delta$ from Eq. (13). Drawing inspiration from [4], where a Nash equilibrium is found by effectively minimizing for $\epsilon$ via linear programming, we apply a similar technique of $\delta$ -minimization. However, in [4], the model parameters are known to the agents, and the Markov game was constrained to two players. In our MG, the joint reward function must be learned. Therefore, we perform approximate search over the policy space instead of using exact methods to discover any possible approximate Nash equilibrium policies. In principle, each state has maps to a corresponding probability in accordance with the Nash Policy which minimizes $\delta$ in Eq. 18. However, a table keeping track of such an enormous state policy space is not feasible. Therefore, we implement a Nash Policy Net $\Psi(s)\xrightarrow[]{}\pi^{*}$ to learn the state to policy mapping, which is the joint policy $\pi$ producing an approximate Nash equilibrium as approximated via TuRBO-p, designated as $\hat{\pi}^{*}(s)$ .

\displaystyle\hat{\pi}^{*}(s)=\underset{\pi}{\mathrm{argmin}}\ \Gamma(\pi)_{s}

(19)

Algorithm 1 Nash equilibrium learning

1:Initialize state joint policy

\pi

2:Initialize random parameters for

\Psi(s_{t})

Q(s_{t})

and

\Gamma(\pi)

, as

\theta_{Q}

\theta_{\Psi}

, and

\theta_{\Gamma}

3:Initialize MDP Environment,

\mathcal{M}

4:for

e\in episodes

do:

5: Get initial state

s_{0}

from

\mathcal{M}

6: for

t\in(0,t_{max})

do: Iterate until end of episode.

7: Get action probabilities

\pi(s_{t},x_{n})

from

\Psi(s_{t})

8: for

n\in(1,N)

do: Iterate through agents.

9: Sample

x_{n}\thicksim Multinomial(s_{t},\pi(\overline{x}))

10: end for

11: Obtain joint action

\overline{x}\xleftarrow[]{}\{x_{1},...,x_{N}\}

12: Assign

\delta_{s}

according to Eq. (18) via TuRBO-p.

13: Find

\pi^{*}\xleftarrow[]{}\underset{\pi}{\mathrm{argmin}}\ \Gamma(\pi,s_{t})

via TuRBO-p.

14: Execute joint action

\overline{x}

(\mathcal{M},\ s_{t})

to obtain

s_{t+1},\overline{r}_{t}

15: Append

(s_{t},\overline{x}_{t},\overline{r}_{t},\epsilon_{s},\pi,s_{t+1})

to experience replay

\mathcal{D}

16: Update State

s_{t}\xleftarrow[]{}s_{t+1}

17: if

|\mathcal{D}|>batchsize

then:

18: Sample minibatch

d

from

\mathcal{D}

19: Set

Q^{\prime}(s_{t},\bar{x})\xleftarrow[]{}\begin{cases}r_{t}&\text{for \ }s_{t}% \ \text{terminal}\\ \text{Eq. \eqref{eq:q-update}}&\text{for \ }s_{t}\ \text{not terminal}\end{cases}

20:

\mathcal{L}_{Q}\xleftarrow[]{}(Q^{\prime}(S_{t},\bar{x})-Q(S_{t},\bar{x}))^{2}

Nash Q Update.

21:

\mathcal{L}_{\Psi}\xleftarrow[]{}(\hat{\pi}-\pi^{*})^{2}

, where

\hat{\pi}

is from

\Psi(s)

22:

\mathcal{L}_{\Gamma}\xleftarrow[]{}(\hat{\epsilon}-\epsilon)^{2}

Nash

\epsilon

update, where

\hat{\epsilon}

is from

\Gamma(\pi)

23: Backpropagate

\mathcal{L}_{Q}

\mathcal{L}_{\Psi}

and

\mathcal{L}_{\Gamma}

\mathcal{Q}

\Psi

, and

\Gamma

24: Update

\theta_{Q}

\theta_{\Psi}

, and

\theta_{\Gamma}

via gradient descent.

25: end if

26: Update joint policy

\overline{\pi}\xleftarrow[]{}\overline{\pi}^{*}

27: end for

28:end for

5 Results

Stabilization of realized market rewards to Nash equilibrium: Fig. 2 represents the convergence of the average market reward of a single agent (randomly selected as agent 0) towards a Nash equilibrium. We superimpose the boundaries of the theoretical Nash equilibrium reward over the plot. The reward is obtained from the boundaries of theoretical Nash equilibria, where $\epsilon<\epsilon_{l}$ ( $\epsilon_{l}=0.0001$ ). The topology of this function is illustrated in Fig. 1, and for each state, or reference price, there exists a boundary where the policy deviation of any agent can occur without significant unilateral reward gain. The expected reward is computed using the market model parameters per Scenario ( $\beta_{0},\beta_{1},\beta_{2},a$ ). From the expected reward, we compare it with the positive difference of theoretical equilibrium reward $\Pi^{\prime}=f(x_{n}^{\prime})\ x_{n}^{\prime}$ where $x_{n}^{\prime}$ . This exists as two boundary points indicating the area where $\epsilon<\epsilon_{l}$ . For each episode the average reward per timestep over the episode lengths is recorded. From Fig. 2, we observe the average reward per agent of the system per episode (blue dashed line), and the average reward of a single agent per episode (orange dashed line), to converge within the boundary of Nash equilibrium (blue shade). However, due to randomization from black box optimization and/or stochastic behaviour of the Markov Game this expected reward can sometimes fall outside of the NE region.

We demonstrate empirically in Appendix D, that the loss function of $Q(s)$ and $\Psi(s)$ decrease with each RL episode. This merely indicates that a representation of the function mapping state to its corresponding Nash policy based on policy-to- $\epsilon$ model $\Gamma(\pi)$ is being learned, and a stable solution is proposed. Hyperparameters are presented in Appendix A.

Comparison with baseline cooperative Q-learning model: We compare our NE learning model wit a naive cooperative Q learning model to serve as a baseline measure of model performance. In the naive Q learning model, the objective is simply to maximize each agent’s respective rewards, and avoid any kind of NE computation. The Q update function in Eq. (14) is simply a maximization of the joint Q function, instead of using $\mathcal{N}(s^{\prime})$ . From Fig.3, we generally observe lower average reward values from the baseline model compared to the NE learning model. The NE reward ranges remain very similar. From an economic perspective, the Nash Q learning agents take into account that other agents will compete against them, and the solver optimizes accordingly, whereas, in cooperative Q learning no consideration for the other agents’ strategies are taken, and agents may to over-compete, driving average rewards lower in an oligopoly.

6 Conclusion

We created a Markov Game that represents an n-firm oligopoly, based on previously established market models where theoretical $\epsilon$ bounds for a Nash equilibrium policy exist. A black box algorithm is applied to estimate an upper bound on the $\epsilon$ value of a joint policy, represented by neural network $\Gamma(\pi)$ . Similarly a Nash Policy Net $\Psi(s)$ is learned to represent the $\epsilon$ -minimizing policy from black box optimization, constituting an $\epsilon$ -Nash policy. Thus $\Psi(s)$ can be used together with traditional Deep Q learning, to solve for a Nash equilibrium. Empirically, we show that the average reward of all agents, and the reward of a respective single agent, converges to an approximate Nash equilibrium. The limitations of this research are the limited action space of the agents, as well as the identical nature of the agents. Larger scale experimentation under this framework is suggested. Furthermore, the proposed market model could be enhanced by implementing more complex non-linear market oligopoly environments.

References

[1] Joseph Bertrand, Review of Walras’s Théorie mathématique de la richesse sociale and Cournot’s Recherches sur les principes mathématiques de la théorie des richesses, 73–81, Cambridge University Press, 1889.
[2] Omar Besbes and Assaf Zeevi, ‘Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms’, Operations Research, 57(6), 1407–1420, (2009).
[3] Richard A Briesch, Lakshman Krishnamurthi, Tridib Mazumdar, and Sevilimedu P Raj, ‘A comparative analysis of reference price models’, Journal of Consumer Research, 24(2), 202–214, (1997).
[4] Sofia Ceppi, Nicola Gatti, Giorgio Patrini, and Marco Rocco, ‘Local search methods for finding a nash equilibrium in two-player games’, pp. 335–342, (08 2010).
[5] Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou, ‘The complexity of computing a nash equilibrium’, in Proceedings of the Thirty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’06, p. 71–78, New York, NY, USA, (2006). Association for Computing Machinery.
[6] Arnoud V. Den Boer and N. Bora Keskin, ‘Discontinuous demand functions: Estimation and pricing’, Management Science, 66, (04 2020).
[7] Arnoud V den Boer and Bert Zwart, ‘Simultaneously learning and optimizing using controlled variance pricing’, Management Science, 60(3), 770–783, (2014).
[8] Youssef Diouane, Victor Picheny, Rodolphe Le Riche, and Alexandre Scotto Di Perrotolo. Trego: a trust-region framework for efficient global optimization, 2021.
[9] David Eriksson, Michael Pearce, Jacob R. Gardner, Ryan Turner, and Matthias Poloczek, ‘Scalable global optimization via local bayesian optimization’, CoRR, abs/1910.01739, (2019).
[10] Kris Johnson Ferreira, Bin Hong Alex Lee, and David Simchi-Levi, ‘Analytics for an online retailer: Demand forecasting and price optimization’, Manufacturing & Service Operations Management, 18(1), 69–88, (2016).
[11] Kris Johnson Ferreira and Emily Mower, ‘Demand learning and pricing for varying assortments’, Manufacturing & Service Operations Management, (2022).
[12] Kris Johnson Ferreira, David Simchi-Levi, and He Wang, ‘Online network revenue management using thompson sampling’, Operations Research, 66(6), 1586–1602, (2018).
[13] Gadi Fibich, Arieh Gavious, and Oded Lowengart, ‘Explicit solutions of optimization models and differential games with nonsmooth (asymmetric) reference-price effects’, Operations Research, 51(5), 721–734, (2003).
[14] Jacob K. Goeree, Charles A. Holt, and Thomas R. Palfrey, Stochastic game theory for social science: a primer on quantal response equilibrium, Edward Elgar Publishing, Cheltenham, UK, 2020.
[15] J Michael Harrison, N Bora Keskin, and Assaf Zeevi, ‘Bayesian dynamic pricing policies: Learning and earning under a binary prior distribution’, Management Science, 58(3), 570–586, (2012).
[16] Paul Heidhues and Botond Kőszegi, ‘Regular prices and sales’, Theoretical Economics, 9(1), 217–251, (2014).
[17] Junling Hu and Michael P. Wellman, ‘Nash q-learning for general-sum stochastic games’, Journal of Machine Learning Research, 4, 1039–1069, (December 2003).
[18] Chris Janiszewski and Donald R Lichtenstein, ‘A range theory account of price perception’, Journal of Consumer Research, 25(4), 353–368, (1999).
[19] N Bora Keskin and Assaf Zeevi, ‘Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies’, Operations Research, 62(5), 1142–1167, (2014).
[20] Tadashi Kozuno, Pierre Ménard, Rémi Munos, and Michal Valko. Model-free learning for two-player zero-sum partially observable markov games with perfect recall, 2021.
[21] Julien Laumônier and Brahim Chaib-draa, ‘Multiagent q-learning: Preliminary study on dominance between the nash and stackelberg equilibriums’, in Proceedings of AAAI-2005 Workshop on Multiagent Learning, Pittsburgh, USA, (2005).
[22] Jiaxi Liu, Yidong Zhang, Xiaoqing Wang, Yuming Deng, and Xingyu Wu, ‘Dynamic pricing on e-commerce platform with deep reinforcement learning’, CoRR, abs/1912.02572, (2019).
[23] Jue Liu, Zhan Pang, and Linggang Qi, ‘Dynamic pricing and inventory management with demand learning: A bayesian approach’, Computers & Operations Research, 124, 105078, (2020).
[24] R. Duncan Luce, Individual Choice Behavior: A Theoretical Analysis, Wiley, New York, NY, USA, 1959.
[25] Tridib Mazumdar, Sevilimedu P Raj, and Indrajit Sinha, ‘Reference price research: Review and propositions’, Journal of Marketing, 69(4), 84–102, (2005).
[26] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis, ‘Human-level control through deep reinforcement learning’, Nature, 518(7540), 529–533, (February 2015).
[27] John F. Nash, ‘Equilibrium points in n-person games’, Proceedings of the National Academy of Sciences, 36(1), 48–49, (1950).
[28] Ioana Popescu and Yaozhong Wu, ‘Dynamic pricing strategies with reference effects’, Operations Research, 55(3), 413–429, (2007).
[29] Ryan Porter, Eugene Nudelman, and Yoav Shoham, ‘Simple search methods for finding a nash equilibrium’, Games and Economic Behavior, 63(2), 642–662, (2008). Second World Congress of the Game Theory Society.
[30] Kalyan Raman, Frank M Bass, et al., ‘A general test of reference price theory in the presence of threshold effects’, Tijdschrift voor Economie en management, 47(2), 205–226, (2002).
[31] Giorgia Ramponi, Alberto Maria Metelli, Alessandro Concetti, and Marcello Restelli, ‘Learning in non-cooperative configurable markov decision processes’, in Advances in Neural Information Processing Systems, eds., A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, (2021).
[32] Vithala R Rao, ‘Pricing models in marketing’, Handbooks in Operations Research and Management Science, 5, 517–552, (1993).
[33] Rommel G. Regis, ‘Trust regions in kriging-based optimization with expected improvement’, Engineering Optimization, 48(6), 1037–1059, (2016).
[34] Muhammed O. Sayin, Kaiqing Zhang, David S. Leslie, Tamer Basar, and Asuman E. Ozdaglar, ‘Decentralized q-learning in zero-sum markov games’, CoRR, abs/2106.02748, (2021).
[35] Rainer Schlosser and Martin Boissier, ‘Dynamic pricing under competition on online marketplaces: A data-driven approach’, in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’18, p. 705–714, New York, NY, USA, (2018). Association for Computing Machinery.
[36] Alfred Taudes and Christian Rudloff, ‘Integrating inventory control and a price change in the presence of reference price effects: a two-period model’, Mathematical Methods of Operations Research, 75(1), 29–65, (2012).
[37] William R Thompson, ‘On the likelihood that one unknown probability exceeds another in view of evidence of two samples’, Biometrika, 25(3-4), 285–294, (12 1933).
[38] Ruben van de Geer, Arnoud V. den Boer, Christopher Bayliss, Christine S. M. Currie, Andria Ellina, Malte Esders, Alwin Haensel, Xiao Lei, Kyle D. S. Maclean, Antonio Martinez-Sykora, and et al., ‘Dynamic pricing and learning with competition: insights from the dynamic pricing challenge at the 2017 informs rm & pricing conference’, Journal of Revenue and Pricing Management, 18(3), 185–203, (Oct 2018).
[39] Yevgeniy Vorobeychik and Michael Wellman, ‘Stochastic search methods for nash equilibrium approximation in simulation-based games’, pp. 1055–1062, (01 2008).
[40] Linnan Wang, Rodrigo Fonseca, and Yuandong Tian, ‘Learning search space partition for black-box optimization using monte carlo tree search’, CoRR, abs/2007.00708, (2020).
[41] Xiaofeng Wang and Tuomas Sandholm, ‘Reinforcement learning to play an optimal nash equilibrium in team markov games’, in Advances in Neural Information Processing Systems, eds., S. Becker, S. Thrun, and K. Obermayer, volume 15. MIT Press, (2003).
[42] Christopher J. C. H. Watkins and Peter Dayan, ‘Q-learning’, Machine Learning, 8(3), 279–292, (May 1992).
[43] Kaiqing Zhang, Sham M. Kakade, Tamer Basar, and Lin F. Yang, ‘Model-based multi-agent RL in zero-sum markov games with near-optimal sample complexity’, CoRR, abs/2007.07461, (2020).
[44] Qifan Zhang, Yue Guan, and Panagiotis Tsiotras, ‘Learning nash equilibria in zero-sum stochastic games via entropy-regularized policy approximation’, CoRR, abs/2009.00162, (2020).

Supplemental Material

Appendix A Hardware Configurations and Model Hyperparameters

All experiments were run on a computer with an Intel Core i5-10600K CPU processor at 4.10GHz, 32 GB DDR3 RAM, with an NVIDIA GeForce GTX 1080 8GB GDDR5X graphics card. The hyperparameters used for all experiments are listed in Table 1.

Parameter	Value
No. Hidden Layers $\Gamma(\pi)$	3
No. Hidden Layers $\Psi(s)$	3
No. Hidden Layers $Q(s)$	3
Hidden Layer Size $\Gamma(\pi)$	1500
Hidden Layer Size $\Psi(s)$	1500
Hidden Layer Size $Q(s)$	75
No. Episodes	40
Batch Update Frequency	20
Batch Size	10
RL Exploration Constant	0.05
Max. Steps per Episode	30
Discount Factor ( $\gamma$ )	0.9
Learning Rate ( $\alpha$ )	0.001
MDP Action Dimensions	10
TuRBO-p Max. Evaluations	10
TuRBO-p Batch Size	4

Table 1: Hyperparameters used for deep reinforcement learning.

Appendix B Softmax Win Probability

Proposition: In an N-player game when deviating from the equilibrium market price by an amount of $d$ , given the softmax win probability $\Phi(x_{n})$ in Eq. (5), the probability of winning a customer changes by a factor of $\Phi_{d}$ as defined in Eq. (21).

Proof: Suppose N-1 players set an equal price of $\tilde{x}$ and one player deviates with price $x_{n}$ , where $x_{n}=\tilde{x}-d$ .

The players who do not deviate from the equilibrium price, will have a win probability of $\Phi(\tilde{x})$

\displaystyle\Phi(\tilde{x})

\displaystyle=\frac{e^{1-a\tilde{x}}}{\sum^{N}_{i=1}e^{1-ax_{i}}}

(20)

The player that deviates from the market price will have a win probability of $\Phi(x_{n})$ ,

	$\displaystyle\Phi(x_{n})$	$\displaystyle=\frac{e^{1-ax_{n}}}{\sum^{N}_{i=1}e^{1-ax_{i}}}=\frac{e^{1-a(% \tilde{x}-d)}}{\sum^{N}_{i=1}e^{1-ax_{i}}-e^{\tilde{x}}+e^{1-a(\tilde{x}-d)}}$
		$\displaystyle=\frac{e^{1-a(\tilde{x}-d)}}{N_{\Phi}+e^{1-a(\tilde{x}-d)}},\quad% \text{where}\ N_{\Phi}=\sum^{N}_{i=1}e^{1-ax_{i}}-e^{\tilde{x}}$		(21)

The increase in win probability by deviating from the equilibrium price $\tilde{x}$ by an increment of $d$ is denoted by $\Phi_{d}$ ,

\displaystyle\Phi_{d}

\displaystyle=\frac{\Phi(x_{n})}{\Phi(\tilde{x})}=\frac{e^{1-a(\tilde{x}-d)}}{% N_{\Phi}+e^{\tilde{x}-d}}\frac{N_{\Phi}+e^{\tilde{x}}}{e^{1-a\tilde{x}}}=e^{ad% }\frac{N_{\Phi}+e^{\tilde{x}}}{N_{\Phi}+e^{\tilde{x}-d}}

(22)

As follows from Eq. (22) the probability of winning a customer’s purchase by changing price with deviation $d$ with respect to the equilibrium price effectively changes the win probability by a factor $\Phi_{d}$ .

B.1 Admissible Values of Profit Function

Proposition: In an N-player game when deviating from the market price by an amount of $d\in\mathbb{R}$ , there always exists a boundary $(d^{-},d^{+})$ such that the expected profit from deviating $\mathbf{E}[\Pi_{n}]^{\prime}>0$ exists. We define this as the admissible range.

\displaystyle\frac{\mathbf{E}[\Pi_{n}]^{\prime}}{\mathbf{E}[\Pi_{n}]^{0}}>0,% \quad\forall d\in(d^{-},d^{+})

(23)

Proof: We define the gain function $\Omega(d)$ as,

$\displaystyle\Omega(d)$	$\displaystyle=\frac{\mathbf{E}[\Pi_{n}]^{\prime}}{\mathbf{E}[\Pi_{n}]^{0}}=% \frac{\Phi(x_{n})f(x_{n})x_{n}}{\Phi(\tilde{x})f(\tilde{x})\tilde{x}}=\frac{% \Phi(\tilde{x})\Phi_{d}[f(\tilde{x})+\gamma d](\tilde{x}-d)}{\Phi(\tilde{x})f(% \tilde{x})\tilde{x}}$
	$\displaystyle=\Phi_{d}\frac{[f(\tilde{x})+\gamma d](\tilde{x}-d)}{f(\tilde{x})% \tilde{x}}=\Phi_{d}(1+\frac{\gamma}{f(\tilde{x})}d)(1-\frac{1}{\tilde{x}}d)$
$\displaystyle\Omega(d)$	$\displaystyle=\Phi_{d}(1+\frac{\gamma}{f(\tilde{x})}d)(1-\frac{1}{\tilde{x}}d)$	(24)

Given the deviation condition $\Omega(d)>0$ and Condition (2), the admissible range is more precisely defined as a range for $d$ where a solution for Inequality (25) exists,

\displaystyle\Omega(d)\implies(1+\frac{\gamma}{f(\tilde{x})}d)(1-\frac{1}{% \tilde{x}}d)>0

\displaystyle\implies-\frac{f(\tilde{x})}{\gamma}<d<\tilde{x}

(25)

We see from Inequality (25) that there is a bound on admissible values of $d\in(-f(\tilde{x})/\gamma,\tilde{x})$ which results in admissible $\Omega(d)$ values.

Appendix C Equilibrium Study

The Nash equilibrium of a pricing strategy can be either a pure or mixed strategy. We prove that for a pure strategy, in the support of a mixed strategy, an $\epsilon$ -Nash equilibrium exists. Consequently, multiple Nash equilibria can exist in this pricing game. Suppose a hypothetical equilibrium where a market price $\tilde{x}$ exists. We examine the hypothetical situation if one agent were to deviate from the $\tilde{x}$ of with a price of $x_{n}$ . Particularly, we study the case when a player undercuts or prices above a set market price, where $x_{n}=\tilde{x}-d$ and $d$ is the value which the player deviates from the equilibrium price. From the equilibrium setting, as follows from Eq. (26),

$\displaystyle f(\tilde{x})$	$\displaystyle=\beta_{0}+\beta_{1}\tilde{x}+\beta_{2}(\tilde{x}-\bar{p})$
	$\displaystyle=\beta_{0}+\beta_{1}\tilde{x}+\beta_{2}\tilde{x}+c_{N}$	(26)
$\displaystyle\text{where}\quad c_{N}$	$\displaystyle=-\beta_{2}\bar{p}$

We derive Eq. (27) from Eq. (26). From Eq. (27) we see that the expected demand is simply the demand function at equal pricing $f(\tilde{x})$ corrected by a factor of $\gamma d$ as defined in the Eq. (28), where $d$ is the amount the player $n$ deviates from the equilibrium price.

$\displaystyle f(x_{n})$	$\displaystyle=\beta_{0}+\beta_{1}(x_{n}-d)+\beta_{2}(x_{n}-d-\bar{p})$
	$\displaystyle=\beta_{0}+\beta_{1}x_{n}+\beta_{2}x_{n}+c_{N}-(\beta_{1}+\beta_{% 2})d$
	$\displaystyle=\beta_{0}+\beta_{1}x_{n}+\beta_{2}x_{n}+c_{N}+\gamma d$
$\displaystyle f(x_{n})$	$\displaystyle=f(\tilde{x})+\gamma d$	(27)
$\displaystyle\text{where}\quad x_{n}$	$\displaystyle\in\mathbb{R},\gamma=-(\beta_{1}+\beta_{2}),x>0,d\in\mathbb{R}$	(28)

C.1 Proof of a Best Response Function (Market Undercutting Scenario)

Proposition: In an N-player game, under specific market conditions dictated by the reference price $\bar{p}$ and equilibrium price $\tilde{x}$ , under certain proven conditions, stipulated later in Condition 35, there can exist a boundary $(0,d^{\prime})$ such that the expected profit from deviating is greater than not deviating $\mathbf{E}[\Pi_{n}]^{-}>\mathbf{E}[\Pi_{n}]^{0}$ , when deviating from the market price by undercutting with an amount of $d>0$ , as illustrated by Inequality (29).

\displaystyle\frac{\mathbf{E}[\Pi_{n}]^{\prime}}{\mathbf{E}[\Pi_{n}]^{0}}>1,% \quad\forall d\in(0,d^{\prime})

(29)

Given the deviation constraint $d>0$ , therefore $\Phi_{d}>1$ , refer to Supplemental B, we find the solution for the polynomial section of the gain function, defined by Eq. (30).

	$\displaystyle\Omega(d)_{p}$	$\displaystyle=(1+\frac{\gamma}{f(\tilde{x})}d)(1-\frac{1}{\tilde{x}}d)$		(30)
	$\displaystyle\Omega(d)_{p}$	$\displaystyle>1$		(31)

With the constraint $d>0$ , a solution to Inequality (31) is restricted by Condition (32). And such a solution only exists when $\tilde{x}>f(\tilde{x})/\gamma$ .

	$\displaystyle 0<d$	$\displaystyle<\frac{\frac{\gamma}{f(\tilde{x})}-\frac{1}{\tilde{x}}}{\frac{% \gamma}{f(\tilde{x})\tilde{x}}}$		(32)
	$\displaystyle d$	$\displaystyle\in(0,\tilde{x}-f(\tilde{x})/\gamma)\quad\text{when}\quad\tilde{x% }>f(\tilde{x})/\gamma$		(33)

Making substitutions from, Eq. (26) into Eq. (33),

$\displaystyle f(\tilde{x})/\gamma$	$\displaystyle<\tilde{x}$
$\displaystyle\frac{\beta_{0}+\beta_{1}\tilde{x}+\beta_{2}\tilde{x}+c_{N}}{-(% \beta_{1}+\beta_{2})}$	$\displaystyle<\tilde{x}$
$\displaystyle\frac{\beta_{0}+c_{N}}{-(\beta_{1}+\beta_{2})}+\tilde{x}\frac{% \beta_{1}+\beta_{2}}{-(\beta_{1}+\beta_{2})}$	$\displaystyle<\tilde{x}$
$\displaystyle\frac{\beta_{0}+c_{N}}{-(\beta_{1}+\beta_{2})}$	$\displaystyle<\tilde{x}\Bigg{(}1+\frac{\beta_{1}+\beta_{2}}{\beta_{1}+\beta_{2% }}\Bigg{)}$
$\displaystyle-\frac{\beta_{0}+c_{N}}{2(\beta_{1}+\beta_{2})}$	$\displaystyle<\tilde{x}$
$\displaystyle\frac{\beta_{2}\bar{p}-\beta_{0}}{2(\beta_{1}+\beta_{2})}$	$\displaystyle<\tilde{x}$	(34)

We see from Inequality (34) that if the equilibrium price of the other agents $\tilde{x}$ is priced below a function of the reference price $\bar{p}$ there will be no value of $d$ that satisfies Inequality (33). Therefore, the agent will not gain profit from undercutting if $\tilde{x}$ , the current market price, is under a certain limit with respect a monotonic function the reference price $\bar{p}$ . The exact conditions on which undercutting the $\tilde{x}$ will yield profit gain for any agent $n$ is outlined in Inequality (35).

\displaystyle\frac{\beta_{2}\bar{p}-\beta_{0}}{2(\beta_{1}+\beta_{2})}<\tilde{% x}\And d

\displaystyle\in(0,\tilde{x}-f(\tilde{x})/\gamma)\implies\mathbf{E}[\Pi_{n}]^{% -}>\mathbf{E}[\Pi_{n}]^{0}

(35)

$\square$

C.2 Proof of the Existence of Multiple $\epsilon$ -Nash Equilibria Single Stage Game

Proposition: Suppose the market is in an equilibrium state where all agents price their items at a fixed price, and one player elects to undercut the market. We demonstrate that there is a theoretical maximum expected reward in this oligopoly for a single stage game for undercutting the market.

\displaystyle\mathbf{E}[\Pi_{n}]^{-}\leq\mathbf{E}[\Pi_{n}]^{0}+\epsilon

(36)

Inequality (36) expresses the conditions of an $\epsilon$ -Nash equilibrium, that is no player can obtain a higher reward than a margin of $\epsilon$ by deviating from the equilibrium price of the market $\mathbf{E}[\Pi_{n}]^{0}$ . In effect we acknowledge that the upper bound of $\Phi_{d}$ in Eq. (22) as,

	$\displaystyle e^{ad}\frac{N_{\Phi}+e^{1-a\tilde{x}}}{N_{\Phi}+e^{1-a\tilde{x}+% d}}$	$\displaystyle\leq e^{N_{\phi}ad}$		(37)
	$\displaystyle\Phi_{d}$	$\displaystyle\leq e^{N_{\phi}ad}$		(38)

We take the partial derivative with respect the deviation amount $d$ to obtain theoretical maximum of $\Omega$ .

	$\displaystyle\frac{\partial\mathbf{E}[\Pi_{n}]^{-}}{\partial d}$	$\displaystyle=\frac{\partial}{\partial d}\Bigg{[}e^{N_{\phi}ad}(1+\frac{\gamma% }{f(\tilde{x})}d)(1-\frac{1}{\tilde{x}}d)\Bigg{]}$
	$\displaystyle\frac{\partial\mathbf{E}[\Pi_{n}]^{-}}{\partial d}$	$\displaystyle=e^{N_{\phi}a}\frac{\partial}{\partial d}\Bigg{[}e^{d}(1+\frac{% \gamma}{f(\tilde{x})}d)(1-\frac{1}{\tilde{x}}d)\Bigg{]}$		(39)

Solve the derivative,

$\displaystyle\frac{\partial\mathbf{E}[\Pi_{n}]^{-}}{\partial d}$	$\displaystyle=0$
$\displaystyle e^{N_{\phi}a}\frac{\partial}{\partial d}\Bigg{[}e^{d}(1+\frac{% \gamma}{f(\tilde{x})}d)(1-\frac{1}{\tilde{x}}d)\Bigg{]}$	$\displaystyle=0$
$\displaystyle\frac{\partial}{\partial d}\Bigg{[}e^{d}+\frac{\gamma}{f(\tilde{x% })}de^{d}-\frac{1}{\tilde{x}}de^{d}-\frac{\gamma}{f(\tilde{x})\tilde{x}}d^{2}e% ^{d}\Bigg{]}$	$\displaystyle=0$
$\displaystyle\frac{\partial}{\partial d}\Bigg{[}e^{d}+\Bigg{(}\frac{\gamma}{f(% \tilde{x})}-\frac{1}{\tilde{x}}\Bigg{)}de^{d}-\frac{\gamma}{f(\tilde{x})\tilde% {x}}d^{2}e^{d}\Bigg{]}$	$\displaystyle=0$
$\displaystyle e^{d}\Bigg{[}\Bigg{(}\frac{\gamma}{f(\tilde{x})}-\frac{1}{\tilde% {x}}\Bigg{)}(d+1)-\frac{\gamma}{f(\tilde{x})\tilde{x}}d(d+2)+1\Bigg{]}$	$\displaystyle=0$
$\displaystyle\Bigg{(}\frac{\gamma}{f(\tilde{x})}-\frac{1}{\tilde{x}}\Bigg{)}(d% +1)-\frac{\gamma}{f(\tilde{x})\tilde{x}}d(d+2)+1$	$\displaystyle=0$	(40)

The solution to Eq. (40)

	$\displaystyle d^{*}$	$\displaystyle=\frac{\sqrt{c_{1}^{2}-c_{1}+4(c_{2}-1)c_{2}-2c_{2}}}{2c_{2}}$		(41)
	$\displaystyle\text{where}\quad c_{1}$	$\displaystyle=\frac{\gamma}{f(\tilde{x})}-\frac{1}{\tilde{x}},\quad c_{2}=% \frac{\gamma}{f(\tilde{x})\tilde{x}}$		(42)

We proved that when a player deviates from the market price by a margin of $d$ , there exists an optimal deviation amount $d^{*}$ , outlined in Eq. (41) such that the expected profit gain $\Omega$ is maximized. $\epsilon$ is therefore,

$\displaystyle\epsilon$	$\displaystyle=\mathbf{E}[\Pi_{n}]^{-}-\mathbf{E}[\Pi]^{0}$
	$\displaystyle=\Omega(d^{*})\mathbf{E}[\Pi_{n}]^{0}-\mathbf{E}[\Pi_{n}]^{0}$
	$\displaystyle=\mathbf{E}[\Pi_{n}]^{0}[\Omega(d^{*})-1)]$	(43)

Multiple solutions can exist where $d^{*}=0$ , to give a perfect Nash equilibria solution, as $c_{1}$ and $c_{2}$ are functions of $\tilde{x}$ . The Nash equilibrium condition is,

\displaystyle c_{1}^{2}+4(c_{2}-1)c_{2}=c_{1}+2c_{2}\quad\xrightarrow[]{}\quad% \epsilon=0

(44)

Therefore, by undercutting the market with any price $d>0$ , a player can theoretically yield no higher expected profits than $\epsilon$ greater than its competitors as defined in Eq. (41) and Eq. (43). Suppose a policy of pure strategy exists, $\pi^{p}_{n}$ , where,

\displaystyle\pi^{p}_{n}=

\displaystyle\begin{cases}1&\text{for \quad}x=\tilde{x}\\ 0&\text{for \quad}else\end{cases}

(45)

Thus, $\pi^{p}_{n}$ constitutes an $\epsilon$ -Nash equilibrium resulting from a pure strategy, with $\epsilon$ denoted in Eq. (43), from which varying the parameters of $c_{1}$ and $c_{2}$ in Eq. (41), multiple $\epsilon$ -Nash or Nash equilibria exist. $\square$

C.3 Existence of Nash equilibrium for Markov Game

Proposition: The existence of an $\epsilon$ -Nash equilibrium in single stage game ensures that there exists an $\delta$ -Nash equilibrium in the Markov Game, as defined in Eq. (13), for the value of a joint policy regardless of the state transition behaviour of the Markov Game.

Proof: From [filar:1997], the value of a policy can be defined,

\displaystyle\mathbf{v}(\pi)=\sum_{t=0}^{\infty}\gamma\mathbf{P}^{t}(\pi)% \mathbf{R}(\pi)

(46)

With a specific policy, the transition matrix of the Markov Game is known, and therefore the value of a policy can be expressed in Eq. (46), with defined transition $\mathbf{P}^{t}(\pi)$ and reward $\mathbf{R}(\pi)$ matrices. Provided the Nash equilibrium condition from Eq. (13) we must demonstrate that,

\displaystyle\mathbf{R}(\pi^{\prime})\leq\mathbf{R}(\pi^{*})+\mathbf{\epsilon}% \xrightarrow[]{}\mathbf{V}(\pi^{\prime})\leq\mathbf{V}(\pi^{*})+\delta

(47)

Where $\pi^{\prime}$ denotes any joint policy, and $\pi^{*}$ denotes a Nash equilibrium policy. We know that the Markov transition probability holds such that,

\displaystyle\sum_{i=1}^{|S|}\mathbf{P}^{t}(\pi)_{si}=1,\quad\forall s

(48)

$\mathbf{P}^{t}(\pi)_{si}$ represents the probability of transition from state $s$ to state $i$ . Eq. (48) simply indicates the sum of transition probabilities from state $s$ to any other state must equal 1 (Markov property). Furthermore, suppose,

	$\displaystyle R_{s}$	$\displaystyle=\max_{1\leq s<\|S\|}\{\mathbf{R}(\pi^{*})_{s}+\epsilon\}$		(49)
	$\displaystyle\delta_{s}$	$\displaystyle=R_{s}-\mathbf{P}(\pi^{})_{s}\times\mathbf{R}(\pi^{})$		(50)

$R_{s}$ represents the maximum reward obtainable from state $s$ under policy $\pi^{*}$ , as illustrated in Eq. (49). Provided the Markov property on the transition matrix $\mathbf{P}^{t}(\pi)$ outlined in Eq. (48), this implies that

\displaystyle\mathbf{P}(\pi^{\prime})_{s}\ \times\ \mathbf{R}(\pi^{\prime})% \leq\mathbf{P}(\pi^{*})_{s}\ \times\ \mathbf{R}(\pi^{*})+\delta_{s}\quad% \forall s\in S

(51)

Where $\delta$ is a maximum fixed value added to the value of the $\epsilon$ -Nash equilibrium policy $\mathbf{V}(\pi^{*})$ effectively bounding the value of any other policy deviating from $\pi^{*}$ . $\square$