这是用户在 2024-9-29 22:09 为 https://arxiv.org/html/2207.06492v3 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
License: CC BY 4.0 授权协议: CC BY 4.0
arXiv:2207.06492v3 [cs.GT] 02 Mar 2024
arXiv:2207.06492v3 [cs.GT] 02 3月 2024

Approximate Nash Equilibrium Learning for n-Player Markov Games in Dynamic Pricing
动态定价中 n 人马尔可夫博弈的近似纳什均衡学习

Larkin Liu Technical University of Munich
larkin.liu@tum.de
Abstract 抽象

We investigate Nash equilibrium learning in a competitive Markov Game (MG) environment, where multiple agents compete, and multiple Nash equilibria can exist. In particular, for an oligopolistic dynamic pricing environment, exact Nash equilibria are difficult to obtain due to the curse-of-dimensionality. We develop a new model-free method to find approximate Nash equilibria. Gradient-free black box optimization is then applied to estimate ϵitalic-ϵ\epsilonitalic_ϵ, the maximum reward advantage of an agent unilaterally deviating from any joint policy, and to also estimate the ϵitalic-ϵ\epsilonitalic_ϵ-minimizing policy for any given state. The policy-ϵitalic-ϵ\epsilonitalic_ϵ correspondence and the state to ϵitalic-ϵ\epsilonitalic_ϵ-minimizing policy are represented by neural networks, the latter being the Nash Policy Net. During batch update, we perform Nash Q learning on the system, by adjusting the action probabilities using the Nash Policy Net. We demonstrate that an approximate Nash equilibrium can be learned, particularly in the dynamic pricing domain where exact solutions are often intractable.
我们研究了竞争性马尔可夫博弈 (MG) 环境中的纳什均衡学习,其中多个代理竞争,并且可能存在多个纳什均衡。特别是,对于寡头垄断的动态定价环境,由于维数诅咒,很难获得精确的纳什均衡。我们开发了一种新的无模型方法来找到近似纳什均衡。然后应用无梯度黑盒优化来估计 ϵ\epsilonitalic_ϵ 单方面偏离任何联合策略的代理的最大奖励优势,并估计任何给定状态的 ϵ\epsilonitalic_ϵ -minimize 策略。策略 ϵ\epsilonitalic_ϵ 对应和状态到 ϵ\epsilonitalic_ϵ 最小化策略由神经网络表示,后者是纳什策略网。在批量更新期间,我们通过使用 Nash Policy Net 调整动作概率,在系统上执行 Nash Q 学习。我们证明了可以学习近似纳什均衡,尤其是在动态定价领域,其中精确解决方案通常难以处理。

\ecaisubmission \ecai提交

1 Introduction 1 介绍

The application of deep reinforcement learning to solve single player Markov Decision Processes are relatively well-researched in today’s machine learning literature, however, the application of novel deep reinforcement learning methods to solve multi-agent competitive MDP’s are relatively few. Particularly, the challenge revolves around competing objectives and solving for, often computationally intractable, Nash equilibria [27] in a competitive setting, where agents cannot merely maximize only their respective Q functions. Nash equilibria in Markov Games are particularly useful in the area of dynamic pricing, a well-known problem in today’s online eCommerce marketplaces. In a dynamic pricing game, multiple firms compete for market share of their product on the same online platform. To model this market, we adopt a modified version of Bertrand oligopoly [1], where the marginal cost of production is 0, and where purchasing behaviour is probabilistic.
在当今的机器学习文献中,深度强化学习解决单人马尔可夫决策过程的应用相对深入,但是,新型深度强化学习方法解决多智能体竞争性 MDP 的应用相对较少。特别是,挑战围绕着竞争目标和在竞争环境中解决通常难以计算的纳什均衡 [27],其中代理不能仅仅最大化他们各自的 Q 函数。马尔可夫游戏中的纳什均衡在动态定价领域特别有用,这是当今在线电子商务市场中众所周知的问题。在动态定价游戏中,多家公司在同一在线平台上争夺其产品的市场份额。为了模拟这个市场,我们采用了 Bertrand 寡头垄断 [1] 的修改版本,其中边际生产成本为 0,并且购买行为是概率性的。

1.1 Market Oligopoly
1.1 市场寡头垄断

Especially in today’s online commerce industries, dynamic pricing provides platform firms with an adversarial advantage and increases profit margins in a competitive economy. In an ideal case, a firm’s pricing strategy should consider the actions of other firms, providing the justification for Nash equilibrium policies over unilateral reward maximization policies. In previous approaches, [22] modelled the dynamic pricing problem using deep reinforcement learning, however, focused primarily on single agent revenue maximization and not solving for multi-agent Nash equilibria. [35] apply stochastic dynamic programming in a simulated environment to estimate consumer demand and maximize expected profit, however, do not consider the equilibrium solution of the market. [38] model market demand using parametric functions, with unknown parameters, yet they do not consider a Nash equilibrium nor the actions of previous pricing on demand. Thus, in almost all recent research on the dynamic pricing problem under uncertainty, there exists a desideratum to compute and promote Nash equilibria for market pricing strategies. Such strategies can be applied to eCommerce platforms such as Amazon auto pricing, allowing firms to opt for an automated pricing algorithm.
尤其是在当今的在线商务行业中,动态定价为平台公司提供了对抗性优势,并在竞争激烈的经济中提高了利润率。在理想情况下,公司的定价策略应考虑其他公司的行为,为纳什均衡政策而不是单边奖励最大化政策提供理由。在以前的方法中,[22] 使用深度强化学习对动态定价问题进行建模,然而,主要关注单个代理收入最大化,而不是解决多代理 Nash 均衡。[35] 在模拟环境中应用随机动态规划来估计消费者需求并最大化预期利润,但是,不要考虑市场的均衡解。[38] 使用参数未知的参数函数对市场需求进行建模,但他们不考虑纳什均衡,也不考虑先前按需定价的作用。因此,在几乎所有关于不确定性下动态定价问题的近期研究中,都存在着计算和促进市场定价策略纳什均衡的职责。此类策略可应用于亚马逊自动定价等电子商务平台,允许公司选择自动定价算法。

A large body of literature on dynamic pricing has assumed an unknown demand function and tried to learn the intrinsic relationship between price and demand. [15] showed that a myopic Bayesian policy may lead to incomplete learning and poor profit performance. [23] studied a joint pricing and inventory problem while learning the price-sensitive demand using a Bayesian dynamic program. Many recent studies have revolved around non-parametric methods. [2] developed a pricing strategy that balances the trade-off between exploration and exploitation. Furthermore, in [22] the demand function was a black box system where the demand was approximated from experience replay and live online testing in a reinforcement learning framework.
大量关于动态定价的文献假设了一个未知的需求函数,并试图了解价格和需求之间的内在关系。[15] 表明,短视的贝叶斯政策可能导致学习不完整和利润表现不佳。[23] 在使用贝叶斯动态程序学习价格敏感型需求的同时,研究了联合定价和库存问题。最近的许多研究都围绕着非参数方法展开。[2] 开发了一种定价策略,用于平衡勘探和开发之间的权衡。此外,在 [22] 中,需求函数是一个黑盒系统,其中需求是通过强化学习框架中的经验重放和实时在线测试来近似的。

Dynamic pricing in oligopolies presents multiple problems. One aspect is the optimization problem, where we learn a Nash equilibrium policy when presented with the parameters of the market demand function, and second is the learning of the demand parameters itself. For example, [7] and [19] studied parametric approaches, using maximum likelihood to estimate unknown parameters of the demand function. Recent research has been conducted in demand learning, for example in cases where multiple products exist, and firms can learn to price on an assortment of items [11] using multinomial logit regression, or when multiple products exist in a demand learning scenario, and a limited inventory exists, Multi-Armed Bandit approaches have shown to be effective at devising profitable policies [12]. However, these approaches do not consider past pricing actions influencing the future market demand parameters. Furthermore, they do not consider competing firms aiming for a Nash equilibrium.
寡头垄断的动态定价存在多个问题。一个方面是优化问题,当面对市场需求函数的参数时,我们学习纳什均衡策略,第二个是需求参数本身的学习。例如,[7][19] 研究了参数方法,使用最大似然来估计需求函数的未知参数。最近的研究是在需求学习方面进行的,例如,在存在多种产品的情况下,公司可以学习使用多项式 Logit 回归对各种商品进行定价 [11],或者当需求学习场景中存在多种产品,并且存在有限的库存时,多臂老虎机方法已被证明可以有效地设计有利可图的政策 [12].但是,这些方法并未考虑影响未来市场需求参数的过去定价行为。此外,他们不考虑以纳什均衡为目标的竞争公司。

We propose a demand market where past pricing affects the future market demand in a market competition. To search for a Nash equilibrium, we apply a multi-agent Markov Decision Process, where the state transition is driven by the joint pricing actions of all players (or firms). To simplify the problem, we assume that each firm has unlimited inventory, and there is only a single product with no substitutes.
我们提出了一个需求市场,其中过去的定价会影响市场竞争中的未来市场需求。为了寻找纳什均衡,我们应用了多智能体马尔可夫决策过程,其中状态转换由所有参与者(或公司)的联合定价行动驱动。为了简化问题,我们假设每家公司都有无限的库存,并且只有一种没有替代品的产品。

1.2 Multi-agent Markov Decision Processes
1,2 多智能体马尔可夫决策过程

Multi-agent Markov Decision Processes, or Markov Games (MG’s), constitute an important area of research, especially when multiple agents are self-interested, and an exact or approximate Nash equilibrium for the system is desired. The computation of Nash equilibria additionaly presents great difficulty when searching over the enormous joint state action space of the problem, and approximations to this search problem do exist [29]. Moreover, the existence of multiple Nash equilibria can further complicate the solution, as some solutions may be more desirable than others.
多智能体马尔可夫决策过程或马尔可夫博弈 (MG) 构成了一个重要的研究领域,尤其是当多个智能体都是自利的,并且需要系统的精确或近似纳什均衡时。此外,在搜索问题的巨大联合状态动作空间时,纳什均衡的计算也带来了很大的困难,并且确实存在该搜索问题的近似值[29]。此外,多重纳什均衡的存在会使解决方案进一步复杂化,因为某些解决方案可能比其他解决方案更可取。

Yet approximate search algorithms require prior knowledge of the joint reward function and are often limited to two players, modelled by a best response function. We treat our payoff function as an oracle [39], meaning we have knowledge regarding the parametric form of the payoff function, however, there is no knowledge to the agents in the MG regarding how the reward function generates rewards for the players. The state of the game is visible to the agents, yet the agents have no knowledge regarding the state transition function. Nevertheless, the market demand parameters and Nash equilibria are known for purposes of experimentation. This eliminates the need for pre-period model fitting to pre-estimate market demand parameters [10] allowing us to compare our solution to the theoretical solution, as opposed to an empirical estimate.
然而,近似搜索算法需要联合奖励函数的先验知识,并且通常仅限于两个玩家,由最佳响应函数建模。我们将 payoff 函数视为 oracle[39],这意味着我们了解支付函数的参数形式,但是,MG 中的代理不知道 reward 函数如何为玩家生成奖励。游戏的状态对 Agent 是可见的,但 Agent 对 state transition 函数一无所知。尽管如此,市场需求参数和纳什均衡是已知的,用于实验目的。这消除了对期前模型拟合以预先估计市场需求参数的需要 [10],使我们能够将我们的解决方案与理论解决方案进行比较,而不是经验估计。

Model-based approaches furthermore exist to provably find the existence of a Nash equilibrium [43]. However, we aim for a Nash equilibrium solver which is model-free, where no knowledge regarding the reward or transition function is known to the agents. In recent literature, [31] present a model-free algorithm to solve competitive MDP’s when the environment parameters can be altered by an external party - but this is not always the case in market economics. Furthermore, [34] propose a model-free decentralized deep learning framework, where the agent is blind to the other agents actions, and convergence to a Nash equilibrium is proven. [44] propose a convergent solution for zero-sum MG’s via entropy-regularization. However, both [44] and [34] impose a number of theoretical restrictions on the MG in order for this convergence to occur. Moreover, [20] present a provably convergent MG solver for imperfect information restricted to two agents. Nevertheless, we are concerned with an approximate solution to a full information MG for N agents.
此外,基于模型的方法可以证明纳什均衡的存在 [43]。然而,我们的目标是一个无模型的 Nash 均衡求解器,其中代理不知道有关奖励或转换函数的知识。在最近的文献中,[31] 提出了一种无模型算法,当环境参数可以被外部方改变时,解决竞争性 MDP 问题——但在市场经济学中,情况并非总是如此。此外,[34] 提出了一个无模型的去中心化深度学习框架,其中代理对其他代理的动作视而不见,并且证明了收敛到纳什均衡。[44] 通过熵正则化提出了零和 MG 的收敛解。然而,为了发生这种收敛,[44][34] 都对 MG 施加了许多理论限制。此外,[20] 提出了一个可证明的收敛 MG 求解器,用于仅限于两个代理的不完美信息。尽管如此,我们关心的是 N 个代理体的完整信息 MG 的近似解。

Contributions: This work outlines a methodology for Deep Q Learning, as introduced in [26], by extending the framework to multi-agent reinforcement learning (MARL) with a Nash equilibrium objective based on the methods in [17] and [41]. Although the framework presented in [17] is theoretically sound, the solution to the Nash equilibrium function is often intractable. We therefore apply approximation methods to compute a Nash equilibrium. Black box optimization is applied to estimate an ϵitalic-ϵ\epsilonitalic_ϵ-Nash Equilibrium policy, and this approximation is learned as a series of neural networks. This MARL model is then applied to the domain of dynamic pricing.
贡献:这项工作概述了 [26] 中介绍的深度 Q 学习方法,将框架扩展到基于 [17][41] 方法的纳什均衡目标的多智能体强化学习 (MARL)。尽管 [17] 中提出的框架在理论上是合理的,但 Nash 均衡函数的解往往是棘手的。因此,我们应用近似方法来计算纳什均衡。黑盒优化用于估计 ϵ\epsilonitalic_ϵ -Nash 均衡策略,并且此近似值被学习为一系列神经网络。然后将此 MARL 模型应用于动态定价领域。

2 Dynamic Pricing Game
阿拉伯数字 动态定价游戏

An oligopoly across N𝑁Nitalic_N firms is modelled as a multi-agent Markov Decision Process with a fully observable state action space. The state space is represented by the fully-observable demand-influencing reference price pt¯¯subscript𝑝𝑡\bar{p_{t}}over¯ start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG being the state of the game stsuperscript𝑠𝑡s^{t}italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. In discrete intervals at time t𝑡titalic_t, each agent issues a price for the item to be sold to the general market, xntsubscriptsuperscript𝑥𝑡𝑛x^{t}_{n}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The reference price, of the market at time t𝑡titalic_t is determined by a stochastic function of all the agents actions at time t1𝑡1t-1italic_t - 1. To focus the problem strictly on dynamic pricing, for each firm we assume a sufficiently large inventory, with no marginal costs of production, no possibility for inventory stockouts, no holding costs, and the capacity to always meet any realization of market demand.
NNitalic_N 公司的寡头垄断被建模为具有完全可观察状态动作空间的多代理马尔可夫决策过程。状态空间由完全可观察的需求影响参考价格表示,该价格 pt¯subscript\bar{p_{t}}over¯ start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG 是游戏 stsuperscripts^{t}italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT 的状态。在 time ttitalic_t 的离散间隔内,每个代理都会为要出售给一般市场 的商品 发布价格。 xntsubscriptsuperscriptx^{t}_{n}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT 当时 ttitalic_t 市场的参考价格由所有代理在 time t11t-1italic_t - 1 的随机函数决定。为了将问题严格集中在动态定价上,我们为每家公司假设有足够大的库存,没有边际生产成本,没有库存缺货的可能性,没有持有成本,并且有能力始终满足任何实现的市场需求。

2.1 Markov Game Parameters
2.1 马尔可夫博弈参数

The joint action space {x1txnt}𝐗superscriptsubscript𝑥1𝑡superscriptsubscript𝑥𝑛𝑡𝐗\{x_{1}^{t}\ ...\ x_{n}^{t}\}\in\mathbf{X}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT … italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } ∈ bold_X constitutes the current actions of all agents at time t𝑡titalic_t, which drive the state transition by setting prices at time t𝑡titalic_t and represent the set price of the item from agent n𝑛nitalic_n at time t𝑡titalic_t. The state-action-reward space is defined as a tuple (st,x1t,,xNt,r1t,,rNt,st+1)superscript𝑠𝑡superscriptsubscript𝑥1𝑡superscriptsubscript𝑥𝑁𝑡superscriptsubscript𝑟1𝑡superscriptsubscript𝑟𝑁𝑡superscript𝑠𝑡1(s^{t},x_{1}^{t},...,x_{N}^{t},r_{1}^{t},...,r_{N}^{t},s^{t+1})( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) where rntsuperscriptsubscript𝑟𝑛𝑡r_{n}^{t}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the reward for agent n𝑛nitalic_n at time t𝑡titalic_t. The joint reward can be written as r¯t=(r1t,,rNt)subscript¯𝑟𝑡superscriptsubscript𝑟1𝑡superscriptsubscript𝑟𝑁𝑡\bar{r}_{t}=(r_{1}^{t},...,r_{N}^{t})over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), and the joint action can be written as x¯t=(x1t,,xNt)subscript¯𝑥𝑡superscriptsubscript𝑥1𝑡superscriptsubscript𝑥𝑁𝑡\bar{x}_{t}=(x_{1}^{t},...,x_{N}^{t})over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). N𝑁Nitalic_N denotes the number of agents. The exact transition probabilities are not known to the agents. Each agent must learn the demand function and optimal strategy as the MG progresses. A state stsuperscript𝑠𝑡s^{t}italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is determined by the reference price of the market observable to all agents, s𝑠s\in\mathbb{R}italic_s ∈ blackboard_R. We discretize the action space into even segments representing a firm’s price. Where each segment represents an action xnXnsubscript𝑥𝑛subscript𝑋𝑛x_{n}\in X_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and Xnsubscript𝑋𝑛X_{n}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the action space for any agent nN𝑛𝑁n\in Nitalic_n ∈ italic_N.
联合操作空间 {x1txnt}𝐗superscriptsubscript1superscriptsubscript\{x_{1}^{t}\ ...\ x_{n}^{t}\}\in\mathbf{X}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT … italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } ∈ bold_X 构成了所有代理的当前操作 time ttitalic_t ,这些操作通过在 time ttitalic_t 设置价格来驱动状态转换,并表示 agent nnitalic_n at time ttitalic_t 的项目的设置价格。state-action-reward 空间定义为一个元组 (st,x1t,,xNt,r1t,,rNt,st+1)superscriptsuperscriptsubscript1superscriptsubscriptsuperscriptsubscript1superscriptsubscriptsuperscript1(s^{t},x_{1}^{t},...,x_{N}^{t},r_{1}^{t},...,r_{N}^{t},s^{t+1})( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ,其中 rntsuperscriptsubscriptr_{n}^{t}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT 是 agent nnitalic_n 在 time ttitalic_t 的奖励。联合奖励可以写成 r¯t=(r1t,,rNt)subscriptsuperscriptsubscript1superscriptsubscript\bar{r}_{t}=(r_{1}^{t},...,r_{N}^{t})over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ,联合行动可以写成 x¯t=(x1t,,xNt)subscriptsuperscriptsubscript1superscriptsubscript\bar{x}_{t}=(x_{1}^{t},...,x_{N}^{t})over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )NNitalic_N 表示代理的数量。代理不知道确切的转换概率。随着 MG 的进展,每个代理都必须学习需求函数和最佳策略。状态 stsuperscripts^{t}italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT 由所有代理可观察到的市场参考价格 确定。 ss\in\mathbb{R}italic_s ∈ blackboard_R 我们将操作空间离散为代表公司价格的偶数段。其中,每个区段表示一个操作 xnXnsubscriptsubscriptx_{n}\in X_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ,并且 XnsubscriptX_{n}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT 是任何代理 nNn\in Nitalic_n ∈ italic_N 的操作空间 。

2.2 Reference Pricing
2.2 元参考定价

The first pillar of any market dynamic is constituted by the demand function, dictated by both the historical and contemporary prices, of the firms in the market. Demand processes can be ideally linear and continuous, however may not be guaranteed to be stationary or continuous [6], and can be subject to various exogenous factors such as environmental carryover effects [32].
任何市场动态的第一支柱都是由需求函数构成的,需求函数由市场上公司的历史和现代价格决定。理想情况下,需求过程可以是线性和连续的,但不能保证是平稳或连续的 [6],并且可能受到各种外生因素的影响,例如环境残留效应 [32]。

We create an idealization of a market based on [36], where a dynamic pricing problem, with reference price effects and a linear demand function, is imposed. The expected current demand of a product is a function of the average price set by all firms, denoted as x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG, and a historical reference price, p¯¯𝑝\bar{p}over¯ start_ARG italic_p end_ARG. The reference price p¯¯𝑝\bar{p}over¯ start_ARG italic_p end_ARG is given, and cannot be modified during the current time t𝑡titalic_t, moreover, it is a function of the immediate past, t1𝑡1t-1italic_t - 1. Although [36] is not necessarily the only model that incorporates reference pricing [25], [13], [28], [16], we adopt it due to its simplicity and the existence of provable Nash equilibria in an oligopolistic framework.
我们基于 [36] 创建了一个市场理想化,其中施加了一个具有参考价格效应和线性需求函数的动态定价问题。产品的预期当前需求是所有公司设定的平均价格 (表示为 x~\tilde{x}over~ start_ARG italic_x end_ARG ) 和历史参考价格 的函数。 p¯\bar{p}over¯ start_ARG italic_p end_ARG 参考价格 p¯\bar{p}over¯ start_ARG italic_p end_ARG 是给定的,并且在当前时间内 ttitalic_t 不能修改,而且,它是过去 的函数 。 t11t-1italic_t - 1 尽管 [36] 不一定是唯一包含参考定价 [25][13][28][16] 的模型,但我们采用它是因为它的简单性以及寡头垄断框架中存在可证明的纳什均衡。

f(x~)𝑓~𝑥\displaystyle f(\tilde{x})italic_f ( over~ start_ARG italic_x end_ARG ) =β0+β1x~+β2(x~p¯)absentsubscript𝛽0subscript𝛽1~𝑥subscript𝛽2~𝑥¯𝑝\displaystyle=\beta_{0}+\beta_{1}\tilde{x}+\beta_{2}(\tilde{x}-\bar{p})= italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG - over¯ start_ARG italic_p end_ARG ) (1)
β0subscript𝛽0\displaystyle\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 0,β1<0,β20,f(x~)0,x~0formulae-sequenceabsent0formulae-sequencesubscript𝛽10formulae-sequencesubscript𝛽20formulae-sequence𝑓~𝑥0~𝑥0\displaystyle\geq 0,\ \beta_{1}<0,\ \beta_{2}\leq 0,\ f(\tilde{x})\geq 0,% \tilde{x}\geq 0≥ 0 , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < 0 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 0 , italic_f ( over~ start_ARG italic_x end_ARG ) ≥ 0 , over~ start_ARG italic_x end_ARG ≥ 0 (2)

ϵDsubscriptitalic-ϵ𝐷\epsilon_{D}italic_ϵ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT is defined as the noise from a Poisson process. Such a Poisson process has the arrival rate λt(x~)=f(x~)subscript𝜆𝑡~𝑥𝑓~𝑥\lambda_{t}(\tilde{x})=f(\tilde{x})italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG ) = italic_f ( over~ start_ARG italic_x end_ARG ) from Eq. (2), and standard deviation σ=λt(x~)𝜎subscript𝜆𝑡~𝑥\sigma=\sqrt{\lambda_{t}(\tilde{x})}italic_σ = square-root start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG ) end_ARG. Furthermore, [36] make the stipulation of decreasing demand with respect to increasing price, as illustrated in Condition (2).
ϵDsubscript\epsilon_{D}italic_ϵ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT 定义为来自 Poisson 过程的噪声。这样的泊松过程具有方程 (2) 的到达率 λt(x~)=f(x~)subscript\lambda_{t}(\tilde{x})=f(\tilde{x})italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG ) = italic_f ( over~ start_ARG italic_x end_ARG ) 和标准差 σ=λt(x~)subscript\sigma=\sqrt{\lambda_{t}(\tilde{x})}italic_σ = square-root start_ARG italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG ) end_ARG 。此外,[36] 规定了需求减少与价格上涨有关,如条件 (2) 所示。

This demand-influencing reference price can be affected by a variety of factors, such as inflation, unemployment, psychological perception [30]. Moreover, in many proposed oligopoly models, as in [18] and [3], the reference price is dictated by a historical joint market price of multiple firms. However, modelling a competitive market oligopoly with an autocorrelated reference price in a MG setting is not heavily investigated until now. In our model, we focus on designing a market whose reference price is driven by the historical t1𝑡1t-1italic_t - 1 joint market price, and additional factors which also affect the reference price are represented as noise ϵDsubscriptitalic-ϵ𝐷\epsilon_{D}italic_ϵ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT. Thus the transition of the reference price is determined by the average of the previous joint prices plus some Guassian randomness, ϵDsubscriptitalic-ϵ𝐷\epsilon_{D}italic_ϵ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT.
这个影响需求的参考价格会受到多种因素的影响,例如通货膨胀、失业率、心理认知 [30]。此外,在许多提出的寡头垄断模型中,如 [18][3],参考价格由多家公司的历史联合市场价格决定。然而,直到现在,在 MG 环境中使用自相关参考价格对竞争性市场寡头垄断进行建模还没有得到深入研究。在我们的模型中,我们专注于设计一个市场,其参考价格由历史 t11t-1italic_t - 1 联合市场价格驱动,而同样影响参考价格的其他因素则表示为噪声 ϵDsubscript\epsilon_{D}italic_ϵ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT 。因此,参考价格的转换由先前联合价格的平均值加上一些瓜斯随机数 来确定。 ϵDsubscript\epsilon_{D}italic_ϵ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT

In our experiment, the reference price of a product is determined by the previous actions of the firms. We express the reference price function as p¯(t):N:¯𝑝𝑡absentsuperscript𝑁\bar{p}(t):\ \mathbb{R}^{N}\xrightarrow[]{}\mathbb{R}over¯ start_ARG italic_p end_ARG ( italic_t ) : blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW blackboard_R, mapping a vector of historical pricing actions 𝐱t1superscript𝐱𝑡1\mathbf{x}^{t-1}bold_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT to \mathbb{R}blackboard_R. In the beginning of the Markovian process, the reference price is randomly generated within the action space of the pricing range. The stochastic nature of the market price transition entail the Markov property of the MG.

p¯(t)=1Nn=1Nxnt1+ϵD,wherep¯(t=0)=1Nn=1𝐱𝟎xn0+ϵDformulae-sequence¯𝑝𝑡1𝑁subscriptsuperscript𝑁𝑛1subscriptsuperscript𝑥𝑡1𝑛subscriptitalic-ϵ𝐷where¯𝑝𝑡01𝑁subscriptsuperscriptsuperscript𝐱0𝑛1subscriptsuperscript𝑥0𝑛subscriptitalic-ϵ𝐷\displaystyle\bar{p}(t)=\frac{1}{N}\sum^{N}_{n=1}x^{t-1}_{n}+\epsilon_{D},% \quad\text{where}\quad\bar{p}(t=0)=\frac{1}{N}\sum^{\mathbf{x^{0}}}_{n=1}x^{0}% _{n}+\epsilon_{D}over¯ start_ARG italic_p end_ARG ( italic_t ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , where over¯ start_ARG italic_p end_ARG ( italic_t = 0 ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT bold_0 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT (3)

2.3 Probabilistic Demand Function

The expected profit for player n𝑛nitalic_n is 𝐄[Πn(xn)]𝐄delimited-[]subscriptΠ𝑛subscript𝑥𝑛\mathbf{E}[\Pi_{n}(x_{n})]bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ], which equals revenue as the marginal cost is of production is assumed to be 0, is defined as,

𝐄[Πn(xn)]𝐄delimited-[]subscriptΠ𝑛subscript𝑥𝑛\displaystyle\mathbf{E}[\Pi_{n}(x_{n})]bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] =Φn(xn,xn)xnf(x~)absentsubscriptΦ𝑛subscript𝑥𝑛subscript𝑥𝑛subscript𝑥𝑛𝑓~𝑥\displaystyle=\Phi_{n}(x_{n},x_{-n})x_{n}f(\tilde{x})= roman_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT ) italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_f ( over~ start_ARG italic_x end_ARG ) (4)
Φn(xn,xn)subscriptΦ𝑛subscript𝑥𝑛subscript𝑥𝑛\displaystyle\Phi_{n}(x_{n},x_{-n})roman_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT ) =eϕ(xn)i=1Neϕ(xi)absentsuperscript𝑒italic-ϕsubscript𝑥𝑛subscriptsuperscript𝑁𝑖1superscript𝑒italic-ϕsubscript𝑥𝑖\displaystyle=\frac{e^{\phi(x_{n})}}{\sum^{N}_{i=1}e^{\phi(x_{i})}}= divide start_ARG italic_e start_POSTSUPERSCRIPT italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG (5)

Φn(xn,xn)subscriptΦ𝑛subscript𝑥𝑛subscript𝑥𝑛\Phi_{n}(x_{n},x_{-n})roman_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT ) in Eq. (5) represents the probability that a customer purchases the item sold by firm n𝑛nitalic_n, where the player n𝑛nitalic_n prices their merchandise with price xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. f(xn)𝑓subscript𝑥𝑛f(x_{n})italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is the expected demand during the single stage game for a fixed time period. Following the quantal response function from [24] and [14], we define a purchase elasticity function, ϕn(xn,xn)subscriptitalic-ϕ𝑛subscript𝑥𝑛subscript𝑥𝑛\phi_{n}(x_{n},x_{-n})italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT ),

ϕ(x)=italic-ϕ𝑥absent\displaystyle\phi(x)=italic_ϕ ( italic_x ) = {bnanxfor 0<x<bn/an0for x>bn/ancasessubscript𝑏𝑛subscript𝑎𝑛𝑥for 0𝑥subscript𝑏𝑛subscript𝑎𝑛0for 𝑥subscript𝑏𝑛subscript𝑎𝑛\displaystyle\begin{cases}b_{n}-a_{n}x&\text{for \quad}0<x<b_{n}/a_{n}\\ 0&\text{for \quad}x>b_{n}/a_{n}\end{cases}{ start_ROW start_CELL italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_x end_CELL start_CELL for 0 < italic_x < italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL for italic_x > italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW (6)
where x,x>0,a0,bn=1formulae-sequence𝑥formulae-sequence𝑥0formulae-sequence𝑎0subscript𝑏𝑛1\displaystyle x\in\mathbb{R},x>0,a\geq 0,b_{n}=1italic_x ∈ blackboard_R , italic_x > 0 , italic_a ≥ 0 , italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 (7)

In ϕ(x)italic-ϕ𝑥\phi(x)italic_ϕ ( italic_x ), bnsubscript𝑏𝑛b_{n}italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT represents the maximum weighted contribution to the probability of an item being purchased by a customer given a price. When the price x𝑥xitalic_x is 0, this measure is bnsubscript𝑏𝑛b_{n}italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. For simplification, we assume the linear marginal decline of this measure of all players ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as equal, an=asubscript𝑎𝑛𝑎a_{n}=aitalic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_a, that is the market has the same elasticity regarding the marginal decrease of the customer willingness to purchase from any firm as price increases, with a negative slope a𝑎aitalic_a. The probability of a customer choosing to purchase from firm n𝑛nitalic_n at price xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT among prices set by other firms xnsubscript𝑥𝑛x_{-n}italic_x start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT is defined by combining a softmax function and purchase elasticity function, ϕ(x)italic-ϕ𝑥\phi(x)italic_ϕ ( italic_x ). This mechanism will prevent a solution where each firm undercuts each other to set their prices to 0, as the lowest price need not guarantee that a consumer will buy their product.

3 Nash Equilibrium Conditions

Computation of exact Nash, or approximate ϵitalic-ϵ\epsilonitalic_ϵ-Nash, equilibria is an NP-hard problem and notoriously difficult to compute [5]. This involves a search over the entire joint state action space, computing the value of each state under a candidate policy. Furthermore, it involves knowledge of the joint Q function and the transition probability of the system. In our scenario, we have the condition that all agents are identical, therefore the solution of one agent can apply to the solution of another.

3.1 Theoretical ϵitalic-ϵ\epsilonitalic_ϵ-Nash Equilibria

Under the market outlined in Section 2, given the market inputs p¯¯𝑝\bar{p}over¯ start_ARG italic_p end_ARG and x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG, in any pricing strategy, there exists an optimal deviation d*𝐑superscript𝑑𝐑d^{*}\in\mathbf{R}italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ bold_R, such that 𝐄[Πn(xnd*)]𝐄delimited-[]subscriptΠ𝑛subscript𝑥𝑛superscript𝑑\mathbf{E}[\Pi_{n}(x_{n}-d^{*})]bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ] will yield the maximum profit advantage ϵitalic-ϵ\epsilonitalic_ϵ.

ϵitalic-ϵ\displaystyle\epsilonitalic_ϵ =maxd*𝐑(𝐄[Πn(xnd*)]𝐄[Πn(xn)])absentsuperscript𝑑𝐑max𝐄delimited-[]subscriptΠ𝑛subscript𝑥𝑛superscript𝑑𝐄delimited-[]subscriptΠ𝑛subscript𝑥𝑛\displaystyle=\underset{d^{*}\in\mathbf{R}}{\mathrm{max}}\ \Bigg{(}\mathbf{E}[% \Pi_{n}(x_{n}-d^{*})]-\mathbf{E}[\Pi_{n}(x_{n})]\Bigg{)}= start_UNDERACCENT italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ bold_R end_UNDERACCENT start_ARG roman_max end_ARG ( bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ] - bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] ) (8)

An increase in the individual profit of an agent n𝑛nitalic_n from unilaterally deviating is denoted as ϵitalic-ϵ\epsilonitalic_ϵ. Given this optimal deviation, d*superscript𝑑d^{*}italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, a maximum theoretical upper bound on the profit advantage ϵitalic-ϵ\epsilonitalic_ϵ that can be obtained. Eq. (9) provides the theoretical value of d*superscript𝑑d^{*}italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT which is a function of both reference price p¯¯𝑝\bar{p}over¯ start_ARG italic_p end_ARG and market price x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG, as well as the fixed market parameters β0,β1,β2subscript𝛽0subscript𝛽1subscript𝛽2\beta_{0},\beta_{1},\beta_{2}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. A derivation of d*superscript𝑑d^{*}italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT can be found in Appendix C.

d*superscript𝑑\displaystyle d^{*}italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT =c12c1+4(c21)c22c22c2absentsuperscriptsubscript𝑐12subscript𝑐14subscript𝑐21subscript𝑐22subscript𝑐22subscript𝑐2\displaystyle=\frac{\sqrt{c_{1}^{2}-c_{1}+4(c_{2}-1)c_{2}-2c_{2}}}{2c_{2}}= divide start_ARG square-root start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 4 ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 ) italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 2 italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 2 italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG (9)
wherec1wheresubscript𝑐1\displaystyle\text{where}\quad c_{1}where italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =(β1+β2)f(x~)1x~,c2=(β1+β2)f(x~)x~formulae-sequenceabsentsubscript𝛽1subscript𝛽2𝑓~𝑥1~𝑥subscript𝑐2subscript𝛽1subscript𝛽2𝑓~𝑥~𝑥\displaystyle=\frac{-(\beta_{1}+\beta_{2})}{f(\tilde{x})}-\frac{1}{\tilde{x}},% \quad c_{2}=\frac{-(\beta_{1}+\beta_{2})}{f(\tilde{x})\tilde{x}}= divide start_ARG - ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) end_ARG - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_x end_ARG end_ARG , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG - ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) over~ start_ARG italic_x end_ARG end_ARG
Market Scenario 1: β0=15,β1=1.05,β2=3.1,a=0.1formulae-sequencesubscript𝛽015formulae-sequencesubscript𝛽11.05formulae-sequencesubscript𝛽23.1𝑎0.1\beta_{0}=15,\beta_{1}=-1.05,\beta_{2}=-3.1,a=0.1italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 15 , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 1.05 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 3.1 , italic_a = 0.1
Refer to caption
Refer to caption
Market Scenario 1: β0=15,β1=1.05,β2=3.1,a=0.1formulae-sequencesubscript𝛽015formulae-sequencesubscript𝛽11.05formulae-sequencesubscript𝛽23.1𝑎0.1\beta_{0}=15,\beta_{1}=-1.05,\beta_{2}=-3.1,a=0.1italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 15 , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 1.05 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 3.1 , italic_a = 0.1
Market Scenario 2: β0=25,β1=0.6,β2=6.1,a=0.1formulae-sequencesubscript𝛽025formulae-sequencesubscript𝛽10.6formulae-sequencesubscript𝛽26.1𝑎0.1\beta_{0}=25,\beta_{1}=-0.6,\beta_{2}=-6.1,a=0.1italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 25 , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 0.6 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 6.1 , italic_a = 0.1
Figure 1: Surface plot of potential advantage from deviation (ϵitalic-ϵ\epsilonitalic_ϵ-deviation) from Nash equilibrium with respect to market price x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG, and reference price p¯¯𝑝\bar{p}over¯ start_ARG italic_p end_ARG, with their respective market parameters β0,β1,β2,asubscript𝛽0subscript𝛽1subscript𝛽2𝑎\beta_{0},\beta_{1},\beta_{2},aitalic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a for arrival of a single sales event.

Using the Eq. (9), as visualized in Fig. 1, multiple Nash equilibria can exist when ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0, or when ϵitalic-ϵ\epsilonitalic_ϵ is sufficiently small (see Section 5). Such behaviour occurs when one agent unilaterally deviates from x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG, under reference price p¯¯𝑝\bar{p}over¯ start_ARG italic_p end_ARG. Thus, the lower plateau from Fig. 1 represents the region where a theoretical Nash Equilibrium exists. Bounds on the state space and action space are limited to |S|=|Xn|=10𝑆subscript𝑋𝑛10|S|=|X_{n}|=10| italic_S | = | italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | = 10, that is there are 10 discrete intervals for each state stsuperscript𝑠𝑡s^{t}italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and action xntsuperscriptsubscript𝑥𝑛𝑡x_{n}^{t}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, representing price values.

3.2 Nash Equilibrium in Markov Games (δ𝛿\deltaitalic_δ)

The value of a policy, at each state, can be represented by a joint action which maximizes the joint Q function [42] under policy π(s)𝜋𝑠\pi(s)italic_π ( italic_s ). The probability of agent n𝑛nitalic_n taking action xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is defined as πn(s,xn)subscript𝜋𝑛𝑠subscript𝑥𝑛\pi_{n}(s,x_{n})italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

v(s,πn,πn)=maxx¯𝐗n=1Nπn(s,xn)Qn(s,xn,xn)𝑣𝑠subscript𝜋𝑛subscript𝜋𝑛¯𝑥𝐗maxsuperscriptsubscriptproduct𝑛1𝑁subscript𝜋𝑛𝑠subscript𝑥𝑛subscript𝑄𝑛𝑠subscript𝑥𝑛subscript𝑥𝑛\displaystyle v(s,\pi_{n},\pi_{-n})=\underset{\bar{x}\in\mathbf{X}}{\mathrm{% max}}\ \prod_{n=1}^{N}\pi_{n}(s,x_{n})Q_{n}(s,x_{n},x_{-n})italic_v ( italic_s , italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT ) = start_UNDERACCENT over¯ start_ARG italic_x end_ARG ∈ bold_X end_UNDERACCENT start_ARG roman_max end_ARG ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT ) (10)

An ϵitalic-ϵ\epsilonitalic_ϵ-Nash equilibrium is defined as a joint policy such that the reward of a single stage game will not result in a greater payoff to the agent n𝑛nitalic_n by more than ϵitalic-ϵ\epsilonitalic_ϵ, when any agent unilaterally deviates from said joint policy. Provided ϵitalic-ϵ\epsilonitalic_ϵ, we consider a bound on the corresponding MG, δ𝛿\deltaitalic_δ, which defines a bound on the gain of the value of a policy, v(s,πn*,πn*)𝑣𝑠superscriptsubscript𝜋𝑛subscriptsuperscript𝜋𝑛v(s,{\pi_{n}}^{*},\pi^{*}_{-n})italic_v ( italic_s , italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT ), should agent n𝑛nitalic_n unilaterally deviate with an alternative policy. The solution to Eq. (10) can be computed by searching over the joint action space {xn,xn}subscript𝑥𝑛subscript𝑥𝑛\{x_{n},x_{-n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT }.

argmaxx¯𝐗n=1Nπn(s,xn)Qn(s,xn,xn*)¯𝑥𝐗argmaxsuperscriptsubscriptproduct𝑛1𝑁subscript𝜋𝑛𝑠subscript𝑥𝑛subscript𝑄𝑛𝑠subscript𝑥𝑛superscriptsubscript𝑥𝑛\displaystyle\underset{\bar{x}\in\mathbf{X}}{\mathrm{argmax}}\prod_{n=1}^{N}% \pi_{n}(s,x_{n})Q_{n}(s,x_{n},x_{-n}^{*})start_UNDERACCENT over¯ start_ARG italic_x end_ARG ∈ bold_X end_UNDERACCENT start_ARG roman_argmax end_ARG ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) (11)
argmaxx¯𝐗n=1Nπn(s,xn)Qn(s,xn*,xn*)+δabsent¯𝑥𝐗argmaxsuperscriptsubscriptproduct𝑛1𝑁subscript𝜋𝑛𝑠subscript𝑥𝑛subscript𝑄𝑛𝑠superscriptsubscript𝑥𝑛superscriptsubscript𝑥𝑛𝛿\displaystyle\leq\underset{\bar{x}\in\mathbf{X}}{\mathrm{argmax}}\prod_{n=1}^{% N}\pi_{n}(s,x_{n})Q_{n}(s,x_{n}^{*},x_{-n}^{*})+\delta≤ start_UNDERACCENT over¯ start_ARG italic_x end_ARG ∈ bold_X end_UNDERACCENT start_ARG roman_argmax end_ARG ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + italic_δ (12)
v(s,πn,πn*)𝑣𝑠subscript𝜋𝑛superscriptsubscript𝜋𝑛\displaystyle v(s,\pi_{n},\pi_{-n}^{*})italic_v ( italic_s , italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) v(s,πn*,πn*)+δ,s𝐒formulae-sequenceabsent𝑣𝑠superscriptsubscript𝜋𝑛superscriptsubscript𝜋𝑛𝛿for-all𝑠𝐒\displaystyle\leq v(s,\pi_{n}^{*},\pi_{-n}^{*})+\delta,\quad\forall s\in% \mathbf{S}≤ italic_v ( italic_s , italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + italic_δ , ∀ italic_s ∈ bold_S (13)

δ𝛿\deltaitalic_δ serves as an upper bound on ϵitalic-ϵ\epsilonitalic_ϵ, therefore the minimization of δ𝛿\deltaitalic_δ will also minimize any possible existence of ϵitalic-ϵ\epsilonitalic_ϵ for each single stage game in the MG. In fact, the existence of ϵitalic-ϵ\epsilonitalic_ϵ implies the existence of upper bound δ𝛿\deltaitalic_δ, we provide a proof of this existence in Appendix C.3.

4 Multi-Agent Nash Q Learning

In our model, the game provides full information, where information regarding the state and the actions of all agents are visible to any agent. This allows for experience replay for Q learning [26] to occur, and the Q function can be shared for all agents as they are assumed to be identical. The joint action space x¯=(x1,,xN)¯𝑥subscript𝑥1subscript𝑥𝑁\bar{x}=(x_{1},...,x_{N})over¯ start_ARG italic_x end_ARG = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) is defined as the combined actions of all agents at time t𝑡titalic_t. Normally, in Q learning, the update mechanism searches for the action that maximizes the Q function, however, in Nash Q learning, we must update the Q function with the solution to the Nash equilibrium. In our model, we represent the tabular Q function via a function approximator, which is denoted simply as Q(s,x¯)𝑄𝑠¯𝑥Q(s,\bar{x})italic_Q ( italic_s , over¯ start_ARG italic_x end_ARG ). As new experiences are obtained when the MG is played, the Nash Q value is updated. We utilize the Q update mechanism defined in [17] as the update mechanism for the Nash Q estimator for Q learning. Given a Q function and a Nash policy, we search over the joint action that maximizes the scaled Q function [21], as in Eq. (15). In our representation, the Q function is a vector, returning the respective Q value for each agent, based on the joint probability input for the joint policy.

Q(s,x¯)superscript𝑄𝑠¯𝑥\displaystyle Q^{\prime}(s,\bar{x})italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , over¯ start_ARG italic_x end_ARG ) (1α)Q(s,x¯)+α[r+γ𝒩(s)]absentabsent1𝛼𝑄𝑠¯𝑥𝛼delimited-[]𝑟𝛾𝒩superscript𝑠\displaystyle\xleftarrow[]{}(1-\alpha)Q(s,\bar{x})+\alpha[r+\gamma\mathcal{N}(% s^{\prime})]start_ARROW start_OVERACCENT end_OVERACCENT ← end_ARROW ( 1 - italic_α ) italic_Q ( italic_s , over¯ start_ARG italic_x end_ARG ) + italic_α [ italic_r + italic_γ caligraphic_N ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] (14)
x¯*superscript¯𝑥\displaystyle\bar{x}^{*}over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT =argmaxx¯Q(s,x¯)i=1Nπn*(s,xn)absent¯𝑥argmax𝑄superscript𝑠¯𝑥superscriptsubscriptproduct𝑖1𝑁superscriptsubscript𝜋𝑛superscript𝑠subscript𝑥𝑛\displaystyle=\underset{\bar{x}}{\mathrm{argmax}}\ Q(s^{\prime},\bar{x})\prod_% {i=1}^{N}\pi_{n}^{*}(s^{\prime},x_{n})= start_UNDERACCENT over¯ start_ARG italic_x end_ARG end_UNDERACCENT start_ARG roman_argmax end_ARG italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over¯ start_ARG italic_x end_ARG ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (15)
𝒩(s)𝒩superscript𝑠\displaystyle\mathcal{N}(s^{\prime})caligraphic_N ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =Q(s,x¯*)i=1Nπn*(s,xn*)absent𝑄superscript𝑠superscript¯𝑥superscriptsubscriptproduct𝑖1𝑁superscriptsubscript𝜋𝑛superscript𝑠subscriptsuperscript𝑥𝑛\displaystyle=Q(s^{\prime},\bar{x}^{*})\prod_{i=1}^{N}\pi_{n}^{*}(s^{\prime},x% ^{*}_{n})= italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (16)

𝒩(s)𝒩superscript𝑠\mathcal{N}(s^{\prime})caligraphic_N ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) represents a Nash Operator, indicating the Q value of a Nash equilibrium, this is approximated by a scaling factor, according to the Nash equilibrium policy, π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, multiplied by the Q value of the joint action, Q(s,x¯)𝑄𝑠¯𝑥Q(s,\bar{x})italic_Q ( italic_s , over¯ start_ARG italic_x end_ARG ). Extending from the work of [17], we introduce a neural network for mapping any state action combination to its Nash policy, where the Nash scaling factor derives from and is used to compute 𝒩(s)𝒩superscript𝑠\mathcal{N}(s^{\prime})caligraphic_N ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in conjunction with the Q function. The Q-function used in Eq. (16) and (14) is represented by a deep neural network, simply referred to as the Nash Q Net Q(s)𝑄𝑠Q(s)italic_Q ( italic_s ). In a full information game with homogenous agents, a shared neural network representing Q(s,x¯)𝑄𝑠¯𝑥Q(s,\bar{x})italic_Q ( italic_s , over¯ start_ARG italic_x end_ARG ) can be used for all agents in the Markov Game (if not homogeneous a separate Q network must be retained and updated separately for each agent). The Q Network parameters representing 𝒩(s)𝒩superscript𝑠\mathcal{N}(s^{\prime})caligraphic_N ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is learned in the same method as Deep Q learning [26], the key innovation, is that the scaling factor to compute 𝒩(s)𝒩superscript𝑠\mathcal{N}(s^{\prime})caligraphic_N ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is obtained via a Nash Policy Net, Ψ(s)Ψ𝑠\Psi(s)roman_Ψ ( italic_s ) (defined later in Section 4.2).

4.1 Estimating Value Advantage via Black Box Optimization

We apply a deep neural network to represent a mapping of a joint policy to its respective δ𝛿\deltaitalic_δ from Eq. (13), designated as Γ(π)δmaps-toΓ𝜋𝛿\Gamma(\pi)\mapsto\deltaroman_Γ ( italic_π ) ↦ italic_δ, where δ𝛿\deltaitalic_δ is a vector containing the δssubscript𝛿𝑠\delta_{s}italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT of each state. As the gradient of Γ(π)Γ𝜋\Gamma(\pi)roman_Γ ( italic_π ) cannot be easily evaluated, we apply gradient free black box optimization for δ𝛿\deltaitalic_δ minimization. Trust Region Optimization (TRO) has been shown to be effective for solving high-dimensional optimization problems [40] [9] [8] [33], particularly when the computational resources are available. To compute the existence of an ϵitalic-ϵ\epsilonitalic_ϵ-Nash equilibrium in the high-dimensional policy space efficiently, we apply model-based Bayesian Optimization via Trust Region optimization (TuRBO) from [9]. TuRBO combines standard TRO with a multi-armed bandit system via Thompson Sampling [37] to optimize for local multiple trust regions simultaneously, with sufficient convergence time. However, the candidate generation step in TuRBO is not constrained to account for the valid joint probabilities of each agent, in which the sum of probabilities for each agent in its respective action space must sum to 1. We alter TuRBO by simply normalizing the original candidate probabilities over each set of probabilities belonging to an agent n𝑛nitalic_n. The resulting modified algorithm is denoted TuRBO-p, where any candidate value generated by TuRBO-p will have joint probabilities that sum to 1 for policies corresponding to each agent.

V(s,π)𝑉𝑠𝜋\displaystyle V(s,\pi)italic_V ( italic_s , italic_π ) =maxx¯Q(s,x¯)i=1Nπn(s,xn)absent¯𝑥max𝑄𝑠¯𝑥superscriptsubscriptproduct𝑖1𝑁subscript𝜋𝑛𝑠subscript𝑥𝑛\displaystyle=\underset{\bar{x}}{\mathrm{max}}\ Q(s,\bar{x})\prod_{i=1}^{N}\pi% _{n}(s,x_{n})= start_UNDERACCENT over¯ start_ARG italic_x end_ARG end_UNDERACCENT start_ARG roman_max end_ARG italic_Q ( italic_s , over¯ start_ARG italic_x end_ARG ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (17)
δ𝛿\displaystyle\deltaitalic_δ =maxπn(V(s,πn,πn)V(s,πn,πn))s𝐒formulae-sequenceabsentsuperscriptsubscript𝜋𝑛max𝑉𝑠superscriptsubscript𝜋𝑛subscript𝜋𝑛𝑉𝑠subscript𝜋𝑛subscript𝜋𝑛for-all𝑠𝐒\displaystyle=\underset{\pi_{n}^{\prime}}{\mathrm{max}}\ \Bigg{(}V(s,\pi_{n}^{% \prime},\pi_{-n})-V(s,\pi_{n},\pi_{-n})\Bigg{)}\quad\forall s\in\mathbf{S}= start_UNDERACCENT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_max end_ARG ( italic_V ( italic_s , italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT ) - italic_V ( italic_s , italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT ) ) ∀ italic_s ∈ bold_S (18)

The maximization problem is formulated in Eq. (18) representing the maximum gain in value of a policy, where agent n𝑛nitalic_n deviates from policy πnsubscript𝜋𝑛\pi_{n}italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with πnsuperscriptsubscript𝜋𝑛\pi_{n}^{{}^{\prime}}italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT.

4.2 Nash Policy Learning

The ϵitalic-ϵ\epsilonitalic_ϵ-Nash policy is defined as a joint policy (πn*,πn*)subscriptsuperscript𝜋𝑛subscriptsuperscript𝜋𝑛(\pi^{*}_{n},\pi^{*}_{-n})( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT ) that minimizes δ𝛿\deltaitalic_δ from Eq. (13). Drawing inspiration from [4], where a Nash equilibrium is found by effectively minimizing for ϵitalic-ϵ\epsilonitalic_ϵ via linear programming, we apply a similar technique of δ𝛿\deltaitalic_δ-minimization. However, in [4], the model parameters are known to the agents, and the Markov game was constrained to two players. In our MG, the joint reward function must be learned. Therefore, we perform approximate search over the policy space instead of using exact methods to discover any possible approximate Nash equilibrium policies. In principle, each state has maps to a corresponding probability in accordance with the Nash Policy which minimizes δ𝛿\deltaitalic_δ in Eq. 18. However, a table keeping track of such an enormous state policy space is not feasible. Therefore, we implement a Nash Policy Net Ψ(s)π*absentΨ𝑠superscript𝜋\Psi(s)\xrightarrow[]{}\pi^{*}roman_Ψ ( italic_s ) start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT to learn the state to policy mapping, which is the joint policy π𝜋\piitalic_π producing an approximate Nash equilibrium as approximated via TuRBO-p, designated as π^*(s)superscript^𝜋𝑠\hat{\pi}^{*}(s)over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s ).

π^*(s)=argmin𝜋Γ(π)ssuperscript^𝜋𝑠𝜋argminΓsubscript𝜋𝑠\displaystyle\hat{\pi}^{*}(s)=\underset{\pi}{\mathrm{argmin}}\ \Gamma(\pi)_{s}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s ) = underitalic_π start_ARG roman_argmin end_ARG roman_Γ ( italic_π ) start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (19)
Algorithm 1 Nash equilibrium learning
1:Initialize state joint policy π𝜋\piitalic_π.
2:Initialize random parameters for Ψ(st)Ψsubscript𝑠𝑡\Psi(s_{t})roman_Ψ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), Q(st)𝑄subscript𝑠𝑡Q(s_{t})italic_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and Γ(π)Γ𝜋\Gamma(\pi)roman_Γ ( italic_π ), as θQsubscript𝜃𝑄\theta_{Q}italic_θ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, θΨsubscript𝜃Ψ\theta_{\Psi}italic_θ start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT, and θΓsubscript𝜃Γ\theta_{\Gamma}italic_θ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT.
3:Initialize MDP Environment, \mathcal{M}caligraphic_M.
4:for eepisodes𝑒𝑒𝑝𝑖𝑠𝑜𝑑𝑒𝑠e\in episodesitalic_e ∈ italic_e italic_p italic_i italic_s italic_o italic_d italic_e italic_s do:
5:     Get initial state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from \mathcal{M}caligraphic_M.
6:     for t(0,tmax)𝑡0subscript𝑡𝑚𝑎𝑥t\in(0,t_{max})italic_t ∈ ( 0 , italic_t start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) do: Iterate until end of episode.
7:         Get action probabilities π(st,xn)𝜋subscript𝑠𝑡subscript𝑥𝑛\pi(s_{t},x_{n})italic_π ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) from Ψ(st)Ψsubscript𝑠𝑡\Psi(s_{t})roman_Ψ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
8:         for n(1,N)𝑛1𝑁n\in(1,N)italic_n ∈ ( 1 , italic_N ) do: Iterate through agents.
9:              Sample xnMultinomial(st,π(x¯))subscript𝑥𝑛𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙subscript𝑠𝑡𝜋¯𝑥x_{n}\thicksim Multinomial(s_{t},\pi(\overline{x}))italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∼ italic_M italic_u italic_l italic_t italic_i italic_n italic_o italic_m italic_i italic_a italic_l ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π ( over¯ start_ARG italic_x end_ARG ) )
10:         end for
11:         Obtain joint action x¯{x1,,xN}absent¯𝑥subscript𝑥1subscript𝑥𝑁\overline{x}\xleftarrow[]{}\{x_{1},...,x_{N}\}over¯ start_ARG italic_x end_ARG start_ARROW start_OVERACCENT end_OVERACCENT ← end_ARROW { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }
12:         Assign δssubscript𝛿𝑠\delta_{s}italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT according to Eq. (18) via TuRBO-p.
13:         Find π*argmin𝜋Γ(π,st)absentsuperscript𝜋𝜋argminΓ𝜋subscript𝑠𝑡\pi^{*}\xleftarrow[]{}\underset{\pi}{\mathrm{argmin}}\ \Gamma(\pi,s_{t})italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_ARROW start_OVERACCENT end_OVERACCENT ← end_ARROW underitalic_π start_ARG roman_argmin end_ARG roman_Γ ( italic_π , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) via TuRBO-p.
14:         Execute joint action x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG in (,st)subscript𝑠𝑡(\mathcal{M},\ s_{t})( caligraphic_M , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to obtain st+1,r¯tsubscript𝑠𝑡1subscript¯𝑟𝑡s_{t+1},\overline{r}_{t}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
15:         Append (st,x¯t,r¯t,ϵs,π,st+1)subscript𝑠𝑡subscript¯𝑥𝑡subscript¯𝑟𝑡subscriptitalic-ϵ𝑠𝜋subscript𝑠𝑡1(s_{t},\overline{x}_{t},\overline{r}_{t},\epsilon_{s},\pi,s_{t+1})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_π , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) to experience replay 𝒟𝒟\mathcal{D}caligraphic_D.
16:         Update State stst+1absentsubscript𝑠𝑡subscript𝑠𝑡1s_{t}\xleftarrow[]{}s_{t+1}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_ARROW start_OVERACCENT end_OVERACCENT ← end_ARROW italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT.
17:         if |𝒟|>batchsize𝒟𝑏𝑎𝑡𝑐𝑠𝑖𝑧𝑒|\mathcal{D}|>batchsize| caligraphic_D | > italic_b italic_a italic_t italic_c italic_h italic_s italic_i italic_z italic_e then:
18:              Sample minibatch d𝑑ditalic_d from 𝒟𝒟\mathcal{D}caligraphic_D.
19:              Set Q(st,x¯){rtfor stterminalEq. (14)for stnot terminalabsentsuperscript𝑄subscript𝑠𝑡¯𝑥casessubscript𝑟𝑡for subscript𝑠𝑡terminalEq. (14)for subscript𝑠𝑡not terminalQ^{\prime}(s_{t},\bar{x})\xleftarrow[]{}\begin{cases}r_{t}&\text{for \ }s_{t}% \ \text{terminal}\\ \text{Eq. \eqref{eq:q-update}}&\text{for \ }s_{t}\ \text{not terminal}\end{cases}italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG ) start_ARROW start_OVERACCENT end_OVERACCENT ← end_ARROW { start_ROW start_CELL italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL for italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT terminal end_CELL end_ROW start_ROW start_CELL Eq. ( ) end_CELL start_CELL for italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT not terminal end_CELL end_ROW
20:              Q(Q(St,x¯)Q(St,x¯))2absentsubscript𝑄superscriptsuperscript𝑄subscript𝑆𝑡¯𝑥𝑄subscript𝑆𝑡¯𝑥2\mathcal{L}_{Q}\xleftarrow[]{}(Q^{\prime}(S_{t},\bar{x})-Q(S_{t},\bar{x}))^{2}caligraphic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_ARROW start_OVERACCENT end_OVERACCENT ← end_ARROW ( italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG ) - italic_Q ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Nash Q Update.
21:              Ψ(π^π*)2absentsubscriptΨsuperscript^𝜋superscript𝜋2\mathcal{L}_{\Psi}\xleftarrow[]{}(\hat{\pi}-\pi^{*})^{2}caligraphic_L start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT start_ARROW start_OVERACCENT end_OVERACCENT ← end_ARROW ( over^ start_ARG italic_π end_ARG - italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG is from Ψ(s)Ψ𝑠\Psi(s)roman_Ψ ( italic_s ).
22:              Γ(ϵ^ϵ)2absentsubscriptΓsuperscript^italic-ϵitalic-ϵ2\mathcal{L}_{\Gamma}\xleftarrow[]{}(\hat{\epsilon}-\epsilon)^{2}caligraphic_L start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT start_ARROW start_OVERACCENT end_OVERACCENT ← end_ARROW ( over^ start_ARG italic_ϵ end_ARG - italic_ϵ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Nash ϵitalic-ϵ\epsilonitalic_ϵ update, where ϵ^^italic-ϵ\hat{\epsilon}over^ start_ARG italic_ϵ end_ARG is from Γ(π)Γ𝜋\Gamma(\pi)roman_Γ ( italic_π )
23:              Backpropagate Qsubscript𝑄\mathcal{L}_{Q}caligraphic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, ΨsubscriptΨ\mathcal{L}_{\Psi}caligraphic_L start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT and ΓsubscriptΓ\mathcal{L}_{\Gamma}caligraphic_L start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT on 𝒬𝒬\mathcal{Q}caligraphic_Q, ΨΨ\Psiroman_Ψ, and ΓΓ\Gammaroman_Γ.
24:              Update θQsubscript𝜃𝑄\theta_{Q}italic_θ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, θΨsubscript𝜃Ψ\theta_{\Psi}italic_θ start_POSTSUBSCRIPT roman_Ψ end_POSTSUBSCRIPT, and θΓsubscript𝜃Γ\theta_{\Gamma}italic_θ start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT via gradient descent.
25:         end if
26:         Update joint policy π¯π¯*absent¯𝜋superscript¯𝜋\overline{\pi}\xleftarrow[]{}\overline{\pi}^{*}over¯ start_ARG italic_π end_ARG start_ARROW start_OVERACCENT end_OVERACCENT ← end_ARROW over¯ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT.
27:     end for
28:end for

5 Results

Stabilization of realized market rewards to Nash equilibrium: Fig. 2 represents the convergence of the average market reward of a single agent (randomly selected as agent 0) towards a Nash equilibrium. We superimpose the boundaries of the theoretical Nash equilibrium reward over the plot. The reward is obtained from the boundaries of theoretical Nash equilibria, where ϵ<ϵlitalic-ϵsubscriptitalic-ϵ𝑙\epsilon<\epsilon_{l}italic_ϵ < italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (ϵl=0.0001subscriptitalic-ϵ𝑙0.0001\epsilon_{l}=0.0001italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 0.0001). The topology of this function is illustrated in Fig. 1, and for each state, or reference price, there exists a boundary where the policy deviation of any agent can occur without significant unilateral reward gain. The expected reward is computed using the market model parameters per Scenario (β0,β1,β2,asubscript𝛽0subscript𝛽1subscript𝛽2𝑎\beta_{0},\beta_{1},\beta_{2},aitalic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a). From the expected reward, we compare it with the positive difference of theoretical equilibrium reward Π=f(xn)xnsuperscriptΠ𝑓superscriptsubscript𝑥𝑛superscriptsubscript𝑥𝑛\Pi^{\prime}=f(x_{n}^{\prime})\ x_{n}^{\prime}roman_Π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT where xnsuperscriptsubscript𝑥𝑛x_{n}^{\prime}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. This exists as two boundary points indicating the area where ϵ<ϵlitalic-ϵsubscriptitalic-ϵ𝑙\epsilon<\epsilon_{l}italic_ϵ < italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. For each episode the average reward per timestep over the episode lengths is recorded. From Fig. 2, we observe the average reward per agent of the system per episode (blue dashed line), and the average reward of a single agent per episode (orange dashed line), to converge within the boundary of Nash equilibrium (blue shade). However, due to randomization from black box optimization and/or stochastic behaviour of the Markov Game this expected reward can sometimes fall outside of the NE region.

We demonstrate empirically in Appendix D, that the loss function of Q(s)𝑄𝑠Q(s)italic_Q ( italic_s ) and Ψ(s)Ψ𝑠\Psi(s)roman_Ψ ( italic_s ) decrease with each RL episode. This merely indicates that a representation of the function mapping state to its corresponding Nash policy based on policy-to-ϵitalic-ϵ\epsilonitalic_ϵ model Γ(π)Γ𝜋\Gamma(\pi)roman_Γ ( italic_π ) is being learned, and a stable solution is proposed. Hyperparameters are presented in Appendix A.

Market rewards per episode Scenario 1, n=3𝑛3n=3italic_n = 3.
Refer to caption
Refer to caption
Market rewards per episode Scenario 1, n=3𝑛3n=3italic_n = 3.
Market rewards per episode Scenario 2, n=3𝑛3n=3italic_n = 3.
Figure 2: In a Nash equilibrium, both the market average reward (blue), and single agent reward (orange) should fall within the Nash equilibria boundary (blue shade), as the MG progresses.

Comparison with baseline cooperative Q-learning model: We compare our NE learning model wit a naive cooperative Q learning model to serve as a baseline measure of model performance. In the naive Q learning model, the objective is simply to maximize each agent’s respective rewards, and avoid any kind of NE computation. The Q update function in Eq. (14) is simply a maximization of the joint Q function, instead of using 𝒩(s)𝒩superscript𝑠\mathcal{N}(s^{\prime})caligraphic_N ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). From Fig.3, we generally observe lower average reward values from the baseline model compared to the NE learning model. The NE reward ranges remain very similar. From an economic perspective, the Nash Q learning agents take into account that other agents will compete against them, and the solver optimizes accordingly, whereas, in cooperative Q learning no consideration for the other agents’ strategies are taken, and agents may to over-compete, driving average rewards lower in an oligopoly.

Scenario 1 - NE Learning vs Q learning, n=3𝑛3n=3italic_n = 3.
Refer to caption
Refer to caption
Scenario 1 - NE Learning vs Q learning, n=3𝑛3n=3italic_n = 3.
Scenario 2 - NE Learning vs Q learning, n=5𝑛5n=5italic_n = 5.
Figure 3: Average revenue per episode from a NE learning model (blue) versus a baseline Q learning model (green), with NE reward boundaries overlayed in their respective colours.

6 Conclusion

We created a Markov Game that represents an n-firm oligopoly, based on previously established market models where theoretical ϵitalic-ϵ\epsilonitalic_ϵ bounds for a Nash equilibrium policy exist. A black box algorithm is applied to estimate an upper bound on the ϵitalic-ϵ\epsilonitalic_ϵ value of a joint policy, represented by neural network Γ(π)Γ𝜋\Gamma(\pi)roman_Γ ( italic_π ). Similarly a Nash Policy Net Ψ(s)Ψ𝑠\Psi(s)roman_Ψ ( italic_s ) is learned to represent the ϵitalic-ϵ\epsilonitalic_ϵ-minimizing policy from black box optimization, constituting an ϵitalic-ϵ\epsilonitalic_ϵ-Nash policy. Thus Ψ(s)Ψ𝑠\Psi(s)roman_Ψ ( italic_s ) can be used together with traditional Deep Q learning, to solve for a Nash equilibrium. Empirically, we show that the average reward of all agents, and the reward of a respective single agent, converges to an approximate Nash equilibrium. The limitations of this research are the limited action space of the agents, as well as the identical nature of the agents. Larger scale experimentation under this framework is suggested. Furthermore, the proposed market model could be enhanced by implementing more complex non-linear market oligopoly environments.

References

  • [1] Joseph Bertrand, Review of Walras’s Théorie mathématique de la richesse sociale and Cournot’s Recherches sur les principes mathématiques de la théorie des richesses, 73–81, Cambridge University Press, 1889.
  • [2] Omar Besbes and Assaf Zeevi, ‘Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms’, Operations Research, 57(6), 1407–1420, (2009).
  • [3] Richard A Briesch, Lakshman Krishnamurthi, Tridib Mazumdar, and Sevilimedu P Raj, ‘A comparative analysis of reference price models’, Journal of Consumer Research, 24(2), 202–214, (1997).
  • [4] Sofia Ceppi, Nicola Gatti, Giorgio Patrini, and Marco Rocco, ‘Local search methods for finding a nash equilibrium in two-player games’, pp. 335–342, (08 2010).
  • [5] Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou, ‘The complexity of computing a nash equilibrium’, in Proceedings of the Thirty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’06, p. 71–78, New York, NY, USA, (2006). Association for Computing Machinery.
  • [6] Arnoud V. Den Boer and N. Bora Keskin, ‘Discontinuous demand functions: Estimation and pricing’, Management Science, 66, (04 2020).
  • [7] Arnoud V den Boer and Bert Zwart, ‘Simultaneously learning and optimizing using controlled variance pricing’, Management Science, 60(3), 770–783, (2014).
  • [8] Youssef Diouane, Victor Picheny, Rodolphe Le Riche, and Alexandre Scotto Di Perrotolo. Trego: a trust-region framework for efficient global optimization, 2021.
  • [9] David Eriksson, Michael Pearce, Jacob R. Gardner, Ryan Turner, and Matthias Poloczek, ‘Scalable global optimization via local bayesian optimization’, CoRR, abs/1910.01739, (2019).
  • [10] Kris Johnson Ferreira, Bin Hong Alex Lee, and David Simchi-Levi, ‘Analytics for an online retailer: Demand forecasting and price optimization’, Manufacturing & Service Operations Management, 18(1), 69–88, (2016).
  • [11] Kris Johnson Ferreira and Emily Mower, ‘Demand learning and pricing for varying assortments’, Manufacturing & Service Operations Management, (2022).
  • [12] Kris Johnson Ferreira, David Simchi-Levi, and He Wang, ‘Online network revenue management using thompson sampling’, Operations Research, 66(6), 1586–1602, (2018).
  • [13] Gadi Fibich, Arieh Gavious, and Oded Lowengart, ‘Explicit solutions of optimization models and differential games with nonsmooth (asymmetric) reference-price effects’, Operations Research, 51(5), 721–734, (2003).
  • [14] Jacob K. Goeree, Charles A. Holt, and Thomas R. Palfrey, Stochastic game theory for social science: a primer on quantal response equilibrium, Edward Elgar Publishing, Cheltenham, UK, 2020.
  • [15] J Michael Harrison, N Bora Keskin, and Assaf Zeevi, ‘Bayesian dynamic pricing policies: Learning and earning under a binary prior distribution’, Management Science, 58(3), 570–586, (2012).
  • [16] Paul Heidhues and Botond Kőszegi, ‘Regular prices and sales’, Theoretical Economics, 9(1), 217–251, (2014).
  • [17] Junling Hu and Michael P. Wellman, ‘Nash q-learning for general-sum stochastic games’, Journal of Machine Learning Research, 4, 1039–1069, (December 2003).
  • [18] Chris Janiszewski and Donald R Lichtenstein, ‘A range theory account of price perception’, Journal of Consumer Research, 25(4), 353–368, (1999).
  • [19] N Bora Keskin and Assaf Zeevi, ‘Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies’, Operations Research, 62(5), 1142–1167, (2014).
  • [20] Tadashi Kozuno, Pierre Ménard, Rémi Munos, and Michal Valko. Model-free learning for two-player zero-sum partially observable markov games with perfect recall, 2021.
  • [21] Julien Laumônier and Brahim Chaib-draa, ‘Multiagent q-learning: Preliminary study on dominance between the nash and stackelberg equilibriums’, in Proceedings of AAAI-2005 Workshop on Multiagent Learning, Pittsburgh, USA, (2005).
  • [22] Jiaxi Liu, Yidong Zhang, Xiaoqing Wang, Yuming Deng, and Xingyu Wu, ‘Dynamic pricing on e-commerce platform with deep reinforcement learning’, CoRR, abs/1912.02572, (2019).
  • [23] Jue Liu, Zhan Pang, and Linggang Qi, ‘Dynamic pricing and inventory management with demand learning: A bayesian approach’, Computers & Operations Research, 124, 105078, (2020).
  • [24] R. Duncan Luce, Individual Choice Behavior: A Theoretical Analysis, Wiley, New York, NY, USA, 1959.
  • [25] Tridib Mazumdar, Sevilimedu P Raj, and Indrajit Sinha, ‘Reference price research: Review and propositions’, Journal of Marketing, 69(4), 84–102, (2005).
  • [26] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis, ‘Human-level control through deep reinforcement learning’, Nature, 518(7540), 529–533, (February 2015).
  • [27] John F. Nash, ‘Equilibrium points in n-person games’, Proceedings of the National Academy of Sciences, 36(1), 48–49, (1950).
  • [28] Ioana Popescu and Yaozhong Wu, ‘Dynamic pricing strategies with reference effects’, Operations Research, 55(3), 413–429, (2007).
  • [29] Ryan Porter, Eugene Nudelman, and Yoav Shoham, ‘Simple search methods for finding a nash equilibrium’, Games and Economic Behavior, 63(2), 642–662, (2008). Second World Congress of the Game Theory Society.
  • [30] Kalyan Raman, Frank M Bass, et al., ‘A general test of reference price theory in the presence of threshold effects’, Tijdschrift voor Economie en management, 47(2), 205–226, (2002).
  • [31] Giorgia Ramponi, Alberto Maria Metelli, Alessandro Concetti, and Marcello Restelli, ‘Learning in non-cooperative configurable markov decision processes’, in Advances in Neural Information Processing Systems, eds., A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, (2021).
  • [32] Vithala R Rao, ‘Pricing models in marketing’, Handbooks in Operations Research and Management Science, 5, 517–552, (1993).
  • [33] Rommel G. Regis, ‘Trust regions in kriging-based optimization with expected improvement’, Engineering Optimization, 48(6), 1037–1059, (2016).
  • [34] Muhammed O. Sayin, Kaiqing Zhang, David S. Leslie, Tamer Basar, and Asuman E. Ozdaglar, ‘Decentralized q-learning in zero-sum markov games’, CoRR, abs/2106.02748, (2021).
  • [35] Rainer Schlosser and Martin Boissier, ‘Dynamic pricing under competition on online marketplaces: A data-driven approach’, in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’18, p. 705–714, New York, NY, USA, (2018). Association for Computing Machinery.
  • [36] Alfred Taudes and Christian Rudloff, ‘Integrating inventory control and a price change in the presence of reference price effects: a two-period model’, Mathematical Methods of Operations Research, 75(1), 29–65, (2012).
  • [37] William R Thompson, ‘On the likelihood that one unknown probability exceeds another in view of evidence of two samples’, Biometrika, 25(3-4), 285–294, (12 1933).
  • [38] Ruben van de Geer, Arnoud V. den Boer, Christopher Bayliss, Christine S. M. Currie, Andria Ellina, Malte Esders, Alwin Haensel, Xiao Lei, Kyle D. S. Maclean, Antonio Martinez-Sykora, and et al., ‘Dynamic pricing and learning with competition: insights from the dynamic pricing challenge at the 2017 informs rm & pricing conference’, Journal of Revenue and Pricing Management, 18(3), 185–203, (Oct 2018).
  • [39] Yevgeniy Vorobeychik and Michael Wellman, ‘Stochastic search methods for nash equilibrium approximation in simulation-based games’, pp. 1055–1062, (01 2008).
  • [40] Linnan Wang, Rodrigo Fonseca, and Yuandong Tian, ‘Learning search space partition for black-box optimization using monte carlo tree search’, CoRR, abs/2007.00708, (2020).
  • [41] Xiaofeng Wang and Tuomas Sandholm, ‘Reinforcement learning to play an optimal nash equilibrium in team markov games’, in Advances in Neural Information Processing Systems, eds., S. Becker, S. Thrun, and K. Obermayer, volume 15. MIT Press, (2003).
  • [42] Christopher J. C. H. Watkins and Peter Dayan, ‘Q-learning’, Machine Learning, 8(3), 279–292, (May 1992).
  • [43] Kaiqing Zhang, Sham M. Kakade, Tamer Basar, and Lin F. Yang, ‘Model-based multi-agent RL in zero-sum markov games with near-optimal sample complexity’, CoRR, abs/2007.07461, (2020).
  • [44] Qifan Zhang, Yue Guan, and Panagiotis Tsiotras, ‘Learning nash equilibria in zero-sum stochastic games via entropy-regularized policy approximation’, CoRR, abs/2009.00162, (2020).

Supplemental Material

Appendix A Hardware Configurations and Model Hyperparameters

All experiments were run on a computer with an Intel Core i5-10600K CPU processor at 4.10GHz, 32 GB DDR3 RAM, with an NVIDIA GeForce GTX 1080 8GB GDDR5X graphics card. The hyperparameters used for all experiments are listed in Table 1.

Parameter Value
No. Hidden Layers Γ(π)Γ𝜋\Gamma(\pi)roman_Γ ( italic_π ) 3
No. Hidden Layers Ψ(s)Ψ𝑠\Psi(s)roman_Ψ ( italic_s ) 3
No. Hidden Layers Q(s)𝑄𝑠Q(s)italic_Q ( italic_s ) 3
Hidden Layer Size Γ(π)Γ𝜋\Gamma(\pi)roman_Γ ( italic_π ) 1500
Hidden Layer Size Ψ(s)Ψ𝑠\Psi(s)roman_Ψ ( italic_s ) 1500
Hidden Layer Size Q(s)𝑄𝑠Q(s)italic_Q ( italic_s ) 75
No. Episodes 40
Batch Update Frequency 20
Batch Size 10
RL Exploration Constant 0.05
Max. Steps per Episode 30
Discount Factor (γ𝛾\gammaitalic_γ) 0.9
Learning Rate (α𝛼\alphaitalic_α) 0.001
MDP Action Dimensions 10
TuRBO-p Max. Evaluations 10
TuRBO-p Batch Size 4
Table 1: Hyperparameters used for deep reinforcement learning.

Appendix B Softmax Win Probability

Proposition: In an N-player game when deviating from the equilibrium market price by an amount of d𝑑ditalic_d, given the softmax win probability Φ(xn)Φsubscript𝑥𝑛\Phi(x_{n})roman_Φ ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) in Eq. (5), the probability of winning a customer changes by a factor of ΦdsubscriptΦ𝑑\Phi_{d}roman_Φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT as defined in Eq. (21).

Proof: Suppose N-1 players set an equal price of x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG and one player deviates with price xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where xn=x~dsubscript𝑥𝑛~𝑥𝑑x_{n}=\tilde{x}-ditalic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over~ start_ARG italic_x end_ARG - italic_d.

The players who do not deviate from the equilibrium price, will have a win probability of Φ(x~)Φ~𝑥\Phi(\tilde{x})roman_Φ ( over~ start_ARG italic_x end_ARG )

Φ(x~)Φ~𝑥\displaystyle\Phi(\tilde{x})roman_Φ ( over~ start_ARG italic_x end_ARG ) =e1ax~i=1Ne1axiabsentsuperscript𝑒1𝑎~𝑥subscriptsuperscript𝑁𝑖1superscript𝑒1𝑎subscript𝑥𝑖\displaystyle=\frac{e^{1-a\tilde{x}}}{\sum^{N}_{i=1}e^{1-ax_{i}}}= divide start_ARG italic_e start_POSTSUPERSCRIPT 1 - italic_a over~ start_ARG italic_x end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT 1 - italic_a italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG (20)

The player that deviates from the market price will have a win probability of Φ(xn)Φsubscript𝑥𝑛\Phi(x_{n})roman_Φ ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ),

Φ(xn)Φsubscript𝑥𝑛\displaystyle\Phi(x_{n})roman_Φ ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =e1axni=1Ne1axi=e1a(x~d)i=1Ne1axiex~+e1a(x~d)absentsuperscript𝑒1𝑎subscript𝑥𝑛subscriptsuperscript𝑁𝑖1superscript𝑒1𝑎subscript𝑥𝑖superscript𝑒1𝑎~𝑥𝑑subscriptsuperscript𝑁𝑖1superscript𝑒1𝑎subscript𝑥𝑖superscript𝑒~𝑥superscript𝑒1𝑎~𝑥𝑑\displaystyle=\frac{e^{1-ax_{n}}}{\sum^{N}_{i=1}e^{1-ax_{i}}}=\frac{e^{1-a(% \tilde{x}-d)}}{\sum^{N}_{i=1}e^{1-ax_{i}}-e^{\tilde{x}}+e^{1-a(\tilde{x}-d)}}= divide start_ARG italic_e start_POSTSUPERSCRIPT 1 - italic_a italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT 1 - italic_a italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG = divide start_ARG italic_e start_POSTSUPERSCRIPT 1 - italic_a ( over~ start_ARG italic_x end_ARG - italic_d ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT 1 - italic_a italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT over~ start_ARG italic_x end_ARG end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT 1 - italic_a ( over~ start_ARG italic_x end_ARG - italic_d ) end_POSTSUPERSCRIPT end_ARG
=e1a(x~d)NΦ+e1a(x~d),whereNΦ=i=1Ne1axiex~formulae-sequenceabsentsuperscript𝑒1𝑎~𝑥𝑑subscript𝑁Φsuperscript𝑒1𝑎~𝑥𝑑wheresubscript𝑁Φsubscriptsuperscript𝑁𝑖1superscript𝑒1𝑎subscript𝑥𝑖superscript𝑒~𝑥\displaystyle=\frac{e^{1-a(\tilde{x}-d)}}{N_{\Phi}+e^{1-a(\tilde{x}-d)}},\quad% \text{where}\ N_{\Phi}=\sum^{N}_{i=1}e^{1-ax_{i}}-e^{\tilde{x}}= divide start_ARG italic_e start_POSTSUPERSCRIPT 1 - italic_a ( over~ start_ARG italic_x end_ARG - italic_d ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT + italic_e start_POSTSUPERSCRIPT 1 - italic_a ( over~ start_ARG italic_x end_ARG - italic_d ) end_POSTSUPERSCRIPT end_ARG , where italic_N start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT = ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT 1 - italic_a italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT over~ start_ARG italic_x end_ARG end_POSTSUPERSCRIPT (21)

The increase in win probability by deviating from the equilibrium price x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG by an increment of d𝑑ditalic_d is denoted by ΦdsubscriptΦ𝑑\Phi_{d}roman_Φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT,

ΦdsubscriptΦ𝑑\displaystyle\Phi_{d}roman_Φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT =Φ(xn)Φ(x~)=e1a(x~d)NΦ+ex~dNΦ+ex~e1ax~=eadNΦ+ex~NΦ+ex~dabsentΦsubscript𝑥𝑛Φ~𝑥superscript𝑒1𝑎~𝑥𝑑subscript𝑁Φsuperscript𝑒~𝑥𝑑subscript𝑁Φsuperscript𝑒~𝑥superscript𝑒1𝑎~𝑥superscript𝑒𝑎𝑑subscript𝑁Φsuperscript𝑒~𝑥subscript𝑁Φsuperscript𝑒~𝑥𝑑\displaystyle=\frac{\Phi(x_{n})}{\Phi(\tilde{x})}=\frac{e^{1-a(\tilde{x}-d)}}{% N_{\Phi}+e^{\tilde{x}-d}}\frac{N_{\Phi}+e^{\tilde{x}}}{e^{1-a\tilde{x}}}=e^{ad% }\frac{N_{\Phi}+e^{\tilde{x}}}{N_{\Phi}+e^{\tilde{x}-d}}= divide start_ARG roman_Φ ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Φ ( over~ start_ARG italic_x end_ARG ) end_ARG = divide start_ARG italic_e start_POSTSUPERSCRIPT 1 - italic_a ( over~ start_ARG italic_x end_ARG - italic_d ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT + italic_e start_POSTSUPERSCRIPT over~ start_ARG italic_x end_ARG - italic_d end_POSTSUPERSCRIPT end_ARG divide start_ARG italic_N start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT + italic_e start_POSTSUPERSCRIPT over~ start_ARG italic_x end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT 1 - italic_a over~ start_ARG italic_x end_ARG end_POSTSUPERSCRIPT end_ARG = italic_e start_POSTSUPERSCRIPT italic_a italic_d end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT + italic_e start_POSTSUPERSCRIPT over~ start_ARG italic_x end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT + italic_e start_POSTSUPERSCRIPT over~ start_ARG italic_x end_ARG - italic_d end_POSTSUPERSCRIPT end_ARG (22)

As follows from Eq. (22) the probability of winning a customer’s purchase by changing price with deviation d𝑑ditalic_d with respect to the equilibrium price effectively changes the win probability by a factor ΦdsubscriptΦ𝑑\Phi_{d}roman_Φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT.

B.1 Admissible Values of Profit Function

Proposition: In an N-player game when deviating from the market price by an amount of d𝑑d\in\mathbb{R}italic_d ∈ blackboard_R, there always exists a boundary (d,d+)superscript𝑑superscript𝑑(d^{-},d^{+})( italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) such that the expected profit from deviating 𝐄[Πn]>0𝐄superscriptdelimited-[]subscriptΠ𝑛0\mathbf{E}[\Pi_{n}]^{\prime}>0bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0 exists. We define this as the admissible range.

𝐄[Πn]𝐄[Πn]0>0,d(d,d+)formulae-sequence𝐄superscriptdelimited-[]subscriptΠ𝑛𝐄superscriptdelimited-[]subscriptΠ𝑛00for-all𝑑superscript𝑑superscript𝑑\displaystyle\frac{\mathbf{E}[\Pi_{n}]^{\prime}}{\mathbf{E}[\Pi_{n}]^{0}}>0,% \quad\forall d\in(d^{-},d^{+})divide start_ARG bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG > 0 , ∀ italic_d ∈ ( italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) (23)

Proof: We define the gain function Ω(d)Ω𝑑\Omega(d)roman_Ω ( italic_d ) as,

Ω(d)Ω𝑑\displaystyle\Omega(d)roman_Ω ( italic_d ) =𝐄[Πn]𝐄[Πn]0=Φ(xn)f(xn)xnΦ(x~)f(x~)x~=Φ(x~)Φd[f(x~)+γd](x~d)Φ(x~)f(x~)x~absent𝐄superscriptdelimited-[]subscriptΠ𝑛𝐄superscriptdelimited-[]subscriptΠ𝑛0Φsubscript𝑥𝑛𝑓subscript𝑥𝑛subscript𝑥𝑛Φ~𝑥𝑓~𝑥~𝑥Φ~𝑥subscriptΦ𝑑delimited-[]𝑓~𝑥𝛾𝑑~𝑥𝑑Φ~𝑥𝑓~𝑥~𝑥\displaystyle=\frac{\mathbf{E}[\Pi_{n}]^{\prime}}{\mathbf{E}[\Pi_{n}]^{0}}=% \frac{\Phi(x_{n})f(x_{n})x_{n}}{\Phi(\tilde{x})f(\tilde{x})\tilde{x}}=\frac{% \Phi(\tilde{x})\Phi_{d}[f(\tilde{x})+\gamma d](\tilde{x}-d)}{\Phi(\tilde{x})f(% \tilde{x})\tilde{x}}= divide start_ARG bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG = divide start_ARG roman_Φ ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG roman_Φ ( over~ start_ARG italic_x end_ARG ) italic_f ( over~ start_ARG italic_x end_ARG ) over~ start_ARG italic_x end_ARG end_ARG = divide start_ARG roman_Φ ( over~ start_ARG italic_x end_ARG ) roman_Φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT [ italic_f ( over~ start_ARG italic_x end_ARG ) + italic_γ italic_d ] ( over~ start_ARG italic_x end_ARG - italic_d ) end_ARG start_ARG roman_Φ ( over~ start_ARG italic_x end_ARG ) italic_f ( over~ start_ARG italic_x end_ARG ) over~ start_ARG italic_x end_ARG end_ARG
=Φd[f(x~)+γd](x~d)f(x~)x~=Φd(1+γf(x~)d)(11x~d)absentsubscriptΦ𝑑delimited-[]𝑓~𝑥𝛾𝑑~𝑥𝑑𝑓~𝑥~𝑥subscriptΦ𝑑1𝛾𝑓~𝑥𝑑11~𝑥𝑑\displaystyle=\Phi_{d}\frac{[f(\tilde{x})+\gamma d](\tilde{x}-d)}{f(\tilde{x})% \tilde{x}}=\Phi_{d}(1+\frac{\gamma}{f(\tilde{x})}d)(1-\frac{1}{\tilde{x}}d)= roman_Φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT divide start_ARG [ italic_f ( over~ start_ARG italic_x end_ARG ) + italic_γ italic_d ] ( over~ start_ARG italic_x end_ARG - italic_d ) end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) over~ start_ARG italic_x end_ARG end_ARG = roman_Φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( 1 + divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) end_ARG italic_d ) ( 1 - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_x end_ARG end_ARG italic_d )
Ω(d)Ω𝑑\displaystyle\Omega(d)roman_Ω ( italic_d ) =Φd(1+γf(x~)d)(11x~d)absentsubscriptΦ𝑑1𝛾𝑓~𝑥𝑑11~𝑥𝑑\displaystyle=\Phi_{d}(1+\frac{\gamma}{f(\tilde{x})}d)(1-\frac{1}{\tilde{x}}d)= roman_Φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( 1 + divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) end_ARG italic_d ) ( 1 - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_x end_ARG end_ARG italic_d ) (24)

Given the deviation condition Ω(d)>0Ω𝑑0\Omega(d)>0roman_Ω ( italic_d ) > 0 and Condition (2), the admissible range is more precisely defined as a range for d𝑑ditalic_d where a solution for Inequality (25) exists,

Ω(d)(1+γf(x~)d)(11x~d)>0Ω𝑑1𝛾𝑓~𝑥𝑑11~𝑥𝑑0\displaystyle\Omega(d)\implies(1+\frac{\gamma}{f(\tilde{x})}d)(1-\frac{1}{% \tilde{x}}d)>0roman_Ω ( italic_d ) ⟹ ( 1 + divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) end_ARG italic_d ) ( 1 - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_x end_ARG end_ARG italic_d ) > 0 f(x~)γ<d<x~absent𝑓~𝑥𝛾𝑑~𝑥\displaystyle\implies-\frac{f(\tilde{x})}{\gamma}<d<\tilde{x}⟹ - divide start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) end_ARG start_ARG italic_γ end_ARG < italic_d < over~ start_ARG italic_x end_ARG (25)

We see from Inequality (25) that there is a bound on admissible values of d(f(x~)/γ,x~)𝑑𝑓~𝑥𝛾~𝑥d\in(-f(\tilde{x})/\gamma,\tilde{x})italic_d ∈ ( - italic_f ( over~ start_ARG italic_x end_ARG ) / italic_γ , over~ start_ARG italic_x end_ARG ) which results in admissible Ω(d)Ω𝑑\Omega(d)roman_Ω ( italic_d ) values.

Appendix C Equilibrium Study

The Nash equilibrium of a pricing strategy can be either a pure or mixed strategy. We prove that for a pure strategy, in the support of a mixed strategy, an ϵitalic-ϵ\epsilonitalic_ϵ-Nash equilibrium exists. Consequently, multiple Nash equilibria can exist in this pricing game. Suppose a hypothetical equilibrium where a market price x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG exists. We examine the hypothetical situation if one agent were to deviate from the x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG of with a price of xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Particularly, we study the case when a player undercuts or prices above a set market price, where xn=x~dsubscript𝑥𝑛~𝑥𝑑x_{n}=\tilde{x}-ditalic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over~ start_ARG italic_x end_ARG - italic_d and d𝑑ditalic_d is the value which the player deviates from the equilibrium price. From the equilibrium setting, as follows from Eq. (26),

f(x~)𝑓~𝑥\displaystyle f(\tilde{x})italic_f ( over~ start_ARG italic_x end_ARG ) =β0+β1x~+β2(x~p¯)absentsubscript𝛽0subscript𝛽1~𝑥subscript𝛽2~𝑥¯𝑝\displaystyle=\beta_{0}+\beta_{1}\tilde{x}+\beta_{2}(\tilde{x}-\bar{p})= italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG - over¯ start_ARG italic_p end_ARG )
=β0+β1x~+β2x~+cNabsentsubscript𝛽0subscript𝛽1~𝑥subscript𝛽2~𝑥subscript𝑐𝑁\displaystyle=\beta_{0}+\beta_{1}\tilde{x}+\beta_{2}\tilde{x}+c_{N}= italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG + italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT (26)
wherecNwheresubscript𝑐𝑁\displaystyle\text{where}\quad c_{N}where italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT =β2p¯absentsubscript𝛽2¯𝑝\displaystyle=-\beta_{2}\bar{p}= - italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG

We derive Eq. (27) from Eq. (26). From Eq. (27) we see that the expected demand is simply the demand function at equal pricing f(x~)𝑓~𝑥f(\tilde{x})italic_f ( over~ start_ARG italic_x end_ARG ) corrected by a factor of γd𝛾𝑑\gamma ditalic_γ italic_d as defined in the Eq. (28), where d𝑑ditalic_d is the amount the player n𝑛nitalic_n deviates from the equilibrium price.

f(xn)𝑓subscript𝑥𝑛\displaystyle f(x_{n})italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =β0+β1(xnd)+β2(xndp¯)absentsubscript𝛽0subscript𝛽1subscript𝑥𝑛𝑑subscript𝛽2subscript𝑥𝑛𝑑¯𝑝\displaystyle=\beta_{0}+\beta_{1}(x_{n}-d)+\beta_{2}(x_{n}-d-\bar{p})= italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_d ) + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_d - over¯ start_ARG italic_p end_ARG )
=β0+β1xn+β2xn+cN(β1+β2)dabsentsubscript𝛽0subscript𝛽1subscript𝑥𝑛subscript𝛽2subscript𝑥𝑛subscript𝑐𝑁subscript𝛽1subscript𝛽2𝑑\displaystyle=\beta_{0}+\beta_{1}x_{n}+\beta_{2}x_{n}+c_{N}-(\beta_{1}+\beta_{% 2})d= italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT - ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_d
=β0+β1xn+β2xn+cN+γdabsentsubscript𝛽0subscript𝛽1subscript𝑥𝑛subscript𝛽2subscript𝑥𝑛subscript𝑐𝑁𝛾𝑑\displaystyle=\beta_{0}+\beta_{1}x_{n}+\beta_{2}x_{n}+c_{N}+\gamma d= italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT + italic_γ italic_d
f(xn)𝑓subscript𝑥𝑛\displaystyle f(x_{n})italic_f ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =f(x~)+γdabsent𝑓~𝑥𝛾𝑑\displaystyle=f(\tilde{x})+\gamma d= italic_f ( over~ start_ARG italic_x end_ARG ) + italic_γ italic_d (27)
wherexnwheresubscript𝑥𝑛\displaystyle\text{where}\quad x_{n}where italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ,γ=(β1+β2),x>0,dformulae-sequenceabsentformulae-sequence𝛾subscript𝛽1subscript𝛽2formulae-sequence𝑥0𝑑\displaystyle\in\mathbb{R},\gamma=-(\beta_{1}+\beta_{2}),x>0,d\in\mathbb{R}∈ blackboard_R , italic_γ = - ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_x > 0 , italic_d ∈ blackboard_R (28)

C.1 Proof of a Best Response Function (Market Undercutting Scenario)

Proposition: In an N-player game, under specific market conditions dictated by the reference price p¯¯𝑝\bar{p}over¯ start_ARG italic_p end_ARG and equilibrium price x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG, under certain proven conditions, stipulated later in Condition 35, there can exist a boundary (0,d)0superscript𝑑(0,d^{\prime})( 0 , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) such that the expected profit from deviating is greater than not deviating 𝐄[Πn]>𝐄[Πn]0𝐄superscriptdelimited-[]subscriptΠ𝑛𝐄superscriptdelimited-[]subscriptΠ𝑛0\mathbf{E}[\Pi_{n}]^{-}>\mathbf{E}[\Pi_{n}]^{0}bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT > bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, when deviating from the market price by undercutting with an amount of d>0𝑑0d>0italic_d > 0, as illustrated by Inequality (29).

𝐄[Πn]𝐄[Πn]0>1,d(0,d)formulae-sequence𝐄superscriptdelimited-[]subscriptΠ𝑛𝐄superscriptdelimited-[]subscriptΠ𝑛01for-all𝑑0superscript𝑑\displaystyle\frac{\mathbf{E}[\Pi_{n}]^{\prime}}{\mathbf{E}[\Pi_{n}]^{0}}>1,% \quad\forall d\in(0,d^{\prime})divide start_ARG bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG > 1 , ∀ italic_d ∈ ( 0 , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (29)

Given the deviation constraint d>0𝑑0d>0italic_d > 0, therefore Φd>1subscriptΦ𝑑1\Phi_{d}>1roman_Φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT > 1, refer to Supplemental B, we find the solution for the polynomial section of the gain function, defined by Eq. (30).

Ω(d)pΩsubscript𝑑𝑝\displaystyle\Omega(d)_{p}roman_Ω ( italic_d ) start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT =(1+γf(x~)d)(11x~d)absent1𝛾𝑓~𝑥𝑑11~𝑥𝑑\displaystyle=(1+\frac{\gamma}{f(\tilde{x})}d)(1-\frac{1}{\tilde{x}}d)= ( 1 + divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) end_ARG italic_d ) ( 1 - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_x end_ARG end_ARG italic_d ) (30)
Ω(d)pΩsubscript𝑑𝑝\displaystyle\Omega(d)_{p}roman_Ω ( italic_d ) start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT >1absent1\displaystyle>1> 1 (31)

With the constraint d>0𝑑0d>0italic_d > 0, a solution to Inequality (31) is restricted by Condition (32). And such a solution only exists when x~>f(x~)/γ~𝑥𝑓~𝑥𝛾\tilde{x}>f(\tilde{x})/\gammaover~ start_ARG italic_x end_ARG > italic_f ( over~ start_ARG italic_x end_ARG ) / italic_γ.

0<d0𝑑\displaystyle 0<d0 < italic_d <γf(x~)1x~γf(x~)x~absent𝛾𝑓~𝑥1~𝑥𝛾𝑓~𝑥~𝑥\displaystyle<\frac{\frac{\gamma}{f(\tilde{x})}-\frac{1}{\tilde{x}}}{\frac{% \gamma}{f(\tilde{x})\tilde{x}}}< divide start_ARG divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) end_ARG - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_x end_ARG end_ARG end_ARG start_ARG divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) over~ start_ARG italic_x end_ARG end_ARG end_ARG (32)
d𝑑\displaystyle ditalic_d (0,x~f(x~)/γ)whenx~>f(x~)/γformulae-sequenceabsent0~𝑥𝑓~𝑥𝛾when~𝑥𝑓~𝑥𝛾\displaystyle\in(0,\tilde{x}-f(\tilde{x})/\gamma)\quad\text{when}\quad\tilde{x% }>f(\tilde{x})/\gamma∈ ( 0 , over~ start_ARG italic_x end_ARG - italic_f ( over~ start_ARG italic_x end_ARG ) / italic_γ ) when over~ start_ARG italic_x end_ARG > italic_f ( over~ start_ARG italic_x end_ARG ) / italic_γ (33)

Making substitutions from, Eq. (26) into Eq. (33),

f(x~)/γ𝑓~𝑥𝛾\displaystyle f(\tilde{x})/\gammaitalic_f ( over~ start_ARG italic_x end_ARG ) / italic_γ <x~absent~𝑥\displaystyle<\tilde{x}< over~ start_ARG italic_x end_ARG
β0+β1x~+β2x~+cN(β1+β2)subscript𝛽0subscript𝛽1~𝑥subscript𝛽2~𝑥subscript𝑐𝑁subscript𝛽1subscript𝛽2\displaystyle\frac{\beta_{0}+\beta_{1}\tilde{x}+\beta_{2}\tilde{x}+c_{N}}{-(% \beta_{1}+\beta_{2})}divide start_ARG italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG + italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG start_ARG - ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG <x~absent~𝑥\displaystyle<\tilde{x}< over~ start_ARG italic_x end_ARG
β0+cN(β1+β2)+x~β1+β2(β1+β2)subscript𝛽0subscript𝑐𝑁subscript𝛽1subscript𝛽2~𝑥subscript𝛽1subscript𝛽2subscript𝛽1subscript𝛽2\displaystyle\frac{\beta_{0}+c_{N}}{-(\beta_{1}+\beta_{2})}+\tilde{x}\frac{% \beta_{1}+\beta_{2}}{-(\beta_{1}+\beta_{2})}divide start_ARG italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG start_ARG - ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG + over~ start_ARG italic_x end_ARG divide start_ARG italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG - ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG <x~absent~𝑥\displaystyle<\tilde{x}< over~ start_ARG italic_x end_ARG
β0+cN(β1+β2)subscript𝛽0subscript𝑐𝑁subscript𝛽1subscript𝛽2\displaystyle\frac{\beta_{0}+c_{N}}{-(\beta_{1}+\beta_{2})}divide start_ARG italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG start_ARG - ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG <x~(1+β1+β2β1+β2)absent~𝑥1subscript𝛽1subscript𝛽2subscript𝛽1subscript𝛽2\displaystyle<\tilde{x}\Bigg{(}1+\frac{\beta_{1}+\beta_{2}}{\beta_{1}+\beta_{2% }}\Bigg{)}< over~ start_ARG italic_x end_ARG ( 1 + divide start_ARG italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG )
β0+cN2(β1+β2)subscript𝛽0subscript𝑐𝑁2subscript𝛽1subscript𝛽2\displaystyle-\frac{\beta_{0}+c_{N}}{2(\beta_{1}+\beta_{2})}- divide start_ARG italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG start_ARG 2 ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG <x~absent~𝑥\displaystyle<\tilde{x}< over~ start_ARG italic_x end_ARG
β2p¯β02(β1+β2)subscript𝛽2¯𝑝subscript𝛽02subscript𝛽1subscript𝛽2\displaystyle\frac{\beta_{2}\bar{p}-\beta_{0}}{2(\beta_{1}+\beta_{2})}divide start_ARG italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 2 ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG <x~absent~𝑥\displaystyle<\tilde{x}< over~ start_ARG italic_x end_ARG (34)

We see from Inequality (34) that if the equilibrium price of the other agents x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG is priced below a function of the reference price p¯¯𝑝\bar{p}over¯ start_ARG italic_p end_ARG there will be no value of d𝑑ditalic_d that satisfies Inequality (33). Therefore, the agent will not gain profit from undercutting if x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG, the current market price, is under a certain limit with respect a monotonic function the reference price p¯¯𝑝\bar{p}over¯ start_ARG italic_p end_ARG. The exact conditions on which undercutting the x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG will yield profit gain for any agent n𝑛nitalic_n is outlined in Inequality (35).

β2p¯β02(β1+β2)<x~&dsubscript𝛽2¯𝑝subscript𝛽02subscript𝛽1subscript𝛽2~𝑥𝑑\displaystyle\frac{\beta_{2}\bar{p}-\beta_{0}}{2(\beta_{1}+\beta_{2})}<\tilde{% x}\And ddivide start_ARG italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG - italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 2 ( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG < over~ start_ARG italic_x end_ARG & italic_d (0,x~f(x~)/γ)𝐄[Πn]>𝐄[Πn]0absent0~𝑥𝑓~𝑥𝛾𝐄superscriptdelimited-[]subscriptΠ𝑛𝐄superscriptdelimited-[]subscriptΠ𝑛0\displaystyle\in(0,\tilde{x}-f(\tilde{x})/\gamma)\implies\mathbf{E}[\Pi_{n}]^{% -}>\mathbf{E}[\Pi_{n}]^{0}∈ ( 0 , over~ start_ARG italic_x end_ARG - italic_f ( over~ start_ARG italic_x end_ARG ) / italic_γ ) ⟹ bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT > bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT (35)

\square

C.2 Proof of the Existence of Multiple ϵitalic-ϵ\epsilonitalic_ϵ-Nash Equilibria Single Stage Game

Proposition: Suppose the market is in an equilibrium state where all agents price their items at a fixed price, and one player elects to undercut the market. We demonstrate that there is a theoretical maximum expected reward in this oligopoly for a single stage game for undercutting the market.

𝐄[Πn]𝐄[Πn]0+ϵ𝐄superscriptdelimited-[]subscriptΠ𝑛𝐄superscriptdelimited-[]subscriptΠ𝑛0italic-ϵ\displaystyle\mathbf{E}[\Pi_{n}]^{-}\leq\mathbf{E}[\Pi_{n}]^{0}+\epsilonbold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ≤ bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT + italic_ϵ (36)

Inequality (36) expresses the conditions of an ϵitalic-ϵ\epsilonitalic_ϵ-Nash equilibrium, that is no player can obtain a higher reward than a margin of ϵitalic-ϵ\epsilonitalic_ϵ by deviating from the equilibrium price of the market 𝐄[Πn]0𝐄superscriptdelimited-[]subscriptΠ𝑛0\mathbf{E}[\Pi_{n}]^{0}bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. In effect we acknowledge that the upper bound of ΦdsubscriptΦ𝑑\Phi_{d}roman_Φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT in Eq. (22) as,

eadNΦ+e1ax~NΦ+e1ax~+dsuperscript𝑒𝑎𝑑subscript𝑁Φsuperscript𝑒1𝑎~𝑥subscript𝑁Φsuperscript𝑒1𝑎~𝑥𝑑\displaystyle e^{ad}\frac{N_{\Phi}+e^{1-a\tilde{x}}}{N_{\Phi}+e^{1-a\tilde{x}+% d}}italic_e start_POSTSUPERSCRIPT italic_a italic_d end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT + italic_e start_POSTSUPERSCRIPT 1 - italic_a over~ start_ARG italic_x end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT + italic_e start_POSTSUPERSCRIPT 1 - italic_a over~ start_ARG italic_x end_ARG + italic_d end_POSTSUPERSCRIPT end_ARG eNϕadabsentsuperscript𝑒subscript𝑁italic-ϕ𝑎𝑑\displaystyle\leq e^{N_{\phi}ad}≤ italic_e start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_a italic_d end_POSTSUPERSCRIPT (37)
ΦdsubscriptΦ𝑑\displaystyle\Phi_{d}roman_Φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT eNϕadabsentsuperscript𝑒subscript𝑁italic-ϕ𝑎𝑑\displaystyle\leq e^{N_{\phi}ad}≤ italic_e start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_a italic_d end_POSTSUPERSCRIPT (38)

We take the partial derivative with respect the deviation amount d𝑑ditalic_d to obtain theoretical maximum of ΩΩ\Omegaroman_Ω.

𝐄[Πn]d𝐄superscriptdelimited-[]subscriptΠ𝑛𝑑\displaystyle\frac{\partial\mathbf{E}[\Pi_{n}]^{-}}{\partial d}divide start_ARG ∂ bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_d end_ARG =d[eNϕad(1+γf(x~)d)(11x~d)]absent𝑑delimited-[]superscript𝑒subscript𝑁italic-ϕ𝑎𝑑1𝛾𝑓~𝑥𝑑11~𝑥𝑑\displaystyle=\frac{\partial}{\partial d}\Bigg{[}e^{N_{\phi}ad}(1+\frac{\gamma% }{f(\tilde{x})}d)(1-\frac{1}{\tilde{x}}d)\Bigg{]}= divide start_ARG ∂ end_ARG start_ARG ∂ italic_d end_ARG [ italic_e start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_a italic_d end_POSTSUPERSCRIPT ( 1 + divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) end_ARG italic_d ) ( 1 - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_x end_ARG end_ARG italic_d ) ]
𝐄[Πn]d𝐄superscriptdelimited-[]subscriptΠ𝑛𝑑\displaystyle\frac{\partial\mathbf{E}[\Pi_{n}]^{-}}{\partial d}divide start_ARG ∂ bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_d end_ARG =eNϕad[ed(1+γf(x~)d)(11x~d)]absentsuperscript𝑒subscript𝑁italic-ϕ𝑎𝑑delimited-[]superscript𝑒𝑑1𝛾𝑓~𝑥𝑑11~𝑥𝑑\displaystyle=e^{N_{\phi}a}\frac{\partial}{\partial d}\Bigg{[}e^{d}(1+\frac{% \gamma}{f(\tilde{x})}d)(1-\frac{1}{\tilde{x}}d)\Bigg{]}= italic_e start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_a end_POSTSUPERSCRIPT divide start_ARG ∂ end_ARG start_ARG ∂ italic_d end_ARG [ italic_e start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( 1 + divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) end_ARG italic_d ) ( 1 - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_x end_ARG end_ARG italic_d ) ] (39)

Solve the derivative,

𝐄[Πn]d𝐄superscriptdelimited-[]subscriptΠ𝑛𝑑\displaystyle\frac{\partial\mathbf{E}[\Pi_{n}]^{-}}{\partial d}divide start_ARG ∂ bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_d end_ARG =0absent0\displaystyle=0= 0
eNϕad[ed(1+γf(x~)d)(11x~d)]superscript𝑒subscript𝑁italic-ϕ𝑎𝑑delimited-[]superscript𝑒𝑑1𝛾𝑓~𝑥𝑑11~𝑥𝑑\displaystyle e^{N_{\phi}a}\frac{\partial}{\partial d}\Bigg{[}e^{d}(1+\frac{% \gamma}{f(\tilde{x})}d)(1-\frac{1}{\tilde{x}}d)\Bigg{]}italic_e start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_a end_POSTSUPERSCRIPT divide start_ARG ∂ end_ARG start_ARG ∂ italic_d end_ARG [ italic_e start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( 1 + divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) end_ARG italic_d ) ( 1 - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_x end_ARG end_ARG italic_d ) ] =0absent0\displaystyle=0= 0
d[ed+γf(x~)ded1x~dedγf(x~)x~d2ed]𝑑delimited-[]superscript𝑒𝑑𝛾𝑓~𝑥𝑑superscript𝑒𝑑1~𝑥𝑑superscript𝑒𝑑𝛾𝑓~𝑥~𝑥superscript𝑑2superscript𝑒𝑑\displaystyle\frac{\partial}{\partial d}\Bigg{[}e^{d}+\frac{\gamma}{f(\tilde{x% })}de^{d}-\frac{1}{\tilde{x}}de^{d}-\frac{\gamma}{f(\tilde{x})\tilde{x}}d^{2}e% ^{d}\Bigg{]}divide start_ARG ∂ end_ARG start_ARG ∂ italic_d end_ARG [ italic_e start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT + divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) end_ARG italic_d italic_e start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_x end_ARG end_ARG italic_d italic_e start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT - divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) over~ start_ARG italic_x end_ARG end_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ] =0absent0\displaystyle=0= 0
d[ed+(γf(x~)1x~)dedγf(x~)x~d2ed]𝑑delimited-[]superscript𝑒𝑑𝛾𝑓~𝑥1~𝑥𝑑superscript𝑒𝑑𝛾𝑓~𝑥~𝑥superscript𝑑2superscript𝑒𝑑\displaystyle\frac{\partial}{\partial d}\Bigg{[}e^{d}+\Bigg{(}\frac{\gamma}{f(% \tilde{x})}-\frac{1}{\tilde{x}}\Bigg{)}de^{d}-\frac{\gamma}{f(\tilde{x})\tilde% {x}}d^{2}e^{d}\Bigg{]}divide start_ARG ∂ end_ARG start_ARG ∂ italic_d end_ARG [ italic_e start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT + ( divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) end_ARG - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_x end_ARG end_ARG ) italic_d italic_e start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT - divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) over~ start_ARG italic_x end_ARG end_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ] =0absent0\displaystyle=0= 0
ed[(γf(x~)1x~)(d+1)γf(x~)x~d(d+2)+1]superscript𝑒𝑑delimited-[]𝛾𝑓~𝑥1~𝑥𝑑1𝛾𝑓~𝑥~𝑥𝑑𝑑21\displaystyle e^{d}\Bigg{[}\Bigg{(}\frac{\gamma}{f(\tilde{x})}-\frac{1}{\tilde% {x}}\Bigg{)}(d+1)-\frac{\gamma}{f(\tilde{x})\tilde{x}}d(d+2)+1\Bigg{]}italic_e start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT [ ( divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) end_ARG - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_x end_ARG end_ARG ) ( italic_d + 1 ) - divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) over~ start_ARG italic_x end_ARG end_ARG italic_d ( italic_d + 2 ) + 1 ] =0absent0\displaystyle=0= 0
(γf(x~)1x~)(d+1)γf(x~)x~d(d+2)+1𝛾𝑓~𝑥1~𝑥𝑑1𝛾𝑓~𝑥~𝑥𝑑𝑑21\displaystyle\Bigg{(}\frac{\gamma}{f(\tilde{x})}-\frac{1}{\tilde{x}}\Bigg{)}(d% +1)-\frac{\gamma}{f(\tilde{x})\tilde{x}}d(d+2)+1( divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) end_ARG - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_x end_ARG end_ARG ) ( italic_d + 1 ) - divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) over~ start_ARG italic_x end_ARG end_ARG italic_d ( italic_d + 2 ) + 1 =0absent0\displaystyle=0= 0 (40)

The solution to Eq. (40)

d*superscript𝑑\displaystyle d^{*}italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT =c12c1+4(c21)c22c22c2absentsuperscriptsubscript𝑐12subscript𝑐14subscript𝑐21subscript𝑐22subscript𝑐22subscript𝑐2\displaystyle=\frac{\sqrt{c_{1}^{2}-c_{1}+4(c_{2}-1)c_{2}-2c_{2}}}{2c_{2}}= divide start_ARG square-root start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 4 ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 ) italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 2 italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 2 italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG (41)
wherec1wheresubscript𝑐1\displaystyle\text{where}\quad c_{1}where italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =γf(x~)1x~,c2=γf(x~)x~formulae-sequenceabsent𝛾𝑓~𝑥1~𝑥subscript𝑐2𝛾𝑓~𝑥~𝑥\displaystyle=\frac{\gamma}{f(\tilde{x})}-\frac{1}{\tilde{x}},\quad c_{2}=% \frac{\gamma}{f(\tilde{x})\tilde{x}}= divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) end_ARG - divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_x end_ARG end_ARG , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG italic_γ end_ARG start_ARG italic_f ( over~ start_ARG italic_x end_ARG ) over~ start_ARG italic_x end_ARG end_ARG (42)

We proved that when a player deviates from the market price by a margin of d𝑑ditalic_d, there exists an optimal deviation amount d*superscript𝑑d^{*}italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, outlined in Eq. (41) such that the expected profit gain ΩΩ\Omegaroman_Ω is maximized. ϵitalic-ϵ\epsilonitalic_ϵ is therefore,

ϵitalic-ϵ\displaystyle\epsilonitalic_ϵ =𝐄[Πn]𝐄[Π]0absent𝐄superscriptdelimited-[]subscriptΠ𝑛𝐄superscriptdelimited-[]Π0\displaystyle=\mathbf{E}[\Pi_{n}]^{-}-\mathbf{E}[\Pi]^{0}= bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT - bold_E [ roman_Π ] start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT
=Ω(d*)𝐄[Πn]0𝐄[Πn]0absentΩsuperscript𝑑𝐄superscriptdelimited-[]subscriptΠ𝑛0𝐄superscriptdelimited-[]subscriptΠ𝑛0\displaystyle=\Omega(d^{*})\mathbf{E}[\Pi_{n}]^{0}-\mathbf{E}[\Pi_{n}]^{0}= roman_Ω ( italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT
=𝐄[Πn]0[Ω(d*)1)]\displaystyle=\mathbf{E}[\Pi_{n}]^{0}[\Omega(d^{*})-1)]= bold_E [ roman_Π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT [ roman_Ω ( italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - 1 ) ] (43)

Multiple solutions can exist where d*=0superscript𝑑0d^{*}=0italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = 0, to give a perfect Nash equilibria solution, as c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are functions of x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG. The Nash equilibrium condition is,

c12+4(c21)c2=c1+2c2ϵ=0formulae-sequencesuperscriptsubscript𝑐124subscript𝑐21subscript𝑐2subscript𝑐12subscript𝑐2absentitalic-ϵ0\displaystyle c_{1}^{2}+4(c_{2}-1)c_{2}=c_{1}+2c_{2}\quad\xrightarrow[]{}\quad% \epsilon=0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 ) italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 2 italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW italic_ϵ = 0 (44)

Therefore, by undercutting the market with any price d>0𝑑0d>0italic_d > 0, a player can theoretically yield no higher expected profits than ϵitalic-ϵ\epsilonitalic_ϵ greater than its competitors as defined in Eq. (41) and Eq. (43). Suppose a policy of pure strategy exists, πnpsubscriptsuperscript𝜋𝑝𝑛\pi^{p}_{n}italic_π start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where,

πnp=subscriptsuperscript𝜋𝑝𝑛absent\displaystyle\pi^{p}_{n}=italic_π start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = {1for x=x~0for elsecases1for 𝑥~𝑥0for 𝑒𝑙𝑠𝑒\displaystyle\begin{cases}1&\text{for \quad}x=\tilde{x}\\ 0&\text{for \quad}else\end{cases}{ start_ROW start_CELL 1 end_CELL start_CELL for italic_x = over~ start_ARG italic_x end_ARG end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL for italic_e italic_l italic_s italic_e end_CELL end_ROW (45)

Thus, πnpsubscriptsuperscript𝜋𝑝𝑛\pi^{p}_{n}italic_π start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT constitutes an ϵitalic-ϵ\epsilonitalic_ϵ-Nash equilibrium resulting from a pure strategy, with ϵitalic-ϵ\epsilonitalic_ϵ denoted in Eq. (43), from which varying the parameters of c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in Eq. (41), multiple ϵitalic-ϵ\epsilonitalic_ϵ-Nash or Nash equilibria exist. \square

C.3 Existence of Nash equilibrium for Markov Game

Proposition: The existence of an ϵitalic-ϵ\epsilonitalic_ϵ-Nash equilibrium in single stage game ensures that there exists an δ𝛿\deltaitalic_δ-Nash equilibrium in the Markov Game, as defined in Eq. (13), for the value of a joint policy regardless of the state transition behaviour of the Markov Game.

Proof: From [filar:1997], the value of a policy can be defined,

𝐯(π)=t=0γ𝐏t(π)𝐑(π)𝐯𝜋superscriptsubscript𝑡0𝛾superscript𝐏𝑡𝜋𝐑𝜋\displaystyle\mathbf{v}(\pi)=\sum_{t=0}^{\infty}\gamma\mathbf{P}^{t}(\pi)% \mathbf{R}(\pi)bold_v ( italic_π ) = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ bold_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_π ) bold_R ( italic_π ) (46)

With a specific policy, the transition matrix of the Markov Game is known, and therefore the value of a policy can be expressed in Eq. (46), with defined transition 𝐏t(π)superscript𝐏𝑡𝜋\mathbf{P}^{t}(\pi)bold_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_π ) and reward 𝐑(π)𝐑𝜋\mathbf{R}(\pi)bold_R ( italic_π ) matrices. Provided the Nash equilibrium condition from Eq. (13) we must demonstrate that,

𝐑(π)𝐑(π*)+ϵ𝐕(π)𝐕(π*)+δ𝐑superscript𝜋𝐑superscript𝜋italic-ϵabsent𝐕superscript𝜋𝐕superscript𝜋𝛿\displaystyle\mathbf{R}(\pi^{\prime})\leq\mathbf{R}(\pi^{*})+\mathbf{\epsilon}% \xrightarrow[]{}\mathbf{V}(\pi^{\prime})\leq\mathbf{V}(\pi^{*})+\deltabold_R ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ bold_R ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + italic_ϵ start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW bold_V ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ bold_V ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + italic_δ (47)

Where πsuperscript𝜋\pi^{\prime}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denotes any joint policy, and π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT denotes a Nash equilibrium policy. We know that the Markov transition probability holds such that,

i=1|S|𝐏t(π)si=1,ssuperscriptsubscript𝑖1𝑆superscript𝐏𝑡subscript𝜋𝑠𝑖1for-all𝑠\displaystyle\sum_{i=1}^{|S|}\mathbf{P}^{t}(\pi)_{si}=1,\quad\forall s∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_S | end_POSTSUPERSCRIPT bold_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_π ) start_POSTSUBSCRIPT italic_s italic_i end_POSTSUBSCRIPT = 1 , ∀ italic_s (48)

𝐏t(π)sisuperscript𝐏𝑡subscript𝜋𝑠𝑖\mathbf{P}^{t}(\pi)_{si}bold_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_π ) start_POSTSUBSCRIPT italic_s italic_i end_POSTSUBSCRIPT represents the probability of transition from state s𝑠sitalic_s to state i𝑖iitalic_i. Eq. (48) simply indicates the sum of transition probabilities from state s𝑠sitalic_s to any other state must equal 1 (Markov property). Furthermore, suppose,

Rssubscript𝑅𝑠\displaystyle R_{s}italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT =max1s<|S|{𝐑(π*)s+ϵ}absentsubscript1𝑠𝑆𝐑subscriptsuperscript𝜋𝑠italic-ϵ\displaystyle=\max_{1\leq s<|S|}\{\mathbf{R}(\pi^{*})_{s}+\epsilon\}= roman_max start_POSTSUBSCRIPT 1 ≤ italic_s < | italic_S | end_POSTSUBSCRIPT { bold_R ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_ϵ } (49)
δssubscript𝛿𝑠\displaystyle\delta_{s}italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT =Rs𝐏(π*)s×𝐑(π*)absentsubscript𝑅𝑠𝐏subscriptsuperscript𝜋𝑠𝐑superscript𝜋\displaystyle=R_{s}-\mathbf{P}(\pi^{*})_{s}\times\mathbf{R}(\pi^{*})= italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - bold_P ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × bold_R ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) (50)

Rssubscript𝑅𝑠R_{s}italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represents the maximum reward obtainable from state s𝑠sitalic_s under policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, as illustrated in Eq. (49). Provided the Markov property on the transition matrix 𝐏t(π)superscript𝐏𝑡𝜋\mathbf{P}^{t}(\pi)bold_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_π ) outlined in Eq. (48), this implies that

𝐏(π)s×𝐑(π)𝐏(π*)s×𝐑(π*)+δssSformulae-sequence𝐏subscriptsuperscript𝜋𝑠𝐑superscript𝜋𝐏subscriptsuperscript𝜋𝑠𝐑superscript𝜋subscript𝛿𝑠for-all𝑠𝑆\displaystyle\mathbf{P}(\pi^{\prime})_{s}\ \times\ \mathbf{R}(\pi^{\prime})% \leq\mathbf{P}(\pi^{*})_{s}\ \times\ \mathbf{R}(\pi^{*})+\delta_{s}\quad% \forall s\in Sbold_P ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × bold_R ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ bold_P ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × bold_R ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∀ italic_s ∈ italic_S (51)

Where δ𝛿\deltaitalic_δ is a maximum fixed value added to the value of the ϵitalic-ϵ\epsilonitalic_ϵ-Nash equilibrium policy 𝐕(π*)𝐕superscript𝜋\mathbf{V}(\pi^{*})bold_V ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) effectively bounding the value of any other policy deviating from π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. \square

Appendix D Plots and Figures

Market Scenario 1: β0=25,β1=0.6,β2=6.1,a=0.1formulae-sequencesubscript𝛽025formulae-sequencesubscript𝛽10.6formulae-sequencesubscript𝛽26.1𝑎0.1\beta_{0}=25,\beta_{1}=-0.6,\beta_{2}=-6.1,a=0.1italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 25 , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 0.6 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 6.1 , italic_a = 0.1
Market Scenario 2: β0=15,β1=1.05,β2=3.1,a=0.1formulae-sequencesubscript𝛽015formulae-sequencesubscript𝛽11.05formulae-sequencesubscript𝛽23.1𝑎0.1\beta_{0}=15,\beta_{1}=-1.05,\beta_{2}=-3.1,a=0.1italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 15 , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 1.05 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 3.1 , italic_a = 0.1
Market Scenario 3: β0=27,β1=1.1,β2=1.0,a=0.1formulae-sequencesubscript𝛽027formulae-sequencesubscript𝛽11.1formulae-sequencesubscript𝛽21.0𝑎0.1\beta_{0}=27,\beta_{1}=-1.1,\beta_{2}=-1.0,a=0.1italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 27 , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 1.1 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 1.0 , italic_a = 0.1
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Market Scenario 1: β0=25,β1=0.6,β2=6.1,a=0.1formulae-sequencesubscript𝛽025formulae-sequencesubscript𝛽10.6formulae-sequencesubscript𝛽26.1𝑎0.1\beta_{0}=25,\beta_{1}=-0.6,\beta_{2}=-6.1,a=0.1italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 25 , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 0.6 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 6.1 , italic_a = 0.1
Market Scenario 2: β0=15,β1=1.05,β2=3.1,a=0.1formulae-sequencesubscript𝛽015formulae-sequencesubscript𝛽11.05formulae-sequencesubscript𝛽23.1𝑎0.1\beta_{0}=15,\beta_{1}=-1.05,\beta_{2}=-3.1,a=0.1italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 15 , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 1.05 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 3.1 , italic_a = 0.1
Market Scenario 3: β0=27,β1=1.1,β2=1.0,a=0.1formulae-sequencesubscript𝛽027formulae-sequencesubscript𝛽11.1formulae-sequencesubscript𝛽21.0𝑎0.1\beta_{0}=27,\beta_{1}=-1.1,\beta_{2}=-1.0,a=0.1italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 27 , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 1.1 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 1.0 , italic_a = 0.1
Market Scenario 4: β0=27,β1=3.05,β2=1.1,a=0.2formulae-sequencesubscript𝛽027formulae-sequencesubscript𝛽13.05formulae-sequencesubscript𝛽21.1𝑎0.2\beta_{0}=27,\beta_{1}=-3.05,\beta_{2}=-1.1,a=0.2italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 27 , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 3.05 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 1.1 , italic_a = 0.2
Figure 4: Surface plot of maximum theoretical advantage from deviation (ϵitalic-ϵ\epsilonitalic_ϵ-deviation) from Nash equilibrium with respect to market price x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG, and reference price p¯¯𝑝\bar{p}over¯ start_ARG italic_p end_ARG, with their respective market parameters β0,β1,β2,asubscript𝛽0subscript𝛽1subscript𝛽2𝑎\beta_{0},\beta_{1},\beta_{2},aitalic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a for arrival of a single sales event. The lower plateau area, illustrated in dark-blue, constitutes a state where the combination of joint action market price x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG, and reference price p¯¯𝑝\bar{p}over¯ start_ARG italic_p end_ARG do not result in significant difference when the player chooses to undercut the market. For each reference, there is a range of actions that result in ϵitalic-ϵ\epsilonitalic_ϵ-Nash equilibrium, where ϵ<0.0001italic-ϵ0.0001\epsilon<0.0001italic_ϵ < 0.0001.
Market rewards per episode Scenario 1 with n=3𝑛3n=3italic_n = 3.
Market rewards per episode Scenario 2 with n=3𝑛3n=3italic_n = 3.
Market rewards per episode Scenario 3 with n=3𝑛3n=3italic_n = 3.
Market rewards per episode Scenario 4 with n=3𝑛3n=3italic_n = 3.
Market rewards per episode Scenario 2 with n=5𝑛5n=5italic_n = 5.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Market rewards per episode Scenario 1 with n=3𝑛3n=3italic_n = 3.
Market rewards per episode Scenario 2 with n=3𝑛3n=3italic_n = 3.
Market rewards per episode Scenario 3 with n=3𝑛3n=3italic_n = 3.
Market rewards per episode Scenario 4 with n=3𝑛3n=3italic_n = 3.
Market rewards per episode Scenario 2 with n=5𝑛5n=5italic_n = 5.
Market rewards per episode Scenario 3 with n=5𝑛5n=5italic_n = 5.
Figure 5: Average market rewards per episode overlaid on top of theoretical Nash equilibrium bounds (blue shade). Market average is in blue, and average reward of single agent in orange. In a Nash equilibrium, both the market average reward, and single agent reward should fall within the Nash equilibria boundary, as the MG progresses.
Loss behaviour of Scenario 1 with n=3𝑛3n=3italic_n = 3.
Loss behaviour of Scenario 2 with n=3𝑛3n=3italic_n = 3.
Loss behaviour of Scenario 3 with n=3𝑛3n=3italic_n = 3.
Loss behaviour of Scenario 4 with n=3𝑛3n=3italic_n = 3.
Loss behaviour of Scenario 2 with n=5𝑛5n=5italic_n = 5.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Loss behaviour of Scenario 1 with n=3𝑛3n=3italic_n = 3.
Loss behaviour of Scenario 2 with n=3𝑛3n=3italic_n = 3.
Loss behaviour of Scenario 3 with n=3𝑛3n=3italic_n = 3.
Loss behaviour of Scenario 4 with n=3𝑛3n=3italic_n = 3.
Loss behaviour of Scenario 2 with n=5𝑛5n=5italic_n = 5.
Loss behaviour of Scenario 4 with n=5𝑛5n=5italic_n = 5.
Figure 6: Loss behviour of batch update during Deep Q Learning (left y axis) and Nash Net update Ψ(s)Ψ𝑠\Psi(s)roman_Ψ ( italic_s ) (right y axis).