这是用户在 2025-7-5 15:02 为 https://app.immersivetranslate.com/pdf-pro/aad023fd-adf2-4a4a-95fe-77e4d0f2dfe3/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Deep Reinforcement Learning with Double Q-Learning
基于双重 Q 学习的深度强化学习

Hado van Hasselt, Arthur Guez, and David Silver
哈多·范·哈塞尔特、亚瑟·格兹与戴维·西尔弗
Google DeepMind  谷歌深度思维公司

Abstract  摘要

The popular Q-learning algorithm is known to overestimate action values under certain conditions. It was not previously known whether, in practice, such overestimations are common, whether they harm performance, and whether they can generally be prevented. In this paper, we answer all these questions affirmatively. In particular, we first show that the recent DQN algorithm, which combines Q-learning with a deep neural network, suffers from substantial overestimations in some games in the Atari 2600 domain. We then show that the idea behind the Double Q-learning algorithm, which was introduced in a tabular setting, can be generalized to work with large-scale function approximation. We propose a specific adaptation to the DQN algorithm and show that the resulting algorithm not only reduces the observed overestimations, as hypothesized, but that this also leads to much better performance on several games.
广为人知的 Q 学习算法在某些情况下会出现动作价值高估现象。此前学界并不清楚这种高估在实践中是否普遍存在、是否会影响性能表现,以及是否能够普遍避免。本文针对这些问题均给出了肯定性回答。具体而言,我们首先证明了结合 Q 学习与深度神经网络的 DQN 算法在 Atari 2600 游戏环境中存在显著高估现象。随后我们论证了原用于表格化场景的双重 Q 学习算法思想,可推广应用于大规模函数逼近场景。我们提出了针对 DQN 算法的具体改进方案,结果表明改进后的算法不仅如预期那样减少了观测到的高估现象,还在多款游戏中实现了显著更优的性能表现。

The goal of reinforcement learning (Sutton and Barto 1998) is to learn good policies for sequential decision problems, by optimizing a cumulative future reward signal. Q-learning (Watkins 1989) is one of the most popular reinforcement learning algorithms, but it is known to sometimes learn unrealistically high action values because it includes a maximization step over estimated action values, which tends to prefer overestimated to underestimated values.
强化学习(Sutton 与 Barto 1998)的目标是通过优化累积未来奖励信号,为序列决策问题学习优良策略。Q 学习(Watkins 1989)作为最流行的强化学习算法之一,却因包含对估计动作值的最大化步骤而存在缺陷——该步骤倾向于选择高估而非低估的值,导致有时会学习到不切实际的高动作价值。
In previous work, overestimations have been attributed to insufficiently flexible function approximation (Thrun and Schwartz 1993) and noise (van Hasselt 2010, 2011). In this paper, we unify these views and show overestimations can occur when the action values are inaccurate, irrespective of the source of approximation error. Of course, imprecise value estimates are the norm during learning, which indicates that overestimations may be much more common than previously appreciated.
既往研究将价值高估归因于函数逼近灵活性不足(Thrun 与 Schwartz 1993)和噪声干扰(van Hasselt 2010, 2011)。本文整合这些观点并证明:只要动作价值估算不准确,无论误差来源如何都可能引发高估现象。由于学习过程中价值估算不精确实属常态,这表明价值高估现象可能远比既有认知更为普遍。
It is an open question whether, if the overestimations do occur, this negatively affects performance in practice. Overoptimistic value estimates are not necessarily a problem in and of themselves. If all values would be uniformly higher then the relative action preferences are preserved and we would not expect the resulting policy to be any worse. Furthermore, it is known that sometimes it is good to be optimistic: optimism in the face of uncertainty is a well-known
如果确实存在高估现象,这是否会对实际性能产生负面影响仍是一个悬而未决的问题。过度乐观的价值估计本身并不必然构成问题——若所有价值被等比例高估,动作的相对偏好仍得以保持,我们便不应预期由此产生的策略会更差。此外,众所周知乐观态度有时反而有益:面对不确定性时的乐观主义就是一种广为人知的
exploration technique (Kaelbling et al. 1996). If, however, the overestimations are not uniform and not concentrated at states about which we wish to learn more, then they might negatively affect the quality of the resulting policy. Thrun and Schwartz (1993) give specific examples in which this leads to suboptimal policies, even asymptotically.
探索技术(Kaelbling 等,1996)。然而,若高估现象并非均匀分布,也未集中在我们希望深入探索的状态上,则可能对最终策略质量产生负面影响。Thrun 和 Schwartz(1993)就提供了具体案例,证明这种情况甚至会导致渐进次优策略的产生。
To test whether overestimations occur in practice and at scale, we investigate the performance of the recent DQN algorithm (Mnih et al. 2015). DQN combines Q-learning with a flexible deep neural network and was tested on a varied and large set of deterministic Atari 2600 games, reaching human-level performance on many games. In some ways, this setting is a best-case scenario for Q-learning, because the deep neural network provides flexible function approximation with the potential for a low asymptotic approximation error, and the determinism of the environments prevents the harmful effects of noise. Perhaps surprisingly, we show that even in this comparatively favorable setting DQN sometimes substantially overestimates the values of the actions.
为验证实践中是否存在大规模高估现象,我们研究了近期 DQN 算法(Mnih 等人 2015 年提出)的表现。该算法将 Q 学习与灵活的深度神经网络相结合,并在多款确定性 Atari 2600 游戏上进行测试,在诸多游戏中达到了人类水平的表现。从某些方面来看,这种设定是 Q 学习的最佳场景——深度神经网络提供了具有低渐进逼近误差潜力的灵活函数逼近能力,而环境的确定性又避免了噪声带来的负面影响。但令人惊讶的是,我们发现即便在这种相对有利的条件下,DQN 有时仍会显著高估动作价值。
We show that the Double Q-learning algorithm (van Hasselt 2010), which was first proposed in a tabular setting, can be generalized to arbitrary function approximation, including deep neural networks. We use this to construct a new algorithm called Double DQN. This algorithm not only yields more accurate value estimates, but leads to much higher scores on several games. This demonstrates that the overestimations of DQN indeed lead to poorer policies and that it is beneficial to reduce them. In addition, by improving upon DQN we obtain state-of-the-art results on the Atari domain.
我们证明了最初在表格环境下提出的双 Q 学习算法(van Hasselt 2010)可以推广到任意函数逼近,包括深度神经网络。基于此,我们构建了一个名为双 DQN 的新算法。该算法不仅能提供更准确的价值估计,还在多个游戏上取得了显著更高的分数。这表明 DQN 的高估确实会导致策略性能下降,而减少这种高估是有益的。此外,通过对 DQN 的改进,我们在 Atari 领域获得了最先进的结果。

Background  背景

To solve sequential decision problems we can learn estimates for the optimal value of each action, defined as the expected sum of future rewards when taking that action and following the optimal policy thereafter. Under a given policy π π pi\pi, the true value of an action a a aa in a state s s ss is
为解决序列决策问题,我们可以学习每个动作最优价值的估计值,其定义为采取该动作并随后遵循最优策略时获得的预期未来奖励总和。在给定策略 π π pi\pi 下,状态 s s ss 中动作 a a aa 的真实价值为
Q π ( s , a ) E [ R 1 + γ R 2 + S 0 = s , A 0 = a , π ] Q π ( s , a ) E R 1 + γ R 2 + S 0 = s , A 0 = a , π Q_(pi)(s,a)-=E[R_(1)+gammaR_(2)+dots∣S_(0)=s,A_(0)=a,pi]Q_{\pi}(s, a) \equiv \mathbb{E}\left[R_{1}+\gamma R_{2}+\ldots \mid S_{0}=s, A_{0}=a, \pi\right]
where γ [ 0 , 1 ] γ [ 0 , 1 ] gamma in[0,1]\gamma \in[0,1] is a discount factor that trades off the importance of immediate and later rewards. The optimal value is then Q ( s , a ) = max π Q π ( s , a ) Q ( s , a ) = max π Q π ( s , a ) Q_(**)(s,a)=max_(pi)Q_(pi)(s,a)Q_{*}(s, a)=\max _{\pi} Q_{\pi}(s, a). An optimal policy is easily derived from the optimal values by selecting the highestvalued action in each state.
其中 γ [ 0 , 1 ] γ [ 0 , 1 ] gamma in[0,1]\gamma \in[0,1] 是折现因子,用于权衡即时奖励与后续奖励的重要性。最优价值即为 Q ( s , a ) = max π Q π ( s , a ) Q ( s , a ) = max π Q π ( s , a ) Q_(**)(s,a)=max_(pi)Q_(pi)(s,a)Q_{*}(s, a)=\max _{\pi} Q_{\pi}(s, a) 。通过选择每个状态下价值最高的动作,可以轻松从最优价值中推导出最优策略。
Estimates for the optimal action values can be learned using Q-learning (Watkins 1989), a form of temporal difference learning (Sutton 1988). Most interesting problems are too large to learn all action values in all states separately. Instead, we can learn a parameterized value function Q ( s , a ; θ t ) Q s , a ; θ t Q(s,a;theta_(t))Q\left(s, a ; \boldsymbol{\theta}_{t}\right). The standard Q-learning update for the parameters after taking action A t A t A_(t)A_{t} in state S t S t S_(t)S_{t} and observing the immediate reward R t + 1 R t + 1 R_(t+1)R_{t+1} and resulting state S t + 1 S t + 1 S_(t+1)S_{t+1} is then
最优动作值的估计可以通过 Q 学习(Watkins 1989)来习得,这是一种时序差分学习(Sutton 1988)的形式。大多数有趣的问题规模过大,无法分别学习所有状态下的所有动作值。取而代之的是,我们可以学习一个参数化的价值函数 Q ( s , a ; θ t ) Q s , a ; θ t Q(s,a;theta_(t))Q\left(s, a ; \boldsymbol{\theta}_{t}\right) 。在状态 S t S t S_(t)S_{t} 中采取动作 A t A t A_(t)A_{t} 并观察到即时奖励 R t + 1 R t + 1 R_(t+1)R_{t+1} 及结果状态 S t + 1 S t + 1 S_(t+1)S_{t+1} 后,对参数的标准 Q 学习更新为
θ t + 1 = θ t + α ( Y t Q Q ( S t , A t ; θ t ) ) θ t Q ( S t , A t ; θ t ) . θ t + 1 = θ t + α Y t Q Q S t , A t ; θ t θ t Q S t , A t ; θ t . theta_(t+1)=theta_(t)+alpha(Y_(t)^(Q)-Q(S_(t),A_(t);theta_(t)))grad_(theta_(t))Q(S_(t),A_(t);theta_(t)).\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}+\alpha\left(Y_{t}^{\mathrm{Q}}-Q\left(S_{t}, A_{t} ; \boldsymbol{\theta}_{t}\right)\right) \nabla_{\boldsymbol{\theta}_{t}} Q\left(S_{t}, A_{t} ; \boldsymbol{\theta}_{t}\right) .
where α α alpha\alpha is a scalar step size and the target Y t Q Y t Q Y_(t)^(Q)Y_{t}^{\mathrm{Q}} is defined as
其中 α α alpha\alpha 为标量步长,目标值 Y t Q Y t Q Y_(t)^(Q)Y_{t}^{\mathrm{Q}} 定义为
Y t Q R t + 1 + γ max a Q ( S t + 1 , a ; θ t ) Y t Q R t + 1 + γ max a Q S t + 1 , a ; θ t Y_(t)^(Q)-=R_(t+1)+gammamax_(a)Q(S_(t+1),a;theta_(t))Y_{t}^{\mathrm{Q}} \equiv R_{t+1}+\gamma \max _{a} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_{t}\right)
This update resembles stochastic gradient descent, updating the current value Q ( S t , A t ; θ t ) Q S t , A t ; θ t Q(S_(t),A_(t);theta_(t))Q\left(S_{t}, A_{t} ; \boldsymbol{\theta}_{t}\right) towards a target value Y t Q Y t Q Y_(t)^(Q)Y_{t}^{\mathrm{Q}}.
该更新类似于随机梯度下降,将当前值 Q ( S t , A t ; θ t ) Q S t , A t ; θ t Q(S_(t),A_(t);theta_(t))Q\left(S_{t}, A_{t} ; \boldsymbol{\theta}_{t}\right) 向目标值 Y t Q Y t Q Y_(t)^(Q)Y_{t}^{\mathrm{Q}} 方向调整。

Deep Q Networks  深度 Q 网络

A deep Q network (DQN) is a multi-layered neural network that for a given state s s ss outputs a vector of action values Q ( s , ; θ ) Q ( s , ; θ ) Q(s,*;theta)Q(s, \cdot ; \boldsymbol{\theta}), where θ θ theta\boldsymbol{\theta} are the parameters of the network. For an n n nn-dimensional state space and an action space containing m m mm actions, the neural network is a function from R n R n R^(n)\mathbb{R}^{n} to R m R m R^(m)\mathbb{R}^{m}. Two important ingredients of the DQN algorithm as proposed by Mnih et al. (2015) are the use of a target network, and the use of experience replay. The target network, with parameters θ θ theta^(-)\boldsymbol{\theta}^{-}, is the same as the online network except that its parameters are copied every τ τ tau\tau steps from the online network, so that then θ t = θ t θ t = θ t theta_(t)^(-)=theta_(t)\boldsymbol{\theta}_{t}^{-}=\boldsymbol{\theta}_{t}, and kept fixed on all other steps. The target used by DQN is then
深度 Q 网络(DQN)是一种多层神经网络,对于给定状态 s s ss 会输出动作价值向量 Q ( s , ; θ ) Q ( s , ; θ ) Q(s,*;theta)Q(s, \cdot ; \boldsymbol{\theta}) ,其中 θ θ theta\boldsymbol{\theta} 是网络参数。在 n n nn 维状态空间和包含 m m mm 个动作的动作空间中,该神经网络实现了从 R n R n R^(n)\mathbb{R}^{n} R m R m R^(m)\mathbb{R}^{m} 的函数映射。Mnih 等人(2015)提出的 DQN 算法包含两个关键要素:目标网络的使用和经验回放机制。目标网络与在线网络结构相同(参数为 θ θ theta^(-)\boldsymbol{\theta}^{-} ),其参数每隔 τ τ tau\tau 步从在线网络复制一次(即 θ t = θ t θ t = θ t theta_(t)^(-)=theta_(t)\boldsymbol{\theta}_{t}^{-}=\boldsymbol{\theta}_{t} ),其余时段保持固定。DQN 采用的目标值计算公式为:
Y t DQN R t + 1 + γ max a Q ( S t + 1 , a ; θ t ) . Y t DQN R t + 1 + γ max a Q S t + 1 , a ; θ t . Y_(t)^(DQN)-=R_(t+1)+gammamax_(a)Q(S_(t+1),a;theta_(t)^(-)).Y_{t}^{\mathrm{DQN}} \equiv R_{t+1}+\gamma \max _{a} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_{t}^{-}\right) .
For the experience replay (Lin 1992), observed transitions are stored for some time and sampled uniformly from this memory bank to update the network. Both the target network and the experience replay dramatically improve the performance of the algorithm (Mnih et al. 2015).
经验回放机制(Lin 1992)会将观测到的状态转移存储一段时间,并从这个记忆库中均匀采样来更新网络。目标网络和经验回放机制都显著提升了算法性能(Mnih 等人,2015)。

Double Q-learning  双 Q 学习

The max operator in standard Q-learning and DQN, in (2) and (3), uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates. To prevent this, we can decouple the selection from the evaluation.
标准 Q 学习和 DQN 中的 max 运算符在公式(2)和(3)中,使用相同的值进行动作选择和评估。这会导致更容易选择被高估的值,从而产生过于乐观的价值估计。为防止这种情况,我们可以将选择与评估过程解耦。
In Double Q-learning (van Hasselt 2010), two value functions are learned by assigning experiences randomly to update one of the two value functions, resulting in two sets of weights, θ θ theta\boldsymbol{\theta} and θ θ theta^(')\boldsymbol{\theta}^{\prime}. For each update, one set of weights is used to determine the greedy policy and the other to determine its value. For a clear comparison, we can untangle the selection and evaluation in Q-learning and rewrite its target (2) as
在双 Q 学习(van Hasselt 2010)中,通过随机分配经验来更新两个价值函数中的一个,从而学习到两组权重 θ θ theta\boldsymbol{\theta} θ θ theta^(')\boldsymbol{\theta}^{\prime} 。每次更新时,一组权重用于确定贪婪策略,另一组则用于评估其价值。为清晰比较,我们可以解构 Q 学习中的选择和评估过程,并将其目标函数(2)改写为
Y t Q = R t + 1 + γ Q ( S t + 1 , argmax a Q ( S t + 1 , a ; θ t ) ; θ t ) . Y t Q = R t + 1 + γ Q S t + 1 , argmax a Q S t + 1 , a ; θ t ; θ t . Y_(t)^(Q)=R_(t+1)+gamma Q(S_(t+1),argmax _(a)Q(S_(t+1),a;theta_(t));theta_(t)).Y_{t}^{\mathrm{Q}}=R_{t+1}+\gamma Q\left(S_{t+1}, \underset{a}{\operatorname{argmax}} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_{t}\right) ; \boldsymbol{\theta}_{t}\right) .
The Double Q -learning error can then be written as
双 Q 学习的误差项可表示为
Y t DoubleQ R t + 1 + γ Q ( S t + 1 , argmax a Q ( S t + 1 , a ; θ t ) ; θ t ) Y t DoubleQ  R t + 1 + γ Q S t + 1 , argmax a Q S t + 1 , a ; θ t ; θ t Y_(t)^("DoubleQ ")-=R_(t+1)+gamma Q(S_(t+1),argmax _(a)Q(S_(t+1),a;theta_(t));theta_(t)^('))Y_{t}^{\text {DoubleQ }} \equiv R_{t+1}+\gamma Q\left(S_{t+1}, \underset{a}{\operatorname{argmax}} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_{t}\right) ; \boldsymbol{\theta}_{t}^{\prime}\right)
Notice that the selection of the action, in the argmax, is still due to the online weights θ t θ t theta_(t)\boldsymbol{\theta}_{t}. This means that, as in Qlearning, we are still estimating the value of the greedy policy according to the current values, as defined by θ t θ t theta_(t)\boldsymbol{\theta}_{t}. However, we use the second set of weights θ t θ t theta_(t)^(')\boldsymbol{\theta}_{t}^{\prime} to fairly evaluate the value of this policy. This second set of weights can be updated symmetrically by switching the roles of θ θ theta\boldsymbol{\theta} and θ θ theta^(')\boldsymbol{\theta}^{\prime}.
注意在 argmax 中的动作选择仍由在线权重 θ t θ t theta_(t)\boldsymbol{\theta}_{t} 决定。这意味着与 Q 学习相同,我们仍根据当前值 θ t θ t theta_(t)\boldsymbol{\theta}_{t} 来估计贪婪策略的价值。但我们会使用第二组权重 θ t θ t theta_(t)^(')\boldsymbol{\theta}_{t}^{\prime} 来公平评估该策略的价值。通过交换 θ θ theta\boldsymbol{\theta} θ θ theta^(')\boldsymbol{\theta}^{\prime} 的角色,可以对称地更新这第二组权重。

Overoptimism due to estimation errors
因估计误差导致的过度乐观

Q-learning’s overestimations were first investigated by Thrun and Schwartz (1993), who showed that if the action values contain random errors uniformly distributed in an interval [ ϵ , ϵ ] [ ϵ , ϵ ] [-epsilon,epsilon][-\epsilon, \epsilon] then each target is overestimated up to γ ϵ m 1 m + 1 γ ϵ m 1 m + 1 gamma epsilon(m-1)/(m+1)\gamma \epsilon \frac{m-1}{m+1}, where m m mm is the number of actions. In addition, Thrun and Schwartz give a concrete example in which these overestimations even asymptotically lead to sub-optimal policies, and show the overestimations manifest themselves in a small toy problem when using function approximation. Van Hasselt (2010) noted that noise in the environment can lead to overestimations even when using tabular representation, and proposed Double Q-learning as a solution.
Q 学习的过高估计问题最初由 Thrun 和 Schwartz(1993)研究,他们证明若动作值包含均匀分布在区间 [ ϵ , ϵ ] [ ϵ , ϵ ] [-epsilon,epsilon][-\epsilon, \epsilon] 内的随机误差,则每个目标会被高估至多 γ ϵ m 1 m + 1 γ ϵ m 1 m + 1 gamma epsilon(m-1)/(m+1)\gamma \epsilon \frac{m-1}{m+1} ,其中 m m mm 表示动作数量。此外,Thrun 和 Schwartz 通过具体案例表明,这些高估甚至会导致渐近次优策略,并在使用函数逼近的小型玩具问题中验证了高估现象的存在。Van Hasselt(2010)指出即便使用表格表示法,环境噪声仍会导致高估,并提出双 Q 学习作为解决方案。
In this section we demonstrate more generally that estimation errors of any kind can induce an upward bias, regardless of whether these errors are due to environmental noise, function approximation, non-stationarity, or any other source. This is important, because in practice any method will incur some inaccuracies during learning, simply due to the fact that the true values are initially unknown.
本节我们将更普遍地论证:任何类型的估计误差——无论源于环境噪声、函数逼近、非平稳性或其他因素——都可能引发向上偏差。这一点至关重要,因为实践中所有学习方法都会因真实值初始未知而产生一定的不准确性。
The result by Thrun and Schwartz (1993) cited above gives an upper bound to the overestimation for a specific setup, but it is also possible, and potentially more interesting, to derive a lower bound.
上述 Thrun 和 Schwartz(1993)的研究结果为特定情境下的高估问题提供了上界,但推导出下界不仅可行,且可能更具研究价值。

Theorem 1. Consider a state s s ss in which all the true optimal action values are equal at Q ( s , a ) = V ( s ) Q ( s , a ) = V ( s ) Q_(**)(s,a)=V_(**)(s)Q_{*}(s, a)=V_{*}(s) for some V ( s ) V ( s ) V_(**)(s)V_{*}(s). Let Q t Q t Q_(t)Q_{t} be arbitrary value estimates that are on the whole unbiased in the sense that a ( Q t ( s , a ) V ( s ) ) = 0 a Q t ( s , a ) V ( s ) = 0 sum_(a)(Q_(t)(s,a)-V_(**)(s))=0\sum_{a}\left(Q_{t}(s, a)-V_{*}(s)\right)=0, but that are not all correct, such that 1 m a ( Q t ( s , a ) V ( s ) ) 2 = C 1 m a Q t ( s , a ) V ( s ) 2 = C (1)/(m)sum_(a)(Q_(t)(s,a)-V_(**)(s))^(2)=C\frac{1}{m} \sum_{a}\left(Q_{t}(s, a)-V_{*}(s)\right)^{2}=C for some C > 0 C > 0 C > 0C>0, where m 2 m 2 m >= 2m \geq 2 is the number of actions in s s ss. Under these conditions, max a Q t ( s , a ) V ( s ) + C m 1 max a Q t ( s , a ) V ( s ) + C m 1 max_(a)Q_(t)(s,a) >= V_(**)(s)+sqrt((C)/(m-1))\max _{a} Q_{t}(s, a) \geq V_{*}(s)+\sqrt{\frac{C}{m-1}}. This lower bound is tight. Under the same conditions, the lower bound on the absolute error of the Double Q-learning estimate is zero. (Proof in appendix.)
定理 1. 考虑状态 s s ss ,其中所有真实最优动作值在 Q ( s , a ) = V ( s ) Q ( s , a ) = V ( s ) Q_(**)(s,a)=V_(**)(s)Q_{*}(s, a)=V_{*}(s) 处相等( V ( s ) V ( s ) V_(**)(s)V_{*}(s) 为某常数)。设 Q t Q t Q_(t)Q_{t} 为整体无偏的价值估计值,即满足 a ( Q t ( s , a ) V ( s ) ) = 0 a Q t ( s , a ) V ( s ) = 0 sum_(a)(Q_(t)(s,a)-V_(**)(s))=0\sum_{a}\left(Q_{t}(s, a)-V_{*}(s)\right)=0 ,但这些估计并非全部准确,使得对某些 C > 0 C > 0 C > 0C>0 1 m a ( Q t ( s , a ) V ( s ) ) 2 = C 1 m a Q t ( s , a ) V ( s ) 2 = C (1)/(m)sum_(a)(Q_(t)(s,a)-V_(**)(s))^(2)=C\frac{1}{m} \sum_{a}\left(Q_{t}(s, a)-V_{*}(s)\right)^{2}=C (其中 m 2 m 2 m >= 2m \geq 2 表示 s s ss 中的动作数量)。在此条件下, max a Q t ( s , a ) V ( s ) + C m 1 max a Q t ( s , a ) V ( s ) + C m 1 max_(a)Q_(t)(s,a) >= V_(**)(s)+sqrt((C)/(m-1))\max _{a} Q_{t}(s, a) \geq V_{*}(s)+\sqrt{\frac{C}{m-1}} 。该下界是紧致的。相同条件下,双 Q 学习估计值的绝对误差下界为零。(证明见附录)
Note that we did not need to assume that estimation errors for different actions are independent. This theorem shows that even if the value estimates are on average correct, estimation errors of any source can drive the estimates up and away from the true optimal values.
需特别说明的是,本定理无需假设不同动作的估计误差相互独立。该定理表明:即使价值估计在整体上无偏,任何来源的估计误差都可能驱使估值向上偏离真实最优值。
The lower bound in Theorem 1 decreases with the number of actions. This is an artifact of considering the lower bound, which requires very specific values to be attained. More typically, the overoptimism increases with the number of actions as shown in Figure 1. Q-learning’s overestimations there indeed increase with the number of actions, while Double Q-learning is unbiased. As another example, if for all actions Q ( s , a ) = V ( s ) Q ( s , a ) = V ( s ) Q_(**)(s,a)=V_(**)(s)Q_{*}(s, a)=V_{*}(s) and the estimation errors Q t ( s , a ) V ( s ) Q t ( s , a ) V ( s ) Q_(t)(s,a)-V_(**)(s)Q_{t}(s, a)-V_{*}(s) are uniformly random in [ 1 , 1 ] [ 1 , 1 ] [-1,1][-1,1], then the overoptimism is m 1 m + 1 m 1 m + 1 (m-1)/(m+1)\frac{m-1}{m+1}. (Proof in appendix.)
定理 1 中的下界随着动作数量的增加而降低。这是考虑下界时的固有特性,因为下界要求达到非常特定的数值。更典型的情况是,如图 1 所示,过度乐观会随着动作数量的增加而增加。Q 学习算法的过高估计确实会随动作数量增加,而双 Q 学习算法则无偏。再举一个例子,如果对所有动作 Q ( s , a ) = V ( s ) Q ( s , a ) = V ( s ) Q_(**)(s,a)=V_(**)(s)Q_{*}(s, a)=V_{*}(s) 且估计误差 Q t ( s , a ) V ( s ) Q t ( s , a ) V ( s ) Q_(t)(s,a)-V_(**)(s)Q_{t}(s, a)-V_{*}(s) [ 1 , 1 ] [ 1 , 1 ] [-1,1][-1,1] 中均匀随机分布,则过度乐观值为 m 1 m + 1 m 1 m + 1 (m-1)/(m+1)\frac{m-1}{m+1} 。(证明见附录。)

Figure 1: The orange bars show the bias in a single Qlearning update when the action values are Q ( s , a ) = Q ( s , a ) = Q(s,a)=Q(s, a)= V ( s ) + ϵ a V ( s ) + ϵ a V_(**)(s)+epsilon_(a)V_{*}(s)+\epsilon_{a} and the errors { ϵ a } a = 1 m ϵ a a = 1 m {epsilon_(a)}_(a=1)^(m)\left\{\epsilon_{a}\right\}_{a=1}^{m} are independent standard normal random variables. The second set of action values Q Q Q^(')Q^{\prime}, used for the blue bars, was generated identically and independently. All bars are the average of 100 repetitions.
图 1:橙色柱状图显示当动作值为 Q ( s , a ) = Q ( s , a ) = Q(s,a)=Q(s, a)= V ( s ) + ϵ a V ( s ) + ϵ a V_(**)(s)+epsilon_(a)V_{*}(s)+\epsilon_{a} 且误差 { ϵ a } a = 1 m ϵ a a = 1 m {epsilon_(a)}_(a=1)^(m)\left\{\epsilon_{a}\right\}_{a=1}^{m} 为独立标准正态随机变量时,单次 Q 学习更新的偏差。用于蓝色柱状图的第二组动作值 Q Q Q^(')Q^{\prime} 以相同且独立的方式生成。所有柱状图均为 100 次重复实验的平均值。
We now turn to function approximation and consider a real-valued continuous state space with 10 discrete actions in each state. For simplicity, the true optimal action values in this example depend only on state so that in each state all actions have the same true value. These true values are shown in the left column of plots in Figure 2 (purple lines) and are defined as either Q ( s , a ) = sin ( s ) Q ( s , a ) = sin ( s ) Q_(**)(s,a)=sin(s)Q_{*}(s, a)=\sin (s) (top row) or Q ( s , a ) = 2 exp ( s 2 ) Q ( s , a ) = 2 exp s 2 Q_(**)(s,a)=2exp(-s^(2))Q_{*}(s, a)=2 \exp \left(-s^{2}\right) (middle and bottom rows). The left plots also show an approximation for a single action (green lines) as a function of state as well as the samples the estimate is based on (green dots). The estimate is a d d dd-degree polynomial that is fit to the true values at sampled states, where d = 6 d = 6 d=6d=6 (top and middle rows) or d = 9 d = 9 d=9d=9 (bottom row). The samples match the true function exactly: there is no noise and we assume we have ground truth for the action value on these sampled states. The approximation is inexact even on the sampled states for the top two rows because the function approximation is insufficiently flexible. In the bottom row, the function is flexible enough to fit the green dots, but this reduces the accuracy in unsampled states. Notice that the sampled states are spaced further apart near the left side of the left plots, resulting in larger estimation errors. In many ways this is a typical learning setting, where at each point in time we only have limited data.
现在我们转向函数逼近问题,考虑一个实值连续状态空间,其中每个状态包含 10 个离散动作。为简化起见,本例中真实的最优动作值仅取决于状态,因此每个状态下所有动作都具有相同的真实值。这些真实值如图 2 左侧图表所示(紫色线条),其定义方式为 Q ( s , a ) = sin ( s ) Q ( s , a ) = sin ( s ) Q_(**)(s,a)=sin(s)Q_{*}(s, a)=\sin (s) (顶部图表)或 Q ( s , a ) = 2 exp ( s 2 ) Q ( s , a ) = 2 exp s 2 Q_(**)(s,a)=2exp(-s^(2))Q_{*}(s, a)=2 \exp \left(-s^{2}\right) (中部和底部图表)。左侧图表还展示了单个动作的逼近函数(绿色线条)随状态变化的曲线,以及用于构建估计的采样点(绿色圆点)。该估计采用 d d dd 次多项式拟合采样状态处的真实值,其中 d = 6 d = 6 d=6d=6 (顶部和中部图表)或 d = 9 d = 9 d=9d=9 (底部图表)。采样点与真实函数完全吻合:不存在噪声干扰,且我们假设在这些采样状态上获得了动作值的真实数据。对于顶部两行图表,即使在采样状态处逼近也不精确,这是因为函数逼近的灵活性不足。而在底部图表中,函数具有足够的灵活性来拟合绿色采样点,但这会降低未采样状态处的准确性。 注意到采样状态在左侧图表的左半部分间隔较远,导致估计误差较大。这在许多方面是典型的学习场景——每个时间点我们只能获取有限的数据。
The middle column of plots in Figure 2 shows estimated action values for all 10 actions (green lines), as functions of state, along with the maximum action value in each state (black dashed line). Although the true value function is the same for all actions, the approximations differ because they are based on different sets of sampled states. 1 1 ^(1){ }^{1} The maximum is often higher than the ground truth shown in purple on the left. This is confirmed in the right plots, which shows the difference between the black and purple curves in orange. The orange line is almost always positive, indicating an upward bias. The right plots also show the estimates from Double
图 2 中间列的图表展示了所有 10 个动作的估计价值(绿线)随状态变化的函数曲线,以及每个状态中的最大动作价值(黑色虚线)。虽然所有动作的真实价值函数相同,但由于基于不同的采样状态集,近似值存在差异。 1 1 ^(1){ }^{1} 最大值通常高于左侧紫色标示的真实值。右侧图表以橙色线条呈现黑紫曲线间的差值,证实了这一点——橙色线条几乎始终为正,表明存在高估偏差。右侧图表同时用蓝色线条 1 1 ^(1){ }^{1}展示了双重 Q 学习的估计值,其均值更接近零值基线。这证明双重 Q 学习确实能有效缓解 Q 学习的过度乐观问题。
Q-learning in blue 2 2 ^(2){ }^{2}, which are on average much closer to zero. This demonstrates that Double Q-learning indeed can successfully reduce the overoptimism of Q-learning.
Q-learning 的蓝色曲线 2 2 ^(2){ }^{2} 平均更接近零值。这表明双重 Q-learning 确实能有效降低 Q-learning 的过度乐观倾向。
The different rows in Figure 2 show variations of the same experiment. The difference between the top and middle rows is the true value function, demonstrating that overestimations are not an artifact of a specific true value function. The difference between the middle and bottom rows is the flexibility of the function approximation. In the left-middle plot, the estimates are even incorrect for some of the sampled states because the function is insufficiently flexible. The function in the bottom-left plot is more flexible but this causes higher estimation errors for unseen states, resulting in higher overestimations. This is important because flexible parametric function approximation is often employed in reinforcement learning (see, e.g., Tesauro 1995, Sallans and Hinton 2004, Riedmiller 2005, and Mnih et al. 2015).
图 2 中的不同行展示了同一实验的多种变体。顶部与中间行的差异在于真实价值函数,这表明高估现象并非特定真实价值函数的产物。中间与底部行的区别在于函数逼近的灵活性:左中图中,由于函数灵活性不足,部分采样状态的估值甚至出现错误;而左下图的函数更具灵活性,但这导致对未见状态的估计误差更大,从而产生更高程度的高估值。这一点尤为重要,因为强化学习中常采用灵活的参变函数逼近方法(参见 Tesauro 1995、Sallans 和 Hinton 2004、Riedmiller 2005 以及 Mnih 等人 2015 年的研究)。
In contrast to van Hasselt (2010), we did not use a statistical argument to find overestimations, the process to obtain Figure 2 is fully deterministic. In contrast to Thrun and Schwartz (1993), we did not rely on inflexible function approximation with irreducible asymptotic errors; the bottom row shows that a function that is flexible enough to cover all samples leads to high overestimations. This indicates that the overestimations can occur quite generally.
与 van Hasselt(2010 年)的研究不同,我们并未采用统计学论证来发现高估现象,图 2 的生成过程完全基于确定性方法。区别于 Thrun 和 Schwartz(1993 年)的研究,我们未依赖存在不可约渐近误差的刚性函数逼近;底行图表显示,当函数具备足够灵活性覆盖所有样本时,反而会导致显著高估。这表明高估现象可能具有普遍性。
In the examples above, overestimations occur even when assuming we have samples of the true action value at certain states. The value estimates can further deteriorate if we bootstrap off of action values that are already overoptimistic, since this causes overestimations to propagate throughout our estimates. Although uniformly overestimating values might not hurt the resulting policy, in practice overestimation errors will differ for different states and actions. Overestimation combined with bootstrapping then has the pernicious effect of propagating the wrong relative information about which states are more valuable than others, directly affecting the quality of the learned policies.
上述示例中,即便假设已获得某些状态下真实动作值的样本,高估现象仍然存在。若基于已过度乐观的动作值进行自举更新,价值估计会进一步恶化——因为这会导致高估效应在整个估计体系中传播。虽然整体价值高估可能不影响最终策略表现,但实践中不同状态和动作的高估误差存在差异。高估现象与自举更新相结合时,会产生有害影响:传播关于"哪些状态更具价值"的错误相对信息,直接损害习得策略的质量。
The overestimations should not be confused with optimism in the face of uncertainty (Sutton 1990, Agrawal 1995, Kaelbling et al. 1996, Auer at al. 2002, Brafman and Tennenholtz 2003, Szita and Lõrincz 2008, Strehl and Littman 2009), where an exploration bonus is given to states or actions with uncertain values. The overestimations discussed here occur only after updating, resulting in overoptimism in the face of apparent certainty. Thrun and Schwartz (1993) noted that, in contrast to optimism in the face of uncertainty, these overestimations actually can impede learning an optimal policy. We confirm this negative effect on policy quality in our experiments: when we reduce the overestimations using Double Q-learning, the policies improve.
不应将这种高估现象与面对不确定性时的乐观探索相混淆(Sutton 1990, Agrawal 1995, Kaelbling 等 1996, Auer 等 2002, Brafman 和 Tennenholtz 2003, Szita 和 Lõrincz 2008, Strehl 和 Littman 2009)。后者会对价值不确定的状态或动作给予探索奖励,而本文讨论的高估现象仅发生在参数更新之后,导致在表面确定性下产生过度乐观。Thrun 和 Schwartz(1993)指出,与面对不确定性的乐观探索不同,这种高估实际上会阻碍学习最优策略。我们在实验中证实了这种对策略质量的负面影响:当使用双 Q 学习减少高估时,策略表现得到了提升。

Double DQN  双深度 Q 网络

The idea of Double Q-learning is to reduce overestimations by decomposing the max operation in the target into action
双 Q 学习的核心思想是通过将目标中的最大值操作分解为动作选择
Figure 2: Illustration of overestimations during learning. In each state (x-axis), there are 10 actions. The left column shows the true values V ( s ) V ( s ) V_(**)(s)V_{*}(s) (purple line). All true action values are defined by Q ( s , a ) = V ( s ) Q ( s , a ) = V ( s ) Q_(**)(s,a)=V_(**)(s)Q_{*}(s, a)=V_{*}(s). The green line shows estimated values Q ( s , a ) Q ( s , a ) Q(s,a)Q(s, a) for one action as a function of state, fitted to the true value at several sampled states (green dots). The middle column plots show all the estimated values (green), and the maximum of these values (dashed black). The maximum is higher than the true value (purple, left plot) almost everywhere. The right column plots shows the difference in orange. The blue line in the right plots is the estimate used by Double Q-learning with a second set of samples for each state. The blue line is much closer to zero, indicating less bias. The three rows correspond to different true functions (left, purple) or capacities of the fitted function (left, green). (Details in the text)
图 2:学习过程中高估现象的示意图。每个状态(x 轴)对应 10 个动作。左列显示真实值 V ( s ) V ( s ) V_(**)(s)V_{*}(s) (紫色线),所有动作真实值由 Q ( s , a ) = V ( s ) Q ( s , a ) = V ( s ) Q_(**)(s,a)=V_(**)(s)Q_{*}(s, a)=V_{*}(s) 定义。绿色线表示某个动作的估计值 Q ( s , a ) Q ( s , a ) Q(s,a)Q(s, a) 随状态变化的函数曲线,其拟合了若干采样状态(绿点)处的真实值。中间列图表展示所有估计值(绿色线)及其最大值(黑色虚线)。该最大值几乎在所有位置都高于真实值(左图紫色线)。右列图表用橙色线标示两者差值。右图中蓝线表示采用双 Q 学习算法时,基于第二组状态样本所得的估计值。蓝线更接近零值,表明偏差较小。三行图表分别对应不同的真实函数(左列紫色线)或拟合函数容量(左列绿色线)。(详见正文说明)

selection and action evaluation. Although not fully decoupled, the target network in the DQN architecture provides a natural candidate for the second value function, without having to introduce additional networks. We therefore propose to evaluate the greedy policy according to the online network, but using the target network to estimate its value. In reference to both Double Q-learning and DQN, we refer to the resulting algorithm as Double DQN. Its update is the same as for DQN, but replacing the target Y t DQN Y t DQN Y_(t)^(DQN)Y_{t}^{\mathrm{DQN}} with
选择与行动评估。虽然并未完全解耦,但 DQN 架构中的目标网络为第二个价值函数提供了天然的候选方案,无需引入额外网络。因此我们提出根据在线网络评估贪婪策略,但使用目标网络来估算其价值。参考双 Q 学习和 DQN,我们将所得算法称为双 DQN。其更新方式与 DQN 相同,只是将目标 Y t DQN Y t DQN Y_(t)^(DQN)Y_{t}^{\mathrm{DQN}} 替换为
Y t DoubleDQN R t + 1 + γ Q ( S t + 1 , argmax a Q ( S t + 1 , a ; θ t ) , θ t ) Y t DoubleDQN  R t + 1 + γ Q S t + 1 , argmax a Q S t + 1 , a ; θ t , θ t Y_(t)^("DoubleDQN ")-=R_(t+1)+gamma Q(S_(t+1),argmax _(a)Q(S_(t+1),a;theta_(t)),theta_(t)^(-))Y_{t}^{\text {DoubleDQN }} \equiv R_{t+1}+\gamma Q\left(S_{t+1}, \underset{a}{\operatorname{argmax}} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_{t}\right), \boldsymbol{\theta}_{t}^{-}\right)
In comparison to Double Q-learning (4), the weights of the second network θ t θ t theta_(t)^(')\boldsymbol{\theta}_{t}^{\prime} are replaced with the weights of the target network θ t θ t theta_(t)^(-)\boldsymbol{\theta}_{t}^{-}for the evaluation of the current greedy policy. The update to the target network stays unchanged from DQN, and remains a periodic copy of the online network.
相较于双 Q 学习(4),当前贪婪策略评估中第二网络 θ t θ t theta_(t)^(')\boldsymbol{\theta}_{t}^{\prime} 的权重被替换为目标网络 θ t θ t theta_(t)^(-)\boldsymbol{\theta}_{t}^{-} 的权重。目标网络的更新保持与 DQN 一致,仍是在线网络的周期性复制品。
This version of Double DQN is perhaps the minimal possible change to DQN towards Double Q -learning. The goal is to get most of the benefit of Double Q-learning, while keeping the rest of the DQN algorithm intact for a fair comparison, and with minimal computational overhead.
这个版本的双 DQN 可能是 DQN 向双 Q 学习转变的最小改动方案。其目标是在保持 DQN 算法其余部分不变以进行公平比较的前提下,获得双 Q 学习的大部分优势,同时将计算开销控制在最低水平。

Empirical results  实证结果

In this section, we analyze the overestimations of DQN and show that Double DQN improves over DQN both in terms of value accuracy and in terms of policy quality. To further test the robustness of the approach we additionally evaluate the algorithms with random starts generated from expert human trajectories, as proposed by Nair et al. (2015).
在本节中,我们分析了 DQN 的高估现象,并证明 Double DQN 在价值准确性和策略质量两方面均优于 DQN。为进一步验证方法的鲁棒性,我们采用 Nair 等人(2015)提出的方案,通过专家人类轨迹生成的随机起点对算法进行了额外评估。
Our testbed consists of Atari 2600 games, using the Arcade Learning Environment (Bellemare et al. 2013). The
我们的测试平台基于雅达利 2600 游戏,使用街机学习环境(Bellemare 等,2013)。该

goal is for a single algorithm, with a fixed set of hyperparameters, to learn to play each of the games separately from interaction given only the screen pixels as input. This is a demanding testbed: not only are the inputs high-dimensional, the game visuals and game mechanics vary substantially between games. Good solutions must therefore rely heavily on the learning algorithm - it is not practically feasible to overfit the domain by relying only on tuning.
测试要求单一算法在固定超参数设置下,仅通过屏幕像素输入进行交互学习,独立掌握每款游戏的玩法。这是极具挑战性的测试平台:不仅输入数据具有高维度特征,不同游戏间的视觉呈现与机制设计也存在显著差异。因此优质解决方案必须高度依赖学习算法本身——仅通过参数调优来实现领域过拟合在实践中并不可行。
We closely follow the experimental setup and network architecture used by Mnih et al. (2015). Briefly, the network architecture is a convolutional neural network (Fukushima 1988, Lecun et al. 1998) with 3 convolution layers and a fully-connected hidden layer (approximately 1.5 M parameters in total). The network takes the last four frames as input and outputs the action value of each action. On each game, the network is trained on a single GPU for 200 M frames.
我们严格遵循 Mnih 等人(2015 年)采用的实验设置和网络架构。简言之,该网络架构是一个包含 3 个卷积层和 1 个全连接隐藏层的卷积神经网络(Fukushima 1988 年,Lecun 等人 1998 年),总参数量约 150 万。网络以最近四帧画面作为输入,输出每个动作的动作价值函数值。每款游戏均在单块 GPU 上训练 2 亿帧。

Results on overoptimism  过度乐观现象的结果

Figure 3 shows examples of DQN’s overestimations in six Atari games. DQN and Double DQN were both trained under the exact conditions described by Mnih et al. (2015). DQN is consistently and sometimes vastly overoptimistic about the value of the current greedy policy, as can be seen by comparing the orange learning curves in the top row of plots to the straight orange lines, which represent the actual discounted value of the best learned policy. More precisely, the (averaged) value estimates are computed regularly during training with full evaluation phases of length T = 125 , 000 T = 125 , 000 T=125,000T=125,000 steps as
图 3 展示了 DQN 在六款 Atari 游戏中出现价值高估的典型案例。DQN 和 Double DQN 均完全按照 Mnih 等人(2015 年)描述的相同条件进行训练。通过对比图中顶部曲线(橙色学习曲线)与水平橙色基准线(代表已学习最优策略的实际折现价值)可以看出,DQN 对当前贪婪策略的价值评估存在持续且有时极为严重的过度乐观现象。具体而言,(平均)价值评估是在训练过程中定期计算的,每次完整评估阶段包含 T = 125 , 000 T = 125 , 000 T=125,000T=125,000 步长。
1 T t = 1 T argmax a Q ( S t , a ; θ ) 1 T t = 1 T argmax a Q S t , a ; θ (1)/(T)sum_(t=1)^(T)argmax _(a)Q(S_(t),a;theta)\frac{1}{T} \sum_{t=1}^{T} \underset{a}{\operatorname{argmax}} Q\left(S_{t}, a ; \boldsymbol{\theta}\right)
Figure 3: The top and middle rows show value estimates by DQN (orange) and Double DQN (blue) on six Atari games. The results are obtained by running DQN and Double DQN with 6 different random seeds with the hyper-parameters employed by Mnih et al. (2015). The darker line shows the median over seeds and we average the two extreme values to obtain the shaded area (i.e., 10 % 10 % 10%10 \% and 90 % 90 % 90%90 \% quantiles with linear interpolation). The straight horizontal orange (for DQN) and blue (for Double DQN) lines in the top row are computed by running the corresponding agents after learning concluded, and averaging the actual discounted return obtained from each visited state. These straight lines would match the learning curves at the right side of the plots if there is no bias. The middle row shows the value estimates (in log scale) for two games in which DQN’s overoptimism is quite extreme. The bottom row shows the detrimental effect of this on the score achieved by the agent as it is evaluated during training: the scores drop when the overestimations begin. Learning with Double DQN is much more stable.
图 3:上排和中排展示了 DQN(橙色)与 Double DQN(蓝色)在六款 Atari 游戏中的价值估计结果。这些数据是通过采用 Mnih 等人(2015 年)的超参数设置,分别用 6 组不同随机种子运行 DQN 和 Double DQN 获得的。深色线条表示各种子运行结果的中位数,并通过线性插值取两个极值的平均值得到阴影区域(即 10 % 10 % 10%10 \% 90 % 90 % 90%90 \% 分位数)。上排的橙色(DQN)与蓝色(Double DQN)水平直线是在学习结束后运行对应智能体,并根据每个访问状态的实际折现回报取平均值计算得出。若无偏差,这些直线应与图表右侧的学习曲线重合。中排以对数尺度展示了两款游戏中 DQN 出现严重高估现象的价值估计。下排则揭示了这种高估对训练期间智能体得分产生的负面影响:当高估开始时,得分出现下降。而采用 Double DQN 进行学习则表现出更高的稳定性。
The ground truth averaged values are obtained by running the best learned policies for several episodes and computing the actual cumulative rewards. Without overestimations we would expect these quantities to match up (i.e., the curve to match the straight line at the right of each plot). Instead, the learning curves of DQN consistently end up much higher than the true values. The learning curves for Double DQN, shown in blue, are much closer to the blue straight line representing the true value of the final policy. Note that the blue straight line is often higher than the orange straight line. This indicates that Double DQN does not just produce more accurate value estimates but also better policies.
通过运行多个回合的最佳学习策略并计算实际累积奖励,获得地面真实平均值。若无高估现象,我们预期这些数值应相互匹配(即曲线与每幅图右侧的直线重合)。然而 DQN 的学习曲线始终明显高于真实值。以蓝色标示的双重 DQN 学习曲线则更接近代表最终策略真实值的蓝色直线。值得注意的是,蓝色直线通常高于橙色直线,这表明双重 DQN 不仅能产生更准确的价值估计,还能生成更优策略。
More extreme overestimations are shown in the middle two plots, where DQN is highly unstable on the games Asterix and Wizard of Wor. Notice the log scale for the values on the y y yy-axis. The bottom two plots shows the corresponding scores for these two games. Notice that the increases in value estimates for DQN in the middle plots coincide with decreasing scores in bottom plots. Again, this indicates that the overestimations are harming the quality of the resulting policies. If seen in isolation, one might perhaps be tempted to think the observed instability is related to inherent instability problems of off-policy learning with function approximation (Baird 1995, Tsitsiklis and Van Roy 1997, Maei
中间两幅图显示了更为极端的高估现象,DQN 在《Asterix》和《Wizard of Wor》游戏中表现出高度不稳定性。请注意 y y yy 轴上数值采用了对数刻度。底部两幅图展示了这两款游戏的对应得分。值得注意的是,中间图中 DQN 价值估计值的上升与底部图中得分的下降完全同步。这再次表明价值高估现象损害了最终策略的质量。若单独观察,人们可能会认为这种不稳定性与函数逼近的离策略学习固有缺陷有关(Baird 1995,Tsitsiklis 和 Van Roy 1997,Maei
no ops  无操作 human starts  人类起始
DQN DDQN DQN DDQN
  调优版 DDQN
DDQN
(tuned)
DDQN (tuned)| DDQN | | :---: | | (tuned) |
Median  中位数 93 % 93 % 93%93 \% 1 1 5 % 1 1 5 % 115%\mathbf{1 1 5} \% 47 % 47 % 47%47 \% 88 % 88 % 88%88 \% 1 1 7 % 1 1 7 % 117%\mathbf{1 1 7} \%
Mean  平均数 241 % 241 % 241%241 \% 3 3 0 % 3 3 0 % 330%\mathbf{3 3 0} \% 122 % 122 % 122%122 \% 273 % 273 % 273%273 \% 4 7 5 % 4 7 5 % 475%\mathbf{4 7 5} \%
no ops human starts DQN DDQN DQN DDQN "DDQN (tuned)" Median 93% 115% 47% 88% 117% Mean 241% 330% 122% 273% 475%| | no ops | | human starts | | | | :--- | ---: | :---: | ---: | ---: | :---: | | | DQN | DDQN | DQN | DDQN | DDQN <br> (tuned) | | Median | $93 \%$ | $\mathbf{1 1 5} \%$ | $47 \%$ | $88 \%$ | $\mathbf{1 1 7} \%$ | | Mean | $241 \%$ | $\mathbf{3 3 0} \%$ | $122 \%$ | $273 \%$ | $\mathbf{4 7 5} \%$ |
Table 1: Summarized normalized performance on 49 games for up to 5 minutes with up to 30 no ops at the start of each episode, and for up to 30 minutes with randomly selected human start points. Results for DQN are from Mnih et al. (2015) (no ops) and Nair et al. (2015) (human starts).
表 1:在 49 款游戏上进行的标准化性能汇总,测试条件包括每局开始最多 30 次无操作(最长 5 分钟)以及随机选择人类玩家起始点(最长 30 分钟)。DQN 的结果分别来自 Mnih 等人(2015 年)(无操作条件)和 Nair 等人(2015 年)(人类起始点条件)。
2011, Sutton et al. 2015). However, we see that learning is much more stable with Double DQN, suggesting that the cause for these instabilities is in fact Q-learning’s overoptimism. Figure 3 only shows a few examples, but overestimations were observed for DQN in all 49 tested Atari games, albeit in varying amounts.
2011 年,Sutton 等人 2015 年)。然而我们发现双 DQN 的学习过程更为稳定,这表明导致这些不稳定的原因实际上是 Q 学习的过度乐观。图 3 仅展示了几组示例,但在所有 49 款测试的 Atari 游戏中都观察到了 DQN 的高估现象,只是程度各有不同。

Quality of the learned policies
习得策略的质量

Overoptimism does not always adversely affect the quality of the learned policy. For example, DQN achieves optimal
过度乐观并不总会对习得策略的质量产生负面影响。例如,尽管 DQN 略微高估了策略价值,但在《Pong》游戏中仍能实现最优

behavior in Pong despite slightly overestimating the policy value. Nevertheless, reducing overestimations can significantly benefit the stability of learning; we see clear examples of this in Figure 3. We now assess more generally how much Double DQN helps in terms of policy quality by evaluating on all 49 games that DQN was tested on.
表现。然而,减少过高估计能显著提升学习稳定性;图 3 中可见清晰的例证。我们现通过在所有 49 款 DQN 测试游戏中评估,更全面地衡量双重 DQN 对策略质量的提升幅度。
As described by Mnih et al. (2015) each evaluation episode starts by executing a special no-op action that does not affect the environment up to 30 times, to provide different starting points for the agent. Some exploration during evaluation provides additional randomization. For Double DQN we used the exact same hyper-parameters as for DQN, to allow for a controlled experiment focused just on reducing overestimations. The learned policies are evaluated for 5 mins of emulator time ( 18,000 frames) with an ϵ ϵ epsilon\epsilon greedy policy where ϵ = 0.05 ϵ = 0.05 epsilon=0.05\epsilon=0.05. The scores are averaged over 100 episodes. The only difference between Double DQN and DQN is the target, using Y t DoubleDQN Y t DoubleDQN  Y_(t)^("DoubleDQN ")Y_{t}^{\text {DoubleDQN }} rather than Y DQN Y DQN  Y^("DQN ")Y^{\text {DQN }}. This evaluation is somewhat adversarial, as the used hyperparameters were tuned for DQN but not for Double DQN.
如 Mnih 等人(2015)所述,每个评估阶段开始时都会执行一个特殊的无操作动作(该动作对环境无影响),最多重复 30 次,为智能体提供不同的起始点。评估过程中的部分探索行为提供了额外的随机化处理。对于双深度 Q 网络,我们采用与深度 Q 网络完全相同的超参数,以便开展专注于减少过高估计的对照实验。学习策略的评估时长为 5 分钟模拟器时间(约 18,000 帧),采用 ϵ ϵ epsilon\epsilon 贪婪策略(其中 ϵ = 0.05 ϵ = 0.05 epsilon=0.05\epsilon=0.05 )。最终得分取 100 次评估的平均值。双深度 Q 网络与深度 Q 网络的唯一区别在于目标值计算——前者使用 Y t DoubleDQN Y t DoubleDQN  Y_(t)^("DoubleDQN ")Y_{t}^{\text {DoubleDQN }} 而非后者采用的 Y DQN Y DQN  Y^("DQN ")Y^{\text {DQN }} 。这种评估方式具有一定对抗性,因为所用超参数是针对深度 Q 网络而非双深度 Q 网络调优的。
To obtain summary statistics across games, we normalize the score for each game as follows:
为获取跨游戏的汇总统计数据,我们对每个游戏的得分进行如下标准化处理:
score normalized = score agent score random score human score random .  score  normalized  =  score  agent   score  random   score  human   score  random  . " score "_("normalized ")=(" score "_("agent ")-" score "_("random "))/(" score "_("human ")-" score "_("random ")).\text { score }_{\text {normalized }}=\frac{\text { score }_{\text {agent }}-\text { score }_{\text {random }}}{\text { score }_{\text {human }}-\text { score }_{\text {random }}} .
The ‘random’ and ‘human’ scores are the same as used by Mnih et al. (2015), and are given in the appendix.
"随机"与"人类"基准分数与 Mnih 等人(2015)研究采用的数据相同,具体数值列于附录。
Table 1, under no ops, shows that on the whole Double DQN clearly improves over DQN. A detailed comparison (in appendix) shows that there are several games in which Double DQN greatly improves upon DQN. Noteworthy examples include Road Runner (from 233 % 233 % 233%233 \% to 617 % 617 % 617%617 \% ), Asterix (from 70 % 70 % 70%70 \% to 180 % 180 % 180%180 \% ), Zaxxon (from 54 % 54 % 54%54 \% to 111 % 111 % 111%111 \% ), and Double Dunk (from 17 % 17 % 17%17 \% to 397 % 397 % 397%397 \% ).
表 1 在无操作条件下显示,总体而言 Double DQN 明显优于 DQN。详细对比(见附录)表明,在多款游戏中 Double DQN 较 DQN 有显著提升,典型案例如《Road Runner》(从 233 % 233 % 233%233 \% 提升至 617 % 617 % 617%617 \% )、《Asterix》(从 70 % 70 % 70%70 \% 提升至 180 % 180 % 180%180 \% )、《Zaxxon》(从 54 % 54 % 54%54 \% 提升至 111 % 111 % 111%111 \% )以及《Double Dunk》(从 17 % 17 % 17%17 \% 提升至 397 % 397 % 397%397 \% )。
The Gorila algorithm (Nair et al. 2015), which is a massively distributed version of DQN, is not included in the table because the architecture and infrastructure is sufficiently different to make a direct comparison unclear. For completeness, we note that Gorila obtained median and mean normalized scores of 96% and 495%, respectively.
由于架构和基础设施差异过大导致直接对比不明确,表格中未包含 Gorila 算法(Nair 等人 2015 年提出的大规模分布式 DQN 版本)。为保持完整,我们备注 Gorila 获得的中位数和平均标准化分数分别为 96%与 495%。

Robustness to Human starts
人类起始点鲁棒性测试

One concern with the previous evaluation is that in deterministic games with a unique starting point the learner could potentially learn to remember sequences of actions without much need to generalize. While successful, the solution would not be particularly robust. By testing the agents from various starting points, we can test whether the found solutions generalize well, and as such provide a challenging testbed for the learned polices (Nair et al. 2015).
先前评估存在一个潜在问题:在具有唯一起始点的确定性游戏中,智能体可能通过记忆动作序列而非泛化来达成目标。虽然这种解法能成功,但缺乏鲁棒性。通过从不同起始点测试智能体,既能验证所获解法是否具备良好泛化能力,也为学习策略提供了严苛的测试环境(Nair 等人 2015)。
We obtained 100 starting points sampled for each game from a human expert’s trajectory, as proposed by Nair et al. (2015). We start an evaluation episode from each of these starting points and run the emulator for up to 108,000 frames ( 30 mins at 60 Hz including the trajectory before the starting point). Each agent is only evaluated on the rewards accumulated after the starting point.
我们按照 Nair 等人(2015)提出的方法,从人类专家轨迹中为每款游戏采样了 100 个起始点。我们从每个起始点开始评估测试,并运行模拟器最多 108,000 帧(60Hz 下 30 分钟,包含起始点前的轨迹片段)。每个智能体仅评估起始点之后累积的奖励。
For this evaluation we include a tuned version of Double DQN . Some tuning is appropriate because the hyperparame-
本次评估包含了调优版的双重 DQN 算法。进行适当调优是必要的,因为超参数...

Figure 4: Normalized scores on 57 Atari games, tested for 100 episodes per game with human starts. Compared to Mnih et al. (2015), eight games additional games were tested. These are indicated with stars and a bold font.
图 4:57 款雅达利游戏的标准化得分,每款游戏采用人类起始点进行 100 次测试。相比 Mnih 等人(2015)的研究,新增测试了 8 款游戏(以星号标记和加粗字体显示)。

ters were tuned for DQN , which is a different algorithm. For the tuned version of Double DQN, we increased the number of frames between each two copies of the target network from 10,000 to 30,000 , to reduce overestimations further because immediately after each switch DQN and Double DQN both revert to Q-learning. In addition, we reduced the exploration during learning from ϵ = 0.1 ϵ = 0.1 epsilon=0.1\epsilon=0.1 to ϵ = 0.01 ϵ = 0.01 epsilon=0.01\epsilon=0.01, and then used ϵ = 0.001 ϵ = 0.001 epsilon=0.001\epsilon=0.001 during evaluation. Finally, the tuned version uses a single shared bias for all action values in the top layer of the network. Each of these changes improved performance and together they result in clearly better results. 3 3 ^(3){ }^{3}
这些超参数原本是为 DQN 算法调优的,而 DQN 是另一种不同算法。针对调优版 Double DQN,我们将目标网络每两次更新之间的帧数从 10,000 增加到 30,000,以进一步减少过高估计——因为每次切换后 DQN 和 Double DQN 都会暂时退化为 Q 学习。此外,我们将训练阶段的探索率从 ϵ = 0.1 ϵ = 0.1 epsilon=0.1\epsilon=0.1 降至 ϵ = 0.01 ϵ = 0.01 epsilon=0.01\epsilon=0.01 ,评估阶段则采用 ϵ = 0.001 ϵ = 0.001 epsilon=0.001\epsilon=0.001 。最终调优版本在网络顶层为所有动作值使用共享偏置项。每项改进都提升了性能,综合起来产生了显著更好的结果。 3 3 ^(3){ }^{3}
Table 1 reports summary statistics for this evaluation (under human starts) on the 49 games from Mnih et al. (2015). Double DQN obtains clearly higher median and
表 1 展示了基于 Mnih 等人(2015)49 款游戏(人类开局模式)的评估汇总数据。Double DQN 在中位数和...(此处原文不完整)
mean scores. Again Gorila DQN (Nair et al. 2015) is not included in the table, but for completeness note it obtained a median of 78 % 78 % 78%78 \% and a mean of 259 % 259 % 259%259 \%. Detailed results, plus results for an additional 8 games, are available in Figure 4 and in the appendix. On several games the improvements from DQN to Double DQN are striking, in some cases bringing scores much closer to human, or even surpassing these.
平均得分。同样地,Gorila DQN(Nair 等人,2015 年)未包含在表格中,但为完整起见需说明其中位数为 78 % 78 % 78%78 \% ,平均值为 259 % 259 % 259%259 \% 。详细结果及另外 8 款游戏的补充结果见图 4 和附录。在多个游戏中,从 DQN 到 Double DQN 的改进令人瞩目,某些情况下得分更接近人类水平,甚至实现超越。
Double DQN appears more robust to this more challenging evaluation, suggesting that appropriate generalizations occur and that the found solutions do not exploit the determinism of the environments. This is appealing, as it indicates progress towards finding general solutions rather than a deterministic sequence of steps that would be less robust.
Double DQN 在这种更具挑战性的评估中表现出更强的鲁棒性,这表明算法产生了适当的泛化,且所找到的解决方案并未利用环境的确定性特征。这一特性极具吸引力,因为它标志着向寻找通用解决方案迈进,而非依赖鲁棒性较差的确定性步骤序列。

Discussion  讨论

This paper has five contributions. First, we have shown why Q-learning can be overoptimistic in large-scale problems, even if these are deterministic, due to the inherent estimation errors of learning. Second, by analyzing the value estimates on Atari games we have shown that these overestimations are more common and severe in practice than previously acknowledged. Third, we have shown that Double Q-learning can be used at scale to successfully reduce this overoptimism, resulting in more stable and reliable learning. Fourth, we have proposed a specific implementation called Double DQN, that uses the existing architecture and deep neural network of the DQN algorithm without requiring additional networks or parameters. Finally, we have shown that Double DQN finds better policies, obtaining new state-of-the-art results on the Atari 2600 domain.
本文有五个主要贡献。首先,我们阐明了在大规模确定性问题上,由于学习过程中固有的估计误差,Q 学习为何会出现过度乐观的估值现象。其次,通过对雅达利游戏价值估计的分析,我们证明这种高估现象在实际应用中比既往认知更为普遍和严重。第三,我们证明了大规模应用双 Q 学习能有效降低这种过度乐观偏差,使学习过程更稳定可靠。第四,我们提出名为"双 DQN"的具体实现方案,该方案沿用 DQN 算法的现有架构和深度神经网络,无需增加额外网络或参数。最后,我们证实双 DQN 能发现更优策略,在雅达利 2600 游戏领域创下新的最优性能记录。

Acknowledgments  致谢

We would like to thank Tom Schaul, Volodymyr Mnih, Marc Bellemare, Thomas Degris, Georg Ostrovski, and Richard Sutton for helpful comments, and everyone at Google DeepMind for a constructive research environment.
感谢 Tom Schaul、Volodymyr Mnih、Marc Bellemare、Thomas Degris、Georg Ostrovski 和 Richard Sutton 提出的宝贵意见,同时感谢 Google DeepMind 全体成员营造的富有建设性的研究环境。

References  参考文献

R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, pages 1054-1078, 1995.
R. 阿格拉瓦尔。基于样本均值的多臂老虎机问题索引策略,具有 O(log n)遗憾度。《应用概率进展》,第 1054-1078 页,1995 年。

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235-256, 2002.
P. 奥尔、N. 塞萨-比安奇和 P. 菲舍尔。多臂老虎机问题的有限时间分析。《机器学习》,47(2-3):235-256,2002 年。

L. Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning: Proceedings of the Twelfth International Conference, pages 30-37, 1995.
L. 贝尔德。残差算法:基于函数逼近的强化学习。《机器学习:第十二届国际会议论文集》,第 30-37 页,1995 年。

M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res. (JAIR), 47:253-279, 2013.
M. G. 贝尔马雷、Y. 纳达夫、J. 维内斯和 M. 鲍林。街机学习环境:通用智能体评估平台。《人工智能研究杂志》(JAIR),47:253-279,2013 年。

R. I. Brafman and M. Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Machine Learning Research, 3:213-231, 2003.
R. I. 布拉夫曼与 M. 坦南霍尔兹。《R-max:一种通用的多项式时间近似最优强化学习算法》。机器学习研究期刊,3:213-231,2003 年。

K. Fukushima. Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural networks, 1(2):119-130, 1988.
福岛邦彦。《新认知机:一种能够进行视觉模式识别的分层神经网络》。神经网络,1(2):119-130,1988 年。

L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4: 237-285, 1996.
L. P. 凯林、M. L. 利特曼与 A. W. 摩尔。《强化学习综述》。人工智能研究期刊,4:237-285,1996 年。

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998.
Y. 勒昆、L. 博图、Y. 本吉奥与 P. 哈夫纳。《基于梯度的学习在文档识别中的应用》。IEEE 会刊,86(11):2278-2324,1998 年。

L. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3):293-321, 1992.
L. Lin. 基于强化学习、规划与教学的自我改进反应式智能体。《机器学习》,8(3):293-321,1992 年。

H. R. Maei. Gradient temporal-difference learning algorithms. PhD thesis, University of Alberta, 2011.
H. R. Maei. 梯度时序差分学习算法。阿尔伯塔大学博士论文,2011 年。

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540): 529-533, 2015.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, 与 D. Hassabis. 通过深度强化学习实现人类水平控制。《自然》,518(7540):529-533,2015 年。

A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. D. Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, S. Legg, V. Mnih, K. Kavukcuoglu, and D. Silver. Massively parallel methods for deep reinforcement learning. In Deep Learning Workshop, ICML, 2015.
A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. D. Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, S. Legg, V. Mnih, K. Kavukcuoglu, 与 D. Silver. 深度强化学习的大规模并行方法。收录于《ICML 深度学习研讨会》,2015 年。

M. Riedmiller. Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method. In Proceedings of the 16th European Conference on Machine Learning, pages 317-328. Springer, 2005.
M. Riedmiller. 基于神经网络的 Q 迭代拟合——一种数据高效神经强化学习方法的初步实践。载于第 16 届欧洲机器学习会议论文集,第 317-328 页。Springer 出版社,2005 年。

B. Sallans and G. E. Hinton. Reinforcement learning with factored states and actions. The Journal of Machine Learning Research, 5: 1063-1088, 2004.
B. Sallans 与 G. E. Hinton. 基于因子化状态与动作的强化学习。机器学习研究期刊,第 5 卷:1063-1088 页,2004 年。

A. L. Strehl, L. Li, and M. L. Littman. Reinforcement learning in finite MDPs: PAC analysis. The Journal of Machine Learning Research, 10:2413-2444, 2009.
A. L. Strehl, L. Li 与 M. L. Littman. 有限马尔可夫决策过程中的强化学习:PAC 分析。机器学习研究期刊,第 10 卷:2413-2444 页,2009 年。

R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9-44, 1988.
R. S. Sutton. 通过时间差分法进行预测学习。机器学习,第 3 卷第 1 期:9-44 页,1988 年。

R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the seventh international conference on machine learning, pages 216-224, 1990.
R. S. Sutton. 基于动态规划近似方法的集成架构:学习、规划与反应. 第七届国际机器学习会议论文集, 第 216-224 页, 1990 年.

R. S. Sutton and A. G. Barto. Introduction to reinforcement learning. MIT Press, 1998.
R. S. Sutton 与 A. G. Barto. 强化学习导论. 麻省理工学院出版社, 1998 年.

R. S. Sutton, A. R. Mahmood, and M. White. An emphatic approach to the problem of off-policy temporal-difference learning. arXiv preprint arXiv:1503.04269, 2015.
R. S. Sutton, A. R. Mahmood 及 M. White. 离策略时序差分学习的强调式方法. arXiv 预印本 arXiv:1503.04269, 2015 年.

I. Szita and A. Lórincz. The many faces of optimism: a unifying approach. In Proceedings of the 25th international conference on Machine learning, pages 1048-1055. ACM, 2008.
I. Szita 和 A. Lórincz. 乐观主义的多种表现:一种统一方法. 第 25 届国际机器学习会议论文集, 第 1048-1055 页. ACM 出版社, 2008 年.

G. Tesauro. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58-68, 1995.
G. 特索罗。《时序差分学习与 TD-Gammon》。《ACM 通讯》,38(3):58-68,1995 年。

S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning. In M. Mozer, P. Smolensky, D. Touretzky, J. Elman, and A. Weigend, editors, Proceedings of the 1993 Connectionist Models Summer School, Hillsdale, NJ, 1993. Lawrence Erlbaum.
S. 特伦与 A. 施瓦茨。《函数逼近在强化学习中的应用问题》。收录于 M. 莫泽、P. 斯莫伦斯基、D. 图雷茨基、J. 埃尔曼和 A. 魏根德编辑的《1993 年连接主义模型暑期学校论文集》,新泽西州希尔斯代尔,1993 年。劳伦斯·厄尔鲍姆出版社。

J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on A u A u AuA u tomatic Control, 42(5):674-690, 1997.
J. N. 齐特西克利斯与 B. 范罗伊。《基于函数逼近的时序差分学习分析》。《IEEE 自动控制汇刊》,42(5):674-690,1997 年。

H. van Hasselt. Double Q-learning. Advances in Neural Information Processing Systems, 23:2613-2621, 2010.
H. 范哈塞尔特。《双 Q 学习》。《神经信息处理系统进展》,23:2613-2621,2010 年。

H. van Hasselt. Insights in Reinforcement Learning. PhD thesis, Utrecht University, 2011.
H. van Hasselt. 《强化学习的洞见》. 博士论文, 乌得勒支大学, 2011 年.

C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge England, 1989.
C. J. C. H. Watkins. 《延迟奖励学习》. 博士论文, 英国剑桥大学, 1989 年.

  1. Copyright © 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
    版权所有 © 2016, 人工智能促进协会 (www.aaai.org)。保留所有权利。
  2. 1 1 ^(1){ }^{1} Each action-value function is fit with a different subset of integer states. States -6 and 6 are always included to avoid extrapolations, and for each action two adjacent integers are missing: for action a 1 a 1 a_(1)a_{1} states -5 and -4 are not sampled, for a 2 a 2 a_(2)a_{2} states -4 and -3 are not sampled, and so on. This causes the estimated values to differ.
    1 1 ^(1){ }^{1} 每个动作价值函数都拟合不同的整数状态子集。为避免外推,始终包含状态-6 和 6,且每个动作会缺失两个相邻整数:动作 a 1 a 1 a_(1)a_{1} 未采样状态-5 和-4,动作 a 2 a 2 a_(2)a_{2} 未采样状态-4 和-3,以此类推。这导致估计值出现差异。
  3. 2 2 ^(2){ }^{2} We arbitrarily used the samples of action a i + 5 a i + 5 a_(i+5)a_{i+5} (for i 5 i 5 i <= 5i \leq 5 ) or a i 5 a i 5 a_(i-5)a_{i-5} (for i > 5 i > 5 i > 5i>5 ) as the second set of samples for the double estimator of action a i a i a_(i)a_{i}.
    2 2 ^(2){ }^{2} 我们任意选取动作 a i + 5 a i + 5 a_(i+5)a_{i+5} (针对 i 5 i 5 i <= 5i \leq 5 )或 a i 5 a i 5 a_(i-5)a_{i-5} (针对 i > 5 i > 5 i > 5i>5 )的样本作为动作 a i a i a_(i)a_{i} 双重估计器的第二组样本。
  4. 3 3 ^(3){ }^{3} Except for Tennis, where the lower ϵ ϵ epsilon\epsilon during training seemed to hurt rather than help.
    3 3 ^(3){ }^{3} 除网球项目外,训练期间较低的 ϵ ϵ epsilon\epsilon 似乎产生了负面影响而非助益。