这是用户在 2024-3-25 20:18 为 https://ar5iv.labs.arxiv.org/html/1812.02900 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Off-Policy Deep Reinforcement Learning without Exploration
离策略深度强化学习无需探索

Scott Fujimoto  斯科特·藤本    David Meger  大卫·梅格尔    Doina Precup  多伊娜·普雷库普

Off-Policy Deep Reinforcement Learning without Exploration:
Supplementary Material
离策略深度强化学习无需探索:补充材料

Scott Fujimoto  斯科特·藤本    David Meger  大卫·梅格    Doina Precup  多伊娜·普雷库普
Abstract 摘要

Many practical applications of reinforcement learning constrain agents to learn from a fixed batch of data which has already been gathered, without offering further possibility for data collection. In this paper, we demonstrate that due to errors introduced by extrapolation, standard off-policy deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning without data correlated to the distribution under the current policy, making them ineffective for this fixed batch setting. We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. We present the first continuous control deep reinforcement learning algorithm which can learn effectively from arbitrary, fixed batch data, and empirically demonstrate the quality of its behavior in several tasks.
许多实际应用的强化学习限制了代理只能从已经收集好的固定数据批次中学习,而不提供进一步数据收集的可能性。在本文中,我们展示了由于外推引入的错误,标准的离策略深度强化学习算法,如 DQN 和 DDPG,在没有与当前策略下的分布相关的数据的情况下无法学习,使它们在这种固定批次设置中效果不佳。我们引入了一类新的离策略算法,批量约束强化学习,它限制了动作空间以迫使代理行为接近于对给定数据子集的在策略行为。我们提出了第一个可以从任意固定批次数据中有效学习的连续控制深度强化学习算法,并通过几项任务实证展示了其行为的质量。

Machine Learning, ICML, Deep Reinforcement Learning, Imitation Learning, Batch Reinforcement Learning, Off-Policy
机器学习,ICML,深度强化学习,模仿学习,批量强化学习,离策略

1 Introduction 1 引言

Batch reinforcement learning, the task of learning from a fixed dataset without further interactions with the environment, is a crucial requirement for scaling reinforcement learning to tasks where the data collection procedure is costly, risky, or time-consuming. Off-policy batch reinforcement learning has important implications for many practical applications. It is often preferable for data collection to be performed by some secondary controlled process, such as a human operator or a carefully monitored program. If assumptions on the quality of the behavioral policy can be made, imitation learning can be used to produce strong policies. However, most imitation learning algorithms are known to fail when exposed to suboptimal trajectories, or require further interactions with the environment to compensate (Hester et al., 2017; Sun et al., 2018; Cheng et al., 2018). On the other hand, batch reinforcement learning offers a mechanism for learning from a fixed dataset without restrictions on the quality of the data.
批量强化学习,即从一个固定的数据集中学习而无需进一步与环境交互的任务,对于将强化学习扩展到数据收集过程成本高、风险大或耗时的任务中是至关重要的。离策略批量强化学习对许多实际应用有重要意义。通常更倾向于通过一些次级控制过程进行数据收集,如人类操作员或仔细监控的程序。如果可以对行为策略的质量做出假设,模仿学习可以用来产生强大的策略。然而,大多数模仿学习算法在暴露于次优轨迹时已知会失败,或需要进一步与环境交互来补偿(Hester et al., 2017; Sun et al., 2018; Cheng et al., 2018)。另一方面,批量强化学习提供了一种从固定数据集中学习的机制,不受数据质量的限制。

Most modern off-policy deep reinforcement learning algorithms fall into the category of growing batch learning (Lange et al., 2012), in which data is collected and stored into an experience replay dataset (Lin, 1992), which is used to train the agent before further data collection occurs. However, we find that these “off-policy” algorithms can fail in the batch setting, becoming unsuccessful if the dataset is uncorrelated to the true distribution under the current policy. Our most surprising result shows that off-policy agents perform dramatically worse than the behavioral agent when trained with the same algorithm on the same dataset.
大多数现代的离策略深度强化学习算法都属于增长批量学习(Lange et al., 2012)的范畴,在这种方法中,数据被收集并存储到经验回放数据集中(Lin, 1992),该数据集被用来在进一步收集数据之前训练代理。然而,我们发现这些“离策略”算法在批量设置中可能会失败,如果数据集与当前策略下的真实分布不相关,它们就会变得不成功。我们最令人惊讶的结果显示,当使用相同的算法在相同的数据集上训练时,离策略代理的表现比行为代理差得多。

This inability to learn truly off-policy is due to a fundamental problem with off-policy reinforcement learning we denote extrapolation error, a phenomenon in which unseen state-action pairs are erroneously estimated to have unrealistic values. Extrapolation error can be attributed to a mismatch in the distribution of data induced by the policy and the distribution of data contained in the batch. As a result, it may be impossible to learn a value function for a policy which selects actions not contained in the batch.
这种无法真正学习离策略的能力是由于离策略强化学习的一个根本问题,我们称之为外推误差,这是一种未见状态-动作对被错误估计为具有不切实际值的现象。外推误差可以归因于由策略引起的数据分布与批量中包含的数据分布之间的不匹配。因此,学习一个为批量中未包含的动作选择动作的策略的价值函数可能是不可能的。

To overcome extrapolation error in off-policy learning, we introduce batch-constrained reinforcement learning, where agents are trained to maximize reward while minimizing the mismatch between the state-action visitation of the policy and the state-action pairs contained in the batch. Our deep reinforcement learning algorithm, Batch-Constrained deep Q-learning (BCQ), uses a state-conditioned generative model to produce only previously seen actions. This generative model is combined with a Q-network, to select the highest valued action which is similar to the data in the batch. Under mild assumptions, we prove this batch-constrained paradigm is necessary for unbiased value estimation from incomplete datasets for finite deterministic MDPs.
为了克服离策略学习中的外推错误,我们引入了批量约束强化学习,其中代理被训练以在最大化奖励的同时最小化策略的状态-动作访问与批次中包含的状态-动作对之间的不匹配。我们的深度强化学习算法,批量约束深度 Q 学习(BCQ),使用一个状态条件生成模型来仅产生之前见过的动作。这个生成模型与 Q 网络结合,以选择与批次中的数据相似的最高价值动作。在温和的假设下,我们证明了这种批量约束范式对于有限确定性 MDP 的不完整数据集的无偏值估计是必要的。

Unlike any previous continuous control deep reinforcement learning algorithms, BCQ is able to learn successfully without interacting with the environment by considering extrapolation error. Our algorithm is evaluated on a series of batch reinforcement learning tasks in MuJoCo environments (Todorov et al., 2012; Brockman et al., 2016), where extrapolation error is particularly problematic due to the high-dimensional continuous action space, which is impossible to sample exhaustively. Our algorithm offers a unified view on imitation and off-policy learning, and is capable of learning from purely expert demonstrations, as well as from finite batches of suboptimal data, without further exploration. We remark that BCQ is only one way to approach batch-constrained reinforcement learning in a deep setting, and we hope that it will be serve as a foundation for future algorithms. To ensure reproducibility, we provide precise experimental and implementation details, and our code is made available (https://github.com/sfujim/BCQ).
与以往任何连续控制深度强化学习算法不同,BCQ 能够通过考虑外推误差而不与环境交互的情况下成功学习。我们的算法在 MuJoCo 环境中的一系列批量强化学习任务上进行了评估(Todorov 等人,2012 年;Brockman 等人,2016 年),在这些任务中,由于高维连续动作空间,外推误差尤其成问题,这是不可能穷尽采样的。我们的算法提供了对模仿和离策略学习的统一视角,并且能够仅从纯粹的专家演示中学习,以及从有限批次的次优数据中学习,无需进一步探索。我们指出,BCQ 只是一种在深度设置中处理批量约束强化学习的方法,并且我们希望它能够为未来的算法提供基础。为了确保可重复性,我们提供了精确的实验和实施细节,我们的代码已经可用(https://github.com/sfujim/BCQ)。

2 Background 2 背景

In reinforcement learning, an agent interacts with its environment, typically assumed to be a Markov decision process (MDP) (𝒮,𝒜,pM,r,γ)𝒮𝒜subscript𝑝𝑀𝑟𝛾(\mathcal{S},\mathcal{A},p_{M},r,\gamma), with state space 𝒮𝒮\mathcal{S}, action space 𝒜𝒜\mathcal{A}, and transition dynamics pM(s|s,a)subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎p_{M}(s^{\prime}|s,a). At each discrete time step, the agent receives a reward r(s,a,s)𝑟𝑠𝑎superscript𝑠r(s,a,s^{\prime})\in\mathbb{R} for performing action a𝑎a in state s𝑠s and arriving at the state ssuperscript𝑠s^{\prime}. The goal of the agent is to maximize the expectation of the sum of discounted rewards, known as the return Rt=i=t+1γir(si,ai,si+1)subscript𝑅𝑡superscriptsubscript𝑖𝑡1superscript𝛾𝑖𝑟subscript𝑠𝑖subscript𝑎𝑖subscript𝑠𝑖1R_{t}=\sum_{i={t+1}}^{\infty}\gamma^{i}r(s_{i},a_{i},s_{i+1}), which weighs future rewards with respect to the discount factor γ[0,1)𝛾01\gamma\in[0,1).
在强化学习中,一个智能体与其环境进行交互,通常假设为一个马尔可夫决策过程(MDP) (𝒮,𝒜,pM,r,γ)𝒮𝒜subscript𝑝𝑀𝑟𝛾(\mathcal{S},\mathcal{A},p_{M},r,\gamma) ,具有状态空间 𝒮𝒮\mathcal{S} 、动作空间 𝒜𝒜\mathcal{A} 和转移动力学 pM(s|s,a)subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎p_{M}(s^{\prime}|s,a) 。在每一个离散时间步骤中,智能体因在状态 s𝑠s 执行动作 a𝑎a 并到达状态 ssuperscript𝑠s^{\prime} 而接收到奖励 r(s,a,s)𝑟𝑠𝑎superscript𝑠r(s,a,s^{\prime})\in\mathbb{R} 。智能体的目标是最大化折扣奖励之和的期望值,即回报 Rt=i=t+1γir(si,ai,si+1)subscript𝑅𝑡superscriptsubscript𝑖𝑡1superscript𝛾𝑖𝑟subscript𝑠𝑖subscript𝑎𝑖subscript𝑠𝑖1R_{t}=\sum_{i={t+1}}^{\infty}\gamma^{i}r(s_{i},a_{i},s_{i+1}) ,它根据折扣因子 γ[0,1)𝛾01\gamma\in[0,1) 对未来奖励进行加权。

The agent selects actions with respect to a policy π:𝒮𝒜:𝜋𝒮𝒜\pi:\mathcal{S}\rightarrow\mathcal{A}, which exhibits a distribution μπ(s)superscript𝜇𝜋𝑠\mu^{\pi}(s) over the states s𝒮𝑠𝒮s\in\mathcal{S} visited by the policy. Each policy π𝜋\pi has a corresponding value function Qπ(s,a)=𝔼π[Rt|s,a]superscript𝑄𝜋𝑠𝑎subscript𝔼𝜋delimited-[]conditionalsubscript𝑅𝑡𝑠𝑎Q^{\pi}(s,a)=\mathbb{E}_{\pi}[R_{t}|s,a], the expected return when following the policy after taking action a𝑎a in state s𝑠s. For a given policy π𝜋\pi, the value function can be computed through the Bellman operator 𝒯πsuperscript𝒯𝜋\mathcal{T}^{\pi}:
代理根据策略 π:𝒮𝒜:𝜋𝒮𝒜\pi:\mathcal{S}\rightarrow\mathcal{A} 选择行动,该策略展示了一个在策略访问的状态 s𝒮𝑠𝒮s\in\mathcal{S} 上的分布 μπ(s)superscript𝜇𝜋𝑠\mu^{\pi}(s) 。每个策略 π𝜋\pi 都有一个对应的价值函数 Qπ(s,a)=𝔼π[Rt|s,a]superscript𝑄𝜋𝑠𝑎subscript𝔼𝜋delimited-[]conditionalsubscript𝑅𝑡𝑠𝑎Q^{\pi}(s,a)=\mathbb{E}_{\pi}[R_{t}|s,a] ,即在状态 s𝑠s 采取行动 a𝑎a 后遵循该策略的预期回报。对于给定的策略 π𝜋\pi ,可以通过贝尔曼算子 𝒯πsuperscript𝒯𝜋\mathcal{T}^{\pi} 计算价值函数:

𝒯πQ(s,a)=𝔼s[r+γQ(s,π(s))].superscript𝒯𝜋𝑄𝑠𝑎subscript𝔼superscript𝑠delimited-[]𝑟𝛾𝑄superscript𝑠𝜋superscript𝑠\mathcal{T}^{\pi}Q(s,a)=\mathbb{E}_{s^{\prime}}[r+\gamma Q(s^{\prime},\pi(s^{\prime}))]. (1)

The Bellman operator is a contraction for γ[0,1)𝛾01\gamma\in[0,1) with unique fixed point Qπ(s,a)superscript𝑄𝜋𝑠𝑎Q^{\pi}(s,a) (Bertsekas & Tsitsiklis, 1996). Q(s,a)=maxπQπ(s,a)superscript𝑄𝑠𝑎subscript𝜋superscript𝑄𝜋𝑠𝑎Q^{*}(s,a)=\max_{\pi}Q^{\pi}(s,a) is known as the optimal value function, which has a corresponding optimal policy obtained through greedy action choices. For large or continuous state and action spaces, the value can be approximated with neural networks, e.g. using the Deep Q-Network algorithm (DQN) (Mnih et al., 2015). In DQN, the value function Qθsubscript𝑄𝜃Q_{\theta} is updated using the target:
Bellman 算子对于 γ[0,1)𝛾01\gamma\in[0,1) 是一个收缩映射,具有唯一的不动点 Qπ(s,a)superscript𝑄𝜋𝑠𝑎Q^{\pi}(s,a) (Bertsekas & Tsitsiklis, 1996)。 Q(s,a)=maxπQπ(s,a)superscript𝑄𝑠𝑎subscript𝜋superscript𝑄𝜋𝑠𝑎Q^{*}(s,a)=\max_{\pi}Q^{\pi}(s,a) 被认为是最优价值函数,它有一个通过贪婪动作选择获得的相应最优策略。对于大型或连续的状态和动作空间,可以使用神经网络来近似价值,例如使用深度 Q 网络算法(DQN)(Mnih 等人,2015)。在 DQN 中,价值函数 Qθsubscript𝑄𝜃Q_{\theta} 使用以下目标进行更新:

r+γQθ(s,π(s)),π(s)=argmaxaQθ(s,a),𝑟𝛾subscript𝑄superscript𝜃superscript𝑠𝜋superscript𝑠𝜋superscript𝑠subscriptargmax𝑎subscript𝑄superscript𝜃superscript𝑠𝑎r+\gamma Q_{\theta^{\prime}}(s^{\prime},\pi(s^{\prime})),\quad\pi(s^{\prime})=\operatorname{argmax}_{a}Q_{\theta^{\prime}}(s^{\prime},a), (2)

Q-learning is an off-policy algorithm (Sutton & Barto, 1998), meaning the target can be computed without consideration of how the experience was generated. In principle, off-policy reinforcement learning algorithms are able to learn from data collected by any behavioral policy. Typically, the loss is minimized over mini-batches of tuples of the agent’s past data, (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime})\in\mathcal{B}, sampled from an experience replay dataset \mathcal{B} (Lin, 1992). For shorthand, we often write s𝑠s\in\mathcal{B} if there exists a transition tuple containing s𝑠s in the batch \mathcal{B}, and similarly for (s,a)𝑠𝑎(s,a) or (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime})\in\mathcal{B}. In batch reinforcement learning, we assume \mathcal{B} is fixed and no further interaction with the environment occurs. To further stabilize learning, a target network with frozen parameters Qθsubscript𝑄superscript𝜃Q_{\theta^{\prime}}, is used in the learning target. The parameters of the target network θsuperscript𝜃\theta^{\prime} are updated to the current network parameters θ𝜃\theta after a fixed number of time steps, or by averaging θτθ+(1τ)θsuperscript𝜃𝜏𝜃1𝜏superscript𝜃\theta^{\prime}\leftarrow\tau\theta+(1-\tau)\theta^{\prime} for some small τ𝜏\tau (Lillicrap et al., 2015).
Q 学习是一种离策略算法(Sutton & Barto, 1998),意味着目标可以在不考虑如何生成经验的情况下计算。原则上,离策略强化学习算法能够从任何行为策略收集的数据中学习。通常,损失在代理过去数据的小批量元组上最小化, (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime})\in\mathcal{B} ,从经验回放数据集 \mathcal{B} (Lin, 1992)中采样。为了简便,如果批次 \mathcal{B} 中存在包含 s𝑠s 的转换元组,我们经常写 s𝑠s\in\mathcal{B} ,对于 (s,a)𝑠𝑎(s,a)(s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime})\in\mathcal{B} 也是类似的。在批量强化学习中,我们假设 \mathcal{B} 是固定的,并且不再与环境进行进一步的交互。为了进一步稳定学习,使用了一个具有冻结参数 Qθsubscript𝑄superscript𝜃Q_{\theta^{\prime}} 的目标网络作为学习目标。目标网络的参数 θsuperscript𝜃\theta^{\prime} 在固定的时间步后更新为当前网络参数 θ𝜃\theta ,或者通过平均 θτθ+(1τ)θsuperscript𝜃𝜏𝜃1𝜏superscript𝜃\theta^{\prime}\leftarrow\tau\theta+(1-\tau)\theta^{\prime} 对于一些小的 τ𝜏\tau (Lillicrap 等人,2015)。

In a continuous action space, the analytic maximum of Equation (2) is intractable. In this case, actor-critic methods are commonly used, where action selection is performed through a separate policy network πϕsubscript𝜋italic-ϕ\pi_{\phi}, known as the actor, and updated with respect to a value estimate, known as the critic (Sutton & Barto, 1998; Konda & Tsitsiklis, 2003). This policy can be updated following the deterministic policy gradient theorem (Silver et al., 2014):
在连续动作空间中,方程(2)的解析最大值是难以处理的。在这种情况下,通常使用演员-评论家方法,其中通过一个单独的策略网络 πϕsubscript𝜋italic-ϕ\pi_{\phi} 进行动作选择,这个网络被称为演员,并且根据一个称为评论家的价值估计进行更新(Sutton & Barto, 1998; Konda & Tsitsiklis, 2003)。这个策略可以根据确定性策略梯度定理进行更新(Silver et al., 2014):

ϕargmaxϕ𝔼s[Qθ(s,πϕ(s))],italic-ϕsubscriptargmaxitalic-ϕsubscript𝔼𝑠delimited-[]subscript𝑄𝜃𝑠subscript𝜋italic-ϕ𝑠\phi\leftarrow\operatorname{argmax}_{\phi}\mathbb{E}_{s\in\mathcal{B}}[Q_{\theta}(s,\pi_{\phi}(s))], (3)

which corresponds to learning an approximation to the maximum of Qθsubscript𝑄𝜃Q_{\theta}, by propagating the gradient through both πϕsubscript𝜋italic-ϕ\pi_{\phi} and Qθsubscript𝑄𝜃Q_{\theta}. When combined with off-policy deep Q-learning to learn Qθsubscript𝑄𝜃Q_{\theta}, this algorithm is referred to as Deep Deterministic Policy Gradients (DDPG) (Lillicrap et al., 2015).
这对应于通过通过 πϕsubscript𝜋italic-ϕ\pi_{\phi}Qθsubscript𝑄𝜃Q_{\theta} 传播梯度来学习对 Qθsubscript𝑄𝜃Q_{\theta} 的最大值的近似。 当与离策略深度 Q 学习结合使用以学习 Qθsubscript𝑄𝜃Q_{\theta} 时,这个算法被称为深度确定性策略梯度(DDPG)(Lillicrap et al., 2015)。

3 Extrapolation Error 外推误差

Extrapolation error is an error in off-policy value learning which is introduced by the mismatch between the dataset and true state-action visitation of the current policy. The value estimate Q(s,a)𝑄𝑠𝑎Q(s,a) is affected by extrapolation error during a value update where the target policy selects an unfamiliar action asuperscript𝑎a^{\prime} at the next state ssuperscript𝑠s^{\prime} in the backed-up value estimate, such that (s,a)superscript𝑠superscript𝑎(s^{\prime},a^{\prime}) is unlikely, or not contained, in the dataset. The cause of extrapolation error can be attributed to several related factors:
外推误差是离策略价值学习中由数据集与当前策略的真实状态-动作访问不匹配引入的错误。当目标策略在下一个状态 ssuperscript𝑠s^{\prime} 中选择一个不熟悉的动作 asuperscript𝑎a^{\prime} 进行价值更新时,价值估计 Q(s,a)𝑄𝑠𝑎Q(s,a) 会受到外推误差的影响,以至于 (s,a)superscript𝑠superscript𝑎(s^{\prime},a^{\prime}) 不太可能出现在数据集中,或者根本不包含在数据集中。外推误差的原因可以归因于几个相关因素:

Absent Data. If any state-action pair (s,a)𝑠𝑎(s,a) is unavailable, then error is introduced as some function of the amount of similar data and approximation error. This means that the estimate of Qθ(s,π(s))subscript𝑄𝜃superscript𝑠𝜋superscript𝑠Q_{\theta}(s^{\prime},\pi(s^{\prime})) may be arbitrarily bad without sufficient data near (s,π(s))superscript𝑠𝜋superscript𝑠(s^{\prime},\pi(s^{\prime})).
缺失数据。如果任何状态-动作对 (s,a)𝑠𝑎(s,a) 不可用,则随着相似数据量和近似误差的某种函数,将引入误差。这意味着如果没有足够接近 (s,π(s))superscript𝑠𝜋superscript𝑠(s^{\prime},\pi(s^{\prime})) 的数据,对 Qθ(s,π(s))subscript𝑄𝜃superscript𝑠𝜋superscript𝑠Q_{\theta}(s^{\prime},\pi(s^{\prime})) 的估计可能会任意糟糕。

Model Bias. When performing off-policy Q-learning with a batch \mathcal{B}, the Bellman operator 𝒯πsuperscript𝒯𝜋\mathcal{T}^{\pi} is approximated by sampling transitions tuples (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime}) from \mathcal{B} to estimate the expectation over ssuperscript𝑠s^{\prime}. However, for a stochastic MDP, without infinite state-action visitation, this produces a biased estimate of the transition dynamics:
模型偏差。在使用一批数据 \mathcal{B} 进行离策略 Q 学习时,贝尔曼算子 𝒯πsuperscript𝒯𝜋\mathcal{T}^{\pi} 通过从 \mathcal{B} 中抽取转换元组 (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime}) 来近似,以估计对 ssuperscript𝑠s^{\prime} 的期望。然而,对于一个随机的 MDP,在没有无限状态-动作访问的情况下,这会产生一个偏差的转换动态估计:

𝒯πQ(s,a)𝔼s[r+γQ(s,π(s))],superscript𝒯𝜋𝑄𝑠𝑎subscript𝔼similar-tosuperscript𝑠delimited-[]𝑟𝛾𝑄superscript𝑠𝜋superscript𝑠\mathcal{T}^{\pi}Q(s,a)\approx\mathbb{E}_{s^{\prime}\sim\mathcal{B}}[r+\gamma Q(s^{\prime},\pi(s^{\prime}))], (4)

where the expectation is with respect to transitions in the batch \mathcal{B}, rather than the true MDP.
其中的期望是相对于批次 \mathcal{B} 中的转换,而不是真实的 MDP。

Training Mismatch. Even with sufficient data, in deep Q-learning systems, transitions are sampled uniformly from the dataset, giving a loss weighted with respect to the likelihood of data in the batch:
训练不匹配。即使有足够的数据,在深度 Q 学习系统中,转换是从数据集中均匀抽样的,给出了一个与批次中数据的可能性成比例的损失权重:

1||(s,a,r,s)r+γQθ(s,π(s))Qθ(s,a)2.absent1subscript𝑠𝑎𝑟superscript𝑠superscriptnorm𝑟𝛾subscript𝑄superscript𝜃superscript𝑠𝜋superscript𝑠subscript𝑄𝜃𝑠𝑎2\approx\frac{1}{|\mathcal{B}|}\sum_{(s,a,r,s^{\prime})\in\mathcal{B}}||r+\gamma Q_{\theta^{\prime}}(s^{\prime},\pi(s^{\prime}))-Q_{\theta}(s,a)||^{2}. (5)

If the distribution of data in the batch does not correspond with the distribution under the current policy, the value function may be a poor estimate of actions selected by the current policy, due to the mismatch in training.
如果批次中的数据分布与当前策略下的分布不一致,由于训练中的不匹配,值函数可能是当前策略选定动作的一个差劲估计。

We remark that re-weighting the loss in Equation (5) with respect to the likelihood under the current policy can still result in poor estimates if state-action pairs with high likelihood under the current policy are not found in the batch. This means only a subset of possible policies can be evaluated accurately. As a result, learning a value estimate with off-policy data can result in large amounts of extrapolation error if the policy selects actions which are not similar to the data found in the batch. In the following section, we discuss how state of the art off-policy deep reinforcement learning algorithms fail to address the concern of extrapolation error, and demonstrate the implications in practical examples.
我们注意到,即使根据当前策略下的可能性对方程(5)中的损失进行重新加权,如果批次中没有发现在当前策略下具有高可能性的状态-动作对,仍然可能得到差劲的估计。这意味着只有一部分可能的策略可以被准确评估。因此,使用非策略数据学习值估计可能会导致大量的外推错误,如果策略选择的动作与批次中的数据不相似。在接下来的部分,我们将讨论最先进的非策略深度强化学习算法如何未能解决外推错误的问题,并在实际示例中展示其影响。

3.1 Extrapolation Error in Deep Reinforcement Learning
3.1 深度强化学习中的外推错误

Deep Q-learning algorithms (Mnih et al., 2015) have been labeled as off-policy due to their connection to off-policy Q-learning (Watkins, 1989). However, these algorithms tend to use near-on-policy exploratory policies, such as ϵitalic-ϵ\epsilon-greedy, in conjunction with a replay buffer (Lin, 1992). As a result, the generated dataset tends to be heavily correlated to the current policy. In this section, we examine how these off-policy algorithms perform when learning with uncorrelated datasets. Our results demonstrate that the performance of a state of the art deep actor-critic algorithm, DDPG (Lillicrap et al., 2015), deteriorates rapidly when the data is uncorrelated and the value estimate produced by the deep Q-network diverges. These results suggest that off-policy deep reinforcement learning algorithms are ineffective when learning truly off-policy.
深度 Q 学习算法(Mnih 等,2015 年)因其与离策略 Q 学习(Watkins,1989 年)的联系而被标记为离策略。然而,这些算法倾向于使用接近在策略的探索性策略,例如ε-贪婪,结合重放缓冲区(Lin,1992 年)。因此,生成的数据集往往与当前策略高度相关。在本节中,我们检查这些离策略算法在使用不相关数据集学习时的表现。我们的结果表明,当数据不相关且深度 Q 网络产生的价值估计发散时,最先进的深度演员-评论家算法 DDPG(Lillicrap 等,2015 年)的性能迅速恶化。这些结果表明,当真正离策略学习时,离策略深度强化学习算法是无效的。

Our practical experiments examine three different batch settings in OpenAI gym’s Hopper-v1 environment (Todorov et al., 2012; Brockman et al., 2016), which we use to train an off-policy DDPG agent with no interaction with the environment. Experiments with additional environments and specific details can be found in the Supplementary Material.
我们的实践实验在 OpenAI gym 的 Hopper-v1 环境中检验了三种不同的批处理设置(Todorov 等人,2012; Brockman 等人,2016),我们使用它来训练一个不与环境交互的离策略 DDPG 代理。关于额外环境的实验和具体细节可以在补充材料中找到。

Refer to caption
Refer to caption
(a) Final buffer  (a)最终缓冲区
performance 性能
(b) Concurrent  (b)并发
performance 表现
(c) Imitation  (c)模仿
performance 表现
Refer to caption
(d) Final buffer  (d)最终缓冲区
value estimate 价值估计
(e) Concurrent  (e)同时发生
value estimate 价值估计
(f) Imitation  (f)模仿
value estimate 价值估计
Figure 1: We examine the performance (top row) and corresponding value estimates (bottom row) of DDPG in three batch tasks on Hopper-v1. Each individual trial is plotted with a thin line, with the mean in bold (evaluated without exploration noise). Straight lines represent the average return of episodes contained in the batch (with exploration noise). An estimate of the true value of the off-policy agent, evaluated by Monte Carlo returns, is marked by a dotted line. In all three experiments, we observe a large gap in the performance between the behavioral and off-policy agent, even when learning from the same dataset (concurrent). Furthermore, the value estimates are unstable or divergent across all tasks.
图 1:我们检验了 DDPG 在 Hopper-v1 的三个批处理任务中的表现(顶行)及相应的价值估计(底行)。每个独立试验以细线绘制,平均值以粗体表示(在没有探索噪声的情况下评估)。直线代表批次中包含的剧集的平均回报(带有探索噪声)。通过蒙特卡洛回报评估的离策略代理的真实价值,由点线标记。在所有三个实验中,我们观察到行为代理和离策略代理之间的表现存在大的差距,即使是从相同的数据集(并发)学习时也是如此。此外,所有任务中的价值估计都不稳定或发散。

Batch 1 (Final buffer). We train a DDPG agent for 1 million time steps, adding 𝒩(0,0.5)𝒩00.5\mathcal{N}(0,0.5) Gaussian noise to actions for high exploration, and store all experienced transitions. This collection procedure creates a dataset with a diverse set of states and actions, with the aim of sufficient coverage.
批次 1(最终缓冲区)。我们训练一个 DDPG 代理 1 百万时间步长,为了高度探索,在动作中添加 𝒩(0,0.5)𝒩00.5\mathcal{N}(0,0.5) 高斯噪声,并存储所有经历过的转换。这个收集程序创建了一个具有多样化状态和动作的数据集,目的是足够的覆盖率。

Batch 2 (Concurrent). We concurrently train the off-policy and behavioral DDPG agents, for 1 million time steps. To ensure sufficient exploration, a standard 𝒩(0,0.1)𝒩00.1\mathcal{N}(0,0.1) Gaussian noise is added to actions taken by the behavioral policy. Each transition experienced by the behavioral policy is stored in a buffer replay, which both agents learn from. As a result, both agents are trained with the identical dataset.
批次 2(并行)。我们同时训练离策略和行为 DDPG 代理,为期 1 百万时间步长。为了确保足够的探索,标准 𝒩(0,0.1)𝒩00.1\mathcal{N}(0,0.1) 高斯噪声被添加到行为策略采取的动作中。行为策略经历的每一个转换都存储在一个缓冲回放中,两个代理都从中学习。结果,两个代理都使用相同的数据集进行训练。

Batch 3 (Imitation). A trained DDPG agent acts as an expert, and is used to collect a dataset of 1 million transitions.
批次 3(模仿)。一个训练有素的 DDPG 代理充当专家,并用来收集 1 百万次转换的数据集。

In Figure 1, we graph the performance of the agents as they train with each batch, as well as their value estimates. Straight lines represent the average return of episodes contained in the batch. Additionally, we graph the learning performance of the behavioral agent for the relevant tasks.
在图 1 中,我们绘制了代理在每个批次训练时的表现以及它们的价值估计。直线代表批次中包含的剧集的平均回报。此外,我们还绘制了行为代理在相关任务上的学习表现。

Our experiments demonstrate several surprising facts about off-policy deep reinforcement learning agents. In each task, the off-policy agent performances significantly worse than the behavioral agent. Even in the concurrent experiment, where both agents are trained with the same dataset, there is a large gap in performance in every single trial. This result suggests that differences in the state distribution under the initial policies is enough for extrapolation error to drastically offset the performance of the off-policy agent. Additionally, the corresponding value estimate exhibits divergent behavior, while the value estimate of the behavioral agent is highly stable. In the final buffer experiment, the off-policy agent is provided with a large and diverse dataset, with the aim of providing sufficient coverage of the initial policy. Even in this instance, the value estimate is highly unstable, and the performance suffers. In the imitation setting, the agent is provided with expert data. However, the agent quickly learns to take non-expert actions, under the guise of optimistic extrapolation. As a result, the value estimates rapidly diverge and the agent fails to learn.
我们的实验展示了关于离策略深度强化学习代理的几个令人惊讶的事实。在每项任务中,离策略代理的表现显著低于行为代理。即使在同时进行的实验中,当两个代理使用相同的数据集进行训练时,每一次试验中都存在很大的性能差距。这一结果表明,初始策略下状态分布的差异足以使外推误差极大地影响离策略代理的性能。此外,相应的价值估计表现出发散行为,而行为代理的价值估计非常稳定。在最终的缓冲实验中,离策略代理被提供了一个大型且多样化的数据集,旨在提供初始策略的足够覆盖。即便在这种情况下,价值估计也高度不稳定,性能也受到影响。在模仿设置中,代理被提供了专家数据。然而,代理很快学会采取非专家动作,这是在乐观外推的幌子下进行的。结果,价值估计迅速发散,代理未能学习。

Although extrapolation error is not necessarily positively biased, when combined with maximization in reinforcement learning algorithms, extrapolation error provides a source of noise that can induce a persistent overestimation bias (Thrun & Schwartz, 1993; Van Hasselt et al., 2016; Fujimoto et al., 2018). In an on-policy setting, extrapolation error may be a source of beneficial exploration through an implicit “optimism in the face of uncertainty” strategy (Lai & Robbins, 1985; Jaksch et al., 2010). In this case, if the value function overestimates an unknown state-action pair, the policy will collect data in the region of uncertainty, and the value estimate will be corrected. However, when learning off-policy, or in a batch setting, extrapolation error will never be corrected due to the inability to collect new data.
尽管外推误差不一定是正向偏差的,但当与强化学习算法中的最大化结合时,外推误差提供了一种噪声源,这种噪声源可以诱导持续的过高估计偏差(Thrun & Schwartz, 1993; Van Hasselt et al., 2016; Fujimoto et al., 2018)。在一个按策略的设置中,外推误差可能是通过一种隐含的“面对不确定性时的乐观”策略来进行有益探索的来源(Lai & Robbins, 1985; Jaksch et al., 2010)。在这种情况下,如果价值函数对未知的状态-动作对进行了过高估计,策略将在不确定性区域收集数据,价值估计将被纠正。然而,在离策略学习或批量设置中,由于无法收集新数据,外推误差将永远不会被纠正。

These experiments show extrapolation error can be highly detrimental to learning off-policy in a batch reinforcement learning setting. While the continuous state space and multi-dimensional action space in MuJoCo environments are contributing factors to extrapolation error, the scale of these tasks is small compared to real world settings. As a result, even with a sufficient amount of data collection, extrapolation error may still occur due to the concern of catastrophic forgetting (McCloskey & Cohen, 1989; Goodfellow et al., 2013). Consequently, off-policy reinforcement learning algorithms used in the real-world will require practical guarantees without exhaustive amounts of data.
这些实验表明,在批量强化学习设置中,外推误差可能对离策略学习极为不利。虽然 MuJoCo 环境中的连续状态空间和多维动作空间是外推误差的促成因素,但这些任务的规模与现实世界的设置相比还是小的。因此,即使有足够多的数据收集,由于灾难性遗忘的问题(McCloskey & Cohen, 1989; Goodfellow et al., 2013),外推误差仍可能发生。因此,在现实世界中使用的离策略强化学习算法将需要在没有大量数据的情况下提供实际保证。

4 Batch-Constrained Reinforcement Learning
批量约束强化学习

Current off-policy deep reinforcement learning algorithms fail to address extrapolation error by selecting actions with respect to a learned value estimate, without consideration of the accuracy of the estimate. As a result, certain out-of-distribution actions can be erroneously extrapolated to higher values. However, the value of an off-policy agent can be accurately evaluated in regions where data is available. We propose a conceptually simple idea: to avoid extrapolation error a policy should induce a similar state-action visitation to the batch. We denote policies which satisfy this notion as batch-constrained. To optimize off-policy learning for a given batch, batch-constrained policies are trained to select actions with respect to three objectives:
当前的离策略深度强化学习算法未能解决通过选择行动来估计学习值时的外推错误问题,而没有考虑估计的准确性。结果是,某些超出分布的行动可能会被错误地外推到更高的值。然而,可以在数据可用的区域准确评估离策略代理的价值。我们提出了一个概念上简单的想法:为了避免外推错误,策略应该引导产生与批次相似的状态-行动访问。我们将满足这一概念的策略称为批次约束的。为了优化给定批次的离策略学习,批次约束的策略被训练以根据三个目标选择行动:

  1. (1)

    Minimize the distance of selected actions to the data in the batch.


    (1) 最小化选定行动到批次中数据的距离。
  2. (2)

    Lead to states where familiar data can be observed.


    (2) 导致可以观察到熟悉数据的状态。
  3. (3)

    Maximize the value function.


    (3) 最大化价值函数。

We note the importance of objective (1) above the others, as the value function and estimates of future states may be arbitrarily poor without access to the corresponding transitions. That is, we cannot correctly estimate (2) and (3) unless (1) is sufficiently satisfied. As a result, we propose optimizing the value function, along with some measure of future certainty, with a constraint limiting the distance of selected actions to the batch. This is achieved in our deep reinforcement learning algorithm through a state-conditioned generative model, to produce likely actions under the batch. This generative model is combined with a network which aims to optimally perturb the generated actions in a small range, along with a Q-network, used to select the highest valued action. Finally, we train a pair of Q-networks, and take the minimum of their estimates during the value update. This update penalizes states which are unfamiliar, and pushes the policy to select actions which lead to certain data.
我们注意到目标(1)相对于其他目标的重要性,因为如果没有对应转换的访问,价值函数和未来状态的估计可能会非常糟糕。也就是说,除非(1)得到了充分的满足,否则我们无法正确估计(2)和(3)。因此,我们提出优化价值函数,以及某种未来确定性的度量,同时限制所选动作到批处理的距离。在我们的深度强化学习算法中,通过一个状态条件生成模型实现这一点,以在批处理下产生可能的动作。这个生成模型与一个旨在以小范围最优地扰动生成动作的网络结合在一起,以及一个用于选择最高价值动作的 Q 网络。最后,我们训练一对 Q 网络,并在价值更新期间取它们估计的最小值。这种更新惩罚不熟悉的状态,并推动策略选择导致确定数据的动作。

We begin by analyzing the theoretical properties of batch-constrained policies in a finite MDP setting, where we are able to quantify extrapolation error precisely. We then introduce our deep reinforcement learning algorithm in detail, Batch-Constrained deep Q-learning (BCQ) by drawing inspiration from the tabular analogue.
我们首先分析有限 MDP 设置中批量约束策略的理论属性,我们能够精确量化外推误差。然后,我们详细介绍我们的深度强化学习算法,通过借鉴表格类比,提出了批量约束深度 Q 学习(BCQ)。

4.1 Addressing Extrapolation Error in Finite MDPs
4.1 在有限 MDP 中解决外推误差

In the finite MDP setting, extrapolation error can be described by the bias from the mismatch between the transitions contained in the buffer and the true MDP. We find that by inducing a data distribution that is contained entirely within the batch, batch-constrained policies can eliminate extrapolation error entirely for deterministic MDPs. In addition, we show that the batch-constrained variant of Q-learning converges to the optimal policy under the same conditions as the standard form of Q-learning. Moreover, we prove that for a deterministic MDP, batch-constrained Q-learning is guaranteed to match, or outperform, the behavioral policy when starting from any state contained in the batch. All of the proofs for this section can be found in the Supplementary Material.
在有限 MDP 设置中,外推误差可以通过缓冲区中包含的转换与真实 MDP 之间的偏差来描述。我们发现,通过引入完全包含在批次中的数据分布,批量约束策略可以完全消除确定性 MDP 的外推误差。此外,我们展示了批量约束变体的 Q 学习在与标准形式的 Q 学习相同的条件下收敛到最优策略。此外,我们证明了对于一个确定性 MDP,批量约束 Q 学习保证能够匹配或超越批次中包含的任何状态的行为策略。本节的所有证明都可以在补充材料中找到。

A value estimate Q𝑄Q can be learned using an experience replay buffer \mathcal{B}. This involves sampling transition tuples (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime}) with uniform probability, and applying the temporal difference update (Sutton, 1988; Watkins, 1989):
可以使用经验回放缓冲区 \mathcal{B} 学习价值估计 Q𝑄Q 。这涉及到以均匀概率抽样转换元组 (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime}) ,并应用时序差分更新(Sutton, 1988; Watkins, 1989):

Q(s,a)(1α)Q(s,a)+α(r+γQ(s,π(s))).𝑄𝑠𝑎1𝛼𝑄𝑠𝑎𝛼𝑟𝛾𝑄superscript𝑠𝜋superscript𝑠Q(s,a)\leftarrow(1-\alpha)Q(s,a)+\alpha\left(r+\gamma Q(s^{\prime},\pi(s^{\prime}))\right). (6)

If π(s)=argmaxaQ(s,a)𝜋superscript𝑠subscriptargmaxsuperscript𝑎𝑄superscript𝑠superscript𝑎\pi(s^{\prime})=\operatorname*{argmax}_{a^{\prime}}Q(s^{\prime},a^{\prime}), this is known as Q-learning. Assuming a non-zero probability of sampling any possible transition tuple from the buffer and infinite updates, Q-learning converges to the optimal value function.
如果 π(s)=argmaxaQ(s,a)𝜋superscript𝑠subscriptargmaxsuperscript𝑎𝑄superscript𝑠superscript𝑎\pi(s^{\prime})=\operatorname*{argmax}_{a^{\prime}}Q(s^{\prime},a^{\prime}) ,这被称为 Q 学习。假设从缓冲区抽样任何可能的转换元组的非零概率和无限次更新,Q 学习收敛到最优价值函数。

We begin by showing that the value function Q𝑄Q learned with the batch \mathcal{B} corresponds to the value function for an alternate MDP Msubscript𝑀M_{\mathcal{B}}. From the true MDP M𝑀M and initial values Q(s,a)𝑄𝑠𝑎Q(s,a), we define the new MDP Msubscript𝑀M_{\mathcal{B}} with the same action and state space as M𝑀M, along with an additional terminal state sinitsubscript𝑠inits_{\text{init}}. Msubscript𝑀M_{\mathcal{B}} has transition probabilities p(s|s,a)=N(s,a,s)s~N(s,a,s~)subscript𝑝conditionalsuperscript𝑠𝑠𝑎𝑁𝑠𝑎superscript𝑠subscript~𝑠𝑁𝑠𝑎~𝑠p_{\mathcal{B}}(s^{\prime}|s,a)=\frac{N(s,a,s^{\prime})}{\sum_{\tilde{s}}N(s,a,\tilde{s})}, where N(s,a,s)𝑁𝑠𝑎superscript𝑠N(s,a,s^{\prime}) is the number of times the tuple (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime}) is observed in \mathcal{B}. If s~N(s,a,s~)=0subscript~𝑠𝑁𝑠𝑎~𝑠0\sum_{\tilde{s}}N(s,a,\tilde{s})=0, then p(sinit|s,a)=1subscript𝑝conditionalsubscript𝑠init𝑠𝑎1p_{\mathcal{B}}(s_{\text{init}}|s,a)=1, where r(s,a,sinit)𝑟𝑠𝑎subscript𝑠initr(s,a,s_{\text{init}}) is set to the initialized value of Q(s,a)𝑄𝑠𝑎Q(s,a).
我们首先展示用批次 \mathcal{B} 学习到的价值函数 Q𝑄Q 对应于另一个 MDP Msubscript𝑀M_{\mathcal{B}} 的价值函数。从真实的 MDP M𝑀M 和初始值 Q(s,a)𝑄𝑠𝑎Q(s,a) 出发,我们定义了新的 MDP Msubscript𝑀M_{\mathcal{B}} ,它与 M𝑀M 有相同的动作和状态空间,并增加了一个额外的终止状态 sinitsubscript𝑠inits_{\text{init}}Msubscript𝑀M_{\mathcal{B}} 的转移概率为 p(s|s,a)=N(s,a,s)s~N(s,a,s~)subscript𝑝conditionalsuperscript𝑠𝑠𝑎𝑁𝑠𝑎superscript𝑠subscript~𝑠𝑁𝑠𝑎~𝑠p_{\mathcal{B}}(s^{\prime}|s,a)=\frac{N(s,a,s^{\prime})}{\sum_{\tilde{s}}N(s,a,\tilde{s})} ,其中 N(s,a,s)𝑁𝑠𝑎superscript𝑠N(s,a,s^{\prime}) 是在 \mathcal{B} 中观察到元组 (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime}) 的次数。如果 s~N(s,a,s~)=0subscript~𝑠𝑁𝑠𝑎~𝑠0\sum_{\tilde{s}}N(s,a,\tilde{s})=0 ,那么 p(sinit|s,a)=1subscript𝑝conditionalsubscript𝑠init𝑠𝑎1p_{\mathcal{B}}(s_{\text{init}}|s,a)=1 ,其中 r(s,a,sinit)𝑟𝑠𝑎subscript𝑠initr(s,a,s_{\text{init}}) 被设置为 Q(s,a)𝑄𝑠𝑎Q(s,a) 的初始值。

Theorem 1. Performing Q-learning by sampling from a batch \mathcal{B} converges to the optimal value function under the MDP Msubscript𝑀M_{\mathcal{B}}.
定理 1:通过从批次 \mathcal{B} 中采样执行 Q 学习,在 MDP Msubscript𝑀M_{\mathcal{B}} 下收敛到最优值函数。

We define ϵMDPsubscriptitalic-ϵMDP\epsilon_{\text{MDP}} as the tabular extrapolation error, which accounts for the discrepancy between the value function Qπsubscriptsuperscript𝑄𝜋Q^{\pi}_{\mathcal{B}} computed with the batch \mathcal{B} and the value function Qπsuperscript𝑄𝜋Q^{\pi} computed with the true MDP M𝑀M:
我们将 ϵMDPsubscriptitalic-ϵMDP\epsilon_{\text{MDP}} 定义为表格外推误差,它解释了用批次 \mathcal{B} 计算的值函数 Qπsubscriptsuperscript𝑄𝜋Q^{\pi}_{\mathcal{B}} 与用真实 MDP M𝑀M 计算的值函数 Qπsuperscript𝑄𝜋Q^{\pi} 之间的差异:

ϵMDP(s,a)=Qπ(s,a)Qπ(s,a).subscriptitalic-ϵMDP𝑠𝑎superscript𝑄𝜋𝑠𝑎subscriptsuperscript𝑄𝜋𝑠𝑎\epsilon_{\text{MDP}}(s,a)=Q^{\pi}(s,a)-Q^{\pi}_{\mathcal{B}}(s,a). (7)

For any policy π𝜋\pi, the exact form of ϵMDP(s,a)subscriptitalic-ϵMDP𝑠𝑎\epsilon_{\text{MDP}}(s,a) can be computed through a Bellman-like equation:
对于任何策略 π𝜋\piϵMDP(s,a)subscriptitalic-ϵMDP𝑠𝑎\epsilon_{\text{MDP}}(s,a) 的确切形式可以通过类似贝尔曼的方程计算出来:

ϵMDP(s,a)=s(pM(s|s,a)p(s|s,a))(r(s,a,s)+γaπ(a|s)Qπ(s,a))+pM(s|s,a)γaπ(a|s)ϵMDP(s,a).subscriptitalic-ϵMDP𝑠𝑎subscriptsuperscript𝑠subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎subscript𝑝conditionalsuperscript𝑠𝑠𝑎𝑟𝑠𝑎superscript𝑠𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptsuperscript𝑄𝜋superscript𝑠superscript𝑎subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptitalic-ϵMDPsuperscript𝑠superscript𝑎\begin{split}\epsilon_{\text{MDP}}(s,a)~{}&=\sum_{s^{\prime}}\left(p_{M}(s^{\prime}|s,a)-p_{\mathcal{B}}(s^{\prime}|s,a)\right)\\ &\left(r(s,a,s^{\prime})+\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})\right)\\ &+p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\epsilon_{\text{MDP}}(s^{\prime},a^{\prime}).\end{split} (8)

This means extrapolation error is a function of divergence in the transition distributions, weighted by value, along with the error at succeeding states. If the policy is chosen carefully, the error between value functions can be minimized by visiting regions where the transition distributions are similar. For simplicity, we denote
这意味着外推误差是转移分布差异的函数,按价值加权,连同在后续状态的误差。如果策略选择得当,通过访问转移分布相似的区域,可以最小化价值函数之间的误差。为了简单起见,我们表示

ϵMDPπ=sμπ(s)aπ(a|s)|ϵMDP(s,a)|.subscriptsuperscriptitalic-ϵ𝜋MDPsubscript𝑠subscript𝜇𝜋𝑠subscript𝑎𝜋conditional𝑎𝑠subscriptitalic-ϵMDP𝑠𝑎\epsilon^{\pi}_{\text{MDP}}=\sum_{s}\mu_{\pi}(s)\sum_{a}\pi(a|s)|\epsilon_{\text{MDP}}(s,a)|. (9)

To evaluate a policy π𝜋\pi exactly at relevant state-action pairs, only ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0 is required. We can then determine the condition required to evaluate the exact expected return of a policy without extrapolation error.
要准确评估策略 π𝜋\pi 在相关状态-动作对上,只需要 ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0 。然后我们可以确定评估策略的确切预期回报而无需外推误差的所需条件。

Lemma 1. For all reward functions, ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0 if and only if p(s|s,a)=pM(s|s,a)subscript𝑝conditionalsuperscript𝑠𝑠𝑎subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎p_{\mathcal{B}}(s^{\prime}|s,a)=p_{M}(s^{\prime}|s,a) for all s𝒮superscript𝑠𝒮s^{\prime}\in\mathcal{S} and (s,a)𝑠𝑎(s,a) such that μπ(s)>0subscript𝜇𝜋𝑠0\mu_{\pi}(s)>0 and π(a|s)>0𝜋conditional𝑎𝑠0\pi(a|s)>0.
引理 1:对于所有奖励函数,当且仅当对于所有的 s𝒮superscript𝑠𝒮s^{\prime}\in\mathcal{S}(s,a)𝑠𝑎(s,a) ,满足 μπ(s)>0subscript𝜇𝜋𝑠0\mu_{\pi}(s)>0π(a|s)>0𝜋conditional𝑎𝑠0\pi(a|s)>0 时, ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0 当且仅当 p(s|s,a)=pM(s|s,a)subscript𝑝conditionalsuperscript𝑠𝑠𝑎subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎p_{\mathcal{B}}(s^{\prime}|s,a)=p_{M}(s^{\prime}|s,a)

Lemma 1 states that if Msubscript𝑀M_{\mathcal{B}} and M𝑀M exhibit the same transition probabilities in regions of relevance, the policy can be accurately evaluated. For a stochastic MDP this may require an infinite number of samples to converge to the true distribution, however, for a deterministic MDP this requires only a single transition. This means a policy which only traverses transitions contained in the batch, can be evaluated without error. More formally, we denote a policy πΠ𝜋subscriptΠ\pi\in\Pi_{\mathcal{B}} as batch-constrained if for all (s,a)𝑠𝑎(s,a) where μπ(s)>0subscript𝜇𝜋𝑠0\mu_{\pi}(s)>0 and π(a|s)>0𝜋conditional𝑎𝑠0\pi(a|s)>0 then (s,a)𝑠𝑎(s,a)\in\mathcal{B}. Additionally, we define a batch \mathcal{B} as coherent if for all (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime})\in\mathcal{B} then ssuperscript𝑠s^{\prime}\in\mathcal{B} unless ssuperscript𝑠s^{\prime} is a terminal state. This condition is trivially satisfied if the data is collected in trajectories, or if all possible states are contained in the batch. With a coherent batch, we can guarantee the existence of a batch-constrained policy.
引理 1 表明,如果 Msubscript𝑀M_{\mathcal{B}}M𝑀M 在相关区域展示出相同的转移概率,那么策略可以被准确评估。对于随机的 MDP,这可能需要无限多的样本才能收敛到真实分布,然而,对于确定性的 MDP,这只需要一个单一转移。这意味着,一个仅遍历批次中包含的转移的策略,可以无误差地被评估。更正式地,我们将一个策略 πΠ𝜋subscriptΠ\pi\in\Pi_{\mathcal{B}} 定义为批次约束的,如果对于所有 (s,a)𝑠𝑎(s,a) ,满足 μπ(s)>0subscript𝜇𝜋𝑠0\mu_{\pi}(s)>0π(a|s)>0𝜋conditional𝑎𝑠0\pi(a|s)>0 ,则 (s,a)𝑠𝑎(s,a)\in\mathcal{B} 。此外,我们定义一个批次 \mathcal{B} 为连贯的,如果对于所有 (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime})\in\mathcal{B} ,则 ssuperscript𝑠s^{\prime}\in\mathcal{B} ,除非 ssuperscript𝑠s^{\prime} 是一个终止状态。如果数据是按轨迹收集的,或者批次包含了所有可能的状态,那么这个条件就会被轻易满足。有了一个连贯的批次,我们可以保证存在一个批次约束的策略。

Theorem 2. For a deterministic MDP and all reward functions, ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0 if and only if the policy π𝜋\pi is batch-constrained. Furthermore, if \mathcal{B} is coherent, then such a policy must exist if the start state s0subscript𝑠0s_{0}\in\mathcal{B}.
定理 2:对于一个确定性的 MDP 和所有奖励函数,当且仅当策略 π𝜋\pi 是批量约束的时, ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0 成立。此外,如果 \mathcal{B} 是连贯的,那么如果起始状态 s0subscript𝑠0s_{0}\in\mathcal{B} ,则必须存在这样的策略。

Batch-constrained policies can be used in conjunction with Q-learning to form batch-constrained Q-learning (BCQL), which follows the standard tabular Q-learning update while constraining the possible actions with respect to the batch:
批量约束策略可以与 Q 学习结合使用,形成批量约束 Q 学习(BCQL),它遵循标准的表格 Q 学习更新,同时根据批量约束可能的行动:

Q(s,a)(1α)Q(s,a)+α(r+γmaxas.t.(s,a)Q(s,a)).𝑄𝑠𝑎1𝛼𝑄𝑠𝑎𝛼𝑟𝛾subscriptsuperscript𝑎s.t.superscript𝑠superscript𝑎𝑄superscript𝑠superscript𝑎Q(s,a)\leftarrow(1-\alpha)Q(s,a)+\alpha(r+\gamma\max_{a^{\prime}\text{s.t.}(s^{\prime},a^{\prime})\in\mathcal{B}}Q(s^{\prime},a^{\prime})). (10)

BCQL converges under the same conditions as the standard form of Q-learning, noting the batch-constraint is nonrestrictive given infinite state-action visitation.
在与标准 Q 学习相同的条件下,BCQL 收敛,注意到在无限状态-动作访问下,批量约束是非限制性的。

Theorem 3. Given the Robbins-Monro stochastic convergence conditions on the learning rate α𝛼\alpha, and standard sampling requirements from the environment, BCQL converges to the optimal value function Qsuperscript𝑄Q^{*}.
定理 3:给定学习率 α𝛼\alpha 上的 Robbins-Monro 随机收敛条件,以及来自环境的标准采样要求,BCQL 收敛到最优值函数 Qsuperscript𝑄Q^{*}

The more interesting property of BCQL is that for a deterministic MDP and any coherent batch \mathcal{B}, BCQL converges to the optimal batch-constrained policy πΠsuperscript𝜋subscriptΠ\pi^{*}\in\Pi_{\mathcal{B}} such that Qπ(s,a)Qπ(s,a)superscript𝑄superscript𝜋𝑠𝑎superscript𝑄𝜋𝑠𝑎Q^{\pi^{*}}(s,a)\geq Q^{\pi}(s,a) for all πΠ𝜋subscriptΠ\pi\in\Pi_{\mathcal{B}} and (s,a)𝑠𝑎(s,a)\in\mathcal{B}.
BCQL 的一个更有趣的特性是,对于一个确定性的 MDP 和任何一致的批次 \mathcal{B} ,BCQL 会收敛到最优的批次约束策略 πΠsuperscript𝜋subscriptΠ\pi^{*}\in\Pi_{\mathcal{B}} ,使得对所有 πΠ𝜋subscriptΠ\pi\in\Pi_{\mathcal{B}}(s,a)𝑠𝑎(s,a)\in\mathcal{B}

Theorem 4. Given a deterministic MDP and coherent batch \mathcal{B}, along with the Robbins-Monro stochastic convergence conditions on the learning rate α𝛼\alpha and standard sampling requirements on the batch \mathcal{B}, BCQL converges to Qπ(s,a)subscriptsuperscript𝑄𝜋𝑠𝑎Q^{\pi}_{\mathcal{B}}(s,a) where π(s)=argmaxa s.t.(s,a)Qπ(s,a)superscript𝜋𝑠subscriptargmax𝑎 s.t.𝑠𝑎subscriptsuperscript𝑄𝜋𝑠𝑎\pi^{*}(s)=\operatorname*{argmax}_{a\text{ s.t.}(s,a)\in\mathcal{B}}Q^{\pi}_{\mathcal{B}}(s,a) is the optimal batch-constrained policy.
定理 4。给定一个确定性的 MDP 和一致的批次 \mathcal{B} ,以及对学习率 α𝛼\alpha 的 Robbins-Monro 随机收敛条件和对批次 \mathcal{B} 的标准采样要求,BCQL 收敛到 Qπ(s,a)subscriptsuperscript𝑄𝜋𝑠𝑎Q^{\pi}_{\mathcal{B}}(s,a) ,其中 π(s)=argmaxa s.t.(s,a)Qπ(s,a)superscript𝜋𝑠subscriptargmax𝑎 s.t.𝑠𝑎subscriptsuperscript𝑄𝜋𝑠𝑎\pi^{*}(s)=\operatorname*{argmax}_{a\text{ s.t.}(s,a)\in\mathcal{B}}Q^{\pi}_{\mathcal{B}}(s,a) 是最优的批次约束策略。

This means that BCQL is guaranteed to outperform any behavioral policy when starting from any state contained in the batch, effectively outperforming imitation learning. Unlike standard Q-learning, there is no condition on state-action visitation, other than coherency in the batch.
这意味着 BCQL 保证在从批次中包含的任何状态开始时,都会胜过任何行为策略,有效地胜过模仿学习。与标准的 Q 学习不同,除了批次的一致性外,没有对状态-动作访问的条件。

4.2 Batch-Constrained Deep Reinforcement Learning
4.2 批量约束深度强化学习

We introduce our approach to off-policy batch reinforcement learning, Batch-Constrained deep Q-learning (BCQ). BCQ approaches the notion of batch-constrained through a generative model. For a given state, BCQ generates plausible candidate actions with high similarity to the batch, and then selects the highest valued action through a learned Q-network. Furthermore, we bias this value estimate to penalize rare, or unseen, states through a modification to Clipped Double Q-learning (Fujimoto et al., 2018). As a result, BCQ learns a policy with a similar state-action visitation to the data in the batch, as inspired by the theoretical benefits of its tabular counterpart.
我们介绍了我们对离线批量强化学习的方法,批量约束深度 Q 学习(BCQ)。BCQ 通过一个生成模型来处理批量约束的概念。对于给定的状态,BCQ 生成与批量高度相似的合理候选动作,然后通过学习到的 Q 网络选择最高价值的动作。此外,我们通过对剪切双 Q 学习(Fujimoto 等,2018)的修改,对这一价值估计进行偏差,以惩罚罕见或未见的状态。因此,受其表格式对应物的理论优势的启发,BCQ 学习了一种与批量数据中的状态-动作访问相似的策略。

To maintain the notion of batch-constraint, we define a similarity metric by making the assumption that for a given state s𝑠s, the similarity between (s,a)𝑠𝑎(s,a) and the state-action pairs in the batch \mathcal{B} can be modelled using a learned state-conditioned marginal likelihood PG(a|s)superscriptsubscript𝑃𝐺conditional𝑎𝑠P_{\mathcal{B}}^{G}(a|s). In this case, it follows that the policy maximizing PG(a|s)superscriptsubscript𝑃𝐺conditional𝑎𝑠P_{\mathcal{B}}^{G}(a|s) would minimize the error induced by extrapolation from distant, or unseen, state-action pairs, by only selecting the most likely actions in the batch with respect to a given state. Given the difficulty of estimating PG(a|s)superscriptsubscript𝑃𝐺conditional𝑎𝑠P_{\mathcal{B}}^{G}(a|s) in high-dimensional continuous spaces, we instead train a parametric generative model of the batch Gω(s)subscript𝐺𝜔𝑠G_{\omega}(s), which we can sample actions from, as a reasonable approximation to argmaxaPG(a|s)subscriptargmax𝑎superscriptsubscript𝑃𝐺conditional𝑎𝑠\operatorname*{argmax}_{a}P_{\mathcal{B}}^{G}(a|s).
为了维持批次约束的概念,我们通过假设对于给定状态 s𝑠s(s,a)𝑠𝑎(s,a) 与批次 \mathcal{B} 中的状态-动作对之间的相似性可以使用学习到的状态条件边缘似然 PG(a|s)superscriptsubscript𝑃𝐺conditional𝑎𝑠P_{\mathcal{B}}^{G}(a|s) 来建模来定义一个相似性度量。在这种情况下,可以推断出,最大化 PG(a|s)superscriptsubscript𝑃𝐺conditional𝑎𝑠P_{\mathcal{B}}^{G}(a|s) 的策略将通过仅选择批次中相对于给定状态最可能的动作,来最小化由远距离或未见状态-动作对引起的外推误差。鉴于在高维连续空间中估计 PG(a|s)superscriptsubscript𝑃𝐺conditional𝑎𝑠P_{\mathcal{B}}^{G}(a|s) 的难度,我们改为训练批次 Gω(s)subscript𝐺𝜔𝑠G_{\omega}(s) 的参数化生成模型,我们可以从中采样动作,作为对 argmaxaPG(a|s)subscriptargmax𝑎superscriptsubscript𝑃𝐺conditional𝑎𝑠\operatorname*{argmax}_{a}P_{\mathcal{B}}^{G}(a|s) 的合理近似。

For our generative model we use a conditional variational auto-encoder (VAE) (Kingma & Welling, 2013; Sohn et al., 2015), which models the distribution by transforming an underlying latent space111See the Supplementary Material for an introduction to VAEs.. The generative model Gωsubscript𝐺𝜔G_{\omega}, alongside the value function Qθsubscript𝑄𝜃Q_{\theta}, can be used as a policy by sampling n𝑛n actions from Gωsubscript𝐺𝜔G_{\omega} and selecting the highest valued action according to the value estimate Qθsubscript𝑄𝜃Q_{\theta}. To increase the diversity of seen actions, we introduce a perturbation model ξϕ(s,a,Φ)subscript𝜉italic-ϕ𝑠𝑎Φ\xi_{\phi}(s,a,\Phi), which outputs an adjustment to an action a𝑎a in the range [Φ,Φ]ΦΦ[-\Phi,\Phi]. This enables access to actions in a constrained region, without having to sample from the generative model a prohibitive number of times. This results in the policy π𝜋\pi:
对于我们的生成模型,我们使用了一个条件变分自编码器(VAE)(Kingma & Welling, 2013; Sohn et al., 2015),它通过转换一个潜在空间 1 来建模分布。生成模型 Gωsubscript𝐺𝜔G_{\omega} ,连同价值函数 Qθsubscript𝑄𝜃Q_{\theta} ,可以通过从 Gωsubscript𝐺𝜔G_{\omega} 采样 n𝑛n 动作并根据价值估计 Qθsubscript𝑄𝜃Q_{\theta} 选择最高价值动作作为策略。为了增加所见动作的多样性,我们引入了一个扰动模型 ξϕ(s,a,Φ)subscript𝜉italic-ϕ𝑠𝑎Φ\xi_{\phi}(s,a,\Phi) ,它输出一个在范围 [Φ,Φ]ΦΦ[-\Phi,\Phi] 内对动作 a𝑎a 的调整。这使得在不需要从生成模型中抽样过多次的情况下,访问一个受限区域内的动作成为可能。这导致了策略 π𝜋\pi

π(s)=argmaxai+ξϕ(s,ai,Φ)Qθ(s,ai+ξϕ(s,ai,Φ)),{aiGω(s)}i=1n.𝜋𝑠subscriptargmaxsubscript𝑎𝑖subscript𝜉italic-ϕ𝑠subscript𝑎𝑖Φsubscript𝑄𝜃𝑠subscript𝑎𝑖subscript𝜉italic-ϕ𝑠subscript𝑎𝑖Φsuperscriptsubscriptsimilar-tosubscript𝑎𝑖subscript𝐺𝜔𝑠𝑖1𝑛\begin{split}\pi(s)=&\operatorname*{argmax}_{a_{i}+\xi_{\phi}(s,a_{i},\Phi)}Q_{\theta}(s,a_{i}+\xi_{\phi}(s,a_{i},\Phi)),\\ &\{a_{i}\sim G_{\omega}(s)\}_{i=1}^{n}.\end{split} (11)

The choice of n𝑛n and ΦΦ\Phi creates a trade-off between an imitation learning and reinforcement learning algorithm. If Φ=0Φ0\Phi=0, and the number of sampled actions n=1𝑛1n=1, then the policy resembles behavioral cloning and as ΦamaxaminΦsubscript𝑎maxsubscript𝑎min\Phi\rightarrow a_{\text{max}}-a_{\text{min}} and n𝑛n\rightarrow\infty, then the algorithm approaches Q-learning, as the policy begins to greedily maximize the value function over the entire action space.
n𝑛nΦΦ\Phi 的选择在模仿学习和强化学习算法之间创造了一种权衡。如果 Φ=0Φ0\Phi=0 ,且采样动作的数量 n=1𝑛1n=1 ,那么策略类似于行为克隆;随着 ΦamaxaminΦsubscript𝑎maxsubscript𝑎min\Phi\rightarrow a_{\text{max}}-a_{\text{min}}n𝑛n\rightarrow\infty ,算法则趋近于 Q 学习,因为策略开始贪婪地在整个动作空间上最大化价值函数。

Algorithm 1 BCQ 算法 1 BCQ
  Input: Batch \mathcal{B}, horizon T𝑇T, target network update rate τ𝜏\tau, mini-batch size N𝑁N, max perturbation ΦΦ\Phi, number of sampled actions n𝑛n, minimum weighting λ𝜆\lambda.
输入:批次 \mathcal{B} ,视界 T𝑇T ,目标网络更新率 τ𝜏\tau ,小批量大小 N𝑁N ,最大扰动 ΦΦ\Phi ,采样动作数量 n𝑛n ,最小加权 λ𝜆\lambda
  Initialize Q-networks Qθ1,Qθ2subscript𝑄subscript𝜃1subscript𝑄subscript𝜃2Q_{\theta_{1}},Q_{\theta_{2}}, perturbation network ξϕsubscript𝜉italic-ϕ\xi_{\phi}, and VAE Gω={Eω1,Dω2}subscript𝐺𝜔subscript𝐸subscript𝜔1subscript𝐷subscript𝜔2G_{\omega}=\{E_{\omega_{1}},D_{\omega_{2}}\}, with random parameters θ1subscript𝜃1\theta_{1}, θ2subscript𝜃2\theta_{2}, ϕitalic-ϕ\phi, ω𝜔\omega, and target networks Qθ1,Qθ2subscript𝑄subscriptsuperscript𝜃1subscript𝑄subscriptsuperscript𝜃2Q_{\theta^{\prime}_{1}},Q_{\theta^{\prime}_{2}}, ξϕsubscript𝜉superscriptitalic-ϕ\xi_{\phi^{\prime}} with θ1θ1,θ2θ2formulae-sequencesubscriptsuperscript𝜃1subscript𝜃1subscriptsuperscript𝜃2subscript𝜃2\theta^{\prime}_{1}\leftarrow\theta_{1},\theta^{\prime}_{2}\leftarrow\theta_{2}, ϕϕsuperscriptitalic-ϕitalic-ϕ\phi^{\prime}\leftarrow\phi.
初始化 Q-网络 Qθ1,Qθ2subscript𝑄subscript𝜃1subscript𝑄subscript𝜃2Q_{\theta_{1}},Q_{\theta_{2}} ,扰动网络 ξϕsubscript𝜉italic-ϕ\xi_{\phi} ,和 VAE Gω={Eω1,Dω2}subscript𝐺𝜔subscript𝐸subscript𝜔1subscript𝐷subscript𝜔2G_{\omega}=\{E_{\omega_{1}},D_{\omega_{2}}\} ,带有随机参数 θ1subscript𝜃1\theta_{1}θ2subscript𝜃2\theta_{2}ϕitalic-ϕ\phiω𝜔\omega ,和目标网络 Qθ1,Qθ2subscript𝑄subscriptsuperscript𝜃1subscript𝑄subscriptsuperscript𝜃2Q_{\theta^{\prime}_{1}},Q_{\theta^{\prime}_{2}}ξϕsubscript𝜉superscriptitalic-ϕ\xi_{\phi^{\prime}}θ1θ1,θ2θ2formulae-sequencesubscriptsuperscript𝜃1subscript𝜃1subscriptsuperscript𝜃2subscript𝜃2\theta^{\prime}_{1}\leftarrow\theta_{1},\theta^{\prime}_{2}\leftarrow\theta_{2}ϕϕsuperscriptitalic-ϕitalic-ϕ\phi^{\prime}\leftarrow\phi
  for t=1𝑡1t=1 to T𝑇T do
对于 t=1𝑡1t=1T𝑇T 执行
     Sample mini-batch of N𝑁N transitions (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime}) from \mathcal{B}
\mathcal{B} 中抽取 N𝑁N 转换的样本小批量 (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime})
     μ,σ=Eω1(s,a),a~=Dω2(s,z),z𝒩(μ,σ)formulae-sequence𝜇𝜎subscript𝐸subscript𝜔1𝑠𝑎formulae-sequence~𝑎subscript𝐷subscript𝜔2𝑠𝑧similar-to𝑧𝒩𝜇𝜎\mu,\sigma=E_{\omega_{1}}(s,a),\quad\tilde{a}=D_{\omega_{2}}(s,z),\quad z\sim\mathcal{N}(\mu,\sigma)
     ωargminω(aa~)2+DKL(𝒩(μ,σ)||𝒩(0,1))\omega\leftarrow\operatorname*{argmin}_{\omega}\sum(a-\tilde{a})^{2}+D_{\text{KL}}(\mathcal{N}(\mu,\sigma)||\mathcal{N}(0,1))
     Sample n𝑛n actions: {aiGω(s)}i=1nsuperscriptsubscriptsimilar-tosubscript𝑎𝑖subscript𝐺𝜔superscript𝑠𝑖1𝑛\{a_{i}\sim G_{\omega}(s^{\prime})\}_{i=1}^{n}
抽样 n𝑛n 动作: {aiGω(s)}i=1nsuperscriptsubscriptsimilar-tosubscript𝑎𝑖subscript𝐺𝜔superscript𝑠𝑖1𝑛\{a_{i}\sim G_{\omega}(s^{\prime})\}_{i=1}^{n}
     Perturb each action: {ai=ai+ξϕ(s,ai,Φ)}i=1nsuperscriptsubscriptsubscript𝑎𝑖subscript𝑎𝑖subscript𝜉italic-ϕsuperscript𝑠subscript𝑎𝑖Φ𝑖1𝑛\{a_{i}=a_{i}+\xi_{\phi}(s^{\prime},a_{i},\Phi)\}_{i=1}^{n}
扰动每个动作: {ai=ai+ξϕ(s,ai,Φ)}i=1nsuperscriptsubscriptsubscript𝑎𝑖subscript𝑎𝑖subscript𝜉italic-ϕsuperscript𝑠subscript𝑎𝑖Φ𝑖1𝑛\{a_{i}=a_{i}+\xi_{\phi}(s^{\prime},a_{i},\Phi)\}_{i=1}^{n}
     Set value target y𝑦y (Eqn. 13)
设置值目标 y𝑦y (方程 13)
     θargminθ(yQθ(s,a))2𝜃subscriptargmin𝜃superscript𝑦subscript𝑄𝜃𝑠𝑎2\theta\leftarrow\operatorname*{argmin}_{\theta}\sum(y-Q_{\theta}(s,a))^{2}
     ϕargmaxϕQθ1(s,a+ξϕ(s,a,Φ)),aGω(s)formulae-sequenceitalic-ϕsubscriptargmaxitalic-ϕsubscript𝑄subscript𝜃1𝑠𝑎subscript𝜉italic-ϕ𝑠𝑎Φsimilar-to𝑎subscript𝐺𝜔𝑠\phi\leftarrow\operatorname*{argmax}_{\phi}\sum Q_{\theta_{1}}(s,a+\xi_{\phi}(s,a,\Phi)),a\sim G_{\omega}(s)
     Update target networks: θiτθ+(1τ)θisubscriptsuperscript𝜃𝑖𝜏𝜃1𝜏subscriptsuperscript𝜃𝑖\theta^{\prime}_{i}\leftarrow\tau\theta+(1-\tau)\theta^{\prime}_{i}
更新目标网络: θiτθ+(1τ)θisubscriptsuperscript𝜃𝑖𝜏𝜃1𝜏subscriptsuperscript𝜃𝑖\theta^{\prime}_{i}\leftarrow\tau\theta+(1-\tau)\theta^{\prime}_{i}
     ϕτϕ+(1τ)ϕsuperscriptitalic-ϕ𝜏italic-ϕ1𝜏superscriptitalic-ϕ\phi^{\prime}\leftarrow\tau\phi+(1-\tau)\phi^{\prime}
  end for 结束

The perturbation model ξϕsubscript𝜉italic-ϕ\xi_{\phi} can be trained to maximize Qθ(s,a)subscript𝑄𝜃𝑠𝑎Q_{\theta}(s,a) through the deterministic policy gradient algorithm (Silver et al., 2014) by sampling aGω(s)similar-to𝑎subscript𝐺𝜔𝑠a\sim G_{\omega}(s):
通过采样 aGω(s)similar-to𝑎subscript𝐺𝜔𝑠a\sim G_{\omega}(s) ,可以训练扰动模型 ξϕsubscript𝜉italic-ϕ\xi_{\phi} 通过确定性策略梯度算法(Silver 等人,2014 年)来最大化 Qθ(s,a)subscript𝑄𝜃𝑠𝑎Q_{\theta}(s,a)

ϕargmaxϕ(s,a)Qθ(s,a+ξϕ(s,a,Φ)).italic-ϕsubscriptargmaxitalic-ϕsubscript𝑠𝑎subscript𝑄𝜃𝑠𝑎subscript𝜉italic-ϕ𝑠𝑎Φ\phi\leftarrow\operatorname*{argmax}_{\phi}\sum_{(s,a)\in\mathcal{B}}Q_{\theta}(s,a+\xi_{\phi}(s,a,\Phi)). (12)

To penalize uncertainty over future states, we modify Clipped Double Q-learning (Fujimoto et al., 2018), which estimates the value by taking the minimum between two Q-networks {Qθ1,Qθ2}subscript𝑄subscript𝜃1subscript𝑄subscript𝜃2\{Q_{\theta_{1}},Q_{\theta_{2}}\}. Although originally used as a countermeasure to overestimation bias (Thrun & Schwartz, 1993; Van Hasselt, 2010), the minimum operator also penalizes high variance estimates in regions of uncertainty, and pushes the policy to favor actions which lead to states contained in the batch. In particular, we take a convex combination of the two values, with a higher weight on the minimum, to form a learning target which is used by both Q-networks:
为了惩罚对未来状态的不确定性,我们修改了剪辑双 Q 学习(Fujimoto 等人,2018 年),它通过取两个 Q 网络 {Qθ1,Qθ2}subscript𝑄subscript𝜃1subscript𝑄subscript𝜃2\{Q_{\theta_{1}},Q_{\theta_{2}}\} 之间的最小值来估计价值。虽然最初用作过估计偏差的对策(Thrun & Schwartz,1993 年;Van Hasselt,2010 年),最小操作符也在不确定性区域惩罚高方差估计,并推动策略偏好导致批次中包含的状态的行为。特别是,我们取两个值的凸组合,对最小值给予更高的权重,以形成一个学习目标,该目标被两个 Q 网络使用:

r+γmaxai[λminj=1,2Qθj(s,ai)+(1λ)maxj=1,2Qθj(s,ai)]𝑟𝛾subscriptsubscript𝑎𝑖𝜆subscript𝑗12subscript𝑄subscriptsuperscript𝜃𝑗superscript𝑠subscript𝑎𝑖1𝜆subscript𝑗12subscript𝑄subscriptsuperscript𝜃𝑗superscript𝑠subscript𝑎𝑖r+\gamma\max_{a_{i}}\left[\lambda\min_{j=1,2}Q_{\theta^{\prime}_{j}}(s^{\prime},a_{i})+(1-\lambda)\max_{j=1,2}Q_{\theta^{\prime}_{j}}(s^{\prime},a_{i})\right] (13)

where aisubscript𝑎𝑖a_{i} corresponds to the perturbed actions, sampled from the generative model. If we set λ=1𝜆1\lambda=1, this update corresponds to Clipped Double Q-learning. We use this weighted minimum as the constrained updates produces less overestimation bias than a purely greedy policy update, and enables control over how heavily uncertainty at future time steps is penalized through the choice of λ𝜆\lambda.
其中 aisubscript𝑎𝑖a_{i} 对应于从生成模型中采样的扰动动作。如果我们设置 λ=1𝜆1\lambda=1 ,这种更新对应于剪切双 Q 学习。我们使用这种加权最小值作为受限更新,因为与纯粹的贪婪策略更新相比,它产生的过估计偏差更小,并且通过选择 λ𝜆\lambda 可以控制对未来时间步的不确定性的惩罚程度。

This forms Batch-Constrained deep Q-learning (BCQ), which maintains four parametrized networks: a generative model Gω(s)subscript𝐺𝜔𝑠G_{\omega}(s), a perturbation model ξϕ(s,a)subscript𝜉italic-ϕ𝑠𝑎\xi_{\phi}(s,a), and two Q-networks Qθ1(s,a),Qθ2(s,a)subscript𝑄subscript𝜃1𝑠𝑎subscript𝑄subscript𝜃2𝑠𝑎Q_{\theta_{1}}(s,a),Q_{\theta_{2}}(s,a). We summarize BCQ in Algorithm 1. In the following section, we demonstrate BCQ results in stable value learning and a strong performance in the batch setting. Furthermore, we find that only a single choice of hyper-parameters is necessary for a wide range of tasks and environments.
这形成了批量约束深度 Q 学习(BCQ),它维护四个参数化网络:一个生成模型 Gω(s)subscript𝐺𝜔𝑠G_{\omega}(s) ,一个扰动模型 ξϕ(s,a)subscript𝜉italic-ϕ𝑠𝑎\xi_{\phi}(s,a) ,以及两个 Q 网络 Qθ1(s,a),Qθ2(s,a)subscript𝑄subscript𝜃1𝑠𝑎subscript𝑄subscript𝜃2𝑠𝑎Q_{\theta_{1}}(s,a),Q_{\theta_{2}}(s,a) 。我们在算法 1 中总结了 BCQ。在接下来的部分中,我们将展示 BCQ 在稳定价值学习和批处理设置中的强大性能。此外,我们发现对于广泛的任务和环境,只需要一种超参数的选择。

5 Experiments 5 个实验

Refer to caption
Refer to caption
(a) Final buffer performance
(a)最终缓冲性能
Refer to caption
(b) Concurrent performance
Refer to caption
(c) Imitation performance
Refer to caption
(d) Imperfect demonstrations performance
Figure 2: We evaluate BCQ and several baselines on the experiments from Section 3.1, as well as the imperfect demonstrations task. The shaded area represents half a standard deviation. The bold black line measures the average return of episodes contained in the batch. Only BCQ matches or outperforms the performance of the behavioral policy in all tasks.

To evaluate the effectiveness of Batch-Constrained deep Q-learning (BCQ) in a high-dimensional setting, we focus on MuJoCo environments in OpenAI gym (Todorov et al., 2012; Brockman et al., 2016). For reproducibility, we make no modifications to the original environments or reward functions. We compare our method with DDPG (Lillicrap et al., 2015), DQN (Mnih et al., 2015) using an independently discretized action space, a feed-forward behavioral cloning method (BC), and a variant with a VAE (VAE-BC), using Gω(s)subscript𝐺𝜔𝑠G_{\omega}(s) from BCQ. Exact implementation and experimental details are provided in the Supplementary Material.

Refer to caption
Refer to caption
(a) Final Buffer
(b) Concurrent
(c) Imitation
Figure 3: We examine the value estimates of BCQ, along with DDPG and DQN on the experiments from Section 3.1 in the Hopper-v1 environment. Each individual trial is plotted, with the mean in bold. An estimate of the true value of BCQ, evaluated by Monte Carlo returns, is marked by a dotted line. Unlike the state of the art baselines, BCQ exhibits a highly stable value function in each task. Graphs for the other environments and imperfect demonstrations task can be found in the Supplementary Material.

We evaluate each method following the three experiments defined in Section 3.1. In final buffer the off-policy agents learn from the final replay buffer gathered by training a DDPG agent over a million time steps. In concurrent the off-policy agents learn concurrently, with the same replay buffer, as the behavioral DDPG policy, and in imitation, the agents learn from a dataset collected by an expert policy. Additionally, to study the robustness of BCQ to noisy and multi-modal data, we include an imperfect demonstrations task, in which the agents are trained with a batch of 100k transitions collected by an expert policy, with two sources of noise. The behavioral policy selects actions randomly with probability 0.30.30.3 and with high exploratory noise 𝒩(0,0.3)𝒩00.3\mathcal{N}(0,0.3) added to the remaining actions. The experimental results for these tasks are reported in Figure 2. Furthermore, the estimated values of BCQ, DDPG and DQN, and the true value of BCQ are displayed in Figure 3.

Our approach, BCQ, is the only algorithm which succeeds at all tasks, matching or outperforming the behavioral policy in each instance, and outperforming all other agents, besides in the imitation learning task where behavioral cloning unsurprisingly performs the best. These results demonstrate that our algorithm can be used as a single approach for both imitation learning and off-policy reinforcement learning, with a single set of fixed hyper-parameters. Furthermore, unlike the deep reinforcement learning algorithms, DDPG and DQN, BCQ exhibits a highly stable value function in the presence of off-policy samples, suggesting extrapolation error has been successfully mitigated through the batch-constraint. In the imperfect demonstrations task, we find that both deep reinforcement learning and imitation learning algorithms perform poorly. BCQ, however, is able to strongly outperform the noisy demonstrator, disentangling poor and expert actions. Furthermore, compared to current deep reinforcement learning algorithms, which can require millions of time steps (Duan et al., 2016; Henderson et al., 2017), BCQ attains a high performance in remarkably few iterations. This suggests our approach effectively leverages expert transitions, even in the presence of noise.

6 Related Work

Batch Reinforcement Learning. While batch reinforcement learning algorithms have been shown to be convergent with non-parametric function approximators such as averagers (Gordon, 1995) and kernel methods (Ormoneit & Sen, 2002), they make no guarantees on the quality of the policy without infinite data. Other batch algorithms, such as fitted Q-iteration, have used other function approximators, including decision trees (Ernst et al., 2005) and neural networks (Riedmiller, 2005), but come without convergence guarantees. Unlike many previous approaches to off-policy policy evaluation (Peshkin & Shelton, 2002; Thomas et al., 2015; Liu et al., 2018), our work focuses on constraining the policy to a subset of policies which can be adequately evaluated, rather than the process of evaluation itself. Additionally, off-policy algorithms which rely on importance sampling (Precup et al., 2001; Jiang & Li, 2016; Munos et al., 2016) may not be applicable in a batch setting, requiring access to the action probabilities under the behavioral policy, and scale poorly to multi-dimensional action spaces. Reinforcement learning with a replay buffer (Lin, 1992) can be considered a form of batch reinforcement learning, and is a standard tool for off-policy deep reinforcement learning algorithms (Mnih et al., 2015). It has been observed that a large replay buffer can be detrimental to performance (de Bruin et al., 2015; Zhang & Sutton, 2017) and the diversity of states in the buffer is an important factor for performance (de Bruin et al., 2016). Isele & Cosgun (2018) observed the performance of an agent was strongest when the distribution of data in the replay buffer matched the test distribution. These results defend the notion that extrapolation error is an important factor in the performance off-policy reinforcement learning.

Imitation Learning. Imitation learning and its variants are well studied problems (Schaal, 1999; Argall et al., 2009; Hussein et al., 2017). Imitation has been combined with reinforcement learning, via learning from demonstrations methods (Kim et al., 2013; Piot et al., 2014; Chemali & Lazaric, 2015), with deep reinforcement learning extensions (Hester et al., 2017; Večerík et al., 2017), and modified policy gradient approaches (Ho et al., 2016; Sun et al., 2017; Cheng et al., 2018; Sun et al., 2018). While effective, these interactive methods are inadequate for batch reinforcement learning as they require either an explicit distinction between expert and non-expert data, further on-policy data collection or access to an oracle. Research in imitation, and inverse reinforcement learning, with robustness to noise is an emerging area (Evans, 2016; Nair et al., 2018), but relies on some form of expert data. Gao et al. (2018) introduced an imitation learning algorithm which learned from imperfect demonstrations, by favoring seen actions, but is limited to discrete actions. Our work also connects to residual policy learning (Johannink et al., 2018; Silver et al., 2018), where the initial policy is the generative model, rather than an expert or feedback controller.

Uncertainty in Reinforcement Learning. Uncertainty estimates in deep reinforcement learning have generally been used to encourage exploration (Dearden et al., 1998; Strehl & Littman, 2008; O’Donoghue et al., 2018; Azizzadenesheli et al., 2018). Other methods have examined approximating the Bayesian posterior of the value function (Osband et al., 2016, 2018; Touati et al., 2018), again using the variance to encourage exploration to unseen regions of the state space. In model-based reinforcement learning, uncertainty has been used for exploration, but also for the opposite effect–to push the policy towards regions of certainty in the model. This is used to combat the well-known problems with compounding model errors, and is present in policy search methods (Deisenroth & Rasmussen, 2011; Gal et al., 2016; Higuera et al., 2018; Xu et al., 2018), or combined with trajectory optimization (Chua et al., 2018) or value-based methods (Buckman et al., 2018). Our work connects to policy methods with conservative updates (Kakade & Langford, 2002), such as trust region (Schulman et al., 2015; Achiam et al., 2017; Pham et al., 2018) and information-theoretic methods (Peters & Mülling, 2010; Van Hoof et al., 2017), which aim to keep the updated policy similar to the previous policy. These methods avoid explicit uncertainty estimates, and rather force policy updates into a constrained range before collecting new data, limiting errors introduced by large changes in the policy. Similarly, our approach can be thought of as an off-policy variant, where the policy aims to be kept close, in output space, to any combination of the previous policies which performed data collection.

7 Conclusion

In this work, we demonstrate a critical problem in off-policy reinforcement learning with finite data, where the value target introduces error by including an estimate of unseen state-action pairs. This phenomenon, which we denote extrapolation error, has important implications for off-policy and batch reinforcement learning, as it is generally implausible to have complete state-action coverage in any practical setting. We present batch-constrained reinforcement learning–acting close to on-policy with respect to the available data, as an answer to extrapolation error. When extended to a deep reinforcement learning setting, our algorithm, Batch-Constrained deep Q-learning (BCQ), is the first continuous control algorithm capable of learning from arbitrary batch data, without exploration. Due to the importance of batch reinforcement learning for practical applications, we believe BCQ will be a strong foothold for future algorithms to build on, while furthering our understanding of the systematic risks in Q-learning (Thrun & Schwartz, 1993; Lu et al., 2018).

References

  • Achiam et al. (2017) Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained policy optimization. In International Conference on Machine Learning, pp.  22–31, 2017.
  • Argall et al. (2009) Argall, B. D., Chernova, S., Veloso, M., and Browning, B. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009.
  • Azizzadenesheli et al. (2018) Azizzadenesheli, K., Brunskill, E., and Anandkumar, A. Efficient exploration through bayesian deep q-networks. arXiv preprint arXiv:1802.04412, 2018.
  • Bertsekas & Tsitsiklis (1996) Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena scientific Belmont, MA, 1996.
  • Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.
  • Buckman et al. (2018) Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8234–8244, 2018.
  • Chemali & Lazaric (2015) Chemali, J. and Lazaric, A. Direct policy iteration with demonstrations. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
  • Cheng et al. (2018) Cheng, C.-A., Yan, X., Wagener, N., and Boots, B. Fast policy learning through imitation and reinforcement. arXiv preprint arXiv:1805.10413, 2018.
  • Chua et al. (2018) Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems 31, pp. 4759–4770, 2018.
  • Dayan & Watkins (1992) Dayan, P. and Watkins, C. J. C. H. Q-learning. Machine learning, 8(3):279–292, 1992.
  • de Bruin et al. (2015) de Bruin, T., Kober, J., Tuyls, K., and Babuška, R. The importance of experience replay database composition in deep reinforcement learning. In Deep Reinforcement Learning Workshop, NIPS, 2015.
  • de Bruin et al. (2016) de Bruin, T., Kober, J., Tuyls, K., and Babuška, R. Improved deep reinforcement learning for robotics through distribution-based experience retention. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  3947–3952. IEEE, 2016.
  • Dearden et al. (1998) Dearden, R., Friedman, N., and Russell, S. Bayesian q-learning. In AAAI/IAAI, pp.  761–768, 1998.
  • Deisenroth & Rasmussen (2011) Deisenroth, M. and Rasmussen, C. E. Pilco: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, pp. 465–472, 2011.
  • Duan et al. (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338, 2016.
  • Ernst et al. (2005) Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
  • Evans (2016) Evans, O. Learning the preferences of ignorant, inconsistent agents. In AAAI, pp.  323–329, 2016.
  • Fujimoto et al. (2018) Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, volume 80, pp.  1587–1596. PMLR, 2018.
  • Gal et al. (2016) Gal, Y., McAllister, R., and Rasmussen, C. E. Improving pilco with bayesian neural network dynamics models. In Data-Efficient Machine Learning workshop, International Conference on Machine Learning, 2016.
  • Gao et al. (2018) Gao, Y., Lin, J., Yu, F., Levine, S., and Darrell, T. Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313, 2018.
  • Goodfellow et al. (2013) Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
  • Gordon (1995) Gordon, G. J. Stable function approximation in dynamic programming. In Machine Learning Proceedings 1995, pp.  261–268. Elsevier, 1995.
  • Henderson et al. (2017) Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep Reinforcement Learning that Matters. arXiv preprint arXiv:1709.06560, 2017.
  • Hester et al. (2017) Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Dulac-Arnold, G., et al. Deep q-learning from demonstrations. arXiv preprint arXiv:1704.03732, 2017.
  • Higuera et al. (2018) Higuera, J. C. G., Meger, D., and Dudek, G. Synthesizing neural network controllers with probabilistic model based reinforcement learning. arXiv preprint arXiv:1803.02291, 2018.
  • Ho et al. (2016) Ho, J., Gupta, J., and Ermon, S. Model-free imitation learning with policy optimization. In International Conference on Machine Learning, pp. 2760–2769, 2016.
  • Hussein et al. (2017) Hussein, A., Gaber, M. M., Elyan, E., and Jayne, C. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):21, 2017.
  • Isele & Cosgun (2018) Isele, D. and Cosgun, A. Selective experience replay for lifelong learning. arXiv preprint arXiv:1802.10269, 2018.
  • Jaksch et al. (2010) Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
  • Jiang & Li (2016) Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652–661, 2016.
  • Johannink et al. (2018) Johannink, T., Bahl, S., Nair, A., Luo, J., Kumar, A., Loskyll, M., Ojea, J. A., Solowjow, E., and Levine, S. Residual reinforcement learning for robot control. arXiv preprint arXiv:1812.03201, 2018.
  • Kakade & Langford (2002) Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, volume 2, pp.  267–274, 2002.
  • Kim et al. (2013) Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.
  • Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Konda & Tsitsiklis (2003) Konda, V. R. and Tsitsiklis, J. N. On actor-critic algorithms. SIAM journal on Control and Optimization, 42(4):1143–1166, 2003.
  • Lai & Robbins (1985) Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
  • Lange et al. (2012) Lange, S., Gabel, T., and Riedmiller, M. Batch reinforcement learning. In Reinforcement learning, pp.  45–73. Springer, 2012.
  • Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Lin (1992) Lin, L.-J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
  • Liu et al. (2018) Liu, Y., Gottesman, O., Raghu, A., Komorowski, M., Faisal, A. A., Doshi-Velez, F., and Brunskill, E. Representation balancing mdps for off-policy policy evaluation. In Advances in Neural Information Processing Systems, pp. 2644–2653, 2018.
  • Lu et al. (2018) Lu, T., Schuurmans, D., and Boutilier, C. Non-delusional q-learning and value-iteration. In Advances in Neural Information Processing Systems, pp. 9971–9981, 2018.
  • McCloskey & Cohen (1989) McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pp. 109–165. Elsevier, 1989.
  • Melo (2001) Melo, F. S. Convergence of q-learning: A simple proof. Institute Of Systems and Robotics, Tech. Rep, pp.  1–4, 2001.
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Munos et al. (2016) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054–1062, 2016.
  • Nair et al. (2018) Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp.  6292–6299. IEEE, 2018.
  • O’Donoghue et al. (2018) O’Donoghue, B., Osband, I., Munos, R., and Mnih, V. The uncertainty Bellman equation and exploration. In International Conference on Machine Learning, volume 80, pp.  3839–3848. PMLR, 2018.
  • Ormoneit & Sen (2002) Ormoneit, D. and Sen, Ś. Kernel-based reinforcement learning. Machine learning, 49(2-3):161–178, 2002.
  • Osband et al. (2016) Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep exploration via bootstrapped dqn. In Advances in Neural Information Processing Systems, pp. 4026–4034, 2016.
  • Osband et al. (2018) Osband, I., Aslanides, J., and Cassirer, A. Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems 31, pp. 8626–8638, 2018.
  • Peshkin & Shelton (2002) Peshkin, L. and Shelton, C. R. Learning from scarce experience. In International Conference on Machine Learning, pp. 498–505, 2002.
  • Peters & Mülling (2010) Peters, J. and Mülling, K. Relative entropy policy search. In AAAI, pp.  1607–1612, 2010.
  • Pham et al. (2018) Pham, T.-H., De Magistris, G., Agravante, D. J., Chaudhury, S., Munawar, A., and Tachibana, R. Constrained exploration and recovery from experience shaping. arXiv preprint arXiv:1809.08925, 2018.
  • Piot et al. (2014) Piot, B., Geist, M., and Pietquin, O. Boosted bellman residual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp.  549–564. Springer, 2014.
  • Precup et al. (2001) Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy temporal-difference learning with function approximation. In International Conference on Machine Learning, pp. 417–424, 2001.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
  • Riedmiller (2005) Riedmiller, M. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pp.  317–328. Springer, 2005.
  • Schaal (1999) Schaal, S. Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3(6):233–242, 1999.
  • Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.
  • Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In International Conference on Machine Learning, pp. 387–395, 2014.
  • Silver et al. (2018) Silver, T., Allen, K., Tenenbaum, J., and Kaelbling, L. Residual policy learning. arXiv preprint arXiv:1812.06298, 2018.
  • Singh et al. (2000) Singh, S., Jaakkola, T., Littman, M. L., and Szepesvári, C. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 38(3):287–308, 2000.
  • Sohn et al. (2015) Sohn, K., Lee, H., and Yan, X. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pp. 3483–3491, 2015.
  • Strehl & Littman (2008) Strehl, A. L. and Littman, M. L. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
  • Sun et al. (2017) Sun, W., Venkatraman, A., Gordon, G. J., Boots, B., and Bagnell, J. A. Deeply aggrevated: Differentiable imitation learning for sequential prediction. In International Conference on Machine Learning, pp. 3309–3318, 2017.
  • Sun et al. (2018) Sun, W., Bagnell, J. A., and Boots, B. Truncated horizon policy search: Combining reinforcement learning & imitation learning. arXiv preprint arXiv:1805.11240, 2018.
  • Sutton (1988) Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
  • Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
  • Thomas et al. (2015) Thomas, P., Theocharous, G., and Ghavamzadeh, M. High confidence policy improvement. In International Conference on Machine Learning, pp. 2380–2388, 2015.
  • Thrun & Schwartz (1993) Thrun, S. and Schwartz, A. Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum, 1993.
  • Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  5026–5033. IEEE, 2012.
  • Touati et al. (2018) Touati, A., Satija, H., Romoff, J., Pineau, J., and Vincent, P. Randomized value functions via multiplicative normalizing flows. arXiv preprint arXiv:1806.02315, 2018.
  • Van Hasselt (2010) Van Hasselt, H. Double q-learning. In Advances in Neural Information Processing Systems, pp. 2613–2621, 2010.
  • Van Hasselt et al. (2016) Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In AAAI, pp.  2094–2100, 2016.
  • Van Hoof et al. (2017) Van Hoof, H., Neumann, G., and Peters, J. Non-parametric policy search with limited information loss. The Journal of Machine Learning Research, 18(1):2472–2517, 2017.
  • Večerík et al. (2017) Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
  • Watkins (1989) Watkins, C. J. C. H. Learning from delayed rewards. PhD thesis, King’s College, Cambridge, 1989.
  • Xu et al. (2018) Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. Algorithmic framework for model-based reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858, 2018.
  • Zhang & Sutton (2017) Zhang, S. and Sutton, R. S. A deeper look at experience replay. arXiv preprint arXiv:1712.01275, 2017.

Appendix A Missing Proofs

A.1 Proofs and Details from Section 4.1

Definition 1. We define a coherent batch \mathcal{B} as a batch such that if (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime})\in\mathcal{B} then ssuperscript𝑠s^{\prime}\in\mathcal{B} unless ssuperscript𝑠s^{\prime} is a terminal state.

Definition 2. We define ϵMDP(s,a)=Qπ(s,a)Qπ(s,a)subscriptitalic-ϵMDP𝑠𝑎superscript𝑄𝜋𝑠𝑎subscriptsuperscript𝑄𝜋𝑠𝑎\epsilon_{\text{MDP}}(s,a)=Q^{\pi}(s,a)-Q^{\pi}_{\mathcal{B}}(s,a) as the error between the true value of a policy π𝜋\pi in the MDP M𝑀M and the value of π𝜋\pi when learned with a batch \mathcal{B}.

Definition 3. For simplicity in notation, we denote

ϵMDPπ=sμπ(s)aπ(a|s)|ϵMDP(s,a)|.subscriptsuperscriptitalic-ϵ𝜋MDPsubscript𝑠subscript𝜇𝜋𝑠subscript𝑎𝜋conditional𝑎𝑠subscriptitalic-ϵMDP𝑠𝑎\epsilon^{\pi}_{\text{MDP}}=\sum_{s}\mu_{\pi}(s)\sum_{a}\pi(a|s)|\epsilon_{\text{MDP}}(s,a)|. (14)

To evaluate a policy π𝜋\pi exactly at relevant state-action pairs, only ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0 is required.

Definition 4. We define the optimal batch-constrained policy πΠsuperscript𝜋subscriptΠ\pi^{*}\in\Pi_{\mathcal{B}} such that Qπ(s,a)Qπ(s,a)superscript𝑄superscript𝜋𝑠𝑎superscript𝑄𝜋𝑠𝑎Q^{\pi^{*}}(s,a)\geq Q^{\pi}(s,a) for all πΠ𝜋subscriptΠ\pi\in\Pi_{\mathcal{B}} and (s,a)𝑠𝑎(s,a)\in\mathcal{B}.


Algorithm 1. Batch-Constrained Q-learning (BCQL) maintains a tabular value function Q(s,a)𝑄𝑠𝑎Q(s,a) for each possible state-action pair (s,a)𝑠𝑎(s,a). A transition tuple (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime}) is sampled from the batch \mathcal{B} with uniform probability and the following update rule is applied, with learning rate α𝛼\alpha:

Q(s,a)(1α)Q(s,a)+α(r+γmaxas.t.(s,a)Q(s,a)).𝑄𝑠𝑎1𝛼𝑄𝑠𝑎𝛼𝑟𝛾subscriptsuperscript𝑎s.t.superscript𝑠superscript𝑎𝑄superscript𝑠superscript𝑎Q(s,a)\leftarrow(1-\alpha)Q(s,a)+\alpha(r+\gamma\max_{a^{\prime}\text{s.t.}(s^{\prime},a^{\prime})\in\mathcal{B}}Q(s^{\prime},a^{\prime})). (15)

Theorem 1. Performing Q-learning by sampling from a batch \mathcal{B} converges to the optimal value function under the MDP Msubscript𝑀M_{\mathcal{B}}.

Proof. Again, the MDP Msubscript𝑀M_{\mathcal{B}} is defined by the same action and state space as M𝑀M, with an additional terminal state sinitsubscript𝑠inits_{\text{init}}. Msubscript𝑀M_{\mathcal{B}} has transition probabilities p(s|s,a)=N(s,a,s)s~N(s,a,s~)subscript𝑝conditionalsuperscript𝑠𝑠𝑎𝑁𝑠𝑎superscript𝑠subscript~𝑠𝑁𝑠𝑎~𝑠p_{\mathcal{B}}(s^{\prime}|s,a)=\frac{N(s,a,s^{\prime})}{\sum_{\tilde{s}}N(s,a,\tilde{s})}, where N(s,a,s)𝑁𝑠𝑎superscript𝑠N(s,a,s^{\prime}) is the number of times the tuple (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime}) is observed in \mathcal{B}. If s~N(s,a,s~)=0subscript~𝑠𝑁𝑠𝑎~𝑠0\sum_{\tilde{s}}N(s,a,\tilde{s})=0, then p(sinit|s,a)=1subscript𝑝conditionalsubscript𝑠init𝑠𝑎1p_{\mathcal{B}}(s_{\text{init}}|s,a)=1, where r(sinit,s,a)𝑟subscript𝑠init𝑠𝑎r(s_{\text{init}},s,a) is to the initialized value of Q(s,a)𝑄𝑠𝑎Q(s,a).

For any given MDP Q-learning converges to the optimal value function given infinite state-action visitation and some standard assumptions (see Section A.2). Now note that sampling under a batch \mathcal{B} with uniform probability satisfies the infinite state-action visitation assumptions of the MDP Msubscript𝑀M_{\mathcal{B}}, where given (s,a)𝑠𝑎(s,a), the probability of sampling (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime}) corresponds to p(s|s,a)=N(s,a,s)s~N(s,a,s~)𝑝conditionalsuperscript𝑠𝑠𝑎𝑁𝑠𝑎superscript𝑠subscript~𝑠𝑁𝑠𝑎~𝑠p(s^{\prime}|s,a)=\frac{N(s,a,s^{\prime})}{\sum_{\tilde{s}}N(s,a,\tilde{s})} in the limit. We remark that for (s,a)𝑠𝑎(s,a)\notin\mathcal{B}, Q(s,a)𝑄𝑠𝑎Q(s,a) will never be updated, and will return the initialized value, which corresponds to the terminal transition sinitsubscript𝑠inits_{\text{init}}. It follows that sampling from \mathcal{B} is equivalent to sampling from the MDP Msubscript𝑀M_{\mathcal{B}}, and Q-learning converges to the optimal value function under Msubscript𝑀M_{\mathcal{B}}.


Remark 1. For any policy π𝜋\pi and state-action pair (s,a)𝑠𝑎(s,a), the error term ϵMDP(s,a)subscriptitalic-ϵMDP𝑠𝑎\epsilon_{\text{MDP}}(s,a) satisfies the following Bellman-like equation:

ϵMDP(s,a)=s(pM(s|s,a)p(s|s,a))(r(s,a,s)+γaπ(a|s)(Qπ(s,a)))+pM(s|s,a)γaπ(a|s)ϵMDP(s,a).subscriptitalic-ϵMDP𝑠𝑎subscriptsuperscript𝑠subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎subscript𝑝conditionalsuperscript𝑠𝑠𝑎𝑟𝑠𝑎superscript𝑠𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptsuperscript𝑄𝜋superscript𝑠superscript𝑎subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptitalic-ϵMDPsuperscript𝑠superscript𝑎\begin{split}\epsilon_{\text{MDP}}(s,a)=~{}&\sum_{s^{\prime}}\left(p_{M}(s^{\prime}|s,a)-p_{\mathcal{B}}(s^{\prime}|s,a)\right)\left(r(s,a,s^{\prime})+\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\left(Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})\right)\right)\\ &+p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\epsilon_{\text{MDP}}(s^{\prime},a^{\prime}).\end{split} (16)

Proof. Proof follows by expanding each Q𝑄Q, rearranging terms and then simplifying the expression.

ϵMDP(s,a)=Qπ(s,a)Qπ(s,a)=spM(s|s,a)(r(s,a,s)+γaπ(a|s)Qπ(s,a))Qπ(s,a)=spM(s|s,a)(r(s,a,s)+γaπ(a|s)Qπ(s,a))(sp(s|s,a)(r(s,a,s)+γaπ(a|s)Qπ(s,a)))=s(pM(s|s,a)p(s|s,a))r(s,a,s)+pM(s|s,a)γaπ(a|s)(Qπ(s,a)+ϵMDP(s,a))p(s|s,a)γaπ(a|s)Qπ(s,a)=s(pM(s|s,a)p(s|s,a))r(s,a,s)+pM(s|s,a)γaπ(a|s)(Qπ(s,a)+ϵMDP(s,a))+pM(s|s,a)γaπ(a|s)(ϵMDP(s,a)ϵMDP(s,a))p(s|s,a)γaπ(a|s)Qπ(s,a)=s(pM(s|s,a)p(s|s,a))(r(s,a,s)+γaπ(a|s)Qπ(s,a))+pM(s|s,a)γaπ(a|s)ϵMDP(s,a)subscriptitalic-ϵMDP𝑠𝑎superscript𝑄𝜋𝑠𝑎subscriptsuperscript𝑄𝜋𝑠𝑎subscriptsuperscript𝑠subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎𝑟𝑠𝑎superscript𝑠𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠superscript𝑄𝜋superscript𝑠superscript𝑎subscriptsuperscript𝑄𝜋𝑠𝑎subscriptsuperscript𝑠subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎𝑟𝑠𝑎superscript𝑠𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠superscript𝑄𝜋superscript𝑠superscript𝑎subscriptsuperscript𝑠subscript𝑝conditionalsuperscript𝑠𝑠𝑎𝑟𝑠𝑎superscript𝑠𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptsuperscript𝑄𝜋superscript𝑠superscript𝑎subscriptsuperscript𝑠subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎subscript𝑝conditionalsuperscript𝑠𝑠𝑎𝑟𝑠𝑎superscript𝑠subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptsuperscript𝑄𝜋superscript𝑠superscript𝑎subscriptitalic-ϵMDPsuperscript𝑠superscript𝑎subscript𝑝conditionalsuperscript𝑠𝑠𝑎𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptsuperscript𝑄𝜋superscript𝑠superscript𝑎subscriptsuperscript𝑠subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎subscript𝑝conditionalsuperscript𝑠𝑠𝑎𝑟𝑠𝑎superscript𝑠subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptsuperscript𝑄𝜋superscript𝑠superscript𝑎subscriptitalic-ϵMDPsuperscript𝑠superscript𝑎subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptitalic-ϵMDPsuperscript𝑠superscript𝑎subscriptitalic-ϵMDPsuperscript𝑠superscript𝑎subscript𝑝conditionalsuperscript𝑠𝑠𝑎𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptsuperscript𝑄𝜋superscript𝑠superscript𝑎subscriptsuperscript𝑠subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎subscript𝑝conditionalsuperscript𝑠𝑠𝑎𝑟𝑠𝑎superscript𝑠𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptsuperscript𝑄𝜋superscript𝑠superscript𝑎subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptitalic-ϵMDPsuperscript𝑠superscript𝑎\begin{split}\epsilon_{\text{MDP}}(s,a)=&~{}Q^{\pi}(s,a)-Q^{\pi}_{\mathcal{B}}(s,a)\\ =&\sum_{s^{\prime}}p_{M}(s^{\prime}|s,a)\left(r(s,a,s^{\prime})+\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}(s^{\prime},a^{\prime})\right)-Q^{\pi}_{\mathcal{B}}(s,a)\\ =&\sum_{s^{\prime}}p_{M}(s^{\prime}|s,a)\left(r(s,a,s^{\prime})+\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}(s^{\prime},a^{\prime})\right)\\ &-\left(\sum_{s^{\prime}}p_{\mathcal{B}}(s^{\prime}|s,a)\left(r(s,a,s^{\prime})+\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})\right)\right)\\ =&\sum_{s^{\prime}}\left(p_{M}(s^{\prime}|s,a)-p_{\mathcal{B}}(s^{\prime}|s,a)\right)r(s,a,s^{\prime})+p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\left(Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})+\epsilon_{\text{MDP}}(s^{\prime},a^{\prime})\right)\\ &-p_{\mathcal{B}}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})\\ =&\sum_{s^{\prime}}\left(p_{M}(s^{\prime}|s,a)-p_{\mathcal{B}}(s^{\prime}|s,a)\right)r(s,a,s^{\prime})+p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\left(Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})+\epsilon_{\text{MDP}}(s^{\prime},a^{\prime})\right)\\ &+p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\left(\epsilon_{\text{MDP}}(s^{\prime},a^{\prime})-\epsilon_{\text{MDP}}(s^{\prime},a^{\prime})\right)-p_{\mathcal{B}}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})\\ =&\sum_{s^{\prime}}\left(p_{M}(s^{\prime}|s,a)-p_{\mathcal{B}}(s^{\prime}|s,a)\right)\left(r(s,a,s^{\prime})+\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})\right)\\ &+p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\epsilon_{\text{MDP}}(s^{\prime},a^{\prime})\end{split} (17)

Lemma 1. For all reward functions, ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0 if and only if p(s|s,a)=pM(s|s,a)subscript𝑝conditionalsuperscript𝑠𝑠𝑎subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎p_{\mathcal{B}}(s^{\prime}|s,a)=p_{M}(s^{\prime}|s,a) for all s𝒮superscript𝑠𝒮s^{\prime}\in\mathcal{S} and (s,a)𝑠𝑎(s,a) such that μπ(s)>0subscript𝜇𝜋𝑠0\mu_{\pi}(s)>0 and π(a|s)>0𝜋conditional𝑎𝑠0\pi(a|s)>0.

Proof. From Remark 1, we note that the form of ϵMDP(s,a)subscriptitalic-ϵMDP𝑠𝑎\epsilon_{\text{MDP}}(s,a), since no assumptions can be made on the reward function and therefore the expression r(s,a,s)+γaπ(a|s)Qπ(s,a)𝑟𝑠𝑎superscript𝑠𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptsuperscript𝑄𝜋superscript𝑠superscript𝑎r(s,a,s^{\prime})+\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime}), we have that ϵMDP(s,a)=0subscriptitalic-ϵMDP𝑠𝑎0\epsilon_{\text{MDP}}(s,a)=0 if and only if p(s|s,a)=pM(s|s,a)subscript𝑝conditionalsuperscript𝑠𝑠𝑎subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎p_{\mathcal{B}}(s^{\prime}|s,a)=p_{M}(s^{\prime}|s,a) for all s𝒮superscript𝑠𝒮s^{\prime}\in\mathcal{S} and pM(s|s,a)γaπ(a|s)ϵMDP(s,a)=0subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptitalic-ϵMDPsuperscript𝑠superscript𝑎0p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\epsilon_{\text{MDP}}(s^{\prime},a^{\prime})=0.

()(\Rightarrow) Now we note that if ϵMDP(s,a)=0subscriptitalic-ϵMDP𝑠𝑎0\epsilon_{\text{MDP}}(s,a)=0 then pM(s|s,a)γaπ(a|s)ϵMDP(s,a)=0subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptitalic-ϵMDPsuperscript𝑠superscript𝑎0p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\epsilon_{\text{MDP}}(s^{\prime},a^{\prime})=0 by the relationship defined by Remark 1 and the condition on the reward function. It follows that we must have p(s|s,a)=pM(s|s,a)subscript𝑝conditionalsuperscript𝑠𝑠𝑎subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎p_{\mathcal{B}}(s^{\prime}|s,a)=p_{M}(s^{\prime}|s,a) for all s𝒮superscript𝑠𝒮s^{\prime}\in\mathcal{S}.

()(\Leftarrow) If we have s|pM(s|s,a)p(s|s,a)|=0\sum_{s^{\prime}}|p_{M}(s^{\prime}|s,a)-p_{\mathcal{B}}(s^{\prime}|s,a)|=0 for all (s,a)𝑠𝑎(s,a) such that μπ(s)>0subscript𝜇𝜋𝑠0\mu_{\pi}(s)>0 and π(a|s)>0𝜋conditional𝑎𝑠0\pi(a|s)>0, then for any (s,a)𝑠𝑎(s,a) under the given conditions, we have ϵ(s,a)=spM(s|s,a)γaπ(a|s)ϵ(s,a)italic-ϵ𝑠𝑎subscriptsuperscript𝑠subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠italic-ϵsuperscript𝑠superscript𝑎\epsilon(s,a)=\sum_{s^{\prime}}p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\epsilon(s^{\prime},a^{\prime}). Recursively expanding the ϵitalic-ϵ\epsilon term, we arrive at ϵ(s,a)=0+γ0+γ20+=0italic-ϵ𝑠𝑎0𝛾0superscript𝛾200\epsilon(s,a)=0+\gamma 0+\gamma^{2}0+...=0.


Theorem 2. For a deterministic MDP and all reward functions, ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0 if and only if the policy π𝜋\pi is batch-constrained. Furthermore, if \mathcal{B} is coherent, then such a policy must exist if the start state s0subscript𝑠0s_{0}\in\mathcal{B}.

Proof. The first part of the Theorem follows from Lemma 1, noting that for a deterministic policy π𝜋\pi, if (s,a)𝑠𝑎(s,a)\in\mathcal{B} then we must have p(s|s,a)=pM(s|s,a)subscript𝑝conditionalsuperscript𝑠𝑠𝑎subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎p_{\mathcal{B}}(s^{\prime}|s,a)=p_{M}(s^{\prime}|s,a) for all s𝒮superscript𝑠𝒮s^{\prime}\in\mathcal{S}.

We can construct the batch-constrained policy by selecting a𝑎a in the state s𝑠s\in\mathcal{B}, such that (s,a)𝑠𝑎(s,a)\in\mathcal{B}. Since the MDP is deterministic and the batch is coherent, when starting from s0subscript𝑠0s_{0}, we must be able to follow at least one trajectory until termination.


Theorem 3. Given the Robbins-Monro stochastic convergence conditions on the learning rate α𝛼\alpha, and standard sampling requirements from the environment, BCQL converges to the optimal value function Qsuperscript𝑄Q^{*}.

Proof. Follows from proof of convergence of Q-learning (see Section A.2), noting the batch-constraint is non-restrictive with a batch which contains all possible transitions.


Theorem 4. Given a deterministic MDP and coherent batch \mathcal{B}, along with the Robbins-Monro stochastic convergence conditions on the learning rate α𝛼\alpha and standard sampling requirements on the batch \mathcal{B}, BCQL converges to Qπ(s,a)subscriptsuperscript𝑄𝜋𝑠𝑎Q^{\pi}_{\mathcal{B}}(s,a) where π(s)=argmaxa s.t.(s,a)Qπ(s,a)superscript𝜋𝑠subscriptargmax𝑎 s.t.𝑠𝑎subscriptsuperscript𝑄𝜋𝑠𝑎\pi^{*}(s)=\operatorname*{argmax}_{a\text{ s.t.}(s,a)\in\mathcal{B}}Q^{\pi}_{\mathcal{B}}(s,a) is the optimal batch-constrained policy.

Proof. Results follows from Theorem 1, which states Q-learning learns the optimal value for the MDP Msubscript𝑀M_{\mathcal{B}} for state-action pairs in (s,a)𝑠𝑎(s,a). However, for a deterministic MDP Msubscript𝑀M_{\mathcal{B}} corresponds to the true MDP in all seen state-action pairs. Noting that batch-constrained policies operate only on state-action pairs where Msubscript𝑀M_{\mathcal{B}} corresponds to the true MDP, it follows that πsuperscript𝜋\pi^{*} will be the optimal batch-constrained policy from the optimality of Q-learning.


A.2 Sketch of the Proof of Convergence of Q-Learning

The proof of convergence of Q-learning relies large on the following lemma (Singh et al., 2000):

Lemma 2. Consider a stochastic process (ζt,Δt,Ft),t0subscript𝜁𝑡subscriptΔ𝑡subscript𝐹𝑡𝑡0(\zeta_{t},\Delta_{t},F_{t}),t\geq 0 where ζt,Δt,Ft:X:subscript𝜁𝑡subscriptΔ𝑡subscript𝐹𝑡𝑋\zeta_{t},\Delta_{t},F_{t}:X\rightarrow\mathbb{R} satisfy the equation:

Δt+1(xt)=(1ζt(xt))Δt(xt)+ζt(xt)Ft(xt),subscriptΔ𝑡1subscript𝑥𝑡1subscript𝜁𝑡subscript𝑥𝑡subscriptΔ𝑡subscript𝑥𝑡subscript𝜁𝑡subscript𝑥𝑡subscript𝐹𝑡subscript𝑥𝑡\Delta_{t+1}(x_{t})=(1-\zeta_{t}(x_{t}))\Delta_{t}(x_{t})+\zeta_{t}(x_{t})F_{t}(x_{t}), (18)

where xtXsubscript𝑥𝑡𝑋x_{t}\in X and t=0,1,2,𝑡012t=0,1,2,.... Let Ptsubscript𝑃𝑡P_{t} be a sequence of increasing σ𝜎\sigma-fields such that ζ0subscript𝜁0\zeta_{0} and Δ0subscriptΔ0\Delta_{0} are P0subscript𝑃0P_{0}-measurable and ζt,Δtsubscript𝜁𝑡subscriptΔ𝑡\zeta_{t},\Delta_{t} and Ft1subscript𝐹𝑡1F_{t-1} are Ptsubscript𝑃𝑡P_{t}-measurable, t=1,2,𝑡12t=1,2,.... Assume that the following hold:

  1. 1.

    The set X𝑋X is finite.

  2. 2.

    ζt(xt)[0,1],tζt(xt)=,t(ζt(xt))2<formulae-sequencesubscript𝜁𝑡subscript𝑥𝑡01formulae-sequencesubscript𝑡subscript𝜁𝑡subscript𝑥𝑡subscript𝑡superscriptsubscript𝜁𝑡subscript𝑥𝑡2\zeta_{t}(x_{t})\in[0,1],\sum_{t}\zeta_{t}(x_{t})=\infty,\sum_{t}(\zeta_{t}(x_{t}))^{2}<\infty with probability 111 and xxt:ζ(x)=0:for-all𝑥subscript𝑥𝑡𝜁𝑥0\forall x\neq x_{t}:\zeta(x)=0.

  3. 3.

    ||𝔼[Ft|Pt]||κ||Δt||+ct||\mathbb{E}\left[F_{t}|P_{t}\right]||\leq\kappa||\Delta_{t}||+c_{t} where κ[0,1)𝜅01\kappa\in[0,1) and ctsubscript𝑐𝑡c_{t} converges to 00 with probability 111.

  4. 4.

    Var[Ft(xt)|Pt]K(1+κΔt)2delimited-[]conditionalsubscript𝐹𝑡subscript𝑥𝑡subscript𝑃𝑡𝐾superscript1𝜅normsubscriptΔ𝑡2\left[F_{t}(x_{t})|P_{t}\right]\leq K(1+\kappa||\Delta_{t}||)^{2}, where K𝐾K is some constant

Where ||||||\cdot|| denotes the maximum norm. Then ΔtsubscriptΔ𝑡\Delta_{t} converges to 00 with probability 111.


Sketch of Proof of Convergence of Q-Learning. We set Δt=Qt(s,a)Q(s,a)subscriptΔ𝑡subscript𝑄𝑡𝑠𝑎superscript𝑄𝑠𝑎\Delta_{t}=Q_{t}(s,a)-Q^{*}(s,a). Then convergence follows by satisfying the conditions of Lemma 2. Condition 1 is satisfied by the finite MDP, setting X=𝒮×𝒜𝑋𝒮𝒜X=\mathcal{S}\times\mathcal{A}. Condition 2 is satisfied by the assumption of Robbins-Monro stochastic convergence conditions on the learning rate αtsubscript𝛼𝑡\alpha_{t}, setting ζt=αtsubscript𝜁𝑡subscript𝛼𝑡\zeta_{t}=\alpha_{t}. Condition 4 is satisfied by the bounded reward function, where Ft(s,a)=r(s,a,s)+γmaxaQ(s,a)Q(s,a)subscript𝐹𝑡𝑠𝑎𝑟𝑠𝑎superscript𝑠𝛾subscriptsuperscript𝑎𝑄superscript𝑠superscript𝑎superscript𝑄𝑠𝑎F_{t}(s,a)=r(s,a,s^{\prime})+\gamma\max_{a^{\prime}}Q(s^{\prime},a^{\prime})-Q^{*}(s,a), and the sequence Pt={Q0,s0,a0,α0,r1,s1,st,at}subscript𝑃𝑡subscript𝑄0subscript𝑠0subscript𝑎0subscript𝛼0subscript𝑟1subscript𝑠1subscript𝑠𝑡subscript𝑎𝑡P_{t}=\{Q_{0},s_{0},a_{0},\alpha_{0},r_{1},s_{1},...s_{t},a_{t}\}. Finally, Condition 3 follows from the contraction of the Bellman Operator 𝒯𝒯\mathcal{T}, requiring infinite state-action visitation, infinite updates and γ<1𝛾1\gamma<1.

Additional and more complete details can be found in numerous resources (Dayan & Watkins, 1992; Singh et al., 2000; Melo, 2001).

Appendix B Missing Graphs

B.1 Extrapolation Error in Deep Reinforcement Learning

Refer to caption
Refer to caption
(a) Final buffer performance
Refer to caption
(b) Final buffer value estimates
Refer to caption
(c) Concurrent performance
Refer to caption
(d) Concurrent value estimates
Refer to caption
(e) Imitation performance
Refer to caption
(f) Imitation value estimates
Figure 4: We examine the performance of DDPG in three batch tasks. Each individual trial is plotted with a thin line, with the mean in bold (evaluated without exploration noise). Straight lines represent the average return of episodes contained in the batch (with exploration noise). An estimate of the true value of the off-policy agent, evaluated by Monte Carlo returns, is marked by a dotted line. In the final buffer experiment, the off-policy agent learns from a large, diverse dataset, but exhibits poor learning behavior and value estimation. In the concurrent setting the agent learns alongside a behavioral agent, with access to the same data, but suffers in performance. In the imitation setting, the agent receives data from an expert policy but is unable to learn, and exhibits highly divergent value estimates.

B.2 Complete Experimental Results

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(a) Final buffer performance
Refer to caption
(b) Final buffer value estimates
Refer to caption
(c) Concurrent performance
Refer to caption
(d) Concurrent value estimates
Refer to caption
(e) Imitation performance
Refer to caption
(f) Imitation value estimates
Refer to caption
(g) Imperfect demonstrations performance
Refer to caption
(h) Imperfect demonstrations value estimates
Figure 5: We evaluate BCQ and several baselines on the experiments from Section 3.1, as well as a new imperfect demonstration task. Performance is graphed on the left, and value estimates are graphed on the right. The shaded area represents half a standard deviation. The bold black line measures the average return of episodes contained in the batch. For the value estimates, each individual trial is plotted, with the mean in bold. An estimate of the true value of BCQ, evaluated by Monte Carlo returns, is marked by a dotted line. Only BCQ matches or outperforms the performance of the behavioral policy in all tasks, while exhibiting a highly stable value function in each task.

We present the complete set of results across each task and environment in Figure 5. These results show that BCQ successful mitigates extrapolation error and learns from a variety of fixed batch settings. Although BCQ with our current hyper-parameters was never found to fail, we noted with slight changes to hyper-parameters, BCQ failed periodically on the concurrent learning task in the HalfCheetah-v1 environment, exhibiting instability in the value function after 750,000750000750,000 or more iterations on some seeds. We hypothesize that this instability could occur if the generative model failed to output in-distribution actions, and could be corrected through additional training or improvements to the vanilla VAE. Interestingly, BCQ still performs well in these instances, due to the behavioral cloning-like elements in the algorithm.

Appendix C Extrapolation Error in Kernel-Based Reinforcement Learning

This problem of extrapolation persists in traditional batch reinforcement learning algorithms, such as kernel-based reinforcement learning (KBRL) (Ormoneit & Sen, 2002). For a given batch \mathcal{B} of transitions (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime}), non-negative density function ϕ:++:italic-ϕsuperscriptsuperscript\phi:\mathbb{R}^{+}\rightarrow\mathbb{R}^{+}, hyper-parameter τ𝜏\tau\in\mathbb{R}, and norm ||||||\cdot||, KBRL evaluates the value of a state-action pair (s,a) as follows:

Q(s,a)𝑄𝑠𝑎\displaystyle Q(s,a) =(sa,a,r,s)κτa(s,sa)[r+γV(s)],absentsubscriptsuperscriptsubscript𝑠𝑎𝑎𝑟superscriptsubscript𝑠subscriptsuperscript𝜅𝑎𝜏𝑠superscriptsubscript𝑠𝑎delimited-[]𝑟𝛾𝑉superscriptsubscript𝑠\displaystyle=\sum_{(s_{\mathcal{B}}^{a},a,r,s_{\mathcal{B}}^{\prime})\in\mathcal{B}}\kappa^{a}_{\tau}(s,s_{\mathcal{B}}^{a})[r+\gamma V(s_{\mathcal{B}}^{\prime})], (19)
κτa(s,sa)superscriptsubscript𝜅𝜏𝑎𝑠superscriptsubscript𝑠𝑎\displaystyle\kappa_{\tau}^{a}(s,s_{\mathcal{B}}^{a}) =kτ(s,sa)s~akτ(s,s~a),kτ(s,sa)=ϕ(ssaτ),formulae-sequenceabsentsubscript𝑘𝜏𝑠superscriptsubscript𝑠𝑎subscriptsuperscriptsubscript~𝑠𝑎subscript𝑘𝜏𝑠superscriptsubscript~𝑠𝑎subscript𝑘𝜏𝑠superscriptsubscript𝑠𝑎italic-ϕnorm𝑠superscriptsubscript𝑠𝑎𝜏\displaystyle=\frac{k_{\tau}(s,s_{\mathcal{B}}^{a})}{\sum_{\tilde{s}_{\mathcal{B}}^{a}}k_{\tau}(s,\tilde{s}_{\mathcal{B}}^{a})},\qquad k_{\tau}(s,s_{\mathcal{B}}^{a})=\phi\left(\frac{||s-s_{\mathcal{B}}^{a}||}{\tau}\right), (20)

where sa𝒮superscriptsubscript𝑠𝑎𝒮s_{\mathcal{B}}^{a}\in\mathcal{S} represents states corresponding to the action a𝑎a for some tuple (s,a)subscript𝑠𝑎(s_{\mathcal{B}},a)\in\mathcal{B}, and V(s)=maxa s.t.(s,a)Q(s,a)𝑉superscriptsubscript𝑠subscript𝑎 s.t.superscriptsubscript𝑠𝑎𝑄superscriptsubscript𝑠𝑎V(s_{\mathcal{B}}^{\prime})=\max_{a\text{ s.t.}(s_{\mathcal{B}}^{\prime},a)\in\mathcal{B}}Q(s_{\mathcal{B}}^{\prime},a). At each iteration, KBRL updates the estimates of Q(s,a)𝑄subscript𝑠subscript𝑎Q(s_{\mathcal{B}},a_{\mathcal{B}}) for all (s,a)subscript𝑠subscript𝑎(s_{\mathcal{B}},a_{\mathcal{B}})\in\mathcal{B} following Equation (19), then updates V(s)𝑉superscriptsubscript𝑠V(s_{\mathcal{B}}^{\prime}) by evaluating Q(s,a)𝑄superscriptsubscript𝑠𝑎Q(s_{\mathcal{B}}^{\prime},a) for all ssubscript𝑠s_{\mathcal{B}}\in\mathcal{B} and a𝒜𝑎𝒜a\in\mathcal{A}.

s0subscript𝑠0s_{0}s1subscript𝑠1s_{1}a1,r=1subscript𝑎1𝑟1a_{1},r=1a0,r=0subscript𝑎0𝑟0a_{0},r=0a0,r=0subscript𝑎0𝑟0a_{0},r=0a1,r=0subscript𝑎1𝑟0a_{1},r=0
Figure 6: Toy MDP with two states s0subscript𝑠0s_{0} and s1subscript𝑠1s_{1}, and two actions a0subscript𝑎0a_{0} and a1subscript𝑎1a_{1}. Agent receives reward of 111 for selecting a1subscript𝑎1a_{1} at s0subscript𝑠0s_{0} and 00 otherwise.

Given access to the entire deterministic MDP, KBRL will provable converge to the optimal value, however when limited to only a subset, we find the value estimation susceptible to extrapolation. In Figure 6, we provide a deterministic two state, two action MDP in which KBRL fails to learn the optimal policy when provided with state-action pairs from the optimal policy. Given the batch {(s0,a1,r=1,s1),(s1,a0,r=0,s0)}\{(s_{0},a_{1},r=1,s_{1}),(s_{1},a_{0},r=0,s_{0})\}, corresponding to the optimal behavior, and noting that there is only one example of each action, Equation (19) provides the following:

Q(,a1)=1+γV(s1)=1+γQ(s1,a0),Q(,a0)=γV(s0)=γQ(s0,a1).formulae-sequence𝑄subscript𝑎11𝛾𝑉subscript𝑠11𝛾𝑄subscript𝑠1subscript𝑎0𝑄subscript𝑎0𝛾𝑉subscript𝑠0𝛾𝑄subscript𝑠0subscript𝑎1Q(\cdot,a_{1})=1+\gamma V(s_{1})=1+\gamma Q(s_{1},a_{0}),\qquad Q(\cdot,a_{0})=\gamma V(s_{0})=\gamma Q(s_{0},a_{1}). (21)

After sufficient iterations KBRL will converge correctly to Q(s0,a1)=11γ2𝑄subscript𝑠0subscript𝑎111superscript𝛾2Q(s_{0},a_{1})=\frac{1}{1-\gamma^{2}}, Q(s1,a0)=γ1γ2𝑄subscript𝑠1subscript𝑎0𝛾1superscript𝛾2Q(s_{1},a_{0})=\frac{\gamma}{1-\gamma^{2}}. However, when evaluating actions, KBRL erroneously extrapolates the values of each action Q(,a1)=11γ2𝑄subscript𝑎111superscript𝛾2Q(\cdot,a_{1})=\frac{1}{1-\gamma^{2}}, Q(,a0)=γ1γ2𝑄subscript𝑎0𝛾1superscript𝛾2Q(\cdot,a_{0})=\frac{\gamma}{1-\gamma^{2}}, and its behavior, argmaxaQ(s,a)subscriptargmax𝑎𝑄𝑠𝑎\operatorname*{argmax}_{a}Q(s,a), will result in the degenerate policy of continually selecting a1subscript𝑎1a_{1}. KBRL fails this example by estimating the values of unseen state-action pairs. In methods where the extrapolated estimates can be included into the learning update, such fitted Q-iteration or DQN (Ernst et al., 2005; Mnih et al., 2015), this could cause an unbounded sequence of value estimates, as demonstrated by our results in Section 3.1.

Appendix D Additional Experiments

D.1 Ablation Study of Perturbation Model

BCQ includes a perturbation model ξθ(s,a,Φ)subscript𝜉𝜃𝑠𝑎Φ\xi_{\theta}(s,a,\Phi) which outputs a small residual update to the actions sampled by the generative model in the range [Φ,Φ]ΦΦ[-\Phi,\Phi]. This enables the policy to select actions which may not have been sampled by the generative model. If Φ=amaxaminΦsubscript𝑎maxsubscript𝑎min\Phi=a_{\text{max}}-a_{\text{min}}, then all actions can be plausibly selected by the model, similar to standard deep reinforcement learning algorithms, such as DQN and DDPG (Mnih et al., 2015; Lillicrap et al., 2015). In Figure 7 we examine the performance and value estimates of BCQ when varying the hyper-parameter ΦΦ\Phi, which corresponds to how much the model is able to move away from the actions sampled by the generative model.

Refer to caption
(a)
Refer to caption
(a) Imitation performance
Refer to caption
(b) Imitation value estimates
Figure 7: We perform an ablation study on the perturbation model of BCQ, on the imitation task from Section 3.1. Performance is graphed on the left, and value estimates are graphed on the right. The shaded area represents half a standard deviation. The bold black line measures the average return of episodes contained in the batch. For the value estimates, each individual trial is plotted, with the mean in bold.

We observe a clear drop in performance with the increase of ΦΦ\Phi, along with an increase in instability in the value function. Given that the data is based on expert performance, this is consistent with our understanding of extrapolation error. With larger ΦΦ\Phi the agent learns to take actions that are further away from the data in the batch after erroneously overestimating the value of suboptimal actions. This suggests the ideal value of ΦΦ\Phi should be small enough to stay close to the generated actions, but large enough such that learning can be performed when exploratory actions are included in the dataset.

D.2 Uncertainty Estimation for Batch-Constrained Reinforcement Learning

Section 4.2 proposes using generation as a method for constraining the output of the policy π𝜋\pi to eliminate actions which are unlikely under the batch. However, a more natural approach would be through approximate uncertainty-based methods (Osband et al., 2016; Gal et al., 2016; Azizzadenesheli et al., 2018). These methods are well-known to be effective for exploration, however we examine their properties for the exploitation where we would like to avoid uncertain actions.

To measure the uncertainty of the value network, we use ensemble-based methods, with an ensemble of size 444 and 101010 to mimic the models used by Buckman et al. (2018) and Osband et al. (2016) respectively. Each network is trained with separate mini-batches, following the standard deep Q-learning update with a target network. These networks use the default architecture and hyper-parameter choices as defined in Section G. The policy πϕsubscript𝜋italic-ϕ\pi_{\phi} is trained to minimize the standard deviation σ𝜎\sigma across the ensemble:

ϕargminϕ(s,a)σ({Qθi(s,a)}i=1N).italic-ϕsubscriptargminitalic-ϕsubscript𝑠𝑎𝜎superscriptsubscriptsubscript𝑄subscript𝜃𝑖𝑠𝑎𝑖1𝑁\phi\leftarrow\operatorname*{argmin}_{\phi}\sum_{(s,a)\in\mathcal{B}}\sigma\left(\{Q_{\theta_{i}}(s,a)\}_{i=1}^{N}\right). (22)

If the ensembles were a perfect estimate of the uncertainty, the policy would learn to select the most certain action for a given state, minimizing the extrapolation error and effectively imitating the data in the batch.

To test these uncertainty-based methods, we examine their performance on the imitation task in the Hopper-v1 environment (Todorov et al., 2012; Brockman et al., 2016). In which a dataset of 1 million expert transitions are provided to the agents. Additional experimental details can be found in Section F. The performance, alongside the value estimates of the agents are displayed in Figure 8.

Refer to caption
Refer to caption
(a) Imitation performance
(b) Imitation value estimates
Figure 8: A comparison with uncertainty-based methods on the Hopper-v1 environment from OpenAI gym, on the imitation task. Although ensemble based methods are able to generate stable value functions (right) under these conditions, they fail to constrain the action space to the demonstrated expert actions and suffer in performance compared to our approach, BCQ (left).

We find that neither ensemble method is sufficient to constrain the action space to only the expert actions. However, the value function is stabilized, suggesting that ensembles are an effective strategy for eliminating outliers or large deviations in the value from erroneous extrapolation. Unsurprisingly, the large ensemble provides a more accurate estimate of the uncertainty. While scaling the size of the ensemble to larger values could possibly enable an effective batch-constraint, increasing the size of the ensemble induces a large computational cost. Finally, in this task, where only expert data is provided, the policy can attempt to imitate the data without consideration of the value, however in other tasks, a weighting between value and uncertainty would need to be carefully tuned. On the other hand, BCQ offers a computational cheap approach without requirements for difficult hyper-parameter tuning.

D.3 Random Behavioral Policy Study

The experiments in Section 3.1 and 5 use a learned, or partially learned, behavioral policy for data collection. This is a necessary requirement for learning meaningful behavior, as a random policy generally fails to provide sufficient coverage over the state space. However, in simple toy problems, such as the pendulum swing-up task and the reaching task with a two-joint arm from OpenAI gym (Brockman et al., 2016), a random policy can sufficiently explore the environment, enabling us to examine the properties of algorithms with entirely non-expert data.

In Figure 9, we examine the performance of our algorithm, BCQ, as well as DDPG (Lillicrap et al., 2015), on these two toy problems, when learning off-policy from a small batch of 5000 time steps, collected entirely by a random policy. We find that both BCQ and DDPG are able to learn successfully in these off-policy tasks. These results suggest that BCQ is less restrictive than imitation learning algorithms, which require expert data to learn. We also find that unlike previous environments, given the small scale of the state and action space, the random policy is able to provide sufficient coverage for DDPG to learn successfully.

Refer to caption
Refer to caption
(a) Random behavioral performance
(b) Random behavioral value estimates
Figure 9: We evaluate BCQ and DDPG on a batch collected by a random behavioral policy. The shaded area represents half a standard deviation. Value estimates include a plot of each trial, with the mean in bold. An estimate of the true value of BCQ and DDPG, evaluated by Monte Carlo returns, is marked by a dotted line. While both BCQ and DDPG perform well, BCQ exhibits more stability in the value function for the Pendulum task, and outperforms DDPG in the Reacher task. These tasks demonstrate examples where any imitation learning would fail as the data collection procedure is entirely random.

Appendix E Missing Background

E.1 Variational Auto-Encoder

A variational auto-encoder (VAE) (Kingma & Welling, 2013) is a generative model which aims to maximize the marginal log-likelihood logp(X)=i=1Nlogp(xi)𝑝𝑋superscriptsubscript𝑖1𝑁𝑝subscript𝑥𝑖\log p(X)=\sum_{i=1}^{N}\log p(x_{i}) where X={x1,,xN}𝑋subscript𝑥1subscript𝑥𝑁X=\{x_{1},...,x_{N}\}, the dataset. While computing the marginal likelihood is intractable in nature, we can instead optimize the variational lower-bound:

logp(X)𝔼q(X|z)[logp(X|z)]+DKL(q(z|X)||p(z)),\log p(X)\geq\mathbb{E}_{q(X|z)}[\log p(X|z)]+D_{\text{KL}}(q(z|X)||p(z)), (23)

where p(z)𝑝𝑧p(z) is chosen a prior, generally the multivariate normal distribution 𝒩(0,I)𝒩0𝐼\mathcal{N}(0,I). We define the posterior q(z|X)=𝒩(z|μ(X),σ2(X)I)𝑞conditional𝑧𝑋𝒩conditional𝑧𝜇𝑋superscript𝜎2𝑋𝐼q(z|X)=\mathcal{N}(z|\mu(X),\sigma^{2}(X)I) as the encoder and p(X|z)𝑝conditional𝑋𝑧p(X|z) as the decoder. Simply put, this means a VAE is an auto-encoder, where a given sample x𝑥x is passed through the encoder to produce a random latent vector z𝑧z, which is given to the decoder to reconstruct the original sample x𝑥x. The VAE is trained on a reconstruction loss, and a KL-divergence term according to the distribution of the latent vectors. To perform gradient descent on the variational lower bound we can use the re-parametrization trick (Kingma & Welling, 2013; Rezende et al., 2014):

𝔼z𝒩(μ,σ)[f(z)]=𝔼ϵ𝒩(0,I)[f(μ+σϵ)].subscript𝔼similar-to𝑧𝒩𝜇𝜎delimited-[]𝑓𝑧subscript𝔼similar-toitalic-ϵ𝒩0𝐼delimited-[]𝑓𝜇𝜎italic-ϵ\mathbb{E}_{z\sim\mathcal{N}(\mu,\sigma)}[f(z)]=\mathbb{E}_{\epsilon\sim\mathcal{N}(0,I)}[f(\mu+\sigma\epsilon)]. (24)

This formulation allows for back-propagation through stochastic nodes, by noting μ𝜇\mu and σ𝜎\sigma can be represented by deterministic functions. During inference, random values of z𝑧z are sampled from the multivariate normal and passed through the decoder to produce samples x𝑥x.

Appendix F Experimental Details

Each environment is run for 1 million time steps, unless stated otherwise, with evaluations every 5000 time steps, where an evaluation measures the average reward from 10 episodes with no exploration noise. Our results are reported over 5 random seeds of the behavioral policy, OpenAI Gym simulator and network initialization. Value estimates are averaged over mini-batches of 100100100 and sampled every 250025002500 iterations. The true value is estimated by sampling 100100100 state-action pairs from the buffer replay and computing the discounted return by running the episode until completion while following the current policy.

Each agent is trained after each episode by applying one training iteration per each time step in the episode. The agent is trained with transition tuples (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime}) sampled from an experience replay that is defined by each experiment. We define four possible experiments. Unless stated otherwise, default implementation settings, as defined in Section G, are used.

Batch 1 (Final buffer). We train a DDPG (Lillicrap et al., 2015) agent for 1 million time steps, adding large amounts of Gaussian noise (𝒩(0,0.5))𝒩00.5(\mathcal{N}(0,0.5)) to induce exploration, and store all experienced transitions in a buffer replay. This training procedure creates a buffer replay with a diverse set of states and actions. A second, randomly initialized agent is trained using the 1 million stored transitions.

Batch 2 (Concurrent learning). We simultaneously train two agents for 1 million time steps, the first DDPG agent, performs data collection and each transition is stored in a buffer replay which both agents learn from. This means the behavioral agent learns from the standard training regime for most off-policy deep reinforcement learning algorithms, and the second agent is learning off-policy, as the data is collected without direct relationship to its current policy. Batch 2 differs from Batch 1 as the agents are trained with the same version of the buffer, while in Batch 1 the agent learns from the final buffer replay after the behavioral agent has finished training.

Batch 3 (Imitation). A DDPG agent is trained for 1 million time steps. The trained agent then acts as an expert policy, and is used to collect a dataset of 1 million transitions. This dataset is used to train a second, randomly initialized agent. In particular, we train DDPG across 15 seeds, and select the 5 top performing seeds as the expert policies.

Batch 4 (Imperfect demonstrations). The expert policies from Batch 3 are used to collect a dataset of 100k transitions, while selecting actions randomly with probability 0.30.30.3 and adding Gaussian noise 𝒩(0,0.3)𝒩00.3\mathcal{N}(0,0.3) to the remaining actions. This dataset is used to train a second, randomly initialized agent.

Appendix G Implementation Details

Across all methods and experiments, for fair comparison, each network generally uses the same hyper-parameters and architecture, which are defined in Table 1 and Figure 10 respectively. The value functions follow the standard practice (Mnih et al., 2015) in which the Bellman update differs for terminal transitions. When the episode ends by reaching some terminal state, the value is set to 00 in the learning target y𝑦y:

y={rif terminal sr+γQθ(s,a)else𝑦cases𝑟if terminal superscript𝑠𝑟𝛾subscript𝑄superscript𝜃superscript𝑠superscript𝑎elsey=\begin{cases}r&\text{if terminal }s^{\prime}\\ r+\gamma Q_{\theta^{\prime}}(s^{\prime},a^{\prime})&\text{else}\\ \end{cases} (25)

Where the termination signal from time-limited environments is ignored, thus we only consider a state stsubscript𝑠𝑡s_{t} terminal if t<𝑡absentt< max horizon.

Table 1: Default hyper-parameters.
Hyper-parameter Value
Optimizer Adam (Kingma & Ba, 2014)
Learning Rate 103superscript10310^{-3}
Batch Size 100100100
Normalized Observations False
Gradient Clipping False
Discount Factor 0.990.990.99
Target Update Rate (τ𝜏\tau) 0.0050.0050.005
Exploration Policy 𝒩(0,0.1)𝒩00.1\mathcal{N}(0,0.1)

(input dimension, 400) ReLU (400, 300) RelU (300, output dimension)

Figure 10: Default network architecture, based on DDPG (Lillicrap et al., 2015). All actor networks are followed by a tanh \cdot max action size

BCQ. BCQ uses four main networks: a perturbation model ξϕ(s,a)subscript𝜉italic-ϕ𝑠𝑎\xi_{\phi}(s,a), a state-conditioned VAE Gω(s)subscript𝐺𝜔𝑠G_{\omega}(s) and a pair of value networks Qθ1(s,a),Qθ2(s,a)subscript𝑄subscript𝜃1𝑠𝑎subscript𝑄subscript𝜃2𝑠𝑎Q_{\theta_{1}}(s,a),Q_{\theta_{2}}(s,a). Along with three corresponding target networks ξϕ(s,a),Qθ1(s,a),Qθ2(s,a)subscript𝜉superscriptitalic-ϕ𝑠𝑎subscript𝑄subscriptsuperscript𝜃1𝑠𝑎subscript𝑄subscriptsuperscript𝜃2𝑠𝑎\xi_{\phi^{\prime}}(s,a),Q_{\theta^{\prime}_{1}}(s,a),Q_{\theta^{\prime}_{2}}(s,a). Each network, other than the VAE follows the default architecture (Figure 10) and the default hyper-parameters (Table 1). For ξϕ(s,a,Φ)subscript𝜉italic-ϕ𝑠𝑎Φ\xi_{\phi}(s,a,\Phi), the constraint ΦΦ\Phi is implemented through a tanh activation multiplied by IΦ𝐼ΦI\cdot\Phi following the final layer.

As suggested by Fujimoto et al. (2018), the perturbation model is trained only with respect to Qθ1subscript𝑄subscript𝜃1Q_{\theta_{1}}, following the deterministic policy gradient algorithm (Silver et al., 2014), by performing a residual update on a single action sampled from the generative model:

ϕargmaxϕ(s,a)Qθ1(s,a+ξϕ(s,a,Φ)),aGω(s).formulae-sequenceitalic-ϕsubscriptargmaxitalic-ϕsubscript𝑠𝑎subscript𝑄subscript𝜃1𝑠𝑎subscript𝜉italic-ϕ𝑠𝑎Φsimilar-to𝑎subscript𝐺𝜔𝑠\begin{split}\phi&\leftarrow\operatorname*{argmax}_{\phi}\sum_{(s,a)\in\mathcal{B}}Q_{\theta_{1}}(s,a+\xi_{\phi}(s,a,\Phi)),\\ &a\sim G_{\omega}(s).\end{split} (26)

To penalize uncertainty over future states, we train a pair of value estimates {Qθ1,Qθ2}subscript𝑄subscript𝜃1subscript𝑄subscript𝜃2\{Q_{\theta_{1}},Q_{\theta_{2}}\} and take a weighted minimum between the two values as a learning target y𝑦y for both networks. First n𝑛n actions are sampled with respect to the generative model, and then adjusted by the target perturbation model, before passed to each target Q-network:

y=r+γmaxai[λminj=1,2Qθj(s,a~i)+(1λ)maxj=1,2Qθj(s,a~i)],{a~i=ai+ξϕ(s,ai,Φ)}i=0n,{aiGω(s)}i=0n,value,i=(s,a,r,s)(yQθi(s,a))2.\begin{split}y&=r+\gamma\max_{ai}\left[\lambda\min_{j=1,2}Q_{\theta^{\prime}_{j}}(s^{\prime},\tilde{a}_{i})+(1-\lambda)\max_{j=1,2}Q_{\theta^{\prime}_{j}}(s^{\prime},\tilde{a}_{i})\right],\\ &\{\tilde{a}_{i}=a_{i}+\xi_{\phi^{\prime}}(s,a_{i},\Phi)\}_{i=0}^{n},\qquad\{a_{i}\sim G_{\omega}(s^{\prime})\}_{i=0}^{n},\\ &\mathcal{L}_{\text{value,i}}=\sum_{(s,a,r,s^{\prime})\in\mathcal{B}}(y-Q_{\theta_{i}}(s,a))^{2}.\end{split} (27)

Both networks are trained with the same target y𝑦y, where λ=0.75𝜆0.75\lambda=0.75.

The VAE Gωsubscript𝐺𝜔G_{\omega} is defined by two networks, an encoder Eω1(s,a)subscript𝐸subscript𝜔1𝑠𝑎E_{\omega_{1}}(s,a) and decoder Dω2(s,z)subscript𝐷subscript𝜔2𝑠𝑧D_{\omega_{2}}(s,z), where ω={ω1,ω2}𝜔subscript𝜔1subscript𝜔2\omega=\{\omega_{1},\omega_{2}\}. The encoder takes a state-action pair and outputs the mean μ𝜇\mu and standard deviation σ𝜎\sigma of a Gaussian distribution 𝒩(μ,σ)𝒩𝜇𝜎\mathcal{N}(\mu,\sigma). The state s𝑠s, along with a latent vector z𝑧z is sampled from the Gaussian, is passed to the decoder Dω2(s,z)subscript𝐷subscript𝜔2𝑠𝑧D_{\omega_{2}}(s,z) which outputs an action. Each network follows the default architecture (Figure 10), with two hidden layers of size 750750750, rather than 400400400 and 300300300. The VAE is trained with respect to the mean squared error of the reconstruction along with a KL regularization term:

reconstructionsubscriptreconstruction\displaystyle\mathcal{L}_{\text{reconstruction}} =(s,a)(Dω2(s,z)a)2,z=μ+σϵ,ϵ𝒩(0,1),formulae-sequenceabsentsubscript𝑠𝑎superscriptsubscript𝐷subscript𝜔2𝑠𝑧𝑎2formulae-sequence𝑧𝜇𝜎italic-ϵsimilar-toitalic-ϵ𝒩01\displaystyle=\sum_{(s,a)\in\mathcal{B}}(D_{\omega_{2}}(s,z)-a)^{2},\quad z=\mu+\sigma\cdot\epsilon,\quad\epsilon\sim\mathcal{N}(0,1), (28)
KLsubscriptKL\displaystyle\mathcal{L}_{\text{KL}} =DKL(𝒩(μ,σ)||𝒩(0,1)),\displaystyle=D_{\text{KL}}(\mathcal{N}(\mu,\sigma)||\mathcal{N}(0,1)), (29)
VAEsubscriptVAE\displaystyle\mathcal{L}_{\text{VAE}} =reconstruction+λKL.absentsubscriptreconstruction𝜆subscriptKL\displaystyle=\mathcal{L}_{\text{reconstruction}}+\lambda\mathcal{L}_{\text{KL}}. (30)

Noting the Gaussian form of both distributions, the KL divergence term can be simplified (Kingma & Welling, 2013):

DKL(𝒩(μ,σ)||𝒩(0,1))=12j=1J(1+log(σj2)μj2σj2),D_{\text{KL}}(\mathcal{N}(\mu,\sigma)||\mathcal{N}(0,1))=-\frac{1}{2}\sum_{j=1}^{J}(1+\log(\sigma_{j}^{2})-\mu_{j}^{2}-\sigma_{j}^{2}), (31)

where J𝐽J denotes the dimensionality of z𝑧z. For each experiment, J𝐽J is set to twice the dimensionality of the action space. The KL divergence term in VAEsubscriptVAE\mathcal{L}_{\text{VAE}} is normalized across experiments by setting λ=12J𝜆12𝐽\lambda=\frac{1}{2J}. During inference with the VAE, the latent vector z𝑧z is clipped to a range of [0.5,0.5]0.50.5[-0.5,0.5] to limit generalization beyond previously seen actions. For the small scale experiments in Supplementary Material D.3, L2subscript𝐿2L_{2} regularization with weight 103superscript10310^{-3} was used for the VAE to compensate for the small number of samples. The other networks remain unchanged. In the value network update (Equation 27), the VAE is sampled multiple times and passed to both the perturbation model and each Q-network. For each state in the mini-batch the VAE is sampled n=10𝑛10n=10 times. This can be implemented efficiently by passing a latent vector with batch size 101010 \cdot batch size, effectively 100010001000, to the VAE and treating the output as a new mini-batch for the perturbation model and each Q-network. When running the agent in the environment, we sample from the VAE 101010 times, perturb each action with ξϕsubscript𝜉italic-ϕ\xi_{\phi} and sample the highest valued action with respect to Qθ1subscript𝑄subscript𝜃1Q_{\theta_{1}}.

DDPG. Our DDPG implementation deviates from some of the default architecture and hyper-parameters to mimic the original implementation more closely (Lillicrap et al., 2015). In particular, the action is only passed to the critic at the second layer (Figure 11), the critic uses L2subscript𝐿2L_{2} regularization with weight 102superscript10210^{-2}, and the actor uses a reduced learning rate of 104superscript10410^{-4}.

(state dimension, 400) ReLU (400 + action dimension, 300) RelU (300, 1)

Figure 11: DDPG Critic Network Architecture.

As done by Fujimoto et al. (2018), our DDPG agent randomly selects actions for the first 10k time steps for HalfCheetah-v1, and 1k time steps for Hopper-v1 and Walker2d-v1. This was found to improve performance and reduce the likelihood of local minima in the policy during early iterations.

DQN. Given the high dimensional nature of the action space of the experimental environments, our DQN implementation selects actions over an independently discretized action space. Each action dimension is discretized separately into 101010 possible actions, giving 10J10𝐽10J possible actions, where J𝐽J is the dimensionality of the action-space. A given state-action pair (s,a)𝑠𝑎(s,a) then corresponds to a set of state-sub-action pairs (s,aij)𝑠subscript𝑎𝑖𝑗(s,a_{ij}), where i{1,,10}𝑖110i\in\{1,...,10\} bins and j={1,,J}𝑗1𝐽j=\{1,...,J\} dimensions. In each DQN update, all state-sub-action pairs (s,aij)𝑠subscript𝑎𝑖𝑗(s,a_{ij}) are updated with respect to the average value of the target state-sub-action pairs (s,aij)superscript𝑠subscriptsuperscript𝑎𝑖𝑗(s^{\prime},a^{\prime}_{ij}). The learning update of the discretized DQN is as follows:

y𝑦\displaystyle y =r+γnj=1JmaxiQθ(s,aij),absent𝑟𝛾𝑛superscriptsubscript𝑗1𝐽subscript𝑖subscript𝑄superscript𝜃superscript𝑠superscriptsubscript𝑎𝑖𝑗\displaystyle=r+\frac{\gamma}{n}\sum_{j=1}^{J}\max_{i}Q_{\theta^{\prime}}(s^{\prime},a_{ij}^{\prime}), (32)
θ𝜃\displaystyle\theta argminθ(s,a,r,s)j=1J(yQθ(s,aij))2,absentsubscriptargmin𝜃subscript𝑠𝑎𝑟superscript𝑠superscriptsubscript𝑗1𝐽superscript𝑦subscript𝑄𝜃𝑠subscript𝑎𝑖𝑗2\displaystyle\leftarrow\operatorname*{argmin}_{\theta}\sum_{(s,a,r,s^{\prime})\in\mathcal{B}}\sum_{j=1}^{J}\left(y-Q_{\theta}(s,a_{ij})\right)^{2}, (33)

Where aijsubscript𝑎𝑖𝑗a_{ij} is chosen by selecting the corresponding bin i𝑖i in the discretized action space for each dimension j𝑗j. For clarity, we provide the exact DQN network architecture in Figure 12.

(state dimension, 400) ReLU (400, 300) RelU (300, 10 action dimension)

Figure 12: DQN Network Architecture.

Behavioral Cloning. We use two behavioral cloning methods, VAE-BC and BC. VAE-BC is implemented and trained exactly as Gω(s)subscript𝐺𝜔𝑠G_{\omega}(s) defined for BCQ. BC uses a feed-forward network with the default architecture and hyper-parameters, and trained with a mean-squared error reconstruction loss.