这是用户在 2024-3-26 15:11 为 https://ar5iv.labs.arxiv.org/html/2106.01345 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Decision Transformer: Reinforcement
Learning via Sequence Modeling

Lili Chen∗,1, Kevin Lu∗,1, Aravind Rajeswaran2, Kimin Lee1,

Aditya Grover2, Michael Laskin1, Pieter Abbeel1, Aravind Srinivas†,1, Igor Mordatch†,3

equal contribution    equal advising
同等贡献 同等指导

1UC Berkeley    2Facebook AI Research    3Google Brain
加州大学伯克利分校 Facebook 人工智能研究 谷歌 Brain

{lilichen, kzl}@berkeley.edu
Abstract 摘要

We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
我们引入了一个将强化学习(RL)抽象为序列建模问题的框架。这使我们能够利用 Transformer 架构的简单性和可扩展性,以及与语言建模相关的进展,如 GPT-x 和 BERT。特别是,我们提出了 Decision Transformer,这是一个将 RL 问题构建为条件序列建模的架构。与先前拟合价值函数或计算策略梯度的 RL 方法不同,Decision Transformer 通过利用有因果屏蔽的 Transformer 简单地输出最优动作。通过将自回归模型条件化为期望回报(奖励)、过去状态和动作,我们的 Decision Transformer 模型可以生成实现期望回报的未来动作。尽管简单,Decision Transformer 在 Atari、OpenAI Gym 和 Key-to-Door 任务上与最先进的无模型离线 RL 基线模型的性能相匹敌或超越。

Refer to caption
Figure 1: Decision Transformer architecture111 Our code is available at: https://sites.google.com/berkeley.edu/decision-transformer . States, actions, and returns are fed into modality-specific linear embeddings and a positional episodic timestep encoding is added. Tokens are fed into a GPT architecture which predicts actions autoregressively using a causal self-attention mask.
图 1:决策 Transformer 架构。状态、动作和回报被输入到特定模态的线性嵌入中,并添加了位置上的情节性时间步编码。令牌被输入到一个 GPT 架构中,该架构使用因果自注意力掩码自回归地预测动作。

1 Introduction 介绍

Recent work has shown transformers [1] can model high-dimensional distributions of semantic concepts at scale, including effective zero-shot generalization in language [2] and out-of-distribution image generation [3]. Given the diversity of successful applications of such models, we seek to examine their application to sequential decision making problems formalized as reinforcement learning (RL). In contrast to prior work using transformers as an architectural choice for components within traditional RL algorithms [4, 5], we seek to study if generative trajectory modeling – i.e. modeling the joint distribution of the sequence of states, actions, and rewards – can serve as a replacement for conventional RL algorithms.
最近的研究表明,transformers 可以在规模上建模语义概念的高维分布,包括语言中的有效零样本泛化和超出分布的图像生成。鉴于这些模型成功应用的多样性,我们希望研究它们在形式化为强化学习(RL)的顺序决策问题中的应用。与以往将 transformers 作为传统 RL 算法中组件的架构选择的研究不同,我们希望研究生成轨迹建模是否可以作为传统 RL 算法的替代。

We consider the following shift in paradigm: instead of training a policy through conventional RL algorithms like temporal difference (TD) learning [6], we will train transformer models on collected experience using a sequence modeling objective. This will allow us to bypass the need for bootstrapping for long term credit assignment – thereby avoiding one of the “deadly triad” [6] known to destabilize RL. It also avoids the need for discounting future rewards, as typically done in TD learning, which can induce undesirable short-sighted behaviors. Additionally, we can make use of existing transformer frameworks widely used in language and vision that are easy to scale, utilizing a large body of work studying stable training of transformer models. 重试    错误原因

In addition to their demonstrated ability to model long sequences, transformers also have other advantages. Transformers can perform credit assignment directly via self-attention, in contrast to Bellman backups which slowly propagate rewards and are prone to “distractor” signals [7]. This can enable transformers to still work effectively in the presence of sparse or distracting rewards. Finally, empirical evidence suggest that a transformer modeling approach can model a wide distribution of behaviors, enabling better generalization and transfer [3].

We explore our hypothesis by considering offline RL, where we will task agents with learning policies from suboptimal data – producing maximally effective behavior from fixed, limited experience. This task is traditionally challenging due to error propagation and value overestimation [8]. However, it is a natural task when training with a sequence modeling objective. By training an autoregressive model on sequences of states, actions, and returns, we reduce policy sampling to autoregressive generative modeling. We can specify the expertise of the policy – which “skill” to query – by selecting the desired return tokens, acting as a prompt for generation.
我们通过考虑离线强化学习来探索我们的假设,我们将要求代理从次优数据中学习策略 - 从固定、有限的经验中产生最大效果的行为。由于误差传播和价值过高估计,这个任务传统上具有挑战性。然而,当训练具有序列建模目标时,这是一个自然的任务。通过在状态、动作和回报序列上训练自回归模型,我们将策略抽样减少到自回归生成建模。我们可以通过选择所需的回报令牌来指定策略的专业知识 - “技能”查询 - 作为生成的提示。

Refer to caption
Figure 2: Illustrative example of finding shortest path for a fixed graph (left) posed as reinforcement learning. Training dataset consists of random walk trajectories and their per-node returns-to-go (middle). Conditioned on a starting state and generating largest possible return at each node, Decision Transformer sequences optimal paths.
图 2:寻找固定图中最短路径的示例(左侧)作为强化学习。训练数据集包括随机行走轨迹及其每个节点的返回值(中间)。在起始状态的条件下,生成每个节点可能的最大回报,决策 Transformer 序列化最优路径。

Illustrative example. To get an intuition for our proposal, consider the task of finding the shortest path on a directed graph, which can be posed as an RL problem. The reward is 00 when the agent is at the goal node and 11-1 otherwise. We train a GPT [9] model to predict next token in a sequence of returns-to-go (sum of future rewards), states, and actions. Training only on random walk data – with no expert demonstrations – we can generate optimal trajectories at test time by adding a prior to generate highest possible returns (see more details and empirical results in the Appendix) and subsequently generate the corresponding sequence of actions via conditioning. Thus, by combining the tools of sequence modeling with hindsight return information, we achieve policy improvement without the need for dynamic programming.
举例说明。为了直观理解我们的提议,考虑在有向图上找到最短路径的任务,这可以被提出为一个强化学习问题。当代理处于目标节点时奖励为 00 ,否则为 11-1 。我们训练一个 GPT [9]模型来预测序列中的下一个标记,这个序列包括返回值之和(未来奖励之和)、状态和动作。仅在随机游走数据上进行训练 - 没有专家演示 - 我们可以通过添加先验来在测试时生成最优轨迹以获得最高可能的回报(更多细节和实证结果请参见附录),然后通过条件生成相应的动作序列。因此,通过将序列建模工具与事后回报信息相结合,我们实现了政策改进,而无需动态规划。

Motivated by this observation, we propose Decision Transformer, where we use the GPT architecture to autoregressively model trajectories (shown in Figure 1). We study whether sequence modeling can perform policy optimization by evaluating Decision Transformer on offline RL benchmarks in Atari [10], OpenAI Gym [11], and Key-to-Door [12] environments. We show that – without using dynamic programming – Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL algorithms [13, 14]. Furthermore, in tasks where long-term credit assignment is required, Decision Transformer capably outperforms the RL baselines. With this work, we aim to bridge sequence modeling and transformers with RL, and hope that sequence modeling serves as a strong algorithmic paradigm for RL.
受这一观察的启发,我们提出了决策 Transformer,其中我们使用 GPT 架构来自回归地建模轨迹(如图 1 所示)。我们研究序列建模是否能够通过在 Atari [10]、OpenAI Gym [11]和 Key-to-Door [12]环境中评估决策 Transformer 来执行策略优化。我们展示了,在不使用动态规划的情况下,决策 Transformer 能够匹敌或超越最先进的无模型离线 RL 算法的性能。此外,在需要长期信用分配的任务中,决策 Transformer 能够胜过 RL 基线。通过这项工作,我们旨在将序列建模和 transformers 与 RL 联系起来,并希望序列建模成为 RL 的强大算法范式。

2 Preliminaries 2 初步说明

2.1 Offline reinforcement learning
2.1 离线强化学习

We consider learning in a Markov decision process (MDP) described by the tuple (𝒮𝒮\mathcal{S}, 𝒜𝒜\mathcal{A}, P𝑃P, \mathcal{R}). The MDP tuple consists of states s𝒮𝑠𝒮s\in\mathcal{S}, actions a𝒜𝑎𝒜a\in\mathcal{A}, transition dynamics P(s|s,a)𝑃conditionalsuperscript𝑠𝑠𝑎P(s^{\prime}|s,a), and a reward function r=(s,a)𝑟𝑠𝑎r=\mathcal{R}(s,a). We use stsubscript𝑠𝑡s_{t}, atsubscript𝑎𝑡a_{t}, and rt=(st,at)subscript𝑟𝑡subscript𝑠𝑡subscript𝑎𝑡r_{t}=\mathcal{R}(s_{t},a_{t}) to denote the state, action, and reward at timestep t𝑡t, respectively. A trajectory is made up of a sequence of states, actions, and rewards: τ=(s0,a0,r0,s1,a1,r1,,sT,aT,rT)𝜏subscript𝑠0subscript𝑎0subscript𝑟0subscript𝑠1subscript𝑎1subscript𝑟1subscript𝑠𝑇subscript𝑎𝑇subscript𝑟𝑇\tau=(s_{0},a_{0},r_{0},s_{1},a_{1},r_{1},\ldots,s_{T},a_{T},r_{T}). The return of a trajectory at timestep t𝑡t, Rt=t=tTrtsubscript𝑅𝑡superscriptsubscriptsuperscript𝑡𝑡𝑇subscript𝑟superscript𝑡R_{t}=\sum_{t^{\prime}=t}^{T}r_{t^{\prime}}, is the sum of future rewards from that timestep. The goal in reinforcement learning is to learn a policy which maximizes the expected return 𝔼[t=1Trt]𝔼delimited-[]superscriptsubscript𝑡1𝑇subscript𝑟𝑡\mathbb{E}\Bigl{[}\sum_{t=1}^{T}r_{t}\Bigr{]} in an MDP. In offline reinforcement learning, instead of obtaining data via environment interactions, we only have access to some fixed limited dataset consisting of trajectory rollouts of arbitrary policies. This setting is harder as it removes the ability for agents to explore the environment and collect additional feedback.
我们考虑在由元组( 𝒮𝒮\mathcal{S}𝒜𝒜\mathcal{A}P𝑃P\mathcal{R} )描述的马尔可夫决策过程(MDP)中学习。MDP 元组包括状态 s𝒮𝑠𝒮s\in\mathcal{S} ,动作 a𝒜𝑎𝒜a\in\mathcal{A} ,转移动态 P(s|s,a)𝑃conditionalsuperscript𝑠𝑠𝑎P(s^{\prime}|s,a) 和奖励函数 r=(s,a)𝑟𝑠𝑎r=\mathcal{R}(s,a) 。我们使用 stsubscript𝑠𝑡s_{t}atsubscript𝑎𝑡a_{t}rt=(st,at)subscript𝑟𝑡subscript𝑠𝑡subscript𝑎𝑡r_{t}=\mathcal{R}(s_{t},a_{t}) 来表示时间步 t𝑡t 处的状态、动作和奖励。轨迹由一系列状态、动作和奖励组成: τ=(s0,a0,r0,s1,a1,r1,,sT,aT,rT)𝜏subscript𝑠0subscript𝑎0subscript𝑟0subscript𝑠1subscript𝑎1subscript𝑟1subscript𝑠𝑇subscript𝑎𝑇subscript𝑟𝑇\tau=(s_{0},a_{0},r_{0},s_{1},a_{1},r_{1},\ldots,s_{T},a_{T},r_{T}) 。轨迹在时间步 t𝑡t 的回报 Rt=t=tTrtsubscript𝑅𝑡superscriptsubscriptsuperscript𝑡𝑡𝑇subscript𝑟superscript𝑡R_{t}=\sum_{t^{\prime}=t}^{T}r_{t^{\prime}} 是从该时间步开始的未来奖励之和。强化学习的目标是学习一种策略,该策略在 MDP 中最大化期望回报 𝔼[t=1Trt]𝔼delimited-[]superscriptsubscript𝑡1𝑇subscript𝑟𝑡\mathbb{E}\Bigl{[}\sum_{t=1}^{T}r_{t}\Bigr{]} 。在离线强化学习中,我们没有通过与环境的交互获得数据,而是只能访问一些固定的有限数据集,其中包含任意策略的轨迹回溯。这种设置更加困难,因为它剥夺了代理探索环境和收集额外反馈的能力。

2.2 Transformers 2.2 变压器

Transformers were proposed by Vaswani et al. [1] as an architecture to efficiently model sequential data. These models consist of stacked self-attention layers with residual connections. Each self-attention layer receives n𝑛n embeddings {xi}i=1nsuperscriptsubscriptsubscript𝑥𝑖𝑖1𝑛\{x_{i}\}_{i=1}^{n} corresponding to unique input tokens, and outputs n𝑛n embeddings {zi}i=1nsuperscriptsubscriptsubscript𝑧𝑖𝑖1𝑛\{z_{i}\}_{i=1}^{n}, preserving the input dimensions. The i𝑖i-th token is mapped via linear transformations to a key kisubscript𝑘𝑖k_{i}, query qisubscript𝑞𝑖q_{i}, and value visubscript𝑣𝑖v_{i}. The i𝑖i-th output of the self-attention layer is given by weighting the values vjsubscript𝑣𝑗v_{j} by the normalized dot product between the query qisubscript𝑞𝑖q_{i} and other keys kjsubscript𝑘𝑗k_{j}:
Transformer 被 Vaswani 等人提出作为一种有效建模序列数据的架构。这些模型由堆叠的自注意力层和残差连接组成。每个自注意力层接收对应于唯一输入标记的嵌入,并输出嵌入,保留输入维度。第 b4 个标记通过线性变换映射到一个键、查询和值。自注意力层的第 b8 个输出通过将值按查询和其他键之间的归一化点积加权得到:

zi=j=1nsoftmax({qi,kj}j=1n)jvj.subscript𝑧𝑖superscriptsubscript𝑗1𝑛softmaxsubscriptsuperscriptsubscriptsubscript𝑞𝑖subscript𝑘superscript𝑗superscript𝑗1𝑛𝑗subscript𝑣𝑗z_{i}=\sum_{j=1}^{n}\texttt{softmax}(\{\langle q_{i},k_{j^{\prime}}\rangle\}_{j^{\prime}=1}^{n})_{j}\cdot v_{j}. (1)

As we shall see later, this allows the layer to assign “credit” by implicitly forming state-return associations via similarity of the query and key vectors (maximizing the dot product). In this work, we use the GPT architecture [9], which modifies the transformer architecture with a causal self-attention mask to enable autoregressive generation, replacing the summation/softmax over the n𝑛n tokens with only the previous tokens in the sequence (j[1,i]𝑗1𝑖j\in[1,i]). We defer the other architecture details to the original papers.

3 Method 3 方法

In this section, we present Decision Transformer, which models trajectories autoregressively with minimal modification to the transformer architecture, as summarized in Figure 1 and Algorithm 1.
在这项工作中,我们使用了 GPT 架构[9],它通过使用因果自注意力掩码修改了变压器架构,以实现自回归生成,将对<b0>令牌的求和/softmax 替换为序列中仅有前一个令牌(<b1>)。我们将其他架构细节推迟到原始论文中。

Trajectory representation. The key desiderata in our choice of trajectory representation are that it should enable transformers to learn meaningful patterns and we should be able to conditionally generate actions at test time. It is nontrivial to model rewards since we would like the model to generate actions based on future desired returns, rather than past rewards. As a result, instead of feeding the rewards directly, we feed the model with the returns-to-go R^t=t=tTrtsubscript^𝑅𝑡superscriptsubscriptsuperscript𝑡𝑡𝑇subscript𝑟superscript𝑡\widehat{R}_{t}=\sum_{t^{\prime}=t}^{T}r_{t^{\prime}}. This leads to the following trajectory representation which is amenable to autoregressive training and generation:

τ=(R^1,s1,a1,R^2,s2,a2,,R^T,sT,aT).𝜏subscript^𝑅1subscript𝑠1subscript𝑎1subscript^𝑅2subscript𝑠2subscript𝑎2subscript^𝑅𝑇subscript𝑠𝑇subscript𝑎𝑇\tau=\left(\widehat{R}_{1},s_{1},a_{1},\widehat{R}_{2},s_{2},a_{2},\dots,\widehat{R}_{T},s_{T},a_{T}\right). (2)

At test time, we can specify the desired performance (e.g. 1 for success or 0 for failure), as well as the environment starting state, as the conditioning information to initiate generation. After executing the generated action for the current state, we decrement the target return by the achieved reward and repeat until episode termination.
在测试时,我们可以指定所需的性能(例如成功为 1 或失败为 0),以及环境的起始状态,作为启动生成的条件信息。在执行当前状态的生成动作后,我们通过实现的奖励减少目标回报,并重复直到剧集终止。

Architecture. We feed the last K𝐾K timesteps into Decision Transformer, for a total of 3K3𝐾3K tokens (one for each modality: return-to-go, state, or action). To obtain token embeddings, we learn a linear layer for each modality, which projects raw inputs to the embedding dimension, followed by layer normalization [15]. For environments with visual inputs, the state is fed into a convolutional encoder instead of a linear layer. Additionally, an embedding for each timestep is learned and added to each token – note this is different than the standard positional embedding used by transformers, as one timestep corresponds to three tokens. The tokens are then processed by a GPT [9] model, which predicts future action tokens via autoregressive modeling.
架构。我们将最后的 K𝐾K 时间步输入决策 Transformer,总共 3K3𝐾3K 个标记(每种模态一个:返回-前进、状态或动作)。为了获得标记嵌入,我们为每种模态学习一个线性层,将原始输入投影到嵌入维度,然后进行层归一化[15]。对于具有视觉输入的环境,状态被馈送到卷积编码器而不是线性层。此外,为每个时间步骤学习一个嵌入,并将其添加到每个标记 - 请注意,这与 transformers 使用的标准位置嵌入不同,因为一个时间步骤对应三个标记。然后,标记通过 GPT[9]模型处理,通过自回归建模预测未来的动作标记。

Training. We are given a dataset of offline trajectories. We sample minibatches of sequence length K𝐾K from the dataset. The prediction head corresponding to the input token stsubscript𝑠𝑡s_{t} is trained to predict atsubscript𝑎𝑡a_{t} – either with cross-entropy loss for discrete actions or mean-squared error for continuous actions – and the losses for each timestep are averaged. We did not find predicting the states or returns-to-go to improve performance, although it is easily permissible within our framework (as shown in Section 5.4) and would be an interesting study for future work.
训练。我们获得了一组离线轨迹数据。我们从数据集中抽样序列长度为 K𝐾K 的小批量数据。与输入标记 stsubscript𝑠𝑡s_{t} 对应的预测头被训练以预测 atsubscript𝑎𝑡a_{t} - 对于离散动作使用交叉熵损失,对于连续动作使用均方误差 - 并对每个时间步的损失进行平均。我们发现预测状态或返回值对性能没有改进,尽管在我们的框架内很容易允许这样做(如第 5.4 节所示),这对未来的研究将是一个有趣的课题。

Algorithm 1 Decision Transformer Pseudocode (for continuous actions)
算法 1 决策 Transformer 伪代码(适用于连续动作)
# R, s, a, t: returns-to-go, states, actions, or timesteps
# transformer: transformer with causal masking (GPT)
# embed_s, embed_a, embed_R: linear embedding layers
# embed_t: learned episode positional embedding
# pred_a: linear action prediction layer
# main model
def DecisionTransformer(R, s, a, t):
# compute embeddings for tokens
pos_embedding = embed_t(t) # per-timestep (note: not per-token)
pos_embedding = 嵌入_t(t) #每个时间步长(注意:不是每个标记)
s_embedding = embed_s(s) + pos_embedding
s_embedding = 嵌入_s(s) + pos_embedding
a_embedding = embed_a(a) + pos_embedding
a_embedding = 嵌入_a(a) + pos_embedding
R_embedding = embed_R(R) + pos_embedding
R_embedding = 嵌入_R(R) + 位置嵌入
# interleave tokens as (R_1, s_1, a_1, ..., R_K, s_K)
input_embeds = stack(R_embedding, s_embedding, a_embedding)
input_embeds = 堆叠(R_embedding, s_embedding, a_embedding)
# use transformer to get hidden states
hidden_states = transformer(input_embeds=input_embeds)
# select hidden states for action prediction tokens
a_hidden = unstack(hidden_states).actions
# predict action
return pred_a(a_hidden)
# training loop
for (R, s, a, t) in dataloader: # dims: (batch_size, K, dim)
a_preds = DecisionTransformer(R, s, a, t)
loss = mean((a_preds - a)**2) # L2 loss for continuous actions
损失 = mean((a_preds - a)**2) #连续动作的 L2 损失
optimizer.zero_grad(); loss.backward(); optimizer.step()
# evaluation loop
target_return = 1 # for instance, expert-level return
目标回报 = 1 # 例如,专家级回报
R, s, a, t, done = [target_return], [env.reset()], [], [1], False
R, s, a, t, done = [目标回报], [env.reset()], [], [1], False
while not done: # autoregressive generation/sampling
while not done: # 自回归生成/采样
# sample next action
action = DecisionTransformer(R, s, a, t)[-1] # for cts actions
action = DecisionTransformer(R, s, a, t)[-1] # 用于选择动作
new_s, r, done, _ = env.step(action)
# append new tokens to sequence
R = R + [R[-1] - r] # decrement returns-to-go with reward
R = R + [R[-1] - r] # 减少回报以匹配奖励
s, a, t = s + [new_s], a + [action], t + [len(R)]
R, s, a, t = R[-K:], ... # only keep context length of K

4 Evaluations on Offline RL Benchmarks

In this section, we investigate the performance of Decision Transformer relative to dedicated offline RL and imitation learning algorithms. In particular, our primary points of comparison are model-free offline RL algorithms based on TD-learning, since our Decision Transformer architecture is fundamentally model-free in nature as well. Furthermore, TD-learning is the dominant paradigm in RL for sample efficiency, and also features prominently as a sub-routine in many model-based RL algorithms [16, 17]. We also compare with behavior cloning and variants, since it also involves a likelihood based policy learning formulation similar to ours. The exact algorithms depend on the environment but our motivations are as follows:
在本节中,我们将研究决策 Transformer 相对于专用的离线 RL 和模仿学习算法的性能。特别是,我们主要比较的是基于 TD-learning 的无模型离线 RL 算法,因为我们的决策 Transformer 架构本质上也是无模型的。此外,TD-learning 是 RL 中用于样本效率的主导范式,并且在许多基于模型的 RL 算法中也作为一个子例程显著特色。我们还将与行为克隆和变体进行比较,因为它也涉及类似于我们的基于可能性的策略学习公式。确切的算法取决于环境,但我们的动机如下:

  • TD learning: most of these methods use an action-space constraint or value pessimism, and will be the most faithful comparison to Decision Transformer, representing standard RL methods. A state-of-the-art model-free method is Conservative Q-Learning (CQL) [14] which serves as our primary comparison. In addition, we also compare against other prior model-free RL algorithms like BEAR [18] and BRAC [19].

    • TD 学习:大多数这些方法使用动作空间约束或价值悲观主义,并且将是与决策变压器最忠实的比较,代表标准 RL 方法。一种最先进的无模型方法是保守 Q 学习(CQL)[14],它作为我们的主要比较。此外,我们还与其他先前的无模型 RL 算法进行比较,如 BEAR [18]和 BRAC [19]。
  • Imitation learning: this regime similarly uses supervised losses for training, rather than Bellman backups. We use behavior cloning here, and include a more detailed discussion in Section 5.1.

    • 模仿学习:这种方法类似地使用监督损失进行训练,而不是贝尔曼备份。我们在这里使用行为克隆,并在第 5.1 节中进行更详细的讨论。

We evaluate on both discrete (Atari [10]) and continuous (OpenAI Gym [11]) control tasks. The former involves high-dimensional observation spaces and requires long-term credit assignment, while the latter requires fine-grained continuous control, representing a diverse set of tasks. Our main results are summarized in Figure 3, where we show averaged normalized performance for each domain.
我们在离散(Atari [10])和连续(OpenAI Gym [11])控制任务上进行评估。前者涉及高维观测空间,并需要长期信用分配,而后者需要精细的连续控制,代表了多样化的任务集。我们的主要结果总结在图 3 中,我们展示了每个领域的平均归一化性能。

Refer to caption
Figure 3: Results comparing Decision Transformer (ours) to TD learning (CQL) and behavior cloning across Atari, OpenAI Gym, and Minigrid. On a diverse set of tasks, Decision Transformer performs comparably or better than traditional approaches. Performance is measured by normalized episode return (see text for details).
图 3:将决策 Transformer(我们的方法)与 TD 学习(CQL)和行为克隆在 Atari、OpenAI Gym 和 Minigrid 上进行比较的结果。在各种任务中,决策 Transformer 的表现与传统方法相当或更好。性能通过标准化的回合收益来衡量(详见文本)。

4.1 Atari

The Atari benchmark [10] is challenging due to its high-dimensional visual inputs and difficulty of credit assignment arising from the delay between actions and resulting rewards. We evaluate our method on 1% of all samples in the DQN-replay dataset as per Agarwal et al. [13], representing 500 thousand of the 50 million transitions observed by an online DQN agent [20] during training; we report the mean and standard deviation of 3 seeds. We normalize scores based on a professional gamer, following the protocol of Hafner et al. [21], where 100 represents the professional gamer score and 0 represents a random policy.
Atari 基准[10]由于其高维视觉输入和由动作与结果奖励之间的延迟引起的信用分配困难而具有挑战性。我们按照 Agarwal 等人[13]的方法,在 DQN 重放数据集的所有样本中评估我们的方法的性能,这代表了在线 DQN 代理[20]在训练过程中观察到的 500 万个 5000 万个转换的 1%;我们报告 3 个种子的均值和标准差。我们根据专业游戏玩家对分数进行标准化,遵循 Hafner 等人[21]的协议,其中 100 代表专业游戏玩家的分数,0 代表随机策略。

Game 游戏 DT (Ours) DT(我们的) CQL QR-DQN REM BC
Breakout 打砖块 267.5±97.5plus-or-minus267.597.5\bf{267.5}\pm 97.5 211.1211.1211.1 138.9±61.7plus-or-minus138.961.7138.9\pm 61.7
Qbert 15.4±11.4plus-or-minus15.411.415.4\pm 11.4 104.2104.2\bf{104.2} 17.3±14.7plus-or-minus17.314.717.3\pm 14.7
Pong 乒乓球 106.1±8.1plus-or-minus106.18.1106.1\pm 8.1 111.9111.9\bf{111.9} 85.2±20.0plus-or-minus85.220.085.2\pm 20.0
Seaquest 深海探险 2.5±0.4plus-or-minus2.50.4\bf{2.5}\pm 0.4 2.1±0.3plus-or-minus2.10.32.1\pm 0.3
Table 1: Gamer-normalized scores for the 1% DQN-replay Atari dataset. We report the mean and variance across 3 seeds. Best mean scores are highlighted in bold. Decision Transformer (DT) performs comparably to CQL on 3 out of 4 games, and outperforms other baselines in most games.
表 1:对 1% DQN-replay Atari 数据集进行了游戏者标准化评分。我们报告了 3 个种子的均值和方差。最佳均值得分以粗体突出显示。决策 Transformer(DT)在 4 个游戏中的 3 个游戏上与 CQL 表现相当,并在大多数游戏中优于其他基线。

We compare to CQL [14], REM [13], and QR-DQN [22] on four Atari tasks (Breakout, Qbert, Pong, and Seaquest) that are evaluated in Agarwal et al. [13]. We use context lengths of K=30𝐾30K=30 for Decision Transformer (except K=50𝐾50K=50 for Pong). We also report the performance of behavior cloning (BC), which utilizes the same network architecture and hyperparameters as Decision Transformer but does not have return-to-go conditioning222We also tried using an MLP with K=1𝐾1K=1 as in prior work, but found this was worse than the transformer.. For CQL, REM, and QR-DQN baselines, we report numbers directly from the CQL and REM papers. We show results in Table 1. Our method is competitive with CQL in 3 out of 4 games and outperforms or matches REM, QR-DQN, and BC on all 4 games.
我们在四个 Atari 任务(Breakout,Qbert,Pong 和 Seaquest)上与 CQL [14],REM [13]和 QR-DQN [22]进行比较,这些任务在 Agarwal 等人[13]中进行了评估。我们对决策 Transformer 使用上下文长度 K=30𝐾30K=30 (除了 Pong 为 K=50𝐾50K=50 )。我们还报告了行为克隆(BC)的性能,它使用与决策 Transformer 相同的网络架构和超参数,但没有返回到起点的条件 2 。对于 CQL,REM 和 QR-DQN 基线,我们直接从 CQL 和 REM 论文中报告数字。我们在表 1 中展示结果。我们的方法在 4 个游戏中与 CQL 竞争,在其中 3 个游戏中优于或与 REM,QR-DQN 和 BC 相匹配。

4.2 OpenAI Gym

In this section, we consider the continuous control tasks from the D4RL benchmark [23]. We also consider a 2D reacher environment that is not part of the benchmark, and generate the datasets using a similar methodology to the D4RL benchmark. Reacher is a goal-conditioned task and has sparse rewards, so it represents a different setting than the standard locomotion environments (HalfCheetah, Hopper, and Walker). The different dataset settings are described below.
在这一部分,我们考虑来自 D4RL 基准测试的连续控制任务。我们还考虑了一个不属于基准测试的 2D reacher 环境,并使用类似于 D4RL 基准测试的方法生成数据集。Reacher 是一个目标条件任务,具有稀疏奖励,因此它代表了与标准运动环境(HalfCheetah、Hopper 和 Walker)不同的设置。不同的数据集设置如下。

  1. 1.

    Medium: 1 million timesteps generated by a “medium” policy that achieves approximately one-third the score of an expert policy.

    1. 中等:由“中等”策略生成的 100 万个时间步,该策略的得分约为专家策略的三分之一。
  2. 2.

    Medium-Replay: the replay buffer of an agent trained to the performance of a medium policy (approximately 25k-400k timesteps in our environments).

    2. 中等回放:代理的重放缓冲区经过训练,以中等策略的表现(在我们的环境中约为 25k-400k 个时间步)为目标。
  3. 3.

    Medium-Expert: 1 million timesteps generated by the medium policy concatenated with 1 million timesteps generated by an expert policy.

    中等-专家:中等策略生成的 100 万个时间步与专家策略生成的 100 万个时间步连接在一起。

We compare to CQL [14], BEAR [18], BRAC [19], and AWR [24]. CQL represents the state-of-the-art in model-free offline RL, an instantiation of TD learning with value pessimism. Score are normalized so that 100 represents an expert policy, as per Fu et al. [23]. CQL numbers are reported from the original paper; BC numbers are run by us; and the other methods are reported from the D4RL paper. Our results are shown in Table 3. Decision Transformer achieves the highest scores in a majority of the tasks and is competitive with the state of the art in the remaining tasks.
我们与 CQL [14]、BEAR [18]、BRAC [19]和 AWR [24]进行比较。CQL 代表无模型离线 RL 的最新技术,是具有价值悲观主义的 TD 学习的实例。得分经过归一化处理,使得 100 代表专家策略,根据 Fu 等人的说法[23]。CQL 的数字来自原始论文;BC 的数字由我们运行;其他方法来自 D4RL 论文。我们的结果显示在表 3 中。Decision Transformer 在大多数任务中获得最高分,并在其余任务中与最新技术竞争。

Dataset Environment 环境 DT (Ours) DT(我们的) CQL BEAR  BRAC-v AWR BC
Medium-Expert 中级-专家 HalfCheetah 86.8±1.3plus-or-minus86.81.3\bf{86.8}\pm 1.3 62.462.462.4 53.453.453.4 41.941.941.9 52.752.752.7 59.959.959.9
Medium-Expert 中级专家 Hopper 107.6±1.8plus-or-minus107.61.8107.6\pm 1.8 111.0111.0\bf{111.0} 96.396.396.3 79.679.679.6
Medium-Expert Walker 行走者 108.1±0.2plus-or-minus108.10.2\bf{108.1}\pm 0.2 98.798.798.7 81.681.681.6 53.853.853.8 36.636.636.6
Medium-Expert 中级专家 Reacher 伸手者 89.1±1.3plus-or-minus89.11.3\bf{89.1\pm 1.3} 30.630.630.6 - - - 73.373.373.3
Medium 中等 HalfCheetah 半猎豹 42.6±0.1plus-or-minus42.60.142.6\pm 0.1 44.444.444.4 41.741.741.7 46.346.3\bf{46.3} 37.437.437.4
Medium 中等 Hopper 跳跃者 67.6±1.0plus-or-minus67.61.0\bf{67.6}\pm 1.0 35.935.935.9 63.963.963.9
Medium 中等 Walker 步行者 74.0±1.4plus-or-minus74.01.474.0\pm 1.4 81.181.1\bf{81.1} 17.417.417.4 77.377.377.3
Medium 中间 Reacher 达人 51.2±3.4plus-or-minus51.23.4\bf{51.2\pm 3.4} - - - 48.948.9\bf{48.9}
Medium-Replay 中间-重播 HalfCheetah 半猎豹 36.6±0.8plus-or-minus36.60.836.6\pm 0.8 38.638.638.6 47.747.7\bf{47.7} 40.340.340.3
Medium-Replay 中等重播 Hopper 跳跃者 82.7±7.0plus-or-minus82.77.0\bf{82.7\pm 7.0} 48.648.648.6 33.733.733.7 28.428.428.4 27.627.627.6
Medium-Replay 媒体回放 Walker 66.6±3.0plus-or-minus66.63.0\bf{66.6\pm 3.0} 26.726.726.7 15.515.515.5 36.936.936.9
Medium-Replay Reacher 抵达者 18.0±2.4plus-or-minus18.02.4\bf{18.0\pm 2.4} 19.019.0\bf{19.0} - - -
Average (Without Reacher)
74.774.7\bf{74.7} 63.963.963.9 36.936.936.9 34.334.334.3 46.446.446.4
Average (All Settings) 平均值(所有设置) 69.269.2\bf{69.2} - - - 47.747.747.7
Table 2: Results for D4RL datasets333Given that CQL is generally the strongest TD learning method, for Reacher we only run the CQL baseline.. We report the mean and variance for three seeds. Decision Transformer (DT) outperforms conventional RL algorithms on almost all tasks.
表 2:D4RL 数据集的结果。我们报告了三个种子的均值和方差。决策 Transformer(DT)在几乎所有任务上都优于传统的 RL 算法。

5 Discussion 5 讨论

5.1 Does Decision Transformer perform behavior cloning on a subset of the data?
5.1 决策 Transformer 是否在数据子集上执行行为克隆?

In this section, we seek to gain insight into whether Decision Transformer can be thought of as performing imitation learning on a subset of the data with a certain return. To investigate this, we propose a new method, Percentile Behavior Cloning (%BC), where we run behavior cloning on only the top X%percent𝑋X\% of timesteps in the dataset, ordered by episode returns. The percentile X%percent𝑋X\% interpolates between standard BC (X=100%𝑋percent100X=100\%) that trains on the entire dataset and only cloning the best observed trajectory (X0%𝑋percent0X\to 0\%), trading off between better generalization by training on more data with training a specialized model that focuses on a desirable subset of the data.
在本节中,我们试图深入了解决策 Transformer 是否可以被视为在具有一定回报的数据子集上执行模仿学习。为了研究这一点,我们提出了一种新方法,百分位行为克隆(%BC),其中我们仅在数据集中按剧集回报排序的前 X%percent𝑋X\% 个时间步上运行行为克隆。百分位 X%percent𝑋X\% 在标准 BC( X=100%𝑋percent100X=100\% )和仅克隆最佳观察轨迹( X0%𝑋percent0X\to 0\% )之间插值,权衡了通过在更多数据上训练来获得更好泛化和训练专门关注数据中理想子集的模型之间的平衡。

We show full results comparing %BC to Decision Transformer and CQL in Table 3, sweeping over X[10%,25%,40%,100%]𝑋percent10percent25percent40percent100X\in[10\%,25\%,40\%,100\%]. Note that the only way to choose the optimal subset for cloning is to evaluate using rollouts from the environment, so %BC is not a realistic approach; rather, it serves to provide insight into the behavior of Decision Transformer. When data is plentiful – as in the D4RL regime – we find %BC can match or beat other offline RL methods. On most environments, Decision Transformer is competitive with the performance of the best %BC, indicating it can hone in on a particular subset after training on the entire dataset distribution.
我们在表 3 中展示了将%BC 与 Decision Transformer 和 CQL 进行比较的完整结果,涵盖了 X[10%,25%,40%,100%]𝑋percent10percent25percent40percent100X\in[10\%,25\%,40\%,100\%] 。请注意,选择克隆的最佳子集的唯一方法是使用来自环境的 rollouts 进行评估,因此%BC 并不是一种现实的方法;相反,它有助于深入了解 Decision Transformer 的行为。当数据充足时 - 如在 D4RL 制度中 - 我们发现%BC 可以与其他离线 RL 方法匹敌甚至超越。在大多数环境中,Decision Transformer 的表现与最佳%BC 的表现相媲美,表明它可以在对整个数据集分布进行训练后聚焦于特定子集。

Dataset Environment DT (Ours) DT(我们的) 10%BC 25%BC 40%BC 100%BC CQL
Medium 中等 HalfCheetah 半猎豹 42.6±0.1plus-or-minus42.60.142.6\pm 0.1 42.942.942.9 44.444.4\bf 44.4
Medium 中等 Hopper 跳跃者 67.6±1.0plus-or-minus67.61.0\bf 67.6\pm 1.0 65.965.965.9 65.365.365.3 63.963.963.9
Medium 中等 Walker 步行者 74.0±1.4plus-or-minus74.01.474.0\pm 1.4 78.878.878.8 80.980.9\bf 80.9 78.878.878.8 77.377.377.3
Medium 中等 Reacher 触手系 51.2±3.4plus-or-minus51.23.451.2\pm 3.4 48.948.948.9 58.458.4\bf 58.4
Medium-Replay 中继 HalfCheetah 半猎豹 36.6±0.8plus-or-minus36.60.836.6\pm 0.8 40.840.840.8 40.940.940.9 46.246.2\bf 46.2
Medium-Replay 媒体回放 Hopper 跳线板 82.7±7.0plus-or-minus82.77.0\bf 82.7\pm 7.0 70.670.670.6 58.658.658.6 27.627.627.6 48.648.648.6
Medium-Replay 媒体回放 Walker 行走者 66.6±3.0plus-or-minus66.63.066.6\pm 3.0 70.470.4\bf 70.4 67.867.867.8 36.936.936.9 26.726.726.7
Medium-Replay 中间-重播 Reacher 探长 18.0±2.4plus-or-minus18.02.418.0\pm 2.4 33.133.1\bf 33.1 10.710.710.7
Average 平均值 56.756.7\bf{56.7} 52.752.752.7 49.449.449.4 39.539.539.5 43.543.543.5
Table 3: Comparison between Decision Transformer (DT) and Percentile Behavior Cloning (%BC).
表 3:决策变压器(DT)和百分位行为克隆(%BC)之间的比较。

In contrast, when we study low data regimes – such as Atari, where we use 1% of a replay buffer as the dataset – %BC is weak (shown in Table 4). This suggests that in scenarios with relatively low amounts of data, Decision Transformer can outperform %BC by using all trajectories in the dataset to improve generalization, even if those trajectories are dissimilar from the return conditioning target. Our results indicate that Decision Transformer can be more effective than simply performing imitation learning on a subset of the dataset. On the tasks we considered, Decision Transformer either outperforms or is competitive to %BC, without the confound of having to select the optimal subset.
相比之下,当我们研究低数据制度时,比如在 Atari 中,我们将回放缓冲区的 1%用作数据集时,%BC 表现较弱(见表 4)。这表明,在数据量相对较低的情况下,决策变压器可以通过使用数据集中的所有轨迹来改善泛化性能,即使这些轨迹与返回条件目标不同。我们的结果表明,决策变压器可能比仅对数据集的子集执行模仿学习更有效。在我们考虑的任务中,决策变压器要么优于%BC,要么与之竞争,而无需选择最佳子集的混淆。

Game 游戏 DT (Ours) DT(我们的) 10%BC 25%BC 40%BC 100%BC
Breakout 突围 267.5±97.5plus-or-minus267.597.5\bf{267.5}\pm 97.5 28.5±8.2plus-or-minus28.58.228.5\pm 8.2 73.5±6.4plus-or-minus73.56.473.5\pm 6.4 108.2±67.5plus-or-minus108.267.5108.2\pm 67.5 138.9±61.7plus-or-minus138.961.7138.9\pm 61.7
Qbert 15.4±11.4plus-or-minus15.411.415.4\pm 11.4 6.6±1.7plus-or-minus6.61.76.6\pm 1.7 16.0±13.8plus-or-minus16.013.816.0\pm 13.8 11.8±5.8plus-or-minus11.85.811.8\pm 5.8 17.3±14.7plus-or-minus17.314.7\bf{17.3}\pm 14.7
Pong 乒乓球 106.1±8.1plus-or-minus106.18.1\bf{106.1}\pm 8.1 2.5±0.2plus-or-minus2.50.22.5\pm 0.2 13.3±2.7plus-or-minus13.32.713.3\pm 2.7 72.7±13.3plus-or-minus72.713.372.7\pm 13.3 85.2±20.0plus-or-minus85.220.085.2\pm 20.0
Seaquest 深海探险 2.5±0.4plus-or-minus2.50.4\bf{2.5}\pm 0.4 1.1±0.2plus-or-minus1.10.21.1\pm 0.2 1.1±0.2plus-or-minus1.10.21.1\pm 0.2 1.6±0.4plus-or-minus1.60.41.6\pm 0.4 2.1±0.3plus-or-minus2.10.32.1\pm 0.3
Table 4: %BC scores for Atari. We report the mean and variance across 3 seeds. Decision Transformer (DT) outperforms all versions of %BC in most games.
表 4:Atari 的%BC 得分。我们报告了 3 个种子之间的平均值和方差。决策 Transformer(DT)在大多数游戏中表现优于所有版本的%BC。

5.2 How well does Decision Transformer model the distribution of returns?
5.2 决策 Transformer 模型对回报分布的拟合程度如何?

We evaluate the ability of Decision Transformer to understand return-to-go tokens by varying the desired target return over a wide range – evaluating the multi-task distribution modeling capability of transformers. Figure 4 shows the average sampled return accumulated by the agent over the course of the evaluation episode for varying values of target return. On every task, the desired target returns and the true observed returns are highly correlated. On some tasks like Pong, HalfCheetah and Walker, Decision Transformer generates trajectories that almost perfectly match the desired returns (as indicated by the overlap with the oracle line). Furthermore, on some Atari tasks like Seaquest, we can prompt the Decision Transformer with higher returns than the maximum episode return available in the dataset, demonstrating that Decision Transformer is sometimes capable of extrapolation.
我们通过改变期望目标回报的范围来评估决策 Transformer 理解返回-前进令牌的能力 - 评估变压器的多任务分布建模能力。图 4 显示了代理在评估周期内累积的平均采样回报,针对不同目标回报值。在每个任务中,期望的目标回报和真实观察到的回报高度相关。在一些任务中,如 Pong,HalfCheetah 和 Walker,决策 Transformer 生成的轨迹几乎完全匹配期望的回报(如与 oracle 线的重叠所示)。此外,在一些 Atari 任务中,如深海探险,我们可以提示决策 Transformer 比数据集中可用的最大集数回报更高的回报,表明决策 Transformer 有时能够外推。

Refer to caption
Refer to caption
Figure 4: Sampled (evaluation) returns accumulated by Decision Transformer when conditioned on the specified target (desired) returns. Top: Atari. Bottom: D4RL medium-replay datasets.
图 4:Decision Transformer 在指定目标(期望)回报条件下累积的采样(评估)回报。顶部:Atari。底部:D4RL 中等重放数据集。

5.3 What is the benefit of using a longer context length?
5.3 使用更长的上下文长度有什么好处?

To assess the importance of access to previous states, actions, and returns, we ablate on the context length K𝐾K. This is interesting since it is generally considered that the previous state (i.e. K=1𝐾1K=1) is enough for reinforcement learning algorithms when frame stacking is used, as we do. Table 5 shows that performance of Decision Transformer is significantly worse when K=1𝐾1K=1, indicating that past information is useful for Atari games. One hypothesis is that when we are representing a distribution of policies – like with sequence modeling – the context allows the transformer to identify which policy generated the actions, enabling better learning and/or improving the training dynamics.
为了评估访问先前状态、动作和回报的重要性,我们对上下文长度进行了消螬。这很有趣,因为通常认为在使用帧堆叠时,先前状态(即 K=1𝐾1K=1 )对于强化学习算法已经足够。表 5 显示,当 K=1𝐾1K=1 时,Decision Transformer 的性能显著较差,表明过去的信息对 Atari 游戏是有用的。一个假设是,当我们表示策略分布时,就像使用序列建模一样,上下文允许变压器识别生成动作的策略,从而实现更好的学习和/或改进训练动态。

Game 游戏 DT (Ours) DT(我们的) DT with no context (K=1𝐾1K=1)
没有上下文的 DT( K=1𝐾1K=1
Breakout 突围 267.5±97.5plus-or-minus267.597.5\bf{267.5}\pm 97.5 73.9±10plus-or-minus73.91073.9\pm 10
Qbert 15.1±11.4plus-or-minus15.111.4\bf{15.1}\pm 11.4 13.6±11.3plus-or-minus13.611.313.6\pm 11.3
Pong 乒乓球 106.1±8.1plus-or-minus106.18.1\bf{106.1}\pm 8.1 2.5±0.2plus-or-minus2.50.22.5\pm 0.2
Seaquest 深海探险 2.5±0.4plus-or-minus2.50.4\bf{2.5}\pm 0.4 0.6±0.1plus-or-minus0.60.10.6\pm 0.1
Table 5: Ablation on context length. Decision Transformer (DT) performs better when using a longer context length (K=50𝐾50K=50 for Pong, K=30𝐾30K=30 for others).
表 5:上下文长度消蚀。决策变压器(DT)在使用更长的上下文长度时表现更好(对于 Pong 为 K=50𝐾50K=50 ,对于其他游戏为 K=30𝐾30K=30 )。

5.4 Does Decision Transformer perform effective long-term credit assignment?
5.4 决策 Transformer 是否能够有效地进行长期信用分配?

To evaluate long-term credit assignment capabilities of our model, we consider a variant of the Key-to-Door environment proposed in Mesnard et al. [12]. This is a grid-based environment with a sequence of three phases: (1) in the first phase, the agent is placed in a room with a key; (2) then, the agent is placed in an empty room; (3) and finally, the agent is placed in a room with a door. The agent receives a binary reward when reaching the door in the third phase, but only if it picked up the key in the first phase. This problem is difficult for credit assignment because credit must be propagated from the beginning to the end of the episode, skipping over actions taken in the middle.
为了评估我们模型的长期信用分配能力,我们考虑了 Mesnard 等人提出的 Key-to-Door 环境的一个变体。这是一个基于网格的环境,包括三个阶段:(1)在第一个阶段,代理放置在一个带有钥匙的房间中;(2)然后,代理被放置在一个空房间中;(3)最后,代理被放置在一个带有门的房间中。代理在第三阶段到达门时会获得二进制奖励,但只有在第一阶段拿起了钥匙时才会获得奖励。这个问题对于信用分配来说很困难,因为信用必须从开始传播到剧集的结束,跳过中间采取的行动。

We train on datasets of trajectories generated by applying random actions and report success rates in Table 6. Furthermore, for the Key-to-Door environment we use the entire episode length as the context, rather than having a fixed content window as in the other environments. Methods that use highsight return information: our Decision Transformer model and %BC (trained only on successful episodes) are able to learn effective policies – producing near-optimal paths, despite only training on random walks. TD learning (CQL) cannot effectively propagate Q-values over the long horizons involved and gets poor performance.
我们通过应用随机动作生成的轨迹数据集进行训练,并在表 6 中报告成功率。此外,在“钥匙到门”环境中,我们使用整个剧集长度作为上下文,而不是像其他环境中那样有一个固定的内容窗口。使用高视角返回信息的方法:我们的决策 Transformer 模型和%BC(仅在成功剧集上训练)能够学习有效的策略-产生接近最优路径,尽管只在随机行走上进行训练。TD 学习(CQL)无法有效地传播涉及长期视角的 Q 值,并表现不佳。

Dataset 数据集 DT (Ours) DT(我们的) CQL BC %BC Random 随机
1K Random Trajectories 1K 随机轨迹 71.8%percent71.8\bf{71.8}\% 13.1%percent13.113.1\% 1.4%percent1.41.4\% 69.9%percent69.969.9\% 3.1%percent3.13.1\%
10K Random Trajectories 10K 随机轨迹 94.6%percent94.694.6\% 13.3%percent13.313.3\% 1.6%percent1.61.6\% 95.1%percent95.1\bf{95.1}\% 3.1%percent3.13.1\%
Table 6: Success rate for Key-to-Door environment. Methods using hindsight (Decision Transformer, %BC) can learn successful policies, while TD learning struggles to perform credit assignment.
表 6:钥匙到门环境的成功率。使用顺知法的方法(决策变换器,%BC)可以学习成功的策略,而 TD 学习则难以执行信用分配。

5.5 Can transformers be accurate critics in sparse reward settings?
5.5 在稀疏奖励环境中,Transformer 能否成为准确的评论家?

In previous sections, we established that decision transformer can produce effective policies (actors). We now evaluate whether transformer models can also be effective critics. We modify Decision Transformer to output return tokens in addition to action tokens on the Key-to-Door environment. Additionally, the first return token is not given, but it is predicted instead (i.e. the model learns the initial distribution p(R^1)𝑝subscript^𝑅1p(\hat{R}_{1})), similar to standard autoregressive generative models. We find that the transformer continuously updates reward probability based on events during the episode, shown in Figure 5 (Left). Furthermore, we find the transformer attends to critical events in the episode (picking up the key or reaching the door), shown in Figure 5 (Right), indicating formation of state-reward associations as discussed in Raposo et al. [25] and enabling accurate value prediction.
在前面的章节中,我们已经确定决策变换器可以生成有效的策略(执行者)。现在我们评估变换器模型是否也可以成为有效的评论者。我们修改决策变换器,在钥匙到门环境中输出回报令牌,除了动作令牌。此外,第一个回报令牌不是给定的,而是预测的(即模型学习初始分布 p(R^1)𝑝subscript^𝑅1p(\hat{R}_{1})

Refer to caption
Refer to caption
Figure 5: Left: Averages of running return probabilities predicted by the transformer model for three types of episode outcomes. Right: Transformer attention weights from all timesteps superimposed for a particular successful episode. The model attends to steps near pivotal events in the episode, such as picking up the key and reaching the door.
图 5:左侧:变压器模型预测的三种类型的剧集结果的运行回报概率平均值。右侧:特定成功剧集的所有时间步的变压器注意力权重叠加。模型关注剧集中靠近关键事件的步骤,比如拿起钥匙和到达门口。

5.6 Does Decision Transformer perform well in sparse reward settings?
5.6 决策变换器在稀疏奖励环境中表现良好吗?

A known weakness of TD learning algorithms is that they require densely populated rewards in order to perform well, which can be unrealistic and/or expensive. In contrast, Decision Transformer can improve robustness in these settings since it makes minimal assumptions on the density of the reward. To evaluate this, we consider a delayed return version of the D4RL benchmarks where the agent does not receive any rewards along the trajectory, and instead receives the cumulative reward of the trajectory in the final timestep. Our results for delayed returns are shown in Table 7. Delayed returns minimally affect Decision Transformer; and due to the nature of the training process, while imitation learning methods are reward agnostic. While TD learning collapses, Decision Transformer and %BC still perform well, indicating that Decision Transformer can be more robust to delayed rewards.
TD 学习算法的一个已知弱点是它们需要密集的奖励才能表现良好,这可能是不现实和/或昂贵的。相比之下,决策变压器可以提高在这些情况下的鲁棒性,因为它对奖励密度的假设很少。为了评估这一点,我们考虑了 D4RL 基准测试的延迟回报版本,其中代理沿着轨迹不接收任何奖励,而是在最后一个时间步接收轨迹的累积奖励。我们的延迟回报结果显示在表 7 中。延迟回报对决策变压器的影响很小;由于训练过程的性质,虽然模仿学习方法对奖励不敏感。当 TD 学习崩溃时,决策变压器和%BC 仍然表现良好,表明决策变压器对延迟奖励更具鲁棒性。

Delayed (Sparse) Agnostic Original (Dense) 数据集
Dataset 环境 Environment DT (Ours) DT(我们的) CQL BC %BC DT (Ours) DT(我们的) CQL
Medium-Expert 中级专家 Hopper 跳线 107.3±3.5plus-or-minus107.33.5\bf{107.3\pm 3.5} 59.959.959.9 102.6102.6102.6 107.6107.6107.6 111.0111.0111.0
Medium 中等 Hopper 跳线 60.7±4.5plus-or-minus60.74.560.7\pm 4.5 63.963.963.9 65.965.9\bf{65.9} 67.667.667.6
Medium-Replay 中继-重播 Hopper 霍珀 78.5±3.7plus-or-minus78.53.7\bf{78.5\pm 3.7} 27.627.627.6 70.670.670.6 82.782.782.7 48.648.648.6
Table 7: Results for D4RL datasets with delayed (sparse) reward. Decision Transformer (DT) and imitation learning are minimally affected by the removal of dense rewards, while CQL fails.
表 7:D4RL 数据集延迟(稀疏)奖励的结果。决策变压器(DT)和模仿学习受到密集奖励移除的影响较小,而 CQL 失败。

5.7 Why does Decision Transformer avoid the need for value pessimism or behavior regularization?
5.7 为什么决策变换器避免了对价值悲观或行为规范化的需求?

One key difference between Decision Transformer and prior offline RL algorithms is that we do not require policy regularization or conservatism to achieve good performance. Our conjecture is that TD-learning based algorithms learn an approximate value function and improve the policy by optimizing this value function. This act of optimizing a learned function can exacerbate and exploit any inaccuracies in the value function approximation, causing failures in policy improvement. Since Decision Transformer does not require explicit optimization using learned functions as objectives, it avoids the need for regularization or conservatism.
Decision Transformer 与先前的离线 RL 算法之间的一个关键区别是,我们不需要策略正则化或保守性来实现良好的性能。我们的猜测是,基于 TD 学习的算法通过学习近似值函数并通过优化该值函数来改进策略。优化学习函数的行为可能会加剧和利用值函数近似中的任何不准确性,导致策略改进失败。由于 Decision Transformer 不需要使用学习函数作为目标进行显式优化,因此它避免了对正则化或保守性的需求。

5.8 How can Decision Transformer benefit online RL regimes?
5.8 决策变换器如何受益于在线强化学习制度?

Offline RL and the ability to model behaviors has the potential to enable sample-efficient online RL for downstream tasks. Works studying the transition from offline to online generally find that likelihood-based approaches, like our sequence modeling objective, are more successful [26, 27]. As a result, although we studied offline RL in this work, we believe Decision Transformer can meaningfully improve online RL methods by serving as a strong model for behavior generation. For instance, Decision Transformer can serve as a powerful “memorization engine” and in conjunction with powerful exploration algorithms like Go-Explore [28], has the potential to simultaneously model and generative a diverse set of behaviors.
离线强化学习和建模行为的能力有望实现对下游任务的高效在线强化学习。研究离线到在线过渡的工作通常发现,基于似然的方法,如我们的序列建模目标,更成功。因此,尽管我们在这项工作中研究了离线强化学习,但我们相信决策 Transformer 可以通过作为行为生成的强大模型来有意义地改进在线强化学习方法。例如,决策 Transformer 可以作为强大的“记忆引擎”,并与 Go-Explore 等强大的探索算法结合使用,有潜力同时建模和生成多样化的行为。

6 Related Work 相关工作

6.1 Offline reinforcement learning
6.1 离线强化学习

To mitigate the impact of distribution shift in offline RL, prior algorithms either (a) constrain the policy action space [29, 30, 31] or (b) incorporate value pessimism [29, 14], or (c) incorporate pessimism into learned dynamics models [32, 33]. Since we do not use Decision Transformers to explicitly learn the dynamics model, we primarily compare against model-free algorithms in our work; in particular, adding a dynamics model tends to improve the performance of model-free algorithms. Another line of work explores learning wide behavior distribution from an offline dataset by learning a task-agnostic set of skills, either with likelihood-based approaches [34, 35, 36, 37] or by maximizing mutual information [38, 39, 40]. Our work is similar to the likelihood-based approaches, which do not use iterative Bellman updates – although we use a simpler sequence modeling objective instead of a variational method, and use rewards for conditional generation of behaviors.
为了减轻离线 RL 中分布转移的影响,先前的算法要么(a)限制策略动作空间[29,30,31],要么(b)融入价值悲观主义[29,14],或者(c)将悲观主义融入到学习的动力学模型中[32,33]。由于我们没有使用决策变换器来明确学习动力学模型,我们主要在我们的工作中与无模型算法进行比较;特别是,添加动力学模型往往会提高无模型算法的性能。另一方面的工作探索通过从离线数据集中学习一组与任务无关的技能来学习广泛的行为分布,可以使用基于似然的方法[34,35,36,37]或通过最大化互信息[38,39,40]。我们的工作类似于基于似然的方法,它们不使用迭代的贝尔曼更新 - 尽管我们使用更简单的序列建模目标而不是变分方法,并使用奖励来生成行为。

6.2 Supervised learning in reinforcement learning settings
6.2 强化学习环境中的监督学习

Some prior methods for reinforcement learning bear more resemblance to static supervised learning, such as Q-learning [41, 42], which still uses iterative backups, or likelihood-based methods such as behavior cloning, which do not (discussed in previous section). Recent work [43, 44, 45] studies “upside-down” reinforcement learning (UDRL), which are similar to our method in seeking to model behaviors with a supervised loss conditioned on the target return. A key difference in our work is the shift of motivation to sequence modeling rather than supervised learning: while the practical methods differ primarily in the context length and architecture, sequence modeling enables behavior modeling even without access to the reward, in a similar style to language [9] or images [46], and is known to scale well [2]. The method proposed by Kumar et al. [44] is most similar to our method with K=1𝐾1K=1, which we find sequence modeling/long contexts to outperform (see Section 5.3). Ghosh et al. [47] extends prior UDRL methods to use state goal conditioning, rather than rewards, and Paster et al. [48] further use an LSTM with state goal conditioning for goal-conditoned online RL settings.
一些先前的强化学习方法更类似于静态监督学习,比如 Q 学习[41, 42],它仍然使用迭代备份,或者基于似然的方法,比如行为克隆,它不使用(在前一节中讨论)。最近的研究[43, 44, 45]研究了“颠倒”强化学习(UDRL),这与我们的方法类似,试图通过有条件于目标回报的监督损失来建模行为。我们工作的一个关键区别是将动机转向序列建模而不是监督学习:虽然实际方法主要在上下文长度和架构上有所不同,但序列建模使得即使没有访问奖励,也能进行行为建模,类似于语言[9]或图像[46],并且已知其具有良好的可扩展性[2]。Kumar 等人提出的方法[44]与我们的方法最相似,我们发现序列建模/长上下文的表现优于其他方法(见第 5.3 节)。Ghosh 等人[47]将先前的 UDRL 方法扩展为使用状态目标条件,而不是奖励,Paster 等人。 在目标条件的在线强化学习设置中进一步使用带有状态目标调节的 LSTM。

Concurrent to our work, Janner et al. [49] propose Trajectory Transformer, which is similar to Decision Transformer but additionally uses state and return prediction, as well as discretization, which incorporates model-based components. We believe that their experiments, in addition to our results, highlight the potential for sequence modeling to be a generally applicable idea for reinforcement learning.
与我们的工作同时进行,Janner 等人提出了 Trajectory Transformer,它类似于 Decision Transformer,但另外使用了状态和回报预测,以及离散化,其中包含了基于模型的组件。我们相信他们的实验,以及我们的结果,突显了序列建模对强化学习可能是一个普遍适用的想法。

6.3 Credit assignment 6.3 信用分配

Many works have studied better credit assignment via state-association, learning an architecture which decomposes the reward function such that certain “important” states comprise most of the credit [50, 51, 12]. They use the learned reward function to change the reward of an actor-critic algorithm to help propagate signal over long horizons. In particular, similar to our long-term setting, some works have specifically shown such state-associative architectures can perform better in delayed reward settings [52, 7, 53, 25]. In contrast, we allow these properties to naturally emerge in a transformer architecture, without having to explicitly learn a reward function or a critic.

6.4 Conditional language generation
6.4 条件语言生成

Various works have studied guided generation for images [54] and language [55, 56]. Several works [57, 58, 59, 60, 61, 62] have explored training or fine-tuning of models for controllable text generation. Class-conditional language models can also be used to learn disciminators to guide generation [63, 55, 64, 65]. However, these approaches mostly assume constant “classes”, while in reinforcement learning the reward signal is time-varying. Furthermore, it is more natural to prompt the model desired target return and continuously decrease it by the observed rewards over time, since the transformer model and environment jointly generate the trajectory.
各种作品已经研究了图像[54]和语言[55, 56]的引导生成。一些作品[57, 58, 59, 60, 61, 62]已经探讨了用于可控文本生成的模型的训练或微调。类条件语言模型也可以用来学习鉴别器以引导生成[63, 55, 64, 65]。然而,这些方法大多假设“类”是恒定的,而在强化学习中,奖励信号是随时间变化的。此外,更自然的做法是提示模型所需的目标回报,并随着随时间观察到的奖励不断减少,因为变压器模型和环境共同生成轨迹。

6.5 Attention and transformer models
6.5 注意力和 Transformer 模型

Transformers [1] have been applied successfully to many tasks in natural language processing [66, 9] and computer vision [67, 68]. However, transformers are relatively unstudied in RL, mostly due to differing nature of the problem, such as higher variance in training. Zambaldi et al. [5] showed that augmenting transformers with relational reasoning improve performance in combinatorial environments and Ritter et al. [69] showed iterative self-attention allowed for RL agents to better utilize episodic memories. Parisotto et al. [4] discussed design decisions for more stable training of transformers in the high-variance RL setting. Unlike our work, these still use actor-critic algorithms for optimization, focusing on novelty in architecture. Additionally, in imitation learning, some works have studied transformers as a replacement for LSTMs: Dasari and Gupta [70] study one-shot imitation learning, and Abramson et al. [71] combine language and image modalities for text-conditioned behavior generation.
变压器已成功应用于自然语言处理和计算机视觉等许多任务。然而,在强化学习中,变压器相对较少研究,主要是由于问题性质的不同,如训练中的更高方差。Zambaldi 等人表明,通过增加关系推理来改进变压器在组合环境中的性能,而 Ritter 等人表明,迭代自注意力使强化学习代理能够更好地利用情节记忆。Parisotto 等人讨论了在高方差强化学习环境中更稳定训练变压器的设计决策。与我们的工作不同,这些研究仍然使用演员-评论家算法进行优化,侧重于架构的新颖性。此外,在模仿学习中,一些研究将变压器作为 LSTM 的替代品进行研究:Dasari 和 Gupta 研究了一次性模仿学习,而 Abramson 等人结合语言和图像模态进行基于文本条件的行为生成。

7 Conclusion 结论

We proposed Decision Transformer, seeking to unify ideas in language/sequence modeling and reinforcement learning. On standard offline RL benchmarks, we showed Decision Transformer can match or outperform strong algorithms designed explicitly for offline RL with minimal modifications from standard language modeling architectures.
我们提出了决策 Transformer,旨在统一语言/序列建模和强化学习的思想。在标准的离线 RL 基准测试中,我们展示了决策 Transformer 可以在最小程度修改标准语言建模架构的情况下,与专门为离线 RL 设计的强算法相匹敌甚至胜过。

We hope this work inspires more investigation into using large transformer models for RL. We used a simple supervised loss that was effective in our experiments, but applications to large-scale datasets could benefit from self-supervised pretraining tasks. In addition, one could consider more sophisticated embeddings for returns, states, and actions – for instance, conditioning on return distributions to model stochastic settings instead of deterministic returns. Transformer models can also be used to model the state evolution of trajectory, potentially serving as an alternative to model-based RL, and we hope to explore this in future work.
我们希望这项工作能激发更多关于使用大型 Transformer 模型进行 RL 的研究。我们在实验中使用了一种简单的监督损失,但应用于大规模数据集可能会受益于自监督预训练任务。此外,人们可以考虑更复杂的回报、状态和动作嵌入 - 例如,根据回报分布进行条件建模以模拟随机设置而不是确定性回报。Transformer 模型还可以用于建模轨迹的状态演变,可能作为基于模型的 RL 的替代方案,我们希望在未来的工作中探索这一点。

For real-world applications, it is important to understand the types of errors transformers make in MDP settings and possible negative consequences, which are underexplored. It will also be important to consider the datasets we train models on, which can potentially add destructive biases, particularly as we consider studying augmenting RL agents with more data which may come from questionable sources. For instance, reward design by nefarious actors can potentially generate unintended behaviors as our model generates behaviors by conditioning on desired returns.
对于现实世界的应用,了解变压器在 MDP 设置中产生的错误类型和可能的负面后果非常重要,这方面尚未得到充分探讨。同时,考虑我们训练模型的数据集也很重要,这可能会引入破坏性偏见,特别是当我们考虑用更多来自可疑来源的数据来增强 RL 代理时。例如,恶意行为者设计的奖励可能会导致意外行为,因为我们的模型通过对期望回报进行条件化生成行为。

Acknowledgements 致谢

This research was supported by Berkeley Deep Drive, Open Philanthropy, and the National Science Foundation under NSF:NRI #2024675. Part of this work was completed when Aravind Rajeswaran was a PhD student at the University of Washington, where he was supported by the J.P. Morgan PhD Fellowship in AI (2020-21). We also thank Luke Metz and Daniel Freeman for valuable feedback and discussions, as well as Justin Fu for assistance in setting up D4RL benchmarks, and Aviral Kumar for assistance with the CQL baselines and hyperparameters.
本研究得到了伯克利深度驱动、开放慈善基金会和国家科学基金会在 NSF:NRI #2024675 下的支持。部分工作是在 Aravind Rajeswaran 在华盛顿大学攻读博士学位期间完成的,他在那里得到了 J.P. Morgan AI 博士奖学金的支持(2020-21 年)。我们还感谢 Luke Metz 和 Daniel Freeman 提供宝贵的反馈和讨论,以及 Justin Fu 在设置 D4RL 基准时的帮助,以及 Aviral Kumar 在 CQL 基线和超参数方面的协助。

References 参考资料

  • Vaswani et al. [2017] Vaswani 等人[2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser 和 Illia Polosukhin. 注意力就是一切。在 2017 年神经信息处理系统进展中。
  • Brown et al. [2020] Brown 等人[2020] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell 等人。语言模型是少样本学习者。arXiv 预印本 arXiv:2005.14165,2020 年。
  • Ramesh et al. [2021] Ramesh 等人[2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021.
    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen 和 Ilya Sutskever。零样本文本到图像生成。arXiv 预印本 arXiv:2102.12092,2021 年。
  • Parisotto et al. [2020] Parisotto 等人[2020] Emilio Parisotto, Francis Song, Jack Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant Jayakumar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, et al. Stabilizing transformers for reinforcement learning. In International Conference on Machine Learning, 2020.
    Emilio Parisotto, Francis Song, Jack Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant Jayakumar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury 等人。稳定强化学习的 Transformer。在 2020 年机器学习国际会议上。
  • Zambaldi et al. [2018] Zambaldi 等人[2018] Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, et al. Deep reinforcement learning with relational inductive biases. In International Conference on Learning Representations, 2018.
    Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart 等人。具有关系归纳偏差的深度强化学习。在 2018 年国际学习表示会议上。
  • Sutton and Barto [2018]
    Sutton 和 Barto [2018]
    Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press, 2018.
    Richard S Sutton 和 Andrew G Barto。强化学习:一种介绍。麻省理工学院出版社,2018 年。
  • Hung et al. [2019] 洪家骏等人[2019] Chia-Chun Hung, Timothy Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. Optimizing agent behavior over long time scales by transporting value. Nature communications, 10(1):1–12, 2019.
    洪家骏,蒂莫西·利利克拉普,乔什·亚伯森,吴燕,梅迪·米尔扎,费德里科·卡内瓦莱,阿伦·阿胡贾和格雷格·韦恩。通过传输价值优化长时间尺度上的代理行为。自然通讯,10(1):1-12,2019 年。
  • Levine et al. [2020] 莱文斯坦等人[2020] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
    Sergey Levine, Aviral Kumar, George Tucker 和 Justin Fu。离线强化学习:教程,评论和对未解决问题的展望。arXiv 预印本 arXiv:2005.01643,2020 年。
  • Radford et al. [2018] Radford 等人[2018] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.
    Alec Radford,Karthik Narasimhan,Tim Salimans 和 Ilya Sutskever。通过生成式预训练改进语言理解。2018 年。
  • Bellemare et al. [2013] Bellemare 等人[2013] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
    Marc G Bellemare, Yavar Naddaf, Joel Veness 和 Michael Bowling. 游戏学习环境:通用智能体评估平台。人工智能研究杂志,47:253–279,2013 年。
  • Brockman et al. [2016] Brockman 等人[2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang 和 Wojciech Zaremba。Openai gym。arXiv 预印本 arXiv:1606.01540,2016。
  • Mesnard et al. [2020] Mesnard 等人[2020] Thomas Mesnard, Théophane Weber, Fabio Viola, Shantanu Thakoor, Alaa Saade, Anna Harutyunyan, Will Dabney, Tom Stepleton, Nicolas Heess, Arthur Guez, et al. Counterfactual credit assignment in model-free reinforcement learning. arXiv preprint arXiv:2011.09464, 2020.
    Thomas Mesnard, Théophane Weber, Fabio Viola, Shantanu Thakoor, Alaa Saade, Anna Harutyunyan, Will Dabney, Tom Stepleton, Nicolas Heess, Arthur Guez 等人。无模型强化学习中的反事实信用分配。arXiv 预印本 arXiv:2011.09464,2020。
  • Agarwal et al. [2020] Agarwal 等人[2020] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, 2020.
    Rishabh Agarwal,Dale Schuurmans 和 Mohammad Norouzi。对离线强化学习持乐观态度。在 2020 年机器学习国际会议上。
  • Kumar et al. [2020] Kumar 等人[2020] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems, 2020.
    Aviral Kumar、Aurick Zhou、George Tucker 和 Sergey Levine。保守的 q-learning 用于离线强化学习。在 2020 年的《神经信息处理系统进展》中。
  • Ba et al. [2016] Ba 等人[2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
    Jimmy Lei Ba、Jamie Ryan Kiros 和 Geoffrey E Hinton。层归一化。arXiv 预印本 arXiv:1607.06450,2016。
  • Sutton [1990] Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ICML, 1990.
    Richard S. Sutton. 基于近似动态规划的学习、规划和反应集成架构。在 ICML,1990 年。
  • Janner et al. [2019] Janner 等人 [2019] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pages 12498–12509, 2019.
    Michael Janner, Justin Fu, Marvin Zhang 和 Sergey Levine。何时信任您的模型:基于模型的策略优化。在神经信息处理系统的进展中,2019 年,第 12498-12509 页。
  • Kumar et al. [2019a] Kumar 等人[2019a] Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949, 2019a.
    Aviral Kumar,Justin Fu,George Tucker 和 Sergey Levine。通过引导误差减少稳定化离线 q-learning。arXiv 预印本 arXiv:1906.00949,2019a。
  • Wu et al. [2019] 吴等人[2019] Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
    吴一凡,乔治·塔克,奥菲尔·纳赫姆。行为规范化的离线强化学习。arXiv 预印本 arXiv:1911.11361,2019 年。
  • Mnih et al. [2015] Mnih 等人[2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski 等人。通过深度强化学习实现人类水平控制。自然, 518(7540):529–533, 2015。
  • Hafner et al. [2020] Hafner 等人 [2020] Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi 和 Jimmy Ba。使用离散世界模型掌握 Atari。arXiv 预印本 arXiv:2010.02193, 2020。
  • Dabney et al. [2018] Dabney 等人[2018] Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In Conference on Artificial Intelligence, 2018.
    Will Dabney, Mark Rowland, Marc Bellemare 和 Rémi Munos. 使用分位数回归的分布式强化学习。在人工智能会议上,2018 年。
  • Fu et al. [2020] Fu 等人[2020] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker 和 Sergey Levine. D4rl: 深度数据驱动强化学习的数据集. arXiv 预印本 arXiv:2004.07219, 2020.
  • Peng et al. [2019] 彭等人[2019] Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
    Xue Bin Peng, Aviral Kumar, Grace Zhang 和 Sergey Levine. 基于优势加权回归的简单且可扩展的离策略强化学习. arXiv 预印本 arXiv:1910.00177, 2019.
  • Raposo et al. [2021] Raposo 等人[2021] David Raposo, Sam Ritter, Adam Santoro, Greg Wayne, Theophane Weber, Matt Botvinick, Hado van Hasselt, and Francis Song. Synthetic returns for long-term credit assignment. arXiv preprint arXiv:2102.12425, 2021.
    David Raposo, Sam Ritter, Adam Santoro, Greg Wayne, Theophane Weber, Matt Botvinick, Hado van Hasselt 和 Francis Song. 长期信用分配的合成回报。arXiv 预印本 arXiv:2102.12425,2021 年。
  • Gao et al. [2018] Gao 等人[2018] Yang Gao, Huazhe Xu, Ji Lin, Fisher Yu, Sergey Levine, and Trevor Darrell. Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313, 2018.
    杨高,徐华哲,林吉,费舍尔·于,谢尔盖·莱文和特雷弗·达雷尔。来自不完美演示的强化学习。arXiv 预印本 arXiv:1802.05313,2018 年。
  • Nair et al. [2020] 奈尔等人[2020] Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
    阿什温·奈尔,穆尔塔扎·达拉尔,阿布希舍克·古普塔和谢尔盖·莱文。利用离线数据集加速在线强化学习。arXiv 预印本 arXiv:2006.09359,2020 年。
  • Ecoffet et al. [2019] Ecoffet 等人[2019] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
    Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley 和 Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
  • Fujimoto et al. [2019] Fujimoto 等人[2019] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, 2019.
    Scott Fujimoto, David Meger 和 Doina Precup。无需探索的离策略深度强化学习。在 2019 年国际机器学习会议上。
  • Kumar et al. [2019b] Kumar 等人[2019b] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, 2019b.
    Aviral Kumar, Justin Fu, Matthew Soh, George Tucker 和 Sergey Levine。通过引导误差减少稳定离策略 Q 学习。在 2019 年神经信息处理系统进展中。
  • Siegel et al. [2020] Siegel 等人[2020] Noah Y Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, and Martin Riedmiller. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. In International Conference on Learning Representations, 2020.
    Noah Y Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner 和 Martin Riedmiller。继续做有效的事情:离线强化学习的行为建模先验。在 2020 年国际学习表示会议上。
  • Kidambi et al. [2020] Kidambi 等人[2020] Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning. In Advances in Neural Information Processing Systems, 2020.
    Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli 和 Thorsten Joachims. Morel: 基于模型的离线强化学习. 在神经信息处理系统进展中,2020 年。
  • Yu et al. [2020] Yu 等人[2020] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. In Advances in Neural Information Processing Systems, 2020.
    Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn 和 Tengyu Ma. Mopo: 基于模型的离线策略优化. 在神经信息处理系统进展中,2020 年。
  • Ajay et al. [2020] Ajay 等人[2020] Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Ofir Nachum. Opal: Offline primitive discovery for accelerating offline reinforcement learning. arXiv preprint arXiv:2010.13611, 2020.
    Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine 和 Ofir Nachum. Opal: 用于加速离线强化学习的离线原始发现。arXiv 预印本 arXiv:2010.13611,2020 年。
  • Campos et al. [2020] Campos 等人[2020] Víctor Campos, Alexander Trott, Caiming Xiong, Richard Socher, Xavier Giro-i Nieto, and Jordi Torres. Explore, discover and learn: Unsupervised discovery of state-covering skills. In International Conference on Machine Learning, 2020.
    Victor Campos, Alexander Trott, Caiming Xiong, Richard Socher, Xavier Giro-i Nieto 和 Jordi Torres. 探索、发现和学习:无监督发现状态覆盖技能。在 2020 年机器学习国际会议上。
  • Pertsch et al. [2020] Pertsch 等人[2020] Karl Pertsch, Youngwoon Lee, and Joseph J Lim. Accelerating reinforcement learning with learned skill priors. arXiv preprint arXiv:2010.11944, 2020.
    Karl Pertsch, Youngwoon Lee 和 Joseph J Lim. 利用学习的技能先验加速强化学习。arXiv 预印本 arXiv:2010.11944,2020。
  • Singh et al. [2021] 辛格等人[2021] Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and Sergey Levine. Parrot: Data-driven behavioral priors for reinforcement learning. In International Conference on Learning Representations, 2021.
    Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, 和 Sergey Levine. Parrot: 数据驱动的强化学习行为先验. 在 2021 年国际学习表示会议上.
  • Eysenbach et al. [2019] 艾森巴赫等人[2019] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019.
    Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz 和 Sergey Levine。多样性就是你所需要的:学习技能而无需奖励函数。在 2019 年国际学习表示会议上。
  • Lu et al. [2020] Lu 等人[2020] Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Reset-free lifelong learning with skill-space planning. arXiv preprint arXiv:2012.03548, 2020.
    Kevin Lu, Aditya Grover, Pieter Abbeel 和 Igor Mordatch。基于技能空间规划的无重置终身学习。arXiv 预印本 arXiv:2012.03548,2020。
  • Sharma et al. [2020] Sharma 等人[2020] Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. In International Conference on Learning Representations, 2020.
    Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar 和 Karol Hausman. 动态感知的无监督技能发现。在 2020 年国际学习表示会议上。
  • Watkins [1989] Christopher Watkins. Learning from delayed rewards. 01 1989.
    克里斯托弗·沃特金斯。从延迟奖励中学习。01 1989。
  • Mnih et al. [2013] Mnih 等人[2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
    沃洛迪米尔·Mnih,科雷·卡武克乔格鲁,大卫·席尔瓦,亚历克斯·格雷夫斯,约阿尼斯·安东诺格鲁,丹·维尔斯特拉和马丁·里德米勒。使用深度强化学习玩 Atari。arXiv 预印本 arXiv:1312.5602,2013。
  • Srivastava et al. [2019]
    Srivastava 等人[2019]
    Rupesh Kumar Srivastava, Pranav Shyam, Filipe Mutz, Wojciech Jaśkowski, and Jürgen Schmidhuber. Training agents using upside-down reinforcement learning. arXiv preprint arXiv:1912.02877, 2019.
    Rupesh Kumar Srivastava, Pranav Shyam, Filipe Mutz, Wojciech Jaśkowski 和 Jürgen Schmidhuber。使用倒置强化学习训练代理。arXiv 预印本 arXiv:1912.02877,2019 年。
  • Kumar et al. [2019c] Kumar 等人[2019c] Aviral Kumar, Xue Bin Peng, and Sergey Levine. Reward-conditioned policies. arXiv preprint arXiv:1912.13465, 2019c.
    Aviral Kumar,Xue Bin Peng 和 Sergey Levine。奖励条件策略。arXiv 预印本 arXiv:1912.13465,2019 年。
  • ogm [2019] Acting without rewards. 2019. URL https://ogma.ai/2019/08/acting-without-rewards/.
    无奖励行为。2019 年。网址 https://ogma.ai/2019/08/acting-without-rewards/。
  • Chen et al. [2020] 陈等人[2020] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691–1703. PMLR, 2020.
    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, 和 Ilya Sutskever. 从像素生成预训练。在机器学习国际会议上,页码 1691-1703。PMLR,2020。
  • Ghosh et al. [2019] Ghosh 等人[2019] Dibya Ghosh, Abhishek Gupta, Justin Fu, Ashwin Reddy, Coline Devin, Benjamin Eysenbach, and Sergey Levine. Learning to reach goals without reinforcement learning. arXiv preprint arXiv:1912.06088, 2019.
    Dibya Ghosh, Abhishek Gupta, Justin Fu, Ashwin Reddy, Coline Devin, Benjamin Eysenbach, 和 Sergey Levine. 学习在没有强化学习的情况下达到目标. arXiv 预印本 arXiv:1912.06088, 2019.
  • Paster et al. [2020] Paster 等人 [2020] Keiran Paster, Sheila A McIlraith, and Jimmy Ba. Planning from pixels using inverse dynamics models. arXiv preprint arXiv:2012.02419, 2020.
    Keiran Paster, Sheila A McIlraith, 和 Jimmy Ba. 使用逆动力学模型从像素规划. arXiv 预印本 arXiv:2012.02419, 2020.
  • Janner et al. [2021] Janner 等人[2021] Michael Janner, Qiyang Li, and Sergey Levine. Reinforcement learning as one big sequence modeling problem. arXiv preprint arXiv:2106.02039, 2021.
    Michael Janner, Qiyang Li 和 Sergey Levine. 强化学习作为一个大的序列建模问题. arXiv 预印本 arXiv:2106.02039, 2021.
  • Ferret et al. [2019] Ferret 等人[2019] Johan Ferret, Raphaël Marinier, Matthieu Geist, and Olivier Pietquin. Self-attentional credit assignment for transfer in reinforcement learning. arXiv preprint arXiv:1907.08027, 2019.
    Johan Ferret, Raphaël Marinier, Matthieu Geist 和 Olivier Pietquin。自我关注的强化学习中的传递学分分配。arXiv 预印本 arXiv:1907.08027,2019 年。
  • Harutyunyan et al. [2019]
    Harutyunyan 等人[2019]
    Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Azar, Bilal Piot, Nicolas Heess, Hado van Hasselt, Greg Wayne, Satinder Singh, Doina Precup, et al. Hindsight credit assignment. arXiv preprint arXiv:1912.02503, 2019.
    Anna Harutyunyan,Will Dabney,Thomas Mesnard,Mohammad Azar,Bilal Piot,Nicolas Heess,Hado van Hasselt,Greg Wayne,Satinder Singh,Doina Precup 等人。事后学分分配。arXiv 预印本 arXiv:1912.02503,2019 年。
  • Arjona-Medina et al. [2018]
    Arjona-Medina 等人[2018]
    Jose A Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, and Sepp Hochreiter. Rudder: Return decomposition for delayed rewards. arXiv preprint arXiv:1806.07857, 2018.
    Jose A Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter 和 Sepp Hochreiter. Rudder: Return decomposition for delayed rewards. arXiv 预印本 arXiv:1806.07857, 2018.
  • Liu et al. [2019] Liu 等人[2019] Yang Liu, Yunan Luo, Yuanyi Zhong, Xi Chen, Qiang Liu, and Jian Peng. Sequence modeling of temporal credit assignment for episodic reinforcement learning. arXiv preprint arXiv:1905.13420, 2019.
    杨柳,罗云安,钟元一,陈曦,刘强,彭健。序列建模时间性信用分配用于情节式强化学习。arXiv 预印本 arXiv:1905.13420,2019 年。
  • Karras et al. [2019] Karras 等人 [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Conference on Computer Vision and Pattern Recognition, 2019.
    Tero Karras,Samuli Laine 和 Timo Aila。用于生成对抗网络的基于风格的生成器架构。在计算机视觉与模式识别会议上,2019 年。
  • Ghazvininejad et al. [2017]
    Ghazvininejad 等人[2017]
    Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. Hafez: an interactive poetry generation system. In Proceedings of ACL, System Demonstrations, 2017.
    Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi 和 Kevin Knight。Hafez:一个交互式诗歌生成系统。在 ACL 会议论文集中,系统演示,2017 年。
  • Weng [2021] Lilian Weng. Controllable neural text generation. lilianweng.github.io/lil-log, 2021. URL https://lilianweng.github.io/lil-log/2021/01/02/controllable-neural-text-generation.html.
    Lilian Weng. 可控神经文本生成. lilianweng.github.io/lil-log, 2021. URL https://lilianweng.github.io/lil-log/2021/01/02/controllable-neural-text-generation.html.
  • Ficler and Goldberg [2017]
    Ficler 和 Goldberg [2017]
    Jessica Ficler and Yoav Goldberg. Controlling linguistic style aspects in neural language generation. arXiv preprint arXiv:1707.02633, 2017.
    Jessica Ficler 和 Yoav Goldberg. 在神经语言生成中控制语言风格方面. arXiv 预印本 arXiv:1707.02633, 2017.
  • Hu et al. [2017] 胡志亭等人[2017] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Toward controlled generation of text. In International Conference on Machine Learning, 2017.
    胡志亭,杨子超,梁晓丹,鲁斯兰·萨拉胡丁诺夫和艾瑞克·平兴。朝着可控文本生成的方向。在 2017 年机器学习国际会议上。
  • Rajani et al. [2019] 拉贾尼等人[2019] Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361, 2019.
    Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong 和 Richard Socher。解释你自己!利用语言模型进行常识推理。arXiv 预印本 arXiv:1906.02361,2019。
  • Yu et al. [2017] 于等人[2017] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI conference on artificial intelligence, 2017.
    Lantao Yu, Weinan Zhang, Jun Wang 和 Yong Yu。Seqgan:具有策略梯度的序列生成对抗网络。在 2017 年 AAAI 人工智能会议上。
  • Ziegler et al. [2019] Ziegler 等人[2019] Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, 和 Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  • Keskar et al. [2019] Keskar 等人[2019] Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019.
    Nitish Shirish Keskar、Bryan McCann、Lav R Varshney、Caiming Xiong 和 Richard Socher。Ctrl: 一种用于可控生成的条件 Transformer 语言模型。arXiv 预印本编号 arXiv:1909.05858,2019 年。
  • Dathathri et al. [2019] Dathathri 等人[2019] Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019.
    Sumanth Dathathri、Andrea Madotto、Janice Lan、Jane Hung、Eric Frank、Piero Molino、Jason Yosinski 和 Rosanne Liu。即插即用语言模型:一种简单的控制文本生成方法。arXiv 预印本编号 arXiv:1912.02164,2019 年。
  • Holtzman et al. [2018] Holtzman 等人[2018] Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. Learning to write with cooperative discriminators. arXiv preprint arXiv:1805.06087, 2018.
    Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub 和 Yejin Choi. 通过合作鉴别器学习写作。arXiv 预印本 arXiv:1805.06087,2018 年。
  • Krause et al. [2020] Krause 等人[2020] Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367, 2020.
    Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher 和 Nazneen Fatema Rajani。Gedi: 生成鉴别器引导的序列生成。arXiv 预印本 arXiv:2009.06367,2020 年。
  • Devlin et al. [2018] Devlin 等人[2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
    Jacob Devlin,Ming-Wei Chang,Kenton Lee 和 Kristina Toutanova。Bert: 深度双向 transformers 的语言理解预训练。arXiv 预印本 arXiv:1810.04805,2018 年。
  • Carion et al. [2020] Carion 等人[2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, 2020.
    尼古拉斯·卡里翁,弗朗西斯科·马萨,加布里埃尔·辛纳夫,尼古拉斯·乌苏尼尔,亚历山大·基里洛夫和谢尔盖·扎戈鲁伊科。基于 Transformer 的端到端目标检测。在 2020 年欧洲计算机视觉大会上。
  • Dosovitskiy et al. [2020]
    Dosovitskiy 等人[2020]
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly 等人。一幅图像价值 16x16 个词:大规模图像识别的 Transformer。arXiv 预印本 arXiv:2010.11929,2020 年。
  • Ritter et al. [2020] Ritter 等人[2020] Sam Ritter, Ryan Faulkner, Laurent Sartran, Adam Santoro, Matt Botvinick, and David Raposo. Rapid task-solving in novel environments. arXiv preprint arXiv:2006.03662, 2020.
    Sam Ritter, Ryan Faulkner, Laurent Sartran, Adam Santoro, Matt Botvinick 和 David Raposo。在新环境中快速解决任务。arXiv 预印本 arXiv:2006.03662,2020 年。
  • Dasari and Gupta [2020]
    Dasari 和 Gupta [2020]
    Sudeep Dasari and Abhinav Gupta. Transformers for one-shot visual imitation. arXiv preprint arXiv:2011.05970, 2020.
    Sudeep Dasari 和 Abhinav Gupta。用于一次性视觉模仿的 Transformer。arXiv 预印本 arXiv:2011.05970,2020 年。
  • Abramson et al. [2020] Abramson 等人 [2020] Josh Abramson, Arun Ahuja, Iain Barr, Arthur Brussee, Federico Carnevale, Mary Cassin, Rachita Chhaparia, Stephen Clark, Bogdan Damoc, Andrew Dudzik, et al. Imitating interactive intelligence. arXiv preprint arXiv:2012.05672, 2020.
    Josh Abramson, Arun Ahuja, Iain Barr, Arthur Brussee, Federico Carnevale, Mary Cassin, Rachita Chhaparia, Stephen Clark, Bogdan Damoc, Andrew Dudzik 等人。模拟交互式智能。arXiv 预印本 arXiv:2012.05672,2020 年。
  • Wolf et al. [2020] Wolf 等人[2020] Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davison, Sam Shleifer, et al. Transformers: State-of-the-art natural language processing. In Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020.
    Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davison, Sam Shleifer 等人。变压器:最先进的自然语言处理。在 2020 年经验方法自然语言处理会议:系统演示中。
  • Loshchilov and Hutter [2017]
    Loshchilov 和 Hutter [2017]
    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
    Ilya Loshchilov 和 Frank Hutter。解耦权重衰减正则化。arXiv 预印本 arXiv:1711.05101,2017 年。

Appendix A Experimental Details
附录 A 实验细节

Code for experiments can be found in the supplementary material.

A.1 Atari A.1 雅达利

We build our Decision Transformer implementation for Atari games off of minGPT (https://github.com/karpathy/minGPT), a publicly available re-implementation of GPT. We use most of the default hyperparameters from their character-level GPT example (https://github.com/karpathy/minGPT/blob/master/play_char.ipynb). We reduce the batch size (except in Pong), block size, number of layers, attention heads, and embedding dimension for faster training. For processing the observations, we use the DQN encoder from Mnih et al. [20] with an additional linear layer to project to the embedding dimension.
我们基于 minGPT(https://github.com/karpathy/minGPT)构建了用于 Atari 游戏的 Decision Transformer 实现,这是 GPT 的一个公开可用的重新实现。我们大部分使用了他们字符级 GPT 示例中的默认超参数(https://github.com/karpathy/minGPT/blob/master/play_char.ipynb)。为了加快训练速度,我们减少了批量大小(除了 Pong)、块大小、层数、注意力头数和嵌入维度。为了处理观察结果,我们使用了 Mnih 等人的 DQN 编码器[20],并增加了一个额外的线性层来投影到嵌入维度。

For return-to-go conditioning, we use either 1×1\times or 5×5\times the maximum return in the dataset, but more possibilities exist for principled return-to-go conditioning. In Atari experiments, we use Tanh instead of LayerNorm (as described in Section 3) after embedding each modality, but did this does not make a significant difference in performance. The full list of hyperparameters can be found in Table 8.
对于返回到起点的条件,我们使用数据集中的最大返回值之一 1×1\times5×5\times ,但也存在更多基于原则的返回到起点的可能性。在 Atari 实验中,我们在嵌入每个模态后使用 Tanh 而不是 LayerNorm(如第 3 节所述),但这并没有在性能上产生显著差异。超参数的完整列表可以在表 8 中找到。

Table 8: Hyperparameters of DT (and %BC) for Atari experiments.
表 8:Atari 实验中 DT(和%BC)的超参数。

Hyperparameter 超参数 Value 价值
Number of layers 层数 666
Number of attention heads
Embedding dimension 嵌入维度 128128128
Batch size 批量大小 512512512 Pong 乒乓球
128128128 Breakout, Qbert, Seaquest
Context length K𝐾K 上下文长度 K𝐾K 505050 Pong 乒乓球
303030 Breakout, Qbert, Seaquest
Return-to-go conditioning
909090 Breakout (1×\approx 1\times max in dataset)
250025002500 Qbert (5×\approx 5\times max in dataset)
202020 Pong (1×\approx 1\times max in dataset)
145014501450 Seaquest (5×\approx 5\times max in dataset)
Nonlinearity 非线性 ReLU, encoder ReLU,编码器
GeLU, otherwise GeLU,另外
Encoder channels 编码器通道 32,64,6432646432,64,64
Encoder filter sizes 编码器滤波器尺寸 8×8,4×4,3×38844338\times 8,4\times 4,3\times 3
Encoder strides 编码器步幅 4,2,14214,2,1
Max epochs 最大轮数 555
Dropout 丢弃
Learning rate 学习率 61046superscript1046*10^{-4}
Adam betas Adam 贝塔 (0.9,0.95)0.90.95(0.9,0.95)
Grad norm clip 梯度范数裁剪
Weight decay 权重衰减
Learning rate decay 学习率衰减 Linear warmup and cosine decay (see code for details)
Warmup tokens 热身令牌 5122051220512*20
Final tokens 最终令牌 2500000K2500000𝐾2*500000*K

A.2 OpenAI Gym A.2OpenAI 体育馆

A.2.1 Decision Transformer
A.2.1 决策变压器

Our code is based on the Huggingface Transformers library [72]. Our hyperparameters on all OpenAI Gym tasks are shown below in Table 9. Heuristically, we find using larger models helps to model the distribution of returns, compared to standard RL model sizes (which learn one policy). For reacher we use a smaller context length than the other environments, which we find to be helpful as the environment is goal-conditioned and the episodes are shorter. We choose return targets based on expert performance for each environment, except for HalfCheetah where we find 50% performance to be better due to the datasets containing lower relative returns to the other environments. Models were trained for 105superscript10510^{5} gradient steps using the AdamW optimizer [73] following PyTorch defaults.
我们的代码基于 Huggingface Transformers 库[72]。我们在所有 OpenAI Gym 任务中的超参数如下表 9 所示。从经验上讲,我们发现使用更大的模型有助于建模回报的分布,与标准的 RL 模型大小相比(学习一个策略)。对于 reacher,我们使用比其他环境更小的上下文长度,我们发现这对于环境是目标条件且剧集较短的情况是有帮助的。我们根据每个环境的专家表现选择回报目标,除了 HalfCheetah,我们发现由于数据集中相对较低的回报,50%的表现更好。模型使用 AdamW 优化器[73]进行了 105superscript10510^{5} 梯度步骤的训练,遵循 PyTorch 默认设置。

Table 9: Hyperparameters of Decision Transformer for OpenAI Gym experiments.
表 9:用于 OpenAI Gym 实验的 Decision Transformer 的超参数。

Hyperparameter 超参数 Value 价值
Number of layers 层数 333
Number of attention heads
Embedding dimension 嵌入维度 128128128
Nonlinearity function 非线性函数 ReLU
Batch size 批量大小 646464
Context length K𝐾K 上下文长度 202020 HalfCheetah, Hopper, Walker
555 Reacher Reacher
Return-to-go conditioning
600060006000 HalfCheetah HalfCheetah
360036003600 Hopper 跳跃者
500050005000 Walker 步行者
505050 Reacher 伸手者
Dropout 辍学
Learning rate 学习率 104superscript10410^{-4}
Grad norm clip 梯度范数裁剪
Weight decay 权重衰减 104superscript10410^{-4}
Learning rate decay 学习率衰减 Linear warmup for first 105superscript10510^{5} training steps
105superscript10510^{5} 训练步骤的线性预热

A.2.2 Behavior Cloning A.2.2 行为克隆

As briefly mentioned in Section 4.2, we found previously reported behavior cloning baselines to be weak, and so run them ourselves using a similar setup as Decision Transformer. We tried using a transformer architecture, but found using an MLP (as in previous work) to be stronger. We train for 2.5×1042.5superscript1042.5\times 10^{4} gradient steps; training more did not improve performance. Other hyperparameters are shown in Table 10. The percentile behavior cloning experiments use the same hyperparameters.
正如在第 4.2 节中简要提到的,我们发现先前报告的行为克隆基线较弱,因此我们使用与决策 Transformer 类似的设置来运行它们。我们尝试使用变压器架构,但发现使用 MLP(与先前的工作相同)更强大。我们进行 2.5×1042.5superscript1042.5\times 10^{4} 个梯度步骤的训练;进行更多训练并没有改善性能。其他超参数如表 10 所示。百分位行为克隆实验使用相同的超参数。

Table 10: Hyperparameters of Behavior Cloning for OpenAI Gym experiments.
表 10:OpenAI Gym 实验的行为克隆超参数。

Hyperparameter 超参数 Value 价值
Number of layers 层数 333
Embedding dimension 嵌入维度 256256256
Nonlinearity function 非线性函数 ReLU
Batch size 批量大小 646464
Dropout 丢弃率
Learning rate 学习率 104superscript10410^{-4}
Weight decay 权重衰减 104superscript10410^{-4}
Learning rate decay 学习率衰减 Linear warmup for first 105superscript10510^{5} training steps
105superscript10510^{5} 训练步骤的线性预热

A.3 Graph Shortest Path A.3 图最短路径

We give details of the illustrative example discussed in the introduction. The task is to find the shortest path on a fixed directed graph, which can be formulated as an MDP where reward is 00 when the agent is at the goal node and 11-1 otherwise. The observation is the integer index of the graph node the agent is in. The action is the integer index of the graph node to move to next. The transition dynamics transport the agent to the action’s node index if there is an edge in the graph, while the agent remains at the past node otherwise. The returns-to-go in this problem correspond to negative path lengths and maximizing them corresponds to generating shortest paths.
我们提供了在介绍中讨论的示例的详细信息。任务是在固定的有向图上找到最短路径,可以将其表述为一个 MDP,其中当代理位于目标节点时奖励为 00

In this environment, we use the GPT model as described in Section 3 to generate both actions and return-to-go tokens. This makes it possible for the model it generate its own (realizable) returns-to-go R^^𝑅\hat{R}. Since we require a return prompt to generate actions and we do assume knowledge of the optimal path length upfront, we use a simple prior over returns that favors shorter paths: Pprior(R^=k)T+1kproportional-tosubscript𝑃prior^𝑅𝑘𝑇1𝑘P_{\text{prior}}(\hat{R}=k)\propto T+1-k, where T𝑇T is the maximum trajectory length. Then, it is combined with the return probabilities generated by the GPT model: P(R^t|s0:t,a0:t1,R^0:t1)=PGPT(R^t|s0:t,a0:t1,R^0:t1)×Pprior(R^t)10𝑃conditionalsubscript^𝑅𝑡subscript𝑠:0𝑡subscript𝑎:0𝑡1subscript^𝑅:0𝑡1subscript𝑃GPTconditionalsubscript^𝑅𝑡subscript𝑠:0𝑡subscript𝑎:0𝑡1subscript^𝑅:0𝑡1subscript𝑃priorsuperscriptsubscript^𝑅𝑡10P(\hat{R}_{t}|s_{0:t},a_{0:t-1},\hat{R}_{0:t-1})=P_{\text{GPT}}(\hat{R}_{t}|s_{0:t},a_{0:t-1},\hat{R}_{0:t-1})\times P_{\text{prior}}(\hat{R}_{t})^{10}. Note that the prior and return-to-go predictions are entirely computable by the model, and thus avoids the need for any external or oracle information like the optimal path length. Adjustment of generation by a prior has also been used for similar purposes in controllable text generation in prior work [65].
在这种环境中,我们使用第 3 节中描述的 GPT 模型来生成动作和返回标记。这使得模型能够生成自己的(可实现的)返回标记。由于我们需要一个返回提示来生成动作,并且我们假设提前知道最佳路径长度,我们使用一个简单的返回先验,有利于较短的路径:,其中是最大轨迹长度。然后,它与 GPT 模型生成的返回概率结合起来。请注意,先验和返回预测完全可由模型计算,因此避免了任何外部或神谕信息,如最佳路径长度。在先前的工作中,调整生成的先验也已用于类似目的的可控文本生成中。

We train on a dataset of 1,00010001,000 graph random walk trajectories of T=10𝑇10T=10 steps each with a random graph of 202020 nodes and edge sparsity coefficient of We report the results in Figure 6, where we find that transformer model is able to significantly improve upon the number of steps required to reach the goal, closely matching performance of optimal paths.
我们在一个包含 1,00010001,000 步的随机图行走轨迹数据集上进行训练,每个轨迹都是由具有 202020 节点和边稀疏系数为 的随机图组成。我们在图 6 中报告了结果,发现变换器模型能够显著改善到达目标所需步数,与最佳路径的表现非常接近。

Refer to caption
Figure 6: Histogram of steps to reach the goal node for random walks on the graph, shortest possible paths to the goal, and attempted shortest paths generated by the transformer model. \infty indicates the goal was not reached during the trajectory.
图 6:达到目标节点所需步数的直方图,包括在图上的随机行走、到达目标的最短路径,以及变换器模型生成的尝试最短路径。 \infty 表示在轨迹中未达到目标。

There are two reasons for the favorable performance on this task. In one case, the training dataset of random walk trajectories may contain a segment that directly corresponds to the desired shortest path, in which case it will be generated by the model. In the second case, generated paths are entirely original and are not subsets of trajectories in the training dataset - they are generated from stitching sub-optimal segments. We find this case accounts for 15.8%percent15.815.8\% of generated paths in the experiment.
这项任务表现良好的原因有两个。一种情况是,随机行走轨迹的训练数据集可能包含直接对应于所需最短路径的部分,这种情况下模型会生成该路径。第二种情况是,生成的路径完全原创,不是训练数据集中轨迹的子集 - 它们是由拼接的次优部分生成的。我们发现这种情况占实验中生成路径的 15.8%percent15.815.8\%

While this is a simple example and uses a prior on generation that we do not use in other experiments for simplicity, it illustrates how hindsight return information can be used with generation priors to avoid the need for explicit dynamic programming.

Appendix B Atari Task Scores
附录 B Atari 任务得分

Table 11 shows the normalized scores used for normalization used in Hafner et al. [21]. Tables 12 and 13 show the raw scores corresponding to Tables 1 and 4, respectively. For %BC scores, we use the same hyperparameters as Decision Transformer for fair comparison. For REM and QR-DQN, there is a slight discrepancy between Agarwal et al. [13] and Kumar et al. [14]; we report raw data provided to us by REM authors.
表 11 显示了在 Hafner 等人 [21] 中使用的归一化得分。表 12 和表 13 分别显示了与表 1 和表 4 相对应的原始得分。对于%BC 得分,我们使用与决策 Transformer 相同的超参数进行公平比较。对于 REM 和 QR-DQN,Agarwal 等人 [13] 和 Kumar 等人 [14] 之间存在轻微差异;我们报告了 REM 作者提供给我们的原始数据。

Game 游戏 Random 随机 Gamer 玩家
Breakout 突围 222 303030
Qbert 164164164 134551345513455
Pong 乒乓球 2121-21 151515
Seaquest 海底寻宝 686868 420554205542055
Table 11: Atari baseline scores used for normalization.
表 11:用于归一化的 Atari 基准分数。
Game 游戏 DT (Ours) DT(我们的) CQL QR-DQN REM BC
Breakout 打砖块 76.9±27.3plus-or-minus76.927.3\bf{76.9}\pm 27.3 40.9±17.3plus-or-minus40.917.340.9\pm 17.3
Qbert 2215.8±1523.7plus-or-minus2215.81523.72215.8\pm 1523.7 14012.014012.0\bf{14012.0} 156.0156.0156.0 160.1160.1160.1 2464.1±1948.2plus-or-minus2464.11948.22464.1\pm 1948.2
Pong 乒乓球 17.1±2.9plus-or-minus17.12.917.1\pm 2.9 19.319.3\bf{19.3} 14.514.5-14.5 20.820.8-20.8 9.7±7.2plus-or-minus9.77.29.7\pm 7.2
Seaquest 深海探险 1129.3±189.0plus-or-minus1129.3189.0\bf{1129.3}\pm 189.0 779.4779.4779.4 250.1250.1250.1 370.5370.5370.5 968.6±133.8plus-or-minus968.6133.8968.6\pm 133.8
Table 12: Raw scores for the 1% DQN-replay Atari dataset. We report the mean and variance across 3 seeds. Best mean scores are highlighted in bold. Decision Transformer performs comparably to CQL on 3 out of 4 games, and usually outperforms other baselines.
表 12:1% DQN-replay Atari 数据集的原始分数。我们报告了 3 个种子的均值和方差。最佳均值分数以粗体突出显示。在 4 个游戏中,决策 Transformer 在 3 个游戏中与 CQL 表现相当,并且通常优于其他基线。
Game 游戏 DT (Ours) DT(我们的) 10%BC 25%BC 40%BC 100%BC
Breakout 突围 76.9±27.3plus-or-minus76.927.3\bf{76.9}\pm 27.3 10.0±2.3plus-or-minus10.02.310.0\pm 2.3 22.6±1.8plus-or-minus22.61.822.6\pm 1.8 32.3±18.9plus-or-minus32.318.932.3\pm 18.9 40.9±17.3plus-or-minus40.917.340.9\pm 17.3
Qbert 2215.8±1523.7plus-or-minus2215.81523.72215.8\pm 1523.7 1045±232.0plus-or-minus1045232.01045\pm 232.0 2302.5±1844.1plus-or-minus2302.51844.12302.5\pm 1844.1 1674.1±776.0plus-or-minus1674.1776.01674.1\pm 776.0 2464.1±1948.2plus-or-minus2464.11948.2\bf{2464.1}\pm 1948.2
Pong 乒乓 17.1±2.9plus-or-minus17.12.9\bf{17.1}\pm 2.9 20.3±0.1plus-or-minus20.30.1-20.3\pm 0.1 16.2±1.0plus-or-minus16.21.0-16.2\pm 1.0 5.2±4.8plus-or-minus5.24.85.2\pm 4.8 9.7±7.2plus-or-minus9.77.29.7\pm 7.2
Seaquest 深海探险 1129.3±189.0plus-or-minus1129.3189.0\bf{1129.3}\pm 189.0 521.3±103.0plus-or-minus521.3103.0521.3\pm 103.0 549.3±96.2plus-or-minus549.396.2549.3\pm 96.2 758±169.1plus-or-minus758169.1758\pm 169.1 968.6±133.8plus-or-minus968.6133.8968.6\pm 133.8
Table 13: %BC scores for Atari. We report the mean and variance across 3 seeds. Decision Transformer usually outperforms %BC.
Table 13: Atari 的%BC 得分。我们报告了 3 个种子的平均值和方差。决策 Transformer 通常优于%BC。