Do Embodied Agents Dream of Pixelated Sheep?:
Embodied Decision Making using Language Guided World Modelling
嵌入式代理是否梦见像素化的羊？：使用语言引导的世界建模进行具体化决策

Kolby Nottingham Prithviraj Ammanabrolu Alane Suhr 苏艾兰 Yejin Choi 崔叶津 Hannaneh Hajishirzi 哈尼涅·哈吉什尔齐 Sameer Singh Roy Fox

Abstract 摘要

Reinforcement learning (RL) agents typically learn tabula rasa, without prior knowledge of the world. However, if initialized with knowledge of high-level subgoals and transitions between subgoals, RL agents could utilize this Abstract World Model (AWM) for planning and exploration. We propose using few-shot large language models (LLMs) to hypothesize an AWM, that will be verified through world experience, to improve sample efficiency of RL agents. Our DECKARD agent applies LLM-guided exploration to item crafting in Minecraft in two phases: (1) the Dream phase where the agent uses an LLM to decompose a task into a sequence of subgoals, the hypothesized AWM; and (2) the Wake phase where the agent learns a modular policy for each subgoal and verifies or corrects the hypothesized AWM. Our method of hypothesizing an AWM with LLMs and then verifying the AWM based on agent experience not only increases sample efficiency over contemporary methods by an order of magnitude but is also robust to and corrects errors in the LLM—successfully blending noisy internet-scale information from LLMs with knowledge grounded in environment dynamics.
强化学习（RL）代理通常从零开始学习，没有对世界的先验知识。然而，如果初始化具有高级子目标和子目标之间的转换知识，RL 代理可以利用此抽象世界模型（AWM）进行规划和探索。我们提议使用少量训练的大型语言模型来假设一个AWM，通过世界经验进行验证，以提高RL代理的样本效率。我们的DECKARD代理在两个阶段中将LLM引导的探索应用于Minecraft中的物品制作：（1）梦境阶段，代理使用LLM将任务分解为一系列子目标，假设的AWM；以及（2）清醒阶段，代理学习每个子目标的模块化策略，并验证或纠正假设的AWM。我们假设使用LLMs的AWM，并根据代理经验验证AWM的方法不仅使样本效率比当代方法提高一个数量级，而且还对LLM的错误具有鲁棒性，成功地将来自LLMs的嘈杂的互联网规模信息与环境动态的知识相结合。

Reinforcement Learning, Natural Language Processing, Embodied Agents, World Models
强化学习，自然语言处理，具身代理，世界模型

1 Introduction 1 介绍

Refer to caption — Figure 1: During the *Dream* phase, DECKARD uses the LLM-predicted DAG of subgoals, the hypothesized Abstract World Model (AWM), to sample a node on the path to the current task. Then, during the *Wake* phase, the agent executes subgoals and explores until reaching the sampled node. The AWM is corrected and discovered nodes marked as verified.
图 1：在梦想阶段，DECKARD 使用LLM预测的子目标 DAG，假设的抽象世界模型（AWM），来采样到当前任务路径上的一个节点。然后，在清醒阶段，代理执行子目标并探索，直到到达采样的节点。AWM 被校正，发现的节点标记为已验证。

Despite evidence that practical sequential decision making systems require efficient exploitation of prior knowledge regarding a task, the current prevailing paradigm in reinforcement learning (RL) is to train tabula rasa, without any pretraining or external knowledge (Agarwal et al., 2022). In an effort to shift away from this paradigm, we focus on the task of creating embodied RL agents that can effectively exploit large-scale external knowledge sources presented in the form of pretrained large language models (LLMs).
尽管有证据表明，实用的顺序决策系统需要有效利用关于任务的先前知识，但当前在强化学习（RL）中占主导地位的范式是在没有任何预训练或外部知识的情况下进行表格式训练（Agarwal等人，2022年）。为了摆脱这一范式，我们专注于创建能够有效利用以预训练的大型语言模型形式呈现的大规模外部知识来源的具体RL代理任务。

LLMs contain potentially useful knowledge for completing tasks and compiling knowledge sources (Petroni et al., 2019). Previous work has attempted to apply knowledge from LLMs to decision-making by generating action plans for executing in an embodied environment (Ichter et al., 2022; Huang et al., 2022b; Song et al., 2022b; Singh et al., 2022; Liang et al., 2022b; Huang et al., 2022a). However, LLMs still often fail when generating plans due to a lack of grounding (Valmeekam et al., 2022). Additionally, many of these agents that rely on LLM knowledge at execution time are limited in performance by the accuracy of LLM output. We hypothesize that if LLMs are instead applied to improving exploration during training, resulting policies will not be constrained by the accuracy of an LLM.
LLMs 包含完成任务和编制知识来源的潜在有用知识（Petroni等，2019年）。先前的工作曾试图通过生成在具体环境中执行的行动计划来应用LLMs的知识来进行决策（Ichter等，2022年；Huang等，2022b年；Song等，2022b年；Singh等，2022年；Liang等，2022b年；Huang等，2022a年）。然而，LLMs 在生成计划时经常由于缺乏基础而失败（Valmeekam等，2022年）。另外，许多依赖于LLM知识的代理在执行时受到LLM输出准确性的限制。我们假设如果LLMs被应用于改善训练过程中的探索，生成的策略将不受LLM准确性的约束。

Exploration in environments with sparse rewards becomes increasingly difficult as the size of the explorable state space increases. For example, the popular 3D embodied environment Minecraft has a large technology tree of craftable items with complex dependencies and a high branching factor. Before crafting a stone pickaxe in Minecraft an agent must: collect logs, craft logs into planks and then sticks, craft a crafting table from planks, use the crafting table to craft a wooden pickaxe from sticks and planks, use the wooden pickaxe to collect cobblestone, and finally use the crafting table to craft a stone pickaxe from sticks and cobblestone. Reaching a goal item is difficult without expert knowledge of Minecraft via dense rewards (Baker et al., 2022; Hafner et al., 2023) or expert demonstrations (Skrynnik et al., 2021; Patil et al., 2020), making item crafting in Minecraft a long-standing AI challenge (Guss et al., 2019; Fan et al., 2022).
在奖励稀疏的环境中进行探索随着可探索状态空间的增加而变得越来越困难。例如，流行的3D实体环境Minecraft拥有一个庞大的可制作物品的技术树，具有复杂的依赖关系和高分支因子。在Minecraft中制作石镐之前，代理必须：收集原木，将原木制作成木板和棍子，从木板制作工作台，使用工作台从木板和棍子制作木镐，使用木镐收集圆石，最后使用工作台从棍子和圆石制作石镐。在没有通过密集奖励（Baker等人，2022年；Hafner等人，2023年）或专家演示（Skrynnik等人，2021年；Patil等人，2020年）获得Minecraft的专业知识的情况下，达到目标物品是困难的，这使得在Minecraft中制作物品成为长期以来的人工智能挑战（Guss等人，2019年；Fan等人，2022年）。

We propose DECKARD¹¹1https://deckardagent.github.io/
我们提出了 DECKARD（DECision-making for Knowledgable Autonomous Reinforcement-learning Dreamers）代理，它通过少量提示一个抽象世界模型（AWM）来假设子目标，然后利用 AWM 进行探索，并通过基于经验的验证 AWM。如图 1 所示，DECKARD 分为两个阶段：（1）梦境阶段，在这个阶段中，它使用假设的 AWM 来建议从子目标的有向无环图（DAG）中探索下一个节点；（2）苏醒阶段，在这个阶段中，它学习了一种模块化的子目标策略，每个策略都是根据 RL 目标进行训练的，并且用基于环境动态的验证假设的 AWM。图 1 显示了 DECKARD 代理在 Minecraft 中学习“制作石镐”的任务的两个迭代。在第一个梦境阶段中，代理已经验证了日志和木板节点，并建议朝着棍子子目标进行探索，忽略了像门这样不被预测为完成任务的节点。然后，在接下来的苏醒阶段中，DECKARD 执行每个分支中结束于棍子节点的子目标，然后探索直到成功制作出一根棍子。 (DECision-making for Knowledgable Autonomous Reinforcement-learning Dreamers), an agent that hypothesizes an Abstract World Model (AWM) over subgoals by few-shot prompting an LLM, then exploits the AWM for exploration and verifies the AWM with grounded experience. As seen in Figure 1, DECKARD operates in two phases: (1) the Dream phase where it uses the hypothesized AWM to suggest the next node to explore from the directed acyclic graph (DAG) of subgoals; and (2) the Wake phase where it learns a modular policy of subgoals, each trained on RL objectives, and verifies the hypothesized AWM with grounded environment dynamics. Figure 1 shows two iterations of the DECKARD agent learning the “craft a stone pickaxe” task in Minecraft. During the first Dream phase, the agent has already verified the nodes log and plank, and DECKARD suggests exploring towards the stick subgoal, ignoring nodes such as door that are not predicted to complete the task. Then, during the following Wake phase, DECKARD executes each subgoal in the branch ending in the stick node and then explores until it successfully crafts a stick. If successful, the agent marks the newly discovered node as verified in its AWM before proceeding to the next iteration.
(DECision-making for Knowledgable Autonomous Reinforcement-learning Dreamers), an agent that hypothesizes an Abstract World Model (AWM) over subgoals by few-shot prompting an LLM, then exploits the AWM for exploration and verifies the AWM with grounded experience. As seen in Figure 1, DECKARD operates in two phases: (1) the Dream phase where it uses the hypothesized AWM to suggest the next node to explore from the directed acyclic graph (DAG) of subgoals; and (2) the Wake phase where it learns a modular policy of subgoals, each trained on RL objectives, and verifies the hypothesized AWM with grounded environment dynamics. Figure 1 shows two iterations of the DECKARD agent learning the “craft a stone pickaxe” task in Minecraft. During the first Dream phase, the agent has already verified the nodes log and plank, and DECKARD suggests exploring towards the stick subgoal, ignoring nodes such as door that are not predicted to complete the task. Then, during the following Wake phase, DECKARD executes each subgoal in the branch ending in the stick node and then explores until it successfully crafts a stick. 如果成功，代理将在继续下一次迭代之前将新发现的节点标记为其 AWM 中已验证。

We evaluate DECKARD on learning to craft items in the Minecraft technology tree. We show that LLM-guidance is essential to exploration in DECKARD, with a version of our agent without LLM-guidance taking over twice as long to craft most items during open-ended exploration. Whereas, when exploring towards a specific task, DECKARD improves sample-efficiency by an order of magnitude versus comparable agents, (12x the ablated DECKARD without LLM-guidance). Our method is also robust to task decomposition errors in the LLM, consistently outperforming baselines as we introduce errors in the LLM output. DECKARD demonstrates the potential for robustly applying LLMs to RL, thus enabling RL agents to effectively use large-scale, noisy prior knowledge sources for exploration.
我们在学习在《我的世界》技术树中制作物品时评估 DECKARD。我们表明，LLM引导对于 DECKARD 中的探索至关重要，我们的一个没有LLM引导的代理版本在开放式探索过程中制作大多数物品所需的时间是原来的两倍。而当朝着特定任务探索时，DECKARD 与可比较的代理相比提高了一个数量级的样本效率（相对于去除LLM引导的 DECKARD 为 12 倍）。我们的方法对于LLM中的任务分解错误也很稳健，随着LLM输出错误的引入，我们始终优于基线。DECKARD 展示了将LLMs鲁棒地应用于 RL 的潜力，从而使 RL 代理能够有效地利用大规模、嘈杂的先验知识源进行探索。

2 Related Work 2 相关工作

2.1 Language-Assisted Decision Making
2.1 语言辅助决策

Textual knowledge can be used to improve generalization in reinforcement learning through environment descriptions (Branavan et al., 2011; Zhong et al., 2020; Hanjie et al., 2021) or language instructions (Chevalier-Boisvert et al., 2019; Anderson et al., 2018; Ku et al., 2020; Shridhar et al., 2020). However, task specific textual knowledge is expensive to obtain, prompting the use of web queries (Nottingham et al., 2022) or models pretrained on general world knowledge (Dambekodi et al., 2020; Suglia et al., 2021; Ichter et al., 2022; Huang et al., 2022b; Song et al., 2022b).
文本知识可以通过环境描述（Branavan 等，2011 年；Zhong 等，2020 年；Hanjie 等，2021 年）或语言指令（Chevalier-Boisvert 等，2019 年；Anderson 等，2018 年；Ku 等，2020 年；Shridhar 等，2020 年）来改善强化学习中的泛化能力。然而，特定任务的文本知识获取成本高昂，促使人们使用网络查询（Nottingham 等，2022 年）或预先训练的通用世界知识模型（Dambekodi 等，2020 年；Suglia 等，2021 年；Ichter 等，2022 年；Huang 等，2022 年 b；Song 等，2022 年 b）。

LLMs can also be used as an external knowledge source by prompting or finetuning them to generate action plans. However, by default, the generated plans are not grounded in environment dynamics and constraining output can harm model performance, both of which lead to subpar performance of out-of-the-box LLMs on decision-making tasks (Valmeekam et al., 2022). Existing work that uses LLMs for generating action plans focuses on methods for grounding language in environment states (Ichter et al., 2022; Huang et al., 2022b; Song et al., 2022b), or improving LLM plans through more structured output (Singh et al., 2022; Liang et al., 2022b). In this work, we focus on using LLMs for exploration rather than directly generating action plans.
LLMs也可以被用作外部知识源，通过提示或微调它们来生成行动计划。然而，默认情况下生成的计划并没有基于环境动态，并且限制输出可能会损害模型性能，这两者都会导致LLMs在决策任务中表现不佳（Valmeekam等人，2022年）。现有的使用LLMs生成行动计划的工作侧重于将语言与环境状态相结合的方法（Ichter等人，2022年；Huang等人，2022b年；Song等人，2022b年），或通过更结构化的输出改进LLM计划（Singh等人，2022年；Liang等人，2022b年）。在本研究中，我们重点关注使用LLMs进行探索，而不是直接生成行动计划。

Tam et al. (2022) and Mu et al. (2022) recently demonstrated that language is a meaningful state abstraction when used for exploration. Additionally, Tam et al. (2022) experiment with using LLM latent representations of state descriptions for novelty exploration, relying on pretrained LLM encodings to detect novel textual states. To the best of our knowledge, we are the first to apply language-assisted decision-making to exploration by using LLMs to predict and verify environment dynamics through experience.
Tam 等人（2022 年）和 Mu 等人（2022 年）最近表明，当用于探索时，语言是一种有意义的状态抽象。此外，Tam 等人（2022 年）尝试使用状态描述的潜在表示进行新颖性探索，依赖于预训练的编码来检测新颖的文本状态。据我们所知，我们是首个通过使用语言辅助决策来进行探索的人，通过经验来预测和验证环境动态。

2.2 Language Grounded in Interaction
2.2 交互中的语言基础

Without grounding, LLMs often fail to reason about real world dynamics (Bisk et al., 2020). Instruction following tasks have been a popular testbed for language grounding (Chevalier-Boisvert et al., 2019; Anderson et al., 2018; Ku et al., 2020; Shridhar et al., 2020) prompting many improvements to decision making conditioned on language instructions (Yu et al., 2018; Lynch & Sermanet, 2020; Nottingham et al., 2021; Suglia et al., 2021; Kuo et al., 2021; Zellers et al., 2021; Song et al., 2022a; Blukis et al., 2022). Other prior work used environment interactions to ground responses from question answering models in environment state (Gordon et al., 2018; Das et al., 2018) or physics (Liu et al., 2022). Finally, Ammanabrolu & Riedl (2021) learn a grounded textual world model from environment interactions to assist an RL agent in planning and action selection. In this work, our DECKARD agent also uses a type of textual world model but it is obtained few-shot from an LLM and then grounded in environment dynamics by verifying hypotheses through interaction.
缺乏基础支撑，LLMs往往无法推理真实世界的动态（Bisk等人，2020年）。遵循指令的任务已成为语言基础的流行测试基地（Chevalier-Boisvert等人，2019年；Anderson等人，2018年；Ku等人，2020年；Shridhar等人，2020年），促使许多改进以便根据语言指令进行决策（Yu等人，2018年；Lynch和Sermanet，2020年；Nottingham等人，2021年；Suglia等人，2021年；Kuo等人，2021年；Zellers等人，2021年；Song等人，2022a年；Blukis等人，2022年）。其他先前的工作使用环境交互来使问题回答模型的响应根植于环境状态（Gordon等人，2018年；Das等人，2018年）或物理学（Liu等人，2022年）。最后，Ammanabrolu＆Riedl（2021年）从环境交互中学习了一个基于文本的世界模型，以协助RL代理进行规划和动作选择。在这项工作中，我们的DECKARD代理也使用一种文本世界模型，但它是从一个LLM中获得的，并通过验证假设进行交互来使其根植于环境动态。

2.3 Modularity in RL 2.3 强化学习中的模块化

Modular RL proposes to learn several independent policies in a composable way to facilitate training and generalization (Simpkins & Isbell, 2019). Ammanabrolu et al. (2020) and Patil et al. (2020) demonstrate how modular policies can improve exploration by reducing policy horizons, the former using the text-based game Zork and the latter using Minecraft. We implement modularity for Minecraft by finetuning a pretrained transformer policy with adapters, a technique recently implemented for RL by Liang et al. (2022a) for multi-task robotic policies.
模块化强化学习提出以可组合的方式学习多个独立策略，以促进训练和泛化（Simpkins＆Isbell，2019）。Ammanabrolu 等人（2020 年）和 Patil 等人（2020 年）展示了模块化策略如何通过减少策略水平来改善探索，前者使用基于文本的游戏 Zork，后者使用 Minecraft。我们通过使用适配器对预训练的转换器策略进行微调来实现 Minecraft 的模块化，这是最近由 Liang 等人（2022a 年）为多任务机器人策略实现的一种技术。

2.4 Minecraft 2.4 Minecraft

Minecraft is a vast open-ended world with complex dynamics and sparse rewards. Crafting items in the Minecraft technology tree has long been considered a challenging task for reinforcement learning, requiring agents to overcome extremely delayed rewards and difficult exploration (Skrynnik et al., 2021; Patil et al., 2020; Hafner et al., 2023). This is partially due to the scarcity of items in the environment, but also due to the depth of some items in the game’s technology tree. The purpose of our work is to overcome the latter of these two difficulties by better learning and navigating Minecraft’s technology tree.
《我的世界》是一个庞大的开放式世界，具有复杂的动态和稀缺的奖励。长期以来，制作《我的世界》科技树中的物品一直被认为是强化学习的一项具有挑战性的任务，需要代理人克服极端延迟的奖励和艰难的探索。这部分是由于环境中物品的稀缺性，但也是由于游戏科技树中一些物品的深度。我们工作的目的是通过更好地学习和导航《我的世界》的技术树来克服这两个困难之一。

Several existing agents overcome the problem of item scarcity in Minecraft by simplifying environment parameters such as action duration (Patil et al., 2020) or block break time (Hafner et al., 2023), making comparison between methods difficult. For this reason we compare minimally to other Minecraft agents (see Table 2), focusing our evaluation on the benefits of LLM-guided exploration with DECKARD. We use the video pretrained (VPT) Minecraft agent (Baker et al., 2022) as a starting point for exploration and finetuning, and we use the Minedojo implementation of the Minecraft Environment (Fan et al., 2022).
几种现有的代理通过简化环境参数（例如动作持续时间（Patil等人，2020）或方块破坏时间（Hafner等人，2023））来克服《我的世界》中物品稀缺的问题，使得方法之间的比较变得困难。出于这个原因，我们仅与其他《我的世界》代理进行了最小程度的比较（见表2），并将评估重点放在 DECKARD 引导探索的优点上。我们使用视频预训练（VPT）《我的世界》代理（Baker等人，2022）作为探索和微调的起点，并使用 Minedojo 实现的《我的世界》环境（Fan等人，2022）。

3 Background 3 背景

3.1 Modular Reinforcement Learning
3.1 模块化强化学习

Rather than train a single policy with sparse rewards, modular RL advocates learning compositional policy modules (Simpkins & Isbell, 2019). DECKARD automatically discovers subgoals in Minecraft—each of which maps to an independently trained policy module—and learns a DAG of dependencies (the AWM) to transition between subgoals. Policy modules are trained in an environment modeled by a POMDP with states $s\in\mathcal{S}$ , obseravtions $o\in O$ , actions $a\in\mathcal{A}$ , and environment dynamics $\mathcal{T}:\mathcal{S},\mathcal{A}\to\mathcal{S}^{\prime}$ . These elements are common between modules, but each subgoal defines different initial states $\mathcal{S}_{0}$ and observations $O_{0}$ , terminal states $\mathcal{S}_{t}$ , and reward functions $\mathcal{R}:\mathcal{S},\mathcal{A}\to\mathbb{R}$ , according to the particular subgoal. $\mathcal{S}_{0}$ and $O_{0}$ are defined by the current subgoal’s parents in the DAG, and $\mathcal{S}_{t}$ and $\mathcal{R}$ are defined by the current subgoal. For example, the craft wooden pickaxe subgoal has parents craft planks and craft stick, so $\mathcal{S}_{0}$ includes these items in the agent’s starting inventory. This subgoal recieves a reward and terminates when a wooden pickaxe is added to the agent’s inventory. Section 5 and Appendix B provide more details.
与训练稀疏奖励的单一策略不同，模块化RL倡导学习组合策略模块（Simpkins＆Isbell，2019）。DECKARD在Minecraft中自动发现子目标——每个子目标都映射到一个独立训练的策略模块——并学习依赖关系的DAG（AWM）以在子目标之间进行转换。策略模块在一个由POMDP建模的环境中进行训练，该环境具有状态 $s\in\mathcal{S}$ ，观察 $o\in O$ ，动作 $a\in\mathcal{A}$ 和环境动态 $\mathcal{T}:\mathcal{S},\mathcal{A}\to\mathcal{S}^{\prime}$ 。这些元素在模块之间是共同的，但每个子目标根据特定子目标定义不同的初始状态 $\mathcal{S}_{0}$ 和观察 $O_{0}$ ，终止状态 $\mathcal{S}_{t}$ 和奖励函数 $\mathcal{R}:\mathcal{S},\mathcal{A}\to\mathbb{R}$ 。 $\mathcal{S}_{0}$ 和 $O_{0}$ 由DAG中当前子目标的父级定义， $\mathcal{S}_{t}$ 和 $\mathcal{R}$ 由当前子目标定义。例如，制作木镐的子目标具有制作木板和制作棍子的父级，因此 $\mathcal{S}_{0}$ 在代理的起始库存中包括这些物品。当木镐添加到代理的库存中时，该子目标接收奖励并终止。第5节和附录B提供了更多细节。

Due to the compositionality of modular RL, individual modules can be chained together to achieve complex tasks. In our case, given a goal state $s_{g}$ , we use the subgoal DAG to create a path from our current state to $s_{g}$ , $[s_{0},s_{1},...,s_{g}]$ , where each $s$ represents the terminal state for a subgoal. By chaining together sequences of subgoal modules, we can successfully navigate to connected portions of the currently discovered DAG and reach arbitrary goal states.
由于模块化 RL 的组合性，可以将单独的模块链在一起以实现复杂的任务。在我们的情况下，给定一个目标状态 $s_{g}$ ，我们使用子目标 DAG 来创建从当前状态到 $s_{g}$ ， $[s_{0},s_{1},...,s_{g}]$ 的路径，其中每个 $s$ 表示子目标的终端状态。通过链式连接子目标模块的序列，我们可以成功导航到当前发现的 DAG 的连接部分并达到任意目标状态。

3.2 Large Language Models 3.2 大型语言模型

Large language models (LLM) are trained with a language modeling objective to maximize the likelihood of training data from large text corpora. As LLMs have grown in size and representational power, they have seen success on various downstream tasks by simply modifying their input, referred to as prompt engineering (Brown et al., 2020). Recent applications of LLMs to decision-making have relied partially or entirely on prompt engineering for their action planning (Ichter et al., 2022; Song et al., 2022b; Huang et al., 2022b; Singh et al., 2022; Liang et al., 2022b). We follow this pattern to extract knowledge from LLMs and construct our AWM. We prompt OpenAI’s Codex model (OpenAI, 2022) to generate DECKARD’s hypothesized AWM. Codex is trained to generate code samples from natural language. As with previous work (Singh et al., 2022; Liang et al., 2022b), we find that structured code output works well for extracting knowledge from LLMs. We structure LLM output by prompting Codex to generate a python dictionary of Minecraft item dependencies, which we then map to a graph of items and their dependencies (see Section 5.1 and Appendix A).
大型语言模型（LLM）是通过语言建模目标进行训练的，以最大化来自大型文本语料库的训练数据的可能性。随着LLMs在规模和表征能力上的增长，它们通过简单修改输入，在各种下游任务上取得了成功，这被称为提示工程（Brown等人，2020）。最近LLMs在决策方面的应用部分或完全依赖于提示工程来进行其行动规划（Ichter等人，2022年；Song等人，2022b年；Huang等人，2022b年；Singh等人，2022年；Liang等人，2022b年）。我们遵循这种模式，从LLMs中提取知识并构建我们的AWM。我们提示OpenAI的Codex模型（OpenAI，2022年）生成DECKARD假设的AWM。Codex被训练以从自然语言生成代码示例。与以前的工作一样（Singh等人，2022年；Liang等人，2022b年），我们发现结构化的代码输出对于从LLMs中提取知识很有效。我们通过提示Codex生成Minecraft物品依赖的Python字典来构造LLM输出，然后将其映射到物品及其依赖关系的图形（参见第5.1节和附录A）。

4 DECKARD 4DECKARD → 4德卡德

Algorithm 1 DECKARD 算法 1 DECKARD

G\leftarrow LLM()

// hypothesize AWM with LLM
// 假设 AWM 为 LLM

C\leftarrow X:0

// dict of visit counts
// 访问计数的字典

V\leftarrow\emptyset

// set of verified nodes
// 已验证节点集合

while

training

do 当

training

时执行

// Dream Phase
// 梦境阶段

F\leftarrow Frontier(G,V)

any(C(F)\leq c_{0})

then 如果

any(C(F)\leq c_{0})

那么

\bar{x}\leftarrow SampleBranch(F\;|\;C(F)\leq c_{0})

else 其他

\bar{x}\leftarrow SampleBranch(F\cup V)

end if 结束如果

// Wake Phase
// 唤醒阶段

x\leftarrow x_{0}

for

t=1...|\bar{x}|

do 对于

t=1...|\bar{x}|

，执行

x^{\prime}\leftarrow ExecuteSubgoal(\bar{x}_{t})

C(x^{\prime})\leftarrow C(x^{\prime})+1

x^{\prime}\notin V

then 如果

x^{\prime}\notin V

，那么

G\leftarrow AddEdge(G,x,x^{\prime})

V\leftarrow V\cup\{x^{\prime}\}

end if 结束如果

x\leftarrow x^{\prime}

end for 结束循环

end while 结束当

4.1 Abstract World Model 4.1 抽象世界模型

Our method, DECision-making for Knowledgable Autonomous Reinforcement-learning Dreamers (DECKARD), builds an Abstract World Model (AWM) of subgoal dependencies from state abstractions. We begin by assuming a textual state representation function $\phi:O\to X$ . Textual state representations $x\in X$ make up the nodes for our AWM $G:X,E$ with directed edges $E$ defining the dependencies between $X$ . We further constrain $G$ to a directed acyclic graph (DAG) so that the nodes of the DAG represent subgoals useful in navigating towards a target goal. In our experiments, we use the agent’s current inventory as $X$ , a common component of the Minecraft observation space (Fan et al., 2022; Hafner et al., 2023).
我们的方法，DECision-making for Knowledgable Autonomous Reinforcement-learning Dreamers（DECKARD），从状态抽象构建子目标依赖的抽象世界模型（AWM）。我们首先假设一个文本状态表示函数 $\phi:O\to X$ 。文本状态表示 $x\in X$ 构成我们的 AWM $G:X,E$ 的节点，有指向的边 $E$ 定义了之间的依赖关系 $X$ 。我们进一步将 $G$ 限制为有向无环图（DAG），使得 DAG 的节点表示对导航向目标目标有用的子目标。在我们的实验中，我们使用代理的当前库存作为 $X$ ，这是 Minecraft 观察空间的一个常见组件（Fan 等人，2022年；Hafner 等人，2023年）。

We update $G$ from agent experience through environment exploration. When the agent experiences node $x_{t}$ for the first time, $G$ is updated by adding edges between the previous node $x_{t-1}$ and the new node $x_{t}$ . When trying to reach a previously experienced node, DECKARD recovers the path from current node $x_{0}$ to the target node $x_{t}$ from the AWM. DECKARD then executes policies for each recovered node until it reaches the target goal.
通过环境探索，我们从代理经验中更新 $G$ 。当代理第一次经历节点 $x_{t}$ 时，通过在前一个节点 $x_{t-1}$ 和新节点 $x_{t}$ 之间添加边来更新 $G$ 。当尝试到达先前经历过的节点时，DECKARD 从当前节点 $x_{0}$ 到目标节点 $x_{t}$ 的路径从 AWM 中恢复。然后，DECKARD 执行每个恢复节点的策略，直到达到目标目标。

4.2 LLM Guidance 4.2LLM 指导

The setup so far (referred to in our experiments as “DECKARD (No LLM)”) allows the construction of a modular RL policy for navigating subgoals. However, the agent is still learning the AWM tabula-rasa. The key insight of DECKARD is that we can hypothesize the AWM with knowledge from an LLM. We use in-context learning, as described in Section 5.1, to predict $G$ from an LLM with predicted edges, $\hat{E}$ . While acting in the environment, we verify or correct edges of $G$ and track the set of nodes that have been verified $V$ thus grounding the AWM hypothesized by the LLM in environment dynamics.
到目前为止的设置（在我们的实验中称为“DECKARD（无LLM）”）允许构建用于导航子目标的模块化 RL 策略。然而，代理仍在从头学习 AWM。 DECKARD 的关键见解是，我们可以使用LLM的知识假设 AWM。我们使用上下文学习，如 5.1 节所述，从具有预测边缘 $\hat{E}$ 的LLM中预测 $G$ 。在环境中行动时，我们验证或纠正 $G$ 的边缘，并跟踪已验证的节点集 $V$ ，从而将LLM对环境动态的假设 AWM 接地。

4.2.1 Dream Phase 4.2.1 梦境阶段

Equipped with a hypothesized AWM, we iterate between Dream and Wake phases for guided exploration toward a goal (see Algorithm 1). During the Dream phase, we compute the verified frontier $F$ of $G$ , composed of verified nodes $V$ , with predicted edges to unverified nodes $G-V$ . In addition, if a path between $V$ and the current task’s goal exists, $F$ is pruned to only include nodes along the predicted path to the goal. For example, after learning to craft planks, subgoals door and stick are potential frontier nodes. However, if the target item is wooden pickaxe, DECKARD will eliminate door as a candidate node for exploration since stick is part of the LLM-predicted recipe for the target item and door is not. Finally, we sample a branch $\bar{x}$ terminating with an element from $F$ to explore during the Wake phase. If all nodes in $F$ have been sampled at least $c_{0}$ times (where $c_{0}$ is an exploration hyperparameter) without success, we the sample from all $V$ rather than $F$ only.
配备假设的 AWM，我们在梦境和清醒阶段之间进行迭代，以引导探索朝向目标（参见算法 1）。在梦境阶段，我们计算已验证节点组成的已验证前沿 $F$ ，并预测到未验证节点 $G-V$ 的边缘。此外，如果 $V$ 和当前任务目标之间存在路径，则 $F$ 被修剪为仅包括沿着通往目标的预测路径的节点。例如，在学会制作木板后，子目标门和棍子是潜在的前沿节点。然而，如果目标物品是木镐，DECKARD 将消除门作为探索的候选节点，因为棍子是目标物品的LLM预测配方的一部分，而门不是。最后，我们从 $F$ 中取一个元素终止的分支 $\bar{x}$ ，在清醒阶段探索。如果 $F$ 中的所有节点都至少被采样了 $c_{0}$ 次（其中 $c_{0}$ 是探索超参数），而没有成功，我们将从所有 $V$ 而不是仅从 $F$ 中采样。

4.2.2 Wake Phase 4.2.2 清醒阶段

Next, during the Wake phase, the agent executes the sequence of subgoals $\bar{x}$ updating $G$ with learned experience and adding verified nodes to $V$ . If sampled from $F$ , the final node in $\bar{x}$ will be unlearned, allowing the agent to explore in an attempt to reach the unverified node. If successful, the AWM is updated and the new node is also added to $V$ . When adding a newly verified node $x$ we begin finetuning a new subgoal policy for $x$ (see Section 5). Beyond reducing the number of iterations it takes to construct $G$ , one benefit of initializing $G$ with an LLM is that we do not finetune subgoals for nodes outside of the predicted path to our target goal. If the predicted recipes fail, then DECKARD begins training additional subgoal policies to assist in exploration. This drastically reduces the number of environment steps required to train DECKARD.
接下来，在唤醒阶段期间，代理执行子目标序列，利用学到的经验更新，并将已验证的节点添加到。如果从采样，则 $\bar{x}$ 中的最终节点将被取消学习，允许代理探索以尝试达到未验证的节点。如果成功，AWM 将被更新，并且新节点也将添加到。当添加一个新验证的节点时，我们开始为开始微调新的子目标策略（见第 5 节）。除了减少构建所需的迭代次数之外，使用的另一个好处是，我们不会为预测路径之外的节点微调子目标。如果预测的方法失败，则 DECKARD 开始训练额外的子目标策略以帮助探索。这极大地减少了训练 DECKARD 所需的环境步骤的数量。

5 Experiment Setup 5 实验设置

We apply DECKARD to crafting items in Minecraft, an embodied learning environment that requires agents to perform sequences of subgoals with sparse rewards. Our agent maps inventory items to AWM subgoals and learns a modular policy that can be composed to achieve complex tasks. By learning modular policies, our agent is able to collect and craft arbitrary items in the Minecraft technology tree.
我们将 DECKARD 应用于 Minecraft 中的物品制作，这是一个需要代理执行一系列子目标并获得稀疏奖励的体验学习环境。我们的代理将库存物品映射到 AWM 子目标，并学习可以组合以完成复杂任务的模块化策略。通过学习模块化策略，我们的代理能够在 Minecraft 技术树中收集和制作任意物品。

5.1 Predicting the Abstract World Model
5.1 预测抽象世界模型

In our experiments, we predict the AWM using OpenAI’s Codex model (OpenAI, 2022) by prompting the LLM to generate recipes for Minecraft items. We prompt Codex to “Create a nested python dictionary containing crafting recipes and requirements for minecraft items” along with additional instructions about the dictionary contents and two examples: diamond pickaxe and diamond (see Appendix A). We iterate over 391 Minecraft items, generating recipes as well as tool requirements (mining stone requires a pickaxe) and workbench requirements (crafting a pickaxe requires a crafting table). The hypothesized AWM is generated at the start of training, so no forward passes of the LLM are necessary during training or inference. Table 1 shows the accuracy of the hypothesized un-grounded AWM.
在我们的实验中，我们使用OpenAI的Codex模型（OpenAI，2022年）来预测AWM，通过提示LLM为Minecraft物品生成配方。我们提示Codex“创建一个包含Minecraft物品的制作配方和要求的嵌套Python字典”，并附加有关字典内容的额外说明以及两个示例：钻石镐和钻石（见附录A）。我们迭代了391个Minecraft物品，生成配方以及工具要求（挖掘石头需要镐）和工作台要求（制作镐需要工作台）。假设的AWM在训练开始时生成，因此在训练或推断过程中不需要LLM的前向传播。表1显示了假设的未接地AWM的准确性。

Metric	All Items	Tools Only
Collectable vs. Craftable 可收集与可制作	57	100
Crafting Table / Furnace 制作桌 / 熔炉	84	96
Recipe Correct Items 配方正确的物品	66	81
Recipe Exact Match 配方精确匹配	55	69

Table 1: LLM accuracy when predicting various node features: whether an item is collectable (no parents) or craftable (has a recipe), whether it requires a crafting table or furnace to craft, whether recipe ingredients are correct, and whether the recipe is an exact match (including ingredient quantities). The first results column includes all 391 Minecraft items, whereas the second column only includes the 37 items in the tool technology tree.
表 1：LLM 在预测各种节点特征时的准确性：物品是否可收集（无父节点）或可制作（有配方），是否需要工作台或熔炉进行制作，配方成分是否正确，以及配方是否完全匹配（包括配料数量）。第一列结果包括所有 391 个 Minecraft 物品，而第二列仅包括工具技术树中的 37 个物品。

5.2 Subgoal Finetuning 5.2 子目标微调

Rather than train each module from scratch, we finetune transformer adapters for each module with an RL objective following the adapter architecture from Houlsby et al. (2019). We use the Video-Pretrained (VPT) Minecraft model as our starting policy (Baker et al., 2022). We chose to finetune VPT as it proved to be more sample efficient and more stable than training policies from scratch. Moreover, since VPT is pretrained on a variety of Minecraft skills, the non-finetuned VPT model explores the environment more thoroughly than a random agent. Our implementation of VPT finetuned with adapters is referred to as VPT-a.
不是每个模块都从头开始训练，而是我们使用强化学习目标对每个模块的 Transformer 适配器进行微调，遵循 Houlsby 等人的适配器架构（2019 年）。我们使用视频预训练（VPT）Minecraft 模型作为我们的起始策略（Baker 等人，2022 年）。我们选择微调 VPT，因为它被证明比从头开始训练的策略更具样本效率和稳定性。此外，由于 VPT 在各种 Minecraft 技能上预训练过，非微调的 VPT 模型比随机智能体更彻底地探索环境。我们使用适配器微调的 VPT 实现称为 VPT-a。

Adapters are especially well suited for modular finetuning due to their lightweight architecture (Liang et al., 2022a). In our agent, each subgoal module corresponds to one set of adapters and only contains 9.5 million trainable parameters, approximately 2% of the 0.5 billion parameter VPT model. This allows us to train a separate set of adapters for each subgoal and still keep all parameters in memory concurrently, a practical benefit of using adapters for modular, compositional RL policies.
由于其轻量级架构，适配器特别适用于模块化微调（Liang 等人，2022 年 a）。在我们的智能体中，每个子目标模块对应一个适配器集，并且仅包含 950 万可训练参数，约占 0.5 亿参数 VPT 模型的 2%。这使我们能够为每个子目标训练一个单独的适配器集，并仍然同时保留所有参数在内存中，这是使用适配器进行模块化、组合式强化学习策略的实际优势。

5.3 Environment Details 5.3 环境详细信息

We use Minedojo’s Minecraft implementation for our experiments (Fan et al., 2022). As with VPT (Baker et al., 2022), our subgoal policies use a pixel only observation space and a large multi-discrete action space, while our overall policy transitions between subgoals based on the agent’s current inventory. Unlike VPT, we use standard high-level crafting actions that instantly crafts target items from inventory items. At the time of this writing, Minedojo does not support the VPT style of human-like crafting in a GUI, so we instead remove the VPT action for opening the crafting GUI and replace it with 254 crafting actions (one for each item). This brings our multi-discrete action space to 121 camera actions and 8714 keyboard/crafting actions, and our observation space to 128x128x3 pixels plus 391 inventory item quantities (only used to transition between subgoals).
我们在实验中使用 Minedojo 的 Minecraft 实现（Fan 等人，2022）。与 VPT（Baker 等人，2022）一样，我们的子目标策略使用仅像素观察空间和大型多离散动作空间，而我们的整体策略基于代理的当前库存在子目标之间进行转换。与 VPT 不同，我们使用标准的高级制作动作，它可以立即从库存项目中制作目标项目。在撰写本文时，Minedojo 不支持 VPT 风格的类人制作 GUI，因此我们删除了用于打开制作 GUI 的 VPT 动作，并用 254 个制作动作（每个项目一个）替换它。这将我们的多离散动作空间扩展到 121 个相机动作和 8714 个键盘/制作动作，观察空间扩展到 128x128x3 像素加上 391 个库存项目数量（仅用于在子目标之间进行转换）。

Because our subgoals map to individual items, there is an intrinsic separation between items that are collected from the environment versus those that require crafting. While we must finetune a set of adapters for subgoals that require navigating or collecting items from the environment, crafting subgoal policies map to a single craft action—making them much more space and sample efficient compared to collectable item subgoals.
因为我们的子目标映射到单独的物品，所以在收集环境物品与需要制作的物品之间存在固有的区别。虽然我们必须为需要从环境中导航或收集物品的子目标调整一组适配器，但制作子目标策略映射到单个制作操作——这使它们在空间和样本效率方面比可收集物品子目标要高得多。

5.4 Experiments 5.4 实验

We evaluate DECKARD on both crafting tasks—in which the agent learns to collect ingredients and craft a target item—and open-ended exploration. In open-ended exploration, although there is no extrinsic learning signal, DECKARD is intrinsically motivated to explore new AWM nodes. We compare the growth of the agent’s verified AWM during open-ended exploration for DECKARD with and without LLM guidance along with a VPT baseline. Next, we compare LLM-guided DECKARD to RL baselines and DECKARD without LLM guidance on goal-driven tasks for collecting/crafting: logs, wooden pickaxes, cobblestone, stone pickaxes, furnaces, sand, and glass. We also compare to several popular Mincraft agents on the “craft a stone pickaxe task” (see Table 2). Finally, we evaluate the effect of artificial errors in the hypothesized AWM to simulate errors in LLM output and demonstrate DECKARD’s robustness to LLM accuracy.
我们在两种制作任务上评估 DECKARD——一种是代理学习收集原料并制作目标物品的任务，另一种是开放式探索任务。在开放式探索中，虽然没有外部学习信号，但 DECKARD 内在地受到探索新 AWM 节点的驱动。我们比较了 DECKARD 在开放式探索中代理的验证 AWM 的增长，分别在有和没有 LLM 引导以及 VPT 基线下进行。接下来，我们将 LLM 引导的 DECKARD 与强化学习基线以及没有 LLM 引导的 DECKARD 在收集/制作目标驱动任务（例如原木、木镐、圆石、石镐、熔炉、沙子和玻璃）上进行比较。我们还将 DECKARD 与几种流行的《我的世界》代理进行比较，针对“制作石镐任务”（见表 2）。最后，我们评估了假设 AWM 中的人工错误对 LLM 输出的影响，并展示了 DECKARD 对 LLM 准确性的稳健性。

6 Experiment Results 6 实验结果

6.1 Open-Ended Exploration
6.1 开放式探索

DECKARDis intrinsically motivated to explore new nodes, always sampling and attempting to craft new items, and thus does not require a target task to improve exploration. We can measure the effect of DECKARD on exploration by tracking the growth of the agent’s verified AWM nodes. Figure 2 shows the speed of exploration when using DECKARD with and without LLM guidance. We also compare DECKARD to a VPT baseline that explores the environment without an AWM with a non-finetuned VPT policy. Although VPT does not construct an AWM, it gathers Minecraft items and randomly attempts to craft new items from the gathered ingredients. We track how many items it has discovered and plot that quantity in Figure 2. DECKARD without LLM guidance constructs an AWM from scratch, but only the LLM-guided DECKARD agent uses LLM guidance to decide which items to collect and which recipes to attempt next. Note that DECKARD subgoal policies are initialized with VPT, so VPT starts out exploring at a similar rate to DECKARD.
DECKARD天生具有探索新节点的内在动机，始终在采样和尝试制作新物品，因此不需要目标任务来改进探索。我们可以通过跟踪代理程序验证的AWM节点的增长来衡量DECKARD对探索的影响。图2显示了使用DECKARD时探索速度与LLM指导和无指导时的对比。我们还将DECKARD与探索环境时没有AWM的VPT基线进行了比较，后者具有未经优化的VPT策略。虽然VPT不构建AWM，但它会收集Minecraft物品，并随机尝试从收集的成分中制作新物品。我们跟踪它发现了多少物品，并在图2中绘制了该数量。没有LLM指导的DECKARD会从头开始构建AWM，但只有LLM指导的DECKARD代理会使用LLM指导来决定下一步收集哪些物品和尝试哪些配方。请注意，DECKARD子目标策略是用VPT初始化的，因此VPT开始时的探索速度与DECKARD相似。

The DECKARD and VPT agents quickly learn to mine logs and craft wooden items. However, one exploration hurdle is discovering that wooden pickaxes are a prerequisite for mining cobblestone. As seen in Figure 2, it takes DECKARD without LLM guidance and the VPT baseline 2x and 3x longer respectively to learn to use a wooden pickaxe to mine cobblestone. Once the agents learn how to mine cobblestone, they can begin adding stone items to their AWM. However, only DECKARD avoids oversampling dead ends in the crafting tree allowing it to quickly explore new states. Also, the LLM incorrectly predicts that glass can be collected without crafting or tools of any kind, but DECKARD overcomes and corrects this error, successfully crafting glass and adding the correct recipe to the AWM.
DECKARD 和 VPT 代理迅速学会挖掘日志并制作木制物品。然而，一个探索的障碍是发现木镐是挖掘圆石的先决条件。如图2所示，DECKARD 在没有 LLM 指导下学习如何使用木镐挖掘圆石需要的时间分别是 VPT 基线的2倍和3倍长。一旦代理学会了如何挖掘圆石，它们就可以开始将石制品添加到它们的 AWM 中。然而，只有 DECKARD 避免了在制作树中过度采样死胡同，使其能够快速探索新的状态。此外，LLM 错误地预测玻璃可以在不进行任何制作或使用工具的情况下收集，但 DECKARD 克服并纠正了这个错误，成功地制作了玻璃并将正确的配方添加到 AWM 中。

In general, the frontier $F$ of the verified AWM nodes is much smaller than $G$ . This difference increases as the agent continues to explore and add verified nodes to $G$ . Figure 3 shows the sizes of $G$ and $F$ throughout open-ended exploration for DECKARD. The smaller size of $F$ means that each iteration DECKARD is more likely to sample items that are useful for crafting something new. Eventually, difficult to reach or erroneous nodes in $F$ could limit exploration, so we stop prioritizing sampling from the frontier after $c_{0}$ failed attempts to reach nodes from $F$ .
一般来说，已验证的 AWM 节点的边界比较小。随着代理继续探索并向 AWM 添加已验证的节点，这种差异会增加。图 3 显示了 DECKARD 在开放式探索过程中的 $G$ 和 $F$ 的大小。较小的 $F$ 大小意味着每次迭代 DECKARD 更有可能采样到对于创造新物品有用的项目。最终， $F$ 中难以达到或错误的节点可能会限制探索，因此在连续 $c_{0}$ 次尝试未能到达 $F$ 节点后，我们停止优先从边界进行采样。

6.2 Crafting Tasks 6.2 制作任务

We also evaluate DECKARD on tasks that challenge the agent to collect or crafting a specific item. The training procedure for items tasks is the same, but rather than sample from the entire frontier $F$ as with open-ended exploration, we only sample nodes from $F$ predicted to lead to the target item. We conclude training when the target item is obtained.
我们还评估 DECKARD 在挑战代理收集或制作特定物品的任务上的表现。物品任务的训练程序与之相同，但不同于从整个边界 $F$ 中采样作为开放式探索，我们仅从预测将导致目标物品的节点 $F$ 中采样。当获得目标物品时，我们结束训练。

Figure 4 compares DECKARD success rates to baselines across item tasks: logs, wooden and stone pickaxes, cobblestone, furnace, sand, and glass. The VPT baseline is the non-finetuned VPT policy acting in the environment, and VPT-a follows the same training setup as our subgoal policies (see Section 5.2). Agents are allowed a maximum of 1,000 environment steps to obtain collectable items (log and sand), and 5,000 steps for all other craftable items. Training for each agent is limited to 6 million steps, although DECKARD only takes that many for the “craft glass” task. DECKARD outperforms directly training on item tasks with a traditional reinforcement learning signal and learns to craft items further up the technology tree where the baseline completely fails.
图 4 对比了 DECKARD 在项目任务中与基线的成功率：原木、木镐和石镐、圆石、熔炉、沙子和玻璃。VPT 基线是在环境中执行的非微调 VPT 策略，VPT-a 遵循与我们的子目标策略相同的训练设置（见第 5.2 节）。代理被允许最多进行 1,000 个环境步骤来获取可收集的物品（原木和沙子），对于所有其他可制作的物品，最多可以进行 5,000 步。每个代理的训练限制为 600 万步，尽管 DECKARD 仅对“制作玻璃”任务进行了相同数量的训练步骤。DECKARD 在直接训练项目任务与传统的强化学习信号相比表现更好，并学会在技术树上进一步制作物品，而基线完全失败。

Method	Demos	Dense Rewards	Auto-crafting	Observations	Actions	Params	Steps
Align-RUDDER (Patil et al., 2020) Align-RUDDER（Patil 等，2020）	Expert 专家	✗	✓	Pixels & Meta 像素与元数据	61	2.5M/subgoal 2.5M/子目标	2M
VPT+RL (Baker et al., 2022) VPT+RL（Baker 等人，2022）	Videos 视频	✓	✗	Pixels Only 仅像素	121, 8461	248M 248 兆 abytes	2.4B 24 亿
DreamerV3 (Hafner et al., 2023) 梦想家 V3（Hafner 等，2023 年）	None 无	✓	✓	Pixels & Meta 像素与元数据	25	200M 200 百万	6M
DECKARD (No LLM) DECKARD（无 LLM）	Videos 视频	✗	✓	Pixels & Inventory 像素和库存	121, 8714	9.5M/subgoal 9.5 百万/子目标	32M
DECKARD	Videos 视频	✗	✓	Pixels & Inventory 像素和库存	121, 8714	9.5M/subgoal 9.5 百万/子目标	2.6M 2.6 百万

Table 2: We limit comparison between minecraft agents because of the various shortcuts used to solve the difficult exploration task. Align-RUDDER, relies on expert demonstrations. DreamerV3 and Align-RUDDER, simplify the action space. VPT+RL and DreamerV3 provide intermediate crafting rewards. The final column above compares how long each method takes to learn the “craft stone pickaxe” task. Despite its challenging learning setup, DECKARD achieves sample efficiency equal to or better than existing agents.
表 2：我们限制了对 minecraft 代理之间的比较，因为它们使用了各种快捷方式来解决困难的探索任务。Align-RUDDER 依赖于专家演示。DreamerV3 和 Align-RUDDER 简化了行动空间。VPT+RL 和 DreamerV3 提供中间的制作奖励。上面的最后一列比较了每种方法学习“制作石镐”任务所需的时间。尽管它具有具有挑战性的学习设置，但 DECKARD 的样本效率等于或优于现有的代理。

Note that we use random world seeds for all evaluation making scarce items more difficult to reliably collect. For example, the fact that sand is more rare than logs is reflected in their respective success rates in Figure 4. Also, items that depend on logs (pickaxes, cobblestone, furnace) and sand (glass) will have success rates bounded by that of their parent nodes in the AWM.
请注意，我们在所有评估中使用随机世界种子，使得稀有物品更难以可靠地收集。例如，沙比原木更稀有这一事实在图4中的它们各自的成功率中有所体现。另外，依赖于原木（镐、鹅卵石、熔炉）和沙（玻璃）的物品的成功率将受到其父节点在AWM中的成功率限制。

The sample efficiency of DECKARD with LLM guidance is especially notable when applied to item crafting and collecting tasks. With LLM guidance, DECKARD can avoid learning subgoal policies for items it predicts are unnecessary for the current goal (see Section 4.2.2). Figure 5 demonstrates the large difference in sample efficiency when only training policies for predicted subgoals. Without LLM guidance, DECKARD finetunes subgoal policies for an average of fifteen different collectable items when learning to craft a stone pickaxe. With guidance, DECKARD only finetunes subgoal policies for collecting needed items (such as logs and cobblestone)—resulting in an order of magnitude improvement in sample efficiency.
DECKARD在物品制作和收集任务中的样本效率在有LLM引导时尤其引人注目。有LLM引导时，DECKARD可以避免学习对当前目标预测为不必要的物品的子目标策略（参见第4.2.2节）。图5展示了仅训练预测子目标策略时的样本效率差异。没有LLM引导时，DECKARD在学习制作石镐时，对平均15种不同可收集物品进行子目标策略的微调。有引导时，DECKARD仅对收集所需物品（如原木和圆石）进行子目标策略的微调，样本效率提高了一个数量级。

Although not the primary goal of this work, we compare DECKARD to several agents from previous work trained to craft items along the Minecraft technology tree. Table 2 includes a high level overview of these agents and shows the number of environment samples for each to learn the “craft stone pickaxe” task. Note that each of these agents uses different action and observation spaces as well as pretraining data. For example, DECKARD does not require any reward shaping from domain expertise, expert demonstrations, or simplifications of the observation and action spaces. As mentioned in Section 5.3, we do follow previous work and use discrete actions for item crafting. At the time of writing, VPT is the only agent that learns low-level item crafting using the graphical crafting interface. Table 2 shows that DECKARD’s sample efficiency is equal to or better than that of previous work.
虽然这项工作的主要目标不是这个，但我们将 DECKARD 与先前工作中训练的几个代理进行了比较，这些代理被训练来沿着《我的世界》技术树制作物品。表 2 包括这些代理的高级概述，并显示了每个代理学习“制作石镐”的环境样本数量。请注意，这些代理使用不同的动作和观察空间以及预训练数据。例如，DECKARD 不需要来自领域专业知识、专家演示或观察和行动空间的简化的任何奖励塑造。正如第 5.3 节中所提到的，我们确实遵循先前的工作，并使用离散动作进行物品制作。在撰写本文时，VPT 是唯一使用图形制作界面学习低级物品制作的代理。表 2 显示 DECKARD 的样本效率等于或优于先前的工作。

6.3 Robustness 6.3 鲁棒性

Finally, we evaluate our claim that DECKARD is robust to errors in LLM output. While LLMs are becoming surprisingly knowledgeable, they are not grounded in environment knowledge and sometimes output erroneous facts (Valmeekam et al., 2022). Figure 6 shows training time for DECKARD on the target task “craft stone pickaxe” for various error types and rates in the hypothesized AWM. For each run, we start with a ground truth AWM and artificially introduce errors over at least three different random seeds for each error type and rate.
最后，我们评估我们关于 DECKARD 对 LLM 输出错误的鲁棒性的声明。虽然 LLMs 变得异常博学，但它们并没有根植于环境知识，有时输出错误的事实（Valmeekam 等人，2022）。图 6 展示了 DECKARD 在目标任务“制作石镐”上的训练时间，针对假设的 AWM 中各种错误类型和错误率。对于每次运行，我们从一个地面真实的 AWM 开始，并在每种错误类型和错误率下的至少三个不同的随机种子上人为引入错误。

The most common error in our LLM-hypothesized AWM was ingredient quantity (see Table 1), but we found that DECKARD was robust to this error and often ended up with a surplus of ingredients. Figure 6 shows the effect of inserting and deleting edges from the ground truth AWM. Inserted edges always add sand as an ingredient for the current item, and deleted edges may remove recipe ingredients or a required tool/crafting table. DECKARD with LLM guidance successfully outperforms DECKARD without LLM guidance even when faced with large errors in LLM output, demonstrating DECKARD’s robustness to LLM output as an exploration method.
在我们的LLM假设的 AWM 中，最常见的错误是成分数量（见表 1），但我们发现 DECKARD 对这种错误具有稳健性，并且通常会产生过剩的成分。图 6 显示了从地面真相 AWM 中插入和删除边缘的效果。插入的边缘总是为当前项目添加沙子作为成分，而删除的边缘可能会删除配方成分或所需的工具/制作桌。具有LLM指导的 DECKARD 成功地优于没有LLM指导的 DECKARD，即使面临LLM输出中的大错误，也证明了 DECKARD 对LLM输出具有探索方法的稳健性。

7 Discussion & Conclusion 7 讨论与结论

In line with proposals to utilize pretrained models in RL (Agarwal et al., 2022), we extract knowledge from LLMs in the form of an Abstract World Model (AWM) that defines transitions between subgoals in a directed acyclic graph. Our agent, DECKARD (DECision-making for Knowledgable Autonomous Reinforcement-learning Dreamers), successfully uses the AWM to intelligently explore Minecraft item crafting, learning to craft arbitrary items through a modular RL policy. Initializing DECKARD with an LLM-predicted AWM improves sample efficiency by an order of magnitude. Additionally, we use environment dynamics to ground the hypothesized AWM by verifying and correcting it with agent experience, robustly applying large-scale, noisy knowledge sources to aid in sequential decision-making.
根据利用预训练模型在强化学习中的建议（Agarwal等人，2022），我们以抽象世界模型（AWM）的形式从LLMs中提取知识，该模型定义了有向无环图中子目标之间的转换。我们的代理人DECKARD（DECision-making for Knowledgable Autonomous Reinforcement-learning Dreamers）成功地利用AWM来智能探索《我的世界》物品制作，通过模块化RL策略学习制作任意物品。使用LLM预测的AWM初始化DECKARD，将样本效率提高了一个数量级。此外，我们利用环境动态来通过验证和纠正代理经验来接地假设的AWM，从而在顺序决策中强大地应用大规模、嘈杂的知识来源。

We, along with many others, hope to utilize the potential of LLMs for unlocking internet-scale knowledge for decision-making. Throughout this effort, we encourage the pursuit of robust and generalizable methods, like DECKARD. One drawback of DECKARD, along with many other LLM-assisted RL methods, is that it requires an environment already be grounded in language. Some preliminary methods for generating state descriptions from images are used by Tam et al. (2022), but this remains an open area of research. Additionally, we assume an abstraction over environment states to make predicting dependencies scalable. We leave the problem of of automatically identifying state abstractions to future work. Finally, DECKARD considers only deterministic transitions in the AWM. While a similar approach to ours could be applied to stochastic AWMs, that is out of the scope of this work.
我们和许多其他人希望利用LLMs的潜力来解锁互联网规模的知识，以供决策使用。在这一努力中，我们鼓励追求像DECKARD这样的稳健且具有普适性的方法。DECKARD的一个缺点，以及许多其他LLM辅助的RL方法的缺点，是它需要一个已经以语言为基础的环境。Tam等人（2022年）使用了一些从图像生成状态描述的初步方法，但这仍然是一个开放的研究领域。此外，我们假设对环境状态进行抽象以使预测依赖关系可扩展。我们将自动识别状态抽象的问题留给未来的工作。最后，DECKARD仅考虑AWM中的确定性转换。虽然我们的方法类似于适用于随机AWM的方法，但这超出了本文的范围。

DECKARD introduces a general approach for utilizing pretrained LLMs for guiding agent exploration. By alternating between sampling predicted next subgoals on the frontier of agent experience (The Dream phase) and executing subgoal policies to expand the frontier (The Wake phase), we successfully ground noisy LLM world knowledge with environment dynamics and learn a modular policy over compositional subgoals.
DECKARD 提出了一种利用预训练LLMs指导代理探索的一般方法。通过在代理经验边界上交替抽样预测的下一个子目标（梦境阶段）和执行子目标策略以扩展边界（清醒阶段），我们成功地将嘈杂的LLM世界知识与环境动态结合起来，并学习了一个关于构成子目标的模块化策略。

Acknowledgements 致谢

We would like to thank Dheeru Dua, Dylan Slack, Anthony Chen, Catarina Belem, Shivanshu Gupta, Tamanna Hossain, Yasaman Razeghi, Preethi Seshardi, and the anonymous reviewers for their discussions and feedback. Roy Fox is partly funded by the Hasso Plattner Foundation.
我们要感谢 Dheeru Dua、Dylan Slack、Anthony Chen、Catarina Belem、Shivanshu Gupta、Tamanna Hossain、Yasaman Razeghi、Preethi Seshardi 以及匿名审稿人对讨论和反馈的贡献。Roy Fox 部分资助来自 Hasso Plattner 基金会。

References 参考资料

Agarwal et al. (2022) Agarwal 等人 (2022) Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A., and Bellemare, M. G. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=t3X5yMI_4G2.
Agarwal，R.，Schwarzer，M.，Castro，P. S.，Courville，A. 和 Bellemare，M. G. 重生强化学习：重用先前的计算加速进展。在 Oh，A. H.，Agarwal，A.，Belgrave，D. 和 Cho，K.（编辑）的《神经信息处理系统进展》中，2022 年。网址 https://openreview.net/forum?id=t3X5yMI_4G2。
Ammanabrolu & Riedl (2021) Ammanabrolu, P. and Riedl, M. Learning knowledge graph-based world models of textual environments. Advances in Neural Information Processing Systems, 34:3720–3731, 2021.
Ammanabrolu，P. 和 Riedl，M. 学习基于知识图谱的文本环境世界模型。 Advances in Neural Information Processing Systems，34：3720–3731，2021。
Ammanabrolu et al. (2020) Ammanabrolu, P., Tien, E., Hausknecht, M., and Riedl, M. O. How to avoid being eaten by a grue: Structured exploration strategies for textual worlds. arXiv preprint arXiv:2006.07409, 2020.
Anderson et al. (2018) Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., and van den Hengel, A. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Baker et al. (2022) Baker, B., Akkaya, I., Zhokov, P., Huizinga, J., Tang, J., Ecoffet, A., Houghton, B., Sampedro, R., and Clune, J. Video pretraining (vpt): Learning to act by watching unlabeled online videos. In Advances in Neural Information Processing Systems, 2022.
Bisk et al. (2020) Bisk, Y., Holtzman, A., Thomason, J., Andreas, J., Bengio, Y., Chai, J., Lapata, M., Lazaridou, A., May, J., Nisnevich, A., et al. Experience grounds language. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8718–8735, 2020.
Blukis et al. (2022) Blukis, V., Paxton, C., Fox, D., Garg, A., and Artzi, Y. A persistent spatial semantic representation for high-level natural language instruction execution. In Faust, A., Hsu, D., and Neumann, G. (eds.), Proceedings of the 5th Conference on Robot Learning, volume 164 of Proceedings of Machine Learning Research, pp. 706–717. PMLR, 08–11 Nov 2022. URL https://proceedings.mlr.press/v164/blukis22a.html.
Branavan et al. (2011) Branavan, S., Silver, D., and Barzilay, R. Learning to win by reading manuals in a monte-carlo framework. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 268–277, 2011.
Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Chevalier-Boisvert et al. (2019) Chevalier-Boisvert, M., Bahdanau, D., Lahlou, S., Willems, L., Saharia, C., Nguyen, T. H., and Bengio, Y. BabyAI: First steps towards grounded language learning with a human in the loop. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJeXCo0cYX.
Dambekodi et al. (2020) Dambekodi, S., Frazier, S., Ammanabrolu, P., and Riedl, M. Playing text-based games with common sense. In Proceedings of the NeurIPS Workshop on Wordplay: When Language Meets Games, 2020.
Das et al. (2018) Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., and Batra, D. Embodied question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Fan et al. (2022) Fan, L., Wang, G., Jiang, Y., Mandlekar, A., Yang, Y., Zhu, H., Tang, A., Huang, D.-A., Zhu, Y., and Anandkumar, A. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
Gordon et al. (2018) Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., and Farhadi, A. Iqa: Visual question answering in interactive environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4089–4098, 2018.
Guss et al. (2019) Guss, W. H., Houghton, B., Topin, N., Wang, P., Codel, C., Veloso, M., and Salakhutdinov, R. Minerl: A large-scale dataset of minecraft demonstrations. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI’19, pp. 2442–2448. AAAI Press, 2019. ISBN 9780999241141.
Hafner et al. (2023) Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. Mastering diverse domains through world models, 2023. URL https://arxiv.org/abs/2301.04104.
Hanjie et al. (2021) Hanjie, A. W., Zhong, V. Y., and Narasimhan, K. Grounding language to entities and dynamics for generalization in reinforcement learning. In International Conference on Machine Learning, pp. 4051–4062. PMLR, 2021.
Houlsby et al. (2019) Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
Huang et al. (2022a) Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pp. 9118–9147. PMLR, 2022a.
Huang et al. (2022b) Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., Zeng, A., Tompson, J., Mordatch, I., Chebotar, Y., et al. Inner monologue: Embodied reasoning through planning with language models. In 6th Annual Conference on Robot Learning, 2022b.
Ichter et al. (2022) Ichter, B., Brohan, A., Chebotar, Y., Finn, C., Hausman, K., Herzog, A., Ho, D., Ibarz, J., Irpan, A., Jang, E., Julian, R., Kalashnikov, D., Levine, S., Lu, Y., Parada, C., Rao, K., Sermanet, P., Toshev, A. T., Vanhoucke, V., Xia, F., Xiao, T., Xu, P., Yan, M., Brown, N., Ahn, M., Cortes, O., Sievers, N., Tan, C., Xu, S., Reyes, D., Rettinghouse, J., Quiambao, J., Pastor, P., Luu, L., Lee, K.-H., Kuang, Y., Jesmonth, S., Jeffrey, K., Ruano, R. J., Hsu, J., Gopalakrishnan, K., David, B., Zeng, A., and Fu, C. K. Do as i can, not as i say: Grounding language in robotic affordances. In 6th Annual Conference on Robot Learning, 2022. URL https://openreview.net/forum?id=bdHkMjBJG_w.
Ku et al. (2020) Ku, A., Anderson, P., Patel, R., Ie, E., and Baldridge, J. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4392–4412, 2020.
Kuo et al. (2021) Kuo, Y.-L., Katz, B., and Barbu, A. Compositional networks enable systematic generalization for grounded language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 216–226, 2021.
Liang et al. (2022a) Liang, A., Singh, I., Pertsch, K., and Thomason, J. Transformer adapters for robot learning. In CoRL 2022 Workshop on Pre-training Robot Learning, 2022a. URL https://openreview.net/forum?id=H--wvRYBmF.
Liang et al. (2022b) Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Florence, P., Zeng, A., et al. Code as policies: Language model programs for embodied control. In Workshop on Language and Robotics at CoRL 2022, 2022b.
Liu et al. (2022) Liu, R., Wei, J., Gu, S. S., Wu, T.-Y., Vosoughi, S., Cui, C., Zhou, D., and Dai, A. M. Mind’s eye: Grounded language model reasoning through simulation. arXiv preprint arXiv:2210.05359, 2022.
Lynch & Sermanet (2020) Lynch, C. and Sermanet, P. Language conditioned imitation learning over unstructured data. Robotics: Science and Systems XVII, 2020.
Mu et al. (2022) Mu, J., Zhong, V., Raileanu, R., Jiang, M., Goodman, N., Rocktäschel, T., and Grefenstette, E. Improving intrinsic exploration with language abstractions. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=ALIYCycCsTy.
Nottingham et al. (2021) Nottingham, K., Liang, L., Shin, D., Fowlkes, C. C., Fox, R., and Singh, S. Modular framework for visuomotor language grounding. In Embodied AI Workshop @ CVPR, 2021.
Nottingham et al. (2022) Nottingham, K., Pyla, A., Singh, S., and Fox, R. Learning to query internet text for informing reinforcement learning agents. In Reinforcement Learning and Decision Making Conference, 2022.
OpenAI (2022) OpenAI. Powering next generation applications with openai codex, 2022. URL https://openai.com/blog/codex-apps/.
Patil et al. (2020) Patil, V., Hofmarcher, M., Dinu, M.-C., Dorfer, M., Blies, P. M., Brandstetter, J., Arjona-Medina, J. A., and Hochreiter, S. Align-rudder: Learning from few demonstrations by reward redistribution. In International Conference on Machine Learning, 2020.
Petroni et al. (2019) Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., and Miller, A. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2463–2473, 2019.
Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Shridhar et al. (2020) Shridhar, M., Thomason, J., Gordon, D., Bisk, Y., Han, W., Mottaghi, R., Zettlemoyer, L., and Fox, D. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
Simpkins & Isbell (2019) Simpkins, C. and Isbell, C. Composable modular reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):4975–4982, Jul. 2019. doi: 10.1609/aaai.v33i01.33014975. URL https://ojs.aaai.org/index.php/AAAI/article/view/4428.
Singh et al. (2022) Singh, I., Blukis, V., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., and Garg, A. Progprompt: Generating situated robot task plans using large language models. In Second Workshop on Language and Reinforcement Learning, 2022.
Skrynnik et al. (2021) Skrynnik, A., Staroverov, A., Aitygulov, E., Aksenov, K., Davydov, V., and Panov, A. I. Forgetful experience replay in hierarchical reinforcement learning from expert demonstrations. Know.-Based Syst., 218(C), apr 2021. ISSN 0950-7051. doi: 10.1016/j.knosys.2021.106844. URL https://doi.org/10.1016/j.knosys.2021.106844.
Song et al. (2022a) Song, C. H., Kil, J., Pan, T.-Y., Sadler, B. M., Chao, W.-L., and Su, Y. One step at a time: Long-horizon vision-and-language navigation with milestones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15482–15491, June 2022a.
Song et al. (2022b) Song, C. H., Wu, J., Washington, C., Sadler, B. M., Chao, W.-L., and Su, Y. Llm-planner: Few-shot grounded planning for embodied agents with large language models. arXiv preprint arXiv:2212.04088, 2022b.
Suglia et al. (2021) Suglia, A., Gao, Q., Thomason, J., Thattai, G., and Sukhatme, G. Embodied bert: A transformer model for embodied, language-guided visual task completion. CoRR, abs/2108.04927, 2021. URL https://arxiv.org/abs/2108.04927.
Tam et al. (2022) Tam, A. C., Rabinowitz, N. C., Lampinen, A. K., Roy, N. A., Chan, S. C., Strouse, D., Wang, J. X., Banino, A., and Hill, F. Semantic exploration from language abstractions and pretrained representations. arXiv preprint arXiv:2204.05080, 2022.
Valmeekam et al. (2022) Valmeekam, K., Olmo, A., Sreedharan, S., and Kambhampati, S. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). In NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.
Yu et al. (2018) Yu, H., Zhang, H., and Xu, W. Interactive grounded language acquisition and generalization in a 2d world. In International Conference on Learning Representations, 2018.
Zellers et al. (2021) Zellers, R., Holtzman, A., Peters, M. E., Mottaghi, R., Kembhavi, A., Farhadi, A., and Choi, Y. Piglet: Language grounding through neuro-symbolic interaction in a 3d world. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2040–2050, 2021.
Zhong et al. (2020) Zhong, V., Rocktäschel, T., and Grefenstette, E. Rtfm: Generalising to new environment dynamics via reading. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJgob6NKvH.

Appendix A Codex In-Context Learning

A.1 Prompting Details

We use OpenAI’s Codex model (code-davinci-002) (OpenAI, 2022) to predict an Abstract World Model (AWM) for DECKARD. We prompt the model with instructions in code comments that instruct the model to generate a python dictionary with information for Minecraft item requirements. We also provide example entries for “diamond pickaxe” and “diamond”. We then iterate over all 391 Minecraft items to generate the next entry in the python dictionary. We organize the data in the dictionary entries into the following item attributes:

•

requires_crafting_table: whether an item requires the agent to have a crafting table prior to crafting
•

requires_furnace: whether the item is smelted with a furnace
•

required_tool: what tool is required to collect the item from the environment
•

recipe: list of ingredients and ingredient quantities to craft the item

The full prompt we use can be found below:

# Create a nested python dictionary containing crafting recipes and requirements
for minecraft items.
# Each crafting item should have a recipe and booleans indicating whether a
furnace or crafting table is required.
# Non craftable blocks should have their recipe set to an empty list and
indicate which tool is required to mine.

minecraft_info = {
    "diamond_pickaxe": {
        "requires_crafting_table": True,
        "requires_furnace": False,
        "required_tool": None,
        "recipe": [
            {
                "item": "stick",
                "quantity": "2"
            },
            {
                "item": "diamond",
                "quantity": "3"
            }
        ]
    },
    "diamond": {
        "requires_crafting_table": False,
        "requires_furnace": False,
        "required_tool": "iron_pickaxe",
        "recipe": []
    },
    "[insert item name]": {

A.2 Parsing Details

When parsing output, we consider any item with a recipe of length zero to be a collectable item (it will have no parents in the AWM). In our experiments, Codex generated parsable entries for all but one Minecraft item (brown mushroom block).

In general, Codex predicts the same item identifier that Minedojo (Fan et al., 2022) uses. One major exception is that of planks, a common item essential for many recipes. We parse plank and wood as well as any variant of these two (oak plank) as planks. We also parse cane as reeds. Note that in all these cases the predicted names are also common identifiers for these items in minecraft, but they do not match the Minedojo identifiers.

Finally, we remove circular dependencies from the predicted AWM. First we remove edges from crafting table, furnace, and tool nodes to items that are found in the recipes for those nodes. Then we remove edges both to and from items found in eachother’s recipes. There were four cases of circular dependencies in our hypothesized AWM, between planks and crafting table, log and wooden axe, fermented spider eye and spider eye, and purpur block and purpur pillar.

A.3 Additional Results

“Tool Only” Items
coal	furnace	crafting_table	log
planks	stick	cobblestone	iron_ore
iron_ingot	gold_ore	gold_ingot	diamond
wooden_hoe	wooden_sword	wooden_axe	wooden_pickaxe
wooden_shovel	stone_hoe	stone_sword	stone_axe
stone_pickaxe	stone_shovel	iron_hoe	iron_sword
iron_axe	iron_pickaxe	iron_shovel	golden_hoe
golden_sword	golden_axe	golden_pickaxe	golden_shovel
diamond_hoe	diamond_sword	diamond_axe	diamond_pickaxe
diamond_shovel

Table 3: The 37 Minecraft items from the tool technology tree.

Metric	All Items	Tools Only
Accuracy: Collectable vs. Craftable Label	57	100
Accuracy: Workbench (Crafting Table/Furnace)	84	96
Accuracy: Recipe Ingredients	66	81
Accuracy: Recipe Ingredients & Quantities	55	69
% Items w/ Incorrectly Inserted Dependencies	42	8
% Items w/ Missing Dependencies	35	26
Standard Deviation In Predicted Ingredient Quantity	0.98	0.34
Absolute Error In Predicted Ingredient Quantity	2.77	1.50
Average Error In Predicted Ingredient Quantity	-1.07	0.50

Table 4: Additional Codex metrics for predicting the Minecraft AWM.

Our experiments with few-shot prompting Codex to generate the AWM for Minecraft show that LLMs can generate structured knowledge for decision making. However, predictions are not perfect, so we treat them as hypotheses that are verified by environment interactions. Codex does perform better on the tool technology tree, items that are both more common and more relevant for crafting agents. A large percentage of errors also appears to be from incorrectly predicted ingredient quantities.

Appendix B Subgoal Finetuning

Hyperparameter	Value
VPT Checkpoint	bc-house-3x
Environment Steps per Actor per Iteration	500
Number of Actors	4
Batch Size	40
Iteration Epochs	5
Learning Rate	0.0001
$\gamma$	0.999
Value Loss Coefficient	1
Initial KL Loss Coefficient	.1
KL Coefficient Decay per Iteration	.999
Adapter Downsize Factor	16

Table 5: DECKARD subgoal finetuning hyperparameters.

B.1 VPT Finetuning

We finetune VPT (3x w/ behavior cloning on house contractor data) (Baker et al., 2022) with reinforcement learning (RL) using transformer adapters as described by Houlsby et al. (2019). That is, we insert two adapter modules with residual connections in each transformer layer, with a 16x reduction in hidden state size. We updated the adapters and agent value head using proximal policy optimization (PPO) (Schulman et al., 2017), but we leave the rest of the agent unchanged (including the policy head).

Following Baker et al. (2022), we replace the traditional entropy loss in the PPO algorithm with a KL loss between the current policy and the non-finetuned VPT policy. The purpose of this loss is to prevent catastrophic forgetting early in training. Our experiments reaffirmed the importance of this term, even when leaving the majority of the VPT weights unchanged. The KL loss coefficient decays throughout training to allow the agent to reach an optimal policy.

B.2 MineClip Reward

Along with their Minedojo Minecraft implementation, Fan et al. (2022) introduced a text and video alignment model for Minecraft called MineClip and showed how the model could be used for automatic reward shaping given a text goal. We use MineClip to provide reward shaping for finetuning DECKARD subgoal and VPT-a policies. Unlike Fan et al. (2022), we implement MineClip reward shaping by subtracting $clip_{low}=21$ from the MineClip alignment score and scaling by $clip_{\alpha}=0.005$ , smoothed over $smooth=50$ steps:

reward_{clip}=clip_{\alpha}\times max(0,mean(score\_buffer_{-smooth:})-clip_{low})

Additionally, we only provide the agent with non-zero reward when the MineClip alignment score reaches a new maximum for the episode. Finally we provide a reward of $+1$ when the agent successfully adds the target item to its inventory.

B.3 Minecraft Settings

Use use the Minedojo simulator (Fan et al., 2022) with the “creative” metatask for our experiments. We found Minedojo preferable to MineRL (Guss et al., 2019), due to a reduced tendency to crash when running many parallel environment instances. We followed the VPT (Baker et al., 2022) observation and action spaces—128x128x3 pixel observation space and 121x8461 multi-discrete action space—with the modification of replacing the “open inventory” action with 254 discrete crafting actions.

When training subgoal policies, we initialize the agent with items from the current node’s parents. For example, when training the collect cobblestone subgoal, we initialize the agent with a wooden pickaxe, the required tool for cobblestone in the AWM. We terminate each episode after 1,000 environment steps, generating a new world.

We also found that finetuning was sensitive to world seed when training. For example, many world seeds spawned the agent far from target items, stranded on islands, or underwater. To mitigate the effect of poor world initialization on training, we use a single world seed for training each subgoal policy and then evaluate on random world seeds. We find that VPT is able to generalize to random seeds after training on a training seed.

Appendix C Abstract World Model

C.1 Disambiguating the World Model

In many environments, multiple possible transitions between subgoals may exist. For example, in Minecraft, an agent can obtain coal through mining or by burning wood in a furnace. Ideally, edges of the AWM would provide paths with high success rate to each node. In our implementation we keep the first experienced edge between nodes, assuming it to be the simplest path.

Do Embodied Agents Dream of Pixelated Sheep?: Embodied Decision Making using Language Guided World Modelling嵌入式代理是否梦见像素化的羊？：使用语言引导的世界建模进行具体化决策