这是用户在 2024-4-3 23:45 为 https://ar5iv.labs.arxiv.org/html/2303.16563?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Skill Reinforcement Learning and Planning for Open-World Long-Horizon Tasks
开放世界长期任务的技能强化学习和规划

Haoqi Yuan1, Chi Zhang2, Hongcheng Wang1,4, Feiyang Xie3,
袁浩琦 1 , 张驰 2 , 王洪成 1,4 , 谢飞扬 3 ,

Penglin Cai3, Hao Dong1, Zongqing Lu1,4
蔡鹏林 3 , 董浩 1 , 陆宗庆 1,4


1School of Computer Science, Peking University
北京大学计算机科学学院

2School of EECS, Peking University
北京大学电子工程与计算机科学学院

3Yuanpei College, Peking University
北京大学袁培学院

4Beijing Academy of Artificial Intelligence
4 北京人工智能学院
Correspondence to Zongqing Lu <zongqing.lu@pku.edu.cn>, Haoqi Yuan <yhq@pku.edu.cn>
通信至卢宗庆 <zongqing.lu@pku.edu.cn>,袁浩琦 <yhq@pku.edu.cn>
Abstract 摘要

We study building multi-task agents in open-world environments. Without human demonstrations, learning to accomplish long-horizon tasks in a large open-world environment with reinforcement learning (RL) is extremely inefficient. To tackle this challenge, we convert the multi-task learning problem into learning basic skills and planning over the skills. Using the popular open-world game Minecraft as the testbed, we propose three types of fine-grained basic skills, and use RL with intrinsic rewards to acquire skills. A novel Finding-skill that performs exploration to find diverse items provides better initialization for other skills, improving the sample efficiency for skill learning. In skill planning, we leverage the prior knowledge in Large Language Models to find the relationships between skills and build a skill graph. When the agent is solving a task, our skill search algorithm walks on the skill graph and generates the proper skill plans for the agent. In experiments, our method accomplishes 40 diverse Minecraft tasks, where many tasks require sequentially executing for more than 10 skills. Our method outperforms baselines by a large margin and is the most sample-efficient demonstration-free RL method to solve Minecraft Tech Tree tasks. The project’s website and code can be found at https://sites.google.com/view/plan4mc.
我们研究在开放世界环境中构建多任务代理。在没有人类示范的情况下,使用强化学习(RL)在大型开放世界环境中学习完成长期目标是非常低效的。为了解决这个挑战,我们将多任务学习问题转化为学习基本技能和对技能进行规划。在流行的开放世界游戏《我的世界》作为测试平台上,我们提出了三种细粒度的基本技能,并使用具有内在奖励的强化学习来获得技能。一种新颖的“寻找技能”通过探索来寻找不同的物品,为其他技能提供更好的初始化,提高了技能学习的样本效率。在技能规划中,我们利用大型语言模型中的先前知识来找到技能之间的关系并构建技能图。当代理正在解决一个任务时,我们的技能搜索算法在技能图上行走,并为代理生成适当的技能计划。在实验中,我们的方法完成了40个不同的《我的世界》任务,其中许多任务需要按顺序执行超过10个技能。 我们的方法在 Minecraft Tech Tree 任务中表现出色,远远超过基准,并且是最节约样本的无演示强化学习方法。该项目的网站和代码可以在 https://sites.google.com/view/plan4mc 找到。

1 Introduction 1 介绍

Learning diverse tasks in open-ended worlds is a significant milestone toward building generally capable agents. Recent studies in multi-task reinforcement learning (RL) have achieved great success in many narrow domains like games [31] and robotics [39]. However, transferring prior methods to open-world domains [34, 9] remains unexplored. Minecraft, a popular open-world game with an infinitely large world size and a huge variety of tasks, has been regarded as a challenging benchmark [10, 9].
在开放式世界中学习多样化的任务是构建通用能力代理的重要里程碑。最近在多任务强化学习(RL)领域的研究在许多狭窄领域(如游戏[31]和机器人[39])取得了巨大成功。然而,将先前的方法转移到开放世界领域[34, 9]仍未被探索。Minecraft 是一个受欢迎的开放世界游戏,拥有无限大的世界尺寸和各种各样的任务,被视为一个具有挑战性的基准[10, 9]。

Previous works usually build policies in Minecraft upon imitation learning, which requires expert demonstrations [10, 4, 37] or large-scale video datasets [2]. Without demonstrations, RL in Minecraft is extremely sample-inefficient. A state-of-the-art model-based method [12] takes over 10M environmental steps to harvest cobblestones [Uncaptioned image], even if the block breaking speed of the game simulator is set to very fast additionally. This difficulty comes from at least two aspects. First, the world size is too large and the requisite resources are distributed far away from the agent. With partially observed visual input, the agent cannot identify its state or do effective exploration easily. Second, a task in Minecraft usually has a long horizon, with many sub-goals. For example, mining a cobblestone involves more than 10 sub-goals (from harvesting logs [Uncaptioned image] to crafting wooden pickaxes [Uncaptioned image]) and requires thousands of environmental steps.
以前的研究通常在Minecraft中建立政策时采用模仿学习,这需要专家演示[10, 4, 37]或大规模的视频数据集[2]。没有演示,Minecraft中的强化学习非常低效。一种最先进的基于模型的方法[12]需要超过1000万个环境步骤来收集鹅卵石,即使游戏模拟器的方块破坏速度设置为非常快。这个困难至少有两个方面。首先,世界的规模太大,必需的资源离代理人很远。在部分观察到的视觉输入下,代理人无法确定自己的状态或进行有效的探索。其次,Minecraft中的任务通常具有很长的时间跨度,有许多子目标。例如,挖掘鹅卵石涉及超过10个子目标(从收集原木到制作木镐),需要数千个环境步骤。

To mitigate the issue of learning long-horizon tasks, we propose to solve diverse tasks in a hierarchical fashion. In Minecraft, we define a set of basic skills. Then, solving a task can be decomposed into planning for a proper sequence of basic skills and executing the skills interactively. We train RL agents to acquire skills and build a high-level planner upon the skills.
为了解决学习长期任务的问题,我们提出以分层方式解决多样化的任务。在 Minecraft 中,我们定义了一组基本技能。然后,解决一个任务可以分解为规划一系列适当的基本技能和交互执行这些技能。我们训练强化学习代理获取技能,并在技能基础上构建高层规划器。

Refer to caption
Figure 1: Overview of Plan4MC. We categorize the basic skills in Minecraft into three types: Finding-skills, Manipulation-skills, and Crafting-skills. We train policies to acquire skills with reinforcement learning. With the help of LLM, we extract relationships between skills and construct a skill graph in advance, as shown in the dashed box. During online planning, the skill search algorithm walks on the pre-generated graph, decomposes the task into an executable skill sequence, and interactively selects policies to solve complex tasks.
图 1:Plan4MC 概述。我们将 Minecraft 中的基本技能分为三类:寻找技能、操作技能和制作技能。我们使用强化学习训练策略来获取技能。在LLM的帮助下,我们提前提取技能之间的关系并构建技能图,如虚线框所示。在在线规划过程中,技能搜索算法在预生成的图上进行遍历,将任务分解为可执行的技能序列,并交互选择策略来解决复杂任务。

We find that training skills with RL remains challenging due to the difficulty in finding the required resources in the vast world. As an example, if we use RL to train the skill of harvesting logs, the agent can always receive 0 reward through random exploration since it cannot find a tree nearby. On the contrary, if a tree is always initialized close to the agent, the skill can be learned efficiently (Table 1). Thus, we propose to learn a Finding-skill that performs exploration to find items in the world and provides better initialization for all other skills, improving the sample efficiency of learning skills with RL. The Finding-skill is implemented with a hierarchical policy, maximizing the area traversed by the agent.
我们发现,由于在广阔的世界中找到所需资源的困难,使用强化学习训练技能仍然具有挑战性。例如,如果我们使用强化学习来训练伐木技能,由于无法找到附近的树木,代理始终通过随机探索获得0奖励。相反,如果树木始终在代理附近初始化,该技能可以高效学习(表1)。因此,我们提出学习一种“寻找技能”,该技能通过探索世界中的物品并为所有其他技能提供更好的初始化,从而提高了使用强化学习学习技能的样本效率。寻找技能采用分层策略实现,最大化代理所经过的区域。

We split the skills in the recent work [37] into more fine-grained basic skills and classify them into three types: Finding-skills, Manipulation-skills, and Crafting skills. Each basic skill solves an atomic task that may not be further divided. Such tasks have a shorter horizon and require exploration in smaller regions of the world. Thus, using RL to learn these basic skills is more feasible. To improve the sample efficiency of RL, we introduce intrinsic rewards to train policies for different types of skills.
我们将最近的工作[37]中的技能细分为更细粒度的基本技能,并将其分类为三种类型:寻找技能、操作技能和制作技能。每个基本技能解决一个不可再细分的原子任务。这些任务的时间跨度较短,需要在较小的世界区域进行探索。因此,使用强化学习来学习这些基本技能更加可行。为了提高强化学习的样本效率,我们引入内在奖励来训练不同类型技能的策略。

For high-level skill planning, recent works [3, 37, 36] demonstrate promising results via interacting with Large Language Models (LLMs). Though LLMs generalize to open-ended environments well and produce reasonable skill sequences, fixing their uncontrollable mistakes requires careful prompt engineering [14, 37]. To make more flawless skill plans, we propose a complementary skill search approach. In the preprocessing stage, we use an LLM to generate the relationships between skills and construct a skill dependency graph. Then, given any task and the agent’s condition (e.g., available resources/tools), we propose a search algorithm to interactively plan for the skill sequence. Figure 1 illustrates our proposed framework, Plan4MC.
对于高级技能规划,最近的研究[3, 37, 36]通过与大型语言模型(LLMs)的交互展示了有希望的结果。虽然LLMs在开放环境中具有很好的泛化能力,并能产生合理的技能序列,但修复它们无法控制的错误需要仔细的提示工程[14, 37]。为了制定更完美的技能计划,我们提出了一种互补的技能搜索方法。在预处理阶段,我们使用LLM生成技能之间的关系,并构建技能依赖图。然后,针对任何任务和代理的条件(例如可用资源/工具),我们提出了一种搜索算法来交互地规划技能序列。图1展示了我们提出的框架Plan4MC。

In experiments, we build 40 diverse tasks in the MineDojo [9] simulator. These tasks involve executing diverse skills, including collecting basic materials [Uncaptioned image][Uncaptioned image], crafting useful items [Uncaptioned image][Uncaptioned image][Uncaptioned image], and interacting with mobs [Uncaptioned image][Uncaptioned image]. Each task requires planning and execution for 2~30 basic skills and takes thousands of environmental steps. Results show that Plan4MC accomplishes all the tasks and outperforms the baselines significantly. Also, Plan4MC can craft iron pickaxes [Uncaptioned image] in the Minecraft Tech Tree and is much more sample-efficient than existing demonstration-free RL methods.
在实验中,我们在 MineDojo [9]模拟器中构建了 40 个不同的任务。这些任务涉及执行各种技能,包括收集基本材料 [Uncaptioned image] [Uncaptioned image] ,制作有用的物品 [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] ,与生物互动 [Uncaptioned image] [Uncaptioned image] 。每个任务需要规划和执行 2~30 个基本技能,并需要进行数千个环境步骤。结果显示,Plan4MC 完成了所有任务,并显著优于基准。此外,Plan4MC 可以在 Minecraft 技能树中制作铁镐 [Uncaptioned image] ,并且比现有的无演示强化学习方法更加样本高效。

To summarize, our main contributions are:
总结一下,我们的主要贡献是:

  • To enable RL methods to efficiently solve diverse open-world tasks, we propose to learn fine-grained basic skills including a Finding-skill and train RL policies with intrinsic rewards. Thus, solving long-horizon tasks is transformed into planning over basic skills.


    • 为了使强化学习方法能够高效解决各种开放世界任务,我们提出了学习细粒度基本技能,包括寻找技能,并使用内在奖励训练强化学习策略。因此,解决长期任务变成了对基本技能的规划。
  • Unlike previous LLM-based planning methods, we propose the skill graph and the skill search algorithm for interactive planning. The LLM only assists in the generation of the skill graph before task execution, avoiding uncontrollable failures caused by the LLM.


    • 与以往基于LLM的规划方法不同,我们提出了技能图和技能搜索算法用于交互式规划。在任务执行之前,LLM仅用于生成技能图,避免了LLM引起的不可控故障。
  • Our hierarchical agent achieves promising performance in diverse and long-horizon Minecraft tasks, demonstrating the great potential of using RL to build multi-task agents in open-ended worlds.


    • 我们的分层代理在多样化和长期的 Minecraft 任务中取得了令人满意的表现,展示了在开放式世界中使用 RL 构建多任务代理的巨大潜力。

2 Preliminaries 2 初步研究

2.1 Problem Formulation 2.1 问题阐述

In Minecraft, a task τ=(g,I)𝜏𝑔𝐼\tau=(g,I) is defined with the combination of a goal g𝑔g and the agent’s initial condition I𝐼I, where g𝑔g represents the target entity to acquire in the task and I𝐼I represents the initial tools and conditions provided for the agent. For example, a task can be ‘harvest cooked_beef [Uncaptioned image] with sword [Uncaptioned image] in plains’. We model the task as a partially observable Markov decision process (POMDP) [16]. I𝐼I determines the environment’s initial state distribution. At each timestep t𝑡t, the agent obtains the partial observation otsubscript𝑜𝑡o_{t}, takes an action atsubscript𝑎𝑡a_{t} following its policy π(at|o0:t,τ)𝜋conditionalsubscript𝑎𝑡subscript𝑜:0𝑡𝜏\pi(a_{t}|o_{0:t},\tau), and receives a sparse reward rtsubscript𝑟𝑡r_{t} indicating task completion. The agent aims to maximize its expected return R=𝔼πtγtrt𝑅subscript𝔼𝜋subscript𝑡superscript𝛾𝑡subscript𝑟𝑡R=\mathbb{E}_{\pi}\sum_{t}\gamma^{t}r_{t}.
在 Minecraft 中,任务 τ=(g,I)𝜏𝑔𝐼\tau=(g,I) 由目标 g𝑔g 和代理的初始条件 I𝐼I 的组合定义,其中 g𝑔g 表示任务中要获取的目标实体, I𝐼I 表示为代理提供的初始工具和条件。例如,一个任务可以是“在平原上用剑 [Uncaptioned image] 收获熟牛肉 [Uncaptioned image] ”。我们将任务建模为部分可观察的马尔可夫决策过程(POMDP)[16]。 I𝐼I 确定了环境的初始状态分布。在每个时间步骤 t𝑡t ,代理获取部分观察 otsubscript𝑜𝑡o_{t} ,根据其策略 π(at|o0:t,τ)𝜋conditionalsubscript𝑎𝑡subscript𝑜:0𝑡𝜏\pi(a_{t}|o_{0:t},\tau) 采取行动 atsubscript𝑎𝑡a_{t} ,并接收表示任务完成的稀疏奖励 rtsubscript𝑟𝑡r_{t} 。代理的目标是最大化其预期回报 R=𝔼πtγtrt𝑅subscript𝔼𝜋subscript𝑡superscript𝛾𝑡subscript𝑟𝑡R=\mathbb{E}_{\pi}\sum_{t}\gamma^{t}r_{t}

To solve complex tasks, humans acquire and reuse skills in the world, rather than learn each task independently from scratch. Similarly, to solve the aforementioned task, the agent can sequentially use the skills: harvest log [Uncaptioned image], …, craft furnace [Uncaptioned image], harvest beef [Uncaptioned image], place furnace [Uncaptioned image], and craft cooked_beef [Uncaptioned image]. Each skill solves a simple sub-task in a shorter time horizon, with the necessary tools and conditions provided. For example, the skill ‘craft cooked_beef [Uncaptioned image]’ solves the task ‘harvest cooked_beef [Uncaptioned image] with beef [Uncaptioned image], log [Uncaptioned image], and placed furnace [Uncaptioned image]’. Once the agent acquires an abundant set of skills S𝑆S, it can solve any complex task by decomposing it into a sequence of sub-tasks and executing the skills in order. Meanwhile, by reusing a skill to solve different tasks, the agent is much better in memory and learning efficiency.
为了解决复杂的任务,人类在世界中获取和重复使用技能,而不是从头开始独立学习每个任务。同样,为了解决上述任务,代理可以按顺序使用以下技能:收获木材 [Uncaptioned image] ,…,制作熔炉 [Uncaptioned image] ,收获牛肉 [Uncaptioned image] ,放置熔炉 [Uncaptioned image] ,制作熟牛肉 [Uncaptioned image] 。每个技能在较短的时间范围内解决一个简单的子任务,并提供必要的工具和条件。例如,技能“制作熟牛肉 [Uncaptioned image] ”解决了任务“使用牛肉 [Uncaptioned image] 、木材 [Uncaptioned image] 和放置的熔炉 [Uncaptioned image] 收获熟牛肉 [Uncaptioned image] ”。一旦代理获得了丰富的技能集合 S𝑆S ,它可以通过将复杂任务分解为一系列子任务并按顺序执行技能来解决任何复杂任务。同时,通过重复使用技能来解决不同的任务,代理在记忆和学习效率方面更加出色。

To this end, we convert the goal of solving diverse and long-horizon tasks in Minecraft into building a hierarchical agent. At the low level, we train policies πssubscript𝜋𝑠\pi_{s} to learn all the skills sS𝑠𝑆s\in S, where πssubscript𝜋𝑠\pi_{s} takes as input the RGB image and some auxiliary information (compass, location, biome, etc.), then outputs an action. At the high level, we study planning methods to convert a task τ𝜏\tau into a skill sequence (sτ,1,sτ,2,)subscript𝑠𝜏1subscript𝑠𝜏2(s_{\tau,1},s_{\tau,2},\cdots).
为了实现这一目标,我们将解决 Minecraft 中各种长期任务的目标转化为构建一个分层代理。在低层次上,我们训练策略 πssubscript𝜋𝑠\pi_{s} 来学习所有的技能 sS𝑠𝑆s\in S ,其中 πssubscript𝜋𝑠\pi_{s} 以 RGB 图像和一些辅助信息(指南针、位置、生物群系等)作为输入,然后输出一个动作。在高层次上,我们研究规划方法将一个任务 τ𝜏\tau 转化为一个技能序列 (sτ,1,sτ,2,)subscript𝑠𝜏1subscript𝑠𝜏2(s_{\tau,1},s_{\tau,2},\cdots)

2.2 Skills in Minecraft 2.2 Minecraft 中的技能

Recent works mainly rely on imitation learning to learn Minecraft skills efficiently. In MineRL competition [17], a human gameplay dataset is accessible along with the Minecraft environment. All of the top methods in competition use imitation learning to some degree, to learn useful behaviors in limited interactions. In VPT [2], a large policy model is pre-trained on a massive labeled dataset using behavior cloning. By fine-tuning on smaller datasets, policies are acquired for diverse skills.
最近的研究主要依赖于模仿学习来高效地学习 Minecraft 技能。在 MineRL 比赛[17]中,提供了一个人类游戏数据集以及 Minecraft 环境。所有在比赛中排名靠前的方法都在一定程度上使用了模仿学习,以在有限的交互中学习有用的行为。在 VPT[2]中,使用行为克隆在大规模标记数据集上预训练了一个大型策略模型。通过在较小的数据集上微调,可以获得各种技能的策略。

However, without demonstration datasets, learning Minecraft skills with reinforcement learning (RL) is difficult. MineAgent [9] shows that PPO [32] can only learn a small set of skills. PPO with sparse reward fails in ‘milk a cow’ and ‘shear a sheep’, though the distance between target mobs and the agent is set within 10 blocks. We argue that with the high dimensional state and action space, open-ended large world, and partial observation, exploration in Minecraft tasks is extremely difficult.
然而,没有演示数据集,使用强化学习(RL)学习Minecraft技能是困难的。MineAgent [9]表明,PPO [32]只能学习一小部分技能。尽管目标生物和代理之间的距离设置在10个方块内,但PPO在“挤奶牛”和“剪羊毛”任务中的稀疏奖励失败。我们认为,由于高维状态和动作空间、开放式的大世界和部分观察,Minecraft任务中的探索极其困难。

We conduct a study for RL to learn skills with different difficulties in Table 1. We observe that RL has comparable performance to imitation learning only when the task-relevant entities are initialized very close to the agent. Otherwise, RL performance decreases significantly. This motivates us to further divide skills into fine-grained skills. We propose a Finding-skill to provide a good initialization for other skills. For example, the skill of ‘milk a cow’ is decomposed into ‘find a cow’ and ‘harvest milk_bucket’. After finding a cow nearby, ‘harvest milk_bucket’ can be accomplished by RL with acceptable sample efficiency. Thus, learning such fine-grained skills is easier for RL, and they together can still accomplish the original task.
我们在表 1 中进行了一个强化学习的研究,以学习具有不同难度的技能。我们观察到,只有当与任务相关的实体初始化非常接近代理时,强化学习的性能才与模仿学习相当。否则,强化学习的性能会显著下降。这促使我们进一步将技能细分为细粒度的技能。我们提出了一种寻找技能,为其他技能提供良好的初始化。例如,‘挤奶’这个技能被分解为‘找到一头奶牛’和‘收获牛奶桶’。在附近找到一头奶牛后,可以通过强化学习以可接受的样本效率完成‘收获牛奶桶’。因此,对于强化学习来说,学习这些细粒度的技能更容易,并且它们仍然可以完成原始任务。

Table 1: Minecraft skill performance of imitation learning (behavior cloning with MineCLIP backbone, reported in [4]) versus reinforcement learning. Better init. means target entities are closer to the agent at initialization. The RL method for each task is trained with proper intrinsic rewards. All RL results are averaged on the last 100 training epochs and 3 training seeds.
表 1:模仿学习(使用 MineCLIP 骨干网络进行行为克隆,参考文献[4])与强化学习在 Minecraft 技能上的性能对比。更好的初始化意味着目标实体在初始化时更接近代理。每个任务的强化学习方法都是使用适当的内在奖励进行训练的。所有强化学习结果都是在最后 100 个训练时期和 3 个训练种子上进行平均的。
Skill 技能 [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Behavior Cloning 行为克隆 0.25 0.27 0.16
RL 0.40±plus-or-minus\pm0.20 0.26±plus-or-minus\pm0.22 0.04±plus-or-minus\pm0.02 0.04±plus-or-minus\pm0.01 0.00±plus-or-minus\pm0.00
RL (better init.) 强化学习(更好的初始化) 0.99±plus-or-minus\pm0.01 0.81±plus-or-minus\pm0.02 0.16±plus-or-minus\pm0.06 0.14±plus-or-minus\pm0.07 0.44±plus-or-minus\pm0.10

3 Learning Basic Skills with Reinforcement Learning
用强化学习学习基本技能

Based on the discussion above, we propose three types of fine-grained basic skills, which can compose all Minecraft tasks.
基于上述讨论,我们提出了三种细粒度的基本技能,可以组合成所有的 Minecraft 任务。

  • Finding-skills: starts from any location, the agent explores to find a target and approaches the target. The target can be any block or entity that exists in the world.


    • 寻找技能:从任意位置开始,代理人探索以找到目标并接近目标。目标可以是世界中存在的任何方块或实体。
  • Manipulation-skills: given proper tools and the target in sight, the agent interacts with the target to obtain materials. These skills include diverse behaviors, like mining ores, killing mobs, and placing blocks.


    • 操作技能:在拥有适当工具和目标的情况下,代理人与目标进行交互以获取材料。这些技能包括各种行为,如挖掘矿石、杀死生物和放置方块。
  • Crafting-skills: with requisite materials in the inventory and crafting table or furnace placed nearby, the agent crafts advanced materials or tools.


    • 制作技能:在背包和工作台或熔炉附近具备所需材料的情况下,代理人制作高级材料或工具。

3.1 Learning to Find with a Hierarchical Policy
3.1 学习使用分层策略进行寻找

Finding items is a long-horizon difficult task for RL. To find an unseen tree on the plains, the agent should take thousands of steps to explore the world map as much as possible. A random policy fails to do such exploration, as shown in Appendix A. Also, it is too costly to train different policies for various target items. To simplify this problem, considering to explore on the world’s surface only, we propose to train a target-free hierarchical policy to solve all the Finding-skills.
对于强化学习来说,寻找物品是一项长期困难的任务。为了在平原上找到一棵未知的树,代理人需要花费数千步来尽可能地探索世界地图。随机策略无法进行这样的探索,如附录 A 所示。此外,为不同的目标物品训练不同的策略成本太高。为了简化这个问题,我们提出了在世界表面上进行探索的无目标分层策略,以解决所有的寻找技能。

Figure 2 demonstrates the hierarchical policy for Finding-skills. The high-level policy πH((x,y)g|(x,y)0:t)superscript𝜋𝐻conditionalsuperscript𝑥𝑦𝑔subscript𝑥𝑦:0𝑡\pi^{H}\left((x,y)^{g}|(x,y)_{0:t}\right) observes historical locations (x,y)0:tsubscript𝑥𝑦:0𝑡(x,y)_{0:t} of the agent, and outputs a goal location (x,y)gsuperscript𝑥𝑦𝑔(x,y)^{g}. It drives the low-level policy πL(at|ot,(x,y)g)superscript𝜋𝐿conditionalsubscript𝑎𝑡subscript𝑜𝑡superscript𝑥𝑦𝑔\pi^{L}\left(a_{t}|o_{t},(x,y)^{g}\right) to reach the goal location. We assume that target items are uniformly distributed on the world’s surface. To maximize the chance to find diverse targets, the objective for the high-level policy is to maximize its reached area. We divide the world’s surface into discrete grids, where each grid represents a 10×10101010\times 10 area. We use state count in the grids as the reward for the high-level policy. The low-level policy obtains the environmental observation otsubscript𝑜𝑡o_{t} and the goal location (x,y)gsuperscript𝑥𝑦𝑔(x,y)^{g} proposed by the high-level policy, and outputs an action atsubscript𝑎𝑡a_{t}. We reward the low-level policy with the distance change to the goal location.
图2展示了寻找技能的层次策略。高层策略观察代理的历史位置,并输出一个目标位置。它驱动低层策略达到目标位置。我们假设目标物品均匀分布在世界表面上。为了最大化找到多样化目标的机会,高层策略的目标是最大化其到达的区域。我们将世界表面划分为离散的网格,每个网格代表一个区域。我们使用网格中的状态计数作为高层策略的奖励。低层策略获取环境观察和高层策略提出的目标位置,并输出一个动作。我们根据到目标位置的距离变化奖励低层策略。

To train the hierarchical policy with acceptable sample complexity, we pre-train the low-level policy with randomly generated goal locations using DQN [26], then train the high-level policy using PPO [32] with the fixed low-level policy. During test, to find a specific item, the agent first explores the world with the hierarchical policy until a target item is detected in its lidar observations. Then, the agent executes the low-level policy conditioned on the detected target’s location, to reach the target item. Though we use additional lidar information here, we believe that without this information, we can also implement the success detector for Finding-skills with computer vision models [7].
为了以可接受的样本复杂度训练分层策略,我们使用DQN [26]对低层策略进行预训练,使用固定的低层策略使用PPO [32]训练高层策略。在测试过程中,为了找到特定的物品,智能体首先使用分层策略在世界中进行探索,直到在其激光雷达观测中检测到目标物品。然后,智能体根据检测到的目标位置执行低层策略,以达到目标物品。尽管我们在这里使用了额外的激光雷达信息,但我们相信即使没有这些信息,我们也可以使用计算机视觉模型实现找到技能的成功检测器[7]。

Refer to caption
Figure 2: The proposed hierarchical policy for Finding-skills. The high-level recurrent policy πHsuperscript𝜋𝐻\pi^{H} observes historical positions (x,y)0:tsubscript𝑥𝑦:0𝑡(x,y)_{0:t} from the environment and generates a goal position (x,y)gsuperscript𝑥𝑦𝑔(x,y)^{g}. The low-level policy πLsuperscript𝜋𝐿\pi^{L} is a goal-based policy to reach the goal position. The right figure shows a top view of the agent’s exploration trajectory, where the walking paths of the low-level policy are shown in blue dotted lines, and the goal is changed by the high-level policy at each black spot. The high-level policy is optimized to maximize the state count in the grid world, which is shown in the grey background.
图 2:寻找技能的提议的分层策略。高层循环策略 πHsuperscript𝜋𝐻\pi^{H} 观察环境中的历史位置 (x,y)0:tsubscript𝑥𝑦:0𝑡(x,y)_{0:t} 并生成目标位置 (x,y)gsuperscript𝑥𝑦𝑔(x,y)^{g} 。低层策略 πLsuperscript𝜋𝐿\pi^{L} 是一种基于目标的策略,用于达到目标位置。右图显示了代理的探索轨迹的俯视图,其中低层策略的行走路径以蓝色虚线显示,并且目标在每个黑点处由高层策略更改。高层策略被优化以最大化网格世界中的状态计数,显示在灰色背景中。

3.2 Manipulation and Crafting
3.2 操作和制作

By executing the pre-trained Finding-skills, we can instantiate the manipulation tasks with requisite target items nearby, making the manipulation tasks much easier. To train the Manipulation-skills in Minecraft, we can either make a training environment with the target item initialized nearby or run the Finding-skills to reach a target item. For example, to train the skill ‘harvest milk_bucket [Uncaptioned image]’, we can either spawn a cow [Uncaptioned image] close to the agent using the Minecraft built-in commands, or execute the Finding-skills until a cow is reached. The latter is similar in the idea to Go-Explore [8], and is more suitable for other environments that do not have commands to initialize the target items nearby.
通过执行预训练的寻找技能,我们可以在附近实例化具有所需目标物品的操作任务,使操作任务变得更容易。要在 Minecraft 中训练操作技能,我们可以创建一个训练环境,并在附近初始化目标物品,或者运行寻找技能以达到目标物品。例如,要训练技能“收获牛奶桶 [Uncaptioned image] ”,我们可以使用 Minecraft 内置命令在代理附近生成一头牛 [Uncaptioned image] ,或者执行寻找技能直到找到一头牛。后者与 Go-Explore [8]的思想类似,并且更适用于没有命令在附近初始化目标物品的其他环境。

We adopt MineCLIP [9] to guide the agent with intrinsic rewards. The pre-trained MineCLIP model computes the CLIP reward based on the similarity between environmental observations (frames) and the language descriptions of the skill. We train the agent using PPO with self-imitation learning, to maximize a weighted sum of intrinsic rewards and extrinsic success (sparse) reward. Details for training basic skills can be found in Appendix D.
我们采用 MineCLIP [9]来通过内在奖励指导代理。预训练的 MineCLIP 模型根据环境观察(帧)与技能的语言描述之间的相似性计算 CLIP 奖励。我们使用自我模仿学习的 PPO 来训练代理,以最大化内在奖励和外在成功(稀疏)奖励的加权和。有关训练基本技能的详细信息,请参见附录 D。

For the Crafting-skills, they can be executed with only a single action in MineDojo [9].
对于制作技能,它们可以在 MineDojo 中仅通过一个动作来执行。

4 Solving Minecraft Tasks via Skill Planning
通过技能规划解决 Minecraft 任务

In this section, we present our skill planning method for solving diverse hard tasks. A skill graph is generated in advance with a Large Language Model (LLM), enabling searching for correct skill sequences on the fly.
在本节中,我们介绍了我们的技能规划方法,用于解决各种困难任务。使用大型语言模型(LLM)提前生成技能图,从而能够实时搜索正确的技能序列。

4.1 Constructing Skill Graph with Large Language Models
4.1 使用大型语言模型构建技能图

A correct plan (sτ,1,sτ,2,)subscript𝑠𝜏1subscript𝑠𝜏2(s_{\tau,1},s_{\tau,2},\cdots) for a task τ=(g,I)𝜏𝑔𝐼\tau=(g,I) should satisfy two conditions. (1) For each i𝑖i, sτ,isubscript𝑠𝜏𝑖s_{\tau,i} is executable after (sτ,1,,sτ,i1)subscript𝑠𝜏1subscript𝑠𝜏𝑖1(s_{\tau,1},\cdots,s_{\tau,i-1}) are accomplished sequentially with initial condition I𝐼I. (2) The target item g𝑔g is obtained after all the skills are accomplished sequentially, given initial condition I𝐼I. To enable searching for such plans, we should be able to verify whether a plan is correct. Thus, we should know what condition is required and what is obtained for each skill. We define such information of skills in a structured format. As an example, information for skill ‘crafting stone_pickaxe [Uncaptioned image]’ is:
一个任务的正确计划应满足两个条件。(1) 对于每个,当顺序完成初始条件后,可执行。(2) 在所有技能按顺序完成后,目标项目被获得,给定初始条件。为了能够搜索这样的计划,我们应该能够验证计划是否正确。因此,我们应该知道每个技能所需的条件和获得的条件。我们以结构化格式定义这些技能的信息。例如,技能“制作石镐”的信息如下:

stone_pickaxe {consume: {cobblestone: 3, stick: 2},
stone_pickaxe {消耗: {圆石: 3, 木棍: 2},
require: {crafting_table_nearby: 1}, obtain: {stone_pickaxe: 1}}

Each item in this format is also a skill. Regarding them as graph nodes, this format shows a graph structure between skill ‘stone_pickaxe’ and skills ‘cobblestone’, ‘stick’, ‘crafting_table_nearby’. The directed edge from ‘cobblestone‘ to ‘stone_pickaxe’ is represented as (3, 1, consume), showing the quantity relationship between parent and child, and that the parent item will be consumed during skill execution. In fact, in this format, all the basic skills in Minecraft construct a large directed acyclic graph with hundreds of nodes. The dashed box in Figure 1 shows a small part of this graph, where grey arrows denote ‘consume’ and red arrows denote ‘require’.
每个以这种格式表示的物品也是一种技能。将它们视为图节点,这种格式显示了技能“stone_pickaxe”与技能“cobblestone”、“stick”、“crafting_table_nearby”之间的图结构。从“cobblestone”到“stone_pickaxe”的有向边表示为(3, 1, consume),显示了父项和子项之间的数量关系,并且在技能执行过程中将消耗父项物品。实际上,在这种格式中,Minecraft 中的所有基本技能构成了一个具有数百个节点的大型有向无环图。图 1 中的虚线框显示了该图的一小部分,其中灰色箭头表示“consume”,红色箭头表示“require”。

To construct the skill graph, we generate structured information for all the skills by interacting with ChatGPT (GPT-3.5) [29], a high-performance LLM. Since LLMs are trained on large-scale internet datasets, they obtain rich knowledge in the popular game Minecraft. In prompt, we give a few demonstrations and explanations about the format, then ask ChatGPT to generate other skills information. Dialog with ChatGPT can be found in Appendix E.
为了构建技能图,我们通过与 ChatGPT(GPT-3.5)[29]进行交互,为所有技能生成结构化信息。由于LLM在大规模的互联网数据集上进行训练,它们在流行游戏 Minecraft 中获得了丰富的知识。在提示中,我们给出了一些示例和关于格式的解释,然后要求 ChatGPT 生成其他技能信息。与 ChatGPT 的对话可以在附录 E 中找到。

4.2 Skill Search Algorithm
4.2 技能搜索算法

Our skill planning method is a depth-first search (DFS) algorithm on the skill graph. Given a task τ=(g,I)𝜏𝑔𝐼\tau=(g,I), we start from the node g𝑔g and do DFS toward its parents, opposite to the edge directions. In this process, we maintain all the possessing items starting from I𝐼I. Once conditions for the skill are satisfied or the skill node has no parent, we append this skill into the planned skill list and modify the maintained items according to the skill information. The resulting skill list is ensured to be executable and target-reaching.
我们的技能规划方法是在技能图上进行深度优先搜索(DFS)算法。给定一个任务 τ=(g,I)𝜏𝑔𝐼\tau=(g,I) ,我们从节点 g𝑔g 开始,沿着反向边向其父节点进行 DFS。在这个过程中,我们维护从 I𝐼I 开始的所有拥有的物品。一旦技能的条件满足或者技能节点没有父节点,我们将这个技能添加到计划的技能列表中,并根据技能信息修改维护的物品。确保生成的技能列表可执行且达到目标。

To solve a long-horizon task, since the learned low-level skills are possible to fail, we alternate skill planning and skill execution until the episode terminates. After each skill execution, we update the agent’s condition Isuperscript𝐼I^{\prime} based on its inventory and the last executed skill, and search for the next skill with τ=(g,I)superscript𝜏𝑔superscript𝐼\tau^{\prime}=(g,I^{\prime}).
为了解决一个长期任务,由于学习到的低级技能可能会失败,我们在每次技能执行之后,根据代理的库存和上次执行的技能更新代理的状态 Isuperscript𝐼I^{\prime} ,并搜索下一个具有 τ=(g,I)superscript𝜏𝑔superscript𝐼\tau^{\prime}=(g,I^{\prime}) 的技能,直到该情节终止。

We present the pseudocode for the skill search algorithm and the testing process in Appendix C.
我们在附录 C 中提供了技能搜索算法和测试过程的伪代码。

5 Experiments 5 实验

In this section, we evaluate and analyze our method with baselines and ablations in challenging Minecraft tasks. Section 5.1 introduces the implementation of basic skills. In Section 5.2, we introduce the setup for our evaluation task suite. In Section 5.3 and 5.4, we present the experimental results and analyze skill learning and planning respectively.
在本节中,我们通过与基准和消融实验来评估和分析我们的方法在具有挑战性的 Minecraft 任务中的表现。第 5.1 节介绍了基本技能的实现。在第 5.2 节中,我们介绍了评估任务套件的设置。在第 5.3 和 5.4 节中,我们分别展示了实验结果和技能学习与规划的分析。

5.1 Training Basic Skills 5.1 训练基本技能

To pre-train basic skills with RL, we use the environments of programmatic tasks in MineDojo [9]. To train Manipulation-skills, for simplicity, we specify the environment that initializes target mobs or resources close to the agent. For the Go-Explore-like training method without specified environments discussed in Section 3.2, we present the results in Appendix H, which does not underperform the former.
为了使用强化学习预训练基本技能,我们使用了 MineDojo [9]中的编程任务环境。为了训练操作技能,为了简单起见,我们指定了一个环境,该环境在代理附近初始化目标生物或资源。关于没有指定环境的类似 Go-Explore 的训练方法的结果,请参见附录 H,该方法不会表现不佳。

For Manipulation-skills and the low-level policy of Finding-skills, we adopt the policy architecture of MineAgent [9], which uses a fixed pre-trained MineCLIP image encoder and processes features using MLPs. To explore in a compact action space, we compress the original large action space into 12×312312\times 3 discrete actions. For the high-level policy of Finding-skills, which observes the agent’s past locations, we use an LSTM policy and train it with truncated BPTT [30]. We pick the model with the highest success rate on the smoothed training curve for each skill, and fix these policies in all tasks. Implementation details can be found in Appendix D.
对于操作技能和低级策略的寻找技能,我们采用了MineAgent [9]的策略架构,该架构使用了一个固定的预训练MineCLIP图像编码器,并使用MLP处理特征。为了在紧凑的动作空间中进行探索,我们将原始的大动作空间压缩为离散动作。对于高级策略的寻找技能,它观察到代理的过去位置,我们使用了一个LSTM策略,并使用截断的BPTT [30]进行训练。我们选择在平滑的训练曲线上具有最高成功率的模型,并将这些策略固定在所有任务中。实现细节可以在附录D中找到。

Note that Plan4MC totally takes 7M environmental steps in training, and can unlock the iron pickaxe [Uncaptioned image] in the Minecraft Tech Tree in test. The sample efficiency greatly outperforms all other existing demonstration-free RL methods [12, 2].
请注意,Plan4MC 在训练中完全采取了 7M 个环境步骤,并且可以在测试中解锁 Minecraft 技术树中的铁镐 [Uncaptioned image] 。样本效率远远超过所有其他现有的无演示强化学习方法[12, 2]。

5.2 Task Setup 5.2 任务设置

Based on MineDojo [9] programmatic tasks, we set up an evaluation benchmark consisting of four groups of diverse tasks: cutting trees [Uncaptioned image] to craft primary items, mining cobblestones [Uncaptioned image] to craft intermediate items, mining iron ores [Uncaptioned image] to craft advanced items, and interacting with mobs [Uncaptioned image] to harvest food and materials. Each task set has 10 tasks, adding up to a total of 40 tasks. With our settings of basic skills, these tasks require 25 planning steps on average and maximally 121 planning steps. We estimate the number of the required steps for each task with the sum of the steps of the initially planned skills and double this number to be the maximum episode length for the task, allowing skill executions to fail. The easiest tasks have 3000 maximum steps, while the hardest tasks have 12000. More details about task setup are listed in Appendix F. To evaluate the success rate on each task, we average the results over 30 test episodes.
基于MineDojo [9]的编程任务,我们设置了一个评估基准,包括四组不同的任务:砍树 [Uncaptioned image] 以制作初级物品,挖掘鹅卵石 [Uncaptioned image] 以制作中级物品,挖掘铁矿石 [Uncaptioned image] 以制作高级物品,与生物互动 [Uncaptioned image] 以收获食物和材料。每组任务有10个任务,总共40个任务。根据我们设置的基本技能,这些任务平均需要25个规划步骤,最多121个规划步骤。我们估计每个任务所需的步骤数,是初始计划技能步骤的总和,并将此数字加倍作为任务的最大回合数,允许技能执行失败。最简单的任务最多有3000个步骤,而最困难的任务最多有12000个步骤。有关任务设置的更多详细信息请参见附录F。为了评估每个任务的成功率,我们对30个测试回合的结果进行平均。

5.3 Skill Learning 5.3 技能学习

We first analyze learning basic skills. While we propose three types of fine-grained basic skills, others directly learn more complicated and long-horizon skills. We introduce two baselines to study learning skills with RL.
我们首先分析学习基本技能。虽然我们提出了三种细粒度的基本技能,但其他人直接学习更复杂和长期的技能。我们引入了两个基准来研究使用强化学习学习技能。

MineAgent [9]. Without decomposing tasks into basic skills, MineAgent solves tasks using PPO and self-imitation learning with the CLIP reward. For fairness, we train MineAgent in the test environment for each task. The training takes 7M environmental steps, which is equal to the sum of environmental steps we take for training all the basic skills. We average the success rate of trajectories in the last 100 training epochs (around 1M environment steps) to be its test success rate. Since MineAgent has no actions for crafting items, we hardcode the crafting actions into the training code. During trajectory collection, at each time step where the skill search algorithm returns a Crafting-skill, the corresponding crafting action will be executed. Note that, if we expand the action space for MineAgent rather than automatically execute crafting actions, the exploration will be much harder.
MineAgent [9]。在不将任务分解为基本技能的情况下,MineAgent 使用 PPO 和自我模仿学习以 CLIP 奖励解决任务。为了公平起见,我们为每个任务在测试环境中训练 MineAgent。训练共进行了 700 万个环境步骤,这相当于我们用于训练所有基本技能的环境步骤总和。我们将最后 100 个训练时期(约 100 万个环境步骤)的轨迹成功率平均为其测试成功率。由于 MineAgent 没有用于制作物品的动作,我们将制作动作硬编码到训练代码中。在轨迹收集过程中,每当技能搜索算法返回一个制作技能时,相应的制作动作将被执行。请注意,如果我们扩展 MineAgent 的动作空间而不是自动执行制作动作,探索将变得更加困难。

Plan4MC w/o Find-skill. None of the previous work decomposes a skill into executing Finding-skills and Manipulation-skills. Instead, finding items and manipulations are done with a single skill. Plan4MC w/o Find-skill implements such a method. It skips all the Finding-skills in the skill plans during test. Manipulation-skills take over the whole process of finding items and manipulating them.
Plan4MC 无查找技能。以往的工作都没有将技能分解为执行查找技能和操作技能。相反,查找物品和操作都是使用单一技能完成的。Plan4MC 无查找技能实现了这种方法。在测试过程中,它跳过了技能计划中的所有查找技能。操作技能接管了查找物品和操作的整个过程。

Table 2: Average success rates on four task sets of our method, all the baselines and ablation methods. Success rates on all the single tasks are listed in Appendix G.
表 2:我们方法、所有基线方法和消融方法在四个任务集上的平均成功率。所有单个任务的成功率列在附录 G 中。
Task Set 任务集 Cut-Trees 砍树 Mine-Stones 开采石头 Mine-Ores 开采矿石 Interact-Mobs 互动生物
MineAgent 我的代理 0.003 0.026 0.000 0.171
Plan4MC w/o Find-skill Plan4MC 没有找到技能 0.187 0.097 0.243 0.170
Interactive LLM 交互式 LLM 0.260 0.067 0.030 0.247
Plan4MC Zero-shot Plan4MC 零-shot 0.183 0.000 0.000 0.133
Plan4MC 1/2-steps Plan4MC 1/2 步骤 0.337 0.163 0.143 0.277
Plan4MC 0.417 0.293 0.267 0.320

Table 2 shows the test results for these methods. Plan4MC outperforms two baselines on the four task sets. MineAgent fails on the task sets of Cut-Trees, Mine-Stones and Mine-Ores, since taking many attacking actions continually to mine blocks in Minecraft is an exploration difficulty for RL on long-horizon tasks. On the contrary, MineAgent achieves performance comparable to Plan4MC’s on some easier tasks [Uncaptioned image][Uncaptioned image][Uncaptioned image][Uncaptioned image] in Interact-Mobs, which requires fewer environmental steps and planning steps. Plan4MC w/o Find-skill consistently underperforms Plan4MC on all the tasks, showing that introducing Finding-skills is beneficial for solving hard tasks with basic skills trained by RL. Because there is no Finding-skill in harvesting iron ores, their performance gap on Mine-Ores tasks is small.
表2显示了这些方法的测试结果。Plan4MC在四个任务集上表现优于两个基准方法。MineAgent在Cut-Trees、Mine-Stones和Mine-Ores任务集上失败,因为在Minecraft中连续采矿需要进行许多攻击动作,这对于长期任务上的强化学习来说是一个探索困难。相反,MineAgent在Interact-Mobs中的一些较简单的任务上表现与Plan4MC相当,这些任务需要较少的环境步骤和规划步骤。Plan4MC在所有任务上没有Find-skill的情况下始终表现不佳,表明引入Finding-skills对于通过强化学习训练的基本技能解决困难任务是有益的。因为在采集铁矿石的任务中没有Find-skill,所以它们在Mine-Ores任务上的表现差距很小。

To further study Finding-skills, we present the success rate at each planning step in Figure 3 for three tasks. The curves of Plan4MC and Plan4MC w/o Find-skill have large drops at Finding-skills. Especially, the success rates at finding cobblestones and logs decrease the most, because these items are harder to find in the environment compared to mobs. In these tasks, we compute the average success rate of Manipulation-Skills, conditioned on the skill before the last Finding-skills being accomplished. While Plan4MC has a conditional success rate of 0.40, Plan4MC w/o Find-skill decreases to 0.25, showing that solving sub-tasks with additional Finding-skills is more effective.
进一步研究寻找技能,我们在图3中呈现了三个任务中每个规划步骤的成功率。Plan4MC和Plan4MC w/o Find-skill的曲线在寻找技能阶段有大幅下降。特别是,在找到鹅卵石和原木方面的成功率下降最多,因为与敌对生物相比,在环境中找到这些物品更加困难。在这些任务中,我们计算了Manipulation-Skills的平均成功率,条件是在最后一个寻找技能之前的技能已经完成。虽然Plan4MC的条件成功率为0.40,但Plan4MC w/o Find-skill降至0.25,表明使用额外的寻找技能来解决子任务更加有效。

As shown in Table 3, most Manipulation-skills have slightly lower success rates in test than in training, due to the domain gap between test and training environments. However, this decrease does not occur in skills [Uncaptioned image][Uncaptioned image] that are trained with a large initial distance of mobs/items, as pre-executed Finding-skills provide better initialization for Manipulation-skills during the test and thus the success rate may increase. In contrast, the success rates in the test without Finding-skills are significantly lower.
如表3所示,大多数操作技能在测试中的成功率略低于训练中的成功率,这是由于测试和训练环境之间的领域差异所致。然而,对于那些在训练时与移动物体/物品之间有较大初始距离的技能,这种下降并不会发生,因为预先执行的寻找技能在测试期间为操作技能提供了更好的初始化,从而可能提高成功率。相比之下,没有寻找技能的测试成功率显著较低。

Refer to caption
Figure 3: Success rates of Plan4MC with/without Finding-skills at each skill planning step, on three long-horizon tasks. We arrange the initially planned skill sequence on the horizontal axis and remove the repeated skills. The success rate of each skill represents the probability of successfully executing this skill at least once in a test episode. Specifically, the success rate is always 1 at task initialization, and the success rate of the last skill is equal to the task’s success rate.
图 3:Plan4MC 在每个技能规划步骤上使用/不使用寻找技能的成功率,针对三个长期任务。我们将最初计划的技能序列排列在水平轴上,并删除重复的技能。每个技能的成功率表示在测试剧集中至少成功执行此技能的概率。具体而言,任务初始化时的成功率始终为 1,最后一个技能的成功率等于任务的成功率。
Table 3: Success rates of Manipulation-skills in training and test. Training init. distance is the maximum distance for mobs/items initialization in training skills. Note that in test, executing Finding-skills will reach the target items within a distance of 3. Training success rate is averaged over 100 training epochs around the selected model’s epoch. Test success rate is computed from the test rollouts of all the tasks, while w/o Find refers to Plan4MC w/o Find-skill.
表 3:训练和测试中的操作技能成功率。训练初始距离是训练技能中初始化移动物体/物品的最大距离。请注意,在测试中,执行寻找技能将在距离为 3 的范围内到达目标物品。训练成功率是在选择的模型时期周围的 100 个训练时期上取平均值。测试成功率是从所有任务的测试回滚中计算得出的,而 w/o Find 指的是没有寻找技能的 Plan4MC。
Manipulation-skills 操作技能 Place 地点 [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Training init. distance 训练初始距离 10 10 2 2
Training success rate 训练成功率 0.98 0.50 0.27 0.21 0.30 0.56 0.47
Test success rate 测试成功率 0.77 0.71 0.26 0.27 0.16 0.33 0.26
Test success rate (w/o Find)
测试成功率(无查找)
0.79 0.07 0.03 0.03 0.02 0.05 0.06

5.4 Skill Planning 5.4 技能规划

For skill planning in open-ended worlds, recent works [13, 14, 3, 21, 37] generate plans or sub-tasks with LLMs. We study these methods on our task sets and implement a best-performing baseline to compare with Plan4MC.
对于开放式世界中的技能规划,最近的研究[13, 14, 3, 21, 37]使用LLMs生成计划或子任务。我们在我们的任务集上研究了这些方法,并实现了一个表现最佳的基准线与 Plan4MC 进行比较。

Interactive LLM. We implement an interactive planning baseline using LLMs. We take ChatGPT [29] as the planner, which proposes skill plans based on prompts including descriptions of tasks and observations. Similar to chain-of-thoughts prompting [38], we provide few-shot demonstrations with explanations to the planner at the initial planning step. In addition, we add several rules for planning into the prompt to fix common errors that the model encountered during test. At each subsequent planning step, the planner will encounter one of the following cases: the proposed skill name is invalid, the skill is already done, skill execution succeeds, and skill execution fails. We carefully design language feedback for each case and ask the planner to re-plan based on inventory changes. For low-level skills, we use the same pre-trained skills as Plan4MC.
交互式LLM。我们使用LLMs实现了一个交互式规划基准线。我们将 ChatGPT [29]作为规划器,它根据任务描述和观察提出技能计划。类似于思维链提示[38],我们在初始规划步骤中向规划器提供了带有解释的少样本演示。此外,我们在提示中添加了一些规则来修复模型在测试过程中遇到的常见错误。在每个后续的规划步骤中,规划器将遇到以下情况之一:提议的技能名称无效,技能已经完成,技能执行成功,技能执行失败。我们为每种情况精心设计了语言反馈,并要求规划器根据库存变化重新规划。对于低级技能,我们使用与 Plan4MC 相同的预训练技能。

Also, we conduct ablations on our skill planning designs.
此外,我们对我们的技能规划设计进行了消融实验。

Plan4MC Zero-shot. This is a zero-shot variant of our interactive planning method, proposing a skill sequence at the beginning of each task only. The agent executes the planned skills sequentially until a skill fails or the environment terminates. This planner has no fault tolerance for skills execution.
Plan4MC 零射击。这是我们交互式规划方法的零射击变体,仅在每个任务开始时提出一个技能序列。代理按顺序执行计划的技能,直到某个技能失败或环境终止。该规划器对技能执行没有容错能力。

Plan4MC 1/2-steps. In this ablation study, we half the test episode length and require the agent to solve tasks more efficiently.
Plan4MC 1/2 步。在这个消融研究中,我们将测试剧集长度减半,并要求代理更高效地解决任务。

Success rates for each method are listed in Table 2. We find that Interactive LLM has comparable performance to Plan4MC on the task set of Interact-Mobs, where most tasks require less than 10 planning steps. In Mine-Stones and Mine-Ores tasks with long-horizon planning, the LLM planner is more likely to make mistakes, resulting in worse performance. The performance of Plan4MC Zero-shot is much worse than Plan4MC in all the tasks, since a success test episode requires accomplishing each skill in one trial. The decrease is related to the number of planning steps and skills success rates in Table 3. Plan4MC 1/2-steps has the least performance decrease to Plan4MC, showing that Plan4MC can solve tasks in a more limited episode length.
每种方法的成功率在表 2 中列出。我们发现,在 Interact-Mobs 任务集中,大多数任务需要少于 10 个规划步骤,交互式LLM的性能与 Plan4MC 相当。在需要长期规划的 Mine-Stones 和 Mine-Ores 任务中,LLM规划器更容易出错,导致性能较差。在所有任务中,Plan4MC 零-shot 的性能要比 Plan4MC 差得多,因为一个成功测试剧集需要在一次试验中完成每个技能。这种下降与表 3 中的规划步骤数量和技能成功率有关。Plan4MC 1/2-steps 对 Plan4MC 的性能下降最小,表明 Plan4MC 可以在更有限的剧集长度内解决任务。

6 Related Work 6 相关工作

Minecraft. In recent years, the open-ended world Minecraft has received wide attention in machine learning research. Malmo [15], MineRL [10] and MineDojo [9] build benchmark environments and datasets for Minecraft. Previous works in MineRL competition [25, 11, 17] study the ObtainDiamond task with hierarchical RL [25, 33, 24, 23] and imitation learning [1, 11]. Other works explore multi-task learning [35, 18, 4, 28], unsupervised skill discovery [27], LLM-based planning [37, 36, 40], and pre-training from videos [2, 22, 9, 6]. Our work falls under reinforcement learning and planning in Minecraft.
Minecraft。近年来,开放式世界游戏Minecraft在机器学习研究中受到广泛关注。Malmo [15]、MineRL [10]和MineDojo [9]构建了Minecraft的基准环境和数据集。之前的研究在MineRL竞赛中[25, 11, 17]使用了分层强化学习[25, 33, 24, 23]和模仿学习[1, 11]来研究获取钻石的任务。其他研究探索了多任务学习[35, 18, 4, 28]、无监督技能发现[27]、基于规划的方法[37, 36, 40]和从视频中进行预训练[2, 22, 9, 6]。我们的工作属于Minecraft中的强化学习和规划领域。

Learning Skills in Minecraft. Acquiring skills is crucial for solving long-horizon tasks in Minecraft. Hierarchical approaches [24, 23] in MineRL competition learn low-level skills with imitation learning. VPT [2] labels internet-scale datasets and pre-trains a behavior-cloning agent to initialize for diverse tasks. Recent works [4, 37, 28] learn skills based on VPT. Without expert demonstrations, MineAgent [9] and CLIP4MC [6] learn skills with RL and vision-language rewards. But they can only acquire a small set of skills. Unsupervised skill discovery [27] learns skills that only produce different navigation behaviors. In our work, to enable RL to acquire diverse skills, we learn fine-grained basic skills with intrinsic rewards.
在Minecraft中学习技能。获得技能对于解决Minecraft中的长期任务至关重要。在MineRL比赛中,分层方法[24, 23]通过模仿学习学习低级技能。VPT [2]标记互联网规模的数据集,并预训练一个行为克隆代理以初始化多样化的任务。最近的研究[4, 37, 28]基于VPT学习技能。在没有专家示范的情况下,MineAgent [9]和CLIP4MC [6]通过强化学习和视觉语言奖励学习技能。但它们只能获得一小组技能。无监督的技能发现[27]学习只产生不同导航行为的技能。在我们的工作中,为了使强化学习能够获得多样化的技能,我们使用内在奖励学习细粒度的基本技能。

Planning with Large Language Models. With the rapid progress of LLMs [29, 5], many works study LLMs as planners in open-ended worlds. To ground language models, SayCan [3] combines LLMs with skill affordances to produce feasible plans, Translation LMs [13] selects demonstrations to prompt LLMs, and LID [20] finetunes language models with tokenized interaction data. Other works study interactive planning for error correction. Inner Monologue [14] proposes environment feedback to the planner. DEPS [37] introduces descriptor, explainer, and selector to generate plans by LLMs. In our work, we leverage the LLM to generate a skill graph and introduce a skill search algorithm to eliminate planning mistakes.
使用大型语言模型进行规划。随着LLMs [ 29, 5]的快速进展,许多研究将LLMs作为开放式世界中的规划器。为了使语言模型具有实际应用性,SayCan [ 3]将LLMs与技能可用性相结合,以生成可行的计划,Translation LMs [ 13]选择演示来提示LLMs,而 LID [ 20]则使用标记化的交互数据对语言模型进行微调。其他研究则研究了用于错误修正的交互式规划。Inner Monologue [ 14]提出了向规划器提供环境反馈。DEPS [ 37]引入了描述符、解释器和选择器,通过LLMs生成计划。在我们的工作中,我们利用LLM生成技能图,并引入了一种技能搜索算法来消除规划错误。

7 Conclusion and Discussion
7 结论和讨论

In this paper, we propose a framework to solve diverse long-horizon open-world tasks with reinforcement learning and planning. To tackle the exploration and sample efficiency issues, we propose to learn fine-grained basic skills with RL and introduce a general Finding-skill to provide good environment initialization for skill learning. In Minecraft, we design a graph-based planner, taking advantage of the prior knowledge in LLMs and the planning accuracy of the skill search algorithm. Experiments on 40 challenging Minecraft tasks verify the advantages of Plan4MC over various baselines.
在本文中,我们提出了一个框架,用于通过强化学习和规划解决各种长期开放世界任务。为了解决探索和样本效率问题,我们提出使用强化学习学习细粒度的基本技能,并引入一个通用的寻找技能来为技能学习提供良好的环境初始化。在 Minecraft 中,我们设计了一个基于图的规划器,利用LLMs中的先验知识和技能搜索算法的规划准确性。对 40 个具有挑战性的 Minecraft 任务的实验验证了 Plan4MC 相对于各种基线方法的优势。

Though we implement Plan4MC in Minecraft, our method is extendable to other similar open-world environments and draws insights on building multi-task learning systems. We leave the detailed discussion in Appendix I.
尽管我们在 Minecraft 中实现了 Plan4MC,但我们的方法可以扩展到其他类似的开放世界环境,并对构建多任务学习系统提供了启示。我们将详细讨论留在附录 I 中。

A limitation of this work is that the Finding-skill is not aware of its goal during exploration, making the goal-reaching policy sub-optimal. Future work needs to improve its efficiency by training a goal-based policy. Moreover, if the LLM lacks domain knowledge, how to correct the LLM’s outputs is a problem worth studying in the future. Providing documents and environmental feedback to the LLM is a promising direction.
这项工作的局限性在于,在探索过程中,Finding-skill 对其目标没有意识,使得目标达成策略次优。未来的工作需要通过训练基于目标的策略来提高效率。此外,如果LLM缺乏领域知识,如何纠正LLM的输出是一个值得研究的问题。向LLM提供文档和环境反馈是一个有前途的方向。

References 参考文献

  • [1] Artemij Amiranashvili, Nicolai Dorka, Wolfram Burgard, Vladlen Koltun, and Thomas Brox. Scaling imitation learning in Minecraft. arXiv preprint arXiv:2007.02701, 2020.
    Artemij Amiranashvili,Nicolai Dorka,Wolfram Burgard,Vladlen Koltun 和 Thomas Brox。在 Minecraft 中扩展模仿学习。arXiv 预印本 arXiv:2007.02701,2020 年。
  • [2] Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems (NeurIPS), 2022.
    Bowen Baker,Ilge Akkaya,Peter Zhokov,Joost Huizinga,Jie Tang,Adrien Ecoffet,Brandon Houghton,Raul Sampedro 和 Jeff Clune。视频预训练(VPT):通过观看未标记的在线视频学习行动。神经信息处理系统(NeurIPS)进展,2022 年。
  • [3] Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning (CORL), 2023.
    Anthony Brohan,Yevgen Chebotar,Chelsea Finn,Karol Hausman,Alexander Herzog,Daniel Ho,Julian Ibarz,Alex Irpan,Eric Jang,Ryan Julian 等。不要按我说的做,而是按我能做的做:在机器人可行性中落地语言。机器人学习会议(CORL),2023 年。
  • [4] Shaofei Cai, Zihao Wang, Xiaojian Ma, Anji Liu, and Yitao Liang. Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. arXiv preprint arXiv:2301.10034, 2023.
    Shaofei Cai,Zihao Wang,Xiaojian Ma,Anji Liu 和 Yitao Liang。通过目标感知表示学习和自适应视野预测实现开放世界多任务控制。arXiv 预印本 arXiv:2301.10034,2023 年。
  • [5] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
    Aakanksha Chowdhery,Sharan Narang,Jacob Devlin,Maarten Bosma,Gaurav Mishra,Adam Roberts,Paul Barham,Hyung Won Chung,Charles Sutton,Sebastian Gehrmann 等。Palm:通过路径扩展语言建模。arXiv 预印本 arXiv:2204.02311,2022 年。
  • [6] Ziluo Ding, Hao Luo, Ke Li, Junpeng Yue, Tiejun Huang, and Zongqing Lu. CLIP4MC: An rl-friendly vision-language model for Minecraft. arXiv preprint arXiv:2303.10571, 2023.
    Ziluo Ding,Hao Luo,Ke Li,Junpeng Yue,Tiejun Huang 和 Zongqing Lu。CLIP4MC:一种适用于 Minecraft 的强化学习友好的视觉语言模型。arXiv 预印本 arXiv:2303.10571,2023 年。
  • [7] Yuqing Du, Ksenia Konyushkova, Misha Denil, Akhil Raju, Jessica Landon, Felix Hill, Nando de Freitas, and Serkan Cabi. Vision-language models as success detectors. arXiv preprint arXiv:2303.07280, 2023.
    Yuqing Du,Ksenia Konyushkova,Misha Denil,Akhil Raju,Jessica Landon,Felix Hill,Nando de Freitas 和 Serkan Cabi。视觉语言模型作为成功检测器。arXiv 预印本 arXiv:2303.07280,2023 年。
  • [8] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
    Adrien Ecoffet,Joost Huizinga,Joel Lehman,Kenneth O Stanley 和 Jeff Clune。Go-explore:一种解决困难探索问题的新方法。arXiv 预印本 arXiv:1901.10995,2019 年。
  • [9] Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. MineDojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
    Linxi Fan,Guanzhi Wang,Yunfan Jiang,Ajay Mandlekar,Yuncong Yang,Haoyi Zhu,Andrew Tang,De-An Huang,Yuke Zhu 和 Anima Anandkumar。MineDojo:利用互联网规模的知识构建开放式的具身代理。在第 36 届神经信息处理系统会议数据集和基准赛道上,2022 年。
  • [10] William H. Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel, Manuela Veloso, and Ruslan Salakhutdinov. MineRL: A large-scale dataset of Minecraft demonstrations. Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI), 2019.
    William H. Guss,Brandon Houghton,Nicholay Topin,Phillip Wang,Cayden Codel,Manuela Veloso 和 Ruslan Salakhutdinov。MineRL:一个大规模的 Minecraft 演示数据集。第 28 届国际人工智能联合会议(IJCAI),2019 年。
  • [11] William Hebgen Guss, Stephanie Milani, Nicholay Topin, Brandon Houghton, Sharada Mohanty, Andrew Melnik, Augustin Harter, Benoit Buschmaas, Bjarne Jaster, Christoph Berganski, et al. Towards robust and domain agnostic reinforcement learning competitions: MineRL 2020. In NeurIPS 2020 Competition and Demonstration Track, 2021.
    William Hebgen Guss,Stephanie Milani,Nicholay Topin,Brandon Houghton,Sharada Mohanty,Andrew Melnik,Augustin Harter,Benoit Buschmaas,Bjarne Jaster,Christoph Berganski 等。迈向稳健和领域无关的强化学习竞赛:MineRL 2020。在 NeurIPS 2020 竞赛和演示论文集中,2021 年。
  • [12] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
    Danijar Hafner,Jurgis Pasukonis,Jimmy Ba 和 Timothy Lillicrap。通过世界模型掌握多样领域。arXiv 预印本 arXiv:2301.04104,2023 年。
  • [13] Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning (ICML), 2022.
    黄文龙,Pieter Abbeel,Deepak Pathak 和 Igor Mordatch。语言模型作为零-shot 规划器:提取行动知识以供具身代理使用。在 2022 年国际机器学习会议(ICML)上。
  • [14] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
    黄文龙,夏飞,肖特德,陈海瑞,梁杰基,佛罗伦斯·皮特,曾安迪,乔纳森·汤普森,Igor Mordatch,Yevgen Chebotar 等。内心独白:通过与语言模型的规划进行具身推理。arXiv 预印本 arXiv:2207.05608,2022 年。
  • [15] Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. In International Joint Conference on Artificial Intelligence (IJCAI), 2016.
    Matthew Johnson,Katja Hofmann,Tim Hutton 和 David Bignell。Malmo 人工智能实验平台。在 2016 年国际人工智能联合会议(IJCAI)上。
  • [16] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
    Leslie Pack Kaelbling、Michael L Littman 和 Anthony R Cassandra。在部分可观察的随机领域中进行规划和行动。人工智能,101(1-2):99-134,1998 年。
  • [17] Anssi Kanervisto, Stephanie Milani, Karolis Ramanauskas, Nicholay Topin, Zichuan Lin, Junyou Li, Jianing Shi, Deheng Ye, Qiang Fu, Wei Yang, et al. MineRL diamond 2021 competition: Overview, results, and lessons learned. NeurIPS 2021 Competitions and Demonstrations Track, 2022.
    Anssi Kanervisto、Stephanie Milani、Karolis Ramanauskas、Nicholay Topin、Zichuan Lin、Junyou Li、Jianing Shi、Deheng Ye、Qiang Fu、Wei Yang 等。MineRL diamond 2021 竞赛:概述、结果和经验教训。NeurIPS 2021 竞赛和演示论文集,2022 年。
  • [18] Ingmar Kanitscheider, Joost Huizinga, David Farhi, William Hebgen Guss, Brandon Houghton, Raul Sampedro, Peter Zhokhov, Bowen Baker, Adrien Ecoffet, Jie Tang, et al. Multi-task curriculum learning in a complex, visual, hard-exploration domain: Minecraft. arXiv preprint arXiv:2106.14876, 2021.
    Ingmar Kanitscheider,Joost Huizinga,David Farhi,William Hebgen Guss,Brandon Houghton,Raul Sampedro,Peter Zhokhov,Bowen Baker,Adrien Ecoffet,Jie Tang 等。在一个复杂的、视觉的、难以探索的领域中进行多任务课程学习:Minecraft。arXiv 预印本 arXiv:2106.14876,2021 年。
  • [19] Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. In Conference on Robot Learning (CORL), 2023.
    Chengshu Li,Ruohan Zhang,Josiah Wong,Cem Gokmen,Sanjana Srivastava,Roberto Martín-Martín,Chen Wang,Gabrael Levine,Michael Lingelbach,Jiankai Sun,等。Behavior-1k:一个具有 1,000 个日常活动和逼真模拟的体验智能基准。在机器人学习会议(CORL)中,2023 年。
  • [20] Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, et al. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems (NeurIPS), 2022.
    Shuang Li,Xavier Puig,Chris Paxton,Yilun Du,Clinton Wang,Linxi Fan,Tao Chen,De-An Huang,Ekin Akyürek,Anima Anandkumar,等。预训练语言模型用于交互决策。神经信息处理系统(NeurIPS)的进展,2022 年。
  • [21] Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753, 2022.
    李杰,黄文龙,夏飞,徐鹏,Karol Hausman,Brian Ichter,Pete Florence 和 Andy Zeng。代码作为策略:用于具体控制的语言模型程序。arXiv 预印本 arXiv:2209.07753,2022 年。
  • [22] Shalev Lifshitz, Keiran Paster, Harris Chan, Jimmy Ba, and Sheila McIlraith. Steve-1: A generative model for text-to-behavior in minecraft. arXiv preprint arXiv:2306.00937, 2023.
    Shalev Lifshitz,Keiran Paster,Harris Chan,Jimmy Ba 和 Sheila McIlraith。Steve-1:Minecraft 中文本到行为的生成模型。arXiv 预印本 arXiv:2306.00937,2023 年。
  • [23] Zichuan Lin, Junyou Li, Jianing Shi, Deheng Ye, Qiang Fu, and Wei Yang. Juewu-mc: Playing Minecraft with sample-efficient hierarchical reinforcement learning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI), 2022.
    Zichuan Lin,Junyou Li,Jianing Shi,Deheng Ye,Qiang Fu 和 Wei Yang。Juewu-mc:使用样本高效的分层强化学习玩 Minecraft。在第三十一届国际人工智能联合会议(IJCAI)论文集中,2022 年。
  • [24] Hangyu Mao, Chao Wang, Xiaotian Hao, Yihuan Mao, Yiming Lu, Chengjie Wu, Jianye Hao, Dong Li, and Pingzhong Tang. Seihai: A sample-efficient hierarchical ai for the MineRL competition. In Distributed Artificial Intelligence (DAI), 2022.
    毛航宇,王超,郝晓天,毛一环,陆一鸣,吴成杰,郝建业,李东,唐平忠。Seihai:一种用于 MineRL 竞赛的样本高效的分层人工智能。在分布式人工智能(DAI)中,2022 年。
  • [25] Stephanie Milani, Nicholay Topin, Brandon Houghton, William H Guss, Sharada P Mohanty, Keisuke Nakata, Oriol Vinyals, and Noboru Sean Kuno. Retrospective analysis of the 2019 MineRL competition on sample efficient reinforcement learning. In NeurIPS 2019 Competition and Demonstration Track, 2020.
    Stephanie Milani,Nicholay Topin,Brandon Houghton,William H Guss,Sharada P Mohanty,Keisuke Nakata,Oriol Vinyals 和 Noboru Sean Kuno。对 2019 年 MineRL 竞赛的回顾性分析,关于样本高效强化学习。在 NeurIPS 2019 竞赛和演示论文集中,2020 年。
  • [26] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
    Volodymyr Mnih,Koray Kavukcuoglu,David Silver,Andrei A Rusu,Joel Veness,Marc G Bellemare,Alex Graves,Martin Riedmiller,Andreas K Fidjeland,Georg Ostrovski 等。通过深度强化学习实现人类水平控制。自然,518(7540):529-533,2015 年。
  • [27] Juan José Nieto, Roger Creus, and Xavier Giro-i Nieto. Unsupervised skill-discovery and skill-learning in Minecraft. arXiv preprint arXiv:2107.08398, 2021.
    Juan José Nieto,Roger Creus 和 Xavier Giro-i Nieto。Minecraft 中的无监督技能发现和技能学习。arXiv 预印本 arXiv:2107.08398,2021 年。
  • [28] Kolby Nottingham, Prithviraj Ammanabrolu, Alane Suhr, Yejin Choi, Hannaneh Hajishirzi, Sameer Singh, and Roy Fox. Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling. arXiv preprint arXiv:2301.12050, 2023.
    Kolby Nottingham,Prithviraj Ammanabrolu,Alane Suhr,Yejin Choi,Hannaneh Hajishirzi,Sameer Singh 和 Roy Fox。具有语言引导的世界建模的具体化决策。arXiv 预印本 arXiv:2301.12050,2023 年。
  • [29] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems (NeurIPS), 2022.
    Long Ouyang,Jeffrey Wu,Xu Jiang,Diogo Almeida,Carroll Wainwright,Pamela Mishkin,Chong Zhang,Sandhini Agarwal,Katarina Slama,Alex Ray 等。通过人类反馈训练语言模型遵循指令。神经信息处理系统(NeurIPS)的进展,2022 年。
  • [30] Marco Pleines, Matthias Pallasch, Frank Zimmer, and Mike Preuss. Memory gym: Partially observable challenges to memory-based agents. In International Conference on Learning Representations (ICLR), 2023.
    Marco Pleines,Matthias Pallasch,Frank Zimmer 和 Mike Preuss。记忆健身房:对基于记忆的智能体的部分可观察挑战。在 2023 年国际学习表示会议(ICLR)上。
  • [31] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
    Julian Schrittwieser,Ioannis Antonoglou,Thomas Hubert,Karen Simonyan,Laurent Sifre,Simon Schmitt,Arthur Guez,Edward Lockhart,Demis Hassabis,Thore Graepel 等。通过使用学习模型进行规划,掌握雅达利、围棋、国际象棋和将棋。《自然》杂志,第 588 卷(7839 期):604-609,2020 年。
  • [32] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
    John Schulman,Filip Wolski,Prafulla Dhariwal,Alec Radford 和 Oleg Klimov。近端策略优化算法。arXiv 预印本 arXiv:1707.06347,2017 年。
  • [33] Alexey Skrynnik, Aleksey Staroverov, Ermek Aitygulov, Kirill Aksenov, Vasilii Davydov, and Aleksandr I Panov. Hierarchical deep q-network from imperfect demonstrations in Minecraft. Cognitive Systems Research, 65:74–78, 2021.
    Alexey Skrynnik,Aleksey Staroverov,Ermek Aitygulov,Kirill Aksenov,Vasilii Davydov 和 Aleksandr I Panov。来自 Minecraft 中不完美演示的分层深度 Q 网络。《认知系统研究》,65:74-78,2021 年。
  • [34] Open Ended Learning Team, Adam Stooke, Anuj Mahajan, Catarina Barros, Charlie Deck, Jakob Bauer, Jakub Sygnowski, Maja Trebacz, Max Jaderberg, Michael Mathieu, et al. Open-ended learning leads to generally capable agents. arXiv preprint arXiv:2107.12808, 2021.
    开放式学习团队,Adam Stooke,Anuj Mahajan,Catarina Barros,Charlie Deck,Jakob Bauer,Jakub Sygnowski,Maja Trebacz,Max Jaderberg,Michael Mathieu 等。开放式学习导致了一般能力的代理。arXiv 预印本 arXiv:2107.12808,2021 年。
  • [35] Chen Tessler, Shahar Givony, Tom Zahavy, Daniel Mankowitz, and Shie Mannor. A deep hierarchical approach to lifelong learning in Minecraft. In Proceedings of the AAAI conference on artificial intelligence (AAAI), 2017.
    Chen Tessler,Shahar Givony,Tom Zahavy,Daniel Mankowitz 和 Shie Mannor。Minecraft 中终身学习的深层次方法。在人工智能(AAAI)会议论文集中的 AAA 会议上,2017 年。
  • [36] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
    关志王,谢宇琪,姜云帆,Ajay Mandlekar,肖超伟,朱宇科,范林熙和 Anima Anandkumar。Voyager:一个具有大型语言模型的开放式体验代理。arXiv 预印本 arXiv:2305.16291,2023 年。
  • [37] Zihao Wang, Shaofei Cai, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023.
    Zihao Wang,Shaofei Cai,Anji Liu,Xiaojian Ma 和 Yitao Liang。描述,解释,计划和选择:与大型语言模型的交互式规划实现了开放世界多任务代理。arXiv 预印本 arXiv:2302.01560,2023 年。
  • [38] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
    Jason Wei,Xuezhi Wang,Dale Schuurmans,Maarten Bosma,brian ichter,Fei Xia,Ed H. Chi,Quoc V Le 和 Denny Zhou。思维链提示引发了大型语言模型的推理。在神经信息处理系统(NeurIPS)的进展中,2022 年。
  • [39] Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning (CORL), 2020.
    天河宇,迪尔德·奎伦,詹鹏·何,瑞安·朱利安,卡罗尔·豪斯曼,切尔西·芬恩和谢尔盖·莱文。元世界:多任务和元强化学习的基准和评估。在机器人学习会议(CORL)中,2020 年。
  • [40] Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, et al. Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144, 2023.
    胡智勇,陈云涛,田浩,陶晨鑫,苏伟杰,杨晨宇,黄高,李斌,陆乐伟,王晓刚等。《我的世界中的幽灵:基于大型语言模型的具备文本知识和记忆的开放世界环境通用能力代理》。arXiv 预印本 arXiv:2305.17144,2023 年。

Appendix A The Necessity of Learning the Finding-skill
附录 A 学习寻找技能的必要性

We demonstrate the exploration difficulty of learning skills in Minecraft. Figure 4 shows that a random policy can only travel to a distance of 5 blocks on plains within 500 steps. Since trees are rare on the plains and usually have >20absent20>20 distances to the player, an RL agent starting from a random policy can fail to collect logs on plains.
我们展示了在《我的世界》中学习技能的探索困难。图 4 显示,在 500 步内,随机策略只能在平原上行进 5 个方块的距离。由于平原上树木稀少且通常与玩家距离较远,从随机策略开始的强化学习代理可能无法在平原上收集木材。

Refer to caption
Figure 4: Maximal travel distance to the spawning point a random policy could reach in Minecraft, under different episode lengths. We test for 100 episodes, with different randomly generated worlds and agent parameters. Note that all Manipulation-skills we trained have episode lengths less than 1000 to ensure sample efficiency.
图 4:在不同的剧集长度下,随机策略在 Minecraft 中能够到达的生成点的最大行程距离。我们进行了 100 个剧集的测试,使用不同的随机生成的世界和代理参数。请注意,我们训练的所有操作技能的剧集长度都小于 1000,以确保样本效率。

In Table 4, we compare the travel distances of a random policy, a hand-coded walking policy, and our Finding-skill pre-trained with RL. We find that the Finding-skill has a stronger exploration ability than the other two policies.
在表 4 中,我们比较了随机策略、手动编码的行走策略和我们使用强化学习预训练的寻找技能的行程距离。我们发现寻找技能比其他两种策略具有更强的探索能力。

Table 4: Maximal travel distance on plains of a random policy, a hand-coded policy which always takes forward+jump and randomly turns left or right, and our Finding-skill.
表 4:在平原上的随机策略、手动编码策略(始终向前+跳跃,随机左转或右转)和我们的寻找技能的最大行程距离。
Episode length 集数长度 200 500 1000
Random Policy 随机策略 3.0±2.1plus-or-minus3.02.13.0\pm 2.1 5.0±3.6plus-or-minus5.03.65.0\pm 3.6 7.1±4.9plus-or-minus7.14.97.1\pm 4.9
Hand-coded Policy 手动编码策略 7.1±2.7plus-or-minus7.12.77.1\pm 2.7 11.7±4.4plus-or-minus11.74.411.7\pm 4.4 18.0±6.6plus-or-minus18.06.618.0\pm 6.6
Finding-skill 寻找技能 12.6±5.6plus-or-minus12.65.612.6\pm 5.6 18.5±9.3plus-or-minus18.59.318.5\pm 9.3 25.7±12.1plus-or-minus25.712.125.7\pm 12.1

Appendix B Pipeline Demonstration
附录 BPipeline 演示

Here we visually demonstrate the steps Plan4MC takes to solve a long-horizon task. Figure 5 shows the interactive planning and execution process for crafting a bed. Figure 6 shows the key frames of Plan4MC solving the challenging Tech Tree task of crafting an iron pickaxe with bare hands.
在这里,我们通过图示展示 Plan4MC 解决长期任务的步骤。图 5 展示了制作床的交互式规划和执行过程。图 6 展示了 Plan4MC 解决具有挑战性的技术树任务——用手制作铁镐的关键帧。

Refer to caption
Figure 5: Demonstration of a planning and execution episode for the task "craft a bed". Following the direction of the arrows, the planner iteratively proposes the skill sequence based on the agent’s state, and the policy executes the first skill. Though an execution for "harvest wool" fails in the middle, the planner replans to "find a sheep" again to fix this error, and finally completes the task. The lower right shows the skill graph for this task, where the red circle indicates the target, and the blue circles indicate the initial items.
Refer to caption
Figure 6: A playing episode of Plan4MC for crafting iron pickaxe with bare hands. This is a challenging task in Minecraft Tech Tree, which requires 16 different basic skills and 117 steps in the initial plan.

Appendix C Algorithms

We present our algorithm sketches for skill planning and solving hard tasks here.

Input: Pre-generated skill graph: G𝐺G; Target item: g𝑔g; Target item quantity: n𝑛n;
Global variables: possessing items I𝐼I and skill sequence S𝑆S.
for gsuperscript𝑔g^{\prime} in parents(G,g)𝑝𝑎𝑟𝑒𝑛𝑡𝑠𝐺𝑔parents(G,g) do
       ng,ng,consume<g,g>n_{g^{\prime}},n_{g},consume\leftarrow<g^{\prime},g>;
      ngtodongsuperscriptsubscript𝑛superscript𝑔𝑡𝑜𝑑𝑜subscript𝑛superscript𝑔n_{g^{\prime}}^{todo}\leftarrow n_{g^{\prime}};
      if (quantity of gsuperscript𝑔g^{\prime} in I𝐼I) >ngabsentsubscript𝑛superscript𝑔>n_{g^{\prime}} then
             Decrease gsuperscript𝑔g^{\prime} quantity with ngsubscript𝑛superscript𝑔n_{g^{\prime}} in I𝐼I, if consume𝑐𝑜𝑛𝑠𝑢𝑚𝑒consume;
      else
             ngtodongtodosuperscriptsubscript𝑛superscript𝑔𝑡𝑜𝑑𝑜limit-fromsuperscriptsubscript𝑛superscript𝑔𝑡𝑜𝑑𝑜n_{g^{\prime}}^{todo}\leftarrow n_{g^{\prime}}^{todo}- (quantity of gsuperscript𝑔g^{\prime} in I𝐼I);
      while ngtodo>0superscriptsubscript𝑛superscript𝑔𝑡𝑜𝑑𝑜0n_{g^{\prime}}^{todo}>0 do
             DFS(G,g,ngtodo,I,S𝐺superscript𝑔superscriptsubscript𝑛superscript𝑔𝑡𝑜𝑑𝑜𝐼𝑆G,g^{\prime},n_{g^{\prime}}^{todo},I,S);
            if gsuperscript𝑔g^{\prime} is not Crafting-skill then
                   Remove all nearby items in I𝐼I;
            ngobtainsuperscriptsubscript𝑛superscript𝑔𝑜𝑏𝑡𝑎𝑖𝑛absentn_{g^{\prime}}^{obtain}\leftarrow (quantity of gsuperscript𝑔g^{\prime} obtained after executing skill gsuperscript𝑔g^{\prime});
            if ngobtain>ngtodosuperscriptsubscript𝑛superscript𝑔𝑜𝑏𝑡𝑎𝑖𝑛superscriptsubscript𝑛superscript𝑔𝑡𝑜𝑑𝑜n_{g^{\prime}}^{obtain}>n_{g^{\prime}}^{todo} then
                   Increase gsuperscript𝑔g^{\prime} quantity with ngobtainngtodosuperscriptsubscript𝑛superscript𝑔𝑜𝑏𝑡𝑎𝑖𝑛superscriptsubscript𝑛superscript𝑔𝑡𝑜𝑑𝑜n_{g^{\prime}}^{obtain}-n_{g^{\prime}}^{todo} in I𝐼I;
            Increase other obtained items after executing skill gsuperscript𝑔g^{\prime} in I𝐼I;
            ngtodongtodongobtainsuperscriptsubscript𝑛superscript𝑔𝑡𝑜𝑑𝑜superscriptsubscript𝑛superscript𝑔𝑡𝑜𝑑𝑜superscriptsubscript𝑛superscript𝑔𝑜𝑏𝑡𝑎𝑖𝑛n_{g^{\prime}}^{todo}\leftarrow n_{g^{\prime}}^{todo}-n_{g^{\prime}}^{obtain};
      
Append skill g𝑔g to S𝑆S.
Algorithm 1 DFS.
Input: Pre-generated skill graph: G𝐺G; Target item: g𝑔g; Initial items: I𝐼I.
Output: Skill sequence: (s1,s2,)subscript𝑠1subscript𝑠2(s_{1},s_{2},...).
S()superscript𝑆S^{\prime}\leftarrow();
IIsuperscript𝐼𝐼I^{\prime}\leftarrow I;
DFS(G,g,1,I,S𝐺𝑔1superscript𝐼superscript𝑆G,g,1,I^{\prime},S^{\prime});
return Ssuperscript𝑆S^{\prime}.
Algorithm 2 Skill search algorithm.
Input: Task: T=(g,I)𝑇𝑔𝐼T=(g,I); Pre-trained skills: {πs}sSsubscriptsubscript𝜋𝑠𝑠𝑆\{\pi_{s}\}_{s\in S}; Pre-generated skill graph: G𝐺G; Skill search algorithm: Search𝑆𝑒𝑎𝑟𝑐Search.
Output: Task success.
IIsuperscript𝐼𝐼I^{\prime}\leftarrow I;
while task not done do
       (s1,s2,)Search(G,g,I)subscript𝑠1subscript𝑠2𝑆𝑒𝑎𝑟𝑐𝐺𝑔superscript𝐼(s_{1},s_{2},...)\leftarrow Search(G,g,I^{\prime});
      Execute πs1subscript𝜋subscript𝑠1\pi_{s_{1}} for several steps;
      if task success then
             return True;
      
      Isuperscript𝐼absentI^{\prime}\leftarrow inventory items \cup nearby items;
return False.
Algorithm 3 Process for solving a task.

Appendix D Details in Training Basic Skills

Table 5 shows the environment and algorithm configurations for training basic skills. Except for the skill of mining [Uncaptioned image][Uncaptioned image]  whose breaking speed multiplier in the simulator is set to 10, all the skills are trained using the unmodified MineDojo simulator.

Though the MineCLIP reward improves the learning of many skills, it is still not enough to encourage some complicated behaviors. In combat [Uncaptioned image][Uncaptioned image], we introduce distance reward and attack reward to further encourage the agent to chase and attack the mobs. In mining [Uncaptioned image][Uncaptioned image], we introduce distance reward to keep the agent close to the target blocks. To mine underground ores [Uncaptioned image][Uncaptioned image], we add depth reward to encourage the agent to mine deeper and then go back to the ground. These item-based intrinsic rewards are easy to implement for all the items and are also applicable in many other open-world environments like robotics. Intrinsic rewards are implemented as follows.

State count. The high-level recurrent policy for Finding-skills optimizes the visited area in a 110×110110110110\times 110 square, where the agent’s spawn location is at the center. We divide the square into 11×11111111\times 11 grids and keep a visitation flag for each grid. Once the agent walks into an unvisited grid, it receives +11+1 state count reward.

Goal navigation. The low-level policy for Finding-skills is encouraged to reach the goal position. The goal location is randomly sampled in 4 directions at a distance of 10 from the agent. To get closer to the goal, we compute the distance change between the goal and the agent: rd=(dtdt1)subscript𝑟𝑑subscript𝑑𝑡subscript𝑑𝑡1r_{d}=-(d_{t}-d_{t-1}), where dtsubscript𝑑𝑡d_{t} is the distance on the plane coordinates at time step t𝑡t. Additionally, to encourage the agent to look in its walking direction, we add rewards to regularize the agent’s yaw and pitch angles: ryaw=yawg,rpitch=cos(pitch)formulae-sequencesubscript𝑟𝑦𝑎𝑤𝑦𝑎𝑤𝑔subscript𝑟𝑝𝑖𝑡𝑐𝑐𝑜𝑠𝑝𝑖𝑡𝑐r_{yaw}=yaw\cdot g,r_{pitch}=cos(pitch), where g𝑔g is the goal direction. The total reward is:

r=ryaw+rpitch+10rd.𝑟subscript𝑟𝑦𝑎𝑤subscript𝑟𝑝𝑖𝑡𝑐10subscript𝑟𝑑r=r_{yaw}+r_{pitch}+10*r_{d}. (1)

CLIP reward. This reward encourages the agent to produce behaviors that match the task prompt. We sample 31 task prompts among all the MineDojo programmatic tasks as negative samples. The pre-trained MineCLIP [9] model computes the similarities between features of the past 16 frames and prompts. We compute the probability that the frames are most similar to the task prompt: p=[softmax(S(fv,fl),{S(fv,fl)}l)]0𝑝subscriptdelimited-[]softmax𝑆subscript𝑓𝑣subscript𝑓𝑙subscript𝑆subscript𝑓𝑣subscript𝑓superscript𝑙superscript𝑙0p=\left[\mathrm{softmax}\left(S\left(f_{v},f_{l}\right),\{S\left(f_{v},f_{l^{-}}\right)\}_{l^{-}}\right)\right]_{0}, where fv,flsubscript𝑓𝑣subscript𝑓𝑙f_{v},f_{l} are video features and prompt features, l𝑙l is the task prompt, and lsuperscript𝑙l^{-} are negative prompts. The CLIP reward is:

rCLIP=max{p132,0}.subscript𝑟CLIP𝑝1320r_{\mathrm{CLIP}}=\max{\left\{p-\frac{1}{32},0\right\}}. (2)

Distance. The distance reward provides dense reward signals to reach the target items. For combat tasks, the agent gets a distance reward when the distance is closer than the minimal distance in history:

rdistance=max{mint<tdtdt,0}.subscript𝑟𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒subscriptsuperscript𝑡𝑡subscript𝑑superscript𝑡subscript𝑑𝑡0r_{distance}=\max\left\{\min_{t^{\prime}<t}{d_{t^{\prime}}}-d_{t},0\right\}. (3)

For mining [Uncaptioned image][Uncaptioned image] tasks, since the agent should stay close to the block for many time steps, we modify the distance reward to encourage keeping a small distance:

rdistance={dt1dt,1.5dt+2,dt<1.52,dt=+,subscript𝑟𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒casessubscript𝑑𝑡1subscript𝑑𝑡1.5subscript𝑑𝑡otherwise2subscript𝑑𝑡1.5otherwise2subscript𝑑𝑡otherwiser_{distance}=\begin{cases}d_{t-1}-d_{t},\quad 1.5\leq d_{t}\leq+\infty\\ 2,\quad d_{t}<1.5\\ -2,\quad d_{t}=+\infty,\end{cases} (4)

where dtsubscript𝑑𝑡d_{t} is the distance between the agent and the target item at time step t𝑡t, which is detected by lidar rays in the simulator.

Attack. For combat tasks, we reward the agent for attacking the target mobs. We use the tool’s durability information to detect valid attacks and use lidar rays to detect the target mob. The attack reward is:

rattack={90,if valid attack and the target at center 1,if valid attack but the target not at center0,otherwise. subscript𝑟𝑎𝑡𝑡𝑎𝑐𝑘cases90if valid attack and the target at center 1if valid attack but the target not at center0otherwise. otherwiser_{attack}=\begin{cases}\begin{aligned} 90,&\quad\text{if valid attack and the target at center }\\ 1,&\quad\text{if valid attack but the target not at center}\\ 0,&\quad\text{otherwise. }\end{aligned}\end{cases} (5)

Depth. For mining [Uncaptioned image][Uncaptioned image]tasks, the agent should dig down first, then go back to the ground. We use the y-axis to calculate the change of the agent’s depth, and use the depth reward to encourage such behaviors. To train the dig-down policy, the depth reward is:

rdown=max{mint<tytyt,0}.subscript𝑟𝑑𝑜𝑤𝑛subscriptsuperscript𝑡𝑡subscript𝑦superscript𝑡subscript𝑦𝑡0r_{down}=\max\left\{\min_{t^{\prime}<t}{y_{t^{\prime}}}-y_{t},0\right\}. (6)

To train the go-back policy, the depth reward is:

rup=max{ytmaxt<tyt,0}.subscript𝑟𝑢𝑝subscript𝑦𝑡subscriptsuperscript𝑡𝑡subscript𝑦superscript𝑡0r_{up}=\max\left\{y_{t}-\max_{t^{\prime}<t}{y_{t^{\prime}}},0\right\}. (7)

For each Manipulation-skill, we use a linear combination of intrinsic reward and extrinsic success reward to train the policy.

Table 5: Training configurations for all the basic skills. Max Steps is the maximal episode length. Training Steps shows the environment steps cost for training each skill. Init. shows the maximal distance to spawn mobs at environment reset. The high-level policy and low-level policy for Finding-skills are listed in two lines.
Skill Max Steps Method Intrinsic Reward Training Steps Biome Init.
Find high: 40 low: 50 PPO DQN state count goal navigation 1M 0.5M plains
Place [Uncaptioned image][Uncaptioned image] 200 PPO CLIP reward 0.3M
Harvest [Uncaptioned image] 200 PPO CLIP reward 1M plains 10
Harvest [Uncaptioned image] 200 PPO CLIP reward 1M plains 10
Combat [Uncaptioned image] 400 PPO CLIP, distance, attack 1M plains 2
Combat [Uncaptioned image] 400 PPO CLIP, distance, attack 1M plains 2
Harvest [Uncaptioned image] 500 PPO distance 0.5M forest
Harvest [Uncaptioned image] 1000 PPO distance 0.3M hills
Mine [Uncaptioned image][Uncaptioned image] 50 PPO depth 0.4M forest
Craft 1 0
Table 6: Information for all the selected basic skill policies. Success Rate is the success rate of the selected policy on the smoothed training curve.
Skill Parameters Execute Steps Success Rate
Find 0.9M 1000
Place [Uncaptioned image][Uncaptioned image] 2.0M 200 0.98
Harvest [Uncaptioned image] 2.0M 200 0.50
Harvest [Uncaptioned image] 2.0M 200 0.27
Combat [Uncaptioned image] 2.0M 400 0.21
Combat [Uncaptioned image] 2.0M 400 0.30
Harvest [Uncaptioned image] 2.0M 500 0.56
Harvest [Uncaptioned image] 2.0M 200 0.47
Mine [Uncaptioned image][Uncaptioned image] 4.0M 1000
Craft 0 1 1.00

It takes one day on a single TITAN Xp GPU to train each skill for 1M environmental steps. Table 6 shows our selected basic skill policies for downstream tasks. Since the Finding-skill and the Mining [Uncaptioned image][Uncaptioned image] skill has no success rate during training, we pick the models with the highest returns on the smoothed training curves. For other skills, we pick the models with the highest success rates on the smoothed training curves.

Appendix E LLM Prompt Design

To construct the skill graph, we use ChatGPT to generate skill information showing relationships between skills. For both Manipulation-skills and Crafting-skills, we demonstrate several skills, explain the format, and then ask ChatGPT to generate all the skills in the same format.

Listing 1: Prompt for generating some Manipulation-skills and ChatGPT’s response. We only modify the last querying sentence in the prompt to generate all the skills.
Prompt:
I am playing the game Minecraft. I define some basic skills, like attack something, collect something and place something nearby. I list the skills in a special format.
As an example:
furnace_nearby:
consume:
furnace’: 1
require:
equip: [’furnace’]
obtain:
furnace_nearby’: 1
To understand this skill line by line: the skill is to get a furnace_nearby. consume means things will be consumed or killed. In this skill, furnace*1 will be consumed. require means things are needed but will not be consumed. In this skill, nothing else is required. We should equip furnace to the first slot. If you do not have to equip anything, write equip: []’. Finally, we will obtain furnace_nearby*1.
Another example:
cobblestone:
consume:
cobblestone_nearby’: 1
require:
wooden_pickaxe’: 1
equip: [’wooden_pickaxe’]
obtain:
cobblestone’: 1
To understand: to mine a cobblestone, we will consume a nearby cobblestone. A wooden_pickaxe is required and should be equipped, but will not be consumed.
Now you understand the rule of this format. Please help me generate the following skills:
crafting_table_nearby, wool, beef, diamond
ChatGPT:
Here are the generated skills in the specified format:
crafting_table_nearby:
consume:
crafting_table’: 1
require:
equip: [’crafting_table’]
obtain:
crafting_table_nearby’: 1
wool:
consume:
sheep_nearby’: 1
require:
shears’: 1
equip: [’shears’]
obtain:
wool’: 1
beef:
consume:
cow_nearby’: 1
require:
equip: []
obtain:
beef’: 1
diamond:
consume:
diamond_ore_nearby’: 1
require:
iron_pickaxe’: 1
equip: [’iron_pickaxe’]
obtain:
diamond’: 1
Listing 2: Prompt for generating some Crafting-skills and ChatGPT’s response. We only modify the last querying sentence in the prompt to generate all the skills.
Prompt:
I am playing the game Minecraft. I list all the crafting skills in this game in a special format.
As an example:
bed:
consume:
planks’: 3
wool’: 3
require:
crafting_table_nearby’: 1
equip: []
obtain:
bed’: 1
To understand this skill line by line: the skill is to craft a bed. We will consume planks*3 and wool*3. To craft bed, a crafting_table_nearby is necessary and must be in the require:’ line.
For crafting skills the equipment is always []. Finally, we will obtain bed*1.
Another example:
cooked_beef:
consume:
planks’: 1
beef’: 1
require:
furnace_nearby’: 1
equip: []
obtain:
cooked_beef’: 1
Because we require a furnace_nearby. The fuel, which is always planks’*1, should be added to the consume items.
For the require:’ line, you should know whether the skill needs crafting_table_nearby, furnace_nearby, or nothing. If require nothing, just dont write anything in this line.
Now you understand the rule of this format. Please help me generate the following skills:
stick, chest, furnace,item_frame
ChatGPT:
Sure, here are the crafting skills for the items you requested:
stick:
consume:
planks’: 2
require:
equip: []
obtain:
stick’: 4
chest:
consume:
planks’: 8
require:
crafting_table_nearby’: 1
equip: []
obtain:
chest’: 1
furnace:
consume:
cobblestone’: 8
require:
crafting_table_nearby’: 1
equip: []
obtain:
furnace’: 1
item_frame:
consume:
stick’: 8
leather’: 1
require:
crafting_table_nearby’: 1
equip: []
obtain:
item_frame’: 1

All the Finding-skills have the same format:

(item)_nearby:
consume:
require:
equip: []
obtain:
(item)_nearby: 1

In our experiments, after generating all the skills, we check and correct them according to the Minecraft database. Among 55 skills, we find 6 mistakes in total.

Appendix F Task Setup

Table 7, 8 lists settings for 40 evaluation tasks. To make sure the agent is spawned in an unseen environment in each test episode, we randomly transport the agent with a maximum distance of 500 at environment reset. For tasks involving interacting with mobs, we spawn cows and sheep with a maximum distance of 30, which is much larger than the spawning distance in training basic skills. For the Mine-Ores task set, we set the breaking speed multiplier to 10. For the other three task sets, we use the default simulator.

Table 7: Settings for Cut-Trees and Mine-Stones tasks. Initial Tools are provided in the inventory at each episode beginning. Involved Skills is the least number of basic skills the agent should master to accomplish the task. Planning Steps is the number of basic skills to be executed sequentially in the initial plans.
Task Icon Target Name Initial Tools Biome Max Steps Involved Skills Planning Steps
[Uncaptioned image] stick plains 3000 4 4
[Uncaptioned image] crafting_table_ nearby plains 3000 5 5
[Uncaptioned image] bowl forest 3000 6 9
[Uncaptioned image] chest forest 3000 6 12
[Uncaptioned image] trap_door forest 3000 6 12
[Uncaptioned image] sign forest 3000 7 13
[Uncaptioned image] wooden_shovel forest 3000 7 10
[Uncaptioned image] wooden_sword forest 3000 7 10
[Uncaptioned image] wooden_axe forest 3000 7 13
[Uncaptioned image] wooden_pickaxe forest 3000 7 13
[Uncaptioned image] furnace_nearby [Uncaptioned image]*10 hills 5000 9 28
[Uncaptioned image] stone_stairs [Uncaptioned image]*10 hills 5000 8 23
[Uncaptioned image] stone_slab [Uncaptioned image]*10 hills 3000 8 17
[Uncaptioned image] cobblestone_wall [Uncaptioned image]*10 hills 5000 8 23
[Uncaptioned image] lever [Uncaptioned image] forest_hills 5000 7 7
[Uncaptioned image] torch [Uncaptioned image]*10 hills 5000 11 30
[Uncaptioned image] stone_shovel [Uncaptioned image] forest_hills 10000 9 12
[Uncaptioned image] stone_sword [Uncaptioned image] forest_hills 10000 9 14
[Uncaptioned image] stone_axe [Uncaptioned image] forest_hills 10000 9 16
[Uncaptioned image] stone_pickaxe [Uncaptioned image] forest_hills 10000 9 16
Table 8: Settings for Mine-Ores and Interact-Mobs tasks. Initial Tools are provided in the inventory at each episode beginning. Involved Skills is the least number of basic skills the agent should master to accomplish the task. Planning Steps is the number of basic skills to be executed sequentially in the initial plans.
Task Icon Target Name Initial Tools Biome Max Steps Involved Skills Planning Steps
[Uncaptioned image] iron_ingot [Uncaptioned image]*5, [Uncaptioned image]*64 forest 8000 12 30
[Uncaptioned image] tripwire_hook [Uncaptioned image]*5, [Uncaptioned image]*64 forest 8000 14 35
[Uncaptioned image] heavy_weighted_ pressure_plate [Uncaptioned image]*5, [Uncaptioned image]*64 forest 10000 13 61
[Uncaptioned image] shears [Uncaptioned image]*5, [Uncaptioned image]*64 forest 10000 13 61
[Uncaptioned image] bucket [Uncaptioned image]*5, [Uncaptioned image]*64 forest 12000 13 91
[Uncaptioned image] iron_trapdoor [Uncaptioned image]*5, [Uncaptioned image]*64 forest 12000 13 121
[Uncaptioned image] iron_shovel [Uncaptioned image]*5, [Uncaptioned image]*64 forest 8000 14 35
[Uncaptioned image] iron_sword [Uncaptioned image]*5, [Uncaptioned image]*64 forest 10000 14 65
[Uncaptioned image] iron_axe [Uncaptioned image]*5, [Uncaptioned image]*64 forest 12000 14 95
[Uncaptioned image] iron_pickaxe [Uncaptioned image]*5, [Uncaptioned image]*64 forest 12000 14 95
[Uncaptioned image] milk_bucket [Uncaptioned image], [Uncaptioned image]*3 plains 3000 4 4
[Uncaptioned image] wool [Uncaptioned image], [Uncaptioned image]*2 plains 3000 3 3
[Uncaptioned image] beef [Uncaptioned image] plains 3000 2 2
[Uncaptioned image] mutton [Uncaptioned image] plains 3000 2 2
[Uncaptioned image] bed [Uncaptioned image], [Uncaptioned image] plains 10000 7 11
[Uncaptioned image] painting [Uncaptioned image], [Uncaptioned image] plains 10000 8 9
[Uncaptioned image] carpet [Uncaptioned image] plains 3000 3 5
[Uncaptioned image] item_frame [Uncaptioned image], [Uncaptioned image] plains 10000 8 9
[Uncaptioned image] cooked_beef [Uncaptioned image], [Uncaptioned image] plains 10000 7 7
[Uncaptioned image] cooked_mutton [Uncaptioned image], [Uncaptioned image] plains 10000 7 7

Appendix G Experiment Results for All the Tasks

Table 9 shows the success rates of all the methods in all the tasks, grouped in 4 task sets.

Table 9: Success rates in all the tasks. Each task is tested for 30 episodes, set with the same random seeds across different methods.
Task MineAgent Plan4MC w/o Find-skill Interactive LLM Plan4MC Zero-shot Plan4MC 1/2-steps Plan4MC
[Uncaptioned image] 0.00 0.03 0.30 0.27 0.30 0.30
[Uncaptioned image] 0.03 0.07 0.17 0.27 0.20 0.30
[Uncaptioned image] 0.00 0.40 0.07 0.27 0.57 0.47
[Uncaptioned image] 0.00 0.23 0.00 0.07 0.10 0.23
[Uncaptioned image] 0.00 0.07 0.03 0.20 0.27 0.37
[Uncaptioned image] 0.00 0.07 0.00 0.10 0.30 0.43
[Uncaptioned image] 0.00 0.37 0.73 0.23 0.50 0.70
[Uncaptioned image] 0.00 0.33 0.63 0.30 0.60 0.47
[Uncaptioned image] 0.00 0.23 0.47 0.13 0.27 0.37
[Uncaptioned image] 0.00 0.07 0.20 0.00 0.27 0.53
[Uncaptioned image] 0.00 0.17 0.00 0.00 0.13 0.37
[Uncaptioned image] 0.00 0.30 0.20 0.00 0.33 0.47
[Uncaptioned image] 0.00 0.20 0.03 0.00 0.37 0.53
[Uncaptioned image] 0.21 0.13 0.13 0.00 0.33 0.57
[Uncaptioned image] 0.00 0.00 0.00 0.00 0.10 0.10
[Uncaptioned image] 0.05 0.10 0.00 0.00 0.17 0.37
[Uncaptioned image] 0.00 0.00 0.10 0.00 0.03 0.20
[Uncaptioned image] 0.00 0.07 0.13 0.00 0.07 0.10
[Uncaptioned image] 0.00 0.00 0.07 0.00 0.10 0.07
[Uncaptioned image] 0.00 0.00 0.00 0.00 0.00 0.17
[Uncaptioned image] 0.00 0.53 0.20 0.00 0.30 0.47
[Uncaptioned image] 0.00 0.27 0.00 0.00 0.27 0.33
[Uncaptioned image] 0.00 0.37 0.00 0.00 0.13 0.30
[Uncaptioned image] 0.00 0.30 0.03 0.00 0.20 0.43
[Uncaptioned image] 0.00 0.27 0.00 0.00 0.03 0.20
[Uncaptioned image] 0.00 0.10 0.00 0.00 0.03 0.13
[Uncaptioned image] 0.00 0.27 0.03 0.00 0.27 0.37
[Uncaptioned image] 0.00 0.13 0.00 0.00 0.07 0.20
[Uncaptioned image] 0.00 0.07 0.03 0.00 0.07 0.07
[Uncaptioned image] 0.00 0.13 0.00 0.00 0.07 0.17
[Uncaptioned image] 0.46 0.57 0.57 0.60 0.63 0.83
[Uncaptioned image] 0.50 0.40 0.76 0.30 0.60 0.53
[Uncaptioned image] 0.33 0.23 0.43 0.10 0.27 0.43
[Uncaptioned image] 0.35 0.17 0.30 0.07 0.13 0.33
[Uncaptioned image] 0.00 0.00 0.00 0.00 0.07 0.17
[Uncaptioned image] 0.00 0.03 0.00 0.10 0.23 0.13
[Uncaptioned image] 0.06 0.27 0.37 0.10 0.50 0.37
[Uncaptioned image] 0.00 0.00 0.00 0.03 0.10 0.07
[Uncaptioned image] 0.00 0.03 0.03 0.03 0.20 0.20
[Uncaptioned image] 0.00 0.00 0.00 0.00 0.03 0.13

Appendix H Training Manipulation-skills without Nearby Items

For all the Manipulation-skills that are trained with specified environments in the paper, we use the Go-Explore-like approach to re-train them in the environments without target items initialized nearby. In a training episode, the pre-trained Finding-skill explores the environment and finds the target item, then the policy collects data for RL training. In the following, we denote the previous method as Plan4MC and the new method as Plan4MC-go-explore.

Table 10 shows the maximal success rates of these skills over 100 training epochs. We find that all the skills trained with Go-Explore do not fail and the success rates are comparable to the previous skills. This is because the Finding-skill provides good environmental initialization for the training policies. In Milk and Wool, Plan4MC-go-explore even outperforms Plan4MC, because the agent can be closer to the target mobs in Plan4MC-go-explore.

Table 11 shows the test performance of Plan4MC on the four task sets. We find that Plan4MC-go-explore even outperforms Plan4MC on three task sets. This demonstrates that the skills trained with Go-Explore can generalize well to unseen environments.

Table 10: Training success rates of the Manipulation-skills under the two environment settings. Results are the maximal success rates averaged on 100 training epochs.
Skill [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Plan4MC 0.50 0.27 0.21 0.30 0.56 0.47
Plan4MC-go-explore 0.82 0.34 0.22 0.19 0.25 0.71
Table 11: Average success rates on the four task sets of Plan4MC, with the Manipulation-skills trained in the two settings.
Task Set Cut-Trees Mine-Stones Mine-Ores Interact-Mobs
Plan4MC 0.417 0.293 0.267 0.320
Plan4MC-go-explore 0.543 0.349 0.197 0.383

We further study the generalization capabilities of learned skills. Table 12 shows the test success rates of these skills in the 40 tasks and the generalization gap. We observe that Plan4MC-go-explore has a small generalization gap in the first four mob-related skills. This is because Plan4MC-go-explore uses the same policy for approaching the target mob in training and test, yielding closer initial distributions for Manipulation-skills. We find that in Harvest Log, Plan4MC-go-explore often finds trees that have been cut before. Thus, it is more difficult to harvest logs in training, and the test success rate exceeds the training success rate.

Table 12: The test success rates of the skills in solving the 40 tasks, and the generalization gap (test success rate - training success rate).
Skill [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Plan4MC 0.71(+0.21) 0.26(-0.01) 0.27(+0.06) 0.16(-0.14) 0.33(-0.23) 0.26(-0.21)
Plan4MC- go-explore 0.86(-0.04) 0.47(+0.13) 0.16(-0.06) 0.16(-0.03) 0.45(+0.20) 0.47(-0.24)

Appendix I Discussion on the Generalization of Plan4MC

Plan4MC contributes a pipeline combining LLM-assisted planning and RL for skill acquisition. It is widely applicable in many open-world domains [3, 19], where the agent can combine basic skills to solve diverse long-horizon tasks.

Our key insight is that we can divide a skill into fine-grained basic skills, thus enabling acquiring skills sample-efficiently with demonstration-free RL. The Finding-skill in Plan4MC can be replaced with any learning-to-explore RL policy, or a navigation policy in robotics. As an example, for indoor robotic tasks, a skill is defined with action (pick/drop/open) + object. We can break such a skill into navigation, arm positioning, and object manipulation, which can be better acquired with demonstration-free RL since the exploration difficulty is substantially reduced.

Our experiments on learning skills in Minecraft demonstrate that object-based intrinsic rewards improve sample efficiency. Figure 7 shows that both MineCLIP reward and distance reward have a positive impact on skill reinforcement learning. This gives motivation to use vision-language models, object detectors, or distance estimation for reward design in skill learning.

For planning, our approach is a novel extension of LLM-based planners, which incorporates LLM knowledge into a graph-based planner, improving planning accuracy. It can be extended to settings where the agent’s state can be abstracted by text or entities.

Refer to caption
Figure 7: Using different intrinsic rewards for training Harvest Milk with PPO. Results are averaged on 3 random seeds.