MineDojo: Building Open-Ended
Embodied Agents with Internet-Scale Knowledge

Linxi Fan1, Guanzhi Wang2∗, Yunfan Jiang3∗, Ajay Mandlekar1, Yuncong Yang4,

Haoyi Zhu5, Andrew Tang4, De-An Huang1, Yuke Zhu1 6†, Anima Anandkumar1 2†

NVIDIA, 2Caltech, 3Stanford, 4Columbia, 5SJTU, 6UT Austin
NVIDIA, 2 加州理工, 3 斯坦福, 4 哥伦比亚, 5 上海交通大学, 6 德克萨斯大学奥斯汀

Abstract 摘要

Autonomous agents have made great strides in specialist domains like Atari games and Go. However, they typically learn tabula rasa in isolated environments with limited and manually conceived objectives, thus failing to generalize across a wide spectrum of tasks and capabilities. Inspired by how humans continually learn and adapt in the open world, we advocate a trinity of ingredients for building generalist agents: 1) an environment that supports a multitude of tasks and goals, 2) a large-scale database of multimodal knowledge, and 3) a flexible and scalable agent architecture. We introduce MineDojo, a new framework built on the popular Minecraft game that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base with Minecraft videos, tutorials, wiki pages, and forum discussions. Using MineDojo’s data, we propose a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function. Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward. We open-source the simulation suite, knowledge bases, algorithm implementation, and pretrained models (https://minedojo.org) to promote research towards the goal of generally capable embodied agents.
自主代理在 Atari 游戏和围棋等专业领域取得了巨大进展。然而,它们通常在孤立的环境中进行白板学习,目标有限且手动设计,因此无法在广泛的任务和能力范围内进行泛化。受到人类在开放世界中不断学习和适应的启发,我们提倡构建通用代理的三个要素:1)支持多种任务和目标的环境,2)大规模的多模态知识数据库,3)灵活可扩展的代理架构。我们介绍了 MineDojo,这是一个基于流行的 Minecraft 游戏构建的新框架,它具有数千个多样化的开放式任务的模拟套件,以及一个包含 Minecraft 视频、教程、维基页面和论坛讨论的互联网规模的知识库。利用 MineDojo 的数据,我们提出了一种新颖的代理学习算法,它利用大规模预训练的视频语言模型作为学习的奖励函数。我们的代理能够在自由形式语言中解决各种开放式任务,而无需任何手动设计的密集塑形奖励。 我们开源了模拟套件、知识库、算法实现和预训练模型(https://minedojo.org),以促进研究朝着通用能力的具体化代理的目标发展。

1 Introduction 1介绍

Refer to caption
Figure 1: MineDojo is a novel framework for developing open-ended, generally capable agents that can learn and adapt continually to new goals. MineDojo features a benchmarking suite with thousands of diverse open-ended tasks specified in natural language prompts, and also provides an internet-scale, multimodal knowledge base of YouTube videos, Wiki pages, and Reddit posts. The database captures the collective experience and wisdom of millions of Minecraft gamers for an AI agent to learn from. Best viewed zoomed in.
图 1:MineDojo 是一个新颖的框架,用于开发开放式、通用能力的代理程序,可以不断学习和适应新的目标。MineDojo 具有一个基准测试套件,其中包含数千个多样化的开放式任务,以自然语言提示的形式进行说明,还提供了一个互联网规模的多模态知识库,包括 YouTube 视频、维基页面和 Reddit 帖子。该数据库捕捉了数百万 Minecraft 玩家的集体经验和智慧,供 AI 代理程序学习。最好放大查看。

Developing autonomous embodied agents that can attain human-level performance across a wide spectrum of tasks has been a long-standing goal for AI research. There has been impressive progress towards this goal, most notably in games [80, 85, 126] and robotics [68, 99, 146, 134, 107]. These embodied agents are typically trained tabula rasa in isolated worlds with limited complexity and diversity. Although highly performant, they are specialist models that do not generalize beyond a narrow set of tasks. In contrast, humans inhabit an infinitely rich reality, continuously learn from and adapt to a wide variety of open-ended tasks, and are able to leverage large amount of prior knowledge from their own experiences as well as others.
开发能够在各种任务中达到人类水平表现的自主体代理一直是人工智能研究的长期目标。在这个目标上取得了令人瞩目的进展,尤其是在游戏[80, 85, 126]和机器人技术[68, 99, 146, 134, 107]方面。这些自主体代理通常在复杂度和多样性有限的孤立世界中进行白板训练。虽然表现出色,但它们是专门的模型,无法推广到更广泛的任务范围。相比之下,人类生活在一个无限丰富的现实中,不断从各种开放性任务中学习和适应,并能够利用自己和他人的大量先前知识。

We argue that three main pillars are necessary for generalist embodied agents to emerge. First, the environment in which the agent acts needs to enable an unlimited variety of open-ended goals [116, 71, 120, 117]. Natural evolution is able to nurture an ever-expanding tree of diverse life forms thanks to the infinitely varied ecological settings that the Earth supports [117, 129]. This process has not stagnated for billions of years. In contrast, today’s agent training algorithms cease to make new progress after convergence in narrow environments [80, 146]. Second, a large-scale database of prior knowledge is necessary to facilitate learning in open-ended settings. Just as humans frequently learn from the internet, agents should also be able to harvest practical knowledge encoded in large amounts of video demos [42, 77], multimedia tutorials [79], and forum discussions [127, 65, 54]. In a complex world, it would be extremely inefficient for an agent to learn everything from scratch through trial and error. Third, the agent’s architecture needs to be flexible enough to pursue any task in open-ended environments, and scalable enough to convert large-scale knowledge sources into actionable insights [19, 96]. This motivates the design of an agent that has a unified observation/action space, conditions on natural language task prompts, and adopts the Transformer pre-training paradigm [27, 91, 15] to internalize knowledge effectively.
我们认为,通用的具身代理出现需要三个主要支柱。首先,代理所行动的环境需要能够实现无限多样的开放目标[116, 71, 120, 117]。自然进化能够培育出不断扩张的多样生命形式树,这要归功于地球支持的无限多样的生态环境[117, 129]。这个过程已经持续了数十亿年。相比之下,当今的代理训练算法在狭窄环境中收敛后就停止了新的进展[80, 146]。其次,在开放环境中促进学习需要一个大规模的先前知识数据库。就像人类经常从互联网上学习一样,代理也应该能够获取大量视频演示[42, 77]、多媒体教程[79]和论坛讨论[127, 65, 54]中编码的实用知识。在一个复杂的世界中,一个代理从头开始通过试错学习所有东西将非常低效。 第三,代理的架构需要足够灵活,以在开放环境中追求任何任务,并且足够可扩展,以将大规模的知识源转化为可操作的见解[19, 96]。这激发了设计一个具有统一的观察/行动空间、依赖自然语言任务提示的代理,并采用 Transformer 预训练范式[27, 91, 15]来有效内化知识。

In light of these three pillars, we introduce MineDojo, a new framework to help the community develop open-ended, generally-capable agents. It is built on the popular Minecraft game, where a player explores a procedurally generated 3D world with diverse types of terrains to roam, materials to mine, tools to craft, structures to build, and wonders to discover. Unlike most other games [80, 85, 126], Minecraft defines no specific reward to maximize and no fixed storyline to follow, making it well suited for developing open-ended environments for embodied AI research. We make the following three major contributions:
鉴于这三个支柱,我们介绍了 MineDojo,这是一个新的框架,帮助社区开发开放性、通用性强的智能体。它建立在流行的 Minecraft 游戏之上,玩家可以在一个由程序生成的三维世界中探索,其中包含各种类型的地形、可供漫游的材料、可制作的工具、可建造的结构和可发现的奇迹。与大多数其他游戏不同[80, 85, 126],Minecraft 没有定义特定的最大化奖励和固定的故事情节,因此非常适合开发用于体验式人工智能研究的开放性环境。我们做出了以下三个重要贡献:

1. Simulation platform with thousands of diverse open-ended tasks.

MineDojo provides convenient APIs on top of Minecraft that standardize task specification, world settings, and agent’s observation/action spaces. We introduce a benchmark suite that consists of thousands of natural language-prompted tasks, making it two orders of magnitude larger than prior Minecraft benchmarks like the MineRL Challenge [48, 62]. The suite includes long-horizon, open-ended tasks that cannot be easily evaluated through automated procedures, such as “build an epic modern house with two floors and a swimming pool”. Inspired by the Inception score [98] and FID score [55] that are commonly used to assess AI-generated image quality, we introduce a novel agent evaluation protocol using a large video-language model pre-trained on Minecraft YouTube videos. This complements human scoring [104] that is precise but more expensive. Our learned evaluation metric has good agreement with human judgment in a subset of the full task suite considered in the experiments.
MineDojo 在 Minecraft 之上提供方便的 API,标准化任务规范、世界设置和代理的观察/动作空间。我们引入了一个基准套件,包含数千个自然语言提示的任务,比之前的 Minecraft 基准测试(如 MineRL Challenge [48, 62])大两个数量级。该套件包括长期目标、开放式任务,无法通过自动化程序轻松评估,例如“建造一个有两层和游泳池的史诗般的现代房屋”。受到 Inception 分数[98]和 FID 分数[55]的启发,这些分数通常用于评估 AI 生成的图像质量,我们引入了一种新颖的代理评估协议,使用在 Minecraft YouTube 视频上预训练的大型视频语言模型。这补充了人工评分[104],后者精确但更昂贵。我们学习到的评估指标在实验中对于完整任务套件的一个子集与人类判断具有良好的一致性。

2. Internet-scale multimodal Minecraft knowledge base.
互联网规模的多模态 Minecraft 知识库。

Minecraft has more than 100 million active players [131], who have collectively generated an enormous wealth of data. They record tutorial videos, stream live play sessions, compile recipes, and discuss tips and tricks on forums. MineDojo features a massive collection of 730K+ YouTube videos with time-aligned transcripts, 6K+ free-form Wiki pages, and 340K+ Reddit posts with multimedia contents (Fig. 3). We hope that this enormous knowledge base can help the agent acquire diverse skills, develop complex strategies, discover interesting objectives, and learn actionable representations automatically.
Minecraft 拥有超过 1 亿活跃玩家[131],他们共同产生了大量的数据。他们录制教程视频,直播游戏过程,整理配方,并在论坛上讨论技巧和窍门。MineDojo 拥有超过 73 万个与时间对齐的 YouTube 视频转录,超过 6 千个自由形式的维基页面,以及超过 34 万个带有多媒体内容的 Reddit 帖子(图 3)。我们希望这个庞大的知识库能帮助代理程序获得多样化的技能,发展复杂的策略,发现有趣的目标,并自动学习可操作的表征。

3. Novel algorithm for embodied agents with large-scale pre-training.

We develop a new learning algorithm for embodied agents that makes use of the internet-scale domain knowledge we have collected from the web. Using the massive volume of YouTube videos from MineDojo, we train a video-text contrastive model in the spirit of CLIP [92], which associates natural language subtitles with their time-aligned video segments. We demonstrate that this learned correlation score can be used effectively as an open-vocabulary, massively multi-task reward function for RL training. Our agent solves the majority of 12 tasks in our experiment using the learned reward model (Fig. 2). It achieves competitive performance to agents trained with meticulously engineered dense-shaping rewards, and in some cases outperforms them, with up to 73% improvement in success rates. For open-ended tasks that do not have a simple success criterion, our agents also perform well without any special modifications.
我们为具身代理开发了一种新的学习算法,利用了我们从网络上收集到的互联网规模的领域知识。利用来自 MineDojo 的大量 YouTube 视频,我们训练了一个视频-文本对比模型,类似于 CLIP [92]的精神,它将自然语言字幕与其时间对齐的视频片段相关联。我们证明了这个学习到的相关性分数可以有效地作为开放词汇量、大规模多任务奖励函数用于强化学习训练。我们的代理使用学习到的奖励模型成功解决了我们实验中的大部分 12 个任务(图 2)。它的性能与经过精心设计的密集奖励训练的代理相当,并且在某些情况下表现优于它们,成功率提高了高达 73%。对于没有简单成功标准的开放式任务,我们的代理也表现良好,无需任何特殊修改。

In summary, this paper proposes an open-ended task suite, internet-scale domain knowledge, and agent learning with recent advances on large pre-trained models [13]. We have open-sourced MineDojo’s simulator, knowledge bases, algorithm implementations, pretrained model checkpoints, and task curation tools at https://minedojo.org/. We hope that MineDojo will serve as an effective starter framework for the community to develop new algorithms and advance towards generally capable embodied agents.
总之,本文提出了一个开放式任务套件、互联网规模的领域知识以及利用大型预训练模型的代理学习[13]。我们已经在 https://minedojo.org/上开源了 MineDojo 的模拟器、知识库、算法实现、预训练模型检查点和任务策划工具。我们希望 MineDojo 能够成为社区开发新算法并向普遍能力的具体化代理迈进的有效起点框架。

Refer to caption
Figure 2: Visualization of our agent’s learned behaviors on four selected tasks. Leftmost texts are the task prompts used in training. Best viewed on a color display.
图 2:我们代理人在四个选定任务上学到的行为的可视化。最左边的文字是训练中使用的任务提示。最好在彩色显示器上查看。

2 MineDojo Simulator & Benchmark Suite
2 MineDojo 模拟器和基准测试套件

MineDojo offers a set of simulator APIs help researchers develop generally capable, open-ended agents in Minecraft. It builds upon the open-source MineRL codebase [48] and makes the following upgrades: 1) We provide unified observation and action spaces across all tasks, facilitating the development of multi-task and continually learning agents that can constantly adapt to new scenarios and novel tasks. This deviates from the MineRL Challenge design that tailors observation and action spaces to individual tasks; 2) Our simulation unlocks all three types of worlds in Minecraft, including the Overworld, the Nether, and the End, which substantially expands the possible task space, while MineRL only supports the Overworld natively; and 3) We provide convenient APIs to configure initial conditions and world settings to standardize our tasks.
MineDojo 提供一组模拟器 API,帮助研究人员在 Minecraft 中开发具有普遍能力和开放性的代理程序。它基于开源的 MineRL 代码库[48],并进行了以下升级:1)我们提供统一的观察和行动空间,适用于所有任务,便于开发多任务和持续学习的代理程序,可以不断适应新的情境和新的任务。这与 MineRL 挑战的设计不同,后者将观察和行动空间针对个别任务进行定制;2)我们的模拟器解锁了 Minecraft 中的三种世界类型,包括主世界、地狱和末地,大大扩展了可能的任务空间,而 MineRL 只原生支持主世界;3)我们提供方便的 API 来配置初始条件和世界设置,以标准化我们的任务。

With this MineDojo simulator, we define thousands of benchmarking tasks, which are divided into two categories: 1) Programmatic tasks that can be automatically assessed based on the ground-truth simulator states; and 2) Creative tasks that do not have well-defined or easily-automated success criteria, which motivates our novel evaluation protocol using a learned model (Sec. 4). To scale up the number of Creative tasks, we mine ideas from YouTube tutorials and use OpenAI’s GPT-3 [15] service to generate substantially more task definitions. Compared to Creative tasks, Programmatic tasks are simpler to get started, but tend to have restricted scope, limited language variations, and less open-endedness in general.
通过这个 MineDojo 模拟器,我们定义了数千个基准任务,分为两类:1)可以根据真实模拟器状态自动评估的编程任务;2)没有明确定义或容易自动化的成功标准的创造性任务,这促使我们使用学习模型(第 4 节)来进行新颖的评估协议。为了扩大创造性任务的数量,我们从 YouTube 教程中挖掘思路,并使用 OpenAI 的 GPT-3 [15]服务生成更多的任务定义。与创造性任务相比,编程任务更容易入门,但往往范围有限,语言变化少,并且总体上不太开放。

2.1 Task Suite I: Programmatic Tasks
2.1 任务套件 I:编程任务

We formalize each programmatic task as a 5-tuple: T=(G,𝒢,,f𝒮,f)𝑇𝐺𝒢subscript𝑓𝒮subscript𝑓T=(G,\mathcal{G},\mathcal{I},f_{\mathcal{S}},f_{\mathcal{R}}). G𝐺G is an English description of the task goal, such as “find material and craft a gold pickaxe”. 𝒢𝒢\mathcal{G} is a natural language guidance that provides helpful hints, recipes, or advice to the agent. We leverage OpenAI’s GPT-3-davinci API to automatically generate detailed guidance for a subset of the tasks. For the example goal “bring a pig into Nether”, GPT-3 returns: 1) Find a pig in the overworld; 2) Right-click on the pig with a lead; 3) Right-click on the Nether Portal with the lead and pig selected; 4) The pig will be pulled through the portal! \mathcal{I} is the initial conditions of the agent and the world, such as the initial inventory, spawn terrain, and weather. f𝒮subscript𝑓𝒮f_{\mathcal{S}}: st{0,1}subscript𝑠𝑡01s_{t}\rightarrow\{0,1\} is the success criterion, a deterministic function that maps the current world state stsubscript𝑠𝑡s_{t} to a Boolean success label. fsubscript𝑓f_{\mathcal{R}}: stsubscript𝑠𝑡s_{t}\rightarrow\mathbb{R} is an optional dense reward function. We only provide fsubscript𝑓f_{\mathcal{R}} for a small subset of the tasks in MineDojo due to the high costs of meticulously crafting dense rewards. For our current agent implementation (Sec. 4.1), we do not use detailed guidance. Inspired by concurrent works SayCan [3] and Socratic Models [143], one potential idea is to feed each step in the guidance to our learned reward model sequentially so that it becomes a stagewise reward function for a complex multi-stage task.
我们将每个编程任务形式化为一个五元组: T=(G,𝒢,,f𝒮,f)𝑇𝐺𝒢subscript𝑓𝒮subscript𝑓T=(G,\mathcal{G},\mathcal{I},f_{\mathcal{S}},f_{\mathcal{R}})G𝐺G 是任务目标的英文描述,例如“找到材料并制作一个金镐”。 𝒢𝒢\mathcal{G} 是提供有用提示、配方或建议给代理的自然语言指导。我们利用 OpenAI 的 GPT-3-davinci API 自动生成一些任务的详细指导。对于示例目标“将猪带入地狱”,GPT-3 返回:1)在主世界找到一只猪;2)用绳子右键点击猪;3)用绳子和选中的猪右键点击地狱传送门;4)猪将被拉到传送门里! \mathcal{I} 是代理和世界的初始条件,例如初始物品栏、生成地形和天气。 f𝒮subscript𝑓𝒮f_{\mathcal{S}}st{0,1}subscript𝑠𝑡01s_{t}\rightarrow\{0,1\} 是成功标准,一个确定性函数,将当前世界状态 stsubscript𝑠𝑡s_{t} 映射到布尔值的成功标签。 fsubscript𝑓f_{\mathcal{R}}stsubscript𝑠𝑡s_{t}\rightarrow\mathbb{R} 是一个可选的稠密奖励函数。由于精心制作稠密奖励的成本较高,我们只为 MineDojo 中的一小部分任务提供 fsubscript𝑓f_{\mathcal{R}} 。对于我们当前的代理实现(第 4.1 节),我们不使用详细指导。 受到 SayCan [3]和 Socratic Models [143]的并行工作的启发,一个潜在的想法是将每个步骤依次输入到我们学习的奖励模型中,从而使其成为复杂多阶段任务的逐阶段奖励函数。

MineDojo provides 4 categories of programmatic tasks with 1,581 template-generated natural language goals to evaluate the agent’s different capabilities systematically and comprehensively:
MineDojo 提供了 4 个类别的编程任务,包含 1,581 个模板生成的自然语言目标,以系统和全面地评估代理的不同能力

  1. 1.

    Survival: surviving for a designated number of days.

  2. 2.

    Harvest: finding, obtaining, cultivating, or manufacturing hundreds of materials and objects.

  3. 3.

    Tech Tree: crafting and using a hierarchy of tools.

  4. 4.

    Combat: fighting various monsters and creatures that require fast reflex and martial skills.


Each task template has a number of variations based on the terrain, initial inventory, quantity, etc., which form a flexible spectrum of difficulty. In comparison, the NeurIPS MineRL Diamond challenge [48] is a subset of our programmatic task suite, defined by the task goal “obtain 1 diamond" in MineDojo.
每个任务模板都有许多变化,基于地形、初始库存、数量等,形成了一个灵活的难度范围。相比之下,NeurIPS MineRL Diamond 挑战是我们编程任务套件的一个子集,由任务目标“获取 1 颗钻石”在 MineDojo 中定义。

2.2 Task Suite II: Creative Tasks
2.2 任务套件 II:创意任务

We define each creative task as a 3-tuple, T=(G,𝒢,)𝑇𝐺𝒢T=(G,\mathcal{G},\mathcal{I}), which differs from programmatic tasks due to the lack of straightforward success criteria. Inspired by model-based metrics like the Inception score [98] and FID score [55] for image generation, we design a novel task evaluation metric based on a pre-trained contrastive video-language model (Sec. 4.1). In the experiments, we find that the learned metric exhibits a high level of agreement with human evaluations (see Table 2).
我们将每个创意任务定义为一个 3 元组 T=(G,𝒢,)𝑇𝐺𝒢T=(G,\mathcal{G},\mathcal{I}) ,与程序化任务不同的是,由于缺乏明确的成功标准。受到图像生成的基于模型的度量指标如 Inception 分数[98]和 FID 分数[55]的启发,我们设计了一种基于预训练对比视频-语言模型的新型任务评估指标(第 4.1 节)。在实验中,我们发现学习到的指标与人类评估具有高度一致性(见表 2)。

We brainstorm and author 216 Creative tasks, such as “build a haunted house with zombie inside” and “race by riding a pig”. Nonetheless, such a manual approach is not scalable. Therefore, we develop two systematic approaches to extend the total number of task definitions to 1,560. This makes our Creative tasks 3 orders of magnitude larger than Minecraft BASALT challenge [104], which has 4 Creative tasks.
我们进行头脑风暴并创作了 216 个创意任务,例如“建造一个里面有僵尸的鬼屋”和“骑猪比赛”。然而,这种手动方法不具有可扩展性。因此,我们开发了两种系统方法,将任务定义的总数扩展到 1,560 个。这使得我们的创意任务比 Minecraft BASALT 挑战[104]的 4 个创意任务大了 3 个数量级。

Approach 1. Task Mining from YouTube Tutorial Videos.
方法 1. 从 YouTube 教程视频中挖掘任务。

We identify our YouTube dataset as a rich source of tasks, as many human players demonstrate and narrate creative missions in the tutorial playlists. To collect high-quality tasks and accompanying videos, we design a 3-stage pipeline that makes it easy to find and annotate interesting tasks (see Sec. C.2 for details). Through this pipeline, we extract 1,042 task ideas from the common wisdom of a huge number of veteran Minecraft gamers, such as “make an automated mining machine” and “grow cactus up to the sky”.
我们将 YouTube 数据集视为一个丰富的任务来源,因为许多玩家在教程播放列表中展示和叙述创造性任务。为了收集高质量的任务和相关视频,我们设计了一个三阶段的流程,使得寻找和注释有趣的任务变得容易(详见第 C.2 节)。通过这个流程,我们从大量经验丰富的 Minecraft 老玩家的共同智慧中提取了 1,042 个任务创意,例如“制作一个自动化的采矿机器”和“将仙人掌种植到天空中”。

Approach 2. Task Creation by GPT-3.
方法 2. 通过 GPT-3 创建任务。

We leverage GPT-3’s few-shot capability to generate new task ideas by seeding it with the tasks we manually author or mine from YouTube. The prompt template is: Here are some example creative tasks in Minecraft: {a few examples}. Let’s brainstorm more detailed while reasonable creative tasks in Minecraft.
我们利用 GPT-3 的少样本能力,通过向其提供我们手动编写或从 YouTube 挖掘的任务来生成新的任务想法。提示模板是:这里有一些 Minecraft 中的创意任务示例:{几个示例}。让我们在 Minecraft 中进行更详细但合理的创意任务头脑风暴。

GPT-3 contributes 302 creative tasks after de-duplication, and demonstrates a surprisingly proficient understanding of Minecraft terminology.
GPT-3 在去重后贡献了 302 个创意任务,并展示出对 Minecraft 术语的出乎意料的熟练理解。

2.3 Collection of Starter Tasks
2.3 起始任务的收集

We curate a set of 64 core tasks for future researchers to get started more easily. If their agent works well on these tasks, they can more confidently scale to the full benchmark.
我们为未来的研究人员策划了一组 64 个核心任务,以便更轻松地开始。如果他们的代理在这些任务上表现良好,他们就可以更有信心地扩展到完整的基准测试。

  • 32 programmatic tasks: 16 “standard” and 16 “difficult”, spanning all 4 categories (survival, harvesting, combat, and tech tree). We rely on our Minecraft knowledge to decide the difficulty level. “Standard” tasks require fewer steps and lower resource dependencies to complete.

    • 32 个程序化任务:16 个“标准”和 16 个“困难”,涵盖了所有 4 个类别(生存、收获、战斗和科技树)。我们依靠我们的 Minecraft 知识来决定难度级别。“标准”任务需要更少的步骤和更低的资源依赖来完成。
  • 32 creative tasks: 16 “standard” and 16 “difficult”. Similarly, tasks labeled with “standard” are typically short-horizon tasks.

    • 32 个创意任务:16 个“标准”和 16 个“困难”。同样,标有“标准”的任务通常是短期任务。

We recommend that researchers run 100 evaluation episodes for each task and report the percentage success rate. The programmatic tasks have ground-truth success, while the creative tasks need our novel evaluation protocol (Sec. 5).
我们建议研究人员对每个任务运行 100 个评估周期,并报告成功率的百分比。编程任务具有真实成功率,而创造性任务需要我们的新颖评估协议(第 5 节)。

Refer to caption
Figure 3: MineDojo’s internet-scale, multimodal knowledge base. Left, YouTube videos: Minecraft gamers showcase the impressive feats they are able to achieve. Clockwise order: an archery range, Hogwarts castle, Taj Mahal, a Nether homebase. Middle, Wiki: Wiki pages contain multimodal knowledge in structured layouts, such as comprehensive catalogs of creatures and recipes for crafting. More examples in Fig. A.4 and A.5. Right, Reddit: We create a word cloud from Reddit posts and comment threads. Gamers ask questions, share achievements, and discuss strategies extensively. Sample posts in Fig. A.7. Best viewed zoomed in.
图 3:MineDojo 的互联网规模的多模态知识库。左边,YouTube 视频:Minecraft 玩家展示他们能够实现的令人印象深刻的壮举。顺时针顺序:一个射箭场,霍格沃茨城堡,泰姬陵,一个地狱基地。中间,维基:维基页面以结构化布局包含多模态知识,例如生物的综合目录和制作配方。更多示例见图 A.4 和 A.5。右边,Reddit:我们从 Reddit 的帖子和评论线程中创建一个词云。玩家们广泛提问、分享成就并讨论策略。示例帖子见图 A.7。最好放大查看。

3 Internet-scale Knowledge Base
3 互联网规模的知识库

Two commonly used approaches [112, 126, 85, 36] to train embodied agents include training agents from scratch using RL with well-tuned reward functions for each task, or using a large amount of human-demonstrations to bootstrap agent learning. However, crafting well-tuned reward functions is challenging or infeasible for our task suite (Sec. 2.2), and employing expert gamers to provide large amounts of demonstration data would also be costly and infeasible [126].
训练具身代理的两种常用方法[112, 126, 85, 36]包括使用经过调整的奖励函数从头开始使用强化学习训练代理,或者使用大量的人类示范来启动代理学习。然而,为我们的任务套件(第 2.2 节)设计出经过调整的奖励函数是具有挑战性或不可行的,而雇佣专家玩家提供大量示范数据也将是昂贵且不可行的[126]。

Instead, we turn to the open web as an ever-growing, virtually unlimited source of learning material for embodied agents. The internet provides a vast amount of domain knowledge about Minecraft, which we harvest by extensive web scraping and filtering. We collect 33 years worth of YouTube videos, 6K+ Wiki pages, and millions of Reddit comment threads. Instead of hiring a handful of human demonstrators, we capture the collective wisdom of millions of Minecraft gamers around the world. Furthermore, language is a key and pervasive component of our database that takes the form of YouTube transcripts, textual descriptions in Wiki, and Reddit discussions. Language facilitates open-vocabulary understanding, provides grounding for image and video modalities, and unlocks the power of large language models [27, 109, 15] for embodied agents. To ensure socially responsible model development, we take special measures to filter out low-quality and toxic contents [13, 51] from our databases, detailed in the Appendix (Sec. D).
相反,我们转向开放网络作为具有无限学习材料的源头,供具有实体的代理使用。互联网提供了大量关于 Minecraft 的领域知识,我们通过广泛的网络爬取和过滤来收集这些知识。我们收集了 33 年的 YouTube 视频,6K+的维基页面和数百万条 Reddit 评论线程。我们捕捉了全球数百万 Minecraft 玩家的集体智慧,而不是雇佣一小部分人类示范者。此外,语言是我们数据库的一个关键和普遍组成部分,它以 YouTube 的转录、维基的文本描述和 Reddit 的讨论形式存在。语言促进了开放词汇的理解,为图像和视频模态提供了基础,并为具有实体的代理解锁了大型语言模型的能力[27, 109, 15]。为了确保社会责任的模型开发,我们采取特殊措施从我们的数据库中过滤掉低质量和有毒的内容[13, 51],详细内容请参见附录(第 D 节)。

YouTube Videos and Transcripts.
YouTube 视频和文字稿件。

Minecraft is among the most streamed games on YouTube [41]. Human players have demonstrated a stunning range of creative activities and sophisticated missions that take hours to complete (examples in Fig. 3). We collect 730K+ narrated Minecraft videos, which add up to 33 years of duration and 2.2B words in English transcripts. In comparison, HowTo100M [77] is a large-scale human instructional video dataset that includes 15 years of experience in total – about half of our volume. The time-aligned transcripts enable the agent to ground free-form natural language in video pixels and learn the semantics of diverse activities without laborious human labeling. We operationalize this insight in our pre-trained video-language model (Sec. 4.1).
Minecraft 是 YouTube 上最受欢迎的游戏之一[41]。人类玩家展示了令人惊叹的创造活动和复杂任务,完成这些任务需要数小时(图 3 中有示例)。我们收集了超过 730K 个解说的 Minecraft 视频,总时长达 33 年,英文转录中包含了 22 亿个单词。相比之下,HowTo100M [77]是一个大规模的人类教学视频数据集,总共包含了 15 年的经验 - 大约是我们的一半。时间对齐的转录使得智能体能够将自由形式的自然语言与视频像素进行关联,并学习各种活动的语义,而无需费力的人工标注。我们在预训练的视频语言模型中实现了这一洞察(第 4.1 节)。

Minecraft Wiki. 我的世界百科全书。

The Wiki pages cover almost every aspect of the game mechanics, and supply a rich source of unstructured knowledge in multimodal tables, recipes, illustrations, and step-by-step tutorials. We use Selenium [103] to scrape 6,735 pages that interleave text, images, tables, and diagrams. The pages are highly unstructured and do not share any common schema, as the Wiki is meant for human consumption rather than AI training. To preserve the layout information, we additionally save the screenshots of entire pages and extract 2.2M bounding boxes of the visual elements (visualization in Fig. A.4 and A.5). We do not use Wiki data in our current experiments. Since the Wiki contains detailed recipes for all crafted objects, they could be provided as input or training data for hierarchical planning methods and policy sketches [8]. Another promising future direction is to apply document understanding models such as LayoutLM [138, 137] and DocFormer [9] to learn actionable knowledge from these unstructured Wiki data.
维基页面涵盖了游戏机制的几乎所有方面,并提供了丰富的非结构化知识来源,包括多模式表格、配方、插图和逐步教程。我们使用 Selenium [103]来爬取 6,735 个页面,这些页面交错着文本、图片、表格和图表。这些页面非常不规则,没有共同的模式,因为维基是为人类消费而不是 AI 训练而设计的。为了保留布局信息,我们还保存了整个页面的截图,并提取了 2.2M 个可视元素的边界框(见图 A.4 和 A.5 中的可视化)。我们在当前的实验中没有使用维基数据。由于维基包含了所有制作物品的详细配方,它们可以作为分层规划方法和策略草图[8]的输入或训练数据。另一个有前景的未来方向是应用文档理解模型,如 LayoutLM [138, 137]和 DocFormer [9],从这些非结构化的维基数据中学习可操作的知识。


We scrape 340K+ posts along with 6.6M comments under the “r/Minecraft” subreddit. These posts ask questions on how to solve certain tasks, showcase cool architectures and achievements in image/video snippets, and discuss general tips and tricks for players of all expertise levels. We do not use Reddit data for training in Sec. 5, but a potential idea is to finetune large language models [27, 91] on our Reddit corpus to generate instructions and execution plans that are better grounded in the Minecraft domain. Concurrent works [3, 56, 143] have explored similar ideas and showed excellent results on robot learning, which is encouraging for more future research in MineDojo.
我们爬取了“r/Minecraft”子论坛下的 340K+帖子和 6.6M 条评论。这些帖子提出了关于如何解决特定任务的问题,展示了在图像/视频片段中的酷炫建筑和成就,并讨论了适用于各种技能水平的玩家的一般提示和技巧。我们在第 5 节中没有使用 Reddit 数据进行训练,但一个潜在的想法是在我们的 Reddit 语料库上微调大型语言模型[27, 91],以生成更好地与 Minecraft 领域相关的指令和执行计划。同时进行的研究[3, 56, 143]探索了类似的想法,并在机器人学习方面取得了出色的结果,这对于 MineDojo 的更多未来研究是鼓舞人心的。

4 Agent Learning with Large-scale Pre-training
4 代理学习与大规模预训练

Refer to caption
Figure 4: Algorithm design. MineCLIP is a contrastive video-language model pre-trained on MineDojo’s massive Youtube database. It computes the correlation between an open-vocabulary language goal string and a 16-frame video snippet. The correlation score can be used as a learned dense reward function to train a strong multi-task RL agent.
图 4:算法设计。MineCLIP 是在 MineDojo 的大规模 YouTube 数据库上预训练的对比视频语言模型。它计算开放词汇语言目标字符串与 16 帧视频片段之间的相关性。相关性分数可以用作学习的密集奖励函数,以训练强大的多任务强化学习代理。

One of the grand challenges of embodied AI is to build a single agent that can complete a wide range of open-world tasks. The MineDojo framework aims to facilitate new techniques towards this goal by providing an open-ended task suite (Sec. 2) and large-scale internet knowledge base (Sec. 3). Here we take an initial step towards this goal by developing a proof of concept that demonstrates how a single language-prompted agent can be trained in MineDojo to complete several complex Minecraft tasks. To this end, we propose a novel agent learning algorithm that takes advantage of the massive YouTube data offered by MineDojo. We note that this is only one of the numerous possible ways to use MineDojo’s internet database — the Wiki and Reddit corpus also hold great potential to drive new algorithm discoveries for the community in future works.
体验智能的一个重大挑战是构建一个能够完成各种开放世界任务的单一代理。MineDojo 框架旨在通过提供一个开放式任务套件(第 2 节)和大规模互联网知识库(第 3 节)来促进朝着这个目标的新技术。在这里,我们通过开发一个概念验证来迈出实现这个目标的初始步骤,该验证展示了一个单一的语言提示代理如何在 MineDojo 中接受训练以完成几个复杂的 Minecraft 任务。为此,我们提出了一种新颖的代理学习算法,利用了 MineDojo 提供的大量 YouTube 数据。我们注意到,这只是使用 MineDojo 互联网数据库的众多可能方式之一 - 维基和 Reddit 语料库在未来的工作中也具有推动社区进行新算法发现的巨大潜力。

In this paper, we consider a multi-task reinforcement learning (RL) setting, where an agent is tasked with completing a collection of MineDojo tasks specified by language instructions (Sec. 2). Solving these tasks often requires the agent to interact with the Minecraft world in a prolonged fashion. Agents developed in popular RL benchmarks [119, 146] often rely on meticulously crafted dense and task-specific reward functions to guide random explorations. However, these rewards are hard or even infeasible to define for our diverse and open-ended tasks in MineDojo. To address this challenge, our key insight is to learn a dense, language-conditioned reward function from in-the-wild YouTube videos and their transcripts. Therefore, we introduce MineCLIP, a contrastive video-language model that learns to correlate video snippets and natural language descriptions (Fig. 4). MineCLIP is multi-task by design, as it is trained on open-vocabulary and diverse English transcripts.
在本文中,我们考虑了一个多任务强化学习(RL)的设置,其中一个代理被指定完成一系列由语言指令(第 2 节)指定的 MineDojo 任务。解决这些任务通常需要代理与 Minecraft 世界进行长时间的交互。在流行的 RL 基准测试中开发的代理[119, 146]通常依赖于精心设计的密集和任务特定的奖励函数来引导随机探索。然而,对于我们在 MineDojo 中多样化和开放性的任务来说,这些奖励很难甚至不可行定义。为了解决这个挑战,我们的关键见解是从野外 YouTube 视频和它们的转录中学习一个密集的、以语言为条件的奖励函数。因此,我们引入了 MineCLIP,一个对比视频语言模型,它学习将视频片段和自然语言描述相关联(图 4)。MineCLIP 是多任务的设计,因为它在开放词汇和多样化的英文转录上进行训练。

During RL training, MineCLIP provides a high-quality reward signal without any domain adaptation techniques, despite the domain gap between noisy YouTube videos and clean simulator-rendered frames. MineCLIP eliminates the need to manually engineer reward functions for each and every MineDojo task. For Creative tasks that lack a simple success criterion (Sec. 2.2), MineCLIP also serves the dual purpose of an automatic evaluation metric that agrees well with human judgement on a subset of tasks we investigate (Sec. 4.2, Table 2). Because the learned reward model incurs a non-trivial computational overhead, we introduce several techniques to significantly improve RL training efficiency, making MineCLIP a practical module for open-ended agent learning in Minecraft (Sec. 4.2).
在 RL 训练过程中,尽管嘈杂的 YouTube 视频和清晰的模拟器渲染帧之间存在领域差异,MineCLIP 提供了高质量的奖励信号,而无需任何领域适应技术。MineCLIP 消除了为每个 MineDojo 任务手动设计奖励函数的需要。对于缺乏简单成功标准的创造性任务(第 2.2 节),MineCLIP 还兼具自动评估指标的双重作用,在我们调查的一部分任务中与人类判断相一致(第 4.2 节,表 2)。由于学习的奖励模型会带来非常重要的计算开销,我们引入了几种技术来显著提高 RL 训练效率,使 MineCLIP 成为 Minecraft 中开放式智能学习的实用模块(第 4.2 节)。

Table 1: Our novel MineCLIP reward model is able to achieve competitive performance with manually written dense reward function for Programmatic tasks, and significantly outperforms the CLIPOpenAI method across all Creative tasks. Entries represent percentage success rates averaged over 3 seeds, each tested for 200 episodes. Success conditions are precise in Programmatic tasks, but estimated by MineCLIP for Creative tasks.
表 1:我们的新颖的 MineCLIP 奖励模型在程序化任务中能够达到与手动编写的密集奖励函数相竞争的性能,并且在所有创造性任务中明显优于 CLIP 方法。条目表示在 3 个种子上平均测试 200 个剧集的成功率百分比。在程序化任务中,成功条件是精确的,但在创造性任务中是由 MineCLIP 估计的。
Group  Tasks 任务 Ours (Attn) 我们的(注意) Ours (Avg) 我们的(平均) Manual Reward 手动奖励 Sparse-only 稀疏 CLIPOpenAI 剪辑
[Uncaptioned image] Milk Cow 奶牛 64.5±37.1plus-or-minus64.537.1{\color[rgb]{0.09,0.45,0.27}\mathbf{64.5\pm 37.1}} 6.5±3.5plus-or-minus6.53.56.5\pm 3.5 62.8±40.1plus-or-minus62.840.162.8\pm 40.1 0.0±0.0plus-or-minus0.00.00.0\pm 0.0 0.0±0.0plus-or-minus0.00.00.0\pm 0.0
Hunt Cow 猎牛 83.5±7.1plus-or-minus83.57.1{\color[rgb]{0.09,0.45,0.27}\mathbf{83.5\pm 7.1\hphantom{0}}} 0.0±0.0plus-or-minus0.00.00.0\pm 0.0 48.3±35.9plus-or-minus48.335.948.3\pm 35.9 0.3±0.4plus-or-minus0.30.40.3\pm 0.4 0.0±0.0plus-or-minus0.00.00.0\pm 0.0
Shear Sheep 剪羊毛 12.1±9.1plus-or-minus12.19.112.1\pm 9.1\hphantom{0} 0.6±0.2plus-or-minus0.60.20.6\pm 0.2 52.3±33.2plus-or-minus52.333.2{\color[rgb]{0.09,0.45,0.27}\mathbf{52.3\pm 33.2}} 0.0±0.0plus-or-minus0.00.00.0\pm 0.0 0.0±0.0plus-or-minus0.00.00.0\pm 0.0
Hunt Sheep 猎羊 8.1±4.1plus-or-minus8.14.18.1\pm 4.1 0.0±0.0plus-or-minus0.00.00.0\pm 0.0 41.9±33.0plus-or-minus41.933.0{\color[rgb]{0.09,0.45,0.27}\mathbf{41.9\pm 33.0}} 0.3±0.4plus-or-minus0.30.40.3\pm 0.4 0.0±0.0plus-or-minus0.00.00.0\pm 0.0
[Uncaptioned image] Combat Spider 战斗蜘蛛 80.5±13.0plus-or-minus80.513.080.5\pm 13.0 60.1±42.5plus-or-minus60.142.560.1\pm 42.5 87.5±4.6plus-or-minus87.54.6{\color[rgb]{0.09,0.45,0.27}\mathbf{87.5\pm 4.6\hphantom{0}}} 47.8±33.8plus-or-minus47.833.847.8\pm 33.8 0.0±0.0plus-or-minus0.00.00.0\pm 0.0
Combat Zombie 战斗僵尸 47.3±10.6plus-or-minus47.310.647.3\pm 10.6 72.3±6.4plus-or-minus72.36.4{\color[rgb]{0.09,0.45,0.27}\mathbf{72.3\pm 6.4\hphantom{0}}} 49.8±26.9plus-or-minus49.826.949.8\pm 26.9 8.8±12.4plus-or-minus8.812.4\hphantom{0}8.8\pm 12.4 0.0±0.0plus-or-minus0.00.00.0\pm 0.0
Combat Pigman 战斗猪人 1.6±2.3plus-or-minus1.62.31.6\pm 2.3 0.0±0.0plus-or-minus0.00.00.0\pm 0.0 13.6±9.8plus-or-minus13.69.8{\color[rgb]{0.09,0.45,0.27}\mathbf{13.6\pm 9.8\hphantom{0}}} 0.0±0.0plus-or-minus0.00.00.0\pm 0.0 0.0±0.0plus-or-minus0.00.00.0\pm 0.0
Combat Enderman 战斗末影人 0.0±0.0plus-or-minus0.00.00.0\pm 0.0 0.0±0.0plus-or-minus0.00.00.0\pm 0.0 0.3±0.2plus-or-minus0.30.20.3\pm 0.2 0.0±0.0plus-or-minus0.00.00.0\pm 0.0 0.0±0.0plus-or-minus0.00.00.0\pm 0.0
[Uncaptioned image] Find Nether Portal 寻找下界传送门 37.4±40.8plus-or-minus37.440.837.4\pm 40.8 89.8±5.7plus-or-minus89.85.7{\color[rgb]{0.09,0.45,0.27}\mathbf{89.8\pm 5.7\hphantom{0}}} N/A N/A 26.3±32.6plus-or-minus26.332.626.3\pm 32.6
Find Ocean 寻找海洋 33.4±45.6plus-or-minus33.445.633.4\pm 45.6 54.3±40.7plus-or-minus54.340.7{\color[rgb]{0.09,0.45,0.27}\mathbf{54.3\pm 40.7}} N/A N/A 9.9±14.1plus-or-minus9.914.1\hphantom{0}9.9\pm 14.1
Dig Hole 挖洞 91.6±5.9plus-or-minus91.65.9{\color[rgb]{0.09,0.45,0.27}\mathbf{91.6\pm 5.9\hphantom{0}}} 88.1±13.3plus-or-minus88.113.388.1\pm 13.3 N/A N/A 0.0±0.0plus-or-minus0.00.00.0\pm 0.0
Lay Carpet 铺地毯 97.6±1.9plus-or-minus97.61.997.6\pm 1.9\hphantom{0} 98.8±1.0plus-or-minus98.81.0{\color[rgb]{0.09,0.45,0.27}\mathbf{98.8\pm 1.0\hphantom{0}}} N/A N/A 0.0±0.0plus-or-minus0.00.00.0\pm 0.0

4.1 Pre-Training MineCLIP on Large-scale Videos
4.1 在大规模视频上进行预训练的 MineCLIP

Formally, the learned reward function can be defined as Φ:(G,V):subscriptΦ𝐺𝑉\Phi_{\mathcal{R}}:(G,V)\rightarrow\mathbb{R} that maps a language goal G𝐺G and a video snippet V𝑉V to a scalar reward. An ideal ΦsubscriptΦ\Phi_{\mathcal{R}} should return a high reward if the behavior depicted in the video faithfully follows the language description, and a low reward otherwise. This can be achieved by optimizing the InfoNCE objective [125, 52, 20], which learns to correlate positive video and text pairs [118, 6, 78, 4, 136].
形式上,学习到的奖励函数可以定义为将语言目标和视频片段映射到标量奖励的函数。理想的奖励函数应该在视频中所展示的行为忠实地遵循语言描述时返回高奖励,否则返回低奖励。这可以通过优化 InfoNCE 目标来实现,该目标学习相关的正面视频和文本对。

Similar to the image-text CLIP model [92], MineCLIP is composed of a separate text encoder ϕGsubscriptitalic-ϕ𝐺\phi_{G} that embeds a language goal and a video encoder ϕVsubscriptitalic-ϕ𝑉\phi_{V} that embeds a moving window of 16 consecutive frames with 160×256160256160\times 256 resolution (Fig. 4). Our neural architecture has a similar design as CLIP4Clip [75], where ϕGsubscriptitalic-ϕ𝐺\phi_{G} reuses OpenAI CLIP’s pretrained text encoder, and ϕVsubscriptitalic-ϕ𝑉\phi_{V} is factorized into a frame-wise image encoder ϕIsubscriptitalic-ϕ𝐼\phi_{I} and a temporal aggregator ϕasubscriptitalic-ϕ𝑎\phi_{a} that summarizes the sequence of 16 image features into a single video embedding. Unlike CLIP4Clip, we insert two extra layers of residual CLIP Adapter [38] after the aggregator ϕasubscriptitalic-ϕ𝑎\phi_{a} to produce a better video feature, and finetune only the last two layers of the pretrained ϕIsubscriptitalic-ϕ𝐼\phi_{I} and ϕGsubscriptitalic-ϕ𝐺\phi_{G}.
与图像文本 CLIP 模型[92]类似,MineCLIP 由一个独立的文本编码器 ϕGsubscriptitalic-ϕ𝐺\phi_{G} 和一个视频编码器 ϕVsubscriptitalic-ϕ𝑉\phi_{V} 组成,该编码器嵌入了一个连续 16 帧的移动窗口,分辨率为 160×256160256160\times 256 (图 4)。我们的神经架构与 CLIP4Clip [75]具有相似的设计,其中 ϕGsubscriptitalic-ϕ𝐺\phi_{G} 重用了 OpenAI CLIP 的预训练文本编码器,而 ϕVsubscriptitalic-ϕ𝑉\phi_{V} 则被分解为逐帧图像编码器 ϕIsubscriptitalic-ϕ𝐼\phi_{I} 和时间聚合器 ϕasubscriptitalic-ϕ𝑎\phi_{a} ,将 16 个图像特征序列汇总为单个视频嵌入。与 CLIP4Clip 不同的是,我们在聚合器 ϕasubscriptitalic-ϕ𝑎\phi_{a} 之后插入了两个额外的残差 CLIP 适配器层[38],以生成更好的视频特征,并且只微调了预训练的最后两层 ϕIsubscriptitalic-ϕ𝐼\phi_{I}ϕGsubscriptitalic-ϕ𝐺\phi_{G}

From the MineDojo YouTube database, we follow the procedure in VideoCLIP [136] to sample 640K pairs of 16-second video snippets and time-aligned English transcripts, after applying a keyword filter. We train two MineCLIP variants with different types of aggregator ϕasubscriptitalic-ϕ𝑎\phi_{a}: (1) MineCLIP[avg] does simple average pooling, which is fast but agnostic to the temporal ordering; (2) MineCLIP[attn] encodes the sequence by two transformer layers, which is relatively slower but captures more temporal information, and thus produces a better reward signal in general. Details of data preprocessing, architecture, and hyperparameters are listed in the Appendix (Sec. E).
从 MineDojo YouTube 数据库中,我们按照 VideoCLIP [136]中的过程,通过应用关键词过滤器,采样了 640K 对 16 秒视频片段和时间对齐的英文转录。我们使用不同类型的聚合器训练了两个 MineCLIP 变体: ϕasubscriptitalic-ϕ𝑎\phi_{a} :(1) MineCLIP[avg]使用简单的平均池化,速度快但对时间顺序不敏感;(2) MineCLIP[attn]通过两个 Transformer 层对序列进行编码,相对较慢但能捕捉更多的时间信息,因此通常产生更好的奖励信号。数据预处理、架构和超参数的详细信息请参见附录(第 E 节)。

4.2 RL with MineCLIP Reward
4.2RL 与 MineCLIP 奖励

We train a language-conditioned policy network that takes as input raw pixels and predicts discrete control. The policy is trained with PPO [102] on the MineCLIP rewards. In each episode, the agent is prompted with a language goal and takes a sequence of actions to fulfill this goal. When calculating the MineCLIP rewards, we concatenate the agent’s latest 16 egocentric RGB frames in a temporal window to form a video snippet. MineCLIP handles all task prompts zero-shot without any further finetuning. In our experiments (Sec. 5), we show that MineCLIP provides effective dense rewards out of the box, despite the domain shift between in-the-wild YouTube frames and simulator frames. Besides regular video data augmentation, we do not employ any special domain adaptation methods during pre-training. Our finding is consistent with CLIP’s strong zero-shot performances on robustness benchmarks in object recognition [92].
我们训练了一个以语言为条件的策略网络,它以原始像素作为输入,并预测离散控制。该策略使用 PPO [102]在 MineCLIP 奖励上进行训练。在每个 episode 中,代理被提示一个语言目标,并采取一系列动作来实现这个目标。在计算 MineCLIP 奖励时,我们将代理的最新 16 个自我中心 RGB 帧连接在一个时间窗口中,形成一个视频片段。MineCLIP 可以在没有任何进一步微调的情况下处理所有任务提示。在我们的实验中(第 5 节),我们展示了尽管在野外 YouTube 帧和模拟器帧之间存在领域转移,MineCLIP 仍然可以提供有效的密集奖励。除了常规的视频数据增强之外,在预训练期间我们没有使用任何特殊的领域适应方法。我们的发现与 CLIP 在目标识别的鲁棒性基准测试中的强大的零样本性能一致[92]。

Compared to hard-coded reward functions in popular benchmarks [146, 119, 34], the MineCLIP model has 150M parameters and is thus much more expensive to query. We make several design choices to greatly accelerate RL training with MineCLIP in the loop:
与流行基准测试中的硬编码奖励函数相比[146, 119, 34],MineCLIP 模型具有 1.5 亿个参数,因此查询成本更高。我们做出了几个设计选择,以极大地加速在循环中使用 MineCLIP 进行强化学习训练。

  1. 1.

    The language goal G𝐺G is fixed for a specific task, so the text features ϕGsubscriptitalic-ϕ𝐺\phi_{G} can be precomputed to avoid invoking the text encoder repeatedly.

    语言目标 G𝐺G 是针对特定任务固定的,因此可以预先计算文本特征 ϕGsubscriptitalic-ϕ𝐺\phi_{G} ,以避免重复调用文本编码器。
  2. 2.

    Our agent’s RGB encoder reuses the pre-trained weights of ϕIsubscriptitalic-ϕ𝐼\phi_{I} from MineCLIP. We do not finetune ϕIsubscriptitalic-ϕ𝐼\phi_{I} during RL training, which saves computation and endows the agent with good visual representations from the beginning.

    我们的代理的 RGB 编码器重复使用了 MineCLIP 中 ϕIsubscriptitalic-ϕ𝐼\phi_{I} 的预训练权重。在强化学习训练过程中,我们不对 ϕIsubscriptitalic-ϕ𝐼\phi_{I} 进行微调,这样可以节省计算资源,并使代理从一开始就具备良好的视觉表示能力。
  3. 3.

    MineCLIP’s video encoder ϕVsubscriptitalic-ϕ𝑉\phi_{V} is factorized into an image encoder ϕIsubscriptitalic-ϕ𝐼\phi_{I} and a light-weight aggregator ϕasubscriptitalic-ϕ𝑎\phi_{a}. This design choice enables efficient image feature caching. Consider two overlapping video sequences of 8 frames, V[0:8] and V[1:9]. We can cache the image features of the 7 overlapping frames V[1] to V[7] to maximize compute savings. If ϕVsubscriptitalic-ϕ𝑉\phi_{V} is a monolithic model like S3D [135] in VideoCLIP [136], then the video features from every sliding window must be recomputed, which would incur a much higher cost per time step.

    3. MineCLIP 的视频编码器 ϕVsubscriptitalic-ϕ𝑉\phi_{V} 被分解为图像编码器 ϕIsubscriptitalic-ϕ𝐼\phi_{I} 和轻量级聚合器 ϕasubscriptitalic-ϕ𝑎\phi_{a} 。这种设计选择可以实现高效的图像特征缓存。考虑两个重叠的 8 帧视频序列 V[0:8]和 V[1:9]。我们可以缓存 7 个重叠帧 V[1]到 V[7]的图像特征,以最大化计算节省。如果 ϕVsubscriptitalic-ϕ𝑉\phi_{V} 是像 VideoCLIP [136]中的 S3D [135]这样的整体模型,那么每个滑动窗口的视频特征都必须重新计算,这将导致每个时间步的成本更高。
  4. 4.

    We leverage Self-Imitation Learning [84] to store the trajectories with high MineCLIP reward values in a buffer, and alternate between PPO and self-imitation gradient steps. It further improves sample efficiency as shown in the Appendix (Fig. A.8).

    我们利用自我模仿学习[84]将具有较高 MineCLIP 奖励值的轨迹存储在缓冲区中,并在 PPO 和自我模仿梯度步骤之间交替进行。如附录所示(图 A.8),它进一步提高了样本效率。
Table 2: MineCLIP agrees well with the ground-truth human judgment on the Creative tasks we consider. Numbers are F1 scores between MineCLIP’s binary classification of tasks success and human labels (scaled to the percentage for better readability).
表 2:MineCLIP 在我们考虑的创意任务上与真实人类判断非常一致。数字是 MineCLIP 对任务成功的二分类和人类标签之间的 F1 分数(为了更好的可读性,将其缩放为百分比)。
Tasks 任务 Find Nether Portal 寻找下界传送门 Find Ocean 寻找海洋 Dig Hole 挖洞 Lay Carpet 铺地毯
Ours (Attn) 我们的(注意) 98.798.798.7 100.0100.0{\color[rgb]{0.09,0.45,0.27}\mathbf{100.0}} 99.499.499.4 97.497.497.4
Ours (Avg) 我们的(平均) 100.0100.0{\color[rgb]{0.09,0.45,0.27}\mathbf{100.0}} 100.0100.0{\color[rgb]{0.09,0.45,0.27}\mathbf{100.0}} 100.0100.0{\color[rgb]{0.09,0.45,0.27}\mathbf{100.0}} 98.498.4{\color[rgb]{0.09,0.45,0.27}\mathbf{98.4}}
CLIPOpenAI 剪辑 48.748.748.7 98.498.498.4 80.680.680.6

5 Experiments 5 个实验

We evaluate our agent-learning approach (Section 4) on 8 Programmatic tasks and 4 Creative tasks from the MineDojo benchmarking suite. We select these 12 tasks due to the diversity of skills required to solve them (e.g., harvesting, combat, building, navigation) and domain-specific entities (e.g., animals, resources, monsters, terrains, and structures). We split the tasks into 3 groups and train one multi-task agent for each group: Animal-Zoo (4 Programmatic tasks on hunting or harvesting resource from animals), Mob-Combat (Programmatic, fight 4 types of hostile monsters), and Creative (4 tasks).
我们在 MineDojo 基准套件中评估了我们的代理学习方法(第 4 节)在 8 个程序化任务和 4 个创造性任务上。我们选择这 12 个任务是因为解决它们需要多样化的技能(例如收获、战斗、建造、导航)和领域特定的实体(例如动物、资源、怪物、地形和结构)。我们将任务分为 3 组,并为每组训练一个多任务代理:动物园(4 个程序化任务,猎杀或从动物身上收集资源)、怪物战斗(程序化,与 4 种敌对怪物战斗)和创造性(4 个任务)。

In the experiments, we empirically check the quality of MineCLIP against manually written reward functions, and quantify how different variants of our learned model affect the RL performance. Table 1 presents our main results, and Fig. 2 visualizes our learned agent behavior in 4 of the considered tasks. Policy networks of all methods share the same architecture and are trained by PPO + Self-Imitation (Sec. 4.2, training details in the Appendix, Sec. F). We compare the following methods:
在实验中,我们经验性地检查了 MineCLIP 与手动编写的奖励函数的质量,并量化了我们学习模型的不同变体对强化学习性能的影响。表 1 呈现了我们的主要结果,图 2 可视化了我们在 4 个考虑的任务中学习到的代理行为。所有方法的策略网络共享相同的架构,并通过 PPO + 自我模仿进行训练(第 4.2 节,训练细节见附录,第 F 节)。我们比较以下方法:

  • Ours (Attn): our agent trained with the MineCLIP[attn] reward model. For Programmatic tasks, we also add the final success condition as a binary reward. For Creative tasks, MineCLIP is the only source of reward.

    我们(注意):我们的代理人接受了基于 MineCLIP 的奖励模型的训练。对于程序化任务,我们还将最终成功条件作为二进制奖励添加。对于创造性任务,MineCLIP 是唯一的奖励来源。
  • Ours (Avg): the average-pooling variant of our method.

  • Manual Reward: hand-engineered dense reward using ground-truth simulator states.

  • Sparse-only: the final binary success as a single sparse reward. Note that neither sparse-only nor manual reward is available for Creative tasks.

  • CLIPOpenAI: pre-trained OpenAI CLIP model that has not been finetuned on any MineDojo videos.

    • CLIP OpenAI :未在任何 MineDojo 视频上进行微调的预训练 OpenAI CLIP 模型。

MineCLIP is competitive with manual reward.
MineCLIP 与手动奖励相媲美。

For Programmatic tasks (first 8 rows), RL agents guided by MineCLIP achieve competitive performance as those trained by manual reward. In three of the tasks, they even outperform the hand-engineered reward functions, which rely on privileged simulator states unavailable to MineCLIP. For a more statistically sound analysis, we conduct the Paired Student’s t-test to compare the mean success rate of each task (pairing column 3 “Ours (Attn)” and column 5 “Manual Reward” in Table 1). The test yields p-value 0.39910.05much-greater-than0.39910.050.3991\gg 0.05, which indicates that the difference between our method and manual reward is not considered statistically significant, and therefore they are comparable with each other. Because all tasks require nontrivial exploration, our approach also dominates the Sparse-only baseline. Note that the original OpenAI CLIP model fails to achieve any success. We hypothesize that the creatures in Minecraft look dramatically different from their real-world counterparts, which causes CLIP to produce misleading signals worse than no shaping reward at all. It implies the importance of finetuning on MineDojo’s YouTube data.
对于编程任务(前 8 行),由 MineCLIP 引导的 RL 代理在性能上与手动奖励训练的代理相媲美。在其中三个任务中,它们甚至超过了依赖于 MineCLIP 无法获取的特权模拟器状态的手工设计奖励函数。为了进行更加统计上可靠的分析,我们进行了配对的学生 t 检验,比较了每个任务的平均成功率(表 1 中的第 3 列“Ours (Attn)”和第 5 列“Manual Reward”进行配对)。检验结果得到 p 值小于 0.05,这表明我们的方法与手动奖励之间的差异在统计上不具有显著性,因此它们是可比较的。由于所有任务都需要复杂的探索,我们的方法也优于仅稀疏奖励的基线。请注意,原始的 OpenAI CLIP 模型无法取得任何成功。我们假设 Minecraft 中的生物与现实世界中的对应物看起来截然不同,这导致 CLIP 产生的误导信号比没有任何奖励更糟糕。这说明了在 MineDojo 的 YouTube 数据上进行微调的重要性。

Table 3: MineCLIP agents have stronger zero-shot visual generalization ability to unseen terrains, weathers, and lighting. Numbers outside parentheses are percentage success rates averaged over 3 seeds (each tested for 200 episodes), while those inside parentheses are relative performance changes.
表 3:MineCLIP 代理在未见过的地形、天气和光照条件下具有更强的零样本视觉泛化能力。括号外的数字是在 3 个种子上平均的成功率百分比(每个种子测试 200 个回合),而括号内的数字是相对性能变化。
Tasks 任务 Ours (Attn), train 我们的(注意),列车 Ours (Attn), unseen test 我们(注意),看不见的测试 CLIPOpenAI, train 剪辑,训练 CLIPOpenAI, unseen test
[Uncaptioned image] Milk Cow 奶牛 64.5±37.1plus-or-minus64.537.164.5\pm 37.1 64.8±31.3(+0.8%)plus-or-minus64.831.3percent0.8{\color[rgb]{0.09,0.45,0.27}\mathbf{64.8\pm 31.3(+\hphantom{0}0.8\%)}} 90.0±0.4plus-or-minus90.00.490.0\pm 0.4 29.2±3.7(67.6%)plus-or-minus29.23.7percent67.629.2\pm 3.7\hphantom{0}(-67.6\%)
Hunt Cow 猎牛 83.5±7.1plus-or-minus83.57.183.5\pm 7.1\hphantom{0} 55.9±7.2(32.9%)plus-or-minus55.97.2percent32.9{\color[rgb]{0.09,0.45,0.27}\mathbf{55.9\pm 7.2\hphantom{0}(-32.9\%)}} 72.7±3.5plus-or-minus72.73.572.7\pm 3.5 16.7±1.6(77.0%)plus-or-minus16.71.6percent77.016.7\pm 1.6\hphantom{0}(-77.0\%)
Combat Spider 战斗蜘蛛 80.5±13.0plus-or-minus80.513.080.5\pm 13.0 62.1±29.7(22.9%)plus-or-minus62.129.7percent22.9{\color[rgb]{0.09,0.45,0.27}\mathbf{62.1\pm 29.7(-22.9\%)}} 79.5±2.5plus-or-minus79.52.579.5\pm 2.5 54.2±9.6(31.8%)plus-or-minus54.29.6percent31.854.2\pm 9.6\hphantom{0}(-31.8\%)
Combat Zombie 战斗僵尸 47.3±10.6plus-or-minus47.310.647.3\pm 10.6 39.9±25.3(15.4%)plus-or-minus39.925.3percent15.4{\color[rgb]{0.09,0.45,0.27}\mathbf{39.9\pm 25.3(-15.4\%)}} 50.2±7.5plus-or-minus50.27.550.2\pm 7.5 30.8±14.4(38.6%)plus-or-minus30.814.4percent38.630.8\pm 14.4(-38.6\%)

MineCLIP provides automated evaluation.
MineCLIP 提供自动评估。

For Creative tasks (last 4 rows), there are no programmatic success criteria available. We convert MineCLIP into a binary success classifier by thresholding the reward value it outputs for an episode. To test the quality of MineCLIP as an automatic evaluation metric, we ask human judges to curate a dataset of 100 successful and 100 failed trajectories for each task. We then run both MineCLIP variants and CLIPOpenAI on the dataset and report the binary F1 score of their judgement against human ground-truth in Table 2. The results demonstrate that both MineCLIP[attn] and MineCLIP[avg] attain a very high degree of agreement with human evaluation results on this subset of the Creative task suite that we investigate. CLIPOpenAI baseline also achieves nontrivial agreement on Find Ocean and Dig Hole tasks, likely because real-world oceans and holes have similar texture. We use the attn variant as an automated success criterion to score the 4 Creative task results in Table 1. Our proposed method consistently learns better than CLIPOpenAI-guided agents. It shows that MineCLIP is an effective approach to solving open-ended tasks when no straightforward reward signal is available. We provide further analysis beyond these 4 tasks in the Appendix (Sec. G.4).
对于创造性任务(最后 4 行),没有可用的程序化成功标准。我们通过对 MineCLIP 输出的奖励值进行阈值处理,将其转化为二元成功分类器。为了测试 MineCLIP 作为自动评估指标的质量,我们请人类评委为每个任务策划了 100 个成功和 100 个失败的轨迹数据集。然后我们在数据集上运行 MineCLIP 的两个变体和 CLIP,并在表 2 中报告它们与人类真实结果的二元 F1 得分。结果表明,MineCLIP[attn]和 MineCLIP[avg]在我们研究的创造性任务子集上与人类评估结果达成了非常高的一致性。CLIP 基准[b1]在寻找海洋和挖洞任务上也取得了一定的一致性,可能是因为现实世界中的海洋和洞穴具有相似的纹理。我们使用 attn 变体作为自动化的成功标准,对表 1 中的 4 个创造性任务结果进行评分。我们提出的方法始终比 CLIP 引导的代理学习效果更好。这表明当没有直接的奖励信号可用时,MineCLIP 是解决开放式任务的有效方法。 我们在附录(第 G.4 节)中提供了超出这 4 个任务的进一步分析。

MineCLIP shows good zero-shot generalization to significant visual distribution shift.
MineCLIP 在面对显著的视觉分布转变时展现出良好的零样本泛化能力。

We evaluate the learned policy without finetuning on a combination of unseen weather, lighting conditions, and terrains — 27 scenarios in total. For the baseline, we train agents with the original CLIPOpenAI image encoder (not trained on our YouTube videos) by imitation learning. The robustness against visual shift can be quantitatively measured by the relative performance degradation on novel test scenarios for each task. Table 3 shows that while all methods incur performance drops, agents with MineCLIP encoder is more robust to visual changes than the baseline across all tasks. Pre-training on diverse in-the-wild YouTube videos is important to boosting zero-shot visual generalization capability, a finding consistent with literature [92, 82].
我们在未经微调的情况下评估了学习策略在未见过的天气、光照条件和地形的组合下的表现-总共 27 个场景。作为基准,我们通过模仿学习使用原始的 CLIP OpenAI 图像编码器(未经我们的 YouTube 视频训练)来训练代理。对于每个任务,可以通过相对性能下降来定量衡量对视觉变化的鲁棒性。表 3 显示,虽然所有方法都会导致性能下降,但 MineCLIP 编码器的代理在所有任务中对视觉变化更具鲁棒性,相比基准方法。在多样化的野外 YouTube 视频上进行预训练对于提高零样本视觉泛化能力是重要的,这一发现与文献[92, 82]一致。

Learning a Single Agent for All 12 Tasks
学习一个单一代理来完成所有 12 个任务

We have also trained a single agent for all 12 tasks. To reduce the computational burden without loss of generality, the agent is trained by self-imitating from successful trajectories generated from the self-imitation learning pipeline (Section F.3). We summarize the result in Table 4. Similar to our main experiments, all numbers represent percentage success rates averaged over 3 training seeds, each tested for 200 episodes. Compared to the original agents, the new 12-multitask agent sees a performance boost in 6 tasks, degradation in 4 tasks, and roughly the same success rates in 2 tasks. This result suggests that there are both positive and negative task transfers happening. To improve the multi-task performance, more advanced algorithms [141, 133] can be employed to mitigate the negative transfer effects.
我们还为所有 12 个任务训练了一个单一的代理。为了减少计算负担而不损失普适性,该代理通过从自我模仿学习流程(第 F.3 节)生成的成功轨迹进行自我模仿训练。我们在表 4 中总结了结果。与我们的主要实验类似,所有数字表示在 3 个训练种子上平均的成功率百分比,每个种子测试 200 个回合。与原始代理相比,新的 12 任务多任务代理在 6 个任务中表现提升,在 4 个任务中表现下降,在 2 个任务中成功率大致相同。这个结果表明,存在着正向和负向的任务转移。为了提高多任务性能,可以采用更先进的算法[141, 133]来减轻负向转移效应。

We also perform Paired Student’s t-test to statistically compare the performance of the 12-multitask agent and those separately trained on each task group. We obtain a p-value of 0.37200.05much-greater-than0.37200.050.3720\gg 0.05, which suggests that the difference between the two training settings is not statistically significant.
我们还进行了配对的学生 t 检验,以统计比较 12 个多任务代理和分别训练在每个任务组上的代理的性能。我们得到了一个 p 值为 0.37200.05much-greater-than0.37200.050.3720\gg 0.05 ,这表明两种训练设置之间的差异在统计上不显著。

Table 4: We train a single multi-task agent for all 12 tasks. All numbers represent percentage success rates averaged over 3 seeds, each tested for 200 episodes.
表 4:我们为所有 12 个任务训练了一个多任务代理。所有数字表示在 3 个种子上平均测试 200 个回合后的成功率百分比。
Group  Tasks 任务 Single Agent on All Tasks
Original 原始的 Performance Change 性能变化
[Uncaptioned image] Milk Cow 奶牛 91.5±0.7plus-or-minus91.50.7{\color[rgb]{0.09,0.45,0.27}\mathbf{91.5\pm 0.7}} 64.5±37.1plus-or-minus64.537.164.5\pm 37.1 \uparrow
Hunt Cow 猎牛 46.8±3.7plus-or-minus46.83.746.8\pm 3.7 83.5±7.1plus-or-minus83.57.1{\color[rgb]{0.09,0.45,0.27}\mathbf{83.5\pm 7.1\hphantom{0}}} \downarrow
Shear Sheep 剪羊毛 73.5±0.8plus-or-minus73.50.8{\color[rgb]{0.09,0.45,0.27}\mathbf{73.5\pm 0.8}} 12.1±9.1plus-or-minus12.19.112.1\pm 9.1\hphantom{0} \uparrow
Hunt Sheep 猎羊 27.0±1.0plus-or-minus27.01.0{\color[rgb]{0.09,0.45,0.27}\mathbf{27.0\pm 1.0}} 8.1±4.1plus-or-minus8.14.18.1\pm 4.1 \uparrow
[Uncaptioned image] Combat Spider 战斗蜘蛛 72.1±1.3plus-or-minus72.11.372.1\pm 1.3 80.5±13.0plus-or-minus80.513.0{\color[rgb]{0.09,0.45,0.27}\mathbf{80.5\pm 13.0}} \downarrow
Combat Zombie 战斗僵尸 27.1±2.7plus-or-minus27.12.727.1\pm 2.7 47.3±10.6plus-or-minus47.310.6{\color[rgb]{0.09,0.45,0.27}\mathbf{47.3\pm 10.6}} \downarrow
Combat Pigman 战斗猪人 6.5±1.2plus-or-minus6.51.2{\color[rgb]{0.09,0.45,0.27}\mathbf{6.5\pm 1.2}} 1.6±2.3plus-or-minus1.62.31.6\pm 2.3 \uparrow
Combat Enderman 战斗末影人 0.0±0.0plus-or-minus0.00.00.0\pm 0.0 0.0±0.0plus-or-minus0.00.00.0\pm 0.0 ==
[Uncaptioned image] Find Nether Portal 寻找下界传送门 99.1±0.4plus-or-minus99.10.4{\color[rgb]{0.09,0.45,0.27}\mathbf{99.1\pm 0.4}} 37.4±40.8plus-or-minus37.440.837.4\pm 40.8 \uparrow
Find Ocean 寻找海洋 95.1±1.5plus-or-minus95.11.5{\color[rgb]{0.09,0.45,0.27}\mathbf{95.1\pm 1.5}} 33.4±45.6plus-or-minus33.445.633.4\pm 45.6 \uparrow
Dig Hole 挖洞 85.8±1.2plus-or-minus85.81.285.8\pm 1.2 91.6±5.9plus-or-minus91.65.9{\color[rgb]{0.09,0.45,0.27}\mathbf{91.6\pm 5.9\hphantom{0}}} \downarrow
Lay Carpet 铺地毯 96.5±0.8plus-or-minus96.50.896.5\pm 0.8 97.6±1.9plus-or-minus97.61.9{\color[rgb]{0.09,0.45,0.27}\mathbf{97.6\pm 1.9}}\hphantom{0} ==
Table 5: We test the open-vocabulary generalization ability to two unseen tasks. All numbers represent percentage success rates averaged over 3 seeds, each tested for 200 episodes.
表 5:我们测试了开放词汇的泛化能力,针对两个未见过的任务。所有数字表示在 3 个种子上平均的成功率百分比,每个种子测试了 200 个情节。
Tasks 任务 Ours (zero-shot) 我们的(零-shot) Ours (after RL finetune)
Baseline (RL from scratch)
[Uncaptioned image] Hunt Pig 1.3±0.6plus-or-minus1.30.61.3\pm 0.6 46.0±15.3plus-or-minus46.015.3{\color[rgb]{0.09,0.45,0.27}\mathbf{46.0\pm 15.3}} 0.0±0.0plus-or-minus0.00.00.0\pm 0.0
Harvest Spider String 收获蜘蛛的丝线 1.6±1.3plus-or-minus1.61.31.6\pm 1.3 36.5±16.9plus-or-minus36.516.9{\color[rgb]{0.09,0.45,0.27}\mathbf{36.5\pm 16.9}} 12.5±12.7plus-or-minus12.512.712.5\pm 12.7

Generalize to Novel Tasks

To test the ability to generalize to new open-vocabulary commands, we evaluate the agent on two novel tasks: “harvest spider string” and “hunt pig”. Table 5 shows that the agent struggles in the zero-shot setting because it has not interacted with pigs or spider strings during training, and thus does not know how to interact with them effectively. However, the performance improves substantially by finetuning with the MineCLIP reward. Here the baseline methods are trained from scratch using RL with the MineCLIP encoders and reward. Therefore, the only difference is whether the policy has been pre-trained on the 12 tasks or not. Given the same environment sampling budget (only around 5% of total samples), it significantly outperforms baselines. It suggests that the multitask agent has learned transferable knowledge on hunting and resource collection, which enables it to quickly adapt to novel tasks.
为了测试对新的开放词汇命令的泛化能力,我们在两个新任务上评估了代理人:“收集蜘蛛丝”和“猎杀猪”。表 5 显示,在零样本设置下,代理人遇到了困难,因为在训练过程中它没有与猪或蜘蛛丝进行交互,因此不知道如何有效地与它们交互。然而,通过使用 MineCLIP 奖励进行微调,性能显著提高。在这里,基准方法是使用 RL 从头开始训练,使用 MineCLIP 编码器和奖励。因此,唯一的区别是策略是否在 12 个任务上进行了预训练。在相同的环境采样预算下(仅约占总样本的 5%),它明显优于基准方法。这表明多任务代理人已经学到了关于狩猎和资源收集的可转移知识,使其能够快速适应新任务。

6 Related work 相关工作

Open-ended Environments for Decision-making Agents.

There are many environments developed with the goal of open-ended agent learning. Prior works include maze-style worlds [121, 129, 61], purely text-based game [69], grid worlds  [21, 16], browser/GUI-based environments  [108, 124], and indoor simulators for robotics  [1, 107, 114, 34, 110, 99, 89]. Minecraft offers an exciting alternative for open-ended agent learning. It is a 3D visual world with procedurally generated landscapes and extremely flexible game mechanics that support an enormous variety of activities. Prior methods in open-ended agent learning [30, 57, 130, 63, 26] do not make use of external knowledge, but our approach leverages internet-scale database to learn open-vocabulary reward models, thanks to Minecraft’s abundance of gameplay data online.
有许多环境旨在实现开放式智能体学习的目标。之前的研究包括迷宫式世界[121, 129, 61],纯文本游戏[69],网格世界[21, 16],基于浏览器/GUI 的环境[108, 124],以及用于机器人的室内模拟器[1, 107, 114, 34, 110, 99, 89]。Minecraft 提供了一个令人兴奋的选择,用于开放式智能体学习。它是一个具有程序生成的景观和极其灵活的游戏机制的 3D 视觉世界,支持各种各样的活动。之前的开放式智能体学习方法[30, 57, 130, 63, 26]没有利用外部知识,但我们的方法利用互联网规模的数据库来学习开放词汇奖励模型,得益于 Minecraft 在线上丰富的游戏数据。

Minecraft for AI Research.
Minecraft 用于 AI 研究。

The Malmo platform [60] is the first comprehensive release of a Gym-style agent API [14] for Minecraft. Based on Malmo, MineRL [48] provides a codebase and human play trajectories for the annual Diamond Challenge at NeurIPS [47, 49, 62]. MineDojo’s simulator builds upon the pioneering work of MineRL, but greatly expands the API and benchmarking task suite. Other Minecraft benchmarks exist with different focuses. For example, CraftAssist [44] and IGLU [66] study agents with interactive dialogues. BASALT [104] applies human evaluation to 4 open-ended tasks. EvoCraft [45] is designed for structure building, and Crafter [50] optimizes for fast experimentation. Unlike prior works, MineDojo’s core mission is to facilitate the development of generally capable embodied agents using internet-scale knowledge. We include a feature comparison table of different Minecraft platforms for AI research in Table A.1.
Malmo 平台[60]是 Minecraft 的第一个基于 Gym 风格的代理 API[14]的全面发布。基于 Malmo,MineRL[48]提供了一个代码库和人类游戏轨迹,用于 NeurIPS[47, 49, 62]的年度 Diamond Challenge。MineDojo 的模拟器建立在 MineRL 的开创性工作基础上,但大大扩展了 API 和基准测试任务套件。其他具有不同重点的 Minecraft 基准测试存在。例如,CraftAssist[44]和 IGLU[66]研究具有交互对话的代理。BASALT[104]将人类评估应用于 4 个开放式任务。EvoCraft[45]专为结构建筑而设计,而 Crafter[50]则优化了快速实验。与以往的工作不同,MineDojo 的核心任务是利用互联网规模的知识促进通用能力的具体化身代理的开发。我们在附录 A.1 中包含了不同 Minecraft 平台用于 AI 研究的功能比较表。

Internet-scale Multimodal Knowledge Bases.

Big dataset such as Common Crawl [24], the Pile [37], LAION [100], YouTube-8M [2] and HowTo100M [77] have been fueling the success of large pre-trained language models [27, 91, 15] and multimodal models [118, 6, 78, 145, 7, 4, 136]. While generally useful for learning representations, these datasets are not specifically targeted at embodied agents. To provide agent-centric training data, RoboNet [25] collects video frames from 7 robot platforms, and Ego4D [43] recruits volunteers to record egocentric videos of household activities. In comparison, MineDojo’s knowledge base is constructed without human curation efforts, much larger in volume, more diverse in data modalities, and comprehensively covers all aspects of the Minecraft environment.
大型数据集,如 Common Crawl [24]、Pile [37]、LAION [100]、YouTube-8M [2]和 HowTo100M [77],推动了大型预训练语言模型 [27, 91, 15]和多模态模型 [118, 6, 78, 145, 7, 4, 136]的成功。虽然这些数据集通常对于学习表示很有用,但它们并不专门针对具体的实体代理。为了提供以代理为中心的训练数据,RoboNet [25]从 7 个机器人平台收集视频帧,而 Ego4D [43]则招募志愿者记录家庭活动的自我中心视频。相比之下,MineDojo 的知识库是在没有人工策划的情况下构建的,体积更大,数据形式更多样,并全面涵盖了 Minecraft 环境的所有方面。

Embodied Agents with Large-scale Pre-training.

Inspired by the success in NLP, embodied agent research [29, 11, 94, 23] has seen a surge in adoption of the large-scale pre-training paradigm. The recent advances can be roughly divided into 4 categories. 1) Novel agent architecture: Decision Transformer [19, 58, 144] applies the powerful self-attention models to sequential decision making. GATO [95] and Unified-IO [74] learn a single model to solve various decision-making tasks with different control interfaces. VIMA [59] unifies a wide range of robot manipulation tasks with multimodal prompting. 2) Pre-training for better representations: R3M [82] trains a general-purpose visual encoder for robot perception on Ego4D videos [43]. CLIPort [111] leverages the pre-trained CLIP model [92] to enable free-form language instructions for robot manipulation. 3) Pre-training for better policies: AlphaStar [126] achieves champion-level performance on StarCraft by imitating from numerous human demos. SayCan [3] leverages large language models (LMs) to ground value functions in the physical world. [72] and [96] directly reuse pre-trained LMs as policy backbone. VPT [10] is a concurrent work that learns an inverse dynamics model from human contractors to pseudo-label YouTube videos for behavior cloning. VPT is complementary to our approach, and can be finetuned to solve language-conditioned open-ended tasks with our learned reward model. 4) Data-driven reward functions: Concept2Robot [105] and DVD [18] learn a binary classifier to score behaviors from in-the-wild videos [42]. LOReL [81] crowd-sources humans labels to train language-conditioned reward function for offline RL. AVID [113] and XIRL [142] extract reward signals via cycle consistency. MineDojo’s task benchmark and internet knowledge base are generally useful for developing new algorithms in all the above categories. In Sec. 4, we also propose an open-vocabulary, multi-task reward model using MineDojo YouTube videos.
受自然语言处理成功的启发,具身代理研究[29, 11, 94, 23]在大规模预训练范式的采用方面出现了激增。最近的进展可以大致分为 4 个类别。1)新颖的代理架构:决策 Transformer[19, 58, 144]将强大的自注意力模型应用于顺序决策。GATO[95]和 Unified-IO[74]学习一个单一模型来解决具有不同控制接口的各种决策任务。VIMA[59]通过多模态提示统一了各种机器人操作任务。2)用于更好表示的预训练:R3M[82]在 Ego4D 视频[43]上训练了一个通用的视觉编码器,用于机器人感知。CLIPort[111]利用预训练的 CLIP 模型[92]实现了自由形式的语言指令,用于机器人操作。3)用于更好策略的预训练:AlphaStar[126]通过模仿大量人类演示,在星际争霸中实现了冠军级表现。SayCan[3]利用大型语言模型(LMs)将价值函数与物理世界联系起来。[72]和[96]直接重用预训练的 LMs 作为策略骨干。 VPT [10]是一项并发工作,它从人类承包商那里学习逆动力学模型,为行为克隆的 YouTube 视频生成伪标签。VPT 与我们的方法互补,可以通过我们学到的奖励模型进行微调,以解决语言条件下的开放式任务。4)数据驱动的奖励函数:Concept2Robot [105]和 DVD [18]学习二元分类器,从野外视频[42]中评分行为。LOReL [81]通过众包人类标签来训练语言条件下的奖励函数,用于离线强化学习。AVID [113]和 XIRL [142]通过循环一致性提取奖励信号。MineDojo 的任务基准和互联网知识库通常可用于开发上述所有类别的新算法。在第 4 节中,我们还提出了一个使用 MineDojo YouTube 视频的开放词汇、多任务奖励模型。

7 Conclusion 结论

In this work, we introduce the MineDojo framework for developing generally capable embodied agents. MineDojo features a benchmarking suite of thousands of Programmatic and Creative tasks, and an internet-scale multimodal knowledge base of videos, wiki, and forum discussions. As an example of the novel research possibilities enabled by MineDojo, we propose MineCLIP as an effective language-conditioned reward function trained with in-the-wild YouTube videos. MineCLIP achieves strong performance empirically and agrees well with human evaluation results, making it a good automatic metric for Creative tasks. We look forward to seeing how MineDojo empowers the community to make progress on the important challenge of open-ended agent learning.
在这项工作中,我们介绍了用于开发通用能力体现的 MineDojo 框架。MineDojo 具有一个包含数千个编程和创意任务的基准套件,以及一个互联网规模的多模态知识库,包括视频、维基和论坛讨论。作为 MineDojo 所带来的新颖研究可能性的一个例子,我们提出了 MineCLIP 作为一种有效的语言条件奖励函数,通过野外 YouTube 视频进行训练。MineCLIP 在实证上取得了强大的性能,并与人类评估结果相吻合,使其成为创意任务的良好自动度量标准。我们期待看到 MineDojo 如何赋予社区在开放式代理学习的重要挑战上取得进展。

8 Acknowledgement 感谢

We are extremely grateful to Anssi Kanervisto, Shikun Liu, Zhiding Yu, Chaowei Xiao, Weili Nie, Jean Kossaifi, Jonathan Raiman, Neel Kant, Saad Godil, Jaakko Haapasalo, Bryan Catanzaro, John Spitzer, Zhiyuan “Jerry” Lin, Yingqi Zheng, Chen Tessler, Dieter Fox, Oli Wright, Jeff Clune, Jack Parker-Holder, and many other colleagues and friends for their helpful feedback and insightful discussions. We also thank the anonymous reviewers for offering us highly constructive advice and kind encouragement during the review and rebuttal period. NVIDIA provides the necessary computing resource and infrastructure for this project. Guanzhi Wang is supported by the Kortschak fellowship in Computing and Mathematical Sciences at Caltech.
我们非常感谢 Anssi Kanervisto、Shikun Liu、Zhiding Yu、Chaowei Xiao、Weili Nie、Jean Kossaifi、Jonathan Raiman、Neel Kant、Saad Godil、Jaakko Haapasalo、Bryan Catanzaro、John Spitzer、Zhiyuan “Jerry” Lin、Yingqi Zheng、Chen Tessler、Dieter Fox、Oli Wright、Jeff Clune、Jack Parker-Holder 以及其他许多同事和朋友对我们的有益反馈和深入讨论表示感谢。我们还感谢匿名审稿人在审稿和反驳期间给予我们高度建设性的建议和友善的鼓励。NVIDIA 为这个项目提供了必要的计算资源和基础设施。Guanzhi Wang 得到了加州理工学院计算和数学科学领域的 Kortschak 奖学金的支持。

References 参考文献

Appendix A Minecraft Framework Comparison

Table A.1: Comparison table of different Minecraft platforms for AI research.
Environment Simulator Task Suite Knowledge Base
Features Real Minecraft Number of Tasks Language- grounded Features Data Scale
MineDojo Unified observation and action space;
unlocks all three types of world (the Overworld, the Nether, and the End)
3,000+3limit-from0003,000+ Automatically scraped from the Internet;
multimodal data (videos, images, texts, tables and diagrams)
740K YouTube videos;
7K Wiki pages;
350K Reddit posts
MineRL v0.4 [48] Built on top of Malmo;
actively maintained
11 Annotated state-action pairs of human demonstrations 60M frames of recorded human player data
MineRL v1.0 (VPT) [10] Mouse and keyboard control 5 Labeled contractor data;
unlabeled videos scraped from the Internet
2K hours of contractor data;
270K hours of unlabeled videos
MarLÖ [87] Cooperative and competitive multiagent tasks;
parameterizable environments
Malmo [60] First comprehensive release of a Gym-style agent API for Minecraft N/A
CraftAssist [44] Bot assistant;
dialogue interactions
N/A Interactive dialogues;
crowd-sourced house building dataset
800K dialogue-action dictionary pairs;
2.6K houses with atomic building actions
IGLU [66] Interactive dialogues with humans;
aimed at building structures described by natural language
EvoCraft [45] Aimed at generating creative artifacts;
allows for direction manipulation of blocks
Crafter [50] 2D clone of Minecraft;
fast experimentation
22 Human experts dataset 100 episodes

Appendix B MineDojo Simulator

We design unified observation and action spaces across all tasks to facilitate the development of multi-tasking and continually learning agents that can adapt to novel tasks and scenarios. The codebase is open sourced at github.com/MineDojo/MineDojo.

B.1 Observation Space

Our observation space contains multiple modalities. The agent perceives the world mainly through the RGB screen. To provide the same information as human players receive, we also supplement the agent with observations about its inventory, location, health, surrounding blocks, etc. The full observation space is shown below. We refer readers to see our code documentation for technical details such as data type for each observable item.

We also support a LIDAR sensor that returns the groundtruth type of the blocks that the agent sees, however this is considered privileged information and does not go into the benchmarking specification. However, it is still useful for hand engineering the dense reward function, which we use in our experiments (Sec. 5). Amounts and directions of LIDAR rays can be arbitrarily configured at the cost of a lower simulation throughput.

Modality Shape Description
RGB (3, H, W) Ego-centric RGB frames.
Equipment (6,) Names, quantities, variants, and durabilities of equipped items.
Inventory (36,) Names, quantities, variants, and durabilities of inventory items.
Voxel (3, 3, 3) Names, variants, and properties of 3×3×33333\times 3\times 3 surrounding blocks.
Life statistics (1,) Agent’s health, oxygen, food saturation, etc.
GPS (3,) GPS location of the agent.
Compass (2,) Yaw and pitch of the agent.
Nearby tools (2,) Indicate if crafting table and furnace are nearby the agent.
Damage source (1,) Information about the damage on the agent.
Lidar (Num rays,) Ground-truth lidar observation.

B.2 Action Space

We design a compound action space. At each step the agent chooses one movement action (forward, backward, camera actions, etc.) and one optional functional action as listed in the table below. Some functional actions such as craft take one argument, while others like attack does not take any argument. This compound action space can be modelled in an autoregressive manner [126]. We refer readers to our code documentation for example usages of our action space.

Name Description Argument
no_op Do nothing. \varnothing
use Use the item held in the main hand. \varnothing
drop Drop the item held in the main hand. \varnothing
attack Attack with barehand or tool held in the main hand. \varnothing
craft Execute a crafting recipe to obtain new items. Index of recipe
equip Equip an inventory item. Slot index of the item
place Place an inventory item on the ground. Slot index of the item
destroy Destroy an inventory item. Slot index of the item

B.3 Customizing the Environment

Environments in MineCLIP simulator can be easily and flexibly customized. Through our simulator API, users can control terrain, weather, day-night condition (different lighting), the spawn rate and range of specified entities and materials, etc. We support a wide range of terrains, such as desert, jungle, taiga, and iced plain, and special in-game structures, such as ocean monument, desert temple, and End city. Please visit our website for video demonstrations.

Appendix C MineDojo Task Suite

In this section, we explain how we collect the Programmatic (Sec. 2.1) and Creative tasks (Sec. 2.2).

2 category: survival
3 prompt: survive as long as possible given a sword and some food
6 category: harvest
7 prompt: harvest wool from a sheep with shears and a sheep nearby
10 category: tech-tree
11 prompt: find material and craft a wooden sword
14 category: combat
15 prompt: combat a zombie pigman in nether with a diamond sword,
16 shield, and a full suite of diamond armors
Figure A.1: Example specifications.

C.1 Programmatic Tasks

Programmatic tasks are constructed by filling manually written templates for four categories of tasks, namely “Survival”, “Harvest”, “Tech Tree”, and “Combat”. The task specifications are included in our codebase. Please refer to Fig. A.1 for a few samples. We briefly explain each task category:


This task group tests the ability to stay alive in the game. It is nontrivial to survive in Minecraft, because the agent grows hungry as time passes and the health bar drops gradually. Hostile mobs like zombie and skeleton spawn at night, which are very dangerous if the agent does not have the appropriate armor to protect itself or weapons to fight back. We provide two tasks with different levels of difficulty for Survival. One is to start from scratch without any assistance. The other is to start with initial weapons and food.


This task group tests the agent’s ability to collect useful resources such as minerals (iron, diamond, obsidian), food (beef, pumpkin, carrots, milk), and other useful items (wool, oak wood, coal). We construct these tasks by enumerating the Cartesian product between target items to collect, initial inventory, and world conditions (terrain, weather, lighting, etc.) so that they cover a spectrum of difficulty. For instance, if the task is to harvest wool, then it is relatively easy if the agent has a shear in its initial inventory with a sheep nearby, but more difficult if the agent has to craft the shear from raw material and explore extensively to find a sheep. We filter out combinations that are impossible (such as farming certain plants in the desert) from the Cartesian product.

Tech Tree.

Minecraft includes several levels of tools and armors with different properties and difficulties to unlock. To progress to a higher level of tools and armors, the agent needs to develop systematic and compositional skills to navigate the technology tree (e.g. wood \rightarrow stone \rightarrow iron \rightarrow diamond). In this task group, the agent is asked to craft and use a hierarchy of tools starting from a less advanced level. For example, some task asks the agent to craft a wooden sword from bare hand. Another task may ask the agent to craft a gold helmet. An agent that can successfully complete these tasks should have the ability to transfer similar exploration strategies to different tech levels.


We test agent’s reflex and martial skills to fight against various monsters and creatures. Similar to how we develop the Harvest task group, we generate these tasks by enumerating the Cartesian product between the target entity to combat with, initial inventory, and world conditions to cover a spectrum of difficulty.

C.2 Creative Tasks

We construct Creative tasks using three approaches: 1) manual brainstorming, 2) mining from YouTube tutorial videos, and 3) generate by querying GPT-3 API. We elaborate the second and third approaches below.

Task Mining from YouTube Tutorial Videos.

Refer to caption
Figure A.2: Labeling UI to mine tasks from YouTube. A human annotator can choose to reject the video (Invalid), adjust the timestamps, select the title, or edit and expand the original description to be the new task goal.

Our YouTube dataset serves the dual purpose of a rich task source, as many human players demonstrate and narrate creative missions in the tutorial playlists. To collect high-quality tasks and accompanying videos, we design a 3-stage pipeline that makes it easy to find and annotate interesting tasks.

  1. Stage 1:

    We search for YouTube playlists with the key phrases, “Minecraft Tutorial” and “Minecraft Guide”. Then we apply heuristic rules (see Sec. D.1) to filter out low-quality videos;

  2. Stage 2:

    We only show the title of the video to a human annotator through a command-line interface, who makes a binary decision to accept or reject it as a potential task. This step is typically very fast, taking a few seconds on average;

  3. Stage 3:

    For the accepted tasks in stage 2, we design a labeling UI using Label Studio [70] that displays the full video and YouTube description. A human annotator can choose to reject the video, adjust the timestamps, select the title, or refine the description to be the task goal (Fig. A.2). Through this pipeline, we extract 1,042 task ideas from the common wisdom of a huge number of veteran Minecraft gamers. Some examples are “make an automated mining machine” and “grow cactus up to the sky”.

C.3 GPT-3 Guidance

We leverage OpenAI’s GPT-3-davinci API to automatically generate detailed guidance for a subset of the tasks. Inspired by [67], we adopt the following template to prompt GPT-3: How to {task goal} in Minecraft? Let’s think step by step. Here are some examples:

The guidance for the task “find material and craft a gold pickaxe” is 1) Find a place with a lot of trees; 2) Cut down the trees and gather the wood; 3) Find a place with a lot of stone; 4) Mine the stone and gather the cobblestone; 5) Find a place with a lot of iron; 6) Mine the iron and gather the iron ingots; 7) Find a place with a lot of gold; 8) Mine the gold and gather the gold ingots; 9) Craft a gold pickaxe.

The guidance for the task “sail on boat with a sheep” is 1) Find a boat; 2) Place the sheep in the boat; 3) Right-click on the boat with an empty hand to get in; 4) Use the WASD keys to move the boat. The sheep should stay in the boat.

C.4 Playthrough: Defeat the Ender Dragon

Our benchmarking suite includes a special task called “Playthrough”. The agent is initialized bare-handed in a freshly created world and aims to defeat the Ender dragon, which is considered the final boss of Minecraft. This task holds a unique position in our benchmark because killing the dragon means “beating the game” in the traditional sense of the phrase, and is considered the most significant achievement for a new player. This boss is optional and plenty of people choose to skip it without affecting their open-ended game experience.

“Playthrough” is technically a programmatic task, because we can check the simulator state for the defeat of the Ender dragon. However, we decide to create its own category due to the uniqueness as well as the sheer difficulty of the task. The mission requires lots of preparation, exploration, agility, and trial-and-error, which may take a regular human player many days to complete. It would be extremely long horizon (hundreds of thousands of steps) and difficult for an agent to tackle. We consider this one of the moonshot goals in MineDojo.

Appendix D Internet-Scale Database

We upload our databases to zenodo.org, which is an open repository platform operated by CERN. The data DOIs, URLs, and licenses are listed below. In this section, we describe our database properties and data collection process in details.

Database DOI License
YouTube 10.5281/zenodo.6641142 Creative Commons Attribution 4.0 International (CC BY 4.0)
Wiki 10.5281/zenodo.6640448 Creative Commons Attribution Non Commercial Share Alike 3.0 Unported
Reddit 10.5281/zenodo.6641114 Creative Commons Attribution 4.0 International (CC BY 4.0)

D.1 YouTube Videos and Transcripts

Refer to caption
Figure A.3: Distribution of YouTube video duration. The histogram is trimmed by the 85th percentile to hide much longer videos that can run for many hours.

Minecraft is among the most streamed games on YouTube [41]. Human players have demonstrated a stunning range of creative activities and sophisticated missions that take hours to complete. We collect 33 years worth of video and 2.2B words in the accompanying English transcripts. The distribution of video duration is shown in Fig. A.3. The time-aligned transcripts enable the agent to ground free-form natural language in video pixels and learn the semantics of diverse activities without laborious human labeling.

We use the official YouTube Data API [140] to collect our database, following the procedure below:

  1. a)

    Search for channels that contain Minecraft videos using a list of keywords, e.g., “Minecraft”, “Minecraft Guide”, “Minecraft Walkthrough”, “Minecraft Beginner”. We do not directly search for videos at this step because there is a limit of total results returned by the API;

  2. b)

    Search for all the video IDs uploaded by each channel that we obtain at the previous step. There are many false positives at this step because some channels (like gaming news channel) may cover a range of topics other than Minecraft;

  3. c)

    To remove the false positives, we rely on the video category chosen by the user when the video was uploaded and filter out all the videos that do not belong to the Minecraft category;

  4. d)

    To curate a language-grounded dataset, we favor videos that have English transcripts, which can be manually uploaded by the user, automatically transcribed from audio, or automatically translated from another language by the YouTube engine. For each video, we filter it out if 1) the view count is less than 100; or 2) the aspect ratio is less 1; or 3) the duration is less than 1 minute long; or 4) marked as age-restricted.

  5. e)

    To further clean the dataset and remove potentially harmful contents, we employ the Detoxify [51] tool to process each video title and description. Detoxify is trained on Wikipedia comments to predict multiple types of toxicity like threats, obscenity, insults, and identity-based hate speech. We delete a video if the toxicity probability in any category is above 0.5.

We release all the video IDs along with metadata such as video titles, view counts, like counts, duration, and FPS. In line with prior practice [64], we do not release the actual MP4 files and transcripts due to legal concerns.

D.2 Minecraft Wiki

Refer to caption
Figure A.4: Wiki dataset examples. Closewise order: Villager trade table, mineral ingredient descriptions, monster gallery, and terrain explanation.
Refer to caption
Figure A.5: More Wiki database examples with bounding boxes (annotated in red). Left: wood block introduction; right: first day tutorial.

The Wiki pages cover almost every aspect of the game mechanics, and supply a rich source of unstructured knowledge in multimodal tables, recipes, illustrations, and step-by-step tutorials (example screenshots in Fig. A.4 and Fig. A.5). We use Selenium [103] to scrape 6,735 pages that interleave text, images, tables, and diagrams. We elaborate the details of each web element scraped by Selenium:

  1. a)

    Screenshot. Using Selenium’s built-in function, we take a full screenshot of the rendered Wiki page in order to preserve the human-readable visual formatting. We also record the bounding boxes of each salient web element on the page.

  2. b)

    Text. We hand-select several HTML tags that likely contain meaningful text data, such as p, h1, h2, ul, dl.

  3. c)

    Images and Animations. We download the raw source file of each image element (JPG, PNG, GIF, etc.), as well as the corresponding caption if available. There are also animation effects enabled by JavaScript on the Wiki. We save all image frames in the animation.

  4. d)

    Sprites. Sprite elements are micro-sized image icons that are typically embedded in text to create multimodal tutorials and explanations. We save all the sprites and locate their bounding boxes within the text too.

  5. e)

    Tables. We save the text content and bounding box of each cell that a table element contains. We store the header cells separately as they carry the semantic meaning of each column. A table can be easily reconstructed with the stored text strings and bounding boxes.

D.3 Reddit

Refer to caption
Figure A.6: Distribution of Reddit post types.
Refer to caption
Figure A.7: Examples of posts and comment threads from the Reddit database.

There are more than 1M subreddits (i.e., Reddit topics) where people can discuss a wide range of themes and subjects. Prior works use Reddit data for conversational response selection [5, 139, 54] and abstractive summarization [127, 65]. The r/Minecraft subreddit contains free-form discussions of game strategies and images/videos showcases of Minecraft builds and creations (examples in Fig. A.7). The distribution of post types is shown in Fig. A.6.

To scrape the Reddit contents, we use PRAW [88], a Python wrapper on top of the official Reddit API. Our procedure is as follows:

  1. a)

    Obtain the ID and metadata (e.g. post title, number of comments, content, score) of every post in the “r/Minecraft” subreddit since it was created. For quality control, we only consider posts with scores (upvotes) 5absent5\geq 5 and not marked as NSFW.

  2. b)

    Determine each post’s type. There are 4 native post types - text, image/video, link, and poll. We group text and poll posts together as text posts, and store their body text. For image/video and link posts, we store the source file URLs on external media hosting sites like Imgur and Gfycat. Based on the URL of each link post, we classify it as an image post, a video post or a general link post.

  3. c)

    Scrape the comments and store the parent ID of each comment so that we can reconstruct the threaded discussion.

  4. d)

    Similarly to our YouTube database (Sec. D.1), we run Detoxify [51] on the scraped Reddit contents to filter out potentially toxic and harmful posts.

We release all post IDs and their corresponding metadata. We also provide a Python function based on PRAW for researchers to download the post contents after obtaining a license key for the official Reddit API.

Appendix E MineCLIP Algorithm Details

We implement all our neural networks in PyTorch v1.11 [86]. Training MineCLIP uses the PyTorch-Lightning framework [32], pre-trained models hosted on HuggingFace [132], and the x-transformers library for Transformer variants [128].

E.1 Video-Text Pair Extraction

Similar to VideoCLIP [136], we sample 640K pairs of 16-second video snippets and time-aligned English transcripts by the following procedure:

  1. 1)

    Collect a list of keywords corresponding to the supported entities, blocks, and items in Minecraft;

  2. 2)

    Perform string matching over our YouTube video transcripts to obtain 640K text segments;

  3. 3)

    For each matched transcript segment, randomly grow it to 16 similar-to\sim 77 tokens (limited by CLIP’s context length);

  4. 4)

    Randomly sample a timestamp within the start and end time of the matched transcript as the center for the video clip;

  5. 5)

    Randomly grow the video clip from the center timestamp to 8 similar-to\sim 16 seconds.

E.2 Architecture

MineCLIP architecture is composed of three parts:

Frame-wise image encoder ϕIsubscriptitalic-ϕ𝐼\phi_{I}

We use the ViT-B/16 architecture [28] to compute a 512-D embedding for each RGB frame. We initialize the weights from OpenAI CLIP’s public checkpoint [92] and only finetune the last two layers during training. The input resolution is 160×256160256160\times 256, which is different from CLIP’s default 224×224224224224\times 224 resolution. We adapt the positional embeddings via bicubic interpolation, which does not introduce any new learnable parameters.

Temporal aggregator ϕasubscriptitalic-ϕ𝑎\phi_{a}

Given a sequence of frame-wise RGB features, a temporal aggregator network summarizes the sequence into one video embedding. After the aggregator, we insert two extra layers of residual CLIP Adapter [38]. The residual weight is initialized such that it is very close to an identity function at the beginning of training. We consider two variants of ϕasubscriptitalic-ϕ𝑎\phi_{a}:

  1. 1.

    Average pooling (MineCLIP[avg]): a simple, parameter-free operator. It is fast to execute but loses the temporal information, because average pooling is permutation-invariant.

  2. 2.

    Self-Attention (MineCLIP[attn]): a 2-layer transformer encoder with 512 embedding size, 8 attention heads, and Gated Linear Unit variant with Swish activation [106, 22]. The transformer sequence encoder is relatively slower, but captures more temporal information and achieves better performance in our experiments (Table 1).

Table A.2: Training hyperparameters for MineCLIP.
Hyperparameter Value
LR schedule Cosine with warmup [73]
Warmup steps 500
Peak LR 1.5e-4
Final LR 1e-5
Weight decay 0.2
Layerwise LR decay 0.65
Pre-trained layers LR multiplier 0.5×0.5\times
Batch size per GPU 64
Parallel GPUs 8
Video resolution 160×256160256160\times 256
Number of frames 16
Image encoder ViT-B/16 [28]

Text encoder ϕGsubscriptitalic-ϕ𝐺\phi_{G}

We use a 12-layer 512-wide GPT model with 8 attention heads [90, 91]. The input string is converted to lower-case byte pair encoding with a 49,152 vocabulary size, and capped at 77 tokens. We exactly follow the text encoder settings in CLIP and initialize the weights from their public checkpoint. Only the last two layers of ϕGsubscriptitalic-ϕ𝐺\phi_{G} is finetuned during training.

E.3 Training

We train MineCLIP on the 640K video-text pairs for 2 epochs. We sample 16 RGB frames from each video uniformly, and apply temporally-consistent random resized crop [17, 33] as data augmentation. We use Cosine learning rate annealing with 500 gradient steps of warming up [73]. We apply a lower learning rate (×0.5absent0.5\times 0.5) on the pre-trained weights and layer-wise learning rate decay for better finetuning [53]. Training is performed on 1 node of 8×8\times V100 GPUs with FP16 mixed precision [76] via the PyTorch native amp module. All hyperparameters are listed in Table A.2.

Appendix F Policy Learning Details

Input: policy πθsubscript𝜋𝜃\pi_{\theta}, value function VF()𝑉𝐹VF(\cdot), SI buffer threshold ΔΔ\Delta, SI frequency ω𝜔\omega
1 Initialize empty SI buffers for all tasks 𝐃SI{,Ttraining tasks}subscript𝐃𝑆𝐼for-all𝑇training tasks\mathbf{D}_{SI}\leftarrow\{\emptyset,\forall T\in\text{training tasks}\};
2 Initialize a counter for simulator steps counter0𝑐𝑜𝑢𝑛𝑡𝑒𝑟0counter\leftarrow 0;
3 while not done do
4       Collect set of trajectories for all tasks {τT,Ttraining tasks}subscript𝜏𝑇for-all𝑇training tasks\{\tau_{T},\forall T\in\text{training tasks}\} by running policy πθsubscript𝜋𝜃\pi_{\theta} in (parallel) environments;
5       forall 𝒟SI,Tsubscript𝒟𝑆𝐼𝑇\mathcal{D}_{SI,T} do
6             if τTsubscript𝜏𝑇\tau_{T} is successful then
7                   𝒟SI,T𝒟SI,TτTsubscript𝒟𝑆𝐼𝑇subscript𝒟𝑆𝐼𝑇subscript𝜏𝑇\mathcal{D}_{SI,T}\leftarrow\mathcal{D}_{SI,T}\cup\tau_{T}
8             else if τTsubscript𝜏𝑇\tau_{T}’s episode return μreturn(𝒟SI,T)+Δ×σreturn(𝒟SI,T)absentsubscript𝜇returnsubscript𝒟𝑆𝐼𝑇Δsubscript𝜎returnsubscript𝒟𝑆𝐼𝑇\geq\mu_{\text{return}}(\mathcal{D}_{SI,T})+\Delta\times\sigma_{\text{return}}(\mathcal{D}_{SI,T}) then
9                   𝒟SI,T𝒟SI,TτTsubscript𝒟𝑆𝐼𝑇subscript𝒟𝑆𝐼𝑇subscript𝜏𝑇\mathcal{D}_{SI,T}\leftarrow\mathcal{D}_{SI,T}\cup\tau_{T}
11       end forall
12      Increase counter𝑐𝑜𝑢𝑛𝑡𝑒𝑟counter accordingly;
13       Update πθsubscript𝜋𝜃\pi_{\theta} following Equation 2;
14       Fit VF()𝑉𝐹VF(\cdot) by regression on mean-squared error;
15       if 𝟙(countermodω=0)1modulo𝑐𝑜𝑢𝑛𝑡𝑒𝑟𝜔0\mathbbm{1}(counter\bmod\omega=0) then
16             Determine the number of trajectories to sample from each buffer #sample=min({|𝒟SI,T|,Ttraining tasks})subscript#samplesubscript𝒟𝑆𝐼𝑇for-all𝑇training tasks\#_{\text{sample}}=\min(\{|\mathcal{D}_{SI,T}|,\forall T\in\text{training tasks}\});
17             Sample #samplesubscript#sample\#_{\text{sample}} trajectories from each buffer in a prioritized manner to construct 𝒟SIsubscript𝒟𝑆𝐼\mathcal{D}_{SI};
18             Update πθsubscript𝜋𝜃\pi_{\theta} on 𝒟SIsubscript𝒟𝑆𝐼\mathcal{D}_{SI} with supervised objective;
21 end while
Algorithm 1 PPO-SI Interleaved Training

In this section, we elaborate how a trained MineCLIP can be adapted as a reward function with two different formulations. We then discuss the algorithm for policy learning. Finally, we demonstrate how we combine self imitation learning and on-policy learning to further improve sample efficiency.

F.1 Adapt MineCLIP as Reward Function

We investigate two ways to convert MineCLIP output to scalar reward, dubbed Direct and Delta. The ablation results for Animal-Zoo task group are presented in Table A.3.


For a task T𝑇T with the goal description G𝐺G, MineCLIP outputs the probability PGsubscript𝑃𝐺P_{G} that the observation video semantically corresponds to G𝐺G, against a set of negative goal descriptions 𝒢superscript𝒢\mathcal{G}^{-}. Note that we omit timestep subscript for simplicity. As an example, for the task “shear sheep”, G𝐺G is “shear a sheep” and 𝒢superscript𝒢\mathcal{G}^{-} may include negative prompts like “milk a cow”, “hunt a sheep”, “hunt a cow”, etc. To compute the Direct reward, we further process the raw probability using the formula r=max(PG1NT,0)𝑟subscript𝑃𝐺1subscript𝑁𝑇0r=\max(P_{G}-\frac{1}{N_{T}},0) where NTsubscript𝑁𝑇N_{T} is the number of prompts passed to MineCLIP. 1NT1subscript𝑁𝑇\frac{1}{N_{T}} is the baseline probability of randomly guessing which text string corresponds to the video. We threshold r𝑟r at zero to avoid highly uncertain probability estimates below the random baseline. We call the variant without the post-processing Direct-Naive: r=PG𝑟subscript𝑃𝐺r=P_{G} as the reward signal for every time step.


The Direct formulation yields strong performance when the task is concerned with moving creatures, e.g. farm animals and monsters that run around constantly. However, we discover that Direct is suboptimal if the task deals with static objects, e.g., “find a nether portal”. Simply using the raw probability from MineCLIP as reward can cause the learned agent to stare at the object of interest but fail to move closer and interact. Therefore, we propose to use an alternative formulation, Delta, to remedy this issue. Concretely, the reward value at timestep t𝑡t becomes rt=PG,tPG,t1subscript𝑟𝑡subscript𝑃𝐺𝑡subscript𝑃𝐺𝑡1r_{t}=P_{G,t}-P_{G,t-1}. We empirically validate that this formulation provides better shaped reward for the task group with static entities.

Table A.3: Ablation on different MineCLIP reward formulations.
Group Tasks Direct Direct-Naive Delta
[Uncaptioned image] Milk Cow 64.5±37.1plus-or-minus64.537.1{\color[rgb]{0.09,0.45,0.27}\mathbf{64.5\pm 37.1}} 8.6±1.2plus-or-minus8.61.28.6\pm 1.2 7.6±5.2plus-or-minus7.65.27.6\pm 5.2
Hunt Cow 83.5±7.1plus-or-minus83.57.1{\color[rgb]{0.09,0.45,0.27}\mathbf{83.5\pm 7.1\hphantom{0}}} 0.0±0.0plus-or-minus0.00.00.0\pm 0.0 0.0±0.0plus-or-minus0.00.00.0\pm 0.0
Shear Sheep 12.1±9.1plus-or-minus12.19.1{\color[rgb]{0.09,0.45,0.27}\mathbf{12.1\pm 9.1\hphantom{0}}} 0.8±0.6plus-or-minus0.80.60.8\pm 0.6 1.8±1.5plus-or-minus1.81.51.8\pm 1.5
Hunt Sheep 8.1±4.1plus-or-minus8.14.1{\color[rgb]{0.09,0.45,0.27}\mathbf{8.1\pm 4.1}} 0.1±0.2plus-or-minus0.10.20.1\pm 0.2 0.0±0.0plus-or-minus0.00.00.0\pm 0.0

F.2 Policy Network Architecture

Our policy architecture consists of three parts: an input feature encoder, a policy head, and a value function. To handle multimodal observations (Sec. B.1), the feature extractor contains several modality-specific components:

  • RGB frame: we use the frozen frame-wise image encoder ϕIsubscriptitalic-ϕ𝐼\phi_{I} in MineCLIP to optimize for compute efficiency and provide the agent with good visual representations from the beginning (Sec. 4.2).

  • Task goal: ϕGsubscriptitalic-ϕ𝐺\phi_{G} computes the text embedding of the natural language task goal.

  • Yaw and Pitch: compute sin()\sin(\cdot) and cos()\cos(\cdot) features respectively, then pass through an MLP.

  • GPS: normalize and featurize via MLP.

  • Voxel: to process the 3×3×33333\times 3\times 3 surrounding voxels, we embed discrete block names to dense vectors, flatten them, and pass through an MLP.

  • Past action: our agent is conditioned on its immediate past action, which is embedded and featurized by MLP.

Features from all modalities are concatenated, passed through another fusion MLP, and finally fed into the policy head and value function head. We use an MLP to model the policy head that maps from the input feature vectors to the action probability distribution. We use another MLP to estimate the value function, conditioned on the same input features.

F.3 RL Training


We use the popular PPO algorithm  [102] (Proximal Policy Optimization) as our RL training backbone. PPO is an on-policy method that optimizes for a surrogate objective while ensuring that the deviation from the previous policy is relatively small. PPO updates the policy network by

maximize𝜃𝔼s,aπθoldL(s,a,θold,θ),𝜃maximizesubscript𝔼similar-to𝑠𝑎subscript𝜋subscript𝜃old𝐿𝑠𝑎subscript𝜃old𝜃\underset{\theta}{\text{maximize}}\;\mathbb{E}_{s,a\sim\pi_{\theta_{\text{old}}}}L(s,a,\theta_{\text{old}},\theta), (1)


L(s,a,θold,θ)=min(πθ(a|s)πθold(a|s)Aπθold(s,a),clip(πθ(a|s)πθold(a|s),1ϵ,1+ϵ)Aπθold(s,a)).𝐿𝑠𝑎subscript𝜃old𝜃subscript𝜋𝜃conditional𝑎𝑠subscript𝜋subscript𝜃oldconditional𝑎𝑠superscript𝐴subscript𝜋subscript𝜃old𝑠𝑎clipsubscript𝜋𝜃conditional𝑎𝑠subscript𝜋subscript𝜃oldconditional𝑎𝑠1italic-ϵ1italic-ϵsuperscript𝐴subscript𝜋subscript𝜃old𝑠𝑎L(s,a,\theta_{\text{old}},\theta)=\min\left(\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}A^{\pi_{\theta_{\text{old}}}}(s,a),\text{clip}\left(\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{\text{old}}}(a|s)},1-\epsilon,1+\epsilon\right)A^{\pi_{\theta_{\text{old}}}}(s,a)\right). (2)

A𝐴A is an estimator of the advantage function (GAE [101] in our case) and ϵitalic-ϵ\epsilon is a hyperparameter that controls the deviation between the new policy and the old one.

Self Imitation Learning.

We apply self-imitation learning [84] (SI) to further improve sample efficiency because computing the reward with MineCLIP in the loop makes the training more expensive. Self-imitation learning is essentially supervised learning on a buffer 𝒟SIsubscript𝒟𝑆𝐼\mathcal{D}_{SI} of good trajectories generated by the agent’s past self. In our case, the trajectories are generated by the behavior policy during PPO rollouts, and only added to 𝒟SIsubscript𝒟𝑆𝐼\mathcal{D}_{SI} if it is a successful trial or if the episodic return exceeds a certain threshold. Self imitation optimizes πθsubscript𝜋𝜃\pi_{\theta} for the objective 𝒥SI=𝔼s,a𝒟SIlogπθ(a|s)subscript𝒥𝑆𝐼subscript𝔼similar-to𝑠𝑎subscript𝒟𝑆𝐼subscript𝜋𝜃conditional𝑎𝑠\mathcal{J}_{SI}=\mathbb{E}_{s,a\sim\mathcal{D}_{SI}}\log\pi_{\theta}(a|s) with respect to θ𝜃\theta.

We alternate between the PPO phase and the SI phase. A pseudocode of our interleaved training procedure is given in Algorithm 1. We use a prioritized strategy to sample trajectories from the buffer 𝒟SIsubscript𝒟𝑆𝐼\mathcal{D}_{SI}. Specifically, we assign equal probability to all successful trajectories. Unsuccessful trajectories can still be sampled but with lower probabilities proportional to their episodic returns.

In Fig. A.8, we demonstrate that adding self-imitation dramatically improves the stability, performance, and sample efficiency of RL training in MineDojo.

Refer to caption
(a) “Milk Cow”
Refer to caption
(b) “Shear Sheep”
Figure A.8: Adding the self imitation technique [84] significantly improves the performance of RL training in MineDojo.

Appendix G Experiment Details

G.1 Task Details

We experiment with three task groups with four tasks per group. We train one multi-task agent for each group. In this section, we describe each task goals, initial setup, and the manual dense-shaping reward function.

Animal Zoo:

4 Programmatic tasks on hunting or harvesting resource from animals. We spawn various animal types (pig, sheep, and cow) in the same environment to serve as distractors. It is considered a failure if the agent does not take action on the correct animal specified by the prompt.

  • Milk Cow: find and approach a cow, then obtain milk from it with an empty bucket. The prompt is milk a cow. We initialize the agent with an empty bucket to collect milk. We also spawn sheep, cow, and pig nearby the agent. The manual dense reward shaping is a navigation reward based on geodesic distance obtained from privileged LIDAR. The combined reward passed to PPO can be formulated as rt=λnavmax(dmin,t1dmin,t,0)+λsuccess𝟙(milk collected)subscript𝑟𝑡subscript𝜆navsubscript𝑑min𝑡1subscript𝑑min𝑡0subscript𝜆success1milk collectedr_{t}=\lambda_{\text{nav}}\max(d_{\text{min},t-1}-d_{\text{min},t},0)+\lambda_{\text{success}}\mathbbm{1}(\text{milk collected}), where λnav=10subscript𝜆nav10\lambda_{\text{nav}}=10 and λsuccess=200subscript𝜆success200\lambda_{\text{success}}=200. dmin,t=min(dmin,dt)subscript𝑑min𝑡subscript𝑑minsubscript𝑑𝑡d_{\text{min},t}=\min(d_{\text{min}},d_{t}) where dminsubscript𝑑mind_{\text{min}} denotes the minimal distance to the cow that the agent has achieved so far in the episode history.

  • Hunt Cow: find and approach a cow, then hunt with a sword. The cow will run away so the agent needs to chase after it. The prompt is hunt a cow. We initialize the agent with a diamond sword. The manual dense reward shaping consists of two parts, a valid attack reward and a navigation reward based on geodesic distance obtained from privileged LIDAR. Mathematically, the reward is rt=λattack𝟙(valid attack)+λnavmax(dmin,t1dmin,t,0)+λsuccess𝟙(cow hunted)subscript𝑟𝑡subscript𝜆attack1valid attacksubscript𝜆navsubscript𝑑min𝑡1subscript𝑑min𝑡0subscript𝜆success1cow huntedr_{t}=\lambda_{\text{attack}}\mathbbm{1}(\text{valid attack})+\lambda_{\text{nav}}\max(d_{\text{min},t-1}-d_{\text{min},t},0)+\lambda_{\text{success}}\mathbbm{1}(\text{cow hunted}), where λattack=5subscript𝜆attack5\lambda_{\text{attack}}=5, λnav=1subscript𝜆nav1\lambda_{\text{nav}}=1, and λsuccess=200subscript𝜆success200\lambda_{\text{success}}=200. We additionally reset dminsubscript𝑑mind_{\text{min}} every time the agent hits the cow to encourage the chasing behavior.

  • Shear Sheep: find and approach a sheep, then collect wool from the sheep with a shear. The prompt is shear a sheep. We initialize the agent with a shear. The manual dense reward shaping is a navigation reward based on geodesic distance obtained from the privileged LIDAR sensor, similar to “Milk Cow”.

  • Hunt Sheep: find and approach a sheep, then hunt with a sword. The sheep will run away so the agent needs to chase after it. An episode will terminate once any entity is hunted. The prompt is hunt a sheep. We initialize the agent with a diamond sword. The manual dense reward shaping consists of two parts, a valid attack reward and a navigation reward based on geodesic distance obtained from the privileged LIDAR sensor, similar to “Hunt Cow”.

Mob Combat:

fight 4 different types of hostile monsters: Spider, Zombie, Zombie Pigman (a creature in the Nether world), and Enderman (a creature in the End world). The prompt template is "Combat {monster}". For all tasks within this group, we initialize the agent with a diamond sword, a shield, and a full suite of diamond armors. The agent is spawned in the Nether for Zombie Pigman task, and in the End for Enderman. The manual dense-shaping reward can be expressed as rt=λattack𝟙(valid attack)+λsuccess𝟙({monster} hunted)subscript𝑟𝑡subscript𝜆attack1valid attacksubscript𝜆success1{monster} huntedr_{t}=\lambda_{\text{attack}}\mathbbm{1}(\text{valid attack})+\lambda_{\text{success}}\mathbbm{1}(\texttt{\{monster\}}\text{ hunted}) where λattack=5subscript𝜆attack5\lambda_{\text{attack}}=5 and λsuccess=200subscript𝜆success200\lambda_{\text{success}}=200.


4 tasks that do not have manual dense reward shaping or code-defined success criterion.

  • Find Nether Portal: find and move close to a Nether Portal, then enter the Nether world through the portal. The prompt is find a nether portal.

  • Find Ocean: find and move close to an ocean. The prompt is find an ocean.

  • Dig Hole: dig holes in the ground. The prompt is dig a hole. We initialize the agent with an iron shovel.

  • Lay Carpet: lay down carpets to cover the wooden floor inside a house. The prompt is put carpets on the floor. We initialize the agent with a number of carpets in its inventory.

Note that we categorize “Find Nether Portal” and “Find Ocean” as Creative tasks even though they seem similar to object navigation [12]. While finding terrains and other structures is semantically well defined, it is not easy to define a function to evaluate success automatically because the simulator does not have the exact location information of these structures given a randomly generated world. In principle, we can make a sweep by querying each chunk of voxels in the world to recognize the terrains, but that would be prohibitively expensive. Therefore, we opt to use MineCLIP as the reward signal and treat these tasks as Creative.

Table A.4: Hyperparameters in RL experiments. “{state} MLP” refers to MLPs to process observations of compass, GPS, and voxel blocks. “Embed Dim” denotes the same dimension size used to embed all discrete observations into dense vectors.
NN Architecture Training
Hyperparameter Animal-Zoo Mob-Combat Creative
RGB Feature Size 512 Learning Rate 104superscript10410^{-4} 104superscript10410^{-4} 104superscript10410^{-4}
Task Prompt Feature Size 512 Cosine Decay Minimal LR 5×1065superscript1065\times 10^{-6} 5×1065superscript1065\times 10^{-6} 5×1065superscript1065\times 10^{-6}
{state} MLP Hidden Size 128 γ𝛾\gamma 0.990.990.99 0.990.990.99 0.990.990.99
{state} MLP Output Size 128 Entropy Weight (Stage 1) 5×1035superscript1035\times 10^{-3} 5×1035superscript1035\times 10^{-3} 5×1035superscript1035\times 10^{-3}
{state} MLP Hidden Depth 2 Entropy Weight (Stage 2) 102superscript10210^{-2} N/A 102superscript10210^{-2}
Embed Dim 8 PPO Optimizer Adam Adam Adam
Num Feature Fusion Layers 1 SI Learning Rate 104superscript10410^{-4} 104superscript10410^{-4} 104superscript10410^{-4}
Feature Fusion Output Size 512 SI Cosine Decay Minimal LR 106superscript10610^{-6} 106superscript10610^{-6} 106superscript10610^{-6}
Prev Action Conditioning True SI Epoch 10 10 10
Policy Head Hidden Size 256 SI Frequency (Env Steps) 100K 100K 100K
Policy Head Hidden Depth 3 SI Optimizer Adam Adam Adam
VF Hidden Size 256 SI Buffer Threshold 2σ2𝜎2\sigma 2σ2𝜎2\sigma 0.5σ0.5𝜎0.5\sigma
VF Hidden Depth 3 PPO Buffer Size 100K 100K 100K
Frame Stack 1 1 1
VF Loss Weight 0.5 0.5 0.5
GAE λ𝜆\lambda 0.95 0.95 0.95
Gradient Clip 10 10 10
PPO ϵitalic-ϵ\epsilon 0.2 0.2 0.2
Action Smooth Weight 107superscript10710^{-7} 107superscript10710^{-7} 107superscript10710^{-7}
Action Smooth Window Size 3 3 3
MineCLIP Reward Formulation Direct Direct Delta

G.2 Observation and Action Space

We use a subset of the full observation and action space listed in Sec. B.1 and B.2, because the tasks in our current experiments do not involve actions like crafting or inventory management. Our observation space consists of RGB frame, compass, GPS, and Voxels.

Our action space is a trimmed version of the full action space. It consists of movement control, camera control, “use” action, and “attack” action, which add up to 89 discrete choices. Concretely, it includes 81 actions for discrete camera control (9×9999\times 9 resulted from the Cartesian product between yaw and pitch, each ranges from 6060-60 degree to 606060 degree with a discrete interval of 151515 degree). It also includes 6 movement actions (forward, forward + jump, jump, back, move left, and move right) and 2 functional actions of “use” and “attack”. Note that the “no-op” action is merged into the 81 camera actions.

G.3 RL Training

All hyperparameters used in our RL experiment are listed in Table A.4. We visualize the learned behaviors of 4 tasks in Figure 2. Demos of more tasks can be found on our website https://minedojo.org.

Action smoothing.

Due to the stochastic nature of PPO, we observe a lot of action jittering in the agent’s behavior during training. This leads to two negative effects that degrade the learning performance: 1) exploration difficulty due to inconsistent action sequence. For example, the agent may be required to take multiple consecutive attack actions in order to complete certain tasks; and 2) rapidly switching different movement and camera motions result in videos that are highly non-smooth and disorienting. This causes a domain gap from the training data of MineCLIP, which are typically smooth human gameplay videos. Therefore, the reward signal quality deteriorates significantly.

To remedy the issue, we impose an action smoothing loss to be jointly optimized with the PPO objective (Eq. 2) during training. Concretely, consider a sliding window 𝒲𝒲\mathcal{W} with window size |𝒲|𝒲|\mathcal{W}| that contains |𝒲|𝒲|\mathcal{W}| consecutive action distributions 𝒲={πt|𝒲|+1,πt|𝒲|+2,,πt}𝒲subscript𝜋𝑡𝒲1subscript𝜋𝑡𝒲2subscript𝜋𝑡\mathcal{W}=\{\pi_{t-|\mathcal{W}|+1},\pi_{t-|\mathcal{W}|+2},\ldots,\pi_{t}\}, the action smoothing loss is defined as

smooth=1|𝒲|i=1|𝒲|1KL(πtπt|𝒲|+i),subscriptsmooth1𝒲superscriptsubscript𝑖1𝒲1𝐾𝐿conditionalsubscript𝜋𝑡subscript𝜋𝑡𝒲𝑖\mathcal{L}_{\text{smooth}}=\frac{1}{|\mathcal{W}|}\sum_{i=1}^{|\mathcal{W}|-1}KL(\pi_{t}\|\pi_{t-|\mathcal{W}|+i}), (3)

where KL()𝐾𝐿KL(\cdot) denotes Kullback–Leibler divergence.

Multi-stage training for multi-task RL.

Due to hardware limitations, we are not able to run a large number of parallel simulators for all tasks in a task group. Therefore, we adopt a multi-stage strategy to split the tasks and train them sequentially with a single policy network. For the task groups Animal-Zoo and Creative, we split the four tasks into two stages of two parallel training tasks each. We carry over the self-imitation buffers when switching to the next stage. We also follow the recommended practice in [83] and reset the policy head at the beginning of stage 2 to encourage exploration and reduce overfitting. We adopt a similar replay buffer balancing strategy as [46] to prevent any task from dominating the training.

G.4 Evaluation

In this section, we elaborate on our human and automatic evaluation procedure for Creative tasks. We first ask the human annotators to manually label 100 successful and 100 failure trajectories. This produces a combined dataset of 200 trajectories with groundtruth binary labels to evaluate the learned reward functions. On this dataset, we run MineCLIP to produce step-wise rewards and compute a score that averages over each trajectory. We then apply K-means clustering with K=2𝐾2K=2 to all scores and determine a decision boundary δ𝛿\delta from the mean of the two centroids. A trajectory with a score greater than δ𝛿\delta is classified as successful, and vice versa for failure. In this way, we essentially convert MineCLIP to a binary classifier. The quality of MineCLIP can be measured by the F1 score of its binary classification output against the human labels. We demonstrate that MineCLIP has high agreements with humans (Table 2), and thus qualifies as an effective automatic evaluation metric for Creative tasks in the absence of human judges.

To further investigate MineCLIP’s evaluation on more complex Creative tasks, we annotate 50 YouTube video segments each for 5 more tasks that are much more semantically complex: “build a farm”, “ build a fence”, “build a house”, “ride a minecart”, and “build a swimming pool”. We then run MineCLIP evaluation on these videos against a negative set. As shown in Table A.5, though not perfect, MineCLIP generally has a positive agreement with human judgment. We note that the current MineCLIP is a proof-of-concept step in leveraging internet data for automated evaluation, and further scaling on more training data and parameters may lead to more improvements. Meanwhile, human judgment remains a useful and important alternative [93, 97].

Table A.5: MineCLIP’s evaluation on more complex Creative tasks. Numbers represent F1 scores between MineCLIP’s evaluation on tasks success and human labels. Scaled to percentage for better readability.
Tasks Build a Farm Build a Fence Build a House Ride a Minecart Build a Swimming Pool
Ours (Attn) 78.778.7{\color[rgb]{0.09,0.45,0.27}\mathbf{78.7}} 91.491.4{\color[rgb]{0.09,0.45,0.27}\mathbf{91.4}} 63.763.7{\color[rgb]{0.09,0.45,0.27}\mathbf{63.7}} 95.995.995.9
Ours (Avg) 73.473.473.4 37.437.437.4 96.996.9{\color[rgb]{0.09,0.45,0.27}\mathbf{96.9}} 94.794.7{\color[rgb]{0.09,0.45,0.27}\mathbf{94.7}}
CLIPOpenAI 62.562.562.5 24.524.524.5 52.952.952.9 71.771.771.7

Appendix H Limitations and Potential Societal Impact

Unlike human demonstrations [126] or offline RL datasets [35], our YouTube dataset contains only the video screen observations but not the actual control actions. This allows us to scale up the dataset tremendously, but at the same time poses a challenge to imitation learning algorithms that require observation-action pairs to learn. Our proposed algorithm, MineCLIP, side-steps this problem by learning a reward model, but we believe that directly inferring the human expert policy from YouTube is another important direction complementary to our approach. There are promising techniques that can potentially overcome this limitation, such as the Learning-from-Observation (LfO) family of algorithms [123, 122, 115, 31].

Our database is scraped from the internet, which inevitably contains offensive YouTube videos or toxic Reddit posts. While we have made our best effort to filter out these harmful contents (Sec. D.1), there can still be undesirable biases and toxicity that elude our automatic filters. Furthermore, we advocate the use of large pre-trained language models in our main paper, and MineCLIP is finetuned from the pre-trained weights of OpenAI CLIP [92]. These foundation models are known to contain harmful stereotypes and generate hateful commentary [15, 13, 40]. We ask the researchers who will use our code and database to exercise their best judgment during new model development to avoid any negative social impact.

Appendix I Datasheet

We present a Datasheet [39] for documentation and responsible usage of our internet knowledge databases.

I.1 Motivation

For what purpose was the dataset created?

We create this internet-scale multimodal knowledge base to facilitate research towards open-ended, generally capable embodied agents.

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

This knowledge base was created by Linxi Fan (Nvidia), Guanzhi Wang (Caltech), Yunfan Jiang (Stanford), Ajay Mandlekar (Nvidia), Yuncong Yang (Columbia), Haoyi Zhu (SJTU), Andrew Tang (Columbia), De-An Huang (Nvidia), Yuke Zhu (Nvidia and UT Austin), and Anima Anandkumar (Nvidia and Caltech).

I.2 Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?

Yes, the dataset is publicly available on the internet.

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)?

All datasets can be downloaded from https://zenodo.org/. Please refer to this table of URL, DOI, and licensing:

Database DOI License
YouTube 10.5281/zenodo.6641142 Creative Commons Attribution 4.0 International (CC BY 4.0)
Wiki 10.5281/zenodo.6640448 Creative Commons Attribution Non Commercial Share Alike 3.0 Unported
Reddit 10.5281/zenodo.6641114 Creative Commons Attribution 4.0 International (CC BY 4.0)

Have any third parties imposed IP-based or other restrictions on the data associated with the instances?


Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?


I.3 Maintenance

Who will be supporting/hosting/maintaining the dataset?

The authors will be supporting, hosting, and maintaining the dataset.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

Please contact Linxi Fan (linxif@nvidia.com), Guanzhi Wang (guanzhi@caltech.edu), and Yunfan Jiang (yunfanj@cs.stanford.edu).

Is there an erratum?

No. We will make announcements if there is any.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?

Yes. New updates will be posted on https://minedojo.org.

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were the individuals in question told that their data would be retained for a fixed period of time and then deleted)?


Will older versions of the dataset continue to be supported/hosted/maintained?

Yes, old versions will be permanently accessible on zenodo.org.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?

Yes, please refer to https://minedojo.org.

I.4 Composition

What do the instances that comprise the dataset represent?

For YouTube videos, our data is in JSON format with video URLs and metadata. We do not provide the raw MP4 files for legal concerns. For Wiki, we provide the text, images, tables, and diagrams embedded on the web pages. For Reddit, our data is in JSON format with post IDs and metadata, similar to YouTube. Users can reconstruct the Reddit dataset by running our script after obtaining an official Reddit API license key.

How many instances are there in total (of each type, if appropriate)?

There are more than 730K YouTube videos with 2.2B words of transcripts, 6,735 Wiki pages with 2.2M bounding boxes of visual elements, and more than 340K Reddit posts with 6.6M comments.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?

We provide all instances in our Zenodo data repositories.

Is there a label or target associated with each instance?


Is any information missing from individual instances?


Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)?

We provide metadata for each YouTube video link and Reddit post ID.

Are there recommended data splits (e.g., training, development/validation, testing)?

No. The entire database is intended for pre-training.

Are there any errors, sources of noise, or redundancies in the dataset?

Please refer to Sec. D

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?

We follow prior works [64] and only release the video URLs of YouTube videos due to legal concerns. Researchers need to acquire the MP4 and transcript files separately. Similarly, we only release the post IDs for the Minecraft Reddit database, but we also provide a script that can reconstruct the full Reddit dataset given a free official license key.

Does the dataset contain data that might be considered confidential?


Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?

We have made our best efforts to detoxify the contents via an automated procedure. Please refer to Sec. D.

I.5 Collection Process

The collection procedure, preprocessing, and cleaning are explained in details in Sec. D.

Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

All data collection, curation, and filtering are done by MineDojo coauthors.

Over what timeframe was the data collected?

The data was collected between Dec. 2021 and May 2022.

I.6 Uses

Has the dataset been used for any tasks already?

Yes, we have used the MineDojo YouTube database for agent pre-training. Please refer to Sec. 5 and Sec. G for algorithmic and training details.

What (other) tasks could the dataset be used for?

Our knowledge base is primarily intended to facilitate research in open-ended, generally capable embodied agents. However, it can also be broadly applicable to research in video understanding, document understanding, language modeling, multimodal learning, and so on.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?


Are there tasks for which the dataset should not be used?

We strongly oppose any research that intentionally generates harmful or toxic contents using our YouTube, Wiki, and Reddit data.